digantamisra98 / mish Goto Github PK

Official Repository for "Mish: A Self Regularized Non-Monotonic Neural Activation Function" [BMVC 2020]

Home Page: https://www.bmvc2020-conference.com/assets/papers/0928.pdf

License: MIT License

Python 24.43% Jupyter Notebook 75.57%

deep-learning activation-functions mathematics neural-networks computer-vision bmvc bmvc20 object-detection image-classification

mish's Introduction

Mish: Self Regularized
Non-Monotonic Activation Function

BMVC 2020 (Official Paper)

Notes: (Click to expand)

A considerably faster version based on CUDA can be found here - Mish CUDA (All credits to Thomas Brandon for the same)
Memory Efficient Experimental version of Mish can be found here
Faster variants for Mish and H-Mish by Yashas Samaga can be found here - ConvolutionBuildingBlocks
Alternative (experimental improved) variant of H-Mish developed by Páll Haraldsson can be found here - H-Mish (Available in Julia)
Variance based initialization method for Mish (experimental) by Federico Andres Lois can be found here - Mish_init

Changelogs/ Updates: (Click to expand)

[07/17] Mish added to OpenVino - Open-1187, Merged-1125
[07/17] Mish added to BetaML.jl
[07/17] Loss Landscape exploration progress in collaboration with Javier Ideami and Ajay Uppili Arasanipalai
[07/17] Poster accepted for presentation at DLRLSS hosted by MILA, CIFAR, Vector Institute and AMII
[07/20] Mish added to Google's AutoML - 502
[07/27] Mish paper accepted to 31st British Machine Vision Conference (BMVC), 2020. ArXiv version to be updated soon.
[08/13] New updated PyTorch benchmarks and pretrained models available on PyTorch Benchmarks.
[08/14] New updated Arxiv version of the paper is out.
[08/18] Mish added to Sony Nnabla - Merged-700
[09/02] Mish added to TensorFlow Swift APIs - Merged - 1068
[06/09] Official paper and presentation video for BMVC is released at this link.
[23/09] CSP-p7 + Mish (multi-scale) is currently the SOTA in Object Detection on MS-COCO test-dev while CSP-p7 + Mish (single-scale) is currently the 3rd best model in Object detection on MS-COCO test dev. Further details on paperswithcode leaderboards.
[11/11] Mish added to TFLearn - Merged 1159 (Follow up 1141)
[17/11] Mish added to MONAI - Merged 1235
[20/11] Mish added to plaidml - Merged 1566
[10/12] Mish added to Simd and Synet - Docs
[14/12] Mish added to OneFlow - Merged 3972
[24/12] Mish added to GPT-Neo
[21/04] Mish added to TensorFlow JS
[02/05] Mish added to Axon
[26/05] 🔥 Mish is added to PyTorch. Will be added in PyTorch 1.9. 🔥
[27/05] Mish is added to PyTorch YOLO v3
[09/06] 🔥 Mish is added to MXNet.
[03/07] Mish is added to TorchSharp.
[05/08] Mish is added to KotlinDL.

News/ Media Coverage:

(02/2020): Podcast episode on Mish at Machine Learning Café is out now. Listen on:

(02/2020): Talk on Mish and Non-Linear Dynamics at Sicara is out now. Watch on:

(07/2020): CROWN: A comparison of morphology for Mish, Swish and ReLU produced in collaboration with Javier Ideami. Watch on:

(08/2020): Talk on Mish and Non-Linear Dynamics at Computer Vision Talks. Watch on:

(12/2020): Talk on From Smooth Activations to Robustness to Catastrophic Forgetting at Weights & Biases Salon is out now. Watch on:

(12/2020) Weights & Biases integration is now added 🔥. Get started.
(08/2021) Comprehensive hardware based computation performance benchmark for Mish has been conducted by Benjamin Warner. Blogpost.

MILA/ CIFAR 2020 DLRLSS (Click on arrow to view)

Contents: (Click to expand)

Mish
a. Loss landscape
ImageNet Scores
MS-COCO
Variation of Parameter Comparison
a. MNIST
b. CIFAR10
Significance Level
Results
a. Summary of Results (Vision Tasks)
b. Summary of Results (Language Tasks)
Try It!
Acknowledgements
Cite this work

Mish:

$f(x) = x\tanh (softplus(x)) = x\tanh(\ln (1 + e^{x}))$

Minimum of f(x) is observed to be ≈-0.30884 at x≈-1.1924
Mish has a parametric order of continuity of: C^∞

Derivative of Mish with respect to Swish and Δ(x) preconditioning:

$f'(x) = (sech^{2}(softplus(x)))(xsigmoid(x)) + \frac{f(x)}{x}$

Further simplifying:

$f'(x) = \Delta(x)swish(x) + \frac{f(x)}{x}$

Alternative derivative form:

$f'(x) = \frac{e^{x}\omega}{\delta^{2}}$

where:

$\omega = 4(x+1)+4e^{2x} +e^{3x} +e^{x}(4x+6)$

$\delta = 2e^{x} +e^{2x} +2$

We hypothesize the Δ(x) to be exhibiting the properties of a pre-conditioner making the gradient more smoother. Further details are provided in the paper.

Loss Landscape:

To visit the interactive Loss Landscape visualizer, click here.

Loss landscape visualizations for a ResNet-20 for CIFAR 10 using ReLU, Mish and Swish (from L-R) for 200 epochs training:

Mish provides much better accuracy, overall lower loss, smoother and well conditioned easy-to-optimize loss landscape as compared to both Swish and ReLU. For all loss landscape visualizations please visit this readme.

We also investigate the output landscape of randomly initialized neural networks as shown below. Mish has a much smoother profile than ReLU.

ImageNet Scores:

For Installing DarkNet framework, please refer to darknet(Alexey AB)

For PyTorch based ImageNet scores, please refer to this readme

Network	Activation	Top-1 Accuracy	Top-5 Accuracy	cfg	Weights	Hardware
ResNet-50	Mish	74.244%	92.406%	cfg	weights	AWS p3.16x large, 8 Tesla V100
DarkNet-53	Mish	77.01%	93.75%	cfg	weights	AWS p3.16x large, 8 Tesla V100
DenseNet-201	Mish	76.584%	93.47%	cfg	weights	AWS p3.16x large, 8 Tesla V100
ResNext-50	Mish	77.182%	93.318%	cfg	weights	AWS p3.16x large, 8 Tesla V100

Network	Activation	Top-1 Accuracy	Top-5 Accuracy
CSPResNet-50	Leaky ReLU	77.1%	94.1%
CSPResNet-50	Mish	78.1%	94.2%

Pelee Net	Leaky ReLU	70.7%	90%
Pelee Net	Mish	71.4%	90.4%
Pelee Net	Swish	71.5%	90.7%

CSPPelee Net	Leaky ReLU	70.9%	90.2%
CSPPelee Net	Mish	71.2%	90.3%

Results on CSPResNext-50:

MixUp	CutMix	Mosaic	Blur	Label Smoothing	Leaky ReLU	Swish	Mish	Top -1 Accuracy	Top-5 Accuracy	cfg	weights
					✔️			77.9%(=)	94%(=)
✔️					✔️			77.2%(-)	94%(=)
	✔️				✔️			78%(+)	94.3%(+)
		✔️			✔️			78.1%(+)	94.5%(+)
			✔️		✔️			77.5%(-)	93.8%(-)
				✔️	✔️			78.1%(+)	94.4%(+)
						✔️		64.5%(-)	86%(-)
							✔️	78.9%(+)	94.5%(+)
	✔️	✔️		✔️	✔️			78.5%(+)	94.8%(+)
	✔️	✔️		✔️			✔️	79.8%(+)	95.2%(+)	cfg	weights

Results on CSPResNet-50:

CutMix	Mosaic	Label Smoothing	Leaky ReLU	Mish	Top -1 Accuracy	Top-5 Accuracy	cfg	weights
			✔️		76.6%(=)	93.3%(=)
✔️	✔️	✔️	✔️		77.1%(+)	94.1%(+)
✔️	✔️	✔️		✔️	78.1%(+)	94.2%(+)	cfg	weights

Results on CSPDarkNet-53:

CutMix	Mosaic	Label Smoothing	Leaky ReLU	Mish	Top -1 Accuracy	Top-5 Accuracy	cfg	weights
			✔️		77.2%(=)	93.6%(=)
✔️	✔️	✔️	✔️		77.8%(+)	94.4%(+)
✔️	✔️	✔️		✔️	78.7%(+)	94.8%(+)	cfg	weights

Results on SpineNet-49:

CutMix	Mosaic	Label Smoothing	ReLU	Swish	Mish	Top -1 Accuracy	Top-5 Accuracy	cfg	weights
			✔️			77%(=)	93.3%(=)	-	-
		✔️		✔️		78.1%(+)	94%(+)	-	-
✔️	✔️	✔️			✔️	78.3%(+)	94.6%(+)	-	-

MS-COCO:

For PyTorch based MS-COCO scores, please refer to this readme

Model	Mish	AP50...95	mAP50	CPU - 90 Watt - FP32 (Intel Core i7-6700K, 4GHz, 8 logical cores) OpenCV-DLIE, FPS	VPU-2 Watt- FP16 (Intel MyriadX) OpenCV-DLIE, FPS	GPU-175 Watt- FP32/16 (Nvidia GeForce RTX 2070) DarkNet-cuDNN, FPS
CSPDarkNet-53 (512 x 512)		42.4%	64.5%	3.5	1.23	43
CSPDarkNet-53 (512 x 512)	✔️	43%	64.9%	-	-	41
CSPDarkNet-53 (608 x 608)	✔️	43.5%	65.7%	-	-	26

Architecture	Mish	CutMix	Mosaic	Label Smoothing	Size	AP	AP50	AP75
CSPResNext50-PANet-SPP					512 x 512	42.4%	64.4%	45.9%
CSPResNext50-PANet-SPP		✔️	✔️	✔️	512 x 512	42.3%	64.3%	45.7%
CSPResNext50-PANet-SPP	✔️	✔️	✔️	✔️	512 x 512	42.3%	64.2%	45.8%

CSPDarkNet53-PANet-SPP		✔️	✔️	✔️	512 x 512	42.4%	64.5%	46%
CSPDarkNet53-PANet-SPP	✔️	✔️	✔️	✔️	512 x 512	43%	64.9%	46.5%

Credits to AlexeyAB, Wong Kin-Yiu and Glenn Jocher for all the help with benchmarking MS-COCO and ImageNet.

Variation of Parameter Comparison:

MNIST:

To observe how increasing the number of layers in a network while maintaining other parameters constant affect the test accuracy, fully connected networks of varying depths on MNIST, with each layer having 500 neurons were trained. Residual Connections were not used because they enable the training of arbitrarily deep networks. BatchNorm was used to lessen the dependence on initialization along with a dropout of 25%. The network is optimized using SGD on a batch size of 128, and for fair comparison, the same learning rates for each activation function was maintained. In the experiments, all 3 activations maintained nearly the same test accuracy for 15 layered Network. Increasing number of layers from 15 gradually resulted in a sharp decrease in test accuracy for Swish and ReLU, however, Mish outperformed them both in large networks where optimization becomes difficult.

The consistency of Mish providing better test top-1 accuracy as compared to Swish and ReLU was also observed by increasing Batch Size for a ResNet v2-20 on CIFAR-10 for 50 epochs while keeping all other network parameters to be constant for fair comparison.

Gaussian Noise with varying standard deviation was added to the input in case of MNIST classification using a simple conv net to observe the trend in decreasing test top-1 accuracy for Mish and compare it to that of ReLU and Swish. Mish mostly maintained a consistent lead over that of Swish and ReLU (Less than ReLU in just 1 instance and less than Swish in 3 instance) as shown below. The trend for test loss was also observed following the same procedure. (Mish has better loss than both Swish and ReLU except in 1 instance)

CIFAR10:

Significance Level:

The P-values were computed for different activation functions in comparison to that of Mish on terms of Top-1 Testing Accuracy of a Squeeze Net Model on CIFAR-10 for 50 epochs for 23 runs using Adam Optimizer at a Learning Rate of 0.001 and Batch Size of 128. It was observed that Mish beats most of the activation functions at a high significance level in the 23 runs, specifically it beats ReLU at a high significance of P < 0.0001. Mish also had a comparatively lower standard deviation across 23 runs which proves the consistency of performance for Mish.

Activation Function	Mean Accuracy	Mean Loss	Standard Deviation of Accuracy	P-value	Cohen's d Score	95% CI
Mish	87.48%	4.13%	0.3967	-	-	-
Swish-1	87.32%	4.22%	0.414	P = 0.1973	0.386	-0.3975 to 0.0844
E-Swish (β=1.75)	87.49%	4.156%	0.411	P = 0.9075	0.034444	-0.2261 to 0.2539
GELU	87.37%	4.339%	0.472	P = 0.4003	0.250468	-0.3682 to 0.1499
ReLU	86.66%	4.398%	0.584	P < 0.0001	1.645536	-1.1179 to -0.5247
ELU(α=1.0)	86.41%	4.211%	0.3371	P < 0.0001	2.918232	-1.2931 to -0.8556
Leaky ReLU(α=0.3)	86.85%	4.112%	0.4569	P < 0.0001	1.47632	-0.8860 to -0.3774
RReLU	86.87%	4.138%	0.4478	P < 0.0001	1.444091	-0.8623 to -0.3595
SELU	83.91%	4.831%	0.5995	P < 0.0001	7.020812	-3.8713 to -3.2670
SoftPlus(β = 1)	83.004%	5.546%	1.4015	P < 0.0001	4.345453	-4.7778 to -4.1735
HardShrink(λ = 0.5)	75.03%	7.231%	0.98345	P < 0.0001	16.601747	-12.8948 to -12.0035
Hardtanh	82.78%	5.209%	0.4491	P < 0.0001	11.093842	-4.9522 to -4.4486
LogSigmoid	81.98%	5.705%	1.6751	P < 0.0001	4.517156	-6.2221 to -4.7753
PReLU	85.66%	5.101%	2.2406	P = 0.0004	1.128135	-2.7715 to -0.8590
ReLU6	86.75%	4.355%	0.4501	P < 0.0001	1.711482	-0.9782 to -0.4740
CELU(α=1.0)	86.23%	4.243%	0.50941	P < 0.0001	2.741669	-1.5231 to -0.9804
Sigmoid	74.82%	8.127%	5.7662	P < 0.0001	3.098289	-15.0915 to -10.2337
Softshrink(λ = 0.5)	82.35%	5.4915%	0.71959	P < 0.0001	8.830541	-5.4762 to -4.7856
Tanhshrink	82.35%	5.446%	0.94508	P < 0.0001	7.083564	-5.5646 to -4.7032
Tanh	83.15%	5.161%	0.6887	P < 0.0001	7.700198	-4.6618 to -3.9938
Softsign	82.66%	5.258%	0.6697	P < 0.0001	8.761157	-5.1493 to -4.4951
Aria-2(β = 1, α=1.5)	81.31%	6.0021%	2.35475	P < 0.0001	3.655362	-7.1757 to -5.1687
Bent's Identity	85.03%	4.531%	0.60404	P < 0.0001	4.80211	-2.7576 to -2.1502
SQNL	83.44%	5.015%	0.46819	P < 0.0001	9.317237	-4.3009 to -3.7852
ELisH	87.38%	4.288%	0.47731	P = 0.4283	0.235784	-0.3643 to 0.1573
Hard ELisH	85.89%	4.431%	0.62245	P < 0.0001	3.048849	-1.9015 to -1.2811
SReLU	85.05%	4.541%	0.5826	P < 0.0001	4.883831	-2.7306 to -2.1381
ISRU (α=1.0)	86.85%	4.669%	0.1106	P < 0.0001	5.302987	-4.4855 to -3.5815
Flatten T-Swish	86.93%	4.459%	0.40047	P < 0.0001	1.378742	-0.7865 to -0.3127
SineReLU (ε = 0.001)	86.48%	4.396%	0.88062	P < 0.0001	1.461675	-1.4041 to -0.5924
Weighted Tanh (Weight = 1.7145)	80.66%	5.985%	1.19868	P < 0.0001	7.638298	-7.3502 to -6.2890
LeCun's Tanh	82.72%	5.322%	0.58256	P < 0.0001	9.551812	-5.0566 to -4.4642
Soft Clipping (α=0.5)	55.21%	18.518%	10.831994	P < 0.0001	4.210373	-36.8255 to -27.7154
ISRLU (α=1.0)	86.69%	4.231%	0.5788	P < 0.0001	1.572874	-1.0753 to -0.4856

Values rounded up which might cause slight deviation in the statistical values reproduced from these tests

Results:

News: Ajay Arasanipalai recently submitted benchmark for CIFAR-10 training for the Stanford DAWN Benchmark using a Custom ResNet-9 + Mish which achieved 94.05% accuracy in just 10.7 seconds in 14 epochs on the HAL Computing Cluster. This is the current fastest training of CIFAR-10 in 4 GPUs and 2nd fastest training of CIFAR-10 overall in the world.

Summary of Results (Vision Tasks):

Comparison is done based on the high priority metric, for image classification the Top-1 Accuracy while for Generative Networks and Image Segmentation the Loss Metric. Therefore, for the latter, Mish > Baseline is indicative of better loss and vice versa. For Embeddings, the AUC metric is considered.

Activation Function	Mish > Baseline Model	Mish < Baseline Model
ReLU	55	20
Swish-1	53	22
SELU	26	1
Sigmoid	24	0
TanH	24	0
HardShrink(λ = 0.5)	23	0
Tanhshrink	23	0
PReLU(Default Parameters)	23	2
Softsign	22	1
Softshrink (λ = 0.5)	22	1
Hardtanh	21	2
ELU(α=1.0)	21	7
LogSigmoid	20	4
GELU	19	3
E-Swish (β=1.75)	19	7
CELU(α=1.0)	18	5
SoftPlus(β = 1)	17	7
Leaky ReLU(α=0.3)	17	8
Aria-2(β = 1, α=1.5)	16	2
ReLU6	16	8
SQNL	13	1
Weighted TanH (Weight = 1.7145)	12	1
RReLU	12	11
ISRU (α=1.0)	11	1
Le Cun's TanH	10	2
Bent's Identity	10	5
Hard ELisH	9	1
Flatten T-Swish	9	3
Soft Clipping (α=0.5)	9	3
SineReLU (ε = 0.001)	9	4
ISRLU (α=1.0)	9	4
ELisH	7	3
SReLU	7	6
Hard Sigmoid	1	0
Thresholded ReLU(θ=1.0)	1	0

Summary of Results (Language Tasks):

Comparison is done based on the best metric score (Test accuracy) across 3 runs.

Activation Function	Mish > Baseline Model	Mish < Baseline Model
Penalized TanH	5	0
ELU	5	0
Sigmoid	5	0
SReLU	4	0
TanH	4	1
Swish	3	2
ReLU	2	3
Leaky ReLU	2	3
GELU	1	2

Try It!

Torch	DarkNet	Julia	FastAI	TensorFlow	Keras	CUDA
Source	Source	Source	Source	Source	Source	Source

Acknowledgments: (Click to expand)

Thanks to all the people who have helped and supported me massively through this project who include:

And many more including the Fast AI community, Weights and Biases Community, TensorFlow Addons team, SpaCy/Thinc team, Sicara team, Udacity scholarships team to name a few. Apologies if I missed out anyone.

Cite this work:

@article{misra2019mish,
  title={Mish: A self regularized non-monotonic neural activation function},
  author={Misra, Diganta},
  journal={arXiv preprint arXiv:1908.08681},
  year={2019}
}

mish's People

Contributors

Stargazers

Watchers

Forkers

soumik12345 geochri sagabanana meajagun binicodes ashishbairwa escanor1996 deepchatterjeevns dsnaveen amalbros uchebest iamshivamjaiswal rozgo stjordanis trendingtechnology hdocmsu gurupavan12 natitaye scape1989 liyunfei1994 lrosique joggyjagz7 tejamoy linhduongtuan lazycrazyowl xgenietony alimiraftab menguangwen-cn-0411 caojinpei thebhatman coolsunxu hanson-young garancerichard tonylibing abelkm99 guanlongtianzi anandprabhakar0507 zlannnn aipakchoi fengxingxiang prabindh gspuniani zlapp mbanuelos leiyan1225 saber5433 yueyedeai liuguoyou yehaizi1995 chaoso lennonmwy iamrishab ymlgithub leo-xxx tomawake mldl drahnreb lhutyra yuv4r4j rheehot marianophielipp jeonsworld njcurtis3 sachinprabhu007 musbell amirunpri2018 rasha-salim kawanda nazmul02 sweaterr jrborelli noguchi999 hide5stm peaceiris capinwind chomolungma sailfish009 csjiangwm xxzcc waterbearbee zerojuzi anandk025 riceshrimp adityak2920 hell-to-heaven cloudy4next saswat0 yueyub yj2victory whwme rmohanty61 ai-stuff y78h11b09 anshuman910 sinason yangxue0827 wuxiaolianggit zumbalamambo solameow jauhnni

mish's Issues

Small rant on the inertia of AI research

Hi!
This is not an issue per se and can be closed.

First of all, thank you for advancing progress in deep learning.

I'm just a random guy that want to implement an AGI (lol) and like many Nlp engeeners, I need HIGHLY accurate neural networks for fundamental NLP tasks (e.g POS tag, NER, dep parsing, Coref resolution, WSD, etc)
They are all not very accurate (often sub 95% F1 score) and their errors add up.

Such limitations make Nlp not yet suitable for many things.
This is why improving the state of the art (which can be observed on paperswithcode.com) is a crucial priority from academicians.

Effectively, many researchers have smart ideas to improve the state of the art and often slightly improve it by:
Having a "standard neural network" for the task and mix with it their new fancy idea.

I talk from knowledge, I've read most papers from state of the art leaderboards from most fundamental NLP tasks.
Almost always they have this common baseline + one idea, theirs.
The common baseline sometimes slowly evolve (e.g now it's often a pre trained model (say BERT) + fine tuning + their idea.

Sorry to say, but "this" is to me retarded
Where "this" mean the fact that by far, most researchers work in isolation, not integrating others ideas (or with such a slow inertia).
I would have wished that state of the art in one Nlp task would be a combination of e.g 50 innovative and complementary ideas from researchers.
You are researchers, do you have an idea why that is the case? If someone actually tried to merge all good complementary and compatible ideas, would they have the best, unmatchable state of the art?
Why facebookresearch, Microsoft, Google don't try the low hanging fruit in addition to producing X new shiny ideas per month, actually try to merge them in a coherent, synergetic manner??
I would like you to tell me what you think of this major issue that slow AI progress.

As an example of such inertia let's talk about Swish, Mish or RAdam :
Those things are incredibly easy to try and see "hey does it give to my neural network free accuracy gains?"
Yet not any paper on state of the art leaderboards has tried Swish, Mish or RAdam despite being soo simple to try (you don't need to change the neural network)
Not even pre trained models where so many papers depend on them (I opened issues for each of them).

Thank you for reading.

Issue about Table.1

Dear authors:
Thanks for your contribution in activation functions.
Now I have a question about a statistical index "mean loss" in Table.1. Could you please tell me what the meaning is?
Thanks again. And looking forward to your reply.

Application of Mish on ReXNet

ReXNet: Diminishing Representational Bottleneck on Convolutional Neural Network
This paper addresses representational bottleneck in a network and proposes new design principles.
The authors apply this method on MobileNetV1 and V2 and gain performance that scales much better than EfficientNets

One of the tools they use is Swish. In this instance, they actually provide theoretical and empirical justification for this. (This involves the rank of inputs and outputs, and is explained within the paper)

The authors do not discuss or use Mish however, which is shown to be better in extensive use cases.

Since the ReXNet is quite simple, it would seem to be relatively straight forward to conduct studies investigating the replacement of swish with mish in the relevant settings.

I hope this is of interest to you.

implementing pytorch mish inplace

Hi, do you have any tips on how to implement Mish as an inplace operation in PyTorch? Could this work? Is there a way of making the F.softplus() operation also inplace?

class Mish_(nn.Module):
    def __init__(self):
        super().__init__()
        
    def forward(self, x):
        softplus_res = F.softplus(x)
        torch.tanh_(softplus_res)
        return x * softplus_res

"Unexpected keyword argument" error when loading model with TF-keras version of Mish

Hi. I trained a model with the TF-Keras version of MIsh and saved it. However, I cannot figure out how to load the Mish custom layer object when attempting to load the model.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-99cc25741e5d> in <module>
    208     learning_rate=1e-4, base_lr=1e-4, max_lr=2e-4,
    209     img_size=300, train_img_folder="data/train_supercropboy/", valid_img_folder="data/train_supercropboy/",
--> 210     df_valid=None, pretrain=False, freeze_weights=False, freeze_weights_epochs=5)

<ipython-input-36-99cc25741e5d> in regression_cv(df, nfolds, epochs, batch_size, learning_rate, base_lr, max_lr, train_img_folder, valid_img_folder, img_size, df_valid, start_fold, pretrain, freeze_weights, freeze_weights_epochs)
    137         # Create model
    138 
--> 139         model = create_model(img_size=img_size, pretrain=pretrain, freeze_weights=freeze_weights)
    140 
    141         if freeze_weights:

<ipython-input-36-99cc25741e5d> in create_model(img_size, pretrain, freeze_weights)
     22         model = load_model("effnet_pretrain.h5",
     23                           custom_objects = {"root_mse": root_mse,
---> 24                                             "Mish": Mish})
     25 
     26     return(model)

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\saving\save.py in load_model(filepath, custom_objects, compile)
    144       h5py is not None and (
    145           isinstance(filepath, h5py.File) or h5py.is_hdf5(filepath))):
--> 146     return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
    147 
    148   if isinstance(filepath, six.string_types):

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py in load_model_from_hdf5(filepath, custom_objects, compile)
    210     model_config = json.loads(model_config.decode('utf-8'))
    211     model = model_config_lib.model_from_config(model_config,
--> 212                                                custom_objects=custom_objects)
    213 
    214     # set weights

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\saving\model_config.py in model_from_config(config, custom_objects)
     53                     '`Sequential.from_config(config)`?')
     54   from tensorflow.python.keras.layers import deserialize  # pylint: disable=g-import-not-at-top
---> 55   return deserialize(config, custom_objects=custom_objects)
     56 
     57 

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\layers\serialization.py in deserialize(config, custom_objects)
     87       module_objects=globs,
     88       custom_objects=custom_objects,
---> 89       printable_module_name='layer')

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py in deserialize_keras_object(identifier, module_objects, custom_objects, printable_module_name)
    190             custom_objects=dict(
    191                 list(_GLOBAL_CUSTOM_OBJECTS.items()) +
--> 192                 list(custom_objects.items())))
    193       with CustomObjectScope(custom_objects):
    194         return cls.from_config(cls_config)

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\engine\network.py in from_config(cls, config, custom_objects)
   1119     # First, we create all layers and enqueue nodes to be processed
   1120     for layer_data in config['layers']:
-> 1121       process_layer(layer_data)
   1122     # Then we process nodes in order of layer depth.
   1123     # Nodes that cannot yet be processed (if the inbound node

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\engine\network.py in process_layer(layer_data)
   1103       from tensorflow.python.keras.layers import deserialize as deserialize_layer  # pylint: disable=g-import-not-at-top
   1104 
-> 1105       layer = deserialize_layer(layer_data, custom_objects=custom_objects)
   1106       created_layers[layer_name] = layer
   1107 

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\layers\serialization.py in deserialize(config, custom_objects)
     87       module_objects=globs,
     88       custom_objects=custom_objects,
---> 89       printable_module_name='layer')

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py in deserialize_keras_object(identifier, module_objects, custom_objects, printable_module_name)
    192                 list(custom_objects.items())))
    193       with CustomObjectScope(custom_objects):
--> 194         return cls.from_config(cls_config)
    195     else:
    196       # Then `cls` may be a function returning a class.

~\Anaconda3\envs\r-tensorflow\lib\site-packages\tensorflow\python\keras\engine\base_layer.py in from_config(cls, config)
    444         A layer instance.
    445     """
--> 446     return cls(**config)
    447 
    448   def compute_output_shape(self, input_shape):

TypeError: __init__() got an unexpected keyword argument 'name'

Any advice on how to successfully load a saved model?

might as well include analysis for Flatten-T Swish in future

Flatten-T Swish is relu(x) * sigmoid(x) - 0.2
That activation showed better results than vanilla Swish in my experiments.

Demo Jupyter Notebooks - Link "404" error

Page link for testing with Jupyter notebook leads to "404 page not found" error.

Extended Coverage of other CNNs

(As a side note)
There are papers that noted information regarding the larger variety of CNNs for ImageNet testing.

Need help with Mish-Metal

I read your research paper on Mish, and from the graphs I saw, it is clearly the best activation function.

You had two different implementations of Mish with widely different computational overheads. There were standard Mish and Mish-CUDA, where Mish-CUDA has almost zero computational cost. I'm optimizing DL4S for Metal, and looking to optimize Mish so that it can be the top-recommended function. However, I need to know how you implemented Mish-CUDA.

Could you please explain how Mish-CUDA was faster than the standard Mish, and how I might implement it in a generic GPGPU context? I have a lot of experience with assembly-level optimization and I know the differences between Apple GPUs/Metal and Nvidia GPUs/CUDA well, if that helps with your explanation.

Correct gain value during kaiming weight initialization

Hello! Great work on this activation function! I've been using it in some of my projects with great success.

I want to let you know I found what the gain should be set at during kaiming weight initialization for Mish.

I found this experimentally using this code:

import torch
import torch.nn.functional as F
import pandas as pd
import numpy as np

device = 'cpu'

def mish(x):
    return x * (torch.tanh(F.softplus(x)))

aa = []
bb = []
for n in range(100):
    with torch.no_grad():
        a = torch.randn(5000, 5000, device=device)
        b = a
        x = 0.0 + 0.00001 * n
        for i in range(10):
            l = torch.nn.Linear(5000, 5000, bias=False).to(device)
            torch.nn.init.kaiming_uniform_(l.weight, a=x)
            b = mish(l(b))
        aa.append(b.std().item())
        bb.append(x)
        print(x)
        print (f"in: {a.std().item():.8f}, out: {b.std().item():.8f}")
pd.DataFrame(data=aa, index=bb).plot(figsize=(20,8))

which was talked about here. The "a" hyperparameter for init.kaiming_uniform_ is not actually the gain but the negative slope of a leaky relu, so really I experimentally found the equivalent negative slope of mish for kaiming_uniform_ init. The actual gain is found internally by math.sqrt(2.0 / (1 + a ** 2)).

This is an example of the code output. I found through repeated experiments that 0.0003 results in the most consistently efficient throughput through the network, so it is almost the zero slop of relu but not quite. a=0 did produce okay results, as did say 0.001, but the best averaged over many runs is 0.0003. This is important because for deep networks the pytorch default value of sqrt(5) for initializing conv layers is not a good default value if using mish.

I now use something like

for m in self.modules():
    if isinstance(m, (nn.Conv1d, nn.Linear)):
        torch.nn.init.kaiming_uniform_(m.weight, a=0.0003)

More comparison with existing methods?

Just wondering if all Activation Functions have been addressed in the ReadME.

might as well include analysis for GELU in future

Gaussian Error Linear Units (GELUs)

Should be mish used before or after batchnorm?

Mish and alternatives, including my own

First, congratulation on your Mish paper accepted.

I've been thinking about activation functions, and how they can be improved, what properties are most important for a long time, and also inspired by your paper (I noticed the days old update), and e.g. recent SharkFin (my own unpublished idea has some similarities).

I was just thinking if I could ask you some questions, I'm not sure if this is the right place.

It seems like almost any function can do (all except polynominal proven to work for shallow wide networks, and that restriction eliminated by deep-narrow networks):

Universal Approximation with Deep Narrow Networks
https://arxiv.org/pdf/1905.08539.pdf

we show that the class of neural networks of arbitrary depth, width n+m+2, and activation function ρ, is dense [..] This covers every activation function possible to use in practice, and also includes polynomial activation functions [..]

We refer to these as enhanced neurons. [..]

4.2. Square Model
Lemma 4.3 One layer of two enhanced neurons, with square activation function, may exactly represent the multiplication function (x,y)→xy on R^2 [..]

Remark 4.7 Lemma 4.5 is key to the proof of Proposition 4.6. It was fortunate that the reciprocal function may be approximated by a network of width two - note that even if Proposition 4.6 were already known, it would have required a network of width three. It remains unclear whether an arbitrary-depth network of width two, with square activation function, is dense in C(K). [..]

Remark 4.8 Note that allowing a single extra neuron in each layer would remove the need for the trick with the reciprocal, as it would allow [..] Doing so would dramatically reduce the depth of the network. We are thus paying a heavy price in depth in order to reduce the width by a single neuron.

[I've yet to read much further, but this seems very important.]

So my reading is all but identity function can work as activation function (when more than one hidden layer), and a network no wider than 4 (or 5 better, optimal?) can approximate all four elementary arithmetic operation. Could also approximate e.g. sine and exponential with such narrow network (through Fourier-theorem), I think.

Have you looked at Capsule networks, and deep variant?
https://arxiv.org/pdf/1904.09546.pdf

May main worry is that by thinking about better activation functions (of if there can be one best), I'm wasting my time, with them and/or (traditional) backpropagation going away, with the thousand-brain theory and more. Capsule networks seem similar, with a voting mechanism. It at least has ReLU in the first layer (I didn't look at more in detail).

Have you looked at BERT and variants? I assume they could use your functions, or do you know if there are exceptions, making GELU better for them? I'm thinking it's maybe just ignorance (or authors extending, want to change one thing at a time):

https://arxiv.org/pdf/1909.11942.pdf

The backbone of the ALBERT architecture is similar to BERT in that it uses a transformer encoder (Vaswani et al., 2017) with GELU nonlinearities

The Reversible Residual Network: Backpropagation Without Storing Activations
https://arxiv.org/pdf/1707.04585.pdf

TF-Keras module name

I know that it is minor, but it would ease things out to rename the folder TF-Keras.
Then we can just git clone repo & import.

Otherwise python doesn't like dashes in module/package names.

Discussion regarding GAN benchmarks

For starters I would like to mention that https://github.com/kozistr/Awesome-GANs and https://github.com/nightrome/really-awesome-gan exists for people to check references.

how to use mish in tensorflow slim.conv2d????

slim.conv2d it is relu ,how to change it to mish?

PyTorch Mish - 1.5x slower training, 2.9X more memory usage vs LeakyReLU(0.1)

Hi, thanks for this interesting new activation function. I've tested it with YOLOv3-SPP on a V100 from https://github.com/ultralytics/yolov3 and have mixed feedback. The performance improves slightly, but the training time is much slower and the GPU memory requirements are much higher vs LeakyReLU(0.1). Any suggestions on how to improve speed/memory in PyTorch? Thanks!

From AlexeyAB/darknet#3114 (comment):

	[email protected]	mAP0.5:0.95	GPU memory	Epoch time
`LeakyReLU(0.1)`	48.9	29.6	4.0G	31min
`Mish()`	50.9	31.2	11.1G	46min

class Swish(nn.Module):
    def __init__(self):
        super(Swish, self).__init__()

    def forward(self, x):
        return x.mul_(torch.sigmoid(x))


class Mish(nn.Module):  # https://github.com/digantamisra98/Mish
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x.mul_(F.softplus(x).tanh())

Double the training time

My implementation is
inputs = inputs * tf.math.tanh(tf.math.softplus(inputs))
Tensorflow 1.14
When I use this activation function for my own model, the training time is doubled compared to swish

Why does mish return significantly worse result than elu?

I've tested mish vs elu on a simple feedforward network with L2 and dropout and got this result:

You can check it with Colab Notebook, press ctrl+f9.

Spelling error in repo description

"Repsoitory" is written instead of "Repository," didn't know if anyone had caught that or if it was intentional for whatever reason that I'm not aware of lol

{NameError}name 'uses_learning_phase' is not defined

When using in an RNN, I get {NameError}name 'uses_learning_phase' is not defined when applying a Dense layer on the output of an LSTM.

Tensorflow 1.14

Equivalent, faster (?) formulation

Hello, thanks for the great work. Using the exponential identity for tanh, you can remove two of the transcendental operations (exp, log) and get what, hopefully, should be a faster implementation.

Since

$$\tanh(x) = (e^2x - 1) / (e^2x + 1)$$

you can express Mish as:

$$y = e^x mish(x) = x y (y + 2) / (y^2 + 2 y + 2)$$

or equivalently (to avoid overflow when x is large)

$$y = e^-x mish(x) = x (1 + 2 y) / (1 + 2 y + 2 y^2)$$

NB: With a little tweak, there is an interesting connection to the GELU approximated with a logistic distribution ("Logistic Error Linear Unit"?) (i.e. Swish)

$$x \tanh(0.5 \log( 1 + e^x) ) = x \sigma(x - \log 2)$$

c.f. the approximation x \sigma(1.702 x) from the GELU paper.

Computational cost of Mish vs GELU vs Swish

What are the computational cost (CPU/GPU cycles per node) of Mish vs GELU vs Swish?
If it is possible to reduce the CPU/GPU cycles of the computation through simplification,
it can save both time and energy, leading to lower green house gas emissions.
Of course convergence is important, RELU isn't that computational heavy, but it is weak,
and right now the second derivative of both GELU and Swish are symmetric.

Error in TF-Keras mish

In function call: return inputs * tf.math.tanh(tf.math.softplus(x))
Should be inputs instead of x inside softplus

Source Code Implentation - "404" Error

Keras and Tensorflow links lead to "404" error.

Comparison with TanhExp

Through the mish landing page I found this paper
TanhExp: A Smooth Activation Function with High Convergence Speed for Lightweight Neural Networks Paper

Long story short it is an adjustment on mish.
In many of the evaluations that they have done, they show consistent improvement across many benchmarks over mish.

Whether this improvement holds upon other datasets (ImageNet) or networks remain to be seen.
The authors definitely have not done as extensive testing as the author of mish has.

I do have problem with the following section
"4.6 Comparison of the Computation Speed"

Here they argue that their TanhExp function is faster then mish in its original form(forward pass), 1st derivative(backward pass) and 2nd derivative form.

However, I would like to point out that for at least the original form, they used the original formulation of mish and not the (potentially) faster mish discussed here before.

I have conducted my own tests and found that for CPU calculations, fastmish is faster than tanhexp.
Unfortunately, I have not been able to test backward passes. But to the best of my knowledge, the 1st and 2nd derivatives do not differ between the original formulation and the faster formulation. (This is because they are based on an identity and the speed gain comes from CPU friendly formulas)

On GPU, TanhExp was faster than both versions of mish.

Since they do not have a separate github page I decided to at least let you know so that you could raise an objection if need be.
I tried thinking of a faster formulation of TanhExp to be fair, but it is beyond my knowledge.

To be frank I am interested in the future of TanhExp since it shows consistent accuracy and training speed and stability gains in the limited scope they tested compared to mish.

However, activation functions come and go real fast(remember swish?) and for simplicity's sake, ReLu isnt going anywhere. Also, previous investigations into this issue showed that minor differences in activation function speed might be eclipsed by other bottlenecks.

Anyway here is my notebook.
TanhExp.zip

When trying to change cifar-10 to cifar-100(In cifar-10-senet-18-mish) it raises RuntimeError: CUDA error: device-side assert triggered.

When trying to change cifar-10 to cifar-100(In cifar-10-senet-18-mish), I try to change the code in the second cell like this:
def get_training_dataloader(train_transform, batch_size=128, num_workers=0, shuffle=True):
""" return training dataloader
Args:
train_transform: transfroms for train dataset
path: path to cifar100 training python dataset
batch_size: dataloader batchsize
num_workers: dataloader num_works
shuffle: whether to shuffle
Returns: train_data_loader:torch dataloader object
"""

transform_train = train_transform
cifar100_training = torchvision.datasets.CIFAR100(root='.', train=True, download=True, transform=transform_train)
cifar100_training_loader = DataLoader(
    cifar100_training, shuffle=shuffle, num_workers=num_workers, batch_size=batch_size)

return cifar100_training_loader

define test dataloader

def get_testing_dataloader(test_transform, batch_size=128, num_workers=0, shuffle=True):
""" return training dataloader
Args:
test_transform: transforms for test dataset
path: path to cifar100 test python dataset
batch_size: dataloader batchsize
num_workers: dataloader num_works
shuffle: whether to shuffle
Returns: cifar100_test_loader:torch dataloader object
"""

transform_test = test_transform
cifar100_test = torchvision.datasets.CIFAR100(root='.', train=False, download=True, transform=transform_test)
cifar100_test_loader = DataLoader(
    cifar100_test, shuffle=shuffle, num_workers=num_workers, batch_size=batch_size)

return cifar100_test_loader

However, it raises RuntimeError: CUDA error: device-side assert triggered.
Could you help me of how to do cifar-100?

Besides, I'd like to know how to use stats.ipynb...Thanks a lot!!!

@digantamisra98

Visualization code hyperlink page is not found.

https://github.com/digantamisra98/Mish/blob/master/output_landscape.py is not found
Please fix it

The result is not good (Fixed, improved mAP)

Hi, I tried to use mish instead of relu on mask r-cnn, including res-net and three head, but the AP dropped 0.3 in the end. What is the possible problem?
Thank you!

Saving trained model (mish-Keras)

Hi, I am trying to save weights of trained model utilizing Keras model_checkpoint.

The error I am getting (last couple of rows):

File "C:\Users\SParkhonyuk\AppData\Local\Continuum\anaconda3\envs\DL\lib\site-packages\keras\engine\network.py", line 860, in get_config
layer_config = layer.get_config()

File "C:\Users\SParkhonyuk\repos...\mish.py", line 36, in get_config
return dict(list(base_config.items()) + list(config.items()))

NameError: name 'config' is not defined

Some global configuration is missing in mish.py ( in fact, it is a straightforward function).

Can you please direct me to what exactly needs to be added to make the model save-able?

SiLU is a more relevant baseline than Swish

Although Swish by Google researchers is more popular, it is the result of NAS and not very well-justified. Actually Swish is a modification of the earlier activation function SiLU (Sigmoid-weighted linear units), which shows better results than Swish on several benchmarks and simpler conceptually. Moreover, SiLU is more theoretical justified, for example the idea of self-regularization was proposed. Please see the following paper.

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118, 2017.