TransWikia.com

ReLU outperforming Softplus

Cross Validated Asked by Mike Land on November 12, 2020

I have noticed that PyTorch models perform significantly better when ReLU is used instead of Softplus with Adam as optimiser.

How can it happen to be that a non-differentiable function is easier to optimise than an analytic one? Is it true, then, that there is no gradient optimisation except than in name, and some kind of combinatorics is used under the hood?

One Answer

ReLU in general is known to outperform many smoother activation functions. It’s easy to optimize, because it’s half-linear. The advantage when using it is usually speed, so it can be the case that if you waited more iterations, used different learning rate, batch sizes, or other hyperparameters, etc, you’d get similar results.

Answered by Tim on November 12, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP