TransWikia.com

Guidelines for selecting an optimizer for training neural networks

Data Science Asked by mplappert on August 27, 2021

I have been using neural networks for a while now. However, one thing that I constantly struggle with is the selection of an optimizer for training the network (using backprop). What I usually do is just start with one (e.g. standard SGD) and then try other others pretty much randomly. I was wondering if there’s a better (and less random) approach to finding a good optimizer, e.g. from this list:

  • SGD (with or without momentum)
  • AdaDelta
  • AdaGrad
  • RMSProp
  • Adam

In particular, I am interested if there’s some theoretical justification for picking one over another given the training data has some property, e.g. it being sparse. I would also imagine that some optimizers work better than others in specific domains, e.g. when training convolutional networks vs. feed-forward networks or classification vs. regression.

If any of you have developed some strategy and/or intuition on how you pick optimizers, I’d be greatly interested in hearing it. Furthermore, if there’s some work that provides theoretical justification for picking one over another, that would be even better.

4 Answers

My personal approach is to pick the optimizer that is newest (i.e. newest-published-in-a-peer-reviewed-journal), because they usually report results on standard datasets, or beat state of the art, or both. When I use Caffe for example, I always use Adam.

Answered by mprat on August 27, 2021

  1. AdaGrad penalizes the learning rate too harshly for parameters which are frequently updated and gives more learning rate to sparse parameters, parameters that are not updated as frequently. In several problems often the most critical information is present in the data that is not as frequent but sparse. So if the problem you are working on deals with sparse data such as tf-idf, etc. Adagrad can be useful.

  2. AdaDelta, RMSProp almost works on similar lines with the only difference in Adadelta you don't require an initial learning rate constant to start with.

  3. Adam combines the good properties of Adadelta and RMSprop and hence tend to do better for most of the problems.

  4. Stochastic gradient descent is very basic and is seldom used now. One problem is with the global learning rate associated with the same. Hence it doesn't work well when the parameters are in different scales since a low learning rate will make the learning slow while a large learning rate might lead to oscillations. Also Stochastic gradient descent generally has a hard time escaping the saddle points. Adagrad, Adadelta, RMSprop, and ADAM generally handle saddle points better. SGD with momentum renders some speed to the optimization and also helps escape local minima better.

Answered by Santanu_Pattanayak on August 27, 2021

Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)

Much like Adam is essentially RMSprop with momentum, Nadam is Adam RMSprop with Nesterov momentum.

Answered by rigo on August 27, 2021

According to Kingma and Ba (2014) Adam has been developed for "large datasets and/or high-dimensional parameter spaces".

The authors claim that: "[Adam] combines the advantages of [...] AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives" (page 9).

In the paper, there are some simulations where Adam is compared to SGDNesterov, AdaGrad, RMSProp (MNIST, IMDB, page 6; CIFAR10 (ConvNet), page 7). Adam does very well compared to the others. The authors find that Adam converges faster than AdaGrad in a convolutional network (5x5 convolution filters, 3x3 max pooling with stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units).

Overall, Adam appears to be a good choice to start with for most (non-shallow) problems.

Answered by Peter on August 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP