Which hyperparameters of a neural network can be tunned independently?

Question

The hyperparameter search is computationally expensive. I am wondering if one can tune the hyperparameters independently: tune one hyperparameter for a fixed value of other hyperparameters. For example, let's say we have two hyperparameters, A and B. We search for best value of A at fixed random B, then we search for best value of B at fixed best value of A.

This make sense if the the other hyperparameters does not interfere with the ordering of the validation loss for the hyperparameter we want to tune. In that sense, the number of units and the number of layers cannot be tuned independently. According to the Y. Bengio's paper (link), at some point the mini-batch size can be tuned independently (page 9, right column, The Mini-Batch Size).

But what about the other ones? Learning rate, activation function, dropout, ... which one can be tuned independently?

Timbus Calin · Answer

Your question is indeed a little bit broad, but I will try to explain to give you an overview on the importance and on some specific topics.

Indeed, multiple hyper-parameters exist in the context of deep learning. At the same time, just like Andrew Ng mentions in his courses, some are of an bigger importance than the others.

For instance, if you see that your training progresses very slowly (i.e. your convergence is relatively slow), you may want to fine-tune your learning rate.

Learning rate is a quintessential example of hyper-parameter that is more important than the number of neurons in a FullyConnected layer or to change the Dropout rate on a layer from 0.3 to 0.5.

At the same time, there exist two well known techniques for hyper-parameter search: grid search and random search. While the former behaves exactly like you explained (keeping the values of N-1 hyper-parameters fixed and iterating over some specific values of the N hyper-parameter), the random search has proven to have a better positive impact, as it slightly modifies all your hyper-parameters after a search step; although it may not be intuitive at first sight, this can yield better and earlier results than the grid search.

BlackCurrant · Answer

As per the paper,The Author has concluded-

"the wisdom distilled here should be
taken as a guideline, to be tried and challenged,

not as a practice set in stone

. The practice summarized
here, coupled with the increase in available computing power, now allows researchers to train neural networks on a scale that is far beyond what was possible
at the time of the first edition of this book, helping
to move us closer to artificial intelligence"

So we can't say that its a practice and we can keep everything fixed and only tune learning rate first and then keep learning rate fixed and tune weights.It doesn't even seems reasonable knowing how GD/errors are calculated and weights are updated.

By this-

""the mini-batch size can be tuned independently.""

It seems he meant the we can tune parameters on this fixed batch and take these parameters as guidelines and tune even further.
"
Hope it helps.

On the other note, I don't see any tables/comparison in the papers which has results of proposed techniques. May be you should contact the author about the understanding.

Which hyperparameters of a neural network can be tunned independently?

2 Answers

Add your own answers!

Ask a Question