TransWikia.com

Which learning rate should I choose?

Data Science Asked on August 10, 2021

I’m training a segmentation model, Unet++, on 2d images and I am now trying to find the optimal learning rate.
The backbone of the model is Resnet34, I use Adam optimizer and the loss function is the dice loss function.

Also, I use a few callbacksfunctions:

callbacks = [
            keras.callbacks.EarlyStopping(monitor='val_loss', patience=15, verbose=1, min_delta=epsilon, mode='min'),
            keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, mode='min', cooldown=0, min_lr=1e-8),
            keras.callbacks.ModelCheckpoint(model_save_path, save_weights_only=True, save_best_only=True, mode='min'),
            keras.callbacks.ReduceLROnPlateau(),
            keras.callbacks.CSVLogger(logger_save_path)
        ]

I plotted the curves of training loss over epochs for a few learning rates:

Loss function over 24 epochs with learning rate equal to 2.10^-6

The validation loss and training loss seem to decrease slowly. However, the validation loss isn’t oscillating (it is almost always decreasing).

Loss function evolution over 8 epochs with learning rate equal to 2.10^-5

The validation and training losses decreased quickly on first 2/3 epochs. After 6 or 7 epochs, the validation loss increases again.

Loss function evolution over 4 epochs with learning rate equal to 1.10^-4
Loss function evolution over 16 epochs with learning rate equal to 5.10^-4

I have a few questions (I hope it is not too much):

  • *What is normally the best way to find the learning rate i.e.
    • How many epochs should I wait before considering that the learning rate isn’t good?
    • What are the criteria on the loss function to determine if a learning rate is "good"?
  • Is there a big difference if I use a small learning (which still converges) instead of the "optimal" learning rate ?
  • Is it normal that the validation loss function oscillates over the training?
  • Which learning rate should I use according to my results?

Even a partial response would help me a lot.

One Answer

I am afraid the that besides learning rate, there are a lot of values for you to make a choice for over a lot of hyperparameters, especially if you’re using ADAM Optimization, etc.

A principled order of importance for tuning is as follows

  1. Learning rate ?
  2. Momentum term ?, num of hidden units in each layer, batch size.
  3. Number of hidden layers, learning rate decay.

To tune a set of hyperparameters, you need to define a range that makes sense for each parameter. Given a number of different values you want to try according to your budget, you could choose a hyperparameter value from a random sampling.

Specifically to learning rate investigation though, you may want to try a wide range of values, e.g. from 0.0001 to 1, and so you can avoid sampling random values directly from 0.0001 to 1. You can instead go for $x=[-4,0]$ for $a=10^x$ essentially following a logarithmic scale.

As far as number of epochs go, you should set an early stopping callback with patience~=50, depending on you "exploration" budget. This means, you give up training with a certain learning rate value if there is no improvement for a defined number of epochs.

Parameter tuning for neural networks is a form of art, one could say. For this reason I suggest you look at basic methodologies for non-manual tuning, such as GridSearch and RandomSearch which are implemented in the sklearn package. Additionally, it may be worth looking at more advanced techniques such as bayesian optimisation with Gaussian processes and Tree Parzen Estimators. Good luck!


Randomized Search for parameter tuning in Keras

  1. Define function that creates model instance
# Model instance
input_shape = X_train.shape[1]

def create_model(n_hidden=1, n_neurons=30, learning_rate = 0.01, drop_rate = 0.5, act_func = 'ReLU',
                act_func_out = 'sigmoid',kernel_init = 'uniform', opt= 'Adadelta'):
    model = Sequential()
    model.add(Dense(n_neurons, input_shape=(input_shape,), activation=act_func,
                   kernel_initializer = kernel_init))
    model.add(BatchNormalization())
    model.add(Dropout(drop_rate))
    # Add as many hidden layers as specified in nl
    for layer in range(n_hidden):
    # Layers have nn neurons model.add(Dense(nn, activation='relu'
        model.add(Dense(n_neurons, activation=act_func, kernel_initializer = kernel_init))
        model.add(BatchNormalization())
        model.add(Dropout(drop_rate))
    model.add(Dense(1, activation=act_func_out, kernel_initializer = kernel_init))
    opt= Adadelta(lr=learning_rate)
    model.compile(loss='binary_crossentropy',optimizer=opt, metrics=[f1_m])
    return model
  1. Define parameter search space
params = dict(n_hidden= randint(4, 32),
              epochs=[50], #, 20, 30],
              n_neurons= randint(512, 600),
              act_func=['relu'],
              act_func_out=['sigmoid'],
              learning_rate= [0.01, 0.1, 0.3, 0.5],
              opt = ['adam','Adadelta', 'Adagrad','Rmsprop'],
              kernel_init = ['uniform','normal', 'glorot_uniform'],
              batch_size=[256, 512, 1024, 2048],
              drop_rate= [np.random.uniform(0.1, 0.4)])
  1. Wrap Keras model with sklearn API and instantiate random search
model = KerasClassifier(build_fn=create_model)
random_search = RandomizedSearchCV(model,  params, n_iter=5, scoring='average_precision', 
                                   cv=5)
  1. Search for optimal hyperparameters
random_search_results = random_search.fit(X_train, y_train, 
                                          validation_data =(X_test, y_test),
                                          callbacks=[EarlyStopping(patience=50)])

Correct answer by hH1sG0n3 on August 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP