Which learning rate should I choose?

Question

I'm training a segmentation model, Unet++, on 2d images and I am now trying to find the optimal learning rate.
The backbone of the model is Resnet34, I use Adam optimizer and the loss function is the dice loss function.
Also, I use a few callbacksfunctions:
callbacks = [
            keras.callbacks.EarlyStopping(monitor='val_loss', patience=15, verbose=1, min_delta=epsilon, mode='min'),
            keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1, mode='min', cooldown=0, min_lr=1e-8),
            keras.callbacks.ModelCheckpoint(model_save_path, save_weights_only=True, save_best_only=True, mode='min'),
            keras.callbacks.ReduceLROnPlateau(),
            keras.callbacks.CSVLogger(logger_save_path)
        ]

I plotted the curves of training loss over epochs for a few learning rates:

The validation loss and training loss seem to decrease slowly. However, the validation loss isn't oscillating (it is almost always decreasing).

The validation and training losses decreased quickly on first 2/3 epochs. After 6 or 7 epochs, the validation loss increases again.

I have a few questions (I hope it is not too much):

*What is normally the best way to find the learning rate i.e.

How many epochs should I wait before considering that the learning rate isn't good?
What are the criteria on the loss function to determine if a learning rate is "good"?

Is there a big difference if I use a small learning (which still converges) instead of the "optimal" learning rate ?
Is it normal that the validation loss function oscillates over the training?
Which learning rate should I use according to my results?

Even a partial response would help me a lot.

hH1sG0n3 · Accepted Answer

I am afraid the that besides learning rate, there are a lot of values for you to make a choice for over a lot of hyperparameters, especially if you’re using ADAM Optimization, etc.
A principled order of importance for tuning is as follows

Learning rate ?
Momentum term ?, num of hidden units in each layer, batch size.
Number of hidden layers, learning rate decay.

To tune a set of hyperparameters, you need to define a range that makes sense for each parameter. Given a number of different values you want to try according to your budget, you could choose a hyperparameter value from a random sampling.
Specifically to learning rate investigation though, you may want to try a wide range of values, e.g. from 0.0001 to 1, and so you can avoid sampling random values directly from 0.0001 to 1. You can instead go for $x=[-4,0]$ for $a=10^x$ essentially following a logarithmic scale.
As far as number of epochs go, you should set an early stopping callback with patience~=50, depending on you "exploration" budget. This means, you give up training with a certain learning rate value if there is no improvement for a defined number of epochs.
Parameter tuning for neural networks is a form of art, one could say. For this reason I suggest you look at basic methodologies for non-manual tuning, such as GridSearch and RandomSearch which are implemented in the sklearn package. Additionally, it may be worth looking at more advanced techniques such as bayesian optimisation with Gaussian processes and Tree Parzen Estimators. Good luck!

Randomized Search for parameter tuning in Keras

Define function that creates model instance

# Model instance
input_shape = X_train.shape[1]

def create_model(n_hidden=1, n_neurons=30, learning_rate = 0.01, drop_rate = 0.5, act_func = 'ReLU',
                act_func_out = 'sigmoid',kernel_init = 'uniform', opt= 'Adadelta'):
    model = Sequential()
    model.add(Dense(n_neurons, input_shape=(input_shape,), activation=act_func,
                   kernel_initializer = kernel_init))
    model.add(BatchNormalization())
    model.add(Dropout(drop_rate))
    # Add as many hidden layers as specified in nl
    for layer in range(n_hidden):
    # Layers have nn neurons model.add(Dense(nn, activation='relu'
        model.add(Dense(n_neurons, activation=act_func, kernel_initializer = kernel_init))
        model.add(BatchNormalization())
        model.add(Dropout(drop_rate))
    model.add(Dense(1, activation=act_func_out, kernel_initializer = kernel_init))
    opt= Adadelta(lr=learning_rate)
    model.compile(loss='binary_crossentropy',optimizer=opt, metrics=[f1_m])
    return model

Define parameter search space

params = dict(n_hidden= randint(4, 32),
              epochs=[50], #, 20, 30],
              n_neurons= randint(512, 600),
              act_func=['relu'],
              act_func_out=['sigmoid'],
              learning_rate= [0.01, 0.1, 0.3, 0.5],
              opt = ['adam','Adadelta', 'Adagrad','Rmsprop'],
              kernel_init = ['uniform','normal', 'glorot_uniform'],
              batch_size=[256, 512, 1024, 2048],
              drop_rate= [np.random.uniform(0.1, 0.4)])

Wrap Keras model with sklearn API and instantiate random search

model = KerasClassifier(build_fn=create_model)
random_search = RandomizedSearchCV(model,  params, n_iter=5, scoring='average_precision', 
                                   cv=5)

Search for optimal hyperparameters

random_search_results = random_search.fit(X_train, y_train, 
                                          validation_data =(X_test, y_test),
                                          callbacks=[EarlyStopping(patience=50)])

Which learning rate should I choose?

One Answer

Randomized Search for parameter tuning in Keras

Add your own answers!

Ask a Question