TransWikia.com

What could be the reasons that making validation loss jumping up and down?

Cross Validated Asked by Haitao Du on November 2, 2021

I am building some image classification model with reasonable size data (~3K) images in both training and validation set. However, I noticed the performance on validation set is not stable.

For example, here are outputs over 10 epochs (acc means accuracy binary classification on balanced data)

epoch [1]: training loss: 2.27 - acc: 0.50 - val_loss: 3.29 - val_acc: 0.49
epoch [2]: training loss: 1.55 - acc: 0.50 - val_loss: 0.92 - val_acc: 0.50
epoch [3]: training loss: 1.07 - acc: 0.51 - val_loss: 1.43 - val_acc: 0.53
epoch [4]: training loss: 0.87 - acc: 0.58 - val_loss: 1.85 - val_acc: 0.61
epoch [5]: training loss: 0.59 - acc: 0.72 - val_loss: 0.58 - val_acc: 0.61
epoch [6]: training loss: 0.52 - acc: 0.79 - val_loss: 2.30 - val_acc: 0.50
epoch [7]: training loss: 0.38 - acc: 0.85 - val_loss: 0.17 - val_acc: 0.86
epoch [8]: training loss: 0.32 - acc: 0.88 - val_loss: 1.52 - val_acc: 0.60
epoch [9]: training loss: 0.21 - acc: 0.91 - val_loss: 0.14 - val_acc: 0.88
epoch [10]: training loss: 0.34 - acc: 0.88 - val_loss: 2.81 - val_acc: 0.49

We can see that in training, it seems fine, but for epoch 6 and 8 validation loss was very high, and the final epoch 10, the validation loss got so high that the model become useless.

What could be the reason causing this? If it is overfitting on training data, why we are not seeing steady increase on validation loss?

One Answer

My mental model is that NN loss surfaces are narrow valleys: they have steep sides, but the bottom of the valley shows a shallow decline. In particular, the steepness of the sides can mean that the steepest direction tends to be dominated by the sides, instead of the shallow decline at the bottom. So a learning rate which is too large will tend to move by jumping from one side of the valley to the other, but can also make less-pronounced progress towards the minimum at the same time -- moving mostly from side-to-side, while also moving in the direction of the shallow decline.

Moreover, you've only reported the results of an epoch's end, but not progress within an epoch. My hypothesis is that within an epoch, the training loss is fluctuating widely, but using the mean discards information about those fluctuations. As further evidence, there's a hint that when the validation accuracy is low, the training loss tends to be lower also (but not as low). This is consistent with my hypothesis. When we observe a large value of validation loss, we're just seeing the "snapshot" corresponding to wherever the parameters are at that time. While the mean of the training loss suppresses this fluctuation, the validation loss exposes it because the parameters aren't changing, so we're not averaging over many different parameter values.

Tracking the training loss within epochs could confirm or disconfirm this hypothesis. (As an aside, measuring training statistics every mini-batch could consume too much memory if you have a large dataset and/or small mini-batch size. So instead, every $k > 1$ mini-batches, record two pieces of data:

  1. the most recent loss value and
  2. the mean of the most recent $k$ mini-batches. Choose the smallest $k$ that doesn't consume too much memory. )

My hypothesis is that lowering the learning rate will allow smoother progress over the loss surface. Instead of jumping around the steep sides of the narrow valley, the optimizer will be closer to the valley floor, and make steadier progress.

Answered by Sycorax on November 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP