Trained BERT models perform unpredictably on test set

Question

We are training a BERT model (using the Huggingface library) for a sequence labeling task with six labels: five labels indicate that a token belongs to a class that is interesting to us, and one label indicates that the token does not belong to any class.
Generally speaking, this works well: loss decreases with each epoch, and we get good enough results. However, if we compute precision, recall and f-score after each epoch on a test set, we see that they oscillate quite a bit. We train for 1,000 epochs. After 100 epochs performance seems to have plateaued. During the last 900 epochs, precision jumps constantly to seemingly random values between 0.677 and 0.709; recall between 0.729 and 0.798. The model does not seem to stabilize.
To mitigate the problem, we already tried the following:

We increase the size of our test data set.
We experimented with different learning rates and batch sizes.
We used different transformer models from the Huggingface library, e.g. RoBERTa, GPT-2 etc.
Nothing of this has helped.

Does anyone have any recommendations on what we could do here? How can we pick the “best model”? Currently, we pick the one that performs best on the test set, but we are unsure about this approach.

noe · Accepted Answer

BERT-style finetuning is known for its instability. Some aspects to take into account when having this kind of issues are:

The number of epochs typically used to finetune BERT models is normally around 3.
The main source of instability is that the authors of the original BERT article suggested using the Adam optimizer but disabling the bias compensation (such a variant became known as "BertAdam").
Currently, practitioners have shifted from Adam to AdamW as optimizer.
It is typical to do multiple "restarts", that is, train the model multiple times and choose the best performing one on the validation data.
Model checkpoints are normally saved after each epoch. The model we chose is the checkpoint with best validation loss among all epoch of every restart we tried.

There are two main articles that study BERT-like finetuning instabilities that may be of use to you. They describe in detail most of the aspects I mentioned before:

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
Revisiting Few-sample BERT Fine-tuning

Trained BERT models perform unpredictably on test set

One Answer

Add your own answers!

Ask a Question