How to deal with class imbalance problem in natural language processing?

Question

I am doing a NLP binary classification task, using Bert + softmax layer on top of it. The network uses cross-entropy loss.
When the ratio of positive class to negative class is 1:1 or 1:2, the model performs well on correctly classifying both classes(accuracy for each class is around 0.92).
When the ratio is 1:3 to 1:10, the model performs poorly as expected. When the ratio is 1:10, the model has a 0.98 accuracy on correctly classifying negative class instances, but only has a 0.80 accuracy on correctly classifying positive class instances.
The behavior is as expected as the model turns to classify most/all instances toward negative class since the ratio of positive class to negative class is 1:10.
I just want to ask what's the recommended way for handling this kind of class imbalance problem in natural language processing specifically?
I saw someone suggests to change loss function, or perform up/down sampling, but most of them are targetting computer vision class imbalance problem.

Erwan · Accepted Answer

Disclaimer: this answer might be disappointing ;)
In general my advice would be to carefully analyze the errors that the model makes and try to make the model deal with these cases better. This can involve many different strategies depending on the task and the data. Here are a few general directions to consider:

Most of the time the imbalance is not the real problem, the real problem is why the model can't differentiate between the classes. Even in case of extreme imbalance if the classes are easy to discriminate a model can perform very well. The imbalance only causes the model to assign the majority class when it doesn't have enough indication to decide, so it resorts to the conservative choice.
If the minority class is really small in absolute terms it's likely that there's not enough language diversity in the positive instances (data sparsity). This will usually cause a kind of overfitting which can be hidden by the fact that the model almost always assigns the majority class. In this case the goal should be to treat the overfitting, so the first direction is to simplify the model and/or the data representation.
Sometimes it can make sense to consider alternative ML designs: in a regular classification problem a model relies on the distribution of the classes, by principle. Some alternative approaches might not as influenced by the distribution, for example one-class classification. Of course it's not suited for every problem.

Overall my old-school advice is not to rely too much on technical answers such as resampling methods. It can make sense sometimes, but it shouldn't be used as some kind if magical answer instead of careful analysis.

How to deal with class imbalance problem in natural language processing?

One Answer

Add your own answers!

Ask a Question