Should I balance the classifier train/test set, if metrics is Precision/Recall (F1 score)?

Question

I want to train a classifier on an unbalanced data set. Proportions of classes C0/C1 are 65/35. Importantly, the success metrics is  F1_score. In other words, the proper classification of class 1 (precision, recall) is important while identification of class 0 (specificity, true negatives rate) is far less important.
Most classifiers expect balanced sets (50/50). So the literature suggests to start with under/oversampling the train data. But what about the test set, and validation set? I think those should rather reflect reality. So eventhough train set would be rebalanced (50/50), I understand that both test and validation sets should keep the 65/35 ratio of classes.
And here lies the problem: the above does not seem to make sense, if we focus on F1 (Precision+Recall). Suppose  I train my classifier on that 50/50 train sets. If my test set was also balanced 50/50, I would maybe get Accuracy = Precision = Recall = F1 = 0.8 (just as an example). But ... on the 65/35 test set, the results will be immediately worse: because class C0 is overrepresented in reality, thus the ratio of Negatives will be high, and so the False Negatives will grow proportionally. This will badly impact Precision and F1. In the result, my F1_score will be 0.5 or so. So... I wasted all effort because I trained the classifier on artificial data which did not reflect reality. This problem is described in more detail in the article linked below:
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28
In summary,  "we show the wrong proportions of the two classes to the classifier during the training.  The classifier learned this way will then have a lower accuracy on the  future real test data."
I share the author's observation, but I don't know what is the way out. The author suggests to maybe not balance the train, which I am also doubtful of, as this brings in other obvious problems. To summarize my question. If the metric is F1_score, and if the real data in unbalanced (imbalanced), then how to proceed to achieve optimal F1_score of the trained model in the real scenario:

should we really balance the train set (50/50)?
if so, should we also balance the test set (50/50)?
if so, should we also balance the validation set (50/50)?

What's your best practice? also, are there other techniques of dealing with this situation, I should be aware of?

Data Man · Answer

The comment of @Dave helped and I went through some lecture. For the benefit of other readers, I am posting here the summary of findings. It is suggested that balancing the classes is not an optimal solution. Instead, I found three alternative ways out that various authors suggest:

instead of Accuracy, Precision and Recall, proper scoring rules could be used. Most commonly used are: Brier score and logarithmic score: wikipedia

use cost-sensitive learning, associating weights to classes either during or after training

pick a classifier that is robust against imbalanced classes. Examples include xgboost, adaboost

Those are possible alternatives for balancing the classes for training, which has this fundamental deficiency: the data presented to the trained algoritm is far from reality, so the trained algorithm may not have good result in real (unbalanced) environment.

Should I balance the classifier train/test set, if metrics is Precision/Recall (F1 score)?

One Answer

Add your own answers!

Ask a Question