How to do class balancing?

Question

I am working with a really imbalanced dataset ($approx$ 1% of positive cases) for a classification problem. I know that class balancing is an important step in this scenario.
I have two questions:

Considering that I don't want to put the 0/1 label, but just to order the record according to the output score (it is always a calibrated probability of being in the positive class), is it still a good idea to do class balancing or, considering the specific output required, it is useless?
Basically, I do not care about the cut-off point, but I just sort the record in order to identify the one with a higher probability of being positives.

Considering the really small percentage of positive cases, is it better to do over/under sampling? Is there any rule-of-thumb to decide the proportion of resampling?

Thank you in advance!

BeamsAdept · Answer

Some Python Sklearn models have this option : class_weight="balanced".
By that, you specify to your algorithm your data are unbalanced, and it makes the changes by itself. You can try this on few models, I had a better result with this option than by using the Downsampling Majority Class technique in a same problem

David Masip · Answer

Referring to a previous answer and a blog post (which I'm aware is not that relevant since the data is more balanced than yours), I think that your first approach should be without handling imbalance, and if you're happy with the results, no need to work towards balanced solutions.
As in many ML topics, the best way is to try, I recommend you to adapt the experiment in the blog post to your data.
However, a more specific answer to your question:

I think that balancing usually messes the calibration of your classifiers on your training data, so if you needed calibrated predictions I would advocate for not using balancing. If you don't care about calibration, it is not that bad to balance.
Under-sampling is better than over-sampling in my experience. The amount of under or over-sampling can be a hyperparameter to tune.

Fnguyen · Answer

With such a heavy imbalance and two classes (it seems) you could treat this as more of an outlier detection problem. You should read up on models and algorithms in that direction!
If you go forward with a traditional classification you need to balance the data set, consider methods such as SMOTE.
Depending on the size of your data I would generally recommend downsampling the majority class which avoids producing "synthetic" cases but advanced methods such as SMOTE basically take care of this decision for you.
Can you elaborate what you mean with your first question as well? A classification algorithm needs 0/1 labels and therefore the output score cannot be ordered in the way that you mention. Some classification algorithms put out a probability score instead of predicted label so if this is what you mean I can tell you that the imbalance will still be a problem.

How to do class balancing?

3 Answers

Add your own answers!

Ask a Question