Python: Handling imbalance Classes in python Machine Learning

Question

I have a dataset for which I am trying to predict target variables.

Col1    Col2    Col3    Col4    Col5    
  1      2       23      11     1
  2      22      12      14     1
  22     11      43      38     3
  14     22      25      19     3
  12     42      11      14     1
  22     11      43      38     2
  1      2       23      11     4
  2      22      12      14     2
  22     11      43      38     3

I have provided a sample data, but mine has thousands of records distributed in a similar way. Here, Col1, Col2, Col3, Col4 are my features and Col5 is target variable. Hence prediction should be 1,2,3 or 4 as these are my values for target variable. I have tried using algorithms such as random forest, decision tree etc. for predictions.

Here if you see, values 1,2 and 3 are occurring more times as compared to 4. Hence while predicting, my model is more biased towards 1 2 and 3 whereas I am getting only less number of predictions for 4 (Got only 1 predicted for policy4 out of thousands of records when I saw the confusion matrix).

In order to make my model generalize, I removed equal percentage of data that belongs to 1,2 and 3 value randomly. I grouped by each value in Col5 and then removed certain percentage, so that I brought down the number of records. Now I could see certain increase in percentage of accuracy and also reasonable increase in predictions for value 4 in confusion matrix.

Is this the right approach to deal with (removing the data randomly from those groups on which the model is biased)?

I tried for in-built python algorithms like Adaboost, GradientBoost techniques using sklearn. I read these algorithms are for handling imbalance class. But I couldnt succeed in improving my accuracy, rather by randomly removing the data, where I could see some improvements.

Is this reduction is undersampling technique and is this the right approach for under-sampling?

Is there are any pre-defined packages in sklearn or any logic which I can implement in python to get this done, if my random removal is wrong?

Also, I learnt about SMOTE technique, which deals with oversampling. Should I try this for value 4? And can we do this using any in-built packages in python? It would be great if someone helps me in this situation.

Bashar Haddad · Answer

It depends on the ensemble technique you want to use. The basic problem that you are working with multi-class data imbalance problem. Under sampling can be used efficiently in bagging as well as in boosting techniques. 
SMOTE algorithm is very efficient in generating new samples.
Data imbalance problem has been widely studied in literature. 
I recommend you to read about one of these algorithms: 
 SMOTE-Boost
SMOTE-Bagging
Rus-Boost
EusBoost
These are boosting /bagging techniques designed specifically for imbalance data problem.
Instead of SMOTE you can try ADA-SMOTE or Border-Line SMOTE.
I have used and modified the Border-Line SMOTE for multi-class and it is very efficient. 
If your data base is very large and the problem is easy try : viola - jones classifier. I have used also with data imbalance problem and it is really efficient

stmax · Answer

Some of sklearn's algorithms have a parameter called class_weight that you can set to "balanced". That way sklearn will adjust its class weights depending on the number of samples that you have of each class.

For the random forest classifier, try the following and see if it improves your score:

rf = RandomForestClassifier(class_weight="balanced") # also add your other parameters!

Ricardo Magalhães Cruz · Answer

This paper suggests using ranking (I wrote it). Instead of using, for instance, SVM directly, you would use RankSVM. Since rankers compare observation against observation, training is necessarily balanced. There are two "buts" however: training is much slower, and, in the end, what these models do is rank your observations from how likely they are to belong to one class to how likely they are to belong to another so you need to apply a threshold afterwards.

If you are going to use pre-processing to fix your imbalance I would suggest you look into MetaCost. This algorithm involves building a bagging of models and then changing the class priors to make them balanced based on the hard to predict cases. It is very elegant. The cool thing about methods like SMOTE is that by fabricating new observations, you might making small datasets more robust.

Anyhow, even though I wrote some things on class imbalance, I am still skeptic that it is an important problem in the real world. I would think it is very uncommon that you have imbalance priors in your training set, but balanced priors in your real world data. Do you? What usually happens is that type I errors are different than type II errors and I would bet most people would be better off using a cost matrix, which most training methods accept or you can apply it by pre-processing using MetaCost or SMOTE. I think many times "fixing imbalance" is short to "I do not want to bother thinking about the relative trade-off between type I and II errors."

Addendum:

I tried for in-built python algorithms like Adaboost, GradientBoost
  techniques using sklearn. I read these algorithms are for handling
  imbalance class.

AdaBoost gives better results for class imbalance when you initialize the weight distribution with imbalance in mind. I can dig the thesis where I read this if you want.

Anyhow, of course, those methods won't give good accuracies. Do you have class imbalance in both your training and your validation dataset? You should use metrics such as F1 score, or pass a cost matrix to the accuracy function. "Fixing" class imbalance is when your priors are different in your training and your validation cases.

Saurav-- · Answer

Yes, this is a fine technique to tackle the problem of class-imbalance. However, under-sampling methods do lead to the loss of information in the data set (say, you just removed an interesting pattern among the remaining variables, which could have contributed to a better training of the model). This is why over-sampling methods are preferred, specifically in case of smaller data set.

In response to your query regarding Python packages, the imbalanced-learn toolbox is specially dedicated for the same task. It provides several under-sampling and over-sampling methods. I would recommend trying the SMOTE technique.

Keith · Answer

There are already some good answers here. I just thought I would add one more technique since you look to be using ensembles of trees. In many cases you are looking to optimize the Lift curve or the AUC for the ROC. For this I would recommend Hellinger distance criterion for splitting the branches in your trees. At the time of writing this it is not in the imbalanced-learn package but it looks like there is a plan.

Answered by Keith on August 4, 2021

saisubrahmanyam janapati · Answer

When dealing with class imbalance problem you should mainly concentrate on error metric and you should choose F1 score as an error metric.
After choosing the correct metric we can use different Techniques for dealing with this issue.
If interested you can look into this blog, it is explained very nicely about the techniques used to solve this class imbalance problem.

Python: Handling imbalance Classes in python Machine Learning

6 Answers

Add your own answers!

Ask a Question