TransWikia.com

Cross validation schema for imbalanced dataset

Data Science Asked on November 26, 2020

Based on a previous post, I understand the need to ensure that the validation folds during the CV process have the same imbalanced distribution as the original dataset when training a binary classification model with imbalance dataset. My question is regarding the best training schema.

Let’s assume that I have an imbalanced dataset with 5M samples where 90% are pos class vs 10% neg class, and I am going to use 5-folds CV for model tuning. Also, let’s assume I will hold out a random 100K samples for test (90K samples w/ pos class vs 10K samples w/ neg class). Now I have two options:

Option 1)

  • Step 1: Pull a randomly selected 200K imbalanced data for training (180K samples pos class vs 20K samples neg class)
  • Step 2: During each CV iteration:
    • The training fold will have 160K samples (144K pos vs 16K neg)
    • and the validation fold will have 40K samples (36K pos vs 4K neg)
  • Step 3: Apply data balancing for the training fold (e.g., Downsampling, Upsampling, SMOTE, etc.) and fit a model
  • Step 4: Validate the model on the imbalanced training fold

However, given that I have enough data, I want to avoid using any data balancing algorithm for the training folds.

Option 2)

  • Step 1: Pull a randomly selected 200K balanced data for training (100K samples pos class vs 100K samples neg class)
  • Step 2: During each CV iteration:
    • The training fold will have 160K samples (80K pos vs 80K neg)
    • and the validation fold will have 40K samples (20K pos vs 20K neg)
  • Step 3: Fit a model for the already balanced training fold
  • Step 4: Can I apply down sampling to the balanced validation dataset to restore it to its imbalanced state? If so, how can I do that in sklearn?

I am also clear that I have a 3rd option, which is based on the 1st option above, where the model could be trained on an imbalanced dataset. Therefore, a data balancing algorithm can be avoided.

My questions are:

  1. Is option 2 better than option 1?
  2. How to apply a downsampling to a balanced validation dataset (Option 2-step 4

One Answer

I'm not sure if there's a question here, but I'll add some comments.

Firstly, if you can get it in the wild, always work with balanced data. However, if you are going to manually create a "balanced" data set yourself, make sure that the selection criteria that you use to create that data is appropriate. As an example, choosing the 100k most recent positive and negative outcomes may not be appropriate because the time frame of the positive outcomes may extend well beyond that of the more common negative outcomes. So in this 200k data set, your negative 100k outcomes may relate to data from the last year while the data relating to your 100k positive outcomes may relate to the last ten years.

Secondly, if you are going to balance your data be aware of how the balancing technique works and try to understand its weaknesses / limitations. Be mindful that rebalancing a data set will result in a new data set, and remember that you will have to check that the new data set is still appropriate to use. As an example, you will need to check that the distribution of input variables is still roughly the same as before. Applying this thinking to your options above, can you be certain that the data in each fold will be roughly similar?

Lastly, if you are going to use a modelling framework which can handle imbalanced data then make sure you understand why it can handle the imbalanced data. In particular, if the framework applies some weighting / balancing technique in the background you should be aware of this and be able to explain it.

Answered by bradS on November 26, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP