What is the definition of imbalanced data set

Question

I have thousands of data sources generating data from similar type of hardware. The different sources create different dynamics in the datasets though!

Even though the features are the same the data sets have very diverse characteristics.

I am working on a multiclass classification problem trying to see how much specific models can be used to tackle that domain.

The number of classes differ on different data sources so different models need to be built. That means that in the end I have many many different models to evaluate. Similar input but the number of classes to be predicted at the output is differrent.

Since this is a multiclass classification problem things like confusion matrices are used and multiple ROC curves.

Now I am trying to see in more details what might be causing poor performance in the poorest performing models. Typically the reasons are:
1.not enough measurements
2.heavily imbalanced datasets
3. a combination of 1 and 2

The problem is that I do not have a definition on a multiclass problem what is an imbalanced dataset. Ideally if I could use a specific "rule" to label my datasets, I would be able to see things like correlation of imbalanced set and precision.

When it comes to imbalanced dataset for multiple class a threshold is not enough, since is the distribution of the available measurements between the classes that is important. For that I have no idea on how to handle that.

How would you handle this case ?

Thanks a lot for reading this and contributing to this community.

Regards
Alex

Tolik · Answer

The problem of an imbalanced dataset is the problem of generative classifiers that use the prior probability for calculating the predicted label. As the labels have a lower prior they get a lower probability.
There are several ways to cope with imbalanced datasets:

Oversampling the minority classes ,randomly add observations from the minority classes, so the prior probability of every class will be the same.
Undersampling - if you have a dataset with a lot of observations but the majority class is few times larger than the minority, choose randomly a subset of the whole dataset that includes the same number of observations for each label.
Use Data Augmentation to generate synthetic data that tries to simulate the same distribution of the features in a label.
Weighted classifiers - there are classifiers support weights for the labels.
If you use Neural Network model, you can do Transfer Learning. Copy the weights of the model from a model with balanced data (you told that you have models with a similar feature vector), copy the network (with the weights) and replace the last layer to randomly initialised (better to use Xavier initialiser) . Then freeze all of the layer weights except for the last one and train it. It's better to keep the same proportion between the classes using (1) or (2) and it's also recommended combine with (3).

Faiz Kidwai · Answer

By definition, a balanced dataset will have an equal number of data points in all the classes. All other datasets are deemed imbalanced.

You can very well use an imbalanced dataset to train your ML model as long as the predictions are accurate. If not, then go for undersampling or oversampling depending on your use case. This blog covers it:
 https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28)

What is the definition of imbalanced data set

2 Answers

Add your own answers!

Ask a Question