How to find whether a dataset is blanced or imbalanced?

Question

I have few dataset to experiment classification(Multi-class). These datasets are about 400GB. I wanted to know whether the dataset is balanced or imbalanced. How to know that dataset is balance or imbalanced using any scientific way?

Murli · Answer

You can look at the number of samples for each class. Ideally, they all should be of equal proportion. If the sample of one class is considerably high than the rest, then the model will learn to predict that class more often than others and hence leading to overfitting.

Answered by Murli on February 9, 2021

cap · Answer

Typically, the representation of each class in a multi-classification problem should be equal. Say if there are 4 classes, then the ratio of count of samples in each class should ideally be n:n:n:n, most classification data sets do not have exactly same number of sample count in each class, which is fine and a lit bit of difference often does not matter. But if the difference is huge, say for example 100:5:9:13 then it matters and it is an imbalanced dataset.

coming to 400 GB of data to read - Depending on the type of your file, you can read it in chunks and then read and save the target variable( the one which has multi class labels) in another variable.

You can visualize this variable (containing target) using a bar chart which will show you the count of variables for each class. Along with that you can also calculate the distribution of your classes to get better understanding of data.

Michael Šòdéké · Answer

In r do the following:

1. convert data frame to tibble to show the data types for each column vector:

require(dplyr)
df <- InsectSprays
df <- as_tibble(df)

> as_tibble(df)
# A tibble: 72 x 2
   count spray
   <dbl> <fct>
 1    10 A
 2     7 A
 3    20 A
 4    14 A
 5    14 A
 6    12 A
 7    10 A
 8    23 A
 9    17 A
10    20 A
# ... with 62 more rows

2. show dimensions for a column vector with a k-level factor:

    df %>% group_by(spray, .add=TRUE) %>% group_nest()

# A tibble: 6 x 2
  spray               data
  <fct> <list<tbl_df[,1]>>
1 A               [12 x 1]
2 B               [12 x 1]
3 C               [12 x 1]
4 D               [12 x 1]
5 E               [12 x 1]
6 F               [12 x 1]

Answered by Michael Šòdéké on February 9, 2021

How to find whether a dataset is blanced or imbalanced?

3 Answers

1. convert data frame to tibble to show the data types for each column vector:

2. show dimensions for a column vector with a k-level factor:

Add your own answers!

Ask a Question