How to deal with highly skewed (on counts) dependent variables?

Question

I am working on a binary classification problem and the dataset consists of several variables which are count variables. For example, how many times a customer defaulted on a broadband bill payment in the last 3 months.

The problem is, these features are highly skewed. This is how the distribution for the above variable looks like:

0.0     98.175855
1.0      1.275902
2.0      0.348707
3.0      0.199535

This is due to the nature of the event being evaluated during the construction of the feature. For example, the majority of the population may not have defaulted hence the value is 0 for 98% of them.

There are several such variables and they are measuring important events. Therefore I cannot remove these variables. However, I am afraid the model would not be learning anything from these features as there is very less information in them.

Questions:

Am I right in assuming these features will not be useful to the model in the current state?
How can these features be handled?

Leevo · Answer

What models will you employ?

If you are working with deep learning, you can train it using mini batch gradient descent, and "artificially" build each mini batch so that the classes you want to predict appear with a more balanced frequency.

Blenz · Answer

It is fine if the feature in itself is a good predictor, since you're using XGBoost and it will do feature selection while modelling. But if the difference in predictive power between the 0 , 1, 2 and 3 is not really strong, i'm not sure it will be very useful since most of your inputs will have 0, and even if you have an input with a different value it it won't change much.

Answered by Blenz on February 27, 2021

Romid · Answer

You are generally right, but as you have mentioned these are important features and you would need to figure the way using them as they are with such a low signal below 2%. You may try building more of these features to enrich the
signal while combining different features together, for example, sum the 1-2-3 counts, sum other count features with these once if it makes sense.

Another type of enrichment that you can do is modelling the feature distribution to get more samples for higher counts, for example, counting of events occurrences in some time-interval is having Poisson distribution or Powe-Low for long-tail distribution where small is common and high values are rare, you may use this property to extract more of your features.

-

Once you engineered your features and cant get more out of them, you may try some models that handle such property of the data.

If you know that some of your zero counts are due to missing values, don't fill them with zeros instead you can use models that can handle them better. XGBoost, for example, allows missing features values and it finds the best way to handle them that minimizes the overall loss.

There are also other statistical models that are developed to handle such inherent/natural skewness, like the Zero-inflated Poisson or Zero-Inflated Negative Binomial regression models that are specifically developed to handle count data with many zeros (both assume that there is another process that inflates the zeros to be too high).

How to deal with highly skewed (on counts) dependent variables?

3 Answers

Add your own answers!

Ask a Question