TransWikia.com

How to remove correlated features?

Cross Validated Asked by iCHAIT on December 27, 2021

I have a small dataset (200 samples and 22 features) and I am trying to solve a binary classification problem. All my features are continuous and lie on a scale of 0-1.

I computed the correlation among my features using the pandas dataframe correlation method. Then, I found all the pairs of features that had a correlation of more than 0.95, and I was left with about 20 pairs.

Now my question is, from these pairs, how do I decide which features to drop?

There is a same question on Stackoverflow and the top voted answer as well as the approach shared by Chris Albon in his blog post (also the second most voted answer in that SO post) drops one of the highly correlated features randomly.

I don’t feel confident about randomly dropping features without taking into account the correlation of the features with other features.

Is there a more convincing/reliable way on how to decide which of the 2 features to drop?

One Answer

It depends on your goal. Prediction or inference? If you want to make good predictions, you can leave this as it is because correlation does not affect prediction. However, if you are interested in inference then it is a problem and you'll have to address it. Let's say you are building logistic regression model with highly correlated variables. Estimated coefficients will be unstable, have a big variance and thus hard to interpret correctly. You could then used penalized logistic regression with Lasso or Ridge penalty (or a mix of both - elastic net). Lasso performs feature selection while Ridge regression is more stable and can be useful in terms of interpretation.

There is also another way. Because all of your variables are continuous, you can use PCA for dimensionality reduction. The tradeoff here is that you end up with principal components which are just mathematical constructs. Generally I am not an advocate of just throwing away one of the highly correlated variables. You don't know whether only one of them can be related to the outcome. Throwing away variables is not always the good approach to solving collinearity problems.

Answered by treskov on December 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP