Feature Selection with one-hot-encoded categorical data

Question

I have a dataset with 400+ columns. Almost 90% of these are categorical data with One-Hot-Encoding (OHE). I'm using the dataset for a classification problem.

My professors asked me to perform feature selection using sequential forward selection (mlxtend).

Is there really a point of doing this since it is also very time consuming? Is it logical to remove categorical data? If so, what would be the k_features number for sfs that I should use? Or is the method(sfs) even suited for this?

aranglol · Accepted Answer

Sequential forward selection appears to be a greedy search algorithm if I am not mistaken? It appears you initially fit all possible one variable models, and then choose the one variable model that is highest performing. You then attempt to add a second variable to the highest performing one variable model from the prior step, by fitting all possible two variable models. You choose the highest performing two variable model, and if it is superior to the best one variable model, you move to the two variable model. The process continues, trying all three variable models, then four (if the best three variable model is superior to the best two)... until you choose the best k features or until the model stops improving.

The best k seems quite arbitrary and if there is a way to let the algorithm continue until there is no improvement then that is what I would do. However, here are some initial impressions of this approach that I have:

I suppose that a problem with this is that you have a large chance of not finding the most optimal model (local solution). A candidate variable may work particularly well with other variables that have yet to be included in future steps, or you may not even get the chance to use potentially useful predictors if the algorithm terminates beforehand.

However, the largest problem with this method as I see it is just how expensive it is. You state that you have 400+ columns. Say you have 415 variables. Then, in the first pass you will have to fit 415 models, then on the second pass, 414, etc...and this is not even including possible hyperparameter tuning and cross validation. This is a huge amount of models, and to be honest this is often the problem with wrapper based methods of feature selection in particular. The majority of them end up fitting a large amount of models for potentially marginal, if any, gain in model performance. The problem is made worse when you have hyperparameters that need to be tuned, and so I find the tradeoff between performance gains and the computational time spent using these methods to not be worthwhile, unless you want the absolute best model.

In your case, Peter's advice in the comments is the route I would take purely because it is way faster and often good enough. Almost every modern ML method offers regularization in some way that will explicity feature select for you aka embedded feature selection (by not using unhelpful predictors at all) or by strongly limiting their influence via. shrinkage. Ridge/LASSO/Elastic Net mentioned by Peter is a terrific suggestion. Other methods such as those based on trees also have embedded feature selection and may work well in your case, considering the dimension of your dataset.

Remy · Answer

This important point is missing:
SFS is suitable as it has no assumption for features to be categorical or numerical. However, one-hot encoding is redundant when you are planning to use SFS. You just make the process longer by one-hot encoding since by doing so SFS needs to check more number of features than what it actually is.

Feature Selection with one-hot-encoded categorical data

2 Answers

Add your own answers!

Ask a Question