Can I apply feature selection before splitting by requiring selection occurs 90% of time

Question

I want to move the feature selection step to before splitting to save time and allow bigger input dataset. If, in repeated subsamples, a feature is selected in over X percentage of cases I will keep it. Alternatively use very low X to remove features that will clearly never be selected. I have read warnings against doing this including on this forum because of information leakage. Feature selection: Information leaking if done before CV-split?  But if the feature would have been selected in almost all post split cases then where is the problem?
Edit: it does involve the target features.

Erwan · Answer

As explained in the post you link, it depends how you select the features: if it doesn't involve the target variable, then it's probably fine. I'll assume the most common case, that is that the selection relies on the target variable.
There are two parts in this problem:

About the ensemble method for feature selection: let's say a particular feature is selected 90% of the time. This means it's not selected 10% of the time, so if you were using a single training set it would have a 10% chance not to have this feature selected. So in theory your model is likely to be better than a "regular model", because it includes a few features which wouldn't have been selected in the "regular model". Note however that this method could also have negative side effects, because the features are not selected as a whole but individually (i.e. it might not be the optimal subset of features).
About the risk of data leakage: by definition, applying this method on the whole data means using information from the test set, so a potential bias in the evaluation. It's true that the ensemble method decreases the risk of selecting a feature by chance, but for every feature there's a 10% chance that it wouldn't have been selected in the "regular model". Since this selection is based partly on the test set, you can't be certain that the evaluation is reliable.

Assuming your goal is to use cross-validation and that the feature selection process is computationally expensive, I can think of two ways to do this properly:

Select a random N% of the data (where N is the size of the training set, e.g. 90% if using 10 fold CV), do the feature selection on this data, then use this predefined set of features every time, independently from the training set. The idea here is that the selected set of features cannot take advantage of a selection on the whole data, so it's not particularly  optimized for the test set or any CV split. That should be enough to obtain a fair evaluation, although technically there is still data leakage in this way.
Do an additional split in your data: with the training set you apply exactly the process that you described for feature selection, then run CV on it. After this, apply the final model to the test set. For example take 80% of the data for the training set, do feature selection and CV based only on this training set, and after that use the 20% left of unseen instances as test set. There's no data leakage in this way, but the final evaluation comes from a single test set, not from CV (here the CV stage can be used to study performance variation or tune hyper-parameters).

Can I apply feature selection before splitting by requiring selection occurs > 90% of time

One Answer

Add your own answers!

Ask a Question