General procedures for combined feature selection, model tuning, and model selection?

Question

What is the general procedure for a combined task of model tuning (i.e., hyperparameter selection), feature selection and model selection?

I know some basic principles for each task, but when combining them all together, I am confused.

For example, let's assume we have 1000 features to be selected from (already filtered after unsupervised methods) and 1000 samples, and the output labels are True/False binary outcome.

The candidate models considered here are k-nearest neighbors (KNN) model, and Support Vector Machines (SVM) with Linear Kernel, so the hyperparameters are the number of neighbors (k) in KNN and the cost (C) in SVM.

We would like to use the genetic algorithm (GA) to guide the search, and cross-validation to tune the hyperparameters during the search process.

Is the following procedure correct?

The following procedure 1 to 6 runs for each of the candidate models (i.e., KNN and SVM), individually:

Randomly choose 100 samples to be the test set (holdout).
Create 10 folds on the remaining 900 samples, use 9 folds (i.e., 810 samples) as train data for GA search, and 1 fold (i.e., 90 samples) for fitness evaluation.
In each GA generation, for each child, the hyperparameters are tuned using cross-validation on the 810 samples (i.e., model fitting and evaluation are performed solely on the 810 samples). The fitness of the tuned model is evaluated using the 90 samples. Based on the fitness value, the next generation is produced. This process continues until stopping criteria is met. Record the number of generations that takes to reach the stopping criteria. (This is what I understood from the CARET package documentation)
Repeat step 3, with using another fold (another 90 samples) for fitness evaluation, and the rest (810 samples) for GA search. Until all the samples have been used for fitness evaluation.
Determine the optimal number of generations, based on the resampling result of step 3 to 4. Using all the 900 samples, and run GA again for the optimal generation to get the final model, model fitness is evaluated on the 100 holdout sample, the hyperparameters are tuned using cross-validation on the 900 sample. The predictor subset is associated with the optimal number of generations and the search behavior of GA, and thus can be different in different GA runs. 
The final model performance is evaluated on the 100 holdout sample. And step 1-5 can be resampling many times to get the range of model performance.

Build a KNN model and a SVM model using steps 1 to 6, and the best model is selected using the statistics obtained in step 6.

The above procedure looks very complex, and the result is uncertain because the final feature selected by GA can be different in different runs. Is there another way to do the above tasks (combined feature selection, model tuning and model comparison)?

I did not use recursive feature elimination methods (e.g., backwards selection), because some of the predictors are correlated, and removing correlated predictors can cause loss of information in this particular problem. So I think GA search is more flexible, and thus is more suitable for the correlated predictors. Is it right?

Daniel Soutar · Answer

This is a very good question and it is shameful that this has not been answered on this site. More content like this should be promoted in the community to minimise the number of junk articles that encourage poor practices.

Regarding your question (if you're still there!), the fact that some features are correlated is precisely why removing them is worthwhile. If two things are highly correlated that one can use the first to infer the second, then the second is unnecessary! You can drop that feature - and if you knew those features were correlated prior to seeing the data then you can take them out yourself. Otherwise, use RFE.

Otherwise I am not so sure ... As I understand it one picks a particular model with some parameter settings and then uses cross-validation on that specific model. I think what you are doing is having an outer-loop of cross-validation and an inner-loop of choosing the best parameters for the model. So you should switch the two. This is because you are trying to evaluate a given model, but if you try finding optimal parameters in each fold then you are not fixing the model and evaluating it, but rather finding the best model for a given fold of your dataset. You do not need a genetic algorithm to pick model parameters at the beginning. If you're using Python/Scikit Learn then GridSearchCV will be a good solution.

Hope this makes sense - please note I'm no genius or expert so do take my answer with a grain of salt. Hope this helps all the same!

General procedures for combined feature selection, model tuning, and model selection?

One Answer

Add your own answers!

Ask a Question