Is it Valid to Grid Search Cross Validation for Model Hyperparameter Selection then a separate Cross Validation for Generalisation Error?

Question

The question has to do with Model Selection and Evaluation
I'm trying to wrap my head around the scale of how different nested cross validation would be from the following:
Let's say I am attempting to evaluate how suitable a model class is for a particular problem domain.

Let's assume for hypothetical purposes nested cross validation is not possible.

I have a small random dataset for a particular domain that warrants using a grid-search cross validation to do hyperparameter selection rather than some other hyperparameter selection approach (AIC etc.). So I run a Grid Search Cross Validation as a way to find optimal hyperparameters (i.e. the optimal complexity/ flexibility) for this model class on this domain. I let the program run.

A few minutes later I get a fresh new similarly sized random sample from the same domain, a potential test set for the model. But while similarly sized, it is still small, likely risking a high variance for the generalisation error it puts out if it is run as a single test set.

Thus, I was wondering, would it be valid that I take the selected hyperparameters from step 2  (a procedure which is meant to find the likely hyperparameters for the best complexity/ flexibility to minimise error for that model class on that particular domain) and run a new cross-validation on the fresh sample from step 3 as an estimate of generalisation error given the small test set?

My thinking is that if the cross validation selection step is meant to find the optimal complexity for that model class[1] [2], can't I just use those hyperparameters on a fresh cross validation to find generalisation error?
At the moment I feel the flaws in my thinking are:
A. That not using the new test set in the second step biases the results to over-estimate generalisation error compared to nested cross-validation.
B. And also because the datasets being used are small, further effort in bootstrapping and using repeated cross validation could improve the standard errors of the generalisation error estimate.
Thank you for your time.
[1] James et al 2013 An Introduction to Statistical Learning P.183
"We find in Figure 5.6 that despite the fact that they sometimes underestimate the true test MSE, all of the CV curves come close to identifying the correct
level of flexibility—"
[2] James et al 2013 An Introduction to Statistical Learning P.186
"Though the cross-validation error curve slightly underestimates the test
error rate, it takes on a minimum very close to the best value for K."

cbeleites unhappy with SX · Accepted Answer

First of all, yes you can get a valid cross validated performance estimate from a 2nd data set where you train with fixed hyperparameters.

However, consider the statistical properties:

the variance of the performance estimate due to finite sample sample size is the same whether you test a model m1 trained on data set 1 on n2 cases of data set 2 vs. testing surrogate models m2i trained on k-1 folds of data set 2: the total number of tested cases is the same.

The cross validation estimate will be slightly pessimistically biased compared to the final model m2 which is trained on data set 2 (with the hyperparameters taken from training on data set 1).

The cross validation estimate also contains some variance due to model instabiliy, i.e. differences in true performance across the k2 surrogate models.
This is relevant for the uncertainty of the cross validation estimate since that is taken as approximation for another model m2.  But if model m1 is tested on data set 2, instability is not relevant since you actually test the final model without the additional approximation step.

(This is what also @astel's answer says, just in different words in case that helps)
All in all, your 2nd cross validation is not invalid in the sense that you don't do anything that violates independence. But outside the corner case that model m2 is truly better than model m1, your procedure is no improvement over using data set 2 as independent test set.
That is, the following corner case: data sets 1 and 2 are sufficiently similar that you can reasonably assume the optimal hyperparameters to be the same across the data sets, but you expect the data sets to be sufficiently different that you'd want to retrain the model parameters on data set 2.

Since you can obtain a second set, you may want to make full use of the advantages a truly separate test set can have over train/test splits (single split or cross validation or any other resampling validation). See e.g. Hold-out validation vs. cross-validation and maybe also  Is hold-out validation a better approximation of "getting new data" than k-fold CV?.

In case you need an estimate of model instability  in addition to the independent test set performance on data set 2, you can also get that at very low computational effort: predict data set 2 also with the k surrogate models you have from data set 1 (m1i) and look at the variance across the predictions for each case. This is actually more efficient than even repeated cross validation since you have a fully crossed design case x surrogate model.

All this assumes that data sets 1 and 2 are of similar size: if the sizes vary substantially, you couldn't assume the same hyperparameters to work well.

astel · Answer

Let's say you have the same number of records 'n1' and 'n2' in each of these two data sets. You perform k-fold cross validation on the first data set and select optimal hyper-parameters.
The traditional way (in your scenario) would be to select the second dataset as a test set and now you are testing n2 records on a model trained on n1 records.
What you are suggesting will lead you to testing n2 records on a model that is trained on n1*(k-1)/k records since you want to do cross validation on the test set. This will lead to a pessimistic bias in your estimate since you are training on less records. It will also lead to more variance in your estimate since now you are adding randomness due to splitting your data in the test set.
Okay, rereading all of your comments I think I understand what you are trying to do. The reason you want to perform cross-validation on your test set rather than simply using the test set to estimate error is because you also want to use this test set to determine model complexity (i.e. you want to do feature selection on your test set). This will be problematic as your estimate is going to be optimistically biased then since the entirety of your test set is used for feature selection and thus it is part of the training process and you have no final test set with which to evaluate your error rate.
I understand you don't want to do nested cross-validation for time reasons but you could simply combine your old and new data and do a single train/test split (i.e. a single outer loop) and do cross validation on the training set to find hyper-parameters and model complexity (feature selection) at the same time. Finding optimal features is essentially hyper-parameter selection after all.
If you want to separate hyper-parameter selection and model complexity as in your current method you are going to need a third data set with which to estimate error (or simply split your 2nd data set into train/test before performing cross validation).

Is it Valid to Grid Search Cross Validation for Model Hyperparameter Selection then a separate Cross Validation for Generalisation Error?

2 Answers

Add your own answers!

Ask a Question