How to estimate variance of classifier on test set?

Question

I have a binary classification task for which I want to compare two different classification methods as well as hyper-parameters for each.
I have used k-fold cross-validation (k = 5) to obtain k estimates of my performance metric (average miss-rate over a given range of false positive rate) to give an approximate mean and variance.

This revealed method A to be superior to B on average, although the best performance of method B roughly matched the worst performance of method A.
For example, method A might have achieved a miss rate of 0.68 (lower is better) with a stddev of 0.02, whereas method B achieved 0.72, also with a stddev of 0.02.

For my task, it is standard procedure to report a single performance number on the test set.
When I choose the hyper-parameters according to cross-validation, the performance that method A and method B each achieve on the test set are virtually identical.
However, I fear this is simply due to the variance of the two methods, and perhaps method A would be better on average if I could sample more training and testing sets.
The numbers are quite far from those that I saw in cross-validation, suggesting a mismatch between the training and testing distributions.

QUESTION: Is there a principled way to estimate the variance of the classifiers using the testing distribution?

I thought of applying the k classifiers from cross-validation to the whole test set (or training new classifiers after re-sampling k training sets by bootstrapping) and looking at the variance of the results.
However, I'm concerned that this will be specific to the testing set that I have, not estimate a property of the testing distribution.
Perhaps I should divide the testing set into k random partitions?
Although these would each be relatively small, would that be inefficient?

cbeleites unhappy with SX · Answer

Is there a principled way to estimate the variance of the classifiers using the testing distribution?

Yes, and contrary to your intuition it is actually easy to do this by cross valiation. The idea is that iterated/repeated cross validation  (or out-of-bag if you prefer to resample with replacement) allow you to compare the performance for slightly different "surrogate" models for the same test case, thus separating variance due to model instability (
training) from variance due to the finite number of test cases (testing).

see e.g. Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations, Anal Bioanal Chem, 390, 1261-1271 (2008). DOI: 10.1007/s00216-007-1818-6

As @RyanBressler points out, there's the Bengio paper about cross validation fundamentally underestimating variance of the models. This underestimation occurs with respect to the assumption that resampling is a good approximation to a new independent sample (which it obviously isn't). This is important if you want to compare the general performance of some type of classifier for some type of data, but not in applied scenarios where we talk about the performance of a classifier trained from the given data. Note also that the separation of this "applied" test variance into instability and testing variance uses a very different view on the resampling: here the surrogate models are treated as approximations or slightly perturbed versions of a model trained on the whole given training data - which should be a much better approximation.

the performance that method A and method B each achieve on the test set are virtually identical. However, I fear this is simply due to the variance of the two methods, and perhaps method A would be better on average if I could sample more training and testing sets.

This is quite possible. I'd suggest that you check which of the 2 sources of variance (instability, i.e. training and testing uncertainty) is the larger source of variance and focus on reducing this.

I think Sample size calculation for ROC/AUC analysis discusses the effects of finite test sample size on your AUC estimate.

However, for performance comparison of two classifiers on the same data I'd suggest to use a paired test like McNemar's: for finding out whether (or which) classifier is better, you can concentrate on correct classifications by one classifier that are wrongly predicted by the other. These numbers are fractions of test cases for which the binomial distribution lets you calculate variance.

Ryan Bressler · Answer

You want some sort of bootstrap or method for generating independent measurements of performance.
You can't look at the k cross validation folds or divide the test set into k partitions as the observations won't be independent. This can and will introduce significant bias in the estimate of variance. See for example Yoshua Bengio's "No Unbiased Estimator of the Variance of K-Fold Cross-Validation"
It isn't even really valid to look at best and worst case performance on the CV folds since they aren't really independent draws...some folds will just have much worse or much better performance.
You could do an out-of-bag estimates of performance where you essentially repeatedly bootstrapping training data sets and get performance on the rest of the data. See this write up by Breiman and the referenced earlier work by Tibshirani on estimating performance variance this way
If that is computationally prohibitive because you have a ton of data i'd wonder about bootstrapping or otherwise resampling just the holdout set but I can't think of or find a reference for that off hand.

How to estimate variance of classifier on test set?

2 Answers

Add your own answers!

Ask a Question