Overfitted model produces similar AUC on test set, so which model do I go with?

Question

I was trying to compare the effect of running GridSearchCV on a dataset which was oversampled prior and oversampled after the training folds are selected. The oversampling approach I used was random oversampling.

Understand that the first approach is wrong since observations that the model has seen bleed into the test set. Was just curious about how much of a difference this causes.

I generated a binary classification dataset with following:

# Generate binary classification dataset with 5% minority class, 
# 3 informative features, introduce noise with flip_y = 15%

X, y = make_classification(n_samples=5000, n_features=3, n_informative=3,
                            n_redundant=0, n_repeated=0, n_classes=2,
                            n_clusters_per_class=1,
                            weights=[0.95, 0.05],
                            flip_y = 0.15,
                            class_sep=0.8)

I split this into 60/40% train/test split and performed GridSearchCV with both approaches on a random forest model. Ended up with following output based on best_estimator_ from both approaches:

Best Params from Post-Oversampled Grid CV:  {'n_estimators': 1000}
Best Params from Pre-Oversampled Grid CV:  {'classifier__n_estimators': 500}
AUC of Post-Oversampled Grid CV - training set:  0.9996723239846446
AUC of Post-Oversampled Grid CV - test set:  0.6060618701968091
AUC of Pre-Oversampled Grid CV - training set:  0.6578310812852065
AUC of Pre-Oversampled Grid CV - test set:  0.6112671617024038

As expected, the Post-Oversampled Grid CV AUC is very high due to overfitting. However, evaluating both models on the test set lead to very similar results on AUC (60.6% vs 61.1%).

I had two questions. Why is this observed? I didn't assign a random_state to any of these steps and retried it many times, but still end up with the same results. In such a case, what becomes the better model to progress with since they're producing similar results on test set?

For oversampling and handling it through the pipeline, I made use of imblearn:

# imblearn functions
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline as Imb_Pipeline

Happy to share my code if needed. Thanks.

Malo · Answer

1/ Oversampling
Oversampling should be used only in the train set.
Oversampling helps during training to have more data and or to help balance the classes if needed.
So in your case it is best to do it only on the train set, not on the test set.
GridSearchCV uses crossvalidation and will split the train set in several folds.
Then the final evaluation of the model on the test set should be done on an untouched test set without oversampling.
This reflects the real class distribution you have, and so it will reflects real accuracy.
2/ Class Imbalance
Your classes are imbalanced: this is usually bad to try to find a good model.
As you see 60% AUC for a 95%/5% binary classification is no good.
Moreover random oversampling on your 95%/5% classes will probably make it even more imbalanced.
You should experiment first with balanced classes and see if the same happends.
3/ Randomness
You use randomness without seeding it as make_classification and flip_y are using randomness and I cannot see the random_stateint parameter set here.
So have you set the randomness seed in the rest of your code ?
Could you check:

train_test_split random_stateint parameter ?
as you oversample randomnly, have you set the random_stateint also ?

Ben Reiniger · Answer

The main problem with oversampling before the split is that the reported score is optimistically biased (in your experiment, quite heavily!).  The resulting model isn't necessarily bad for future purposes, just probably not nearly as good as you think.
(N.B. the scores used to pick best hyperparameters are no longer unbiased estimators of performance either.)

Now with hyperparameter tuning in the mix, which is looking at those scores, you might end up with a set of hyperparameters that helps the model overfit on the duplicated rows, so your resulting model might suffer in future performance for this reason.  However, in your experiment you only look for number of trees in a random forest, which has little effect on the final performance (just reducing the variance due to the random row/column sampling).  So similar test set performances are not unexpected here.

tenshi · Answer

If you don't provide any int for random_int for train_test_split it doesn't mean there isn't going to be anything. It defaults to None, check the documentation of train_test_split it says,

If None, the random number generator is the RandomState instance used by np.random.

Now let's look at RandomState instance from the documentation of np.random. Take a look at these points:

defaults to None. If size is None, then a single value is generated and returned.

and

If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.

what these mean is, you can achieve the results, the very same several times. But no guarantee it will work for others. There is nothing called random, everything uses pseudo-random. Somehow a "random_state" is generated.

My best guess would be to try: train:valid:test::80:10:10, or 60:20:20 after doing all this GridSearchCV compare your validation accuracy then try the best estimator on the test set. But this assumes you need to have more data.

Hope this helps. If anyone finds any mistakes, I'd be happy to get corrected.

Overfitted model produces similar AUC on test set, so which model do I go with?

3 Answers

1/ Oversampling

2/ Class Imbalance

3/ Randomness

Add your own answers!

Ask a Question