# Cross validation and parameter tuning

Cross Validated Asked by Sana Sudheer on November 20, 2020

Can anyone tell me what exactly a cross-validation analysis gives as result?
Is it just the average accuracy or does it give any model with parameters tuned?

Because, I heard somewhere that cross-validation is used for parameter tuning.

1. Hyperparameters optimization or parameters tuning is used to find the best hyperparameters sklearn hyperparameters optimization that are parameters that are not directly learnt within estimators. They are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc.

2. Model selection or model comparison model selection is to search the best model with high generalization ability (low generalization error) for your datasets.

3. For large datasets, it usually divide datasets into two parts: training data and test data(it is similar to Kaggle competition). So, the training data is fitting by the model to learning the pattern, and test data is used to evaluate the model generalization ability. However, hyperparameters are very important to the model learning ability (like XGBoost). It needs to search the optimal hyperparameters combination. It is time to use K-fold cross validation to find the optimal hyperparameters(GridsearchCV, RandomSearchCV), beacause the one fold in cross validation can be used as validation data to evaluate the model in corresponding hyperparameters combination. Therefore, the training data is used to tune hyperparameters and fit the modle, the test data is used to calculate the generalization ability in the optimal hyperparameters combination and compare different models.

4. for small datasets (the large datasets or small datasets refers to the size of samples from your study field), the training data and test data are not advisable. To calculate or compare the generalization ability of each model. It is recommanded to use k-fold cross validation which can compare the generalization ability in the whole datasets. However, the question is how to choose the optimal hyperparameters. The nested cross validation is to use. It is understood that the K-1 fold data is as training data and left fold is as test data. to find the optimal hyperparameters. the k-1 fold data (training data like 3.) is used to hyperparameters optimazition and left fold to compare the models. It is called nested cross validation.

5. I think the result of K-fold cross validation can be stored with the predicted value and corresponding observed value in the .csv and then handle them for your task.

Answered by Ali Ma on November 20, 2020

There are few ways you can overfit your models to the training data, some are obvious, some less so. First, and the most important one is overfitting of the training parameters (weights) to the data (curve fitting parameters in logistic regression, network weights in neural network etc.). Then you would model the noise in the data - if you overfit you don't only capture the underlying generating function, but also randomness due to sample size and the fact that sample is not a perfect representation of the population. This overfitting can be to a certain extent mitigated by penalizing certain attributes (in general complexity) of the model. This can be done by stopping the training once the performance on the train sample is no longer significantly improving, by removing some neurons from a neural network (called dropout), by adding a term that explicitly penalizes complexity of the model (https://ieeexplore.ieee.org/document/614177/) etc.). However these regularization strategies are themselves parametrized (when do you stop?, how many neurons to remove? etc.). In addition most machine learning models have a number of hyper-parameters that need to be set before the training begins. And these hyper-parameters are tuned in the parameter tuning phase.

That brings us to second, and more subtle type of overfitting: hyper-parameter overfitting. Cross-validation can be used to find "best" hyper-parameters, by repeatedly training your model from scratch on k-1 folds of the sample and testing on the last fold.

So how is it done exactly? Depending on the search strategy (given by tenshi), you set hyper-parameters of the model and train your model k times, every time using different test fold. You "remember" the average performance of the model over all test folds and repeat the whole procedure for another set of hyper-parameters. Then you choose set of hyper-parameters that corresponds to the best performance during cross-validation. As you can see, the computation cost of this procedure heavily depends on the number of hyper-parameter sets that needs to be considered. That's why some strategies for choosing this set have been developed (here I'm going to generalize what tenshi said):

1. Grid search: for each hyper-parameter you enumerate a finite number of possible values. Then the procedure is exhaustively done for all combinations of of enumerated hyper-parameters. Obviously, if you have continuous hyper-parameters, you cannot try them all.
2. Randomized grid search: similar to normal grid search, but this time you do not try out all combinations exhaustively, but instead sample a fixed number of times for all possible values. Note that here it is possible to not just enumerate possible values for a hyper-parameter, but you can also provide a distribution to sample from.
3. BayesianSearch - the combination of hyper-parameter values is chosen to maximize expected improvement of the score. For more: http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf. And a library that deals only with that: https://github.com/hyperopt/hyperopt . As it's not as easy to combine with sklearn as what tenshi recommended, use it only if you're not working with sklearn.
4. Other ways for guided search in hyper-parameter space. From my experience they are rarely used, so I won't cover them here.

However this is not the end of the story, as the hyper-parameters can (and will) also overfit the data. For most cases you can just live with it, but if you want to maximize the generalization power of your model, you might want to try and regularize the hyper-parameters as well. First, you can assess the performance on out-of-sample data a bit better by using nested grid search (details: http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html, discussion:Nested cross validation for model selection), or just use a validation set that is not used for hyper-parameter tuning. As for regularization in the hyper-parameter space, it's a more or less an open question. Some ideas include choosing not the best set of hyper-parameter values, but something closer to the middle; the reasoning goes as follows: best hyper-parameter values most likely overfit the data just because the perform better than the other of the train data, bad parameters are just bad, but the ones in the middle can possibly achieve better generalization than the best ones. Andrew Ng wrote a paper about it. Another option is limiting your search space (you're regularizing by introducing strong bias here - values outside of search space will never be selected obviously).

Side remark: using accuracy as a performance metric is in most cases a very bad idea, look into f1 and f_beta scores - these metrics will in most cases better reflect what you're actually trying to optimize in binary classification problems.

To summarize: cross-validation by itself is used to asses performance of the model on out-of-sample data, but can also be used to tune hyper-parameters in conjunction with one of the search strategies in hyper-parameters space. Finding good hyper-parameters allows to avoid or at least reduce overfitting, but keep in mind that hyper-parameters can also overfit the data.

Answered by Wojtek on November 20, 2020

k-fold cross-validation is used to split the data into k partitions, the estimator is then trained on k-1 partitions and then tested on the kth partition. Like this, choosing which partition should be the kth partition, there are k possibilities. Therefore you get k results of all k possibilities of your estimator.

these are computationally expensive methods, but if you are going to try different estimators you can try these three for doing the hyperparameter tuning along with CV:

i. GridSearchCV - an exhaustive list of all possible P and C for the hyperparameters for all the estimators. In the end gives the best hyperparameters using the mean of that particular estimator CV's mean.

ii. RandomizedSearchCV - Does not do all the P and C of hyperparameters, but on a randomized approach, gives the closest possible accurate estimator saving more on computation.

iii. BayesSearchCV - Not part of scikit-learn but does Bayesian optimization for doing a randomized search and fit results.

tl:dr: CV is just used to avoid high bias and high variance for you r estimator because of the data you are passing. Hope it was helpful.

Answered by tenshi on November 20, 2020

However, if you use cross validation for parameter tuning, the out-samples in fact become part of your model. So you need another independent sample to correctly measure the final model's performance.

Employed for measuring model performance, cross validation can measure more than just the average accuracy:
A second thing you can measure with cross validation is the model stability with respect to changing training data: cross validation builds lots of "surrogate" models that are trained with slightly differing training sets. If the models are stable, all these surrogate models are equivalent, if training is unstable, the surrogate models vary a lot. You can quantify this "varies a lot" e.g. as variance of predictions of different surrogate models for the same sample (in iterated/repeated cross validation) or e.g. as variance of the surrogate models' parameters.

Answered by cbeleites unhappy with SX on November 20, 2020

Cross-validation gives a measure of out-of-sample accuracy by averaging over several random partitions of the data into training and test samples. It is often used for parameter tuning by doing cross-validation for several (or many) possible values of a parameter and choosing the parameter value that gives the lowest cross-validation average error.

So the process itself doesn't give you a model or parameter estimates, but you can use it to help choose between alternatives.

Answered by Jonathan Christensen on November 20, 2020

## Related Questions

### Difference of two non-central chi squared random variables

1  Asked on February 14, 2021 by pol

### Why do the mean and proportion measurements take the spotlight in estimation?

1  Asked on February 13, 2021

### Feature engineering: Measuring proportional changes across two item given two conditions

0  Asked on February 13, 2021 by comte

### Validation of my implementation for trace norm and L1 regularization

0  Asked on February 13, 2021 by rando

### Multivariate Normal Quadratic MGF: Eigendecomposition to Matrix form

0  Asked on February 12, 2021 by itsallpurple

### Sample Clustering by Accounting for Gene Fold Change and p-value

0  Asked on February 12, 2021 by user294496

### How do we construct features to use as input to machine learning algorithms for the purpose of movie recommendations(using collaborative filtering)?

1  Asked on February 11, 2021 by user3676846

### subspace clustering with density threshold

0  Asked on February 11, 2021 by pianobegginer

### In real-life applications, which continuous distributions have NON-CONVERGENT expectations that require Lebesgue integration?

0  Asked on February 11, 2021 by iterator516

### How many combinations and how to find a (standard?) distribution that matches?

2  Asked on February 10, 2021 by d-b

### Does PyMC3’s Timeseries API allow for time varying parameters of any model that would require fixed values under a frequentist approach?

0  Asked on February 10, 2021 by lisa-ann

### How do I modify data to test retest measure as more normal?

1  Asked on February 10, 2021 by user136083

### Anomaly detection in Text Classification

2  Asked on February 9, 2021 by naveen-y

### Which output activation is recommended when predicting a variable with a lower but not an upper bound?

2  Asked on February 9, 2021 by user26067

### Extracting a parameter from a probability problem

1  Asked on February 8, 2021 by mishe-mitasek

### Missing target variable values for a labeled model with just a time feature (timeseries)?

0  Asked on February 8, 2021 by amit-s

### What happens to the Gaussian as $sigma to infty$ at $x to infty$?

1  Asked on February 6, 2021 by rkabra

### Deriving comparable probabilites from continuous and discrete data

0  Asked on February 6, 2021 by pst0102

Get help from others!