Cross Validated Asked by Tomek Tarczynski on January 1, 2022
Lets assume that we have dataset that contains continuous variable $Y$ which we want to predict and 10 predictors $X_{1}, …, X_{10}$. The number of observations is $n=1000$. I have questions about proper cross validation in two following situations:
I want to add variable $X_{11}$ which is equal to average of $Y$ from 10 nearest observations (the metric is not important). On this extended dataset I would like to make linear regression. What is the proper way of CV (kfold for $k=5$)?
Two models were built (for example: random forest and gradient boosting machine) and now want to make linear blending of those two models. What predictions from the model should be put as predictiors? One solution is:
I strongly believe that both cases are well known, but I would like to know what is the state of the art in that matter.
Question 1:
This is really similar to what is called KFold Target Encoding, and a correct way to do it is explained here:
https://medium.com/@pouryaayria/k-fold-target-encoding-dfe9a594874b
Your encoding is slightly different than what is describe in the article above, but you can apply the same design.
Answered by steco on January 1, 2022
Looking for closeby cases and upweighting them for prediction is referred to as local models or local prediction.
For the proper way to do cross validation, remember that for each fold, you only use training cases, and then do with the test cases exactly what you do for prediciton of a new unkown case.
I'd recommend to see the calculation of $X_1$ as part of the prediction. E.g. in a two level model consisting of a $n$ nearest neighbours + a second level model:
So for prediction of a case $X_{new}$, you
You use exactly this prediction procedure to predict the test cases in the cross validation.
random forest tends to overfit on training data set
Usually random forest will overfit only in situations where you have a hierarchical/clustered data structer that creates a dependence between (some) rows of your data.
Boosting is more prone to overfitting because of the iteratively weighted average (as opposed to the simple average of the random forest).
I did not yet completely understand your question (see comment). But here's my guess:
I assume you want to find out the optimal weight you should use for random forest and boosted prediction, which is a linear model of those two models. (I don't see how you could use the individual trees within those ensemble models because the trees will totally change between the splits). This again amounts to a 2 level model (or 3 levels if combined with the approach of question 1).
The general answer here is that whenever you do a data-driven model or hyperparameter optimization (e.g. optimize the weights for random forest prediction and gradient boosted prediction by test/cross validation results), you need to do an independent validation to assess the real performance of the resulting model. Thus you need either yet another independent test set, or a so-called nested or double cross validation.
I'd recommend a different approach here: try to cut down as far as possible the number of splits you need by doing as few data-driven hyperparameter calculations or optimizations as possible. There cannot be any discussion about the need of a validation of the final model. But you may be able to show that no inner splitting is needed if you can show that the models you try to stack are not overfit. In addition this would remove the need to stack at all:
Ensemble models only help if the underlying individual models suffer from variance, i.e. are unstable. (Or if they are biased in opposing directions, so the ensembe would roughly cancel the individual biases. I suspect that this is not the case here, assuming that your GBM uses trees like the RF.)
As for the instability, you can measure this easily by repeated aka iterated cross validation (see e.g. this answer). If this does not point to substantial variance in the prediction of the same case by models built on slightly varying training data (i.e. if your RF and GBM are stable), producing an ensemble of the ensemble models is not going to help.
Answered by cbeleites unhappy with SX on January 1, 2022
1 Asked on January 5, 2022 by aarsmith
logistic mixed model prediction regression regression coefficients
7 Asked on January 3, 2022 by user2806363
2 Asked on January 3, 2022
autoencoders gan graphical model machine learning neural networks
1 Asked on January 3, 2022
1 Asked on January 3, 2022
2 Asked on January 3, 2022 by iplexipen
0 Asked on January 3, 2022 by khemedi
artificial intelligence machine learning neural networks precision recall
0 Asked on January 3, 2022
data visualization machine learning matplotlib python variance
0 Asked on January 3, 2022 by indula
0 Asked on January 3, 2022 by e-wade
lme4 nlme mixed model multilevel analysis r random effects model
2 Asked on January 3, 2022 by fishchick
0 Asked on January 3, 2022
hypothesis testing neyman pearson lemma statistical significance
0 Asked on January 3, 2022 by gannawag
1 Asked on January 3, 2022
0 Asked on January 3, 2022 by ofow
approximation machine learning neural networks optimization polynomial
1 Asked on January 3, 2022 by p-lrc
1 Asked on January 1, 2022
2 Asked on January 1, 2022 by tomek-tarczynski
Get help from others!
Recent Answers
Recent Questions
© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir