Cross Validated Asked on November 8, 2020
I validate usage of a clinical cardiovascular score to predict the risk of dementia using data from a longitudinal study. Therefore, my outcome is binary (dementia yes or not) and the independent variable (the score) is continuous, of course I have a whole set of covariates.
I did Cox analysis to assess an association between baseline values and the outcome over time but now I would like to validate the use of the score.
I thought about taking a random sub-sample of my cohort to split in training and test and run some sort of validation statistics (i.e. ROC curves) but I have some concerns about this for a number of reasons:
How shall I evaluate the use of this score?
Can you suggest tests that suit better for the problem?
For data analysis I use Stata.
So you basically want to do a Cross Validation of your dataset.
One type of CV is the Holdout method which divides the data into two parts: Dtest and Dtrain. The model will predict values from Dtest (x → y) and since we know the actual values, that is what x-value correspond to what y-value we can compare the predicted and the actual values and estimate the performance. The subsample is quite arbitrarily Dtrain/Dtest: 70/30.
But your concern about splitting up your dataset into subset is valid. Why? Because when the data is split up into subparts with a dataset that is quite small, in your case N=2500, there's a much higher chance that Dtest and Dtrain is different from one another.
We can solve this by using a k-folded CV. K-folded cross validation divides the dataset into several parts. One of the subparts will be used as test data and the rest (k-1 parts) will be used as training data. The model is iterated (sort of) through each subpart and an error rate will be obtained for each iteration. The mean error rate will be used as a performance value. This solves the problem of having subparts of the data that is not representative of the whole dataset.
The problem with using k-folded cross validation is if the subparts are divided so as to give a model that overfits the data. That is, it predicts the value of the datapoints that are given quite well but when the model is given new data the error rate will be high. This usually happens when there's little data to begin with. To prevent this we can use Repeated cross validation.
Repeated cross validation:
Figure 1. General idea behind CV.
Answered by Lennart on November 8, 2020
1 Asked on September 8, 2020 by jordan
1 Asked on September 7, 2020 by student_r123
1 Asked on September 5, 2020 by siegfried
1 Asked on September 4, 2020 by meg
0 Asked on September 4, 2020 by jonathan
1 Asked on September 3, 2020 by mamafoku
1 Asked on August 30, 2020 by engrstudent
2 Asked on August 28, 2020 by kapetantuka
0 Asked on August 27, 2020 by user8714896
2 Asked on August 25, 2020 by andrea-moro
0 Asked on August 23, 2020 by marg
1 Asked on August 20, 2020 by thomas
0 Asked on August 17, 2020 by probdiscr
1 Asked on August 13, 2020 by rnso
1 Asked on August 13, 2020 by laurie
0 Asked on August 12, 2020 by clment-f
1 Asked on August 10, 2020 by anna
1 Asked on August 8, 2020 by john-baker
2 Asked on August 5, 2020 by 3michelin
Get help from others!