# What test shall I use to validate the use of a certain score to predict my outcome in a survival analysis?

Cross Validated Asked on November 8, 2020

I validate usage of a clinical cardiovascular score to predict the risk of dementia using data from a longitudinal study. Therefore, my outcome is binary (dementia yes or not) and the independent variable (the score) is continuous, of course I have a whole set of covariates.

I did Cox analysis to assess an association between baseline values and the outcome over time but now I would like to validate the use of the score.
I thought about taking a random sub-sample of my cohort to split in training and test and run some sort of validation statistics (i.e. ROC curves) but I have some concerns about this for a number of reasons:

• My sample is relatively small ($n=2500$), and I am afraid that taking a sub-sample would reduce the power too much.
• Not sure whether the ROC (or alternatively the somerset) are the best tests in this case, as other tests (like those used in screenings evaluation) may suit better.

How shall I evaluate the use of this score?
Can you suggest tests that suit better for the problem?

For data analysis I use Stata.

So you basically want to do a Cross Validation of your dataset.

One type of CV is the Holdout method which divides the data into two parts: Dtest and Dtrain. The model will predict values from Dtest (x → y) and since we know the actual values, that is what x-value correspond to what y-value we can compare the predicted and the actual values and estimate the performance. The subsample is quite arbitrarily Dtrain/Dtest: 70/30.

But your concern about splitting up your dataset into subset is valid. Why? Because when the data is split up into subparts with a dataset that is quite small, in your case N=2500, there's a much higher chance that Dtest and Dtrain is different from one another.

We can solve this by using a k-folded CV. K-folded cross validation divides the dataset into several parts. One of the subparts will be used as test data and the rest (k-1 parts) will be used as training data. The model is iterated (sort of) through each subpart and an error rate will be obtained for each iteration. The mean error rate will be used as a performance value. This solves the problem of having subparts of the data that is not representative of the whole dataset.

The problem with using k-folded cross validation is if the subparts are divided so as to give a model that overfits the data. That is, it predicts the value of the datapoints that are given quite well but when the model is given new data the error rate will be high. This usually happens when there's little data to begin with. To prevent this we can use Repeated cross validation.

Repeated cross validation:

1. Do CV (which will give mean error rate (Ê)
2. Reorder the data so as to give different subparts
3. Repeat 1-2

Figure 1. General idea behind CV.

Answered by Lennart on November 8, 2020

## Related Questions

### Computing GLM Relativities from Spline Regression

1  Asked on September 8, 2020 by jordan

### Comparing a Bayesian model with a Classical model for linear regression

1  Asked on September 7, 2020 by student_r123

### Tuning parameters of SVM in tune function

1  Asked on September 5, 2020 by siegfried

### Interpreting growth curve analysis (GCA) main effect in light of interaction (eye tracking data)

1  Asked on September 4, 2020 by meg

### ElasticNet coefficients are different for each cv.glmnet run

0  Asked on September 4, 2020 by jonathan

### Tensor Classification Models

1  Asked on September 3, 2020 by mamafoku

### Simulation of Secretary problem: optimal pool size given k=2?

1  Asked on August 30, 2020 by engrstudent

### Comparing more than two means of continuous variables

2  Asked on August 28, 2020 by kapetantuka

### For B-Spline why $n+1 > k ge 2$ and why is $t_{k-1} le t le t_{n+1}$

0  Asked on August 27, 2020 by user8714896

### Standard deviation and confidence level: how to interpret and evaluate the results

2  Asked on August 25, 2020 by andrea-moro

### Specifying several independent priors in stan_glm() in R

0  Asked on August 23, 2020 by marg

### How do I compare cv.glmnet models with AIC?

1  Asked on August 20, 2020 by thomas

### Maximum likelihood estimator for a discontinuous PDF

0  Asked on August 17, 2020 by probdiscr

### Difference between Linear Mixed Regression and Generalized Estimating Equation Results

1  Asked on August 13, 2020 by rnso

### ‘Translate’ ANOVA comparison on regression parameters into linear mixed model

1  Asked on August 13, 2020 by laurie

### Uncertainty propagation for the solution of an integral equation

0  Asked on August 12, 2020 by clment-f

### Which test should I use to compare 2 unrelated dichotomous variables?

1  Asked on August 10, 2020 by anna

### Difference in Differences with Multiple Time Periods and Multiple Treatment Periods

1  Asked on August 8, 2020 by john-baker

### ARDL and ECM lags

0  Asked on August 8, 2020 by php-useless

### Combining categorical and continuous features for neural networks

2  Asked on August 5, 2020 by 3michelin

### Ask a Question

Get help from others!