How big should my subsample be?

Question

Say I have a dataset of 1 million high school students.
I'm trying to determine if test scores can be used to determine a successful college performance.
While the dataset has 1 million students, it has many more rows because I have multiple tests for each student.
What I'm trying to determine is if I should exclude some students because I don't have a sufficient quantity of tests and therefore including them could throw off my results.
For example, Student A, I have 20 tests to use as datapoints; but for Student B, I only have 2 tests. Should I keep Student B when conducting my analysis or drop Student B?
In other words, it makes me think there should be a way to calculate the required sample size of a subgroup in your larger sample.

MONODA43 · Answer

The comments mentioned that you should first identify what kind of bias you're concerned about. My thought is that it is possible that correlation of test scores with college performance may vary based on the number of tests taken (there could be some confounding variable that affects both number of tests and college performance).
One thought i had on how you could incorporate all your scores is by using a bayesian model where you marginalize out the unknown test scores.

How big should my subsample be?

One Answer

Add your own answers!

Ask a Question