How big should my subsample be?

Cross Validated Asked by kaecvtionr on December 11, 2020

Say I have a dataset of 1 million high school students.

I’m trying to determine if test scores can be used to determine a successful college performance.

While the dataset has 1 million students, it has many more rows because I have multiple tests for each student.

What I’m trying to determine is if I should exclude some students because I don’t have a sufficient quantity of tests and therefore including them could throw off my results.

For example, Student A, I have 20 tests to use as datapoints; but for Student B, I only have 2 tests. Should I keep Student B when conducting my analysis or drop Student B?

In other words, it makes me think there should be a way to calculate the required sample size of a subgroup in your larger sample.

One Answer

The comments mentioned that you should first identify what kind of bias you're concerned about. My thought is that it is possible that correlation of test scores with college performance may vary based on the number of tests taken (there could be some confounding variable that affects both number of tests and college performance).

One thought i had on how you could incorporate all your scores is by using a bayesian model where you marginalize out the unknown test scores.

Answered by MONODA43 on December 11, 2020

Add your own answers!

Related Questions

linear causal model

1  Asked on November 29, 2021 by markowitz


What is the point of test set in ML?

4  Asked on November 29, 2021 by lelouche-lamperouge


Ask a Question

Get help from others!

© 2022 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP