Cross Validated Asked by iPlexipen on January 3, 2022

There is a dataset with 30 variables and over 5 million observations. We plan to use a subsample of the data for analysis. Around .02 – 2.5% of EACH variable are missing. I plan imputation in Stata for this, but I’m not sure if we should do the imputation for ALL 50 variables at once, or at different stages.

We will use 11 of the variables to create a subsample. As such, we plan to use imputation prior to this stage in order for the exclusion criteria to be applied correctly. However, once this is done, 3 different regressions will be run (OLS and logistic models). All 30 of the variables will be used at some point in these.

Here is the problem: should the imputation for the other (the 19 variables NOT used for the exclusion criteria) be conducted AFTER the exclusion criteria is applied, or should the imputation be done for ALL variables at the same time (prior to application of exclusion criteria).

The command in stata, `hotdeck`

is what we were going to use.

Since you’ve decided on an imputation method relying on MCAR (missing completely at random) data, I infer that your data are indeed MCAR. In this case, you should impute the missing values **after the exclusion criteria** are applied, for two reasons:

- Speed (because there are fewer data points to process, downstream of exclusion criteria);
- Bespoke imputation for your data of interest. (Whereas, imputing all 30 variables before exclusion would tap into a larger, less specific population than the one under study.)

The caveat in the above is that it’s based on my inference that because you've chosen hotdeck you have MCAR data. If I’m mistaken, then:

- Don’t impute
*any*data using hotdeck; use something such as multiple imputation by chained equations (MICE), for which there are toolboxes. - Impute the data
*before*the exclusion criteria are applied. Basically, see the other answer here by Robert Long.

Good luck!

**References:**

- Missing Data Problems in Machine Learning by B. Marlin (2008)
- Section 9.6 of The Elements of Statistical Learning, arguing for multiple imputation when data are not MCAR

Answered by Mark Ebden on January 3, 2022

You should do all the imputations first, otherwise you may get biased results.

I don't know what `hotdeck`

in Stata does exactly, but if it is a single imputation method (ie you get one completed/imputed dataset) then I would advise against it. At the very least I would advise creating several completeted datasets, if the algorithm allows a different seed to create different imputations. I don't know what your reasons for choosing hot decking are, but I have always found multiple imputation to be superior and has desirable statistical properties, when certain assumptions hold, namely that the data missingness being MAR (missing at random) or MCAR (missing completely at random) and not MNAR (missing not at random). Roughly, this means that, for any particular variable, if the missing data can be predicted from the other variables, or if the missing values are simple a random sample, multiple imputation will produce unbiased results.

Answered by Robert Long on January 3, 2022

1 Asked on December 27, 2021

backpropagation derivative error propagation neural networks

1 Asked on December 27, 2021 by mathella

1 Asked on December 25, 2021

lme4 nlme mixed model r random effects model repeated measures

2 Asked on December 25, 2021 by manas

1 Asked on December 25, 2021 by badmax

2 Asked on December 25, 2021 by user33268

1 Asked on December 25, 2021

1 Asked on December 25, 2021 by ribelles

1 Asked on December 25, 2021

autocorrelation lme4 nlme mixed model regression repeated measures

1 Asked on December 25, 2021

hierarchical clustering mixed model multicollinearity multilevel analysis multiple regression

1 Asked on December 25, 2021

1 Asked on December 25, 2021

lme4 nlme mixed model r random effects model repeated measures

1 Asked on December 25, 2021

1 Asked on December 25, 2021 by catarina-toscano

1 Asked on December 25, 2021

0 Asked on December 25, 2021

1 Asked on December 25, 2021 by k-k-mcdonald

bayesian network conditional probability graphical model probability

1 Asked on December 25, 2021 by mayank-kumar

feature selection machine learning pca regression resampling

0 Asked on December 25, 2021 by l-d

0 Asked on December 25, 2021 by carsonwhit

Get help from others!

Recent Questions

- How Do I Get The Ifruit App Off Of Gta 5 / Grand Theft Auto 5
- Iv’e designed a space elevator using a series of lasers. do you know anybody i could submit the designs too that could manufacture the concept and put it to use
- Need help finding a book. Female OP protagonist, magic
- Why is the WWF pending games (“Your turn”) area replaced w/ a column of “Bonus & Reward”gift boxes?
- Does Google Analytics track 404 page responses as valid page views?

Recent Answers

- Peter Machado on Why fry rice before boiling?
- Joshua Engel on Why fry rice before boiling?
- Lex on Does Google Analytics track 404 page responses as valid page views?
- Jon Church on Why fry rice before boiling?
- haakon.io on Why fry rice before boiling?

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP