TransWikia.com

Advice on imputing temperature data with StatsModels MICE

Data Science Asked by plytheman on September 26, 2021

This may be a dumb question but I can’t figure out how to actually get the values imputed using StatsModels MICE back into my data. I have a dataframe (dfLocal) with hourly temperature records for five neighboring stations (LOC1:LOC5) over many years and I’d like to impute the missing data for any given site. Following the examples I have:

imp = mice.MICEData(dfLocal)
fml = 'LOC1 ~ LOC2 + LOC3 + LOC4 + LOC5'
mice = mice.MICE(fml, sm.OLS, imp)
results = mice.fit(10, 10)
print(results.summary())

dfLocal.dropna(axis=0, how='all', inplace=True)
imp.data = imp.data.set_index(dfLocal.index)

# In this case I only want to fill one specific set of missing data
# hence gap_start and gap_end
dfLocal.loc[gapStart:gapEnd, 'LOC1'] = imp.data[fillSite]

My understanding of MICE is broadly that missing values are imputed multiple times and then combined to find the best value from the many. The only way I’ve found to actually get any numbers out of the above code is with imp.data but I’m afraid that might just be one of the individual imputations before they’re combined? All I can seem to get from fitting the model (results), though, is the summary?

I’m far from a statistician (and not much of a programmer either) so I’ve been reading through the code for mice.MICE and other resources on general MICE applications, but I’d appreciate any guidance on this as I can’t find much about using statsmodels’ MICE online. Normally I’d post some data on Gist but the full set is a bit large. That said, I’ll upload it if ya’ll think it would help.

Thanks!

One Answer

MICE does generate several datasets, but it does not then combine these datasets. Rather, it fits your model on each of those datasets and combines those models. If you really need an imputed dataset, you could just choose one or combine them in whatever way makes sense for your problem (or you might be better off with another method):

Now, for the statsmodels implementation, imp.data only keeps track of the latest imputed set [1]; you can loop through updates rather than using fit to get all of the datasets as in an example in [2].

Answered by Ben Reiniger on September 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP