TransWikia.com

How to handle sparsely coded features in a dataframe

Data Science Asked on August 28, 2021

I have a dataset that contains information regarding diabetes patients, like so:

    id   diabetes   diet   insulin   lifestyle
    0    No         NaN    NaN       NaN
    1    Yes        Yes    Yes       NaN
    2    No         NaN    NaN       NaN
    3    Yes        NaN    NaN       NaN
    4    Yes        Yes    NaN       Yes
    5    Yes        Yes    Yes       Yes

Features diet, insulin and lifestyle have a high percentage of missing data (around 95% each). So initially, I excluded these features from my dataset. However, after taking a closer look at the data, I found the values for diet, insulin and lifestyle to be associated with the value for diabetes feature. This makes sense as diabetes patients would be recommended treatment relating to diet, insulin intake and lifestyle changes.

  • So in cases where diabetes=’No’, values for features diet, insulin and lifestyle are missing.

  • And in cases where diabetes=’Yes’, I have found that in most cases, at least one feature from diet, insulin and lifestyle to have a value of ‘Yes’, and the remaining values are missing.

After some reading, I believe features diet, insulin and lifestyle are Missing at Random (MAR), and clearly not missing completely at Random (MCAR) as is explained here.

Anyway, so my question is, should the nature of the missing data here change my decision to remove these features from the dataset due to their high percentage of missing values. Or, should I impute the data for these features, by filling in missing values with "No", like so:

    imputer = SimpleImputer(strategy='constant', fill_value='No')
    x[:, 2:5] = imputer.fit_transform(x[:, 2:5]) 

One Answer

The first question you have to answer is whether these are actually missing or simply sparsely coded!

E.g. if the variables are "supposed" to only show doctors recommended diet, insulin or lifestyle change then we could naturally conclude that any NaN is actually a "No" in which case you do not even have to impute the data but rather straight-up replace with "yes = 1 and NaN = 0.

How do you differentiate this case (sparsely coded) from the other case: actual missing values? Besides applying your domain knowledge you should also test whether the following rules apply:

  • If diabetes == No then all treatment variables are missing
  • If diabtes == Yes then at least one treatment variable is "Yes"
  • Treatment variables are always only "Yes" or NaN

If this is the case your dataset is likely sparsely coded, otherwise you have actual missing values.

Take note however that should the values be actual missing values (e.g. because you identify some cases where non-diabetics have a treatment or some cases are labeled "No" in the treatment variables) you can assume they are MAR but not MNAR. In this case I would recommend to remove these variables because imputation as "No" or o does not make sense if you do not have any information what the NaN actually means.

Correct answer by Fnguyen on August 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP