TransWikia.com

Handling features with multiple values per instance in Python for Machine Learning model

Data Science Asked by sums22 on August 25, 2021

I have a dataset which contains medical data about children and I am developing a predictive machine learning model to predict adverse pregnancy outcomes. The dataset contains mostly features with a single value per child, e.g. gender = ["Male", "Female].
However, I have some features that have multiple values per child, such as the abdominal circumference which has been recorded multiple times per child, as such:

    ChildID     abdomcirc
0   1           273
1   1           267
2   1           294
3   2           136
4   2           248

So in the above table child 1 has 3 values for abdomcirc and child 2 has two values for abdomcirc. Adding this feature to the remaining dataset (comprised of single observational features) will result in nearly duplicate rows, apart from the different values for abdomcirc, like so:

    ChildID     gender  diabetes  birthroute  abdomcirc
0   1           Male    No        Normal      273
1   1           Male    No        Normal      267
2   1           Male    No        Normal      294
3   2           Female  Yes       csection    136
4   2           Female  Yes       csection    248

I am unsure what the best way to deal with these features is, without merging the data and having near-duplicate rows. I have considered the following:

  • Using python list type for abdomcirc. However, I do not know if a machine learning model can handle this data type. So my data will look something like this

          ChildID     gender  diabetes  birthroute  abdomcirc
     0    1           Male    No        Normal      [273, 267, 294]
     1    2           Female  Yes       csection    [136, 248]
    
  • Transforming abdomcirc into a single observational feature by calculating the mean (although I am not sure how useful this information would be for my predictive model) like so:

          ChildID     gender  diabetes  birthroute  abdomcirc
     0    1           Male    No        Normal      278
     1    2           Female  Yes       csection    192
    

I have tried looking for resources to help me with this but have not been very successful, maybe because I am not typing the correct keywords or something. So, I would appreciate your opinions and helpful resources. Many thanks!

One Answer

A possible resource is featuretools, they do feature engineering on data that has many records. Their examples are not from medical cases but I think it should work for you too.

You can also manually build several features. For instance, given a list of abdomcirc, you can compute its:

  • mean
  • maximum
  • minimum
  • variance
  • difference from minimum to maximum
  • last value (if they are sorted by date)
  • number of unique values

These features would get most of the information of the abdomcirc list, and this should help your modelling.

I wouldn't go for the first approach of giving lists to the algorithm, although it is possible, I think it is a relatively advanced thing and I wouldn't go for it unless the simpler approaches don't work.

Correct answer by David Masip on August 25, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP