TransWikia.com

Using categorical feature as both a continuous feature, and also doing One hot encoding. Is this overkill?

Cross Validated Asked by stats_nerd on January 10, 2021

I am working on a Machine Learning regression problem, with a data-set where I have data from a period of several years. From the “date” feature, I extracted the week number (0-53). Next I am doing 2 things:

1) One hot encoding: Splitting this categorical “week number” feature into 53 binary features, where each feature indicates whether the data points belong to that particular week number or not.

2) I am also using the cyclic variable (week number) as a continuous variable to predict my outcome. First I am converting this feature, however, to the distance from week 1 (so week 2 and 53 don’t represent drastically different time points)

My question is, am I making this too complicated without increasing potential improvements in my model outcome? Does including the continuous variable actually provide my model with valuable information that is not already covered in the categorical feature extraction? Thank you in advance

2 Answers

I think what you are doing is unnecessary.

Encoding of categorical features heavily depends on model choice:

In the case of linear models and Nets.:

It is necessary to encode it as a continuous variable or one-hot encoding. In your example, I believe it is better to encode it as a continuous variable because dimensionality is relatively high. Some approaches:

  • Mean encoding(expanding one if you choose this option)
  • Encoding as cyclical, I leave a piece of code I use often in my projects:

In the case of tree-based models:

  • You can go with any method, but I would not recommend the one-hot-encoding approach again because of the dimensionality problem
  • Ordinal features generally work well. Sometimes mean encoding provides some improvement

In any case suggestion from @Ugur MULUK is helpful. Date features like "week of the month, day of the week, season, hour, a year or whether it is not a business day, etc." helps a lot.

Code I use to create cyclical feats.:

df['hour_sin'] = np.sin(2 * np.pi * df.DATETIME.dt.hour/23)
df['hour_cos'] = np.cos(2 * np.pi * df.DATETIME.dt.hour/23)
    
df['day_sin'] = np.sin(2 * np.pi * df.DATETIME.dt.dayofweek/6)
df['day_cos'] = np.cos(2 * np.pi * df.DATETIME.dt.dayofweek/6)
```

Answered by Guneykan Ozkaya on January 10, 2021

1) What you are doing is one-hot-encoding. That will give you a sparser dataset, but you do not need both at the same time.

2) You should leave categorical variables as categorical, what you are doing is dangerous. Numeric to categorical transformation would be acceptable however by digitization and grouping.

I do not know your machine learning task. However, I’d recommend you to get features like week of the month, day of the week, season, hour, year or whether it is not a business day, etc.

Answered by Ugur MULUK on January 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP