TransWikia.com

One hot encoding alternatives for large categorical values

Data Science Asked by vinaykva on January 6, 2021

I have a data frame with large categorical values over 1600 categories. Is there any way I can find alternatives so that I don’t have over 1600 columns?

I found this interesting link.

But they are converting to class/object which I don’t want. I want my final output as a data frame so that I can test with different machine learning models? Or, is there any way I can use the generated matrix to train the other machine learning models other than Logistic regression or XGBoost?

Is there anyway I can implement it?

4 Answers

One option is to map rare values to 'other'. This is commonly done in e.g. natural language processing - the intuition being that very rare labels don't carry much statistical power.

I have also seen people map 1-hot categorical values to lower-dimensional vectors, where each 1-hot vector is re-represented as a draw from a multivariate Gaussian. See e.g. the paper Deep Knowledge Tracing, which says this approach is motivated by the idea of compressed sensing:

BARANIUK, R. Compressive sensing. IEEE signal processing magazine 24, 4 (2007).

Specifically, they map each vector of length N to a shorter vector of length log2(N). I have not done this myself but I think it would be worth trying.

Correct answer by tom on January 6, 2021

You can read the data and first get a list of all the unique values of your categorical variables. Then you can fit a one hot encoder object (like the sklearn.preprocessing.CategoricalEncoder) on your list of unique values.

This method can also help in a train test framework or when you are reading your data in chunks. I have created a python module that does all this on its own. You can find it in this GitHub repository - dummyPy

A short a tutorial on this - How to One Hot Encode Categorical Variables in Python?

Answered by Yashu Seth on January 6, 2021

You can do bucketing of similar values, so that values (or columns) that holds closest value (or has much similar) pattern can be replaced by one value (or column) and thus your 1600 values can come down to say 400 (or even less).

Ex. for values like (cloud like - Nimbus Clouds, drizzle, light rain, rain, heavy rain can be converted to ( light rain, rain, heavy rain).

Answered by akash manakshe on January 6, 2021

Refer to this link (this is also related to categorical feature having quite a few unique values):

https://datascience.stackexchange.com/a/64021/67149

For embedding, you can refer below link (not written by me, but worthy to read once): https://medium.com/@satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9

Answered by Amandeep on January 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP