One hot encoding alternatives for large categorical values

Question

I have a data frame with large categorical values over 1600 categories. Is there any way I can find alternatives so that I don't have over 1600 columns?
I found this interesting link.
But they are converting to class/object which I don't want. I want my final output as a data frame so that I can test with different machine learning models? Or, is there any way I can use the generated matrix to train the other machine learning models other than Logistic regression or XGBoost?
Is there anyway I can implement it?

tom · Accepted Answer

One option is to map rare values to 'other'.  This is commonly done in e.g. natural language processing - the intuition being that very rare labels don't carry much statistical power.

I have also seen people map 1-hot categorical values to lower-dimensional vectors, where each 1-hot vector is re-represented as a draw from a multivariate Gaussian.  See e.g. the paper Deep Knowledge Tracing, which says this approach is motivated by the idea of compressed sensing:

BARANIUK, R. Compressive sensing. IEEE signal processing magazine 24, 4 (2007).

Specifically, they map each vector of length N to a shorter vector of length log2(N).  I have not done this myself but I think it would be worth trying.

Yashu Seth · Answer

You can read the data and first get a list of all the unique values of your categorical variables. Then you can fit a one hot encoder object (like the sklearn.preprocessing.CategoricalEncoder) on your list of unique values.

This method can also help in a train test framework or when you are reading your data in chunks. I have created a python module that does all this on its own. You can find it in this GitHub repository - dummyPy

A short a tutorial on this - How to One Hot Encode Categorical Variables in Python?

akash manakshe · Answer

You can do bucketing of similar values, so that values (or columns) that holds closest value (or has much similar) pattern can be replaced by one value (or column) and thus your 1600 values can come down to say 400 (or even less).

Ex. for values like (cloud like - Nimbus Clouds, drizzle, light rain, rain, heavy rain
can be converted to ( light rain, rain, heavy rain).

Amandeep · Answer

Refer to this link (this is also related to categorical feature having quite a few unique values):

https://datascience.stackexchange.com/a/64021/67149

For embedding, you can refer below link (not written by me, but worthy to read once):
https://medium.com/@satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9

One hot encoding alternatives for large categorical values

4 Answers

Add your own answers!

Ask a Question