What is the best way to encode an arbitrary collection of strings into int categorical variables?

Question

I have a bunch of categorical labels which I want to transform into int categorical features for an ML algorithm. 
The problem is I don't have a prior list of the categories, so that I can't just define a dictionary or mapping function before hand.

Say for example that I am using food labels - my current data has the following labels:
['Steak','Potatoes','Soup'], but it is possible that later, I will gate data with the labels 'Asparagus' or 'Chow mein', and I have no way of knowing the list of all potential labels before hand. Moreover it is possible that some of the incoming labels are proper names or strings that are idiosyncratic and not part of any standard vocabulary, e.g. 'Double Super Mac-Whopper'.

I thought of simply building my own hash map, but then I would have to put a lot of effort into saving and versioning the resulting map to maintain consistency across experiments and later in production.

I tried using the int.from_bytes function in Python 3, but it gives wildly varying int sizes (I think because it is using string length):

> int.from_bytes('steak'.encode('utf-8'),'little')
461195539571
> int.from_bytes('milk'.encode('utf-8'),'little')
1802266989
> int.from_bytes('Bok Choy'.encode('utf-8'),'little')
8750327238520172354

I looked at the sklearn categorical encoders (preprocessing.LabelEncoder() or sklearn.feature_extraction.FeatureHasher), but they all seem to require knowledge of the number of categories before hand (by having to specify a dictionary or fitting an encoder to the available data, etc...)

I thought about using some word embeddings like word2vec, but they return pretty large vectors, and all I need is an int, and I don't really care about semantic similarity etc...(i.e. using a word embedding is overkill).

Is there some sort of preprocessing utility from and ML library, or some publicly available string to int hash map that is stable that I can use?

Andrey Lukyanenko · Answer

Let's see. You want to be able to work with unspecified number of categories.
There are multiple ways to work with this.

Create a special category "other" and put all categories which are very rare into this category during processing. When you encounter a new category, put it into this "other" category. And you can use any common preprocessing - label encoding, one hot encoding and so on. This way you'll be able to make predictions for new categories. And when you have enough data for them, you can leave then as they are and refit you preprocessing and the model.
Target encoding. There are multiple ways to convert categories to float numbers. You train target encoding on the data which you have and can apply it to new categories (usually unknown values are assigned a global mean value).

What is the best way to encode an arbitrary collection of strings into int categorical variables?

One Answer

Add your own answers!

Ask a Question