Does label encoding an entire dataset cause data leakage?

Question

I have a dataset on which one of the features has a lot of different categorical values. Trying to use a LabelEncoder, OrdinalEncoder or a OneHotEncoder results in an error, since when splitting the data, the test set ends up having some values that are not present in the train set.
My question is: if I choose to encode my variables before splitting the data, does this cause data leakage?
I'm aware that I shouldn't perform any normalization or educated transformations on the data before splitting the dataset, but I couldn't find a solution for this problem inside scikit-learn.
Thanks in advance for any responses.
Edit: This particular features has very high cardinality, with around 60k possible values. So using scikit-learn's OneHotEncoder with handle_unknown set to ignore would introduce too many new columns to the dataset.

Victor Luu · Answer

First, no data leakage here because you are encoding a feature not the target variable. Second ly, you can consider other useful encoding scheme like target encoding, which will not create a huge amount of columns like onehot encoding. In fact it creates just a single column. Also try to reduce your number of values in your category, 60k is way too many.

Answered by Victor Luu on November 30, 2021

Sammy · Answer

The cleanest solution would be to apply scikit's OneHotEncoder with the handle_unknown parameter set to "ignore":

handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Other manual solution are described in this and this question on Stackoverflow, for example.

zachdj · Answer

Encoding labels before splitting the data set should not cause leakage, particularly in the case of ordinal encoding.  Ordinal encoding is just a transform from "label space" to "integer space".  Changing the names we use for the labels does not add any useful information that could change classification results, so no data leakage.
Think about it this way:  Suppose you have 3 labels "Red", "Blue", "Green".  But, for some reason, the software package you are using only works in Spanish.  So you change the labels to "Rojo", "Azul", and "Verde".  No data leakage has occurred - you've just started calling the labels something different.  This is almost perfectly analogous to ordinal encoding*.
I think you could make an argument that one-hot encoding allows for some very, very minor leakage.  Suppose you have labels "Red", "Blue", "Green" but only the first two appear in your training set.  By one-hot encoding the labels before splitting, you implicitly declare that there are three possible labels instead of two. Depending on the definition, this could be described as data leakage, since you can derive some information that's not actually included in the training set.
However, I can't imagine how an ML algorithm would gain an artificial benefit in this scenario, so I don't think it's anything to worry about.

*if you ignore the fact that some algorithms can find spurious relationships between numbers, but not string labels.

Does label encoding an entire dataset cause data leakage?

3 Answers

Add your own answers!

Ask a Question