# Does label encoding an entire dataset cause data leakage?

I have a dataset on which one of the features has a lot of different categorical values. Trying to use a LabelEncoder, OrdinalEncoder or a OneHotEncoder results in an error, since when splitting the data, the test set ends up having some values that are not present in the train set.

My question is: if I choose to encode my variables before splitting the data, does this cause data leakage?

I’m aware that I shouldn’t perform any normalization or educated transformations on the data before splitting the dataset, but I couldn’t find a solution for this problem inside scikit-learn.

Thanks in advance for any responses.

Edit: This particular features has very high cardinality, with around 60k possible values. So using scikit-learn’s OneHotEncoder with handle_unknown set to ignore would introduce too many new columns to the dataset.

Data Science Asked on November 30, 2021

First, no data leakage here because you are encoding a feature not the target variable. Second ly, you can consider other useful encoding scheme like target encoding, which will not create a huge amount of columns like onehot encoding. In fact it creates just a single column. Also try to reduce your number of values in your category, 60k is way too many.

Answered by Victor Luu on November 30, 2021

The cleanest solution would be to apply scikit's OneHotEncoder with the handle_unknown parameter set to "ignore":

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

Other manual solution are described in this and this question on Stackoverflow, for example.

Answered by Sammy on November 30, 2021

Encoding labels before splitting the data set should not cause leakage, particularly in the case of ordinal encoding. Ordinal encoding is just a transform from "label space" to "integer space". Changing the names we use for the labels does not add any useful information that could change classification results, so no data leakage.

Think about it this way: Suppose you have 3 labels "Red", "Blue", "Green". But, for some reason, the software package you are using only works in Spanish. So you change the labels to "Rojo", "Azul", and "Verde". No data leakage has occurred - you've just started calling the labels something different. This is almost perfectly analogous to ordinal encoding*.

I think you could make an argument that one-hot encoding allows for some very, very minor leakage. Suppose you have labels "Red", "Blue", "Green" but only the first two appear in your training set. By one-hot encoding the labels before splitting, you implicitly declare that there are three possible labels instead of two. Depending on the definition, this could be described as data leakage, since you can derive some information that's not actually included in the training set. However, I can't imagine how an ML algorithm would gain an artificial benefit in this scenario, so I don't think it's anything to worry about.

*if you ignore the fact that some algorithms can find spurious relationships between numbers, but not string labels.

Answered by zachdj on November 30, 2021

## Related Questions

### ML, Statistics and Mathematics

2  Asked on December 20, 2020 by ranit-b

### what are the main differences between parametric and non-parametric machine learning algorithms?

1  Asked on December 20, 2020 by jackearl

### How to build a symptom checker and medical diagnose chat bot

1  Asked on December 20, 2020 by ozan-yurtsever

### IOU accounting for the difference of the damage degree in GT and prediction

0  Asked on December 20, 2020

### How do I deal with additional input information other than images in a convolutional neural network?

1  Asked on December 19, 2020 by hey-hey

### Why does Gradient Boosting regression predict negative values when there are no negative y-values in my training set?

3  Asked on December 19, 2020 by user2592989

### Model selection in active learning

1  Asked on December 19, 2020 by maurits-van-roozendaal

### Modify keras_unet.utils.get_augmented to read images from disk

0  Asked on December 19, 2020 by stepan

### sagemath: compared to r.quantile, what is a faster way to find boundaries for a boxplot?

1  Asked on December 19, 2020 by kjl

### Where can I find an algorithm for human activity classification using thigh and shank sensors?

2  Asked on December 19, 2020

### How to improve results from a Naive Bayes algorithm?

1  Asked on December 19, 2020

### Machine learning algorithms for interpreting Companies brand/s logo/s

2  Asked on December 19, 2020

### Relating ROC curves with class statistics

1  Asked on December 19, 2020 by shahriar49

### Can neural networks have multi-dimensional output nodes?

1  Asked on December 19, 2020 by stewii

### How to decide if gradients are vanishing?

1  Asked on December 18, 2020

### How to apply model to training data to identify mislabeled observations?

2  Asked on December 18, 2020 by overflowingtheglass

### Is the number of bidirectional LSTMs in encoder-decoder model equal to the maximum length of input text/characters?

1  Asked on December 18, 2020 by joe-black

### How to Predict/Forecast street’s Traffic based on previous values?

1  Asked on December 18, 2020 by angrycoder

### Use TSFRESH-library to forecast values

1  Asked on December 18, 2020 by spanishboy