Target encoding with KFold cross-validation - how to transform test set?

Question

Let's say I have a categorical feature (cat):
import random
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold

random.seed(1234)
y = random.choices([1, 0], weights=[0.2, 0.8], k=100)
cat = random.choices(["A", "B", "C"], k=100)
df = pd.DataFrame.from_dict({"y": y, "cat": cat})

and I want to use target encoding with regularisation using CV like below:
X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42)
df_train = pd.concat([X_train, y_train], axis=1).sort_index()
df_train["kfold"] = -1
idx = df_train.index
df_train = df_train.sample(frac=1)

skf = StratifiedKFold(n_splits=5)
for fold_id, (train_id, val_id) in enumerate(skf.split(X=df_train.drop("y", axis=1), y=df_train["y"])):
    df_train.iloc[val_id, df_train.columns.get_loc("kfold")] = fold_id

df_train = df_train.loc[idx]

encoded_dfs = []

for fold in df_train["kfold"].unique():
    df_train_cv = df_train[df_train["kfold"] != fold].copy()
    df_val_cv = df_train[df_train["kfold"] == fold].copy()

means = df_train_cv.groupby('cat')['y'].mean()
    df_val_cv['cat'] = df_val_cv['cat'].map(means)
    encoded_dfs.append(df_val_cv)

encoded_dfs = pd.concat(encoded_dfs, axis=0).sort_index()
encoded_dfs.drop('kfold', axis=1, inplace=True)

However, I have some doubts about the way how I should then encode test set. As there is no single mapping deduced from train set I think we should use the whole train set to fit the encodings and then use it on test set:
means = df_train.groupby('cat')['y'].mean()
X_test['cat'] = X_test['cat'].map(means)

It seems to be the natural way to do it as, in fact, this is exactly mimicked by CV step. But the results of the model I got were off and it made me think if I am missing something. Please note that, for sake of simplicity, I omitted additional smoothing I did as well. Therefore, my question is: is it the correct way to encode test set?

Carlos Mougan · Answer

I have some doubts about the way how I should then encode test set. As there is no single mapping deduced from train set I think we should use the whole train set to fit the encodings and then use it on test set

Yep, that seems fine, they way that you do it there its a bit more complicated than using a pipeline. The idea of splitting into train and test is mimicking how the model will behave in production/unseen data. Doing target encoding with the test, is doing data leakage and getting a miss representation of how the model will behave in production. So you get the target values in train and then move to test.
If you do this, and then you have a category in test that is unseen, it will through an error. If you have a look at the target encoding library of category encoders, you can deal with this.:

handle_missing: str
options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

You can handle it in different ways, the best is depending in your problem. The default is returning the target mean.
They best practice to do is to create a pipeline where the target encoding is a step(transformer). This will allow you to do CV, evaluate your model on test and many other functionalities. (Here a tutorial on how to)
A code snippet:
import random
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from category_encoders.target_encoder import TargetEncoder
from category_encoders.m_estimate import MEstimateEncoder
from sklearn.linear_model import ElasticNet,LogisticRegression

random.seed(1234)
y = random.choices([1, 0], weights=[0.2, 0.8], k=100)
cat = random.choices(["A", "B", "C"], k=100)
df = pd.DataFrame.from_dict({"y": y, "cat": cat})

X_train, X_test, y_train, y_test = train_test_split(df[["cat"]], df["y"], train_size=0.8, random_state=42)
skf = StratifiedKFold(n_splits=5)

clf = LogisticRegression()
te = TargetEncoder()

pipe = Pipeline(
        [
         ("te", te),
          ("clf", clf),
        ]
    )

#Grid to serch for the hyper parameters
pipe_grid = {
    "te__smoothing": [0.0001],
    }

# Instantiate the grid
pipe_cv = GridSearchCV(
        pipe,
        param_grid=pipe_grid,
        n_jobs=-1,
        cv=skf,
    )

pipe_cv.fit(X_train, y_train)

# Add some unseen category to the test.
X_test['cat'] = 'UUUUU'

pipe_cv.predict(X_test)

Note that the code is not optimal but it should show you how to deal with this problem of doing target encoding with the train and test using a pipeline, and working with unseen data :)
Note that the category has been assigned randomly. So the model detects that the best is predicting the most frequent class. If you change for ElasticNet (a regressor) you will get the mean.
If you take out the unseen category assignation to test you will still get the same results

Target encoding with KFold cross-validation - how to transform test set?

One Answer

Add your own answers!

Ask a Question