Why Continous Variable Buckets Overfitting model

Question

I have a continuous (high cardinal discrete) variable 'numInteractionPoints' in my dataset during training model - I binned this feature in order to avoid overffing , first top bar chart is from training and second bar chart is from testing for this binned feature. Both datasets has same distribution as per following bar graph

this is how i created bin for this feature
i converted variable 'numInteractionPoints' into bins using simple bin method based on visual analysis
bins = [0, 6, 12, np.inf]
names = ['0-6', '6-12','12+']

train_data['numInteractionPointsRange'] = pd.cut(train_data['numInteractionPoints'], bins, labels=names)
train_data['numInteractionPointsRange']=train_data['numInteractionPointsRange'].astype('object')

in both cases when i include raw variable numInteractionPoints without bin and with above defined bucket the overall F1 score of model decrease on test data from 0.47 to 0.33 for a class 1. On validation dataset it is giving 0.47 while on test it gives 0.33 with or without bins

Any idea please what i am missing? This behavior is in both algo Xgboost and catboost, target class is highly imbalanced and i am controlling it Xgboost parameter scale_pos_weight. Why model overfitting due to defined buckets , as you can see above bar chart distribution of both dataset are same

lcrmorin · Answer

When you use buckets you lose some information. Basically you assume the relationship between the variable and the target is flat within the interval. When this is probably not the case. This is point #3 on Frank Harell's list of reasons why you generally shouldn't categorise at all.
More specifically, 80% of your instances seems to be in one bucket. That mean your algo can't learn what happen inside this bucket. As a starting point you may use more buckets to check if that was the problem. More convoluted techniques would include splines to smooth the response (removing noise help not learning from it).  Honestly given that you work with a tree like algo (Xgboost) you should probably let it choose optimal splits instead of enforcing some splits trough variable categorisation. That is : don't use a priori variable categorisation at all, enforce the absence of overfitting trough other parameters (tree depht at least) and if you really need category use those found by the algo.
Edit: re-reading your question it seems that you are overfitting in both case... the solution might be more about tuning your xgboost parameters.

Why Continous Variable Buckets Overfitting model

One Answer

Add your own answers!

Ask a Question