How to restructure my dataset for interpretability without losing performance?

Question

What I am doing:
I am predicting product ratings using boosted trees (XGBoost) with a dataset in this format:

What I want to do:
I want to use SHAP TreeExplainer to interpret each prediction my model gives in terms of product attributes and user ids.
What I am getting:
My model is drawing all the conclusions based on product names and user ids, instead of product attributes and user ids.
What I tried:
I discovered that each product name has a unique combination of product attributes, i.e. by knowing the product attributes you can find its name. So my idea was to remove the product_name column, leaving only the attributes.
My reasoning was that restructuring the dataset in this way would lead to the interpretability that I wanted without any performance loss (since the product name doesn't add any new information).
What I got:
The model performance decreased a lot. Even with a great deal of hyperparameter tuning, I couldn't get near the performance I had when also using the product name.
What I think maybe going on:

My dataset is too small for the model to learn with the product attributes (10k samples, 60 attributes).

or

Maybe there are some attributes adding bias and screwing with my model ability to generalize, leading to an overfit.

I am a little skeptical about the number 2, seeing that my training loss also went up when I removed the product name.
My question:
So, how can I restructure my dataset? Does anybody have a clue why my model can't reach the same performance without using the product name? Any light or ideas on what I can try?

Vikrant Arora · Answer

What may be happening is that your attribute predictors are weak predictors, they are noisy. Meaningful decision trees can't be made out of product attribute features by xgb.

When you are adding name as a predictor, xgb finds some signal wrt your target variable - rating and thus you get a better score. So your name plus attributes model may be performing better than attributes only model for this reason.

So if you from domain experience know product attributes are very weakly related to rating then you can conclude that this feature set of attributes is not going to help you make accurate predictions. Or instead of relying  on d omain expertise, you can use correlation or relevant statistical tests to understand attributes relation to rating and if found that relationship is non existent or very weak you can conclude model isn't possible.

So may be add more relevant features if possible if you want to make a reasonably good model.

Regards
Vik

How to restructure my dataset for interpretability without losing performance?

One Answer

Add your own answers!

Ask a Question