TransWikia.com

Encode missing data and unseen data

Data Science Asked by Outcast on January 12, 2021

Let’s assume that I have a classification problem and all my features are categorical data.

I have missing data (and I do not want to do any imputation).

Also, I know that I will have some unseen data (at my test data) at some of my features.

My question is the following:

Should I encode the missing and unseen data (of the test set) into the same class or to different ones?

Which is the most common practice and why?

One Answer

This depends at least a little on your model type and encoding strategy. It sounds like you'll be one-hot encoding, with possible level/dummy for "missing" and/or for "new". Given you say you won't impute, I guess you'll be using tree-based models?

If you have a dedicated level for unseen categories, then the trained model will have no idea what to do with it. Results will vary among errors in model build, zero input of the dummy feature, and random contributions from the dummy feature.

Assigning new categories to the missing category is a little better, but will depend on how you treat missing values in your model training (but you've said you don't want to impute). At least it gives a consistent result. This may even be the best choice, depending on your data.

In your elaboration you mention bank names, which suggests an alternative (maybe several).
Perhaps these unseen banks will have some commonality with banks in the training set (they're small and local maybe?); if so, you could try to coarsen the categories in the train set so that these new ones will fall nicely into those categories (and then encode all future unseen banks into that new category; make a note to reexamine this process when you retrain!).
If your training data has lots of small levels, it can also make sense to lump the rare categories together (to help avoid overfitting); in that case you can just drop the new levels into that one. Again depending on your data this might work well (small and local would also be rare, so this accidentally recovers the last idea in this case).

Answered by Ben Reiniger on January 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP