How to handle unseen labels in test data?

Question

I get something like TF-IDF of training corpus in python with (something like) TfidfVectorizer. In test data some features (here are the words of test corpus, every word is a feature) are not seen in the training data and because of this, the shape of the test and the train matrix are not equal and the program gives an error (number of columns isn't same and some words in the test data are not seen in the train data).
How should I solve this problem? How should I handle unseen features in test set?

prashant0598 · Accepted Answer

It depends on the problem.There is no single answer to it.
Things you can do:

Delete these features and focus on features that appear in both train and test set.
Semi-supervised learning.
You can also turn these unusual cases into a new unique category.
Find nearest neighbour and use that value for that feature.

How to handle unseen labels in test data?

One Answer

Add your own answers!

Ask a Question