TransWikia.com

How to handle unseen labels in test data?

Data Science Asked by Meysam on August 22, 2020

I get something like TF-IDF of training corpus in python with (something like) TfidfVectorizer. In test data some features (here are the words of test corpus, every word is a feature) are not seen in the training data and because of this, the shape of the test and the train matrix are not equal and the program gives an error (number of columns isn’t same and some words in the test data are not seen in the train data).
How should I solve this problem? How should I handle unseen features in test set?

One Answer

It depends on the problem.There is no single answer to it.

Things you can do:

  • Delete these features and focus on features that appear in both train and test set.
  • Semi-supervised learning.
  • You can also turn these unusual cases into a new unique category.
  • Find nearest neighbour and use that value for that feature.

Correct answer by prashant0598 on August 22, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP