TransWikia.com

Can identifiers be used to train a model?

Data Science Asked on August 12, 2021

I recently participated in some Machine Learning competition where we were asked to decide whether a rider should accept or not a course (~2k riders and ~140k courses). It came up that some of the winners used the identifier of the rider (an integer number unique for each rider) in their features which was discarded on default notebook and it greatly improved their score.

Is it legitimate? can identifier be used to train a model?

2 Answers

You can use any variable during model training if this variable is available during inference time. This is the only technical restriction. Another question is should you include this column or not. If this column is completely unique and does not have any relevant information in it then you can discard it. Also, categorical columns with high cardinality (number of unique values) might also have a negative impact.

On the other hand, if including this identifier improved the score a lot (not a random fluctuation) then there are might be more to it and you need to share more info on the dataset and used models.

Answered by Yaroslaw Homenko on August 12, 2021

If adding the identifiers (I presume not as discrete values but converted to real numbers) improved the results, then there should be a not obvious correlation between the IDs and the target variable. Maybe IDs reflect the seniority of the rider (the higher the ID, the lower the seniority), and therefore acted as a proxy for it in the model.

Despite seeming "unorthodox", if the IDs are available at inference time, then there is no reason not to allow using them, especially when they bring to the model information that is in no other variable. Furthermore, this looks like a great example of the importance of feature engineering in data science.

Answered by noe on August 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP