TransWikia.com

Does using user-specific accumulative variables causes data leakage?

Data Science Asked by Corel on November 7, 2020

Let’s say I have a scenario in which my observational unit is a bill that was issued after a certain service was given and my goal is to predict if this bill is going to be paid or not. I have users in the system so I include user-variables like number of unpaid bills the user has, user’s time in the service system (seniority) etc. I train on month 1 and test on month 2 (bills that were created in those months, respectively).

In the testing month I will have user-variable’s count increased, so e.g. if during training user_1 had 100 days time in my system, of course that if there is a bill associated with him in the test month – his count of days will be higher.

Is this accumulative nature of such variables is considered a data leakage between train and test sets (because part of the information that was used in training is being used, in a sense, in testing)?

2 Answers

If the data is available at prediction time, then it is not data leakeage.

In your specific example, historical user data should be used at prediction time to build a better model.

You do have to decide on how to do the train/test split. You can split by user, by time, or combination.

Answered by Brian Spiering on November 7, 2020

yes it is clear data leakage, actually it's a tricky one that many DS's are missing.

The count feature logic you were referring to, for each row (regardless of which dataset it belongs to) should be based only on rows that are "older" than the current one you calculate the count for. It makes your experimentation and feature engineering more complex but that's the only way to really reflect the data that you had (or will have) in reality.

Answered by Oren Razon on November 7, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP