Does using user-specific accumulative variables causes data leakage?

Question

Let's say I have a scenario in which my observational unit is a bill that was issued after a certain service was given and my goal is to predict if this bill is going to be paid or not. I have users in the system so I include user-variables like number of unpaid bills the user has, user's time in the service system (seniority) etc. I train on month 1 and test on month 2 (bills that were created in those months, respectively).

In the testing month I will have user-variable's count increased, so e.g. if during training user_1 had 100 days time in my system, of course that if there is a bill associated with him in the test month - his count of days will be higher.

Is this accumulative nature of such variables is considered a data leakage between train and test sets (because part of the information that was used in training is being used, in a sense, in testing)?

Brian Spiering · Answer

If the data is available at prediction time, then it is not data leakeage.
In your specific example, historical user data should be used at prediction time to build a better model.
You do have to decide on how to do the train/test split. You can split by user, by time, or combination.

Oren Razon · Answer

yes it is clear data leakage, actually it's a tricky one that many DS's are missing.
The count feature logic you were referring to, for each row (regardless of which dataset it belongs to) should be based only on rows that are "older" than the current one you calculate the count for. It makes your experimentation and feature engineering more complex but that's the only way to really reflect the data that you had (or will have) in reality.

Does using user-specific accumulative variables causes data leakage?

2 Answers

Add your own answers!

Ask a Question