Split train//validation/test sets by time, is it correct?

Question

Here's the scenario, slightly altered to a common one.

Credit card fraud, payments for the last 12 months (a rolling window). Train with the data from the first 10 months, validate with data from the 11th and test with the data from the 12th month.

My rationale for this is that when used for real, we'll always use the history (be it of the same card or everything in the past, like fraud patterns).

Are there any methodological problems with this approach?

Matt Reichenbach · Accepted Answer

Your approach is statistically valid.

With that being said, you may also want to split the first 10 months into a training and validation set. Build a logistic regression model using only the training set or perform cross-validation to obtain the lowest misclassification error rate in the validation set. This way you are less likely to overfit the first 10 months of data, which will likely lead to better predictions in the 11th month (your validation set), the 12 month (your test set), and most importantly, future months. As with any predictive model, it will be important to see how the model holds up over time!

Please let me know if you need any further help! Happy modeling!

LeiDing · Answer

You can split your data based on time and card-account (the account is assumed to be unchanged even the card Number is changed).

The reason for the first factor, time, is like what 'cbeleites' mentioned that we need to look at the performance of the model over time.
The reason for the second factor, card-account, is to avoid a kind of potential over-fitting issue in case that the same card-holder experienced frauds in both training period and testing period. Under this case, the frauds in both period may have the same profile, which is equal to information leakage.

One more thing we should also consider is that if we split the data based on time, we should be careful about the seasonal influence that may influence the performance of the final model. In other words, the frauds in training and testing datasets should come from the same clusters or groups. This is the drawback that the supervised learning cannot detect the new trends of fraud.

cbeleites unhappy with SX · Answer

What specifically do you want the validation (11th month) and test (12th months) sets for?

If you do any kind of optimization with the 11th month data, you are right that you need to test you final model on independent cases. However, the 12th month would not be independent for that scenario: it has an overlap of 10 months with the training and optimization (11th mondth) data. In order to have independent data for testing, you'd need to split the records into training and optimization on the one hand and records kept apart for validation/testing of the final optimized model (test on the 11th month of independent cards).
The 12th month of the same credit cards where you used the first 10 months for training would be suited e.g. for testing how far into the future you can predict the fraud risk (i.e. how fast the quality of your predictions deteriorates)

Split train//validation/test sets by time, is it correct?

3 Answers

Add your own answers!

Ask a Question