# Split train//validation/test sets by time, is it correct?

Here’s the scenario, slightly altered to a common one.

Credit card fraud, payments for the last 12 months (a rolling window). Train with the data from the first 10 months, validate with data from the 11th and test with the data from the 12th month.

My rationale for this is that when used for real, we’ll always use the history (be it of the same card or everything in the past, like fraud patterns).

Are there any methodological problems with this approach?

With that being said, you may also want to split the first 10 months into a training and validation set. Build a logistic regression model using only the training set or perform cross-validation to obtain the lowest misclassification error rate in the validation set. This way you are less likely to overfit the first 10 months of data, which will likely lead to better predictions in the 11th month (your validation set), the 12 month (your test set), and most importantly, future months. As with any predictive model, it will be important to see how the model holds up over time!

Please let me know if you need any further help! Happy modeling!

Correct answer by Matt Reichenbach on December 31, 2020

You can split your data based on time and card-account (the account is assumed to be unchanged even the card Number is changed).

1. The reason for the first factor, time, is like what 'cbeleites' mentioned that we need to look at the performance of the model over time.
2. The reason for the second factor, card-account, is to avoid a kind of potential over-fitting issue in case that the same card-holder experienced frauds in both training period and testing period. Under this case, the frauds in both period may have the same profile, which is equal to information leakage.

One more thing we should also consider is that if we split the data based on time, we should be careful about the seasonal influence that may influence the performance of the final model. In other words, the frauds in training and testing datasets should come from the same clusters or groups. This is the drawback that the supervised learning cannot detect the new trends of fraud.

Answered by LeiDing on December 31, 2020

What specifically do you want the validation (11th month) and test (12th months) sets for?

• If you do any kind of optimization with the 11th month data, you are right that you need to test you final model on independent cases. However, the 12th month would not be independent for that scenario: it has an overlap of 10 months with the training and optimization (11th mondth) data. In order to have independent data for testing, you'd need to split the records into training and optimization on the one hand and records kept apart for validation/testing of the final optimized model (test on the 11th month of independent cards).

• The 12th month of the same credit cards where you used the first 10 months for training would be suited e.g. for testing how far into the future you can predict the fraud risk (i.e. how fast the quality of your predictions deteriorates)

Answered by cbeleites unhappy with SX on December 31, 2020

## Related Questions

### R: When do we use mean or median for the y axis in ggplot2 when doing analysis on property prices?

0  Asked on January 28, 2021 by chua-s-yang

### COCO evaluation – Negative values on AP and AR

0  Asked on January 28, 2021 by visionenthusiast

### How to make the regressor of LASSO consistent?

0  Asked on January 28, 2021 by zqq

### Suggestions for identifying the most “important” image labels

1  Asked on January 28, 2021 by nlapidot

### Any ideas on how to segment a 2D vector field?

0  Asked on January 28, 2021 by tricostume

### Binomial logistic regression for multiclass problems

1  Asked on January 27, 2021 by mathews24

### How is confidence defined in Expected Calibration Error?

0  Asked on January 26, 2021 by thecity2

### Why does the McNemar’s test use $chi^{2}$ and not the normal distribution?

2  Asked on January 26, 2021

### What algorithm can you use if you want clusters but only are interested in one group?

0  Asked on January 26, 2021 by bonesones

### Can I use an unknown number of variables to model my time-series?

0  Asked on January 26, 2021 by kplauritzen

### Variance of a stationary AR(2) model

2  Asked on January 26, 2021 by user369210

### Avoiding adjustments for time-varying controls in difference-in-differences (DID)?

0  Asked on January 26, 2021

### Removing the effect from structural breaks

1  Asked on January 25, 2021 by kiril-e-proykov

### Recommender System – Predict ratings with Random Forest Regressor or Classifier?

0  Asked on January 24, 2021 by oja-niva

### Nonparametric assessment of multiple predictors

0  Asked on January 24, 2021 by mephisto73

### Calculating measurement variance to achieve desired accuracy in estimation

0  Asked on January 23, 2021 by valjean

### Can large # of epochs or smaller batchsize compensate for smaller data size in training lstms

1  Asked on January 23, 2021 by tjt

### Probability that number of heads exceeds sum of die rolls

5  Asked on January 23, 2021 by user239903

### Combining Sub-Samples for Factor Analysis?

0  Asked on January 22, 2021

### Need to create a model to identify patterns in user details

0  Asked on January 21, 2021 by pooza