TransWikia.com

Is using samples from the same person in both trainset and testset considers being a data leakage?

Data Science Asked on September 5, 2021

Suppose a neural network is built for a binary classification problem such as recognize the face as a smiley face or not, by using a dataset of 1000 persons and each person has ten images of his face.
If the dataset randomly spilt into trainset and testset by a ratio of 70:30, in this case, there is a big chance face image of same persons will be used in both the trainset and testset, so is this considered to be data leakage (train-test contamination)?

One Answer

Yes, this is a form of data leakage. The testing data should not be linked to the training data in any way.

Another way to think of it is, if someone were to try replicating your results with their own test set, would your test set have given you an advantage such that your results are generally better than theirs?

Correct answer by Benji Albert on September 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP