Model tuning in the presence of incorrect training labels

Question

I have a situation where I have a large amount of labeled data (~40 million records) with a binary outcome variable that has about 50% positive and 50% negative cases. The issue is that I know that the true proportion for these 40 million is more like 75% positive cases to 25% negative cases. So when I test my model I actually do not want to see that I have low false positives and false negatives, in fact I prefer to see some number of false positive cases.

Then I started to think, what about hyper-parameter tuning? For example I was using glmnet and the LASSO and using cross-validation to choose lambda and then thought, wait a minute, this is the lambda value that gives me the lowest classification error (which as I said before maybe is not actually what I want).

Am I correct in thinking that if I want to use cross validation to train my model I will have to tune with the goal of achieving the true known proportion rather than the lowest error?

kjetil b halvorsen · Answer

Logistic regression can be tailored to the case with errors in labels. There is an example (with code) in the fourth edition of MASS-the book, chapter 16. Code can be found here.
If you really know (approximately) the proportions of the true labels, a Bayesian version could be useful.
But your case seems to be special, with a very high probability of wrong labels. Could it be feasible to do a study, by sampling randomly some (1000, 1000s) of your 40 million cases, verify the label manually, and so study if the probability of wrong labeling depends on covariates in some way? If so, you might get a much better model.

Model tuning in the presence of incorrect training labels

One Answer

Add your own answers!

Ask a Question