How to apply model to training data to identify mislabeled observations?

Question

I have a list of people, attributes about those people (height, weight, blood pressure, etc.), and a binary target variable called has_heart_issues. This data represents the full population of data, and I am trying to determine whether anyone who is listed as "No" for has_heart_issues is similar to the people who are listed as "Yes".

To answer this question, I split the data into training (70%) and testing (30%). I trained a random forest model on the training, and I tested it on the testing. The results are good, but I don't know how to apply to the population since I used most of it for training. Is there any way to apply the model to the full dataset (including the training) since I had labels for the full dataset to start with? Essentially, I am trying to determine whether any of the people were mislabeled.

Is it okay to apply the model to the training data to find the "mislabeled" records?

Noah Weber · Answer

Sure, its called cross validation. Have a look

Dave Kielpinski · Answer

There is exactly one thing you can check by examining the predictions on your training data. That is the numerical convergence of your model training routine. Any validation of model accuracy can only use holdout data or test data - that is the entire point of cross validation. Once the model architecture and hyperparameters have been optimized through n-fold cross-validation, the standard procedure is to train a single production model on the entire dataset. At that point, you've gotten all the information from the training set that you can.

Answered by Dave Kielpinski on December 18, 2020

How to apply model to training data to identify mislabeled observations?

2 Answers

Add your own answers!

Ask a Question