Minimizing error on unseen data

Question

The classifier aims to minimize the loss function (($F(x)$ - $hat{F}(x)$)2), where $F(x)$ is unknown function and $hat{F}(x)$ is the predicted function. If $F(x)$ is not known for unseen data, how do we compute this loss? Why is the training error used to estimate the error for unseen data?

Erwan · Answer

if we don't know $F(x)$ for unseen data, how does a decision tree minimize this error?

Every supervised ML method relies on the assumption that the test data (any unseen data) follows the same distribution as the training data (note that this is not specific to Decision Trees). In fact both the training data and the test data are assumed to be sampled from the true population data. As a consequence $F(x)$ is assumed to be the same for the training data and the test (unseen) data.
If one uses a trained model on some unseen data which is not distributed like the training data, the results are simply unpredictable and the performance is likely to drop.

Why do we estimate the error for unseen data using the error observed in the training data?

You seem to suggest to use the "unseen data" in the training process. You would indeed get better results on the "unseen data" if you optimized on it, but then you would lose the point of having a portion of data set apart. "Unseed data" is necessary to estimate how good your model will perform on data never seen before. If you don't keep some data set apart you may have a better model but you have no way of estimating how good it will be when put into production.

Dave · Answer

The idea of using a test set is to mimic the real-world application of using machine learning to, say, do speech recognition for people who aren’t born yet.
That’s the situation where you don’t know the label, so you can’t calculate the error or loss.
However, we mimic that (we hope) by withholding some data from the training. We fit the model on the training data and the estimate the error on unseen data by using our withheld data, since we know the correct label or value for those points, even though we didn’t tell the model.
Our estimate on the holdout data might be badly wrong when we deploy the model to, say, Siri or Alexa, but some kind of error calculation on withheld data is the next-best that we can do until it goes into production and we see how the model performs.

Minimizing error on unseen data

2 Answers

Add your own answers!

Ask a Question