Role of misspecification by biased data in the generalization error

Question

I am confused with the role that model misspecification plays in the generalization error, in particular when the misspecification is due to a biased (non representative) training dataset. To clarify what I mean, imagine a non-representative sample where there seems to be a linear pattern when in fact the pattern is more complex in the population. If we base our modeling decisions on that sample we might decide for a linear model and, as a result, misspecify the model. So, some doubts I have:

My understanding is that the bias and variance in the error decomposition are properties of the predictor and not the data. However, the training data causes bias error, which seems to indicate that the data does play a role in the decomposition. Is it that the decomposition assumes that the data is always representative of the population? In other words, it does not take into account biases in the data such as sampling bias or omitted variables.

The other question I have is that it is not clear that biased data only causes bias error (misspecification) but also variance. My reasoning is that if we had chosen another training sample (one without the biases), our predictor would have made better predictions, indicating that there is variance error, too.

I think my misunderstanding is on the assumptions of the learning paradigm: for instance, that biases in the data are out of the picture when talking about variance and bias of the model, but I would like to confirm it and shed some light on these questions.

Vivek · Answer

Bias in a model can be due to many factors. From what I understand, you maybe talking about sampling bias. It Happens when the training data doesn’t accurately represent the environment the program is expected to run into.
It is not possible to train a model on the entire universe of data, rather data is assumed to be sampled from the population. There is science behind sampling the training data that is both large enough and representative enough to mitigate sample bias and must come from the same population density. In your case, the data seems to be from different population densities which contradicts the assumption of machine learning theory i.e training data should be sampled from the same population. and hence will never generalise.
The machine learning theory is based on the assumption that there exists a true function g(x) (which we don't need to know) and our goal is to come up with a hypothesis f(x) which approximates g(x) on the sampled dataset. Bias and variance in error decomposition both are dependent on your choice of hypothesis f(x) and the true function g(x) of the population.

Bias occurs when your hypothesis is far from the true function. Even if you have representative data and not using the right hypothesis, you will have high bias. For ex: if you are using linear hypothesis to model a non-linear function.

Variance occurs when you have large pool of hypothesis(high model complexity) which makes it difficult to navigate to the true function resulting in higher variance. For ex: using a polynomial regression to model a linear relationship.

For a deeper understanding check out: https://www.youtube.com/watch?v=L_0efNkdGMc
This is one of the best resource for machine learning theory.

Role of misspecification by biased data in the generalization error

One Answer

Add your own answers!

Ask a Question