Understaning Bayesian Optimisation graph

Question

I came across the concept of Bayesian Occam Razor in the book Machine Learning: a Probabilistic Perspective. According to the book:

Another way to understand
  the Bayesian Occam’s razor effect is to note that probabilities must
  sum to one. Hence $sum_D'  p(D' |m) = 1$, where the sum is over all possible data sets. Complex
  models, which can predict many things, must spread their probability mass thinly, and hence
  will not obtain as large a probability for any given data set as simpler models. This is sometimes called the conservation of probability mass principle.

The figure below is used to explain the concept:

Image Explanation: On the vertical axis we plot the predictions of 3 possible models: a simple one, $M_1$ ; a medium one, $M_2$ ; and a complex one, $M_3$ . We also indicate the actually observed
    data $D_0$ by a vertical line. Model 1 is too simple and assigns low probability to $D_0$ . Model 3
    also assigns $D_0$ relatively low probability, because it can predict many data sets, and hence it
    spreads its probability quite widely and thinly. Model 2 is “just right”: it predicts the observed data with a reasonable degree of confidence, but does not predict too many other things. Hence model 2 is the most probable model.

What I do not understand is when a complex model is used, it will likely overfit data and hence the plot for a complex model will look like a bell shaped with its peak at $D_0$ while simpler models will more likely have a broader bell shape. But the graph here shows something else entirely. What am I missing here?

DuttaA · Answer

The original graph for the aforementioned Bayesian Optimisation is similar to the graph in these slides (slide 18) along with the calculations.

So, according to the tutorial the graph shown should actually have the term $p(D|m)$ on the y-axis, thus making it a generative model.Now the graph starts to make sense, since a model with low complexity cannot produce very complex datasets and will be centred around 0, while very complex models can produce richer datasets which makes them assign probability thinly over all the datatsets (to keep $sum_{D'}p(D'|m) = 1$).

Understaning Bayesian Optimisation graph

One Answer

Add your own answers!

Ask a Question