# Understaning Bayesian Optimisation graph

Artificial Intelligence Asked by DuttaA on November 7, 2020

I came across the concept of Bayesian Occam Razor in the book Machine Learning: a Probabilistic Perspective. According to the book:

Another way to understand
the Bayesian Occam’s razor effect is to note that probabilities must
sum to one. Hence $$sum_D’ p(D’ |m) = 1$$, where the sum is over all possible data sets. Complex
models, which can predict many things, must spread their probability mass thinly, and hence
will not obtain as large a probability for any given data set as simpler models. This is sometimes called the conservation of probability mass principle.

The figure below is used to explain the concept:

Image Explanation: On the vertical axis we plot the predictions of 3 possible models: a simple one, $$M_1$$ ; a medium one, $$M_2$$ ; and a complex one, $$M_3$$ . We also indicate the actually observed
data $$D_0$$ by a vertical line. Model 1 is too simple and assigns low probability to $$D_0$$ . Model 3
also assigns $$D_0$$ relatively low probability, because it can predict many data sets, and hence it
spreads its probability quite widely and thinly. Model 2 is “just right”: it predicts the observed data with a reasonable degree of confidence, but does not predict too many other things. Hence model 2 is the most probable model.

What I do not understand is when a complex model is used, it will likely overfit data and hence the plot for a complex model will look like a bell shaped with its peak at $$D_0$$ while simpler models will more likely have a broader bell shape. But the graph here shows something else entirely. What am I missing here?

The original graph for the aforementioned Bayesian Optimisation is similar to the graph in these slides (slide 18) along with the calculations.

So, according to the tutorial the graph shown should actually have the term $$p(D|m)$$ on the y-axis, thus making it a generative model.Now the graph starts to make sense, since a model with low complexity cannot produce very complex datasets and will be centred around 0, while very complex models can produce richer datasets which makes them assign probability thinly over all the datatsets (to keep $$sum_{D'}p(D'|m) = 1$$).

Answered by DuttaA on November 7, 2020

## Related Questions

### Classification or regression for deep Q learning

0  Asked on December 16, 2021

### Is the Bellman equation that uses sampling weighted by the Q values (instead of max) a contraction?

0  Asked on December 16, 2021 by sirfroggy

### Why does reinforcement learning using a non-linear function approximator diverge when using strongly correlated data as input?

1  Asked on December 13, 2021

### How Graph Convolutional Neural Networks forward propagate?

1  Asked on December 13, 2021

### In which cases is the categorical cross-entropy better than the mean squared error?

3  Asked on December 11, 2021

### What are the keys and values of the attention model for the encoder and decoder in the “Attention Is All You Need” paper?

1  Asked on December 11, 2021

### Is my 57% sports betting accuracy correct?

1  Asked on December 11, 2021 by sports_stats

### Understanding the “unroling” step in the proof of the policy gradient theorem

2  Asked on December 9, 2021

### Forcing a neural network to be close to a previous model – Regularization through given model

0  Asked on December 9, 2021 by blba

### Why is DDPG not learning and it does not converge?

0  Asked on December 9, 2021 by i_al-thamary

### How artificial intelligence will change the future?

1  Asked on December 7, 2021

### Can residual neural networks use other activation functions different from ReLU?

1  Asked on December 7, 2021 by jr123456jr987654321

### Is it necessary to standardise the expected output

1  Asked on December 7, 2021

### Is CNN capable of extracting the descriptive statistics features

1  Asked on December 4, 2021 by nilsinelabore

### How to create Partially Connected NNs with prespecified connections using Tensorflow?

3  Asked on December 2, 2021 by pnar-demetci

### What is the best resources to learn Graph Convolutional Neural Networks?

2  Asked on December 2, 2021

### Is it possible to use AI to reverse engineer software?

2  Asked on November 29, 2021 by ipsumpanest

### Why do CNN’s sometimes make highly confident mistakes, and how can one combat this problem?

6  Asked on November 29, 2021

### Can you explain me this CNN architecture?

1  Asked on November 29, 2021 by sanmu

### In Deep Deterministic Policy Gradient, are all weights of the policy network updated with the same or different value?

1  Asked on November 29, 2021 by unter_983