Cross Validated Asked by P Lrc on January 3, 2022

I’m trying to understand Bayesian statistics. Recently I asked here whether we estimate paramteres of a priori distribution in bayesian statistics. I was responded that we typically don’t estimate them unless we’re using Empirical Bayes and because we’re going to "update" a priori distribution anyway.

In wikipedia I’ve read

Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.

I thought that maybe we assume some a priori distribution, get our observations, calculate a posteriori distribution, treat it as our a priori distribution and we repeat this procedure untill convergence.

Unfortunately I’ve realised that this doesn’t make sense since for example for Poisson-Gamma with a priori with parameters $gamma, beta$ the a posteriori is again a gamma distribution with parameters

$$ gamma’=gamma+sum_{j=1}^n X_j$$

$$beta’=beta+n$$

and such parameters connot be "convergent". So:

(a) why we don’t need to bother ourselves with the exact form of a priori distribution in pure bayesian statistics?

(b) how do we "update" a priori distribution?

(c) what exactly the sequential estimation means?

So a couple things to clarify:

- Posterior Distribution: This typically represents the information that the model entity has about the system before looking at the data expressed in probabalistic terms. There are many schools of thought on how one should do this exactly and it is context dependent.

For concreteness, suppose we are medical researchers trying to evaluate the effectiveness of a treatment ($A$) on some (continuous) quality of life measure ($Y$), controlling for a vector of baseline covariates ($X$).

Suppose we model the data generating likelihood as normal:

Y|A,X ~ $N(alpha_y + Abeta_a + Xbeta_x, sigma)$

Now our priors are the joint distribution of the parameters, $p(alpha_y, beta_a, beta_x, sigma)$ which we can specify however we want to represent what we know. In a medical context we might be able to bring in information from other studies or theoretical knowledge about how the control variables might impact the outcome. Or we could express some notion of ignorance with these priors.

Sometimes we might write down a family of distributions that represent the priors, but we are unsure how to parametrize those priors. This is where we have the options to estimate those hyper-parameters with methods like empirical bayes or we can specify a hyper-prior distribution for these parameters. In either case, they are just part of the prior and how we are expressing the information and ignorance that we have prior to looking at the data. So to answer question (a), we do need to worry about the prior and it's form. The prior will impact our inferences and decisions later on, but there are different approaches to how you do that exactly. Some approaches (Jaynes and the "objective school": https://bayes.wustl.edu/etj/articles/prior.pdf, Priors in the context of the likelihood:https://arxiv.org/pdf/1708.07487.pdf, the afforementioned Empirical Bayes approach, and many more). The prior is a big part of making a Bayesian model

Now we get to the updating. Often finding the posterior is referred to as updating. If we let $theta = (alpha_y, beta_a, beta_x, sigma)$ be the vector of parameters, the posterior is:

$p(theta|A,X) propto p(A,X|theta) p(theta)$, where $p(A,X|theta)$ is the normal likelihood above and $p(theta)$ is the prior.

The way to think of this update is in terms of information. The prior is the information or ignorance before and the posterior is now the best representation of our knowledge of the parameters combining what we knew before and what the data through the likelihood is telling us, representing the current state of all of our knowledge in the form of a probability distribution. (In decision theoretic approaches to bayesian probability, this can be formalized as in some sense an optimal updating of the prior information taking in the evidence from the data, See Bernardo and Smith (1994) for example).

The posterior is the update.

However, I think I see where your possible confusion. When do we stop, I think you are asking. The answer is we update whenever we get new information (typically this means data).

So say we conduct our experiment on the treatment $(A)$ and we get our posterior. We could potentially run another study. Ideally, how this would work is that the posterior from the first study is now our prior for the second study since it represents everything we know about the parameters before incorporating the knowledge from this second experiment. This kind of thing happens all the time in industry where data or information might come in batches and then you might get iterative updating of our knowledge and thus posterior. I believe this is what they mean by sequential estimates. The key is the updates have to occur with more information.

They also talk about how the posterior becomes complex and numerical methods in the case that the priors are not conjugate. In the real world this is usually the case, our information is not always conveniently represented by a conjugate family. Then to estimate the posterior we have to rely on numerical methods. This can get very complicated in sequential analyses and may require approximations in order to pass on the information from one experiment to the next when the posterior is not closed form or easy to sample from.

Answered by Tyrel Stokes on January 3, 2022

0 Asked on November 21, 2021

3 Asked on November 21, 2021

0 Asked on November 21, 2021

correlation data visualization intuition mathematical statistics

1 Asked on November 21, 2021

computer vision image processing neural networks object detection

1 Asked on November 21, 2021

0 Asked on November 21, 2021 by cian

0 Asked on November 21, 2021 by jj_okocha

0 Asked on November 21, 2021

independence maximum likelihood psychometrics standard error

1 Asked on November 21, 2021 by giulia-magnani

1 Asked on November 21, 2021 by ena

0 Asked on November 21, 2021 by e-fresher

algorithms backpropagation calculus machine learning neural networks

0 Asked on November 21, 2021 by henry50618

1 Asked on November 21, 2021 by molecularrunner

1 Asked on November 21, 2021 by apocalypsis

correlation covariance covariance matrix stochastic processes volatility

1 Asked on November 21, 2021 by ahmadmkhatib

2 Asked on November 20, 2021 by user60674

1 Asked on November 20, 2021 by fcassidy

1 Asked on November 20, 2021

3 Asked on November 20, 2021 by narayanpatra

Get help from others!

Recent Answers

- Peter Machado on Why fry rice before boiling?
- Joshua Engel on Why fry rice before boiling?
- Lex on Does Google Analytics track 404 page responses as valid page views?
- Jon Church on Why fry rice before boiling?
- haakon.io on Why fry rice before boiling?

Recent Questions

- How Do I Get The Ifruit App Off Of Gta 5 / Grand Theft Auto 5
- Iv’e designed a space elevator using a series of lasers. do you know anybody i could submit the designs too that could manufacture the concept and put it to use
- Need help finding a book. Female OP protagonist, magic
- Why is the WWF pending games (“Your turn”) area replaced w/ a column of “Bonus & Reward”gift boxes?
- Does Google Analytics track 404 page responses as valid page views?

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP