# Do we update a priori distribution somehow?

Cross Validated Asked by P Lrc on January 3, 2022

I’m trying to understand Bayesian statistics. Recently I asked here whether we estimate paramteres of a priori distribution in bayesian statistics. I was responded that we typically don’t estimate them unless we’re using Empirical Bayes and because we’re going to "update" a priori distribution anyway.

Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.

I thought that maybe we assume some a priori distribution, get our observations, calculate a posteriori distribution, treat it as our a priori distribution and we repeat this procedure untill convergence.

Unfortunately I’ve realised that this doesn’t make sense since for example for Poisson-Gamma with a priori with parameters $$gamma, beta$$ the a posteriori is again a gamma distribution with parameters
$$gamma’=gamma+sum_{j=1}^n X_j$$
$$beta’=beta+n$$
and such parameters connot be "convergent". So:

(a) why we don’t need to bother ourselves with the exact form of a priori distribution in pure bayesian statistics?

(b) how do we "update" a priori distribution?

(c) what exactly the sequential estimation means?

So a couple things to clarify:

1. Posterior Distribution: This typically represents the information that the model entity has about the system before looking at the data expressed in probabalistic terms. There are many schools of thought on how one should do this exactly and it is context dependent.

For concreteness, suppose we are medical researchers trying to evaluate the effectiveness of a treatment ($$A$$) on some (continuous) quality of life measure ($$Y$$), controlling for a vector of baseline covariates ($$X$$).

Suppose we model the data generating likelihood as normal:

Y|A,X ~ $$N(alpha_y + Abeta_a + Xbeta_x, sigma)$$

Now our priors are the joint distribution of the parameters, $$p(alpha_y, beta_a, beta_x, sigma)$$ which we can specify however we want to represent what we know. In a medical context we might be able to bring in information from other studies or theoretical knowledge about how the control variables might impact the outcome. Or we could express some notion of ignorance with these priors.

Sometimes we might write down a family of distributions that represent the priors, but we are unsure how to parametrize those priors. This is where we have the options to estimate those hyper-parameters with methods like empirical bayes or we can specify a hyper-prior distribution for these parameters. In either case, they are just part of the prior and how we are expressing the information and ignorance that we have prior to looking at the data. So to answer question (a), we do need to worry about the prior and it's form. The prior will impact our inferences and decisions later on, but there are different approaches to how you do that exactly. Some approaches (Jaynes and the "objective school": https://bayes.wustl.edu/etj/articles/prior.pdf, Priors in the context of the likelihood:https://arxiv.org/pdf/1708.07487.pdf, the afforementioned Empirical Bayes approach, and many more). The prior is a big part of making a Bayesian model

Now we get to the updating. Often finding the posterior is referred to as updating. If we let $$theta = (alpha_y, beta_a, beta_x, sigma)$$ be the vector of parameters, the posterior is:

$$p(theta|A,X) propto p(A,X|theta) p(theta)$$, where $$p(A,X|theta)$$ is the normal likelihood above and $$p(theta)$$ is the prior.

The way to think of this update is in terms of information. The prior is the information or ignorance before and the posterior is now the best representation of our knowledge of the parameters combining what we knew before and what the data through the likelihood is telling us, representing the current state of all of our knowledge in the form of a probability distribution. (In decision theoretic approaches to bayesian probability, this can be formalized as in some sense an optimal updating of the prior information taking in the evidence from the data, See Bernardo and Smith (1994) for example).

The posterior is the update.

However, I think I see where your possible confusion. When do we stop, I think you are asking. The answer is we update whenever we get new information (typically this means data).

So say we conduct our experiment on the treatment $$(A)$$ and we get our posterior. We could potentially run another study. Ideally, how this would work is that the posterior from the first study is now our prior for the second study since it represents everything we know about the parameters before incorporating the knowledge from this second experiment. This kind of thing happens all the time in industry where data or information might come in batches and then you might get iterative updating of our knowledge and thus posterior. I believe this is what they mean by sequential estimates. The key is the updates have to occur with more information.

They also talk about how the posterior becomes complex and numerical methods in the case that the priors are not conjugate. In the real world this is usually the case, our information is not always conveniently represented by a conjugate family. Then to estimate the posterior we have to rely on numerical methods. This can get very complicated in sequential analyses and may require approximations in order to pass on the information from one experiment to the next when the posterior is not closed form or easy to sample from.

Answered by Tyrel Stokes on January 3, 2022

## Related Questions

### ARIMA(0,0,0) model but residuals not white noise

0  Asked on November 30, 2020 by mathias-schinnerup-hejberg

### Conditional Multivariate Gaussian Identity

1  Asked on November 29, 2020 by statian

### Interpreting infinite odds ratio and confidence interval from Fisher’s test

2  Asked on November 29, 2020

### Log-likelihood of Normal Distribution: Why the term $frac{n}{2}log(2pi sigma^2)$ is not considered in the minimization of SSE?

1  Asked on November 29, 2020 by javier-tg

### Significant change over NxM distributions

0  Asked on November 29, 2020

### Random distribution which statistical method

0  Asked on November 29, 2020 by theundecided

### Bayes Optimal Classification Rule

1  Asked on November 28, 2020 by john

### Calculate confidence interval over Relative Prediction Error

1  Asked on November 28, 2020 by xeon123

### What could be a good way to interpret this neurophysiological data?

0  Asked on November 28, 2020 by ystein-dunker

### The definition of natural cubic splines for regression

2  Asked on November 28, 2020 by durin

### Why do we need to emphasize sufficient statistics in generalized linear models?

0  Asked on November 27, 2020 by tranquil-coder

### How to impose restrictions on a random matrix via its prior distribution?

1  Asked on November 27, 2020 by souled_outt

### Bounds for the expectation of the max

1  Asked on November 27, 2020 by user3285148

### In OLS, when absolute value of the coefficient of a independent variable is large, does it mean this independent variable is more important?

0  Asked on November 26, 2020 by duke-yue

### How to estimate click through rate lift for a ranking model?

0  Asked on November 26, 2020 by etang

### Comparing percentages based on likert scale by year

1  Asked on November 26, 2020 by chris-beeley

### The probability that the minimum of a multivariate Gaussian exceeds zero

0  Asked on November 26, 2020 by jld

### Cohen’s Kappa for more than two categories

1  Asked on November 25, 2020 by asra-khalid

### How can I forecast with multiple time series sampled at different frequencies?

1  Asked on November 25, 2020 by dkent