# Do we update a priori distribution somehow?

Cross Validated Asked by P Lrc on January 3, 2022

I’m trying to understand Bayesian statistics. Recently I asked here whether we estimate paramteres of a priori distribution in bayesian statistics. I was responded that we typically don’t estimate them unless we’re using Empirical Bayes and because we’re going to "update" a priori distribution anyway.

Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.

I thought that maybe we assume some a priori distribution, get our observations, calculate a posteriori distribution, treat it as our a priori distribution and we repeat this procedure untill convergence.

Unfortunately I’ve realised that this doesn’t make sense since for example for Poisson-Gamma with a priori with parameters $$gamma, beta$$ the a posteriori is again a gamma distribution with parameters
$$gamma’=gamma+sum_{j=1}^n X_j$$
$$beta’=beta+n$$
and such parameters connot be "convergent". So:

(a) why we don’t need to bother ourselves with the exact form of a priori distribution in pure bayesian statistics?

(b) how do we "update" a priori distribution?

(c) what exactly the sequential estimation means?

So a couple things to clarify:

1. Posterior Distribution: This typically represents the information that the model entity has about the system before looking at the data expressed in probabalistic terms. There are many schools of thought on how one should do this exactly and it is context dependent.

For concreteness, suppose we are medical researchers trying to evaluate the effectiveness of a treatment ($$A$$) on some (continuous) quality of life measure ($$Y$$), controlling for a vector of baseline covariates ($$X$$).

Suppose we model the data generating likelihood as normal:

Y|A,X ~ $$N(alpha_y + Abeta_a + Xbeta_x, sigma)$$

Now our priors are the joint distribution of the parameters, $$p(alpha_y, beta_a, beta_x, sigma)$$ which we can specify however we want to represent what we know. In a medical context we might be able to bring in information from other studies or theoretical knowledge about how the control variables might impact the outcome. Or we could express some notion of ignorance with these priors.

Sometimes we might write down a family of distributions that represent the priors, but we are unsure how to parametrize those priors. This is where we have the options to estimate those hyper-parameters with methods like empirical bayes or we can specify a hyper-prior distribution for these parameters. In either case, they are just part of the prior and how we are expressing the information and ignorance that we have prior to looking at the data. So to answer question (a), we do need to worry about the prior and it's form. The prior will impact our inferences and decisions later on, but there are different approaches to how you do that exactly. Some approaches (Jaynes and the "objective school": https://bayes.wustl.edu/etj/articles/prior.pdf, Priors in the context of the likelihood:https://arxiv.org/pdf/1708.07487.pdf, the afforementioned Empirical Bayes approach, and many more). The prior is a big part of making a Bayesian model

Now we get to the updating. Often finding the posterior is referred to as updating. If we let $$theta = (alpha_y, beta_a, beta_x, sigma)$$ be the vector of parameters, the posterior is:

$$p(theta|A,X) propto p(A,X|theta) p(theta)$$, where $$p(A,X|theta)$$ is the normal likelihood above and $$p(theta)$$ is the prior.

The way to think of this update is in terms of information. The prior is the information or ignorance before and the posterior is now the best representation of our knowledge of the parameters combining what we knew before and what the data through the likelihood is telling us, representing the current state of all of our knowledge in the form of a probability distribution. (In decision theoretic approaches to bayesian probability, this can be formalized as in some sense an optimal updating of the prior information taking in the evidence from the data, See Bernardo and Smith (1994) for example).

The posterior is the update.

However, I think I see where your possible confusion. When do we stop, I think you are asking. The answer is we update whenever we get new information (typically this means data).

So say we conduct our experiment on the treatment $$(A)$$ and we get our posterior. We could potentially run another study. Ideally, how this would work is that the posterior from the first study is now our prior for the second study since it represents everything we know about the parameters before incorporating the knowledge from this second experiment. This kind of thing happens all the time in industry where data or information might come in batches and then you might get iterative updating of our knowledge and thus posterior. I believe this is what they mean by sequential estimates. The key is the updates have to occur with more information.

They also talk about how the posterior becomes complex and numerical methods in the case that the priors are not conjugate. In the real world this is usually the case, our information is not always conveniently represented by a conjugate family. Then to estimate the posterior we have to rely on numerical methods. This can get very complicated in sequential analyses and may require approximations in order to pass on the information from one experiment to the next when the posterior is not closed form or easy to sample from.

Answered by Tyrel Stokes on January 3, 2022

## Related Questions

### Keeping baseline as predictor with change score as outcome in this peculiar scenario

0  Asked on November 21, 2021

### How to test if sum of two coefficients of ols model is greater than zero using R?

3  Asked on November 21, 2021

### maximum Corelation coefficient – how the numarator and denominator becomes equal?

0  Asked on November 21, 2021

### Which Object detection model will give the best result on images when the speed is not a problem for Text Images

1  Asked on November 21, 2021

### What is alpha in Vapnik’s statistical learning theory?

1  Asked on November 21, 2021

### Difference-in-Differences time-variant control variable

0  Asked on November 21, 2021 by cian

### Quantile Matching using the skewed t-distribution from Azzalini & Capitanio (2003)

0  Asked on November 21, 2021 by jj_okocha

### Non-independence of trial likelihoods in a staircase procedure?

0  Asked on November 21, 2021

### Should I reword the Factor if it shows all negative loadings?

1  Asked on November 21, 2021 by giulia-magnani

### How to correctly interpret rma.uni output?

1  Asked on November 21, 2021 by ena

### Backpropagation through time for stacked RNNs

0  Asked on November 21, 2021 by e-fresher

### averaging feature importance from different models

0  Asked on November 21, 2021 by henry50618

### Assumptions of OLS and linear mixed models

1  Asked on November 21, 2021 by molecularrunner

### requirements for simulating a covariance matrix

1  Asked on November 21, 2021 by apocalypsis

### Dirichlet distribution: Normalization of alpha values

2  Asked on November 20, 2021 by user60674

### Interpreting main effects in the presence of an interaction in logistic regression

1  Asked on November 20, 2021 by fcassidy

### How to interpret precision and recall for multiclass prediction?

1  Asked on November 20, 2021

### What standard deviation is used for calculating standard error?

3  Asked on November 20, 2021 by narayanpatra

### In a parametric model, if I do not have enough data, can I estimate the parameter, and simulate data from the estimated model and estimate again?

1  Asked on November 20, 2021