How to impose restrictions on a random matrix via its prior distribution?

Question

I am reading the paper Factor analysis and outliers: A Bayesian approach. The author starts with a factor analysis model given by
$${bf y}_i = {bf Lambda} {bf z}_i + {bf e}_i, quad i = 1, ldots, n,$$
where each ${bf y}_i$ is a $p$-dimensional observation vector, each ${bf z}_i$ is a $K$-dimensional latent factor vector, and ${bf Lambda}$ is a $p times K$ full-rank matrix of factor loadings. The author assumes that the factors and the error term are Normal:
$${bf z}_i sim mathcal{N} ({bf 0}, {bf Phi})$$
$${bf e}_i sim mathcal{N} ({bf 0}, {bf Psi})$$
The author assigns Wishart priors to ${bf Phi}^{-1}$ and ${bf Psi}^{-1}$:
$${bf Phi}^{-1} sim mathcal{W}_K left( {bf Phi}_{*}, nu_{*} right)$$
$${bf Psi}^{-1} sim mathcal{W}_p left( {bf Psi}_{*}, n_{*} right)$$
In the paper the author writes something I found to be quite interesting:

While classical factor analysis sets $bf Phi = I$ and uses a diagonal $bf Psi$ matrix, we impose these restrictions via the prior information matrices ${bf Psi}_{*}$ and ${bf Phi}_{*}$.

Question: What should the values of ${bf Psi}_{*}$ and ${bf Phi}_{*}$ be in order to do what the author is suggesting?
The author does not seem to state exactly how this can be done, but I may have missed it so I will continue reading it. My own research on this matter pointed me to these seemingly similar unanswered questions here and here.

UPDATE: I did some research on the Wishart distribution and if you specify that $Psi_*$ and $Phi_*$ are two diagonal matrices, then $mathbb{E} [Psi]$ and $mathbb{E} [Phi]$ will be two diagonal mean matrices. Perhaps, this is what the author is referring to. Still unsure, though.
UPDATE 2: I set $Psi_*$ and $Phi_*$ to diagonal matrices and ran simulations in R, but the results aren't what I expected. The simulated values I obtained are not diagonal, so I think I misinterpreted the author's statement. I thought that if you formulate the factor analysis model with the prior distributions above, that you can consider it the classical factor analysis model by choosing certain hyper-parameter value. But it seems that this formulation does not produce the classical factor analysis model.
UPDATE 3: The classical factor analysis model sets ${bf Phi} = {bf I}$ (i.e. non-random), sets $bf Psi$ to be a diagonal matrix (i.e. random diagonal matrix) and assigns prior distributions to only the diagonal elements. What I understand the author's statement to mean, is that I can do the aforementioned things by using Wishart priors on $bf Phi$ and $bf Psi$ with special scale matrices $bf Phi_*$ and $bf Psi_*$.

ping · Accepted Answer

Inverse Wishart (which is used in the mentioned article) is used as a prior for the covariance matrix of a multivariate Normal distributed random variable.
This choice is based on the fact that its a conjugate prior for the covariance matrix in this scenario.
If $mathbf{X}=(mathbf{x}_1, mathbf{x}_2, ldots, mathbf{x}_n) sim mathcal{N}(mathbf{0}, mathbf{Sigma})$, with a prior $mathbf{Sigma} sim mathcal{W}^{-1}(mathbf{Psi}, nu)$, then the posterior $p(mathbf{Sigma}|mathbf{X}) sim mathcal{W}^{-1}(mathbf{A}+mathbf{Psi},n+nu)$ is also an inverse-Wishart distributed random variable ($mathbf{A}=mathbf{X}mathbf{X}^t$, $n$=number of observations $mathbf{X}$).
Said that, one can impose the structure of the prior for the covariance matrix, by setting the prior scale matrix $mathbf{Psi}$ opportunely. In the article, the authors set the $mathbf{Psi}=mathbf{Psi}^*$ to be diagonal.
An alternative approach would have been forcing the $p$ variables to be independently Normal-distributed. In that case, the conjugate prior for the variance of each dimension would have been the Inverse Gamma.
The limitation of the latter is that forces the posterior $p$ variables to be independent, while in the case of an Inverse Wishart, off-diagonal elements of the covariance matrix can have a non-zero-probability to be non-zero.
When setting the scale matrix $mathbf{Psi}^*$ as diagonal and $nu=p+1$, the correlations in $mathbf{Sigma}$ have a marginal uniform distribution (par. 2.1 https://arxiv.org/pdf/1408.4050.pdf). This corresponds to a non-informative prior for the correlations, implying that non-zero correlations require strong evidence from the data $mathbf{X}$.
An interesting alternative, suggested by Gelman, is to use Half-Cauchy priors (the linked article focuses on 1-dimensional hierarchical models):
http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf

How to impose restrictions on a random matrix via its prior distribution?

One Answer

Add your own answers!

Ask a Question