Computing the variance of hypergeometric distribution using indicator functions

Question

I want to compute the variance of a random variable $X$ which has hypergeometric distribution $mathrm{Hyp}(n,r,b)$, where $n$ is the total number of balls in the urn and $r$ and $b$ are the numbers of red/black balls, by using the representation

$$X= I_{A_1} + cdots + I_{A_n}$$

($I_A$ is the indicator function of $A$ and $A_i$ means that we have a red ball in the $i$-th draw).

So for the expected value we have

$$E[X] = E[I_{A_1} + cdots + I_{A_n}] = E[I_{A_1}] + cdots +E[I_{A_n}] = P(A_1) + cdots + P(A_n)$$

But I don't know how to calculate these $P(A_i)$. And what about $E[X^2]$? Can anybody help?

Thanks in advance!

Michael Hardy · Accepted Answer

$newcommand{var}{operatorname{var}}newcommand{cov}{operatorname{cov}}$

The variance of $I_{A_1}+cdots+I_{A_n}$ is trivially $0$ since the sum is $r$ with probability $1$.

But suppose there had been more than $n$ balls in the urn, so that it would not be certain that every red ball had been drawn after $n$ trials.  Then we would have
begin{align}
var(I_{A_1}+cdots+I_{A_n}) & = var(I_{A_1})+cdots+var(I_{A_n}) + underbrace{2cov(I_{A_1},I_{A_2})+cdotsquad{}}_{n(n+1)/2text{ terms}} \[10pt]
& = nvar(I_{A_1}) + frac{n(n+1)}2 cov(I_{A_1},I_{A_2}).
end{align}

Next we have
$$
var(I_{A_1}) = operatorname{E}(I_{A_1}^2)-(operatorname{E}I_{A_1})^2 
$$
and then use the fact that $I_{A_1}^2=I_{A_1}$ since $0^2=0$ and $1^2=1$.

For the covariance, you have
$$
cov(I_{A_1},I_{A_2}) = operatorname{E}(I_{A_1}I_{A_2}) - (operatorname{E}I_{A_1})(operatorname{E}I_{A_2})
$$
And $operatorname{E}(I_{A_1}I_{A_2})=Pr(I_{A_1}=I_{A_2}=1)=dfrac{binom r 2}{binom{r+b}2}$.

abblaa · Answer

$newcommand{var}{operatorname{var}}newcommand{cov}{operatorname{cov}}$Just a small note to Michael's answer. The number of $2 cov(I_{A_{1}}, I_{A_{2}})$ terms is $nchoose 2$. Thus, the variance becomes:
begin{align}
var(I_{A_1}+cdots+I_{A_n}) & = var(I_{A_1})+cdots+var(I_{A_n}) + underbrace{2cov(I_{A_1},I_{A_2})+cdotsquad{}}_{{nchoose 2}text{ terms}} \[10pt]
& = nvar(I_{A_1}) + {nchoose 2} 2 cov(I_{A_1},I_{A_2}).
end{align}
(I wrote it as a separate answer, because it was rejected as an edit, and don't have enough reputation to comment.)

André Nicolas · Answer

Outline: I will change notation, to have fewer subscripts. Let $Y_i=1$ if the $i$-th ball is red, and let $Y_i=0$ otherwise.

We are picking $n$ balls. I will assume that (unlike in the problem as stated) $n$ is not necessarily the total number of balls, since that would make the problem trivial.

Then $E(X)=E(Y_1)+cdots+E(Y_n)$. Note that $Pr(Y_i=1)=frac{r}{r+b}$. For if the balls have ID numbers (if you like, in invisible ink) then all sequences of balls are equally likely.

For the variance, as you know, it is enough to compute $E(X^2)$. Expand $(Y_1+cdots+Y_n)^2$ and take the expectation, using the linearity of expectation.

We have terms $Y_i^2$ whose expectation is easy, since $Y_i^2=Y_i$. So we need the expectations of the "mixed" products $Y_iY_j$. We need to find the probability that the $i$-th ball and the $j$-th ball are red. This is the probability that the $i$-th is red times the probability that the $j$-th is red given that the $i$-th is.

Thus $E(Y_iY_j)=frac{r}{r+b}cdotfrac{r-1}{r+b-1}$.

Now it s a matter of putting the pieces together.

Computing the variance of hypergeometric distribution using indicator functions

3 Answers

Add your own answers!

Ask a Question