Estimating number of infected people and getting bounds of its probability based on a few samples.

Question

This question came on my exam a few week ago, and I've been stuck on it ever since.
Say a hospital has received 500 blood samples for COVID testing. We want to estimate how many of the the samples are of infected people before testing them all. Also, we can assume that are no false positives/negatives.
Let m be the total no. of infected samples out of these 500. Now, 20 samples are randomly chosen, and it is found that n of these are of infected people. These are the following questions we have to answer:

What is the estimate of m, given the value of n? That is, $E[M|N=n]$ where N denotes the random variable  of infected samples from the 20 randomly chosen ones, and M denotes the number of infected samples out of 500.

Say, 4 out of these 20 are of those infected. What is $P[M>E[M] + 1| N = 4]$? This is to find out how reliable the expected value is.

Any other assumptions required can be made.
My approach:
$P(M=m, N=n) = frac{binom{m}{n}binom{500-m}{20-n}}{binom{500}{20}}$, so we can find $E[M|N=n]$ as $sum_{m=n}^{500} mfrac{P(M=m, N=n)}{Pr(N=n)}$, where $Pr(N=n) = frac{1}{21}$(I don't think this part is correct, but since $n ={0,1,2....20}$, I wrote this down)
Thanks in advance, this question has really been bugging me!

lonza leggiera · Answer

Your expression $ {500choose20} $ is the number of subsets of size $20$ that can be drawn from your set of $500$.  The expression $ {mchoose n}{500-mchoose 20-n} $ is, for a given subset of size $ m $, the number of subsets of size $ 20 $ which include exactly $ n $ members of that given subset and exactly $ 20-n $ members of its complement.  So, if each subset of size $ 20 $ is equally likely, the fraction $ frac{{mchoose n}{500-mchoose 20-n}}{{500choose20}} $ is the probability that your randomly chosen subset of size $ 20 $ contains exactly $ n $ members of the given subset of size $ m $.
The problem here is that the description of the problem makes $ m $ a non-random number whose value is already determinable, although unknown.  Its exact value could be obtained by testing all $500$ blood samples.  Non-Bayesians would consider it inappropriate to treat it as a random variable and if they had to estimate its value from a random sample of size $20$ they would probably use some sort of significance test.
That your exam question does treat it as a random variable implies, I presume, that you're required to adopt a Bayesian approach, which would entail your assigning a prior distribution to that random variable. For the moment, let's treat this prior, $ pi $, say, as arbitrary:
$$
pi_m=P(M=m) .
$$
You can then obtain the posterior distribution of $ M $, given $ N=n $ , from Bayes's theorem:
begin{align}
P(M=m,| N=n,)&=frac{P(N=n |,M=m,)P(M=m)}{P(N=n)}\
&=cases{ frac{{mchoose n}{500-mchoose 20-n}pi_m}{sum_{i=n}^{480+n} {ichoose n}{500-ichoose 20-n}pi_i}& if $ nle mle n+ 480$\
0&otherwise .}
end{align}
Your formula for $ E[M|N=n] $ is correct, except that begin{align}
P(M=m, N=n)&=frac{{mchoose n}{500-mchoose 20-n}pi_m}{{500choose20}} text{ and}\
P(N=n)&= displaystylesum_{i=n}^{480+n} frac{{ichoose n}{500-ichoose 20-n}pi_i}{{500choose20}} .
end{align}
The value of $ Pbig(M>E[M]+1,|,N=4big) $ is given by
begin{align}
Pbig(M>E[M]+1,|,N=4big)&=sum_{m=
max(lfloor E[M] rfloor +2, 4)}^{484}P(M=m,|,N=4,)\
&=frac{sum_{m=
max(lfloor sum_{m=0}^{500}mpi_m rfloor +2, 4)}^{484} {mchoose 4}{500-mchoose 16}pi_m}{sum_{i=4}^{484} {ichoose 4}{500-ichoose 16}pi_i} .
end{align}
Coming to the vexed question of what to choose for $ pi $, the choice of priors in Bayesian statistics is nearly always going to be somewhat subjective.  Since I have no idea how this topic was treated in your course, I also have little idea how your examiners would have expected you to handle it in the exam question you've quoted.  Also, while I have used statistics professionally, I certainly wouldn't claim to have ever been a professional statistician, let alone one with a good working knowledge of Bayesian statistics.  Please bear that in mind while reading the following suggestion.
If your $ 500 $ blood samples were taken somewhat randomly from a large population, it seems to me that a reasonable choice for $ pi $ would be $ text{Binomial}(500,p) $:
$$
P(M=m)={500choose m}p^m(1-p)^m
$$
for some value of $ p $, which would be the proportion of the large population that are infected.  For COVID-$19$, $ p $ will not be known exactly, but if you know the population from which the sample was drawn you may have a reasonable estimate that you could use for the value of $ p $.  Otherwise, the best you're likely to be able to do is to get expert epidemiologists to suggest a range $ [a,b] $ in which they think $ p $ is $90%$ (say) likely to lie, and choose a suitable prior distribution $ Pi $ for $ p $ such that  $ Pibig([a,b]big)=0.9 $. Your prior distribution for $ M $ will then be
$$
P(M=m)={500choose m}int_0^1p^m(1-p)^mdPi(p) ,
$$
and $ E[M]=500E[p]=500displaystyleint_0^1pdPi(p) $.

Estimating number of infected people and getting bounds of its probability based on a few samples.

One Answer

Add your own answers!

Ask a Question