Why hasn't the USA done a statistical sampling of COVID cases?

Question

In all the media reports I have seen over the past few months with respect to the COVID pandemic, I have never heard about any statistically random testing done at any level (state, federal).
From my perspective I think it would make a lot of sense to do statistical random sample testing for COVID and create a "probability" model, just how election polling is done. With such a study statements could be made about actual active COVID cases (within a margin of error).
Instead I have just heard media talking about "the CDC says there could be up to 10x as many cases out there" or something along those lines. Why hasn't that estimate been quantified statistically? Or if it has, where can I see that data?

CDJB · Answer

The 10x figure you mention originally came from a June 25th conference call with reporters. Dr. Robert Redfield, the CDC director, was one of the participants. The figure came in response to this question:

Maggie Fox: Thanks.  Dr. Redfield, I was very intrigued by something you said, that for every case that’s tested positive, there
might be ten that weren’t detected.  Can you expand on that?  And I
think you probably know, the Wall Street Journal has said that the CDC
estimates many millions more cases than has been diagnosed.  Thanks.
Robert Redfield: Yeah.  Thank you for the question.  I mean.  We have one of the realities, because this virus causes so much
asymptomatic infection.  And again, we don’t know the exact number.
There’s ranges between 20%, as high as 80% in different groups.  But
clearly, it causes significant asymptomatic infection.
The traditional approach of looking for symptomatic illness and
diagnosing it obviously underestimated the total amount of infections.
So, now, with the availability of serology, the ability to test for
antibodies, CDC has established surveillance throughout the united
states using a variety of different mechanisms for serology, and that
information now is coming in and will continue as we look at the
range, for example, where you have a different range of percent
infections, say on the west coast, where it may be limited, say 1% or
so, and then you have the northeast, where it may be much more common.
The estimates that we have right now, that I mentioned — and again,
this will continue with more and more surveillance — is that it’s
about ten times more people have antibody in these jurisdictions that
had documented infection.  So that gives you an idea.  What the
ultimate number is going to be — is it 5-1, is it 10-1, is it 12-1?
But I think a good rough estimate right now is 10-1.  And I just
wanted to highlight that, because at the beginning, we were seeing
diagnosis in cases of individuals that presented in hospitals and
emergency rooms and nursing homes.  And we were selecting for
symptomatic or higher-risk groups.
There wasn’t a lot of testing that was done of younger-age symptomatic
individuals.  So, I think it’s important for us to realize that, that
we probably recognized about 10% of the outbreak by the methods that
were used to diagnose it between March, April, and May.  And I think
we are continuing to try to enhance surveillance systems for
individuals that are asymptomatic to be able to start detecting that
asymptomatic infection more in real time.

So according to Redfield, the CDC is now using 'surveillance' methods which include testing of younger-age individuals, whereas at the start of the pandemic there was a lot more focus on older participants or those in particularly at-risk groups. This suggests that although there was little statistical sampling representative of the national population to begin with, these methods have developed during the course of the pandemic.
It should be noted that in this study the CDC is testing for COVID-19 antibodies - whether the individual has had the virus - not whether the individual actually has the virus at that moment in time. This can then be cross-referenced with the amount of positive tests to inform the figure of how many cases have escaped detection.
Since this call, the study has been published in the JAMA Internal Medicine journal, and the data is available on the CDC's website. This allows us to get a better insight into the sampling methods used to arrive at this figure. Full details of this are available in the study, but generally, 'convenience samples' were taken from 10 sites in the US. This is a sampling method which, rather than attempting to collect a completely representative sample from the off, relies on obtaining data which is easy to collect - in this case, sera samples came from two commercial diagnostic laboratories which were taken for routine screening tests. In particular, the study mentions that they aimed to have at least 300 samples per age group.
As this sampling method doesn't obtain data which is representative of the general population, weighting was then performed post-data-collection. In particular, they stratified the data based on sex and age group, as well as geographically based on the location of the sample collection sites. This allowed the CDC to come up with a general approximation which is hopefully representative of the national population.
Although this type of sampling is different to, for example, identifying a representative sample of the population in advance and posting antibody test kits to them directly, it is highly reminiscent of many election polling methods. For example, Gallup publishes information on how its weekly U.S. poll is conducted - sampling is done based on randomly generating landline & mobile numbers, and results are then weighted based on demographics post-data-collection.
In fact, the CDC itself acknowledges that its sampling method has limitations - in the 'Discussion' section, the study mentions:

Our study has limitations that are associated with both the samples
and with the tests used. The specimens were collected for clinical
purposes from persons seeking health care and were shared with the CDC
with minimal accompanying data. No data on recent symptomatic illness,
underlying conditions, or possible COVID-19 exposures were available.
It is possible that specimens were drawn from patients seeking care
for suspected COVID-19 symptoms, potentially biasing results,
particularly in settings such as NY where disease incidence was
higher.

Lab B sampled sera from metabolic panels taken at routine
outpatient visits; Lab A sampled randomly with respect to clinical
test type and admission status. Residual clinical specimens from
screening or routine care are more likely to come from persons who
require monitoring for chronic medical conditions despite the ongoing
pandemic. These persons may not be representative of the general
population, including in their health care seeking and social
distancing behavior, immune response to infection, and disease
exposure risk. Representativeness may vary by age group as well.
Therefore, our seroprevalence estimates should be confirmed and
extended by other studies, including serosurveys that use targeted
sampling frames to enroll more representative populations.

So in conclusion - this 10x figure has been quantified statistically, based on a large-scale geographically diverse sample. There are limitations to the study, based on its sampling methods and the stage during the pandemic in which it was conducted, but generally the figure can be justified statistically.
The reason why convenience samples were used rather than more targeted sampling is down to the availability of data at that point in the pandemic. The CDC collected just over 16,000 samples using this method, a volume which would have been much harder to achieve using other sampling methods.

Why hasn't the USA done a statistical sampling of COVID cases?

One Answer

Add your own answers!

Ask a Question