# Comparing a random sample and a non random sample extracted from a finite population

Cross Validated Asked by Alessandro Jacopson on December 1, 2020

I am not able to formulate a more descriptive title.

I have a population of five million people, I code it with R programming language:

pop <- 5e6


41% of them have brown eyes, 42% of them have green eyes.

brown <- ceiling(.41*pop)
green <- ceiling(.42*pop)
other <- pop-(brown+green)


Then I have 7000 people belonging to the population which test positive to a disease, obviously this is not a random sample taken from the population.

sample_size <- 7000


The percentage of eyes color for these 7000 people is the following: 38% have brown eyes, 44% have green eyes.

brown_in_sample <- ceiling(.38*sample_size)
green_in_sample <- ceiling(.44*sample_size)
other_in_sample <- sample_size-(brown_in_sample+green_in_sample)


Now a very naïve conclusion is that green eyed people are more prone to test positive to the disease because 38% is slightly less than 41% and 44% is slightly greater than 42%; that conclusion smells like meaningless to me because I feel that the proportions of brown eyes in the full population (41%) and in the positive-to-the-disease sample (38%) are too close and the same happens for the green eyes.

How can I get some convincing arguments to support my intuitions?

I have a vague idea that I have to use a multivariate hypergeometric distribution and compare the percentage of brown and green eyes found in a true random sample with the ones found in the positive-to-the-disease sample but I do not know how to proceed.

As an additional constraint, I would like to follow a Bayesian approach.

Let me first place here your data:

pop <- 5e6
brown <- ceiling(.41*pop)
green <- ceiling(.42*pop)
other <- pop-(brown+green)

sample_size <- 7000
brown_in_sample <- ceiling(.38*sample_size)
green_in_sample <- ceiling(.44*sample_size)
other_in_sample <- sample_size-(brown_in_sample+green_in_sample)


My first approach would be to use resampling. Let's select an appropriate statistic to check if its measured value for the real sample is significant or not. I choose the absolute value of the difference in the number of people with green and brown eyes (as a percentage of the sample size), that is,

stat <- function(pop1, pop2, sample_size) {
abs( pop1 - pop2 ) / sample_size
}


What we should do now is to sample from the population, selecting new samples of the same size and then computing the same statistic. We will get lots of sample statistics that can be used to compare with the observed effect.

# brown=1, green=2, other=0
universe <- c( rep(1,brown),
rep(2,green),
rep(0,other))

n.sims <- 10000

replicate(n.sims, {
sampling <- sample(universe, sample_size, rep=FALSE)
n.browns <- length( sampling[sampling==1] )
n.greens <- length( sampling[sampling==2] )
stat(n.browns, n.greens, sample_size)
}) -> results


If you plot the results against the observed effect,

observed.effect <- stat(brown_in_sample, green_in_sample, sample_size)


The conclusion from this simulation is that there is strong evidence that the difference in your sample is significant.

This is the vizualization code I used,

compute.p.value <- function(results, observed.effect, precision=3) {
# n = #experiences
n <- length(results)
# r = #replications at least as extreme as observed effect
r <- sum(abs(results) >= observed.effect)
# compute Monte Carlo p-value with correction (Davison & Hinkley, 1997)
list(mc.p.value=round((r+1)/(n+1), precision), r=r, n=n)
}

present_results <- function(results, observed.effect, label="", breaks=50, precision=3) {
lst <- compute.p.value(results, observed.effect, precision=precision)

hist(results, breaks=breaks, prob=T, main=label, col = "dodgerblue",
sub=paste0("MC p-value for H0: ", lst$$mc.p.value), xlim=c(0,max(c(results,observed.effect))), xlab=paste("found", lst$$r, "as extreme effects for", lst$n, "replications")) abline(v=observed.effect, lty=2, lwd=2, col="red") } present_results(results, observed.effect)  If you like a Bayesian approach, one possibility I see is using a model for a Binomial Test with two rates, testing for one eye color at a time. Something like, $$k_i sim text{Binomial}(n_i, theta_i) ~, ~ i=1ldots m$$ $$theta_i sim text{Beta}(1,1) ~, ~ i=1ldots m$$ $$delta_{ij} = theta_i - theta_j$$ herein, $$m=2$$. Ie, we would like to compare if the difference of, say, brown eyes in the population and brown eyes in the sample, is significant. The null hypothesis is $$H_0 : delta_{1,2} = 0$$. If the credible interval for $$delta_{1,2}$$ includes zero, we must retain the null, not reject it. If it does not include zero, we would reject it. Here is a possible Stan model, data { int<lower=2> m; // at least two groups int<lower=1> n[m]; int<lower=0> k[m]; } parameters { real<lower=0,upper=1> theta[m]; } model { theta ~ beta(1, 1); k ~ binomial(n, theta); } generated quantities { real delta[m,m]; for ( i in 1:m ) { for ( j in 1:m ) { if (i==j) delta[i,j] = normal_lpdf(0|0,1); // dummy values else delta[i,j] = theta[i] - theta[j]; } } }  After the fit of this model with the brown eyes data, we get a 99% credible interval of $$[0.016,0.044]$$ which, given the large numbers we are dealing with, would mean a rejection of the null hypothesis despite being close to zero. The code to perform the fit: library(rstan) library(rethinking) library(latex2exp) n <- c(pop, sample_size ) # runs k <- c(brown, brown_in_sample ) # successes m <- length(n) # model1 is a string containing the Stan model above fit1 <- sampling(model1, data = list(n=n, k=k, m=m), iter = 10000, chains = 2, refresh = 0 ) precis(fit1, depth=3, prob=0.99)  Let's also plot the posterior of $$delta_{1,2}$$ samples <- rstan::extract(fit1) samples_theta <- samples$$theta samples_delta <- samples$$delta credible_interval <- PI(samples_delta[,1,2], prob=0.99) h <- hist(samples_delta[,1,2], breaks=40, prob=T, yaxt='n', col=rgb(0,0,1,0.15), border=rgb(1,1,1,0.6), main="", xlab=TeX(paste0('$$\theta_',1,'-\theta_',2,'$$')), ylab="") y.seg <- max(h$density) / 15
segments(credible_interval[1], y.seg, credible_interval[2], y.seg,
lwd=5, col="dodgerblue")
text(credible_interval[1], y.seg*2.5, round(credible_interval[1],3),
col="dodgerblue")
text(credible_interval[2], y.seg*2.5, round(credible_interval[2],3),
col="dodgerblue")


Correct answer by jpneto on December 1, 2020

## Related Questions

### Difference of two non-central chi squared random variables

1  Asked on February 14, 2021 by pol

### Why do the mean and proportion measurements take the spotlight in estimation?

1  Asked on February 13, 2021

### Feature engineering: Measuring proportional changes across two item given two conditions

0  Asked on February 13, 2021 by comte

### Validation of my implementation for trace norm and L1 regularization

0  Asked on February 13, 2021 by rando

### Multivariate Normal Quadratic MGF: Eigendecomposition to Matrix form

0  Asked on February 12, 2021 by itsallpurple

### Sample Clustering by Accounting for Gene Fold Change and p-value

0  Asked on February 12, 2021 by user294496

### How do we construct features to use as input to machine learning algorithms for the purpose of movie recommendations(using collaborative filtering)?

1  Asked on February 11, 2021 by user3676846

### subspace clustering with density threshold

0  Asked on February 11, 2021 by pianobegginer

### In real-life applications, which continuous distributions have NON-CONVERGENT expectations that require Lebesgue integration?

0  Asked on February 11, 2021 by iterator516

### How many combinations and how to find a (standard?) distribution that matches?

2  Asked on February 10, 2021 by d-b

### Does PyMC3’s Timeseries API allow for time varying parameters of any model that would require fixed values under a frequentist approach?

0  Asked on February 10, 2021 by lisa-ann

### How do I modify data to test retest measure as more normal?

1  Asked on February 10, 2021 by user136083

### Anomaly detection in Text Classification

2  Asked on February 9, 2021 by naveen-y

### Which output activation is recommended when predicting a variable with a lower but not an upper bound?

2  Asked on February 9, 2021 by user26067

### Extracting a parameter from a probability problem

1  Asked on February 8, 2021 by mishe-mitasek

### Missing target variable values for a labeled model with just a time feature (timeseries)?

0  Asked on February 8, 2021 by amit-s

### What happens to the Gaussian as $sigma to infty$ at $x to infty$?

1  Asked on February 6, 2021 by rkabra

### Deriving comparable probabilites from continuous and discrete data

0  Asked on February 6, 2021 by pst0102