# Hypothesis test for difference of mean when two groups have different size population

I want to conduct hypothesis testing to prove that there are no difference between two group’s mean value.

Null hyp: μ_group1-μ_group2=0
Alt hyp : μ_group1-μ_group2 != 0

my first question is since I know all information about each group’s population such as standard deviation, mean, etc… can I use hypothesis testing on whole population?

Second, does size of population(if 1st question’s answer is "yes")/sample have to be same? so if I have population size of 300 for group1 and 100 for group2 I would need to sample same number from each group and do hypothesis testing?

Cross Validated Asked by Ambleu on January 1, 2021

Illustrating comment, using R:

set.seed(2020)
x1 = rnorm(500, 100, 15)
x2 = rnorm(100, 105, 17)

summary(x1); length(x1); sd(x1)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
53.32   89.29   98.93   99.18  109.45  148.02
[1] 500          # size sample 1
[1] 15.96929     # sample SD sample 1
summary(x2); length(x2); sd(x2)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
59.74   94.62  104.05  104.77  114.88  146.67
[1] 100
[1] 17.11946


The two sample means $$bar X_1 = 99.18$$ and $$bar X_2 = 104.77$$ differ. The question is whether, in view of the variability of the data, this difference is large enough to be 'statistically significant' at the 5% level.

In the boxplots below, boxes are of different widths, as a reminder that sample sizes are quite different. The fact that the 'notches' in the sides of the boxplots do not overlap, is a preliminary clue that sample means may be significantly different.

 boxplot(x1, x2, varwidth=T, col="skyblue2", pch=20, notch=T)


A Welch t test (used because population variances are unequal), the small P-value $$0.003 < 0.05$$ indicates significant difference at the 5% level. This is not "proof" that the population means differ. However, we are unlikely to get such different sample means if the population means are the same.

t.test(x1, x2)

Welch Two Sample t-test

data:  x1 and x2
t = -3.0129, df = 135.64, p-value = 0.003089
alternative hypothesis:
true difference in means is not equal to 0
95 percent confidence interval:
-9.257022 -1.920342
sample estimates:
mean of x mean of y
99.18129 104.76998


Addendum per comment. Here is a one-sided test. If $$bar X_1 > bar X_2,$$ then the test of $$H_0: mu_1 = mu_2$$ against $$H_0: mu_1 < mu_2$$ will have a P-value half the size of the two-sided test.

t.test(x1, x2, alt="less")

Welch Two Sample t-test

data:  x1 and x2
t = -3.0129, df = 135.64, p-value = 0.001544
alternative hypothesis:
true difference in means is less than 0
95 percent confidence interval:
-Inf -2.516599
sample estimates:
mean of x mean of y
99.18129 104.76998


Answered by BruceET on January 1, 2021

## Related Questions

### Is there a word in statistics for “mean divided by absolute difference”?

0  Asked on December 1, 2021 by user989761

### SPSS – Automatic Linear Modeling “Importance” Numbers

1  Asked on December 1, 2021 by josh-davis

### Is the pooled AUC calculation for imputated data in (psfmi package) mivalext_lr() correct?

0  Asked on December 1, 2021 by yy-shi

### Am I okay in not using EC model when series are co-integrated?

1  Asked on December 1, 2021

### How does propensity score matching that uses only a small proportion of eligible patients affect generalizability?

1  Asked on December 1, 2021 by diana-petitti

### Logistic regression model predicts only one outcome, producing a high specificity but very low sensitivity. How do I improve the model?

1  Asked on November 29, 2021

### Why does the Lasso provide Variable Selection?

4  Asked on November 29, 2021 by zhi-zhao

### Why do increasing regularization weights make objective function not monotonically decrease?

1  Asked on November 29, 2021

### Do we need to demean and standardize all variables in a model?

1  Asked on November 29, 2021 by ama-perera

### linear causal model

1  Asked on November 29, 2021 by markowitz

### What is the point of test set in ML?

4  Asked on November 29, 2021 by lelouche-lamperouge

### Proof that Cov(W+Y, Y-V) = 0 given that W, Y, and V are uncorrelated but not independent

2  Asked on November 29, 2021 by user292024

### Can linear and logistic regression coefficients be combined using an inverse variance weighted average?

1  Asked on November 29, 2021

### How to construct one sided CI for Superiority Randomized Controlled Trial?

1  Asked on November 29, 2021 by user292068

### Working out expected steps of absorbent Markov Chain with more than one sink

0  Asked on November 29, 2021

### How do I calculate confidence level or interval?

0  Asked on November 29, 2021 by user810739

### Power of two-sample test of binomial proportions

1  Asked on November 29, 2021 by afternoon

### What is the most sound way to perform variable selection on an lmer() model?

1  Asked on November 29, 2021

### Comparing AUC and classification loss for binary outcome in LASSO cross validation

1  Asked on November 29, 2021 by atakan

### Examples of Simpson’s Paradox being resolved by choosing the aggregate data

4  Asked on November 29, 2021 by richie-cotton