Hot to use the formula model in t.test

Question

I am trying to better understand the formula model for two sample t-tests in R. When I calculate the test in the formula model I get a wrong result.
set.seed(41)
df = data.frame(x1=c(rep(1, 10), rep(0, 10))+ rnorm(20, mean = 0, sd = 0.1),
               x2=c(rep(0, 10), rep(1, 10)))

t.test(x1 ~ x2, data=df)

Output
    Welch Two Sample t-test

data:  x1 by x2
t = 22.365, df = 17.85, p-value = 1.668e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.9247087 1.1165780
sample estimates:
mean in group 0 mean in group 1 
      1.0530115       0.0323681

If I use the variable model, I get the expected result.
t.test(x = df$x1, y = df$x2)

Output
    Welch Two Sample t-test

data:  df$x1 and df$x2
t = 0.2581, df = 37.945, p-value = 0.7977
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2921655  0.3775450
sample estimates:
mean of x mean of y 
0.5426898 0.5000000
```

Dayne · Accepted Answer

Which result is right or wrong here depends on your objective. You have created two variables (vectors) $x_1, x_2$.
Assuming $x_1, x_2$ are two samples of i.i.d random variables $X_1$ and $X_2$, respectively. Now, with some more assumptions, you want to test the null hypothesis: $mathbb E(X_1) = mathbb E(X_2)$.
For this, your second output is the correct one. However, based on the data that you have generated, this is not applicable because each of your samples, $x_1, x_2$ are not coming from the same distribution, as the mean of your first five values is different from the last five.
Ignoring your data, this analysis can be done using the formula approach as well. Join the two vectors $x_1,x_2$ to $x$ and add another column (say, y) which identifies which data point is coming from which sample. Call this new data frame df1. Then an equivalent way of doing the above mentioned test is t.test(x~y, data = df1)
The second approach is helpful when your data is organized in such a format. For example, say, you have data frame with two columns: height ($x$) and gender ($y$). Then running t.test(x~y, data = df1) will test whether the mean height is different between genders.
Your first approach can be considered right only when your $x_2$ is a factor variable which identifies the group or sample of the data point in vector $x_1$.

Hot to use the formula model in t.test

One Answer

Add your own answers!

Ask a Question