TransWikia.com

Hot to use the formula model in t.test

Data Science Asked by Molitoris on May 14, 2021

I am trying to better understand the formula model for two sample t-tests in R. When I calculate the test in the formula model I get a wrong result.

set.seed(41)
df = data.frame(x1=c(rep(1, 10), rep(0, 10))+ rnorm(20, mean = 0, sd = 0.1),
               x2=c(rep(0, 10), rep(1, 10)))

t.test(x1 ~ x2, data=df)

Output

    Welch Two Sample t-test

data:  x1 by x2
t = 22.365, df = 17.85, p-value = 1.668e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.9247087 1.1165780
sample estimates:
mean in group 0 mean in group 1 
      1.0530115       0.0323681 

If I use the variable model, I get the expected result.

t.test(x = df$x1, y = df$x2)

Output

    Welch Two Sample t-test

data:  df$x1 and df$x2
t = 0.2581, df = 37.945, p-value = 0.7977
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2921655  0.3775450
sample estimates:
mean of x mean of y 
0.5426898 0.5000000
```

One Answer

Which result is right or wrong here depends on your objective. You have created two variables (vectors) $x_1, x_2$.

Assuming $x_1, x_2$ are two samples of i.i.d random variables $X_1$ and $X_2$, respectively. Now, with some more assumptions, you want to test the null hypothesis: $mathbb E(X_1) = mathbb E(X_2)$.

For this, your second output is the correct one. However, based on the data that you have generated, this is not applicable because each of your samples, $x_1, x_2$ are not coming from the same distribution, as the mean of your first five values is different from the last five.

Ignoring your data, this analysis can be done using the formula approach as well. Join the two vectors $x_1,x_2$ to $x$ and add another column (say, y) which identifies which data point is coming from which sample. Call this new data frame df1. Then an equivalent way of doing the above mentioned test is t.test(x~y, data = df1)

The second approach is helpful when your data is organized in such a format. For example, say, you have data frame with two columns: height ($x$) and gender ($y$). Then running t.test(x~y, data = df1) will test whether the mean height is different between genders.

Your first approach can be considered right only when your $x_2$ is a factor variable which identifies the group or sample of the data point in vector $x_1$.

Correct answer by Dayne on May 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP