TransWikia.com

Simulate a variable based on a known correlation and distribution

Cross Validated Asked on December 15, 2021

I have a normally distributed variable (var2) with a mean of 10 and sd of 3:

mean <- 10
sd <- 3    
var2 <- rnorm(n = 1000, mean = mean, sd = sd)

I want to simulate a second variable when the correlation is known. For example

r = .83

The second variable (y) is known to be normally distributed with a mean of 10 and sd of 3. I did find one solution that I think was relevant that suggested using an independent normal distributed variable with the same variance (var1): Tool for generating correlated data sets

var1 <- rnorm(n = 1000, mean = mean, sd = sd)
y <- scale(var2) * r  +  scale(residuals(lm(var1 ~ var2))) * sqrt(1 - r * r)
y <- mean + (y - 0) * (sd/1) # Convert to mean and sd of original variable

cor(y,var2)
     [,1]
[1,]  0.83

I then want to simulate a third variable (var3) where the correlation is known with the second variable (y).

r <- .91    
var3 <- scale(y) * r  +  scale(residuals(lm(var1 ~ y))) * sqrt(1 - r * r)
var3 <- mean + (var3 - 0) * (sd/1) # Convert to mean and sd of original variable

cor(var3,y)
     [,1]
[1,]  0.91

A practical example of this is test 1 (var2) with a predicted score on test 2 (y) and subsequent predicted score on test 3 (var3); I have a situation where I have known correlations between var2 and y, and between y and var3 and subsequently want to know the ultimate correlation between var2 (test 1) and var3 (test 2) based on this simulation.

cor(var3,var2)

My uncertainty is to whether I have completely misinterpreted or misapplied the intention of the methodology discussed in Tool for generating correlated data sets. Or perhaps there is a more convenient way to simulate the scores on test 3 that I am completely overlooking?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP