Can you run intraclass-correlations with different raters, and different numbers of raters per participant?

Question

I'm trying to run an intraclass-correlation (inter-rater agreement) for personality data I have collected. However, I work with animals and as such the data has been collected over a period of a year or so, I have, in some cases different numbers of raters per subject, and different raters (or more specifically, not all raters have rated all subjects; some raters have done some subjects, other raters have done other subjects (and some have done all)). So, I have 7-9 ratings per subject, with some overlap between raters and subjects, but not always.

Is this possible, and if so which ICC should I use?

Any advice would be gratefully received.

Jeffrey Girard · Answer

My agreement package in R can calculate ICCs in the presence of missing data.

First, I need to set a random number seed (for reproducibility), install, and load the required packages.

set.seed(1)

# Install packages (if necessary)
#install.packages(c("tidyverse", "scales", "devtools"))
#devtools::install_github("jmgirard/agreement")

# Load packages 
library(scales)
library(tidyverse)
library(agreement)

Second, I need to simulate some data that match your description. To do so, I will assume there are 30 total subjects, 20 total raters, and a seven-point ordinal rating scale from 1 to 7. I will simulate the hypothetically complete data where all raters score all subjects. Then I will randomly select a different (and size-varying) subset of raters to retain for each subject.

# Set simulation parameters
n_subjects <- 30
n_raters <- 20

# Simulate hypothetically complete data
complete_data <- 
  tibble(
    subject = rep(1:n_subjects, each = n_raters),
    s_mean = rep(rnorm(n_subjects, mean = 0, sd = 3), each = n_raters),
    rater = rep(1:n_raters, times = n_subjects),
    r_mean = rep(rnorm(n_raters, mean = 0, sd = 2), times = n_subjects)
  ) %>%
  mutate(
    latent = s_mean + r_mean + rnorm(n_subjects * n_raters, mean = 0, sd = 1),
    score = round(rescale(latent, to = c(1, 7)))
  )

# Simulate observed data
observed_data <- 
  complete_data %>% 
  group_by(subject) %>% 
  mutate(
    observed = sample(
      c(TRUE, FALSE), 
      size = n_raters, 
      replace = TRUE, 
      prob = c(6 / n_raters, 1 - (6 / n_raters))
    )
  ) %>% 
  ungroup() %>% 
  filter(observed == TRUE) %>% 
  select(subject, rater, score) %>% 
  print()

Third, I can verify the simulation worked as expected by calculating the number of raters that scored each subject and the number of subjects scored by each rater in our "observed" dataset:

# count number of rater per subject
table(observed_data) %>% margin.table(margin = 1)

# count number of subjects per rater
table(observed_data) %>% margin.table(margin = 2)

As desired, each subject was scored by around 6 raters (between 2 and 9) and each rater scored around 9 subjects (between 4 and 17).

Finally, Ican provide this data to the dim_icc() function, which will calculate different intraclass correlations for dimensional data.

Since, in this example, each rater scored multiple subjects, I can use a two-way ICC model such as Model 2A to capture variation across subjects and raters. We can calculate the reliability of any single rater's scores and the reliability of the average of six raters' scores. We can even get confidence intervals using bootstrap resampling.

# Single-measures Agreement ICC under Model 2A
icc_2A_A_1 <- 
  dim_icc(
    observed_data, 
    model = "2A", 
    type = "agreement", 
    unit = "single",
    object = subject, 
    rater = rater, 
    score = score, 
    bootstrap = 2000
  )
summary(icc_2A_A_1)

# Average-measures Agreement ICC under Model 2A
icc_2A_A_6 <- 
  dim_icc(
    observed_data, 
    model = "2A", 
    type = "agreement", 
    unit = "custom",
    customk = 6,
    object = subject, 
    rater = rater, 
    score = score, 
    bootstrap = 2000
  )
summary(icc_2A_A_6)

Note that the variance component estimates are identical in both cases (with the confidence intervals only slightly varying due to the stochastic nature of bootstrapping), but the inter-rater ICC estimate is much higher for the average of the six raters' scores. This is due to the Spearman-Brown prophecy formula.

If each subject truly was scored by a different group of raters with minimal overlap, we could use one-way ICC model such as Model 1A.

# Single-measures Agreement ICC under Model 1A
icc_1A_A_1 <- 
  dim_icc(
    observed_data, 
    model = "1A", 
    type = "agreement", 
    unit = "single",
    object = subject, 
    rater = rater, 
    score = score, 
    bootstrap = 2000
  )
summary(icc_1A_A_1)

Note that here a variance component for raters wasn't even estimated.

For more on how to select an appropriate ICC formulation, I recommend:

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46. https://doi.org/10/br5ffs

Gustavo Moreira Tavares · Answer

I had the same problem. I think I could solve that by using the formulas informed by Bliese (2000). You basically run an one-way ANOVA using the group as the independent variable. Then use the mean square between (MSB), mean square within (MSW) informed in the ANOVA output. In the calculation of the ICC(1), as you have a varying number of raters, use the average number of raters as k.

https://www.kellogg.northwestern.edu/rc/workshops/mlm/Bliese_2000.pdf

Can you run intraclass-correlations with different raters, and different numbers of raters per participant?

2 Answers

Add your own answers!

Ask a Question