1 Aim

Cluster-randomised experiments are experiments in which entire groups of participants are randomly assigned to conditions, as opposed to experiments in which the participants are assigned randomly but individually to conditions. Cluster randomisation needs to be taken into account in the analysis for reasons explained below. Two methods for analysing cluster-randomised experiments are then discussed.

The following is assumed:

  • There are about 10 to 20 clusters (e.g., classes) in the experiment.1
  • The clusters all have roughly the same size (e.g., sizes typical of school classes). Cluster sizes vary between, say, 10 and 30 observations, but not between 5 and 2,000.2
  • Besides a measure of the participants’ skill post-intervention, some measure of the participants’ skill pre-intervention may be available. This is referred to as a pretest in what follows, but it may have been measured on a different scale than the posttest and it may also be a pretty rough proxy of the pre-intervention skill level.

3 Preliminaries

You’ll need the following packages. See the bottom of this page for the software versions used for this document. See https://github.com/janhove/cannonball for the cannonball package.

library(tidyverse)  # easier-to-read workflow
library(lme4)       # for fitting multilevel models
library(lmerTest)   # for Satterthwaite d.f. approximation
library(cannonball) # for simulating cluster-randomised experiments

4 Cluster randomisation and its consequences

In a text-book experiment, \(n\) participants are allocated randomly and independently of each other to two (or more) conditions. Assuming two conditions to which an equal number of participants (\(\frac{n}{2}\)) are assigned, the number of possible allocations can be computed as follows:

\[\textrm{choose}(n,\frac{n}{2}) = \frac{n!}{\frac{n}{2}! \cdot \frac{n}{2}!}\]

\(n!\) is the factorial of \(n\), i.e., the product of all strictly positive integers up to and including \(n\): \(n \times(n-1) \times (n-2) \times \dots \times 2 \times 1\).

For instance, for \(n = 6\):

\[\textrm{choose}(6,3) = \frac{6!}{3! \cdot 3!} = \frac{6\cdot5\cdot4\cdot3\cdot2\cdot1}{(3\cdot2\cdot1) \cdot (3\cdot2\cdot1)} = \frac{720}{36} = 20\]

In R:

choose(6, 3)
## [1] 20

For \(n = 80\) participants randomly and independently assigned to two equal-sized groups:

choose(80, 40)
## [1] 107507208733336251924480

The group sizes needn’t be equal. For instance, there are 77,520 ways of splitting up 20 participants into groups of 13 and 7.

choose(20, 13)
## [1] 77520
choose(20, 7)  # same result
## [1] 77520

In cluster-randomised experiments, the participants are allocated randomly but not independently of each other to the different conditions. Specifically, by design (rather than by sheer accident), participants are grouped in clusters and each cluster is assigned randomly, independently of the other clusters, and in its entirety to one of the conditions. This dramatically changes the combinatorics. For \(n = 80\) participants that form \(k = 8\) clusters with 10 participants each, the number of possible allocations is not \(\textrm{choose}(80, 40) \approx 10^{23}\) but \(\textrm{choose}(8, 4) = 70\) if the number of clusters assigned to both conditions is equal. This dramatically affects how the data should be analysed.

Some vocabulary:

Experimental units: The units on whose level independent random allocation took place.

Observational units: The units on whose level the outcome data were measured.

Example 1: 10 schools with 6,000 pupils in total. Five randomly selected schools are assigned in their entirety to the intervention condition, the rest to the control condition. At the end of the school year, all pupils take a proficiency test. Experimental units = 10 schools. Observational units = 6,000 pupils.

Example 2: 10 schools. In each school, eight classes are selected to participate in an experiment, four of which are randomly assigned to the intervention and four to the control condition. At the end of the school year, all 1,500 pupils in the selected classes take a proficiency test. Experimental units = \(10 \times 8 = 80\) classes. Observational units = 1,500 pupils. In this example, ‘school’ is a blocking factor: The randomisation was restricted in such a way that the number of control and intervention classes was the same in each school. This makes perfect sense, but the blocking factor should be included in the analysis, otherwise the analysis will tend to be too conservative (i.e., it’ll overestimate the actual uncertainty).

4.1 Randomisation as the basis for inference

In experiments, the random assignment of participants to conditions provides a solid basis for statistical inference. If the null hypothesis were true (i.e., if they were no effect of condition), the participants assigned to the intervention group would have had the same outcome value if they had instead been assigned to the control group, and vice versa. That is, any difference between the control and intervention group is due solely to the random assignment of participants to conditions.

One valid way of computing a \(p\)-value is hence by re-randomisation: Take the observed data, randomly shuffle the condition labels, and recompute the difference between the control and intervention group. Do this either for all possible assignments of participants to conditions (feasible only in smallish datasets) or a couple of thousand times (for larger datasets). Then compute the proportion of reshuffled datasets for which the difference between the conditions is at least as large as the difference that was observed in the actual dataset. This is your \(p\)-value.

The following illustrates this process for an experiment without cluster randomisation; i.e., participants were assigned randomly and independently. First generate some data:

# Set random seed for reproducibility
set.seed(2019-11-12, kind = "Mersenne-Twister")

# The function is called "clustered_data", but with clusters of 
# size 1, you effectively don't have any clusters: 
# all data points are independent of each other.
d_no_clusters <- clustered_data(
  n_per_class = rep(1, 50),  # 2*25 participants
  effect = 0.5               # actual intervention effect
)

ggplot(data = d_no_clusters,
       aes(x = condition,
           y = outcome)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(shape = 1,
             position = position_jitter(width = 0.2))

With 25 participants per condition, the number of possible recombinations is too large to evaluate them all, so instead evaluate only 2,000 possibilities. Also compute the actually observed difference in group means.

n_permutations <- 2000
mean_differences <- vector(length = n_permutations)

for (i in 1:n_permutations) {
  mean_differences[[i]] <- d_no_clusters %>% 
    # randomly reshuffle the condition labels
    mutate(permuted_group = sample(condition)) %>% 
    # compute the mean per (shuffled) condition
    group_by(permuted_group) %>% 
    summarise(mean_outcome = mean(outcome)) %>% 
    # now compute the difference between these means
    pivot_wider(names_from = permuted_group,
                values_from = mean_outcome) %>% 
    transmute(mean_difference = intervention - control) %>% 
    # output
    pull(mean_difference)
}

# Compute the actual observed difference as well
observed_difference <- d_no_clusters %>% 
  group_by(condition) %>% 
  summarise(mean_outcome = mean(outcome)) %>% 
    pivot_wider(names_from = condition,
                values_from = mean_outcome) %>% 
    transmute(mean_difference = intervention - control) %>% 
    pull(mean_difference)

Plot and compute the proportion of shuffled datasets in which the mean difference was equal to or larger than the actual mean difference.

hist(mean_differences, col = "#7570B3",
     xlab = "mean differences when permuting", main = "")
abline(v = observed_difference, lty = 2, col = "#D95F02")
abline(v = -observed_difference, lty = 2, col = "#D95F02")

mean(abs(mean_differences) >= abs(observed_difference))
## [1] 0.447

I.e., \(p = 0.45\). Faster with the following function from the coin package. Small differences are to be expected since we couldn’t evaluate all possible combinations.

coin::independence_test(outcome ~ condition, data = d_no_clusters,
                        distribution = coin::approximate())
## 
##  Approximative General Independence Test
## 
## data:  outcome by condition (control, intervention)
## Z = -0.774, p-value = 0.45
## alternative hypothesis: two.sided

The \(t\)-test is a mathematical shortcut to arrive at approximately the same answer:

t.test(outcome ~ condition, data = d_no_clusters, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  outcome by condition
## t = -0.77, df = 48, p-value = 0.44
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.80803  0.36031
## sample estimates:
##      mean in group control mean in group intervention 
##                    0.75681                    0.98068

The \(t\)-test can be written as a linear model:

m <- lm(outcome ~ condition, data = d_no_clusters)
summary(m)$coefficients
##                       Estimate Std. Error t value   Pr(>|t|)
## (Intercept)            0.75681    0.20544 3.68380 0.00058291
## conditionintervention  0.22386    0.29054 0.77049 0.44478346

Such randomisation tests and \(t\)-tests work when the data points are independent of each other, i.e., when the observational units that go into the analysis are also the experimental units. However, they lead to dramatically wrong inferential results if the observational units aren’t the experimental units. In the following, data from a cluster-randomised experiment are generated and analysed incorrectly.

# Data from 10 clusters with 6 observations each.
d_clusters <- clustered_data(
  n_per_class = rep(6, 10),
  effect = 0.5
)

ggplot(data = d_clusters,
       aes(x = reorder(class, outcome, median),
           y = outcome)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(shape = 1,
             position = position_jitter(width = 0.2)) +
  facet_grid(. ~ condition, scales = "free") +
  xlab("class")

If we were to ignore the clusters, we could compute the randomisation-based \(p\)-value as follows:

coin::independence_test(outcome ~ condition, data = d_clusters,
                        distribution = coin::approximate())
## 
##  Approximative General Independence Test
## 
## data:  outcome by condition (control, intervention)
## Z = -1.09, p-value = 0.28
## alternative hypothesis: two.sided

Or a \(t\)-test / linear model:

m <- lm(outcome ~ condition, data = d_clusters)
summary(m)$coefficients
##                       Estimate Std. Error t value  Pr(>|t|)
## (Intercept)            0.59692    0.20189  2.9567 0.0044911
## conditionintervention  0.31032    0.28551  1.0869 0.2815788

From the perspective of randomisation tests, the problem with analyses that don’t take clustering into account is that they hugely overestimate the number of possible assignments of participants to conditions. With 60 participants in total, there seem to be have been \(C(60, 30) \approx 1.2 \times 10^{17}\) possible assignments. However, all participants belonging to the same cluster had to take part in the same condition, so effectively, there were only \(C(10, 5) = 252\) possible assignments! So the vast majority of rearrangements that were generated for the randomisation test could not have occurred, even if the null hypothesis were true. This invalidates the resulting \(p\)-value. An important principle in analysing experiments is therefore:

The analysis should reflect the experiment’s design.

If randomisation was restricted in some sense (clustering, but also ‘blocking’ on a covariate), this needs to be taken into account in the analysis. If the analysis doesn’t take into account the design, the results may be overly optimistic (e.g., when ignoring clustering) or overly pessimistic (e.g., when ignoring ‘blocking’).

From a different conceptual perspective, the problem with cluster randomisation is that, even without factoring in an intervention effect, participants in the same cluster tend to be more alike than participants in different clusters. As a result, different participants from the same cluster don’t contribute independent information about the performance in the control and intervention conditions. The extent to which participants from the same cluster are more alike than participants from different clusters is expressed in the intra-class correlation coefficient.

More about randomisation tests:

4.2 The intra-class correlation coefficient (ICC)

\[\textrm{ICC} = \frac{\sigma^2_{\textrm{between}}}{\sigma^2_\textrm{between} + \sigma^2_\textrm{within}}\]

  • If, for all clusters, each observation within a cluster has the same outcome, then there is no variation within clusters, so \(\sigma^2_\textrm{within} = 0\) and \(\textrm{ICC} = \frac{\sigma^2_\textrm{between}}{\sigma^2_\textrm{between}} = 1\). Conceptually, learning about a new participant from a given cluster has no added value whatsoever since you know their outcome value beforehand, just by looking at the other participants in the same cluster.

  • If the variability in the data isn’t systematically related to the clusters, then \(\sigma^2_\textrm{between} = 0\) and \(\textrm{ICC} = 0\). Conceptually, each new participant contributes new information to the study: You couldn’t have guessed the new participant’s outcome data any better by looking at the other participants in the same cluster than by looking at participants from other clusters.

  • Sample estimates of an experiment’s ICC are highly imprecise, so you can’t run a a study, estimate your ICC and based on that decide if your analysis should take into account clustering.

  • Even low ICCs invalidate \(p\)-values if clustering isn’t taken into account.

  • Illustration of the effects of an ICC of 0.03 on the Type-I error rate. If the null hypothesis is actually correct, then \(p\)-values should be uniformly distributed and only 5% of \(p\)-values should be lower than 0.05. For even very low ICCs, the Type-I error rate is appreciably higher than that when the analysis doesn’t take the clusters into account (about 12% for ICC = 0.03).

# Simulate 10,000 cluster randomised experiments with ICC = 0.03 and NO effect;
# ignore clustering in the analysis.
ps <- replicate(10000, {
  d <- clustered_data(ICC = 0.03, effect = 0, n_per_class = rep(20, 16))
  t.test(outcome ~ condition, data = d, var.equal = TRUE)$p.value
})

hist(ps, breaks = seq(0, 1, 0.05), col = "#8DA0CB",
     xlab = "p-value", main = "Distribution of p-values under H0\n(ICC = 0.03)")

mean(ps < 0.05)
## [1] 0.1137
  • The figure below shows the actual Type-I error rate in cluster-randomised experiments analysed using traditional \(t\)-tests according to the intra-class correlation (ICC) and the number of participants per cluster (m). The number of clusters was fixed at 10, but the Type-I error rate differs only slightly for different numbers of clusters. Reading guide: If you carry out a cluster-randomised experiment with 10 classes of 20 (m) students each and run an ordinary t-test on the resultant 200 data points, you have a 25% chance of observing a significant difference even if no real difference exists (for an intra-class correlation of 0.10) rather than the nominal 5% chance. (See Hedges (2007) and Vanhove (2015).)

  • Common ICC values in educational contexts are in the 0.15 to 0.20 bracket (Hedges & Hedberg 2007; Schochet 2008).

5 Cluster-level analyses

One valid and simple way to analyse a cluster-randomised experiment is to compute a summary of the outcome for each cluster, typically the cluster mean outcome, and run an appropriate analysis on these experimental unit-level summaries rather than on the observational unit-level raw data. This kind of analysis maintains the nominal Type-I error rate, meaning that only 5% of the p-values it produces will be lower than 0.05 if the null hypothesis is true.

5.1 Without covariate adjustment

  1. Compute a summary statistic (typically the mean) for each cluster.
  2. Run a \(t\)-test or a permutation test on the cluster means.

Illustrated here for simulated data that I think are similar to your data:

d_nocovar <- clustered_data(
  n_per_class = sample(10:28, size = 14, replace = TRUE),
  effect = 0.4, ICC = 0.18)
  
ggplot(data = d_nocovar,
       aes(x = reorder(class, outcome, median),
           y = outcome)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(shape = 1,
             position = position_jitter(width = 0.2)) +
  facet_wrap(~ condition, scale = "free_x") +
  xlab("class")

d_per_class <- d_nocovar %>% 
  group_by(class, condition) %>% 
  summarise(mean_class = mean(outcome),
            n = n())
ggplot(data = d_per_class,
       aes(x = mean_class,
           y = reorder(class, mean_class))) +
  geom_point() +
  xlab("mean outcome per class") +
  ylab("class") +
  facet_wrap(~ condition, ncol = 1, scales = "free_y")

m <- lm(mean_class ~ condition, data = d_per_class)
summary(m)$coefficients
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)            0.26243    0.21051  1.2466 0.236318
## conditionintervention  0.92814    0.29770  3.1177 0.008893
confint(m)
##                          2.5 %  97.5 %
## (Intercept)           -0.19623 0.72108
## conditionintervention  0.27951 1.57678

Instead of a linear model/t-test, you could also run a randomisation test:

coin::independence_test(mean_class ~ condition, data = d_per_class,
                        distribution = coin::exact())
## 
##  Exact General Independence Test
## 
## data:  mean_class by condition (control, intervention)
## Z = -2.41, p-value = 0.0082
## alternative hypothesis: two.sided

Or a Mann-Whitney (or two-sample Wilcoxon) test. This is equivalent to a randomisation test on the ranked means (ytrafo = rank):

wilcox.test(mean_class ~ condition, data = d_per_class)
## 
##  Wilcoxon rank sum test
## 
## data:  mean_class by condition
## W = 6, p-value = 0.017
## alternative hypothesis: true location shift is not equal to 0
coin::independence_test(mean_class ~ condition, data = d_per_class,
                        distribution = coin::exact(), ytrafo = rank)
## 
##  Exact General Independence Test
## 
## data:  mean_class by condition (control, intervention)
## Z = -2.36, p-value = 0.017
## alternative hypothesis: two.sided

Of course, you wouldn’t try out several different analyses and then just report the one that returned the lowest \(p\)-value! Personally, I would default to the \(t\)-test/linear model as its assumptions seem reasonable enough and as it also produces a confidence interval.

5.2 With covariate adjustment

Ideally, further information about the clusters or the participants is available that may help to account for variability in the outcome. The better the variability in the data that is not due to the intervention effect can be accounted for, the more precise your estimates will be.

5.2.1 Cluster-level covariates

Cluster-level covariates are covariates for which all participants in a given cluster have the same value. In educational contexts, this may be variables such as class size, the number of years of experience the teacher has, etc. If such cluster-level variables are believed a priori to account for variability in the mean outcome between clusters, they can be added to the linear model in which the cluster mean outcomes is the dependent variable.

As always when considering covariates: it’s better to have one or two covariates of which you are pretty sure they are strongly related to the outcome than to have a whole plethora of them that might all be slightly related to the outcome: Each covariate requires at least one degree of freedom to be modelled, and with few clusters, sacrificing a single degree of freedom to a weak covariate can do more harm than good.

Also, covariates needn’t be interesting: You collect and include them in your statistical model because you’re already pretty sure that they will be related to the outcome; they are not part of your research question.

5.2.2 Individual-level covariates

The values on individual-level covariates can vary between individuals in the same cluster. Again, it’s better to have one or two strong covariates than a plethora of weaks ones. In general, the single best covariate you can have is a reliable measure of the participants’ pre-experiment performance.

I have encountered two recommended ways of including individual-level covariates in a cluster-mean analysis:

  1. Residualise the outcome against the covariate, average the residuals per cluster, then analyse these averages (Hayes & Moulton 2009).

  2. Compute a summary score for the covariate per cluster, then include this covariate summary per cluster to the cluster-mean model (Bloom, Richburg-Hayes & Black (2007)’s discussion concerns multilevel models but is favourable to using cluster-averaged covariates).

Let’s start by generating some data:

# expected correlation between pre and posttest performance
# without intervention effect = sqrt(reliability_post*reliability_pre) = 0.8
d_withcovar <- clustered_data(
  n_per_class = sample(10:28, size = 14, replace = TRUE),
  reliability_post = 0.8, reliability_pre = 0.8,
  effect = 0.4, ICC = 0.18)
ggplot(d_withcovar,
       aes(x = pretest,
           y = outcome,
           linetype = condition)) +
  geom_point(aes(colour = class)) +
  geom_smooth(method = "lm", se = FALSE)

ggplot(d_withcovar,
       aes(x = reorder(class, outcome, median),
           y = outcome)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(shape = 1, position = position_jitter(width = 0.2)) +
  facet_wrap(~ condition, scales = "free_x") +
  xlab("class")

5.2.2.1 Method A: Residualise the outcome

  1. Fit a model that ignores both the clustering and the intervention effect:
m_residualise <- lm(outcome ~ pretest, data = d_withcovar)
  1. Extract its residuals:
d_withcovar$resid_outcome <- resid(m_residualise)

ggplot(data = d_withcovar,
       aes(x = reorder(class, resid_outcome, median),
           y = resid_outcome)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(shape = 1, position = position_jitter(width = 0.2)) +
  facet_wrap(~ condition, scales = "free_x") +
  ylab("outcome residualised against pretest") +
  xlab("class")

  1. Summarise the residuals per cluster.
d_per_class <- d_withcovar %>% 
  group_by(class, condition) %>% 
  summarise(
    mean_outcome = mean(outcome),
    mean_resid = mean(resid_outcome),
    n = n())

ggplot(data = d_per_class,
       aes(x = mean_resid,
           y = reorder(class, mean_resid))) +
  geom_point() +
  xlab("mean residualised outcome per class") +
  ylab("class") +
  facet_wrap(~ condition, scales = "free_y",
             ncol = 1)

  1. Run the analysis on the mean residuals:
m <- lm(mean_resid ~ condition, data = d_per_class)
summary(m)$coefficients
##                       Estimate Std. Error t value   Pr(>|t|)
## (Intercept)           -0.18477   0.046913 -3.9387 0.00196753
## conditionintervention  0.37184   0.066344  5.6047 0.00011527
confint(m)
##                          2.5 %    97.5 %
## (Intercept)           -0.28699 -0.082561
## conditionintervention  0.22729  0.516393

Note the decreased standard error for the intervention effect compared to an analysis that doesn’t take the covariate into account. Of course, you wouldn’t run both analyses and then report the one that works best.

m2 <- lm(mean_outcome ~ condition, data = d_per_class)
summary(m2)$coefficients
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)            0.42111    0.21390  1.9687 0.072521
## conditionintervention  0.42681    0.30251  1.4109 0.183667

5.2.2.2 Method B: Use the cluster-averaged covariate

  1. Compute the mean covariate value per cluster.
d_per_class <- d_withcovar %>% 
  group_by(class, condition) %>% 
  summarise(
    mean_outcome = mean(outcome),
    mean_pretest = mean(pretest),
    n = n())
ggplot(data = d_per_class,
       aes(x = mean_pretest,
           y = mean_outcome,
           colour = condition,
           linetype = condition)) +
  geom_point() +
  xlab("mean pretest score per class") +
  ylab("mean posttest score per class") +
  geom_smooth(method = "lm", se = FALSE)

  1. Analyse the mean outcome using the mean covariate as a control variable:
m <- lm(mean_outcome ~ mean_pretest + condition, data = d_per_class)
summary(m)$coefficients
##                       Estimate Std. Error t value         Pr(>|t|)
## (Intercept)            0.36614   0.038200  9.5847 0.00000112769585
## mean_pretest           1.03734   0.054120 19.1674 0.00000000084294
## conditionintervention  0.36289   0.053974  6.7235 0.00003271538047
confint(m)
##                         2.5 %  97.5 %
## (Intercept)           0.28206 0.45022
## mean_pretest          0.91822 1.15646
## conditionintervention 0.24410 0.48169

The estimate and significance of the covariate is of no interest whatsoever; note again the decreased standard error for the intervention estimate.

In simulations, I have found this second way (averaging the covariate per cluster) to be slightly more powerful than the first way (residualising the outcome against the covariate) if the number of clusters isn’t very small (say, 4) and if the covariate is informative. For fewer clusters or less informative covariates, the first way may be ever so slightly more powerful. Both approaches have their advertised Type-I error, i.e., if the null hypothesis is true, they’ll only return \(p\)-values lower than 0.05 in 5% of cases.

Incidentally, method B may be useful if some standardised measure of the pupils’ pre-intervention performance is available but if teachers or administrators are unwilling to share the results: you could try asking for the average score per class instead!

5.3 Interim summary: Cluster-level analyses

  • Analysing cluster means is a valid and simple way to take clustering into account. It may be all you need.

  • Cluster- and individual-level covariates can be accommodated as well.

  • Based on simulations, I recommend that Method B be used if a fairly strong covariate is available (e.g., a pretest).3

In the two studies that need to be analysed, the cluster sizes will all be approximately the same. Consequently, the precision of each cluster mean will be about the same, too. If the cluster sizes differ markedly, a multilevel approach may be somewhat more powerful. We’ll turn to this approach next.

6 Multilevel analyses

Also called mixed-effects analyses or hierarchical regression. (The latter may, however, mean different things.)

Whether learning multilevel modelling is worth the effort if you just need it for analysing cluster-randomised experiments, depends on what you want to find out.

  • If you’re just interested in estimating the intervention effect, possibly adjusted for a covariate, cluster-level analyses are easier to run and report.

  • If you want to estimate how the intervention effect interacts with cluster-level properties (e.g., teacher experience, school type, etc.), again cluster-level analyses are easier to run and report.

  • If you want to estimate how the intervention effect interacts with individual-level properties (e.g., pupil SES or IQ), multilevel models may be more useful.

  • Interpreting the relationship between individual-level covariates with the outcome is more straightforward in multilevel models, but this is typically not of interest in a randomised experiment: The effect of the covariate is taken for granted and not investigated in its own right. Be clear about the questions you want to answer, and ideally choose an analysis that allows you to answer all of your questions in one model.

  • For more complex designs, multilevel models are definitely worth the time and effort. But don’t overcomplicate your analysis! See Murtaugh (2007).

Be forewarned about the following problems that you may encounter in multilevel models (but not in cluster-level analyses):

  • Warnings about convergence issues. These mean that the algorithm hasn’t found what it believes to be a good solution to the problem. Don’t ignore these; instead go fetch help. (If you really need a multilevel model!)

  • Warnings about singular fits. For most of the analyses demonstrated below, these mean that the between-cluster variance has been estimated to be 0. From what I understand, the inferential statistics aren’t invalidated if you encounter this warning. Alternatively, if you included random slopes in the model, it may mean that correlation between the by-cluster adjustments for the slope and for the intercept are perfectly (positively or negatively) correlated. This indicates that the model has been overparametrised (= is too complicated for the type and amount of data it’s fitted on); see Bates et al. (2015).

  • There are different ways for computing p-values in multilevel models. In simulations of cluster-randomised experiments, I have found that F- and t-tests using Satterthwaite’s method perform best; see also Luke (2017).

6.1 Without covariate adjustment

Reusing d_nocovar:

mer_nocovar <- lmer(outcome ~ condition + (1|class), data = d_nocovar)

The outcome ~ condition part should be familiar. The (1|class) specifies that a class effect should be modelled: The mean outcome may vary systematically between classes, even after taking into account the intervention effect (this is what clustering is). With (1|class), the model’s intercept is allowed to vary between classes. Each class’s modelled intercept is a compromise between the overall observed mean and the class’s observed mean: For small classes, the overall mean will weigh more towards the class’s modelled intercept; for large classes, the class’s observed mean will be weighted more. This is the essence of random effects (here: random intercepts):4 When estimating an effect specific to one class, the model borrows information from other classes under the assumption that while different classes (or clusters more generally) may be different in some respects, they will also be alike in some respects. If little information is available for a specific cluster (i.e., small cluster size), the borrowed information counts more towards the final estimate.

In contrast, when estimating fixed effects, no information is borrowed. The condition term is a fixed effect: When estimating the mean outcome in the control condition, the outcomes in the intervention condition aren’t considered.

Some terms you may encounter:

  • Complete pooling: Cluster-specific information is not taken into account at all; i.e., clustering is ignored.

  • No pooling: Only cluster-specific information is taken into account for the cluster-specific estimate.

  • Partial pooling: The cluster-specific estimate is a compromise between cluster-specific information and information borrowed from other clusters. The larger the cluster, the greater the weight of the cluster-specific information.

On to the output:

summary(mer_nocovar)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ condition + (1 | class)
##    Data: d_nocovar
## 
## REML criterion at convergence: 774.3
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.170 -0.665  0.088  0.608  2.767 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.252    0.502   
##  Residual             1.025    1.013   
## Number of obs: 262, groups:  class, 14
## 
## Fixed effects:
##                       Estimate Std. Error     df t value Pr(>|t|)
## (Intercept)              0.267      0.213 12.477    1.25   0.2333
## conditionintervention    0.917      0.298 11.995    3.07   0.0096
## 
## Correlation of Fixed Effects:
##             (Intr)
## cndtnntrvnt -0.715
  • The variances under Random effects are estimates for the between- and within-class variances. Here, the between-class variance is estimated to be 0.252 and the within-class variance is estimated to be 1.025. An (imprecise) estimate of the ICC is thus:
0.252 / (0.252 + 1.025)
## [1] 0.19734

This estimate is imprecise because it doesn’t take into account the considerable uncertainty in both the numerator and the denominator.

  • The standard deviations under Random effects are just the square roots of the variances. If you need to compute the ICC (and I don’t see why you would if you’re analysing an experiment), be sure to use the variances, not the standard deviations.

Turning to the fixed effects:

  • (Intercept): The estimated mean outcome and its standard error for the reference group. Here control is the reference group for condition, because it occurs first in the alphabet. This isn’t usually of interest.

  • conditionintervention: The difference in the estimated mean of the reference group (control) and the estimated mean of the other group (here: intervention). The estimate and its standard error are not identical to the one obtained in the cluster-level analysis.

Approximate confidence intervals can be obtained as follows:

confint(mer_nocovar, method = "Wald")
##                          2.5 %  97.5 %
## .sig01                      NA      NA
## .sigma                      NA      NA
## (Intercept)           -0.15082 0.68507
## conditionintervention  0.33249 1.50198

6.2 With covariate adjustment

Covariates can simply be added to the model as fixed effects. There is no need for prior aggregation. I will, however, centre the covariate at its sample mean. This makes the intercept more readily interpretable, but has otherwise little added value when analysing cluster-randomised experiments.

Reusing d_withcovar:

# Add centred pretest scores:
d_withcovar$c.pretest <- d_withcovar$pretest - mean(d_withcovar$pretest)

# Fit model
mer_withcovar <- lmer(outcome ~ c.pretest + condition + (1|class), data = d_withcovar)
## boundary (singular) fit: see ?isSingular
summary(mer_withcovar)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ c.pretest + condition + (1 | class)
##    Data: d_withcovar
## 
## REML criterion at convergence: 676.6
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.888 -0.604 -0.013  0.624  3.276 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.000    0.000   
##  Residual             0.549    0.741   
## Number of obs: 298, groups:  class, 14
## 
## Fixed effects:
##                       Estimate Std. Error       df t value
## (Intercept)             0.4947     0.0603 295.0000    8.20
## c.pretest               0.8871     0.0338 295.0000   26.25
## conditionintervention   0.3674     0.0859 295.0000    4.28
##                                   Pr(>|t|)
## (Intercept)             0.0000000000000074
## c.pretest             < 0.0000000000000002
## conditionintervention   0.0000256881581972
## 
## Correlation of Fixed Effects:
##             (Intr) c.prts
## c.pretest    0.025       
## cndtnntrvnt -0.703 -0.035
## convergence code: 0
## boundary (singular) fit: see ?isSingular

And done. Note that the intercept represents the mean outcome for participants in the control condition whose performance on the pretest equalled the sample mean (i.e., c.pretest = 0).

Furthermore, if you’d compute the ICC, you’d obtain

0.000 / (0.000 + 0.549)
## [1] 0

This isn’t an estimate of the ICC for the outcome; it’s an estimate of the ICC of what remains of the outcome after factoring out its relationship with the pretest scores. This estimate, too, is fairly imprecise. (If you stumble across the term ‘conditional ICC’, this is what is meant; the ICC from the model without the covariate is the unconditional ICC.)

Ah oh: The random effect specification for the model above is still just (1 | class). In principle, we could have allowed for the relationship between c.pretest and outcome to vary from class to class, using (1 + c.pretest | class). Perhaps it’s theoretically possible that this relationship varies by class (e.g., when participants with low pretest scores are given much more attention in some classes than in others), but I suspect that any such effect will be small enough to be negligible. What is more, Barr et al. (2013) present the ‘working assumption’ that not including random slopes for control variables (such as c.pretest) doesn’t affect the Type-I error rate for the intervention effect (p. 275). That is, as long as you don’t want to interpret the relationship between c.pretest and outcome, you don’t have to include a random slope for c.pretest. As far as I know, firm best practice guidelines on this topic are yet to be developed.

6.3 p-values in mixed-effects models

By default, lmer() doesn’t produce p-values. In a nutshell, the reason is that it’s not clear how the degrees of freedom for the significance tests should be computed. See https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html for more information. However, some methods have been developed to squeeze p-values out of lmer() fits:

  1. Treating the t-statistic in the model output as a z-statistic (“Wald test”).
  2. Applying a likelihood ratio test.

Both of these first methods are demonstrably anti-conservative, meaning that they produce significant p-values too often when the null hypothesis is in fact true. This is particularly the case for small samples. So we won’t go into them.

  1. Recalibrate the results of the likelihood ratio test (option 2) using a parametric bootstrap.

Option 3 works reasonably well, but it is too computationally intensive to try out in simulations. For more information, see Faraway (2006), Section 10.2. Option 3 is implemented in the PBmodcomp() function in the pbkrtest package.

  1. Estimate the test’s degrees of freedom using the Kenward-Roger or the Satterthwaite approximations. How these approximations work isn’t too important here; what’s important is that in both my own and Luke (2017)’s simulations, the Satterthwaite approximation yields p-values with the appropriate Type-I error rate. (Luke (2017) also found the same to be true for the Kenward-Roger approximation, but it was too computationally intensive to include it in my own simulations.) If you load the lmerTest package, Satterthwaite p-values are automatically included in the lmer() output:
library(lmerTest)
mer_withcovar <- lmer(outcome ~ c.pretest + condition + (1|class), data = d_withcovar)
## boundary (singular) fit: see ?isSingular
summary(mer_withcovar)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ c.pretest + condition + (1 | class)
##    Data: d_withcovar
## 
## REML criterion at convergence: 676.6
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.888 -0.604 -0.013  0.624  3.276 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.000    0.000   
##  Residual             0.549    0.741   
## Number of obs: 298, groups:  class, 14
## 
## Fixed effects:
##                       Estimate Std. Error       df t value
## (Intercept)             0.4947     0.0603 295.0000    8.20
## c.pretest               0.8871     0.0338 295.0000   26.25
## conditionintervention   0.3674     0.0859 295.0000    4.28
##                                   Pr(>|t|)
## (Intercept)             0.0000000000000074
## c.pretest             < 0.0000000000000002
## conditionintervention   0.0000256881581972
## 
## Correlation of Fixed Effects:
##             (Intr) c.prts
## c.pretest    0.025       
## cndtnntrvnt -0.703 -0.035
## convergence code: 0
## boundary (singular) fit: see ?isSingular

Also see Luke (2017) and https://featuredcontent.psychonomic.org/putting-ps-into-lmer-mixed-model-regression-and-statistical-significance/.

Incidentally, if you don’t have a covariate, the \(p\)-values obtained in a multilevel model will be identical to the ones obtained in cluster-mean analysis, provided all cluster sizes are the same:

d_nocovar_samesize <- clustered_data(n_per_class = rep(16, 7))
summary(lmer(outcome ~ condition + (1|class), data = d_nocovar_samesize))
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ condition + (1 | class)
##    Data: d_nocovar_samesize
## 
## REML criterion at convergence: 322.8
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -2.131 -0.689 -0.111  0.566  2.468 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.161    0.401   
##  Residual             0.965    0.983   
## Number of obs: 112, groups:  class, 7
## 
## Fixed effects:
##                       Estimate Std. Error      df t value Pr(>|t|)
## (Intercept)            -0.0685     0.2715  5.0000   -0.25     0.81
## conditionintervention   0.2233     0.3591  5.0000    0.62     0.56
## 
## Correlation of Fixed Effects:
##             (Intr)
## cndtnntrvnt -0.756
d_nocovar_samesize_sum <- d_nocovar_samesize %>% 
  group_by(class, condition) %>% 
  summarise(mean_outcome = mean(outcome))
summary(lm(mean_outcome ~ condition, data = d_nocovar_samesize_sum))
## 
## Call:
## lm(formula = mean_outcome ~ condition, data = d_nocovar_samesize_sum)
## 
## Residuals:
##      1      2      3      4      5      6      7 
##  0.433  0.119 -0.552  0.460 -0.424 -0.339  0.303 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)            -0.0685     0.2715   -0.25     0.81
## conditionintervention   0.2233     0.3591    0.62     0.56
## 
## Residual standard error: 0.47 on 5 degrees of freedom
## Multiple R-squared:  0.0718, Adjusted R-squared:  -0.114 
## F-statistic: 0.386 on 1 and 5 DF,  p-value: 0.561
# Exact same result

6.4 Shrinkage

Shrinkage isn’t too relevant when analysing cluster-randomised experiments. If you’re interested in the phenomenon nonetheless, check out https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixed-models/.5

7 Recommendations

7.1 Taking into account covariates in cluster-randomised experiments

Based on findings by Wright (2015), the following seem to be reasonable recommendations for planning and analysing cluster randomised experiments in which covariate information is available.

  1. Choose the covariates a priori. Do not include a covariate in the analysis just because its distribution happens to be imbalanced between conditions.6

  2. The most useful covariate is bound to be a measure of the participants’ pre-experiment performance.

  3. If a covariate affected the assignment of clusters to conditions, include it in the analysis. For instance, if you purposefully restricted the randomisation in such a way that there was an equal number of advanced classes and an equal number of beginner classes in each condition (= ‘blocking’ or ‘stratification’), include a variable ‘advanced vs. beginner’ in the analysis – even if you’re not actually interested in this variable.

Add to this the findings from my own simulations:

  1. For typical class-based cluster-randomised experiments in which a covariate is available, compute the mean outcome and the mean covariate value per cluster. Then fit the mean outcomes in a linear model with the condition and the mean covariate per cluster as predictors. This approach only uses introductory-level statistics that most readers may be assumed to familiar with. (They may not be familiar with the combination of first averaging the data and then fitting the averages in a linear model/ANCOVA. But each step individually is easy enough.)

8 Interactions with intervention effects

I haven’t written a function for simulating datasets with both clusters and interactions yet, so I’ll just add binary cluster-level and individual-level variables to a dataset with clusters to demonstrate the possible approaches.

I have not run any simulations about which approach works best.

# Generate a new dataset
d <- clustered_data(
  n_per_class = sample(15:25, size = 14, replace = TRUE),
  ICC = 0.15,
  reliability_post = 1,
  reliability_pre = 0.68^2,
  effect = 0.35
)

# Generate cluster-level covariate
classes <- data.frame(
  class = factor(1:14),
  teacher_sex = rep(c(rep("man", 3), rep("woman", 4)), 2)
)

# Add to dataset
d <- d %>% 
  left_join(classes, by = "class")

# Add individual-level covariate
d$pupil_sex <- sample(c("girl", "boy"), size = nrow(d), replace = TRUE)

8.1 Interaction with cluster-level properties

Cluster-level properties are variables whose values are the same within each cluster but may vary between clusters. Examples in class-based research would be the sex of the teacher and the proportion of girls in the class.

It’s easy to come up with a host of cluster- as well as individual-level variables that might conceivably affect the size of the intervention effect. But be forewarned that you typically have considerably less statistical power to detect interactions than main effects. Moreover, don’t let the possibilities of interactions detract you from gauging the overall intervention effect, if the latter is your main objective. Don’t test for interactions just because you can.

d_per_class <- d %>% 
  group_by(class, teacher_sex, condition) %>% 
  summarise(
    mean_outcome = mean(outcome),
    mean_pretest = mean(pretest)
  )
class teacher_sex condition mean_outcome mean_pretest
1 man control 1.09092 0.66642
2 man control 0.95545 0.36301
3 man control -0.15447 -0.59381
4 woman control 0.28428 0.35445
5 woman control 0.18650 -0.83637
6 woman control 0.31647 -0.40417
7 woman control 0.12425 -0.22681
8 man intervention 0.89412 -0.18559
9 man intervention -0.05238 -0.48754
10 man intervention 0.94568 0.51240
11 woman intervention 0.91863 0.17880
12 woman intervention 1.20587 0.30287
13 woman intervention 0.78100 0.01320
14 woman intervention 0.55629 -0.24605
ggplot(data = d_per_class,
       aes(x = condition,
           y = mean_outcome)) +
  geom_boxplot(outlier.shape = NA) +
  geom_point(
    shape = 1,
    position = position_jitter(width = 0.2)
  ) +
  facet_wrap(~ teacher_sex)

Without taking the covariate into account:

m1 <- lm(mean_outcome ~ condition*teacher_sex, data = d_per_class)
summary(m1)
## 
## Call:
## lm(formula = mean_outcome ~ condition * teacher_sex, data = d_per_class)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7851 -0.0988  0.0548  0.3182  0.4603 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                              0.6306     0.2456    2.57
## conditionintervention                   -0.0348     0.3474   -0.10
## teacher_sexwoman                        -0.4028     0.3250   -1.24
## conditionintervention:teacher_sexwoman   0.6724     0.4596    1.46
##                                        Pr(>|t|)
## (Intercept)                               0.028
## conditionintervention                     0.922
## teacher_sexwoman                          0.243
## conditionintervention:teacher_sexwoman    0.174
## 
## Residual standard error: 0.425 on 10 degrees of freedom
## Multiple R-squared:  0.314,  Adjusted R-squared:  0.109 
## F-statistic: 1.53 on 3 and 10 DF,  p-value: 0.267

The interaction term isn’t significant here, but the main effects are difficult to interpret because R by default uses treatment coding. Use sum-coding instead, i.e., set the binary factors to -0.5 and 0.5:

d_per_class <- d_per_class %>% 
  mutate(
    n.teacher_sex = ifelse(teacher_sex == "man", -0.5, 0.5),
    n.condition = ifelse(condition == "control", -0.5, 0.5)
  )

Refit:

m1 <- lm(mean_outcome ~ n.condition*n.teacher_sex, data = d_per_class)
summary(m1)
## 
## Call:
## lm(formula = mean_outcome ~ n.condition * n.teacher_sex, data = d_per_class)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7851 -0.0988  0.0548  0.3182  0.4603 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                 0.5799     0.1149    5.05   0.0005
## n.condition                 0.3014     0.2298    1.31   0.2190
## n.teacher_sex              -0.0666     0.2298   -0.29   0.7780
## n.condition:n.teacher_sex   0.6724     0.4596    1.46   0.1741
## 
## Residual standard error: 0.425 on 10 degrees of freedom
## Multiple R-squared:  0.314,  Adjusted R-squared:  0.109 
## F-statistic: 1.53 on 3 and 10 DF,  p-value: 0.267

The n.condition effect is now the estimated intervention effect averaged over both male and female teachers. This isn’t the same as the estimated intervention effect in an analysis that doesn’t include teacher sex:

m1_no_interaction <- lm(mean_outcome ~ n.condition, data = d_per_class)
summary(m1_no_interaction)
## 
## Call:
## lm(formula = mean_outcome ~ n.condition, data = d_per_class)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8023 -0.2089 -0.0265  0.1890  0.6904 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.575      0.115    5.01   0.0003
## n.condition    0.349      0.230    1.52   0.1539
## 
## Residual standard error: 0.429 on 12 degrees of freedom
## Multiple R-squared:  0.162,  Adjusted R-squared:  0.092 
## F-statistic: 2.32 on 1 and 12 DF,  p-value: 0.154

Of course, if we have a pretest or a decent covariate are our disposal, we’d be daft not to use it, hence:

m2 <- lm(mean_outcome ~ mean_pretest + n.condition*n.teacher_sex, data = d_per_class)
summary(m2)
## 
## Call:
## lm(formula = mean_outcome ~ mean_pretest + n.condition * n.teacher_sex, 
##     data = d_per_class)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.375 -0.129 -0.031  0.176  0.388 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                 0.6011     0.0780    7.71  0.00003
## mean_pretest                0.6813     0.1903    3.58   0.0059
## n.condition                 0.2531     0.1562    1.62   0.1395
## n.teacher_sex               0.0382     0.1583    0.24   0.8145
## n.condition:n.teacher_sex   0.3050     0.3276    0.93   0.3762
## 
## Residual standard error: 0.288 on 9 degrees of freedom
## Multiple R-squared:  0.717,  Adjusted R-squared:  0.591 
## F-statistic:  5.7 on 4 and 9 DF,  p-value: 0.0144

The same can be accomplished in a multilevel model on the individual-level data. Again recode the binary predictors first:

d <- d %>% 
  mutate(
    n.teacher_sex = ifelse(teacher_sex == "man", -0.5, 0.5),
    n.condition = ifelse(condition == "control", -0.5, 0.5)
  )

The random effect specification is still just (1 | class) and not (1 + n.teacher_sex | class): Perhaps the effect of n.teacher_sex is different from class to class, but n.teacher_sex doesn’t vary within classes, there’s no data to estimate this varying effect.

# without covariate
m3 <- lmer(outcome ~ n.condition*n.teacher_sex + (1|class), data = d)
summary(m3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ n.condition * n.teacher_sex + (1 | class)
##    Data: d
## 
## REML criterion at convergence: 748.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0897 -0.6721 -0.0489  0.6670  2.8064 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.135    0.368   
##  Residual             0.907    0.953   
## Number of obs: 266, groups:  class, 14
## 
## Fixed effects:
##                           Estimate Std. Error      df t value Pr(>|t|)
## (Intercept)                 0.5779     0.1157 10.1090    4.99  0.00052
## n.condition                 0.3034     0.2314 10.1090    1.31  0.21883
## n.teacher_sex              -0.0632     0.2314 10.1090   -0.27  0.79018
## n.condition:n.teacher_sex   0.6607     0.4628 10.1090    1.43  0.18358
## 
## Correlation of Fixed Effects:
##             (Intr) n.cndt n.tch_
## n.condition  0.005              
## n.teachr_sx -0.136 -0.013       
## n.cndtn:n._ -0.013 -0.136  0.005
# with covariate
m4 <- lmer(outcome ~ pretest + n.condition*n.teacher_sex + (1|class), data = d)
summary(m4)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ pretest + n.condition * n.teacher_sex + (1 | class)
##    Data: d
## 
## REML criterion at convergence: 632.8
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.210 -0.669  0.041  0.648  2.531 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.062    0.249   
##  Residual             0.581    0.762   
## Number of obs: 266, groups:  class, 14
## 
## Fixed effects:
##                            Estimate Std. Error        df t value
## (Intercept)                 0.59024    0.08227  10.06468    7.17
## pretest                     0.41769    0.03382 260.31789   12.35
## n.condition                 0.27891    0.16454  10.06349    1.70
## n.teacher_sex              -0.00704    0.16459  10.07129   -0.04
## n.condition:n.teacher_sex   0.44217    0.32951  10.11122    1.34
##                                       Pr(>|t|)
## (Intercept)                           0.000029
## pretest                   < 0.0000000000000002
## n.condition                               0.12
## n.teacher_sex                             0.97
## n.condition:n.teacher_sex                 0.21
## 
## Correlation of Fixed Effects:
##             (Intr) pretst n.cndt n.tch_
## pretest      0.013                     
## n.condition  0.006 -0.012              
## n.teachr_sx -0.134  0.027 -0.017       
## n.cndtn:n._ -0.017 -0.053 -0.134  0.005

8.2 Interaction with individual-level properties

Multilevel models seem to be most appropriate here. Again recode the binary predictor:

d <- d %>% 
  mutate(
    n.pupil_sex = ifelse(pupil_sex == "boy", -0.5, 0.5)
  )

And fit the models:

# without covariate
m5 <- lmer(outcome ~ n.condition*n.pupil_sex + (1|class), data = d)
summary(m5)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ n.condition * n.pupil_sex + (1 | class)
##    Data: d
## 
## REML criterion at convergence: 752.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1102 -0.6839 -0.0289  0.6526  2.8175 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.135    0.367   
##  Residual             0.914    0.956   
## Number of obs: 266, groups:  class, 14
## 
## Fixed effects:
##                         Estimate Std. Error       df t value Pr(>|t|)
## (Intercept)               0.5713     0.1148  12.0804    4.98  0.00031
## n.condition               0.3562     0.2296  12.0804    1.55  0.14650
## n.pupil_sex               0.0378     0.1206 256.2519    0.31  0.75434
## n.condition:n.pupil_sex  -0.1065     0.2411 256.2519   -0.44  0.65917
## 
## Correlation of Fixed Effects:
##             (Intr) n.cndt n.ppl_
## n.condition -0.003              
## n.pupil_sex -0.051  0.048       
## n.cndtn:n._  0.048 -0.051  0.016

The influence of individual-level predictors may in principle vary from cluster to cluster, so you could add them as random slopes, too. In fact, based on Barr et al. (2013; also see Jaeger et al. 2011; Schielzeth & Forstmeier 2009), this is probably to be recommended.

m5b <- lmer(outcome ~ n.condition*n.pupil_sex + (1+n.pupil_sex|class), data = d)
## boundary (singular) fit: see ?isSingular
summary(m5b)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ n.condition * n.pupil_sex + (1 + n.pupil_sex | class)
##    Data: d
## 
## REML criterion at convergence: 752.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1031 -0.6640 -0.0275  0.6475  2.7891 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  class    (Intercept) 0.1395   0.3735        
##           n.pupil_sex 0.0055   0.0741   -1.00
##  Residual             0.9124   0.9552        
## Number of obs: 266, groups:  class, 14
## 
## Fixed effects:
##                         Estimate Std. Error       df t value Pr(>|t|)
## (Intercept)               0.5740     0.1163  11.9052    4.94  0.00035
## n.condition               0.3538     0.2325  11.9052    1.52  0.15422
## n.pupil_sex               0.0352     0.1221 112.5388    0.29  0.77381
## n.condition:n.pupil_sex  -0.1086     0.2442 112.5388   -0.44  0.65736
## 
## Correlation of Fixed Effects:
##             (Intr) n.cndt n.ppl_
## n.condition -0.003              
## n.pupil_sex -0.191  0.045       
## n.cndtn:n._  0.045 -0.191  0.016
## convergence code: 0
## boundary (singular) fit: see ?isSingular
# with covariate
m6 <- lmer(outcome ~ pretest + n.condition*n.pupil_sex + (1|class), data = d)
summary(m6)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ pretest + n.condition * n.pupil_sex + (1 | class)
##    Data: d
## 
## REML criterion at convergence: 635.2
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.275 -0.674  0.025  0.684  2.619 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  class    (Intercept) 0.0581   0.241   
##  Residual             0.5825   0.763   
## Number of obs: 266, groups:  class, 14
## 
## Fixed effects:
##                         Estimate Std. Error       df t value
## (Intercept)               0.5840     0.0801  12.1015    7.29
## pretest                   0.4226     0.0338 260.1357   12.49
## n.condition               0.3233     0.1602  12.1023    2.02
## n.pupil_sex               0.1079     0.0962 256.9706    1.12
## n.condition:n.pupil_sex  -0.1022     0.1920 256.6576   -0.53
##                                     Pr(>|t|)
## (Intercept)                        0.0000091
## pretest                 < 0.0000000000000002
## n.condition                            0.066
## n.pupil_sex                            0.263
## n.condition:n.pupil_sex                0.595
## 
## Correlation of Fixed Effects:
##             (Intr) pretst n.cndt n.ppl_
## pretest      0.013                     
## n.condition -0.005 -0.016              
## n.pupil_sex -0.057  0.055  0.054       
## n.cndtn:n._  0.055  0.003 -0.058  0.013
m6b <- lmer(outcome ~ pretest + n.condition*n.pupil_sex + (1+n.pupil_sex|class), data = d)
summary(m6b)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## outcome ~ pretest + n.condition * n.pupil_sex + (1 + n.pupil_sex |  
##     class)
##    Data: d
## 
## REML criterion at convergence: 631.5
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.222 -0.696  0.060  0.688  2.465 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  class    (Intercept) 0.0708   0.266         
##           n.pupil_sex 0.0519   0.228    -1.00
##  Residual             0.5687   0.754         
## Number of obs: 266, groups:  class, 14
## 
## Fixed effects:
##                         Estimate Std. Error       df t value
## (Intercept)               0.5912     0.0853  11.9967    6.93
## pretest                   0.4245     0.0330 247.8815   12.87
## n.condition               0.3207     0.1707  11.9976    1.88
## n.pupil_sex               0.0946     0.1130  19.2696    0.84
## n.condition:n.pupil_sex  -0.1011     0.2258  19.1872   -0.45
##                                     Pr(>|t|)
## (Intercept)                         0.000016
## pretest                 < 0.0000000000000002
## n.condition                            0.085
## n.pupil_sex                            0.413
## n.condition:n.pupil_sex                0.659
## 
## Correlation of Fixed Effects:
##             (Intr) pretst n.cndt n.ppl_
## pretest      0.008                     
## n.condition -0.002 -0.013              
## n.pupil_sex -0.503  0.043  0.037       
## n.cndtn:n._  0.038  0.004 -0.504  0.011

For more on sum- (or simple) coding, see https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/#SIMPLE.

This type of analysis may also be useful if you want to take into account the possibility that the effectiveness of the intervention depends on the participants’ pre-intervention skills. Since now we are interested in interpreting the effect of c.pretest (esp. in its interaction with n.condition), it’s advisable to include a by-cluster random slope for it in the model:

# Centre the pretest score at the sample mean
d$c.pretest <- d$pretest - mean(d$pretest)

# Fit c.pretest*n.condition interaction
m7 <- lmer(outcome ~ c.pretest*n.condition + (1+c.pretest|class), data = d)
## boundary (singular) fit: see ?isSingular
summary(m7)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: outcome ~ c.pretest * n.condition + (1 + c.pretest | class)
##    Data: d
## 
## REML criterion at convergence: 635.5
## 
## Scaled residuals: 
##    Min     1Q Median     3Q    Max 
## -3.252 -0.677  0.068  0.664  2.546 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  class    (Intercept) 0.061080 0.2471        
##           c.pretest   0.000686 0.0262   -1.00
##  Residual             0.581244 0.7624        
## Number of obs: 266, groups:  class, 14
## 
## Fixed effects:
##                       Estimate Std. Error      df t value
## (Intercept)             0.5834     0.0811 11.9682    7.19
## c.pretest               0.4200     0.0346 81.2462   12.13
## n.condition             0.3071     0.1622 11.9682    1.89
## c.pretest:n.condition   0.0220     0.0692 81.2462    0.32
##                                   Pr(>|t|)
## (Intercept)                       0.000011
## c.pretest             < 0.0000000000000002
## n.condition                          0.083
## c.pretest:n.condition                0.752
## 
## Correlation of Fixed Effects:
##             (Intr) c.prts n.cndt
## c.pretest   -0.169              
## n.condition  0.003 -0.017       
## c.prtst:n.c -0.017  0.072 -0.169
## convergence code: 0
## boundary (singular) fit: see ?isSingular

The effect for n.condition capture the intervention effect for participants with an average pretest score (provided the pretest was centred!); the interaction effect captures how this effect varies with the pretest score.

# Compute effect of n.condition for different c.pretest values.
# See https://janhove.github.io/reporting/2017/04/23/visualising-models-1
#     https://janhove.github.io/reporting/2017/05/12/visualising-models-2
predictor_grid <- expand.grid(
  c.pretest = seq(min(d$c.pretest), max(d$c.pretest), length.out = 50),
  n.condition = unique(d$n.condition)
)
predictor_grid$condition <- ifelse(predictor_grid$n.condition == -0.5, 
                                   "control", "intervention")
predictor_grid$pretest <- predictor_grid$c.pretest + mean(d$pretest)

predictor_grid$predicted_outcome <- predict(m7, newdata = predictor_grid,
                                            re.form = NA)

predictor_grid <- predictor_grid %>% 
  select(condition, pretest, predicted_outcome) %>% 
  pivot_wider(
    id_cols = pretest,
    names_from = condition,
    values_from = predicted_outcome) %>% 
  mutate(intervention_effect = intervention - control)

predictor_grid %>% 
  sample_n(5) %>% 
  arrange(pretest)
pretest control intervention intervention_effect
-3.8722 -1.14147 -0.91882 0.22265
-2.9323 -0.75707 -0.51378 0.24330
-2.5563 -0.60332 -0.35176 0.25156
-2.3683 -0.52644 -0.27075 0.25569
1.3913 1.01115 1.34944 0.33828
ggplot(data = predictor_grid,
       aes(x = pretest,
           y = intervention_effect)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = 2) +
  xlab("pretest score") + 
  ylab("intervention effect")

(You could draw some kind of uncertainty interval around this line, I think, but it’s easier to do if you fit the data in a Bayesian multilevel model. Which we’re not going to do here.)

9 Analysing accuracy data at the item level

If you haven’t worked with logistic regression yet, see https://janhove.github.io/statintro.html, Chapter 17.

I haven’t found a dataset to illustrate possible analyses with, which makes this discussion difficult.

Assumptions:

  1. Cluster-randomised experiment with a fair number of clusters.
  2. Participants respond to a number of items at pretest and at posttest, and their accuracy on these items is recorded.
  3. The same items occur at pretest and at posttest, but it’s not necessarily the case that they’re seen by the same participants. E.g., item 1 may be seen by participant A at pretest and by participant B at posttest. (Cho & Goodwin (2017), see below, assume that the same items are seen by all participants at pre and at post, but by my reckoning, their model would stay the same regardless. It’ll probably be even more difficult to estimate it, though.)
  4. The items may belong to different categories, with a research question being whether the intervention is more effective for some categories than for others. For ease of exposition, only two categories are included here. Use simple-coding if you have more than two categories.

There is hardly any guidance on fitting this kind of data:

  • Twisk & Proper (2004) discuss pretest/posttest designs with binary outcomes but with just one measurement per time per participant (i.e., one item).
  • Localio, Berlin & Ten Have (2005) and Borhan (2012) discuss cluster-randomised designs with binary outcomes, but again with just one measurement per time per participant.
  • Cho & Goodwin (2017) seems promising, but I can’t claim to have understood it all. Based on what I’ve understood, though, they seem to be recommending these model specifications:
mod1 <- glmer(accuracy ~ n.time * n.condition +
                (1 + n.time | participant) +
                (1 + n.time | class) +
                (1 + n.time*n.condition | item),
              data = d,
              family = "binomial")

The predictors are treatment-coded, not sum-coded:

  • n.time is 0 at pretest and 1 at posttest.
  • n.condition is 0 for control and 1 for intervention.

In mod1, you’ll obtain the following fixed-effect estimates:

  • (Intercept): Probability (in log-odds!) of accurate response at pretest by control group for average item and average control participant. (= “initial status” for the control group)
  • n.time: Change in probability (in log-odds) of accurate response from pretest to posttest by control group for average item and average control participant. (= “learning” for the control group)
  • n.condition: Difference in probability (in log-odds) of accurate response between intervention group and control group at pretest, again for average items/participants. (= difference in initial status)
  • n.time:n.condition: Difference in change in probability (in log-odds) from pretest to posttest between intervention group and control group, again for average items/participants. (= difference in learning)

The latter estimate will be of interest as it expresses how much more the learning the intervention induced.

You’ll also obtain the following random effect estimates (plot them using lattice::dotplot(ranef(mod1)))).

  • By-participant intercepts and n.time slopes. The intercepts express differences between participants (conditional on the other variables) in their “initial status” (i.e., skill at pretest). The slopes express differences between participants (conditional on the other variables) in the extent of their learning. The correlation between these expresses to what extent between-participant differences in pretest skill correlate with between-participant difference in learning.
  • By-class intercepts and n.time slopes. Same, but now per class.
  • By-item intercepts, n.time and n.condition slopes as well as interactions between n.time and n.condition. The intercepts express differences between items (conditional on the other variables) in their “initial status” (i.e., ease at pretest). The n.time slopes express differences between items in the extent to which they become easier from pretest to posttest for participants in the control condition. The n.condition slopes express differences between items in the extent to which they are easier for participants in the intervention condition than for those in the control condition at pretest. The interaction slopes express how the difference in learning between the conditions varies per item. In other words, it expresses how the items vary in their susceptibility to the intervention.

mod2 adds information about item categories. (Cho & Goodwin (2017) don’t discuss this in so many words, though they do model item covariates.)

mod2 <- glmer(accuracy ~ n.time * n.condition * n.category +
                (1 + n.time*n.category | participant) +
                (1 + n.time*n.category | class) +
                (1 + n.time*n.condition | item),
              data = d,
              family = "binomial")

n.category is sum-coded, not treatment-coded:

  • n.category is -0.5 for one item category and 0.5 for the other.

In mod2, you’ll obtain the following fixed-effect estimates:

  • (Intercept): Probability (in log-odds!) of accurate response at pretest by control group for average item and average control participant, aggregated over item categories. (= “initial status” for the control group)
  • n.time: Change in probability (in log-odds) of accurate response from pretest to posttest by control group for average item and average control participant, aggregated over item categories. (= “learning” for the control group)
  • n.condition: Difference in probability (in log-odds) of accurate response between intervention group and control group at pretest, again for average items/participants, aggregated over item categories. (= difference in initial status according to intervention)
  • n.time:n.condition: Difference in change in probability (in log-odds) from pretest to posttest between intervention group and control group, again for average items/participants, aggregated over item categories. (= difference in learning according to intervention)
  • n.category: Difference in probability of accurate response at pretest by control group between the two item categories (for average item/participant). (= difference in initial status for control group according to item category)
  • n.time:n.category. (= difference in learning for control group according to item category)
  • n.time:n.condition:n.category. (= difference in the difference in learning according to intervention by item category, i.e., how much more susceptible is the item category coded as 0.5 to the intervention in terms of learning than the item category coded as -0.5)

The random effect estimates will now also contain n.category slopes by participant and by class (How much do the item categories vary in ease at pretest?) and n.time:n.category slopes by participant and class (How much do the item categories vary in learnability?), and the correlations between these. n.category doesn’t vary within items, only between items, so no additional random slopes are needed there.

It’s very probable that these models won’t converge, incidentally. Some iterative simplification of the random effect structure will probably be b in order.

For visualisation possibilities, see https://janhove.github.io/visualise_uncertainty/#47_logistic_mixed_effects_models.

10 Session info

devtools::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.1 (2019-07-05)
##  os       Ubuntu 18.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Brussels             
##  date     2019-11-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package      * version    date       lib
##  assertthat     0.2.1      2019-03-21 [1]
##  backports      1.1.5      2019-10-02 [1]
##  boot           1.3-23     2019-07-05 [4]
##  broom          0.5.2      2019-04-07 [1]
##  callr          3.3.2      2019-09-22 [1]
##  cannonball   * 0.1.0      2019-10-23 [1]
##  cellranger     1.1.0      2016-07-27 [1]
##  cli            1.1.0      2019-03-19 [1]
##  codetools      0.2-16     2018-12-24 [4]
##  coin           1.3-1      2019-08-28 [1]
##  colorspace     1.4-1      2019-03-18 [1]
##  crayon         1.3.4      2017-09-16 [1]
##  desc           1.2.0      2018-05-01 [1]
##  devtools       2.2.1      2019-09-24 [1]
##  digest         0.6.21     2019-09-20 [1]
##  dplyr        * 0.8.3      2019-07-04 [1]
##  ellipsis       0.3.0      2019-09-20 [1]
##  evaluate       0.14       2019-05-28 [1]
##  forcats      * 0.4.0      2019-02-17 [1]
##  fs             1.3.1      2019-05-06 [1]
##  generics       0.0.2      2018-11-29 [1]
##  ggplot2      * 3.2.1      2019-08-10 [1]
##  glue           1.3.1      2019-03-12 [1]
##  gtable         0.3.0      2019-03-25 [1]
##  haven          2.1.1      2019-07-04 [1]
##  highr          0.8        2019-03-20 [1]
##  hms            0.5.1      2019-08-23 [1]
##  htmltools      0.4.0      2019-10-04 [1]
##  httr           1.4.1      2019-08-05 [1]
##  jsonlite       1.6        2018-12-07 [1]
##  knitr          1.25       2019-09-18 [1]
##  labeling       0.3        2014-08-23 [1]
##  lattice        0.20-38    2018-11-04 [4]
##  lazyeval       0.2.2      2019-03-15 [1]
##  libcoin        1.0-5      2019-08-27 [1]
##  lifecycle      0.1.0      2019-08-01 [1]
##  lme4         * 1.1-21     2019-03-05 [1]
##  lmerTest     * 3.1-0      2019-02-11 [1]
##  lubridate      1.7.4      2018-04-11 [1]
##  magrittr       1.5        2014-11-22 [1]
##  MASS           7.3-51.4   2019-04-26 [1]
##  Matrix       * 1.2-17     2019-03-22 [4]
##  matrixStats    0.55.0     2019-09-07 [1]
##  memoise        1.1.0      2017-04-21 [1]
##  minqa          1.2.4      2014-10-09 [1]
##  modelr         0.1.5      2019-08-08 [1]
##  modeltools     0.2-22     2018-07-16 [1]
##  multcomp       1.4-10     2019-03-05 [1]
##  munsell        0.5.0      2018-06-12 [1]
##  mvtnorm        1.0-11     2019-06-19 [1]
##  nlme           3.1-141    2019-08-01 [4]
##  nloptr         1.2.1      2018-10-03 [1]
##  numDeriv       2016.8-1.1 2019-06-06 [1]
##  pillar         1.4.2      2019-06-29 [1]
##  pkgbuild       1.0.6      2019-10-09 [1]
##  pkgconfig      2.0.3      2019-09-22 [1]
##  pkgload        1.0.2      2018-10-29 [1]
##  prettyunits    1.0.2      2015-07-13 [1]
##  processx       3.4.1      2019-07-18 [1]
##  ps             1.3.0      2018-12-21 [1]
##  purrr        * 0.3.2      2019-03-15 [1]
##  R6             2.4.0      2019-02-14 [1]
##  RColorBrewer * 1.1-2      2014-12-07 [1]
##  Rcpp           1.0.2      2019-07-25 [1]
##  readr        * 1.3.1      2018-12-21 [1]
##  readxl         1.3.1      2019-03-13 [1]
##  remotes        2.1.0      2019-06-24 [1]
##  rlang          0.4.0.9005 2019-10-18 [1]
##  rmarkdown      1.16       2019-10-01 [1]
##  rprojroot      1.3-2      2018-01-03 [1]
##  rstudioapi     0.10       2019-03-19 [1]
##  rvest          0.3.4      2019-05-15 [1]
##  sandwich       2.5-1      2019-04-06 [1]
##  scales         1.0.0      2018-08-09 [1]
##  sessioninfo    1.1.1      2018-11-05 [1]
##  stringi        1.4.3      2019-03-12 [1]
##  stringr      * 1.4.0      2019-02-10 [1]
##  survival       2.44-1.1   2019-04-01 [1]
##  testthat       2.2.1.9002 2019-10-18 [1]
##  TH.data        1.0-10     2019-01-21 [1]
##  tibble       * 2.1.3      2019-06-06 [1]
##  tidyr        * 1.0.0      2019-09-11 [1]
##  tidyselect     0.2.5      2018-10-11 [1]
##  tidyverse    * 1.2.1      2017-11-14 [1]
##  usethis        1.5.1      2019-07-04 [1]
##  vctrs          0.2.0      2019-07-05 [1]
##  withr          2.1.2      2018-03-15 [1]
##  xfun           0.10       2019-10-01 [1]
##  xml2           1.2.2      2019-08-09 [1]
##  yaml           2.2.0      2018-07-25 [1]
##  zeallot        0.1.0      2018-01-28 [1]
##  zoo            1.8-6      2019-05-28 [1]
##  source                             
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  Github (janhove/cannonball@b7fb1bf)
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.5.2)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.5.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  Github (r-lib/rlang@4593159)       
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  Github (r-lib/testthat@0f14a42)    
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
##  CRAN (R 3.6.1)                     
## 
## [1] /home/jan/R/x86_64-pc-linux-gnu-library/3.6
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

References

Barr, Dale J., Roger Levy, Christoph Scheepers & Harry J. Tily. 2013. Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language 68. 255–278.

Bates, Douglas, Reinhold Kliegl, Shravan Vasishth & R. Harald Baayen. 2015. Parsimonious mixed models. http://arxiv.org/abs/1506.04967.

Bloom, Howard S., Lashawn Richburg-Hayes & Alison Rebeck Black. 2007. Using covariates to improve precision for studies that randomize schools to evaluate educational interventions. Educational Evaluation and Policy Analysis 29(1). 30–59. doi:10.3102/0162373707299550.

Borhan, A. S. M. 2012. Methods for the analysis of pretest-posttest binary outcomes from cluster randomization trials. University of Western Ontario Master’s thesis. https://ir.lib.uwo.ca/etd/825.

Cho, Sun-Joo & Amanda P. Goodwin. 2017. Modeling learning in doubly multilevel binary longitudinal data using generalized linear mixed models: An application to measuring and explaining word learning. Psychometrika 82(3). 846–870. doi:10.1007/s11336-016-9496-y.

Faraway, Julian J. 2006. Extending the linear model with r: Generalized linear, mixed effects and nonparametric regression models. Boca Raton, FL: Chapman & Hall/CRC.

Hayes, Richard J. & Lawrence H. Moulton. 2009. Cluster randomised trials. Boca Raton, FL: Chapman & Hall/CRC.

Hedges, Larry V. 2007. Correcting a significance test for clustering. Journal of Educational and Behavioral Statistics 32(2). 151–179. doi:10.3102/1076998606298040.

Hedges, Larry V. & E. C. Hedberg. 2007. Intraclass correlation values for planning group-randomized trials in education. Educational Evaluation and Policy Analysis 29(1). 60–87. doi:10.3102/0162373707299706.

Jaeger, T. Florian, Peter Graff, William Croft & Daniel Pontillo. 2011. Mixed effect models for genetic and areal dependencies in linguistic typology. Linguistic Typology 15. 281–320.

Localio, A. Russell, Jesse A. Berlin & Thomas R. Ten Have. 2005. Longitudinal and repeated cross‐sectional cluster‐randomization designs using mixed effects regression for binary outcomes: Bias and coverage of frequentist and Bayesian methods. Statistics in Medicine 25(16). 2720–2736. doi:10.1002/sim.2428.

Luke, Steven G. 2017. Evaluating significance in linear mixed-effects models in R. Behavioral Research Methods 49. 1494–1502. doi:10.3758/s13428-016-0809-y.

Murtaugh, Paul A. 2007. Simplicity and complexity in ecological data analysis. Ecology 88(1). 56–62. doi:10.1890/0012-9658(2007)88[56:SACIED]2.0.CO;2.

Schielzeth, Holger & Wolfgang Forstmeier. 2009. Conclusions beyond support: Overconfident estimates in mixed models. Behavioral Ecology 20. 416–420.

Schochet, Peter Z. 2008. Statistical power for random assignment evaluations of education programs. Journal of Educational and Behavioral Statistics 33(1). 62–87. doi:10.3102/1076998607302714.

Twisk, Jos & Karin Proper. 2004. Evaluation of the results of a randomized controlled trial: How to define changes between baseline and follow-up. Journal of Clinical Epidemiology 57(3). 223–228. doi:10.1016/j.jclinepi.2003.07.009.

Vanhove, Jan. 2015. Analyzing randomized controlled interventions: Three notes for applied linguists. Studies in Second Language Learning and Teaching 5. 135–152. doi:10.14746/ssllt.2015.5.1.7.

Wright, Neil D. 2015. Choosing covariates in the analysis of cluster randomised trials. Queen Mary University of London PhD thesis. http://qmro.qmul.ac.uk/xmlui/handle/123456789/9017.


  1. The first assumption is relevant inasmuch as one of the suggested analyses (mixed-effect/multilevel modelling) doesn’t work well for a low number of clusters. In fact, Hayes & Moulton (2009) suggest that multilevel modelling be used only from about 15 clusters per condition onwards (pp. 223-224). If only very few clusters, say 4 or 6, are involved, cluster-level analyses may be more robust than multilevel modelling.

  2. The second assumption is relevant to the other suggested method of analysis, viz., analysing cluster-level summaries. For this analysis, the outcome variable is averaged per cluster and the analysis is ran on the cluster averages (e.g., means) in a general linear model (e.g., \(t\)-test). The general linear model assumes that the residuals for all data points are sampled from distributions with the same spread (‘homoskedasticity’). Since \(\textrm{variance of cluster mean} = \frac{\textrm{variance in cluster}}{\textrm{cluster size}}\) and since we assume the variance within clusters to be constant across clusters, approximately equal sample sizes imply that homoskedasticity is approximately met. Hugely different cluster sizes would be an argument in favour of multilevel modelling.

  3. It’s entirely possible that some of the assumptions in the simulations aren’t realistic in a given setting, however!

  4. Since the topic is cluster-randomised experiments, we won’t go into random slopes: We can’t estimate whether the effect of the intervention varies between classes since we don’t have control and intervention participants from the same classes. Since we can’t estimate this intervention-by-class interaction, fitting random slopes (intervention-by-class) has no added value.

  5. I find it amusing that shrinkage comes about due to pooling: https://www.youtube.com/watch?v=8DoARSlv-HU

  6. Contrary to common belief, covariate adjustment is in fact even more powerful when the covariate distribution is the same between conditions.