Measurement error on control variables messes up statistical control. From what I thought to understand, accounting for measurement error in a Bayesian model should largely solve this problem. But I couldn’t make it work. It seems you also have to account explicitly for the fact that the latent predictors are correlated. Any feedback is greatly appreciated.
In observational studies, it is customary to account for confounding variables by including measurements of them in the statistical model. This practice is referred to as “statistically controlling” for the confounding variables. An underappreciated problem is that if the confounding variables were measured imperfectly, then statistical control will be imperfect as well, and the confound won’t be eradicated entirely (see Berthele & Vanhove 2017; Brunner & Austin 2009; Westfall & Yarkoni 2016) (see also Controlling for confounding variables in correlational research: Four caveats).
Let’s first illustrate the problem using simulated data. That way, we know what goes into the data and what we hope a model should take out of it.
A
and B
) that aren’t measured directly. Here, the correlation is \(\rho = 0.8\).A
causally influence the directly measured outcome Z
. Construct B
does not.# Generate correlated constructs
rho <- 0.8
n <- 300
constructs <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = rbind(c(1, rho), c(rho, 1)))
A <- constructs[, 1]
B <- constructs[, 2]
# A influences Z, B doesn't
Z <- A + rnorm(n, sd = 1)
Because A
and B
are correlated and A
influences Z
, B
is also correlated with Z
:
source("https://janhove.github.io/RCode/scatterplot_matrix.R")
scatterplot_matrix(cbind(Z, A, B))
But if you put A
and B
in the same regression model, the regression model correctly estimates that B
doesn’t have much of an influence on Z
.
summary(lm(Z ~ A + B))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035 0.058 0.61 5.4e-01
## A 0.926 0.093 9.91 3.4e-20
## B 0.041 0.092 0.45 6.5e-01
But A
and B
aren’t directly observed; instead we observe obs_A
and obs_B
. For this example, let’s assume that the reliability of obs_A
is 0.6 and that of obs_B
0.9. Since their corresponding latent variables have a variance of 1, the standard error of measurement for these observed variables is \(\sqrt{\frac{1}{0.6}-1} \approx 0.82\) and \(\sqrt{\frac{1}{0.9}-1} \approx 0.33\), respectively.1
obs_A <- A + rnorm(n = n, sd = sqrt(1/0.6 - 1))
obs_B <- B + rnorm(n = n, sd = sqrt(1/0.9 - 1))
scatterplot_matrix(cbind(Z, obs_A, obs_B))
If you put obs_A
and obs_B
in the same regression model, the model now finds significant effects for both variables.
summary(lm(Z ~ obs_A + obs_B))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.077 0.064 1.2 2.4e-01
## obs_A 0.354 0.060 5.9 7.6e-09
## obs_B 0.436 0.073 6.0 5.7e-09
Descriptively, this is perfectly fine. But if you’d conclude that B
(or obs_B
) causally influences Z
because its parameter estimate is significant after controlling for a measure of A
(i.e., obs_A
), you’d be wrong: In this simulation, only A
influences Z
.
Westfall & Yarkoni (2016) suggest that researchers use structural equation models to account for imperfectly measured control variables. In reaction to this suggestion, Andrew Gelman wrote that “[y]ou can account for measurement error directly in Bayesian inference by just putting the measurement error model directly into the posterior distribution.”
I wanted to have a go at this suggestion to see if I can make it work. To this end, I use the rethinking package for R, with code adapted from McElreath (2016).
library(rethinking)
## Loading required package: rstan
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.19.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Loading required package: parallel
## Loading required package: dagitty
## rethinking (Version 1.93)
##
## Attaching package: 'rethinking'
## The following object is masked from 'package:stats':
##
## rstudent
Let’s first define the model:
abz_model <- alist(
# Distribution of outcome and conditional mean (mu)
# Conditional mean is modelled in terms of both (estimated) A and B
Z[i] ~ dnorm(mu, sigma),
mu <- intercept + slope_A*A_latent[i] + slope_B*B_latent[i],
# Observation process of A and B: obs_A and obs_B are equal to
# their latent construct scores and some normal error
obs_A[i] ~ dnorm(A_latent, error_A),
obs_B[i] ~ dnorm(B_latent, error_B),
# Prior distribution of A_latent and B_latent
vector[N]:A_latent ~ dnorm(0, 1),
vector[N]:B_latent ~ dnorm(0, 1),
# Other priors
intercept ~ dnorm(0, 1),
slope_A ~ dnorm(0, 1),
slope_B ~ dnorm(0, 1),
sigma ~ dexp(1)
)
Now, for reference, fit a model in which both predictors were measured with hardly any error (reliability of 0.999):
data_no_error <- list(
Z = Z,
obs_A = obs_A,
obs_B = obs_B,
error_A = sqrt(1/0.999 - 1), # assume nearly perfect predictors
error_B = sqrt(1/0.999 - 1),
N = n
)
model_no_error <- ulam(flist = abz_model,
data = data_no_error,
chains = 4, cores = 4)
precis(model_no_error)
## 600 vector or matrix parameters hidden. Use depth=2 to show them.
## mean sd 5.5% 94% n_eff Rhat
## intercept 0.076 0.063 -0.026 0.18 5441 1
## slope_A 0.354 0.060 0.258 0.45 3726 1
## slope_B 0.434 0.072 0.319 0.55 3560 1
## sigma 1.113 0.047 1.039 1.19 5166 1
So the estimated slope for B
is 0.43 with a 89% credible interval of [0.32, 0.55]. Interpreted at the construct level, this is an overestimate: the true effect of construct B
on Z
is 0.
Let’s explicitly model measurement error on the confounding variable (A
). To make sure this works as intended, I’m going to use the actual standard deviation of the measurement noise; in most real-life settings, you’d have to estimate this.
data_error_A <- list(
Z = Z,
obs_A = obs_A,
obs_B = obs_B,
error_A = sqrt(1/0.6 - 1),
error_B = sqrt(1/0.999 - 1),
N = n
)
model_error_A <- ulam(flist = abz_model,
data = data_error_A,
chains = 4, cores = 4)
precis(model_error_A)
## 600 vector or matrix parameters hidden. Use depth=2 to show them.
## mean sd 5.5% 94% n_eff Rhat
## intercept 0.077 0.065 -0.026 0.18 4236 1
## slope_A 0.586 0.102 0.425 0.75 1195 1
## slope_B 0.437 0.075 0.319 0.55 1841 1
## sigma 1.049 0.055 0.962 1.14 1950 1
While the estimated slope for A
is now larger (though more uncertainty), the slope for B
is still the same as in the model that assumed essentially perfectly measured predictors. Based on my clearly incomplete understanding of Gelman’s suggestion, I had expected the slope for predictor B
to be much near to 0.
Let’s also take into account the fact that the variable of interest, B
was measured imperfectly:
data_error_AB <- list(
Z = Z,
obs_A = obs_A,
obs_B = obs_B,
error_A = sqrt(1/0.6 - 1),
error_B = sqrt(1/0.9 - 1),
N = n
)
model_error_AB <- ulam(flist = abz_model,
data = data_error_AB,
chains = 4, cores = 4)
precis(model_error_AB)
## 600 vector or matrix parameters hidden. Use depth=2 to show them.
## mean sd 5.5% 94% n_eff Rhat
## intercept 0.076 0.067 -0.027 0.19 4838 1
## slope_A 0.586 0.101 0.421 0.74 1255 1
## slope_B 0.484 0.084 0.348 0.62 1904 1
## sigma 1.035 0.053 0.951 1.12 1996 1
The estimated slope for B
is now 0.48, with a 89% credible interval from [0.35, 0.62], i.e., even farther from the truth than in the naïve model. Clearly, this doesn’t solve the problem either.
There is one part of the data generating process that wasn’t accounted for in the model: the latent constructs A
and B
were drawn from a bivariate normal distribution with \(\rho = 0.8\). I think (no guarantees) that the following model specification takes this into account:
abz_model_correlated <- alist(
# Distribution of outcome and conditional mean (mu)
# Conditional mean is modelled in terms of both (estimated) A and B
Z[i] ~ dnorm(mu, sigma),
mu <- intercept + slope_A*A_latent[i] + slope_B*B_latent[i],
# Observation process of A and B: obs_A and obs_B are equal to
# their latent construct scores and some normal error
obs_A[i] ~ dnorm(A_latent, error_A),
obs_B[i] ~ dnorm(B_latent, error_B),
# Latent construct scores both depend on further unobserved factors
# (the combination of standard deviations here guarantees that
# A_latent and B_latent are correlated at 0.8)
vector[N]:A_latent ~ dnorm(unobserved, sqrt(1-0.894^2)),
vector[N]:B_latent ~ dnorm(unobserved, sqrt(1-0.894^2)),
# Prior distribution of unobserved factors
vector[N]:unobserved ~ dnorm(0, 0.894),
# Other priors
intercept ~ dnorm(0, 1),
slope_A ~ dnorm(0, 1),
slope_B ~ dnorm(0, 1),
sigma ~ dexp(1)
)
model_error_AB_correlated <- ulam(flist = abz_model_correlated,
data = data_error_AB,
iter = 5000, warmup = 1000,
chains = 4, cores = 4)
precis(model_error_AB_correlated)
## 900 vector or matrix parameters hidden. Use depth=2 to show them.
## mean sd 5.5% 94% n_eff Rhat
## intercept 0.076 0.065 -0.027 0.18 16926 1
## slope_A 0.888 0.178 0.605 1.18 3296 1
## slope_B 0.059 0.161 -0.198 0.31 3705 1
## sigma 1.009 0.061 0.908 1.11 4156 1
Loo and behold, the slope for B
is now appropriately close to 0, with a 89% credible interval from [-0.2, 0.3]. So accounting for the measurement error alone doesn’t solve the problem that measurement error creates for statistical control.
In the previous model, I specified the standard deviations for the latent predictors and their common cause so that the latent predictors would be correlated at \(\rho = 0.8\). In the next model, I have the model estimate these standard deviations. In fact, it only has to estimate one of them, because the others can be computed from this one estimate, if we assume that the latent predictors have a variance of 1.
abz_model_correlated_2 <- alist(
# Distribution of outcome and conditional mean (mu)
# Conditional mean is modelled in terms of both (estimated) A and B
Z[i] ~ dnorm(mu, sigma),
mu <- intercept + slope_A*A_latent[i] + slope_B*B_latent[i],
# Observation process of A and B: obs_A and obs_B are equal to
# their latent construct scores and some normal error
obs_A[i] ~ dnorm(A_latent, error_A),
obs_B[i] ~ dnorm(B_latent, error_B),
# Latent construct scores both depend on further unobserved factors;
# their correlation with the latent factor is estimated
vector[N]:A_latent ~ dnorm(unobserved, sqrt(1-sigma_unobserved^2)),
vector[N]:B_latent ~ dnorm(unobserved, sqrt(1-sigma_unobserved^2)),
# Prior distribution of unobserved factors
vector[N]:unobserved ~ dnorm(0, sigma_unobserved),
# Other priors
intercept ~ dnorm(0, 1),
slope_A ~ dnorm(0, 1),
slope_B ~ dnorm(0, 1),
sigma ~ dexp(1),
sigma_unobserved ~ dunif(0, 1)
)
model_error_AB_correlated_2 <- ulam(flist = abz_model_correlated_2,
data = data_error_AB,
iter = 5000, warmup = 1000,
chains = 4, cores = 4)
precis(model_error_AB_correlated_2)
## 900 vector or matrix parameters hidden. Use depth=2 to show them.
## mean sd 5.5% 94% n_eff Rhat
## intercept 0.076 0.064 -0.026 0.18 19160 1
## slope_A 0.849 0.185 0.569 1.16 2580 1
## slope_B 0.100 0.171 -0.183 0.36 2424 1
## sigma 1.012 0.061 0.914 1.11 6024 1
## sigma_unobserved 0.878 0.026 0.834 0.92 1217 1
This also works pretty well: the slope for B
is still near 0.
To make sure this approach also works on slightly different data, let’s generate a new dataset, with a different correlation between the latent predictors and different (known) reliabilities:
# Generate correlated constructs
rho <- 0.4
n <- 500
constructs <- MASS::mvrnorm(n = n, mu = c(0, 0), Sigma = rbind(c(1, rho), c(rho, 1)))
A <- constructs[, 1]
B <- constructs[, 2]
# A influences Z, B doesn't
Z <- A + rnorm(n, sd = 1)
obs_A <- A + rnorm(n = n, sd = sqrt(1/0.45 - 1))
obs_B <- B + rnorm(n = n, sd = sqrt(1/0.95 - 1))
data_error_AB <- list(
Z = Z,
obs_A = obs_A,
obs_B = obs_B,
error_A = sqrt(1/0.45 - 1),
error_B = sqrt(1/0.95 - 1),
N = n
)
model_error_AB_correlated_2 <- ulam(flist = abz_model_correlated_2,
data = data_error_AB,
iter = 5000, warmup = 1000,
chains = 4, cores = 4)
precis(model_error_AB_correlated_2)
## 1500 vector or matrix parameters hidden. Use depth=2 to show them.
## mean sd 5.5% 94% n_eff Rhat
## intercept -0.07030 0.055 -0.16 0.018 12264 1
## slope_A 1.06533 0.112 0.89 1.244 1253 1
## slope_B 0.00024 0.095 -0.16 0.146 1219 1
## sigma 0.96755 0.075 0.85 1.083 1466 1
## sigma_unobserved 0.66743 0.045 0.59 0.737 1132 1
Works fine!
I’m not an expert in these matters, but it seems that just accounting for the measurement error on a measured confound in a Bayesian model doesn’t solve the problem that statistical control with imperfect control variables will be imperfect. You also have to account for the correlation between the latent construct of interest and the latent confound.
Berthele, Raphael & Jan Vanhove. 2017. What would disprove interdependence? Lessons learned from a study on biliteracy in Portuguese heritage language speakers in Switzerland. International Journal of Bilingual Education and Bilingualism. doi:10.1080/13670050.2017.1385590.
Brunner, Jerry & Peter C. Austin. 2009. Inflation of Type I error rate in multiple regression when independent variables are measured with error. Canadian Journal of Statistics 37(1). 33–46. doi:10.1002/cjs.10004.
McElreath, Richard. 2016. Statistical rethinking: A Bayesian course with examples in R and Stan. Boca Raton, FL: CRC Press.
Westfall, Jacob & Tal Yarkoni. 2016. Statistically controlling for confounding constructs is harder than you think. PLOS ONE 11(3). e0152719. doi:10.1371/journal.pone.0152719.
Since the variance of the observed scores \(S^2_X\) equals the sum of the variance of the true scores \(S^2_T\) and the noise variance \(S^2_E\), the standard error of measurement (which is the standard deviation of the noise) equals \(\sqrt{S^2_X - S^2_T}\). Moreover, the reliability of the observed scores \(r_{XX'} = \frac{S^2_T}{S^2_X}\), so \(S^2_X = \frac{S^2_T}{r_{XX'}}\). Hence, \(S_E = \sqrt{\frac{S^2_T}{r_{XX'}} - S^2_T}\). Since in this simulation \(S^2_T = 1\), we have \(S_E = \sqrt{\frac{1}{r_{XX'}} - 1}\).↩