Imagine you’re studying whether using a new static analysis tool reduces the number of bugs developers introduce. You recruit 30 developers, each of whom completes 5 coding tasks—some with the tool, some without. You count the bugs in each task.
You might think: “I’ll just run a linear regression predicting bug count from tool usage.” But there’s a problem. Each developer contributed 5 data points. Some developers are simply better coders than others. Their data points aren’t independent—they’re clustered by developer.
Ignoring this structure is one of the most common mistakes in empirical SE research. Mixed-effects models are the fix.
Linear regression models the relationship between a response variable \(y\) and one or more predictors \(x\):
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)\]
Let’s simulate a straightforward scenario: does code review time (in minutes) predict the number of defects found?
library(ggplot2)
library(dplyr)
set.seed(42)
# Simulate: longer reviews find more defects
n <- 60
review_time <- runif(n, 10, 90) # 10 to 90 minutes
defects_found <- 1.5 + 0.08 * review_time + rnorm(n, 0, 1.5)
defects_found <- pmax(round(defects_found), 0) # non-negative integers
df_simple <- data.frame(review_time, defects_found)
# Fit a simple linear model
lm_simple <- lm(defects_found ~ review_time, data = df_simple)
summary(lm_simple)##
## Call:
## lm(formula = defects_found ~ review_time, data = df_simple)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1482 -1.0362 0.3900 0.7861 2.6207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.274665 0.461239 2.764 0.00765 **
## review_time 0.084653 0.007622 11.106 5.51e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.407 on 58 degrees of freedom
## Multiple R-squared: 0.6802, Adjusted R-squared: 0.6747
## F-statistic: 123.3 on 1 and 58 DF, p-value: 5.512e-16
ggplot(df_simple, aes(x = review_time, y = defects_found)) +
geom_point(alpha = 0.6, size = 2.5, color = "#2c7bb6") +
geom_smooth(method = "lm", se = TRUE, color = "#d7191c", linewidth = 1) +
labs(
x = "Code Review Duration (minutes)",
y = "Defects Found",
title = "Longer Code Reviews Find More Defects"
) +
theme_minimal(base_size = 14)Simple linear regression: review time vs. defects found.
The key things to look at:
review_time: the slope.
Here, each additional minute of review is associated with finding ~0.08
more defects.Linear regression assumes every observation is independent. In the example above, each row is one review from one person—that’s fine.
But in SE research, our data almost always has structure:
When observations are grouped, they violate independence. Observations within the same group tend to be more similar to each other than to observations from other groups. This is the problem mixed-effects models solve.
Let’s make our example realistic. Now 8 developers each review 8 code submissions. Some developers are inherently more thorough than others.
set.seed(123)
n_devs <- 8
n_reviews <- 8
# Each developer has a personal "baseline thoroughness"
dev_intercepts <- rnorm(n_devs, mean = 0, sd = 2.5)
sim_data <- expand.grid(
developer = factor(paste0("Dev_", LETTERS[1:n_devs])),
review_id = 1:n_reviews
) %>%
mutate(
dev_idx = as.numeric(developer),
review_time = runif(n(), 10, 90),
# Each dev has their own baseline (intercept), but same slope
defects_found = 2 + dev_intercepts[dev_idx] + 0.06 * review_time +
rnorm(n(), 0, 1),
defects_found = pmax(round(defects_found), 0)
)ggplot(sim_data, aes(x = review_time, y = defects_found, color = developer)) +
geom_point(alpha = 0.7, size = 2) +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.8, linetype = "dashed") +
labs(
x = "Review Duration (minutes)",
y = "Defects Found",
title = "Same Trend, Different Baselines Per Developer",
color = "Developer"
) +
theme_minimal(base_size = 14) +
theme(legend.position = "right")The same relationship, but each developer has a different baseline.
Notice how the lines are roughly parallel (similar slopes) but at different heights (different intercepts). Dev_A might be naturally more thorough than Dev_F, regardless of how long they spend.
##
## Call:
## lm(formula = defects_found ~ review_time, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5284 -1.6478 -0.1315 2.0052 5.7008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.30810 0.76055 4.35 5.18e-05 ***
## review_time 0.04696 0.01427 3.29 0.00165 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.561 on 62 degrees of freedom
## Multiple R-squared: 0.1487, Adjusted R-squared: 0.1349
## F-statistic: 10.83 on 1 and 62 DF, p-value: 0.001653
This model pools all observations as if they came from unrelated individuals. The problems:
You might think: “I’ll add developer as a
predictor.”
##
## Call:
## lm(formula = defects_found ~ review_time + developer, data = sim_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6424 -0.6206 -0.1465 0.5309 2.4513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.660206 0.444287 1.486 0.142992
## review_time 0.061023 0.005796 10.529 8.54e-15 ***
## developerDev_B 0.900889 0.499735 1.803 0.076910 .
## developerDev_C 5.071424 0.499327 10.157 3.19e-14 ***
## developerDev_D 2.014455 0.500454 4.025 0.000176 ***
## developerDev_E 1.719502 0.503148 3.417 0.001196 **
## developerDev_F 5.600659 0.504016 11.112 1.12e-15 ***
## developerDev_G 2.537962 0.500306 5.073 4.78e-06 ***
## developerDev_H -2.098738 0.499369 -4.203 9.76e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9986 on 55 degrees of freedom
## Multiple R-squared: 0.8852, Adjusted R-squared: 0.8685
## F-statistic: 53 on 8 and 55 DF, p-value: < 2.2e-16
This is called a fixed-effects approach and it works for this dataset, but it has practical problems:
| Issue | Why It Matters |
|---|---|
| Uses up degrees of freedom | With 50 developers, you burn 49 df on dummy variables |
| Can’t generalize | Results apply only to these specific developers |
| Can’t handle crossed designs | What if tasks are also grouped (e.g., by project)? |
| Breaks with unbalanced data | Some developers did 3 reviews, others did 12 |
Mixed-effects models solve all of these problems.
A mixed-effects model separates your predictors into two types:
The model for our example:
\[y_{ij} = \underbrace{(\beta_0 + u_{0j})}_{\text{intercept for dev } j} + \beta_1 \cdot \text{review\_time}_{ij} + \epsilon_{ij}\]
where:
The random intercepts \(u_{0j}\) are assumed to come from a normal distribution. This is what makes them “random”—we’re modeling the distribution of developer effects, not estimating each one individually.
library(lme4)
# Random intercept for developer
mixed_model <- lmer(defects_found ~ review_time + (1 | developer),
data = sim_data)
summary(mixed_model)## Linear mixed model fit by REML ['lmerMod']
## Formula: defects_found ~ review_time + (1 | developer)
## Data: sim_data
##
## REML criterion at convergence: 217.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.6776 -0.6242 -0.1496 0.5107 2.5216
##
## Random effects:
## Groups Name Variance Std.Dev.
## developer (Intercept) 6.2673 2.5034
## Residual 0.9971 0.9986
## Number of obs: 64, groups: developer, 8
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.642825 0.936668 2.822
## review_time 0.060726 0.005791 10.486
##
## Correlation of Fixed Effects:
## (Intr)
## review_time -0.299
Random effects section:
developer (Intercept) variance: how much developers
vary in their baseline thoroughness (\(\tau^2\)).Residual variance: leftover noise after accounting for
developer differences (\(\sigma^2\)).vc <- as.data.frame(VarCorr(mixed_model))
tau2 <- vc$vcov[1] # developer variance
sig2 <- vc$vcov[2] # residual variance
icc <- tau2 / (tau2 + sig2)
cat(sprintf("ICC = %.3f\n", icc))## ICC = 0.863
## 86% of the variance in defects is due to developer differences.
Fixed effects section:
review_time estimate: same interpretation as linear
regression—each additional minute of review is associated with finding
more defects.Interpreting the ICC:
An ICC of ~0.40 or higher, in this case, means that ignoring the grouping would be a serious mistake.
The impact of ignoring grouping depends on two things together: the ICC and the cluster size (how many observations per group). The combination determines how badly your standard errors are inflated. The relevant concept is the design effect, which approximates how much the variance of your estimates is inflated by clustering:
\(\text{DEFF} = 1 + (m - 1) \times \text{ICC}\),
where \(m\) is the average cluster size.
With 8 observations per developer and ICC = 0.863:
\(\text{DEFF} = 1 + (8 - 1) \times 0.863 = 1 + 6.04 = 7.04\)
That means the naive model’s variance estimates for the fixed effects are inflated by a factor of about 7. Standard errors are off by \(\sqrt{7.04} \approx 2.65\). So a confidence interval that should be, say, ±0.5 is instead being computed as if it were ±0.19 — the naive model is roughly 2.65 times more confident than it has any right to be.
ranefs <- ranef(mixed_model)$developer
ranefs$developer <- rownames(ranefs)
names(ranefs)[1] <- "intercept_offset"
ggplot(ranefs, aes(x = reorder(developer, intercept_offset),
y = intercept_offset)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
geom_point(size = 3, color = "#2c7bb6") +
geom_segment(aes(xend = developer, y = 0, yend = intercept_offset),
color = "#2c7bb6", linewidth = 0.8) +
coord_flip() +
labs(
x = NULL,
y = "Deviation from Grand Mean",
title = "Random Intercepts: How Each Developer Differs"
) +
theme_minimal(base_size = 14)Each developer’s estimated deviation from the grand mean intercept.
library(patchwork)
# Get mixed model predictions
sim_data$pred_mixed <- predict(mixed_model)
p1 <- ggplot(sim_data, aes(x = review_time, y = defects_found)) +
geom_point(alpha = 0.5, color = "grey40") +
geom_smooth(method = "lm", color = "#d7191c", linewidth = 1.2, se = FALSE) +
labs(title = "Naive Linear Regression",
subtitle = "One line for all developers",
x = "Review Time", y = "Defects Found") +
theme_minimal(base_size = 13)
p2 <- ggplot(sim_data, aes(x = review_time, y = defects_found,
color = developer)) +
geom_point(alpha = 0.6, size = 1.5) +
geom_line(aes(y = pred_mixed), linewidth = 0.8) +
labs(title = "Mixed-Effects Model",
subtitle = "Each developer gets their own intercept",
x = "Review Time", y = "Defects Found") +
theme_minimal(base_size = 13) +
theme(legend.position = "none")
p1 + p2Left: one line for everyone (naive). Right: developer-specific intercepts (mixed model).
The naive model reports one big sigma^2 and calls it “noise,” while the mixed model reveals that a substantial chunk of it was actually systematic developer differences — not noise at all.
# Naive LM: all unexplained variance in one bucket
sigma_naive <- sigma(lm_naive)^2
# Mixed model: variance split into developer + residual
vc <- as.data.frame(VarCorr(mixed_model))
tau2 <- vc$vcov[1] # developer
sig2 <- vc$vcov[2] # residual
cat(sprintf("Naive LM residual variance: %.2f\n", sigma_naive))## Naive LM residual variance: 6.56
## Mixed model τ² + σ²: 7.26 (τ²=6.27 + σ²=1.00)
So far we assumed every developer has the same slope (same benefit from spending more time). But what if some developers benefit more from longer reviews than others?
\[y_{ij} = (\beta_0 + u_{0j}) + (\beta_1 + u_{1j}) \cdot \text{review\_time}_{ij} + \epsilon_{ij}\]
Now each developer has their own intercept and their own slope.
# Random intercept AND random slope for review_time
mixed_slopes <- lmer(
defects_found ~ review_time + (1 + review_time | developer),
data = sim_data
)
summary(mixed_slopes)## Linear mixed model fit by REML ['lmerMod']
## Formula: defects_found ~ review_time + (1 + review_time | developer)
## Data: sim_data
##
## REML criterion at convergence: 216.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8263 -0.6093 -0.1294 0.5092 2.3855
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## developer (Intercept) 7.8798259 2.80710
## review_time 0.0001107 0.01052 -0.70
## Residual 0.9490999 0.97422
## Number of obs: 64, groups: developer, 8
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.631310 1.037221 2.537
## review_time 0.060470 0.006797 8.897
##
## Correlation of Fixed Effects:
## (Intr)
## review_time -0.590
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0496749 (tol = 0.002, component 1)
When to use random slopes:
SE data often has crossed or nested random effects. For instance, if developers review code from multiple projects:
# Crossed random effects: developers AND projects
lmer(defects ~ review_time + (1 | developer) + (1 | project), data = mydata)
# Nested: developers within teams
lmer(defects ~ review_time + (1 | team/developer), data = mydata)This is where mixed models really shine over the dummy-variable approach— handling complex grouping structures naturally.
| Situation | Model |
|---|---|
| One observation per subject, no grouping | Linear regression (lm) |
| Repeated measures, few groups (< 5) | Fixed effects / ANOVA may work |
| Repeated measures, groups are a sample from a larger population | Mixed-effects model (lmer) |
| Multiple crossed grouping factors | Mixed-effects model (lmer) |
| You want to generalize beyond the specific groups in your data | Mixed-effects model (lmer) |
| Study Design | Fixed Effects | Random Effects |
|---|---|---|
| Tool evaluation with repeated tasks | tool (on/off), task difficulty | participant, task |
| Mining defects across projects | code metrics (LOC, complexity) | project, module |
| Developer productivity over time | time period, methodology | developer |
| Code review effectiveness | review type, reviewer experience | pull request, repository |
lmer Syntax| Formula | Meaning |
|---|---|
(1 \| group) |
Random intercept per group |
(1 + x \| group) |
Random intercept + random slope for x |
(1 \| g1) + (1 \| g2) |
Crossed random effects |
(1 \| g1/g2) |
g2 nested within g1 |
(0 + x \| group) |
Random slope only (no random intercept) |
When you publish, report:
# Use confint for confidence intervals (profile likelihood)
# confint(mixed_model) # can be slow; use method="Wald" for speed
# For p-values, use lmerTest
library(lmerTest)
mixed_with_p <- lmer(defects_found ~ review_time + (1 | developer),
data = sim_data)
summary(mixed_with_p)## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: defects_found ~ review_time + (1 | developer)
## Data: sim_data
##
## REML criterion at convergence: 217.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.6776 -0.6242 -0.1496 0.5107 2.5216
##
## Random effects:
## Groups Name Variance Std.Dev.
## developer (Intercept) 6.2673 2.5034
## Residual 0.9971 0.9986
## Number of obs: 64, groups: developer, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 2.642825 0.936668 8.421160 2.822 0.0213 *
## review_time 0.060726 0.005791 55.178496 10.486 9.54e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## review_time -0.299
Linear regression assumes independence. In SE experiments, observations are almost never independent—they’re grouped by developer, project, or task.
Ignoring grouping inflates your confidence. You’ll report effects as significant when they’re not. This is a real problem in the SE literature.
Mixed models account for grouping structure. Random effects absorb the group-level variation so your fixed-effect estimates and standard errors are correct.
Think of random effects as “nuisance variation.” You don’t care which developer is best—you care whether the tool works across developers.
Start simple. Fit a random-intercept model first. Add random slopes only if justified by your design and supported by your data.
This lecture was designed for PhD students in software engineering. The examples use simulated data to make the concepts transparent. In your own research, check model assumptions (normality of residuals, homoscedasticity) and consider whether a generalized linear mixed model (GLMM) is more appropriate for your outcome variable (e.g., Poisson for counts, logistic for binary outcomes).