Why This Matters for SE Research

Imagine you’re studying whether using a new static analysis tool reduces the number of bugs developers introduce. You recruit 30 developers, each of whom completes 5 coding tasks—some with the tool, some without. You count the bugs in each task.

You might think: “I’ll just run a linear regression predicting bug count from tool usage.” But there’s a problem. Each developer contributed 5 data points. Some developers are simply better coders than others. Their data points aren’t independent—they’re clustered by developer.

Ignoring this structure is one of the most common mistakes in empirical SE research. Mixed-effects models are the fix.


Part 1: Quick Review of Linear Regression

The Basic Idea

Linear regression models the relationship between a response variable \(y\) and one or more predictors \(x\):

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \epsilon_i \sim N(0, \sigma^2)\]

  • \(\beta_0\): intercept (baseline value of \(y\))
  • \(\beta_1\): slope (how much \(y\) changes per unit of \(x\))
  • \(\epsilon_i\): residual error (assumed independent and normally distributed)

A Simple SE Example

Let’s simulate a straightforward scenario: does code review time (in minutes) predict the number of defects found?

library(ggplot2)
library(dplyr)

set.seed(42)

# Simulate: longer reviews find more defects
n <- 60
review_time <- runif(n, 10, 90)        # 10 to 90 minutes
defects_found <- 1.5 + 0.08 * review_time + rnorm(n, 0, 1.5)
defects_found <- pmax(round(defects_found), 0)  # non-negative integers

df_simple <- data.frame(review_time, defects_found)

# Fit a simple linear model
lm_simple <- lm(defects_found ~ review_time, data = df_simple)
summary(lm_simple)
## 
## Call:
## lm(formula = defects_found ~ review_time, data = df_simple)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1482 -1.0362  0.3900  0.7861  2.6207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.274665   0.461239   2.764  0.00765 ** 
## review_time 0.084653   0.007622  11.106 5.51e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.407 on 58 degrees of freedom
## Multiple R-squared:  0.6802, Adjusted R-squared:  0.6747 
## F-statistic: 123.3 on 1 and 58 DF,  p-value: 5.512e-16
ggplot(df_simple, aes(x = review_time, y = defects_found)) +
  geom_point(alpha = 0.6, size = 2.5, color = "#2c7bb6") +
  geom_smooth(method = "lm", se = TRUE, color = "#d7191c", linewidth = 1) +
  labs(
    x = "Code Review Duration (minutes)",
    y = "Defects Found",
    title = "Longer Code Reviews Find More Defects"
  ) +
  theme_minimal(base_size = 14)
Simple linear regression: review time vs. defects found.

Simple linear regression: review time vs. defects found.

Reading the Output

The key things to look at:

  • Estimate for review_time: the slope. Here, each additional minute of review is associated with finding ~0.08 more defects.
  • p-value: is the relationship statistically distinguishable from zero?
  • \(R^2\): how much variance in defects does review time explain?

The Critical Assumption We’re About to Break

Linear regression assumes every observation is independent. In the example above, each row is one review from one person—that’s fine.

But in SE research, our data almost always has structure:

  • Multiple commits from the same developer
  • Multiple releases from the same project
  • Multiple tasks completed by the same participant
  • Multiple files in the same module

When observations are grouped, they violate independence. Observations within the same group tend to be more similar to each other than to observations from other groups. This is the problem mixed-effects models solve.


Part 2: The Problem With Ignoring Groups

A More Realistic SE Scenario

Let’s make our example realistic. Now 8 developers each review 8 code submissions. Some developers are inherently more thorough than others.

set.seed(123)

n_devs <- 8
n_reviews <- 8

# Each developer has a personal "baseline thoroughness"
dev_intercepts <- rnorm(n_devs, mean = 0, sd = 2.5)

sim_data <- expand.grid(
  developer = factor(paste0("Dev_", LETTERS[1:n_devs])),
  review_id = 1:n_reviews
) %>%
  mutate(
    dev_idx = as.numeric(developer),
    review_time = runif(n(), 10, 90),
    # Each dev has their own baseline (intercept), but same slope
    defects_found = 2 + dev_intercepts[dev_idx] + 0.06 * review_time +
                    rnorm(n(), 0, 1),
    defects_found = pmax(round(defects_found), 0)
  )
ggplot(sim_data, aes(x = review_time, y = defects_found, color = developer)) +
  geom_point(alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.8, linetype = "dashed") +
  labs(
    x = "Review Duration (minutes)",
    y = "Defects Found",
    title = "Same Trend, Different Baselines Per Developer",
    color = "Developer"
  ) +
  theme_minimal(base_size = 14) +
  theme(legend.position = "right")
The same relationship, but each developer has a different baseline.

The same relationship, but each developer has a different baseline.

Notice how the lines are roughly parallel (similar slopes) but at different heights (different intercepts). Dev_A might be naturally more thorough than Dev_F, regardless of how long they spend.

What Happens If We Ignore the Grouping?

lm_naive <- lm(defects_found ~ review_time, data = sim_data)
summary(lm_naive)
## 
## Call:
## lm(formula = defects_found ~ review_time, data = sim_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5284 -1.6478 -0.1315  2.0052  5.7008 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.30810    0.76055    4.35 5.18e-05 ***
## review_time  0.04696    0.01427    3.29  0.00165 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.561 on 62 degrees of freedom
## Multiple R-squared:  0.1487, Adjusted R-squared:  0.1349 
## F-statistic: 10.83 on 1 and 62 DF,  p-value: 0.001653

This model pools all observations as if they came from unrelated individuals. The problems:

  1. Standard errors are wrong—ignoring grouping biases standard errors, but the direction depends on the predictor’s structure. For between-cluster predictors the SEs shrink and you get false positives. For within-cluster predictors (our case) the SEs can inflate and you lose power.
  2. p-values are overconfident—you’re more likely to claim a “significant” effect when there isn’t one.
  3. Developer-level variance is dumped into the residuals, inflating \(\sigma\).

The “Just Add a Dummy Variable” Approach

You might think: “I’ll add developer as a predictor.”

lm_fixed_dev <- lm(defects_found ~ review_time + developer, data = sim_data)
summary(lm_fixed_dev)
## 
## Call:
## lm(formula = defects_found ~ review_time + developer, data = sim_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6424 -0.6206 -0.1465  0.5309  2.4513 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.660206   0.444287   1.486 0.142992    
## review_time     0.061023   0.005796  10.529 8.54e-15 ***
## developerDev_B  0.900889   0.499735   1.803 0.076910 .  
## developerDev_C  5.071424   0.499327  10.157 3.19e-14 ***
## developerDev_D  2.014455   0.500454   4.025 0.000176 ***
## developerDev_E  1.719502   0.503148   3.417 0.001196 ** 
## developerDev_F  5.600659   0.504016  11.112 1.12e-15 ***
## developerDev_G  2.537962   0.500306   5.073 4.78e-06 ***
## developerDev_H -2.098738   0.499369  -4.203 9.76e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9986 on 55 degrees of freedom
## Multiple R-squared:  0.8852, Adjusted R-squared:  0.8685 
## F-statistic:    53 on 8 and 55 DF,  p-value: < 2.2e-16

This is called a fixed-effects approach and it works for this dataset, but it has practical problems:

Issue Why It Matters
Uses up degrees of freedom With 50 developers, you burn 49 df on dummy variables
Can’t generalize Results apply only to these specific developers
Can’t handle crossed designs What if tasks are also grouped (e.g., by project)?
Breaks with unbalanced data Some developers did 3 reviews, others did 12

Mixed-effects models solve all of these problems.


Part 3: Mixed-Effects Models

The Key Idea

A mixed-effects model separates your predictors into two types:

  • Fixed effects: the variables you care about and want to estimate (e.g., review time, tool usage). These generalize to the population.
  • Random effects: grouping variables that introduce structured variability (e.g., developer, project). You’re not interested in specific group values— you want to account for the fact that groups differ.

The model for our example:

\[y_{ij} = \underbrace{(\beta_0 + u_{0j})}_{\text{intercept for dev } j} + \beta_1 \cdot \text{review\_time}_{ij} + \epsilon_{ij}\]

where:

  • \(\beta_0\): grand mean intercept (fixed)
  • \(u_{0j} \sim N(0, \tau^2)\): random intercept for developer \(j\) (how much dev \(j\) deviates from the grand mean)
  • \(\beta_1\): fixed slope for review time
  • \(\epsilon_{ij} \sim N(0, \sigma^2)\): residual error

The random intercepts \(u_{0j}\) are assumed to come from a normal distribution. This is what makes them “random”—we’re modeling the distribution of developer effects, not estimating each one individually.

Fitting the Model in R

library(lme4)

# Random intercept for developer
mixed_model <- lmer(defects_found ~ review_time + (1 | developer),
                    data = sim_data)
summary(mixed_model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: defects_found ~ review_time + (1 | developer)
##    Data: sim_data
## 
## REML criterion at convergence: 217.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.6776 -0.6242 -0.1496  0.5107  2.5216 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  developer (Intercept) 6.2673   2.5034  
##  Residual              0.9971   0.9986  
## Number of obs: 64, groups:  developer, 8
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 2.642825   0.936668   2.822
## review_time 0.060726   0.005791  10.486
## 
## Correlation of Fixed Effects:
##             (Intr)
## review_time -0.299

Reading the Output

Random effects section:

  • developer (Intercept) variance: how much developers vary in their baseline thoroughness (\(\tau^2\)).
  • Residual variance: leftover noise after accounting for developer differences (\(\sigma^2\)).
  • The Intraclass Correlation Coefficient (ICC) tells you what fraction of the total variance is explained by the grouping:
vc <- as.data.frame(VarCorr(mixed_model))
tau2 <- vc$vcov[1]   # developer variance
sig2 <- vc$vcov[2]   # residual variance

icc <- tau2 / (tau2 + sig2)
cat(sprintf("ICC = %.3f\n", icc))
## ICC = 0.863
cat(sprintf("%.0f%% of the variance in defects is due to developer differences.\n",
            icc * 100))
## 86% of the variance in defects is due to developer differences.

Fixed effects section:

  • review_time estimate: same interpretation as linear regression—each additional minute of review is associated with finding more defects.
  • Standard errors are now correct because the model knows observations within a developer are correlated.

Interpreting the ICC:

An ICC of ~0.40 or higher, in this case, means that ignoring the grouping would be a serious mistake.

The impact of ignoring grouping depends on two things together: the ICC and the cluster size (how many observations per group). The combination determines how badly your standard errors are inflated. The relevant concept is the design effect, which approximates how much the variance of your estimates is inflated by clustering:

\(\text{DEFF} = 1 + (m - 1) \times \text{ICC}\),

where \(m\) is the average cluster size.

With 8 observations per developer and ICC = 0.863:

\(\text{DEFF} = 1 + (8 - 1) \times 0.863 = 1 + 6.04 = 7.04\)

That means the naive model’s variance estimates for the fixed effects are inflated by a factor of about 7. Standard errors are off by \(\sqrt{7.04} \approx 2.65\). So a confidence interval that should be, say, ±0.5 is instead being computed as if it were ±0.19 — the naive model is roughly 2.65 times more confident than it has any right to be.

Visualizing the Random Effects

ranefs <- ranef(mixed_model)$developer
ranefs$developer <- rownames(ranefs)
names(ranefs)[1] <- "intercept_offset"

ggplot(ranefs, aes(x = reorder(developer, intercept_offset),
                   y = intercept_offset)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey50") +
  geom_point(size = 3, color = "#2c7bb6") +
  geom_segment(aes(xend = developer, y = 0, yend = intercept_offset),
               color = "#2c7bb6", linewidth = 0.8) +
  coord_flip() +
  labs(
    x = NULL,
    y = "Deviation from Grand Mean",
    title = "Random Intercepts: How Each Developer Differs"
  ) +
  theme_minimal(base_size = 14)
Each developer's estimated deviation from the grand mean intercept.

Each developer’s estimated deviation from the grand mean intercept.

Side-by-Side: Naive LM vs. Mixed Model

library(patchwork)

# Get mixed model predictions
sim_data$pred_mixed <- predict(mixed_model)

p1 <- ggplot(sim_data, aes(x = review_time, y = defects_found)) +
  geom_point(alpha = 0.5, color = "grey40") +
  geom_smooth(method = "lm", color = "#d7191c", linewidth = 1.2, se = FALSE) +
  labs(title = "Naive Linear Regression",
       subtitle = "One line for all developers",
       x = "Review Time", y = "Defects Found") +
  theme_minimal(base_size = 13)

p2 <- ggplot(sim_data, aes(x = review_time, y = defects_found,
                            color = developer)) +
  geom_point(alpha = 0.6, size = 1.5) +
  geom_line(aes(y = pred_mixed), linewidth = 0.8) +
  labs(title = "Mixed-Effects Model",
       subtitle = "Each developer gets their own intercept",
       x = "Review Time", y = "Defects Found") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

p1 + p2
Left: one line for everyone (naive). Right: developer-specific intercepts (mixed model).

Left: one line for everyone (naive). Right: developer-specific intercepts (mixed model).

The naive model reports one big sigma^2 and calls it “noise,” while the mixed model reveals that a substantial chunk of it was actually systematic developer differences — not noise at all.

# Naive LM: all unexplained variance in one bucket
sigma_naive <- sigma(lm_naive)^2

# Mixed model: variance split into developer + residual
vc <- as.data.frame(VarCorr(mixed_model))
tau2 <- vc$vcov[1]   # developer
sig2 <- vc$vcov[2]   # residual

cat(sprintf("Naive LM residual variance:   %.2f\n", sigma_naive))
## Naive LM residual variance:   6.56
cat(sprintf("Mixed model τ² + σ²:          %.2f  (τ²=%.2f + σ²=%.2f)\n",
            tau2 + sig2, tau2, sig2))
## Mixed model τ² + σ²:          7.26  (τ²=6.27 + σ²=1.00)

Part 4: Going Further

Random Slopes

So far we assumed every developer has the same slope (same benefit from spending more time). But what if some developers benefit more from longer reviews than others?

\[y_{ij} = (\beta_0 + u_{0j}) + (\beta_1 + u_{1j}) \cdot \text{review\_time}_{ij} + \epsilon_{ij}\]

Now each developer has their own intercept and their own slope.

# Random intercept AND random slope for review_time
mixed_slopes <- lmer(
  defects_found ~ review_time + (1 + review_time | developer),
  data = sim_data
)
summary(mixed_slopes)
## Linear mixed model fit by REML ['lmerMod']
## Formula: defects_found ~ review_time + (1 + review_time | developer)
##    Data: sim_data
## 
## REML criterion at convergence: 216.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8263 -0.6093 -0.1294  0.5092  2.3855 
## 
## Random effects:
##  Groups    Name        Variance  Std.Dev. Corr 
##  developer (Intercept) 7.8798259 2.80710       
##            review_time 0.0001107 0.01052  -0.70
##  Residual              0.9490999 0.97422       
## Number of obs: 64, groups:  developer, 8
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 2.631310   1.037221   2.537
## review_time 0.060470   0.006797   8.897
## 
## Correlation of Fixed Effects:
##             (Intr)
## review_time -0.590
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0496749 (tol = 0.002, component 1)

When to use random slopes:

  • When you have theoretical reason to believe the effect of your predictor varies across groups.
  • When you have enough data per group to estimate them (rule of thumb: 5+ observations per group, ideally more).
  • Barr et al. (2013) recommend “maximal” random effects structures justified by the design—but in practice, you often need to simplify for convergence.

Multiple Grouping Factors

SE data often has crossed or nested random effects. For instance, if developers review code from multiple projects:

# Crossed random effects: developers AND projects
lmer(defects ~ review_time + (1 | developer) + (1 | project), data = mydata)

# Nested: developers within teams
lmer(defects ~ review_time + (1 | team/developer), data = mydata)

This is where mixed models really shine over the dummy-variable approach— handling complex grouping structures naturally.


Part 5: The Cheat Sheet

When to Use What

Situation Model
One observation per subject, no grouping Linear regression (lm)
Repeated measures, few groups (< 5) Fixed effects / ANOVA may work
Repeated measures, groups are a sample from a larger population Mixed-effects model (lmer)
Multiple crossed grouping factors Mixed-effects model (lmer)
You want to generalize beyond the specific groups in your data Mixed-effects model (lmer)

Common SE Research Scenarios

Study Design Fixed Effects Random Effects
Tool evaluation with repeated tasks tool (on/off), task difficulty participant, task
Mining defects across projects code metrics (LOC, complexity) project, module
Developer productivity over time time period, methodology developer
Code review effectiveness review type, reviewer experience pull request, repository

Quick Reference: lmer Syntax

Formula Meaning
(1 \| group) Random intercept per group
(1 + x \| group) Random intercept + random slope for x
(1 \| g1) + (1 \| g2) Crossed random effects
(1 \| g1/g2) g2 nested within g1
(0 + x \| group) Random slope only (no random intercept)

Reporting Results

When you publish, report:

  1. The fixed effects with estimates, standard errors, and confidence intervals.
  2. The random effects variance components (and ICC if relevant).
  3. The number of observations and the number of groups.
  4. How you handled model selection and convergence.
# Use confint for confidence intervals (profile likelihood)
# confint(mixed_model)  # can be slow; use method="Wald" for speed

# For p-values, use lmerTest
library(lmerTest)

mixed_with_p <- lmer(defects_found ~ review_time + (1 | developer),
                     data = sim_data)
summary(mixed_with_p)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: defects_found ~ review_time + (1 | developer)
##    Data: sim_data
## 
## REML criterion at convergence: 217.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.6776 -0.6242 -0.1496  0.5107  2.5216 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  developer (Intercept) 6.2673   2.5034  
##  Residual              0.9971   0.9986  
## Number of obs: 64, groups:  developer, 8
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  2.642825   0.936668  8.421160   2.822   0.0213 *  
## review_time  0.060726   0.005791 55.178496  10.486 9.54e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## review_time -0.299

Key Takeaways

  1. Linear regression assumes independence. In SE experiments, observations are almost never independent—they’re grouped by developer, project, or task.

  2. Ignoring grouping inflates your confidence. You’ll report effects as significant when they’re not. This is a real problem in the SE literature.

  3. Mixed models account for grouping structure. Random effects absorb the group-level variation so your fixed-effect estimates and standard errors are correct.

  4. Think of random effects as “nuisance variation.” You don’t care which developer is best—you care whether the tool works across developers.

  5. Start simple. Fit a random-intercept model first. Add random slopes only if justified by your design and supported by your data.


Further Reading

  • Winter, B. (2020). Statistics for Linguists: An Introduction Using R. Chapters 10–12 give an excellent accessible introduction to mixed models.
  • Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing. Journal of Memory and Language, 68(3), 255–278.
  • Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.

This lecture was designed for PhD students in software engineering. The examples use simulated data to make the concepts transparent. In your own research, check model assumptions (normality of residuals, homoscedasticity) and consider whether a generalized linear mixed model (GLMM) is more appropriate for your outcome variable (e.g., Poisson for counts, logistic for binary outcomes).