Interaction Term vs Split Sample

5 minute read

When an analyst wants to show heterogenous treatment effect (i.e. different treatment effect for different groups), should they 1) run one regression term with an interaction, or 2) run multiple regressions, one for each group (aka split sample)? A quick Google search shows how common this question is.

In this post, I will show:

  1. Split sample is analogous to running a regression with an interaction term for all predictors

  2. Using split sample to show heterogenous treatment effect is a bad idea

We have outcome \(y\), independent variable of interest \(X_{\text{int}}\), an exogenous covariate \(X_{\text{cov}}\), and group indicator \(G \in \{A, B\}\). We are interested in estimating the effect of \(X_{\text{int}}\) on \(y\), and whether the effect is different across groups \(A\) and \(B\).

Split sample is analogous to a fully interacted regression

In a split sample analysis, we fit one regression for each group, allowing the coefficients for all \(X\)’s (i.e. both \(\beta_{\text{int}}\), and \(\beta_{\text{cov}}\)) to vary. This is equivalent to running a fully interacted regression as I’ll show below.

Let the true data-generating process (DGP) be a fully interacted model, i.e. both \(X_\text{int}\) and \(X_\text{cov}\) having a different effects on \(y\) across groups.

Here’s the DGP in math: \(\begin{align} y &= \beta_0 + \beta_{int} X_\text{int} + \beta_{cov} X_\text{cov} + \beta_G G + \beta_{intG} X_\text{int} G + \beta_{covG} X_\text{cov} G + \epsilon \end{align}\)

Here’s the DGP simulated in R:

# Generate the Xs and G
n <- 1000
group <- sample(c("A", "B"), size = n, prob = c(0.5, 0.5), replace = T)
X_int <- rnorm(n)
X_cov <- rnorm(n)

# The DGP that generates y
y <- 1 + 2 * X_int + 3 * X_cov + 4 * (group=="B") +
  5 * X_int * (group=="B") + 6 * X_cov * (group=="B") + rnorm(n)

data <- data.frame(y, X_int, X_cov, group)

We then run three regressions, 1) full interaction, 2) split sample (group A only), 3) split sample (group B only).

stargazer(
  lm(y ~ X_int + X_cov + group + X_int:group + X_cov:group, data = data),
  lm(y ~ X_int + X_cov, data = data[data$group == "A", ]),
  lm(y ~ X_int + X_cov, data = data[data$group == "B", ]),
  column.labels = c("fully interacted |", "| group A |",
                    "| group B"),
  keep.stat = c("n", "rsq"), type = "html"
)
Dependent variable:
y
fully interacted || group A || group B
(1)(2)(3)
X_int1.914***1.914***7.010***
(0.044)(0.044)(0.050)
X_cov3.026***3.026***8.954***
(0.043)(0.043)(0.048)
groupB3.969***
(0.063)
X_int:groupB5.096***
(0.066)
X_cov:groupB5.928***
(0.065)
Constant1.033***1.033***5.003***
(0.043)(0.043)(0.046)
Observations1,000533467
R20.9850.9270.991
Note:*p<0.1; **p<0.05; ***p<0.01

We see the all coefficients in (2) A only differ from (3) B only, not just the \(\beta_{\text{int}}\) of \(X_{\text{int}}\). Furthermore, we can calculate the split sample coefficients from the fully interacted model.

\[\begin{align} \beta_{intA} &= \beta_{int} \\ \beta_{intB} &= \beta_{int} + \beta_{intG} \end{align}\]

Take-away: Since it’s equivalent to a fully interacted model, split sample analysis shows how all coefficients (not just the coefficient of the variable of interest) differ across group.

Why split sample is a bad way to show heterogenous treatment effect

A very common use of split sample analysis is to run separate regressions and, upon observing that the coefficient \(\beta_{int}\) for \(X_{\text{int}}\) is significant for group \(A\) and insignificant for group \(B\), conclude that its treatment effect is different across groups.

This is wrong because the significance of \(X_{int}\) depends on other covariates as well. For example, if there is high multicollinearity in group A between \(X_{int}\) and \(X_{cov}\), then \(X_{int}\) will be statistically insignificant even if its effect in group \(A\) is just as strong as its effect in group \(B\).

For a substantive example, we could imagine that we want to estimate the effect of income (\(X_{int}\)) on happiness (\(y\)) across two groups of people, urban and rural dwellers (\(G\)). It happens that the length of commute (\(X_{cov}\)) matters for happiness too, and the commute is highly correlated with income only for urban dwellers. Thus, if we run a split sample analysis, income will have an insignificant coefficient for urban dwellers because of a high multicollinearity, despite having a real impact on happiness.

We simulate that scenario below:

library(mvtnorm)
set.seed(2)

# high multi-collinearity in group A (cor = 0.8)
Xs_A <- rmvnorm(100, mean = c(0, 0),
                 sigma = matrix(c(1, 0.8, 0.8, 1), nrow = 2))
X_intA <- Xs_A[, 1] ; X_covA <- Xs_A[, 2]

# low multicollinearity in group B (cor = 0)
Xs_B <- rmvnorm(100, mean = c(0, 0),
                 sigma = matrix(c(1, 0, 0, 1), nrow = 2))
X_intB <- Xs_B[, 1] ; X_covB <- Xs_B[, 2]

# Combine the data from two groups
X_int <- c(X_intA, X_intB)
X_cov <- c(X_covA, X_covB)
group = c(rep("A", 100), rep("B", 100))

# True DGP has NO INTERACTION EFFECT
y <- 1 + 2 * X_int + 3 * X_cov + 4 * (group=="B") + rnorm(100, sd = 4)

Notice that the true DGP has no interaction effect. However, if we use split sample analysis, we will see that \(\beta_{int}\) is insigificant in group \(A\), and significant in group \(B\). Hence, we would wrongly conclude that there is an interaction effect between \(X_{int}\) and \(G\).

data <- data.frame(y, X_int, X_cov, group)
stargazer(
  lm(y ~ X_int + X_cov, data = data[data$group == "A", ]),
  lm(y ~ X_int + X_cov, data = data[data$group == "B", ]),
  keep.stat = c("n"), column.labels = c("group A", "group B"), type = 'html'
)
Dependent variable:
y
group Agroup B
(1)(2)
X_int0.6061.801***
(0.616)(0.419)
X_cov3.691***2.458***
(0.595)(0.394)
Constant0.5044.563***
(0.393)(0.422)
Observations100100
Note:*p<0.1; **p<0.05; ***p<0.01

What’s worrisome is that we really can’t know how the statistical significance will turn out in a split sample analysis. Below I re-run exactly the same analysis, only using a different random seed, and we now conclude (correctly, but with the wrong methodology) that there’s no interaction effect.

Dependent variable:
y
group Agroup B
(1)(2)
X_int3.392***1.663***
(0.839)(0.458)
X_cov1.1432.790***
(0.824)(0.480)
Constant0.813*4.856***
(0.462)(0.474)
Observations100100
Note:*p<0.1; **p<0.05; ***p<0.01

Take-away: Don’t use split sample analysis to show heterogenous treatment effect for one variable of interest.

What if you do want to examine heterogenous treatment effect for all variables?

In this case, I would recommend a multi-level model, which allows co-varying intercepts and coefficients across groups.

Updated:

Leave a Comment