Interaction Term vs Split Sample
When an analyst wants to show heterogenous treatment effect (i.e. different treatment effect for different groups), should they 1) run one regression term with an interaction, or 2) run multiple regressions, one for each group (aka split sample)? A quick Google search shows how common this question is.
In this post, I will show:
-
Split sample is analogous to running a regression with an interaction term for all predictors
-
Using split sample to show heterogenous treatment effect is a bad idea
We have outcome \(y\), independent variable of interest \(X_{\text{int}}\), an exogenous covariate \(X_{\text{cov}}\), and group indicator \(G \in \{A, B\}\). We are interested in estimating the effect of \(X_{\text{int}}\) on \(y\), and whether the effect is different across groups \(A\) and \(B\).
Split sample is analogous to a fully interacted regression
In a split sample analysis, we fit one regression for each group, allowing the coefficients for all \(X\)’s (i.e. both \(\beta_{\text{int}}\), and \(\beta_{\text{cov}}\)) to vary. This is equivalent to running a fully interacted regression as I’ll show below.
Let the true data-generating process (DGP) be a fully interacted model, i.e. both \(X_\text{int}\) and \(X_\text{cov}\) having a different effects on \(y\) across groups.
Here’s the DGP in math: \(\begin{align} y &= \beta_0 + \beta_{int} X_\text{int} + \beta_{cov} X_\text{cov} + \beta_G G + \beta_{intG} X_\text{int} G + \beta_{covG} X_\text{cov} G + \epsilon \end{align}\)
Here’s the DGP simulated in R:
We then run three regressions, 1) full interaction, 2) split sample (group A only), 3) split sample (group B only).
Dependent variable: | |||
y | |||
fully interacted | | | group A | | | group B | |
(1) | (2) | (3) | |
X_int | 1.914*** | 1.914*** | 7.010*** |
(0.044) | (0.044) | (0.050) | |
X_cov | 3.026*** | 3.026*** | 8.954*** |
(0.043) | (0.043) | (0.048) | |
groupB | 3.969*** | ||
(0.063) | |||
X_int:groupB | 5.096*** | ||
(0.066) | |||
X_cov:groupB | 5.928*** | ||
(0.065) | |||
Constant | 1.033*** | 1.033*** | 5.003*** |
(0.043) | (0.043) | (0.046) | |
Observations | 1,000 | 533 | 467 |
R2 | 0.985 | 0.927 | 0.991 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
We see the all coefficients in (2) A only
differ from (3) B only
, not just the \(\beta_{\text{int}}\) of \(X_{\text{int}}\). Furthermore, we can calculate the split sample coefficients from the fully interacted model.
Take-away: Since it’s equivalent to a fully interacted model, split sample analysis shows how all coefficients (not just the coefficient of the variable of interest) differ across group.
Why split sample is a bad way to show heterogenous treatment effect
A very common use of split sample analysis is to run separate regressions and, upon observing that the coefficient \(\beta_{int}\) for \(X_{\text{int}}\) is significant for group \(A\) and insignificant for group \(B\), conclude that its treatment effect is different across groups.
This is wrong because the significance of \(X_{int}\) depends on other covariates as well. For example, if there is high multicollinearity in group A between \(X_{int}\) and \(X_{cov}\), then \(X_{int}\) will be statistically insignificant even if its effect in group \(A\) is just as strong as its effect in group \(B\).
For a substantive example, we could imagine that we want to estimate the effect of income (\(X_{int}\)) on happiness (\(y\)) across two groups of people, urban and rural dwellers (\(G\)). It happens that the length of commute (\(X_{cov}\)) matters for happiness too, and the commute is highly correlated with income only for urban dwellers. Thus, if we run a split sample analysis, income will have an insignificant coefficient for urban dwellers because of a high multicollinearity, despite having a real impact on happiness.
We simulate that scenario below:
Notice that the true DGP has no interaction effect. However, if we use split sample analysis, we will see that \(\beta_{int}\) is insigificant in group \(A\), and significant in group \(B\). Hence, we would wrongly conclude that there is an interaction effect between \(X_{int}\) and \(G\).
Dependent variable: | ||
y | ||
group A | group B | |
(1) | (2) | |
X_int | 0.606 | 1.801*** |
(0.616) | (0.419) | |
X_cov | 3.691*** | 2.458*** |
(0.595) | (0.394) | |
Constant | 0.504 | 4.563*** |
(0.393) | (0.422) | |
Observations | 100 | 100 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
What’s worrisome is that we really can’t know how the statistical significance will turn out in a split sample analysis. Below I re-run exactly the same analysis, only using a different random seed, and we now conclude (correctly, but with the wrong methodology) that there’s no interaction effect.
Dependent variable: | ||
y | ||
group A | group B | |
(1) | (2) | |
X_int | 3.392*** | 1.663*** |
(0.839) | (0.458) | |
X_cov | 1.143 | 2.790*** |
(0.824) | (0.480) | |
Constant | 0.813* | 4.856*** |
(0.462) | (0.474) | |
Observations | 100 | 100 |
Note: | *p<0.1; **p<0.05; ***p<0.01 |
Take-away: Don’t use split sample analysis to show heterogenous treatment effect for one variable of interest.
What if you do want to examine heterogenous treatment effect for all variables?
In this case, I would recommend a multi-level model, which allows co-varying intercepts and coefficients across groups.
Leave a Comment