Getting Started
This notebook demonstrates a minimal example.
import pandas as pd
from oaxaca import Oaxaca
We load a sample dataset of Hispanic workers in the Chicago metropolitan area. The goal is to explain the wage gap between native and foreign-born workers. The data is taken from the oaxaca
R package.
df = pd.read_csv("sample_data.csv")
df.head()
age | female | foreign_born | LTHS | high_school | some_college | college | advanced_degree | education_level | ln_real_wage | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 52 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | high_school | 2.140066 |
1 | 46 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | high_school | NaN |
2 | 31 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | high_school | 2.499795 |
3 | 35 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | high_school | 2.708050 |
4 | 19 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | high_school | 2.079442 |
We fit the Oaxaca model, using R-style formula to describe the regression.
model = Oaxaca().fit(
formula="exp(ln_real_wage) ~ -1 + age + female + C(education_level)", data=df, group_variable="foreign_born"
)
From the model fit, we can generate two-fold and three-fold decomposition results.
Two-fold decomposition¶
This approach decomposes the wage difference into an explained part vs an unexplained part.
The weights
argument specifies which group is considered non-discriminated. Here, weights={0: 1.0, 1: 0.0}
means that native workers (i.e. those with foreign_born = 0
) are the non-discriminated group.
We see that the difference in covariates can only explain 6% of the overall difference, suggesting strong evidence of discrimination.
twofold_decomposition = model.two_fold(weights={0: 1.0, 1: 0.0})
twofold_decomposition
Oaxaca-Blinder Decomposition Results
Group Variable: foreign_born | Groups: 0 vs 1 | Direction: 0 - 1 | Weights: Group 0: 1.000, Group 1: 0.000
Mean Outcomes: Group 0: 17.5828 | Group 1: 14.5672 | Difference: 3.0156
Detailed Variable Contributions
Variable | Explained | Explained% | Unexplained | Unexplained% | Total | Total% |
---|---|---|---|---|---|---|
age | -1.7491 | -58.0% | 7.5585 | 250.6% | 5.8094 | 192.6% |
female | -0.5231 | -17.3% | -1.1653 | -38.6% | -1.6883 | -56.0% |
C(education_level) | 2.4545 | 81.4% | -3.5599 | -118.1% | -1.1055 | -36.7% |
C(education_level)[LTHS] | -1.5272 | -50.6% | -1.5892 | -52.7% | -3.1163 | -103.3% |
C(education_level)[advanced_degree] | 0.8993 | 29.8% | -0.4043 | -13.4% | 0.4950 | 16.4% |
C(education_level)[college] | 0.8965 | 29.7% | 0.2337 | 7.8% | 1.1303 | 37.5% |
C(education_level)[high_school] | -0.5832 | -19.3% | -1.2954 | -43.0% | -1.8786 | -62.3% |
C(education_level)[some_college] | 2.7690 | 91.8% | -0.5048 | -16.7% | 2.2642 | 75.1% |
Total | 0.1822 | 6.0% | 2.8333 | 94.0% | 3.0156 | 100.0% |
💡 For programmatic access:
• contributions
- aggregated categorical variables
• detailed_contributions
- individual categories with hierarchy
• removal_info
- per-group removal impact details
It's worth noting that the age
does explain a large portion of the difference. We can zoom into how age
differs between native and foreign-born workers with print_x()
, and the impact of age
on wage
with print_ols()
twofold_decomposition.print_x()
Difference in X (Predictor Variables) Between Groups ================================================================================ Group Variable: foreign_born Groups: 0 (Group 0) vs 1 (Group 1) Difference = Group 0 Mean - Group 1 Mean Variable 0 Mean 1 Mean Difference ---------------------------------------------------------------------------------------- age 34.0105 40.6359 -6.6254 female 0.4808 0.3958 0.0851 C(education_level)[LTHS] 0.1185 0.3879 -0.2694 C(education_level)[advanced_degree] 0.0697 0.0317 0.0380 C(education_level)[college] 0.1289 0.0818 0.0471 C(education_level)[high_school] 0.2962 0.3641 -0.0679 C(education_level)[some_college] 0.3868 0.1346 0.2522
twofold_decomposition.print_ols()
OLS Regression Results by Group ============================================================ Group: 0 ---------------------------------------- Number of observations: 287 R-squared: 0.3268 Mean of dependent variable: 17.5828 Std of dependent variable: 12.0486 Coefficients: Variable Coeff Std Err t P>|t| ------------------------------------------------------------- age 0.2640 0.0468 5.637 0.000 female -6.1497 1.1929 -5.155 0.000 C(education_level)[LTHS] 5.6689 2.3922 2.370 0.018 C(education_level)[advanced_degree] 23.6506 2.9415 8.040 0.000 C(education_level)[college] 19.0245 2.4208 7.859 0.000 C(education_level)[high_school] 8.5831 1.9820 4.331 0.000 C(education_level)[some_college] 10.9798 1.9020 5.773 0.000 Group: 1 ---------------------------------------- Number of observations: 379 R-squared: 0.3195 Mean of dependent variable: 14.5672 Std of dependent variable: 8.9085 Coefficients: Variable Coeff Std Err t P>|t| ------------------------------------------------------------- age 0.0780 0.0310 2.519 0.012 female -3.2055 0.7868 -4.074 0.000 C(education_level)[LTHS] 9.7661 1.4673 6.656 0.000 C(education_level)[advanced_degree] 36.4205 2.4641 14.781 0.000 C(education_level)[college] 16.1671 1.9121 8.455 0.000 C(education_level)[high_school] 12.1407 1.4121 8.598 0.000 C(education_level)[some_college] 14.7311 1.7540 8.399 0.000 Coefficient Comparison Between Groups ================================================================================ Direction: 0 - 1 Variable Group 0 Group 1 Difference ------------------------------------------------------------------------------- age 0.2640 0.0780 0.1860 female -6.1497 -3.2055 -2.9442 C(education_level)[LTHS] 5.6689 9.7661 -4.0972 C(education_level)[advanced_degree] 23.6506 36.4205 -12.7699 C(education_level)[college] 19.0245 16.1671 2.8573 C(education_level)[high_school] 8.5831 12.1407 -3.5576 C(education_level)[some_college] 10.9798 14.7311 -3.7514
Three-fold decomposition¶
This approach decomposes the wage gap into three parts: Endowment, Coefficient, and Interaction.
threefold_decomposition = model.three_fold()
threefold_decomposition
Oaxaca-Blinder Decomposition Results (Three-fold)
Group Variable: foreign_born | Groups: 0 vs 1 | Direction: 0 - 1
Mean Outcomes: Group 0: 17.5828 | Group 1: 14.5672 | Difference: 3.0156
Detailed Variable Contributions
Variable | Endowment | Endowment% | Coefficient | Coefficient% | Interaction | Interaction% | Total | Total% |
---|---|---|---|---|---|---|---|---|
age | -0.5168 | -17.1% | 7.5585 | 250.6% | -1.2324 | -40.9% | 5.8094 | 192.6% |
female | -0.2727 | -9.0% | -1.1653 | -38.6% | -0.2504 | -8.3% | -1.6883 | -56.0% |
C(education_level) | 2.4060 | 79.8% | -3.5599 | -118.1% | 0.0485 | 1.6% | -1.1055 | -36.7% |
C(education_level)[LTHS] | -2.6310 | -87.2% | -1.5892 | -52.7% | 1.1038 | 36.6% | -3.1163 | -103.3% |
C(education_level)[advanced_degree] | 1.3849 | 45.9% | -0.4043 | -13.4% | -0.4856 | -16.1% | 0.4950 | 16.4% |
C(education_level)[college] | 0.7619 | 25.3% | 0.2337 | 7.8% | 0.1347 | 4.5% | 1.1303 | 37.5% |
C(education_level)[high_school] | -0.8249 | -27.4% | -1.2954 | -43.0% | 0.2417 | 8.0% | -1.8786 | -62.3% |
C(education_level)[some_college] | 3.7151 | 123.2% | -0.5048 | -16.7% | -0.9461 | -31.4% | 2.2642 | 75.1% |
Total | 1.6165 | 53.6% | 2.8333 | 94.0% | -1.4343 | -47.6% | 3.0156 | 100.0% |
💡 For programmatic access:
• contributions
- aggregated categorical variables
• detailed_contributions
- individual categories with hierarchy
• removal_info
- per-group removal impact details
# Test direction argument by creating decompositions with different directions
# First, let's check the current direction and total_difference
print("Current twofold_decomposition:")
print(f"Direction: {twofold_decomposition.direction}")
print(f"Total difference: {twofold_decomposition.total_difference}")
print(f"Groups: {model.groups_}")
print(f"Group 0 mean_y: {model.group_stats_[model.groups_[0]]['mean_y']}")
print(f"Group 1 mean_y: {model.group_stats_[model.groups_[1]]['mean_y']}")
# Calculate what the difference should be for each direction
group_0_mean = model.group_stats_[model.groups_[0]]["mean_y"]
group_1_mean = model.group_stats_[model.groups_[1]]["mean_y"]
print(f"\nGroup 0 - Group 1: {group_0_mean - group_1_mean:.4f}")
print(f"Group 1 - Group 0: {group_1_mean - group_0_mean:.4f}")
# Test threefold decomposition direction
print("\nThreefold decomposition:")
print(f"Direction: {threefold_decomposition.direction}")
print(f"Total difference: {threefold_decomposition.total_difference}")