This notebook is an exploration of causal inference in python using the famous Lalonde dataset.
The study looked at the effectiveness of a job training program (the treatment) on the real earnings of an individual, a couple years after completion of the program.
The data consists of a number of demographic variables (age, race, academic background, and previous real earnings), as well as a treatment indicator, and the real earnings in the year 1978 (the response).
Robert Lalonde, "Evaluating the Econometric Evaluations of Training Programs", American Economic Review, Vol. 76, pp. 604-620
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In this notebook we'll be using the tools provided by Laurence Wong in the Package CausalInference. Comments on what each function does come from the very good package documentation: http://laurence-wong.com/software/
This package relies heavily on Rubin causal model, and so will this analysis https://en.wikipedia.org/wiki/Rubin_causal_model
The reason why several models exist is that it is impossible to observe the causal effect on a single unit, and so assumptions must be made to estimate the missing counterfactuals. We'll explain what all that means in this post.
# https://pypi.org/project/CausalInference/
from causalinference import CausalModel
lalonde = pd.read_csv('lalonde.csv', index_col=0)
lalonde.head()
# let's have an overview of the data
lalonde.describe()
Here is the raw difference in earning between the control group and the treated group:
lalonde.groupby('treat')['re78'].agg(['median','mean'])
The control group has higher earning that the treatment group - does this mean the treatment had a negative impact?
lalonde.groupby('treat')['re78'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
This dataset is not a balanced trial. Indeed people in the control group are very different from people in the test (treatment) group. Below is a plot of the different income distributions:
lalonde.groupby('treat')['re74'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
lalonde.groupby('treat')['age'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
lalonde.groupby('treat')['educ'].plot(kind='hist', bins=20, alpha=0.8, legend=True)
Notations.
Aims. What we want to know here is the Average Treatment Effect (ATE): $$\Delta = E[Y_{1} - Y_{0}]$$
However, as we saw, if we try to estimate this quantity from the row observational distribution, we get: $$\Delta_{raw} = E[Y|D=1] - E[Y|D=0] = E[Y_{1}|D=1] - E[Y_{0}|D=0] \\ \neq \Delta = E[Y_{1} - Y_{0}]$$
because: $$E[Y_{i}|D=i] \neq E[Y_{i}]$$
General problem. If we believe that age, education, race, and marital status all have a likely influence on earnings Y, we need a way to disentangle the effect of D on Y from the perturbative effect of X on Y.
Assumptions. The Causalinference package is based on a typical assumption called unconfoundedness or ignorability: $$(Y(0), Y(1)) \perp D \; | \; X$$
Indeed we saw that the treatment assignment is probably not independent of each subject's potential outcomes, e.g. poorer people are more represented in the treatment group than in the control group.
However the treatment is assumed to be unconfounded in the sense that the dependence between the treatment assignment and the outcomes is only through something we observe, namely the covariates X.
What this means is that if we control for X, i.e. look across people with similar levels of X, then the difference between treated and control should be attributable to the treatment itself, just as a randomized experiment would be.
This is the assumption, and if it doesn't hold our results could be completely wrong.
The simplest type of model we can use is a linear model:
$$Y_0 = \alpha + \beta X + \epsilon$$ $$Y_1 = Y_0 + \gamma D $$
If this is accurate, fitting the following model to the data using linear regression will give us an estimate of the Average Treatment Effect (ATE): $$Y = \alpha + \beta X + \gamma D$$
$\epsilon$ is called a residual and represents the noise
covariates = ['age', 'educ', 'black', 'hispan', 'married', 'nodegree', 're74', 're75']
# we use the CausalModel method from the causalinference package
causal = CausalModel(
Y=lalonde['re78'].values,
D=lalonde['treat'].values,
X=lalonde[covariates].values)
causal.est_via_ols(adj=1)
# adj=1 corresponds to the simplicity of the model we entered
# This is called a "constant treatment effect"
print(causal.estimates)
This model predicts that the Average Treatment Effect (ATE, the job training) is $1548 extra annual earnings. This is very different from our previous raw results predicting that the job training had negative effects on earnings!
Assuming that our model accurately describes the counterfactual X, CausalModel provides the 95% confidence interval. What this means is that, if we were to repeat this treatment experiment, in 95% of the cases the Average Treatment Effect would be within that interval. That doesn't mean that the true value is within that interval.
Based on the assumption that the residuals are normally distributed, the 95% confidence interval is calculated as: $$ AVG \pm 1.96 * STD / \sqrt{n}$$
In practice, as the confidence interval is very large, my interpretation is that the experiment should have had more people if a better estimate of the extra earnings was desired. Ways to control the standard deviation could also be explored.
Overall, assuming that we controlled for all the effects and did it well, it seems that the job training had a positive effect on earnings. Indeed, although the standard deviation is very large, the p value of 0.035 rejects the null hypothesis (no effect) with a confidence level of 97.5%. However, the truth is that we don't know if we modelled the counterfactual well, and this could change everything... As we will see later, estimators such as the Ordinary Least Square (OLS) estimator can behave poorly when there is not enough covariate overlap, and that's because the estimator needs to extrapolate too much from one group to another.
A more structured approach as we will see below can allow us to increase our confidence that the covariants are well controlled for. We will see many steps, but one simple idea is the technique of matching: the idea is to find for each sample which received the treatment a similar sample in the control group, and to directly compare these values.
Pre-processing phase:
Estimation phase:
In the pre-processing phase, the data is inspected and manipulated to allow credible analysis to be conducted on it.
As we discussed in the previous section, one key method for disantangling the treatment effect from the covariant effects is the matching technique. In this technique we compare subjects that have similar covariate values (i.e. same age, rage, income etc). However, our ability to compare such pairs depends heavily on the degree of overlap of the covariates between the treatment and control group. This is called covariate balance.
Said otherwise, to control the effect of education, one way is to look at people in the tested group and in the non-tested group that all have the same level of education, say a bachelor degree. However, if nobody in the test group has a bachelor degree while many do in the non-test group, this procedure is impossible.
(1) assess covariate balance to assess whether how easily people can be matched. If there is too much unbalance, direct matching will rarely be possible, and we may need to use more complex techniques, if at all possible.
lalonde.columns
print(causal.summary_stats)
Raw-diff is the raw difference between the means of the control and treatment groups.
As we saw previously, the treated group (trained) is earning $635 less than the control group, which is surprising.
Nor-diff in this package is Imbens and Rubin's normalized differences (2015) in average covariates, defined as: $$\frac{\bar{X}_{k,t}-\bar{X}_{k,c}}{\sqrt{(s_{k,t}^2 + s_{k,c}^2) / 2}}$$
Here $\bar{X}_{k,t}$ and $s_{k,t}$ are the sample mean and sample standard deviation of the kth covariate of the treatment group, and $\bar{X}_{k,c}$ and $s_{k,c}$ are the analogous statistics for the control group.
The aim here is to assess the overlap between the control and treatment groups. It can be seen that X2, X4, and X6 (black, married, revenue in 1974) have a large normalized difference, beyond 0.5. This can be interpreted as an imbalance. Concretely, there are way more black people, less married people and lower income in 1974 in the treatment group than in the control group.
The impact of imbalance is to make the matching technique harder to apply. We'll see later how we can try to correct for it (however, ideally the study would be more balanced!).
(2) Propensity Score - the probability of receiving the treatment, conditional on the covariates.
Propensity is useful for assessing and improving covariate balance. Indeed a theorem by Rosenbaum and Rubin in 1983, proves that, for subjects that share the same propensity score (even if their covariate vectors are different), the difference between the treated and the control units actually identifies a conditional average treatment effect.
Thus, instead of matching on the covariate vectors X themselves, we can also match on the single-dimensional propensity score p(X), aggregate across subjects, and still arrive at a valid estimate of the overall average treatment effect.
$$ E[Y(1) - Y(0) | p(X)] \approx E[Y(1) - Y(0)]$$
This is if $p(X) = P(D=1 | X)$, which the CausalInference package estimates for us using a sequence of likelihood ratio tests.
reference: http://laurence-wong.com/software/propensity-score
#this function estimates the propensity score, so that propensity methods can be employed
causal.est_propensity_s()
print(causal.propensity)
(3) Trim sample. This excludes subjects with extreme propensity scores. Indeed it will be very hard to match those extreme subjects, so the usual strategy is to focus attention on the remaining units that exhibit a higher degree of covariate balance.
# extreme propensity is a very high probability to be either in the control group or the treatment group
# that makes matching difficult
#by default, causal.cutoff is set to 1
# the trim function will drop units whose estimated propensity score <= 0.1 or >= 0.9
#causal.cutoff = 0.1
#causal.trim()
#however, there is a procedure that tried to select an optimal cutoff value
causal.trim_s()
print(causal.summary_stats)
In this new subset, the normal difference for most variables is rather balanced. Only X2 (number of black people) is still unbalanced.
It is worth noting that the initial sample of 614 people (429 controls, 185 treated) has been drastically trimmed to 297 people (157 controls, 140 treated).
In this more balanced sub-sample, without using any model, the average earnings in 1978 is more like what we would expect: populations that received training (treated) earn in average $875 more than the control group.
(4) Stratify sample - group similar subjects together. People are grouped in layers of similar propensity scores. These bins should have an improved covariate balance, and we should be able to compare and match samples within those bins.
# the default is to have 5 bins with equal number of samples
# however, it is possible to split the sample in a more data-driven way.
# The larger the sample, the more bins we can afford, so that samples can be increasingly similar within smaller bins
# the limit in dividing too much is that there are too few datapoints in each bin for the bins to be statistically different (t-test)
causal.stratify_s()
print(causal.strata)
Within bins, the raw difference in outcome should be a good representation of the real treatment effect. For example:
People in group 1 are unlikely to be in the treatment group (well off?). For them, the training improved their earnings by $1399 in average.
People in group 4 are likely to be in the treatment group (poor?). For them, the training improved their earnings even more, with a mean of $2211 for that year 1978.
Something that looks quite bad is that outcomes for the group 3 are totally different from that of the other groups. The trend seems to be that the higher the propensity score, the higher the raw difference in outcome for each stratum. but this one shows opposite results... This may be a sign that we haven't controlled for enough factors (or that the propensity calculation is wrong?). Or it might also be a true representation or reality: some people may benefit from the job training, while other may not. It might also be random and the reflection that we are working with a relatively small sample (74 elements in bin 3).
Let's see in the analysis phase if regressions within each stratum will be able to control for confounding variables better.
In the estimation phase, treatment effects of the training can be estimated in several ways.
(1) The blocking estimator - although each layer of the stratum is pretty balanced and gives reasonable raw results, this estimator goes further and controls for the confounding factors within each layer of the stratum. More precisely, this estimator uses a least square estimate within each propensity bin, and from this produces an overall average treatment effect estimate.
#causal.est_via_blocking()
#print(causal.estimates)
# for some reason I'm having a singular matrix when calculating this blocking estimator
# on one of the stratum
# I've tried changing the stratum structure and the set of variables,
# however, the singularity persists when calculating the covariance matrix
# this would need a closer look at the dataset, which I haven't taken the time to do yet
# this is one of the issue of this causalinference package:
# it needs to invert large matrixes, which can fail
(2) The matching estimator - although each layer of the stratum is pretty balanced and gives reasonable raw results, this matching estimator controls for the confounding factors by matching even more thinely samples within each layer of the stratum. More precisely, this pairing is done via nearest-neighborhood matching. If the matching is imperfect, biias correction is recommended.
If issues arrive with least square, such as excessive extrapolation, this matching estimator pushes until the end the unconfoundedness assumption and nonparametrically matches subjects with similar covariate values together. In other words, if the confounding factors are equal for both element of a pair, the difference between the two will be the real treatment effect. In the causalinference package, samples are weighted by the inverse of the standard deviation of the sample covariate, so as to normalize.
Where matching discrepancy exist, least square will be used, but very locally, so large extrapolations should be less of a problem.
causal.est_via_matching(bias_adj=True)
print(causal.estimates)
The model provides estimates of three quantities: ATE, ATT and ATC:
# allowing several matches
causal.est_via_matching(bias_adj=True, matches=4)
print(causal.estimates)
Allowing several matches attributes \$1027 of revenue increase to the treatment, with 75% probability to be significant. A common interpretation would be still to reject this as proof of statistical significance.
The effect of training is hard to establish firmly. Although it seems the sample from Lalonde had positive effects, it is actually quite likely to be without any effect.
This isn't so far from what Lalonde concluded: http://people.hbs.edu/nashraf/LaLonde_1986.pdf By glancing at it, Lalonde seemed to know the gender of participants, which does not seem to be in this dataset, or may be hidden in the NSW vs AFDC.
More work could be done to better estimate the counterfactual. For instance we could introduce polynomial variables to capture non-linear effects and/or introduce categorical variables to bin numerical variables such aseducation.
This was an example of how the CausalInference package could be used, and our conclusions are attached to those models. This package relies heavily on propensity matching and its ignorability / confoundedness assumption.
Other models exist, e.g. Bayesian models. This will be for another time for us. Meanwhile, the curious can have a look at this other post: https://engl.is/causal-analysis-introduction-examples-in-python-and-pymc.html