This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Introduction

These notes will examine the incorportion of machine learning methods in classic econometric techniques for estimating causal effects. More specifally, we will focus on estimating treatment effects using matching and instrumental variables. In these estimators (and many others) there is a low-dimensional parameter of interest, such as the average treatment effect, but estimating it requires also estimating a potentially high dimensional nuisance parameter, such as the propensity score. Machine learning methods were developed for prediction with high dimensional data. It is then natural to try to use machine learning for estimating high dimensional nuisance parameters. Care must be taken when doing so though because the flexibility and complexity that make machine learning so good at prediction also pose challenges for inference.

Example: partially linear model

  • Interested in $\theta$
  • Assume $\Er[\epsilon|d,x] = 0$
  • Nuisance parameter $f()$
  • E.g. Donohue and Levitt (2001)

The simplest example of the setting we will analyze is a partially linear model. We have some regressor of interest, $d$, and we want to estimate the effect of $d$ on $y$. We have a rich enough set of controls that we are willing to believe that $\Er[\epsilon|d,x] = 0$. $d_i$ and $y_i$ are scalars, while $x_i$ is a vector. We are not interested in $x$ per se, but we need to include it to avoid omitted variable bias.

Typical applied econometric practice would be to choose some transfrom of $x$, say $X = T(x)$, where $X$ could be some subset of $x$, along with interactions, powers, and so on. Then estimate a linear regression and then perhaps also report results for a handful of different choices of $T(x)$.

Some downsides to the typical applied econometric practice include:

  • The choice of T is arbitrary, which opens the door to specification searching and p-hacking.

  • If $x$ is high dimensional, and $X$ is low dimensional, a poor choice will lead to omitted variable bias. Even if $x$ is low dimensional, if $f(x)$ is poorly approximated by $X’\beta$, there will be omitted variable bias.

In some sense, machine learning can be thought of as a way to choose $T$ is an automated and data-driven way. There will be still be a choice of machine learning method and often tuning parameters for that method, so some arbitrary decisions remain. Hopefully though these decisions have less impact.

You may already be familiar with traditional nonparametric econometric methods like series / sieves and kernels. These share much in common with machine learning. What makes machine learning different that traditional nonparametric methods? Machine learning methods appear to have better predictive performance, and arguably more practical data-driven methods to choose tuning parameters. Machine learning methods can deal with high dimensional $x$, while traditional nonparametric methods focus on situations with low dimensional $x$.

Example : Effect of abortion on crime

Donohue and Levitt (2001) estimate a regression of state crime rates on crime type relevant abortion rates and controls, $a_{it}$ is a weighted average of lagged abortion rates in state $i$, with the weight on the $\ell$th lag equal to the fraction of age $\ell$ people who commit a given crime type. The covariates $x$ are the log of lagged prisoners per capita, the log of lagged police per capita, the unemployment rate, per-capita income, the poverty rate, AFDC generosity at time t − 15, a dummy for concealed weapons law, and beer consumption per cap. Alexandre Belloni, Chernozhukov, and Hansen (2014a) reanalyze this setup using lasso to allow a more flexible specification of controls. They allow for many interactions and quadratic terms, leading to 284 controls.

Example: Matching

  • Binary treatment $d_i \in {0,1}$
  • Potential outcomes $y_i(0), y_i(1)$, observe $y_i = y_i(d_i)$
  • Interested in average treatment effect : $\theta = \Er[y_i(1) - y_i(0)]$
  • Covariates $x_i$
  • Assume unconfoundedness : $d_i \indep y_i(1), y_i(0) | x_i$
  • E.g. Connors et al. (1996)

The partially linear and matching models are closely related. If the conditional mean independence assumption of the partially linear model is strengthing to conditional indepence then the partially linear model is a special case of the matching model with constant treatment effects, $y_i(1) - y_i(0) = \theta$. Thus the matching model can be viewed as a generalization of the partially linear model that allows for treatment effect heterogeneity.

Example: Matching

  • Estimatable formulae for ATE :

All the expectations in these three formulae involve observable data. Thus, we can form an estimate of $\theta$ be replacing the expectations and conditional expectations with appropriate estimators. For example, to use the first formula, we could estimate a logit model for the probability of treatment, where, as above, $X$ is a some chosen transformation of $x_i$. Then we simply take an average to estimate $\theta$. As in the partially linear model, estimating the parameter of interest, $\theta$, requires estimating a potentially high dimensional nuisance parameter, in this case $\hat{\Pr}(d=1|x)$. Similarly, the second expression would require estimating conditional expectations of $y$ as nuisance parameters. The third expression requires estimating both conditional expecations of $y$ and $d$.

The third expression might appear needlessly complicated, but we will see later that it has some desirable properties that will make using it essential when very flexible machine learning estimators for the conditional expectations are used.

The origin of the name “matching” can be seen in the second expression. One way to estimate that expression would be to take each person in the treatment group, find someone with the same (or nearly the same) $x$ in the control group, difference the outcome of this matched pair, and then average over the whole sample. (Actually this gives the average treatment effect on the treated. For the ATE, you would also have to do the same with roles of the groups switched and average all the differences.) When $x$ is multi-dimensional, there is some ambiguity about what it means for two $x$ values to be nearly the same. An important insight of Rosenbaum and Rubin (1983) is that it is sufficient to match on the propensity score, $P(d=1|x)$, instead.

Example: effectiveness of heart catheterization

Connors et al. (1996) use matching to estimate the effectiveness of heart catheterization in critically ill patients. Their dataset contains 5735 patients and 72 covariates. Athey et al. (2017) reanalyze this data using a variety of machine learning methods.

References: Imbens (2004) reviews the traditional econometric literature on matching. Imbens (2015) focuses on practical advice for matching and includes a brief mention of incorporating machine learning.

Both the partially linear model and treatment effects model can be extended to situations with endogeneity and instrumental variables.

Example: IV

  • Interested in $\theta$
  • Assume $\Er[\epsilon|x,z] = 0$, $\Er[u|x,z]=0$
  • Nuisance parameters $f()$, $g()$
  • E.g. Angrist and Krueger (1991)

Most of the remarks about the partially linear model also apply here.

Hartford et al. (2017) estimate a generalization of this model with $y_i = f(d_i, x_i) +\epsilon$ using deep neural networks.

Example : compulsory schooling and earnings

Angrist and Krueger (1991) use quarter of birth as an instrument for years of schooling to estimate the effect of schooling on earnings. Since compulsory schooling laws typically specify a minimum age at which a person can leave school instead of a minimum years of schooling, people born at different times of the year can be required to complete one more or one less year of schooling. Compulsory schooling laws and their effect on attained schooling can vary with state and year. Hence, Angrist and Krueger (1991) considered specifying $g(x,z)$ as all interactions of quarter of birth, state, and year dummies. Having so many instruments leads to statistical problems with 2SLS.

Example: LATE

  • Binary instrumet $z_i \in {0,1}$
  • Potential treatments $d_i(0), d_i(1) \in {0,1}$, $d_i = d_i(Z_i)$
  • Potential outcomes $y_i(0), y_i(1)$, observe $y_i = y_i(d_i)$
  • Covariates $x_i$
  • $(y_i(1), y_i(0), d_i(1), d_i(0)) \indep z_i | x_i$
  • Local average treatment effect:

See Abadie (2003).

Belloni et al. (2017) analyze estimation of this model using Lasso and other machine learning methods.

General setup

  • Parameter of interest $\theta \in \R^{d_\theta}$

  • Nuisance parameter $\eta \in T$

  • Moment conditions with $\psi$ known

  • Estimate $\hat{\eta}$ using some machine learning method

  • Estimate $\hat{\theta}$ from

We are following the setup and notation of Chernozhukov et al. (2018). As in the examples, the dimension of $\theta$ is fixed and small. The dimension of $\eta$ is large and might be increasing with sample size. $T$ is some normed vector space.


Example: partially linear model

  • Compare the estimates from

    1. $\En[d_i(y_i - \tilde{\theta} d_i - \hat{f}(x_i)) ] = 0$

    and

    1. $\En[(d_i - \hat{m}(x_i))(y_i - \hat{\mu}(x_i) - \theta (d_i - \hat{m}(x_i)))] = 0$

    where $m(x) = \Er[d|x]$ and $\mu(y) = \Er[y|x]$

Example: partially linear model In the partially linear model,

we can let $w_i = (y_i, x_i)$ and $\eta = f$. There are a variety of candidates for $\psi$. An obvious (but flawed) one is $\psi(w_i; \theta, \eta) = (y_i - \theta_0 d_i - f_0(x_i))d_i$. With this choice of $\psi$, we have

The first term of this expression is quite promising. $d_i$ and $\epsilon_i$ are both finite dimensional random variables, so a law of large numbers will apply to $\En[d_i^2]$, and a central limit theorem would apply to $\sqrt{n} \En[d_i \epsilon_i]$. Unfortunately, the second expression is problematic. To accomodate high dimensional $x$ and allow for flexible $f()$, machine learning estimators must introduce some sort of regularization to control variance. This regularization also introduces some bias. The bias generally vanishes, but at a slower than $\sqrt{n}$ rate. Hence

To get around this problem, we must modify our estimate of $\theta$. Let $m(x) = \Er[d|x]$ and $\mu(y) = \Er[y|x]$. Let $\hat{m}()$ and $\hat{\mu}()$ be some estimates. Then we can estimate $\theta$ by partialling out:

where

with $v_i = d_i - \Er[d_i | x_i]$. The term $a$ is well behaved and $\sqrt{n}a \leadsto N(0,\Sigma)$ under standard conditions. Although terms $b$ and $c$ appear similar to the problematic term in the initial estimator, they are better behaved because $\Er[v|x] = 0$ and $\Er[\epsilon|x] = 0$. This makes it possible, but difficult to show that $\sqrt{n}b \to_p = 0$ and $\sqrt{n} c \to_p = 0$, see e.g. Alexandre Belloni, Chernozhukov, and Hansen (2014a). However, the conditions on $\hat{m}$ and $\hat{\mu}$ needed to show this are slightly restrictive, and appropriate conditions might not be known for all estimators. Chernozhukov et al. (2018) describe a sample splitting modification to $\hat{\theta}$ that allows $\sqrt{n} b$ and $\sqrt{n} c$ to vanish under weaker conditions (essentially the same rate condition as needed for $\sqrt{n} d$ to vanish.)

The last term, $d$, is a considerable improvement upon the first estimator. Instead of involving the error in one estimate, it now involes the product of the error in two estimates. By the Cauchy-Schwarz inequality, So if the estimates of $m$ and $\mu$ converge at rates faster than $n^{-1/4}$, then $\sqrt{n} d \to_p 0$. This $n^{-1/4}$ rate is reached by many machine learning estimators.


Lessons from the example

  • Need an extra condition on moments – Neyman orthogonality

  • Want estimators faster than $n^{-1/4}$ in the prediction norm,

  • Also want estimators that satisfy something like

    • Sample splitting will make this easier

References by topic

  • Matching
  • Surveys on machine learning in econometrics
    • Athey and Imbens (2017)
    • Mullainathan and Spiess (2017)
    • Athey and Imbens (2018)
    • Athey et al. (2017)
    • Athey and Imbens (2015), Athey and Imbens (2018)
  • Machine learning
    • Breiman and others (2001)
    • Friedman, Hastie, and Tibshirani (2009)
    • James et al. (2013)
    • Efron and Hastie (2016)
  • Introduction to lasso
    • Belloni and Chernozhukov (2011)
    • Friedman, Hastie, and Tibshirani (2009) section 3.4
    • Chernozhukov, Hansen, and Spindler (2016)
  • Introduction to random forests
    • Friedman, Hastie, and Tibshirani (2009) section 9.2

Bold references are recommended reading. They are generally shorter and less technical than some of the others. Aspiring econometricians should read much more than just the bold references.

  • Neyman orthogonalization
    • Chernozhukov, Chetverikov, et al. (2017)
    • Chernozhukov, Hansen, and Spindler (2015)
    • Chernozhukov et al. (2018)
    • Belloni et al. (2017)
  • Lasso for causal inference
    • Alexandre Belloni, Chernozhukov, and Hansen (2014b)
    • Belloni et al. (2012)
    • Alexandre Belloni, Chernozhukov, and Hansen (2014a)
    • Chernozhukov, Goldman, et al. (2017)
    • Chernozhukov, Hansen, and Spindler (2016) hdm R package
  • Random forests for causal inference
    • Athey, Tibshirani, and Wager (2016)
    • Wager and Athey (2018)
    • Tibshirani et al. (2018) grf R package
    • Athey and Imbens (2016)

There is considerable overlap among these categories. The papers listed under Neyman orthogonalization all include use of lasso and some include random forests. The papers on lasso all involve some use of orthogonalization.

References [references]

Abadie, Alberto. 2003. “Semiparametric Instrumental Variable Estimation of Treatment Response Models.” Journal of Econometrics 113 (2): 231–63. https://doi.org/https://doi.org/10.1016/S0304-4076(02)00201-4.

Angrist, Joshua D., and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect Schooling and Earnings?” The Quarterly Journal of Economics 106 (4): pp. 979–1014. http://www.jstor.org/stable/2937954.

Athey, Susan, and Guido Imbens. 2015. “Lectures on Machine Learning.” NBER Summer Institute. .

———. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113 (27): 7353–60. https://doi.org/10.1073/pnas.1510489113.

———. 2018. “Machine Learning and Econometrics.” AEA Continuing Education. https://www.aeaweb.org/conference/cont-ed/2018-webcasts.

Athey, Susan, Guido Imbens, Thai Pham, and Stefan Wager. 2017. “Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges.” American Economic Review 107 (5): 278–81. https://doi.org/10.1257/aer.p20171042.

Athey, Susan, and Guido W. Imbens. 2017. “The State of Applied Econometrics: Causality and Policy Evaluation.” Journal of Economic Perspectives 31 (2): 3–32. https://doi.org/10.1257/jep.31.2.3.

Athey, Susan, Julie Tibshirani, and Stefan Wager. 2016. “Generalized Random Forests.” https://arxiv.org/abs/1610.01271.

Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen. 2012. “Sparse Models and Methods for Optimal Instruments with an Application to Eminent Domain.” Econometrica 80 (6): 2369–2429. https://doi.org/10.3982/ECTA9626.

Belloni, A., V. Chernozhukov, I. Fernández-Val, and C. Hansen. 2017. “Program Evaluation and Causal Inference with High-Dimensional Data.” Econometrica 85 (1): 233–98. https://doi.org/10.3982/ECTA12723.

Belloni, Alexandre, and Victor Chernozhukov. 2011. “High Dimensional Sparse Econometric Models: An Introduction.” In Inverse Problems and High-Dimensional Estimation: Stats in the Château Summer School, August 31 - September 4, 2009, edited by Pierre Alquier, Eric Gautier, and Gilles Stoltz, 121–56. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-19989-9_3.

Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen. 2014a. “Inference on Treatment Effects After Selection Among High-Dimensional Controls†.” The Review of Economic Studies 81 (2): 608–50. https://doi.org/10.1093/restud/rdt044.

———. 2014b. “High-Dimensional Methods and Inference on Structural and Treatment Effects.” Journal of Economic Perspectives 28 (2): 29–50. https://doi.org/10.1257/jep.28.2.29.

Breiman, Leo, and others. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3): 199–231. https://projecteuclid.org/euclid.ss/1009213726.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey. 2017. “Double/Debiased/Neyman Machine Learning of Treatment Effects.” American Economic Review 107 (5): 261–65. https://doi.org/10.1257/aer.p20171038.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–C68. https://doi.org/10.1111/ectj.12097.

Chernozhukov, Victor, Matt Goldman, Vira Semenova, and Matt Taddy. 2017. “Orthogonal Machine Learning for Demand Estimation: High Dimensional Causal Inference in Dynamic Panels.” https://arxiv.org/abs/1712.09988v2.

Chernozhukov, Victor, Chris Hansen, and Martin Spindler. 2016. “hdm: High-Dimensional Metrics.” R Journal 8 (2): 185–99. https://journal.r-project.org/archive/2016/RJ-2016-040/index.html.

Chernozhukov, Victor, Christian Hansen, and Martin Spindler. 2015. “Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach.” Annual Review of Economics 7 (1): 649–88. https://doi.org/10.1146/annurev-economics-012315-015826.

Connors, Alfred F., Theodore Speroff, Neal V. Dawson, Charles Thomas, Frank E. Harrell Jr, Douglas Wagner, Norman Desbiens, et al. 1996. “The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients.” JAMA 276 (11): 889–97. https://doi.org/10.1001/jama.1996.03540110043030.

Donohue, John J., III, and Steven D. Levitt. 2001. “The Impact of Legalized Abortion on Crime*.” The Quarterly Journal of Economics 116 (2): 379–420. https://doi.org/10.1162/00335530151144050.

Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference. Vol. 5. Cambridge University Press. .

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2009. The Elements of Statistical Learning. Springer series in statistics. .

Hartford, Jason, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. 2017. “Deep IV: A Flexible Approach for Counterfactual Prediction.” In Proceedings of the 34th International Conference on Machine Learning, edited by Doina Precup and Yee Whye Teh, 70:1414–23. Proceedings of Machine Learning Research. International Convention Centre, Sydney, Australia: PMLR. http://proceedings.mlr.press/v70/hartford17a.html.

Imbens, Guido W. 2004. “Nonparametric Estimation of Average Treatment Effects Under Exogeneity: A Review.” The Review of Economics and Statistics 86 (1): 4–29. https://doi.org/10.1162/003465304323023651.

———. 2015. “Matching Methods in Practice: Three Examples.” Journal of Human Resources 50 (2): 373–419. https://doi.org/10.3368/jhr.50.2.373.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer. .

Mullainathan, Sendhil, and Jann Spiess. 2017. “Machine Learning: An Applied Econometric Approach.” Journal of Economic Perspectives 31 (2): 87–106. https://doi.org/10.1257/jep.31.2.87.

Rosenbaum, Paul R., and Donald B. Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55. https://doi.org/10.1093/biomet/70.1.41.

Tibshirani, Julie, Susan Athey, Stefan Wager, Rina Friedberg, Luke Miner, and Marvin Wright. 2018. Grf: Generalized Random Forests (Beta). https://CRAN.R-project.org/package=grf.

Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 0 (0): 1–15. https://doi.org/10.1080/01621459.2017.1319839.