This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Using machine learning to estimate causal effects

Double debiased machine learning

  • Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins, “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21 (2018a), C1–C68., Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, and Whitney Newey, “Double/debiased/neyman machine learning of treatment effects,” American Economic Review, 107 (2017), 261–65.

  • Parameter of interest $\theta \in \R^{d_\theta}$

  • Nuisance parameter $\eta \in T$

  • Moment conditions with $\psi$ known

  • Estimate $\hat{\eta}$ using some machine learning method

  • Estimate $\hat{\theta}$ using cross-fitting


Cross-fitting

  • Randomly partition into $K$ subsets $(I_k)_{k=1}^K$
  • $I^c_k = {1, …, n} \setminus I_k$
  • $\hat{\eta}_k =$ estimate of $\eta$ using $I^c_k$
  • Estimator:

Assumptions

  • Linear score
  • Near Neyman orthogonality:

Assumptions

  • Rate conditions: for $\delta_n \to 0$ and $\Delta_n \to 0$, we have $\Pr(\hat{\eta}_k \in \mathcal{T}_n) \geq 1-\Delta_n$ and
  • Moments exist and other regularity conditions

::: notes We focus on the case of linear scores to simplify proofs and all of our examples have scores linear in $\theta$. Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins, “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21 (2018a), C1–C68. cover nonlinear scores as well.

These rate conditions might look a little strange. The rate conditions are stated this way because they’re exactly what is needed for the result to work. $\Delta_n$ and $\delta_n$ are sequences converging to $0$. $\mathcal{T}_n$ is a shrinking neighborhood of $\eta_0$. A good exercise would be show that if $\psi$ is a smooth function of $\eta$ and $\theta$, and $\Er[(\hat{\eta}(x) - \eta_0(x))^2]^{1/2} = O(\epsilon_n) = o(n^{-1/4})$, then we can meet the above conditions with $r_n = r_n’ = \epsilon_n$ and $\lambda_n’ = \epsilon_n^2$. :::


Proof outline:

  • Let

  • Show:

  • Show $\norm{R_{n,1}} = O_p(n^{-1/2} + r_n)$

  • Show $\norm{R_{n,2}}= O_p(n^{-1/2} r_n’ + \lambda_n + \lambda_n’)$

::: notes For details see the appendix of Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins, “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21 (2018a), C1–C68.. :::


Proof outline: Lemma 6.1

Lemma 6.1

(a) If $\Pr(\norm{X_m} > \epsilon_m | Y_m) \to_p 0$, then $\Pr(\norm{X_m}>\epsilon_m) \to 0$.

(b) If $\Er[\norm{X_m}^q/\epsilon_m^q | Y_m] \to_p 0$ for $q\geq 1$, then $\Pr(\norm{X_m}>\epsilon_m) \to 0$.

(c) If $\norm{X_m} = O_p(A_m)$ conditional on $Y_m$ (i.e. for any $\ell_m \to \infty$, $\Pr(\norm{X_m} > \ell_m A_m | Y_m) \to_p 0$), then $\norm{X_m} = O_p(A_m)$ unconditionally

::: notes (a) by dominated convergence (b) from Markov’s inequality (c) follows from (a) :::


Proof outline: $R_{n,1}$

  • where

Proof outline: $R_{n,2}$

  • $R_{n,2} = \frac{1}{K} \sum_{k=1}^K \En_k\left[ \psi(w_i;\theta_0,\hat{\eta}_k) - \psi(w_i;\theta_0,\eta_0) \right]$
  • where

  • $U_{4,k} = \sqrt{n} \norm{f_k(1)}$ where


Asymptotic normality

  • $\rho_n := n^{-1/2} + r_n + r_n’ + n^{1/2} (\lambda_n +\lambda_n’) \lesssim \delta_n$

  • Influence function with

::: notes This is the DML2 case of theorem 3.1 of Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins, “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21 (2018a), C1–C68.. :::


Creating orthogonal moments

  • Need

  • Given an some model, how do we find a suitable $\psi$?


Orthogonal scores via concentrating-out

  • Original model:
  • Define
  • First order condition from $\max_\theta \Er[\ell(W;\theta,\beta(\theta))]$ is

Orthogonal scores via projection

  • Original model: $m: \mathcal{W} \times \R^{d_\theta} \times \R^{d_h} \to \R^{d_m}$
  • Let $A(R)$ be $d_\theta \times d_m$ moment selection matrix, $\Omega(R)$ $d_m \times d_m$ weighting matrix, and
  • $\eta = (\mu, h)$ and

::: notes Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins, “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21 (2018a), C1–C68. show how to construct orthogonal scores in a few examples via concentrating out and projection. Chernozhukov, Victor, Christian Hansen, and Martin Spindler, “Valid post-selection and post-regularization inference: An elementary, general approach,” Annual Review of Economics, 7 (2015), 649–688. also discusses creating orthogonal scores. :::


Example: average derivative

  • $x,y \in \R^1$, $\Er[y|x] = f_0(x)$, $p(x) =$ density of $x$

  • $\theta_0 = \Er[f_0’(x)]$

  • Joint objective

  • Solve for minimizing $f$ given $\theta$

  • Concentrated objective:

  • First order condition at $f_\theta = f_0$ gives

::: notes We’ll go over this derivation in lecture, but I don’t think I’ll have time to type it here.

See Chernozhukov, Victor, Whitney Newey, James Robins, and Rahul Singh, “Double/de-biased machine learning of global and local parameters using regularized riesz representers,” 2018c. for an approach to estimating average derivatives (and other linear in $\theta$ models) that doesn’t require explicitly calculating an orthogonal moment condition. :::


Example : average derivative with endogeneity

  • $x,y \in \R^1$, $p(x) =$ density of $x$

  • Model : $\Er[y - f(x) | z] = 0$ $\theta_0 = \Er[f_0’(x)]$

  • Joint objective:

  • then

  • where $T:\mathcal{L}^2_{p} \to \mathcal{L}^2_{\mu_z}$ with $(T f)(z) = \Er[f(x) |z]$

  • and $T^\ast :\mathcal{L}^2_{\mu_z} \to \mathcal{L}^2_{p}$ with $(T^\ast g)(z) = \Er[g(z) |x]$

  • Orthogonal moment condition :

::: notes The first order condition for $f$ in the joint objective function is Writing these expectations as integrals, integrating by parts to get rid of $v’(x)$, and switching the order of integration, gives Notice that integrating by parts $\int f’‘(x) p(x) dx = \int f’ p’(x) dx$ eliminates the terms with $f’$ and $f’‘$, leaving For this to be $0$ for all $v$, we need or equivalently using $T$ and $T^\ast$, Note that $T$ and $T^\ast$ are linear, and $T^\ast$ is the adjoint of $T$. Also, identification of $f$ requires $T$ is one to one. Hence, if $f$ is identified, $T^\ast T$ is invertible. Therefore, we can solve for $f$ as: Plugging $f_\theta(x)$ back into the objective function and then differentiating with respect to $\theta$ gives the orthogonal moment condition on the slide. Verifying that this moment condition is indeed orthogonal is slightly tedious. Writing out some of the expectations as integrals, changing order of integrations, and judiciously factoring out terms, will eventually lead to the desired conclusion.

Carrasco, Marine, Jean-Pierre Florens, and Eric Renault, “Chapter 77 linear inverse problems in structural econometrics estimation based on spectral decomposition and regularization,” in Handbook of econometrics, James J. Heckman and Edward E. Leamer, eds., Handbook of econometrics (Elsevier, 2007). is an excellent review about estimating $(T^\ast T)^{-1}$ and the inverses of other linear transformations. :::


Example: average elasticity

  • Demand $D(p)$, quantities $q$, instruments $z$

  • Average elasticity $\theta \Er[D’(p)/D(p) | z ]$

  • Joint objective :


Example: control function


Treatment heterogeneity

  • Potential outcomes model
  • Treatment $d \in {0,1}$
  • Potential outcomes $y(1), y(0)$
  • Covariates $x$
  • Unconfoundedness or instruments
  • Objects of interest:
  • Conditional average treatment effect $s_0(x) = \Er[y(1) - y(0) | x]$
  • Range and other measures of spread of conditional average treatment effect
  • Most and least affected groups

Fixed, finite groups

  • $G_1, …, G_K$ finite partition of support $(x)$

  • Estimate $\Er[y(1) - y(0) | x \in G_k]$ as above

  • pros: easy inference, reveals some heterogeneity

  • cons: poorly chosen partition hides some heterogeneity, searching partitions violates inference


Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments

  • Chernozhukov, Victor, Mert Demirer, Esther Duflo, and Iván Fernández-Val, “Generic machine learning inference on heterogenous treatment effects in randomized experimentsxo,” Working paper series, Working Paper, 2018b (National Bureau of Economic Research).

  • Use machine learning to find partition with sample splitting to allow easy inference

  • Randomly partition sample into auxillary and main samples

  • Use any method on auxillary sample to estimate and


Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments

  • Define $G_k = 1{\ell_{k-1} \leq S(x) \leq \ell_k}$

  • Use main sample to regress with weights $(P(x)(1-P(X)))^{-1}$

  • $\hat{\gamma}_k \to_p \Er[y(1) - y(0) | G_k]$


Best linear projection of CATE

  • Randomly partition sample into auxillary and main samples

  • Use any method on auxillary sample to estimate and

  • Use main sample to regress with weights $(P(x)(1-P(X)))^{-1}$

  • $\hat{\beta}0, \hat{\beta}_1 \to_p \argmin \Er[(s_0(x) - b_0 - b_1 (S(x)-E[S(x)]))^2]$


Inference on CATE


Random forest asymptotic normality

  • Wager, Stefan, and Susan Athey, “Estimation and inference of heterogeneous treatment effects using random forests,” Journal of the American Statistical Association, 0 (2018), 1–15 (Taylor & Francis).

  • $\mu(x) = \Er[y|x]$

  • $\hat{\mu}(x)$ estimate from honest random forest

  • honest $=$ trees independent of outcomes being averaged

  • sample-splitting or trees formed using another outcome

  • Then

  • $\hat{\sigma}_n(x) \to 0$ slower than $n^{-1/2}$


Random forest asymptotic normality

  • Results are pointwise, but what about?
  • $H_0: \mu(x_1) = \mu(x_2)$
  • ${x: \mu(x) \geq 0 }$
  • $\Pr(\mu(x) \leq 0)$

Uniform inference

Bibliography