This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Using machine learning to estimate causal effects

Double debiased machine learning

  • @chernozhukov2018, @chernozhukov2017

  • Parameter of interest $\theta \in \R^{d_\theta}$

  • Nuisance parameter $\eta \in T$

  • Moment conditions with $\psi$ known

  • Estimate $\hat{\eta}$ using some machine learning method

  • Estimate $\hat{\theta}$ using cross-fitting


Cross-fitting

  • Randomly partition into $K$ subsets $(I_k)_{k=1}^K$
  • $I^c_k = {1, …, n} \setminus I_k$
  • $\hat{\eta}_k =$ estimate of $\eta$ using $I^c_k$
  • Estimator:

Assumptions

  • Linear score
  • Near Neyman orthogonality:

Assumptions [assumptions-1]

  • Rate conditions: for $\delta_n \to 0$ and $\Delta_n \to 0$, we have $\Pr(\hat{\eta}_k \in \mathcal{T}_n) \geq 1-\Delta_n$ and
  • Moments exist and other regularity conditions

We focus on the case of linear scores to simplify proofs and all of our examples have scores linear in $\theta$. @chernozhukov2018 cover nonlinear scores as well.

These rate conditions might look a little strange. The rate conditions are stated this way because they’re exactly what is needed for the result to work. $\Delta_n$ and $\delta_n$ are sequences converging to $0$. $\mathcal{T}_n$ is a shrinking neighborhood of $\eta_0$. A good exercise would be show that if $\psi$ is a smooth function of $\eta$ and $\theta$, and $\Er[(\hat{\eta}(x) - \eta_0(x))^2]^{1/2} = O(\epsilon_n) = o(n^{-1/4})$, then we can meet the above conditions with $r_n = r_n’ = \epsilon_n$ and $\lambda_n’ = \epsilon_n^2$.


Proof outline:

  • Let

  • Show:

  • Show $\norm{R_{n,1}} = O_p(n^{-1/2} + r_n)$

  • Show $\norm{R_{n,2}}= O_p(n^{-1/2} r_n’ + \lambda_n + \lambda_n’)$

For details see the appendix of @chernozhukov2018.


Proof outline: Lemma 6.1

Lemma 6.1

  1. If $\Pr(\norm{X_m} > \epsilon_m | Y_m) \to_p 0$, then $\Pr(\norm{X_m}>\epsilon_m) \to 0$.

  2. If $\Er[\norm{X_m}^q/\epsilon_m^q | Y_m] \to_p 0$ for $q\geq 1$, then $\Pr(\norm{X_m}>\epsilon_m) \to 0$.

  3. If $\norm{X_m} = O_p(A_m)$ conditional on $Y_m$ (i.e. for any $\ell_m \to \infty$, $\Pr(\norm{X_m} > \ell_m A_m | Y_m) \to_p 0$), then $\norm{X_m} = O_p(A_m)$ unconditionally

  1. by dominated convergence
  2. from Markov’s inequality
  3. follows from (a)

Proof outline: $R_{n,1}$

  • where

Proof outline: $R_{n,2}$

  • $R_{n,2} = \frac{1}{K} \sum_{k=1}^K \En_k\left[ \psi(w_i;\theta_0,\hat{\eta}_k) - \psi(w_i;\theta_0,\eta_0) \right]$
  • where

  • $U_{4,k} = \sqrt{n} \norm{f_k(1)}$ where


Asymptotic normality

  • $\rho_n := n^{-1/2} + r_n + r_n’ + n^{1/2} (\lambda_n +\lambda_n’) \lesssim \delta_n$

  • Influence function with

This is the DML2 case of theorem 3.1 of @chernozhukov2018.


Creating orthogonal moments

  • Need

  • Given an some model, how do we find a suitable $\psi$?


Orthogonal scores via concentrating-out

  • Original model:
  • Define
  • First order condition from $\max_\theta \Er[\ell(W;\theta,\beta(\theta))]$ is

Orthogonal scores via projection

  • Original model: $m: \mathcal{W} \times \R^{d_\theta} \times \R^{d_h} \to \R^{d_m}$
  • Let $A(R)$ be $d_\theta \times d_m$ moment selection matrix, $\Omega(R)$ $d_m \times d_m$ weighting matrix, and
  • $\eta = (\mu, h)$ and

@chernozhukov2018 show how to construct orthogonal scores in a few examples via concentrating out and projection. @chernozhukov2015 also discusses creating orthogonal scores.


Example: average derivative

  • $x,y \in \R^1$, $\Er[y|x] = f_0(x)$, $p(x) =$ density of $x$

  • $\theta_0 = \Er[f_0’(x)]$

  • Joint objective

  • Solve for minimizing $f$ given $\theta$

  • Concentrated objective:

  • First order condition at $f_\theta = f_0$ gives

We’ll go over this derivation in lecture, but I don’t think I’ll have time to type it here.

See @cnr2018 for an approach to estimating average derivatives (and other linear in $\theta$ models) that doesn’t require explicitly calculating an orthogonal moment condition.


Example : average derivative with endogeneity

  • $x,y \in \R^1$, $p(x) =$ density of $x$

  • Model : $\Er[y - f(x) | z] = 0$ $\theta_0 = \Er[f_0’(x)]$

  • Joint objective:

  • then

    • where $T:\mathcal{L}^2_{p} \to \mathcal{L}^2_{\mu_z}$ with $(T f)(z) = \Er[f(x) |z]$
    • and $T^\ast :\mathcal{L}^2_{\mu_z} \to \mathcal{L}^2_{p}$ with $(T^\ast g)(z) = \Er[g(z) |x]$
  • Orthogonal moment condition :

The first order condition for $f$ in the joint objective function is Writing these expectations as integrals, integrating by parts to get rid of $v’(x)$, and switching the order of integration, gives Notice that integrating by parts $\int f’‘(x) p(x) dx = \int f’ p’(x) dx$ eliminates the terms with $f’$ and $f’‘$, leaving For this to be $0$ for all $v$, we need or equivalently using $T$ and $T^\ast$, Note that $T$ and $T^\ast$ are linear, and $T^\ast$ is the adjoint of $T$. Also, identification of $f$ requires $T$ is one to one. Hence, if $f$ is identified, $T^\ast T$ is invertible. Therefore, we can solve for $f$ as: Plugging $f_\theta(x)$ back into the objective function and then differentiating with respect to $\theta$ gives the orthogonal moment condition on the slide. Verifying that this moment condition is indeed orthogonal is slightly tedious. Writing out some of the expectations as integrals, changing order of integrations, and judiciously factoring out terms, will eventually lead to the desired conclusion.

@cfr2007 is an excellent review about estimating $(T^\ast T)^{-1}$ and the inverses of other linear transformations.


Example: average elasticity

  • Demand $D(p)$, quantities $q$, instruments $z$

  • Average elasticity $\theta \Er[D’(p)/D(p) | z ]$

  • Joint objective :


Example: control function


Treatment heterogeneity

  • Potential outcomes model
    • Treatment $d \in {0,1}$
    • Potential outcomes $y(1), y(0)$
    • Covariates $x$
    • Unconfoundedness or instruments
  • Objects of interest:
    • Conditional average treatment effect $s_0(x) = \Er[y(1) - y(0) | x]$
    • Range and other measures of spread of conditional average treatment effect
    • Most and least affected groups

Fixed, finite groups

  • $G_1, …, G_K$ finite partition of support $(x)$

  • Estimate $\Er[y(1) - y(0) | x \in G_k]$ as above

  • pros: easy inference, reveals some heterogeneity

  • cons: poorly chosen partition hides some heterogeneity, searching partitions violates inference


Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments

  • @cddf2018

  • Use machine learning to find partition with sample splitting to allow easy inference

  • Randomly partition sample into auxillary and main samples

  • Use any method on auxillary sample to estimate and


Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments [generic-machine-learning-inference-on-heterogenous-treatment-effects-in-randomized-experiments-1]

  • Define $G_k = 1{\ell_{k-1} \leq S(x) \leq \ell_k}$

  • Use main sample to regress with weights $(P(x)(1-P(X)))^{-1}$

  • $\hat{\gamma}_k \to_p \Er[y(1) - y(0) | G_k]$


Best linear projection of CATE

  • Randomly partition sample into auxillary and main samples

  • Use any method on auxillary sample to estimate and

  • Use main sample to regress with weights $(P(x)(1-P(X)))^{-1}$

  • $\hat{\beta}0, \hat{\beta}_1 \to_p \argmin \Er[(s_0(x) - b_0 - b_1 (S(x)-E[S(x)]))^2]$


Inference on CATE

  • Inference on $\Er[y(1) - y(0) | x] = s_0(x)$ challenging when $x$ high dimensional and/or few restrictions on $s_0$

  • Pointwise results for random forests : @wager2018, @athey2016

  • Recent review of high dimensional inference : @bcchk2018


Random forest asymptotic normality

  • @wager2018

  • $\mu(x) = \Er[y|x]$

  • $\hat{\mu}(x)$ estimate from honest random forest

    • honest $=$ trees independent of outcomes being averaged

    • sample-splitting or trees formed using another outcome

  • Then

    • $\hat{\sigma}_n(x) \to 0$ slower than $n^{-1/2}$

Random forest asymptotic normality

  • Results are pointwise, but what about?
    • $H_0: \mu(x_1) = \mu(x_2)$
    • ${x: \mu(x) \geq 0 }$
    • $\Pr(\mu(x) \leq 0)$

Uniform inference

  • @bcchk2018
  • @bccw2018
<!-- --- -->

Bibliography

<!-- --- -->