# orthogonal_machine_learning_power_and_limitations__5db2adb5.pdf

Orthogonal Machine Learning: Power and Limitations

Lester Mackey 1 Vasilis Syrgkanis 1 Ilias Zadik 1 2

Double machine learning provides n-consistent estimates of parameters of interest even when high-dimensional or nonparametric nuisance parameters are estimated at an n 1/4 rate. The key is to employ Neyman-orthogonal moment equations which are ﬁrst-order insensitive to perturbations in the nuisance parameters. We show that the n 1/4 requirement can be improved to n 1/(2k+2) by employing a k-th order notion of orthogonality that grants robustness to more complex or higher-dimensional nuisance parameters. In the partially linear regression setting, popular in causal inference, we show that we can construct second-order orthogonal moments if and only if the treatment residual is not normally distributed. Our proof relies on Stein s lemma and may be of independent interest. We conclude by demonstrating the robustness beneﬁts of an explicit doubly-orthogonal estimation procedure for treatment effect.

1. Introduction

The increased availability of large and complex observational datasets is driving an increasing demand to conduct accurate causal inference of treatment effects in the presence of high-dimensional confounding factors. We take as our running example demand estimation from pricing and purchase data in the digital economy where many features of the world that simultaneously affect pricing decisions and demand are available in large data stores. One often appeals to modern statistical machine learning (ML) techniques to model and ﬁt the high-dimensional or nonparametric nuisance parameters introduced by these confounders. However, most such techniques introduce bias into their estimates (e.g., via regularization) and hence yield invalid or

1Microsoft Research New England, USA 2Operations Research Center, MIT, USA. Correspondence to: Lester Mackey <lmackey@microsoft.com>, Vasilis Syrgkanis <vasy@microsoft.com>, Ilias Zadik <izadik@mit.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

inaccurate inferences concerning the parameters of interest (the treatment effects).

Several recent lines of have begun address the problem of debiasing ML estimators to perform accurate inference on a low dimensional component of model parameters. Prominent examples include Lasso debiasing (Zhang & Zhang; van de Geer et al., 2014; Javanmard & Montanari, 2015) and post-selection inference (Belloni et al.; Berk et al., 2013; Tibshirani et al., 2016). The recent double / debiased ML work of Chernozhukov et al. (2017) describes a generalpurpose strategy for extracting valid inferences for target parameters from somewhat arbitrary and relatively inaccurate estimates of nuisance parameters.

Speciﬁcally, Chernozhukov et al. (2017) analyze a twostage process where in the ﬁrst stage one estimates nuisance parameters using arbitrary statistical ML techniques on a ﬁrst stage data sample and in the second stage estimates the low dimensional parameters of interest via the generalized method of moments (GMM). Crucially, the moments in the second stage are required to satisfy a Neyman orthogonality condition, granting them ﬁrst-order robustness to errors in the nuisance parameter estimation. A main conclusion is that the second stage estimates are n-consistent and asymptotically normal whenever the ﬁrst stage estimates are consistently estimated at a o(n 1/4) rate.

To illustrate this result, let us consider the partially linear regression (PLR) model, popular in causal inference. In the PLR model we observe data triplets Z = (T, Y, X), where T R represents a treatment or policy applied, Y R represents an outcome of interest, and X Rp is a vector of associated covariates. These observations are related via the equations

Y = θ0T + f0(X) + ϵ, E[ϵ | X, T] = 0 a.s.

T = g0(X) + η, E[η | X] = 0 a.s.

where η and ϵ represent unobserved disturbances with distributions independent of (θ0, f0, g0). The ﬁrst equation features the treatment effect θ0, our object of inference. The second equation describes the relation between the treatment T and the associated covariates X. The covariates X affect the outcome Y through the nuisance function f0 and the treatment T through the nuisance function g0. Using the Neyman-orthogonal moment of (Chernozhukov et al.,

Orthogonal Machine Learning: Power and Limitations

2017, Eq. 4.55), the authors show that it sufﬁces to estimate the nuisance (f0, g0) at an o(n 1/4) rate to construct a n-consistent and asymptotically normal estimator of θ0.

In this work, we provide a framework for achieving stronger robustness to ﬁrst stage errors while maintaining second stage validity. In particular, we introduce a notion of higher-order orthogonality and show that if the moment is k-th order orthogonal then a ﬁrst-stage estimation rate of o(n 1/(2k+2)) sufﬁces for n-asymptotic normality of the second stage.

We then provide a concrete application of our approach to the case of estimating treatment effects in the PLR model. Interestingly, we show an impossibility result when the treatment residual follows a Gaussian distribution: no higherorder orthogonal moments with ﬁnite asymptotic variance exist, so ﬁrst-order Neyman orthogonality appears to be the limit of robustness to ﬁrst stage errors under Gaussian treatment residual. However, conversely, we also show how to construct appropriate second-order orthogonal moments whenever the treatment residual is not Gaussian. As a result, when the nuisance functions are linear in the highdimensional confounders, our second-order orthogonal moments provide valid inferences whenever the number of relevant confounders is o( n2/3

log p); meanwhile the ﬁrst-order orthogonality analyses of (Chernozhukov et al., 2017) accommodate only o( n log p) relevant confounders.

We apply these techniques in the setting of demand estimation from pricing and purchase data, where highly non Gaussian treatment residuals are standard. In this setting, the treatment is the price of a product, and commonly, conditional on all observable covariates, the treatment follows a discrete distribution representing random discounts offered to customers over a baseline price linear in the observables. In Figure 1 we portray the results of a synthetic demand estimation problem with dense dependence on observables. Here, the standard orthogonal moment estimation has large bias, comparable to variance, while our second-order orthogonal moments lead to nearly unbiased estimation.

Notational conventions For each n N, we introduce the shorthand [n] for {1, . . . , n}. We let p and d represent convergence in probability and convergence in distribution respectively. When random variables A and B are independent, we use EA[g(A, B)] E[g(A, B) | B] to represent expectation only over the variable A. For a sequence of random vectors (Xn) n=1 and a deterministic sequence of scalars (an) n=1, we write Xn = OP (an) to mean Xn/an is stochastically bounded, i.e., for any ϵ > 0 there is Rϵ, Nϵ > 0 with Pr( Xn/an > Rϵ) ϵ for all n > Nϵ. We let N(µ, Σ) represent a multivariate Gaussian distribution with mean µ and covariance Σ.

2. Z-Estimation with Nuisance Functions and Orthogonality

Our aim is to estimate an unknown target parameter θ0 Θ Rd given access to independent replicates (Zt)2n t=1 of a random data vector Z Rρ drawn from a distribution satisfying d moment conditions,

E[m(Z, θ0, h0(X))|X] = 0 a.s. (1)

Here, X Rp is a sub-vector of the observed data vector Z, h0 H {h : Rp Rℓ} is a vector of ℓunknown nuisance functions, and m : Rρ Rd Rℓ Rd is a vector of d known moment functions. We assume that these moment conditions exactly identify the parameter θ0, and we allow for the data to be high-dimensional, with ρ and p potentially growing with the sample size n. However, the number of parameters of interest d and the number of nuisance functions ℓare assumed to be constant.

We will analyze a two-stage estimation process where we ﬁrst estimate the nuisance parameters using half of our sample1 and then form a Z-estimate of the target parameter θ0 using the remainder of the sample and our ﬁrst-stage estimates of the nuisance. This sample-splitting procedure proceeds as follows.

1. First stage. Form an estimate ˆh H of h0 using (Zt)2n t=n+1 (e.g., by running a nonparametric or highdimensional regression procedure).

2. Second stage. Compute a Z-estimate ˆθSS Θ of θ0 using an empirical version of the moment conditions (1) and ˆh as a plug-in estimate of h0:

ˆθSS solves 1 n Pn t=1 m(Zt, θ, ˆh(Xt)) = 0. (2)

Relegating only half of the sample to each stage represents a statistically inefﬁcient use of data and, in many applications, detrimentally impacts the quality of the ﬁrst-stage estimate ˆh. A form of repeated sample splitting called K-fold crossﬁtting (see, e.g., Chernozhukov et al., 2017) addresses both of these concerns. K-fold cross-ﬁtting partitions the index set of the datapoints [2n] into K subsets I1, . . . , IK of cardinality 2n

K (assuming for simplicity that K divides 2n) and produces the following two-stage estimate:

1. First stage. For each k [K], form an estimate ˆhk H of h0 using only the datapoints (Zt)t Ic k corresponding to Ic k = [2n] \ Ik.

2. Second stage. Compute a Z-estimate ˆθSS Θ of θ0 using an empirical version of the moment conditions

1Unequal divisions of the sample can also be used; we focus on an equal division for simplicity of presentation.

Orthogonal Machine Learning: Power and Limitations

(a) Orthogonal estimates (ˆθ = 2.78, ˆσ = .022)

(b) Second-order orthogonal estimates (ˆθ = 3., ˆσ = .032)

Figure 1. We portray the distribution of estimates based on orthogonal moments and second-order orthogonal moments. The true treatment effect θ0 = 3. Sample size n = 5000, dimension of confounders d = 1000, support size of sparse linear nuisance functions s = 100. The details of this experiment can be found in Section 5.

and (ˆhk)k [K] as plug-in estimators of h0:

ˆθCF solves 1 2n

t Ik m(Zt, θ, ˆhk(Xt)) = 0. (3)

Throughout, we assume K is a constant independent of all problem dimensions. As we will see in Theorem 1, a chief advantage of cross-ﬁtting over sample splitting is improved relative efﬁciency with an asymptotic variance that reﬂects the use of the full dataset in estimating θ.

Main Question. Our primary inferential goal is to establish conditions under which the estimators ˆθSS in (2) and ˆθCF

(3) enjoy n-asymptotic normality, that is

n(ˆθSS θ0) d N(0, Σ) and

2n(ˆθCF θ0) d N(0, Σ)

for some constant covariance matrix Σ. Coupled with a consistent estimator of Σ, asymptotic normality enables the construction of asymptotically valid conﬁdence intervals for θ based on Gaussian or Student s t quantiles and asymptotically valid hypothesis tests, like the Wald test, based on chi-squared limits.

2.1. Higher-order Orthogonality

We would like our two-stage procedures to produce accurate estimates of θ0 even when the ﬁrst stage nuisance estimates are relatively inaccurate. With this goal in mind, Chernozhukov et al. (2017) deﬁned the notion of Neymanorthogonal moments, inspired by the early work of Neyman (1979). In our setting, the orthogonality condition of (Chernozhukov et al., 2017) is implied by the following condition, which we will call ﬁrst-order orthogonality:

Deﬁnition 1 (First-order Orthogonal Moments). A vector of moments m : Rρ Rd Rℓ Rd is ﬁrst-order orthogonal with respect to the nuisance h0(X) if

E γm(Z, θ0, γ)|γ=h0(X) | X = 0.

Here, γm(Z, θ0, γ) is the gradient of the vector of moments with respect to its ﬁnal ℓarguments.

Intuitively, ﬁrst-order orthogonal moments are insensitive to small perturbations in the nuisance parameters and hence robust to small errors in estimates of these parameters. A main result of (Chernozhukov et al., 2017) is that, if the moments m are ﬁrst-order orthogonal, then o(n 1/4) error rates2 in the ﬁrst stage estimation of h0 are sufﬁcient for n-asymptotic normality of the estimates ˆθSS and ˆθCF .

Our aim is to accommodate slower rates of convergence in the ﬁrst stage of estimation by designing moments robust to larger nuisance estimation errors. To achieve this, we will introduce a generalized notion of orthogonality that requires higher-order nuisance derivatives of m to be conditionally mean zero. We will make use of the following higher-order differential notation:

Deﬁnition 2 (Higher-order Differentials). Given a vector of moments m : Rρ Rd Rℓ Rd and a vector α Nℓ

we denote by Dαm(Z, θ, γ) the α-differential of m with respect to its ﬁnal ℓarguments:

Dαm(Z, θ, γ) = α1 γ1 α2 γ2 . . . αℓ γℓm(Z, θ, γ) (4)

We are now equipped to deﬁne our notion of S-orthogonal moments:

Deﬁnition 3 (S-Orthogonal Moments). A vector of moments m : Rρ Rd Rℓ Rd is S-orthogonal with respect to the nuisance h0(X) for some orthogonality set S Nℓ, if for any α S:

E [Dαm(Z, θ0, h0(X))| X] = 0. (5)

We will often be interested in the special case of Deﬁnition 3 in which S is comprised of all vectors α Nℓwith α 1

2In the sense of root mean squared error:

E[ h0(X) ˆh(X) 2 2 | ˆh] p 0.

Orthogonal Machine Learning: Power and Limitations

k. This implies that all mixed nuisance derivatives of the moment of order k or less are conditionally mean zero. We will refer to this special case as k-orthogonality or k-th order orthogonality.

Deﬁnition 4 (k-Orthogonal Moments). A vector of moments m : Rρ Rd Rℓ Rd is k-orthogonal if it is Sk-orthogonal for the k-orthogonality set, Sk {α Nℓ: α 1 k}.

The general notion of S-orthogonality allows for our moments to be more robust to errors in some nuisance functions and less robust to errors in others. This is particularly valuable when some nuisance functions are easier to estimate than others; we will encounter such an example in Section 4.2.

3. Higher-order Orthogonality and Root-n Consistency

We will now show that S-orthogonality together with appropriate consistency rates for the ﬁrst stage estimates of the nuisance functions imply n-consistency and asymptotic normality of the two-stage estimates ˆθSS and ˆθCF . Beyond orthogonality and consistency, our main Assumption 1 demands identiﬁability, non-degeneracy, and regularity of the moments m, all of which are standard for establishing the asymptotic normality of Z-estimators.

Assumption 1. For a non-empty orthogonality set S Nℓ

and k maxα S α 1, we assume the following:

1. S-Orthogonality. The moments m are S-orthogonal.

2. Identiﬁability. E[m(Z, θ, h0(X))] = 0 when θ = θ0.

3. Non-degeneracy. The matrix E [ θm(Z, θ0, h0(X))] is invertible.

4. Smoothness. km exists and is continuous.

5. Consistency of First Stage. The ﬁrst stage estimates satisfy

E[Qℓ i=1 |ˆhi(X) h0,i(X)|4αi | ˆh] p 0, α S,

where the convergence in probability is with respect to the auxiliary data set used to ﬁt ˆh.

6. Rate of First Stage. The ﬁrst stage estimates satisfy

E[Qℓ i=1 |ˆhi(X) h0,i(X)|2αi | ˆh] p 0,

α {a Nℓ: a 1 k + 1} \ S, where the convergence in probability is with respect to the auxiliary data set used to ﬁt ˆh.

7. Regularity of Moments. There exists an r > 0 such that the following regularity conditions hold:

(a) E[supθ Bθ0,r θm(Z, θ, h0(X)) F ] <

for Bθ0,r {θ Θ : θ θ0 2 r}. (b) sup h Bh0,r E[ sup θ Bθ0,r γ θm(Z, θ, h(X)) 2] <

for Bh0,r {h H : max α: α 1 k+1 E[Qℓ i=1 |hi(X) h0,i(X)|2αi] r}.

(c) max α: α 1 k+1 sup h Bh0,r E |Dαm(Z, θ0, h(X))|4

λ (θ0, h0) < . (d) E[supθ A,h Bh0,r m(Z, θ, h(X)) 2] < , for any compact A Θ, (e) supθ A,h Bh0,r E[ γm(Z, θ, h(X)) 2] < , for any compact A Θ.

We are now ready to state our main theorem on the implications of S-orthogonality for second stage n-asymptotic normality. The proof can be found in Section A.

Theorem 1 (Main Theorem). Under Assumption 1, if ˆθSS

and ˆθCF are consistent, then n(ˆθSS θ0) d N(0, Σ) and

2n(ˆθCF θ0) d N(0, Σ)

where Σ = J 1V J 1 for J = E [ θm(Z, θ0, h0(X))] and V = Cov(m(Z, θ0, h0(X))).

A variety of standard sufﬁcient conditions guarantee the consistency of ˆθSS and ˆθCF . Our next result, proved in Section B, establishes consistency under either of two commonly satisﬁed assumptions.

Assumption 2. One of the following sets of conditions is satisﬁed:

1. Compactness conditions: Θ is compact.

2. Convexity conditions: Θ is convex, θ0 is in the interior of Θ, and, with probability approaching 1, the mapping θ 7 1

n Pn t=1 m(Zt, θ, ˆh(Xt)) is the gradient of a convex function.

Remark A continuously differentiable vector-valued function θ 7 F(θ) on a convex domain Θ is the gradient of a convex function whenever the matrix θF(θ) is symmetric and positive semideﬁnite for all θ.

Theorem 2 (Consistency). If Assumptions 1 and 2 hold, then ˆθSS and ˆθCF are consistent.

3.1. Sufﬁcient Conditions for First Stage Rates

Our assumption on the ﬁrst stage estimation rates, i.e., that α {a Nℓ: a 1 k + 1} \ S

E h Qℓ i=1 |ˆhi(X) h0,i(X)|2αi | ˆh i p 0

Orthogonal Machine Learning: Power and Limitations

may seem complex, as it involves the interaction of the errors of multiple nuisance function estimates. In this section we give sufﬁcient conditions that involve only the rates of individual nuisance function estimates and which imply our ﬁrst stage rate assumptions. In particular, we are interested in formulating consistency rate conditions for each nuisance function hi with respect to an Lp norm,

ˆhi h0,i p = E[ ˆhi(X) h0,i(X) p p | ˆh]1/p.

We will make use of these sufﬁcient conditions when applying our main theorem to the partially linear regression model in Section 4.2.

Lemma 3. Let k = maxa S a 1. Then

(1) Assumption 1.6 holds if any of the following holds α {a Nℓ: a 1 k + 1} \ S:

n Qℓ i=1 ˆhi h0,i αi 2 α 1

1 κi α 1 ˆhi h0,i 2 α 1 p 0 (7)

for some κi (0, 2] where 1 α 1 Pℓ i=1 αi κi 1

1 κi α 1 ˆhi h0,i 2 α 1 p 0 (8)

for some κi (0, 2].

(2) Assumption 1.5 holds if i, ˆhi h0,i 4k p 0.

A simpler description of the sufﬁcient conditions arises under k-orthogonality (Deﬁnition 4), since the set {a Nℓ: a 1 k + 1} \ Sk contains only vectors α with α = k + 1.

Corollary 4. If S is the canonical k-orthogonality set Sk (Deﬁnition 4), then Assumption 1.6 holds whenever

1 2(k+1) ˆhi h0,i 2(k+1) p 0,

and Assumption 1.5 holds whenever i, ˆhi h0,i 4k p 0.

In the case of ﬁrst-order orthogonality, Corollary 4 requires that the ﬁrst stage nuisance functions be estimated at a o(n 1/4) rate with respect to the L4 norm. This is almost but not exactly the same as the condition presented in (Chernozhukov et al., 2017), which require o(n 1/4) consistency rates with respect to the L2 norm. Ignoring the expectation over X, the two conditions are equivalent.3 Moreover, in the case of k-orthogonality, Corollary 4 requires o(n 1/2(k+1)) rates with respect to the L2(k+1) norm. More generally, S-orthogonality allows for some functions to be estimated slower than others as we will see in the case of the sparse linear model.

3We would recover the exact condition in (Chernozhukov et al., 2017) if we replaced Assumption 1.7c with the more stringent assumption that |Dαm(Z, θ, h(X))| λ a.s.

4. Second-order Orthogonality for Partially Linear Regression

When second-order orthogonal moments satisfying Assumption 1 are employed, Corollary 4 implies that an o(n 1/6) rate of nuisance parameter estimation is sufﬁcient for nconsistency of ˆθSS and ˆθCF . This asymptotic improvement over ﬁrst-order orthogonality holds the promise of accommodating more complex and higher-dimensional nuisance parameters. In this section, we detail both the limitations and the power of this approach in the partially linear regression (PLR) model setting popular in causal inference (see, e.g, Chernozhukov et al., 2017). Deﬁnition 5 (Partially Linear Regression (PLR)). In the partially linear regression model of observations Z = (T, Y, X), T R represents a treatment or policy applied, Y R represents an outcome of interest, and X Rp is a vector of associated covariates. These observations are related via the equations

Y = θ0T + f0(X) + ϵ, E[ϵ | X, T] = 0 a.s.

T = g0(X) + η, E[η | X] = 0 a.s.

where η and ϵ represent unobserved noise variables with distributions independent of (θ0, f0, g0).

4.1. Limitations: the Gaussian Treatment Barrier

Our ﬁrst result shows that, under the PLR model, if the treatment noise, η, is conditionally Gaussian given X, then no second-order orthogonal moment can satisfy Assumption 1, because every twice continuously differentiable 2orthogonal moment has E [ θm(Z, θ0, h0(X))] = 0 (a violation of Assumption 1.3). The proof in Section D relies on Stein s lemma. Theorem 5. Under the PLR model, suppose that η is conditionally Gaussian given X (a.s. X). If a twice differentiable moment function m is second-order orthogonal with respect to the nuisance parameters (f0(X), g0(X)), then it must satisfy E [ θm(Z, θ0, h0(X))] = 0 and hence violate Assumption 1.3. Therefore no second-order orthogonal moment satisﬁes Assumption 1.

In the following result, proved in Section E, we establish that under mild conditions Assumption 1.3 is necessary for the n-consistency of ˆθSS in the PLR model. Proposition 6. Under the PLR model, suppose that |Θ| 2 and that the conditional distribution of (ϵ, η) given X has full support on R2 (a.s. X). Then no moment function m simultaneously satisﬁes

1. Assumption 1, except for Assumption 1.3,

2. E [ θm(Z, θ0, h0(X))] = 0, and

3. ˆθSS θ0 = OP (1/ n).

Orthogonal Machine Learning: Power and Limitations

4.2. Power: Second-order Orthogonality under Non-Gaussian Treatment

We next show that, inversely, second-order orthogonal moments are available whenever the conditional distribution of treatment noise given X is not a.s. Gaussian. Our proofs rely on a standard characterization of a Gaussian distribution, proved in Section F:

Lemma 7. If E[η|X] = 0 a.s., the conditional distribution of η given X is a.s. Gaussian if and only if for all r N, r 2 it holds that, E ηr+1|X = r E[η2|X]E ηr 1|X a.s.

We will focus on estimating the nuisance functions q0 = f0+θ0g0 and g0 instead of the nuisance functions f0 and g0, since the former task is more practical in many applications. This is because estimating q0 can be accomplished by carrying out an arbitrary non-parametric regression of Y onto X. In contrast, estimating f0 typically involves regressing Y onto (X, T), where T is constrained to enter linearly. The latter might be cumbersome when using arbitrary ML regression procedures.

Our ﬁrst result, established in Section G, produces ﬁnitevariance 2-orthogonal moments when an appropriate moment of the treatment noise η is known.

Theorem 8. Under the PLR model, suppose that we know E[ηr|X] and that E[ηr+1] = r E[E[η2|X]E[ηr 1|X]] for some r N, so that the conditional distribution of η given X is not a.s. Gaussian. Then the moments

m (Z, θ, q(X), g(X), µr 1(X))

(Y q(X) θ (T g(X)))

((T g(X))r E[ηr|X] r (T g(X)) µr 1(X))

satisfy each of the following properties

2-orthogonality with respect to the nuisance h0(X) = (q0(X), g0(X), E[ηr 1|X]),

Identiﬁability: When θ = θ0,

E[m(Z, θ, q0(X), g0(X), E[ηr 1|X])] = 0,

Non-degeneracy:

E[ θm Z, θ0, q0(X), g0(X), E[ηr 1|X] ] = 0,

Smoothness: km is continuous for all k N.

Our next result, proved in Section H, addresses the more realistic setting in which we do not have exact knowledge of E [ηr|X]. We introduce an additional nuisance parameter and still satisfy an orthogonality condition with respect to these parameters.

Theorem 9. Under the PLR model, suppose that E[ηr+1] = r E[E[η2|X]E[ηr 1|X]] for r N, so that the conditional distribution of η given X is not a.s. Gaussian. Then, if

S {α N4 : α 1 2} \ {(1, 0, 0, 1), (0, 1, 0, 1)},

the moments

m (Z, θ, q(X), g(X), µr 1(X), µr(X))

(Y q(X) θ (T g(X)))

((T g(X))r µr(X) r (T g(X)) µr 1(X))

satisfy each of the following properties

S-orthogonality with respect to the nuisance h0(X) = (q0(X), g0(X), E[ηr 1|X], E[ηr|X]),

Identiﬁability: When θ = θ0,

E[m(Z, θ, q0(X), g0(X), E[ηr 1|X], E[ηr|X])] = 0,

Non-degeneracy:

E[ θm Z, θ0, q0(X), g0(X), E[ηr 1|X], E[ηr|X] ]

Smoothness: km continuous for all k N.

In words, S-orthogonality here means that m satisﬁes the orthogonality condition for all mixed derivatives of total order at most 2 with respect to the four nuisance parameters except the mixed derivatives with respect to (q0(X), E[ηr|X]) and (g0(X), E[ηr|X]).

4.3. Application to High-dimensional Linear Nuisance Functions

We now consider deploying the PLR model in the highdimensional linear regression setting, where f0(X) = X, β0 and g0(X) = X, γ0 for two s-sparse vectors β0, γ0 Rp, p tends to inﬁnity as n , and (η, ϵ, X) are mutually independent. Deﬁne q0 = θ0β0 + γ0. In this high-dimensional regression setting, Chernozhukov et al. (2017, Rem. 4.3) showed that two-stage estimation with ﬁrst-order orthogonal moments

m (Z, θ, X, q , X, γ ) = (9)

(Y X, q θ (T X, γ )) (T X, γ )

and Lasso estimates of the nuisance provides a nasymptotically normal estimator of θ0 when s = o(n 1 2 / log p). Our next result, established in Appendix I, shows that we can accommodate s = o(n 2 3 / log p) with an explicit set of higher-order orthogonal moments.

Orthogonal Machine Learning: Power and Limitations

Theorem 10. In the high-dimensional linear regression setting, suppose that either E[η3] = 0 (non-zero skewness) or E[η4] = 3E[η2]2 (excess kurtosis), that X has i.i.d. mean-zero standard Gaussian entries, that ϵ and η are almost surely bounded by the known value C, and that θ0 [ M, M] for known M. If s = o(n2/3/log p), and in the ﬁrst stage of estimation we

(a) create estimates ˆq, ˆγ of q0, γ0 via Lasso regression of Y on X and T on X respectively, with regularization parameter λn = 2CM p

3 log(p)/n and

(b) estimate E[η2] and E[η3] using ˆηt T t X t, ˆγ ,

n Pn t=1 ˆη2 t , and ˆµ3 = 1

n Pn t=1(ˆη3 t 3ˆµ2ˆηt),

for (T t, X t)n t=1 an i.i.d. sample independent of ˆγ,

then, using the moments m of Theorem 9 with r = 2 in the case of non-zero skewness or r = 3 in the case of excess kurtosis, ˆθSS and ˆθCF are n-asymptotically normal estimators of θ0.

5. Experiments

We perform an experimental analysis of the second order orthogonal estimator of Theorem 10 with r = 3 for the case of estimating treatment effects in the PLR model with highdimensional sparse linear nuisance functions. We compare our estimator with the double ML estimator (labeled dml in our ﬁgures) based on the ﬁrst-order orthogonal moments (9) of (Chernozhukov et al., 2017). Our experiments are designed to simulate demand estimation from pricing and purchase data, where non-Gaussian treatment residuals are standard. Here, our covariates X correspond to all collected variables that may affect a pricing policy. A typical randomized experiment in a pricing policy takes the form of random discounts from a baseline price as a company offers random discounts to customers periodically to gauge demand level. In this case, the treatment residual the unexplained ﬂuctuation in price is decidedly non-Gaussian and speciﬁcally follows a discrete distribution over a small number of price points. Python code recreating all experiments is available at https://github.com/Ilias Zadik/ double_orthogonal_ml.

Experiment Speciﬁcation We generated n independent replicates of outcome Y , treatment T, and confounding covariates X. The confounders X have dimension p and have independent components from the N(0, 1) distribution. The treatment is a sparse linear function of X, T = γ0, X + η, where only s of the p coefﬁcients of γ0 are non-zero. The x-axis on each plot is the number of non-zero coefﬁcients s. Moreover, η is drawn from a discrete distribution with

values {0.5, 0, 1.5, 3.5} taken respectively with probabilities (.65, .2, .1, .05). Here, the treatment represents the price of a product or service, and this data generating process simulates random discounts over a baseline price. Finally, the outcome is generated by a linear model, Y = θ0T + β0, X + ϵ, where θ0 = 3 is the treatment effect, β0 is another sparse vector with only s non-zero entries, and ϵ is drawn independently from a uniform U( σϵ, σϵ) distribution. Importantly, the coordinates of the s non-zero entries of the coefﬁcient β0 are the same as the coordinates of the s non-zero entries of γ0. The latter ensures that variables X create a true endogeneity problem, i.e., that X affects both the treatment and the outcome directly. In such settings, controlling for X is important for unbiased estimation.

To generate an instance of the problem, the common support of both γ0 and β0 was generated uniformly at random from the set of all coordinates, and each non-zero coefﬁcient was generated independently from a uniform U(0, 5) distribution. The ﬁrst stage nuisance functions were ﬁtted for both methods by running the Lasso on a subsample of n/2 sample points. For the ﬁrst-order method all remaining n/2 points were used for the second stage estimation of θ0. For the second-order method, the moments E[η2] and E[η3] were estimated using a subsample of n/4 points as described in Theorem 10, and the remaining n/4 sample points were used for the second stage estimation of θ0. For each method we performed cross-ﬁtting across the ﬁrst and second stages, and for the second-order method we performed nested crossﬁtting between the n/4 subsample used for the E[η2] and E[η3] estimation and the n/4 subsample used for the second stage estimation. The regularization parameter λn of each Lasso was chosen to be p

For each instance of the problem, i.e., each random realization of the coefﬁcients, we generated 2000 independent datasets to estimate the bias and standard deviation of each estimator. We repeated this process over 100 randomly generated problem instances, each time with a different draw of the coefﬁcients γ0 and β0, to evaluate variability across different realizations of the nuisance functions.

Distribution of Errors with Fixed Sparsity In Figure 1, we display the distribution of estimates based on orthogonal moments and second-order orthogonal moments for a particular sparsity level s = 100 and for n = 5000 and p = 1000. We observe that both estimates are approximately normally distributed, but the orthogonal moment estimation exhibits signiﬁcant bias, an order of magnitude larger than the variance.

Bias-Variance Tradeoff with Varying Sparsity Figure 2 portrays the median quantities (solid lines) and maximum and minimum of these quantities (error bars) across the

Orthogonal Machine Learning: Power and Limitations

Figure 2. Comparison of estimates ˆθCF based on orthogonal moments and second order orthogonal moments under the PLR model as a function of the number of non-zero coefﬁcients in the nuisance vectors γ0 and β0. See Section 5 for more details. The parameters used for this ﬁgure were n = 5000, p = 1000, σϵ = 1. The fourth ﬁgure displays the ℓ2 error in the coefﬁcients discovered by the ﬁrst stage estimates for each of the nuisance functions: model t is the model for E[T|X] and model y is the model for E[Y |X].

100 different nuisance function draws as a function of the support size for n = 5000, p = 1000, and σϵ = 1.

Varying n and p In Figure 3, we display how performance varies with n and p. Due to computational considerations, for this parameter exploration, we only used a single problem instance for each (n, p, s) triplet rather than 100 instances as in the exploration above. We note that for n = 2000, p = 5000 the breaking point of our method is around s = 100, while for n = 5000, p = 2000 it is around s = 550. For n = 10000, p = 1000 our method performs exceptionally well even until s = 800.

(a) n = 2000, p = 1000

(b) n = 2000, p = 2000

(c) n = 2000, p = 5000

(d) n = 5000, p = 1000

(e) n = 5000, p = 2000

(f) n = 10000, p = 1000

Figure 3. MSE of both estimators as the sparsity varies for different sample size and dimension pairs (n, p). Note that the range of the support sizes is larger for larger n. σϵ = 1.

Varying σϵ Finally Figure 4 displays performance as the variance σϵ of the noise ϵ grows.

(b) σϵ = 10

(c) σϵ = 20

Figure 4. MSE of both estimators as the sparsity varies for different variance parameters σϵ. n = 5000, p = 1000.

6. Conclusion

Our aim in this work was to conduct accurate inference for ﬁxed-dimensional target parameters in the presence of highdimensional or nonparametric nuisance. To achieve this, we introduced a notion of k-th order orthogonal moments for two-stage Z-estimation, generalizing the ﬁrst-order Neyman orthogonality studied in (Chernozhukov et al., 2017). Given k-th order orthogonal moments, we established that estimating nuisance at an o(n 1/(2k+2)) rate sufﬁces for n-consistent and asymptotically normal estimates of target parameters. We then studied the PLR model popular in causal inference and showed that a valid second-order orthogonal moment exists if and only if the treatment residual is not normally distributed. In the high-dimensional linear nuisance setting, these explicit second-order orthogonal moments tolerate signiﬁcantly denser nuisance vectors than those accommodated by (Chernozhukov et al., 2017). We complemented our results with synthetic demand estimation experiments showing the beneﬁts of second-order orthogonal moments over standard Neyman-orthogonal moments.

Orthogonal Machine Learning: Power and Limitations

Belloni, A., Chernozhukov, V., Val, I. F., and Hansen, C. Program evaluation and causal inference with high dimensional data. Econometrica, 85(1):233 298.

Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. Valid post-selection inference. Ann. Statist., 41(2):802 837, 04 2013. doi: 10.1214/12-AOS1077.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., and Newey, W. Double/debiased/neyman machine learning of treatment effects. American Economic Review, 107(5):261 65, May 2017.

Durrett, R. Probability: Theory and Examples. Cambridge University Press, New York, NY, USA, 4th edition, 2010. ISBN 0521765390, 9780521765398.

Flanders, H. Differentiation under the integral sign. The American Mathematical Monthly, 80(6):615 627, 1973.

Hastie, T., Tibshirani, R., and Wainwright, M. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015.

Javanmard, A. and Montanari, A. De-biasing the Lasso: Optimal Sample Size for Gaussian Designs. Ar Xiv eprints, August 2015.

Newey, W. and Mc Fadden, D.l. Chapter 36 large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111 2245, 1994. ISSN 1573-4412. doi: http://dx.doi.org/10.1016/S1573-4412(05)80005-4.

Neyman, J. C() tests and their use. Sankhy: The Indian Journal of Statistics, Series A (1961-2002), 41(1/2):1 21, 1979. ISSN 0581572X.

Stein, C. M. Estimation of the mean of a multivariate normal distribution. Ann. Statist., 9(6):1135 1151, 11 1981.

Tibshirani, R. J., Taylor, J., Lockhart, R., and Tibshirani, R. Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600 620, 2016.

van de Geer, S., Buhlmann, P., Ritov, Y., and Dezeure, R. On asymptotically optimal conﬁdence regions and tests for high-dimensional models. Ann. Statist., 42(3):1166 1202, 06 2014. doi: 10.1214/14-AOS1221.

van der Vaart, A. W. Asymptotic statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998. ISBN 0-521-49603-9.

Zhang, C. H. and Zhang, S. Conﬁdence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217 242. doi: 10.1111/rssb.12026.