# adaptive_estimation_of_nonparametric_functionals__290d7a4e.pdf

Journal of Machine Learning Research 22 (2021) 1-66 Submitted 10/19; Revised 4/21; Published 5/21

Adaptive Estimation of Nonparametric Functionals

Lin Liu linliu@sjtu.edu.cn Institute of Natural Sciences, MOE-LSC, School of Mathematical Sciences, SJTU-Yale Center for Biostatistics and Data Science Shanghai Jiao Tong University Shanghai, 200240, China

Rajarshi Mukherjee ram521@mail.harvard.edu Department of Biostatistics Harvard T. H. Chan School of Public Health Boston, MA 02115, USA

James M. Robins robins@hsph.harvard.edu Department of Epidemiology and Department of Biostatistics Harvard T. H. Chan School of Public Health Boston, MA 02115, USA

Eric Tchetgen Tchetgen ett@wharton.upenn.edu Wharton Statistics Department The University of Pennsylvania Philadelphia, PA 19104, USA

Editor: G abor Lugosi

We provide general adaptive upper bounds for estimating nonparametric functionals based on second-order U-statistics arising from ﬁnite-dimensional approximation of the inﬁnitedimensional models. We then provide examples of functionals for which the theory produces rate optimally matching adaptive upper and lower bounds. Our results are automatically adaptive in both parametric and nonparametric regimes of estimation and are automatically adaptive and semiparametric eﬃcient in the regime of parametric convergence rate.

Keywords: Adaptive minimax estimation, Functional estimation, Lepski s method, Ustatistics, Wavelets

1. Introduction

Estimation of functionals of data generating distribution has always been of central interest in statistics. In nonparametric statistics and machine learning problems, where data generating distributions are parametrized by functions in inﬁnite-dimensional spaces, there exists a comprehensive literature addressing such questions. In particular, a large body of research has been devoted to explore minimax estimation of linear, quadratic functionals, and entropy type functionals in density and/or white noise models. We do not attempt to survey this extensive literature in this area. However, the interested reader can ﬁnd a comprehensive snapshot of the literature in Hall and Marron (1987), Bickel and Ritov (1988), Donoho et al. (1990), Donoho and Nussbaum (1990), Fan (1991), Kerkyacharian

c 2021 Lin Liu, Rajarshi Mukherjee, James M. Robins, Eric Tchetgen Tchetgen.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-892.html.

Liu, Mukherjee, Robins, Tchetgen Tchetgen

and Picard (1996), Laurent (1996), Johnstone (2001), Cai and Low (2003), Cai and Low (2004), Cai and Low (2005b), Kandasamy et al. (2014), Berrett et al. (2019), Han et al. (2020b,a) and other references therein. Although the question of more general nonparametric functionals has received relatively less attention, some fundamental insights regarding estimation of non-linear integral functionals in density and white noise models can be found in Ibragimov and Has minskii (2013), Kerkyacharian and Picard (1996), Nemirovski (2000), and references therein. A general feature of the results obtained while estimating smooth nonparametric functionals is an elbow eﬀect in the rate of estimation based on the regularity of the underlying function classes. For example while estimating quadratic functionals in a d dimensional density model, n-eﬃcient estimation can be achieved as soon as H older exponent β of the

underlying density exceeds d

4, whereas the optimal rate of estimation is n 4β 4β+d (in root mean squared error sense) for β d

4. A similar elbow in the rate of estimation exists for estimation of non-linear integral functionals as well. For density model this was demonstrated by Birg e and Massart (1995), Kerkyacharian and Picard (1996) and Tchetgen Tchetgen et al. (2008). For signal or white noise model, the problem of general integrated non-linear functionals was studied by Nemirovski (2000), but mostly in the n-regime. However, for more complex nonparametric models, the approach for constructing minimax optimal procedures for general non-linear functionals in nonn-regimes has been rather case speciﬁc. Motivated by this, in recent years, Robins et al. (2008, 2017) and Mukherjee et al. (2017) have developed a theory of inference for nonlinear functionals in parametric, semi-parametric, and non-parametric models based on higher order inﬂuence functions. Most minimax rate optimal estimators proposed in the literature, however, depend explicitly on the knowledge of the smoothness indices. Thus, it becomes of interest to understand the question of adaptive estimation i.e. the construction and analysis of estimators without prior knowledge of the smoothness. The question of adaptation of linear and quadratic functionals has been studied in detail in the context of density, white noise, and nonparametric additive Gaussian noise regression models (Low (1992), Efromovich and Low (1994), Efromovich and Low (1996), Tribouley (2000), Johnstone (2001), Efromovich and Samarov (2000), Klemel a and Tsybakov (2001), Laurent and Massart (2000), Cai and Low (2005a), Cai and Low (2006), Gin e and Nickl (2008), Breunig and Chen (2021)). Although our theoretical results are very much related to and motivated by the papers on optimal adaptive estimation of quadratic functionals, including Fan (1991); Johnstone (2001); Efromovich and Samarov (2000); Laurent and Massart (2000); Cai and Low (2006); Gin e and Nickl (2008), nonetheless we consider a much wider and more complex class of functionals whose ﬁrst order inﬂuence function depends on multiple nuisance functions. Our paper is the ﬁrst to consider adaptive estimation in this wider class. In particular, we observe i.i.d copies of a random vector O = (W; X) Rm+d with unknown distribution P on each of n study subjects. The variable X represents a random vector of baseline covariates such as age, height, weight, etc. Throughout X is assumed to have compact support and a density with respect to (w.r.t.) Lebesgue measure in Rd. The variable W Rm can be thought of as a combination of outcome and treatment variables in some of our examples. In the above setup, we are interested in estimating certain smooth functionals φ(P) in the sense that under ﬁnite-dimensional parametric submodels, they admit ﬁrst-order derivatives which can be represented as inner products of

Adaptive Estimation of Nonparametric Functionals

ﬁrst order inﬂuence functions with score functions (Bickel et al., 1993). For some classical examples of these functionals, we provide matching upper and lower bounds on the rate of adaptive minimax estimation over a varying class of smoothness of the underlying functions, provided that the unknown marginal design density of X is suﬃciently regular.

The contributions of this paper are as follows. Building on the theory of adaptive estimation of linear and quadratic nonparametric functionals in density and Gaussian white noise models, we explore adaptation theory for non-linear functionals in more complex nonparametric models in both the n (where our adaptive estimators are semi-parametric eﬃcient) and nonn-regimes. The crux of our arguments relies on the observation that when the non-adaptive minimax estimators can be written as a sum of empirical mean type statistics and 2nd-order U-statistics, one can provide a uniﬁed theory of selecting the best data driven estimator using Lepski type arguments (Lepski, 1991, 1992). Indeed, under certain assumptions on the data generating mechanism P, the non-adaptive minimax estimators have the desired structure for a large class of problems (Robins et al., 2008). This enables us to produce a class of examples where a single method helps provide a desired answer. Although the basic scheme of the approach is straightforward, the proof is somewhat involved because of the dependence of the U-statistics kernels in question on estimated functions from the sample. This prohibits performing a Hoeﬀding decomposition w.r.t. the whole sample and we can only proceed by a conditional Hoeﬀding decomposition restricted on carefully chosen good events. In order to prove a lower bound for the rate of adaptation under the nonn-regime for all the examples covered below (and show that a sharp poly-logarithmic penalty is indeed necessary for adaptation), we adapt the results in Birg e and Massart (1995); Robins et al. (2009b) for the Hellinger distance between two mixtures of suitable product measures, by producing similar results in chi-square divergence (see Appendix A). Our results apply to several common examples in nonparametric analyses of observational studies. These include, but are not limited to, average treatment eﬀect estimation in causal inference, mean estimation in missing data studies, and error variance estimation in general non-parametric problems. To the best of our knowledge, the optimal adaptive results (both in sense of upper and lower bound) which simultaneously describe both n and nonn regimes of estimation, are among the ﬁrst results in this direction. Moreover, all these examples involve estimating nonparametric nuisance functions (such as a density or regression). Consequently, the proofs of our results shed light on what properties of machine learning algorithms one needs to explore (see e.g. the results in Theorem 2.7 proved for wavelet based estimation) so that a principled use of them in semiparametric theory is possible.

The rest of the paper is organized as follows. In Section 2 we provide the main results of the paper in a general form. Section 3 is devoted for applications of the main results in speciﬁc examples. Discussions on potential issues and future work are provided in Section 4. In Section 5 we provide a brief discussion on some basic wavelet and function space theory and notations, which we use extensively. Finally Section 6, Appendices A, B, and C are devoted for the proofs of the theorems and collecting useful technical lemmas.

Liu, Mukherjee, Robins, Tchetgen Tchetgen

1.1 Notation

For data arising from underlying probability distribution P we denote by PP and EP the probability of an event and expectation under P receptively. In the sequel, we often split the whole sample D into M disjoint subsamples {D1, D2, . . . , DM}. For some subset Jm = {j1, j2, . . . , jm} {1, 2, . . . , M}, we denote by PP,{j1,j2,...,jm} and EP,{j1,j2,...,jm} the probability and expectation under P over the sample space of j Jm Dj, while treating the subsamples j Jm Dj as ﬁxed. For example, when we divide the whole sample into three disjoint subsamples {D1, D2, D3}, then PP,2,3 PP,{2,3} denotes the conditional probability of an event, while treating the subsample D1 as ﬁxed. For a bivariate function h(O1, O2) let

S(h(O1, O2)) = 1

2 [h(O1, O2) + h(O2, O1)]

be the symmetrization of h. The results in this paper are mostly asymptotic (in n) in nature and thus require some standard asymptotic notations. If an and bn are two sequences of real numbers then an bn (and an bn) implies that an/bn (and an/bn 0) as n , respectively. Similarly an bn (and an bn) implies that lim infn an/bn = C for some C (0, ] (and lim supn an/bn = C for some C [0, )). Alternatively, an = o(bn) will also imply an bn and an = O(bn) will imply that lim supn an/bn = C for some C [0, )). Finally we comment brieﬂy on the various constants appearing throughout the text and proofs. Given that our primary results concern convergence rates of various estimators, we will not emphasize the role of constants throughout and rely on fairly generic notation for such constants. In particular, for any ﬁxed tuple v of real numbers, C(v) will denote a positive real number which depends on elements of v only. Finally for any linear subspace L L2([0, 1]d), let Π (h|L) denote the orthogonal projection of h onto L under the Lebesgue measure. Also, for a function deﬁned on [0, 1]d, for 1 q < we let h q := ( R

[0,1]d |h(x)|qdx)1/q denote the Lq semi-norm of h, h := supx [0,1]d |h(x)| the L semi-norm of h. We say h Lq([0, 1]d) for q [1, ] if h q < . Typical functions arising in this paper will be considered to have memberships in certain H older balls H(β, M) (see Section 5 for details). This will imply that the functions are uniformly bounded by a number depending on M. However, to make the dependence of our results on the uniform upper bound of functions more clear, we will typically assume a bound BU over the function classes, and for the sake of compactness will avoid the notational dependence of BU on M. The necessary notation appeared in Lepski-type adaptation scheme will be introduced in Section 2.1.1 when we ﬁrst deﬁne the procedure.

2. Main Results

We divide the main results of the paper into three main parts. First we discuss a general recipe for producing a data-adaptive best estimator from a sequence of estimators based on second-order U-statistics which in turn are constructed from compactly supported wavelet based projection kernels (deﬁned in Section 5). Next we show that the bound on Hellinger distance obtained in Robins et al. (2009b) directly implies the desired bound on chi-square divergence between mixtures of product measures under certain bounded

Adaptive Estimation of Nonparametric Functionals

density assumptions. In subsequent sections, this then serves as a basis of using a version of constrained risk inequality (Cai and Low, 2011) for producing matching adaptive lower bounds in the context of estimating non-linear functionals considered in this paper. Finally we provide control over estimators of the design density as well as the regression functions in L norm which not only adapt over H older type smoothness classes (deﬁned in Section 5) but also belong to desired regularity classes with probability converging to 1 suﬃciently fast.

2.1 Upper Bound

Consider a sample of i.i.d data D := {Oi = (Wi, Xi) P, Wi Rm, Xi [0, 1]d, i = 1, . . . , n} and a real valued functional of interest φ(P). In the following, we also split D into two disjoint subsamples D1 = {O1, . . . , On1} and D2 = {On1+1, . . . , On1+n2}, where n1 + n2 = n. Given this sample of size n 2, consider further, a sequence of estimators {ˆφn,k(L): n N, k = 2jd for some j N} of φ(P) (indexed by function tuple L(O) = (L1(O), L2l(O), L2r(O))) deﬁned as follows:

ˆφn,k(L) = 1

i=1 L1(Oi) 1 n1(n1 1)

1 i1 =i2 n1 S (L2l(Oi1)KVk (Xi1, Xi2) L2r(Oi2))

where KVk (Xi1, Xi2) KVj (Xi1, Xi2) is a resolution j = log2 k

d wavelet projection kernel deﬁned in Section 5. Depending on the context, we sometimes denote Vj by Vk to avoid repeated translation between the resolution j and the number of wavelet bases k being used in deﬁning the projection kernels. In all our examples, the non-random functions L1, L2l, L2r: Ω R depend on the true underlying distribution P (e.g. L1( ) = (Y EP (Y |X))(A EP (A|X)) in the example of average treatment eﬀect estimation in Section 3.1, but we silence the dependence on P in the notation for brevity) so we cannot evaluate ˆφn,k(L) purely from data. The (random) functions e L1, e L2l and e L2r serve as the corresponding data-adaptive estimators of L1, L2l and L2r constructed from the (training) subsample D2, without knowledge on the data generating mechanism. Then

ˆφn,k ˆφn,k(e L) = 1

i=1 e L1(Oi) 1 n1(n1 1)

1 i1 =i2 n1 S e L2l(Oi1)KVk (Xi1, Xi2) e L2r(Oi2) ,

which is now a measurable function w.r.t. the σ-ﬁeld generated by the observed data sample D. We further assume that

max{|e L1(O)|, |e L2l(O)|, |e L2r(O)|} B, P-almost surely

for a known constant B. In addition, assume that |g(x)| BU x, g being the marginal density of X w.r.t. Lebesgue measure. Then the sequence of estimators {ˆφn,k: n N, k = 2jd for some j N} of φ(P) can be thought of as a bias corrected version of a usual ﬁrst order estimator arising from standard ﬁrst order inﬂuence function theory for smooth functionals φ(P) (Bickel et al., 1993; van der Vaart, 2000). In particular, the linear empirical

Liu, Mukherjee, Robins, Tchetgen Tchetgen

mean type term 1 n1 Pn1 i=1 e L1(Oi) typically derives from the classical inﬂuence function of φ(P) and the U-statistics quadratic term

ˆUn,k := 1 n1(n1 1)

1 i1 =i2 n1 S e L2l(Oi1)KVk (Xi1, Xi2) e L2r(Oi2) (2.1)

corrects for higher order bias terms. While, speciﬁc examples in Section 3 will make the structure of the sequence of estimators more clear, the interested reader will be able to ﬁnd more detailed theory in Robins et al. (2008, 2009a); Li et al. (2011); Robins et al. (2017). The quality of such sequence of estimators will be judged against models for the data generating mechanism P. To this end, assume P Pθ where Pθ is a class of data generating distributions indexed by θ which in turn can vary over an index set Θ. The choices of such a Θ will be clear from our speciﬁc examples in Section 3, and will typically correspond to smoothness indices of various inﬁnite-dimensional functions parametrizing the data generating mechanism. Further assume that there exist positive real-valued functions f1 and f2 deﬁned on Θ such that the sequence of estimators {ˆφn,k}k 1 satisﬁes the following bounds for all θ Θ with known constants CA > 0, C A, and CB > 0.

Property (A): Conditional Bias Bound within Good Events of D2: there exists a good event I2(n2), a measurable subset w.r.t. the σ-algebra generated by the sample in D2, such that, for some absolute constants CA, C A,

EP,1 ˆφn,k φ(P) 1 {I2(n2)} CA k 2 f1(θ)

d + n f2(θ) 2 P-a.s.

and for some η > 2,

sup P Pθ PP,2

I2(n2) C A log n2

nη 2 P-a.s.

where I stands for the complement of I for any event I.

Property (B): Conditional Variance Bound: there exists an absolute constant CB > 0, such that

sup P Pθ EP,1 ˆφn,k EP,1 ˆφn,k 2 CB

Remark 2.1. Property (A) makes assumptions on certain nuisance functions involved in the problem. In the examples provided in Section 3 these will typically correspond to certain regression and density functions, and we will invoke Theorem 2.7 to show that Property (A) is met if one estimates the nuisance functions with (boundary-corrected) Cohen-Daubechies Vial type of wavelet kernel projections1.

1. Throughout this paper, we assume that we can evaluate wavelets without numerical error and we provide the reason for making such a simpliﬁcation in Remark 5.1.

Adaptive Estimation of Nonparametric Functionals

2.1.1 Lepski-type adaptation scheme

Given properties (A) and (B), we employ a standard Lepski-type adaptation scheme (Lepski, 1991, 1992) to choose an optimal estimator from the collection of estimators {ˆφn,k: n N, k = 2jd for some j N}. Following the notation in Gin e and Nickl (2008), for any given n1 N, n > 1, and suﬃciently small δ (0, 1), we deﬁne the following discretization set:

K := k n1 δ 1 , n2 1 (log n1)4

: k0 = n1 δ 1 , k1 = n1 log n1 , k2 = n1 ℓ(n1), kj+1 = ϱkj, j = 2, 3, . . . ,

where ϱ > 1, ℓ(n1) is any function such that ℓ(n1) 0 and ℓ(n1) log n1 as n1 , and ℓ 1(n1) < log n1 for all n1. Let N be the cardinality of K. Then k N 1 = ϱN 3 n1 ℓ(n1)

n2 1 (log n1)4 implies that N = O(log n1). Next we deﬁne R(k) = k n2 1 and d(k) for all k

[n1 δ 1 , n2 1/(log n4 1)4] as

C2 Lepski log k

ℓ(n1) 1/2 k0 k k2.

where CLepski will be speciﬁed later in the proof of Theorem 2.2 and chosen only depending on the known parameters of the data generating mechanism. For example, when k = ϱln1/ℓ(n1) for some positive integer l,

C2 Lepski(δ log n1 + l log ϱ log (ℓ(n1))) = O((log n1)1/2).

Now following Lepski s strategy, we deﬁne the data-adaptive estimator of k as

ˆk := min k K: ˆφn,k ˆφn,k 2 R(k )d2(k ), k k, k K . (2.2)

Then for each k K, we simply ﬁnd the corresponding j as log2 k

d to construct estimator ˆφ2,k with wavelets at resolution j. With the notations and data-adaptive estimation scheme described above we now have the following theorem which is the main result in the direction of adaptive upper bound in this paper the proof of which can be found in Section 6.

Theorem 2.2. Given a known interval (τmin, τmax) with 0 τmin < τmax, for any τ (τmin, τmax), under Properties (A) and (B) with n2 = n/ log n and n1 = n n2, there exists an absolute constant C depending on (τmax, d, CA, C A, CB), such that

1. If 0 < τ < d/4, then

f1(θ)=τ,f2(θ)>min{ 4τ 4τ+d , 1

EP ˆφn,ˆk φ(P) 2 C n log n

Liu, Mukherjee, Robins, Tchetgen Tchetgen

2. If τ = d/4, then

f1(θ)=τ,f2(θ)>min{ 4τ 4τ+d , 1

EP ˆφn,ˆk φ(P) 2 C nℓ(n)2 1 .

3. If τ > d/4 and σ2 = EP h (L1(O) EP L1(O))2i , then for every P {P Pθ: f1(θ) = τ, f2(θ) > min{4τ/(4τ + d), 1/2}}

n ˆφn,ˆk φ(P) d Z N(0, σ2).

where N(µ, σ2) stands for normal distribution with mean µ and variance σ2.

A few remarks are in order regarding the implications of Theorem 2.2 as well as the subtleties involved in its proof.

Remark 2.3. Provided one has knowledge of a data generating θ and therefore of f1(θ) and f2(θ), one can use the bias and variance properties to achieve an optimal trade-oﬀand subsequently obtain optimal mean squared error in estimating φ(P) over P Pθ which scales as n 8τ 4τ+d in the low regularity regime when 0 < τ < d/4 or is n-consistent and semiparametric eﬃcient when τ > d/4. Theorem 2.2 demonstrates a logarithmic price paid by the estimator ˆφn,ˆk in terms of estimating φ(P) over a class of data generating mechanisms n Pθ: f1(θ) = τ, f2(θ) > 4τ 4τ+d o which will in all examples be non-empty and often inﬁnite. As will be demonstrated in Section 3, the term f1(θ) = τ usually drives the minimax rate of estimation whereas f2(θ) > 4τ 4τ+d is a regularity condition which typically will relate to the marginal density of covariates in observational studies. Moreover, in our examples, the range of τ < d/4 will necessarily imply the non-existence of n-consistent estimators of φ(P) over P Pθ in a minimax sense (Robins et al., 2009b).

Remark 2.4. The proof of Theorem 2.2 is somewhat involved because of the dependence of the U-statistics kernel on estimated nuisance functions (e.g. the functions e L1, e L2l and e L2r are estimated from the data). This prohibits performing a Hoeﬀding decomposition w.r.t. the whole sample and we can only proceed by a conditional Hoeﬀding decomposition allowed by our sample splitting mechanism. Subsequently, the conditioning event needs to be suﬃciently well behaved and corresponds to the high probability requirement of the good event deﬁned in Property (A). As we shall see in Section 3, that the good event corresponds to certain estimates of nuisance functions to be well behaved both in terms of sup-norm concentration and membership to desired H older regularity classes. It is for this reason, we need somewhat precise results on adaptive function estimation as well as provided by Theorem 2.7 in Section 2.3.

Remark 2.5. The sample splitting scheme employed in Theorem 2.2 is not the only option. Many other reasonable options exist, e.g. n2 = n/polylog(n). One may also choose n1 = n2 = n/2, which only aﬀects the constants when 0 < τ d/4. However, complication arises when τ > d/4: we need to rely on cross-ﬁtting after sample splitting (Chernozhukov et al., 2018) to achieve semiparametric eﬃciency.

Adaptive Estimation of Nonparametric Functionals

Finally, it therefore remains to be explored whether this logarithmic price payed in Theorem 2.2 is indeed necessary for the regime τ < d/4. Using a chi-square divergence inequality developed in the next section along with a suitable version of constrained risk inequality (see Section C) we shall argue that the logarithmic price of Theorem 2.2 is indeed necessary for a class of examples naturally arising in many observational studies.

2.2 Lower Bound

In terms of lower bounds, we shall use the constrained risk inequality of Cai and Low (2011) (see Lemma C.1 in Section C) which in turn requires the control over the chisquare divergence between the mixture of suitable product measures. In this section we provide such a bound which will be used in the examples in Section 3. We closely follow the strategy in Birg e and Massart (1995); Robins et al. (2009b) who already provide useful control for the Hellinger distance between mixtures of suitable product measures. It turns out that under certain boundedness assumptions on the Radon-Nykodym derivatives of the measures involved in the problem, one can convert the available bounds on the Hellinger distance quite easily to the chi-squared divergence related bounds. More precisely, as in Robins et al. (2009b), let O1, . . . , On be a random sample from a density p w.r.t. measure µ on a sample space (χ, A). For k N, let χ = k j=1χj be a measurable partition of the sample space. Given a vector λ = (λ1, . . . , λk) in some product measurable space Λ = Λ1 Λk let Pλ and Qλ be probability measures on χ such that

Pλ(χj) = Qλ(χj) = pj for every λ and some (p1, . . . , pk) in the k-dimensional simplex.

The restrictions Pλ and Qλ to χj depends on the jth coordinate λj of λ = (λ1, . . . , λk) only.

For pλ and qλ densities of the measures Pλ and Qλ respectively that are jointly measurable in the parameters λ and the observations, and π a probability measure on Λ, deﬁne p = R pλdπ(λ) and q = R qλd(π(λ)), and set

a = max j sup λ

b = max j sup λ

ec = max j sup λ

δ = max j sup λ

With the notations and deﬁnitions as above we now have the following proposition which is a direct application of Robins et al. (2009b)[Theorem 2.1].

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Proposition 2.6. Suppose that npj(1 a b ec) A for all j and for all λ, B pλ B for positive constants A, B, B. Then there exists a C > 0 that depends only on A, B, B, such that, for any product probability measure π = π1 πk, one has

χ2 Z Pλd(π(λ)), Z Qλd(π(λ)) e Cn2(maxj pj)(b2+ab)+Cnδ 1,

where χ2(ν1, ν2) = R dν2 dν1 1 2 dν1 is the chi-square divergence between two probability measures ν1 and ν2 with ν2 ν1.

Proof. Based on the above construction, Robins et al. (2009b)[Theorem 2.1] showed that, there exists some absolute constant C such that

H2 Z Pλd(π(λ)), Z Qλd(π(λ)) C n2(max j pj)(b2 + ab) + C nδ.

where H(ν1, ν2) denotes the Hellinger distance between two probability measures ν1 and ν2. As pλ B, so is R Pλd(π(λ)) B. By setting C = 4B 1C , we have

χ2 Z Pλd(π(λ)), Z Qλd(π(λ))

exp 4B 1H2 Z Pλd(π(λ)), Z Qλd(π(λ)) 1 (2.3)

exp 4B 1(C n2(max j pj)(b2 + ab) + C nδ) 1

= e Cn2(maxj pj)(b2+ab)+Cnδ 1

where the ﬁrst inequality (2.3) is proved in Appendix A for the sake of completeness.

2.3 Adaptive Estimation of Nuisance Functions

As will be evident from our examples in Section 3 that the applications of Theorem 2.2 require certain estimates of nuisance functions to be well behaved both in terms of supnorm concentration and membership to desired H older regularity classes. In this section we provide such estimators of regression and density functions using Lepski-type arguments (Lepski, 1991, 1992). Although the construction and adaptation proof of such estimators are quite standard, we will need a few additional results on suﬃciently high probability inclusion of the adaptive estimators in suitable H older regularity classes. We provide the necessary ingredients below. Consider i.i.d. data on Oi = (Wi, Xi) P for a scalar variable W such that |W| BU and EP (W|X) = f(X) almost surely, and X [0, 1]d has density g such that 0 < BL g(x) BU < for all x [0, 1]d. Although to be precise, we should put subscripts P to f, g, we omit this since the context of their use is clear. We assume H older type smoothness (deﬁned in Section 5) on f, g and let P(s, γ) = {P: (f, g) H(s, M) H(γ, M), |f(x)| BU, BL g(x) BU x [0, 1]d} denote classes of data generating mechanisms indexed by the smoothness indices. Then we have the following theorem (the proof of which can

Adaptive Estimation of Nonparametric Functionals

be found in Appendix B), which considers adaptive estimation of f, g in L norm over (s, γ) (smin, smax) (γmin, γmax), for given positive real numbers smin, smax, γmin, γmax.

Theorem 2.7. If γmin > smax, then there exist ˆf and ˆg depending on M, BL, BU, smin, smax, γmin, γmax, and choice of wavelet bases ψ0 0,0, ψ1 0,0 (deﬁned in Section 5) such that the following hold for every (s, γ) (smin, smax) (γmin, γmax) with a large enough C > 0 depending possibly on M, BL, BU, and γmax.

sup P P(s,γ) EP ˆg g C d 2γ+d n log n

γ 2γ+d , (2.4)

sup P P(s,γ) PP

ˆg g C d d+2γ n log n

sup P P(s,γ) PP (ˆg / H(γ, C)) 1

sup P P(s,γ) EP ˆf f C d 2s+d n log n

s 2s+d , (2.7)

sup P P(s,γ) PP

ˆf f C d d+2s n log n

sup P P(s,γ) PP ˆf / H(s, C) 1

| ˆf(x)| 2BU and BL/2 ˆg(x) 2BU x [0, 1]d. (2.10)

Remark 2.8. A close look at the proof of Theorem 2.7 shows that the proof continues to hold for smin = 0. Moreover, although we did not keep track of our constants, the purpose of keeping them in the form above is to show that the multiplicative constants are not arbitrarily big when β is large.

Theorems of the ﬂavor of Theorem 2.7 are not uncommon in literature (see (Gin e and Nickl, 2016, Chapter 8) for details) and the proof is somewhat standard for most parts of the result. However, proving (2.6) and (2.9) for data driven adaptive estimators requires somewhat more care than standard results in the literature. In particular, results of the kind stating that ˆg H(γ, C) with high probability uniformly over P(β, γ) for a suitably large constant C are often very easy to demonstrate. However, our proof shows that a suitably bounded estimator ˆg, which adapts over smoothness and satisﬁes ˆg H(γ, C) with probability larger than 1 1 nκ uniformly over P(β, γ), for any κ > 0 and correspondingly large enough C. This result in turn turns out to be crucial for the purpose of controlling suitable bias terms in our functional estimation problems as speciﬁed by the good event property in Property (A) in Theorem 2.2. Additionally, the results concerning ˆf are relatively less common in an unknown design density setting. Indeed, adaptive estimation of regression function with random design over Besov type smoothness classes has been obtained by model selection type techniques by Baraud (2002) for the case of Gaussian errors. Our results in contrast hold for any regression model with bounded outcomes and compactly supported covariates having suitable marginal design density.

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Remark 2.9. Theorem 2.7 estimates the nuisance functions in our problem using standard Cohen-Daubechies-Vial type of wavelet kernel projections. It is an open problem if estimators based on other more general machine learning algorithms such as random forests or deep neural networks (DNNs) still satisfy Property (A). Indeed, recent results in DNNs using Re LU activation functions (Schmidt-Hieber, 2020; Farrell et al., 2021; Chen et al., 2019a,b) are non-adaptive to the underlying regularity of the function classes and little is known about the Fourier coeﬃcients of their outputs a property which we will crucially require in the good events deﬁned in Property (A) accompanying Theorem 2.2. Some recent results (Rahaman et al., 2019; Xu et al., 2020; Luo et al., 2020) connecting the training dynamics (Luo et al., 2021) to the behavior of Fourier coeﬃcients in DNNs might be an interesting direction to investigate for future work.

3. Examples

In this section we discuss applications of Theorem 2.2 and Proposition 2.6 in producing rate optimal adaptive estimators of certain nonparametric functionals commonly arising in statistical and causal inference literature. These include (i) treatment eﬀect type functional, (ii) mean functional in missing data studies, and (iii) quadratic and variance functional in nonparametric regression. Before proceeding we note that the results of this paper can be extended to include the whole class of doubly robust functionals considered in Robins et al. (2008). However we only provide speciﬁc examples here to demonstrate the clear necessity to pay a sharp poly-logarithmic penalty for adaptation in low regularity regimes. The proof of the results in this section can be found in Appendix B.

3.1 Weighted Average Treatment Eﬀect

In this subsection, we consider estimating the treatment eﬀect of a treatment on an outcome in presence of multi-dimensional confounding variables (Crump et al., 2009; Robins et al., 1992). To be more speciﬁc, we consider a binary treatment A and response Y and d-dimensional covariate vector X, and let τ be the variance weighted average treatment eﬀect deﬁned as

τ := E Var(A|X)c(X)

E(Var(A|X))

= E(Cov(Y, A|X))

E(Var(A|X)) ,

c(x) = E(Y |A = 1, X = x) E(Y |A = 0, X = x). (3.1)

Under the assumption of no unmeasured confounding, c(x) is often referred to as the average treatment eﬀect among subjects with covariate value X = x. The reason of referring to τ as the average treatment eﬀect can be further understood by considering the semiparametric regression model that assumes

c(x) = φ for all x, (3.2)

or speciﬁcally the partial linear model

E(Y |A, X) = φ A + b(X).

Adaptive Estimation of Nonparametric Functionals

It is clear that under (3.2) τ equals φ . Moreover, the inference on τ is closely related to the estimation of E(Cov(Y, A|X)) (Robins et al., 2008). Speciﬁcally, point and interval estimators for τ can be constructed from point and interval estimators of E(Cov(Y, A|X)). To be more speciﬁc, for any ﬁxed τ R, one can deﬁne Y (τ ) = Y τ A and consider

φ(τ ) = E((Y (τ ) E(Y (τ )|X))(A E(A|X))) = E(Cov(Y (τ ), A|X)).

It is easy to check that τ is the unique solution of φ(τ ) = 0. Consequently, if we can construct estimator ˆφ(τ ) of φ(τ ), then ˆτ satisfying φ(ˆτ) = 0 is an estimator of τ with desired properties. Moreover, (1 α) conﬁdence set for τ can be constructed as the set of values of τ for which (1 α) interval estimator of φ(τ ) contains the value 0. Finally, since EP (Cov P (Y, A|X)) = EP (EP (Y |X)EP (A|X)) EP (AY ), and EP (AY ) is estimable at a parametric rate, the crucial part of the problem hinges on the estimation of EP (EP (Y |X)EP (A|X)). For the rest of this section, we assume that we observe n i.i.d. copies of O = (Y, A, X) P and we want to estimate φ(P) = EP (Cov P (Y, A|X)). We assume that the marginal distribution of X has a density w.r.t. Lebesgue measure on Rd that has a compact support, which we assume to be [0, 1]d and let g be the marginal density of X (i.e. EP (h(X)) = R

[0,1]d h(x)g(x)dx for all P-integrable function h), a(X) := EP (A|X), b(X) := EP (Y |X), and c(X) = EP (Y |A = 1, X) EP (Y |A = 0, X). Although to be precise, we should put subscripts P to a, b, g, c, we omit this since the context of their use is clear. To connect this example with the notation employed in Theorem 2.2, we proceed as follows. Let Θ := {θ = (α, β, γ): α+β

2d > 0, γmax γ > γmin 2(1 + ϵ) max{α, β}} for some ﬁxed ϵ > 0, and let Pθ denote all data generating mechanisms P satisfying the following conditions for known positive constants M, BL, BU.

(1) max{|Y |, |A|} BU a.s. P.

(2) a H(α, M), b H(β, M), and g H(γ, M).

(3) 0 < BL < g(x) < BU for all x [0, 1]d.

Note that we do not put any assumptions on the function c. Indeed for Y and A binary random variables, the functions a, b, g, c are variation independent. Following our discussion above, we will discuss adaptive estimation of φ(P) = EP (Cov P (Y, A|X)) = EP ((Y b(X))(A a(X))) over P P. In particular, we summarize our results on upper and lower bounds on the adaptive minimax estimation rate of φ(P) in the following theorem.

Theorem 3.1. Assume (1) (3) and (α, β, γ) Θ. Then the following hold for positive C, C depending on M, BL, BU, γmax and any sequence ℓ(n) 0 such that ℓ(n) log n .

(i) (Upper Bound) There exists an estimator ˆφ, depending only on M, BL, BU, γmax such that

sup P P(α,β,γ) EP ˆφ φ(P) 2

Liu, Mukherjee, Robins, Tchetgen Tchetgen

4α+4β 2α+2β+d if 0 < (α + β)/2 < d/4,

C nℓ(n)2 1 if (α + β)/2 = d/4,

and if (α + β)/2 > d/4, with

σ2 = EP h {(Y b(X))(A a(X)) φ(P)}2i ,

for every P {P P(α,β,γ)}, we have

n ˆφn,ˆk φ(P) d Z N(0, σ2).

(ii) (Lower Bound) Suppose {A, Y } {0, 1}2 and 0 < (α + β)/2 < d/4. If one has

sup P P(α,β,γ) EP ˆφ φ(P) 2 C n log n

4α+4β 2α+2β+d ,

for an estimator ˆφ of φ(P). Then there exists a class of distributions P(α ,β ,γ ) such that

sup P P(α ,β ,γ ) EP ˆφ φ(P ) 2 C n log n

2α +2β +d .

Theorem 3.1 describes the adaptive minimax estimation rate of the treatment eﬀect functional in both low regularity regime (α+β

4) i.e. when n-rate estimation is not possible and regular n-regimes ( α+β

4) where our result demonstrates adaptive semiparametric eﬃciency. Note that unlike the classical adaptation problem regarding quadratic functional estimation in density and white noise models, the adaptation in this problem is w.r.t. to hyperplanes {(α, β): (α + β)/2d = τ}. Finally, we note that the case of α+β

4 also incurs an additional penalty (which can be made to grow arbitrarily slow) over usual n-rate of convergence resulted from adaptation. Although we do not state it explicitly, the proof of the lower bound shows that if the rate of convergence α+β

4 is O(n 1) then one pays a polynomial penalty for α+β

4 which is indeed suboptimal. It is worth noting that if the set of (α, β) only includes the case α+β

4, one can indeed obtain adaptive and even semi-parametric eﬃcient estimation of the functionals studied here without eﬀectively any assumption on g. When α+β

4, the dependence of the smoothness γ of the density g( ) on α and β can also be relaxed by considering estimators using higherorder U-statistics. But we do not further pursue such direction in this paper. The interested reader can ﬁnd the relevant details in Robins et al. (2017); Mukherjee et al. (2017).

Remark 3.2 (Further comments on the dependence of γmin on α β). In Theorem 3.1 (and also in Theorem 3.3 and 3.4 in the next two subsections), we provide a suﬃcient condition on the dependence of the known smoothness lower bound γmin of the density of X on α β - the maximum between the smoothness of a( ) and b( ) - to obtain matching adaptive minimax upper and lower bound. At a ﬁrst sight, such dependence looks dimension free. However, when α + β d/4, γmin implicitly grows with d. As a matter of fact, one can

Adaptive Estimation of Nonparametric Functionals

easily obtain sharper dependence of γmin on α, β, and d by looking at the proof of Theorem 3.1 in the appendix. We decide to keep the constraint on γmin in the current form for ease of exposition. As mentioned above, such dependence can even further relaxed by considering estimators based on even higher-order U-statistics. We refer to Section 4 for more discussions on this.

3.2 Mean Response in Missing Data Models

Suppose we have n i.i.d observations on O = (Y A, A, X) P, for a response variable Y R which is conditionally independent of the missingness indicator variable A {0, 1} given covariate information X. In literature, this assumption is typically known as the missing at random (MAR) assumption. Under this assumption, our quantity of interest φ(P) = EP (Y ) is identiﬁable as EP [EP [(Y |A = 1, X)]] from the observed data. This model is a canonical example of a study with missing response variable and to make this assumption reasonable, the covariates must contain the information on possible dependence between response and missingness. We refer the interested readers to Tsiatis (2007) for the history of statistical analysis of MAR and related models. To lay down the mathematical formalism for minimax adaptive estimation of φ(P) in this model, let f be the marginal density of X, a(X) := EP (A|X) = PP (A = 1|X) which is often called the propensity score in the causal inference literature, and b(X) := EP (Y |A = 1, X) = EP (Y |X), and g(X) = f(X)/a(X) (with the convention of the value + when dividing by 0). Although to be precise, we should put subscripts P to a, b, g, we omit this since the context of their use is clear. To connect this example with the notation employed in Theorem 2.2, we proceed as follows. Let Θ := {θ = (α, β, γ): α+β

2d > 0, γmax γ > γmin 2(1 + ϵ) max{α, β}} for some ﬁxed ϵ > 0, and let Pθ denote all data generating mechanisms P satisfying the following conditions for known positive constants M, BL, BU.

(1) |Y | BU.

(2) a H(α, M), b H(β, M), and g H(γ, M).

(3) BL < g(x), a(x) < BU for all x [0, 1]d.

We then have the following result.

Theorem 3.3. Assume (1) (3) and (α, β, γ) Θ. Then the following hold for positive C, C depending on M, BL, BU, γmax and any sequence ℓ(n) 0 such that ℓ(n) log n .

(i) (Upper Bound) There exists an estimator ˆφ, depending only on M,BL,BU, γmax such that

sup P P(α,β,γ) EP ˆφ φ(P) 2

4α+4β 2α+2β+d if 0 < (α + β)/2 < d/4,

C nℓ(n)2 1 if (α + β)/2 = d/4,

Liu, Mukherjee, Robins, Tchetgen Tchetgen

and if (α + β)/2 > d/4, with

σ2 = EP h {Aa(X)(Y b(X)) + b(X) φ(P)}2i ,

for every P {P P(α,β,γ)}, we have

n ˆφn,ˆk φ(P) d Z N(0, σ2).

(ii) (Lower Bound) Suppose {A, Y } {0, 1}2 and 0 < (α + β)/2 < d/4. If one has

sup P P(α,β,γ) EP ˆφ φ(P) 2 C n log n

4α+4β 2α+2β+d ,

for an estimator ˆφ of φ(P). Then there exists a class of distributions P(α ,β ,γ ) such that

sup P P(α ,β ,γ ) EP ˆφ φ(P ) 2 C n log n

2α +2β +d .

Once again, Theorem 3.3 describes the adaptive minimax estimation rate of the treatment eﬀect functional in both low regularity regime (α+β

4) i.e. when n-rate estimation is not possible and regular n-regimes ( α+β

4) where our result demonstrates adaptive semi-parametric eﬃciency. The case of α+β

4 also incurs an additional penalty (which can be made to grow arbitrarily slow) over usual n-rate of convergence resulted from adaptation. As mentioned in the end of Section 3.1, the dependence on the smoothness γ of the function g( ) can be relaxed using higher-order U-statistics.

3.3 Quadratic and Variance Functionals in Regression Models

Consider a sample of n i.i.d copies of O = (Y, X) P and the functional of interest is the expected value of the square of the regression of Y on X. Speciﬁcally suppose we want to estimate φ(P) = EP {EP (Y |X)}2 . Assume that distribution of X has a density w.r.t.

Lebesgue measure on Rd that has a compact support, which we assume to be [0, 1]d for sake of simplicity. Let g be the marginal density of X, and b(X) := EP (Y |X). To connect this example with the notation employed in Theorem 2.2, we proceed as follows. Let Θ := {P(β, γ): β/d > 0, γmax γ > γmin 2(1+ϵ)β} for some ﬁxed ϵ > 0, and by P(β, γ) we consider all data generating mechanisms P satisfying the following conditions for known positive constants M, BL, BU.

(1) max{|Y |} BU.

(2) b H(β, M), and g H(γ, M).

(3) 0 < BL < g(x) < BU for all x [0, 1]d.

Theorem 3.4. Assume (1) (3) and (β, γ) Θ. Then the following hold for positive C, C

depending on M, BL, BU, γmax and any sequence ℓ(n) 0 such that ℓ(n) log n .

Adaptive Estimation of Nonparametric Functionals

(i) (Upper Bound) There exists an estimator ˆφ, depending only on M,BL,BU, γmax such that

sup P P(β,γ) EP ˆφ φ(P) 2

8β 4β+d if 0 < β < d/4,

C nℓ(n)2 1 if β = d/4,

and if β > d/4, with σ2 = EP h ((2Y b(X))b(X) φ(P))2i , for every P {P P(β,γ)}, we have n ˆφn,ˆk φ(P) d Z N(0, σ2).

(ii) (Lower Bound) Suppose Y {0, 1}2 and 0 < β < d/4. If one has

sup P P(β,γ) EP ˆφ φ(P) 2 C n log n

for an estimator ˆφ of φ(P). Then there exists a class of distributions P(β , γ ) such that

sup P P(β ,γ ) EP ˆφ φ(P ) 2 C n log n

Remark 3.5. Although Theorem 3.4 and the discussion before that are made in the context of estimating a particular quadratic functional in the context of a regression framework, it is worth noting that the result also extends to estimating classical quadratic functionals in density models (Efromovich and Low, 1996; Gin e and Nickl, 2008).

One can also consider in the same setup, the estimation of functionals related to the conditional variance of Y under such a regression model, which has been studied in detail by Hall and Carroll (1989); Ruppert et al. (1997); Fan and Yao (1998); Brown and Levine (2007); Cai and Wang (2008). Whereas, the minimax optimal and adaptive results in Brown and Levine (2007); Cai and Wang (2008) are in a equi-spaced ﬁxed design setting, one can use an analogue of Theorem 3.4 to demonstrate a rate adaptive estimator and corresponding

matching lower bound, with a mean-squared error of the order of n log n

8β 4β+d for es-

timating EP (Var P (Y |X)) adaptively over H older balls of regularity β < d

4. As noted by Robins et al. (2008), this rate is higher than the rate of estimating the conditional variance in mean-squared error for the equi-spaced ﬁxed design (Cai and Wang, 2008). In a similar vein, one can also obtain this type of results for the estimation of conditional variance under the assumption of homoscedasticity i.e. σ2 := Var (Y |X = x) for all x [0, 1]d. In

particular, there exists an estimator for σ2 with mean-squared error of order n log n

4 and of order n 1 when β > d

4. As demonstrated recently by Shen et al. (2020), the (non-adaptive) minimax lower bound for estimating homoscedastic variance σ2 in any

dimension is indeed n 8β 4β+d when β < d

4 and our proof results in an adaptive estimator of the same rate in general dimensions (which also guarantees eﬃcient estimation whenever β > d

4) while assuming suﬃcient smoothness on the marginal density of X. For instance,

Liu, Mukherjee, Robins, Tchetgen Tchetgen

a candidate sequence of ˆφn,k s for this purpose can be constructed by taking A = Y in the treatment eﬀect example considered in Section 3.1. The requirement of not needing a smoothness on the marginal density of X can be removed for d = 1 (see e.g. Robins et al. (2008); Shen et al. (2020)) but doing so in higher dimensions remains a challenge. Although we do not pursue it here, it is possible to follow the lines of argument in Shen et al. (2020) to show the requirement of a logarithmic penalty while trying to adapt in the region β < d

4. Discussion

In this paper, we have extended the results for adaptive estimation of non-linear integral functionals from density estimation and Gaussian white noise models, to move towards adaptive estimation of non-linear functionals in more complex nonparametric models. Below we make a few comments on some of the assumptions and future research directions.

(i) The adaptation considered in this paper is w.r.t. the smoothness of certain nuisance functions that naturally arise in a class of nonparametric problems. For the three concrete examples considered above, the smoothness of the regression functions decides the minimax rates of the problem and consequently our results naturally adapt to this regularity. However, since our estimators are based on second-order U-statistics, we can only perform a second-order bias correction (see e.g. Robins et al. (2008, 2017)) and this results in the requirement of suﬃcient large smoothness of the marginal density of the covariates. Using higher order U-statistics this dependence can be substantially lowered while considering non-adaptive minimax rates (Robins et al., 2008, 2017; Mukherjee et al., 2017). However, such higher order U-statistics make the adaptation proof substantially more diﬃcult and it is currently beyond the scope of this paper. Moreover, it also remains open to decide the minimum smoothness required from the marginal density of the covariates for even non-adaptive minimax risks to go through in general (ﬁxed) dimensions d > 1 (for d = 1 Shen et al. (2020) shows that one can do without the requirement on density smoothness in one of the special cases considered here).

(ii) The examples considered here involve bounded outcomes. This requirement is mostly used while getting suitable tail bounds for the U-statistics under study. Although we do not pursue it here, we believe that this condition can be weakened by assuming sub-Gaussian or other related light tails (by carefully using results in Houdr e and Reynaud-Bouret (2003); Gin e et al. (2000); Chakrabortty and Kuchibhotla (2018)) for the conditional error distribution of the outcome given the covariates.

(iii) The sequence of estimators constructed here depends both on the estimation of multivariate regressions and density functions. The requirement of estimating a multivariate density as well as the requirement of smoothness on the marginal density of the covariates can be entirely removed when making the assumptions such that nconsistent estimation is possible. More precisely, in all of the three examples above, there exists a sequence of estimators which does not require estimation or any smoothness requirement on the marginal density of the covariates and automatically adapts to the n-consistent and semiparametric eﬃcient estimators in the n-estimable range.

Adaptive Estimation of Nonparametric Functionals

This sequence of estimators was recently constructed in Mukherjee et al. (2017) but they do not generalize to the nonn-estimable part of the problems.

(iv) Adaptive estimation of the functionals is only a ﬁrst step of inference and provides no guidelines for statistical inference. It remains an extremely interesting question to explore honest adaptive conﬁdence set constructions for the class of nonparametric functionals considered above.

(v) Finally, as mentioned in Remark 2.9, it is interesting to investigate if the adaptation theory for nonparametric functional estimation developed in this paper can be generalized when the nuisance functions (including the design densities) are estimated by deep neural networks instead of classical nonparametric regression such as wavelets regression.

5. Wavelets, Projections, and H older Spaces

We work with certain Besov-H older type spaces which we deﬁne in terms of moduli of wavelet coeﬃcients of continuous functions. For d > 1, consider expansions of functions h L2 [0, 1]d on an orthonormal basis of compactly supported bounded wavelets of the form

m Zd h, ψ0 0,m ψ0 0,m(x) +

v {0,1}d\{0}d h, ψv l,m ψv l,m(x),

where the base functions ψv l,m are orthogonal for diﬀerent indices (l, m, v) and are scaled and translated versions of the 2d S-regular base functions ψv 0,0 with S > β, i.e., ψv l,m(x) =

2ld/2ψv 0,0(2lx m) = Qd j=1 2 l 2 ψvj 0,0 2lxj mj for m = (m1, . . . , md) Zd and v = (v1, . . . , vd) {0, 1}d with ψ0 0,0 = φ and ψ1 0,0 = ψ being the scaling function and mother wavelet of regularity S respectively as deﬁned in one dimensional case. As our choices of wavelets, we will throughout use compactly supported scaling and wavelet functions of (boundary-corrected) Cohen-Daubechies-Vial type with S ﬁrst null moments (Cohen et al., 1993). In view of the compact support of the wavelets, for each resolution level l and index v, only O(2ld) base elements ψv l,m are non-zero on [0, 1]d; let us denote the corresponding set of indices m by Zl and obtain the representation,

h, ψ0 J0,m ψ0 J0,m(x) +

v {0,1}d\{0}d h, ψv l,m ψv l,m(x), (5.1)

where J0 = J0(S) 1 is such that 2J0 S (Cohen et al., 1993; Gin e and Nickl, 2016). Thereafter, let for any h L2([0, 1]d), h, ψl , 2 be the vector L2 norm of the vector h, ψv l ,m : m Zl , v {0, 1}d .

We will be working with projections onto subspaces deﬁned by truncating expansions as above at certain resolution levels. For example letting

Vj := span n ψv l,k, J0 l j, m Zl, v {0, 1}do , j J0, (5.2)

Liu, Mukherjee, Robins, Tchetgen Tchetgen

one immediately has the following orthogonal projection kernel onto Vj as

KVj (x1, x2) = X

ψ0 J0,m(x1)ψ0 J0,m(x2) +

v {0,1}d\{0}d ψv l,m(x1)ψv l,m(x2).(5.3)

As mentioned in Section 2.1, we sometimes also denote Vj by Vk for k = 2jd . Owing to the MRA property of the wavelet bases, it is easy to see that KVj has the equivalent representation as

KVj (x1, x2) = X

v {0,1}d ψv j,m (x1) ψv j,m (x2) . (5.4)

Thereafter, using S-regular scaling and wavelet functions of Cohen-Daubechies-Vial type with S > β let

h C [0, 1]d : 2J0(β+ d

2 ) h, ψ0 J0 + sup l 0,m Zd,v {0,1}d\{0}d 2l(β+ d

2 )| h, ψv l,m | M

with C [0, 1]d being the set of all continuous bounded functions on [0, 1]d. It is standard result in the theory of wavelets that H(β, M) is related to classical H older-Zygmund spaces with equivalent norms (see Gin e and Nickl (2016, Chapter 4) for details). For 0 < β < 1, H(β, M) consists of all functions in C [0, 1]d such that f + sup x1,x2 [0,1]d

|f(x1) f(x2)|

C(M). For non-integer β > 1, H(β, M) consists of all functions in C [0, 1]d such that

f( β ) C [0, 1]d for any partial f( β ) of order β of f and f + sup x1,x2 [0,1]d

|f( β )(x1) f( β )(x2)|

C(M). Therefore, the functions in H(β, M) are automatically uniformly bounded by a number depending on the radius M.

Remark 5.1. Since the Cohen-Daubechies-Vial type wavelets in general do not have closedform expressions, we brieﬂy comment on the numerical issues when in practice we are only able to evaluate wavelets to a certain accuracy. In our paper, it suﬃces to approximate the scaling and wavelet functions φ and ψ by eφ and eψ to the accuracy such that: for any h L2([0, 1]d)

hjmax ehjmax n 1/2

because the ﬁnest resolution jmax is chosen so that 2jmaxd n2. Here hjmax is the projection of h onto Vjmax and ehjmax is the projection of h onto e Vjmax, where

e Vj := span n eψ v lk, J0 l j, m Zl, v {0, 1}do , j J0,

and eψ v lk is deﬁned in the same way as ψv lk but with φ and ψ replaced by eφ and eψ. As a result, it suﬃces to have the following:

P m ZJ0 h, ψ0 J0,m ψ0 J0,m(x) + Pjmax l=J0 P m Zl P

v {0,1}d\{0}d h, ψv l,m ψv l,m(x)

P m ZJ0 h, eψ 0 J0,m eψ 0 J0,m(x) Pjmax l=J0 P m Zl P

v {0,1}d\{0}d h, eψ v l,m eψ v l,m(x)

Adaptive Estimation of Nonparametric Functionals

|ZJ0| max m ZJ0 h, ψ0 J0,m ψ0 J0,m(x) h, eψ 0 J0,m eψ 0 J0,m(x) | {z } (A)

and jmax|Zjmax| max l=J0, ,jmax max m Zl max v {0,1}d\{0}d h, ψv l,m ψv l,m(x) h, eψ v l,m eψ v l,m(x) | {z } (B)

Now our goal is to ﬁnd the appropriate accuracy δ with eφ φ + eψ ψ δ such that (A) n 1/2 and (B) n 1/2 hold. eφ φ + eψ ψ δ leads to the following:

ψv l,m eψ v l,m 2ld/2dδ, (5.6)

where we use the trivial identity

j=1 aj(ai eai)

j =i+1 eaj (5.7)

for any a1, , ad; ea1, , ead. With such an error bound, for (A) n 1/2 and (B) n 1/2 to hold, it suﬃces to have the following, where we again use the identity (5.7) together with (5.6):

jmax|Zjmax|2jmaxd/2dδ2jmaxd/2 n 1/2

log(n2)2jmaxd2jmaxdδ n 1/2

log(n2)n4δ n 1/2

δ n 9/2/ log(n)

where the second 2jmaxd/2 in the ﬁrst line is due to the product of two ψjmax,m s (and the product of two eψjmax,m s) within the in term (B), in the second line we use |Zjmax| = O(2jmaxd) and in the third line we use 2jmaxd n2. In conclusion, if we make sure that the approximation error of the scaling and wavelet functions satisﬁes δ n 9/2/ log(n), the numerical error will not aﬀect the statistical rate of convergence. Therefore, without loss of generality, we simply assume that we can exactly evaluate the scaling and wavelet functions in the rest of the paper. In the above calculation we did not try to ﬁnd the largest δ such that the statistical rate is unaﬀected so the above requirement on δ may very well not be the weakest possible. Finally, we remark that unlike wavelets, numerical accuracy will be an integral part of the analysis if one wants to generalize the theory in this paper to the case in which nuisance functions are estimated by deep neural networks trained with (stochastic) gradient descent or its variants.

6. Proof of Theorem 2.2

Proof. In this proof we repeatedly use the fact that for any ﬁxed m N and a1, . . . , am real numbers, one has by H older s inequality |a1 + . . . + am|p C(m, p) (|a1|p + . . . + |am|p) for p > 1. Fix θ such that {P Pθ: f1(θ) = τ, f2(θ) > min{4τ/(4τ + d), 1/2}} and our result

Liu, Mukherjee, Robins, Tchetgen Tchetgen

will hold over all τ such that f1(θ) = τ, f2(θ) > min{4τ/(4τ + d), 1/2}. Throughout the proof, we use C, C , C (with subscripts 1, 2, etc.) to denote generic universal constants only depending on the known parameters of the data generating mechanism (and CLepski, which also only depends on the known parameters). When it is clear from the derivation, the same constant notation may change its value from line to line. We will highlight which known parameters a constant depends on when it is necessary for understanding. The corresponding oracle choice of k is deﬁned through balancing bias-variance tradeoﬀ:

min k K: k 2τ/d 1 C1/2 Lepski R(k)1/2d(k), k > k2

k2 n1 ℓ(n1) τ/d = 1/4

k1 n1 log n1 τ/d > 1/4

where k is of order n1/ log n1 2/(4τ/d+1) when τ/d < 1/4 up to a constant depending on CLepski. By construction, we have, for n1 large enough

C1/2 Lepski R(k )1/2d(k ).

A key step in bounding the risk of an adaptive estimator through Lepski s adaptation scheme is to show that the probability of selecting a k greater than necessary (k ) is small. To this end, we ﬁrst prove the following lemma in order to bound PP ˆk > k , when Properties (A) and (B) are met:

Lemma 6.1. Under Properties (A) and (B), given a large enough universal constant CLepski depending on known parameters (τmax, CA, C A, CB, B, BU) of the data generating mechanism, for any k > k , we have

PP ˆk = k C log n exp{ Cd(k)2} if k > k2, C exp{ Cd(k2)2} + log n exp{ Cd(k3)2} if k = k2.

for some constants C > 0 depending on CLepski.

Proof. For any k, let k be the previous element of k in the discretization set K. Then

PP ˆk = k = PP k k: ˆφn,k ˆφn,k > R(k )1/2d(k )

k k k N 1 PP ˆφn,k ˆφn,k > R(k )1/2d(k ) .

Recall the good event I2(n2) introduced in Property (A). Then we decompose each summand on the RHS of the above display as follows:

PP ˆφn,k ˆφn,k > R(k )1/2d(k )

= PP ˆφn,k ˆφn,k > R(k )1/2d(k ), I2(n2)

Adaptive Estimation of Nonparametric Functionals

+ PP ˆφn,k ˆφn,k > R(k )1/2d(k ), I2(n2)

PP ˆφn,k ˆφn,k > R(k )1/2d(k ), I2(n2)

where B C A log n2

n2 2 by Property (A). We are only left to bound the ﬁrst term A:

A = EP,2 PP,1 ˆφn,k ˆφn,k > R(k )1/2d(k ) 1 {I2(n2)} .

We attempt to control PP,1 ˆφn,k ˆφn,k > R(k )1/2d(k ) through the following inequality decomposition:

PP,1 ˆφn,k ˆφn,k > R(k )1/2d(k )

= PP,1 ˆUn,k ˆUn,k > R(k )1/2d(k )

ˆUn,k EP,1 ˆUn,k > 1

2R(k )1/2d(k )

ˆUn,k EP,1 ˆUn,k > 1

2R(k )1/2d(k ) EP,1 ˆUn,k EP,1 ˆUn,k

where we recall the deﬁnition of ˆUn,k in equation (2.1). Now we need to control terms EP,2[A11{I2(n2)}] and EP,2[A21{I2(n2)}] respectively. First we apply the exponential concentration inequality in Lemma C.2 for U-statistics of order two to obtain the following upper bound on EP,2[A11{I2(n2)}]:

EP,2[A11{I2(n2)}] C1 exp{ C 1d(k )2},

where C1 and C 1 are chosen suﬃciently large depending on CA deﬁned in Property (A), CB deﬁned in Property (B), (B, BU) deﬁned in Section 2.1, J0 determined by τmax, a known upper bound on the smoothness indices adapted over, and CLepski to be speciﬁed later.

Next we analyze EP,2[A21{I2(n2)}]. To do this, we need to ﬁrst control EP,1 ˆφn,k EP,1 ˆφn,k within the good event I2(n2).

EP,1 ˆUn,k EP,1 ˆUn,k 1{I2(n2)}

= EP,1 ˆφn,k EP,1 ˆφn,k 1{I2(n2)}

n f2(θ) 2 + k 2f1(θ)

where the last inequality is simply a consequence of Property (A). Since k > k , k = k 1 k , and thus for some absolute constant C depending on CA, EP,1 ˆUn,k EP,1 ˆUn,k 1{I2(n2)}

( Ck 2f1(θ)

d if τ/d 1/4 Cn f2(θ) if τ/d > 1/4

Liu, Mukherjee, Robins, Tchetgen Tchetgen

4R(k )1/2d(k ) 1

4R(k )1/2d(k )

where we use n2 = n/ log n, the deﬁnition of k and f2(θ) > 1/2 if τ/d > 1/4 and we choose CLepski large enough depending on CA. Then again Lemma C.2 implies:

EP,2[A21{I2(n2)}] C2 exp{ C 2d(k )2}

where similarly C2 and C 2 are chosen suﬃciently large depending on CA deﬁned in Property (A) (which in turn determines CLepski), CB deﬁned in Property (B), (B, BU) deﬁned in Section 2.1, and J0 determined by τmax, a known upper bound of the smoothness index adapted over. Combining the bounds on EP,2[A11{I2(n2)}] and EP,2[A21{I2(n2)}], we have

PP ˆφn,k ˆφn,k > R(k )1/2d(k )

C exp{ C d(k )2} + C log n

C exp{ ℓ 1(n1)} C exp{ ℓ 1(n)} k = k2

for some absolute constant C depending on (CA, C A, CB, B, BU, τmax, CLepski).

Finally, recall that in the beginning of the proof we bound PP ˆk = k by

PP ˆk = k = PP k k: ˆφn,k ˆφn,k > R(k )1/2d(k )

k k k N 1 PP ˆφn,k ˆφn,k > R(k )1/2d(k ) ,

and there are at most O(log n1) = O(log n) summands in total, therefore when k > k2

PP ˆk = k C log n max k k k N 1 ℓ(n1)

nδ 1 C log (n)ℓ(n)

whereas when k = k2,

PP ˆk = k2 PP ˆφn,k2 ˆφn,k1 > R(k2)1/2d(k2)

k3 k k N 1 PP ˆφn,k ˆφn,k1 > R(k )1/2d(k )

C exp{ ℓ 1(n)} + log (n)ℓ(n)

We further introduce the notation for the mean zero random variable IF := L1(O) φ(P), which encodes the ﬁrst-order inﬂuence function (van der Vaart, 2000) of φ(P) evaluated at

Adaptive Estimation of Nonparametric Functionals

the true nuisance functions. Returning to the proof of Theorem 2.2, our goal is to obtain an upper bound on the following risk of our adaptive estimator ˆφn,ˆk:

f1(θ)=τ,f2(θ)>min{ 4τ 4τ+d , 1

ˆφn,ˆk φ(P) 1

= sup P Pθ:

f1(θ)=τ,f2(θ)>min{ 4τ 4τ+d , 1

ˆφn,ˆk φ(P) 1

We start with controlling the following conditional risk bound:

ˆφn,ˆk φ(P) 1

ˆφn,ˆk φ(P) 1

!2 1 n ˆk k o

ˆφn,ˆk φ(P) 1

!2 1 n ˆk > k o

Below we control the terms T1 and T2 separately.

Control of T1

We decompose T1 as follows.

1 n ˆk k o ˆφn,ˆk φ(P) 1

ˆφn,ˆk ˆφn,k 2

+ EP,1 ˆφn,k φ(P) 2

| {z } T12 + ˆφn,k EP,1 ˆφn,k 1

n1 Pn1 i=1 IFi 2

EP,1 h 1 n ˆk k o T11 i + T12 + EP,1

ˆφn,k EP,1 ˆφn,k 1

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Now we control T11, T12 and T13 separately. First, by deﬁnition of ˆk, for some absolute constant C depending on CLepski,

T11 R(k )d(k )2 = k

n2 1 d(k )2

C n1 log n1

8τ/d 4τ/d+1 C n log n

8τ/d 4τ/d+1 τ/d < 1/4,

n τ/d = 1/4,

ℓ 1(n1) n1 log n1 ℓ 1(n)

n τ/d > 1/4.

The above upper bound also applies to EP,1T11 and EP T11. In terms of T13, by Lemma C.2 (ii), we have

EP,2T13 2EP,2EP,1

ˆUn,k EP,1 ˆUn,k 2

+ 1 n1 Pn1 i=1 e L1(Oi) EP,1e L1(Oi) IFi 2

for some absolute constant C depending on (CB, B, BU, τmax), the known parameters of the problem. The fact that

i=1 e L1(Oi) EP,1e L1(Oi) IFi

scales as o(n 1 1 ) can be seen from the following argument. Within the good event I2(n2), this term is n 1 1 EP,1[(e L1(O) L1(O) EP,1(e L1(O) L1(O)))2] = o(n 1 1 ) because

EP,1[(e L1(O) L1(O))2] = o(1) in I2(n2). Outside the good event I2(n2), EP,1

1 n1 Pn1 i=1 e L1(Oi) EP,1e L1(Oi)

scales as n 1 1 EP,1[(e L1(O) L1(O) EP,1(e L1(O) L1(O)))2] = O(n 1 1 )

because now we only know EP,1[(e L1(O) L1(O) EP,1(e L1(O) L1(O)))2] is bounded almost surely. However, since the probability outside the good event is o(1) by Property (A),

i=1 e L1(Oi) EP,1e L1(Oi) IFi

= o(n 1 1 ).

Then combining the above two points, we see that EP,2EP,1[( 1

n1 Pn1 i=1 e L1(Oi) EP,1e L1(Oi) IFi)2] scales as o(n 1 1 ). As a result, there exists some universal positive constant C depending only on known parameters such that

f1(θ)=τ,f2(θ)>min{ 4τ 4τ+d , 1

EP,2T13 C k

Adaptive Estimation of Nonparametric Functionals

n2 τ/d 1/4,

o(n 1 1 ) = o(n 1) τ/d > 1/4.

In terms of T12, we again need to analyze its behavior under two diﬀerent scenarios:

1. Within the good event I2(n2), for some absolute constant C depending on CA:

EP,2T121 {I2(n2)}

CA n 2f2(θ) 2 + k 4f1(θ)

C n1 log n1

8τ/d 4τ/d+1 C n log n

8τ/d 4τ/d+1 τ/d < 1/4,

n τ/d = 1/4,

o(n 1) τ/d > 1/4.

2. Outside the good event I2(n2): we still have T12 < BT for some constant BT > 0 depending on B and BU almost surely.

Summarizing the above arguments, we have

EP,2T12 = EP,2T121 n

I2(n2) o + EP,2T121 {I2(n2)}

I2(n2) + CA n 2f2(θ) 2 + k 4f1(θ)

8τ/d 4τ/d+1 τ/d < 1/4,

n τ/d = 1/4,

o(n 1) τ/d > 1/4.

Combining the above bounds for EP,2 h 1 n ˆk k o T11 i , EP,2T12 and EP,2T13, we get, for some universal positive constant C > 0,

8τ/d 4τ/d+1 τ/d < 1/4,

n τ/d = 1/4,

o(n 1) τ/d > 1/4.

Control of T2

For some p, q > 1 such that 1

q = 1, we have

k <k k N 1 EP,1

ˆφn,ˆk φ(P) 1

!2 1 n ˆk = k o

Liu, Mukherjee, Robins, Tchetgen Tchetgen

k <k k N 1 EP,1

ˆφn,k φ(P) 1

!2 1 n ˆk = k o

2 EP,1 ˆφn,k φ(P) 2

k <k k N 1 EP,1

ˆφn,k EP,1 ˆφn,k 1

!2 1 n ˆk = k o

2 EP,1 ˆφn,k φ(P) 2

k <k k N 1 E

ˆφn,k EP,1 ˆφn,k 1

1 p P,1 ˆk = k

We next control T21 and T22 separately. First we observe that T21 is exactly T12. Thus applying the same bound, we have

8τ/d 4τ/d+1 τ/d < 1/4,

n τ/d = 1/4,

o(n 1) τ/d > 1/4.

In the control of T22, we utilize the result in Lemma 6.1.

k <k k N 1 E

ˆUn,k EP,1 ˆUn,k 2q

+ 1 n1 Pn1 i=1 e L1(Oi) EP,1e L1(O) IFi 2q

1 p P,1 ˆk = k

1 p P,1 ˆk = k

where the ﬁrst inequality is due to H older inequality, the second inequality follows from Lemma C.2 (ii) with k > k implying that k > n1 = n(1 o(1)), and the third inequality is a consequence of Lemma 6.1. Here C (q) absorbs the constants (depending on (B, BU, τmax, CB)) in front of the rate (which is k/n2) of

ˆUn,k EP,1 ˆUn,k 2q +

i=1 e L1(Oi) EP,1e L1(O) IFi

into C(q). Next we divide our analyses into three diﬀerent scenarios:

1. τ/d < 1/4: k > k2. Then following Lemma 6.1, we have

n2 log 1 p (n) X

k <k k N 1 k exp{ Cd(k)2/p}

Adaptive Estimation of Nonparametric Functionals

CC2 Lepski(1 δ)

n2 log 1 p (n) X

k <k k N 1 k k CC2 Lepski/p

= C (q)n2(1 δ)

n2 log 1 p (n) X

k <k k N 1 k k 2

C (q)n 2δ log 1 p (n)C log nk 1

C (q)C n 2δ log

p (n) n log n

8τ/d 4τ/d+1

where the second line follows by plugging in the deﬁnition of d(k) when k > k > k2, the third line holds by choosing the constant p such that CC2 Lepski/p = 2, the fourth line utilizes the cardinality of K being bounded by log n up to some constant C > 0 and k > k , the ﬁfth line follows from the deﬁnition of k when τ/d < 1/4 and in the sixth line we simply use 8τ/d < 2. Here we want to emphasize that the choice of p depends only on the constants C in Lemma 6.1 and CLepski, not on the sample size n or k.

2. τ/d = 1/4: k = k2 = n1/ℓ(n1). Then similar to the calculations above by setting p such that C 1C2 Lepski/p = 2 and choosing 0.5 < δ < 1, we have

T22 C (q)C n2(1 δ)

C (q)C n 2δ log

3. τ/d > 1/4: k = k1 = n1/ log n1. Then compared to T22 in τ/d 1/4, we need to add one extra term

ˆφn,k2 EP,1 ˆφn,k2 1

1 p P,1 ˆk = k2 :

n2 log 1 p (n) X

k <k k N 1 k exp{ Cd(k)2/p}

ˆφn,k2 EP,1 ˆφn,k2 1

1 p P,1 ˆk = k2

ˆφn,k2 EP,1 ˆφn,k2 1

1 p P,1 ˆk = k2

Liu, Mukherjee, Robins, Tchetgen Tchetgen

o(n 1) + C(nℓ(n)) 1 exp{ Cd(k2)2} + log n exp{ Cd(k3)2} 1

where the last line follows from the calculation as in the case τ/d = 1/4.

Hence combining the above ﬁndings on the control of T22, we have

8τ/d 4τ/d+1 τ/d < 1/4,

o(n 1) τ/d 1/4.

Finally, summarizing the above results on T1 and T2, we have

ˆφn,ˆk φ(P) 1

EP,2(T1 + T2)

8τ/d 4τ/d+1

In particular, when τ/d > 1/4, the above result implies that

" ˆφn,ˆk φ(P) 1

which further implies that

n ˆφn,ˆk φ(P) = 1 n1

i=1 IFi + o P (1) d N(0, σ2).

Acknowledgement

We would like to thank two anonymous referees, the action editor G abor Lugosi, and Zhi Qin John Xu (Shanghai Jiao Tong University) for helpful comments on this paper.

Funding Information

Lin Liu and James M. Robins were supported by the U.S. Oﬃce of Naval Research (ONR) grant N000141912446, and National Institutes of Health (NIH) awards R01AG057869 and R01AI127271. Lin Liu was also partially sponsored by Shanghai Pujiang Program Research grant 20PJ1408900, Shanghai Municipal Science and Technology Major Project 2021SHZDZX0102 and Major Program of National Natural Science Foundation of China 12090024. Rajarshi Mukherjee s research was partially supported by NSF Grant EAGER1941419. Eric Tchetgen Tchetgen was supported by NIH awards R01GM139926, R01AG065276, R01CA222147 and R01AI27271.

Adaptive Estimation of Nonparametric Functionals

Yannick Baraud. Model selection for regression on a random design. ESAIM: Probability and Statistics, 6:127 146, 2002.

Thomas B Berrett, Richard J Samworth, and Ming Yuan. Eﬃcient multivariate entropy estimation via k-nearest neighbour distances. The Annals of Statistics, 47(1):288 318, 2019.

Peter J Bickel and Ya acov Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhy a: The Indian Journal of Statistics, Series A, pages 381 393, 1988.

Peter J Bickel, Chris AJ Klaassen, Jon A Wellner, and Ya acov Ritov. Eﬃcient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press Baltimore, 1993.

Lucien Birg e and Pascal Massart. Estimation of integral functionals of a density. The Annals of Statistics, 23(1):11 29, 1995.

Christoph Breunig and Xiaohong Chen. Adaptive estimation of quadratic functionals in nonparametric instrumental variable models. ar Xiv preprint ar Xiv:2101.12282, 2021.

Lawrence D Brown and Michael Levine. Variance estimation in nonparametric regression via the diﬀerence sequence method. The Annals of Statistics, 35(5):2219 2232, 2007.

Lawrence D Brown and Mark G Low. A constrained risk inequality with applications to nonparametric functional estimation. The Annals of Statistics, 24(6):2524 2535, 1996.

Adam D Bull and Richard Nickl. Adaptive conﬁdence sets in L2. Probability Theory and Related Fields, 156(3-4):889 919, 2013.

T Tony Cai and Mark G Low. A note on nonparametric estimation of linear functionals. The Annals of Statistics, 31(4):1140 1153, 2003.

T Tony Cai and Mark G Low. Minimax estimation of linear functionals over nonconvex parameter spaces. The Annals of Statistics, 32(2):552 576, 2004.

T Tony Cai and Mark G Low. On adaptive estimation of linear functionals. The Annals of Statistics, 33(5):2311 2343, 2005a.

T Tony Cai and Mark G Low. Nonquadratic estimators of a quadratic functional. The Annals of Statistics, 33(6):2930 2956, 2005b.

T Tony Cai and Mark G Low. Optimal adaptive estimation of a quadratic functional. The Annals of Statistics, 34(5):2298 2325, 2006.

T Tony Cai and Mark G Low. Testing composite hypotheses, hermite polynomials and optimal estimation of a nonsmooth functional. The Annals of Statistics, 39(2):1012 1041, 2011.

Liu, Mukherjee, Robins, Tchetgen Tchetgen

T Tony Cai and Lie Wang. Adaptive variance function estimation in heteroscedastic nonparametric regression. The Annals of Statistics, 36(5):2025 2054, 2008.

Abhishek Chakrabortty and Arun Kumar Kuchibhotla. Tail bounds for canonical Ustatistics and U-processes with unbounded kernels. 2018.

Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric regression on low-dimensional manifolds using deep Re LU networks. ar Xiv preprint ar Xiv:1908.01842, 2019a.

Minshuo Chen, Haoming Jiang, and Tuo Zhao. Eﬃcient approximation of deep Re LU networks for functions on low dimensional manifolds. Advances in Neural Information Processing Systems, 2019b.

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen, Whitney Newey, and James Robins. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1 C68, 2018.

Albert Cohen, Ingrid Daubechies, and Pierre Vial. Wavelets on the interval and fast wavelet transforms. Applied and computational harmonic analysis, 1(1):54 81, 1993.

Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. Dealing with limited overlap in estimation of average treatment eﬀects. Biometrika, page asn055, 2009.

David L Donoho and Michael Nussbaum. Minimax quadratic estimation of a quadratic functional. Journal of Complexity, 6(3):290 323, 1990.

David L Donoho, Richard C Liu, and Brenda Mac Gibbon. Minimax risk over hyperrectangles, and implications. The Annals of Statistics, 18(3):1416 1437, 1990.

Sam Efromovich and Mark Low. On optimal adaptive estimation of a quadratic functional. The Annals of Statistics, 24(3):1106 1125, 1996.

Sam Efromovich and Mark G Low. Adaptive estimates of linear functionals. Probability Theory and Related Fields, 98(2):261 275, 1994.

Sam Efromovich and Alexander Samarov. Adaptive estimation of the integral of squared regression derivatives. Scandinavian Journal of Statistics, 27(2):335 351, 2000.

Jianqing Fan. On the estimation of quadratic functionals. The Annals of Statistics, 19(3): 1273 1294, 1991.

Jianqing Fan and Qiwei Yao. Eﬃcient estimation of conditional variance functions in stochastic regression. Biometrika, 85(3):645 660, 1998.

Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimation and inference. Econometrica, 89(1):181 213, 2021.

Evarist Gin e and Richard Nickl. A simple adaptive estimator of the integrated square of a density. Bernoulli, 14(1):47 61, 2008.

Adaptive Estimation of Nonparametric Functionals

Evarist Gin e and Richard Nickl. Mathematical foundations of inﬁnite-dimensional statistical models, volume 40. Cambridge University Press, 2016.

Evarist Gin e, Rafa l Lata la, and Joel Zinn. Exponential and moment inequalities for Ustatistics. In High Dimensional Probability II, pages 13 38. Springer, 2000.

Peter Hall and Raymond J Carroll. Variance function estimation in regression: the eﬀect of estimating the mean. Journal of the Royal Statistical Society: Series B (Methodological), 51(1):3 14, 1989.

Peter Hall and James Stephen Marron. Estimation of integrated squared density derivatives. Statistics & Probability Letters, 6(2):109 115, 1987.

Yanjun Han, Jiantao Jiao, and Rajarshi Mukherjee. On estimation of Lr-norms in Gaussian white noise models. Probability Theory and Related Fields, 177(3):1243 1294, 2020a.

Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates of entropy estimation over Lipschitz balls. Annals of Statistics, 48(6):3228 3250, 2020b.

Wolfgang H ardle, Gerard Kerkyacharian, Dominique Picard, and Alexander Tsybakov. Wavelets, approximation, and statistical applications, volume 129. Springer Science & Business Media, 1998.

Christian Houdr e and Patricia Reynaud-Bouret. Exponential inequalities, with constants, for U-statistics of order two. In Stochastic Inequalities and Applications, pages 55 69. Springer, 2003.

Il dar Abdulovich Ibragimov and Rafail Z Has minskii. Statistical estimation: asymptotic theory, volume 16. Springer Science & Business Media, 2013.

Iain M Johnstone. Chi-square oracle inequalities. In State of the Art in Probability and Statistics, pages 399 418. Institute of Mathematical Statistics, 2001.

Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James M Robins. Inﬂuence functions for machine learning: Nonparametric estimators for entropies, divergences and mutual informations. ar Xiv preprint ar Xiv:1411.4342, 2014.

G erard Kerkyacharian and Dominique Picard. Estimating nonquadratic functionals of a density using Haar wavelets. The Annals of Statistics, 24(2):485 507, 1996.

Jussi Klemel a and Alexandre B Tsybakov. Sharp adaptive estimation of linear functionals. The Annals of Statistics, 29(6):1567 1600, 2001.

B eatrice Laurent. Eﬃcient estimation of integral functionals of a density. The Annals of Statistics, 24(2):659 681, 1996.

B eatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

Oleg V Lepski. On a problem of adaptive estimation in gaussian white noise. Theory of Probability & Its Applications, 35(3):454 466, 1991.

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Oleg V Lepski. Asymptotically minimax adaptive estimation. I: Upper bounds. optimally adaptive estimates. Theory of Probability & Its Applications, 36(4):682 697, 1992.

Lingling Li, Eric Tchetgen Tchetgen, Aad van der Vaart, and James M Robins. Higher order inference on a treatment eﬀect under low regularity conditions. Statistics & Probability Letters, 81(7):821 828, 2011.

Mark G Low. Renormalization and white noise approximation for nonparametric functional estimation problems. The Annals of Statistics, 20(1):545 554, 1992.

Tao Luo, Zheng Ma, Zhiwei Wang, Zhi-Qin John Xu, and Yaoyu Zhang. Fourier-domain variational formulation and its well-posedness for supervised learning. ar Xiv preprint ar Xiv:2012.03238, 2020.

Tao Luo, Zhi-Qin John Xu, Zheng Ma, and Yaoyu Zhang. Phase diagram for two-layer Re LU neural networks at inﬁnite-width limit. Journal of Machine Learning Research, 22 (71):1 47, 2021.

Rajarshi Mukherjee and Subhabrata Sen. Optimal adaptive inference in random design binary regression. Bernoulli, 24(1):699 739, 2018.

Rajarshi Mukherjee, Whitney K Newey, and James M Robins. Semiparametric eﬃcient empirical higher order inﬂuence function estimators. ar Xiv preprint ar Xiv:1705.07577, 2017.

Arkadi Nemirovski. Topics in non-parametric. Ecole d Et e de Probabilit es de Saint-Flour, 28:85, 2000.

Valentin V Petrov. Limit Theorems of Probability Theory: Sequences of Independent Random Variables, volume 4. Clarendon Press; Oxford University Press, 1995.

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In International Conference on Machine Learning, pages 5301 5310. PMLR, 2019.

James Robins, Lingling Li, Eric Tchetgen Tchetgen, and Aad van der Vaart. Higher order inﬂuence functions and minimax estimation of nonlinear functionals. In Probability and Statistics: Essays in Honor of David A. Freedman, pages 335 421. Institute of Mathematical Statistics, 2008.

James Robins, Lingling Li, Eric Tchetgen Tchetgen, and Aad W van der Vaart. Quadratic semiparametric von Mises calculus. Metrika, 69(2-3):227 247, 2009a.

James Robins, Eric Tchetgen Tchetgen, Lingling Li, and Aad van der Vaart. Semiparametric minimax rates. Electronic Journal of Statistics, 3:1305 1321, 2009b.

James M Robins, Steven D Mark, and Whitney K Newey. Estimating exposure eﬀects by modelling the expectation of exposure conditional on confounders. Biometrics, pages 479 495, 1992.

Adaptive Estimation of Nonparametric Functionals

James M Robins, Lingling Li, Rajarshi Mukherjee, Eric Tchetgen Tchetgen, and Aad van der Vaart. Minimax estimation of a functional on a structured high-dimensional model. The Annals of Statistics, 45(5):1951 1987, 2017.

David Ruppert, Matthew P Wand, Ulla Holst, and Ola H o SJER. Local polynomial variancefunction estimation. Technometrics, 39(3):262 273, 1997.

Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with Re LU activation function. Annals of Statistics, 48(4):1875 1897, 2020.

Yandi Shen, Chao Gao, Daniela Witten, and Fang Han. Optimal estimation of variance in nonparametric regression with random design. Annals of Statistics, 48(6):3589 3618, 2020.

Eric Tchetgen Tchetgen, Lingling Li, James Robins, and Aad van der Vaart. Minimax estimation of the integral of a power of a density. Statistics & Probability Letters, 78(18): 3307 3311, 2008.

Karine Tribouley. Adaptive estimation of integrated functionals. Mathematical Methods of Statistics, 9(1):19 38, 2000.

Anastasios Tsiatis. Semiparametric theory and missing data. Springer Science & Business Media, 2007.

Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2009.

Aad W van der Vaart. Asymptotic statistics, volume 3. Cambridge University Press, 2000.

Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency principle: Fourier analysis sheds light on deep neural networks. Communications in Computational Physics, 28(5):1746 1767, 2020.

Appendix A. Proof of (2.3)

Proof. For any two probability measures ν1 and ν2, denote their total-variation distance and Hellinger distance as TV (ν1, ν2) := 1

2 R |dν1 dν2| and H(ν1, ν2) := R ( dν1 dν2)2 1/2

respectively. Since we assume dν2 is strictly bounded from below by B, we have

χ2 (ν1, ν2) = Z (dν1 dν2)2

dν1 B 1 Z (dν1 dν2)2 4B 1TV2 (ν1, ν2) 4B 1H2 (ν1, ν2)

where in the last inequality we apply the well-known Le Cam s inequality that the Hellinger distance upper bounds the total variation distance (Tsybakov, 2009, Lemma 2.3). Furthermore, using a simple Taylor expansion argument, we have

χ2(ν1, ν2) 4B 1H2 (ν1, ν2) exp{4B 1H2 (ν1, ν2)} 1.

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Appendix B. Proof of Remaining Theorems

Proof of Theorem 2.7

2jmind = n log n

1 2smax/d+1 , 2jmaxd = n log n

1 2smin/d+1 ,

2lmind = n log n

1 2γmax/d+1 , 2lmaxd = n log n

1 2γmin/d+1 .

Without loss of generality assume that we have data {xi, yi}2n i=1. We split it into two disjoint and equal-sized parts and use the second part to construct the estimator ˆg of the design density g and use the resulting ˆg to construct the adaptive estimates of the regression functions from the ﬁrst half of the sample. Throughout we choose the regularity of our wavelet bases to be larger than γmax for the desired approximation and moment properties to hold. As a result our constants depend on γmax. Deﬁne T1 = [jmin, jmax] N and T2 = [lmin, lmax] N. For l T2, let ˆgl(x) = 1

n P2n i=n+1 KVl (Xi, x). Now, let

j T2: ˆgj ˆgl C r

n , l T2 s.t. l j

where C is a constant (depending on γmax, BU) that can be determined from the proof hereafter. Thereafter, consider the estimator eg := ˆgˆl. In the following, we ﬁrst prove inequality (2.4). Fix a P := (f, g) P(s, γ). To analyze the estimator eg, we begin with standard bias variance type analysis for the candidate estimators ˆgl. First note that for any x [0, 1]d, using standard facts about compactly supported wavelet basis having regularity larger than γmax (H ardle et al., 1998), one has, for a constant C1 depending only on q and the wavelet basis used, that

|EP (ˆgl(x)) g(x)| = |Π (g|Vl) (x) g(x)| C1M2 ld γ

Above we have used the fact that

sup h H(γ,M) h Π(h|Vl) C1M2 lγ. (B.2)

Also, by standard arguments about compactly supported wavelet basis having regularity larger than γmax (Gin e and Nickl, 2016), one has, for a constant C2 := C(BU, ψ0 0,0, ψ1 0,0, γmax), that

EP ( ˆgl(x) EP (ˆgl(x)) ) C2

Therefore, by inequalities (B.1) and (B.3), and triangle inequality,

EP,2 ˆgl g C1M2 ld γ

Adaptive Estimation of Nonparametric Functionals

l T2: C1M2 ld γ

The deﬁnition of l implies that for n suﬃciently large,

C2 M 2d 2γ+d n log n

d 2γ+d 2l d 2d+1 C1

C2 M 2d 2γ+d n log n

The error analysis of eg can now be carried out as follows:

EP,2 eg g = EP,2 eg g 1 ˆl l + EP,2 eg g 1 ˆl > l

:= I + II. (B.5)

We ﬁrst control term I as follows:

I = EP,2 eg g I ˆl l

EP,2 ˆgˆl ˆgl 1 ˆl l + EP,2 ˆgl g 1 ˆl l

n + C1M2 l d γ

C2 M 2d 2γ+d n log n

The control of term II is easier if one has suitable bounds on ˆgl g . To this end note that, for any ﬁxed x [0, 1]d, there exists a constant C3 := C(ψ0 0,0, ψ1 0,0, γmax) such that

v {0,1}d |ψv l,m(Xi)||ψv l,m(x)| C32ld.

This along with the fact that g BU, implies that for n suﬃciently large,

ˆgl g C32ld + BU 2C32ld.

In the above display the last inequality follows since l lmin n log n 1 2γmax/d+1 . Therefore,

l=l +1 2ld PP,2 ˆl = l . (B.7)

We now complete the control over II by suitably bounding P ˆl = l . To this end, note that for any l > l ,

PP,2 ˆl = l

Liu, Mukherjee, Robins, Tchetgen Tchetgen

ˆgl ˆgl > C r

ˆgl E (ˆgl ) > C

n EP,2 (ˆgl ) EP,2 (ˆgl)

ˆgl EP,2 (ˆgl) > C

ˆgl E (ˆgl ) > C

n Π (g|Vl ) Π (g|Vl)

+P ˆgl EP,2 (ˆgl) > C

ˆgl E (ˆgl ) > C

+P ˆgl EP,2 (ˆgl) > C

ˆgl E (ˆgl ) > (C

+P ˆgl EP,2 (ˆgl) > C

l>l 2 exp ( Cld). (B.8)

In the fourth and ﬁfth lines of the above series of inequalities, we have used inequality (B.2) and the deﬁnition of l respectively. The last inequality in the above display holds for an absolute constant C > 0 depending only on BU, ψ0 0,0, ψ1 0,0 and the inequality follows from Lemma C.6 provided we choose C large enough depending on M, BU, ψ0 0,0, ψ1 0,0, γmax. In particular, this implies that, choosing C large enough will guarantee that there exists a η > 3 such that for large enough n, one has for any l > l

P(ˆl = l) n η. (B.9)

This along with inequality (B.7) and the choice of lmax implies that

l>l 2ldn η = C3 X

n n η+1 lmax

Finally combining equations (B.6) and (B.10), we have the existence of an estimator eg depending on M, BU, and γmax (once we have ﬁxed our choice of the scaling and wavelet functions; see Section 5 in the main text), such that for every (s, γ) [smin, smax] [γmin, γmax],

sup P P(s,γ) EP eg g (C) d 2γ+d n log n

with a large enough positive C depending on M, BU, and γmax. We next show that uniformly over P P(s, γ), eg belongs to H(γ, C) with probability at least 1 1/n2, for a large enough constant C depending on M, BU, and γmax. Towards

Adaptive Estimation of Nonparametric Functionals

this end, note that, for any C > 0, l > 0, and h L2([0, 1]d), let h, ψl , 2 be the vector

L2 norm of the vector h, ψv l ,m : m Zl , v {0, 1}d \ {0}d . We have,

PP,2 2l (γ+ d

2 ) eg, ψl , > C

l=lmin PP,2 2l (γ+ d

2 ) ˆgl, ψl , > C, ˆl = l 1 l l

l=lmin PP,2 2l (γ+ d

2 ) ˆgl, ψl , > C, ˆl = l 1 l l

l=l +1 PP,2 2l (γ+ d

2 ) ˆgl, ψl , > C, ˆl = l 1 l l

l=lmin PP,2 2l (γ+ d

2 ) ˆgl, ψl , > C 1 l l

l=l +1 PP,2 ˆl = l 1 l l

l=lmin PP,2 2l (γ+ d

2 ) ˆgl, ψl , > C 1 l l + X

l>l n η, (B.11)

where the last inequality follows from (B.9) for some η > 3 provided C is chosen large enough as before. Now,

PP,2 2l (γ+ d

2 ) ˆgl, ψl , > C

PP,2 2l (γ+ d

2 ) ˆgl, ψl , EP,2 ˆgl, ψl , > C/2

+ 1 2l (γ+ d

2 ) EP,2 ˆgl, ψl , > C/2

= PP,2 2l (γ+ d

2 ) ˆgl, ψl , EP,2 ˆgl, ψl , > C/2

if C > 2M (by deﬁnition in equation (5.5)). Therefore, from (B.11), one has for any C > 2M,

PP,2 2l (γ+ d

2 ) ˆg, ψl , > C

l=lmin PP,2 2l (γ+ d

2 ) ˆgl, ψl , EP,2 ˆgl, ψl , > C/2 1 l l

l=l n 31 l l . (B.12)

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Considering the ﬁrst term of the last summand of the above display, we have

l=lmin PP,2 2l (γ+ d

2 ) ˆgl, ψl , EP,2 ˆgl, ψl , > C/2 1 l l

v {0,1}d PP,2

ψv l ,m(Xi) EP,2 ψv l ,m(Xi) > C/2

By Bernstein s inequality, for any λ > 0,

ψv l ,m(Xi) EP,2 ψv l ,m(Xi) > λ

2 σ2 + ψv l ,m λ/3

where σ2 = EP,2 ψv l ,m(Xi) EP,2 ψv l ,m(Xi) 2 . Indeed, there exists a constant C4 de-

pending on ψ0 0,0, ψ1 0,0, γmax such that σ2 C4 and ψv l ,m C42 l d 2 . Therefore,

v {0,1}d PP,2

ψv l ,m(Xi) EP,2 ψv l ,m(Xi) > C/2

v {0,1}d exp

n2 2l (γ+ d

2 2 l (γ+ d

v {0,1}d exp

v {0,1}d exp

v {0,1}d exp

2l dl d 2(l l )(d+2γ)l d

v {0,1}d exp

C2C2 2d+3C1(1 + C

2 )C4 2(l l )(d+2γ)l d

1 l l (B.13)

v {0,1}d exp

C2C2 2d+3C1(1 + C

l=lmin C(ψ0 0,0, ψ1 0,0)2l d exp

C2C2 2d+3C1(1 + C

Adaptive Estimation of Nonparametric Functionals

l=lmin C(ψ0 0,0, ψ1 0,0) exp

C2C2 2d+3C1(1 + C

2lmax C(ψ0 0,0, ψ1 0,0) exp

C2C2 2d+3C1(1 + C

if C2C2 2d+3C1(1+ C

2 )C4 1. The above inequality (B.13) uses the deﬁnition of l . Indeed choosing

C large enough, one can guarantee, C2C2 2d+3C1(1+ C

2 )C4 1 lmind 4 log n. Such a choice of

C implies that,

v {0,1}d PP,2

ψv l ,m(Xi) EP,2 ψv l ,m(Xi) > C/2

C(ψ0 0,0, ψ1 0,0) n3 1(l lmax),

which in turn implies that, for C suﬃciently large (depending on M, ψ0 0,0, ψ1 0,0) one has

PP,2 2l (γ+ d

2 ) ˆg, ψl , 2 > C C(ψ0 0,0, ψ1 0,0) + 1 n3 1(l lmax).

This along with the logarithmic in n size of lmax implies that for suﬃciently large n, uniformly over P P(s, γ), eg belongs to H(γ, C) with probability at least 1 1/n2, for a large enough constant C depending on M, BU, and γmax (the choice of ψ0 0,0, ψ1 0,0 being ﬁxed by specifying a regularity S > γmax). However this eg does not satisfy the desired point-wise bounds. To achieve this let φ be a C function such that ψ(x)|[BL,BU] x while BL

2 ψ(x) 2BU for all x. Finally, consider the estimator ˆg(x) = ψ(eg(x)). We note that |g(x) ˆg(x)| |g(x) eg(x)| thus ˆg is adaptive to the smoothness of the design density. The boundedness of the constructed estimator follows from the construction. Finally, we wish to show that almost surely, the constructed estimator belongs to the H older space with the same smoothness, possibly of a diﬀerent radius. This is captured by the next lemma, the proof of which can be completed by following arguments similar to the proof of Lemma 3.1 in Mukherjee and Sen (2018). In particular, recall the deﬁnition of H(s, M) in Section 5,

Lemma B.1. For all h H(s, M), ψ(h) H(s, C(M, s)), where C(M, s) is a universal constant dependent only on M, s and independent of h H(s, M).

The probabilistic tail bound (2.5) follows from essentially the same argument as the moment bound (2.4). For the sake of completeness, we present the proof below. We ﬁrst control the tail probability of eg g .

eg g ( e C) d 2γ+d n log n

eg g ( e C) d 2γ+d n log n

γ 2γ+d , ˆl l !

eg g ( e C) d 2γ+d n log n

γ 2γ+d , ˆl > l !

Liu, Mukherjee, Robins, Tchetgen Tchetgen

eg g ( e C) d 2γ+d n log n

γ 2γ+d , ˆl l !

+ PP,2 ˆl > l .

Recall that we have shown that, for some η > 3,

PP,2 ˆl > l =

l=l +1 PP,2 ˆl = l

l=l +1 n η lmax

Thus we are left to control the following term:

eg g ( e C) d 2γ+d n log n

γ 2γ+d , ˆl l !

eg ˆgl + ˆgl g ( e C) d 2γ+d n log n

γ 2γ+d , ˆl l !

ˆgl g ( e C) d 2γ+d n log n

ˆgl ΠP,2[g|Vl ] + ΠP,2[g|Vl ] g ( e C) d 2γ+d n log n

ˆgl ΠP,2[g|Vl ] ( e C) d 2γ+d n log n

n C1M2 l γ !

ˆgl ΠP,2[g|Vl ] ( e C) d 2γ+d n log n

γ 2γ+d (C + C2)

ˆgl ΠP,2[g|Vl ] n (C) d 2γ+d C C2 o r

exp{ C l d} n η

for some η > 3, where the second and ﬁfth inequalities follow from the deﬁnitions of ˆl and l

respectively, the fourth inequality follows from (B.1), the sixth inequality is a consequence of (B.4) and the seventh inequality is implied by Lemma C.6. For veriﬁcation purposes, (B.4) implies

2d 2γ+d n log n

2γ 2γ+d n log n

γ 2γ+d C1M2 (l 1)γ C2

2(l 1)d(l 1)d

( e C) d 2γ+d n log n

! d 2γ+d 2 d

n C d 2γ+d r

Adaptive Estimation of Nonparametric Functionals

Thus combining the above calculations, we have

eg g ( e C) d 2γ+d n log n

Since |g(x) ˆg(x)| |g(x) eg(x)|, the above display immediately implies

ˆg g ( e C) d 2γ+d n log n

Now, the construction of ˆf satisfying the desired properties of Theorem 2.7 can be done following ideas from the proof of Theorem 1.1 of Mukherjee and Sen (2018). In particular, construct the estimator ˆg of the design density g as above from the second part of the

sample and let for j T1, ˆfj(x) = 1

Wi ˆg(Xi)KVj (Xi, x). Now, let

j T1: ˆfj ˆfj C r

n , j T1 s.t. j j

where C depends only on the known parameters of the problem and can be determined from the proof hereafter. Thereafter, consider the estimator ef := ˆfˆj. Now deﬁne

j T1: 2 jd s

EP ef f EP ˆfˆj f 1(ˆj j ) + EP ˆfˆj f 1(ˆj > j ). (B.14)

Thereafter using Lemma C.6 and equation (B.2) we have

EP ˆfˆj f 1(ˆj j )

EP ˆfˆj ˆfj 1(ˆj j ) + EP ˆfj f

n + EP,2 ˆfj EP,1( ˆfj ) + EP,2 EP,1( ˆfj ) f

(C + C(BU, BL, ψ0 0,0, ψ1 0,0))

n + EP,2 Π(f(g

ˆg 1)|Vj ) + f Π(f|Vj )

(C + C(BU, BL, ψ0 0,0, ψ1 0,0))

n + C(M, ψ0 0,0, ψ1 0,0)2 j s + EP,2 Π(f(g

ˆg 1)|Vj ) .

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Now, by standard computations involving compactly supported wavelet bases and properties of ˆg

ˆg 1)|Vj ) C(ψ0 0,0, ψ1 0,0)EP,2 f(g

C(BU, BL, ψ0 0,0, ψ1 0,0)EP,2 ˆg g

C(BU, BL, M, γmax, ψ0 0,0, ψ1 0,0) n log n

γ 2γ+d . (B.16)

Combining (B.15), (B.16), deﬁnition of j , and the fact that γ > s, we have

EP ˆfˆj f 1(ˆj j ) C(BU, BL, M, γmax, ψ0 0,0, ψ1 0,0) n log n

s 2s+d , (B.17)

provided that C is chosen depending only the known parameters of the problem. Now using arguments similar to those leading to (B.7) we have

EP ˆfˆj f 1(ˆj > j ) C(BU, BL, ψ0 0,0, ψ1 0,0) X

j>j 2jd PP (ˆj = j). (B.18)

We now complete the control over II by suitably bounding PP (ˆj = j). To this end, note that for any j > j ,

PP (ˆj = j)

ˆfj ˆfj > C r

ˆfj EP,1 ˆfj > C

n EP,1 ˆfj EP,1 ˆfj

ˆfj EP,1 ˆfj > C

ˆfj EP,1 ˆfj > C

ˆg|Vj Π f g

ˆfj EP,1 ˆfj > C

ˆg|Vj Π f g

C(M, ψ0 0,0, ψ1 0,0)2 j s + C(BU, BL, , ψ0 0,0, ψ1 0,0) ˆg g .

Using the fact that q

n for j > j , we have, using the deﬁnition of j , that there exist C, C > 0 depending on M, BU, BL, ψ0 0,0, ψ1 0,0 such that

PP (ˆj = j)

Adaptive Estimation of Nonparametric Functionals

ˆfj EP,1 ˆfj > (C

ˆfj EP,1 ˆfj > C

Now, provided C > 2C is chosen large enough (depending on BU, BL, ψ0 0,0, ψ1 0,0) we have there exists large enough C (depending on BU, BL, ψ0 0,0, ψ1 0,0) such that

ˆfj EP,1 ˆfj > (C

ˆfj EP,1 ˆfj > C

2e C jd. (B.20)

Henceforth, whenever required, C, C , C will be chosen to be large enough depending on the known parameters of the problem, which in turn will imply that C can be chosen large enough depending on the known parameters of the problem as well. First note that, the last term in the above display can be bounded rather crudely using the following lemma.

Lemma B.2. Assume γmin > smax. Then for C , C1, C2 > 0 (chosen large enough depending on BU, ψ0 0,0, ψ1 0,0) one has

sup P P(s,γ) PP,2 ˆg g > C r

C1(lmax lmin)e C2lmind.

The proof of Lemma B.2 can be argued as follows. Indeed, ˆg = ψ(eg), where ψ(u) is a C function which is identically equal to u on [BL, BU] and has universally bounded ﬁrst derivative. Therefore, it is enough to prove Lemma B.2 for eg instead of ˆg and thereby invoking a simple ﬁrst order Taylor series argument along with the fact that ψ(g) g owing to the bounds on g. The crux of the argument for proving Lemma B.2 is that by Lemma

C.6, any ˆgl for l T2 suitably concentrates around g in a radius of the order of q

n . The proof of the lemma is therefore very similar to the proof of adaptivity of ˆg (by dividing into cases where the chosen ˆl is larger and smaller than l respectively and thereafter invoking Lemma C.6) and therefore we omit the details. Plugging in the result of Lemma B.2 into (B.19), and thereafter using the facts that γmin > smax, lmax, jmax are both poly logarithmic in nature, along with equations (B.14), (B.17), (B.18), and (B.20), we have the existence of an estimator ef depending on M, BU, BL, smin, smax, γmax, such that for every (s, γ) [smin, smax] [γmin, γmax],

sup P P(s,γ) EP ef f C n log n

with a large enough positive constant C depending on M, BU, BL, smin, γmax, ψ0 0,0, ψ1 0,0. The proof that ef H(s, C) with probability at least 1 1/n2 follows largely from the same argument used to show eg H(γ, C), by the boundedness of Y and ˆg along with the fact that ˆg H(γ, C) with high probability with γ > s.

Liu, Mukherjee, Robins, Tchetgen Tchetgen

However this ef does not satisfy the desired point-wise bounds. To achieve this, as before, let φ be a C function such that ψ(x)|[BL,BU] x while BL

2 ψ(x) 2BU for all x. Finally, consider the estimator ˆf(x) = ψ( ef(x)). We note that |f(x) ˆf(x)| |f(x) ef(x)| thus ˆf is adaptive to the smoothness of the design density. The boundedness of the constructed estimator follows from the construction. The proof of the fact that the constructed estimator belongs to the H older space with the same smoothness, possibly of a diﬀerent radius follows once again from of Lemma B.1. Finally, we obtain the tail bound (2.8) of ˆf f . As in the proof of (2.5), we consider the tail bound of ef f ﬁrst, and then the desired tail bound of ˆf f follows directly from the inequality |f(x) ˆf(x)| |f(x) ef(x)| almost surely.

ef f ( e C ) d 2s+d n log n

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj > j !

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

+ PP ˆj > j .

As was shown before, for some η > 3, we have

PP ˆj > j =

j=j +1 PP ˆj = j

j=j +1 C exp{ C jd} jmax

Therefore we are left to bound the ﬁrst term PP

ef f ( e C ) d 2s+d n log n s 2s+d , ˆj j ,

proceeded as below. First, we write

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !#

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

ˆg g C n log n

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

ˆg g > C n log n

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

ˆg g C n log n

ˆg g > C n log n

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

ˆg g C n log n

Adaptive Estimation of Nonparametric Functionals

where the last inequality is implied by (2.5). Then within the event ˆg g C n log n s 2s+d , we have

ef f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

ef ˆfj + ˆfj f ( e C ) d 2s+d n log n

s 2s+d , ˆj j !

ˆfj f ( e C ) d 2s+d n log n

ˆfj EP,1 ˆfj + EP,1 ˆfj f ( e C ) d 2s+d n log n

ˆfj EP,1 ˆfj + Π f g ˆg 1 |Vj

( e C ) d 2s+d n log n s 2s+d C q

n C(M, ψ0 0,0, ψ1 0,0)2 j s

ˆfj EP,1 ˆfj + Π f g ˆg 1 |Vj

( e C ) d 2s+d n log n s 2s+d (C + C(M, ψ0 0,0, ψ1 0,0)) q

ˆfj EP,1 ˆfj + C(BU, BL, ψ0 0,0, ψ1 0,0) ˆg g

( e C ) d 2s+d n log n s 2s+d (C + C(M, ψ0 0,0, ψ1 0,0)) q

ˆfj EP,1 ˆfj ( e C ) d 2s+d C (BU, BL, ψ0 0,0, ψ1 0,0) n log n s 2s+d (C + C(M, ψ0 0,0, ψ1 0,0)) q

ˆfj EP,1 ˆfj C r

C1e C2j d 1

for some η > 3, where the second inequality is due to the deﬁnition of ˆj, the fourth inequality follows from the property of wavelet basis of Cohen-Daubechies-Vial type, the ﬁfth inequality is implied by the deﬁnition of j , the sixth inequality follows from the properties of ˆg, f, g,

the seventh inequality is a result of being restricted to the event ˆg g C n log n s 2s+d , the eighth inequality is again implied by the deﬁnition of j , and the ninth inequality follows from the same argument as in (B.20). Finally combining the above analysis, and j < lmin, for suﬃciently large C 1, C 2 > 0, we have

ef f ( e C ) d 2s+d n log n

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Remark B.3. For the mean response functional in missing data models studied in Section 3.2, denote W = AY . There we need to estimate b(x) = EP [Y |A = 1, X = x] instead of f(x) = EP [W|X = x]. Note that b(x) can be rewritten as:

b(x) = Z yf(y|a = 1, x)dy = Z w g1(x)f(y, a, x)dyda

where g1(x) = f(x|A = 1)P(A = 1), where f(x|A = 1) is the conditional density of X given A = 1. We deﬁne an adaptive estimator ˆb of b in the same way as we deﬁne ˆf as follows. Deﬁne Π(b|Vj)(x) := EP h W g1(X)KVj(X, x) i , and ˆbj(x) := 1 n Pn i=1 Wi ˆg1(Xi)KVj(Xi, x), where ˆg1(x) is estimated in the same way as ˆg(x) except that ˆgj(x) is replaced by ˆg1,j(X = x) =

i=n+1 Ai KVj(Xi, x). Then the above proof goes through if we replace all the corresponding

Proof of Theorem 3.1

Proof. (i) Proof of Upper Bound

The general scheme of the proof involves identifying a non-adaptive minimax estimator of φ(P) under the knowledge of P P(α,β,γ), demonstrating suitable conditional bias and variance properties of this sequence of estimators, and thereafter invoking Theorem 2.2 to conclude. This routine can be carried out as follows. Suppose that we observe n i.i.d. samples {Yi, Ai, Xi}n i=1. First divide the samples into two disjoint parts D = D1 D2 with sample sizes n1 = n(1 1/ log n) and n2 = n/ log n respectively. Further divide the subsample D2 into two disjoint and equal-sized parts D2 = D21 D22 with sample sizes n21 = n22 = n/(2 log n), where we estimate g by ˆg adaptively from D22 (as in Theorem 2.7), and estimate a and b by ˆa and ˆb respectively, adaptively from D21 (as in Theorem 2.7). Let EP,S denote the expectation while samples with indices not in S held ﬁxed, for S {1, 21, 22}. A ﬁrst order inﬂuence function for φ(P) at P is given by (Y b(X))(A a(X)) φ(P) and a resulting ﬁrst order estimator for φ(P) is 1 n Pn i=1(Yi ˆb(Xi))(Ai ˆa(Xi)), computed from the subsample D1. This estimator has

a conditional bias R (b(x) ˆb(x))(a(x) ˆa(x)) g(x)dx. Indeed for α+β

2, this bias

turns out to be suboptimal compared to the minimax rate of convergence of n 4α+4β 2α+2β+d in mean squared loss. The most intuitive way to proceed is to estimate and correct for the bias. If there exists a Dirac-kernel K(x1, x2) L2 [0, 1]d [0, 1]d such that R h(x1)K(x1, x2)dx2 = h(x1) almost surely x1 for all h L2([0, 1]d), then one can estimate

the bias term by 1 n(n 1) P

(Yi1 ˆb(Xi1))

g(Xi1) K(Xi1, Xi2) (Ai2 ˆa(Xi2))

g(Xi2) , provided the marginal

density g was known. Indeed there are two concerns with the above suggestion. The ﬁrst one is the knowledge of g. This can be relatively easy to deal with by plugging in an suitable estimate ˆg although there are some subtleties involved (refer to Section 4 for more on this). The primary concern though is the non-existence of a Dirac-kernel of the above

Adaptive Estimation of Nonparametric Functionals

sort as an element of L2([0, 1]d) L2([0, 1]d). This necessitates the following modiﬁcation where one works with projection kernels on suitable ﬁnite-dimensional linear subspace L of L2([0, 1]d) which guarantees existence of such kernels when the domain space is restricted to L. In particular, we work with the linear subspace Vk (deﬁned in 5) where the choice of k is guided by the balance between the bias and variance properties of the resulting estimator. In particular, a choice of k is guided by the knowledge of the parameter space P(α,β,γ). For any k K, this implies that our bias corrected second-order estimator of φ(P) is given by

i=1 (Yi ˆb(Xi))(Ai ˆa(Xi))

1 i1 =i2 n S

(Yi1 ˆb(Xi1))) p

ˆg(Xi1) KVk(Xi1, Xi2)(Ai2 ˆa(Xi2)) p

Note that the division by ˆg is permitted by the properties guaranteed by Theorem 2.7. Indeed this sequence of estimators is in the form of those considered by Theorem 2.2 with

e L1(O) = (Y ˆb(X))(A ˆa(X)),

e L2l(O) = (Y ˆb(X))) p

ˆg(X) , e L2r(O) = (A ˆa(X))) p

where by Theorem 2.7 max{|e L1(O)|, |e L2l(O)|, |e L2r(O)|} C(BL, BU). Therefore it remains to show that the sequence ˆφn,k satisﬁes the bias and variance properties (A) and (B) necessary for application of Theorem 2.2.

We ﬁrst verify the conditional bias property. Utilizing the representation of the ﬁrst order bias as stated above, we have

|EP,1 ˆφn,k φ(P) |

R (b(x) ˆb(x))(a(x) ˆa(x)) g(x)dx

S (Y1 ˆb(X1))

ˆg(X1) KVk(X1, X2)(A2 ˆa(X2))

Now, using the notation δb(x) = b(x) ˆb(x) and δa(x) = a(x) ˆa(x), we have

" (Y1 ˆb(X1)) p

ˆg(X1) KVk(X1, X2)(A2 ˆa(X2)) p

= Z Z δb(x1)g(x1) p

ˆg(x1) KVk(x1, x2)δa(x2)g(x2) p

ˆg(x2) dx1dx2

= Z δb(x1)g(x1) p

ˆg(x1) Π δag ˆg |Vk

= Z δb(x1)g(x1) p

δa(x1)g(x1) p

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Z δb(x1)g(x1) p

ˆg(x1) Π δag ˆg |V k

= Z (b(x) ˆb(x))(a(x) ˆa(x)) g(x)dx

+ Z δa(x1)δb(x1)g2(x1) 1 ˆg(x1) 1 g(x1)

Z δb(x1)g(x1) p

ˆg(x1) Π δag ˆg |V k

(x1)dx1. (B.22)

Plugging (B.22) into (B.21), we get,

|EP,1 ˆφn,k φ(P) |

R δa(x1)δb(x1)g2(x1) 1 ˆg(x1) 1 g(x1) dx1

R δb(x1)g(x1)

ˆg(x1) Π δag ˆg |V k (x1)dx1

Now, by repeatedly applying Cauchy-Schwarz inequality and invoking results in Theorem 2.7, we have

Z δa(x1)δb(x1)g2(x1) 1 ˆg(x1) 1 g(x1)

Z g4(x1) g2(x1)ˆg2(x1) (ˆg(x1) g(x1))2 dx1

Z (ˆa(x1) a(x1))4 dx1

4 Z ˆb(x1) b(x1) 4 dx1

B2 U B2 L ˆg g 2 ˆa a 4 ˆb b 4

B2 U B2 L C d 2γ+d+ d 2α+d+ d 2β+d n2/2 log (n2/2)

α 2α+d β 2β+d γ 2γ+d . (B.24)

Moreover, when restricted to the good event

I2;b,a,g(n2) :=

O D2: ˆb b C

d 2β+d b n2/2 log(n2/2) β 2β+d ˆa a C

d 2α+d a n2/2 log (n2/2) α 2α+d

ˆg g e C d 2α+d n2/2 log (n2/2) γ 2γ+d ˆg H(γ, C) ˆb H(β, C) ˆa H(α, C)

with probability at least 1 C log n2

nη 2 for some absolute constant C and η > 2 by Theorem 2.7,

Z δb(x1)g(x1) p

ˆg(x1) Π δag ˆg |V k

Adaptive Estimation of Nonparametric Functionals

Z Π δbg ˆg |V k

(x1)Π δag ˆg |V k

Π δbg ˆg |V k

Π δag ˆg |V k

where the last line follows for some constant C (depending on M, BU, BL, γmax) by Theorem 2.7, deﬁnition in equation (5.5), and noting that Π (h|Vj) C(BU) if h BU. Therefore, one has combining (B.23), (B.24), and (B.25), that for a constant C (depending on M, BU, BL, γmin, γmax) and γmin(ϵ) := γmin

EP,1 ˆφn,k φ(P)

" n2/2 log (n2/2)

α 2α+d β 2β+d γ 2γ+d + k β+α

C (n2/2) α 2α+d β 2β+d γmin(ϵ) 2γmin(ϵ)+d (n2/2)

γmin(ϵ) 2γmin(ϵ)+d γ 2γ+d log (n2/2) α 2α+d+ β 2β+d+ γ 2γ+d + k β+α

C (n2/2) α 2α+d β 2β+d γmin(ϵ) 2γmin(ϵ)+d (n2/2) γ γmin(ϵ)

2γ+d log3 (n2/2) + k β+α

C (n2/2) α 2α+d β 2β+d γmin(ϵ) 2γmin(ϵ)+d (n2/2) ϵγmin (1+ϵ)(2γmax+d) log3 (n2/2) + k β+α

n α 2α+d β 2β+d γmin(ϵ) 2γmin(ϵ)+d 2 n ϵγmin (1+ϵ)(2γmax+d) 2 log3 n2 + k β+α

C (n/ log n) α 2α+d β 2β+d γmin(ϵ) 2γmin(ϵ)+d (n/ log n) ϵγmin (1+ϵ)(2γmax+d) log3 (n/ log n) + k β+α

C n α 2α+d β 2β+d γmin(ϵ) 2γmin(ϵ)+d + k β+α

where the constant C changes its value from line to line. Now, letting θ = (α, β, γ), f1(θ) = α+β

2 and f2(θ) = α 2α+d β 2β+d γmin(ϵ) 2γmin(ϵ)+d we have that the conditional bias property (A) corresponding to Theorem 2.2 holds with the given choice of f1 and f2 and a constant C depending on M, BU, BL, γmax for {P Pθ: f1(θ) = α+β

2 , f2(θ) > 2α+2β 2α+2β+d}. The proof of the validity of the conditional variance property corresponding to Theorem 2.2 is easy to derive by standard Hoeﬀding decomposition of ˆφn,k followed by applications of moment bounds in Lemmas C.2 and C.5. For calculations of similar ﬂavor, refer to proof of Theorem 1.3 in Mukherjee and Sen (2018). Note that this is the step where we have used the fact that α+β

4, since otherwise the linear term dominates resulting in O( 1

n) the variance. Then invoking Theorem 2.2, we have

2 ,f2(θ)> 2α+2β 2α+2β+d

EP ˆφn,ˆk φ(P) 2 C log n

4α+4β d+2α+2β .

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Noting that for θ Θ, since γmin > 2(1 + ϵ) max{α, β}, one has automatically, f2(θ) > 2α+2β 2α+2β+d, which completes the proof of the upper bound. (ii) Proof of Lower Bound To prove a lower bound matching the upper bound above, note that φ(P) = EP (cov P (Y, A|X)) = EP (AY ) EP (a(X)b(X)). Indeed, EP (AY ) can be estimated at n-rate by sample average of Ai Yi. Therefore, it suﬃces to prove a lower bound for adaptive estimation of EP (a(X)b(X)) (our proof continues to hold for EP (cov P (Y, A|X)) since the perturbations created for the lower bound of EP (AY ) is trivial and we only present the lower bound for estimation of the most important term of its estimation). Let c(X) = EP (Y |A = 1, X) EP (Y |A = 0, X), which implies owing to the binary nature of A that EP (Y |A, X) = c(X) (A a(X)) + b(X). For the purpose of lower bound it is convenient to parametrize the data generating mechanism by (a, b, c, g), which implies that φ(P) = R a(x)b(x)g(x)dx. With this parametrization, we show that the same lower bound holds in a smaller class of problems where g 1 on [0, 1]d. Speciﬁcally consider

P = (a, b, c, g): a H(α, M), b H(β, M), α+β

4, g 1, (a(x), b(x)) [BL, BU]2 x [0, 1]d

The likelihood of O P for P Θsub can then be written as

a(X)A(1 a(X))1 A

(c(X)(1 a(X)) + b(X))Y A (1 c(X)(1 a(X)) b(X))(1 Y )A

( c(X)a(X) + b(X))Y (1 A) (1 + c(X)a(X) b(X))(1 Y )(1 A) . (B.26)

Let for some (α, β, γ) tuple in the original problem Θ, one has

sup P P(α,β,γ) EP ˆφ φ(P) 2 C log n

4α+4β d+2α+2β .

Now, let H: [0, 1]d R be a C function supported on 0, 1

2 d such that R H(x)dx = 0 and R H2(x)dx = 1 and let for k N (to be decided later) Ω1, . . . , Ωk be the translations of the cube k 1

2 d that are disjoint and contained in [0, 1]d. Let x1, . . . , xk denote the bottom left corners of these cubes. Assume ﬁrst that α < β. We set for λ = (λ1, . . . , λk) { 1, +1}k and α β < β,

j=1 λj H (x xj)k 1 d ,

j=1 λj H (x xj)k 1 d ,

1 2 bλ(x) 1 aλ(x).

Adaptive Estimation of Nonparametric Functionals

A properly chosen H guarantees aλ H(α, M) and bλ H(β , M) for all λ. Let

Θ0 = P n: P = aλ, 1

2, 0, 1 , λ { 1, +1}k ,

Θ1 = n P n: P = (aλ, bλ, cλ, 1), λ { 1, +1}ko .

Finally let Θtest = Θ0 Θ1. Let π0 and π1 be uniform priors on Θ0 and Θ1 respectively.

It is easy to check that by our choice of H, φ(P) = 1

4 on Θ0 and φ(P) = 1

for P Θ1. Therefore, using notation from Lemma C.1, µ1 = 1

d , and σ1 = σ2 = 0. Since Θ0 P(α, β, γ), we must have that worst case error of estimation

over Θ0 is bounded by C log n

n 4α+4β d+2α+2β . Therefore, the π0 average bias over Θ0 is also

bounded by C log n

n 2α+2β d+2α+2β . This implies by Lemma C.1, that the π1 average bias over Θ1 (and hence the worst case bias over Θ1) is bounded below by

2α+2β d+2α+2β C log n

2α+2β d+2α+2β η, (B.27)

where η is the chi-square divergence between the probability measures R P ndπ0(P n) and R P ndπ1(P n). We now bound η using Proposition 2.6. To put ourselves in the notation of Proposition 2.6, let for λ { 1, +1}k, Pλ and Qλ be the probability measures identiﬁed from Θ0 and Θ1 respectively. Therefore, with χj = {0, 1} {0, 1} Ωj, we indeed have for all j = 1, . . . , k, Pλ(χj) = Qλ(χj) = pj where there exists a constant c such that pj = c

k. Letting π be the uniform prior over { 1, +1}k it is immediate that η = χ2 R Pλd(π(λ)), R Qλd(π(λ)) . It now follows by calculations similar to the proof of Theorem 4.1 in Robins et al. (2009b), that for a constant C > 0

χ2 Z Pλd(π(λ)), Z Qλd(π(λ)) exp C n2

d + k 4 α+β

Now choosing k = n c log n

2d d+2α+2β , we have

χ2 Z Pλd(π(λ)), Z Qλd(π(λ)) n2C c 1.

Therefore choosing c such that 2C c + 2α+2β 2α+2β +d < 2α+2β 2α+2β+d, we have the desired result by (B.27). The proof for α > β is similar after changing various quantities to:

j=1 λj H (x xj)k 1 d , β α < α,

Liu, Mukherjee, Robins, Tchetgen Tchetgen

j=1 λj H (x xj)k 1 d ,

2 aλ(X))bλ(X) aλ(X)(1 aλ(X)),

Θ0 = P n: P = 1

2, bλ, 0, 1 : λ { 1, +1}k ,

Θ1 = n P n: P = (aλ, bλ, cλ, 1): λ { 1, +1}ko .

For the case of α = β, choose α < α and therefore, α < β and thereafter work with

j=1 λj H (x xj)k 1 d ,

j=1 λj H (x xj)k 1 d ,

1 2 bλ(x) 1 aλ(x).

Θ0 = P n: P = aλ, 1

2, 0, 1 : λ { 1, +1}k ,

Θ1 = n P n: P = (aλ, bλ, cλ, 1): λ { 1, +1}ko .

This completes the proof of the lower bound.

Proof of Theorem 3.3

Proof. (i) Proof of Upper Bound The general scheme of the proof is the same as that of Theorem 3.1 and involves identifying a non-adaptive minimax estimator of φ(P) under the knowledge of P P(α,β,γ), demonstrating suitable bias and variance properties of this sequence of estimators, and thereafter invoking Theorem 2.2 to conclude. This routine can be carried out as follows. We observe n i.i.d. samples {Yi Ai, Ai, Xi}n i=1. First divide the samples into two disjoint parts D = D1 D2 with sample sizes n1 = n(1 1/ log n) and n2 = n/ log n respectively. Further divide the subsample D2 into two disjoint and equal-sized parts D2 = D21 D22 with sample sizes n21 = n22 = n/(2 log n). We estimate E (A|x) and b by b E (A|x) and

Adaptive Estimation of Nonparametric Functionals

ˆb(x) := b E (Y |A = 1, x) respectively, adaptively from D21 (as in Theorem 2.7). Note that g(X) = f(X|A = 1)P(A = 1). Therefore, also estimate PP (A = 1) by ˆπ := 1 n22 P i D22 Ai i.e. the sample average of A s from D22 and ˆf1 is estimated as an estimator of f(X|A = 1) also from D22 using density estimation technique among observations with A = 1. Finally, our estimates of a and g are ˆa(x) = 1 b E(A|x) and ˆg = ˆf1ˆπ respectively. In the following,

we will freely use Theorem 2.7, for desired properties of ˆa,ˆb, and ˆg. In particular, following the proof of Theorem 2.7, we can actually assume that our choice of ˆg also satisﬁes the necessary conditions of boundedness away from 0 and , as well as membership in H(γ, C) with high probability for a large enough C > 0. A ﬁrst order inﬂuence function for φ(P) at P is given by Aa(X)(Y b(X)) + b(X) φ(P) and a resulting ﬁrst order estimator for φ(P) is 1 n1 P i D1 Aiˆa(Xi)(Yi ˆb(Xi)) + ˆb(Xi). This estimator has a bias

EP,2 R (b(x) ˆb(x))(a(x) ˆa(x)) g(x)dx. Indeed for α+β

2, this bias turns out to

be suboptimal compared to the minimax rate of convergence of n 4α+4β 2α+2β+d in mean squared loss. Similar to the proof of Theorem 3.1 we use a second-order bias corrected estimator as follows. Once again we work with the linear subspace Vk (deﬁned in 5) where the choice of k is guided by the balance between the conditional bias and variance properties of the resulting estimator. In particular, a choice of k is guided by the knowledge of the parameter space P(α,β,γ). For any k K, our bias corrected second-order estimator of φ(P) is given by

i=1 Aiˆa(Xi)(Yi ˆb(Xi)) + ˆb(Xi)

+ 1 n1(n1 1)

1 i1 =i2 n1 S

Ai1(Yi1 ˆb(Xi1))) p

ˆg(Xi1) KVk(Xi1, Xi2)(Ai2ˆa(Xi2) 1) p

Note that division by ˆg is permitted by the properties guaranteed by Theorem 2.7. Indeed this sequence of estimators is in the form of those considered by Theorem 2.2 with

e L1(O) = Aˆa(X)(Y ˆb(X)) + ˆb(X),

e L2l(O) = A(Y ˆb(X))) p

ˆg(X) , e L2r(O) = (Aˆa(X) 1) p

where by Theorem 2.7 max{|e L1(O)|, |e L2l(O)|, |e L2r(O)|} C(BL, BU). Therefore it remains to show that the sequence ˆφn,k satisﬁes the conditional bias and variance properties (A) and (B) necessary for application of Theorem 2.2. Using the conditional independence of Y and A given X, one has the following calculations exactly parallel to that in proof of Theorem 3.1, that for a constant C (depending on M, BU, BL, γmin, γmax), within the good event

I2;b,a,g(n2) :=

O D2: ˆb b C

d 2βb+d b n2/2 log(n2/2) βb 2βb+d ˆa a C

d 2α+d a n2/2 log (n2/2) α 2α+d

ˆg g e C d 2α+d n2/2 log (n2/2) γ 2γ+d ˆg H(γ, C) ˆb H(β, C) ˆa H(α, C)

Liu, Mukherjee, Robins, Tchetgen Tchetgen

with probability at least 1 C log n2

nη 2 for some absolute constant C and η > 2 by Theorem 2.7, EP,1 ˆφn,k φ(P) C n α 2α+d β 2β+d γmin(ϵ) 2γmin(ϵ)+d + k α+β

where γmin(ϵ) := γmin

1+ϵ . Now, letting θ = (α, β, γ), f1(θ) = α+β

2 and f2(θ) = α 2α+d β 2β+d

γmin(ϵ) 2γmin(ϵ)+d we have the bias property corresponding to Theorem 2.2 holds with the given choice of f1 and f2 and a constant C depending on M, BU, BL, γmax for {P Pθ: f1(θ) = α+β

2 , f2(θ) > 2α+2β 2α+2β+d}. The proof of the validity of the conditional variance property corresponding to Theorem 2.2 is once again easy to derive by standard Hoeﬀding decomposition of ˆφn,k followed by applications of moment bounds in Lemmas C.2 and C.5.

2 ,f2(θ)> 2α+2β 2α+2β+d

EP ˆφn,ˆk φ(P) 2 8C log n

4α+4β d+2α+2β .

Noting that for θ Θ, since γmin > 2(1 + ϵ) max{α, β}, one has automatically, f2(θ) > 2α+2β 2α+2β+d, completes the proof of the upper bound.

(ii) Proof of Lower Bound First note that we can parametrize our distributions by the tuple of functions (a, b, g). We show that the same lower bound holds in a smaller class of problems where g 1/2 on [0, 1]d. Speciﬁcally consider

P = (a, b, g): a H(α, M), b H(β, M), α+β

4, g 1/2, (a(x), b(x)) [BL, BU]2 x [0, 1]d

The observed data likelihood of O P for P Θsub can then be written as

(a(X) 1)1 A b Y (X)(1 b(X))1 Y A . (B.28)

Let for some (α, β, γ) tuple in the original problem Θ, one has

sup P P(α,β,γ) EP ˆφ φ(P) 2 C log n

4α+4β d+2α+2β .

Now, let H: [0, 1]d R be a C function supported on 0, 1

2 d such that R H(x)dx = 0 and R H2(x)dx = 1 and let for k N (to be decided later) Ω1, . . . , Ωk be the translations of the cube k 1

2 d that are disjoint and contained in [0, 1]d. Let x1, . . . , xk denote the bottom left corners of these cubes. Assume ﬁrst that α < β. We set for λ = (λ1, . . . , λk) { 1, +1}k and α β < β,

aλ(x) = 2 + 1

j=1 λj H (x xj)k 1 d ,

Adaptive Estimation of Nonparametric Functionals

j=1 λj H (x xj)k 1 d .

A properly chosen H guarantees aλ H(α, M) and bλ H(β , M) for all λ. Let

Θ0 = n P n: P = (aλ, 1/2, 1/2) : λ { 1, +1}ko ,

Θ1 = n P n: P = (aλ, bλ, 1/2): λ { 1, +1}ko .

Finally let Θtest = Θ0 Θ1. Let π0 and π1 be uniform priors on Θ0 and Θ1 respectively.

It is easy to check that by our choice of H, φ(P) = 1

2 on Θ0 and φ(P) = 1

for P Θ1. Therefore, using notation from Lemma C.1, µ1 = 1

d , and σ1 = σ2 = 0. Since Θ0 P(α, β, γ), we must have that worst case error of estimation

over Θ0 is bounded by C log n

n 4α+4β d+2α+2β . Therefore, the π0 average bias over Θ0 is also

bounded by C log n

n 2α+2β d+2α+2β . This implies by Lemma C.1, that the π1 average bias over Θ1 (and hence the worst case bias over Θ1) is bounded below by a constant multiple of

2α+2β d+2α+2β log n

2α+2β d+2α+2β η, (B.29)

where η is the chi-square divergence between the probability measures R P ndπ0(P n) and R P ndπ1(P n). We now bound η using Proposition 2.6. To put ourselves in the notation of Proposition 2.6, let for λ { 1, +1}k, Pλ and Qλ be the probability measures identiﬁed from Θ0 and Θ1 respectively. Therefore, with χj = {0, 1} {0, 1} Ωj, we indeed have for all j = 1, . . . , k, Pλ(χj) = Qλ(χj) = pj where there exists a constant c such that pj = c

k. Letting π be the uniform prior over { 1, +1}k it is immediate that η = χ2 R Pλd(π(λ)), R Qλd(π(λ)) . It now follows by calculations similar to the proof of Theorem 4.1 in Robins et al. (2009b), that for a constant C > 0

χ2 Z Pλd(π(λ)), Z Qλd(π(λ)) exp C n2

d + k 4 α+β

Now choosing k = n c log n

2d d+2α+2β , we have

χ2 Z Pλd(π(λ)), Z Qλd(π(λ)) n2C c 1.

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Therefore choosing c such that 2C c + 2α+2β 2α+2β +d < 2α+2β 2α+2β+d, we have the desired result by (B.27). The proof for α > β is similar after changing various quantities to:

aλ(x) = 2 + 1

j=1 λj H (x xj)k 1 d , β α < α,

j=1 λj H (x xj)k 1 d .

Θ0 = n P n: P = (2, bλ, 1/2) : λ { 1, +1}ko ,

Θ1 = n P n: P = (aλ, bλ, 1/2): λ { 1, +1}ko .

For the case of α = β, choose α < α and therefore, α < β and thereafter work with

aλ(x) = 2 + 1

j=1 λj H (x xj)k 1 d ,

j=1 λj H (x xj)k 1 d

Θ0 = n P n: P = (aλ, 1/2, 1/2) : λ { 1, +1}ko ,

Θ1 = n P n: P = (aλ, bλ, 1/2): λ { 1, +1}ko .

This completes the proof of the lower bound.

Proof of Theorem 3.4

Proof. (i) Proof of Upper Bound We observe n i.i.d. samples {Yi, Xi}n i=1. First divide the samples into two disjoint parts D = D1 D2 with sample sizes n1 = n(1 1/ log n) and n2 = n/ log n respectively. Further divide the subsample D2 into two disjoint and equal-sized parts D2 = D21 D22 with sample sizes n21 = n22 = n/(2 log n). We estimate g by ˆg adaptively from D22 (as in Theorem 2.7), and estimate b by ˆb, adaptively from D21 (as in Theorem 2.7). For any k K, consider

i=1 (2Yi ˆb(Xi))ˆb(Xi)

Adaptive Estimation of Nonparametric Functionals

(Yi1 ˆb(Xi1))) p

ˆg(Xi1) KVk(Xi1, Xi2)(Yi2 ˆb(Xi2)) p

Indeed this sequence of estimators is in the form of those considered by Theorem 2.2 with e L1(O) = (2Y ˆb(X))ˆb(X),

e L2l(O) = (Y ˆb(X))) p

ˆg(X) , e L2r(O) = (Y ˆb(X))) p

where by Theorem 2.7 max{|e L1(O)|, |e L2l(O)|, |e L2r(O)|} C(BL, BU). Therefore it remains to show that the sequence ˆφn,k satisﬁes the bias and variance properties (A) and (B) necessary for application of Theorem 2.2. We ﬁrst verify the bias property. Utilizing the representation of the ﬁrst order bias as stated above, we have

|EP,1 ˆφn,k φ(P) |

R (b(x) ˆb(x) 2 g(x)dx

S (Y1 ˆb(X1)))

ˆg(X1) KVj(X1, X2)(Y2 ˆb(X2))

Now, by calculations similar to the proof of Theorem 3.1, one can show that for a constant C (depending on M, BU, BL, γmin, γmax), within the good event

I2;b,g(n2) :=

O D2: ˆb b C

d 2βb+d b n2/2 log(n2/2) βb 2βb+d ˆg g e C d 2γ+d n2/2 log (n2/2) γ 2γ+d

ˆg H(γ, C) ˆb H(β, C)

with probability at least 1 C log n2

nη 2 for some absolute constant C and η > 2 by Theorem 2.7, EP,1 ˆφn,k φ(P)

C n 2β 2β+d γmin(ϵ) 2γmin(ϵ)+d + k α+β

d , γmin(ϵ) := γmin

Now, letting θ = (β, γ), f1(θ) = β and f2(θ) = 2β 2β+d γmin(ϵ) 2γmin(ϵ)+d, the rest of the proof follows along the lines of the proof of Theorem 3.1. (ii) Proof of Lower Bound The proof of the lower bound is very similar to that of the lower bound proof in Theorem 3.1, after identifying Y = A almost surely P, and hence is omitted.

Appendix C. Technical Lemmas

C.1 Constrained Risk Inequality

A main tool for producing adaptive lower bound arguments is a general version of constrained risk inequality due to Cai and Low (2011), obtained as an extension of Brown and

Liu, Mukherjee, Robins, Tchetgen Tchetgen

Low (1996). For the sake of completeness, we begin with a summary of these results. Suppose Z has distribution Pθ where θ belongs to some parameter space Θ. Let ˆQ = ˆQ(Z) be an estimator of a function Q(θ) based on Z with bias B(θ) := Eθ( ˆQ) Q(θ). Now suppose that Θ0 and Θ1 form a disjoint partition of Θ with priors π0 and π1 supported on them respectively. Also, let µi = R Q(θ)dπi and σ2 i = R (Q(θ) µi)2dπi, i = 0, 1 be the mean and variance of Q(θ) under the two priors π0 and π1. Letting γi be the marginal density w.r.t. some common dominating measure of Z under πi, i = 0, 1, let us denote by Eγ0(g(Z)) the expectation of g(Z) w.r.t. the marginal density of Z under prior π0 and distinguish it from Eθ(g(Z)), which is the expectation under Pθ. Lastly, denote the chi-square divergence

between γ0 and γ1 by χ = Eγ0 γ1 γ0 1 2 1

2 . Then we have the following result.

Lemma C.1 (Cai and Low (2011)). If R Eθ ˆQ(Z) Q(θ) 2 dπ0(θ) ϵ2, then

Z B(θ)dπ1(θ) Z B(θ)dπ0(θ) |µ1 µ0| (ϵ + σ0)χ.

Since the maximum risk is always at least as large as the average risk, this immediately yields a lower bound on the minimax risk.

C.2 Tail and Moment Bounds

The U-statistics appearing in this paper are mostly based on projection kernels sandwiched between arbitrary bounded functions. This necessitates generalizing the U-statistics bounds obtained in Bull and Nickl (2013) as in Mukherjee and Sen (2018) .

Lemma C.2. O1, . . . , On P are i.i.d. random vectors of observations such that Xi [0, 1]d

is a sub-vector of Oi for each i. There exists a constant C := C(B, BU, J0) > 0 such that the following hold

i1 =i2 R (Oi1, Oi2) E (R (O1, O2)) t (C.1)

e Cnt2 + e Ct2

a2 1 + e Ct

i1 =i2 R (Oi1, Oi2) E (R (O1, O2)) |2q

where a1 = 1 n 12 jd

2 , a2 = 1 n 1

n + 1 , a3 = 1 n 1

R(O1, O2) = S e L2l(O1)KVj (X1, X2) e L2r(O2) with max{|e L2l(O)|, |e L2r(O)|} B , almost

surely O, and Xi [0, 1]d are i.i.d. with density g such that g(x) BU for all x [0, 1]d.

Adaptive Estimation of Nonparametric Functionals

Proof. The proof of part (i) can be found in Mukherjee and Sen (2018). However, for the sake of completeness we provide the proof here again. We do the proof for the special case where e L2l = e L2r = L. However, the details of the argument show that the proof continues to hold to symmetrized U-statistics as deﬁned here. The proof hinges on the following tail bound for second-order degenerate U-statistics (Gin e and Nickl, 2016) which is due to Gin e et al. (2000) with constants by Houdr e and Reynaud-Bouret (2003) and is crucial for our calculations.

Lemma C.3. Let Un be a degenerate U-statistic of order 2 with kernel R based on an i.i.d. sample W1, . . . , Wn. Then there exists a constant C independent of n, such that

i =j R(W1, W2)| C(Λ1 u + Λ2u + Λ3u3/2 + Λ4u2)] 6 exp( u),

where, we have,

Λ2 1 = n(n 1)

2 E[R2(W1, W2)],

Λ2 = n sup{E[R(W1, W2)ζ(W1)ξ(W2)]: E[ζ2(W1)] 1, E[ξ2(W1)] 1},

Λ3 = n E[R2(W1, )

We use this lemma to establish Lemma C.2. By Hoeﬀding s decomposition one has

i1 =i2 R (Oi1, Oi2) E (R (O1, O2))

h EOi1R (Oi1, Oi2) ER (Oi1, Oi2) i

R (Oi1, Oi2) EOi1R (Oi1, Oi2) EOi2R (Oi1, Oi2) + ER (Oi1, Oi2)

C.2.1 Analysis of T1

Noting that T1 = 2

n Pn i1=1 H(Oi1) where H(Oi1) = E (R (Oi1, Oi2|Oi1)) ER (Oi1, Oi2) we control T1 by standard Hoeﬀding s inequality. First note that,

h L (Oi1) ψv jk (Xi1) E ψv jk (Xi2) L (Oi2) E ψv jk (Xi2) L (Oi2) 2i |

v {0,1}d |L (Oi1) ψv jk (Xi1) E ψv jk (Xi2) L (Oi2) |

E ψv jk (Xi2) L (Oi2) 2

Liu, Mukherjee, Robins, Tchetgen Tchetgen

First, by standard compactness argument for the wavelet bases,

|E ψv jk (X) L(O) | Z |E (L(O)|X = x) 2 jd

l=1 ψvl 00(2jxl kl) ||g(x)|dx

C(B, BU, J0)2 jd

E ψv jk (Xi2) L (Oi2) 2 C(B, BU, J0). (C.3)

Also, using the fact that for each ﬁxed x [0, 1]d, the number of indices k Zj such that x belongs to support of at least one of ψv jk is bounded by a constant depending only on ψ0 00 and ψ1 00. Therefore combining (C.2) and (C.3),

v {0,1}d |L (Oi1) ψv jk (Xi1) E ψv jk (Xi2) L (Oi2) |

C(B, BU, J0)2 jd

2 = C(B, BU, J0).

Therefore, by (C.4) and Hoeﬀding s inequality,

P (|T1| t) 2e C(B,BU,J0)nt2. (C.5)

C.2.2 Analysis of T2

Since T2 is a degenerate U-statistic, its analysis is based on Lemma C.3. In particular,

T2 = 1 n(n 1)

i1 =i2 R (Oi1, Oi2)

R (Oi1, Oi2)

L(Oi1)ψv jk (Xi1) E ψv jk (Xi1) E (L(Oi1)|Xi1)

L(Oi2)ψv jk (Xi2) E ψv jk (Xi2) E (L(Oi2)|Xi2)

Letting Λi, i = 1, . . . , 4 be the relevant quantities as in Lemma C.3, we have the following lemma.

Lemma C.4. There exists a constant C = C(B, BU, J0) such that

Λ2 1 C n(n 1)

2 2jd, Λ2 Cn, Λ2 3 Cn2jd, Λ4 C2 jd

Adaptive Estimation of Nonparametric Functionals

Proof. First we control Λ1. To this end, note that by simple calculations, using bounds on L, g, and orthonormality of ψv jk s we have,

Λ2 1 = n(n 1)

2 E {R (O1, O2)}2 3n(n 1)E R2 (O1, O2)

= 3n(n 1)E L2 (O1) K2 Vj (X1, X2) L2 (O2)

3n(n 1)B4 Z Z h X

v {0,1}d ψv jk (x1) ψv jk (x2) i2 g(x1)g(x2)dx1dx2

3n(n 1)B4B2 U

v {0,1}d ψv jk (x1) ψv jk (x2) i2 dx1dx2

= 3n(n 1)B4B2 U X

Z ψv jk (x1) 2 dx1

Z ψv jk (x2) 2 dx2

C(B, BU, J0)n(n 1)2jd.

Next we control

Λ2 = n sup E (R (O1, O2) ζ (O1) ξ (O2)) : E ζ2(O1) 1, E ξ2(O2) 1 .

To this end, we ﬁrst control

|E L(O1)KVj (X1, X2) L(O2)ζ(O1)ξ(O2) |

= | Z Z E(L(O1)ζ(O1)|X1 = x1)KVj (x1, x2) E(L(O2)ξ(O2)|X2 = x2)g(x1)g(x2)dx1dx2|

= | Z E(L(O)ζ(O)|X = x)Π (E(L(O)ξ(O)|X = x)g(x)|Vj) g(x)dx|

Z E2(L(O)ζ(O)|X = x)g2(x)dx 1

2 Z Π2 (E(L(O)ξ(O)|X = x)g(x)|Vj) dx 1

Z E(L2(O)ζ2(O)|X = x)g2(x)dx 1

2 Z E(L2(O)ξ2(O)|X = x)g2(x)dx 1

E(ζ2(O1))E(ξ2(O2)) B2BU.

Above we have used Cauchy-Schwarz inequality, Jensen s inequality, and the fact that projections contract norms. Also,

|E E L(O1)KVj (X1, X2) L(O2)|O1 ζ(O1)ξ(O2) |

= |E [L(O1)Π (E (L(O1)g(X1)|X1) |Vj) ζ(O1)ξ(O2)] |

= |E [L(O1)Π (E (L(O1)g(X1)|X1) |Vj) ζ(O1)] ||E(ξ(O2))|

| Z Π(E(L(O)ζ(O)|X = x)g(x)|Vj)Π(E(L(O)|X = x)g(x)|Vj)dx| B2BU,

where the last step once again uses contraction property of projection, Jensen s inequality, and bounds on L and g. Finally, by Cauchy-Schwarz inequality and (C.3),

E E L(O1)KVj (X1, X2) L(O2) ζ(O1)ξ(O2)

Liu, Mukherjee, Robins, Tchetgen Tchetgen

v {0,1}d E2 L(O)ψv jk(X) C(B, BU, J0).

This completes the proof of Λ2 C(B, BU, J0)n. Turning to Λ3 = n E h (R (O1, ))2i

1 2 we have that

(R (O1, o2))2

2 [R(O1, o2) E(R(O1, O2)|O1)]2 + 2 [E(R(O1, O2)|O2 = o2) E (R(O1, O2))]2 .

E [R(O1, o2) E(R(O1, O2)|O1)]2

2E L2(O1)K2 Vj (X1, x2) L2(o2)

v {0,1}d L(O1)ψv jk(X1)E ψv jk(X2)L(O2) 2

ψv jk(x2) 2 + 2E(H2(O2)) C(B, BU, J0)2jd

where the last inequality follows from arguments along the line of (C.4). Also, using inequalities (C.3) and (C.4)

[E(R(O1, O2)|O2 = o2) E (R(O1, O2))]2

v {0,1}d E L(O1)ψv jk(X1) E L(O1)ψv jk(X1) ψv jk(x2)L(o2) i2

C(B, BU, J0).

This completes the proof of controlling Λ3. Finally, using compactness of the wavelet basis,

R( , ) B2 sup x1,x2

v {0,1}d |ψv jk(x1)||ψv jk(x2)| C(B, BU, J0)2jd.

Combining this with arguments similar to those leading to (C.4), we have Λ4 C(B, BU, J0)2jd.

Therefore, using Lemma C.3 and Lemma C.4 we have

P |T2| C(B, BU, J0)

n t 3 2 + 2jd

Finally using 2t 3 2 t + t2 we have,

Pf h |T2| > a1

t + a2t + a3t2i 6e t (C.6)

Adaptive Estimation of Nonparametric Functionals

where a1 = C(B,BU,J0)

2 , a2 = C(B,BU,J0)

n + 1 , and a3 = C(B,BU,J0)

Now if h(t) is such that a1 p

h(t) + a2h(t) + a3h2(t) t, then one has by (C.6),

P [|T2| t] P h |T2| a1 p

h(t) + a2h(t) + a3h2(t) i 6e 6h(t).

Indeed, there exists such an h(t) such that h(t) = b1t2 b2t b3

t where b1 = C(B,BU,J0)

b2 = C(B,BU,J0)

a2 , and b3 = C(B,BU,J0) a3 . Therefore, there exists a C = C(B, BU, J0) such that

P [|T2| t] e Ct2

a2 1 + e Ct

t a3 . (C.7)

C.2.3 Combining Bounds on T1 and T2

Applying union bound along with C.5 and C.7 completes the proof of Lemma C.2 part (i). For the proof of part (ii) note that with the notation of the proof of part (i) we have by Hoeﬀding decomposition

i1 =i2 R (Oi1, Oi2) E (R (O1, O2)) |2q

2(E|T1|2q + E|T2|2q).

The proof will be completed by individual control of the two moments above.

C.2.4 Analysis of E|T1|2q

Recall that T1 = 2

n Pn i1=1 H(Oi1) where H(Oi1) = E (R (Oi1, Oi2|Oi1)) ER (Oi1, Oi2) and |H(O)| C(B, BU, J0) almost surely. Therefore by Rosenthal s inequality (C.5) we have

i=1 E|H(Oi)|2q +

i=1 E|H(Oi)|2 )q#

2C(B, BU, J0)

2q (n + nq) C(B, BU, J0)qn q.

C.2.5 Analysis of E|T2|2q

Recall that

P [|T2| t] P h |T2| a1 p

h(t) + a2h(t) + a3h2(t) i 6e 6h(t)

where h(t) = b1t2 b2t b3

t with b1 = C(B,BU,J0)

a2 1 , b2 = C(B,BU,J0)

a2 , and b3 = C(B,BU,J0) a3 . Therefore

0 x2q 1Pf(|T2| x)dx

Liu, Mukherjee, Robins, Tchetgen Tchetgen

0 x2q 1Pf(|T2| a1 p

h(x) + a2h(x) + a3h2(x))dx

0 x2q 1e h(x)dx

0 x2q 1e {b1x2 b2x b3 x}dx

0 x2q 1e b1x2dx + Z

0 x2q 1e b2xdx + Z

0 x2q 1e b3 xdx

2bq 1 + Γ(2q)

b2q 2 + 2Γ(4q)

for a constant C = C(B, BU, J0), by our choices of b1, b2, b3.

Since the estimators arising in this paper also have a linear term, we will need the following standard Bernstein and Rosenthal type tail and moment bounds (Petrov, 1995).

Lemma C.5. If O1, . . . , On P are i.i.d. random vectors such that |L(O)| B almost surely P, then for q 2 one has for large enough constants C(B) and C(B, q)

i=1 (L(Oi) E(L(Oi))) | t) 2e nt2/C(B),

i=1 (L(Oi) E(L(Oi))) |q)

i=1 E (|L(Oi) E(L(Oi))|q) +

i=1 E |L(Oi) E(L(Oi))|2 #q/2

C(B, q)n q 2 .

We will also need the following concentration inequality for linear estimators based on wavelet projection kernels, the proof of which can be done along the lines of proofs of Theorem 5.1.5 and Theorem 5.1.13 of Gin e and Nickl (2016).

Lemma C.6. Consider i.i.d. observations Oi = (Y, X)i, i = 1, . . . , n where Xi [0, 1]d with marginal density g. Let ˆm(x) = 1

n Pn i=1 L(Oi)KVl (Xi, x), such that max{ g , L } BU. If 2ldld

n 1, there exist C, C1, C2 > 0, depending on BU and scaling functions ψ0 0,0, ψ1 0,0 respectively, such that

E( ˆm E( ˆm) ) C

and for any t > 0

P n ˆm E( ˆm) > 3

2n E( ˆm E( ˆm) ) + p

C1n2ldt + C22ldt e t.