# optimal_estimation_of_derivatives_in_nonparametric_regression__86da6032.pdf

Journal of Machine Learning Research 17 (2016) 1-25 Submitted 12/15; Revised 4/16; Published 8/16

Optimal Estimation of Derivatives in Nonparametric Regression

Wenlin Dai wenlin.dai@kaust.edu.sa CEMSE Division King Abdullah University of Science and Technology Saudi Arabia

Tiejun Tong tongt@hkbu.edu.hk Department of Mathematics Hong Kong Baptist University Hong Kong

Marc G. Genton marc.genton@kaust.edu.sa CEMSE Division King Abdullah University of Science and Technology Saudi Arabia

Editor: Xiaotong Shen

We propose a simple framework for estimating derivatives without ﬁtting the regression function in nonparametric regression. Unlike most existing methods that use the symmetric diﬀerence quotients, our method is constructed as a linear combination of observations. It is hence very ﬂexible and applicable to both interior and boundary points, including most existing methods as special cases of ours. Within this framework, we deﬁne the variance-minimizing estimators for any order derivative of the regression function with a ﬁxed bias-reduction level. For the equidistant design, we derive the asymptotic variance and bias of these estimators. We also show that our new method will, for the ﬁrst time, achieve the asymptotically optimal convergence rate for diﬀerence-based estimators. Finally, we provide an eﬀective criterion for selection of tuning parameters and demonstrate the usefulness of the proposed method through extensive simulation studies of the ﬁrstand second-order derivative estimators. Keywords: Linear combination, Nonparametric derivative estimation, Nonparametric regression, Optimal sequence, Taylor expansion

1. Introduction

Consider the following nonparametric regression model:

Yi = m(xi) + εi, 1 i n, (1)

where xi are the design points satisfying 0 x1 < < xn 1, m(x) is the regression function, Yi are the observations, and εi are independent and identically distributed random errors with E(εi) = 0 and var(εi) = σ2 < . Estimation of m(x) is an important problem in nonparametric regression and has received sustained attention in the literature. Such

c 2016 Wenlin Dai, Tiejun Tong, and Marc G. Genton.

Dai, Tong and Genton

methods include, for example, kernel smoothing (H ardle, 1990), spline smoothing (Wahba, 1990), and local polynomial regression (Fan and Gijbels, 1996). It has been noted that the estimation of the ﬁrstor higher-order derivatives of m(x) is also important for practical implementations including, but not limited to, the modeling of human growth data (Ramsay and Silverman, 2002), kidney function for a lupus nephritis patient (Ramsay, 2006), and Raman spectra of bulk materials (Charnigo et al., 2011). Derivative estimation is also needed in nonparametric regression to construct conﬁdence intervals for regression functions (Eubank and Speckman, 1993), to select kernel bandwidths (Ruppert et al., 1995), and to compare regression curves (Park and Kang, 2008). Most existing methods for pth-order derivative estimation can be expressed as a weighted average of the responses,

i=1 wi(x)Yi,

where wi(x) are weights assigned to each observation Yi. These estimators can be separated into two classes by their ability to directly or indirectly assess the weights, wi(x). In the indirect methods, the regression function is initially estimated as ˆm(x) = Pn i=1 ci(x)Yi by the aforementioned nonparametric smoothing techniques, where ci(x) are smooth functions. Then, wi(x) are estimated as dpci(x)/dxp (Gasser and M uller, 1984; M uller et al., 1987; Fan and Gijbels, 1995; Zhou and Wolfe, 2000; Boente and Rodriguez, 2006; Cao, 2014). We note, however, that the optimal bandwidths may diﬀer for estimating the regression function and for estimating the derivatives, respectively. That is, a good estimate of the regression function may not guarantee the generation of good estimates of the derivatives. Direct methods lead to the second class, which estimate the derivatives directly without ﬁtting the regression function. The two key steps for such methods are constructing point-wise estimates for the derivatives of each design point and determining the amount of smoothing or the bandwidth. To select the bandwidth, one may refer to some classical methods in M uller et al. (1987), H ardle (1990), Fan and Gijbels (1996), Opsomer et al. (2001), Lahiri (2003), and Kim et al. (2009), among others. In contrast, little attention has been paid to the improvement of the point-wise estimation of the derivatives. One simple point-wise estimator for derivatives uses diﬀerence quotients. This method is, however, very noisy. For example, the variance of the ﬁrst-order diﬀerence quotient (Yi Yi 1)/(xi xi 1) is of order O(n2). Charnigo et al. (2011) proposed a variance-reducing linear combination of symmetric diﬀerence quotients, called empirical derivatives, and applied it to their generalized Cp criterion for tuning parameter selection. De Brabanter et al. (2013) established the L1 and L2 convergence rates for the empirical derivatives. Speciﬁcally, they deﬁned the empirical derivatives as

Y (L 1) i+j Y (L 1) i j xi+j xi j

, L = 1, . . . p,

where Y (L) i denotes the estimated Lth-order derivative at xi, Y (0) i = Yi and wj,L are the associated weights. When L = 1, wj,1 are chosen as the optimal weights that minimize the estimation variance. For L 2, wj,L are determined intuitively instead of by optimizing the estimation variance. As a consequence, their higher-order empirical derivatives may not

Optimal Estimation of Derivatives in Nonparametric Regression

be optimally deﬁned. Another attempt was made recently by Wang and Lin (2015). They estimated the derivative as the intercept of a linear regression model through the weighted least squares method. They further showed that their proposed estimators achieve better control of the estimation bias, which makes them superior to empirical derivatives when the signal-to-noise ratio is large. Finally, it is noteworthy that their method only applies to equidistant designs and hence the practical applications are somewhat limited. In this paper, we propose a simple framework for estimating derivatives in model (1) without ﬁtting the regression function. Our method does not rely on symmetric diﬀerence quotients; hence, it is more ﬂexible than existing methods. Within this framework, we deﬁne the variance-minimizing estimators for any order derivative of m(x) with a ﬁxed bias-reduction level. For the equidistant design, we derive the asymptotic variance and bias of these estimators. We also show that the proposed estimators perform well on both interior and boundary points and, more importantly, that they achieve the optimal convergence rate for the mean squared error (MSE). The rest of this paper is organized as follows. In Section 2, we propose a new framework for ﬁrst-order derivative estimation and show that most existing estimators are special cases of ours. We also investigate the theoretical properties of the proposed estimator, including the optimal sequence, the asymptotic variance and bias, the point-wise consistency, and the boundary behavior. In Section 3, we extend the proposed method to higher-order derivative estimation and provide an eﬀective criterion for the selection of tuning parameters. We then report extensive simulation studies in Section 4 that validate the proposed method. We conclude the paper with a discussion in Section 5. Technical proofs of the theoretical results are given in the Appendix.

2. First-order derivative estimation

In this section, we propose a new framework for estimating derivatives in nonparametric regression. Within this framework, we deﬁne the optimal estimator for the ﬁrst-order derivative by minimizing the estimation variance. Theoretical results including the asymptotic variance and bias, and point-wise consistency are derived for the proposed optimal estimators under the equidistant design. We also investigate the performance of the estimators on the boundaries.

2.1 New framework

Recall that most existing methods are weighted average of symmetric diﬀerence quotients, which limits their implementation to some extent. All these estimators can be expressed as a linear combination of the observations for ﬁxed design points. To proceed, we deﬁne

k=0 dk Yi+k, 1 i n r,

where (d0, . . . , dr) is a sequence of real numbers, and r is referred to as the order of the sequence. Assuming that m(x) is a smooth enough function, we have the following Taylor

Dai, Tong and Genton

expansion at xi+l for each m(xi+k),

m(xi+k) = m(xi+l) +

(xi+k xi+l)j

j! m(j)(xi+l), 0 l r.

Note that xi+l can be any design point within [xi, xi+r], which frees our method from the symmetric form restriction. If we further assume that xi are equidistant, then xi = i/n, i = 1, . . . , n. Deﬁne Cj,l = Pr k=0 dk(k l)j/(njj!), j = 0, 1, . . . and l = 0, . . . , r. The expectation of DYi can be expressed as

j=0 Cj,lm(j)(xi+l), 1 i n r. (2)

To estimate the ﬁrst-order derivative at xi+l with DYi, we let C0,l = 0 and C1,l = 1 so that

E(DYi) = m (xi+l) +

j=2 Cj,lm(j)(xi+l),

where the second term on the right side is the estimation bias. When the regression function is oscillating around xi+l, we can alter our model by controlling the estimation bias at a higher level. Speciﬁcally, if we let

C1,l = 1 and Cj,l = 0, 0 j = 1 q 1, (3)

E(DYi) = m (xi+l) +

j=q Cj,lm(j)(xi+l).

When q = 2, condition (3) reduces to C1,l = 1 and C0,l = 0. When q 3, condition (3) eliminates the estimation bias up to order q 1.

2.2 Theoretical results

If we use a sequence with an order r q, an inﬁnite number of choices satisfying (3) is available. Among them, we choose the one(s) minimizing the estimation variance, var(DYi) = σ2 Pr k=0 d2 k, which leads to the following optimization problem,

(d0, . . . , dr)1,q = argmin (d0,...,dr) Rr+1

k=0 d2 k, s.t. condition (3) holds.

We denote this variance-minimizing sequence as (d0, . . . , dr)1,q. For simplicity of notation, the dependence of dk on l is suppressed. In addition, we introduce the following notation:

I(l) i = Pr k=0(k l)i, l = 0, . . . , r and i = 0, 1, . . . ;

U(l) denotes a q q matrix with u(l) ij = I(l) i+j 2;

Optimal Estimation of Derivatives in Nonparametric Regression

V (l) = (U(l)) 1 is the inverse matrix of U(l).

Then, we present the theoretical results for (d0, . . . , dr)1,q in the following proposition.

Proposition 1 Assume that model (1) holds with equidistant design and m(x) has a ﬁnite qth-order derivative on [0, 1]. For 1 i n r and 0 l r, the unique varianceminimizing sequence is

(dk)1,q = n

j=0 V (l) (j+1,2)(k l)j, k = 0, . . . , r,

for estimating m (xi+l) with an order of accuracy up to m(q)(xi+l), q 2. Here, V (l) (i,j) denotes the element in the ith row and the jth column of the matrix V (l). Proof: see Appendix A.

When q is ﬁxed, the optimal sequence depends only on l, which makes it quite convenient for practical implementation. When r is even and l = r/2, we get the symmetric form used in De Brabanter et al. (2013) and Wang and Lin (2015). For this case, it is easy to verify that dk = dr k, which eliminates all the even-order derivatives in (2). The sequence is derived for the equidistant design on [0, 1]. To extend the result to equidistant designs on an arbitrary interval, [a, b] R, we can simply use dk/(b a) instead. We treat the DYi built on (d0, . . . , dr)1,q as the estimator for the ﬁrst-order derivative with a bias-reduction level of q, denoted by ˆm q(xi+l).

Theorem 1 Assume that model (1) holds with equidistant design, m(x) has a ﬁnite qthorder derivative on [0, 1] and r = o(n), r . For 1 i n r and 0 l r, we have

var[ ˆm q(xi+l)] = n2V (l) (2,2)σ2 = O n2

bias[ ˆm q(xi+l)] = 1 q!nq 1

j=0 V (l) (j+1,2)I(l) j+qm(q)(xi+l) + o rq 1

Proof: see Appendix B.

For a larger q, the order of estimation bias is indeed reduced as expected, and the estimation variance surprisingly retains the same order at the same time. Assuming r = nλ

and 2/3 < λ < 1, we can establish the point-wise consistency of our estimator, ˆm q(xi+l) P

m (xi+l), where P means convergence in probability.

Corollary 1 Assume that the conditions in Theorem 1 hold. When r is even and l = r/2,

ˆm 2v(xi+r/2) = ˆm 2v+1(xi+r/2), v = 1, 2, . . . , q 1

where [x] denotes the greatest integer less than or equal to x.

Dai, Tong and Genton

This means that, when we employ a symmetric form for our estimator, the optimal sequence is the same for q = 2v and q = 2v +1. In other words, the symmetric form further reduces the order of estimation bias without any increase in the estimation variance. Hence, it is natural to use the symmetric form (r is even and l = r/2) for the interior points, {xi : 1 + r/2 i n r/2}, when the design points are equidistant. Also, we can show that the two existing estimators for the ﬁrst-order derivative (De Brabanter et al., 2013; Wang and Lin, 2015) are special cases of our method. When q = 2 or q = 3, we get the same sequence as

(dk)1,2 = (dk)1,3 = 6n(2k r) r(r + 1)(r + 2), k = 0, . . . , r.

This results in the empirical estimator in De Brabanter et al. (2013), denoted by ˆm emp. Assuming the regression function has a ﬁnite third-order derivative on [0, 1], the estimation variance and bias are respectively

var[ ˆm 2(xi+r/2)] = 12n2σ2

r(r + 1)(r + 2) and bias[ ˆm 2(xi+r/2)] = r2

40n2 m(3)(xi+r/2) + o r2

When q = 4 or q = 5, we get the same sequence as

(dk)1,4 = (dk)1,5 = n h I(r/2) 6 (k r

2) I(r/2) 4 (k r

I(r/2) 2 I(r/2) 6 I(r/2) 4 2 , k = 0, . . . , r.

This results in the least squares estimator in Wang and Lin (2015), denoted by ˆm lse. Within our framework, it is clear that the least squares estimator can be regarded as a bias-reduction modiﬁcation of the empirical estimator. Figure 1 presents an example of m q(xi) with diﬀerent levels of control for the estimation bias (q = 3 , 5 and 7). We follow the regression function m(x) = p

x(1 x) sin{2.1π/(x + 0.05)} for model (1) from De Brabanter et al. (2013) and Wang and Lin (2015). Five hundred design points are equidistant on [0.25, 1] and the random errors are generated from a Gaussian distribution, N(0, 0.12). Sequence orders are chosen as {50, 100}. We observe that the estimation curves are smoother for smaller q, and the bias in oscillating areas decreases signiﬁcantly for larger q. These results are consistent with our theoretical results. With various levels of bias control, we may achieve a better compromise in the trade-oﬀ between the estimation variance and bias.

2.3 Behavior on the boundaries

If we use a sequence with order r, then the boundary region will be {xi : 1 i [r/2] or n [r/2] + 1 i n.}. Within our framework, we have two types of estimators for estimating derivatives for the boundary area. One choice is to use a sequence with smaller order, so that we can still use the symmetric estimator as suggested for the interior points. This solution is also suggested by both De Brabanter et al. (2013) and Wang and Lin (2015). The other is to hold the sequence order while using an asymmetric form of the estimator instead.

Optimal Estimation of Derivatives in Nonparametric Regression

0.4 0.6 0.8 1.0

20 10 0 10 20 30

First Derivative

0.4 0.6 0.8 1.0

20 10 0 10 20 30

First Derivative

Figure 1: First-order derivative estimates with diﬀerent levels of bias reduction. Red lines (dotted): q = 3; green lines (long dash): q = 5; blue lines (dot dash): q = 7 and black lines (solid): the true ﬁrst-order derivative.

For the symmetric estimator, we can choose an even-order t satisfying 1 t/2 min (i 1, n i, [r/2]). By Theorem 1, we have

var[ ˆm q(xi)] = O n2

and bias[ ˆm q(xi)] = O tq 1

for 2 i [r/2] or n [r/2] + 1 i n 1. The closer xi locates to the endpoints, the smaller the largest order of the chosen sequence, which means that the information we can incorporate into the estimator becomes very limited. As a consequence, the estimation variance will eventually reach an order of O(n2), which is rather noisy. The asymmetric estimator does not require the estimated point to be located at the middle of the interval. We can still use a relatively large sequence order to include as much information as included in the interior points. The theoretical results were provided in Theorem 1:

var[ ˆm q(xi)] = O n2

and bias[ ˆm q(xi)] = O rq 1

With a proper choice of r, we can still get a consistent estimate for the derivatives at the boundary region. Another advantage of this asymmetric form is that it is applicable to all the boundary points including x1 and xn, which can never be handled by the symmetricform estimators. It is noteworthy that Wang and Lin (2015) also proposed left-side and right-side weighted least squares estimators for the boundary points. Their estimators are, however, two special

Dai, Tong and Genton

cases of our asymmetric estimator with q = 2 and l = 0 (right-side) or l = r (left-side). The estimation bias for ˆm 2(xi+l) is

bias[ ˆm 2(xi+l)] = r 2l

2n m (xi+l) + o r

To minimize the estimation bias on these boundary points, we recommend the following criterion:

ˆm 2(xi) = DY1 1 i [r/2], DYn r n [r/2] + 1 i n.

Then, the smallest absolute estimation bias can be simply derived as

|E[ ˆm 2(xi)] m (xi)| = r 2min(i 1, n i)

2n |m (xi)| + o r

In summary, the asymmetric estimator generates a smaller variance, while its estimation bias is of a higher order. Consequently, the sequence order should be selected to achieve the best trade-oﬀbetween the estimation variance and bias. In view of this, we recommend using the asymmetric estimator when the regression function is ﬂat at the boundary region or when σ2 is large; otherwise, the symmetric form should be employed.

3. Higher-order derivative estimation

In this section, we extend our method and propose higher-order derivative estimators for model (1). We further demonstrate that the new estimators possess the optimal estimation variance, which is not achieved by the two aforementioned methods (De Brabanter et al., 2013; Wang and Lin, 2015). Our new estimators also achieve the optimal convergence rate for MSE.

3.1 Theoretical results

To deﬁne an estimator for m(p)(xi+l) with a bias-reduction level up to m(q)(xi+l), we construct the new conditions on the coeﬃcients as

Cp,l = 1 and Cj,l = 0, 0 j = p q 1. (4)

Then, the optimal sequence can be derived as the solution(s) of the following optimization problem:

(d0, . . . , dr)p,q = argmin (d0,...,dr) Rr+1

k=0 d2 k, s.t. condition (4) holds.

We present the result for (d0, . . . , dr)p,q in the following proposition.

Proposition 2 Assume that model (1) holds with equidistant design and m(x) has a ﬁnite qth-order derivative on [0, 1]. For 1 i n r and 0 l r, the unique variance minimizing sequence is

(dk)p,q = p!np q 1 X

j=0 V (l) (j+1,p+1)(k l)j, k = 0, . . . , r,

Optimal Estimation of Derivatives in Nonparametric Regression

for estimating m(p)(xi+l) with an order of accuracy up to m(q)(xi+l), q p + 1. Proof: see Appendix C.

To extend the result to equidistant designs on an arbitrary interval, [a, b] R, we can simply use (dk)p,q/(b a)p instead. We treat the DYi built on (d0, . . . , dr)p,q as the estimator for the pth-order derivative with a bias-reduction level up to m(q)(xi+l), denoted as ˆm(p) q (xi+l).

Theorem 2 Assume that model (1) holds with equidistant design, m(x) has a ﬁnite qthorder derivative on [0, 1] and r = o(n), r . For 1 i n r and 0 l r, we have

var[ ˆm(p) q (xi+l)] = (p!)2n2p V (l) (p+1,p+1)σ2 = O n2p

bias[ ˆm(p) q (xi+l)] = p! q!nq p

j=0 V (l) (j+1,p+1)I(l) j+qm(q)(xi+l) + o rq p

Proof: see Appendix D.

For a ﬁxed p and an increasing q, we can reduce the estimation bias to a lower order while keeping the order of variance unchanged. Whenever we keep the diﬀerence between q and p constant, the convergence rate of the bias is preserved for diﬀerent p. When r is an even number and l = r/2, we can derive that (dk)p,q = ( 1)p(dn k)p,q. Consequently in this case, the optimal sequence remains the same when we increase q from p + 2ν 1 to p + 2ν, ν = 1, 2, . . . , which means ˆm(p) p+2ν 1(xi+r/2) = ˆm(p) p+2ν(xi+r/2). Hence, for this kind of estimator, the symmetric form is also the most favorable choice for the interior points. The optimal MSE of our estimator is of order O(n 2(q p)/(2q+1)), which achieves the asymptotically optimal rate established by Stone (1980). For comparison, we note that the optimal MSE of the empirical estimator in De Brabanter et al. (2013) is of order O(n 4/(2p+5)), that is, their estimator is of the optimal order only when q = p + 2. While for the least squares estimator in Wang and Lin (2015), they provided asymptotic results only for the ﬁrstand second-order derivative estimator. Their optimal MSE is of order O(n 8/11) for p = 1 and O(n 8/13) for p = 2, which corresponds with two special cases, i.e., when (p, q) = (1, 5) or (2, 6). From this point of view, our method has greatly improved the literature in derivative estimation and it achieves the optimal rate of MSE for any (p, q) from Theorem 2. As mentioned at the beginning of this section, the newly deﬁned estimator is optimal for the estimation variance, which is superior to existing estimators. In what follows, we illustrate this advantage in detail with the second-order derivative estimator, which is usually of greatest interest after the ﬁrst-order derivative in practice. A similar analysis can be made for other higher-order derivative estimators. For the estimator without biascontrol, e.g. ˆm 4, we derive the following results:

var[ ˆm 4(xi+r/2)] = 4n4σ2I(r/2) 0

I(r/2) 0 I(r/2) 4 I(r/2) 2 2 ,

bias[ ˆm 4(xi+r/2)] = r2

14n2 m(4)(xi+r/2) + o r2

Dai, Tong and Genton

50 100 150 200

1.5 1.6 1.7 1.8 1.9 2.0

50 100 150 200

1.00 1.05 1.10 1.15

Figure 2: The ratio of estimation variance is plotted against the sequence order, r. Setting: n = 500 and r is chosen as an even integer ranging from 20 to 200. (a), var( ˆm emp)/var( ˆm 4); (b), var( ˆm lse)/var( ˆm 6).

The corresponding method is ˆm emp in De Brabanter et al. (2013) with regard to the accurate level. Instead of minimizing the estimation variance, they intuitively choose the weight sequences for higher-order derivative estimators, which makes it quite diﬃcult to derive analytical asymptotic results. Hence, we make a ﬁnite sample comparison of the variance of the two estimators. We set n = 500 and calculate the corresponding sequences for ˆm 4 with an even order r ranging from 20 to 200 and l = r/2. For ˆm emp, we choose (k1, k2), which achieves the smallest estimation variance from {(k1, k2) : k1 k2, k1 + k2 = r/2}. We do not need a speciﬁed form of the regression function, since it is not related with the estimation variance. We illustrate the ratio, var( ˆm emp)/var( ˆm 4), in the left panel of Figure 2. Obviously, the new estimator improves the estimation variance signiﬁcantly, which results in a smaller MSE for smooth regression functions.

A similar comparison is carried out between ˆm lse and ˆm 6 under the same settings, and the ratio of var( ˆm lse)/var( ˆm 6) is presented in the right panel of Figure 2. Wang and Lin (2015) built a linear model with correlated regressors but employed the weighted least squares regression, rather than the generalized least squares technique, to derive the estimator. It can be shown that our method is equivalent with the generalized least squares estimator for their model. As expected, we ﬁnd that our proposed estimator performs slightly better in terms of the ﬁnite sample than the least squares estimator. In addition their asymptotic variances and biases are equivalent for the ﬁrst-order term. For the boundary points, our second-order estimator also maintains the same advantages over the existing estimators as discussed in Section 2.2 for the ﬁrst-order estimators.

Optimal Estimation of Derivatives in Nonparametric Regression

3.2 Tuning parameter selection

As shown in Figure 1, the order, r, and the bias-reduction level, q, are both critical to the proposed estimators. For practical implementation, (r, q) should be chosen to achieve a better trade-oﬀbetween the estimation variance and bias. By Theorem 2, the approximated MSE of ˆm(p) q (xi+l) is

MSE[ ˆm(p) q (xi+l)] (p!)2n2p V (l) (p+1,p+1)σ2 +

j=0 V (l) (j+1,p+1)I(l) j+qm(q)(xi+l)

We deﬁne the averaged mean squared error (AMSE) as a measure of the goodness of ﬁt for all the design points,

AMSE( ˆm(p) q ) = 1

i=1 MSE[ ˆm(p) q (xi)].

A uniform sequence is preferred for the estimate at most points (all the interior points for example) over diﬀerent sequences for each design point. Hence, we can choose the parameters (r, q) minimizing the AMSE. To achieve this, we replace the unknown quantities, σ2 and m(q)(xi+l), with their consistent estimates. The error variance can be estimated by the method in Tong and Wang (2005) and Tong et al. (2013) and m(q)(xi) can be estimated by the local polynomial regression of order q + 2. For the high-order derivatives at the boundary points, we recommend replacing the AMSE for all the points with the following adjusted form:

AMSEadj( ˆm(p) q ) = 1 n r

i=1+r/2 MSE[ ˆm(p) q (xi)] B1σ2 + B2 n r

i=1+r/2 [m(q)(xi)]2, (5)

where B1 = (p!)2n2p V (r/2) (p+1,p+1) and B2 = h p! q!nq p Pq 1 j=0 V (r/2) (j+1,p+1)I(r/2) j+q i2 . Given all the parameters for a speciﬁc problem, B1 and B2 are available quantities. The adjusted AMSE includes only derivatives at the interior points that share the identical diﬀerence sequence for an even r and l = r/2. Another advantage is that we only need V (r/2) and I(r/2) j+q instead

of V (l) and I(l) j+q for l = 0, . . . , r, which greatly reduces the computation time. For the tuning parameter space of the sequence order, we recommend r O = {2i : 1 i k0}, where k0 < [n/4], to keep a symmetric form (l = r/2) for the interior points and to make sure that the number of boundary points will be less than that of the interior points. For the bias-reduction level of ˆm(p) q , we consider q Q = {p + 2ν : ν = 1, 2, . . . , ν0}, where p + 2ν0 is the highest level chosen by users. Only even diﬀerences are considered for q p, since ˆm(p) p+2ν0 1 = ˆm(p) p+2ν0 when we use the recommended symmetric form.

4. Simulation study

In this section, we conduct simulation studies to assess the ﬁnite sample performance of the proposed estimators, ˆm(p) q , and make comparisons with the empirical estimator, ˆm(p) emp,

Dai, Tong and Genton

in De Brabanter et al. (2013) and the least squares estimator, ˆm(p) lse , in Wang and Lin (2015). We apply the three methods to both interior (Int) and boundary (Bd) areas, where Int = {xi : k0 + 1 i n k0} and Bd = {xi : 1 i k0 or n k0 + 1 i n}. Throughout the simulation, we set k0 = [n/10], which means that we treat ten percent of the design points on both sides of the interval as boundary points. We also tried some other proportions and the results were similar. For the interior part, we keep the symmetric form for ˆm(p) q by setting r as an even number and l = r/2, as suggested in the theoretical results. For the boundary part, we apply the following criterion for the proposed estimators:

ˆm(p) q (xi) = DY1 1 i [r/2], DYn r n [r/2] + 1 i n.

The modiﬁed version of ˆm(p) emp in De Brabanter et al. (2013) and the one-side weighted least squares estimators in Wang and Lin (2015) are investigated for the empirical and least squares estimators, respectively on the boundary points. We consider estimators for both ﬁrstand second-order derivatives, which are of most interest in practice. Similar to De Brabanter et al. (2013) and Wang and Lin (2015), the mean absolute error (MAE) is used as a measure of estimation accuracy. It is deﬁned as follows:

xi A | ˆm(p)(xi) m(p)(xi)|,

where A = Int or Bd and #A denotes the number of elements in set A. We consider the following regression function,

m(x) = 5 sin(wπx),

with ω = 1, 2, 4 corresponding to diﬀerent levels of oscillations. The n = 100 and 500 sample sizes are investigated. We set the design points as xi = i/n and generate the random errors, εi, independently from N(0, σ2). For each regression function, we consider σ = 0.1, 0.5 and 2 to capture the small, moderate and large variances, respectively. In total, we have 18 combinations of simulation settings. Following the deﬁnitions of Int and Bd, we select the sequence order r from O = {2i : 1 i k0}. We choose the bias-reduction level, q, from Q = {p + 2, p + 4, p + 6}, with q = p + 2 and q = p + 4 corresponding to ˆm(p) emp and ˆm(p) lse , respectively, and q = p + 6 as an even higher level. We denote by ˆm(p) opt the estimator with the selected tuning parameters. For ˆm(p) emp and ˆm(p) lse , the parameter k is chosen from {i : 1 i k0}. We investigate two scenarios (for the tuning parameters selection criterion): oracle and plug-in (see below). For each run of the simulation, we compute the MAE of the estimators at both Int and Bd and repeat the procedure 1000 times for each setting. The simulation results for w = 2 are reported as box-plot ﬁgures. Other results are provided in the supplementary materials.

Oracle parameters

Oracle parameters are selected by assuming that we know the true regression (derivative) function, the purpose of which is to illustrate the possible best performance of each

Optimal Estimation of Derivatives in Nonparametric Regression

estimator. Speciﬁcally for ˆm(p) q , the pair of tuning parameters is chosen as

(r, q)opt = argmin r O,q Q

MAE( ˆm(p) q ) .

The bandwidths of ˆm(p) emp and ˆm(p) lse are selected through a similar procedure. For the ﬁrst-order derivative, we investigate ˆm opt, ˆm emp and ˆm lse and report the simulation results in Figure 3. On the interior points, ˆm opt always possesses the same MAE as the smaller one of ˆm emp and ˆm lse , due to the fact that ˆm emp and ˆm lse are two special cases of ˆm q in this area. On the boundary points, ˆm opt is uniformly better than the other two methods. To further explore the reason for the boundary behavior, we use an example from De Brabanter et al. (2013) and Wang and Lin (2015). The ﬁtted results for the three estimators are illustrated in Figure 4, where the red points represent the boundary parts. The empirical estimator suﬀers a lot from the increasing variance when the estimated points get close to the endpoints of the interval. The least squares estimator simply estimates the boundary parts by shifting the estimates of the interior points nearby, which results in very serious estimation bias. Our estimator ﬁts the boundary points very well, resulting from the ﬂexibility brought by the parameter l, the relative location of the estimated point within the interval [xi, xi+r]. For the second-order derivative, we include another two estimators, ˆm 4 and ˆm 6, which have the same bias-reduction level as ˆm emp and ˆm lse, respectively. The sequence order, r, of the two additional estimators is optimally chosen by minimizing MAE as well. The simulation results are presented in Figure 5. The relationships between ˆm opt, ˆm emp and ˆm lse remain the same as those observed for the ﬁrst-order derivative. We also observe that MAE( ˆm 4) is signiﬁcantly smaller than MAE( ˆm emp) and that MAE( ˆm 6) is almost the same as MAE( ˆm lse), consistent with our theoretical results in Section 3.

Plug-in parameters

Plug-in parameters are chosen via minimizing the adjusted AMSE in (5) after replacing all the unknown quantities with their consistent estimates. In this simulation, we estimate σ2 using Tong and Wang s (2005) method with the recommended bandwidth [n1/3]. Here, ˆm(q)(xi) (1 + k0 i n k0) are calculated with the function locpol in the R package locpol (Ojeda Cabrera, 2012) with the parameter deg = q + 2. The bandwidths of ˆm(p) emp and ˆm(p) lse are selected accordingly. We report the simulation results together with those for the oracle parameters in Figures 6 and 7. From the comparison, we observe that the plug-in parameters lead to quite similar results with those for the oracle parameters, especially on the interior points. Since the tuning parameters are selected based on AMSE of derivative estimates for the interior points, the performance on the boundary is not consistent. Nevertheless, the mutual relationship of the three estimators remains the same for most cases on both interior and boundary points. Overall, the proposed plug-in method is quite eﬀective for choosing the optimal tuning parameters. In summary, we have demonstrated the superiority of the proposed estimators over the existing estimators through extensive simulation studies. We have further provided an eﬀective criterion for selection of the tuning parameters for the newly deﬁned estimator.

Dai, Tong and Genton

emp lse opt

Interior (n = 100 w = 2 σ = 2)

emp lse opt

Interior (n = 100 w = 2 σ = 0.5)

emp lse opt

0.4 0.6 0.8 1.0

Interior (n = 100 w = 2 σ = 0.1)

emp lse opt

Interior (n = 500 w = 2 σ = 2)

emp lse opt

1.0 1.5 2.0

Interior (n = 500 w = 2 σ = 0.5)

emp lse opt

0.20 0.30 0.40

Interior (n = 500 w = 2 σ = 0.1)

emp lse opt

0 10 20 30 40 50

Boundary (n = 100 w = 2 σ = 2)

emp lse opt

0 2 4 6 8 10 12 14

Boundary (n = 100 w = 2 σ = 0.5)

emp lse opt

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Boundary (n = 100 w = 2 σ = 0.1)

emp lse opt

0 10 20 30 40 50

Boundary (n = 500 w = 2 σ = 2)

emp lse opt

0 2 4 6 8 10 12

Boundary (n = 500 w = 2 σ = 0.5)

emp lse opt

0.5 1.0 1.5 2.0 2.5

Boundary (n = 500 w = 2 σ = 0.1)

Figure 3: Mean absolute errors of three ﬁrst-order derivative estimators on both interior (2 top rows) and boundary (2 bottom rows) points for various settings. ˆm emp, yellow box; ˆm lse, green box; ˆm opt, red box. m(x) = 5 sin(2πx) and ε N(0, σ2).

Optimal Estimation of Derivatives in Nonparametric Regression

0.0 0.2 0.4 0.6 0.8 1.0

60 40 20 0 20

estimated v.s. ture value (k=49)

First order derivative

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,

, ,, ,,,,,, ,,,,,,,, ,,

0.0 0.2 0.4 0.6 0.8 1.0

60 40 20 0 20

estimated v.s. ture value (k=18)

First order derivative

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,

0.0 0.2 0.4 0.6 0.8 1.0

60 40 20 0 20

estimated v.s. ture value (r=170)

First order derivative

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Figure 4: The ﬁtted point-wise derivatives by the three estimators using oracle tuning parameters. m(x) = 32e 8(1 2x)2(1 2x), εi are independent random errors from N(0, 0.12) and n = 500. Interior points: green points. Boundary points: red points.

5. Conclusion

We proposed a new framework for estimating derivatives without ﬁtting the regression function. Unlike most existing methods using the symmetric diﬀerence quotients, our method is constructed as a linear combination of the observations. It is hence very ﬂexible and applicable to both interior and boundary points. We obtained the variance-minimizing estimators for the ﬁrstand higher-order derivatives with a ﬁxed bias-reduction level. Under the equidistant design, we derived some theoretical results for the proposed estimators including the optimal sequence, asymptotic variance and bias, point-wise consistency, and boundary behavior. We illustrated that the order of the estimation bias can be reduced while the order of variance remains unchanged. We showed that our method achieves the optimal convergence rate for the MSE. Furthermore, we provided an eﬀective selection procedure for the tuning parameters of the proposed estimators. Simulation studies for the ﬁrstand second-order derivative estimators demonstrated the superiority of our proposed method.

The method can be readily extended to unequally spaced designs. In this case, the symmetric form is no longer valid and the choice of l also deserves further consideration. To estimate the point-wise derivatives for unequally spaced designs, we can ﬁrst ﬁnd the r nearest neighbors of the estimated point and construct the variance-minimizing estimator with the linear combination of the r + 1 points, say xi < < xi+l < < xi+r. Assuming that m(x) is smooth enough and that xi+l is the estimated point, we have the expectation of DYi as

E(DYi) = m(xi+l)

j=1 m(j)(xi+l)

k=0 dk(xi+k xi+l)j/j!.

Dai, Tong and Genton

emp lse opt1 opt2 opt

200 400 600 800 1000

Interior (n = 100 w = 2 σ = 2)

emp lse opt1 opt2 opt

50 100 150 200 250 300

Interior (n = 100 w = 2 σ = 0.5)

emp lse opt1 opt2 opt

10 20 30 40 50 60

Interior (n = 100 w = 2 σ = 0.1)

emp lse opt1 opt2 opt

100 200 300 400 500 600

Interior (n = 500 w = 2 σ = 2)

emp lse opt1 opt2 opt

20 40 60 80 100 140

Interior (n = 500 w = 2 σ = 0.5)

emp lse opt1 opt2 opt

5 10 15 20 25

Interior (n = 500 w = 2 σ = 0.1)

emp lse opt

0 1000 2000 3000 4000

Boundary (n = 100 w = 2 σ = 2)

emp lse opt

0 200 400 600 800 1000

Boundary (n = 100 w = 2 σ = 0.5)

emp lse opt

50 100 150 200

Boundary (n = 100 w = 2 σ = 0.1)

emp lse opt

0 5000 10000 15000

Boundary (n = 500 w = 2 σ = 2)

emp lse opt

0 1000 2000 3000 4000

Boundary (n = 500 w = 2 σ = 0.5)

emp lse opt

0 200 400 600 800

Boundary (n = 500 w = 2 σ = 0.1)

Figure 5: Mean absolute errors of three second-order derivative estimators on both interior (2 top rows) and boundary (2 bottom rows) points for various settings. ˆm emp, yellow box; ˆm lse, green box; ˆm q, red box. opt1: ˆm 4; opt2: ˆm 6; opt: ˆm opt. m(x) = 5 sin(2πx) and ε N(0, σ2).

Optimal Estimation of Derivatives in Nonparametric Regression

emp_s emp lse_s lse opt_s opt

Interior (n = 100 w = 2 σ = 2)

emp_s emp lse_s lse opt_s opt

Interior (n = 100 w = 2 σ = 0.5)

emp_s emp lse_s lse opt_s opt

0.4 0.6 0.8 1.0

Interior (n = 100 w = 2 σ = 0.1)

emp_s emp lse_s lse opt_s opt

Interior (n = 500 w = 2 σ = 2)

emp_s emp lse_s lse opt_s opt

1.0 1.5 2.0

Interior (n = 500 w = 2 σ = 0.5)

emp_s emp lse_s lse opt_s opt

0.20 0.30 0.40 0.50

Interior (n = 500 w = 2 σ = 0.1)

emp_s emp lse_s lse opt_s opt

0 10 20 30 40 50

Boundary (n = 100 w = 2 σ = 2)

emp_s emp lse_s lse opt_s opt

Boundary (n = 100 w = 2 σ = 0.5)

emp_s emp lse_s lse opt_s opt

0 2 4 6 8 10 12

Boundary (n = 100 w = 2 σ = 0.1)

emp_s emp lse_s lse opt_s opt

0 10 20 30 40 50

Boundary (n = 500 w = 2 σ = 2)

emp_s emp lse_s lse opt_s opt

Boundary (n = 500 w = 2 σ = 0.5)

emp_s emp lse_s lse opt_s opt

0 2 4 6 8 10 12

Boundary (n = 500 w = 2 σ = 0.1)

Figure 6: Comparison of the mean absolute errors on both interior (2 top rows) and boundary (2 bottom rows) points between the ﬁrst-order derivative estimators with oracle tuning parameters and those with plug-in tuning parameters. ˆm emp, yellow box; ˆm lse, green box; ˆm opt, red box. s denotes estimators using plug-in parameters. m(x) = 5 sin(2πx) and ε N(0, σ2).

Dai, Tong and Genton

emp_s emp lse_s lse opt_s opt

200 400 600 800 1000

Interior (n = 100 w = 2 σ = 2)

emp_s emp lse_s lse opt_s opt

50 100 150 200 250 300

Interior (n = 100 w = 2 σ = 0.5)

emp_s emp lse_s lse opt_s opt

10 20 30 40 50

Interior (n = 100 w = 2 σ = 0.1)

emp_s emp lse_s lse opt_s opt

100 200 300 400 500 600

Interior (n = 500 w = 2 σ = 2)

emp_s emp lse_s lse opt_s opt

20 40 60 80 100 140

Interior (n = 500 w = 2 σ = 0.5)

emp_s emp lse_s lse opt_s opt

5 10 15 20 25

Interior (n = 500 w = 2 σ = 0.1)

emp_s emp lse_s lse opt_s opt

0 1000 2000 3000 4000

Boundary (n = 100 w = 2 σ = 2)

emp_s emp lse_s lse opt_s opt

0 200 400 600 800 1000

Boundary (n = 100 w = 2 σ = 0.5)

emp_s emp lse_s lse opt_s opt

50 100 150 200

Boundary (n = 100 w = 2 σ = 0.1)

emp_s emp lse_s lse opt_s opt

0 5000 10000 15000

Boundary (n = 500 w = 2 σ = 2)

emp_s emp lse_s lse opt_s opt

0 1000 2000 3000 4000

Boundary (n = 500 w = 2 σ = 0.5)

emp_s emp lse_s lse opt_s opt

0 200 400 600 800

Boundary (n = 500 w = 2 σ = 0.1)

Figure 7: Comparison of the mean absolute errors on both interior (2 top rows) and boundary (2 bottom rows) points between the second-order derivative estimators with oracle tuning parameters and those with plug-in tuning parameters. ˆm emp, yellow box; ˆm lse, green box; ˆm opt, red box. s denotes estimators using plug-in parameters. m(x) = 5 sin(2πx) and ε N(0, σ2).

Optimal Estimation of Derivatives in Nonparametric Regression

The optimal sequence for estimating m(p)(xi+l) with a bias-reduction level q can be decided by solving the following optimization problem:

(d0, . . . , dr)p,q = argmin (d0,...,dr) Rr+1

k=0 dk (xi+k xi+l)j

j! = 0, j = 0, . . . , p 1, p + 1, . . . , q 1,

k=0 dk (xi+k xi+l)p

The optimal diﬀerence sequences are adaptively chosen for each estimated point and they are no longer identical for all the interior design points. As a result, the parameter selection becomes more challenging and we leave this for future research. Finally, other models worthy of investigation include, for example, random design models (De Brabanter and Liu, 2015) and multivariate models (Charnigo et al., 2015; Charnigo and Srinivasan, 2015).

Acknowledgments

The authors thank the editor, the associate editor and the two referees for their constructive comments that led to a substantial improvement of the paper. The work of Wenlin Dai and Marc G. Genton was supported by King Abdullah University of Science and Technology (KAUST). Tiejun Tong s research was supported in part by Hong Kong Baptist University FRG grants FRG1/14-15/044, FRG2/15-16/038, FRG2/15-16/019 and FRG2/14-15/084.

Appendix A. Proof of Proposition 1

To ﬁnd the optimal sequence for estimating the ﬁrst-order derivative with qth-order accuracy, we solve the following optimization problem:

(d0, . . . , dr)1,q = argmin (d0,...,dr) Rr+1

k=0 d2 k, s.t. condition (3) holds.

It is easy to check that condition (3) is equivalent to

k=0 dk(k l) = n and

k=0 dk(k l)j = 0, 0 j = 1 q 1.

To apply the Lagrange multipliers method to ﬁnd the optimal sequence, we transform the above problem in the following unconstrained optimization problem:

f(d0, . . . , dr, λ0, . . . , λq 1) =

k=0 d2 k + λ0C0 +

k=0 dk(k l)j + λ1

k=0 dk(k l) n

Dai, Tong and Genton

Taking the partial derivative of f with respect to each parameter and setting it to zero, we have

f dk = 2dk + λ0 +

j=1 λj(k l)j = 0, k = 0, . . . , r, (6)

k=0 dk(k l)j = 0, 0 j = 1 q 1,

k=0 dk(k l) = n.

We further make the following transformation:

k=0 (k l)i f

k=0 dk(k l)i + λ0

k=0 (k l)i +

k=0 (k l)i+j

= I(l) i λ0 +

j=1 I(l) i+jλj = 0, 0 i = 1 q 1.

k=0 (k l) f

k=0 dk(k l) + λ0

k=0 (k l) +

k=0 (k l)1+j

= 2n + I(l) 1 λ0 +

j=1 I(l) 1+jλj = 0,

where I(l) i = Pr k=0(k l)i for i = 1, 2, . . . . These results can be expressed as a matrix equation,

U(l)(λ0, . . . , λq 1) = 2nϵ2,

where U(l) is a q q matrix with u(l) ij = I(l) i+j 2 and ϵ2 is a q 1 vector with the second element equal to 1 and the others equal to zero. Noting that U(l) is an invertible matrix, we have

(λ0, . . . , λq 1) = 2n V (l) (.,2),

where V (l) = (U(l)) 1 and V (l) (.,2) denotes the second column of V (l). This leads to λj =

2p!np V (l) (j+1,2) for j = 0, . . . , q 1. Combining this result with (6), we get

(dk)1,q = n

j=0 V (l) (j+1,2)(k l)j, k = 0, . . . , r.

This completes the proof of Proposition 1.

Appendix B. Proof of Theorem 1

Optimal Estimation of Derivatives in Nonparametric Regression

We can easily derive that

var[ ˆm q(xi+l)] = σ2 r X

k=0 d2 k = σ2 r X

j=0 V (l) (j+1,2)(k l)j

j=0 V (l) (j+1,2)

k=0 dk(k l)j = σ2n V (l) (2,2)

k=0 dk(k l)

= σ2n2V (l) (2,2),

bias[ ˆm q(xi+l)] = Cq,lm(q)(xi+l) + o(rq 1/nq 1),

k=0 dk (k l)q

nqq! = n q!nq

j=0 V (l) (j+1,2)(k l)j

j=0 V (l) (j+1,2)I(l) j+q = O(rq 1/nq 1).

This completes the proof of Theorem 1.

Appendix C. Proof of Proposition 2

To ﬁnd the optimal sequence for estimating the pth-order derivative with qth-order accuracy, we solve the following optimization problem:

(d0, . . . , dr)p,q = argmin (d0,...,dr) Rr+1

k=0 d2 k, s.t. condition (4) holds.

It is easy to check that condition (4) is equivalent to

k=0 dk(k l)p = p!np and

k=0 dk(k l)j = 0, 0 j = p q 1.

To apply the Lagrange multipliers method to ﬁnd the optimal sequence, we transform the above problem in the following unconstrained optimization problem:

f(d0, . . . , dr, λ0, . . . , λq 1) =

k=0 d2 k + λ0C0 +

k=0 dk(k l)j

k=0 dk(k l)p p!np #

Dai, Tong and Genton

Taking the partial derivative of f with respect to each parameter and setting it to zero, we have

f dk = 2dk + λ0 +

j=1 λj(k l)j = 0, k = 0, . . . , r, (7)

k=0 dk(k l)j = 0, 0 j = p q 1,

k=0 dk(k l)p = p!np.

We further make the following transformation:

k=0 (k l)i f

k=0 dk(k l)i + λ0

k=0 (k l)i +

k=0 (k l)i+j

= I(l) i λ0 +

j=1 I(l) i+jλj = 0, 0 i = p q 1.

k=0 (k l)p f

k=0 dk(k l)p + λ0

k=0 (k l)p +

k=0 (k l)p+j

= 2p!np + I(l) p λ0 +

j=1 I(l) p+jλj = 0,

where I(l) i = Pr k=0(k l)i for i = 1, 2, . . . . These results can be expressed as a matrix equation,

U(l)(λ0, . . . , λq 1) = 2p!npϵp+1,

where U(l) is a q q matrix with u(l) ij = I(l) i+j 2 and ϵp+1 is a q 1 vector with the (p + 1)th element equal to 1 and the others equal to zero. Noting that U(l) is an invertible matrix, we have

(λ0, . . . , λq 1) = 2p!np V (l) (.,p+1),

where V (l) = (U(l)) 1 and V (l) (.,p+1) denotes the (p + 1)th column of V (l). This leads to

λj = 2p!np V (l) (j+1,p+1) for j = 0, . . . , q 1. Combining this result with (7), we get

(dk)p,q = p!np q 1 X

j=0 V (l) (j+1,p+1)(k l)j, k = 0, . . . , r.

This completes the proof of Proposition 2.

Appendix D. Proof of Theorem 2

Optimal Estimation of Derivatives in Nonparametric Regression

We can easily derive that

var[ ˆm(p) q (xi+l)] = σ2 r X

k=0 d2 k = σ2 r X

j=0 V (l) (j+1,p+1)(k l)j

= σ2p!np q 1 X

j=0 V (l) (j+1,p+1)

k=0 dk(k l)j = σ2p!np V (l) (p+1,p+1)

k=0 dk(k l)p

= σ2(p!)2n2p V (l) (p+1,p+1),

bias[ ˆm(p) q (xi+l)] = Cq,lm(q)(xi+l) + o(rq p/nq p),

k=0 dk (k l)q

nqq! = p!np

j=0 V (l) (j+1,p+1)(k l)j

= p! q!nq p

j=0 V (l) (j+1,p+1)I(l) j+q = O(rq p/nq p).

This completes the proof of Theorem 2.

Graciela Boente and Daniela Rodriguez. Robust estimators of high order derivatives of regression functions. Statistics and Probability Letters, 76(13):1335 1344, 2006.

Guanqun Cao. Simultaneous conﬁdence bands for derivatives of dependent functional data. Electronic Journal of Statistics, 8(2):2639 2663, 2014.

Richard Charnigo and Cidambi Srinivasan. A multivariate generalized Cp and surface estimation. Biostatistics, 16(2):311 325, 2015.

Richard Charnigo, Benjamin Hall, and Cidambi Srinivasan. A generalized Cp criterion for derivative estimation. Technometrics, 53(3):238 253, 2011.

Richard Charnigo, Limin Feng, and Cidambi Srinivasan. Nonparametric and semiparametric compound estimation in multiple covariates. Journal of Multivariate Analysis, 141: 179 196, 2015.

Kris De Brabanter and Yu Liu. Smoothed nonparametric derivative estimation based on weighted diﬀerence sequences, Stochastic Models, Statistics and Their Applications, A. Steland and E. Rafaj lowicz and K. Szajowski (Eds.), Chapter 4 (31-38). Springer, 2015.

Kris De Brabanter, Jos De Brabanter, Bart De Moor, and Ir ene Gijbels. Derivative estimation with local polynomial ﬁtting. Journal of Machine Learning Research, 14(1):281 301, 2013.

Dai, Tong and Genton

Randall L. Eubank and Paul L. Speckman. Conﬁdence bands in nonparametric regression. Journal of the American Statistical Association, 88(424):1287 1301, 1993.

Jianqing Fan and Ir ene Gijbels. Data-driven bandwidth selection in local polynomial ﬁtting: variable bandwidth and spatial adaptation. Journal of the Royal Statistical Society Series B, 57:371 394, 1995.

Jianqing Fan and Ir ene Gijbels. Local Polynomial Modelling and Its Applications. CRC Press, 1996.

Theo Gasser and Hans G. M uller. Estimating regression functions and their derivatives by the kernel method. Scandinavian Journal of Statistics, 11:171 185, 1984.

Wolfgang H ardle. Applied Nonparametric Regression. Cambridge University Press, 1990.

Tae Y. Kim, Byeong U. Park, Myung S. Moon, and Chiho Kim. Using bimodal kernel for inference in nonparametric regression with correlated errors. Journal of Multivariate Analysis, 100:1487 1497, 2009.

Soumendra Nath Lahiri. Resampling Methods for Dependent Data. Springer, 2003.

Hans G. M uller, Ulrich Stadtm uller, and Thomas Schmitt. Bandwidth choice and conﬁdence intervals for derivatives of noisy data. Biometrika, 74:743 749, 1987.

Jorge L. Ojeda Cabrera. locpol: Kernel local polynomial regression. URL http://mirrors.ustc.edu.cn/CRAN/web/packages/locpol/index.html, 2012.

Jean Opsomer, Yuedong Wang, and Yuhong Yang. Nonparametric regression with correlated errors. Statistical Science, 16:134 153, 2001.

Cheolwoo Park and Kee H. Kang. Si Zer analysis for the comparison of regression curves. Computational Statistics and Data Analysis, 52(8):3954 3970, 2008.

James O. Ramsay. Functional Data Analysis. Wiley, 2006.

James O. Ramsay and Bernard W Silverman. Applied Functional Data Analysis: Methods and Case Studies. Springer, 2002.

David Ruppert, Simon J. Sheather, and Matthew P. Wand. An eﬀective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90(432): 1257 1270, 1995.

Charles J. Stone. Optimal rates of convergence for nonparametric estimators. The Annals of Statistics, 8:1348 1360, 1980.

Tiejun Tong and Yuedong Wang. Estimating residual variance in nonparametric regression using least squares. Biometrika, 92:821 830, 2005.

Tiejun Tong, Yanyuan Ma, and Yuedong Wang. Optimal variance estimation without estimating the mean function. Bernoulli, 19(5A):1839 1854, 2013.

Optimal Estimation of Derivatives in Nonparametric Regression

Grace Wahba. Spline Models for Observational Data. SIAM, 1990.

Wenwu Wang and Lu Lin. Derivative estimation based on diﬀerence sequence via locally weighted least squares regression. Journal of Machine Learning Research, 16:2617 2641, 2015.

Shanggang Zhou and Douglas A Wolfe. On derivative estimation in spline regression. Statistica Sinica, 10(1):93 108, 2000.