# nonparametric_principal_subspace_regression__d87d93af.pdf

Journal of Machine Learning Research 23 (2022) 1-28 Submitted 9/20; Revised 11/21; Published 5/22

Nonparametric Principal Subspace Regression

Yang Zhou YANGZ91@163.COM School of Statistics Beijing Normal University Beijing 100875, China Mark Koudstaal MKOUDSTAAL@GMAIL.COM Dengdeng Yu DENGDENG.YU@UTORONTO.CA Dehan Kong KONGDEHAN@UTSTAT.TORONTO.EDU Department of Statistical Sciences University of Toronto Toronto, ON M5S 3G3, Canada

Fang Yao FYAO@MATH.PKU.EDU.CN Department of Probability & Statistics Center for Statistical Science Peking University Beijing 100871, China

Editor: Ambuj Tewari

In scientiﬁc applications, multivariate observations often come in tandem with temporal or spatial covariates, with which the underlying signals vary smoothly. The standard approaches such as principal component analysis and factor analysis neglect the smoothness of the data, while multivariate linear or nonparametric regression fails to leverage the correlation information among multivariate response variables. We propose a novel approach named nonparametric principal subspace regression to overcome these issues. By decoupling the model discrepancy, a simple two-step estimation procedure is introduced, which takes advantage of the low-rank approximation while keeping smooth dynamics. The theoretical property of the proposed procedure is established under an increasing-dimension framework. We demonstrate the favorable performance of our method in comparison with its counterpart, the conventional nonparametric regression, from both theoretical and numerical perspectives. Keywords: factor analysis, local polynomial smoothing, low-rank approximation, singular value decomposition, smoothness

1. Introduction

In scientiﬁc applications, one is often interested in predicting a multivariate response using one or a few predictor variables. The multivariate response linear regression is a conventional way to model this type of data. The usual procedure is the ordinary least squares, equivalent to performing an individual linear regression of each response variable on predictor variables, which fails to use the correlation information among the response variables. To incorporate the correlation information, Breiman and Friedman (1997) proposed a multivariate shrinkage method to

c 2022 Yang Zhou, Mark Koudstaal, Dengdeng Yu, Dehan Kong and Fang Yao.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-963.html.

ZHOU, KOUDSTAAL, YU, KONG AND YAO

0 0.2 0.4 0.6 0.8 1 Time

EEG signals from 64 electrodes

0 0.2 0.4 0.6 0.8 1 Time

f MRI (Bold signals) of 68 ROIs

Figure 1: Left: EEG signals detected from 64 electrodes of the scalp at 256 Hz per second for a randomly selected subject; Right: f MRI of 68 Regions of Interest (ROIs, divided according to Desikan-Killany atlas) over 205 seconds including 284 frames for a randomly selected subject.

leverage information from the correlation structure, which helps to improve the predictive accuracy compared to the ordinary least squares. Although multivariate response linear regression is a useful tool, it may not work properly in some applications. For example, in the real data examples presented in Section 5, we are interested in modeling the dynamic changes of the electroencephalogram (EEG) signals and functional Magnetic Resonance Imaging (f MRI), shown in Figure 1, where both curves show nonlinear patterns. This indicates that the multivariate response linear regression model may not be adequate to characterize the relationship between the common predictor time and the multivariate signals. A natural rescue is to use nonparametric regression of the multivariate response variables on the common predictors, that is nonparametrically ﬁt model (1) in Section 2.2. However, this solution is unsatisfactory as it is equivalent to performing curve-by-curve individual nonparametric regression for each component of the response, and thus does not capture correlations among the response variables. Motivated by these applications, we propose a new nonparametric principal subspace regression model in which the essence is that the nonparametric function admits a singular value type decomposition with smooth dynamics. The new model allows ﬂexible nonlinear structures of the regression functions, while takes into account the correlation among response variables at the same time. Our proposal is related to the factor models, which have been frequently used to characterize the correlation structure in multivariate data (Gary and Rothschild, 1983; Fan et al., 2013). In factor models, the signal of interest is expressed as a linear combination of a few latent variables, and does not concern additional covariate information that may play a role in estimation or prediction. For instance, factor models are often employed in contexts such as multiple time series or correlated functional data (Engle and Watson, 1981; Huang et al., 2009), where useful information may be hidden in the form of smoothness with respect to some additional covariates,

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

for example, temporal or spatial variable. Neglecting such information in recovery and prediction potentially hinders the quality and performance of the resulting estimators. This has been noticed by Durante et al. (2014), which further proposed a locally adaptive factor process under the Bayesian framework for characterizing multivariate mean-covariance changes in continuous time, allowing locally varying smoothness in both the mean and covariance matrix of multivariate time series. However, theoretical guarantees are lacking for the approach, which may leave practitioners uncertain about the quality of resulting estimates. In this work, we approach the problem from a different perspective that is intuitive and broadly applicable. The contributions are summarized as follows. First, we propose a new nonparametric principal subspace regression model with a diverging model dimension. This not only incorporates the correlation structure among multivariate responses, but also accounts for the nonlinear smooth trend of the data with respect to the covariates. Second, we introduce a simple two-step estimation framework, where the ﬁrst step is to obtain the left singular vectors of the data matrix and the second step is to estimate the nonparametric functions by local polynomial regression. Third, we provide theoretical guarantees. Speciﬁcally, we show that the space spanned by the estimated singular vectors can consistently estimate their underlying space, and obtain a uniform error bound for a diverging number of function estimates, which together ensure the convergence of the nonparametric principal subspace estimate. Lastly, we show that our method outperforms its counterpart, the conventional nonparametric regression, from both theoretical and numerical perspectives. This is not surprising because our approach signiﬁcantly reduces the model complexity and risk of overﬁtting compared to individual nonparametric regressions. The rest of the paper is organized as follows. In Section 2, we propose the nonparametric principal subspace regression methodology with a ﬁtting procedure, and the main theoretical results are presented in Section 3. Favorable ﬁnite-sample performance is illustrated through simulated and real data examples in Section 4 and 5, respectively. Proofs of main results and technical lemmas are contained in the Appendix.

2. Proposed Methodology

In this section, we ﬁrst introduce some notation used throughout the whole paper. Then we propose the nonparametric principal subspace regression methodology. By the end of this section, we describe the two-step ﬁtting procedure and parameters tuning.

2.1 Notation

We begin by listing some notation used throughout the paper. For two vectors a, b Rm, denote the inner product by a, b = a b = Pm i=1 aibi and the corresponding norm . Deﬁne the rescaled inner product a, b m = 1 m a, b and the induced norm m. For two functions f, g L2, the inner product and corresponding norm bear the subscript L2, that is f, g L2 = R

T f(x)g(x)dx and f 2 L2 = R

T f2(x)dx, where T is the domain of x. Let denote the sup norm of vector or function. For a matrix M Rp n, write the singular value decomposition as M = UΣV , where Σ = diag{σ1(M), σ2(M), . . .} with singular values σ1(M) σ2(M) 0. In particular, we use σmin(M) = σmin(p,n)(M), σmax(M) =

ZHOU, KOUDSTAAL, YU, KONG AND YAO

σ1(M) as the smallest and largest nontrivial singular values of M. Denote M = σ1(M) the

spectral norm and M F = q P j σ2 j (M) = q P k,l M2 kl the Frobenius norm, respectively.

Let PM = M(M M) M be the projection matrix onto the column space of M, where ( )

represents the Moore-Penrose pseudo-inverse. For two p q matrices A1 and A2 with p q and the singular values σ1 σ2 σq 0 of A 1 A2, deﬁne their principal angles as Θ(A1, A2) = diag(cos 1(σ1), . . . , cos 1(σq)). A measure of distance between A1 and A2 is given by sin Θ(A1, A2) or sin Θ(A1, A2) F . For any a, b R, let a b = min(a, b) and a b = max(a, b). We use c and C to denote generic positive constants that may vary in the sequel. For two positive sequences an and bn, an bn means an Cbn for large n, an bn if an bn and bn an, and an bn if lim supn an/bn = 0.

2.2 Nonparametric Principal Subspace Regression

Let {(xi, yi)}n i=1 be independent and identically distributed observations, where xi [0, 1]d and yi Rp. We consider the following nonparametric model

yi = F(xi) + zi, (1)

where zi = (zi1, . . . , zip) Rp is independent and identically distributed with mean 0 and covariance Σ, and F : [0, 1]d Rp so that E(yi|xi) = F(xi). The data dimension p is allowed to grow with the sample size n, and our goal is to estimate the function F, which characterizes the relationship between xi and yi, under mild smoothness assumption on the components of F. Motivated by factor analysis and the singular value decomposition as methods of accounting for correlation among variables in yi, as we show in Proposition 1, F can be written as

k=1 ukfk(x), q p, (2)

where uk Rp (1 k q) are orthonormal vectors and {fk, 1 k q} are smooth functions which are orthogonal in L2[0, 1]d. We refer to q as the (underlying) model dimension and allow q to grow with n, reﬂecting its nonparametric nature. If one takes q = p, every function F : [0, 1]d Rp with components in L2[0, 1]d has such a representation; hence the model has the simple interpretation of reducing dimension with smoothness on covariates. Thus, in addition to capturing correlations via factor type analysis, this model also nonparametrically incorporates smoothness information into the covariates. We refer to our model as nonparametric principal subspace regression . With the model in place, we aim to ﬁnd an appropriate approximation to F from an rdimensional function-valued vector subspace deﬁned as

G(x) G(x) =

k=1 vkgk(x), v k vl = δkl, x [0, 1]d, 1 k, l r

where vk Rp are orthonormal vectors, gk L2[0, 1]d are smooth functions and δkl is the Kronecker delta function. Note that the elimination of orthogonality of gk s facilitates our algorithm

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

(see Step 2 below), but does not affect the orthogonality of minimizers in Gr approximating to F. The approximation dimension r serves as a tuning parameter that may vary with q and n, depending on the balance between the approximation and estimation errors. Given G Gr, we use the sample discrepancy

i=1 F(xi) G(xi) 2 (3)

as a reasonable approximation to R(G) = R

[0,1]d F(x) G(x) 2dx to measure the error between G and F. Thus we can estimate gk and vk (1 k r) from the sample pairs {(xi, yi)}n i=1 by optimizing the following problem

min G Gr RD n (G), where RD n (G) = 1

i=1 yi G(xi) 2.

Let V = (v1, . . . , vr) Rp r and g(xi) = (g1(xi), . . . , gr(xi)) Rr. Then for a given i, by orthogonal projection, we have

yi V g(xi) 2 = (Ip PV )yi + PV yi V g(xi) 2

= (Ip PV )yi 2 + PV yi V g(xi) 2,

where Ip PV is the orthogonal complement of the projection matrix PV . Setting c(Y, V ) = 1 n Pn i=1 (Ip PV )yi 2, which contains no information of g, and observing that

PV (yi V g(xi)) 2 = V yi g(xi) 2 =

v k yi gk(xi) 2 ,

we may decompose the objective RD n (G) as

RD n (G) = c(Y, V ) +

k=1 RD n (vk, gk) where RD n (vk, gk) = 1

v k yi gk(xi) 2 .

This decouples the optimization problem of minimizing RD n (G) over Gr into two separate problems: ﬁnding a sequence of orthonormal vectors vk s, and then estimating gk by considering individual optimization of the RD n (vk, gk) along the direction vk. According to model (2), a reasonable choice of vk s is to ﬁnd the empirical counterpart of the singular vectors uk for 1 k r. Let Y = (y1, . . . , yn) Rp n be the response data matrix. A natural estimator of U[r] = (u1, . . . , ur) is the top r left singular vectors of Y . For the estimation of gk s, there exist many standard nonparametric smoothing methods. For simplicity, we adopt the local polynomial regression in the sequel for implementation and theoretical development. This suggests a two-step ﬁtting procedure.

Step 1. For a given r q, let ˆU[r] = (ˆu1, . . . , ˆur) be the top r left singular vectors of data Y = (y1, . . . , yn) Rp n from model (1);

ZHOU, KOUDSTAAL, YU, KONG AND YAO

Step 2. Plug in ˆU[r] into RD n (G) and ﬁnd the corresponding minimizers of the RD n (ˆuk, gk) by applying local polynomial smoothing for k = 1, . . . , r separately, denoted by ˆf1, . . . , ˆfr.

Then the estimate ˆF is given by

k=1 ˆuk ˆfk. (4)

There are two tuning parameters involved in our estimation procedure, the retained dimension r in Step 1 and the bandwidth hk (1 k r) in Step 2. To select these parameters, for each r, we choose hk by the standard ﬁve-fold cross-validation for each function estimate individually. Let ˆF (r) be the corresponding estimator of F using the retained dimension r with selected bandwidths ˆhk,r. In view of the nonparametric approximation nature of ˆF (r), a reasonable choice for selecting r is to minimize

AIC(r) = log n V (r, ˆF (r)) o + 2r

where V (r, ˆF (r)) = (2n) 1 Pn i=1 yi ˆF (r)(xi) 2.

3. Theoretical Guarantees

We shall present the main theoretical result, followed by an explicit rate of convergence when the proposed method is coupled with local polynomial smoothing, while the proofs are deferred to Appendix. We ﬁrst present a proposition that ensures that a reasonable F : [0, 1]d Rp has a singular value type representation and supports the form of function proposed in this paper.

Proposition 1 Suppose that F : [0, 1]d Rp, which can be written as F = (F1, . . . , Fp) , satisﬁes Fj L2[0, 1]d for 1 j p. Then F has a singular value type decomposition

k=1 σkukvk( ) =

k=1 ukfk( ), q p,

where vk s are orthonormal in L2[0, 1]d, uk s orthonormal in Rp and σ1 σ2 σq > 0.

It is noted that Proposition 1 holds under a rather weak condition Fj L2[0, 1]d that is mostly satisﬁed in practice. To make model (2) feasible, we need an additional assumption for F, fk 2 L2 = σ2 k k α, α > 1, 1 k q. (5)

This type of polynomial decay condition on singular values is widely used in the ﬁeld of highdimensional statistics (Wainwright, 2019; Vershynin, 2019). For simplicity, we assume the design points xi s are independent and identically distributed with xi following uniform distribution on [0, 1]d, that is xi U[0, 1]d. This assumption can be relaxed with more technicality, see Remark 3. We also assume zi = (zi1, . . . , zip) Rp are independent and identically distributed as N(0, σ2Ip), which puts us in the regime of p repeated nonparametric experiments. The assumption of uncorrelated Gaussian noise is commonly used

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

to facilitate model exploration and theoretical development (Cai, 2012; Donoho and Johnstone, 1994; Tsybakov, 2009; Johnstone, 2017) in the study of nonparametric experiments. This is in accordance to a common practice in nonparametric regression which incorporates the structural information into the underlying function F contaminated by independent noise. Given the estimate ˆU[r] of U[r] in Step 1, we deﬁne the estimated rotated response as ˆy ik = ˆu k yi and the oracle rotated data {(xi, yo ik)} as

yo ik = u k yi = fk(xi) + ϵik, for 1 k r, 1 i n,

where ϵik = u k zi are independent and identically distributed from N(0, σ2) by the model assumptions. Denote the linear smoother by L and let ˆfo k be the nonparametric estimates by applying L to the oracle rotated data {(xi, yo ik)}. Note that ˆfo k s are not present in the estimation procedure, but serve as an intermediate quantity in theoretical analysis. Therefore, in order to obtain the main result (7) in Theorem 2 below, we need to bound the rotation error between the estimated principal subspace ˆU[r] and its true counterpart U[r] as (6) states. This bound is of interest in its own rights given broad applicability in singular value type problems. It is also required to quantify the smoothing error for estimating fk, where 1 k r, where the difference between ˆfo k and the actual estimates ˆfk is quantiﬁed via the rotation error (6). It is important to note that r is not ﬁxed but may (slowly) diverge with n, which makes the classical theory on nonparametric regression for a ﬁxed number of function estimates invalid. For the ﬂow of exposition, we present here the needed condition and rate from the smoothing step and show their fulﬁllment afterward.

Theorem 2 Assume that fk s are uniformly bounded so that max1 k q fk B and fk 2 L2 s satisfy a polynominal decay condition (5). Let ˆU[r] be the estimate of U[r] in Step 1. If r (n/q2 log n)1/(2α+2), q p and p n, then

E sin Θ( ˆU[r], U[r]) 4 p2r4α+4 log2 n

In addition, assume that the linear smoother L satisﬁes L C and E ˆfo k fk 2 n kτ/nρ

uniformly for 1 k q, where τ and ρ are constants associated with the smoothness of fk. Then the estimator ˆF = Pr k=1 ˆuk ˆfk in (4) satisﬁes

E[Rn( ˆF)] r α+1 + pr2α+3 log n

where Rn( ˆF) = n 1 Pn i=1 Fi(xi) ˆF(xi) 2 is the sample discrepancy deﬁned in (3).

Note that the assumption r (n/q2 log n)1/(2α+2) is to bound the gap between the estimated singular values, and the approximation error Pq k=r+1 σ2 k is bounded by r α+1 regardless of the underlying model dimension q due to the decaying structure (5). The condition q p is reasonable since q is usually much smaller than p in a low-rank approximation problem given the decay rate of singular values. Based on such observations, we see that the ﬁrst term arises from the approximation error, while the second and third terms are due to estimation errors for

ZHOU, KOUDSTAAL, YU, KONG AND YAO

the principal subspace U[r] and the smoothing of fk (1 k r), respectively. The proposed estimator is consistent, provided that

r n p log n

1 2α+3 n ρ τ+1 ,

which reﬂects the essence of dimension reduction achieved by the proposed method. By comparing the order of each error term in (7), one may obtain the optimal convergence rate by choosing r appropriately. Speciﬁcally, if pnρ 1 log n rτ 2α 2 and r nρ/(α+τ), the optimal rate is n ρ(α 1)/(α+τ); if rτ 2α 2 pnρ 1 log n and r (n/p log n)1/(3α+2), the optimal rate changes to (p log n/n)(α 1)/(3α+2).

Remark 3 If xi are on a grid, saying {0, 1/m, . . . , (m 1)/m, 1}. By deﬁning Ef2(xi) = m 1 Pm i=1 f2(i/m) as a Riemann integral approximation to f 2 L2, the result in this section remains valid by considering the additional integral approximation error. If xi is sampled from some distribution π, we consider the transformation yi = F(π 1(wi)) + zi = H(wi) + zi, where wi follows a uniform distribution on [0, 1]d. Let ˆπ be the empirical distribution based on xi, then one can perform the proposed method to estimate H based on the sample pairs {( ˆwi, yi)}, where ˆwi = ˆπ 1(xi). Consequently, the estimation of F is ˆF(x) = ˆH(ˆπ(x)). For the fact that ˆπ enjoys a standard nonparametric rate only depending on n and d, the impact of transformation is negligible when d p, thus the result also holds.

We next turn to demonstrate that the requirements on a diverging number r of smoothing estimates of fk (1 k r) are fulﬁlled. For conciseness, we focus on the common local polynomial regression (Fan and Gijbels, 1996; Tsybakov, 2009) that is implemented for numerical studies, while other smoothing methods can also be investigated with more technicality. To bound the errors of a diverging number of nonparametric function estimators, the key is to extend and combine the smoothness classes to which such q functions belong. Recall that the smoothness class of primary concern in standard nonparametric regression is the H older class H(β, L), consisting of l = β times differentiable functions f, where β represents the largest integer strictly less than β, with the l-th derivative f(l) satisfying |f(l)(x) f(l)(y)| L|x y|β l for x, y in the domain of interest. The idea arises from the fact that, for a sequence of orthonormal basis functions {vk} k=1, the smoothness of vk deteriorates as index k increases. Thus we assume that the orthonormal functions vk s in Proposition 1, hence fk s, belong to different H older classes H(β, Lk) for 1 k q. The H older constants Lk depict the amplitude of its derivatives which characterizes the function s frequency increment. Here are the examples of explicit forms of Lk for some commonly used basis functions with a given β.

Exp.1 Fourier Series: ψ0 = 1, ψ2k 1 = sin(kπt) and ψ2k = cos(kπt), then Lk k β .

Exp.2 B-splines: {Njk}J j= k 1, where Njk is deﬁned in (6.20) in Hsing and Eubank (2015) and k is the order, then Lk k!/(k β )! k β .

Exp.3 Wavelets: ψj,k = 2j/2ψ(2jt k) with a mother wavelet function ψ, then Lk k β .

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

We adopt a similar argument as in Cai and Brown (1999) for the random design under consideration, and derive the minimax rate for the oracle functional estimates ˆfo k (1 k q) in the metric Ef 2 n.

Theorem 4 Suppose that vk H(β, Llk) with l > 1 and L1k = v k , and the kernel K satisﬁes the conditions outlined in Tsybakov (2009). Then the local polynomial smoother L is bounded and the resulting estimator ˆfo k of fk = σkvk satisﬁes

Efk ˆfo k fk 2 n 1 (σ2 k L2 1k)

2β 2β+1 σ2 k L2 lk 1 2β+1 n 2β 2β+1 ,

uniformly for 1 k q, by choosing optimal bandwidth h 1 σ2 k L2 1k /nσ2 k L2 lk 1/(2β+1). Moreover, if Llk kl and L1k k, then

Efk ˆfo k fk 2 n kτ

nρ with τ = 2(2 α)β 0 + 2l α

2β + 1 , ρ = 2β 2β + 1. (8)

We conclude this section by comparing with the curve-by-curve estimator of F without extracting the subspace. By Proposition 1 and Theorem 4, it is seen that Fj H(β, Mlj), where Mlj = Pq k=1 ukjσk Llk and M1j = Pq k=1 ukjσk L1k. By Cauchy-Schwartz inequality, M2 lj q2l α+1 Pq k=1 u2 kj and M2 1j q3 α Pq k=1 u2 kj. Following the same argument as in proof of Theorem 4, we have

EFj ˆFj Fj 2 n 1 M

2 2β+1 lj n 2β 2β+1 ,

where ˆFj is the local polynomial estimator of Fj for j = 1, . . . , p. Then the convergence rate of the curve-by-curve estimator ˆFcbc = ˆF1, . . . , ˆFp is

E[Rn( ˆFcbc)] q 2l α+2

8β 2αβ+2l α+2

2β+1 n 2β 2β+1 .

Note that β l, α > 1 and if p q4 α, it holds that

E[Rn( ˆFcbc)] q p

In comparison with the proposed method, plugging (8) into (7),

E[Rn( ˆF)] r α+1 + p log n

n r2α+3 + r3n 2β 2β+1 .

The essence of the proposed method is dimension reduction based on low-rank approximation, that is, the reduced dimension r is often much smaller than q and p, which implies that E[Rn( ˆF)] E[Rn( ˆFcbc)].

ZHOU, KOUDSTAAL, YU, KONG AND YAO

4. Simulation Study

In this section, we perform simulation study to evaluate the performance of the proposed method. The U = (u1, . . . , uq) is generated by orthonormalizing a p q matrix with all elements being independent and identically distributed standard normal. The xi s are independently generated from uniform distribution on [0, 1]. The functions fk s are independently generated from a zero mean Gaussian process with compactly supported covariance function

Ca,b(s, t) = b max{0, (1 ha)5}(8h2 a + 5ha + 1),

where ha(s, t) = |s t|/a, see Rasmussen and Williams (2006) for details. We set a = 0.5 and b = 15. We orthogonalize and scale fk s by

fk 2 L2 = σ2 k = 5k 2, 1 k 6 and σ2 k = k 2, 6 < k q.

The error zij are independent and identically distributed standard normal N(0, σ2) for 1 i n and 1 j p. The response yi is obtained by yi = Pq k=1 ukfk(xi) + zi. A natural comparison would be conducted against individual nonparametric regression of Y on x in a curve-by-curve manner. In particular, we compare with the method that ﬁts the jth component of Y on x nonparametrically for each 1 j p. We also use the local polynomial regression with a Gaussian kernel for curve-by-curve nonparametric recovery. For the tuning parameters, we use the selection method described in Section 2. For nonparametric principal subspace regression, we report the average values of selected ˆr by AIC and the average estimation errors based on 100 Monte Carlo runs. For curve-by-curve nonparametric recovery, we only report the average estimation errors since the procedure ﬁts each curve individually. We consider different combinations of (n, p, q) at two noise levels σ = 0.5 and σ = 1. We ﬁrst check the performance of two approaches under the typical low-rank scenario that p is much larger than q. The results in Table 1 suggest that our method outperforms the curveby-curve nonparametric regression for all cases. Given the sample size n, the recovery results from nonparametric principal subspace regression tend to improve at a faster rate as p increases or q decreases. We then conduct the simulation when q increases to the extent close to p and the results are reported in Table 2. We see that our method still outperforms. In particular, the estimation error and the selected ˆr become to level off even as q increases substantially. The selection of r plays an important role in our method. As seen from Tables 1 and 2, the

AIC is capable of extracting most of the important signals. Moreover, the selected ˆr becomes smaller with a higher noise level that tends to blind small signals. To demonstrate the effectiveness of AIC, we plot the average estimation errors with increasing r and labeled the averages of selected ˆr in Figure 2. It is observed that the AIC criterion indeed selects models comparable with those of minimal prediction error.

5. Real Data Applications

In this section, we apply our methodology to two data examples introduced in Section 1, which shows a favorable performance in comparison with the conventional nonparametric regression.

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

σ = 0.5 σ = 1 n p q ˆr NPSR CBC ˆr NPSR CBC

50 2 2.14 0.171(0.003) 0.546(0.017) 2.14 0.634(0.013) 1.728(0.024) 4 4.09 0.312(0.003) 0.587(0.011) 4.09 1.269(0.013) 1.827(0.019) 6 6.13 0.471(0.004) 0.611(0.011) 6.34 2.100(0.018) 1.877(0.019)

100 3 3.01 0.372(0.003) 0.972(0.009) 3.01 1.539(0.012) 3.258(0.034) 6 6.01 0.755(0.004) 1.053(0.010) 5.49 3.446(0.024) 3.452(0.033) 9 6 0.798(0.004) 1.066(0.010) 5.48 3.482(0.019) 3.456(0.029)

150 4 4 0.668(0.004) 1.422(0.014) 4 2.936(0.018) 4.780(0.048) 8 6 1.077(0.005) 1.486(0.014) 4.72 4.589(0.021) 4.938(0.046) 12 6.01 1.115(0.005) 1.488(0.012) 4.62 4.624(0.022) 4.932(0.045)

100 3 3.02 0.239(0.002) 0.718(0.007) 3.02 0.978(0.007) 2.472(0.025) 6 6.02 0.491(0.003) 0.768(0.007) 6.05 2.316(0.014) 2.630(0.024) 9 6.05 0.537(0.003) 0.768(0.007) 6.04 2.374(0.015) 2.625(0.025)

200 4 4 0.531(0.002) 1.361(0.013) 4 2.377(0.012) 4.678(0.046) 8 6 0.871(0.003) 1.420(0.013) 5.12 3.875(0.014) 4.851(0.044) 12 6 0.909(0.003) 1.428(0.012) 5.03 3.921(0.017) 4.854(0.044)

300 5 5 0.955(0.004) 1.984(0.018) 4.22 4.313(0.016) 6.790(0.063) 10 6 1.229(0.004) 2.039(0.018) 4.18 5.088(0.018) 6.970(0.064) 15 6 1.262(0.004) 2.052(0.019) 4.27 5.106(0.017) 6.992(0.066)

Table 1: The average estimation errors for our nonparametric principal subspace regression (NPSR) and the curve-by-curve nonparametric regression (CBC) with their associated standard errors in the parentheses, and the average values of selected ˆr by AIC in 100 Monte Carlo runs are reported.

σ = 0.5 σ = 1 (n, p) q ˆr NPSR CBC ˆr NPSR CBC

4 4 0.668(0.004) 1.422(0.014) 4 2.936(0.018) 4.780(0.048) 40 6 1.168(0.005) 1.525(0.011) 4.77 4.660(0.020) 5.003(0.046) 80 6 1.174(0.005) 1.537(0.012) 4.8 4.670(0.021) 5.010(0.049) 120 6.01 1.191(0.005) 1.545(0.013) 4.77 4.735(0.023) 5.033(0.047)

5 5 0.955(0.004) 1.984(0.018) 4.22 4.313(0.016) 6.790(0.063) 50 6 1.307(0.004) 2.078(0.019) 4.25 5.141(0.016) 7.005(0.066) 100 6 1.312(0.004) 2.081(0.019) 4.25 5.138(0.019) 6.981(0.065) 200 6 1.315(0.004) 2.086(0.018) 4.19 5.145(0.017) 6.993(0.063)

Table 2: The average estimation errors for our nonparametric principal subspace regression (NPSR) and the curve-by-curve nonparametric regression (CBC) with their associated standard errors in the parentheses, and the average values of selected ˆr by AIC in 100 Monte Carlo runs are reported.

ZHOU, KOUDSTAAL, YU, KONG AND YAO

2 4 6 8 10 12 r

n=300, p=150, q=40

2 4 6 8 10 12 r

n=500, p=300, q=50

Figure 2: The average estimation errors ˆRn for our nonparametric principal subspace regression with different r under two noise levels. The average values of selected ˆr by AIC in 100 Monte Carlo runs are depicted (vertical dotted lines), suggesting the effectiveness of AIC.

5.1 Application to an EEG Study

We apply the proposed method to an EEG data set, which is available at https://archive. ics.uci.edu/ml/datasets/EEG+Database. The data were collected by the Neurodynamics Laboratory and contain 122 subjects, where researchers measured the voltage values from 64 electrodes placed on each subject s scalps sampled at 256 Hz for 1 second. As EEG data are notoriously noisy while there are known to be strong relations between different electrodes, we model the data from each subject by the nonparametric framework (1). Thus, for each subject, we ﬁt the nonparametric principal subspace regression to the data matrix Y = (y1, . . . , yn) Rp n with p = 64 and n = 256. The average retained dimension selected by the proposed AIC among these 122 subjects is 7.270 with a standard error 0.112. To compare the prediction performance, we also ﬁt curve-by-curve nonparametric regression to the signals obtained from each of the 64 electrodes. For each subject, we randomly reserve 10% of data as the test set: Stest {1, . . . , 256} such that |Stest|/256 10%, while using the rest as the training set, and report the prediction errors |Stest| 1 P i Stest Yi ˆF(xi) 2/64 for both approaches. The average prediction error for nonparametric principal subspace regression over the 122 subjects is 1.153 with a standard error 0.072, while that obtained by the curve-bycurve nonparametric regression is 1.288 with a standard error 0.075.

5.2 Application to an f MRI Study

For another data application, we analyze the motor task-related f MRI data from the Human Connectome Project (HCP) Data https://www.humanconnectome.org/ which includes behavioral and 3 Tesla magnetic resonance imaging data from 970 healthy adult participants collected from 2012 to spring 2015. The block-design motor task used in this study is adapted

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

from experiments by Buckner et al. (2011) and Yeo et al. (2011). The details on the HCP implementation can be referred to Barch et al. (2013). In the motor task, participants are presented with visual cues that ask them to either tap their left or right ﬁngers, or squeeze their left or right toes, or move their tongue to map motor areas. For each subject, there are two runs of phase encoding scans (right-to-left and left-to-right), and we use the left-to-right phase encoding scan in this study. Each run of the motor task lasted for about 205 seconds including 284 frames, and we use the Desikan-Killiany atlas (Desikan et al., 2006) to divide the brain into 68 ROIs.

We use 869 subjects which had the motor task-related f MRI data. For each subject, we obtain the data matrix Y Rp n with p = 68 and n = 284. By ﬁtting the nonparametric principal subspace regression, the average selected retained dimension among these 869 subjects is 8.353 with standard error 0.049. Same as the above example, we randomly select 10% of data as the test set and the rest of the data as the training set for each subject. The average prediction error for the proposed method over 869 subjects is 2.629 with a standard error 0.100, while that obtained by the curve-by-curve nonparametric regression is 2.727 with a standard error 0.103.

Acknowledgments

Yang Zhou is the ﬁrst author, and Fang Yao is the corresponding author. Yang Zhou s research is partially supported by the Postdoctoral Science Foundation of China (Grant no. 2020M680226 and 2020TQ0014) and National Natural Science Foundation of China (Grant no. 11971048). Dengdeng Yu s research is partially supported by the Canadian Statistical Sciences Institute postdoctoral fellowship. Dehan Kong s research is partially supported by the Natural Science and Engineering Research Council of Canada. Fang Yao s research is partially supported by National Natural Science Foundation of China Grants 11931001 and 11871080, the National Key R&D Program of China Grant 2020YFE0204200, the LMAM, and the Key Laboratory of Mathematical Economics and Quantitative Finance (Peking University), Ministry of Education.

Appendix A. Notations

We ﬁrst introduce the notation used in the appendix. Recall we have proposed a nonparametric model yi = Pq k=1 ukfk(xi) + zi Rp, where zi = (zi1, . . . , zip) follows independent and identically distributed N(0, σ2Ip). The design points xi s are independent and identically distributed U[0, 1]d. Let Y = (y1, . . . , yn) Rp n be the response data matrix, F = F(x1), . . . , F(xn) Rp n and Z = (z1, . . . , zn) Rp n, one can write Y = F + Z. Let U[r] = (u1, . . . , ur) Rp r, and ˆU[r] = (ˆu1, . . . , ˆur) Rp r the estimate of U[r]. Deﬁne fk = fk(x1), . . . , fk(xn) Rn and f = ( f1, . . . , fq) Rq n so that F = U f. We further

deﬁne ˆfk = ˆfk(x1), . . . , ˆfk(xn) Rn and ˆf[r] = ( ˆf1, . . . , ˆfr) Rr n so that we may

write ˆF = ˆF(x1), . . . , ˆF(xn) Rp n as ˆF = ˆU[r] ˆf[r].

ZHOU, KOUDSTAAL, YU, KONG AND YAO

Appendix B. Proofs of the Proposition and Main Theorems

Proof. [Proposition 1] Given F : [0, 1]d Rp, we can deﬁne an operator F : L2[0, 1]d Rp

mapping h L2[0, 1]d to

Fh = F, h L2 := ( F1, h L2, . . . , Fp, h L2) Rp.

In this case we can represent the operator F as

j=1 ej Fj and F Fh =

j=1 Fj, h L2Fj.

where is the Kronecker product. Thus F F is of ﬁnite rank and hence compact. As it is also symmetric, it has an eigendecomposition

k=1 σ2 kvk vk

with at most p of the σk = 0 and vk s forming an orthonormal basis of L2[0, 1]d. We order the nonzero σk decreasingly by σ1 σ2 σq > 0 with q p. Note that we may write

k=1 h, vk L2vk.

Setting wk = Fvk we have that wk, wl = vk, F Fvl L2 = σ2 l vk, vl L2 = σ2 l δkl. Hence, the wk s are orthogonal vectors for 1 k q. Letting uk = σ 1 k wk, we may write

k=1 h, vk L2Fvk =

k=1 σk h, vk L2uk = q X

k=1 σkuk vk

Since Fj = P k=1 Fj, vk L2vk, we have

k=1 F, vk L2vk =

k=1 (Fvk) vk =

k=1 σkukvk,

which completes the proof.

Proof. [Theorem 2] As the fk s are not necessarily orthogonal in Rn, we need to consider that the singular value decomposition of F is of the form F = PΛQ , with P possibly spanning a different subspace from U. Let Y admit the singular value decomposition Y = ˆU ˆΛ ˆV . Recall that Y = F + Z = U f + Z = PΛQ + Z and note that

sin Θ( ˆU[r], U[r]) 4 C sin Θ( ˆU[r], P[r]) 4 + sin Θ(P[r], U[r]) 4 ,

we shall bound E sin Θ( ˆU[r], P[r]) 4 and E sin Θ(P[r], U[r]) 4 separately below.

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

Deﬁne the event E as

max 1 k,l q | fk, fl n fk, fl L2| 4B2 r

By Lemma 7, it holds that P Ec q(q + 1)/n2. Denote E( |D) the expectation conditional on the design {x1, . . . , xn}. As sin Θ( ˆU[r], P[r]) 1, we decompose E sin Θ( ˆU[r], P[r]) 4 as

E sin Θ( ˆU[r], P[r]) 4 = E E sin Θ( ˆU[r], P[r]) 4 D

P(Ec) + E E sin Θ( ˆU[r], P[r]) 4 D 1E .

On the event E, Lemma 8 ensures (12) holds. Since q p and p n there exists a constant Cmax such that σ2 max( F) nσ2 max + 4B2p

q2n log n Cmaxn.

Under the assumption that r (n/q2 log n)1/(2α+2), it holds that

3(σ2 r σ2 r+1).

Combing this with Lemma 10 we get

E sin Θ( ˆU[r], P[r]) 4 D Cp2(σ2 r( F) + n)2 σ2r( F) σ2 r+1( F) 4 Cp2(2nσ2 r + n)2

n4(σ2r σ2 r+1)4 Cp2r4α+4

As a result, we obtain that

E sin Θ( ˆU[r], P[r]) 4 Cq2

n2 + Cp2r4α+4

Using Proposition 1 in Cai and Zhang (2018), it holds that

sin Θ(P[r], U[r]) σr( F U[r]) P( F U[r]) F U [r]

σ2r( F U[r]) σ2 r+1( F ) .

Note that σ2 r( F U[r]) = σ2 r( f[r]), σ2 r+1( F ) = σ2 r+1( f) and

P( F U[r]) F U [r] = f [r]( f[r] f [r]) 1 f[r]( f [r]) σ 1 min( f [r]) f[r]( f [r])

where we use (8.18) in Cai and Zhang (2018) in the inequality. Hence,

sin Θ(P[r], U[r]) 4 Cσ4 r( f[r])σ 4 min( f [r]) f[r]( f [r]) 4 σ2r( f[r]) σ2 r+1( f) 4 = C f[r]( f [r]) 4 σ2r( f[r]) σ2 r+1( f) 4 .

On the event E, following the proof of Lemma 8, it is easily seen that

f[r]( f [r]) 4 f[r]( f [r]) 4 F Cq2r2n2 log2 n p2n2 log2 n

ZHOU, KOUDSTAAL, YU, KONG AND YAO

which deduces that

E sin Θ(P[r], U[r]) 4 Cq2

n2 + Cp2r4α+4 log2 n

n2 p2r4α+4 log2 n

Piecing together what has been shown ﬁnishes the proof of (6). To prove (7), we divide the error E[Rn( ˆF)] into two parts

E h Rn( ˆF) i 2E

i=1 F(xi) F[r](xi) 2 #

i=1 F[r](xi) ˆF(xi) 2 #

where F[r] = Pr k=1 ukfk is the ﬁrst r truncation of F. The ﬁrst part is the approximation error and the second part is the estimation error. For the approximation error, By the assumption (5) it is easily seen that

i=1 F(xi) F[r](xi) 2 #

k=r+1 ukfk(xi)

k=r+1 Ef2 k(xi) =

k=r+1 σ2 k r α+1,

where Ef2 k(xi) = fk 2 L2 = σ2 k since xi U[0, 1]d. Now we consider the estimation error. Given the deﬁnitions in the paper and at the outset of the appendix, the estimation error can be rewritten as 1

n E F[r] ˆF 2 F where F[r] = U[r] f[r] and ˆF = ˆU[r] ˆf[r]. Recall that Y o k = Y uk and ˆY k = Y ˆuk and for a linear smoother L,

the estimates are ˆfo k = LY o k and ˆfk = L ˆY k, respectively. Consequently ˆf[r] = ˆU [r]Y L and ˆF = ˆU[r] ˆU [r]Y L . With this notation, we have

E F[r] ˆF 2 F 2E U[r] f[r] U[r] ˆfo [r] 2 F + 2E U[r] ˆfo [r] ˆU[r] ˆf[r] 2 F .

By the assumption that E ˆfo k fk 2 n Ckτ/nρ, we have

1 n E U[r] f[r] U[r] ˆfo [r] 2 F = 1

n E f[r] ˆfo [r] 2 F = E

k=1 fk ˆfo k 2 n

Also notice that

1 n E U[r] ˆfo [r] ˆU[r] ˆf[r] 2 F = 1

n E (U[r]U [r] ˆU[r] ˆU [r])Y L 2 F

n E h U[r]U [r] ˆU[r] ˆU [r] 2 F Y 2 L 2i

E U[r]U [r] ˆU[r] ˆU [r] 4 F p

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

where in the last inequality we use L C. Using (6), Lemma 6 and together with the fact that U[r]U [r] ˆU[r] ˆU [r] F 2 r sin Θ( ˆU[r], U[r]) give that

1 n E U[r] ˆfo [r] ˆU[r] ˆf[r] 2 F Cpr2α+3 log n

n max 1, q n, p

pr2α+3 log n

since q p and p n. Combing (9), (10) and (11), we obtain (7) and conclude the proof.

Proof. [Theorem 4] It is well known that in the ﬁxed design case, where xi = i/n, the local polynomial estimator enjoys the minimax optimal rate Ef ˆf f 2 n n 2β/(1+2β); see Proposition 1.13 and Theorem 1.6 in Tsybakov (2009). However, in the random design case, there seem no results on convergence in the metric Ef 2 n. One remedy is to adopt a similar approach in Cai and Brown (1999), and slightly modify the local polynomial regression strategy. Represent the estimated rotated data {(xi, ˆy ik)} as {(x(i), ˆy (i)k)}, where x(i) is the ith order statistic of xi s and ˆy (i)k is the corresponding response. In the recovery procedure, we perform local polynomial regression on the equispaced data {(δi, ˆy (i)k)} for i = 1, . . . , n,

where δi = Ex(i) = i/(n + 1). Then for a given x, ˆfk(x) can be represented as ˆfk(x) = Pn i=1 Wn,i(x)ˆy (i)k = L ˆY where the Wn,i(x) are deﬁned in (1.67) in Tsybakov (2009) and completely deterministic, satisfying all of the properties derived therein. Similarly, for the oracle equispaced data {(δi, ˆyo (i)k)}, we denote the oracle estimate function by ˆfo k = L ˆY o. Next we turn to the main proof of the theorem. Let b(x) = Efk,D ˆfo k(x) fk(x) denote the bias, conditioned on design, of the estimator ˆfo k(x) at x. Then we ﬁnd that

i=1 fk(x(i))Wn,i(x) fk(x) =

fk(x(i)) fk(x) Wn,i(x)

fk(x(i)) fk(δi) Wn,i(x) +

i=1 {fk(δi) fk(x)} Wn,i(x)

fk(x(i)) fk(δi) Wn,i(x)

| {z } I(x)

i=1 {fk(δi) fk(x)} Wn,i(x)

| {z } II(x)

Starting from the fact that for xi we have

Efk{ ˆfo k(xi) fk(xi)}2 = Efk Efk,D{ ˆfo k(xi) fk(xi)}2

= Efk Efk,D{ ˆfo k(xi) Efk,D ˆfo k(xi) + Efk,D ˆfo k(xi) fk(xi)}2

= Efk h varfk,D{ ˆfo k(xi)} + b2(xi) i .

Conditioned on design D, as the same as the proof in Tsybakov (2009) we can show that

II2(xi) Cσk Llk

2 h2β and varfk,D{ ˆfo k(xi)} Cσ2

ZHOU, KOUDSTAAL, YU, KONG AND YAO

which together give that

Efk{ ˆfo k(xi) fk(xi)}2 Cσk Llk

2 h2β + Cσ2

nh + 2Efk I2(xi).

Since vk H(β, Llk) and l > 1, by Theorem 1.34 in Adams and Fournier (2003), we have L1k < . For each xi, it holds that

I(xi) σk L1k

i=1 |δi x(i)||Wn,i(xi)|.

Applying Cauchy-Schwarz to the right hand side and using the properties of Wn,i(x) from Tsybakov (2009) gives

I2(xi) σ2 k L2 1k

i=1 |δi x(i)|2 n X

i=1 |Wn,i(xi)|2 Cσ2 k L2 1k nh

i=1 |δi x(i)|2.

Efk I2(xi) Cσ2 k L2 1k nh

i=1 var{x(i)} Cσ2 k L2 1k nh

i n2 Cσ2 k L2 1k nh .

Together with these we show that

Efk ˆfo k fk 2 n = 1

i=1 Efk{ ˆfo k(xi) fk(xi)}2

Cσ2 k L2 lk (l!)2 h2β + Cσ2

nh + Cσ2 k L2 1k nh

max{1, (σ2 k L2 1k)

2β 2β+1 }(σ2 k L2 lk) 1 2β+1 n 2β 2β+1

by choosing optimal bandwidth h max{1, σ2 k L2 1k}/(nσ2 k L2 lk) 1/(2β+1). Now we verify the boundness of L. This follows from a result for bounds of eigenvalues of matrices. Let A = (akl)n k,l=1 be an n n matrix and set

l |akl| and Cl = X

Then one can show that the eigenvalues of A, µ(A), are bounded by

µ(A) min max k Rk, max l Cl

Note that the L satisﬁes Lij = Wn,j(xi) and from Tsybakov (2009) we know that

j |Lij| = X

j |Wn,j(xi)| C.

Thus we have that µ(L) C and hence L C.

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

Appendix C. Auxiliary Lemmas for Main Theorems

In the Appendix 5.2, we introduce the auxiliary lemmas for main theorems. Lemma 6 bounds fourth moments of Y in which the proof needs Lemma 5. Lemma 7 quantiﬁes the discrepancy between , n and , L2, which is crucial to the proof of Lemma 8. Lemma 8 shows that the scaled singular values of F are close to the true counterparts of F. Lemma 9 is crucial to the proofs of Lemma 10, which in turn is crucial to the proofs of the main theorems of the paper. Lemma 10 extends the results of Theorem 3 in Cai and Zhang (2018) by considering fourth moment perturbation bounds for the top r singular vectors.

Lemma 5 If X 0 is a positive random variable and for a, b > 0 we have P(X > a + bt) 2 exp( t2) for all t 0, then it follows that E(X4) C max(a4, b4).

Proof. Separating on the value of a we ﬁnd that

E(X4) = E{X41(X a)} + E{X41(X>a)} a4 + E{X41(X>a)}.

Now notice that

E{X41(X>a)} = Z

Ω X4(ω)1{X(ω)>a}d P(ω)

a4 + 4 Z X(ω)

a s31{s<X(ω)}ds d P(ω)

a s3P(X > s)ds

= a4 + 4b Z

0 (a + bt)3P(X > a + bt)dt

a4 + 8b max(a3, b3) Z

0 (1 + t)3 exp( t2)dt

C max(a4, b4),

which concludes the proof of the lemma.

Lemma 6 With Y = F + Z Rp n denoting the data matrix and Y = maxi σi(Y ) the operator norm, or maximum singular value of Y , we have that

E Y 4 C max(nq2, p2, n2)

holds when maxk fk B.

ZHOU, KOUDSTAAL, YU, KONG AND YAO

Proof. Since F = U f as deﬁned above, we have

F 2 F 2 F = tr( f U U f) = tr( f f) =

k=1 f2 k(xi).

Note that Ef2 k(xi) = fk 2 = σ2 k and Pq k=1 f2 k(xi) q B2 by the assumption, one has

k=1 f2 k(xi)

#2 = n(n 1)

k=1 E f2 k(xi) !2 + n E

k=1 f2 k(xi)

Cn(n 1) + B4nq2.

Now, if Z Rp n is composed of independent and identically distributed N(0, σ2) entries, then according to Theorem 4.4.5 in Vershynin (2019) there is a constant C so that for all t > 0,

P n Z > Cσ(p1/2 + n1/2 + t) o 2e t2.

By Lemma 5, this implies that E Z 4 C max(p2, n2) which completes the proof by

E Y 4 C(E F 4 + E Z 4) C max(nq2, p2, n2).

Lemma 7 Suppose that fk s are bounded and orthogonal in L2[0, 1]d, satisfying maxk fk B. Then

P max k,l | fk, fl n fk, fl L2| > 2δB2 q(q + 1)exp nδ2/2

and so with the probability at least 1 q(q + 1)/n2,

max k,l | fk, fl n fk, fl L2| 4B2 r

Proof. Noting that for any 1 k, l q we have

fk, fl n fk, fl L2 = 1

i=1 {fk(xi)fl(xi) fk, fl L2}

guarantees that fk, fl n fk, fl L2 is expressible as the sum of n independent and identically distributed mean 0 random variables, each bounded by 2B2/n. Hoeffding then gives that

P | fk, fl n fk, fl L2| > 2δB2 2 exp nδ2

Symmetry of inner product guarantees that there are q(q + 1)/2 distinct sums | fk, fl n fk, fl L2| as we vary k, l over 1, . . . , q and so the ﬁrst inequality of the theorem follows from a union bound.

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

Lemma 8 Let F be the sampled version of F admitting a singular value type decomposition

k=1 σkukvk =

Under the conditions of Lemma 7, the k-th singular value of F satisﬁes

1 nσk( F) σk

2B q2 log n

with probability at least 1 q(q + 1)/n2.

Proof. Recall that the matrix f Rq n collects the sampled values of the fk in its rows. Note that the matrix f f = (akl)q q with akl = Pn i=1 fk(xi)fl(xi) = n fk, fl n. Furthermore, we may write

k=1 n fk, fk L2eke k + = n

k=1 σ2 keke k + ,

where the matrix is composed of elements kl = n( fk, fl n fk, fl L2). Thus we have

F F = U f f U = n

k=1 σ2 kuku k + U U

and F F , , and thus U U , are both real and symmetric. Therefore a well known perturbation result for matrices (Weyl, 1912) implies that

max k |σ2 k( F) nσ2 k| U U .

Since U U 2 = 2 2 F = Pq k,l 2 k,l, by Lemma 7 it implies that, with probability at least 1 q(q + 1)/n2, U U 2 16B4q2n log n,

which completes the proof by collecting the above results.

Lemma 9 Suppose that the design D = {x1, . . . , xn} is ﬁxed. Let Y = F + Z Rp n denote the data matrix and F admits a singular value decomposition F = PΛQ . Then it holds that

P n σ2 r(Y P[r]) (σ2 r( F) + n)(1 t) o Cexp Cr c(σ2 r( F) + n)t t2 ,

P n σ2 r+1(Y ) (σ2 r+1( F) + n)(1 + t) o Cexp Cp c(σ2 p( F) + n)t t2 .

Moreover, there exists C0 only depending on C and c, such that whenever σ2 r( F) C0p, for any t > 0 we have

P n P(Y P[r])Y P [r] t o Cexp Cp ct2 q

σ2r( F) + nt +Cexp c(σ2 r( F) + n) .

ZHOU, KOUDSTAAL, YU, KONG AND YAO

Proof. The proof mainly uses random matrix theory and follows the lines of proof of Lemma 4 in Cai and Zhang (2018). (A1) Low bounds for σ2 r(Y P[r]) Note that E(Y Y ) = F F + n Ip = PΛ2P + n Ip, we have

E(P [r]Y Y P[r]) = P [r]PΛ2P P[r] + n P [r]Ip P[r] = Λ2 [r] + n Ir,

where Λ[r] is a diagonal matrix consisting the ﬁrst r diagonal elements of Λ. Let M[r] = (Λ2 [r] +

n Ir) 1/2 and then E(M [r]P [r]Y Y P[r]M[r]) = Ir. Since σ2 r(Y P[r]M[r]) σ2 r(Y P[r])σ2 max(M[r]), σ2 r(Y P[r]M[r]) = σr(M [r]P [r]Y Y P[r]M[r]) and

1 σ2 r(Y P[r]M[r]) = σr(Ir M [r]P [r]Y Y P[r]M[r]) Ir M [r]P [r]Y Y P[r]M[r] ,

we have σ2 r(Y P[r]) (σ2 r( F) + n) 1 Ir M [r]P [r]Y Y P[r]M[r] . (13)

In the following we need an upper bound for Ir M [r]P [r]Y Y P[r]M[r] . For any unit vector u Rr,

u M [r]P [r]Y Y P[r]M[r] Ir u = u M [r]P [r] F F P[r]M[r]u E u M [r]P [r] F F P[r]M[r]u

+ 2 u M [r]P [r] FZ P[r]M[r]u E u M [r]P [r] FZ P[r]M[r]u

+ u M [r]P [r]ZZ P[r]M[r]u E u M [r]P [r]ZZ P[r]M[r]u

= 0 + 2u M [r]P [r] FZ P[r]M[r]u

+ u M [r]P [r] ZZ E(ZZ ) P[r]M[r]u.

Due to u M [r]P [r] FZ P[r]M[r]u = tr Z P[r]M[r]u F P[r]M[r]u , using general Hoeffding inequality (see Theorem 2.6.3 in Vershynin, 2019), it holds that

P n tr Z P[r]M[r]u F P[r]M[r]u t o 2exp ct2

P[r]M[r]u F P[r]M[r]u 2 F

M[r] 2 Λ[r]M[r] 2

2exp ct2(σ2 r( F) + n) .

On the other hand, using Hanson-Wright inequality (see Theorem 6.2.1 in Vershynin, 2019), it holds that

P n u M [r]P [r] ZZ E(ZZ ) P[r]M[r]u t o 2exp c t2

P[r]M[r]u 4n t P[r]M[r]u 2

ct2(σ2 r( F) + n)2

n t(σ2 r( F) + n)

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

Together with these we obtain that

P n u M [r]P [r]Y Y P[r]M[r] Ir u t o Cexp c(σ2 r( F) + n)t t2 .

The ϵ-net argument in Lemma 5 in Cai and Zhang (2018) leads to

P n M [r]P [r]Y Y P[r]M[r] Ir t o Cexp Cr c(σ2 r( F) + n)t t2 . (14)

which together with (13) deduces that

P n σ2 r(Y P[r]) (σ2 r( F) + n)(1 t) o 1 Cexp Cr c(σ2 r( F) + n)t t2 .

(A2) Upper bounds for σ2 r+1(Y ) Using the best rank-r approximation of Y (see Eckart-Young-Mirsky Theorem on page 73 in Vershynin, 2019),

σr+1(Y ) = min rank(A) r Y A Y Y P[r]P [r] = Y P [r](P [r]) = Y P [r] ,

which switch our focus from σr+1(Y ) to σmax(Y P [r]). Since E (P [r]) Y Y P [r] = (Λ [r])2 + n Ip r where Λ [r] denotes the diagonal matrix with

eliminating the ﬁrst r diagonal elements of Λ, let M [r] = (Λ [r])2 + n Ip r 1/2 and then

σ2 max(Y P [r]) (P [r]M [r]) Y Y P [r]M [r] σ2 r+1( F) + n

(P [r]M [r]) Y Y P [r]M [r] Ip r + 1 σ2 r+1( F) + n .

Following the same arguments for the proof of (14), we have

P n (P [r]M [r]) Y Y P [r]M [r] Ip r t o Cexp C(p r) c(σ2 p( F) + n)t t2 ,

which deduces that

P n σ2 r+1(Y ) (σ2 r+1( F) + n)(1 + t) o 1 Cexp Cp c(σ2 p( F) + n)t t2 .

(A3) Upper bounds for P(Y P[r])Y P [r] Since

P(Y P[r])Y P [r] = P(Y P[r]M[r])Y P [r]

= (Y P[r]M[r]) (Y P[r]M[r]) (Y P[r]M[r]) 1 (Y P[r]M[r]) Y P [r]

σ 1 min(Y P[r]M[r]) M [r]P [r]Y Y P [r] ,

we shall analyze σmin(Y P[r]M[r]) and M [r]P [r]Y Y P [r] separately below. Similar to (13),

σ2 min(Y P[r]M[r]) 1 Ir M [r]P [r]Y Y P[r]M[r]

ZHOU, KOUDSTAAL, YU, KONG AND YAO

and by (14) it implies that

P n σ2 min(Y P[r]M[r]) 1 t o 1 Cexp Cr c(σ2 r( F) + n)t t2 .

Setting t = 1/2 and choosing C0 large enough such that σ2 r( F) C0p C0r, we have Cr c(σ2 r( F) + n)t t2 c(σ2 r( F) + n)/8, leading to that

P n σ2 min(Y P[r]M[r]) 1/2 o 1 Cexp c(σ2 r( F) + n) . (15)

For M [r]P [r]Y Y P [r] , since P [r] F F P [r] = 0, we have the decomposition

u M [r]P [r]Y Y P [r]v = u M [r]P [r] FZ P [r] + M [r]P [r]Z F P [r] + M [r]P [r]ZZ P [r] v.

Following the same proof of (14) again, we can show that

P n M [r]P [r]Y Y P [r] t o Cexp C(p r) ct2 q

σ2r( F) + nt . (16)

Combing (15) and (16), we obtain

P n P(Y P[r])Y P [r] t o Cexp Cp ct2 q

σ2r( F) + nt +Cexp c(σ2 r( F) + n) .

Then the proof is complete.

Lemma 10 Denote the design D = {x1, . . . , xn}. Assume that there exists a constant Cmax large enough such that σ2 max( F) Cmaxn, then it holds that

E sin Θ( ˆU[r], P[r]) 4 D p2(σ2 r( F) + n)2 σ2r( F) σ2 r+1( F) 4 1.

Proof. Since the left singular vectors of Y are just the right singular vectors of Y , we can apply the same arguments for the right singular vectors of Y to get bounds for estimation of the left singular vectors of Y . Thus, by Proposition 1 in Cai and Zhang (2018),

sin Θ( ˆU[r], P[r]) σr(Y P[r]) P(Y P[r])Y P [r]

σ2r(Y P[r]) σ2 r+1(Y ) .

Since || sin Θ( ˆU[r], P[r])|| 1, to complete the proof we only need focus on the case that σ2 r( F) σ2 r+1( F) 2 C0p n + σ2 r( F) for large C0 only depending on Cmax, C, c. Note that in this case it holds that σ2 r( F) C0p, then by Lemma 9 we have

σ2 r(Y P[r]) 2σ2 r( F)

3 + σ2 r+1( F)

σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n

σ2 r+1(Y ) 2σ2 r+1( F)

3 + σ2 r( F)

σ2 p( F) + n σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n 2

P n P(Y P[r])Y P [r] t o Cexp Cp ct2 q

σ2r( F) + nt + Cexp c(σ2 r( F) + n) .

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

Furthermore,

P P(Y P[r])Y P [r] q

σ2r( F) + n Cexp Cp c(σ2 r( F) + n) .

Denote the event Q as

Q = σ2 r(Y P[r]) 2σ2 r( F)

3 + σ2 r+1( F)

3 + n, σ2 r+1(Y ) 2σ2 r+1( F)

3 + σ2 r( F)

P(Y P[r])Y P [r] q

σ2r( F) + n .

Now, under the event Q, it holds that

sin Θ( ˆU[r], P[r]) 4 σ4 r(Y P[r]) P(Y P[r])Y P [r] 4 σ2r(Y P[r]) σ2 r+1(Y ) 4 C(σ2 r( F) + n)2 P(Y P[r])Y P [r] 4 σ2r( F) σ2 r+1( F) 4 ,

where we use the fact that u2/(u2 v2)2 is a decreasing function of u and increasing function of v when u > v > 0. Thus for ﬁxed D, it gives that

E sin Θ( ˆU[r], P[r]) 4 = E sin Θ( ˆU[r], P[r]) 41Q + E sin Θ( ˆU[r], P[r]) 41Qc

C(σ2 r( F) + n)2 σ2r( F) σ2 r+1( F) 4 E P(Y P[r])Y P [r] 41Q + P Qc .

For large C0, it holds that

Cr Cp c C0p

2Cmax c 2Cmax

σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n .

By the condition σ2 r( F) Cmaxn, note that

σ2 p( F) + n

σ2r( F) + n n σ2r( F) + n 1 Cmax + 1 > 0.

Together with these we can show that

σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n c0

σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n ,

Cp c(σ2 p( F) + n) σ2 r( F) σ2 r+1( F) 2

(σ2r( F) + n)2 c0

σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n ,

Cp c(σ2 r( F) + n) Cp c

σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n c0

σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n ,

ZHOU, KOUDSTAAL, YU, KONG AND YAO

where c0 > 0 only depends on C0, Cmax, C, c. Using the basic property of exponential function, one can see that

P (Qc) Cexp c0

σ2 r( F) σ2 r+1( F) 2

σ2r( F) + n

Cp2(σ2 r( F) + n)2 σ2r( F) σ2 r+1( F) 4 .

Thus for the desired extension, it remains to show that E P(Y P[r])Y P [r] 41Q Cp2.

As in the proof, we let T = P(Y P[r])Y P [r] and apply Lemma 9 again,

ET 41Q ET 41{T 2 σ2r( F)+n} = Z

0 P T 41{T 2 σ2r( F)+n} t dt

δ2p2 + Z (σ2 r( F)+n)2

δ2p2 P T t1/4 dt

δ2p2 + Z (σ2 r( F)+n)2

δ2p2 C e Cp c

t + e c(σ2 r( F)+n) dt

δ2p2 + C(σ2 r( F) + n)2e c(σ2 r( F)+n) + 2C(1 + cδp)

c2 e(C cδ)p

δ2p2 + C + 2C(1 + cδp)

c2 e(C cδ)p.

It is seen that as long as we choose δ large enough, but only depending on C and c, it is guaranteed that ET 41Q δ2p2.

Robert A. Adams and John J. F. Fournier. Sobolev Spaces. Academic Press, 2th edition, 2003.

Deanna M. Barch, Gregory C. Burgess, Michael P. Harms, Steven E. Petersen, and et al. Function in the human connectome: Task-fmri and individual differences in behavior. Neuro Image, 80:169 189, 2013.

Leo Breiman and Jerome H. Friedman. Predicting multivariate responses in multiple linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59 (1):3 54, 1997.

Randy L. Buckner, Fenna M. Krienen, Angela Castellanos, Julio C. Diaz, and B. T. Thomas Yeo. The organization of the human cerebellum estimated by intrinsic functional connectivity. Journal of Neurophysiology, 106(5):2322 2345, 2011.

T. Tony Cai. Minimax and adaptive inference in nonparametric function estimation. Statistical Science, 27(1):31 50, 2012.

T. Tony Cai and Lawrence D. Brown. Wavelet estimation for samples with random uniform design. Statistics & Probability Letters, 42(3):313 321, 1999.

NONPARAMETRIC PRINCIPAL SUBSPACE REGRESSION

T. Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. Annals of Statistics, 46(1):60 89, 2018.

Rahul S. Desikan, Florent S egonne, Bruce Fischl, Brian T. Quinn, and et al. An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest. Neuro Image, 31(3):968 980, 2006.

David L. Donoho and Iain M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425 455, 1994.

Daniele Durante, Bruno Scarpa, and David B. Dunson. Locally adaptive factor processes for multivariate time series. Journal of Machine Learning Research, 15(1):1493 1522, 2014.

Robert Engle and Mark Watson. A one-factor multivariate time series model of metropolitan wage rates. Journal of the American Statistical Association, 76(376):774 781, 1981.

Jianqing Fan and Irene Gijbels. Local Polynomial Modelling and Its Applications. Chapman and Hall, London, 1st edition, 1996.

Jianqing Fan, Yuan Liao, and Martina Mincheva. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4):603 680, 2013.

Chamberlain Gary and Michael J. Rothschild. Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica, 51:1305 1324, 1983.

Tailen Hsing and Randall L. Eubank. Theoretical Foundations of Functional Data Analysis, with An Introduction to Linear Operators. Wiley, West Sussex, 2015.

Jianhua Z. Huang, Haipeng Shen, and Andreas Buja. The analysis of two-way functional data using two-way regularized singular value decompositions. Journal of the American Statistical Association, 104(488):1609 1620, 2009.

Iain M. Johnstone. Gaussian Estimation: Sequence and Wavelet Models. Unpublished manuscript, 2017.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT Press Cambridge, MA, 2006.

Alexandre B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, 1st edition, 2009.

Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 1st edition, 2019.

Martin J. Wainwright. High-dimensional Statistics: A Non-asymptotic Viewpoint. Cambridge University Press, 1st edition, 2019.

ZHOU, KOUDSTAAL, YU, KONG AND YAO

Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441 479, 1912.

B. T. Thomas Yeo, Fenna M. Krienen, Jorge Sepulcre, Mert R. Sabuncu, and et al. The organization of the human cerebral cortex estimated by intrinsic functional connectivity. Journal of Neurophysiology, 106(3):1125 1165, 2011.