# minimax_nonparametric_parallelism_test__9a482f7d.pdf

Journal of Machine Learning Research 21 (2020) 1-47 Submitted 9/19; Revised 4/20; Published 5/20

Minimax Nonparametric Parallelism Test

Xin Xing XINXING@VT.EDU

Meimei Liu MEIMEILIU@VT.EDU Department of Statistics Virginia Tech, Blacksburg, VA, 24061, USA

Ping Ma PINGMA@UGA.EDU

Wenxuan Zhong WENXUAN@UGA.EDU Department of Statistics University of Georgia, Athens, GA 30601, USA

Editor: Xiaotong Shen

Testing the hypothesis of parallelism is a fundamental statistical problem arising from many applied sciences. In this paper, we develop a nonparametric parallelism test for inferring whether the trends are parallel in treatment and control groups. In particular, the proposed nonparametric parallelism test is a Wald type test based on a smoothing spline ANOVA (SSANOVA) model which can characterize the complex patterns of the data. We derive that the asymptotic null distribution of the test statistic is a Chi-square distribution, unveiling a new version of Wilks phenomenon. Notably, we establish the minimax sharp lower bound of the distinguishable rate for the nonparametric parallelism test by using the information theory, and further prove that the proposed test is minimax optimal. Simulation studies are conducted to investigate the empirical performance of the proposed test. DNA methylation and neuroimaging studies are presented to illustrate potential applications of the test. The software is available at https://github.com/Bio Algs/Parallelism . Keywords: asymptotic distribution, minimax optimality, nonparametric inference, parallelism test, penalized least squares, smoothing spline ANOVA, Wald test

1. Introduction

The assessment of parallelism is a fundamental problem in statistical inference and arises from many applications. For example, in genomic studies, one of primary interest is to detect genes with nonparallel expression patterns in time course studies (Storey et al., 2005; Ma et al., 2009). Another motivating example is in epigenomics, researchers are interested in testing whether the patterns of DNA methylation intensities along genome in the treatment and control groups are parallel or not (Hansen et al., 2012). The abnormal DNA methylation patterns are associated with changes in many important biological processes such as imprinting, X-chromosome inactivation, and aging (Sch ubeler, 2015). In functional neuroimaging, a common problem is to detect nonparallel signals (Nichols and Holmes, 2002; Orrison et al., 2017) among different brain regions. There is an immense literature focused on the analysis of the parallelism of trends using linear model-based approaches, ranging from simple ANOVA (Sthle and Wold, 1989) to linear mixed models (Vossoughi et al., 2016). However, the linear model-based approaches have a limited ability to parsimoniously represent non-linear structures in complex data. Nonparametric parallelism

c 2020 Xin Xing, Meimei Liu, Ping Ma and Wenxuan Zhong.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-800.html.

XING, LIU, MA AND ZHONG

comparison methods have drawn huge attention due to the modeling ﬂexibility. Munk and Dette (1998) developed a test statistic through a weighted L2 distances for the regression functions based on similar equal-spaced ﬁxed design. Degras et al. (2011) tested the parallelism of multiple time series based on the L2 distances between the local linear estimator of each individual curve and the global one for time series data when the time points are evenly spaced. Wang (1998) proposed a wavelet-based method to measure the changes of curves. Liu and Wang (2004) compared different nonparametric testing methods and showed that the performances of these tests depend on the shape of the true function. Ma et al. (2009) proposed an approximate F-test to detect nonparallel patterns in time course gene expression data with a more ﬂexible random design. However, rigorous testing methods with optimal power guarantees are still lacking in the existing nonparametric parallelism literature. The key cause of such research gap is that distinguishing from the simple/linear/polynomial null hypothesis, the parameter space of the null hypothesis for the nonparametric parallelism testing is a nonparametric function class with inﬁnite dimension. How to conduct a rigorous test for such composite functional null hypothesis is still an open question. A major motivation of this article is on developing a nonparametric parallelism testing approach that detects the signiﬁcance of the nonparallel effect, while guarantees statistical optimality in the sense of minimax testing rate, facilitating the power performance analysis. In this article, we develop a nonparametric parallelism test based on the decomposition of tensor product reproducing Hilbert space (RKHS) (Wahba, 1990; Gu, 2013; Wang, 2011) under both ﬁxed and random design. Tensor product RKHS provides a ﬂexible space for modeling complex functions; see Wahba et al. (1995), Wood (2003) and reference therein. For the simplicity of description, we consider the case that there are two predictors only. Suppose the response variable Yij is the observed value of the jth subject at the ith time or spatial location for i = 1, , n and j = 1, , s. Yij depends on two predictors x 1 i and x 2 j through an unknown bivariate function f( , ) H, the

tensor product RKHS, where x 1 i X1 = [0, 1] is a continuous variable representing the ith time or the i-th spatial location, and x 2 j X2 = {0, 1} is a discrete variable representing the jth subject in

different groups, x 2 j = 1 represents the jth subject in treatment group, otherwise in control group. That is, Yij = f(x 1 i , x 2 j ) + ϵij, i = 1, n, j = 1, , s, (1)

where ϵijs are i.i.d. random noise following a normal distribution with mean zero, variance σ2, and s is the number of subjects. Each subject can be represented by a curve. When s = 2, there are two curves in total and each group has only one curve. When s > 2, we have multiple curves in each group. We assume the i.i.d. random noise since, in many scientiﬁc experiments, the random errors are attributed to environmental factors independent of the time points or spatial location. For example, in the f MRI data analysis in Section 6, the error is mostly attributed to the random movement of the head and imaging noise which are independent with the time. Analogous to the classical ANOVA decomposition, f H has the smoothing spline ANOVA (SSANOVA) decomposition (Wahba, 1990):

f(x 1 i , x 2 j ) = f00 + f10(x 1 i ) + f01(x 2 j ) + f11(x 1 i , x 2 j ), (2)

where f00 is the grand mean, f10 and f01 are the main effects, and f11 is the nonparallel effect. When f11 = 0 (see the left panel in Figure 1), the curves in two groups are parallel. Then f11 = 0 is equivalent to that f(x 1 , 0) and f(x 1 , 1) are parallel. When f11 = 0 (see the right panel in

MINIMAX NONPARAMETRIC PARALLELISM TEST

𝑥𝑥2 = 1 𝑥𝑥2 = 0

Figure 1: An illustration of two scenarios of a bivariate function f(x 1 , x 2 ), where x 1 is continuous, x 2 only takes two values, 0 and 1. Left panel: the scenario with f11 = 0, i.e., f(x 1 , 0) and f(x 1 , 1) are parallel. Right panel: the scenario with f11 = 0, i.e., f(x 1 , 0) and f(x 1 , 1) are nonparallel.

Figure 1), the magnitude of ||f11||2 2 characterizes the signiﬁcance of the non-parallelism between the treatment and control groups, where f11 2 2 = P1 x 2 =0 R 1 0 f2(x 1 , x 2 )dω1, with ω1 as the marginal density of x 1 . Statistically, the hypothesis testing for parallelism can be formulated as

H0 : f11 = 0 vs H1 : f11 = 0. (3)

We introduce two concrete examples which motivate our study. Example 1. DNA methylation in case-control study. DNA methylation is an essential epigenetic mechanism that regulates gene expression. Aberrant DNA methylation contributes to a number of human diseases including cancer (Stach et al., 2003). In a typical case-control study of DNA methylation (Filarsky et al., 2016), the DNA methylation level, denoted as Yij at the ith location x 1 i on the genome for the jth individual in group x 2 j , can be modeled using Equation (1), where f is an unknown function with the SSANOVA decomposition in Equation (2). A primary focus is to infer whether the DNA methylation levels have different proﬁles along the genome between the case and control groups, i.e., testing the presence/absence of nonparallel effect f11 as in Equation (3). Example 2. Neuroimaging using functional magnetic resonance imaging (f MRI). f MRI is a powerful neuroimaging technology for the diagnosis of many brain-related diseases. It measures brain activity by detecting changes associated with blood ﬂow. The primary form of f MRI uses the blood-oxygen-level dependent (BOLD) as signal (Huettel et al., 2004). In many case-control studies, the BOLD signal, Yij, at the ith time x 1 i for the jth subject in group x 2 j is measured for a particular region of interest (ROI), and can be modeled using Equation (1), where f is an unknown function with the SSANOVA decomposition in Equation (2). The goal is to test whether the BOLD signals in two groups have same patterns along the time, i.e., test the signiﬁcance of nonparallel effect f11 in Equation (3). We ﬁrst establish the minimax lower bound for nonparametric parallelism test in Equation (3) for general testing rules with the aid of tensor product decomposition of RKHS and the information theory. The tensor product decomposition in Equation (2) enables us to quantify the magnitude of nonparalelism by ||f11||2, where || ||2 is the L2 norm. Intuitively, the smaller ||f11||2 is, the harder it is to distinguish the alternative hypothesis from the null. In analyzing the power performance, we consider a slightly different alternative hypothesis,

H 1 : ||f11||2 dn, (4)

XING, LIU, MA AND ZHONG

where we remove the neighborhood within the dn distance of f11 = 0 from the original alternative H1. Here the sequence dn is called the distinguishable rate (or separation rate) (Ingster and Suslina, 2012; Gin e and Nickl, 2015). We ﬁrst introduce a geometric interpretation of the testing problem in Equation (3), and then establish a general minimax lower bound for the distinguishable rate for the nonparametric parallelism test using the Bernstein k-width in information theory (Pinkus, 2012). Bernstein k-width provides a geometric measure of the distinguishable rate and is easy to evaluate in the tensor product RKHS. Recently, similar technique was also used in analyzing the testing problems over cones and studied in Gaussian sequence models (Wei and Wainwright, 2020).

In addition, we propose a Wald-type test statistic as the squared empirical norm of the penalized least square estimator of f11. We derive its asymptotic null distribution, which satisﬁes the Wilks phenomenon. The asymptotic distribution of our test statistic is Gaussian, and the testing rule does not depend on any unknown quantities, thus is easy to compute. We can further reduce the computational cost by applying many popular fast computation methods such as fast random kernel methods Alaoui and Mahoney (2015) and subsampling methods such as Ma et al. (2015); Kim and Gu (2004). We note that our proposed Wald-type test distinguishes from the existing nonparametric testing methods as follows. The existing testing procedures mostly consider simple null hypothesis, such as the generalized likelihood ratio test in Fan et al. (2001), the penalized likelihood ratio test in Shang and Cheng (2013), the wavelet based method in Shen et al. (2002), and kernelized Stein method in Liu et al. (2016), whereas we consider a composite null hypothesis. More importantly, there is a nontrivial technical complication in addition to the above model setting difference. The composite null hypothesis H0 : f11 = 0 here deﬁnes a nonparametric function in an inﬁnite-dimensional functional space rather than a parametric function in a ﬁnite-dimensional parameter space as required in Shang and Cheng (2013), because testing H0 : f11 = 0 is equivalent to testing H0 : f {f00 + f10 + f01}. Developing the limiting distribution of the test statistic in an inﬁnite-dimensional null hypothesis space and quantifying the testing difﬁculty are very challenging since the distribution relies on the more delicate tensor product decomposition of the RKHS.

We further prove that the upper bound of the distinguishable rate for the proposed Wald type test matches the established minimax lower bound. Thus the proposed Wald-type test is minimax optimal. To the best of our knowledge, our work is the ﬁrst one in establishing the minimax nonparametric parallelism test. Based on the Wald-type test statistic, we propose a data-adaptive choice of the regularization parameter with testing optimality guarantee.

The rest of the paper is organized as follows. We introduce the background of tensor product RKHS in Section 2. In Section 3, we introduce a minimax principle and a geometric interpretation of the parallelism testing problem. In Section 4, we derive the minimax lower bound of the distinguishable rate for general parallelism test using the information theory. Section 5 presents various simulation studies demonstrating substantial performance of our testing method, and Section 6 applies the methods to genome-wide anomaly of DNA methylation in chronic lymphocytic leukemia patients and brain function change in patients with Alzheimer disease. We conclude with a few remarks in Section 7. All technical proofs are relegated to the Appendix and Supplementary Material.

2. Background

In this section, we introduce some background of the tensor product RKHS, its tensor product decomposition, together with the penalized least square estimation.

MINIMAX NONPARAMETRIC PARALLELISM TEST

2.1. Reproducing Kernel Hilbert Space

Given an RKHS H with an inner product , H, there exists a symmetric and square integrable function K( , ) : X X R such that

f, K(x, ) H = f(x), for all f H and x X.

We call K as the reproducing kernel of H. By Mercer s theorem, any continuous kernel has the following decomposition

ν=0 λνϕν(x)ϕν(y), (5)

where λνs are non-negative descending eigenvalues and ϕνs are eigen-functions. We consider the bivariate function f in Equation (1) on the product domain X1 X2. We assume that f is a function in a tensor product RKHS (Lin, 2000)

H = H 1 H 2 . (6)

Given the Hilbert space H 1 and H 2 , H 1 H 2 is deﬁned as the completion of the class of functions with the form PM i=1 η1i(x)η2i(y), for η1i H 1 , η2i H 2 , and M is any positive integer. We consider H 1 as an mth order homogeneous Sobolev space, i.e.,

H 1 = {η1 L2[0, 1] | η(k) 1 is absolutely continuous and η(k) 1 (0) = η(k) 1 (1)

for k = 0, 1, . . . , m 1, η(m) 1 L2[0, 1]},

and H 2 is a two-dimensional Euclidean space with standard Euclidean norm. Assume that H 1 has the eigenvalue and eigenvector pairs {µi, φi} i=0 and H 2 has the eigenvalue and eigenvector pairs {νj, ψj}2 j=1. Then we have the eigenvalue and eigenvector pairs for the kernel function K in H as

{µiνj, φiψj} for i = 0, . . . , , j = 1, 2, (7)

in the decomposition in Equation (5). We refer Equation (7) as the eigensystem for H. We further denote , H as the product norm induced by the norm on the marginal space H1 and H2 (Lin, 2000). Using the Riesz representation theorem (Sch olkopf et al., 2001), we can easily represent any function f H as in the following Lemma.

Lemma 1 Given the sampling points xij = (x 1 i , x 2 j ), i = 1, , n and j = 1, s, for any f in a reproducing kernel Hilbert space H, there exists a set of reproducing kernels Kxij( , ) such that

f(x 1 , x 2 ) =

j=1 αij Kxij(x 1 , x 2 ) + ρ(x 1 , x 2 ). (8)

Lemma 1 implies that f can be expressed as a sum of a linear expansion of Kxij and a nonlinear function ρ. Notice that when (x 1 , x 2 ) {xij}j=1, ,s i=1, ,n, we have ρ(x 1 , x 2 ) = 0. Thus, ρ( , ) can be considered as a residual that quantiﬁes the unknown information of function f. To get an estimate of f, we only need to specify Kxij( , ) and estimate αij. Next, we provide a way to construct the reproducing kernels Kxij( , ). In order to do that, we need the following two lemmas.

XING, LIU, MA AND ZHONG

Lemma 2 Suppose K 1 is the reproducing kernel of H 1 on X1, and K 2 is the reproducing kernel of H 2 on X2. Then the reproducing kernels of H 1 H 2 on X = X1 X2 is K(x, z) = K 1 (x 1 , z 1 )K 2 (x 2 , z 2 ) with x = (x 1 , x 2 ) and z = (z 1 , z 2 ).

Lemma 3 For every Sobolev space H of functions on X, there corresponds a unique reproducing kernel K, which is non-negative deﬁnite. If K0 and K1 are both non-negative deﬁnite reproducing kernels for H0 and H1, and H0 T H1 = {0}, then H0 H1 has a reproducing kernel K = K0 + K1.

Lemmas 2 and 3 can be easily proved based on Theorems 2.3 to 2.6 in Gu (2013). Lemma 2 states that the reproducing kernel of the tensor product space is the product of the reproducing kernels. Lemma 3 states that the reproducing kernel of a tensor sum space is the sum of the reproducing kernels. Therefore, to construct Kxij( , ), we introduce the decomposition of tensor product space in the following part.

2.2. Decomposition of Tensor Product Space

For any η1 H 1 and η2 H 2 , deﬁne the averaging operators A1 : η1 R 1 0 η1(x)dx and A2 : η2 1

2 P2 k=1 η2(k) where η2(k) = e T k η2, ek is the unit vector with the kth element one and all other

elements zeros. We have H 1 and H 2 with the following tensor sum decomposition H 1 0 H 1 1 and H 2 0 H 2 1 respectively, where H 1 0 = {A1η1 | η1 H 1 }, H 2 0 = {A2η2 | η2 H 2 }, H 1 1 = {(I A1)η1 | η1 H 1 }, H 2 1 = {(I A2)η2 | η2 R2}, and I is the identity operator. Thus H has the following tensor sum decomposition

H = (H 1 0 H 2 0 ) (H 1 1 H 2 0 ) (H 1 0 H 2 1 ) (H 1 1 H 2 1 ), (9)

and for any f H 1 H 2 , we have

f = f00 + f10 + f01 + f11, (10)

where f00 = A1A2f H 1 0 H 2 0 , f10 = (I A1)A2f H 1 1 H 2 0 , f01 = A1(I A2)f H 1 0 H 2 1 and f11 = (I A1)(I A2)f H 1 1 H 2 1 . Thus, any function f H can be decomposed uniquely as : f00 the interception, f10 and f01 the marginal effects and f11 the two-way interaction term. Denote the reproducing kernels of H 1 0 , H 2 0 , H 1 1 , H 2 1 as K 1 0 , K 1 1 , K 2 0 , K 2 1 , respectively. Speciﬁcally, K 1 0 (x 1 , z 1 ) = 1 and K 1 1 (x 1 , z 1 ) is deﬁned as ( 1)m 1k2m(z 1 x 1 ) for the mth order homogeneous subspace where kr( ) is the rth order scaled Bernoulli polynomials (Abramowitz and Stegun, 1964; Gu, 2013) and 1( ) is the indicator function. K 2 0 (x 2 , z 2 ) = 1/2

and K 2 1 (x 2 , z 2 ) = 1(z 2 =x 2 ) 1/2 on X2. Let Hℓℓ = H 1 ℓ H 2 ℓ with reproducing kernel Kℓℓ , where Kℓℓ (xij, xi j ) = K 1 ℓ(x 1 i , x 1 i )K 2 ℓ (x 2 j , x 2 j ),

for ℓ, ℓ {0, 1}. The induced inner product of Hℓℓ is denoted as fℓℓ , gℓℓ ℓℓ , where fℓℓ and gℓℓ are projections of f and g on Hℓℓ respectively, ℓ, ℓ {0, 1}. Notice that the metrics induced by inner products fℓℓ , gℓℓ ℓℓ are not necessarily of the same scale for different ℓℓ . The inner product for H can be deﬁned as f, g H = X

ℓℓ θ 1 ℓℓ fℓℓ , gℓℓ ℓℓ , (11)

MINIMAX NONPARAMETRIC PARALLELISM TEST

where θℓℓ s re-scale the metrics on different Hℓℓ , , ℓℓ is the restricted norm of , H on Hℓℓ . Based on Lemmas 2 and 3, we can easily show that the reproducing kernels associated with Equation (11) is K(xij, xi j ) = P ℓ,ℓ θℓℓ Kℓℓ (xij, xi j ) with ℓ, ℓ = 0, 1. Thus, given the sampling

points xij = (x 1 i , x 2 j ) for i = 1, , n and j = 1, , s, the kernel function in H is a bivariate function depending on xij, i.e.,

Kxij(x 1 , x 2 ) = θ00

2 + θ01(1(x 2 =x 2 j ) 1

2) + θ10 1 2K 1 1 (x 1 , z 1 )

+ θ11 1 2(1(x 2 =x 2 j ) 1

2)K 1 1 (x 1 , z 1 ), (12)

and accordingly f(x 1 , x 2 ) = P ij αij Kxij(x 1 , x 2 ) + ρ(x 1 , x 2 ) by Lemma 1. In the function decomposition in Equation (10), it is easy to verify that f00 H00 = g : g = {(θ00 θ01)/2} P ij αij . As f00 is a constant for any x 1 and x 2 , it is analogous to the ground mean in classical ANOVA models. Similarly, we have f01 H01 = {g : g = θ01 P ij αij1(x 2 =x 2 j )}. Recall that x 2 j can only be either 0 or 1, we can rewrite f01 as 1(x 2 =0)β0+

1(x 2 =1)β1, where β0 = Ps j=1(Pn i=1 αij)1(x 2 j =0) and β1 = Ps j=1(Pn i=1 αij)1(x 2 j =1).

We remark that f00 and f01 are all in a ﬁnite-dimensional space. The space H10 (where f10 lies in) spanned by the third term in the right hand side of Equation (12) is, however, an inﬁnite-dimensional space, because we have uncountable x X1. The function can be expressed as a linear combination of the observed reproducing kernels plus a residual that quantiﬁes the unobserved reproducing kernels, i.e., H10 = {g : g = 1

2 Pn i=1(θ10 Ps j=1 αij)K 1 1 (x 1 , z 1 ) + ρ2}. Notice that function in this space only changes as we change x 1 . Thus, the third term in right hand side of (12) can be used to quantify the effect of the continuous variable such as the temporal effect. The forth term in the right hand side of Equation (12) varies for both continuous variable and the case-control indicator, thus it is the term that can catch different functional patterns between the case and control. Similarly, the space spanned by the last addend is also an inﬁnite-dimensional space because we still have an inﬁnite number of unobserved kernel functions in addition to the n s observed kernel functions. Thus, we have f11 H11 = {g : g = 1

2θ11 P ij αij(1(x 2 j =x 2 ) 1

2)K 1 1 (x 1 , z 1 ) + ρ12}. Clearly, to test

if two functions are parallel to each other, we only need to test if f11 = 0.

2.3. Penalized Least Squares

Here we introduce the penalized least square estimate of f H, and the interaction term f11 in Equation (10). Given the sampling points xij = (x 1 i , x 2 j ) for i = 1, . . . , n and j = 1, . . . , s, consider the model space

Hmodel = {g : g =

j=1 αij Kxij(x 1 , x 2 )},

a closed linear subspace of H. αijs are the regression coefﬁcients, and the bivariate residual function ρ( , ) in Lemma 1 is in Hresidual = H Hmodel. Notice that ρ(x 1 i , x 2 j ) = Kxij(x 1 , x 2 ), ρ = 0 because of the orthogonality constraint between Hmodel and Hresidual. Then, f can be estimated

XING, LIU, MA AND ZHONG

by minimizing the penalized least squares functional as follows:

i j αi j Kxi j (x 1 i , x 2 j ))2 + λJ(f10 + f11), (13)

where the quadratic functional J(f) = J(f10 + f11) = f10 + f11 2 H quantiﬁes the roughness of f10 and f11, the smoothing parameter λ controls the trade-off between the goodness-of-ﬁt and the roughness of f10 and f11. Recall ρ and Kxi j ( , ) are orthogonal to each other. Plugging Equation (8) into J(f), we have

i j αi j (θ10K10 xi j + θ11K11 xi j ), X

i j αi j (θ10K10 xi j + θ11K11 xi j ) H + ρ, ρ H.

Further notice that Kℓℓ xij, Kℓℓ xi j = Kℓℓ xij(x 1 i , x 2 j ) by the reproducible property of reproducing

kernels (Gu, 2013). Thus, substituting K and Kℓℓ by (12) and f in J(f) by Equation (8), Equation (13) can be rewritten as

||y ns Kα||2 2 + nsλαT Qα + nsλ ρ, ρ H, (14)

where y = (Y11, Y21, . . . , Yns)T , K is the ns ns matrix with (i + n(j 1), i + n(j 1))th entry 1 ns Kxij(x 1 i , x 2 j ), Q is the ns ns matrix with (i + n(j 1), i + n(j 1))th entry

1 ns(θ10K10 xij(x 1 i , x 2 j ) + θ11K11 xij(x 1 i , x 2 j )) and α = (α11, α21, . . . , αns)T . Similar to Chapter A3 in Gu (2013), we set the rescale parameter θ10 and θ11 to make θ10K10 and θ11K11 contribute equally in penalty term of Equation (14) (see Appendix A.1 for details) and set θ00 and θ01 as one since H00 and H01 are simply one-dimensional Euclidean space. Since ρ does not rely on α, the optimizer of α in minimizing Equation (14) is equivalent to minimizing

bα = arg min α Rns ||y ns Kα||2 2 + nsλαT Qα. (15)

The penalized least square estimate of f is then bf(x 1 i , x 2 j ) = Pn,s i,j bαi,j Kxi,j(x 1 i , x 2 j ). As n goes to inﬁnity, we have countable number of kernels and f(x 1 , x 2 ) that the minimizer of Equation (13) resides in an inﬁnite dimensional space spanned by a countable number of kernels, i.e.,

H model = {g : g(x 1 , x 2 ) =

ij αij Kxij(x 1 , x 2 )}.

The nonparallel effect f11 also resides in a subspace that is spanned by a countable number of kernels. We denote the subspace by

H 11 = {f11 : f11(x 1 , x 2 ) =

ij αij ( 1)m 1

2 (1(x 2 j =x 2 ) 1

2)k2m(x 1 i x 1 )}.

Here, we did not normalize f11 by the constant scale parameter θ11 for the simplicity of description. The penalized least square estimate of f11 H 11 is

ˆf11(x 1 , x 2 ) =

i,j bαij ( 1)m 1

2 (1(x 2 j =x 2 ) 1

2)k2m(x 1 i x 1 ). (16)

MINIMAX NONPARAMETRIC PARALLELISM TEST

With a little abuse of notation, we use ˆf11 to denote the vector version evaluation of ˆf11 on ns data points from now on. Plugging in bα to (16), we have an explicit expression of ˆf11 as

ˆf11 = K11M 1(Ins S(ST M 1S) 1ST M 1)y, (17)

where Ins is the ns dimensional identity matrix, S, M and K11 are reparametrization of the kernel matrices with explicit forms provided in Appendix A.1 Notation Clariﬁcation . In Section 4, we will construct a Wald type test statistics based on ˆf11 for the parallelism test H0 : f11 = 0, and derive its null asymptotic distribution. Before that, we ﬁrst establish the minimax principle of the parallelism test for general testing rules in the following Section 3.

3. Minimax Principle of the Nonparametric Parallelism Test

Consider the test problem as follows

H0 : f11 = 0 vs H1 : ||f11||2 > 0. (18)

Given a decision rule φn for the testing problem (18), φn = 0 if H0 is preferred and 1 otherwise. Then the zero-one loss function is

( φn if H0 is true, 1 φn if H1 is true. (19)

The minimax principle requires φn to minimize the maximum possible risk, i.e.,

min φn max H E[Loss(φn)] = min φn [max H0 E(φn|H0 is true) + max H1 E(1 φn|H1 is true)]. (20)

Notice E(φn|H0 is true) is the probability of making a type I error and E(1 φn|H1 is true) is the probability of making a type II error. Intuitively, we choose φn to minimize the maximum possible type I error and type II error. Notice that if H0 and H1 are contiguous, we cannot ensure that Equation (20) can be controlled, because there may lie some f11 on the boundary of H0 and H1 for which strikes the balance between acceptance and rejection of the null hypothesis, and an appropriate decision cannot be made. Thus, instead of H1, we consider a slightly different alternative hypothesis (4) and partition the parameter space into three sets: H0 + H 1 + I, of which I designates the indifference zone 0 < ||f11||2 < dn. Because dn clearly separates H0 from H 1, it is referred to as the distinguishable rate (a.k.a the separation rate) (Ingster and Suslina, 2012; Gin e and Nickl, 2015). Let

pseudo.risk(φn, dn) = sup H0 E(φn|H0 is true) + sup H 1 E(1 φn|H 1 is true). (21)

Then pseudo.risk(φn, dn) converges to the risk function E[Loss(φn)] as dn goes to zero. Compared to the risk function, the pseudo.risk is not only a function of a decision rule φn but also a function of the distinguishable rate dn. When φn is given, we have sup H 1 E(1 φn|H 1 is true) sup H1 E(1 φn|H1 is true) because H 1 is a subset of H1. Thus, ﬁnding the largest pseudo.risk on H 1 for a given φn is equivalent to ﬁnding the smallest dn with a tolerable pseudo.risk. In another word, ﬁnding the maximum possible pseudo.risk over the parameter space can be considered as ﬁnding the smallest boundary of H 1 such that an appropriate decision φn can be made and the risk can

XING, LIU, MA AND ZHONG

be controlled. Meanwhile, for an adequately large given dn, we can always ﬁnd a decision rule such that the pseudo.risk can reach its minimum value. Let φ n(dn) = arg minφn pseudo.risk(φn, dn). Then, if dn can reach the smallest value d n, the corresponding φ n(d n) is the minimax decision. Thus, the essential step to ﬁnd the minimax decision of pseudo.risk(φn, dn) is to ﬁnd d n such that

d n = arg min dn φ n(dn). (22)

Because d n is an estimate of the distinguishable rate to obtain the minimax test, it is referred to as the minimax distinguishable rate. Clearly, the corresponding decision rule φ n is the minimax decision rule. We ﬁrst introduce a geometric interpretation of the testing problem (18). Geometrically, we can treat E = {f H : ||f||H < 1/2} as an ellipse with eigenvalues in Equation (7) as axis lengths as shown in Figure 2. For any f E, the projection of f on {f : f E11 := H11 E} is f11. The magnitude of nonparallelism can be qualiﬁed by ||f11||2. The distinguishable rate dn is the radius of the sphere centered at f11 = 0 in H11.

Figure 2: Geometric interpretation of the distinguishable rate of the parallelism test.

Intuitively, the testing will be harder when the projection of f on E11 is closer to the origin f11 = 0. We use the Bernstern width in Pinkus (2012) to characterize the testing difﬁculty. Let Sk+1 be the set of all (k + 1)-dimensional subspaces for any k 1. For a compact set C, the Bernstein k-width is deﬁned as

bk,2(C) := arg max r 0 {Bk+1 2 (r) C S for some subspace S Sk+1}, (23)

where Bk+1 2 (r) is a (k + 1)-dimensional l2-ball with radius r centered at f11 = 0 in E11. The Bernstein width characterizes the largest ball that can be inserted into a (k +1)-dimensional subspace in E11. Based on the Bernstein width, we give an upper bound of the testing radius, i.e., for any f

MINIMAX NONPARAMETRIC PARALLELISM TEST

projected in the ball with radius less than this upper bound, the minimum pseudo.risk is larger than 1/2.

Lemma 4 For any f H, we have

inf φn pseudo.risk(φn, dn) 1/2

dn r B := sup{δ | δ 1 2 nσ(k B(δ))1/4}

where k B(δ) := arg maxk{b2 k 1,2(H11) δ2} is the Bernstein lower critical dimension, and r B is called the Bernstein lower critical radius.

Lemma 4 shows that when dn is less than r B, there has no test can distinguish the alternative hypothesis from the null. In order to achieve a non-trivial power, we need dn to be larger than the Bernstein lower critical radius r B, which is determined by the Bernstein lower critical dimension k B(δ). In the next lemma, we provide the lower bound for k B(δ).

Lemma 5 Let {ρi} i=1 be eigenvalues of H11. We have

k B(δ) > arg max k { ρk δ} (24)

Plugging in the lower bound of k B(δ) derived in Lemma 5 to Lemma 4, we calculate a lower bound for r B based on the decay rate of eigenvalues. r B is served as a minimax lower bound for the distinguishable rate. The following theorem summarizes the minimax distinguishable rate for the testing problem (18).

Theorem 6 (Minimax lower bound for distinguishable rate) In the nonparametric model (1) with SSANOVA (2). Suppose f H, where H = H 1 H 2 with H 1 as the mth order Sobolev space1, and H 2 as a two-dimensional Euclidean space. The minimax distinguishable rate for testing hypotheses (18) is achieved at d n n 2m/(4m+1).

Theorem 6 provides a general guidance for justifying a local minimax test, i.e., there is no test can distinguish the alternative from null if dn n 2m/(4m+1). The proof of Theorem 6 is presented in the Appendix. Essentially, for any test φn that is deﬁned by a family of type I error α = E(φn) and by the supremum of the type II error δ = sup H 1 E(1 φn|H 1 is true), we need φn converges to zero faster than dn to ensure the distinguishability of the null distribution. We further remark that the minimax rate for nonparametric estimation is n m/(2m+1) (Yang et al. (2017)) which is higher than the minimax distinguishable rate n 2m/(4m+1). In the next section, we will introduce a Wald type test for the hypothesis testing (18) with the separation rate dn achieves the lower bound n 2m/(4m+1)

indicating our proposed test is minimax optimal.

1. The mth order Sobolev space is deﬁned as H 1 = {η1 L2[0, 1] | η(k) 1 is absolutely continuous for k = 0, 1, . . . , m 1}.

XING, LIU, MA AND ZHONG

4. Wald Type Parallism Test

In this section, we propose a Wald type test statistics based on the penalized least squares estimate of f11, and derive the asymptotic distribution of the test statistics. We further prove an upper bound of the distinguishable rate of the Wald type test which matches the minimax lower bound established in Theorem 6.

4.1. Wald Type Test and Asymptotic Distribution

The nonparallel effect of the curves between the case group and the control group is measured by the magnitude of ||f11||2 2. The nonparallel test in Equation (3) is equivalent to

H0 : f H model H 11 vs H1 : f H model

or equivalently, H1 : f11 H 11. First, notice that the null hypothesis in Equation (18) is a composite hypothesis as the null hypothesis deﬁnes a class of functions in H model H 11. Second, H0 deﬁnes an inﬁnite dimensional parameter spaces as n , the assumptions of Neyman-Pearson Lemma cannot be satisﬁed. Thus the uniformly most powerful test may not exist in general. To overcome the difﬁculty, we propose a Wald-type test

ns|| ˆf11||2 2 (25)

and show its minimax optimality. Since Yij follows Equation (1) with f satisfying the SSANOVA decomposition in Equation (2), we can replace each element in vector y by f00(x 1 i , x 2 j ) + f10(x 1 i , x 2 j ) + f01(x 1 i , x 2 j ) +

f11(x 1 i , x 2 j ) + ϵij. Then plug in the expression of ˆf11 in Equation (17) to Tn,λ, we have

ns||K11M 1(In S(ST M 1S) 1ST M 1)(f00 + f10 + f01 + f11 + ϵ)||2 2,

where f00, f10, f01 and f11 are ns dimensional vectors with the ijth entry f00(x 1 i , x 2 j ), f10(x 1 i , x 2 j ),

f01(x 1 i , x 2 j ) and f11(x 1 i , x 2 j ) respectively, and ϵ is the ns dimensional stochastic error that follows a normal distribution with mean 0 and variance σ2Ins. Because f00, f10 and f01 are in the space that is orthogonal to the space spanned by K11, and f11 = 0 under the null hypothesis, Tn,λ can be further simpliﬁed as

ns||K11M 1(Ins S(ST M 1S) 1ST M 1)ϵ||2 2. (26)

A detailed discussion of this simpliﬁcation will be provided in Lemma 12 in Appendix. Next, we develop the null limiting distribution of Tn,λ as n goes to inﬁnity. In the derivation, we only require the number of subjects s to be ﬁnite. This requirement is desired in real applications since the number of subjects in an experiment is usually limited. For example, due to the high sequencing cost, there are usually only tens of sample sequenced in the DNA methylation studies. We consider the following two designs.

Quasi-Uniform Design : x 1 1 , x 1 2 , . . . , x 1 n iid ω(x 1 ) where ω is the marginal density of x 1 . For any x 1 [0, 1], there exist two constants c1, c2 > 0 such that c1 ω(x 1 ) c2 (Eggermont and La Riccia, 2001).

MINIMAX NONPARAMETRIC PARALLELISM TEST

Uniform Design: x 1 1 , x 1 2 , . . . , x 1 n are evenly spaced on [0, 1].

The above two designs are commonly used in scientiﬁc investigations. For example, in f MRI experiments, the sampling points on the time domain are usually measured with equal-time intervals. Thus, they are assumed to follow uniform design. On the other hand, the DNA methylation sites are randomly scattered on DNA sequence. Therefore, they are assumed to follow a quasi-uniform design.

Theorem 7 For both the uniform design and the quasi-uniform design, if the smoothing parameter λ = O(nc 1) for any ﬁxed c (0, 1), we have

d N(0, 1) as n ,

where µn,λ = σ2 Tr( )/(ns) and σ2 n,λ = 2σ4 Tr( 2)/(ns)2 with = M 1K2 11M 1.

In practice, we estimate the variance σ2 via bσ2 deﬁned as

bσ2 = y (I A(λ))2y

Tr(I A(λ)) ,

where A(λ) = K(ns K2 + λQ) 1y, and (I A(λ))y is the residual y b f based on the objective function in equation (15). The consistency of the variance estimate bσ2 is established in Theorem 3.4 in Gu (2013). The proof of Theorem 7 is provided in Appendix and sketched below. Notice that Tn,λ = T1 + T2 2T3, where

nsϵT M 1K2 11M 1ϵ,

ns||K11M 1S(ST M 1S) 1ST M 1ϵ||2 2, (27)

nsϵT M 1S(ST M 1S) 1ST M 1K2 11M 1ϵ.

We show that T2 and T3 are higher order small perturbation terms compared to T1. Thus, the null distribution of Tn,λ and the distribution of T1 are asymptotically equivalent. We only need to focus on the distribution of the quadratic form T1 = 1 nsϵT ϵ with ϵ having a mean zero normal distribution. To prove the normality of T1, we show that the log-characteristic function of the standardized T1 is asymptotically σ2t2/2, provided that Tr( 2) diverges as λ 0. Lemma 15 shows that Tr( 2) ˆτλ, where ˆτλ = max{i | ˆµi λ} as the effective dimension (Bartlett et al., 2005; Liu et al., 2019) with ˆµ1 ˆµn the empirical eigenvalues of kernel matrix K 1 1 which is the kernel matrix of H 1 1 with (i, i )th entry as 1

n K 1 1 (x 1 i , x 1 i ). We further show in Lemma 13 and 14 that ˆτλ is of the same order as its population counterpart τλ deﬁned as τλ = max{i | µi λ}, under both the quasi-uniform design and the uniform design, where µ1 0 are a sequence of ordered eigenvalues satisfying K 1 1 (x, x ) = P i=1 µiφi(x)φi(x ). Since µi has a polynomial decay rate i 2m (Gu, 2013), we have Tr( 2) ˆτλ τλ λ 1/(2m) diverges as λ 0. Consequently, the testing consistency in Theorem 7 holds.

XING, LIU, MA AND ZHONG

Theorem 7 characterizes the distribution of the test statistic Tn,λ for f H model H 11. The distribution turns out to be fairly simple and easy to calculate as the test statistic does not depend on any unknown nuisance functions such as f00, f10 and f01. Critical value can be easily found based on the known null distribution N(µn,λ, σ2 n,λ). Consequently, one can make a statistical decision by comparing Tn,λ with the critical value. This nuisance-parameter free property is referred to as the Wilks phenomenon in statistics literature (Fan et al., 2001; Fan and Zhang, 2004).

4.2. Upper Bound of the Distingushiable Rate

Given type I error α, we show that our Wald-type testing rule φn,λ = 1(|Tn,λ µn,λ| zα/2σn,λ) achieves the local minimax distinguishable rate. Without loss of generality, we assume f H 1.

Theorem 8 Let the minimum distinguishable rate of the test φn,λ be dn(φn,λ). Suppose λ = O(nc 1) for any ﬁxed c (0, 1). Then for any δ > 0, there exist positive constants Cδ and Nδ such that, when n Nδ, the tolerable pseudo.risk(φn,λ, dn) = α + δ, with dn(φn,λ) := Cδ pλ + σn,λ.

Theorem 8 shows that for a controlled type I error, Tn,λ can achieve arbitrary small type II error provided that the local alternative is separated from the null by at least an amount of dn(φn,λ). The proof of Theorem 8 is collected in Appendix. Note that d2 n(φn,λ) consists of two components: σn,λ representing the standard variation of the test statistic Tn,λ, and λ representing the squared bias of ˆf1,2 (see the proof of Lemma S.1 in the Supplementary). Through approximating σn,λ by the Rademacher complexity (Bartlett et al., 2005; Liu et al., 2019), we show that σn,λ τλ/n, which is a decreasing function of λ. Hence, the minimum distinguishable rate for φn,λ is achieved by the trade-off between the bias of ˆf1,2 and the standard derivation of Tn,λ, i.e., choosing appropriate λ such that λ σn,λ. Next, we prove that our proposed Wald-type test is minimax under two special design conditions: the quasi-uniform design and the uniform design in the next two corollaries.

Corollary 9 [Quasi-Uniform Design] Let λ n 4m/(4m+1) and suppose x 1 follows a quasiuniform design. We have

P(dn(φn,λ) n 2m/(4m+1)) 1 4 exp( n1/(2m+1)).

Corollary 10 [Uniform Design] Let λ n 4m/(4m+1), and suppose x 1 follows a uniform design, we have dn(φn,λ) n 2m/(4m+1) a.s.

Corollaries 9 and 10 suggest that if λ n 4m/(4m+1), our Wald-type test φn,λ can achieve the minimax distinguishable rate d n n 2m/(4m+1). Thus, we demonstrate that our proposed Wald type test is minimax optimal. We remark that Corollary 9 still holds when extending H 1 as a standard Sobolev space.

4.3. The Choice of Regularization Parameter

Different from the classical bias-variance tradeoff in optimal nonparametric estimation, Theorem 8 states that the optimal nonparametric testing for Equation (3) can be achieved by another type of tradeoff between the squared bias of the estimator and the standard deviation of the test statistic.

MINIMAX NONPARAMETRIC PARALLELISM TEST

Such intrinsic difference further leads to different orders of optimal regularization parameters: as shown in Corollary 9, 10, the optimal λ is chosen as the order of n 4m 4m+1 ; while as the order of n 2m 2m+1 for optimal estimation (Gu, 2013). In practice, cross validation method is often used as a tuning procedure for nonparametric estimation based on penalized loss functions (Golub et al., 1979). Raskutti et al. (2014) proposed another data-dependent algorithmic regularization technique, that is, choosing an early stopping rule for an iterative algorithm to avoid over-ﬁtting in nonparametric estimation. Both of the above approaches are optimal for estimation but suboptimal for testing. There has few theoretically justiﬁed tuning procedure for obtaining optimal testing in nonparametric inference. One related work we are aware currently is Liu and Cheng (2018), under which they developed a data-dependent early stopping regularization rule from an algorithmic perspective for testing f = 0 in nonparametric regression model Y = f(X) + ϵ. The total step size determined via the early stopping rule in gradient descent algorithm plays the same role with 1/λ in the penalized regularization, to avoid over-ﬁtting. However, a data-adaptive choice of the regularization parameter λ is still lacking for nonparametric inference in Equation (3) under the penalization regularization. We propose a data-adaptive method to choose λ with testing optimality guarantee based on Theorem 8. In practice, we can choose the optimal smoothing parameter λ satisfying

λ = min λ | λ < σn,λ , (28)

where σn,λ can be explicitly calculated based on the observed data by the expression deﬁned in Theorem 7, i.e., σ2 n,λ = 2σ4Tr( 2)/(ns)2, with = M 1K2 11M 1. The above criterion in Equation (28) in choosing λ is a data-dependent rule that produces a minimax-optimal nonparametric testing method. Based on the Rademacher complexity, σn,λ σ2 ns p Pn i=1 min{1, bµi/λ}. That is, the rule in Equation (28) depends on the eigenvalues of the kernel matrix, especially the ﬁrst few leading eigenvalues. There are many efﬁcient methods to compute the top eigenvalues fast (Drineas and Mahoney, 2005; Ma and Belkin, 2017). As a future work, we can also introduce the randomly projected kernel methods to accelerate the computing time.

5. Simulation Study

To assess the performance of our proposed test, we carried out extensive analyses on simulated data sets. We compared our approach with F-test (SSF) (Ma et al., 2009), parallelism trend test (PTT) (Degras et al., 2011) and a random permutation test with 500 permutations. In the three methods, permutation test can be used as a benchmark because it can closely approximate null distribution when the number of permutations is adequate. However, the permutation test is computationally intensive, especially for calculating the Kullback-Leibler distance under the null and alternative hypothesis for SSANOVA model (Gu, 2004).

5.1. Empirical Power Analysis

We illustrate the empirical power performance of our proposed test through four well-designed examples. In all four examples, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in Equation (1), where x 1 i iid

U(0, 1) and ϵij iid N(0, 1). Each example was repeated 500 times for power and other comparisons. To make the simulation more close to the reality, we considered two types of nonparallel patterns

XING, LIU, MA AND ZHONG

between f(x 1 , 1) and f(x 1 , 0): magnitude and frequency. These two kinds of nonparallel patterns are often observed in real applications. For example, the hypermethylated DNA regions, i.e., regions with low methylation levels, are related to transcriptional silencing which plays an important role in cancer development; the frequency differences are often related to different brain functions between the neurodisease and control groups in f MRI studies. In the ﬁrst four examples, we consider the

0 0.2 0.4 0.6 0.8 1

δ1(δ2=δ3=0)

0 0.5 0.75 1

1.5 0.5 0.0 0.5 1.0 1.5 2.0

0 0.2 0.4 0.6 0.8 1

δ2(δ1=δ3=0)

0 0.2 0.3 0.4

(a) Setting 1 (b) Setting 2

2 1 0 1 2 3

0 0.2 0.4 0.6 0.8 1

0 0.2 0.3 0.4

0 0.5 0.75 1

1.0 0.5 0.0 0.5 1.0 1.5 2.0

0 0.2 0.4 0.6 0.8 1

δ3(δ1=δ2=0)

0 0.5 0.75 1

(c) Setting 3 (d) Setting 4

Figure 3: Plotted here are functions of the control group (solid line) and case group (dashed, dotted and dot-dash lines) with four types of nonparallel patterns: magnitude differences only (Setting 1), frequency differences only (Setting 2), both magnitude and frequency differences (Setting 3), and magnitude dynamic differences (Setting 4).

following function in Equation (1),

f(x 1 , x 2 ) =

( 2.5 sin(3πx 1 )(1 x 1 ) if x 2 = 0, i.e., control (2.5 + δ1) sin((3 + δ2)πx 1 )(1 x 1 )(1+δ3) if x 2 = 1, i.e., case (29)

where δ1, δ2 and δ3 control the magnitude of nonparallelism between the null hypothesis and the alternative hypothesis in Equation (18). In general, varying δ1, δ2 and δ3 give rise to different distinguishable rates dns. The larger the δ1, δ2 and δ3 are, the larger the dn is. To illustrate how the testing power is affected by different δ s, as shown in Figure 3, we considered the following four settings. Setting 1:

MINIMAX NONPARAMETRIC PARALLELISM TEST

Case and control have constant magnitude differences (δ1 = 0.50, 0.75, 1.00 and δ2, δ3 = 0.00); Setting 2: Case and control have frequency differences (δ2 = 0.20, 0.30, 0.40 and δ1, δ3 = 0.00); Setting 3: Both magnitude and frequency are different (δ1, δ2 = (0.50, 0.20), (0.75, 0.30), (1.00, 0.40) and δ3 = 0.00); Setting 4: Case and control have non-constant magnitude differences (δ1, δ2 = 0.00 and δ3 = 0.50, 0.75, 1.00). The corresponding functions f(x 1 , 0) and f(x 1 , 1) are shown in Figure 3.

Sample Size

100 200 300 400 500 600 700 800 900 1000 δ1 = 0.50 Proposed 0.17 0.33 0.49 0.59 0.69 0.75 0.86 0.91 0.92 0.96 Permutation 0.19 0.38 0.53 0.60 0.62 0.76 0.80 0.88 0.94 0.97 SSF 0.02 0.09 0.11 0.16 0.26 0.28 0.36 0.54 0.58 0.72 PTT 0.05 0.06 0.05 0.1 0.11 0.1 0.14 0.21 0.11 0.17 δ1 = 0.75 Proposed 0.37 0.67 0.90 0.93 0.97 0.98 1.00 1.00 1.00 1.00 Permutation 0.38 0.66 0.81 0.90 0.96 0.99 0.99 1.00 1.00 1.00 SSF 0.04 0.21 0.37 0.50 0.81 0.86 0.91 0.96 0.97 0.98 PTT 0.09 0.14 0.15 0.33 0.38 0.36 0.47 0.55 0.44 0.54 δ1 = 1.00 Proposed 0.61 0.92 0.97 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Permutation 0.57 0.89 0.95 0.99 1.00 1.00 1.00 1.00 1.00 1.00 SSF 0.14 0.48 0.79 0.90 0.97 0.99 1.00 1.00 1.00 1.00 PTT 0.08 0.23 0.42 0.43 0.54 0.62 0.77 0.77 0.79 0.85

Table 1: Table lists the empirical power of our proposed test and permutation test for Setting 1 with δ1 = 0.50, 0.75, 1.00, δ2 = δ3 = 0.00 and sample size ranging from 100 to 1000.

Sample Size

100 200 300 400 500 600 700 800 900 1000 δ2 = 0.20 Proposed 0.28 0.46 0.66 0.79 0.86 0.95 0.95 0.97 0.98 0.99 Permutation 0.27 0.43 0.59 0.74 0.86 0.94 0.94 0.98 1.00 1.00 SSF 0.02 0.05 0.21 0.32 0.48 0.62 0.79 0.84 0.88 0.95 PTT 0.04 0.03 0.04 0.08 0.11 0.14 0.12 0.09 0.16 0.26 δ2 = 0.30 Proposed 0.40 0.63 0.81 0.94 0.96 0.99 0.99 1.00 1.00 1.00 Permutation 0.36 0.64 0.79 0.89 0.97 0.98 0.99 1.00 1.00 1.00 SSF 0.03 0.13 0.35 0.52 0.72 0.85 0.91 0.97 0.99 1.00 PTT 0.03 0.08 0.09 0.15 0.31 0.23 0.28 0.4 0.35 0.4 δ2 = 0.40 Proposed 0.73 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Permutation 0.78 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 SSF 0.24 0.74 0.98 0.99 1.00 1.00 1.00 1.00 1.00 1.00 PTT 0.11 0.16 0.18 0.38 0.39 0.52 0.56 0.59 0.81 0.89

Table 2: Table lists the empirical power of our proposed test and permutation test for Setting 2 with δ2 = 0.20, 0.30, 0.40, δ1 = δ3 = 0.00 and sample size ranging from 100 to 1000.

The empirical powers of our proposed Wald-type test, permutation test, SSF test and PTT test are summarized in Tables 1-2 for Settings 1-2. For Setting 1, as shown in Table 1, the empirical

XING, LIU, MA AND ZHONG

Sample Size

100 200 300 400 500 600 700 800 900 1000 δ1 = 0.50 Proposed 0.35 0.51 0.74 0.86 0.91 0.95 0.97 0.98 1.00 1.00 δ2 = 0.20 SSF 0.04 0.15 0.29 0.41 0.57 0.72 0.85 0.89 0.91 0.96 PTT 0.03 0.07 0.07 0.08 0.08 0.06 0.15 0.19 0.21 0.2 δ1 = 0.75 Proposed 0.42 0.70 0.86 0.96 0.99 1.00 1.00 1.00 1.00 1.00 δ2 = 0.30 SSF 0.05 0.26 0.46 0.64 0.79 0.93 0.94 0.95 1.00 1.00 PTT 0.04 0.07 0.11 0.15 0.19 0.23 0.31 0.29 0.43 0.46 δ1 = 1.00 Proposed 0.72 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 δ2 = 0.40 SSF 0.25 0.72 0.97 0.99 1.00 1.00 1.00 1.00 1.00 1.00 PTT 0.11 0.19 0.22 0.32 0.52 0.5 0.64 0.61 0.73 0.69

Table 3: Table lists the empirical power of our proposed test and permutation test for Setting 3 with δ1, δ2 = (0.50, 0.20), (0.75, 0.30), (1.00, 0.40), δ3 = 0 and sample size ranging from 100 to 1000.

Sample Size

100 200 300 400 500 600 700 800 900 1000 δ3 = 0.50 Proposed 0.15 0.33 0.47 0.58 0.66 0.75 0.83 0.88 0.89 0.94 SSF 0.01 0.04 0.07 0.16 0.18 0.28 0.35 0.47 0.57 0.64 PTT 0.06 0.03 0.08 0.09 0.09 0.14 0.07 0.13 0.08 0.13 δ3 = 0.75 Proposed 0.35 0.61 0.73 0.84 0.92 0.95 0.99 1.00 1.00 1.00 SSF 0.03 0.12 0.18 0.34 0.56 0.70 0.83 0.86 0.96 0.96 PTT 0.01 0.07 0.06 0.07 0.09 0.12 0.13 0.18 0.18 0.24 δ3 = 1.00 Proposed 0.42 0.70 0.85 0.95 0.99 0.99 1.00 1.00 1.00 1.00 SSF 0.07 0.20 0.52 0.76 0.82 0.92 0.98 0.98 1.00 1.00 PTT 0.09 0.04 0.08 0.10 0.18 0.18 0.21 0.24 0.25 0.28

Table 4: Table lists the empirical power of our proposed test and permutation test for Setting 4 with δ3 = 0.50, 0.75, 1.00, δ1 = δ2 = 0.00 and sample size ranging from 100 to 1000.

power of our test increases rapidly as sample size increases, and approaches to 1 even for the smallest magnitude (δ1 = 0.50). The empirical powers of the proposed test are comparable with that of the permutation test. In contrast, the empirical powers of SSF and PTT increase slower than our proposed test. For the weak signal scenario, i.e., δ1 = 0.50, the proposed test has signiﬁcantly gain of power under different sample sizes. For the strong signal scenario, i.e, δ = 1.00, our proposed test is signiﬁcantly more powerful than SSF and PTT when sample size is less than 500. For Setting 2, as shown in Table 2, the empirical power of our proposed test converges to 1 as the sample size increases for all three cases with δ2 = 0.20, 0.30 and 0.40. In contrast, the empirical power of SSF and PTT converges to 1 slower than the proposed test. For Settings 3 and 4, we only included the empirical results for our proposed test and SSF test due to the extremely high computational cost of the permutation test. As shown in Table 6, it takes more than 150 hours to complete the permutation test for one setting. For Setting 3, we simulated the signal with differences in both scale and frequency across case and control groups. The empirical powers of the simulation with different distinguishable parameters are listed in Table 3. The empirical

MINIMAX NONPARAMETRIC PARALLELISM TEST

powers of our proposed test and SSF increase for all the three cases with δ1, δ2 = 0.20, 0.30, 0.40. The empirical power of PTT also increases, but with a much slower pattern. When the sample size is small and signal strength is weak, our proposed test has signiﬁcant gain of power compared to the SSF and PTT test. For Setting 4, there is a nonlinear magnitude difference along the x 1 between the two groups. As shown in Table 4, the empirical power of SSF test converges to one slower than the proposed test and is lower than 0.65 for the least distinguishable case.

5.2. Empirical Size Analysis

To examine the approximation of signiﬁcance levels, we generated data from a new setting Setting 5. We kept the function form of control group the same as Equation (5.4) and only added a parallel shift over the control function as the function of the case group, i.e., the model does not include the nonparallel patterns. In particular,

f(x 1 , x 2 ) = 2.5 sin(3πx 1 )(1 x 1 ) + δ4I{x 2 =1},

where δ4 was set to be 0, 0.5 and 1 to characterize different level parallel difference in the two groups. We generated data from Equation (1) with function f speciﬁed in Setting 5. The rest of parameters were set the same as before.

Sample Size

100 200 300 400 500 600 700 800 900 1000 δ4 = 0.00 Proposed 0.04 0.07 0.06 0.06 0.05 0.06 0.06 0.07 0.06 0.05 Permutation 0.04 0.08 0.05 0.08 0.06 0.05 0.06 0.04 0.07 0.06 SSF 0.06 0.11 0.03 0.08 0.08 0.03 0.07 0.09 0.07 0.03 PTT 0.03 0.05 0.02 0.02 0.03 0.02 0.12 0.09 0.08 0.06 δ4 = 0.50 Proposed 0.06 0.05 0.05 0.06 0.06 0.05 0.06 0.04 0.05 0.06 Permutation 0.07 0.04 0.05 0.06 0.08 0.09 0.04 0.03 0.05 0.04 SSF 0.06 0.05 0.07 0.06 0.07 0.08 0.07 0.04 0.04 0.07 PTT 0.02 0.02 0.03 0.03 0.07 0.04 0.06 0.07 0.06 0.04 δ4 = 1.00 Proposed 0.07 0.06 0.07 0.06 0.05 0.05 0.06 0.06 0.06 0.05 Permutation 0.04 0.06 0.03 0.05 0.05 0.04 0.03 0.02 0.04 0.04 SSF 0.07 0.07 0.08 0.06 0.04 0.07 0.06 0.09 0.07 0.04 PTT 0.03 0.04 0.03 0.05 0.05 0.04 0.06 0.05 0.05 0.08

Table 5: Table lists the empirical sizes of the proposed test, permutation test, SSF, and PTT for δ4 = 0.00, 0.50, 1.00 and sample size ranging from 100 to 1000.

Table 5 lists the empirical sizes of our proposed test, permutation test, SSF test, and PTT under Setting 5. We varied δ4 from 0.00 to 1.00 to model different magnitudes of the main effect. The empirical size of our proposed test approaches to 0.05 as the sample size increases for different values of δ4. The empirical size of SSF test is ﬂuctuating from 0.03 to 0.1. The inaccurate size of the SSF test may be attributed to the fact that the degrees of freedom of the SSF test is very roughly approximated by the rounding value of the trace of the smoothing matrix. The empirical size of PTT test is ﬂuctuating from 0.02 to 0.12.

XING, LIU, MA AND ZHONG

5.3. Computation Time

As shown in Tables 1 and 2, our purposed test achieves the power similar to the permutation test. Next, we compared the computation time of our proposed test and permutation test for 500 replicated samples. We conducted the comparison on a computer workstation with core Intel i7 8700k CPU and 32 Gb RAM. In Table 6, we reported the computational time in Setting 1 with δ1 = 0.5 and sample size ranging from 100 to 1000. As shown in Table 6, our proposed test is consistently faster than the permutation test. Our proposed test is nearly 263 faster than the permutation test when the sample size is 1000. Note that the computational time is more than 42 hours when the sample size is 1000 for running 500 test. In practice, the huge computational cost limits the application of the permutation test in many large scale studies involving large sample size and multiple tests.

Sample Size

100 200 300 400 500 600 700 800 900 1000 Proposed 0.01 0.03 0.04 0.06 0.07 0.09 0.10 0.12 0.14 0.16 Permutation 3.22 6.14 9.29 13.29 17.93 22.26 26.74 31.26 36.57 42.23

Table 6: Table lists computational time (in hour) of running the simulation with 500 replications for our proposed test and the permutation test.

5.4. Simulation Studies with Correlated Noise

We established Setting 6 to evaluate the performance of the proposed test when the noises are correlated. In this example, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in Equation (1). We considered x 1 i , i = 1, . . . , n are evenly distributed in [0, 1]. We generated two correlated noise vector (ϵ11, . . . , ϵn1) and (ϵ12, . . . , ϵn2) i.i.d. from N(0, Σ) where Σ is autoregressive, i.e., each of its element σii = ρ|i i | with ρ = 0.5. We generated the signal Yij = f(x 1 i , x 2 j ) + ϵij where f is deﬁned in Equation (5.4) with δ1 = 0.00, 0.50, 0.75, 1.00 and δ2, δ3 = 0.00, that is,

f(x 1 , x 2 ) =

( 2.5 sin(3πx 1 )(1 x 1 ) if x 2 = 0, (2.5 + δ1) sin(3πx 1 )(1 x 1 ) if x 2 = 1.

We set the signiﬁcance level as 0.05 and repeated 500 times for evaluating the empirical size and power. As shown in Table 7, when δ1 = 0.00, the size of our proposed method concentrates around 0.05 0.07, while the sizes of SSF and PTT are ﬂuctuating from 0.02 to 0.16. When δ1 > 0.00, compared with SSF and PTT, the power of our proposed method has the highest performance, and approaches to 1 as δ1 increases.

5.5. Simulation Studies with Non-smooth Cases

We evaluate the robustness of the proposed method when the smoothness assumption is invalid. We established Setting 7 to test the performance of the proposed test for the cases with non-smooth trends. In this setting, we generated 100 to 1000 observations with an increment of 100 observations in each simulation for both case and control groups in model (1). We considered x 1 i , i = 1, . . . , n,

MINIMAX NONPARAMETRIC PARALLELISM TEST

Sample Size

100 200 300 400 500 600 700 800 900 1000 δ1 = 0.00 Proposed 0.08 0.04 0.06 0.06 0.06 0.08 0.10 0.07 0.06 0.07 SSF 0.08 0.06 0.06 0.09 0.10 0.05 0.09 0.14 0.08 0.06 PTT 0.02 0.1 0.04 0.08 0.11 0.05 0.16 0.12 0.13 0.06 δ1 = 0.50 Proposed 0.21 0.33 0.48 0.57 0.73 0.73 0.82 0.91 0.94 0.96 SSF 0.01 0.05 0.10 0.17 0.29 0.32 0.46 0.48 0.63 0.72 PTT 0.13 0.22 0.35 0.50 0.51 0.53 0.72 0.73 0.78 0.86 δ1 = 0.75 Proposed 0.66 0.89 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 SSF 0.13 0.48 0.74 0.89 0.98 1.00 0.99 1.00 1.00 1.00 PTT 0.16 0.32 0.41 0.43 0.66 0.67 0.73 0.85 0.85 0.89 δ1 = 1.00 Proposed 0.93 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 SSF 0.47 0.93 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 PTT 0.16 0.41 0.55 0.62 0.69 0.85 0.86 0.90 0.90 0.95

Table 7: Table lists the empirical size (δ1 = 0) and power (δ1 = 0.50, 0.75, 1.00) of our proposed test, SSF and PTT for Setting 6 with δ2 = δ3 = 0.00 and sample size ranging from 100 to 1000.

are evenly distributed in [0, 1] and ϵij iid N(0, 1). We generated the signal Yij = f(x 1 i , x 2 j ) + ϵij with f deﬁned as

f(x 1 , x 2 ) = 2.5 sin(2πx 1 )I{x 1 (0,0.5)} + (1 + δ5I{x 2 =1})(x 1)I{x 1 [0.5,0)}

which is shown in Figure 4. This curve is non-differentiable at x 1 = 0.5 which is a change point from nonlinear to linear trend. We set the signiﬁcance level as 0.05 and repeated 500 times to evaluate the empirical size and power.

0.0 0.5 1.0 1.5 2.0 2.5

0 0.2 0.4 0.6 0.8 1

Figure 4: Solid line with δ5 = 0: function of the control group; dashed and dotted lines with δ5 = 1, 2: the case group for Setting 7.

XING, LIU, MA AND ZHONG

As shown in Table 8, when δ5 = 0.00, the empirical size of our proposed method concentrates around 0.05. The empirical size of our proposed method is slightly inﬂated compared with SSF and PTT. When δ5 = 1, 2, compared with SSF and PTT, the power of our proposed method has the highest performance, and approaches to 1 as n increases.

Sample Size

100 200 300 400 500 600 700 800 900 1000 δ5 = 0.00 Proposed 0.08 0.04 0.06 0.06 0.06 0.08 0.10 0.07 0.06 0.07 SSF 0.01 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.00 0.00 PTT 0.02 0.06 0.04 0.08 0.03 0.05 0.03 0.06 0.03 0.03 δ5 = 1.00 Proposed 0.21 0.33 0.48 0.57 0.73 0.73 0.82 0.91 0.94 0.96 SSF 0.03 0.02 0.06 0.07 0.09 0.15 0.23 0.29 0.34 0.37 PTT 0.01 0.05 0.01 0.02 0.04 0.02 0.04 0.04 0.08 0.06 δ5 = 2.00 Proposed 0.66 0.89 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00 SSF 0.07 0.24 0.46 0.63 0.76 0.87 0.91 0.98 0.97 0.99 PTT 0.02 0.04 0.02 0.09 0.03 0.06 0.07 0.10 0.06 0.06

Table 8: Table lists the empirical size (δ5 = 0) and power (δ5 = 1.00, 2.00) of our proposed test, SSF and PTT for Setting 7.

6. Real Data Examples

We apply the technique to analyze two real data sets: DNA methylation in chronic lymphocytic leukemia and neuroimaging of Alzheimer s Disease using f MRI.

6.1. DNA Methylation in Chronic Lymphocytic Leukemia

Recently, Filarsky et al. (2016) reported a DNA methylation study for chronic lymphocytic leukemia (CLL) patients. In the study, the DNA samples were extracted from CD19+ cells from 12 CLL patients and B cells from 6 normal subjects. The DNA methylation is proﬁled by the whole-genome tiling array technique. The goal is to identify differentially methylated regions (DMRs), i.e., the genome regions that have signiﬁcantly different methylation levels, between CLL patients and normal subjects. To achieve this goal, we compiled the DNA methylation intensities within the 3.8 to +1.8 kb of transcription start sites (TSS) for each gene. We used the M-value suggested by Irizarry et al. (2008) as methylation level at each site and as our response variable. In particular, the data consists of (Yij, x 1 i , x 2 j ), where Yij is the methylation level at the ith genome location x 1 i of the jth subject

in group x 2 j , which equals to 1 if the jth subject is in the case group and equals to 0 if the jth subject is in the control group. We ﬁt the model in Equation (1) with SSANOVA decomposition in Equation (2) to the data. We applied the proposed hypothesis testing on 10383 regions. Through controlling FDR < 0.01 using Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995), we selected 613 DMRs. We conducted gene ontology analysis on the 613 genes corresponding 613 identiﬁed DMRs using the GSEA (Subramanian et al., 2005). Among these genes, 79 genes participate the lipid metabolic process, which plays an important role in the development of CLL (Pallasch et al., 2008). This

MINIMAX NONPARAMETRIC PARALLELISM TEST

biological process contributes to apoptosis resistance in CLL cells. Furthermore, 78 and 61 genes participate the immune related biological processes: Immune system process and Regulation of immune system process respectively. The observation indicates that the aberrant DNA methylation has the potential impact on the immune system. Our Wald-type test, even after FDR control, yields p-values that are as small as 10 9. Consequently, it is very difﬁcult to compare our test with the permutation test with only hundreds or thousands of permutations. Thus, we only compared our proposed test with permutation test (based on 500 permutation) for regions with p-values larger than 0.05 the averaged difference between our test and permutation test is 0.012.

42573000 42574000 42575000 42576000 42577000

0.5 0.0 0.5 1.0

Genomic locations

Methylation intensities

G G G G G G G G

25243500 25244000 25244500 25245000 25245500

1.0 0.5 0.0 0.5

Genomic locations

Methylation intensities

(a) MTA3 (b) DNMT3A

Figure 5: The promoter regions of two genes, (a) MTA3 and (b) DNMT3A. The horizontal axis is the genomic location and the y axis is the M-value representing the methylation levels. The red and blue lines are the ﬁtted curves for the case and control groups respectively.

We highlighted two DMRs with signiﬁcant nonparallel patterns in Figure 5. The focal hypermethylation at genome locations 42574000 and 42576500 are observed on the promoter region of gene MTA3. It was reported in (Bilban et al., 2006) that MTA3 signaling pathway is a potential bio-marker for CLL and shows signiﬁcantly altered gene expression. Our test also identiﬁed that the methylation levels between CLL patients and normal subjects, of MTA3 gene have signiﬁcant difference, which has potential prognostic value. In the promoter region of DNMT3, we observed signiﬁcant hypomethylation at genome location 25244500. DNMT3 is a family of DNA methyltransferases that could methylate hemimethylated and unmethylated Cp G sites at the same rate (Okano et al., 1998). Since the global hypomethylation is observed, the aberrant methylation levels of this DNA methylatransferase may have inﬂuence on this global trend.

6.2. Neuroimaging of Alzheimer s Disease using f MRI

Alzheimer s disease (AD) is one of the most commonly known neurology disease characterized with neurodegeneration and cognitive decline (Rombouts et al., 2005; Wang et al., 2006). Despite the prevalence of AD, there are no cure or preventive methods available due to the lack of a complete understanding of the mechanisms that contribute to AD pathophysiology. Discovering aberrant neural network of AD will fundamentally advance the scientiﬁc understanding of this disease.

XING, LIU, MA AND ZHONG

In this study, we analyzed the data that was collected by Alzheimer s Disease Neuroimaging Initiative (ADNI) 2, in which the resting-state f MRI signals of 60 normal/early-mild-cognitiveimpairment subjects (control group) and 50 AD/late-mild-cognitive-impairment subjects (AD group) were collected from 256 256 170 voxels for 140 consecutive time points with equal time intervals of 30ms. The f MRI signals for each subject were preprocessed using f MRI Expert Analysis Tool (FEAT) (Smith et al., 2004) for skull-stripping, motion correction, slice timing correction, temporal ﬁltering, spatial smoothing and registration to standard space (MNI152 T1 2mm model) so that signals from all subjects can be considered as from the same engineered brain template. Sixtynine brain-region-of-interests (ROI) that are deﬁned by Harvard-Oxford-Atlas (http://fsl.fmrib.ox .ac.uk/fsl/fslwiki/Atlases) was extracted by automatic regional labeling approach using the reﬁned f MRI data. For each ROI, we consider model (1) with SSANOVA decomposition in Equation (2), where Yij records the average blood-oxygen-level (Huettel et al., 2004) of the brain region for subject j measured at the x 1 i time point. As the blood-oxygen-level can accurately quantify the corresponding brain activity, we can detect abnormal AD related brain activity. Testing problem in Equation (18) is equivalent to testing whether the brain activities of a given ROI have different temporal patterns in case and control groups.

Cingulate Gyrus, posterior division

Parahippocampal Gyrus, posterior division

Figure 6: Plotted here are blood-oxygen-levels of parahippocampal gyrus (left) and cingulate gyrus (right) for control group (blue) and AD group (red) observed at 140 time points. Physical locations of either ROIs on frontal, axial and lateral sides are illustrated on the top of each panel.

2. Data used in preparation of this article were obtained from the Alzheimer s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ ADNI_Acknowledgement_List.pdf

MINIMAX NONPARAMETRIC PARALLELISM TEST

Seven cortical regions parahippocampal gyrus, cingulate gyrus, inferior temporal gyrus, postcentral gyrus, juxtapositional lobule cortex, precuneous cortex, central opercular cortex and one sub-cortical region right thalamus with signiﬁcantly different temporal patterns were identiﬁed using our test with the false discovery rate controlled at 5% using Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). Among the eight ROIs, parahippocampal gyrus and cingulate gyrus have been shown clinically to be risk factors for AD. As demonstrated in Ech avarri et al. (2011) and Kesslak et al. (1991), parahippocampal gyrus of AD patients have signiﬁcant atrophy. Meanwhile, cingulate gyrus was also found to be AD related (Scheff et al., 2015) due to its extensive connectivity with multiple different cortical areas, especially areas involved with learning and memory. In Figure 6, we plotted frontal, axial, and lateral views and corresponding temporal patterns of parahippocampal gyrus and cingulate gyrus. The temporal regions with signiﬁcant difference between AD/late-mildcognitive-impairment subjects (red line) and normal/early-mild-cognitive-impairment subjects (blue line) are highlighted. As clearly demonstrated in lower left panel of Figure 6, the ﬁrst highlighted area of parahippocampal gyrus has a signiﬁcant reversed pattern between case group and control group. The second highlighted area shows the reduced levels for the AD group. For cingulate gyrus, the highlighted regions in the right panel of Figure 6 show clearly larger magnitude for the AD groups. This difference was also observed via f MRI in a visual encoding memory task (Rami et al., 2012). Both of the two experiments suggest that the difference may change the memory function.

7. Discussion

The hypothesis testing in SSANOVA is a very challenge problem. In this paper, we develop a Wald-type test for testing the signiﬁcance of the nonparallelism in a two-way SSANOVA model. The optimality of the proposed test is justiﬁed by the minimax distinguishable rate. The extensive empirical studies suggest that the proposed test has a superior performance over existing methods. Although we only discuss the test of the signiﬁcance of the nonparallelism in a two-way SSANOVA model, the test on a higher order SSANOVA model can be developed parallel to our framework.

Acknowledgments

PM was funded in part by NSF DMS-1440037, 1438957, 1925066 and NIH 1R01GM122080-01. WZ was funded in part by NSF DMS-1440038, 1903226, 1925066 and NIH 1R01GM113242-01. Data collection and sharing for this project was funded by the Alzheimer s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: Abb Vie, Alzheimer s Association; Alzheimer s Drug Discovery Foundation; Araclon Biotech; Bio Clinica, Inc.; Biogen; Bristol-Myers Squibb Company; Cere Spir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; Euro Immun; F. Hoffmann-La Roche Ltd and its afﬁliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; Neuro Rx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pﬁzer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics.The

XING, LIU, MA AND ZHONG

Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

MINIMAX NONPARAMETRIC PARALLELISM TEST

Appendix A. Proof of Main Results

In this section, we present main proofs of the theorems and lemmas in the main text.

A.1. Notation Clariﬁcation

We rewrite (16) as ˆf11 = K11M 1(Ins S(ST M 1S) 1ST M 1)y,

S = In 1w0 0 0 In 1w1

" K 1 1 K 1 1 K 1 1 K 1 1

M = In 1w0 0 0 In 1w1

" K 1 1 K 1 1 K 1 1 K 1 1

" K 1 1 K 1 1 K 1 1 K 1 1

In 1T w0 0 0 In 1T w1

K 1 1 is the kernel matrix of H 1 1 with (i, i )th entry as 1

n K 1 1 (x 1 i , x 1 i ), w0 is the number of subjects in control group, w1 is the number of subjects in case group, and denotes the Kronecker product. Based on Chapter A.3 in Gu (2013), we set θ 1 10 Tr(K10) and θ 1 11 Tr(K11) with θ10 + θ11 = 1. In the following theoretical derivation, we only focus on the case with s = 2, i.e. w0 = w1 = 1. If we have s > 2 subjects, the proof can be easily generalized to this situation by replacing Equation (15) by the penalized weighted least squares; see Section 3.2.4 in Gu (2013).

A.2. Proofs for Section 3

A.2.1. PRELIMINARY

We identify a sequence model that is equivalent to our nonparametric model (1) with SSANOVA decomposition in Equation (2). Let {ρi, φi} i=1 be pairs of eigenvalue and eigenfunction in H 1

and {νj, ψj}2 j=1 be pairs of eigenvalue and eigenfunction in H 2 . In the tensor product space H = H 1 H 2 , as shown in Lin (2000), eigenvalues and eigenfunctions are {µiνj, φiψj}i=1,..., , j=1,2. Model (1) is equivalent to a sequence model

zij = θij + ωij, (30)

where θij = 1

2 P1 x 2 =0 R

X1 f(x 1 i , x 2 j )φi(x 1 )ψj(x 2 )dω(x 1 ) are the basis expansion coefﬁcients, the random noise ωij is mean zero and variance σ2/n. The space E = {f H : ||f||H < 1}

in Equation (30) is equivalent to E = {P i=1 P2 j=1 θ2 ij (µiνj) 1}. The hypothesis in Equation (18) is equivalent to the hypothesis H0 : θi2 = 0 for i = 2, . . . , n.

Let θ11 = (θ22, θ32, . . . , θn2)T , and E11 = {θ11 | Pn i=2 θ2 i2 (µiν2) 1}. Consider a local alternative H1n : θ11 E11 with θ11 2 dn, where dn represents a generic distinguishable rate. The total

XING, LIU, MA AND ZHONG

error of a generic testing rule φn under distinguishable rate dn can be rewritten as

pseudo.risk(φn, dn) = EH0{φn|H0 is true} + sup θ11 E11 θ11 2 dn

E{1 φn|H1 is true}. (31)

Equation (31) is consistent with the testing error deﬁned by Ingster (1993), Wei and Wainwright (2020). For the simplicity of description, we order the axis length {(µiν2)} i=2 from the smallest to the largest as {ρp} p=1. Next we introduce a lemma to give a low bound of the minimum pseudo risk.

Lemma 11 For every set C and probability measure Q supported on C Bc(dn), we have

inf φn pseudo.risk(φn, dn) 1 1

Eη,η exp( η, η

where Eη,η denotes expectation with respect to an i.i.d. pair η, η Q

The proof of this lemma directly follows Lemma 3 in Wei and Wainwright (2020).

A.2.2. PROOF OF LEMMA 4

Proof As shown in Lemma 11, we have

inf φn pseudo.risk(φn, dn) 1 1

Eη,η exp( η, η

σ2 ) 1 (32)

Next we show that if δ2

4 , we have the last term in Equation (32) larger than 1/2. Let θb = δ

k Pk i=1 biei where ei is the standard basis vector with ith coordinate as one. We consider Q

as the uniform distribution on {θb, b { 1, 1}k}. The expectation in the last term of Equation (32) can be written as

Enη,η exp(n η, η

b,b exp(nθT b θb σ2 ) = 1

b,b exp(nδ2 Pk i=1 bib i kσ2 )

2k (exp(nδ2

kσ2 ) + exp( nδ2

(i) (1 + n2δ4

(ii) exp(n2δ4

where (i) is due to that 1

2(exp(x)+exp( x)) 1+x2 for |x| 1/2 and (ii) is due to that 1+x ex. Thus for any δ4 kσ4

16n2 , we have

inf φn pseudo.risk(φn, dn) 1 1

e1/16 1 1/2.

By the deﬁnition of r B, we have pseudo.risk(φn, dn) > 1/2 for all dn r B .

MINIMAX NONPARAMETRIC PARALLELISM TEST

A.2.3. PROOF OF LEMMA 5

Proof We show that bk,2(E11) is bounded below by ρk+1. It is sufﬁcient to show that E11 contains a l2 ball centered at f11 = 0 with radius ρk+1. For any v E11 with ||v||2 ρk+1, we have

(ii) 1 µk+1

where inequality (i) holds by set the (k + 1)-dimensional subspace spaned by the eigenvectors corresponding to the ﬁrst (k + 1) largest eigenvalues; inequality (ii) holds by the decreasing order of the eigenvalues, i.e., ρ1 ρ2 . . . ρk+1. Recall that the deﬁnition of the Bernstein lower critical dimension is k B(δ) = arg maxk{b2 k 1,2(E11) δ2}, we have k B(δ) arg max k { ρk δ}.

A.2.4. PROOF OF THEOREM 6

Proof By Lemme 4, we have

dn sup{δ : k B(δ) 16n2δ4}.

We plug in the lower bound of k B(δ) in Lemma 5. Then we have

dn sup{δ : arg max k { ρk δ} 16n2δ4}. (33)

The eigenvalues have polynomial decay rate i.e., ρp p 2m, and consequently, arg maxk{ ρk δ} δ 1/m. Plugging this into Equation (33), it is easy to see that the supremum on the right hand side has an order n 2m 4m+1 . Proof is thus completed.

A.3. Proof of Theorem 7

Before deriving the proof of Theorem 7, we ﬁrst state Lemma 12, Lemma 13, Lemma 14, and Lemma 15, which are used in the proof of Theorem 7. The proof of these auxiliary lemmas is referred to the Supplementary.

A.3.1. SOME AUXILIARY LEMMAS

Lemma 12 shows the projection of f10 on H11 Hmodel is zero. This result indicates our test statistic does not depend on the nuisance parameter f10.

Lemma 12 The quantity, K11M 1(In S(ST M 1S) 1ST M 1)f10, equals to zero.

The next two lemmas show the equivalence of τλ and ˆτλ under the quasi-uniform design and uniform design.

XING, LIU, MA AND ZHONG

Lemma 13 If x 1 follows the quasi-uniform random design, for any λ = 1 n1 c , m > 3/2, and any δ, c > 0, we have

P(ˆτλ τλ) 1 (n 2 2m 1 2δ + n 1 2m 1 ) exp{ cn 2m 3 2m 1 +2δ},

where τλ = max{i | µi λ} and ˆτλ = max{i | ˆµi λ}.

Lemma 14 If x 1 follows the uniform ﬁxed design condition, for m > 1/2 and λ > 0, we have

In the following lemma, we bound Tr( ) by a function of ˆτλ. This result is essential in deriving the asymptotic distribution of Tn,λ.

Lemma 15 For = M 1K2 11M 1 deﬁned in Theorem 7, we have

9 Tr( ) 4 (1 θd)2 (ˆτλ + 1

i=ˆτλ+1 ˆµi). (34)

A.3.2. PROOF OF THEOREM 7

Proof For simplicity, we suppose σ2 = 1. We deﬁne the three terms on the right-hand side of Equation (26) as T1, T2 and T3, i.e.,

nϵT M 1S(ST M 1S) 1ST S(ST M 1S) 1ST M 1ϵ,

nϵT M 1S(ST M 1S) 1ST ϵ.

We now show T2 and T3 are in smaller order compared to T1. First, we analyze the second term T2 in Equation (26). We have

n E[ϵT M 1S(ST M 1S) 1ST S(ST M 1S) 1ST M 1ϵ]

n Tr(M 1S(ST M 1S) 1ST S(ST M 1S) 1ST M 1)

nλmax( )λmax(M 1S(ST M 1S) 1ST S(ST M 1S) 1ST M 1)

where λmax denotes the largest eigenvalue. Since all eigenvalues of are less than 1, we have E[T2] 2

n. Analogously, we can derive the variance inequality of T2. Combining the results together and using the Chebyshev inequality, we have

MINIMAX NONPARAMETRIC PARALLELISM TEST

Second, we analyze the third term T3 in Equation (26). We apply the Cauchy-Schwarz inequality and have |T3| p

Finally, we derive the magnitude of T1. We ﬁrst consider the testing consistency of T1 conditional on X. Denote Eϵ as the expectation with respect to ϵ, and deﬁne Varϵ as the variance with respect to ϵ. Note that Eϵ[ϵT ϵ] = Tr( ), Varϵ[ϵT ϵ] = 2 Tr( 2).

Let Z = (ϵT ϵ Tr( ))/ p

2 Tr( 2) and t ( 1/2, 1/2). Then the log-characteristic function of Z can be written as

log Eϵ[exp(it Z)]

= log Eϵ[exp(itϵT ϵ/ p

2 Tr( 2))] it Tr( )/ p

2 log det{I2n 2it / p

2 Tr( 2)} it Tr( )/ p

2 Tr( 2). (37)

Through Taylor expansion, one has

2 log det{I2n 2it / p

=it Tr( ) p

2 Tr( 2) t2 Tr( 2)

2 Tr( 2) + O(t3 Tr( 3) [Tr( 2)]3/2 ). (38)

Combining Equations (37) and (38), we have

log Eϵ[exp(it Z)] = t2

2 + O(t3 Tr( 3) [Tr( 2)]3/2 ). (39)

Since all eigenvalues of are less than 1, we have Tr( 3)

Tr( 2) 1. Analogous to (S.11), we have

81 ˆτλ. (40)

Under the quasi-uniform design, we have Tr( 2) as λ 0 with probability approaching 1 by Lemma 13 and Equation (40). Hence, the second term on the right-hand side of Equation (39) is op(1). We thus conclude that

Eϵ[exp(it Z)] P exp( t2

Next, we show that

E[exp(it Z)] = EX Eϵ[exp(it Z)] exp( t2/2)

2). If not, there exists a subsequence of r.v X 1 nk , such that for ε > 0, |EX 1 nk Eϵ exp(it Z)

exp( t2/2)| > ε. On the other hand, since Eϵ exp(it Z(X 1 nk )) P exp( t2/2), which is bounded, there exists a sub-sub sequence {X 1 nkl}, such that Eϵ exp(it Z(X 1 nkl)) a.s exp( t2/2). Then by dominate convergence theorem, EX 1 nkl Eϵ exp(it Z) exp( t2/2), which is a contradiction. Under

XING, LIU, MA AND ZHONG

the uniform design, we can easily obtain E[exp(it Z)] exp( t2

2 ) by Lemma 14 and Equation (40). Thus Z is asymptotically normally distributed, and

T1 Tr( )/n p

2 Tr( 2)/n2

d N(0, 1). (41)

Combining (35), (36) and (41), the theorem follows.

A.4. Proof of Theorem 8

Proof Under the alternative hypothesis, the statistic Tn,λ in Equation (26) can be decomposed into three terms as follows

n||Hϵ||2 2 + 1

n||Hf11||2 2 + 2

nf T 11HT Hϵ. (42)

where H = θ11K11M 1(I S(ST M 1S) 1ST M 1). Let W1 = 1

n||Hϵ||2 2, W2 = 1

n||Hf11||2 2, and W3 = 2

nf T 11HT Hϵ denote corresponding three terms on the right-hand side of Equation (42). We now derive a lower bound for W2. By Lemma S.1, we have

1 n||Hf11 f11||2 2 1

n||Hf11 f11||2 2 + 1

n||Hf10 f10||2 2

n||Hf10 + Hf11 f10 f11||2 2

= || g g ||2 n cλ. (43)

Let c = c, we consider the distinguishable rate

1 n||f11||2 2 = ||f11||2 n > c 2d2 n = c(λ + σn,λ). (44)

where the inequality is satisﬁed since || ||n dominates || ||2 by Lemma S.2. The lower bound of W2 is thus,

n||Hf11||2 2 = 1

n||f11||2 2 1

n||f11 Hf11||2 2 cd2 n cλ cσn,λ. (45)

where the ﬁrst inequality is obtained by (43) and the second inequality is obtained through plugging in Equation (44). For the third term W3, it is seen that EW3 = 0. It is easy to verify that the eigenvalues of HHT

are all less than 1. Moreover,

n2 E[f T 11HT HϵϵT HT Hf11] = 4

n2 (Hf11)T HHT (Hf11)

n2 (Hf11)T (Hf11) = 4

By the Chebyshev s inequality, for any ϵ > 0, we have

P(|W3| 2ϵ 1

1 2 2 n ) n EW 2 3 4ϵ 1W2 ϵ.

MINIMAX NONPARAMETRIC PARALLELISM TEST

Consequently, there exists an n0, for any n > n0, we have

2W2} P(|W3| 2ϵ 1

1 2 2 n ) ϵ. (46)

Now, we are ready to prove our theorem. By the triangle inequality, we have

σn,λ + W2 + W3

σn,λ | |W2 + W3

σn,λ | |W1 µn,λ

σn,λ | (47)

σn,λ | | W3

σn,λ | |W1 µn,λ

σn,λ |. (48)

If |W1 µn,λ|

σn,λ Cϵ, and |W3| 1

2W2 hold, in view of (47) and Equation (45), we have

σn,λ + W2 + W3

Noting that W1 is identical to Equation (26), by Theorem 7, we have |W1 µn,λ|

σn,λ = Op(1). That is for any Cϵ > 0, there exists an s, for any n > s, we have

P(|W1 µn,λ|

σn,λ > Cϵ) ϵ. (49)

Setting c 2(Cϵ + z1 α

2 ) and N = max(n0, s), for any n > N, we have

P(φn,λ = 1) =P{|W1 + W2 + W3 µn,λ|

P{|W1 µn,λ|

σn,λ Cϵ, |W3| 1

1 P{|W1 µn,λ|

σn,λ > Cϵ} P{|W3| > 1

where the second inequality is due to the Boole s inequality (Casella and Berger, 2002) and the last inequality is obtained by combining Equation (45) and Equation (49). Thus, we have

sup H 1 E(1 φn,λ|H 1 is true) < δ,

where H 1 = {f | f H model and ||f11||2 Cδ pλ + σn,λ dn}.

XING, LIU, MA AND ZHONG

Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. National Bureau of Standards, Washington, DC., 1964.

Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28, pages 775 783. 2015.

Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, 33(4):1497 1537, 2005.

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289 300, 1995.

Martin Bilban, Daniel Heintel, Theresa Scharl, Thomas Woelfel, Michael M Auer, Edit Porpaczy, Birgit Kainz, Alexander Krober, Vincent J Carey, Medhat Shehata, C Zielinski, W Pickl, S Stilgenbauer, A Gaiger, O Wagner, U Jager, and German CLL Study Group. Deregulated expression of fat and muscle genes in b-cell chronic lymphocytic leukemia with high lipoprotein lipase expression. Leukemia, 20(6):1080 1088, 2006.

Mikio L Braun. Accurate error bounds for the eigenvalues of the kernel matrix. Journal of Machine Learning Research, 7(Nov):2303 2328, 2006.

George Casella and Roger L Berger. Statistical inference. Duxbury Paciﬁc Grove, CA, 2nd edition, 2002.

David Degras, Zhiwei Xu, Ting Zhang, and Wei Biao Wu. Testing for parallelism among trends in multiple time series. IEEE Transactions on Signal Processing, 60(3):1087 1097, 2011.

Petros Drineas and Michael W Mahoney. On the nystr om method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6(Dec):2153 2175, 2005.

C Ech avarri, P Aalten, HBM Uylings, HIL Jacobs, PJ Visser, EHBM Gronenschild, FRJ Verhey, and S Burgmans. Atrophy in the parahippocampal gyrus as an early biomarker of alzheimer s disease. Brain Structure and Function, 215(3-4):265 271, 2011.

Paulus Petrus Bernardus Eggermont and Vincent N La Riccia. Maximum penalized likelihood estimation, volume II. Springer, 2001.

Jianqing Fan and Jian Zhang. Sieve empirical likelihood ratio tests for nonparametric functions. Ann. Statist., 32(5):1858 1907, 10 2004.

Jianqing Fan, Chunming Zhang, and Jian Zhang. Generalized likelihood ratio statistics and wilks phenomenon. Ann. Statist., 29(1):153 193, 02 2001.

Katharina Filarsky, Angela Garding, Natalia Becker, Christine Wolf, Manuela Zucknick, Rainer Claus, Dieter Weichenhan, Christoph Plass, Hartmut D ohner, Stephan Stilgenbauer, Peter Lichter, and Daniel Mertens. Kr uppel-like factor 4 (klf4) inactivation in chronic lymphocytic leukemia

MINIMAX NONPARAMETRIC PARALLELISM TEST

correlates with promoter dna-methylation and can be reversed by inhibition of notch signaling. Haematologica, 101(6):249, 2016.

Evarist Gin e and Richard Nickl. Mathematical foundations of inﬁnite-dimensional statistical models. Cambridge University Press, 2015.

Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215 223, 1979.

Chong Gu. Model diagnostics for smoothing spline ANOVA models. Canadian Journal of Statistics, 32(4):347 358, 2004.

Chong Gu. Smoothing spline ANOVA models. Springer, 2nd edition, 2013.

Kasper D Hansen, Benjamin Langmead, and Rafael A Irizarry. BSmooth: from whole genome bisulﬁte sequencing reads to differentially methylated regions. Genome Biology, 13(10):R83, 2012.

Scott A Huettel, Allen W Song, and Gregory Mc Carthy. Functional magnetic resonance imaging. Sinauer Associates Sunderland, 2004.

Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-ﬁt testing under Gaussian models. Springer Science & Business Media, 2012.

Yuri I Ingster. Asymptotically minimax hypothesis testing for nonparametric alternatives. i, ii, iii. Math. Methods Statist, 2(2):85 114, 1993.

Rafael A Irizarry, Christine Ladd-Acosta, Benilton Carvalho, Hao Wu, Sheri A Brandenburg, Jeffrey A Jeddeloh, Bo Wen, and Andrew P Feinberg. Comprehensive high-throughput arrays for relative methylation (charm). Genome Research, 18(5):780 790, 2008.

J Patrick Kesslak, Orhan Nalcioglu, and Carl W Cotman. Quantiﬁcation of magnetic resonance scans for hippocampal and parahippocampal atrophy in alzheimer s disease. Neurology, 41(1):51 51, 1991.

Young-Ju Kim and Chong Gu. Smoothing spline gaussian regression: more scalable computation via efﬁcient approximation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(2):337 356, 2004.

Yi Lin. Tensor product space ANOVA models. Annals of Statistics, 28(3):734 755, 2000.

Anna Liu and Yuedong Wang. Hypothesis testing in smoothing spline models. Journal of Statistical Computation and Simulation, 74(8):581 597, 2004.

Meimei Liu and Guang Cheng. Early stopping for nonparametric testing. In Advances in Neural Information Processing Systems, pages 3985 3994, 2018.

Meimei Liu, Zuofeng Shang, and Guang Cheng. Sharp theoretical analysis for nonparametric testing under random projection. In Conference on Learning Theory, pages 2175 2209, 2019.

XING, LIU, MA AND ZHONG

Qiang Liu, Jason Lee, and Michael Jordan. A kernelized stein discrepancy for goodness-of-ﬁt tests. In International Conference on Machine Learning, pages 276 284, 2016.

Ping Ma, Wenxuan Zhong, and Jun S Liu. Identifying differentially expressed genes in time course microarray data. Statistics in Biosciences, 1(2):144, 2009.

Ping Ma, Michael W Mahoney, and Bin Yu. A statistical perspective on algorithmic leveraging. Journal of Machine Learning Research, 16(1):861 911, 2015.

Siyuan Ma and Mikhail Belkin. Diving into the shallows: a computational perspective on large-scale shallow learning. In Advances in Neural Information Processing Systems, pages 3778 3787, 2017.

Axel Munk and Holger Dette. Nonparametric comparison of several regression functions: exact and asymptotic theory. Annals of Statistics, 26(6):2339 2368, 1998.

Thomas E Nichols and Andrew P Holmes. Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapping, 15(1):1 25, 2002.

Masaki Okano, Shaoping Xie, and En Li. Cloning and characterization of a family of novel mammalian dna (cytosine-5) methyltransferases. Nature Genetics, 19(3):219, 1998.

William W Orrison, Jeffrey Lewine, John Sanders, and Michael F Hartshorne. Functional brain imaging. Elsevier Health Sciences, 2017.

CP Pallasch, J Schwamb, S K onigs, A Schulz, S Debey, D Koﬂer, JL Schultze, M Hallek, A Ultsch, and CM Wendtner. Targeting lipid metabolism by the lipoprotein lipase inhibitor orlistat results in apoptosis of b-cell chronic lymphocytic leukemia cells. Leukemia, 22(3):585 592, 2008.

Allan Pinkus. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012.

Lorena Rami, Roser Sala-Llonch, Cristina Sol e-Padull es, Juan Fortea, Jaume Olives, Albert Llad o, Cleofe Pe na-G omez, Mircea Balasa, Bea Bosch, Anna Antonell, R. Sanchez-Valle, D. Bartr es-Faz, and J. L. Molinuevo. Distinct functional activity of the precuneus and posterior cingulate cortex during encoding in the preclinical stage of alzheimer s disease. Journal of Alzheimer s Disease, 31 (3):517 526, 2012.

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15(1):335 366, 2014.

Serge ARB Rombouts, Frederik Barkhof, Rutger Goekoop, Cornelis J Stam, and Philip Scheltens. Altered resting state networks in mild cognitive impairment and mild alzheimer s disease: an fmri study. Human Brain Mapping, 26(4):231 239, 2005.

Stephen W Scheff, Douglas A Price, Mubeen A Ansari, Kelly N Roberts, Frederick A Schmitt, Milos D Ikonomovic, and Elliott J Mufson. Synaptic change in the posterior cingulate gyrus in the progression of alzheimer s disease. Journal of Alzheimer s Disease, 43(3):1073 1090, 2015.

Bernhard Sch olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International Conference on Computational Learning Theory, pages 416 426. Springer, 2001.

MINIMAX NONPARAMETRIC PARALLELISM TEST

Dirk Sch ubeler. Function and information content of dna methylation. Nature, 517(7534):321, 2015.

Zuofeng Shang and Guang Cheng. Local and global asymptotic inference in smoothing spline models. Annals of Statistics, 41(5):2608 2638, 2013.

Zuofeng Shang and Guang Cheng. Computational limits of a distributed algorithm for smoothing spline. Journal of Machine Learning Research, 18(1):3809 3845, 2017.

Xiaotong Shen, Hsin-Cheng Huang, and Noel Cressie. Nonparametric hypothesis testing for a spatial signal. Journal of the American Statistical Association, 97(460):1122 1140, 2002.

Stephen M Smith, Mark Jenkinson, Mark W Woolrich, Christian F Beckmann, Timothy EJ Behrens, Heidi Johansen-Berg, Peter R Bannister, Marilena De Luca, Ivana Drobnjak, David E Flitney, RK Niazy, J Saunders, J Vickers, Y Zhang, N De Stefano, JM Brady, and PM Matthews. Advances in functional and structural mr image analysis and implementation as fsl. Neuroimage, 23: S208 S219, 2004.

Dirk Stach, Oliver J Schmitz, Stephan Stilgenbauer, Axel Benner, Hartmut Do Ehner, Manfred Wiessler, and Frank Lyko. Capillary electrophoretic analysis of genomic DNA methylation levels. Nucleic Acids Research, 31(2):e2 e2, 2003.

Lars Sthle and Svante Wold. Analysis of variance (anova). Chemometrics and Intelligent Laboratory Systems, 6(4):259 272, 1989.

John D Storey, Wenzhong Xiao, Jeffrey T Leek, Ronald G Tompkins, and Ronald W Davis. Signiﬁcance analysis of time course microarray experiments. Proceedings of the National Academy of Sciences, 102(36):12837 12842, 2005.

Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, , , and Jill P Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression proﬁles. Proceedings of the National Academy of Sciences, 102(43): 15545 15550, 2005.

Mehrdad Vossoughi, SMT Ayatollahi, Mina Towhidi, and Seyyed Taghi Heydari. A distribution-free test of parallelism for two-sample repeated measurements. Statistical Methodology, 30:31 44, 2016.

Grace Wahba. Spline models for observational data. SIAM, 1990.

Grace Wahba, Yuedong Wang, Chong Gu, Ronald Klein, Barbara Klein, et al. Smoothing spline anova for exponential families, with application to the wisconsin epidemiological study of diabetic retinopathy: the 1994 neyman memorial lecture. Annals of Statistics, 23(6):1865 1895, 1995.

Liang Wang, Yufeng Zang, Yong He, Meng Liang, Xinqing Zhang, Lixia Tian, Tao Wu, Tianzi Jiang, and Kuncheng Li. Changes in hippocampal connectivity in the early stages of alzheimer s disease: evidence from resting state fmri. Neuroimage, 31(2):496 504, 2006.

Yazhen Wang. Change curve estimation via wavelets. Journal of the American Statistical Association, 93(441):163 172, 1998.

XING, LIU, MA AND ZHONG

Yuedong Wang. Smoothing splines: methods and applications. CRC Press, 2011.

Yuting Wei and Martin J Wainwright. The local geometry of testing in ellipses: Tight control via localized kolmogorov widths. IEEE Transactions on Information Theory, 2020.

Simon N Wood. Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95 114, 2003.

Yun Yang, Mert Pilanci, Martin J Wainwright, et al. Randomized sketches for kernels: Fast and optimal nonparametric regression. Annals of Statistics, 45(3):991 1023, 2017.

MINIMAX NONPARAMETRIC PARALLELISM TEST

Supplementary

Supplement to Minimax Nonparametric Parallelism Test

This document contains some auxiliary lemmas, the proofs of Corollary 9 and Corollary 10 as well as the proofs of Lemma 12, Lemma 13, Lemma 14, and Lemma 15 stated in Appendix.

Section S.1 includes some auxiliary lemmas in proving Theorem 8.

Section S.2 includes the proof of Corollary 9 and Corollary 10.

Section S.3 includes the proof of Lemma 12, Lemma 13, Lemma 14, Lemma 15, Lemma S.1, Lemma S.2, and Lemma S.4.

S.1. Some Auxiliary Lemmas in Proving Theorem 8

We ﬁrst introduce several notations and lemmas and then start the main proof of Theorem 8. Let g = f10 + f11 and its estimator as

g = R[M 1 M 1S(ST M 1S) 1ST M 1]g . (S.1)

Lemma S.1 If f H < 1 for any f H, as n , λ 0 and λ n 1, we have

|| g g ||2 n cλ,

where c is a constant, || ||n is the empirical norm.

In the following Lemma S.2, we discuss the relationship between the empirical norm and L2 norm. Recall the deﬁnition of empirical norm and L2 norm are as follows:

||f||2 n = 1

i=1 f2(x 1 i , x 2 j ) and ||f||2 2 =

0 f2(x 1 , x 2 )dω1.

Lemma S.2 Under the quasi-uniform design or the uniform design, for f : X1 X2 R and a positive constant c, we have ||f||2 c||f||n,

i.e. the empirical norm of f dominates the L2 norm.

S.2. Proofs of Corollary 9 and Corollary 10

In order to ﬁnd the optimal distinguishable rate, we need to bound the tail sum of the eigenvalues of the empirical kernel matrix. We state the following two lemmas which give upper bounds for the tail sum of the eigenvalues of the empirical kernel matrix under the quasi-uniform design and the uniform design respectively.

XING, LIU, MA AND ZHONG

Lemma S.3 (Liu et al. (2019)) If 1/n < λ 0 and the quasi-uniform design is satisﬁed, then with probability at least 1 4e τλ, n X

i=ˆτλ+1 ˆµi Cτλµτλ,

where C > 0 is an absolute constant.

Lemma S.4 If λ > 0 and the uniform design is satisﬁed, we have

i=ˆτλ+1 ˆµi Cτλµτλ,

where C > 0 is an absolute constant.

Now we start the main proof of Corollary 9 and Corollary 10. The distinguishable rate is

where σ2 n,λ = 2θ4 11σ4 Tr( 2)/n2. We now derive the order of σ2 n,λ. Since the eigenvalues of are less than 1, and by Lemma 15, we have

Tr( 2) Tr( ) 4 (1 θd)2 (ˆτλ + 1

i=ˆτλ+1 ˆµi).

Proof Under the quasi-uniform design, applying Lemma A.1, we have

Tr( 2) 4 (1 θd)2 (ˆτλ + 1

with probability at least 1 4e τλ. Combining the lower bound of Tr( 2) in Equation (40) and Lemma S.3, we have Tr( 2) τλ, (S.2)

with probability at least 1 4e τλ (n 2 2m 1 2ϵ + n 1 2m 1 ) exp{ cn 2m 3 2m 1 +2ϵ}. Similarly, we have Equation (S.2) satisﬁed under the uniform design by applying Lemma 14 and Lemma S.4. Using Equation (S.2), we have

2m n 2 τλn 2. (S.3)

By the Cauchy-Schwartz inequality, the distinguishable rate dn = pλ + σn,λ is minimized when λ σn,λ, i.e., λ n 4m/(4m+1).

Thus we have the minimum distinguishable rate

d n n 2m/(4m+1).

By Lemma S.2, this optimal distinguishable rate is achieved in the sense of L2 norm.

MINIMAX NONPARAMETRIC PARALLELISM TEST

S.3. Proofs of Auxiliary Results

S.3.1. PROOF OF LEMMA 12

Proof Write matrix R as

R = θ01K01 + θ11K11 = 1

" K 1 1 θd K 1 1 θd K 1 1 K 1 1

where θd = θ01 θ11. The inverse of M can be written as

" 1 2K 1 1 + λIn θd

2 K 1 1 1 2K 1 1 + λIn

# 1 A B B A

= A 1 + A 1B(A BA 1B) 1BA 1 A 1B(A BA 1B) 1

A 1B(A BA 1B) 1 (A BA 1B) 1

where A = 1

2K 1 1 + λIn, B = θd

2 K, and In denotes the n n identity matrix. Note that S is a 2n 2 matrix deﬁned as S = (1n, 1n)T . We thus have

ST M 1S = a b b c

a = 1T A 11 + 1T A 1B(A BA 1B) 1BA 11 + 2b c,

b = 1T BA 1(A BA 1B) 11 + 1T (A BA 1B) 11,

c = 1T (A BA 1B) 11.

Consequently,

S(ST M 1S) 1ST = 1 ac b2

c11T (c b)11T

(c b)11T (a + c 2b)11T

c11T (c b)11T

(c b)11T c11T

where the second equality holds by the fact a 2b = 0 and Woodbury matrix identity. Note that f10 = (f10(x 1 1 ), . . . , f10(x 1 n ), f10(x 1 1 ), . . . , f10(x 1 n )). Let

(f10(x 1 1 ), . . . , f10(x 1 n ) h T .

Therefore, we have

K11M 1(In S(ST M 1S) 1ST M 1)f10

" K 1 1 K 1 1 K 1 1 K 1 1

(M 1 1 ac b2 M 1 c11T (c b)11T

(c b)11T c11T

XING, LIU, MA AND ZHONG

Since both M 1 and 1 ac b2 M 1 c11T (c b)11T

(c b)11T c11T

M 1) are symmetric matrices and their

diagnonal entries are identical, we have

(M 1 1 ac b2 M 1 c11T (c b)11T

(c b)11T c11T

Simple algebra yields K11M 1(In S(ST M 1S) 1ST M 1)f10 = 0.

S.3.2. PROOF OF LEMMA 13

Proof Under the quasi-uniform design, X 1 1 ,. . . , X n n are i.i.d with distribution ω 1 . Therefore, by Theorem 3 in Braun (2006), for 1 i n and i r n, simple algebra yields

P(|ˆµi µi| cmµi + µr + Λr) 1 r(r + 1) exp{ nc2 m 2C4r2 },

where Λr = P i=r+1 µi, C is an absolute constant, and cm is a constant depending solely on m. Since the eigenvalue µi has the polynomial decay rate i 2m, we have

i=r+1 i 2m.

For m > 1/2, X

i=r+1 i 2m Z

r x 2mdx = r1 2m

2m 1 = O(r1 2m).

Let r = n1/(2m 1) ϵ, we have Λr + µr = O(n2ϵm 1 ϵ) = o(µi) for i = 1, . . . , n1/2m ϵ. Next, we have, for any i = 1, . . . , n 1 2m ϵ, the empirical eigenvalue ˆµi satisﬁes

|ˆµi µi| cmµi,

with probability at least

1 (n 2 2m 1 2ϵ + n 1 2m 1 ) exp{ cn 2m 3 2m 1 +2ϵ}, (S.4)

where c = c2 m 2C4 , cm is a constant only related to m, and M is an absolute constant. To ensure the probability in Equation (S.4) goes to 1, we further require m > 3/2. Thus, for λ > 1/n and m > 3/2, ˆτλ τλ

with probability at least 1 (n 2 2m 1 2ϵ + n 1 2m 1 ) exp{ cn 2m 3 2m 1 +2ϵ}.

MINIMAX NONPARAMETRIC PARALLELISM TEST

S.3.3. PROOF OF LEMMA 14

Proof Considering H 1 as the homogeneous Sobolev space, the kernel function K 1 1 can be explicitly written as

K 1 1 (x, y) = 2

cos(2πk(x y))

Under the uniform design, we have the X 1 1 , . . . , X 1 n evenly distributed on [0, 1]. Without loss of generality, we assume that X 1 1 < < X 1 n . Therefore, the ii th entry of kernel matrix K 1 1 is K 1 1 (x 1 i , x 1 i ) which is a symmetric circulant matrix of order n (Shang and Cheng, 2017) with eigenvalues

(P k=1 1 [2π(kn i)]2m + P k=0 1 [2π(kn+i)]2m if 1 i n 1

2 P k=1 1 (2πkn)2m if i = n . (S.5)

Note that ˆµ i is a re-arrangement of ˆµi. When m > 1/2, simple calculation yields

1 [2π(n i)]2m + 1 (2πi)2m + 2cm(2πn) 2m ˆµ i

1 [2π(n i)]2m + 1 (2πi)2m + 2 cm(2πn) 2m,

for i = 1, . . . , n 1, and ˆµ n = 2 cm(2πn) 2m,

where cm := P k=1 k 2m, and cm = P k=2 k 2m. By Equation (S.6), we have ˆµ i µi for 1 i n

2 and ˆµ i µn i for n

2 i n. Since {ˆµ}n i=1 are obtained by ordering {ˆµ i }n i=1 decreasingly, we have µi ˆµi, and consequently, τλ ˆτλ

for any λ > 0.

S.3.4. PROOF OF LEMMA 15

Proof Note that the kernel matrix K 1 1 in Equation (A.1) has the spectral decomposition K 1 1 = UDUT , where the eigenvector matrix U is a n n unitary matrix and the eigenvalue matrix D = Diag{ˆµi} is a diagonal matrix with eigenvalues ˆµ1 ˆµ2 ˆµn. Correspondingly, we have the following decomposition,

D + 2λIn θd D θd D D + 2λIn

XING, LIU, MA AND ZHONG

where In is the n n identity matrix, and θd = θ01 θ11. Letting E = D+2λIn = Diag{ˆµi +2λ} and F = θd D = Diag{θdˆµi}, we have

K11M 1 = U 0 0 U

1 UT 0 0 UT

Using the inverse of block matrix, we have D D D D

1 V11 V12 V21 V22

V11 = DE 1 + (D + DE 1F)(E FE 1F) 1FE 1,

V12 = (DE 1F + D)(E FE 1F) 1,

V21 = V12, (S.7)

V22 = V11. (S.8)

Consequently,

= M 1K2 11M 1 = U 0 0 U

V11 V12 V21 V22

T V11 V12 V21 V22

We thus have

Tr( ) = Tr(M 1K2 11M 1) = Tr( V11 V12 V21 V22

T V11 V12 V21 V22

= Tr V T 11V11 + V T 21V21 V T 11V12 + V T 21V22 V T 12V11 + V T 22V21 V T 12V12 + V T 22V22

By Equation (S.7) and Equation (S.8), we have

V T 11V11 + V T 21V21 = V T 12V12 + V T 22V22.

Simple algebra yields V T 11V11 + V T 21V21 = 2V T 11V11.

Therefore, we have Tr( ) = 4 Tr(V T 11V11). (S.10)

Notice that D, E, F are diagonal matrices, we have

Tr( ) = 4 Tr(V T 11V11) 4 Tr (D2E 2).

D2E 2 = Diag{ ˆµ2 i (ˆµi + 2λ)2 },

ˆµ2 i (ˆµi + 2λ)2 4

ˆµ2 i (ˆµi + 2λ)2 , (S.11)

MINIMAX NONPARAMETRIC PARALLELISM TEST

where ˆτλ is the effective dimension for kernel matrix K11. For the any i < ˆτλ, we have ˆµi ˆµi+2λ > 1

3. Thus we have Tr( ) 4

Now we shall prove the upper bound for Tr( ). Since Tr( ) has the expression in Equation (S.10), we expand V11 as

V11 = DE 1 + DE 1(F(E FE 1F) 1FE 1 + (E FE 1) 1F).

The ith diagonal entry of F(E FE 1F) 1FE 1 is

Diagi(F(E FE 1F) 1FE 1) = θ2 dˆµ2 i (ˆµi + 2λ θ2 d ˆµ2 i ˆµi+2λ)(ˆµi + 2λ)

θ2 d 1 θ2 d , (S.12)

and the ith diagonal entry of (E FE 1) 1F is

Diagi((E FE 1) 1F) = θdˆµ

ˆµi + 2λ θ2 d ˆµ2 i ˆµi+2λ θd 1 θ2 d . (S.13)

Combining Equation (S.12) and Equation (S.13), we have the ith diagonal entry of V11

Diagi(V11) (1 + θ2 d 1 θ2 d + θd 1 θ2 d Diagi(DE 1) = 1 1 θd Diagi(DE 1).

Since the lower diagonal block of DE 1 is identical to the upper diagonal block, we only need to bound the trace of DE 1. We have

Tr(D2E 2) =

ˆµ2 i (ˆµi + 2λ)2 +

ˆµ2 i (ˆµi + 2λ)2

ˆµi ˆµi + 2λ +

ˆµi ˆµi + 2λ

i=ˆτλ+1 ˆµi.

Thus we have Tr( ) 4 (1 θd)2 (ˆτλ + 1

2λ Pn i=ˆτλ+1 ˆµi).

S.3.5. PROOF OF LEMMA S.1

Proof By the functional decomposition in Equation (10), we have

||f10 + f11||2 H10 H11 ||f||2 H < 1. (S.14)

XING, LIU, MA AND ZHONG

For any function g in H10 H11, we write g = ξT c + ζ( ), where ζ( ) H10 H11 is orthogonal to ξ. Moreover,

||f10 + f11||2 H10 H11 =||ξT c||2 H10 H11 + ||ζ( )||2 H10 H11

n c T R c = 1

n(n c T R)R 1(n R c)

ng T R 1g . (S.15)

Combining Equation (S.14) and Equation (S.15), we have

1 ng T R 1g < 1. (S.16)

By Equation (S.1), we have

|| g g ||2 n = 1

n||g RM 1g + RM 1S(ST M 1S) 1ST M 1g ||2 2

ng T (I RM 1)2g + 1

n||RM 1S(ST M 1ST ) 1ST M 1g ||2 2.

Noting that M = R + λIn, the eigenvalues of In R(R + λIn) 1 are all smaller than 1, and the rank of RM 1S(ST M 1ST ) 1ST M 1 is 2, we have

|| g g ||2 n 1

ng (I R(R + λI) 1)g + O( 1

where the last inequality holds by applying Woodbury matrix identity,

(R + λIn) 1 = R 1 R 1( 1

λIn + R 1) 1R 1 R 1 λR 2,

and Equation (S.16). The proof is thus completed.

S.3.6. PROOF OF LEMMA S.2

Proof Under the quasi-uniform design, Theorem 3.1 of Eggermont and La Riccia (2001)(page 384, Eggermont and La Riccia (2001)) implies that || ||ω 1 mh norm is equivalent to || ||n for any ﬁxed x 2 . The || ||ω 1 hm, which is deﬁned as ||f(x 1 , 0)||2 ω 1 mh = ||f(x 1 , 0)||2 L2(ω 1 ) +

h2m||f(x 1 , 0)(m)||2, trivially dominates the || ||L2(ω 1 ) norm. Since x 2 can only take the values 0 or 1, we have that || ||n dominates || ||2, i.e. there exists a positive constant c such that ||f||2 c||f||n. Under the uniform design, Lemma 2.27 in Eggermont and La Riccia (2001) states that || ||n dominates || ||2 for x 1 .

MINIMAX NONPARAMETRIC PARALLELISM TEST

S.3.7. PROOF OF LEMMA S.4

Proof Under the uniform design, the empirical eigenvalues could be calculated by Equation (S.5). By the deﬁnition of ˆτλ, we have

i=ˆτλ+1 ˆµi = X

{i|ˆµ i <λ} ˆµ i . (S.17)

Since the population eigenvalues are {(2πi) 2m} i=1, we calculate the population efﬁcient dimension as τλ = (λ) 1/2m/2π. By the inequalities Equation (S.6), we have ˆµ i λ for i = 1, . . . , τλ or i = n τλ, . . . , n. We can bound the term in Equation (S.17)

{i|ˆµ i <λ} ˆµ i

i=τλ ˆµ i .

By the upper bound of ˆµ i given in Equation (S.6), we have

i=τλ ˆµ i Cτλµτλ,

which completes the proof.