# group_slope_penalized_lowrank_tensor_regression__5af03d4b.pdf

Journal of Machine Learning Research 24 (2023) 1-30 Submitted 11/22; Revised 8/23; Published 11/23

Group SLOPE Penalized Low-Rank Tensor Regression

Yang Chen yangchen@bjtu.edu.cn

Ziyan Luo zyluo@bjtu.edu.cn School of Mathematics and Statistics Beijing Jiaotong University Beijing, P. R. China

Editor: Zhihua Zhang

This article aims to seek a selection and estimation procedure for a class of tensor regression problems with multivariate covariates and matrix responses, which can provide theoretical guarantees for model selection in ﬁnite samples. Considering the frontal slice sparsity and low-rankness inherited in the coeﬃcient tensor, we formulate the regression procedure as a group SLOPE penalized low-rank tensor optimization problem based on an orthogonal decomposition, namely Tg SLOPE. This procedure provably controls the newly introduced tensor group false discovery rate (Tg FDR), provided that the predictor matrix is columnorthogonal. Moreover, we establish the asymptotically minimax convergence with respect to the Tg SLOPE estimate risk. For eﬃcient problem resolution, we equivalently transform the Tg SLOPE problem into a diﬀerence-of-convex (DC) program with the level-coercive objective function. This allows us to solve the reformulation problem of Tg SLOPE by an eﬃcient proximal DC algorithm (DCA) with global convergence. Numerical studies conducted on synthetic data and a real human brain connection data illustrate the eﬃcacy of the proposed Tg SLOPE estimation procedure. Keywords: diﬀerence-of-convex, false discovery rate, group sparsity, low-rankness, tensor regression

1. Introduction

Tensor regression modeling, in which the regression coeﬃcients take the form of a multi-way array or tensor, is an important and prevalent technique for coeﬃcient estimation and/or feature selection in the high-dimensional statistical learning theory, with wide applications in many modern data science problems such as neuroimaging analysis, see, e.g., Zhou et al. (2013); Sun and Li (2017); Raskutti et al. (2019); Ahmed et al. (2020); Han et al. (2022). In this article, we focus on the tensor regression with multivariate covariates and matrix responses. Given n independent and identically distributed (i.i.d.) observations {(xi, Yi)}n i=1 with xi Rp the vector of predictors and Yi Rp1 p2 the matrix of responses, the regression model can be expressed as follows

Yi = B 3 xi + Ei, i [n] := {1, 2, . . . , n}, (1)

where B Rp1 p2 p is the unknown coeﬃcient tensor with some inheritance structures like sparsity and/or low-rankness, Ei, i = 1, . . . , n, are i.i.d. noise matrices whose entries are

. Corresponding author.

c 2023 Yang Chen and Ziyan Luo.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1327.html.

Chen and Luo

i.i.d. drawn from the Gaussian distribution N(0, σ2). Our goal is to seek a feature selection and tensor estimation procedure for model (1). A straightforward idea to estimate B is via optimization problems ˆB = arg min B Ω f(B; D),

where Ωis any constraint set of sparse and/or low-rank tensors, f(B; D) can be taken as the least squares loss or any more general loss function with D = {(xi, Yi)}n i=1 the random sample set.

1.1 Related Work

One simple approach to estimate the sparse and/or low-rank coeﬃcient tensor is to use matricization techniques such that the model (1) is degenerated into a linear matrix regression model, in which the estimated tensor B and the response matrix Yi are unfolded into a p p1p2 matrix and a p1p2 dimensional vector, respectively. Therefore, a huge body of relevant work on sparse and low-rank matrix methods can be applied to deal with the tensor regression problems, see, e.g., variable selection by sparsity penalized methods (Obozinski et al., 2011; Raskutti et al., 2019; Chen et al., 2021); low-rank matrix estimation by reduced-rank regression methods (Bura et al., 2018; Fan et al., 2019; Wei et al., 2021) and nuclear norm penalized methods (Hu et al., 2020, 2021); joint penalized methods for selection and low-rank estimation (Yu et al., 2022); regularized covariance estimation for matrix-valued data (Zhang et al., 2022). However, the use of matricization techniques will not only break the sparse and low-rank structures of tensors, making resulting estimators diﬃcult to interpret, but also lead to a dramatic increase in dimensionality, which is prone to over-ﬁtting phenomenon. Keeping the tensor format for regression models, the existing work can be divided into two categories from the perspective of diﬀerent characterizations for the sparsity and lowrankness of coeﬃcient tensors. One is based on the tensor decomposition, including CP decomposition (CPD) (Zhou et al., 2013; Sun and Li, 2017; Hao et al., 2020) and Tucker decomposition (Li et al., 2018; Zhang et al., 2020; Han et al., 2022; Hu et al., 2022). The other is the method without tensor decomposition, such as imposing sparsity on elements, ﬁber vectors, and slice matrices (Raskutti et al., 2019), assuming low-rankness of slice matrices (Chen et al., 2019; Raskutti et al., 2019; Kong et al., 2020), and considering Tucker low-rankness of coeﬃcient tensors (Chen et al., 2019). Focusing on the matrix response tensor regression model (1), limited work has been done on statistical property analysis and algorithm design. Sun and Li (2017) have considered an element-wise sparse tensor regression model based on the CPD, and established a non-asymptotic estimation error bound for the estimator obtained from the proposed alternating updating algorithm, which compounds the truncation-based sparse tensor decomposition procedure. Considering the frontal slice sparsity and low-rankness, Kong et al. (2020) have proposed a two-step screening and estimation procedure and shown that it enjoys estimation consistency and rank consistency. Hu et al. (2022) have developed a generalized Tucker decomposition model with features on multiple modes, and investigated the statistical convergence for the proposed supervised tensor decomposition algorithm with side information. Nevertheless, these methods may not work well for ﬁnite samples, since

Penalized Low-Rank Tensor Regression

their feature selection results are usually achieved in inﬁnite samples, or unfortunately some of these are not capable of feature selection. To seek a mechanism that enables to make inference about the validity of the selected model in ﬁnite samples, Bogdan et al. (2015) have introduced a new convex penalized method for classical linear regressions inspired by the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995), namely Sorted L-One Penalized Estimation (SLOPE). They have shown that SLOPE controls the false discovery rate (FDR) under certain conditions. Following Bogdan et al. (2015), a recent line of research (Su and Cand es, 2016; Bellec et al., 2018; Brzyski et al., 2019; Luo et al., 2019; Wei et al., 2021) has studied the SLOPE based methods, including statistical property analysis, algorithm design, etc. One of these work (Brzyski et al., 2019) has extended SLOPE to group SLOPE (g SLOPE) to deal with the situation when one aims to select whole groups of regressors instead of single one. This motivates us to investigate the use of g SLOPE penalty for feature selection and estimation in the tensor regression framework with ﬁnite samples.

1.2 Our Contributions

We consider the matrix response tensor regression model (1) and embed the frontal slice sparse and low-rank structures for the estimated tensor. Unlike the sparse and low-rank settings directly on the frontal slices (Kong et al., 2020), we characterize the inherited structures based on the low-rank, orthogonal decomposition (LROD). Similar to CPD, the orthogonal decomposition can also reduce the model dimensionality, thereby reducing the computational complexity of the regression procedure. In addition, the encouraging sparsity on factor matrices produced by decomposition has been shown not only to yield asymptotically consistent estimators in high-dimensional settings (Johnstone and Lu, 2009), but also to simplify visualization and interpretation of data analysis results (Allen, 2012). It is worth mentioning that there is always an optimal LROD approximation (Sørensen et al., 2012) for a given tensor, while the rank-k approximation problem for ﬁtting a CPD tensor is ill-posed for many rank values of k in general (Silva and Lim, 2008). This makes the LROD tensor regression procedure more stable than those CPD based methods. To investigate the selection and estimation properties, we develop a sparse and lowrank tensor regression procedure by formulating it as a g SLOPE penalized LROD tensor optimization problem, namely Tg SLOPE. Then, to measure its feature selection performance in ﬁnite samples, we deﬁne the notion of the tensor group FDR (Tg FDR). Under the column-orthogonality assumption on the predictor matrix, Tg SLOPE is shown to control Tg FDR at any given level 0 < q < 1 with appropriate choice of the regularization parameters. Moreover, the tensor estimator produced by Tg SLOPE provably achieves the asymptotically minimax rate. Overall, our proposed Tg FDRT controlling method can also be minimax optimal with respect to the estimation risk. To well resolve our proposed Tg SLOPE model with Stiefel manifold constraints, we constructively reformulate this manifold optimization problem as a diﬀerence-of-convex (DC) program whose objective function is shown to be level-coercive. This allows us to adopt some globally convergent DC-type algorithm in which the decision variables are updated by the proximal operator of the g SLOPE penalty. Simulations on synthetic data verify the Tg FDR control, and test the eﬀects of sparsity, model size and LROD rank on Tg SLOPE

Chen and Luo

(a) Horizontal slices Aj1::

(b) Lateral slices A:j2:

(c) Frontal slices A::j3

Figure 1: Slices of a order-3 tensor.

performance respectively. In addition, numerical results on both synthetic data and a real human brain connection data conﬁrm the superiority of our proposed Tg SLOPE procedure by comparing it with several state-of-the-art approaches.

1.3 Notation and Preliminaries

Throughout the article, we denote scalars, vectors, matrices and tensors by lowercase letters (e.g., a, b), boldface lowercase letters (e.g., a, b), boldface uppercase letters (e.g., A, B), and calligraphic letters (e.g., A, B), respectively. In addition, zero scalars, vectors, matrices and tensors are respectively denoted by 0, 0, O and O. For a vector a Rn, we denote a[1:s] = (a1, . . . , as) with s n. For a matrix A Rm n, the ith row and the jth column are denoted by ai: and a:j, respectively, then write A[:,1:s] = (a:1, . . . , a:s) with s n. Figure 1 shows the horizontal, lateral, and frontal slices of the tensor A Rp1 p2 p3, denoted by Aj1:: Rp2 p3, A:j2: Rp1 p3, and A::j3 Rp1 p2, respectively. For convenience, we simply denote the ith row of A as ai and the j3th frontal slice of A as Aj3. Given vectors a Rm and b Rn, denote the outer product a b = ab Rm n

and Kronecker product a b = (a1b , . . . , amb ) Rmn. For matrices A Rm q and B Rn q, the Khatri-Rao product is deﬁned as A B = (a:1 b:1, . . . , a:q b:q) Rmn q; if m = n, the Hadamard product is deﬁned as A B = [aijbij] Rm q, and the inner product is deﬁned as A, B = P i,j aijbij. For an order-3 tensor A Rp1 p2 p3, the mode-3 product with a matrix X Rn p3 is denoted by A 3 X Rp1 p2 n with elements (A 3 X)i,j,l = Pp3 k=1 Ai,j,k Xl,k. We also denote the matricization operator as M3( ), which unfolds the tensor A along the third mode into the matrix M3(A) Rp3 p1p2. Speciﬁcally, (M3(A))k,l = Ai,j,k with l = 1 + (i 1) + (j 1)p1. Then the inverse of mode-3 unfolding can be denoted as M 1 3 ( ). The tensor A is called rank-one if it can be written as the outer product of vectors A = u v w. More about tensor operations can be found in Kolda and Bader (2009). Next we introduce some norms. For a vector a Rn, denote its ℓ2-norm as a = q P i a2 i and ℓ0-norm as a 0 = {i : |ai| = 0}. For a matrix A Rm n, denote its

Frobenius norm as A F = q P i,j a2 ij and trace as tr(A) = P i=j aij. Write singular

values of A as σ1(A) σp(A) with p = min{m, n}. Then the nuclear norm is A = P k σk(A) and the spectral norm is A 2 = σ1(A). We also introduce the notation

Penalized Low-Rank Tensor Regression

A r = ( a1 , . . . , am ) for any matrix A Rm n and A f = ( A1 F , . . . , Ap3 F )

for any tensor A Rp1 p2 p3. Moreover, we say that a random variable X has a chi distribution with n degrees of freedom, written by X χn, if it is the square root of a chi-squared random variable, i.e., X2 χ2 n. For a proper closed convex function f : Rm n ( , ], the subdiﬀerential of f at any given X dom(f), says f(X), is deﬁned by

f(X) = {G Rm n : f(Y ) f(X) + G, Y X for all Y Rm n},

where each matrix G f(X) is called a subgradient of f at X. The proximal operator of f is given by

Proxf(X) = arg min Z

2 Z X 2 F o , X Rm n.

Given positive scalars a and b, denote a b if a/b 1, and a b if there exist uniform constants c, C > 0 such that ca b Ca. The remainder of this article is organized as follows. In Section 2, we introduce a frontal slice sparse and low-rank tensor regression procedure, which optimizes the g SLOPE penalized LROD tensor optimization problem, and then the identiﬁability of parameters is analyzed. Section 3 makes statistical theory analysis for Tg SLOPE procedure, including the Tg FDR control at the prespeciﬁed level and the asymptotically minimax convergence with respect to the ℓ2-loss. An eﬃcient and globally convergent p DCAe algorithm is proposed in Section 4. Section 5 reports some numerical studies to verify the performance of our proposed Tg SLOPE approach. Concluding remarks are drawn in Section 6. All proofs are deferred to the Appendix.

In this section, we propose a feature selection and tensor estimate method for the tensor regression model (1), which can also be rewritten as

Y = B 3 X + E, (2)

where Y, E Rp1 p2 n, X Rn p. As discussed above, it is crucial to introduce some sparse and low-rank structures in order to facilitate estimation of the ultrahigh dimensional unknown parameters in ﬁnite samples. Thus, for the eﬃcient feature selection, we consider the frontal slice sparsity of the coeﬃcient tensor B based on the low-rank decomposition in the tensor regression framework (2).

2.1 Tg SLOPE Estimate

Assume that the coeﬃcient tensor is LROD, that is, the tensor B admits a rank-K CPD with column-orthogonal factor matrices. Speciﬁcally, the rank-K CPD models a tensor as a sum of K rank-one tensors (Kolda and Bader, 2009), i.e., B = PK k=1 u k v k w k, where the factor vectors u k Rp1, v k Rp2, w k Rp, k = 1, . . . , K. Figure 2 illustrates the rank-K CPD. Denote the factor matrices composed of the vectors from the rank-one components by U = (u 1, u 2, . . . , u K) Rp1 K, V Rp2 K and W Rp K respectively. The CPD can

Chen and Luo

Figure 2: Illustration of the rank-K CPD of the tensor B .

be rewritten as B = [[U , V , W ]]. The LROD assumes that the factor matrices U and V produced by CPD are column-orthogonal. We deﬁne the rank of LROD tensor as the CP rank K. Moreover, we consider the row sparsity settings on the mode-3 factor matrix W , which can translate to the sparsity of the frontal slices for the tensor B . To identify signiﬁcant features and estimate the coeﬃcient tensor B , we develop a penalized sparse and low-rank tensor regression method which optimizes the following g SLOPE penalized LROD tensor optimization problem (Tg SLOPE)

min U,V ,W 1 2

Y [[U, V , W ]] 3 X 2 F + Pλ W r

s.t. U U = V V = IK, (3)

where Pλ(x) = Pp j=1 λj|x|(j) is the SLOPE penalty with the regularization parameter vector λ = (λ1, . . . , λp) satisfying

λ1 λ2 λp 0 and λ1 > 0,

and |x|(j) the jth largest component of x in magnitude. According to the property of the matrix Khatri-Rao product that for any A Rm k

and B Rn k, (A B) (A B) = (A A) (B B), we have the following lemma.

Lemma 1 For any two column-orthogonal matrices A Rm k and B Rn k, we have C = A B Rmn k is column-orthogonal, that is, C C = Ik.

Following the matricized form of a tensor, we have M3(B ) = W (V U ) . Denote H = V U Rp1p2 K. Without loss of generality, we assume that K min{p, p1p2}. It is known from Lemma 1 that H H = IK. Thus, the Tg SLOPE problem (3) can be simpliﬁed as

min W ,H L(W , H) + Pλ W r

s.t. H H = IK, (4)

where the loss L(W , H) = 1

2 M3(Y) XW H 2 F . The estimator ( ˆ W , ˆ H) produced by (4) and the tensor estimator ˆB are linked via M3( ˆB) = ˆ W ˆ H . Notably, the LROD tensor decomposition with the additional column-orthogonality on U and V turns to be a special CPD. While the best CP low-rank approximation of a

Penalized Low-Rank Tensor Regression

tensor may not exist and the exact CP rank of a tensor is hard to compute, the imposed column-orthogonality on U and V can make sure that there is always an optimal rank-k approximation for a given tensor (Sørensen et al., 2012). The resulting approximation tensor that admits such a special CPD has been coined as an orthogonally decomposable tensor, which has been studied in the communities of tensor analysis and tensor learning, see, e.g., the perturbation bounds and the applications to the unsupervised learning scenario of tensor SVD and the supervised task of tensor regression by Auddy and Yuan (2023), and the linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations by Hu and Ye (2023). Additionally, Poythressa et al. (2022) have illustrated that the orthogonally decomposable tensor model has the potential to perform better than CPD model in terms of predictive performance and model interpretation in numerical experiments. Technically, the column-orthogonal constraints on factor matrices can help to ﬁx the indeterminacy and the non-uniqueness of CPD, thereby improving identiﬁability results of CPD methods (see, e.g., Zhou et al. (2013)).

2.2 Identiﬁability

The identiﬁability of parameters is analyzed in this subsection. Considering CPD of a tensor B = [[U, V , W ]], Zhou et al. (2013) have stated that the parameters in the tensor model is not identiﬁable due to two complications. One is the indeterminacy coming from scaling and permutation, and the second is possibly non-uniqueness of decomposition. For LROD tensor, the scaling indeterminacy can be avoided automatically due to the columnorthogonal U and V . To ﬁx the permutation indeterminacy, we adopt the convention that the ﬁrst row of the factor matrix U is assumed to be arranged in descending order. Moreover, it follows from the Kruskal s condition of uniqueness of CPD (Theorem 4b, Kruskal (1977)) that the LROD is unique up to permutation if k(W ) 2, where k(W ) is the k-rank of W deﬁned by the largest integer k such that every subset of k columns of W is linearly independent. Under the assumptions of determinacy and uniqueness of LROD, the imposed frontal slice sparsity on coeﬃcient tensor B can be equivalently translated to the row sparsity settings on the mode-3 factor matrix W , which yields the feasible set of problem (4) being

BT = B = [[U, V , W ]] : U U = V V = IK, W r 0 s , (5)

where s > 0 is a prescribed parameter that controls the frontal slice sparsity. We give conditions of global identiﬁability result in the following proposition. The proof is presented in Appendix A.

Proposition 2 Consider the tensor regression model (2) with the coeﬃcient tensor B = [[U , V , W ]] BT . Then B is globally identiﬁable up to permutation if k(W ) 2 and B f 0 k(X)/2.

Remark 3 For sparse parameter models, Donoho and Elad (2003) have discussed the conditions for the uniqueness of sparse coeﬃcient vector in classical linear regression frameworks. The identiﬁability result in Proposition 2 can be regarded as an extension of Corollary 1 (Donoho and Elad, 2003) to the sparse and decomposable tensor regression models. Provided that the LROD tensor decomposition is unique up to permutation, the full coeﬃcient

Chen and Luo

tensor B with the frontal slice sparsity is globally identiﬁable and thus the LROD of B is identiﬁable.

3. Statistical Results

This section is devoted to the Tg FDR control and the estimate accuracy for our proposed Tg SLOPE procedure.

3.1 Tg FDR Control

FDR is a commonly used error rate that counts the expected proportion of errors among the rejected hypotheses in multiple testing. In this subsection, we develop the classic FDR notion to the setting of tensor regression and show that it can be well controlled by our proposed Tg SLOPE.

Deﬁnition 4 Consider the tensor regression model (2) and let ( ˆ W , ˆ H) be an estimator given by the optimization problem (4). We deﬁne the tensor group false discovery rate (Tg FDR) for Tg SLOPE as

Tg FDR = E V max{R, 1}

where V, R are deﬁned as follows

V = {j [p] : W j = 0, ˆ Wj = 0}, R = {j [p] : ˆ Wj = 0}

with W j and ˆ Wj the jth rows of W and ˆ W , respectively.

Deﬁne the regularization parameters of the Tg SLOPE procedure as

λj = σF 1 χK 1 q j/p , j [p], (7)

where 0 < q < 1, F 1 χK(α) is the αth quantile of the χ distribution with K degrees of freedom. We give an upper bound of Tg FDR in the following theorem. The technical proof is deferred to Appendix B.1.

Theorem 5 Consider the tensor regression model (2) with the predictor matrix X satisfying X X = Ip. Then, for any solution ( ˆ W , ˆ H) given by the Tg SLOPE problem (4) with the regularization parameters in (7), Tg FDR obeys

Tg FDR = E V max{R, 1}

with s the number of nonzero rows of W .

Remark 6 Under the guarantee of the uniqueness for LROD, our row sparsity settings on the matrix W can be equivalently interpreted as the sparsity in the frontal slices of the coeﬃcient tensor B . Figure 3 illustrates this equivalence relation. Therefore, the deﬁnition of Tg FDR in (6) can be redeﬁned based on the following random variables

TV = {j [p] : B j = O, ˆ Bj = O}, TR = {j [p] : ˆ Bj = O}

Penalized Low-Rank Tensor Regression

Figure 3: Illustration of the sparsity equivalence relation between factor matrix W and tensor B . Here the elements of matrices and tensors are zero (gray) and nonzero (color).

with B j and ˆ Bj the jth frontal slices of B and ˆB, respectively. This indicates that the estimator ˆB = M 1 3 ( ˆ W ˆ H ) given by (4) provably controls Tg FDR at any prespeciﬁed level q (0, 1).

3.2 Estimate Accuracy

This subsection aims to show that Tg SLOPE enjoys minimax optimal estimation property under the assumption that the ground truth B is bounded, that is, max{ B j F , j [p]} < . We measure the deviation of an estimator from B in the following theorem. The technical proof is given in Appendix B.2.

Theorem 7 Consider the Tg SLOPE optimization problem (4) with the predictor matrix X satisfying X X = Ip and the regularization parameters in (7). Then, under p and s/p 0, the Tg SLOPE estimator enjoys the asymptotically minimax rate over Bs = {B : B f 0 s}, that is,

sup B Bs E ˆB B 2 F inf B sup B Bs E B B 2 F ,

where the inﬁmum is taken over all estimators B based on date set D = {(xi, Yi)}n i=1.

Remark 8 The asymptotically minimax result in Theorem 7 is an extension of Theorem 1.1 in Su and Cand es (2016) (for the SLOPE estimate) and Theorem 2.2 in Brzyski et al. (2019) (for the group SLOPE estimate) under the classical linear regression frameworks. In addition, we note that the choice of the regularization parameters in Theorem 7 does not depend on the sparsity level. In such a sense, Tg SLOPE is adaptive to a range of sparsity in achieving the minimax optimality. It would also be interesting to work on estimation bounds of minimax rates in a non-asymptotic manner, such as that in Raskutti et al. (2019) and Hao et al. (2020). We leave this as one of our research topics in the future.

Chen and Luo

4. Proximal DCA with Extrapolation

In this section, we propose a proximal diﬀerence-of-convex algorithm with extrapolation (p DCAe) to solve the Tg SLOPE optimization problem and establish its global convergence.

4.1 DC Reformulation

Given an optimal solution ( ˆ W , ˆ H) of the problem (4), it is known from the orthogonal Procrustes problem (Gower and Dijksterhuis, 2004) that ˆ H can be determined by ˆ W in the way that ˆ H = ˆU[:,1:K] ˆV , where ˆU, ˆV are obtained from the SVD M3(Y) X ˆ W = ˆUD ˆV

with ˆU Rp1p2 p1p2, D Rp1p2 K and ˆV RK K. Plugging ( ˆ W , ˆ H) into the objective function of problem (4), we obtain that

L( ˆ W , ˆ H) = 1

M3(Y) 2 F + 1

X ˆ W ˆ H 2 F tr(M3(Y) X ˆ W ˆ H )

M3(Y) 2 F + 1

X ˆ W 2 F tr(D ˆU [:,1:K] ˆU)

M3(Y) 2 F + 1

X ˆ W 2 F tr(D)

M3(Y) 2 F + 1

X ˆ W 2 F M3(Y) X ˆ W .

Therefore, the Tg SLOPE problem (4) can be equivalently transformed into

ˆ W = arg min n F(W ) = 1

2 XW 2 F + Pλ W r M3(Y) XW o ,

ˆ H = ˆU[:,1:K] ˆV , where ˆU and ˆV are from the SVD M3(Y) X ˆ W = ˆUD ˆV . (8)

Denote F1(W ) = 1 2 XW 2 F and F2(W ) = L

2 W 2 F F1(W ). Since F1 is gradient Lipschitz continuous with modulus L = X X 2, the convexity of F2 follows. Inspired by the DC-representable function from Gotoh et al. (2018), we rewrite the optimization problem in (8) as

min W Rp K F(W ) = L

2 W 2 F + Pλ W r

| {z } C1(W )

2 XW 2 F + N(W )

| {z } C2(W )

where N(W ) = M3(Y) XW . It is easy to verify that both C1 and C2 are convex, leading to a DC program as in (9). By now, we constructively reformulate the manifold optimization problem (4) as a DC program (9). This allows us to solve the reformulation problem of Tg SLOPE by a DC-type algorithm.

4.2 Proximal DCA

Considering the DC program (9), we ﬁrst give the subdiﬀerential of N(W ) in the following lemma, which can be clearly derived from the subdiﬀerential of the matrix nuclear norm (Watson, 1992) and the chain rule of Theorem 10.6 in Rockafellar and Wets (2009).

Penalized Low-Rank Tensor Regression

Lemma 9 The subdiﬀerential of N(W ) is given by

N(W ) = {X M3(Y)( U[:,1:K] V + W ) : U [:,1:K] W = O, W V = O, W 2 1}, (10)

where U, V are obtained by the SVD of M3(Y) XW = UD V .

The DCA iterative scheme for (9) takes the following form

W (k+1) = arg min W

2 W 2 F D W , (LI X X)W (k) + Q(k) 1 E + Pλ W r

= arg min W

X XW (k) Q(k) 1

F + Pλ/L W r ,

where Q(k) 1 N(W (k)). Then the optimal solution of problem (11) can be computed by the proximal operator of the penalty function Pλ/L. Speciﬁcally,

η(k+1) = arg minη n 1 2 Q(k) r η 2 + Pλ/L(η) o ,

W (k+1) j = Prox Pλ/L(Q(k))

j = η(k+1) j q(k) j q(k) j , j [p], (12)

where Q(k) = W (k) X XW (k) Q(k) 1 /L. The resulting DCA is called as the proximal DCA (Gotoh et al., 2018; Wen et al., 2018). Consequently, the W -subproblem is reduced to identifying the general SLOPE proximal operator, which can be eﬃciently solved by Fast Prox SL1 (Bogdan et al., 2015). Furthermore, in order to possibly accelerate the algorithm, we entertain extrapolation techniques (Nesterov, 2013) in the proximal DCA. The algorithmic framework is summarized in Algorithm 1.

Algorithm 1 p DCAe for solving Tg SLOPE (4)

Initialize W (0) Rp K, {β(k)} [0, 1) with supk β(k) < 1, set W ( 1) = W (0), k = 0;

1: Choose Q(k) 1 N(W (k)) by (10);

2: Compute A(k) = W (k) + β(k)(W (k) W (k 1)) and update

Q(k) = A(k) X XA(k) Q(k) 1 /L;

3: Compute W (k+1) by the two-step form in (12);

4: Set k = k + 1. If the stopping criterion is met, perform the SVD M3(Y) XW (k) = U (k)D(k)V (k) to get H(k) = U (k)

[:,1:K]V (k) ,

then stop and return B(k) = M 1 3 (W (k)H(k) ); otherwise, go to the step 1.

Note that the p DCAe algorithm is reduced to the general proximal DCA if β(k) = 0 for all k. Moreover, Algorithm 1 can be coupled with many popular choices of extrapolation parameters {β(k)} including that used in accelerated proximal gradient (APG) for solving

Chen and Luo

SLOPE (Bogdan et al., 2015). In our numerical experiments section, we follow Bogdan et al. (2015) and set θ( 1) = 1,

β(k) = θ(k 1) 1

θ(k) with θ(k) = 1 + p

1 + 4(θ(k 1))2

4.3 Global Convergence

This subsection is dedicated to the global convergence of p DCAe. We start with the levelcoercivity of the DC function F(W ).

Lemma 10 Let the predictor matrix X in problem (4) be full column rank. Then the DC function F(W ) in (9) is level-coercive in the sense that lim inf W F F(W ) W F > 0.

Based on the level-coercivity as obtained in Lemma 10, the desired global convergence of p DCAe for solving problem (4) is stated as follows.

Theorem 11 Suppose that the prediction matrix X is full column rank and let {W (k)} be the sequence generated by Algorithm 1 for solving problem (4). Then the following properties hold:

(a) The sequence {W (k)} is bounded;

(b) limk W (k+1) W (k) F = 0;

(c) Every limit point W of the sequence {W (k)} is a stationary point of F in the sense that O X X W + Pλ W r M3(Y) X W .

The proofs of results in this subsection are provided in Appendix C.

5. Numerical Experiments

This section gives some experiments on synthetic data and a real human brain connection data. All numerical experiments are implemented in MATLAB (R2021a), running on a laptop with Intel Core i5-8265U CPU (1.60GHz) and 16 GB RAM.

5.1 Comparative Methods

We verify the performance of proposed Tg SLOPE by comparing it with the following three approaches:

(a) TBMM: the block majorization minimization (BMM) algorithm proposed by Wei et al. (2021) to solve the Tg SLOPE problem (4), whose corresponding iterative scheme is ( W (k+1) = arg min W 1

2 W R(k) 2 F + Pλ/L W r ,

H(k+1) = U (k+1)

[:,1:K]V (k+1) ,

where R(k) = W (k) X XW (k) X M3(Y)H(k) /L, U (k+1) and V (k+1) are from the SVD M3(Y) XW (k+1) = U (k+1)D(k+1)V (k+1) ;

Penalized Low-Rank Tensor Regression

(b) Tg LASSO: the group LASSO penalized LROD tensor regression approach, in which the p DCAe algorithm in Section 4 is used to solve

M3(Y) XW H 2 F + λ

j=1 Wj : H H = IK

where λ > 0 is the tuning parameter;

(c) TLRR: the LROD tensor regression (without sparse penalty), in which the p DCAe is applied to solve

M3(Y) XW H 2 F : H H = IK

In our numerical experiments, we simply adopt the stopping criterion proposed by Wen et al. (2018) for all approaches, i.e., W (k) W (k+1) F

max{ W (k) F ,1} ϵ with some given tolerance ϵ > 0.

5.2 Synthetic Data

In our numerical experiments on synthetic data, we adopt the measures including Tg FDR deﬁned in (6) and the tensor power (TP) to evaluate selection performance of the estimator ˆB generated by a given method. Here TP is deﬁned as

TP = E(T)/s, where T = {j [p] : B j = O, ˆ Bj = O} and s = {j [p] : B j = O}.

For estimation accuracy, we evaluate the performance of ˆB in terms of the relative group estimate error (Rg EE) and the mean squared error (MSE) deﬁned respectively as

ˆB f B f 2 B f 2 , MSE = ( ˆB B ) 3 X 2 F /np1p2.

Meanwhile, to evaluate the time eﬃciency of our proposed p DCAe algorithm, the CPU time (Time) is reported for each testing instance. The ground truth is simulated via B = M 1 3 (W H ) Rp1 p2 p, in which W Rp K is generated in a similar manner as in Brzyski et al. (2019) with s nonzero rows. Speciﬁcally, each nonzero row of W is generated from the uniform distribution U[0.1, 1.1] and then we scale it such that w j = a

K with a = p

4 ln(p)/(1 p 2/K) K. The explanation of such a special simulation procedure is presented in Appendix D. The columnorthogonal matrix H Rp1p2 K is simulated as the ﬁrst K left singular vectors of a p1p2 p1p2 matrix with i.i.d. standard normal distributed entries. The response tensor Y Rp1 p2 n is simulated by Y = B 3 X + E, where the entries of the noise tensor E are i.i.d. drawn from N(0, 1). We ﬁrst verify the Tg FDR controlling performance of Tg SLOPE under the following two situations of the predictor matrix X Rn p: (a) the orthogonal design, where X is generated from the orthogonalization for a n p matrix with i.i.d. standard normal distributed entries; (b) the commonly used Gaussian random design, such as in Kong et al.

Chen and Luo

50 100 150 200 250 0.99

q=0.05 q=0.1

50 100 150 200 250 Sparsity

q=0.05 q=0.1

(a) Orthogonal design

50 100 150 200 250 0.97

q=0.05 q=0.1

50 100 150 200 250 Sparsity

q=0.05 q=0.1

(b) Gaussian random design

Figure 4: Estimated Tg FDR and TP with n = 2000, p = 1000, p1 = p2 = 10, K = 20 under the orthogonal and Gaussian situations of the matrix X. Bars correspond to SE (standard error), black dotted lines represent the nominal Tg FDR level q (p s)/p.

(2020), where the predictor vectors xi, i = 1, , n are i.i.d. drawn from Gaussian distribution N(0, C), where C is a p p covariance matrix with the element Cj1j2 = 0.5|j1 j2|

for any j1, j2 [p]. The orthogonal situation works with the regularization parameter sequence deﬁned in (7). For Gaussian random design, we select regularization parameters of Tg SLOPE according to Procedure 2 of Brzyski et al. (2019).

Set n = 2000, p = 1000, p1 = p2 = 10, K = 20 and sparsity s = 25 : 25 : 250. We perform 100 independent replications for each sparsity and target Tg FDR level (q = 0.05, 0.1). As shown in Figure 4 (a), Tg SLOPE maintains a comparable Tg FDR with the nominal level and reports the estimated tensor power TP =1 for all testing instances. For the Gaussian situation, Tg SLOPE reports the relatively low Tg FDR with the strong power, especially for the sparse cases as shown in Figure 4 (b). The reported Tg FDR of Tg SLOPE does not exceed the nominal level of the case q = 0.1 in all sparsity settings except for the case s = 250.

In the following numerical comparisons, we consider the Gaussian random situation in which the entries of X are i.i.d. drawn from N(0, 1/n), and test the eﬀect of sparsity s, model size p and LROD rank K respectively under the target Tg FDR level q = 0.05. For Tg LASSO, the tuning parameter is selected by 5-fold cross-validation.

Sparsity eﬀect: set n = 3000, p = 1000, p1 = p2 = 10 and K = 20. We test the performance of four approaches under various sparsity with s = 25 : 25 : 250. Simulation results report TP =1 for all competitors in each testing instance. In addition, average results based on 100 independent replications have been shown in Figure 5. (i) Figure 5 (a) illustrates that Tg SLOPE and TBMM have signiﬁcant superiority in terms of Tg FDR, with Tg FDR below nominal for all testing sparsity. (ii) Tg LASSO fails to control Tg FDR, although it reports the relatively small Rg EE and MSE as shown in Figure 5 (b) and (c). (iii) Figure 5 (d) shows that Tg SLOPE reports a relatively short CUP time, especially in the

Penalized Low-Rank Tensor Regression

50 100 150 200 250 Sparsity

Estimated Tg FDR

TLRR Tg LASSO TBMM Tg SLOPE

50 100 150 200 250 Sparsity

TLRR Tg LASSO TBMM Tg SLOPE

50 100 150 200 250 Sparsity

TLRR Tg LASSO TBMM Tg SLOPE

50 100 150 200 250 Sparsity

TLRR Tg LASSO TBMM Tg SLOPE

Figure 5: Average results with various sparsity under n = 3000, p = 1000, p1 = p2 = 10, K = 20 and for the Gaussian random design situation of X. Bars correspond to SD (standard deviation). In (a), black dotted lines represent the nominal Tg FDR level q (p s)/p.

cases of small sparsity. (iv) It is intuitive that those four methods tend to have comparable performance as the sparsity increases. Model size eﬀect: in this example, we test the eﬀect of model size p comparing among four approaches. Set n = 3000, p1 = p2 = 10, K = 20 and the sparsity s = 0.02p with model size p = 2000, 4000, 6000. In each testing instance, all these competitors report TP =1. Moreover, Table 1 collects the average results based on 100 independent replications for each model size. (i) As suggested in Table 1, TLRR gives the worst performance. For the other three approaches, all of the four evaluation metrics tend to become larger as the model size grows. (ii) Similar to the results shown in Figure 5, Tg LASSO fails to control Tg FDR, while Tg SLOPE and TBMM can maintain the Tg FDR below the nominal Tg FDR level for all testing model sizes, with predictable sacriﬁce on the estimation accuracy. (iii) We can also see from Table 1 that Tg SLOPE gives the smaller Tg FDR than TBMM for the model size p = 6000, and it reports the least CPU time for all cases. LROD rank eﬀect: we now examine the impact of LROD rank on the four approaches under p1 = p2 = 10 and p1 = p2 = 20, respectively. Set n = 1000, p = 2000 and the sparsity s = 0.02p. It is known from Kolda and Bader (2009) that LROD rank K {p1p2, pp1, pp2}, then we set LROD rank K = 5 : 5 : 50. All testing instances are simulated based on 100

Chen and Luo

Size Method Tg FDR Rg EE MSE Time(s)

p = 2000 TLRR 0.980 5.10e-0 (6.24e-2) 7.77e-4 (1.89e-6) 3.325 Tg LASSO 0.891 3.44e-2 (3.97e-3) 1.92e-4 (4.45e-6) 1.566 TBMM 0.006 5.59e-2 (3.91e-3) 2.02e-4 (2.73e-6) 2.242 Tg SLOPE 0.007 5.59e-2 (3.91e-3) 2.02e-4 (2.76e-6) 1.081

p = 4000 TLRR 0.980 3.92e-0 (4.26e-2) 9.28e-4 (2.45e-6) 10.163 Tg LASSO 0.921 3.50e-2 (2.16e-3) 2.40e-4 (4.70e-6) 3.727 TBMM 0.008 6.46e-2 (3.28e-3) 2.70e-4 (3.48e-6) 6.104 Tg SLOPE 0.008 6.45e-2 (3.27e-3) 2.70e-4 (3.49e-6) 2.962

p = 6000 TLRR 0.980 1.69e-0 (1.71e-2) 9.19e-4 (2.94e-6) 15.146 Tg LASSO 0.924 3.66e-2 (2.17e-3) 2.77e-4 (5.16e-6) 6.179 TBMM 0.025 6.91e-2 (3.78e-3) 3.24e-4 (4.02e-6) 9.108 Tg SLOPE 0.014 6.88e-2 (3.75e-3) 3.24e-4 (4.01e-6) 5.666

Table 1: Average results with diﬀerent model size under n = 3000, p1 = p2 = 10, K = 20, s = 0.02p for the Gaussian random design situation of X. Standard deviations are presented in brackets.

independent replications. Simulation results report that TP =1 for all the competitors. Figure 6 depicts changes of the four evaluation metrics with the increase of LROD rank. (i) Our proposed Tg SLOPE procedure works well in terms of Tg FDR, especially for small LROD ranks. See, e.g., as shown in Figure 6 (a) and (b), Tg SLOPE and TBMM report the lower Tg FDR than the nominal level as K 25. (ii) Figure 6 (g) and (h) illustrate that Tg SLOPE reports a relatively short CPU time, especially compared with TBMM. For example, as p1 = p2 = 20 and K 35, the CPU time reported by Tg SLOPE is no more than 1/5 of that of TBMM. (iii) For the ﬁxed tensor size, we can see from Figure 6 (a) and (b) that the Tg FDR tends to become larger as the LROD rank increases. In addition, with the increase of the LROD rank, Figure 6 (c) and (d) show that Rg EE reduces, while Figure 6 (e) and (f) show that MSE grows. This may give us some inspiration to choose a moderate LROD rank.

5.3 Human Brain Connection Data

In this subsection, we test our proposed Tg SLOPE comparing with the other three approaches on a real human brain connection (HBC) data from the Human Connectome Project (HCP), which aims to build a network map between the anatomical and functional connectivity within healthy human brains (Van Essen et al., 2013). The preprocessed HBC data set is provided by Hu et al. (2022), in which the response is a 68 68 binary matrix with entries encoding the presence or absence of ﬁber connections between 68 brain regions-ofinterest, the predictor matrix is collected from diﬀerent personal features for each observed individual. After removing those missing values, the HBC data set consists of 111 individuals and 549 personal features, including gender, age, etc. (The HBC data can be found at https://wiki.humanconnectome.org/display/Public Data/). For HBC analysis, the predictor matrix is normalized to have unit column vectors. We choose the tuning parameters

Penalized Low-Rank Tensor Regression

5 10 15 20 25 30 35 40 45 50 LROD Rank

Estimated Tg FDR

TLRR Tg LASSO TBMM Tg SLOPE

5 10 15 20 25 30 35 40 45 50 LROD Rank

Estimated Tg FDR

TLRR Tg LASSO TBMM Tg SLOPE

5 10 15 20 25 30 35 40 45 50 LROD Rank

TLRR Tg LASSO TBMM Tg SLOPE

5 10 15 20 25 30 35 40 45 50 LROD Rank

TLRR Tg LASSO TBMM Tg SLOPE

5 10 15 20 25 30 35 40 45 50 LROD Rank

TLRR Tg LASSO TBMM Tg SLOPE

5 10 15 20 25 30 35 40 45 50 LROD Rank

TLRR Tg LASSO TBMM Tg SLOPE

5 10 15 20 25 30 35 40 45 50 LROD Rank

TLRR Tg LASSO TBMM Tg SLOPE

5 10 15 20 25 30 35 40 45 50 LROD Rank

TLRR Tg LASSO TBMM Tg SLOPE

Figure 6: Simulation results with diﬀerent LROD rank under n = 1000, p = 2000, s = 0.02p for the Gaussian random design situation of X. Left column: p1 = p2 = 10, right column: p1 = p2 = 20. In (a) and (b), black dotted lines represent the nominal Tg FDR level q (p s)/p. 17

Chen and Luo

2 4 6 8 10 12

2 4 6 8 10 12 LROD Rank

Tg LASSO TBMM Tg SLOPE

2 4 6 8 10 12 0

2 4 6 8 10 12 LROD Rank

Tg LASSO Tg SLOPE

(b) Discovery

Figure 7: Average results with the diﬀerent LROD rank based on 20 randomly 8:2 splits for HBC data set.

via a BIC-type criterion on the whole data set, which minimizes

BIC = Y ˆB 3 X 2 F + ˆB f 0 + p1 + p2 K log(np1p2).

In addition, the number of discovered features (termed as Discovery), the mean squared prediction error (MSPE) and the CPU time are adopted to evaluate the performance of four competitors. Here, Discovery and MSPE are deﬁned respectively as

Discovery = {j [p] : ( ˆ Btraining)j = O},

MSPE = Ytest ˆBtraining 3 Xtest 2 F /p1p2ntest,

where ˆBtraining is the estimator for the training set, Xtest, Ytest and ntest are the predictor matrix, response tensor and the sample size of the testing set, respectively. For HBC data, the experimental results in Zhang et al. (2016) have revealed that the whole-brain functional magnetic resonance imaging (f MRI) signals can be well characterized by sparse representations, which is supportive on the interpretability with fewer signiﬁcant features in the HBC data analysis. In this regard, it may be reasonable to recognize the performance superiority of a method with the smaller Discovery and the comparable or even less MSPE. We will test the performance of our proposed approach in such a sense with comparison to other three methods. Notably, the rank of true coeﬃcient tensor is unknown in real data applications. We ﬁrst test the performance of four approaches under diﬀerent LROD ranks. Set the tolerance for termination in algorithms to be ϵ = 10 4. For each LROD rank selected from 1 to 12, HBC data set is randomly divided into 80 percent of training set and 20 percent of testing set 20 times. The average numerical results are depicted in Figure 7. We can see from Figure 7 (a) that small LROD ranks would be suﬃcient for an acceptable prediction errors of all methods, which suggests us to ﬁt the true rank with a relatively small value. In addition, as shown in Figure 7 (b), TLRR fails to feature selection for all rank testing instances, while

Penalized Low-Rank Tensor Regression

ϵ Method MSPE Discovery Time(s)

10 4 TLRR 0.0837 536.30 2.766 Tg LASSO 0.0848 28.45 9.396 TBMM 0.0840 242.50 7.137 Tg SLOPE 0.0838 47.95 8.171

10 5 TLRR 0.0831 536.20 9.363 Tg LASSO 0.0844 27.70 30.561 TBMM 0.0833 121.90 92.278 Tg SLOPE 0.0833 47.15 29.724

10 6 TLRR 0.0836 537.20 25.629 Tg LASSO 0.0861 26.70 111.271 TBMM 0.0838 73.65 668.040 Tg SLOPE 0.0838 49.60 99.964

Table 2: Average results with BIC selected LROD rank and diﬀerent tolerance ϵ based on 20 randomly 8:2 splits for HBC data set.

the other three methods all report the sparse numerical approximate estimates reﬂected by relatively small values of Discovery, with the smaller MSPE compared to TLRR. This further supports the reasonability of taking the group sparsity of coeﬃcient tensor into consideration for HBC data set.

To test the comparative performance of four methods, we next choose the combination of LROD rank and regularization parameters via the BIC criterion. Table 2 collects the average results with diﬀerent tolerances ϵ = {10 4, 10 5, 10 6} based on 20 independent replications, which demonstrate the superior performance of Tg SLOPE among all comparative methods. Speciﬁcally, as shown in Table 2, TLRR reports the competitive MSPE with Tg SLOPE, but unsurprisingly fails to feature selection for all tolerance settings. In addition, Tg LASSO gives the larger MSPE than Tg SLOPE in all testing instances, although it has the fewest discovered features.

For the method of TBMM, it solves the same optimization model (4) by the alternating minimization scheme. While our proposed Tg SLOPE method solves problem (4) by p DCAe based on the DC reformulation (8). We can see from Table 2 that these two algorithms report the similar MSPE in diﬀerent tolerance settings. While the Discovery generated by TBMM is larger than that of Tg SLOPE, and the diﬀerence of the values for Discovery decreases as the tolerance ϵ of the numerical algorithms reduces, as shown in Table 2 Columns ϵ and Discovery . In addition, Tg SLOPE reports the less CPU time than TBMM, especially for the small tolerance ϵ = 10 6. One possible explanation for the superior eﬃciency of Tg SLOPE is that p DCAe has more power of achieving numerical approximate coeﬃcient tensor solution with higher estimate accuracy, leading to better performance on time eﬃciency and group sparsity reﬂected by Discovery. The comparison results on Discovery, MSPE and CPU Time of both algorithms with varying ϵ have been collected in the boxplots in Figure 8.

Chen and Luo

Figure 8: Boxplot for results of TBMM and Tg SLOPE with BIC selected LROD rank and diﬀerent tolerance ϵ based on 20 randomly 8:2 splits for HBC data set.

6. Conclusions

In this article we propose a sparse and low-rank tensor regression method, which optimizes a g SLOPE penalized low-rank, orthogonally decomposable tensor minimization problem. Under the assumption of column-orthogonality for the predictor matrix, we show that our proposed Tg SLOPE procedure controls Tg FDR at a pre-set level, and achieves the asymptotically minimax convergence with respect to the produced estimation risk. This provides theoretical guarantees for feature selection and coeﬃcient estimation in ﬁnite samples. Moreover, a globally convergent p DCAe algorithm is applied to solve the Tg SLOPE estimator by constructively reformulating our Tg SLOPE problem into a DC program. Numerical experiments verify the superiority of our method in terms of Tg FDR control, estimation accuracy and CPU time against three state-of-the-art approaches.

For the Tg SLOPE method, it would be interesting to investigate statistical properties including Tg FDR control and estimate accuracy in more general cases besides the orthogonal design. Moreover, Luo et al. (2019) propose a sparse semismooth Newton-based augmented Lagrangian method (Newt-ALM) to solve the SLOPE model of the classical linear regression (Bogdan et al., 2015) and show that Newt-ALM oﬀers a notable computational advantage in the high-dimensional settings comparing with the ﬁrst-order algorithms, such as APG and alternating direction method of multipliers (ADMM). How to design the robust and highly eﬃcient Newton-type algorithm for Tg SLOPE based models is also one of our research topic in the future.

Acknowledgments

The authors are very grateful to the editor and two anonymous reviewers for their constructive comments and suggestions which greatly help us to improve the quality of our paper.

Penalized Low-Rank Tensor Regression

This research was supported by the National Natural Science Foundation of China (Grant No.12271022).

Appendix A. Proof of Proposition 2

To check the identiﬁability for the frontal slice sparse and LROD tensor coeﬃcient B BT deﬁned in (5), we need to state that the equation system produced by tensor regression model (2), i.e., Y = B 3 X, has a unique solution B satisfying B BT . Under the uniqueness guarantee k(W ) 2 of the LROD tensor decomposition B = [[U , V , W ]] (Kruskal, 1977, Theorem 4b), it suﬃces to show that

B = arg min B Rp1 p2 p

B f 0, s.t. Y = B 3 X

is a unique solution. Following the matricized form of a tensor, the above optimization problem can be equivalent to

B r 0, (13)

where BM = {B Rp p1p2 : M3( Y) = XB}. Thus, it suﬃces to prove that problem (13) has a unique solution B = M 1 3 (B ) under the condition that B r 0 k/2, where k is the k-rank of the matrix X. Assume on the contrary that B BM is another optimal solution of (13), which gives that B r 0 = B r 0, = B B = O and X = O. This further indicates that there exist r 0 columns of X which are linearly dependent. By virtue of the deﬁnition of the k-rank of X, it yields that

k < r 0 B r 0 + B r 0 k,

where the second inequality is from the fact that ℓ0-norm obeys the triangle inequality, and the last inequality from B r 0 = B r 0 k/2. Contradiction arrives and the proof is completed.

Appendix B. Proofs for Section 3

B.1 Proof of Theorem 5

Assume that ( ˆ W , ˆ H) is a local minimizer of the Tg SLOPE problem (4). Let ˆ H Rp1p2 (p1p2 K) such that [ ˆ H, ˆ H ] is orthogonal. Then the loss function of problem (4)

L(W , ˆ H) = [M3(Y) XW ˆ H ][ ˆ H, ˆ H ] 2 F = M3(Y) ˆ H XW 2 F + M3(Y) ˆ H 2 F .

Together with the column-orthogonality of the predictor matrix X, we have

ˆ W = arg min W Rp K

ˆY W 2 F + Pλ W r , (14)

Chen and Luo

where ˆY = X M3(Y) ˆ H Rp K. It follows from (2) that

ˆY = W H ˆ H + X M3(E) ˆ H,

which implies that ˆY follows the matrix normal distribution N(W H ˆ H, σ2Ip IK). Note that (14) is a convex problem with strongly convex, piecewise linear-quadratic objective function and hence admits a unique optimal solution. Similar to the proximal operator for g SLOPE (Brzyski et al., 2019), the optimization problem (14) can be solved in two steps

ˆη = arg minη 1

2 Pp j=1 ˆyj ηj 2 + Pλ(η) ,

ˆ Wj = ˆηj ˆyj ˆyj , j [p]. (15)

Deﬁne Vη = {j [p] : W j = 0, ˆηj = 0}, Rη = {j [p] : ˆηj = 0}. It suﬃces to show that

E Vη max{Rη, 1}

Without loss of generality, assume that W j = 0 for 0 j p s and W j = 0 otherwise. By the deﬁnition of Tg FDR, we derive that

E Vη max{Rη, 1}

1 r E Vη1{Rη=r}

j=1 E 1{ˆηj =0}1{Rη=r}

j=1 Pr ˆηj = 0, Rη = r ,

where 1{ } = 1 if the event occurs, and 1{ } = 0 otherwise. Next we focus on the events {ˆηj = 0, Rη = r}, j = 1, . . . , p s, r = 1, . . . , p. Denote Y = (ˆy1, . . . , ˆyj 1, ˆyj+1, . . . , ˆyp) R(p 1) K, λ = (λ2, . . . , λp) Rp 1. Applying the simpliﬁed Tg SLOPE problem (14) to Y with tuning parameter λ, the optimization problem is given by

W = arg min W R(p 1) K

yj Wj 2 + Pλ W r )

Deﬁne R j = {j [p 1] : W j = 0}. It follows from Lemmas E.6 and E.7 in Brzyski et al. (2019) that ˆY r : ˆηj = 0, Rη = r ˆY r : ˆyj > λr, R j = r 1 ,

where ˆY r = ( ˆy1 , . . . , ˆyp ) with ˆyj 2 χ2 K W j H ˆ H 2 , j = 1, . . . , p. Then we have

Pr(ˆηj = 0, Rη = r) Pr( ˆyj > λr, R j = r 1)

= Pr( ˆyj /σ > λ r) Pr(R j = r 1)

p Pr(R j = r 1),

Penalized Low-Rank Tensor Regression

where λ r = F 1 χK 1 q r/p , the equality is due to the independence between ˆyj and R j, the last inequality follows from the deﬁnition of the tuning parameters in (7). Therefore, Tg FDR in (16) can be bounded by

E Vη max{Rη, 1}

p Pr(R j = r 1)

r=1 Pr(R j = r 1) = q p s

The proof is completed.

B.2 Proof of Theorem 7

Let ( ˆ W , ˆ H) be an optimal solution of the problem (4). Then ˆB = M 1 3 ( ˆ W ˆ H ). Under the column-orthogonality assumption on the predictor matrix X, we know that the statistically equivalent model of (2) is

e Y = M3(B ) + X M3(E), (17)

which has the distribution N(M(B ), σ2Ip IK). Denote e B as the set of all coeﬃcient tensors for which only the elements in the ﬁrst column of the mode-3 unfolded matrix are possibly nonzero and the rest are ﬁxed to be zero. Let e Bs = Bs e B. Then, for any B e Bs, (17) is reduced to a general Gaussian sequence model with length p and sparsity at most s. As s/p 0, this sequence model has minimax risk (1 + o(1))2σ2s log(p/s) (Donoho and Johnstone, 1994). Thus, we have

sup B e Bs E M( ˆB) M(B ) 2 F (1 + o(1))2σ2s log(p/s),

which yields that

sup B Bs E ˆB B 2 F (1 + o(1))2σ2s log(p/s).

We next show that the ℓ2-loss which measures the deviation of the Tg SLOPE estimator from the ground truth B is bounded above by (1 + o(1))2σ2s log(p/s). For simplicity, we assume that w j = 0 for j s and w j = 0 otherwise. Denote µj = w j and νj = ˆ H H w j . Then, it follows from the proof of Theorem 5 that ˆ W can be obtained

Chen and Luo

by the two step format (15), in which ˆyj 2 χ2 K(ν2 j ), j = 1, . . . , p. The ℓ2-loss is

E ˆB B 2 F (a) = E ˆ W ˆ H W H 2 F = E

j=1 ˆ H ˆwj H w j 2 #

j=1 ( ˆ H ˆwj + H w j )2 # (c) = E

j=1 ( ˆwj + w j )2 #

j=1 ( ˆwj + w j )2 #

j=s+1 ˆwj 2 #

j=1 (ˆηj + µj)2 #

j=s+1 ˆη2 j

where (a) comes from ˆB = M 1 3 ( ˆ W ˆ H ), (b) is due to the triangle inequality and (c) follows from the column-orthogonality of ˆ H and H. Thus, it suﬃces to show

E1 (1 + o(1))2σ2s log(p/s) and E2 = o(1)2σ2s log(p/s).

To proceed, deﬁne random variables ψ2 j = ψ2 j,1 +ψ2 j,2 + +ψ2 j,K and φ2 j = (ψj,1 +νj)2 + ψ2 j,2 + +ψ2 j,K with i.i.d. ψj,k N(0, 1), j [p], k [K]. Then ψ2 j χ2 K and φ2 j χ2 K(ν2 j ) for all j [p]. Denoting φ = (φ1, φ2, . . . , φp) , we have

E ˆη[1:s] + µ[1:s] 2 = E ˆη[1:s] φ[1:s] + φ[1:s] + µ[1:s] 2

E ˆη[1:s] φ[1:s] + φ[1:s] + µ[1:s] 2

λ[1:s] 2 + E φ[1:s] + µ[1:s] 2 + 2 λ[1:s] E φ[1:s] + µ[1:s] ,

where the last inequality is obtained by ˆη[1:s] φ[1:s] λ[1:s] , owing to Fact 3.3 of Su and Cand es (2016) with conditions φ2 j and ˆyj 2 are i.i.d. for any j s. Moreover, it follows from Inglot (2010) that as s/p 0, F 1 χK 1 q j/p p

2 log(p/(q j)) for all j s, which yields λ[1:s] 2 2σ2s log(p/s). (19)

Then, it is easy to see that

|φj + µj|2 = q

(ψj,1 + νj)2 + ψ2 j,2 + + ψ2 j,K + µj 2

ψ2 j,2 + + ψ2 j,K + |ψj,1| + νj + µj 2

2(ψ2 j,1 + ψ2 j,2 + + ψ2 j,K) + νj + µj 2

4ψ2 j + 2(νj + µj)2.

Penalized Low-Rank Tensor Regression

Plugging µj = w j and νj = ˆ H H w j into the part |νj µj| derives that

|νj + µj| (a) ˆ H H w j + w j = ( ˆ H + H ) (M3(B ))j

(M3(B ))j ˆ H + H 2 (M3(B ))j ˆ H 2 + H 2

(b) 2 (M3(B ))j ,

where (a) follows from the Cauchy-Schwarz inequality, (b) comes from the fact that ˆ H 2 = H 2 = 1 for the column-orthogonal matrices ˆ H and H . Setting α = max{ B j F , j [p]}, we know that (νj µj)2 4 α2. Combining with (20) and (21), we obtain

E φ[1:s] + µ[1:s] 2 = E s X

j=1 (φj + µj)2 E s X

j=1 (4ψ2 j + 8 α2)

= 8s α2 + 4E(ζ2) = 4s(2 α2 + K),

where ζ2 = Ps j=1 ψ2 j χ2 s K. Therefore, combining with (18), (19) and (22) yields that

E1 (1 + o(1))2σ2s log(p/s) + 4s(2 α2 + K)

4s(2 α2 + K) p

(1 + o(1))2σ2s log(p/s)

(1 + o(1))2σ2s log(p/s),

where the last step makes use of s/p 0. We claim that E2 = o(1)2σ2s log(p/s) in the following. Note that |φj| = |ψj| χK since νj = 0 for j > s. Denote |ψ|(1) |ψ|(p s) as the order statistics of |ψs+1|, . . . , |ψp|. It follows from the proof of Lemma 3.3 in Su and Cand es (2016) that

j=s+1 ˆη2 j

j=1 E(|ψ|(j) λs+j)2 +,

where x+ = max{0, x}. Then, we can partition the sum into three parts

j=1 E(|ψ|(j) λs+j)2 + =

j=1 E(|ψ|(j) λs+j)2 +

j= As E(|ψ|(j) λs+j)2 + +

j= ap E(|ψ|(j) λs+j)2 +,

for a suﬃciently large constant A > 0 and a suﬃciently small constant a > 0. Note that Lemmas F.1, F.2 and F.3 given by Brzyski et al. (2019) show that each part is negligible compared with 2σ2s log(p/s). This indicates that E2 = o(1)2σ2s log(p/s) and consequently completes the proof together with (23).

Chen and Luo

Appendix C. Proofs for Section 4

C.1 Proof of Lemma 10

Denoting σ1 σ2 σK as singular values of W , we can obtain that

i=1 σ2 i = K W 2 F . (24)

In addition, we know from the singular value inequality (Chatelin, 1983) that σi+j 1(AB) σi(A)σj(B) for 1 i, j K, i + j K + 1, which implies that AB A B for any two matrices A Rp1p2 p and B Rp K. Together with (24), we derive that as W F ,

F(W ) W F = XW 2 F /2 + Pλ W r M3(Y) XW W F

XW 2 F /2 + Pλ W r

W F M3(Y) X W

XW 2 F /2 + Pλ W r

K M3(Y) X ,

where is from the fact that the inequality XW 2 F σ2 p(X) W 2 F holds with σp(X) > 0 when X is full column rank. This completes the proof.

C.2 Proof of Theorem 11

It follows from Lemma 10 that the DC function F in (9) is also level-bounded, that is, the level set Lα = {W Rp K : F(W ) α} is bounded for any α R. In addition, we know from Wen et al. (2018) that p DCAe enjoys the global convergence if F is level-bounded. Thus, using Lemma 10, the desired properties in (a)-(c) are consequences of Theorem 4.1 in Wen et al. (2018).

Appendix D. Scaling of nonzero ground truth in simulations

In simulations, we generate the row sparsity factor matrix W Rp K in a similar manner as in Brzyski et al. (2019), where each nonzero row of W is scaled such that w j = a

K with a = p

4 ln(p)/(1 p 2/K) K. This speciﬁc form allows the generated signals to be comparable to the maximal noise such that nonzero signals (i.e., signiﬁcant variables) can be identiﬁed with moderate power. We give the calculation of the scale value a in the following. Considering the case of the column-orthogonal predictor matrix, the Tg SLOPE estimator ˆ W can be obtained from (14). We see from (14) that the identiﬁcation of the R signiﬁcant variables (where the number, R, is determined by λ) corresponds to discovering indices of the R largest values among ˆy1 , . . . , ˆyp . Note that ˆy1, . . . , ˆyp are generated respectively by the random vectors ˆYj = ( ˆYj1, . . . , ˆYj K) , j = 1, . . . , p, where

ˆYj = vj + Ej N(vj, σ2IK)

Penalized Low-Rank Tensor Regression

with vj = ˆ H H w j and i.i.d. random noise vectors Ej = ˆ H (M3(E)) X:j N(0, σ2IK)

for j = 1, . . . , p. Thus ˆYj = q PK k=1 ˆY 2 jk has a χK distribution with the noncentrality parameter vj and Ej has a central χK distribution for j [p]. Then, the nonzero vj (or the nonzero w j) could be perceived as a strong signal and thus be identiﬁed by Tg Sl OPE if with high probability the value ˆyj generated from the noncentral χ distribution ˆYj is large compared to the background noise produced by the random disturbance Ej with central χK distributions. Otherwise, the signal could be easily covered by random disturbances. Theorem H.1 of Brzyski et al. (2019) gives that for independent variables Z1, . . . Zp with Zj χ2 K, j [p], one has

E max j [p]{Zj} 4 ln(p) 1 m 2/K .

Thus an idea is to use the quantity p

4 ln(p)/(1 m 2/K) as the upper bound on the expected value of maximum over p independent χK-distributed variables. To investigate the discovery performance of Tg SLOPE in simulation, we aim at E( ˆyj ) = p

4 ln(p)/(1 m 2/K), which yields the setting that vj = p

4 ln(p)/(1 m 2/K) K since E( ˆyj ) p

vj + K. Note that vj = ˆ H H w j ˆ H 2 H 2 w j = w j due to the column-orthogonality of ˆ H and H . In such a sense, we use the value p

4K ln(p)/(1 p 2/K) K to scale w j in our simulations. This yields signals comparable to the maximal noise such that nonzero signals can be detected with moderate power.

T. Ahmed, H. Raja, and W. U. Bajwa. Tensor regression using low-rank and sparse tucker decompositions. SIAM Journal on Mathematics of Data Science, 2(4):944 966, 2020.

G. I. Allen. Sparse higher-order principal components analysis. In Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics, pages 27 36, 2012.

A. Auddy and M. Yuan. Perturbation bounds for (nearly) orthogonally decomposable tensors and their applications in high dimensional data analysis. Information and Inference, 12(2):1044 1072, 2023.

P. C. Bellec, G. Lecu e, and A. B. Tsybakov. SLOPE meets LASSO: Improved oracle bounds and optimality. The Annals of Statistics, 46(6):3603 3642, 2018.

Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57(1):289 300, 1995.

M. Bogdan, E. Van Den Berg, C. Sabatti, W. Su, and E. J. Cand es. SLOPE-adaptive variable selection via convex optimization. Annals of Applied Statistics, 9(3):1103 1140, 2015.

D. Brzyski, A. Gossmann, W. Su, and M. Bogdan. Group SLOPE-adaptive selection of groups of predictors. Journal of the American Statistical Association, 114(525):419 433, 2019.

Chen and Luo

E. Bura, S. Duarte, L. Forzani, E. Smucler, and M. Sued. Asymptotic theory for maximum likelihood estimates in reduced-rank multivariate generalized linear models. Statistics, 52 (5):1005 1024, 2018.

F. Chatelin. Eigenvalues of Matrices. New York: Weily, 1983.

H. Chen, G. Raskutti, and M. Yuan. Non-convex projected gradient descent for generalized low-rank tensor regression. Journal of Machine Learning Research, 20:1 37, 2019.

Y. Chen, Z. Luo, and L. Kong. ℓ2,0-norm based selection and estimation for multivariate generalized linear models. Journal of Multivariate Analysis, 185:104782, 2021.

D. L. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization. Proceedings of the National Academy of Sciences - PNAS, 100(5):2197 2202, 2003.

D. L. Donoho and I. M. Johnstone. Minimax risk over ℓp-balls for ℓq-error. Probability Theory and Related Fields, 99(2):227 303, 1994.

J. Fan, W. Gong, and Z. Zhu. Generalized high-dimensional trace regression via nuclear norm regularization. Journal of Econometrics, 212(1):177 202, 2019.

J. Gotoh, Akiko Takeda, and Katsuya Tono. DC formulations and algorithms for sparse optimization problems. Mathematical Programming, 169:141 176, 2018.

J. C. Gower and G. B. Dijksterhuis. Procrustes Problems. Oxford University Press, 2004.

R. Han, R. Willett, and A. Zhang. An optimal statistical and computational framework for generalized tensor estimation. The Annals of Statistics, 50(1):1 29, 2022.

B. Hao, A. Zhang, and G. Cheng. Sparse and low-rank tensor estimation via cubic sketchings. IEEE Transactions on Information Theory, 66(9):5927 5964, 2020.

J. Hu, C. Lee, and M. Wang. Generalized tensor decomposition with features on multiple modes. Journal of Computational and Graphical Statistics, 31(1):204 218, 2022.

S. L. Hu and K. Ye. Linear convergence of an alternating polar decomposition method for low rank orthogonal tensor approximations. Mathematical Programming, 199(1-2): 1305 1364, 2023.

W. Hu, W. Shen, H. Zhou, and D. Kong. Matrix linear discriminant analysis. Technometrics, 62(2):196 205, 2020.

W. Hu, T. Pan, D. Kong, and W. Shen. Nonparametric matrix response regression with application to brain imaging data analysis. Biometrics, 77(4):1227 1240, 2021.

T. Inglot. Inequalities for quantiles of the Chi-square distribution. Probability and Mathematical Statistics, 30(2):339 351, 2010.

I. Johnstone and A. Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682 693, 2009.

Penalized Low-Rank Tensor Regression

T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51 (3):455 500, 2009.

D. Kong, B. An, J. Zhang, and H. Zhu. L2RM: Low-rank linear regression models for high-dimensional matrix responses. Journal of the American Statistical Association, 115 (529):403 424, 2020.

J. B. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decompositions, with applications to arithmetic complexity and statistics. Linear Algebra and its Applications, 18(2):95 138, 1977.

X. Li, D. Xu, H. Zhou, and L. Li. Tucker tensor regression and neuroimaging analysis. Statistics in Biosciences, 10:520 545, 2018.

Z. Luo, D. Sun, K. C. Toh, and N. Xiu. Solving the OSCAR and SLOPE models using a semismooth newton-based augmented lagrangian method. Journal of Machine Learning Research, 20(106):1 25, 2019.

Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140:125 161, 2013.

G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in highdimensional multivariate regression. The Annals of Statistics, 39:1 47, 2011.

J. C. Poythressa, J. Ahnb, and C. Parkc. Low-rank, orthogonally decomposable tensor regression with application to visual stimulus decoding of f MRI data. Journal of Computational and Graphical Statistics, 31(1):190 203, 2022.

G. Raskutti, M. Yuan, and H. Chen. Convex regularization for high-dimensional multiresponse tensor regression. The Annals of Statistics, 47(3):1554 1584, 2019.

R. T. Rockafellar and R. J.-B. Wets. Variational Analysis. Springer, 2009.

V. De Silva and L.-H. Lim. Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM Journal on Matrix Analysis and Applications, 30(3):1084 1127, 2008.

M. Sørensen, L. D. Lathauwer, P. Comon, S. Icart, and L. Deneire. Canonical polyadic decomposition with a columnwise orthonormal factor matrix. SIAM Journal on Matrix Analysis and Applications, 33(4):1190 1213, 2012.

W. Su and E. J. Cand es. SLOPE is adaptive to unknown sparsity and asymptotically minimax. The Annals of Statistics, 44(3):1038 1068, 2016.

W. W. Sun and L. Li. STORE: Sparse tensor response regression and neuroimaging analysis. Journal of Machine Learning Research, 18:1 37, 2017.

D. C. Van Essen, S. M. Smith, D. M. Barch, T. E. J. Behrens, E. Yacoub, and K. Ugurbil. The WU-Minn human connectome project: An overview. Neuro Image, 80(15):62 79, 2013.

Chen and Luo

G. A. Watson. Characterization of the subdiﬀerential of some matrix norms. Linear Algebra and its Applications, 170:33 45, 1992.

Q. Wei, Y. Zhang, and Z. Zhao. Sparse reduced-rank regression with adaptive selection of groups of predictors. In 55th Annual Asilomar Conference on Signals, Systems, and Computers, pages 145 149, 2021.

B. Wen, X. Chen, and T. Pong. A proximal diﬀerence-of-convex algorithm with extrapolation. Computational Optimization and Applications, 69:297 324, 2018.

D. Yu, L. Wang, D. Kong, and H. Zhu. Mapping the genetic-imaging-clinical pathway with applications to alzheimer s disease. Journal of the American Statistical Association, 117 (540):1656 1668, 2022.

A. Zhang, Y. Luo, G. Raskutti, and M. Yuan. ISLET: Fast and optimal low-rank tensor regression via importance sketching. SIAM Journal on Mathematics of Data Science, 2 (2):444 479, 2020.

S. Zhang, X. Li, J. Lv, X. Jiang, L. Guo, and T. Liu. Characterizing and diﬀerentiating task-based and resting state fmri signals via two-stage sparse representations. Brain Imaging and Behavior, 10(1):21 32, 2016.

Y. Zhang, W. Shen, and D. Kong. Covariance estimation for matrix-valued data. Journal of the American Statistical Association, ahead-of-print:1 12, 2022. doi: 10.1080/01621459. 2022.2068419.

H. Zhou, L. Li, and H. Zhou. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013.