# boosted_sparse_and_lowrank_tensor_regression__c236518e.pdf

Boosted Sparse and Low-Rank Tensor Regression

Lifang He Weill Cornell Medicine lifanghescut@gmail.com

Kun Chen University of Connecticut kun.chen@uconn.edu

Wanwan Xu University of Connecticut wanwan.xu@uconn.edu

Jiayu Zhou Michigan State Universtiy dearjiayu@gmail.com

Fei Wang Weill Cornell Medicine few2001@med.cornell.edu

We propose a sparse and low-rank tensor regression model to relate a univariate outcome to a feature tensor, in which each unit-rank tensor from the CP decomposition of the coefﬁcient tensor is assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor regression problems. To make the computation efﬁcient and scalable, for the unit-rank tensor regression, we propose a stagewise estimation procedure to efﬁciently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The superior performance of our approach is demonstrated on various real-world and synthetic examples.

1 Introduction

Regression analysis is commonly used for modeling the relationship between a predictor vector x RI and a scalar response y. Typically a good regression model can achieve two goals: accurate prediction on future response and parsimonious interpretation of the dependence structure between y and x [Hastie et al., 2009]. As a general setup, it ﬁts M training samples {(xm, ym)}M m=1 via minimizing a regularized loss, i.e., a loss L( ) plus a regularization term Ω( ), as follows

m=1 L( xm, w , ym) + λΩ(w), (1)

where w RI is the regression coefﬁcient vector, , is the standard Euclidean inner product, and λ > 0 is the regularization parameter. For example, the sum of squared loss with ℓ1-norm regularization leads to the celebrated LASSO approach [Tibshirani, 1996], which performs sparse estimation of w and thus has implicit feature selection embedded therein.

In many modern real-world applications, the predictors/features are represented more naturally as higher-order tensors, such as videos and Magnetic Resonance Imaging (MRI) scans. In this case, if we want to predict a response variable for each tensor, a naive approach is to perform linear regression on the vectorized data (e.g., by stretching the tensor element by element). However, it completely ignores the multidimensional structure of the tensor data, such as the spatial coherence of the voxels. This motivates the tensor regression framework [Yu and Liu, 2016, Zhou et al., 2013], which treats each observation as a tensor X and learns a tensor coefﬁcient W via regularized model ﬁtting:

m=1 L( X m, W , ym) + λΩ(W). (2)

Corresponding Author

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

When ℓ1-norm regularization is used, this formulation is essentially equivalent to (1) via vectorization. To effectively exploit the structural information of X m, we can impose a low-rank constraint on W for Problem (2). Some authors achieved this by ﬁxing the CANDECOMP/PARAFAC (CP) rank of W as a priori. For example, Su et al. [2012] assumed W to be rank-1. Since rank-1 constraint is too restrictive, Guo et al. [2012] and Zhou et al. [2013] imposed a rank-R constraint in a tensor decomposition model, but none of the above methods considered adding sparsity constraint as well to enhance the model interpretability. Wang et al. [2014] imposed a restrictive rank-1 constraint on W and also applied an elastic net regularization [Zou and Hastie, 2005]. Tan et al. [2012] imposed a rank-R constraint and applied ℓ1-norm regularization to factor matrices to also promote sparsity in W. Signoretto et al. [2014] applied the trace norm (nuclear norm) for low-rank estimation of W, and Song and Lu [2017] imposed a combination of trace norm and ℓ1-norm on W. Bengua et al. [2017] showed that the trace norm may not be appropriate for capturing the global correlation of a tensor as it provides only the mean of the correlation between a single mode (rather than a few modes) and the rest of the tensor. In all the above sparse and low-rank tensor models, the sparsity is imposed on W itself, which, does not necessarily lead to the sparsity on the decomposed matrices.

In this paper, we propose a sparse and low-rank tensor regression model in which the unit-rank tensors from the CP decomposition of the coefﬁcient tensor are assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor factorization/regression problems (SURF) in the form of

m=1(ym X m, W )2 + λ W 1, s.t. rank(W) 1,

To make the solution process efﬁcient for the SURF problem, we propose a boosted/stagewise estimation procedure to efﬁciently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The effectiveness and efﬁciency of our proposed approach is demonstrated on various real-world datasets as well as under various simulation setups.

2 Preliminaries on Tensors

We start with a brief review of some necessary preliminaries on tensors, and more details can be found in [Kolda and Bader, 2009]. We denote scalars by lowercase letters, e.g., a; vectors by boldfaced lowercase letters, e.g., a; matrices by boldface uppercase letters, e.g., A; and tensors by calligraphic letters, e.g., A. We denote their entries by ai, ai,j, ai,j,k, etc., depending on the number of dimensions. Indices are denoted by lowercase letters spanning the range from 1 to the uppercase letter of the index, e.g., n = 1, , N. Each entry of an Nth-order tensor A RI1 IN is indexed by N indices {in}N n=1, and each in indexes the n-mode of A. Speciﬁcally, n denotes every mode except n. Deﬁnition 1 (Inner Product). The inner product of two tensors A, B RI1 IN is the sum of the products of their entries, deﬁned as A, B = PI1 i1=1 PIN i1=1 ai1, ,i N bi1, ,i N .

It follows immediately that the Frobenius norm of A is deﬁned as A F = p

A, A . The ℓ1 norm of a tensor is deﬁned as A 1= PI1 i1=1 PIN i N=1|Ai1, ,i N |.

Deﬁnition 2 (Tensor Product). Let a(n) RIn be a length-In vector for each n = 1, 2, , N. The tensor product of {a(n)}, denoted by A = a(1) a(N), is an (I1 IN)-tensor of which the entries are given by Ai1, ,i N = a(1) i1 a(N) i N . We call A a rank-1 tensor or a unit-rank tensor.

Deﬁnition 3 (CP Decomposition). Every tensor A RI1 IN can be decomposed as a weighted sum of rank-1 tensors with a suitably large R as

r=1 σr a(1) r a(N) r , (3)

where σr R, a(n) r RI and a(n) r 2= 1. Deﬁnition 4 (Tensor Rank). The tensor rank of A, denoted by rank(A), is the smallest number R such that the equality (3) holds. Deﬁnition 5 (n-mode Product). The n-mode product of a tensor A RI1 IN by a vector u RIn, denoted by A n u, is an (I1 In 1 In+1 IN)-tensor of which the entries are given by (A n u)i1,...,in 1in+1,...,i N = PIn in=1 ai1,...,i N uin.

3 Sparse and Low-Rank Tensor Regression

3.1 Model Formulation

For an Nth-order predictor tensor X m RI1 IN and a scalar response ym, m = 1, , M, we consider the regression model of the form

ym = X m, W + εm (4)

where W RI1 IN is an Nth-order coefﬁcient tensor, and εm is a random error term of zero mean. Without loss of generality, the intercept is set to zero by centering the response and standardizing the predictors as PM m=1 ym = 0; PM m=1 X m i1, ,i N = 0 and PM m=1(X m i1, ,i N )2/M = 1 for in = 1, , In. Our goal is to estimate W with M i.i.d. observations {(X m, ym)}M m=1.

To reduce the complexity of the model and leverage the structural information in X m, we assume that the coefﬁcient tensor W to be both low-rank and sparse. Speciﬁcally, we assume W can be decomposed via CP decomposition as W = PR r=1 Wr = PR r=1 σrw(1) r w(N) r , where each rank-1 tensor Wr is possibly sparse, or equivalently, the vectors in its representation w(n) r , n = 1, . . . , N, are possibly sparse, for all r = 1, . . . , R.

Here we impose sparsity on the rank-1 components from CP decomposition rather than on W itself [Chen et al., 2012, Tan et al., 2012]. This adaption can be more beneﬁcial in multiple ways: 1) It integrates a ﬁner sparsity structure into the CP decomposition, which enables a direct control of component-wise sparsity; 2) It leads to an appealing model interpretation and feature grouping: the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions; 3) It leads to a more ﬂexible and parsimonious model as it requires less number of parameters to recover the within-decomposition sparsity of a tensor than existing methods which impose sparsity on the tensor itself, thus makes the model generalizability better.

A straightforward way of conducting model estimation is to solve the following optimization problem:

min σr,w(n) r

m=1(ym X m, XR

r=1 σrw(1) r w(N) r )2 + XR

n=1 λr,n w(n) r 1,

s.t. w(n) r 1= 1, n = 1, , N, r = 1, , R. (5)

where λr,n, r = 1, , R, n = 1, , N, are regularization parameters. This problem is very difﬁcult to solve because: 1) The CP rank R needs to be pre-speciﬁed; 2) As the CP decomposition may not be unique, the pursue of its within-decomposition sparsity is highly non-convex and the problem may suffer from parameter identiﬁability issues [Mishra et al., 2017]; 3) The estimation may involve many regularization parameters, for which the tuning becomes very costly.

3.2 Divide-and-Conquer: Sequential Pursue for Sparse Tensor Decomposition

We propose a divide-and-conquer strategy to recover the sparse CP decomposition. Our approach is based on the sequential extraction method (a.k.a. deﬂation) [Phan et al., 2015, Mishra et al., 2017], which seeks a unit-rank tensor at a time and then deﬂates to ﬁnd further unit-rank tensors from the residuals. This has been proved to be a rapid and effective method of partitioning and concentrating tensor decomposition. Speciﬁcally, we sequentially solve the following sparse unit-rank problem:

c Wr = min Wr 1 M

m=1(ym r X m, Wr )2 + λr Wr 1+α Wr 2 F , s.t. rank(Wr) 1. (6)

where r is the sequential number of the unit-rank terms and ym r is the current residue of response with

( ym, if r = 1 ym r 1 X m, c Wr 1 , otherwise,

where c Wr 1 is the estimated unit-rank tensor in the (r 1)-th step, with tuning done by, e.g., cross validation. The ﬁnal estimator is then obtained as c W(R) = PR r=1 c Wr. Here for improving the convexity of the problem and its numerical stability, we have used the elastic net [Zou and Hastie, 2005] penalty form instead of LASSO, which is critical to ensure the convergence of the optimization solution. The accuracy of the solution can be controlled simply by adjusting the values of λr and α. Since we mainly focus on sparse estimation, we ﬁx α > 0 as a small constant in numerical studies.

As each Wr is of unit rank, it can be decomposed as Wr = σrw(1) r w(N) r with σr 0 and w(n) r 1= 1, n = 1, , N. It is then clear that Wr 1= bσr bw(1) r bw(N) r 1= σr QN n=1 w(n) r 1. That is, the sparsity of a unit-rank tensor directly leads to the sparsity of its components. This allows us to kill multiple birds with one stone: by simply pursuing the elementwise sparsity of the unit-rank coefﬁcient tensor with only one tuning parameter λ, solving (6) can produce a set of sparse factor coefﬁcients bw(n) r for n = 1, , N simultaneously.

With this sequential pursue strategy, the general problem boils down to a set of sparse unit-rank estimation problems, for which we develop a novel stagewise/boosting algorithm.

4 Fast Stagewise Unit-Rank Tensor Factorization (SURF)

For simplicity, we drop the index r and write the generic form of the problem in (6) as

c W = min W 1 M

m=1(ym X m, W )2 + λ W 1+α W 2 F , s.t. rank(W) 1. (7)

Let W = σw(1) w(N), where σ 0, w(n) 1= 1, and the factors w(n), n = 1, , N are identiﬁable up to sign ﬂipping. Let y = [y1, , y M] RM, and X = [X 1, , X M] RI1 IN M. Then (7) can be reformulated as

min σ,w(n) 1 M y σX 1 w(1) 2 N w(N) 2 2+λσ YN

n=1 w(n) 1+ασ2 YN

n=1 w(n) 2 2

s.t. σ 0, w(n) 1= 1, n = 1, , N. (8)

Before diving into the stagewise/boosting algorithm, we ﬁrst consider an alternating convex search (ACS) approach [Chen et al., 2012, Minasian et al., 2014] which appears to be natural for solving (7) with any ﬁxed tuning parameter. Speciﬁcally, we alternately optimize with respect to a block of variables (σ, w(n)) with others ﬁxed. For each block (σ, w(n)), the relevant constraints are σ 0 and w(n) 1= 1, but the objective function in (8) is a function of (σ, w(n)) only through their product σw(n). So both constraints are avoided when optimizing with respect to σw(n). Let bw(n) = σw(n)

and Z( n) = X 1 w(1) 2 n 1 w(n 1) n+1 N w(N), the subproblem boils down to

min b w(n) 1 M y T Z( n)T bw(n) 2 2+αβ( n) bw(n) 2 2+λ bw(n) 1, (9)

where β( n) = Q

l =n w(l) 2 2. Once we obtain the solution bw(n), we can set σ = bw(n) 1 and w(n) = bw(n)/σ to satisfy the constraints whenever bw(n) = 0. If bw(n) = 0, w(n) is no longer identiﬁable, we then set σ = 0 and terminate the algorithm. Note that when α > 0, each subproblem is strongly convex and the generated solution sequence is uniformly bounded, we can show that ACS can converge to a coordinate-wise minimum point of (8) with properly chosen initial value [Mishra et al., 2017]. The optimization procedure then needs to be repeated to a grid of tuning parameter values for obtaining the solution paths of parameters and locating the optimal sparse solution along the paths.

Inspired by the biconvex structure of (7) and the stagewise algorithm for LASSO [Zhao and Yu, 2007, Vaughan et al., 2017], we develop a fast stagewise unit-rank tensor factorization (SURF) algorithm to trace out the entire solution paths of (7) in a single run. The main idea of a stagewise procedure is to build a model from scratch, gradually increasing the model complexity in a sequence of simple learning steps. For instance, in stagewise estimation for linear regression, a forward step searches for the best predictor to explain the current residual and updates its coefﬁcient by a small increment, and a backward step, on the contrary, may decrease the coefﬁcient of a selected predictor to correct for greedy selection whenever necessary. Due to the biconvex or multi-convex structure of our objective function, it turns out that efﬁcient stagewise estimation remains possible: the only catch is that, when we determine which coefﬁcient to update at each iteration, we always get N competing proposals from N different tensor modes, rather than just one proposal in case of LASSO.

To simplify the notations, the objective function (9) is re-arranged into a standard LASSO as

min b w(n) 1 M by b Z( n) bw(n) 2 2+λ bw(n) 1 (10)

Algorithm 1 Fast Stagewise Unit-Rank Tensor Factorization (SURF)

Input: Training data D, a small stepsize ϵ > 0 and a small tolerance parameter ξ 2

Output: Solution paths of (σ, {w(n)}). 1: Initialization: take a forward step with ({ˆi1, ,ˆi N}, ˆs) = arg min {i1, ,i N },s= ϵ J(s1i1, 1i2, , 1i N ), and

σ0 = ϵ, w(1) 0 = sign(bs)1bin, w(n) 0 = 1bin(n = 1), λ0 = (J({0}) J(σ0, {w(n) 0 }))/ϵ. (11)

Set the active index sets I(n) 0 = {ˆin} for n = 1, , N; t = 0. 2: repeat 3: Backward step:

(bn,bibn) := arg min n,in I(n) t J(bw(n) t + sin1in), where sin = sign( bw(n) tin)ϵ. (12)

if Γ( bw(bn) t + sbibn1bibn; λt) Γ( bw(bn) t ; λt) ξ, then

σt+1 = bw(bn) t + sbibn1bibn 1, w(bn) t+1 = (bw(bn) t + sbibn1bibn)/σt+1, w( bn) t+1 = w( bn) t ,

λt+1 = λt, I(n) t+1 :=

( I(bn) t \ {bibn}, if w(bn) (t+1)bibn = 0

I(n) t , otherwise.

4: else Forward step:

(bn,bibn, bsbibn) := arg min n,in,s= ϵ J(bw(n) t + sin1in), (13)

σt+1 = bw(bn) t + bsbibn1bibn 1, w(bn) t+1 = ( bw(bn) t + bsbibn1bibn)/σt+1, w( bn) t+1 = w( bn) t ,

λt+1 = min[λt, J(σt, {w(n) t }) J(σt+1, {w(n) t+1}) ξ

Ω(σt+1, {w(n) t+1}) Ω(σt, {w(n) t }) ], I(n) t+1 :=

( I(n) t {bin}, if n = bn I(n) t , otherwise. 5: Set t = t + 1. 6: until λt 0

using the augmented data by = (y, 0)T and b Z( n) = (Z( n), p

αβ( n)MI)T , where I is the identity matrix of size In. We write the objective function (10) by

Γ(bw(n); λ) = J(bw(n)) + λΩ(bw(n)).

We use (σ, {w(n)}) to denote all the variables if necessary.

The structure of our stagewise algorithm is presented in Algorithm 1. It can be viewed as a boosting procedure that builds up the solution gradually in terms of forward step (line 4) and backward step (line 3)3. The initialization step is solved explicitly (see Lemma 1 below). At each subsequent iteration, the parameter update takes the form bw(n) = bw(n) + s1in in either forward or backward direction, where 1in is a length-In vector with all zeros except for a 1 in the in-th coordinate, s = ϵ, and ϵ is the pre-speciﬁed step size controlling the ﬁneness of the grid. The algorithm also keeps track of the tuning parameter λ. Intuitively, the selection of the index (n, in) and the increment s is guided by minimizing the penalized loss function with the current λ subject to a constraint on the step size. Comparing to the standard stagewise LASSO, the main difference here is that we need to select the best triple of (n, in, s) over all the dimensions across all tensor modes.

Problem (12) and (13) can be solved efﬁciently. By expansion of J(bw(n) + s1in), we have

J(bw(n) + s1in) = 1

M ( be(n) 2 2 2sbe(n)Tb Z( n)1in + ϵ2 b Z( n)1in 2 2),

2As stated in [Zhao and Yu, 2007], ξ is implementation-speciﬁc but not necessarily a user parameter. In all experiments, we set ξ = ϵ2/2. 3Boosting amounts to combine a set of weak learners to build one strong learner, and is connected to stagewise and regularized estimation methods. SURF is a boosting method since each backward/forward step is in essence ﬁnding a weaker learner to incrementally improve the current learner (thus generate a path of solutions).

where be(n) = by b Z( n) bw(n) is a constant at each iteration. Then the solution at each forward step is

(bn,bibn) := arg max n,in 2|be(n)Tb Z( n)1in| ϵDiag(b Z( n)Tb Z( n))T1in, bs = sign(be(n)Tb Z( bn)1bibn)ϵ,

and at each backward step is

(bn,bibn) := arg min n,in I(n) 2sign( bw(n) in )be Tb Z( n)1in + ϵDiag(b Z( n)Tb Z( n))T1in,

where Diag( ) denotes the vector formed by the diagonal elements of a square matrix, and I(n) {1, , In} is the n-mode active index set at current iteration.

Computational Analysis. In Algorithm 1, the most time-consuming part are the calculations of be(n)Tb Z( n) and Diag(b Z( n)Tb Z( n)), involved in both forward and backward steps. We further write as be(n)Tb Z( n) = (Z( n)e αβ( n)M bw(n))T, Diag(b Z( n)Tb Z( n)) = Diag(Z( n)Z( n)T) + αβ( n)M1(n), where e = y T Z( n)T ˆw(n) is a constant during each iteration but varies from one iteration to the next, 1(n) is a length-In vector with all ones. At each iteration, the computational cost is dominated by the update of Z( n) (n = ˆn), which can be obtained by

Z( n) t+1 = 1 σt+1 (σt Z( n) t + Z( n, ˆn) t ˆn bsbibn1bibn), (14)

where ( n, ˆn) denotes every mode except n and ˆn, which greatly reduces the computation. Therefore, when n = ˆn, the updates of Z( n), e, Z( n)e and Diag(Z( n)TZ( n)) requires O(M Q

s =n,ˆn Is + 3MIn), O(MIˆn), O(MIn), O(MIn) operations, respectively, when n = ˆn, we only need to additionally update Z( ˆn)e, which requires O(MIˆn) operations. Overall the computational complexity of our approach is O(M PN n =ˆn(Q

s =n,ˆn Is + 5In) + 2MIˆn) per iteration. In contrast, the ACS algorithm has to be run for each ﬁxed λ, and within each of such problems it requires O(M QN n=1 In) per iteration [da Silva et al., 2015].

5 Convergence Analysis

We provide convergence analysis for our stagewise algorithm in this section. All detailed proofs are given in the Appendix A. Speciﬁcally, Lemma 1 and 2 below justify the validity of the initialization. Lemma 1. Let X be the (N + 1)-mode matricization of X. Denote X = [x1, , x I] where each xi is a column of X, then

λmax = 2/M max{|x T i y|; i = 1, , I.}.

Moreover, letting i = arg maxi|x T i y| and (i 1, , i N) represents its corresponding indices in tensor space, then the initial non-zero solution of (11), denoted as (σ, {w(n)}), is given by

σ = ϵ, w(1) = sign(x T i y)1i 1, w(n) = 1i n, n = 2, , N.

where 1i n is a vector with all 0 s except for a 1 in the i n-th coordinate.

Lemma 2. If there exists s and in with |s|= ϵ, n = 1, , N such that Γ(s1i1, 1i2, , 1i N ; λ) Γ({0}; λ), it must be true that λ λ0. Lemma 3 shows that the backward step always performs coordinate descent update of ﬁxed size ϵ, each time along the steepest coordinate direction within the current active set, until the descent becomes impossible subject to a tolerance level ξ. Also, the forward step performs coordinate descent when λt+1 = λt. Lemma 4 shows that when λ gets changed, the penalized loss for the previous λ can no longer be improved subject to a tolerance level ξ. Thus ϵ controls the granularity of the paths, and ξ is a convergence threshold in optimizing the penalized loss with any ﬁxed tuning parameter. They enable convenient trade-off between computational efﬁciency and estimation accuracy. Lemma 3. For any t with λt+1 = λt, we have Γ(σt+1, {w(n) t+1}; λt+1) Γ(σt, {w(n) t }; λt+1) ξ.

Lemma 4. For any t with λt+1 < λt, we have Γ( ˆw(n) t + sin1in; λt) > Γ( ˆw(n) t ; λt) ξ.

Lemma 3 and Lemma 4 proves the following convergence theorem.

Theorem 1. For any t such that λt+1 < λt, we have (σt, {w(n) t }) (σ(λt), {ew(n)(λt)}) as ϵ, ξ 0, where (σ(λt), {ew(n)(λt)}) denotes a coordinate-wise minimum point of Problem (7).

Table 1: Compared methods. α and λ are regularized parameters; R is the CP rank. Methods LASSO ENet Remurs or TRR GLTRM ACS SURF Input Data Type Vector Vector Tensor Tensor Tensor Tensor Tensor Regularization ℓ1 (w) ℓ1/ℓ2 (w) Nuclear/ℓ1 (W) ℓ2 (W(n)) ℓ1/ℓ2 (W(n)) ℓ1/ℓ2 (Wr) ℓ1/ℓ2 (Wr) Rank Explored Optimized Fixed Increased Increased Hyperparameters λ α, λ λ1, λ2 α, R α, λ, R α, λ, R α, λ, R

6 Experiments

We evaluate the effectiveness and efﬁciency of our method SURF through numerical experiments on both synthetic and real data, and compare with various state-of-the-art regression methods, including LASSO, Elastic Net (ENet), Regularized multilinear regression and selection (Remurs) [Song and Lu, 2017], optimal CP-rank Tensor Ridge Regression (or TRR) [Guo et al., 2012], Generalized Linear Tensor Regression Model (GLTRM) [Zhou et al., 2013], and a variant of our method with Alternating Convex Search (ACS) estimation. Table 1 summarizes the properties of all methods. All methods are implemented in MATLAB and executed on a machine with 3.50GHz CPU and 256GB RAM. For LASSO and ENet we use the MATLAB package glmnet from [Friedman et al., 2010]. For GLTRM, we solve the regularized CP tensor regression simultaneously for all R factors based on Tensor Reg toolbox [Zhou et al., 2013]. We follow [Kampa et al., 2014] to arrange the test and training sets in the ratio of 1:5. The hyperparameters of all methods are optimized using 5-fold cross validation on the training set, with range α {0.1, 0.2, , 1}, λ {10 3, 5 10 3, 10 2, 5 10 2, , 5 102, 103}, and R {1, 2, , 50}. Speciﬁcally, for GLTRM, ACS, and SURF, we simply set α = 1. For LASSO, ENet and ACS, we generate a sequence of 100 values for λ to cover the whole path. For fairness, the number of iterations for all compared methods are ﬁxed to 100. All cases are run 50 times and the average results on the test set are reported. Our code is available at https://github.com/Lifang He/SURF.

Synthetic Data. We ﬁrst use the synthetic data to examine the performance of our method in different scenarios, with varying step sizes, sparsity level, number of features as well as sample size. We generate the data as follows: y = X, W +ε, where ε is a random noise generated from N(0, 1), and X RI I. We generate M samples {Xm}M m=1 from N(0, Σ), where Σ is a covariance matrix, the correlation coefﬁcient between features xi,j and xp,q is deﬁned as 0.6

(i p)2+(j q)2. We generate the true support as W = PR r=1 σrw(1) r w(2) r , where each w(n) r RI, n = 1, 2, is a column vector with N(0, 1) i.i.d. entries and normalized with ℓ1 norm, the scalars σr are deﬁned by σr = 1/r. To impose sparsity on W, we set S% of its entries (chosen uniformly at random) to zero. When studying one of factors, other factors are ﬁxed to M = 500, I = 16, R = 50, S = 80.

0 0.5 1 1.5 2 2.5 3

(a) ϵ = 0.01

0 0.5 1 1.5 2 2.5 3

(b) ϵ = 0.1

0 0.5 1 1.5 2 2.5 3

(c) ϵ = 0.5

Figure 1: Comparison of solution paths of SURF (solid line) and ACS (dashed line) with different step sizes on synthetic data. The path of estimates W for each λ is treated as a function of t = W 1. Note that when the step size is small, the SURF path is almost indiscernible from the ACS path.

In a ﬁrst step we analyze the critical parameter ϵ for our method. This parameter controls how close SURF approximates the ACS paths. Figure 1 shows the solution path plot of our method versus ACS method under both big and small step sizes. As shown by the plots, a smaller step size leads to a closer approximation to the solutions of ACS. In Figure 2, we also provide a plot of averaged prediction error with standard deviation bars (left side of y-axis) and CPU execution time (right side of y-axis in mins) over different values of step size. From the ﬁgure, we can see that the choice of the step size affects both computational speed and the root mean-squared prediction error (RMSE). The

smaller the value of step size, the more accurate the regression result but the longer it will take to run. In both ﬁgures, the moderate step size ϵ = 0.1 seems to offer a better trade-off between performance and ease of implementation. In the following we ﬁx ϵ = 0.1.

0.01 0.05 0.1 0.5 1 0.9

Elapsed time (min)

Test error Running time

Figure 2: Results to different values of step size ϵ.

Next, we examine the performance of our method with varying sparsity level of W. For this purpose, we compare the prediction error (RMSE) and running time (log min) of all methods on the synthetic data. Figure 3(a)-(b) shows the results for the case of S = {60, 80, 90, 95, 98} on synthetic 2D data, where S% indicates the sparsity level of true W. As can be seen from the plots, SURF generates slightly better predictions than other existing methods when the true W is sparser. Moreover, as shown in Figure 3(c), it is also interesting to note that larger step sizes give much more sparsity of coefﬁcients for SURF, this explains why there is no value for some curves as the increase of P|W|1 in Figure 1(c).

Furthermore, we compare the prediction error and running time (log min) of all methods with increasing number of features. Figure 4(a)-(b) shows the results for the case of I = {8, 16, 32, 64} on synthetic 2D data. Overall, SURF gives better predictions at a lower computational cost. Particularly, SURF and ACS have very similar prediction qualities, this matches with our theoretical result on the solutions of SURF versus ACS. SURF achieves better predictions than other tensor methods, indicating the effectiveness of structured sparsity in unit-rank tensor decomposition itself. In terms of running time, it is clear that as the number of features is increased, SURF is signiﬁcantly faster than other methods.

Finally, Figure 5 shows the results with increasing number of samples. Overall, SURF gives better predictions at a lower computational cost. Particularly, Remurs and or TRR do not change too much as increasing number of samples, this may due to early stop in the iteration process when searching for optimized solution.

60 80 90 95 98 0.9

LASSO ENet Remurs or TRR GLTRM ACS SURF

(a) Test error

60 80 90 95 98 100

Elapsed time (log min)

LASSO ENet Remurs or TRR GLTRM ACS SURF

(b) Running time

0.01 0.05 0.1 0.5 1 0.7

(c) Sparsity vs. ϵ

Figure 3: Results with increasing sparsity level (S%) of true W on synthetic 2D data (a)-(b), and (c) sparsity results of W versus step size for SURF.

64 256 1024 4096

LASSO ENet Remurs or TRR GLTRM ACS SURF

(a) Test error

64 256 1024 4096 10-2

Elapsed time (log min)

LASSO ENet Remurs or TRR GLTRM ACS SURF

(b) Running time

Methods 100

Elapsed time (log min)

LASSO ENet Remurs or TRR GLTRM ACS SURF

(c) Running time on MRI data

Figure 4: Results with increasing number of features on synthetic 2D data (a)-(b), and (c) real 3D MRI data of features 240 175 176 with ﬁxed hyperparameters (without cross validation).

Real Data. We also examine the performance of our method on a real medical image dataset, including both DTI and MRI images, obtained from the Parkinson s progression markers initiative

64 1024 4096 16384 0.8

LASSO ENet Remurs or TRR GLTRM ACS SURF

(a) Test error

64 1024 4096 16384 10-2

Elapsed time (log min)

LASSO ENet Remurs or TRR GLTRM ACS SURF

(b) Running time

Figure 5: Results with increasing number of samples on synthetic 2D data.

Table 2: Performance comparison over different DTI datasets. Column 2 indicates the used metrics RMSE, Sparsity of Coefﬁcients (SC) and CPU execution time (in mins). The results are averaged over 50 random trials, with both the mean values and standard deviations (mean std.)

Datasets Metrics Comparative Methods LASSO ENet Remurs or TRR GLTRM ACS SURF

RMSE 2.94 0.34 2.92 0.32 2.91 0.32 3.48 0.21 3.09 0.35 2.81 0.24 2.81 0.23 Sparsity 0.99 0.01 0.97 0.01 0.66 0.13 0.00 0.00 0.90 0.10 0.92 0.02 0.95 0.01 Time 6.4 0.3 46.6 4.6 161.3 9.3 27.9 5.6 874.8 29.6 60.8 24.4 1.7 0.2

RMSE 3.18 0.36 3.16 0.42 2.97 0.30 3.76 0.44 3.26 0.46 2.90 0.31 2.91 0.32 Sparsity 0.99 0.01 0.95 0.03 0.37 0.09 0.00 0.00 0.91 0.06 0.93 0.02 0.94 0.01 Time 5.7 0.3 42.4 2.9 155.0 10.7 10.2 0.1 857.4 22.5 63.0 21.6 5.2 0.8

RMSE 3.06 0.34 2.99 0.34 2.93 0.27 3.56 0.41 3.14 0.39 2.89 0.38 2.87 0.35 Sparsity 0.98 0.01 0.95 0.01 0.43 0.17 0.00 0.00 0.87 0.03 0.90 0.03 0.93 0.02 Time 5.8 0.3 45.0 1.0 163.6 9.0 7.5 0.9 815.4 6.5 66.3 44.9 1.5 0.1

RMSE 3.20 0.40 3.21 0.59 2.84 0.35 3.66 0.35 3.12 0.32 2.82 0.33 2.83 0.32 Sparsity 0.99 0.01 0.96 0.03 0.44 0.13 0.00 0.00 0.86 0.03 0.90 0.02 0.91 0.02 Time 5.5 0.2 42.3 1.4 159.6 7.6 26.6 3.1 835.8 9.9 96.7 43.2 3.8 0.5

RMSE 3.02 0.37 2.89 0.41 2.81 0.31 3.33 0.27 3.26 0.45 2.79 0.31 2.78 0.29 Sparsity 0.99 0.00 0.97 0.01 0.34 0.22 0.00 0.00 0.91 0.19 0.97 0.01 0.99 0.00 Time 8.8 0.6 71.9 2.3 443.5 235.7 48.4 8.0 1093.6 49.7 463.2 268.1 6.2 0.5

(PPMI) database4 with 656 human subjects. We parcel the brain into 84 regions and extract four types connectivity matrices. Our goal is to predict the Montreal Cognitive Assessment (Mo CA) scores for each subject. Details of data processing are presented in Appendix B. We use three metrics to evaluate the performance: root mean squared prediction error (RMSE), which describes the deviation between the ground truth of the response and the predicted values in out-of-sample testing; sparsity of coefﬁcients (SC), which is the same as the S% deﬁned in the synthetic data analysis (i.e., the percentage of zero entries in the corresponding coefﬁcient); and CPU execution time. Table 2 shows the results of all methods on both individual and combined datasets. Again, we can observe that SURF gives better predictions at a lower computational cost, as well as good sparsity. In particular, the paired t-tests showed that for all ﬁve real datasets, the RMSE and SC of our approach are signiﬁcantly lower and higher than those of Remurs and GLTRM methods, respectively. This indicates that the performance gain of our approach over the other low-rank + sparse methods is indeed signiﬁcant. Figure 4(c) provides the running time (log min) of all methods on the PPMI MRI images of 240 175 176 voxels each, which clearly demonstrates the efﬁciency of our approach.

Acknowledgements

This work is supported by NSF No. IIS-1716432 (Wang), IIS-1750326 (Wang), IIS-1718798 (Chen), DMS-1613295 (Chen), IIS-1749940 (Zhou), IIS-1615597 (Zhou), ONR N00014-18-12585 (Wang), and N00014-17-1-2265 (Zhou), and Michael J. Fox Foundation grant number 14858 (Wang). Lifang He s research is supported in part by 1R01AI130460. Data used in the preparation of this article were obtained from the Parkinson s Progression Markers Initiative (PPMI) database (http://www.ppmi-info.org/data). For up-to-date information on the study, visit http://www. ppmi-info.org. PPMI a public-private partnership is funded by the Michael J. Fox Foundation for Parkinson s Research and funding partners, including Abbvie, Avid, Biogen, Bristol-Mayers Squibb, Covance, GE, Genentech, Glaxo Smith Kline, Lilly, Lundbeck, Merk, Meso Scale Discovery, Pﬁzer, Piramal, Roche, Sanoﬁ, Servier, TEVA, UCB and Golub Capital.

4http://www.ppmi-info.org/data

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning, 2009.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, pages 267 288, 1996.

Rose Yu and Yan Liu. Learning from multiway data: Simple and efﬁcient tensor regression. In International Conference on Machine Learning, pages 373 381, 2016.

Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013.

Ya Su, Xinbo Gao, Xuelong Li, and Dacheng Tao. Multivariate multilinear regression. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(6):1560 1573, 2012.

Weiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816 827, 2012.

Fei Wang, Ping Zhang, Buyue Qian, Xiang Wang, and Ian Davidson. Clinical risk prediction with multilinear sparse logistic regression. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 145 154. ACM, 2014.

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2):301 320, 2005.

Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang. Logistic tensor regression for classiﬁcation. In International Conference on Intelligent Science and Intelligent Data Engineering, pages 573 581. Springer, 2012.

Marco Signoretto, Quoc Tran Dinh, Lieven De Lathauwer, and Johan AK Suykens. Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, 94(3):303 351, 2014.

Xiaonan Song and Haiping Lu. Multilinear regression for embedded feature selection with application to fmri analysis. In AAAI, pages 2562 2568, 2017.

Johann A Bengua, Ho N Phien, Hoang Duong Tuan, and Minh N Do. Efﬁcient tensor completion for color image and video recovery: Low-rank tensor train. IEEE Transactions on Image Processing, 26(5):2466 2479, 2017.

Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM Review, 51(3):455 500, 2009.

Kun Chen, Kung-Sik Chan, and Nils Chr Stenseth. Reduced rank stochastic regression with a sparse singular value decomposition. Journal of the Royal Statistical Society: Series B, 74(2):203 221, 2012.

Aditya Mishra, Dipak K. Dey, and Kun Chen. Sequential co-sparse factor regression. Journal of Computational and Graphical Statistics, pages 814 825, 2017.

Anh-Huy Phan, Petr Tichavsk y, and Andrzej Cichocki. Tensor deﬂation for candecomp/parafac part i: Alternating subspace update algorithm. IEEE Transactions on Signal Processing, 63(22):5924 5938, 2015.

Arin Minasian, Shahram Shahbaz Panahi, and Raviraj S Adve. Energy harvesting cooperative communication systems. IEEE Transactions on Wireless Communications, 13(11):6118 6131, 2014.

Peng Zhao and Bin Yu. Stagewise lasso. Journal of Machine Learning Research, 8(Dec):2701 2726, 2007.

Gregory Vaughan, Robert Aseltine, Kun Chen, and Jun Yan. Stagewise generalized estimation equations with grouped variables. Biometrics, 73:1332 1342, 2017.

Alex Pereira da Silva, Pierre Comon, and Andre Lima Ferrer de Almeida. Rank-1 tensor approximation methods and application to deﬂation. ar Xiv preprint ar Xiv:1508.05273, 2015.

Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.

Kittipat Kampa, S Mehta, Chun-An Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski. Sparse optimization in feature selection: application in neuroimaging. Journal of Global Optimization, 59(2-3): 439 457, 2014.