# boosted_sparse_and_lowrank_tensor_regression__c236518e.pdf Boosted Sparse and Low-Rank Tensor Regression Lifang He Weill Cornell Medicine lifanghescut@gmail.com Kun Chen University of Connecticut kun.chen@uconn.edu Wanwan Xu University of Connecticut wanwan.xu@uconn.edu Jiayu Zhou Michigan State Universtiy dearjiayu@gmail.com Fei Wang Weill Cornell Medicine few2001@med.cornell.edu We propose a sparse and low-rank tensor regression model to relate a univariate outcome to a feature tensor, in which each unit-rank tensor from the CP decomposition of the coefficient tensor is assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor regression problems. To make the computation efficient and scalable, for the unit-rank tensor regression, we propose a stagewise estimation procedure to efficiently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The superior performance of our approach is demonstrated on various real-world and synthetic examples. 1 Introduction Regression analysis is commonly used for modeling the relationship between a predictor vector x RI and a scalar response y. Typically a good regression model can achieve two goals: accurate prediction on future response and parsimonious interpretation of the dependence structure between y and x [Hastie et al., 2009]. As a general setup, it fits M training samples {(xm, ym)}M m=1 via minimizing a regularized loss, i.e., a loss L( ) plus a regularization term Ω( ), as follows m=1 L( xm, w , ym) + λΩ(w), (1) where w RI is the regression coefficient vector, , is the standard Euclidean inner product, and λ > 0 is the regularization parameter. For example, the sum of squared loss with ℓ1-norm regularization leads to the celebrated LASSO approach [Tibshirani, 1996], which performs sparse estimation of w and thus has implicit feature selection embedded therein. In many modern real-world applications, the predictors/features are represented more naturally as higher-order tensors, such as videos and Magnetic Resonance Imaging (MRI) scans. In this case, if we want to predict a response variable for each tensor, a naive approach is to perform linear regression on the vectorized data (e.g., by stretching the tensor element by element). However, it completely ignores the multidimensional structure of the tensor data, such as the spatial coherence of the voxels. This motivates the tensor regression framework [Yu and Liu, 2016, Zhou et al., 2013], which treats each observation as a tensor X and learns a tensor coefficient W via regularized model fitting: m=1 L( X m, W , ym) + λΩ(W). (2) Corresponding Author 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. When ℓ1-norm regularization is used, this formulation is essentially equivalent to (1) via vectorization. To effectively exploit the structural information of X m, we can impose a low-rank constraint on W for Problem (2). Some authors achieved this by fixing the CANDECOMP/PARAFAC (CP) rank of W as a priori. For example, Su et al. [2012] assumed W to be rank-1. Since rank-1 constraint is too restrictive, Guo et al. [2012] and Zhou et al. [2013] imposed a rank-R constraint in a tensor decomposition model, but none of the above methods considered adding sparsity constraint as well to enhance the model interpretability. Wang et al. [2014] imposed a restrictive rank-1 constraint on W and also applied an elastic net regularization [Zou and Hastie, 2005]. Tan et al. [2012] imposed a rank-R constraint and applied ℓ1-norm regularization to factor matrices to also promote sparsity in W. Signoretto et al. [2014] applied the trace norm (nuclear norm) for low-rank estimation of W, and Song and Lu [2017] imposed a combination of trace norm and ℓ1-norm on W. Bengua et al. [2017] showed that the trace norm may not be appropriate for capturing the global correlation of a tensor as it provides only the mean of the correlation between a single mode (rather than a few modes) and the rest of the tensor. In all the above sparse and low-rank tensor models, the sparsity is imposed on W itself, which, does not necessarily lead to the sparsity on the decomposed matrices. In this paper, we propose a sparse and low-rank tensor regression model in which the unit-rank tensors from the CP decomposition of the coefficient tensor are assumed to be sparse. This structure is both parsimonious and highly interpretable, as it implies that the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions. We take a divide-and-conquer strategy to simplify the task into a set of sparse unit-rank tensor factorization/regression problems (SURF) in the form of m=1(ym X m, W )2 + λ W 1, s.t. rank(W) 1, To make the solution process efficient for the SURF problem, we propose a boosted/stagewise estimation procedure to efficiently trace out its entire solution path. We show that as the step size goes to zero, the stagewise solution paths converge exactly to those of the corresponding regularized regression. The effectiveness and efficiency of our proposed approach is demonstrated on various real-world datasets as well as under various simulation setups. 2 Preliminaries on Tensors We start with a brief review of some necessary preliminaries on tensors, and more details can be found in [Kolda and Bader, 2009]. We denote scalars by lowercase letters, e.g., a; vectors by boldfaced lowercase letters, e.g., a; matrices by boldface uppercase letters, e.g., A; and tensors by calligraphic letters, e.g., A. We denote their entries by ai, ai,j, ai,j,k, etc., depending on the number of dimensions. Indices are denoted by lowercase letters spanning the range from 1 to the uppercase letter of the index, e.g., n = 1, , N. Each entry of an Nth-order tensor A RI1 IN is indexed by N indices {in}N n=1, and each in indexes the n-mode of A. Specifically, n denotes every mode except n. Definition 1 (Inner Product). The inner product of two tensors A, B RI1 IN is the sum of the products of their entries, defined as A, B = PI1 i1=1 PIN i1=1 ai1, ,i N bi1, ,i N . It follows immediately that the Frobenius norm of A is defined as A F = p A, A . The ℓ1 norm of a tensor is defined as A 1= PI1 i1=1 PIN i N=1|Ai1, ,i N |. Definition 2 (Tensor Product). Let a(n) RIn be a length-In vector for each n = 1, 2, , N. The tensor product of {a(n)}, denoted by A = a(1) a(N), is an (I1 IN)-tensor of which the entries are given by Ai1, ,i N = a(1) i1 a(N) i N . We call A a rank-1 tensor or a unit-rank tensor. Definition 3 (CP Decomposition). Every tensor A RI1 IN can be decomposed as a weighted sum of rank-1 tensors with a suitably large R as r=1 σr a(1) r a(N) r , (3) where σr R, a(n) r RI and a(n) r 2= 1. Definition 4 (Tensor Rank). The tensor rank of A, denoted by rank(A), is the smallest number R such that the equality (3) holds. Definition 5 (n-mode Product). The n-mode product of a tensor A RI1 IN by a vector u RIn, denoted by A n u, is an (I1 In 1 In+1 IN)-tensor of which the entries are given by (A n u)i1,...,in 1in+1,...,i N = PIn in=1 ai1,...,i N uin. 3 Sparse and Low-Rank Tensor Regression 3.1 Model Formulation For an Nth-order predictor tensor X m RI1 IN and a scalar response ym, m = 1, , M, we consider the regression model of the form ym = X m, W + εm (4) where W RI1 IN is an Nth-order coefficient tensor, and εm is a random error term of zero mean. Without loss of generality, the intercept is set to zero by centering the response and standardizing the predictors as PM m=1 ym = 0; PM m=1 X m i1, ,i N = 0 and PM m=1(X m i1, ,i N )2/M = 1 for in = 1, , In. Our goal is to estimate W with M i.i.d. observations {(X m, ym)}M m=1. To reduce the complexity of the model and leverage the structural information in X m, we assume that the coefficient tensor W to be both low-rank and sparse. Specifically, we assume W can be decomposed via CP decomposition as W = PR r=1 Wr = PR r=1 σrw(1) r w(N) r , where each rank-1 tensor Wr is possibly sparse, or equivalently, the vectors in its representation w(n) r , n = 1, . . . , N, are possibly sparse, for all r = 1, . . . , R. Here we impose sparsity on the rank-1 components from CP decomposition rather than on W itself [Chen et al., 2012, Tan et al., 2012]. This adaption can be more beneficial in multiple ways: 1) It integrates a finer sparsity structure into the CP decomposition, which enables a direct control of component-wise sparsity; 2) It leads to an appealing model interpretation and feature grouping: the outcome is related to the features through a few distinct pathways, each of which may only involve subsets of feature dimensions; 3) It leads to a more flexible and parsimonious model as it requires less number of parameters to recover the within-decomposition sparsity of a tensor than existing methods which impose sparsity on the tensor itself, thus makes the model generalizability better. A straightforward way of conducting model estimation is to solve the following optimization problem: min σr,w(n) r m=1(ym X m, XR r=1 σrw(1) r w(N) r )2 + XR n=1 λr,n w(n) r 1, s.t. w(n) r 1= 1, n = 1, , N, r = 1, , R. (5) where λr,n, r = 1, , R, n = 1, , N, are regularization parameters. This problem is very difficult to solve because: 1) The CP rank R needs to be pre-specified; 2) As the CP decomposition may not be unique, the pursue of its within-decomposition sparsity is highly non-convex and the problem may suffer from parameter identifiability issues [Mishra et al., 2017]; 3) The estimation may involve many regularization parameters, for which the tuning becomes very costly. 3.2 Divide-and-Conquer: Sequential Pursue for Sparse Tensor Decomposition We propose a divide-and-conquer strategy to recover the sparse CP decomposition. Our approach is based on the sequential extraction method (a.k.a. deflation) [Phan et al., 2015, Mishra et al., 2017], which seeks a unit-rank tensor at a time and then deflates to find further unit-rank tensors from the residuals. This has been proved to be a rapid and effective method of partitioning and concentrating tensor decomposition. Specifically, we sequentially solve the following sparse unit-rank problem: c Wr = min Wr 1 M m=1(ym r X m, Wr )2 + λr Wr 1+α Wr 2 F , s.t. rank(Wr) 1. (6) where r is the sequential number of the unit-rank terms and ym r is the current residue of response with ( ym, if r = 1 ym r 1 X m, c Wr 1 , otherwise, where c Wr 1 is the estimated unit-rank tensor in the (r 1)-th step, with tuning done by, e.g., cross validation. The final estimator is then obtained as c W(R) = PR r=1 c Wr. Here for improving the convexity of the problem and its numerical stability, we have used the elastic net [Zou and Hastie, 2005] penalty form instead of LASSO, which is critical to ensure the convergence of the optimization solution. The accuracy of the solution can be controlled simply by adjusting the values of λr and α. Since we mainly focus on sparse estimation, we fix α > 0 as a small constant in numerical studies. As each Wr is of unit rank, it can be decomposed as Wr = σrw(1) r w(N) r with σr 0 and w(n) r 1= 1, n = 1, , N. It is then clear that Wr 1= bσr bw(1) r bw(N) r 1= σr QN n=1 w(n) r 1. That is, the sparsity of a unit-rank tensor directly leads to the sparsity of its components. This allows us to kill multiple birds with one stone: by simply pursuing the elementwise sparsity of the unit-rank coefficient tensor with only one tuning parameter λ, solving (6) can produce a set of sparse factor coefficients bw(n) r for n = 1, , N simultaneously. With this sequential pursue strategy, the general problem boils down to a set of sparse unit-rank estimation problems, for which we develop a novel stagewise/boosting algorithm. 4 Fast Stagewise Unit-Rank Tensor Factorization (SURF) For simplicity, we drop the index r and write the generic form of the problem in (6) as c W = min W 1 M m=1(ym X m, W )2 + λ W 1+α W 2 F , s.t. rank(W) 1. (7) Let W = σw(1) w(N), where σ 0, w(n) 1= 1, and the factors w(n), n = 1, , N are identifiable up to sign flipping. Let y = [y1, , y M] RM, and X = [X 1, , X M] RI1 IN M. Then (7) can be reformulated as min σ,w(n) 1 M y σX 1 w(1) 2 N w(N) 2 2+λσ YN n=1 w(n) 1+ασ2 YN n=1 w(n) 2 2 s.t. σ 0, w(n) 1= 1, n = 1, , N. (8) Before diving into the stagewise/boosting algorithm, we first consider an alternating convex search (ACS) approach [Chen et al., 2012, Minasian et al., 2014] which appears to be natural for solving (7) with any fixed tuning parameter. Specifically, we alternately optimize with respect to a block of variables (σ, w(n)) with others fixed. For each block (σ, w(n)), the relevant constraints are σ 0 and w(n) 1= 1, but the objective function in (8) is a function of (σ, w(n)) only through their product σw(n). So both constraints are avoided when optimizing with respect to σw(n). Let bw(n) = σw(n) and Z( n) = X 1 w(1) 2 n 1 w(n 1) n+1 N w(N), the subproblem boils down to min b w(n) 1 M y T Z( n)T bw(n) 2 2+αβ( n) bw(n) 2 2+λ bw(n) 1, (9) where β( n) = Q l =n w(l) 2 2. Once we obtain the solution bw(n), we can set σ = bw(n) 1 and w(n) = bw(n)/σ to satisfy the constraints whenever bw(n) = 0. If bw(n) = 0, w(n) is no longer identifiable, we then set σ = 0 and terminate the algorithm. Note that when α > 0, each subproblem is strongly convex and the generated solution sequence is uniformly bounded, we can show that ACS can converge to a coordinate-wise minimum point of (8) with properly chosen initial value [Mishra et al., 2017]. The optimization procedure then needs to be repeated to a grid of tuning parameter values for obtaining the solution paths of parameters and locating the optimal sparse solution along the paths. Inspired by the biconvex structure of (7) and the stagewise algorithm for LASSO [Zhao and Yu, 2007, Vaughan et al., 2017], we develop a fast stagewise unit-rank tensor factorization (SURF) algorithm to trace out the entire solution paths of (7) in a single run. The main idea of a stagewise procedure is to build a model from scratch, gradually increasing the model complexity in a sequence of simple learning steps. For instance, in stagewise estimation for linear regression, a forward step searches for the best predictor to explain the current residual and updates its coefficient by a small increment, and a backward step, on the contrary, may decrease the coefficient of a selected predictor to correct for greedy selection whenever necessary. Due to the biconvex or multi-convex structure of our objective function, it turns out that efficient stagewise estimation remains possible: the only catch is that, when we determine which coefficient to update at each iteration, we always get N competing proposals from N different tensor modes, rather than just one proposal in case of LASSO. To simplify the notations, the objective function (9) is re-arranged into a standard LASSO as min b w(n) 1 M by b Z( n) bw(n) 2 2+λ bw(n) 1 (10) Algorithm 1 Fast Stagewise Unit-Rank Tensor Factorization (SURF) Input: Training data D, a small stepsize ϵ > 0 and a small tolerance parameter ξ 2 Output: Solution paths of (σ, {w(n)}). 1: Initialization: take a forward step with ({ˆi1, ,ˆi N}, ˆs) = arg min {i1, ,i N },s= ϵ J(s1i1, 1i2, , 1i N ), and σ0 = ϵ, w(1) 0 = sign(bs)1bin, w(n) 0 = 1bin(n = 1), λ0 = (J({0}) J(σ0, {w(n) 0 }))/ϵ. (11) Set the active index sets I(n) 0 = {ˆin} for n = 1, , N; t = 0. 2: repeat 3: Backward step: (bn,bibn) := arg min n,in I(n) t J(bw(n) t + sin1in), where sin = sign( bw(n) tin)ϵ. (12) if Γ( bw(bn) t + sbibn1bibn; λt) Γ( bw(bn) t ; λt) ξ, then σt+1 = bw(bn) t + sbibn1bibn 1, w(bn) t+1 = (bw(bn) t + sbibn1bibn)/σt+1, w( bn) t+1 = w( bn) t , λt+1 = λt, I(n) t+1 := ( I(bn) t \ {bibn}, if w(bn) (t+1)bibn = 0 I(n) t , otherwise. 4: else Forward step: (bn,bibn, bsbibn) := arg min n,in,s= ϵ J(bw(n) t + sin1in), (13) σt+1 = bw(bn) t + bsbibn1bibn 1, w(bn) t+1 = ( bw(bn) t + bsbibn1bibn)/σt+1, w( bn) t+1 = w( bn) t , λt+1 = min[λt, J(σt, {w(n) t }) J(σt+1, {w(n) t+1}) ξ Ω(σt+1, {w(n) t+1}) Ω(σt, {w(n) t }) ], I(n) t+1 := ( I(n) t {bin}, if n = bn I(n) t , otherwise. 5: Set t = t + 1. 6: until λt 0 using the augmented data by = (y, 0)T and b Z( n) = (Z( n), p αβ( n)MI)T , where I is the identity matrix of size In. We write the objective function (10) by Γ(bw(n); λ) = J(bw(n)) + λΩ(bw(n)). We use (σ, {w(n)}) to denote all the variables if necessary. The structure of our stagewise algorithm is presented in Algorithm 1. It can be viewed as a boosting procedure that builds up the solution gradually in terms of forward step (line 4) and backward step (line 3)3. The initialization step is solved explicitly (see Lemma 1 below). At each subsequent iteration, the parameter update takes the form bw(n) = bw(n) + s1in in either forward or backward direction, where 1in is a length-In vector with all zeros except for a 1 in the in-th coordinate, s = ϵ, and ϵ is the pre-specified step size controlling the fineness of the grid. The algorithm also keeps track of the tuning parameter λ. Intuitively, the selection of the index (n, in) and the increment s is guided by minimizing the penalized loss function with the current λ subject to a constraint on the step size. Comparing to the standard stagewise LASSO, the main difference here is that we need to select the best triple of (n, in, s) over all the dimensions across all tensor modes. Problem (12) and (13) can be solved efficiently. By expansion of J(bw(n) + s1in), we have J(bw(n) + s1in) = 1 M ( be(n) 2 2 2sbe(n)Tb Z( n)1in + ϵ2 b Z( n)1in 2 2), 2As stated in [Zhao and Yu, 2007], ξ is implementation-specific but not necessarily a user parameter. In all experiments, we set ξ = ϵ2/2. 3Boosting amounts to combine a set of weak learners to build one strong learner, and is connected to stagewise and regularized estimation methods. SURF is a boosting method since each backward/forward step is in essence finding a weaker learner to incrementally improve the current learner (thus generate a path of solutions). where be(n) = by b Z( n) bw(n) is a constant at each iteration. Then the solution at each forward step is (bn,bibn) := arg max n,in 2|be(n)Tb Z( n)1in| ϵDiag(b Z( n)Tb Z( n))T1in, bs = sign(be(n)Tb Z( bn)1bibn)ϵ, and at each backward step is (bn,bibn) := arg min n,in I(n) 2sign( bw(n) in )be Tb Z( n)1in + ϵDiag(b Z( n)Tb Z( n))T1in, where Diag( ) denotes the vector formed by the diagonal elements of a square matrix, and I(n) {1, , In} is the n-mode active index set at current iteration. Computational Analysis. In Algorithm 1, the most time-consuming part are the calculations of be(n)Tb Z( n) and Diag(b Z( n)Tb Z( n)), involved in both forward and backward steps. We further write as be(n)Tb Z( n) = (Z( n)e αβ( n)M bw(n))T, Diag(b Z( n)Tb Z( n)) = Diag(Z( n)Z( n)T) + αβ( n)M1(n), where e = y T Z( n)T ˆw(n) is a constant during each iteration but varies from one iteration to the next, 1(n) is a length-In vector with all ones. At each iteration, the computational cost is dominated by the update of Z( n) (n = ˆn), which can be obtained by Z( n) t+1 = 1 σt+1 (σt Z( n) t + Z( n, ˆn) t ˆn bsbibn1bibn), (14) where ( n, ˆn) denotes every mode except n and ˆn, which greatly reduces the computation. Therefore, when n = ˆn, the updates of Z( n), e, Z( n)e and Diag(Z( n)TZ( n)) requires O(M Q s =n,ˆn Is + 3MIn), O(MIˆn), O(MIn), O(MIn) operations, respectively, when n = ˆn, we only need to additionally update Z( ˆn)e, which requires O(MIˆn) operations. Overall the computational complexity of our approach is O(M PN n =ˆn(Q s =n,ˆn Is + 5In) + 2MIˆn) per iteration. In contrast, the ACS algorithm has to be run for each fixed λ, and within each of such problems it requires O(M QN n=1 In) per iteration [da Silva et al., 2015]. 5 Convergence Analysis We provide convergence analysis for our stagewise algorithm in this section. All detailed proofs are given in the Appendix A. Specifically, Lemma 1 and 2 below justify the validity of the initialization. Lemma 1. Let X be the (N + 1)-mode matricization of X. Denote X = [x1, , x I] where each xi is a column of X, then λmax = 2/M max{|x T i y|; i = 1, , I.}. Moreover, letting i = arg maxi|x T i y| and (i 1, , i N) represents its corresponding indices in tensor space, then the initial non-zero solution of (11), denoted as (σ, {w(n)}), is given by σ = ϵ, w(1) = sign(x T i y)1i 1, w(n) = 1i n, n = 2, , N. where 1i n is a vector with all 0 s except for a 1 in the i n-th coordinate. Lemma 2. If there exists s and in with |s|= ϵ, n = 1, , N such that Γ(s1i1, 1i2, , 1i N ; λ) Γ({0}; λ), it must be true that λ λ0. Lemma 3 shows that the backward step always performs coordinate descent update of fixed size ϵ, each time along the steepest coordinate direction within the current active set, until the descent becomes impossible subject to a tolerance level ξ. Also, the forward step performs coordinate descent when λt+1 = λt. Lemma 4 shows that when λ gets changed, the penalized loss for the previous λ can no longer be improved subject to a tolerance level ξ. Thus ϵ controls the granularity of the paths, and ξ is a convergence threshold in optimizing the penalized loss with any fixed tuning parameter. They enable convenient trade-off between computational efficiency and estimation accuracy. Lemma 3. For any t with λt+1 = λt, we have Γ(σt+1, {w(n) t+1}; λt+1) Γ(σt, {w(n) t }; λt+1) ξ. Lemma 4. For any t with λt+1 < λt, we have Γ( ˆw(n) t + sin1in; λt) > Γ( ˆw(n) t ; λt) ξ. Lemma 3 and Lemma 4 proves the following convergence theorem. Theorem 1. For any t such that λt+1 < λt, we have (σt, {w(n) t }) (σ(λt), {ew(n)(λt)}) as ϵ, ξ 0, where (σ(λt), {ew(n)(λt)}) denotes a coordinate-wise minimum point of Problem (7). Table 1: Compared methods. α and λ are regularized parameters; R is the CP rank. Methods LASSO ENet Remurs or TRR GLTRM ACS SURF Input Data Type Vector Vector Tensor Tensor Tensor Tensor Tensor Regularization ℓ1 (w) ℓ1/ℓ2 (w) Nuclear/ℓ1 (W) ℓ2 (W(n)) ℓ1/ℓ2 (W(n)) ℓ1/ℓ2 (Wr) ℓ1/ℓ2 (Wr) Rank Explored Optimized Fixed Increased Increased Hyperparameters λ α, λ λ1, λ2 α, R α, λ, R α, λ, R α, λ, R 6 Experiments We evaluate the effectiveness and efficiency of our method SURF through numerical experiments on both synthetic and real data, and compare with various state-of-the-art regression methods, including LASSO, Elastic Net (ENet), Regularized multilinear regression and selection (Remurs) [Song and Lu, 2017], optimal CP-rank Tensor Ridge Regression (or TRR) [Guo et al., 2012], Generalized Linear Tensor Regression Model (GLTRM) [Zhou et al., 2013], and a variant of our method with Alternating Convex Search (ACS) estimation. Table 1 summarizes the properties of all methods. All methods are implemented in MATLAB and executed on a machine with 3.50GHz CPU and 256GB RAM. For LASSO and ENet we use the MATLAB package glmnet from [Friedman et al., 2010]. For GLTRM, we solve the regularized CP tensor regression simultaneously for all R factors based on Tensor Reg toolbox [Zhou et al., 2013]. We follow [Kampa et al., 2014] to arrange the test and training sets in the ratio of 1:5. The hyperparameters of all methods are optimized using 5-fold cross validation on the training set, with range α {0.1, 0.2, , 1}, λ {10 3, 5 10 3, 10 2, 5 10 2, , 5 102, 103}, and R {1, 2, , 50}. Specifically, for GLTRM, ACS, and SURF, we simply set α = 1. For LASSO, ENet and ACS, we generate a sequence of 100 values for λ to cover the whole path. For fairness, the number of iterations for all compared methods are fixed to 100. All cases are run 50 times and the average results on the test set are reported. Our code is available at https://github.com/Lifang He/SURF. Synthetic Data. We first use the synthetic data to examine the performance of our method in different scenarios, with varying step sizes, sparsity level, number of features as well as sample size. We generate the data as follows: y = X, W +ε, where ε is a random noise generated from N(0, 1), and X RI I. We generate M samples {Xm}M m=1 from N(0, Σ), where Σ is a covariance matrix, the correlation coefficient between features xi,j and xp,q is defined as 0.6 (i p)2+(j q)2. We generate the true support as W = PR r=1 σrw(1) r w(2) r , where each w(n) r RI, n = 1, 2, is a column vector with N(0, 1) i.i.d. entries and normalized with ℓ1 norm, the scalars σr are defined by σr = 1/r. To impose sparsity on W, we set S% of its entries (chosen uniformly at random) to zero. When studying one of factors, other factors are fixed to M = 500, I = 16, R = 50, S = 80. 0 0.5 1 1.5 2 2.5 3 (a) ϵ = 0.01 0 0.5 1 1.5 2 2.5 3 (b) ϵ = 0.1 0 0.5 1 1.5 2 2.5 3 (c) ϵ = 0.5 Figure 1: Comparison of solution paths of SURF (solid line) and ACS (dashed line) with different step sizes on synthetic data. The path of estimates W for each λ is treated as a function of t = W 1. Note that when the step size is small, the SURF path is almost indiscernible from the ACS path. In a first step we analyze the critical parameter ϵ for our method. This parameter controls how close SURF approximates the ACS paths. Figure 1 shows the solution path plot of our method versus ACS method under both big and small step sizes. As shown by the plots, a smaller step size leads to a closer approximation to the solutions of ACS. In Figure 2, we also provide a plot of averaged prediction error with standard deviation bars (left side of y-axis) and CPU execution time (right side of y-axis in mins) over different values of step size. From the figure, we can see that the choice of the step size affects both computational speed and the root mean-squared prediction error (RMSE). The smaller the value of step size, the more accurate the regression result but the longer it will take to run. In both figures, the moderate step size ϵ = 0.1 seems to offer a better trade-off between performance and ease of implementation. In the following we fix ϵ = 0.1. 0.01 0.05 0.1 0.5 1 0.9 Elapsed time (min) Test error Running time Figure 2: Results to different values of step size ϵ. Next, we examine the performance of our method with varying sparsity level of W. For this purpose, we compare the prediction error (RMSE) and running time (log min) of all methods on the synthetic data. Figure 3(a)-(b) shows the results for the case of S = {60, 80, 90, 95, 98} on synthetic 2D data, where S% indicates the sparsity level of true W. As can be seen from the plots, SURF generates slightly better predictions than other existing methods when the true W is sparser. Moreover, as shown in Figure 3(c), it is also interesting to note that larger step sizes give much more sparsity of coefficients for SURF, this explains why there is no value for some curves as the increase of P|W|1 in Figure 1(c). Furthermore, we compare the prediction error and running time (log min) of all methods with increasing number of features. Figure 4(a)-(b) shows the results for the case of I = {8, 16, 32, 64} on synthetic 2D data. Overall, SURF gives better predictions at a lower computational cost. Particularly, SURF and ACS have very similar prediction qualities, this matches with our theoretical result on the solutions of SURF versus ACS. SURF achieves better predictions than other tensor methods, indicating the effectiveness of structured sparsity in unit-rank tensor decomposition itself. In terms of running time, it is clear that as the number of features is increased, SURF is significantly faster than other methods. Finally, Figure 5 shows the results with increasing number of samples. Overall, SURF gives better predictions at a lower computational cost. Particularly, Remurs and or TRR do not change too much as increasing number of samples, this may due to early stop in the iteration process when searching for optimized solution. 60 80 90 95 98 0.9 LASSO ENet Remurs or TRR GLTRM ACS SURF (a) Test error 60 80 90 95 98 100 Elapsed time (log min) LASSO ENet Remurs or TRR GLTRM ACS SURF (b) Running time 0.01 0.05 0.1 0.5 1 0.7 (c) Sparsity vs. ϵ Figure 3: Results with increasing sparsity level (S%) of true W on synthetic 2D data (a)-(b), and (c) sparsity results of W versus step size for SURF. 64 256 1024 4096 LASSO ENet Remurs or TRR GLTRM ACS SURF (a) Test error 64 256 1024 4096 10-2 Elapsed time (log min) LASSO ENet Remurs or TRR GLTRM ACS SURF (b) Running time Methods 100 Elapsed time (log min) LASSO ENet Remurs or TRR GLTRM ACS SURF (c) Running time on MRI data Figure 4: Results with increasing number of features on synthetic 2D data (a)-(b), and (c) real 3D MRI data of features 240 175 176 with fixed hyperparameters (without cross validation). Real Data. We also examine the performance of our method on a real medical image dataset, including both DTI and MRI images, obtained from the Parkinson s progression markers initiative 64 1024 4096 16384 0.8 LASSO ENet Remurs or TRR GLTRM ACS SURF (a) Test error 64 1024 4096 16384 10-2 Elapsed time (log min) LASSO ENet Remurs or TRR GLTRM ACS SURF (b) Running time Figure 5: Results with increasing number of samples on synthetic 2D data. Table 2: Performance comparison over different DTI datasets. Column 2 indicates the used metrics RMSE, Sparsity of Coefficients (SC) and CPU execution time (in mins). The results are averaged over 50 random trials, with both the mean values and standard deviations (mean std.) Datasets Metrics Comparative Methods LASSO ENet Remurs or TRR GLTRM ACS SURF RMSE 2.94 0.34 2.92 0.32 2.91 0.32 3.48 0.21 3.09 0.35 2.81 0.24 2.81 0.23 Sparsity 0.99 0.01 0.97 0.01 0.66 0.13 0.00 0.00 0.90 0.10 0.92 0.02 0.95 0.01 Time 6.4 0.3 46.6 4.6 161.3 9.3 27.9 5.6 874.8 29.6 60.8 24.4 1.7 0.2 RMSE 3.18 0.36 3.16 0.42 2.97 0.30 3.76 0.44 3.26 0.46 2.90 0.31 2.91 0.32 Sparsity 0.99 0.01 0.95 0.03 0.37 0.09 0.00 0.00 0.91 0.06 0.93 0.02 0.94 0.01 Time 5.7 0.3 42.4 2.9 155.0 10.7 10.2 0.1 857.4 22.5 63.0 21.6 5.2 0.8 RMSE 3.06 0.34 2.99 0.34 2.93 0.27 3.56 0.41 3.14 0.39 2.89 0.38 2.87 0.35 Sparsity 0.98 0.01 0.95 0.01 0.43 0.17 0.00 0.00 0.87 0.03 0.90 0.03 0.93 0.02 Time 5.8 0.3 45.0 1.0 163.6 9.0 7.5 0.9 815.4 6.5 66.3 44.9 1.5 0.1 RMSE 3.20 0.40 3.21 0.59 2.84 0.35 3.66 0.35 3.12 0.32 2.82 0.33 2.83 0.32 Sparsity 0.99 0.01 0.96 0.03 0.44 0.13 0.00 0.00 0.86 0.03 0.90 0.02 0.91 0.02 Time 5.5 0.2 42.3 1.4 159.6 7.6 26.6 3.1 835.8 9.9 96.7 43.2 3.8 0.5 RMSE 3.02 0.37 2.89 0.41 2.81 0.31 3.33 0.27 3.26 0.45 2.79 0.31 2.78 0.29 Sparsity 0.99 0.00 0.97 0.01 0.34 0.22 0.00 0.00 0.91 0.19 0.97 0.01 0.99 0.00 Time 8.8 0.6 71.9 2.3 443.5 235.7 48.4 8.0 1093.6 49.7 463.2 268.1 6.2 0.5 (PPMI) database4 with 656 human subjects. We parcel the brain into 84 regions and extract four types connectivity matrices. Our goal is to predict the Montreal Cognitive Assessment (Mo CA) scores for each subject. Details of data processing are presented in Appendix B. We use three metrics to evaluate the performance: root mean squared prediction error (RMSE), which describes the deviation between the ground truth of the response and the predicted values in out-of-sample testing; sparsity of coefficients (SC), which is the same as the S% defined in the synthetic data analysis (i.e., the percentage of zero entries in the corresponding coefficient); and CPU execution time. Table 2 shows the results of all methods on both individual and combined datasets. Again, we can observe that SURF gives better predictions at a lower computational cost, as well as good sparsity. In particular, the paired t-tests showed that for all five real datasets, the RMSE and SC of our approach are significantly lower and higher than those of Remurs and GLTRM methods, respectively. This indicates that the performance gain of our approach over the other low-rank + sparse methods is indeed significant. Figure 4(c) provides the running time (log min) of all methods on the PPMI MRI images of 240 175 176 voxels each, which clearly demonstrates the efficiency of our approach. Acknowledgements This work is supported by NSF No. IIS-1716432 (Wang), IIS-1750326 (Wang), IIS-1718798 (Chen), DMS-1613295 (Chen), IIS-1749940 (Zhou), IIS-1615597 (Zhou), ONR N00014-18-12585 (Wang), and N00014-17-1-2265 (Zhou), and Michael J. Fox Foundation grant number 14858 (Wang). Lifang He s research is supported in part by 1R01AI130460. Data used in the preparation of this article were obtained from the Parkinson s Progression Markers Initiative (PPMI) database (http://www.ppmi-info.org/data). For up-to-date information on the study, visit http://www. ppmi-info.org. PPMI a public-private partnership is funded by the Michael J. Fox Foundation for Parkinson s Research and funding partners, including Abbvie, Avid, Biogen, Bristol-Mayers Squibb, Covance, GE, Genentech, Glaxo Smith Kline, Lilly, Lundbeck, Merk, Meso Scale Discovery, Pfizer, Piramal, Roche, Sanofi, Servier, TEVA, UCB and Golub Capital. 4http://www.ppmi-info.org/data Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning, 2009. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, pages 267 288, 1996. Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression. In International Conference on Machine Learning, pages 373 381, 2016. Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013. Ya Su, Xinbo Gao, Xuelong Li, and Dacheng Tao. Multivariate multilinear regression. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(6):1560 1573, 2012. Weiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816 827, 2012. Fei Wang, Ping Zhang, Buyue Qian, Xiang Wang, and Ian Davidson. Clinical risk prediction with multilinear sparse logistic regression. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 145 154. ACM, 2014. Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2):301 320, 2005. Xu Tan, Yin Zhang, Siliang Tang, Jian Shao, Fei Wu, and Yueting Zhuang. Logistic tensor regression for classification. In International Conference on Intelligent Science and Intelligent Data Engineering, pages 573 581. Springer, 2012. Marco Signoretto, Quoc Tran Dinh, Lieven De Lathauwer, and Johan AK Suykens. Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, 94(3):303 351, 2014. Xiaonan Song and Haiping Lu. Multilinear regression for embedded feature selection with application to fmri analysis. In AAAI, pages 2562 2568, 2017. Johann A Bengua, Ho N Phien, Hoang Duong Tuan, and Minh N Do. Efficient tensor completion for color image and video recovery: Low-rank tensor train. IEEE Transactions on Image Processing, 26(5):2466 2479, 2017. Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM Review, 51(3):455 500, 2009. Kun Chen, Kung-Sik Chan, and Nils Chr Stenseth. Reduced rank stochastic regression with a sparse singular value decomposition. Journal of the Royal Statistical Society: Series B, 74(2):203 221, 2012. Aditya Mishra, Dipak K. Dey, and Kun Chen. Sequential co-sparse factor regression. Journal of Computational and Graphical Statistics, pages 814 825, 2017. Anh-Huy Phan, Petr Tichavsk y, and Andrzej Cichocki. Tensor deflation for candecomp/parafac part i: Alternating subspace update algorithm. IEEE Transactions on Signal Processing, 63(22):5924 5938, 2015. Arin Minasian, Shahram Shahbaz Panahi, and Raviraj S Adve. Energy harvesting cooperative communication systems. IEEE Transactions on Wireless Communications, 13(11):6118 6131, 2014. Peng Zhao and Bin Yu. Stagewise lasso. Journal of Machine Learning Research, 8(Dec):2701 2726, 2007. Gregory Vaughan, Robert Aseltine, Kun Chen, and Jun Yan. Stagewise generalized estimation equations with grouped variables. Biometrics, 73:1332 1342, 2017. Alex Pereira da Silva, Pierre Comon, and Andre Lima Ferrer de Almeida. Rank-1 tensor approximation methods and application to deflation. ar Xiv preprint ar Xiv:1508.05273, 2015. Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010. Kittipat Kampa, S Mehta, Chun-An Chou, Wanpracha Art Chaovalitwongse, and Thomas J Grabowski. Sparse optimization in feature selection: application in neuroimaging. Journal of Global Optimization, 59(2-3): 439 457, 2014.