# scalable_interpretable_multiresponse_regression_via_seed__004ac348.pdf Journal of Machine Learning Research 20 (2019) 1-34 Submitted 4/18; Revised 2/19; Published 6/19 Scalable Interpretable Multi-Response Regression via SEED Zemin Zheng zhengzm@ustc.edu.cn School of Management and School of Data Science International Institute of Finance University of Science and Technololgy of China Hefei, Anhui 230026, China M. Taha Bahadori bahadori@gatech.edu School of Computational Science and Engineering Georgia Institute of Technology Atlanta, GA 30332, USA Yan Liu yanliu.cs@usc.edu Computer Science Department Viterbi School of Engineering University of Southern California Los Angeles, CA 90089, USA Jinchi Lv jinchilv@marshall.usc.edu Data Sciences and Operations Department Marshall School of Business University of Southern California Los Angeles, CA 90089, USA Editor: Xiaotong Shen Sparse reduced-rank regression is an important tool for uncovering meaningful dependence structure between large numbers of predictors and responses in many big data applications such as genome-wide association studies and social media analysis. Despite the recent theoretical and algorithmic advances, scalable estimation of sparse reduced-rank regression remains largely unexplored. In this paper, we suggest a scalable procedure called sequential estimation with eigen-decomposition (SEED) which needs only a single top-r sparse singular value decomposition from a generalized eigenvalue problem to find the optimal low-rank and sparse matrix estimate. Our suggested method is not only scalable but also performs simultaneous dimensionality reduction and variable selection. Under some mild regularity conditions, we show that SEED enjoys nice sampling properties including consistency in estimation, rank selection, prediction, and model selection. Moreover, SEED employs only basic matrix operations that can be efficiently parallelized in high performance computing devices. Numerical studies on synthetic and real data sets show that SEED outperforms the state-of-the-art approaches for large-scale matrix estimation problem. Keywords: reduced-rank regression, scalability, high dimensionality, greedy algorithm, sparse eigenvector estimation c 2019 Zemin Zheng, M. Taha Bahadori, Yan Liu and Jinchi Lv. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-200.html. Zheng, Bahadori, Liu and Lv 1. Introduction Identifying complex dependence structures among predictors and responses is an important problem in statistics and machine learning, since these structures reveal hidden domain knowledge about the data. For example, in bioinformatics, identifying gene regulatory networks is crucial for understanding gene regulatory paths and gene functions, which helps disease prediction and diagnosis. Similarly, in social media analysis, inferring the influence networks from user activities, that is, Diffusion Network Inference problem (Leskovec et al., 2008; Zhou et al., 2013; Embar et al., 2014) is an important problem and it has found applications in social media marketing (Gomez-Rodriguez et al., 2012) and crisis management (Starbird and Palen, 2012). In these big data applications, inferring the dependence structures is challenging since the responses and predictors may be related through a few latent pathways and/or associated through only a subset of responses and predictors. Moreover, the curse of dimensionality and massive amounts of data, that is, scalability issues make the dependence structure discovery problem even harder to solve. To recover sparse response-predictor associations and latent predictors, regularization methods such as lasso (Tibshirani, 1996) and group lasso (Yuan and Lin, 2006), and reduced-rank regression approaches (Izenman, 1975; Velu and Reinsel, 2013) have been proposed, respectively. Chen et al. (2012) and Chen and Chan (2015) have proposed sparse reduced-rank regression approaches by combining the regularization and reduced-rank regression techniques to find the complex dependence structures between responses and predictors. Sparse reduced-rank regression works by modeling the associations between the predictor and response variables via a sparse and low-rank representation of the coefficient matrix. It not only enhances the interpretability of the estimated matrix by eliminating irrelevant features (Chen et al., 2012), but also reduces the number of free parameters of the model and thus the number of observations required for desired estimation consistency (Yuan et al., 2007; Bunea et al., 2011; Cand es and Plan, 2011; Negahban and Wainwright, 2011; Chen et al., 2013). Sparse reduced-rank regression has found applications in micro-array biclustering (Chen et al., 2012), subspace clustering (Wang et al., 2013), social network community discovery (Richard et al., 2012; Zhou et al., 2013), and motion segmentation (Feng et al., 2014). In these applications, joint sparsity and low-rankness has been used to enforce a clustered dependence structure among data points. In particular, the key idea is to estimate a similarity matrix among data points that is simultaneously sparse and lowrank and then permute the rows and columns of the matrix to yield approximately blockdiagonal structures, which naturally lead to clustering of data points into several groups. Note that Chandrasekaran et al. (2010), Agarwal et al. (2012), and the references therein have considered estimating matrices with a low-rank plus sparse representation which is different from our work as we are interested in estimating a matrix that is jointly low-rank and sparse. A natural approach to solving the sparse reduced-rank regression problem is to simultaneously penalize the parameter matrix using the L1 and nuclear norm regularizers, as they are convex relaxations to sparsity and low-rankness of a matrix, respectively. The resulting optimization problem is convex and can be solved using the alternating direction method of multipliers (ADMM) (Boyd et al., 2010) as proposed by Richard et al. (2012) and Zhou et al. (2013). In Bunea et al. (2012), an alternative approach, called rank constrained group lasso (RCGL), was proposed which directly penalizes the rank and the number of nonzero rows of the parameter matrix. They showed oracle rates for the estimated matrix and also provided a practical algorithm which iteratively and jointly solves a L1-regularization and low-rank estimation problem. To further improve the estimation accuracy, Chen et al. (2012) borrowed ideas from adaptive Lasso (Zou, 2006) and proposed the iterative exclusive extraction algorithm (IEEA) which finds a locally optimal solution in the neighborhood of the initial value. They also showed model selection consistency and asymptotic normality results along with the improved empirical performance of IEEA on microarray biclustering analysis data. All the above approaches for sparse reduced-rank regression achieve both desirable theoretical properties and strong empirical results. However, they cannot scale to large matrix estimation problems in many big data applications. The ADMM algorithm of Richard et al. (2012) and Zhou et al. (2013) uses iterative singular value thresholding (Cai et al., 2010) for solving the joint L1 and nuclear norm regularization. Iterative singular value thresholding is known to be computationally expensive since it performs a full singular value decomposition of the parameter matrix in each iteration. On the other hand, RCGL (Bunea et al., 2012) is computationally much faster than ADMM since it only performs top-r singular value decomposition for estimating a rank-r matrix in each iteration. Despite a lower computational cost per iteration, it is unclear how many iterations RCGL needs for convergence. IEEA (Chen et al., 2012) performs nested loops of alternating L1-penalized regression for each singular vector which can be expensive, especially on parallel computing devices. The iterative nature of these three approaches makes them not scalable and renders them inefficient for large matrix estimation even on high performance computing devices. To overcome the scalability issues of the previous approaches, we propose a simple and scalable sparse reduced-rank regression procedure called sequential estimation with eigendecomposition (SEED). SEED is designed for high-performing computing platforms. It converts the sparse and low-rank regression problem to a sparse generalized eigenvalue problem and then solves the problem using the recent algorithms for sparse eigenvalue decomposition (Cai et al., 2013; Ma, 2013; Yuan and Zhang, 2013). As a pure learning algorithm, SEED is expected to perform only a single top-r sparse eigenvalue decomposition for estimating a rank-r matrix, which makes it truly scalable and efficient for large matrix estimation problems. The main contributions of our paper are threefold. First, for the sparse reduced-rank regression problem, our proposed procedure SEED provides a scalable approach to uncovering the sparse predictor-response association network while simultaneously achieving dimension reduction and variable selection. Second, for the high-dimensional settings, our theoretical analysis shows that SEED can consistently estimate the singular vectors, latent factors as well as the regression coefficient matrix, identify the correct rank of the matrix, accurately predict the multivariate response vector, and recover the support of the singular vectors under mild conditions. Note that, compared with Chen et al. (2012), we do not make any assumption on the positive definiteness of the design matrix for proving our consistency results. Third, we empirically demonstrate that SEED can not only be efficiently implemented on both central processing units (CPU) and graphics processing units (GPU) for large-scale applications, but it also outperforms the state-of-the-art sparse reduced-rank regression approaches. Zheng, Bahadori, Liu and Lv Recently, Mishra et al. (2017) proposed the sequential co-sparse factor regression in a similar sparse and low-rank model setting, where a unit rank sparse coefficient matrix was recovered at each step such that both the left and the right singular vectors could be estimated with co-sparse pattern (an earlier version of the current paper was cited in theirs). Compared with their method, our approach separates the estimation of the singular vectors at each step with the left ones obtained through a generalized sparse eigenvalue problem and the right ones recovered directly based on the left ones. Moreover, we provide comprehensive theoretical properties to justify the validity of sequential estimation for sparse and low-rank coefficient matrices in high dimensions. The rest of this paper is organized as follows. Section 2 introduces our SEED method. We discuss the implementation details of SEED in Section 3 and present its asymptotic properties in Section 4. We demonstrate the advantages of SEED on both synthetic and real data sets in Section 5, and in Section 6 we discuss some extensions of our SEED method. The proofs of some main results are relegated to the Appendix. Additional technical details are provided in the Supplementary Material. 2. Sequential Estimation with Eigen-Decomposition We will first introduce the model setup and then the proposed methodology. 2.1. Model and Problem Formulation Denote by {(xi, yi)}n i=1 n observations in the fixed design setting, where xi Rp and yi Rq represent the ith predictor and the corresponding response vectors, respectively. Given a predictor vector x, the corresponding response vector y is drawn from the following model y = C T x + ε, where the noise vector ε N(0, Σ) is a q-dimensional multivariate Gaussian random vector with mean zero and covariance matrix Σ, and C Rp q is the regression coefficient matrix.1 We can rewrite the model in the matrix form as follows Y = XC + E, (1) where Y = [y1, . . . , yn]T , X = [x1, . . . , xn]T , and E = [ε1, . . . , εn]T denote the matrices of stacked response, predictor and noise vectors, respectively. Let P = n 1XT X be the Gram matrix of the predictors. We consider model (1) from a latent factor point of view, where the true regression coefficient matrix C is jointly low-rank and sparse, similar to Bunea et al. (2012) and Chen et al. (2012). In particular, rank(C ) = r with the matrix rank r min(p, q). Without loss of generality, we assume that rank(XC ) = rank(C ) since if rank(XC ) < rank(C ), the redundant part of C can be removed such that it reflects the true number of latent factors. Then C will have the following decomposition k=1 u kv T k = k=1 C k, (2) 1. Note that the Gaussianity of noise variables is not essential to either our procedure or the theoretical results and similar results would hold under the sub-Gaussian assumption (B uhlmann and van de Geer, 2011, Chapter 14). where the left singular vectors u k Rp are P-orthogonal with unit length, that is, u T k Pu k = 0 if k = k and u k 2 = 1, while the right singular vectors v k Rq are orthogonal, that is, v T k v k = 0 for k = k , and C k is the layer k unit rank matrix of C . The singular vectors are sorted by the magnitudes of the singular values σk = 1 nq XC k F in descending order, consistent with their contributions to the prediction of Y. We consider the left singular vectors (both the population and estimated ones) in the constraint space u Ker(P), where Ker(P) denotes the null space of P, to guarantee the model identifiability. Otherwise, u would contain certain component eu such that Xeu = 0, which does not contribute to the prediction of Y. It is worth noticing that the decomposition in the form of C = Pr k=1 u kv T k is not unique without orthogonality constraints and the entrywise sparsity in the singular vectors is not invariant to rotations. The decomposition in (2) is a special one since the P-orthogonality of u k arises naturally from the power method and leads to a latent factor analysis interpretation of the reduced rank estimation in which the latent factors Xu k are uncorrelated. Moreover, decomposition (2) coincides with the singular value decomposition of XC except for different scalings on the singular vectors. We defer the discussion on the identifiability of decomposition (2) (existence and uniqueness up to simultaneous sign changes of u k and v k) to Supplementary Material. The aforementioned modeling of the regression coefficient matrix gives a latent factor model with r latent factors, where Xu k is the kth latent factor and v k describes the impacts of the kth factor on the response variables. As illustrated in Yuan et al. (2007), the lowrankness of C renders dimension reduction such that all responses can be predicted by a relatively small set of common factors. Furthermore, the left singular vectors u k are assumed to be sparse, yielding the necessity of predictor selection. Similar sparsity assumptions were made in Bunea et al. (2012) and Chen et al. (2012) to enhance model interpretability by removing irrelevant features in high dimensions. Specifically, Chen et al. (2012) assumed that both the left and right singular vectors are sparse while Bunea et al. (2012) imposed restriction on the number of nonzero rows of the regression coefficient matrix. In this paper, we are interested in two cases: (i) when the right singular vectors are not required to be sparse and (ii) when it is desirable to have sparse right singular vectors, which entails the response selection. We will show that both cases are efficiently accommodated by our procedure. Our goal is to accurately estimate the singular vectors u k and v k, and the true rank r such that we can recover the latent factors, their impacts, and the underlying number of latent factors as well as the significant predictors. As a singular vector can have two opposite directions, we always assume that the estimated left singular vector takes the correct one, that is, the angles between estimated and population left singular vectors are no more than a right angle. Once the estimated rank and singular vectors are obtained, the estimate b C of the true matrix C follows immediately from (2). Unlike most existing sparse and low-rank estimation methods which adopt the regularization framework of minimizing a loss function plus certain penalties, we will show that the proposed procedure SEED is indeed a pure learning algorithm that predicts Y using Xb C. Zheng, Bahadori, Liu and Lv 2.2. Description of SEED The following proposition provides us insight into estimating the top-r left and right singular vectors of C . Proposition 1 Consider the noiseless case Y = XC with C = Pr k=1 u kv T k defined in (2). Then u 1, . . . , u r are the r non-degenerate left singular vectors of C if and only if they are eigenvectors of the following generalized eigenvalue problem XT Y Y T Xu = λXT Xu (3) with respect to the nonzero eigenvalues λ1, , λr , where λk = nqσ2 k is the kth largest eigenvalue of Y Y T with σk = XC k F / nq defined in Section 2.1. Furthermore, given the left singular vector u k, the corresponding right singular vector v k can be written as v k = 1 u T k XT Xu k Y T Xu k. (4) Proposition 1 shows that the problem of estimating the singular vectors can be transformed into the generalized eigenvalue problem in (3), thanks to the P-orthogonality of the left singular vectors. It motivates us to estimate the left singular vectors in the noisy case Y = XC + E by solving the following generalized eigenvalue problem XT YYT Xu = λXT Xu. (5) The estimation consistency can be guaranteed by the matrix perturbation theory (see Section 4 for details). On the other hand, it is not difficult to see that the eigenvectors with respect to different eigenvalues of problem (5) are P-orthogonal, which further gives the orthogonality of the right singular vectors estimated according to (4). It implies that the right and left singular vectors obtained by solving the generalized eigenvalue problem will automatically be orthogonal and P-orthogonal, respectively. Related results of principal component analysis in low dimensions can be found in Baldi and Hornik (1989), Diamantaras and Kung (1996), and De La Torre and Black (2003). Note that in the high-dimensional setting, the regime of interest for this paper, the Gram matrix P can not be invertible and the generalized eigenvalue problem is potentially challenging to solve. We will address the implementation challenges in Section 3. Based on Proposition 1, our proposed procedure SEED performs a two-step estimation for the regression coefficient matrix. It first solves the generalized eigenvalue problem (5) to obtain the estimated left singular vectors bu1, . . . , bur with unit length and then finds the estimated right singular vectors bv1, . . . , bvr according to (4) as follows bvk = 1 bu T k XT Xbuk YT Xbuk. (6) The maximum rank r depends on the magnitude of the estimated singular value bσk = Xb Ck F / nq with b Ck = bukbv T k (whether it is larger than a threshold µ) and the optimal rank can be tuned by cross validation or the information criterion described in Section 4. Algorithm 1: SEED Input: (1) Data Y Rn q and X Rn p (2) Termination accuracy µ and sparsity parameter θ 1 Compute matrices: P 1 n XT X, R 1 n YT X, Q 1 4 (buk, bσk) kth θ-sparse eigenvector and eigenvalue of Qu = λPu 5 if bσk > µ then 6 bvk 1 bu k Pbuk Rbuk #Optional thresholding of bvk for sparsity 7 b Ck bukbv k 8 k k + 1 10 until bσk < µ 11 tune the optimal rank br 12 return b C = Pbr k=1 b Ck The details of the procedure are provided in Algorithm 1. To achieve a sparse solution, we need to find the rank-r sparse matrix via a sparse eigenvalue decomposition procedure in Line 4 of Algorithm 1. This can be done in a sequential way and practical methods will be provided in Section 3. The practical methods need a sparsity parameter θ, such as a threshold (Ma, 2013) or a sparsity size (Yuan and Zhang, 2013). We will show in Section 5 that SEED is robust to the choices of parameters θ and µ. If the right singular vectors are also required to be sparse, we perform a simple element-wise thresholding on bvk after we obtain it in Line 6. 3. Scalable Implementation of SEED Algorithm 1 requires a sparse solution of (5) which is a generalized eigenvalue problem with a rank deficient matrix P. In this section, we propose two different procedures to solve (5) and study multiple practical aspects of SEED. 3.1. Fast Approach The bottleneck in speeding up Algorithm 1 is Line 4 where we need to solve a sparse generalized eigenvalue problem. To overcome this bottleneck, we propose a new solution to estimating the left singular vectors by rewriting equation (5) as XT (YYT λI)Xu = 0. Similar to Proposition 1, when X is of full row rank (which is easy to satisfy in the highdimensional setting), the above equation shares the same nonzero eigenvalues with YYT . Even if X is row rank deficient, we still have nonzero solution u when the perturbation is relatively small, ensured by the perturbation theory in Lemma 6. Thus, we propose the following two-step procedure for Line 4 of Algorithm 1: Zheng, Bahadori, Liu and Lv (1) λ λmax(YYT ). (2) bu1 sparse eigenvector relative to the zero eigenvalue of XT YYT X λXT X. Based on bu1, the corresponding right singular vector bv1 and the unit rank matrix b C1 will be obtained. To continue, we compute the residual Yk = Y X Pk 1 j=1 b Cj in the kth step and replace Y with Yk in the above two-step procedure to obtain the kth left singular vector buk. Overall, the first step requires calculation of the top-r eigenvalues for an n n matrix (or q q if q < n) while the second step finds the corresponding eigenvectors by solving a regular sparse eigenvalue problem. The above procedure significantly accelerates the speed of SEED as it converts a degenerate sparse generalized eigenvalue problem to two simpler regular sparse eigenvalue problems. Existing procedures such as the iterative thresholding method (Ma, 2013) can be used to compute both eigenvalue problems efficiently. Generally speaking, SEED will be extremely efficient in applications with low rank structure such as image processing as it stops early after achieving a few important signals. Moreover, the speed of SEED can be greatly enhanced by parallel implementation on high performance computing devices such as GPU, due to the fact that it employs only basic matrix operations. 3.2. Alternative with Enhanced Stability In cases where the perturbation can be large, for numerical stability purposes, we can also solve (5) by the following modified problem with a very small positive ρ: XT YYT Xu = λ(XT X + ρI)u. (7) Note that XT X + ρI is invertible since the eigenvalues of XT X are nonnegative. Denote by e X Rp p the modified predictor matrix such that e X T e X = XT X + ρI, which can be obtained via the Cholesky decomposition, and e Y = ( e X T ) 1XT Y the modified response matrix. Then the above equation (7) can be rewritten as e X T e Y e Y T e Xu = λ e X T e Xu, (8) which adopts the same form as (5). A computationally efficient technique for solving equation (8) is to solve the sparse eigenvalue problem e P 1 e Qu = λu, where the modified Gram matrix e P = n 1 e X T e X and e Q is defined accordingly as in Algorithm 1. We can compute e P 1 by the Sherman-Morrison Woodbury formula as follows: (ρIp + XT X) 1 = 1 ρ2 XT (In + 1 The above equation requires inversion of an n n matrix instead of a p p matrix, which is significantly faster in the high-dimensional setting when p n. Remark. The formulation of e X and e Y can be regarded as a generalization of the ridge regression to the multivariate response setting. In fact, since C is the minimizer of Y XC 2 F , we can enhance the stability by adding a small Frobenius norm regularization as follows: e C = argmin C n Y XC 2 F + ρ C 2 F o , where the Frobenius norm is defined as C 2 F = P i,j C2 i,j for any matrix C. After completing the squares, we get e C = argmin C n e Y e XC 2 F o , which means that e X and e Y are the corresponding predictor and response matrices that take into account the shrinkage effects (James and Stein, 1961; Zheng et al., 2014). 3.3. Practical Aspects Now we analyze several practical aspects of SEED as follows. Refitting. In Algorithm 1, a refitting can be performed during the eigenvalue decomposition in Line 4 to enhance the stability. The refitting procedure is as follows. In the kth step, we perform the top-k singular value decomposition b C = Ub SVT for b C = Pk j=1 b Cj and refit the solution by finding e S = argmin S Y XUSVT 2 F . The estimate with refitting is defined as e C = Ue SVT . In practice, we find this approach more stable and report the results based on this variation of SEED in numerical studies. Sparse eigenvector estimation. The previous approaches for solving the generalized eigenvalue decomposition problem indicate that we can solve the problem via regular sparse eigenvalue decomposition. This allows us to reuse the existing procedures for sparse eigenvalue decomposition such as Cai et al. (2013), Ma (2013), Yuan and Zhang (2013), and Lei and Vu (2015) to solve the problem in (5). In numerical studies, we use the iterative thresholding method (Ma, 2013), which is detailed in Algorithm 2 for estimating the sparse eigenvector relative to the largest eigenvalue of a given sparse eigenvalue problem Su = λu with sparsity parameter θ (threshold level). It is worth pointing out that although sparse eigenvalue decomposition is a nonconvex problem, Ma (2013) proved that the sparse eigenvectors obtained by the iterative thresholding algorithm will converge to their population counterparts with asymptotic probability one within O(log n) iterations. The initial estimate bu(0) is generated from the diagonal thresholding sparse PCA algorithm proposed in Johnstone and Lu (2009), which is a regular eigenvalue problem after constraining on the significant coordinates (variances). This iterative thresholding algorithm can also be applied to recover several sparse eigenvectors simultaneously. Parallel implementation. In order to scale up the procedures, we often need to utilize the parallel computing tools. Given that SEED only uses basic matrix operations, we can employ parallel implementation of the large matrix operations in high performance computing devices to accelerate SEED. In Section 5, our experiments with GPU which contain thousands of processing units show that the matrix operations in SEED can be efficiently parallelized and it significantly enhances the speed of SEED. Whenever the data can not be loaded into the memory of a single device, efficient distributed algorithms can be used. See for example, Kang et al. (2011) and the references therein. Zheng, Bahadori, Liu and Lv Algorithm 2: Iterative thresholding Input: Matrix S Rp p, threshold level θ, and initial estimate bu(0) Rp. 3 Multiplication: t(k) = (t(k) ij ) = Sbu(k 1) 4 Thresholding: bt (k) = (bt(k) ij ) with bt(k) ij = t(k) ij 1(|tij| θ) 5 QR factorization: bu(k)R(k) = bt (k) with R(k) = bt (k) 2 6 k k + 1. 7 until convergence 8 return bu(k) at convergence 4. Asymptotic Properties of SEED In this section, we will analyze the statistical properties of SEED. Define the maximum sparsity level of the left singular vectors as s = maxr k=1 u k 0 p, which is assumed mainly for theoretical analysis (see Condition 1 below) and unknown in our practical implementation. We consider the estimated left singular vectors with the number of nonzero elements less than certain sparsity level s > s , that is, buk 0 s for k = 1, , r, where r > r is an upper bound of the estimated rank that can be controlled by the algorithm in practice. When the generalized eigenvalue problem (5) does not have an s-sparse solution, the estimated largest s-sparse eigenvalue can be reformulated as bλ = max u =0, u 0 s u T (XT YYT X)u u T (XT X)u due to the extremal property of eigenvalues, and the corresponding maximizer bu will be treated as the estimated left singular vector. Note that there is no sparsity constraint on either the population or the estimated right singular vectors since v k can be recovered from (4) based on accurate estimation of u k. 4.1. Technical Conditions Here we list a few technical conditions and discuss their relevance in detail. Condition 1 (Restricted Isometry) There exists a positive constant φs such that the Gram matrix P satisfies φs min z Rp z 2 : z 0 2s max z Rp z 2 : z 0 2s φ 1 s for some s > s . Condition 2 (Minimum Singular Value Separation) The non-zero singular values σk satisfy σ2 k σ2 k+1 dσ > 0 for some positive constant dσ and k = 1, . . . , r . Condition 3 (Bounded Eigenvalues) The eigenvalues of the population covariance matrix of the noise vector ε satisfy 0 < γ2 l λj(Σ) γ2 u < for j = 1, . . . , q, where γl and γu are positive constants with γu cγσr for some positive constant cγ. Condition 4 (Minimum Signal Strength) There exists some positive constant δ (0, 1) such that the following lower bounds on the magnitudes of the non-zero elements of u k and v k hold for any 1 k r min i supp(u k) |u i | 3Cu min i supp(v k) 1 q|v i | 3Cv where Cu and Cv are constants defined in Theorem 1. Condition 1 imposes bounds on the 2s-sparse eigenvalues of P, which is weaker than the regular bounded eigenvalue assumption since the sparse eigenvalues do not grow as fast as the regular eigenvalues when the dimensionality p increases. As a typical condition in high dimensions, it restricts the correlations between small numbers of features and thus guarantees the identifiability of the true support. See, for instance, Cand es and Tao (2005) and Zhang (2011) for more discussion on it. Recall that σk is the kth largest singular value of XC / nq. Condition 2 requires strict separation among the singular values such that the left singular vectors are distinguishable. For ease of presentation, we assume dσ to be a constant and indicate the roles of dσ clearly in the constants of the theoretical results. In fact, XC adopts a spiked eigen-structure (Johnstone, 2001; Shen et al., 2016) with the spiked singular values allowed to diverge at the rate of nq. This rate is reasonable for an n by q matrix as we do not impose any sparsity on the columns of C . The elements of the unobserved noise vector ε were assumed to be independent and identically distributed (i.i.d.) in Bunea et al. (2012). We relax it a bit in Condition 3 by imposing bounded eigenvalues for the noise covariance matrix for recovering the true rank in Theorem 2. Our technical argument still applies when either γl 0 or γu as long as their rates of convergence can be controlled within certain magnitudes. The two inequalities in Condition 4 are imposed for the model selection consistency of the predictors and responses, respectively. The magnitude of the minimum signal strength is O p s log(pq)/n , which is relatively mild as it converges to zero in our setting. Since u k is assumed to have unit length, the singular values of C are absorbed into v k in view of decomposition (2) so that there is an extra scaling factor 1 q in the second inequality. 4.2. Main Results Denote by P 2 = maxp j=1 Pjj and γ2 = maxq j=1 Σjj the maximum diagonal components of the Gram matrix and noise covariance matrix, respectively. Without loss of generality, we assume that V = maxr k=1 1 q v k 2 is finite for the q-dimensional vectors v k. Moreover, it is clear that under Conditions 1 and 3, P and γ are also finite constants. The estimated regression coefficient matrix is given by b C = P r k=1 b Ck, where r is the optimal rank tuned Zheng, Bahadori, Liu and Lv by information criterion (9) and b Ck = bukbv T k . The following theorem bounds the estimation errors of SEED with the estimated left singular vectors buk taking the correct signs as discussed before. Theorem 1 (Consistency of Estimation and Prediction) Suppose that Conditions 1 and 2 hold, γ is finite, and s log(pq) = o(n), then with probability at least 1 δ for any δ (0, 1) and uniformly over k = 1, . . . , r , we have buk u k 2 Cu 1 q bvk v k 2 Cv X(buk u k) 2 n Cu φs 1 q b Ck C k F (V Cu + Cv) r s X(b Ck C k) F nq (V Cu + Cv) φs where the constants Cu = 4γPσ1 dσφ5/2 s and Cv = 2 2φ 3/2 s (2V φ 1/2 s + σ1)Cu + 2 Theorem 1 is established based on the perturbation theory of sparse generalized eigenvalue problem (Lemma 6). It shows that the uniform estimation error bounds for both top-r singular vectors u k and 1 qv k, top-r latent factors 1 n Xu k and unit rank matrices 1 q C k, and the uniform prediction error bounds of the top-r latent factors are all in the same order of O q δ . The tail probability δ can decay to zero quickly as p and q grow with rates such as δ (pq) α for some positive constant α > 1. This is due to the fact that when δ (pq) α, we have q δ 0 under the assumption that s log(pq) = o(n). Furthermore, the estimation and prediction accuracy would then be within the rate of O p s log(pq)/n , where the factor log(pq) reflects the curse of dimensionality as there are pq parameters in total from the regression coefficient matrix C . If the true rank r can be correctly identified, it is not difficult to see that the estimation accuracy for 1 q C will be within the rate of O{ p r s log(pq)/n} (see Corollary 3 below), which coincides with the minimax error bound for estimating the regression coefficient vector in the univariate response setting (Raskutti et al., 2011) with the dimensionality p and sparsity level s replaced by the overall dimensionality pq and the product r s, respectively. Similarly, with the true rank r , the normalized prediction error X(b C C ) F / nq will be within the rate of O np r s log(pq)/n o , which is similar to the prediction error bound established in Bunea et al. (2012). But we have an extra scaling factor of q since the singular values of XC are allowed to diverge at the rate of nq, larger than the rate n + q in Bunea et al. (2012). Our theoretical results show that when the left singular vectors are consistently estimated, the normalized prediction error will converge to zero asymptotically. Overall speaking, the prediction error bound in Bunea et al. (2012) was derived from the regularization framework, while the prediction accuracy of SEED is obtained from the matrix perturbation theory (Lemmas 5 and 6). Based on the discussion before, a desirable statistical property of any low-rank estimation procedure is accurately recovering the true rank of the parameter matrix. Similar to Lasso, the nuclear norm regularization needs to be enhanced by techniques such as adaptive regularization to accurately recover the rank of the matrix (Bach, 2008; Chen et al., 2013). In contrast, we can directly control the rank of the solution in SEED by limiting the number of steps. In particular, we propose a GIC-type (Fan and Tang, 2013) information criterion that guarantees rank recovery by SEED when the optimal rank is tuned according to it. Theorem 2 (Consistency of Rank Recovery) Suppose Conditions 1-3 hold, s log(pq) = o(n), r = o 1 (log log n) s h n log(pq) i 1 4 , and r = o h n s log(pq) i 1 4 . Define the following in- formation criterion Cn = n log Ln(Y, X, b C) + rank(b C) p log(pq) log log n, (9) where Ln(Y, X, C) = 1 qn Y XC 2 F . Under the above information criterion, with probability at least 1 (pq) α for some positive constant α > 1 and sufficiently large n, SEED will select the true rank, that is, rank(b C) = rank(C ). In the high-dimensional setting where the number of predictors can increase exponentially with the sample size, it is demonstrated in Fan and Tang (2013) that we need some power of the logarithmic factor of dimensionality ( p log(pq) for our setting) in the model complexity penalty of the information criterion to consistently identify the true model, and the slow diverging rate log log n is set to prevent underfitting. The proof of Theorem 2 indeed shows that information criterion (9) will keep decreasing until the estimated rank reaches the true rank r , where in each step the amount of decrease in the objective function Ln(Y, X, b C) equals to the squared singular value obtained by solving the generalized eigenvalue problem (5). After reaching the true rank, the estimated singular value becomes small such that the model complexity penalty will overweight the decrease and then information criterion (9) would start increasing. Therefore, in the sequence of solutions generated by SEED, the estimate b C with rank r will be the minimizer of (9) such that the true rank can be correctly identified. As discussed before, correct identification of the true rank will yield the estimation accuracy of C as well as the prediction accuracy of XC . Therefore, combined with Theorem 2, it is immediate that the results in Theorem 1 give the following corollary. Corollary 3 (Consistency of Overall Estimation and Prediction) Given Conditions 1-3, s log(pq) = o(n), r = o 1 (log log n) s h n log(pq) i 1 4 , and r = o h n s log(pq) i 1 optimal rank is tuned by information criterion (9), then with probability at least 1 (pq) α Zheng, Bahadori, Liu and Lv for any constant α > 0 and sufficiently large n, we have 1 q b C C F (1 + α)(V Cu + Cv) r s log(pq) r s log(pq) X(b C C ) F qn (1 + α)(V Cu + Cv) φs r s log(pq) r s log(pq) Besides estimation consistency and rank recovery, SEED is also able to find the true support of the singular vectors. To achieve this goal, after selecting the optimal rank, we need to further refine the model selection procedure by performing a hard-thresholding. See, for example, Fan and Lv (2013) for applications of thresholding in high-dimensional sparse modeling. Specifically, denote by Tθ(z) the estimator after the hard-thresholding operation on every element of z = (z1, , zp) Rp such that Tθ(zi) = 0 if |zi| < θ zi otherwise , i = 1, . . . , p. Based on the results of Theorems 1 and 2 and the signal strength assumption in Condition 4, we have the following properties for the estimator with a further thresholding. Theorem 4 (Support Recovery of Singular Vectors) Given Conditions 1-4, s log(pq) = o(n), r = o 1 (log log n) s h n log(pq) i 1 4 , and r = o h n s log(pq) i 1 4 , for every pair of sin- gular vectors, (buk, bvk), k = 1, . . . , r , the following results hold. a) If the threshold θ (5 4Tu) with Tu = Cu q δ , then with probability at least 1 δ, we have supp(Tθ(buk)) = supp(u k); b) If the threshold θ (5 4Tv) with Tv = Cv q δ , then with probability at least 1 δ, we have supp(Tθ(bvk)) = supp(v k). Theorem 4 shows that both supports of the left and right singular vectors can be accurately recovered with properly chosen tuning parameter θ. Together with the correctly identified true rank r , the above results indeed yield consistent selection of both predictors and responses. In practice, this threshold θ can be tuned by criteria such as cross-validation. Besides the statistical properties established before, the proposed procedure SEED enjoys great flexibility in the sense that it does not rely on exact eigenvalue decomposition and the perturbation errors in the generalized eigenvalue problem (5) will be linearly incorporated into the estimated singular vectors buk and bvk. Furthermore, our analysis does not reply on the positive definiteness of the Gram matrix (Chen et al., 2012) in high dimensions. 5. Numerical Studies In this section, we conduct experiments on three data sets, including two simulation data sets (one for a medium-scale experiment and one for a large-scale experiment) and one application data set in social media analysis, to examine the empirical performance of SEED. 5.1. Simulation Studies 5.1.1. Simulation Example 1 We generate a medium-scale synthetic data set as follows: the predictors x are drawn from a multivariate Gaussian distribution as x N(0p 1, ΣX), where ΣX is the p p covariance matrix with auto-regressive structure, that is, ΣX,i,j = ρ|i j| for some 0 < ρ < 1 which will be specified later. The responses y are drawn according to conditional distribution y|x N(C x, γΣE) where the noise covariance matrix ΣE is also selected to have the autoregressive structure with ρ = 0.5 and we set γ = 0.3. We generate the parameter matrix C as follows: first we generate a block-sparse matrix e C with 5% non-zero elements. Each non-zero element of e C is drawn from a N(0, 1). To achieve a low-rank structure, we find the top-r singular value decomposition of e C as e C = USVT , and then set the elements of U and V whose magnitude is smaller than 0.01 to zero to obtain U and V. The final parameter matrix is obtained as C = U S VT where the first r diagonal elements of the diagonal matrix S are set to 100, 99, . . . , 101 r. Without loss of generality, we add a few vectors to the design matrix to ensure the orthogonality condition of Section 4.1. In all of the simulation experiments, we generate 100 data sets and report the mean and standard error of performance for different methods. We compare the performance of SEED with two state-of-art methods: (1) RCGL (Bunea et al., 2012) and (2) Penalized regression with simultaneous L1 and nuclear-norm penalization. The optimization problem is solved by the popular alternating direction method of multipliers (Boyd et al., 2010) and we will refer to this baseline as the ADMM algorithm. For a fair comparison, all model parameters are set based on a separate validation set with size nvalid = 500. To tune the parameters in SEED, we created a grid of sparsity thresholds θ and for each value of θ, the validation errors were recorded while increasing the rank of the solution matrices. The robustness of sparsity threshold θ and termination parameter µ will also be analyzed. The quality of the estimator b C is evaluated via four performance metrics listed as follows. (1) Normalized Prediction Error defined as Normalized Prediction Error = Ytest Xtest b C F (2) Normalized Parameter Estimation Error defined as Normalized Parameter Estimation Error = b C C F where C is the true parameter matrix. Zheng, Bahadori, Liu and Lv (3) Rank Recovery Error defined as Rank Recovery Error = |rank(b C) rank(C)|. Since the solution of the nuclear norm always leads to small non-zero singular values (which prevents b C from being low-rank), we threshold the singular values of b C that are more than 100 times smaller than its largest singular value to have a fair comparison. (4) Support Recovery AUC, that is, the area under the receiver operating characteristic (ROC) curve of comparing support of b C with the ground truth, which is always between 0 and 1. It is computed by varying the decision threshold and obtaining the false positive and true positive curve. Then the area under the false positive and true positive curve is reported as AUC. The value of AUC indicates the probability that a procedure assigns a higher value to a randomly chosen non-zero element than a randomly chosen zero element (Hanley and Mc Neil, 1982). It is an appropriate metric for measuring support recovery accuracy because in sparse support recovery we have more zeros than non-zeros which inflates the result of the simple 0-1 accuracy measure. In contrast, AUC is more robust to imbalanced positive/negative prediction labels. Table 1 shows the results of all algorithms on a variety of regimes by varying the dimensionality p and the rank r. We can see that SEED is superior or comparable to the baseline algorithms across all four measures. As the results show and the theory predicts, in most high-dimensional cases, nuclear norm usually overestimates the true rank of the matrix. Furthermore, we find that the iterative SVD procedure in the RCGL algorithm often results in significant underestimation of the true rank, when the true rank is large. Note that in addition to accuracy, SEED also significantly reduces the variance of the estimation. Figure 1a shows the solution path for SEED on one example data set (p = 400, r = 5, q = 200, ρ = 0.5, and n = 100). The corresponding singular values are set to 30, 27, 24, 21, and 18. In the horizontal axis, we show the termination parameter µ normalized by Y 2 F /(nq). We can see that SEED can identify the correct rank with medium values of µ. Figure 1b shows the solution path for the top left singular vector u1 of C on an example data set (p = 200, q = 100, n = 50, ρ = 0.5, and r = 1). Both of the solution paths indicate that SEED is robust to the particular choice of parameters and in a large range of parameters SEED is able to successfully recover the true rank of the matrix and the support of the singular vectors. 5.1.2. Simulation Example 2 In order to study scalability of SEED, we conduct the experiments on two computing environment, including: (1) an off-the-shelf personal computer (PC) and (2) a graphics processing unit (GPU), to demonstrate the runtime efficiency and the parallelization capability of SEED. First, we run our experiments on an off-the-shelf PC with Intel i7 at 3.4GHz and 8GB of memory. The system runs MATLAB R2013b on the Windows operating system. We generate 5 data sets with r = 1, non-zero ratio of 10%, q = 1000, and n = 1000. Figure 2a shows the average CPU runtime of three algorithms as the dimension p increases. We can see that SEED can achieve a speed up of 10-100 times in runtime compared with baseline methods. p Normalized Normalized Rank Support Algorithm Prediction Estimation Recovery Recovery Error Error Error AUC n = 100, q = 200, r = 3, and ρ = 0.5 SEED 0.039 (0.013) 0.012 (0.006) 0.06 (0.445) 0.980 (0.024) ADMM 0.041 (0.012) 0.015 (0.004) 0.01 (0.100) 0.975 (0.026) RCGL 0.053 (0.013) 0.040 (0.009) 0.03 (0.171) 0.979 (0.028) SEED 0.026 (0.005) 0.011 (0.002) 0 (0) 0.970 (0.015) ADMM 0.035 (0.007) 0.035 (0.008) 0.06 (0.278) 0.961 (0.018) RCGL 0.158 (0.050) 0.193 (0.066) 0 (0) 0.934 (0.027) SEED 0.019 (0.003) 0.010 (0.002) 0 (0) 0.970 (0.014) ADMM 0.076 (0.020) 0.127 (0.031) 4.14 (1.614) 0.954 (0.020) RCGL 0.482 (0.072) 0.603 (0.074) 0.01 (0.100) 0.834 (0.030) SEED 0.015 (0.002) 0.010 (0.002) 0 (0) 0.971 (0.011) ADMM 0.072 (0.015) 0.145 (0.024) 3.02 (0.141) 0.968 (0.010) RCGL 0.698 (0.054) 0.820 (0.022) 0.09 (0.379) 0.809 (0.047) SEED 0.014 (0.001) 0.010 (0.002) 0 (0) 0.973 (0.011) ADMM 0.066 (0.010) 0.136 (0.016) 3 (0) 0.970 (0.011) RCGL 0.746 (0.041) 0.857 (0.021) 0.2 (0.603) 0.789 (0.045) n = 100, q = 200, r = 30, and ρ = 0.5 SEED 0.010 (0.001) 0.043 (0.003) 0.29 (0.456) 0.989 (0.003) ADMM 0.015 (0.002) 0.066 (0.006) 48.51 (0.882) 0.964 (0.027) RCGL 0.058 (0.010) 0.220 (0.017) 0.04 (0.243) 0.774 (0.064) SEED 0.035 (0.018) 0.155 (0.082) 1.17 (1.092) 0.770 (0.023) ADMM 0.037 (0.002) 0.139 (0.007) 42.34 (3.085) 0.783 (0.018) RCGL 0.395 (0.029) 0.558 (0.023) 0.02 (0.200) 0.610 (0.010) SEED 0.003 (0.000) 0.005 (0.000) 0.02 (0.141) 0.999 (0.000) ADMM 0.014 (0.003) 0.022 (0.004) 9.34 (2.388) 0.999 (0.001) RCGL 0.286 (0.037) 0.429 (0.020) 0.13 (0.733) 0.966 (0.004) SEED 0.003 (0.000) 0.007 (0.000) 0 (0) 0.999 (0.000) ADMM 0.021 (0.001) 0.088 (0.004) 23.61 (1.348) 0.999 (0.000) RCGL 0.315 (0.024) 0.457 (0.017) 2.31 (3.243) 0.970 (0.002) SEED 0.002 (0.000) 0.007 (0.000) 0 (0) 0.999 (0.000) ADMM 0.020 (0.001) 0.091 (0.003) 22.54 (1.275) 0.999 (0.000) RCGL 0.335 (0.019) 0.472 (0.012) 3.10 (3.538) 0.974 (0.002) Table 1: Simulation Results Zheng, Bahadori, Liu and Lv Singular Value Parameter Value (b) Support Figure 1: (a) Solution path for the singular values of the estimated matrices. The plot show the value of top five singular values of the solution b C as we change the stopping error µ. (b) Solution path for the top left singular vector u of the estimated matrices. Only seven coefficients are non-zero. The range of the parameters are generated as follows: µ = logspace( 5, 1, 5) and θ = logspace( 1, log10(20), 10), where logspace(a, b, n) indicates the minimum value 10a, maximum value 10b, and total number n. 200 400 600 800 1000 1200 1400 10 1 Dimension p Run Time (Seconds) SEED LN ADMM RCGL 500 1000 2000 3000 5000 10 1 Dimension p Run Time (Seconds) SEED LN ADMM RCGL Figure 2: Speedup by SEED on (a) CPU and (b) GPU devices. Note that the vertical axis is in logarithmic scale. Next, in order to test scalability of SEED in extremely large data sets, we use a machine that is equipped with a Tesla K40 GPU which has 2880 processing cores at 745MHz and 12GB of memory. We perform our experiments with MATLAB R2013b on a Debian Linux operating system. GPUs are built to have many less-powerful processing units which makes them ideal for parallel implementation (Bekkerman et al., 2012, Chapter 5). Therefore we apply the two-step fast eigenvalue decomposition described in Section 3, which involves only simple matrix operation and can be paralleled easily. The experiment results shown in 2000 4000 6000 8000 10000 0 Dimension p Run Time (Seconds) 2000 4000 6000 8000 10000 0 Dimension p Parameter Estimation RMSE Figure 3: Scalability experiments on very large data sets on GPU using the fast approach. Accuracy results are normalized. Figures 2b and 3 are obtained under the setting of q = 10000, n = 5000, r = 1 and non-zero ratio of 10%. The results indicate that while SEED is fast on the GPU, it also achieves reasonable accuracy. Note that the results show that SEED is able to estimate a sparse and low-rank matrix with 108 elements in less than a minute, confirming its extreme scalability. 5.2. Real Data Analysis Diffusion Network Inference, that is, the task of inferring influence networks from user activities, is one of the central tasks in social networks analysis (Gomez-Rodriguez et al., 2012) because it helps improve social marketing by finding the influential users in a network. It is a challenging problem because: (i) in many social networks the influence is expressed implicitly (Gomez-Rodriguez et al., 2012) and (ii) empirical studies show that common metrics such as number of friends or followers fail to accurately measure the social influence of the users (Cha et al., 2010). A popular computational approach in estimating social influence among users is to count the number of users activities over a time span (in regularly or irregularly spaced intervals) and analyze the resulting time series data (Truccolo et al., 2005). Many different models have been developed, among which the vector auto-regressive model arises as a simple and robust solution (Trusov et al., 2009; Bahadori and Liu, 2013). That is, every user i is described by a time series xi(t) for t = 1, . . . , T and i = 1, . . . , q, such that j=1 βT i,jxt,Lagged j + εi(t), where βi,j is the vector of coefficients modeling the effects of time series xj, xt,Lagged j = [xj(t L), . . . , xj(t 1)]T is the history of xj up to time t with L the maximal time lag, and εi(t) is the random noise at time t. Denoting by x(t) = [x1(t), . . . , xq(t)]T , we have the following multi-response regression model: x(t) = BT xt,Lagged + ε(t), Zheng, Bahadori, Liu and Lv 1 2 3 4 5 0.45 Graph Recovery AUC SEED RCGL LN-ADMM Figure 4: The graph recovery accuracy of the algorithms as the rank of solution varies. where the predictor vector xt,Lagged = [(xt,Lagged 1 )T , . . . , (xt,Lagged q )T ]T , BT = βT i,j 1 i,j q is the evolution matrix, and ε(t) = [ε1(t), . . . , εq(t)]T . The influence network can be built from the evolution matrix by establishing an edge from node j to node i if βi,j 1 is significantly larger than zero. In this experiment, we gather a Twitter data set with tweets on the Haiti earthquake and apply vector auto-regressive model to identify the potential top influencers on this topic (that is, those Twitter accounts with the largest impact on the others). We divide the 17 days after the Haiti Earthquake on Jan. 12, 2010 into 1000 uniformly spaced intervals and generate a multivariate time series data set by counting the number of tweets on this topic for the top 1000 users who tweeted most about it. For accurate modeling, we remove the users that were highly correlated with each other, most of which were operated by the same users and tweeted exactly the same content. We also remove robot-like user-accounts which tweeted on very regular intervals, which in total led to a subset of 270 users. We analyze this data by a VAR model with the maximal time lag L = 5 based on the intuition about the maximum retweeting delay, which requires estimation of a q = 270 dimensional response vector using p = 1350 predictors while we have only n = 995 observations. Since we do not have access to the true influence network, we use the retweet network as a surrogate of the ground truth following the evaluation convention in the social networks community. The retweet network is constructed by adding an edge from user i to user j if user j has retweeted at least 4 of the tweets of user i. Clearly, the retweet network is not the actual underlying temporal dependency graph, mainly because there are possible implicit influence patterns as well. However, it is the best possible metric that we could obtain for graph estimation accuracy evaluation in our data set (Cha et al., 2010). The retweet network for the 270 selected users is sparse; it has only 0.11% of possible edges. We apply SEED, ADMM, and RCGL algorithms to uncover the influence network in our twitter data set. Figure 4 shows the accuracy of the procedures in uncovering the true influence network in terms of AUC. For every value of the rank parameter, we tune the sparsity by 5-fold cross-validation. Given the fact that exact rank constraint cannot be enforced directly in the ADMM algorithm, we find the best value of the nuclear norm regularization parameter λL by 5-fold cross-validation. Then, we compute the low-rank approximations of the parameter matrix and evaluate the accuracy at each rank. SEED ADMM RCGL 2.83 127.34 256.87 Table 2: Run time (in seconds) of the algorithms on the application data set. The results in Figure 4 show that SEED significantly outperforms the baseline procedures. They also indicate that, in all of the algorithms, as we increase the rank of the solution matrix, the accuracy is improved initially and then quickly saturates. SEED achieves the highest accuracy when the rank is 4. Note that this result also confirms other studies that the social network connections may be strongly influenced by a few unobserved exogenous variables (Myers et al., 2012). The results in Table 2 demonstrate the significant speedup achieved by SEED compared to the baselines. 6. Discussion In this paper, we propose to convert the problem of sparse reduced-rank regression into a sparse generalized eigenvalue problem, which allows us to efficiently employ the recently developed sparse eigenvalue decomposition techniques. After this transformation, the left singular vectors can be estimated in simple steps and the estimation of both sparse and dense right singular vectors is unified in a single framework. As a pure learning algorithm, SEED deviates from traditional regularization frameworks (that is, a loss function plus certain penalties), leading to computational efficiency and scalability. Furthermore, we prove that SEED achieves nice estimation and prediction accuracy similar to the minimax error bound in the univariate regression setting (Raskutti et al., 2011). Some interesting problems for future research include extending the current formulation of the regression coefficient matrix in (2) to the case where the singular values can be repeated such that the left singular vectors (which correspond to latent factors) are not identifiable. Then we will need to estimate the eigenspaces spanned by important singular vectors and characterize the estimation accuracy by some new criterion, such as the one in Cai et al. (2013) and Ma (2013). Another research direction is to explore the theory of random design matrices and this can be addressed by using an extended version of perturbation theory (Lemma 6), where the perturbation in P is also included in the analysis. Moreover, it is computationally straightforward to extend SEED to the generalized linear models by adapting the sequential quadratic programming framework. For this extension, we first approximate the loss function by the quadratic loss function and find the optimal unit rank matrix. Then we can add the unit rank matrix to the solution and re-approximate the loss function with another quadratic function around this new solution. By performing these three steps sequentially, we can efficiently estimate the low-rank coefficient matrix. Statistical properties of such estimator can be analyzed by extending the results in Lozano et al. (2011) for greedy sparse procedures to reduced-rank regression. Acknowledgments Zheng, Bahadori, Liu and Lv We would like to acknowledge support for this project from NSF CAREER Awards DMS0955316 and IIS-1254206, NSFC-11601501, 11671374 and 71731010, Anhui Provincial Natural Science Foundation-1708085QA02, Fundamental Research Funds for Central Universities WK2040160028, Okawa Foundation Research Award, a grant from the Simons Foundation, and Adobe Data Science Research Award. The authors would like to thank Emre Demirkaya and Sanjay Purushotham for helpful discussions. The authors also sincerely thank the Action Editor and referees for their valuable comments that helped improve the article substantially. Appendix A. Proofs of Theorems 1 and 2 We need the following two lemmas in the proofs of Theorems 1 and 2. Lemma 5 Suppose ε N(0q, Σ) and γ2 = maxj Σjj. We have stacked n realization of these random vectors in the rows of n q matrix E. Denote by P 2 = maxj[ 1 n XT X]jj given a deterministic n p matrix X. Then for any δ (0, 1), with probability at least 1 δ, 1 n ET X 2,s Lemma 6 (Perturbation Theory for Sparse Symmetric Generalized Eigenvalue Problems) Suppose Q, P S+ p are two p p semi-definite matrices. For the following generalized eigenvalue problem and its perturbed variant with sparse eigenvectors Qu = λPu, (10) (Q + δQ)bu = bλPbu, (11) where u, bu Ker(P) with u 0 s , bu 0 s and s < s, under Condition 1, we have uniformly over k = 1, , p dim{Ker(P)}, |bλk λk| φ 1 s δQ 2,s. (12) Here λk and bλk are the kth largest eigenvalues of equations (10) and (11), respectively, δQ 2,s denotes the s-sparse largest singular value of δQ, and φs is defined in Condition 1. Furthermore, denote by uk and buk the unit length sparse eigenvectors correspond to λk and bλk, respectively, with buk taking the correct directions. When there exists some positive eigengap dλ which is the minimum difference between non-zero eigenvalues of equation (10) and the perturbation of Q satisfies δQ 2,s = o(dλ), then uniformly over k such that λk = 0, 2φ 2 s d 1 λ δQ 2,s + o(d 1 λ δQ 2,s). (13) A.1. Proof of Theorem 1 The proof is established based on analyzing the impact of perturbation of the matrices in the solution of equation (3). To do so, in the first step, we bound the amount of perturbation in the matrix of the generalized eigenvalue problem in Proposition 1. In the second step, using the eigenvalue perturbation theory (Lemma 6), we derive the error bound for buk. Using the bound for buk, in the third step, the error bound for bvk is calculated. Step 1: Bounding the perturbation of matrix Q. For notational simplicity, define the noise free response variables as Y = XC , Q = 1 qn2 XT Y Y T X and its noisy version b Q = 1 qn2 XT YYT X. Denote by b Q Q 2,s the s-sparse largest singular value of b Q Q. Since Y = Y + E, we can derive the following bound, b Q Q 2,s = 1 qn2 XT YYT X XT Y Y T X 2,s 2 n2q XT Y ET X 2,s + 1 n2q ET X 2 2,s, where the last inequality is due to the expansion of XT YYT X and an application of the triangular inequality. By Condition 1 and Proposition 1, we have X 2,s nφ 1/2 s and XT Y 2 X 2,s Y 2 = X 2,s nqσ1 n qσ1φ 1/2 s , (14) where the first inequality holds since the unit length vector v satisfying XT Y v 2 = XT Y 2 must be one of the v k in (2) due to the strict separation of the singular values by Condition 2, so that XT Y v k = XT XC v k with C v k yielding an s-sparse vector for XT X. Therefore, we get XT Y ET X 2,s 2 n2q XT Y 2 ET X 2,s 2σ1φ 1/2 s n q Let a = σ1φ 1/2 s . It follows that b Q Q 2,s 2a n q ET X 2,s + 1 n2q ET X 2 2,s. (15) Step 2: Error bounds for buk and Xbuk. Using Lemma 5, for any δ (0, 1) with probability at least 1 δ, we have 1 n q ET X 2,s γP δ 0 , substituting the above bound in equation (15) yields b Q Q 2,s γPa r Therefore, under Conditions 1 and 2, applying Lemma 6 with b Q = 1 qn2 XT YYT X, Q = 1 qn2 XT Y Y T X, and P = 1 n XT X gives the following bound for buk, buk u k 2 4γPa Zheng, Bahadori, Liu and Lv where Cu = 4γPa dσφ2s = 4γPσ1 dσφ5/2 s is a constant. Since buk uk 0 2s, further applying Condition 1 n X(buk u k) 2 φ 1/2 s Cu Step 3: Error bound for bvk. With the estimated left singular vector buk, we can further estimate the corresponding right singular vector as bvk = 1 bu k XT Xbuk YT Xbuk, and compare it to v k, which by Proposition 1 can be expressed as v k = 1 u k XT Xu k Y T Xu k. For simplicity of notation, we drop the index k of vk and uk and write bv v 2 = 1 u XT Xu e1 Y T Xu e2 1 u XT Xu Y T Xu 2 , where e1 u XT Xu bu XT Xbu and e2 Y T Xu YT Xbu. Using the Taylor expansion of 1 1 x at x = 0, we can write u XT Xu e1 u XT Xu 2 + e2 u XT Xu = (u XT Xu /n) 1 ( v 2|e1|/n + e2 2/n) + Te φ 1 s ( v 2|e1|/n + e2 2/n) + Te, (17) where Te = v 2|e1|/n u Pu + e2 2/n u Pu P ℓ=1 |e1|/n u Pu ℓ denotes the higher order terms in the Taylor expansion and the last step is by Condition 1. Bounding |e1|: Let bu = u + δu. Since u 0 s , bu 0 s and s < s, we have δu 0 2s. As u is a unit length vector, it yields from Condition 1 that |e1|/n = 1 n(u + δu) XT X(u + δu) 1 = |2u Pδu + δT u Pδu| φ 1 s (2 δu 2 + δu 2 2). (18) Bounding e2 2: Similarly, let YT X = Y T X + ET X. We obtain e2 2/n = (Y T X + ET X)(u + δu) Y T Xu 2/n Y T Xδu 2/n + ET Xu 2/n + ET Xδu 2/n a q δu 2 + ET X 2,s/n + ET X 2,s δu 2/n, (19) where in the last step we used Y T Xδu 2/n a q δu 2 similarly as in (14). Since 1 q v k 2 V , substituting inequalities (18) and (19) in inequality (17) with reorganization yields bvk v k 2 V q δu 2 2 + δu 2 + 1 nφs ET X 2,s(1 + δu 2) + φ 1 s a q δu 2 + Te. When s log(pq/δ) = o(n), we can see that both the events E1 = n δu 2 = O q and E2 = n Te = O q δ o occur, if the event E0 = n 1 n q ET X 2,s = O q holds for some δ (0, 1), where in the second event we have used the results in inequalities (18) and (19) together with an application of geometric series sum. Note that the first term of Te is O q δ and the common ratio is O q Therefore, by applying the result in Lemma 5 for bounding ET X 2,s/n and the result for the estimation error bound of buk u k 2 in Step 2, we can conclude with probability at least 1 δ that bvk v k 2 (2V φ 1 s + a )Cu + γP 2{(2V φ 1 s + a )Cu + γP} where Cv = 2 2φ 1 s n (2V φ 1/2 s + σ1)φ 1/2 s Cu + γP o . Step 4: Error bounds for 1 q b Ck C k F and 1 qn X(b Ck C k) F . With the estimation error bounds of buk and bvk, we write 1 qn X(C k b Ck) 2 F = 1 q trace{(C k b Ck)T P(C k b Ck)} = e1 + e2 + e3, (20) where the last equality follows from the decomposition C k b Ck = u kv T k bukbv T k = (u k buk)v T k + buk(v k bvk)T (21) q trace{v k(u k buk)T P(u k buk)v T k }, q trace{(v k bvk)bu T j P(u k buk)v T k }, q trace{(v k bvk)bu T k Pbuk(v k bvk)T }. Zheng, Bahadori, Liu and Lv We will bound them separately. Since the estimation error bounds of buk and bvk, k = 1, , r , hold with probability at least 1 δ. It yields that q trace{(u k buk)T P(u k buk)v T k v k} = V 2 (u k buk)T P(u k buk) V 2φ 1 s u k buk 2 2 V 2φ 1 s C2 u s where the first inequality follows from Condition 1 and the fact u k buk 0 2s. Similarly, we have e2 2V φ 1 s Cu Cv s e3 φ 1 s C2 v s In view of (20), the above bounds together give 1 qn X(b Ck C k) 2 F φ 1 s (V Cu + Cv)2 s Thus we have 1 qn X(b Ck C k) F (V Cu + Cv) By a similar but simpler argument, it follows from the decomposition (21) that 1 q C k b Ck F 1 q (u k buk)v T k F + 1 q buk(v k bvk)T F = 1 q u k buk 2 v k 2 + 1 q buk 2 v k bvk 2 (V Cu + Cv) r s It concludes the proof. A.2. Proof of Theorem 2 The main idea for proving the rank consistency result is noting that bσ2 k = (bu k b Qbuk)/(bu k Pbuk) which relates to the termination criteria is in fact the kth largest eigenvalue of the problem b Qu = λPu. Thus, we will show that the amount of decrease in the objective function in the kth step equals to the the kth largest eigenvalue of the problem b Qu = λPu. Given the perturbation bounds in Lemma 6, we can bound its difference from the true eigenvalue and show that after r greedy steps, the eigenvalues become almost zero. To prove the result in Theorem 2, we need to bound log Lk 1 log Lk, where Lk = 1 nq Y XCk 2 F is the objective function with Ck = Pk j=1 b Cj the estimated coefficient matrix up to the kth step. By using the fact that 1 1 x log(x) x 1 for x > 0, we can write Lk 1 log Lk 1 Next, we show that the amount of the decrease in the objective function in the kth step is equal to the the kth largest eigenvalue of the problem b Qu = λPu, which satisfies the following equality Lk 1 Lk = 1 nq Y XCk 1 2 F 1 nq Y XCk 2 F nq Y XCk 1 2 F 1 nq Y XCk 1 Xbukbv k 2 F 2 D Y XCk 1, Xbukbv k E Xbukbv k 2 F . Now, using the P-orthogonality of buk s, we have Lk 1 Lk = 1 2 D Y, Xbukbv k E Xbukbv k 2 F 2 bu k XT YYT Xbuk bu k XT Xbuk bu k XT YYT Xbuk bu k XT Xbuk = bu k b Qbuk bu k Pbuk = bσ2 k, where the second step is due to the substitution of the solution for bvk in (6) and a few steps of algebraic rearrangement. Underfitted regime k r . Under Condition 1, using the perturbation bound in Lemma 6 for the eigenvalues σ2 k s with k r , we can write bσ2 k σ2 k φ 1 s Q 2,s. Further applying inequality (16), we know that there exists some positive constant C such that φ 1 s Q 2,s 2a φ 1 s γP Therefore, it follows that with probability at least 1 δ, bσ2 k σ2 k C r s Then we can derive the following lower bound 1 nq Pr j=k XC j F + Pk 1 j=1 X(C j b Cj) F + E F , (23) where C j = u jv T j , b Cj = bujbvj, in the nominator we have used the result obtained in (22) and the denominator is the result of the triangular inequality. We will then bound the three terms in the denominator successively. Zheng, Bahadori, Liu and Lv For the first term, given the definition of C j = u jv T j , v T j v j = qa2 j and u T j Pu j = c2 j, we have XC j F nq = trace(Xu jv T j v ju T j XT ) u T j XT Xu j n = ajcj = σj. It follows that 1 nq Pr j=k XC j F = Pr j=k σj. To bound the second term, by Theorem 1, we have 1 nq X(C j b Cj) 2 F φ 1 s (V Cu + Cv)2 s which gives X(C j b Cj) F nq (k 1)(V Cu + Cv) To bound the last term, as the components of EΣ 1/2 are independent and identically distributed with the standard Gaussian distribution, given the tail bound for the χ2 distribution in (Laurent and Massart, 2000, Lemma 1), we can see that with probability at least 1 δ, EΣ 1/2 2 F /nq 1 + 2 r 1 Moreover, by Condition 3, we have EΣ 1/2 2 F = i=1 Ei:Σ 1/2 2 2 i=1 Ei: 2 2/γ2 u = E 2 F /γ2 u. E F / nq γu Thus, applying the union bound, we can see that with probability at least 1 2δ, the results in Theorem 1 including inequality (16) , inequalities (24) and (25) hold simultaneously. In view of (23), it yields that Lk 1 σk + O q Pr j=k σj + γu + O r q where we used the inequality b for 0 x b in the numerator. Let δ 1 = O{(pq)α} for some positive constant α > 1. Since Pr j=k σj (r k + 1)σk, γu cγσr cγσk by Condition 3 and r n s log(pq) n o1/4 = o(1), we get Lk 1 1 r k + 1 + cγ + O r r s = 1 r k + 1 + cγ + o 1 Overfitted regime k > r . Similar to the previous argument, by Condition 1 and Lemma 6, with probability at least 1 δ, for all k > r we have bσ2 k φ 1 s Q 2,s C r s Then we derive the following upper bound 1 nq E F Pr j=1 X(C j b Cj) F Pk j=r +1 Xb Cj F . (29) Bounds on the three terms in the denominator will be derived successively. First, another application of the tail bound for the χ2 distribution in (Laurent and Massart, 2000, Lemma 1) gives that with probability at least 1 δ, EΣ 1/2 2 F /nq 1 2 r 1 Similarly, together with Condition 3, we have E F / nq γl Moreover, by inequality (24), we get X(C j b Cj) F O r r s For the last term, by the estimation procedure of bvj in equation (6) and the fact that buj is the eigenvector of the generalized eigenvalue problem b Qu = λPu with respect to the jth largest eigenvalue bσ2 j , we have 1 nq Xb Cj 2 F = 1 nqtrace(bvjbu T j XT Xbujbv T j ) = 1 nq(bu T j XT Xbuj)bv T j bvj bu T j XT YYT Xbuj bu T j XT Xbuj = bu T j b Qbuj bu T j Pbuj = bσ2 j . Zheng, Bahadori, Liu and Lv Thus, it follows from inequality (28) that 1 nq Xb Cj F = j=r +1 bσj O (r r ) s In view of inequality (29), using the same argument as in (26), the above bounds yield δ + O n (r r ) s Since δ 1 = O{(pq)α}, r n s log(pq) n o1/4 = o(1) and r n s log(pq) n o1/4 = o(1), we conclude that In view of the bounds in (27) and (30), using the union bound, it is not difficult to see that with probability at least 1 3δ (both bounds are based on the event in Lemma 5 such that the results in Theorem 1 hold), the following bounds hold, For k r : log Lk 1 1 (r k + 1 + cγ)2 + o (r ) 2 , For k > r : log Lk 1 Thus, given the information criterion Cn = rank(b C) p log(pq) log log n + n log Ln and the assumption (r )2p log(pq) = o( n/ log log n), by setting δ = 1 3(pq) α, we can show that the following statements hold with probability at least 1 (pq) α: Cn(rank(b C) = k 1) Cn(rank(b C) = k) > 0, if k r ; Cn(rank(b C) = k 1) Cn(rank(b C) = k) < 0, if k > r . The above equations indicate that Cn will attain its minimum value when rank(b C) = r , which means the algorithm will stop at rank(b C) = r with probability at least 1 (pq) α for sufficiently large n, which concludes the proof of Theorem 2. Alekh Agarwal, Sahand Negahban, and Martin J. Wainwright. Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2): 1171 1197, 2012. Francis R. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, 9(2):1019 1048, 2008. M. Taha Bahadori and Yan Liu. An examination of practical granger causality inference. SIAM International Conference on Data Mining, pages 467 475, 2013. Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2(1):53 58, 1989. Ron Bekkerman, Mikhail Bilenko, and John Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2012. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2010. Peter B uhlmann and Sara van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science & Business Media, 2011. Florentina Bunea, Yiyuan She, and Marten H. Wegkamp. Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics, 39(2):1282 1309, 2011. Florentina Bunea, Yiyuan She, and Marten H. Wegkamp. Joint variable and rank selection for parsimonious estimation of high-dimensional matrices. The Annals of Statistics, 40 (5):2359 2388, 2012. Jian-Feng Cai, Emmanuel J. Cand es, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956 1982, 2010. Tony T. Cai, Zongming Ma, and Yihong Wu. Sparse PCA: Optimal rates and adaptive estimation. The Annals of Statistics, 41(6):3074 3110, 2013. Emmanuel J. Cand es and Yaniv Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):2342 2359, 2011. Emmanuel J. Cand es and Terence Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203 4215, 2005. Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. Measuring user influence in twitter: The million follower fallacy. International AAAI Conference on Weblogs and Social, 10:10 17, 2010. Venkat Chandrasekaran, Pablo Parrilo, and Alan S. Willsky. Latent variable graphical model selection via convex optimization. Annual Allerton Conference on Communication, Control, and Computing, pages 1610 1613, 2010. Kun Chen and Kung-Sik Chan. A note on rank reduction in sparse multivariate regression. Journal of Statistical Theory and Practice, 10(1):100 120, 2015. Kun Chen, Kung-Sik Chan, and Nils Chr. Stenseth. Reduced rank stochastic regression with a sparse singular value decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):203 221, 2012. Kun Chen, Hongbo Dong, and Kung-Sik Chan. Reduced rank regression via adaptive nuclear norm penalization. Biometrika, 100(4):901 920, 2013. Zheng, Bahadori, Liu and Lv Fernando De La Torre and Michael J. Black. A framework for robust subspace learning. International Journal of Computer Vision, 54(1-3):117 142, 2003. Konstantinos I. Diamantaras and Sun Y. Kung. Principal Component Neural Networks: Theory and Applications. John Wiley & Sons, 1996. Varun R. Embar, Rama K. Pasumarthi, and Indrajit Bhattacharya. A Bayesian framework for estimating properties of network diffusions. International Conference on Knowledge Discovery and Data Mining, pages 1216 1225, 2014. Yingying Fan and Jinchi Lv. Asymptotic equivalence of regularization methods in thresholded parameter space. Journal of the American Statistical Association, 108(503):1044 1061, 2013. Yingying Fan and Chengyong Tang. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):531 552, 2013. Jiashi Feng, Zhouchen Lin, Huan Xu, and Shuicheng Yan. Robust subspace segmentation with block-diagonal prior. IEEE Conference on Computer Vision and Pattern Recognition, pages 3818 3825, 2014. Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and influence. ACM Transactions on Knowledge Discovery from Data, 5(4):21, 2012. James A. Hanley and Barbara J. Mc Neil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1):29 36, 1982. Alan J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5(2):248 264, 1975. Willard James and Charles Stein. Estimation with quadratic loss. Berkeley Symposium on Mathematical Statistics and Probability, 1:361 379, 1961. Iain M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of Statistics, 29(2):295 327, 2001. Iain M. Johnstone and Arthur Y. Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486): 682 693, 2009. U Kang, Brendan Meeder, and Christos Faloutsos. Spectral analysis for billion-scale graphs: discoveries and implementation. Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 13 25, 2011. B eatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000. Jing Lei and Vincent Q. Vu. Sparsistency and agnostic inference in sparse PCA. The Annals of Statistics, 43(1):299 322, 2015. Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: an approach to modeling networks. Journal of Machine Learning Research, 11(3):985 1042, 2008. Aur elie C. Lozano, Grzegorz Swirszcz, and Naoki Abe. Group orthogonal matching pursuit for logistic regression. International Conference on Artificial Intelligence and Statistics, pages 452 460, 2011. Zongming Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772 801, 2013. Aditya Mishra, Dipak K. Dey, and Kun Chen. Sequential co-sparse factor regression. Journal of Computational and Graphical Statistics, 26(4):814 825, 2017. Seth A. Myers, Chenguang Zhu, and Jure Leskovec. Information diffusion and external influence in networks. International Conference on Knowledge Discovery and Data Mining, pages 33 41, 2012. Sahand Negahban and Martin J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069 1097, 2011. Garvesh Raskutti, Martin J. Wainwright, and Bin Yu. Minimax rates of estimation for highdimensional linear regression over ℓq balls. IEEE Transactions on Information Theory, 57(10):6976 6994, 2011. Emile Richard, St ephane Ga ıffas, and Nicolas Vayatis. Link prediction in graphs with autoregressive features. Journal of Machine Learning Research, 15(1):489 517, 2012. Dan Shen, Haipeng Shen, and J. S. Marron. A general framework for consistency of principal component analysis. Journal of Machine Learning Research, 17(1):5218 5251, 2016. Kate Starbird and Leysia Palen. (How) will the revolution be retweeted?: information diffusion and the 2011 Egyptian uprising. ACM Conference on Computer Supported Cooperative Work, pages 7 16, 2012. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 58(1):267 288, 1996. Wilson Truccolo, Uri T. Eden, Matthew R. Fellows, John P. Donoghue, and Emery N. Brown. A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. Journal of Neurophysiology, 93(2):1074 89, 2005. Michael Trusov, Randolph E. Bucklin, and Koen Pauwels. Effects of word-of-mouth versus traditional marketing: findings from an internet social networking site. Journal of Marketing, 73(5):90 102, 2009. Raja Velu and Gregory C. Reinsel. Multivariate Reduced-Rank Regression: Theory and Applications. Springer Science & Business Media, 2013. Zheng, Bahadori, Liu and Lv Yu-Xiang Wang, Huan Xu, and Chenlei Leng. Provable subspace clustering: when LRR meets SSC. International Conference on Neural Information Processing Systems, pages 64 72, 2013. Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49 67, 2006. Ming Yuan, Ali Ekici, Zhaosong Lu, and Renato Monteiro. Dimension reduction and coefficient estimation in multivariate linear regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3):329 346, 2007. Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research, 14(1):899 925, 2013. Tong Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Transactions on Information Theory, 57(7):4689 4708, 2011. Zemin Zheng, Yingying Fan, and Jinchi Lv. High dimensional thresholded regression and shrinkage effect. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(3):627 649, 2014. Ke Zhou, Hongyuan Zha, and Le Song. Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes. International Conference on Artificial Intelligence and Statistics, 31:641 649, 2013. Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418 1429, 2006.