# lowrank_regression_with_tensor_responses__f15ad7c7.pdf Low-Rank Regression with Tensor Responses Guillaume Rabusseau and Hachem Kadri Aix Marseille Univ, CNRS, LIF, Marseille, France {firstname.lastname}@lif.univ-mrs.fr This paper proposes an efficient algorithm (HOLRR) to handle regression tasks where the outputs have a tensor structure. We formulate the regression problem as the minimization of a least square criterion under a multilinear rank constraint, a difficult non convex problem. HOLRR computes efficiently an approximate solution of this problem, with solid theoretical guarantees. A kernel extension is also presented. Experiments on synthetic and real data show that HOLRR computes accurate solutions while being computationally very competitive. 1 Introduction Recently, there has been an increasing interest in adapting machine learning and statistical methods to tensors. Data with a natural tensor structure are encountered in many scientific areas including neuroimaging [30], signal processing [4], spatio-temporal analysis [2] and computer vision [16]. Extending multivariate regression methods to tensors is one of the challenging task in this area. Most existing works extend linear models to the multilinear setting and focus on the tensor structure of the input data (e.g. [24]). Little has been done however to investigate learning methods for tensor-structured output data. We consider a multilinear regression task where outputs are tensors; such a setting can occur in the context of e.g. spatio-temporal forecasting or image reconstruction. In order to leverage the tensor structure of the output data, we formulate the problem as the minimization of a least squares criterion subject to a multilinear rank constraint on the regression tensor. The rank constraint enforces the model to capture low-rank structure in the outputs and to explain dependencies between inputs and outputs in a low-dimensional multilinear subspace. Unlike previous work (e.g. [22, 24, 27]) we do not rely on a convex relaxation of this difficult non-convex optimization problem. Instead we show that it is equivalent to a multilinear subspace identification problem for which we design a fast and efficient approximation algorithm (HOLRR), along with a kernelized version which extends our approach to the nonlinear setting (Section 3). Our theoretical analysis shows that HOLRR provides good approximation guarantees. Furthermore, we derive a generalization bound for the class of tensor-valued regression functions with bounded multilinear rank (Section 3.3). Experiments on synthetic and real data are presented to validate our theoretical findings and show that HOLRR computes accurate solutions while being computationally very competitive (Section 4). Proofs of all results stated in the paper can be found in supplementary material A. Related work. The problem we consider is a generalization of the reduced-rank regression problem (Section 2.2) to tensor structured responses. Reduced-rank regression has its roots in statistics [10] but it has also been investigated by the neural network community [3]; non-parametric extensions of this method have been proposed in [18] and [6]. In the context of multi-task learning, a linear model using a tensor-rank penalization of a least squares criterion has been proposed in [22] to take into account the multi-modal interactions between 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. tasks. They propose an approach relying on a convex relaxation of the multlinear rank constraint using the trace norms of the matricizations, and a non-convex approach based on alternating minimization. Nonparametric low-rank estimation strategies in reproducing kernel Hilbert spaces (RKHS) based on a multilinear spectral regularization have been proposed in [23, 24]. Their method is based on estimating the regression function in the tensor product of RKHSs and is naturally adapted for tensor covariates. A greedy algorithm to solve a low-rank tensor learning problem has been proposed in [2] in the context of multivariate spatio-temporal data analysis. The linear model they assume is different from the one we propose and is specifically designed for spatio-temporal data. A higher-order extension of partial least squares (HOPLS) has been proposed in [28] along with a kernel extension in [29]. While HOPLS has the advantage of taking the tensor structure of the input into account, the questions of approximation and generalization guarantees were not addressed in [28]. The generalization bound we provide is inspired from works on matrix and tensor completion [25, 19]. 2 Preliminaries We begin by introducing some notations. For any integer k we use [k] to denote the set of integers from 1 to k. We use lower case bold letters for vectors (e.g. v Rd1), upper case bold letters for matrices (e.g. M Rd1 d2) and bold calligraphic letters for higher order tensors (e.g. T Rd1 d2 d3). The identity matrix will be written as I. The ith row (resp. column) of a matrix M will be denoted by Mi,: (resp. M:,i). This notation is extended to slices of a tensor in the straightforward way. If v Rd1 and v Rd2, we use v v Rd1 d2 to denote the Kronecker product between vectors, and its straightforward extension to matrices and tensors. Given a matrix M Rd1 d2, we use vec(M) Rd1 d2 to denote the column vector obtained by concatenating the columns of M. 2.1 Tensors and Tucker Decomposition We first recall basic definitions of tensor algebra; more details can be found in [13]. A tensor T Rd1 dp can simply be seen as a multidimensional array (T i1, ,ip : in [dn], n [p]). The mode-n fibers of T are the vectors obtained by fixing all indices except the nth one, e.g. T :,i2, ,ip Rd1. The nth mode matricization of T is the matrix having the mode-n fibers of T for columns and is denoted by T(n) Rdn d1 dn 1dn+1 dp. The vectorization of a tensor is defined by vec(T ) = vec(T(1)). The inner product between two tensors S and T (of the same size) is defined by S, T = vec(S), vec(T ) and the Frobenius norm is defined by T 2 F = T , T . In the following T always denotes a tensor of size d1 dp. The mode-n matrix product of the tensor T and a matrix X Rm dn is a tensor denoted by T n X. It is of size d1 dn 1 m dn+1 dp and is defined by the relation Y = T n X Y(n) = XT(n). The mode-n vector product of the tensor T and a vector v Rdn is a tensor defined by T n v = T n v Rd1 dn 1 dn+1 dp. The mode-n rank of T is the dimension of the space spanned by its mode-n fibers, that is rankn(T ) = rank(T(n)). The multilinear rank of T , denoted by rank(T ), is the tuple of mode-n ranks of T : rank(T ) = (R1, , Rp) where Rn = rankn(T ) for n [p]. We will write rank(T ) (S1, , Sp) whenever rank1(T ) S1, rank2(T ) S2, , rankp(T ) Sp. The Tucker decomposition decomposes a tensor T into a core tensor G transformed by an orthogonal matrix along each mode: (i) T = G 1 U1 2 U2 3 p Up, where G RR1 R2 Rp, Ui Rdi Ri and U i Ui = I for all i [p]. The number of parameters involved in a Tucker decomposition can be considerably smaller than d1d2 dp. We have the following identities when matricizing and vectorizing a Tucker decomposition: T(n) = Un G(n)(Up Un+1 Un 1 U1) and vec(T ) = (Up Up 1 U1)vec(G). It is well known that T admits the Tucker decomposition (i) iffrank(T ) (R1, , Rp) (see e.g. [13]). Finding an exact Tucker decomposition can be done using the higher-order SVD algorithm (HOSVD) introduced by [5]. Although finding the best approximation of multilinear rank (R1, , Rp) of a tensor T is a difficult problem, the truncated HOSVD algorithm provides good approximation guarantees and often performs well in practice. 2.2 Low-Rank Regression Multivariate regression is the task of recovering a function f : Rd Rp from a set of inputoutput pairs {(x(n), y(n))}N n=1 sampled from the model with an additive noise y = f(x) + ε, where ε is the error term. To solve this problem, the ordinary least squares (OLS) approach assumes a linear dependence between input and output data and boils down to finding a matrix W Rd p that minimizes the squared error XW Y 2 F , where X RN d and Y RN p denote the input and the output matrices. To prevent overfitting and to avoid numerical instabilities a ridge regularization term (i.e. γ W 2 F ) is often added to the objective function, leading to the regularized least squares (RLS) method. It is easy to see that the OLS/RLS approach in the multivariate setting is equivalent to performing p linear regressions for each scalar output {yj}p j=1 independently. Thus it performs poorly when the outputs are correlated and the true dimension of the response is less than p. Low-rank regression (or reduced-rank regression) addresses this issue by solving the rank penalized problem min W Rd p XW Y 2 F + γ W 2 F s.t. rank(W) R for a given integer R. The rank constraint was first proposed in [1], whereas the term reduced-rank regression was introduced in [10]. Adding a ridge regularization was proposed in [18]. In the rest of the paper we will refer to this approach as low-rank regression (LRR). For more description and discussion of reduced-rank regression, we refer the reader to the books [21] and [11]. 3 Low-Rank Regression for Tensor-Valued Functions 3.1 Problem Formulation We consider a multivariate regression task where the input is a vector and the response has a tensor structure. Let f : Rd0 Rd1 d2 dp be the function we want to learn from a sample of input-output data {(x(n), Y(n))}N n=1 drawn from the model Y = f(x)+E, where E is an error term. We assume that f is linear, that is f(x) = W 1 x for some regression tensor W Rd0 d1 dp. The vectorization of this relation leads to vec(f(x)) = W (1)x showing that this model is equivalent to the standard multivariate linear model. One way to tackle this regression task would be to vectorize each output sample and to perform a standard low-rank regression on the data {(x(n), vec(Y(n)))}N n=1 Rd0 Rd1 dp. A major drawback of this approach is that the tensor structure of the output is lost in the vectorization step. The low-rank model tries to capture linear dependencies between components of the output but it ignores higher level dependencies that could be present in a tensor-structured output. For illustration, suppose the output is a matrix encoding the samples of d1 continuous variables at d2 different time steps, one could expect structural relations between the d1 time series, e.g. linear dependencies between the rows of the output matrix. Low-rank regression for tensor responses. To overcome the limitation described above we propose an extension of the low-rank regression method for tensor-structured responses by enforcing low multilinear rank of the regression tensor W. Let {(x(n), Y(n))}N n=1 Rd0 Rd1 d2 dp be a training sample of input/output data drawn from the model f(x) = W 1 x + E where W is assumed of low multilinear rank. Considering the framework of empirical risk minimization, we want to find a low-rank regression tensor W minimizing the loss on the training data. To avoid numerical instabilities and to prevent overfitting we add a ridge regularization to the objective function, leading to the minimization of PN n=1 ℓ(W 1 x(n), Y(n)) + γ W 2 F w.r.t. the regression tensor W subject to the constraint rank(W) (R0, R1, , Rp) for some given integers R0, R1, , Rp and where ℓis a loss function. In this paper, we consider the squared error loss between tensors defined by L(T , ˆT ) = T ˆT 2 F . Using this loss we can rewrite the minimization problem as min W Rd0 d1 dp W 1 X Y 2 F + γ W 2 F s.t. rank(W) (R0, R1, , Rp), (1) Figure 1: Image reconstruction from noisy measurements: Y = W 1 x + E where W is a color image (RGB). Each image is labeled with the algorithm and the rank parameter. where the input matrix X RN d0 and the output tensor Y RN d1 dp are defined by Xn,: = (x(n)) , Yn,:, ,: = Y(n) for n = 1, , N (Y is the tensor obtained by stacking the output tensors along the first mode). Low-rank regression function. Let W be a solution of problem (1), it follows from the multilinear rank constraint that W = G 1 U0 2 p+1 Up for some core tensor G RR0 Rp and orthogonal matrices Ui Rdi Ri for 0 i p. The regression function f : x 7 W 1 x can thus be written as f : x 7 G 1 x U0 2 p+1 Up. This implies several interesting properties. First, for any x Rd0 we have f (x) = T x 1 U1 2 p Up with T x = G 1 U 0 x, which implies rank(f (x)) (R1, , Rp), that is the image of f is a set of tensors with low multilinear rank. Second, the relation between x and Y = f (x) is explained in a low dimensional subspace of size R0 R1 Rp. Indeed one can decompose the mapping f into the following steps: (i) project x in RR0 as x = U 0 x, (ii) perform a low-dimensional mapping Y = G 1 x, (iii) project back into the output space to get Y = Y 1 U1 2 p Up. To give an illustrative intuition on the differences between matrix and multilinear rank regularization we present a simple experiment1 in Figure 1. We generate data from the model Y = W 1 x + E where the tensor W R3 m n is a color image of size m n encoded with three color channels RGB. The components of both x and E are drawn from N(0, 1). This experiment allows us to visualize the tensors returned by RLS, LRR and our method HOLRR that enforces low multilinear rank of the regression function. First, this shows that the function learned by vectorizing the outputs and performing LRR does not enforce any low-rank structure. This is well illustrated in (Figure 1) where the regression tensors returned by HOLRR-(3,1,1) are clearly of low-rank while the ones returned by LRR-1 are not. This also shows that taking into account the low-rank structure of the model allows one to better eliminate the noise when the true regression tensor is of low rank (Figure 1, left). However if the ground truth model does not have a low-rank structure, enforcing low mutlilinear rank leads to underfitting for low values of the rank parameter (Figure 1, right). 3.2 Higher-Order Low-Rank Regression and its Kernel Extension We now propose an efficient algorithm to tackle problem (1). We first show that the ridge regularization term in (1) can be incorporated in the data fitting term. Let X R(N+d0) d0 and Y R(N+d0) d1 dp be defined by X = (X | γI) and Y (1) = Y(1) | 0 . It is easy to check that the objective function in (1) is equal to W 1 X Y 2 F . Minimization problem (1) is then equivalent to G RR0 R1 Rp , Ui Rdi Ri for 0 i p W 1 X Y 2 F s.t. W = G 1 U0 p+1 Up, U i Ui = I for all i. (2) We now show that this minimization problem can be reduced to finding p + 1 projection matrices onto subspaces of dimension R0, R1, , Rp. We start by showing that the core tensor G solution of (2) is determined by the factor matrices U0, , Up. 1An extended version of this experiment is presented in supplementary material B. Theorem 1. For given orthogonal matrices U0, , Up the tensor G that minimizes (2) is given by G = Y 1 (U 0 X XU0) 1U 0 X 2 U 1 3 p+1 U p . It follows from Theorem 1 that problem (1) can be written as min Ui Rdi Ri,0 i p Y 1 Π0 2 p+1 Πp Y 2 F (3) subject to U i Ui = I for all i, Π0 = XU0 U 0 X XU0 1 U 0 XT , Πi = Ui U i for i 1. Note that Π0 is the orthogonal projection onto the space spanned by the columns of XU0 and Πi is the orthogonal projection onto the column space of Ui for i 1. Hence solving problem (1) is equivalent to finding p + 1 low-dimensional subspaces U0, , Up such that projecting Y onto the spaces XU0, U1, , Up along the corresponding modes is close to Y. HOLRR algorithm. Since solving problem (3) for the p + 1 projections simultaneously is a difficult non-convex optimization problem we propose to solve it independently for each projection. This approach has the benefits of both being computationally efficient and providing good theoretical approximation guarantees (see Theorem 2). The following proposition gives the analytic solutions of (3) when each projection is considered independently. Proposition 1. For 0 i p, using the definition of Πi in (3), the optimal solution of min Ui Rdi Ri Y i+1 Πi Y 2 F s.t. U i Ui = I is given by the top Ri eigenvectors of ( X X) 1 X Y(1) Y (1) X if i = 0 and Y(i+1) Y (i+1) otherwise. The results from Theorem 1 and Proposition 1 can be rewritten in terms of the original input matrix X and output tensor Y using the identities X X = X X+γI, Y 1 X = Y 1 X and Y(i) Y (i) = Y(i)Y (i) for any i 1. The overall Higher-Order Low-Rank Regression procedure (HOLRR) is summarized in Algorithm 1. Note that the Tucker decomposition of the solution returned by HOLRR could be a good initialization point for an Alternative Least Square method. However, studying the theoretical and experimental properties of this approach is beyond the scope of this paper and is left for future work. HOLRR Kernel Extension We now design a kernelized version of the HOLRR algorithm by analyzing how it would be instantiated in a feature space. We show that all the steps involved can be performed using the Gram matrix of the input data without having to explicitly compute the feature map. Let φ : Rd0 RL be a feature map and let Φ RN L be the matrix with rows φ(x(n)) for n [N]. The higher-order low-rank regression problem in the feature space boils down to the minimization problem min W RL d1 dp W 1 Φ Y 2 F + γ W 2 F s.t. rank(W) (R0, R1, , Rp) . (4) Following the HOLRR algorithm, one needs to compute the top R0 eigenvectors of the L L matrix (Φ Φ + γI) 1Φ Y(1)Y (1)Φ. The following proposition shows that this can be done using the Gram matrix K = ΦΦ without explicitly knowing the feature map φ. Proposition 2. If α RN is an eigenvector with eigenvalue λ of the matrix (K + γI) 1Y(1)Y (1)K, then v = Φ α RL is an eigenvector with eigenvalue λ of the ma- trix (Φ Φ + γI) 1Φ Y(1)Y (1)Φ. Let A be the top R0 eigenvectors of the matrix (K + γI) 1Y(1)Y (1)K. When working with the feature map φ, it follows from the previous proposition that line 1 in Algorithm 1 is equivalent to choosing U0 = Φ A RL R0, while the updates in line 3 stay the same. The regression tensor W RL d1 dp returned by this algorithm is then equal to W = Y 1P 2U1U 1 2 p+1Up U p , where P = Φ A A Φ(Φ Φ + γI)Φ A 1 A ΦΦ . It is easy to check that P can be rewritten as P = Φ A A K(K + γI)A 1 A K. Suppose now that the feature map φ is induced by a kernel k : Rd0 Rd0 R. The prediction for an input vector x is then given by W 1 x = C 1 kx where the nth component Algorithm 1 HOLRR Input: X RN d0, Y RN d1 dp, rank (R0, R1, , Rp) and regularization parameter γ. 1: U0 top R0 eigenvectors of (X X + γI) 1X Y(1)Y (1)X 2: for i = 1 to p do 3: Ui top Ri eigenvec. of Y(i+1)Y (i+1) 4: end for 5: M = U 0 (X X + γI)U0 1 U 0 X 6: G Y 1 M 2 U 1 3 p+1 U p 7: return G 1 U0 2 p+1 Up Algorithm 2 Kernelized HOLRR Input: Gram matrix K RN N, Y RN d1 dp, rank (R0, R1, , Rp) and regularization parameter γ. 1: A top R0 eigenvectors of (K + γI) 1Y(1)Y (1)K 2: for i = 1 to p do 3: Ui top Ri eigenvec. of Y(i+1)Y (i+1) 4: end for 5: M A K(K + γI)A 1 A K 6: G Y 1 M 2 U 1 3 p+1 U p 7: return C = G 1A 2U1 3 p+1Up of kx RN is φ(x(n)), φ(x) = k(x(n), x) and the tensor C RN d1 dp is defined by C = G 1A 2U1 2 p+1Up, with G = Y 1 A K(K + γI)A 1 A K 2U 2 3 p+1Up. Note that C has multilinear rank (R0, , Rp), hence the low mutlilinear rank constraint on W in the feature space translates into the low rank structure of the coefficient tensor C. Let H be the reproducing kernel Hilbert space associated with the kernel k. The overall procedure for kernelized HOLRR is summarized in Algorithm 2. This algorithm returns the tensor C RN d1 dp defining the regression function f : x 7 C 1 kx = PN n=1 k(x, x(n))C(n), where C(n) = Cn: : Rd1 dp. 3.3 Theoretical Analysis Complexity analysis. HOLRR is a polynomial time algorithm, more precisely it has a time complexity in O((d0)3 +N((d0)2 +d0d1 dp)+maxi 0 Ri(di)2 +Nd1 dp maxi 1 di). In comparison, LRR has a time complexity in O((d0)3 + N((d0)2 + d0d1 dp) + (N + R)(d1 dp)2). Since the complexity of HOLRR only have a linear dependence on the product of the output dimensions instead of a quadratic one for LRR, we can conclude that HOLRR will be more efficient than LRR when the output dimensions d1, , dp are large. It is worth mentioning that the method proposed in [22] to solve a convex relaxation of problem 2 is an iterative algorithm that needs to compute SVDs of matrices of size di d1 di 1di+1 dp for each 0 i p at each iteration, it is thus computationally more expensive than HOLRR. Moreover, since HOLRR only relies on simple linear algebra tools, readily available methods could be used to further improve the speed of the algorithm, e.g. randomized-SVD [8] and random feature approximation of the kernel function [12, 20]. Approximation guarantees. It is easy to check that problem (1) is NP-hard since it generalizes the problem of fitting a Tucker decomposition [9]. The following theorem shows that HOLRR is a (p + 1)-approximation algorithm for this problem. This result generalizes the approximation guarantees provided by the truncated HOSVD algorithm for the problem of finding the best low multilinear rank approximation of an arbitrary tensor. Theorem 2. Let W be a solution of problem (1) and let W be the regression tensor returned by Algorithm 1. If L : Rd0 dp R denotes the objective function of (1) w.r.t. W then L(W) (p + 1)L(W ). Generalization Bound. The following theorem gives an upper bound on the excessrisk for the function class F = {x 7 W 1 x : rank(W) (R0, , Rp)} of tensor-valued regression functions with bounded multilinear rank. Recall that the expected loss of an hypothesis h F w.r.t. the target function f is defined by R(h) = Ex[L(h(x), f (x))] and its empirical loss by ˆR(h) = 1 N PN n=1 L(h(x(n)), f (x(n))). Theorem 3. Let L : Rd1 dp R be a loss function satisfying L(A, B) = 1 d1 dp P i1, ,ip ℓ(Ai1, ,ip, Bi1, ,ip) for some loss-function ℓ: R R+ bounded by M. Then for any δ > 0, with probability at least 1 δ over the choice of a sample of size N, the follow- ing inequality holds for all h F: R(h) ˆR(h) + M r 2D log 4e(p+2)d0d1 dp δ /(2N), where D = R0R1 Rp + Pp i=0 Ridi. Proof. (Sketch) The complete proof is given in the supplementary material. It relies on bounding the pseudo-dimension of the class of real-valued functions F = (x, i1, , ip) 7 (W 1 x)i1, ,ip : rank(W) = (R0, , Rp) . We show that the pseudo- dimension of F is upper bounded by (R0R1 Rp + Pp i=0 Ridi) log 4e(p+2)d0d1 dp is done by leveraging the following result originally due to [26]: the number of sign patterns of r polynomials, each of degree at most d, over q variables is at most (4edr/q)q for all r > q > 2 [25, Theorem 2]. The rest of the proof consists in showing that the risk (resp. empirical risk) of hypothesis in F and F are closely related and invoking standard error generalization bounds in terms of the pseudo-dimension [17, Theorem 10.6]. Note that generalization bounds based on the pseudo-dimension for multivariate regression without low-rank constraint would involve a term in O( p d0d1 dp). In contrast, the bound from the previous theorem only depends on the product of the output dimensions in a term bounded by O( p log(d1 dp)). In some sense, taking into account the low mutlilinear rank of the hypothesis allows us to significantly reduce the dependence on the output dimensions from O( p d0 dp) to O( p i log(di))). 4 Experiments In this section, we evaluate HOLRR on both synthetic and real-world datasets. Our experimental results are for tensor-structured output regression problems on which we report root mean-squared errors (RMSE) averaged across all the outputs. We compare HOLLR with the following methods: regularized least squares RLS, low-rank regression LRR described in Section 2.2, a multilinear approach based on tensor trace norm regularization ADMM [7, 22], a nonconvex multilinear multitask learning approach MLMT-NC [22], an higher order extension of partial least squares HOPLS [28] and the greedy tensor approach for multivariate spatio-temporal analysis Greedy [2]. For experiments with kernel algorithms we use the readily available kernelized RLS and the LRR kernel extension proposed in [18]. Note that ADMM, MLMT-NC and Greedy only consider a linear dependency between inputs and outputs. The greedy tensor algorithm proposed in [2] is developed specially for spatio-temporal data and the implementation provided by the authors is restricted to third-order tensors. Although MLMLT-NC is perhaps the closest algorithm to ours, we applied it only to simulated data. This is because MLMLT-NC is computationally very expensive and becomes intractable for large data sets. Average running times are reported in supplementary material B. 4.1 Synthetic Data We generate both linear and nonlinear data. Linear data is drawn from the model Y = W 1 x + E where W R10 10 10 10 is a tensor of multilinear rank (6, 4, 4, 8) drawn at random, x R10 is drawn from N(0, I), and each component of the error tensor E is drawn from N(0, 0.1). Nonlinear data is drawn from Y = W 1(x x)+E where W R25 10 10 10 is of rank (5, 6, 4, 2) and x R5 and E are generated as above. Hyper-parameters for all algorithms are selected using 3-fold cross-validation on the training data. These experiments have been carried out for different sizes of the training data set, 20 trials have been executed for each size. The average RMSEs on a test set of size 100 for the 20 trials are reported in Figure 2. We see that HOLRR algorithm clearly outperforms the other methods on the linear data. MLMT-NC method achieved the second best performance, it is however much more computationally expensive (see Table 1 in supplementary material B). On the nonlinear data LRR achieves good performances but HOLRR is still significantly more accurate, especially with small training datasets. Figure 2: Average RMSE as a function of the training set size: (left) linear data, (middle) nonlinear data, (right) for different values of the rank parameter. Table 1: RMSE on forecasting task. Data set ADMM Greedy HOPLS HOLRR K-HOLRR (poly) K-HOLRR (rbf) CCDS 0.8448 0.8325 0.8147 0.8096 0.8275 0.7913 Foursquare 0.1407 0.1223 0.1224 0.1227 0.1223 0.1226 Meteo-UK 0.6140 0.625 0.5971 0.6107 0.5886 To see how sensitive HOLLR is w.r.t. the choice of the multilinear rank, we carried out a similar experiment comparing HOLLR performances for different values of the rank parameter, see Fig. 2 (right). In this experiment, the rank of the tensor W used to generate the data is (2, 2, 2, 2) while the input and output dimensions and the noise level are the same as above. 4.2 Real Data We evaluate our algorithm on a forecasting task on the following real-world data sets: CCDS: the comprehensive climate data set is a collection of climate records of North America from [15]. The data set contains monthly observations of 17 variables such as Carbon dioxide and temperature spanning from 1990 to 2001 across 125 observation locations. Foursquare: the Foursquare data set [14] contains users check-in records in Pittsburgh area categorized by different venue types such as Art & University. It records the number of check-ins by 121 users in each of the 15 category of venues over 1200 time intervals. Meteo-UK: The data set is collected from the meteorological office of the UK2. It contains monthly measurements of 5 variables in 16 stations across the UK from 1960 to 2000. The forecasting task consists in predicting all variables at times t + 1,. . . , t + k from their values at times t 2, t 1 and t. The first two real data sets were used in [2] with k = 1 (i.e. outputs are matrices). We consider here the same setting for these two data sets. For the third dataset we consider higher-order output tensors by setting k = 5. The output tensors are thus of size respectively 17 125, 15 121 and 16 5 5 for the three datasets. For all the experiments, we use 90% of the available data for training and 10% for testing. All hyper-parameters are chosen by cross-validation. The average test RMSE over 10 runs are reported in Table 1 (running times are reported in Table 1 in supplementary material B). We see that HOLRR and K-HOLRR outperforms the other methods on the CCDS dataset while being orders of magnitude faster for the kernelized version (0.61s vs. 75.47s for Greedy and 235.73s for ADMM in average). On the Foursquare dataset HOLRR performs as well as Greedy and on the Meteo-UK dataset K-HOLRR gets the best results with the RBF kernel while being much faster than ADMM (1.66s vs. 40.23s in average). 5 Conclusion We proposed a low-rank multilinear regression model for tensor-structured output data. We developed a fast and efficient algorithm to tackle the multilinear rank penalized minimization problem and provided theoretical guarantees. Experimental results showed that capturing low-rank structure in the output data can help to improve tensor regression performance. 2http://www.metoffice.gov.uk/public/weather/climate-historic/ Acknowledgments We thank François Denis and the reviewers for their helpful comments and suggestions. This work was partially supported by ANR JCJC program MAD (ANR14-CE27-0002). [1] T. W. Anderson. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Annals of Mathematical Statistics, 22:327 351, 1951. [2] M. T. Bahadori, Q. R. Yu, and Y. Liu. Fast multivariate spatio-temporal analysis via low rank tensor learning. In NIPS. 2014. [3] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53 58, 1989. [4] A. Cichocki, R. Zdunek, A.H. Phan, and S.I. Amari. Nonnegative Matrix and Tensor Factorizations. Wiley, 2009. [5] L. De Lathauwer, B. De Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253 1278, 2000. [6] R. Foygel, M. Horrell, M. Drton, and J. D. Lafferty. Nonparametric reduced rank regression. In NIPS, 2012. [7] S. Gandy, B. Recht, and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27(2):025010, 2011. [8] N. Halko, P. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM, 53(2):217 288, 2011. [9] C. J. Hillar and L. Lim. Most tensor problems are np-hard. JACM, 60(6):45, 2013. [10] A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of Multivariate Analysis, 5(2):248 264, 1975. [11] A. J. Izenman. Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer-Verlag, New York, 2008. [12] P. Kar and H. Karnick. Random feature maps for dot product kernels. In AISTATS, 2012. [13] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455 500, 2009. [14] X. Long, L. Jin, and J. Joshi. Exploring trajectory-driven local geographic topics in foursquare. In Ubi Comp, 2012. [15] A. C. Lozano, H. Li, A. Niculescu-Mizil, Y. Liu, C. Perlich, J. Hosking, and N. Abe. Spatialtemporal causal modeling for climate change attribution. In KDD, 2009. [16] H. Lu, K.N. Plataniotis, and A. Venetsanopoulos. Multilinear Subspace Learning: Dimensionality Reduction of Multidimensional Data. CRC Press, 2013. [17] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT, 2012. [18] A. Mukherjee and J. Zhu. Reduced rank ridge regression and its kernel extensions. Statistical analysis and data mining, 4(6):612 622, 2011. [19] M. Nickel and V. Tresp. An analysis of tensor models for learning on structured data. In Machine Learning and Knowledge Discovery in Databases, pages 272 287. Springer, 2013. [20] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007. [21] G.C. Reinsel and R.P. Velu. Multivariate reduced-rank regression: theory and applications. Lecture Notes in Statistics. Springer, 1998. [22] B. Romera-Paredes, M. H. Aung, N. Bianchi-Berthouze, and M. Pontil. Multilinear multitask learning. In ICML, 2013. [23] M. Signoretto, L. De Lathauwer, and J. K. Suykens. Learning tensors in reproducing kernel hilbert spaces with multilinear spectral penalties. ar Xiv preprint ar Xiv:1310.4977, 2013. [24] M. Signoretto, Q. T. Dinh, L. De Lathauwer, and J. K. Suykens. Learning with tensors: a framework based on convex optimization and spectral regularization. Mach. Learn., 1 49, 2013. [25] N. Srebro, N. Alon, and T. S. Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In NIPS, 2004. [26] Hugh E Warren. Lower bounds for approximation by nonlinear manifolds. Transactions of the American Mathematical Society, 133(1):167 178, 1968. [27] K. Wimalawarne, M. Sugiyama, and R. Tomioka. Multitask learning meets tensor factorization: task imputation via convex optimization. In NIPS. 2014. [28] Q. Zhao, C. F. Caiafa, D. P. Mandic, Z. C. Chao, Y. Nagasaka, N. Fujii, L. Zhang, and A. Cichocki. Higher-order partial least squares (hopls). IEEE Trans. on Pattern Analysis and Machine Intelligence, 35(7):1660 1673, 2012. [29] Q. Zhao, Guoxu Z., T. Adalı, L. Zhang, and A. Cichocki. Kernel-based tensor partial least squares for reconstruction of limb movements. In ICASSP, 2013. [30] H. Zhou, L. Li, and H. Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013.