# highdimensional_quantile_tensor_regression__8be47fe7.pdf Journal of Machine Learning Research 21 (2020) 1-31 Submitted 4/20; Revised 10/20; Published 10/20 High-dimensional Quantile Tensor Regression Wenqi Lu wenqilu4-c@my.cityu.edu.hk School of Management Fudan University Shanghai, China and Department of Mathematics City University of Hong Kong Hong Kong, China Zhongyi Zhu zhuzy@fudan.edu.cn School of Management Fudan University Shanghai, China Heng Lian henglian@cityu.edu.hk Department of Mathematics City University of Hong Kong Hong Kong, China Editor: Animashree Anandkumar Quantile regression is an indispensable tool for statistical learning. Traditional quantile regression methods consider vector-valued covariates and estimate the corresponding coefficient vector. Many modern applications involve data with a tensor structure. In this paper, we propose a quantile regression model which takes tensors as covariates, and present an estimation approach based on Tucker decomposition. It effectively reduces the number of parameters, leading to efficient estimation and feasible computation. We also use a sparse Tucker decomposition, which is a popular approach in the literature, to further reduce the number of parameters when the dimension of the tensor is large. We propose an alternating update algorithm combined with alternating direction method of multipliers (ADMM). The asymptotic properties of the estimators are established under suitable conditions. The numerical performances are demonstrated via simulations and an application to a crowd density estimation problem. Keywords: Multidimensional array, Quantile regression, Sparsity principle, Tensor regression, Tucker decomposition 1. Introduction Quantile regression (Koenker and Bassett, 1978) provides a useful approach to analyse the heterogeneous impact of regressors on different parts of the conditional distribution of the response, exhibits robustness to outliers, comes with well-developed computational algorithms, and thus is widely used in applications (Koenker, 2005). There is a large literature on the computational aspects and the asymptotic theories of quantile regression in both c 2020 Wenqi Lu, Zhongyi Zhu and Heng Lian. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/20-383.html. Lu, Zhu and Lian low-dimensional and high-dimensional settings, see, for example, Koenker (2005); Belloni and Chernozhukov (2011); Yu et al. (2017); Yi and Huang (2017). Most of the existing works focused on scenarios in which the covariates and parameters of interest are vectors. However, in fields like image/video analysis or recommendation systems (Zhou et al., 2018; Rendle and Schmidt-Thieme, 2010; Liu et al., 2013), data often take the form of multidimensional arrays, also known as tensors. For example, in the real data application part, we take a 136 186 3 tensor extracted from surveillance video image in crowd dataset PETS2009 (Ferryman and Shahrokni, 2009) as the covariate. Using it as the predictor in a regression, the number of parameters is 136 186 3 = 75888. Moreover, vectorizing procedure, which destroys the spatial structure of an image, would result in a loss of information. In statistical models involving tensors, different types of tensor decomposition techniques are almost always used to treat tensor variables with a reduced number of parameters (Kolda and Bader, 2009; Chi and Kolda, 2012; Liu et al., 2012; Anandkumar et al., 2014a,b; Sun et al., 2017). Tensor methods can also be applied to deep learning models, aiming to reduce the number of parameters by imposing low-dimensional structures on tensors. For example, by applying tensor decomposition to neural network layers, it can be used for network model compression (Novikov et al., 2015; Kolbeinsson et al., 2019). Castellana and Bacciu (2020) used tensor methods in constructing the aggregation functions in LSTM (long short-term memory). Su et al. (2020) proposed a convolutional tensor-train LSTM for spatio-temporal learning. In some other works, converting input vectors in traditional machine learning and statistical models to tensors leads to new tensor-based methods, such as tensor clustering (Sun and Li, 2019) and vector autoregressive time series (Wang et al., 2019). Coming to the regression problems, there have been a sequence of developments concerning mean regression with tensor predictors, tensor responses and/or tensor parameters. Several papers have studied regression with tensor response, see for example Rabusseau and Kadri (2016); Li and Zhang (2017); Sun and Li (2017). For scalar on tensor regression, there are considerably more works, some dealing with high-dimensional cases, based on different decomposition methods. Guo et al. (2012) proposed tensor ridge regression and support tensor regression based on CANDECOMP/PARAFAC (CP) decomposition. Zhou et al. (2013) proposed an estimation procedure for general linear tensor regression model which also used CP decomposition and studied its asymptotic properties. Li et al. (2018) applied a more flexible decomposition, called Tucker decomposition (Kolda and Bader, 2009), to the same regression model as Zhou et al. (2013). Tucker decomposition decomposes the coefficient tensor into a core tensor multiplied by a factor matrix along each mode. It includes CP as a special case where the core tensor is diagonal and the ranks in different modes are equal. As pointed out in Li et al. (2018), it has advantages in accommodating tensors with skewed dimensions and allows explicit modelling of interactions compared to CP decomposition. Apart from CP and Tucker decomposition, Liu et al. (2020) applied Tensor-Train (TT) decomposition (Oseledets, 2011) to a tensor on tensor regression model. TT decomposition was chosen because it has advantages in high order tensors. Alternative to using tensor decompositions which typically results in non-convex optimization problems, Raskutti et al. (2019) considered convex regularization techniques to exploit low-rank and sparse properties in tensor regression. Tensor regression can also be incorporated as trainable layers in deep neural networks (Cao and Rabusseau, 2017; Kossaifiet al., 2020). High-dimensional Quantile Tensor Regression While all the literature mentioned above tackle mean regression, quantile regression with tensor covariates is rarely studied. Since quantile regression has advantages over mean regression when there are outliers or the distribution of response is skewed, and it can be used to build prediction intervals, many classical machine learning tools have been generalized to conditional quantiles, for example neural networks and random forests (Taylor, 2000; Meinshausen, 2006). Our goal is to fill the void in quantile tensor regression and provide estimation methods for this problem. In this paper, we propose methodologies to estimate the tensor coefficient in quantile regression. We assume the regression coefficient tensor has a low-rank structure, adopting Tucker decomposition to reduce the dimensionality to a manageable level, resulting in a parsimonious model. Moreover, we develop an alternating update algorithm for the proposed estimator and establish its asymptotic normality under some conditions. Furthermore, we introduce a sparse Tucker decomposition for the coefficient tensor. Sparsity assumption is commonly used in high-dimensional statistical problems. When the dimensions are large, it is reasonable to incorporate sparsity to further reduce the number of parameters. For instance, Raskutti et al. (2019) discussed several scenarios in which sparsity occurs at entry-wise, fiber-wise or slice-wise level of a tensor, and presented a general convex regularized optimization approach. Here we assume there exists a Tucker decomposition such that the factor matrices are sparse orthogonal matrices. We induce sparsity by penalizing the Kronecker product of factor matrices using ℓ1 penalty. For the sparse tensor quantile regression, we use alternating update combined with ADMM algorithm to deal with the ℓ1 penalty and orthogonality constraints, and derive an upper bound of statistical estimation error. Although conceptually straightforward, we make some important contributions which are summarized below. We extend the use of Tucker decomposition to quantile regression, which is an important tool in statistical analysis as demonstrated in the large literature. We establish the rates of convergence of both non-sparse and sparse quantile tensor regression estimator. Development of theoretical results for quantile regression using tensor decomposition is challenging, especially in high dimensions. Our proof indeed shows that it contains a lot more technical details compared to, say, quantile linear regression using lasso, or various least squares regression models. Computationally, for the case with sparse penalty, we propose an ADMM based algorithm, which is not needed in Li et al. (2018). Compared with prior works which applied ADMM to tensor decomposition (Zhang et al., 2014; Xie et al., 2018; Huang et al., 2016; Smith et al., 2017) and tensor regression (Wang et al., 2019), our algorithm is more complex due to the simultaneous use of quantile loss, sparsity penalty, and the orthogonality constraint, which requires introducing more auxiliary variables. The rest of the paper is organized as follows. In Section 2, we introduce the tensor quantile regression model based on Tucker decomposition, and present the estimation and implementation details for both nonsparse and sparse scenarios. Section 3 establishes the theoretical properties. In Section 4, we investigate the finite sample properties via simula- Lu, Zhu and Lian tion studies, and Section 5 presents an application of the proposed method in crowd density estimation problem. 2. Models and estimation 2.1 Preliminaries Tensors, or multidimensional arrays, are the fundamental constructs in our study. We start with a brief introduction to tensors and their decomposition. We refer readers to Kolda and Bader (2009) for a more detailed review. Throughout this paper, we denote tensors by boldface script capital letters such as X, Y, matrices by boldface capital letters X, Y , and vectors by small boldface letters x, y. For a vector x, let x 1, x and x denote its ℓ1, ℓ2 and ℓ norms. For a matrix X, its Frobenius norm, spectral norm, vectorization and jth largest singular value are denoted by X F , X op, vec(X) and σj(X), respectively. The ℓ1 norm of X is X 1 = vec(X) 1. The Kronecker product of two matrices X and Y is denoted by X Y . For a tensor X Rp1 p K, its Frobenius norm and inner product are X F = q Pp1 i1=1 Pp2 i2=1 Pp K i K=1 x2 i1i2...ik and X, Y = Pp1 i1=1 Pp2 i2=1 Pp K i K=1 xi1i2...ikyi1i2...ik, respectively. The order of a tensor is the number of dimensions, a.k.a modes a multidimensional array A Rp1 ...,p K is called a Kth-order tensor. The matricization of a tensor links the concepts and properties of matrices to tensors. Mode-k matricization of a tensor A Rp1 p K, denoted by A(k), arranges the mode-k fibers (all pk-dimensional vectors obtained by fixing all indices of the tensor A except for the k-th index) to be the columns of the resulting matrix. For example, the mode-1 matricization of A Rp1 p2 p3 is the p1 (p2p3) matrix A(1) such that {A(1)}i,(j 1)p3+k = Aijk, 1 i p1, 1 j p2, 1 k p3. Then, we can define the vectorization of A by vec(A) = vec(A(1)). Tensors can be multiplied by matrices. The mode-k multiplication of tensor A Rp1 p2 p K with a matrix U Rrk pk is defined as i1 ik 1jik+1 i K = ik=1 Ai1i2 i KUjik. The definition of rank for a tensor A Rp1 ...,p K is not universal and many definitions of rank have been proposed in the literature. In this paper, we consider the multilinear ranks (r1, r2, . . . , r K), where rk is the dimension of vector space spanned by the mode-k fibers. This rank is related to Tucker decomposition. For a given tensor A of rank (r1, r2, . . . , r K), there exists a Tucker decomposition A = G 1 U1 2 U2 K UK, where Uk Rpk rk, k = 1, . . . , K are factor matrices which can be assumed to be orthogonal if so desired, and G Rr1 r K is the core tensor, whose entries are related to the level of interactions between factors. A special Tucker decomposition called higher-order singular value decomposition (HOSVD, Lathauwer et al., 2000) is often adopted. The factor matrix High-dimensional Quantile Tensor Regression Uk of HOSVD can be computed by the leading rk left singular vectors of A(k) for each k = 1, . . . , K, and the core tensor G = A 1U T 1 2U T 2 K U T K. In this way, matrices Uk, k = 1, . . . , K are orthogonal, and G is all orthogonal, meaning the rows of each G(k), k = 1, . . . , K are mutually orthogonal. 2.2 Tucker tensor quantile regression In this paper, we consider quantile regression models with a scalar response y, and tensor covariate X Rp1 p K. The τth conditional quantile of response given covariate X is defined as Qτ(y|X) = inf t; Fy|X (t) τ , where Fy|X (t) is the condition distribution function of y given X. We consider a linear quantile regression model Qτ(y|X) = µ + A, X , where A is a coefficient tensor of the same size as X which captures the effect of each element in X. Since the intercept µ can be more easily dealt with both computationally and theoretically, in the following we will suppress the intercept µ for convenience of notation and explain briefly the minor modifications required in the algorithm and proofs at several places below. In this model, the number of parameters is QK k=1 pk, which is often large compared to the sample size. In order to reduce the number of parameters by exploiting the tensor structure, we impose a low-rank assumption on A. As mentioned before, if A has multilinear rank (r1, r2, . . . , r K), then there exists a Tucker decomposition A = G 1 U1 2 U2 K UK, where G Rr1 r K is the core tensor, and Uk Rpk rk, k = 1, . . . , K are called factor matrices. Based on the decomposition, we obtain the Tucker tensor quantile regression model Qτ(y|X) = G 1 U1 2 U2 K UK, X . (1) As a result of decomposition, the number of parameters in model (1) reduces to QK k=1 rk + PK k=1(pk rk)rk according to Zhang (2019), which is substantially smaller than QK k=1 pk. Besides the Tucker decomposition, the CP decomposition and Tensor-Train (TT) decomposition (Oseledets, 2011) are also frequently used, see for example Guo et al. (2012); Zhou et al. (2013); Liu et al. (2020). In general it is hard to compare the dimension reduction ability of CP decomposition and Tucker decomposition. We choose Tucker decomposition partially due to that it is unique under mild assumptions and given the tensor the multilinear rank is easy to find while it is an NP hard problem to determine the CP rank (H astad, 1990). TT decomposition is an efficient way to tackle tensor with large tensor order K. Furthermore, it may result in different TT format after permuting the modes of a tensor which may cause additional difficulty. For specificity, we only consider Tucker decomposition in the current work, while we acknowledge that investigations of other decompositions may also lead to fruitful results. To estimate the coefficient of quantile regression model, Koenker and Bassett (1978) proposed to replace the ℓ2 loss in mean regression by the check loss function, defined by ( τu, if u 0, (τ 1)u, if u < 0, Lu, Zhu and Lian or equivalently ρτ(u) = u(τ I(u 0)) where I(.) is the indicator function that takes value 1 if the statement inside the bracket is true and 0 otherwise. For a given quantile τ, the estimate of model (1) with ranks (r1, r2, . . . , r K) is obtained by minimizing Pn i=1 ρτ (yi G 1 U1 2 U2 K UK, X i ). Thus the estimator of A can be defined as b A = b G 1 b U1 2 b U2 K b UK arg min G,Uk,k=1,...,K i=1 ρτ (yi G 1 U1 2 U2 K UK, X i ) . (2) Therefore, the problem of estimating A is transformed to estimating the core tensor G and factor matrices Uk, k = 1, . . . , K. Note that the core tensor and factor matrices are not identified since the decomposition is not unique (unless we put more stringent conditions on the tensor). Therefore, the minimizer is not unique. Our theoretical results hold for any minimizer in the set of minimizers. By the fact that A(k) = Uk G(k)(UK Uk+1 Uk 1 U1)T , we can rewrite G 1 U1 2 U2 K UK, X as vec(X)T (UK UK 1 U1)vec(G) (3) vec(X (k))T (UK Uk+1 Uk 1 U1)GT (k) Ipk vec(Uk), (4) where Ipk is the pk pk identity matrix. This implies that the objective function (2) is linear with respect to vec(G) and vec(Uk), k = 1, . . . , K, when the others are fixed. Equations (3) and (4) involve computation of Kronecker products which results in large and unwieldy matrices when the dimension is high. (3) and (4) are used mainly due to their elegant mathematical presentation. In practice, one can organize the computation in other ways. For example, for K = 3, since G 1 U1 2 U2 3 U3, X = s3=1 xd1d2d3u1,d1s1u2,d2s2u3,d3s3gs1s2s3, where xd1d2d3, u1,d1s1, u2,d2s2, u3,d3s3, gs1s2s3 denotes the entries of X, U1, U2, U3, G, respectively, then it is easy to see that vec(X)T (U3 U2 U1) in (3) can be computed via the vectorization of X 1 UT 1 2 UT 2 3 UT 3 , for example, to avoid directly computing the Kronecker product. Similarly, (4) can be dealt with using appropriate matrix products. The following alternating update algorithm is used to find b A. We note that we do not require G, Uk to be identified since we are only interested in the tensor A. In particular, in this part, we do not require Uk to be orthogonal which makes the optimization problem simpler. Such a choice for similar low-rank matrix/tensor models are often adopted in the literature (Liu et al., 2013; Udell et al., 2016). The alternating update algorithm obviously can decrease the objective function value in each step and the function value is guaranteed to converge. High-dimensional Quantile Tensor Regression Algorithm 1 Alternating update algorithm for low dimensional quantile regression Initialized: A(0) = arg min Pn i=1 ρτ(yi vec(X i)T vec(A)) Perform HOSVD: A(0) = G(0) 1 U (0) 1 2 U (0) 2 K U (0) K with predetermined ranks (r1, . . . , r K) Repeat ℓ= 0, 1, 2, . . . For k = 1, . . . , K U (ℓ+1) k = arg min Uk Pn i=1 ρτ yi vec(X i,(k))T (U (ℓ) K U (ℓ) k+1 U (ℓ+1) k 1 U (ℓ+1) 1 ) G(ℓ+1)T (k) Ipk G(ℓ+1) = arg min G Pn i=1 ρτ yi vec(X i,(1))T (U (ℓ+1) K U (ℓ+1) K 1 U (ℓ+1) 1 )vec(G(1)) A(ℓ+1) = G(ℓ+1) 1 U (ℓ+1) 1 2 U (ℓ+1) 2 K U (ℓ+1) K Until convergence Remark 1 Note that we omit the intercept in most parts of the paper for simplicity of exposition. When an intercept is added, its estimation can be combined with the G-update step. In this way, the G-update step becomes (µ(ℓ+1), G(ℓ+1)) = arg minµ,G Pn i=1 ρτ vec(X i,(1))T (U (ℓ) K U (ℓ) K 1 U (ℓ) 1 )vec(G(1)) , and the yi in U-update step is replaced by yi µℓ. The same modification is used in the sparse case in the following section. 2.3 Sparse Tucker tensor quantile regression When the number of variables p1p2 p K is large, it is desirable to incorporate sparsity. In our case study later, sparsity is mainly used for dimension reduction without much interpretability. Assume that there exists a Tucker decomposition for A such that the factor matrices U1, . . . , UK are sparse orthogonal matrices. We impose ℓ1 penalty for variable selection and hence the estimator is b A = b G 1 b U1 2 b U2 K b UK = arg min 1 yi G 1 U1 2 U2 K UK, X i + λ UK U1 1, (5) subject to U T k Uk = Irk, k = 1, . . . , K. By the definition of the ℓ1 norm, we have j1 |u K,i Kj Ku K 1,i K 1j K 1 u1,i1j1| j1=1 |u1,i1j1| j K=1 |u K,i Kj K| = U1 1 U2 1 UK 1. Lu, Zhu and Lian Algorithm 2 Alternating update algorithm for high dimensional quantile regression Initialized: A(0) = arg min Pn i=1 ρτ(yi vec(X i)T vec(A)) Perform HOSVD: A(0) = G(0) 1 U (0) 1 2 U (0) 2 K U (0) K with predetermined ranks (r1, . . . , r K) Repeat ℓ= 0, 1, 2, . . . For k = 1, . . . , K U (ℓ+1) k = arg min UT k Uk=Irk 1 n Pn i=1 ρτ yi vec(X i,(k))T (U (ℓ) K U (ℓ) k+1 U (ℓ+1) k 1 U (ℓ+1) 1 ) G(ℓ+1)T (k) Ipk vec(Uk) + λ U (ℓ+1) 1 1 U (ℓ+1) k 1 1 U (ℓ) k+1 1 U (ℓ) K 1 Uk 1 G(ℓ+1) = arg min G 1 n Pn i=1 ρτ yi vec(X i)T (U (ℓ+1) K U (ℓ+1) K 1 U (ℓ+1) 1 )vec(G) A(ℓ+1) = G(ℓ+1) 1 U (ℓ+1) 1 2 U (ℓ+1) 2 K U (ℓ+1) K Until convergence Therefore, the ℓ1 penalty is imposed jointly on the ℓ1 norm of all the factor matrices. Several remarks are in order. First, we do not penalize the core tensor matrix since usually r1, r2, r3 are small. If one really wants, one can also add a penalty on G. Entries of G represents interactions between latent dimensions and selection of nonzero entries may be of interest in some cases. However, in our application, we simply use sparsity as a way to reduce dimension and the size of G is already very small and thus we do not impose sparsity on the core tensor. Second, putting a penalty on the product UK U1 1 = U1 1 U2 1 UK 1 is somewhat unusual, but is motivated by the form (3) in which the product of factor matrices appears. Using such a penalty makes it much more convenient, to say the least, to perform our theoretical analysis later. In practice, we find it performs very similar to using the more conventional penalty λ( U1 1 + U2 1 + + UK 1). Third, we only use one tuning parameter λ, which is only suitable if all factor matrices are believed to have similar degrees of sparsity. For greater flexibility, one should instead use λ1 U1 1 + λ2 U2 1 + + λK UK 1. In our current numerical examples, we find there is no need to use the more flexible penalty which leads to the computational burden of having to tune multiple parameters. Finally, unlike the non-sparse case, here we would require the orthogonality constraint. The reason is that without the constraint, due to scale indeterminacy (multiplying G by any constant while dividing the same constant on Uk will result in the same A), the penalty will make all Uk arbitrarily close to zero while making G very large. This is in stark contrast with Algorithm 1, where we do not need to use any constraints when there is no penalty and the algorithm is much simpler there. Alternatively, a ridge penalty on G can also be used with an additional tuning parameter. However, in the high-dimensional case, when there is no orthogonal constraint, we often find such estimators are more unstable with larger variances. The estimate b A can be computed by an alternating update algorithm similar to that in Section 2.2, see Algorithm 2. However, the update of Uk, k = 1, . . . , K, is a nonconvex optimization problem due to the orthogonality constraint, thus it can not be solved by ordinary quantile regression as in Section 2.2. In order to separate the orthogonality constraint and ℓ1 regularization, we propose to use an ADMM algorithm motived by Yu et al. (2017). High-dimensional Quantile Tensor Regression More specifically, the update of Uk, k = 1, . . . , K takes the common form min B 1 nρτ y Zvec(B) + λ B 1, s.t. BT B = I, (6) where ρτ y Zvec(B) = Pn i=1 ρτ yi zivec(B) . Rewriting (6) in an equivalent form by introducing auxiliary variables βb, r and P , we have arg min r,βb,β,P 1 nρτ(r) + λ β 1 (7) subject to y Zβb = r, βb = β, Bb = P , P T P = I, where β = vec(B) and βb = vec(Bb). Using augmented Lagrangian, the ADMM algorithm consists of the following steps, β(j+1) = arg min β λ β 1 + γ 2 β(j) b β + 1 r(j+1) = arg min r 1 nρτ(r) + γ 2 y Zβ(j) b r + 1 P (j+1) = arg min P γ 2 B(j) b P + 1 γ E(j) 2 F , subject to P T P = I β(j+1) b = arg min βb γ 2 y Zβb r(j+1) + 1 γ u(j) 2 + γ 2 βb β(j+1) + 1 2 Bb P (j+1) + 1 E(j+1) = E(j)+γ(B(j+1) b P (j+1)) u(j+1) = u(j)+γ(y Zβ(j+1) b r(j+1)) η(j+1) = η(j)+γ(β(j+1) b β(j+1)). All updates for β, βb, r, P above can be obtained in closed form, as detailed in Algorithm 3. We note that for general nonconvex problems the ADMM algorithm does not have satisfying theoretical convergence guarantee, but in our numerical studies we observe good convergence behavior for the algorithm although the results are not presented here. In particular, we always observe that Bb and B are very close to being orthogonal. 3. Theoretical properties In this section, we investigate the statistical properties of the proposed quantile tensor regression method. In particular, we establish the asymptotic normality for the estimate of nonsparse quantile tensor regression under mild general conditions, in particular showing that the estimator is n-consistent. Then we derive an upper bound for the estimation error under the sparse setting, which shows that the estimate converges to the true value at a rate depending on sample size, number of nonzero parameters and logarithm of dimension Lu, Zhu and Lian Algorithm 3 ADMM for sparse and orthogonal regression Initialize: B(0) b = U (ℓ) k , P (0) = U (ℓ) k , E(0) = 0, η(0) = u(0) b = 0 Repeat j = 0, 1, 2, . . . β(j+1) = [β(j) b + η(j)/γ λ/γ]+ [ β(j) b η(j)/γ λ/γ]+ r(j+1) = [y Zβ(j) b + u(j)/γ τ nγ 1n]+ [ y + Zβ(j) b u(j)/γ + τ 1 nγ 1n]+ Compute SVD B(j) b + E(j)/γ = V1DV T 2 P (j+1) = V1V T 2 β(j+1) b = (ZT Z + 2I) 1 ZT (y r(j+1) + u(j)/γ) η(j)/γ + β(j+1) + vec(P (j+1)) vec(E(j))/γ E(j+1) = E(j) + γ(B(j+1) b P (j+1)) u(j+1) = u(j) + γ(y Zβ(j+1) b r(j+1)) η(j+1) = η(j) + γ(β(j+1) b β(j+1)) Until convergence p. The logarithmic dependence on p means the procedure can be applied to very large tensors, although limited by computational efficiency in implementation. We first consider the asymptotic distribution of the estimator for the nonsparse Tucker tensor quantile regression discussed in Section 2.2, assuming the dimensions p1, . . . , p K are fixed. Let φ = (vec(G)T , vec(U1)T , . . . , vec(UK)T )T be the true parameters (the true φ is not unique but any that corresponds to the true A will do). Let h(φ) = vec(A) = vec G 1U1 2U2 KUK considered as a function of φ. Define the QK k=1 pk QK k=1 rk+ PK k=1 pkrk Jacobian matrix H as φ = UK U1, (UK U2)GT (1) Ip1, T21{ (UK U3 U1)GT (2) Ip2}, . . . , TK1{ (UK 1 U1)GT (K) Ip K} , where Tij is the permutation matrix such that vec(A(j)) = Tijvec(A(i)). Let Fε|x(t) and fε|x(t) be the conditional distribution function and conditional density function of random errors ε := y X, A . In addition, let xi = vec(X i,(1)), D0 = E(xix T i ), D1 = E(xix T i fε|x(0)) and denote the smallest eigenvalue of a symmetric matrix D by δmin(D). Finally we define Γ = τ(1 τ)D 1 1 D0D 1 1 . In order to establish the asymptotic properties of the estimator (2), we assume the following conditions. C1. fε|x(0) is bounded away from zero uniformly over the support of x, and both fε|x(.) and its derivative are uniformly bounded. C2. E x 3 < . The matrix D0 is positive definite. C3. The parameter space for h is bounded. Conditions C1 and C2 are mild regularity assumptions commonly used in quantile regression models (Wang et al., 2009; Belloni and Chernozhukov, 2011). Condition C1 imposes smoothness assumptions on the conditional density of random errors, which is satisfied by High-dimensional Quantile Tensor Regression most distributions such as Gaussian and exponential distribution. Condition C2 is an assumption on existence of moments to ensure asymptotic normality of the estimator. It is trivially satisfied if x is bounded. The boundedness of parameter space in Condition C3 is often necessary in nonconvex models in order to apply empirical process theory. In practice, one usually searches within a large but bounded region and thus this assumption is not overly stringent. We establish the asymptotic distribution of h( bφ) in the following theorem, with proofs relegated to the appendix. Theorem 1 Suppose conditions C1-C3 hold, then as n , n h( bφ) h(φ) d N(0, Σ), (8) where Σ = P ΓP T , and P = H(HT D1H) HT D1. Here (.) denotes the pseudo-inverse. Note that Γ is the asymptotic variance of conventional quantile regression. This theorem in particular shows that the estimator is n-consistent. Such asymptotic results can potentially lead to interval estimates which however is not our focus in this paper (we are mainly interested in its prediction performance especially in high dimensions). Next we derive the error bound for the estimation error of sparse Tucker tensor quantile regression (5) in high dimensions. Let r = QK k=1 rk, p = QK k=1 pk, U = UK U1 and g = vec(G). To obtain the upper bound of estimation error, we assume the following conditions. C4. The factor matrices are sparse and satisfy Uk 0 sk for 1 k K. C5. The parameter space for G is Ωg = {G : vec(G) g < } for some g > 0. The parameter space for Uk is Ωk = {Uk : Uk Rpk rk, U T k Uk = I}. We let ΩU = {U = UK U1 Rp r : Uk Ωk}. C6. x = vec(X) is sub-Gaussian in the sense that E exp{a T x} C exp{C a 2} for any vector a where C > 0 is a constant. The eigenvalues of D0 are bounded away from zero and infinity. C7. The maximum and the minimum non-zero singular values of the rank-rk matrix A(k) is given by σmax,k and σmin,k respectively. C8. Let Ω= { Rp : = (U + U)(g + g) Ug, U + U ΩU, g + g Ωg, ( U)Sc 1 3 ( U)S 1 + g 1}, where S is the indices of the nonzero entries of U with Sc its complement, ( U)S, for example, denotes the vector containing the entries of U indexed in S. We assume that there exists some positive constant c1 such that for any Ωwith = t, Q(Ug + ) c1(t2 t), t > 0, where Q(δ) = E[ρτ(y x T δ) ρτ(y x T Ug)]. Condition C4 constrains the sparsity of factor matrices. The condition Uk 0 sk implies U 0 QK k=1 sk =: s where U = UK U1. Note that we do not require an Lu, Zhu and Lian upper bound for sk. Thus, the following Theorem 2 holds for any s, but the upper bound will increase as s goes up. The algorithm can always be applied whether the true matrix is sparse or not. Condition C5 restricts the parameter space for G and Uk, which is related to C3 in the fixed-dimensional case. We use the upper bound g for technical reasons in the proof. In the high-dimensional setting, we require a stronger distribution assumption on the predictors as in C6. It is satisfied if the components of x are bounded and independent, although sub-Gaussianity is much more general than boundedness. C7 assumes A is lowrank. The bounds will be simplified if we assume σmax,k, σmin,k are bounded away from zero and infinity. C8 is a high-level assumption on local strong convexity of the expected loss at the minimizer. Lemma 4 in Belloni and Chernozhukov (2011) showed that C8 holds if the quantity inf Ω, =0 (E|x T |2)3/2 E|x T |3 is bounded away from zero. This quantity is indeed bounded away from zero when, for example, x is Gaussian. Theorem 2 Suppose conditions C1 and C4-C8 hold. Assume λ c2 g p log(p n)/n for some sufficiently large constant c2, Then with probability 1 (p n) c3 for some constant c3 > 0, b A A F Cbnλ/ g, for some constant C > 0 if dnλ/ g = o(1). Here bn, dn are defined as + C( g + 1) r + C( g + 1) r A F σ2 min,k The bound can be simplified significantly if we assume A F , σmax,k, g, r are all bounded, and σmin,k are bounded away from zero. In this case, we obtain b A A F C p s log(p n)/n when we set λ p log(p n)/n. This result gives the rate of convergence of the sparse quantile tensor regression estimator. The upper bound depends on the sample size n and the number of non-zero parameters s, and the dimension p only has a logarithmic effect. It also shows that the low-rank structure (smaller rk) leads to better risk bounds. 4. Simulations In this section, we carry out simulation studies to investigate the finite sample performances of the proposed methods. 4.1 Low-dimensional tensor quantile regression We first consider the low-dimensional non-sparse estimator presented in Section 2.2. We examine the performances of the proposed algorithm under a variety of dimensions, sample sizes and signal strengths. Specifically, the response is generated by yi = A, X i + εi, where X i is a p1 p2 p3 tensor with standard normal entries, and random errors εi are generated independently from normal distribution N( qτ, σ2) with qτ being the τth High-dimensional Quantile Tensor Regression quantile of N(0, 1). Coefficient A is generated by G 1 U1 2 U2 3 U3. The entries of the core tensor G are standard normal values. The factor matrix Uk is obtained from a QR decomposition of a pk rk matrix with independent standard normal entries. We use dimensions (p1, p2, p3) = (5, 5, 5) and (10, 10, 10). Rank (r1, r2, r3) is chosen to be either (2, 2, 2) or (3, 3, 3). In addition, we choose the sample size n = 1000 and 2000, and noise level σ = 0.5 and 1. To select the ranks in the proposed method, we use the following BIC BIC = log 1 i=1 ρτ(yi b A, X ) + df where the degrees of freedom (df) is defined as df = r1r2 r K + PK k=1 rk(pk rk). We compare the performances of the proposed method with lasso quantile regression (using the hqreg package in R, Yi and Huang, 2017). Table 1 presents the estimation errors (EE) measured by b A A F , and the percentage that the true rank is selected, based on 200 replications. It can be seen that the estimation error of the proposed method is smaller than the lasso method in all settings (nothing unexpected here since the true model involves a low-rank tensor), and the percentage of choosing the true rank by BIC is always close to 1. 4.2 High-dimensional sparse tensor quantile regression For this experiment, we take ranks (r1, r2, r3) = (2, 2, 2), and consider larger dimensions (p1, p2, p3) = (10, 10, 10), (20, 20, 20) and (100, 100, 3) and smaller sample sizes n = 200, 400, 800. A is still generated by G 1 U1 2 U2 3 U3. A sparse factor matrix U is generated as follows. First let U = a3 1 03 1 02 1 b2 1 where a, b are vectors of independent standard normal random numbers. When p = 10, we obtain Uk R5 2 by stacking two copies of independently generated U s, one on top of the other. For other dimensions p = 20, 100 similar procedure is followed to generate Uk. Finally, we standardize these constructed matrices into orthonormal matrices (dividing each column by its length). When (p1, p2, p3) = (10, 10, 10) and (20, 20, 20), all Uk s are set to be sparse (and generated independently as described above). When (p1, p2, p3) = (100, 100, 3), U1 and U2 are sparse, while U3 is non-sparse (due to that its size is small) and generated as in Section 4.1. Accordingly, we use the penalty on U1 1 U2 1. The core tensors, the covariates and the random errors are generated in the same way as before. We again compare the performances with lasso. Since the number of free parameters is not straightforward to define for a sparse and orthogonal matrix, we select the ranks by minimizing the out of sample prediction error on independently generated data. Table 2 reports the estimation error and Table 3 reports the average selected ranks when n = 400. As in the nonsparse case, the proposed method outperforms lasso. In particular, lasso fails when (p1, p2, p3) = (20, 20, 20), with the estimation errors almost the same as the Frobenius norm of the true coefficients, but our approach still works. These results demonstrate that for a given finite sample with limited Lu, Zhu and Lian (p1, p2, p3) rank n σ = 0.5 σ = 1 LASSO Proposed LASSO Proposed EE EE rank EE EE rank (2,2,2) 1000 0.23(0.02) 0.11(0.02) 1.00 0.47(0.04) 0.22(0.04) 1.00 2000 0.15(0.01) 0.07(0.01) 1.00 0.32(0.02) 0.16(0.02) 1.00 (3,3,3) 1000 0.24(0.02) 0.14(0.02) 1.00 0.43(0.03) 0.29(0.03) 1.00 2000 0.19(0.01) 0.10(0.01) 1.00 0.29(0.02) 0.20(0.02) 1.00 (2,2,2) 1000 1.70(0.13) 0.16(0.02) 0.97 3.38(0.27) 0.33(0.03) 0.96 2000 0.55(0.02) 0.11(0.01) 1.00 1.18(0.05) 0.23(0.02) 1.00 (3,3,3) 1000 1.93(0.16) 0.21(0.02) 1.00 3.36(0.23) 0.42(0.03) 1.00 2000 0.52(0.02) 0.14(0.01) 1.00 1.11(0.04) 0.29(0.02) 1.00 τ = 0.5 (p1, p2, p3) rank n σ = 0.5 σ = 1 LASSO Proposed LASSO Proposed EE EE rank EE EE rank (2,2,2) 1000 0.21(0.02) 0.10(0.02) 1.00 0.42(0.03) 0.20(0.03) 1.00 2000 0.14(0.01) 0.07(0.01) 1.00 0.30(0.02) 0.14(0.02) 1.00 (3,3,3) 1000 0.19(0.01) 0.13(0.01) 1.00 0.47(0.03) 0.27(0.03) 1.00 2000 0.13(0.09) 0.09(0.01) 1.00 0.33(0.02) 0.19(0.02) 1.00 (2,2,2) 1000 0.94(0.15) 0.15(0.01) 1.00 1.45(0.18) 0.30(0.03) 1.00 2000 0.53(0.02) 0.10(0.01) 1.00 1.11(0.04) 0.21(0.02) 1.00 (3,3,3) 1000 1.88(0.15) 0.19(0.01) 1.00 3.37(0.23) 0.39(0.03) 1.00 2000 0.51(0.02) 0.13(0.01) 1.00 1.06(0.03) 0.27(0.02) 1.00 τ = 0.75 (p1, p2, p3) rank n σ = 0.5 σ = 1 LASSO Proposed LASSO Proposed EE EE rank EE EE rank (2,2,2) 1000 0.21(0.01) 0.11(0.01) 1.00 0.46(0.03) 0.22(0.04) 1.00 2000 0.15(0.01) 0.08(0.01) 1.00 0.31(0.02) 0.16(0.02) 1.00 (3,3,3) 1000 0.20(0.01) 0.14(0.02) 1.00 0.43(0.03) 0.29(0.03) 1.00 2000 0.14(0.01) 0.10(0.01) 1.00 0.30(0.02) 0.21(0.02) 1.00 (2,2,2) 1000 1.70(0.13) 0.17(0.02) 0.99 3.40(0.28) 0.33(0.03) 0.99 2000 0.55(0.02) 0.11(0.01) 1.00 1.18(0.04) 0.23(0.02) 1.00 (3,3,3) 1000 1.94(0.17) 0.21(0.02) 1.00 3.40(0.24) 0.41(0.03) 1.00 2000 0.52(0.02) 0.14(0.01) 1.00 1.11(0.04) 0.29(0.02) 1.00 Table 1: Estimation errors and the percentage of times selecting the true ranks for tensor quantile regression. Numbers in the parentheses denote the standard errors. High-dimensional Quantile Tensor Regression (p1, p2, p3) n σ = 0.5 σ = 1 LASSO Proposed LASSO Proposed (10,10,10) 200 2.40(0.72) 0.56(0.07) 2.52(0.67) 0.93(0.39) 400 1.82(0.56) 0.36(0.04) 2.05(0.55) 0.51(0.06) 800 1.44(0.57) 0.26(0.03) 1.70(0.48) 0.37(0.04) (20,20,20) 200 2.97(0.63) 1.99(0.70) 3.01(0.62) 2.30(0.68) 400 2.88(0.61) 0.62(0.11) 2.93(0.59) 0.94(0.20) 800 2.66(0.54) 0.38(0.03) 2.73(0.54) 0.57(0.10) (100,100,3) 200 3.02(0.73) 3.01(0.73) 3.04(0.73) 3.05(0.72) 400 3.02(0.72) 2.16(0.64) 3.05(0.71) 2.38(0.63) 800 2.98(0.71) 1.80(0.85) 3.01(0.69) 2.15(0.77) τ = 0.5 (p1, p2, p3) n σ = 0.5 σ = 1 LASSO Proposed LASSO Proposed (10,10,10) 200 2.82(0.90) 0.38(0.07) 2.91(0.84) 0.84(0.37) 400 2.10(0.71) 0.24(0.02) 2.32(0.70) 0.48(0.06) 800 1.36(0.44) 0.17(0.02) 1.63(0.41) 0.35(0.04) (20,20,20) 200 3.42(0.70) 1.61(0.70) 3.55(0.74) 2.08(0.66) 400 3.20(0.86) 0.44(0.27) 3.25(0.66) 0.84(0.23) 800 2.89(0.64) 0.25(0.02) 2.78(0.81) 0.52(0.07) (100,100,3) 200 3.52(0.87) 3.03(0.70) 3.58(0.88) 3.10(0.74) 400 3.32(0.87) 2.06(0.68) 3.45(0.86) 2.30(0.65) 800 3.14(0.81) 1.56(0.89) 3.20(0.81) 2.01(0.76) τ = 0.75 (p1, p2, p3) n σ = 0.5 σ = 1 LASSO Proposed LASSO Proposed (10,10,10) 200 2.41(0.69) 0.59(0.10) 2.54(0.66) 0.85(0.17) 400 1.87(0.63) 0.36(0.04) 2.02(0.51) 0.54(0.07) 800 1.54(0.63) 0.26(0.02) 1.70(0.53) 0.37(0.03) (20,20,20) 200 2.97(0.63) 1.88(0.66) 3.00(0.62) 2.21(0.64) 400 2.88(0.61) 0.65(0.22) 2.94(0.59) 0.93(0.20) 800 2.66(0.55) 0.38(0.03) 2.73(0.54) 0.57(0.12) (100,100,3) 200 3.02(0.73) 3.00(0.72) 3.04(0.71) 3.05(0.73) 400 3.02(0.72) 2.17(0.67) 3.04(0.71) 2.40(0.66) 800 2.98(0.70) 1.83(0.83) 3.02(0.70) 2.14(0.80) Table 2: Estimation errors for sparse tensor quantile regression. Numbers in the parentheses denote the standard errors. data, further dimension reduction mechanism brought about by tensor decomposition can further improve estimation efficiency compared to using sparsity alone. Lu, Zhu and Lian (p1, p2, p3) (r1, r2, r3) σ = 0.5 σ = 1 r1 r2 r3 r1 r2 r3 (10,10,10) (2,2,2) 2.02(0.14) 2.02(0.14) 2.06(0.24) 2.00(0.29) 2.00(0.29) 2.04(0.35) (20,20,20) (2,2,2) 1.98(0.38) 2.00(0.35) 2.12(0.39) 1.78(0.42) 1.82(0.39) 1.98(0.51) (100,100,3) (2,2,2) 1.40(0.61) 1.56(0.61) 2.62(0.60) 1.44(0.64) 1.68(0.62) 2.62(0.53) τ = 0.5 (p1, p2, p3) (r1, r2, r3) σ = 0.5 σ = 1 r1 r2 r3 r1 r2 r3 (10,10,10) (2,2,2) 2.02(0.14) 2.02(0.14) 2.08(0.27) 2.02(0.25) 2.02(0.25) 2.10(0.36) (20,20,20) (2,2,2) 1.94(0.24) 1.94(0.24) 2.04(0.28) 1.80(0.40) 1.84(0.37) 2.02(0.47) (100,100,3) (2,2,2) 1.42(0.70) 1.60(0.70) 2.70(0.54) 1.60(0.83) 1.84(0.77) 2.70(0.46) τ = 0.75 (p1, p2, p3) (r1, r2, r3) σ = 0.5 σ = 1 r1 r2 r3 r1 r2 r3 (10,10,10) (2,2,2) 1.98(0.14) 1.98(0.14) 2.12(0.33) 1.94(0.31) 1.94(0.31) 2.02(0.32) (20,20,20) (2,2,2) 1.94(0.31) 1.94(0.31) 2.00(0.40) 1.82(0.48) 1.82(0.48) 1.92(0.49) (100,100,3) (2,2,2) 1.30(0.61) 1.54(0.65) 2.72(0.54) 1.60(0.78) 1.82(0.72) 2.76(0.52) Table 3: Mean and standard error of the estimated ranks for sparse Tucker tensor quantile regression when n = 400. The true rank is (2, 2, 2). Then we consider the situation with unequal ranks for different modes of the tensor, which is set to be (2, 3, 4). The factor matrices U1 Rp1 2, U2 Rp2 3 and U3 Rp3 4 are respectively generated by stacking U (as defined before) and U 2 = a3 2 03 1 02 2 b2 1 R5 3 , U 3 = a3 2 03 2 02 2 b2 2 The orthogonal matrices a3 2 and b2 2 are obtained by QR decomposition as before. Figure 1 and Figure 2 compare the proposed method and lasso quantile regression under different choice of sample sizes and dimensions. As expected, the estimation errors increase as dimensions increases, and as the sample size n decreases, and the errors are smaller than lasso quantile regression. Moreover, as suggested by a reviewer, we consider the alternative approach of simply imposing convex regularizations on the coefficient tensor A to encourage low-dimensional structure. To be specific, we impose the nuclear norm regularization on the matricized coefficient tensors to encourage low-rankness. The objective function is i=1 ρτ (yi A, X i ) + λ1 k=1 A(k) , (9) where λ1 is the tuning parameter, and stands for the matrix nuclear norm. We can also add an entry-wise ℓ1 penalty on A to encourage sparsity, resulting in i=1 ρτ (yi A, X i ) + λ1 k=1 A(k) + λ2 j K=1 |Aj1j2...j K|. (10) High-dimensional Quantile Tensor Regression Figure 1: Estimation errors based on 50 replications under quantile level τ = 0.25, 0.5 and 0.75, when true ranks are (2, 3, 4) and dimension p1 = p2 = p3 = 10. The error bars represent one standard deviation. Figure 2: Estimation error based on 50 replications under quantile levels τ = 0.25, 0.5 and 0.75, when the true rank is (2, 3, 4) and sample size n = 400. The dimension of the coefficient tensor is (p, p, p). The error bars represent one standard deviation. Lu, Zhu and Lian Figure 3: Mean of estimation error based on 50 replications under quantile level τ = 0.25, 0.5 and 0.75, when the dimension p1 = p2 = p3 = 10, ranks (r1, r2, r3) = (2, 2, 2) and noise level σ = 0.5. The error bars represent one standard deviation. Note that the sparsity regularization in (10) is not on the factor matrices Uk and thus represents a different model of sparsity. The minimizer of (9) and (10) are computed by the proximal gradient method. For this experiment, we still generated datasets as before and take ranks (r1, r2, r3) = (2, 2, 2), dimensions (p1, p2, p3) = (10, 10, 10), sample size n = 200, 400, 800 and noise level σ = 0.5. Figure 3 shows the error of the estimate computed by (9), (10), lasso and the proposed method. Although in this experiment, our method performs better than convex regularization, which is certainly as expected, we are not saying the convex penalization approach is always worse. Convex approach as an alternative approach also deserves careful theoretical and computational study in the convex of quantile tensor regression, but a detailed investigation is outside the scope of the current paper. 5. Application to PETS 2009 dataset In this section, we show the effectiveness of the proposed method by performing experiments on PETS 2009 dataset (http://www.cvg.reading.ac.uk/PETS2009). The dataset contains crowd images recorded at Whiteknights Campus, University of Reading, UK. Our goal is to estimate the number of people by fitting a model that takes the color images as tensor covariates (136 186 3 tensor). Several views of the same scene are available and we only use View 1 in this analysis. Figure 4 shows two example images of View 1 in the dataset. The ground-truth count is obtained from Chan et al. (2009). We fit a quantile regression model using 662 images from timestamps 13-57, 13-59 and the first 200 images with timestamp 14-03. The remaining 213 images of 14-03 are used as the test set. The raw count is taken as the response. We also tried some transformations on High-dimensional Quantile Tensor Regression τ = 0.25 Training error Prediction error Lasso Proposed Lasso Proposed Raw 0.18 0.09 2.49 2.18 Square root 0.47 0.07 1.26 0.89 τ = 0.5 Training error Prediction error Lasso Proposed Lasso Proposed Raw 0.21 0.11 1.96 1.09 Square root 0.35 0.19 1.25 0.78 τ = 0.75 Training error Prediction error Lasso Proposed Lasso Proposed Raw 0.39 0.10 1.04 0.91 Square root 0.58 0.15 0.98 0.72 Table 4: Training and prediction error for lasso and the proposed method on the test set, either directly using the count as the response, or using square-root transformation. the response from the Box-Cox family and found that the square-root transformation works well. We examine the performances of the proposed sparse tensor regression approach on predicting the 0.25, 0.5 and 0.75 quantile of the counts on the test set. Ranks are chosen from all combinations of ranks up to 4 (except r3 is of course bounded by 3). Since the dimension of mode-3 is small, we only impose penalty on factor matrices U1 and U2. The tuning parameter and ranks were set by 10-fold cross-validation, resulting in low-rank estimators with rank (3, 3, 3). Using the low-rank structure, the number of parameters is reduced by about 98.7%. Table 4 shows the training errors and prediction errors in terms of quantile loss (for prediction, responses are transformed back to the original scale after model fitting on the transformed response). As in simulations, we compared the proposed method with lasso quantile regression which vectorizes the image. It can be seen that the proposed method leads to smaller errors. Figure 4 shows two examples with the predicted quantile values calculated by the proposed method (with square-root transformation). 6. Conclusion In this paper, we propose a Tucker decomposition-based estimation approach for quantile regression with tensor covariates. The motivation of our work is the increasing demand for analysing tensor valued data such as images, and the lack of studies on quantile regression for this data type. The high dimensionality which often emerges in tensor regression is reduced by imposing a low-rank approximation and then applying Tucker decomposition. In addition, when the dimension is very large, we introduce a sparse Tucker decomposition to further reduce the number of parameters. We establish the asymptotic distribution for the nonsparse scenario and the bound on estimation error for the sparse scenario. Lu, Zhu and Lian (a) The true count in this image is 17. The predicted 0.25, 0.5 and 0.75 quantiles are 14.97, 17.15 and 19.90, respectively. (b) The true count in this image is 11. The predicted 0.25, 0.5 and 0.75 quantiles are 9.07, 10.70 and 12.88, respectively. Figure 4: Examples in the test set with predicted quantiles calculated by the proposed method. In the quantile regression setting, it is typical to consider a scalar response. However, one can also consider the case where the response is a tensor. In this case, the intercept is a tensor of the same size as the response. Our results can be easily extended to this case in a straightforward way to predict each response s quantile. For both nonsparse and sparse Tucker quantile regression, we develop alternating algorithms to estimate the coefficient tensor. The convergence result for the nonconvex problem is generally an open problem with success only in some special problems. Developing efficient and provably convergent algorithms for the model is a challenging issue which can be investigated in the future. Our implementation of the algorithm based on R can be obtained from https://github.com/Wenqi Lu/Quantile Tensor Reg. We note that for probably more efficient implementation in Python for example, the library Tensor Ly (Kossaifiet al., 2019) provides a high-level API, and allows the model to be run on multiple CPU and GPU machines. Acknowledgments The authors sincerely thank the editor Professor Animashree Anandkumar, and four anonymous reviewers for their insightful comments that improved the manuscript. The research of Lian is supported by Hong Kong RGC general research fund 11301718 and 11300519, and by Project 11871411 from NSFC and the Shenzhen Research Institute, City University of Hong Kong. The research of Zhu is supported by National Natural Science Foundation of China (112071087, 11731011). High-dimensional Quantile Tensor Regression Appendix A. In the proofs C denotes a generic constant whose value may change even on the same line. A.1 Proof of Theorem 1 In the main text, the true parameters are simply denoted by A, U, g, etc. For this proof only, we will use subscript 0 to denote the true parameters which will make the presentation more clear. Using our notations, b A = arg min i=1 ρτ (yi G 1 U1 2 U2 K UK, X i ) can be rewritten as bh = h(bφ) = arg min h=h(φ) i=1 ρτ yi x T i h . (11) We first establish consistency. Let h0 = h(φ0) be the true value of h. Define ρτ(yi x T i h) ρτ(yi x T i h0) , Q(h) = E ρτ(y x T h) ρτ(y x T h0) . Since |ρτ(y x T h) ρτ(y x T h )| max(τ, 1 τ) x h h for all h, h , by Example 19.7 in van der Vaart (1998), the class ρτ(y x T h) ρτ(y x T h0) : h H is Glibenko-Cantelli. Then the first condition of Theorem 5.7 in van der Vaart (1998), that is suph H |Qn(h) Q(h)| = op(1), is verified, where H denotes the parameter space for h which is assumed to be bounded by condition C3. Using Knight s identity, we have Q(h) Q(h0) = E Z x T (h h0) 0 (Fε|x(t) Fε|x(0)) dt 2fδmin(D0) h h0 2 1 6f E[ x 3] h h0 3, where f denotes the lower bound for fε|x(0) and f denotes the upper bound for f ε|x(.). Let c = 3δmin(D0)f 2f E x 3 , it follows that when h h0 c, Q(h) Q(h0) 1 4fδmin(D0) h h0 2. (12) When h h0 > c, let h = c h + (1 c )h0 with c = c h h0 , then h h0 = c, and by the convexity of Q, we have Q(h) Q(h0) h h0 c Q(h ) Q(h0) 4fδmin(D0) h h0 . Lu, Zhu and Lian Therefore, the second condition in Theorem 5.7 in van der Vaart (1998) inf h h0 ε Q(h) Q(h0) > 0 is verified. By Theorem 5.7 in van der Vaart (1998), we have the sequence h( bφ) converges in probability to h0. Next we establish the convergence rate. The first condition of Theorem 5.52 in van der Vaart (1998) is verified in (12). For every sufficiently small δ > 0, let F = {ρτ(y x T h) ρτ(y x T h0); h h0 < δ}. This class has an envelop function δm where m = max(τ, 1 τ) x . By Corollary 19.35 and Example 19.7 in van der Vaart (1998), we get E sup f F | n(Pn(f) Pf)| J[ ] mδ p,2, F, L2(P) where m p,2 = (E|m|2)1/2, and p = QK k=1 pk. Change the variables in the integral to see that this is a multiple of δ. Then by Theorem 5.52 in van der Vaart (1998), we have h(bφ) h0 = Op(n 1/2). Furthermore, we can prove the following uniform convergence sup h h0 C/ n | i=1 ρτ(yi x T i h) i=1 ρτ(yi x T i bh QR) 2 (h bh QR)T D1(h bh QR)| = op(1), where bh QR is the vectorized quantile regression estimate vec( b AQR) with b AQR = arg min Pn i=1 ρτ yi x T i vec(A) (without the rank constraint). In fact, by Lemma 19.31 of van der Vaart (1998), we have sup h h0 =C/ n i=1 ρτ(yi x T i h) i=1 ρτ(yi x T i h0) (h h0)T n X i=1 xi τ I(yi x T i h0) n E ρτ(Y x T h) ρτ(Y x T h0) = op(1), i=1 ρτ(yi x T i bh QR) i=1 ρτ(yi x T i h0) (bh QR h0)T n X i=1 xi τ I(yi x T i h0) n E ρτ(Y x T bh QR) ρτ(Y x T h0) = op(1). By Knight s identity, we have E ρτ(Y x T h) ρτ(Y x T h0) = E Z x T (h h0) 0 (Fε|x(t) Fε|x(0)) dt 2(h h0)T D1(h h0) + op( h h0 2). High-dimensional Quantile Tensor Regression It follows that sup h h0 =C/ n n E ρτ(Y x T h) ρτ(Y x T h0) n 2 (h h0)T D1(h h0) = op(1), (17) and n E ρτ(Y x T bh QR) ρτ(Y x T h0) n 2 (bh QR h0)T D1(bh QR h0) = op(1), (18) Combining (14) - (18), we have sup h h0 =C/ n i=1 ρτ(yi x T i h) i=1 ρτ(yi x T i bh QR) (h bh QR)T n X i=1 xi τ I(yi x T i h0) 2 (h h0)T D1(h h0) + n 2 (bh QR h0)T D1(bh QR h0) = op(1). Then equation (13) can be obtained from (19) by applying the Bahadur representation for the classical linear quantile estimator n(bh QR h0) = D 1 1 1 n Pn i=1 xi τ I(yi x T i h0 < 0) + op(1). Define F(bh QR, h) = n 2 (h bh QR)T D1(h bh QR), and denote by h = h( φ) as the minimizer of F(bh QR, h), then by Proposition 4.1 in Shapiro (1986), it has asymptotic normality n( h h0) D N(0, P ΓP T ). To complete the proof, we note (13) implies h and bh are asymptotically equivalent. Remark 2 It is trivial to incorporate the intercept for this fixed-dimensional result. All derivations up to equation (19) has nothing to do with the tensor structure. We only need to add µ to h and add 1 to xi, and we still define D0 = E(xix T i ), D1 = E(xix T i fε|x(0)) and the asymptotic normality result remains the same. A.2 Proof of Theorem 2 Define b U = ( b UK b OK) ( b U1 b O1) and bg = vec(b G 1 b OT 1 K b OT K), where the matrices b Ok are defined in the statement of Lemma 2. Write vec( b A) = b Ubg. Let b = b U bg Ug, b U = b U U and b g = bg g. Then b = b U bg + U b g and b A A F = b . By the optimality of b U and bg, we have i=1 ρτ yi x T i b U bg + λ b U 1 1 i=1 ρτ yi x T i Ug + λ U 1. (20) Lu, Zhu and Lian We first show that b Ω:= { : ( U)Sc 1 3 ( U)S 1 + g 1}, where S is the support of U with size bounded by s1s2 s K =: s. Let ei = τ I(yi x T i Ug 0). By convexity of ρτ, we have Qn( b Ubg) := 1 n(P i ρτ(yi xi b Ubg) P i ρτ(yi xi Ug)) 1 n(Pn i=1 xiei)T b . Combining this with (20) we get, i xiei)( b Ubg) 1 i xiei)U b g Qn( b Ubg) λ U 1 λ b U 1. By Theorem 1 of Belloni and Chernozhukov (2011), we have λ/2 g 1 n P i xiei and n UT xiei . Using 1 n(P i xiei)( b Ubg) 1 n P i xiei b Ubg 1 g 1 n P i xiei b U 1 (λ/2) b U 1 and 1 n(P i xiei)U b g 1 n UT xiei b g 1 (λ/2) b g 1, we get (λ/2) b U 1 (λ/2) b g 1 λ U 1 λ b U 1. (21) Using U 1 = US 1 and b U 1 = b U + U 1 = ( b U + U)S 1 + ( b U)Sc 1 US 1 ( b U)S 1 + ( b U)Sc 1, the above implies (λ/2) b U 1 (λ/2) b g 1 λ ( b U)S 1 λ ( b U)Sc 1. (22) Using b U 1 = ( b U)S 1 + ( b U)Sc 1, the above is seen to be equivalent to b Ω. Furthermore, due to b Ω, we have b 1 g b U 1 + U b g 1 C g ( b U)S 1 + C g b g 1 + r b g C g s b U F + C( g + 1) r b g bn b + dn b 2, (23) by Lemma 2, where + C( g + 1) r + C( g + 1) r A F σ2 min,k Let Ω2 = { : 1 bn + dn 2}. Assume b = t. This means Qn(Ug + ) + λ U + U 1 λ U 1 < 0. (24) Lemma 1 shows that with probability at least 1 (p n) C, |Qn(Ug + ) EQn(Ug + )| C(bnt + dnt2) High-dimensional Quantile Tensor Regression (24) and (25) together means that there exists with = t such that EQn(Ug + ) λ U 1 λ U + U 1 + C(bnt + dnt2) λ ( b U)S 1 λ ( b U)Sc 1 + C(bnt + dnt2) λ ( b U)S 1 + C bnt + dnt2 g (bnt + dnt2) + C(bnt + dnt2) where the second inequality uses the same arguments that leads to (22) starting from (21), and the last step used (23). Using assumption C8, we then have c1(t2 t) C λ g (bnt + dnt2) + C(bnt + dnt2) which implies t Cbnλ/ g if dnλ/ g = o(1). Remark 3 To incorporate the intercept µ, let bδµ = bµ µ. Then using similar arguments with minor modifications, we have (bδµ, b ) Ω:= {(δµ, ) : ( U)Sc 1 3 ( U)S 1 + g 1+|δµ|}}, and (23) is replaced by b 1 bn b +dn b 2+ g|bδµ|. The rest of the proof still can proceed as before using (δµ, ) Ω2 := {(δµ, ) : 1 bn +dn 2+ g|δµ|} and +|δµ| = t instead of = t. Then we can see that the bound b +|bδµ| bnλ/ g holds. Regarding the roles of the following lemmas, Lemmas 1 and 2 are used in the proof of Theorem 2 while Lemma 3 is used in the proof of Lemma 1. Lemma 1 Let Ω2 = { : 1 bn + dn 2} as defined in the proof of Theorem 2. With probability at least 1 (p n) C for some constant C > 0, i=1 ρτ yi x T i (Ug + ) 1 i=1 ρτ yi x T i Ug Eρτ yi x T i (Ug + ) + Eρτ yi x T i Ug C(bnt + dnt2) p log(p n)/n. A(t) = sup Ω2, t n 1/2 Gn ρτ y x T (Ug + ) ρτ y x T Ug , where Gnf(xi) = n(Pnf Pf) is the empirical process. Note that for any with t, by the Lipschitz property of ρτ, we have V ar Gn ρτ y x T (Ug + ) ρτ y x T Ug E(x T )2 t2/f. Lu, Zhu and Lian Then an application of Lemma 2.3.7 of van der Vaart and Wellner (1996) yields P(A(t) M) 2P(B(t) M 4 ) 1 t2/(n M2f), (26) if the denominator above is positive, where B(t) = sup Ω2, t n 1/2 Gs n ρτ y x T (Ug + ) ρτ y x T Ug , Gs nf(x, y) = n 1/2 P i δif(xi, yi), and δi { 1, 1} are independent Rademacher variables. We also have P(B(t) M) e γMEeγB(t) 2γ sup Ω2, t (1/n)| X i δix T i | 2γ sup Ω2, t 1 Cpe γM exp{Cγ2(bnt + dnt2)2/n} p exp C n M2 (bnt + dnt2)2 where the 1st line uses Markov s inequality, the 2nd uses the contraction property of the Rademacher process (see Theorem 2.3 of Koltchinskii, 2011), the 3rd uses the simple inequality |a T b| a b 1 for two vectors a and b, the 4th uses Lemma 3 together with the fact that 1 bnt + dnt2, and the last line is obtained by setting γ n M/(bnt + dnt2)2. Finally, taking M (bnt + dnt2) p log(p n)/n proves the lemma. Lemma 2 There exists rk rk orthogonal matrix b Ok, k = 1, . . . , K, such that ( b UK b OK) ( b U1 b O1) UK U1 F b A A F + C b A A 2 F , b G 1 b OT 1 K b OT K G F C b A A F + C A F σ2 min,k b A A 2 F . Proof. Since b Uk and Uk are the left singular vectors of b A(k) and A(k), respectively, by the Davis-Kahn theorem as stated in Theorem 3 of Yu et al. (2015), we have b Uk b Ok Uk F C (σmax,k + b A A F ) b A A F High-dimensional Quantile Tensor Regression for some orthogonal matrix b Ok. Note that b A = b G 1 b U1 K b UK = (b G 1 b OT 1 K b OT K) 1 b U1 b O1 K b UK b OK. For simplicity of notation, in the following we denote b Uk b Ok simply by b Uk and b G 1 b OT 1 K b OT K by b G. Then b UK b U1 UK U1 F b UK b U1 b UK b U2 U1 F + + b UK UK 1 U1 UK U2 U1 F rk b Uk Uk F b A A F + C b A A 2 F , (27) where the equality is due to A B F = A F B F . Furthermore, b G G F = ( b UK b U1)T vec( b A) UK U1)T vec(A) F ( b UK b U1)T (vec( b A) vec(A)) F + ( b UK b U1 UK U1)T vec(A) F b UK b U1 op b A A F + b UK b U1 UK U1 op A F b A A F + C A F σ2 min,k b A A 2 F , (28) b UK b U1 UK U1 op b UK b U1 b UK b U2 U1 op + + b UK UK 1 U1 UK U2 U1 op k b Uk Uk op b A A F + C b A A 2 F . Lemma 3 For any constant γ > 0, E[exp{γ max 1 j p | X i xijδi|}] 2pe Cnγ2, where δi { 1, 1} are independent Rademacher variables. Proof. We have E[exp{γ max j | X = E[max j exp{γ| X p max j E[exp{γ| X 2p max j E[exp{γ( X i xijδi)}], Lu, Zhu and Lian where the last step used the fact that for any symmetric random variable z, E[e|z|] e[ez + e z] = 2E[ez]. Using xij is sub-Gaussian and thus xijδi is also sub-Gaussian, we get E[exp{γ(P i xijδi)}] = (e Cγ2)n which proved the lemma. Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(80):2773 2832, 2014a. Animashree Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates. ar Xiv preprint ar Xiv:1402.5180, 2014b. Alexandre Belloni and Victor Chernozhukov. l1 -penalized quantile regression in highdimensional sparse models. The Annals of Statistics, 39(1):82 130, 2011. Xingwei Cao and Guillaume Rabusseau. Tensor regression networks with various low-rank tensor approximations. ar Xiv preprint ar Xiv:1712.09520, 2017. Daniele Castellana and Davide Bacciu. Tensor decompositions in recursive neural networks for tree-structured data. ar Xiv preprint ar Xiv:2006.10619, 2020. Antoni B. Chan, Mulloy Morrow, and Nuno Vasconcelos. Analysis of crowded scenes using holistic properties. In IEEE Intl. Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2009), 2009. Eric C. Chi and Tamara G. Kolda. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272 1299, 2012. James Ferryman and Ali Shahrokni. Pets2009: Dataset and challenge. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pages 1 6, 2009. Weiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816 827, 2012. Johan H astad. Tensor rank is NP-complete. Algorithms, 11(4):644 654, 1990. Kejun Huang, Nicholas D. Sidiropoulos, and Athanasios P. Liavas. A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Transactions on Signal Processing, 64(19):5052 5065, 2016. Roger Koenker. Quantile Regression. Cambridge University Press, New York, 2005. Roger Koenker and Gilbert Bassett. Regression quantiles. Econometrica, 46(1):33 50, 1978. Arinbj orn Kolbeinsson, Jean Kossaifi, Yannis Panagakis, Adrian Bulat, Anima Anandkumar, Ioanna Tzoulaki, and Paul Matthews. Robust deep networks with randomized tensor regression layers. ar Xiv preprint ar Xiv:1902.10758, 2019. High-dimensional Quantile Tensor Regression Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455 500, 2009. Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer, New York, 2011. Jean Kossaifi, Yannis Panagakis, Anima Anandkumar, and Maja Pantic. Tensor Ly: Tensor Learning in Python. Journal of Machine Learning Research, 20(1):1 6, 2019. Jean Kossaifi, Zachary C Lipton, Arinbjorn Kolbeinsson, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. Journal of Machine Learning Research, 21:1 21, 2020. Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253 1278, 2000. Lexin Li and Xin Zhang. Parsimonious tensor response regression. Journal of the American Statistical Association, 112(519):1131 1146, 2017. Xiaoshan Li, Da Xu, Hua Zhou, and Lexin Li. Tucker tensor regression and neuroimaging analysis. Statistics in Biosciences, 10:520 545, 2018. Ji Liu, Jun Liu, Peter Wonka, and Jieping Ye. Sparse non-negative tensor factorization using columnwise coordinate descent. Pattern Recognition, 45(1):649 656, 2012. Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:208 220, 2013. Yipeng Liu, Jiani Liu, and Ce Zhu. Low-rank tensor train coefficient array estimation for tensor-on-tensor regression. IEEE Transactions on Neural Networks and Learning Systems, to appear, 2020. Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983 999, 2006. Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In International Conference on Neural Information Processing Systems (NIPS), pages 442 450, 2015. I. V. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33 (5):2295 23, 2011. Guillaume Rabusseau and Hachem Kadri. Low-rank regression with tensor responses. In International Conference on Neural Information Processing Systems (NIPS), pages 1875 1883, 2016. Garvesh Raskutti, Ming Yuan, and Han Chen. Convex regularization for high-dimensional multiresponse tensor regression. The Annals of Statistics, 47(3):1554 1584, 06 2019. Lu, Zhu and Lian Steffen Rendle and Lars Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In ACM International Conference on Web Search and Data Mining, pages 81 90, 2010. Alexander Shapiro. Asymptotic theory of overparameterized structural models. Journal of the American Statistical Association, 81(393):142 149, 1986. Shaden Smith, Alec Beri, and George Karypis. Constrained tensor factorization with accelerated ao-admm. In International Conference on Parallel Processing (ICPP), pages 111 120, 2017. Jiahao Su, Wonmin Byeon, Furong Huang, Jan Kautz, and Animashree Anandkumar. Convolutional tensor-train lstm for spatio-temporal learning. ar Xiv preprint ar Xiv:2002.09131, 2020. Will Wei Sun and Lexin Li. STORE: sparse tensor response regression and neuroimaging analysis. Journal of Machine Learning Research, 18(1):1 37, 2017. Will Wei Sun and Lexin Li. Dynamic tensor clustering. Journal of the American Statistical Association, 114(528):1894 1907, 2019. Will Wei Sun, Junwei Lu, Han Liu, and Guang Cheng. Provable sparse tensor decomposition. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3): 899 916, 2017. James W. Taylor. A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19(4):299 311, 2000. Madeleine Udell, Corinne Horn, Reza Zadeh, and Stephen Boyd. Generalized low rank models. Foundations and Trends R in Machine Learning, 9:1 118, 2016. Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Press, Cambridge, 1998. Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996. Di Wang, Heng Lian, Yao Zheng, and Guodong Li. High-dimensional vector autoregressive time series modeling via tensor decomposition. Available at http://dx.doi.org/10.2139/ssrn.3453802, 2019. Huixia Wang, Zhongyi Zhu, and Jianhui Zhou. Quantile regression in partially linear varying coefficient models. The Annals of Statistics, 37(6B):3841 3866, 2009. Qi Xie, Qian Zhao, Deyu Meng, , and Zongben Xu. Kronecker-basis-representation based tensor sparsity and its applications to tensor recovery. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(8):1888 1902, 2018. Congrui Yi and Jian Huang. Semismooth newton coordinate descent algorithm for elasticnet penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics, 26(3):547 557, 2017. High-dimensional Quantile Tensor Regression Liqun Yu, Nan Lin, and Lan Wang. A parallel algorithm for large-scale nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 26(4):935 939, 2017. Yi Yu, Tengyao Wang, and Richard J. Samworth. A useful variant of the Davis Kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015. Anru Zhang. Cross: Efficient low-rank tensor completion. Annals of Statistics, 47(2): 936 964, 04 2019. Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, and Misha Kilmer. Novel methods for multilinear data completion and de-noising based on tensor-svd. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. Bingyin Zhou, Biao Song, Mohammad Mehedi Hassan, and Atif Alamri. Multilinear rank support tensor machine for crowd density estimation. Engineering Applications of Artificial Intelligence, 72:382 392, 2018. Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540 552, 2013.