# adaptive_reduced_rank_regression__f8460ffc.pdf Adaptive Reduced Rank Regression Qiong Wu William & Mary Felix M. F. Wong Independent Researcher Yanhua Li Worcester Polytechnic Institute Zhenming Liu William & Mary Varun Kanade University of Oxford We study the low rank regression problem y = Mx + ϵ, where x and y are d1 and d2 dimensional vectors respectively. We consider the extreme high-dimensional setting where the number of observations n is less than d1+d2. Existing algorithms are designed for settings where n is typically as large as rank(M)(d1 + d2). This work provides an efficient algorithm which only involves two SVD, and establishes statistical guarantees on its performance. The algorithm decouples the problem by first estimating the precision matrix of the features, and then solving the matrix denoising problem. To complement the upper bound, we introduce new techniques for establishing lower bounds on the performance of any algorithm for this problem. Our preliminary experiments confirm that our algorithm often out-performs existing baselines, and is always at least competitive. 1 Introduction We consider the regression problem y = Mx + ϵ in the high dimensional setting, where x Rd1 is the vector of features, y Rd2 is a vector of responses, M Rd2 d1 are the learnable parameters, and ϵ N(0, σ2 ϵ Id2 d2) is a noise term. High-dimensional setting refers to the case where the number of observations n is insufficient for recovery and hence regularization for estimation is necessary [26, 30, 12]. This high-dimensional model is widely used in practice, such as identifying biomarkers [48], understanding risks associated with various diseases [18, 7], image recognition [34, 17], forecasting equity returns in financial markets [33, 39, 28, 8], and analyzing social networks [46, 35]. We consider the large feature size setting, in which the number of features d1 is excessively large and can be even larger than the number of observations n. This setting frequently arises in practice because it is often straightforward to perform feature-engineering and produce a large number of potentially useful features in many machine learning problems. For example, in a typical equity forecasting model, n is around 3,000 (i.e., using 10 years of market data), whereas the number of potentially relevant features can be in the order of thousands [33, 22, 25, 13]. In predicting the popularity of a user in an online social network, n is in the order of hundreds (each day is an observation and a typical dataset contains less than three years of data) whereas the feature size can easily be more than 10k [36, 6, 38]. Existing low-rank regularization techniques (e.g., [3, 23, 26, 30, 27] ) are not optimized for the large feature size setting. These results assume that either the features possess the so-called restricted isometry property [10], or their covariance matrix can be accurately estimated [30]. Therefore, their sample complexity n depends on either d1 or the smallest eigenvalue value λmin of x s covariance matrix. For example, a mean-squared error (MSE) result that appeared in [30] is of the form Correspondence to: Qiong Wu . Currently at Google. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. . When n d1/λ2 min, this result becomes trivial because the forecast ˆy = 0 produces a comparable MSE. We design an efficient algorithm for the large feature size setting. Our algorithm is a simple two-stage algorithm. Let X Rn d1 be a matrix that stacks together all features and Y Rn d2 be the one that stacks the responses. In the first stage, we run a principal component analysis (PCA) on X to obtain a set of uncorrelated features ˆZ. In the second stage, we run another PCA to obtain a low rank approximation of ˆZTY and use it to construct an output. While the algorithm is operationally simple, we show a powerful and generic result on using PCA to process features, a widely used practice for dimensionality reduction [11, 21, 19]. PCA is known to be effective to orthogonalize features by keeping only the subspace explaining large variations. But its performance can only be analyzed under the so-called factor model [40, 39]. We show the efficacy of PCA without the factor model assumption. Instead, PCA should be interpreted as a robust estimator of x s covariance matrix. The empirical estimator C = 1 n XXT in the high-dimensional setting cannot be directly used because n d1 d2, but it exhibits an interesting regularity: the leading eigenvectors of C are closer to ground truth than the remaining ones. In addition, the number of reliable eigenvectors grows as the sample size grows, so our PCA procedure projects the features along reliable eigenvectors and dynamically adjusts ˆZ s rank to maximally utilize the raw features. Under mild conditions on the ground-truth covariance matrix C of x, we show that it is always possible to decompose x into a set of near-independent features and a set of (discarded) features that have an inconsequential impact on a model s MSE. When features x are transformed into uncorrelated ones z, our original problem becomes y = Nz+ϵ, which can be reduced to a matrix denoising problem [16] and be solved by the second stage. Our algorithm guarantees that we can recover all singular vectors of N whose associated singular values are larger than a certain threshold τ. The performance guarantee can be translated into MSE bounds parametrized by commonly used variables (though, these translations usually lead to looser bounds). For example, when N s rank is r, our result reduces the MSE from O( r(d1+d2) nλ2 min ) to O( rd2 n + n c) for a suitably small constant c. The improvement is most pronounced when n d1. We also provide a new matching lower bound. Our lower bound asserts that no algorithm can recover a fraction of singular vectors of N whose associated singular values are smaller than ρτ, where ρ is a gap parameter . Our lower bound contribution is twofold. First, we introduce a notion of local minimax , which enables us to define a lower bound parametrized by the singular values of N. This is a stronger lower bound than those delivered by the standard minimax framework, which are often parametrized by the rank r of N [26]. Second, we develop a new probabilistic technique for establishing lower bounds under the new local minimax framework. Roughly speaking, our techniques assemble a large collection of matrices that share the same singular values of N but are far from each other, so no algorithm can successfully distinguish these matrices with identical spectra. 2 Preliminaries Notation. Let X Rn d1 and Y Rn d2 be data matrices with their i-th rows representing the i-th observation. For matrix A, we denote its singular value decomposition as A = U AΣA(V A)T and Pr(A) U A r ΣA r V A r T is the rank r approximation obtained by keeping the top r singular values and the corresponding singular vectors. When the context is clear, we drop the superscript A and use U, Σ, and V (Ur, Σr, and Vr) instead. Both σi(A) and σA i are used to refer to i-th singular value of A. We use MATLAB notation when we refer to a specific row or column, e.g., V1,: is the first row of V and V:,1 is the first column. A F , A 2, and A are Frobenius, spectral, and nuclear norms of A. In general, we use boldface upper case (e.g., X) to denote data matrices and boldface lower case (e.g., x) to denote one sample. Regular fonts denote other matrices. Let C = IE[xx T] and C = 1 n XTX be the empirical estimate of C . Let C = V Λ (V )T be the eigen-decomposition of the matrix C , and λ 1 λ 2, . . . , λ d1 0 be the diagonal entries of Λ . Let {u1, u2, . . . uℓ} be an arbitrary set of column vectors, and Span({u1, u2, . . . , uℓ}) be the subspace spanned by it. An event happens with high probability means that it happens with probability 1 n 5, where 5 is an arbitrarily chosen large constant and is not optimized. Our model. We consider the model y = Mx + ϵ, where x Rd1 is a multivariate Gaussian, y Rd2, M Rd2 d1, and ϵ N(0, σ2 ϵ Id2 d2). We can relax the Gaussian assumptions on x and STEP-1-PCA-X(X) 1 [U, Σ, V ] = svd(X) 2 Λ = 1 n(Σ2); λi = Λi,i. 3 Gap thresholding. 4 δ = n O(1) is a tunable parameter. 5 k1 = max{k1 : λk1 λk1+1 δ}, 6 Λk1: diagonal matrix comprised of {λi}i k1. 7 Uk1, Vk1: k1 leading columns of U and V . 8 ˆΠ = (Λk1) 1 2 V T k1 9 ˆZ+ = n Uk1(= X ˆΠT). 10 return {ˆZ+, ˆΠ}. STEP-2-PCA-DENOISE(ˆZ+, Y) n ˆZT +Y. 2 Absolute value thresholding. 3 θ is a suitable constant; σϵ is std. of the noise. 4 k2 = max n k2 : σk2( ˆN+) θσϵ q 5 return Pk2( ˆN+) ADAPTIVE-RRR(X, Y) 1 [ ˆZ+, ˆΠ] = STEP-1-PCA-A(X). 2 Pk2( ˆN+) = STEP-2-PCA-DENOISE( ˆZ+, Y). 3 return ˆ M = Pk2( ˆN+)ˆΠ Figure 1: Our algorithm (ADAPTIVE-RRR) for solving the regression y = Mx + ϵ. ϵ for most results we develop. We assume a PAC learning framework, i.e., we observe a sequence {(xi, yi)}i n of independent samples and our goal is to find an ˆ M that minimizes the test error IEx,y[ ˆ Mx Mx 2 2]. We are specifically interested in the setting in which d2 n d1. The key assumption we make to circumvent the d1 n issue is that the features are correlated. This assumption can be justified for the following reasons: (i) In practice, it is difficult, if not impossible, to construct completely uncorrelated features. (ii) When n d1, it is not even possible to test whether the features are uncorrelated [5]. (iii) When we indeed know that the features are independent, there are significantly simpler methods to design models. For example, we can build multiple models such that each model regresses on an individual feature of x, and then use a boosting/bagging method [19, 37] to consolidate the predictions. The correlatedness assumption implies that the eigenvalues of C decays. The only (full rank) positive semidefinite matrices that have non-decaying (uniform) eigenvalues are the identity matrix (up to some scaling). In other words, when C has uniform eigenvalues, x has to be uncorrelated. We aim to design an algorithm that works even when the decay is slow, such as when λi(C ) has a heavy tail. Specifically, our algorithm assumes λi s are bounded by a heavy-tail power law series: Assumption 2.1. The λi(C ) series satisfies λi(C ) c i ω for a constant c and ω 2. We do not make functional form assumptions on λi s. This assumption also covers many benign cases, such as when C has low rank or its eigenvalues decay exponentially. Many empirical studies report power law distributions of data covariance matrices [2, 31, 44, 14]. Next, we make standard normalization assumptions. IE x 2 2 = 1, M 2 Υ = O(1), and σϵ 1. Remark that we assume only the spectral norm of M is bounded, while its Frobenius norm can be unbounded. Also, we assume the noise σϵ 1 is sufficiently large, which is more important in practice. The case when σϵ is small can be tackled in a similar fashion. Finally, our studies avoid examining excessively unrealistic cases, so we assume d1 d3 2. We examine the setting where existing algorithms fail to deliver non-trivial MSE, so we assume that n rd1 d4 2. 3 Upper bound Our algorithm (see Fig. 1) consists of two steps. Step 1. Producing uncorrelated features. We run a PCA to obtain a total number of k1 orthogonalized features. See STEP-1-PCA-X in Fig. 1. Let the SVD of X be X = UΣ(V )T. Let k1 be a suitable rank chosen by inspecting the gaps of X s singular values (Line 5 in STEP-1-PCA-X). ˆZ+ = n Uk1 is the set of transformed features output by this step. The subscript + in ˆZ+ reflects that a dimension reduction happens so the number of columns in ˆZ+ is smaller than that in X. Compared to standard PCA dimension reduction, there are two differences: (i) We use the left leading singular vectors of X (with a re-scaling factor n) as the output, whereas the PCA reduction outputs Pk1(X). (ii) We design a specialized rule to choose k1 whereas PCA usually uses a hard thresholding or other ad-hoc rules. Step 2. Matrix denoising. We run a second PCA on the matrix ( ˆN+)T 1 n ˆZT +Y. The rank k2 is chosen by a hard thresholding rule (Line 4 in STEP-2-PCA-DENOISE). Our final estimator is Pk2( ˆN+)ˆΠ, where ˆΠ = (Λk1) 1 2 V T k1 is computed in STEP-1-PCA-X(X). 3.1 Intuition of the design While the algorithm is operationally simple, its design is motivated by carefully unfolding the statistical structure of the problem. We shall realize that applying PCA on the features should not be viewed as removing noise from a factor model, or finding subspaces that maximize variations explained by the subspaces as suggested in the standard literature [19, 40, 41]. Instead, it implicitly implements a robust estimator for x s precision matrix, and the design of the estimator needs to be coupled with our objective of forecasting y, thus resulting in a new way of choosing the rank. Design motivation: warm up. We first examine a simplified problem y = Nz + ϵ, where variables in z are assumed to be uncorrelated. Assume d = d1 = d2 in this simplified setting. Observe that 1 n ZTY = 1 n ZT(ZN T + E) = ( 1 n ZTZ)N T + 1 n ZTE Id1 d1N T + 1 n ZTE = N T + E, (1) where E is the noise term and E can be approximated by a matrix with independent zero-mean noises. Solving the matrix denoising problem. Eq. 1 implies that when we compute ZTY, the problem reduces to an extensively studied matrix denoising problem [16, 20]. We include the intuition for solving this problem for completeness. The signal N T is overlaid with a noise matrix E. E will elevate all the singular values of N T by an order of σϵ p d/n. We run a PCA to extract reliable signals: when the singular value of a subspace is σϵ p d/n, the subspace contains significantly more signal than noise and thus we keep the subspace. Similarly, a subspace associated a singular value σϵ p d/n mostly contains noise. This leads to a hard thresholding algorithm that sets ˆN T = Pr(N T + E), where r is the maximum index such that σr(N T + E) c p d/n for some constant c. In the general setting y = Mx + ϵ, x may not be uncorrelated. But when we set z = (Λ ) 1 2 (V )Tx, we see that IE[zz T] = I. This means knowing C suffices to reduce the original problem to a simplified one. Therefore, our algorithm uses Step 1 to estimate C and Z, and uses Step 2 to reduce the problem to a matrix denoising one and solve it by standard thresholding techniques. Relationship between PCA and precision matrix estimation. In step 1, while we plan to estimate C , our algorithm runs a PCA on X. We observe that empirical covariance matrix C = 1 n XTX = 1 n V (Σ)2(V )T, i.e., C s eigenvectors coincide with X s right singular vectors. When we use the empirical estimator to construct ˆz, we obtain ˆz = n(Σ) 1(V )Tx. When we apply this map to every training point and assemble the new feature matrix, we exactly get ˆZ = n XV (Σ) 1 = n U. It means that using C to construct ˆz is the same as running a PCA in STEP-1-PCA-X with k1 = d1. 100 200 300 400 500 Figure 2: The angle matrix between C and C . When k1 < d1, PCA uses a low rank approximation of C as an estimator for C . We now explain why this is effective. First, note that C is very far from C when n d1, therefore it is dangerous to directly plug in C to find ˆz. Second, an interesting regularity of C exists and can be best explained by a picture. In Fig. 2, we plot the pairwise angles between eigenvectors of C and those of C from a synthetic dataset. Columns are sorted by the C s eigenvalues in decreasing order. When C and C coincide, this plot would look like an identity matrix. When C and C are unrelated, then the plot behaves like a block of white Gaussian noise. We observe a pronounced pattern: the angle matrix can be roughly divided into two sub-blocks (see the red lines in Fig. 2). The upper left sub-block behaves like an identity matrix, suggesting that the leading eigenvectors of C are close to those of C . The lower right block behaves like a white noise matrix, suggesting that the small eigenvectors of C are far from those of C . When n grows, one can observe the upper left block becomes larger and this the eigenvectors of C will sequentially get stabilized. Leading eigenvectors are first stabilized, followed by smaller ones. Our algorithm leverages this regularity by keeping only a suitable number of reliable eigenvectors from C while ensuring not much information is lost when we throw away those small eigenvectors. Implementing the rank selection. We rely on three interacting building blocks: 1. Dimension-free matrix concentration. First, we need to find a concentration behavior of C for n d1 to decouple d1 from the MSE bound. We utilize a dimension-free matrix concentration inequality [32]. Roughly speaking, the concentration behaves as C C 2 n 1 2 . This guarantees that |λi(C) λi(C )| n 1 2 by standard matrix perturbation results [24]. 2. Davis-Kahan perturbation result. However, the pairwise closeness of the λi s does not imply the eigenvectors are also close. When λi(C ) and λi+1(C ) are close, the corresponding eigenvectors in C can be jammed together. Thus, we need to identify an index i, at which λi(C ) λi+1(C ) exhibits significant gap, and use a Davis-Kahan result to show that Pi(C) is close to Pi(C ). On the other hand, the map Π ( (Λ ) 1 2 (V )T) we aim to find depends on the square root of inverse (Λ ) 1 2 , so we need additional manipulation to argue our estimate is close to (Λ ) 1 2 (V )T. 3. The connection between gap and tail. Finally, the performance of our procedure is also characterized by the total volume of signals that are discarded, i.e., P i>k1 λi(C ), where k1 is the location that exhibits the gap. The question becomes whether it is possible to identify a k1 that simultaneously exhibits a large gap and ensures the tail after it is well-controlled, e.g., the sum of the tail is O(n c) for a constant c. We develop a combinatorial analysis to show that it is always possible to find such a gap under the assumption that λi(C ) is bounded by a power law distribution with exponent ω 2. Combining all these three building blocks, we have: Proposition 1. Let ξ and δ be two tunable parameters such that ξ = ω(log3 n/ n) and δ3 = ω(ξ). Assume that λ i c i ω. Consider running STEP-1-PCA-X in Fig. 1, with high probability, we have (i) Leading eigenvectors/values are close: there exists a unitary matrix W and a constant c1 such that Vk1(Λk1) 1 2 V k1(Λ k1) 1 δ3 . (ii) Small tail: P i k1 λ i c2δ ω 1 ω+1 for a constant c2. Prop. 1 implies that our estimate ˆz+ = ˆΠ(x) is sufficiently close to z = Π (x), up to a unitary transform. We then execute STEP-2-PCA-DENOISE to reduce the problem to a matrix denoising one and solve it by hard-thresholding. Let us refer to y = Nz + ϵ, where z is a standard multivariate Gaussian and N = MV (Λ ) 1 2 as the orthogonalized form of the problem. While we do not directly observe z, our performance is characterized by spectra structure of N. Theorem 1. Consider running ADAPTIVE-RRR in Fig. 1 on n independent samples (x, y) from the model y = Mx + ϵ, where x Rd1 and y Rd2. Let C = IE[xx T]. Assume that (i) M 2 Υ = O(1), and (ii) x is a multivariate Gaussian with x 2 = 1. In addition, λ1(C ) < 1 and for all i, λi(C ) c/iω for a constant c, and (iii) ϵ N(0, σ2 ϵ Id1), where σϵ min{Υ, 1}. Let ξ = ω(log3 n/ n), δ3 = ω(ξ), and θ be a suitably large constant. Let y = Nz + ϵ be the orthogonalized form of the problem. Let ℓ be the largest index such that σN ℓ > θσϵ q n . Let ˆy be our testing forecast. With high probability over the training data: IE[ ˆy y 2 2] X i>ℓ (σN i )2 + O ℓ d2θ2σ2 ϵ n ω 1 4(ω+1) (2) The expectation is over the randomness of the test data. Theorem 1 also implies that there exists a way to parametrize ξ and δ such that IE[ ˆy y 2 2] P i>ℓ (σN i )2 + O ℓ d2θ2σ2 ϵ n + O(n c0) for some constant c0. We next interpret each term in (2). Terms P i>ℓ (σN i )2 + O ℓ d2θ2σ2 ϵ n are typical for solving a matrix denoising problem ˆN T + + E( N T + E): we can extract signals associated with ℓ leading singular vectors of N, so P i>ℓ (σN i )2 starts at i > ℓ . For each direction we extract, we need to pay a noise term of order θ2σ2 ϵ d2 n , leading to the term O ℓ d2θ2σ2 ϵ n . Terms O q ω 1 4(ω+1) come from the estimations error of ˆz+ produced from Prop. 1, consisting of both estimation errors of C s leading eigenvectors and the error of cutting out a tail. We pay an exponent of 1 4 on both terms (e.g., δ ω 1 ω+1 in Prop. 1 becomes ω 1 4(ω+1) ) because we used Cauchy-Schwarz (CS) twice. One is used in running matrix denoising algorithm with inaccurate z+; the other one is used to bound the impact of cutting a tail. It remains open whether two CS is can be circumvented. Sec. 4 explains how Thm 1 and the lower bound imply the algorithm is near-optimal. Sec. 5 compares our result with existing ones under other parametrizations, e.g. rank(M). 4 Lower bound (b) (c) (d) (a) 0 5 10 15 20 25 0 Singular values Figure 3: (a) Major result: signals in N are partitioned into four blocks. All signals in block 1 can be estimated (Thm 1). All signals in block 3 cannot be estimated (Prop 2). Our lower bound techniques does not handle a small tail in Block 4. A gap in block 2 exists between upper and lower bounds. (b)-(d) Constructing N: Step 1 and 2 belong to the first stage; step 3 belongs to the second stage. (b) Step 1. Generate a random subset D(i) for each row i, representing its non-zero positions. (c) Step 2. Randomly sample from D, where D is the Cartesian product of D(i). (d) Step 3. Fill in non-zero entries sequentially from left to right. Our algorithm accurately estimates the singular vectors of N that correspond to singular values above the threshold τ = θσϵ q n . However, it may well happen that most of the spectral mass of N lies only slightly below this threshold τ. In this section, we establish that no algorithm can do better than us, in a bi-criteria sense, i.e. we show that any algorithm that has a slightly smaller sample than ours can only minimally outperform ours in terms of MSE. We establish instance dependent lower bounds: When there is more spectral mass below the threshold, the performance of our algorithm will be worse, and we will need to establish that no algorithm can do much better. This departs from the standard minimax framework, in which one examines the entire parameter space of N, e.g. all rank r matrices, and produces a large set of statistically indistinguishable bad instances [43]. These lower bounds are not sensitive to instancespecific quantities such as the spectrum of N, and in particular, if prior knowledge suggests that the unknown parameter N is far from these bad instances, the minimax lower bound cannot be applied. We introduce the notion of local minimax. We partition the space into parts so that similar matrices are together. Similar matrices are those N that have the same singular values and right singular vectors; we establish strong lower bounds even against algorithms that know the singular values and right singular vectors of N. An equivalent view is to assume that the algorithm has oracle access to C , M s singular values, and M s right singular vectors. This algorithm can solve the orthogonalized form as N s singular values and right singular vectors can easily be deduced. Thus, the only reason why the algorithm needs data is to learn the left singular vectors of N. The lower bound we establish is the minimax bound for this unfair comparison, where the competing algorithm is given more information. In fact, this can be reduced further, i.e., even if the algorithm knows that the left singular vectors of N are sparse, identifying the locations of the non-zero entries is the key difficulty that leads to the lower bound. Definition 1 (Local minimax bound). Consider a model y = Mx + ϵ, where x is a random vector, so C (x) = IE[xx T] represents the co-variance matrix of the data distribution, and M = U MΣM(V M)T. The relation (M, x) (M , x ) (ΣM = ΣM V M = V M C (x) = C (x )) is an equivalence relation and let the equivalence class of (M, x) be R(M, x) = {(M , x ) : ΣM = ΣM, V M = V M, and C (x ) = C (x)}. The local minimax bound for y = Mx + ϵ with n independent samples and ϵ N(0, σ2 ϵ Id2 d2) is r(x, M, n, σϵ) = min ˆ M max (M ,x ) R(M,x) E X, Y from y M x +ϵ h IEx [ ˆ M(X, Y)x M x 2 2 | X, Y] i . (3) It is worth interpreting (3) in some detail. For any two (M, x), (M , x ) in R(M, x), the algorithm has the same prior knowledge , so it can only distinguish between the two instances by using the observed data, in particular ˆ M is a function only of X and Y, and we denote it as ˆ M(X, Y) to emphasize this. Thus, we can evaluate the performance of ˆ M by looking at the worst possible (M , x ) and considering the MSE IE ˆ M(X, Y)x M x 2. Proposition 2. Consider the problem y = Mx + ϵ with normalized form y = Nz + ϵ. Let ξ be a sufficient small constant. There exists a sufficiently small constant ρ0 (that depends on ξ) and a constant c such that for any ρ ρ0, r(x, M, n, σϵ) (1 cρ 1 2 ξ) P i t(σN i )2 O ρ 1 2 ξ t is the smallest index such that σN t ρσϵ q Proposition 2 gives the lower bound on the MSE in expectation; it can be turned into a high probability result with suitable modifications. The proof of the lower bound uses a similar trick to the one used in the analysis of the upper bound analysis to cut the tail. This results in an additional term O ρ 1 2 ξ which is generally smaller than the n c0 tail term in Theorem 1 and does not dominate the gap. Gap requirement and bi-criteria approximation algorithms. Let τ = σϵ q n . Theorem 1 asserts that any signal above the threshold θτ can be detected, i.e., the MSE is at most P σN i >θτ σ2 i (N) (plus inevitable noise), whereas Proposition 2 asserts that any signal below the threshold ρτ cannot be detected, i.e., the MSE is approximately at least P σN i ρτ(1 poly(ρ))σ2 i (N). There is a gap between θτ and ρτ, as θ > 1 and ρ < 1. See Fig. 3(a). This kind of gap is inevitable because both bounds are high probability statements. This gap phenomenon appears naturally when the sample size is small as can be illustrated by this simple example. Consider the problem of estimating µ when we see one sample from N(µ, σ2). Roughly speaking, when µ σ, the estimation is feasible, and whereas µ σ, the estimation is impossible. For the region µ σ, algorithms fail with constant probability and we cannot prove a high probability lower bound either. While many of the signals can hide in the gap, the inability to detect signals in the gap is a transient phenomenon. When the number of samples n is modestly increased, our detection threshold n shrinks, and this hidden signal can be fully recovered. This observation naturally leads to a notion of bi-criteria optimization that frequently arises in approximation algorithms. Definition 2. An algorithm for solving the y = Mx + ϵ problem is (α, β)-optimal if, when given an i.i.d. sample of size αn as input, it outputs an estimator whose MSE is at most β worse than the local minimax bound, i.e., IE[ ˆy y 2 2] r(x, M, n, σϵ) + β. Corollary 1. Let ξ and c0 be small constants and ρ be a tunable parameter. Our algorithm is (α, β)-optimal for α = θ2 ρ 5 2 and β = O(ρ 1 2 ξ) Mx 2 2 + O(n c0) The error term β consists of ρ 1 2 ϵ Mx 2 2 that is directly characterized by the signal strength and an additive term O(n c0) = o(1). Assuming that Mx = Ω(1), i.e., the signal is not too weak, the term β becomes a single multiplicative bound O(ρ 1 2 ξ + n c0) Mx 2 2. This gives an easily interpretable result. For example, when our data size is n log n, the performance gap between our algorithm and any algorithm that uses n samples is at most o( Mx 2 2). The improvement is significant when other baselines deliver MSE in the additive form that could be larger than Mx 2 2 in the regime n d1. Preview of techniques. Let N = U NΣN(V N)T be the instance (in orthogonalized form). Our goal is to construct a collection N = {N1, . . . , NK} of K matrices so that (i) For any Ni N, ΣNi = ΣN and V Ni = V N. (ii) For any two Ni, Nj N, N N F is large, and (iii) K = exp(Ω(poly(ρ)d2)) (cf. [43, Chap. 2]) Condition (i) ensures that it suffices to construct unitary matrices U Ni s for N, and that the resulting instances will be in the same equivalence class. Conditions (ii) and (iii) resemble standard construction of codes in information theory: we need a large code rate , corresponding to requiring a large K as well as large distances between codewords, corresponding to requiring that Ui Uj F be large. Standard approaches for constructing such collections run into difficulties. Getting a sufficiently tight concentration bound on the distance between two random unitary matrices is difficult as the matrix Table 1: Summary of results for equity return forecasts (left) and average results for Twitter (right) from 10 random samples. R2 are measured by basis points (bps). 1bps = 10 4. Bold font denotes the best out-of-sample results and smallest gap. out in denotes MSEout in . Equity return Twitter dataset Model R2 out Sharp t-stat MSEout out in corrout MSEout out in ADAPTIVE-RRR 18.576 1.623 15.413 1.005 0.1006 0.67 0.13 9.42 2.31 4.417 Lasso 1.124 0.595 0.018 1.063 0.534 0.47 0.15 14.82 4.81 12.452 Ridge 0.212 0.574 0.067 1.029 0.355 0.47 0.17 13.62 4.39 12.2 27 Reduced ridge 1.082 1.548 0.062 1.972 1.235 0.49 0.18 12.23 2.70 7.708 RRR 4.580 -0.477 0.640 1.087 0.474 0.38 0.22 13.07 2.63 8.731 Nuclear norm 2.210 -0.370 -0.899 1.109 0.955 0.48 0.16 13.05 4.38 8.668 PCR 5.233 1.280 0.699 1.026 0.493 0.48 0.15 13.08 4.19 8.889 entries, by necessity, are correlated. On the other hand, starting with a large collection of random unit vectors and using its Cartesian product to build matrices does not necessarily yield unitary matrices. We design a two-stage approach to decouple condition (iii) from (i) and (ii) by only generating sparse matrices U Ni. See Fig. 3(b)-(d). In the first stage (Steps 1 & 2 in Fig. 3(b)-(c)), we only specify the non-zero positions (sparsity pattern) in each U Ni. It suffices to guarantee that the sparsity patterns of the matrices U Ni and U Nj have little overlap. The existence of such objects can easily be proved using the probabilistic method. Thus, in the first stage, we can build up a large number of sparsity patterns. In the second stage (Step 3 in Fig. 3(d)), we carefully fill in values in the non-zero positions for each U Ni. When the number of non-zero entries is not too small, satisfying the unitary constraint is feasible. As the overlap of sparsity patterns of any two matrices is small, we can argue the distance between them is large. By carefully trading off the number of non-zero positions and the portion of overlap, we can simultaneously satisfy all three conditions. 5 Related work and comparison In this section, we compare our results to other regression algorithms that make low rank constraints on M. Most existing MSE results are parametrized by the rank or spectral properties of M, e.g. [30] defined a generalized notion of rank Bq(RA q ) A Rd2 d1 : Pd2 i=1 |σA i |q Rq , where q [0, 1], A {N, M}, i.e. RN q characterizes the generalized rank of N whereas RM q characterizes that of M. When q = 0, RN q = RM q is the rank of the N because rank(N) = rank(M) in our setting. In their setting, the MSE is parametrized by RM and is shown to be O RM q σ2 ϵ λ 1(d1+d2) (λ min)2n 1 q/2 . In the special case when q = 0, this reduces to O σ2 ϵ λ 1rank(M)(d1+d2) (λ min)2 n . On the other hand, the MSE in our case is bounded by (cf. Thm. 1). We have IE[ ˆy y 2 2] = O RN q ( σ2 ϵ d2 n )1 q/2 + n c0 . When q = 0, this becomes O σ2 ϵ rank(M)d2 The improvement here is twofold. First, our bound is directly characterized by N in orthogonalized form, whereas result of [30] needs to examine the interaction between M and C , so their MSE depends on both RM q and λ min. Second, our bound no longer depends on d1 and pays only an additive factor n c0, thus, when n < d1, our result is significantly better. Other works have different parameters in the upper bounds, but all of these existing results require that n > d1 to obtain nontrivial upper bounds [26, 9, 12, 26]. Unlike these prior work, we require a stochastic assumption on X (the rows are i.i.d.) to ensure that the model is identifiable when n < d1, e.g. there could be two sets of disjoint features that fit the training data equally well. Our algorithm produces an adaptive model whose complexity is controlled by k1 and k2, which are adjusted dynamically depending on the sample size and noise level. [9] and [12] also point out the need for adaptivity; however they still require n > d1 and make some strong assumptions. For instance, [9] assumes that there is a gap between σi(XM T) and σi+1(XM T) for some i. In comparison, our sufficient condition, the decay of λ i , is more natural. Our work is not directly comparable to standard variable selection techniques such as LASSO [42] because they handle univariate y. Column selection algorithms [15] generalize variable selection methods for vector responses, but they cannot address the identifiability concern. 6 Experiments We apply our algorithm on an equity market and a social network dataset to predict equity returns and user popularity respectively. Our baselines include ridge regression ( Ridge ), reduced rank ridge regression [29] ( Reduced ridge ), LASSO ( Lasso ), nuclear norm regularized regression ( Nuclear norm ), reduced rank regression [45] ( RRR ), and principal component regression [1] ( PCR ). Predicting equity returns. We use a stock market dataset from an emerging market that consists of approximately 3600 stocks between 2011 and 2018. We focus on predicting the next 5-day returns. For each asset in the universe, we compute its past 1-day, past 5-day, and past 10-day returns as features. We use a standard approach to translate forecasts into positions [4, 47]. We examine two universes in this market: (i) Universe 1 is equivalent to S&P 500 and consists of 983 stocks, and (ii) Full universe consists of all stocks except for illiquid ones. Results. Table 1 (left) reports the forecasting power and portfolio return for out-of-sample periods in Full universe (see our full version for Universe 1). We observe that (i) The data has a low signal-tonoise ratio. The out-of-sample R2 values of all the methods are close to 0. (ii) ADAPTIVE-RRR has the highest forecasting power. (iii) ADAPTIVE-RRR has the smallest in-sample and out-of-sample gap (see column out in), suggesting that our model is better at avoiding spurious signals. Predicting user popularity in social networks. We collected tweet data on political topics from Oct. 2016 to Dec. 2017. Our goal is to predict a user s next 1-day popularity, which is defined as the sum of retweets, quotes, and replies received by the user. There are a total of 19 million distinct users, and due to the huge size, we extract the subset of 2000 users with the most interactions for evaluation. For each user in the 2000-user set, we use its past 5 days popularity as features. We further randomly sample 200 users and make predictions for them, i.e., setting d2 = 200 to make d2 of the same magnitude as n. Results. We randomly sample users for 10 times and report the average MSE and correlation (with standard deviations) for both in-sample and out-of-sample data (see full version for more results). In Table 1 (right) we can see results consistent with the equity returns experiment: (i) ADAPTIVE-RRR yields the best performance in out-of-sample MSE and correlation. (ii) ADAPTIVE-RRR achieves the best generalization error by having a much smaller gap between training and test metrics. 7 Conclusion This paper examines the low-rank regression problem under the high-dimensional setting. We design the first learning algorithm with provable statistical guarantees under a mild condition on the features covariance matrix. Our algorithm is simple and computationally more efficient than low rank methods based on optimizing nuclear norms. Our theoretical analysis of the upper bound and lower bound can be of independent interest. Our preliminary experimental results demonstrate the efficacy of our algorithm. The full version explains why our (algorithm) result is unlikely to be known or trivial. Broader Impact The main contribution of this work is theoretical. Productionizing downstream applications stated in the paper may need to take six months or more so there is no immediate societal impact from this project. Acknowledgement We thank anonymous reviewers for helpful comments and suggestions. Varun Kanade is supported in part by the Alan Turing Institute under the EPSRC grant EP/N510129/1. Yanhua Li was supported in part by NSF grants IIS-1942680 (CAREER), CNS-1952085, CMMI-1831140, and DGE-2021871. Qiong Wu and Zhenming Liu are supported by NSF grants NSF-2008557, NSF-1835821, and NSF-1755769. The authors acknowledge William & Mary Research Computing for providing computational resources and technical support that have contributed to the results reported within this paper. [1] Anish Agarwal, Devavrat Shah, Dennis Shen, and Dogyoon Song. On robustness of principal component regression. In Advances in Neural Information Processing Systems, pages 9889 9900, 2019. [2] Gernot Akemann, Jonit Fischmann, and Pierpaolo Vivo. Universal correlations and power-law tails in financial covariance matrices. Physica A: Statistical Mechanics and its Applications, 389(13):2566 2579, 2010. [3] Theodore Wilbur Anderson et al. Estimating linear restrictions on regression coefficients for multivariate normal distributions. The Annals of Mathematical Statistics, 22(3):327 351, 1951. [4] Andrew Ang. Asset management: A systematic approach to factor investing. Oxford University Press, 2014. [5] Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010. [6] David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. Gender identity and lexical variation in social media. Journal of Sociolinguistics, 18(2):135 160, 2014. [7] Carolina Batis, Michelle A Mendez, Penny Gordon-Larsen, Daniela Sotres-Alvarez, Linda Adair, and Barry Popkin. Using both principal component analysis and reduced rank regression to study dietary patterns and diabetes in chinese adults. Public health nutrition, 19(2):195 203, 2016. [8] Jennifer Bender, Remy Briand, Dimitris Melas, and Raman Aylur Subramanian. Foundations of factor investing. Available at SSRN 2543990, 2013. [9] Florentina Bunea, Yiyuan She, and Marten H Wegkamp. Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics, pages 1282 1309, 2011. [10] Emmanuel J Candes et al. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589 592, 2008. [11] LJ Cao, Kok Seng Chua, WK Chong, HP Lee, and QM Gu. A comparison of pca, kpca and ica for dimensionality reduction in support vector machine. Neurocomputing, 55(1-2):321 336, 2003. [12] Kun Chen, Hongbo Dong, and Kung-Sik Chan. Reduced rank regression via adaptive nuclear norm penalization. Biometrika, 100(4):901 920, 2013. [13] Luyang Chen, Markus Pelger, and Jason Zhu. Deep learning in asset pricing. Available at SSRN 3350138, 2019. [14] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. Power-law distributions in empirical data. SIAM review, 51(4):661 703, 2009. [15] Chen Dan, Kristoffer Arnsfelt Hansen, He Jiang, Liwei Wang, and Yuchen Zhou. Low rank approximation of binary matrices: Column subset selection and generalizations. ar Xiv preprint ar Xiv:1511.01699, 2015. [16] David Donoho, Matan Gavish, et al. Minimax risk of matrix denoising by singular value thresholding. The Annals of Statistics, 42(6):2413 2440, 2014. [17] Fan Fan, Yong Ma, Chang Li, Xiaoguang Mei, Jun Huang, and Jiayi Ma. Hyperspectral image denoising with superpixel segmentation and low-rank representation. Information Sciences, 397:48 68, 2017. [18] Laura Frank, Franziska Jannasch, Janine Kröger, George Bedu-Addo, Frank Mockenhaupt, Matthias Schulze, and Ina Danquah. A dietary pattern derived by reduced rank regression is associated with type 2 diabetes in an urban ghanaian population. Nutrients, 7(7):5497 5514, 2015. [19] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. [20] Matan Gavish and David L Donoho. The optimal hard threshold for singular values is 4/sqrt(3). IEEE Transactions on Information Theory, 60(8):5040 5053, 2014. [21] Ali Ghodsi. Dimensionality reduction a short tutorial. Department of Statistics and Actuarial Science, Univ. of Waterloo, Ontario, Canada, 37:38, 2006. [22] Shihao Gu, Bryan Kelly, and Dacheng Xiu. Empirical asset pricing via machine learning. Technical report, National Bureau of Economic Research, 2018. [23] Alan Julian Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis, 5(2):248 264, 1975. [24] Tosio Kato. Variation of discrete spectra. Communications in Mathematical Physics, 111(3):501 504, 1987. [25] Bryan T Kelly, Seth Pruitt, and Yinan Su. Characteristics are covariances: A unified model of risk and return. Journal of Financial Economics, 134(3):501 524, 2019. [26] Vladimir Koltchinskii, Karim Lounici, Alexandre B Tsybakov, et al. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302 2329, 2011. [27] Zongming Ma and Tingni Sun. Adaptive sparse reduced-rank regression. ar Xiv, 1403, 2014. [28] Marcelo C Medeiros and Eduardo F Mendes. Estimating high-dimensional time series models. Technical report, Texto para discussão, 2012. [29] Ashin Mukherjee and Ji Zhu. Reduced rank ridge regression and its kernel extensions. Statistical analysis and data mining: the ASA data science journal, 4(6):612 622, 2011. [30] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, pages 1069 1097, 2011. [31] Ashadun Nobi, Seong Eun Maeng, Gyeong Gyun Ha, and Jae Woo Lee. Random matrix theory and cross-correlations in global financial indices and local stock market indices. Journal of the Korean Physical Society, 62(4):569 574, 2013. [32] Roberto Oliveira et al. Sums of random hermitian matrices and an inequality by rudelson. Electronic Communications in Probability, 15:203 212, 2010. [33] Christopher Polk, Samuel Thompson, and Tuomo Vuolteenaho. Cross-sectional forecasts of the equity premium. Journal of Financial Economics, 81(1):101 141, 2006. [34] Mehdi Rahim, Bertrand Thirion, and Gaël Varoquaux. Multi-output predictions from neuroimaging: assessing reduced-rank linear models. In 2017 International Workshop on Pattern Recognition in Neuroimaging (PRNI), pages 1 4. IEEE, 2017. [35] Anca Ralescu, Mojtaba Kohram, et al. Spectral regression with low-rank approximation for dynamic graph link prediction. IEEE intelligent systems, 26(4):48 53, 2011. [36] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3):1287 1319, 2010. [37] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. Kybernetes, 2013. [38] Shiladitya Sinha, Chris Dyer, Kevin Gimpel, and Noah A Smith. Predicting the nflusing twitter. ar Xiv preprint ar Xiv:1310.6998, 2013. [39] James H Stock and Mark Watson. Dynamic factor models. Oxford handbook on economic forecasting, 2011. [40] James H Stock and Mark W Watson. Forecasting using principal components from a large number of predictors. Journal of the American statistical association, 97(460):1167 1179, 2002. [41] James H Stock and Mark W Watson. Implications of dynamic factor models for var analysis. Technical report, National Bureau of Economic Research, 2005. [42] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. [43] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. [44] Thomas Chinwe Urama, Patrick Oseloka Ezepue, and Chimezie Peters Nnanwa. Analysis of cross-correlations in emerging markets using random matrix theory. In International Conference on Computational Mathematics, Computational Geometry & Statistics (CMCGS). Proceedings, page 89. Global Science and Technology Forum, 2017. [45] Raja Velu and Gregory C Reinsel. Multivariate reduced-rank regression: theory and applications, volume 136. Springer Science & Business Media, 2013. [46] Yun-Jhong Wu, Elizaveta Levina, and Ji Zhu. Generalized linear models with low rank effects for network data. ar Xiv preprint ar Xiv:1705.06772, 2017. [47] Xinfeng Zhou and Sameer Jain. Active equity management. Self-published, 2014. [48] Xiaofeng Zhu, Heung-Il Suk, Heng Huang, and Dinggang Shen. Low-rank graph-regularized structured sparse regression for identifying genetic biomarkers. IEEE transactions on big data, 3(4):405 414, 2017.