# provable_metalearning_of_linear_representations__99870f6f.pdf Provable Meta-Learning of Linear Representations Nilesh Tripuraneni 1 Chi Jin 2 Michael I. Jordan 1 Abstract Meta-learning, or learning-to-learn, seeks to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments. Representation learning a key tool for performing meta-learning learns a data representation that can transfer knowledge across multiple tasks, which is essential in regimes where data is scarce. Despite a recent surge of interest in the practice of meta-learning, the theoretical underpinnings of meta-learning algorithms are lacking, especially in the context of learning transferable representations. In this paper, we focus on the problem of multi-task linear regression in which multiple linear regression models share a common, low-dimensional linear representation. Here, we provide provably fast, sample-efficient algorithms to address the dual challenges of (1) learning a common set of features from multiple, related tasks, and (2) transferring this knowledge to new, unseen tasks. Both are central to the general problem of meta-learning. Finally, we complement these results by providing informationtheoretic lower bounds on the sample complexity of learning these linear features. 1. Introduction The ability of a learner to transfer knowledge between tasks is crucial for robust, sample-efficient inference and prediction. One of the most well-known examples of such transfer learning has been in few-shot image classification where the idea is to initialize neural network weights in early layers using Image Net pre-training/features, and subsequently re-train the final layers on a new task (Donahue et al., 2014; Vinyals et al., 2016). However, the need for methods that can learn data representations that generalize to multiple, unseen tasks has also become vital in other applications, 1Department of EECS, University of California, Berkeley 2Department of Electrical Engineering, Princeton University. Correspondence to: Nilesh Tripuraneni . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). ranging from deep reinforcement learning (Baevski et al., 2019) to natural language processing (Ando & Zhang, 2005; Liu et al., 2019). Accordingly, researchers have begun to highlight the need to develop (and understand) generic algorithms for transfer (or meta) learning applicable in diverse domains (Finn et al., 2017). Surprisingly, however, despite a long line of work on transfer learning, there is limited theoretical characterization of the underlying problem. Indeed, there are few efficient algorithms for feature learning that provably generalize to new, unseen tasks. Sharp guarantees are even lacking in the linear setting. In this paper, we study the problem of meta-learning of features in a linear model in which multiple tasks share a common set of low-dimensional features. Our aim is twofold. First, we ask: given a set of diverse samples from t different tasks how we can efficiently (and optimally) learn a common feature representation? Second, having learned a common feature representation, how can we use this representation to improve sample efficiency in a new (t + 1)st task where data may be scarce?1 Formally, given an (unobserved) linear feature matrix B = (b1, . . . , br) Rd r with orthonormal columns, our statistical model for data pairs (xi, yi) is: yi = x i Bαt(i) + ϵi ; βt(i) = Bαt(i), (1) where there are t (unobserved) underlying task parameters αj for j {1, . . . , t}. Here t(i) {1, . . . , t} is the index of the task associated with the ith datapoint, xi Rd is a random covariate, and ϵi is additive noise. We assume the sequence {αt(i)} i=1 is independent of all other randomness in the problem. In this framework, the aforementioned questions reduce to recovering B from data from the first {1, . . . , t} tasks, and using this feature representation to recover a better estimate of a new task parameter, βt+1 = Bαt+1, where αt+1 is also unobserved. Our main result targets the problem of learning-to-learn (LTL), and shows how a feature representation ˆB learned from t diverse tasks can improve learning on an unseen (t + 1)st task which shares the same underlying linear representation. We informally state this result below.2 1This is sometimes referred to as learning-to-learn (LTL). 2Theorem 1 follows immediately from combining Theorems 3 and 4; see Theorem 6 for a formal statement. Provable Meta-Learning of Linear Representations Theorem 1 (Informal). Suppose we are given n1 total samples from t diverse and normalized tasks which are used in Algorithm 1 to learn a feature representation ˆB, and n2 samples from a new (t + 1)st task which are used along with ˆB and Algorithm 2 to learn the parameters ˆα of this new (t + 1)st task. Then, the parameter ˆB ˆα has the following excess prediction error on a new test point x drawn from the training data covariate distribution: Ex [ x , ˆB ˆα Bαt+1 2] O dr2 with high probability over the training data. The naive complexity of linear regression which ignores the information from the previous t tasks has complexity O( d n2 ). Theorem 1 suggests that positive transfer from the first {1, . . . , t} tasks to the final (t + 1)st task can dramatically reduce the sample complexity of learning when r d and n1 n2 r2; that is, when (1) the complexity of the shared representation is much smaller than the dimension of the underlying space and (2) when the ratio of the number of samples used for feature learning to the number of samples present for a new unseen task exceeds the complexity of the shared representation. We believe that the LTL bound in Theorem 1 is the first bound, even in the linear setting, to sharply exhibit this phenomenon (see Section 1.1 for a detailed comparison to existing results). Prior work provides rates for which the leading term in (2) decays as 1 t, not as 1 n1 . We identify structural conditions on the design of the covariates and diversity of the tasks that allow our algorithms to take full advantage of all samples available when learning the shared features. Our primary contributions in this paper are to: Establish that all local minimizers of the (regularized) empirical risk induced by (1) are close to the true linear representation up to a small, statistical error. This provides strong evidence that first-order algorithms, such as gradient descent (Jin et al., 2017), can efficiently recover good feature representations (see Section 3.1). Provide a method-of-moments estimator which can efficiently aggregate information across multiple differing tasks to estimate B even when it may be informationtheoretically impossible to learn the parameters of any given task (see Section 3.2). Demonstrate the benefits and pitfalls of transferring learned representations to new, unseen tasks by analyzing the bias-variance trade-offs of the linear regression estimator based on a biased, feature estimate (see Section 4). Develop an information-theoretic lower bound for the problem of feature learning, demonstrating that the aforementioned estimator is a close-to-optimal estimator of B, up to logarithmic and conditioning/eigenvalue factors in the matrix of task parameters (see Assumption 2). To our knowledge, this is the first information-theoretic lower bound for representation learning in the multi-task setting (see Section 5). 1.1. Related Work While there is a vast literature on papers proposing multitask and transfer learning methods, the number of theoretical investigations is much smaller. An important early contribution is due to Baxter (2000), who studied a model where tasks with shared representations are sampled from the same underlying environment. Pontil & Maurer (2013) and Maurer et al. (2016), using tools from empirical process theory, developed a generic and powerful framework to prove generalization bounds in multi-task and learning-to-learn settings that are related to ours. Indeed, the closest guarantee to that in our Theorem 1 that we are aware of is Maurer et al. (2016, Theorem 5). Instantiated in our setting, Maurer et al. (2016, Theorem 5) provides an LTL guarantee showing that the excess risk of the loss function with learned representation on a new datapoint is bounded by O( r r n2 ), with high probability. There are several principal differences between our work and results of this kind. First, we address the algorithmic component (or computational aspect) of metalearning while the previous theoretical literature generally assumes access to a global empirical risk minimizer (ERM). Computing the ERM in these settings requires solving a nonconvex optimization problem that is in general NP hard. Second, in contrast to Maurer et al. (2016), we also provide guarantees for feature recovery in terms of the parameter estimation error measured directly in the distance in the feature space. Third, and most importantly, in Maurer et al. (2016), the leading term capturing the complexity of learning the feature representation decays only in t but not in n1 (which is typically much larger than t). Although, as they remark, the 1/ t scaling they obtain is in general unimprovable in their setting, our results leverage assumptions on the distributional similarity between the underlying covariates x and the potential diversity of tasks to improve this scaling to 1/n1. That is, our algorithms make benefit of all the samples in the feature learning phase. We believe that for many settings (including the linear model that is our focus) such assumptions are natural and that our rates reflect the practical efficacy of meta-learning techniques. Indeed, transfer learning is often successful even when we are presented with only a few training tasks but with each having a significant number of samples per task (e.g., n1 t).3 There has also been a line of recent work providing guaran- 3See Fig. 3 for a numerical simulation relevant to this setting. Provable Meta-Learning of Linear Representations tees for gradient-based meta-learning (MAML) (Finn et al., 2017). Finn et al. (2019); Khodak et al. (2019a;b), and Denevi et al. (2019) work in the framework of online convex optimization (OCO) and use a notion of (a potentially data-dependent) task similarity that assumes closeness of all tasks to a single fixed point in parameter space to provide guarantees. In contrast to this work, we focus on the setting of learning a representation common to all tasks in a generative model. The task model parameters need not be close together in our setting. In concurrent work, Du et al. (2020) obtain results similar to ours for multi-task linear regression and provide comparable guarantees for a two-layer Re LU network using a notion of training task diversity akin to ours. Their generalization bounds use a distributional assumption over meta-test tasks, while our bounds for linear regression are sharp for fixed meta-test tasks. Moreover, their focus is on purely statistical guarantees they assume access to an ERM oracle for nonconvex optimization problems. Our focus is on providing statistical rates for efficient algorithmic procedures (i.e., the method-of-moments and local minima reachable by gradient descent). Finally, we also show a (minimax)-lower bound for the problem of feature recovery (i.e., recovering B). 2. Preliminaries Throughout, we will use bold lower-case letters (e.g., x) to refer to vectors and bold upper-case letters to refer to matrices (e.g., X). We exclusively use B Rd r to refer to a matrix with orthonormal columns spanning an rdimensional feature space, and B to refer a matrix with orthonormal columns spanning the orthogonal subspace of this feature space. The norm appearing on a vector or matrix refers to its ℓ2 norm or spectral norm respectively. The notation F refers to a Frobenius norm. x, y is the Euclidean inner product, while M, N = tr(MN ) is the inner product between matrices. Generically, we will use hatted vectors and matrices (e.g., ˆα and ˆB) to refer to (random) estimators of their underlying population quantities. We will use , , and to denote greater than, less than, and equal to up to a universal constant and use O to denote an expression that hides polylogarithmic factors in all problem parameters. Our use of O, Ω, and Θ is otherwise standard. Formally, an orthonormal feature matrix B is an element of an equivalence class (under right rotation) of a representative lying in Grr,d(R) the Grassmann manifold (Edelman et al., 1998). The Grassmann manifold, which we denote as Grr,d(R), consists of the set of r-dimensional subspaces within an underlying d-dimensional space. To define distance in Grr,d(R) we define the notion of a principal angle between two subspaces p and q. If E is an orthonormal matrix whose columns form an orthonormal basis of p and F is an orthonormal matrix whose columns form an orthonormal basis of q, then a singular value decomposition of E F = UDV defines the principal angles as: D = diag(cos θ1, cos θ2, . . . , cos θk), where 0 θk . . . θ1 π 2 . The distance of interest for us will be the subspace angle distance sin θ1, and for convenience we will use the shorthand sin θ(E, F) to refer to it. With some abuse of notation we will use B to refer to an explicit orthonormal feature matrix and the subspace in Grr,d(R) it represents. We now detail several assumptions we use in our analysis. Assumption 1 (Sub-Gaussian Design and Noise). The i.i.d. design vectors xi are zero mean with covariance E[xx ] = Id and are Id-sub-gaussian, in the sense that E[exp(v xi)] exp v 2 2 for all v. Moreover, the additive noise variables ϵi are i.i.d. sub-gaussian with variance parameter 1 and are independent of xi. Throughout, we work in the setting of random design linear regression, and in this context Assumption 1 is standard. Our results do not critically rely on the identity covariance assumption although its use simplifies several technical arguments. In the following we define the population task diversity matrix as A = (α1, . . . , αt) Rt r, ν = σr( A A the average condition number as κ = tr( A A t ) rν , and the worst-case condition number as κ = σ1( A A t )/ν. Assumption 2 (Task Diversity and Normalization). The t underlying task parameters αj satisfy αj = Θ(1) for all j {1, . . . , t}. Moreover, we assume ν > 0. Recovering the feature matrix B is impossible without structural conditions on A. Consider the extreme case in which α1, . . . , αt are restricted to span only the first r 1 columns of the column space of the feature matrix B. None of the data points (xi, yi) contain any information about the rth column-feature which can be any arbitrary vector in the complementary d r 1 subspace. In this case recovering B accurately is information-theoretically impossible. The parameters ν, κ, and κ capture how spread out the tasks αj are in the column space of B. The condition αj = Θ(1) is also standard in the statistical literature and is equivalent to normalizing the signal-to-noise (snr) ratio to be Θ(1)4. In linear models, the snr is defined as the square of the ℓ2 norm of the underlying parameter divided by the variance of the additive noise. Our overall approach to meta-learning of representations consists of two phases that we term meta-train and metatest . First, in the meta-train phase (see Section 3), we 4Note that for a well-conditioned population task diversity matrix where κ κ O(1), our snr normalization enforces that tr(A A/t) = Θ(1) and ν Ω( 1 Provable Meta-Learning of Linear Representations provide algorithms to learn the underlying linear representation from a set of diverse tasks. Second, in the meta-test phase (see Section 4) we show how to transfer these learned features to a new, unseen task to improve the sample complexity of learning. Detailed proofs of our main results can be found in the Appendix. 3. Meta-Train: Learning Linear Features Here we address both the algorithmic and statistical challenges of provably learning the linear feature representation B. 3.1. Local Minimizers of the Empirical Risk The remarkable, practical success of first-order methods for training nonconvex optimization problems (including meta/multi-task learning objectives) motivates us to study the optimization landscape of the empirical risk induced by the model in (1). We show in this section that all local minimizers of a regularized version of empirical risk recover the true linear representation up to a small statistical error. Jointly learning the population parameters B and (α1, . . . , αt) defined by (1) is reminiscent of a matrix sensing/completion problem. We leverage this connection for our analysis, building in particular on results from Ge et al. (2017). Throughout this section we assume that we are in a uniform task sampling model at each iteration the task t(i) for the ith datapoint is uniformly sampled from the t underlying tasks. We first recast our problem in the language of matrices, by defining the matrix we hope to recover as M = (α1, . . . , αt) B Rt d. Since rank(M ) = r, we let X D (Y ) = SVD(M ), and denote U = X (D )1/2 Rt r, V = (D )1/2Y Rd r. In this notation, the responses of the regression model are written as follows: yi = et(i)x i , M + ϵi. (3) To frame recovery as an optimization problem we consider the Burer-Monteiro factorization of the parameter M = UV where U Rt r and V Rd r. This motivates the following objective: min U Rt r,V Rd r f(U, V) = 2t i=1 (yi et(i)x i , UV )2 2 U U V V 2 F. (4) The second term in (4) functions as a regularization to prevent solutions which send U F 0 while V F or vice versa. If the value of this objective (4) is small we might hope that an estimate of B can be extracted from the column space of the parameter V, since the column space of V spans the same subspace as B. Informally, our main result states that all local minima of the regularized empirical risk are in the neighborhood of the optimal V , and have subspaces that approximate B well. Before stating our result we define the constraint set, which contains incoherent matrices with reasonable scales, as follows: W = { (U, V) | max i [t] e i U 2 C0 κr κν tκν, V 2 C0 for some large constant C0. Under Assumption 2, this set contains the optimal parameters. Note that U and V satisfy the final two constraints by definition and Lemma 16 can be used to show that Assumption 2 actually implies that U is incoherent, which satisfies the first constraint. Our main result follows. Theorem 2. Let Assumptions 1 and 2 hold in the uniform task sampling model. If the number of samples n1 satisfies n1 polylog(n1, d, t)(κr)4 max{t, d}, then, with probability at least 1 1/poly(d), we have that given any local minimum (U, V) int(W) of the objective (4), the column space of V, spanned by the orthonormal feature matrix ˆB, satisfies: sin θ( ˆB, B) O max{t, d}r log n1 We make several comments on this result: The guarantee in Theorem 2 suggests that all local minimizers of the regularized empirical risk (4) will produce a linear representation at a distance at most O( p max{t, d}r/n1) from the true underlying representation. Theorem 5 guarantees that any estimator (including the empirical risk minimizer) must incur error p dr/n1. Therefore, in the regime t O(d), all local minimizers are statistically close-to-optimal, up to logarithmic factors and conditioning/eigenvalue factors in the task diversity matrix. Combined with a recent line of results showing that (noisy) gradient descent can efficiently escape strict saddle points to find local minima (Jin et al., 2017), Theorem 2 provides strong evidence that first-order methods can efficiently meta-learn linear features.5 The proof of Theorem 2 is technical so we only sketch the high-level ideas. The overall strategy is to analyze the Hessian of the objective (4) at a stationary point (U, V) in int(W) to exhibit a direction of negative curvature which 5To formally establish computational efficiency, we need to further verify the smoothness and the strict-saddle properties of the objective function (4) (see, e.g., (Jin et al., 2017)). Provable Meta-Learning of Linear Representations Algorithm 1 Mo M Estimator for Learning Linear Features Input: {(xi, yi)}n1 i=1. ˆBD1 ˆB top-r SVD of 1 n1 Pn1 i=1 y2 i xix i return ˆB can serve as a direction of local improvement pointing towards M (and hence show (U, V) is not a local minimum). Implementing this idea requires surmounting several technical hurdles including (1) establishing various concentration of measure results (e.g., RIP-like conditions) for the sensing matrices et(i)x i unique to our setting and (2) handling the interaction of the optimization analysis with the regularizer and noise terms. Performing this analysis establishes that under the aforementioned conditions all local minima in int(W) satisfy UV M F O( q t max{t,d}r log n1 n1 ) (see Theorem 8). Guaranteeing that this loss is small is not sufficient to ensure recovery of the underlying features. Transferring this guarantee in the Frobenius norm to a result on the subspace angle critically uses the task diversity assumption (see Lemma 15) to give the final result. 3.2. Method-of-Moments Estimator Next, we present a method-of-moments algorithm to recover the feature matrix B with sharper statistical guarantees. An alternative to optimization-based approaches such as maximum likelihood estimation, the method-of-moments is among the oldest statistical techniques (Pearson, 1894) and has recently been used to estimate parameters in latent variable models (Anandkumar et al., 2012). As we will see, the technique is well-suited to our formulation of multi-task feature learning. We present our estimator in Algorithm 1, which simply computes the top-r eigenvectors of the matrix (1/n1) Pn1 i=1 y2 i xix i . Before presenting our result, we define the averaged empirical task matrix as Λ = 1 n Pn i=1 αt(i)α t(i) where ν = σr( Λ), and κ = tr( Λ)/(r ν) in analogy with Assumption 2. Theorem 3. Suppose the n1 data samples {(xi, yi)}n1 i=1 are generated from the model in (1) and that Assumptions 1 and 2 hold, but additionally that xi N(0, Id). Then, if n1 polylog(d, n1)rd κ/ ν, the output ˆB of Algorithm 1 satisfies sin θ( ˆB, B) O with probability at least 1 O(n 100 1 ). Moreover, if the number of samples generated from each task are equal (i.e., Λ = 1 t A A), then the aforementioned guarantee holds with κ = κ and ν = ν. We first make several remarks regarding this result. Algorithm 2 Linear Regression for Learning a New Task with a Feature Estimate Input: ˆB, {(xi, yi)}n2 i=1. ˆα (Pn2 i=1 ˆBxix i ˆB ) ˆB Pn2 i=1 xiyi return ˆα Theorem 3 is flexible the only dependence of the estimator on the distribution of samples across the various tasks is factored into the empirical task diversity parameters ν and κ. Under a uniform observation model the guarantee also immediately translates into an analogous statement which holds with the population task diversity parameters ν and κ. Theorem 3 provides a non-trivial guarantee even in the setting where we only have Θ(1) samples from each task, but t = Θ(dr). In this setting, recovering the parameters of any given task is information-theoretically impossible. However, the method-of-moments estimator can efficiently aggregate information across the tasks to learn B. The estimator does rely on the moment structure implicit in the Gaussian design to extract B. However, Theorem 3 has no explicit dependence on t and is close-to-optimal in the constant-snr regime; see Theorem 5 for our lower bound. We now provide a summary of the proof. Under oracle access to the population mean E[ 1 i y2 i xix i ] = (2 Γ + (1 + tr( Γ))Id), where Γ = 1 n Pn i=1 Bαt(i)α t(i)B (see Lemma 1), we can extract the features B by directly applying PCA to this matrix, under the condition that κ > 0, to extract its column space. In practice, we only have access to the samples {(xi, yi)}n i=1. Algorithm 1 uses the empirical moments 1 i y2 i xix i in lieu of the population mean. Thus, to show the result, we argue that 1 n Pn i=1 y2 i xix i = E[ 1 n Pn i=1 y2 i xix i ] + E where E is a small, stochastic error (see Theorem 7). If this holds, the Davis-Kahan sin θ theorem (Bhatia, 2013) shows that PCA applied to the empirical moments provides an accurate estimate of B under perturbation by a sufficiently small E. The key technical step in this argument is to show sharp concentration (in spectral norm) of the matrix-valued noise terms contained in E which are neither bounded (in spectral norm) nor sub-gaussian/sub-exponential-like; we refer the reader to the Appendix for further details on this argument. 4. Meta-Test: Transfer of Features to New Tasks Having estimated a linear feature representation ˆB shared across related tasks, our second goal is to transfer this representation to a new, unseen task the (t + 1)st task to Provable Meta-Learning of Linear Representations improve learning. In the context of the model in (1), the approach taken in Algorithm 2 uses ˆB as a plug-in surrogate for the unknown B, and attempts to estimate αt+1 Rr. Formally we define our estimator α as follows: ˆα = arg min α y X ˆBα 2, (5) where n2 samples (X, y) are generated from the model in (1) from the (t + 1)st task. Effectively, the feature representation ˆB performs dimension reduction on the input covariates X, allowing us to learn in a lower-dimensional space. Our focus is on understanding the generalization properties of the estimator in Algorithm 2, since (5) is an ordinary least-squares objective which can be analytically solved. Assuming we have produced an estimate ˆB of the true feature matrix B, we can present our main result on the sample complexity of meta-learned linear regression. Theorem 4. Suppose n2 data points, {(xi, yi)}n2 i=1, are generated from the model in (1), where Assumption 1 holds, from a single task satisfying αt+1 2 O(1). Then, if sin θ( ˆB, B) δ and n2 r log n2, the output ˆα from Algorithm 2 satisfies ˆB ˆα Bαt+1 2 O δ2 + r with probability at least 1 O(n 100 2 ). Note that Bαt+1 is simply the underlying parameter in the regression model in (1). We make several remarks about this result: Theorem 4 decomposes the error of transfer learning into two components. The first term, O(δ2), arises from the bias of using an imperfect feature estimate ˆB to transfer knowledge across tasks. The second term, O( r n2 ), arises from the variance of learning in a space of reduced dimensionality. Standard generalization guarantees for random design linear regression ensure that the parameter recovery error is bounded by O( d n2 ) w.h.p. under the same assumptions (Hsu et al., 2012). Meta-learning of the linear representation ˆB can provide a significant reduction in the sample complexity of learning when δ2 d n2 and r d. Conversely, if δ2 d n2 the bounds in (6) imply that the overhead of learning the feature representation may overwhelm the potential benefits of transfer learning (with respect to baseline of learning the (t + 1)st task in isolation). This accords with the well-documented empirical phenomena of negative transfer observed in large-scale deep learning problems where meta/transferlearning techniques actually result in a degradation in performance on new tasks (Wang et al., 2019). For diverse tasks (i.e. κ O(1)), using Algorithm 1 to estimate ˆB suggests that ensuring δ2 d n2 , where δ2 = O( dr νn1 ), requires n1 n2 r/ν. That is, the ratio of the number of samples used for feature learning to the number of samples used for learning the new task should exceed the complexity of the feature representation to achieve positive transfer. In order to obtain the rate in Theorem 4 we use a biasvariance analysis of the estimator error ˆB ˆα Bαt+1 (and do not appeal to uniform convergence arguments). Using the definition of y we can write the error as, ˆB ˆα Bα0 = ˆB( ˆB X X ˆB) 1 ˆBX XBα0 Bα0 + ˆB( ˆB X X ˆB) 1 ˆB X ϵ. The first term contributes the bias term to (6) while the second contributes the variance term. Analyzing the fluctuations of the (mean-zero) variance term can be done by controlling the norm of its square, ϵ Aϵ, where A = X ˆB( ˆB X X ˆB) 2 ˆB X . We can bound this (random) quadratic form by first appealing to the Hanson-Wright inequality to show w.h.p. that ϵ Aϵ tr(A) + O( A F + A ). The remaining randomness in A can be controlled using matrix concentration/perturbation arguments (see Lemma 17). With access to the true feature matrix ˆB (i.e., setting ˆB = B) the term ˆB(B X XB) 1BX XBα0 Bα0 = 0, due to the cancellation in the empirical covariance matrices, (B X XB) 1BX XB = Ir. This cancellation of the empirical covariance is essential to obtaining a tight analysis of the least-squares estimator. We cannot rely on this effect in full since ˆB = B. However, a naive analysis which splits these terms, ( ˆB X X ˆB) 1 and ˆBX XB can lead to a large increase in the variance in the bound. To exploit the fact ˆB B, we project the matrix B in the leading XB term onto the column space of ˆB and its complement which allows a partial cancellation of the empirical covariances in the subspace spanned by ˆB. The remaining variance can be controlled as in the previous term (see Lemma 18). 5. Lower Bounds for Feature Learning To complement the upper bounds provided in the previous section, in this section we derive information-theoretic limits for feature learning in the model (1). To our knowledge, these results provide the first sample-complexity lower bounds for feature learning, with regards to subspace recovery, in the multi-task setting. While there is existing literature on (minimax)-optimal estimation of low-rank matrices (see, for example, Rohde et al. (2011)), that work focuses on Provable Meta-Learning of Linear Representations the (high-dimensional) estimation of matrices, whose only constraint is to be low rank. Moreover, error is measured in the additive prediction norm. In our setting, we must handle the additional difficulties arising from the fact that we are interested in (1) learning a column space (i.e., an element in the Grr,d(R)) and (2) the error between such representatives is measured in the subspace angle distance. We begin by presenting our lower bound for feature recovery. Theorem 5. Suppose a total of n data points are generated from the model in (1) satisfying Assumption 1 with xi N(0, Id), ϵi N(0, 1), with an equal number from each task, and that Assumption 2 holds with αj for each task normalized to αj = 1 2. Then, there are αj for r d 2 and n max 1 8ν , r(d r) so that: inf ˆB sup B Grr,d(R) sin θ( ˆB, B) Ω with probability at least 1 4, where the infimum is taken over the class of estimators that are functions of the n data points. Again we make several comments on the result. The result of Theorem 5 shows that the estimator in Algorithm 1 provides a close-to-optimal estimator of the feature representation parameterized by B up to logarithmic and conditioning factors (i.e. κ, ν)6 in the task diversity matrix that is independent of the task number t. Note that under the normalization for αi, as κ (i.e. the task matrix A becomes ill-conditioned) we have that ν 0. So the first term in Theorem 5 establishes that task diversity is necessary for recovery of the subspace B. The dimension of Grr,d(R) (i.e., the number of free parameters needed to specify a feature set) is r(d r) Ω(dr) for d/2 r; hence the second term in Theorem 5 matches the scaling that we intuit from parameter counting. Obtaining tight dependence of our subspace recovery bounds on conditioning factors in the task diversity matrix (i.e. κ, ν) is an important and challenging research question. We believe the gap between in conditioning/eigenvalue factors between Theorem 3 and Theorem 5 on the p dr/n term is related to a problem that persists for classical estimators in linear regression (i.e. for the Lasso estimator in sparse linear regression). Even in this setting, a gap remains with respect to condition number/eigenvalue factors of the data design matrix X, between existing upper and lower bounds (see Chen et al. (2016, Section 7), Raskutti et al. (2011, Theorem 1, Theorem 2) and Zhang et al. (2014) for example). In our setting the task diversity matrix A enters into the problem in a similar fashion to the data design matrix X in these aforementioned settings. 6Note in the setting that κ O(1), ν 1 The dependency on the task diversity parameter 1 ν (the first term in Theorem 5) is achieved by constructing a pair of feature matrices and an ill-conditioned task matrix A that cannot discern the direction along which they defer. The proof strategy to capture the second term uses a f-divergence based minimax technique from Guntuboyina (2011) (restated in Lemma 20 in the Appendix), similar in spirit to the global Fano (or Yang-Barron). There are two key ingredients to using Lemma 20 and obtaining a tight lower bound. First, we must exhibit a large family of distinct, well-separated feature matrices {Bi}M i=1 (i.e., a packing at scale η). Second, we must argue this set of feature matrices induces a family of distributions over data {(xi, yi)}Bi which are statistically close and fundamentally difficult to distinguish amongst. This is captured by the fact the ϵ-covering number, measured in the space of distributions with divergence measure Df( , ), is small. The standard (global) Fano method, or Yang-Barron method (see Wainwright (2019, Ch. 15)), which uses the KL divergence to measure distance in the space of measures, is known to provide rate-suboptimal lower bounds for parametric estimation problems.7 Our case is no exception. To circumvent this difficulty we use the framework of Guntuboyina (2011), instantiated with the f-divergence chosen as the χ2-divergence, to obtain a tight lower bound. The argument proceeds in two steps. First, although the geometry of Grr,d(R) is complex, we can adapt results from Pajor (1998) to provide sharp upper/lower bounds on the metric entropy (or global entropy) of the Grassmann manifold (see Proposition 9). The second technical step of the argument hinges on the ability to cover the space of distributions parametrized by B in the space of measures {PB : B Grr,d(R)} with distance measured by an appropriate f-divergence. In order to establish a covering in the space of measures parametrized by B, the key step is to bound the distance χ2(PB1, PB2) for two different measures over data generated from the model (1) with two different feature matrices B1 and B2 (see Lemma 21). This control can be achieved in our random design setting by exploiting the Gaussianity of the marginals over data X and the Gaussianity of the conditionals of y|X, B, to ultimately be expressed as a function of the angular distance between B1 and B2. 6. Simulations We complement our theoretical analysis with a series of numerical experiments highlighting the benefits (and lim- 7Even for the simple problems of Gaussian mean estimation the classical Yang-Barron method is suboptimal; see Guntuboyina (2011) for more details. Provable Meta-Learning of Linear Representations its) of meta-learning8. For the purposes of feature learning we compare the performance of the method-of-moments estimator in Algorithm 1 vs. directly optimizing the objective in (4). Additional details on our set-up are provided in Appendix G. We construct problem instances by generating Gaussian covariates and noise as xi N(0, Id), ϵi N(0, 1), and the tasks and features used for the firststage feature estimation as αi 1 r N(0, Ir), with B generated as a (uniform) random r-dimensional subspace of Rd. In all our experiments we generate an equal number of samples nt for each of the t tasks, so n1 = t nt. In the second stage we generate a new, (t + 1)st task instance using the same feature estimate B used in the first stage and otherwise generate n2 samples, with the covariates, noise and αt+1 constructed as before. Throughout this section we refer to features learned via a first-order gradient method as LF-FO and the corresponding meta-learned regression parameter on a new task by meta-LR-FO. We use LF-Mo M and meta-LR-Mo M to refer to the same quantities save with the feature estimate learned via the method-of-moments estimator. We also use LR to refer to the baseline linear regression estimator on a new task which only uses data generated from that task. We begin by considering a challenging setting for feature learning where d = 100, r = 5, but nt = 5 for varying numbers of tasks t. As Fig. 1 demonstrates, the method-of- 0 1000 2000 3000 4000 5000 6000 Number of Tasks LF-Mo M LF-FO 0 1000 2000 3000 4000 5000 6000 Number of Tasks ℓ2 Parameter Error LR meta-LR-Mo M meta-LR-FO Figure 1. Left: LF-FO vs. LF-Mo M estimator with error measured in the subspace angle distance sin θ( ˆB, B). Right: meta-LR-FO and meta-LR-Mo M vs. LR on new task with error measured on new task parameter. Here d = 100, r = 5, and nt = 5 while n2 = 2500 as the number of tasks is varied. moments estimator is able to aggregate information across the tasks as t increases to slowly improve its feature estimate, even though nt d. The loss-based approach struggles to improve its estimate of the feature matrix B in this regime. This accords with the extra t dependence in Theorem 2 relative to Theorem 3. In this setting, we also generated a (t + 1)st test task with d n2 = 2500, to test the effect of meta-learning the linear representation on generalization in a new, unseen task against a baseline which simply performs a regression on this new task in isolation. Fig. 1 also shows 8An open-source Python implementation to reproduce our experiments can be found at https://github.com/ nileshtrip/MTL. that meta-learned regressions perform significantly worse than simply ignoring first t tasks. Theorem 4 indicates the bias from the inability to learn an accurate feature estimate of B overwhelms the benefits of transfer learning. In this regime n2 d so the new task can be efficiently learned in isolation. We believe this simulation represents a simple instance of the empirically observed phenomena of negative transfer (Wang et al., 2019). We now turn to the more interesting use cases where metalearning is a powerful tool. We consider a setting where d = 100, r = 5, and nt = 25 for varying numbers of tasks t. However, now we consider a new, unseen task where data is scarce: n2 = 25 < d. As Fig. 2 shows, in 0 1000 2000 3000 4000 5000 6000 Number of Tasks LF-Mo M LF-FO 0 1000 2000 3000 4000 5000 6000 Number of Tasks ℓ2 Parameter Error LR meta-LR-Mo M meta-LR-FO Figure 2. Left: LF-FO vs. LF-Mo M estimator with error measured in the subspace angle distance sin θ( ˆB, B). Right: meta-LR-FO and meta-LR-Mo M vs. LR on new task with error measured on new task parameter. Here d = 100, r = 5, nt = 25 while n2 = 25 while the number of tasks is varied. this regime both the method-of-moments estimator and the loss-based approach can learn a non-trivial estimate of the feature representation. The benefits of transferring this representation are also evident in the improved generalization performance seen by the meta-regression procedures on the new task. Interestingly, the loss-based approach learns an accurate feature representation ˆB with significantly fewer samples then the method-of-moments estimator, in contrast to the previous experiment. Finally, we consider an instance where d = 100, r = 5, t = 20, and n2 = 50 with varying numbers of training points nt per task. We see in Fig. 3 that meta-learning of representations provides significant value in a new task. Note that these numerical experiments show that as the number of tasks is fixed, but nt increases, the generalization ability of the meta-learned regressions significantly improves as reflected in the bound (2). 7. Conclusions In this paper we show how a shared linear representation may be efficiently learned and transferred between multiple linear regression tasks. We provide both upper and lower bounds on the sample complexity of learning this representation and for the problem of learning-to-learn. We believe our bounds capture important qualitative phenomena observed in real meta-learning applications absent from previous theoretical treatments. Provable Meta-Learning of Linear Representations 0 1000 2000 3000 4000 5000 6000 Number of Training Points (per Task) LF-Mo M LF-FO 0 1000 2000 3000 4000 5000 6000 Number of Training Points (per Task) ℓ2 Parameter Error LR meta-LR-Mo M meta-LR-FO Figure 3. Left: LF-FO vs. LF-Mo M estimator with error measured in the subspace angle distance sin θ( ˆB, B). Right: meta-LR-FO and meta-LR-Mo M vs. LR on new task with error measured on new task parameter. Here d = 100, r = 5, t = 20, and n2 = 50 while the number of training points per task (nt) is varied. Anandkumar, A., Hsu, D., and Kakade, S. M. A method of moments for mixture models and hidden Markov models. In Conference on Learning Theory, pp. 33 1, 2012. Ando, R. K. and Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817 1853, 2005. Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., and Auli, M. Cloze-driven pretraining of self-attention networks. ar Xiv preprint ar Xiv:1903.07785, 2019. Baxter, J. A model of inductive bias learning. Journal of artificial intelligence research, 12:149 198, 2000. Bhatia, R. Matrix Analysis, volume 169. Springer Science & Business Media, 2013. Candes, E. and Plan, Y. Tight oracle bounds for low-rank matrix recovery from a minimal number of noisy random measurements. ar Xiv preprint ar Xiv:1001.0339, 2010. Chen, X., Guntuboyina, A., and Zhang, Y. On bayes risk lower bounds. The Journal of Machine Learning Research, 17(1):7687 7744, 2016. Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. ar Xiv preprint ar Xiv:1903.10399, 2019. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647 655, 2014. Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020. Edelman, A., Arias, T. A., and Smith, S. T. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303 353, 1998. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126 1135. JMLR. org, 2017. Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. ar Xiv preprint ar Xiv:1902.08438, 2019. Ge, R., Jin, C., and Zheng, Y. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. ar Xiv preprint ar Xiv:1704.00708, 2017. Guntuboyina, A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Transactions on Information Theory, 57(4):2386 2399, 2011. Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. In Conference on learning theory, pp. 9 1, 2012. Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pp. 1724 1732. JMLR. org, 2017. Khodak, M., Balcan, M.-F., and Talwalkar, A. Provable guarantees for gradient-based meta-learning. ar Xiv preprint ar Xiv:1902.10644, 2019a. Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pp. 5915 5926, 2019b. Liu, D. C. and Nocedal, J. On the limited memory BFGS method for large scale optimization. Mathematical programming, 45(1-3):503 528, 1989. Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neural networks for natural language understanding. ar Xiv preprint ar Xiv:1901.11504, 2019. Maclaurin, D., Duvenaud, D., and Adams, R. P. Autograd: Effortless gradients in numpy. In ICML 2015 Auto ML Workshop, volume 238, 2015. Maurer, A., Pontil, M., and Romera-Paredes, B. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016. Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., et al. Ray: A distributed framework for emerging {AI} applications. In 13th {USENIX} Symposium on Provable Meta-Learning of Linear Representations Operating Systems Design and Implementation ({OSDI} 18), pp. 561 577, 2018. Pajor, A. Metric entropy of the Grassmann manifold. Convex Geometric Analysis, 34:181 188, 1998. Pearson, K. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71 110, 1894. Pontil, M. and Maurer, A. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, pp. 55 76, 2013. Raskutti, G., Wainwright, M. J., and Yu, B. Minimax rates of estimation for high-dimensional linear regression over ℓq-balls. IEEE transactions on information theory, 57 (10):6976 6994, 2011. Recht, B. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(Dec):3413 3430, 2011. Rohde, A., Tsybakov, A. B., et al. Estimation of highdimensional low-rank matrices. The Annals of Statistics, 39(2):887 930, 2011. Tropp, J. A. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12 (4):389 434, 2012. Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Science & Business Media, 2008. Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630 3638, 2016. Wainwright, M. J. High-Dimensional Statistics: A Non Asymptotic Viewpoint, volume 48. Cambridge University Press, 2019. Wang, Z., Dai, Z., P oczos, B., and Carbonell, J. Characterizing and avoiding negative transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11293 11302, 2019. Zhang, Y., Wainwright, M. J., and Jordan, M. I. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory, pp. 921 948, 2014.