# towards_sampleefficient_overparameterized_metalearning__8742b3d7.pdf

Towards Sample-efﬁcient Overparameterized Meta-learning

Yue Sun University of Washington yuesun@uw.edu

Adhyyan Narang University of Washington adhyyan@uw.edu

Halil Ibrahim Gulluk Bogazici University hibrahimgulluk@gmail.com

Samet Oymak University of California, Riverside oymak@ece.ucr.edu

Maryam Fazel University of Washington mfazel@uw.edu

An overarching goal in machine learning is to build a generalizable model with a small number of samples. To this end, overparameterization has been the subject of immense interest to explain the generalization ability of deep nets even when the size of the dataset is smaller than that of the model. While prior literature focuses on the classical supervised setting, this paper aims to demystify overparameterization for meta-learning. Here we have a sequence of linear-regression tasks and we ask: (1) Given earlier tasks, what is the optimal linear representation of features for a new downstream task? and (2) How many samples do we need to build this representation? This work shows that surprisingly, overparameterization arises as a natural answer to these fundamental meta-learning questions. Speciﬁcally, for (1), we ﬁrst show that learning the optimal representation coincides with the problem of designing a task-aware regularization to promote inductive bias. This inductive bias explains how the downstream task actually beneﬁts from overparameterization, in contrast to prior works on few-shot learning. For (2), we develop a theory to explain how feature covariance can implicitly help reduce the sample complexity well below the degrees of freedom and lead to small estimation error. We then integrate these ﬁndings to obtain an overall performance guarantee for our metalearning algorithm. Numerical experiments on real and synthetic data verify our insights on overparameterized meta-learning.

1 Introduction

In a multitude of machine learning (ML) tasks with limited data, it is crucial to build accurate models in a sample-efﬁcient way. Constructing a simple yet informative representation of features is a critical component of learning a model that generalizes well to an unseen test set. The ﬁeld of meta-learning dates back to [8, 4] and addresses this challenge by transferring insights across distinct but related tasks. Usually, the meta-learner ﬁrst (1) learns a feature-representation from previously seen tasks and then (2) uses this representation to succeed at an unseen task. The ﬁrst phase is called representation learning and the second is called few-shot learning. Such information transfer between tasks is the backbone of modern transfer and multitask learning and ﬁnds ubiquitous applications in image classiﬁcation [14], machine translation [6] and reinforcement learning [17].

Recent literature in ML theory has posited that overparameterization can be beneﬁcial to generalization in traditional single-task setups for both regression [28, 38, 3, 32, 29] and classiﬁcation [31, 30] problems. Empirical literature in deep learning suggests that overparameterization is of interest for both phases of meta-learning as well. Deep networks are stellar representation learners despite containing many more parameters than the sample size. Additionally, overparameterization

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

is observed to be beneﬁcial in the few-shot phase for transfer-learning in Figure 1(a). A Res Net-50 network pretrained on Imagenet was utilized to obtain a representation of R features for classiﬁcation on CIFAR-10. All layers except the ﬁnal (softmax) layer are frozen and are treated as a ﬁxed feature-map. We then train the ﬁnal layer of the network for the downstream task which yields a linear classiﬁer on pretrained features. The ﬁgure plots the effect of increasing R on the test error on CIFAR-10, for different choices of training size n2. For each choice of n2, increasing R beyond n2 is seen to reduce the test-error. These ﬁndings are corroborated by [17] (MAML) and [37], who successfully use a transfer learning method that adapts a pre-trained model, with 112980 parameters, to downstream tasks with only 1-5 new training samples.

101 102 103 Representation Dimension

Few-Shot Test Error

Few shot train size = 20 Few shot train size = 100 Few shot train size = 500

0 20 40 60 80 100 Representation Dimension

Few-Shot Test Error

no weighting experimental optimal weighting experimental optimal weighting theoretical no weighting theoretical

Figure 1: Illustration of the beneﬁt of overparameterization in the few-shot phase. (a) Doubledescent in transfer learning: dashed lines indicate the location where the number of features R exceed the number of training points; i.e., the transition from under to over-parameterization. The experimental details are contained in the supplement. (b) Illustration of the beneﬁt of using Weighted min L2-interpolation in Deﬁnition 3 (blue). See Remark 1 for details and discussion.

In Figure 1(b), we consider a sequence of linear regression tasks and plot the few-shot error of our proposed projection and eigen-weighting based meta-learning algorithm for a ﬁxed few-shot training size, but varying dimensionality of features. The resulting curve looks similar to Figure 1(a) and suggests that the observations regarding overparameterization for meta-learning in neural networks can, to a good extent, be captured by linear models, thus motivating their detailed study. This aligns with trends in recent literature: while deep nets are nonlinear, recent advances show that linearized problems such as kernel regression (e.g., via neural tangent kernel [20, 16, 23, 34, 12]) provide a good proxy to understand some of the theoretical properties of practical overparameterized deep nets.

However, existing analysis of subspace-based meta-learning algorithms for both the representation learning and few-shot phases of linear models have typically focused on the classical underparameterized regime. These works (see Paragraphs 2-3 of Sec. 1.2) consider the case where representation learning involves projection onto a lower-dimensional subspace. On the other hand, recent works on double descent shows that an overparameterized interpolator beats PCA-based method. to build upon these results to develop a theoretical understanding of overparameterized meta-learning1.

1.1 Our contributions

This paper studies meta-learning when each task is a linear regression problem, similar in spirit to [36, 22]. In the representation learning phase, the learner is provided with training data from T distinct tasks, with n1 training samples per task: using this data, it selects a matrix Λ Rd R with arbitrary R to obtain a linear representation of features via the map x Λ x. In the few-shot learning phase, the learner faces a new task with n2 training samples and aims to use the representation Λ x to aid prediction performance.

We highlight that obtaining the representation consists of two steps: ﬁrst the learner projects x onto R basis directions, and then performs eigen-weighting of each of these directions, as shown in Figure 2(b). The overarching goal of this paper is to propose a scheme to use the knowledge gained from earlier tasks to choose Λ that minimizes few-shot risk. This goal enables us to engage with important questions regarding overparameterization:

1The code for this paper is in https://github.com/sunyue93/Rep-Learning.

Q1: What should the size R and the representation Λ be to minimize risk at the few-shot phase?

Q2: Can we learn the Rd dimensional representation Λ with N Rd samples?

The answers to the questions above will shed light on whether overparameterization is beneﬁcial in few-shot learning and representation learning respectively. Towards this goal, we make several contributions to the ﬁnite-sample understanding of linear meta-learning, under assumptions discussed in Section 2. Our results are obtained for a general data/task model with arbitrary task covariance Σβ and feature covariance ΣF which allows for a rich set of observations.

Optimal representation for few-shot learning. As a stepping stone towards the goal of characterizing few-shot risk for different Λ, in Section 3 we ﬁrst consider learning with known covariances ΣT and ΣF respectively (Algorithm 1). Compared to projection-only representations in previous works (see Paragraphs 2-3 of Sec. 1.2), our scheme applies eigen-weighting matrix Λ to incentivize the optimizer to place higher weight on promising eigen-directions. This eigen-weighting procedure has been shown in the single-task case to be extremely crucial to avail the beneﬁt of overparameterization [5, 29, 32]: it captures an inductive bias that promotes certain features and demotes others. We show that the importance of eigen-weighting extends to the multi-task case as well.

Canonical task covariance. Our analysis in Section 3 also reveals that, the optimal subspace and representation matrix are closed-form functions of the canonical task covariance ΣT = Σ1/2 F ΣT Σ1/2 F , which captures the feature saliency by summarizing the feature and task distributions.

ΣF Feature covariance ΣT Task covariance ΣT Canonical task covariance n1 Samples per each earlier task T Number of earlier tasks N Total sample size T n1 n2 Samples for new task Λ Eigen-weighting matrix

Table 1: Main notation

Representation learning. In practice, task and feature covariances (and hence the canonical covariance) are rarely known apriori. However, we can estimate the principal subspace of the canonical task covariance ΣT (which has a degree of freedom (Do F) of Ω(Rd)) from data. In Section 4 we ﬁrst present empirical evidence that feature covariance ΣF is positively correlated with ΣT . Then we propose an efﬁcient algorithm based on Method-of-Moments (Mo M), and show that the sample complexity of representation learning is well below O(Rd) due to the inductive bias. Our sample complexity bound depends on interpretable quantities such as effective ranks ΣF , ΣT and improves over prior art (e.g., [22, 36]), even though the prior works were specialized to low-rank ΣT and identity ΣF (see Table 2).

End to end meta-learning guarantee. In Section 5, we consider the generalization of Section 3, where we have only estimates of the covariances instead of perfect knowledge. This leads to an overall meta-learning guarantee in terms of Λ , N and n2 and uncovers a bias-variance tradeoff: As N decreases, it becomes more preferable to use a smaller R (more bias, less variance) due to inaccurate estimate of the weak eigen-directions of ΣT . In other words, we ﬁnd that overparameterization is only beneﬁcial for few-shot learning if the quality of representation learning is sufﬁciently good. This explains why, in practice, increasing the representation dimension may not help reduce few-shot risk beyond a certain point (see Fig. 5).

1.2 Related work

Overparameterized ML and double-descent The phenomenon of double-descent was ﬁrst discovered by [5]. This paper and subsequent works on this topic [3, 32, 31, 29, 10] emphasize the importance of the right prior (sometimes referred to as inductive bias or regularization) to avail the beneﬁts of overparameterization. However, an important question that arises is: where does this prior come from? Our work shows that the prior can come from the insights learned from related previously-seen tasks. Section 3 extends the ideas in [33, 38] to depict how the optimal representation described can be learned from imperfect covariance estimates as well.

Theory for representation learning Recent papers [22, 21, 36, 15] propose the theoretical bounds of representation learning when the tasks lie in an exactly r dimensional subspace. [22, 21, 36] discuss method of moment estimators and [36, 15] discuss matrix factorized formulations. [36] shows that the number of samples that enable meaningful representation learning is O(dr2). [22, 21, 36]

Figure 2: (a) Steps of the meta-learning algorithm. (b) Our representation-learning algorithm has two steps: projection and eigen-weighting. We focus on the use of overparameterization+weighting matrix (Def. 3), and compare this with overparameterization with simple projection (no eigenweighting), and underparameterization (for which eigen-weighting has no impact and is equivalent to projection). [36, 22, 21, 15] study underparameterized projections only. To distinguish from eigen-weighting, we will refer to simple projections as subspace-based representations.

assume the features follow a standard normal distribution. We deﬁne a canonical covariance which handles arbitrary feature and task covariances. We also show that our estimator succeeds with O(dr) samples when n1 r, and extend the bound to general covariances with effective rank deﬁned.

Subspace-based meta learning With tasks being low rank, [22, 21, 36, 18, 15] do few-shot learning in a low dimensional space. [39, 40] study meta-learning for linear bandits. [26] gives information theoretic lower and upper bounds. [7] proposes subspace-based methods for nonlinear problems such as classiﬁcation. We investigate a representation with arbitrary dimension, speciﬁcally interested in overparameterized case and show it yields a smaller error with general task/feature covariances. Related work [15] provides results on overparameterized representation learning, but [15] requires number of samples per pre-training task to obey n1 d, whereas our results apply as soon as n1 1.

Mixed Linear Regression (MLR) In MLR [41, 24, 11], multiple linear regression are executed, similar to representation learning. The difference is that, the tasks are drawn from a ﬁnite set, and number of tasks can be larger than d and not necessarily low rank. [25, 9, 27] propose sample complexity bounds of representation learning for mixed linear regression. They can be combined with other structures such as binary task vectors [2] and sparse task vectors [1].

2 Problem Setup

The problem we consider consists of two phases:

1. Representation learning: Prior tasks are used to learn a suitable representation to process features. 2. Few-shot learning: A new task is learned with a few samples by using the suitable representation.

This section deﬁnes the key notations and describes the data generation procedure for the two phases. In summary, we study linear regression tasks, the features and tasks are generated randomly, i.i.d. from their associated distributions DT and DF , and the two phases share the same feature and task distributions.The setup is summarized in Figure 2(a).

2.1 Data generation

Deﬁnition 1 (Task and feature distributions) Throughout, DT and DF denote the distributions of tasks βi and features xij respectively. These distributions are sub Gaussian, zero-mean with corresponding covariance matrices ΣT and ΣF .

Deﬁnition 2 (Data distribution for a single task) Given a speciﬁc realization of task vector β DT , the corresponding label/input distribution (y, x) Dβ is obtained via y = x β + ε where x DF and ε is zero-mean subgaussian noise with variance σ2.

Data for Representation Learning (Phase 1). We have T tasks, each with n1 training examples. The task vectors (βi)T i=1 Rd are drawn i.i.d. from the distribution DT . The data for ith task is

given by (yij, xij)n1 j=1 i.i.d. Dβi. In total, there are N = T n1 examples.

Data for Few-Shot Learning (Phase 2). Sample task β DT . Few-shot dataset has n2 examples (yi, xi)n2 j=1 i.i.d. Dβ .

We use representation learning data to learn a representation of feature-task distribution, called eigen-weighting matrix Λ in Def. 3 below. The matrix Λ is passed to few-shot learning stage, helping learn β with few data.

2.2 Training in Phase 2

We will deﬁne a weighted representation, called eigen-weighting matrix, and show how it is applied for few-shot learning. The matrix is learned during representation learning using the data from the T tasks. Denote X Rn2 d whose ith row is xi, and y = [y1, ..., ym] . We are interested in studying the weighted 2-norm interpolator deﬁned below for overparameterization regime R n2.

Deﬁnition 3 (Eigen-weighting matrix and Weighted ℓ2-norm interpolator) Let the representation dimension be R, where R is any integer between 1 and d. We deﬁne an eigen-weighting matrix Λ Rd R and the associated weighted ℓ2-norm interpolator ˆβΛ = arg min β Λ β 2 s.t. y = Xβ and β range_space(Λ).

The solution is equivalent to deﬁning ˆαΛ = Λ ˆβΛ and solving an unweighted minimum 2-norm regression with features XΛ. This corresponds to our few-shot learning problem ˆαΛ = arg min α α 2 s.t. y = XΛα

from which we obtain ˆβΛ = Λ ˆαΛ. When there is no confusion, we can replace ˆβΛ with ˆβ. One can easily see that ˆβ = Λ(XΛ) y. We note that Deﬁnition 3 is a special case of the weighted ridge regression discussed in [38], as stated in Observation 1. An alternative equivalence between min-norm interpolation and ridge regression can be found in [32].

Observation 1 Let X Rn2 d and y Rn2, deﬁne ˆβ1 = lim t 0 argminβ Xβ y 2 2 + tβ (ΛΛ ) β, β column space of Λ. (2.1)

We have that ˆβ1 = ˆβ.

3 Canonical Covariance and Optimal Representation

In this section, we ask the simpler question: if the covariances ΣT and ΣF are known, what is the best choice of Λ to minimize the risk of the interpolator from Deﬁnition 3? In general, the covariances are not known; however, the insights from this section help us study the more general case in Section 5. Deﬁne the risk as the expected error of inferring the label on the few-shot dataset,

risk(Λ, ΣT , ΣF ) = Ex,y,β(y x ˆβΛ)2 = Eβ( ˆβΛ β) ΣF ( ˆβΛ β) + σ2. (3.1) The natural choice of optimization for choosing Λ would be to choose the weighting that minimizes the eventual risk of the learned interpolator. Λ = arg min Λ Rd R risk(Λ , ΣT , ΣF ) (3.2)

Since the label y is bilinear in x and β, we introduce whitened features x = Σ 1/2 F x and associated task vector β = Σ1/2 F β. This change of variables ensures x T β = x T β; now, the task covariance in the transformed coordinates takes the form ΣT = Σ1/2 F ΣT Σ1/2 F , which we call the canonical task covariance; it captures the joint behavior of feature and task covariances ΣF , ΣT . Below, we observe that the risk in Equation (3.1) is invariant to the change of co-ordinates that we have described above i.e it does not change when Σ1/2 F ΣT Σ1/2 F is ﬁxed and we vary ΣF and ΣT .

Algorithm 1 Constructing the optimal representation

Require: Projection dimension R, noise level σ, canonical covariance ΣT , task covariance ΣF .

1: function COMPUTEOPTIMALREP(R, ΣF , ΣT , σ, n2) 2: U1, ΣR F , ΣR T , σR = COMPUTEREDUCTION(R, ΣF , ΣT , σ) 3: Optimization: Get θ from (OPT-REP). 4: Map to eigenvalues: Set diagonal Λ R RR R with entries Λ R,i = (1/θ i 1) 2.

5: Lifting and feature whitening: Λ U1(ΣR F ) 1/2Λ R. 6: return Λ

7: function COMPUTEREDUCTION(R, ΣF , ΣT , σ) 8: Get eigen-decomposition ΣT = UΣU . 9: Principal eigenspace U1 Rd R = the ﬁrst R columns of U. 10: Top eigenvalues: Set ΣR T = U 1 ΣT U1, ΣR F = U 1 ΣF U1 11: Equivalent noise level: σ2 R σ2 + tr( ΣT ) tr( ΣR T ). 12: return U1, ΣR F , ΣR T , σR

Observation 2 (Equivalence to problem with whitened features) Let data be generated as in Phase 1. Denote ΣT = Σ1/2 F ΣT Σ1/2 F . Then risk(Σ 1/2 F Λ, ΣT , ΣF ) = risk(Λ, ΣT , I).

This observation can be easily veriﬁed by substituting the change-of-coordinates into Equation (3.1) and evaluating the risk.

The risk in (3.1) quantiﬁes the quality of representation Λ; however it is not a manageable function of Λ that can be straightforwardly optimized. In this subsection, we show that it is asymptotically equivalent to a different optimization problem, which can be easily solved by analyzing KKT optimality conditions. Theorem 1 characterizes this equivalence; the COMPUTEREDUCTION subroutine of Algorithm 1 calculates key quantities that are used in specifying the reduction, and the COMPUTEOPTIMALREP subroutine of Algorithm 1 uses the solution of the simpler problem to obtain a solution for the original. Assumption 1 (Bounded feature covariance) There exist positive constants Σmin, Σmax such that ΣF is lower/upper bounded as follows: 0 Σmin I ΣF Σmax I.

Assumption 2 (Joint diagonalizability) ΣF and ΣT are diagonal matrices.2

Assumption 3 (Double asymptotic regime) We let the dimensions and the sample size grow as d, R, n2 at ﬁxed ratios κ := d/n2 and κ := R/n2.

Assumption 4 The joint empirical distribution of the eigenvalues of ΛR and ΣR T is given by the average of Dirac δ s: 1

R PR i=1 δΛR,i,

R ΣR T,i. It converges to a ﬁxed distribution as d .

With these assumptions, we can derive an analytical expression to quantify the risk of a representation Λ. We will then optimize this analytic expression to obtain a formula for the optimal representation.

Theorem 1 (Asymptotic risk equivalence) Suppose Assumptions 1, 2, 3, 4 hold. Let ξ > 0 be the unique number obeying n2 = PR i=1 1 + (ξΛ2 i ) 1 1. Deﬁne θ RR with entries θi = ξΛ2 i 1+ξΛ2 i and calculate ΣR T , σR using the COMPUTEREDUCTION procedure of Algorithm 1. Then, deﬁne the analytic risk formula

f(θ, ΣR T , n2) = 1 n2 θ 2 2

i=1 (1 θi)2 ΣR T,i + ( θ 2 2 + 1)σ2 R

We have that lim n2 f(θ, ΣR T , n2) = lim n2 risk(Σ 1/2 F Λ, ΣT , ΣF ) (3.4)

The proof of Theorem 1 applies the convex Gaussian Min-max Theorem (CGMT) in [35] and can be found in the Appendix B.2.We show that as dimension grows, the distribution of the estimator ˆβ converges to a Gaussian distribution and we can calculate the expectation of risk.

2This is equivalent to the more general scenario where ΣF and ΣT are jointly diagonalizable.

Theorem 1 provides us with a closed-form risk for any linear representation. Now, one can solve for the optimal representation by computing (OPT-REP) below. In order to do this, we propose an algorithm for the optimization problem in Appendix B.5 via a study of the KKT conditions for the problem 3.

θ = arg min θ f(θ, ΣT , ΣF ), s.t. 0 θ < 1,

i=1 θi = n2 (OPT-REP)

0 20 40 60 80 100 Representation Dimension

Few-Shot Test Error

ι = 0.5 optimal weighting

ι = 0.5 no weighting

ι = 0.1 optimal weighting

ι = 0.1 no weighting

ι = 0.05 optimal weighting

ι = 0.05 no weighting

Figure 3: Theoretical risk of optimal representation. ΣF = I100, ΣT = diag(I20, ιI80), n2 = 40.

The optimal representation is4 Λ R,i = ((1/θ i 1)ξ) 2. The subroutine COMPUTEOPTIMALREP in Algorithm 1 summarizes this procedure.

Remark 1 Thm. 1 states that risk(Σ 1/2 F Λ, ΣT , ΣF ) can be arbitrarily well-approximated by f(θ, ΣR T , n2) if n2 is sufﬁciently large. In Fig. 1(b), we set ΣF = I100, ΣT = diag(I20, 0.1I80), n2 = 40. The curves in Fig1(b) are the ﬁnite dimensional approximation of f (LHS of (3.4)); the dots are empirical approximations of the risk (RHS of (3.4)). We tested two cases when Λ is the optimal eigen-weighting or projection matrix with no weighting. Our theorem is corroborated by the observation that the dots and curves are visibly very close. The approximation is already accurate for the ﬁnite dimensional problem with just n2 = 40.

The beneﬁt of overparameterization. Theorem 1 leads to an optimal eigen-weighting strategy via asymptotic analysis. In Figure 3, we plot the effect on the risk of increasing R for different shapes of task covariance; the parameter ι controls how spiked ΣT is, with a smaller value for ι indicating increased spikedness. For the underparameterized problem, the weighting does not have any impact on the risk. In the overparameterized regime, the eigen-weighted learner achieves lower few-shot error than its unweighted (Λ = I) counterpart, showing that eigen-weighting becomes critical.

The eigen-weighting procedure can introduce inductive bias during few-shot learning, and helps explain how optimal representation minimizing the few-shot risk can be overparameterized with R n2. We note that, an R dimensional representation can be recovered by a d dimensional representation matrix of rank R, thus the underparameterized case can never beat d dimensional case in theory. The error with optimal eigen-weighting in overparameterized regime is smaller than the respective underparameterized counterpart. The error is lower with smaller ι. It implies that, while ΣT gets closer to low-rank, the excess error caused by choosing small dimension R (equal to the gap σ2 R σ2 in Algo 1) is not as signiﬁcant.

Low dimensional representations zero out features and cause bias. By contrast, when ΣT Rd d is not low rank, every feature contributes to learning with the importance of the features reﬂected by the weights. This viewpoint is in similar spirit to that of [19] where the authors devise a misspeciﬁed linear regression to demonstrate the beneﬁts of overparameterization. Our algorithm allows arbitrary representation dimension R and eigen-weighting.

4 Representation Learning

In this section, we will show how to estimate the useful distribution in representation learning phase that enables us to calculate eigen-weighting matrix Λ . Note that Λ depends on the canonical covariance ΣT = Σ1/2 F ΣT Σ1/2 F . Learning the R-dimensional principal subspace of ΣT enables us5

to calculate Λ . Denote this subspace by ST .

3In Sec. 5 the constraint is θ θ 1 d n2

n2 θ for robustness concerns.

4In the algorithm, ξ = 1 and ΛR,i = (1/θ i 1) 2, because cΛ for any constant c gives the same ˆβ. 5We also need to estimate ΣF for whitening. Estimating ΣF is rather easy and incurs smaller error compared to ΣT . The analysis is provided in the ﬁrst part of Appendix B.

Subspace estimation vs. inductive bias. The subspace-based representation ST has degrees of freedom= Rd. When ΣT is exactly rank R and features are whitened, [36] provides a samplecomplexity lower bound of Ω(Rd) examples and gives an algorithm achieving O(R2d) samples. However, in practice, deep nets learn good representations despite overparameterization. In this section, recalling our Q2, we argue that the inductive bias of the feature distribution can implicitly accelerate learning the canonical covariance. This differentiates our results from most prior works such as [22, 21, 36] in two aspects:

1. Rather than focusing on a low dimensional subspace and assuming N Rd, we can estimate ΣT or ST in the overparameterized regime N Rd. 2. Rather than assuming whitened features ΣF = I and achieving a sample complexity of R2d, our learning guarantee holds for arbitrary covariance matrices ΣF , ΣT . The sample complexity depends on effective rank and can be arbitrarily smaller than Do F. We showcase our bounds via a spiked covariance setting in Example 1 below.

For learning ΣT or its subspace ST , we investigate the method-of-moments (Mo M) estimator. Deﬁnition 4 (Mo M Estimator) For 1 i T, deﬁne ˆbi,1 = 2n 1 1 Pn1/2 j=1 yijxij, ˆbi,2 = 2n 1 1 Pn1 j=n1/2+1 yijxij. Set ˆ M = n 1 1

i=1 (bi,1b i,2 + bi,2b i,1),

The expectation of ˆ M is equal to M = ΣF ΣT ΣF .

Inductive bias in representation learning: Recall that canonical covariance ΣT = Σ1/2 F ΣT Σ1/2 F is the attribute of interest. However, feature covariance Σ1/2 F term implicitly modulates the estimation procedure because the population Mo M is not ΣT but M = Σ1/2 F ΣT Σ1/2 F . For instance, when estimating the principle canonical subspace ST , the degree of alignment between ΣF and ΣT can make or break the estimation procedure: If ΣF and ΣT have well-aligned principal subspaces, ST will be easier to estimate since ΣF will amplify the ST direction within M.

We verify the inductive bias on practical image dataset, reported in Appendix A. We assessed correlation coefﬁcient between covariances ΣT , ΣF via the canonical-feature alignment score deﬁned as the correlation coefﬁcient

ρ(ΣF , ΣT ) :=

ΣF F ΣT F = trace(M) ΣF F ΣT F .

Observe that, the Mo M estimator M naturally shows up in the alignment deﬁnition because the inner product of ΣT , ΣF is equal to trace(M). This further supports our inductive bias intuition. As reference, we compared it to canonical-identity alignment deﬁned as trace( ΣT )

d ΣT F (replacing ΣF with I). The canonical-feature alignment score is higher than the canonical-identity alignment score. This signiﬁcant score difference exempliﬁes how ΣF and ΣT can synergistically align with each other (inductive bias). This alignment helps our Mo M estimator deﬁned below, illustrated by Example 1 (spiked covariance).

In the following subsections, let N = n1T refer to the total tasks in representation-learning phase. Let r F = tr(ΣF ), r T = tr(ΣT ), and r T = tr( ΣT ). Deﬁne the approximate low-rankness measure of feature covariance by6

s F = min s F , s.t. s F {1, ..., d}, s F /d λs F +1(ΣF )

We have two results for this estimator.

1. Generally, we can estimate M with O(r F r2 T ) samples. 2. Let n1 s T , we can estimate M with O(s F r T ) samples.

Paper [36] has sample complexity O(dr2) (r is exact rank). Our sample complexity is O(r F r2 T ). r F , r T can be seen as effective ranks and our bounds are always smaller than [36]. We will discuss later in Example 1. Our second result says when n1 s T , our sample complexity achieves the O(dr) which is proven a lower bound in [36].

6The (s F + 1)-th eigenvalue is smaller than s F /d. Note the top eigenvalue is 1.

feature cov ΣF = I, ΣT = diag(Is T , 0) ΣF = diag(Is F , ιF Id s F ), ΣT = diag(Is T , ιT Id s T ) estimator sample N sample n1 error sample N sample n1 error Mo M ds2 T 1 (ds2 T /N)1/2 r F r2 T 1 (r F r2 T /N)1/2

Mo M ds T s T (s T /n1)1/2 r F r T r T (r T /n1)1/2

Table 2: Right side: Sample complexity and error of Mo M estimators. s F (s T ) is the dimension of the principal eigenspace of the feature (task) covariance. r F = s F + ιF (d s F ), r T = s T + ιT (d s T ) are the effective ranks. Left side: This is the well-studied setting of identity feature covariance and low-rank task covariance. Our bound in the second row is the ﬁrst result to achieve optimal sample complexity of O(ds T ) (cf. [36, 22]).

Theorem 2 Let data be generated as in Phase 1. Assume ΣF , ΣT = 1 for normalization7.

1. Let n1 be a even number. Then with probability at least 1 N 100,

ˆ M M ( r T + σ2) rr F

2. Assume T s F . If n1 r T + σ2, then with probability at least 1 CT 100

ˆ M M ( r T + σ2)/n1 1/2 .

Denote the top-R principal subspaces of M, ˆ M by Mtop, ˆ Mtop and assume the eigen-gap condition λR(M) λR+1(M) > 2 ˆ M M . Then a direct application of Davis-Kahan Theorem [13] bounds the subspace angle as follows

angle(Mtop, ˆ Mtop) ˆ M M /(λR(M) λR+1(M)).

0.00 0.50 1.00 1.50 2.00 ι

0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4

Figure 4: Error of Mo M estimator

Estimating eigenspace of canonical covariance. Note that if ΣF and ΣT are aligned, (e.g. Example 1 below with s F = s T = R), then Mtop = ST is exactly the principal subspace of ΣT . Theorem 2 indeed gives estimation error for the principal subspace of ΣT . Note that, such alignment is and more general requirement compared to related works which require whitened features [36, 22].

Example 1 (Spiked ΣT , Aligned principal subspaces) Suppose the spectra of ΣF and ΣT are bimodal as follows ΣF = diag(Is F , ιF Id s F ), ΣT = diag(Is T , ιT Id s T ). Set statistical error Err T,N := p

r2 T r F /N + p

r T /T. When ιT , ιF < 1, s F s T , the recovery error of ΣT and its principal subspace ST are bounded as

angle( ˆ Mtop, ST ) Err T,N + ι2 F ιT and ˆ M ΣT Err T,N + ιF ιT .

The estimation errors for ΣT , ST are controlled in terms of the effective ranks and the spectrum tails ιF , ιT . Typically s F s T n1 so p

r2 T r F /N term dominates the statistical error in practice. In Fig. 4 we plot the error of estimating M (whose principal subspace coincides with ΣT ). ΣF = diag(I30, ιI70), ΣT = diag(I30, 070). T = N = 100. We can see that the error increase with ι .

5 Robustness of Optimal Representation and Overall Meta-Learning Bound

In Section 3, we described the algorithm for computing the optimal representation with known distributions of features and tasks. In Section 4, we proposed the Mo M estimator in representation learning phase to estimate the unknown covariance matrices. In this section, we study the algorithm s behaviors when we calculate Λ using the estimated canonical covariance, rather than the fullinformation setting of Section 3.

7This is simply equivalent to scaling yij, which does not affect the normalized error ˆ M M / M . In the appendix we deﬁne S = max{ ΣF , ΣT } and prove the theorem for general S.

Armed with the provably reliable estimators of Section 4, we can replace ΣT and ΣF in Algorithm 1 with our estimators. In this section, we inquire: how does the estimation error in covariance-estimation in representation learning stage affect the downstream few-shot learning risk? That says, we are interested in8 risk(Λ, ΣT , ΣF ) risk(Λ , ΣT , ΣF ).

Let us replace the constraint in (OPT-REP) by θ θ 1 d n2

n2 θ. This changes the optimization" step in Algorithm 1. Theorem 3 does not require an explicit computation of the optimal representation by enforcing θ. Instead, we use the robustness of such a representation (due to its well-conditioned nature) to deduce its stability. That said, for practical computation of optimal representation, we simply use Algorithm 1. We can then evaluate θ after-the-fact as the minimum singular value of this representation to apply Theorem 3 without assuming an explicit θ.

Let Λθ(R) = COMPUTEOPTIMALREP(R, ΣF , ˆ M, σ, n2) denote the estimated optimal representation and Λ θ(R) = COMPUTEOPTIMALREP(R, ΣF , ΣT , σ, n2) denote the true optimal representation, which cannot be accessed in practice. Below we present the bound of the whole meta-learning algorithm. It shows that a bounded error in representation learning leads to a bounded increase on the downstream few-shot learning risk, thus quantifying the robustness of few-shot learning to errors in covariance estimates.

Theorem 3 Let Λθ(R), Λ θ(R) be as deﬁned above, and r F = tr(ΣF ), r T = tr(ΣT ), r T = tr( ΣT ). The risk of meta-learning algorithm satisﬁes9

risk(Λθ(R), ΣT , ΣF ) risk(Λ θ(R), ΣT , ΣF ) n2 2 d(R n2)(2n2 Rθ)θ

( r T + σ2) rr F

40 50 60 70 80 90 100 Representation Dimension

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Few-Shot Test Error

n1 = 100 n1 = 200 n1 = 1000 perfect covariance

Figure 5: End to end learning guarantees. d = 100, n2 = 40, T = 200, ΣT = (I20, 0.05 I80), ΣF = I100.

Notice that as the number of previous tasks T and total representation-learning samples N observed increases, the risk of the estimated Λθ(R) approaches that of the optimal Λ θ(R) as we expect. The result only applies to the overparameterized regime of interest R > n2. The expression of risk in the underparameterized case is different, and covered by the second case of Equation(4.4) in [38]. We plot it in Fig 1(b) on the left side of the peak as a comparison.

Risk with respect to PCA level R. In Fig. 5, we plot the error of the whole meta-learning algorithm. We simulate representation learning and get ˆ M, use it to compute Λ and plot the theoretical downstream risk (experiments match, see Fig. 1 (b)). Mainly, we compare the behavior of Theorem 3 with different R. When R grows, we search Λ in a larger space. The optimal Λ in a feasible subset is always no better than searching in a larger space, thus the risk decreases with R increasing. At the same time, representation learning error increases with R since we need to ﬁt a matrix in a larger space. In essence, this result provides a theoretical justiﬁcation on a sweet-spot for the optimal representation. d = R is optimal when N = , i.e., representation learning error is 0. As N decreases, there is a tradeoff between learning error and truncating small eigenvalues. Thus choosing R adaptively with N can strike the right bias-variance tradeoff between the excess risk (variance) and the risk due to suboptimal representation.

6 Conclusion

In this paper, we study the sample efﬁciency of meta-learning with linear representations. We show that the optimal representation is typically overparameterized and outperforms subspace-based representations for general data distributions. We reﬁne the sample complexity analysis for learning arbitrary distributions and show the importance of inductive bias of feature and task. Finally we provide an end-to-end bound for the meta-learning algorithm showing the tradeoff of choosing larger representation dimension v.s. robustness against representation learning error.

8Note that Sec.6 of [38] gives the exact value of risk(Λ , ΣT , ΣF ) so we have an end to end error guarantee. 9The bracketed expression applies ﬁrst conclusion of Theorem 3. One can plug in the second as well.

Acknowledgements

This work is supported in part by the NSF TRIPODS II grant DMS 2023166, NSF TRIPODS CCF 1740551, NSF CCF-2046816, NSF CCF 2007036, Army Research Ofﬁce grant W911NF-21-1-0312, and the Moorthy Professorship at UW ECE.

[1] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine learning, 73(3):243 272, 2008.

[2] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Efﬁcient representations for lifelong learning and autoencoding. In Conference on Learning Theory, pages 191 210, 2015.

[3] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overﬁtting in linear regression. Proceedings of the National Academy of Sciences, 2020.

[4] Jonathan Baxter. A model of inductive bias learning. Journal of artiﬁcial intelligence research, 12:149 198, 2000.

[5] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

[6] Ondˇrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12 58, 2014.

[7] Quentin Bouniot, Ievgen Redko, Romaric Audigier, Angélique Loesch, Yevhenii Zotkin, and Amaury Habrard. Towards better understanding meta-learning methods through multi-task representation learning theory. ar Xiv preprint ar Xiv:2010.01992, 2020.

[8] Rich Caruana. Multitask learning. Machine learning, 28(1):41 75, 1997.

[9] Giovanni Cavallanti, Nicolo Cesa-Bianchi, and Claudio Gentile. Linear algorithms for online multitask classiﬁcation. The Journal of Machine Learning Research, 11:2901 2934, 2010.

[10] Xiangyu Chang, Yingcong Li, Samet Oymak, and Christos Thrampoulidis. Provable beneﬁts of overparameterization in model compression: From double descent to pruning neural networks. ar Xiv preprint ar Xiv:2012.08749, 2020.

[11] Sitan Chen, Jerry Li, and Zhao Song. Learning mixtures of linear regressions in subexponential time via fourier moments. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 587 600, 2020.

[12] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. ar Xiv preprint ar Xiv:1812.07956, 2018.

[13] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970.

[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[15] Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020.

[16] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.

[17] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126 1135, 2017.

[18] Halil Ibrahim Gulluk, Yue Sun, Samet Oymak, and Maryam Fazel. Sample efﬁcient subspacebased representations for nonlinear meta-learning. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3685 3689. IEEE, 2021.

[19] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in highdimensional ridgeless least squares interpolation. ar Xiv preprint ar Xiv:1903.08560, 2019.

[20] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018.

[21] Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh. Robust meta-learning for mixed linear regression with small batches. ar Xiv preprint ar Xiv:2006.09702, 2020.

[22] Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. Meta-learning for mixed linear regression. ar Xiv preprint ar Xiv:2002.08936, 2020.

[23] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32:8572 8583, 2019.

[24] Yuanzhi Li and Yingyu Liang. Learning mixtures of linear regressions with nearly optimal complexity. In Conference On Learning Theory, pages 1125 1144, 2018.

[25] Karim Lounici, Massimiliano Pontil, Sara Van De Geer, Alexandre B Tsybakov, et al. Oracle inequalities and optimal inference under group sparsity. The annals of statistics, 39(4):2164 2204, 2011.

[26] James Lucas, Mengye Ren, Irene Kameni, Toniann Pitassi, and Richard Zemel. Theoretical bounds on estimation error for meta-learning. ar Xiv preprint ar Xiv:2010.07140, 2020.

[27] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The beneﬁt of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853 2884, 2016.

[28] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. ar Xiv preprint ar Xiv:1908.05355, 2019.

[29] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan. The generalization error of max-margin linear classiﬁers: High-dimensional asymptotics in the overparametrized regime. ar Xiv preprint ar Xiv:1911.01544, 2019.

[30] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan. The generalization error of max-margin linear classiﬁers: High-dimensional asymptotics in the overparametrized regime, 2020.

[31] Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. Classiﬁcation vs regression in overparameterized regimes: Does the loss function matter? ar Xiv preprint ar Xiv:2005.08054, 2020.

[32] Vidya Muthukumar, Kailas Vodrahalli, and Anant Sahai. Harmless interpolation of noisy data in regression. Co RR, abs/1903.09139, 2019.

[33] Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. ar Xiv preprint ar Xiv:2003.01897, 2020.

[34] Samet Oymak, Zalan Fabian, Mingchen Li, and Mahdi Soltanolkotabi. Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian. ICML Workshop on Understanding and Improving Generalization in Deep Learning, 2019.

[35] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Lasso with non-linear measurements is equivalent to one with linear measurements. Advances in Neural Information Processing Systems, 28:3420 3428, 2015.

[36] Nilesh Tripuraneni, Chi Jin, and Michael I Jordan. Provable meta-learning of linear representations. ar Xiv preprint ar Xiv:2002.11684, 2020.

[37] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.

[38] Denny Wu and Ji Xu. On the optimal weighted ℓ2 regularization in overparameterized linear regression, 2020.

[39] Jiaqi Yang, Wei Hu, Jason D Lee, and Simon S Du. Provable beneﬁts of representation learning in linear bandits. ar Xiv preprint ar Xiv:2010.06531, 2020.

[40] Jiaqi Yang, Wei Hu, Jason D Lee, and Simon S Du. Impact of representation learning in linear bandits. In International Conference on Learning Representations, 2021.

[41] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with multiple components. In Advances in neural information processing systems, pages 2190 2198, 2016.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] In supplementary ﬁle 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]