# how_finetuning_allows_for_effective_metalearning__67ff9e96.pdf

How Fine-Tuning Allows for Effective Meta-Learning

Kurtland Chua Princeton University kchua@princeton.edu

Qi Lei Princeton University qilei@princeton.edu

Jason D. Lee Princeton University jasonlee@princeton.edu

Representation learning has served as a key tool for meta-learning, enabling rapid learning of new tasks. Recent works like MAML learn task-speciﬁc representations by ﬁnding an initial representation requiring minimal per-task adaptation (i.e. a ﬁne-tuning-based objective). We present a theoretical framework for analyzing a MAML-like algorithm, assuming all available tasks require approximately the same representation. We then provide risk bounds on predictors found by ﬁnetuning via gradient descent, demonstrating that the method provably leverages the shared structure. We illustrate these bounds in the logistic regression and neural network settings. In contrast, we establish settings where learning one representation for all tasks (i.e. using a frozen representation objective) fails. Notably, any such algorithm cannot outperform directly learning the target task with no other information, in the worst case. This separation underscores the beneﬁt of ﬁne-tuning-based over frozen representation objectives in few-shot learning.

1 Introduction

Meta-learning (Thrun & Pratt, 2012) has emerged as an essential tool for adapting prior knowledge to new tasks under data and computational constraints. In this context, a meta-learner has access to related source tasks from a shared environment. The learner aims to uncover inductive biases from the source tasks to reduce the sample/computational complexity of learning new tasks from the same environment. A common approach is representation learning (Bengio et al., 2013), i.e. learning a feature extractor from the source tasks. At test time, a learner adapts to a new task by ﬁne-tuning the representation and retraining the ﬁnal layer(s) (see, e.g., prototype networks (Snell et al., 2017)). Substantial improvements over directly learning a single task has been shown in fewshot learning (Antoniou et al., 2018), a natural setting in many applications including reinforcement learning (Mendonca et al., 2019; Finn et al., 2017), computer vision (Nichol et al., 2018), federated learning (Mc Mahan et al., 2017) and robotics (Al-Shedivat et al., 2017).

The empirical success of representation learning has led to an increased interest in theoretical analyses of underlying phenomena. Recent work assumes an explicitly shared representation across tasks (Du et al., 2020; Tripuraneni et al., 2020b,a; Saunshi et al., 2020; Balcan et al., 2015). For instance, Du et al. (2020) prove a generalization bound consisting of a representation error term and an estimation error term. However, without ﬁne-tuning the whole network, the representation error accumulates to the target task and is irreducible, even with an inﬁnitely large target task dataset. This result is consistent with empirical ﬁndings, which suggest that ﬁne-tuning the whole network provides substantial performance gains, compared to just learning the ﬁnal linear layer (Chen et al., 2020; Salman et al., 2020). Due to the lack of (representation) ﬁne-tuning while training on the source tasks, we refer to methods above as making use of frozen representation objectives.

Linear separability of tasks on the same features poses additional problems, being an unrealistic assumption when transferring knowledge to other domains (e.g. from Image Net to medical images (Raghu et al., 2019)). Therefore, we consider a more realistic setting where the available tasks only approximately share the same representation. In this setting, we propose a theoretical framework for

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

analyzing the sample complexity of ﬁne-tuning, using an initial representation from a MAML-like algorithm. We show that ﬁne-tuning quickly adapts to new tasks, requiring fewer samples in certain cases compared to methods using frozen representation objectives. To the best of our knowledge, no prior theoretical studies exist beyond ﬁne-tuning a linear model (Denevi et al., 2018; Konobeev et al., 2020; Collins et al., 2020a; Lee et al., 2020) or only the task-speciﬁc layers (Du et al., 2020; Tripuraneni et al., 2020b,a; Mu et al., 2020). Tripuraneni et al. (2020b), in particular, acknowledge that their work does not incorporate representation ﬁne-tuning, leaving such analysis to future work. It is this gap which we now seek to address.

The following outlines this paper and its contributions:

In Section 2, we outline the general setting and our overall assumptions. Additionally, we introduce ADAPTREP, the representative ﬁne-tuning-based algorithm we analyze in this work. As a baseline, we also formally deﬁne FROZENREP, which makes use of a frozen representation objective.

In Section 3, we provide an in-depth analysis of the (d-dimensional) linear representation setting, when learning a k-dimensional representation. First, we show that ADAPTREP achieves rates of

rsource = O kd n ST + k n S + δ0 q

and rtarget = O k n T + δ0 q

n T + rsource

on the source and target tasks. Here, n S and n T are the number of available samples, T is the number of source tasks, δ0 is the norm of the representation change, and Σ is the input covariance. Thus, ﬁne-tuning can handle the misspeciﬁed setting. In contrast, FROZENREP has a minimax rate of Ω(d/n T) on the target task under certain task distributions, matching standard linear regression. We provide a formal construction and an experimental veriﬁcation of the gap in Section C.

In Section 4, we extend the analysis to general function classes. We provide risk bounds of the form

εOPT + εEST + εREPR,

where εOPT, εEST, εREPR are the optimization, estimation, and representation errors, respectively.

The optimization error εOPT quantiﬁes the error from using approximate minima found during optimization. To control εOPT, our analysis accounts for nonconvexity introduced by representation ﬁne-tuning, and is presented as a self-contained result in Section H.

The estimation error εEST stems from ﬁne-tuning to the target task with ﬁnite n T samples. It therefore scales with 1/ n T and is controlled by the complexity of the target ﬁne-tuning set.

The representation error εREPR is the error incurred by representation learning on the source tasks while adapting to them. It consists of two terms: one scaling as 1/ n ST for learning an initialization using all T source tasks, and the other scaling as 1/ n S for learning task-speciﬁc adaptations.

In Section 5, we instantiate our guarantees in the two-layer neural network setting. Additionally, we provide an analysis for logistic regression in Section F. Furthermore, we extend the linear hard case to a nonlinear setting in Section G.

1.1 Related Work

The empirical success of MAML (Finn et al., 2017), and meta-learning in general, has elicited further theoretical analysis from both statistical and optimization perspectives. A ﬂurry of work engages in developing more efﬁcient and theoretically-sound optimization algorithms (Antoniou et al., 2018; Nichol et al., 2018; Li et al., 2017) or convergence analyses (Fallah et al., 2020; Zhou et al., 2019; Rajeswaran et al., 2019; Collins et al., 2020b). Inspired by MAML, a subseqent line of gradient-based meta-learning algorithms have been developed and widely used (Nichol et al., 2018; Al-Shedivat et al., 2017; Jerfel et al., 2018). Much follow-up work focused on the online setting, with accompanying regret bounds (Denevi et al., 2018; Finn et al., 2019; Khodak et al., 2019; Balcan et al., 2015; Alquier et al., 2017; Bullins et al., 2019; Pentina & Lampert, 2014).

The statistical analysis of meta-learning traces back to Baxter (2000); Maurer & Jaakkola (2005), which focused on inductive bias learning. Following the same setting, Amit & Meir (2018); Konobeev et al. (2020); Maurer et al. (2016); Pentina & Lampert (2014) ﬁx a meta-distribution for source task sampling, and measure the average generalization error/gap over the meta distribution. Another line of work connects the target performance to source using distance measures between distributions (Ben-David & Borbely, 2008; Ben-David et al., 2010; Mohri & Medina, 2012). Finally,

another series of works studied the beneﬁts of using additional side information provided with a task for specializing the parameters of an inner algorithm (Denevi et al., 2020, 2021).

The hardness of meta-learning has also been investigated under various settings. Recent work has studied meta-learning performance in worst-case settings (Collins et al., 2020b; Hanneke & Kpotufe, 2020a,b; Kpotufe & Martinet, 2018; Lucas et al., 2020). Hanneke & Kpotufe (2020a) provide a nofree-lunch result with a problem-independent minimax lower bound, while Konobeev et al. (2020) also provide problem-dependent lower bound on a simple linear setting.

2 General Setting

2.1 Notation

Let [n] := {1, . . . , n}. We denote the vector L2-norm as 2, and the matrix Frobenius norm as F . Additionally, , can denote either the Euclidean inner product or the Frobenius inner product between matrices.

For a matrix A, we let σi(A) denote its ith largest singular value. Additionally, for positive semidefinite A, we write λmax(A) and λmin(A) for its largest and smallest eigenvalues, and A1/2 for its principal square root. We write PA for the projection onto the column span of A, denoted Col A, and P A := I PA for the projection onto its complement.

We use standard O, Θ, and Ωnotation to denote orders of growth. We also use a b or a b to indicate that a = O(b). Finally, we write a b for a = Θ(b).

2.2 Problem Setting

Preliminaries. Fix an input space X Rd and label space Y R. We assume that the learner has access to T source tasks. Each task t [T] is associated with a distribution µt over the set X Y of input-label pairs. From each µt, the learner observes n S i.i.d. samples {(xi,t, yi,t) | i [n S]}, which we collect into a matrix Xt Rn S d and a vector yt Rn S. Finally, we measure learner performance using the loss function ℓ: R Y R.

We aim to ﬁnd common structure among the source tasks that could be leveraged for future tasks. In this work, such common structure is deﬁned using representations, mappings φ : X Z from the input space to a latent space Z Rk. These representations lie in some function class parameterized by a normed space. However, unlike prior work, we do not assume that all tasks use a single representation. Instead, there is a ﬁxed θ0 such that the predictor for task t can be written as x 7 wt, φθt(x) , with θt lying near θ0 (that is, θt θ0 is small).

ADAPTREP Procedure. Following the discussion above, consider the objective

min θ0 min θt,wt θt θ0 δ0

i=1 ℓ( wt, φθt(xi,t) , yi,t) (1)

for some ﬁxed δ0 > 0, which will be the main focus of this work. We refer to (1) as ADAPTREP, as it ﬁnds an initialization φθ0 for which every task has a good representation nearby, ensuring ease of ﬁne-tuning. (1) can be viewed as a constrained form of algorithms found in the literature such as i MAML (Rajeswaran et al., 2019) and Meta-Minibatch Prox (Zhou et al., 2019). However, we do not use train-validation splits, as is widespread in practice. This is motivated by results in Bai et al. (2020), which show that data splitting may be undesirable, assuming realizability. Furthermore, empirical evaluations have demonstrated successes despite the lack of such splits (Zhou et al., 2019).

FROZENREP Procedure. As a baseline, we will also be studying the ubiquitous frozen representation objective, which is analyzed in detail in Du et al. (2020); Tripuraneni et al. (2020b). Formally, this objective can be written as follows:

min θ0 min wt 1 n ST

i=1 ℓ( wt, φθ0(xi,t) , yi,t). (2)

Since φθ0 is frozen for all source tasks, we refer to the method above as FROZENREP. In Sections 3.4 and G, we will prove that unlike ADAPTREP, there are cases where FROZENREP is unable to leverage the latent structure in the presence of misspeciﬁcations.

Learning Target Tasks. Let θ0 be a solution to (1) or (2). The learner obtains n T i.i.d. samples {(xi, yi)} from a new task with distribution µ, which we collect into X Rn T d and y Rn T as before. We adapt to the target task using θ0 by solving

min θ,w θ θ0 δ

i=1 ℓ( w, φθ(xi) , yi). (3)

for some δ 0. This learned predictor is evaluated via its population loss over µ, which is given by E(x,y) µ [ℓ( w, φθ(x) , y)]. We focus on the few-shot learning setting for the target task, where n T is small, and a learner needs to effectively use the source tasks to sufﬁciently learn the target task.

Statistical Assumptions and Optimization Oracles. For ADAPTREP to be sensible, we need to ensure that a desirable initialization exists. Accordingly, we assume that for some θ 0, every task t is associated with (θ t , w t ) where θ t θ 0 δ0, so that the associated distribution µt is given by

x p, y | x q( | w t , φθ t (x) ). Here, p is an input distribution (same for all tasks), while q models label noise. Speciﬁcally, in regression, we set E [y | x] = w t , φθ t (x) . With an appropriate choice of loss function ℓ, we can guarantee that the optimal predictor under the population loss is x 7 w t , φθ t (x) .

Throughout the paper, we assume access to an oracle for solving (1) or (2), similar to Du et al. (2020); Tripuraneni et al. (2020b). For detailed analyses of source training optimization, we refer the reader to Ji et al. (2020); Wang et al. (2020). On the other hand, representation ﬁne-tuning introduces nonconvexity during target time not present in prior work, where one simply optimizes a ﬁnal linear layer (a convex problem). Thus, our bounds explicitly incorporate optimization performance on (3). We focus on using projected gradient descent (PGD), which applies to a wide variety of settings, under certain loss landscape assumptions. These standalone results are also provided in Section H.

3 ADAPTREP in the Linear Setting

We ﬁrst examine ADAPTREP in the linear setting to illustrate key ideas. Here, representations are linear transformations Rd Rk for d > k so that φB(x) = B x, with distances measured by the Frobenius norm. In this setting, we provide a performance bound for ADAPTREP, and then exhibit a speciﬁc instance for which FROZENREP fails, establishing our claimed sample complexity gap.

3.1 Statistical Assumptions

Figure 1: Illustrating the geometry of task parameterization in the linear representation setting.

Data Sampling. We specialize the statistical assumptions of Section 2.2. Assume that the input distribution p is zero-mean and has covariance Σ. Let κ = λmax(Σ)/λmin(Σ) be the condition number of Σ. As in Du et al. (2020), we impose the following tail condition on p: Assumption 3.1. There exists ρ > 0 such that if x p, then Σ 1/2x is ρ2-sub-Gaussian1.

This assumption guarantees probabilistic tail bounds for our proofs, and can be replaced with other suitable tail conditions. Finally, we deﬁne q ( | µ) N µ, σ2 for a ﬁxed σ > 0.

Parameterizing the Tasks. Let B be the initial representation, and t be the task-speciﬁc change for task t, where t F δ0. Furthermore, let w t Rk be the task-speciﬁc weights. The predictor for task t is then given by x 7 x, (B + t )w t . We deﬁne θ t = (B + t )w t , and combine (w t ) into a matrix W = [w 1, . . . , w T ] Rk T . We illustrate the parameterization in Figure 1.

1A zero-mean random vector v is ρ2-sub-Gaussian if for any ﬁxed u, E exp(λv u) λ2 u 2 2 ρ2/2.

Assumption 3.2 (Source task diversity). For any t [T], w t 2 = Θ(1), and σ2 k(W ) = Ω(T/k).

Since P i [k] σ2 i (W ) = W 2 F = Θ(T) by Assumption 3.2, the bound on σ2 k(W ) implies that the weights covers all directions in Rk roughly evenly. This condition is satisﬁed with high probability when the w t are sampled from a sub-Gaussian distribution with well-conditioned covariance.

Note that θ t = B w t + δ t , where δ t 2 = t w t 2 = O(δ0) by Assumption 3.2. Thus, the predictor weights lie near the rank-k subspace determined by B , as illustrated by Figure 1. As a convention since the parameterization is not unique, we require (B ) Σδ t = 0 for any t [T]. This is without loss of generality, as we can always redeﬁne relevant quantities as needed.

Finally, we evaluate the learner on a target task θ := B w + δ for some w and δ 2 δ0.

3.2 Training Procedure

Source training. We consider the following regularized form of (1):

min B min t,wt 1 2n ST

t=1 yt Xt(B + t)wt 2 2 + λ

2 t 2 F + γ

2 wt 2 2 . (4)

In Section B, we show that the regularization is equivalent to regularizing λγ twt 2, consistent with the intuition that t w t has small norm. This additional regularization is necessary, since (1) only controls the norm of t, which is insufﬁcient for controlling twt.

Target training. Let B0 be the output of (4) after orthonormalizing. We adapt to the target task via

Lβ( , w) = 1

2n y βX (AB0 + ) (w0 + w) 2 2 , (5)

where AB0 := [B0 B0] Rd 2k and w0 = [u, u] for a ﬁxed unit-norm vector u Rk. This corresponds to training a predictor of the form x 7 x, (AB0 + )(w0 + w) . We optimize (5) by performing TPGD steps of PGD with stepsize η on (5) with

Cβ := {( , w) | F c1/β, w 2 c2/β}

as the feasible set, where we explicitly deﬁne c1 and c2 in Section B.

To understand the procedure, note that the predictor has parameter θ = β(AB0w + w0 + w), since AB0w0 = 0.2 Within Cβ, the ﬁrst two terms have O(1) norm, while w 2 = O(1/β). Thus, for large β, the cross term w is a negligible perturbation, and the predictor is approximately linear in ( , w). Consequently, Lβ is approximately convex in ( , w) within Cβ, allowing PGD to ﬁnd a near-optimal solution in the constraint set.

3.3 Performance Bound

Now, we provide a performance bound on the performance of the algorithm proposed in Section 3.2. We deﬁne the following rates3 of interest, where RΣ(w , δ0) > 0:

r S(n S, T) := σ2kd

n ST + σδ0 Σ 1/2 2

kd n ST + σ2k

n S + σδ0 n S

r T(n T) := σ2k

n T + σδ0 n T

tr Σ + min kr S(n S, T) , RΣ(w , δ0) p

kr S(n S, T) σ n T

Theorem 3.1 (Performance guarantee, linear representations). Assume that Assumptions 3.1 and 3.2 hold, n S ρ4(d+log(T/δ)), and n T ρ4(d+log(1/δ)). Then there are (λ, γ, β, TPGD, η, c1, c2) (speciﬁed in Section B) such that the training procedure in Section 3.2, with high probability, ﬁnds θ achieving excess risk bounded as

E (x θ x θ)2 min r T(n T), σ2d

2This symmetrization via AB0 is commonly used in the neural tangent kernel literature (Chizat et al., 2019; Zhang et al., 2020), allowing us to set fθ0(x) 0 while ensuring AB0w Col B for any w. 3Log factors and non-dominant terms suppressed for clarity. Full rates are presented in the appendix.

We prove this result in Section B. Nevertheless, we give an interpretation of r S and r T below.

Source Training. The rate r S represents the performance of the learner during source training. Despite misspeciﬁcations, the algorithm is able to pool the aggregate n ST samples to learn a common (kd-dimensional) initialization, as shown by the ﬁrst two terms. The last two terms represent the rate achieved on each task upon inferring the initialization, and therefore decays only with n S. The ﬁrst is the cost of learning w t , while the second is the cost of learning δ t .

Target Training. The rate r T represents the performance of the learner with ﬁne-tuning. The ﬁrst two terms and their interpretations are analogous to that of the last two terms in r S. On the other hand, the minimum indicates two ways to handle the representation error from source training:

Ignore source error. In the most data-starved regime, we ignore the ﬁnite-sample source error, restricting ﬁne-tuning to only learn δ . Thus, we incur an irreducible error in the form of kr S(n S, T)4.

Fix source error. Alternatively, with more data, we can use ﬁne-tuning to ﬁx the source representation error. This additional complexity is captured by the second argument, indicating ﬁne-tuning to a vector with norm RΣ(w , δ0) p

kr S(n S, T). Note that the additional norm shrinks to 0 as n S .

Finally, by ignoring the representation B0 completely, we can obtain the alternative trivial σ2d/n T rate, which matches the minimax lower bound for standard linear regression (see e.g., Duchi (2016)).

3.4 A Hard Case for FROZENREP

In this section, we present a sample complexity separation between ADAPTREP and FROZENREP in the linear setting. We provide the formal construction and an experimental veriﬁcation of the gap in Section C. Nevertheless, we provide the following informal result: Theorem 3.2 (FROZENREP-ADAPTREP Sample Complexity Separation). Assume k = Θ(1) d. There exists a family of task distributions T satisfying the conditions of Section 3.1, such that with high probability over the draw of n T d target samples, we have the minimax bound

min ˆ w,ˆδ max τ T θ supp τ E 1

X(θ B ˆw ˆδ) 2

FROZENREP is provided with inﬁnitely many source tasks and per-task samples from τ to learn B, and n T samples from θ to learn ( ˆw, ˆδ). In contrast, by specializing Theorem 3.1 for the same family, ADAPTREP (with only ﬁnitely many source tasks and per-task samples) achieves excess risk bounded as

E h (x θ x ˆθ)2i min σ n T , σ2d

Interpreting the minimax result. Observe that the rate is achievable by performing standard linear regression directly on the target samples, with no other information. Thus, FROZENREP fails to capture the shared structure from source tasks, in the worst case. Furthermore, in the high-dimensional setting when n T d, there exists a strict separation between ADAPTREP and FROZENREP that widens as n T .

The minimax bound applies regardless of the target ﬁne-tuning procedure in use, including those used in practice, e.g. i MAML, Meta Opt Net (Lee et al., 2019), and R2D2 (Bertinetto et al., 2019). As FROZENREP is provided with inﬁnitely many source tasks and samples, this failure is thus entirely due to the representation learning algorithm. Altogether, we have established a case where incorporating representation ﬁne-tuning is provably sufﬁcient for handling misspeciﬁcations, unlike ubiquitous frozen representation objectives.

3.4.1 Intuition

In this section, we provide a brief intution behind our construction. Recall that in the linear setting, FROZENREP uses the following objective for obtaining a representation:

ˆB = argmin B min wt 1 2n ST

t [T ] yt Xt Bwt 2 2 n S 1

t [T ] θ t Bw t 2 Σ . (6)

4For average target performance, kr S(n S, T) can be replaced by r S(n S, T); see Section F; Du et al. (2020).

By optimizing with respect to the Σ-norm, FROZENREP seeks a k-dimensional subspace that captures as much of the predictive signal across all source tasks. However, in general, the Σ-norm is not aligned with the parameter norm. In other words, a small θ Bw Σ does not imply that θ Bw is small, and thus it may be the case that θ Bw is difﬁcult to learn during ﬁne-tuning.

Figure 2: Prototypical hard case. While t w t is smaller (left ﬁgure), FROZENREP learns t w t since it has a larger predictive signal (right ﬁgure).

A prototypical example of this difference is illustrated by Figure 2, where k = 1. The dashed red vector has small parameter norm, and is thus easily learnable by ﬁne-tuning (as demonstrated by norm-based generalization bounds). Thus, a meta-learner should opt to learn the space captured by the solid green vector. However, if the Σ is structured so that t w t Σ B w t Σ, then FROZENREP will learn the space spanned by the dashed red vector instead. By appropriately tuning the ratio of the lengths of the two vectors, we obtain the desired result.

We provide a formal construction in Section C.1.1, including the extension to general k.

4 ADAPTREP in the Nonlinear Setting

We now describe a framework for analyzing ﬁne-tuning in general function classes. To simplify notation, we modify the setting in Section 2.2 so that both the representation φ and predictor weights w are captured by a parameter θ Θ, with corresponding predictor gθ.

Loss Notation. We denote the population and excess risk of predictor g with respect to the true predictor g as L (g, g ) := Ex p Ey q( |g (x)) [ℓ(g(x), y)]

Lex (g, g ) := L (g, g ) L (g , g ) ,

respectively. Additionally, we let L(g, g ) denote the corresponding ﬁnite-sample quantity.5

Feasible Predictors. For an initialization θ and a ﬁne-tuning set C Θ, we deﬁne the set AC(θ) := {gθ | θ θ C} . Intuitively, this set contains all feasible predictors given an initialization θ and feasible ﬁne-tuning set C. For convenience, we also deﬁne the set

[AC(θ)] T := (xt)t [T ] 7 (gt(xt))t [T ] g1, . . . , g T AC(θ) ,

which is a collection of functions mapping X T YT . This set represents all feasible T-tuples of predictors, for a given initialization θ and ﬁne-tuning set C.

4.1 Training Procedure

Figure 3: An illustration of source training.

Source Training. Fix a set of possible initializations Θ0 Θ, and a ﬁne-tuning set CS. With n S samples for each task (with optimal predictor g t ), we consider the objective

θ0 argmin θ Θ0

t=1 min gt ACS(θ) L(gt, g t ). (7)

Note that this corresponds to the ADAPTREP objective in (1) using the combined parameter θ. We illustrate the objective in Figure 3

Target Training. Fix a ﬁne-tuning set CT (not necessarily CS). Given n T samples from a new task (with true predictor g ), we run PGD on L(δ) = L(gθ0+δ, g t ) with feasible set CT. In this process, PGD is run with TPGD timesteps and step size η.

5Note that we have omitted the samples from the notation for brevity.

4.2 Assumptions

Figure 4: Assumption 4.1. If AC(θ) has low complexity, then ﬁne-tuning quickly ﬁnds good task-speciﬁc parameters for all tasks.

We rephrase the statistical assumption from Section 2.2 in terms of feasible sets AC(θ) as follows: Assumption 4.1. There exists θ 0 such that for every t [T], g t ACS(θ0).

We illustrate Assumption 4.1 in Figure 4. We also impose the following standard regularity conditions on the loss function ℓ: Assumption 4.2. For any y Y, ℓ( , y) is 1-Lipschitz6 and convex, and |ℓ(0, y)| B.

Task Diversity. To ensure transfer from source to target, we impose the following condition, a speciﬁc instance of which was proposed by Du et al. (2020) in the linear setting, and proposed by Tripuraneni et al. (2020b) for general settings: Assumption 4.3 (Source tasks are (ν, ε)-diverse). There exists constants (ν, ε) such that if ρ is the distribution of target tasks, then for any θ Θ0,

inf g ACT(θ) Lex (g, g ) 1

t=1 inf g ACS(θ) Lex (g, g t )

The (ν, ε)-diversity assumption connects the average source task performance (right) to the average target performance (left). Thus, it ensures that optimizing (7) results in controlled target performance. We weakened the condition in Tripuraneni et al. (2020b) to bound average target performance rather than worst-case, as is more suitable for higher-dimensional settings.

Target Loss Landscape. Finally, we impose conditions ensuring that PGD ﬁnds a near-optimal solution in CT. We remark that these assumptions are speciﬁc to our choice of ﬁne-tuning algorithm. Assumption 4.4 (Approximate linearity in ﬁne-tuning). Let x1, . . . , xn T be the set of target inputs. Then, there exists β, L such that

2 θgθ+δ(xi) 2 2 β2 and sup θ Θ0

i=1 θgθ(xi) 2 2 L2.

Assumption 4.5. supδ CT δ 2 R for some R.

4.3 Performance Bound

Recall that for a set H of functions Rd Rk on n samples, its Rademacher complexity Rn(H) on n samples is given by

Rn(H) := Eε,X

j=1 εijhj(xi)

where (εij) are i.i.d. Rademacher random variables and (xi) are i.i.d. samples from some (preset) distribution. We now proceed with our performance guarantee. Theorem 4.1 (General Performance Bound). Assume that all assumptions in Section 4.2 hold. Let (θt) be the set of iterates generated by PGD following the procedure in Section 4.1 (step size η speciﬁed in Section D). Then, with probability at least 1 δ over the random draw of samples,

Eg ρ h min t Lex (gθt, g ) i βR2 + R

TPGD | {z } εOPT

δ sup θ Θ0 Rn T [ACT(θ)] + B δ n T | {z } εEST

( 1 δT Rn S

θ Θ0 [ACS(θ)] T #

| {z } εREPR

6This is not restrictive as one can simply rescale the loss, and is assumed for simplicity of presentation.

Note that the Rn T complexity term samples from p. Meanwhile, the Rn S complexity term samples from p T , which concatenates T i.i.d. samples from p (one for each task) for every draw.

We prove Theorem 4.1 in Section D. The Rademacher complexity terms decay in most settings as

θ Θ0 [ACS(θ)] T #

= O C(Θ0) n ST + diam CS n S

sup θ Θ0 Rn T[ACT(θ)] = O diam CT n T

where C(Θ0) measures the complexity of Θ0, and diam CS and diam CT measures the size of the ﬁne-tuning sets CS and CT.

Understanding the Bound. We brieﬂy outline out how the assumptions in Section 4.2 contribute to the ﬁnal bound. Let θOPT denote the best solution found by PGD, and deﬁne the parameters

θERM := argmin θ ACT(θ0) L(gθ, g ) and θ := argmin θ ACT(θ0) L (gθ, g ).

That is, θERM is the minimizer of the empirical risk while θ is the minimizer of the population risk, both within ACT(θ0).

Figure 5: Illustrating the excess risk bound in Theorem 4.1.

Optimization error (εOPT). The difference in performance between θOPT and θERM, i.e. the error due to PGD, is controlled by the assumptions on the target loss landscape.

Estimation error (εEST). By uniform convergence, θERM performs similarly to θ, with the difference bounded using Rademacher complexity. By extension, θOPT performs similarly to θ.

Representation/approximation error (εREPR). The average best case excess risk (i.e., the excess risk of θ) is bounded via the performance of the initialization θ0 on the source tasks, using (ν, ε)-diversity. Since (7) is a ﬁnite-sample estimate of the desired bound, we again use uniform convergence arguments, which manifest as the second Rademacher complexity term.

We have included an illustration of these three components in Figure 5.

5 Case Study: Two-Layer Neural Networks

We now illustrate our framework by instantiating our bounds in concrete settings. We focus on specializing Theorem 4.1 to the two-layer neural networks (this section) and logistic regression (provided in the appendix, Section F). For this section, we ﬁx an activation function σ : R R. Assumption 5.1. For any t [ 2, 2], |σ (t)| L and |σ (t)| µ. Furthermore, σ(0) = 0.

Then, for a constant β and θ = (B, w), where B Rd 2k and w Rk, we deﬁne the neural network f β θ (x) = βw σ(B x), where σ is applied elementwise.

5.1 Statistical Assumptions

Data Sampling. For all tasks, the inputs are sampled from a 1-norm-bounded distribution p. Furthermore, q( | µ) is generated as µ + η for some O(1)-bounded additive noise η, similar to Tripuraneni et al. (2020b).

Parameterizing the Tasks. We ﬁx a representation B Rd 2k and a linear predictor w 0 R2k. In this setting, we consider representation changes lying in a k-dimensional subspace, determined

by an orthonormal7 set of matrices 1, . . . , k in Rd 2k. Then, for each t [T], we have two unit-norm vectors w t , δ t so that task t has parameter

i [k] (δ t )i i , w 0 + 1

We assume a diversity condition on (w t , δ t ) similar in spirit to the linear version in Assumption 3.2, see Assumption E.2.

With appropriate assumptions on the initialization and large enough β, the source task predictors behave like their linearizations, i.e. there exists feature vectors φB and ψB ,w 0 so that

fθ t (x) φB (x) w t +

ψB ,w 0(x) , X

i [k] δ t,i i

We make use of these assumptions throughout for our result, see Assumption E.1 for the formal statement. Note that φB and ψB ,w 0 correspond to the activation and gradient features, respectively, that are empirically evaluated by Mu et al. (2020).

5.2 Training Procedure

We use the squared error loss to train in this setting. Additionally, we use the training procedure in Section 4.1 with the function class {f γ θ }, with a minor caveat: we set γ = β during source training, and γ = O( n T) during target training.

With the procedure above, we now deﬁne the relevant feasible sets. We set Θ0 be the set of initializations satisfying Assumption E.1. Finally, we set the constraint sets

CS = {( , w) | F , w 2 1/β} and Cγ T = n ( , w) 2 F + w 2 2 O(1/γ2) o .

5.3 Performance Guarantee

With the above assumptions in hand, we proceed with the performance guarantee. Theorem 5.1 (Neural net performance bound). Assume that Assumptions 5.1, E.1 and E.2 hold. Then, if n S n T, there exists a setting of the training parameters (see Section E) such that with probability at least 1 δ, the iterates (θt) satisfy

Ef ρ h min t Lex (f γ θt, f ) i L2 k n T + L(L + µ) k3/2

n ST + µ + L

where log factors have been suppressed for clarity.

6 Conclusion

We have presented, to the best of our knowledge, the ﬁrst statistical analysis of ﬁne-tuning-based meta-learning. We demonstrate the success of such algorithms under the assumption of approximately shared representations between available tasks. In contrast, we show that frozen representation objectives analyzed by prior work fail under this weaker assumption.

An interesting line of future work is to determine ways to formulate useful shared structure among MDPs, i.e. formulate settings for which meta-reinforcement learning succeeds and results in improved regret bounds for downstream tasks.

7 Acknowledgements

KC is supported by a National Science Foundation Graduate Research Fellowship, Grant DGE2039656. QL is supported by NSF #2030859 and the Computing Research Association for the CIFellows Project. JDL acknowledges support of the ARO under MURI Award W911NF-11-10304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, and an ONR Young Investigator Award.

7Orthonormal with respect to the Frobenius (i.e. entrywise) inner product on matrices.

Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via meta-learning in nonstationary and competitive environments. ar Xiv preprint ar Xiv:1710.03641, 2017.

Alquier, P., Pontil, M., et al. Regret bounds for lifelong learning. In Artiﬁcial Intelligence and Statistics, pp. 261 269. PMLR, 2017.

Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended pac-bayes theory. In International Conference on Machine Learning, pp. 205 214. PMLR, 2018.

Antoniou, A., Edwards, H., and Storkey, A. How to train your maml. ar Xiv preprint ar Xiv:1810.09502, 2018.

Bai, Y., Chen, M., Zhou, P., Zhao, T., Lee, J. D., Kakade, S., Wang, H., and Xiong, C. How important is the train-validation split in meta-learning? ar Xiv preprint ar Xiv:2010.05843, 2020.

Balcan, M.-F., Blum, A., and Vempala, S. Efﬁcient representations for lifelong learning and autoencoding. In Conference on Learning Theory, pp. 191 210. PMLR, 2015.

Baxter, J. A model of inductive bias learning. Journal of artiﬁcial intelligence research, 12:149 198, 2000.

Ben-David, S. and Borbely, R. S. A notion of task relatedness yielding provable multiple-task learning guarantees. Machine learning, 73(3):273 287, 2008.

Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1):151 175, 2010.

Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013.

Bertinetto, L., Henriques, J. F., Torr, P. H. S., and Vedaldi, A. Meta-learning with differentiable closed-form solvers, 2019.

Bullins, B., Hazan, E., Kalai, A., and Livni, R. Generalize across tasks: Efﬁcient algorithms for linear representation learning. In Algorithmic Learning Theory, pp. 235 246. PMLR, 2019.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020.

Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32:2937 2947, 2019.

Collins, L., Mokhtari, A., and Shakkottai, S. Why does maml outperform erm? an optimization perspective. ar Xiv preprint ar Xiv:2010.14672, 2020a.

Collins, L., Mokhtari, A., and Shakkottai, S. Task-robust model-agnostic meta-learning. Advances in Neural Information Processing Systems, 33, 2020b.

Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incremental learning-to-learn with statistical guarantees. ar Xiv preprint ar Xiv:1803.08089, 2018.

Denevi, G., Pontil, M., and Ciliberto, C. The advantage of conditional meta-learning for biased regularization and ﬁne tuning. Advances in Neural Information Processing Systems, 33, 2020.

Denevi, G., Pontil, M., and Ciliberto, C. Conditional meta-learning of linear representations. ar Xiv preprint ar Xiv:2103.16277, 2021.

Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020.

Duchi, J. Lecture notes for statistics 311/electrical engineering 377. URL: https://stanford. edu/class/stats311/Lectures/full_notes. pdf. Last visited on, 2:23, 2016.

Fallah, A., Mokhtari, A., and Ozdaglar, A. On the convergence theory of gradient-based modelagnostic meta-learning algorithms. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1082 1092. PMLR, 2020.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. International Conference on Machine Learning, 2017.

Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. ar Xiv preprint ar Xiv:1902.08438, 2019.

Hanneke, S. and Kpotufe, S. A no-free-lunch theorem for multitask learning. ar Xiv preprint ar Xiv:2006.15785, 2020a.

Hanneke, S. and Kpotufe, S. On the value of target data in transfer learning. ar Xiv preprint ar Xiv:2002.04747, 2020b.

Jerfel, G., Grant, E., Grifﬁths, T. L., and Heller, K. Online gradient-based mixtures for transfer modulation in meta-learning. ar Xiv preprint ar Xiv:1812.06080, 2018.

Ji, K., Lee, J. D., Liang, Y., and Poor, H. V. Convergence of meta-learning with task-speciﬁc adaptation over partial parameters, 2020.

Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pp. 5917 5928, 2019.

Konobeev, M., Kuzborskij, I., and Szepesvári, C. On optimality of meta-learning in ﬁxed-design regression with weighted biased regularization. ar Xiv preprint ar Xiv:2011.00344, 2020.

Kpotufe, S. and Martinet, G. Marginal singularity, and the beneﬁts of labels in covariate-shift. In Conference On Learning Theory, pp. 1882 1886. PMLR, 2018.

Ledoux, M. and Talagrand, M. Probability in banach spaces: isoperimetry and processes. 1991.

Lee, J. D., Lei, Q., Saunshi, N., and Zhuo, J. Predicting what you already know helps: Provable self-supervised learning. ar Xiv preprint ar Xiv:2008.01064, 2020.

Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Meta-learning with differentiable convex optimization, 2019.

Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

Lucas, J., Ren, M., Kameni, I., Pitassi, T., and Zemel, R. Theoretical bounds on estimation error for meta-learning. ar Xiv preprint ar Xiv:2010.07140, 2020.

Maurer, A. and Jaakkola, T. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6(6), 2005.

Maurer, A., Pontil, M., and Romera-Paredes, B. The beneﬁt of multitask representation learning. Journal of Machine Learning Research, 17(81):1 32, 2016.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efﬁcient learning of deep networks from decentralized data. In Artiﬁcial Intelligence and Statistics, pp. 1273 1282. PMLR, 2017.

Mendonca, R., Gupta, A., Kralev, R., Abbeel, P., Levine, S., and Finn, C. Guided meta-policy search. In Advances in Neural Information Processing Systems, pp. 9656 9667, 2019.

Mohri, M. and Medina, A. M. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pp. 124 138. Springer, 2012.

Mu, F., Liang, Y., and Li, Y. Gradients as features for deep representation learning. ar Xiv preprint ar Xiv:2004.05529, 2020.

Nichol, A., Achiam, J., and Schulman, J. On ﬁrst-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

Pentina, A. and Lampert, C. A pac-bayesian bound for lifelong learning. In International Conference on Machine Learning, pp. 991 999. PMLR, 2014.

Raghu, M., Zhang, C., Kleinberg, J., and Bengio, S. Transfusion: Understanding transfer learning for medical imaging. ar Xiv preprint ar Xiv:1902.07208, 2019.

Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pp. 113 124, 2019.

Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., and Madry, A. Do adversarially robust imagenet models transfer better? ar Xiv preprint ar Xiv:2007.08489, 2020.

Saunshi, N., Zhang, Y., Khodak, M., and Arora, S. A sample complexity separation between nonconvex and convex meta-learning. In International Conference on Machine Learning, pp. 8512 8521. PMLR, 2020.

Snell, J., Swersky, K., and Zemel, R. S. Prototypical networks for few-shot learning. ar Xiv preprint ar Xiv:1703.05175, 2017.

Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.

Tripuraneni, N., Jin, C., and Jordan, M. I. Provable meta-learning of linear representations. ar Xiv preprint ar Xiv:2002.11684, 2020a.

Tripuraneni, N., Jordan, M., and Jin, C. On the theory of transfer learning: The importance of task diversity. Advances in Neural Information Processing Systems, 33, 2020b.

Vershynin, R. Four lectures on probabilistic methods for data science. 2017.

Wang, H., Sun, R., and Li, B. Global convergence and generalization bound of gradient-based meta-learning with deep neural nets, 2020.

Zhang, Y., Xu, Z.-Q. J., Luo, T., and Ma, Z. A type of generalization error induced by initialization in deep neural networks. In Mathematical and Scientiﬁc Machine Learning, pp. 144 164. PMLR, 2020.

Zhou, P., Yuan, X., Xu, H., Yan, S., and Feng, J. Efﬁcient meta learning via minibatch proximal update. In Advances in Neural Information Processing Systems, pp. 1534 1544, 2019.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] Our introduction outline where each claim appears in the paper. (b) Did you describe the limitations of your work? [Yes] We have discussed extensively the assumptions that we require to establish our results, see Section 3.1, Section 4.2, Section 5.1, among others. (c) Did you discuss any potential negative societal impacts of your work? [No] As our work focuses on analyzing existing algorithms from a theoretical perspective, we do not foresee any negative societal impacts stemming from our work. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] A Jupyter notebook is provided to run the simulation outlined in Section C.3. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section C.3, as well as the provided Jupyter notebook. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] Since we are simulating regression problems and can sample arbitrarily many times from the task distribution, we only report the mean in our experiments. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] Our experiments only involve simulations of simple settings that do not require extensive compute. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]