# how_finetuning_allows_for_effective_metalearning__67ff9e96.pdf How Fine-Tuning Allows for Effective Meta-Learning Kurtland Chua Princeton University kchua@princeton.edu Qi Lei Princeton University qilei@princeton.edu Jason D. Lee Princeton University jasonlee@princeton.edu Representation learning has served as a key tool for meta-learning, enabling rapid learning of new tasks. Recent works like MAML learn task-specific representations by finding an initial representation requiring minimal per-task adaptation (i.e. a fine-tuning-based objective). We present a theoretical framework for analyzing a MAML-like algorithm, assuming all available tasks require approximately the same representation. We then provide risk bounds on predictors found by finetuning via gradient descent, demonstrating that the method provably leverages the shared structure. We illustrate these bounds in the logistic regression and neural network settings. In contrast, we establish settings where learning one representation for all tasks (i.e. using a frozen representation objective) fails. Notably, any such algorithm cannot outperform directly learning the target task with no other information, in the worst case. This separation underscores the benefit of fine-tuning-based over frozen representation objectives in few-shot learning. 1 Introduction Meta-learning (Thrun & Pratt, 2012) has emerged as an essential tool for adapting prior knowledge to new tasks under data and computational constraints. In this context, a meta-learner has access to related source tasks from a shared environment. The learner aims to uncover inductive biases from the source tasks to reduce the sample/computational complexity of learning new tasks from the same environment. A common approach is representation learning (Bengio et al., 2013), i.e. learning a feature extractor from the source tasks. At test time, a learner adapts to a new task by fine-tuning the representation and retraining the final layer(s) (see, e.g., prototype networks (Snell et al., 2017)). Substantial improvements over directly learning a single task has been shown in fewshot learning (Antoniou et al., 2018), a natural setting in many applications including reinforcement learning (Mendonca et al., 2019; Finn et al., 2017), computer vision (Nichol et al., 2018), federated learning (Mc Mahan et al., 2017) and robotics (Al-Shedivat et al., 2017). The empirical success of representation learning has led to an increased interest in theoretical analyses of underlying phenomena. Recent work assumes an explicitly shared representation across tasks (Du et al., 2020; Tripuraneni et al., 2020b,a; Saunshi et al., 2020; Balcan et al., 2015). For instance, Du et al. (2020) prove a generalization bound consisting of a representation error term and an estimation error term. However, without fine-tuning the whole network, the representation error accumulates to the target task and is irreducible, even with an infinitely large target task dataset. This result is consistent with empirical findings, which suggest that fine-tuning the whole network provides substantial performance gains, compared to just learning the final linear layer (Chen et al., 2020; Salman et al., 2020). Due to the lack of (representation) fine-tuning while training on the source tasks, we refer to methods above as making use of frozen representation objectives. Linear separability of tasks on the same features poses additional problems, being an unrealistic assumption when transferring knowledge to other domains (e.g. from Image Net to medical images (Raghu et al., 2019)). Therefore, we consider a more realistic setting where the available tasks only approximately share the same representation. In this setting, we propose a theoretical framework for 35th Conference on Neural Information Processing Systems (Neur IPS 2021). analyzing the sample complexity of fine-tuning, using an initial representation from a MAML-like algorithm. We show that fine-tuning quickly adapts to new tasks, requiring fewer samples in certain cases compared to methods using frozen representation objectives. To the best of our knowledge, no prior theoretical studies exist beyond fine-tuning a linear model (Denevi et al., 2018; Konobeev et al., 2020; Collins et al., 2020a; Lee et al., 2020) or only the task-specific layers (Du et al., 2020; Tripuraneni et al., 2020b,a; Mu et al., 2020). Tripuraneni et al. (2020b), in particular, acknowledge that their work does not incorporate representation fine-tuning, leaving such analysis to future work. It is this gap which we now seek to address. The following outlines this paper and its contributions: In Section 2, we outline the general setting and our overall assumptions. Additionally, we introduce ADAPTREP, the representative fine-tuning-based algorithm we analyze in this work. As a baseline, we also formally define FROZENREP, which makes use of a frozen representation objective. In Section 3, we provide an in-depth analysis of the (d-dimensional) linear representation setting, when learning a k-dimensional representation. First, we show that ADAPTREP achieves rates of rsource = O kd n ST + k n S + δ0 q and rtarget = O k n T + δ0 q n T + rsource on the source and target tasks. Here, n S and n T are the number of available samples, T is the number of source tasks, δ0 is the norm of the representation change, and Σ is the input covariance. Thus, fine-tuning can handle the misspecified setting. In contrast, FROZENREP has a minimax rate of Ω(d/n T) on the target task under certain task distributions, matching standard linear regression. We provide a formal construction and an experimental verification of the gap in Section C. In Section 4, we extend the analysis to general function classes. We provide risk bounds of the form εOPT + εEST + εREPR, where εOPT, εEST, εREPR are the optimization, estimation, and representation errors, respectively. The optimization error εOPT quantifies the error from using approximate minima found during optimization. To control εOPT, our analysis accounts for nonconvexity introduced by representation fine-tuning, and is presented as a self-contained result in Section H. The estimation error εEST stems from fine-tuning to the target task with finite n T samples. It therefore scales with 1/ n T and is controlled by the complexity of the target fine-tuning set. The representation error εREPR is the error incurred by representation learning on the source tasks while adapting to them. It consists of two terms: one scaling as 1/ n ST for learning an initialization using all T source tasks, and the other scaling as 1/ n S for learning task-specific adaptations. In Section 5, we instantiate our guarantees in the two-layer neural network setting. Additionally, we provide an analysis for logistic regression in Section F. Furthermore, we extend the linear hard case to a nonlinear setting in Section G. 1.1 Related Work The empirical success of MAML (Finn et al., 2017), and meta-learning in general, has elicited further theoretical analysis from both statistical and optimization perspectives. A flurry of work engages in developing more efficient and theoretically-sound optimization algorithms (Antoniou et al., 2018; Nichol et al., 2018; Li et al., 2017) or convergence analyses (Fallah et al., 2020; Zhou et al., 2019; Rajeswaran et al., 2019; Collins et al., 2020b). Inspired by MAML, a subseqent line of gradient-based meta-learning algorithms have been developed and widely used (Nichol et al., 2018; Al-Shedivat et al., 2017; Jerfel et al., 2018). Much follow-up work focused on the online setting, with accompanying regret bounds (Denevi et al., 2018; Finn et al., 2019; Khodak et al., 2019; Balcan et al., 2015; Alquier et al., 2017; Bullins et al., 2019; Pentina & Lampert, 2014). The statistical analysis of meta-learning traces back to Baxter (2000); Maurer & Jaakkola (2005), which focused on inductive bias learning. Following the same setting, Amit & Meir (2018); Konobeev et al. (2020); Maurer et al. (2016); Pentina & Lampert (2014) fix a meta-distribution for source task sampling, and measure the average generalization error/gap over the meta distribution. Another line of work connects the target performance to source using distance measures between distributions (Ben-David & Borbely, 2008; Ben-David et al., 2010; Mohri & Medina, 2012). Finally, another series of works studied the benefits of using additional side information provided with a task for specializing the parameters of an inner algorithm (Denevi et al., 2020, 2021). The hardness of meta-learning has also been investigated under various settings. Recent work has studied meta-learning performance in worst-case settings (Collins et al., 2020b; Hanneke & Kpotufe, 2020a,b; Kpotufe & Martinet, 2018; Lucas et al., 2020). Hanneke & Kpotufe (2020a) provide a nofree-lunch result with a problem-independent minimax lower bound, while Konobeev et al. (2020) also provide problem-dependent lower bound on a simple linear setting. 2 General Setting 2.1 Notation Let [n] := {1, . . . , n}. We denote the vector L2-norm as 2, and the matrix Frobenius norm as F . Additionally, , can denote either the Euclidean inner product or the Frobenius inner product between matrices. For a matrix A, we let σi(A) denote its ith largest singular value. Additionally, for positive semidefinite A, we write λmax(A) and λmin(A) for its largest and smallest eigenvalues, and A1/2 for its principal square root. We write PA for the projection onto the column span of A, denoted Col A, and P A := I PA for the projection onto its complement. We use standard O, Θ, and Ωnotation to denote orders of growth. We also use a b or a b to indicate that a = O(b). Finally, we write a b for a = Θ(b). 2.2 Problem Setting Preliminaries. Fix an input space X Rd and label space Y R. We assume that the learner has access to T source tasks. Each task t [T] is associated with a distribution µt over the set X Y of input-label pairs. From each µt, the learner observes n S i.i.d. samples {(xi,t, yi,t) | i [n S]}, which we collect into a matrix Xt Rn S d and a vector yt Rn S. Finally, we measure learner performance using the loss function ℓ: R Y R. We aim to find common structure among the source tasks that could be leveraged for future tasks. In this work, such common structure is defined using representations, mappings φ : X Z from the input space to a latent space Z Rk. These representations lie in some function class parameterized by a normed space. However, unlike prior work, we do not assume that all tasks use a single representation. Instead, there is a fixed θ0 such that the predictor for task t can be written as x 7 wt, φθt(x) , with θt lying near θ0 (that is, θt θ0 is small). ADAPTREP Procedure. Following the discussion above, consider the objective min θ0 min θt,wt θt θ0 δ0 i=1 ℓ( wt, φθt(xi,t) , yi,t) (1) for some fixed δ0 > 0, which will be the main focus of this work. We refer to (1) as ADAPTREP, as it finds an initialization φθ0 for which every task has a good representation nearby, ensuring ease of fine-tuning. (1) can be viewed as a constrained form of algorithms found in the literature such as i MAML (Rajeswaran et al., 2019) and Meta-Minibatch Prox (Zhou et al., 2019). However, we do not use train-validation splits, as is widespread in practice. This is motivated by results in Bai et al. (2020), which show that data splitting may be undesirable, assuming realizability. Furthermore, empirical evaluations have demonstrated successes despite the lack of such splits (Zhou et al., 2019). FROZENREP Procedure. As a baseline, we will also be studying the ubiquitous frozen representation objective, which is analyzed in detail in Du et al. (2020); Tripuraneni et al. (2020b). Formally, this objective can be written as follows: min θ0 min wt 1 n ST i=1 ℓ( wt, φθ0(xi,t) , yi,t). (2) Since φθ0 is frozen for all source tasks, we refer to the method above as FROZENREP. In Sections 3.4 and G, we will prove that unlike ADAPTREP, there are cases where FROZENREP is unable to leverage the latent structure in the presence of misspecifications. Learning Target Tasks. Let θ0 be a solution to (1) or (2). The learner obtains n T i.i.d. samples {(xi, yi)} from a new task with distribution µ, which we collect into X Rn T d and y Rn T as before. We adapt to the target task using θ0 by solving min θ,w θ θ0 δ i=1 ℓ( w, φθ(xi) , yi). (3) for some δ 0. This learned predictor is evaluated via its population loss over µ, which is given by E(x,y) µ [ℓ( w, φθ(x) , y)]. We focus on the few-shot learning setting for the target task, where n T is small, and a learner needs to effectively use the source tasks to sufficiently learn the target task. Statistical Assumptions and Optimization Oracles. For ADAPTREP to be sensible, we need to ensure that a desirable initialization exists. Accordingly, we assume that for some θ 0, every task t is associated with (θ t , w t ) where θ t θ 0 δ0, so that the associated distribution µt is given by x p, y | x q( | w t , φθ t (x) ). Here, p is an input distribution (same for all tasks), while q models label noise. Specifically, in regression, we set E [y | x] = w t , φθ t (x) . With an appropriate choice of loss function ℓ, we can guarantee that the optimal predictor under the population loss is x 7 w t , φθ t (x) . Throughout the paper, we assume access to an oracle for solving (1) or (2), similar to Du et al. (2020); Tripuraneni et al. (2020b). For detailed analyses of source training optimization, we refer the reader to Ji et al. (2020); Wang et al. (2020). On the other hand, representation fine-tuning introduces nonconvexity during target time not present in prior work, where one simply optimizes a final linear layer (a convex problem). Thus, our bounds explicitly incorporate optimization performance on (3). We focus on using projected gradient descent (PGD), which applies to a wide variety of settings, under certain loss landscape assumptions. These standalone results are also provided in Section H. 3 ADAPTREP in the Linear Setting We first examine ADAPTREP in the linear setting to illustrate key ideas. Here, representations are linear transformations Rd Rk for d > k so that φB(x) = B x, with distances measured by the Frobenius norm. In this setting, we provide a performance bound for ADAPTREP, and then exhibit a specific instance for which FROZENREP fails, establishing our claimed sample complexity gap. 3.1 Statistical Assumptions Figure 1: Illustrating the geometry of task parameterization in the linear representation setting. Data Sampling. We specialize the statistical assumptions of Section 2.2. Assume that the input distribution p is zero-mean and has covariance Σ. Let κ = λmax(Σ)/λmin(Σ) be the condition number of Σ. As in Du et al. (2020), we impose the following tail condition on p: Assumption 3.1. There exists ρ > 0 such that if x p, then Σ 1/2x is ρ2-sub-Gaussian1. This assumption guarantees probabilistic tail bounds for our proofs, and can be replaced with other suitable tail conditions. Finally, we define q ( | µ) N µ, σ2 for a fixed σ > 0. Parameterizing the Tasks. Let B be the initial representation, and t be the task-specific change for task t, where t F δ0. Furthermore, let w t Rk be the task-specific weights. The predictor for task t is then given by x 7 x, (B + t )w t . We define θ t = (B + t )w t , and combine (w t ) into a matrix W = [w 1, . . . , w T ] Rk T . We illustrate the parameterization in Figure 1. 1A zero-mean random vector v is ρ2-sub-Gaussian if for any fixed u, E exp(λv u) λ2 u 2 2 ρ2/2. Assumption 3.2 (Source task diversity). For any t [T], w t 2 = Θ(1), and σ2 k(W ) = Ω(T/k). Since P i [k] σ2 i (W ) = W 2 F = Θ(T) by Assumption 3.2, the bound on σ2 k(W ) implies that the weights covers all directions in Rk roughly evenly. This condition is satisfied with high probability when the w t are sampled from a sub-Gaussian distribution with well-conditioned covariance. Note that θ t = B w t + δ t , where δ t 2 = t w t 2 = O(δ0) by Assumption 3.2. Thus, the predictor weights lie near the rank-k subspace determined by B , as illustrated by Figure 1. As a convention since the parameterization is not unique, we require (B ) Σδ t = 0 for any t [T]. This is without loss of generality, as we can always redefine relevant quantities as needed. Finally, we evaluate the learner on a target task θ := B w + δ for some w and δ 2 δ0. 3.2 Training Procedure Source training. We consider the following regularized form of (1): min B min t,wt 1 2n ST t=1 yt Xt(B + t)wt 2 2 + λ 2 t 2 F + γ 2 wt 2 2 . (4) In Section B, we show that the regularization is equivalent to regularizing λγ twt 2, consistent with the intuition that t w t has small norm. This additional regularization is necessary, since (1) only controls the norm of t, which is insufficient for controlling twt. Target training. Let B0 be the output of (4) after orthonormalizing. We adapt to the target task via Lβ( , w) = 1 2n y βX (AB0 + ) (w0 + w) 2 2 , (5) where AB0 := [B0 B0] Rd 2k and w0 = [u, u] for a fixed unit-norm vector u Rk. This corresponds to training a predictor of the form x 7 x, (AB0 + )(w0 + w) . We optimize (5) by performing TPGD steps of PGD with stepsize η on (5) with Cβ := {( , w) | F c1/β, w 2 c2/β} as the feasible set, where we explicitly define c1 and c2 in Section B. To understand the procedure, note that the predictor has parameter θ = β(AB0w + w0 + w), since AB0w0 = 0.2 Within Cβ, the first two terms have O(1) norm, while w 2 = O(1/β). Thus, for large β, the cross term w is a negligible perturbation, and the predictor is approximately linear in ( , w). Consequently, Lβ is approximately convex in ( , w) within Cβ, allowing PGD to find a near-optimal solution in the constraint set. 3.3 Performance Bound Now, we provide a performance bound on the performance of the algorithm proposed in Section 3.2. We define the following rates3 of interest, where RΣ(w , δ0) > 0: r S(n S, T) := σ2kd n ST + σδ0 Σ 1/2 2 kd n ST + σ2k n S + σδ0 n S r T(n T) := σ2k n T + σδ0 n T tr Σ + min kr S(n S, T) , RΣ(w , δ0) p kr S(n S, T) σ n T Theorem 3.1 (Performance guarantee, linear representations). Assume that Assumptions 3.1 and 3.2 hold, n S ρ4(d+log(T/δ)), and n T ρ4(d+log(1/δ)). Then there are (λ, γ, β, TPGD, η, c1, c2) (specified in Section B) such that the training procedure in Section 3.2, with high probability, finds θ achieving excess risk bounded as E (x θ x θ)2 min r T(n T), σ2d 2This symmetrization via AB0 is commonly used in the neural tangent kernel literature (Chizat et al., 2019; Zhang et al., 2020), allowing us to set fθ0(x) 0 while ensuring AB0w Col B for any w. 3Log factors and non-dominant terms suppressed for clarity. Full rates are presented in the appendix. We prove this result in Section B. Nevertheless, we give an interpretation of r S and r T below. Source Training. The rate r S represents the performance of the learner during source training. Despite misspecifications, the algorithm is able to pool the aggregate n ST samples to learn a common (kd-dimensional) initialization, as shown by the first two terms. The last two terms represent the rate achieved on each task upon inferring the initialization, and therefore decays only with n S. The first is the cost of learning w t , while the second is the cost of learning δ t . Target Training. The rate r T represents the performance of the learner with fine-tuning. The first two terms and their interpretations are analogous to that of the last two terms in r S. On the other hand, the minimum indicates two ways to handle the representation error from source training: Ignore source error. In the most data-starved regime, we ignore the finite-sample source error, restricting fine-tuning to only learn δ . Thus, we incur an irreducible error in the form of kr S(n S, T)4. Fix source error. Alternatively, with more data, we can use fine-tuning to fix the source representation error. This additional complexity is captured by the second argument, indicating fine-tuning to a vector with norm RΣ(w , δ0) p kr S(n S, T). Note that the additional norm shrinks to 0 as n S . Finally, by ignoring the representation B0 completely, we can obtain the alternative trivial σ2d/n T rate, which matches the minimax lower bound for standard linear regression (see e.g., Duchi (2016)). 3.4 A Hard Case for FROZENREP In this section, we present a sample complexity separation between ADAPTREP and FROZENREP in the linear setting. We provide the formal construction and an experimental verification of the gap in Section C. Nevertheless, we provide the following informal result: Theorem 3.2 (FROZENREP-ADAPTREP Sample Complexity Separation). Assume k = Θ(1) d. There exists a family of task distributions T satisfying the conditions of Section 3.1, such that with high probability over the draw of n T d target samples, we have the minimax bound min ˆ w,ˆδ max τ T θ supp τ E 1 X(θ B ˆw ˆδ) 2 FROZENREP is provided with infinitely many source tasks and per-task samples from τ to learn B, and n T samples from θ to learn ( ˆw, ˆδ). In contrast, by specializing Theorem 3.1 for the same family, ADAPTREP (with only finitely many source tasks and per-task samples) achieves excess risk bounded as E h (x θ x ˆθ)2i min σ n T , σ2d Interpreting the minimax result. Observe that the rate is achievable by performing standard linear regression directly on the target samples, with no other information. Thus, FROZENREP fails to capture the shared structure from source tasks, in the worst case. Furthermore, in the high-dimensional setting when n T d, there exists a strict separation between ADAPTREP and FROZENREP that widens as n T . The minimax bound applies regardless of the target fine-tuning procedure in use, including those used in practice, e.g. i MAML, Meta Opt Net (Lee et al., 2019), and R2D2 (Bertinetto et al., 2019). As FROZENREP is provided with infinitely many source tasks and samples, this failure is thus entirely due to the representation learning algorithm. Altogether, we have established a case where incorporating representation fine-tuning is provably sufficient for handling misspecifications, unlike ubiquitous frozen representation objectives. 3.4.1 Intuition In this section, we provide a brief intution behind our construction. Recall that in the linear setting, FROZENREP uses the following objective for obtaining a representation: ˆB = argmin B min wt 1 2n ST t [T ] yt Xt Bwt 2 2 n S 1 t [T ] θ t Bw t 2 Σ . (6) 4For average target performance, kr S(n S, T) can be replaced by r S(n S, T); see Section F; Du et al. (2020). By optimizing with respect to the Σ-norm, FROZENREP seeks a k-dimensional subspace that captures as much of the predictive signal across all source tasks. However, in general, the Σ-norm is not aligned with the parameter norm. In other words, a small θ Bw Σ does not imply that θ Bw is small, and thus it may be the case that θ Bw is difficult to learn during fine-tuning. Figure 2: Prototypical hard case. While t w t is smaller (left figure), FROZENREP learns t w t since it has a larger predictive signal (right figure). A prototypical example of this difference is illustrated by Figure 2, where k = 1. The dashed red vector has small parameter norm, and is thus easily learnable by fine-tuning (as demonstrated by norm-based generalization bounds). Thus, a meta-learner should opt to learn the space captured by the solid green vector. However, if the Σ is structured so that t w t Σ B w t Σ, then FROZENREP will learn the space spanned by the dashed red vector instead. By appropriately tuning the ratio of the lengths of the two vectors, we obtain the desired result. We provide a formal construction in Section C.1.1, including the extension to general k. 4 ADAPTREP in the Nonlinear Setting We now describe a framework for analyzing fine-tuning in general function classes. To simplify notation, we modify the setting in Section 2.2 so that both the representation φ and predictor weights w are captured by a parameter θ Θ, with corresponding predictor gθ. Loss Notation. We denote the population and excess risk of predictor g with respect to the true predictor g as L (g, g ) := Ex p Ey q( |g (x)) [ℓ(g(x), y)] Lex (g, g ) := L (g, g ) L (g , g ) , respectively. Additionally, we let L(g, g ) denote the corresponding finite-sample quantity.5 Feasible Predictors. For an initialization θ and a fine-tuning set C Θ, we define the set AC(θ) := {gθ | θ θ C} . Intuitively, this set contains all feasible predictors given an initialization θ and feasible fine-tuning set C. For convenience, we also define the set [AC(θ)] T := (xt)t [T ] 7 (gt(xt))t [T ] g1, . . . , g T AC(θ) , which is a collection of functions mapping X T YT . This set represents all feasible T-tuples of predictors, for a given initialization θ and fine-tuning set C. 4.1 Training Procedure Figure 3: An illustration of source training. Source Training. Fix a set of possible initializations Θ0 Θ, and a fine-tuning set CS. With n S samples for each task (with optimal predictor g t ), we consider the objective θ0 argmin θ Θ0 t=1 min gt ACS(θ) L(gt, g t ). (7) Note that this corresponds to the ADAPTREP objective in (1) using the combined parameter θ. We illustrate the objective in Figure 3 Target Training. Fix a fine-tuning set CT (not necessarily CS). Given n T samples from a new task (with true predictor g ), we run PGD on L(δ) = L(gθ0+δ, g t ) with feasible set CT. In this process, PGD is run with TPGD timesteps and step size η. 5Note that we have omitted the samples from the notation for brevity. 4.2 Assumptions Figure 4: Assumption 4.1. If AC(θ) has low complexity, then fine-tuning quickly finds good task-specific parameters for all tasks. We rephrase the statistical assumption from Section 2.2 in terms of feasible sets AC(θ) as follows: Assumption 4.1. There exists θ 0 such that for every t [T], g t ACS(θ0). We illustrate Assumption 4.1 in Figure 4. We also impose the following standard regularity conditions on the loss function ℓ: Assumption 4.2. For any y Y, ℓ( , y) is 1-Lipschitz6 and convex, and |ℓ(0, y)| B. Task Diversity. To ensure transfer from source to target, we impose the following condition, a specific instance of which was proposed by Du et al. (2020) in the linear setting, and proposed by Tripuraneni et al. (2020b) for general settings: Assumption 4.3 (Source tasks are (ν, ε)-diverse). There exists constants (ν, ε) such that if ρ is the distribution of target tasks, then for any θ Θ0, inf g ACT(θ) Lex (g, g ) 1 t=1 inf g ACS(θ) Lex (g, g t ) The (ν, ε)-diversity assumption connects the average source task performance (right) to the average target performance (left). Thus, it ensures that optimizing (7) results in controlled target performance. We weakened the condition in Tripuraneni et al. (2020b) to bound average target performance rather than worst-case, as is more suitable for higher-dimensional settings. Target Loss Landscape. Finally, we impose conditions ensuring that PGD finds a near-optimal solution in CT. We remark that these assumptions are specific to our choice of fine-tuning algorithm. Assumption 4.4 (Approximate linearity in fine-tuning). Let x1, . . . , xn T be the set of target inputs. Then, there exists β, L such that 2 θgθ+δ(xi) 2 2 β2 and sup θ Θ0 i=1 θgθ(xi) 2 2 L2. Assumption 4.5. supδ CT δ 2 R for some R. 4.3 Performance Bound Recall that for a set H of functions Rd Rk on n samples, its Rademacher complexity Rn(H) on n samples is given by Rn(H) := Eε,X j=1 εijhj(xi) where (εij) are i.i.d. Rademacher random variables and (xi) are i.i.d. samples from some (preset) distribution. We now proceed with our performance guarantee. Theorem 4.1 (General Performance Bound). Assume that all assumptions in Section 4.2 hold. Let (θt) be the set of iterates generated by PGD following the procedure in Section 4.1 (step size η specified in Section D). Then, with probability at least 1 δ over the random draw of samples, Eg ρ h min t Lex (gθt, g ) i βR2 + R TPGD | {z } εOPT δ sup θ Θ0 Rn T [ACT(θ)] + B δ n T | {z } εEST ( 1 δT Rn S θ Θ0 [ACS(θ)] T # | {z } εREPR 6This is not restrictive as one can simply rescale the loss, and is assumed for simplicity of presentation. Note that the Rn T complexity term samples from p. Meanwhile, the Rn S complexity term samples from p T , which concatenates T i.i.d. samples from p (one for each task) for every draw. We prove Theorem 4.1 in Section D. The Rademacher complexity terms decay in most settings as θ Θ0 [ACS(θ)] T # = O C(Θ0) n ST + diam CS n S sup θ Θ0 Rn T[ACT(θ)] = O diam CT n T where C(Θ0) measures the complexity of Θ0, and diam CS and diam CT measures the size of the fine-tuning sets CS and CT. Understanding the Bound. We briefly outline out how the assumptions in Section 4.2 contribute to the final bound. Let θOPT denote the best solution found by PGD, and define the parameters θERM := argmin θ ACT(θ0) L(gθ, g ) and θ := argmin θ ACT(θ0) L (gθ, g ). That is, θERM is the minimizer of the empirical risk while θ is the minimizer of the population risk, both within ACT(θ0). Figure 5: Illustrating the excess risk bound in Theorem 4.1. Optimization error (εOPT). The difference in performance between θOPT and θERM, i.e. the error due to PGD, is controlled by the assumptions on the target loss landscape. Estimation error (εEST). By uniform convergence, θERM performs similarly to θ, with the difference bounded using Rademacher complexity. By extension, θOPT performs similarly to θ. Representation/approximation error (εREPR). The average best case excess risk (i.e., the excess risk of θ) is bounded via the performance of the initialization θ0 on the source tasks, using (ν, ε)-diversity. Since (7) is a finite-sample estimate of the desired bound, we again use uniform convergence arguments, which manifest as the second Rademacher complexity term. We have included an illustration of these three components in Figure 5. 5 Case Study: Two-Layer Neural Networks We now illustrate our framework by instantiating our bounds in concrete settings. We focus on specializing Theorem 4.1 to the two-layer neural networks (this section) and logistic regression (provided in the appendix, Section F). For this section, we fix an activation function σ : R R. Assumption 5.1. For any t [ 2, 2], |σ (t)| L and |σ (t)| µ. Furthermore, σ(0) = 0. Then, for a constant β and θ = (B, w), where B Rd 2k and w Rk, we define the neural network f β θ (x) = βw σ(B x), where σ is applied elementwise. 5.1 Statistical Assumptions Data Sampling. For all tasks, the inputs are sampled from a 1-norm-bounded distribution p. Furthermore, q( | µ) is generated as µ + η for some O(1)-bounded additive noise η, similar to Tripuraneni et al. (2020b). Parameterizing the Tasks. We fix a representation B Rd 2k and a linear predictor w 0 R2k. In this setting, we consider representation changes lying in a k-dimensional subspace, determined by an orthonormal7 set of matrices 1, . . . , k in Rd 2k. Then, for each t [T], we have two unit-norm vectors w t , δ t so that task t has parameter i [k] (δ t )i i , w 0 + 1 We assume a diversity condition on (w t , δ t ) similar in spirit to the linear version in Assumption 3.2, see Assumption E.2. With appropriate assumptions on the initialization and large enough β, the source task predictors behave like their linearizations, i.e. there exists feature vectors φB and ψB ,w 0 so that fθ t (x) φB (x) w t + ψB ,w 0(x) , X i [k] δ t,i i We make use of these assumptions throughout for our result, see Assumption E.1 for the formal statement. Note that φB and ψB ,w 0 correspond to the activation and gradient features, respectively, that are empirically evaluated by Mu et al. (2020). 5.2 Training Procedure We use the squared error loss to train in this setting. Additionally, we use the training procedure in Section 4.1 with the function class {f γ θ }, with a minor caveat: we set γ = β during source training, and γ = O( n T) during target training. With the procedure above, we now define the relevant feasible sets. We set Θ0 be the set of initializations satisfying Assumption E.1. Finally, we set the constraint sets CS = {( , w) | F , w 2 1/β} and Cγ T = n ( , w) 2 F + w 2 2 O(1/γ2) o . 5.3 Performance Guarantee With the above assumptions in hand, we proceed with the performance guarantee. Theorem 5.1 (Neural net performance bound). Assume that Assumptions 5.1, E.1 and E.2 hold. Then, if n S n T, there exists a setting of the training parameters (see Section E) such that with probability at least 1 δ, the iterates (θt) satisfy Ef ρ h min t Lex (f γ θt, f ) i L2 k n T + L(L + µ) k3/2 n ST + µ + L where log factors have been suppressed for clarity. 6 Conclusion We have presented, to the best of our knowledge, the first statistical analysis of fine-tuning-based meta-learning. We demonstrate the success of such algorithms under the assumption of approximately shared representations between available tasks. In contrast, we show that frozen representation objectives analyzed by prior work fail under this weaker assumption. An interesting line of future work is to determine ways to formulate useful shared structure among MDPs, i.e. formulate settings for which meta-reinforcement learning succeeds and results in improved regret bounds for downstream tasks. 7 Acknowledgements KC is supported by a National Science Foundation Graduate Research Fellowship, Grant DGE2039656. QL is supported by NSF #2030859 and the Computing Research Association for the CIFellows Project. JDL acknowledges support of the ARO under MURI Award W911NF-11-10304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, and an ONR Young Investigator Award. 7Orthonormal with respect to the Frobenius (i.e. entrywise) inner product on matrices. Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via meta-learning in nonstationary and competitive environments. ar Xiv preprint ar Xiv:1710.03641, 2017. Alquier, P., Pontil, M., et al. Regret bounds for lifelong learning. In Artificial Intelligence and Statistics, pp. 261 269. PMLR, 2017. Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended pac-bayes theory. In International Conference on Machine Learning, pp. 205 214. PMLR, 2018. Antoniou, A., Edwards, H., and Storkey, A. How to train your maml. ar Xiv preprint ar Xiv:1810.09502, 2018. Bai, Y., Chen, M., Zhou, P., Zhao, T., Lee, J. D., Kakade, S., Wang, H., and Xiong, C. How important is the train-validation split in meta-learning? ar Xiv preprint ar Xiv:2010.05843, 2020. Balcan, M.-F., Blum, A., and Vempala, S. Efficient representations for lifelong learning and autoencoding. In Conference on Learning Theory, pp. 191 210. PMLR, 2015. Baxter, J. A model of inductive bias learning. Journal of artificial intelligence research, 12:149 198, 2000. Ben-David, S. and Borbely, R. S. A notion of task relatedness yielding provable multiple-task learning guarantees. Machine learning, 73(3):273 287, 2008. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. W. A theory of learning from different domains. Machine learning, 79(1):151 175, 2010. Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Bertinetto, L., Henriques, J. F., Torr, P. H. S., and Vedaldi, A. Meta-learning with differentiable closed-form solvers, 2019. Bullins, B., Hazan, E., Kalai, A., and Livni, R. Generalize across tasks: Efficient algorithms for linear representation learning. In Algorithmic Learning Theory, pp. 235 246. PMLR, 2019. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32:2937 2947, 2019. Collins, L., Mokhtari, A., and Shakkottai, S. Why does maml outperform erm? an optimization perspective. ar Xiv preprint ar Xiv:2010.14672, 2020a. Collins, L., Mokhtari, A., and Shakkottai, S. Task-robust model-agnostic meta-learning. Advances in Neural Information Processing Systems, 33, 2020b. Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incremental learning-to-learn with statistical guarantees. ar Xiv preprint ar Xiv:1803.08089, 2018. Denevi, G., Pontil, M., and Ciliberto, C. The advantage of conditional meta-learning for biased regularization and fine tuning. Advances in Neural Information Processing Systems, 33, 2020. Denevi, G., Pontil, M., and Ciliberto, C. Conditional meta-learning of linear representations. ar Xiv preprint ar Xiv:2103.16277, 2021. Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020. Duchi, J. Lecture notes for statistics 311/electrical engineering 377. URL: https://stanford. edu/class/stats311/Lectures/full_notes. pdf. Last visited on, 2:23, 2016. Fallah, A., Mokhtari, A., and Ozdaglar, A. On the convergence theory of gradient-based modelagnostic meta-learning algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 1082 1092. PMLR, 2020. Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. International Conference on Machine Learning, 2017. Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. ar Xiv preprint ar Xiv:1902.08438, 2019. Hanneke, S. and Kpotufe, S. A no-free-lunch theorem for multitask learning. ar Xiv preprint ar Xiv:2006.15785, 2020a. Hanneke, S. and Kpotufe, S. On the value of target data in transfer learning. ar Xiv preprint ar Xiv:2002.04747, 2020b. Jerfel, G., Grant, E., Griffiths, T. L., and Heller, K. Online gradient-based mixtures for transfer modulation in meta-learning. ar Xiv preprint ar Xiv:1812.06080, 2018. Ji, K., Lee, J. D., Liang, Y., and Poor, H. V. Convergence of meta-learning with task-specific adaptation over partial parameters, 2020. Khodak, M., Balcan, M.-F. F., and Talwalkar, A. S. Adaptive gradient-based meta-learning methods. In Advances in Neural Information Processing Systems, pp. 5917 5928, 2019. Konobeev, M., Kuzborskij, I., and Szepesvári, C. On optimality of meta-learning in fixed-design regression with weighted biased regularization. ar Xiv preprint ar Xiv:2011.00344, 2020. Kpotufe, S. and Martinet, G. Marginal singularity, and the benefits of labels in covariate-shift. In Conference On Learning Theory, pp. 1882 1886. PMLR, 2018. Ledoux, M. and Talagrand, M. Probability in banach spaces: isoperimetry and processes. 1991. Lee, J. D., Lei, Q., Saunshi, N., and Zhuo, J. Predicting what you already know helps: Provable self-supervised learning. ar Xiv preprint ar Xiv:2008.01064, 2020. Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Meta-learning with differentiable convex optimization, 2019. Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. Lucas, J., Ren, M., Kameni, I., Pitassi, T., and Zemel, R. Theoretical bounds on estimation error for meta-learning. ar Xiv preprint ar Xiv:2010.07140, 2020. Maurer, A. and Jaakkola, T. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6(6), 2005. Maurer, A., Pontil, M., and Romera-Paredes, B. The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1 32, 2016. Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273 1282. PMLR, 2017. Mendonca, R., Gupta, A., Kralev, R., Abbeel, P., Levine, S., and Finn, C. Guided meta-policy search. In Advances in Neural Information Processing Systems, pp. 9656 9667, 2019. Mohri, M. and Medina, A. M. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pp. 124 138. Springer, 2012. Mu, F., Liang, Y., and Li, Y. Gradients as features for deep representation learning. ar Xiv preprint ar Xiv:2004.05529, 2020. Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018. Pentina, A. and Lampert, C. A pac-bayesian bound for lifelong learning. In International Conference on Machine Learning, pp. 991 999. PMLR, 2014. Raghu, M., Zhang, C., Kleinberg, J., and Bengio, S. Transfusion: Understanding transfer learning for medical imaging. ar Xiv preprint ar Xiv:1902.07208, 2019. Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, pp. 113 124, 2019. Salman, H., Ilyas, A., Engstrom, L., Kapoor, A., and Madry, A. Do adversarially robust imagenet models transfer better? ar Xiv preprint ar Xiv:2007.08489, 2020. Saunshi, N., Zhang, Y., Khodak, M., and Arora, S. A sample complexity separation between nonconvex and convex meta-learning. In International Conference on Machine Learning, pp. 8512 8521. PMLR, 2020. Snell, J., Swersky, K., and Zemel, R. S. Prototypical networks for few-shot learning. ar Xiv preprint ar Xiv:1703.05175, 2017. Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012. Tripuraneni, N., Jin, C., and Jordan, M. I. Provable meta-learning of linear representations. ar Xiv preprint ar Xiv:2002.11684, 2020a. Tripuraneni, N., Jordan, M., and Jin, C. On the theory of transfer learning: The importance of task diversity. Advances in Neural Information Processing Systems, 33, 2020b. Vershynin, R. Four lectures on probabilistic methods for data science. 2017. Wang, H., Sun, R., and Li, B. Global convergence and generalization bound of gradient-based meta-learning with deep neural nets, 2020. Zhang, Y., Xu, Z.-Q. J., Luo, T., and Ma, Z. A type of generalization error induced by initialization in deep neural networks. In Mathematical and Scientific Machine Learning, pp. 144 164. PMLR, 2020. Zhou, P., Yuan, X., Xu, H., Yan, S., and Feng, J. Efficient meta learning via minibatch proximal update. In Advances in Neural Information Processing Systems, pp. 1534 1544, 2019. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] Our introduction outline where each claim appears in the paper. (b) Did you describe the limitations of your work? [Yes] We have discussed extensively the assumptions that we require to establish our results, see Section 3.1, Section 4.2, Section 5.1, among others. (c) Did you discuss any potential negative societal impacts of your work? [No] As our work focuses on analyzing existing algorithms from a theoretical perspective, we do not foresee any negative societal impacts stemming from our work. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] A Jupyter notebook is provided to run the simulation outlined in Section C.3. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section C.3, as well as the provided Jupyter notebook. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] Since we are simulating regression problems and can sample arbitrarily many times from the task distribution, we only report the mean in our experiments. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] Our experiments only involve simulations of simple settings that do not require extensive compute. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]