# provable_guarantees_for_gradientbased_metalearning__562e7312.pdf Provable Guarantees for Gradient-Based Meta-Learning Mikhail Khodak 1 Maria-Florina Balcan 1 Ameet Talwalkar 1 2 We study the problem of meta-learning through the lens of online convex optimization, developing a meta-algorithm bridging the gap between popular gradient-based meta-learning and classical regularization-based multi-task transfer methods. Our method is the first to simultaneously satisfy good sample efficiency guarantees in the convex setting, with generalization bounds that improve with task-similarity, while also being computationally scalable to modern deep learning architectures and the many-task setting. Despite its simplicity, the algorithm matches, up to a constant factor, a lower bound on the performance of any such parameter-transfer method under natural task similarity assumptions. We use experiments in both convex and deep learning settings to verify and demonstrate the applicability of our theory. 1. Introduction The goal of meta-learning can be broadly defined as using the data of existing tasks to learn algorithms or representations that enable better or faster performance on unseen tasks. As the modern iteration of learning-to-learn (LTL) (Thrun & Pratt, 1998), research on meta-learning has been largely focused on developing new tools that can exploit the power of the latest neural architectures. Examples include the control of stochastic gradient descent (SGD) itself using a recurrent neural network (Ravi & Larochelle, 2017) and learning deep embeddings that allow simple classification methods to work well (Snell et al., 2017). A particularly simple but successful approach has been parametertransfer via gradient-based meta-learning, which learns a meta-initialization φ for a class of parametrized functions fθ : X 7 Y such that one or a few stochastic gradient steps on a few samples from a new task suffice to learn good taskspecific model parameters ˆθ . For example, when presented 1Carnegie Mellon University 2Determined AI. Correspondence to: Mikhail Khodak . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). with examples (xi, yi) X Y for an unseen task, the popular MAML algorithm (Finn et al., 2017) outputs i L(fφ(xi), yi) (1) for loss function L : Y Y 7 R+ and learning rate η > 0; ˆθ is then used for inference on the task. Despite its simplicity, gradient-based meta-learning is a leading approach for LTL in numerous domains including vision (Li et al., 2017; Nichol et al., 2018; Kim et al., 2018), robotics (Al-Shedivat et al., 2018), and federated learning (Chen et al., 2018). While meta-initialization is a more recent approach, methods for parameter-transfer have long been studied in the multi-task, transfer, and lifelong learning communities (Evgeniou & Pontil, 2004; Kuzborskij & Orabona, 2013; Pentina & Lampert, 2014). A common classical alternative to (1), which in modern parlance may be called metaregularization, is to learn a good bias φ for the following regularized empirical risk minimization (ERM) problem: ˆθ = arg min θ θ φ 2 2 2η + X i L(fθ(xi), yi) (2) Although there exist statistical guarantees and poly-time algorithms for learning a meta-regularization for simple models (Pentina & Lampert, 2014; Denevi et al., 2018b), such methods are impractical and do not scale to modern settings with deep neural architectures and many tasks. On the other hand, while the theoretically less-studied meta-initialization approach is often compared to meta-regularization (Finn et al., 2017), their connection is not rigorously understood. In this work, we formalize this connection using the theory of online convex optimization (OCO) (Zinkevich, 2003), in which an intimate connection between initialization and regularization is well-understood due to the equivalence of online gradient descent (OGD) and follow-the-regularizedleader (FTRL) (Shalev-Shwartz, 2011; Hazan, 2015). In the lifelong setting of an agent solving a sequence of OCO tasks, we use this connection to analyze an algorithm that learns a φ, which can be a meta-initialization for OGD or a metaregularization for FTRL, such that the within-task regret of these algorithms improves with the similarity of the online tasks; here the similarity is measured by the distance between the optimal actions θ of each task and is not known Provable Guarantees for Gradient-Based Meta-Learning beforehand. This algorithm, which we call Follow-the-Meta Regularized-Leader ( FMRL or Ephemeral ), scales well in both computation and memory requirements, and in fact generalizes the gradient-based meta-learning algorithm Reptile (Nichol et al., 2018), thus providing a convex-case theoretical justification for a leading method in practice. More specifically, we make the following contributions: Our first result assumes a sequence of OCO tasks t whose optimal actions θ t are inside a small subset Θ of the action space. We show how Ephemeral can use these θ t to make the average regret decrease in the diameter of Θ and do no worse on dissimilar tasks. Furthermore, we extend a lower bound of Abernethy et al. (2008) to the multi-task setting to show that one can do no more than a small constant-factor better sans stronger assumptions. Under a realistic assumption on the loss functions, we show that Ephemeral also has low-regret guarantees in the practical setting where the optimal actions θ t are difficult or impossible to compute and the algorithm only has access to a statistical or numerical approximation. In particular, we show high probability regret bounds in the case when the approximation uses the gradients observed during within-task training, as is done in practice by Reptile (Nichol et al., 2018). We prove an online-to-batch conversion showing that the task parameters learned by a meta-algorithm with low task-averaged regret have low risk, connecting our guarantees to statistical LTL (Baxter, 2000; Maurer, 2005). We verify several assumptions and implications of our theory using a new meta-learning dataset we introduce consisting of text-classification tasks solvable using convex methods. We further study the empirical suggestions of our theory in the deep learning setting. 1.1. Related Work Gradient-Based Meta-Learning: The model-agnostic meta-learning (MAML) algorithm of Finn et al. (2017) pioneered this recent approach to LTL. A great deal of empirical work has studied and extended this approach (Li et al., 2017; Grant et al., 2018; Nichol et al., 2018; Jerfel et al., 2018); in particular, Nichol et al. (2018) develop Reptile, a simple yet equally effective first-order simplification of MAML for which our analysis shows provable guarantees as a subcase. Theoretically, Franceschi et al. (2018) provide computational convergence guarantees for gradient-based meta-learning for strongly-convex functions, while Finn & Levine (2018) show that with infinite data MAML can approximate any function of task samples assuming a specific neural architecture as the model. In contrast to both results, we show finite-sample learning-theoretic guarantees for convex functions under a natural task-similarity assumption. Online LTL: Learning-to-learn and multi-task learning (MTL) have both been extensively studied in the online setting, although our setting differs significantly from the one usually studied in online MTL (Abernethy et al., 2007; Dekel et al., 2007; Cavallanti et al., 2010). There, in each round an agent is told which of a fixed set of tasks the current loss belongs to, whereas our analysis is in the lifelong setting, in which tasks arrive one at a time. Here there are many theoretical results for learning useful data representations (Ruvolo & Eaton, 2013; Pentina & Lampert, 2014; Balcan et al., 2015; Alquier et al., 2017); the PACBayesian result of Pentina & Lampert (2014) can also be used for regularization-based parameter transfer, which we also consider. Such methods are provable variants of practical shared-representation approaches, e.g. Proto Nets (Snell et al., 2017), but unlike our algorithms they do not scale to deep neural networks. Our work is especially related to Alquier et al. (2017), who also consider a many-task regret. We achieve similar bounds with a significantly more practical algorithm, although within-task their results hold for any low-regret method whereas ours only hold for OCO. Lastly, we note two concurrent works, by Denevi et al. (2019) and Finn et al. (2019), that address LTL via online learning, either directly or through online-to-batch conversion. Statistical LTL: While we focus on the online setting, our online-to-batch results also imply risk bounds for distributional meta-learning. This setting was formalized by Baxter (2000); Maurer (2005) further extended the hypothesisspace-learning framework to algorithm-learning. Recently, Amit & Meir (2018) showed PAC-Bayesian generalization bounds for this setting, although without implying an efficient algorithm. Also closely related are the regularizationbased approaches of Denevi et al. (2018a;b), which provide statistical learning guarantees for Ridge regression with a meta-learned kernel or bias. Denevi et al. (2018b) in particular focuses on usefulness relative to single-task learning, showing that their method is better than the ℓ2-regularized ERM, but neither addresses the connection between lossregularization and gradient-descent-initialization. 2. Meta-Initialization & Meta-Regularization We study simple methods of the form of Algorithm 1, where we run a within-task online algorithm on each task and then update the initialization or regularization of this algorithm using a meta-update online algorithm. Alquier et al. (2017) study such a method where the meta-update is conducted using exponentially-weighted averaging. Our use of OCO for the meta-update makes this class of algorithms much more practical; for example, in the case of OGD for both the inner and outer loop we recover the Reptile algorithm of Nichol et al. (2018). To analyze Algorithm 1, we first discuss the OCO methods that make up both its inner and outer loop and the inherent connection they provide between initialization Provable Guarantees for Gradient-Based Meta-Learning Algorithm 1: The generic online-within-online algorithm we study. First-order gradient-based metalearning uses OGD in both the inner and outer loop. Pick a first meta-initialization φ1. for task t [T] do Run a within-task online algorithm (e.g. OGD) on the losses of task t using initialization φt. Compute (exactly or approximately) the best fixed action in hindsight θ t for task t. Update φt using a meta-update online algorithm (e.g. OGD) on the meta-loss ℓt(φ) = θ t φt 2. and regularization. We then make this connection explicit by formalizing the notion of learning a meta-initialization or meta-regularization as learning a parameterized Bregman regularizer. We conclude this section by proving convexcase upper and lower bounds on the task-averaged regret. 2.1. Online Convex Optimization In the online learning setting, at each time t = 1, . . . , T an agent chooses action θt Θ Rd and suffers loss ℓt(θt) for some adversarially chosen function ℓt : Θ 7 R that subsumes the loss, model, and data in L(fθ(x), y) into one function of θ. The goal is to minimize regret the difference between the total loss and that of the optimal fixed action: t=1 ℓt(θt) min θ Θ When RT = o(T) then as T the average loss of the agent will approach that of an optimal fixed action. For OCO, ℓt is assumed convex and Lipschitz for all t. This setting provides many practically useful algorithms such as online gradient descent (OGD). Parameterized by a starting point φ Θ and learning rate η > 0, OGD plays and achieves sublinear regret O(D T) when η D T , where D is the diameter of the action space Θ. Note the similarity between OGD and the meta-initialization update in Equation 1. In fact another fundamental OCO algorithm, follow-the-regularized-leader (FTRL), is a direct analog for the meta-regularization algorithm in Equation 2, with its action at each time being the output of ℓ2-regularized ERM over the previous data: θt = arg min θ Θ 1 2η θ φ 2 2 + X s 1). Note θ t is unique as the minimum of 2, a strongly convex function, over minima of a convex function. The algorithms in Section 2.4 assume an efficient oracle computing θ t . 2.3. Parameterizing Bregman Regularizers Following the main idea of gradient-based meta-learning, our goal is to learn a φ Θ such that an online algorithm such as OGD starting from φ will have low regret. We thus treat regret as our objective and observe that in the regret of FTRL (5), the regularizer R effectively encodes a distance from the initialization to θ . This is clear in the Euclidean geometry for R(θ) = 1 2 θ φ 2 2, but can be extended via the Bregman divergence (Bregman, 1967), defined for f : S 7 R everywhere-sub-differentiable and convex as Bf(x||y) = f(x) f(y) f(y), x y The Bregman divergence has many useful properties (Banerjee et al., 2005) that allow us to use it almost directly as a parameterized regularization function. However, in order to use OCO for the meta-update we also require it to be strictly convex in the second argument, a property that holds for the Bregman divergence of both the ℓ2 regularizer and the entropic regularizer R(θ) = θ, log θ used for online learning over the probability simplex, e.g. with expert advice. Definition 2.2. Let R : S 7 R be 1-strongly-convex w.r.t. norm on convex S Rd. Then we call the Bregman divergence BR(x||y) : S S 7 R+ a Bregman regularizer if BR(x|| ) is strictly convex for any fixed x S. Within each task, the regularizer is parameterized by the second argument and acts on the first. More specifically, for R = 1 2 2 2 we have BR(θ||φ) = 1 2 θ φ 2 2, and so in the case of FTRL and OGD, φ is a parameterization of the regularization and the initialization, respectively. In the case of the entropic regularizer, the associated Bregman regularizer is the KL-divergence from φ to θ and thus metalearning φ can very explicitly be seen as learning a prior. Finally, we use Bregman regularizers to formally define our parameterized learning algorithms: Definition 2.3. FTRLη,φ, for η R+, φ Θ, where Θ is some bounded convex subset Θ Rd, plays θt = arg min θ Θ BR(θ||φ) + η X for Bregman regularizer BR. Similarly, OMDη,φ plays θt = arg min θ Θ BR(θ||φ) + η X Here FTRL and OMD correspond to the meta-regularization (2) and meta-initialization (1) approaches, respectively. As BR( ||φ) is strongly-convex, both algorithms have the same regret bound (5), allowing us to analyze them jointly. 2.4. Follow-the-Meta-Regularized-Leader We now specify the first variant of our main algorithm, Follow-the-Meta-Regularized-Leader (Ephemeral). First assume the diameter D of Θ , as measured by the square root of the maximum Bregman divergence between any two points, is known. Starting with φ1 Θ, run FTRLη,φt or OMDη,φt with η D m on the losses in each task t. After each task, compute φt+1 using an OCO meta-update algorithm operating on the Bregman divergences BR(θ t || ). For D unknown, make an underestimate ε > 0 and multiply it by a factor γ > 1 each time BR(θ t ||φt) > ε2. The following is a regret bound for this algorithm when the meta-update is either Follow-the-Leader (FTL), which plays the minimizer of all past losses, or OGD with adaptive step size. We call this Ephemeral variant Follow-the-Average Leader (FAL) because in the case of FTL the algorithm uses the mean of the previous optimal parameters in hindsight as the initialization. Pseudo-code for this and other variants is given in Algorithm 2. For brevity, we state results for constant Gt = G, mt = m t; detailed statements are in the supplement together with the full proof. Theorem 2.1. In Setting 2.1, the FAL variant of Algorithm 2 with task similarity guess ε = D 1+log T T , tuning parameter γ = 1+log T log T , and BR that is Lipschitz on Θ achieves TAR R O D + D log T for diameter D = maxθ,φ Θ p BR(θ||φ) of Θ . Provable Guarantees for Gradient-Based Meta-Learning Proof Sketch. We give a proof for R( ) = 1 2 2 2 and known task similarity, i.e. ε = D , γ = 1. Denote the divergence to θ t by t(φ) = BR(θ t ||φ) = 1 2 θ t φ 2 2 and let φ = 1 T PT t=1 θ t . Note t is 1-strongly-convex and φ is the minimizer of their sum, with the variance D2 = 1 T PT t=1 t(φ ) D 2. Now by Definition 2.1: i=1 ℓt,i(θt,i) min θt Θ i=1 ℓt,i(θt) t(φt) t(φ ) The first two lines apply the regret bound (5) of FTRL and OMD. The key step is the last one, with the regret is split into the loss of the meta-update algorithm on the left and the loss if we had always initialized at the mean φ of the optimal actions θ t on the right. Since 1, . . . , T are 1-stronglyconvex with minimizer φ , and since each φt is determined by playing FTL or OGD on these same functions, the left term is the regret of these algorithms on strongly-convex functions, which is known to be O(log T) (Bartlett et al., 2008; Kakade & Shalev-Shwartz, 2008). Substituting the definition of φ and η = D G m sets the right term to η + ηG2m = G D m + GD m The full proof uses the doubling trick to tune task similarity D , requiring an analysis of the location of meta-parameter φt to ensure that we only increase the guess when needed. The extension to non-Euclidean geometries uses a novel logarithmic regret bound for FTL over Bregman regularizers. Remark 2.1. Note that if we know the variance D2 of the task parameters from their mean φ , setting ηt = D Gt mt in Algorithm 2 and following the analysis above replaces D in Theorem 2.1 with D, which is better since D D and is furthermore less sensitive to possible outlier tasks. Theorem 2.1 shows that the TAR of Ephemeral scales with task similarity D , and that if tasks are not similar then we only do a constant factor worse than FTRL or OMD. This shows that gradient-based meta-learning is useful in convex settings: under a simple notion of similarity, having more tasks yields better performance than the O(D m) regret of single-task learning. The algorithm also scales well and in the ℓ2 setting is similar to Reptile (Nichol et al., 2018). However, it is easy to see that an even simpler strawman algorithm achieves regret only a constant factor worse: at time t+1, simply initialize FTRL or OMD using the optimal Algorithm 2: Follow-the-Meta-Regularized-Leader (Ephemeral) meta-algorithm for meta-learning. For the FAL variant we assume arg minθ Θ L(θ) returns the minimum-norm θ among all minimizers of L over Θ. For META = OGD we assume R( ) = 1 2 2 2 and adaptive step size P s 0 and tuning parameter γ 1 // set first-task similarity guess to be the full action space D1 maxθ Θ p BR(θ||φ1) + ε k 0 for t [T] do // set learning rate using task similarity guess; run within-task algorithm ηt Dt Gt mt for i [mt] do θt,i TASKηt,φt(ℓt,1, ..., ℓt,i 1) suffer loss ℓt,i(θt,i) // compute meta-update vector θt depending on Ephemeral variant case FAL do θt arg minθ Θ Pmt i=1 ℓt,i(θ) case FLI-Online do θt TASKηt,φt(ℓt,1, ..., ℓt,mt) case FLI-Batch do θt 1 mt Pmt i=1 θt,i // increase task similarity guess if violated; run meta-update if Dt < p BR(θt||φt) then k k + 1 Dt+1 γkε φt+1 METAθ1({BR(θs|| )Gs ms}t s=1) parameter θ t of task t. Of course, in the few-shot setting of small m, a reduction in the average regret is still practically significant; we observe this empirically in Figure 3. Indeed, in the proof of Theorem 2.1 the regret converges to that obtained by always playing the mean optimal action, which will not occur when playing the strawman algorithm. Furthermore, the following lower-bound on the task-averaged regret, a multi-task extension of Abernethy et al. (2008, Theorem 4.2), shows that such constant factor reductions are the best we can achieve under our task similarity assumption: Provable Guarantees for Gradient-Based Meta-Learning Theorem 2.2. Assume d 3 and that for each t [T] an adversary must play a sequence of m convex G-Lipschitz functions ℓt,i : Θ 7 R whose optimal actions in hindsight arg minθ Θ Pm i=1 ℓt,i(θ) are contained in some fixed ℓ2ball Θ Θ with center φ and diameter D . Then the adversary can force the agent to have TAR at least GD More broadly, this lower bound shows that the learningtheoretic benefits of gradient-based meta-learning are inherently limited without stronger assumptions on the tasks. Nevertheless, Ephemeral-style algorithms are very attractive from a practical perspective, as their memory and computation requirements per iteration scale linearly in the dimension and not at all in the number of tasks. 3. Provable Guarantees for Practical Gradient-Based Meta-Learning In the previous section we gave an algorithm with access to the best actions in hindsight θ t of each task that can learn a good meta-initialization or meta-regularization. While θ t is efficiently computable in some cases, often it is more practical to use an approximation. This holds in the deep learning setting, e.g. Nichol et al. (2018) use the average within-task gradient. Furthermore, in the batch setting a more natural similarity notion depends on the true risk minimizers and not the optimal actions for a few samples. In this section we first show how two simple variants of Ephemeral handle these settings, one for the adversarial setting which uses the final action on task t as the meta-update and one for the stochastic setting using the average iterate. We call these methods FLI-Online and FLI-Batch, respectively, where FLI stands for Follow-the-Last-Iterate. We then provide an online-to-batch conversion result for TAR that implies good generalization guarantees when any of the variants of Ephemeral are run in the distributional LTL setting. 3.1. Simple-to-Compute Meta-Updates To achieve guarantees using approximate meta-updates we need to make some assumptions on the within-task loss functions. This is unavoidable because we need estimates of the optimal actions of different tasks to be nearby; in general, for some θ Θ a convex function f : Θ 7 R can have small f(θ) f(θ ) but large θ θ if f does not increase quickly away from the minimum. This makes it impossible to use guarantees on the loss of an estimate of θ t to bound its distance from θ t . We therefore make assumptions that some aggregate loss, e.g. the expectation or sum of the within-task losses, satisfies the following growth condition: Definition 3.1. A function f : Θ 7 R has α-quadraticgrowth (α-QG) w.r.t. for α > 0 if for any θ Θ and θ its closest minimum of f we have α 2 θ θ 2 f(θ) f(θ ) Figure 2. Plot of the smallest L(θ) L(θ ) as θ θ 2 increases for logistic regression over a mixture of four 50-dimensional Gaussians (left) and over a four-class text classification task over 50dimensional CBOW (right). For both the α factor of the quadraticgrowth condition scales linearly with the number of samples m. QG has recently been used to provide fast rates for GD that hold for practical problems such as LASSO and logistic regression under data-dependent assumptions (Karimi et al., 2016; Garber, 2019). It can be shown when f(θ) = g(Aθ) for g strongly-convex and some A Rm d; in this case α σmin(A) (Karimi et al., 2016). Note that α-QG is also a weaker condition than α-strong-convexity. To prove FLI guarantees, we require in Setting 3.1 that some notion of average loss on each task grows quadratically away from the optimum, which is shown to hold in both a real and a synthetic setting in Figure 2. Setting 3.1. In Setting 2.1, for each task t [T] define average loss Lt according to one of the following two cases: (a) Lt(θ) = 1 mt Pmt i=1 ℓt,i(θ) (b) assume losses ℓt,i : Θ 7 [0, 1] are i.i.d. from distribution Pt s.t. Lt(θ) = EPt ℓ(θ) has a unique minimum Assume the corresponding Lt in each case is α-QG w.r.t. and define Θ Θ s.t. Θ arg minθ Θ L(θ) t [T]. Here case (b) is the batch-within-online setting, also studied by Alquier et al. (2017). In this case the distance defining the similarity is between the true-risk minimizers and not the optimal parameters in hindsight. Under such datadependent assumptions we have the following bound on using approximate meta-updates: Theorem 3.1. In Setting 3.1(a), the FLI-Online variant of Algorithm 2 with ε = Ω 1 6 m , tuning parameter γ 1, and within-task algorithm FTRL with Bregman regularizer BR for R strongly-smooth w.r.t. achieves TAR T + om(1) m for D as in Theorem 2.1 and om(1) = O(m 1 6 ). In Setting 3.1(b) the same bound holds w.p. 1 δ and om(1) = O m 1 δ for both the FAL and FLIBatch variants and using either FTRL or OMD within-task. This bound is very similar to Theorem 2.1 apart from a per-task error term due to the use of an estimate of θ t . Provable Guarantees for Gradient-Based Meta-Learning Figure 3. TAR of Ephemeral and the strawman method for FTRL (left) and of variants of Ephemeral for OGD (right). Ephemeral is much better than the strawman at low m, showing the significance of Theorem 2.1 in the few-shot case. As predicted by Theorem 3.1, FLI regret converges to that of FAL as m increases. 3.2. Distributional Learning-to-Learn While gradient-based LTL methods are largely online, their goals are often statistical. The usual setting due to Baxter (2000) assumes a distribution Q over task-distributions P over functions ℓ, which can correspond to a single-sample loss. Given i.i.d. samples from each of T i.i.d. task-samples Pt Q, we seek to do learn how to do well given m samples from a new distribution P Q. Here we hope that samples from Q can reduce the amount needed from P. Theorem 3.2 gives an online-to-batch conversion for which low TAR implies low expected risk of a new task sampled from Q. For Ephemeral, the procedure draws t U[T], runs FTRLηt,φt or OMDη,φt on samples from P Q, and outputs the average iterate θ. Such guarantees on random or mean iterates are standard, although in practice the last iterate is used. The proof uses Jensen s inequality to combine two standard conversions (Cesa-Bianchi et al., 2004). Theorem 3.2. Suppose convex losses ℓt,i : Θ 7 [0, 1] are drawn i.i.d. from Pt Q, {ℓt,i}i Pm t for some distribution Q over task distributions Pt. Let At be the state (e.g. the initialization φt and similarity guess Dt) before task t [T] of an algorithm A with TAR R. Then w.p. 1 δ if m loss functions {ℓi}i Pm are sampled from task distribution P Q, running At on these losses will generate θ1, . . . , θm Θ s.t. their mean θ satisfies E t U[T ] E ℓ P P Q E Pm ℓ( θ) = E ℓ P P Q ℓ(θ ) + R m + 4. Empirical Results An important aspect of Ephemeral is its practicality. n particular, FLI-Batch is scalable without modification to high-dimensional, non-convex models. This is demonstrated by the success of Reptile (Nichol et al., 2018), a sub-case of our method that competes with MAML on standard metalearning benchmarks. Given this evidence, empirically our goal is to validate our theory in the convex setting, although we also examine implications for deep meta-learning. Ephemeral Ephemeral Figure 4. Meta-test accuracy of MAML and Ephemeral in the batch setting. Both using the average iterate, as recommended by online-to-batch conversion, and using the last iterate, as done in practice, provides performance comparable to that of MAML. 4.1. Convex Setting We introduce a new dataset of 812 classification tasks, each consisting of sentences from one of four Wikipedia pages which we use as labels. It is derived from the raw super-set of the Wiki3029 corpus collected by Arora et al. (2019). We call the new dataset Mini-Wiki and make it available in the supplement. Our use of text classification to examine the convex setting is motivated by the well-known effectiveness of linear models over simple representations (Wang & Manning, 2012; Arora et al., 2018). We use logistic regression over 50-dimensional continuous-bag-of-words (CBOW) using Glo Ve embeddings (Pennington et al., 2014). The similarity of these tasks is verified by seeing if their optimal parameters are close together. As shown before in Figure 1, we find when Θ is the unit ball that even in the 1-shot setting the tasks have non-vacuous similarity; for 32-shots the parameters are contained in a set of radius 0.32. We next compare Ephemeral to the strawman algorithm from Section 2, which uses the previous optimal action as the initialization. For both algorithms we use similarity guess ε = 0.1 and tune with γ = 1.1. As expected, in Figure 3 we see that Ephemeral is superior to the strawman algorithm, especially for few-shot learning, demonstrating that our TAR improvement is significant in the low-sample regime. We also see that FLI-Batch, which uses approximate meta-updates, approaches FAL as the number of samples increases and thus its estimate improves. Finally, we evaluate Ephemeral and (first-order) MAML in the statistical setting. On each task we standardize data using the mean and deviation of the training features. For Ephemeral we use the FAL variant with OGD as the withintask algorithm, with learning rate set using the average deviation of the task parameters from the mean parameter, as suggested in Remark 2.1. For MAML, we use grid search to determine the within-task and meta-update learning rates. As shown in Figure 4, despite using no tuning, Ephemeral performs comparably to MAML slightly better for m 8 and slightly worse for m < 4. Provable Guarantees for Gradient-Based Meta-Learning Figure 5. Performance of Reptile (the FLI variant of Ephemeral using OGD within-task) on 5-shot 5-way Mini-Image Net (left), 1-shot 5-way Omniglot (center), and 5-shot 20-way Omniglot (right) while varying the number of training samples. Increasing the number of samples per training task improves performance even when using the same number of samples at meta-test time. Figure 6. Performance of Reptile (the FLI variant of Ephemeral using OGD within-task) on 5-shot 5-way Mini-Image Net (left), 1-shot 5-way Omniglot (center), and 5-shot 20-way Omniglot (right) while varying the number of training iterations. The benefit of more iterations is not clear for Mini-Image Net, but an improvement is seen on Omniglot. The number of iterations at meta-test time is 50. 4.2. Deep Learning While our method generalizes Reptile, an effective metalearning method (Nichol et al., 2018), we can still examine if our theory can help neural network LTL. We study modifications of Reptile on 5-way and 20-way Omniglot (Lake et al., 2017) and 5-way Mini-Image Net classification (Ravi & Larochelle, 2017) using the same networks as Nichol et al. (2018). As in these works, we evaluate in the transductive setting, where test points are evaluated in batch. Our theory points to the importance of accurately computing the within-task parameter for the meta-update; Theorem 2.1 assumes access to this parameter, whereas Theorems 3.1 allow computational and stochastic approximations that result in an additional error term decaying with number of task-examples. This becomes relevant in the non-convex setting with many tasks, where it is infeasible to find even a local optimum. Thus we see how a better estimate of the within-task parameter for the meta-update may lead to higher accuracy. We can attain a better estimate by using more samples to reduce stochastic noise or by running more gradient steps on each task to reduce approximation error. It is not obvious that these changes will improve performance it may be better to learn using the same settings at metatrain and meta-test time. However, for 5-shot evaluation the Reptile authors do indeed use more than 5 task samples 10 for Omniglot and 15 for Mini-Image Net. Similarly, they use far fewer within-task gradient steps 5 for Omniglot and 8 for Mini-Image Net at meta-train time than the 50 iterations used for evaluation. We study how the two settings the number of task samples and within-task iterations affect meta-test performance. In Figure 5, we see that more task-samples provide a significant improvement, with fewer meta-iterations needed for good test performance. Reducing this number is equivalent to reducing task-sample complexity, although for a better approximation each task needs more samples. We also see in Figure 6 that taking more gradient steps, which does not use more samples, can also help performance, especially on 20-way Omniglot. However, on Mini-Image Net using than 8 iterations reduces performance; this may be due to overfitting on specific tasks, with task similarity likely holding for the true rather than empirical risk minimizers, as in Setting 3.1(b). The broad patterns shown above also hold for several other settings, which we discuss in the supplement. 5. Conclusion In this paper we study a broad class of gradient-based metalearning methods using the theory of OCO, proving their usefulness compared to single-task learning under a closeness assumption on task parameters. The guarantees of our algorithm, Ephemeral, can be extended to approximate meta-updates, the batch-within-online setting, and statistical LTL. Apart from these results, the algorithm s simplicity makes it extensible to settings of practical interest such as federated learning and differential privacy. Future work can consider more sophisticated notions of task-similarity, such as multi-modal or evolving settings, and theory for practical and scalable shared-representation-learning. Provable Guarantees for Gradient-Based Meta-Learning Acknowledgments This work was supported in part by DARPA FA875017C0141, National Science Foundation grants CCF1535967, IIS-1618714, IIS-1705121, and IIS-1838017, a Microsoft Research Faculty Fellowship, an Okawa Grant, a Google Faculty Award, an Amazon Research Award, an Amazon Web Services Award, and a Carnegie Bosch Institute Research Award. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA, the National Science Foundation, or any other funding agency. Abernethy, J., Bartlett, P., and Rakhlin, A. Multitask learning with expert advice. In Proceedings of the International Conference on Computational Learning Theory, 2007. Abernethy, J., Bartlett, P. L., Rakhlin, A., and Tewari, A. Optimal strategies and minimax lower bounds for online convex games. Technical report, EECS Department, University of California, Berkeley, 2008. Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I., and Abbeel, P. Continuous adaptation via metalearning in nonstationary and competitive environments. In Proceedings of the 6th International Conference on Learning Representations, 2018. Alquier, P., Mai, T. T., and Pontil, M. Regret bounds for lifelong learning. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In Proceedings of the 35th International Conference on Machine Learning, 2018. Arora, S., Khodak, M., Saunshi, N., and Vodrahalli, K. A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and LSTMs. In Proceedings of the 6th International Conference on Learning Representations, 2018. Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, 2019. Azuma, K. Weighted sums of certain dependent random variables. Tˆohoku Mathematical Journal, 19:357 367, 1967. Balcan, M.-F., Blum, A., and Vempala, S. Efficient representations for lifelong learning and autoencoding. In Proceedings of the Conference on Learning Theory, 2015. Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. Clustering with Bregman divergences. Journal of Machine Learning Research, 6:1705 1749, 2005. Bartlett, P. L., Hazan, E., and Rakhlin, A. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, 2008. Baxter, J. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149 198, 2000. Bregman, L. M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7: 200 217, 1967. Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. Linear algorithms for online multitask classification. Journal of Machine Learning Research, 11:2901 2934, 2010. Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004. Chen, F., Dong, Z., Li, Z., and He, X. Federated metalearning for recommendation. ar Xiv, 2018. Dekel, O., Long, P. M., and Singer, Y. Online learning of multiple tasks with a shared loss. Journal of Machine Learning Research, 8:2233 2264, 2007. Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Incremental learning-to-learn with statistical guarantees. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2018a. Denevi, G., Ciliberto, C., Stamos, D., and Pontil, M. Learning to learning around a common mean. In Advances in Neural Information Processing Systems, 2018b. Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M. Learning-to-learn stochastic gradient descent with biased regularization. ar Xiv, 2019. Evgeniou, T. and Pontil, M. Regularized multi-task learning. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004. Fellbaum, C. Word Net: An Electronic Lexical Database. MIT Press, 1998. Provable Guarantees for Gradient-Based Meta-Learning Finn, C. and Levine, S. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. In Proceedings of the 6th International Conference on Learning Representations, 2018. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, 2017. Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. ar Xiv, 2019. Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In Proceedings of the 35th International Conference on Machine Learning, 2018. Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3, 1956. Freedman, D. A. On tail probabilities for martingales. The Annals of Probability, 3(1):100 118, 1975. Garber, D. Fast rates for online gradient descent without strong convexity via Hoffman s bound. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019. Grant, E., Finn, C., Levine, S., Darrell, T., and Griffiths, T. Recasting gradient-baed meta-learning as hierarchical Bayes. In Proceedings of the 6th International Conference on Learning Representations, 2018. Hazan, E. Introduction to online convex optimization. In Foundations and Trends in Optimization, volume 2, pp. 157 325. now Publishers Inc., 2015. Jerfel, G., Grant, E., Griffiths, T. L., and Heller, K. Online gradient-based mixtures for transfer modulation in metalearning. ar Xiv, 2018. Kakade, S. and Shalev-Shwartz, S. Mind the duality gap: Logarithmic regret algorithms for online optimization. In Advances in Neural Information Processing Systems, 2008. Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016. Kim, J., Lee, S., Kim, S., Cha, M., Lee, J. K., Choi, Y., Choi, Y., Choi, D.-Y., and Kim, J. Auto-Meta: Automated gradient based meta learner search. ar Xiv, 2018. Kuzborskij, I. and Orabona, F. Stability and hypothesis transfer learning. In Proceedings of the 30th International Conference on Machine Learning, 2013. Lake, B. M., Salakhutdinov, R., Gross, J., and Tenenbaum, J. B. One shot learning of simple visual concepts. In Proceedings of the Conference of the Cognitive Science Society (Cog Sci), 2017. Li, Z., Zhou, F., Chen, F., and Li, H. Meta-SGD: Learning to learning quickly for few-shot learning. ar Xiv, 2017. Maurer, A. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6:967 994, 2005. Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. ar Xiv, 2018. Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014. Pentina, A. and Lampert, C. H. A PAC-Bayesian bound for lifelong learning. In Proceedings of the 31st International Conference on Machine Learning, 2014. Polyak, B. T. Gradient methods for minimizing functionals. USSR Computational Mathematics and Mathematical Physics, 3(3):864 878, 1963. Ravi, S. and Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the 5th International Conference on Learning Representations, 2017. Ruvolo, P. and Eaton, E. ELLA: An efficient lifelong learning algorithm. In Proceedings of the 30th International Conference on Machine Learning, 2013. Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 -194, 2011. Snell, J., Swersky, K., and Zemel, R. S. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 2017. Thrun, S. and Pratt, L. Learning to Learn. Springer Science & Business Media, 1998. Wang, S. and Manning, C. D. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 2012. Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 2003.