# online_metalearning__9afa3ee9.pdf

Online Meta-Learning

Chelsea Finn * 1 Aravind Rajeswaran * 2 Sham Kakade 2 Sergey Levine 1

A central capability of intelligent systems is the ability to continuously build upon previous experiences to speed up and enhance learning of new tasks. Two distinct research paradigms have studied this question. Meta-learning views this problem as learning a prior over model parameters that is amenable for fast adaptation on a new task, but typically assumes the tasks are available together as a batch. In contrast, online (regret based) learning considers a setting where tasks are revealed one after the other, but conventionally trains a single model without task-speciﬁc adaptation. This work introduces an online meta-learning setting, which merges ideas from both paradigms to better capture the spirit and practice of continual lifelong learning. We propose the follow the meta leader (FTML) algorithm which extends the MAML algorithm to this setting. Theoretically, this work provides an O(log T) regret guarantee with one additional higher order smoothness assumption (in comparison to the standard online setting). Our experimental evaluation on three different largescale problems suggest that the proposed algorithm signiﬁcantly outperforms alternatives based on traditional online learning approaches.

1. Introduction

Two distinct research paradigms have studied how prior tasks or experiences can be used by an agent to inform future learning. Meta-learning (Schmidhuber, 1987; Santoro et al., 2016; Finn et al., 2017) casts this as the problem of learning to learn, where past experience is used to acquire a prior over model parameters or a learning procedure, and typically studies a setting where meta-training tasks are made available together upfront. In contrast, online learning (Hannan, 1957; Cesa-Bianchi & Lugosi, 2006) considers a sequential

*Equal contribution 1UC Berkeley 2University of Washington. Correspondence to: Chelsea Finn <cbfinn@stanford.edu>, Aravind Rajeswaran <aravraj@cs.washington.edu>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

setting where tasks are revealed one after another, but aims to attain zero-shot generalization without any task-speciﬁc adaptation. We argue that neither setting is ideal for studying continual lifelong learning. Meta-learning deals with learning to learn, but neglects the sequential and non-stationary aspects of the problem. Online learning offers an appealing theoretical framework, but does not generally consider how past experience can accelerate adaptation to a new task. In this work, we motivate and present the online meta-learning problem setting, where the agent simultaneously uses past experiences in a sequential setting to learn good priors, and also adapt quickly to the current task at hand.

As an example, Figure 1 shows a family of sinusoids. Imagine that each task is a regression problem from x to y corresponding to one sinusoid. When presented with data from a large collection of such tasks, a naïve approach that does not consider the task structure would collectively use all the data, and learn a prior that corresponds to the model y = 0. An algorithm that understands the underlying structure would recognize that each curve in the family is a sinusoid, and would therefore attempt to identify, for a new batch of data, which sinusoid it corresponds to. As another example where joint training fails, Figure 1 also shows colored MNIST digits with different backgrounds. Suppose we ve seen MNIST digits with various colored backgrounds, and then observe a 7 on a new color. We might conclude from training on all of the data seen so far that all digits with that color must all be 7. In fact, this is an optimal conclusion from a purely statistical standpoint. However, if we understand that the data is divided into different tasks, and take note of the fact that each task has a different color, a better conclusion is that the color is irrelevant. Training on all of the data together, or only on the new data, does not achieve this goal.

Meta-learning offers an appealing solution: by learning how to learn from past tasks, we can make use of task structure

Figure 1. (left) sinusoid functions and (right) colored MNIST

Online Meta-Learning

and extract information from the data that both allows us to succeed on the current task and adapt to new tasks more quickly. However, typical meta learning approaches assume that a sufﬁciently large set of tasks are made available upfront for meta-training. In the real world, tasks are likely available only sequentially, as the agent is learning in the world, and also from a non-stationary distribution. By recasting meta-learning in a sequential or online setting, that does not make strong distributional assumptions, we can enable faster learning on new tasks as they are presented.

Our contributions: In this work, we formulate the online meta-learning problem setting and present the follow the meta-leader (FTML) algorithm. This extends the MAML algorithm to the online meta-learning setting, and is analogous to follow the leader in online learning. We analyze FTML and show that it enjoys a O(log T) regret guarantee when competing with the best meta-learner in hindsight. In this endeavor, we also provide the ﬁrst set of results (under any assumptions) where MAML-like objective functions can be provably and efﬁciently optimized. We also develop a practical form of FTML that can be used effectively with deep neural networks on large scale tasks, and show that it signiﬁcantly outperforms prior methods. The experiments involve vision-based sequential learning tasks with the MNIST, CIFAR-100, and PASCAL 3D+ datasets.

2. Foundations

Before introducing online meta-learning, we brieﬂy summarize the foundations of meta-learning, the model-agnostic meta-learning (MAML) algorithm, and online learning. To illustrate the differences in setting and algorithms, we will use the running example of few-shot learning, which we describe below ﬁrst. We emphasize that online learning, MAML, and the online meta-learning formulations have a broader scope than few-shot supervised learning.

2.1. Few-Shot Learning

In few-shot supervised learning (Santoro et al., 2016), we are interested in a family of tasks, where each task T is associated with a notional and inﬁnite-size population of inputoutput pairs. Our goal is to learn a new task Ti while accessing only a small, ﬁnite-size labeled dataset Di := {xi, yi} corresponding Ti. If we have a predictive model, h( ; w), with parameters w, the population risk of the model is

fi(w) := E(x,y) Ti[ (x, y, w)],

where the expectation is deﬁned over the task population and is a loss function, such as the square loss or crossentropy between the model prediction and the correct label. An example of corresponding to squared error loss is (x, y, w) = ||y h(x; w)||2.

Let L(Di, w) represent the average loss on the dataset Di. Effectively minimizing fi(w) is likely hard if we rely only

on Di due to the small size of the dataset. However, metalearning algorithms aim to perform better by drawing upon data from many tasks from the family, as we discuss next.

2.2. Meta-Learning and MAML

Meta-learning, or learning to learn, aims to bootstrap from a set of tasks to learn faster on a new task. Tasks are assumed to be drawn from a ﬁxed distribution, T P(T ). At meta-training time, M tasks {Ti}M

i=1 are drawn from this distribution and datasets corresponding to them are made available to the agent. At deployment time, we are faced with a new test task Tj P(T ), for which we are again presented with a small dataset Dj := {xj, yj}. Meta-learning algorithms attempt to ﬁnd a model using the M training tasks, such that when Dj is revealed from the test task, the model can be quickly updated to minimize fj(w).

Model-agnostic meta-learning (MAML) (Finn et al., 2017) does so by learning an initial set of parameters w MAML, such that performing a few steps of gradient descent from w MAML using Dj minimizes fj( ). To get such an initialization, MAML solves the optimization problem:

w MAML := arg min

The inner gradient Ò ˆfi(w) is based on a small mini-batch from Di. Hence, MAML optimizes for few-shot generalization. Finn et al. (2017) show that gradient-based methods can be used to optimize Eq. 1 with existing automatic differentiation libraries. Stochastic gradient methods are used since the population risk is not known. At meta-test time, the solution is ﬁne-tuned as: wj Ω w MAML Ò ˆfj(w MAML) with the gradient obtained using Dj. MAML and other meta-learning algorithms (see Section 7) are not directly applicable to sequential settings, as they assume a ﬁxed ﬁxed task distribution and have two distinct-phases, metatraining and meta-testing. We instead would like to develop algorithms that work in continuous learning settings with non-stationary task distributions.

2.3. Online Learning

In online learning, an agent faces a sequence of loss functions {ft}Œ

t=1, one in each round t. These functions need not be drawn from a ﬁxed distribution, and could even be chosen adversarially over time. The learner must sequentially decide on model parameters {wt}Œ

t=1 that perform well on the loss sequence. In particular, the goal is to minimize some notion of regret deﬁned as the difference between the learner s loss, q T

t=1 ft(wt), and the best performance achievable by some family of methods (comparator class). The most standard notion of regret is to compare to the cumulative loss of the best ﬁxed model in hindsight:

Online Meta-Learning

We would like algorithms for which this regret grows with T as slowly as possible. An algorithm whose regret grows sub-linearly in T is non-trivially learning and adapting. One of the simplest algorithms in this setting is follow the leader (FTL) (Hannan, 1957), which computes parameters as:

wt+1 = arg min

FTL enjoys strong performance guarantees depending on the properties of the loss function. For the few-shot supervised learning example, FTL would consolidate all the data from the prior stream of tasks into a single large dataset and ﬁt a single model to this dataset. As alluded to in Section 1, and as we ﬁnd in our empirical evaluation in Section 6, such a joint training approach may not learn effective models. To

overcome this issue, we may desire a more adaptive notion of a comparator class, and algorithms that have low regret against such a comparator, as we will discuss next.

3. The Online Meta-Learning Problem

We consider a sequential setting where an agent is faced with tasks one after another. Each task corresponds to a round, denoted by t. In each round, the goal of the learner is to determine model parameters wt that perform well for the corresponding task at that round. This is monitored by ft : w œ W æ R, which we would like to be minimized. Crucially, we consider a setting where the agent can perform some local task-speciﬁc updates to the model before it is deployed and evaluated in each round. This is realized through an update procedure, which at round t, is a mapping Ut : w œ W æ w œ W. This procedure takes as input w and returns w that performs better on ft. One example for Ut is a step of gradient descent (Finn et al., 2017): Ut(w) = w Ò ˆft(w). As speciﬁed in Section 2.2, Ò ˆft is potentially an approximate gradient of ft, e.g. obtained using a mini-batch of data from the task at round t. The overall protocol is as follows:

1. At round t, the agent chooses a model deﬁned by wt.

2. The world simultaneously chooses task deﬁned by ft.

3. The agent obtains access to the update procedure Ut,

and uses it to update parameters as wt = Ut(wt).

4. The agent incurs loss ft( wt). Advance to round t + 1.

The goal for the agent is to minimize regret over the rounds. A highly ambitious comparator is the best meta-learned model in hindsight. Let {wt}T

t=1 be the sequence of models generated by the algorithm. Then, the regret we consider is:

Notice that we allow the comparator to adapt locally to each task at hand; thus the comparator has strictly more capabilities than the learning agent, since it is presented with all

the task functions in batch mode. Here, again, achieving sublinear regret suggests that the agent is improving over time and is competitive with the best meta-learner in hindsight. As discussed earlier, in the batch setting, when faced with multiple tasks, meta-learning performs signiﬁcantly better than training a single model for all the tasks. Thus, we may hope that learning sequentially, but still being competitive with the best meta-learner in hindsight, provides a signiﬁcant leap in continual learning.

4. Algorithm and Analysis

In this section, we outline the follow the meta leader (FTML) algorithm and provide an analysis of its behavior.

4.1. Follow the Meta Leader

Taking inspiration from the form of the follow the leader algorithm (Hannan, 1957; Kalai & Vempala, 2005), we propose the FTML algorithm template which updates model parameters as:

wt+1 = arg min

This can be interpreted as the agent playing the best metalearner in hindsight if the learning process were to stop at round t. In the remainder of this section, we will show that under standard assumptions on the losses, and just one additional assumption on higher order smoothness, this algorithm has strong regret guarantees. In practice, we may not have full access to fk( ), such as when it is the population risk and we only have a ﬁnite size dataset. In such cases, we will draw upon stochastic approximation algorithms to solve the optimization problem in Eq. (4).

4.2. Assumptions

We make the following assumptions about each loss function in the learning problem for all t. Let and represent two arbitrary choices of model parameters.

Assumption 1. (C2-smoothness) 1. (Lipschitz in function value) f has gradients bounded by

G, i.e. ||Òf( )|| Æ G . This is equivalent to f being G Lipschitz.

2. (Lipschitz gradient) f is smooth, i.e.

||Òf( ) Òf( )|| Æ || || ( , ).

3. (Lipschitz Hessian) f has ﬂ Lipschitz Hessians, i.e.

||Ò2f( ) Ò2f( )|| Æ ﬂ|| || ( , ).

Assumption 2. (Strong convexity) Suppose that f is convex. Furthermore, suppose f is µ strongly convex, i.e. ||Òf( ) Òf( )|| Ø µ|| ||.

These assumptions are largely standard in online learning (Cesa-Bianchi & Lugosi, 2006), except 1.3. Examples where these assumptions hold include logistic regression

Online Meta-Learning

and L2 regression over a bounded domain. Assumption 1.3 is a statement about the higher order smoothness of functions which is common in non-convex analysis (Nesterov & Polyak, 2006; Jin et al., 2017). In our setting, it allows us to characterize the landscape of the MAML function, which has a gradient step embedded within it. Importantly, these assumptions do not trivialize the meta-learning setting. We can observe a clear difference in performance between meta-learning and joint training even when fi are quadratic functions, which correspond to the simplest strongly convex setting. See Appendix A for an example illustration.

4.3. Analysis

We analyze the FTML algorithm when the update procedure is a single step of gradient descent, as in the formulation of MAML. Concretely, the update procedure we consider is Ut(w) = w Ò ˆft(w). For this update rule, we ﬁrst state our main theorem below. Theorem 1. Suppose f and ˆf : Rd æ R satisfy assumptions 1 and 2. Let f be the function evaluated after a one step gradient update procedure, i.e.

If the step size is selected as Æ min{ 1

2 , µ 8ﬂG}, then f is convex. Furthermore, it is also = 9 /8 smooth and

µ = µ/8 strongly convex.

For the proof, see Appendix B. The following corollary is now immediate. Corollary 1. (inherited convexity for the MAML objective) If {fi, ˆfi}K

i=1 satisfy assumptions 1 and 2, then the MAML optimization problem,

with Æ min{ 1

2 , µ 8ﬂG} is convex. Furthermore, it is 9 /8smooth and µ/8-strongly convex.

Since the objective function is convex, we may expect ﬁrstorder optimization methods to be effective, since gradients can be efﬁciently computed with standard automatic differentiation libraries (as discussed in Finn et al. (2017)). In fact, this work provides the ﬁrst set of results (under any assumptions) under which MAML-like objective function can be provably and efﬁciently optimized. Another immediate corollary of our main theorem is that FTML now enjoys the same regret guarantees (up to constant factors) as FTL does in the comparable setting (with strongly convex losses). Corollary 2. (inherited regret bound for FTML) Suppose that for all t, ft and ˆft satisfy assumptions 1 and 2. Suppose that the update procedure in FTML (Eq. 4) is chosen as Ut(w) = w Ò ˆft(w) with Æ min{ 1

2 , µ 8ﬂG}. Then, FTML enjoys the following regret guarantee

See Appendix C for a proof. More generally, our main theorem implies that there exists a large family of online meta-learning algorithms that enjoy sub-linear regret, based on the inherited smoothness and strong convexity of f( ). See Hazan (2016); Shalev-Shwartz (2012); Shalev-Shwartz & Kakade (2008) for algorithmic templates to derive sublinear regret based algorithms.

5. Practical Online Meta-Learning Algorithm

In the previous section, we derived a theoretically principled algorithm for convex losses. Many practical problems have non-convex loss landscales. However, methods developed for convex losses, such as Ada Grad (Duchi et al., 2011), often perform well in non-convex settings. Taking inspiration from these successes, we describe a practical instantiation of FTML, and evaluate its performance in Section 6.

The main considerations when adapting the FTML algorithm to few-shot supervised learning with high capacity neural network models are: (a) the optimization problem in Eq. (4) has no closed form solution, and (b) we do not have access to the population risk ft but only a subset of the data. To overcome both these limitations, we can use iterative stochastic optimization algorithms. Speciﬁcally, by adapting the MAML algorithm (Finn et al., 2017), we can use stochastic gradient descent with a minibatch Dtr

k as the update rule, and stochastic gradient descent with an independently-sampled minibatch Dval

k to optimize the parameters. The gradient computation is described below:

gt(w) = Òw Ek t L

Uk(w) w Òw L

Here, t( ) denotes a sampling distribution for the previously seen tasks T1, ..., Tt. In our experiments, we use the uniform distribution, t P(k) = 1/t k = {1, 2, . . . t}, but other sampling distributions can be used if required. Recall that L(D, w) is the loss function (e.g. cross-entropy) averaged over the datapoints (x, y) œ D for the model with parameters w. Using independently sampled minibatches Dtr and Dval minimizes interaction between the inner gradient update Ut and the outer optimization (Eq. 4), which is performed using the gradient above (gt) in conjunction with Adam (Kingma & Ba, 2015). While Ut in Eq. 5 includes only one gradient step, we observed that it is beneﬁcial to take multiple gradient steps in the inner loop (i.e., in Ut), which is consistent with prior works (Finn et al., 2017; Grant et al., 2018; Antoniou et al., 2018; Shaban et al., 2018).

Now that we have derived the gradient, the overall algorithm proceeds as follows. We ﬁrst initialize a task buffer B = [ ]. When presented with a new task at round t, we add task Tt to B and initialize a task-speciﬁc dataset Dt = [ ], which is appended to as data incrementally arrives for task Tt. As new data arrives for task Tt, we iteratively compute and

Online Meta-Learning

Algorithm 1 Online Meta-Learning with FTML

1: Input: Performance threshold of proﬁciency, 2: randomly initialize w1 3: initialize the task buffer as empty, B Ω [ ] 4: for t = 1, . . . do 5: initialize Dt = ÿ 6: Add B Ω B + [ Tt ] 7: while |DTt| < N do 8: Append batch of n new datapoints {(x, y)} to Dt 9: wt Ω Meta-Update(wt, B, t) 10: wt Ω Update-Procedure (wt, Dt) 11: if L (Dtest

t , wt) < then 12: Record efﬁciency for task Tt as |Dt| datapoints 13: end if 14: end while 15: Record ﬁnal performance of wt on test set Dtest

t for task t. 16: wt+1 Ω wt 17: end for

Algorithm 2 FTML Subroutines

1: Input: Hyperparameters parameters , 2: function Meta-Update(w, B, t) 3: for nm = 1, . . . , Nmeta steps do 4: Sample task Tk: k t( ) // (or a minibatch of tasks) 5: Sample minibatches Dtr

k uniformly from Dk 6: Compute gradient gt using Dtr

k , and Eq. 5 7: Update parameters w Ω w gt // (or use Adam) 8: end for 9: Return w 10: end function 11: function Update-Procedure(w, D) 12: Initialize w Ω w 13: for ng = 1, . . . , Ngrad steps do 14: w Ω w ÒL(D, w) 15: end for 16: Return w 17: end function

apply the gradient in Eq. 5, which uses data from all tasks seen so far. Once all of the data (ﬁnite-size) has arrived for Tt, we move on to task Tt+1. This procedure is detailed in Alg. 1, including the evaluation, which we discuss next.

To evaluate performance of the model at any point within a particular round t, we ﬁrst update the model as using all of the data (Dt) seen so far within round t. This is outlined in the Update-Procedure subroutine of Algorithm 2. Note that this is different from the update Ut used within the meta-optimization, which uses a ﬁxed-size minibatch since many-shot meta-learning is computationally expensive and memory intensive. In practice, we meta-train with update minibatches of size at-most 25, whereas evaluation may use hundreds of datapoints for some tasks. After the model is updated, we measure the performance using held-out data Dtest

t from task Tt. This data is not revealed to the online meta-learner at any time. Further, we also evaluate task learning efﬁciency, which corresponds to the size of Dt required to achieve a speciﬁed performance threshold , e.g. = 90% classiﬁcation accuracy or corresponds to a certain loss value. If less data is sufﬁcient to reach the threshold, then priors learned from previous tasks are being useful and we have achieved positive transfer.

6. Experimental Evaluation

Our experimental evaluation studies the practical FTML algorithm (Section 5) in the context of vision-based online learning problems. These problems include synthetic modiﬁcations of the MNIST dataset, pose detection with synthetic images based on PASCAL3D+ models (Xiang et al., 2014), and online classiﬁcation with the CIFAR-100 dataset. The aim of our experimental evaluation is to study the following questions: (1) can online meta-learning (speciﬁcally FTML) be successfully applied to multiple non-stationary learning problems? and (2) does online meta-learning (FTML) pro-

vide empirical beneﬁts over prior methods? To this end, we compare to the following: (a) Train on everything (TOE) jointly trains a single predictve model on all available data so far (including Dt at round t). (b) Train from scratch initializes wt randomly and trains using Dt. (c) Joint training with ﬁne-tuning, which at round t, trains jointly until round t 1, and then ﬁnetunes it speciﬁcally to round t using only Dt. This corresponds to the standard FTL approach, followed by task-speciﬁc ﬁne-tuning.

We note that TOE is a very strong point of comparison, capable of reusing representations across tasks, as has been proposed in a number of prior continual learning works (Rusu et al., 2016; Aljundi et al., 2017; Wang et al., 2017). However, unlike FTML, TOE does not explicitly learn the structure across tasks. Thus, it may not be able to fully utilize the information present in the data, and hence may not be able to learn new tasks with only a few examples. Further, the model might incur negative transfer if the new task differs substantially from previous ones, as has been observed in prior work (Parisotto et al., 2016). Training on each task from scratch avoids negative transfer, but also precludes any reuse between tasks. When the amount of data for a given task is large, we may expect training from scratch to perform well since it can avoid negative transfer and can learn specifically for the particular task. Finally, FTL with ﬁne-tuning represents a natural online learning comparison, which in principle should combine the best parts of learning from scratch and TOE, since this approach adapts speciﬁcally to each task and beneﬁts from prior data. However, in contrast to FTML, this method does not explicitly meta-learn and hence may not fully utilize any structure in the tasks.

6.1. Rainbow MNIST

In this experiment, we create a sequence of tasks based on the MNIST dataset. We transform the digits in a num-

Online Meta-Learning

Figure 2. Illustration of three tasks for Rainbow MNIST (top) and pose prediction (bottom). CIFAR images not shown. Rainbow MNIST

includes different rotations, scales, and background colors. For pose prediction, the goal is to predict the global pose of the object on the table. Cross-task variation includes 50 object models within 9 object classes, varying object scales, and different camera viewpoints.

Figure 3. Rainbow MNIST results. Left: amount of data needed to learn each new task. Center: task performance after 100 datapoints on

the current task. Right: The task performance after all 900 datapoints for the current task have been received. Lower is better for all plots, while shaded regions show standard error over three random seeds. FTML can learn new tasks more and more efﬁciently as each new task is received, demonstrating effective forward transfer.

ber of ways to create different tasks, including 7 colored backgrounds, 2 scales (half size and original size), and 4 rotations of 90 degree intervals. As illustrated in Fig. 2, a task involves correctly classifying digits with a randomly sampled background, scale, and rotation. This leads to 56 total tasks. We partitioned the MNIST training dataset into 56 batches of examples, each with 900 images and applied the corresponding task transformation to each batch of images. The ordering of tasks was selected at random and we set 90% classiﬁcation accuracy as the proﬁciency threshold.

Learning curves in Fig. 3 show that FTML learns tasks more and more quickly, with each new task added. We also ﬁnd that FTML substantially outperforms the alternative approaches in both efﬁciency and ﬁnal performance. FTL performs better than TOE since it uses task-speciﬁc adaptation, but its performance is still inferior to FTML. We hypothesize that, while the prior methods improve in efﬁciency over the course of learning as they see more tasks, they struggle to prevent negative transfer on each new task. Lastly, training from scratch does not learn efﬁciently compared to models that incorporate data from other tasks; but, their ﬁnal performance with 900 datapoints is similar.

6.2. Five-Way CIFAR-100

In this experiment, we create a sequence of 5-way classiﬁcation tasks based on the CIFAR-100 dataset, which contains more challenging and realistic RGB images than MNIST. Each classiﬁcation problem involves a newly-introduced

class from the 100 classes in CIFAR-100. Thus, different tasks correspond to different labels spaces. The ordering of tasks is selected at random, and we measure performance using classiﬁcation accuracy. Since it is less clear what the proﬁciency threshold should be for this task, we evaluate the accuracy on each task after varying numbers of datapoints have been seen. Since these tasks are mutually exclusive (as label space is changing), it makes sense to train the TOE model with a different ﬁnal layer for each task. An extremely similar approach to this is to use our meta-learning approach but to only allow the ﬁnal layer parameters to be adapted to each task. Further, such a metalearning approach is a more direct comparison to our full FTML method, and the comparison can provide insight into whether online meta-learning is simply learning features and performing training on the last layer, or if it is adapting the features to each task. Thus, we compare to this last layer online meta-learning approach instead of TOE with multiple heads. The results (see Figure 4) indicate that FTML learns more efﬁciently than independent models and a model with a shared feature space. The results on the right indicate that training from scratch achieves good performance with 2000 datapoints, reaching similar performance to FTML. However, the last layer variant of FTML seems to not have the capacity to reach good performance on all tasks.

Online Meta-Learning

Figure 4. Online CIFAR-100 results, evaluating task performance after seeing 50, 250, and 2000 datapoints for each task. FTML learns

each task much more efﬁciently than models trained from scratch, while both achieve similar asymptotic performance after 2000 datapoints. FTML beneﬁts from adapting all layers rather than learning a shared feature space across tasks while adapting only the last layer.

Figure 5. Object pose prediction results. Left: we observe that online meta-learning leads to faster learning as more and more tasks are

introduced, learning with only tens of datapoints for many of the tasks. Center & right, we see that meta-learning enables transfer not just for faster learning but also for more effective performance when 60 and 400 datapoints of each task are available. The task order is randomized, leading to spikes when more difﬁcult tasks are introduced. Shaded regions show standard error across three random seeds

6.3. Sequential Object Pose Prediction

In our ﬁnal experiment, we study a 3D pose prediction problem. Each task involves learning to predict the global position and orientation of an object in an image. We construct a dataset of synthetic images using 50 object models from 9 different object classes in the PASCAL3D+ dataset (Xiang et al., 2014), rendering the objects on a table using the renderer accompanying the Mu Jo Co physics engine (Todorov et al., 2012) (see Figure 2). To place an object on the table, we select a random 2D location, as well as a random azimuthal angle. Each task corresponds to a different object with a randomly sampled camera angle. We place a red dot on one corner of the table to provide a global reference point for the position. Using this setup, we construct 90 tasks (with an average of about 2 camera viewpoints per object), with 1000 datapoints per task. All models are trained to regress to the global 2D position and the sine and cosine of the azimuthal angle (the angle of rotation along the z-axis). For the loss functions, we use mean-squared error, and set the proﬁciency threshold to an error of 0.05. We show the results of this experiment in Figure 5. The results demonstrate that meta-learning can improve both efﬁciency and performance of new tasks over the course of learning, solving many of the tasks with only 10 datapoints. Unlike the previous settings, TOE substantially outperforms training from scratch, indicating that it can effectively make

use of the previous data from other tasks, likely due to the greater structural similarity between the pose detection tasks. However, the performance of FTML suggests that even better transfer can be accomplished by explicitly optimizing for the ability to quickly and effectively learn new tasks. Finally, we ﬁnd that FTL performs comparably or worse than TOE, indicating that task-speciﬁc ﬁne-tuning can lead to overﬁtting when the model is not explicitly trained for the ability to ﬁne-tune effectively.

7. Connections to Related Work

Meta-learning: Our work proposes to use meta-learning or learning to learn (Thrun & Pratt, 1998; Schmidhuber, 1987; Naik & Mammone, 1992), in the context of online (regretbased) learning. Prior works have proposed learning update rules, selective copying of weights, or optimizers (Hochreiter et al., 2001; Bengio et al., 1992; Andrychowicz et al., 2016; Li & Malik, 2017; Ravi & Larochelle, 2017; Schmidhuber, 2002), as well as recurrent models that learn by ingesting datasets directly (Santoro et al., 2016; Duan et al., 2016; Wang et al., 2016; Munkhdalai & Yu, 2017; Mishra et al., 2017). Some meta-learning works have considered online learning at meta-test time (Santoro et al., 2016; Al Shedivat et al., 2017; Nagabandi et al., 2018). However, with the exception of work on online hyperparameter adaptation (Elfwing et al., 2017; Meier et al., 2017; Baydin et al.,

Online Meta-Learning

2017), nearly all prior meta-learning algorithms assume that the meta-training tasks come from a stationary distribution, and does not consider tasks that are presented as a continuous stream. In contrast, we consider a more ﬂexible approach that allows for adapting all of the model s parameters during online meta-training. Recent work has also considered meta-training with non-stationary distributions using Dirichlet process mixture models over parameters (Grant et al., 2019). In contrast, we introduce a simple extension onto the MAML algorithm without mixtures over parameters, and provide theoretical guarantees.

Continual learning: Our problem setting is related to continual, or lifelong learning (Thrun, 1998; Zhao & Schmidhuber, 1996). A number of papers in this area have focused on avoiding forgetting or negative backward transfer (Goodfellow et al., 2013; Kirkpatrick et al., 2017; Zenke et al., 2017; Rebufﬁet al., 2017; Shin et al., 2017; Shmelkov et al., 2017; Lopez-Paz et al., 2017; Nguyen et al., 2017; Schmidhuber, 2013), and maintaining a small model capacity as new tasks are added (Lee et al., 2017; Mallya & Lazebnik, 2017). In this paper, we sidestep the problem of catastrophic forgetting by maintaining a buffer of all the observed data (Isele & Cosgun, 2018), and instead focus on maximizing the efﬁciency of learning new tasks within a non-stationary meta-learning setting. Furthermore, unlike prior works (Ruvolo & Eaton, 2013; Rusu et al., 2016; Aljundi et al., 2017; Wang et al., 2017), we focus on the setting where there are several tens or hundreds of tasks, and therefore more information that can be transferred from previous tasks to enable few-shot acquisition of new concepts.

Online learning: Similar to continual learning, online learning considers a sequential setting with streaming tasks. Much of the prior work in this area has focused on computationally cheap algorithms that do not iterate over past data multiple times (Cesa-Bianchi & Lugosi, 2006; Hazan et al., 2006; Zinkevich, 2003; Shalev-Shwartz, 2012). Again, we sidestep computational considerations to ﬁrst study the meta-learning analog of FTL. For this, we derived the FTML algorithm which has low regret when compared to a powerful adaptive comparator class, and demonstrated empirical gains over strong baselines.

Adaptive notions of regret have been considered in prior work to overcome limitations of a ﬁxed comparator. In the dynamic regret setting (Herbster & Warmuth, 1995; Yang et al., 2016; Besbes et al., 2015), the online learner is compared with the sequence of optimal solutions corresponding to each loss function. Unfortunately, lower bounds (Yang et al., 2016) suggest that the comparator class is too powerful and may not provide for any non-trivial learning in the general case. To overcome this challenge, prior work has placed restrictions on how quickly the loss functions or comparator model can change (Hazan & Comandur, 2009; Hall

& Willett, 2015; Herbster & Warmuth, 1995). In contrast, we develop a new notion of adaptive regret where the learner and comparator both have access to an update procedure. The update procedures allow the comparator to produce different models for different loss functions, thereby serving as a powerful comparator class (in comparison to a ﬁxed model in hindsight). In this setting, we derived sublinear regret algorithms without placing restrictions on the loss functions. Concurrent work has also studied algorithms related to ﬁrst order variants of MAML using theoretical tools from online learning (Alquier et al., 2016; Denevi et al., 2019; Khodak et al., 2019). These works also derive regret and generalization bounds, but the algorithms have not yet been empirically studied in large scale domains or nonstationary settings. We believe that our online meta-learning setting captures the spirit and practice of continual lifelong learning, and also shows promising empirical results.

8. Discussion and Future Work

We introduced the online meta-learning setting, which provides a natural perspective on the ideal real-world learning procedure: an intelligent agent interacting with a constantly changing environment should utilize streaming experience to both master the task at hand, and become more proﬁcient at learning new tasks in the future. We proposed and analyzed the FTML algorithm to derive regret bounds, and illustrated how FTML can be adapted to a practical algorithm. Our experiments demonstrate that FTML outperforms prior methods, learing new tasks more and more efﬁciently over time. We next outline avenues for future work.

More powerful update procedures. We analyzed the case where the update Ut is one gradient step. However, in practice, MAML is often used with multiple gradient steps. Analyzing this case, and potentially higher order update rules, will make for exciting future work.

Memory and computational constraints. In this work, we primarily aimed to discern if it is possible to meta-learn in a sequential setting. As discussed in Section 7, the cost of FTL (and FTML) grows over time as new tasks and loss functions are accumulated. Further, in many practical online learning problems, it is challenging to store all datapoints from previous tasks. While we showed that our method can effectively learn nearly 100 tasks in sequence without signiﬁcant burdens on compute or memory, scalability remains a concern. Can a more streaming algorithm like mirror descent that does not store all the past experiences be successful as well? Our main theoretical results (Section 4.3) suggests that there exist a large family of online meta-learning algorithms that enjoy sublinear regret. Tapping into the large body of work in online learning, particularly mirror descent, to develop computationally cheaper algorithms would make for exciting future work.

Online Meta-Learning

Acknowledgements

Aravind Rajeswaran thanks Emo Todorov for valuable discussions on the problem formulation. This work was supported by the National Science Foundation via IIS-1651843, Google, Amazon, and NVIDIA. Sham Kakade acknowledges funding from the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery and the NSF CCF 1740551 award.

Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mor-

datch, I., and Abbeel, P. Continuous adaptation via metalearning in nonstationary and competitive environments. Co RR, abs/1710.03641, 2017.

Aljundi, R., Chakravarty, P., and Tuytelaars, T. Expert

gate: Lifelong learning with a network of experts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

Alquier, P., Mai, T. T., and Pontil, M. Regret bounds for

lifelong learning. In AISTATS, 2016.

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W.,

Pfau, D., Schaul, T., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Neural Information Processing Systems (NIPS), 2016.

Antoniou, A., Edwards, H., and Storkey, A. How to train

your maml. ar Xiv preprint ar Xiv:1810.09502, 2018.

Baydin, A. G., Cornish, R., Rubio, D. M., Schmidt, M., and

Wood, F. Online learning rate adaptation with hypergradient descent. ar Xiv:1703.04782, 2017.

Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. On the

optimization of a synaptic learning rule. In Optimality in Artiﬁcial and Biological Neural Networks, 1992.

Besbes, O., Gur, Y., and Zeevi, A. J. Non-stationary stochas-

tic optimization. Operations Research, 63:1227 1244, 2015.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and

games. 2006.

Denevi, G., Ciliberto, C., Grazzi, R., and Pontil, M.

Learning-to-learn stochastic gradient descent with biased regularization. Co RR, abs/1903.10399, 2019.

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever,

I., and Abbeel, P. Rl2: Fast reinforcement learning via slow reinforcement learning. ar Xiv:1611.02779, 2016.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient

methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.

Elfwing, S., Uchibe, E., and Doya, K. Online meta-learning

by parallel algorithm competition. ar Xiv:1702.07490, 2017.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-

learning for fast adaptation of deep networks. International Conference on Machine Learning (ICML), 2017.

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A.,

and Bengio, Y. An empirical investigation of catastrophic forgetting in gradient-based neural networks. ar Xiv:1312.6211, 2013.

Grant, E., Finn, C., Levine, S., Darrell, T., and Grifﬁths, T.

Recasting gradient-based meta-learning as hierarchical bayes. International Conference on Learning Representations (ICLR), 2018.

Grant, E., Jerfel, G., Heller, K., and Grifﬁths, T. L. Mod-

ulating transfer between tasks in gradient-based metalearning, 2019.

Hall, E. C. and Willett, R. M. Online convex optimization in

dynamic environments. IEEE Journal of Selected Topics in Signal Processing, 9:647 662, 2015.

Hannan, J. Approximation to bayes risk in repeated play.

Contributions to the Theory of Games, 1957.

Hazan, E. Introduction to online convex optimization. 2016.

Hazan, E. and Comandur, S. Efﬁcient learning algorithms

for changing environments. In ICML, 2009.

Hazan, E., Kalai, A. T., Kale, S., and Agarwal, A. Loga-

rithmic regret algorithms for online convex optimization. Machine Learning, 69:169 192, 2006.

Herbster, M. and Warmuth, M. K. Tracking the best expert.

Machine Learning, 32:151 178, 1995.

Hochreiter, S., Younger, A. S., and Conwell, P. R. Learning

to learn using gradient descent. In International Conference on Artiﬁcial Neural Networks, 2001.

Isele, D. and Cosgun, A. Selective experience replay for

lifelong learning. ar Xiv preprint ar Xiv:1802.10269, 2018.

Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan,

M. I. How to escape saddle points efﬁciently. In ICML, 2017.

Kalai, A. T. and Vempala, S. Efﬁcient algorithms for online

decision problems. J. Comput. Syst. Sci., 71:291 307, 2005.

Khodak, M., Balcan, M.-F., and Talwalkar, A. S. Prov-

able guarantees for gradient-based meta-learning. Co RR, abs/1902.10644, 2019.

Online Meta-Learning

Kingma, D. and Ba, J. Adam: A method for stochastic

optimization. International Conference on Learning Representations (ICLR), 2015.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J.,

Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 2017.

Lee, J., Yun, J., Hwang, S., and Yang, E. Lifelong learning with dynamically expandable networks. ar Xiv:1708.01547, 2017.

Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-

end training of deep visuomotor policies. The Journal of Machine Learning Research, 2016.

Li, K. and Malik, J. Learning to optimize. International

Conference on Learning Representations (ICLR), 2017.

Lopez-Paz, D. et al. Gradient episodic memory for continual

learning. In Advances in Neural Information Processing Systems, 2017.

Mallya, A. and Lazebnik, S. Packnet: Adding multiple tasks to a single network by iterative pruning. ar Xiv:1711.05769, 2017.

Meier, F., Kappler, D., and Schaal, S. Online learning of a

memory for learning rates. ar Xiv:1709.06709, 2017.

Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P.

Meta-learning with temporal convolutions. ar Xiv preprint ar Xiv:1707.03141, 2017.

Munkhdalai, T. and Yu, H. Meta networks. International

Conference on Machine Learning (ICML), 2017.

Nagabandi, A., Finn, C., and Levine, S. Deep online learn-

ing via meta-learning: Continual adaptation for modelbased rl. ar Xiv preprint ar Xiv:1812.07671, 2018.

Naik, D. K. and Mammone, R. Meta-neural networks that

learn by learning. In International Joint Conference on Neural Netowrks (IJCNN), 1992.

Nesterov, Y. and Polyak, B. T. Cubic regularization of new-

ton method and its global performance. Math. Program., 108:177 205, 2006.

Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Varia-

tional continual learning. ar Xiv:1710.10628, 2017.

Parisotto, E., Ba, J. L., and Salakhutdinov, R. Actor-mimic:

Deep multitask and transfer reinforcement learning. International Conference on Learning Representations (ICLR), 2016.

Ravi, S. and Larochelle, H. Optimization as a model for few-

shot learning. In International Conference on Learning Representations (ICLR), 2017.

Rebufﬁ, S.-A., Kolesnikov, A., and Lampert, C. H. icarl:

Incremental classiﬁer and representation learning. In Proc. CVPR, 2017.

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H.,

Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. Progressive neural networks. ar Xiv:1606.04671, 2016.

Ruvolo, P. and Eaton, E. Ella: An efﬁcient lifelong learn-

ing algorithm. In International Conference on Machine Learning, pp. 507 515, 2013.

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and

Lillicrap, T. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016.

Schmidhuber, J. Evolutionary principles in self-referential

learning. Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.

Schmidhuber, J. Optimal ordered problem solver. Machine

Learning, 54:211 254, 2002.

Schmidhuber, J. Powerplay: Training an increasingly gen-

eral problem solver by continually searching for the simplest still unsolvable problem. In Front. Psychol., 2013.

Shaban, A., Cheng, C.-A., Hirschey, O., and Boots, B. Trun-

cated back-propagation for bilevel optimization. Co RR, abs/1810.10667, 2018.

Shalev-Shwartz, S. Online learning and online convex op-

timization. "Foundations and Trends in Machine Learning", 2012.

Shalev-Shwartz, S. and Kakade, S. M. Mind the duality gap:

Logarithmic regret algorithms for online optimization. In NIPS, 2008.

Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learn-

ing with deep generative replay. In Advances in Neural Information Processing Systems, 2017.

Shmelkov, K., Schmid, C., and Alahari, K. Incremental

learning of object detectors without catastrophic forgetting. ar Xiv:1708.06977, 2017.

Singh, A., Yang, L., and Levine, S. Gplac: Generalizing

vision-based robotic skills using weakly labeled images. ar Xiv:1708.02313, 2017.

Online Meta-Learning

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,

Z. Rethinking the inception architecture for computer vision. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Thrun, S. Lifelong learning algorithms. In Learning to

learn. Springer, 1998.

Thrun, S. and Pratt, L. Learning to learn. Springer Science

& Business Media, 1998.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics

engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,

Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., and Botvinick, M. Learning to reinforcement learn. ar Xiv:1611.05763, 2016.

Wang, Y.-X., Ramanan, D., and Hebert, M. Growing a

brain: Fine-tuning by increasing model capacity. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Xiang, Y., Mottaghi, R., and Savarese, S. Beyond pascal:

A benchmark for 3d object detection in the wild. In Conference on Applications of Computer Vision (WACV), 2014.

Yang, T., Zhang, L., Jin, R., and Yi, J. Tracking slowly

moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In ICML, 2016.

Zenke, F., Poole, B., and Ganguli, S. Continual learning

through synaptic intelligence. In International Conference on Machine Learning, 2017.

Zhao, J. and Schmidhuber, J. Incremental self-improvement

for life-time multi-agent reinforcement learning. In From Animals to Animats 4: International Conference on Simu-

lation of Adaptive Behavior, 1996.

Zinkevich, M. Online convex programming and generalized

inﬁnitesimal gradient ascent. In ICML, 2003.