# memory_efficient_online_meta_learning__ce53e42c.pdf

Memory Efﬁcient Online Meta Learning

Durmus Alp Emre Acar 1 Ruizhao Zhu 1 Venkatesh Saligrama 1

We propose a novel algorithm for online meta learning where task instances are sequentially revealed with limited supervision and a learner is expected to meta learn them in each round, so as to allow the learner to customize a task-speciﬁc model rapidly with little task-level supervision. A fundamental concern arising in online metalearning is the scalability of memory as more tasks are viewed over time. Heretofore, prior works have allowed for perfect recall leading to linear increase in memory with time. Different from prior works, in our method, prior task instances are allowed to be deleted. We propose to leverage prior task instances by means of a ﬁxed-size state-vector, which is updated sequentially. Our theoretical analysis demonstrates that our proposed memory efﬁcient online learning (MOML) method suffers sub-linear regret with convex loss functions and sub-linear local regret for nonconvex losses. On benchmark datasets we show that our method can outperform prior works even though they allow for perfect recall.

1. Introduction

Meta Learning (Vinyals et al., 2016; Santoro et al., 2016) is deﬁned as a problem of learning to learn new tasks. This is typically accomplished by training a meta-model on a diverse set of tasks, such that the meta-model can in turn train and output a classiﬁer on a new task using only a few training examples. There has been a burst of activity on meta-learning (Finn et al., 2017; Ravi & Larochelle, 2017; Zhou et al., 2019) recently with much of it devoted to the batch setting. Namely, during training, datasets composed of different tasks are provided, and the goal is to train a metalearner, which at test-time can be rapidly re-purposed for a new task by training on relatively few annotated examples.

1Boston University, Boston, MA. Correspondence to: Durmus Alp Emre Acar <alpacar@bu.edu>, Ruizhao Zhu <rzhu@bu.edu>, Venkatesh Saligrama <srv@bu.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

While human learning is evidently the inspiration for traditional meta-learning approaches, there are signiﬁcant differences. Humans learn continually by dynamically and continuously adapting their representations, beliefs and predictions as they encounter new tasks (Lake et al., 2017). Motivated by applications that involve such continual and lifelong learning, Finn et al. (2019) advocate an online metalearning approach, in order to bridge the seeming gap between online learning and meta-learning. Their point is that online learning (Shalev-Shwartz & Singer, 2007) mirrors the way in which humans sequentially encounter new tasks one after another, but unlike humans, online learning does not allow for either task-speciﬁc adaptation or consider how solution to past tasks can help solve new ones. On the other hand, while conventional meta-learning does incorporate task-speciﬁc adaptation and incorporates knowledge transfer from past experiences, it ignores the episodic, sequential and non-stationary aspects of how tasks are encountered.

In this paper, we develop a novel method for online metalearning overcoming drawbacks of Finn et al. (2019). Like Finn et al. (2019) we focus on training deep neural network models, and consider the setting where task instances are revealed to the learner episodically. We also attempt to train the model s initial parameters so that, on any new task, the model parameters can be rapidly adapted with a small amount of data by means of gradient descent. Subsequently, the learner then updates its underlying task-agnostic parameters based on solving the new task.

Linear Memory Scaling. Finn et al. (2019) s follow-themeta-learner (FTML) leverages the well-known follow the regularized leader (FTRL) method in online learning. While learning task-agnostic model parameters (meta-learning step), FTML recalls all task instances that have heretofore appeared. This is undesirable and impractical, and as such leads to linear increase in memory with the number of observed tasks. One option is to update the meta model only using the current task, but this leads to signiﬁcant current task bias. Another possibility is to leverage an online gradient descent (OGD) algorithm, based on linearization of loss-functions for prior tasks, but the linearization would be around stale model parameters associated with the previous tasks. While practical, empirically these options do not lead to meaningful meta-learning performance.

Memory Efﬁcient Online Meta Learning

Memory Efﬁcient Online Meta-Learning (MOML). To overcome memory scaling, we introduce a ﬁxed-size statevector, which is dynamically updated after completion of an episodic task. The state-vector, which serves the purpose of encoding past task experiences, parameterizes the regularizer penalty for next episode. As a result, model parameters retain prior task experience, and utilize this experience to solve new tasks. Our MOML scheme is not only memory efﬁcient, but is more effective than FTML. We compare MOML with FTML on current tasks as well as past tasks (to evaluate catastrophic forgetting), and show that MOML dominates FTML on several benchmark datasets. We also analyze MOML theoretically and show that for T tasks, we achieve sub-linear O(

T) regret on convex losses, and O(

T) local regret for non-convex loss functions.

Our Contributions.

We present, MOML, a new family of online learning algorithms that do not explicitly store loss functions from previous rounds. We show that MOML has O

T regret guarantees, We empirically show that MOML achieves signiﬁcantly improved memory footprint with no perceptible degradation in performance over existing baselines.

1.1. Related Work

Meta learning is a popular and evolving ﬁeld, and we only describe brieﬂy different approaches (see Hospedales et al. (2020); Vanschoren (2018) for surveys). One approach is to learn a black box optimizer for the test task based on the state of the algorithm (Ravi & Larochelle, 2017; Mishra et al., 2017; Santoro et al., 2016). For instance, Ravi & Larochelle (2017) proposes an LSTM model in which the states of the LSTM determines the speciﬁc optimization procedure for the current task. More recently, Finn et al. (2017) introduced Model-Agnostic Meta Learning (MAML) approach. MAML proposes one step gradient descent on the task datasets and the goal is to learn a good task-agnostic initialization that works well for all tasks. Many works have introduced extensions to MAML including novel transformations (Li et al., 2017; Zhou et al., 2019) as well as improved training algorithms (Antoniou et al., 2018). There are works that propose non parametric meta adaptations (Snell et al., 2017; Vinyals et al., 2016). For instance, prototypical adaptation (Snell et al., 2017) is a metric based method where the task speciﬁc model is obtained using k NN with respect to the class prototypes. Additionally, Chen et al. (2020) show that pretraining of meta models before meta learning improves performance. Different from ours, meta learning in these works is viewed as an ofﬂine concept where training and test-time are separated.

Online learning. In vanilla online learning (Shalev-Shwartz

& Singer, 2007), (possible adversarial) loss functions are sequentially revealed and the learner is trained as well as tested at each round. The agent aims to minimize cummulative regret that measures how well the algorithm performs compared to the best possible ﬁxed model in hindsight. Online learning is a well established ﬁeld and we refer to extensive studies for more information Hazan (2019); Shalev-Shwartz (2012). Online gradient descent (OGD) (Zinkevich, 2003) proposes to take a gradient descent step in each round using the current loss. Follow the Regularized Learner (FTRL) (Abernethy et al., 2008) minimizes a regularized version of all seen loss functions. Different from meta learning, the learner is expected to minimize the loss functions, and does not leverage insights from past experiences. In contrast, our focus is on online meta learning problem where the agent is expected to meta learn the revealed task instances.

Continual learning, (a.k.a lifelong learning) (Thrun & Pratt, 2012) is related to online learning, with particular focus on catastrophic forgetting. In this context, the agent is expected to do well on the seen tasks as well as efﬁciently learn new task instances (Chen & Liu, 2018). For instance, Learn-to-Grow framework (Li et al., 2019) avoids forgetting previous tasks by expanding the network architecture of the learner with upcoming tasks. Variational continual learning (Nguyen et al., 2018) is a method using variational inference on a set which has representative datapoints from the seen tasks. In contrast, our goal is to reduce the memory footprint, by allowing for the learner to delete all data instances for past tasks. Additionally, our goal is to derive regret guarantees as in online learning.

Online meta learning (Finn et al., 2019) proposes to fuse meta learning with online learning. In online meta learning, an agent is expected to meta learn tasks where the tasks are sequentially revealed. Prior works have proposed to leverage OGD and FTRL to the online meta learning problem (Zhuang et al., 2020; Finn et al., 2017). In the FTML (Finn et al., 2019) approach the goal is to learn initializations of model parameters (meta-model) so as to allow for quick adaptation to all of the previously viewed task instances based on taking a few gradient steps from the meta-model. For this reason, FTML must store data from all seen tasks to update its meta model. This means that the memory complexity of FTML linearly grows as new tasks arrive which is impractical. We propose a memory efﬁcient approach, which does not require storing past past instances.

Subsequent works such as OSML (Yao et al., 2020) and FTML-VS (Yu et al., 2020) propose extensions to FTML. OSML is a pipeline that has multiple so called meta blocks to facilitate the learning of new tasks. FTML-VS aims to decrease the number of datapoints used during meta training as tasks arrive. Nevertheless, these methods still require storing datapoints for all seen tasks. Different from these works,

Memory Efﬁcient Online Meta Learning

our goal is to bypass storing datapoints corresponding to seen tasks.

The learner s goal is to train a meta model, chosen from a parameterized model with parameters θ Rd for the setting where new task instances are sequentially revealed at each round. Each task has a speciﬁc joint distribution denoted as Pt in which task features x X and the corresponding labels y Y are drawn from it (x, y) Pt. The agent has access to a limited supervised dataset Dt = {(xi t, yi t)}Nt i=1 for each task in order to obtain a task speciﬁc model. We deﬁne task loss as the expected loss with respect to Pt as f t(θ) = E{x,y} Pt L ((x, y); θ) where L is the loss function that the model incurs on the data tuple (x, y). Our objective is to get a sublinear rate of the following regret statement,

t=1 f t U t(θt) min θ Rd

t=1 f t U t(θ) (1)

where U t is the meta adaptation function that transforms the meta model into a task speciﬁc model using the limited supervised dataset Dt and the agent is compared against the best ﬁxed meta learner that has access to all loss functions {f t}T t=1 in hindsight.

Adaptation function. The regret statement depends on the transformation function U t for task t. There are many transformations proposed in meta learning ﬁeld to obtain a task speciﬁc model out of the meta model. In this work, we focus on MAML (Finn et al., 2017) adaptation. MAML transformation is proposed as,

U t(θ) = θ η 1

(x,y) Dt L ((x, y); θ)

and corresponds to updating the meta model with a step gradient descent using a meta learning rate η.

MOML Intuition. Before describing the algorithm, we build intuition for our solution by considering two conventional online learning algorithms: FTRL and OGD. Let us assume that we have {ℓt}T t=1 losses in an online setting.

FTRL Algorithm. FTRL updates its model using all seen losses and a regularizer, namely,

θt+1 = arg min θ µ

i=1 ℓi(θ), (2)

where µ is the coefﬁcient on the quadratic regularizer. FTRL must store the history of all the past observed loss functions. Since the optimal competitor minimizes the sum of these losses (i.e. PT t=1 ℓt(θ)), we can view FTRL, with proper

regularization, converging to the optimal competitors risk at the expense of storing all seen task information.

OGD Algorithm. Different from FTRL, OGD does not store the seen losses. It applies one gradient descent step using the currently revealed loss as,

θt+1 = θt β ℓt θt (3)

where β is the learning rate. Even though it gets to desired regret guarantees, it is hard to compare OGD to the best competitor that minimizes the sum of losses (i.e. PT t=1 ℓt(θ)).

The Bias Problem. OGD updates the model with gradients arising from the loss revealed at that time. Since, the best competitor seeks to optimize the sum of losses, one could learn the best competitor, by running SGD on the average of all losses as

θk+1 = θk β 1

t=1 ℓt θk . (4)

This iteration has the property that it converges to the optimal solution in hindsight for convex losses and a suitable choice of β.

The difference between OGD update in 3 and the competitor update in 4 is that the gradient directions are not aligned. More explicitly, the minima of the current loss is not the same as the competitor, min ℓt(θ) = min PT s=1 ℓs(θ). As such the gradients have different directions. We propose a solution based on debiasing the gradient of the current loss, in the hope of taking a step towards the global gradient, while bypassing the need for recalling past seen instances.

A Toy Example. We illustrate the bias issue for an online learning setting with T = 6 losses where the parameter space is two dimensional. Figure 1 shows the contour plots (for quadratic loss functions) and the corresponding local optimal solutions for the 5 seen losses; the current revealed loss ℓ6 (θ); and the sum of all losses P6 t=1 ℓt(θ). The learner s parameters at this time is depicted as .

Using the current loss, we can only go towards minima of the cuurent loss where the direction is shown with a red arrow, biased direction . However, we would like to move towards the global minima denoted as the correct direction since it points to the best competitor. The biased direction and the correct directions are not aligned which we refer as the bias problem.

The Debiasing Concept. Unfortunately, at a round t < T, all of the losses have not yet been revealed, and as such we can debias based only on the past losses. We propose to debias the current loss noted as debiasing direction in Figure 1 by leveraging the past seen empirical loss functions. Then, we substitute a surrogate direction with the goal of correcting the biased direction. Our objective is to bypass

Memory Efﬁcient Online Meta Learning

Figure 1. A two dimensional online learning setting with T = 6 losses. The current loss pulls the model towards its minima. However, the sum of losses have a different optimum point. Hence, the model is biased towards the current loss minima. We propose to debias the gradient so that the model is correctly updated.

the need to recall previous seen instances in computing this correction.

Let us denote the current model as θt. We start with the current model, θt+1 1 = θt and apply K corrected gradient descent steps as,

θt+1 k+1 = θt+1 k β ℓt θt+1 k dt + ct (5)

where k = 1, 2, . . . K, dt debiases the current loss and ct

encourages the correct direction. We note that our proposed solution does not require to store the losses as such dt and ct terms are updated using only the current loss.

MOML Algorithm. Our proposed method is presented in Algorithm 1. In each round, the current loss (f t) along with adaptation function (U t) is modiﬁed using a quadratic regularization and it is revealed to the algorithm. A task speciﬁc model is obtained using the transformation and the current meta model θt as θ t = U t(θt). Then, the performance of the task speciﬁc model is recorded as f t(θ t) = f t U t(θt).

We ﬁrst update the meta model by optimizing using a quadratic penalty:

Rt(θ) = f t 1 U t 1(θt), θ + α

θ wt 2 , (6)

where f t 1 U t 1(θt) and wt are the states the model stores. The linear term debiases the current loss and the second term corrects the direction as in dt and ct terms deﬁned in update Eq. 5.

Algorithm 1 Memory Efﬁcient Online Meta Learning MOML

Input: T, f 0 U 0(θ1) = w1 = θ1 = 0, α, K, β for t = 1, 2, . . . T do

Output θt, reveal f t and U t, suffer f t U t(θt), Rt(θ) = f t 1 U t 1(θt), θ + α

2 θ wt 2, θt+1 1 = θt, for k = 1, 2, . . . K do

θt+1 k+1 = θt+1 k β f t U t(θt+1 k ) + Rt(θt+1 k )

end for θt+1 = θt+1 K+1, wt+1 = 1

2 wt + θt+1 1

α f t U t(θt+1) , end for

Subsequently, the algorithm iteratively optimizes the loss function with gradient corrections as,

θt+1 k+1 = θt+1 k β f t U t(θt+1 k ) + Rt(θt+1 k ) . (7)

After K gradient descent updates, the new meta model is obtained θt+1 = θt+1 K+1.

MOML explicitly stores states ( f t 1 U t 1(θt), w) by way of summarization of previous task instances, and as such incorporates this information in the constructed regularizer.

w state is recursively updated as,

wt + θt+1 1

α f t U t(θt+1) . (8)

This completes one round of update mechanism for MOML.

Buffered-MOML (B-MOML). MOML can be extended to the situation where a ﬁxed size buffer of previous task instances are also used. In this embodiment, we are allowed to store the latest B losses. For this scenario, the update rule for θ is:

θt+1 k+1 = θt+1 k β Lt B(θt+1 k ) + Rt B(θt+1 k ) .

where Lt B(θ) = 1

B PB 1 i=0 f t i U t i(θ) is the sum of last B losses and

i=0 f t i 1 U t i 1(θt i), θ

is the adapted regularizer. We utilize B-MOML in deriving regret bounds for nonconvex (adversarial) setting.

Random Task Buffer. In experiments we also consider storing random tasks instances in our buffer. To do so, we consider a buffer as ﬁrst-in-ﬁrst-out (FIFO) queue. For each

Memory Efﬁcient Online Meta Learning

new task, we sample a biased coin with parameter p. We accept the new task in the buffer and put it at end of our queue, and then delete the ﬁrst task.

Memory footprint of MOML does not grow over time. Our proposed method does not require storing all seen task instances. Instead, it accumulates previous task information in the auxiliary model (w). Different from MOML, FTML needs to increase its memory usage in each round since it explicitly stores previous task information. We note that this leads to signiﬁcant savings in terms of memory requirement.

MOML can leverage any meta adaptation function (U t). We can use any adaptation function such as MAML, or other objectives such as prototypical adaptation, etc. MOML leads to a new family algorithms that can be applied to any online learning setup.

2.1. Analysis of MOML

MOML minimizes the following risk in K gradient steps,

min θ Rd f t U t(θ) + Rt(θ) (9)

To simplify the convergence analysis, we assume that MOML reaches a stationary point of Eq. 9, namely,

f t U t(θt+1) f t 1 U t 1(θt)+α θt+1 wt = 0. (10)

This assumption is not unrealistic. Indeed, due to our quadratic penalty, in experiments, we have found that MOML is essentially reaches a stationary point (i.e., Eq. 10) within a small number of gradient steps. In particular, we ﬁnd that the residual noise is signiﬁcantly smaller than the model parameters, and can be ignored for the purpose of analysis. Nevertheless, we point out that it is possible to extend our results to include this additional residual noise.

MOML proposes to reach a similar solution without explicitly storing losses from previous rounds. We present an intuitive justiﬁcation for our viewpoint based on the assumption that MOML is a stable1 algorithm and the sequential updates, θt, converge, namely, limt θt = θ . Nevertheless, with this assumption in place, it follows that MOML, if it converges, converges to a stationary point of the desired loss function. We state this as a proposition.

Proposition 1 Suppose f t U t( ) are a sequence of smooth functions with uniformly bounded Lipshitz constant. Furthermore, suppose θt is a bounded convergent sequence approaching θ . Then, θ is also a stationary point of the competitor deﬁned as in Eq. 1.

1Validating this assumption requires additional proof, such as showing the map θt θt+1 is contractive. In online meta learning, θt convergence is not required since the losses are nonstochastic. The convergence happens in stochastic loss settings.

We sketch the proof below. Using Cesaro2 mean argument, if models converge, the mean model does as well: 1 t P

s [t] θs t θ . If we average Eq. 10

over time, we observe that f t U t(θt+1) terms telescope and we get 1 t P

s [t] θs+1 1 t P

s [t] ws = 1

αt f t U t(θt+1). If the gradients are bounded we can assume 1

αt f t U t(θt+1) t 0. Consequently we see that mean model and mean w state converge to the same model as 1 t P

s [t] ws t θ . If we average update rule of w in Eq. 8 over time and plug these relations we get 1

s [t] f s 1 U s 1(θs) t 0.

Since we assume θt t θ , for sufﬁciently large t,

s [t] f s 1 U s 1(θ ) t 0. This is the stationary point relation of the accumulated losses. Therefore, MOML can be close to the optimal competitor without actually storing the losses from previous rounds.

As in standard online learning, we assume the losses to have a bounded gradient f t U t(θ) G. We ﬁrst give a regret statement for convex losses.

Theorem 1 For convex possibly adversarial {f t U t}T t=1 functions and α = O

T , Algorithm 1 satisﬁes,

t=1 f t U t(θt)

t=1 f t U t(θ )=O

T θ 2 + G2 ,

where θ = arg min θ Rd PT t=1 f t U t(θ).

Theorem 1 gives sub-linear, O

T , regret rate for convex functions with bounded gradients. The dependency on T is optimal since there exists a setting where any algorithm suffers Ω

T regret for convex adversarial losses (Hazan et al., 2007).

Nonconvex Adversarial. It is well-known (see Hazan et al. (2017)) that for adversarial nonconvex losses, sublinear regret for the formulation in Eq. 1 is generally difﬁcult to achieve. One option is to instead evaluate PT t=1 ℓt(θt) 2, but this is not meaningful, since we want to reach a stationary point for the sum of the losses, and not each individual loss. As a tractable regret notion, Hazan et al. (2017) propose to use

i=0 ℓt i(θt)

where we consider a B sized window of the losses. This introduced time window smoothens the gradient of loss

2If a sequence {ai}i=1 converges ai i a , mean also con-

i Pi s=1 as i a .

Memory Efﬁcient Online Meta Learning

suffered and leads to sub-linear regret. Similar to Hazan et al. (2017) algorithm, we use B-MOML to tackle this type of regret notion. B-MOML gets to O T

B2 G2 regret. Hazan et al. (2017) show a lower bound where a set of losses of losses is constsructed in a way that any algorithm suffers Ω T

B2 . Based on this result, our regret rate is optimal in terms of T and B dependencies. We give formal statement and the proof in Appendix A.4.

Stochastic Setting with Nonconvex Losses. As another extension, we propose a different regret notion for nonconvex losses where we do not need to consider a time window either in the regret statement or in the algorithm. In this notion, at each time, we assume the losses are chosen uniformly at random without replacement from a predetermined set of losses, it [K], {ℓj}K j=1 and we aim to minimize

where the expectation is with respect to the random index it of the losses.

Theorem 2 Suppose T is a collection of tasks, and for each τ T , f τ U τ is nonconvex L smooth. We choose tasks it from some task distribution, PT in an IID fashion. Then, for α = O

T , Algorithm 1 satisﬁes,

t=1 E Eτ[ f τ U τ(θt)] 2 = O

where = Eτ[f τ U τ(θ1)] minθ Eτ[f τ U τ(θ)].

Theorem 2 gives sub-linear regret rate for a nonconvex losses. For the stochastic setting, our regret rate is O

Stochastic Setting with Nonconvex Polyak Lojasiewicz (PL) Losses. The regret statement in Theorem 2 is motivated from the ﬁrst order condition of stationary points. A better statement with respect to the optimum competitor can be obtained for PL nonconvex losses in the stochastic setting.

Corollary 1 If each loss in Theorem 2 satisﬁes the (PL) condition, with parameter µ, it follows that,

t=1 E Eτ[f τ U τ(θt)] min θ Eτ[f τ U τ(θ)]

Corollary 1 gives sub-linear rate with respect to the best competitor. We note that PL condition allows us to get a similar regret statement as in the convex setting even for nonconvex losses.

3. Experiments

This section displays our empirical comparison of MOML against competing baselines on standard datasets. We highlight main advantages of our method under various settings. We use Py Torch framework (Paszke et al., 2019) to train and evaluate our models. MAML meta training is implemented with Higher (Grefenstette et al., 2019) library. Hyperparameter tuning in an online setting poses challenges, since unlike the batch setting, we typically do not have a validation set. To overcome this issue we leverage the Hedge algorithm (Freund & Schapire, 1997) for hyperparameter tuning. We start by explaining the datasets, models and the baselines used for evaluations. We refer to Appendix A.1 for details of our setup.

Datasets. We evaluate the performance of our approach on three benchmark datasets: MNIST (Le Cun et al., 1998), CIFAR-100 (Krizhevsky et al., 2009) and mini Image Net (Vinyals et al., 2016). We state the task generation process for each of the datasets below.

Sequential MNIST (S-MNIST): Similar to Rainbow MNIST (Finn et al., 2019), we construct S-MNIST dataset consisting of diverse MNIST classiﬁcation tasks. We generate a largescale dataset consisting of 1000 tasks where train/test set of each task include transformations such as rotations, axes ﬂip, cropping and scaling. These transformations are applied to class instances of each task. We note that unlike Rainbow MNIST (Finn et al., 2019) where a speciﬁc transformation is used in each task, we allow a more diverse tasks by including different transformations among classes within each task.

5 way-CIFAR-100: Similar to Finn et al. (2019), we construct a sequence of 5-way classiﬁcation tasks within CIFAR-100 dataset. There are overall 200 tasks and train/test set of each task contains a 5-class combination from the 100 classes. Since only 5 classes are chosen out of 100 classes for each task, this generation ensures that the tasks are diverse, and particularly challenging for the online meta learning setting.

5 way-mini Imagenet: mini Image Net dataset is collected as a subset ILSVRC-2012 (Deng et al., 2009) where there are a total of 100 classes with 600 images in each class. mini Image Net has realistic RGB images and it is harder compared to CIFAR100 dataset. Similar to 5 way-CIFAR100, in 5 way-mini Image Net, we generate 100 tasks where each task has 5 classes from mini Image Net dataset.

Realistic Test-Bed. We note that S-MNIST and 5-way CIFAR-100 tasks are harder than the corresponding ones in Finn et al. (2019). Firstly, our setting is large scale where there are 1000 and 200 tasks for S-MNIST and 5 way-CIFAR-100 respectively compared to 60 and 50 tasks in prior work. Secondly, we have fewer number of training data per task in our setting, and in particular, 60 and 250

Memory Efﬁcient Online Meta Learning

Figure 2. Experiment results on S-MNIST. Left: CTM versus rounds plot. MOML adapts well to the unseen tasks compared to the baselines. Right: Smoothed LTM versus rounds. MOML is robust to catastrophic forgetting.

training datapoints for S-MNIST and 5 way-CIFAR-100 datasets respectively. In contrast, Finn et al. (2019) has 900 and 2000 datapoints. We note that in online meta learning, we must leverage common knowledge across different tasks, since we do not have sufﬁcient data for any one task. Our task generation resembles a more realistic and large-scale experimental test-bed. As a result, FTML performance is signiﬁcantly lower than that reported in prior work.

Models. We use fully connected network architecture for SMNIST experiments. The model takes ﬂattened version of the image and passes through one hidden layer of size 200 neurons with Re LU non linearity followed by the softmax layer. For 5 way-CIFAR-100, we use a CNN architecture consisting of two convolutional layers with 64 5 5 ﬁlters, two max pooling layers, two fully connected layers with hidden sizes as 384 and 192 and a ﬁnal output layer. In 5 way-mini Image Net, we use a similar CNN architecture where we have three convolutional blocks and max pooling layers followed by two fully connected layers of size 400 and 100 and a ﬁnal softmax layer.

Methods. We compare MOML algorithm against two types of baselines. First set of baselines meta learns tasks such as FTML (Finn et al., 2019) and Meta OGD (MOGD). As described, FTML is an extension of FTRL in online meta learning setting where the algorithm stores all seen task instances. Similarly, we deﬁne Meta OGD (MOGD) where the algorithm extends OGD (Zinkevich, 2003) using the meta losses at each iteration. Different from FTRL, MOGD does not store losses. Our second set of baselines follows Finn et al. (2019) and includes train on everything (TOE) and train from scratch (FS). In TOE baseline, we do not meta learn a model, instead we store all tasks and learn one predictive model. During inference ﬁrst ﬁne tune TOE baseline using MAML and then record its performance. In FS baseline, we adapt a random model to each task using the limited data.

Performance Metrics. We report performance on three different metrics. These are Current Task Metric, Long Term Task Metric and Task Learning Efﬁciency Metric.

Current Task Metric (CTM). We evaluate our models with respect to the current revealed task instance. We ﬁrst allow the meta model to adapt to the current task using MAML adaptation and record its performance on task test data. We note that meta model is adapted directly before being trained on the task similar to the regret statement (Eq. 1).

Long-Term Task Metric (LTM). Different from CTM, we also look at the performance with respect to the previous tasks. At each round, we adapt the current meta model to each of the previous tasks using the limited data and record the performances with task test data. Then, we consider the average performance among the seen tasks. This metric measures catastrophic forgetting, and our goal is to ensure that past experiences are not forgotten. We note that the meta model is ﬁrst adapted to the old tasks using associated training data in evaluating catastrophic forgetting and LTM.

Task Learning Efﬁciency Metric. Similar to Finn et al. (2019), we record the number of datapoints required to achieve a sufﬁcient performance on the current task instance.

Analysis and Discussions. We report average CTM as well as LTM accuracy for all methods in Table 1. We present CTM and LTM versus rounds plots in Figure 2, 3 and 4 for S-MNIST, 5 way-CIFAR-100 and 5 way-mini Image Net settings respectively. We further test the effect of number of classes in each task on CIFAR-100 dataset and report the performances of MOML and FTML for the setting where each task has 3, 4 or 5 classes in Table 2.

MOML adapts well to unseen tasks, improving upon competing methods. MOML gets to similar or higher accuracy in CTM compared to the baselines in Table 1 and 2. For instance, in 5 way-CIFAR-100 setting, MOML reaches 5%

Memory Efﬁcient Online Meta Learning

Figure 3. Experiment results on 5 way-CIFAR-100. Left: CTM versus rounds plot. MOML adapts well to the unseen tasks compared to the baselines. Right: Smoothed LTM versus rounds plot. MOML is robust to catastrophic forgetting.

Table 1. Performances for S-MNIST, 5way-CIFAR-100 and 5waymini Image Net. TOE is not tabulated for 5way-CIFAR-100 and 5way-mini Image Net due to its poor performance.

Test Accuracy S-MNIST CTM LTM TOE 71.22 63.73 FS 71.15 71.14 MOGD 74.63 74.07 FTML 80.62 82.49 MOML 85.82 87.49 5 way-CIFAR-100 FS 31.58 31.60 MOGD 49.92 50.21 FTML 50.68 54.50 MOML 55.83 60.78 5 way-mini Image Net FS 20.90 20.81 MOGD 49.08 56.86 FTML 56.77 63.12 MOML3 56.23 64.27

higher accuracy than FTML. As another example, Figure 2 shows that MOML strictly dominates all baselines after 100 rounds. These ﬁndings show that MOML can easily adapt new task instances.

MOML is task efﬁcient. MOML achieves 80% test accuracy on new task instances by using less amount of limited data compared to FTML in S-MNIST as shown in Appendix A.1. The ﬁndings show that MOML is more task efﬁcient and it can easily adapt to new task instances.

MOML is robust to catastrophic forgetting. MOML implicitly encodes past experience without explicitly storing it.

3For mini Image Net, MOML performances reported in Table 1 is B-MOML with buffer of 10 tasks.

Table 2. MOML and FTML performances for 3, 4 and 5 way CIFAR-100 settings.

Test Accuracy K-way CTM 3 4 5 FTML 67.07 58.19 50.68 MOML 69.10 60.85 55.83 LTM FTML 61.84 54.70 54.50 MOML 72.15 62.79 60.78

This is evident in Table 1 and 2. For instance, in S-MNIST setting, MOML improves 5% LTM over FTML baseline. This shows that unlike competitors, MOML has superior performance on previously observed tasks.

MOML decreases memory complexity. Among the meta learning methods, MOML and MOGD do not store seen task instances. Different from these methods, FTML remembers all seen task information and minimizes the accumulated sum of losses. Not storing task instances for MOGD leads to performance degradation in comparison to FTML (Table 1). On the other hand, MOML outperforms both methods without storing previous data.

Meta learning is necessary for good generalization. TOE baseline learns one predictive model for all tasks. Since tasks are diverse, its performance is strictly lower than the methods that meta learns tasks even though we allow ﬁne tuning during inference time. Similarly, FS does not use meta learning and directly adapts the model for each task. it performs poorly on both CTM and LTM (see Table 1).

Ablation study on B-MOML. B-MOML, a variant of MOML algorithm, described in Section 2, allows for storing B tasks in a buffer. We test B-MOML to see the effect of buffer size. Table 3 displays B-MOML performance with buffer sizes as 0, 5, 10. We note that B = 0 corresponds to

Memory Efﬁcient Online Meta Learning

Figure 4. Experiment results on 5 way-mini Image Net. Left: CTM versus rounds plot. MOML adapts well to the unseen tasks compared to the baselines. Right: Smoothed LTM versus rounds plot. MOML is robust to catastrophic forgetting.

Table 3. Ablative study of B-MOML with 0, 5 and 10 size buffers where B = 0 is original MOML.

Test Accuracy Buffer Size, B S-MNIST 0 5 10 CTM 85.82 84.21 84.33 LTM 87.49 87.54 87.77 5 way-CIFAR-100 CTM 55.83 56.37 56.70 LTM 60.78 63.76 65.24

original MOML algorithm where we do not store previous tasks.

MOML and B-MOML performances are comparable. We observe that storing seen task instances improves performance on new task and reduces the catastrophic forgetting in Table 3. In particular, LTM performances are improved with B-MOML. However, the change in CTM performance is marginal. For instance, increasing buffer from 0 to 10 increases CTM by only 1% in CIFAR-100 . We can infer that MOML effectively summarizes past task information.

Latest B-buffer is more effective than random B-buffer. As described in Section 2, we consider B-MOML variants where we allow buffers with the latest-B and a random variant of a ﬁxed size. Table 4 shows that latest-B Buffer is somewhat more effective than the random case for MOML.

4. Conclusion

We propose a novel method MOML for online meta learning problem where an artiﬁcial agent is expected to meta learn sequentially coming tasks. We point out a practical concern that the agent can not store data from previously seen tasks as its memory footprint linearly grows with time. Different from the prior work, MOML stores local states

Table 4. Ablative study of buffer storing schemes of B-MOML.

Test Accuracy Buffer Size, B S-MNIST 5 10 CTM Random 83.30 84.07 Last 84.21 84.33 LTM Random 86.09 86.97 Last 87.54 87.77 5 way-CIFAR-100

CTM Random 56.40 55.07 Last 56.37 56.70 LTM Random 64.37 64.37 Last 63.76 65.24

that extract the experience information from the seen task. These local states signiﬁcantly improve the memory complexity of MOML as we only store the states rather than all accumulated task data. We theoretically prove sub-linear regret rates and empirically show that MOML leads to huge memory savings by retaining the same or better performance compared to the competitors.

Acknowledgements

This research was supported by a gift from the ARM corporation, and by Army Research Ofﬁce Grant W911NF2110246, the National Science Foundation grants CCF-2007350 (VS), CCF-2022446(VS), CCF-1955981 (VS), the Ofﬁce of Naval Research Grant N0014-18-12257.

Memory Efﬁcient Online Meta Learning

Abernethy, J., Hazan, E., and Rakhlin, A. Competing in the dark: An efﬁcient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008, pp. 263 274, 01 2008.

Antoniou, A., Edwards, H., and Storkey, A. How to train your maml. ar Xiv preprint ar Xiv:1810.09502, 2018.

Chen, Y., Wang, X., Liu, Z., Xu, H., and Darrell, T. A new meta-baseline for few-shot learning. ar Xiv preprint ar Xiv:2003.04390, 2020.

Chen, Z. and Liu, B. Lifelong machine learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 12(3):1 207, 2018.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1126 1135, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings.mlr. press/v70/finn17a.html.

Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1920 1930, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr. press/v97/finn19a.html.

Freund, Y. and Schapire, R. E. A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997. ISSN 0022-0000. doi: https://doi.org/10.1006/jcss.1997. 1504. URL https://www.sciencedirect.com/ science/article/pii/S002200009791504X.

Grefenstette, E., Amos, B., Yarats, D., Htut, P. M., Molchanov, A., Meier, F., Kiela, D., Cho, K., and Chintala, S. Generalized inner loop meta-learning. ar Xiv preprint ar Xiv:1910.01727, 2019.

Hazan, E. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192, 2007.

Hazan, E., Singh, K., and Zhang, C. Efﬁcient regret minimization in non-convex games. ar Xiv preprint ar Xiv:1708.00075, 2017.

Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A. Meta-learning in neural networks: A survey, 2020.

Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Technical report, 2009.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Li, X., Zhou, Y., Wu, T., Socher, R., and Xiong, C. Learn to grow: A continual structure learning framework for overcoming catastrophic forgetting. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3925 3934. PMLR, 09 15 Jun 2019. URL http:// proceedings.mlr.press/v97/li19m.html.

Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. ar Xiv preprint ar Xiv:1707.03141, 2017.

Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. Variational continual learning. In International Conference on Learning Representations, 2018. URL https: //openreview.net/forum?id=Bk Qqq0g Rb.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019.

Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2017.

Memory Efﬁcient Online Meta Learning

Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pp. 1842 1850, 2016.

Shalev-Shwartz, S. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4, 2012. doi: 10.1561/2200000018.

Shalev-Shwartz, S. and Singer, Y. Online learning: Theory, algorithms, and applications. 2007.

Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077 4087, 2017.

Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.

Vanschoren, J. Meta-learning: A survey. ar Xiv preprint ar Xiv:1810.03548, 2018.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. Advances in neural information processing systems, 29:3630 3638, 2016.

Yao, H., Zhou, Y., Mahdavi, M., Li, Z. J., Socher, R., and Xiong, C. Online structured meta-learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6779 6790. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/ 4b86ca48d90bd5f0978afa3a012503a4-Paper. pdf.

Yu, T., Geng, X., Finn, C., and Levine, S. Variableshot adaptation for online meta-learning. ar Xiv preprint ar Xiv:2012.07769, 2020.

Zhou, P., Yuan, X., Xu, H., Yan, S., and Feng, J. Efﬁcient meta learning via minibatch proximal update. Advances in Neural Information Processing Systems, 32: 1534 1544, 2019.

Zhuang, Z., Wang, Y., Yu, K., and Lu, S. No-regret nonconvex online meta-learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3942 3946. IEEE, 2020.

Zinkevich, M. Online convex programming and generalized inﬁnitesimal gradient ascent. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pp. 928 936. AAAI Press, 2003. URL http://www.aaai.org/Library/ICML/ 2003/icml03-120.php.