# metalearning_with_warped_gradient_descent__bac0109b.pdf

Published as a conference paper at ICLR 2020

META-LEARNING WITH WARPED GRADIENT DESCENT

Sebastian Flennerhag,1,2,3 Andrei A. Rusu,3 Razvan Pascanu,3

Francesco Visin,3 Hujun Yin,1,2 Raia Hadsell3

1The University of Manchester, 2The Alan Turing Institute, 3Deep Mind {flennerhag,andreirusu,razp,visin,raia}@google.com hujun.yin@manchester.ac.uk

Learning an efﬁcient update rule from data that promotes rapid learning of new tasks from the same distribution remains an open problem in meta-learning. Typically, previous works have approached this issue either by attempting to train a neural network that directly produces updates or by attempting to learn better initialisations or scaling factors for a gradient-based update rule. Both of these approaches pose challenges. On one hand, directly producing an update forgoes a useful inductive bias and can easily lead to non-converging behaviour. On the other hand, approaches that try to control a gradient-based update rule typically resort to computing gradients through the learning process to obtain their metagradients, leading to methods that can not scale beyond few-shot task adaptation. In this work, we propose Warped Gradient Descent (Warp Grad), a method that intersects these approaches to mitigate their limitations. Warp Grad meta-learns an efﬁciently parameterised preconditioning matrix that facilitates gradient descent across the task distribution. Preconditioning arises by interleaving non-linear layers, referred to as warp-layers, between the layers of a task-learner. Warp-layers are meta-learned without backpropagating through the task training process in a manner similar to methods that learn to directly produce updates. Warp Grad is computationally efﬁcient, easy to implement, and can scale to arbitrarily large meta-learning problems. We provide a geometrical interpretation of the approach and evaluate its effectiveness in a variety of settings, including few-shot, standard supervised, continual and reinforcement learning.

1 INTRODUCTION

Learning (how) to learn implies inferring a learning strategy from some set of past experiences via a meta-learner that a task-learner can leverage when learning a new task. One approach is to directly parameterise an update rule via the memory of a recurrent neural network (Andrychowicz et al., 2016; Ravi & Larochelle, 2016; Li & Malik, 2016; Chen et al., 2017). Such memory-based methods can, in principle, represent any learning rule by virtue of being universal function approximators (Cybenko, 1989; Hornik, 1991; Schäfer & Zimmermann, 2007). They can also scale to long learning processes by using truncated backpropagation through time, but they lack an inductive bias as to what constitutes a reasonable learning rule. This renders them hard to train and brittle to generalisation as their parameter updates have no guarantees of convergence.

An alternative family of approaches deﬁnes a gradient-based update rule and meta-learns a shared initialisation that facilitates task adaptation across a distribution of tasks (Finn et al., 2017; Nichol et al., 2018; Flennerhag et al., 2019). Such methods are imbued with a strong inductive bias gradient descent but restrict knowledge transfer to the initialisation. Recent work has shown that it is beneﬁcial to more directly control gradient descent by meta-learning an approximation of a parameterised matrix (Li et al., 2017; Lee & Choi, 2018; Park & Oliva, 2019) that preconditions gradients during task training, similarly to second-order and Natural Gradient Descent methods (Nocedal & Wright, 2006; Amari & Nagaoka, 2007). To meta-learn preconditioning, these methods backpropagate through the gradient descent process, limiting them to few-shot learning.

Published as a conference paper at ICLR 2020

f(x) L(θ; φ)

P (1) θ(1)L

P (2) θ(2)L

P L min Eθ[J(φ)]

Task Adaptation Meta-Learning

Task-learners Shared Warp

Figure 1: Schematics of Warp Grad. Warp Grad preconditioning is embedded in task-learners f by interleaving warp-layers (ω(1), ω(2)) between each task-learner s layers (h(1), h(2)). Warp Grad achieve preconditioning by modulating layer activations in the forward pass and gradients in the backward pass by backpropagating through warp-layers (Dω), which implicitly preconditions gradients by some matrix (P). Warp-parameters (φ) are meta-learned over the joint search space induced by task adaptation (Eθ[J(φ)]) to form a geometry that facilitates task learning.

In this paper, we propose a novel framework called Warped Gradient Descent (Warp Grad)1, that relies on the inductive bias of gradient-based meta-learners by deﬁning an update rule that preconditions gradients, but that is meta-learned using insights from memory-based methods. In particular, we leverage that gradient preconditioning is deﬁned point-wise in parameter space and can be seen as a recurrent operator of order 1. We use this insight to deﬁne a trajectory agnostic meta-objective over a joint parameter search space where knowledge transfer is encoded in gradient preconditioning.

To achieve a scalable and ﬂexible form of preconditioning, we take inspiration from works that embed preconditioning in task-learners (Desjardins et al., 2015; Lee & Choi, 2018), but we relax the assumption that task-learners are feed-forward and replace their linear projection with a generic neural network ω, referred to as a warp layer. By introducing non-linearity, preconditioning is rendered data-dependent. This allows Warp Grad to model preconditioning beyond the block-diagonal structure of prior works and enables it to meta-learn over arbitrary adaptation processes.

We empirically validate Warp Grad and show it surpasses baseline gradient-based meta-learners on standard few-shot learning tasks (mini Image Net, tiered Image Net; Vinyals et al., 2016; Ravi & Larochelle, 2016; Ren et al., 2018), while scaling beyond few-shot learning to standard supervised settings on the multi -shot Omniglot benchmark (Flennerhag et al., 2019) and a multi-shot version of tiered Image Net. We further ﬁnd that Warp Grad outperforms competing methods in a reinforcement learning (RL) setting where previous gradient-based meta-learners fail (maze navigation with recurrent neural networks (Miconi et al., 2019)) and can be used to meta-learn an optimiser that prevents catastrophic forgetting in a continual learning setting.

2 WARPED GRADIENT DESCENT

2.1 GRADIENT-BASED META-LEARNING

Warp Grad belongs to the family of optimisation-based meta-learners that parameterise an update rule θ U(θ; ξ) with some meta-parameters ξ. Speciﬁcally, gradient-based meta-learners deﬁne an update rule by relying on the gradient descent, U(θ; ξ) := θ α L(θ) for some objective L and learning rate α. A task is deﬁned by a training set Dτ train and a test set Dτ test, which deﬁnes learning objectives LDτ (θ) := E(x,y) Dτ [ℓ(f(x, θ), y)] over the task-learner f for some loss ℓ. MAML (Finn

1Open-source implementation available at https://github.com/flennerhag/warpgrad.

Published as a conference paper at ICLR 2020

Figure 2: Gradient-based meta-learning. Colours denote different tasks (τ), dashed lines denote backpropagation through the adaptation process, and solid black lines denote optimiser parameter (φ) gradients w.r.t. one step of task parameter (θ) adaptation. Left: A meta-learned initialisation compresses trajectory information into a single initial point (θ0). Middle: MAML-based optimisers interact with adaptation trajectories at every step and backpropagate through each interaction. Right: Warp Grad is trajectory agnostic. Task adaptation deﬁnes an empirical distribution p(τ, θ) over which Warp Grad learns a geometry for adaptation by optimising for steepest descent directions.

et al., 2017) meta-learns a shared initialisation θ0 by backpropagating through K steps of gradient descent across a given task distribution p(τ),

CMAML(ξ) := X

τ p(τ) LDτ test

k=0 UDτ train(θτ k; ξ)

Subsequent works on gradient-based meta-learning primarily differ in the parameterisation of U. Meta-SGD (MSGD; Li & Malik, 2016) learns a vector of learning rates, whereas Meta-Curvature (MC; Park & Oliva, 2019) deﬁnes a block-diagonal preconditioning matrix B, and T-Nets (Lee & Choi, 2018) embed block-diagonal preconditioning in feed-forward learners via linear projections,

U(θk; θ0) := θk α L(θk) MAML (2) U(θk; θ0, φ) := θk α diag(φ) L(θk) MSGD (3) U(θk; θ0, φ) := θk αB(θk; φ) L(θk) MC (4) U(θk; θ0, φ) := θk α L(θk; φ) T-Nets. (5)

These methods optimise for meta-parameters ξ = {θ0, φ} by backpropagating through the gradient descent process (Eq. 1). This trajectory dependence limits them to few-shot learning as they become (1) computationally expensive, (2) susceptible to exploding/vanishing gradients, and (3) susceptible to a credit assignment problem (Wu et al., 2018; Antoniou et al., 2019; Liu et al., 2019).

Our goal is to develop a meta-learner that overcomes all three limitations. To do so, we depart from the paradigm of backpropagating to the initialisation and exploit the fact that learning to precondition gradients can be seen as a Markov Process of order 1 that depends on the state but not the trajectory (Li et al., 2017). To develop this notion, we ﬁrst establish a general-purpose form of preconditioning (Section 2.2). Based on this, we obtain a canonical meta-objective from a geometrical point of view (Section 2.3), from which we derive a trajectory-agnostic meta-objective (Section 2.4).

2.2 GENERAL-PURPOSE PRECONDITIONING

A preconditioned gradient descent rule, U(θ; φ) := θ αP(θ; φ) L(θ), deﬁnes a geometry via P. To disentangle the expressive capacity of this geometry from the expressive capacity of the tasklearner f, we take inspiration from T-Nets that embed linear projections T in feed-forward layers, h = σ(TWx+b). This in itself is not sufﬁcient to achieve disentanglement since the parameterisation of T is directly linked to that of W, but it can be achieved under non-linear preconditioning.

To this end, we relax the assumption that the task-learner is feed-forward and consider an arbitrary neural network, f = h(L) h(1). We insert warp-layers that are universal function approximators parameterised by neural networks into the task-learner without restricting their form or how they

Published as a conference paper at ICLR 2020

Figure 3: Left: synthetic experiment illustrating how Warp Grad warps gradients (see Appendix D for full details). Each task f p(f) deﬁnes a distinct loss surface (W, bottom row). Gradient descent (black) on these surfaces struggles to ﬁnd a minimum. Warp Grad meta-learns a warp ω to produce better update directions (magenta; Section 2.4). In doing so, Warp Grad learns a meta-geometry P where standard gradient descent is well behaved (top row). Right: gradient descent in P is equivalent to ﬁrst-order Riemannian descent in W under a meta-learned Riemann metric (Section 2.3).

interact with f. In the simplest case, we interleave warp-layers between layers of the task-learner to obtain ˆf = ω(L) h(L) ω(1) h(1), but other forms of interaction can be beneﬁcial (see Appendix A for practical guidelines). Backpropagation automatically induces gradient preconditioning, as in T-Nets, but in our case via the Jacobians of the warp-layers:

j=0 Dxω(L j)Dxh(L j)

Dxω(i)Dθh(i)

where Dx and Dθ denote the Jacobian with respect to input and parameters, respectively. In the special case where f is feed-forward and each ω a linear projection, we obtain an instance of Warp Grad that is akin to T-Nets since preconditioning is given by Dxω = T. Conversely, by making warp-layers non-linear, we can induce interdependence between warp-layers, allowing Warp Grad to model preconditioning beyond the block-diagonal structure imposed by prior works. Further, this enables a form of task-conditioning by making Jacobians of warp-layers data dependent. As we have made no assumptions on the form of the task-learner or warp-layers, Warp Grad methods can act on any neural network through any form of warping, including recurrence. We show that increasing the capacity of the meta-learner by deﬁning warp-layers as Residual Networks (He et al., 2017) improves performance on classiﬁcation tasks (Section 4.1). We also introduce recurrent warp-layers for agents in a gradient-based meta-learner that is the ﬁrst, to the best of our knowledge, to outperform memory-based meta-learners on a maze navigation task that requires memory (Section 4.3).

Warp-layers imbue Warp Grad with three powerful properties. First, due to preconditioned gradients, Warp Grad inherits gradient descent properties, importantly guarantees of convergence. Second, warp-layers form a distributed representation of preconditioning that disentangles the expressiveness of the geometry it encodes from the expressive capacity of the task-learner. Third, warp-layers are meta-learned across tasks and trajectories and can therefore capture properties of the task-distribution beyond local information. Figure 3 illustrates these properties in a synthetic scenario, where we construct a family of tasks f : R2 R (see Appendix D for details) and meta-learn across the task distribution. Warp Grad learns to produce warped loss surfaces (illustrated on two tasks τ and τ ) that are smoother and more well-behaved than their respective native loss-surfaces.

2.3 THE GEOMETRY OF WARPED GRADIENT DESCENT

If the preconditioning matrix P is invertible, it deﬁnes a valid Riemann metric (Amari, 1998) and therefore enjoys similar convergence guarantees to gradient descent. Thus, if warp-layers represent

Published as a conference paper at ICLR 2020

a valid (meta-learned) Riemann metric, Warp Grad is well-behaved. For T-Nets, it is sufﬁcient to require T to be full rank, since T explicitly deﬁnes P as a block-diagonal matrix with block entries TT T . In contrast, non-linearity in warp-layers precludes such an explicit identiﬁcation.

Instead, we must consider the geometry that warp-layers represent. For this, we need a metric tensor, G, which is a positive-deﬁnite, smoothly varying matrix that measures curvature on a manifold W. The metric tensor deﬁnes the steepest direction of descent by G 1 L (Lee, 2003), hence our goal is to establish that warp-layers approximate some G 1. Let Ωrepresent the effect of warp-layers by a reparameterisation h(i)(x; Ω(θ; φ)(i)) = ω(i)(h(i)(x; θ(i)); φ) x, i that maps from a space P onto the manifold W with γ = Ω(θ; φ). We induce a metric G on W by push-forward (Figure 2):

θ := (L Ω) (θ; φ) = [DxΩ(θ; φ)]T L (γ) P -space (7)

γ := DxΩ(θ; φ) θ = G(γ; φ) 1 L(γ) W -space, (8)

where G 1 := [DxΩ][DxΩ]T . Provided Ωis not degenerate (G is non-singular), G 1 is positivedeﬁnite, hence a valid Riemann metric. While this is the metric induced on W by warp-layers, it is not the metric used to precondition gradients since we take gradient steps in P which introduces an error term (Figure 2). We can bound the error by ﬁrst-order Taylor series expansion to establish ﬁrst-order equivalence between the Warp Grad update in P (Eq. 7) and the ideal update in W (Eq. 8),

(L Ω)(θ α θ) = L(γ α γ) + O(α2). (9)

Consequently, gradient descent under warp-layers (in P-space) is ﬁrst-order equivalent to warping the native loss surface under a metric G to facilitate task adaptation. Warp parameters φ control the geometry induced by warping, and therefore what task-learners converge to. By meta-learning φ we can accumulate information that is conducive to task adaptation but that may not be available during that process. This suggests that an ideal geometry (in W-space) should yield preconditioning that points in the direction of steepest descent, accounting for global information across tasks,

min φ EL,γ p(L,γ) h L γ α G(γ; φ) 1 L(γ) i . (10)

In contrast to MAML-based approaches (Eq. 1), this objective avoids backpropagation through learning processes. Instead, it deﬁnes task learning abstractly by introducing a joint distribution over objectives and parameterisations, opening up for general-purpose meta-learning at scale.

2.4 META-LEARNING WARP PARAMETERS

The canonical objective in Eq. 10 describes a meta-objective for learning a geometry on ﬁrst principles that we can render into a trajectory-agnostic update rule for warp-layers. To do so, we deﬁne a task τ = (hτ, Lτ meta, Lτ task) by a task-learner ˆf that is embedded with a shared Warp Grad optimiser, a meta-training objective Lτ meta, and a task adaptation objective Lτ task. We use Lτ task to adapt task parameters θ and Lτ meta to adapt warp parameters φ. Note that we allow meta and task objectives to differ in arbitrary ways, but both are expectations over some data, as above. In the simplest case, they differ in terms of validation versus training data, but they may differ in terms of learning paradigm as well, as we demonstrate in continual learning experiment (Section 4.3).

To obtain our meta-objective, we recast the canonical objective (Eq. 10) in terms of θ using ﬁrst-order equivalence of gradient steps (Eq. 9). Next, we factorise p(τ, θ) into p(θ | τ)p(τ). Since p(τ) is given, it remains to consider a sampling strategy for p(θ | τ). For meta-learning of warp-layers, we assume this distribution is given. We later show how to incorporate meta-learning of a prior p(θ0 | τ). While any sampling strategy is valid, in this paper we exploit that task learning under stochastic gradient descent can be seen as sampling from an empirical prior p(θ | τ) (Grant et al., 2018); in particular, each iterate θτ k can be seen as a sample from p(θτ k | θτ k 1, φ). Thus, K-steps of gradient descent forms a Monte-Carlo chain θτ 0, . . . , θτ K and sampling such chains deﬁne an empirical distribution p(θ | τ) around some prior p(θ0 | τ), which we will discuss in Section 2.5. The joint distribution p(τ, θ) deﬁnes a joint search space across tasks. Meta-learning therefore learns

Published as a conference paper at ICLR 2020

a geometry over this space with the steepest expected direction of descent. This direction is however not with respect to the objective that produced the gradient, Lτ task, but with respect to Lτ meta,

θτ p(θ|τ) Lτ meta

θτ α Lτ task (θτ; φ); φ . (11)

Decoupling the task gradient operator Lτ task from the geometry learned by Lτ meta lets us infuse global knowledge in warp-layers, a promising avenue for future research (Metz et al., 2019; Mendonca et al., 2019). For example, in Section 4.3, we meta-learn an update-rule that mitigates catastrophic forgetting by deﬁning Lτ meta over current and previous tasks. In contrast to other gradient-based meta-learners, the Warp Grad meta-objective is an expectation over gradient update steps sampled from the search space induced by task adaptation (for example, K steps of stochastic gradient descent; Figure 2). It is therefore trajectory agnostic and hence compatible with arbitrary task learning processes. Because the meta-gradient is independent of the number of task gradient steps, it avoids vanishing/exploding gradients and the credit assignment problem by design. It does rely on second-order gradients, a requirement we can relax by detaching task parameter gradients ( Lτ task) in Eq. 11,

θτ p(θ|τ) Lτ meta

sg θτ α Lτ task (θτ; φ) ; φ , (12)

where sg is the stop-gradient operator. In contrast to the ﬁrst-order approximation of MAML (Finn et al., 2017), which ignores the entire trajectory except for the ﬁnal gradient, this approximation retains all gradient terms and only discards local second-order effects, which are typically dominated by ﬁrst-order effect in long parameter trajectories (Flennerhag et al., 2019). Empirically, we ﬁnd that our approximation only incurs a minor loss of performance in an ablation study on Omniglot (Appendix F). Interestingly, this approximation is a form of multitask learning with respect to φ (Li & Hoiem, 2016; Bilen & Vedaldi, 2017; Rebufﬁet al., 2017) that marginalises over task parameters θτ.

Algorithm 1 Warp Grad: online meta-training

Require: p(τ): distribution over tasks Require: α, β, λ: hyper-parameters

1: initialise φ and p(θ0 | τ) 2: while not done do 3: sample mini-batch of tasks T from p(τ) 4: gφ, gθ0 0 5: for all τ T do 6: θτ 0 p(θ0 | τ) 7: for all k in 0, . . . , Kτ 1 do 8: θτ k+1 θτ k α Lτ task (θτ k; φ) 9: gφ gφ + L(φ; θτ k) 10: gθ0 gθ0 + C(θ0; θτ 0:k) 11: end for 12: end for 13: φ φ βgφ 14: θ0 θ0 λβgθ0 15: end while

Algorithm 2 Warp Grad: ofﬂine meta-training

Require: p(τ): distribution over tasks Require: α, β, λ, η: hyper-parameters

1: initialise φ, p(θ0 | τ) 2: while not done do 3: initialise buffer B = {} 4: sample mini-batch of tasks T from p(τ) 5: for all τ T do 6: θτ 0 p(θ0 | τ) 7: B[τ] = [θτ 0] 8: for all k in 0, . . . , Kτ 1 do 9: θτ k+1 θτ k α Lτ task (θτ k; φ) 10: B[τ].append(θτ k+1) 11: end for 12: end for 13: i, gφ, gθ0 0 14: for all (τ, k) B do 15: gφ gφ + L(φ; θτ k) 16: gθ0 gθ0 + C(θτ 0; θτ 0:k) 17: i i + 1 18: if i = η then 19: φ φ βgφ 20: θ0 θ0 λβgθ0 21: i, gφ, gθ0 0 22: end if 23: end for 24: end while

Published as a conference paper at ICLR 2020

2.5 INTEGRATION WITH LEARNED INITIALISATIONS

Warp Grad is a method for learning warp layer parameters φ over a joint search space deﬁned by p(τ, θ). Because Warp Grad takes this distribution as given, we can integrate Warp Grad with methods that deﬁne or learn some form of prior p(θ0 | τ) over θτ 0. For instance, (a) Multi-task solution: in online learning, we can alternate between updating a multi-task solution and tuning warp parameters. We use this approach in our Reinforcement Learning experiment (Section 4.3); (b) Meta-learned point-estimate: when task adaptation occurs in batch mode, we can meta-learn a shared initialisation θ0. Our few-shot and supervised learning experiments take this approach (Section 4.1); (c) Metalearned prior: Warp Grad can be combined with Bayesian methods that deﬁne a full prior (Rusu et al., 2019; Oreshkin et al., 2018; Lacoste et al., 2018; Kim et al., 2018). We incorporate such methods by some objective C (potentially vacuous) over θ0 that we optimise jointly with Warp Grad,

J(φ, θ0) := L (φ) + λ C (θ0) , (13)

where L can be substituted for by ˆL and λ [0, ) is a hyper-parameter. We train the Warp Grad optimiser via stochastic gradient descent and solve Eq. 13 by alternating between sampling task parameters from p(τ, θ) given the current parameter values for φ and taking meta-gradient steps over these samples to update φ. As such, our method can also be seen as a generalised form of gradient descent in the form of Mirror Descent with a meta-learned dual space (Desjardins et al., 2015; Beck & Teboulle, 2003). The details of the sampling procedure may vary depending on the speciﬁcs of the tasks (static, sequential), the design of the task-learner (feed-forward, recurrent), and the learning objective (supervised, self-supervised, reinforcement learning). In Algorithm 1 we illustrate a simple online algorithm with constant memory and linear complexity in K, assuming the same holds for C. A drawback of this approach is that it is relatively data inefﬁcient; in Appendix B we detail a more complex ofﬂine training algorithm that stores task parameters in a replay buffer for mini-batched training of φ. The gains of the ofﬂine variant can be dramatic: in our Omniglot experiment (Section 4.1), ofﬂine meta-training allows us to update warp parameters 2000 times with each meta-batch, improving ﬁnal test accuracy from 76.3% to 84.3% (Appendix F).

3 RELATED WORK

Learning to learn, or meta-learning, has previously been explored in a variety of settings. Early work focused on evolutionary approaches (Schmidhuber, 1987; Bengio et al., 1991; Thrun & Pratt, 1998). Hochreiter et al. (2001) introduced gradient descent methods to meta-learning, speciﬁcally for recurrent meta-learning algorithms, extended to RL by Wang et al. (2016) and Duan et al. (2016). A similar approach was taken by Andrychowicz et al. (2016) and Ravi & Larochelle (2016) to meta-learn a parameterised update rule in the form of a Recurrent Neural Network (RNN).

A related idea is to separate parameters into slow and fast weights, where the former captures meta-information and the latter encapsulates rapid adaptation (Hinton & Plaut, 1987; Schmidhuber, 1992; Ba et al., 2016). This can be implemented by embedding a neural network that dynamically adapts the parameters of a main architecture (Ha et al., 2016). Warp Grad can be seen as learning slow warp-parameters that precondition adaptation of fast weights. Recent meta-learning focuses almost exclusively on few-shot learning, where tasks are characterised by severe data scarcity. In this setting, tasks must be sufﬁciently similar that a new task can be learned from a single or handful of examples (Lake et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Ren et al., 2018).

Several meta-learners have been proposed that directly predict the parameters of the tasklearner (Bertinetto et al., 2016; Munkhdalai et al., 2018; Gidaris & Komodakis, 2018; Qiao et al., 2018). To scale, such methods typically pretrain a feature extractor and predict a small subset of the parameters. Closely related to our work are gradient-based few-shot learning methods that extend MAML by sharing some subset of parameters between task-learners that is ﬁxed during task training but meta-learner across tasks, which may reduce overﬁtting (Mishra et al., 2018; Lee & Choi, 2018; Munkhdalai et al., 2018) or induce more robust convergence (Zintgraf et al., 2019). It can also be used to model latent variables for concept or task inference, which implicitly induce gradient modulation (Zhou et al., 2018; Oreshkin et al., 2018; Rusu et al., 2019; Lee et al., 2019). Our work is also related to gradient-based meta-learning of a shared initialisation that scales beyond few-shot learning (Nichol et al., 2018; Flennerhag et al., 2019).

Published as a conference paper at ICLR 2020

Meta-learned preconditioning is closely related to parallel work on second-order optimisation methods for high dimensional non-convex loss surfaces (Nocedal & Wright, 2006; Saxe et al., 2013; Kingma & Ba, 2015; Arora et al., 2018). In this setting, second-order optimisers typically struggle to improve upon ﬁrst-order baselines (Sutskever et al., 2013). As second-order curvature is typically intractable to compute, such methods resort to low-rank approximations (Nocedal & Wright, 2006; Martens, 2010; Martens & Grosse, 2015) and suffer from instability (Byrd et al., 2016). In particular, Natural Gradient Descent (Amari, 1998) is a method that uses the Fisher Information Matrix as curvature metric (Amari & Nagaoka, 2007). Several proposed methods for amortising the cost of estimating the metric (Pascanu & Bengio, 2014; Martens & Grosse, 2015; Desjardins et al., 2015). As noted by Desjardins et al. (2015), expressing preconditioning through interleaved projections can be seen as a form of Mirror Descent (Beck & Teboulle, 2003). Warp Grad offers a new perspective on gradient preconditioning by introducing a generic form of model-embedded preconditioning that exploits global information beyond the task at hand.

4 EXPERIMENTS

Table 1: Mean test accuracy after task adaptation on held out evaluation tasks. Multi-headed. No meta-training; see Appendix E and Appendix H.

mini Image Net 5-way 1-shot 5-way 5-shot

Reptile 50.0 0.3 66.0 0.6 Meta-SGD 50.5 1.9 64.0 0.9 (M)T-Net 51.7 1.8 CAVIA (512) 51.8 0.7 65.9 0.6

MAML 48.7 1.8 63.2 0.9 Warp-MAML 52.3 0.8 68.4 0.6

tiered Image Net 5-way 1-shot 5-way 5-shot

MAML 51.7 1.8 70.3 1.8 Warp-MAML 57.2 0.9 74.1 0.7

tiered Image Net Omniglot 10-way 640-shot 20-way 100-shot

SGD 58.1 1.5 51.0 KFAC 56.0 Finetuning 76.4 2.2 Reptile 76.52 2.1 70.8 1.9

Leap 73.9 2.2 75.5 2.6 Warp-Leap 80.4 1.6 83.6 1.9

We evaluate Warp Grad in a set of experiments designed to answer three questions: (1) do Warp Grad methods retain the inductive bias of MAML-based fewshot learners? (2) Can Warp Grad methods scale to problems beyond the reach of such methods? (3) Can Warp Grad generalise to complex meta-learning problems?

4.1 FEW-SHOT LEARNING

For few-shot learning, we test whether Warp Grad retains the inductive bias of gradient-based meta-learners while avoiding backpropagation through the gradient descent process. To isolate the effect of the Warp Grad objective, we use linear warp-layers that we train using online meta-training (Algorithm 1) to make Warp Grad as close to T-Nets as possible. For a fair comparison, we meta-learn the initialisation using MAML (Warp-MAML) with J(θ0, φ) := L(φ) + λCMAML(θ0). We evaluate the importance of metalearning the initialisation in Appendix G and ﬁnd that Warp Grad achieves similar performance under random task parameter initialisation.

All task-learners use a convolutional architecture that stacks 4 blocks made up of a 3 3 convolution, max-pooling, batch-norm, and Re LU activation. We deﬁne Warp-MAML by inserting warp-layers in the form of 3 3 convolutions after each block in the baseline task-learner. All baselines are tuned with identical and independent hyper-parameter searches (including ﬁlter sizes full experimental settings in Appendix H), and we report best results from our experiments or the literature. Warp MAML outperforms all baselines (Table 1), improving 1and 5-shot accuracy by 3.6 and 5.5 percentage points on mini Image Net (Vinyals et al., 2016; Ravi & Larochelle, 2016) and by 5.2 and 3.8 percentage points on tiered Image Net (Ren et al., 2018), which indicates that Warp Grad retains the inductive bias of MAML-based meta-learners.

Published as a conference paper at ICLR 2020

1 5 10 15 20 25 Number of tasks in meta-training set

Test accuracy on held-out tasks

Warp-Leap Leap Reptile FT SGD KFAC

0 20000 40000 60000 80000 100000 Number of Episodes

Warp-RNN Hebb-RNN Hebb-RNN RNN

Figure 4: Left: Omniglot test accuracies on held-out tasks after meta-training on a varying number of tasks. Shading represents standard deviation across 10 independent runs. We compare Warp Leap, Leap, and Reptile, multi-headed ﬁnetuning, as well as SGD and KFAC which used random initialisation but with 4x larger batch size and 10x larger learning rate. Right: On a RL maze navigation task, mean cumulative return is shown. Shading represents inter-quartile ranges across 10 independent runs. Simple modulation and retroactive modulation are used (Miconi et al., 2019).

4.2 MULTI-SHOT LEARNING

Next, we evaluate whether Warp Grad can scale beyond few-shot adaptation on similar supervised problems. We propose a new protocol for tiered Image Net that increases the number of adaptation steps to 640 and use 6 convolutional blocks in task-learners, which are otherwise deﬁned as above. Since MAML-based approaches cannot backpropagate through 640 adaptation steps for models of this size, we evaluate Warp Grad against two gradient-based meta-learners that meta-learn an initialisation without such backpropagation, Reptile (Nichol et al., 2018) and Leap (Flennerhag et al., 2019), and we deﬁne a Warp-Leap meta-learner by J(θ0, φ) := L(φ) + λCLeap(θ0). Leap is an attractive complement as it minimises the expected gradient descent trajectory length across tasks. Under Warp Grad, this becomes a joint search for a geometry in which task adaptation deﬁnes geodesics (shortest paths, see Appendix C for details). While Reptile outperforms Leap by 2.6 percentage points on this benchmark, Warp-Leap surpasses both, with a margin of 3.88 to Reptile (Table 1).

We further evaluate Warp-Leap on the multi-shot Omniglot (Lake et al., 2011) protocol proposed by Flennerhag et al. (2019), where each of the 50 alphabets is a 20-way classiﬁcation task. Task adaptation involves 100 gradient steps on random samples that are preprocessed by random afﬁne transformations. We report results for Warp-Leap under ofﬂine meta-training (Algorithm 2), which updates warp parameters 2000 times per meta step (see Appendix E for experimental details). Warp Leap enjoys similar performance on this task as well, improving over Leap and Reptile by 8.1 and 12.8 points respectively (Table 1). We also perform an extensive ablation study varying the number of tasks in the meta-training set. Except for the case of a single task, Warp-Leap substantially outperforms all baselines (Figure 4), achieving a higher rate of convergence and reducing the ﬁnal test error from ~30% to ~15% . Non-linear warps, which go beyond block-diagonal preconditioning, reach ~11% test error (refer to Appendix F and Table 2 for the full results). Finally, we ﬁnd that Warp Grad methods behave distinctly different from Natural Gradient Descent methods in an ablation study (Appendix G). It reduces ﬁnal test error from ~42% to ~19%, controlling for initialisation, while its preconditioning matrices differ from what the literature suggests (Desjardins et al., 2015).

4.3 COMPLEX META-LEARNING: REINFORCEMENT AND CONTINUAL LEARNING

(c.1) Reinforcement Learning To illustrate how Warp Grad may be used both with recurrent neural networks and in meta-reinforcement learning, we evaluate it in a maze navigation task proposed by Miconi et al. (2018). The environment is a ﬁxed maze and a task is deﬁned by randomly choosing a goal location. The agent s objective is to ﬁnd the location as many times as possible, being teleported to a random location each time it ﬁnds it. We use advantage actor-critic with a basic recurrent neural network (Wang et al., 2016) as the task-learner, and we design a Warp-RNN as a Hyper Network (Ha et al., 2016) that uses an LSTM that is ﬁxed during task training. This LSTM modulates the weights of the task-learning RNN (deﬁned in Appendix I), which in turn is trained on mini-batches of 30 episodes for 200 000 steps. We accumulate the gradient of ﬁxed warp-parameters continually (Algorithm 3, Appendix B) at each task parameter update. Warp parameters are updated on every 30th

Published as a conference paper at ICLR 2020

Figure 5: Continual learning experiment. Average log-loss over 100 randomly sampled tasks, each comprised of 5 sub-tasks. Left: learned sequentially as seen during meta-training. Right: learned in random order [sub-task 1, 3, 4, 2, 0].

step on task parameters (we control for meta-LSTM capacity in Appendix I). We compare against Learning to Reinforcement Learn (Wang et al., 2016) and Hebbian meta-learning (Miconi et al., 2018; 2019); see Appendix I for details. Notably, linear warps (T-Nets) do worse than the baseline RNN on this task while the Warp-RNN converges to a mean cumulative reward of ~160 in 60 000 episodes, compared to baselines that reach at most a mean cumulative reward of ~125 after 100 000 episodes (Figure 4), reaching ~150 after 200 000 episodes (I).

(c.2) Continual Learning We test if Warp Grad can prevent catastrophic forgetting (French, 1999) in a continual learning scenario. To this end, we design a continual learning version of the sine regression meta-learning experiment in Finn et al. (2017) by splitting the input interval [ 5, 5] R into 5 consecutive sub-tasks (an alternative protocol was recently proposed independently by Javed & White, 2019). Each sub-task is a regression problem with the target being a mixture of two random sine waves. We train 4-layer feed-forward task-learner with interleaved warp-layers incrementally on one sub-task at a time (see Appendix J for details). To isolate the behaviour of Warp Grad parameters, we use a ﬁxed random initialisation for each task sequence. Warp parameters are meta-learned to prevent catastrophic forgetting by deﬁning Lτ meta to be the average task loss over current and previous sub-tasks, for each sub-task in a task sequence. This forces warp-parameters to disentangle the adaptation process of current and previous sub-tasks. We train on each sub-task for 20 steps, for a total of 100 task adaptation steps. We evaluate Warp Grad on 100 random tasks and ﬁnd that it learns new sub-tasks well, with mean losses on an order of magnitude 10 3. When switching sub-task, performance immediately deteriorates to ~10 2 but is stable for the remainder of training (Figure 5). Our results indicate that Warp Grad can be an effective mechanism against catastrophic forgetting, a promising avenue for further research. For detailed results, see Appendix J.

5 CONCLUSION

We propose Warp Grad, a novel meta-learner that combines the expressive capacity and ﬂexibility of memory-based meta-learners with the inductive bias of gradient-based meta-learners. Warp Grad meta-learns to precondition gradients during task adaptation without backpropagating through the adaptation process and we ﬁnd empirically that it retains the inductive bias of MAML-based few-shot learners while being able to scale to complex problems and architectures. Further, by expressing preconditioning through warp-layers that are universal function approximators, Warp Grad can express geometries beyond the block-diagonal structure of prior works.

Warp Grad provides a principled framework for general-purpose meta-learning that integrates learning paradigms, such as continual learning, an exciting avenue for future research. We introduce novel means for preconditioning, for instance with residual and recurrent warp-layers. Understanding how Warp Grad manifolds relate to second-order optimisation methods will further our understanding of gradient-based meta-learning and aid us in designing warp-layers with stronger inductive bias.

In their current form, Warp Gradshare some of the limitations of many popular meta-learning approaches. While Warp Grad avoids backpropagating through the task training process, as in Warp-Leap, the Warp Grad objective samples from parameter trajectories and has therefore linear computational complexity in the number of adaptation steps, currently an unresolved limitation of gradient-based meta-learning. Algorithm 2 hints at exciting possibilities for overcoming this limitation.

Published as a conference paper at ICLR 2020

ACKNOWLEDGEMENTS

The authors would like to thank Guillaume Desjardins for helpful discussions on an early draft as well as anonymous reviewers for their comments. SF gratefully acknowledges support from North West Doctoral Training Centre under ESRC grant ES/J500094/1 and by The Alan Turing Institute under EPSRC grant EP/N510129/1.

Amari, Shun-Ichi. Natural gradient works efﬁciently in learning. Neural computation, 10(2):251 276, 1998.

Amari, Shun-ichi and Nagaoka, Hiroshi. Methods of information geometry, volume 191. American Mathematical Society, 2007.

Andrychowicz, Marcin, Denil, Misha, Gomez, Sergio, Hoffman, Matthew W, Pfau, David, Schaul, Tom, Shillingford, Brendan, and De Freitas, Nando. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.

Antoniou, Antreas, Edwards, Harrison, and Storkey, Amos J. How to train your MAML. In International Conference on Learning Representations, 2019.

Arora, Sanjeev, Cohen, Nadav, and Hazan, Elad. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, 2018.

Ba, Jimmy, Hinton, Geoffrey E, Mnih, Volodymyr, Leibo, Joel Z, and Ionescu, Catalin. Using fast weights to attend to the recent past. In Advances in Neural Information Processing Systems, 2016.

Beck, Amir and Teboulle, Marc. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167 175, 2003.

Bengio, Yoshua, Bengio, Samy, and Cloutier, Jocelyn. Learning a synaptic learning rule. Université de Montréal, Département d informatique et de recherche opérationnelle, 1991.

Bertinetto, Luca, Henriques, João F, Valmadre, Jack, Torr, Philip, and Vedaldi, Andrea. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, 2016.

Bilen, Hakan and Vedaldi, Andrea. Universal representations: The missing link between faces, text, planktons, and cat breeds. ar Xiv preprint ar Xiv:1701.07275, 2017.

Byrd, R., Hansen, S., Nocedal, J., and Singer, Y. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008 1031, 2016.

Chen, Yutian, Hoffman, Matthew W, Colmenarejo, Sergio Gómez, Denil, Misha, Lillicrap, Timothy P, Botvinick, Matt, and de Freitas, Nando. Learning to learn without gradient descent by gradient descent. In International Conference on Machine Learning, 2017.

Cybenko, George. Approximation by superpositions of a sigmoidal function. MCSS, 1989.

Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A largescale hierarchical image database. In International Conference on Computer Vision and Pattern Recognition, 2009.

Desjardins, Guillaume, Simonyan, Karen, Pascanu, Razvan, and kavukcuoglu, koray. Natural neural networks. In Advances in Neural Information Processing Systems, 2015.

Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and Abbeel, Pieter. Rl$ˆ2$: Fast reinforcement learning via slow reinforcement learning. ar Xiv preprint ar Xiv:1611.02779, 2016.

Finn, Chelsea, Abbeel, Pieter, and Levine, Sergey. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning, 2017.

Flennerhag, Sebastian, Yin, Hujun, Keane, John, and Elliot, Mark. Breaking the activation function bottleneck through adaptive parameterization. In Advances in Neural Information Processing Systems, 2018.

Published as a conference paper at ICLR 2020

Flennerhag, Sebastian, Moreno, Pablo G., Lawrence, Neil D., and Damianou, Andreas. Transferring knowledge across learning processes. In International Conference on Learning Representations, 2019.

French, Robert M. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3 (4):128 135, 1999.

Gidaris, Spyros and Komodakis, Nikos. Dynamic few-shot visual learning without forgetting. In International Conference on Computer Vision and Pattern Recognition, 2018.

Grant, Erin, Finn, Chelsea, Levine, Sergey, Darrell, Trevor, and Grifﬁths, Thomas L. Recasting gradient-based meta-learning as hierarchical bayes. In International Conference on Learning Representations, 2018.

Ha, David, Dai, Andrew M., and Le, Quoc V. Hypernetworks. In International Conference on Learning Representations, 2016.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In International Conference on Computer Vision and Pattern Recognition, 2016.

He, Kaiming, Gkioxari, Georgia, Dollár, Piotr, and Girshick, Ross. Mask r-cnn. In International Conference on Computer Vision, pp. 2980 2988, 2017.

Hinton, Geoffrey E and Plaut, David C. Using fast weights to deblur old memories. In 9th Annual Conference of the Cognitive Science Society, 1987.

Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9: 1735 80, 12 1997.

Hochreiter, Sepp, Younger, A. Steven, and Conwell, Peter R. Learning to learn using gradient descent. In International Conference on Articial Neural Networks, 2001.

Hornik, Kurt. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4 (2):251 257, 1991.

Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.

Javed, Khurram and White, Martha. Meta-learning representations for continual learning. ar Xiv preprint ar Xiv:1905.12588, 2019.

Kim, Taesup, Yoon, Jaesik, Dia, Ousmane, Kim, Sungwoong, Bengio, Yoshua, and Ahn, Sungjin. Bayesian model-agnostic meta-learning. ar Xiv preprint ar Xiv:1806.03836, 2018.

Kingma, Diederik P. and Ba, Jimmy. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.

Lacoste, Alexandre, Oreshkin, Boris, Chung, Wonchang, Boquet, Thomas, Rostamzadeh, Negar, and Krueger, David. Uncertainty in multitask transfer learning. In Advances in Neural Information Processing Systems, 2018.

Lake, Brenden, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua. One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, 2011.

Lake, Brenden M., Salakhutdinov, Ruslan, and Tenenbaum, Joshua B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332 1338, 2015.

Lee, John M. Introduction to Smooth Manifolds. Springer, 2003.

Lee, Kwonjoon, Maji, Subhransu, Ravichandran, Avinash, and Soatto, Stefano. Meta-learning with differentiable convex optimization. In CVPR, 2019.

Lee, Yoonho and Choi, Seungjin. Meta-Learning with Adaptive Layerwise Metric and Subspace. In International Conference on Machine Learning, 2018.

Li, Ke and Malik, Jitendra. Learning to optimize. In International Conference on Machine Learning, 2016.

Published as a conference paper at ICLR 2020

Li, Zhenguo, Zhou, Fengwei, Chen, Fei, and Li, Hang. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017.

Li, Zhizhong and Hoiem, Derek. Learning without forgetting. In European Conference on Computer Vision, 2016.

Liu, Hao, Socher, Richard, and Xiong, Caiming. Taming MAML: Efﬁcient unbiased metareinforcement learning. In International Conference on Machine Learning, 2019.

Martens, James. Deep learning via hessian-free optimization. In International Conference on Machine Learning, 2010.

Martens, James and Grosse, Roger. Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, 2015.

Mendonca, Russell, Gupta, Abhishek, Kralev, Rosen, Abbeel, Pieter, Levine, Sergey, and Finn, Chelsea. Guided meta-policy search. ar Xiv preprint ar Xiv:1904.00956, 2019.

Metz, Luke, Maheswaranathan, Niru, Cheung, Brian, and Sohl-Dickstein, Jascha. Meta-learning update rules for unsupervised representation learning. In International Conference on Learning Representations, 2019.

Miconi, Thomas, Clune, Jeff, and Stanley, Kenneth O. Differentiable plasticity: training plastic neural networks with backpropagation. In International Conference on Machine Learning, 2018.

Miconi, Thomas, Clune, Jeff, and Stanley, Kenneth O. Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. In International Conference on Learning Representations, 2019.

Mishra, Nikhil, Rohaninejad, Mostafa, Chen, Xi, and Abbeel, Pieter. A Simple Neural Attentive Meta-Learner. In International Conference on Learning Representations, 2018.

Mujika, Asier, Meier, Florian, and Steger, Angelika. Fast-slow recurrent neural networks. In Advances in Neural Information Processing Systems, 2017.

Munkhdalai, Tsendsuren, Yuan, Xingdi, Mehri, Soroush, Wang, Tong, and Trischler, Adam. Learning rapid-temporal adaptations. In International Conference on Machine Learning, 2018.

Nichol, Alex, Achiam, Joshua, and Schulman, John. On First-Order Meta-Learning Algorithms. ar Xiv preprint Ar Xiv:1803.02999, 2018.

Nocedal, Jorge and Wright, Stephen. Numerical optimization. Springer, 2006.

Oreshkin, Boris N, Lacoste, Alexandre, and Rodriguez, Paul. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, 2018.

Park, Eunbyung and Oliva, Junier B. Meta-curvature. ar Xiv preprint ar Xiv:1902.03356, 2019.

Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. In International Conference on Learning Representations, 2014.

Perez, Ethan, Strub, Florian, De Vries, Harm, Dumoulin, Vincent, and Courville, Aaron. Film: Visual reasoning with a general conditioning layer. In Association for the Advancement of Artiﬁcial Intelligence, 2018.

Qiao, Siyuan, Liu, Chenxi, Shen, Wei, and Yuille, Alan L. Few-shot image recognition by predicting parameters from activations. In International Conference on Computer Vision and Pattern Recognition, 2018.

Ravi, Sachin and Larochelle, Hugo. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2016.

Rebufﬁ, Sylvestre-Alvise, Bilen, Hakan, and Vedaldi, Andrea. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, 2017.

Ren, Mengye, Triantaﬁllou, Eleni, Ravi, Sachin, Snell, Jake, Swersky, Kevin, Tenenbaum, Joshua B., Larochelle, Hugo, and Zemel, Richard S. Meta-learning for semi-supervised few-shot classiﬁcation. In International Conference on Learning Representations, 2018.

Published as a conference paper at ICLR 2020

Rusu, Andrei A., Rao, Dushyant, Sygnowski, Jakub, Vinyals, Oriol, Pascanu, Razvan, Osindero, Simon, and Hadsell, Raia. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.

Saxe, Andrew M, Mc Clelland, James L, and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013.

Schäfer, Anton Maximilian and Zimmermann, Hans-Georg. Recurrent neural networks are universal approximators. International Journal of Neural Systems, 17(04):253 263, 2007.

Schmidhuber, Jürgen. Evolutionary principles in self-referential learning. Ph D thesis, Technische Universität München, 1987.

Schmidhuber, Jürgen. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131 139, 1992.

Snell, Jake, Swersky, Kevin, and Zemel, Richard S. Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems, 2017.

Suarez, Joseph. Language modeling with recurrent highway hypernetworks. In Advances in Neural Information Processing Systems, 2017.

Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, 2013.

Thrun, Sebastian and Pratt, Lorien. Learning to learn: Introduction and overview. In In Learning To Learn. Springer, 1998.

Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, and Wierstra, Daan. Matching Networks for One Shot Learning. In Advances in Neural Information Processing Systems, 2016.

Wang, Jane X., Kurth-Nelson, Zeb, Tirumala, Dhruva, Soyer, Hubert, Leibo, Joel Z., Munos, Rémi, Blundell, Charles, Kumaran, Dharshan, and Botvinick, Matthew. Learning to reinforcement learn. In Annual Meeting of the Cognitive Science Society, 2016.

Wu, Yuhuai, Ren, Mengye, Liao, Renjie, and Grosse, Roger B. Understanding short-horizon bias in stochastic meta-optimization. In International Conference on Learning Representations, 2018.

Zhou, Fengwei, Wu, Bin, and Li, Zhenguo. Deep meta-learning: Learning to learn in the concept space. ar Xiv preprint ar Xiv:1802.03596, 2018.

Zintgraf, Luisa M., Shiarlis, Kyriacos, Kurin, Vitaly, Hofmann, Katja, and Whiteson, Shimon. Fast Context Adaptation via Meta-Learning. International Conference on Machine Learning, 2019.

Published as a conference paper at ICLR 2020

A WARPGRAD DESIGN PRINCIPLES FOR NEURAL NETS

(a) Warp-Conv Net

(b) Warp-Res Net

(c) Warp-LSTM

4x linear 4x linear

(d) Warp-Hyper Network

Figure 6: Illustration of possible Warp Grad architectures. Orange represents task layers and blue represents warp-layers. denotes residual connections and any form of gating mechanism. We can obtain warped architectures by interleaving taskand warp-layers (a, c) or by designating some layers in standard architectures as task-adaptable and some as warp-layers (b, d).

Warp Grad is a model-embedded meta-learned optimiser that allows for several implementation strategies. To embed warp-layers given a task-learner architecture, we may either insert new warplayers in the given architecture or designate some layers as warp-layers and some as task layers. We found that Warp Grad can both be used in a high-capacity mode, where task-learners are relatively weak to avoid overﬁtting, as well as in a low-capacity mode where task-learners are powerful and warp-layers are relatively weak. The best approach depends on the problem at hand. We highlight three approaches to designing Warp Grad optimisers, starting from a given architecture:

(a) Model partitioning. Given a desired architecture, designate some operations as task-adaptable and the rest as warp-layers. Task layers do not have to interleave exactly with warp-layers as gradient warping arises both through the forward pass and through backpropagation. This was how we approached the tiered Image Net and mini Image Net experiments. (b) Model augmentation. Given a model, designate all layers as task-adaptable and interleave warplayers. Warp-layers can be relatively weak as backpropagation through non-linear activations ensures expressive gradient warping. This was our approach to the Omniglot experiment; our main architecture interleaves linear warp-layers in a standard architecture. (c) Information compression. Given a model, designate all layers as warp and interleave weak task layers. In this scenario, task-learners are prone to overﬁtting. Pushing capacity into the warp allows it to encode general information the task-learner can draw on during task adaptation. This approach is similar to approaches in transfer and meta-learning that restrict the number of free parameters during task training (Rebufﬁet al., 2017; Lee & Choi, 2018; Zintgraf et al., 2019).

Note that in either case, once warp-layers have been chosen, standard backpropagation automatically warps gradients for us. Thus, Warp Grad is fully compatible with any architecture, for instance, Residual Neural Networks (He et al., 2016) or LSTMs. For convolutional neural networks, we may use any form of convolution, learned normalization (e.g. Ioffe & Szegedy, 2015), or adaptor module (e.g. Rebufﬁet al., 2017; Perez et al., 2018) to design task and warp-layers. For recurrent networks, we can use stacked LSTMs to interleave warped layers, as well as any type of Hyper Network

Published as a conference paper at ICLR 2020

architecture (e.g. Ha et al., 2016; Suarez, 2017; Flennerhag et al., 2018) or partitioning of fast and slow weights (e.g. Mujika et al., 2017). Figure 6 illustrates this process.

B WARPGRAD META-TRAINING ALGORITHMS

In this section, we provide the variants of Warp Grad training algorithms used in this paper. Algorithm 1 describes a simple online algorithm, which accumulates meta-gradients online during task adaptation. This algorithm has constant memory and scales linearly in the length of task trajectories. In Algorithm 2, we describe an ofﬂine meta-training algorithm. This algorithm is similar to Algorithm 1 in many respects, but differs in that we do not compute meta-gradients online during task adaptation. Instead, we accumulate them into a replay buffer of sampled task parameterisations. This buffer is a Monte-Carlo sample of the expectation in the meta objective (Eq. 13) that can be thought of as a dataset in its own right. Hence, we can apply standard mini-batching with respect to the buffer and perform mini-batch gradient descent on warp parameters. This allows us to update warp parameters several times for a given sample of task parameter trajectories, which can greatly improve data efﬁciency. In our Omniglot experiment, we found ofﬂine meta-training to converge faster: in fact, a mini-batch size of 1 (i.e. η = 1 in Algorithm 2 converges rapidly without any instability.

Finally, in Algorithm 3, we present a continual meta-training process where meta-training occurs throughout a stream of learning experiences. Here, C represents a multi-task objective, such as the average task loss, Cmulti = P

τ p(τ) Lτ task. Meta-learning arises by collecting experiences continuously (across different tasks) and using these to accumulate the meta-gradient online. Warp parameters are updated intermittently with the accumulated meta-gradient. We use this algorithm in our maze navigation experiment, where task adaptation is internalised within the RNN tasklearner.

C WARPGRAD OPTIMISERS

In this section, we detail Warp Grad methods used in our experiments.

Warp-MAML We use this model for few-shot learning (Section 4.1). We use the full warpobjective in Eq. 11 together with the MAML objective (Eq. 1),

JWarp-MAML := L(φ) + λCMAML(θ0), (14)

where CMAML = LMAML under the constraint P = I. In our experiments, we trained Warp-MAML using the online training algorithm (Algorithm 1).

Warp-Leap We use this model for multi-shot meta-learning. It is deﬁned by applying Leap (Flennerhag et al., 2019) to θ0 (Eq. 16), JWarp-Leap := L(φ) + λCLeap(θ0), (15)

where the Leap objective is deﬁned by minimising the expected cumulative chordal distance,

CLeap(θ0) := X

sg [ϑτ k] ϑτ k 1 2, ϑτ k = (θτ k,0, . . . , θτ k,n, Lτ task (θτ k; φ)). (16)

Note that the Leap meta-gradient makes a ﬁrst-order approximation to avoid backpropagating through the adaptation process. It is given by

CLeap(θ0) X

Lτ task (θτ k; φ) Lτ task θτ k 1; φ + θτ k ϑτ k ϑτ k 1 2 , (17)

where Lτ task (θτ k; φ) := Lτ task (θτ k; φ) Lτ task θτ k 1; φ and θτ k := θτ k θτ k 1. In our experiments, we train Warp-Leap using Algorithm 1 in the multi-shot tiered Image Net experiment and Algorithm 2

Published as a conference paper at ICLR 2020

in the Omniglot experiment. We perform an ablation study for training algorithms, comparing exact (Eq. 11) versus approximate (Eq. 12) meta-objectives, and several implementations of the warp-layers on Omniglot in Appendix F.

Warp-RNN For our Reinforcement Learning experiment, we deﬁne a Warp Grad optimiser by meta-learning an LSTM that modulates the weights of the task-learner (see Appendix I for details). For this algorithm, we face a continuous stream of experiences (episodes) that we meta-learn using our continual meta-training algorithm (Algorithm 3). In our experiment, both Lτ task and Lτ meta are the advantage actor-critic objective (Wang et al., 2016); C is computed on one batch of 30 episodes, whereas L is accumulated over η = 30 such batches, for a total of 900 episodes. As each episode involves 300 steps in the environment, we cannot apply the exact meta objective, but use the approximate meta objective (Eq. 12). Speciﬁcally, let Eτ = {s0, a1, r1, s1, . . . , s T , a T , r T , s T +1} denote an episode on task τ, where s denotes state, a action, and r instantaneous reward. Denote a minibatch of randomly sampled task episodes by E = {Eτ}τ p(τ) and an ordered set of k consecutive mini-batches by Ek = {Ek i}k 1 i=0 . Then ˆL(φ; Ek) = 1/n P

Eτ i,j Ei Lτ meta(φ; θ, Eτ i,j) and

Algorithm 1 Warp Grad: online meta-training

Require: p(τ): distribution over tasks Require: α, β, λ: hyper-parameters

1: initialise φ and θ0 2: while not done do 3: Sample mini-batch of tasks B from p(τ) 4: gφ, gθ0 0 5: for all τ B do 6: θτ 0 θ0 7: for all k in 0, . . . , Kτ 1 do 8: θτ k+1 θτ k α Lτ task (θτ k; φ) 9: gφ gφ + L(φ; θτ k) 10: gθ0 gθ0 + C(θτ 0; θτ 0:k) 11: end for 12: end for 13: φ φ βgφ 14: θ0 θ0 λβgθ0 15: end while

Algorithm 3 Warp Grad: continual meta-training

Require: p(τ): distribution over tasks Require: α, β, λ, η: hyper-parameters

1: initialise φ and θ 2: i, gφ, gθ 0

3: while not done do 4: Sample mini-batch of tasks B from p(τ) 5: for all τ B do 6: gφ gφ + L(φ; θ) 7: gθ gθ + C(θ; φ) 8: end for 9: θ θ λβgθ 10: gθ, i 0, i + 1 11: if i = η then 12: φ φ βgφ 13: i, gθ 0 14: end if 15: end while

Algorithm 2 Warp Grad: ofﬂine meta-training

Require: p(τ): distribution over tasks Require: α, β, λ, η: hyper-parameters

1: initialise φ and θ0 2: while not done do 3: Sample mini-batch of tasks B from p(τ)

4: T {τ : [θ0] for τ in B}

5: for all τ B do 6: θτ 0 θ0 7: for all k in 0, . . . , Kτ 1 do 8: θτ k+1 θτ k α Lτ task (θτ k; φ)

9: T [τ].append(θτ k+1)

10: end for 11: end for 12: i, gφ, gθ0 0

13: while T not empty do 14: sample τ, k without replacement 15: gφ gφ + L(φ; θτ k)

16: gθ0 gθ0 + C(θτ 0; θτ 0:k)

17: i i + 1 18: if i = η then 19: φ φ βgφ 20: θ0 θ0 λβgθ0 21: i, gφ, gθ0 0

22: end if 23: end while 24: end while

Published as a conference paper at ICLR 2020

Cmulti(θ; Ek) = 1/n P

Eτ k,j Ek Lτ task(θ; φ, Eτ k,j), where n and n are normalising constants. The Warp-RNN objective is deﬁned by

JWarp-RNN := L(φ; Ek) + λCmulti(θ; Ek) if k = η λCmulti(θ; Ek) otherwise. (18)

Warp Grad for Continual Learning For this experiment, we focus on meta-learning warpparameters. Hence, the initialisation for each task sequence is a ﬁxed random initialisation, (i.e. λC(θ0) = 0). For the warp meta-objective, we take expectations over N task sequences, where each task sequence is a sequence of T = 5 sub-tasks that the task-learner observes one at a time; thus while the task loss is deﬁned over the current sub-task, the meta-loss averages of the current and all prior sub-tasks, for each sub-task in the sequence. See Appendix J for detailed deﬁnitions. Importantly, because Warp Grad deﬁnes task adaptation abstractly by a probability distribution, we can readily implement a continual learning objective by modifying the joint task parameter distribution p(τ, θ) that we use in the meta-objective (Eq. 11). A task deﬁnes a sequence of sub-tasks over which we generate parameter trajectories θτ. Thus, the only difference from multi-task meta-learning is that parameter trajectories are not generated under a ﬁxed task, but arise as a function of the continual learning algorithm used for adaptation. We deﬁne the conditional distribution p(θ | τ) as before by sampling sub-task parameters θτt from a mini-batch of such task trajectories, keeping track of which sub-task t it belongs to and which sub-tasks came before it in the given task sequence τ. The meta-objective is constructed, for any sub-task parameterisation θτt, as Lτ meta(θτt) = 1/t Pt i=1 Lτ task (θτi, Di; φ), where Dj is data from sub-task j (Appendix J). The meta-objective is an expectation over task parameterisations,

LCL(φ) := X

θτt p(θ|τt) Lτ meta

θτt; φ . (19)

D SYNTHETIC EXPERIMENT

To build intuition for what it means to warp space, we construct a simple 2-D problem over loss surfaces. A learner is faced with the task of minimising an objective function of the form f τ(x1, x2) = gτ 1(x1) exp(gτ 2(x2)) gτ 3(x1) exp(gτ 4(x1, x2)) gτ 5 exp(gτ 6(x1)), where each task f τ is deﬁned by scale and rotation functions gτ that are randomly sampled from a predeﬁned distribution. Speciﬁcally, each task is deﬁned by the objective function

f τ(x1, x2) = bτ 1(aτ 1 x1)2 exp( x2 1 (x2 + aτ 2)2)

bτ 2(x1/sτ x3 1 x5 2) exp( x2 1 x2 2)

bτ 3 exp( (x1 + aτ 3)2 x2 1)),

where each a, b and s are randomly sampled parameters from

sτ Cat(1, 2, . . . , 9, 10) aτ i Cat( 1, 0, 1) bτ i Cat( 5, 4, . . . , 4, 5).

The task is to minimise the given objective from a randomly sampled initialisation, x{i=1,2} U( 3, 3). During meta-training, we train on a task for 100 steps using a learning rate of 0.1. Each task has a unique loss-surface that the learner traverses from the randomly sampled initialisation. While each loss-surface is unique, they share an underlying structure. Thus, by metalearning a warp over trajectories on randomly sampled loss surfaces, we expect Warp Grad to learn a warp that is close to invariant to spurious descent directions. In particular, Warp Grad should produce a smooth warped space that is quasi-convex for any given task to ensure that the task-learner ﬁnds a minimum as fast as possible regardless of initialisation.

To visualise the geometry, we use an explicit warp Ωdeﬁned by a 2-layer feed-forward network with a hidden-state size of 30 and tanh non-linearities. We train warp parameters for 100 meta-training steps;

Published as a conference paper at ICLR 2020

Figure 7: Example trajectories on three task loss surfaces. We start Gradient Descent (black) and Warp Grad (magenta) from the same initialisation; while SGD struggles with the curvature, the Warp Grad optimiser has learned a warp such that gradient descent in the representation space (top) leads to rapid convergence in model parameter space (bottom).

in each meta-step we sample a new task surface and a mini-batch of 10 random initialisations that we train separately. We train to convergence and accumulate the warp meta-gradient online (Algorithm 1). We evaluate against gradient descent on randomly sampled loss surfaces (Figure 7). Both optimisers start from the same initialisation, chosen such that standard gradient descent struggles; we expect the Warp Grad optimisers to learn a geometry that is robust to the initialisation (top row). This is indeed what we ﬁnd; the geometry learned by Warp Grad smoothly warps the native loss surface into a well-behaved space where gradient descent converges to a local minimum.

We follow the protocol of Flennerhag et al. (2019), including the choice of hyper-parameters unless otherwise noted. In this setup, each of the 50 alphabets that comprise the dataset constitutes a distinct task. Each task is treated as a 20-way classiﬁcation problem. Four alphabets have fewer than 20 characters in the alphabet and are discarded, leaving us with 46 alphabets in total. 10 alphabets are held-out for ﬁnal meta-testing; which alphabets are held out depend on the seed to account for variations across alphabets; we train and evaluate all baselines on 10 seeds. For each character in an alphabet, there are 20 raw samples. Of these, 5 are held out for ﬁnal evaluation on the task while the remainder is used to construct a training set. Raw samples are pre-processed by random afﬁne transformations in the form of (a) scaling between [0.8, 1.2], (b) rotation [0, 360), and (c) cropping height and width by a factor of [ 0.2, 0.2] in each dimension. This ensures tasks are too hard for few-shot learning. During task adaptation, mini-batches are sampled at random without ensuring class-balance (in contrast to few-shot classiﬁcation protocols (Vinyals et al., 2016)). Note that benchmarks under this protocol are not compatible with few-shot learning benchmarks.

We use the same convolutional neural network architecture and hyper-parameters as in Flennerhag et al. (2019). This learner stacks a convolutional block comprised of a 3 3 convolution with 64 ﬁlters, followed by 2 2 max-pooling, batch-normalisation, and Re LU activation, four times. All images are down-sampled to 28 28, resulting in a 1 1 64 feature map that is passed on to a ﬁnal linear layer. We create a Warp Leap meta-learner that inserts warp-layers between each convolutional block, W ω4 h4 ω1 h1, where each h is deﬁned as above. In our main experiment, each ωi is simply a 3 3 convolutional layer with zero padding; in Appendix F we consider both simpler and more sophisticated versions. We ﬁnd that relatively simple warp-layers do quite well. However, adding capacity does improve generalisation performance. We meta-learn the initialisation of task parameters using the Leap objective (Eq. 16), detailed in Appendix C.

Both Lτ meta and Lτ task are deﬁned as the negative log-likelihood loss; importantly, we evaluate them on different batches of task data to ensure warp-layers encourage generalisation. We found no additional beneﬁt in this experiment from using held-out data to evaluate Lτ meta. We use the ofﬂine meta-training algorithm (Appendix B, Algorithm 2); in particular, during meta-training, we sample mini-batches

Published as a conference paper at ICLR 2020

1 5 10 15 20 25 Number of tasks in meta-training set

Test accuracy on held-out tasks

Warp-Leap Leap Reptile FT SGD KFAC

1 5 10 15 20 25 Number of tasks in meta-training set

Train Acc AUC on held-out tasks

Warp-Leap Leap Reptile FT SGD KFAC

Figure 8: Omniglot results. Top: test accuracies on held-out tasks after meta-training on a varying number of tasks. Bottom: AUC under accuracy curve on held-out tasks after meta-training on a varying number of tasks. Shading represents standard deviation across 10 independent runs. We compare between Warp-Leap, Leap, and Reptile, multi-headed ﬁnetuning, as well as SGD and KFAC which used random initialisation but with a 10x larger learning rate.

of 20 tasks and train task-learners for 100 steps to collect 2000 task parameterisations into a replay buffer. Task-learners share a common initialisation and warp parameters that are held ﬁxed during task adaptation. Once collected, we iterate over the buffer by randomly sampling mini-batches of task parameterisations without replacement. Unless otherwise noted, we used a batch size of η = 1. For each mini-batch, we update φ by applying gradient descent under the canonical meta-objective (Eq. 11), where we evaluate Lτ meta on a randomly sampled mini-batch of data from the corresponding task. Consequently, for each meta-batch, we take (up to) 2000 meta-gradient steps on warp parameters φ. We ﬁnd that this form of mini-batching causes the meta-training loop to converge much faster and induces no discernible instability.

We compare Warp-Leap against no meta-learning with standard gradient descent (SGD) or KFAC (Martens & Grosse, 2015). We also benchmark against baselines provided in Flennerhag et al. (2019); Leap, Reptile (Nichol et al., 2018), MAML, and multi-headed ﬁne-tuning. All learners beneﬁt substantially from large batch sizes as this enables higher learning rates. To render no-pretraining a competitive option within a fair computational budget, we allow SGD and KFAC to use 4x larger batch sizes, enabling 10x larger learning rates.

Published as a conference paper at ICLR 2020

Table 2: Mean test error after 100 training steps on held out evaluation tasks. Multi-headed. No meta-training, but 10x larger learning rates).

Method Warp Grad Leap Reptile Finetuning MAML KFAC SGD No. Meta-training tasks

1 49.5 7.8 37.6 4.8 40.4 4.0 53.8 5.0 40.0 2.6 56.0 51.0 3 68.8 2.8 53.4 3.1 53.1 4.2 64.6 3.3 48.6 2.5 56.0 51.0 5 75.0 3.6 59.5 3.7 58.3 3.3 67.7 2.8 51.6 3.8 56.0 51.0 10 81.2 2.4 67.4 2.4 65.0 2.1 71.3 2.0 54.1 2.8 56.0 51.0 15 82.7 3.3 70.0 2.4 66.6 2.9 73.5 2.4 54.8 3.4 56.0 51.0 20 82.0 2.6 73.3 2.3 69.4 3.4 75.4 3.2 56.6 2.0 56.0 51.0 25 83.8 1.9 74.8 2.7 70.8 1.9 76.4 2.2 56.7 2.1 56.0 51.0

F ABLATION STUDY: WARP LAYERS, META-OBJECTIVE, AND META-TRAINING

Warp Grad provides a principled approach for model-informed meta-learning and offers several degrees of freedom. To evaluate these design choices, we conduct an ablation study on Warp-Leap where we vary the design of warp-layers as well as meta-training approaches. For the ablation study, we ﬁxed the number of pretraining tasks to 25 and report ﬁnal test accuracy over 4 independent runs. All ablations use the same hyper-parameters, except for online meta-training which uses a learning rate of 0.001.

First, we vary the meta-training protocol by (a) using the approximate objective (Eq. 12), (b) using online meta-training (Algorithm 1), and (c) whether meta-learning the learning rate used for task adaptation is beneﬁcial in this experiment. We meta-learn a single scalar learning rate (as warp parameters can learn layer-wise scaling). Meta-gradients for the learning rate are clipped at 0.001 and we use a learning rate of 0.001. Note that when using ofﬂine meta-training, we store both task parameterisations and the momentum buffer in that phase and use them in the update rule when computing the canonical objective (Eq. 11).

Further, we vary the architecture used for warp-layers. We study simpler versions that use channelwise scaling and more complex versions that use non-linearities and residual connections. We also evaluate a version where each warp-layer has two stacked convolutions, where the ﬁrst warp convolution outputs 128 ﬁlters and the second warp convolution outputs 64 ﬁlters. Finally, in the two-layer warp-architecture, we evaluate a version that inserts a Fi LM layer between the two warp convolutions. These are adapted during task training from a 0 initialisation; they amount to task embeddings that condition gradient warping on task statistics. Full results are reported in Table 3.

G ABLATION STUDY: WARPGRAD AND NATURAL GRADIENT DESCENT

Table 4: Ablation study: mean test error after 100 training steps on held out evaluation tasks from a random initialisation. Mean and standard deviation over 4 seeds.

Method Preconditioning Accuracy

SGD None 40.1 6.1 KFAC (NGD) Linear (block-diagonal) 58.2 3.2 Warp Grad Linear (block-diagonal) 68.0 4.4 Warp Grad Non-linear (full) 81.3 4.0

Here, we perform ablation studies to compare the geometry that a Warp Grad optimiser learns to the geometry that Natural Gradient Descent (NGD) methods represent (approximately). For consistency, we run the ablation on Omniglot. As computing the true Fisher Information Matrix is intractable, we can compare Warp Grad against two common block-diagonal approximations, KFAC (Martens & Grosse, 2015) and Natural Neural Nets (Desjardins et al., 2015).

First, we isolate the effect of warping task loss surfaces by ﬁxing a random initialisation and only meta-learning warp parameters. That is, in this experiment, we set λC(θ0) = 0. We compare against two baselines, stochastic gradient descent (SGD) and KFAC, both trained from a random initialisation. We use task mini-batch sizes of 200 and task learning rates of 1.0, otherwise we use

Published as a conference paper at ICLR 2020

Table 3: Ablation study: mean test error after 100 training steps on held out evaluation tasks. Mean and standard deviation over 4 independent runs. Ofﬂine refers to ofﬂine meta-training (Appendix B), online to online meta-training Algorithm 1; full denotes Eq. 11 and approx denotes Eq. 12; Batch Normalization (Ioffe & Szegedy, 2015); equivalent to Fi LM layers (Perez et al., 2018); Residual connection (He et al., 2016), when combined with BN, similar to the Residual Adaptor architecture (Rebufﬁet al., 2017); Fi LM task embeddings.

Architecture Meta-training Meta-objective Accuracy

None (Leap) Online None 74.8 2.7 3 3 conv (default) Ofﬂine full (L, Eq. 11) 84.4 1.7

3 3 conv Ofﬂine approx (ˆL, Eq. 12) 83.1 2.7 3 3 conv Online full 76.3 2.1 3 3 conv Ofﬂine full, learned α 83.1 3.3

Scaling Ofﬂine full 77.5 1.8 1 1 conv Ofﬂine full 79.4 2.2 3 3 conv + Re LU Ofﬂine full 83.4 1.6 3 3 conv + BN Ofﬂine full 84.7 1.7 3 3 conv + BN + Re LU Ofﬂine full 85.0 0.9 3 3 conv + BN + Res + Re LU Ofﬂine full 86.3 1.1 2-layer 3 3 conv + BN + Res Ofﬂine full 88.0 1.0 2-layer 3 3 conv + BN + Res + TA Ofﬂine full 88.1 1.0

1 2 3 4 Layer

Expected Activation

pre-warp post-warp

1 2 3 4 Layer

Shatten-1 Norm of Cov - I

pre-warp post-warp

Figure 9: Ablation study. Left: Comparison of mean activation value E[h(x)] across layers, preand post-warping. Right: Shatten-1 norm of Cov(h(x), h(x)) I, preand post-norm. Statistics are gathered on held-out test set and averaged over tasks and adaptation steps.

the same hyper-parameters as in the main experiment. For Warp Grad, we meta-train with these hyper-parameters as well. We evaluate two Warp Grad architectures, in one, we use linear warp-layers, which gives a block-diagonal preconditioning, as in KFAC. In the other, we use our most expressive warp conﬁguration from the ablation experiment in appendix F, where warp-layers are two-layer convolutional block with residual connections, batch normalisation, and Re LU activation. We ﬁnd that warped geometries facilitate task adaptation on held-out tasks to a greater degree than either SGD or KFAC by a signiﬁcant margin (table 4). We further ﬁnd that going beyond block-diagonal preconditioning yields a signiﬁcant improvement in performance.

Second, we explore whether the geometry that we meta-learn under in the full Warp-Leap algorithm is approximately Fisher. In this experiment, we use the main Warp-Leap architecture. We use a meta-learner trained on 25 tasks and that we evaluate on 10 held-out tasks. Because warp-layers are linear in this conﬁguration, if the learned geometry is approximately Fisher, post-warp activations should be zero-centred and the layer-wise covariance matrix should satisfy Cov(ω(i)(h(i)(x)), ω(i)(h(i)(x))) = I, where I is the identity matrix (Desjardins et al., 2015). If true, Warp-Leap would learn a block-diagonal approximation to the Inverse Fisher Matrix, as Natural Neural Nets.

Published as a conference paper at ICLR 2020

To test this, during task adaptation on held-out tasks, we compute the mean activation in each convolutional layer preand post-warping. We also compute the Shatten-1 norm of the difference between layer activation covariance and the identity matrix preand post-warping, as described above. We average statistics over task and adaptation step (we found no signiﬁcant variation in these dimensions).

Figure 9 summarise our results. We ﬁnd that, in general, Warp Grad-Leap has zero-centered post-warp activations. That pre-warp activations are positive is an artefact of the Re LU activation function. However, we ﬁnd that the correlation structure is signiﬁcantly different from what we would expect if Warp-Leap were to represent the Fisher matrix; post-warp covariances are signiﬁcantly dissimilar from the identity matrix and varies across layers.

These results indicate that Warp Grad methods behave distinctly different from Natural Gradient Descent methods. One possibility is that Warp Grad methods do approximate the Fisher Information Matrix, but with higher accuracy than other methods. A more likely explanation is that Warp Grad methods encode a different geometry since they can learn to leverage global information beyond the task at hand, which enables them to express geometries that standard Natural Gradient Descent cannot.

H mini IMAGENET AND tiered IMAGENET

mini Image Net This dataset is a subset of 100 classes sampled randomly from the 1000 base classes in the ILSVRC-12 training set, with 600 images for each class. Following (Ravi & Larochelle, 2016), classes are split into non-overlapping meta-training, meta-validation and meta-tests sets with 64, 16, and 20 classes in each respectively.

tiered Image Net As described in (Ren et al., 2018), this dataset is a subset of ILSVRC-12 that stratiﬁes 608 classes into 34 higher-level categories in the Image Net human-curated hierarchy (Deng et al., 2009). In order to increase the separation between meta-train and meta-evaluation splits, 20 of these categories are used for meta-training, while 6 and 8 are used for meta-validation and metatesting respectively. Slicing the class hierarchy closer to the root creates more similarity within each split, and correspondingly more diversity between splits, rendering the meta-learning problem more challenging. High-level categories are further divided into 351 classes used for meta-training, 97 for meta-validation and 160 for meta-testing, for a total of 608 base categories. All the training images in ILSVRC-12 for these base classes are used to generate problem instances for tiered Image Net, of which there are a minimum of 732 and a maximum of 1300 images per class.

For all experiments, N-way K-shot classiﬁcation problem instances were sampled following the standard image classiﬁcation methodology for meta-learning proposed in Vinyals et al. (2016). A subset of N classes was sampled at random from the corresponding split. For each class, K arbitrary images were chosen without replacement to form the training dataset of that problem instance. As usual, a disjoint set of L images per class were selected for the validation set.

Few-shot classiﬁcation In these experiments we used the established experimental protocol for evaluation in meta-validation and meta-testing: 600 task instances were selected, all using N = 5, K = 1 or K = 5, as speciﬁed, and L = 15. During meta-training we used N = 5, K = 5 or K = 15 respectively, and L = 15.

Task-learners used 4 convolutional blocks deﬁned by with 128 ﬁlters (or less, chosen by hyperparameter tuning), 3 3 kernels and strides set to 1, followed by batch normalisation with learned scales and offsets, a Re LU non-linearity and 2 2 max-pooling. The output of the convolutional stack (5 5 128) was ﬂattened and mapped, using a linear layer, to the 5 output units. The last 3 convolutional layers were followed by warp-layers with 128 ﬁlters each. Only the ﬁnal 3 task-layer parameters and their corresponding scale and offset batch-norm parameters were adapted during task-training, with the corresponding warp-layers and the initial convolutional layer kept ﬁxed and meta-learned using the Warp Grad objective. Note that, with the exception of CAVIA, other baselines do worse with 128 ﬁlters as they overﬁt; MAML and T-Nets achieve 46% and 49 % 5-way-1-shot test accuracy with 128 ﬁlters, compared to their best reported results (48.7% and 51.7%, respectively).

Published as a conference paper at ICLR 2020

Hyper-parameters were tuned independently for each condition using random grid search for highest test accuracy on meta-validation left-out tasks. Grid sizes were 50 for all experiments. We choose the optimal hyper-parameters (using early stopping at the meta-level) in terms of meta-validation test set accuracy for each condition and we report test accuracy on the meta-test set of tasks. 60000 metatraining steps were performed using meta-gradients over a single randomly selected task instances and their entire trajectories of 5 adaptation steps. Task-speciﬁc adaptation was done using stochastic gradient descent without momentum. We use Adam (Kingma & Ba, 2015) for meta-updates.

Multi-shot classiﬁcation For these experiments we used N = 10, K = 640 and L = 50. Tasklearners are deﬁned similarly, but stacking 6 convolutional blocks deﬁned by 3 3 kernels and strides set to 1, followed by batch normalisation with learned scales and offsets, a Re LU non-linearity and 2 2 max-pooling (ﬁrst 5 layers). The sizes of convolutional layers were chosen by hyper-parameter tuning to {64, 64, 160, 160, 256, 256}. The output of the convolutional stack (2 2 256) was ﬂattened and mapped, using a linear layer, to the 10 output units.

Hyper-parameters were tuned independently for each algorithm, version, and baseline using random grid search for highest test accuracy on meta-validation left-out tasks. Grid sizes were 200 for all multi-shot experiments. We choose the optimal hyper-parameters in terms of mean meta-validation test set accuracy AUC (using early stopping at the meta-level) for each condition and we report test accuracy on the meta-test set of tasks. 2000 meta-training steps were performed using averaged meta-gradients over 5 random task instances and their entire trajectories of 100 adaptation steps with batch size 64, or inner-loops. Task-speciﬁc adaptation was done using stochastic gradient descent with momentum (0.9). Meta-gradients were passed to Adam in the outer loop.

We test Warp Grad against Leap, Reptile, and training from scratch with large batches and tuned momentum. We tune all meta-learners for optimal performance on the validation set. Warp Grad outperforms all baselines both in terms of rate of convergence and ﬁnal test performance (Figure 10).

Figure 10: Multi-shot tiered Image Net results. Top: mean learning curves (test classiﬁcation accuracy) on held-out meta-test tasks. Bottom: mean test classiﬁcation performance on held-out meta-test tasks during meta-training. Training from scratch omitted as it is not meta-trained.

Published as a conference paper at ICLR 2020

I MAZE NAVIGATION

To illustrate both how Warp Grad may be used with Recurrent Neural Networks in an online metalearning setting, as well as in a Reinforcement Learning environment, we evaluate it in a maze navigation task proposed by Miconi et al. (2018). The environment is a ﬁxed maze and a task is deﬁned by randomly choosing a goal location in the maze. During a task episode of length 200, the goal location is ﬁxed but the agent gets teleported once it ﬁnds it. Thus, during an episode the agent must ﬁrst locate the goal, then return to it as many times as possible, each time being randomly teleported to a new starting location. We use an identical setup as Miconi et al. (2019), except our grid is of size 11 11 as opposed to 9 9. We compare our Warp-RNN to Learning to Reinforcement Learn (Wang et al., 2016) and Hebbian meta-learners (Miconi et al., 2018; 2019).

The task-learner in all cases is an advantage actor-critic (Wang et al., 2016), where the actor and critic share an underlying basic RNN, whose hidden state is projected into a policy and value function by two separate linear layers. The RNN has a hidden state size of 100 and tanh non-linearities. Following (Miconi et al., 2019), for all benchmarks, we train the task-learner using Adam with a learning rate of 1e 3 for 200 000 steps using batches of 30 episodes, each of length 200. Metalearning arises in this setting as each episode encodes a different task, as the goal location moves, and by learning across episodes the RNN is encoding meta-information in its parameters that it can leverage during task adaptation (via its hidden state (Hochreiter & Schmidhuber, 1997; Wang et al., 2016)). See Miconi et al. (2019) for further details.

We design a Warp-RNN by introducing a warp-layer in the form of an LSTM that is frozen for most of the training process. Following Flennerhag et al. (2018), we use this meta-LSTM to modulate the task RNN. Given an episode with input vector xt, the task RNN is deﬁned by

ht = tanh U 2 h,t V U 1 h,t ht 1 + U 2 x,t WU 1 x,t xt + U b t b , (20)

where W, V, b are task-adaptable parameters; each U (i) j,t is a diagonal warp matrix produced by

projecting from the hidden state of the meta-LSTM, U (i) j,t = diag(tanh(P (i) j zt)), where z is the hidden-state of the meta-LSTM. See Flennerhag et al. (2018) for details. Thus, our Warp-RNN is a form of Hyper Network (see Figure 6, Appendix A). Because the meta-LSTM is frozen for most of the training process, task adaptable parameters correspond to those of the baseline RNN.

To control for the capacity of the meta-LSTM, we also train a Hyper RNN where the LSTM is updated with every task adaptation; we ﬁnd this model does worse than the Warp Grad-RNN. We also compare the non-linear preconditioning that we obtain in our Warp-RNN to linear forms of preconditioning deﬁned in prior works. We implement a T-Nets-RNN meta-learner, deﬁned by embedding linear projections Th, Tx and Tb that are meta-learned in the task RNN, ht = tanh(Th V ht + Tx Wxt + b). Note that we cannot backpropagate to these meta-parameters as per the T-Nets (MAML) framework. Instead, we train Th, Tx, Tb with the meta-objective and meta-training algorithm we use for the Warp-RNN. The T-Nets-RNN does worse than the baseline RNN and generally fails to learn.

We meta-train the Warp-RNN using the continual meta-training algorithm (Algorithm 3, see Appendix B for details), which accumulates meta-gradients continuously during training. Because task training is a continuous stream of batches of episodes, we accumulating the meta-gradient using the approximate objective (Eq. 12, where Lτ task and Lτ meta are both the same advantage actor-critic objective) and update warp-parameters on every 30th task parameter update. We detail the meta-objective in Appendix C (see Eq. 18). Our implementation of a Warp-RNN can be seen as meta-learning slow weights to facilitate learning of fast weights (Schmidhuber, 1992; Mujika et al., 2017). Implementing Warp-RNN requires four lines of code on top of the standard training script. The task-learner is the same in all experiments with the same number of learnable parameters and hidden state size. Compared to all baselines, we ﬁnd that the Warp-RNN converges faster and achieves a higher cumulative reward (Figure 4 and Figure 11).

J META-LEARNING FOR CONTINUAL LEARNING

Online SGD and related optimisation methods tend to adapt neural network models to the data distribution encountered last during training, usually leading to what has been termed catastrophic

Published as a conference paper at ICLR 2020

0 20000 40000 60000 80000 100000 Number of Episodes

Warp-RNN Hyper RNN Hebb-RNN

RNN T-Nets-RNN

0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000 Number of Episodes

Warp-RNN Hyper RNN Hebb-RNN

RNN T-Nets-RNN

Figure 11: Mean cumulative return for maze navigation task, for 200000 training steps. Shading represents inter-quartile ranges across 10 independent runs. Simple modulation and retroactive modulation, respectively (Miconi et al., 2019).

forgetting (French, 1999). In this experiment, we investigate whether Warp Grad optimisers can meta-learn to avoid this problem altogether and directly minimise the joint objective over all tasks with every update in the fully online learning setting where no past data is retained.

Continual Sine Regression We propose a continual learning version of the sine regression metalearning experiment in Finn et al. (2017). We split the input interval [ 5, 5] R evenly into 5 consecutive sub-intervals, corresponding to 5 regression tasks. These are presented one at a time to a task-learner, which adapts to each sub-task using 20 gradient steps on data from the given sub-task only. Batch sizes were set to 5 samples. Sub-tasks thus differ in their input domain. A task sequence is deﬁned by a target function composed of two randomly mixed sine functions of the form fai,bi(x) = ai sin(x bi) each with randomly sampled amplitudes ai [0.1, 5] and phases bi [0, π]. A task τ = (a1, b1, a2, b2, o) is therefore deﬁned by sampling the parameters that specify this mixture; a task speciﬁes a target function gτ by

gτ(x) = αo(x)ga1,b1(x) + (1 αo(x))ga2,b2(x), (21)

where αo(x) = σ(x + o) for a randomly sampled offset o [ 5, 5], with σ being the sigmoid activation function.

Model We deﬁne a task-learner as 4-layer feed-forward networks with hidden layer size 200 and Re LU non-linearities to learn the mapping between inputs and regression targets, f( , θ, φ). For each task sequence τ, a task-learner is initialised from a ﬁxed random initialisation θ0 (that is not

Published as a conference paper at ICLR 2020

meta-learned). Each non-linearity is followed by a residual warping block consisting of 2-layer feedforward networks with 100 hidden units and tanh non-linearities, with meta-learned parameters φ which are ﬁxed during the task adaptation process.

Continual learning as task adaptation The task target function gτ is partitioned into 5 sets of sub-tasks. The task-learner sees one partition at a time and is given n = 20 gradient steps to adapt, for a total of K = 100 steps of online gradient descent updates for the full task sequence; recall that every such sequence starts from a ﬁxed random initialisation θ0. The adaptation is completely online since at step k = 1, . . . , K we sample a new mini-batch Dk task of 5 samples from a single sub-task (sub-interval). The data distribution changes after each n = 20 steps with inputs x coming from the next sub-interval and targets form the same function gτ(x). During meta-training we always present tasks in the same order, presenting intervals from left to right. The online (sub-)task loss is deﬁned on the current mini-batch Dk task at step k:

Lτ task θτ k, Dk task; φ = 1 2| Dk task |

f(x, θτ k; φ) gτ(x) 2. (22)

Adaptation to each sub-task uses sub-task data only to form task parameter updates θτ k+1 θτ k α Lτ task θτ k, Dk task; φ . We used a constant learning rate α = 0.001. Warp-parameters φ are ﬁxed across the full task sequence during adaptation and are meta-learned across random samples of task sequences, which we describe next.

Meta-learning an optimiser for continual learning To investigate the ability of Warp Grad to learn an optimiser for continual learning that mitigates catastrophic forgetting, we ﬁx a random initialisation prior to meta-training that is not meta-learned; every task-learner is initialised with these parameters. To meta-learn an optimiser for continual learning, we need a meta-objective that encourages such behaviour. Here, we take a ﬁrst step towards a framework for meta-learned continual learning. We deﬁne the meta-objective Lτ meta as an incremental multitask objective that, for each sub-task!τt in a given task sequence τ, averages the validation sub-task losses (Eq. 22) for the current and every preceding loss in the task sequence. The task meta-objective is deﬁned by summing over all sub-tasks in the task sequence. For some sub-task parameterisation θτt, we have

1 n(T t + 1)Lτ task

θτi, Di val; φ . (23)

As before, the full meta-objective is an expectation over the joint task parameter distribution (Eq. 11); for further details on the meta-objective, see Appendix C, Eq. 19. This meta-objective gives equal weight to all the tasks in the sequence by averaging the regression step loss over all sub-tasks where a prior sub-task should be learned or remembered. For example, losses from the ﬁrst sub-task, deﬁned using the interval [ 5, 3], will appear n T times in the meta-objective. Conversely, the last sub-task in a sequence, deﬁned on the interval [3, 5], is learned only in the last n = 20 steps of task adaptation, and hence appears n times in the meta-objective. Normalising on number of appearances corrects for this bias. We trained warp-parameters using Adam and a meta-learning rate of 0.001, sampling 5 random tasks to form a meta-batch and repeating the process for 20 000 steps of meta-training.

Results Figure 12 shows a breakdown of the validation loss across the 5 sequentially learned tasks over the 100 steps of online learning during task adaptation. Results are averaged over 100 random regression problem instances. The meta-learned Warp Grad optimiser reduces the loss of the task currently being learned in each interval while also largely retaining performance on previous tasks. There is an immediate relatively minor loss of performance, after which performance on previous tasks is retained. We hypothesise that this is because the meta-objectives averages over the full learning curve, as opposed to only the performance once a task has been adapted to. As such, the Warp Grad optimiser may allow for some degree of performance loss. Intriguingly, in all cases, after an initial spike in previous sub-task losses when switching to a new task, the spike starts to revert back some way towards optimal performance, suggesting that the Warp Grad optimiser facilitates positive backward transfer, without this being explicitly enforced in the meta-objective. Deriving a principled meta-objective for continual learning is an exciting area for future research.

Published as a conference paper at ICLR 2020

(a) Task order as seen during meta-training.

(b) Random task order.

Figure 12: Continual learning regression experiment. Average log-loss over 100 randomly sampled tasks. Each task contains 5 sub-tasks learned (a) sequentially as seen during meta-training or (b) in random order [sub-task 1, 3, 4, 2, 0]. We train on each sub-task for 20 steps, for a total of K = 100 task adaptation steps.

(a) Task order seen during meta-training.

(b) Random task order.

Figure 13: Continual learning regression: evaluation after partial task adaptation. We plot the ground truth (black), task-learner prediction before adaptation (dashed green) and task-learner prediction after adaptation (red). Each row illustrates how task-learner predictions evolve (red) after training on sub-tasks up to and including that sub-task (current task illustrate in plot). (a) sub-tasks are presented in the same order as seen during meta-training; (b) sub-tasks are presented in random order at meta-test time in sub-task order [1, 3, 4, 2 and 0].