# metalearning_with_implicit_gradients__4c595f4f.pdf

Meta-Learning with Implicit Gradients

Aravind Rajeswaran University of Washington aravraj@cs.washington.edu

Chelsea Finn University of California Berkeley cbfinn@cs.stanford.edu

Sham M. Kakade University of Washington sham@cs.washington.edu

Sergey Levine University of California Berkeley svlevine@eecs.berkeley.edu

A core capability of intelligent systems is the ability to quickly learn new tasks by drawing on prior experience. Gradient (or optimization) based meta-learning has recently emerged as an effective approach for few-shot learning. In this formulation, meta-parameters are learned in the outer loop, while task-speciﬁc models are learned in the inner-loop, by using only a small amount of data from the current task. A key challenge in scaling these approaches is the need to differentiate through the inner loop learning process, which can impose considerable computational and memory burdens. By drawing upon implicit differentiation, we develop the implicit MAML algorithm, which depends only on the solution to the inner level optimization and not the path taken by the inner loop optimizer. This effectively decouples the meta-gradient computation from the choice of inner loop optimizer. As a result, our approach is agnostic to the choice of inner loop optimizer and can gracefully handle many gradient steps without vanishing gradients or memory constraints. Theoretically, we prove that implicit MAML can compute accurate meta-gradients with a memory footprint no more than that which is required to compute a single inner loop gradient and at no overall increase in the total computational cost. Experimentally, we show that these beneﬁts of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.

1 Introduction

A core aspect of intelligence is the ability to quickly learn new tasks by drawing upon prior experience from related tasks. Recent work has studied how meta-learning algorithms [51, 55, 41] can acquire such a capability by learning to efﬁciently learn a range of tasks, thereby enabling learning of a new task with as little as a single example [50, 57, 15]. Meta-learning algorithms can be framed in terms of recurrent [25, 50, 48] or attention-based [57, 38] models that are trained via a meta-learning objective, to essentially encapsulate the learned learning procedure in the parameters of a neural network. An alternative formulation is to frame meta-learning as a bi-level optimization procedure [35, 15], where the inner optimization represents adaptation to a given task, and the outer objective is the meta-training objective. Such a formulation can be used to learn the initial parameters of a model such that optimizing from this initialization leads to fast adaptation and generalization. In this work, we focus on this class of optimization-based methods, and in particular the model-agnostic meta-learning (MAML) formulation [15]. MAML has been shown to be as expressive as black-box approaches [14], is applicable to a broad range of settings [16, 37, 1, 18], and recovers a convergent and consistent optimization procedure [13].

Equal Contributions. Project page: http://sites.google.com/view/imaml

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Figure 1: To compute the meta-gradient P

dθ , the MAML algorithm differentiates through the optimization path, as shown in green, while ﬁrst-order MAML computes the meta-gradient by approximating dφi

dθ as I. Our implicit MAML approach derives an analytic expression for the exact meta-gradient without differentiating through the optimization path by estimating local curvature.

Despite its appealing properties, meta-learning an initialization requires backpropagation through the inner optimization process. As a result, the meta-learning process requires higher-order derivatives, imposes a non-trivial computational and memory burden, and can suffer from vanishing gradients. These limitations make it harder to scale optimization-based meta learning methods to tasks involving medium or large datasets, or those that require many inner-loop optimization steps. Our goal is to develop an algorithm that addresses these limitations.

The main contribution of our work is the development of the implicit MAML (i MAML) algorithm, an approach for optimization-based meta-learning with deep neural networks that removes the need for differentiating through the optimization path. Our algorithm aims to learn a set of parameters such that an optimization algorithm that is initialized at and regularized to this parameter vector leads to good generalization for a variety of learning tasks. By leveraging the implicit differentiation approach, we derive an analytical expression for the meta (or outer level) gradient that depends only on the solution to the inner optimization and not the path taken by the inner optimization algorithm, as depicted in Figure 1. This decoupling of meta-gradient computation and choice of inner level optimizer has a number of appealing properties.

First, the inner optimization path need not be stored nor differentiated through, thereby making implicit MAML memory efﬁcient and scalable to a large number of inner optimization steps. Second, implicit MAML is agnostic to the inner optimization method used, as long as it can ﬁnd an approximate solution to the inner-level optimization problem. This permits the use of higher-order methods, and in principle even non-differentiable optimization methods or components like samplebased optimization, line-search, or those provided by proprietary software (e.g. Gurobi). Finally, we also provide the ﬁrst (to our knowledge) non-asymptotic theoretical analysis of bi-level optimization. We show that an ϵ approximate meta-gradient can be computed via implicit MAML using O(log(1/ϵ)) gradient evaluations and O(1) memory, meaning the memory required does not grow with number of gradient steps.

2 Problem Formulation and Notations

We ﬁrst present the meta-learning problem in the context of few-shot supervised learning, and then generalize the notation to aid the rest of the exposition in the paper.

2.1 Review of Few-Shot Supervised Learning and MAML

In this setting, we have a collection of meta-training tasks {Ti}M i=1 drawn from P(T ). Each task Ti is associated with a dataset Di, from which we can sample two disjoint sets: Dtr i and Dtest i . These datasets each consist of K input-output pairs. Let x X and y Y denote inputs and outputs, respectively. The datasets take the form Dtr i = {(xk i , yk i )}K k=1, and similarly for Dtest i . We are interested in learning models of the form hφ(x) : X Y, parameterized by φ Φ Rd. Performance on a task is speciﬁed by a loss function, such as the cross entropy or squared error loss. We will write the loss function in the form L(φ, D), as a function of a parameter vector and dataset. The goal for task Ti is to learn task-speciﬁc parameters φi using Dtr i such that we can minimize the population or test loss of the task, L(φi, Dtest i ).

In the general bi-level meta-learning setup, we consider a space of algorithms that compute taskspeciﬁc parameters using a set of meta-parameters θ Θ Rd and the training dataset from the task, such that φi = Alg(θ, Dtr i ) for task Ti. The goal of meta-learning is to learn meta-parameters that produce good task speciﬁc parameters after adaptation, as speciﬁed below:

outer level z }| { θ ML := argmin θ Θ F(θ) , where F(θ) = 1

i=1 L inner level z }| { Alg θ, Dtr i , Dtest i

We view this as a bi-level optimization problem since we typically interpret Alg θ, Dtr i as either explicitly or implicitly solving an underlying optimization problem. At meta-test (deployment) time, when presented with a dataset Dtr j corresponding to a new task Tj P(T ), we can achieve good generalization performance (i.e., low test error) by using the adaptation procedure with the metalearned parameters as φj = Alg(θ ML, Dtr j ).

In the case of MAML [15], Alg(θ, D) corresponds to one or multiple steps of gradient descent initialized at θ. For example, if one step of gradient descent is used, we have:

φi Alg(θ, Dtr i ) = θ α θL(θ, Dtr i ). (inner-level of MAML) (2)

Typically, α is a scalar hyperparameter, but can also be a learned vector [34]. Hence, for MAML, the meta-learned parameter (θ ML) has a learned inductive bias that is particularly well-suited for ﬁnetuning on tasks from P(T ) using K samples. To solve the outer-level problem with gradient-based methods, we require a way to differentiate through Alg. In the case of MAML, this corresponds to backpropagating through the dynamics of gradient descent.

2.2 Proximal Regularization in the Inner Level

To have sufﬁcient learning in the inner level while also avoiding over-ﬁtting, Alg needs to incorporate some form of regularization. Since MAML uses a small number of gradient steps, this corresponds to early stopping and can be interpreted as a form of regularization and Bayesian prior [20]. In cases like ill-conditioned optimization landscapes and medium-shot learning, we may want to take many gradient steps, which poses two challenges for MAML. First, we need to store and differentiate through the long optimization path of Alg, which imposes a considerable computation and memory burden. Second, the dependence of the model-parameters {φi} on the meta-parameters (θ) shrinks and vanishes as the number of gradient steps in Alg grows, making meta-learning difﬁcult. To overcome these limitations, we consider a more explicitly regularized algorithm:

Alg (θ, Dtr i ) = argmin φ Φ L(φ , Dtr i ) + λ

2 ||φ θ||2. (3)

The proximal regularization term in Eq. 3 encourages φi to remain close to θ, thereby retaining a strong dependence throughout. The regularization strength (λ) plays a role similar to the learning rate (α) in MAML, controlling the strength of the prior (θ) relative to the data (Dtr T ). Like α, the regularization strength λ may also be learned. Furthermore, both α and λ can be scalars, vectors, or full matrices. For simplicity, we treat λ as a scalar hyperparameter. In Eq. 3, we use to denote that the optimization problem is solved exactly. In practice, we use iterative algorithms (denoted by Alg) for ﬁnite iterations, which return approximate minimizers. We explicitly consider the discrepancy between approximate and exact solutions in our analysis.

2.3 The Bi-Level Optimization Problem

For notation convenience, we will sometimes express the dependence on task Ti using a subscript instead of arguments, e.g. we write:

Li(φ) := L φ, Dtest i , ˆLi(φ) := L φ, Dtr i , Algi θ := Alg θ, Dtr i .

With this notation, the bi-level meta-learning problem can be written more generally as:

θ ML := argmin θ Θ F(θ) , where F(θ) = 1

i=1 Li Alg i (θ) , and

Alg i (θ) := argmin φ Φ Gi(φ , θ), where Gi(φ , θ) = ˆLi(φ ) + λ

2 ||φ θ||2.

2.4 Total and Partial Derivatives

We use d to denote the total derivative and to denote partial derivative. For nested function of the form Li(φi) where φi = Algi(θ), we have from chain rule

dθLi(Algi(θ)) = d Algi(θ)

dθ φLi(φ) |φ=Algi(θ) = d Algi(θ)

dθ φLi(Algi(θ))

Note the important distinction between dθLi(Algi(θ)) and φLi(Algi(θ)). The former passes derivatives through Algi(θ) while the latter does not. φLi(Algi(θ)) is simply the gradient function, i.e. φLi(φ), evaluated at φ = Algi(θ). Also note that dθLi(Algi(θ)) and φLi(Algi(θ)) are d dimensional vectors, while d Algi(θ)

dθ is a (d d) size Jacobian matrix. Throughout this text, we will also use dθ and d dθ interchangeably.

3 The Implicit MAML Algorithm

Our aim is to solve the bi-level meta-learning problem in Eq. 4 using an iterative gradient based algorithm of the form θ θ η dθF(θ). Although we derive our method based on standard gradient descent for simplicity, any other optimization method, such as quasi-Newton or Newton methods, Adam [28], or gradient descent with momentum can also be used without modiﬁcation. The gradient descent update be expanded using the chain rule as

d Alg i (θ) dθ φLi(Alg i (θ)). (5)

Here, φLi(Alg i (θ)) is simply φLi(φ) |φ=Alg i (θ) which can be easily obtained in practice via

automatic differentiation. For this update rule, we must compute d Alg i (θ) dθ , where Alg i is implicitly deﬁned as an optimization problem (Eq. 4), which presents the primary challenge. We now present an efﬁcient algorithm (in compute and memory) to compute the meta-gradient..

3.1 Meta-Gradient Computation

If Alg i (θ) is implemented as an iterative algorithm, such as gradient descent, then one way to compute d Alg i (θ) dθ is to propagate derivatives through the iterative process, either in forward mode or reverse mode. However, this has the drawback of depending explicitly on the path of the optimization, which has to be fully stored in memory, quickly becoming intractable when the number of gradient steps needed is large. Furthermore, for second order optimization methods, such as Newton s method, third derivatives are needed which are difﬁcult to obtain. Furthermore, this approach becomes impossible when non-differentiable operations, such as line-searches, are used. However, by recognizing that Alg i is implicitly deﬁned as the solution to an optimization problem, we may employ a different strategy that does not need to consider the path of the optimization but only the ﬁnal result. This is derived in the following Lemma.

Lemma 1. (Implicit Jacobian) Consider Alg i (θ) as deﬁned in Eq. 4 for task Ti. Let φi = Alg i (θ)

be the result of Alg i (θ). If I + 1

λ 2 φ ˆLi(φi) is invertible, then the derivative Jacobian is

d Alg i (θ) dθ = I + 1

λ 2 φ ˆLi(φi) 1 . (6)

Note that the derivative (Jacobian) depends only on the ﬁnal result of the algorithm, and not the path taken by the algorithm. Thus, in principle any approach of algorithm can be used to compute Alg i (θ), thereby decoupling meta-gradient computation from choice of inner level optimizer.

Practical Algorithm: While Lemma 1 provides an idealized way to compute the Alg i Jacobians and thus by extension the meta-gradient, it may be difﬁcult to directly use it in practice. Two issues are particularly relevant. First, the meta-gradients require computation of Alg i (θ), which is the exact solution to the inner optimization problem. In practice, we may be able to obtain only approximate solutions. Second, explicitly forming and inverting the matrix in Eq. 6 for computing

Algorithm 1 Implicit Model-Agnostic Meta-Learning (i MAML)

1: Require: Distribution over tasks P(T ), outer step size η, regularization strength λ, 2: while not converged do 3: Sample mini-batch of tasks {Ti}B i=1 P(T ) 4: for Each task Ti do 5: Compute task meta-gradient gi = Implicit-Meta-Gradient(Ti, θ, λ) 6: end for 7: Average above gradients to get ˆ F(θ) = (1/B) PB i=1 gi 8: Update meta-parameters with gradient descent: θ θ η ˆ F(θ) // (or Adam) 9: end while

Algorithm 2 Implicit Meta-Gradient Computation

1: Input: Task Ti, meta-parameters θ, regularization strength λ 2: Hyperparameters: Optimization accuracy thresholds δ and δ

3: Obtain task parameters φi using iterative optimization solver such that: φi Alg i (θ) δ 4: Compute partial outer-level gradient vi = φLT (φi) 5: Use an iterative solver (e.g. CG) along with reverse mode differentiation (to compute Hessian vector products) to compute gi such that: gi I + 1

λ 2 ˆLi(φi) 1vi δ

6: Return: gi

the Jacobian may be intractable for large deep neural networks. To address these difﬁculties, we consider approximations to the idealized approach that enable a practical algorithm.

First, we consider an approximate solution to the inner optimization problem, that can be obtained with iterative optimization algorithms like gradient descent. Deﬁnition 1. (δ approx. algorithm) Let Algi(θ) be a δ accurate approximation of Alg i (θ), i.e.

Algi(θ) Alg i (θ) δ

Second, we will perform a partial or approximate matrix inversion given by: Deﬁnition 2. (δ approximate Jacobian-vector product) Let gi be a vector such that

λ 2 φ ˆLi(φi) 1 φLi(φi) δ

where φi = Algi(θ) and Algi is based on deﬁnition 1.

Note that gi in deﬁnition 2 is an approximation of the meta-gradient for task Ti. Observe that gi can be obtained as an approximate solution to the optimization problem:

min w 1 2 w I + 1

λ 2 φ ˆLi(φi) w w φLi(φi) (7)

The conjugate gradient (CG) algorithm is particularly well suited for this problem due to its excellent iteration complexity and requirement of only Hessian-vector products of the form 2 ˆLi(φi)v. Such hessian-vector products can be obtained cheaply without explicitly forming or storing the Hessian matrix (as we discuss in Appendix C). This CG based inversion has been successfully deployed in Hessian-free or Newton-CG methods for deep learning [36, 44] and trust region methods in reinforcement learning [52, 47]. Algorithm 1 presents the full practical algorithm. Note that these approximations to develop a practical algorithm introduce errors in the meta-gradient computation. We analyze the impact of these errors in Section 3.2 and show that they are controllable. See Appendix A for how i MAML generalizes prior gradient optimization based meta-learning algorithms.

In Section 3.1, we outlined a practical algorithm that makes approximations to the idealized update rule of Eq. 5. Here, we attempt to analyze the impact of these approximations, and also understand the computation and memory requirements of i MAML. We ﬁnd that i MAML can match the

Table 1: Compute and memory for computing the meta-gradient when using a δ accurate Algi, and the corresponding approximation error. Our compute time is measured in terms of the number of ˆLi computations. All results are in O( ) notation, which hide additional log factors; the error bound hides additional problem dependent Lipshitz and smoothness parameters (see the respective Theorem statements). κ 1 is the condition number for inner objective Gi (see Equation 4), and D is the diameter of the search space. The notions of error are subtly different: we assume all methods solve the inner optimization to error level of δ (as per deﬁnition 1). For our algorithm, the error refers to the ℓ2 error in the computation of dθLi(Alg i (θ)). For the other algorithms, the error refers to the ℓ2 error in the computation of dθLi(Algi(θ)). We use Prop 3.1 of Shaban et al. [53] to provide the guarantee we use. See Appendix D for additional discussion.

Algorithm Compute Memory Error

MAML (GD + full back-prop) κ log D

Mem( ˆLi) κ log D

MAML (Nesterov s AGD + full back-prop) κ log D

Mem( ˆLi) κ log D

Truncated back-prop [53] (GD) κ log D

Mem( ˆLi) κ log 1

Implicit MAML (this work) κ log D

Mem( ˆLi) δ

minimax computational complexity of backpropagating through the path of the inner optimizer, but is substantially better in terms of memory usage. This work to our knowledge also provides the ﬁrst non-asymptotic result that analyzes approximation error due to implicit gradients. Theorem 1 provides the computational and memory complexity for obtaining an ϵ approximate meta-gradient. We assume Li is smooth but do not require it to be convex. We assume that Gi in Eq. 4 is strongly convex, which can be made possible by appropriate choice of λ. The key to our analysis is a second order Lipshitz assumption, i.e. ˆLi( ) is ρ-Lipshitz Hessian. This assumption and setting has received considerable attention in recent optimization and deep learning literature [26, 42].

Table 1 summarizes our complexity results and compares with MAML and truncated backpropagation [53] through the path of the inner optimizer. We use κ to denote the condition number of the inner problem induced by Gi (see Equation 4), which can be viewed as a measure of hardness of the inner optimization problem. Mem( ˆLi) is the memory taken to compute a single derivative ˆLi. Under the assumption that Hessian vector products are computed with the reverse mode of autodifferentiation, we will have that both: the compute time and memory used for computing a Hessian vector product are with a (universal) constant factor of the compute time and memory used for computing ˆLi itself (see Appendix C). This allows us to measure the compute time in terms of the number of ˆLi computations. We refer readers to Appendix D for additional discussion about the algorithms and their trade-offs.

Our main theorem is as follows: Theorem 1. (Informal Statement; Approximation error in Algorithm 2) Suppose that: Li( ) is B Lipshitz and L smooth function; that Gi( , θ) (in Eq. 4) is a µ-strongly convex function with condition number κ; that D is the diameter of search space for φ in the inner optimization problem (i.e. Alg i (θ) D); and ˆLi( ) is ρ-Lipshitz Hessian.

Let gi be the task meta-gradient returned by Algorithm 2. For any task i and desired accuracy level ϵ, Algorithm 2 computes an approximate task-speciﬁc meta-gradient with the following guarantee:

||gi dθLi(Alg i (θ))|| ϵ .

Furthermore, under the assumption that the Hessian vector products are computed by the reverse mode of autodifferentiation (Assumption 1), Algorithm 2 can be implemented using at most O κ log poly(κ,D,B,L,ρ,µ,λ)

ϵ gradient computations of ˆLi( ) and 2 Mem( ˆLi) memory.

The formal statement of the theorem and the proof are provided the appendix. Importantly, the algorithm s memory requirement is equivalent to the memory needed for Hessian-vector products which is a small constant factor over the memory required for gradient computations, assuming the reverse mode of auto-differentiation is used. Finally, based on the above, we also present corollary 1 in the appendix which shows that i MAML efﬁciently ﬁnds a stationary point of F( ), due to i MAML having controllable exact-solve error.

4 Experimental Results and Discussion

In our experimental evaluation, we aim to answer the following questions empirically: (1) Does the i MAML algorithm asymptotically compute the exact meta-gradient? (2) With ﬁnite iterations, does i MAML approximate the meta-gradient more accurately compared to MAML? (3) How does the computation and memory requirements of i MAML compare with MAML? (4) Does i MAML lead to better results in realistic meta-learning problems? We have answered (1) - (3) through our theoretical analysis, and now attempt to validate it through numerical simulations. For (1) and (2), we will use a simple synthetic example for which we can compute the exact meta-gradient and compare against it (exact-solve error, see deﬁnition 3). For (3) and (4), we will use the common few-shot image recognition domains of Omniglot and Mini-Image Net.

To study the question of meta-gradient accuracy, Figure 2 considers a synthetic regression example, where the predictions are linear in parameters. This provides an analytical expression for Alg i allowing us to compute the true meta-gradient. We ﬁx gradient descent (GD) to be the inner optimizer for both MAML and i MAML. The problem is constructed so that the condition number (κ) is large, thereby necessitating many GD steps. We ﬁnd that both i MAML and MAML asymptotically match the exact meta-gradient, but i MAML computes a better approximation in ﬁnite iterations. We observe that with 2 CG iterations, i MAML incurs a small terminal error. This is consistent with our theoretical analysis. In Algorithm 2, δ is dominated by δ when only a small number of CG steps are used. However, the terminal error vanishes with just 5 CG steps. The computational cost of 1 CG step is comparable to 1 inner GD step with the MAML algorithm, since both require 1 hessianvector product (see section C for discussion). Thus, the computational cost as well as memory of i MAML with 100 inner GD steps is signiﬁcantly smaller than MAML with 100 GD steps.

To study (3), we turn to the Omniglot dataset [30] which is a popular few-shot image recognition domain. Figure 2 presents compute and memory trade-off for MAML and i MAML (on 20-way, 5-shot Omniglot). Memory for i MAML is based on Hessian-vector products and is independent of the number of GD steps in the inner loop. The memory use is also independent of the number of CG iterations, since the intermediate computations need not be stored in memory. On the other hand, memory for MAML grows linearly in grad steps, reaching the capacity of a 12 GB GPU in approximately 16 steps. First-order MAML (FOMAML) does not back-propagate through the optimization process, and thus the computational cost is only that of performing gradient descent, which is needed for all the algorithms. The computational cost for i MAML is also similar to FOMAML along with a constant overhead for CG that depends on the number of CG steps. Note however, that FOMAML does not compute an accurate meta-gradient, since it ignores the Jacobian. Compared to FOMAML, the compute cost of MAML grows at a faster rate. FOMAML requires only gradient computations, while backpropagating through GD (as done in MAML) requires a Hessian-vector products at each iteration, which are more expensive.

Finally, we study empirical performance of i MAML on the Omniglot and Mini-Image Net domains. Following the few-shot learning protocol in prior work [57], we run the i MAML algorithm on the

Figure 2: Accuracy, Computation, and Memory tradeoffs of i MAML, MAML, and FOMAML. (a) Metagradient accuracy level in synthetic example. Computed gradients are compared against the exact meta-gradient per Def 3. (b) Computation and memory trade-offs with 4 layer CNN on 20-way-5-shot Omniglot task. We implemented i MAML in Py Torch, and for an apples-to-apples comparison, we use a Py Torch implementation of MAML from: https://github.com/dragen1860/MAML-Pytorch

Table 2: Omniglot results. MAML results are taken from the original work of Finn et al. [15], and ﬁrst-order MAML and Reptile results are from Nichol et al. [43]. i MAML with gradient descent (GD) uses 16 and 25 steps for 5-way and 20-way tasks respectively. i MAML with Hessian-free uses 5 CG steps to compute the search direction and performs line-search to pick step size. Both versions of i MAML use λ = 2.0 for regularization, and 5 CG steps to compute the task meta-gradient.

Algorithm 5-way 1-shot 5-way 5-shot 20-way 1-shot 20-way 5-shot

MAML [15] 98.7 0.4% 99.9 0.1% 95.8 0.3% 98.9 0.2% ﬁrst-order MAML [15] 98.3 0.5% 99.2 0.2% 89.4 0.5% 97.9 0.1% Reptile [43] 97.68 0.04% 99.48 0.06% 89.43 0.14% 97.12 0.32% i MAML, GD (ours) 99.16 0.35% 99.67 0.12% 94.46 0.42% 98.69 0.1% i MAML, Hessian-Free (ours) 99.50 0.26% 99.74 0.11% 96.18 0.36% 99.14 0.1%

dataset for different numbers of class labels and shots (in the N-way, K-shot setting), and compare two variants of i MAML with published results of the most closely related algorithms: MAML, FOMAML, and Reptile. While these methods are not state-of-the-art on this benchmark, they provide an apples-to-apples comparison for studying the use of implicit gradients in optimization-based meta-learning. For a fair comparison, we use the identical convolutional architecture as these prior works. Note however that architecture tuning can lead to better results for all algorithms [27].

The ﬁrst variant of i MAML we consider involves solving the inner level problem (the regularized objective function in Eq. 4) using gradient descent. The meta-gradient is computed using conjugate gradient, and the meta-parameters are updated using Adam. This presents the most straightforward comparison with MAML, which would follow a similar procedure, but backpropagate through the path of optimization as opposed to invoking implicit differentiation. The second variant of i MAML uses a second order method for the inner level problem. In particular, we consider the Hessian-free or Newton-CG [44, 36] method. This method makes a local quadratic approximation to the objective function (in our case, G(φ , θ) and approximately computes the Newton search direction using CG. Since CG requires only Hessian-vector products, this way of approximating the Newton search direction is scalable to large deep neural networks. The step size can be computed using regularization, damping, trust-region, or linesearch. We use a linesearch on the training loss in our experiments to also illustrate how our method can handle non-differentiable inner optimization loops. We refer the readers to Nocedal & Wright [44] and Martens [36] for a more detailed exposition of this optimization algorithm. Similar approaches have also gained prominence in reinforcement learning [52, 47].

Table 3: Mini-Image Net 5-way-1-shot accuracy

Algorithm 5-way 1-shot

MAML 48.70 1.84 % ﬁrst-order MAML 48.07 1.75 % Reptile 49.97 0.32 % i MAML GD (ours) 48.96 1.84 % i MAML HF (ours) 49.30 1.88 %

Tables 2 and 3 present the results on Omniglot and Mini-Image Net, respectively. On the Omniglot domain, we ﬁnd that the GD version of i MAML is competitive with the full MAML algorithm, and substatially better than its approximations (i.e., ﬁrst-order MAML and Reptile), especially for the harder 20-way tasks. We also ﬁnd that i MAML with Hessian-free optimization performs substantially better than the other methods, suggesting that powerful optimizers in the inner loop can offer beneﬁts to meta-learning. In the Mini-Image Net domain, we ﬁnd that i MAML performs better than MAML and FOMAML. We used λ = 0.5 and 10 gradient steps in the inner loop. We did not perform an extensive hyperparameter sweep, and expect that the results can improve with better hyperparameters. 5 CG steps were used to compute the meta-gradient. The Hessian-free version also uses 5 CG steps for the search direction. Additional experimental details are Appendix F.

5 Related Work

Our work considers the general meta-learning problem [51, 55, 41], including few-shot learning [30, 57]. Meta-learning approaches can generally be categorized into metric-learning approaches that learn an embedding space where non-parametric nearest neighbors works well [29, 57, 54, 45, 3], black-box approaches that train a recurrent or recursive neural network to take datapoints as input

and produce weight updates [25, 5, 33, 48] or predictions for new inputs [50, 12, 58, 40, 38], and optimization-based approaches that use bi-level optimization to embed learning procedures, such as gradient descent, into the meta-optimization problem [15, 13, 8, 60, 34, 17, 59, 23]. Hybrid approaches have also been considered to combine the beneﬁts of different approaches [49, 56]. We build upon optimization-based approaches, particularly the MAML algorithm [15], which metalearns an initial set of parameters such that gradient-based ﬁne-tuning leads to good generalization. Prior work has considered a number of inner loops, ranging from a very general setting where all parameters are adapted using gradient descent [15], to more structured and specialized settings, such as ridge regression [8], Bayesian linear regression [23], and simulated annealing [2]. The main difference between our work and these approaches is that we show how to analytically derive the gradient of the outer objective without differentiating through the inner learning procedure.

Mathematically, we view optimization-based meta-learning as a bi-level optimization problem. Such problems have been studied in the context of few-shot meta-learning (as discussed previously), gradient-based hyperparameter optimization [35, 46, 19, 11, 10], and a range of other settings [4, 31]. Some prior works have derived implicit gradients for related problems [46, 11, 4] while others propose innovations to aid back-propagation through the optimization path for speciﬁc algorithms [35, 19, 24], or approximations like truncation [53]. While the broad idea of implicit differentiation is well known, it has not been empirically demonstrated in the past for learning more than a few parameters (e.g., hyperparameters), or highly structured settings such as quadratic programs [4]. In contrast, our method meta-trains deep neural networks with thousands of parameters. Closest to our setting is the recent work of Lee et al. [32], which uses implicit differentiation for quadratic programs in a ﬁnal SVM layer. In contrast, our formulation allows for adapting the full network for generic objectives (beyond hinge-loss), thereby allowing for wider applications.

We also note that prior works involving implicit differentiation make a strong assumption of an exact solution in the inner level, thereby providing only asymptotic guarantees. In contrast, we provide ﬁnite time guarantees which allows us to analyze the case where the inner level is solved approximately. In practice, the inner level is likely to be solved using iterative optimization algorithms like gradient descent, which only return approximate solutions with ﬁnite iterations. Thus, this paper places implicit gradient methods under a strong theoretical footing for practically use.

6 Conclusion

In this paper, we develop a method for optimization-based meta-learning that removes the need for differentiating through the inner optimization path, allowing us to decouple the outer metagradient computation from the choice of inner optimization algorithm. We showed how this gives us signiﬁcant gains in compute and memory efﬁciency, and also conceptually allows us to use a variety of inner optimization methods. While we focused on developing the foundations and theoretical analysis of this method, we believe that this work opens up a number of interesting avenues for future study.

Broader classes of inner loop procedures. While we studied different gradient-based optimization methods in the inner loop, i MAML can in principle be used with a variety of inner loop algorithms, including dynamic programming methods such as Q-learning, two-player adversarial games such as GANs, energy-based models [39], and actor-critic RL methods, and higher-order model-based trajectory optimization methods. This signiﬁcantly expands the kinds of problems that optimizationbased meta-learning can be applied to.

More ﬂexible regularizers. We explored one very simple regularization, ℓ2 regularization to the parameter initialization, which already increases the expressive power over the implicit regularization that MAML provides through truncated gradient descent. To further allow the model to ﬂexibly regularize the inner optimization, a simple extension of i MAML is to learn a vectoror matrix-valued λ, which would enable the meta-learner model to co-adapt and co-regularize various parameters of the model. Regularizers that act on parameterized density functions would also enable meta-learning to be effective for few-shot density estimation.

Acknowledgements

Aravind Rajeswaran thanks Emo Todorov for valuable discussions about implicit gradients and potential application domains; Aravind Rajeswaran also thanks Igor Mordatch and Rahul Kidambi for helpful discussions and feedback. Sham Kakade acknowledges funding from the Washington Research Foundation for innovation in Data-intensive Discovery; Sham Kakade also graciously acknowledges support from ONR award N00014-18-1-2247, NSF Award CCF-1703574, and NSF CCF 1740551 award.

[1] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. Co RR, abs/1710.03641, 2017.

[2] Ferran Alet, Tom as Lozano-P erez, and Leslie P Kaelbling. Modular meta-learning. ar Xiv preprint ar Xiv:1806.10166, 2018.

[3] Kelsey R Allen, Evan Shelhamer, Hanul Shin, and Joshua B Tenenbaum. Inﬁnite mixture prototypes for few-shot learning. ar Xiv preprint ar Xiv:1902.04552, 2019.

[4] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 136 145. JMLR. org, 2017.

[5] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981 3989, 2016.

[6] Walter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical Computer Science, 22:317 330, 1983.

[7] Atilim Gunes Baydin, Barak A. Pearlmutter, and Alexey Radul. Automatic differentiation in machine learning: a survey. Co RR, abs/1502.05767, 2015.

[8] Luca Bertinetto, Joao F Henriques, Philip HS Torr, and Andrea Vedaldi. Meta-learning with differentiable closed-form solvers. ar Xiv preprint ar Xiv:1805.08136, 2018.

[9] Sebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 2015.

[10] Chuong B. Do, Chuan-Sheng Foo, and Andrew Y. Ng. Efﬁcient multiple hyperparameter learning for log-linear models. In NIPS, 2007.

[11] Justin Domke. Generic methods for optimization-based modeling. In AISTATS, 2012.

[12] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. ar Xiv:1611.02779, 2016.

[13] Chelsea Finn. Learning to Learn with Gradients. Ph D thesis, UC Berkeley, 2018.

[14] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. ar Xiv:1710.11622, 2017.

[15] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. International Conference on Machine Learning (ICML), 2017.

[16] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learning via meta-learning. ar Xiv preprint ar Xiv:1709.04905, 2017.

[17] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 9516 9527, 2018.

[18] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. International Conference on Machine Learning (ICML), 2019.

[19] Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1165 1173. JMLR. org, 2017.

[20] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Grifﬁths. Recasting gradient-based meta-learning as hierarchical bayes. International Conference on Learning Representations (ICLR), 2018. [21] Andreas Griewank. Some bounds on the complexity of gradients, jacobians, and hessians. 1993. [22] Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2008. [23] James Harrison, Apoorva Sharma, and Marco Pavone. Meta-learning priors for efﬁcient online bayesian regression. ar Xiv preprint ar Xiv:1807.08912, 2018. [24] Laurent Hasco et and Mauricio Araya-Polo. Enabling user-driven checkpointing strategies in reverse-mode automatic differentiation. Co RR, abs/cs/0606042, 2006. [25] Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learning to learn using gradient descent. In International Conference on Artiﬁcial Neural Networks, 2001. [26] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efﬁciently. In ICML, 2017. [27] Jaehong Kim, Youngduck Choi, Moonsu Cha, Jung Kwon Lee, Sangyeul Lee, Sungwan Kim, Yongseok Choi, and Jiwon Kim. Auto-meta: Automated gradient based meta learner search. ar Xiv preprint ar Xiv:1806.06927, 2018. [28] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015. [29] Gregory Koch. Siamese neural networks for one-shot image recognition. ICML Deep Learning Workshop, 2015. [30] Brenden M Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua B Tenenbaum. One shot learning of simple visual concepts. In Conference of the Cognitive Science Society (Cog Sci), 2011. [31] Benoit Landry, Zachary Manchester, and Marco Pavone. A differentiable augmented lagrangian method for bilevel nonlinear optimization. ar Xiv preprint ar Xiv:1902.03319, 2019. [32] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-learning with differentiable convex optimization. ar Xiv preprint ar Xiv:1904.03758, 2019. [33] Ke Li and Jitendra Malik. Learning to optimize. ar Xiv preprint ar Xiv:1606.01885, 2016. [34] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-sgd: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. [35] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113 2122, 2015. [36] James Martens. Deep learning via hessian-free optimization. In ICML, 2010. [37] Fei Mi, Minlie Huang, Jiyong Zhang, and Boi Faltings. Meta-learning for low-resource natural language generation in task-oriented dialogue systems. ar Xiv preprint ar Xiv:1905.05644, 2019. [38] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. ar Xiv preprint ar Xiv:1707.03141, 2017. [39] Igor Mordatch. Concept learning with energy-based models. Co RR, abs/1811.02486, 2018. [40] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2554 2563. JMLR. org, 2017. [41] Devang K Naik and RJ Mammone. Meta-neural networks that learn by learning. In International Joint Conference on Neural Netowrks (IJCNN), 1992. [42] Yurii Nesterov and Boris T. Polyak. Cubic regularization of newton method and its global performance. Math. Program., 108:177 205, 2006. [43] Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

[44] Jorge Nocedal and Stephen J. Wright. Numerical optimization (springer series in operations research and ﬁnancial engineering). 2000. [45] Boris Oreshkin, Pau Rodr ıguez L opez, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 721 731, 2018. [46] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. ar Xiv preprint ar Xiv:1602.02355, 2016. [47] Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards Generalization and Simplicity in Continuous Control. In NIPS, 2017. [48] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016. [49] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. ar Xiv preprint ar Xiv:1807.05960, 2018. [50] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (ICML), 2016. [51] Jurgen Schmidhuber. Evolutionary principles in self-referential learning. Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987. [52] John Schulman, Sergey Levine, Pieter Abbeel, Michael I Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning (ICML), 2015. [53] Amirreza Shaban, Ching-An Cheng, Olivia Hirschey, and Byron Boots. Truncated backpropagation for bilevel optimization. Co RR, abs/1810.10667, 2018. [54] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077 4087, 2017. [55] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 1998. [56] Eleni Triantaﬁllou, Tyler Zhu, Vincent Dumoulin, Pascal Lamblin, Kelvin Xu, Ross Goroshin, Carles Gelada, Kevin Swersky, Pierre-Antoine Manzagol, and Hugo Larochelle. Metadataset: A dataset of datasets for learning to learn from few examples. ar Xiv preprint ar Xiv:1903.03096, 2019. [57] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Neural Information Processing Systems (NIPS), 2016. [58] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. ar Xiv:1611.05763, 2016. [59] Fengwei Zhou, Bin Wu, and Zhenguo Li. Deep meta-learning: Learning to learn in the concept space. ar Xiv preprint ar Xiv:1802.03596, 2018. [60] Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Fast context adaptation via meta-learning. ar Xiv preprint ar Xiv:1810.03642, 2018.