# modeling_and_optimization_tradeoff_in_metalearning__b3f751d3.pdf

Modeling and Optimization Trade-off in

Meta-learning

Katelyn Gao

By searching for shared inductive biases across tasks, meta-learning promises to accelerate learning on novel tasks, but with the cost of solving a complex bilevel optimization problem. We introduce and rigorously deﬁne the trade-off between accurate modeling and optimization ease in meta-learning. At one end, classic meta-learning algorithms account for the structure of meta-learning but solve a complex optimization problem, while at the other end domain randomized search (otherwise known as joint training) ignores the structure of meta-learning and solves a single level optimization problem. Taking MAML as the representative meta-learning algorithm, we theoretically characterize the trade-off for general nonconvex risk functions as well as linear regression, for which we are able to provide explicit bounds on the errors associated with modeling and optimization. We also empirically study this trade-off for meta-reinforcement learning benchmarks.

1 Introduction

Arguably, the major bottleneck of applying machine learning to many practical problems is the cost associated with data and/or labeling. While the cost of labeling and data makes supervised learning problems expensive, the high sample complexity of reinforcement learning makes it downright inapplicable for many practical settings. Meta-learning (or in general multi-task learning) is designed to ease the sample complexity of these methods. It has had success stories on a wide range of problems including image recognition and reinforcement learning [14].

In the classical risk minimization setting, for a task γ, the learner solves the problem

R( ; γ) , E

where R( ; γ) is the risk function which the learner can only access via noisy evaluations ˆR( , ; γ). Meta-learning, or learning to learn [24], makes the observation that if the learner has access to a collection of tasks sampled from a distribution p(γ), it can utilize an ofﬂine meta-training stage to search for shared inductive biases that assist in learning future tasks from p(γ). Under the PAC framework, Baxter [2] shows that given sufﬁciently many tasks and data per task during meta-training, there are guarantees on the generalization of learned biases to novel tasks.

Speciﬁcally, consider an optimization algorithm OPT(γ, meta) which solves the problem of metatest task γ using the meta solution meta. This meta solution is typically a policy initialization for reinforcement learning or shared features for supervised learning. However, it can be any useful knowledge which can be learned in the meta-training stage. The family of meta-learning methods solve, where in practice OPT is approximated by ˆ OPT that uses N calls to an oracle 1;

Rmeta( meta) , Eγ p(γ),

ˆR(OPT(γ, meta), ; γ)

1For example, if OPT(γ, meta) is gradient descent on the task risk function R( ; γ), ˆ OPT(γ, meta, N) would be SGD on the usual empirical risk minimization function (4).

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

This setting is intuitive and theoretically sound. However, it corresponds to a complicated bilevel optimization problem. Bilevel optimization is a notoriously hard problem and even the case of a well-behaved inner problem (e.g. linear program as OPT) can be NP-hard [15] in the general case. Hence, one can rightfully ask, is it feasible to solve the meta problem in (2)? This question is rather more important for the case of reinforcement learning as even solving the empirical-risk minimization in (1) has prohibitively high sample complexity.

Meta-learning proposes to accurately use the structure of the problem by introducing a very costly optimization problem. One obvious question is, can we trade off modeling accuracy for computational ease? Unfortunately, there is no general principled approach for controlling this trade-off as it requires understanding domain speciﬁc properties of the meta problem. Instead, we focus on the case of meta-information meta as the initialization of an iterative optimizer for meta-test task γ, meta = 0

γ and drop the subscript meta as it is clear from the context. This covers many existing algorithms, including MAML [9], which is able to approximate any learning algorithm when combined with deep networks [8]. For this case of meta-learning the initialization, a simple and direct alternative would be solving the pseudo-meta problem

Rdrs( ) , Eγ p(γ),

We call this domain randomized search (DRS) since it corresponds to the domain randomization method from Tobin et al. [26] and it does direct search over a distribution of domains (tasks). 2

It might not be clear to the reader how DRS solves meta-learning. It is important to reiterate that this is only the case if the meta-learned information is an initialization. However, we believe an approximate form of meta-learning without bilevel structure can be found in other cases with the help of domain knowledge. In this paper, we rigorously prove that DRS is an effective meta-learning algorithm for learning an initialization by showing DRS decreases sample complexity during meta-testing.

These two approaches correspond to the two extremes of the modeling and optimization trade-off. Meta-learning corresponds to an accurate modeling and a computationally harder optimization, whereas DRS corresponds to a less accurate modeling and computationally easier optimization. In this paper, we try to understand this trade-off and speciﬁcally attempt to answer the following question; Given a ﬁxed and ﬁnite budget for meta-training and meta-testing, which algorithm is more desirable? In order to answer this question, we provide a collection of theoretical and empirical answers. Taking MAML to be the representative meta-learning algorithm;

We empirically study this trade-off in meta-reinforcement learning (Section 2). We analyze the sample complexity of DRS and MAML for a general non-convex function,

and illustrate the interplay of the modeling error and optimization error (Section 3). We theoretically analyze the meta-linear regression case, which is fully tractable, and

explicitly characterize the trade-off with simulations that conﬁrm our results (Section 4).

1.1 Formulation, Background and Summary of Results

We are interested in a distribution p(γ) of problems with risk functions R( , γ) for the task ids γ. Given a speciﬁc task, we assume we have access to 2N i.i.d. samples from the risk function. This corresponds to sampling data-points and labels for the supervised learning problem, and sampling episodes with their corresponding reward values for reinforcement learning3. Classical empirical risk minimization separately solves for each γ

R( ; γ) , 1 2N

ˆR( , i; γ) (4)

where i are datapoints for supervised learning and episodes for RL. Empirical risk minimization for MAML can be deﬁned over M meta-training tasks sampled from p(γ) as;

Rmaml( ) , 1

ˆR( ˆ OPT(γj, , N), i,j; γj) (5)

2It has also been referred to as joint training in the literature [10]. 3Risk can be seen as the chosen loss function for supervised learning and negative of reward for RL.

Meta-learning needs samples for both the inner optimization and outer meta problem; usually half of the samples are used for each. Empirical risk minimization for DRS is

Rdrs( ) , 1

ˆR( , i,j; γj) (6)

Denote the solutions of problems (2) and (3) by ?

drs, which are the minimum population risk solutions. We call the risk of these solutions Rdrs( ?

drs) and Rmaml( ?

maml) the modeling error as they are the best each method can get with the best optimizer and inﬁnite data and computation. However, we only have access to empirical risk in (5) and (6), as well as a ﬁnite computation budget for meta-training, T tr. Hence, instead of optimal solutions, we have solutions T tr

maml and T tr

drs . We call the difference between these empirical solutions and ?

drs the optimization error.

We expect the optimization error of MAML to be signiﬁcantly higher as bilevel optimization is much harder. More speciﬁcally, for general non-convex risk functions, in order for stochastic gradient descent (SGD) to reach an -stationary point, DRS needs O(1/ 4) samples [12] while MAML needs O(1/ 6) [6]. Moreover, MAML uses half of its samples for the inner optimization; hence, the effective number of samples is reduced resulting in additional optimization error. The optimization error strongly depends on the sample budget (N and M) which can be afforded. As more samples will decrease optimization errors for both methods but with different rates. On the other hand, we know MAML has lower modeling error as it explicitly uses the problem geometry.

The key question is the trade-off of the modeling error versus the optimization error as a function of N and M since it characterizes which algorithm is better for a given problem and budget. We look at this trade-off in two different settings.

General Non-Convex Case: We study this trade-off empirically for meta-reinforcement

learning (Section 2) and theoretically for general non-convex risk functions (Section 3). Our empirical results suggest that this trade-off is rather nuanced and problem dependent. Our theoretical results shed light on these peculiarities and describe the trade-off. Meta-linear regression: (Section 4). Empirical risk can be minimized analytically, enabling

us to directly compare optimization error and modeling error. Results suggest that there are cut-off values for N and M that determine when the different methods are desirable.

2 Trade-offs in Meta-Reinforcement Learning

We study the trade-off empirically in meta-reinforcement learning in this section. We compare DRS and MAML on on a wide range of meta-RL benchmarks used in the literature. We consider 1) four robotic locomotion environments introduced by Finn et al. [9] and Rothfuss et al. [21], with varying reward functions (Half Cheetah Rand Vel, Walker2D Rand Vel) or varying system dynamics (Hopper Rand Params, Walker2D Rand Params) and 2) four manipulation environments from Meta World [33] with varying reward functions (ML1-Push, ML1-Reach) or changing manipulation tasks and varying reward functions (ML10, ML45). 4 All environments utilize the Mujoco simulator [27].

We make two comparisons: 1) Pro MP [21], which combines MAML with PPO [23], vs. DRS combined with PPO (DRS+PPO) 2) TRPO-MAML [9, 21], which combines MAML with TRPO [22], vs. DRS combined with TRPO (DRS+TRPO). Meta-training is controlled such that all algorithms utilize an equal number of steps generated from the simulator for each environment.

During evaluation, we ﬁrst sample a set of meta-test tasks. For each meta-test task, starting at a trained policy, we repeat the following ﬁve times: generate a small number of episodes from the current policy and update the policy using the policy gradient algorithm from the inner optimization of Pro MP and TRPO-MAML. We compute the average episodic reward after t updates, for t = 0, 1, . . . , 5. These statistics are then averaged over all sampled tasks. We evaluate the policies learned by all algorithms at various checkpoints during meta-training, in total ﬁve evaluations with different random seeds. For each checkpoint and test update, we compute the probability of DRS being better than MAML using the one-sided Welch t-test. We plot the ﬁlled contour plot of the probabilities with respect to the meta-training checkpoint and number of test updates in Figures 1 and 2. We include details of the experimental protocol, Welch t-test, and additional plots of the reward in Appendix A.

4We do not include ML1-Pick-and-Place because training was unsuccessful for all algorithms of interest.

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

(a) Hopper Rand Params (b) Walker-2D Rand Params (c) Half Cheetah Rand Vel (d) Walker-2D Rand Vel

(e) ML1 - Push (f) ML1 - Reach (g) ML10 (h) ML45

Meta-Learning is Better Domain Randomization is Better

Figure 1: DRS+PPO vs. Pro MP: Probability that the ﬁrst method is better than the second.

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

0 2 4 6 8 10 12 14 16 18 20 Meta-Training Budget

Test Updates

(a) Hopper Rand Params (b) Walker-2D Rand Params (c) Half Cheetah Rand Vel (d) Walker-2D Rand Vel

(e) ML1 - Push (f) ML1 - Reach (g) ML10 (h) ML45 Figure 2: DRS+TRPO vs. TRPO-MAML: Probability that the ﬁrst method is better than the second.

For the majority of environments, DRS is either better than or comparable to MAML, (see Figure 1- (a,b,d,e,f) and Figure 2-(a,b,d,e,f)). For two environments, speciﬁcally Half Cheetah Rand Vel and ML10 in Figure 1-(c,g)&2-(c,g), MAML outperforms DRS for larger sample sizes and at least one test update. This conﬁrms our thesis as the higher optimization error of MAML requires a larger sample size for successful learning, dominating its trade-off with modeling error for smaller sample sizes. Another surprising result is the case of ML45 (in Figure 1-(h)& 2-(h)) where as the sample size increases it alternates between the cases where DRS is better and MAML is better. We conjecture that this is because of the greater variance in meta training arising from the higher diversity in tasks.

Another interesting observation is that in the majority of environments, MAML fares better when combined with TRPO than when combined with PPO. This validates the importance of optimization as both TRPO and PPO use exactly the same model and are expected to behave similarly from a statistical perspective. We suspect that this behavior is due to the difference in how the trust region is used for optimization. In TRPO-MAML the constraint on the KL divergence between the current policy and updated policy is strictly enforced, while in Pro MP it is transformed into a Lagrangian penalty and thus may not actually be satisﬁed. Satisfaction of the constraint is more helpful for MAML, whose empirical gradients generally have greater variance than those of DRS.

In conclusion, when the sample complexity is high and available sample size budget is low, DRS is the preferred method. MAML is only effective for large data sets, only in some problems. It is important for practitioners to understand the sample complexity of their problems before choosing which method to apply. In the next section, we provide theoretical analysis to clarify this phenomenon.

3 Trade-off through the Lens of Optimization Behavior

In this section, we analyze the sample complexity of meta-training and meta-test of MAML and DRS. In particular, we consider the interplay of meta-training and meta-testing for a complete analysis. Speciﬁcally, for MAML OPT(γ, ) is one step of gradient descent on the task risk function R( ; γ)

starting at with learning rate . The objectives of DRS and MAML are

Rdrs( ) , Eγ[R( ; γ)], Rmaml( ; ) , Eγ[R( r R( ; γ); γ)]. (7)

We analyze the trade-off between optimization error and modeling error of DRS and MAML for smooth non-convex risk functions from a sample complexity perspective. We denote the modeling errors by and, as in Section 1, deﬁne them as the expected risks at the globally optimal values of the corresponding objectives. Speciﬁcally, drs = Rdrs( ?

drs) and maml( ) = Rmaml( ?

Suppose that we have access to task stochastic gradient oracles and task stochastic Hessian oracles. During meta-training, DRS and MAML optimize Rdrs( ) and Rmaml( ; ) using SGD for T tr steps. During meta-testing for a task γ, both carry out classical risk minimization (1) using SGD for T te steps, with the meta-training results T tr

drs and T tr

maml( ) as warm starts. The metric we use is the Euclidean norm of the gradient at meta-testing since the best we can hope for a non-convex function is ﬁrst-order stationarity.

Before we proceed with our analysis we set the notation and make some mild regularity assumptions. Consider task stochastic gradient oracles g( , ; γ) such that E [g( , ; γ)] = r R( ; γ), and task stochastic Hessian oracles h( , ; γ) such that E [h( , ; γ)] = r2

R( ; γ). We assume B1: The risk functions are nonnegative and bounded, 0 R( ; γ) for all and γ. B2: The risk functions are uniformly Lipschitz and smooth k R( 1;γ) R( 2;γ)k/k 1 2k L and kr R( 1;γ) r R( 2;γ)k/k 1 2k µ for all γ. B3: The task stochastic gradient oracles have bounded variance, tr(Var (g( , ; γ))) V d for all γ, and the gradients of the risk functions have bounded variance, tr(Varγ(r R( ; γ))) V t. B4: the Hessians of the risk functions are Lipschitz kr2

R( 2;γ)k/k 1 2k µH for all γ. B5: The task stochastic Hessian oracles have bounded variance E

%%h( , ; γ) r2

We state the sample complexity of DRS and MAML in Theorems 1 and 2, respectively. Proofs can be found in Appendix B. Theorem 1. Suppose that during each iteration of meta-training DRS, M tasks are sampled each with 2N calls to their gradient oracles, and during each iteration of meta-testing, N calls are made to the gradient oracle. Assume B1-3. Then, in expectation over the meta-test task γ,

%%r Rdrs( t)

kr R( t+T tr; γ)k2

tr T tr + Cdrs

M + V d 2NM

Before we continue with the analysis of MAML, we ﬁrst discuss the implications of Theorem 1. Let us examine the bound when T te increases by one and T tr, as well as all other variables, are ﬁxed. The change in the left side is the meta-testing gradient after T te iterations, whereas that in the right side is approximated by its gradient wrt T te, behaving approximately as O(

tr T tr+Cdrs

te T te). Hence, with more meta-training, we get closer to the stationary point of the meta-test task γ with the same number of meta-test iterations. In other words our result shows that DRS, although it ignores the meta-learning problem structure as discussed in Section 1, provably solves the problem of meta-learning the initialization of an iterative optimization problem under sensible assumptions. Theorem 2. Following Algorithm 1 of Fallah et al. [6], suppose that during each iteration of metatraining MAML, M tasks are sampled each with 2N calls to their gradient oracles, half of which are used in the inner optimization, and D calls to their Hessian oracles. During each iteration of meta-testing, N calls are made to the gradient oracle. Assume B1-5, 2 [0, 1/6µ], and D 2 2V h. Then, in expectation over the meta-test task γ,

%%r Rmaml( t; )

kr R( t+T tr; γ)k2

T tr µ(1 + µ)2L

+ maml( ) + L2)(

tr T tr + Cmaml

where Cmaml

tr = (4µ+2µH L)

(1 + µ)2L2 + 14V t

M + 3V d(1+ 2µ2M)

We keep the as a free parameter since it makes the role of modeling error explicit. First consider a similar argument to discussion of Theorem 1; the meta-testing gradient after T te iterations behaves approximately as O(

tr T tr+Cmaml

te T te). Hence, unsurprisingly meta-training of MAML also provably improves the sample complexity in meta-testing. Next, if we ignore the modeling error by assuming maml( ) = drs, the bound in Theorem 1 is lower than in Theorem 2 for all > 0. In other words, if the problem has no speciﬁc geometric structure that MAML can utilize, DRS will perform better. On the other hand, if the modeling error is dominant, i.e. maml( ) drs, MAML will perform better. For most practical cases, maml( ) will be less

than drs, but not signiﬁcantly. In these cases, the trade-off is governed by the values of N and M.

Another important observation is the strong dependence of the sample complexity on oracle variances as well as smoothness constants. This partially explains the problem dependent behavior of DRS and MAML in Section 2. Choosing the right algorithm requires, in addition to the budget for N and M, these problem and domain speciﬁc information that are often not practical to compute (or estimate). Hence, translating these results to practically relevant decision rules needs future work.

4 Trade-offs in Meta-Linear Regression

In this section, we study the linear regression case where we can explicitly characterize the trade-off of optimization error and modeling error. Since the empirical risk minimization problems corresponding to MAML and DRS in linear regression are analytically solvable, the optimization errors only consist of the statistical errors arising from using empirical risk minimization. We ﬁrst analyze the optimization errors and then discuss the modeling errors of the two approaches. All proofs can be found in Appendix C.

4.1 Formal Setup and Preliminaries

For each task γ, we assume the following data model:

γxγ + γ, γ (0, σ2

γ), Qγ = E[xγx|

γ | γ]. (8)

where γ and xγ are independent and xγ 2 Rp. We assume the squared error loss;

R( ; γ) = 1

2E[(yγ |xγ)2 | γ] = 1

The globally optimal (minimum risk) solutions for the DRS and MAML objectives (7) are (see C.2 for derivations);

drs = Eγ[Qγ] 1Eγ[Qγ γ]

maml( ) = Eγ[(I Qγ)Qγ(I Qγ)] 1Eγ[(I Qγ)Qγ(I Qγ) γ].

Both solutions can be viewed as weightings of the task parameters γ. Since the Hessian of R( ; γ) is Qγ, the DRS solution gives greater weight to the tasks whose risk functions have higher curvature, i.e. are more sensitive to perturbations in . Compared to the DRS solution, the MAML solution puts more weight on the tasks with lower curvature. As the gradient of R( ; γ) is Qγ( γ), for tasks with lower curvature one gradient step on the task risk function takes us a smaller fraction of the distance from the current point to the stationary point γ; thus, starting from the MAML solution enables faster task adaptation overall if the risks are known exactly.

4.2 Bounds on Optimization Error

We consider a ﬁnite-sample setting where M tasks, independently sampled from p(γ) and 2N observations per task, sampled according to the model in (8) are given. We denote the resulting dataset as D (xj,i, yj,i), j = 1, . . . , M, i = 1, . . . , 2N.

During meta-training, from (6), DRS minimizes the average squared error over all data:

ˆ drs , arg min

(yj,i |xj,i)2. (11)

From (5), MAML optimizes for parameters such that, when optimized for each task using SGD on the average squared error over N observations with learning rate , minimizes the average squared error over the remaining N observations of all tasks:

ˆ maml( ) , arg min

(yj,i j( )|xj,i)2, j( ) =

j,i xj,iyj,i)

As a sub-optimality metric, we characterize the distances between the empirical solution and globally optimal solution,

%%%ˆ maml( )

%%%, in terms of the ﬁnite sample sizes M and N in Theorem 3&4. This is the error arising from using empirical samples instead of the population statistics, that is, the statistical error.

Before we state the theorems, we summarize the assumptions. A1: Bounded Hessian of the task loss, k Qγk β. A2: Bounded parameter, feature, and error space, k γ

maml( )k 0 and k γk, kxγ,ik, and | γ,i| are ﬁnite. A3: The distribution of xγ,i conditional on γ is sub-Gaussian with parameter K. In this setting, the following theorems characterize the statistical error for meta linear-regression. Theorem 3. Suppose that with probability 1, A1-3 holds. Let ! be logarithmic in δ 1, M, and p, and deﬁne the functions

c1( , r, s, ) = k k

2s , c2( ) = βCK2p

p + , c3( ) =

tr(Eγ[σ2γQγ]) ,

where C is a constant. If λmin(Eγ[Qγ]) o(1) > 0, with probability at least 1 δ, ignoring higher

order terms,

%%% is bounded above by 5

(λmin(Eγ[Qγ]) o(1)) 1

c1(!, k Varγ[Qγ]k , tr(Varγ[Qγ γ]),

2 + c3(!) p

Theorem 4. With the same assumptions as Theorem 3, let Sγ( ) = (I Qγ)Qγ(I Qγ). If λmin(Eγ[Sγ( )]) o(1) > 0, with probability at least 1 δ, ignoring higher order terms, %%%ˆ maml( )

%%% is bounded above by

(λmin(Eγ[Sγ( )]) o(1)) 1

c1(!, k Varγ[Sγ( )]k , tr(Varγ[Sγ( ) γ]),

+ (1 + 3 β)2 0c2(!) +

2(1 + β)2c3(!) p

Theorems 3&4 show the statistical errors for MAML and DRS scale similarly in terms of rates with respect to N and M. However, the constants are signiﬁcantly different. Compare the coefﬁcients of 1/

N. The coefﬁcient of c2(!) for DRS is

1λmin(Eγ[Qγ]) 1 and for MAML it is 0(1 + 3 β)2λmin(Eγ[Sγ( )]) 1. When β / 1, the latter is larger than the former, since we expect that 0 and the eigenvalues of Eγ[Sγ( )] to be shrunken compared to those of Eγ[Qγ]. A similar observation holds for the coefﬁcient of c3(!). In other words, the convergence behavior of the MAML estimate has a worse dependence on N, indicating that the statistical error for DRS has more favorable behavior.

4.3 Modeling Error

Modeling error affects the meta-testing performance of the globally optimal solutions. We expect MAML to perform better as it directly models the meta-learning problem whereas DRS does not. Modeling error by deﬁnition depends on the correct model of the world and thus is difﬁcult to characterize without domain knowledge. We study this error in the following theorem assuming that the distribution of tasks is the same for meta-training and meta-testing and discuss its implications for practitioners.

5Here we abuse notation somewhat by deﬁning the variance of a matrix B to be Var(B) = E(B E(B)2).

Theorem 5. For meta-test task γ and arbitrary , let γ( ) be the parameters optimized by one step of SGD using N data points Oγ and learning rate , as in (12). Let Aγ( ) = Sγ( ) + 2(E[xγ,ix|

γ,i Qγxγ,ix|

γ)/N, where Sγ( ) is deﬁned in Theorem 4. The expected losses before and after optimization, as functions of , are

and Eγ[EOγ[R( γ( ); γ)]] Eγ

where we have ignored constants and terms that do not include . The former is minimized by

drs. As N ! 1, the latter approaches Rmaml( ; ), is minimized by

maml( ) and, for 0 < 1/β, is at most Rdrs(

drs), the minimum expected loss possible before meta-testing optimization.

Theorem 5 shows that the smaller modeling error of meta-learning indeed translates to improved performance. Meta-learning can utilize the geometry implied by the distribution of the tasks to reduce the expected loss given the ability to optimize at meta-testing.

Combining the results of all theorems, the optimization error is worse for MAML when β / 1 from Theorem 3&4. DRS does not model the meta structure, hence it has worse modeling error/greater expected loss by Theorem 5. As expected, there is no clear winner in practice as the choice depends on the trade-off between optimization error and modeling error. Next, we show this empirically.

4.4 Empirical Results

We carried out simulations to empirically study the trade-off in the linear regression case. We chose the following speciﬁc distribution of tasks and data model:

γxγ + γ γ N(0, σ2

γ) xγ N(0, Qγ)

γ U([0, 2]p) σ2

γ U([0, 2]) Qγ = V diag( γ)VT V U(SO(p))

We present the case for p = 1; larger p lead to similar qualitative results. We compute approximations of Rdrs( ) (expected loss before meta-testing optimization) as a function of and Eγ[EOγ[R(γ; γ( ))] (expected loss after meta-testing optimization) as a function of and . For various M and N, we generate a collection of datasets D; for each D, over a grid of values for 2 [0, 1], we compute the corresponding DRS and MAML estimates and, using the aforementioned functions, calculate whether the MAML estimate has lower expected loss than the DRS estimate before and after meta-testing optimization. Figure 3 shows contour plots of the fraction of the datasets for which the MAML estimate has lower expected loss before meta-testing optimization (left three ﬁgures) and after (right three ﬁgures), for several values of . The axes are the number of tasks (M) and the data set size for meta-testing optimization (N).

101 102 100

101 102 100

101 102 100

101 102 100

101 102 100

101 102 100

Meta-Learning is Better Domain Randomization is Better

Figure 3: p = 1: Contour plots of the probability that the MAML estimate has lower expected loss than the DR estimate before meta-testing optimization (pre) and after (post). The axes are the number of tasks (M) and the number of data points used for meta-testing optimization (N), and is ﬁxed.

From Figure 3, for very high M and N, the DRS estimate has lower expected loss before meta-testing optimization and the MAML estimate has lower expected loss after meta-testing optimization with very high certainty. This is expected as asymptotically meta-learning is expected to work well by Theorem 5; the practical question of interest is the ﬁnite sample case.

From Figure 3-post, we observe that for small M and N, the probability that the MAML estimate has lower expected loss after meta-testing optimization can be substantially lower than 0.5, i.e. the DRS estimate is superior. We conclude that in this case the increased optimization error from MAML dominates the decreased modeling error. Hence, unless the geometry is strongly skewed, DRS is desirable for smaller datasets in meta-linear regression.

5 Related Work

Recent work on few-shot image classiﬁcation has shown that features from learning a deep network classiﬁer on a large training set combined with a simple classiﬁer at meta-testing may outperform many meta-learning algorithms [32, 4, 25]; a similar observation has been made for few-shot object detection [31]. Packer et al. [17] show that DRS outperforms RL2 [5] on simple reinforcement learning environments where tasks correspond to different system dynamics. Our meta-RL experiments complement these works and theoretical studies partially explain them. We argue that there is a larger picture to be considered; the trade-off between modeling accuracy and optimization ease depend on characteristics of the dataset, model, and optimization, and should be studied on a case-by-case basis.

Previous theoretical studies of MAML have primarily focused on the meta-training stage. Finn et al. [10] prove that the MAML objective is smooth and convex if the task risk functions are smooth and convex and derive the DRS and MAML estimates if those functions are known for linear regression. Fallah et al. [6] characterize the meta-training sample complexity of SGD for MAML in supervised learning; Fallah et al. [7] provide analogous results for reinforcement learning. For regression with overparameterized neural networks, Wang et al. [29] show that gradient descent leads to the global optimum of the empirical MAML objective (5). Franceschi et al. [11] study the meta-learning problem (5) when OPT(γ, ) is a minimization operator instead of an iterative optimization procedure, proposing an algorithm with convergence guarantees. Rajeswaran et al. [20] propose implicit MAML with analysis of its training sample complexity, providing an alternative method to estimate the gradient of the MAML objective; Grazzi et al. [13] compare the quality of the estimate to that of the original MAML algorithm.

As a counterpart to Fallah et al. [6] and Fallah et al. [7], Wang et al. [30] provides bounds on the meta-testing performance of an -stationary point of the MAML objective, assuming that the same set of tasks is used for meta-training and meta-testing. In contrast to these previous works, we analyze the meta-testing performance and overall optimization behavior of MAML when the meta-training and meta-testing tasks are not identical, merely drawn from the same distribution, and compare them to those of DRS.

6 Conclusion

This paper introduces an important trade-off in meta-learning, that of accurately modeling the metalearning problem and complexity of the optimization problem. Classic meta-learning algorithms account for the structure of the problem space but deﬁne complex optimization objectives. Domain randomized search (DRS) does not account for the structure of the meta-learning problem and solves a single level optimization objective.

Taking MAML to be the representative meta-learning algorithm, we study this trade-off empirically and theoretically. On meta-reinforcement learning benchmarks, the optimization complexity appears to be more important; DRS is competitive with and often outperforms MAML, especially for fewer environment steps. Through an analysis of the sample complexity for smooth nonconvex risk functions, we show that DRS and MAML both solve the meta-learning problem and delineate the roles of optimization complexity and modeling accuracy. For meta-linear regression, we prove theoretically and verify in simulations that while MAML can utilize the geometry of the distribution of task losses to improve performance through meta-testing optimization, this modeling gain can be counterbalanced by its greater optimization error for small sample sizes. All three studies show that the balance of the trade-off is not only determined by the sample sizes but characteristics of the meta-learning problem, such as the smoothness of the task risk functions.

There are several interesting directions for future work. What is the trade-off exhibited by other algorithms, such as Reptile [16] or ANIL [19], that were designed to be more computationally efﬁcient than MAML? Our theory may also be extended to more complex scenarios such as supervised learning with deep networks, the Linear Quadratic Regulator, more than one inner optimization step in MAML, or the use of part of the data to select hyperparameters.

Broader Impact

The massive progress made by machine learning in artiﬁcial intelligence has been partly driven by massive amounts of data and compute [1]. By sharing inductive biases across tasks, meta-learning aims to speed up learning on novel tasks, thereby reducing their data and computational burden. In this paper, we have argued that the data and computational cost of the meta-training procedure matters in addition to that of the meta-test procedure. We have shown that domain randomized search, a computationally cheaper approach compared to classic meta-learning methods such as MAML, solves the meta-learning problem and is competitive with MAML when the budget of meta-training data/compute is small. Thus, it can be an effective meta-learning approach in practice when obtaining data is expensive and/or one would like to reduce the carbon footprint of meta-training, with some cost in performance at meta-test.

Acknowledgments and Disclosure of Funding

We are very grateful to Vladlen Koltun for providing guidance during the project and to Charles Packer for help in setting up the RL experiments. We would like to thank the other members of the Intelligent Systems Lab at Intel and Amir Zamir for feedback on the ﬁrst draft of this paper, and Sona Jeswani for assistance in running an early version of the RL experiments. There are no external funding or conﬂicts of interest to disclose.

[1] Dario Amodei and Danny Hernandez. AI and compute, 2018. URL https://openai.

com/blog/ai-and-compute.

[2] Jonathan Baxter. A model of inductive bias learning. Journal of artiﬁcial intelligence research,

12:149 198, 2000.

[3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang,

and Wojciech Zaremba. Open AI Gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

[4] Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, and Trevor Darrell. A new meta-baseline

for few-shot learning. ar Xiv preprint ar Xiv:2003.04390, 2020.

[5] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2:

Fast reinforcement learning via slow reinforcement learning. ar Xiv:1611.02779, 2016.

[6] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. On the convergence theory of gradient-

based model-agnostic meta-learning algorithms. ar Xiv preprint ar Xiv:1908.10400, 2019.

[7] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Provably convergent policy gradient

methods for model-agnostic meta-reinforcement learning. ar Xiv preprint ar Xiv:2002.05135, 2020.

[8] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and

gradient descent can approximate any learning algorithm. ar Xiv preprint ar Xiv:1710.11622, 2017.

[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-

tation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1126 1135. JMLR, 2017.

[10] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning.

ar Xiv preprint ar Xiv:1902.08438, 2019.

[11] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimilano Pontil.

Bilevel programming for hyperparameter optimization and meta-learning. ar Xiv preprint ar Xiv:1806.04910, 2018.

[12] Saeed Ghadimi and Guanghui Lan. Stochastic ﬁrstand zeroth-order methods for nonconvex

stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

[13] Riccardo Grazzi, Luca Franceschi, Massimiliano Pontil, and Saverio Salzo. On the iteration

complexity of hypergradient computation. ar Xiv preprint ar Xiv:2006.16218, 2020.

[14] Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in

neural networks: A survey. ar Xiv preprint ar Xiv:2004.05439, 2020.

[15] Patrice Marcotte and Gilles Savard. Bilevel programming: A combinatorial perspective. In

Graph theory and combinatorial optimization, pages 191 217. Springer, 2005.

[16] Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms.

ar Xiv preprint ar Xiv:1803.02999, 2018.

[17] Charles Packer, Katelyn Gao, Jernej Kos, Philipp Krähenbühl, Vladlen Koltun, and Dawn Song.

Assessing generalization in deep reinforcement learning. ar Xiv preprint ar Xiv:1810.12282, 2018.

[18] Roger Penrose. On best approximate solutions of linear matrix equations. Mathematical

Proceedings of the Cambridge Philosophical Society, 52(1):17 19, 1956.

[19] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature

reuse? towards understanding the effectiveness of maml. ar Xiv preprint ar Xiv:1909.09157, 2019.

[20] Aravind Rajeswaran, Chelsea Finn, Sham M Kakade, and Sergey Levine. Meta-learning with

implicit gradients. In Advances in Neural Information Processing Systems, pages 113 124, 2019.

[21] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Promp: Proximal

meta-policy search. ar Xiv preprint ar Xiv:1810.06784, 2018.

[22] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust

region policy optimization. In International conference on machine learning (ICML), pages 1889 1897, 2015.

[23] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal

policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

[24] Sebastian Thrun and Lorien Pratt. Learning to learn: Introduction and overview. In Learning to

learn, pages 3 17. Springer, 1998.

[25] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Re-

thinking few-shot image classiﬁcation: a good embedding is all you need? ar Xiv preprint ar Xiv:2003.11539, 2020.

[26] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel.

Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23 30. IEEE, 2017.

[27] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu Jo Co: A physics engine for model-based

control. In Intelligent Robots and Systems (IROS), 2012.

[28] Roman Vershynin. High-dimensional probability: An introduction with applications in data

science, volume 47. Cambridge University Press, 2018.

[29] Haoxiang Wang, Ruoyu Sun, and Bo Li. Global convergence and induced kernels of gradient-

based meta-learning with neural nets. ar Xiv preprint ar Xiv:2006.14606, 2020.

[30] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. On the global optimality of

model-agnostic meta-learning. ar Xiv preprint ar Xiv:2006.13182, 2020.

[31] Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. Frustratingly

simple few-shot object detection. ar Xiv preprint ar Xiv:2003.06957, 2020.

[32] Yan Wang, Wei-Lun Chao, Kilian Q Weinberger, and Laurens van der Maaten. Simpleshot: Re-

visiting nearest-neighbor classiﬁcation for few-shot learning. ar Xiv preprint ar Xiv:1911.04623, 2019.

[33] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and

Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (Co RL), 2019.