# on_the_global_optimality_of_modelagnostic_metalearning__9d61add7.pdf On the Global Optimality of Model-Agnostic Meta-Learning: Reinforcement Learning and Supervised Learning Lingxiao Wang 1 Qi Cai 1 Zhuoyan Yang 2 Zhaoran Wang 1 Model-agnostic meta-learning (MAML) formulates meta-learning as a bilevel optimization problem, where the inner level solves each subtask based on a shared prior, while the outer level searches for the optimal shared prior by optimizing its aggregated performance over all the subtasks. Despite its empirical success, MAML remains less understood in theory, especially in terms of its global optimality, due to the nonconvexity of the meta-objective (the outer-level objective). To bridge such a gap between theory and practice, we characterize the optimality gap of the stationary points attained by MAML for both reinforcement learning and supervised learning, where the inner-level and outer-level problems are solved via first-order optimization methods. In particular, our characterization connects the optimality gap of such stationary points with (i) the functional geometry of inner-level objectives and (ii) the representation power of function approximators, including linear models and neural networks. To the best of our knowledge, our analysis establishes the global optimality of MAML with nonconvex meta-objectives for the first time. 1 1. Introduction Meta-learning aims to find a prior that efficiently adapts to a new subtask based on past subtasks. One of the most popular meta-learning methods, namely model-agnostic metalearning (MAML) (Finn et al., 2017a), is based on bilevel op- 1Department of Industrial Engineering and Management Sciences, Northwestern University, USA 2Department of Operations Research and Financial Engineering, Princeton University, USA. Correspondence to: Lingxiao Wang . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). 1See https://arxiv.org/abs/2006.13182 for the full version. timization, where the inner level solves each subtask based on a shared prior, while the outer level optimizes the aggregated performance of the shared prior over all the subtasks. In particular, MAML associates the solution to each subtask with the shared prior through one step of gradient descent based on the subtask data. Due to its model-agnostic property, MAML is widely adopted in reinforcement learning (Finn et al., 2017a;b; Xu et al., 2018; Nagabandi et al., 2018; Gupta et al., 2018; Yu et al., 2018; Mendonca et al., 2019) and supervised learning (Finn et al., 2017a; Li et al., 2017; Finn et al., 2018; Rakelly et al., 2018; Yoon et al., 2018). Despite its popularity in empirical studies, MAML is scarcely explored theoretically. In terms of the global optimality of MAML, (Finn et al., 2019) show that the metaobjective is strongly convex assuming that the inner-level objective is strongly convex (in its finite-dimensional parameter). However, such an assumption fails to hold for neural function approximators, which leads to a gap between theory and practice. For nonconvex meta-objectives, (Fallah et al., 2019) characterize the convergence of MAML to a stationary point under certain regularity conditions. Meanwhile, (Rajeswaran et al., 2019) propose a variant of MAML that utilizes implicit gradients, which is also guaranteed to converge to a stationary point. However, the global optimality of such stationary points remains unclear. On the other hand, (Pentina & Lampert, 2014; Amit & Meir, 2017) establish PAC-Bayes bounds for the generalization error of two variants of MAML. However, such generalization guarantees only apply to the global optima of the two meta-objectives rather than their stationary points. In this work, we characterize the global optimality of the ϵstationary points attained by MAML for both reinforcement learning (RL) and supervised learning (SL). For meta-RL, we study a variant of MAML, which associates the solution to each subtask with the shared prior, namely πθ, through one step of proximal policy optimization (PPO) (Schulman et al., 2015; 2017) in the inner level of optimization. In the outer level of optimization, we maximize the expected total reward associated with the shared prior aggregated over all the subtasks. We prove that the ϵ-stationary point attained by such an algorithm is (approximately) globally optimal given that the function approximator has sufficient On the Global Optimality of Model-Agnostic Meta-Learning representation power. For example, for the linear function approximator πθ(s, a) exp(φ(s, a) θ), the optimality gap of the ϵ-stationary point is characterized by the representation power of the linear class {φ( , ) v : v B}, where B is the parameter space (which is specified later). The core of our analysis is the functional one-point monotonicity (Facchinei & Pang, 2007) of the expected total reward J(π) with respect to the policy π (Liu et al., 2019) for each subtask. Based on a similar notion of functional geometry in the inner level of optimization, we establish similar results on the optimality gap of meta-SL. Moreover, our analysis of both meta-RL and meta-SL allows for neural function approximators. More specifically, we prove that the optimality gap of the attained ϵ-stationary points is characterized by the representation power of the corresponding classes of overparameterized two-layer neural networks. Challenge. We highlight that the bilevel structure of MAML makes it challenging for the analysis of its global optimality. In the simple case where the inner-level objective is strongly convex and smooth, (Finn et al., 2019) show that the meta-objective is also strongly convex assuming that the stepsize of inner-level optimization is sufficiently small. In practice, however, both the inner-level objective and the meta-objective can be nonconvex, which leads to a gap between theory and practice. For example, the inner-level objective of meta-RL is nonconvex even in the (infinite-dimensional) functional space of policies. Even assuming that the inner-level objective is convex in the (infinite-dimensional) functional space, nonlinear function approximators, such as neural networks, can make the inner-level objective nonconvex in the finite-dimensional space of parameters. Furthermore, even for linear function approximators, the bilevel structure of MAML can make the metaobjective nonconvex in the finite-dimensional space of parameters, especially when the stepsize of inner-level optimization is large. In this work, we tackle all these challenges by analyzing the global optimality of both meta-RL and meta-SL for both linear and neural function approximators. Contribution. Our contribution is three-fold. First, we propose a meta-RL algorithm and characterize the optimality gap of the ϵ-stationary point attained by such an algorithm for linear function approximators. Second, under an assumption on the functional convexity of the inner-level objective, we characterize the optimality gap of the ϵ-stationary point attained by meta-SL. Finally, we extend our optimality analysis for linear function approximators to handle overparameterized two-layer neural networks. To the best of our knowledge, our analysis establishes the global optimality of MAML with nonconvex meta-objectives for the first time. Related Work. Meta-learning is studied by various communities (Evgeniou & Pontil, 2004; Thrun & Pratt, 2012; Pentina & Lampert, 2014; Amit & Meir, 2017; Nichol et al., 2018; Nichol & Schulman, 2018; Khodak et al., 2019). See (Pan & Yang, 2009; Weiss et al., 2016) for the surveys of meta-learning and (Taylor & Stone, 2009) for a survey of meta-RL. Our work focuses on the model-agnostic formulation of meta-learning (MAML) proposed by (Finn et al., 2017a). In contrast to existing empirical studies, the theoretical analysis of MAML is relatively scarce. (Fallah et al., 2019) establish the convergence of three variants of MAML for nonconvex meta-objectives. (Rajeswaran et al., 2019) propose a variant of MAML that utilizes implicit gradients of the inner level of optimization and establish the convergence of such an algorithm. This line of work characterizes the convergence of MAML to the stationary points of the corresponding meta-objectives. Our work is complementary to this line of work in the sense that we characterize the global optimality of the stationary points attained by MAML. Meanwhile, (Finn et al., 2019) propose an online algorithm for MAML with regret guarantees, which rely on the strong convexity of the meta-objectives. In contrast, our work tackles nonconvex meta-objectives, which allows for neural function approximators, and characterizes the global optimality of MAML. (Mendonca et al., 2019) propose a meta-policy search method and characterize the global optimality for solving the subtasks under the assumption that the meta-objective is (approximately) globally optimal. Our work is complementary to their work in the sense that we characterize the global optimality of MAML in terms of optimizing the meta-objective. See also the concurrent work (Wang et al., 2020). There is a large body of literature that studies the training and generalization of overparameterized neural networks for SL (Daniely, 2017; Jacot et al., 2018; Wu et al., 2018; Allen-Zhu et al., 2018a;b; Du et al., 2018a;b; Zou et al., 2018; Chizat & Bach, 2018; Li & Liang, 2018; Cao & Gu, 2019a;b; Arora et al., 2019; Lee et al., 2019; Bai & Lee, 2019). See (Fan et al., 2019) for a survey. In comparison, we study MAML with overparameterized neural networks for both RL and SL. The bilevel structure of MAML makes our analysis significantly more challenging than that of RL and SL. Notation. We denote by [n] = {1, 2, ..., n} the index set. Also, we denote by x = ([x] 1 , . . . , [x] m) Rmd a vector in Rmd, where [x]k Rd is the k-th block of x for k [m]. For a real-valued function f defined on X, we denote by f( ) p,ν = { R X f p(x) dν(x)}1/p the Lp(ν)-norm of f, where ν is a measure on X. We write f( ) 2,ν = f( ) ν for notational simplicity and f p,ν = f( ) p,ν when the On the Global Optimality of Model-Agnostic Meta-Learning variable is clear from the context. For a vector φ Rn, we denote by φ 2 the ℓ2-norm of φ. 2. Background In this section, we briefly introduce reinforcement learning and meta-learning. 2.1. Reinforcement Learning We define a Markov decision process (MDP) by a tuple (S, A, P, r, γ, ζ), where S and A are the state and action spaces, respectively, P is the Markov kernel, r is the reward function, which is possibly stochastic, γ (0, 1) is the discount factor, and ζ is the initial state distribution over S. In the sequel, we assume that A is finite. An agent interacts with the environment as follows. At each step t, the agent observes the state st of the environment, takes the action at, and receives the reward r(st, at). The environment then transits into the next state according to the distribution P( | st, at) over S. We define a policy π as a mapping from S to distributions over A. Specifically, π(a | s) gives the probability of taking the action a at the state s. Given a policy π, we define for all (s, a) S A the corresponding stateand action-value functions V π and Qπ as follows, V π(s) = (1 γ) E X t=0 γt r(st, at) s0 = s , Qπ(s, a) = (1 γ) E X t=0 γt r(st, at) s0 = s, a0 = a , where st+1 P( | st, at) and at π( | st) for all t 0. Correspondingly, the advantage function Aπ is defined as follows, Aπ(s, a) = Qπ(s, a) V π(s), (s, a) S A. (2.3) A policy π induces a state visitation measure νπ on S, which takes the form of νπ(s) = (1 γ) t=0 γt P(st = s), (2.4) where s0 ζ, st+1 P( | st, at), and at π( | st) for all t 0. Correspondingly, we define the state-action visitation measure by σπ(s, a) = π(a | s) νπ(s) for all (s, a) S A, which is a probability distribution over S A. The goal of reinforcement learning is to find the optimal policy π that maximizes the expected total reward J(π), which is defined as J(π) = Es ζ V π(s) = E(s,a) σπ r(s, a) . (2.5) When S is continuous, maximizing J(π) over all possible π is computationally intractable. A common alternative is to parameterize the policy by πθ with the parameter θ Θ, where Θ is the parameter space, and maximize J(πθ) over θ Θ. 2.2. Meta-Learning In meta-learning, the meta-learner is given a sample of learning subtasks {Ti}i [n] drawn independently from the task distribution ι and a set of parameterized algorithms A = {Aθ : θ Θ}, where Θ is the parameter space. Specifically, given θ, the algorithm Aθ A maps from a learning subtask T to its desired outcome. For example, an algorithm that solves reinforcement learning subtasks maps from an MDP T = (S, A, P, r, γ, ζ) to a policy π, aiming at maximizing the expected total reward J(π) defined in (2.5). As an example, given a hypothesis class H, a distribution D over Z, which is the space of data points, and a loss function ℓ: H Z 7 R, a supervised learning subtask aims at minimizing the risk Ez D[ℓ(h, z)] over h H. We denote the supervised learning subtask T by the tuple (D, ℓ, H). Similarly, an algorithm that solves supervised learning subtasks maps from T = (D, ℓ, H) to a hypothesis h H, aiming at minimizing the risk R(h) = Ez D[ℓ(h, z)] over h H. In what follows, we denote by HT the objective of a learning subtask T . If T is a reinforcement learning subtask, we have HT = J, and if T is a supervised learning subtask, we have HT = R. The goal of the meta-learner is to find θ Θ that optimizes the population version of the meta-objective L(θ), which is defined as L(θ) = ET ι h HT Aθ(T ) i . (2.6) To approximately optimize L defined in (2.6) based on the sample {Ti}i [n] of subtasks, the meta-learner optimizes the following empirical version of the meta-objective, i=1 HTi Aθ(Ti) . (2.7) The algorithm Aθ corresponding to the global optimum θ of (2.7) incorporates the past experience through the observed learning subtasks {Ti}i [n], and therefore, facilitates the learning of a new subtask (Pentina & Lampert, On the Global Optimality of Model-Agnostic Meta-Learning 2014; Finn et al., 2017a; Amit & Meir, 2017; Yoon et al., 2018). As an example, in model-agnostic meta-learning (MAML) (Finn et al., 2017a) for supervised learning, the hypothesis class H is parameterized by hθ with θ Θ, and the algorithm Aθ performs one step of gradient descent with θ Θ as the starting point. In this setting, MAML aims to find the globally optimal starting point θ by minimizing the following meta-objective by gradient descent, i=1 Ri hθ η θRi(hθ) , where η is the learning rate of Aθ and Ri(h) = Ez Di[ℓ(h, z)] is the risk of the supervised learning subtask Ti = (Di, ℓ, H). Similarly, in MAML for reinforcement learning, the algorithm Aθ performs, e.g., one step of policy gradient with θ as the starting point. We call πθ the main effect in the sequel. MAML aims to find the globally optimal main effect πθ by maximizing the following meta-objective by gradient ascent, i=1 Ji πθ+η θJi(πθ) , where η is the learning rate of Aθ and Ji is the expected total reward of the reinforcement learning subtask Ti = (S, A, Pi, ri, γi, ζi). 3. Meta-Reinforcement Learning In this section, we present the analysis of metareinforcement learning (meta-RL). We first define the detailed problem setup of meta-RL and propose a meta-RL algorithm. We then characterize the global optimality of the stationary point attained by such an algorithm. We refer the analysis of meta-RL with neural network parameterization to A.2. 3.1. Problem Setup and Algorithm In meta-RL, the meta-learner observes a sample of MDPs {(S, A, Pi, ri, γi, ζi)}i [n] drawn independently from a task distribution ι. We set the algorithm Aθ that optimizes the policy to be one step of (a variant of) proximal policy optimization (PPO) (Schulman et al., 2015; 2017) starting from the main effect πθ. More specifically, Aθ solves the following maximization problem, Aθ(S, A, Pi, ri, γi, ζi) = argmax π Es νi,πθ h Qπθ i (s, ), π( | s) (3.1) 1/η DKL π( | s) πθ( | s) i . Here , is the inner product over R|A|, η is the tuning parameter of Aθ, and Qπθ i , νi,πθ are the action-value function and the state visitation measure, respectively, corresponding to the MDP (S, A, Pi, ri, γi, ζi) and the policy πθ. Note that the objective in (3.1) has DKL(π( | s) πθ( | s)) in place of DKL(πθ( | s) π( | s)) compared with the original version of PPO (Schulman et al., 2015; 2017). As shown by (Liu et al., 2019), such a variant of PPO enjoys global optimality and convergence. We parameterize the main effect πθ as the following energybased policy (Haarnoja et al., 2017) for all (s, a) S A, πθ(a | s) = exp 1/τ φ(s, a) θ a A exp 1/τ φ(s, a ) θ , (3.2) where φ : S A 7 Rd is the feature mapping, θ Rd is the parameter, φ( , ) θ is the energy function, and τ is the temperature parameter. The maximizer πi,θ = Aθ(S, A, Pi, ri, γi, ζi) defined in (3.1) then takes the following form (Liu et al., 2019, Proposition 3.1) for all s S, πi,θ( | s) exp 1/τ φ(s, ) θ + η Qπθ i (s, ) . (3.3) The goal of meta-RL is to find the globally optimal main effect πθ by maximizing the following meta-objective, i=1 Ji(πi,θ), where πi,θ = Aθ(S, A, Pi, ri, γi, ζi). (3.4) Here Ji is the expected total reward defined in (2.5) corresponding to the MDP (S, A, Pi, ri, γi, ζi) for all i [n]. To maximize L(θ), we use gradient ascent, which iteratively updates θ as follows, θℓ+1 θℓ+ αℓ θL(θℓ), for ℓ= 0, 1, . . . , T 1, (3.5) where θL(θℓ) is the gradient of the meta-objective at θℓ, αℓis the learning rate at the ℓ-th iteration, and T is the number of iterations. It remains to calculate the gradient θL(θ). To this end, we first define the state-action vis- On the Global Optimality of Model-Agnostic Meta-Learning itation measures induced by the main effect πθ, and then calculate θL(θ) in closed form based on such state-action visitation measures. Definition 3.1 (Visitation Measures of Main Effect). For all i [n], given the MDP (S, A, Pi, ri, γi, ζi) and the main effect πθ, we denote by σi,πθ the state-action visitation measure induced by the main effect πθ. We further define the state-action visitation measure σ(s,a) i,πθ initialized at (s, a) S A as follows, σ(s,a) i,πθ (s , a ) = (1 γi) t=0 γt i P(st = s , at = a ), where (s , a ) S A, s0 Pi( | s, a), st+1 Pi( | st, at), and at πθ( | st) for all t 0. In other words, given the transition kernel Pi and the discount factor γi, σ(s,a) i,πθ is the state-action visitation measure induced by the main effect πθ where the initial state distribution is given by s0 Pi( | s, a). Based on the policy gradient theorem (Sutton & Barto, 2018), the following proposition calculates the gradient of the meta-objective L defined in (3.4) with respect to the parameter θ of the main effect πθ. Proposition 3.2 (Gradient of Meta-Objective). It holds for all θ Rd that i=1 E(s,a) σπi,θ hi,θ(s, a) Aπi,θ i (s, a) , where the auxiliary function hi,θ takes the form of hi,θ(s, a) = 1/τ φ(s, a) (3.8) + η γi/τ E(s ,a ) σ(s,a) i,πθ φ(s , a ) Aπθ i (s , a ) . Here Aπi,θ i and Aπθ i are the advantage functions of the policy πi,θ and the main effect πθ, respectively, both corresponding to the MDP (S, A, Pi, ri, γi, ζi). Also, σ(s,a) i,πθ is the stateaction visitation measure induced by the main effect πθ defined in Definition 3.1, and σπi,θ is the state-action visitation measure induced by the policy πi,θ, both corresponding to the MDP (S, A, Pi, ri, γi, ζi). Proof. See D.1 for a detailed proof. In the sequel, we assume without loss of generality that the action-value function Qπ is available once we obtain the policy π, and the expectations over state-action visitation measures in (3.7) and (3.8) of Theorem 3.2 are available once we obtain the policies {πi,θ}i [n] and the main effect πθ. We summarize meta-RL in Algorithm 1. In practice, we can estimate the action-value functions by temporal difference learning (Sutton, 1988) and the expectations over the visitation measures by Monte Carlo sampling (Konda, 2002). Algorithm 1 Meta-RL Require: Sampled MDPs {(S, A, Pi, ri, γi, ζi)}i [n] from the task distribution τ, feature mapping φ, number of iterations T, learning rate {αℓ}ℓ [T ], temperature parameter 1/τ, tuning parameter τ, initial parameter θ0. 1: Initialization: 2: for ℓ= 0, . . . , T 1 do 3: for i [n] do 4: Update the policy: πi,θℓ( | s) exp 1/τ φ(s, ) θℓ+ η Q πθℓ i (s, ) . 5: Compute the auxiliary function hi,θℓ(s, a) via (3.8) 6: end for 7: Compute the gradient of meta-objective θL(θℓ) based on the policies {πi,θℓ}i [n] and auxiliary functions {hi,θℓ}i [n] via (3.7). 8: Update the parameter of the main effect: θℓ+1 θℓ+ αℓ θL(θℓ). 9: Update the main effect: πθℓ+1( | s) exp 1/τ φ(s, ) θℓ+1 . 10: end for 11: Output: θT and πθT . 3.2. Theoretical Results In this section, we analyze the global optimality of the ϵstationary point attained by meta-RL (Algorithm 1). In the sequel, we assume that the reward functions {ri}i [n] are upper bounded by an absolute constant Qmax > 0 in absolute value. It then follows from (2.1) and (2.2) that |V π i (s, a)| and |Qπ i (s, a)| are upper bounded by Qmax for all i [n] and (s, a) S A. Here we define Qπ i and V π i as the stateand action-value functions of the policy π, respectively, corresponding to the MDP (S, A, Pi, ri, γi, ζi). To analyze the global optimality of meta-RL, we define the following meta-visitation measures induced by the main effect πθ. Definition 3.3 (Meta-Visitation Measures). We define the joint meta-visitation measure ρi,πθ over (s , a , s, a) S A S A induced by the main effect πθ and the policy On the Global Optimality of Model-Agnostic Meta-Learning πi,θ as follows, ρi,πθ(s , a , s, a) = σ(s,a) i,πθ (s , a ) σπi,θ(s, a). (3.9) We further define the meta-visitation measure ςi,πθ as the marginal distribution of the joint meta-visitation measure ρi,πθ of (s , a ) S A, that is, ςi,πθ(s , a ) = E(s,a) σπi,θ σ(s,a) i,πθ (s , a ) . (3.10) In addition, for (s , a ) S A we define the mixed metavisitation measure ϱπθ over all the subtasks as follows, ϱπθ(s , a ) = 1 i=1 ςi,πθ(s , a ). (3.11) In other words, the meta-visitation measure ςi,πθ is the stateaction visitation measure induced by πθ given the transition kernel Pi, the discount factor γi, and the initial state distribution s0 E(s,a) σπi,θ [Pi( | s, a)]. In what follows, we impose an assumption on the metavisitation measures defined in Definition 3.3. Assumption 3.4 (Regularity Condition on Meta-Visitation Measures). We assume for all θ Rd and i [n] that E(s ,a ) ϱπθ h dσπi,θ/ dϱπθ(s , a ) 2i C2 0, (3.12) E(s ,a ) ϱπθ h dςi,πθ/ dϱπθ(s , a ) 2i C2 0, (3.13) where C0 > 0 is an absolute constant . Here ςi,πθ and ϱπθ are the meta-visitation measure and the mixed metavisitation measure induced by the main effect πθ, which are defined in (3.10) and (3.11) of Definition 3.3, respectively. Meanwhile, σπi,θ is the state-action visitation measure induced by the policy πi,θ, which is defined in (2.4). Here dσπi,θ/ dϱπθ and dςi,πθ/ dϱπθ are the Radon-Nikodym derivatives. According to (3.11) of Definition 3.3, the upper bound in (3.12) of Assumption 3.4 holds if the L2(ϱπθ)-norms of dσπi,θ/ dςj,πθ is upper bounded by C0 for all i, j [n]. For i = j, note that πi,θ is obtained by one step of PPO with πθ as the starting point. Thus, for a sufficiently small tuning parameter η in (3.3), πi,θ is close to πθ. Hence, the assumption that dσπi,θ/ dςj,πθ has an upper bounded L2(ϱπθ)-norm for all i = j is a mild regularity condition. For i = j, to ensure the upper bound of the L2(ϱπθ)-norms of dσπi,θ/ dςj,πθ in (3.12), Assumption 3.4 requires the task distribution ι to generate similar MDPs so that the meta-visitation measures {ςi,πθ}i [n] are similar across all the subtasks indexed by i [n]. Similarly, to ensure the upper bound in (3.13), Assumption 3.4 also requires that the meta-visitation measures {ςi,πθ}i [n] are similar across all the subtasks indexed by i [n]. The following theorem characterizes the optimality gap of the ϵ-stationary point attained by meta-RL (Algorithm 1). Let θ be a global maximizer of the meta-objective L(θ) defined in (3.4). For all (s , a ) S A and ω Rd, we define fω(s , a ) = n X Aπi,ω i (s , a ) 1 γi dσπi,θ dϱπω (s , a ) i=1 gi,ω(s , a ) dςi,πω dϱπω (s , a ) , where we defined gi,ω as follows, gi,ω(s , a ) =1/τ Aπi,ω i (s , a ) ( dσπi,ω/ dςi,πω)(s , a ) + γi η/τ Gi,πω(s , a ) Aπω i (s , a ). Here τ is the temperature parameter in (3.2), η is the tuning parameter defined in (3.1), Aπi,ω i and Aπω i are the advantage functions of the policy πi,ω and the main effect πω, respectively, corresponding to the MDP (S, A, Pi, ri, γi, ζi), and Gi,πω is defined as follows, Gi,πω(s , a ) = E(s ,a ,s,a) ρi,πω Aπi,ω i (s, a) s , a , (3.15) where ρi,πω is the joint meta-visitation measure defined in (3.9) of Definition 3.3. Theorem 3.5 (Optimality Gap of ϵ-Stationary Point). Under Assumption 3.4, for all R > 0, ω Rd, and ϵ > 0 such that ωL(ω) v ϵ, v B = {θ Rd : θ 2 1}, (3.16) R ϵ + C inf v BR fω( , ) φ( , ) v ϱπω , (3.17) where BR = {θ Rd : θ 2 R}, γ = (Pn i=1 γi)/n, and C = 2C0 Qmax/τ (1 + 2Qmax γ η). Here C0 is defined in Assumption 3.4, τ is the temperature parameter On the Global Optimality of Model-Agnostic Meta-Learning in (3.2), η is the tuning parameter defined in (3.1), and Qmax is the upper bound of the reward functions {ri}i [n] in absolute value. Proof. See C.1 for a detailed proof. By Theorem 3.5, the global optimality of the ϵ-stationary point ω hinges on the representation power of the linear class {φ( ) θ : θ BR}. More specifically, if the function fω defined in (3.14) is well approximated by φ( ) θ for a parameter θ BR, then ω is approximately globally optimal. 4. Meta-Supervised Learning In this section, we present the analysis of meta-supervised learning (meta-SL). We first define the detailed problem setup of meta-SL and present a meta-SL algorithm. We then characterize the global optimality of the stationary point attained by such an algorithm. We refer the analysis of meta-SL with neural network parameterization to A.3. 4.1. Problem Setup and Algorithm In meta-SL, the meta-learner observes a sample of supervised learning subtasks {(Di, ℓ, H)}i [n] drawn independently from a task distribution ι. Specifically, each subtask (Di, ℓ, H) consists of a distribution Di over X Y, where Y R, a loss function ℓ: H X Y 7 R, and a hypothesis class H. Each hypothesis h H is a mapping from X to Y. The goal of the supervised learning subtask (Di, ℓ, H) is to obtain the following hypothesis, h i = argmin h H Ri(h) = argmin h H Ez Di ℓ(h, z) , (4.1) where Ri(h) = Ez Di[ℓ(h, z)] is the risk of h H. To approximately attain the minimizer defined in (4.1), we parameterize the hypothesis class H by Hθ with a feature mapping φ : X 7 Rd as follows, Hθ = hθ( ) = φ( ) θ : θ Rd , (4.2) and minimize Ri(hθ) over θ Rd. We set the algorithm Aθ in (2.7), which solves (Di, ℓ, H), to be one step of gradient descent with the starting point θ, that is, Aθ(Di, ℓ, H) = hθ η θRi(hθ). (4.3) Here η is the learning rate of Aθ. The goal of meta-SL is to minimize the following meta-objective, i=1 Ri(hθi), where hθi = Aθ(Di, R, H). To minimize L(θ) defined in (4.4), we adopt gradient descent, which iteratively updates θℓas follows, θℓ+1 θℓ αℓ θL(θℓ), for ℓ= 0, 1, . . . , T 1. (4.5) Here θL(θℓ) is the gradient of the meta-objective at θℓ, αℓis the learning rate at the ℓ-th iteration, and T is the number of iterations. (Fallah et al., 2019) show that the update defined in (4.5) converges to an ϵ-stationary point of the meta-objective L under a smoothness assumption on L. In what follows, we characterize the optimality gap of such an ϵ-stationary point. We first introduce the Fr echet differentiability of the risk Ri in (4.1). Definition 4.1 (Fr echet Differentiability). Let H be a Banach space with the norm H. A functional R : H 7 R is Fr echet differentiable at h H if it holds for a bounded linear operator A : H 7 R that lim h1 H, h1 H 0 |R(h + h1) R(h) A(h1)| We define A as the Fr echet derivative of R at h H, and write Dh R( ) = A( ). (4.7) In what follows, we assume that the hypothesis class H with the L2(ρ)-inner product is a Hilbert space, where ρ is a distribution over X. Thus, following from the definition of the Fr echet derivative in Definition 4.1 and the Rieze representation theorem (Rudin, 2006), it holds for an ah H that Dh R( ) = A( ) = , ah H, (4.8) Here we denote by f, g H = R X f(x) g(x) dρ the L2(ρ)- inner product. In what follows, we write (δR/δh)(x) = ah(x), x X, h H. (4.9) On the Global Optimality of Model-Agnostic Meta-Learning We refer to B for an example of the Fr echet derivative defined in (4.9). We assume that H contains the parameterized hypothesis class Hθ defined in (4.2), and impose the following assumption on the convexity and the Fr echet differentiability of the risk Ri in (4.1). Assumption 4.2 (Convex and Differentiable Risk). We assume for all i [n] that the risk Ri defined in (4.1) is convex and Fr echet differentiable on H. Assumption 4.2 is a mild regularity condition on the risk Ri, which holds for the risks induced by commonly used loss functions, such as the squared loss and the cross entropy loss. Specifically, the convexity of Ri holds if the loss function ℓ(h, z) is convex in h H for all z Z (Rockafellar, 1968). The following proposition holds under Assumption 4.2. Proposition 4.3 (Convex and Differentiable Risk (Ekeland & Temam, 1999)). Under Assumption 4.2, it holds for all i [n] that Ri(h1) Ri(h2) + δRi/δh2, h1 h2 H, h1, h2 H. Proof. See (Ekeland & Temam, 1999) for a detailed proof. We highlight that the convexity of the risks over h H does not imply the convexity of the meta-objective defined in (4.4). In contrast, Proposition 4.3 characterizes the functional geometry of the risk Ri in the Hilbert space H for all i [n], which allows us to analyze the global optimality of meta-SL in the sequel. 4.2. Theoretical Results In this section, we characterize the global optimality of the ϵ-stationary point attained by meta-SL defined in (4.5). Let θ be a global minimizer of the meta-objective L(θ) defined in (4.4), and ω be the ϵ-stationary point attained by meta-SL such that ωL(ω) v ϵ, v B = {θ Rd : θ 2 1}. (4.10) Our goal is to upper bound the optimality gap L(ω) L(θ ). To this end, we first define the mixed distribution M over all the distributions {Di}i [n] as follows, M(x, y) = 1 i=1 Di(x, y), (x, y) X Y. To simplify the notation, we write ωi and θ i as the parameters that correspond to the outputs of the algorithms Aω(Di, ℓ, H) and Aθ (Di, ℓ, H), respectively. More specifically, according to (4.3), we have ωi = ω η ωRi(hω), θ i = θ η θ Ri(hθ ), i [n], (4.12) where η is the learning rate of the algorithms Aω(Di, ℓ, H) and Aθ (Di, ℓ, H). The following theorem characterizes the optimality gap of the ϵ-stationary point ω attained by meta-SL. We define for all (x, y, x ) X Y X that w(x, y, x ) = 1 δRi δhωi (x ) d Di d M(x, y), (4.13) u(x, y, x ) = 1 δRi δhωi (x ) hωi(x ) hθ i (x ) w(x, y, x ) , φℓ,ω(x, y, x ) = Id η 2 ωℓ φ(x) ω, (x, y) φ(x ), where d Di/ d M is the Radon-Nikodym derivative and δRi/δhωi is the Fr echet derivative defined in (4.9). Theorem 4.4 (Optimality Gap of ϵ-Stationary Point). Let θ be a global minimizer of L(θ). Also, let ω be the ϵstationary point defined in (4.10). Let ℓ(hθ(x), (x, y)) be twice differentiable with respect to all θ Rd and (x, y) X Y. Under Assumption 4.2, it holds for all R > 0 that L(ω) L(θ ) (4.16) R ϵ |{z} (i) + w M ρ | {z } (ii) inf v BR u( ) φℓ,ω( ) v M ρ | {z } (iii) where we define BR = {θ Rd : θ 2 R} as the ball with radius R and w M ρ = Z w2(x, y, x ) d M(x, y) dρ(x ) 1/2 as the L2(M ρ)-norm of w. Proof. See C.2 for a detailed proof. By Theorem 4.4, the optimality gap of the ϵ-stationary point ω hinges on the three terms on the right-hand side of (4.16). Here term (i) characterizes the deviation of the ϵ-stationary On the Global Optimality of Model-Agnostic Meta-Learning point ω from a stationary point. Term (ii) characterizes the difficulty of all the subtasks sampled from the task distribution ι. Specifically, given the ϵ-stationary point ω, if the output hωi of Aω(Di, ℓ, H) well approximates the minimizer of the risk Ri in (4.1), then the Fr echet derivative δRi/δhωi defined in (4.9) is close to zero. Meanwhile, the Radon-Nikodym derivative d Di/ d M characterizes the deviation of the distribution Di from the mixed distribution M defined in (4.11), which is upper bounded if Di is close to M. Thus, term (ii) is upper bounded if hωi well approximates the minimizer of Ri and Di is close to M for all i [n]. Term (iii) characterizes the representation power of the feature mapping φℓ,ω defined in (4.15). Specifically, if the function u defined in (4.14) of Theorem 4.4 is well approximated by φℓ,ω( ) v for some v BR, then term (iii) is small. In conclusion, if the subtasks generated by the task distribution ι are sufficiently regular so that term (ii) is upper bounded, and the linear class {φℓ,ω( ) v : v BR} has sufficient representation power, then ω is approximately globally optimal. See B for a corollary of Theorem 4.4 when it is adapted to the squared loss. Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. ar Xiv preprint ar Xiv:1811.04918, 2018a. Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962, 2018b. Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended PAC-Bayes theory. ar Xiv preprint ar Xiv:1711.01244, 2017. Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Finegrained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584, 2019. Bai, Y. and Lee, J. D. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. ar Xiv preprint ar Xiv:1910.01619, 2019. Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. Neural temporaldifference learning converges to global optima. In Advances in Neural Information Processing Systems, 2019. Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. ar Xiv preprint ar Xiv:1905.13210, 2019a. Cao, Y. and Gu, Q. A generalization theory of gradient descent for learning over-parameterized deep Re LU networks. ar Xiv preprint ar Xiv:1902.01384, 2019b. Chizat, L. and Bach, F. A note on lazy training in supervised differentiable programming. ar Xiv preprint ar Xiv:1812.07956, 2018. Daniely, A. SGD learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, 2017. Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. ar Xiv preprint ar Xiv:1811.03804, 2018a. Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018b. Ekeland, I. and Temam, R. Convex analysis and variational problems, volume 28. SIAM, 1999. Evgeniou, T. and Pontil, M. Regularized multi task learning. In International Conference on Knowledge Discovery and Data Mining, pp. 109 117, 2004. Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007. Fallah, A., Mokhtari, A., and Ozdaglar, A. On the convergence theory of gradient-based model-agnostic metalearning algorithms. ar Xiv preprint ar Xiv:1908.10400, 2019. Fan, J., Ma, C., and Zhong, Y. A selective overview of deep learning. ar Xiv preprint ar Xiv:1904.05526, 2019. Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017a. Finn, C., Yu, T., Zhang, T., Abbeel, P., and Levine, S. Oneshot visual imitation learning via meta-learning. ar Xiv preprint ar Xiv:1709.04905, 2017b. Finn, C., Xu, K., and Levine, S. Probabilistic modelagnostic meta-learning. In Advances in Neural Information Processing Systems, 2018. Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Online meta-learning. ar Xiv preprint ar Xiv:1902.08438, 2019. Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, 2018. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, 2017. On the Global Optimality of Model-Agnostic Meta-Learning Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018. Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, 2002. Khodak, M., Balcan, M.-F., and Talwalkar, A. Provable guarantees for gradient-based meta-learning. ar Xiv preprint ar Xiv:1902.10644, 2019. Konda, V. Actor-Critic Algorithms. Ph D thesis, Massachusetts Institute of Technology, 2002. Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. ar Xiv preprint ar Xiv:1902.06720, 2019. Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, 2018. Li, Z., Zhou, F., Chen, F., and Li, H. Meta-SGD: Learning to learn quickly for few-shot learning. ar Xiv preprint ar Xiv:1707.09835, 2017. Liu, B., Cai, Q., Yang, Z., and Wang, Z. Neural proximal/trust region policy optimization attains globally optimal policy. ar Xiv preprint ar Xiv:1906.10306, 2019. Mendonca, R., Gupta, A., Kralev, R., Abbeel, P., Levine, S., and Finn, C. Guided meta-policy search. In Advances in Neural Information Processing Systems, 2019. Nagabandi, A., Finn, C., and Levine, S. Deep online learning via meta-learning: Continual adaptation for modelbased RL. ar Xiv preprint ar Xiv:1812.07671, 2018. Nichol, A. and Schulman, J. Reptile: A scalable metalearning algorithm. ar Xiv preprint ar Xiv:1803.02999, 2: 2, 2018. Nichol, A., Achiam, J., and Schulman, J. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018. Pan, S. J. and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10): 1345 1359, 2009. Pentina, A. and Lampert, C. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, 2014. Rajeswaran, A., Finn, C., Kakade, S. M., and Levine, S. Meta-learning with implicit gradients. In Advances in Neural Information Processing Systems, 2019. Rakelly, K., Shelhamer, E., Darrell, T., Efros, A. A., and Levine, S. Few-shot segmentation propagation with guided networks. ar Xiv preprint ar Xiv:1806.07373, 2018. Rockafellar, R. Integrals which are convex functionals. Pacific journal of mathematics, 24(3):525 539, 1968. Rudin, W. Real and complex analysis. Mc Graw-Hill, 2006. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, 2015. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Sutton, R. S. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9 44, 1988. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018. Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633 1685, 2009. Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012. Wang, H., Sun, R., and Li, B. Global convergence and induced kernels of gradient-based meta-learning with neural nets. ar Xiv preprint ar Xiv: 2006.14606, 2020. Weiss, K., Khoshgoftaar, T. M., and Wang, D. A survey of transfer learning. Journal of Big Data, 3(1):9, 2016. Wu, L., Ma, C., and Weinan, E. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, 2018. Xu, K., Ratner, E., Dragan, A., Levine, S., and Finn, C. Learning a prior over intent via meta-inverse reinforcement learning. ar Xiv preprint ar Xiv:1805.12573, 2018. Yoon, J., Kim, T., Dia, O., Kim, S., Bengio, Y., and Ahn, S. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, 2018. Yu, T., Abbeel, P., Levine, S., and Finn, C. One-shot hierarchical imitation learning of compound visuomotor tasks. ar Xiv preprint ar Xiv:1810.11043, 2018. Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient descent optimizes over-parameterized deep Re LU networks. ar Xiv preprint ar Xiv:1811.08888, 2018.