# promp_proximal_metapolicy_search__23884a8d.pdf Published as a conference paper at ICLR 2019 PROMP: PROXIMAL META-POLICY SEARCH Jonas Rothfuss UC Berkeley, KIT jonas.rothfuss@gmail.com Dennis Lee , Ignasi Clavera UC Berkeley {dennisl88,iclavera}@berkeley.edu Tamim Asfour Karlsruhe Inst. of Technology (KIT) asfour@kit.edu Pieter Abbeel UC Berkeley, Covariant.ai pabbeel@cs.berkeley.edu Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during metatraining as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance. 1 INTRODUCTION A remarkable trait of human intelligence is the ability to adapt to new situations in the face of limited experience. In contrast, our most successful artificial agents struggle in such scenarios. While achieving impressive results, they suffer from high sample complexity in learning even a single task, fail to generalize to new situations, and require large amounts of additional data to successfully adapt to new environments. Meta-learning addresses these shortcomings by learning how to learn. Its objective is to learn an algorithm that allows the artificial agent to succeed in an unseen task when only limited experience is available, aiming to achieve the same fast adaptation that humans possess (Schmidhuber, 1987; Thrun & Pratt, 1998). Despite recent progress, deep reinforcement learning (RL) still relies heavily on hand-crafted features and reward functions as well as engineered problem specific inductive bias. Meta-RL aims to forego such reliance by acquiring inductive bias in a data-driven manner. Recent work proves this approach to be promising, demonstrating that Meta-RL allows agents to obtain a diverse set of skills, attain better exploration strategies, and learn faster through meta-learned dynamics models or synthetic returns (Duan et al., 2016; Xu et al., 2018; Gupta et al., 2018b; Saemundsson et al., 2018). Meta-RL is a multi-stage process in which the agent, after a few sampled environment interactions, adapts its behavior to the given task. Despite its wide utilization, little work has been done to promote theoretical understanding of this process, leaving Meta-RL grounded on unstable foundations. Although the behavior prior to the adaptation step is instrumental for task identification, the interplay between pre-adaptation sampling and posterior performance of the policy remains poorly understood. In fact, prior work in gradient-based Meta-RL has either entirely neglected credit assignment to the pre-update distribution (Finn et al., 2017) or implemented such credit assignment in a naive way (Al-Shedivat et al., 2018; Stadie et al., 2018). To our knowledge, we provide the first formal in-depth analysis of credit assignment w.r.t. preadaptation sampling distribution in Meta-RL. Based on our findings, we develop a novel Meta-RL algorithm. First, we analyze two distinct methods for assigning credit to pre-adaptation behavior. authors contributed equally to this work Published as a conference paper at ICLR 2019 We show that the recent formulation introduced by Al-Shedivat et al. (2018) and Stadie et al. (2018) leads to poor credit assignment, while the MAML formulation (Finn et al., 2017) potentially yields superior meta-policy updates. Second, based on insights from our formal analysis, we highlight both the importance and difficulty of proper meta-policy gradient estimates. In light of this, we propose the low variance curvature (LVC) surrogate objective which yields gradient estimates with a favorable bias-variance trade-off. Finally, building upon the LVC estimator we develop Proximal Meta Policy Search (Pro MP), an efficient and stable meta-learning algorithm for RL. In our experiments, we show that Pro MP consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance. 2 RELATED WORK Meta-Learning concerns the question of learning to learn , aiming to acquire inductive bias in a data driven manner, so that the learning process in face of unseen data or new problem settings is accelerated (Schmidhuber, 1987; Schmidhuber et al., 1997; Thrun & Pratt, 1998). This can be achieved in various ways. One category of methods attempts to learn the learning program of an universal Turing machine in form of a recurrent / memory-augmented model that ingests datasets and either outputs the parameters of the trained model (Hochreiter et al., 2001; Andrychowicz et al., 2016; Chen et al., 2017; Ravi & Larochelle, 2017) or directly outputs predictions for given test inputs (Duan et al., 2016; Santoro et al., 2016; Mishra et al., 2018). Though very flexible and capable of learning very efficient adaptations, such methods lack performance guarantees and are difficult to train on long sequences that arise in Meta-RL. Another set of methods embeds the structure of a classical learning algorithm in the meta-learning procedure, and optimizes the parameters of the embedded learner during the meta-training (H usken & Goerick, 2000; Finn et al., 2017; Nichol et al., 2018; Miconi et al., 2018). A particular instance of the latter that has proven to be particularly successful in the context of RL is gradient-based metalearning (Finn et al., 2017; Al-Shedivat et al., 2018; Stadie et al., 2018). Its objective is to learn an initialization such that after one or few steps of policy gradients the agent attains full performance on a new task. A desirable property of this approach is that even if fast adaptation fails, the agent just falls back on vanilla policy-gradients. However, as we show, previous gradient-based Meta-RL methods either neglect or perform poor credit assignment w.r.t. the pre-update sampling distribution. A diverse set of methods building on Meta-RL, has recently been introduced. This includes: learning exploration strategies (Gupta et al., 2018b), synthetic rewards (Sung et al., 2017; Xu et al., 2018), unsupervised policy acquisition (Gupta et al., 2018a), model-based RL (Clavera et al., 2018; Saemundsson et al., 2018), learning in competitive environments (Al-Shedivat et al., 2018) and meta-learning modular policies (Frans et al., 2018; Alet et al., 2018). Many of the mentioned approaches build on previous gradient-based meta-learning methods that insufficiently account for the pre-update distribution. Pro MP overcomes these deficiencies, providing the necessary framework for novel applications of Meta-RL in unsolved problems. 3 BACKGROUND Reinforcement Learning. A discrete-time finite Markov decision process (MDP), T , is defined by the tuple (S, A, p, p0, r, H). Here, S is the set of states, A the action space, p(st+1|st, at) the transition distribution, p0 represents the initial state distribution, r : S A R is a reward function, and H the time horizon. We omit the discount factor γ in the following elaborations for notational brevity. However, it is straightforward to include it by substituting the reward by r(st, at) := γtr(st, at). We define the return R(τ) as the sum of rewards along a trajectory τ := (s0, a0, ..., s H 1, a H 1, s H). The goal of reinforcement learning is to find a policy π(a|s) that maximizes the expected return Eτ PT (τ|π) [R(τ)]. Meta-Reinforcement Learning goes one step further, aiming to learn a learning algorithm which is able to quickly learn the optimal policy for a task T drawn from a distribution of tasks ρ(T ). Each task T corresponds to a different MDP. Typically, it is assumed that the distribution of tasks share the action and state space, but may differ in their reward function or their dynamics. Gradient-based meta-learning aims to solve this problem by learning the parameters θ of a policy πθ such that performing a single or few steps of vanilla policy gradient (VPG) with the given task leads to the optimal policy for that task. This meta-learning formulation, also known under the name Published as a conference paper at ICLR 2019 of MAML, was first introduced by Finn et al. (2017). We refer to it as formulation I which can be expressed as maximizing the objective JI(θ) = ET ρ(T ) Eτ PT (τ |θ ) [R(τ )] with θ := U(θ, T ) = θ + α θEτ PT (τ|θ) [R(τ)] In that U denotes the update function which depends on the task T , and performs one VPG step towards maximizing the performance of the policy in T . For national brevity and conciseness we assume a single policy gradient adaptation step. Nonetheless, all presented concepts can easily be extended to multiple adaptation steps. Later work proposes a slightly different notion of gradient-based Meta-RL, also known as E-MAML, that attempts to circumvent issues with the meta-gradient estimation in MAML (Al-Shedivat et al., 2018; Stadie et al., 2018): JII(θ) = ET ρ(T ) Eτ 1:N PT (τ 1:N|θ) τ PT (τ |θ ) R(τ ) with θ := U(θ, τ 1:N) = θ + α θ h R(τ (n)) i Formulation II views U as a deterministic function that depends on N sampled trajectories from a specific task. In contrast to formulation I, the expectation over pre-update trajectories τ is applied outside of the update function. Throughout this paper we refer to πθ as pre-update policy, and πθ as post-update policy. 4 SAMPLING DISTRIBUTION CREDIT ASSIGNMENT This section analyzes the two gradient-based Meta-RL formulations introduced in Section 3. Figure 1 illustrates the stochastic computation graphs (Schulman et al., 2015b) of both formulations. The red arrows depict how credit assignment w.r.t the pre-update sampling distribution PT (τ|θ) is propagated. Formulation I (left) propagates the credit assignment through the update step, thereby exploiting the full problem structure. In contrast, formulation II (right) neglects the inherent structure, directly assigning credit from post-update return R to the pre-update policy πθ which leads to noisier, less effective credit assignment. Figure 1: Stochastic computation graphs of meta-learning formulation I (left) and formulation II (right). The red arrows illustrate the credit assignment from the post-update returns R to the pre-update policy πθ through θJpre. (Deterministic nodes: Square; Stochastic nodes: Circle) Both formulations optimize for the same objective, and are equivalent at the 0th order. However, because of the difference in their formulation and stochastic computation graph, their gradients and the resulting optimization step differs. In the following, we shed light on how and where formulation II loses signal by analyzing the gradients of both formulations, which can be written as (see Appendix A for more details and derivations) θJ(θ) = ET ρ(T ) E τ PT (τ|θ) τ PT (τ |θ ) θJpost(τ, τ ) + θJpre(τ, τ ) # The first term θJpost(τ, τ ) is equal in both formulations, but the second term, θJpre(τ, τ ), differs between them. In particular, they correspond to θJpost(τ, τ ) = θ log πθ(τ )R(τ ) | {z } θ Jouter I + αR(τ) 2 θ log πθ (τ)) | {z } transformation from θ to θ θJII pre(τ, τ ) = α θ log πθ(τ)R(τ ) (3) θJI pre(τ, τ ) = α θ log πθ(τ) ( θ log πθ(τ)R(τ)) | {z } θJinner ( θ log πθ (τ )R(τ )) | {z } θ Jouter Published as a conference paper at ICLR 2019 θJpost(τ, τ ) simply corresponds to a policy gradient step on the post-update policy πθ w.r.t θ , followed by a linear transformation from postto pre-update parameters. It corresponds to increasing the likelihood of the trajectories τ that led to higher returns. However, this term does not optimize for the pre-update sampling distribution, i.e., which trajectories τ led to better adaptation steps. The credit assignment w.r.t. the pre-updated sampling distribution is carried out by the second term. In formulation II, θJII pre can be viewed as standard reinforcement learning on πθ with R(τ ) as reward signal, treating the update function U as part of the unknown dynamics of the system. This shifts the pre-update sampling distribution to better adaptation steps. Formulation I takes the causal dependence of PT (τ |θ ) on PT (τ|θ) into account. It does so by maximizing the inner product of pre-update and post-update policy gradients (see Eq. 4). This steers the pre-update policy towards 1) larger post-updates returns 2) larger adaptation steps α θJinner, 3) better alignment of preand post-update policy gradients (Li et al., 2017; Nichol et al., 2018). When combined, these effects directly optimize for adaptation. As a result, we expect the first meta-policy gradient formulation, JI, to yield superior learning properties. 5 LOW VARIANCE CURVATURE ESTIMATOR In the previous section we show that the formulation introduced by Finn et al. (2017) results in superior meta-gradient updates, which should in principle lead to improved convergence properties. However, obtaining correct and low variance estimates of the respective meta-gradients proves challenging. As discussed by Foerster et al. (2018), and shown in Appendix B.3, the score function surrogate objective approach is ill suited for calculating higher order derivatives via automatic differentiation toolboxes. This important fact was overlooked in the original RL-MAML implementation (Finn et al., 2017) leading to incorrect meta-gradient estimates1. As a result, θJpre does not appear in the gradients of the meta-objective (i.e. θJ = θJpost). Hence, MAML does not perform any credit assignment to pre-adaptation behavior. But, even when properly implemented, we show that the meta-gradients exhibit high variance. Specifically, the estimation of the hessian of the RL-objective, which is inherent in the metagradients, requires special consideration. In this section, we motivate and introduce the low variance curvature estimator (LVC): an improved estimator for the hessian of the RL-objective which promotes better meta-policy gradient updates. As we show in Appendix A.1, we can write the gradient of the meta-learning objective as θJI(θ) = ET ρ(T ) h Eτ PT (τ |θ ) θ log PT (τ |θ )R(τ ) θU(θ, T ) i (5) Since the update function U resembles a policy gradient step, its gradient θU(θ, T ) involves computing the hessian of the reinforcement learning objective, i.e., 2 θ Eτ PT (τ|θ) [R(τ)]. Estimating this hessian has been discussed in Baxter & Bartlett (2001) and Furmston et al. (2016). In the infinite horizon MDP case, Baxter & Bartlett (2001) derived a decomposition of the hessian. We extend their finding to the finite horizon case, showing that the hessian can be decomposed into three matrix terms (see Appendix B.2 for proof): θU(θ, T ) = I + α 2 θ Eτ PT (τ|θ) [R(τ)] = I + α H1 + H2 + H12 + H 12 (6) H1 = Eτ PT (τ|θ) t=0 θ log πθ(at, st) θ log πθ(at, st) H 1 X t =t r(st , at ) H2 = Eτ PT (τ|θ) t=0 2 θ log πθ(at, st) t =t r(st , at ) H12 = Eτ PT (τ|θ) t=0 θ log πθ(at, st) θQπθ t (st, at) # 1Note that MAML is theoretically sound, but does not attend to correctly estimating the meta-policy gradients. As consequence, the gradients in the corresponding implementation do not comply with the theory. Published as a conference paper at ICLR 2019 Here Qπθ t (st, at) = Eτ t+1:H 1 PT ( |θ) h PH 1 t =t r(st , at )|st, at i denotes the expected state-action value function under policy πθ at time t. Computing the expectation of the RL-objective is in general intractable. Typically, its gradients are computed with a Monte Carlo estimate based on the policy gradient theorem (Eq. 82). In practical implementations, such an estimate is obtained by automatically differentiating a surrogate objective (Schulman et al., 2015b). However, this results in a highly biased hessian estimate which just computes H2, entirely dropping the terms H1 and H12+H 12. In the notation of the previous section, it leads to neglecting the θJpre term, ignoring the influence of the pre-update sampling distribution. The issue can be overcome using the Di CE formulation, which allows to compute unbiased higherorder Monte Carlos estimates of arbitrary stochastic computation graphs (Foerster et al., 2018). The Di CE-RL objective can be rewritten as follows JDi CE(τ) = πθ(at |st ) (πθ(at |st )) r(st, at) τ PT (τ) (7) Eτ PT (τ|θ) 2 θJDi CE(τ) = H1 + H2 + H12 + H 12 (8) In that, denotes the stop gradient operator, i.e., (fθ(x)) fθ(x) but θ (fθ(x)) 0. The sequential dependence of πθ(at|st) within the trajectory, manifesting itself through the product of importance weights in (7), results in high variance estimates of the hessian 2 θ Eτ PT (τ|θ) [R(τ)]. As noted by Furmston et al. (2016), H12 is particularly difficult to estimate, since it involves three nested sums along the trajectory. In section 7.2 we empirically show that the high variance estimates of the Di CE objective lead to noisy meta-policy gradients and poor learning performance. To facilitate a sample efficient meta-learning, we introduce the low variance curvature (LVC) estimator: πθ(at|st) (πθ(at|st)) t =t r(st , at ) τ PT (τ) (9) Eτ PT (τ|θ) 2 θJLVC(τ) = H1 + H2 (10) By removing the sequential dependence of πθ(at|st) within trajectories, the hessian estimate neglects the term H12 + H 12 which leads to a variance reduction, but makes the estimate biased. The choice of this objective function is motivated by findings in Furmston et al. (2016): under certain conditions the term H12 + H 12 vanishes around local optima θ , i.e., Eτ[ 2 θJLVC] Eτ[ 2 θJDi CE] as θ θ . Hence, the bias of the LVC estimator becomes negligible close to local optima. The experiments in section 7.2 underpin the theoretical findings, showing that the low variance hessian estimates obtained through JLVC improve the sample-efficiency of meta-learning by a significant margin when compared to JDi CE. We refer the interested reader to Appendix B for derivations and a more detailed discussion. 6 PROMP: PROXIMAL META-POLICY SEARCH Building on the previous sections, we develop a novel meta-policy search method based on the low variance curvature objective which aims to solve the following optimization problem: max θ ET ρ(T ) Eτ PT (τ |θ ) [R(τ )] with θ := θ + α θEτ PT (τ|θ) JLVC(τ) (11) Prior work has optimized this objective using either vanilla policy gradient (VPG) or TRPO (Schulman et al., 2015a). TRPO holds the promise to be more data efficient and stable during the learning process when compared to VPG. However, it requires computing the Fisher information matrix (FIM). Estimating the FIM is particularly problematic in the meta-learning set up. The meta-policy gradients already involve second order derivatives; as a result, the time complexity of the FIM estimate is cubic in the number of policy parameters. Typically, the problem is circumvented using finite difference methods, which introduce further approximation errors. The recently introduced PPO algorithm (Schulman et al., 2017) achieves comparable results to TRPO with the advantage of being a first order method. PPO uses a surrogate clipping objective which allows it to safely take multiple gradient steps without re-sampling trajectories. JCLIP T (θ) = Eτ PT (τ,θo) h PH 1 t=0 min πθ(at|st) πθo(at|st)Aπθo(st, at) , clip1+ϵ 1 ϵ πθ(at|st) πθo(at|st) Aπθo(st, at) i Published as a conference paper at ICLR 2019 Algorithm 1 Proximal Meta-Policy Search (Pro MP) Require: Task distribution ρ, step sizes α, β, KL-penalty coefficient η, clipping range ϵ 1: Randomly initialize θ 2: while θ not converged do 3: Sample batch of tasks Ti ρ(T ) 4: for step n = 0, ..., N 1 do 5: if n = 0 then 6: Set θo θ 7: for all Ti ρ(T ) do 8: Sample pre-update trajectories Di = {τi} from Ti using πθ 9: Compute adapted parameters θ o,i θ + α θJLR Ti (θ) with Di = {τi} 10: Sample post-update trajectories D i = {τ i} from Ti using πθ o,i 11: Update θ θ + β P Ti θJPro MP Ti (θ) using each D i = {τ i} In case of Meta-RL, it does not suffice to just replace the post-update reward objective with JCLIP T . In order to safely perform multiple meta-gradient steps based on the same sampled data from a recent policy πθo, we also need to 1) account for changes in the pre-update action distribution πθ(at|st), and 2) bound changes in the pre-update state visitation distribution (Kakade & Langford, 2002). We propose Proximal Meta-Policy Search (Pro MP) which incorporates both the benefits of proximal policy optimization and the low variance curvature objective (see Alg. 1.) In order to comply with requirement 1), Pro MP replaces the stop gradient importance weight πθ(at|st) (πθ(at|st)) by the likelihood ratio πθ(at|st) πθo(at|st)), which results in the following objective JLR T (θ) = Eτ PT (τ,θo) πθ(at|st) πθo(at|st)Aπθo(st, at) An important feature of this objective is that its derivatives w.r.t θ evaluated at θo are identical to those of the LVC objective, and it additionally accounts for changes in the pre-update action distribution. To satisfy condition 2) we extend the clipped meta-objective with a KL-penalty term between πθ and πθo. This KL-penalty term enforces a soft local trust region around πθo, preventing the shift in state visitation distribution to become large during optimization. This enables us to take multiple meta-policy gradient steps without re-sampling. Altogether, Pro MP optimizes JPro MP T (θ) = JCLIP T (θ ) η DKL(πθo, πθ) s.t. θ = θ + α θJLR T (θ) , T ρ(T ) (13) Pro MP consolidates the insights developed throughout the course of this paper, while at the same time making maximal use of recently developed policy gradients algorithms. First, its meta-learning formulation exploits the full structural knowledge of gradient-based meta-learning. Second, it incorporates a low variance estimate of the RL-objective hessian. Third, Pro MP controls the statistical distance of both preand post-adaptation policies, promoting efficient and stable meta-learning. All in all, Pro MP consistently outperforms previous gradient-based meta-RL algorithms in sample complexity, wall clock time, and asymptotic performance (see Section 7.1). 7 EXPERIMENTS In order to empirically validate the theoretical arguments outlined above, this section provides a detailed experimental analysis that aims to answer the following questions: (i) How does Pro MP perform against previous Meta-RL algorithms? (ii) How do the lower variance but biased LVC gradient estimates compare to the high variance, unbiased Di CE estimates? (iii) Do the different formulations result in different pre-update exploration properties? (iv) How do formulation I and formulation II differ in their meta-gradient estimates and convergence properties? To answer the posed questions, we evaluate our approach on six continuous control Meta-RL benchmark environments based on Open AI Gym and the Mujoco simulator (Brockman et al., 2016; Todorov et al., 2012). A description of the experimental setup is found in Appendix D. In all experiments, the reported curves are averaged over at least three random seeds. Returns are estimated Published as a conference paper at ICLR 2019 0.00 0.25 0.50 0.75 1.00 Timesteps 1e8 Average return Half Cheetah Fwd Back 0.0 1.5 3.0 4.5 Ant Rand Dir 0.0 0.4 0.8 1.2 1.6 1e8 300 Hopper Rand Params 0.00 0.25 0.50 0.75 1.00 Walker Fwd Back 0.0 0.4 0.8 1.2 1.6 1e8 Humanoid Rand Dir 0.0 0.4 0.8 1.2 1.6 1e8 Walker Rand Params Pro MP (ours) MAML-TRPO E-MAML-TRPO MAML-VPG Figure 2: Meta-learning curves of Pro MP and previous gradient-based meta-learning algorithms in six different Mu Jo Co environments. Pro MP outperforms previous work in all the the environments. based on sampled trajectories from the adapted post-update policies and averaged over sampled tasks. The source code and the experiment data are available on our supplementary website.2 7.1 META-GRADIENT BASED COMPARISON We compare our method, Pro MP, in sample complexity and asymptotic performance to the gradientbased meta-learning approaches MAML-TRPO (Finn et al., 2017) and E-MAML-TRPO (see Fig. 2). Note that MAML corresponds to the original implementation of RL-MAML by (Finn et al., 2017) where no credit assignment to the pre-adaptation policy is happening (see Appendix B.3 for details). Moreover, we provide a second study which focuses on the underlying meta-gradient estimator. Specifically, we compare the LVC, Di CE, MAML and E-MAML estimators while optimizing meta-learning objective with vanilla policy gradient (VPG) ascent. This can be viewed as an ablated version of the algorithms which tries to eliminate the influences of the outer optimizers on the learning performance (see Fig. 3). These algorithms are benchmarked on six different locomotion tasks that require adaptation: the half-cheetah and walker must switch between running forward and backward, the high-dimensional agents ant and humanoid must learn to adapt to run in different directions in the 2D-plane, and the hopper and walker have to adapt to different configuration of their dynamics. 0.00 0.25 0.50 0.75 1.00 Timesteps 1e8 Average return Half Cheetah Fwd Back 0.0 1.5 3.0 4.5 Ant Rand Dir 0.0 0.4 0.8 1.2 1.6 1e8 300 Hopper Rand Params 0.00 0.25 0.50 0.75 1.00 Walker Fwd Back 0.0 0.4 0.8 1.2 1.6 1e8 Humanoid Rand Dir 0.0 0.4 0.8 1.2 1.6 1e8 Walker Rand Params LVC-VPG (ours) MAML-VPG E-MAML-VPG Di CE-VPG Figure 3: Meta-learning curves corresponding to different meta-gradient estimators in conjunction with VPG. The introduced LVC approach consistently outperforms the other estimators. 2https://sites.google.com/view/pro-mp Published as a conference paper at ICLR 2019 0.0 0.3 0.6 0.9 1.2 Time steps 1e8 Average Return Ant Rand Dir Gradient Variance 0.0 0.3 0.6 0.9 1.2 Time steps 1e8 Average Return Half Cheetah Fwd Back Gradient Variance 0.0 0.3 0.6 0.9 1.2 Time steps 1e8 Average Return Walker Rand Params Gradient Variance 0.0 0.3 0.6 0.9 1.2 Time steps 1e8 Relative Std 0.0 0.3 0.6 0.9 1.2 Time steps 1e8 Relative Std 0.0 0.3 0.6 0.9 1.2 Time steps 1e8 Relative Std Figure 4: Top: Relative standard deviation of meta-policy gradients. Bottom: Returns in the respective environments throughout the learning process. LVC exhibits less variance in its meta-gradients which may explain its superior performance when compared to Di CE. The results in Figure 2 highlight the strength of Pro MP in terms of sample efficiency and asymptotic performance. In the meta-gradient estimator study in Fig. 3, we demonstrate the positive effect of the LVC objective, as it consistently outperforms the other estimators. In contrast, Di CE learns only slowly when compared to the other approaches. As we have motivated mathematically and substantiate empirically in the following experiment, the poor performance of Di CE may be ascribed to the high variance of its meta-gradient estimates. The fact that the results of MAML and EMAML are comparable underpins the ineffectiveness of the naive pre-update credit assignment (i.e. formulation II), as discussed in section 4. Results for four additional environments are displayed in Appendix D along with hyperparameter settings, environment specifications and a wall-clock time comparison of the algorithms. 7.2 GRADIENT ESTIMATOR VARIANCE AND ITS EFFECT ON META-LEARNING In Section 5 we discussed how the Di CE formulation yields unbiased but high variance estimates of the RL-objective hessian and served as motivation for the low variance curvature (LVC) estimator. Here we investigate the meta-gradient variance of both estimators as well as its implication on the learning performance. Specifically, we report the relative standard deviation of the metapolicy gradients as well as the average return throughout the learning process in three of the metaenvironments. The results, depicted in Figure 4, highlight the advantage of the low variance curvature estimate. The trajectory level dependencies inherent in the Di CE estimator leads to a meta-gradient standard deviation that is on average 60% higher when compared to LVC. As the learning curves indicate, the noisy gradients may be a driving factor for the poor performance of Di CE, impeding sample efficient meta-learning. Meta-policy search based on the LVC estimator leads to substantially better sample-efficiency and asymptotic performance. In case of Half Cheetah Fwd Back, we observe some unstable learning behavior of LVC-VPG which is most likely caused by the bias of LVC in combination with the naive VPG optimizer. However, the mechanisms in Pro MP that ensure proximity w.r.t. to the policys KL-divergence seem to counteract these instabilities during training, giving us a stable and efficient meta-learning algorithm. 7.3 COMPARISON OF INITIAL SAMPLING DISTRIBUTIONS Here we evaluate the effect of the different objectives on the learned pre-update sampling distribution. We compare the low variance curvature (LVC) estimator with TRPO (LVC-TRPO) against MAML (Finn et al., 2017) and E-MAML-TRPO (Stadie et al., 2018) in a 2D environment on which the exploration behavior can be visualized. Each task of this environment corresponds to reaching a different corner location; however, the 2D agent only experiences reward when it is sufficiently close to the corner (translucent regions of Figure 5). Thus, to successfully identify the task, the agent must explore the different regions. We perform three inner adaptation steps on each task, allowing the agent to fully change its behavior from exploration to exploitation. Published as a conference paper at ICLR 2019 Pre-update Post-update Figure 5: Exploration patterns of the pre-update policy and exploitation post-update with different update functions. Through its superior credit assignment, the LVC objective learns a pre-update policy that is able to identify the current task and respectively adapt its policy, successfully reaching the goal (dark green circle). The different exploration-exploitation strategies are displayed in Figure 5. Since the MAML implementation does not assign credit to the pre-update sampling trajectory, it is unable to learn a sound exploration strategy for task identification and thus fails to accomplish the task. On the other hand, E-MAML, which corresponds to formulation II, learns to explore in long but random paths: because it can only assign credit to batches of pre-update trajectories, there is no notion of which actions in particular facilitate good task adaptation. As consequence the adapted policy slightly misses the task-specific target. The LVC estimator, instead, learns a consistent pattern of exploration, visiting each of the four regions, which it harnesses to fully solve the task. 7.4 GRADIENT UPDATE DIRECTIONS OF THE TWO META-RL FORMULATIONS Figure 6: Meta-gradient updates of policy parameters θ0 and θ1 in a 1D environment w.r.t Formulation I (red) and Formulation II (green). To shed more light on the differences of the gradients of formulation I and formulation II, we evaluate the meta-gradient updates and the corresponding convergence to the optimum of both formulations in a simple 1D environment. In this environment, the agent starts in a random position in the real line and has to reach a goal located at the position 1 or -1. In order to visualize the convergence, we parameterize the policy with only two parameters θ0 and θ1. We employ formulation I by optimizing the Di CE objective with VPG, and formulation II by optimizing its (E-MAML) objective with VPG. Figure 6 depicts meta-gradient updates of the parameters θi for both formulations. Formulation I (red) exploits the internal structure of the adaptation update yielding faster and steadier convergence to the optimum. Due to its inferior credit assignment, formulation II (green) produces noisier gradient estimates leading to worse convergence properties. 8 CONCLUSION In this paper we propose a novel Meta-RL algorithm, proximal meta-policy search (Pro MP), which fully optimizes for the pre-update sampling distribution leading to effective task identification. Our method is the result of a theoretical analysis of gradient-based Meta-RL formulations, based on which we develop the low variance curvature (LVC) surrogate objective that produces low variance meta-policy gradient estimates. Experimental results demonstrate that our approach surpasses previous meta-reinforcement learning approaches in a diverse set of continuous control tasks. Finally, we underpin our theoretical contributions with illustrative examples which further justify the soundness and effectiveness of our method. ACKNOWLEDGMENTS Ignasi Clavera was supported by the La Caixa Fellowship. The research leading to these results has received funding from the German Research Foundation (DFG: Deutsche Forschungsgemeinschaft) under Priority Program on Autonomous Learning (SPP 1527) and was supported by Berkeley Deep Drive, Amazon Web Services, and Huawei. Also we thank Abhishek Gupta, Chelsea Finn, aand Aviv Tamar for their valuable feedback. Published as a conference paper at ICLR 2019 Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained Policy Optimization. Technical report, 2017. URL https://arxiv.org/pdf/1705.10528.pdf. Maruan Al-Shedivat, Trapit Bansal, Umass Amherst, Yura Burda, Openai Ilya, Sutskever Openai, Igor Mordatch Openai, and Pieter Abbeel. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments. In ICLR, 2018. URL https://goo.gl/tboqa N. Ferran Alet, Toms Lozano-P erez, and Leslie P. Kaelbling. Modular meta-learning. Technical report, 6 2018. URL http://arxiv.org/abs/1806.10166. Marcin Andrychowicz, Misha Denil, Sergio Gmez Colmenarejo, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Technical report, 2016. URL https://arxiv.org/pdf/1606. 04474.pdf. Jonathan Baxter and Peter L Bartlett. Infinite-Horizon Policy-Gradient Estimation. Technical report, 2001. URL https://arxiv.org/pdf/1106.0665.pdf. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Open AI Gym. Technical report, 6 2016. URL http://arxiv.org/ abs/1606.01540. Yutian Chen, Matthew W Hoffman, Sergio Gmez Colmenarejo, Misha Denil, Timothy P Lillicrap, Matt Botvinick, and Nando De Freitas. Learning to Learn without Gradient Descent by Gradient Descent. In ICML, 2017. Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-Based Reinforcement Learning via Meta-Policy Optimization. In Co RL, 2018. URL http://arxiv.org/abs/1809.05214. Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. RL$ˆ2$: Fast Reinforcement Learning via Slow Reinforcement Learning. Co RR, abs/1611.0:1 14, 2016. ISSN 0004-6361. doi: 10.1051/0004-6361/201527329. URL http://arxiv.org/abs/ 1611.02779. Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML, 2017. Jakob Foerster, Gregory Farquhar, Maruan Al-Shedivat, Tim Rockt aschel, Eric P Xing, and Shimon Whiteson. Di CE: The Infinitely Differentiable Monte Carlo Estimator. In ICML, 2018. URL https://goo.gl/xkk Gx N. Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, and John Schulman. Meta Learning Shared Hierarchies. In ICLR, 10 2018. URL http://arxiv.org/abs/1710.09767. Thomas Furmston, Guy Lever, David Barber, and Joelle Pineau. Approximate Newton Methods for Policy Search in Markov Decision Processes. Technical report, 2016. URL http://jmlr. org/papers/volume17/15-414/15-414.pdf. Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, and Sergey Levine. Unsupervised Meta Learning for Reinforcement Learning. In ICML, 2018a. Abhishek Gupta, Russell Mendonca, Yuxuan Liu, Pieter Abbeel, and Sergey Levine. Meta Reinforcement Learning of Structured Exploration Strategies. In ICML, 2018b. URL https: //arxiv.org/pdf/1802.07245.pdf. Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell. Learning To Learn Using Gradient Descent. In ICANN, pp. 87 94, 2001. URL http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.5.323. Published as a conference paper at ICLR 2019 Michael H usken and Christian Goerick. Fast learning for problem classes using knowledge based network initialization. In IJCNN. IEEE Computer Society Press, 2000. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31. 9720&rep=rep1&type=pdf. Sham Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. In ICML, 2002. URL https://people.eecs.berkeley.edu/ pabbeel/ cs287-fa09/readings/Kakade Langford-icml2002.pdf. Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M Hospedales. Learning to Generalize: Meta Learning for Domain Generalization. In AAAI, 2017. URL www.aaai.org. Thomas Miconi, Jeff Clune, and Kenneth O. Stanley. Differentiable plasticity: training plastic neural networks with backpropagation. In ICML, 4 2018. URL https://arxiv.org/abs/1804. 02464. Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A Simple Neural Attentive Meta Learner. In ICLR, 7 2018. URL http://arxiv.org/abs/1707.03141. Alex Nichol, Joshua Achiam, and John Schulman. On First-Order Meta-Learning Algorithms. Technical report, 2018. URL http://arxiv.org/abs/1803.02999. Jan Peters and Stefan Schaal. Policy Gradient Methods for Robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2219 2225. IEEE, 10 2006. ISBN 1-4244-0258-1. doi: 10.1109/IROS.2006.282564. URL http://ieeexplore.ieee.org/ document/4058714/. Sachin Ravi and Hugo Larochelle. Optimization as a Model for Few-Shot Learning. In ICLR, 11 2017. URL https://openreview.net/forum?id=r JY0-Kcll. Steindr Saemundsson, Katja Hofmann, and Marc Peter Deisenroth. Meta Reinforcement Learning with Latent Variable Gaussian Processes. In UAI, 2018. URL https://arxiv.org/pdf/ 1803.07551.pdf. Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, Timothy Lillicrap, and Google Deepmind. Meta-Learning with Memory-Augmented Neural Networks. In ICML, 2016. URL http://proceedings.mlr.press/v48/santoro16.pdf. Juergen Schmidhuber. Evolutionary principles in self-referential learning. On learning how to learn: The meta-meta-... hook. Ph D thesis, Technische Universitaet Munchen, 1987. URL http: //people.idsia.ch/ juergen/diploma.html. Jrgen Schmidhuber, Jieyu Zhao, and Marco Wiering. Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement. Machine Learning, 28 (1):105 130, 1997. ISSN 08856125. doi: 10.1023/A:1007383707642. URL http://link. springer.com/10.1023/A:1007383707642. John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient Estimation Using Stochastic Computation Graphs. In NIPS, 2015a. URL https://arxiv.org/pdf/1506. 05254.pdf. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust Region Policy Optimization. ICML, 2015b. ISSN 2158-3226. doi: 10.1063/1.4927398. URL http: //arxiv.org/abs/1502.05477. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov Openai. Proximal Policy Optimization Algorithms. Co RR, 2017. URL https://arxiv.org/pdf/1707. 06347.pdf. Bradly C Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan, Yuhuai Wu, Pieter Abbeel, and Ilya Sutskever. Some Considerations on Learning to Explore via Meta-Reinforcement Learning. Technical report, 2018. URL https://arxiv.org/pdf/1803.01118.pdf. Published as a conference paper at ICLR 2019 Flood Sung, Li Zhang, Tao Xiang, Timothy Hospedales, and Yongxin Yang. Learning to Learn: Meta-Critic Networks for Sample Efficient Learning. Technical report, 6 2017. URL http: //arxiv.org/abs/1706.09529. Richard S. Sutton, David Mcallester, Satinder Singh, and Yishay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NIPS, 2000. ISBN 0-262-19450-3. doi: 10.1.1.37.9714. Sebastian Thrun and Lorien Pratt. Learning to learn. 1998. ISBN 0792380479. URL https: //dl.acm.org/citation.cfm?id=296639. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu Jo Co: A physics engine for model-based control. In IROS, pp. 5026 5033. IEEE, 10 2012. ISBN 978-1-4673-1736-8. doi: 10.1109/IROS.2012. 6386109. URL http://ieeexplore.ieee.org/document/6386109/. Zhongwen Xu, Hado van Hasselt, and David Silver. Meta-Gradient Reinforcement Learning. Technical report, 5 2018. URL http://arxiv.org/abs/1805.09801. Published as a conference paper at ICLR 2019 A TWO META-POLICY GRADIENT FORMULATIONS In this section we discuss two different gradient-based meta-learning formulations, derive their gradients and analyze the differences between them. A.1 META-POLICY GRADIENT FORMULATION I The first meta-learning formulation, known as MAML (Finn et al., 2017), views the inner update rule U(θ, T ) as a mapping from the pre-update parameter θ and the task T to an adapted policy parameter θ . The update function can be viewed as stand-alone procedure that encapsulates sampling from the task-specific trajectory distribution PT (τ|πθ) and updating the policy parameters. Building on this concept, the meta-objective can be written as JI(θ) = ET ρ(T ) Eτ PT (τ |θ ) [R(τ )] with θ := U(θ, T ) (14) The task-specific gradients follow as θJI T (θ) = θEτ PT (τ |θ ) [R(τ )] (15) = Eτ PT (τ |θ ) [ θ log PT (τ |θ )R(τ )] (16) = Eτ PT (τ |θ ) [ θ log PT (τ |θ )R(τ ) θθ ] (17) In order to derive the gradients of the inner update θθ = θU(θ, T ) it is necessary to know the structure of U. The main part of this paper assumes the inner update rule to be a policy gradient descent step θU(θ, T ) = θ θ + α θEτ PT (τ|θ) [R(τ)] (18) = I + α 2 θ Eτ PT (τ|θ) [R(τ)] (19) Thereby the second term in (19) is the local curvature (hessian) of the inner adaptation objective function. The correct hessian of the inner objective can be derived as follows: 2 θ Eτ PT (τ|θ) [R(τ)] = θ Eτ PT (τ|θ) [ θ log πθ(τ)R(τ)] (20) Z PT (τ|θ) θ log πθ(τ)R(τ)dτ (21) = Z PT (τ|θ) θ log πθ(τ) θ log πθ(τ) R(τ)+ (22) PT (τ|θ) 2 θ log πθ(τ)R(τ)dτ (23) = Eτ PT (τ|θ) R(τ) 2 θ log πθ(τ) + θ log πθ(τ) θ log πθ(τ) A.2 META-POLICY GRADIENT FORMULATION II The second meta-reinforcement learning formulation views the the inner update θ = U(θ, τ 1:N) as a deterministic function of the pre-update policy parameters θ and N trajectories τ 1:N PT (τ 1:N|θ) sampled from the pre-update trajectory distribution. This formulation was introduced in Al-Shedivat et al. (2018) and further discussed with respect to its exploration properties in Stadie et al. (2018). Viewing U as a function that adapts the policy parameters θ to a specific task T given policy rollouts in this task, the corresponding meta-learning objective can be written as JII(θ) = ET ρ(T ) Eτ 1:N PT (τ 1:N|θ) h Eτ PT (τ |θ ) R(τ ) i with θ := U(θ, τ 1:N) (25) Since the first part of the gradient derivation is agnostic to the inner update rule U(θ, τ 1:N), we only assume that the inner update function U is differentiable w.r.t. θ. First we rewrite the meta-objective J(θ) as expectation of task specific objectives JII T (θ) under the task distribution. This allows us to express the meta-policy gradients as expectation of task-specific gradients: θJII(θ) = ET ρ(T ) θJII T (θ) (26) Published as a conference paper at ICLR 2019 The task specific gradients can be calculated as follows θJII T (θ) = θEτ PT (τ 1:N|θ) h Eτ PT (τ |θ ) R(τ ) i Z Z R(τ ) PT (τ |θ ) PT (τ 1:N|θ) dτ dτ = Z Z R(τ ) PT (τ |θ ) θ log PT (τ 1:N|θ)PT (τ 1:N|θ)+ R(τ ) θ log PT (τ |θ )PT (τ |θ ) PT (τ 1:N|θ) dτ dτ = Eτ 1:N PT (τ 1:N|θ) τ PT (τ |θ ) θ log PT (τ |θ ) + i=1 θ log PT (τ (n)|θ) = Eτ 1:N PT (τ 1:N|θ) τ PT (τ |θ ) θ log PT (τ |θ ) θθ + n=1 θ log PT (τ (n)|θ) As in A.1 the structure of U(θ, τ 1:N) must be known in order to derive the gradient θθ . Since we assume the inner update to be vanilla policy gradient, the respective gradient follows as U(θ, τ 1:N) = θ+α 1 n=1 θ log πθ(τ (n)))R(τ (n)) with θ log πθ(τ) = t=0 θ log πθ(at|st) The respective gradient of U(θ, τ 1:N) follows as θU(θ, τ 1:N) = θ n=1 θ log πθ(τ (n)))R(τ (n)) n=1 2 θ log πθ(τ (n)))R(τ (n)) (28) A.3 COMPARING THE GRADIENTS OF THE TWO FORMULATIONS In the following we analyze the differences between the gradients derived for the two formulations. To do so, we begin with θJI T (θ) by inserting the gradient of the inner adaptation step (19) into (17): θJI T (θ) = Eτ PT (τ |θ ) θ log PT (τ |θ )R(τ ) I + α 2 θ Eτ PT (τ|θ) [R(τ)] (29) We can substitute the hessian of the inner objective by its derived expression from (24) and then rearrange the terms. Also note that θ log PT (τ|θ) = θ log πθ(τ) = PH 1 t=1 log πθ(at|st) where H is the MDP horizon. θJI T (θ) = Eτ PT (τ |θ ) θ log PT (τ |θ )R(τ ) I + αEτ PT (τ|θ) R(τ) (30) 2 θ log πθ(τ) + θ log πθ(τ) θ log πθ(τ) (31) = E τ PT (τ|θ) τ PT (τ |θ ) θ log πθ (τ )R(τ ) I + αR(τ) 2 θ log πθ(τ) | {z } θJpost(τ,τ ) +α θ log πθ (τ )R(τ )R(τ) θ log πθ(τ) θ log πθ(τ) | {z } θJIpre(τ,τ ) Published as a conference paper at ICLR 2019 Next, we rearrange the gradient of JII into a similar form as θJI T (θ). For that, we start by inserting (28) for θθ and replacing the expectation over pre-update trajectories τ 1:N by the expectation over a single trajectory τ. θJI T (θ) = E τ PT (τ|θ) τ PT (τ |θ ) R(τ ) θ log πθ(τ ) I + αR(τ) 2 θ log πθ(τ)) | {z } θJpost(τ,τ ) +R(τ ) θ log πθ(τ) | {z } θJI pre(τ,τ ) While the first part of the gradients match ((32) and (34)), the second part ((33) and (35)) differs. Since the second gradient term can be viewed as responsible for shifting the pre-update sampling distribution PT (τ|θ) towards higher post-update returns, we refer to it as θJpre(τ, τ ) . To further analyze the difference between θJI pre and θJII pre we slightly rearrange (33) and put both gradient terms next to each other: θJI pre(τ, τ ) = α θ log πθ(τ) ( θ log πθ(τ)R(τ)) | {z } θJinner ( θ log πθ (τ )R(τ )) | {z } θ Jouter θJII pre(τ, τ ) = α θ log πθ(τ)R(τ ) (37) In the following we interpret and and compare of the derived gradient terms, aiming to provide intuition for the differences between the formulations: The first gradient term Jpost that matches in both formulations corresponds to a policy gradient step on the post-update policy πθ . Since θ itself is a function of θ, the term I + αR(τ) 2 θ log πθ(τ)) can be seen as linear transformation of the policy gradient update R(τ ) θ log πθ(τ ) from the post-update parameter θ into θ. Although Jpost takes into account the functional relationship between θ and θ, it does not take into account the pre-update sampling distribution PT (τ|θ). This is where θJpre comes into play: θJI pre can be viewed as policy gradient update of the preupdate policy πθ w.r.t. to the post-update return R(τ ). Hence this gradient term aims to shift the pre-update sampling distribution so that higher post-update returns are achieved. However, θJII pre does not take into account the causal dependence of the post-update policy on the pre-update policy. Thus a change in θ due to θJII pre may counteract the change due to θJII post. In contrast, θJI pre takes the dependence of the the post-update policy on the pre-update sampling distribution into account. Instead of simply weighting the gradients of the pre-update policy θ log πθ(τ) with R(τ ) as in θJI post, θJI post weights the gradients with inner product of the pre-update and post-update policy gradients. This inner product can be written as θJinner θ Jouter = || θJinner||2 || θ Jouter||2 cos(δ) (38) wherein δ denotes the angle between the the inner and outer pre-update and post-update policy gradients. Hence, θJI post steers the pre-update policy towards not only towards larger post-updates returns but also towards larger adaptation steps α θJinner, and better alignment of preand postupdate policy gradients. This directly optimizes for maximal improvement / adaptation for the respective task. See Li et al. (2017); Nichol et al. (2018) for a comparable analysis in case of domain generalization and supervised meta-learning. Also note that (38) allows formulation I to perform credit assignment on the trajectory level whereas formulation II can only assign credit to entire batches of N pre-update trajectories τ 1:N. As a result, we expect the first meta-policy gradient formulation to learn faster and more stably since the respective gradients take the dependence of the pre-update returns on the pre-update sampling distribution into account while this causal link is neglected in the second formulation. B ESTIMATING THE META-POLICY GRADIENTS When employing formulation I for gradient-based meta-learning, we aim maximize the loss J(θ) = ET ρ(T ) Eτ PT (τ |θ ) [R(τ )] with θ := θ + α θEτ PT (τ|θ) [R(τ)] (39) Published as a conference paper at ICLR 2019 by performing a form of gradient-descent on J(θ). Note that we, from now on, assume J := JI and thus omit the superscript indicating the respective meta-learning formulation. As shown in A.2 the gradient can be derived as θJ(θ) = E(T ) ρ(T )[ θJT (θ)] with θJT (θ) = Eτ PT (τ |θ ) θ log PT (τ |θ )R(τ ) I + α 2 θ Eτ PT (τ|θ) [R(τ)] (40) where 2 θJinner(θ) := 2 θ Eτ PT (τ|θ) [R(τ)] denotes hessian of the inner adaptation objective w.r.t. θ. This section concerns the question of how to properly estimate this hessian. B.1 ESTIMATING GRADIENTS OF THE RL REWARD OBJECTIVE Since the expectation over the trajectory distribution PT (τ|θ) is in general intractable, the score function trick is typically used to used to produce a Monte Carlo estimate of the policy gradients. Although the gradient estimate can be directly defined, when using a automatic-differentiation toolbox it is usually more convenient to use an objective function whose gradients correspond to the policy gradient estimate. Due to the Policy Gradient Theorem (PGT) Sutton et al. (2000) such a surrogate objective can be written as: t=0 log πθ(at|st) t =t r(st , at ) τk PT (τ) (41) t =0 log πθ(at|st) r(st , at ) τk PT (τ) (42) While (41) and (42) are equivalent (Peters & Schaal, 2006), the more popular formulation formulation (41) can be seen as forward looking credit assignment while (42) can be interpreted as backward looking credit assignment (Foerster et al., 2018). A generalized procedure for constructing surrogate objectives for arbitrary stochastic computation graphs can be found in Schulman et al. (2015a). B.2 A DECOMPOSITION OF THE HESSIAN Estimating the the hessian of the reinforcement learning objective has been discussed in Furmston et al. (2016) and Baxter & Bartlett (2001) with focus on second order policy gradient methods. In the infinite horizon MDP case, Baxter & Bartlett (2001) derive a decomposition of the hessian. In the following, we extend their finding to the finite horizon case. Proposition. The hessian of the RL objective can be decomposed into four matrix terms: 2 θJinner(θ) = H1 + H2 + H12 + H 12 (43) H1 = Eτ PT (τ|θ) t=0 θ log πθ(at, st) θ log πθ(at, st) H 1 X t =t r(st , at ) H2 = Eτ PT (τ|θ) t=0 2 θ log πθ(at, st) t =t r(st , at ) H12 = Eτ PT (τ|θ) t=0 θ log πθ(at, st) θQπθ t (st, at) # Here Qπθ t (st, at) = Eτ t+1:H 1 PT ( |θ) h PH 1 t =t r(st , at )|st, at i denotes the expected state-action value function under policy πθ at time t. Published as a conference paper at ICLR 2019 Proof. As derived in (24), the hessian of Jinner(θ) follows as: 2 θJinner = Eτ PT (τ|θ) R(τ) 2 θ log πθ(τ) + θ log πθ(τ) θ log πθ(τ) (47) = Eτ PT (τ|θ) t =0 2 θ log πθ(at , st ) + Eτ PT (τ|θ) t =0 θ log πθ(at , st ) t =0 θ log πθ(at , st ) = Eτ PT (τ|θ) t=0 2 θ log πθ(at, st) t =t r(st , at ) + Eτ PT (τ|θ) h=0 θ log πθ(at , st ) θ log πθ(ah, sh) ! The term in (50) is equal to H2. We continue by showing that the remaining term in (51) is equivalent to H1 + H12 + H 12. For that, we split the inner double sum in (51) into three components: Eτ PT (τ|θ) h=0 θ log πθ(at , st ) θ log πθ(ah, sh) ! = Eτ PT (τ|θ) t =0 θ log πθ(at , st ) θ log πθ(at , st ) ! + Eτ PT (τ|θ) h=0 θ log πθ(at , st ) θ log πθ(ah, sh) + Eτ PT (τ|θ) h=t +1 θ log πθ(at , st ) θ log πθ(ah, sh) ! By changing the backward looking summation over outer products into a forward looking summation of rewards, (53) can be shown to be equal to H1: Eτ PT (τ|θ) t =0 θ log πθ(at , st ) θ log πθ(at , st ) ! = Eτ PT (τ|θ) t=0 θ log πθ(at, st) θ log πθ(at, st) H 1 X t =t r(st , at ) By simply exchanging the summation indices t and h in (55) it is straightforward to show that (55) is the transpose of (54). Hence it is sufficient to show that (54) is equivalent to H12. However, instead of following the direction of the previous proof we will now start with the definition of H12 and derive the expression in (54). H12 = Eτ PT (τ|θ) t=0 θ log πθ(at, st) θQπθ t (st, at) # The gradient of Qπθ t can be expressed recursively: θQπθ t (st, at) = θEst+1 at+1 Qπθ t+1(st+1, at+1) (61) = Est+1 at+1 θ log πθ(at+1, st+1)Qπθ t+1(st+1, at+1) + θQπθ t+1(st+1, at+1) (62) Published as a conference paper at ICLR 2019 By induction, it follows that θQπθ t (st, at) = Eτ t+1:H 1 PT ( |θ) t =t+1 θ log πθ(at , st ) h=t r(sh, ah) When inserting (63) into (59) and swapping the summation, we are able to show that H12 is equivalent to (54). H12 = Eτ PT (τ|θ) t =t+1 θ log πθ(at, st) θ log πθ(at , st ) H 1 X h=t r(sh, ah) = Eτ PT (τ|θ) h=0 θ log πθ(at , st ) θ log πθ(ah, sh) This concludes the proof that the hessian of the expected sum of rewards under policy πθ and an MDP with finite time horizon H can be decomposed into H1 + H2 + H12 + H 12. B.3 ESTIMATING THE HESSIAN OF THE RL REWARD OBJECTIVE As pointed out by Al-Shedivat et al. (2018); Stadie et al. (2018) and Foerster et al. (2018), simply differentiating through the gradient of surrogate objective JPGT as done in the original MAML version (Finn et al., 2017) leads to biased hessian estimates. Specifically, when compared with the unbiased estimate, as derived in (24) and decomposed in Appendix B.2, both H1 and H12 +H 12 are missing. Thus, θJpre does not appear in the gradients of the meta-objective (i.e. θJ = θJpost). Only performing gradient descent with θJpost entirely neglects influences of the pre-update sampling distribution. This issue was overseen in the RL-MAML implementation of Finn et al. (2017). As discussed in Stadie et al. (2018) this leads to poor performance in meta-learning problems that require exploration during the pre-update sampling. B.3.1 THE DICE MONTE-CARLO ESTIMATOR Addressing the issue of incorrect higher-order derivatives of monte-carlo estimators, Foerster et al. (2018) propose DICE which mainly builds upon an newly introduced Magic Box( ) operator. This operator allows to formulate monte-carlo estimators with correct higher-order derivatives. A DICE formulation of a policy gradient estimator reads as: t=0 θ({at t})r(st, at) (66) t =0 log πθ(at |st ) (log πθ(at |st ) r(st, at) (67) In that, denotes a stop gradient operator (i.e. (fθ(x)) fθ(x) but θ (fθ(x)) 0). Note that denotes a evaluates to and does not necessarily imply equality w.r.t. to gradients. Hence, JDICE(θ) evaluates to the sum of rewards at 0th order but produces the unbiased gradients n θ JDICE(θ) when differentiated n-times (see Foerster et al. (2018) for proof). To shed more light on the maverick DICE formulation, we rewrite (67) as follows: πθ(at |st ) (πθ(at |st )) r(st, at) (68) Interpreting this novel formulation, the Magic Box operator θ({at t}) can be understood as dry importance sampling weight. At 0th order it evaluates to 1 and leaves the objective function unaffected, but when differentiated once it yields an estimator for the marginal rate of return due to a change in the policy-implied trajectory distribution. Published as a conference paper at ICLR 2019 In the following we show that on expectation 1) the gradients of (81) match standard policy gradients and 2) its hessian estimate is equal to the hessian of inner RL objective, derived in B.2. πθ(at |st ) (πθ(at |st )) r(st, at) (69) πθ(at |st ) (πθ(at |st )) t =0 θ log πθ(at |st ) r(st, at) (70) t =0 θ log πθ(at |st ) r(st, at) (71) Here, (71) corresponds to the backward looking credit assignment formulation of policy gradients θJPGT as discussed in B.1. Once again we take the derivative in order to obtain the Hessian of JDICE: πθ(at |st ) (πθ(at |st )) t =0 θ log πθ(at |st ) r(st, at) (72) πθ(at |st ) (πθ(at |st )) t =0 θ log πθ(at |st ) r(st, at) (73) t =0 θ log πθ(at |st ) t =0 θ log πθ(at |st ) r(st, at) (74) t =0 2 θ log πθ(at |st ) r(st, at) (75) In expectation, Eτ PT (τ|θ)[ 2 θJDICE] the DICE monte carlo estimate of the hessian is equivalent to the hessian of the inner objective. To show this, we use the expression of 2 θJinner (49): Eτ PT (τ|θ)[ 2 θJDICE] (76) = Eτ PT (τ|θ) t =0 θ log πθ(at |st ) t =0 θ log πθ(at |st ) r(st, at) + t =0 2 θ log πθ(at |st ) r(st, at) (78) = H1 + H2 + H12 + H 12 (79) = 2 θJinner (80) B.4 BIAS AND VARIANCE OF THE CURVATURE ESTIMATE As shown in the previous section, 2 θJDICE provides an unbiased estimate of the hessian of the inner objective Jinner = Eτ PT (τ|θ) [R(τ)]. However, recall the DICE objective involves a product of importance weights along the trajectory. πθ(at |st ) (πθ(at |st )) r(st, at) (81) Taking the 2nd derivative of this product leads to the outer product of sums in (74) which is of high variance w.r.t to τ. Specifically, this outer product of sums can be decomposed into three terms H1 +H12 +H 12 (see Appendix B.2). As noted by Furmston et al. (2016), H12 +H 12 is particularly difficult to estimate. In section 7.2 we empirically show that the high variance curvature estimates obtained with the DICE objective require large batch sizes and impede sample efficient learning. In the following we develop a low variance curvature (LVC) estimator JLVC which matches JDICE at the gradient level and yields lower-variance estimates of the hessian by neglecting H12 + H 12. Published as a conference paper at ICLR 2019 Before formally introducing JLVC, we motivate such estimator starting with the policy gradient estimate that was originally derived in Sutton et al. (2000), followed by marginalizing the trajectory level distribution PT (τ|θ) over states st and actions at. Note that we omit reward baselines for notational simplicity. θJinner = Eτ PT (τ|θ) t=0 θ log πθ(at|st) t =t r(st , at ) t=0 E st p πθ t (st) at πθ(at|st) θ log πθ(at|st) t =t r(st , at ) In that, pπθ t (st) denotes the state visitation frequency at time step t, i.e. the probability density of being in st after t steps under the policy πθ. In the general case pπθ t (st) is intractable but depends on the policy parameter θ. We make the simplifying assumption that pπθ t (st) is fixed in a local region of θ. Since we make this assumption at the gradient level, this corresponds to a 1st order Taylor expansion of pπθ t (st) in θ. Note that this assumption is also used in the Monotonic Policy Improvement Theory (Kakade & Langford, 2002; Schulman et al., 2015a). Based on this condition, the hessian follows as derivative of (83) whereby a stop gradient expression around the state visitation frequency pπθ t (st) resembles the 1st order Taylor approximation: Eτ 2 θJLVC = θ t=0 Est (p πθ t (st)) at πθ(at|st) θ log πθ(at|st) t =t r(st , at ) t=0 Est (p πθ t (st)) at πθ(at|st) θ log πθ(at|st) θ log πθ(at|st) H 1 X t =t r(st , at ) + 2 θ log πθ(at|st) t =t r(st , at ) Since the expectation in (84) is intractable it must be evaluated by a monte carlo estimate. However, simply replacing the expectation with an average of samples trajectories induces a wrong hessian that does not correspond to (86) since outer product of log-gradients would be missing when differentiated. To ensure that automatic differentiation still yields the correct hessian, we add a dry importance weight comparable to DICE: πθ(at|st) (πθ(at|st)) θ log πθ(at|st) t =t r(st , at ) τ PT (τ|θ) (87) When integrated this resembles the LVC surrogate objective JLVC. πθ(at|st) (πθ(at|st)) t =t r(st , at ) τ PT (τ|θ) (88) The gradients of JLVC match θJDICE and resemble an unbiased policy gradient estimate: θπθ(at|st) (πθ(at|st)) t =t r(st , at ) πθ(at|st) (πθ(at|st)) θ log πθ(at|st) t =t r(st , at ) t=0 θ log πθ(at|st) t =t r(st , at ) Published as a conference paper at ICLR 2019 The respective Hessian can be obtained by differentiating (90): 2 θJLVC = θ πθ(at|st) (πθ(at|st)) θ log πθ(at|st) t =t r(st , at ) πθ(at|st) (πθ(at|st)) θ log πθ(at|st) θ log πθ(at|st) H 1 X t =t r(st , at ) + πθ(at|st) (πθ(at|st)) 2 θ log πθ(at|st) t =t r(st , at ) t=0 θ log πθ(at|st) θ log πθ(at|st) H 1 X t =t r(st , at ) + 2 θ log πθ(at|st) t =t r(st , at ) t =0 θ log πθ(at |st ) θ log πθ(at|st) ! r(st, at) (97) t =0 2 θ log πθ(at |st ) r(st, at) (98) In expectation 2 θJLVC is equivalent to H1 + H2: Eτ PT (τ|θ) JLVC = Eτ PT (τ|θ) t =0 θ log πθ(at |st ) θ log πθ(at|st) ! + Eτ PT (τ|θ) t =0 2 θ log πθ(at |st ) = H1 + H2 (101) The Hessian 2 θJLVC no longer provides an unbiased estimate of 2 θJinner since neglects the matrix term H12 +H 12. This approximation is based on the assumption that the state visitation distribution is locally unaffected by marginal changes in θ and leads to a substantial reduction of variance in the hessian estimate. Furmston et al. (2016) show that under certain conditions (i.e. infinite horizon MDP, sufficiently rich policy parameterisation) the term H12+H 12 vanishes around a local optimum θ . Given that the conditions hold, this implies that Eτ[ 2 θJLVC] Eτ[ 2 θJDICE] as θ θ , i.e. the bias of the LCV estimator becomes negligible close to the local optimum. The experiments in section 7.2 confirm this theoretical argument empirically and show that using the low variance curvature estimates obtained through JLVC improve the sample-efficiency of meta-learning by a significant margin. C PROXIMAL POLICY SEARCH METHODS C.1 MONOTONIC POLICY IMPROVEMENT THEORY This section provides a brief introduction to policy performance bounds and the theory of monotonic policy improvement in the setting of reinforcement learning. While Section 6 discusses the extension of this theory to meta learning, the following explanations assume a standard RL setting where T is exogenously given. Hence, we will omit mentioning the dependence on T for notational brevity. Since the monotonic policy improvement frameworks relies on infinite-time horizon MDPs, we assume H for the remainder of this chapter. Published as a conference paper at ICLR 2019 In addition to the expected reward J(π) under policy π, we will use the state value function V π, the state-action value function Qπ as well as the advantage function Aπ: V π(s) = Ea0,s1,... t=0 γtr(st, at) st = s Qπ(s, a) = Es1,a1,... t=0 γtr(st, at) st = s, a0 = a = r(s, a) + γEs p(s |s,a) [Vπ(s )] Aπ(s, a) = Qπ(s, a) V π(s) with at π(at|st) and st+1 p(st+1|st, at). The expected return under a policy π can be expressed as the sum of the expected return of another policy π and the expected discounted advantage of π over π (see Schulman et al. (2015a) for proof). J( π) = J(π) + Eτ P (τ, π) t=0 γt Aπ(st, at) Let dπ denote the discounted state visitation frequency: t=0 p(st = s|π) We can use dπ to express the expectation over trajectories τ pπ(τ) in terms of states and actions: J( π) = J(π) + Es d π(s) a π(a|s) [Aπ(s, a)] (102) Local policy search aims to find a policy update π π in the proximity of π so that J( π) is maximized. Since J(π) is not affected by the policy update π π, it is sufficient to maximize the expected advantage under π. However, the complex dependence of d π(s) on π makes it hard to directly maximize the objective in (102). Using a local approximation of (102) where it is assumed that the state visitation frequencies dπ and d π are identical, the optimization can be phrased as Jπ( π) = J(π) + Es dπ(s) a π(a|s) [Aπ(s, a)] = J(π) + Es dπ(s) a π(a|s) π(a|s)Aπ(s, a) (103) In the following we refer to J( π) as surrogate objective. It can be shown that the surrogate objective J matches J to first order when π = π (see Kakade & Langford (2002)). If πθ is a parametric and differentiable function with parameter vector θ, this means that for any θo: Jπθo(πθo) = Jπθo(πθo) and θ Jπθo(πθ) θo = θJπθo(πθ) θo (104) When π = π, an approximation error of the surrogate objective J w.r.t. to the true objective J is introduced. Achiam et al. (2017) derive a lower bound for the true expected return of π: J( π) Jπ( π) C p Es dπ [DKL[ π( |s)||π( |s)]] = Jπ( π) C q DKL[ π||π] (105) 2γ 1 γ maxs |Ea π(a,s)[Aπ(s, a)]| C.2 TRUST REGION POLICY OPTIMIZATION (TRPO) Trust region policy optimization (TPRO) (Schulman et al., 2015a) attempts to approximate the bound in (105) by phrasing local policy search as a constrained optimization problem: arg max θ Es dπθo (s) a πθo(a|s) πθo(a|s)Aπθo(s, a) s.t. DKL[πθo||πθ] δ (106) Published as a conference paper at ICLR 2019 Thereby the KL-constraint δ induces a local trust region around the current policy πθo. A practical implementation of TPRO uses a quadratic approximation of the KL-constraint which leads to the following update rule: 2δ g Fg F 1g (107) with g := θEs dπθo (s) a πθo(a|s) h πθ(a|s) πθo(a|s)Aπθo(s, a) i being the gradient of the objective and F = 2 θ DKL[πθo||πθ] the Fisher information matrix of the current policy πθo. In order to avoid the cubic time complexity that arise when inverting F, the Conjugate Gradient (CG) algorithm is typically used to approximate the Hessian vector product F 1g. C.3 PROXIMAL POLICY OPTIMIZATION (PPO) While TPRO is framed as constrained optimization, the theory discussed in Appendix C.1 suggest to optimize the lower bound. Based on this insight, Schulman et al. (2017) propose adding a KL penalty to the objective and solve the following unconstrained optimization problem: arg max θ Es dπθo (s) a πθo(a|s) πθo(a|s)Aπθo(s, a) βDKL[πθo( |s)||πθ( |s)] (108) However, they also show that it is not sufficient to set a fixed penalty coefficient β and propose two alternative methods, known as Proximal Policy Optimization (PPO) that aim towards alleviating this issue: 1) Adapting the KL coefficient β so that a desired target KL-divergence DKL[πθo||πθ] between the policy before and after the parameter update is achieved 2) Clipping the likelihood ratio so that the optimization has no incentive to move the policy πθ too far away from the original policy πθo. A corresponding optimization objective reads as: JCLIP = Es dπθo (s) a πθo(a|s) min πθ(a|s) πθo(a|s)Aπθo(s, a) , clip1+ϵ 1 ϵ Aπθo(s, a) (109) Empirical results show that the latter approach leads to better learning performance (Schulman et al., 2017). Since PPO objective keeps πθ in proximity of πθo, it allows to perform multiple gradient steps without re-sampling trajectories from the updated policy. This property substantially improves the data-efficiency of PPO over vanilla policy gradient methods which need to re-estimate the gradients after each step. D EXPERIMENTS D.1 HYPERPARAMETER CHOICE The optimal hyperparameter for each algorithm was determined using parameter sweeps. Table 1 contains the hyperparameter settings used for the different algorithms. Any environment specific modifications are noted in the respective paragraph describing the environment. D.2 ENVIRONMENT SPECIFICATIONS Point Env (used in the experiment in 7.3) Trajectory Length : 100 Num Adapt Steps : 3 Published as a conference paper at ICLR 2019 All Algorithms Policy Hidden Layer Sizes (64, 64) (128, 128) for Humanoid Num Adapt Steps 1 Inner Step Size α 0.01 Tasks Per Iteration 40 Trajectories Per Task 20 Pro MP Outer Learning Rate β 0.001 Grad Steps Per Pro MP Iteration 5 Outer Clip Ratio ϵ 0.3 KL Penalty Coef. η 0.0005 MAML-TRPO Trust Region Size 0.01 MAML-VPG Outer Learning Rate β 0.001 Table 1: Hyperparameter settings used in each algorithm In this environment, each task corresponds to one corner of the area. The point mass must reach the goal by applying directional forces. The agent only experiences a reward when within a certain radius of the goal, and the magnitude of the reward is equal to the distance to the goal. Half Cheetah Fwd Back, Ant Fwd Back, Walker Fwd Back, Humanoid Fwd Back Trajectory Length : 100 (Half Cheetah, Ant); 200 (Humanoid, Walker) Num Adapt Steps: 1 The task is chosen between two directions - forward and backward. Each agent must run along the goal direction as far as possible, with reward equal to average velocity minus control costs. Ant Rand Direc, Humanoid Rand Direc Trajectory Length : 100 (Ant); 200 (Humanoid) Num Adapt Steps: 1 Each task corresponds to a random direction in the XY plane. As above, each agent must learn to run in that direction as far as possible, with reward equal to average velocity minus control costs. Ant Rand Goal Trajectory Length : 200 Num Adapt Steps: 2 In this environment, each task is a location randomly chosen from a circle in the XY plane. The goal is not given to the agent - it must learn to locate, approach, and stop at the target. The agent receives a penalty equal to the distance from the goal. Hopper Rand Params, Walker Rand Params Trajectory Length : 200 Inner LR : 0.05 Num Adapt Steps: 1 The agent must move forward as quickly as it can. Each task is a different randomization of the simulation parameters, including friction, joint mass, and inertia. The agent receives a reward equal to its velocity. Published as a conference paper at ICLR 2019 D.3 FURTHER EXPERIMENTS RESULTS In addition to the six environments displayed in 2, we ran experiments on the other four continuous control environments described above. The results are displayed in 7. 0 1 2 3 4 Timesteps 1e7 Average return Ant Fwd Back 0.0 0.8 1.6 2.4 3.2 1e8 Humanoid Fwd Back 0.0 0.5 1.0 1.5 2.0 Ant Rand Goal 0.0 0.8 1.6 2.4 3.2 4.0 1e7 Walker Rand Vel Pro MP (ours) MAML E-MAML-TRPO LVC-VPG E-MAML-VPG Figure 7: Meta-learning curves of Pro MP and four other gradient-based meta-learning algorithms in four new Mujoco environments In addition to the improved sample complexity and better asymptotic performance, another advantage of Pro MP is its computation time. Figure 8 shows the average time spent per iteration throughout the learning process in the humanoid environment differences of Pro MP, LVC-VPG, and MAML-TRPO. Due to the expensive conjugate gradient steps used in TRPO, MAML takes far longer than either first order method. Since Pro MP takes multiple stochastic gradient descent steps per iteration, it leads to longer outer update times compared to VPG, but in both cases the update time is a fraction of the time spent sampling from the environment. The difference in sampling time is due to the reset process: resetting the environment when the agent dies is an expensive operation. Pro MP acquires better performance quicker, and as a result the agent experiences longer trajectories and the environment is reset less often. In our setup, instances of the environment are run in parallel and performing a reset blocks all environments. 0 20 40 60 80 100 120 140 160 Time (seconds/itr) Pro MP (ours) Average Iteration Time Time-Sampling Time-Sample Proc Time-Inner Step Time-Outer Step Figure 8: Comparison of wall clock time with different algorithms on Humanoid Rand Direc, averaged over all iterations