# iterative_amortized_policy_optimization__1fd8c190.pdf

Iterative Amortized Policy Optimization

Joseph Marino California Institute of Technology Alexandre Piché Mila, Université de Montréal

Alessandro Davide Ialongo University of Cambridge Yisong Yue California Institute of Technology

Policy networks are a central feature of deep reinforcement learning (RL) algorithms for continuous control, enabling the estimation and sampling of high-value actions. From the variational inference perspective on RL, policy networks, when used with entropy or KL regularization, are a form of amortized optimization, optimizing network parameters rather than the policy distributions directly. However, direct amortized mappings can yield suboptimal policy estimates and restricted distributions, limiting performance and exploration. Given this perspective, we consider the more ﬂexible class of iterative amortized optimizers. We demonstrate that the resulting technique, iterative amortized policy optimization, yields performance improvements over direct amortization on benchmark continuous control tasks. Accompanying code: github.com/joelouismarino/variational_rl.

1 Introduction

Reinforcement learning (RL) algorithms involve policy evaluation and policy optimization [73]. Given a policy, one can estimate the value for each state or state-action pair following that policy, and given a value estimate, one can improve the policy to maximize the value. This latter procedure, policy optimization, can be challenging in continuous control due to instability and poor asymptotic performance. In deep RL, where policies over continuous actions are often parameterized by deep networks, such issues are typically tackled using regularization from previous policies [67, 68] or by maximizing policy entropy [57, 23]. These techniques can be interpreted as variational inference [51], using optimization to infer a policy that yields high expected return while satisfying prior policy constraints. This smooths the optimization landscape, improving stability and performance [3].

However, one subtlety arises: when used with entropy or KL regularization, policy networks perform amortized optimization [26]. That is, rather than optimizing the action distribution, e.g., mean and variance, many deep RL algorithms, such as soft actor-critic (SAC) [31, 32], instead optimize a network to output these parameters, learning to optimize the policy. Typically, this is implemented as a direct mapping from states to action distribution parameters. While such direct amortization schemes have improved the efﬁciency of variational inference as encoder networks [44, 64, 56], they also suffer from several drawbacks: 1) they tend to provide suboptimal estimates [20, 43, 55], yielding a so-called amortization gap in performance [20], 2) they are restricted to a single estimate [27], thereby limiting exploration, and 3) they cannot generalize to new objectives, unlike, e.g., gradient-based [36] or gradient-free optimizers [66].

Inspired by techniques and improvements from variational inference, we investigate iterative amortized policy optimization. Iterative amortization [55] uses gradients or errors to iteratively update the parameters of a distribution. Unlike direct amortization, which receives gradients only after

Now at Deep Mind, London, UK. Correspondence to josephmarino@deepmind.com.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

outputting the distribution, iterative amortization uses these gradients online, thereby learning to iteratively optimize. In generative modeling settings, iterative amortization empirically outperforms direct amortization [55, 54] and can ﬁnd multiple modes of the optimization landscape [27].

The contributions of this paper are as follows:

We propose iterative amortized policy optimization, exploiting a new, fruitful connection between amortized variational inference and policy optimization. Using the suite of Mu Jo Co environments [78, 12], we demonstrate performance improvements over direct amortized policies, as well as more complex ﬂow-based policies. We demonstrate novel beneﬁts of this amortization technique: improved accuracy, providing multiple policy estimates, and generalizing to new objectives.

2 Background

2.1 Preliminaries

We consider Markov decision processes (MDPs), where st S and at A are the state and action at time t, resulting in reward rt = r(st, at). Environment state transitions are given by st+1 penv(st+1|st, at), and the agent is deﬁned by a parametric distribution, pθ(at|st), with parameters θ.2 The discounted sum of rewards is denoted as R(τ) = P

t γtrt, where γ (0, 1] is the discount factor, and τ = (s1, a1, . . . ) is a trajectory. The distribution over trajectories is:

p(τ) = ρ(s1)

t=1 penv(st+1|st, at)pθ(at|st), (1)

where the initial state is drawn from the distribution ρ(s1). The standard RL objective consists of maximizing the expected discounted return, Ep(τ) [R(τ)]. For convenience of presentation, we use the undiscounted setting (γ = 1), though the formulation can be applied with any valid γ.

2.2 KL-Regularized Reinforcement Learning

Various works have formulated RL, planning, and control problems in terms of probabilistic inference [21, 8, 79, 77, 11, 51]. These approaches consider the agent-environment interaction as a graphical model, then convert reward maximization into maximum marginal likelihood estimation, learning and inferring a policy that results in maximal reward. This conversion is accomplished by introducing one or more binary observed variables [19], denoted as O for optimality [51], with

p(O = 1|τ) exp R(τ)/α ,

where α is a temperature hyper-parameter. We would like to infer latent variables, τ, and learn parameters, θ, that yield the maximum log-likelihood of optimality, i.e., log p(O = 1). Evaluating this likelihood requires marginalizing the joint distribution, p(O = 1) = R p(τ, O = 1)dτ. This involves averaging over all trajectories, which is intractable in high-dimensional spaces. Instead, we can use variational inference to lower bound this objective, introducing a structured approximate posterior distribution:

π(τ|O) = ρ(s1)

t=1 penv(st+1|st, at)π(at|st, O). (2)

This provides the following lower bound on the objective:

log p(O = 1) = log Z p(O = 1|τ)p(τ)dτ (3)

Z π(τ|O) log p(O = 1|τ)p(τ)

= Eπ[R(τ)/α] DKL(π(τ|O) p(τ)). (5)

2In this paper, we consider the entropy-regularized case, where pθ(at|st) = U( 1, 1), i.e., uniform. However, we present the derivation for the KL-regularized case for full generality.

1.0 0.5 0.0 0.5 1.0 tanh(µ1)

Direct Policy Network Iterative Policy Network Optimal Estimate

Direct Iterative

φ(at|st, O)

penv(st|st 1, at 1)

E [Q (st, at)]

log φ(at|st, O)

Figure 1: Amortization. Left: Optimization over two dimension of the policy mean, µ1 and µ3, for a particular state. A direct amortized policy network outputs a suboptimal estimate, yielding an amortization gap in performance. An iterative amortized policy network ﬁnds an improved estimate. Right: Diagrams of direct and iterative amortization. Larger circles denote distributions, and smaller red circles denote terms in the objective, J (Eq. 8). Dashed arrows denote amortization. Iterative amortization uses gradient feedback during optimization, while direct amortization does not.

Equivalently, we can multiply by α, deﬁning the variational RL objective as:

J (π, θ) Eπ[R(τ)] αDKL(π(τ|O) p(τ)). (6)

This objective consists of the expected return (i.e., the standard RL objective) and a KL divergence between π(τ|O) and p(τ). In terms of states and actions, this objective is:

J (π, θ) = Est,rt penv at π

t=1 rt α log π(at|st, O)

At a given timestep, t, one can optimize this objective by estimating the future terms in the sum using a soft action-value (Qπ) network [30] or model [62]. For instance, sampling st penv, slightly abusing notation, we can write the objective at time t as:

J (π, θ) = Eπ [Qπ(st, at)] αDKL(π(at|st, O)||pθ(at|st)). (8)

Policy optimization in the KL-regularized setting corresponds to maximizing J w.r.t. π. We often consider parametric policies, in which π is deﬁned by distribution parameters, λ, e.g., Gaussian mean, µ, and variance, σ2. In this case, policy optimization corresponds to maximizing:

λ arg max λ J (π, θ). (9)

Optionally, we can then also learn the policy prior parameters, θ [1].

2.3 Entropy & KL Regularized Policy Networks Perform Direct Amortization

Policy-based approaches to RL typically do not directly optimize the action distribution parameters, e.g., through gradient-based optimization. Instead, the action distribution parameters are output by a function approximator (deep network), fφ, which is trained using deterministic [70, 52] or stochastic gradients [83, 35]. When combined with entropy or KL regularization, this policy network is a form of amortized optimization [26], learning to estimate policies. Again, denoting the action distribution parameters, e.g., mean and variance, as λ, for a given state, s, we can express this direct mapping as

λ fφ(s), (direct amortization) (10)

denoting the corresponding policy as πφ(a|s, O; λ). Thus, fφ attempts to learn to optimize Eq. 9. This setup is shown in Figure 1 (Right). Without entropy or KL regularization, i.e. πφ(a|s) = pθ(a|s), we can instead interpret the network as directly integrating the LHS of Eq. 4, which is less efﬁcient and more challenging. Regularization smooths the optimization landscape, yielding more stable improvement and higher asymptotic performance [3].

Viewing policy networks as a form of direct amortized variational optimizer (Eq. 10) allows us to see that they are similar to encoder networks in variational autoencoders (VAEs) [44, 64]. However, there are several drawbacks to direct amortization.

Algorithm 1 Direct Amortization

Initialize φ for each environment step do

λ fφ(st) at πφ(at|st, O; λ) st+1 penv(st+1|st, at) end for for each training step do

φ φ + η φJ end for

Algorithm 2 Iterative Amortization

Initialize φ for each environment step do

Initialize λ for each policy optimization iteration do

λ fφ(st, λ, λJ ) end for at πφ(at|st, O; λ) st+1 penv(st+1|st, at) end for for each training step do

φ φ + η φJ end for

Amortization Gap. Direct amortization results in suboptimal approximate posterior estimates, with the resulting gap in the variational bound referred to as the amortization gap [20]. Thus, in the RL setting, an amortized policy, πφ, results in worse performance than the optimal policy within the parametric policy class, denoted as bπ. The amortization gap is the gap in following inequality:

J (πφ, θ) J (bπ, θ).

Because J is a variational bound on the RL objective, i.e., expected return, a looser bound, due to amortization, prevents one from more completely optimizing this objective.

This is shown in Figure 1 (Left),3 where J is plotted over two dimensions of the policy mean at a particular state in the Mu Jo Co environment Hopper-v2. The estimate of a direct amortized policy ( ) is suboptimal, far from the optimal estimate ( ). While the relative difference in the objective is relatively small, suboptimal estimates prevent sampling and exploring high-value regions of the action-space. That is, suboptimal estimates have only a minor impact on evaluation performance (see Appendix B.6) but hinder effective data collection.

Single Estimate. Direct amortization is limited to a single, static estimate. In other words, if there are multiple high-value regions of the action-space, a uni-modal (e.g., Gaussian) direct amortized policy is restricted to only one region, thereby limiting exploration. Note that this is an additional restriction beyond simply considering uni-modal distributions, as a generic optimization procedure may arrive at multiple uni-modal estimates depending on initialization and stochastic sampling (see Section 3.2). While multi-modal distributions reduce the severity of this restriction [74, 29], the other limitations of direct amortization still persist.

Inability to Generalize Across Objectives. Direct amortization is a feedforward procedure, receiving gradients from the objective only after estimation. This is contrast to other forms of optimization, which receive gradients (feedback) during estimation. Thus, unlike other optimizers, direct amortization is incapable of generalizing to new objectives, e.g., if Qπ(s, a) or pθ(a|s) change, which is a desirable capability for adapting to new tasks or environments.

To improve upon this scheme and overcome these drawbacks, in Section 3, we turn to a technique developed in generative modeling, iterative amortization [55], retaining the efﬁciency of amortization while employing a more ﬂexible iterative estimation procedure.

2.4 Related Work

Previous works have investigated methods for improving policy optimization. QT-Opt [41] uses the cross-entropy method (CEM) [66], an iterative derivative-free optimizer, to optimize a Q-value estimator for robotic grasping. CEM and related methods are also used in model-based RL for performing model-predictive control [60, 14, 62, 33]. Gradient-based policy optimization [36, 71, 10], in contrast, is less common, however, gradient-based optimization can also be combined with CEM

3Additional 2D plots are shown in Figure 19 in the Appendix.

1.0 0.5 0.0 0.5 1.0 tanh(µ2)

1 0 1 Act. Dim. 2

1 0 1 Act. Dim. 6

Figure 2: Estimating Multiple Policy Modes. Unlike direct amortization, which is restricted to a single estimate, iterative amortization can effectively sample from multiple high-value action modes. This is shown for a particular state in Ant-v2, showing multiple optimization runs across two action dimensions (Left). Each square denotes an initialization. The optimizer ﬁnds both modes, with the densities plotted on the Right. This capability provides increased ﬂexibility in action exploration.

[5]. Most policy-based methods use direct amortization, either using a feedforward [31] or recurrent [28] network. Similar approaches have also been applied to model-based value estimates [13, 16, 4], as well as combining direct amortization with model predictive control [50] and planning [65]. A separate line of work has explored improving the policy distribution, using normalizing ﬂows [29, 74] and latent variables [76]. In principle, iterative amortization can perform policy optimization in each of these settings.

Iterative amortized policy optimization is conceptually similar to negative feedback control [7], using errors to update policy estimates. However, while conventional feedback control methods are often restricted in their applicability, e.g., linear systems and quadratic cost, iterative amortization is generally applicable to any differentiable control objective. This is analogous to the generalization of Kalman ﬁltering [42] to amortized ﬁltering [54] for state estimation.

3 Iterative Amortized Policy Optimization

3.1 Formulation

Iterative amortization [55] utilizes errors or gradients to update the approximate posterior distribution parameters. While various forms exist, we consider gradient-encoding models [6] due to their generality. Compared with direct amortization (Eq. 10), we use iterative amortized optimizers of the general form λ fφ(s, λ, λJ ), (iterative amortization) (11) also shown in Figure 1 (Right), where fφ is a deep network and λ are the action distribution parameters. For example, if π = N(a; µ, diag(σ2)), then λ [µ, σ]. Technically, s is redundant, as the state dependence is already captured in J , but this can empirically improve performance [55]. In practice, the update is carried out using a highway gating operation [38, 72]. Denoting ωφ [0, 1] as the gate and δφ as the update, both of which are output by fφ, the gating operation is expressed as λ ωφ λ + (1 ωφ) δφ, (12) where denotes element-wise multiplication. This update is typically run for a ﬁxed number of steps, and, as with a direct policy, the iterative optimizer is trained using stochastic gradient estimates of φJ , obtained through the path-wise derivative estimator [44, 64, 35]. Because the gradients λJ must be estimated online, i.e., during policy optimization, this scheme requires some way of estimating J online through a parameterized Q-value network [58] or a differentiable model [35].

3.2 Beneﬁts of Iterative Amortization

Reduced Amortization Gap. Iterative amortized optimizers are more ﬂexible than their direct counterparts, incorporating feedback from the objective during policy optimization (Algorithm 2), rather than only after optimization (Algorithm 1). Increased ﬂexibility improves the accuracy of optimization, thereby tightening the variational bound [55, 54]. We see this ﬂexibility in Figure 1 (Left), where an iterative amortized policy network iteratively reﬁnes the policy estimate ( ), quickly arriving near the optimal estimate.

Multiple Estimates. Iterative amortization, by using stochastic gradients and random initialization, can traverse the optimization landscape. As with any iterative optimization scheme, this allows iterative amortization to obtain multiple valid estimates, referred to as multi-stability in the generative modeling literature [27]. We illustrate this capability across two action dimensions in Figure 2 for a state in the Ant-v2 Mu Jo Co environment. Over multiple policy optimization runs, iterative amortization ﬁnds multiple modes, sampling from two high-value regions of the action space. This provides increased ﬂexibility in action exploration, despite only using a uni-modal policy distribution.

Generalization Across Objectives. Iterative amortization uses the gradients of the objective during optimization, i.e., feedback, allowing it to potentially generalize to new or updated objectives. We see this in Figure 1 (Left), where iterative amortization, despite being trained with a different value estimator, is capable of generalizing to this new objective. We demonstrate this capability further in Section 4. This opens the possibility of accurately and efﬁciently performing policy optimization in new settings, e.g., a rapidly changing model or new tasks.

3.3 Consideration: Mitigating Value Overestimation

Why are more powerful policy optimizers typically not used in practice? Part of the issue stems from value overestimation. Model-free approaches generally estimate Qπ using function approximation and temporal difference learning. However, this has the pitfall of value overestimation, i.e., positive bias in the estimate, b Qπ [75]. This issue is tied to uncertainty in the value estimate, though it is distinct from optimism under uncertainty. If the policy can exploit regions of high uncertainty, the resulting target values will introduce positive bias into the estimate. More ﬂexible policy optimizers exacerbate the problem, exploiting this uncertainty to a greater degree. Further, a rapidly changing policy increases the difﬁculty of value estimation [63].

Various techniques have been proposed for mitigating value overestimation in deep RL. The most prominent technique, double deep Q-network [81] maintains two Q-value estimates [80], attempting to decouple policy optimization from value estimation. Fujimoto et al. [25] apply and improve upon this technique for actor-critic settings, estimating the target Q-value as the minimum of two Q-networks, Qψ1 and Qψ2: b Qπ(s, a) = min i=1,2 Qψ i(s, a),

where ψ i denotes the target network parameters. As noted by Fujimoto et al. [25], this not only counteracts value overestimation, but also penalizes high-variance value estimates, because the minimum decreases with the variance of the estimate. Ciosek et al. [15] noted that, for a bootstrapped ensemble of two Q-networks, the minimum operation can be interpreted as estimating b Qπ(s, a) = µQ(s, a) βσQ(s, a), (13)

with mean µQ(s, a) 1

i=1,2 Qψ i(s, a), standard deviation σQ(s, a) ( 1

i=1,2(Qψ i(s, a) µQ(s, a))2)1/2, and β = 1. Thus, to further penalize high-variance value estimates, preventing value overestimation, we can increase β. For large β, however, value estimates become overly pessimistic, negatively impacting training. Thus, β reduces target value variance at the cost of increased bias.

0 1 2 3 Million Steps

direct, β = 1 iterative, β = 1 iterative, β = 2.5

0 1 2 3 Million Steps

DKL(πnew||πold)

Figure 3: Mitigating Value Overestimation. With β = 1, iterative amortization results in (a) higher value overestimation and (b) a more rapidly changing policy as compared with direct amortization. Increasing β helps to mitigate these issues.

Due to the ﬂexibility of iterative amortization, the default β = 1 results in increased value bias and a more rapidly changing policy as compared with direct amortization (Figure 3). Further penalizing high-variance target values (β = 2.5) reduces value overestimation and improves stability. For details, see Appendix A.2. Recent techniques for mitigating overestimation have been proposed, such as adjusting α [22]. In ofﬂine RL, this issue has been tackled through the action prior [24, 48, 84] or by altering Qnetwork training [2, 49]. While such techniques could be used here, increasing β provides a simple solution with no additional computational overhead. This is a meaningful insight toward applying more powerful policy optimizers.

0 5 10 15 20 25 Environment Time Step

Action Dim. 5

0 5 10 15 20 25 Environment Time Step

(b) Improvement

0 100 200 Opt. Iteration

Adam CEM It. Amort.

(c) Comparison

Figure 4: Policy Optimization. Visualization over time steps of (a) one dimension of the policy distribution and (b) the improvement in the objective, J , across policy optimization iterations. (c) Comparison of iterative amortization with Adam [45] (gradient-based) and CEM [66] (gradient-free). Iterative amortization is an order of magnitude more efﬁcient.

4 Experiments

To focus on policy optimization, we implement iterative amortized policy optimization using the soft actor-critic (SAC) setup described by Haarnoja et al. [32]. This uses two Q-networks, uniform action prior, pθ(a|s) = U( 1, 1), and a tuning scheme for the temperature, α. In our experiments, direct refers to direct amortization employed in SAC, i.e., a direct policy network, and iterative refers to iterative amortization. Both approaches use the same network architecture, adjusting only the number of inputs and outputs to accommodate gradients, current policy estimates, and gated updates (Sec. 3.1). Unless otherwise stated, we use 5 iterations per time step for iterative amortization, following [55]. For details, refer to Appendix A and Haarnoja et al. [31, 32].

4.2 Analysis

4.2.1 Visualizing Policy Optimization

We provide 2D visualizations of iterative amortized policy optimization in Figures 1 & 2, with additional 2D plots in Appendix B.5. In Figure 4, we visualize iterative reﬁnement using a single action dimension from Ant-v2 across time steps. The reﬁnements in Figure 4a give rise to the objective improvements in Figure 4b. We compare with Adam [45] (gradient-based) and CEM [66] (gradient-free) in Figure 4c, where iterative amortization is an order of magnitude more efﬁcient. This trend is consistent across environments, as shown in Appendix B.4.

4.2.2 Performance Comparison

We evaluate iterative amortized policy optimization on the suite of Mu Jo Co [78] continuous control tasks from Open AI gym [12]. In Figure 5, we compare the cumulative reward of direct and iterative amortized policy optimization across environments. Each curve shows the mean and standard deviation of 5 random seeds. In all cases, iterative amortized policy optimization matches or outperforms the baseline direct amortized method, both in sample efﬁciency and ﬁnal performance. Iterative amortization also yields more consistent, lower variance performance.

4.2.3 Improved Exploration: Multiple Policy Modes

As described in Section 3.2, iterative amortization is capable of obtaining multiple estimates, i.e., multiple modes of the optimization objective. To conﬁrm that iterative amortization has captured multiple modes, at the end of training, we take an iterative agent trained on Walker2d-v2 and histogram the distances between policy means across separate runs of policy optimization per state (Fig. 7a). For the state with the largest distance, we plot 2D projections of the optimization objective, J , across action dimensions in Figure 7b, as well as the policy density across 10 optimization runs (Fig. 7c). The multi-modal policy optimization surface shown in Figure 7b results in the multi-modal policy in Figure 7c. Additional results on other environments are presented in Appendix B.7.

To better understand whether the performance beneﬁts of iterative amortization are coming purely from improved exploration via multiple modes, we also compare with direct amortization with a

0.0 0.2 0.4 0.5 Million Steps

Cumulative Reward

(a) Reacher-v2

0 1 2 3 Million Steps

Cumulative Reward

(b) Hopper-v2

0 1 2 3 Million Steps

Cumulative Reward

(c) Half Cheetah-v2

0 1 2 3 Million Steps

Cumulative Reward

(d) Swimmer-v2

0 1 2 3 Million Steps

Cumulative Reward

(e) Walker2d-v2

0 1 2 3 Million Steps

Cumulative Reward

direct (SAC)

0 2 4 6 8 10 Million Steps

Cumulative Reward

(g) Humanoid-v2

0 1 2 3 Million Steps

Cumulative Reward

(h) Humanoid Standup-v2

Figure 5: Performance Comparison. Iterative amortized policy optimization performs comparably with or better than direct amortization across Mu Jo Co environments. Curves show the mean std. dev. of performance over 5 random seeds.

0 1 2 3 Million Steps

Amortization Gap

(a) Hopper-v2

0 1 2 3 Million Steps

Amortization Gap

direct (SAC) iterative, 5 iter. iterative, 10 iter.

(b) Half Cheetah-v2

0 1 2 3 Million Steps

Amortization Gap

(c) Walker2d-v2

0 1 2 3 Million Steps

Amortization Gap

Figure 6: Amortization Gap. Iterative amortization achieves similar or lower amortization gaps than direct amortization. Gaps are estimated using stochastic gradient-based optimization over 100 random states. Curves show the mean and std. dev. over 5 random seeds.

multi-modal policy distribution. This is formed using inverse autoregressive ﬂows [46], a type of normalizing ﬂow (NF). Results are presented in Appendix B.2. Using a multi-modal policy reduces the performance deﬁciencies on Hopper-v2 and Walker2d-v2, indicating that much of the beneﬁt of iterative amortization is due to lifting direct amortization s restriction to a single, uni-modal policy estimate. Yet, direct + NF still struggles on Half Cheetah-v2 compared with iterative amortization, suggesting that more complex, multi-modal distributions are not the only consideration.

4.2.4 Improved Optimization: Amortization Gap

To evaluate policy optimization accuracy, we estimate per-step amortization gaps, performing additional iterations of gradient ascent on J w.r.t. the policy parameters, λ [µ, σ] (see Appendix A.3). To analyze generalization, we also evaluate the iterative agents trained with 5 iterations for an additional 5 amortized iterations. Results are shown in Figure 6. We emphasize that it is challenging to directly compare amortization gaps across optimization schemes, as these involve different value functions, and therefore different objectives. Likewise, we estimate the amortization gap using the learned Q-networks, which may be biased (Figure 3). Nevertheless, we ﬁnd that iterative amortized policy optimization achieves, on average, lower amortization gaps than direct amortization across all environments. Additional amortized iterations at evaluation yield further estimated improvement, demonstrating generalization beyond the optimization horizon used during training.

The amortization gaps are small relative to the objective, playing a negligible role in evaluation performance (see Appendix B.6). Rather, improved policy optimization is helpful for training, allowing the agent to explore states where value estimates are highest. To probe this further, we train iterative amortized policy optimization while varying the number of iterations per step in {1, 2, 5}, yielding optimizers with varying degrees of accuracy. Note that each optimizer is, in theory, capable of ﬁnding multiple modes. In Figure 8, we see that training with additional iterations improves performance and optimization accuracy. We stress that the exact form of this relationship depends on the Q-value estimator and other factors. We present additional results in Appendix B.6.

0.0 0.5 1.0 1.5 || tanh (µ(i)) tanh (µ(j))||2

1 0 1 Dim. 1

1 0 1 Dim. 2

1 0 1 Dim. 3

1 0 1 Dim. 4

1 0 1 Dim. 5

1 0 1 Dim. 6

Figure 7: Multiple Policy Modes. (a) Histogram of distances between policy means (µ) across optimization runs (i and j) over seeds and states on Walker2d-v2 at 3 million environment steps. For the state with the largest distance, (b) shows the projected optimization surface on each pair of action dimensions, and (c) shows the policy density for 10 optimization runs.

0 1 2 3 Million Steps

Cumulative Reward

0 1 2 3 Million Steps

Amortization Gap

1 iter. 2 iter. 5 iter.

Figure 8: Varying Iterations During Training. Performance (Left) and estimated amortization gap (Right) for varying numbers of policy optimization iterations per step during training on Half Cheetah-v2. Increasing the iterations generally improves performance and decreases the estimated amortization gap.

0.00 0.25 0.50 0.75 1.00 Million Steps

Cumulative Reward

MF MB MF MB

Figure 9: Zero-shot generalization of iterative amortization from model-free (MF) to model-based (MB) value estimates.

4.2.5 Generalizing to Model-Based Value Estimates

Direct amortization is a purely feedforward process and is therefore incapable of generalizing to new objectives. In contrast, because iterative amortization is formulated through gradient-based feedback, such optimizers may be capable of generalizing to new objective estimators, as shown in Figure 1. To demonstrate this capability further, we apply iterative amortization with model-based value estimators, using a learned deterministic model on Half Cheetah-v2 (see Appendix A.5). We evaluate the generalizing capabilities in Figure 9 by transferring the policy optimizer from a modelfree agent to a model-based agent. Iterative amortization generalizes to these new value estimates, instantly recovering the performance of the model-based agent. This highlights the opportunity for instantly incorporating new tasks, goals, or model estimates into policy optimization.

5 Discussion

We have introduced iterative amortized policy optimization, a ﬂexible and powerful policy optimization technique. In so doing, we have identiﬁed KL-regularized policy networks as a form of direct amortization, highlighting several limitations: 1) limited accuracy, as quantiﬁed by the amortization gap, 2) restriction to a single estimate, limiting exploration, and 3) inability to generalize to new objectives, limiting the transfer of these policy optimizers. As shown through our empirical analysis, iterative amortization provides a step toward improving each of these restrictions, with accompanying improvements in performance over direct amortization. Thus, iterative amortization can serve as a drop-in replacement and improvement over direct policy networks in deep RL.

This improvement, however, is accompanied by added challenges. As highlighted in Section 3.3, improved policy optimization can exacerbate issues in Q-value estimation stemming from positive bias. Note that this is not unique to iterative amortization, but applies broadly to any improved optimizer. We have provided a simple solution that involves adjusting a factor, β, to counteract this bias. Yet, we see this as an area for further investigation, perhaps drawing on insights from the ofﬂine RL community [49]. In addition to value estimation issues, iterative amortized policy optimization incurs computational costs that scale linearly with the number of iterations. This is in comparison with direct amortization, which has constant computational cost. Fortunately, unlike standard optimizers, iterative amortization adaptively tunes step sizes. Thus, relative improvements

rapidly diminish with each additional iteration, enabling accurate optimization with exceedingly few iterations. In practice, even a single iteration per time step can work surprisingly well.

Although we have discussed three separate limitations of direct amortization, these factors are highly interconnected. By broadening policy optimization to an iterative procedure, we automatically obtain a potentially more accurate and general policy optimizer, with the capability of obtaining multiple modes. While our analysis suggests that improved exploration resulting from multiple modes is the primary factor affecting performance, future work could tease out these effects further and assess the relative contributions of these improvements in additional environments. We are hopeful that iterative amortized policy optimization, by providing a more powerful, exploratory, and general optimizer, will enable a range of improved RL algorithms.

Acknowledgments and Disclosure of Funding

JM acknowledges Scott Fujimoto for helpful discussions. This work was funded in part by NSF #1918839 and Beyond Limits. JM is currently employed by Google Deep Mind. The authors declare no other competing interests related to this work.

[1] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018.

[2] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on ofﬂine reinforcement learning. In International Conference on Machine Learning, pages 104 114, 2020.

[3] Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pages 151 160, 2019.

[4] Brandon Amos, Samuel Stanton, Denis Yarats, and Andrew Gordon Wilson. On the model-based stochastic value gradient for continuous reinforcement learning. In Learning for Dynamics and Control, pages 6 20. PMLR, 2021.

[5] Brandon Amos and Denis Yarats. The differentiable cross-entropy method. In International Conference on Machine Learning, 2020.

[6] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981 3989, 2016.

[7] Karl Johan Astrom and Richard M Murray. Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press, 2008.

[8] Hagai Attias. Planning by probabilistic inference. In AISTATS. Citeseer, 2003.

[9] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. Neur IPS Deep Learning Symposium, 2016.

[10] Homanga Bharadhwaj, Kevin Xie, and Florian Shkurti. Model-predictive planning via crossentropy and gradient-based optimization. In Learning for Dynamics and Control, pages 277 286, 2020.

[11] Matthew Botvinick and Marc Toussaint. Planning as inference. Trends in cognitive sciences, 16(10):485 488, 2012.

[12] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

[13] Arunkumar Byravan, Jost Tobias Springenberg, Abbas Abdolmaleki, Roland Hafner, Michael Neunert, Thomas Lampe, Noah Siegel, Nicolas Heess, and Martin Riedmiller. Imagined value gradients: Model-based policy optimization with tranferable latent dynamics models. In Conference on Robot Learning, pages 566 589, 2020.

[14] Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754 4765, 2018.

[15] Kamil Ciosek, Quan Vuong, Robert Loftin, and Katja Hofmann. Better exploration with optimistic actor critic. In Advances in Neural Information Processing Systems, pages 1787 1798, 2019.

[16] Ignasi Clavera, Yao Fu, and Pieter Abbeel. Model-augmented actor-critic: Backpropagating through paths. In International Conference on Learning Representations, 2020.

[17] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations, 2016.

[18] Comet.ML. Comet.ML home page, 2021.

[19] Gregory F Cooper. A method for using belief networks as inﬂuence diagrams. In Fourth Workshop on Uncertainty in Artiﬁcial Intelligence., 1988.

[20] Chris Cremer, Xuechen Li, and David Duvenaud. Inference suboptimality in variational autoencoders. In International Conference on Machine Learning, pages 1078 1086, 2018.

[21] Peter Dayan and Geoffrey E Hinton. Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271 278, 1997.

[22] Roy Fox. Toward provably unbiased temporal-difference value estimation. Optimization Foundations for Reinforcement Learning Workshop at Neur IPS, 2019.

[23] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 202 211. AUAI Press, 2016.

[24] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052 2062, 2019.

[25] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587 1596, 2018.

[26] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the Cognitive Science Society, volume 36, 2014.

[27] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran, Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning, pages 2424 2433, 2019.

[28] Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, et al. An investigation of model-free planning. In International Conference on Machine Learning, pages 2464 2473, 2019.

[29] Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. Latent space policies for hierarchical reinforcement learning. In International Conference on Machine Learning, pages 1851 1860, 2018.

[30] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352 1361. JMLR. org, 2017.

[31] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1856 1865, 2018.

[32] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018.

[33] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pages 2555 2565. PMLR, 2019.

[34] Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585(7825):357 362, September 2020.

[35] Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944 2952, 2015.

[36] Mikael Henaff, William F Whitney, and Yann Le Cun. Model-based planning with discrete and continuous actions. ar Xiv preprint ar Xiv:1705.07177, 2017.

[37] Devon Hjelm, Ruslan R Salakhutdinov, Kyunghyun Cho, Nebojsa Jojic, Vince Calhoun, and Junyoung Chung. Iterative reﬁnement of the approximate posterior for directed belief networks. In Advances in Neural Information Processing Systems, pages 4691 4699, 2016.

[38] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997.

[39] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering, 9(3):90 95, 2007.

[40] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12519 12530, 2019.

[41] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on Robot Learning, 2018.

[42] Rudolph Emil Kalman. A new approach to linear ﬁltering and prediction problems. Journal of Basic Engineering, 82(1):35 45, 1960.

[43] Yoon Kim, Sam Wiseman, Andrew C Miller, David Sontag, and Alexander M Rush. Semiamortized variational autoencoders. In International Conference on Machine Learning, 2018.

[44] Diederik P Kingma and Max Welling. Stochastic gradient vb and the variational auto-encoder. In International Conference on Learning Representations, 2014.

[45] Durk P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

[46] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive ﬂow. In Advances in Neural Information Processing Systems, pages 4743 4751, 2016.

[47] Rahul G Krishnan, Dawen Liang, and Matthew Hoffman. On the challenges of learning with inference networks on sparse, high-dimensional data. In International Conference on Artiﬁcial Intelligence and Statistics, pages 143 151, 2018.

[48] Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784 11794, 2019.

[49] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for ofﬂine reinforcement learning. In Advances in Neural Information Processing Systems, pages 1179 1191, 2020.

[50] Keuntaek Lee, Kamil Saigol, and Evangelos A Theodorou. Safe end-to-end imitation learning for model predictive control. In International Conference on Robotics and Automation, 2019.

[51] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. ar Xiv preprint ar Xiv:1805.00909, 2018.

[52] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.

[53] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293 321, 1992.

[54] Joseph Marino, Milan Cvitkovic, and Yisong Yue. A general method for amortizing variational ﬁltering. In Advances in Neural Information Processing Systems, 2018.

[55] Joseph Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. In International Conference on Machine Learning, pages 3403 3412, 2018.

[56] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pages 1791 1799, 2014.

[57] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928 1937, 2016.

[58] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In Neur IPS Deep Learning Workshop, 2013.

[59] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efﬁcient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054 1062, 2016.

[60] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free ﬁne-tuning. In International Conference on Robotics and Automation, pages 7559 7566. IEEE, 2018.

[61] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, pages 8026 8037, 2019.

[62] Alexandre Piché, Valentin Thomas, Cyril Ibrahim, Yoshua Bengio, and Chris Pal. Probabilistic planning with sequential monte carlo methods. In International Conference on Learning Representations, 2019.

[63] Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model based reinforcement learning. In International Conference on Machine Learning, pages 7953 7963, 2020.

[64] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278 1286, 2014.

[65] Benjamin Rivière, Wolfgang Hönig, Yisong Yue, and Soon-Jo Chung. Glas: Global-to-local safe autonomy synthesis for multi-robot motion planning with end-to-end learning. IEEE Robotics and Automation Letters, 5(3):4249 4256, 2020.

[66] Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a uniﬁed approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2013.

[67] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897, 2015.

[68] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

[69] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

[70] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, pages 387 395, 2014.

[71] Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks: Learning generalizable representations for visuomotor control. In International Conference on Machine Learning, pages 4732 4741, 2018.

[72] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in Neural Information Processing Systems, pages 2377 2385, 2015.

[73] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

[74] Yunhao Tang and Shipra Agrawal. Boosting trust region policy optimization by normalizing ﬂows policy. In Neur IPS Deep Reinforcement Learning Workshop, 2018.

[75] Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.

[76] Dhruva Tirumala, Hyeonwoo Noh, Alexandre Galashov, Leonard Hasenclever, Arun Ahuja, Greg Wayne, Razvan Pascanu, Yee Whye Teh, and Nicolas Heess. Exploiting hierarchy for learning and transfer in kl-regularized rl. ar Xiv preprint ar Xiv:1903.07438, 2019.

[77] Emanuel Todorov. General duality between optimal control and estimation. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 4286 4292. IEEE, 2008.

[78] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE, 2012.

[79] Marc Toussaint and Amos Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. In International Conference on Machine Learning, pages 945 952, 2006.

[80] Hado Van Hasselt. Double q-learning. In Advances in Neural Information Processing Systems, pages 2613 2621, 2010.

[81] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI Conference on Artiﬁcial Intelligence, 2016.

[82] Michael L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021.

[83] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5 32. Springer, 1992.

[84] Yifan Wu, George Tucker, and Oﬁr Nachum. Behavior regularized ofﬂine reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] We demonstrate performance improvements and novel beneﬁts of iterative amortization in Section 4. Each claim is supported by empirical evidence. (b) Did you describe the limitations of your work? [Yes] We discuss the additional computation requirements and the necessity of mitigating value overestimation. (c) Did you discuss any potential negative societal impacts of your work? [No] We do not see any immediate societal impacts of this work beyond the general impacts of improvements in machine learning. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We have included code in the supplementary material with an accompanying README ﬁle. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] All details and hyperparameters are provided in the Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We report error bars on each of our main quantitative results comparing performance. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We provide these details in the supplementary material. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We cite the creators of the environments and software packages used in this paper. (b) Did you mention the license of the assets? [Yes] We mention or cite the license for each asset. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]