# policy_optimization_as_wasserstein_gradient_flows__50c01624.pdf Policy Optimization as Wasserstein Gradient Flows Ruiyi Zhang 1 Changyou Chen 2 Chunyuan Li 1 Lawrence Carin 1 Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate. Though often achieving encouraging empirical success, its underlying mathematical principle on policy-distribution optimization is unclear. We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows. On the probabilitymeasure space, under specified circumstances, policy optimization becomes a convex problem in terms of distribution optimization. To make optimization feasible, we develop efficient algorithms by numerically solving the corresponding discrete gradient flows. Our technique is applicable to several RL settings, and is related to many state-ofthe-art policy-optimization algorithms. Empirical results verify the effectiveness of our framework, often obtaining better performance compared to related algorithms. 1. Introduction There is recent renewed interest in reinforcement learning (Sutton & Barto, 1998; Kaelbling et al., 1996), largely as a consequence of the success of deep reinforcement learning (deep RL) (Mnih et al., 2015; Li, 2017), which is applicable to complex environments and has obtained state-of-the-art performance on several challenging problems. In reinforcement learning an agent interacts with the environment, seeking to learn an optimal policy that yields the maximum expected reward during the interaction. Generally speaking, a policy defines a distribution over actions conditioned on the states. Learning an optimal policy corresponds to searching for an element (a conditional action distribution) on the space of distributions that yields the best expected feedback (reward) to the agent as it interacts sequentially 1Duke University 2SUNY at Buffalo. Correspondence to: Changyou Chen . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). with the environment. A standard technique for policy learning is the policygradient (PG) method (Sutton et al., 2000). In PG, a policy is represented in terms of parameters, typically optimized by stochastic gradient descent (SGD) to maximize the expected total reward. A similar idea has been applied for learning deterministic policies, termed deterministic policy gradient (DPG) (Silver et al., 2014). Significant progress has been made on advancing policy learning since introduction of deep learning techniques. As examples, the deep deterministic policy gradient (DDPG) method combines DPG and Q-learning (Watkins & Dayan, 1992) to jointly learn a policy and a Q-function for continuous control problems (Lillicrap et al., 2016). Trust region policy optimization (TRPO) improves PG by preserving the monotonicpolicy-improvement property (Schulman et al., 2015), implemented by imposing a trust-region constraint, defined as the Kullback-Leibler (KL) divergence between consecutive policies. Later, (Schulman et al., 2017b) proposed proximal policy optimization (PPO) to improve TRPO by optimizing a surrogate objective with an adaptive KL penalty and reward-clipping mechanism. Though obtaining encouraging empirical success, many of the aforementioned algorithms optimize parameters directly, and appear to lack a rigorous interpretation in terms of distribution optimization, e.g., it is not mathematically clear how sequentially optimizing policy parameters based on an expected-total-reward objective corresponds to optimizing the distribution of policy itself. In this paper we introduce gradient flows in the space of probability distributions, called Wasserstein gradient flows (WGF), and formulate policy optimization in RL as a WGF problem. Essentially, WGF induces a geometry structure (manifold) in the distribution space characterized by an energy functional. The length between elements on the manifold is defined by the second-order Wasserstein distance. Thus, searching for an optimal distribution corresponds to traveling along a gradient flow on the space until convergence. In the context of deep RL, the energy functional is characterized by the expected reward. Gradient flow corresponds to a sequence of policy distributions converging to an optimal policy during an iterative optimization procedure. From this perspective, convergence behavior of the optimization can be better understood. Policy Optimization as Wasserstein Gradient Flows Traditional stochastic policies are limited by their simple representation ability, such as using multinomial (Mnih et al., 2015) or Gaussian policy distributions (Schulman et al., 2015). To overcome this issue, the proposed WGFbased stochastic policies employ general energy-based representations. To optimize the stochastic policy, we define WGFs for RL in two settings: i) indirect-policy learning, defined on a distribution space for parameters; ii) directpolicy learning, defined on a distribution space for policy distributions. These correspond to two variants of our algorithms. The original form of the WGF problem is hard to deal with, as it is generally infeasible to directly optimize over a distribution (an infinite-dimensional object). To overcome this issue, based on the Jordan-Kinderlehrer-Otto (JKO) method (Jordan et al., 1998), we propose a particlebased algorithm to approximate a continuous density function with particles, and derive the corresponding gradient formulas for particle updates. Our method is conceptually simple and practically efficient, which also provides a theoretically sound way to use trust-region algorithms for RL. Empirical experiments show improved performance over related reinforcement learning algorithms. 2. Preliminaries We review concepts and numerical algorithms for gradient flows. We start from gradient flows on the Euclidean space, and then extend them on the space of probability measures. 2.1. Gradient flows on the Euclidean space For a smooth function F : Rd R, and a starting point x0 Rd, the gradient flow of F(x) is defined as the solution of the differential equation: d x dτ = F(x(τ)), for time τ > 0 and initial condition x(0) = x0. This is a standard Cauchy problem (Rulla, 1996), endowed with a unique solution if F is Lipschitz continuous. When F is nondifferentiable, the gradient is replaced with its subgradient, which gives a similar definition, omitted for simplicity. Numerical solution An exact solution to the above gradient-flow problem is typically intractable. A standard numerical method, called the Minimizing Movement Scheme (MMS) (Gobbino, 1999), evolves x iteratively for small steps along the gradient of F at the current point. Denoting the current point as xk, the next point is xk+1 = xk F(xk+1)h, with stepsize h. Note xk+1 is equivalent to solving optimization problem xk+1 = arg minx F(x)+ x xk 2 2 2h , where 2 denotes the vector 2-norm. Convergence of the {xk} sequence to the exact solution has been established (Ambrosio et al., 2005). Refer to Section A.1 of the Supplementary Material (SM) for details. We will focus on the convex case, since this is the case for many gradient flows on the space of probability measures, as detailed subsequently. 2.2. Gradient flows on the probability-measure space By placing the optimization onto the space of probability measures, denoted P(Ω) with Ω Rd, we arrive at Wasserstein gradient flows. For a formal definition, we first endow a Riemannian geometry on P(Ω). The geometry is characterized by the length between two elements (two distributions), defined by the 2nd-order Wasserstein distance: W 2 2 (µ, ν) inf γ Ω Ω x y 2 2dγ(x, y) : γ Γ(µ, ν) , where Γ(µ, ν) is the set of joint distributions over (x, y) such that the two marginals equal µ and ν, respectively. This is an optimal-transport problem, where one wants to transform µ to ν with minimum cost (Villani, 2008). Thus the term x y 2 2 represents the cost to transport x in µ to y in ν, and can be replaced by a general metric c(x, y) in a metric space. If µ is absolutely continuous w.r.t. the Lebesgue measure, there is a unique optimal transport plan from µ to ν, i.e., a mapping T : Rd Rd pushing µ onto ν satisfying T#µ = ν. Here T#µ denotes the pushforward measure of µ. The Wasserstein distance is equivalently reformulated as W 2 2 (µ, ν) inf T Ω c(x, T(x))dµ(x) . Let {µτ}τ [0,1] be an absolutely continuous curve in P(Ω) with finite second-order moments. Consider W 2 2 (µτ, µτ+h). Motivated by the Euclidean-space case, if we define vτ(x) limh 0 T (xτ ) xτ h as the velocity of the particle, a gradient flow can be defined on P(Ω) correspondingly (Ambrosio et al., 2005). Lemma 1 Let {µτ}τ [0,1] be an absolutely-continuous curve in P(Ω) with finite second-order moments. Then for a.e. τ [0, 1], the above vector field vτ defines a gradient flow on P(Ω) as τµτ + (vτ µτ) = 0, where a x a for a vector a. Function F in Section 2.1 is lifted to be a functional in the space of probability measures, mapping a probability measure µ to a real value, i.e., F : P(Ω) R. F is the energy functional of a gradient flow on P(Ω). Consequently, it can be shown that vτ in Lemma 1 has the form vτ = δF δµτ (µτ) (Ambrosio et al., 2005), where δF δµτ is called the first variation of F at µτ. Based on this, gradient flows on P(Ω) can be written τµτ = (vτ µτ) = µτ ( δF δµτ (µτ)) . (1) Remark 2 Intuitively, an energy functional F characterizes the landscape structure (appearance) of the corresponding manifold, and the gradient flow (1) defines a solution path on this manifold. Usually, by choosing appropriate F, the landscape is convex, e.g., the Itó-diffusion case defined below. This provides a theoretical guarantee of optimal convergence of a gradient flow. Policy Optimization as Wasserstein Gradient Flows Itó diffusions as WGFs Itó diffusion defines a stochastic mapping T : Rd R Rd such that we have T (x, 0) = x and T (T (x, τ), s) = T (x, s + τ), for all x Rd and s, τ R. A typical example of this family is defined as T (x, τ) = xτ, where xτ is driven by a diffusion of the form: d xτ = U(xτ)dτ + σ(xτ)d W . (2) Here U : Rd Rd, σ : Rd Rd d are called the drift and diffusion terms, respectively; W is the standard d-dimensional Brownian motion. In Bayesian inference, we seek to make the stationary distribution of xτ approach a particular distribution p(x), e.g., a posterior distribution. One solution for this is to set U(xτ) = 1 2 log p(x) and σ(xτ) as the d d identity matrix. The resulting diffusion is called Langevin dynamics (Welling & Teh, 2011). Denoting the distribution of xτ as µτ, it is well known (Risken, 1989) that µτ is characterized by the Fokker-Planck (FP) equation: τµτ = µτ U + µτσσ . (3) Note (3) is in the gradient-flow form of (1), where the energy functional F is defined as : F(µ) Z U(x)µ(x)d x | {z } F1 + Z µ(x) log µ(x)d x | {z } F2 Note F2 is the energy functional of a pure Brownian motion (e.g., U(x) = 0 in (2)). To verify the FP equation with (1), the first variation of F1 and F2 is calculated as δµ = U, δF2 δµ = log µ + 1 . (5) Substituting (5) into (1) recovers the FP equation (3). Numerical methods Inspired by the Euclidean-space case, gradient flow (1) can be approximately solved by discretizing time, leading to an iterative optimization problem, where for iteration k: µ(h) k+1 arg minµ F(µ)+ W 2 2 (µ,µ(h) k ) 2h . Specifically, for Itó-diffusion with F defined in (4), the optimization problem becomes: µ(h) k+1 = arg min µ KL (µ p(x)) + W 2 2 (µ, µ(h) k ) 2h , (6) where p(x) 1 Z e U(x) is the target distribution. This procedure is called the Jordan-Kinderlehrer-Otto (JKO) scheme. Remarkably, the convergence of (6) can be guaranteed (Jordan et al., 1998), as stated in Lemma 3. We assume σ to be independent of x, which is the case in Langevin dynamics whose stationary distribution is set to be proportional to e U(x). As a result, we drop σ in the following. Lemma 3 Assume that log p(x) C1 is infinitely differentiable, and log p(x) C2 (1 + C1 log p(x)) ( x) for some constants {C1, C2}. Let T = h K, µ0 q0(x), and {µ(h) k }K k=1 be the solution of the functional optimization problem (6), which are restricted to the space with finite second-order moments. Then i) the problem (6) is convex; and ii) µ(h) K converges to µT in the limit of h 0, i.e., limh 0 µ(h) K = µT , where µT is the solution of (3) at T. Remark 4 Since the stationary distribution of the FP equation (3) is proportional to e U(x), Lemma 3 suggests that limk ,h 0 µ(h) k = 1 Z e U, a useful property to guide design of energy functionals for RL, as discussed in Section 4. 3. Particle Approximation for WGFs We focus on solving Itó diffusions with scheme (6). Directly reformulating gradient flows as a sequential optimization problem in (6) is infeasible, because {µ(h) k } are infinitedimensional objects. We propose to use particle approximation to solve (6), where particles continuously evolve over time. There exist particle-based algorithms for gradient-flow approximations, for example, the stochastic and deterministic particle methods in (Cottet & Koumoutsakos, 2000; Russo, 1990; Carrillo et al., 2017). However, they did not target the JKO scheme, and thus are not applicable to our setting. Another advantage of the JKO scheme is that it allows direct application of gradient-based algorithms, once we get gradients of the particles; thus, it is particularly useful in deep-learning-based methods where parameters are updated by backpropagating gradients through a network. Following similar idea as in (Chen et al., 2018), in the k-th iteration of our algorithm, M particles {x(i) k }M i=1 are used to approximate µ(h) k : µ(h) k 1 M PM i=1 δx(i) k . Our goal is to evolve {x(i) k } such that the corresponding empirical measure, µ(h) 1 M PM i=1 δx(i), minimizes (6). A standard method is to use gradient descent to update the particles, where gradients { KL(µ,µ(h) k ) x(i) , W 2 2 (µ,µ(h) k ) x(i) } are required ac- cording to (6). By assuming x(i) k to evolve in the form of x(i) k+1 = x(i) k +hφ(x(i) k ), with function φ restricted to an RKHS with kernel K( , ), the gradient of the KL term is calculated as (Liu & Wang, 2016): KL(µ(h), µ(h) k ) x(i) 1 h K(x(j), x(i)) x(j) log p(x(j)) + x(j)K(x(j), x(i)) i . (7) The gradient for W 2 2 (µ(h), µ(h) k ) is more involved, as the distance does not have a closed form. The Wasserstein term arises due to the Brownian motion in the diffusion process (2), and the non-differentiability of a sample path from a Policy Optimization as Wasserstein Gradient Flows Brownian motion is translated into the Wasserstein distance. We develop a simple yet effective method to overcome this issue below. First, using a particle approximation, W 2 2 (µ(h), µ(h) k ) is simplified as W 2 2 (µ, µ(h) k ) = inf pi,j i,j pijc(x(i), x(j) k ) (8) where c(x1, x2) x1 x2 2 2. Our goal turns to solving for the optimal {pij}. Since W2 comes from the Brownian motion, the energy functional in its gradient flow is defined as F2 in (4). Solving the gradient flow with the JKO scheme, at each iteration we minimize λF2 + W 2 2 (µ, µ(h) k ) with λ a regularization parameter. Substituting W 2 2 with (8), introducing Lagrangian multipliers {αi, βi} to deal with the constraints, and letting cij c(x(i), x(j) k ), the dual problem is, L({pij}, {αi}, {βi}) = λ X i,j pij log pij + pijcij The optimal pij have forms of p ij = uie cij/λvj, where λ , vj = e 1 λ . Assuming {ui} and {vj} are independent of {x(i)} and {x(j) k }, W 2 2 (µ, µ(h) k ) x(i) j cije cij/λ e cij/λ(x(i) x(j) k ) . (9) The gradients of particles can be obtained by combining (7) and (9), which are then optimized using SGD. Intuitively, from (9), the Wasserstein term contributes in two ways: i) When cij λ > 1, x(i) is pulled close to previous particles {x(j) k }, with force proportional to ( cij λ 1)e cij/λ; ii) when x(i) is close enough to a previous particle x(j) k , i.e., cij λ < 1, x(i) is pushed away, preventing it from collapsing to x(j) k . 4. Policy Optimization as WGFs Reinforcement learning is the problem of finding an optimal policy for an agent interacting with an unknown environment, collecting a reward per action. A policy is defined as a conditional distribution, π(a | s), defining the probability over an action a A conditioned on a state variable s S. Formally, the problem can be described as a Markov decision process (MDP), M = S, A, Ps, r, γ , where Ps(s | s, a) is the transition probability from state s to s given action a; r(s, a) is an unknown reward function immediately following the action a performed at state s; γ [0, 1] is a discount factor regularizing future rewards. We denote these variables with a subscript t to indicate their time dependency. At each time step t, conditioned on the current state st, the agent chooses an action at π( | st) and receives a reward signal r(a, s). The environment as seen by the agent then updates its state as st+1 Ps( | st, at). The goal is to learn an optimal policy such that one obtains the maximum expected total reward, e.g., by maximizing t=1 EPs,π γtr(a, s) = Es ρπ,a π [r(s, a)] (10) where ρπ P t=1 γt 1Pr(st = s), and Pr(s) denotes the state marginal distribution induced by π. Optimizing the objective in (10) with a maximum entropy constraint provides us with a framework for training stochastic policies, where specific forms of these policy distribution are required, restricting the representation power. To define a more general class of distributions that can represent more complex and multimodal distributions, we adopt the general energy-based policies (Haarnoja et al., 2017), and transform it into the WGF framework. Specifically, in the WGF framework, policies form a Riemannian manifold on the space of probability measures. The manifold structure is determined by the expected total reward (10), and the geodesic length between two elements (policy distributions) is defined as the standard second-order Wasserstein distance. With convex energy functionals (defined below), searching for an optimal policy reduces to running SGD on the manifold of probability measures. In the following, we define gradient flows on both parameterdistribution space and policy-distribution space, leading to indirect-policy learning and direct-policy learning, respectively. In indirect-policy learning, a WGF is defined over policy parameters; whereas in direct-policy learning, a WGF is defined over actions. For both settings, different energy functionals are defined based on the expected total reward, as detailed below. We note that most existing deep RL algorithms cannot be included into the two settings, without the concept of WGF. However, their specific techniques could be applied as intermediate ingredients in our framework. 4.1. Indirect-policy learning With indirect-policy learning, we do not optimize the stochastic policy π directly. Instead, we aim to describe uncertainty of a policy with parameter distributions (weight uncertainty). Thus we define a gradient flow on the parameters. Let a policy be parameterized by θ, denoted as πθ. If we treat θ as stochastic and learn its posterior distribution p(θ) in response to the expected total reward, the policy is implicitly learned in the sense that uncertainty in the parameter is transferred into the policy distribution in prediction. Following (Houthooft et al., 2016; Liu et al., 2017), the objective function is defined as: max p {Ep(θ)[J(πθ)] αKL(p p0)} (11) We assume the reward function to be deterministic, for simplicity; stochastic rewards can be addressed similarly. Policy Optimization as Wasserstein Gradient Flows where p0(θ) is the prior of θ; α [0, + ) is the temperature hyper-parameter to balance exploitation and exploration in the policy. If we use an uninformative prior, p0(θ) = const, the KL term is simplified to the entropy as maxp{Ep(θ)[J(πθ)] + αH(p)}. By taking the derivative of the objective function, the optimal distribution is shown to have a simple closed form of p(θ) exp (J(πθ)/α) (Liu et al., 2017). This formulation is equivalent to a Bayesian formulation of parameter θ, where p(θ) can be seen as the posterior distribution, and exp(J(πθ)/α) is the likelihood function. A variational (posterior) distribution for θ, denoted as µ(θ), is learned by solving an appropriate gradient-flow problem. We define an energy functional characterizing the similarity between the current parameter distribution and the true distribution induced by the total reward as F(µ) Z J(πθ)µ(θ)dθ + Z µ(θ) log µ(θ)dθ = KL (µ pθ) , (12) The energy functional defines a landscape determined by the expected total reward, and obtains its minimum when µ = pθ. Proposition 5 For the gradient flow with energy functional defined in (12), µ converges to pθ in the infinite-time limit. To solve the above gradient-flow problem, one can apply the JKO scheme with a stepsize h (we follow previous notation to use subscript k to denote discrete-time solutions and superscript h to denote the stepsize): µ(h) k+1 = arg min µ KL (µ pθ) + W 2 2 (µ, µ(h) k ) 2h . (13) The above problem can be directly solved with gradient descent by adopting the particle approximation described in Section 3. Specifically, let the current particles be (θ(i))M i=1. When calculating KL(µ pθ) θ(i) as in (7), we need to evaluate θ(i)J(πθ(i)). This can be approximated with REINFORCE (Williams, 1992) or advantage actor critic (Schulman et al., 2016). For example, with REINFORCE, θ(i)J(πθ(i)) 1 t=1 γt 1 θ(i) log πθ(i)(at | st) ˆQπ(st, at) where T is a horizon parameter, and ˆQπ(st, at) is the Qvalue function. We call this variant of our framework Indirect Policy learning with WGF (IP-WGF). Remark 6 Assume gradients θ(i)J(πθ(i)) and θ(i)W 2 2 are unbiased. Under the limit of M and h 0, and based on the fact that F in (12) is convex, Lemma 3 suggests the particle approximation converges to the global minimum pθ. The conclusion applies, in the next section, similarly in the direct-policy-learning case. Existing methods such as TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017b) do not have such an interpretation, thus understanding their underlying convergence is more challenging. Furthermore, these methods optimize parameters directly as fixed points, deteriorating their ability to explore when policy distributions are inappropriately defined, as stochasticity only comes from the policy distributions. 4.2. Direct-policy learning When the dimension of parameter space is high, as is often the case in practice, IP-WGF can suffer from computation and storage inefficiencies. In direct-policy learning, a gradient flow is defined for the distribution of policies, thus a policy is directly optimized during learning. This approach appears to be more efficient and flexible, and connects more directly to existing works compared with indirect-policy learning. Specifically, we consider a general energy-based policies of the form π(a | s) exp( ε(s, a)/α) that is able to model more complex distributions (Haarnoja et al., 2017). We formulate the direct-policy learning as policydistribution-based gradient flows. The energy functional is defined with respect to the learned policy π, thus it should depend on states. To this end, let εs,π(a) = Q(at, st), where Q(at, st) r(at = a, st = s) + E(st+1,at+1, ) (ρπ,π) P l=1 γlr(st+l, at+l). Q(at, st) is seen to be a functional depending on the current st and at, as well as the policy π. Integrating out the action a, an energy functional characterizing the similarity of the current policy π and the optimal policy, ps,π(a | s) e Q(a,s), is readily defined as Fs(π) Z Q(a, s)π(a | s)d a + Z π(a | s) log π(a | s)d a = KL (π ps,π) . (14) Remark 7 Soft Q-learning (Haarnoja et al., 2017) adopts Q(a, s)+H(π) as the objective function, where the entropy of π, H(π) Eπ[log π], is included to add stochasticity into the corresponding Q-function. By treating the problem as a WGF, the stochasticity is modeled directly in the policy distribution, thus we do not include the entropy term, though it is of no harm to add it in. Proposition 8 For a WGF with the energy functional defined in (14), π(a | s) converges to ps,π(a) e Q(a,s) with Q(a, s) satisfying the following modified Bellman equation: Q(at, st) = r(at, st) + γEst+1 ρπ[Vπ(st+1) H(π( | st+1))] where Vπ(st+1) log R A exp(Q(a, st+1))d a. To solve the corresponding WGF, we again adopt the JKO scheme as in (13) to optimize the policy π by particle approximation, i.e., π 1 M PM i=1 δa(i). One challenge is that when calculating KL(π ps,π) a(i) , from (7), one needs to evaluate ps,π(a(i)), which is difficult due to the infinite time horizon and the unknown reward function r(s, a) when calculating Q(a(i), s). To address this, we approximate the soft Q-function, Q( , s), with a deep neural network Qθ s(s, a) Policy Optimization as Wasserstein Gradient Flows parametrized by θ, i.e., ps,π(a) e Qθ s (s,a). The neural network Qθ s naturally leads to a soft approximation of the standard Q-function according to Proposition 8. As a result, the learning can be done by alternating between the following two steps. 1) Optimizing the policy Given Qθ s, we could adopt the particle approximation with the JKO scheme to optimize the policy. However, since a policy is a conditional distribution, one needs to introduce a set of particles for each state. For large or continuous state space, this becomes intractable. To mitigate this problem, we propose to use a stochastic state-conditioned neural network fφ parametrized by φ to approximate the policy. We call such a network a sampling network. The input to fφ is a concatenation of a state s and a randomnoise sample ξ drawn from a simple distribution, e.g., the standard normal distribution. To optimize the sampling network, note that the JKO scheme, with energy functional (14), is written as (we rewrite π as πφ to explicitly indicate the dependence of π on φ): πφ k+1 = arg min πφ KL πφ ps,π + W 2 2 (πφ, πφ k) 2h Jφ π . (15) The outputs of fφ({ξi}; st) are particles (a(i) t )M i=1. Using chain rule we calculate the gradient of φ as Jφ π φ = E{ξi} " Jφ π a(i) t Thus φ can be updated using standard SGD, where Jφ π / a(i) t represents particle gradients in the WGF, and is approximated using techniques from Section 3; a(i) t / φ can be calculated by standard backpropagation. 2) Optimizing the Q-network We optimize the Qnetwork using the Bellman error as in the soft-Q learning setting (Haarnoja et al., 2017). Specifically, in each iteration, we optimize the following objective function: JQ(θ) Est qst ,at qat ˆQ θ s(st, at) Qθ s(st, at) 2 , where qst and qat are arbitrary distributions with support on S and A, respectively; ˆQ θ s(st, at) = r(st, at)+γEst+1 ρπ[V θ s (st+1)] is the target Q-value, with V θ s (st+1) = log Eqa [ exp(Q θ s (st+1,a )) qa (a ) ] H(qa ); θ represents the parameters of the target Q-network, as used in standard deep Q-learning (Mnih et al., 2013). qat can be set to the distribution induced by the sampling network fφ (Haarnoja et al., 2017). Alternatively, the form of qat can be explicitly defined, e.g., using isotropic Gaussian or mixture of Gaussian distributions. The full algorithm is given in Section G of the SM. We call this variant of our framework Direct Policy learning with WGF (DP-WGF). Reducing Variance Note that when optimizing the Q-network, one needs to calculate the V θ s -function. This includes an integration over qa , which endows the high variance associated with Monte Carlo integration. Consequently, we propose to learn a V -network to approximate V θ s , denoted as Vψ(s) with parameter ψ. To learn the V -network, similar to (Haarnoja et al., 2018), we use an explicit policy distribution. As a result, we replace the sampling network fφ with a BNN discussed in Section 4.1, whose induced policy distribution is denoted πφ. Intuitively, V θ s (s) can be considered an approximation to the log-normalizer of exp(Qθ s(s, a)) over a. From the definition, the objective is defined as: JV (ψ) Est qst Vψ(st) log Eat πφ(st) exp Qθ s(st, at)/ πφ(at | st) Eat πφ(st) log πφ(at | st) 2. In our implementation, we find the following approximation works well: JV (ψ) Est qst Vψ(st) Eat πφ(st)[Qθ s(st, at) log πφ(at | st)] 2, which is inspired by (Haarnoja et al., 2018). We call this variant Direct Policy learning with WGF and Variance reduction (DP-WGF-V). 5. Connections with Related Works Soft-Q learning Though motivated from different perspectives, our DP-WGF results in a similar algorithm as soft-Q learning with energy based policies (Haarnoja et al., 2017). However, DP-WGF is more general, in that we can define different sampling networks, such as a BNN, which can be optimized with the proposed IP-WGF technique. Soft actor-critic (SAC) SAC (Haarnoja et al., 2018) is an improvement of soft-Q learning, introducing a similar V -network as ours and modeling the policy with a mixture of Gaussians. DP-WGF-V is related to SAC, but with a V -network from a different perspective (variance reduction). Importantly, we define a gradient flow for policy distributions, allowing optimization from a distribution perspective. Trust-region methods Trust-Region methods are known to stabilize policy optimization in RL (Schulman et al., 2015). Schulman et al. (2017a) illustrate the equivalence between soft Q-learning and policy gradient. In the original TRPO setting, an objective function is optimized subject to a constraint that the updated policy is not too far from the current policy, in terms of the KL divergence (see Section G for a more detailed descriptions). The theory of TRPO suggests adding a penalty to the objective instead of adopting a constraint, resulting in a similar form as our framework. However, we use the Wasserstein distance to penalize the previous and current policies, which is a weaker metric than the KL divergence, and potentially leads to more robust solutions. This is evidenced by the development of Wasserstein GAN (Arjovsky et al.). As a result, our framework can be regarded as a trust-region-based counterpart for solving the soft Q-learning (Haarnoja et al., 2017) and SVPG (Liu et al., Policy Optimization as Wasserstein Gradient Flows Dataset PBP SVGD WGF Boston -2.57 0.09 -2.50 0.03 2.40 0.10 Concrete -3.16 0.02 -3.08 0.02 2.95 0.06 Energy -2.04 0.02 -1.77 0.02 0.73 0.08 Kin8nm 0.90 0.01 0.98 0.01 0.97 0.02 Naval 3.73 0.01 4.09 0.01 4.11 0.02 CCPP -2.80 0.05 -2.82 0.01 2.78 0.01 Winequality -0.97 0.01 -0.93 0.01 0.87 0.04 Yacht -1.63 0.02 -1.23 0.04 0.99 0.15 Protein -2.97 0.00 -2.95 0.00 2.88 0.01 Year Predict -3.60 NA -3.58 NA 3.57 NA Table 1. Averaged predictions, with standard deviations, in terms of test log-likelihood. 2017). Similar arguments hold for other trust-region methods such as PPO (Schulman et al., 2017b) and Trust-PCL (Nachum et al., 2017), which improve TRPO with either different objective or trust-region constraints. Noisy exploration Adding noise to the parameters for noisy exploration (Fortunato et al., 2018; Plappert et al., 2018) can be interpreted as a special case of our IP-WGF framework with a single particle. Isotropic Gaussian noisy exploration corresponds to the maximum a posterior (MAP) solution with a Gaussian assumption on the posterior distributions of parameters, potentially leading to inferior solutions when the assumption is not met. By contrast, our method is endowed with the ability to explore multimodal distributions, by optimizing the parameter distribution directly. More details are provided in Section C of the SM. 6. Experiments We test the proposed WGF framework from two perspectives: i) the effectiveness of the proposed particleapproximation method for WGF, and ii) the advantages of the WGF framework for policy optimization. For i), a standard regression model to learn optimal parameter distributions, i.e., posterior distributions. For ii), we test our algorithms on several domains in Open AI rllab and Gym (Duan et al., 2016). All experiments are conducted on a single Tesla P100. Detailed settings are given in the SM. 6.1. Regression We use a single-layer BNN as a regression model. The parameters of the BNN are treated probabilistically and optimized with our WGF framework. We compare WGF, SVGD (Liu & Wang, 2016), Bayesian Dropout (Gal & Ghahramani, 2016) and PBP (Hernández-Lobato & Adams, 2015). The RMSprop optimizer is employed. Detailed experimental settings and datasets are described in Section D.2 of the SM. We adopt the root-mean-squared error (RMSE) and test log-likelihood as the evaluation criteria. The experimental results are shown in Table 1 (complete results are provided in Section D.2 of the SM). It is observed that our proposed WGF obtains better results in both metrics, partially due to the flexibility of our particle approximation algorithm, which solves the original WGF problem effectively. Figure 1. Learning curves by IP-WGF and SVPG with REINFORCE and A2C. 6.2. Indirect-policy learning For this group of experiments, we compare IP-WGF with SVPG (Liu et al., 2017), a state-of-the-art method for indirect-policy learning, considering three classical continuous-control tasks: Cartpole Swing-Up, Double Pendulum, and Cartpole. Only policy parameters are updated by IP-WGF or SVPG, while the critics are updated with TD-error. We train our agents for 100 iterations on the easier Cartpole domain and 1000 iterations on the other two domains. Following the settings in (Liu et al., 2017; Houthooft et al., 2016; Zhang et al., 2018), the policy is parameterized as a two-layer (25-16 hidden units) neural network with tanh activation function. The maximum horizon length is set to 500. A sample size of 5000 is used for policy gradient estimation. We use M = 16 particles to approximate parameter distributions, and h = 0.1 as the discretized stepsize. REINFORCE (Williams, 1992) and advantage actor critic (A2C) (Schulman et al., 2016) are used as strategies of policy learning. Figure 2 plots the mean (dark curves) and standard derivation (light areas) of rewards over 5 runs. It is clear that in all tasks IP-WGF consistently converges faster than SVPG, and finally converges to higher average rewards. The results are comparable to (Houthooft et al., 2016). The experiments demonstrate that employing the Wasserstein gradient flows on policy optimization improves the performance, as suggested by our theory. Policy Optimization as Wasserstein Gradient Flows WGF-DP-V SAC TRPO-GAE DDPG Domain Threshold Max Return. Episodes Max Return Epsisodes Max Return Episodes Max Return Episodes Swimmer 100 181.60 76 180.83 112 110.58 433 49.57 N/A Walker 3000 4978.59 2289 4255.05 2388 3497.81 3020 2138.42 N/A Hopper 2000 3248.76 678 3146.51 736 2604 1749 1317 N/A Humanoid 2000 3077.84 18740 2212.51 26476 5411.15 32261 2230.60 34652 Table 2. WGF-DP-V, TRPO, SAC, and DDPG results showing the max average rewards attained and the episodes to cross specific reward thresholds. WGF-DP-V often learn more sample-efficiently than the baselines, and WGF-DP-V can solve difficult domains such as Humanoid better than DDPG. Figure 2. Average return in Mu Jo Co tasks by Soft-Q, SAC and DP-WGF-V (first row), and by DDPG, TRPO-GAE and DP-WGF-V (second row). From left to right, the tasks are: Swimmer, Hopper, Walker and Humanoid, respectively. 6.3. Direct-policy Learning We compare our DP-WGF and DP-WGF-V frameworks with existing off-policy and on-policy deep RL algorithms on several tasks in Mu Jo Co, e.g., SAC, Soft-Q, DDPG (offpolicy) and TRPO-GAE (on-policy). Our DP-WGF-V is considered to be an off-policy actor-critic method. For all methods, value function and policy are parameterized as two-layer (128-128 hidden units) neural networks with tanh as the activation function. The maximum horizon length is set to 1000 when simulating expected total rewards. Three easier tasks (Swimmer, Hopper and Walker) in Mu Jo Co can be solved by a wide range of algorithms; while the more complex benchmark, the 21-dimensional Humanoid, is known to be very difficult to solve with off-policy algorithms (Duan et al., 2016). Implementation details of the algorithms are specified in Section E.3 of the SM. Effectiveness of the Wasserstein trust-region We evaluate DP-WGF-V against SAC, and DP-WGF against Soft-Q on four Mujoco tasks, as they are closely related to our algorithms. Figure 2 (first row) plots average returns over epochs on the tasks. Similarly, our WGF-based methods converge faster and better than their counterparts due to the introduction of WGFs. Furthermore, by variance reduction, DP-WGF-V significantly outperforms DP-WGF on all tasks. Comparisons with popular baselines Finally we compare DP-WGF-V with TRPO-GAE (Schulman et al., 2016) and DDPG (Lillicrap et al., 2016) on the same Mujoco tasks. In general, TRPO-GAE has been a state-of-the-art method for policy optimization. Figure 2 (second row) plots average returns over episodes, and it is observed that DP-WGF-V consistently outperforms other algorithms. Table 2 summarizes some key statistics, including the best attained average rewards and the episodes to reach the reward thresholds. It is observed that DP-WGF-V consistently outperform TRPOGAE and DDPG in terms of sample complexity, and often achieves higher rewards than TRPO-GAE. A particularly notable case, on Humanoid, shows DP-WGF-V substantially outperforms TRPO-GAE in terms of sample efficiency, while DDPG cannot learn a good policy at all. 7. Conclusion We lift policy optimization to the space of probabilistic distributions, and interpret it as Wasserstein gradient flows. Two types of WGFs are defined for the task, one on the parameter-distribution space and the other on the policydistribution space. The WGFs are solved by a new particleapproximation-based algorithm, where gradients of particles are calculated in closed forms. Under some circumstance, optimization on probability-distribution space is convex, thus it is easier to deal with compared to existing methods. Experiments are conducted on a number of reinforcementlearning tasks, demonstrating the superiority of the proposed framework compared to related algorithms. Policy Optimization as Wasserstein Gradient Flows Acknowledgements We acknowledge Tuomas Haarnoja et al. for making their code public and thank Ronald Parr for insightful advice. This research was supported in part by ARO, DARPA, DOE, NGA, ONR and NSF. Ambrosio, L., Gigli, N., and Savaré, G. Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Zürich, 2005. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. In NIPS. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. In ICML, 2015. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016. Carrillo, J. A., Craig, K., and Patacchini, F. S. A blob method for diffusion. (ar Xiv:1709.09195), 2017. Chen, C., Zhang, R., Wang, W., Li, B., and Chen, L. A unified particle-optimization framework for scalable Bayesian sampling. In UAI, 2018. Cottet, G. H. and Koumoutsakos, P. D. Vortex methods. Cambridge University Press, 2000. Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In ICML, 2016. Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., and Legg, S. Noisy networks for exploration. In ICLR, 2018. Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In ICML, 2016. Gobbino, M. Minimizing movements and evolution problems in Euclidean spaces. Annali di Matematica Pura ed Applicata, 1999. Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efficient policy gradient with an off-policy critic. In ICLR, 2017. Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforcement learning with deep energy-based policies. In ICML, 2017. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML, 2018. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and D. Meger, D. Deep reinforcement learning that matters. In AAAI, 2018. Hernández-Lobato, J. M. and Adams, R. P. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In ICML, 2015. Hinton, G. E., Srivastava, N., and Swersky, K. Rmsprop: Divide the gradient by a running average of its recent magnitude. Neural Networks for Machine Learning, Coursera, 2012. Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. VIME: Variational information maximizing exploration. In NIPS, 2016. Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the Fokker-Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1 17, 1998. Kaelbling, L. P., Littman, M. L., and Moore, A. W. Reinforcement learning: A survey. In JAIR, 1996. Kakade, S. M. A natural policy gradient. In NIPS, 2002. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. Li, Y. Deep reinforcement learning: An overview. (ar Xiv:1701.07274), 2017. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. In ICLR, 2016. Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In NIPS, 2016. Liu, Y., Ramachandran, P., Liu, Q., and Peng, J. Stein variational policy gradient. In UAI, 2017. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. ar Xiv:1312.5602, 2013. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, A., Riedmiller, M., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 2015. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In ICML, pp. 1928 1937, 2016. Policy Optimization as Wasserstein Gradient Flows Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. Trust-PCL: An off-policy trust region method for continuous control. (ar Xiv:1707.01891), 2017. O Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. Pgq: Combining policy gradient and q-learning. ar Xiv:1611.01626, 2016. Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz, M. Parameter space noise for exploration. In ICLR, 2018. Risken, H. The Fokker-Planck equation. Springer-Verlag, New York, 1989. Rulla, J. Error analysis for implicit approximations to solutions to Cauchy problems. SIAM Journal on Numerical Analysis, 33(1):68 87, 1996. Russo, G. Deterministic diffusion of particles. Communications on Pure and Applied Mathematics, 1990. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In ICML, 2015. Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In ICLR, 2016. Schulman, J., Abbeel, P., and Chen, X. Equivalence between policy gradients and soft q-learning. (ar Xiv:1704.06440), 2017a. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. (ar Xiv:1707.06347), 2017b. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. 1998. Sutton, R. S., Mc Allester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In NIPS, 2000. Villani, C. Optimal transport: old and new. Springer Science & Business Media, 2008. Watkins, C. J. and Dayan, P. Q-learning. Machine Learning, 1992. Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011. Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992. Zhang, R., Li, C., Chen, C., and Carin, L. Learning structural weight uncertainty for sequential decision-making. In AISTATS, 2018.