# policy_optimization_with_stochastic_mirror_descent__0bd1350e.pdf Policy Optimization with Stochastic Mirror Descent Long Yang 1, *, Yu Zhang 2, *, Gang Zheng1, Qian Zheng1,3, Pengfei Li1, Jianghang Huang 1, Gang Pan 1, 1College of Computer Science and Technology, Zhejiang University, China 2Netease Games AI Lab, Hangzhou, China 3School of Electrical and Electronic Engineering, Nanyang Technological University,Singapore 1 {yanglong,gang zheng, hzzhangyu,pfl,gpan}@zju.edu.cn 2zhangyu15@corp.netease.com 3zhengqian@ntu.edu.sg Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ϵ 3) sample trajectories to achieve an ϵ-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMPO outperforms the state-of-the-art policy gradient methods in various settings. Introduction Policy gradient (Williams 1992; Sutton et al. 2000) is widely used to search the optimal policy in reinforcement learning (RL), and it has achieved significant successes in challenging fields such as playing Go (Silver et al. 2016, 2017) or robotics (Duan et al. 2016). However, policy gradient methods suffer from high sample complexity, since many existing popular methods require to collect a lot of samples for each step to update its parameters (Haarnoja et al. 2018; Yang et al. 2021; Xing et al. 2021; Yang et al. 2022), which partially reduces the effectiveness of the samples. Besides, it is still very challenging to provide a theoretical analysis of sample complexity for policy gradient methods instead of empirically improving sample efficiency. To improve sample efficiency, this paper addresses how to design an efficient and convergent algorithm with stochastic mirror descent (SMD) (Nemirovskij and Yudin 1983). SMD keeps the advantage of low memory requirement and low computational complexity (Lei and Tang 2018), which implies SMD needs less samples to learn a model. However, the significant challenges of applying the existing SMD to RL are two-fold: 1) The objective of policy-based RL is a typical non-convex function, Ghadimi et al. (2016) show that it may cause instability and even divergence when updating the parameter of a non-convex objective by SMD via a single sample. 2) The large variance of policy gradient estimator is a critical bottleneck of improving sample efficiency for *L.Yang and Y.Zhang share equal contributions. G.Pan is the corresponding author. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. policy optimization with SMD. The non-stationary sampling process with the environment will lead to a large variance on the policy gradient estimator (Papini et al. 2018), which requires more samples to get a robust policy gradient and results in poor sample efficiency (Liu et al. 2018). To address the above challenges, we provide a theory analysis of the dilemma of applying SMD to policy optimization. Result (18) shows that under the Assumption 1, deriving the algorithm directly via SMD can not guarantee the convergence for policy optimization. Furthermore, we propose a new algorithm MPO that keeps a provable convergence guarantee (see Theorem 2). Designing a new gradient estimator according to historical information of policy gradient is the key to MPO. Then, we propose a variance-reduced mirror policy optimization algorithm (VRMPO): an efficient sample method via constructing a variance reduced policy gradient estimator. Concretely, we design an efficiently computable policy gradient estimator (see Eq.(26)) that utilizes fresh information and yields a more accurate estimation of the policy gradient, which is the key to improve sample efficiency. Theorem 3 illustrates that VRMPO needs O(ϵ 3) sample trajectories to achieve an ϵ-approximate first-order stationary point (ϵ-FOSP). To our best knowledge, the proposed VRMPO matches the best sample complexity among the existing literature. Particularly, although SRVR-PG (Xu et al. 2020; Xu 2021) achieves the same sample complexity as VRMPO, our approach needs less assumptions than Xu et al. (2020); Xu (2021), and our VRMPO unifies SRVR-PG. Besides, empirical result shows VRMPO converges faster than SRVR-PG. Background and Stochastic Mirror Descent Reinforcement learning (RL) is often formulated as Markov decision processes (MDP) M = (S, A, P, R, ρ0, γ), where S is state space, A is action space; P(s |s, a) is the probability of the state transition from s to s under playing a; R( , ) : S A [ Rmax, Rmax] is the reward function, where Rmax is a certain positive scalar. ρ0( ) : S [0, 1] is the initial state distribution and γ (0, 1). Policy πθ(a|s) is a probability distribution on S A with a parameter θ Rp. Let τ = {st, at, rt+1}Hτ t=0 be a trajectory, where s0 ρ0(s0), at πθ( |st), rt+1 = R(st, at), st+1 P( |st, at), and Hτ is the finite horizon of τ. The The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) expected return function J(θ) is defined as follows, J(θ) def = Z τ P(τ|θ)R(τ)dτ = Eτ πθ[R(τ)], (1) where P(τ|θ) = ρ0(s0) QHτ t=0 P(st+1|st, at)πθ(at|st) is the probability of generating τ, R(τ) = PHτ t=0 γtrt+1 is the accumulated discounted return. Let J (θ) =: J(θ), the central problem of policy-based RL is to solve the problem: θ = arg max θ J(θ) θ = arg min θ J (θ). (2) Computing J(θ) analytically, we have J(θ) = Eτ πθ t 0 θ log πθ(at|st)R(τ) Let g(τ|θ) = PHτ t=0 θ log πθ(at|st)R(τ), which is an unbiased estimator of J(θ). Vanilla policy gradient (VPG) is a straightforward way to solve problem (2) as follows, θ θ + αg(τ|θ), where α is step size. Assumption 1. (Papini et al. 2018) For each pair (s, a) S A, θ Rp, and all components i, j, there exists positive constants G, F such that: | θi log πθ(a|s)| G, 2 θi θj log πθ(a|s) F. (4) Assumption 1 implies J(θ) is L-Lipschiz (Papini et al. 2018, Lemma B.2), i.e., J(θ1) J(θ2) L θ1 θ2 , (5) where L = Rmax Hτ(HτG2 + F)/(1 γ), Besides, under Assumption 1, Shen et al. (2019) have shown the property: g(τ|θ) J(θ) 2 2 G2R2 max/(1 γ)4 =: σ2. (6) SMD and Bregman Gradient Now, we review some basic concepts of stochastic mirror descent(SMD) and Bregman gradient. Let s consider the stochastic optimization problem, min θ Dθ{f(θ) = E[F(θ; ξ)]}, (7) where Dθ Rn is a nonempty convex compact set, ξ is a random vector whose probability distribution µ is supported on Ξ Rd and F : Dθ Ξ R. We assume that the expectation E[F(θ; ξ)] = R Ξ F(θ; ξ)dµ(ξ) is well defined and finite-valued for every θ Dθ. Definition 1 (Proximal Operator). Let T be defined on a closed convex X, and α > 0. The proximal operator of T is Mψ α,T (z) = arg min x X αDψ(x, z) , (8) where ψ( ) is a continuously-differentiable, ζ-strictly convex function satisfies x y, ψ(x) ψ(y) ζ x y 2, ζ > 0, Dψ( , ) is Bregman distance: x, y X, Dψ(x, y) = ψ(x) ψ(y) ψ(y), x y . Stochastic Mirror Descent (SMD). The SMD solves (7) by generating an iterative solution as follows, θt+1 = Mψ αt,ℓ(θ)(θt) = arg min θ Dθ αt Dψ(θ, θt) , where αt > 0 is step-size, ℓ(θ) = gt, θ is the first-order approximation of f(θ) at θt, gt = g(θt, ξt) is stochastic subgradient such that g(θt) = E[g(θt, ξt)] f(θ)|θ=θt, {ξt}t 0 represents a draw form distribution µ, and f(θ) = {g|f(θ) f(ω) g (θ ω), ω dom(f)}. If we choose ψ(x) = 1 2 x 2 2, then Dψ(x, y) = 1 2 x y 2 2, since then iteration (9) is reduced to stochastic gradient decent (SGD). Convergence Criteria: Bregman Gradient. Recall X is a closed convex set on Rn, α > 0, T(x) is defined on X. The Bregman gradient of T at x X is defined as: Gψ α,T (x) = α 1(x Mψ α,T (x)), (10) where Mψ α,T ( ) is defined in Eq.(8). If ψ(x) = 1 2 x 2 2, according to Bauschke, Combettes et al. (2011, Theorem 27.1), then x is a critical point of T if and only if Gψ α,T (x ) = T(x ) = 0. Thus, Bregman gradient (10) is a generalization of standard gradient. Remark 1 provides us some insights to understand Bregman gradient as a convergence criterion. Remark 1. Let T( ) be a convex function, according to Bertsekas (2009, Proposition 5.4.7): x is a stationarity point of T( ) if and only if 0 (T + δX )(x ), (11) where δX ( ) is the indicator function on X. Furthermore, if ψ(x) is twice continuously differentiable, let x = Mψ α,T (x), by the definition of Mψ α,T ( ) (8), we have 0 (T + δX )( x) + ψ( x) ψ(x) ( ) (T + δX )( x) + αGψ α,T (x) 2ψ(x), (12) Eq.( ) holds due to Taylor expansion of ψ(x) on first order. If Gψ α,T (x) 0, Eq.(12) implies the origin point 0 is near the set (T+δX )( x), i.e., according to the criteria (11), x is close to a stationary point. For the iteration (9), we focus on the time when it makes the Gψ α,T (θt) near origin point 0. Formally, we are satisfied with finding an ϵ-approximate first-order stationary point (ϵ-FOSP) θϵ such that Gψ α,T (θϵ)(θϵ) 2 ϵ. (13) Particularly, for policy optimization (2), we would choose T(θ) = J(θt), θ . Stochastic Mirror Policy Optimization In this section, we solve the problem (2) via SMD. Firstly, we analyze the theoretical dilemma of applying SMD directly to policy optimization, and result shows that under the common Assumption 1, there still lacks a provable guarantee of solving (2) via SMD directly. Then, we propose a convergent mirror policy optimization algorithm (MPO). Theoretical Dilemma For each k [1, N 1], τk = {st, at, rt+1} Hτk t=0 πθk, and we receive the gradient information as follows, g(τk|θk) = X t 0 θ log πθ(at|st)R(τk)|θ=θk. (14) According to (9), we define the update rule as follows, θk+1 = Mψ αk, g(τk|θk),θ (θk) (15) = arg min θ g(τk|θk), θ + 1 αk Dψ(θ, θk) , where αk is step-size. After (N 1) episodes, we receive a collection {θk}N k=1. Since J(θ) is non-convex, according to Ghadimi, et al (2016), a standard strategy for analyzing non-convex optimization is to pick up the output θN from the following distribution (16) over {1, 2, , N}: P( θN = θk) = ζαk Lα2 k PN i=1(ζαi Lα2 i ) , k [1, N], (16) where step-size αk (0, ζ/L). Theorem 1. (Ghadimi, et al (2016)) Under Assumption 1, consider the sequence {θk}N k=1 generated by (15), the output θN = θk follows the distribution (16). Let ℓ(g, u) = g, u , gk = (τk|θk), Let = J(θ ) J(θ1). Then, E h Gψ αk,ℓ( gk,θk)( θN) 2 2 i + σ2/ζ PN i=1 αi PN i=1(ζαi Lα2 i ) . (17) Unfortunately, the lower bound of (17) reaches J(θ ) J(θ1)+σ2/ζ PN i=1 αi PN i=1(ζαi Lα2 i ) σ2 which can not guarantee the convergence of (15), no matter how the step-size αk is specified. Thus, under Assumption 1, updating parameters according to (15) and the output following (16) lacks a provable convergence guarantee. Discussion 1 (Open Problems). Eq.(15) is a general rule that unifies many existing algorithms. If ψ(θ) = 1 2 θ 2 2, then (15) is VPG (Williams 1992). The update (15) is natural policy gradient (Kakade 2002) if we choose ψ(θ) = 1 2θ F(θ)θ, where F(θ) = Eτ πθ[ θ log πθ(s, a) θ log πθ(s, a) ] is Fisher information matrix. If ψ is Boltzmann-Shannon entropy, then Dψ is KL divergence and update (15) is reduced to relative entropy policy search (Peters et al. 2010). Despite extensive works around above methods, existing works are scattered and fragmented in both theoretical and empirical aspects (Agarwal et al. 2020). Thus, it is of great significance to establish the fundamental theoretical convergence properties of iteration (15): What conditions guarantee the convergence of (15)? This is an open problem. From the previous discussion, intuitively, the iteration (15) is a convergent scheme since particular mirror maps ψ can lead (15) to some popular empirically effective policy-based RL algorithms, but there still lacks a complete theoretical convergence analysis of (15). Algorithm 1: MPO 1: Initialize: parameter θ1, step-sizeαk > 0, g0 = 0, parametric policy πθ(a|s), and map ψ. 2: for k = 1 to N do 3: Generate a trajectory τk = {st, at, rt+1} Hτk t=0 πθk, temporary variable g0 = 0. t=0 θ log πθ(at|st)R(τk)|θ=θk (21) k gk + (1 1 k )ˆgk 1 (22) θk+1 arg min ω { ˆgk, ω + 1 αk Dψ(ω, θk)} (23) 4: end for 5: Output θN according to (16). MPO: A Convergent Implementation In this section, we propose a convergent mirror policy optimization (MPO) as follows, for each step k: θk+1 = Mψ αk, ˆgk,θ (θk) = arg min θ Θ{ ˆgk, θ + 1 αk Dψ(θ, θk)}, (19) where ˆgk is an arithmetic mean of previous episodes gradient estimate {g(τi|θi)}k i=1: i=1 g(τi|θi). (20) We present the details of an implementation of MPO in Algorithm 1. Eq.(22) is an incremental implementation of the average (20), thus, (22) enjoys a lower storage cost than (20). For a given episode, the gradient flow (20)/(22) of MPO is slightly different from the traditional VPG, REINFORCE (Williams 1992), or DPG (Silver et al. 2014) whose gradient estimator (14) follows the current episode, while our MPO uses an arithmetic mean of all the previous policy gradients. The gradient estimator (14) is a natural way to estimate the term J(θt) = E[PHτt k=0 θ log πθ(ak|sk)R(τt)], i.e., using the current trajectory to estimate policy gradient. Theorem 2 (Convergence of Algorithm 1). Under Assumption 1, and the total trajectories are {τk}N k=1. Consider the sequence {θk}N k=1 generated by Algorithm 1, and the output θN = θn follows the distribution of (16). Let 0 < αk < ζ L, ℓ(g, u) = g, u , ˆgk = 1 k Pk i=1 gi, and = J(θ ) J(θ1), where gi = PHτi t=0 θ log πθ(at|st)R(τi)|θ=θi. Then the output θN = θn satisfies E[ Gψ αn,ℓ( gn,θn)(θn) 2 2] +σ2/ζ PN k=1 αk k PN k=1(ζαk Lα2 k) . (24) For the proof, see Appendix A. Let αk = ζ/2L, E[ Gψ αn,ℓ( ˆgn,θn)(θn) 2] 4L +2σ2 PN k=1 1 k Nζ2 = O( ln N VRMPO: Variance Reduction Mirror Policy Optimization In this section, we propose a variance reduction version of MPO: VRMPO. Inspired by the above work of (Nguyen et al. 2017a), we provide an efficiently computable policy gradient estimator; then, we prove that the VRMPO needs O(ϵ 3) sample trajectories to achieve an ϵ-FOSP that matches the best sample complexity. Methodology. For any initial θ0, let {τ 0 j }N j=1 πθ0, we estimate the initial policy gradient as follows, G0 = ˆ NJ(θ0) def = 1 j=1 g(τ 0 j |θ0). (25) Let θ1 = θ0 αG0, for each step k N+, let {τ k j }N j=1 be the trajectories generated by πθk, we define the policy gradient estimator Gk and update rule as follows, Gk = Gk 1 + 1 g(τ k j |θk) + g(τ k j |θk 1) , (26) θk+1 = arg min θ { Gk, θ + 1 αDψ(θ, θk)}. (27) In (26), g(τ k j |θk) and g(τ k j |θk 1) share the same trajectory {τ k j }N j=1, which plays a critical role in reducing the variance of gradient estimator (Shen et al. 2019). Besides, it is different from (20), we admit a simple recursive formulation to conduct the gradient estimator, see (26), which captures the technique from SARAH (Nguyen et al. 2017a). For each step k, the term 1 N PN j=1 g(τ k j |θk) + g(τ k j |θk 1) can be seen as an additional noise for the policy gradient estimate. A lot of practices show that conducting a gradient estimator with such additional noise enjoys a lower variance and speeding up the convergence (Reddi et al. 2016). More details are shown in Algorithm 2. Theorem 3 (Convergence Analysis). Consider { θk}K k=1 generated by Algorithm 2. Under Assumption 1, and let ζ > 5 32. For any positive scalar ϵ, let batch size of the trajectories of the outer loop N1 = 1 8Lζ2 + 1 2(ζ 5 32 ) 1 + 1 32ζ2 σ2 m 1 = N2 = q 1 8Lζ2 + 1 2(ζ 5 32 ) 1 + 1 32ζ2 σ ϵ , the outer loop times K = 8L(E[J ( θ0)] J (θ ))(1+ 1 16ζ2 ) r 1 8Lζ2 + 1 2(ζ 5 32 ) 1+ 1 32ζ2 ζ 5 step size α = 1 4L. Then, Algorithm 2 outputs θK satisties E Gψ α, J( θK),θ ( θK) ϵ. (30) For its proof, see Appendix C. Theorem 3 illustrates that VRMPO needs K(N1+(m 1)N2) = 8L(E[J ( θ0)] J (θ )) 1 16ζ2 1 + q 1 8Lζ2 + 1 2(ζ 5 32 ) 1 + 1 32ζ2 σ ϵ 1 ϵ2 = O( 1 random trajectories to achieve the ϵ-FOSP. As far as we know, our VRMPO matches the best sample complexity as HAPG (Shen et al. 2019) and SRVR-PG (Xu et al. 2020; Xu 2021). In fact, according to Shen et al. (2019), REINFORCE Algorithm 2: VRMPO. 1: Initialize: Policy πθ(a|s) with parameter θ0, mirror map ψ, step-size α > 0, epoch size K,m. 2: for k = 1 to K do 3: θk,0 = θk 1, generate Tk = {τi}N1 i=1 πθk,0 4: θk,1 = θk,0 αGk,0, where Gk,0 = ˆ N1J(θk,0) = 1 N1 PN1 i=1 g(τi|θk,0). 5: for t = 1 to m 1 do 6: Generate {τj}N2 j=1 πθk,t Gk,t = Gk,t 1 (28) j=1 ( g(τj|θk,t) + g(τj|θk,t 1)), θk,t+1 = arg min ω { Gk,t, ω + 1 αDψ(ω, θk,t)}. 7: end for 8: θk = θk,t with t chosen uniformly randomly from {0, 1, ..., m}. 9: end for 10: Output: θK. needs O(ϵ 4) random trajectories to achieve the ϵ-FOSP, and no provable improvement on its complexity has been made so far. The same order of sample complexity of REINFORCE is shown by Xu et al. (2019). With the additional assumptions Var[QH h=0 πθ0(ah|sh) πθt(ah|sh)],Var[g(τ|θ)] < + , Papini et al. (2018) show that the SVRPG achieves the sample complexity of O(ϵ 4). Later, under the same assumption as Papini et al. (2018), Xu et al. (2019) reduce the sample complexity of SVRPG to O(ϵ 10 3 ). We summarize it in Table 1. Remark 2. It s remarkable that although our VRMPO shares sample complexity with HAPG, SRVR-PG, and VRBGPO(Huang et al. 2021), the difference between our VRMPO and theirs are at least three aspects: Firstly, Shen et al. (2019) derive their HAPG from the information of Hessian policy, our VRMPO provides a simple recursive formulation to conduct the gradient estimator. Secondly, if the mirror map ψ is reduced to the ℓ2-norm, then VRMPO is SRVR-PG exactly, i.e., VRMPO unifies SRVR-PG. From Table 1, we see VRMPO needs less conditions than Xu et al. (2020) to achieve the same sample complexity. Finally, Shen et al. (2019), Xu et al. (2020) and Huang et al. (2021) only provide an off-line (i.e., Monte Carlo) policy gradient estimator, which is limited in complex domains. We have provided an on-line version of VRMPO, and discuss some insights of practical tracks to the application to the complex domains, please see the section of experiment on Mu Jo Co task, Appendix E.1. Related Works Stochastic Variance Reduced Gradient in RL. To our best knowledge, Du et al. (2017) firstly introduce SVRG (Johnson and Zhang 2013) to off-policy evaluation (Yang et al. 2018). Algorithm Conditions Complexity VPG REINFORCE Assumption 1 Var[g(τ|θ)] < + O(ϵ 4) TRPO (Shani et al. 2020) Assumption 1 O(ϵ 4) TRPO (Liu et al. 2019) Assumption 1 O(ϵ 8) SVRPG (Papini et al. 2018) Assumption 1 Var[ρt] < + Var[g(τ|θ)] < + SVRPG (Xu et al. 2019) Assumption 1; Var[ρt] < + Var[g(τ|θ)] < + HAPG (Shen et al. 2019) Assumption 1 O(ϵ 3) SRVR-PG (Xu et al. 2020; Xu 2021) Assumption 1 Var[ρt] < + Var[g(τ|θ)] < + VR-PGPO (Huang et al. 2021) Assumption 1 Var[ρt] < + Var[g(τ|θ)] < + VRMPO (Our Work) Assumption 1 O(ϵ 3) Table 1: Comparison of complexity achieves J(θ) ϵ. If ψ(θ) = 1 2 θ 2 2, then the result (30) of our VRMPO is also measured by gradient. Beside, ρt def = QH i=0 πθ0(ai|si) πθt(ai|si). Du et al. (2017) transform the empirical policy evaluation problem into a convex-concave saddle-point problem, then they solve the problem via SVRG straightforwardly. Later, to improve sample efficiency for complex RL, Xu et al. (2017) combine SVRG with TRPO (Schulman et al. 2015). Similarly, Yuan et al. (2019) introduce SARAH (Nguyen et al. 2017a) to TRPO to improve sample efficiency. However, the results presented by Xu et al. (2017) and Yuan et al. (2019) are empirical, which lacks a strong theory analysis. Metelli et al. (2018) present a surrogate objective function with R enyi divergence (R enyi et al. 1961) to reduce the variance. Recently, Papini et al. (2018) propose a stochastic variance reduced version of policy gradient (SVRPG), and they define the gradient estimator via importance sampling: g(τ k j |θt) + πθ0(ai|si) πθt(ai|si) g(τ k j |θt 1) , where e Gk 1 is an unbiased estimator according to the trajectory generated by πθk 1. Although SVRPG is practical empirically, its gradient estimate is dependent heavily on importance sampling. This fact partially reduces the effectiveness of variance reduction. Later, Shen et al. (2019) remove the importance sampling term, and they construct a Hessian aided policy gradient. Our VRMPO is different from Du et al. (2017); Xu, et al. (2017); Papini et al. (2018), which admits a stochastic recursive iteration to estimate the policy gradient. VRMPO exploits fresh information to improve convergence and reduces variance. Besides, VRMPO reduces the storage cost since it doesn t require to store the complete historical information. 0 100 200 300 400 500 600 Episode Total reward on episode: G0 -11.6 VPG REINFORCE MPO Figure 1: Convergence comparison between our MPO algorithm and REINFORCE/VPG on the SASC domain. Baseline Methods. Baseline (also also known as control variates) is a widely used technique to reduce the variance (Weaver and Tao 2001; Greensmith et al. 2004). For example, A2C (Sutton and Barto 1998; Mnih et al. 2016) introduces the value function as baseline function, Wu et al. (2018) consider action-dependent baseline, and Liu et al. (2018) use the Stein s identity (Stein 1986) as baseline. Q-Prop (Gu et al. 2017) makes use of both the linear dependent baseline and GAE (Schulman et al. 2016) to reduce variance. Cheng et al. (2019) present a predictorcorrector framework transforms a first-order model-free algorithm into a new hybrid method that leverages predictive models to accelerate policy learning. Mao et al. (2019) derive a bias-free, input-dependent baseline to reduce variance, and analytically show its benefits over state-dependent baselines. Recently, Grathwohl et al. (2018); Cheng, et al. (2019) provide a standard explanation for the benefits of such approaches with baseline function. However, the capacity of all the above methods is limited by their choice of baseline function (Liu et al. 2018). In practice, it is troublesome to design a proper baseline function to reduce the variance of policy gradient estimate. Our VRMPO avoids the selection of baseline function, and it uses the current trajectories to construct a novel, efficiently computable gradient to reduce variance and improve sample efficiency. Experiments Our experiments cover the following three different aspects: We provide a numerical analysis of MPO, and compare the convergence rate of MPO with REINFORCE and VPG on the Short Corridor with Switched Actions (SASC) domain (Sutton and Barto 2018). We provider a better understand the effect of how the mirror map affects the performance of VRMPO. To demonstrate the stability and efficiency of VRMPO on the Mu Jo Co continuous control tasks, we provide a comprehensive comparison to state-of-the-art policy optimization algorithms. 0 200 400 600 800 1000 Average Return REINFORCE SRVR-PG( = 2-norm) best = 4-norm = 5-norm 0 200 400 600 800 1000 Episode 5000 Cart Pole REINFORCE SRVR-PG( = 2-norm) best = 4-norm = 5-norm 300 400 500 600 700 800 900 1000 2000 200 Mountain Car REINFORCE SRVR-PG( = 2-norm) best = 4-norm = 5-norm Figure 2: Comparison of the empirical performance of VRMPO between different mirror maps and REINFORCE. Numerical Analysis of MPO SASC Domain (see Appendix B): The task is to estimate the optimal value function of state s1, V (s1) = G0 11.6. Let φ(s, right) = [1, 0] and φ(s, left) = [0, 1] , s S. Let Lθ(s, a) = φ (s, a)θ, (s, a) S A, where A = {right, left}. πθ(a|s) is the soft-max distribution defined as πθ(a|s) = exp{Lθ(s,a)} P a A exp{Lθ(s,a )}. The initial parameter θ0 U[ 0.5, 0.5], where U is the uniform distribution. Before we report the results, it is necessary to explain why we only compare MPO with VPG and REINFORCE. VPG/REINFORCE is one of the most fundamental policy gradient methods in RL, and extensive modern policy-based algorithms are derived from VPG/REINFORCE. Our MPO is a new policy gradient algorithm to learn the parameter. Thus, it is natural to compare with VPG and REINFORCE. The result of Figure 1 shows that MPO converges faster significantly than both REINFORCE and VPG. Effect of Mirror Map on VRMPO If ψ( ) is ℓp-norm, then ψ (y) = (Pn i=1 |yi|q) 1 q is the conjugate map of ψ, where y = (y1, y2, , yn) , 1 q = 1, and p, q > 1. According to Beck and Teboulle (2003), iteration (27) is equivalent to θk+1 = ψ ( ψ(θk) + αGk), where ψj(x) = sign(xj)|xj|p 1 x p 2 p , ψ j (y) = sign(yj)|yj|q 1 y q 2 q , and j is coordinate index of the vector ψ, ψ . To compare fairly, we use the same random seed for each domain. The hyper-parameter p runs in the set [P] = {1.1, 1.2, , 1.9, 2, 3, 4, 5}. For the non-Euclidean distance case, we only show the results of p = 3, 4, 5 in Figure 2, and best is a certain hyper-parameter p [P] achieves the best performance among the set [P]. We use a two-layer feedforward neural network of 200 and 100 hidden nodes, respectively, with rectified linear units (Re LU) activation function between each layer. We run the discounter γ = 0.99 and the step-size α is chosen by a grid search from the set {0.01, 0.02, 0.04, 0.08, 0.1}. The result of Figure 2 shows that the best method is produced by non-Euclidean distance (p = 2), not the Euclidean distance (p = 2). The traditional policy gradient methods such as REINFORCE, VPG, and DPG are all the algorithms update parameters by Euclidean distance. This experiment gives us some light that one can create better algorithms with existing approaches via non-Euclidean distance. Additionally, the result of Figure 2 shows our VRMPO converges faster than REINFORCE, i.e., VRMPO needs less sampled trajectories to reach a convergent state, which supports the complexity analysis in Table 1. Although SRVR-PG achieves the same sample complexity as our VRMPO, result of Figure 2 shows VRMPO converges faster than SRVR-PG. Evaluate VRMPO on Continuous Control Tasks It is noteworthy that the policy gradient (26) of VRMPO is an off-line estimator likes REINFOECE. As pointed by Sutton and Barto (2018), REINFOECE converge asymptotically to a local minimum, but like all off-line methods, it is inconvenient for continuous control tasks, and it is limited in the application to some complex domains. This could also happen in VRMPO. Now, we introduce some practical tricks for on-line implementation of VRMPO. We have provided the complete update rule of on-line VRMPO in Algorithm 3. Details of Implementation. Firstly, we extend Algorithm 2 to be an actor-critic structure, i.e., we introduce a critic structure to Algorithm 2. Concretely, for each step t, we construct a critic network Qω(s, a) with the parameter ω, sample {(si, ai)}N i=1 from a data memory D, and learn the parameter ω via minimizing the critic loss as follows, i=1 (ri+1 + γQωk 1(si, ai) Qω(si, ai))2. (31) For more details, please see Line 17-20 of Algorithm 3. Then, for each pair (s, a) D, we conduct the actor loss Lθ(s, a) = log πθ(s, a)Qωk 1(s, a) to replace J(θ) to learn parameter θ. For more details, please see Line 9-16 of Algorithm 3 (Appendix E.1). Score Performance Comparison. From the results of Figure 3 and Table 2, overall, VRMPO outperforms the baseline algorithms in both final performance and learning process. Our VRMPO also learns considerably faster with better 50 100 150 200 250 300 350 400 450 500 Epoch Average Return VRMPO DDPG PPO TD3 TRPO (a) Walker2d-v2 200 400 600 800 1000 Epoch Average Return VRMPO DDPG PPO TD3 TRPO (b) Half Cheetah-v2 50 100 150 200 250 300 350 400 450 500 Epoch Average Return VRMPO PPO TD3 TRPO DDPG (c) Reacher-v2 50 100 150 200 250 300 350 400 450 500 Epoch Average Return VRMPO DDPG PPO TD3 TRPO (d) Hopper-v2 20 40 60 80 100 120 140 Epoch Average Return VRMPO TD3 DDPG PPO TRPO (e) Inv Double Pendulum-v2 20 30 40 50 60 70 80 Epoch Average Return VRMPO PPO TD3 TRPO DDPG (f) Inv Pendulum-v2 Figure 3: Learning curves for continuous control tasks. The shaded region represents the standard deviation of the score over the best three trials. Curves are smoothed uniformly for visual clarity. performance than the popular TD3 on Walker2d, Half Cheetah, Hopper, Inv Double Pendulum (IDP), and Reacher domains. On the Inv Double Pendulum task, our VRMPO has only a small advantage over other algorithms. The Inv Pendulum task is relatively easy, the advantage of our VRMPO becomes more powerful when the task is more difficult. It is worth noticing that on the Half Cheetah domain, our VRMPO achieves a significant max-average score 16000+, which outperforms far more than the second-best score 11781. Stability. The stability of an algorithm is also an important topic in RL. Although DDPG exploits the off-policy samples, which promotes its efficiency in stable environments. DDPG is unstable on the Reacher task, while our VRMPO learning faster significantly with lower variance. DDPG fails to make any progress on Inv Double Pendulum domain, which is corroborated by (Dai et al. 2018). Although TD3 takes the minimum value between a pair of critics to limit overestimation, it learns severely fluctuating in the Inverted Double Pendulum environment. In contrast, our VRMPO is consistently reliable and effective in different tasks. Variance Comparison. As we can see from the results in Figure 3, our VRMPO converges with a considerably low variance in the Hopper, Inv Double Pendulum, and Reacher. Although the asymptotic variance of VRMPO is slightly larger than other algorithms in Half Cheetah, the final performance of VRMPO outperforms all the baselines significantly. The result of Figure 3 also implies conducting a proper gradient estimator not only reduces the variance of the score during the learning but speeds the convergence of training. Environment VRMPO TD3 DDPG PPO TRPO Walker2d 5251.83 4887.85 5795.13 3905.99 3636.59 Half Cheetah 16095.51 11781.07 8616.29 3542.60 3325.23 Reacher -0.49 -1.47 -1.55 -0.44 -0.66 Hopper 3751.43 3482.06 3558.69 3609.65 3578.06 IDP 9359.82 9248.27 6958.42 9045.86 9151.56 Inv Pendulum 1000.00 1000.00 907.81 1000.00 1000.00 Table 2: Max-average return over final 50 epochs, where we run 5000 iterations for each epoch. Maximum value for each task is bolded. In this paper, we analyze the theoretical dilemma of applying SMD to policy optimization. Then, we propose a sample efficient algorithm VRMPO, and prove the sample complexity of VRMPO achieves only O(ϵ 3). To our best knowledge, VRMPO matches the best sample complexity so far. Finally, we conduct extensive experiments to show our algorithm outperforms state-of-the-art policy gradient methods. Acknowledgements This work is supported by China Brain Project (2021ZD0200400), Natural Science Foundation of China (No. 61925603), Zhejiang Lab (No. 2018EB0ZX01), and Ten Thousand Talent Program of Zhejiang Province (No.2018R52039). References Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2020. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes. COLT. Bauschke, H. H.; Combettes, P. L.; et al. 2011. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Beck, A.; and Teboulle, M. 2003. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3): 167 175. Bertsekas, D. P. 2009. Convex optimization theory. Athena Scientific Belmont. Cheng, C.-A.; Yan, X.; and Boots, B. 2019. Trajectory-wise Control Variates for Variance Reduction in Policy Gradient Methods. ICRA. Cheng, C.-A.; Yan, X.; Ratliff, N.; and Boots, B. 2019. Predictor-Corrector Policy Optimization. In ICML. Dai, B.; Shaw, A.; Li, L.; Xiao, L.; He, N.; Liu, Z.; Chen, J.; and Song, L. 2018. SBEED: Convergent reinforcement learning with nonlinear function approximation. ICML. Du, S. S.; Chen, J.; Li, L.; Xiao, L.; and Zhou, D. 2017. Stochastic variance reduction methods for policy evaluation. In ICML. Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; and Abbeel, P. 2016. Benchmarking deep reinforcement learning for continuous control. In ICML. Ghadimi, S.; Lan, G.; and Zhang, H. 2016. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2): 267 305. Grathwohl, W.; Choi, D.; Wu, Y.; Roeder, G.; and Duvenaud, D. 2018. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. ICLR. Greensmith, E.; Bartlett, P. L.; Baxter, J.; et al. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. JMLR, 5(Nov): 1471 1530. Gu, S.; Lillicrap, T.; Ghahramani, Z.; Turner, R. E.; and Levine, S. 2017. Q-prop: Sample-efficient policy gradient with an off-policy critic. ICLR. Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In ICML. Huang, F.; Gao, S.; Huang, H.; and et.al. 2021. Bregman gradient policy optimization. ar Xiv preprint ar Xiv:2106.12112. Johnson, R.; and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Neur IPS, 315 323. Kakade, S. M. 2002. A natural policy gradient. In Neur IPS, 1531 1538. Lei, Y.; and Tang, K. 2018. Stochastic composite mirror descent: optimal bounds with high probabilities. In Neur IPS, 1519 1529. Liu, B.; Cai, Q.; Yang, Z.; and Wang, Z. 2019. Neural proximal/trust region policy optimization attains globally optimal policy. Neur IPS. Liu, H.; Feng, Y.; Mao, Y.; Zhou, D.; Peng, J.; and Liu, Q. 2018. Action-dependent control variates for policy optimization via stein identity. ICLR. Mao, H.; Venkatakrishnan, S. B.; Schwarzkopf, M.; and Alizadeh, M. 2019. Variance reduction for reinforcement learning in input-driven environments. ICLR. Metelli, A. M.; Papini, M.; Faccio, F.; and Restelli, M. 2018. Policy optimization via importance sampling. In Neur IPS, 5442 5454. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In ICML, 1928 1937. Nemirovskij, A. S.; and Yudin, D. B. 1983. Problem complexity and method efficiency in optimization. Wiley Interscience. Nguyen, L. M.; Liu, J.; Scheinberg, K.; and Tak aˇc, M. 2017a. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In ICML. Papini, M.; Binaghi, D.; Canonaco, G.; and Matteo Pirotta, M. R. 2018. Stochastic Variance-Reduced Policy Gradient. In ICML. Peters, J.; M ulling, K.; Altun; and Yasemin. 2010. Relative Entropy Policy Search. In AAAI, 1607 1612. Reddi, S. J.; Hefny, A.; Sra, S.; Poczos, B.; and Smola, A. 2016. Stochastic variance reduction for nonconvex optimization. In ICML, 314 323. R enyi, A.; et al. 1961. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In ICML, 1889 1897. Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2016. High-dimensional continuous control using generalized advantage estimation. ICLR. Shani, L.; Efroni, Y.; Mannor, S.; and et.al. 2020. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In AAAI, volume 34, 5668 5675. Shen, Z.; Ribeiro, A.; Hassani, H.; Qian, H.; and Mi, C. 2019. Hessian Aided Policy Gradient. In ICML, 5729 5738. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587): 484 489. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In ICML. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the game of Go without human knowledge. Nature, 550(7676): 354. Stein, C. 1986. Approximate computation of expectations. Lecture Notes-Monograph Series, 7: i 164. Sutton, R. S.; and Barto, A. G. 1998. Reinforcement learning: An introduction. MIT press. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press. Sutton, R. S.; Mc Allester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Neur IPS, 1057 1063. Weaver, L.; and Tao, N. 2001. The optimal reward baseline for gradient-based reinforcement learning. In UAI, 538 545. Morgan Kaufmann Publishers Inc. Williams, R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4): 229 256. Wu, C.; Rajeswaran, A.; Duan, Y.; Kumar, V.; Bayen, A. M.; Kakade, S.; Mordatch, I.; and Abbeel, P. 2018. Variance reduction for policy gradient with action-dependent factorized baselines. ICLR. Xing, D.; Liu, Q.; Zheng, Q.; and Pan, G. 2021. Learning with Generated Teammates to Achieve Type-Free Ad-Hoc Teamwork. In IJCAI. Xu, P. 2021. Sample-Efficient Nonconvex Optimization Algorithms in Machine Learning and Reinforcement Learning. Ph.D. thesis, UCLA. Xu, P.; Gao, F.; Gu, Q.; et al. 2019. An Improved Convergence Analysis of Stochastic Variance-Reduced Policy Gradient. UAI. Xu, P.; Gao, F.; Gu, Q.; et al. 2020. Sample efficient policy gradient methods with recursive variance reduction. ICLR. Xu, T.; Liu, Q.; and Peng, J. 2017. Stochastic Variance Reduction for Policy Gradient Estimation. ar Xiv preprint ar Xiv:1710.06034. Yang, L.; Ji, J.; Dai, J.; Zhang, Y.; Li, P.; and Pan, G. 2022. CUP: A Conservative Update Policy Algorithm for Safe Reinforcement Learning. ar Xiv preprint ar Xiv:2202.07565. Yang, L.; Shi, M.; Zheng, Q.; Meng, W.; and Pan, G. 2018. A unified approach for multi-step temporal-difference learning with eligibility traces in reinforcement learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2984 2990. Yang, L.; Zheng, Q.; ; and Pan, G. 2021. Sample complexity of policy gradient finding second-order stationary points. In AAAI. Yuan, H.; Li, C. J.; Tang, Y.; and Zhou, Y. 2019. Policy optimization via stochastic recursive gradient algorithm. https://openreview.net/forum?id=r Jl3S2A9t7.