# bregman_gradient_policy_optimization__99be4a8a.pdf

Published as a conference paper at ICLR 2022

BREGMAN GRADIENT POLICY OPTIMIZATION

Feihu Huang , Shangqian Gao , Heng Huang Department of Electrical and Computer Engineering University of Pittsburgh Pittsburgh, PA 15261, USA huangfeihu2018@gmail.com, shg84@pitt.edu, heng.huang@pitt.edu

In the paper, we design a novel Bregman gradient policy optimization framework for reinforcement learning based on Bregman divergences and momentum techniques. Speciﬁcally, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique and mirror descent iteration. Meanwhile, we further propose an accelerated Bregman gradient policy optimization (VR-BGPO) algorithm based on the variance reduced technique. Moreover, we provide a convergence analysis framework for our Bregman gradient policy optimization under the nonconvex setting. We prove that our BGPO achieves a sample complexity of O(ϵ 4) for ﬁnding ϵ-stationary policy only requiring one trajectory at each iteration, and our VR-BGPO reaches the best known sample complexity of O(ϵ 3), which also only requires one trajectory at each iteration. In particular, by using different Bregman divergences, our BGPO framework uniﬁes many existing policy optimization algorithms such as the existing (variance reduced) policy gradient algorithms such as natural policy gradient algorithm. Extensive experimental results on multiple reinforcement learning tasks demonstrate the efﬁciency of our new algorithms.

1 INTRODUCTION

Policy Gradient (PG) methods are a class of popular policy optimization methods for Reinforcement Learning (RL), and have achieved signiﬁcant successes in many challenging applications (Li, 2017) such as robot manipulation (Deisenroth et al., 2013), the Go game (Silver et al., 2017) and autonomous driving (Shalev-Shwartz et al., 2016). In general, PG methods directly search for the optimal policy by maximizing the expected total reward of Markov Decision Processes (MDPs) involved in RL, where an agent takes action dictated by a policy in an unknown dynamic environment over a sequence of time steps. Since the PGs are generally estimated by Monte-Carlo sampling, such vanilla PG methods usually suffer from very high variances resulted in slow convergence rate and destabilization. Thus, recently many fast PG methods have been proposed to reduce variances in vanilla stochastic PGs. For example, Sutton et al. (2000) introduced a baseline to reduce variances of the stochastic PG. Konda & Tsitsiklis (2000) proposed an efﬁcient actor-critic algorithm by estimating the value function to reduce effects of large variances. (Schulman et al., 2016) proposed the generalized advantage estimation (GAE) to control both the bias and variance in policy gradient. More recently, some faster variance-reduced PG methods (Papini et al., 2018; Xu et al., 2019a; Shen et al., 2019; Liu et al., 2020; Huang et al., 2020) have been developed based on the variance-reduction techniques in stochastic optimization.

Alternatively, some successful PG algorithms (Schulman et al., 2015; 2017) improve convergence rate and robustness of vanilla PG methods by using some penalties such as Kullback-Leibler (KL) divergence penalty. For example, trust-region policy optimization (TRPO) (Schulman et al., 2015) ensures that the new selected policy is near to the old one by using KL-divergence constraint, while proximal policy optimization (PPO) (Schulman et al., 2017) clips the weighted likelihood ratio to implicitly reach this goal. Subsequently, Shani et al. (2020) have analyzed the global convergence properties of TRPO in tabular RL based on the convex mirror descent algorithm. Liu et al.

Feihu and Shangqian contributed equally. Corresponding Authors.

Published as a conference paper at ICLR 2022

Table 1: Sample complexities of the representative PG algorithms based on mirror descent algorithm for ﬁnding an ϵ-stationary policy of the nonconcave performance function. Although Liu et al. (2019); Shani et al. (2020) have provided the global convergence of TRPO and PPO under some speciﬁc policies based on convex mirror descent, they still obtain a stationary point of nonconcave performance function. Note that our convergence analysis does not rely any speciﬁc policies.

Algorithm Reference Complexity Batch Size TRPO Shani et al. (2020) O(ϵ 4) O(ϵ 2) Regularized TRPO Shani et al. (2020) O(ϵ 3) O(ϵ 2) TRPO/PPO Liu et al. (2019) O(ϵ 8) O(ϵ 6) VRMPO Yang et al. (2019) O(ϵ 3) O(ϵ 2) MDPO Tomar et al. (2020) Unknown Unknown BGPO Ours O(ϵ 4) O(1) VR-BGPO Ours O(ϵ 3) O(1)

(2019) have also studied the global convergence properties of PPO and TRPO equipped with overparametrized neural networks based on mirror descent iterations. At the same time, Yang et al. (2019) tried to propose the PG methods based on the mirror descent algorithm. More recently, mirror descent policy optimization (MDPO) (Tomar et al., 2020) iteratively updates the policy beyond the tabular RL by approximately solving a trust region problem based on convex mirror descent algorithm. In addition, Agarwal et al. (2019); Cen et al. (2020) have studied the natural PG methods for regularized RL. However, Agarwal et al. (2019) mainly focuses on tabular policy and log-linear, neural policy classes. Cen et al. (2020) mainly focuses on softmax policy class.

Although these speciﬁc PG methods based on mirror descent iteration have been recently studied, which are scattered in empirical and theoretical aspects respectively, it lacks a universal framework for these PG methods without relying on some speciﬁc RL tasks. In particular, there still does not exist the convergence analysis of PG methods based on the mirror descent algorithm under the nonconvex setting. Since mirror descent iteration adjusts gradient updates to ﬁt problem geometry, and is useful in regularized RL (Geist et al., 2019), there exists an important problem to be addressed:

Could we design a universal policy optimization framework based on the mirror descent algorithm, and provide its convergence guarantee under the non-convex setting ?

In the paper, we ﬁrmly answer the above challenging question with positive solutions and propose an efﬁcient Bregman gradient policy optimization framework based on Bregman divergences and momentum techniques. In particular, we provide a convergence analysis framework of the PG methods based on mirror descent iteration under the nonconvex setting. In summary, our main contributions are provided as follows:

a) We propose an effective Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique, which achieves the sample complexity of O(ϵ 4) for ﬁnding ϵ-stationary policy only requiring one trajectory at each iteration. b) We propose an accelerated Bregman gradient policy optimization (VR-BGPO) algorithm based on the variance-reduced technique of STORM (Cutkosky & Orabona, 2019). Moreover, we prove that the VR-BGPO reaches the best known sample complexity of O(ϵ 3). c) We design a uniﬁed policy optimization framework based on mirror descent iteration and momentum techniques, and provide its convergence analysis under nonconvex setting. In Table 1 shows that sample complexities of the representative PG algorithms based on mirror descent algorithm. Shani et al. (2020); Liu et al. (2019) have established global convergence of a mirror descent variant of PG under some pre-speciﬁed setting such as over-parameterized networks (Liu et al., 2019) by exploiting these speciﬁc problems hidden convex nature. Without these special structures, global convergence of these methods cannot be achieved. However, our framework does not rely on any speciﬁc policy classes, and our convergence analysis only builds on the general nonconvex setting. Thus, we only prove that our methods convergence to stationary points.

Geist et al. (2019); Jin & Sidford (2020); Lan (2021); Zhan et al. (2021) studied a general theory of regularized MDPs based on policy space such as a discrete probability space that generally is discontinuous. Since both the state and action spaces S and A generally are very large in practice, the policy space is large. While our methods build on policy parameter space that is generally continuous Euclidean space and relatively small. Clearly, our methods and theoretical results are more practical than the results in (Geist et al., 2019; Jin & Sidford, 2020; Lan, 2021; Zhan et al., 2021). (Tomar et al., 2020) also proposes mirror descent PG framework based on policy parameter

Published as a conference paper at ICLR 2022

space, but it does not provide any theoretical results and only focuses on Bregman divergence taking form of KL divergence. While our framework can collaborate with any Bregman divergence forms.

2 RELATED WORKS

In this section, we review some related works about mirror descent-based algorithms in RL and variance-reduced PG methods, respectively.

2.1 MIRROR DESCENT ALGORITHM IN RL Due to easily deal with the regularization terms, mirror descent (a.k.a., Bregman gradient) algorithm (Censor & Zenios, 1992; Beck & Teboulle, 2003) has shown signiﬁcant successes in regularized RL, which is ﬁrst proposed in (Censor & Zenios, 1992) based on Bregman distance (divergence) (Bregman, 1967; Censor & Lent, 1981). For example, Neu et al. (2017) have shown both the dynamic policy programming (Azar et al., 2012) and TRPO (Schulman et al., 2015) algorithms are approximate variants of mirror descent algorithm. Subsequently, Geist et al. (2019) have introduced a general theory of regularized MDPs based on the convex mirror descent algorithm. More recently, Liu et al. (2019) have studied the global convergence properties of PPO and TRPO equipped with overparametrized neural networks based on mirror descent iterations. At the same time, Shani et al. (2020) have analyzed the global convergence properties of TRPO in tabular policy based on the convex mirror descent algorithm. Wang et al. (2019) have proposed divergence augmented policy optimization for off-policy learning based on mirror descent algorithm. MDPO (Tomar et al., 2020) iteratively updates the policy beyond the tabular RL by approximately solving a trust region problem based on convex mirror descent algorithm.

2.2 (VARIANCE-REDUCED) PG METHODS PG methods have been widely studied due to their stability and incremental nature in policy optimization. For example, the global convergence properties of vanilla policy gradient method in inﬁnite-horizon MDPs have been recently studied in (Zhang et al., 2019). Subsequently, Zhang et al. (2020) have studied asymptotically global convergence properties of the REINFORCE (Williams, 1992), whose policy gradient is approximated by using a single trajectory or a ﬁxed size mini-batch of trajectories under soft-max parametrization and log-barrier regularization. To accelerate these vanilla PG methods, some faster variance-reduced PG methods have been proposed based on the variance-reduction techniques of SVRG (Johnson & Zhang, 2013), SPIDER (Fang et al., 2018) and STORM (Cutkosky & Orabona, 2019) in stochastic optimization. For example, fast SVRPG (Papini et al., 2018; Xu et al., 2019a) algorithm have been proposed based on SVRG. Fast HAPG (Shen et al., 2019) and SRVR-PG (Xu et al., 2019a) algorithms have been presented by using SPIDER technique. Subsequently, the momentum-based PG methods, i.e., Prox HSPGA (Pham et al., 2020) and IS-MBPG (Huang et al., 2020), have been developed based on variance-reduced technique of STORM/Hybrid-SGD (Cutkosky & Orabona, 2019; Tran-Dinh et al., 2019). More recently, (Ding et al., 2021) studied the global convergence of momentum-based policy gradient methods. (Zhang et al., 2021) proposed a truncated stochastic incremental variance-reduced policy gradient (TSIVRPG) method to relieve the uncheckable importance weight assumption in above variance-reduced PG methods and provided the global convergence of the TSIVR-PG under overparameterizaiton of policy assumption.

3 PRELIMINARIES

In the section, we will review some preliminaries of Markov decision process and policy gradients.

3.1 NOTATIONS Let [n] = {1, 2, , n} for all n N+. For a vector x Rd, let x denote the ℓ2 norm of x, and x p = Pd i=1 |xi|p 1/p (p 1) denotes the p-norm of x. For two sequences {ak} and {bk}, we denote ak = O(bk) if ak Cbk for some constant C > 0. E[X] and V[X] denote the expectation and variance of random variable X, respectively.

3.2 MARKOV DECISION PROCESS Reinforcement learning generally involves a discrete time discounted Markov Decision Process (MDP) deﬁned by a tuple {S, A, P, r, γ, ρ0}. S and A denote the state and action spaces of the

Published as a conference paper at ICLR 2022

agent, respectively. P(s |s, a) : S A (S) is the Markov kernel that determines the transition probability from the state s to s under taking an action a A. r(s, a) : S A [ R, R] (R > 0) is the reward function of s and a, and ρ0 = p(s0) denotes the initial state distribution. γ (0, 1) is the discount factor. Let π : S (A) be a stationary policy, where (A) is the set of probability distributions on A.

Given the current state st S, the agent executes an action at A following a conditional probability distribution π(at|st), and then the agent obtains a reward rt = r(st, at). At each time t, we can deﬁne the state-action value function Qπ(st, at) and state value function V π(st) as follows:

Qπ(st, at) = Est+1,at+1,... X

l=0 γlrt+l , V π(st) = Eat,st+1,... X

l=0 γlrt+l . (1)

We also deﬁne the advantage function Aπ(st, at) = Qπ(st, at) V π(st). The goal of the agent is to ﬁnd the optimal policy by maximizing the expected discounted reward max π J(π) := Es0 ρ0[V π(s0)]. (2)

Given a time horizon H, the agent collects a trajectory τ = {st, at}H 1 t=0 under any stationary policy. Then the agent obtains a cumulative discounted reward r(τ) = PH 1 t=0 γtr(st, at). Since the state and action spaces S and A are generally very large, directly solving the problem (2) is difﬁcult. Thus, we let the policy π be parametrized as πθ for the parameter θ Θ Rd. Given the initial distribution ρ0 = p(s0), the probability distribution over trajectory τ can be obtained

p(τ|θ) = p(s0)

t=0 P(st+1|st, at)πθ(at|st). (3)

Thus, the problem (2) will be equivalent to maximize the expected discounted trajectory reward: max θ Θ J(θ) := Eτ p(τ|θ)[r(τ)]. (4)

In fact, the above objective function J(θ) has a truncation error of O( γH

1 γ ) compared to the original inﬁnite-horizon MDP.

3.3 POLICY GRADIENTS The policy gradient methods (Williams, 1992; Sutton et al., 2000) are a class of effective policybased methods to solve the above RL problem (4). Speciﬁcally, the gradient of J(θ) with respect to θ is given as follows: J(θ) = Eτ p(τ|θ) log p(τ|θ) r(τ) . (5)

Given a mini-batch trajectories B = {τi}n i=1 sampled from the distribution p(τ|θ), the standard stochastic policy gradient ascent update at (k + 1)-th step, deﬁned as θk+1 = θk + η JB(θk), (6)

where η > 0 is learning rate, and JB(θk) = 1

n Pn i=1 g(τi|θk) is stochastic policy gradient. Given H = O( 1 1 γ ) as in (Zhang et al., 2019; Shani et al., 2020), g(τ|θ) is the unbiased stochastic policy gradient of J(θ), i.e., E[g(τ|θ)] = J(θ), where

g(τ|θ) = H 1 X

t=0 θ log πθ(at, st) H 1 X

t=0 γtr(st, at) . (7)

Based on the gradient estimator in (7), we can obtain the existing well-known policy gradient estimators such as REINFORCE (Williams, 1992), policy gradient theorem (PGT (Sutton et al., 2000)). Speciﬁcally, the REINFORCE obtains a policy gradient estimator by adding a baseline b, deﬁned as

g(τ|θ) = H 1 X

t=0 θ log πθ(at, st) H 1 X

t=0 γtr(st, at) bt .

The PGT is a version of the REINFORCE, deﬁned as

γjr(sj, aj) bj θ log πθ(at, st).

Published as a conference paper at ICLR 2022

Algorithm 1 BGPO Algorithm

1: Input: Total iteration K, tuning parameters {λ, b, m, c} and mirror mappings ψk K k=1 are ν-strongly convex functions; 2: Initialize: θ1 Θ, and sample a trajectory τ1 from p(τ|θ1), and compute u1 = g(τ1|θ1); 3: for k = 1, 2, . . . , K do 4: Update θk+1 = arg minθ Θ uk, θ + 1

λDψk(θ, θk) ;

5: Update θk+1 = θk + ηk( θk+1 θk) with ηk = b (m+k)1/2 ;

6: Sample a trajectory τk+1 from p(τ|θk+1), and compute uk+1 = βk+1g(τk+1|θk+1) + (1 βk+1)uk with βk+1 = cηk; 7: end for 8: Output: θζ chosen uniformly random from {θk}K k=1.

4 BREGMAN GRADIENT POLICY OPTIMIZATION

In this section, we propose a novel Bregman gradient policy optimization framework based on Bregman divergences and momentum techniques. We ﬁrst let f(θ) = J(θ), the goal of policy-based RL is to solve the problem: maxθ Θ J(θ) minθ Θ f(θ), so we have f(θ) = J(θ).

Assume ψ(x) is a continuously-differentiable and ν-strongly convex function, i.e., x y, ψ(x) ψ(y) ν x y , ν > 0, we deﬁne a Bregman distance:

Dψ(y, x) = ψ(y) ψ(x) ψ(x), y x , x, y Rd (8) Then given a function h(x) deﬁned on a closed convex set X, we deﬁne a proximal operator (a.k.a., mirror descent):

Pψ λ,h(x) = arg min y X h(y) + 1

λDψ(y, x) , (9)

where λ > 0. Based on this proximal operator Pψ λ,h as in (Ghadimi et al., 2016; Zhang & He, 2018), we can deﬁne a Bregman gradient of function h(x) as follows:

Bψ λ,h(x) = 1

λ x Pψ λ,h(x) . (10)

If ψ(x) = 1

2 x 2 and X = Rd, x is a stationary point of h(x) if and only if Bψ λ,h(x ) = h(x ) = 0. Thus, this Bregman gradient can be regarded as a generalized gradient.

4.1 BGPO ALGORITHM In the subsection, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique. The pseudo code of BGPO Algorithm is provided in Algorithm 1.

In Algorithm 1, the step 4 uses the stochastic Bregman gradient descent (a.k.a., stochastic mirror descent) to update the parameter θ. Let h(θ) = θ, uk be the ﬁrst-order approximation of function f(θ) at θk, where uk is an approximated gradient of function f(θ) at θk. By the step 4 of Algorithm 1 and the above equality (10), we have

Bψk λ,h(θk) = 1

λ θk θk+1 , (11)

where λ > 0. Then by the step 5 of Algorithm 1, we have

θk+1 = θk ληk Bψk λ,h(θk), (12)

where 0 < ηk 1. Due to the convexity of set Θ Rd and θ1 Θ, we choose the parameter ηk (0, 1] to ensure the updated sequence {θk}K k=1 in Θ.

In fact, our BGPO algorithm uniﬁes many popular policy optimization algorithms. When the mirror mappings ψk(θ) = 1 2 θ 2 for k 1, the update (12) will be equivalent to a classic policy gradient iteration. Then our BGPO algorithm will become a momentum version of the policy gradient algorithms (Sutton et al., 2000; Zhang et al., 2019). Given ψk(θ) = 1

2 θ 2 and βk = 1, i.e., uk = g(τk|θk), we have Bψk λ,h(θk) = g(τk|θk) and

θk+1 = θk + ληkg(τk|θk). (13)

Published as a conference paper at ICLR 2022

Algorithm 2 VR-BGPO Algorithm

1: Input: Total iteration K, tuning parameters {λ, b, m, c} and mirror mappings ψk K k=1 are ν-strongly convex functions; 2: Initialize: θ1 Θ, and sample a trajectory τ1 from p(τ|θ1), and compute u1 = g(τ1|θ1); 3: for k = 1, 2, . . . , K do 4: Update θk+1 = arg minθ Θ uk, θ + 1

λDψk(θ, θk) ;

5: Update θk+1 = θk + ηk( θk+1 θk) with ηk = b (m+k)1/3 ;

6: Sample a trajectory τk+1 from p(τ|θk+1), and compute uk+1 = βk+1g(τk+1|θk+1) + (1 βk+1) uk g(τk+1|θk+1) + w(τk+1|θk, θk+1)g(τk+1|θk) with βk+1 = cη2 k; 7: end for 8: Output: θζ chosen uniformly random from {θk}K k=1.

When the mirror mappings ψk(θ) = 1

2θT F(θk)θ with F(θk) = E θπθk(s, a) θπθk(s, a) T , the update (12) will be equivalent to a natural policy gradient iteration. Then our BGPO will become a momentum version of natural policy gradient algorithms (Kakade, 2001; Liu et al., 2020). Given ψk(θ) = 1

2θT F(θk)θ, βk = 1, i.e., uk = g(τk|θk), we have Bψk λ,h(θk) = F(θk)+g(τk|θk) and

θk+1 = θk + ληk F(θk)+g(τk|θk), (14)

where F(θk)+ denotes the Moore-Penrose pseudoinverse of the Fisher information matrix F(θk). When given the mirror mapping ψk(θ) = P

s S πθ(s) log(πθ(s)), i.e., Boltzmann-Shannon entropy function (Shannon, 1948) and Θ = θ Rd | P

s S πθ(s) = 1 , we have Dψk(θ, θk) = KL πθ(s), πθk(s) = P

s S πθ(s) log πθ(s)

πθk (s) , which is the KL divergence. Then our BGPO will become a momentum version of mirror descent policy optimization (Tomar et al., 2020).

4.2 VR-BGPO ALGORITHM In the subsection, we propose a faster variance-reduced Bregman gradient policy optimization (VRBGPO) algorithm based on a variance-reduced technique. The pseudo code of VR-BGPO algorithm is provided in Algorithm 2.

Consider the problem (4) is non-oblivious that the distribution p(τ|θ) depends on the variable θ varying through the whole optimization procedure, we apply the importance sampling weight (Papini et al., 2018; Xu et al., 2019a) in estimating our policy gradient uk+1, deﬁned as

w(τk+1|θk, θk+1) = p(τk+1|θk) p(τk+1|θk+1) =

πθk(at|st) πθk+1(at|st).

Except for different stochastic policy gradients {uk} and tuning parameters {ηk, βk} using in Algorithms 1 and 2, the steps 4 and 5 in these algorithms for updating parameter θ are the same. Interestingly, when choosing mirror mapping ψk(θ) = 1

2 θ 2, our VR-BGPO algorithm will reduce to a non-adaptive version of IS-MBPG algorithm (Huang et al., 2020).

5 CONVERGENCE ANALYSIS

In this section, we will analyze the convergence properties of our BGPO and VR-BGPO algorithms. All related proofs are provided in the Appendix A. Here we use the standard convergence metric Bψk λ, θ, f(θk) (θ) used in (Zhang & He, 2018; Yang et al., 2019) to evaluate the convergence Bregman gradient-based (a.k.a., mirror descent) algorithms. To give the convergence analysis, we ﬁrst give some standard assumptions. Assumption 1. For function log πθ(a|s), its gradient and Hessian matrix are bounded, i.e., there exist constants Cg, Ch > 0 such that θ log πθ(a|s) Cg, 2 θ log πθ(a|s) Ch. Assumption 2. Variance of stochastic gradient g(τ|θ) is bounded, i.e., there exists a constant σ > 0, for all πθ such that V(g(τ|θ)) = E g(τ|θ) J(θ) 2 σ2. Assumption 3. For importance sampling weight w(τ|θ1, θ2) = p(τ|θ1)/p(τ|θ2), its variance is bounded, i.e., there exists a constant W > 0, it follows V(w(τ|θ1, θ2)) W for any θ1, θ2 Rd and τ p(τ|θ2).

Published as a conference paper at ICLR 2022

Assumption 4. The function J(θ) has an upper bound in Θ, i.e., J = supθ Θ J(θ) < + .

Assumptions 1 and 2 are commonly used in the PG algorithms (Papini et al., 2018; Xu et al., 2019a;b). Assumption 3 is widely used in the study of variance reduced PG algorithms (Papini et al., 2018; Xu et al., 2019a). In fact, the bounded importance sampling weight might be violated in some cases such as using neural networks as the policy. Thus, we can clip this importance sampling weights to guarantee the effectiveness of our algorithms as in (Papini et al., 2018). At the same time, the importance weights actually also have some nice properties, e.g., in soft-max policy it is bounded by ec θ1 θ2 2 for all θ1, θ2 Θ. More recently, (Zhang et al., 2021) used a simple truncated update to relieve this uncheckable importance weight assumption. Assumption 4 guarantees the feasibility of the problem (4). Note that Assumptions 2 and 4 are satisﬁed automatically given Assumption 1 and the fact that all the rewards are bounded, i.e., |r(s, a)| R for any s S and a A. For example, due to |r(s, a)| R, we have |J(θ)| R 1 γ . So we have J = R 1 γ .

5.1 CONVERGENCE ANALYSIS OF BGPO ALGORITHM In the subsection, we provide convergence properties of the BGPO algorithm. The detailed proof is provided in Appendix A.1. Theorem 1. Assume the sequence {θk}K k=1 be generated from Algorithm 1. Let ηk = b (m+k)1/2 for

all k 1, 0 < λ νm1/2

9Lb , b > 0, 8Lλ

b , and m max{b2, (cb)2}, we have

k=1 E Bψk λ, f(θk),θ (θk) 2

where M = J J(θ1)

νλb + σ2 νλLb + mσ2

νλLb ln(m + K). Remark 1. Without loss of generality, let b = O(1), m = O(1) and λ = O(1), we have M = O(ln(m + K)) = O(1). Theorem 1 shows that the BGPO algorithm has a convergence rate of O( 1 K1/4 ). Let K 1

4 ϵ, we have K = O(ϵ 4). Since the BGPO algorithm only needs one trajectory to estimate the stochastic policy gradient at each iteration and runs K iterations, it has the sample complexity of 1 K = O(ϵ 4) for ﬁnding an ϵ-stationary point.

5.2 CONVERGENCE ANALYSIS OF VR-BGPO ALGORITHM In the subsection, we give convergence properties of the VR-BGPO algorithm. The detailed proof is provided in Appendix A.2. Theorem 2. Suppose the sequence {θk}K k=1 be generated from Algorithm 2. Let ηk = b (m+k)1/3 for

all k 0, 0 < λ νm1/3

5ˆLb , b > 0, c 2

3b3 + 20ˆL2λ2

b2 and m max 2, b3, (cb)3, ( 5

6b)2/3 , we have

k=1 E Bψk λ, f(θk),θ (θk) 2

K1/3 , (15)

where M = J J(θ1)

bνλ + m1/3σ2

16b2 ˆL2λ2 + c2σ2b2

8ˆL2λ2 , ˆL2 = L2 + 2G2C2 w, G = Cg R/(1 γ)2 and Cw = q

H(2HC2g + Ch)(W + 1).

Remark 2. Without loss of generality, let b = O(1), m = O(1) and λ = O(1), we have M = O(ln(m + K)) = O(1). Theorem 2 shows that the VR-BGPO algorithm has a convergence rate of O( 1 K1/3 ). Let K 1

3 ϵ, we have K = (ϵ 3). Since the VR-BGPO algorithm only needs one trajectory to estimate the stochastic policy gradient at each iteration and runs K iterations, it reaches a lower sample complexity of 1 K = O(ϵ 3) for ﬁnding an ϵ-stationary point.

6 EXPERIMENTS

In this section, we conduct some RL tasks to verify the effectiveness of our methods. We ﬁrst study the effect of different choices of Bregman divergences with our algorithms (BGPO and VRBGPO), and then we compare our VR-BGPO algorithm with other state-of-the-art methods such as TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), Prox HSPGA (Pham et al., 2020), VRMPO (Yang et al., 2019), and MDPO (Tomar et al., 2020). Our code is available at https: //github.com/gaosh/BGPO.

Published as a conference paper at ICLR 2022

(a) Cart Pole-v1

(b) Acrobat-v1

(c) Mountain Car Continuous Figure 1: Effects of two Bregman Divergences: lp-norm and diagonal term (Diag).

(a) Cart Pole-v1

(b) Acrobat-v1

(c) Mountain Car Continuous Figure 2: Comparison between BGPO and VR-BGPO on different environments.

6.1 EFFECTS OF BREGMAN DIVERGENCES In the subsection, we examine how different Bregman divergences affect the performance of our algorithms. In the ﬁrst setting, we let mirror mapping ψk(x) = x p (p 1) with different p to test the performance our algorithms. Let ψ k(y) = (Pd i=1 |yi|q) 1 q be the conjugate mapping of ψk(x), where p 1 + q 1 = 1, p, q > 1. According to (Beck & Teboulle, 2003), when Θ = Rd, the update of θk+1 in our algorithms can be calculated by θk+1 = ψ k( ψk(θk) + λuk), where

ψk(xj) and ψ k(yj) are p-norm link functions, and ψk(xj) = sign(xj)|xj|p 1

x p 2 p , ψ k(yj) =

sign(yj)|yj|q 1

y q 2 q , and j is the coordinate index of x and y. In the second setting, we apply diagonal

term on the mirror mapping ψk(x) = 1 2x T Mkx, where Mk is a diagonal matrix with positive values. In the experiments, we generate Hk = diag( vk + α), vk = βvk 1 + (1 β)u2 k, and α > 0, β (0, 1), as in Super-Adam algorithm (Kingma & Ba, 2014; Huang et al., 2021). Then we have Dψk(y, x) = 1

2(y x)T Hk(y x). Under this setting, the update of θk+1 can also be analytically solved θk+1 = θk λH 1 k uk.

To test the effectiveness of two different Bregman divergences, we evaluate them on three classic control environments from gym Brockman et al. (2016): Cart Pole-v1, Acrobat-v1, and Mountain Car Continuous-v0. In the experiment, categorical policy is used for Cart Pole and Acrobot environments, and Gaussian policy is used for Mountain Car. Gaussian value functions are used in all settings. All policies and value functions are parameterized by multilayer perceptrons (MLPs). For a fair comparison, all settings use the same initialization for policies. We run each setting ﬁve times and plot the mean and variance of average returns. For lp-norm mapping, we test three different values of p = (1.50, 2.0, 3.0). For diagonal mapping, we set β = 0.999 and α = 10 8. We set hyperparameters {b, m, c} to be the same. λ still needs to be tuned for different p to achieve relatively good performance. For simplicity, we use BGPO-Diag to represent BGPO with diagonal mapping, and we use BGPO-lp to represent BGPO with lp-norm mapping. Details about the setup of environments and hyperparameters are provided in the Appendix C.

From Fig. 1, we can ﬁnd that BGPO-Diag largely outperforms BGPO-lp with different choices of p. The parameter tuning of BGPO-lp is much more difﬁcult than BGPO-Diag because each p requires an individual λ to achieve the desired performance.

6.2 COMPARISON BETWEEN BGPO AND VR-BGPO To understand the effectiveness of variance reduced technique used in our VR-BGPO algorithm, we compare BGPO and VR-BGPO using the same settings introduced in section. 6.1. Both algorithms use the diagonal mapping for ψ, since it performs much better than lp-norm. From Fig. 2 given in the Appendix C, we can see that VR-BGPO can outperform BGPO in all three environments.

Published as a conference paper at ICLR 2022

(a) Pendulum-v2

(b) Double Pendulum-v2

(c) Walker2d-v2

(d) Swimmer-v2

(e) Reacher-v2

(f) Half Cheetah-v2 Figure 3: Experimental results of our algorithms and other baseline algorithms on six environments.

In Cart Pole, both algorithms converge very fast and have similar performance, and VR-BGPO is more stable than BGPO. The advantage of VR-BGPO becomes large in Acrobot and Mountain Car environments, probably because the task is more difﬁcult compared to Cart Pole.

6.3 COMPARE TO OTHER METHODS In this subsection, we apply our BGPO and VR-BGPO algorithms to compare with the other methods. For our BGPO and VR-BGPO, we use diagonal mapping for ψ. For VRMPO, we follow their implementation and use lp-norm for ψ. For MDPO, ψ is the negative Shannon entropy, and the Bregman divergence becomes KL-divergence.

To evaluate the performance of these algorithms, we test them on six gym (Brockman et al., 2016) environments with continuous control tasks: Inverted-Double Pendulum-v2, Walker2d-v2, Reacherv2, Swimmer-v2, Inverted-Pendulum-v2 and Half Cheetah-v2. We use Gaussian policies and Gaussian value functions for all environments, and both of them are parameterized by MLPs. To ensure a fair comparison, all policies use the same initialization. For TRPO and PPO, we use the implementations provided by garage (garage contributors, 2019). We carefully implement MDPO and VRMPO following the description provided by the original papers. All methods include our method, are implemented with garage (garage contributors, 2019) and pytorch (Paszke et al., 2019). We run all algorithms ten times on each environment and report the mean and variance of average returns. Details about the setup of environments and hyperparameters are also provided in the Appendix C.

From Fig. 3, we can ﬁnd that our VR-BGPO method consistently outperforms all the other methods. Our BGPO basically reaches the second best performances. From the results of our BGPO, we can ﬁnd that using a proper Bregman (mirror) distance can improve performances of the PG methods. From the results of our VR-BGPO, we can ﬁnd that using a proper variance-reduced technique can further improve performances of the BGPO. Prox HSPGA can reach some relatively good performances by using the variance reduced technique. MDPO can achieve good results in some environments, but it can not outperform PPO or TRPO in Swimmer and Inverted Double Pendulum. VRMPO only outperforms PPO and TRPO in Reacher and Inverted Double Pendulum. The undesirable performance of VRMPO is probably because it uses lp norm for ψ, which requires careful tuning of learning rate.

7 CONCLUSION

In the paper, we proposed a novel Bregman gradient policy optimization framework for reinforcement learning based on Bregman divergences and momentum techniques. Moreover, we studied convergence properties of the proposed methods under the nonconvex setting.

ACKNOWLEDGMENT

This work was partially supported by NSF IIS 1845666, 1852606, 1838627, 1837956, 1956002, OIA 2040588.

Published as a conference paper at ICLR 2022

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. ar Xiv preprint ar Xiv:1908.00261, 2019.

Mohammad Gheshlaghi Azar, Vicenc G omez, and Hilbert J Kappen. Dynamic policy programming. The Journal of Machine Learning Research, 13(1):3207 3245, 2012.

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003.

Lev M Bregman. The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200 217, 1967.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization. ar Xiv preprint ar Xiv:2007.06558, 2020.

Yair Censor and Arnold Lent. An iterative row-action method for interval convex programming. Journal of Optimization theory and Applications, 34(3):321 353, 1981.

Yair Censor and Stavros Andrea Zenios. Proximal minimization algorithm withd-functions. Journal of Optimization Theory and Applications, 73(3):451 464, 1992.

Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in neural information processing systems, pp. 442 450, 2010.

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210 15219, 2019.

Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1 2):1 142, 2013.

Yuhao Ding, Junzi Zhang, and Javad Lavaei. On the global convergence of momentum-based policy gradient. ar Xiv preprint ar Xiv:2110.10116, 2021.

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689 699, 2018.

The garage contributors. Garage: A toolkit for reproducible reinforcement learning research. https://github.com/rlworkgroup/garage, 2019.

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In Thirty-sixth International Conference on Machine Learning, 2019.

Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2): 267 305, 2016.

Feihu Huang, Shangqian Gao, Jian Pei, and Heng Huang. Momentum-based policy gradient methods. In International Conference on Machine Learning, pp. 4422 4433. PMLR, 2020.

Feihu Huang, Junyi Li, and Heng Huang. Super-adam: Faster and universal framework of adaptive gradients. Advances in Neural Information Processing Systems, 34, 2021.

Yujia Jin and Aaron Sidford. Efﬁciently solving mdps with stochastic mirror descent. In International Conference on Machine Learning, pp. 4890 4900. PMLR, 2020.

Published as a conference paper at ICLR 2022

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315 323, 2013.

Sham M Kakade. A natural policy gradient. Advances in neural information processing systems, 14:1531 1538, 2001.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008 1014, 2000.

Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. ar Xiv preprint ar Xiv:2102.00135, 2021.

Yuxi Li. Deep reinforcement learning: An overview. ar Xiv preprint ar Xiv:1701.07274, 2017.

Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. ar Xiv preprint ar Xiv:1906.10306, 2019.

Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33, 2020.

Gergely Neu, Anders Jonsson, and Vicenc G omez. A uniﬁed view of entropy-regularized markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017.

Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. In 35th International Conference on Machine Learning, volume 80, pp. 4026 4035, 2018.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024 8035, 2019.

Nhan Pham, Lam Nguyen, Dzung Phan, Phuong Ha Nguyen, Marten Dijk, and Quoc Tran-Dinh. A hybrid stochastic policy gradient algorithm for reinforcement learning. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 374 385. PMLR, 2020.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015.

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), 2016.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. ar Xiv preprint ar Xiv:1610.03295, 2016.

Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pp. 5668 5675, 2020.

Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379 423, 1948.

Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qian, and Chao Mi. Hessian aided policy gradient. In International Conference on Machine Learning, pp. 5729 5738, 2019.

Published as a conference paper at ICLR 2022

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

Richard S Sutton, David A Mc Allester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000.

Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. ar Xiv preprint ar Xiv:2005.09814, 2020.

Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan, and Lam M Nguyen. Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization. ar Xiv preprint ar Xiv:1905.05920, 2019.

Qing Wang, Yingru Li, Jiechao Xiong, and Tong Zhang. Divergence-augmented policy optimization. In Advances in Neural Information Processing Systems, pp. 6099 6110, 2019.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.

Pan Xu, Felicia Gao, and Quanquan Gu. An improved convergence analysis of stochastic variancereduced policy gradient. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 191, 2019a.

Pan Xu, Felicia Gao, and Quanquan Gu. Sample efﬁcient policy gradient methods with recursive variance reduction. ar Xiv preprint ar Xiv:1909.08610, 2019b.

Long Yang, Gang Zheng, Haotian Zhang, Yu Zhang, Qian Zheng, Jun Wen, and Gang Pan. Policy optimization with stochastic mirror descent. ar Xiv preprint ar Xiv:1906.10462, 2019.

Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D Lee, and Yuejie Chi. Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence. ar Xiv preprint ar Xiv:2105.11066, 2021.

Junyu Zhang, Chengzhuo Ni, Zheng Yu, Csaba Szepesvari, and Mengdi Wang. On the convergence and sample efﬁciency of variance-reduced policy gradient method. ar Xiv preprint ar Xiv:2102.08607, 2021.

Junzi Zhang, Jongho Kim, Brendan O Donoghue, and Stephen Boyd. Sample efﬁcient reinforcement learning with reinforce. ar Xiv preprint ar Xiv:2010.11364, 2020.

Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Bas ar. Global convergence of policy gradient methods to (almost) locally optimal policies. ar Xiv preprint ar Xiv:1906.08383, 2019.

Siqi Zhang and Niao He. On the convergence rate of stochastic mirror descent for nonsmooth nonconvex optimization. ar Xiv preprint ar Xiv:1806.04781, 2018.

Published as a conference paper at ICLR 2022

In this section, we study the convergence properties of our algorithms. We ﬁrst provide some useful lemmas. Lemma 1. (Proposition 4.2 in Xu et al. (2019b)) Suppose g(τ|θ) is the PGT estimator. Under Assumption 1, we have

1) g(τ|θ) is L-Lipschitz differential, i.e., g(τ|θ1) g(τ|θ2) L θ1 θ2 for all θ1, θ2 Θ, where L = Ch R/(1 γ)2;

2) J(θ) is L-smooth, i.e., 2J(θ) L;

3) g(τ|θ) is bounded, i.e., g(τ|θ) G for all θ Θ with G = Cg R/(1 γ)2. Lemma 2. (Lemma 6.1 in Xu et al. (2019a)) Under Assumptions 1 and 3, let w(τ|θk 1, θk) = g(τ|θk 1)/g(τ|θk), we have

V[w(τ|θk 1, θk)] C2 w θk θk 1 2, (16)

where Cw = q

H(2HC2g + Ch)(W + 1).

Lemma 3. (Lemma 1 in (Ghadimi et al., 2016)) Let X Rd be a closed convex set, and φ : X R be a convex function but possibly nonsmooth, and Dψ : X X R is Bregman divergence related to the ν-strongly convex function ψ. Then we deﬁne

x+ = arg min z X g, z + 1

λDψ(z, x) + φ(z) , x X (17)

PX (x, g, λ) = 1

λ(x x+), (18)

where g Rd, λ > 0 and Dψ(z, x) = ψ(z) ψ(x) + ψ(x), z x . Then the following statement holds

g, PX (x, g, λ) ν PX (x, g, λ) 2 + 1

λ φ(x+) φ(x) . (19)

Lemma 4. (Proposition 1 in Ghadimi et al. (2016)) Let x+ 1 and x+ 2 be given in (17) with g replaced by g1 and g2 respectively. Then let PX (x, g1, λ) and PX (x, g2, λ) be deﬁned in (18) with x+

replaced by x+ 1 and x+ 2 respectively. we have

PX (x, g1, λ) PX (x, g2, λ) 1

ν g1 g2 . (20)

Lemma 5. (Lemma 1 in (Cortes et al., 2010)) Let w(x) = P (x)

Q(x) be the importance weight for distributions P and Q. The following identities hold for the expectation, second moment, and variance of w(x)

E[w(x)] = 1, E[w2(x)] = d2(P||Q), V[w(x)] = d2(P||Q) 1, (21)

where d2(P||Q) = 2D(P ||Q), and D(P||Q) is R enyi divergence between distributions P and Q. Lemma 6. Suppose that the sequence {θk}K k=1 be generated from Algorithms 1 or 2. Let 0 < ηk 1 and 0 < λ ν 2Lηk , then we have

f(θk+1) f(θk) ηkλ

ν f(θk) uk 2 νηk

2λ θk+1 θk 2. (22)

Proof. According to Assumption 1 and Lemma 1, the function f(θ) is L-smooth. Then we have

f(θk+1) f(θk) + f(θk), θk+1 θk + L

2 θk+1 θk 2 (23)

= f(θk) + ηk f(θk), θk+1 θk + Lη2 k 2 θk+1 θk 2

= f(θk) + ηk f(θk) uk, θk+1 θk + ηk uk, θk+1 θk + Lη2 k 2 θk+1 θk 2,

Published as a conference paper at ICLR 2022

where the second equality is due to θk+1 = θk + ηk( θk+1 θk). By the step 4 of Algorithm 1 or 2, we have θk+1 = arg minθ Θ uk, θ + 1

λDψk(θ, θk) . By using Lemma 3 with φ( ) = 0, we have

λ(θk θk+1) ν 1

λ(θk θk+1) 2. (24)

Thus, we can obtain

uk, θk+1 θk ν

λ θk+1 θk 2. (25)

According to the Cauchy-Schwarz inequality and Young s inequality, we have

f(θk) uk, θk+1 θk f(θk) uk θk+1 θk

ν f(θk) uk 2 + ν

4λ θk+1 θk 2. (26)

Combining the inequalities (23), (25) with (26), we obtain

f(θk+1) f(θk) + ηk f(θk) uk, θk+1 θk + ηk uk, θk+1 θk + Lη2 k 2 θk+1 θk 2

f(θk) + ηkλ

ν f(θk) uk 2 + νηk

4λ θk+1 θk 2 νηk

λ θk+1 θk 2 + Lη2 k 2 θk+1 θk 2

= f(θk) + ηkλ

ν f(θk) uk 2 νηk

2λ θk+1 θk 2 νηk

4λ Lη2 k 2 θk+1 θk 2

f(θk) + ηkλ

ν f(θk) uk 2 νηk

2λ θk+1 θk 2, (27)

where the last inequality is due to 0 < λ ν 2Lηk .

A.1 CONVERGENCE ANALYSIS OF BGPO ALGORITHM

In this subsection, we analyze the convergence properties of BGPO algorithm. Lemma 7. Assume the stochastic policy gradient uk+1 be generated from Algorithm 1, given 0 < βk 1, we have

E f(θk+1) uk+1 2 (1 βk+1)E f(θk) uk 2 + 2 βk+1 L2η2 k E θk+1 θk 2 + β2 k+1σ2.

Proof. By the deﬁnition of uk+1 in Algorithm 1, we have

uk+1 uk = βk+1uk βk+1g(τk+1|θk+1). (28)

Since f(θk) = J(θk) for all k 1, we have

E f(θk+1) uk+1 2

= E J(θk) uk J(θk+1) + J(θk) (uk+1 uk) 2

= E J(θk) uk J(θk+1) + J(θk) + βk+1uk + βk+1g(τk+1|θk+1) 2

= E (1 βk+1)( J(θk) uk) + βk+1( J(θk+1) + g(τk+1|θk+1))

+ (1 βk+1) J(θk+1) + J(θk) 2

= (1 βk+1)2E J(θk) + uk + J(θk+1) J(θk) 2 + β2 k+1E J(θk+1) g(τk+1|θk+1) 2

(1 βk+1)2(1 + βk+1)E J(θk) + uk 2 + (1 βk+1)2(1 + 1 βk+1 )E J(θk+1) J(θk) 2

+ β2 k+1E J(θk+1) g(τk+1|θk+1) 2

(1 βk+1)E J(θk) + uk 2 + 2 βk+1 E J(θk+1) J(θk) 2 + β2 k+1E J(θk+1) g(τk+1|θk+1) 2

(1 βk+1)E J(θk) + uk 2 + 2 βk+1 L2E θk+1 θk 2 + β2 k+1E J(θk+1) g(τk+1|θk+1) 2

(1 βk+1)E f(θk) uk 2 + 2 βk+1 L2η2 k E θk+1 θk 2 + β2 k+1σ2, (29)

Published as a conference paper at ICLR 2022

where the fourth equality holds by Eτk+1 p(τ|θk+1)[g(τk+1|θk+1)] = J(θk+1); the ﬁrst inequality holds by Young s inequality; the second inequality is due to 0 < βk+1 1 such that (1 βk+1)2(1+ βk+1) = 1 βk+1 β2 k+1 + β3 k+1 1 βk+1 and (1 βk+1)2(1 + 1 βk+1 ) 1 + 1 βk+1 2 βk+1 ; the last inequality holds by Assumption 2.

Theorem 3. Assume the sequence {θk}K k=1 be generated from Algorithm 1. Let ηk = b (m+k)1/2 for

all k 1, 0 < λ νm1/2

9Lb , b > 0, 8Lλ

b , and m max{b2, (cb)2}, we have

k=1 E Bψk λ, f(θk),θ (θk) 2

where M = J J(θ1)

νλb + σ2 νλLb + mσ2

νλLb ln(m + K).

Proof. Since ηk = b (m+k)1/2 is decreasing on k, we have ηk η0 = b m1/2 1 for all k 0. At

the same time, let m (cb)2, we have βk+1 = cηk cη0 = cb m1/2 1. Consider m (cb)2, we

have c m1/2

b . Since 0 < λ νm1/2

9Lb , we have λ νm1/2

2Lb = ν 2Lη0 ν 2Lηk for all k 0. According to Lemma 7, we have

E f(θk+1) uk+1 2 E f(θk) uk 2

βk+1E f(θk) uk 2 + 2 βk+1 L2η2 k E θk+1 θk 2 + β2 k+1σ2

= cηk E f(θk) uk 2 + 2L2

c ηk E θk+1 θk 2 + c2η2 kσ2

ν ηk E f(θk) uk 2 + Lν

4λ ηk E θk+1 θk 2 + mη2 kσ2

where the ﬁrst equality is due to βk+1 = cηk and the last equality holds by 8Lλ

Next we deﬁne a Lyapunov function Φk = E f(θk) + 1

L f(θk) uk 2 for any t 1. Then we have

Φk+1 Φk = E f(θk+1) f(θk) + 1

L f(θk+1) uk+1 2 f(θk) uk 2

ν E f(θk) uk 2 νηk

2λ E θk+1 θk 2 8ληk

ν E f(θk) uk 2

4λ E θk+1 θk 2 + mη2 kσ2

4ν ηk E f(θk) uk 2 ν

4ληk E θk+1 θk 2 + mη2 kσ2

where the ﬁrst inequality follows by the Lemma 6 and the above inequality (30).

Published as a conference paper at ICLR 2022

Summing the above inequality (31) over k from 1 to K, we can obtain

4ν ηk E f(θk) uk 2 + ν

4ληk E θk+1 θk 2]

Φ1 ΦK+1 + mσ2

= f(θ1) f(θK+1) + 1

L f(θ1) u1 2 1

L f(θK+1) u K+1 2 + mσ2

J(θK+1) J(θ1) + 1

L J(θ1) g(τ1|θ1) 2 + mσ2

J J(θ1) + σ2

J J(θ1) + σ2

L ln(m + K), (32)

where the last second inequality holds by Assumptions 2 and 4.

Since ηk is decreasing, we have

4ν2 E f(θk) uk 2 + 1 4λ2 E θk+1 θk 2]

KνλLηK + mσ2

νλLKηK ln(m + K)

νλLb ln(m + K) (m + K)1/2

Let M = J J(θ1)

νλb + σ2 νλLb + mσ2

νλLb ln(m + K), the above inequality (33) reduces to

4ν2 E f(θk) uk 2 + 1 4λ2 E θk+1 θk 2] M

K (m + K)1/2. (34)

According to Jensen s inequality, we have

2ν f(θk) uk + 1

4ν2 f(θk) uk 2 + 1 4λ2 θk+1 θk 2 1/2

2M K1/2 (m + K)1/4

2M K1/4 , (35)

where the last inequality is due to the inequality (a + b)1/4 a1/4 + b1/4 for all a, b 0. Thus we have

ν f(θk) uk + 1

λ θk+1 θk 2

2M K1/4 . (36)

By the step 4 of Algorithm 1, we have

Bψk λ, uk,θ (θk) = PΘ(θk, uk, λ) = 1

λ θk θk+1 . (37)

Published as a conference paper at ICLR 2022

At the same time, as in Ghadimi et al. (2016), we deﬁne

Bψk λ, f(θk),θ (θk) = PΘ(θk, f(θk), λ) = 1

λ θk θ+ k+1 , (38)

θ+ k+1 = arg min θ Θ f(θk), θ + 1

λDψk(θ, θk) . (39)

According to the above Lemma 4, we have Bψk λ, uk,θ (θk) Bψk λ, f(θk),θ (θk) 1

ν uk f(θk) . Then we have

Bψk λ, f(θk),θ (θk) Bψk λ, uk,θ (θk) + Bψk λ, uk,θ (θk) Bψk λ, f(θk),θ (θk)

Bψk λ, uk,θ (θk) + 1

λ θk+1 θk + 1

ν uk f(θk) . (40)

By the above inequalities (36) and (40), we have

k=1 E Bψk λ, f(θk),θ (θk) 2

2M K1/4 . (41)

A.2 CONVERGENCE ANALYSIS OF VR-BGPO ALGORITHM

In this subsection, we will analyze convergence properties of the VR-BGPO algorithm. Lemma 8. Assume that the stochastic policy gradient uk+1 be generated from Algorithm 2, given 0 < βk 1, we have

E f(θk+1) uk+1 2 (1 βk+1)E f(θk) uk 2 + 4ˆL2η2 k θk+1 θk 2 + 2β2 k+1σ2,

where ˆL2 = L2 + 2G2C2 w and Cw = q

H(2HC2g + Ch)(W + 1).

Proof. By the deﬁnition of uk+1 in Algorithm 2, we have

uk+1 uk = βk+1uk βk+1g(τk+1|θk+1) + (1 βk+1) g(τk+1|θk+1) + w(τk+1|θk, θk+1)g(τk+1|θk) .

Since f(θk+1) = J(θk+1), we have

E f(θk+1) uk+1 2

= E J(θk) + uk + J(θk+1) J(θk) + (uk+1 uk) 2 (42) = E J(θk) + uk + J(θk+1) J(θk) βk+1uk βk+1g(τk+1|θk+1)

+ (1 βk+1) g(τk+1|θk+1) + w(τk+1|θk, θk+1)g(τk+1|θk) 2

= E (1 βk+1)( J(θk) + uk) + βk+1( J(θk+1) g(τk+1|θk+1))

(1 βk+1) g(τk+1|θk+1) w(τk+1|θk, θk+1)g(τk+1|θk) ( J(θk+1) J(θk)) 2

= (1 βk+1)2E J(θk) + uk 2 + E βk+1( J(θk+1) g(τk+1|θk+1))

(1 βk+1) g(τk+1|θk+1) w(τk+1|θk, θk+1)g(τk+1|θk) ( J(θk+1) J(θk)) 2

(1 βk+1)2E J(θk) + uk 2 + 2β2 k+1E J(θk+1) g(τk+1|θk+1) 2

+ 2(1 βk+1)2E g(τk+1|θk+1) w(τk+1|θk, θk+1)g(τk+1|θk) ( J(θk+1) J(θk)) 2

(1 βk+1)E f(θk) uk 2 + 2β2 k+1σ2 + 2 E g(τk+1|θk+1) w(τk+1|θk, θk+1)g(τk+1|θk) 2 | {z } =T1

Published as a conference paper at ICLR 2022

where the forth equality holds by Eτk+1 p(τ|θk+1)[g(τk+1|θk+1)] = J(θk+1) and Eτk+1 p(τ|θk+1)[g(τk+1|θk+1) w(τk+1|θk, θk+1)g(τk+1|θk)] = J(θk+1) J(θk); the second last inequality follows by Young s inequality; and the last inequality holds by Assumption 2, and the inequality E ζ E[ζ] 2 = E ζ 2 (E[ζ])2 E ζ 2, and 0 < βk+1 1.

Next, we give an upper bound of the term T1 as follows:

T1 = E g(τk+1|θk+1) w(τk+1|θk, θk+1)g(τk+1|θk) 2

= E g(τk+1|θk+1) g(τk+1|θk) + g(τk+1|θk) w(τk+1|θk, θk+1)g(τk+1|θk) 2

2E g(τk+1|θk+1) g(τk+1|θk) 2 + 2E (1 w(τk+1|θk, θk+1))g(τk+1|θk) 2

2L2 θk+1 θk 2 + 2G2E 1 w(τk+1|θk, θk+1) 2

= 2L2 θk+1 θk 2 + 2G2V w(τk+1|θk, θk+1)

2(L2 + 2G2C2 w) θk+1 θk 2, (43)

where the second inequality holds by Lemma 1, and the third equality holds by Lemma 5, and the last inequality follows by Lemma 2.

Combining the inequalities (42) with (43), let ˆL2 = L2 + 2G2C2 w, we have

E f(θk+1) uk+1 2 (1 βk+1)E f(θk) uk 2 + 2β2 k+1σ2 + 4ˆL2 θk+1 θk 2

= (1 βk+1)E f(θk) uk 2 + 2β2 k+1σ2 + 4ˆL2η2 k θk+1 θk 2.

Theorem 4. Suppose the sequence {θk}K k=1 be generated from Algorithm 2. Let ηk = b (m+k)1/3 for

all k 0, 0 < λ νm1/3

5ˆLb , b > 0, 2 3b3 + 20ˆL2λ2

b2 and m max 2, b3, (cb)3, ( 5

6b)2/3 , we have

k=1 E Bψk λ, f(θk),θ (θk) 2

K1/3 , (44)

where M = J J(θ1)

bνλ + m1/3σ2

16b2 ˆL2λ2 + c2σ2b2

8ˆL2λ2 and ˆL2 = L2 + 2G2C2 w.

Proof. Since ηk = b (m+k)1/3 on k is decreasing and m b3, we have ηk η0 = b m1/3 1. Due

L2 + 2G2C2w L, we have 0 < λ νm1/3

2Lb = ν 2Lη0 ν 2Lηk for any k 0.

Consider 0 < ηk 1 and m (cb)3, we have βk+1 = cη2 k cb2

m2/3 1. At the same time, we have

b2 . According to Lemma 8, we have

1 ηk E f(θk+1) uk+1 2 1 ηk 1 E f(θk) uk 2

E f(θk) uk 2 + 2β2 k+1σ2

ηk + 4ˆL2ηk θk+1 θk 2

ηk 1 ηk 1 cηk E f(θk) uk 2 + 2c2η3 kσ2 + 4ˆL2ηk θk+1 θk 2

b (m + k) 1 3 (m + k 1) 1 3 cηk E f(θk) uk 2 + 2c2η3 kσ2 + 4ˆL2ηk θk+1 θk 2

3b3 ηk cηk E f(θk) uk 2 + 2c2η3 kσ2 + 4ˆL2ηk θk+1 θk 2, (45)

where the last inequality holds by the following inequality

(m + k) 1 3 (m + k 1) 1 3 1 3(m + k 1)2/3 1

3 m/2 + k 2/3

3(m + k)2/3 = 22/3

(m + k)2/3 = 22/3

3b2 η2 k 2 3b2 ηk, (46)

Published as a conference paper at ICLR 2022

where the ﬁrst inequality holds by the concavity of function f(x) = x1/3, i.e., (x + y)1/3 x1/3 + y 3x2/3 ; the second inequality is due to m 2, and the last inequality is due to 0 < ηk 1.

Let c 2 3b3 + 20ˆL2λ2

ν2 , we have

1 ηk E f(θk+1) uk+1 2 1 ηk 1 E f(θk) uk 2

ν2 ηk E f(θk) uk 2 + 2c2η3 kσ2 + 4ˆL2ηk θk+1 θk 2. (47)

Here we simultaneously consider c 2 3b3 + 20ˆL2λ2

ν2 , c m2/3

b2 and 0 < λ νm1/3

5ˆLb , we have

2 3b3 + 20ˆL2λ2

ν2 2 3b3 + 20ˆL2

25ˆL2b2 = 2 3b3 + 4m2/3

Then we have m ( 5

Next we deﬁne a Lyapunov function Ωk = E f(θk) + ν 16ˆL2ληk 1 f(θk) uk 2 for any k 1. According to Lemma 6, we have

Ωk+1 Ωk = f(θk+1) f(θk) + ν

ηk E f(θk+1) uk+1 2 1 ηk 1 E f(θk) uk 2

ν f(θk) uk 2 νηk

2λ θk+1 θk 2 5ληk

4ν E f(θk) uk 2

4λ E θk+1 θk 2 + νc2η3 kσ2

4ν E f(θk) uk 2 νηk

4λ E θk+1 θk 2 + νc2η3 kσ2

8ˆL2λ , (49)

where the ﬁrst inequality is due to the above inequality (47). Thus, we can obtain

4ν E f(θk) uk 2 + νηk

4λ E θk+1 θk 2 Ωk Ωk+1 + νc2η3 kσ2

8ˆL2λ . (50)

Taking average over k = 1, 2, , K on both sides of (50), we have

4ν E f(θk) uk 2 + νηk

4λ E θk+1 θk 2

f(θ1) f(θK+1)

K + ν f(θ1) u1 2

16ˆL2η0λK ν f(θK+1) u K+1 2

16ˆL2ηKλK + 1

J(θK+1) J(θ1)

16ˆL2η0λK + 1

16ˆL2η0λK + 1

8ˆL2λ , (51)

where the second inequality is due to u1 = g(τ1|θ1), f(θ1) = J(θ1) and Assumption 1, and the last inequality holds by Assumption 2. Since ηk is decreasing, i.e., η 1 K η 1 k for any

Published as a conference paper at ICLR 2022

0 < k K, we have

4ν2 E f(θk) uk 2 + 1 4λ2 E θk+1 θk 2

16ˆL2ηKη0λ2K + 1 KληK

KνληK + m1/3σ2

16bˆL2λ2ηKK + c2σ2

KληK + m1/3σ2

16bˆL2λ2ηKK + c2σ2b3

8ˆL2λ2KηK ln(m + K)

bνλ + m1/3σ2

16b2 ˆL2λ2 + c2σ2b2

where the second inequality holds by PK k=1 η3 kdk R K 1 η3 kdk = b3 R K 1 (m + k) 1dk.

Let M = J J(θ1)

bνλ + m1/3σ2

16b2 ˆL2λ2 + c2σ2b2

8ˆL2λ2 , the above inequality (52) reduces to

4ν2 f(θk) uk 2 + 1 4λ2 θk+1 θk 2 M

K (m + K)1/3. (53)

According to Jensen s inequality, we have

2ν f(θk) uk + 1

4ν2 f(θk) uk 2 + 1 4λ2 θk+1 θk 2 1/2

K1/2 (m + K)1/6

K1/3 , (54)

where the last inequality is due to the inequality (a + b)1/6 a1/6 + b1/6 for all a, b 1. Thus we have

ν f(θk) uk + 1

λ θk+1 θk 2

K1/3 . (55)

Then by using the above inequality (40), we can obtain

k=1 E Bψk λ, f(θk),θ (θk) 2

K1/3 . (56)

B ACTOR-CRITIC STYLE BGPO AND VR-BGPO ALGORITHMS

In the experiments, we use the advantage-based policy gradient estimator:

t=0 log πθ(at|st) ˆAπθ(st, at), (57)

Published as a conference paper at ICLR 2022

Algorithm 3 BGPO Algorithm (Actor-Critic Style)

1: Input: Total iteration K, tuning parameters {λ, b, m, c} and mirror mappings ψk K k=1 are ν-strongly convex functions; 2: Initialize: θ1 Θ, θv 1 Θv and sample a trajectory τ1 from p(τ|θ1), and compute u1 = g(τ1|θ1); 3: for k = 1, 2, . . . , K do 4: # Update the policy network 5: Update θk+1 = arg minθ Θ uk, θ + 1

λDψk(θ, θk) ;

6: Update θk+1 = θk + ηk( θk+1 θk) with ηk = b (m+k)1/2 ; 7: # Update the value network 8: Update θv k+1 by solving the subproblem (58); 9: # Sample a new trajectory and compute policy gradients 10: Sample a trajectory τk+1 from p(τ|θk+1), and compute uk+1 = βk+1g(τk+1|θk+1) + (1 βk+1)uk with βk+1 = cηk; 11: end for 12: Output: θζ chosen uniformly random from {θk}K k=1.

Algorithm 4 VR-BGPO Algorithm (Actor-Critic Style)

1: Input: Total iteration K, tuning parameters {λ, b, m, c} and mirror mappings ψk K k=1 are ν-strongly convex functions; 2: Initialize: θ1 Θ, θv 1 Θv and sample a trajectory τ1 from p(τ|θ1), and compute u1 = g(τ1|θ1); 3: for k = 1, 2, . . . , K do 4: # Update the policy network 5: Update θk+1 = arg minθ Θ uk, θ + 1

λDψk(θ, θk) ;

6: Update θk+1 = θk + ηk( θk+1 θk) with ηk = b (m+k)1/3 ; 7: # Update the value network 8: Update θv k+1 by solving the subproblem (58); 9: # Sample a new trajectory and compute policy gradients 10: Sample a trajectory τk+1 from p(τ|θk+1), and compute uk+1 = βk+1g(τk+1|θk+1) + (1 βk+1) uk g(τk+1|θk+1) + w(τk+1|θk, θk+1)g(τk+1|θk) with βk+1 = cη2 k; 11: end for 12: Output: θζ chosen uniformly random from {θk}K k=1 .

where θ( Θ Rd) denotes parameters of the policy network, and ˆAπθ(s, a) is an estimator of the advantage function Aπθ(s, a). In using advantage-based policy gradient, we also need the statevalue function V πθ(s). Here, we use a value network Vθv(s) to approximate the state-value function V πθ(s). Speciﬁcally, we solve the following problem to obtain the value network:

min θv Θv L(θv) :=

Vθv(st) ˆV πθ(st) 2, (58)

where θv( Θv Rdv) denotes parameters of the value network, and ˆV πθ(s) is an estimator of the state-value function V πθ(s), which is obtained by the GAE Schulman et al. (2016). Then we use the GAE to estimate ˆAπθ based on value network Vθv. We describe the actor-critic style BGPO and VR-BGPO algorithms in Algorithm 3 and Algorithm 4, respectively.

C DETAILED SETUP OF EXPERIMENTAL ENVIRONMENTS AND HYPER-PARAMETERS

In this section, we provide the detailed setup of experimental environments and hyper-parameters. We ﬁrst provide the detailed setup of our experiments in Tab. 2 and Tab. 3. We use ADAM optimizer to optimize value functions for all methods and settings, which is a common practice. The impor-

Published as a conference paper at ICLR 2022

Environments Cart Pole-v1 Acrobat-v1 Mountain Car-v0 Horizon 100 500 500 Value function Network sizes 32 32 32 32 32 32 Policy network sizes 8 8 8 8 64 64 Number of timesteps 5 105 5 106 7.5 106 Batchsize 50 100 100 VR-BGPO/BGPO {b, m, c} {1.5, 2.0, 25} {1.5, 2.0, 25} {1.5, 2.0, 25} BGPO-lp {λp=1.5, λp=2.0, λp=3.0} {0.0064, 0.0016, 0.0008} {0.016, 0.004, 0.001} {0.016, 0.004, 0.001} BGPO-Diag/VR-BGPO-Diag λ 1 10 3 1 10 3 1 10 3

Value function learning rate 2.5 10 3 2.5 10 3 2.5 10 3

Table 2: Setups of environments and hyper-parameters for experiments in section 6.2 and section 6.3. The learning rate of value functions are the same for all methods.

Environments Pendulum-v2 Double Pendulum-v2 Walker2d-v2 Swimmer-v2 Reacher-v2 Half Cheetah-v2 Horizon 500 500 500 500 500 500 Value function Network sizes 32 32 32 32 32 32 32 32 32 32 32 32 Policy network sizes 64 64 64 64 64 64 64 64 64 64 64 64 Number of timesteps 5 106 5 106 1 107 1 107 1 107 1 107 Batchsize 100 100 100 100 100 100 VR-BGPO {b, m, c} {1.50, 2.0, 25} {1.50, 2.0, 25} {1.50, 2.0, 25} {1.50, 2.0, 25} {1.50, 2.0, 25} {1.50, 2.0, 25} VR-BGPO λ 1 10 2 1 10 2 1 10 2 5 10 4 5 10 4 5 10 4

TRPO/PPO learning rate 2.5 10 3 2.5 10 3 2.5 10 3 2.5 10 3 2.5 10 3 2.5 10 3

MDPO learning rate 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3 3 10 3

VRMPO learning rate 5 10 3 3 10 4 1 10 2 2 10 4 5 10 5 5 10 5

Value function learning rate 2.5 10 3 2.5 10 3 2.5 10 3 2.5 10 3 2.5 10 3 2.5 10 3

Table 3: Setups of environments and hyper-parameters for experiments in section 6.4. The learning rate of value functions are the same for all methods.

tance sampling weight used for VR-BGPO algorithm is clipped within [0.5, 1.5]. The momentum term βk is set to be less or equal than one (βk = min(βk, 1.0) ) through the whole training process.

BGPO and VR-BGPO algorithms involve 4 hyper-parameters {λ, b, m, c}, which may bring additional efforts for hyper-parameter tuning. However, the actual hyper-parameter tuning is not so hard, and we only use one set of {b, m, c} for 9 environments. The strategy of hyper-parameter tuning is to separate the four hyper-parameters into two parts. The ﬁrst part is {b, m, c}, which mainly decide when the momentum term βk actually affects (βk < 1.0) updates. The second part, λ, only affects how fast the policy is learning. To further reduce the complexity of hyper-parameter tuning, we always set m = 2. By grouping hyper-parameters, we only consider λ and how βk changes, which largely simpliﬁes the process of hyper-parameter tuning.