# momentumbased_policy_gradient_methods__c6be5602.pdf

Momentum-Based Policy Gradient Methods

Feihu Huang 1 Shangqian Gao 1 Jian Pei 2 Heng Huang 1 3

In the paper, we propose a class of efﬁcient momentum-based policy gradient methods for the model-free reinforcement learning, which use adaptive learning rates and do not require any large batches. Speciﬁcally, we propose a fast important-sampling momentum-based policy gradient (IS-MBPG) method based on a new momentum-based variance reduced technique and the importance sampling technique. We also propose a fast Hessian-aided momentum-based policy gradient (HA-MBPG) method based on the momentum-based variance reduced technique and the Hessian-aided technique. Moreover, we prove that both the IS-MBPG and HA-MBPG methods reach the best known sample complexity of O(ϵ 3) for ﬁnding an ϵ-stationary point of the nonconcave performance function, which only require one trajectory at each iteration. In particular, we present a non-adaptive version of IS-MBPG method, i.e., IS-MBPG*, which also reaches the best known sample complexity of O(ϵ 3) without any large batches. In the experiments, we apply four benchmark tasks to demonstrate the effectiveness of our algorithms.

1. Introduction

Reinforcement Learning (RL) has achieved great success in solving many sequential decision-making problems such as autonomous driving (Shalev-Shwartz et al., 2016), robot manipulation (Deisenroth et al., 2013), the game of Go (Silver et al., 2017) and natural language processing (Wang et al., 2018). In general, RL involves a Markov decision process (MDP), where an agent takes actions dictated by a policy in a stochastic environment over a sequence of time steps, and

1Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, USA 2School of Computing Science, Simon Fraser University, Vancouver, Canada 3JD Finance America Corporation, Mountain View, CA, USA. Correspondence to: Heng Huang <heng.huang@pitt.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

then maximizes the long-term cumulative rewards to obtain an optimal policy. Due to easy implementation and avoiding policy degradation, policy gradient method (Williams, 1992; Sutton et al., 2000) is widely used for ﬁnding the optimal policy in MDPs, especially for the high dimensional continuous state and action spaces. To obtain the optimal policy, policy gradient methods directly maximize the expected total reward (also called as performance function J(θ)) via using the stochastic ﬁrst-order gradient of cumulative rewards. Recently, policy gradient methods have achieved signiﬁcant empirical successes in many challenging deep reinforcement learning applications (Li, 2017) such as playing game of Go and robot manipulation.

Thus, policy gradient methods have regained much interest in reinforcement learning, and some corresponding algorithms and theory of policy gradient (Fellows et al., 2018; Fujimoto et al., 2018; Papini et al., 2018; Haarnoja et al., 2018; Xu et al., 2019a; Shen et al., 2019; Cheng et al., 2019b;a; Wang et al., 2019a) have been proposed and studied. Since the classic policy gradient methods (e.g., REINFORCE (Williams, 1992), PGT (Sutton et al., 2000), GPOMDP (Baxter & Bartlett, 2001) and TRPO (Schulman et al., 2015a)) approximate the gradient of the expected total reward based on a batch of sampled trajectories, they generally suffer from large variance in the estimated gradients, which results in a poor convergence. Following the standard stochastic gradient methods (Robbins & Monro, 1951; Ghadimi & Lan, 2013), these gradient-based policy methods require O(ϵ 4) samples for ﬁnding an ϵstationary point of non-concave performance function J(θ) ( i.e., E J(θ) ϵ). Thus, recently many works have begun to study to reduce variance in the policy gradient methods. For example, the early variance reduced policy methods (Greensmith et al., 2004; Peters & Schaal, 2008) mainly focused on using unbiased baseline functions to reduce the variance. Schulman et al. (2015b) presented the generalized advantage estimation (GAE) to discover the balance between bias and variance of policy gradient. Then Gu et al. (2016) applied both the GAE and linear baseline function to reduce variance. Recently, Mao et al. (2018); Wu et al. (2018) proposed the input-dependent and actiondependent baselines to reduce the variance, respectively. More recently, Cheng et al. (2019b) leveraged the predictive models to reduce the variance to accelerate policy learning.

Momentum-Based Policy Gradient Methods

Table 1. Convergence properties of the representative variance-reduced policy algorithms on the non-oblivious model-free RL problem for ﬁnding an ϵ-stationary point of the nonconcave performance function J(θ), i.e., E J(θ) ϵ. Our algorithms (IS-MBPG, IS-MBPG* and HA-MBPG) and REINFORCE are single-loop algorithms, while the other algorithms are double-loops, which need the outer-loop and inner-loop mini-batch sizes. Note that Papini et al. (2018) only remarked that apply the ADAM algorithm (Kingma & Ba, 2014) to the SVRPG algorithm to obtain an adaptive learning rate, but did not provide any theoretical analysis about this learning rate. In addition, the sample complexity O(ϵ 4) of REINFORCE does not directly come from (Williams, 1992), but follows theoretical results of SGD (Ghadimi & Lan, 2013) (A detailed theoretical analysis is given in the Appendix A.4).

Algorithm Reference Sample Complexity Batch Size Adaptive Learning Rate REINFORCE Williams (1992) O(ϵ 4) O(ϵ 2) SVRPG Papini et al. (2018) O(ϵ 4) O(ϵ 2) & O(ϵ 2) SVRPG Xu et al. (2019a) O(ϵ 10/3) O(ϵ 4/3) & O(ϵ 2) HAPG Shen et al. (2019) O(ϵ 3) O(ϵ 1) & O(ϵ 2) SRVR-PG Xu et al. (2019b) O(ϵ 3) O(ϵ 1) & O(ϵ 2) IS-MBPG Ours O(ϵ 3) O(1) HA-MBPG Ours O(ϵ 3) O(1) IS-MBPG* Ours O(ϵ 3) O(1)

Recently, the variance reduced gradient estimators such as SVRG (Johnson & Zhang, 2013; Allen-Zhu & Hazan, 2016; Reddi et al., 2016), SAGA (Defazio et al., 2014), SARAH (Nguyen et al., 2017), SPIDER (Fang et al., 2018), Spider Boost (Wang et al., 2019b) and SNVRG (Zhou et al., 2018) have been successful in the oblivious supervised learning. However, the RL optimization problems are non-oblivious, i.e., the distribution of the samples is non-stationarity and changes over time. Thus, Du et al. (2017); Xu et al. (2017); Wai et al. (2019) ﬁrst transform the original non-oblivious policy evaluation problem into some oblivious subproblems, and then use the existing variance reduced gradient estimators (such as SVRG and SAGA) to solve these subproblems to reach the goal of reducing the large variance in the original RL problem. For example, Du et al. (2017) ﬁrst transforms the empirical policy evaluation problem into a quadratic convex-concave saddle-point problem via linear function approximation, and then applies the variants of SVRG and SAGA (Palaniappan & Bach, 2016) to solve this oblivious saddle-point problem.

More recently, Papini et al. (2018); Xu et al. (2019a;b); Shen et al. (2019) further have developed some variance reduced policy gradient estimators directly used in the non-oblivious model-free RL, based on the existing variance reduced techniques such as SVRG and SPIDER used in the oblivious supervised learning. Moreover, Xu et al. (2019a;b); Shen et al. (2019) have effectively improved the sample complexity by using these variance reduced policy gradients. For example, two efﬁcient variance reduced policy gradient methods, i.e, SRVR-PG (Xu et al., 2019b) and HAPG (Shen et al., 2019) have been proposed based on the SARAH/SPIDER, and reach a sharp sample complexity of O(ϵ 3) for ﬁnding an ϵ-stationary point, which improves the vanilla complexity of O(ϵ 4) (Williams, 1992) by a factor of O(ϵ 1). Since a lower bound of complexity of O(ϵ 3) for recently proposed

variance reduction techniques is established in (Arjevani et al., 2019), both the SRVR-PG and HAPG obtain a nearoptimal sample complexity of O(ϵ 3). However, the practical performances of these variance reduced policy gradient methods are not consistent with their near-optimal sample complexity, because these methods require large batches and strict learning rates to achieve this optimal complexity.

In the paper, thus, we propose a class of efﬁcient momentumbased policy gradient methods, which use adaptive learning rates and do not require any large batches. Speciﬁcally, our algorithms only need one trajectory at each iteration, and use adaptive learning rates based on the current and historical stochastic gradients. Note that Pirotta et al. (2013) has studied the adaptive learning rates for policy gradient methods, which only focuses on Gaussian policy. Moreover, Pirotta et al. (2013) did not consider sample complexity and can not improve it. While our algorithms not only provide the adaptive learning rates that are suitable for any policies, but also improve sample complexity.

Contributions Our main contributions are summarized as follows:

1) We propose a fast important-sampling momentumbased policy gradient (IS-MBPG) method with adaptive learning rate, which builds on a new momentumbased variance reduction technique of STORM/Hybrid SGD (Cutkosky & Orabona, 2019; Tran-Dinh et al., 2019) and the importance sampling technique.

2) We propose a fast Hessian-aided momentum-based policy gradient (HA-MBPG) method with adaptive learning rate, based on the momentum-based variance reduction technique and the Hessian-aided technique.

3) We study the sample complexity of our methods, and prove that both the IS-MBPG and HA-MBPG methods reach the best known sample complexity of O(ϵ 3)

Momentum-Based Policy Gradient Methods

without any large batches (see Table 1).

4) We propose a non-adaptive version of IS-MBPG method, i.e., IS-MBPG*, which has a simple monotonically decreasing learning rate. We prove that it also reaches the best known sample complexity of O(ϵ 3) without any large batches.

After our paper is accepted, we ﬁnd that three related papers (Xiong et al., 2020; Pham et al., 2020; Yuan et al., 2020) more recently are released on ar Xiv. Xiong et al. (2020) has studied the adaptive Adam-type policy gradient (PG-AMSGrad) method, which still suffers from a high sample complexity of O(ϵ 4). Subsequently, Pham et al. (2020); Yuan et al. (2020) have proposed the policy gradient methods, i.e., Prox HSPGA and STORM-PG, respectively, which also build on the momentum-based variance reduced technique of STORM/Hybrid-SGD. Although both the Prox HSPGA and STORM-PG reach the best known sample complexity of O(ϵ 3), these methods still rely on large batch sizes to obtain this sample complexity and do not provide an effective adaptive learning rate as our methods.

Let denote the vector ℓ2 norm and the matrix spectral norm, respectively. We denote an = O(bn) if an cbn for some constant c > 0. E[X] and V[X] denote the expectation and variance of a random variable X, respectively. Eτt[ ] = Eτt[ |τ1, , τt 1] for any t 2.

2. Background

In the section, we will review some preliminaries of standard reinforcement learning and policy gradient.

2.1. Reinforcement Learning

Reinforcement learning is generally modeled as a discrete time Markov Decision Process (MDP): M = {S, A, P, R, γ, ρ0}. Here S is the state space, A is the action space, and ρ0 denotes the initial state distribution. P(s |s, a) denotes the probability that the agent transits from the state s to s under taking the action a A. R(s, a) : S A 7 [ R, R] (R > 0) is the bounded reward function, i.e., the agent obtain the reward R(s, a) after it takes the action a at the state s, and γ (0, 1) is the discount factor. The policy π(a|s) at the state s is represented by a conditional probability distribution πθ(a|s) associated to the parameter θ Rd.

Given a time horizon H, the agent can collect a trajectory τ = {s0, a0, , s H 1, a H 1} under any stationary policy. Following the trajectory τ, a cumulative discounted reward can be given as follows:

h=0 γh R(sh, ah), (1)

where γ is the discount factor. Assume that the policy πθ is parameterized by an unknown parameter θ Rd. Given the initial distribution ρ0 = ρ(s0), the probability distribution over trajectory τ can be obtain

p(τ|θ) = ρ(s0)

h=0 P(sh+1|sh, ah)πθ(ah|sh). (2)

2.2. Policy Gradient

The goal of RL is to ﬁnd an optimal policy πθ that is equivalent to maximize the expected discounted trajectory reward:

max θ Rd J(θ) := Eτ p(τ|θ)[R(τ)] = Z R(τ)p(τ|θ)dτ. (3)

Since the underlying distribution p depends on the variable θ and varies through the whole optimization procedure, the problem (3) is a non-oblivious learning problem, which is unlike the traditional supervised learning problems that the underlying distribution p is stationary. To deal with this problem, the policy gradient method (Williams, 1992; Sutton et al., 2000) is a good choice. Speciﬁcally, we ﬁrst compute the gradient of J(θ) with respect to θ, and obtain

J(θ)= Z R(τ) p(τ|θ)dτ = Z R(τ) p(τ|θ)

p(τ|θ) p(τ|θ)dτ

= Eτ p(τ|θ) log p(τ|θ)R(τ) . (4)

Since the distribution p(τ|θ) is unknown, we can not compute the exact full gradient of (4). Similar for stochastic gradient descent (SGD), the policy gradient method samples a batch of trajectories B = {τi}|B| i=1 from the distribution p(τ|θ) to obtain the stochastic gradient as follows:

i B log p(τi|θ)R(τi).

At the t-th iteration, the parameter θ can be updated:

θt+1 = θt + ηt ˆ θJ(θ), (5)

where ηt > 0 is a learning rate. In addition, since the term log p(τi|θ) is independent of the transition probability P, we rewrite the stochastic gradient ˆ J(θ) as follows:

i B g(τi, θ) (6)

h=0 θ log πθ(ai h, si h) H 1 X

h=0 γh R(si h, ai h) ,

where g(τi, θ) is an unbiased stochastic gradient based on the trajectory τi, i.e., E[g(τi, θ)] = J(θ). Based on the above gradient estimator in (6), we can obtain the existing well-known gradient estimators of policy gradient such

Momentum-Based Policy Gradient Methods

as the REINFORCE, the PGT and the GPOMDP. Due to E[ θ log πθ(a, s)] = 0, the REINFORCE adds a constant baseline b and obtains a gradient estimator as follows:

g(τi, θ)= H 1 X

h=0 θ log πθ(ai h, si h) H 1 X

h=0 γh R(si h, ai h) b .

Further, considering the fact that the current actions do not rely on the previous rewards, the PGT reﬁnes the REINFORCE and obtains the following gradient estimator:

γj R(si j, ai j) bj θ log πθ(ai h, si h).

Meanwhile, the PGT estimator is equivalent to the popular GPOMDP estimator deﬁned as follows:

j=0 θ log πθ(ai j, si j)(γh R(si h, ai h) bh).

3. Momentum-Based Policy Gradients In the section, we propose a class of fast momentumbased policy gradient methods based on a new momentumbased variance reduction method, i.e., STORM (Cutkosky & Orabona, 2019). Although the STORM shows its effectiveness in the oblivious learning problems, it is not well suitable for the non-oblivious learning problem , where the underlying distribution p( ) depends on the variable θ and varies through the whole optimization procedure. To deal with this challenge, we will apply two effective techniques, i.e., importance sampling (Metelli et al., 2018; Papini et al., 2018) and Hessian-aided (Shen et al., 2019), and propose the corresponded policy gradient methods, respectively.

3.1. Important-Sampling Momentum-Based Policy Gradient In the subsection, we propose a fast important-sampling momentum-based policy gradient (IS-MBPG) method based on the importance sampling technique. Algorithm 1 describes the algorithmic framework of IS-MBPG method.

Since the problem (3) is non-oblivious or non-stationarity that the underlying distribution p(τ|θ) depends on the variable θ and varies through the whole optimization procedure, we have Eτ p(τ|θ)[g(τ|θ) g(τ|θ )] = J(θ) J(θ ). Given τ sampled from p(τ|θ), we deﬁne an importance sampling weight

w(τ|θ , θ) = p(τ|θ )

πθ(ah|sh) (7)

to obtain Eτ p(τ|θ) g(τ|θ) w(τ|θ , θ)g(τ|θ ) = J(θ) J(θ ). In Algorithm 1, we use the following

Algorithm 1 Important-Sampling Momentum-Based Policy Gradient (IS-MBPG) Algorithm

1: Input: Total iteration T, parameters {k, m, c} and initial input θ1; 2: for t = 1, 2, . . . , T do 3: if t = 1 then 4: Sample a trajectory τ1 from p(τ|θ1), and compute u1 = g(τ1|θ1); 5: else 6: Sample a trajectory τt from p(τ|θt), and compute ut = βtg(τt|θt) + (1 βt) ut 1 + g(τt|θt) w(τt|θt 1, θt)g(τt|θt 1) , where the importance sampling weight w(τt|θt 1, θt) can be computed by using (7); 7: end if 8: Compute Gt = g(τ|θt) ; 9: Compute ηt = k (m+Pt i=1 G2 i )1/3 ;

10: Update θt+1 = θt + ηtut; 11: Update βt+1 = cη2 t ; 12: end for 13: Output: θζ chosen uniformly random from {θt}T t=1.

momentum-based variance reduced stochastic gradient

ut =(1 βt) ut 1 + g(τt|θt) w(τt|θt 1, θt)g(τt|θt 1) | {z } SARAH

+ βt g(τt|θt) | {z } SGD

where βt (0, 1]. When βt = 1, ut will reduce to a vanilla stochastic policy gradient used in the REINFORCE. When βt = 0, it will reduce to the SARAH-based stochastic policy gradient used in the SRVR-PG.

Let et = ut J(θt). It is easily veriﬁed that

E[et]=E (1 βt)et 1+βt(g(τt|θt) J(θt) | {z } =T1

g(τt|θt) w(τt|θt 1, θt)g(τt|θt 1) J(θt)+ J(θt 1) | {z } =T2

= (1 βt)E[et 1], (8)

where the last equality holds by Eτt p(τ|θt)[T1] = 0 and Eτt p(τ|θt)[T2] = 0. By Cauchy-Schwarz inequality, we can obtain

E et 2 (1 βt)2E et 1 2 + 2β2 t E T1 2

+ 2(1 βt)2E T2 2. (9)

Since O( T2 2) = O( θt θt 1 2) = O(η2 t ut 2), we can choose appropriate ηt and βt to reduce the variance of stochastic gradient ut. From the following theoretical

Momentum-Based Policy Gradient Methods

Algorithm 2 Hessian-Aided Momentum-Based Policy Gradient (HA-MBPG) Algorithm

1: Input: Total iteration T, parameters {k, m, c} and initial input θ1; 2: for t = 1, 2, . . . , T do 3: if t = 1 then 4: Sample a trajectory τ1 from p(τ|θ1), and compute u1 = g(τ1|θ1); 5: else 6: Choose α uniformly at random from [0, 1], and compute θt(α) = αθt + (1 α)θt 1; 7: Sample a trajectory τt from p(τ|θt(α)), and compute ut = βtw(τt|θt, θt(α))g(τt|θt) + (1 βt) ut 1+ t , where w(τ|θt, θt(α)) and t can be computed by using (7) and (11), respectively; 8: end if 9: Compute Gt = g(τ|θt) ; 10: Compute ηt = k (m+Pt i=1 G2 i )1/3 ;

11: Update θt+1 = θt + ηtut; 12: Update βt+1 = cη2 t ; 13: end for 14: Output: θζ chosen uniformly random from {θt}T t=1.

results, our IS-MBPG algorithm can generate the adaptive and monotonically decreasing learning rate ηt (0, 1 2L], and the monotonically decreasing parameter βt (0, 1].

3.2. Hessian-Aided Momentum-Based Policy Gradient

In the subsection, we propose a fast Hessian-aided momentum-based policy gradient (HA-MBPG) method based on the Hessian-aided technique. Algorithm 2 describes the algorithmic framework of HA-MBPG method.

In Algorithm 2, at the 7-th step, we use an unbiased term t i.e., Eτt p(τ|θt(α))[ t] = J(θt) J(θt 1) instead of the biased term g(τ|θt) g(τ|θt 1). To construct the term t, we ﬁrst assume that the function J(θ) is twice differentiable as in (Furmston et al., 2016; Shen et al., 2019). By the Taylor s expansion (or Newton-Leibniz formula), the gradient difference J(θt) J(θt 1) can be written as

J(θt) J(θt 1) = Z 1

0 2J(θt(α))dα vt, (10)

where vt = θt θt 1 and θt(α) = αθt + (1 α)θt 1 for some α [0, 1]. Following (Furmston et al., 2016; Shen et al., 2019), we obtain the policy Hessian 2J(θ) as follows:

2J(θ) = Eτ p(τ|θ) log p(τ|θ) log p(τ|θ)T

+ 2 log p(τ|θ) R(τ)

= Eτ p(τ|θ) Φ(τ|θ) log p(τ|θ)T + 2Φ(τ|θ) ,

where Φ(τ|θ) = PH 1 h=0 PH 1 j=h γjr(sj, aj) log πθ(ah, sh). Given the random tuple (α, τ), where α samples uniformly from [0, 1] and τ samples from the distribution p(τ|θt(α)), we can construct t as follows:

t := ˆ 2(θt(α), τ)vt, (11)

where Eτ p(τ|θt(α))[ ˆ 2(θt(α), τ)] = 2J(θt(α)) and

ˆ 2(θt, τ) = Φ(τ|θt(α)) log p(τ|θt(α))T

+ 2Φ(τ|θt(α)).

Note that Eα U[0,1][ 2J(θt(α))] = R 1 0 2J(θt(α))dα implies the unbiased estimator 2J(θ( α)) with α uniformly sampled from [0, 1]. Given α, we have Eτ p(τ|θt( α))[ ˆ 2(θt( α), τ)] = 2J(θt( α)). According to the equation (10), thus we have Eα U[0,1], τ p(τ|θt(α))[ t] = J(θt) J(θt 1), where U[0, 1] denotes the uniform distribution over [0, 1].

Next, we rewrite (11) as follows:

t = log p(τ|θt(α))T vt Φ(τ|θt(α))

+ 2Φ(τ|θt(α))vt. (12)

Considering the second term in (12) is a time-consuming Hessian-vector product, in practice, we use can the ﬁnite difference method to estimate 2Φ(τ|θt(α))vt as follows:

2Φ(τ|θt(α))vt

Φ(τ|θt(α) + δvt) Φ(τ|θt(α) δvt)

= 2Φ(τ| θt(α))vt, (13)

where δ > 0 is very small and θt(α) θt(α) δvt, θt(α)+ δvt] is obtained by the mean-value theorem. Suppose Φ(τ|θ) is L2-second-order smooth, we can upper bound the approximated error:

2Φ(τ|θt(α))vt 2Φ(τ| θt(α))vt L2 vt δ. (14)

Thus, we take a sufﬁciency small δ to obtain arbitrarily small approximated error.

In Algorithm 2, we use the following momentum-based variance reduced stochastic gradient

ut = βtw(τ|θt, θt(α))g(τ|θt) + (1 βt) ut 1 + t ,

where βt (0, 1]. When βt = 1, ut will reduce to a vanilla stochastic policy gradient used in the REINFORCE. When βt = 0, it will reduce to the Hessian-aided stochastic policy gradient used in the HAPG.

Momentum-Based Policy Gradient Methods

Algorithm 3 IS-MBPG* Algorithm

1: Input: Total iteration T, parameters {k, m, c} and initial input θ1; 2: for t = 1, 2, . . . , T do 3: if t = 1 then 4: Sample a trajectory τ1 from p(τ|θ1), and compute u1 = g(τ1|θ1); 5: else 6: Sample a trajectory τt from p(τ|θt), and compute ut = βtg(τt|θt) + (1 βt) ut 1 + g(τt|θt) w(τt|θt 1, θt)g(τt|θt 1) ; 7: end if 8: Compute ηt = k (m+t)1/3 ;

9: Update θt+1 = θt + ηtut; 10: Update βt+1 = cη2 t ; 11: end for 12: Output: θζ chosen uniformly random from {θt}T t=1.

Let et = ut J(θt). It is also easily veriﬁed that

E[et] = E (1 βt)et 1+βt(w(τ|θt, θt(α))g(τ|θt) J(θt) | {z } =T3

+ (1 βt) t J(θt) + J(θt 1) | {z } =T4

= (1 βt)E[et 1], (15)

where the last equality holds by Eτ p(τ|θt(α))[T3] = 0 and Eτ p(τ|θt(α))[T4] = 0. Similarly, by Cauchy-Schwarz inequality, we can obtain

E et 2 (1 βt)2E et 1 2 + 2β2 t E T3 2

+ 2(1 βt)2E T4 2. (16)

Since O( T4 2) = O( θt θt 1 2) = O(η2 t ut 2), we can choose appropriate ηt and βt to reduce the variance of stochastic gradient ut. From the following theoretical results, our HA-MBPG algorithm can also generate the adaptive and monotonically decreasing learning rate ηt (0, 1 2L], and the monotonically decreasing parameter βt (0, 1].

3.3. Non-Adaptive IS-MBPG* In this subsection, we propose a non-adaptive version of IS-MBPG algorithm, i.e., IS-MBPG*. The IS-MBPG* algorithm is given in Algorithm 3. Speciﬁcally, Algorithm 3 applies a simple monotonically decreasing learning rate ηt, which only depends on the number of iteration t.

4. Convergence Analysis

In this section, we will study the convergence properties of our algorithms, i.e., IS-MBPG, HA-MBPG and IS-MBPG*.

All related proofs are provided in supplementary document. We ﬁrst give some assumptions as follows: Assumption 1. Gradient and Hessian matrix of function log πθ(a|s) are bounded, i.e., there exist constants Mg, Mh > 0 such that

θ log πθ(a|s) Mg, 2 θ log πθ(a|s) Mh. (17)

Assumption 2. Variance of stochastic gradient g(τ|θ) is bounded, i.e., there exists a constant σ > 0, for all πθ such that V(g(τ|θ)) = E g(τ|θ) J(θ) 2 σ2. Assumption 3. Variance of importance sampling weight w(τ|θ1, θ2) = p(τ|θ1)/p(τ|θ2) is bounded, i.e., there exists a constant W > 0, it follows V(w(τ|θ1, θ2)) W for any θ1, θ2 Rd and τ p(τ|θ2).

Assumptions 1 and 2 have been commonly used in the convergence analysis of policy gradient algorithms (Papini et al., 2018; Xu et al., 2019a;b; Shen et al., 2019). Assumption 3 has been used in the study of variance reduced policy gradient algorithms (Papini et al., 2018; Xu et al., 2019a;b). Note that the bounded importance sampling weight in Assumption 3 might be violated in practice. For example, when using neural networks (NNs) as the policy, small perturbations in θ might raise a large gap in the point probability due to some activation functions in NNs, which results in very large importance sampling weights. Thus, we generally clip the importance sampling weights to make our algorithms more effective. Based on Assumption 1, we give some useful properties of stochastic gradient g(τ|θ) and ˆ 2(θt, τ), respectively. Proposition 1. (Proposition 4.2 in (Xu et al., 2019b)) Suppose g(τ|θ) is the PGT estimator. By Assumption 1, we have

1) g(τ|θ) is ˆL-Lipschitz differential, i.e., g(τ|θ) g(τ|θ ) L θ θ with ˆL = Mh R/(1 γ)2;

2) J(θ) is ˆL-smooth, i.e., 2J(θ) ˆL;

3) g(τ|θ) is bounded, i.e., g(τ|θ) G for all θ Rd

with G = Mg R/(1 γ)2.

Since J(θ) = E[g(τ|θ)] E g(τ|θ) G, Proposition 1 implies that J(θ) is G-Lipschitz. Without loss of generality, we use the PGT estimator to generate the gradient g(τ|θt) in our algorithms, so Gt = g(τ|θt) G. Proposition 2. (Lemma 4.1 in (Shen et al., 2019)) Under Assumption 1, we have for all θ

ˆ 2(θ, τ) 2 H2M 4 g R2 + M 2 h R2

(1 γ)4 = L2. (18)

Since 2J(θ) = E[ ˆ 2(θ, τ)] E ˆ 2(θ, τ) L, Proposition 2 implies that J(θ) is L-smooth. Let L = max(ˆL, L), so J(θ) is L-smooth.

Momentum-Based Policy Gradient Methods

4.1. Convergence Analysis of IS-MBPG Algorithm

In the subsection, we analyze the convergence properties of the IS-MBPG algorithm. The detailed proof is provided in Appendix A.1. For notational simplicity, let B2 = L2 + 2G2C2 w with Cw = q

H(2HM 2g + Mh)(W + 1).

Theorem 1. Assume that the sequence {θt}T t=1 be generated from Algorithm 1. Set k = O( G2/3

L ), c = G2 3k3L + 104B2, m = max{2G2, (2Lk)3, ( ck

2L)3} and η0 = k m1/3 , we have

2Ωm1/6 + 2Ω3/4

T 1/3 , (19)

k 16(J J(θ1))+ m1/3

8B2kσ2 + c2k3

4B2 ln(T +2)

with J = supθ J(θ) < + .

Remark 1. Since Ω= O(ln(T)), Theorem 1 shows that the IS-MBPG algorithm has O( p

ln(T)/T 1 3 ) convergence rate. The IS-MBPG algorithm needs 1 trajectory to estimate the stochastic policy gradient ut at each iteration, and needs T iterations. Without loss of generality, we omit a relative small term p

ln(T). By T 1

3 ϵ, we choose T = ϵ 3. Thus, the IS-MBPG has the sample complexity of 1 T = O(ϵ 3) for ﬁnding an ϵ-stationary point.

4.2. Convergence Analysis of HA-MBPG Algorithm

In the subsection, we study the convergence properties of the HA-MBPG algorithm. The detailed proof is provided in Appendix A.2.

Theorem 2. Assume that the sequence {θt}T t=1 be generated from Algorithm 2, and let k = O( G2/3

L ), c = G2 3k3L + 52L2, m = max{2G2, (2Lk)3, ( ck

2L)3} and η0 = k m1/3 , we have

2Λm1/6 + 2Λ3/4

where Λ = 1

k 16(J J(θ1))+ m1/3

4L2k σ2+ (W +1)c2k3

2L2 ln(T + 2) with J = supθ J(θ) < + .

Remark 2. Since Λ = O(ln(T)), Theorem 2 shows that the HA-MBPG algorithm has O( p

ln(T)/T 1 3 ) convergence rate. The HA-MBPG algorithm needs 1 trajectory to estimate the stochastic policy gradient ut at each iteration, and needs T iterations. Without loss of generality, we omit a relative small term p

ln(T). By T 1

3 ϵ, we choose T = ϵ 3. Thus, the HA-MBPG has the sample complexity of 1 T = O(ϵ 3) for ﬁnding an ϵ-stationary point.

4.3. Convergence Analysis of IS-MBPG* Algorithm

In the subsection, we give the convergence properties of the IS-MBPG* algorithm. The detailed proof is provided in Appendix A.3.

(a) Cart Pole

(d) Half Cheetah

Figure 1. Four environments we used. (a) Cartpole: balance a pole on a cart; (b) Walker: make a 2D robot walk; (c) Hopper: make a 2D robot hop; (d) Half Cheetah: make a 2D cheetah robot run.

Theorem 3. Assume that the sequence {θt}T t=1 be generated from Algorithm 3, and let B2 = L2 + 2G2C2 w, k > 0 c = 1 3k3L + 104B2, m = max{2, (2Lk)3, ( ck

2L)3} and η0 = k m1/3 , we have

E J(θζ) = 1

t=1 E J(θt)

where Γ = 1

k 16(J J(θ1))+ m1/3

8B2kσ2+ c2k3σ2

4B2 ln(T +2)

with J = supθ J(θ) < + . Remark 3. Since Γ = O(ln(T)), Theorem 3 shows that the IS-MBPG* algorithm has O( p

ln(T)/T 1 3 ) convergence rate. The IS-MBPG* algorithm needs 1 trajectory to estimate the stochastic policy gradient ut at each iteration, and needs T iterations. Without loss of generality, we omit a relative small term p

ln(T). By T 1

3 ϵ, we choose T = ϵ 3. Thus, the IS-MBPG* also has the sample complexity of 1 T = O(ϵ 3) for ﬁnding an ϵ-stationary point.

5. Experiments

In this section, we demonstrate the performance of our algorithms on four standard reinforcement learning tasks, which are Cart Pole, Walker, Half Cheetah and Hopper. The ﬁrst one is a discrete task from classic control, and the later three tasks are continuous RL task, which are popular Mu Jo Co environments (Todorov et al., 2012). Detailed description of these environments is shown in Fig. 1. Our code is publicly available on https://github.com/gaosh/MBPG.

5.1. Experimental Setup

In the experiment, we use Categorical Policy for Cart Pole, and Gaussian Policy for all the other environments. All

Momentum-Based Policy Gradient Methods

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 System Probes 104

Average Episode Return

REINFORCE HAPG SRVR-PG HA-MBPG IS-MBPG

(a) Cart Pole

0 1 2 3 4 5 6 7 8 9 10 System Probes 106

Average Episode Return

REINFORCE HAPG SRVR-PG HA-MBPG IS-MBPG

0 1 2 3 4 5 6 7 8 9 10 System Probes 106

Average Episode Return

REINFORCE HAPG SRVR-PG HA-MBPG IS-MBPG

0 1 2 3 4 5 6 7 8 9 10 System Probes 106

Average Episode Return

REINFORCE HAPG SRVR-PG HA-MBPG IS-MBPG

(d) Half Cheetah

Figure 2. Experimental results of our algorithms (IS-MBPG and HA-MBPG) and baseline algorithms at four environments.

Policies are parameterized by the fully connected neural network. The detail of network architecture and activation function used are shown in the Appendix A. The network settings are similar to the HAPG (Shen et al., 2019) algorithm. We implement our algorithms by using garage (garage contributors, 2019) and pytorch (Paszke et al., 2019). Note that Previous works mostly use environments implemented by old versions of garage, while latest version of garage directly use environments from gym (Brockman et al., 2016). As a result, there might be an inconsistency of the reward calculation between this paper and previous works due to the difference of environment implementation.

In the experiments, we compare our algorithm with the existing two best algorithms: Hessian Aided Policy Gradient (HAPG) (Shen et al., 2019), Stochastic Recursive Variance Reduced Policy Gradient (SRVR-PG) (Xu et al., 2019b) and a baseline algorithm: REINFORCE (Sutton et al., 2000). For a fair comparison, the policies of all methods use the same initialization, which ensures that they have similar start point. Moreover, to ease the impact of randomness, we run each method 10 times, and plot mean as well as variance interval for each of them.

In addition, for the purpose of fair comparison, we use the same batch size |B| for all algorithms, though our algorithms do not have a requirement on it. HAPG and SRVR-PG have sub-iterations (or inner loop), and requires additional hyper-

parameters. The inner batch size for HAPG and SRVR-PG is also set to be the same value. For all the other hyperparameters, we try to make them be analogous to the settings in their original paper. One may argue that our algorithms need three hyper-parameters k, m and c to control the evolution of learning rate while for other algorithms one hyper parameter is enough to control the learning rate. However, it should be noticed that our algorithms do not involve any sub-iterations unlike HAPG and SRVR-PG. Introducing sub-iterations itself naturally bring more hyper-parameters such as the number of sub-iteration and the inner batch size. From this perspective, the hyper-parameter complexity of our algorithms resembles HAPG and SRVR-PG. The more details of hyper-parameter selection are shown in Appendix A.

Similar to the HAPG algorithm, we use the system probes (i.e., the number of state transitions) as the measurement of sample complexity instead of number of trajectories. The reason of doing so is because each trajectory may have different length of states due to a failure ﬂag returned from the environment (often happens at the beginning of training). Besides this reason, if using the number of trajectories as complexity measurement and the environment can return a failure ﬂag, a faster algorithm may have a lot more system probes given the same number of trajectories. We also use average episode return as used in HAPG (Shen et al., 2019).

Momentum-Based Policy Gradient Methods

(a) IS-MBPG

(b) HA-MBPG

Figure 3. Different batch sizes for our algorithms (IS-MBPG and HA-MBPG) at Cart Pole environment.

(a) Cart Pole

Figure 4. Results of IS-MBPG and IS-MBPG* algorithms at Cart Pole and Walker environments.

5.2. Experimental Results

The results of experiments are presented in Fig. 2. In the Cart Pole environment, our IS-MBPG and HA-MBPG algorithms have better performances than the other methods. In the Walker environment, our algorithms start to have more advantages. Speciﬁcally, the average return of IS-MBPG and HA-MBPG grows rapidly at the beginning of training. Moreover, our IS-MBPG algorithm achieves the best ﬁnal performance with a obvious margin. HA-MBPG performs similar compared to SRVR-PG and HAPG, though it has an advantage at the beginning. In Hopper environment, our ISMBPG and HA-MBPG algorithms are signiﬁcantly faster compared to all other methods, while the ﬁnal average reward are similar for different algorithms. In Half Cheetah environment, IS-MBPG, HA-MBPG and SRVR-PG performs similarly at the beginning. In the end of training, IS-MBPG can achieve the best performance. We note that HAPG performs poorly on this task, which is probably because of the normalized gradient and ﬁxed learning rate in their algorithm. For all tasks, HA-MBPG are always inferior to the IS-MBPG. One possible reason for this observation is that we use the estimated Hessian vector product instead of the exact Hessian vector product in HA-MBPG algorithm, which brings additional estimation error to the algorithm.

In Fig. 3, we plot the average reward when changing batch

size in Cart Pole environment. From Fig. 3, we ﬁnd that when 20% of the original batch size, our HA-MBPG and IS-MBPG algorithms still outperform the HAPG and SRVRPG algorithms, respectively. When the batch size is 1, our HA-MBPG and IS-MBPG algorithms still reach a good performance. These results demonstrate that our HA-MBPG and IS-MBPG algorithms are not sensitive to the selection of batch size. Fig. 4 shows that the non-adaptive IS-MBPG* algorithm also has similar performances as the adaptive IS-MBPG algorithm.

6. Conclusion

In the paper, we proposed a class of efﬁcient momentumbased policy gradient methods (i.e., IS-MBPG and HAMBPG), which use adaptive learning rates and do not require any large batches. Moreover, we proved that both IS-MBPG and HA-MBPG methods reach the best known sample complexity of O(ϵ 3), which only require one trajectory at each iteration. In particular, we also presented a nonadaptive version of IS-MBPG method (i.e., IS-MBPG*), which has a simple monotonically decreasing learning rate. We proved that the IS-MBPG* also reaches the best known sample complexity of O(ϵ 3) only required one trajectory at each iteration.

Momentum-Based Policy Gradient Methods

Acknowledgements

We thank the anonymous reviewers for their helpful comments. We also thank the IT Help Desk at University of Pittsburgh. This work was partially supported by U.S. NSF IIS 1836945, IIS 1836938, IIS 1845666, IIS 1852606, IIS 1838627, IIS 1837956.

Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In ICML, pp. 699 707, 2016.

Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization. ar Xiv preprint ar Xiv:1912.02365, 2019.

Baxter, J. and Bartlett, P. L. Inﬁnite-horizon policy-gradient estimation. Journal of Artiﬁcial Intelligence Research, 15:319 350, 2001.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.

Cheng, C.-A., Yan, X., and Boots, B. Trajectory-wise control variates for variance reduction in policy gradient methods. ar Xiv preprint ar Xiv:1908.03263, 2019a.

Cheng, C.-A., Yan, X., Ratliff, N., and Boots, B. Predictorcorrector policy optimization. In International Conference on Machine Learning, pp. 1151 1161, 2019b.

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex sgd. In Advances in Neural Information Processing Systems, pp. 15210 15219, 2019.

Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pp. 1646 1654, 2014.

Deisenroth, M. P., Neumann, G., Peters, J., et al. A survey on policy search for robotics. Foundations and Trends R in Robotics, 2(1 2):1 142, 2013.

Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1049 1058. JMLR. org, 2017.

Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Nearoptimal non-convex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pp. 689 699, 2018.

Fellows, M., Ciosek, K., and Whiteson, S. Fourier policy gradients. In International Conference on Machine Learning, pp. 1486 1495, 2018.

Fujimoto, S., Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. In ICML, pp. 1587 1596, 2018.

Furmston, T., Lever, G., and Barber, D. Approximate newton methods for policy search in markov decision processes. The Journal of Machine Learning Research, 17 (1):8055 8105, 2016.

garage contributors, T. Garage: A toolkit for reproducible reinforcement learning research. https://github. com/rlworkgroup/garage, 2019.

Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Greensmith, E., Bartlett, P. L., and Baxter, J. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov): 1471 1530, 2004.

Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Q-prop: Sample-efﬁcient policy gradient with an off-policy critic. ar Xiv preprint ar Xiv:1611.02247, 2016.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861 1870, 2018.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp. 315 323, 2013.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Li, Y. Deep reinforcement learning: An overview. ar Xiv preprint ar Xiv:1701.07274, 2017.

Mao, H., Venkatakrishnan, S. B., Schwarzkopf, M., and Alizadeh, M. Variance reduction for reinforcement learning in input-driven environments. ar Xiv preprint ar Xiv:1807.02264, 2018.

Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, pp. 5442 5454, 2018.

Nguyen, L. M., Liu, J., Scheinberg, K., and Tak aˇc, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In ICML, pp. 2613 2621, 2017.

Momentum-Based Policy Gradient Methods

Palaniappan, B. and Bach, F. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pp. 1416 1424, 2016.

Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. Stochastic variance-reduced policy gradient. In 35th International Conference on Machine Learning, volume 80, pp. 4026 4035, 2018.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024 8035, 2019.

Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682 697, 2008.

Pham, N. H., Nguyen, L. M., Phan, D. T., Nguyen, P. H., van Dijk, M., and Tran-Dinh, Q. A hybrid stochastic policy gradient algorithm for reinforcement learning. ar Xiv preprint ar Xiv:2003.00430, 2020.

Pirotta, M., Restelli, M., and Bascetta, L. Adaptive stepsize for policy gradient methods. In Advances in Neural Information Processing Systems, pp. 1394 1402, 2013.

Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pp. 314 323, 2016.

Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015a.

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b.

Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. ar Xiv preprint ar Xiv:1610.03295, 2016.

Shen, Z., Ribeiro, A., Hassani, H., Qian, H., and Mi, C. Hessian aided policy gradient. In International Conference on Machine Learning, pp. 5729 5738, 2019.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

Sutton, R. S., Mc Allester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057 1063, 2000.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033, 2012.

Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. A hybrid stochastic optimization framework for stochastic composite nonconvex optimization. ar Xiv preprint ar Xiv:1907.03793, 2019.

Wai, H.-T., Hong, M., Yang, Z., Wang, Z., and Tang, K. Variance reduced policy evaluation with smooth function approximation. In Advances in Neural Information Processing Systems, pp. 5776 5787, 2019.

Wang, L., Cai, Q., Yang, Z., and Wang, Z. Neural policy gradient methods: Global optimality and rates of convergence. ar Xiv preprint ar Xiv:1909.01150, 2019a.

Wang, W. Y., Li, J., and He, X. Deep reinforcement learning for nlp. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pp. 19 21, 2018.

Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. Spiderboost and momentum: Faster variance reduction algorithms. In Advances in Neural Information Processing Systems, pp. 2403 2413, 2019b.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 1992.

Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance reduction for policy gradient with action-dependent factorized baselines. ar Xiv preprint ar Xiv:1803.07246, 2018.

Xiong, H., Xu, T., Liang, Y., and Zhang, W. Non-asymptotic convergence of adam-type reinforcement learning algorithms under markovian sampling. ar Xiv preprint ar Xiv:2002.06286, 2020.

Xu, P., Gao, F., and Gu, Q. An improved convergence analysis of stochastic variance-reduced policy gradient. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artiﬁcial Intelligence, pp. 191, 2019a.

Xu, P., Gao, F., and Gu, Q. Sample efﬁcient policy gradient methods with recursive variance reduction. ar Xiv preprint ar Xiv:1909.08610, 2019b.

Momentum-Based Policy Gradient Methods

Xu, T., Liu, Q., and Peng, J. Stochastic variance reduction for policy gradient estimation. ar Xiv preprint ar Xiv:1710.06034, 2017.

Yuan, H., Lian, X., Liu, J., and Zhou, Y. Stochastic recursive momentum for policy gradient methods. ar Xiv preprint ar Xiv:2003.04302, 2020.

Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 3921 3932, 2018.