# trust_regionguided_proximal_policy_optimization__fdf14310.pdf

Trust Region-Guided Proximal Policy Optimization

Yuhui Wang , Hao He , Xiaoyang Tan , Yaozhong Gan

College of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence Collaborative Innovation Center of Novel Software Technology and Industrialization {y.wang, hugo, x.tan, yzgancn}@nuaa.edu.cn

Proximal policy optimization (PPO) is one of the most popular deep reinforcement learning (RL) methods, achieving state-of-the-art performance across a wide range of challenging tasks. However, as a model-free RL method, the success of PPO relies heavily on the effectiveness of its exploratory policy search. In this paper, we give an in-depth analysis on the exploration behavior of PPO, and show that PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima. To address these issues, we proposed a novel policy optimization method, named Trust Region-Guided PPO (TRGPPO), which adaptively adjusts the clipping range within the trust region. We formally show that this method not only improves the exploration ability within the trust region but enjoys a better performance bound compared to the original PPO as well. Extensive experiments verify the advantage of the proposed method.

1 Introduction

Deep model-free reinforcement learning has achieved great successes in recent years, notably in video games [11], board games [19], robotics [10], and challenging control tasks [17, 5]. Among others, policy gradient (PG) methods are commonly used model-free policy search algorithms [14]. However, the ﬁrst-order optimizer is not very accurate for curved areas. One can get overconﬁdence and make bad moves that ruin the progress of the training. Trust region policy optimization (TRPO) [16] and proximal policy optimization (PPO) [18] are two representative methods to address this issue. To ensure stable learning, both methods impose a constraint on the difference between the new policy and the old one, but with different policy metrics.

In particular, TRPO uses a divergence between the policy distributions (total variation divergence or KL divergence), whereas PPO uses a probability ratio between the two policies1. The divergence metric is proven to be theoretically-justiﬁed as optimizing the policy within the divergence constraint (named trust region) leads to guaranteed monotonic performance improvement. Nevertheless, the complicated second-order optimization involved in TRPO makes it computationally inefﬁcient and difﬁcult to scale up for large scale problems. PPO signiﬁcantly reduces the complexity by adopting a clipping mechanism which allows it to use a ﬁrst-order optimization. PPO is proven to be very effective in dealing with a wide range of challenging tasks while being simple to implement and tune.

However, how the underlying metric adopted for policy constraints inﬂuence the behavior of the algorithm is not well understood. It is normal to expect that the different metrics will yield RL algorithms with different exploration behaviors. In this paper, we give an in-depth analysis on the

1There is also a variant of PPO which uses KL divergence penalty. In this paper we refer to the one clipping probability ratio as PPO by default, which performs better in practice.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

exploration behavior of PPO, and show that the ratio-based metric of PPO tends to continuously weaken the likelihood of choosing an action in the future if that action is not preferred by the current policy. As a result, PPO is prone to suffer from the risk of lack of exploration especially under the case of bad initialization, which may lead to the failure of training or being trapped in bad local optima.

To address these issues, we propose an enhanced PPO method, named Trust Region-Guided PPO (TRGPPO), which is theoretically justiﬁed by the improved exploration ability and better performance bound compared to the original PPO. In particular, TRGPPO constructs a connection between the ratio-based metric and trust region-based one, such that the resulted ratio clipping mechanism allows the constraints imposed on the less preferred actions to be relaxed. This effectively encourages the policy to explore more on the potential valuable actions, no matter whether they were preferred by the previous policies or not. Meanwhile, the ranges of the new ratio-based constraints are kept within the trust region; thus it would not harm the stability of learning. Extensive results on several benchmark tasks show that the proposed method signiﬁcantly improves both the policy performance and the sample efﬁciency. Source code is available at https://github.com/wangyuhuix/TRGPPO.

2 Related Work

Many researchers have tried to improve proximal policy learning from different perspectives. Chen et al. also presented a so-called adaptive clipping mechanism" for PPO [3]. Their method adaptively adjusts the scale of policy gradient according to the signiﬁcance of state-action. They did not make any alteration on the clipping mechanism of PPO, while our method adopts a newly adaptive clipping mechanism. Fakoor et al. used proximal learning with penalty on KL divergence to utilize the off-policy data, which could effectively reduce the sample complexity [6]. In our previous work, we also introduced trust region-based clipping to improve boundness on policy of PPO [22]. While in this work, we use the trust region-based criterion to guide the clipping range adjustment, which requires additional computation but is more ﬂexible and interpretable.

Several methods have been proposed to improve exploration in recent research. Osband et al. tried to conduct consistent exploration using posterior sampling method [12]. Fortunato et al. presented a method named Noisy Net to improve exploration by generating perturbations of the network weights [7]. Another popular algorithm is the soft actor-critic method (SAC) [9], which maximizes expected reward and entropy simultaneously.

3 Preliminaries

A Markov Decision Processes (MDP) is described by the tuple (S, A, T , c, ρ1, γ). S and A are the state space and action space; T : S A S R is the transition probability distribution; c : S A R is the reward function; ρ1 is the distribution of the initial state s1, and γ (0, 1) is the discount factor. The return is the accumulated discounted reward from timestep t onwards, Rγ t = P k=0 γkc(st+k, at+k). The performance of a policy π is deﬁned as η(π) = Es ρπ,a π [c(s, a)] where ρπ(s) = (1 γ) P t=1 γt 1ρπ t (s), ρπ t is the density function of state at time t. Policy gradients methods [20] update the policy by the following surrogate performance objective, Lπold(π) =

Es ρπold,a πold h π(a|s) πold(a|s)Aπold(s, a) i + η(πold), where π(a|s)/πold(a|s) is the probability ratio

between the new policy π and the old policy πold, Aπ(s, a) = E[Rγ t |st = s, at = a; π] E[Rγ t |st = s; π] is the advantage value function of policy π. Let Ds KL (πold, π) DKL (πold( |s)||π( |s)), Schulman et al. [16] derived the following performance bound:

Theorem 1. Deﬁne that C = max s,a |Aπold (s, a)| 4γ . (1 γ)2, Mπold(π) = Lπold(π)

C maxs S Ds KL (πold, π). We have η(π) Mπold(π), η(πold) = Mπold(πold).

This theorem implies that maximizing Mπold(π) guarantee non-decreasing of the performance of the new policy π. To take larger steps in a robust way, TRPO optimizes Lπold(π) with the constraint maxs S Ds KL (πold, π) δ, which is called the trust region.

4 The Exploration Behavior of PPO

In this section will ﬁrst give a brief review of PPO and then show that how PPO suffers from an exploration issue when the initial policy is sufﬁciently far from the optimal one.

PPO imposes the policy constraint through a clipped surrogate objective function:

LCLIP πold (π) = E min π(a|s)

πold(a|s)Aπold(s, a), clip π(a|s)

πold(a|s), ls,a, us,a

Aπold(s, a)

where ls,a (0, 1) and us,a (1, + ) are called the lower and upper clipping range on stateaction (s, a). The probability ratio π(a|s)/πold(a|s) will be clipped once it is out of (ls,a, us,a). Therefore, such clipping mechanism could be considered as a constraint on policy with ratiobased metric, i.e., ls,a π(a|s)/πold(a|s) us,a, which can be rewritten as, πold(a|s)(1 ls,a) π(a|s) πold(a|s) πold(a|s)(us,a 1). We call (Ll πold(s, a), Uu πold(s, a)) ( πold(a|s)(1 ls,a), πold(a|s)(us,a 1)) the feasible variation range of policy π w.r.t. πold on state-action (s, a) with the clipping range setting (l, u), which is a measurement on the allowable change of policy π on state-action (s, a).

Note that the original PPO adopts a constant setting of clipping range, i.e., ls,a = 1 ϵ, us,a = 1 + ϵ for any (s, a) [18]. The corresponding feasible variation range is (L1 ϵ πold(s, a), U1+ϵ πold(s, a)) = ( πold(a|s)ϵ, πold(a|s)ϵ). As can be seen, given an optimal action aopt and a sub-optimal one asubopt on state s, if πold(aopt|s) < πold(asubopt|s), then |(L1 ϵ πold(s, aopt), U1+ϵ πold(s, aopt))|< |(L1 ϵ πold(s, asubopt), U1+ϵ πold(s, asubopt))|. This means that the allowable change of the likelihood on optimal action, i.e., π(aopt|s), is smaller than that of π(asubopt|s). Note that π(aopt|s) and π(asubopt|s) are in a zero-sum competition, such unequal restriction may continuously weaken the likelihood of the optimal action and make the policy trapped in local optima. We now give a formal illustration.

Algorithm 1 Simpliﬁed Policy Iteration with PPO

1: Initialize a policy π0, t 0. 2: repeat 3: Sample an action ˆat πt. 4: Get the new policy πt+1 by optimizing the empirical surrogate objective function of PPO based on ˆat:

πt(a)ua a = ˆat and c(a) > 0 πt(a)la a = ˆat and c(a) < 0

πt(a) πt(ˆat)uˆat πt(ˆat)

|A| 1 a = ˆat and c(ˆat) > 0

πt(a) + πt(ˆat)(1 lˆat )

|A| 1 a = ˆat and c(ˆat) < 0 πt(a) c(ˆat) = 0

5: πt+1 = Normalize(ˆπt+1)2. t t + 1. 6: until πt converge

We investigate the exploration behavior of PPO under the discrete-armed bandit problem, where there are no state transitions and the action space is discrete. The objective function of PPO in this problem is LCLIP πold (π) = E h min π(a) πold(a)c(a), clip π(a) πold(a), la, ua c(a) i . Let A+ {a A|c(a) > 0},

A {a A|c(a) < 0} denote the actions which have positive and negative reward respectively, and Asubopt = A+/{aopt} denote the set of the sub-optimal actions. Let aopt = argmaxa c(a) and asubopt Asubopt denote the optimal 3 and a sub-optimal action. Let us consider a simpliﬁed online policy iteration algorithm with PPO. As presented in Algorithm 1, the algorithm iteratively sample an action ˆat based on the old policy πold at each step and obtains a new policy πnew.

We measure the exploration ability by the expected distance between the learned policy πt and the optimal policy π after t-step learning, i.e., π0,t Eπt [ πt π |π0], where π (aopt) = 1, π (a) = 0 for a = aopt, π0 is the initial policy, πt is a stochastic element in the policy space and

2ˆπt+1 may violate the probability rules, e.g., P

a ˆπt+1(a) > 1. Thus we need to enforce speciﬁc normalization operation to rectify it. To simplify the analysis, we assume that πt+1 = ˆπt+1. 3Assume that there is only one optimal action.

depends on the previous sampled actions {at }t 1 t =1 (see eq. (2)). Note that smaller π0,t means better exploration ability, as it is closer to the optimal policy. We now derive the exact form of π0,t.

Lemma 1. π0,t Eπt [ πt π |π0] = 1 Eπt [πt(aopt)|π0].

Lemma 2. Eπt+1 [πt+1(a)|π0] = Eπt Eπt+1 [πt+1(a)|πt] |π0 .

We provide all the proofs in Appendix E. Lemma 1 implies that we can obtain the exploration ability π0,t by computing the expected likelihood of the optimal action aopt, i.e., Eπt [πt(aopt)|π0]. And Lemma 2 shows an iterative way to compute the exploration ability. By eq. (2), for action a which satisﬁes c(a) > 0, we have

Eπt+1 πt+1(a)|πt = πt(a) +

π2 t (a)(ua 1) X

|A| 1 (ua+ 1) + X

|A| 1 (1 la )

This equation provides a explicit form of the case when the likelihood of action a would decrease. That is, if the second term in RHS of eq. (3) is negative, then the likelihood on action a would decrease. This means that the initialization of policy π0 profoundly affects the future policy πt. Now we show that if the policy π0 initializes from a bad one, π(aopt) may continuously be decreased. Formally, for PPO, we have the following theorem: Theorem 2. Given initial policy π0, if π2 0(aopt) |A|< P

asubopt Asubopt π2 0(asubopt) P

a A π2 0(a ), then we have

asubopt Asubopt π0(asubopt) < P

asubopt Asubopt EπPPO 1 πPPO 1 (asubopt)|π0 < < P

asubopt Asubopt EπPPO t πPPO t (asubopt)|π0 ;

(ii) π0(aopt) > EπPPO 1 πPPO 1 (aopt)|π0 > > EπPPO t πPPO t (aopt)|π0 ;

(iii) π0,0 < PPO π0,1 < < PPO π0,t .

Conclusion (i) and (ii) implies that if the optimal action aopt is relatively less preferred than the sub-optimal action asubopt by the initial policy, then the preference of choosing the optimal action would continue decreasing while that of the sub-optimal action would continue increasing. This is because the feasible variation of probability on the optimal action π(aopt) is larger than that on the sub-optimal one π(asubopt), increasing probability on the latter one could diminish the former one. Conclusion (iii) implies that the policy of PPO is expected to diverge from the optimal one (in terms of the inﬁnity metric). We give a simple example below. Example 1. Consider a three-armed bandit problem, the reward function is c(aopt) = 1, c(asubopt) = 0.5, c(aworst) = 50. The initial policy is π0(aopt) = 0.2, π0(asubopt) = 0.6, π0(aworst) = 0.2. The hyperparameter of PPO is ϵ = 0.2. We have PPO π0,0 = 0.8, PPO π0,1 = 0.824,..., PPO π0,6 0.999, which means the policy diverges from the optimal one.

Note that the case that the optimal action aopt is relatively less preferred by the initial policy may be avoided in discrete action space, where we can use uniform distribution as initial policy. However, such a case could hardly be avoided in the high dimensional action space, where the policy is possibly initialized far from the optimal one. We have experimented Example 1 and a continuous-armed bandit problem with random initialization for multiple trials; about 30% of the trials were trapped in the local optima. See Section 6.1 for more detail.

In summary, PPO with constant clipping range could lead to an exploration issue when the policy is initialized from a bad one. However, eq. (3) inspires us a method to address this issue enlarging the clipping range (la, ua) when the probability of the old policy πold(a) is small.

5.1 Trust Region-Guided PPO

In the previous section, we have concluded that the constant clipping range of PPO could lead to an exploration issue. We consider how to adaptively adjust the clipping range to improve the exploration

behavior of PPO. The new clipping range (lδ s,a, uδ s,a), where δ is a hyperparameter, is set as follows:

lδ s,a = min π

πold(a|s) : Ds KL(πold, π) δ , uδ s,a = max π

πold(a|s) : Ds KL(πold, π) δ

To ensure the new adaptive clipping range would not be over-strict, an additional truncation operation is attached: lδ,ϵ s,a = min(lδ s,a, 1 ϵ), uδ,ϵ s,a = max(uδ s,a, 1 + ϵ). This setting of clipping range setting could be motivated from the following perspectives.

First, the clipping range is related to the policy metric of constraint. Both TRPO and PPO imposes a constraint on the difference between the new policy and the old one. TRPO uses the divergence metric of the distribution, i.e., Ds KL(πold, π) = Ea h log πold(a|s)

π(a|s) i δ for all s S, which is more theoretically-justiﬁed according to Theorem 1. Whereas PPO uses a ratio-based metric on each action, i.e., 1 ϵ π(a|s) πold(a|s) 1 + ϵ for all a A and s S. The divergence-based metric is averaged over the action space while the ratio-based one is an element-wise one on each action point. If the policy is restricted within a region with the ratio-based metric, then it is also constrained within a region with divergence-based one, but not vice versa. Thus the probability ratio-based metric constraint is somewhat more strict than the divergence-based one. Our method connects these two underlying metrics adopts the probability ratio-based constraint while getting closer to the divergence metric.

Second, a different underlying metric of the policy difference may result in different algorithm behavior. In the previous section, we have concluded that PPO s metric with constant clipping range could lead to an exploration issue, due to that it imposes a relatively strict constraint on actions which are not preferred by the old policy. Therefore, we wish to relax such constraint by enlarging the upper clipping range while reducing the lower clipping range. Fig. 1a shows the clipping range of TRGPPO and PPO. For TRGPPO (blue curve), as πold(a|s) gets smaller, the upper clipping range increases while the lower one decreases, which means the constraint is relatively relaxed as πold(a|s) gets smaller. This mechanism could encourage the agent to explore more on the potential valuable actions which are not preferred by the old policy. We will theoretically show that the exploration behavior with this new clipping range is better than that of with the constant one in Section 5.2.

Last but not least, although the clipping ranges are enlarged, it will not harm the stability of learning, as the ranges are kept within the trust region. We will show that this new setting of clipping range would not enlarge the policy divergence and has better performance bound compared to PPO in Section 5.3.

Our TRGPPO adopts the same algorithm procedure as PPO, except that it needs an additional computation of adaptive clipping range. We now present methods on how to compute the adaptive clipping range deﬁned in (4) efﬁciently. For discrete action space, by using the KKT conditions, the problem (4) is transformed into solving the following equation w.r.t X.

g(πold(a|s), X) (1 πold(a|s)) log 1 πold(a|s)

1 πold(a|s)X πold(a|s) log X = δ (5)

which has two solutions, one is for lδ s,a which is within (0, 1), and another one is for uδ s,a which is within (1, + ). We use MINPACK s HYBRD and HYBRJ routines [15] as the solver. To accelerate this computation procedure, we adopt two additional measures. First, we train a Deep Neural Network

0 0.2 0.4 0.6 0.8 1

Clipping Range

us, a of TRGPPO

ls, a of TRGPPO

1 + of PPO 1 of PPO

0 0.2 0.4 0.6 0.8 1.0

TRGPPO TRGPPO PPO PPO

3 2 1 0 1 2 3 a

5.0 ua of TRGPPO

la of TRGPPO

1 of PPO 1 + of PPO

(c) Figure 1: (a) and (b) plot the clipping range and the feasible variation range under different πold(a|s) for discrete action space task. (c) plots the clipping range under different a for continuous action space task; the black curve plots the density of πold(a|s) = N(a|0, 1).

(DNN) which input πold(a|s) and δ, and approximately output the initial solution. Note that the solution in (5) only depends on the probability πold(a|s) and the hyperparameter δ, and it is not affected by the dimension of the action space. Thus it is possible to train one DNN for all discrete action space tasks in advance. Second, with ﬁxed δ, we discretize the probability space and save all the solutions in advance. This clipping range computation procedure with these two acceleration measures only requires only additional 4% wallclock computation time of the original policy learning. See Appendix B.3 for more detail.

While for the continuous actions space task, we make several transformations to make the problem independent of the dimension of the action space, which makes it tractable to apply the two acceleration measures above. See Appendix B.2 for more detail.

5.2 Exploration Behavior

In this section, we will ﬁrst give the property of the clipping range of TRGPPO, which could affect the exploration behavior (as discussed in Section 4). Then a comparison between TRGPPO and PPO on the exploration behavior will be provided.

Lemma 3. For TRGPPO with hyperparameter δ, we have duδ s,a dπold(a|s) < 0, dlδ s,a dπold(a|s) > 0.

This result implies that the upper clipping range becomes larger as the preference on the action by the old policy πold(a|s) approaches zero, while the lower clipping range is on the contrary. This means that the constraints are relaxed on the actions which are not preferred by the old policy, such that it would encourage the policy to explore more on the potential valuable actions, no matter whether they were preferred by the previous policies or not.

We now give a formal comparison on the exploration behavior. As mentioned in Section 4, we measure the exploration ability by the expected distance between the learned policy πt and the optimal policy π after t-step learning, i.e., π0,t Eπt [ πt π |π0]. Smaller π0,t means the better exploration ability. The exploration ability of TRGPPO is denoted as TRGPPO π0,t while that of PPO is denoted as PPO π0,t . By eq. (3) and Lemma 3, we get the following conclusion.

Theorem 3. For TRGPPO with hyperparameter (δ, ϵ) and PPO with same ϵ. If δ g(maxa Asubopt πt(a), 1 + ϵ) for all t, then we have TRGPPO π0,t PPO π0,t for any t.

This theorem implies that our TRGPPO has better exploration ability than PPO, with proper setting of the hyperparameter δ.

5.3 Policy Divergence and Lower Performance Bound

To investigate how TRGPPO and PPO perform in practical, let us consider an empirical version of lower performance bound: ˆ Mπold(π) = ˆLπold(π) C maxt Dst KL (πold, π) , where ˆLπold(π) =

1 T PT t=1 h π(at|st) πold(at|st)At i + ˆηπold, st ρπold, at πold( |st) are the sampled states and actions, where we assume si = sj for any i = j, At is the estimated value of Aπold(st, at), ˆηπold is the estimated performance of old policy πold.

Let ΠPPO new denote the set of all the optimal solutions of the empirical surrogate objective function of PPO, and let πPPO new ΠPPO new denote the optimal solution which achieve minimum KL divergence over all optimal solutions, i.e., Dst KL(πold, πPPO new ) Dst KL(πold, π) for any π ΠPPO new under all st. This problem can be formalized as πPPO new = argminπ ΠPPO new (Ds1 KL(πold, π), . . . , Ds T KL(πold, π)). Note that π( |st) is a conditional probability and the optimal solution on different states are independent from each other. Thus the problem can be optimized by independently solving minπ( |st) {π( |st):π ΠPPO new } DKL (πold( |st), π( |st)) for each st. The ﬁnal πPPO new is obtained by integrating these independent optimal solutions πPPO new ( |st) on different state st. Similarly, πTRGPPO new is the one of TRGPPO which has similar deﬁnition as πPPO new . Please refer to Appendix E for more detail.

To analyse TRGPPO and PPO in a comparable way, we introduce a variant of TRGPPO. The hyperparameter δ of TRGPPO in eq. (4) is set adaptively by ϵ. That is, δ = max (1 p+) log 1 p+

1 p+(1+ϵ) p+ log(1 + ϵ), (1 p ) log 1 p

1 p (1 ϵ) p log(1 ϵ) , where

p+ = max t:At>0 πold(at|st), p = max t:At<0 πold(at|st). One may note that this equation has a simi-

lar form to that of eq. (5). In fact, if TRGPPO and PPO share a similar ϵ, then they have the same KL divergence theoretically. We conclude the comparison between TRGPPO and PPO by the following theorem.

Theorem 4. Assume that maxt Dst KL(πold, πPPO new ) < + for all t. If TRGPPO and PPO have the same hyperparameter ϵ, we have:

(i) uδ st,at 1 + ϵ and lδ st,at 1 ϵ for all (st, at);

(ii) maxt Dst KL(πold, πTRGPPO new ) = maxt Dst KL(πold, πPPO new );

(iii) ˆ Mπold(πTRGPPO new ) ˆ Mπold(πPPO new ). Particularly, if there exists at least one (st, at) such that πold(at|st) = max ˆt:Aˆt<0 πold(aˆt|sˆt) and πold(at|st) = max ˆt:Aˆt>0 πold(aˆt|sˆt), then ˆ Mπold(πTRGPPO new ) >

ˆ Mπold(πPPO new ).

Conclusion (i) implies that TRGPPO could enlarge the clipping ranges compared to PPO and accordingly allow larger update of the policy. Meanwhile, the maximum KL divergence is retained, which means TRGPPO would not harm the stability of PPO theoretically. Conclusion (iii) implies that TRGPPO has better empirical performance bound.

6 Experiment

We conducted experiments to answer the following questions: (1) Does PPO suffer from the lack of exploration issue? (2) Could our TRGPPO relief the exploration issue and improve sample efﬁciency compared to PPO? (3) Does our TRGPPO maintain the stable learning property of PPO? To answer these questions, we ﬁrst evaluate the algorithms on two simple bandit problems and then compare them on high-dimensional benchmark tasks.

6.1 Didactic Example: Bandit Problems

We ﬁrst evaluate the algorithms on the bandit problems. In the continuous-armed bandit problem, the reward is 0.5 for a (1, 2); 1 for a (2.5, 5); and 0 otherwise. We use a Gaussian policy. The discrete-armed bandit problem is deﬁned in Section 4. We use a Gibbs policy π(a) exp(θa), where the parameter θ is initialized randomly from N(0, 1). We also consider the vanilla Policy Gradient method as a comparison. Each algorithm was run for 1000 iterations with 10 random seeds.

0 250 500 750 1000 Iterations

Discrete Bandit

TRPPO PPO Vanilla PG

0 250 500 750 1000 Iterations

Continuous Bandit

TRPPO PPO Vanilla PG

TRGPPO TRGPPO

Figure 2: The performance on discrete and continuous-armed bandit problems during training process.

Fig. 2 plots the performance during the training process. PPO gets trapped in local optima at a rate of 30% and 20% of all the trials on discrete and continuous cases respectively, while our TRGPPO could ﬁnd the optimal solution on almost all trials. For continuous-armed problem, we have also tried other types of parametrized policies like Beta and Mixture Gaussian, and these policies behaves similarly as the Gaussian policy. In discrete-armed problem, we ﬁnd that when the policy is initialized with a local optima, PPO could easily get trapped in that one. Notably, since vanilla PG could also ﬁnd the optimal one, it could be inferred that the exploration issue mainly derives from the ratio-based clipping with constant clipping range.

6.2 Evaluation on Benchmark Tasks

We evaluate algorithms on benchmark tasks implemented in Open AI Gym [2], simulated by Mu Jo Co [21] and Arcade Learning Environment [1]. For continuous control tasks, we evaluate algorithms on 6 benchmark tasks. All tasks were run with 1 million timesteps except that the Humanoid task was 20 million timesteps. The trained policies are evaluated after sampling every 2048 timesteps data. The experiments on discrete control tasks are detailed in Appendix C.

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

0 4 8 12 16 20 Timesteps( 106)

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

Half Cheetah

TRGPPO PPO SAC PPO-penalty PPO-entropy PPO-0.6

Figure 3: Episode rewards during the training process; the shaded area indicate half the standard deviation over 10 random seeds.

Table 1: Results of timesteps to hit a threshold within 1 million timesteps (except Humanoid with 20 million) and averaged rewards over last 40% episodes during training process.

(a) Timesteps to hit threshold ( 103) (b) Averaged rewards

Threshold TRGPPO PPO PPOpenalty SAC TRGPPO PPO PPOpenalty SAC

Humanoid 5000 4653 7241 13096.0 343.0 7074.9 6620.9 3612.3 6535.9 Reacher -5 201 178.0 301.0 265 -7.9 -6.7 -6.8 -17.2 Swimmer 90 353.0 564 507.0 /4 101.9 100.1 94.1 49 Half Cheetah 3000 117 148 220.0 53.0 4986.1 4600.2 4868.3 9987.1 Hopper 3000 168.0 267 188.0 209 3200.5 2848.9 3018.7 3020.7 Walker2d 3000 269.0 454 393.0 610 3886.8 3276.2 3524 2570

For our TRGPPO, the trust region coefﬁcient δ is adaptively set by tuning ϵ (see Appendix B.4 for more detail). We set ϵ = 0.2, same as PPO. The following algorithms were considered in the comparison. (a) PPO: we used ϵ = 0.2 as recommended by [18]. (b) PPO-entropy: PPO with an explicit entropy regularization term βEs [H (πold( |s), π( |s))], where β = 0.01. (c) PPO-0.6: PPO with a larger clipping range where ϵ = 0.6. (d) PPO-penalty: a variant of PPO which imposes a penalty on the KL divergence and adaptively adjust the penalty coefﬁcient [18]. (e) SAC: Soft Actor-Critic, a state-of-the-art off-policy RL algorithm [9]. Both TRGPPO and PPO adopt exactly same implementations and hyperparameters except the clipping range based on Open AI Baselines [4]. This ensures that the differences are due to algorithm changes instead of implementations or hyperparameters. For SAC, we adopt the implementations provided in [9].

Sample Efﬁciency: Table 1 (a) lists the timesteps required by algorithms to hit a prescribed threshold within 1 million timesteps and Figure 3 shows episode rewards during the training process. The thresholds for all tasks were chosen according to [23]. As can be seen in Table 1, TRGPPO requires about only 3/5 timesteps of PPO on 4 tasks except Half Cheetah and Reacher.

Performance/Exploration: Table 1 (b) lists the averaged rewards over last 40% episodes during training process. TRGPPO outperforms the original PPO on almost all tasks except Reacher. Fig. 4a shows the policy entropy during training process, the policy entropy of TRGPPO is obviously higher than that of PPO. These results implies that our TRGPPO method could maintain a level of entropy learning and encourage the policy to explore more.

4 / means that the method did not reach the reward threshold within the required timesteps on all the seeds.

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

policy entropy

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

policy entropy

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

policy entropy

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

policy entropy

(a) Policy entropy

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

(b) Upper clipping range

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

KL divergence

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

KL divergence

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

KL divergence

0.0 0.2 0.4 0.6 0.8 1.0 Timesteps( 106)

KL divergence

TRGPPO PPO PPO-0.6

(c) KL divergence Figure 4: (a) shows the policy entropy during training process. (b) shows the statistics of the computed upper clipping ranges over all samples. (c) shows the KL divergence during the training process.

The Clipping Ranges and Policy Divergence: Fig. 4b shows the statistics of the upper clipping ranges of TRGPPO and PPO. Most of the resulted adaptive clipping ranges of TRGPPO are much larger that of PPO. Nevertheless, our method has similar KL divergences with PPO (see Fig. 4c). However, the method of arbitrary enlarging clipping range (PPO-0.6) does not enjoy such property and fails on most of tasks.

Training Time: Within one million timesteps, the training wall-clock time for our TRGPPO is 25 min; for PPO, 24 min; for SAC, 182 min (See Appendix B.3 for the detail of evaluation). TRGPPO does not require much additional computation time than PPO does.

Comparison with State-of-the-art Method: TRGPPO achieves higher reward than SAC on 5 tasks while is not as good as it on Half Cheetah. And TRGPPO is not as sample efﬁcient as SAC on Half Cheetah and Humanoid. This may due to that TRGPPO is an on-policy algorithm while SAC is an off-policy one. However, TRGPPO is much more computationally efﬁcient (25 min vs. 182 min). In addition, SAC tuned hyperparameters speciﬁcally for each task in the implementation of the original authors. In contrast, our TRGPPO uses the same hyperparameter across different tasks.

7 Conclusion

In this paper, we improve the original PPO by an adaptive clipping mechanism with a trust regionguided criterion. Our TRGPPO method improves PPO with more exploration and better sample efﬁciency and is competitive with several state-of-the-art methods, while maintains the stable learning property and simplicity of PPO.

To our knowledge, this is the ﬁrst work to reveal the effect of the metric of policy constraint on the exploration behavior of the policy learning. While recent works devoted to introducing inductive bias to guide the policy behavior, e.g., maximum entropy learning [24, 8], curiosity-driven method [13]. In this sense, our adaptive clipping mechanism is a novel alternative approach to incorporate prior knowledge to achieve fast and stable policy learning. We hope it will inspire future work on investigating more well-deﬁned policy metrics to guide efﬁcient learning behavior.

Acknowledgement

This work is partially supported by National Science Foundation of China (61976115,61672280, 61732006), AI+ Project of NUAA(56XZA18009), Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX19_0195). We would also like to thank Yao Li, Weida Li, Xin Jin, as well as the anonymous reviewers, for offering thoughtful comments and helpful advice on earlier versions of this work.

[1] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, 2013.

[2] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

[3] Gang Chen, Yiming Peng, and Mengjie Zhang. An adaptive clipping approach for proximal policy optimization. Co RR, abs/1804.06461, 2018.

[4] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github. com/openai/baselines, 2017.

[5] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. International Conference on Machine Learning, pages 1329 1338, 2016.

[6] Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. P3o: Policy-on policy-off policy optimization. In Uncertainty in Artiﬁcial Intelligence, 2019.

[7] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. International Conference on Learning Representations, 2018.

[8] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. International Conference on Machine Learning, pages 1352 1361, 2017.

[9] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning, pages 1856 1865, 2018.

[10] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1):1334 1373, 2016.

[11] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

[12] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems, pages 4026 4034, 2016.

[13] Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.

[14] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682 697, 2008.

[15] Michael JD Powell. A hybrid method for nonlinear equations. Numerical methods for nonlinear algebraic equations, 1970.

[16] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897, 2015.

[17] John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. International Conference on Learning Representations, 2016.

[18] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

[19] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. ar Xiv preprint ar Xiv:1712.01815, 2017.

[20] Richard S Sutton, David A. Mc Allester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. pages 1057 1063, 2000.

[21] E Todorov, T Erez, and Y Tassa. Mujoco: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033, 2012.

[22] Yuhui Wang, Hao He, Xiaoyang Tan, and Yaozhong Gan. Truly proximal policy optimization. In Uncertainty in Artiﬁcial Intelligence, 2019.

[23] Yuhuai Wu, Elman Mansimov, Roger B. Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. Advances in Neural Information Processing Systems, pages 5279 5288, 2017.

[24] Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling interaction via the principle of maximum causal entropy. In ICML, 2010.