# offpolicy_proximal_policy_optimization__c176ae70.pdf

Off-Policy Proximal Policy Optimization

Wenjia Meng1, Qian Zheng2,3, Gang Pan2,3, Yilong Yin1

1 School of Software, Shandong University, Jinan, China 2 The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou, China 3 College of Computer Science and Technology, Zhejiang University, Hangzhou, China wjmeng@sdu.edu.cn, qianzheng@zju.edu.cn, gpan@zju.edu.cn, ylyin@sdu.edu.cn

Proximal Policy Optimization (PPO) is an important reinforcement learning method, which has achieved great success in sequential decision-making problems. However, PPO faces the issue of sample inefﬁciency, which is due to the PPO cannot make use of off-policy data. In this paper, we propose an Off-Policy Proximal Policy Optimization method (Off-Policy PPO) that improves the sample efﬁciency of PPO by utilizing off-policy data. Speciﬁcally, we ﬁrst propose a clipped surrogate objective function that can utilize off-policy data and avoid excessively large policy updates. Next, we theoretically clarify the stability of the optimization process of the proposed surrogate objective by demonstrating the degree of policy update distance is consistent with that in the PPO. We then describe the implementation details of the proposed Off Policy PPO which iteratively updates policies by optimizing the proposed clipped surrogate objective. Finally, the experimental results on representative continuous control tasks validate that our method outperforms the state-of-the-art methods on most tasks.

Introduction Off-policy deep reinforcement learning has achieved huge success in domains, e.g., games (Mnih et al. 2015), (Silver et al. 2016), (Silver et al. 2017), (Vinyals et al. 2019), (Schrittwieser et al. 2020), (Xing et al. 2021), (Meng et al. 2019), robotics (Kober, Bagnell, and Peters 2013), and continuous control tasks (Lillicrap et al. 2016), (Haarnoja et al. 2018), (Yang et al. 2022b). These off-policy deep reinforcement learning methods make use of off-policy data collected during the interaction between agent and environment to optimize policies (Degris, White, and Sutton 2012), (Silver et al. 2014), which are more sample efﬁcient than onpolicy methods only using on-policy data (Fujimoto, Hoof, and Meger 2018), (Mnih et al. 2016). With the utilization of off-policy data whose behavior policy differs from the target policy, these off-policy deep reinforcement learning methods can avoid the expensive cost on large amounts of onpolicy interaction and are suitable for solving complex realworld sequential decision-making problems (Haarnoja et al. 2018), (Yang et al. 2022a), (Lillicrap et al. 2016), (Mnih et al. 2015).

Y. Yin is the corresponding author. Copyright c 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Proximal Policy Optimization (PPO) (Schulman et al. 2017) is one of the most popular deep reinforcement learning methods, which optimizes policies by optimizing a clipped surrogate objective of policy performance. In order to further improve the sample efﬁciency of PPO, several works are proposed to achieve this goal from different perspectives. Speciﬁcally, Trust Region-Guided Proximal Policy Optimization (TRGPPO) (Wang et al. 2019) improves the sample efﬁciency of PPO by adaptively adjusting the clipping range within a trust region. Truly Proximal Policy Optimization (Wang, He, and Tan 2020) improves the sample efﬁciency of PPO by adopting a new clipping function to restrict the policy ratio, and substituting the triggering condition for clipping by a trust region-based one. Separated Trust Regions Policy Optimization (Zou et al. 2019) improves the sample efﬁciency of PPO by proposing a softer objective with more conservative constraints and building the separated trust-region for optimization. However, these methods ignore the perspective of directly utilizing off-policy data to improve the sample efﬁciency of PPO (Wang et al. 2019), (Wang, He, and Tan 2020), (Zou et al. 2019). In this paper, we put forward an Off-Policy Proximal Policy Optimization (Off-Policy PPO) method that leverages off-policy data to further improve the sample efﬁciency of PPO. Speciﬁcally, we ﬁrst propose a clipped surrogate objective that can use off-policy data and avoid the excessively large policy update. Next, we theoretically clarify the use of off-policy data during the optimization process of this objective does not harm the stability of PPO. We then introduce the implementation details of the proposed Off-Policy PPO, which include the whole procedure of this method and the network update procedure in this method. Our contributions are described as follows: We propose an Off-Policy Proximal Policy Optimization method (Off-Policy PPO) that introduces a clipped surrogate objective using off-policy data and iteratively utilizes off-policy data to optimize policies by maximizing this proposed clipped surrogate objective. We theoretically clarify that the stability of the proposed Off-Policy PPO by demonstrating the degree of the policy update distance in our method is the same as that in PPO. We also conduct experiments on a variety of representative continuous control tasks and the experimental results demonstrate that our method can achieve better

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

performance than state-of-the-art methods on most tasks.

Background & Notation In this paper, we study the Markov decision process denoted by the tuple (S, A, P, ρ0, r). S and A separately represent the state space and the action space; P : S A S R denotes the transition dynamics distribution; ρ0 : S R and r : S A R represent distribution of the initial state s0 and the reward function, respectively. During the interaction, the agent given a state st chooses an action at conforming to policy π : S A [0, 1] at timestep t; environment yields a reward r(st, at) and the next state st+1. With above interaction, the discounted return from timestep t can be formulated as Rt = P k=t γk tr(sk, ak), where γ is the discount factor. Based on such Rt, we next introduce the state value Vπ(st) given st, the action value Qπ(st, at) given (st, at) and the corresponding advantage value Aπ(st, at) (Schulman et al. 2016):

Vπ(st) = Eat,st+1, π X

k=t γk tr(sk, ak) , (1)

Qπ(st, at) = Est+1,at+1, π X

k=t γk tr(sk, ak) , (2)

Aπ(st, at) = Qπ(st, at) Vπ(st). (3)

The standard reinforcement learning learns a policy π to maximize the policy performance objective (the discounted return from the start state) (Sutton and Barto 2018):

η(π) = Es0,a0, [R0] = Es0,a0, X

t=0 γtr(st, at) (4)

where s0 ρ0, at π(at|st), st+1 P(st+1|st, at). In the following, we ﬁrst describe the Trust Region Policy Optimization (TRPO) which maximizes the policy performance (η(π)) by optimizing its surrogate objective with onpolicy data. Next, we state the Proximal Policy Optimization (PPO) which proposes a clipped surrogate objective to avoid the excessively large policy update in TRPO. Trust Region Policy Optimization (TRPO). To maximize the performance objective in Eq. (4), the TRPO method (Schulman et al. 2015) uses on-policy data to optimize policies by maximizing a surrogate objective function subject to a constraint on a Kullback-Leibler (KL) divergence:

max π Es ρπold,a πold π(a|s)

πold(a|s)Aπold(s, a) (5)

subject to Es ρπold DKL πold( |s)||π( |s) δ, (6)

where πold is the current policy, δ denotes the bound, DKL πold( |s)||π( |s) represents the KL deivergence between πold( |s) and π( |s), ρπold denotes the discounted state distribution starting at initial state s0 and following πold: ρπold(s) = P t=0 γt P(st = s|s0, πold) (Sutton et al. 2000). However, without a constraint, the optimization of the surrogate objective function in Eq. (5) would lead to excessively large policy updates.

Proximal Policy Optimization (PPO). In order to avoid such a large policy update, Proximal Policy Optimization (PPO) (Schulman et al. 2017) puts forward a clipped surrogate objective and optimizes policies by maximizing this clipped surrogate objective. The clipped surrogate objective proposed in PPO can be expressed as:

LCLIP PPO = Es ρπold,a πold min π(a|s)

πold(a|s)Aπold(s, a),

clip( π(a|s)

πold(a|s), 1 ϵ, 1 + ϵ)Aπold(s, a) , (7)

where ϵ is a hyperparameter. Note that clipped surrogate objective in Eq. (7) can help PPO avoid large policy updates by penalizing changes to the policy that moves π(a|s) πold(a|s) away from 1 (Schulman et al. 2017). However, PPO faces the issue of high sample complexity due to the lack of utilization of off-policy data, which leads to great demand for on-policy interaction between agent and environment.

Off-Policy Proximal Policy Optimization To tackle the sample inefﬁciency problem in the PPO method, we propose an Off-Policy Proximal Policy Optimization method (Off-Policy PPO) that employs off-policy data for policy optimization, as outlined in this section. Speciﬁcally, we ﬁrst introduce the clipped surrogate objective using off-policy data in the proposed Off-Policy PPO. We next clarify the stability of the proposed Off-Policy PPO by clarifying our method makes an update close to the older policy, and the degree of this update distance is the same as that in PPO. Finally, we describe the implementation details of the proposed Off-Policy PPO.

Clipped Surrogate Objective Using Off-Policy Data In this section, we describe the proposed clipped surrogate objective that utilizes off-policy data in Off-Policy PPO. To do this, we ﬁrst present the optimization problem that maximizes the surrogate objective using off-policy data in Off Policy TRPO (Meng et al. 2021). Using this surrogate objective, we then explain how we derive a clipped surrogate objective that effectively uses off-policy data while avoiding large policy updates. Speciﬁcally, the optimization problem which can use offpolicy data in the Off-Policy TRPO (Meng et al. 2021) is:

max π Es ρµ,a µ π(a|s)

µ(a|s)Aπold(s, a) (8)

s.t. D ρµ,sqrt KL (µ, πold)D ρµ,sqrt KL (πold, π) + D ρµ KL(πold, π) δ, (9)

where µ represents the behavior policy and ρµ(s) = P t=0 γt P(st = s|s0, µ), D ρµ KL(πold, π) := Es ρµ[DKL(πold( |s) π( |s))], D ρµ,sqrt KL (µ, πold) := Es ρµ[ p

DKL(µ( |s) πold( |s))], D ρµ,sqrt KL (πold, π) := Es ρµ[ p

DKL(πold( |s) π( |s))]. However, without the constraint in Eq. (9), the maximization of the above surrogate objective utilizing off-policy data in Eq. (8) suffers from an excessively large policy update.

To address this issue, a straightforward solution is to apply the clipping strategy from PPO to adjust the surrogate objective in Eq. (8). To clarify the clipped objective, we begin by presenting the surrogate objective in Eq. (8):

Lµ(π) = Es ρµ,a µ π(a|s)

µ(a|s)Aπold(s, a) . (10)

With Lµ(π) in Eq. (10), the corresponding clipped surrogate objective using off-policy data is:

Lµ(π) = Es ρµ,a µ min π(a|s)

µ(a|s)Aπold(s, a),

clip π(a|s)

µ(a|s), 1 ϵ, 1 + ϵ Aπold(s, a) . (11)

Note that in Eq. (11), the policy ratio π(a|s)

µ(a|s) is generally either less than 1 ϵ or greater than 1 + ϵ. Consequently, the target policy π(a|s) often remains unchanged and does not undergo any updates during the optimization process of the clipped surrogate objective in Eq. (11). In order to address this issue, we propose a clipped surrogate objective which scales the lower and upper bound ((1 ϵ), (1 + ϵ)) in Eq. (11) by a factor of πold(a|s)

LCLIP Off-Policy PPO(π) = Es ρµ,a µ min rπ(s, a)Aπold(s, a),

clip rπ(s, a), ls,a, hs,a Aπold(s, a) , (12)

where rπ(s, a) = π(a|s) µ(a|s), ls,a = πold(a|s)

µ(a|s) (1 ϵ), hs,a =

µ(a|s) (1 + ϵ).

Stability Analysis In this section, we analyze the stability of the proposed Off Policy PPO by clarifying our method makes an update close to the older policy, and the degree of this update distance is the same as that in PPO. In order to clarify this, we ﬁrst describe the optimal policy set in Lemma 1, which maximizes the proposed clipped objective in Eq. (12). We then clarify the maximum KL divergence between current policy πold and the optimal policy πnew in the proposed Off-Policy PPO is consistent with that in PPO in Theorem 1. With the surrogate objective in Eq. (12), we denote Πnew as the optimal policy set maximizing this objective in Lemma 1. Note that advantage value Aπold(s, a) is simpliﬁed and denoted as A in Lemma 1. Lemma 1. Πnew={π| for all state and action pair (s, a) that A < 0, π(a|s) µ(a|s)ls,a; for all state and action pair (s, a) that A > 0, π(a|s) min(µ(a|s)hs,a, 1)}.

Proof. Firstly, we prove that a policy π meeting the conditions in Πnew is the optimal solution maximizing the objective in Eq. (12). In order to prove this, we need to show that, given any state and action pair (s, a), the policy π meeting the condition in Πnew satisﬁes: Ls,a µ (π ) Ls,a µ (π) for any π. Note that Ls,a µ (π) denotes the surrogate objective given any (s, a) under the policy π: Ls,a µ (π) = min rπ(s, a)A, clip rπ(s, a), ls,a, hs,a A .

If A < 0, Ls,a µ (π) could be written as:

Ls,a µ (π) = ls,a A, rπ(s, a) ls,a rπ(s, a)A, rπ(s, a) > ls,a. (13)

Ls,a µ (π ) can be written as Ls,a µ (π ) = min rπ (s, a)A, clip rπ (s, a), ls,a, hs,a A = ls,a A, where π meeting the condition in Πnew satisﬁes π (a|s) µ(a|s)ls,a when A < 0. Thus, if A < 0, Ls,a µ (π) ls,a A = Ls,a µ (π ) for any π. If A > 0, Ls,a µ (π) could be written as:

Ls,a µ (π) = hs,a A, rπ(s, a) hs,a rπ(s, a)A, rπ(s, a) < hs,a. (14)

Ls,a µ (π ) can be written as Ls,a µ (π ) = min rπ (s, a)A, clip rπ (s, a), ls,a, hs,a A = hs,a A, where π satisﬁes π (a|s) µ(a|s)hs,a when A > 0 , which is due to that π meeting the conditon in Πnew satisﬁes π (a|s) min(µ(a|s)hs,a, 1) and π (a|s) < 1. Thus, if A > 0, Ls,a µ (π) hs,a A = Ls,a µ (π ) for any π. Based on such fact, we have proven that a policy π meeting the conditions in Πnew is the optimal solution. Secondly, we prove that a policy π0 not meeting conditions in Πnew is not the optimal solution of maximizing the objective in Eq. (12). In order to prove this, we construct a policy π satisfying conditions in Πnew. Then, Ls,a µ (π0) Ls,a µ (π ) for any state and action pair (s, a). Based on such fact, we have proven that a policy not meeting the conditions in Πnew is not the optimal solution of maximizing the objective in Eq. (12). Finally, combining the above results, we prove that Πnew described in Lemma 1 contains all the optimal solutions of maximizing the surrogate objective in Eq. (12).

Based on the optimal policy set Πnew in Lemma 1, we clarify the degree of the policy update distance in Off-Policy PPO is the same as that in PPO. Speciﬁcally, we demonstrate that the maximum KL divergence between the new policy and the old one in our method is equivalent to that in PPO, as illustrated in Theorem 1. Theorem 1. Let πOff-Policy PPO new Πnew denote the optimal pilicy in Off-Policy PPO, which achieves the minimum KL divergence over all optimal policies, i.e., DKL(πold( |st), πOff-Policy PPO new ( |st)) DKL(πold( |st), π( |st)) for π Πnew at any timestep t , and let πPPO new have the similar deﬁnition for PPO, we have maxt DKL(πold( |st), πOff-Policy PPO new ( |st)) = maxt DKL(πold( |st), πPPO new ( |st)) for all timestep t.

Proof. In order to simplify the expression in the proof, we separately denote DKL(πold( |st), πOff-Policy PPO new ( |st)), DKL(πold( |st), πPPO new ( |st)) as Dst KL(πold, πOff-Policy PPO new ), Dst KL(πold, πPPO new ) in the following. In the proof, we need to prove that Dst KL(πold, πOff-Policy PPO new ) = Dst KL(πold, πPPO new ) for any timestep t. Speciﬁcally, we prove this in two cases, i.e., Aπold(st, at) 0 and Aπold(st, at) > 0 for any timestep t. We denote Aπold(st, at) as At for simplicity in the following.

In the case At 0, we prove that Dst KL(πold, πOff-Policy PPO new ) = Dst KL(πold, πPPO new ). If At 0, the optimal policy πOff-Policy PPO new can be derived by solving the following constraint optimization problem according to Lemma 1:

a πold(a|st)logπold(a|st)

s.t. π(at|st) µ(at|st)lst,at, X

a π(a|st) = 1,

π(a|st) > 0,

where at denotes the action at timestep t. By using the Karush-Kuhn-Tucker (KKT) conditions (Gordon and Tibshirani 2012), we get:

πOff-Policy PPO new (a|st) =

πold(a|st)(1 µ(at|st)lst,at)

1 πold(at|st) , a = at

µ(at|st)lst,at, a = at. (16)

The corresponding KL divergence is

Dst KL(πold, πOff-Policy PPO new )

= (1 πold(at|st))log 1 πold(at|st) 1 πold(at|st)(1 ϵ) πold(at|st)log(1 ϵ),

which equals Dst KL(πold, πPPO new ) described in Eq. (26) of appendix in (Wang et al. 2019) due to the lower bound in PPO is 1 ϵ. In the case At > 0, we prove that Dst KL(πold, πOff-Policy PPO new ) = Dst KL(πold, πPPO new ). If At > 0, according to Lemma 1, the constraint optimization problem is:

a πold(a|st)logπold(a|st)

s.t. π(at|st) min(µ(at|st)hst,at, 1), X

a π(a|st) = 1,

π(a|st) > 0.

By using the KKT conditions, we get:

πOff-Policy PPO new (a|st)

πold(a|st)(1 min(µ(at|st)hst,at, 1))

1 πold(at|st) , a = at

min(µ(at|st)hst,at, 1), a = at. (19)

When At > 0 and µ(at|st)hst,at 1, the KL divergence is

Dst KL(πold, πOff-Policy PPO new )

= (1 πold(at|st))log 1 πold(at|st) 1 πold(at|st)(1 + ϵ) πold(at|st)log(1 + ϵ),

Algorithm 1: Off-Policy PPO

Require: Environment E, trace-decay parameter λ, discount factor γ, learning rate α, trajectory length K, replay memory R, epoch number N, minibatch size M. Initialize policy network πθ and state value network Vw. repeat // Collect data Collect K transitions during interaction with environment S0:K = {s0, a0, r0, µ( |s0), , s K, a K, r K, µ( |s K)} according to behavior policy µ. Add the collected data into replay memory R.

// Update networks Sample off-policy data T0:K from replay memory R. Obtain {Vw(sj)}K j=0 by Vw. Obtain V-trace target vj = Vw(sj)+δj V +γcj(vj+1 Vw(sj+1)). Obtain advantage value {A(sj, aj)}K 1 j=0 = {rj + γvj+1 Vw(sj)}K 1 j=0 . for epoch=1, 2, , N do Acquire data D0:K by randomly shufﬂing T0:K. for i = 1, 2, , K/M do Obtain M minibatch from shufﬂed data Di M M:i M 1. Update πθ by maximizing objective in Eq. (12): θ θold + α θLCLIP Off-Policy PPO(πθ). Update Vw by minimizing the mean squared error between V-trace target v and state value Vw(s). end for end for until policy πθ converges

which equals Dst KL(πold, πPPO new ) described in Eq. (28) of appendix in (Wang et al. 2019) due to the upper bound in PPO is 1 + ϵ. When At > 0 and µ(at|st)hst,at > 1, Dst KL(πold, πOff-Policy PPO new ) = + = Dst KL(πold, πPPO new ) (Wang et al. 2019). Combining above results on two cases (At 0 and At > 0), we have proven Dst KL(πold, πOff-Policy PPO new ) = Dst KL(πold, πPPO new ) for any timestep t. Based on such fact, we can conclude that maxt Dst KL(πold, πOff-Policy PPO new ) = maxt Dst KL(πold, πPPO new ) for all timestep t. As far, we prove that maxt DKL(πold( |st), πOff-Policy PPO new ( |st)) = maxt DKL(πold( |st), πPPO new ( |st)) in Theorem 1.

Implementation Details In this section, we introduce the implementation details of the proposed Off-Policy PPO, which iteratively optimizes policies by maximizing the proposed clipped surrogate objective in Eq. (12). Speciﬁcally, the whole procedure of the proposed Off-Policy PPO is described in Algorithm 1. In Algorithm 1, we ﬁrst initialize policy network πθ and state value network Vw, respectively. Us-

Figure 1: Training curve comparison between the proposed Off-Policy PPO and other state-of-the-art methods.

ing these networks, we then collect K transitions (S0:K = {s0, a0, r0, µ( |s0), , s K, a K, r K, µ( |s K)}) and add these data into memory. Next, we sample off-policy data (T0:K) from memory R and use these data to update policy and state value networks. During the network update procedure, we ﬁrst use data T0:K to estimate state values {Vw(sj)}K j=0 and Vtrace target (Espeholt et al. 2018) vj = Vw(sj) + δj V + γcj(vj+1 Vw(sj+1)) where δj V = ρj(rj +

γVw(sj+1) Vw(sj)), ρj = min(1, πθold(aj|sj)

µ(aj|sj) ), and

cj = min(1, πθold(aj|sj)

µ(aj|sj) ). With these values, we next ob-

tain advantage values {A(sj, aj)}K 1 j=0 = {rj + γvj+1 Vw(sj)}K 1 j=0 . Finally, we optimize policy network πθ and state value network Vw with N epochs. In every epoch, we shufﬂe the off-policy data T0:K and obtain the minibatch from the shufﬂed data D0:K. With such minibatch, we optimize πθ by maximizing objective in Eq. (12): θ θold+α θLCLIP Off-Policy PPO(πθ) and optimize Vw by minimizing the mean squared error between V-trace target v and Vw(s).

Experiments

In this section, we perform experiments to evaluate the proposed Off-Policy Proximal Policy Optimization (Off-Policy PPO) on a variety of representative continuous control tasks. We ﬁrst introduce the experimental setup which comprises networks, hyperparameters, and experimental tasks. We next compare our method and the state-of-the-art meth-

ods, i.e., TRGPPO (Wang et al. 2019), Soft Actor-Critic (SAC) (Haarnoja et al. 2018), DDPG (Lillicrap et al. 2016), SLAC (Lee et al. 2020), Off-Policy TRPO (Meng et al. 2021), TD3 (Fujimoto, Hoof, and Meger 2018), SOP (Wang et al. 2020), and Trust-PCL (Nachum et al. 2018), to validate that our method can achieve better or comparable performance than these methods. We then study the effectiveness of our method on using off-policy data by comparing our method with PPO. Next, we study on KL divergence curves between our method, PPO, and SAC to evaluate the stability of our method in practice. Finally, we study the effectiveness of our method on sample efﬁciency by comparing our method to other methods from the aspect of timesteps to reach a threshold on the continuous control tasks.

In the experiments, we adopt a policy network and state value network to separately approximate the Gaussian policy distribution and the state value. These networks are multi-layer neural networks comprising two hidden layers with 64 neurons. We use the Tanh as the activation function of these networks. For hyperparameters, the trace-decay parameter λ is 0.95 and the discount factor γ is 0.99. The length of transitions (K) is set to be 1024. We use the Adam optimizer with learning rate α = 3 10 4. The epoch number N is 10. The minibatch size M is set to be 32. Experimental tasks consist of six representative continuous control tasks from Open AI Gym (Brockman et al. 2016) and Mu Jo Co (Todorov, Erez, and Tassa 2012), which cover

Figure 2: Training curve comparison between the proposed Off-Policy PPO and PPO during training.

simple and complex tasks: Swimmer, Hopper, Half Cheetah, Walker2d, Ant, and Humanoid. We adopt a commonly used version, i.e., v2, for these tasks. The experiments are performed on a GPU server that has four Nvidia RTX 3090. The results reported in the experiments are averaged over the top three seeds of ten random seeds. For the implementations of these compared methods: we use the original implementation codes given by the authors of most state-ofthe-art methods (TRGPPO, SAC, SLAC, Off-Policy TRPO, TD3, SOP, and Trust-PCL) and we use the implementation in https://github.com/chainer/chainerrl for DDPG.

Comparison with State-of-the-art Methods In this section, we compare our method with state-of-theart methods, i.e., TRGPPO, SAC, DDPG, SLAC, Off-Policy TRPO, TD3, SOP, and Trust-PCL, to validate that our method can outperform these methods on most tasks. The training curve comparison between our method and state-of-the-art methods is shown in Figure 1. We observe that, when compared to other methods, our method requires fewer timesteps to achieve the same return on the majority of tasks, i.e., Swimmer, Hopper, Walker2d, and Ant. On Half Cheetah and Humanoid, the proposed Off-Policy PPO needs fewer timesteps to achieve the same return than most methods, i.e., TRGPPO, DDPG, SLAC, Off-Policy TRPO, and Trust-PCL. It can be observed that the ﬁnal return achieved by our method is higher or comparable when compared to other methods on these tasks. Notice that our method obtains the highest returns among TRGPPO, DDPG, SLAC, Off-Policy TRPO, and Trust-PCL on Half Cheetah and Humanoid. Results in Figure 1 illustrate that our method

can surpass these state-of-the-art methods on most tasks.

Study on Using Off-Policy Data In this section, we study the effectiveness of our method on using off-policy data by comparing our method using offpolicy data with PPO only using on-policy data. The results are shown in Figure 2. Note that our method needs fewer timesteps to achieve the same return when compared to PPO on most tasks and this phenomenon is particularly evident on complex tasks (Walker2d, Ant, and Humanoid), which is due to that our method can use off-policy data to optimize policies. We can also note that our method using off-policy data achieves higher ﬁnal returns than PPO only using on-policy data on most tasks. The experimental results from Figure 2 validate the effectiveness of our method on using off-policy data.

Study on KL Divergence In this section, we study on KL divergence by comparing our method, PPO, and SAC to evaluate the stability of our method in practice. SAC is chosen to be compared due to this method is the state-of-the-art off-policy method. The KL divergence comparison is shown in Figure 3. The KL divergence reported in Figure 3 denotes KL divergence between πold and πnew in each policy update in practice. Note that we choose representative tasks, i.e., Hopper, Half Cheetah, and Walker2d, due to their popularity (Todorov, Erez, and Tassa 2012). From Figure 3, we can observe that the KL divergence of SAC is larger than these of PPO and Off Policy PPO. It is noticeable that the KL divergence of SAC on Hopper and Walker2d has an increasing trend and that of

Figure 3: KL divergence comparison among the proposed Off-Policy PPO, PPO, and SAC during training.

Timesteps to reach a threshold ( 103) Tasks PPO TRGPPO SAC DDPG SLAC Off-TRPO TD3 SOP Trust-PCL Our Method Swimmer 520 340 / / / 810 740 / / 340 Hopper / 440 920 / / / 380 355 / 210 Half Cheetah 140 140 70 / 160 / 45 45 / 70 Walker2d 750 590 700 / / / 285 365 / 220 Ant / 860 680 / / / 455 285 / 310 Humanoid 6800 3400 1000 / / / 950 1200 / 2800

Table 1: Comparison of timesteps to reach a threshold within one million timesteps (except Humanoid with ten million) during training. The thresholds for these tasks (Swimmer, Hopper, Half Cheetah, Walker2d, Ant, and Humanoid) separately are 90, 3000, 3000, 3000, 3000, and 5000. We denote Off-Policy TRPO as Off-TRPO for short. For each task, the minimum result is indicated in boldface. / indicates that the method did not reach a threshold within ﬁxed timesteps.

SAC on Half Cheetah has a decreasing trend, which is most likely due to that large policy updates of SAC on Hopper and Walker2d mainly exist in the early training stage and these of SAC on Half Cheetah mainly exist in the late training stage. From Figure 3, it can be observed that the proposed Off-Policy PPO has nearly the same KL divergence as PPO during the whole training process. The similar KL divergence curves between our method and PPO in Figure 3 demonstrate that the proposed Off-Policy PPO does not harm the stability in practice.

Study on Sample Efﬁciency

In this section, we study the effectiveness of our method on sample efﬁciency by comparing our method and other methods in terms of timesteps to reach a threshold over training. The comparison is represented in Table 1. We separately set the threshold values of Swimmer, Hopper, Half Cheetah, Walker2d, Ant, and Humanoid as 90, 3000, 3000, 3000, 3000, 5000, which refer to the threshold values in (Wang et al. 2019). As shown in Table 1, when compared with other methods, our method requires fewer or comparable timesteps to reach these thresholds on most tasks. Note that Off-Policy TRPO did not reach thresholds within the ﬁxed timesteps on most tasks, which is probably because the update interval value in this method is relatively large. DDPG and Trust-PCL did not reach thresholds within the ﬁxed timesteps on most tasks, which is probably

due to that these two methods may need more timesteps during training to achieve better performance. The results from Table 1 validate that our method achieves higher sample efﬁciency than other methods on most tasks.

Conclusion and Future Work

In this paper, we propose an Off-Policy Proximal Policy Optimization (Off-Policy PPO) method, which improves the sample efﬁciency of PPO by utilizing off-policy data. We provide a novel idea for improving the sample efﬁciency of PPO. Speciﬁcally, we propose a clipped surrogate objective using off-policy data, which is based on the surrogate objective in Off-Policy TRPO. Then, we theoretically clarify the stability of optimizing the proposed clipped surrogate objective using off-policy data. Next, we describe the algorithm of the proposed Off-Policy PPO which iteratively optimizes policies by maximizing the proposed clipped surrogate objective in detail. Finally, experimental results on a variety of domains validate that our method outperforms state-of-theart methods on most tasks. These experimental results also evaluate our method in terms of using off-policy data and sample efﬁciency. Although this algorithm is appealing to sequentialdecision-making problems having difﬁculty in collecting data, it is interesting to improve the performance of this algorithm in rare scenarios where the quality of off-policy data is poor, e.g., πold( |s) µ( |s), in the future work.

Acknowledgments

This work is supported by STI 2030 Major Projects (2021ZD0200400), Natural Science Foundation of China (No. 62206158, No. 61925603), Natural Science Foundation of Shandong Province (No. ZR2022QF097), Major Basic Research Project of Natural Science Foundation of Shandong Province (No. ZR2021ZD15), The Fundamental Research Funds of Shandong University. We are very grateful to Guoqiang Wu for his great help with this paper.

References Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym. ar Xiv preprint ar Xiv:1606.01540. Degris, T.; White, M.; and Sutton, R. S. 2012. Off-policy actor-critic. In International Conference on Machine Learning, 179 186. Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; Dunning, I.; et al. 2018. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, 1407 1416. Fujimoto, S.; Hoof, H.; and Meger, D. 2018. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, 1587 1596. Gordon, G.; and Tibshirani, R. 2012. Karush-kuhn-tucker conditions. Optimization, 10(725/36): 725. Haarnoja, T.; Zhou, A.; Abbeel, P.; and Levine, S. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 1861 1870. Kober, J.; Bagnell, J. A.; and Peters, J. 2013. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11): 1238 1274. Lee, A. X.; Nagabandi, A.; Abbeel, P.; and Levine, S. 2020. Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model. In Advances in Neural Information Processing Systems, 741 752. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous control with deep reinforcement learning. In International Conference on Learning Representations. Meng, W.; Zheng, Q.; Shi, Y.; and Pan, G. 2021. An offpolicy trust region policy optimization method with monotonic improvement guarantee for deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 33(5): 2223 2235. Meng, W.; Zheng, Q.; Yang, L.; Li, P.; and Pan, G. 2019. Qualitative measurements of policy discrepancy for returnbased deep Q-network. IEEE Transactions on Neural Networks and Learning Systems, 31(10): 4374 4380. Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928 1937.

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533. Nachum, O.; Norouzi, M.; Xu, K.; and Schuurmans, D. 2018. Trust-PCL: An Off-Policy Trust Region Method for Continuous Control. In International Conference on Learning Representations. Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. 2020. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839): 604 609. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International Conference on Machine Learning, 1889 1897. Schulman, J.; Moritz, P.; Levine, S.; Jordan, M. I.; and Abbeel, P. 2016. High-Dimensional Continuous Control Using Generalized Advantage Estimation. In International Conference on Learning Representations. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484 489. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic policy gradient algorithms. In International Conference on Machine Learning, 387 395. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. 2017. Mastering the game of go without human knowledge. Nature, 550(7676): 354 359. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press. Sutton, R. S.; Mc Allester, D. A.; Singh, S. P.; and Mansour, Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057 1063. Todorov, E.; Erez, T.; and Tassa, Y. 2012. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 5026 5033. Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. 2019. Grandmaster level in Star Craft II using multi-agent reinforcement learning. Nature, 575(7782): 350 354. Wang, C.; Wu, Y.; Vuong, Q.; and Ross, K. 2020. Striving for simplicity and performance in off-policy DRL: Output normalization and non-uniform sampling. In International Conference on Machine Learning, 10070 10080.

Wang, Y.; He, H.; and Tan, X. 2020. Truly proximal policy optimization. In Uncertainty in Artiﬁcial Intelligence, 113 122. Wang, Y.; He, H.; Tan, X.; and Gan, Y. 2019. Trust Region Guided Proximal Policy Optimization. In Advances in Neural Information Processing Systems, 626 636. Xing, D.; Liu, Q.; Zheng, Q.; and Pan, G. 2021. Learning with Generated Teammates to Achieve Type-Free Ad-Hoc Teamwork. In International Joint Conference on Artiﬁcial Intelligence, 472 478. Yang, L.; Ji, J.; Dai, J.; Zhang, L.; Zhou, B.; Li, P.; Yang, Y.; and Pan, G. 2022a. Constrained Update Projection Approach to Safe Policy Optimization. In Advances in Neural Information Processing Systems. Yang, L.; Zhang, Y.; Zheng, G.; Zheng, Q.; Li, P.; Huang, J.; and Pan, G. 2022b. Policy optimization with stochastic mirror descent. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 8823 8831. Zou, L.; Zhuang, Z.; Cheng, Y.; Wang, X.; and Zhang, W. 2019. Separated trust regions policy optimization method. In International Conference on Knowledge Discovery & Data Mining, 1471 1479.