# behavior_proximal_policy_optimization___8ba39fed.pdf Published as a conference paper at ICLR 2023 BEHAVIOR PROXIMAL POLICY OPTIMIZATION Zifeng Zhuang12 Kun Lei2 Jinxin Liu2 Donglin Wang23 Yilang Guo4 1 Zhejiang University. 2 School of Engineering, Westlake University. 3 Institute of Advanced Technology, Westlake Institute for Advanced Study. 4 School of Software Engineering, Beijing Jiaotong University. {zhuangzifeng,leikun,liujinxin,wangdonglin}@westlake.edu.cn, 20301130@bjtu.edu.cn Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to overestimating of out-ofdistribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we reach a surprising conclusion that online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark empirically show this extremely succinct method outperforms state-of-the-art offline RL algorithms. Our implementation is available at https://github.com/Dragon-Zhuang/BPPO. 1 INTRODUCTION Typically, reinforcement learning (RL) is thought of as a paradigm for online learning, where the agent interacts with the environment to collect experiences and then uses them to improve itself (Sutton et al., 1998). This online process poses the biggest obstacles to real-world RL applications because of expensive or even risky data collection in some fields (such as navigation (Mirowski et al., 2018) and healthcare (Yu et al., 2021a)). As an alternative, offline RL eliminates the online interaction and learns from a fixed dataset collected by some arbitrary and possibly unknown process (Lange et al., 2012; Fu et al., 2020). The prospect of this data-driven mode (Levine et al., 2020) is pretty encouraging and has been placed with great expectations for solving real-world RL applications. Unfortunately, the major superiority of offline RL, the lack of online interaction, also raises another challenge. The classical off-policy iterative algorithms tend to underperform due to overestimating out-of-distribution (shorted as OOD) state-action pairs, even though offline RL can be viewed as an extreme off-policy case. More specifically, when Q-function poorly estimates the value of OOD state-action pairs during policy evaluation, the agent tends to take OOD actions with erroneously estimated high values, resulting in low-performance after policy improvement (Fujimoto et al., 2019). Thus, to overcome the overestimation issue, some solutions keep the learned policy close to the behavior policy (or the offline dataset) (Fujimoto et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021). Most offline RL algorithms adopt online interactions to select hyperparameters. This is because offline hyperparameter selection, which selects hyperparameters without online interactions, is always an open problem lacking satisfactory solutions (Paine et al., 2020; Zhang & Jiang, 2021). Deploying the policy learned by offline RL is potentially risky in certain areas (Mirowski et al., 2018; Yu et al., 2021a) since the performance is unknown. However, the risk during online interactions will be greatly reduced if the deployed policy can guarantee better performance than the behavior policy. This inspires us to consider how to use offline dataset to improve behavior policy with a monotonic performance guarantee. We formulate this problem as offline monotonic policy improvement. Equal contribution. Corresponding author. Published as a conference paper at ICLR 2023 To analyze offline monotonic policy improvement, we introduce the Performance Difference Theorem (Kakade & Langford, 2002). During analysis, we find that the offline setting does make the monotonic policy improvement more complicated, but the way to monotonically improve policy remains unchanged. This indicates the algorithms derived from online monotonic policy improvement (such as Proximal Policy Optimization) can also achieve offline monotonic policy improvement. In other words, PPO can naturally solve offline RL. Based on this surprising discovery, we propose Behavior Proximal Policy Optimization (BPPO), an offline algorithm that monotonically improves behavior policy in the manner of PPO. Owing to the inherent conservatism of PPO, BPPO restricts the ratio of learned policy and behavior policy within a certain range, similar to the offline RL methods which make the learned policy close to the behavior policy. As offline algorithms are becoming more and more sophisticated, TD3+BC (Fujimoto & Gu, 2021), which augments TD3 (Fujimoto et al., 2018) with behavior cloning (Pomerleau, 1988), reminds us to revisit the simple alternatives with potentially good performance. BPPO is such a most simple alternative without introducing any extra constraint or regularization on the basis of PPO. Extensive experiments on the D4RL benchmark (Fu et al., 2020) empirically shows that BPPO outperforms state-of-the-art offline RL algorithms. 2 PRELIMINARIES 2.1 REINFORCEMENT LEARNING Reinforcement Learning (RL) is a framework of sequential decision. Typically, this problem is formulated by a Markov decision process (MDP) M = {S, A, r, p, d0, γ}, with state space S, action space A, scalar reward function r, transition dynamics p, initial state distribution d0(s0) and discount factor γ (Sutton et al., 1998). The objective of RL is to learn a policy, which defines a distribution over action conditioned on states π (at|st) at timestep t, where at A, st S. Given this definition, the trajectory τ = (s0, a0, , s T , a T ) generated by the agent s interaction with environment M can be described as a distribution Pπ (τ) = d0(s0) QT t=0 π (at|st) p (st+1|st, at), where T is the length of the trajectory, and it can be infinite. Then, the goal of RL can be written as an expectation under the trajectory distribution J (π) = Eτ Pπ(τ) h PT t=0 γtr(st, at) i . This objective can also be measured by a state-action value function Qπ (s, a), the expected discounted return given the action a in state s: Qπ (s, a) = Eτ Pπ(τ|s,a) h PT t=0 γtr(st, at)|s0 = s, a0 = a i . Similarly, the value function Vπ (s) is the expected discounted return of a certain state s: Vπ (s) = Eτ Pπ(τ|s) h PT t=0 γtr(st, at)|s0 = s i . Then, we can define the advantage function: Aπ (s, a) = Qπ (s, a) Vπ (s). 2.2 OFFLINE REINFORCEMENT LEARNING In offline RL, the agent only has access to a fixed dataset with transitions D = n (st, at, st+1, rt)N t=1 o collected by a behavior policy πβ. Without interacting with environment M, offline RL expects the agent to infer a policy from the dataset. Behavior cloning (BC) (Pomerleau, 1988), an approach of imitation learning, can directly imitate the action of each state with supervised learning: ˆπβ = argmax π E(s,a) D [log π (a|s)] . (1) Note that the performance of ˆπβ trained by behavior cloning highly depends on the quality of transitions, also the collection process of behavior policy πβ. In the rest of this paper, improving behavior policy actually refers to improving the estimated behavior policy ˆπβ, because πβ is unknown. 2.3 PERFORMANCE DIFFERENCE THEOREM Theorem 1. (Kakade & Langford, 2002) Let the discounted unnormalized visitation frequencies as ρπ (s) = PT t=0 γt P (st = s|π) and P (st = s|π) represents the probability of the t-th state equals to s in trajectories generated by policy π. For any two policies π and π , the performance difference J (π , π) J (π ) J (π) can be measured by the advantage function: J (π , π) = Eτ Pπ (τ) t=0 γt Aπ(st, at) = Es ρπ ( ),a π ( |s) [Aπ(s, a)] . (2) Published as a conference paper at ICLR 2023 Derivation detail is presented in Appendix A. This theorem implies that improving policy from π to π can be achieved by maximizing (2). From this theorem, Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) is derived, which can guarantee the monotonic improvement of performance. We also apply this theorem to formulate offline monotonic policy improvement. 3 OFFLINE MONOTONIC IMPROVEMENT OVER BEHAVIOR POLICY In this section, we theoretically analyze offline monotonic policy improvement based on Theorem 1, namely improving the ˆπβ generated by behavior cloning (1) with offline dataset D. Applying the Performance Difference Theorem to the estimated behavior policy ˆπβ, we can get J (π, ˆπβ) = Es ρπ( ),a π( |s) Aˆπβ(s, a) . (3) Maximizing this equation can obtain a policy better than behavior policy ˆπβ. But the above equation is not tractable due to the dependence of the new policy s state distribution ρπ (s). For standard online method, ρπ (s) is replaced by the old state distribution ρˆπβ (s). But in the offline setting, ρˆπβ (s) cannot be obtained through interactions with the environment like the online situation. We use the state distribution recovered by the offline dataset ρD (s) for replacement, where ρD (s) = PT t=0 γt P (st = s|D) and P (st = s|D) represents the probability of the t-th state equals to s in the offline dataset. Therefore, the approximation of J (π, πβ) can be written as: b J (π, ˆπβ) = Es ρD( ),a π( |s) Aˆπβ(s, a) . (4) To measure the difference between J (π, ˆπβ) and its approximation b J (π, ˆπβ), we introduce a midterm Es ρˆπβ (s),a π( |s) Aˆπβ(s, a) with the state distribution ρˆπβ (s). During the proof, the commonly-used total variational divergence DT V (π ˆπβ) [s] = 1 2Ea |π (a|s) ˆπβ (a|s)| between policy π, ˆπβ at state s is necessary. For the total variational divergence between the offline dataset D and the estimated behavior policy ˆπβ, it may not be straightforward. We can view the offline dataset D = n (st, at, st+1, rt)N t=1 o as a deterministic distribution, and then the distance is: Proposition 1. For offline dataset D = n (st, at, st+1, rt)N t=1 o and policy ˆπβ, the total variational divergence can be expressed as DT V (D ˆπβ) [st] = 1 2 (1 ˆπβ (at|st)). Detailed derivation process is presented in Appendix B. Now we are ready to measure the difference: Theorem 2. Given the distance DT V (π ˆπβ) [s] and DT V (D ˆπβ) [s] = 1 2 (1 ˆπβ (at|st)), we can derive the following bound: J (π, ˆπβ) b J (π, ˆπβ) 4γAˆπβ max s DT V (π ˆπβ) [s] E s ρˆπβ ( ) [DT V (π ˆπβ) [s]] 2γAˆπβ max s DT V (π ˆπβ) [s] E s ρD( ) [1 ˆπβ (a|s)] , (5) here Aˆπβ = max s,a Aˆπβ (s, a) . The proof is presented in Appendix C. Compared to the theorem in the online setting (Schulman et al., 2015a; Achiam et al., 2017; Queeney et al., 2021), the second right term of Equation (5) is similar while the third term is unique for the offline. E s ρD( ) [1 ˆπβ (a|s)] represents the difference caused by the mismatch between offline dataset D and ˆπβ. When ˆπβ is determined, this term is one constant. And because the inequality maxs DT V (π ˆπβ) [s] E s ρˆπβ ( ) [DT V (π ˆπβ) [s]] holds, we can claim the following conclusion: Conclusion 1 To guarantee the true objective J (π, ˆπβ) non-decreasing, we should simultaneously maximize Es ρD( ),a π( |s) Aˆπβ(s, a) and minimize [maxs DT V (π ˆπβ) [s]], which means the offline dataset D is capable of monotonically improving the estimated behavior policy ˆπβ. Published as a conference paper at ICLR 2023 Suppose we have improved the behavior policy ˆπβ and get a policy πk. The above theorem only guarantees that πk has a higher performance than ˆπβ but πk may not be optimal. If the offline dataset D can still improve the policy πk to get a better policy πk+1, πk+1 must be closer to the optimal policy. Thus, we further analyze the monotonic policy improvement over policy πk. Applying Performance Difference Theorem 1 to the policy πk, J (π, πk) = Es ρπ( ),a π( |s) [Aπk(s, a)] . (6) To approximate the above equation, the common manner is replacing ρπ with the old policy state distribution ρπk. But in the offline RL, πk is forbidden from acting in the environment. As a result, the state distribution ρπk is impossible to estimate. Thus, the only choice without any other alternative is replacing ρπk by the state distribution from the offline dataset D: b J (π, πk) = Es ρD( ),a π( |s) [Aπk(s, a)] . (7) Intuitively, this replacement is reasonable if πk, ˆπβ are similar which means this approximation must be related to the distance DT V (πk ˆπβ) [s]. Concretely, the gap can be formulated as follows: Theorem 3. Given the distance DT V (π πk) [s], DT V (πk ˆπβ) [s] and DT V (D ˆπβ) [s] = 1 2 (1 ˆπβ (a|s)), we can derive the following bound: J (π, πk) b J (π, πk) 4γAπk max s DT V (π πk) [s] E s ρπk ( ) [DT V (π πk) [s]] 4γAπk max s DT V (π πk) [s] E s ρˆπβ ( ) [DT V (πk ˆπβ) [s]] 2γAπk max s DT V (π πk) [s] E s ρD( ) [1 ˆπβ (a|s)] , (8) here Aπk = max s,a |Aπk(s, a)|. The proof is presented in Appendix D. Compared to the theorem 2, one additional term related to the distance of πk, ˆπβ has been introduced. The distance E s ρˆπβ ( ) [DT V (πk ˆπβ) [s]] is irrelevant to the target policy π which can also be viewed as one constant. Besides, theorem 2 is a specific case of this theorem if πk = ˆπβ. Thus, we set π0 = ˆπβ since ˆπβ is the first policy to be improved and in the following section we will no longer deliberately distinguish ˆπβ, πk. Similarly, we can derive the following conclusion: Conclusion 2 To guarantee the true objective J (π, πk) non-decreasing, we should simultaneously maximize Es ρD( ),a π( |s) [Aπk(s, a)] and minimize [maxs DT V (π πk) [s]], which means the offline dataset D is capable of monotonically improving the policy πk, where k = 0, 1, 2, . 4 BEHAVIOR PROXIMAL POLICY OPTIMIZATION In this section, we derive a practical algorithm based on the theoretical results. And surprisingly, the loss function of this algorithm is the same as the online on-policy method Proximal Policy Optimization (PPO) (Schulman et al., 2017). Furthermore, this algorithm highly depends on the behavior policy so we name it as Behavior Proximal Policy Optimization, shorted as BPPO. According to the Conclusion 2, to monotonically improve policy πk, we should jointly optimize: Maximize π Es ρD( ),a π( |s) [Aπk(s, a)] & Minimize π max s DT V (π πk) [s], (9) here k = 0, 1, 2, and π0 = ˆπβ. But minimizing the total divergence between π and πk results in a trivial solution π = πk which is impossible to make improvement over πk. A more reasonable optimization objective is to maximize b J (π, πk) while constraining the divergence: Maximize π Es ρD( ),a π( |s) [Aπk(s, a)] s.t. max s DT V (π πk) [s] ϵ. (10) Published as a conference paper at ICLR 2023 For the term to be maximized, we adopt importance sampling to make the expectation only depends on the action distribution of the old policy πk rather than the new policy π: Es ρD( ),a π( |s) [Aπk(s, a)] = Es ρD( ),a πk( |s) πk (a|s)Aπk(s, a) . (11) In this way, we could estimate this term by sampling states from offline the dataset s ρD ( ) then sampling actions with old policy a πk ( |s). For the total variational divergence, we rewrite it as max s DT V (π πk) [s] = max s 1 2 a |π (a|s) πk (a|s)| da = max s 1 2 a πk (a|s) π (a|s) πk (a|s) 1 da = 1 2 max s E a πk( |s) π (a|s) πk (a|s) 1 . (12) In the offline setting, only states s ρD ( ) are available and other states are inaccessible. So the operation maxs can also be expressed as max s ρD( ). When comparing Equation (11) and (12), we find that the state distribution, the action distribution and the policy ratio appear in both. Thus we consider how to insert the divergence constraint into Equation (11). The following constraints are equivalent: max s ρD( )DT V (π πk) [s] ϵ max s ρD( ) E a πk( |s) π (a|s) πk (a|s) 1 2ϵ max s ρD( ) E a πk( |s)clip π (a|s) πk (a|s), 1 2ϵ, 1 + 2ϵ , clip (x, l, u) = min (max (x, l) , u) . (13) Here the max operation is impractical to solve, so we adopt a heuristic approximation (Schulman et al., 2015a) that changes max into expectation. Then divergence constraint (13) can be inserted: Lk (π) = Es ρD( ),a πk( |s) min π (a|s) πk (a|s)Aπk(s, a), clip π (a|s) πk (a|s), 1 2ϵ, 1 + 2ϵ Aπk(s, a) , where the operation min makes this objective become the lower bound of Equation (11). This loss function is quite similar to PPO (Schulman et al., 2017) and the only difference is the state distribution. Therefore, we claim that online on-policy algorithms are naturally able to solve offline RL. 5 DISCUSSIONS AND IMPLEMENTATION DETAILS In this section, we first directly highlight why BPPO can solve offline reinforcement learning, namely, how to overcome the overestimation issue. Then we discuss some implementation details, especially, the approximation of the advantage Aπk(s, a). Finally, we analyze the relation between BPPO and previous algorithms including Onestep RL and iterative methods. Why BPPO can solve offline RL? According to the final loss (14) and Equation (13), BPPO actually constrains the closeness by the expectation of the total variational divergence: Es ρD( ),a πk( |s) π (a|s) πk (a|s) 1 2ϵ. (15) If k = 0, this equation ensures the closeness between learned policy π and behavior policy ˆπβ. When k > 0, one issue worthy of attention is whether the closeness between learned policy π and πk can indirectly constrain the closeness between π and ˆπβ. To achieve this, also to prevent the learned policy π completely away from ˆπβ, we introduce a technique called clip ratio decay. As the policy updates, the clip ratio ϵ gradually decreases until reaching a certain training step (such as 200 steps): ϵi = ϵ0 (σ)i IF i 200 ELSE ϵi = ϵ200 (16) here i denotes the training steps, ϵ0 denotes the initial clip ratio, and σ (0, 1] is the decay coefficient. Published as a conference paper at ICLR 2023 0 2 4 6 8 10 12 k (update steps) Importance Ratio k/ BPPO with decay BPPO without decay 1 + 2 1 2 (a) hopper-medium 0 1 2 3 4 5 6 7 k (update steps) Importance Ratio k/ (b) hopper-medium-replay Figure 1: Visualization of the importance weight between the updated policy πk and the estimated behavior policy ˆπβ. From Figure 1(a) and 1(b), we can find that the ratio πk/ˆπβ may be out of the certain range [1 2ϵ, 1 + 2ϵ] (the region surrounded by the dotted pink and purple line) without clip ratio decay technique (also σ = 1). But the ratio stays within the range stably when the decay is applied which means the Equation (15) can ensure the closeness between the final learned policy by BPPO and behavior policy. How to approximate the advantage? When calculating the loss function (14), the only difference from the online situation is the approximation of advantage Aπk(s, a). In online RL, GAE (Generalized Advantage Estimation) (Schulman et al., 2015b) approximates the Aπk using the data collected by policy πk. Obviously, GAE is inappropriate in the offline situations due to the existence of online interaction. As a result, BPPO has to calculate the advantage Aπk = Qπk Vπβ in an off-policy manner where Qπk is calculated by Q-learning (Watkins & Dayan, 1992) using offline dataset D and Vπβ is calculated by fitting returns PT t=0 γtr(st, at) using the MSE loss. Note that the value function is Vπβ rather than Vπk since the state distribution has been changed into s ρD ( ) in Theorem 2, 3. Algorithm 1 Behavior Proximal Policy Optimization (BPPO) 1: Estimate behavior policy ˆπβ by behavior cloning; 2: Calculate Q-function Qπβ by SARSA; 3: Calculate value function Vπβ by fitting returns; 4: Initialize k = 0 and set πk πβ & Qπk = Qπβ; 5: for i = 0, 1, 2, , I do 6: Aπk = Qπk Vπβ 7: Update the policy π by maximizing Lk (π); 8: if J(π) > J(πk) then 9: Set k = k + 1 & πk π; 10: if advantage replacement then 11: Qπk = Qπβ; 12: else 13: Calculate Qπk by Q-learning; 14: end if 15: end if 16: end for Besides, we have another simple choice based on the results that πk is close to the πβ with the help of clip ratio decay. We can replace all the Aπk with the Aπβ, which may introduce some error but the benefit is that Aπβ must be more accurate than Aπk since off-policy estimation is potentially dangerous, especially in the offline setting. We conduct a series of experiments in Section 7.2 to compare these two implementations and find that the latter one, advantage replacement, is better. Based on the above implementation details, we summarize the whole workflow of BPPO in Algorithm 1. Figure 2: The difference between Onestep BPPO (left) and BPPO (right), where the decreasing circle corresponds to ϵ decay. What is the relation between BPPO, Onestep RL and iterative methods? Since BPPO is highly related to on-policy algorithms, it is naturally associated with Onestep RL (Brandfonbrener et al., 2021) that solves offline RL without off-policy evaluation. If we remove lines 8 15 in Algorithm 1, we get Onestep version of BPPO, which means only the behavior policy ˆπβ is improved. In contrast, BPPO also improves πk, the policy that has been improved over ˆπβ. The right figure shows the difference between BPPO and its Onestep version: Onestep strictly requires the new policy close to ˆπβ, while BPPO appropriately loosens this restriction. If we calculate the Q-function in off-policy manner, namely, line 13 in Algorithm 1, the method switches to an iterative style. If we adopt advantage replacement, line 11, BPPO only estimates the advantage function once but updates many policies, from ˆπβ to πk. Onestep RL estimates the Q-function once and use it to update estimated behavior policy. Iterative methods estimate Q-function several times and then update the corresponding policy. Strictly speaking, BPPO is neither an Onestep nor an iterative method. BPPO is a special case between these two types. Published as a conference paper at ICLR 2023 6 RELATED WORK Offline Reinforcement Learning Most of the online off-policy methods fail or underperform in offline RL due to extrapolation error (Fujimoto et al., 2019) or distributional shift (Levine et al., 2020). Thus most offline algorithms typically augment existing off-policy algorithms with a penalty measuring divergence between the policy and the offline data (or behavior policy). Depending on how to implement this penalty, a variety of methods were proposed such as batch constrained (Fujimoto et al., 2019), KL-control (Jaques et al., 2019; Liu et al., 2022b), behavior-regularized (Wu et al., 2019; Fujimoto & Gu, 2021) and policy constraint (Kumar et al., 2019; Levine et al., 2020; Kostrikov et al., 2021). Other methods augment BC with a weight to make the policy favor high advantage actions (Wang et al., 2018; Siegel et al., 2020; Peng et al., 2019; Wang et al., 2020). Some methods extra introduced Uncertainty estimation (An et al., 2021b; Bai et al., 2022) or conservative (Kumar et al., 2020; Yu et al., 2021b; Nachum et al., 2019) estimation to overcome overestimation. Monotonic Policy Improvement Monotonic policy improvement in online RL was first introduced by Kakade & Langford (2002). On this basis, two classical on-policy methods Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) were proposed. Afterwards, monotonic policy improvement has been extended to constrained MDP (Achiam et al., 2017), model-based method (Luo et al., 2018) and off-policy RL (Queeney et al., 2021; Meng et al., 2021). The main idea behind BPPO is to regularize each policy update by restricting the divergence. Such regularization is often used in unsupervised skill learning (Liu et al., 2021; 2022a; Tian et al., 2021) and imitation learning (Xiao et al., 2019; Kang et al., 2021). Xu et al. (2021) mentions that offline algorithms lack guaranteed performance improvement over the behavior policy but we are the first to introduce monotonic policy improvement to solve offline RL. 7 EXPERIMENTS We conduct a series of experiments on the Gym (v2), Adroit (v1), Kitchen (v0) and Antmaze (v2) from D4RL (Fu et al., 2020) to evaluate the performance and analyze the design choice of Behavior Proximal Policy Optimization (BPPO). Specifically, we aim to answer: 1) How does BPPO compare with previous Onestep and iterative methods? 2) What is the superiority of BPPO over its Onestep and iterative version? 3) What is the influence of hyperparameters clip ratio ϵ and clip ratio decay σ? Table 1: The normalized results on D4RL Gym, Adroit, and Kitchen. We bold the best results and BPPO is calculated by averaging mean returns over 10 evaluation trajectories and five random seeds. The symbol * specifies that the results are reproduced by running the offical open-source code. Suite Environment Iterative methods Onestep methods BC (Ours) BPPO (Ours) CQL TD3+BC Onestep RL IQL halfcheetah-medium-v2 44.0 48.3 48.4 47.4 43.5 0.1 44.0 0.2 hopper-medium-v2 58.5 59.3 59.6 66.3 61.3 3.2 93.9 3.9 walker2d-medium-v2 72.5 83.7 81.8 78.3 74.2 4.6 83.6 0.9 halfcheetah-medium-replay-v2 45.5 44.6 38.1 44.2 40.1 0.1 41.0 0.6 hopper-medium-replay-v2 95.0 60.9 97.5 94.7 66.0 18.3 92.5 3.4 walker2d-medium-replay-v2 77.2 81.8 49.5 73.9 33.4 11.2 77.6 7.8 halfcheetah-medium-expert-v2 91.6 90.7 93.4 86.7 64.4 8.5 92.5 1.9 hopper-medium-expert-v2 105.4 98.0 103.3 91.5 64.9 7.7 112.8 1.7 walker2d-medium-expert-v2 108.8 110.1 113.0 109.6 107.7 3.5 113.1 2.4 Gym locomotion-v2 total 698.5 677.4 684.6 692.4 555.5 57.2 751.0 21.8 pen-human-v1 37.5 8.4* 90.7* 71.5 61.6 9.7 117.8 11.9 hammer-human-v1 4.4 2.0* 0.2* 1.4 2.0 0.9 14.9 3.2 door-human-v1 9.9 0.5* -0.1* 4.3 7.8 3.5 25.9 7.5 relocate-human-v1 0.2 -0.3* 2.1* 0.1 0.1 0.0 4.8 2.2 pen-cloned-v1 39.2 41.5* 60.0 37.3 58.8 16.0 110.8 6.3 hammer-cloned-v1 2.1 0.8* 2.0 2.1 0.5 0.2 8.9 5.1 door-cloned-v1 0.4 -0.4* 0.4 1.6 0.9 0.8 6.2 1.6 relocate-cloned-v1 -0.1 -0.3* -0.1 -0.2 -0.1 0.0 1.9 1.0 adroit-v1 total 93.6 52.2 155.2 118.1 131.6 31.1 291.4 38.8 kitchen-complete-v0 43.8 0.0* 2.0* 62.5 55.0 11.5 91.5 8.9 kitchen-partial-v0 49.8 22.5* 35.5* 46.3 44.0 4.9 57.0 2.4 kitchen-mixed-v0 51.0 25.0* 28.0* 51.0 45.0 1.6 62.5 6.7 kitchen-v0 total 144.6 47.5 65.5 159.8 144.0 18.0 211.0 18.0 locomotion+kitchen+adroit 936.7 777.1 905.3 970.3 831.1 106.3 1253.4 78.6 Published as a conference paper at ICLR 2023 7.1 RESULTS ON D4RL BENCHMARKS We first compare BPPO with iterative methods including CQL (Kumar et al., 2020) and TD3+BC (Fujimoto & Gu, 2021), and Onestep methods including Onestep RL (Brandfonbrener et al., 2021) and IQL (Kostrikov et al., 2021). Most results of Onestep RL, IQL, CQL, TD3+BC are extracted from the paper IQL and the results with symbol * are reproduced by ourselves. Since BPPO first estimates a behavior policy and then improves it, we list the results of BC on the left side of BPPO. From Table 1, we find BPPO achieves comparable performance on each task of Gym and slightly outperforms when considering the total performance. For Adroit and Kitchen, BPPO prominently outperforms other methods. Compared to BC, BPPO achieves 51% performance improvement on all D4RL tasks. Interestingly, our implemented BC on Adroit and Kitchen nearly outperform the baselines, which may imply improving behavior policy rather than learning from scratch is better. Next, we evaluate whether BPPO can solve more difficult tasks with sparse reward. For Antmaze tasks, we also compare BPPO with Decision Transformer (DT) (Chen et al., 2021), Rv S-G and Rv S-R (Emmons et al., 2021). DT conditions on past trajectories to predict future actions using Transformer. Rv S-G and Rv S-R condition on goals or rewards to learn policy via supervised learning. Table 2: The normalized results on D4RL Antmaze tasks. The results of CQL and IQL are extracted from paper IQL while others are extracted from paper Rv S. In the BC column, symbol * specifies the Filtered BC (Emmons et al., 2021) which removes the failed trajectories instead of standard BC. Environment CQL TD3+BC Onestep IQL DT Rv S-R Rv S-G BC (Ours) BPPO (Ours) Umaze-v2 74.0 78.6 64.3 87.5 65.6 64.4 65.4 51.7 20.4 95.0 5.5 Umaze-diverse-v2 84.0 71.4 60.7 62.2 51.2 70.1 60.9 48.3 17.2 91.7 4.1 Medium-play-v2 61.2 10.6 0.3 71.2 1.0 4.5 58.1 16.7 5.2* 51.7 7.5 Medium-diverse-v2 53.7 3.0 0.0 70.0 0.6 7.7 67.3 33.3 10.3* 70.0 6.3 Large-play-v2 15.8 0.2 0.0 39.6 0.0 3.5 32.4 48.3 11.7* 86.7 8.2 Large-diverse-v2 14.9 0.0 0.0 47.5 0.2 3.7 36.9 46.7 20.7* 88.3 4.1 Total 303.6 163.8 61.0 378.0 118.6 153.9 321.0 245.0 85.5 483.3 35.7 As shown in Table 2, BPPO can outperform most tasks and is significantly better than other algorithms in the total performance of all tasks. We adopt Filtered BC in last four tasks, where only the successful trajectories is selected for behavior cloning. The performance of CQL and IQL is very impressive since no additional operations or information is introduced. Rv S-G uses the goal to overcome the sparse reward challenge. The superior performance demonstrates BPPO can also considerably improve the policy performance based on (Filtered) BC on tasks with sparse reward. 7.2 THE SUPERIORITY OF BPPO OVER ONESTEP AND ITERATIVE VERSION BPPO v.s. Onestep BPPO We choose to improve policy πk after it has been improved over behavior policy ˆπβ because Theorem 2 provides no guarantee of optimality. Besides, BPPO and Onestep RL are easily to be connected because BPPO is based on online method while Onestep RL solves offline RL without off-policy evaluation. Although Figure 2 gives an intuitive interpretation to show the advantage of BPPO over its Onestep version, the soundness is relatively weak. We further analyze the superiority of BPPO over its Onestep version empirically. 0 50 100 150 Gradient Step ( 5) Normalized Return BPPO Onestep BPPO BC (a) hopper-medium-v2 0 50 100 150 Gradient Step ( 5) Normalized Return BPPO Onestep BPPO BC (b) hopper-medium-replay-v2 0 50 100 150 Gradient Step ( 5) Normalized Return BPPO Onestep BPPO BC (c) hopper-medium-expert-v2 Figure 3: The comparison between BPPO and Onestep BPPO. The hyperparameters of both methods are tuned through the grid search, and then we exhibit their learning curves with the best performance. Published as a conference paper at ICLR 2023 In Figure3, we observe that both BPPO and Onestep BPPO can outperform BC (the orange dotted line). This indicates both of them can achieve monotonic improvement over behavior policy ˆπβ. Another important result is that BPPO is consitently better than Onestep BPPO and this demonstrates two key points: First, improving πk to fully utilize information is necessary. Second, compared to strictly restricting the learned policy close to the behavior policy, appropriate looseness is useful. BPPO v.s. iterative BPPO When approximating the advantage Aπk, we have two implementation choices. One is advantage replacement (line 11 in Algorithm 1). The other one is off-policy Q-estimation (line 13 in Algorithm 1), corresponding to iterative BPPO. Both of them will introduce extra error compared to true Aπk. The error of the former comes from replacement Aπk Aπβ while the latter comes from the off-policy estimation itself. We compare BPPO with iterative BPPO in Figure 4 and find that advantage replacement, namely BPPO, is obviously better. 0 50 100 150 Gradient Step ( 5) Normalized Return (a) halfcheetah-medium-replay 0 50 100 150 Gradient Step ( 5) Normalized Return (b) walker2d-medium-replay 0 50 100 150 Gradient Step ( 5) Normalized Return (c) halfcheetah-medium-expert 0 50 100 150 Gradient Step ( 5) Normalized Return BPPO BPPOoff = 5 BPPOoff = 10 BPPOoff = 20 BPPOoff = 100 BC (d) walker2d-medium-expert Figure 4: The comparison between BPPO (the green curves) and its iterative versions in which we update the Q network to approximate Qπk instead of Qˆπβ using in BPPO. In particular, we use BPPOoff=5 to denote that we update Q network for 5 gradient steps per policy training step. 7.3 ABLATION STUDY OF DIFFERENT HYPERPARAMETERS In this section, we evaluate the influence of clip ratio ϵ and its decay rate σ. Clip ratio restricts the policy close to behavior policy and it directly solves the offline overestimation. Since ϵ also appears in PPO, we can set it properly to avoid catastrophic performance, which is the unique feature of BPPO. σ gradually tightens this restriction during policy improvement. We show how these coefficients contribute to the performance of BPPO and more ablations can be found in Appendix G, I, and H. 0 50 100 150 Gradient Step ( 5) Normalized Return =0.05 =0.10 =0.20 =0.25 =0.30 BC (a) hopper-medium-replay 0 50 100 150 Gradient Step ( 5) Normalized Return =0.05 =0.10 =0.20 =0.25 =0.30 BC (b) hopper-medium-expert 0 50 100 150 Gradient Step ( 5) Normalized Return =0.90 =0.94 =0.96 =0.98 =1.00 BC (c) hopper-medium-replay 0 50 100 150 Gradient Step ( 5) Normalized Return =0.90 =0.94 =0.96 =0.98 =1.00 BC (d) hopper-medium-expert Figure 5: Ablation study on clip ratio ϵ (5(a), 5(b)) and clip ratio decay σ (5(c), 5(d)). Firstly, we analyze five values of the clip coefficient ϵ = (0.05, 0.1, 0.2, 0.25, 0.3). In most environment, like hopper-medium-expert 5(b), different ϵ shows no significant difference so we choose ϵ = 0.25, while only ϵ = 0.1 is obviously better than others for hopper-medium-replay. We then demonstrate how the clip ratio decay (σ = 0.90, 0.94, 0.96, 0.98, 1.00) affects the performance of BPPO. As shown in Figure 5(c), a low decay rate (σ = 0.90) or no decay (σ = 1.00) may cause crash during training. We use σ = 0.96 to achieve stable policy improvement for all environments. 8 CONCLUSION Behavior Proximal Policy Optimization (BPPO) starts from offline monotonic policy improvement, using the loss function of PPO to elegantly solve offline RL without any extra constraint or regularization introduced. Theoretical derivations and extensive experiments show that the inherent conservatism from the on-policy method PPO is naturally suitable to overcome overestimation in offline RL. BPPO is simple to implement and achieves superior performance on the D4RL dataset. Published as a conference paper at ICLR 2023 9 ACKNOWLEDGEMENTS This work was supported by the National Science and Technology Innovation 2030 - Major Project (Grant No. 2022ZD0208800), and NSFC General Program (Grant No. 62176215). We really appreciate Li He and Yachen Kang for helpful discussions and writing polishing. Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning, pp. 22 31. PMLR, 2017. Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436 7447, 2021a. Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436 7447, 2021b. Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. ar Xiv preprint ar Xiv:2202.11566, 2022. David Brandfonbrener, Will Whitney, Rajesh Ranganath, and Joan Bruna. Offline rl without off-policy evaluation. Advances in Neural Information Processing Systems, 34:4933 4946, 2021. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084 15097, 2021. Xi Chen, Ali Ghadirzadeh, Tianhe Yu, Yuan Gao, Jianhao Wang, Wenzhe Li, Bin Liang, Chelsea Finn, and Chongjie Zhang. Latent-variable advantage-weighted policy optimization for offline rl. ar Xiv preprint ar Xiv:2203.08949, 2022. Ching-An Cheng, Tengyang Xie, Nan Jiang, and Alekh Agarwal. Adversarially trained actor critic for offline reinforcement learning. ar Xiv preprint ar Xiv:2202.02446, 2022. Scott Emmons, Benjamin Eysenbach, Ilya Kostrikov, and Sergey Levine. Rvs: What is essential for offline rl via supervised learning? ar Xiv preprint ar Xiv:2112.10751, 2021. Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep rl: A case study on ppo and trpo. In International Conference on Learning Representations, 2020. URL https://openreview. net/forum?id=r1et N1rt PB. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020. Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132 20145, 2021. Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International conference on machine learning, pp. 1587 1596. PMLR, 2018. Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pp. 2052 2062. PMLR, 2019. Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. ar Xiv preprint ar Xiv:1907.00456, 2019. Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002. Published as a conference paper at ICLR 2023 Yachen Kang, Jinxin Liu, Xin Cao, and Donglin Wang. Off-dynamics inverse reinforcement learning from hetero-domain. ar Xiv preprint ar Xiv:2110.11443, 2021. Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. ar Xiv preprint ar Xiv:2110.06169, 2021. Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179 1191, 2020. Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pp. 45 73. Springer, 2012. Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. Jinxin Liu, Hao Shen, Donglin Wang, Yachen Kang, and Qiangxing Tian. Unsupervised domain adaptation with dynamics-aware rewards in reinforcement learning. Advances in Neural Information Processing Systems, 34:28784 28797, 2021. Jinxin Liu, Donglin Wang, Qiangxing Tian, and Zhengyu Chen. Learn goal-conditioned policy with intrinsic motivation for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 7558 7566, 2022a. Jinxin Liu, Hongyin Zhang, and Donglin Wang. Dara: Dynamics-aware reward augmentation in offline reinforcement learning. ar Xiv preprint ar Xiv:2203.06662, 2022b. Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. ar Xiv preprint ar Xiv:1807.03858, 2018. Wenjia Meng, Qian Zheng, Yue Shi, and Gang Pan. An off-policy trust region policy optimization method with monotonic improvement guarantee for deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 33(5):2223 2235, 2021. Piotr Mirowski, Matt Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson, Denis Teplyashin, Karen Simonyan, Andrew Zisserman, Raia Hadsell, et al. Learning to navigate in cities without a map. Advances in Neural Information Processing Systems, 31, 2018. Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. ar Xiv preprint ar Xiv:1912.02074, 2019. Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. Hyperparameter selection for offline reinforcement learning. ar Xiv preprint ar Xiv:2007.09055, 2020. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ar Xiv preprint ar Xiv:1910.00177, 2019. Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988. James Queeney, Yannis Paschalidis, and Christos G Cassandras. Generalized proximal policy optimization with sample reuse. Advances in Neural Information Processing Systems, 34:11909 11919, 2021. Published as a conference paper at ICLR 2023 John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897. PMLR, 2015a. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Noah Y Siegel, Jost Tobias Springenberg, Felix Berkenkamp, Abbas Abdolmaleki, Michael Neunert, Thomas Lampe, Roland Hafner, Nicolas Heess, and Martin Riedmiller. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. ar Xiv preprint ar Xiv:2002.08396, 2020. Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning. 1998. Qiangxing Tian, Guanchu Wang, Jinxin Liu, Donglin Wang, and Yachen Kang. Independent skill transfer for deep reinforcement learning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2901 2907, 2021. Qing Wang, Jiechao Xiong, Lei Han, Han Liu, Tong Zhang, et al. Exponentially weighted imitation learning for batched historical data. Advances in Neural Information Processing Systems, 31, 2018. Ziyu Wang, Alexander Novikov, Konrad Zolna, Josh S Merel, Jost Tobias Springenberg, Scott E Reed, Bobak Shahriari, Noah Siegel, Caglar Gulcehre, Nicolas Heess, et al. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768 7778, 2020. Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8:279 292, 1992. Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019. Huang Xiao, Michael Herman, Joerg Wagner, Sebastian Ziesche, Jalal Etesami, and Thai Hong Linh. Wasserstein adversarial imitation learning. ar Xiv preprint ar Xiv:1906.08113, 2019. Haoran Xu, Xianyuan Zhan, Jianxiong Li, and Honglei Yin. Offline reinforcement learning with soft behavior regularization. ar Xiv preprint ar Xiv:2110.07395, 2021. Rui Yang, Chenjia Bai, Xiaoteng Ma, Zhaoran Wang, Chongjie Zhang, and Lei Han. Rorl: Robust offline reinforcement learning via conservative smoothing. ar Xiv preprint ar Xiv:2206.02829, 2022. Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1 36, 2021a. Tianhe Yu, Aviral Kumar, Rafael Rafailov, Aravind Rajeswaran, Sergey Levine, and Chelsea Finn. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954 28967, 2021b. Siyuan Zhang and Nan Jiang. Towards hyperparameter-free policy selection for offline reinforcement learning. Advances in Neural Information Processing Systems, 34:12864 12875, 2021. Published as a conference paper at ICLR 2023 A PROOF OF PERFORMANCE DIFFERENCE THEOREM 1 Proof. First note that Aπ(s, a) = Es p(s |s,a) [r(s, a) + γVπ (s ) Vπ(s)] . Therefore, t=0 γt Aπ (st, at) t=0 γt (r (st, at) + γVπ (st+1) Vπ (st)) t=0 γtr (st, at) = Es0 [Vπ (s0)] + Eτ Pπ t=0 γtr (st, at) = J (π) + J (π ) J (π , π) (17) Now the first equation in 1 has been proved. For the proof of second equation, we decompose the expectation over the trajectory into the sum of expectation over state-action pairs: t=0 γt Aπ (st, at) h P (st = s|π ) Ea π ( |s) γt Aπ (s, a) i t=0 γt P (st = s|π ) Ea π ( |s) [Aπ (s, a)] h ρπ (s)Ea π ( |s) [Aπ (s, a)] i =Es ρπ (s),a π ( |s) [Aπ (s, a)] (18) B PROOF OF PROPOSITION 1 Proof. For state-action pair (st, at) D, it can be viewed as one deterministic policy that satisfies πD (a = at|st) = 1 and πD (a = at|st) = 0. So DT V (D ˆπβ) [st] = DT V (πD ˆπβ) [st] 2Ea |πD (a|st) ˆπβ (a|st)| Z [P (at) |πD (at|st) ˆπβ (at|st)| + P (a = at) |πD (a|st) ˆπβ (a|st)|] da Z [P (at) (1 ˆπβ (at|st)) + P (a = at) ˆπβ (a = at|st)] da Z [P (at) (1 ˆπβ (at|st)) + (1 P (at)) (1 ˆπβ (at|st))] da 2 (1 ˆπβ (at|st)) (19) Published as a conference paper at ICLR 2023 C PROOF OF THEOREM 2 The definition of Aπ,ˆπβ (s) is as follows: Aπ,ˆπβ (s) = Ea π( |s) Aˆπβ (s, a) (20) Note that the expectation of advantage function Aˆπβ (s, a) depends on another policy π rather than ˆπβ, so Aπ,ˆπβ (s) = 0. Furthermore, given the Aπ,ˆπβ (s), the performance difference in Theorem 2 can be rewritten as: J (π, ˆπβ) = Es ρπ( ),a π( |s) Aˆπβ(s, a) = Es ρπ( ) Aπ,ˆπβ (s) (21) b J (π, ˆπβ) = Es ρD( ),a π( |s) Aˆπβ(s, a) = Es ρD( ) Aπ,ˆπβ (s) (22) Lemma 1. For all state s, Aπ,ˆπβ (s) 2 max a Aˆπβ (s, a) DT V (π ˆπβ) [s] (23) Proof. The expectation of advantage function Aπ (s, a) over its policy π equals zero: Ea π [Aπ(s, a)] = Ea π [Qπ(s, a) Vπ(s)] = Ea π [Qπ(s, a)] Vπ(s) = 0 (24) Thus, with the help of Hölder s inequality, we get Aπ,ˆπβ (s) = Ea π( |s) Aˆπβ (s, a) Ea ˆπβ( |s) Aˆπβ (s, a) π (a|s) ˆπβ (a|s) 1 Aˆπβ (s, a) =2DT V (π ˆπβ) [s] max a Aˆπβ (s, a) , s (25) Lemma 2. ((Achiam et al., 2017)) The divergence between two unnormalized visitation frequencies, ρπ ( ) ρπ ( ) 1 , is bounded by an average total variational divergence of the policies π and π : ρπ ( ) ρπ ( ) 1 2γ E s ρπ ( ) [DT V (π π ) [s]] (26) Given this powerful lemma and other preparation, now we are able to derive the bound of J (π, ˆπβ) b J (π, ˆπβ) : J (π, ˆπβ) b J (π, ˆπβ) = Es ρπ( ) Aπ,ˆπβ (s) Es ρD( ) Aπ,ˆπβ (s) = Es ρπ( ) Aπ,ˆπβ (s) Es ρˆπβ ( ) Aπ,ˆπβ (s) + Es ρˆπβ Aπ,ˆπβ (s) Es ρD( ) Aπ,ˆπβ (s) (27) Based on Hölder s inequality and lemma 2, we can bound the first term as follows: Es ρπ( ) Aπ,ˆπβ (s) Es ρˆπβ ( ) Aπ,ˆπβ (s) ρπ ( ) ρˆπβ ( ) 1 Aπ,ˆπβ (s) 2γ E s ρˆπβ ( ) [DT V (π ˆπβ) [s]] max s Aπ,ˆπβ (s) (28) For the second term, we can derive similar bound and furthermore let DT V (D ˆπβ) [s] = 1 2 (1 ˆπβ (at|st)). Finally, using lemma 1, we get J (π, ˆπβ) b J (π, ˆπβ) 2γ max s Aπ,ˆπβ (s) E s ρˆπβ ( ) [DT V (π ˆπβ) [s]] + E s ρD( ) [DT V (D ˆπβ) [s]] =2γ max s Aπ,ˆπβ (s) E s ρˆπβ ( ) [DT V (π ˆπβ) [s]] + E s ρD( ) 1 2 [1 ˆπβ (a|s)] 4γ max s,a Aˆπβ (s, a) max s DT V (π ˆπβ) [s] E s ρˆπβ ( ) [DT V (π ˆπβ) [s]] + E s ρD( ) 1 2 [1 ˆπβ (a|s)] Published as a conference paper at ICLR 2023 D PROOF OF THEOREM 3 As an extension of Theorem 2, the proof process of Theorem 3 is similar. Based on the Equation (28), we can directly derive the final bound: J (π, πk) b J (π, πk) = Es ρπ( ),a π( |s) [Aπk(s, a)] Es ρD( ),a π( |s) [Aπk(s, a)] = Es ρπ( ) Aπ,πk (s) Es ρπk ( ) Aπ,πk (s) + Es ρπk ( ) Aπ,πk (s) Es ρˆπβ ( ) Aπ,πk (s) + Es ρˆπβ ( ) Aπ,πk (s) Es ρD( ) Aπ,πk (s) 2γ max s Aπ,πk (s) E s ρπk ( ) [DT V (π πk) [s]] + E s ρˆπβ ( ) [DT V (πk ˆπβ) [s]] + E s ρD( ) 1 2 [1 ˆπβ (a|s)] 4γ max s,a |Aπk (s, a)| max s DT V (π πk) [s] E s ρπk ( ) [DT V (π πk) [s]] + E s ρˆπβ ( ) [DT V (πk ˆπβ) [s]] + E s ρD( ) 1 2 [1 ˆπβ (a|s)] E WHY GAE IS UNAVAILABLE IN OFFLINE SETTING? In traditional online situation, advantage Aπk(s, a) is estimated by Generalized Advantage Estimation (GAE) (Schulman et al., 2015b) using the data collected by policy πk. But in offline RL, only offline dataset D = n (st, at, st+1, rt)N t=1 o from true behavior policy πβ is available. The advantage of (st, at) calculated by GAE is as follow: Aπβ (st, at) = l=0 (γλ)l rt+l + γVπβ (st+l+1) Vπβ (st+l) . (31) GAE can only calculate the advantage of (st, at) D. For (st, at) D, where at is an indistribution action sampling but (st, at) D, GAE is unable to give any estimation. This is because the calculation process of GAE depends on the trajectory and does not have the ability to generalize to unseen state-action pairs. Therefore, GAE is not a satisfactory choice for offline RL. Offline RL forbids the interaction with environment, so data usage should be more efficient. Concretely, we expect advantage approximation method can not only calculate the advantage of (st, at), but also (st, at). As a result, we directly estimate advantage with the definition Aπβ (s, a) = Qπβ (s, a) Vπβ (s), where Q-function is estimated by SARSA and value function by fitting returns PT t=0 γtr(st, at) with MSE loss. This function approximation method can generalize to the advantage of (st, at). F THEORETICAL ANALYSIS FOR Advantage Replacement We choose to replace all Aπk with trustworthy Aˆπβ then theoretically measure the difference rather than empirically make Aπk learned by Q-learning more accurate. The difference caused by replacing the Aπk in b J (π, πk) with Aπβ (s, a) can be measured in the following theorem: Theorem 4. Given the distance DT V (πk πβ) [s] and assume the reward function satisfies |r (s, a)| Rmax for all s, a, then b J (π, πk) Es ρD( ),a π( |s) Aπβ (s, a) 2γ (γ + 1) Rmax E s ρπβ ( ) [DT V (πk πβ) [s]] . Published as a conference paper at ICLR 2023 Proof. First note that Aπ(s, a) = Es p(s |s,a) [r(s, a) + γVπ (s ) Vπ(s)]. Then we have Es ρD( ),a π( |s) [Aπk (s, a)] Es ρD( ),a π( |s) Aπβ (s, a) = Es ρD( ),a π( |s)Es p(s |s,a) h γ Vπk (s ) Vπβ (s ) Vπk (s) Vπβ (s) i Es ρD( ),a π( |s)Es p(s |s,a) γ Vπk (s ) Vπβ (s ) + Vπk (s) Vπβ (s) (33) Similarly to Equation (18), the value function can be rewritten as Vπ (s) = Es ρπ( ) [r (s)]. Then the difference between two value function can be measured using Hölder s inequality and lemma 2: Vπk (s) Vπβ (s) = Es ρπk( ) [r (s)] Es ρπβ ( ) [r (s)] ρπk ( ) ρπβ ( ) 1 r (s) 2γ E s ρπβ ( ) [DT V (πk πβ) [s]] max s |r (s)| (34) Thus, the final bound is Es ρD( ),a π( |s) [Aπk (s, a)] Es ρD( ),a π( |s) Aπβ (s, a) Es ρD( ),a π( |s)Es p(s |s,a) 2γ2 E s ρπβ ( ) [DT V (πk πβ) [s ]] max s |r (s )| +2γ E s ρπβ ( ) [DT V (πk πβ) [s]] max s |r (s)| =2γ (γ + 1) max s |r (s)| E s ρπβ ( ) [DT V (πk πβ) [s]] (35) Note that the right end term of the equation is irrelevant to the policy π and can be viewed as a constant when optimizing π. Combining the result of Theorem 3 and 4, we get the following corollary: Corollary 1. Given the distance DT V (π πk) [s], DT V (πk ˆπβ) [s] and DT V (D ˆπβ) [s] = 1 2 (1 ˆπβ (a|s)), we can derive the following bound: J (π, πk) Es ρD( ),a π( |s) Aπβ (s, a) 4γAπk max s DT V (π πk) [s] E s ρπk ( ) [DT V (π πk) [s]] 4γAπk max s DT V (π πk) [s] E s ρˆπβ ( ) [DT V (πk ˆπβ) [s]] 2γAπk max s DT V (π πk) [s] E s ρD( ) [1 ˆπβ (a|s)] Cπk,πβ , (36) where Aπk = max s,a |Aπk(s, a)| and Cπk,πβ = 2γ (γ + 1) max s,a |r (s, a)| E s ρπβ ( ) [DT V (πk πβ) [s]]. Conclusion 3 To guarantee the true objective J (π, πk) non-decreasing, we can also simultaneously maximize Es ρD( ),a π( |s) Aπβ(s, a) and minimize [maxs DT V (π πk) [s]], k = 0, 1, 2, . G ABLATION STUDY ON AN ASYMMETRIC COEFFICIENT In this section, we give the details of all hyperparameter selections in our experiments. In addition to the aforementioned clip ratio ϵ and its clip decay coefficient σ, we introduce the ω (0, 1) as an asymmetric coefficient to adjust the advantage Aπβ based on the positive or negative of advantage: Aπβ = |ω 1(Aπβ < 0)|Aπβ. (37) Published as a conference paper at ICLR 2023 For ω > 0.5, that downweights the contributions of the state-action value Qπβ smaller than it s expectation, i.e., Vπβ while distributing more weights to larger Qπβ. The asymmetric coefficient can adjust the weight of advantage based on the Q performance, which downweights the contributions of the state-action value Q smaller than its expectation while distributing more weights to advantage with a larger Q value. We analyze how the three coefficients affect the performance of BPPO. We analyze three values of the asymmetric coefficient ω = (0.5, 0.7, 0.9) in three Gym environments. Figure 6 shows that ω = 0.9 is best for these tasks, especially in hopper-medium-v2 and hoppermedium-replay-v2. With a larger value ω, the policy improvement can be guided in a better direction, leading to better performance in Gym environments. Based on the performance of different coefficient values above, we use the asymmetric advantage coefficient ω = 0.9 for the Gym dataset training and ω = 0.7 for the Adroit, Antmaze, and Kitchen datasets training, respectively. 0 50 100 150 Gradient Step ( 5) Normalized Return =0.9 =0.7 =0.5 BC (a) hopper-medium-v2 0 50 100 150 Gradient Step ( 5) Normalized Return =0.9 =0.7 =0.5 BC (b) hopper-medium-replay-v2 0 50 100 150 Gradient Step ( 5) Normalized Return =0.9 =0.7 =0.5 BC (c) hopper-medium-expert-v2 Figure 6: Ablation study on coefficient ω. We optimize the hyperparameters through the grid search, then we fix the value of other coefficients with the best performance and change the value of the asymmetric coefficient to analyze how it affects the BPPO. In particular, ω = 0.5 denotes without the asymmetric coefficient during the training phase (contributing equal value to all Advantages). H IMPORTANCE RATIO DURING TRAINING In this section, we consider exploring whether the importance weight between the improved policy πk and the behavior policy πβ will be arbitrarily large. To this end, we quantify this importance weight in the training phase in Figure 7. In Figure 7, we often observe that the ratio of the BPPO with decay always stays in the clipped region (the region surrounded by the dotted yellow and red line). However, the BPPO without decay is beyond the region in Figure 7(a) and 7(b). That demonstrates the improved policy without decay is farther away from the behavior policy than the case of BPPO with decay. It may cause unstable performance and even crashing, as shown in Figure 5(c), 5(d) and 10 when σ = 1.00 (i.e., without decay). 0 2 4 6 8 10 12 k (update steps) Importance Ratio k/ BPPO with decay BPPO without decay 1 + 2 1 2 (a) hopper-medium-v2 0 1 2 3 4 5 6 7 k (update steps) Importance Ratio k/ (b) hopper-medium-replay-v2 0 2 4 6 8 k (update steps) Importance Ratio k/ (c) hopper-medium-expert-v2 Figure 7: Visualization of the importance weight between the updated policy and the behavior policy trained by BC. When the performance of the policy is improved, we calculate the importance weight (i.e., the probability ratio) between the improved policy and the behavior policy. Published as a conference paper at ICLR 2023 I COEFFICIENT PLOTS OF ONESTEP BPPO In this section, we exhibit the learning curves and coefficient plots of Onestep BPPO. As shown in Figure 8 and 9, ϵ = 0.25 and ω = 0.9 are best for those tasks. Figure 10 shows how the clip coefficient decay affects the performance of the Onestep BPPO. We can observe that the performance of the curve without decay or with low decay is unstable over three tasks and even crash during training in the "hopper-medium-replay-v2" task. Thus, we select σ = 0.96 to achieve a stable policy improvement for Onestep BPPO. that We use the coefficients with the best performance to compare with the BPPO in Figure 3. 0 50 100 150 Gradient Step ( 5) Normalized Return =0.05 =0.10 =0.20 =0.25 =0.30 BC (a) hopper-medium-v2 0 50 100 150 Gradient Step ( 5) Normalized Return =0.05 =0.10 =0.20 =0.25 =0.30 BC (b) hopper-medium-replay-v2 0 50 100 150 Gradient Step ( 5) Normalized Return =0.05 =0.10 =0.20 =0.25 =0.30 BC (c) hopper-medium-expert-v2 Figure 8: Ablation study of Onestep BPPO on coefficient ϵ. We optimize the hyperparameters through the grid search, then we fix the value of other coefficients with the best performance and change the value of the clip coefficient to analyze how it affects the Onestep BPPO. 0 50 100 150 Gradient Step ( 5) Normalized Return =0.9 =0.7 =0.5 BC (a) hopper-medium-v2 0 50 100 150 Gradient Step ( 5) Normalized Return =0.9 =0.7 =0.5 BC (b) hopper-medium-replay-v2 0 50 100 150 Gradient Step ( 5) Normalized Return =0.9 =0.7 =0.5 BC (c) hopper-medium-expert-v2 Figure 9: Ablation study of Onestep BPPO on coefficient ω. We optimize the hyperparameters through the grid search, then we fix the value of other coefficients with the best performance and change the value of the asymmetric coefficient to analyze how it affects the Onestep BPPO. 0 50 100 150 Gradient Step ( 5) Normalized Return =0.90 =0.94 =0.96 =0.98 =1.00 BC (a) hopper-medium-v2 0 50 100 150 Gradient Step ( 5) Normalized Return =0.90 =0.94 =0.96 =0.98 =1.00 BC (b) hopper-medium-replay-v2 0 50 100 150 Gradient Step ( 5) Normalized Return =0.90 =0.94 =0.96 =0.98 =1.00 BC (c) hopper-medium-expert-v2 Figure 10: Ablation study of Onestep BPPO on clip coefficient decay and its decay rate. We optimize the hyperparameters through the grid search, then we fix the value of other coefficients with the best performance and change the value of the clip decay coefficient to analyze how it affects the Onestep BPPO. In particular, σ = 1.00 denotes without the decay coefficient during the training phase. Published as a conference paper at ICLR 2023 J EXTRA COMPARISONS In this section, we have added the EDAC (An et al., 2021a), LAPO (Chen et al., 2022), RORL (Yang et al., 2022), and ATAC (Cheng et al., 2022) as the comparison baselines to further evaluate the superiority of the BPPO. Although the performance of the BPPO is slightly worse than the SOTA methods on Gym environment, the BPPO significantly outperforms all methods on the Adroit, Kitchen, and Antmaze datasets and has the best overall performance over all datasets. Table 3: The normalized results of all algorithms on Gym locomotion and Adroit datasets. The results of the EDAC, RORL, and ATAC are extracted from their original articles. Environment/method EDAC RORL ATAC Ours halfcheetah-medium-v2 65.9 66.8 54.3 44.0 0.2 hopper-medium-v2 101.6 104.8 102.8 93.9 3.9 walker2d-medium-v2 92.5 102.4 91.0 83.6 0.9 halfcheetah-medium-replay-v2 61.3 61.9 49.5 41.0 0.6 hopper-medium-replay-v2 101 102.8 102.8 92.5 3.4 walker2d-medium-replay-v2 87.1 90.4 94.1 77.6 7.8 halfcheetah-medium-expert-v2 106.3 107.8 95.5 92.5 1.9 hopper-medium-expert-v2 110.7 112.7 112.6 112.8 1.7 walker2d-medium-expert-v2 114.7 121.2 116.3 113.1 2.4 Gym locomotion-v2 total 841.1 870.8 818.9 751.0 21.8 pen-human-v1 52.1 33.7 79.3 117.8 11.9 hammer-human-v1 0.8 2.3 6.7 14.9 3.2 door-human-v1 10.7 3.8 8.7 25.9 7.5 relocate-human-v1 0.1 0 0.3 4.8 2.2 pen-cloned-v1 68.2 35.7 73.9 110.8 6.3 hammer-cloned-v1 0.3 1.7 2.3 8.9 5.1 door-cloned-v1 9.6 -0.1 8.2 6.2 1.6 relocate-cloned-v1 0 0 0.8 1.9 1.0 adroit-v1 total 141.8 77.1 180.2 291.4 38.8 locomotion + adroit total 982.9 947.9 999.1 1042.4 60.6 Table 4: The normalized results of all algorithms on Kitchen dataset. The results of the LAPO are extracted from its original article. Environment/method LAPO Ours kitchen-complete-v0 53.2 91.5 8.9 kitchen-partial-v0 53.7 57.0 2.4 kitchen-mixed-v0 62.4 62.5 6.7 kitchen-v0 total 169.3 211.0 18.0 Table 5: The normalized results of all algorithms on Antmaze dataset. The results of the RORL are extracted from its original article. Environment/method RORL Ours Umaze-v2 96.7 95.0 5.5 Umaze-diverse-v2 90.7 91.7 4.1 Medium-play-v2 76.3 51.7 7.5 Medium-diverse-v2 69.3 70.0 6.3 Large-play-v2 16.3 86.7 8.2 Large-diverse-v2 41.0 88.3 4.1 Antmaze-v2 total 390.3 483.3 35.7 Published as a conference paper at ICLR 2023 K IMPLEMENTATION AND EXPERIMENT DETAILS Following the online PPO method, we use tricks called code-level optimization including learning rate decay, orthogonal initialization, and normalization of the advantage in each mini-batch, which are considered very important to the success of the online PPO algorithm (Engstrom et al., 2020). We clip the concatenated gradient of all parameters such that the global L2 norm does not exceed 0.5. We use 2 layers MLP with 1024 hidden units for the Q and policy networks, and use 3 layers MLP with 512 hidden units for value function V . Our method is constructed by Pytorch (Paszke et al., 2019). Next, we introduce the training details of the Q, V , (estimated) behavior policy ˆπβ, and target policy π, respectively. Q and V networks training: we run 2 106 steps for fitting value Q and V functions using learning rate 10 4, respectively. (Estimated) behavior policy ˆπβ training: we run 5 105 steps for ˆπβ cloning using learning rate 10 4. Target policy π training: during policy improvement, we use the learning rate decay, i.e., decaying in each interval step in the first 200 gradient steps and then remaining the learning rate (decay rate σ = 0.96). We run 1,000 gradient steps for policy improvement for Gym, Adroit, and Kitchen tasks and run 100 gradient steps for Antmaze tasks. The selections of the initial policy learning rate, initial clip ratio, and asymmetric coefficient are listed in Table 6, respectively. Table 6: The selections of part of hyperparameters during policy improvement phase. Hyperparameter Task Value Initial policy learning rate Gym locomotion and cloned tasks of Adroit 10 4 Kitchen, Antmaze, and human tasks of Adroit 10 5 Initial clip ratio ϵ Hopper-medium-replay-v2 0.1 Antmaze 0.5 Others 0.25 Asymmetric coefficient ω Gym locomotion 0.9 Others 0.7