# offpolicy_primaldual_safe_reinforcement_learning__8042b849.pdf Published as a conference paper at ICLR 2024 OFF-POLICY PRIMAL-DUAL SAFE REINFORCEMENT LEARNING Zifan Wu1, Bo Tang23 , Qian Lin1 , Chao Yu1 , Shangqin Mao3, Qianlong Xie3, Xingxing Wang3, Dong Wang3 1Sun Yat-sen University, Guangzhou, China {wuzf5,linq67}@mail2.sysu.edu.cn, yuchao3@mail.sysu.edu.cn 2Institute for Advanced Algorithms Research, Shanghai, China tangb@iaar.ac.cn 3Meituan, Beijing, China {maoshangqin,xieqianlong,wangxingxing04,wangdong07}@meituan.com Primal-dual safe RL methods commonly perform iterations between the primal update of the policy and the dual update of the Lagrange Multiplier. Such a training paradigm is highly susceptible to the error in cumulative cost estimation since this estimation serves as the key bond connecting the primal and dual update processes. We show that this problem causes significant underestimation of cost when using off-policy methods, leading to the failure to satisfy the safety constraint. To address this issue, we propose conservative policy optimization, which learns a policy in a constraint-satisfying area by considering the uncertainty in cost estimation. This improves constraint satisfaction but also potentially hinders reward maximization. We then introduce local policy convexification to help eliminate such suboptimality by gradually reducing the estimation uncertainty. We provide theoretical interpretations of the joint coupling effect of these two ingredients and further verify them by extensive experiments. Results on benchmark tasks show that our method not only achieves an asymptotic performance comparable to state-of-the-art on-policy methods while using much fewer samples, but also significantly reduces constraint violation during training. Our code is available at https://github.com/Zifan Wu/CAL. 1 INTRODUCTION Reinforcement learning (RL) has achieved tremendous successes in various domains (Silver et al., 2016; Andrychowicz et al., 2020; Jumper et al., 2021; Afsar et al., 2022). However, in many realworld tasks, interacting with the environment is expensive or even dangerous (Amodei et al., 2016). By encoding safety concerns into RL via the form of constraints, safe RL (Gu et al., 2022) studies the problem of optimizing the traditional reward objective while satisfying certain constraints. In recent years, safe RL has received increasing attention due to the growing demand for the real-world deployment of RL methods (Xu et al., 2022b). As one of the most important safe RL methods, primal-dual methods convert the original constrained optimization problem to its dual form by introducing a set of Lagrange multipliers (Chow et al., 2017; Tessler et al., 2018; Ray et al., 2019; Ding et al., 2020; Yang et al., 2022). Generally, primaldual methods alternate between the primal update for policy improvement and the dual update for Lagrange multiplier adjustment. While enjoying theoretical guarantee in converging to constraintsatisfying policies (Tessler et al., 2018; Paternain et al., 2019), primal-dual methods often suffer from severe performance instability during the practical training process (Stooke et al., 2020), which can be largely attributed to the high reliance on an accurate estimation of the cumulative cost (Xu et al., 2021; Liu et al., 2022). An inaccurate cost estimation, even with a small error, can give rise to a wrong Lagrange multiplier that hinders the subsequent policy optimization. For example, when the Equal contribution. Work done when Bo Tang was at Meituan. Corresponding author Published as a conference paper at ICLR 2024 true cost value is under the constraint threshold but its estimate is over the threshold (and vice-versa), the Lagrange multiplier will be updated towards the complete opposite of the true direction and in turn becomes a misleading penalty weight in the primal policy objective. Therefore, compared to the reward estimation where an inaccurate value estimate may still be useful for policy improvement as long as the relative reward priority among actions is preserved, cost estimation is more subtle since a minor error can significantly affect the overall primal-dual iteration (Lee et al., 2022b; Liu et al., 2023). This issue is further aggravated when using off-policy methods due to the inherent estimation error caused by the distribution shift (Levine et al., 2020). As a result, the majority of existing primal-dual methods are on-policy by design and thus require a large amount of samples before convergence, which is problematic in safety-critical applications where interacting with the environment could be potentially risky. Figure 1: Visualization of a part of the policy space, where deeper color indicates higher reward in the feasible (constraint-satisfying) area. In this paper, we propose an off-policy primaldual-based safe RL method which consists of two main ingredients. The first ingredient is conservative policy optimization, which utilizes an upper confidence bound of the true cost value in the Lagrangian, in order to address the cost underestimation issue existing in na ıve offpolicy primal-dual methods. The underestimation of the cost value causes the policy to violate the constraint, leading to an aggressive feasibility boundary ( ) that exceeds the feasible policy space ( ), as illustrated in Figure 1(a). In contrast, by considering estimation uncertainty, the conservative policy optimization encourages cost overestimation, which ultimately results in a conservative boundary ( ) that lies inside the constraint-satisfying policy space. While such conservatism improves constraint satisfaction, it may also hinder reward maximization since it shrinks the policy search space. To tackle this problem, we introduce our second ingredient, namely local policy convexification, which modifies the original objective using the augmented Lagrangian method (Luenberger et al., 1984), in order to convexify the neighborhood area ( ) of a locally optimal policy ( ). Through rectifying the policy gradient ( ), the local policy convexification can stabilize policy learning in the local convexified area and thus gradually reduce the cost estimation uncertainty in this area. Consequently, the conservative search space will expand gradually towards the optimal policy ( ), as depicted by the process from Figure 1(a) to Figure 1(b). The contributions of this work are as follows: We analyze the problem existing in na ıve off-policy primal-dual methods and address it by introducing two critical algorithmic ingredients. Detailed theoretical interpretations and empirical verifications are provided to demonstrate the respective effect of each ingredient and more importantly, their mutual impact on the overall learning process. We specify a practical algorithm by incorporating the two ingredients into an off-policy backbone. Empirical results on various continuous control tasks show that our method achieves superior sample efficiency compared to the baselines, i.e., achieving the same reward with fewer samples and constraint violations during training. We evaluate our method on a real-world auto-bidding task under the semi-batch training paradigm (Matsushima et al., 2021), where the behavior policy is not allowed to update within each long-term data collecting process. Results verify the effectiveness of our method in such scenarios by conservatively approaching the optimal policy. 2 BACKGROUND This paper studies safe RL problems under the formulation of the Constrained Markov Decision Process (CMDPs) (Altman, 1999), which is defined by a tuple (S, A, P, r, c, γ, d), where S is the state space, A the action space, P : S A S [0, 1] the transition kernel, r : S A R Published as a conference paper at ICLR 2024 the reward function, c : S A R+ the cost function, γ the discount factor, and d the constraint threshold. Without loss of generality, this paper considers CMDPs with only a single scalar cost to simplify the notation. At each time step t, the agent selects an action at A based on the current state st S and the policy π( |st). The environment then returns a reward rt = r(st, at), a cost ct = c(st, at), and the next state st+1 S sampled from the transition probability P( |st, at). We denote the reward value function and the cost value function respectively as Qπ r (s, a) = Eτ π[P t=0 γtrt] and Qπ c (s, a) = Eτ π[P t=0 γtct], where τ = {s0 = s, a0 = a, s1, . . .} is a trajectory starting from s and a. The goal of safe RL methods is to solve the following constrained optimization problem: max π Es ρπ,a π( |s) [Qπ r (s, a)] s.t. Es ρπ,a π( |s) [Qπ c (s, a)] d, (1) where ρπ denotes the stationary state distribution induced by π. The basic primal-dual approach converts Eq. (1) to its dual form, i.e., min λ 0 max π Es ρπ,a π( |s) [Qπ r (s, a) λ (Qπ c (s, a) d)] , (2) and solves it by alternating between optimizing the policy and updating the Lagrange multiplier. For a large state space, the value functions can be estimated with function approximators, and we denote b Qπ r and b Qπ c as the estimation of the reward and the cost value, respectively. 3 METHODOLOGY This section introduces two key algorithmic ingredients for improving off-policy primal-dual safe RL methods. In Section 3.1, we reveal the cost underestimation issue existing in na ıve off-policy primal-dual methods and address it by applying the first ingredient, i.e., conservative policy optimization. To mitigate the possible reward suboptimality induced by such conservatism, we then introduce the second ingredient, i.e., local policy convexification in Section 3.2. The interpretation and empirical verification for the joint effect of these two ingredients are provided in Section 3.3. 3.1 CONSERVATIVE POLICY OPTIMIZATION In the primal-dual objective (2), when the Lagrange multiplier is greater than 0, the policy will be updated to minimize the estimation of the cost value. Assuming that the approximation b Qc(s, a) = Qc(s, a) + ϵ has a zero-mean noise ϵ, then the zero-mean property cannot be preserved in general under the minimization operation, i.e., Eϵ[minπ b Qc(s, π(s))] minπ Qc(s, π(s)), s (Thrun & Schwartz, 1993), leading to the underestimation of the cost value. Such underestimation bias may be further accumulated to a large bias in temporal difference learning where the underestimated cost value becomes a learning target for estimates of other state-action pairs. This issue becomes even more severe when using off-policy methods that inherently suffer from value approximation error (i.e., larger ϵ) induced by the distribution shift issue (Levine et al., 2020). We empirically validate the above analysis by showing the results of a na ıve off-policy baseline: SAC-Lag, which is the Lagrangian version of SAC (Haarnoja et al., 2018). Figure 2 provides the reward (left column) and the cost evaluation (middle column), as well as the difference between the estimations of cost values and their oracle values during training (right column). The oracle value is approximated by testing the current policy in the environment for 100 episodes and averaging the induced discounted cost returns of these episodes. Clearly, we can see that SAC-Lag suffers from severe underestimation of the cost value and also poor constraint satisfaction. To deal with such cost underestimation, we propose to replace the value estimate b Qπ c in the primaldual objective by an upper confidence bound (UCB) of the true cost value. Specifically, we maintain E bootstrapped Qc networks, i.e., { b Qi c}E i=1, which share the same architecture and only differ by the initial weights. All the networks are trained on the same dataset. During cost estimation, each b Qi c estimates the cumulative cost in parallel and independently. The UCB value is obtained by computing mean plus weighted standard deviation of this bootstrapped value ensemble: b QUCB c (s, a) = 1 i=1 b Qi c(s, a) + k b Qic(s, a) 1 i=1 b Qic(s, a) Published as a conference paper at ICLR 2024 0 50k 100k 150k 200k Point Button2 test reward SAC-Lag UCB 0 50k 100k 150k 200k 0 50k 100k 150k 200k estimation oracle 0 50k 100k 150k Samples 0 50k 100k 150k 0 50k 100k 150k Figure 2: Results of SAC-Lag and its conservative version (denoted as UCB in the legend) on two example tasks. The horizontal dashed line represents the constraint threshold in the cost plot. where we omit the superscript π in b QUCB c and b Qi c for notational brevity. From the Bayesian perspective, the ensemble forms an estimation of the posterior distribution of Qc (Osband et al., 2016), with the standard deviation quantifying the epistemic uncertainty induced by a lack of sufficient data (Chua et al., 2018). Despite its simplicity, this technique is widely adopted and is proved to be extremely effective (Ghasemipour et al., 2022; Bai et al., 2022). By taking into account this uncertainty in Qc estimation, b QUCB c serves as an approximate upper bound of the true value with high confidence, as empirically verified by Figure 2 1. Intuitively, the use of UCB induces safety conservativeness in the policy by encouraging actions that not only result in high reward but also yield low uncertainty in cost estimation. 3.2 LOCAL POLICY CONVEXIFICATION While conservative policy optimization can improve constraint satisfaction by encouraging cost overestimation, it could also impede reward maximization due to the shrink of policy search space (i.e., E[Qπ c ] E[ b QUCB c ] implies {π|E[ b QUCB c ] d} {π|E[Qπ c ] d}). As illustrated in Figure 1, by narrowing the gap between the conservative boundary ( ) and the true boundary ( ), the algorithm can expand the search space and thus enhance the probability to obtain a high-reward policy. However, this conservatism gap is co-controlled by the conservatism parameter k and the bootstrap disagreement. Since the bootstrap disagreement quantifies the uncertainty and is susceptible to instability, it is impractical to precisely control the gap only through tuning the conservatism parameter. In this section, we introduce local policy convexification to stabilize the learning of the policy and the Lagrange multiplier. When working in conjunction with conservative policy optimization, such stabilization helps reduce the bootstrap disagreement and thus gradually narrows down the conservatism gap, as interpreted in Section 3.3. In the estimation of the reward value, even if the estimated values are far from their true values, they can still be helpful for policy improvement as long as their relative order is preserved. Nevertheless, the cost value estimation is less tolerant to estimation error, because the dual update process needs to compare its exact value against an absolute threshold. Thus, even a slight estimation error of the cost value can result in a completely wrong λ that misleads the subsequent primal optimization. For instance, in objective (2) the optimal value for λ is if the estimated cost value exceeds the threshold, and is 0 otherwise. Therefore, compared to on-policy methods, off-policy methods are less suitable in primal-dual safe RL due to the inherent value estimation error (Levine et al., 2020), which can lead to instability in policy learning and oscillations in the overall performance. To stabilize the training of off-policy methods, we modify the original objective (2) by applying the augmented Lagrangian method (ALM) (see Pages 445-453 of (Luenberger et al., 1984)) as follows 2: 1For presentational convenience, we only show the performance of k = 2 in Figure 2, and the ablation study of this hyperparameter is provided in Section 3.3. 2For the convenience of the following interpretations, here we use the UCB modified objective by replacing b Qπ c in the original objective (2) with b QUCB c . Published as a conference paper at ICLR 2024 min λ 0 max π E h b Qπ r i λ E h b QUCB c i d c E h b QUCB c i d 2 , if λ c > d E[ b QUCB c ] ; (4a) max π E h b Qπ r i , otherwise, (4b) where c is a positive hyperparameter and s ρπ, a π( |s). Such a modification does not alter the positions of local optima under a common assumption for inequality constraints (see Appendix A). Note that (4b) does not involve the cost value, which is reasonable since λ c d E[ b QUCB c ] implies that the UCB value is below the threshold. As for objective (4a), intuitively, with a sufficiently large c, it will be dominated by the quadratic penalty term and thus will be convex in the neighborhood of a boundary solution (i.e., a policy π satisfying E[ b QUCB c ] = d). Theoretically, assuming a continuoustime setting, we can define the energy of our learning system as E = π 2 + 1 2 λ2, where the dot denotes the gradient w.r.t. time. Adapting the proof in (Platt & Barr, 1987) to the inequality constraint case leads to the following proposition (see Appendix B for the complete proof): Proposition 1. If c is sufficiently large, E will decrease monotonically during training in the neighborhood of a local optimal policy. Proof (Sketch). Let f(π) = Es,a[ b Qπ r (s, a)] and g(π) = Es,a[ b QUCB c (s, a)] d. By transforming the objective into an equivalent form and using the chain rule, the dynamics of the energy E can then be derived as: E = πT A π, where u 0 and A = 2f + (λ + c(g(π) + u)) 2g + c( g)2. According to Lemma 1, there exists a c > 0 such that if c > c , A is positive definite at the constrained minima. Using continuity, A is positive definite in some neighborhoods of local optima, hence, the energy function E monotonically decreases in such neighborhood areas. By definition, a relatively smaller E implies smaller update steps of both π and λ, meaning that the policy will oscillate less when it is close to a local minimum. In other words, there will be an absorbing area surrounding each suboptimal policy, in which the updating policy can be stabilized and thus be attracted by this local optimum. 3.3 JOINT EFFECT OF CONSERVATIVE POLICY OPTIMIZATION AND LOCAL POLICY CONVEXIFICATION Differentiating the ALM modified objective (4a) w.r.t. the estimated cost value yields the gradient E[ λ a b QUCB c (s, a)|a=πϕ(s) ϕπϕ(s)], where λ = max{0, λ c(d E[ b QUCB c (s, a)])}. Therefore, the ALM modification can be seen as applying a gradient rectification by replacing λ in the original gradient with λ. As illustrated in Figure 1, the effect of such gradient rectification ( ) is manifested in two cases: 1) when the policy crosses the conservative boundary (i.e., E[ b QUCB c (s, a)] > d), the ALM modification rectifies the policy gradient by assigning a larger penalty weight (i.e., λ c(d E[ b QUCB c (s, a)]) > λ), which is an acceleration for dragging the policy back to the conservative area; 2) in contrast, when the policy satisfies the conservative constraint (i.e., E[ b QUCB c (s, a)] d) while λ is still large and not able to be immediately reduced to 0, we have λ c(d E[ b QUCB c (s, a)]) λ, which to some extent prevents unnecessary over-conservative policy updates from reducing the cost value. Therefore, applying local policy convexification can force the policy to stay close to the local optimum, which is consistent with the theoretical analysis from the energy perspective. With such a stabilization effect, the collected samples can be focused in line with the distribution of the local optimal policy, which in turn dissipates the epistemic uncertainty in the local convexified area ( in Figure 1). As the uncertainty decreases, the value estimations in the ensemble ({ b Qi c}E i=1 in Eq. (3) will become more accurate, i.e., i=1 b Qi c(s, a) Qπ c (s, a), Stdi=1,...,E b Qi c(s, a) 0 (5) and the bootstrap disagreement, which quantifies the epistemic uncertainty, will also decrease, i.e., v u u t 1 b Qic(s, a) 1 i=1 b Qic(s, a) 2 0. (6) Published as a conference paper at ICLR 2024 In this way, the conservativeness induced by the bootstrap disagreement is reduced, and thus the conservative boundary can be pushed gradually towards the true boundary. 0 50k 100k 150k 200k Point Button2 test reward UCB-0.5 UCB-1.0 UCB-1.5 UCB-2.0 CAL 0 50k 100k 150k 200k CAL UCB-0.5 UCB-1.0 UCB-1.5 UCB-2.0 training cost 150k 51-100k 101-150k 151-200k 0 50k 100k 150k 200k cost estimation Ensemble Std UCB Oracle 0 50k 100k 150k UCB-0.5 UCB-1.0 UCB-1.5 UCB-2.0 CAL 0 50k 100k 150k CAL UCB-0.5 UCB-1.0 UCB-1.5 UCB-2.0 0 160 150k 51-100k 101-150k 0 50k 100k 150k 30 Ensemble Std UCB Oracle Figure 3: Results of applying local policy convexification on two example tasks, averaged over 6 random seeds. The first two columns show the reward/cost learning curves, where the solid lines are the mean and the shadowed regions are the standard deviation. The third column shows the training cost of different timestep intervals by using box-and-whisker plots where boxes showing median, 75% (q3) and 25% (q1) quarterlies of the distributions respectively, whiskers depicting the error bounds computed as 1.5(q3 q1), as well as outliers lying outside the whisker intervals. The last column shows the trend of the standard deviation of the cost value ensemble during training, as well as the difference between the UCB cost value and its oracle value during training (where greater than 0 implies overestimation, otherwise implies underestimation). To validate the joint effect of the two introduced ingredients, we incorporate them into the off-policy RL method of SAC and denote the final algorithm as Conservative Augmented Lagrangian (CAL) 3. The results are given in Figure 3, where UCB-k denotes the method only using UCB with k being the value of the conservatism parameter, and k in CAL is set to 0.5 in the two tested tasks. From the first two columns, we can see that compared to the results of merely depending on the conservative policy optimization, applying local policy convexification yields significantly smaller oscillations, indicating that this algorithmic ingredient can stabilize the overall performance both within a single run (the solid line) and across different runs (the shadowed region). Moreover, the third column 4 shows that, 1) as the timestep grows, the episodic training cost becomes more and more stable (embodied in the decreasing box length) and closer to the constraint thresholds; and 2) even compared to sole UCB with a much larger k, using local policy convexification makes the overall constraint violations and cost oscillations much fewer during training. Additionally, we can see in the last column that the overestimated UCB value gradually approaches its oracle value while the standard deviation of the bootstrapped ensemble decays toward 0, which is consistent with our interpretation about the coupling effect of the two ingredients 5. 4 EXPERIMENTS In this section, we first compare CAL to several strong baselines on benchmarks in Section 4.1, and then study two hyperparameters that respectively control the effect of the two proposed ingredients in Section 4.2. The experiment on a real-world advertising bidding scenario is presented in Section 4.3. 3The pseudo code of CAL can be found in Appendix C. 4Note that the learning curves are the testing results using extra evaluation episodes after each training iteration, while the box plots presents the training cost. 5Note that different from the results on Hopper, the UCB value in the Point Button2 task does not exhibit overestimation at the early stage of training. This is because the sparsity of the cost in the Point Button2 task increases the difficulty in learning the value estimates. Published as a conference paper at ICLR 2024 0 25k 50k 75k 100k test reward Point Button1 0 50k 100k 150k Car Button1 0 50k 100k 150k 200k Point Button2 0 50k 100k 150k 200k Point Push1 WCSAC SAC-PIDLag CVPO CAL(ours) 0 100k 200k 300k Car Button2 0 25k 50k 75k 100k 0 50k 100k 150k 0 50k 100k 150k 200k 0 50k 100k 150k 200k 0 100k 200k 300k 0 training cost 1-25k 26-50k 51-75k 76-100k 1-37k 38-75k 75-112k 113-150k 120 1-50k 51-100k 101-150k 151-200k 100 1-50k 51-100k 101-150k 151-200k 160 1-75k 76-150k 151-225k 226-300k 0 25k 50k 75k 100k test reward Half Cheetah 0 50k 100k 150k 0 50k 100k 150k 200k 500 0 100k 200k 300k 400k 500k 0 25k 50k 75k 100k 0 50k 100k 150k 0 50k 100k 150k 200k 0 100k 200k 300k 400k 500k WCSAC SAC-PIDLag CVPO CAL(ours) training cost 1-50k 51-100k 101-150k 151-200k WCSAC SAC-PIDLag CVPO CAL(ours) 1-37k 38-75k 76-112k 113-150k WCSAC SAC-PIDLag CVPO CAL(ours) 150 1-50k 51-100k 101-150k 151-200k WCSAC SAC-PIDLag CVPO CAL(ours) 1-125k 126-250k 251-375k 376-500k Figure 4: Comparisons with off-policy baselines on Safety-Gym (top half) and velocity-constrained Mu Jo Co (bottom half). 4.1 COMPARATIVE EVALUATION We conduct our comparative evaluation on the Safety-Gym benchmark (Ray et al., 2019) and the velocity-constrained Mu Jo Co benchmark (Zhang et al., 2020). A detailed description of the tested tasks can be found in Appendix D.1. Baselines are categorized into offand on-policy methods. The off-policy baselines include 1) SAC-PIDLag (Stooke et al., 2020), which builds a stronger version of SAC-Lag by using PID-control to update the Lagrange multiplier, 2) WCSAC (Yang et al., 2021), which replaces the expected cost value in SAC-Lag with the Conditional Value-at-Risk (Rockafellar et al., 2000) to achieve controllable conservatism, and 3) CVPO (Liu et al., 2022), which performs constrained policy learning in an Expectation Maximization fashion. Due to the space limit, we show the comparisons with on-policy baselines in Appendix D.2. Implementation details of CAL and the baselines are given in Appendix D.3. Our code can be found in our supplementary material. Figures 4 shows the results on the benchmarks averaged over 6 random seeds. The top/middle rows present the tested reward/cost using extra episodes after each training iteration 6. The bottom row presents the cost induced during training, where the training episodes are quartered in time order and respectively represented by the four boxes for each algorithm. The horizontal dashed lines in the 6Note that the smoothness of the learning curves among the baselines varies due to the different testing intervals, which result from the difference in the training paradigm. Published as a conference paper at ICLR 2024 0 50k 100k 150k test reward 0 50k 100k 150k 200k Point Button2 0 50k 100k 150k Car Button1 k=0.0 k=0.5 k=1.0 k=1.5 k=2.0 0 50k 100k 150k 0 50k 100k 150k 200k 0 50k 100k 150k k=0.0 k=0.5 k=1.0 k=1.5 k=2.0 training cost 1-37k 38-75k 76-112k 113-150k k=0.0 k=0.5 k=1.0 k=1.5 k=2.0 1-50k 51-100k 101-150k 151-200k k=0.0 k=0.5 k=1.0 k=1.5 k=2.0 1-37k 38-75k 76-112k 113-150k 0 50k 100k 150k test reward 0 50k 100k 150k 200k Point Button2 0 50k 100k 150k Car Button1 c=0 c=1 c=10 c=100 c=1000 0 50k 100k 150k 0 50k 100k 150k 200k 0 50k 100k 150k c=0 c=1 c=10 c=100 c=1000 training cost 1-37k 38-75k 76-112k 113-150k c=0 c=1 c=10 c=100 c=1000 1-50k 51-100k 101-150k 151-200k c=0 c=1 c=10 c=100 c=1000 1-37k 38-75k 76-112k 112-150k Figure 5: Ablation Studies of k (top half) and c (bottom half). cost plots represent the constraint thresholds. On all tested tasks, CAL significantly outperforms all baselines in low data regime, and also achieves the same level of reward and constraint satisfaction as that of state-of-the-art on-policy methods (see Appendix D.2), with much fewer samples and constraint violations during training. 4.2 ABLATION STUDIES An appealing advantage of CAL is that there are only a small number of hyperparameters that need to be tuned. The ablation studies for two key hyperparameters are presented in Figure 5, i.e., k in Eq. (3) and c in Eq. (4a), which respectively control the conservativeness in conservative policy optimization and the degree of convexity in local policy convexification. As shown in the top half of Figure 5, a larger k generally leads to a lower test/training cost, but an overly conservative k can also hinder reward maximization. As a special case, setting k = 0 leads to significantly more constraint violations, and it even fails to obtain an asymptotic constraint-satisfying policy on Point Button2 and Car Button1, which demonstrates the importance of incorporating conservatism into the algorithm. As for the ablation results of c, increasing c results in fewer performance oscillations but slower Published as a conference paper at ICLR 2024 reward improvement. Specially, setting c = 0 (i.e., without ALM) results in severe performance oscillations which hinder both the reward maximization and constraint satisfaction, and an overly large c imposes a large penalty on policies exceeding the conservative boundary (where E[ b QUCB c ] > d), which significantly slows down the process of expanding the conservative boundary (as illustrated in Figure 1) and thus impedes reward improvement. In addition, an ablation study for the ensemble size E is shown in Figure 8 in Appendix D.4, where we can see that a larger E generally leads to a lower cost, but the overall performance is not very sensitive to this hyperparameter. In theory, increasing the number of Qc networks will improve the reliability of uncertainty quantification. However, Figure 8 shows that in simpler tasks (i.e., Hopper and Point Button2) the benefit of doing this is not as significant as in the harder Car Button2. Since training more networks yields more computational burden, users of the CAL algorithm will need to make some trade-offs according to the difficulty of the target task and the computational expenses. 4.3 SEMI-BATCH SETTING Semi-batch setting refers to the training paradigm where the agent alternates between 1) collecting a large batch of data with a fixed behavior policy, and 2) updating the target policy on the collected data. Since the policy is not allowed to be adjusted during long-term environment interaction, the online deployment of a policy must ensure a certain level of safety. As a compromise between the online and the offline training (Levine et al., 2020), such a training paradigm is crucial for realworld applications of RL (Lange et al., 2012; Matsushima et al., 2021). In this regard, CAL may be particularly useful since by conservatively approaching the optimal policy from the constraintsatisfying area it can ensure both safe online deployment and high offline sample efficiency. To verify CAL s capability in the semi-batch setting, we conduct experiments in a large-scale advertising bidding scenario w.r.t. o CPM (the optimized cost-per-mille), which is a typical semi-batch application in real world. In the advertising bidding scenario, advertisers bid for advertisement slots that would be exhibited to users browsing the site. After being assigned the target ROI (return on investment) of each advertiser, the auto-bidding policy provided by the platform helps the advertisers bid for the slots, in order to maximize the total payments of all advertisers while satisfying each ROI constraint. A more detailed description of this scenario is provided in Appendix D.1. Our experiment is carried out on a real-world advertising platform with a total amount of 204.8 million bidding records collected from 47244 advertisers within 4 weeks. During the online deployment of each week, the platform generates 51.2 million records with a fixed bidding policy. The policy is only allowed to update once a week on data sampled from the historical dataset. We evaluate the algorithms by summarizing the total advertiser payments and the ROI violation rate within each deployment. The experiment results shown in Table 1 demonstrate that CAL significantly outperforms the baselines in terms of payment maximization while maintaining relatively low ROI violation rate, verifying CAL s potential to the large-scale semi-batch setting on real-world applications. WCSAC CVPO CUP SAC-PIDLag CAL(ours) Pay/k | Vio/% Pay/k | Vio/% Pay/k | Vio/% Pay/k | Vio/% Pay/k | Vio/% Week 1 270.23 | 30.43 374.38 | 31.66 433.33 | 31.86 367.82 | 32.30 656.31 | 30.55 Week 2 96.38 | 25.51 92.23 | 23.44 116.89 | 22.63 105.12 | 25.77 372.51 | 23.04 Week 3 0 | 22.57 25.20 | 21.38 59.47 | 20.88 29.71 | 23.12 208.65 | 18.15 Week 4 25.61 | 22.70 51.06 | 20.52 95.80 | 19.50 58.22 | 23.07 226.37 | 17.26 Table 1: Comparison on the o CPM task, where Pay refers to the total advertiser payments (value normalized for data security) and Vio refers to the ROI violation rate. 5 CONCLUSION This paper proposes an off-policy primal-dual method containing two main ingredients, i.e., conservative policy optimization and local policy convexification. We provide both theoretical interpretation and empirical verification of how these two ingredients work in conjunction with each other. In addition to the benchmark results which demonstrate a significantly higher sample efficiency compared to the baselines, the proposed method also exhibits promising performance on a large-scale real-world task using a challenging semi-batch training paradigm. A possible direction for future work is to extend our approach to fully offline settings. Published as a conference paper at ICLR 2024 ACKNOWLEDGMENTS We gratefully acknowledge support from the National Natural Science Foundation of China (No. 62076259), the Fundamental and Applicational Research Funds of Guangdong province (No. 2023A1515012946), and the Fundamental Research Funds for the Central Universities-Sun Yat-sen University. This research is also supported by Meituan. M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7):1 38, 2022. Eitan Altman. Constrained Markov decision processes, volume 7. CRC press, 1999. Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man e. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565, 2016. Open AI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob Mc Grew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3 20, 2020. Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhihong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. ar Xiv preprint ar Xiv:2202.11566, 2022. Xinyue Chen, Che Wang, Zijian Zhou, and Keith W Ross. Randomized ensembled double qlearning: Learning fast without a model. In International Conference on Learning Representations, 2021. Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18 (1):6070 6120, 2017. Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in Neural Information Processing Systems, 31, 2018. Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378 8390, 2020. A.M. Duguid. Studies in linear and non-linear programming, by k. j. arrow, l. hurwicz and h. uzawa. stanford university press, 1958. 229 pages. 7.50. Canadian Mathematical Bulletin, 3(3):196 198, 1960. doi: 10.1017/S0008439500025522. Kamyar Ghasemipour, Shixiang Shane Gu, and Ofir Nachum. Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Advances in Neural Information Processing Systems, 35:18267 18281, 2022. Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, Yaodong Yang, and Alois Knoll. A review of safe reinforcement learning: Methods, theory and applications. ar Xiv preprint ar Xiv:2205.10330, 2022. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861 1870. PMLR, 2018. Jiaming Ji, Jiayi Zhou, Borong Zhang, Juntao Dai, Xuehai Pan, Ruiyang Sun, Weidong Huang, Yiran Geng, Mickel Liu, and Yaodong Yang. Omnisafe: An infrastructure for accelerating safe reinforcement learning research. ar Xiv preprint ar Xiv:2305.09304, 2023. Published as a conference paper at ICLR 2024 John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇZ ıdek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583 589, 2021. Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. Reinforcement learning: State-of-the-art, pp. 45 73, 2012. Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703 3712. PMLR, 2019. Jongmin Lee, Cosmin Paduraru, Daniel J Mankowitz, Nicolas Heess, Doina Precup, Kee-Eung Kim, and Arthur Guez. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. ar Xiv preprint ar Xiv:2204.08957, 2022a. Jongmin Lee, Cosmin Paduraru, Daniel J Mankowitz, Nicolas Heess, Doina Precup, Kee-Eung Kim, and Arthur Guez. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. In International Conference on Learning Representations, 2022b. Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. Qian Lin, Bo Tang, Zifan Wu, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, and Dong Wang. Safe offline reinforcement learning with real-time budget constraints. In International Conference on Machine Learning, pp. 21127 21152. PMLR, 2023. Yongshuai Liu, Jiaxin Ding, and Xin Liu. Ipo: Interior-point policy optimization under constraints. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 4940 4947, 2020. Zuxin Liu, Zhepeng Cen, Vladislav Isenbaev, Wei Liu, Steven Wu, Bo Li, and Ding Zhao. Constrained variational policy optimization for safe reinforcement learning. In International Conference on Machine Learning, pp. 13644 13668. PMLR, 2022. Zuxin Liu, Zijian Guo, Yihang Yao, Zhepeng Cen, Wenhao Yu, Tingnan Zhang, and Ding Zhao. Constrained decision transformer for offline safe reinforcement learning. In International Conference on Machine Learning, pp. 21611 21630. PMLR, 2023. David G Luenberger, Yinyu Ye, et al. Linear and nonlinear programming, volume 2. Springer, 1984. Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, and Shixiang Gu. Deploymentefficient reinforcement learning via model-based offline optimization. In International Conference on Learning Representations, 2021. Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. Advances in Neural Information Processing Systems, 29, 2016. Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32:7555 7565, 2019. John Platt and Alan Barr. Constrained differential optimization. In Neural Information Processing Systems, 1987. Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. ar Xiv preprint ar Xiv:1910.01708, 7(1):2, 2019. R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Published as a conference paper at ICLR 2024 Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pp. 9133 9143. PMLR, 2020. Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. In International Conference on Learning Representations, 2018. Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the Fourth Connectionist Models Summer School, volume 255, pp. 263. Hillsdale, NJ, 1993. Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012. Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8753 8760, 2022a. Mengdi Xu, Zuxin Liu, Peide Huang, Wenhao Ding, Zhepeng Cen, Bo Li, and Ding Zhao. Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability. ar Xiv preprint ar Xiv:2209.08025, 2022b. Tengyu Xu, Yingbin Liang, and Guanghui Lan. Crpo: A new approach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning, pp. 11480 11491. PMLR, 2021. Long Yang, Jiaming Ji, Juntao Dai, Linrui Zhang, Binbin Zhou, Pengfei Li, Yaodong Yang, and Gang Pan. Constrained update projection approach to safe policy optimization. In Advances in Neural Information Processing Systems, 2022. Qisong Yang, Thiago D Sim ao, Simon H Tindemans, and Matthijs TJ Spaan. Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 10639 10646, 2021. Yiming Zhang, Quan Vuong, and Keith Ross. First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 33:15338 15349, 2020. Published as a conference paper at ICLR 2024 A ASSUMPTIONS AND PROOF FOR THE ALM MODIFICATION FOR INEQUALITY-CONSTRAINT PROBLEMS DOES NOT ALTER THE POSITIONS OF LOCAL OPTIMA The ALM modification in Section 3.2 does not alter the positions of local optima under the following strict complementarity assumption (Luenberger et al., 1984): Assumption 1 (Strict Complementarity). Eπ [Qπ c (s, a)] = d implies that λ > 0 as well as the second order sufficiency conditions hold for the original constrained optimization problem, where π denotes the optimal solution to the original constrained problem, and λ denotes the optimal solution to the dual problem. Assumption 2 (Second Order Sufficiency). The second order sufficiency conditions for the original constrained optimization problem include: Eπ [Qπ c (s, a)] = d there exists λ 0, s.t. Eπ [Qπ r (s, a)] + λ Eπ [Qπ c (s, a)] = 0 (7) the Hessian matrix 2Eπ [Qπ r (s, a)] + λ 2Eπ [Qπ c (s, a)] (8) is positive definite on the subspace {π| π Eπ [Qπ c (s, a)] = 0}. Here we prove that under the strict complementarity assumption, the ALM modification does not alter the position of local optima. Proof. Let f(π) = Es ρπ,a π( |s) h b Qπ r (s, a) i and g(π) = Es ρπ,a π( |s) h b QUCB c (s, a) i d, the original problem can be written as max π f(π) s.t. g(π) 0. (9) If the strict complementarity assumption is satisfied, then the original problem can be written as an equivalent problem with equality constraints (Luenberger et al., 1984): s.t. g(π) + u = 0, u 0. (10) Since c > 0, the optimization problem (10) is equivalent to: max π f(π) 1 2c|g(π) + u|2 s.t. g(π) + u = 0, u 0, (11) which does not alter the position of local optima since the conditions of stationary points of the Lagrangian of Eq. (11) are not changed compared to the original Lagrangian. Then, we formulate the dual function of problem (11) as: ϕ(λ) = max u 0,π f(π) λT [g(π) + u] 1 2c|g(π) + u|2, (12) where λ is the Lagrangian multiplier. Now we show that the optimization of the dual problem, i.e., minλ 0 ϕ(λ), is equivalent to: min λ 0 max π E h b Qπ r i 1 max{0, λ c(d E h b QUCB c i } 2 λ2 . (13) Given λ and π, the analytical optimal solution u j for problem (12) can be obtain by solving the quadratic problem: Pj = max uj 1 2c[gj(π) + uj]2 λj[gj(π) + uj] (14) Published as a conference paper at ICLR 2024 It can be easily computed that the derivative of Eq. (14) is 0 at gj(π) λj/c. Thus, the optimal solution to problem (12) is u j = 0 if gj(π) λj/c < 0, otherwise u j = gj(π) λj/c. In other words, the solution can be written as: u j = max {0, gj(π) λj/c} . (15) Substituting Eq. (15) into Eq. (14), we obtain an explicit expression for Pj: 1) for u j = 0, we have: 2c gj(π)2 λj gj(π) = 1 2c [λj + c gj(π)]2 λ2 j ; (16) and 2) for u j = gj(π) λj/c, we have: Pj = λ2 j/2c. (17) Combining the two cases together we get: 2c [max {0, λj + c gj(π)}]2 λ2 j . (18) Therefore, the dual function can be written as the following form that only depends on λ and π: ϕ(λ) = max u 0,π f(π) λT [g(π) + u] 1 2c|g(π) + u|2 = max π f(π) + X = max π f(π) 1 2c | max {0, λ + c g(π)} |2 |λ|2 =max π E h b Qπ r i 1 max{0, λ c(d E h b QUCB c i } 2 λ2 . According the equality in Eq. (19), the optimization of the dual problem (12) minλ 0 ϕ(λ) is equivalent to Eq. (13). Based on this result, the objective (4a)(4b) in the main paper can be derived, i.e., min λ 0 ϕ(λ) = min λ 0 max π E h b Qπ r i λ E h b QUCB c i d c E h b QUCB c i d 2 , if λ c > d E[ b QUCB c ] min λ 0 max π E h b Qπ r i + λ2 2c , otherwise. Since min λ 0 max π E h b Qπ r i + λ2 2c is equivalent to max π E h b Qπ r i , we have reached our conclusion that the ALM modification does not alter the positions of local optima. B PROOF FOR THE MONOTONIC ENERGY DECREASE AROUND A LOCAL OPTIMUM In Section 3.2, we define the energy of the learning system of Eq. (13) as: E = π 2 + 1 2 λ2, where the dot denotes the gradient w.r.t. time (assuming a continuous-time setting). Now we extend the proof in (Platt & Barr, 1987) to show that: If c is sufficiently large, E will decrease monotonically during training in the neighborhood of a local optimal policy. Proof. Recall that in the proof in Appendix A, the equivalent problem for the equality constrained optimization (10) can be written as: min π f(π) + 1 2c|g(π) + u|2 s.t. g(π) + u = 0, u 0. (21) The Lagrangian of this constrained objective is: L(π, λ) = f(π) + λ(g(π) + u) + 1 2c|g(π) + u|2. (22) Published as a conference paper at ICLR 2024 The continuous learning dynamics of Eq. (22) can be written as: π c(g(π) + u) g π λ = (g(π) + u). (23) During each time interval t, the policy π and the Lagrangian multiplier λ are updated to π + π t and λ + λ t, respectively. The dynamics of the energy E = π 2 + 1 2 λ2 can be derived as: i ( πi)2 + 1 2(g(π) + u)2 # 2( πi)2 + (g(π) + u)( g)T x. Furthermore, since 2( πi)2 = πi d dt πi c(g(π) + u) g 2f πi πj λ 2g πi πj c g g πj c(g(π) + u) 2g πi πj (25) we derive the following expression for E by substituting Eq. (25) into Eq. (24): 2f πi πj λ 2g πi πj c g g πj c(g(π) + u) 2g πi πj πi + (g(π) + u)( g)T x = πT 2f + (λ + c(g(π) + u)) 2g + c g( g)T π (g(π) + u)( g)T x + (g(π) + u)( g)T x = πT A π, (26) where A = 2f + (λ + c(g(π) + u)) 2g + c g( g)T . Before discussing this damping matrix A, we first introduce a Lemma mentioned in (Duguid, 1960) [Page 172]: Lemma 1. If P is a symmetric matrix and if x Px < 0 for all x = 0 satisfying Qx = 0, then for all c sufficiently large, P c QT Q is negative definite. Denoting φ(π, µ) = f(π) + µ(g(π) + u) as the Lagrangian function of the original constrained problem (10) and π as an optimal solution for the original constrained problem, we can easily derive that π satisfies: 1) g(π ) + u = 0, and 2) there exists critical point (π , µ ) such that 2φ(π , µ ) = 2f(π ) + µ 2g(π ) is positive definite according to the second order sufficiency conditions. At the point (π , µ ), the damping matrix can be represented by: A = 2f(π ) + µ 2g(π ) + c g(π )( g(π ))T = (P c QT Q), (27) where P = ( 2f(π ) + µ 2g(π )) and Q = g(π ). Since P is negative definite, we can apply Lemma 1 and obtain that there exists a c > 0 such that if c > c , the damping matrix A is positive definite at the constrained minima (π , µ ). Using continuity, the damping matrix A is positive definite in some neighborhoods of the local optima, hence, the energy function E monotonically decreases in such neighborhood areas. C PSEUDO CODE The pseudo code of CAL is presented in Algorithm 1. Published as a conference paper at ICLR 2024 Algorithm 1 CAL 1: Set hyperparameters: k, c, and learning rates απ, αQ, αλ, τ. 2: Initialize policy πϕ, double reward Q-functions b Qθ1 r , b Qθ2 r , cost Q-function ensemble { b Qψi c }E i=1, target networks b Q θ 1 r , b Q θ 2 r , { b Q ψ i c }E i=1, Lagrange multiplier λ, and replay buffer D. 3: for each iteration do 4: for each environment step do 5: at πϕ( |st) 6: st+1 P( |st, at) 7: D D {st, at, r(st, at), c(st, at), st+1} 8: end for 9: for each gradient step do 10: sample experience batch B = {s(k), a(k), s (k)}B k=1 from D 11: ϕ ϕ απEs,a B ϕ log πϕ(a|s) + a log πϕ(a|s) a min i {1,2} b Qθi r (s, a) ϕπϕ(a|s) | {z } SAC objective 12: + λ a b QUCB c (s, a) ϕπϕ(a|s) , where λ = max n 0, λ c d Es,a B h b QUCB c (s, a) i o 13: ψi ψi αQEs,a B ψi c(s, a) + γEa πϕ( |s ) b Q ψ i c (s , a ) b Qψi c (s, a) 2 , i = 1, . . . , E 14: θi θi αQEs,a B θi r(s, a) + γ mini Ea πϕ( |s ) b Q θ i r (s , a ) b Qθi r (s, a) 2 , i = 15: λ max n 0, λ αλEs,a B h d b QUCB c (s, a) io 16: ψ i τψi + (1 τ)ψ i , i = 1, . . . , E 17: θ i τθi + (1 τ)θ i , i = 1, 2 18: end for 19: end for D EXPERIMENT DETAILS D.1 ENVIRONMENT DESCRIPTIONS Safety-Gym In Safety-Gym (Ray et al., 2019), the agent perceives the world through its sensors and interacts with the world through its actuators. In Button tasks, the agent needs to reach the goal while avoiding both static and dynamic obstacles. In Push tasks, the agent needs to move a box to a series of goal positions. Dense positive rewards will be given to the agent for moving towards the goal, and a sparse reward will be given for successfully reaching the goal. The agent will be penalized through the form of a cost for violating safety constraints, i.e., colliding with the static obstacles or pressing the wrong button. To accelerate training, we adopt the modifications on Safety-Gym tasks made by (Liu et al., 2022). The comparison is still valid since all the algorithms are evaluated on the same set of tasks. Velocity-Constrained Mu Jo Co In Mu Jo Co tasks (Todorov et al., 2012), we follow the cost definition and constraint threshold used in (Zhang et al., 2020; Yang et al., 2022) for reproducible comparison. Specifically, the cost is calculated as the velocity of the agent, and the threshold of each task is set to 50% of the velocity attained by an unconstrained PPO (Schulman et al., 2017) agent after training for a million samples. o CPM At the beginning of each day, advertiser s assigns the target ROI Rois and the maximum total pay Ps. At time step t, the advertising system bids pays,t for advertiser s. If this bid succeeds, then the advertiser expends pays,t and obtains revenue gmvs,t (the gross merchandise value) from the exhibited advertisements. In this process, the system has to ensure that the average conversion Published as a conference paper at ICLR 2024 is larger than the target Rois, i.e., P Ps > Rois X t gmvs,t < Rois Ps, (28) Thus, the optimization problem in this scenario can be formulated as: t gmvs,t < Rois Ps, s (29) We further formulate this problem as a constrained Markov Decision Process: The state consists of the user context features, ads context features, cumulative revenue, cumulative pay, maximum total pay, target ROI, p CTR (predicted click-through rate), p CV R (predicted conversion rate) and p GMV (predicted gross merchandise volume) of one advertiser. The action is to output a weight k [0.5, 1.5], which controls the bidding by formula bid = k p CT R p CV R p GMV The reward is pays,t for a successful bid and 0 otherwise. The cost is gmvs,t for a successful bid and 0 otherwise. The constraint budget is Rois. Note that by definition the ROI value is computed by P t gmvs,t , thus in this scenario we replace E[Qπ c ] d in the algorithm by E[ Qπ r Qπ c ] + Rois. Furthermore, the frequency of the change in the target ROI is relatively low and does not affect the training. In this task, the cost and the constraint budget are negative, which means that the constraint is violated at the starting stage of bidding and can only be satisfied over time. During evaluation, each advertiser switches among policies trained by different algorithms, in order to eliminate the differences between advertisers and make the online experiment fairer. D.2 COMPARISONS WITH ON-POLICY BASELINES The on-policy baselines include 1) PDO (Chow et al., 2017), which applies Lagrangian relaxation to devise risk-constrained RL algorithm, 2) RCPO (Tessler et al., 2018), which incorporates the constraint as a penalty into the reward function to guide the policy towards a constraint-satisfying one, 3) IPO (Liu et al., 2020), which augments the objective with a logarithmic barrier function as a penalty, 4) CRPO (Xu et al., 2021), which updates the policy alternatingly between reward maximization and constraint satisfaction, and 5) CUP (Yang et al., 2022), which applies a conservative policy projection after policy improvement. Figure 6 shows the comparative evaluation on two benchmark task sets. As shown by the x-axes, the total number of samples used by on-policy methods are set to 5 million in order to present the asymptotic performance of these methods. Note that the curves of CAL look squeezed in these plots. This is because the x-axes are altered to a much larger scale compared to those used in Figure 4. We can see that in each task, CAL achieves a reward performance comparable to the on-policy baseline that yields the highest asymptotic reward in this task. Although IPO attains higher reward in several Safety-Gym tasks, its performance on Mu Jo Co tasks is apparently inferior to CAL. Moreover, it is worth noting that no on-policy baseline is able to achieve constraint satisfaction on all these tested tasks. Most importantly, CAL s advantage in sample efficiency is significant, and this is particularly important in safety-critical applications where interacting with the environment could be potentially risky. D.3 IMPLEMENTATION DETAILS & HYPERPARAMETER SETTINGS Baselines We use the implementations in the official codebases, i.e., Published as a conference paper at ICLR 2024 0 1M 2M 3M 4M 5M 0 test reward Point Button1 0 1M 2M 3M 4M 5M Car Button1 0 1M 2M 3M 4M 5M 0 Point Button2 CAL(ours) CUP RCPO PDO IPO CRPO 0 1M 2M 3M 4M 5M Point Push1 0 1M 2M 3M 4M 5M Car Button2 0 1M 2M 3M 4M 5M 0 0 1M 2M 3M 4M 5M 0 0 1M 2M 3M 4M 5M 0 0 1M 2M 3M 4M 5M 0 0 1M 2M 3M 4M 5M 0 0 1M 2M 3M 4M 5M test reward 0 1M 2M 3M 4M 5M Half Cheetah 0 1M 2M 3M 4M 5M 0 1M 2M 3M 4M 5M 6000 Humanoid 0 1M 2M 3M 4M 5M 0 1M 2M 3M 4M 5M 0 0 1M 2M 3M 4M 5M 0 0 1M 2M 3M 4M 5M 0 Figure 6: Comparisons with on-policy baselines on Safety-Gym (top half) and velocity-constrained Mu Jo Co (bottom half). https://github.com/Alg TUDelft/WCSAC for WCSAC (Yang et al., 2021) https://github.com/liuzuxin/cvpo-safe-rl for CVPO (Liu et al., 2022) https://github.com/zmsn-2077/CUP-safe-rl for CUP (Yang et al., 2022) https://github.com/PKU-Alignment/omnisafe for PDO (Chow et al., 2017), RCPO (Tessler et al., 2018), IPO (Liu et al., 2020), CRPO (Xu et al., 2021). Note that this is an infrastructural framework for safe RL algorithms implemented by Ji et al. (2023). For SAC-PIDLag, we use the implementation in the CVPO codebase. We adopt the default hyperparameter settings in the original implementations. Note that the Mu Jo Co results are not presented in the CVPO paper, so the hyperparameters of CVPO are finetuned on Mu Jo Co tasks based on the authors modified codebase 7 (https://github.com/liuzuxin/fsrl). In addition, the risk hyperparameter α in WCSAC is set to the risk-averse 0.1. CAL The network structures of the actor and the reward/cost critics of CAL are all the same with the off-policy baselines, i.e., 256 neurons for two hidden layers with Re LU activation. The discount factor γ is set to 0.99 both in reward and cost value estimations. The optimization of the networks is conducted by Adam with the learning rate 5e 4 for the cost critics, and 3e 4 for the actor and the reward critics. The ensemble size E is set to 4 for Mu Jo Co tasks and 6 for Safety-Gym tasks. The conservatism parameter k is set to 0.5 for all tasks except for Point Push1 (0.8). The convexity parameter c is set to 10 for all tasks except for Ant (100), Half Cheetah (1000) and Humanoid (1000). About the Up-To-Data Ratio To make a more sufficient utilization of each data point and improve the sample efficiency, a larger up-to-data (UTD) ratio (i.e., the number of updates taken by an agent per environment interaction) (Chen et al., 2021) than normal off-policy algorithms is adopted in the implementation of CAL. The UTD ratio of CAL is set to 20 for all tasks except for Humanoid 7Specifically, we set 1) the number of episodes per data collection to 1; 2) the number of updates per environment sample to 0.5; and 3) the constraint of the policy s standard deviation in the M-step to 0.005. Published as a conference paper at ICLR 2024 0 50k 100k 150k test reward 0 50k 100k 150k 0 50k 100k 150k 0 50k 100k 150k 0 50k 100k 150k 0 50k 100k 150k 0 50k 100k 150k 0 50k 100k 150k 0 50k 100k 150k 200k 40 test reward 0 50k 100k 150k 200k 0 50k 100k 150k 200k 0 50k 100k 150k 200k 40 0 50k 100k 150k 200k 50 0 50k 100k 150k 200k 0 50k 100k 150k 200k 0 50k 100k 150k 200k Figure 7: Results of different UTD settings (1, 10 and 20) on Hopper (top half) and Point Button2 (bottom half). (10) and Half Cheetah (40). In the comparative evaluation (Section 4.1), we adopt the default UTD settings for the baselines 8, which are lower than the ratio in CAL. Since the off-policy baselines also have the potential to be improved by using a larger UTD ratio, here we provide results of different UTD settings for these baselines (and also CAL) for a more comprehensive comparison. Clearly, we can see from Figure 7 that different from CAL, increasing the UTD ratios of the baselines only improves their reward performances but cannot improve their constraint satisfactions, indicating that the capability of using a large UTD ratio is also one of the non-trivial advantages of CAL. D.4 ABLATION STUDY FOR THE ENSEMBLE SIZE See Figure 8. E RELATED WORK In recent years, deep safe RL methods targeting at high-dimensional controlling tasks have been well developed. One of the most commonly adopted approach for safe RL is the primal-dual framework due to its simplicity and effectiveness. In this regard, Chow et al. (2017) propose a primal-dual method and prove the convergence to constraint-satisfying policies. Tessler et al. (2018) propose a multi-timescale approach and use an alternative penalty signal to learn constrained policies with local convergence guarantees. Paternain et al. (2019) prove that primal-dual methods have zero duality gap under certain assumptions. Ding et al. (2020) show the convergence rate of a natural policy gradient based on the primal-dual framework. Ray et al. (2019) make practical contributions by benchmarking safe RL baselines on the Safety-Gym environment. Nevertheless, the majority of existing primal-dual methods are on-policy by their design. Despite the convergence guarantees, 8except for SAC-PIDLag in Point Button2 (20) and CVPO in Mu Jo Co tasks (0.5). Published as a conference paper at ICLR 2024 0 50k 100k 150k 0 test reward 0 50k 100k 150k 200k 0 40 Point Button2 0 100k 200k 300k 0 Car Button2 E=2 E=4 E=6 E=8 0 50k 100k 150k 0 0 50k 100k 150k 200k 0 0 100k 200k 300k 0 E=2 E=4 E=6 E=8 training cost 1-37k 38-75k 76-112k 113-150k E=2 E=4 E=6 E=8 1-50k 51-100k 101-150k 151-200k E=2 E=4 E=6 E=8 1-75k 76-150k 151-225k 226-300k Figure 8: Ablation study for the ensemble size E. these methods usually encounter the sample efficiency issues, demanding a large amount of environment interaction and also constraint violations during the training process. This may impede the deployment of such methods in safety-critical applications. An approach to avoid risky environment interaction is training constraint-satisfying policies using only offline data (Le et al., 2019; Lee et al., 2022a; Xu et al., 2022a; Liu et al., 2023; Lin et al., 2023). Instead, this paper focuses on the online setting and attempts to enhance the safety of RL methods by improving the sample efficiency. The methods most related to ours are as follows: Stooke et al. (2020) apply PID control on the dual variable and show its promising performance when combined with the on-policy PPO method (Schulman et al., 2017). This method is adopted as one of our baselines since as reported in (Stooke et al., 2020), the PID control can stabilize the update of the dual variable and thus has the potential to be adapted to the off-policy setting. WCSAC (Yang et al., 2021) augments the na ıve SAC-Lag method with a variance estimator of the cost value to achieve risk control. Yet, in practice we found that the temporal-difference-style learning of this variance estimator is less stable than directly using the variance of an ensemble, which may be one of the causes explaining WCSAC s inferior performance compared to CAL. CVPO (Liu et al., 2022) allows off-policy updates by breaking the brittle bond between the primal and the dual update and introducing a variational policy to bridge the two updates. This method shows significant improvement in sample efficiency but suffers from relatively low computational efficiency, due to the frequent sampling operations in the action space when computing the variational policy.