# conservative_safety_critics_for_exploration__75fd02af.pdf Published as a conference paper at ICLR 2021 CONSERVATIVE SAFETY CRITICS FOR EXPLORATION Homanga Bharadhwaj1 , Aviral Kumar2, Nicholas Rhinehart2, Sergey Levine2, Florian Shkurti1, Animesh Garg1 1University of Toronto, Vector Institute 2University of California Berkeley homanga@cs.toronto.edu Safe exploration presents a major challenge in reinforcement learning (RL): when active data collection requires deploying partially trained policies, we must ensure that these policies avoid catastrophically unsafe regions, while still enabling trial and error learning. In this paper, we target the problem of safe exploration in RL by learning a conservative safety estimate of environment states through a critic, and provably upper bound the likelihood of catastrophic failures at every training iteration. We theoretically characterize the tradeoff between safety and policy improvement, show that the safety constraints are likely to be satisfied with high probability during training, derive provable convergence bounds for our approach, which is no worse asymptotically than standard RL, and demonstrate the efficacy of the proposed approach on a suite of challenging navigation, manipulation, and locomotion tasks. Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https: //sites.google.com/view/conservative-safety-critics/ 1 INTRODUCTION Reinforcement learning (RL) is a powerful framework for learning-based control because it can enable agents to learn to make decisions automatically through trial and error. However, in the real world, the cost of those trials and those errors can be quite high: a quadruped learning to run as fast as possible, might fall down and crash, and then be unable to attempt further trials due to extensive physical damage. However, learning complex skills without any failures at all is likely impossible. Even humans and animals regularly experience failure, but quickly learn from their mistakes and behave cautiously in risky situations. In this paper, our goal is to develop safe exploration methods for RL that similarly exhibit conservative behavior, erring on the side of caution in particularly dangerous settings, and limiting the number of catastrophic failures. A number of previous approaches have tackled this problem of safe exploration, often by formulating the problem as a constrained Markov decision process (CMDP) (Garcıa & Fern andez, 2015; Altman, 1999). However, most of these approaches require additional assumptions, like assuming access to a function that can be queried to check if a state is safe (Thananjeyan et al., 2020), assuming access to a default safe controller (Koller et al., 2018; Berkenkamp et al., 2017), assuming knowledge of all the unsafe states (Fisac et al., 2019), and only obtaining safe policies after training converges, while being unsafe during the training process (Tessler et al., 2018; Dalal et al., 2018). In this paper, we propose a general safe RL algorithm, with bounds on the probability of failures during training. Our method only assumes access to a sparse (e.g., binary) indicator for catastrophic failure, in the standard RL setting. We train a conservative safety critic that overestimates the probability of catastrophic failure, building on tools in the recently proposed conservative Q-learning framework (Kumar et al., 2020) for offline RL. In order to bound the likelihood of catastrophic failures at every iteration, we impose a KL-divergence constraint on successive policy updates so that the stationary distribution of states induced by the old and the new policies are not arbitrarily Work done during HB s (virtual) visit to Sergey Levine s lab at UC Berkeley Published as a conference paper at ICLR 2021 different. Based on the safety critic s value, we consider a chance constraint denoting probability of failure, and optimize the policy through primal-dual gradient descent. Our key contributions in this paper are designing an algorithm that we refer to as Conservative Safety Critics (CSC), that learns a conservative estimate of how safe a state is, using this conservative estimate for safe-exploration and policy updates, and theoretically providing upper bounds on the probability of failures throughout training. Through empirical evaluation in five separate simulated robotic control domains spanning manipulation, navigation, and locomotion, we show that CSC is able to learn effective policies while reducing the rate of catastrophic failures by up to 50% over prior safe exploration methods. 2 PRELIMINARIES We describe the problem setting of a constrained MDP (Altman, 1999) specific to our approach and the conservative Q learning (Kumar et al., 2020) framework that we build on in our algorithm. Constrained MDPs. We take a constrained RL view of safety (Garcıa & Fern andez, 2015; Achiam et al., 2017), and define safe exploration as the process of ensuring the constraints of the constrained MDP (CMDP) are satisfied while exploring the environment to collect data samples. A CMDP is a tuple (S, A, P, R, γ, µ, C), where S is the state space, A is the action space, P : S A S [0, 1] is a transition kernel, R : S A R is a task reward function, γ (0, 1) is a discount factor, µ is a starting state distribution, and C = {(ci : S {0, 1}, χi R)|i Z} is a set of (safety) constraints that the agent must satisfy, with constraint functions ci taking values either 0 (alive) or 1 (failure) and limits χi defining the maximal allowable amount of non-satisfaction, in terms of expected probability of failure. A stochastic policy π : S P(A) is a mapping from states to action distributions, and the set of all stationary policies is denoted by Π. Without loss of generality, we can consider a single constraint, where C denotes the constraint satisfaction function C : S {0, 1}, (C 1{failure}) similar to the task reward function, and an upper limit χ. Note that since we assume only a sparse binary indicator of failure from the environment C(s), in purely online training, the agent must fail a few times during training, and hence 0 failures is impossible. However, we will discuss how we can minimize the number of failures to a small rate, for constraint satisfaction. We define discounted state distribution of a policy π as dπ(s) = (1 γ) P t=0 γt P(st = s|π), the state value function as V π R (s) = Eτ π [R(τ)|s0 = s], the state-action value function as Qπ R(s, a) = Eτ π [R(τ)|s0 = s, a0 = a], and the advantage function as Aπ R(s, a) = Qπ R(s, a) V π R (s). We define similar quantities for the constraint function, as VC, QC, and AC. So, we have V π R (µ) = Eτ π [P t=0 R(st, at)] and V π C (µ) denoting the average episodic failures, which can also be interpreted as expected probability of failure since V π C (µ) = Eτ π [P t=0 C(st)] = Eτ π[1{failure}] = P(failure|µ). For policy parameterized as πφ, we denote dπ(s) as ρφ(s). Note that although C : S {0, 1} takes on binary values in our setting, the function V π C (µ) is a continuous function of the policy π. Conservative Q Learning. CQL (Kumar et al., 2020) is a method for offline/batch RL (Lange et al., 2012; Levine et al., 2020) that aims to learn a Q-function such that the expected value of a policy under the learned Q function lower bounds its true value, preventing over-estimation due to out-ofdistribution actions as a result. In addition to training Q-functions via standard Bellman error, CQL minimizes the expected Q-values under a particular distribution of actions, µ(a|s), and maximizes the expected Q-value under the on-policy distribution, π(a|s). CQL in and of itself might lead to unsafe exploration, whereas we will show in Section 3, how the theoretical tool introduced in CQL can be used to devise a safe RL algorithm. 3 THE CONSERVATIVE SAFE-EXPLORATION FRAMEWORK In this section we describe our safe exploration framework. The safety constraint C(s) defined in Section 2 is an indicator of catastrophic failure: C(s) = 1 when a state s is unsafe and C(s) = 0 when it is not, and we ideally desire C(s) = 0 s S that the agent visits. Since we do not make any assumptions in the problem structure for RL (for example a known dynamics model), we cannot guarantee this, but can at best reduce the probability of failure in every episode. So, we formulate the constraint as V π C (µ) = Eτ π [P t=0 C(st)] χ, where χ [0, 1) denotes probability of failure. Our approach is motivated by the insight that by being conservative with respect to how Published as a conference paper at ICLR 2021 Figure 1: CSC (Algorithm 1). env.step(a) steps the simulator to the next state s and provides R(s, a) and C(s ) values to the agent. If C(s ) = 1 (failure), episode terminates. QC is the learned safety critic. safe a state is, and hence by over-estimating this probability of failure, we can effectively ensure constrained exploration. Figure 1 provides an overview of the approach. The key idea of our algorithm is to train a conservative safety critic denoted as QC(s, a), that overestimates how unsafe a particular state is and modifies the exploration strategy to appropriately account for this safety under-estimate (by overestimating the probability of failure). During policy evaluation in the environment, we use the safety critic QC(s, a) to reduce the chance of catastrophic failures by checking whether taking action a in state s has QC(s, a) less than a threshold ϵ. If not, we re-sample a from the current policy π(a|s). We now discuss our algorithm more formally. We start by discussing the procedure for learning the safety critic QC, then discuss how we incorporate this in the policy gradient updates, and finally discuss how we perform safe exploration (Garcıa & Fern andez, 2015) during policy execution in the environment. Overall objective. Our objective is to learn an optimal policy π that maximizes task rewards, while respecting the constraint on expected probability of failures. π = arg max π ΠC V π R (µ) where ΠC = {π Π : V π C (µ) χ} (1) Learning the safety critic. The safety critic QC is used to obtain an estimate of how unsafe a particular state is, by providing an estimate of probability of failure, that will be used to guide exploration. We desire the estimates to be conservative , in the sense that the probability of failure should be an over-estimate of the actual probability so that the agent can err on the side of caution while exploring. To train such a critic QC, we incorporate tools from CQL to estimate QC through updates similar to those obtained by reversing the sign of α in Equation 2 of CQL(H) (Kumar et al., 2020). This gives us an upper bound on QC instead of a lower bound, as ensured by CQL. We denote the over-estimated advantage corresponding to this safety critic as ˆAC. Formally the safety critic is trained via the following objective, where the objective inside arg min is called CQL(ζ), ζ parameterizes QC, and k denotes the kth update iteration. ˆQk+1 C arg min QC α Es Denv,a πφ(a|s)[QC(s, a)] + E(s,a) Denv[QC(s, a)] 2E(s,a,s ,c) Denv QC(s, a) ˆBπφ ˆQk C(s, a) 2 (2) Here, ˆBπφ is the empirical Bellman operator discussed in section 3.1 and equation 2 of Kumar et al. (2020). α is a weight that varies the importance of the first term in equation 2, and controls the magnitude of value over-estimation, as we now highlight in red above. For states sampled from the replay buffer Denv, the first term seeks to maximize the expectation of QC over actions sampled from the current policy, while the second term seeks to minimize the expectation of QC over actions sampled from the replay buffer. Denv can include off-policy data, and also offline-data (if available). We interleave the gradient descent updates for training of QC, with gradient ascent updates for policy πφ and gradient descent updates for Lagrange multiplier λ, which we describe next. Policy learning. Since we want to learn policies that obey the constraint we set in terms of the safety critic, we can solve the objective in equation 1 via: max πφ Es ρφ,a πφ Aπφ R (s, a) s.t. Es ρφ,a πφQC(s, a) χ (3) Published as a conference paper at ICLR 2021 We can construct a Lagrangian and solve the policy optimization problem through primal dual gradient descent max πφ min λ 0 Es ρφ,a πφ Aπφ R (s, a) λ (QC(s, a) χ) We can apply vanilla policy gradients or some actor-critic style Q-function approximator for optimization. Here, QC is the safety critic trained through CQL as described in equation 2. We defer specific implementation details for policy learning to the final paragraph of this section. Algorithm 1 CSC: safe exploration with conservative safety critics 1: Initialize V r θ (task value fn), Qs ζ (safety critic), policy πφ, λ, Denv, thresholds ϵ, δ, χ. 2: Set ˆV πφold C (µ) χ. ˆV πφold C (µ) denotes avg. failures in the previous epoch. 3: for epochs until convergence do Execute actions in the environment. Collect on-policy samples. 4: for episode e in {1, ..., M} do 5: Set ϵ (1 γ)(χ ˆV πφold C (µ)) 6: Sample a πφold(s). Execute a iff QC(s, a) ϵ. Else, resample a. 7: Obtain next state s , r = R(s, a), c = C(s ). 8: Denv Denv {(s, a, s , r, c)} If available, Denv can be seeded with off-policy/offline data 9: end for 10: Store the average episodic failures ˆV πφold C (µ) PM e=1 ˆV e C 11: for step t in {1, ..., N} do Policy and Q function updates using Denv 12: Gradient ascent on φ and (Optionally) add Entropy regularization (Appendix A.2) 13: Gradient updates for the Q-function ζ := ζ ηQ ζCQL(ζ) 14: Gradient descent step on Lagrange multiplier λ (Appendix A.2) 15: end for 16: φold φ 17: end for Executing rollouts (i.e., safe exploration). Since we are interested in minimizing the number of constraint violations while exploring the environment, we do not simply execute the learned policy iterate in the environment for active data collection. Rather, we query the safety critic QC to obtain an estimate of how unsafe an action is and choose an action that is safe via rejection sampling. Formally, we sample an action a πφold(s), and check if QC(s, a) ϵ. We keep re-sampling actions πφold(s) until this condition is met, and once met, we execute that action in the environment. In practice, we execute this loop for 100 iterations, and choose the action a among all actions in state s for which QC(s, a) ϵ and the value of QC(s, a) is minimum. If no such action a is found that maintains QC(s, a) ϵ, we just choose a for which QC(s, a) is minimum (although above the threshold). Here, ϵ is a threshold that varies across iterations and is defined as ϵ = (1 γ)(χ ˆV πφold C (µ)) where, ˆV πφold C (µ) is the average episodic failures in the previous epoch, denoting a sample estimate of the true V πφold C (µ). This value of ϵ is theoretically obtained such that Lemma 1 holds. In the replay buffer Denv, we store tuples of the form (s, a, s , r, c), where s is the previous state, a is the action executed, s is the next state, r is the task reward from the environment, and c = C(s ), the constraint value. In our setting, c is binary, with 0 denoting a live agent and 1 denoting failure. Overall algorithm. Our overall algorithm, shown in Algorithm 1, executes policy rollouts in the environment by respecting the constraint QC(s, a) ϵ, stores the observed data tuples in the replay buffer Denv, and uses the collected tuples to train a safety value function QC using equation 2, update the policy and the dual variable λ following the optimization objective in equation 6. Implementation details. Here, we discuss the specifics of the implementation for policy optimization. We consider the surrogate policy improvement problem Sutton (2020): max πφ Es ρφold,a πφ h A πφold R (s, a) i s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ and V πφ C (µ) χ (4) Here, we have introduced a DKL constraint to ensure successive policies are close in order to help obtain bounds on the expected failures of the new policy in terms of the expected failures of the old policy in Section 4. We replace the DKL(πφold( |s)||πφ( |s)) term by its second order Taylor expansion (expressed in terms of the Fisher Information Matrix F ) and enforce the resulting constraint Published as a conference paper at ICLR 2021 exactly (Schulman et al., 2015a). Following equation 22 (Appendix A.2) we have, max πφ Es ρφold,a πφ h A πφold R (s, a) i s.t. V πφold C (µ) + 1 1 γ Es ρφold,a πφ [AC(s, a)] χ s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ (5) We replace the true AC by the learned over-estimated ˆAC, and consider the Lagrangian dual of this constrained problem, which we can solve by alternating gradient descent as shown below. max πφ min λ 0 Es ρφold,a πφ h A πφold R (s, a) i λ V πφold C (µ) + 1 1 γ Es ρφold,a πφ h ˆAC(s, a) i χ s.t. 1 2(φ φold)T F (φ φold) δ (6) Note that although we use FIM for the updates, we can also apply vanilla policy gradients or some actor-critic style Q-function approximator to optimize equation 6. Detailed derivations of the gradient updates are in Appendix A.2. 4 THEORETICAL ANALYSIS In this section, we aim to theoretically analyze our approach, showing that the expected probability of failures is bounded after each policy update throughout the learning process, while ensuring that the convergence rate to the optimal solution is only mildly bottlenecked by the additional safety constraint. Our main result, stated in Theorem 1, bounds the expected probability of failure of the policy that results from Equation 5. To prove this, we first state a Lemma that shows that the constraints in Equation 5 are satisfied with high probability during the policy updates. Detailed proofs of all the Lemmas and Theorems are in Appendix A.1. Notation. Let ϵC = maxs |Ea πφnew AC(s, a)| and be the amount of overestimation in the expected advantage value generated from the safety critic, Es ρφold ,a πφold[ ˆAC(s, a)] as per equa- tion 2, such that = Es ρφold ,a πφold[ ˆAC(s, a) AC(s, a)]. Let ζ denote the sampling error in the estimation of V πφold C (µ) by its sample estimate ˆV πφold C (µ) (i.e. ζ = | ˆV πφold C (µ) V πφold C (µ)|) and N be the number of samples used in the estimation of VC. Let Reg C(T) be the total cumulative failures incurred by running Algorithm 1 until T samples are collected from the environment. We first show that when using Algorithm 1, we can upper bound the expectation probability of failure for each policy iterate πφold. Lemma 1. If we follow Algorithm 1, during policy updates via Equation 5, the following is satisfied with high probability 1 ω V πφold C (µ) + 1 1 γ Es ρφold,a πφ [AC(s, a)] χ + ζ 1 γ Here, ζ captures sampling error in the estimation of V πφold C (µ) and we have ζ C log(1/ω) |N| , where C is a constant independent of ω obtained from union bounds and concentration inequalities (Kumar et al., 2020) and N is the number of samples used in the estimation of VC. This lemma intuitively implies that the constraint on the safety critic in equation 5 is satisfied with a high probability, when we note that the RHS can be made small as N becomes large. Lemma 1 had a bound in terms of V πφold C (µ) for the old policy πφold, but not for the updated policy πφnew. We now show that the expected probability of failure for the policy πφnew resulting from solving equation 5, V πφnew C (µ) is bounded with a high probability. Theorem 1. Consider policy updates that solve the constrained optimization problem defined in Equation 5. With high probability 1 ω, we have the following upper bound on expected probability of failure V πφnew C (µ) for πφnew during every policy update iteration: V πφnew C (µ) χ + ζ 1 γ + 2δγϵC (1 γ)2 where ζ C p Published as a conference paper at ICLR 2021 So far we have shown that, with high probability, we can satisfy the constraint in the objective during policy updates (Lemma 1) and obtain an upper bound on the expected probability of failure of the updated policy πφnew (Theorem 1). The key insight from Theorem 1 is that if we execute policy πφnew in the environment, the probability of failing is upper-bounded by a small number depending on the specified safety threshold χ. Since the probability of failure is bounded, if we execute πφnew for multiple episodes, the total number of failures is bounded as well. We now bound the task performance in terms of policy return and show that incorporating and satisfying safety constraints during learning does not severely affect the convergence rate to the optimal solution for task performance. Theorem 2 builds upon and relies on the assumptions in (Agarwal et al., 2019) and extends it to our constrained policy updates in equation 5. Theorem 2 (Convergence rate for policy gradient updates with the safety constraint). If we run the policy gradient updates through equation 5, for policy πφ, with µ as the starting state distribution, with φ(0) = 0, learning rate η > 0, and choose α as mentioned in the discussion of Theorem 1, then for all policy update iterations T > 0 we have, with probability 1 ω, V R(µ) V (T ) R (µ) log |A| ηT + 1 (1 γ)2T + K PT 1 t=0 λ(t) ηT where K (1 χ) + 4 Since the value of the dual variables λ strictly decreases during gradient descent updates (Algorithm 1), PT 1 t=0 λ(t) is upper-bounded. So, we see that the additional term proportional to K introduced in the convergence rate (compared to (Agarwal et al., 2019)) due to the safety constraint is upper bounded, and can be made small with a high probability by choosing α appropriately. In addition, we note that the safety threshold χ helps tradeoff the convergence rate by modifying the magnitude of K (a low χ means a stricter safety threshold, and a higher value of K, implying a larger RHS and slower convergence). We discuss some practical considerations of the theoretical results in Appendix A.4. So far we have demonstrated that the resulting policy iterates from our algorithm all satisfy the desired safety constraint of the CMDP which allows for a maximum safety violation of χ for every intermediate policy. While this result ensures that the probability of failures is bounded, it does not elaborate on the total failures incurred by the algorithm. In our next result, we show that the cumulative failures until a certain number of samples T of the algorithm grows sublinearly when executing Algorithm 1, provided the safety threshold χ is set in the right way. Theorem 3. [Number of cumulative safety failures grows sublinearly] Let χ in Algorithm 1 be timedependent such that χt = O(1/ t). Then, the total number of cumulative safety violations until when T transition samples have been collected by Algorithm 1, Reg C(T), scales sub-linearly with T, i.e., Reg C(T) = O( p A proof is provided in Appendix A.1. Theorem 3 is in many ways similar to a typical regret bound for exploration (Russo, 2019; Jaksch et al., 2010), though it measures the total number of safety violations. This means that training with Algorithm 1 will converge to a safe policy that incurs no failures at a quick, O( 5 EXPERIMENTS Through experiments on continuous control environments of varying complexity, we aim to empirically evaluate the agreement between empirical performance and theoretical guidance by understanding the following questions: How safe is CSC in terms of constraint satisfaction during training? How does learning of safe policies trade-off with task performance during training? 5.1 EXPERIMENTAL SETUP Environments. In each environment, shown in Figure 2, we define a task objective that the agent must achieve and a criteria for catastrophic failure. The goal is to solve the task without dying. In point agent/car navigation avoiding traps, the agent must navigate a maze while avoiding traps. The agent has a health counter that decreases every timestep that it spends within a trap. When the Published as a conference paper at ICLR 2021 Figure 2: Illustrations of the five environments in our experiments: (a) 2D Point agent navigation avoiding traps. (b) Car navigation avoiding traps. (c) Panda push without toppling. (d) Panda push within boundary. (e) Laikago walk without falling. counter hits 0, the agent gets trapped and dies. In Panda push without toppling, a 7-Do F Franka Emika Panda arm must push a vertically placed block across the table to a goal location without the block toppling over. Failure is defined as when the block topples. In Panda push within boundary, the Panda arm must be controlled to push a block across the table to a goal location without the block going outside a rectangular constraint region. Failure occurs when the block center of mass ((x, y) position) move outside the constraint region. In Laikago walk without falling, an 18-Do F Laikago quadruped robot must walk without falling. The agent is rewarded for walking as fast as possible (or trotting) and failure occurs when the robot falls. Since quadruped walking is an extremely challenging task, for all the baselines, we initialize the agent s policy with a controller that has been trained to keep the agent standing, while not in motion. Baselines and comparisons. We compare CSC to three prior methods: constrained policy optimization (CPO) (Achiam et al., 2017), a standard unconstrained RL method (Schulman et al., 2015a) which we call Base (comparison with SAC (Haarnoja et al., 2018) in Appendix Figure 7), an algorithm similar to Base, called Base Shaped that modifies the reward R(s, a) as R(s, a) PC(s) where P = 10 and C(s) is 1 when a failure occurs and is 0 otherwise. We also consider a method that extends Leave No Trace (Eysenbach et al., 2017) to our setting, which we refer to as Q ensembles. This last comparison is the most similar to our approach, in that it also implements a safety critic (adapted from LNT s backward critic), but instead of using our conservative updates, the safety critic uses an ensemble for epistemic uncertainty estimation, as proposed by Eysenbach et al. (2017). There are other safe RL approaches which we cannot compare against, as they make multiple additional assumptions, such as the availability of a function that can be queried to determine if a state is safe or not Thananjeyan et al. (2020), availability of a default safe policy for the task Koller et al. (2018); Berkenkamp et al. (2017), and prior knowledge of the location of unsafe states (Fisac et al., 2019). In addition to the baselines (Figure 3), we analyze variants of our algorithm with different safety thresholds through ablation studies (Figure 4). We also analyze CSC and the baselines by seeding with a small amount of offline data in the Appendix A.10. 5.2 EMPIRICAL RESULTS Comparable or better performance with significantly lower failures during training. In Figure 3, we observe that CSC has significantly lower average failures per episode, and hence lower cumulative failures during the entire training process. Although the failures are significantly lower for our method, task performance and convergence of average task rewards is comparable to or better than all prior methods, including the Base method, corresponding to an unconstrained RL algorithm. While the CPO and Q-ensembles baselines also achieve near 0 average failures eventually, we see that CSC achieves this very early on during training. CSC trades off performance with safety constraint satisfaction, based on the safety-threshold χ. In Figure 4, we plot variants of our method with different safety constraint thresholds χ. Observe that: (a) when the threshold is set to a lower value (stricter constraint), the number of avg. failures per episode decreases in all the environments, and (b) the convergence rate of the task reward is lower when the safety threshold is stricter. These observations empirically complement our theoretical guarantees in Theorems 1 and 2. We note that there are quite a few failures even in the case where χ = 0.0, which is to be expected in practice because in the initial stages of training there is high function approximation error in the learned critic QC. However, we observe that the average episodic failures quickly drop below the specified threshold after about 500 episodes of training. Published as a conference paper at ICLR 2021 Figure 3: Top row: Average task rewards (higher is better). Bottom row: Average catastrophic failures (lower is better). x-axis: Number of episodes (each episode has 500 steps). Results on four of the five environments we consider for our experiments. For each environment, we plot the average task reward, the average episodic failures, and the cumulative episodic failures. The safety threshold is χ = 0.03 for all the baselines in all the environments. Results are over four random seeds. Detailed results including plots of cumulative failures are in Fig. 6 of the Appendix. 6 RELATED WORK We discuss prior safe RL and safe control methods under three subheadings Assuming prior domain knowledge of the problem structure. Prior works have attempted to solve safe exploration in the presence of structural assumptions about the environment or safety structures. For example, Koller et al. (2018); Berkenkamp et al. (2017) assume access to a safe set of environment states, and a default safe policy, while in Fisac et al. (2018); Dean et al. (2019), knowledge of system dynamics is assumed and (Fisac et al., 2019) assume access to a distance metric on the state space. SAVED (Thananjeyan et al., 2020) learns a kernel density estimate over unsafe states, and assumes access to a set of user demonstrations and a user specified function that can be queried to determine whether a state is safe or not. In contrast to these approaches, our method does not assume any prior knowledge from the user, or domain knowledge of the problem setting, except a binary signal from the environment indicating when a catastrophic failure has occurred. Assuming a continuous safety cost function. CPO (Achiam et al., 2017), and (Chow et al., 2019) assume a cost function can be queried from the environment at every time-step and the objective is to keep the cumulative costs within a certain limit. This assumption limits the generality of the method in scenarios where only minimal feedback, such as binary reward feedback is provided (additional details in section A.3). (Stooke et al., 2020) devise a general modification to the Lagrangian by incorporating two additional terms in the optimization of the dual variable. SAMBA (Cowen-Rivers et al., 2020) has a learned GP dynamics model and a continuous constraint cost function that encodes safety. The objective is to minimize task cost function while maintaining the CVARα of cumulative costs below a threshold. In the work of Dalal et al. (2018); Paternain et al. (2019b;a); Grbic & Risi (2020), only the optimal policy is learned to be safe, and there are no safety constraint satisfactions during training. In contrast to these approaches, we assume only a binary signal from the environment indicating when a catastrophic failure has occurred. Instead of minimizing expected costs, our constraint formulation directly seeks to constrain the expected probability of failure. Safety through recoverability. Prior works have attempted to devise resetting mechanisms to recover the policy to a base configuration from (near) a potentially unsafe state. LNT (Eysenbach et al., 2017) trains both a forward policy for solving a task, and a reset goal-conditioned policy that kicks in when the agent is in an unsafe state and learns an ensemble of critics, which is substantially more complex than our approach of a learned safety critic, which can give rise to a simple but provable safe exploration algorithm. Concurrently to us, SQRL (Srinivasan et al., 2020) developed an approach also using safety critics such that during the pre-training phase, the agent explores both safe and unsafe states in the environment for training the critic. Published as a conference paper at ICLR 2021 Figure 4: Top row: Average task rewards (higher is better). Bottom row: Average catastrophic failures (lower is better). x-axis: Number of episodes (each episode has 500 steps). Results on four of the five environments we consider for our experiments. For each environment we plot the average task reward, the average episodic failures, and the cumulative episodic failures. All the plots are for our method (CSC) with different safety thresholds χ, specified in the legend. From the plots it is evident that our method can naturally trade-off safety for task performance depending on how strict the safety threshold is set to. Results are over four random seeds. Detailed results including plots of cumulative failures are in Fig. 5 of the Appendix. In control theory, a number of prior works have focused on Hamilton-Jacobi-Isaacs (HJI) reachability analysis (Bansal et al., 2017) for providing safety constraint satisfactions and obtaining control inputs for dynamical systems (Herbert et al., 2019; Bajcsy et al., 2019; Leung et al., 2018). Our method does not require knowledge of the system dynamics or regularity conditions on the statespace, which are crucial for computing unsafe states using HJI reachability. 7 DISCUSSION, LIMITATIONS, AND CONCLUSION We introduced a safe exploration algorithm to learn a conservative safety critic that estimates the probability of failure for each candidate state-action tuple, and uses this to constrain policy evaluation and policy improvement. We provably demonstrated that the probability of failures is bounded throughout training and provided convergence results showing how ensuring safety does not severely bottleneck task performance. We empirically validated our theoretical results and showed that we achieve high task performance while incurring low accidents during training. While our theoretical results demonstrated that the probability of failures is bounded with a high probability, one limitation is that we still observe non-zero failures empirically even when the threshold χ is set to 0. This is primarily because of neural network function approximation error in the early stages of training the safety critic, which we cannot account for precisely in the theoretical results, and also due to the fact that we bound the probability of failures, which in practice means that the number of failures is also bounded, but non-zero. We also showed that if we set the constraint threshold in an appropriate time-varying manner, training with CSC incurs cumulative failures that scales at most sub-linearly with the number of transition samples in the environment. Although our approach bounds the probability of failure and is general in the sense that it does not assume access any user-specified constraint function, in situations where the task is difficult to solve, for example due to stability concerns of the agent, our approach will fail without additional assumptions. In such situations, some interesting future work directions would be to develop a curriculum of tasks to start with simple tasks where safety is easier to achieve, and gradually move towards more difficult tasks, such that the learned knowledge from previous tasks is not forgotten. ACKNOWLEDGEMENT We thank Vector Institute, Toronto and the Department of Computer Science, University of Toronto for compute support. We thank Glen Berseth and Kevin Xie for helpful initial discussions about the project, Alexandra Volokhova, Arthur Allshire, Mayank Mittal, Samarth Sinha, and Irene Zhang for feedback on the paper, and other members of the Uof T CS Robotics Group for insightful discussions during internal presentations and reading group sessions. Finally, we are grateful to the anonymous ICLR 2021 reviewers for their feedback in helping improve the paper. Published as a conference paper at ICLR 2021 Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. ar Xiv preprint ar Xiv:1705.10528, 2017. Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. ar Xiv preprint ar Xiv:1908.00261, 2019. Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999. Andrea Bajcsy, Somil Bansal, Eli Bronstein, Varun Tolani, and Claire J Tomlin. An efficient reachability-based framework for provably safe autonomous navigation in unknown environments. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 1758 1765. IEEE, 2019. Somil Bansal, Mo Chen, Sylvia Herbert, and Claire J Tomlin. Hamilton-jacobi reachability: A brief overview and recent advances. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 2242 2253. IEEE, 2017. Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908 918, 2017. Yinlam Chow, Ofir Nachum, Aleksandra Faust, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. Lyapunov-based safe policy optimization for continuous control. ar Xiv preprint ar Xiv:1901.10031, 2019. Alexander I Cowen-Rivers, Daniel Palenicek, Vincent Moens, Mohammed Abdullah, Aivar Sootla, Jun Wang, and Haitham Ammar. Samba: Safe model-based & active reinforcement learning. ar Xiv preprint ar Xiv:2006.09436, 2020. Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. ar Xiv preprint ar Xiv:1801.08757, 2018. Sarah Dean, Stephen Tu, Nikolai Matni, and Benjamin Recht. Safely learning to control the constrained linear quadratic regulator. In 2019 American Control Conference (ACC), pp. 5582 5588. IEEE, 2019. Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and Sergey Levine. Leave no trace: Learning to reset for safe and autonomous reinforcement learning. ar Xiv preprint ar Xiv:1711.06782, 2017. Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737 2752, 2018. Jaime F Fisac, Neil F Lugovoy, Vicenc Rubies-Royo, Shromona Ghosh, and Claire J Tomlin. Bridging hamilton-jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8550 8556. IEEE, 2019. Javier Garcıa and Fernando Fern andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. Djordje Grbic and Sebastian Risi. Safe reinforcement learning through meta-learned instincts. ar Xiv preprint ar Xiv:2005.03233, 2020. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. ar Xiv preprint ar Xiv:1801.01290, 2018. Sylvia L Herbert, Somil Bansal, Shromona Ghosh, and Claire J Tomlin. Reachability-based safety guarantees using efficient initializations. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 4810 4816. IEEE, 2019. Published as a conference paper at ICLR 2021 Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010. Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267 274, 2002. Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-based model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), pp. 6059 6066. IEEE, 2018. Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. ar Xiv preprint ar Xiv:2006.04779, 2020. Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batch reinforcement learning. In Reinforcement learning, pp. 45 73. Springer, 2012. Karen Leung, Edward Schmerling, Mo Chen, John Talbot, J Christian Gerdes, and Marco Pavone. On infusing reachability-based safety assurance within probabilistic planning frameworks for human-robot vehicle interactions. In International Symposium on Experimental Robotics, pp. 561 574. Springer, 2018. Sergey Levine. Deep reinforcement learning course, 2018. URL http://rail.eecs. berkeley.edu/deeprlcourse-fa18/static/slides/. Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. ar Xiv preprint ar Xiv:2005.01643, 2020. Santiago Paternain, Miguel Calvo-Fullana, Luiz FO Chamon, and Alejandro Ribeiro. Safe policies for reinforcement learning via primal-dual methods. ar Xiv preprint ar Xiv:1911.09101, 2019a. Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, pp. 7555 7565, 2019b. Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Lee, Jie Tan, and Sergey Levine. Learning agile robotic locomotion skills by imitating animals. ar Xiv preprint ar Xiv:2004.00784, 2020. Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. ar Xiv preprint ar Xiv:1910.01708, 2019. Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. In Advances in Neural Information Processing Systems, pp. 14433 14443, 2019. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015a. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b. Krishnan Srinivasan, Benjamin Eysenbach, Sehoon Ha, Jie Tan, and Chelsea Finn. Learning to be safe: Deep rl with a safety critic. ar Xiv preprint ar Xiv:2010.14603, 2020. Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. ar Xiv preprint ar Xiv:2007.03964, 2020. Rich Sutton. Reinforcement learning book, 2020. URL http://incompleteideas.net/ book/first/ebook/node42.html. Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. ar Xiv preprint ar Xiv:1805.11074, 2018. Brijen Thananjeyan, Ashwin Balakrishna, Ugo Rosolia, Felix Li, Rowan Mc Allister, Joseph E Gonzalez, Sergey Levine, Francesco Borrelli, and Ken Goldberg. Safety augmented value estimation from demonstrations (saved): Safe deep model-based rl for sparse cost robotic tasks. IEEE Robotics and Automation Letters, 5(2):3612 3619, 2020. Published as a conference paper at ICLR 2021 Yuke Zhu, Josiah Wong, Ajay Mandlekar, and Roberto Mart ın-Mart ın. robosuite: A modular simulation framework and benchmark for robot learning. In ar Xiv preprint ar Xiv:2009.12293, 2020. Published as a conference paper at ICLR 2021 A.1 PROOFS OF ALL THEOREMS AND LEMMAS Note. During policy updates via Equation 5, the DKL constraint is satisfied with high probability if we follow Algorithm 1. Following the steps in the Appendix A.2, we can write the gradient ascent step for φ as φ φold + βF 1 φold J(φold) β = βj s 2δ φold J(φold)T F φold J(φold) (8) F can be estimated with samples as F = Es ρφold h Ea πφold φold log πφold( φold log πφold)T i (9) Here βj is the backtracking coefficient and we perform backtracking line search with exponential decay. φold J(φold) is calculated as, φold J(φold) = Es ρφold,a πφold h φold log πφold(a|s) A πφold R i (10) After every update, we check if DKL(φ||φold) δ, and if not we decay βj = βj(1 βj)j, set j j + 1 and repeat for L steps until DKL δ is satisfied. If this is not satisfied after L steps, we backtrack, and do not update φ i.e. set φ φold. Lemma 1. If we follow Algorithm 1, during policy updates via equation 5, the following is satisfied with high probability 1 ω V πφold C (µ) + 1 1 γ Es ρφold,a πφ [AC(s, a)] χ + ζ 1 γ Here, ζ captures sampling error in the estimation of V πφold C (µ) and we have ζ C |N| , where C is a constant and N is the number of samples used in the estimation of VC. Proof. Based on line 6 of Algorithm 1, for every rollout {(s, a)}, the following holds: QC(s, a) (1 γ)(χ ˆV πφold C (µ))) (s, a) = ˆAC(s, a) (1 γ)(χ ˆV πφold C (µ))) (s, a) = ˆV πφold C (µ) + 1 1 γ ˆAC(s, a) χ (s, a) = ˆV πφold C (µ) + 1 1 γ Es ρφold,a πφ h ˆAC(s, a) i χ We note that we can only compute a sample estimate ˆV πφold C (µ) instead of the true quantity VC which can introduce sampling error in practice. In order to ensure that ˆV πφold C (µ) is not much lesser than V πφold C (µ), we can obtain a bound on their difference. Note that if ˆV πφold C (µ) V πφold C (µ), the Lemma holds directly, so we only need to consider the less than case. Let ˆV πφold C (µ) = V πφold C (µ) ζ. With high probability 1 ω, we can ensure ζ C log(1/ω) |N| , where C is a constant independent of ω (obtained from union bounds and concentration inequalities) and N is the number of samples used in the estimation of VC. In addition, our estimate of Es ρφold,a πφ h ˆAC(s, a) i is an overestimate of the true Es ρφold,a πφ [AC(s, a)], and we denote their difference by . So, with high probability 1 ω, we have ˆV πφold C (µ) + 1 1 γ Es ρφold,a πφ h ˆAC(s, a) i χ = V πφold C (µ) + 1 1 γ Es ρφold,a πφ [AC(s, a)] χ + ζ 1 γ Published as a conference paper at ICLR 2021 Theorem 1. Consider policy updates that solve the constrained optimization problem defined in equation 5. With high probability 1 ω, we have the following upper bound on expected probability of failure V πφnew C (µ) for πφnew during every policy update iteration V πφnew C (µ) χ + ζ 1 γ + 2δγϵC (1 γ)2 where ζ C p Here, ϵC = maxs |Ea πφnew AC(s, a)| and is the overestimation in Es ρφold ,a πφold[AC(s, a)] due to CQL. Proof. C(s) denotes the value of the constraint function from the environment in state s. This is analogous to the task reward function R(s, a). In our case C(s) is a binary indicator of whether a catastrophic failure has occurred, however the analysis we present holds even when C(s) is a shaped continuous cost function. C(s) = 1, 1{failure} = 1 0, otherwise Let V πφ R (µ) denotes the discounted task rewards obtained in expectation by executing policy πφ for one episode, and let V πφ C (µ) denote the corresponding constraint values. max πφ V πφ R (µ) s.t. V πφ C (µ) χ (14) From the TRPO (Schulman et al., 2015a) and CPO (Achiam et al., 2017) papers, following similar derivations, we obtain the following bounds V πφ R (µ) V πφold R (µ) 1 1 γ Es ρφold,a πφ A πφold R (s, a) 2γϵR 1 γ DT V (πφ||πφold)[s] (15) Here, Aπφ R is the advantage function corresponding to the task rewards and ϵR = maxs |Ea πφAπφ R (s, a)|. DT V is the total variation distance. We also have, V πφ C (µ) V πφold C (µ) 1 1 γ Es ρφold,a πφ A πφold C (s, a) + 2γϵC 1 γ DT V (πφ||πφold)[s] (16) Here, A πφold C is the advantage function corresponding to the costs and ϵC = maxs |Ea πφA πφold C (s, a)|. In our case, AC is defined in terms of the safety Q function QC(s, a), and CQL can bound its expectation directly. To see this, note that, by defini- tion Es ρφold,a πφ h A πφold C (s, a) i = Es ρφold,a πφ [Qζ(s, a)] Es ρφold,a πφold [Qζ(s, a)]. Here, the RHS is precisely the term in equation 2 of (Kumar et al., 2020) that is bounded by CQL. We get an overstimated advantage ˆAC(s, a) from training the safety critic QC through updates in equation 2. . Let denote the expected magnitude of over-estimate Es ρφold,a πφ h ˆAC(s, a) i = Es ρφold,a πφ [AC(s, a)] + , where is positive. Note that replacing AC, by its over-estimate ˆAC, the inequality in 16 above still holds. Using Pinsker s inequality, we can convert the bounds in terms of DKL instead of DT V , DT V (p||q) p DKL(p||q)/2 (17) By Jensen s inequality, E[ p DKL(p||q)/2] p E[DKL(p||q)]/2 (18) So, we can replace the E[DT V (p||q)] terms in the bounds by p E[DKL(p||q)]. Then, inequation 16 becomes, V πφ C (µ) V πφold C (µ) 1 1 γ Es ρφold,a πφ h A πφold C (s, a) i + 2γϵC Es ρφold,a πφ [DKL(πφ||πφold)[s]] Published as a conference paper at ICLR 2021 Re-visiting our objective in equation 5, max πφ Es ρφold,a πφ h A πφold R (s, a) i s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ s.t. V πφ C (µ) χ From inequation 19 we note that instead of of constraining V πφ C (µ) we can constrain an upper bound on this. Writing the constraint in terms of the current policy iterate πφold using equation 19, πφnew = max πφ Es ρφold,a πφ h A πφold R (s, a) i s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ s.t. V πφold C (µ) + 1 1 γ Es ρφold,a πφ h A πφold C (s, a) i + β q Es ρφold[DKL(πφold( |s)||πφ( |s))] χ As there is already a bound on DKL(πφold( |s)||πφ( |s))], getting rid of the redundant term, we define the following optimization problem, which we actually optimize for πφnew = max πφ Es ρφold,a πφ h A πφold R (s, a) i s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ s.t. V πφold C (µ) + 1 1 γ Es ρφold,a πφ h A πφold C (s, a) i χ Upper bound on expected probability of failures. If πφnew is updated using equation 5, then we have the following upper bound on V πφnew C (µ) V πφnew C (µ) V πφold C (µ) + 1 1 γ Es ρφold,a πφ h A πφold C i + 2γϵC (1 γ)2 Es ρφold,a πφ [DKL(πφ||πφold)[s]] If we ensure V πφold C (µ) + 1 1 γ Es ρφold,a πφ h A πφold C (s, a) i χ holds by following Algorithm 1,we have the following upper bound on V πφnew C (µ) V πφnew C (µ) χ + 2δγϵC (1 γ)2 (24) Here, ϵC = maxs |Ea πφnew A πφold C (s, a)|. Now, instead of AC(s, a), we have an over-estimated advantage estimate ˆAC(s, a) obtained by training the safety critic QC through CQL as in equation 2. Let denote the expected magnitude of over-estimate Es ρφold,a πφ h ˆAC(s, a) i = Es ρφold,a πφ [AC(s, a)] + , where is positive. From Lemma 1, we are able to ensure the following with high probability 1 ω V πφold C (µ) + 1 1 γ Es ρφold,a πφ [AC(s, a)] χ + ζ 1 γ By combining this with the upper bound on V πφnew C (µ) from inequality 23, we obtain with probability 1 ω V πφnew C (µ) χ + ζ 1 γ + 2δγϵC (1 γ)2 where ζ C p Since ϵC depends on the optimized policy πφnew, it can t be calculated exactly prior to the update. As we cap QC(s, a) to be 1, therefore, the best bound we can construct for ϵC is the trivial bound ϵC 2. Now, in order to have V πφnew C (µ) < χ, we require > 2 2δγ 1 γ + (1 γ)ζ. To guarantee this, replacing by the exact overestimation term from CQL, we have the following condition on Published as a conference paper at ICLR 2021 1 γ max s ρφold 1 |p Dφold | + 2 2δγ + (1 γ)2ζ πφold 1 1 (26) Here, Gc,T is a constant depending on the concentration properties of the safety constraint function C(s, a) and the state transition operator T(s |s, a) (Kumar et al., 2020). φold denotes the parameters of the policy π in the iteration before φold. Now, with probability 1 ω, we have ζ C log(1/ω) |N| . So, if α is chosen as follows 1 γ max s ρφold 1 |p Dφold | + 2 2δγ + (1 γ)2 C log(1/ω) |N| Gc,T (27) Then with probability 1 ω, we will have, V πφnew C (µ) χ (28) In the next theorem, we show that the convergence rate to the optimal solution is not severely affected due to the safety constraint satisfaction guarantee, and gets modified by addition of an extra bounded term. Theorem 2. If we run the policy gradient updates through equation 5, for policy πφ, with µ as the starting state distribution, with φ(0) = 0, and learning rate η > 0, then for all policy update iterations T > 0 we have, with probability 1 ω, V R(µ) V (T ) R (µ) log |A| ηT + 1 (1 γ)2T + (1 χ) + 1 2 (1 γ) + 2ζ PT 1 t=0 λ(t) Since the value of the dual variables λ strictly decreases during gradient descent updates (Algorithm 1), PT 1 t=0 λ(t) is upper-bounded. In addition, if we choose α as mentioned in the discussion of Theorem 1, we have > 2 2δγ 1 γ + ζ. Hence, with probability 1 ω, we can ensure that V R(µ) V (T ) R (µ) log |A| ηT + 1 (1 γ)2T + K PT 1 t=0 λ(t) ηT where K (1 χ) + 4 Proof. Let superscript (t) denote the tth policy update iteration. We follow the derivation in Lemma 5.2 of (Agarwal et al., 2019) but replace A(s, a) with our modified advantage estimator ˆA(t)(s, a) = A(t) R (s, a) λ(t)AC(s, a). The quantity log Zt(s) is defined in terms of A(t) R as log Zt(s) = log X a π(t)(a|s) exp (ηA(t)/(1 γ)) a π(t)(a|s) log exp ηA(t)(s, a)/(1 γ)) a π(t)(a|s)A(t)(s, a) We define an equivalent alternate quantity based on ˆA(t) log ˆZt(s) = log X a π(t)(a|s) exp (η ˆA(t)(s, a)/(1 γ)) a π(t)(a|s) exp (η(A(t) R (s, a) λ(t)AC(s, a))/(1 γ)) a π(t)(a|s) log exp (ηA(t) R (s, a)/(1 γ)) λ(t) X a π(t)(a|s) log exp (ηA(t) C (s, a)/(1 γ)) a π(t)(a|s)A(t) C (s, a) Published as a conference paper at ICLR 2021 For simplicity, consider softmax policy parameterization (equivalent results hold under the function approximation regime as shown in (Agarwal et al., 2019)), where we define the policy updates with the modified advantage function ˆA(t) to take the form: φ(t+1) = φ(t) + η 1 γ ˆA(t) and π(t+1)(a|s) = π(t)(a|s)exp(η ˆA(t)(s, a)/(1 γ)) Here, ˆZt(s) = P a A π(t)(a|s) exp(η ˆA(t)(s, a)/(1 γ)). Note that our actual policy updates (with backtracking line search) are almost equivalent to this when η is small. For the sake of notational convenience, we will denote log ˆZt(s)+ λ(t)η a π(t)(a|s)A(t) C (s, a) as Gt(s). We have Gt(s) 0 from equation 30. We consider the performance improvement lemma (Kakade & Langford, 2002) with respect to the task advantage function A(t) R (s, a) and express it in terms of the modified advantage function ˆA(t)(s, a) = A(t) R (s, a) λ(t)AC(s, a). Let µ be the starting state distribution of the MDP, and d(t) denote the stationary distribution of states induced by policy π in the tth iteration. V (t+1) R (µ) V (t) R (µ) = 1 1 γ Es d(t+1) X a π(t+1)(a|s)A(t) R (s, a) = 1 1 γ Es d(t+1) X a π(t+1)(a|s)( ˆA(t)(s, a) + λ(t)A(t) C (s, a)) η Es d(t+1) X a π(t+1)(a|s) log π(t+1)(a|s) ˆZt(s) + 1 1 γ Es d(t+1) X a π(t+1)(a|s)(λ(t)A(t) C (s, a)) η Es d(t+1)DKL(π(t+1) s ||π(t) s ) + 1 η Es d(t+1) log ˆZt(s) + 1 1 γ Es d(t+1) X a π(t+1)(a|s)(λ(t)A(t) C (s, a)) η Es d(t+1) log ˆZt(s) + λ(t) 1 γ Es d(t+1) X a π(t)(a|s)A(t) C (s, a) η Es d(t+1)Gt(s) η Es µGt(s) We note that Gt(s) 0 from equation 30. We now prove a result upper bounding the difference between the optimal task value for any state distribution ρ and the task value at the tth iteration for the same state distribution. Sub-optimality gap. The difference between the optimal value function and the current value function estimate is upper bounded. Published as a conference paper at ICLR 2021 V π R (ρ) V (t) R (ρ) = 1 1 γ Es d X a π (a|s)( ˆA(t)(s, a) + λ(t)A(t) C (s, a)) a π (a|s) log π(t+1)(a|s) ˆZt(s) π(t)(a|s) + 1 1 γ Es d X a π (a|s)λ(t)A(t) C (s, a) DKL(π s||π(t) s ) DKL(π s||π(t+1) s ) + X a π (a|s) log ˆZt(s) + 1 1 γ Es d X a π (a|s)λ(t)A(t) C (s, a) η Es d DKL(π s||π(t) s ) DKL(π s||π(t+1) s ) + log ˆZt(s) + 1 1 γ Es d X a π (a|s)λ(t)A(t) C (s, a) η Es d DKL(π s||π(t) s ) DKL(π s||π(t+1) s ) + 1 log ˆZt(s) + λ(t) a π (a|s)A(t) C (s, a) η Es d DKL(π s||π(t) s ) DKL(π s||π(t+1) s ) Gt(s) + λ(t) a π (a|s)A(t) C (s, a) λ(t) a π(t)(a|s)A(t) C (s, a) Using equation 31 with d as the starting state distribution µ, we have: 1 η Es d log Gt(s) 1 1 γ V (t+1)(d ) V (t)(d ) which gives us a bound on Es d log Gt(s). Using the above equation and that V (t+1)(ρ) V (t)(ρ) (as V (t+1)(s) V (t)(s) for all states s), we have: V π R (ρ) V (T 1) R (ρ) 1 t=0 (V π R (ρ) V (t) R (ρ)) t=0 Es d (DKL(π s||π(t) s ) DKL(π s||π(t+1) s )) + 1 t=0 Es d log Gt(s) a π (a|s)A(t) C (s, a) λ(t) a π(t)(a|s)A(t) C (s, a) Es d DKL(π s||π(0)) ηT + 1 (1 γ)T V (t+1) R (d ) V (t) R (d ) t=0 λ(t) 1 1 γ Es d X a π (a|s)A(t) C (s, a) 1 1 γ Es d X a π(t)(a|s)A(t) C (s, a) Es d DKL(π s||π(0)) ηT + V (T ) R (d ) V (0) R (d ) (1 γ)T + 2((1 γ)(χ + ζ) ) PT 1 t=0 λ(t) ηT + 1 (1 γ)2T + 2((1 γ)(χ + ζ) ) PT 1 t=0 λ(t) Here, denotes the CQL overestimation penalty, and we have used the fact that each term of 1 1 γ P a π (a|s)A(t) C (s, a) 1 1 γ P a π(t)(a|s)A(t) C (s, a) is upper bounded by (χ + ζ (1 γ)) from Lemma 1, so the difference is upper-bounded by 2(χ + ζ (1 γ)). Published as a conference paper at ICLR 2021 By choosing α as in equation 26, we have > 2 2δγ 1 γ + (1 γ)ζ. So, < 2 2δγ 1 γ (1 γ)ζ. Hence, we obtain the relation We also observe that 2(χ (1 γ)) + 2ζ = χ + χ 2 (1 γ) + 2ζ 2 χ 2 (1 γ) = (1 χ) + 2ζ + (1 2 (1 γ)) + 2ζ So, we have the following result for convergence rate V R(µ) V (T ) R (µ) log |A| ηT + 1 (1 γ)2T + ((1 χ) + (1 2 (1 γ)) + 2ζ) PT 1 t=0 λ(t) Again, with probability 1 ω, we can ensure ζ C log(1/ω) |N| . Overall, choosing the value of α from equation 27, we have > 2 2δγ 1 γ + (1 γ)ζ. So, < 2 2δγ 1 γ (1 γ)ζ. Hence, with probability 1 ω, we can ensure that V R(µ) V (T ) R (µ) log |A| ηT + 1 (1 γ)2T + K PT 1 t=0 λ(t) K (1 χ) + 4 So far we have demonstrated that the resulting policy iterates from our algorithm all satisfy the desired safety constraint of the CMDP which allows for a maximum safety violation of χ for every intermediate policy. While this guarantee ensures that the probability of failures is bounded, it does not elaborate on the total failures incurred by the algorithm. In our next result, we show that the cumulative failures until a certain number of iterations T of the algorithm grows sublinearly when executing Algorithm 1, provided the safety threshold χ is set in the right way. Theorem 3. Let χ in Algorithm 1 be time-dependent such that χt = O(1/ t). Then, the total number of cumulative safety violations until when T transition samples have been collected by Algorithm 1, Reg C(T), scales sub-linearly with T, i.e., Reg C(T) = O( p Proof. From Theorem 1, with probability 1 ω V πφnew C (µ) χ + ζ 1 γ + 2δγϵC (1 γ)2 where ζ C p Instead of using φnew, for ease of notation in this analysis, let us write this in terms of subscript t such that πt denotes the policy at the tth iteration. We can write the bound on ζ more explicitly as, ζt Es dπt,a πt Here, Nt(s, a) denotes the total number of times (s, a) is seen, and can be written as Nt(s, a) = Pt j=1 nj(s, a), where nj(s, a) denotes total number of times (s, a) is seen in episode j. T = P s,a Nk(s, a) denotes the total number of samples collected so far. Let k be the number of episodes elapsed when this happens. The safety threshold χ can be varied at every iteration, such that it decreases as χt 1 t . So, we have with probability 1 ω V πt C (µ) χt+ζt 1 γ + 2δγϵC (1 γ)2 where ζt Es dπt,a πt Now, we note that V πt C denotes the expected probability of episodic failures by executing policy πt (as described in section 2 of the main paper). Let us consider that a policy is rolled out for one episode after every training iteration to collect data (exploration). Published as a conference paper at ICLR 2021 Hence the total cumulative safety failures when T transition samples have been collected by Algorithm 1 is: t=1 V πt C (µ) 1 t=1 (χt + ζt) 2δγϵC (1 γ)2 s,a dπt(s)πt(a|s)C p s,a Nk(s, a) Here, we used the concavity of square root function to obtain P s,a Nk(s, a) and the definition T = P s,a Nk(s, a) A.2 DERIVATION OF THE POLICY UPDATE EQUATIONS Let a A denote an action, s S denote a state, πφ(a|s) denote a parameterized policy, r(s, a) denote a reward function for the task being solved, and τ denote a trajectory of actions by following policy πφ at each state. To solve the following constrained optimization problem: max πφ Eτ πφ[ X τ r( )] s.t. Eτ πφ[ X τ 1{failure}] = 0 (36) Here, τ is the trajectory corresponding to an episode. The objective is to maximize the cumulative returns while satisfying the constraint. The constraint says that the agent must never fail during every episode. 1{failure} = 1 if there is a failure and 1{failure} = 0 if the agent does not fail. The only way expectation can be 0 for this quantity is if every element is 0, so the constraint essentially is to never fail in any episode. Let s rewrite the objective, more generally as max πφ V πφ R (µ) s.t. V πφ C (µ) = 0 (37) We can relax the constraint slightly, by introducing a tolerance parameter χ 0. The objective below tolerates atmost χ failures in expectation. Since the agent can fail only once in an episode, V πφ C (µ) can also be interpreted as the probability of failure, and the constraint V πφ C (µ) χ says that the probability of failure in expectation must be bounded by χ. So, our objective has a very intuitive and practical interpretation. max πφ V πφ R (µ) s.t. V πφ C (µ) χ (38) We learn one state value function, VR (corresponding to the task reward), parameterized by θ and one state-action value function QC (corresponding to the sparse failure indicator), parameterized by ζ. We have a task reward function r(s, a) from the environment which is used to learn VR. For learning QC, we get a signal from the environment indicating whether the agent is dead (1) or alive (0) i.e. 1{failure}. Published as a conference paper at ICLR 2021 The safety critic QC is used to get an estimate of how safe a particular state is, by providing an estimate of probability of failure, that will be used to guide exploration. We desire the estimates to be conservative, in the sense that the probability of failure should be an over-estimate of the actual probability so that the agent can err in the side of caution while exploring. To train such a critic QC, we incorporate theoretical insights from CQL, and estimate QC through updates similar to those obtained by flipping the sign of α in equation 2 of the CQL paper (Kumar et al., 2020). The motivation for this is to get an upper bound on QC instead of a lower bound, as guaranteed by CQL. We also note that the CQL penalty term (the first two terms of equation 2 of the CQL paper) can be expressed as an estimate for the advantage function of the policy Es d πφold ,a πφ(a|s)[A(s, a)],where, A(s, a) is the advantage function. Es d πφold ,a πφ(a|s)[Q(s, a)] Es d πφold ,a πφold(a|s)[Q(s, a)] = Es d πφold ,a πφ(a|s)[Q(s, a) Ea πφold(a|s)Q(s, a)] = Es d πφold ,a πφ(a|s)[Q(s, a) V (s)] = Es d πφold ,a πφ(a|s)[A(s, a)] Hence, CQL can help provide an upper bound on the advantage function directly. Although the CQL class of algorithms have been proposed for batch RL, the basic bounds on the value function hold even for online training. We denote the objective inside arg min as CQL(ζ), where ζ parameterizes QC, and k denotes the kth update iteration. ˆQk+1 C arg min QC α Es Denv,a πφ(a|s)[QC(s, a)] + E(s,a) Denv[QC(s, a)] 2E(s,a,s ,c) Denv QC(s, a) ˆBπφ ˆQk C(s, a) 2 (40) For states sampled from the replay buffer Denv, the first term seeks to maximize the expectation of QC over actions sampled from the current policy, while the second term seeks to minimize the expectation of QC over actions sampled from the replay buffer. Denv can include off-policy data, and also offline-data (if available). Let the over-estimated advantage, corresponding to the overestimated critic QC, so obtained from CQL, be denoted as ˆAC(s, a), where the true advantage is AC(s, a). Now, let ρφ(s) denote the stationary distribution of states induced by policy πφ. For policy optimization, we have to solve a constrained optimization problem as described below: max πφ Es ρφold,a πφ h A πφold R (s, a) i s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ s.t. V πφ C (µ) χ This, as per equation 22 can be rewritten as πφnew = max πφ Es ρφold,a πφ h A πφold R (s, a) i s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ s.t. V πφold C (µ) + 1 1 γ Es ρφold,a πφ h A πφold C (s, a) i χ Since we are learning an over-estimate of AC through the updates in equation 2, we replace AC by the learned ˆAC in the constraint above. There are multiple ways to solve this constrained optimization problem, through duality. If we consider the Lagrangian dual of this, then we have the following optimization problem, which we can solve approximately by alternating gradient descent. For now, we keep the KL constraint as is, and later use its second order Taylor expansion in terms of the Fisher Information Matrix. Published as a conference paper at ICLR 2021 max πφ min λ 0 Es ρφold,a πφ h A πφold R (s, a) i λ V πφold C (µ) + 1 1 γ Es ρφold,a πφ h ˆAC(s, a) i χ s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ (43) We replace V πφold C (µ) by its sample estimate ˆV πφold C (µ) and denote χ ˆV πφold C (µ) as χ . Note that χ is independent of parameter φ that is being optimized over. So, the objective becomes max πφ min λ 0 Es ρφold,a πφ ˆAπφold(s, a) λ 1 γ ˆAC(s, a) + λχ s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ (44) For notational convenience let λ denote the fraction λ 1 γ . Also, in the expectation, we replace a πφ by a πφold and account for it by importance weighting of the objective. Let us consider maxπφ operation and the following gradient necessary for gradient ascent of φ φ arg max φ Es ρφold πφold(a|s)(A πφold R (s, a) λ ˆAC(s, a)) s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ (45) φ arg max φ φold A(φold)T (φ φold) s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ (46) Here, using slide 20 of Lecture 9 in (Levine, 2018), and the identity φπφ = πφ φ log πφ we have φ A(φ) = Es ρφold πφold(a|s) φ log πφ(a|s)(A πφold R (s, a) λ ˆAC(s, a)) (47) Using slide 24 of Lecture 5 in (Levine, 2018) and estimating locally at φ = φold, φold A(φold) = Es ρφold h φold log πφold(a|s)(A πφold R (s, a) λ ˆAC(s, a)) ii (48) We note that, Es ρφold h φold log πφold(a|s) ˆAπφold(s, a) ii = φold J(φold), the original policy gradient corresponding to task rewards. So, we can write equation 48 as φoldar A(φold) = φold J(φold) + Es ρφold h λ ˆAC(s, a) ii (49) In practice, we estimate A πφold R through GAE (Schulman et al., 2015b;a; Levine, 2018) t =t (γ)t t t t = r(st , at ) + γVR(st +1) VR(st ) (50) Let ˆAπφold(s, a) = A πφold R (s, a) λ AC(s, a) denote the modified advantage function corresponding to equation 48 t =t (γ)t t t t = r(st , at ) + γVR(st +1) VR(st ) λ ˆAC(st , at ) (51) So, rewriting equations 48 and 53 in terms of Aπφold, we have φold A(φold) = Es ρφold h φold log πφold(a|s) ˆAπφold ii (52) φold A(φold) = φold J(φold) (53) Substituting in equation 46, we have φ arg max φ φold J(φold)T (φ φold) s.t. Es ρφold[DKL(πφold( |s)||πφ( |s))] δ (54) As shown in slide 20 of Lecture 9 (Levine, 2018) and (Schulman et al., 2015a), we can approximate DKL in terms of the Fisher Information Matrix F (this is the second order term in the Taylor Published as a conference paper at ICLR 2021 expansion of KL; note that around φ = φold, both the KL term and its gradient are 0), DKL(πφold( |s)||πφ( |s)) = 1 2(φ φold)T F(φ φold) (55) Where, F can be estimated with samples as F = Es ρφold h Ea πφold φold log πφold( φold log πφold)T i (56) So, finally, we can write the gradient ascent step for φ as (natural gradient conversion) φ φold + βF 1 φold J(φold) β = 2δ φold J(φold)T F φold J(φold) (57) In practice, we perform backtracking line search to ensure the DKL constraint satisfaction. So, we have the following update rule φ φold + βF 1 φold J(φold) β = βj s 2δ φold J(φold)T F φold J(φold) (58) After every update, we check if DKL(φ||φold) δ, and if not we decay βj = βj(1 βj)j, set j j + 1 and repeat for L steps until DKL δ is satisfied. If this is not satisfied after L steps, we backtrack, and do not update φ i.e. set φ φold. For gradient descent with respect to the Lagrange multiplier λ we have (from equation 5), λ λ 1 1 γ Es ρφold,a πφold[ ˆAC(s, a)] χ (59) Note that in the derivations we have ommitted P t in the outermost loop of all expectations, and subscripts (e.g. at, st) in order to avoid clutter in notations. A.3 RELATION TO CPO The CPO paper (Achiam et al., 2017) considers a very similar overall objective for policy gradient updates, with one major difference. CPO approximates the V πφ C (µ) χ constraint by replacing V πφ C (µ) with its first order Taylor expansion and enforces the resulting simplified constraint exactly in the dual space. On the other hand, we do not make this simplification, and use primal-dual optimization to optimize an upper bound on VC through the CQL inspired objective in equation 2. Doing this and not not making the linearity modification allows us to handle sparse (binary) failure indicators from the environment without assuming a continuous safety cost function as done in CPO (Achiam et al., 2017). A.4 PRACTICAL CONSIDERATIONS Depending on the value of KL-constraint on successive policies δ, the RHS in Theorem 2 can either be a lower or higher rate than the corresponding problem without safety constraint. In particular, let the sampling error ζ = 0, then if δ (1 γ)4(2 χ)2 8γ2 , the third term is negative. If we set γ = 0.99 and χ = 0.05, then for any δ > 1e-8, the third term in Theorem 3 will be negative. Also, if α is chosen to be much greater than that in equation 26, the value of can be arbitrarily increased in principle, and we would be overestimating the value of QC significantly. While increasing significantly will lead to a decrease in the upper bound of V R(µ) V (T ) R (µ), but in practice, we would no longer have a practical algorithm. This is because, when QC is overestimated significantly, it would be difficult to guarantee that line 9 of Algorithm 1 is satisfied, and policy execution will stop, resulting in infinite wall clock time for the algorithm. In order to ensure that the above does not happen, in practice we loop over line 6 of Algorithm 1 for a maximum of 100 iterations. So, in practice the anytime safety constraint satisfaction of Theorem 2 is violated during the early stages of training when the function approximation of QC is incorrect. However, as we demonstrate empirically, we are able to ensure the guarantee holds during the majority of the training process. Published as a conference paper at ICLR 2021 A.5 DETAILS ABOUT THE ENVIRONMENTS In each environment, shown in Figure 2, we define a task objective that the agent must achieve and a criteria for catastrophic failure. The goal is to solve the task without dying. In all the environments, in addition to the task reward, the agent only receives a binary signal indicatin whether it is dead i.e. a catastrophic failure has occurred (1) or alive (0). Point agent navigation avoiding traps. Here, a point agent with two independent actuators for turning and moving forward/backward must be controlled in a 2D plane to reach a goal (shown in green in Figure 2) while avoiding traps shown in violet circular regions. The agent has a health counter set to 25 for the episode and it decreases by 1 for every timestep that it resides in a trap. The agent is alive when the health counter is positive, and a catastrophic failure occurs when the counter strikes 0 and the agent dies. Car agent navigation avoiding traps. Similar environment as the above but the agent is a Car with more complex dynamics. It has two independently controllable front wheels and free-rolling rear wheel. We adapt this environment from (Ray et al., 2019). Panda push without toppling. A Franka Emika Panda arm must push a vertically placed block across the table to a goal location without the block toppling over. The workspace dimensions of the table are 20cmx40cm and the dimensions of the block are 5cmx5cmx10cm. The environment is based on Robosuite Zhu et al. (2020) and we use Operational Space Control (OSC) to control the end-effevctor velocities of the robot arm. A catastrophic failure is said to occur is the block topples. Panda push within boundary. A Franka Emika Panda arm must be controlled to push a block across the table to a goal location without the block going outside a rectangular constraint region. Catastrophic failure occurs when the block center of mass ((x, y) position) move outside the constraint region on the table with dimensions 15cmx35cm. The dimensions of the block are 5cmx5cmx10cm. The environment is based on Robosuite Zhu et al. (2020) and we use Operational Space Control (OSC) to control the end-effector velocities of the robot arm. Laikago walk without falling, a Laikago quadruped robot must walk without falling. The agent is rewarded for walking as fast as possible (or trotting) and failure occurs when the robot falls. Since this is an extremely challenging task, for all the baselines, we initialize the agent s policy with a controller that has been trained to keep the agent standing, while not in motion. The environment is implemented in Py Bullet and is based on (Peng et al., 2020). A.6 HYPER-PARAMETER DETAILS We chose the learning rate ηQ for the safety-critic QC to be 2e 4 after experimenting with 1e 4 and 2e 4 and observing slightly better results with the latter. The value of discount factor γ is set to the usual default value 0.99, the learning rate ηλ of the dual variable λ is set to 4e 2, the value of δ for the DKL constraint on policy updates is set to 0.01, and the value of α to be 0.5. We experimented with three different α values 0.05, 0.5, 5 and found nearly same performance across these three values. For policy updates, the backtracking co-efficient β(0) is set to 0.7 and the max. number of line search iterations L = 20. For the Q-ensembles baseline, the ensemble size is chosen to be 20 (as mentioned in the LNT paper), with the rest of the common hyper-parameter values consistent with CSC, for a fair comparison.All results are over four random seeds. Published as a conference paper at ICLR 2021 A.7 COMPLETE RESULTS FOR TRADEOFF BETWEEN SAFETY AND TASK PERFORMANCE (a) Point 2D Nav. Rewards (b) Point 2D Nav. Avg. failures (c) Point 2D Nav. Cum. failures (d) Panda Topple Rewards (e) Panda Topple Avg. failures (f) Panda Topple Cum. failures (g) Car Rewards (h) Car Avg. failures (i) Car Cum. failures (j) Panda Boundary Rewards (k) Panda Boundary Avg. failures (l) Panda Boundary Cum. failures (m) Laikago Rewards (n) Laikago Avg. failures (o) Laikago Cum. failures Figure 5: Results on the five environments we consider for our experiments. For each environment we plot the average task reward, the average episodic failures, and the cumulative episodic failures. All the plots are for our method with different safety thresholds χ. From the plots it is evident that our method can naturally trade-off safety for task performance depending on how strict the safety threshold χ is set to. In particular, for a stricter χ (i.e. lesser value), the avg. failures decreases, and the task reward plot also has a slower convergence compared to a less strict threshold. Published as a conference paper at ICLR 2021 A.8 COMPLETE RESULTS FOR COMPARISON WITH BASELINES (a) Point 2D Nav. Rewards (b) Point 2D Nav. Avg. failures (c) Point 2D Nav. Cum. failures (d) Panda Topple Rewards (e) Panda Topple Avg. failures (f) Panda Topple Cum. failures (g) Car Rewards (h) Car Avg. failures (i) Car Cum. failures (j) Panda Boundary Rewards (k) Panda Boundary Avg. failures (l) Panda Boundary Cum. failures (m) Laikago Rewards (n) Laikago Avg. failures (o) Laikago Cum. failures Figure 6: Results on the five environments we consider for our experiments. For each environment we plot the average task reward, the average episodic failures, and the cumulative episodic failures. Since Laikago is an extremely challenging task, for all the baselines, we initialize the agent s policy with a controller that has been trained to keep the agent standing, while not in motion. The task then is to bootstrap learning so that the agent is able to remain standing while walking as well. The safety threshold χ = 0.05 for all the baselines in all the environments. A.9 COMPARISON BETWEEN TWO UNCONSTRAINED RL ALGORITHMS (a) Point 2D Nav. Rewards (b) Point 2D Nav. Avg. failures (c) Point 2D Nav. Cum. failures Figure 7: Comparison between two RL algorithms TRPO (Schulman et al., 2015a), and SAC (Haarnoja et al., 2018) in the Point agent 2D Navigation environment. We see that TRPO has slightly faster convergence in terms of task rewards and also slightly lower average and cumulative failures, and so consider TRPO as the Base RL baseline in Figures 3 and 4. Published as a conference paper at ICLR 2021 A.10 SEEDING THE REPLAY BUFFER WITH VERY FEW SAMPLES In order to investigate if we can leverage some offline user-specified data to lower the number of failures during training even further, we seed the replay buffer of CSC and the baselines with 1000 tuples in the Car navigation environment. The 1000 tuples are marked as safe or unsafe depending on whether the car is inside a trap location or not in those states. If our method can leverage such manually marked offline data (in small quantity as this marking procedure is not cheap), then we have a more practical method that can be deployed in situations where the cost of visiting an unsafe state is significantly prohibitive. Note that this is different from the setting of offline/batch RL, where the entire training data is assumed to be available offline - in this experimental setting we consider very few tuples (only 1000). Figure 9 shows that our method can successfully leverage this small offline data to bootstrap the learning of the safety critic and significantly lower the average failures. We attribute this to training the safety critic conservatively through CQL, which is an effective method for handling offline data. (a) Car Nav. Rewards (b) Car Nav. Avg. failures (c) Car Nav. Cum. failures Figure 8: Results on the Car navigation environment after seeding the replay buffer with 1000 tuples. Although all the baselines improve by seeding, in terms of lower failure rates compared to Figure 3, we observe that CSC is able to particularly leverage the offline seeding data and significantly lower the average and cumulative failures during training. A.11 CAR NAVIGATION WITH TRAPS USING CONTINUOUS SAFETY SIGNAL In this section we consider the case of a continuous safety signal to show that CSC can learn constraints in this setting as well, and minimize failures significantly more compared to the baselines. The car navigation with traps provides a natural setting for this, because every time the agent enters a trap region, it can receive a penalty (and its health counter decreases by 1), and a catastrophic failure occurs when the health counter drops to 0. By training the safety critic with this continuous failure signal instead of on binary failure signals, we can capture a notion of impending failure and hence aim to be more safe. Note that this setting is a strictly easier evaluation setting than what we previously considered with the binary safety signal. (a) Car Nav. Rewards (b) Car Nav. Avg. failures (c) Car Nav. Cum. failures Figure 9: Results on the Car navigation environment using continuous safety signal.