# density_constrained_reinforcement_learning__733f5f63.pdf Density Constrained Reinforcement Learning Zengyi Qin 1 Yuxiao Chen 2 Chuchu Fan 1 We study constrained reinforcement learning (CRL) from a novel perspective by setting constraints directly on state density functions, rather than the value functions considered by previous works. State density has a clear physical and mathematical interpretation, and is able to express a wide variety of constraints such as resource limits and safety requirements. Density constraints can also avoid the time-consuming process of designing and tuning cost functions required by value function-based constraints to encode system specifications. We leverage the duality between density functions and Q functions to develop an effective algorithm to solve the density constrained RL problem optimally and the constrains are guaranteed to be satisfied. We prove that the proposed algorithm converges to a near-optimal solution with a bounded error even when the policy update is imperfect. We use a set of comprehensive experiments to demonstrate the advantages of our approach over state-of-the-art CRL methods, with a wide range of density constrained tasks as well as standard CRL benchmarks such as Safety-Gym. 1. Introduction Constrained reinforcement learning (CRL) (Altman, 1999; Achiam et al., 2017; Dalal et al., 2018; Paternain et al., 2019; Tessler et al., 2019; Yang et al., 2020; Zhang et al., 2020; Stooke et al., 2020) aims to find the optimal policy that maximizes the cumulative reward signal while respecting certain constraints such as safety requirements and resource limits. Existing CRL approaches typically involve constructing suitable cost functions and value functions to take into account the constraints. Then a crucial step is to choose appropriate parameters such as thresholds for the cost and value functions to encode the constraints. However, one 1Massachusetts Institute of Technology 2California Institute of Technology. Correspondence to: Chuchu Fan . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Charging station Road network Figure 1. Autonomous electric vehicle routing in Manhattan: Control the electric vehicles to reach the goals and always keep the remaining energy above a threshold. Vehicles can be charged at the charging stations. The vehicle density at each charging station is constrained by an upper bound due to resource limits. The roads and charging stations are from the real-world data (Blahoudek et al., 2020). significant gap between the use of such methods and solving CRL problems is the correct construction of the cost and value functions, which is typically not solved systematically but relies on engineering intuitions (Paternain et al., 2019). Simple cost functions may not exhibit satisfactory performance, while sophisticated cost functions may not have clear physical meanings. When cost functions lack clear physical interpretations, it is difficult to formally guarantee the satisfaction of the performance specifications, even if the constraints on the cost functions are fulfilled. Moreover, different environments generally need different cost functions, which makes the tuning process time-consuming. In this work, we study CRL from a novel perspective by imposing constraints on state density functions, which avoids the use of cost or value function-based constraints in previous CRL literature. Density is a measurement of state concentration in the state space, and is directly related to the state distribution. A variety of real-world constraints are naturally expressed as density constraints in the state space. For example, safety constraints is a special case of density Density Constrained Reinforcement Learning constraints where we require the entire density distribution of states being contained in safe regions. Resource limits can also be encoded by imposing maximum allowed density on certain states. For instance, consider the electric vehicle routing problem (Blahoudek et al., 2020) in Figure 1. In order to keep the remaining energy above a predefined threshold, vehicles can be charged at the charging stations. Due to resource limits, the vehicle density at each charging station is constrained by an upper bound. In our experiments, we will demonstrate that density constraints can be used in more general examples, which also include the standard value function-based constraints as special cases. We cannot directly apply existing CRL optimization methods based on constraining cost or value functions to solve density constrained reinforcement learning (DCRL) problems, because the resulting solution cannot guarantee the satisfaction of required density constraints. Instead, we propose a novel general model-free algorithm (Algorithm 1) that exploits the duality between density functions and Q functions. Such algorithm enables us to incorporate state density constraints straightforwardly in CRL. Our proposed algorithm is applicable of handling both discrete and continuous state and action spaces, and can be flexibly combined with off-the-shelf RL methods to update the policy. We also prove that our algorithm can converge to a near-optimal solution with a bounded error even when the RL-based policy update is imperfect (Theorem 2). Comprehensive experiments are conducted on various density constrained problems, which demonstrate the advantages of our approach over state-of-the-art CRL methods in terms of satisfying system specifications and improving the cumulative reward. A notable achievement is that although our method can handle much more general constraints than safety constraints, our DCRL algorithm still shows consistent improvement over existing CRL approaches on standard CRL benchmarks from Mu Jo Co and Safety-Gym (Ray et al., 2019) where constraints are primarily about safety. Our main contributions are: 1) We are the first to introduce the DCRL problem with density constraints as a promising way to encode system specifications, which differs from the value function-based constraints in existing CRL literature. 2) We propose a general algorithm for solving DCRL problems by leveraging the duality between density functions and Q functions, and prove that our algorithm is guaranteed to converge to a near-optimal solution with a bounded error even when the policy update is imperfect. 3) We use an extensive set of experiments to examine the advantages of our method over leading CRL approaches, in a wide variety of density constrained tasks as well as standard CRL benchmarks. 2. Related Work Constrained reinforcement learning (Garcıa & Fern andez, 2015) primarily focuses on two approaches: modifying the optimality criteria by combining a risk factor (Heger, 1994; Nilim & El Ghaoui, 2005; Howard & Matheson, 1972; Borkar, 2002; Basu et al., 2008; Sato et al., 2001; Dotan Di Castro & Mannor, 2012; Kadota et al., 2006; L otjens et al., 2019) and incorporating extra knowledge to the exploration process (Moldovan & Abbeel, 2012; Abbeel et al., 2010; Tang et al., 2010; Geramifard et al., 2013; Clouse & Utgoff, 1992; Thomaz et al., 2006; Chow et al., 2018). Our method falls into the first category by imposing constraints and is closely related to constrained Markov decision processes (Altman, 1999) (CMDPs). CMDPs has been extensively studied in robotics (Gu et al., 2017; Pham et al., 2018), game theory (Altman & Shwartz, 2000), and communication and networks (Hou & Zhao, 2017; Bovopoulos & Lazar, 1992). Most previous works consider the constraints on value functions, cost functions and reward functions (Altman, 1999; Paternain et al., 2019; Altman & Shwartz, 2000; Dalal et al., 2018; Achiam et al., 2017; Ding et al., 2020). Instead, we directly impose constraints on the state density functions. Chen et al. (2019) and Chen et al. (2019) study the duality between density functions and value functions in CMDPs when the full model dynamics are known, while we consider the model-free reinforcement learning and take a further step to prove the duality of density functions to Q functions. In Geibel and Wysotzki (2005) density was studied as the probability of entering error states and thus has fundamentally different physical interpretations from us. In Dai et al. (2017) the duality was used to boost the actor-critic algorithm. The duality is also used in the policy evaluation community (Nachum et al., 2019; Nachum & Dai, 2020; Tang et al., 2019). The offline policy evaluation method proposed by Nachum et al. (2019) can also be used to estimate the state density in our paper, but their focus is policy evaluation rather than constrained RL. Therefore, we claim that this paper is the first work to consider density constraints and use the duality property to solve CRL. 3. Preliminaries Markov Decision Processes (MDP). An MDP M is a tuple S, A, P, R, γ , where (1) S is the (possibly infinite) set of states; (2) A is the (possibly infinite) set of actions; (3) P : S A S 7 [0, 1] is the transition probability with P(s, a, s ) the probability of transitioning from state s to s when action a A is taken; (4) R : S A S 7 R is the reward associated with the transition P under the action a A; (5) γ [0, 1] is a discount factor. A policy π maps states to a probability distribution over actions where π(a|s) denotes the probability of choosing action a at state s. Let a function φ : S 7 R Density Constrained Reinforcement Learning specifie the initial state distribution. The objective of an MDP optimization is to find the optimal policy that maximizes the overall discounted reward Jp = R S φ(s)V π(s)ds, where V π(s) is called the value function and satisfies V π(s) = rπ(s) + γ R S P(s, a, s )V π(s )ds da, and rπ(s) = R S P(s, a, s )R(s, a, s )ds da is the one-step reward from state s following policy π. For every state s with occurring as an initial state with probability φ(s), it incurs a expected cumulative discounted reward of V π(s). Therefore the overall reward is R S φ(s)V π(s)ds. The formulation of value functions in MDP typically cannot handle constraints on state distribution, which motivates the density functions. Density Functions. State density functions ρ : S 7 R 0 measure the state concentration in the state space (Chen & Ames, 2019; Rantzer, 2001).1 We will show later that generic duality relationship exists between density functions and Q functions, which allows us to directly impose density constraints in RL problems. For infinite horizon MDPs, given a policy π and an initial state distribution φ, the stationary density of state s is expressed as: t=0 γt P(st = s|π, s0 φ), which is the discounted sum of the probability of visiting s. We prove in the supplementary material that the density has an equivalent expression: ρπ(s) = φ(s) + γ Z A π(a|s )P(s , a, s)ρπ(s )dads , 4. Density Constrained Reinforcement Learning In this section, we first formally define the novel DCRL problem and clarify its benefits compared to value functionbased constraints. To solve DCRL, we will establish the duality between Q functions and density functions, then propose a general algorithm for DCRL with convergence guarantees. 4.1. Problem Statement Given an MDP M = S, A, P, R, γ and an initial state distribution φ, DCRL finds the optimal policy π to the following optimization problem: S φ(s)V π(s)ds s.t. ρmin(s) ρπ(s) ρmax(s), s S, (1) 1ρ is not necessarily a probability density function, which means R S ρ(s) = 1 is not enforced. S φ(s)V π(s)ds is the expected cumulative reward. The density constraints ρmin(s) and ρmax(s) are functions of states, and different states can have different lower and upper bound constraints on their densities. The formulation in (1) is different from most of the previous work (Achiam et al., 2017; Tessler et al., 2019; Yang et al., 2020), which did not consider the density constraint but instead used the cost (or reward) value as constraint: S φ(s)V π(s)ds s.t. R S φ(s)V π C (s)ds η, (2) where V π C is the cost value function and η is the threshold for the expected cumulative cost. Benefits of the Density Constraint. The density constraint in (1) can bring at least two benefits. First, density has a clear physical and mathematical interpretation as a measurement of state concentration in the state space. A wide range of real-world constraints can be conveniently expressed as density constraints (e.g., the vehicle density in certain areas, and the frequency of agents entering undesirable states). Second, the value function constraint in (2) requires the time-consuming process of designing and tuning cost functions, which are completely avoided by the density constraint since no cost function tuning is needed. 4.2. Duality of Density Functions and Q Functions To solve the DCRL problem in (1), we cannot directly apply existing RL algorithms (Lillicrap et al., 2016; Schulman et al., 2015; 2017), which are designed to optimize the expected cumulative reward or cost, but not the state density function. However, we will show that density functions are actually dual to Q functions, and the density constraints can be realized by modifying the formulation of Q functions. Then the off-the-shelf RL algorithms can be used to optimize the modified Q functions to enforce the density constraints. We extend the stationary density ρπ to consider the action taken at each state. Let ρ : S A R 0 be a stationary state-action density function, which represents the amount of state instances taking action a at state s. ρ is related to ρ via marginalization: ρ(s) = R A ρ(s, a)da. Under a policy π, we also have ρπ(s, a) = ρπ(s)π(a|s). Let r(s, a) = R S P(s, a, s )R(s, a, s )ds . Consider the density function optimization: Jd = max ρ,π A ρπ(s, a)r(s, a)dads s.t. ρπ(s, a) = π(a|s) φ(s)+ A P(s , a , s) ρπ(s , a )da ds ρmin(s) ρπ(s) ρmax(s) Density Constrained Reinforcement Learning Lemma 1. The optimization problems in (1) and (3) share the same optimal policy π . The proof of Lemma 1 is provided in the supplementary material. It shows that solving (1) is equivalent to solving (3). Now we are ready to present the duality between density functions and Q functions, which reveals that (3) can be solved via standard RL algorithms by modifying the Q function. We use the Lagrangian method for (3) and denote the Lagrange multipliers for ρπ ρmin and ρπ ρmax as σ : S 7 R 0 and σ+ : S 7 R 0. Then we consider the following optimization problem: Jp = max Q,π A Qπ(s, a)π(a|s)dads s.t. Qπ(s, a) = r(s, a) + σ (s) σ+(s)+ S P(s, a, s ) Z A π(a |s )Qπ(s , a )da ds The difference between the Q function in (4) and the standard Q function is that the reward r(s, a) is modified to r(s, a) + σ (s) σ+(s). Theorem 1. The density constrained optimization objectives Jd in (3) and Jp in (4) are dual to each other. If both are feasible and the KKT conditions are satisfied, then they share the same optimal policy π . The proof of Theorem 1 is provided in the appendix. Theorem 1 reveals that when the KKT conditions are satisfied, the optimal solution to the adjusted unconstrained primal problem (4) is exactly the optimal solution to the dual problem (3) with density constraints. Such an optimal solution not only satisfies the state density constraints, but also maximizes the the total reward Jd. Thus it is the optimal solution to the DCRL problem. Note that (4) is solvable using standard RL algorithms (Lillicrap et al., 2016; Schulman et al., 2015; 2017). However, the update of policy π will also result in an update of state density function ρπ, which will change the Lagrangian multipliers σ+ and σ to in order to satisfy the KKT conditions of (3). This motivates us to develop an algorithm that iteratively update π, ρπ, σ+ and σ , which will be detailed in Section 4.3. Remark 1. While the duality between density functions and value functions has already been studied (Chen & Ames, 2019), we take a step further to show the duality property between the density functions and Q functions. We prove and use the duality between density functions and Q functions over continuous state space to solve density constrained RL from a novel perspective, which has not been explored and utilized by published RL literature. Remark 2. The dual variables in Linear Programming are different from ours and do not have a clear physical interpretation. Technically, any non-negative dual variable satisfying the conservation law (Liouville PDE in the continuous case, see Chen et al. (2019)) is a valid dual. However, among all valid dual variables, the state density is associated with a clear physical interpretation as the concentration of states, and we are able to directly apply constraints on the density in RL. Remark 3. The value function-based constraint in CMDPs can be viewed as a special case of density constraint. R S φ(s)V π C (s)ds η is equivalent to R S ρπ(s)rc(s)ds η. Thus the value function-based CRL problems can also be solved by the DCRL framework. Remark 4. The general DCRL problem cannot be solved by value function-based RL methods. One may argue that for a single state sk, its visiting frequency can be constrained by defining a cost function ck that returns 1 when sk is visited and 0 otherwise. Then value function-based methods can be used to constrain the expected value VCk of the cost, as is in R S φ(s)VCk(s)ds ηk ( ). However, when every state sj has unique density constraints ρmin(sj) and ρmax(sj), every state sj needs a value function VCj (for the cost of visiting sj) and an inequality ( ). In order to cover all states, it would require an extremely large and even infinite number (when the state space is continuous) of value functions and inequalities, which can make the problem intractable for value function-based methods. Furthermore, we have explicitly mentioned in Remark 3 that the value function-based constraint in CMDP is equivalent to a special case of density constraint, and our general formulation of state-wise density constraint is more expressive and cannot be trivially converted into value function-based constraints. 4.3. The DCRL Algorithm The density function optimization in continuous state space is an infinite dimensional linear program and there is no convenient approximation method. We thus turn to the equivalent Q function optimization where standard RL tools are applicable. By utilizing the duality between density function and Q function (Theorem 1) in the density constrained optimization, the DCRL problem can be solved by alternating between the primal problem (4) and dual problem (3), as is illustrated in Figure 2. In the primal domain, we solve the adjusted primal problem (reward adjusted by Lagrange multipliers) in (4) using off-the-shelf unconstrained RL methods such as TRPO (Schulman et al., 2015) and DDPG (Lillicrap et al., 2016). Note that the density constraints are enforced in dual domain and the primal domain is still an unconstrained problem, which means we can make use of existing RL methods to solve the primal problem. In the dual domain, the policy is used to evaluate the state density function, which is described in details in Section 4.5. If the KKT conditions σ+ (ρπ ρmax) = 0, σ (ρmin ρπ) = 0 and ρmin ρπ ρmax are not satisfied, the Lagrange multipliers are updated and enter the next loop. The key insight Density Constrained Reinforcement Learning Primal domain Dual domain Update 𝜋based on 𝜎+ and 𝜎 Update 𝜎+ and 𝜎 based on 𝜋 𝜋 Figure 2. Illustration of the iterative optimization in DCRL. In the primal domain, we solve the adjusted primal problem in (4) to obtain the policy π. Then in the dual domain, the π is used to evaluate the state density. The Lagrange multipliers σ+ and σ are updated as σ+ max(0, σ++α(ρπ ρmax)) and σ max(0, σ +α(ρmin ρπ)). In the next loop, since the reward r(s, a)+σ (s) σ+(s) is updated, the primal optimization solves for the new π under the updated reward. The loop stops when the KKT conditions are satisfied. is that the density constraints can be enforced in the dual problem, and we can solve the dual problem by solving the equivalent primal problem using existing algorithms. Alternating between the primal and dual optimization can gradually adjust the Lagrange multipliers until the KKT conditions are satisfied. Algorithm 1 Template of the DCRL algorithm 1: Input MDP M, initial condition distribution φ, constraints on the state density ρmax and ρmin 2: Initialize π randomly, σ+ 0, σ 0 3: Generate experience Dπ {(s, a, r, s ) | s0 φ, a π(s) then r and s are observed} 4: repeat 5: (s, a, r, s ) (s, a, r + σ (s) σ+(s), s ) for each (s, a, r, s ) in Dπ 6: Solve for π of (4) using Dπ via standard RL 7: Generate experience Dπ {(s, a, r, s ) | s0 φ, a π(s) then r and s are observed} 8: Compute stationary density ρπ using Dπ 9: σ+ max(0, σ+ + α(ρπ ρmax)) 10: σ max(0, σ + α(ρmin ρπ)) 11: until σ+ (ρπ ρmax) = 0, σ (ρmin ρπ) = 0 and ρmin ρπ ρmax 12: Return π, ρπ A general template of the density constraint policy optimization is provided in Algorithm 1. In Algorithm 1, the Lagrange multipliers σ+ and σ are used to adjust rewards, which lead to an update of the policy π. Then the policy is used to evaluate the stationary density, and the Lagrange multipliers are updated following dual ascent. The iteration stops when all the KKT conditions are satisfied. 4.4. Convergence Guarantee under Imperfect Updates If every loop of Algorithm 1 gives the perfect (optimal) policy π of (4) under the current σ and σ+, then the convergence can be derived from the subgradient methods (Boyd et al., 2003) and will not be detailed here. But in reality, the policy update in each loop can be imperfect, which is unavoidable for standard RL algorithms. In light of this, we are interested in the case where the policy solved at each iteration is imperfect, and will provide conver- gence guarantees of Algorithm 1. We will also identify how the error in policy updates propagates to the solution found by Algorithm 1 and derive an upper bound of its distance to the optimal solution. Denote the negative objective function of (3) as f(ρπ) = R A ρπ(s)π(a|s)r(s, a)dads. We assume f is strongly convex modulus µ. If this is not the case, one can add a strongly convex regularization term to the (already convex) objective function. Then (3) minimizes f(ρπ) subject to the density constraints. For brevity, we only consider the density upper bound in this convergence analysis, and the same result exists for the density lower bound. The Lagrangian dual function is formulated as d(σ+) = min ρπ f(ρπ) + Z S σ+(s)(ρπ(s) ρmax(s))ds (5) Denote the feasible set of Lagrangian multipliers as P. The dual ascent is to find maxσ+ P d(σ+). We assume the policy update is imperfect and the suboptimality of the solution ˆρπ to (3) is upper bounded as: S σ+(s)(ˆρπ(s) ρmax(s))ds d(σ+) ϵ And the imperfect dual ascent takes the form σ+ max(0, σ+ + α ˆd(σ+)), where ˆd(σ+) = ˆρπ ρmax. Let ˆg(σ+, α) = 1 α(max(0, σ+ + α ˆd(σ+)) σ+), which is the residual of σ+ before and after the imperfect update in a single loop of Algorithm 1. Lemma 2. Following the imperfect dual ascent with step size α µ, we have d(σk+1 + ) d(σk +) + α 2 ||ˆg(σk +, α)||2 r2ϵ µ ||ˆg(σk +, α)|| The superscript k denotes the kth loop of Algorithm 1. The proof of Lemma 2 is provided in the supplementary material. Let P = {σ+|d(σ+) = maxσ+ P d(σ+)} be the set of optimal solutions to (5). Theorem 2. There exists constants λ > 0 and ξ > 0 such that Algorithm 1 with imperfect policy updates converges to a dual solution ˆσ+ that satisfies min σ + P ||ˆσ+(s) σ +|| λ r ϵ Density Constrained Reinforcement Learning The dual function also converges to a bounded neighbourhood of its optimal value: min σ + P ||d(ˆσ+) d(σ +)|| ξ ϵ The proof of Theorem 2 is provided in the supplementary material. Theorem 2 shows that Algorithm 1 is guaranteed to converge to a near-optimal solution even under imperfect policy updates. 4.5. Computational Approaches Algorithm 1 requires computing the policy π, density ρπ, Lagrange multipliers σ+ and σ . For π, there are welldeveloped representations such as neural networks and tabular methods. Solving π from experience Dπ is also straightforward via standard model-free RL. By contrast, the computation of ρπ, σ+ and σ need to be addressed. Density Functions. In the discrete state case, ρπ is represented by a vector where each element corresponds to a state. To compute ρπ from experience Dπ (line 1 of Algorithm 1), let Dπ contain N episodes where episode i ends at time Ti. Let sij represent the state reached at time j in the ith episode. Initialize ρπ 0. For all i {1, , N} and j {0, 1, , Ti}, do the update ρπ(sij) ρπ(sij) + 1 N γj. The resulting vector ρπ approximates the stationary state density. In the continuous state space, ρπ cannot be represented as a vector since there are infinitely many states. We utilize the kernel density estimation method (Chen, 2017; Chen & Ames, 2019) that computes ρπ(s) at state s using the samples in Dπ with ρπ(s) = 1 N PN i=1 PTi j=0 γj Kh(s sij) where Kh is the kernel function satisfying s S, Kh(s) 0 and R S Kh(s)ds = 1. There are multiple choices of the kernel Kh, e.g. Gaussian, Spheric, and Epanechnikov kernels (Chen, 2017), and probabilistic guarantee of accuracy can be derived (Wasserman, 2019). Lagrange Multipliers. If the state space is discrete, both σ+ and σ are vectors whose length equals to the number of states. In each loop of Algorithm 1, after the stationary density is computed, σ+ and σ are updated following Line 1 and Line 1 respectively in Algorithm 1. If the state space is continuous, we construct Lagrange multiplier functions σ+ and σ from samples in the state space leveraging linear interpolation. Let s = [s1, s2, ] represent the samples in the state space. In every loop of Algorithm 1, denote the Lagrange function computed by the previous loop as σo + and σo . We compute the updated Lagrange multipliers at states s as σ+ = [max(0, σo +(s1)+α(ρπ(s1) ρmax(s1)), max(0, σo +(s2)+ α(ρπ(s2) ρmax(s2)), ] and σ = [max(0, σo (s1) + α(ρmin(s1) ρπ(s1)), max(0, σo (s2) + α(ρmin(s2) ρπ(s2)), ]. Then the new σ+ and σ are obtained by linearly interpolating σ+ and σ respectively. 5. Experiment We consider a wide variety of CRL benchmarks and demonstrate how they can be effectively solve by the proposed DCRL approach. The density constrained benchmarks include autonomous electrical vehicle routing (Blahoudek et al., 2020), safe motor control (Traue et al., 2019) and agricultural drone control. The standard CRL benchmarks are from Mu Jo Co and Safety-Gym (Ray et al., 2019). The definition of reward and constraint vary from task to task and will be explained when each task is introduced. Baseline Approaches. Three CRL baselines are compared. PCPO (Yang et al., 2020) first performs an unconstrained policy update then project the action to the constrained set. CPO (Achiam et al., 2017) maximizes the reward in a small neighbourhood that enforces the constraints. RCPO (Tessler et al., 2019) incorporates the cost terms and Lagrange multipliers with the reward function to encourage the satisfaction of the constraints. We used the original implementation of CPO and PCPO with KL-projection that leads to the best performance. For RCPO, since the official implementation is not available, we re-implemented RCPO and made sure it matches the original performance. All the three baseline approaches and our DCRL have the same number of neural network parameters. Note that the baseline approaches enforce the constraints by restricting values functions while our DCRL restricts the state density functions. When we train the value function-based CRL methods on density constrained tasks, we convert the density threshold to the value function threshold by duality between density functions and value functions (Chen & Ames, 2019). 5.1. Autonomous Electrical Vehicle Routing Figure 3. Vehicle densities at the charging stations. All results are averaged over 10 independent trials. 16 charging stations with the highest vehicle densities are kept for visualization. Density Constrained Reinforcement Learning Figure 4. Average reward and remaining energy. All results are averaged over 10 independent trials. Our first case study is about controlling autonomous electric vehicles (EV) in the middle of Manhattan, New York. It is adopted from Blahoudek et al. (2020) and is shown in Figure 1. While EVs drive to their destinations, they can avoid running out of power by recharging at the fast charging stations along the roads. At the same time, the vehicles should not stay at the charging stations for too long in order to save resources and avoid congestion. An road intersection is called a node. In each episode, an autonomous EV starts from a random node and drives to the goals. At each node, the EV chooses a direction and reaches the next node along that direction at the next step. The consumed electric energy is assumed to be proportional to traveling distance. There are 1024 nodes and 137 charging stations in total. Denote the full energy capacity of the vehicle as cf and the remaining energy as cr. When arriving at a charging station, the EV chooses a charging time τ [0, 1], then its energy increases to min(cf, cr + τcf). The action space includes the EV direction and the charging time τ. The state space S R3 is consisted of the current 2D location and the remaining energy cr. The agent receives a negative reward proportional to its traveling distance to restrict energy consumption, and a +10 reward when it reaches the goal. In one episode, a single agent starts from a random initial location and reaches the goal. The density is accumulated for 2000 episodes, which is equivalent to having 2000 agents. Two types of constraints are considered: (1) the minimum remaining energy should keep close to a required threshold and (2) the vehicle density at charging stations should be less than a given threshold. Apparently, if the EV chooses a larger τ, then the constraint (1) is more likely to be satisfied, while (2) is more likely to be violated, since a larger τ will increase the vehicle density at the charging station. These contradictory constraints pose a greater challenge to the RL algorithms. Both constraints can be naturally expressed as density constraints. For constraint (1), we can limit the density of low-energy states. For constraint (2), it is straightforward to limit the EV density (a function of E[τ]) at charging stations. The threshold of density constraints are transformed to the threshold of value functions to be used by the baseline methods. The conversion is based on the duality of density functions and value functions (Chen & Ames, 2019). Figure 3 demonstrates that our DCRL can avoid the vehicle densities from exceeding the thresholds, while the baseline methods suffer from constraint violation. This is because DCRL allows us to explicitly set state density thresholds. Figure 4 shows that all the compared methods can satisfy the constraint on remaining energy, and DCRL can reach the highest reward. 5.2. Mu Jo Co Benchmark and Safety-Gym Experiments are conducted on three tasks adopted from CPO (Achiam et al., 2017) and two tasks from the Safety Gym (Ray et al., 2019), built on top of the Mu Jo Co simulator. The tasks include Point-Gather, Point-Circle, Ant Circle, Car-Goal and Car-Button. In the Point-Gather task, a point agent moves on a plane and tries to gather as many apples as possible while keeping the probability of gather bombs below the threshold. In the Point-Circle task, a point agent tries to follow a circular path that maximizes the reward, while constraining the probability of exiting an given area. The Ant-Circle task is similar to the Point-Circle task except that the agent is an ant instead of a point. Detailed configurations of the three benchmarks can be found in Achiam et al. (2017). In the Car-Goal task, a car agent tries to navigate to a goal while avoiding hazards. In the Car-Button task, a car agent tries to press the goal button while avoiding hazards, and not to press the wrong buttons. Detailed descriptions of the Safety-Gym benchmark can be found in Ray et al. (2019). The original value function constraints for these benchmarks are converted to state density constraints for DCRL by duality between density functions and value functions. Figure 5 demonstrates the performance of the four methods. In general, DCRL is able to achieve higher reward than other methods while satisfying the constraint thresholds. In the Point-Gather and Point-Circle environments, all the four approaches exhibit stable performance with relatively small variances. In the Ant-Circle environment, the variances of reward and constraint values are significantly greater than that in Point environments, which is mainly due to the complexity of ant dynamics. In Ant-Circle, after 600 iterations of policy updates, the constraint values of the four approaches converge to the neighbourhood of the threshold. The reward of DCRL falls behind PCPO in the first 400 iterations of updates but outperforms PCPO thereafter. 5.3. Safe Motor Control The safe motor control environment is adopted from Traue et al. (2019) and shown in Figure 6 (a). The objective is to control the direct current series motor and ensure its angular velocity follows a random reference trajectory and prevent the motor from overheating. The state space S R6 and consists of six variables: angular velocity, torque, current, voltage, temperature and the reference angular velocity. The Density Constrained Reinforcement Learning Figure 5. Performance on the constrained reinforcement learning tasks on the Mu Jo Co (Todorov et al., 2012) simulator. The first three tasks are from CPO (Achiam et al., 2017) and the last two tasks are from the Safety-Gym (Ray et al., 2019). All results are averaged over 10 independent trials. The constraint thresholds are all upper bounds. 𝑖𝑖𝑛 Sensors Agent Convertor (a) Direct current series motor (b) A farm divided into 5 regions Figure 6. Illustration of the safe motor control and drone application environments. (a) Control the rotor to follow a reference angular velocity trajectory while limiting the density of hightemperature states. (b) Control the drones to spray pesticide over a farmland which is divided into five parts and each requires different densities of pesticide. Figure 7. Reward and state density constraint in the safe motor control task. Results are averaged over 10 independent trials. action space A R is the electrical duty cycle that controls the motor power. The agent outputs a duty cycle at each time step to drive the angular velocity close to the reference. When the reference angular velocity increases, the required duty cycle will need to increase. As a result, the motor s power and angular velocity will increase and cause the motor temperature to grow. The reward is defined as the negative distance between the measured angular velocity and the reference angular velocity. We consider the state density w.r.t. the motor temperature as constraint. Hightemperature states should have a low density to protect the motor. When we train the baseline methods, the density con- Figure 8. Result of the agricultural spraying problem. (a) Percentage of the entire area that satisfies the pesticide density requirement. (b) Time consumption in steps to reach the destination. Whiskers in the plots denote confidence intervals. Results are calculated with 10 independent trials after each method converges. straints are converted to equivalent cost value constraints by duality between density functions and value functions. Figure 7 shows the reward and state density for each method. We find that DCRL is successful at reaching the highest reward among the compared methods while respecting the state density constraints. More experiments under different settings can be found in the supplementary material. 5.4. Agricultural Spraying Drone We consider the problem of using drones to spray pesticides over a farmland in simulation. The farmland is shown in Figure 6 (b), which is divided into 5 regions and different regions require different pesticide densities represented by a lower bound ρmin and an upper bound ρmax. The drone starts from the top-left corner, flies over the farmland spraying pesticides, and stops at the bottom-right corner. The state space constrains the position and velocity of the drone, as well as the required pesticide density of the current region. The action space contains the angular acceleration and the vertical thrust to control the drone. The drone sprays a constant volume of pesticide at each time step, and controls the pesticide density on the land by adjusting its position and velocity. A +1 reward will be given when the pesticide Density Constrained Reinforcement Learning density of the area below the drone enters the required range, and a -1 reward will be given when the density exists the range. A +10 reward will be given when the drone reaches the bottom-right corner that is the destination. Figure 8 shows the experimental results. With the DCRL method, we can obtain the highest percentage of area that satisfies the pesticide density constraint. Also, DCRL requires the least steps to reach the destination. Details of the experiment configurations and results under different settings can be found in the supplementary material. 6. Conclusion and future works We study a novel class of constrained reinforcement learning problems where the constraints are imposed on state density functions, rather than value functions considered by previous literature. State densities have clear physical meanings and can express a variety of constraints of the environment and the system. We prove the duality between density functions and Q functions, then leverage the duality to develop a general algorithm to solve the DCRL problems. We also provide the convergence guarantee of our algorithm even under imperfect policy updates. Note that our algorithm does not guarantee the satisfaction of density constraints during training, and we do see occasional violations at the beginning iterations. In the future, we aim to improve the algorithm to enforce the density constraints throughout the entire training process. 7. Acknowledgement The authors acknowledge support from the DARPA Assured Autonomy under contract FA8750-19-C-0089 and from the Defense Science and Technology Agency in Singapore. The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense, the U.S. Government, DSTA Singapore, or the Singapore Government. Abbeel, P., Coates, A., and Ng, A. Y. Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research, 29(13):1608 1639, 2010. Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International Conference on Machine Learning (ICML), pp. 22 31, 2017. Altman, E. Constrained Markov decision processes, volume 7. CRC Press, 1999. Altman, E. and Shwartz, A. Constrained markov games: Nash equilibria. In Advances in Dynamic Games and Applications, pp. 213 221. Springer, 2000. Basu, A., Bhattacharyya, T., and Borkar, V. S. A learning algorithm for risk-sensitive cost. Mathematics of operations research, 33(4):880 898, 2008. Blahoudek, F., Br azdil, T., Novotn y, P., Ornik, M., Thangeda, P., and Topcu, U. Qualitative controller synthesis for consumption markov decision processes. International Conference on Computer-aided Verification, 2020. Borkar, V. S. Q-learning for risk-sensitive control. Mathematics of operations research, 27(2):294 311, 2002. Bovopoulos, A. D. and Lazar, A. A. The effect of delayed feedback information on network performance. Annals of Operations Research, 36(1):101 124, 1992. Boyd, S., Xiao, L., and Mutapcic, A. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004 2005, 2003. Chen, Y. and Ames, A. D. Duality between density function and value function with applications in constrained optimal control and markov decision process. ar Xiv preprint ar Xiv:1902.09583, 2019. Chen, Y., Singletary, A., and Ames, A. Density functions for guaranteed safety on robotic systems. American Control Conference, 10 2019. Chen, Y.-C. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161 187, 2017. Chow, Y., Nachum, O., Duenez-Guzman, E., and Ghavamzadeh, M. A lyapunov-based approach to safe reinforcement learning. In Advances in neural information processing systems, pp. 8092 8101, 2018. Clouse, J. A. and Utgoff, P. E. A teaching method for reinforcement learning. In Machine Learning Proceedings 1992, pp. 92 101. Elsevier, 1992. Dai, B., Shaw, A., He, N., Li, L., and Song, L. Boosting the actor with dual critic. ar Xiv preprint ar Xiv:1712.10282, 2017. Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., and Tassa, Y. Safe exploration in continuous action spaces. ar Xiv preprint ar Xiv:1801.08757, 2018. Ding, D., Zhang, K., Basar, T., and Jovanovic, M. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33, 2020. Density Constrained Reinforcement Learning Dotan Di Castro, A. T. and Mannor, S. Policy gradients with variance related risk criteria. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. Garcıa, J. and Fern andez, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. Geibel, P. and Wysotzki, F. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24:81 108, 2005. Geramifard, A., Redding, J., and How, J. P. Intelligent cooperative control architecture: a framework for performance improvement using safe learning. Journal of Intelligent & Robotic Systems, 72(1):83 103, 2013. Gu, S., Holly, E., Lillicrap, T., and Levine, S. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3389 3396. IEEE, 2017. Heger, M. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994, pp. 105 111. Elsevier, 1994. Hou, C. and Zhao, Q. Optimization of web service-based control system for balance between network traffic and delay. IEEE Transactions on Automation Science and Engineering, 15(3):1152 1162, 2017. Howard, R. A. and Matheson, J. E. Risk-sensitive markov decision processes. Management science, 18(7):356 369, 1972. Kadota, Y., Kurano, M., and Yasuda, M. Discounted markov decision processes with utility constraints. Computers & Mathematics with Applications, 51(2):279 284, 2006. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning, 2016. L otjens, B., Everett, M., and How, J. P. Safe reinforcement learning with model uncertainty estimates. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8662 8668. IEEE, 2019. Luo, Z.-Q. and Tseng, P. On the convergence rate of dual ascent methods for linearly constrained convex minimization. Mathematics of Operations Research, 18(4):846 867, 1993. Moldovan, T. M. and Abbeel, P. Safe exploration in markov decision processes. ar Xiv preprint ar Xiv:1205.4810, 2012. Nachum, O. and Dai, B. Reinforcement learning via fenchelrockafellar duality. ar Xiv preprint ar Xiv:2001.01866, 2020. Nachum, O., Chow, Y., Dai, B., and Li, L. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, pp. 2318 2328, 2019. Nilim, A. and El Ghaoui, L. Robust control of markov decision processes with uncertain transition matrices. Operations Research, 53(5):780 798, 2005. Paternain, S., Chamon, L., Calvo-Fullana, M., and Ribeiro, A. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems (Neur IPS), pp. 7553 7563, 2019. Pham, T.-H., De Magistris, G., and Tachibana, R. Optlayerpractical constrained optimization for deep reinforcement learning in the real world. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6236 6243. IEEE, 2018. Rantzer, A. A dual to lyapunov s stability theorem. Systems and Control Letters, 42(3):161 168, 2001. Ray, A., Achiam, J., and Amodei, D. Benchmarking Safe Exploration in Deep Reinforcement Learning. 2019. Rockafellar, R. T. Convex analysis. Number 28. Princeton university press, 1970. Sato, M., Kimura, H., and Kobayashi, S. Td algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence, 16(3):353 362, 2001. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International conference on machine learning, pp. 1889 1897, 2015. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. Stooke, A., Achiam, J., and Abbeel, P. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pp. 9133 9143. PMLR, 2020. Tang, J., Singh, A., Goehausen, N., and Abbeel, P. Parameterized maneuver learning for autonomous helicopter flight. In 2010 IEEE International Conference on Robotics and Automation, pp. 1142 1148. IEEE, 2010. Tang, Z., Feng, Y., Li, L., Zhou, D., and Liu, Q. Doubly robust bias reduction in infinite horizon off-policy estimation. ar Xiv preprint ar Xiv:1910.07186, 2019. Density Constrained Reinforcement Learning Tessler, C., Mankowitz, D. J., and Mannor, S. Reward constrained policy optimization. In International Conference on Learning Representations (ICLR), 2019. Thomaz, A. L., Breazeal, C., et al. Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. In Aaai, volume 6, pp. 1000 1005. Boston, MA, 2006. Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026 5033. IEEE, 2012. Traue, A., Book, G., Kirchg assner, W., and Wallscheid, O. Towards a reinforcement learning environment toolbox for intelligent electric motor control, 2019. Wasserman, L. Lecture notes in statistical methods for machine learning, 2019. URL http://www.stat.cmu. edu/ larry/=sml/densityestimation.pdf. Online material, last visited on 2019/08/25. Yang, T.-Y., Rosca, J., Narasimhan, K., and Ramadge, P. J. Projection-based constrained policy optimization. In International Conference on Learning Representations, 2020. Zhang, Y., Vuong, Q., and Ross, K. First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 33, 2020.