# reward_constrained_policy_optimization__0ca3fa4c.pdf

Published as a conference paper at ICLR 2019

Reward Constrained Policy Optimization

Chen Tessler1, Daniel J. Mankowitz2, and Shie Mannor1

1Technion Israel Institute of Technology, Haifa, Israel 2Deep Mind, London, England chen.tessler@campus.technion.ac.il, dmankowitz@google.com, shie@ee.technion.ac.il

Solving tasks in Reinforcement Learning is no easy feat. As the goal of the agent is to maximize the accumulated reward, it often learns to exploit loopholes and misspeciﬁcations in the reward signal resulting in unwanted behavior. While constraints may solve this issue, there is no closed form solution for general constraints. In this work, we present a novel multi-timescale approach for constrained policy optimization, called Reward Constrained Policy Optimization (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one. We prove the convergence of our approach and provide empirical evidence of its ability to train constraint satisfying policies.

1 Introduction

Applying Reinforcement Learning (RL) is generally a hard problem. At each state, the agent performs an action which produces a reward. The goal is to maximize the accumulated reward, hence the reward signal implicitly deﬁnes the behavior of the agent. While in computer games (e.g. Bellemare et al. (2013)) there exists a pre-deﬁned reward signal, it is not such in many real applications.

An example is the Mujoco domain (Todorov et al., 2012), in which the goal is to learn to control robotic agents in tasks such as: standing up, walking, navigation and more. Considering the Humanoid domain, the agent is a 3 dimensional humanoid and the task is to walk forward as far as possible (without falling down) within a ﬁxed amount of time. Naturally, a reward is provided based on the forward velocity in order to encourage a larger distance; however, additional reward signals are provided in order to guide the agent, for instance a bonus for staying alive, a penalty for energy usage and a penalty based on the force of impact between the feet and the ﬂoor (which should encourage less erratic behavior). Each signal is multiplied by it s own coeﬃcient, which controls the emphasis placed on it.

This approach is a multi-objective problem (Mannor and Shimkin, 2004); in which for each set of penalty coeﬃcients, there exists a diﬀerent, optimal solution, also known as Pareto optimality (Van Moﬀaert and Now e, 2014). In practice, the exact coeﬃcient is selected through a time consuming and a computationally intensive process of hyper-parameter tuning. As our experiments show, the coeﬃcient is not shared across domains, a coeﬃcient which leads to a satisfying behavior on one domain may lead to catastrophic failure on the other (issues also seen in Leike et al. (2017) and Mania et al. (2018)). Constraints are a natural and consistent approach, an approach which ensures a satisfying behavior without the need for manually selecting the penalty coeﬃcients.

In constrained optimization, the task is to maximize a target function f(x) while satisfying an inequality constraint g(x) α. While constraints are a promising solution to ensuring a satisfying behavior, existing methods are limited in the type of constraints they are able to handle and the algorithms that they may support - they require a parametrization of the policy (policy gradient methods) and propagation of the constraint violation signal over the entire trajectory (e.g. Prashanth and Ghavamzadeh (2016)). This poses an issue, as Q-learning algorithms such as DQN (Mnih et al., 2015) do not learn a parametrization of the policy, and common Actor-Critic methods (e.g. (Schulman et al., 2015a; Mnih et al., 2016;

Published as a conference paper at ICLR 2019

Table 1: Comparison between various approaches.

Handles discounted sum constraints

Handles mean value1constraints

Requires no prior knowledge

Reward agnostic

RCPO (this paper) 2 Dalal et al. (2018) Achiam et al. (2017) Reward shaping Schulman et al. (2017)3

Schulman et al., 2017)) build the reward-to-go based on an N-step sample and a bootstrap update from the critic.

In this paper, we propose the Reward Constrained Policy Optimization (RCPO) algorithm. RCPO incorporates the constraint as a penalty signal into the reward function. This penalty signal guides the policy towards a constraint satisfying solution. We prove that RCPO converges almost surely, under mild assumptions, to a constraint satisfying solution (Theorem 2). In addition; we show, empirically on a toy domain and six robotics domains, that RCPO results in a constraint satisfying solution while demonstrating faster convergence and improved stability (compared to the standard constraint optimization methods).

Related work: Constrained Markov Decision Processes (Altman, 1999) are an active ﬁeld of research. CMDP applications cover a vast number of topics, such as: electric grids (Koutsopoulos and Tassiulas, 2011), networking (Hou and Zhao, 2017), robotics (Chow et al., 2015; Gu et al., 2017; Achiam et al., 2017; Dalal et al., 2018) and ﬁnance (Krokhmal et al., 2002; Tamar et al., 2012).

The main approaches to solving such problems are (i) Lagrange multipliers (Borkar, 2005; Bhatnagar and Lakshmanan, 2012), (ii) Trust Region (Achiam et al., 2017), (iii) integrating prior knowledge (Dalal et al., 2018) and (iv) manual selection of the penalty coeﬃcient (Tamar and Mannor, 2013; Levine and Koltun, 2013; Peng et al., 2018).

Novelty: The novelty of our work lies in the ability to tackle (1) general constraints (both discounted sum and mean value constraints), not only constraints which satisfy the recursive Bellman equation (i.e, discounted sum constraints) as in previous work. The algorithm is (2) reward agnostic. That is, invariant to scaling of the underlying reward signal, and (3) does not require the use of prior knowledge. A comparison with the diﬀerent approaches is provided in Table 1.

2 Preliminaries

2.1 Markov Decision Process (MDP)

A Markov Decision Processes M is deﬁned by the tuple (S, A, R, P, µ, γ) (Sutton and Barto, 1998). Where S is the set of states, A the available actions, R : S A S 7 R is the reward function, P : S A S 7 [0, 1] is the transition matrix, where P(s |s, a) is the probability of transitioning from state s to s assuming action a was taken, µ : S 7 [0, 1] is the initial state distribution and γ [0, 1) is the discount factor for future rewards. A policy π : S 7 A is a probability distribution over actions and π(a|s) denotes the probability of taking action a at state s. For each state s, the value of following policy π is denoted by:

V π R (s) = Eπ[ X

t γtr(st, at)|s0 = s] .

1A mean valued constraint takes the form of E[ 1

T PT 1 t=0 ct] α, as seen in Section 5.2. 2Under the appropriate assumptions. 3Algorithms such as PPO are not intended to consider or satisfy constraints.

Published as a conference paper at ICLR 2019

An important property of the value function is that it solves the recursive Bellman equation:

V π R (s) = Eπ[r(s, a) + γV π R (s )|s] .

The goal is then to maximize the expectation of the reward-to-go, given the initial state distribution µ:

max π Π Jπ R , where Jπ R = Eπ s µ[

t=0 γtrt] = X

s S µ(s)V π R (s) . (1)

2.2 Constrained MDPs

A Constrained Markov Decision Process (CMDP) extends the MDP framework by introducing a penalty c(s, a), a constraint C(st) = F(c(st, at), ..., c(s N, a N)) and a threshold α [0, 1]. A constraint may be a discounted sum (similar to the reward-to-go), the average sum and more (see Altman (1999) for additional examples). Throughout the paper we will refer to the collection of these constraints as general constraints.

We denote the expectation over the constraint by:

Jπ C = Eπ s µ[C(s)] . (2)

The problem thus becomes:

max π Π Jπ R , s.t. Jπ C α . (3)

2.3 Parametrized Policies

In this work we consider parametrized policies, such as neural networks. The parameters of the policy are denoted by θ and a parametrized policy as πθ. We make the following assumptions in order to ensure convergence to a constraint satisfying policy:

Assumption 1. The value V π R (s) is bounded for all policies π Π.

Assumption 2. Every local minima of Jπθ C is a feasible solution.

Assumption 2 is the minimal requirement in order to ensure convergence, given a general constraint, of a gradient algorithm to a feasible solution. Stricter assumptions, such as convexity, may ensure convergence to the optimal solution; however, in practice constraints are non-convex and such assumptions do not hold.

3 Constrained Policy Optimization

Constrained MDP s are often solved using the Lagrange relaxation technique (Bertesekas, 1999). In Lagrange relaxation, the CMDP is converted into an equivalent unconstrained problem. In addition to the objective, a penalty term is added for infeasibility, thus making infeasible solutions sub-optimal. Given a CMDP (3), the unconstrained problem is

min λ 0 max θ L(λ, θ) = min λ 0 max θ [Jπθ R λ (Jπθ C α)] , (4)

where L is the Lagrangian and λ 0 is the Lagrange multiplier (a penalty coeﬃcient). Notice, as λ increases, the solution to (4) converges to that of (3). This suggests a twotimescale approach: on the faster timescale, θ is found by solving (4), while on the slower timescale, λ is increased until the constraint is satisﬁed. The goal is to ﬁnd a saddle point (θ (λ ), λ ) of (4), which is a feasible solution.

Deﬁnition 1. A feasible solution of the CMDP is a solution which satisﬁes Jπ C α.

Published as a conference paper at ICLR 2019

3.1 Estimating the gradient

We assume there isn t access to the MDP itself, but rather samples are obtained via simulation. The simulation based algorithm for the constrained optimization problem (3) is:

λk+1 = Γλ[λk η1(k) λL(λk, θk)] , (5)

θk+1 = Γθ[θk + η2(k) θL(λk, θk)] , (6)

where Γθ is a projection operator, which keeps the iterate θk stable by projecting onto a compact and convex set. Γλ projects λ into the range [0, λmax4]. θL and λL are derived from (4), where the formulation for θL is derivied using the log-likelihood trick (Williams, 1992):

θL(λ, θ) = θEπθ s µ [log π(s, a; θ) [R(s) λ C(s)]] , (7)

λL(λ, θ) = (Eπθ s µ[C(s)] α) , (8)

η1(k), η2(k) are step-sizes which ensure that the policy update is performed on a faster timescale than that of the penalty coeﬃcient λ.

Assumption 3.

k=0 η1(k) =

k=0 η2(k) = ,

η1(k)2 + η2(k)2 < and η1(k) η2(k) 0 .

Theorem 1. Under Assumption 3, as well as the standard stability assumption for the iterates and bounded noise (Borkar et al., 2008), the iterates (θn, λn) converge to a ﬁxed point (a local minima) almost surely.

Lemma 1. Under assumptions 1 and 2, the ﬁxed point of Theorem 1 is a feasible solution.

The proof to Theorem 1 is provided in Appendix C and to Lemma 1 in Appendix D.

4 Reward Constrained Policy Optimization

4.1 Actor Critic Requirements

Recently there has been a rise in the use of Actor-Critic based approaches, for example: A3C (Mnih et al., 2016), TRPO (Schulman et al., 2015a) and PPO (Schulman et al., 2017). The actor learns a policy π, whereas the critic learns the value (using temporal-diﬀerence learning - the recursive Bellman equation). While the original use of the critic was for variance reduction, it also enables training using a ﬁnite number of samples (as opposed to Monte-Carlo sampling).

Our goal is to tackle general constraints (Section 2.2), as such, they are not ensured to satisfy the recursive property required to train a critic.

4.2 Penalized reward functions

We overcome this issue by training the actor (and critic) using an alternative, guiding, penalty - the discounted penalty. The appropriate assumptions under which the process converges to a feasible solution are provided in Theorem 2. It is important to note that; in order to ensure constraint satisfaction, λ is still optimized using Monte-Carlo sampling on the original constraint (8).

Deﬁnition 2. The value of the discounted (guiding) penalty is deﬁned as:

V π Cγ(s) Eπ " X

t=0 γtc(st, at)|s0 = s

4When Assumption 2 holds, λmax can be set to .

Published as a conference paper at ICLR 2019

Deﬁnition 3. The penalized reward functions are deﬁned as:

ˆr(λ, s, a) r(s, a) λc(s, a) , (10)

ˆV π(λ, s) Eπ " X

t=0 γtˆr(λ, st, at)|s0 = s

t=0 γt (r(st, at) λc(st, at)) |s0 = s

= V π R (s) λV π Cγ(s) . (11)

As opposed to (4), for a ﬁxed π and λ, the penalized value (11) can be estimated using TD-learning critic. We denote a three-timescale (Constrained Actor Critic) process, in which the actor and critic are updated following (11) and λ is updated following (5), as the Reward Constrained Policy Optimization (RCPO) algorithm. Algorithm 1 illustrates such a procedure and a full RCPO Advantage-Actor-Critic algorithm is provided in Appendix A.

Algorithm 1 Template for an RCPO implementation

1: Input: penalty c( ), constraint C( ), threshold α, learning rates η1(k) < η2(k) < η3(k) 2: Initialize actor parameters θ = θ0, critic parameters v = v0, Lagrange multipliers and λ = 0 3: for k = 0, 1, ... do 4: Initialize state s0 µ 5: for t = 0, 1, ..., T 1 do 6: Sample action at π, observe next state st+1, reward rt and penalties ct 7: ˆRt = rt λkct + γ ˆV (λ, st; vk) Equation 10

8: Critic update: vk+1 vk η3(k) h ( ˆRt ˆV (λ, st; vk))2/ vk i Equation 11

9: Actor update: θk+1 Γθ h θk + η2(k) θ ˆV (λ, s) i Equation 6

10: Lagrange multiplier update: λk+1 Γλ [λk + η1(k) (Jπθ C α)] Equation 8

11: return policy parameters θ

Theorem 2. Denote by Θ = {θ : Jπθ C α} the set of feasible solutions and the set of localminimas of Jπθ Cγ as Θγ. Assuming that Θγ Θ then the Reward Constrained Policy Optimization (RCPO) algorithm converges almost surely to a ﬁxed point (θ (λ , v ), v (λ ), λ ) which is a feasible solution (e.g. θ Θ).

The proof to Theorem 2 is provided in Appendix E.

The assumption in Theorem 2 demands a speciﬁc correlation between the guiding penalty signal Cγ and the constraint C. Consider a robot with an average torque constraint. A policy which uses 0 torque at each time-step is a feasible solution and in turn is a local minimum of both JC and JCγ. If such a policy is reachable from any θ (via gradient descent), this is enough in order to provide a theoretical guarantee such that JCγ may be used as a guiding signal in order to converge to a ﬁxed-point, which is a feasible solution.

5 Experiments

We test the RCPO algorithm in various domains: a grid-world, and 6 tasks in the Mujoco simulator (Todorov et al., 2012). The grid-world serves as an experiment to show the beneﬁts of RCPO over the standard Primal-Dual approach (solving (4) using Monte-Carlo simulations), whereas in the Mujoco domains we compare RCPO to reward shaping, a simpler (yet common) approach, and show the beneﬁts of an adaptive approach to deﬁning the cost value.

While we consider mean value constraints (robotics experiments) and probabilistic constraints (i.e., Mars rover), discounted sum constraints can be immediately incorporated into

Published as a conference paper at ICLR 2019

(a) α = 0.01

(b) α = 0.5

Figure 1: Mars Rover domain and policy illustration. As α decreases, the agent is required to learn a safer policy.

Figure 2: RCPO vs Lagrange comparison. The reward is ( ) the average number of steps it takes to reach the goal. Results are considered valid if and only if they are at or below the threshold.

our setup. We compare our approach with relevant baselines that can support these constraints. Discounted sum approaches such as Achiam et al. (2017) and per-state constraints such as Dalal et al. (2018) are unsuitable for comparison given the considered constraints. See Table 1 for more details.

For clarity, we provide exact details in Appendix B (architecture and simulation speciﬁcs).

5.1 Mars Rover

5.1.1 Domain Description

The rover (red square) starts at the top left, a safe region of the grid, and is required to travel to the goal (orange square) which is located in the top right corner. The transition function is stochastic, the rover will move in the selected direction with probability 1 δ and randomly otherwise. On each step, the agent receives a small negative reward rstep and upon reaching the goal state a reward rgoal. Crashing into a rock (yellow) causes the episode to terminate and provides a negative reward λ. The domain is inspired by the Mars Rover domain presented in Chow et al. (2015). It is important to note that the domain is built such that a shorter path induces higher risk (more rocks along the path). Given a minimal failure threshold (α (0, 1)), the task is to ﬁnd λ, such that when solving for parameters δ, rstep, rgoal and λ, the policy will induce a path with Pπθ µ (failure) α; e.g., ﬁnd the shortest path while ensuring that the probability of failure is less or equal to α.

5.1.2 Experiment Description

As this domain is characterized by a discrete action space, we solve it using the A2C algorithm (a synchronous version of A3C (Mnih et al., 2016)). We compare RCPO, using the discounted penalty Cγ, with direct optimization of the Lagrange dual form (4).

5.1.3 Experiment Analysis

Figure 1 illustrates the domain and the policies the agent has learned based on diﬀerent safety requirements. Learning curves are provided in Figure 2. The experiments show that,

Published as a conference paper at ICLR 2019

Table 2: Comparison between RCPO and reward shaping with a torque constraint < 25%.

Swimmer-v2 Walker2d-v2 Hopper-v2 Torque Reward Torque Reward Torque Reward λ = 0 30.4% 94.4 24.6% 3364.1 31.5% 2610.7 λ = 0.00001 37.4% 65.1 28.4% 3198.9 31.4% 1768.2 λ = 0.1 32.8% 16.5 13.6% 823.5 15.7% 865.95

λ = 100 2.4% 11.7 17.8% 266.1 14.3% 329.4 RCPO (ours) 24% 72.7 25.2% 591.6 26% 1138.55

Humanoid-v2 Half Cheetah-v2 Ant-v2 Torque Reward Torque Reward Torque Reward λ = 0 28.6% 617.1 37.8% 2989.5 36.7% 1313.1 λ = 0.00001 28.1% 617.1 40.8% 2462.3 35.9% 1233.5 λ = 0.1 28.5% 1151.8 13.87% -0.4 16.6% 1012.2 λ = 100 30.5% 119.4 13.9% -2.4 16.7% 957.2 RCPO (ours) 24.3% 606.1 26.7% 1547.1 15.2% 1031.5

for both scenarios α = 0.01 and α = 0.5, RCPO is characterized by faster convergence (improved sample eﬃciency) and lower variance (a stabler learning regime).

5.2 Robotics

5.2.1 Domain Description

Todorov et al. (2012); Brockman et al. (2016) and Open AI (2017) provide interfaces for training agents in complex control problems. These tasks attempt to imitate scenarios encountered by robots in real life, tasks such as teaching a humanoid robot to stand up, walk, and more. The robot is composed of n joints; the state S Rn 5 is composed of the coordinates (x, y, z) and angular velocity (ωθ, ωφ) of each joint. At each step the agent selects the amount of torque to apply to each joint. We chose to use PPO (Schulman et al., 2017) in order to cope with the continuous action space.

5.2.2 Experiment Description

In the following experiments; the aim is to prolong the motor life of the various robots, while still enabling the robot to perform the task at hand. To do so, the robot motors need to be constrained from using high torque values. This is accomplished by deﬁning the constraint C as the average torque the agent has applied to each motor, and the per-state penalty c(s, a) becomes the amount of torque the agent decided to apply at each time step. We compare RCPO to the reward shaping approach, in which the diﬀerent values of λ are selected apriori and remain constant.

5.2.3 Experiment Analysis

Learning curves are provided in Figure 3 and the ﬁnal values in Table 2. It is important to note that by preventing the agent from using high torque levels (limit the space of admissible policies), the agent may only be able to achieve a sub-optimal policy. RCPO aims to ﬁnd the best performing policy given the constraints; that is, the policy that achieves maximal value while at the same time satisfying the constraints. Our experiments show that:

1. In all domains, RCPO ﬁnds a feasible (or near feasible) solution, and, besides the Walker2d-v2 domain, exhibits superior performance when compared to the relevant reward shaping variants (constant λ values resulting in constraint satisfaction).

5In the Hopper-v2 domain, it is not clear which is the superior method RCPO or the λ = 0.1 reward shaping variant.

Published as a conference paper at ICLR 2019

Figure 3: Mujoco with torque constraints. The dashed line represents the maximal allowed value. Results are considered valid only if they are at or below the threshold. RCPO is our approach, whereas each λ value is a PPO simulation with a ﬁxed penalty coeﬃcient. Y axis is the average reward and the X axis represents the number of samples (steps).

2. Selecting a constant coeﬃcient λ such that the policy satisﬁes the constraint is not a trivial task, resulting in diﬀerent results across domains (Achiam et al., 2017).

5.2.4 The Drawbacks of Reward Shaping

When performing reward shaping (selecting a ﬁxed λ value), the experiments show that in domains where the agent attains a high value, the penalty coeﬃcient is required to be larger in order for the solution to satisfy the constraints. However, in domains where the agent attains a relatively low value, the same penalty coeﬃcients can lead to drastically diﬀerent behavior - often with severely sub-optimal solutions (e.g. Ant-v2 compared to Swimmer-v2).

Additionally, in RL, the value (Jπ R) increases as training progresses, this suggests that a non-adaptive approach is prone to converge to sub-optimal solutions; when the penalty is large, it is plausible that at the beginning of training the agent will only focus on constraint satisfaction and ignore the underlying reward signal, quickly converging to a local minima.

Published as a conference paper at ICLR 2019

6 Discussion

We introduced a novel constrained actor-critic approach, named Reward Constrained Policy Optimization (RCPO). RCPO uses a multi-timescale approach; on the fast timescale an alternative, discounted, objective is estimated using a TD-critic; on the intermediate timescale the policy is learned using policy gradient methods; and on the slow timescale the penalty coeﬃcient λ is learned by ascending on the original constraint. We validate our approach using simulations on both grid-world and robotics domains and show that RCPO converges in a stable and sample eﬃcient manner to a constraint satisfying policy.

An exciting extension of this work is the combination of RCPO with CPO (Achiam et al., 2017). As they consider the discounted penalty, our guiding signal, it might be possible to combine both approaches. Such an approach will be able to solve complex constraints while enjoying feasibility guarantees during training.

7 Acknowledgements

The authors would like to thank Nadav Merlis for the insightful discussions and helpful remarks during the writing process.

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. ar Xiv preprint ar Xiv:1705.10528, 2017.

Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.

Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47: 253 279, 2013.

D Bertesekas. Nonlinear programming. athena scientiﬁc. Belmont, Massachusetts, 1999.

Shalabh Bhatnagar and K Lakshmanan. An online actor critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications, 153(3):688 708, 2012.

Vivek S Borkar. An actor-critic algorithm for constrained markov decision processes. Systems & control letters, 54(3):207 213, 2005.

Vivek S Borkar et al. Stochastic approximation. Cambridge Books, 2008.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522 1530, 2015.

Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. ar Xiv preprint ar Xiv:1801.08757, 2018.

Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation with asynchronous oﬀ-policy updates. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3389 3396. IEEE, 2017.

Chen Hou and Qianchuan Zhao. Optimization of web service-based control system for balance between network traﬃc and delay. IEEE Transactions on Automation Science and Engineering, 2017.

Published as a conference paper at ICLR 2019

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267 274, 2002.

Ilya Kostrikov. Pytorch implementations of reinforcement learning algorithms. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr, 2018.

Iordanis Koutsopoulos and Leandros Tassiulas. Control and optimization meet the smart power grid: Scheduling of power demands for optimal energy management. In Proceedings of the 2nd International Conference on Energy-eﬃcient Computing and Networking, pages 41 50. ACM, 2011.

Pavlo Krokhmal, Jonas Palmquist, and Stanislav Uryasev. Portfolio optimization with conditional value-at-risk objective and constraints. Journal of risk, 4:43 68, 2002.

Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Benjamin Recht. First-order methods almost always avoid saddle points. ar Xiv preprint ar Xiv:1710.07406, 2017.

Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. ar Xiv preprint ar Xiv:1711.09883, 2017.

Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1 9, 2013.

Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random search provides a competitive approach to reinforcement learning. ar Xiv preprint ar Xiv:1803.07055, 2018.

Shie Mannor and Nahum Shimkin. A geometric approach to multi-criterion reinforcement learning. Journal of machine learning research, 5(Apr):325 360, 2004.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928 1937, 2016.

Open AI. Roboschool. https://github.com/openai/roboschool, 2017.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic diﬀerentiation in pytorch. In NIPS-W, 2017.

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ar Xiv preprint ar Xiv:1804.02717, 2018.

LA Prashanth and Mohammad Ghavamzadeh. Variance-constrained actor-critic algorithms for discounted and average reward mdps. Machine Learning, 105(3):367 417, 2016.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889 1897, 2015a.

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. ar Xiv preprint ar Xiv:1506.02438, 2015b.

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017.

Published as a conference paper at ICLR 2019

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

Aviv Tamar and Shie Mannor. Variance adjusted actor critic algorithms. ar Xiv preprint ar Xiv:1310.3697, 2013.

Aviv Tamar, Dotan Di Castro, and Shie Mannor. Policy gradients with variance related risk criteria. In Proceedings of the twenty-ninth international conference on machine learning, pages 387 396, 2012.

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026 5033. IEEE, 2012.

Kristof Van Moﬀaert and Ann Now e. Multi-objective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1):3483 3512, 2014.

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5 32. Springer, 1992.

Published as a conference paper at ICLR 2019

A RCPO Algorithm

Algorithm 2 RCPO Advantage Actor Critic The original Advantage Actor Critic algorithm is in gray, whereas our additions are highlighted in black.

1: Input: penalty function C( ), threshold α and learning rates η1, η2, η3 2: Initialize actor π( | ; θp) and critic V ( ; θv) with random weights 3: Initialize λ = 0, t = 0, s0 µ Restart 4: for T = 1, 2, ..., Tmax do 5: Reset gradients dθv 0, dθp 0 and i : dλi 0 6: tstart = t 7: while st not terminal and t tstart < tmax do 8: Perform at according to policy π(at|st; θp) 9: Receive rt, st+1 and penalty score ˆCt 10: t t + 1

( 0 , for terminal st V (st, θv) , otherwise 12: for τ = t 1, t 2, ..., tstart do 13: R rτ λ ˆCτ+γR Equation 10 14: dθp dθp + θp log π(aτ|sτ; θp)(R V (sτ; θv)) 15: dθv dθv + (R V (sτ; θv))2/ θv 16: if st is terminal state then 17: dλ (C α) Equation 8 18: t 0 19: s0 µ 20: Update θv, θp and λ 21: Set λ = max(λ, 0) Ensure weights are non-negative (Equation 4)

B Experiment details

B.1 Mars Rover

The MDP was deﬁned as follows:

rstep = 0.01, rgoal = 0, δ = 0.05, γ = 0.99 .

In order to avoid the issue of exploration in this domain, we employ a linearly decaying random restart (Kakade and Langford, 2002). µ, the initial state distribution, follows the following rule:

( uniform(s S) w.p. 1 #iteration s else

where S denotes all the non-terminal states in the state space and s is the state at the top left corner (red in Figure 1). Initially the agent starts at a random state, eﬀectively improving the exploration and reducing convergence time. As training progresses, with increasing probability, the agent starts at the top left corner, the state which we test against.

The A2C architecture is the standard non-recurrent architecture, where the actor and critic share the internal representation and only hold a separate ﬁnal projection layer. The input is fully-observable, being the whole grid. The network is as follows:

Layer Actor Critic 1 CNN (input layers = 1, output layers = 16, kernel size = 5, stride = 3) 2 CNN (input layers = 16, output layers = 32, kernel size = 3, stride = 2) 3 CNN (input layers = 32, output layers = 32, kernel size = 2, stride = 1) 4 Linear(input = 288, output = 64) Linear(input = 288, output = 64) 5 Linear(input = 64, output = 4) Linear(input = 64, output = 1) LR 1e-3 5e-4

Published as a conference paper at ICLR 2019

between the layers we apply a Re LU non-linearity.

As performance is noisy on such risk-sensitive environments, we evaluated the agent every 5120 episodes for a length of 1024 episodes. To reduce the initial convergence time, we start λ at 0.6 and use a learning rate lrλ = 0.000025.

B.2 Robotics

For these experiments we used a Py Torch (Paszke et al., 2017) implementation of PPO (Kostrikov, 2018). Notice that as in each domain the state represents the location and velocity of each joint, the number of inputs diﬀers between domains. The network is as follows:

Layer Actor Critic 1 Linear(input = x, output = 64) Linear(input = x, output = 64) 2 Linear(input = 64, output = 64) Linear(input = 64, output = 64) 3 Diag Gaussian(input = 64, output = y) Linear(input = 64, output = 1) LR 3e-4 1.5e-4

where Diag Gaussian is a multivariate Gaussian distribution layer which learns a mean (as a function of the previous layers output) and std, per each motor, from which the torque is sampled. Between each layer, a Tanh non-linearity is applied.

We report the online performance of the agent and run each test for a total of 1M samples. In these domains we start λ at 0 and use a learning rate lrλ = 5e 7 which decays at a rate of κ = (1 1e 9) in order to avoid oscillations.

The simulations were run using Generalized Advantage Estimation (Schulman et al., 2015b) with coeﬃcient τ = 0.95 and discount factor γ = 0.99.

C Proof of Theorem 1

We provide a brief proof for clarity. We refer the reader to Chapter 6 of Borkar et al. (2008) for a full proof of convergence for two-timescale stochastic approximation processes.

Initially, we assume nothing regarding the structure of the constraint as such λmax is given some ﬁnite value. The special case in which Assumption 2 holds is handled in Lemma 1.

The proof of convergence to a local saddle point of the Lagrangian (4) contains the following main steps:

1. Convergence of θ-recursion: We utilize the fact that owing to projection, the θ parameter is stable. We show that the θ-recursion tracks an ODE in the asymptotic limit, for any given value of λ on the slowest timescale.

2. Convergence of λ-recursion: This step is similar to earlier analysis for constrained MDPs. In particular, we show that λ-recursion in (4) converges and the overall convergence of (θk, λk) is to a local saddle point (θ (λ , λ ) of L(λ, θ).

Step 1: Due to the timescale separation, we can assume that the value of λ (updated on the slower timescale) is constant. As such it is clear that the following ODE governs the evolution of θ: θt = Γθ( θL(λ, θt)) (12)

where Γθ is a projection operator which ensures that the evolution of the ODE stays within the compact and convex set Θ := Πk i=1 θi min, θi max .

As λ is considered constant, the process over θ is:

θk+1 = Γθ[θk + η2(k) θL(λ, θk)] = Γθ[θk + η2(k) θEπθ s µ [log π(s, a; θ) [R(s) λ C(s)]]]

Published as a conference paper at ICLR 2019

Thus (6) can be seen as a discretization of the ODE (12). Finally, using the standard stochastic approximation arguments from Borkar et al. (2008) concludes step 1.

Step 2: We start by showing that the λ-recursion converges and then show that the whole process converges to a local saddle point of L(λ, θ).

The process governing the evolution of λ:

λk+1 = Γλ[λk η1(k) λL(λk, θ(λk))]

= Γλ[λk + η1(k)(E πθ(λk) s µ [C(s)] α)]

where θ(λk) is the limiting point of the θ-recursion corresponding to λk, can be seen as the following ODE: λt = Γλ( λL(λt, θ(λt))) . (13) As shown in Borkar et al. (2008) chapter 6, (λn, θn) converges to the internally chain transitive invariant sets of the ODE (13), θt = 0. Thus, (λn, θn) {(λ(θ), θ) : θ Rk} almost surely.

Finally, as seen in Theorem 2 of Chapter 2 of Borkar et al. (2008), θn θ a.s. then λn λ(θ ) a.s. which completes the proof.

D Proof of Lemma 1

The proof is obtained by a simple extension to that of Theorem 1. Assumption 2 states that any local minima πθ of 2 satisﬁes the constraints, e.g. Jπθ C α; additionally, Lee et al. (2017) show that ﬁrst order methods such as gradient descent, converge almost surely to a local minima (avoiding saddle points and local maxima). Hence for λmax = (unbounded Lagrange multiplier), the process converges to a ﬁxed point (θ (λ ), λ ) which is a feasible solution.

E Proof of Theorem 2

As opposed to Theorem 1, in this case we are considering a three-timescale stochastic approximation scheme (the previous Theorem considered two-timescales). The proof is similar in essence to that of Prashanth and Ghavamzadeh (2016).

The full process is described as follows:

λk+1 = Γλ[λk + η1(k)(E πθ(λk) s µ [C(s)] α)]

θk+1 = Γθ[θk + η2(k) θEπθ s µ h log π(s, a; θ) ˆV (λ, st; vk) i ]

vk+1 = vk η3(k) h (ˆr + γ ˆV (λ, s ; vk) ˆV (λ, s; vk))2/ vk i

Step 1: The value vk runs on the fastest timescale, hence it observes θ and λ as static. As the TD operator is a contraction we conclude that vk v(λ, θ).

Step 2: For the policy recursion θk, due to the timescale diﬀerences, we can assume that the critic v has converged and that λ is static. Thus as seen in the proof of Theorem 1, θk converges to the ﬁxed point θ(λ, v).

Step 3: As shown previously (and in Prashanth and Ghavamzadeh (2016)), (λn, θn, vn) (λ(θ ), θ , v(θ )) a.s.

Denoting by Θ = {θ : Jπθ C α} the set of feasible solutions and the set of local-minimas of Jπθ Cγ as Θγ. We recall the assumption stated in Theorem 2:

Assumption 4. Θγ Θ.

Given that the assumption above holds, we may conclude that for λmax , the set of stationary points of the process are limited to a sub-set of feasible solutions of (4). As such the process converges a.s. to a feasible solution.

Published as a conference paper at ICLR 2019

We ﬁnish by providing intuition regarding the behavior in case the assumptions do not hold.

1. Assumption 2 does not hold: As gradient descent algorithms descend until reaching a (local) stationary point. In such a scenario, the algorithm is only ensured to converge to some stationary solution, yet said solution is not necessarily a feasible one. As such we can only treat the constraint as a regularizing term for the policy in which λmax deﬁnes the maximal regularization allowed.

2. Assumption 4 does not hold: In this case, it is not safe to assume that the gradient of (2) may be used as a guide for solving (3). A Monte-Carlo approach may be used (as seen in Section 5.1) to approximate the gradients, however this does not enjoy the beneﬁts of reduced variance and smaller samples (due to the lack of a critic).