# lexicographic_multiobjective_reinforcement_learning__1578403f.pdf Lexicographic Multi-Objective Reinforcement Learning Joar Skalse , Lewis Hammond , Charlie Griffin and Alessandro Abate Department of Computer Science, University of Oxford {joar.skalse, lewis.hammond, charlie.griffin, aabate}@cs.ox.ac.uk In this work we introduce reinforcement learning techniques for solving lexicographic multiobjective problems. These are problems that involve multiple reward signals, and where the goal is to learn a policy that maximises the first reward signal, and subject to this constraint also maximises the second reward signal, and so on. We present a family of both action-value and policy gradient algorithms that can be used to solve such problems, and prove that they converge to policies that are lexicographically optimal. We evaluate the scalability and performance of these algorithms empirically, demonstrating their practical applicability. As a more specific application, we show how our algorithms can be used to impose safety constraints on the behaviour of an agent, and compare their performance in this context with that of other constrained reinforcement learning algorithms. 1 Introduction Reinforcement learning (RL) algorithms learn to solve tasks in unknown environments by a process of trial and error, where the task typically is encoded as a scalar reward function. However, there are tasks for which it is difficult (or even infeasible) to create such a function. Consider, for example, Isaac Asimov s three Laws of Robotics the task of following these laws involves multiple (possibly conflicting) objectives, some of which are lexicographically (i.e. categorically) more important than others. There is, in general, no straightforward way to write a scalar reward function that encodes such a task without ever incentivising the agent to prioritise less important objectives. In such cases, it is difficult (and often unsuitable) to apply standard RL algorithms. In this work, we introduce several RL techniques for solving lexicographic multi-objective problems. More precisely, we present both a family of action-value algorithms and a family of policy gradient algorithms that can accept multiple reward functions R1, . . . , Rm, and that learn a policy π such that π maximises expected discounted R1-reward, and Contact Author among all policies that do so, π also maximises expected discounted R2-reward, and so on. These techniques can easily be combined with a wide range of existing RL algorithms. We also prove the convergence of our algorithms, and benchmark them against state-of-the-art methods for constrained reinforcement learning in a number of environments. 1.1 Related Work Lexicographic optimisation in Multi-Objective RL (MORL) has previously been studied by [G abor et al., 1998], whose algorithm is a special case of one of ours (cf. Footnote 3). Our contribution extends this work to general, state-of-theart RL algorithms. Unlike [G abor et al., 1998], we also prove that our algorithms converge to the desired policies, and provide benchmarks against other state-of-the-art algorithms in more complex environments. Other MORL algorithms combine and trade off rewards in different ways; for an overview, see [Roijers et al., 2013; Liu et al., 2015]. Lexicographic optimisation more generally is a long-studied problem see, e.g. [Mitten, 1974; Rentmeesters et al., 1996; Wray and Zilberstein, 2015]. A natural application of lexicographic RL (LRL) is to learn a policy that maximises a performance metric, subject to satisfying a safety constraint. This setup has been tackled with dynamic programming in [Lesser and Abate, 2018], and has also been studied within RL. For example, [Tessler et al., 2019] introduce an algorithm that maximises a reward subject to the constraint that the expectation of an additional penalty signal should stay below a certain threshold; [Chow et al., 2017] introduce techniques to maximise a reward subject to constraints on the value-at-risk (Va R), or the conditional value-at-risk (CVa R), of a penalty signal; and [Miryoosefi et al., 2019] discuss an algorithm that accepts an arbitrary number of reward signals, and learns a policy whose expected discounted reward vector lies inside a given convex set. Our contributions add to this literature and, unlike the methods above, allow us to encode safety constraints in a principled way without prior knowledge of the level of safety that can be attained in the environment. Note that lexicographic optimisation of two rewards is qualitatively different from maximising one reward subject to a constraint on the second, and thus the limit policies of LRL and the algorithms above will not, in general, be the same. Other methods also emphasise staying safe while learning; see e.g. [Achiam et Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) al., 2017; Thomas et al., 2013; Polymenakos et al., 2019]. In contrast, our algorithms do not guarantee safety while learning, but rather learn a safe limit policy. 2 Background Reinforcement Learning. The RL setting is usually formalised as a Markov Decision Process (MDP), which is a tuple S, A, T, I, R, γ where S is a set of states, A is a set of actions, T : S A S is a transition function, I is an initial state distribution over S, R : S A S R a reward function, where R(s, a, s ) is the reward obtained if the agent moves from state s to s by taking action a, and γ [0, 1] is a discount factor. Here, f : X Y denotes a probabilistic mapping f from X to Y . A state is terminal if T(s, a) = s and R(s, a, s) = 0 for all a. A (stationary) policy is a mapping π : S A that specifies a distribution over the agent s actions in each state. The value function vπ(s) of π is defined as the expected γdiscounted cumulative reward when following π from s, i.e. vπ(s) := Eπ [P t=0 γt R(st, at, st+1) | s0 = s]. When γ = 1, we instead consider the limit-average of this expectation. The objective in RL can then be expressed as maximising J(π) := P s I(s)vπ(s). Given a policy π we may also define the q-function qπ(s, a) := Es T (s,a) [R(s, a, s ) + vπ(s )] and the advantage function aπ(s, a) := qπ(s, a) vπ(s). Value-Based Methods. A value-based agent has two main components: a Q-function Q : S A R that predicts the expected future discounted reward conditional on taking a particular action in a particular state; and a bandit algorithm that is used to select actions in each state. The Q-function can be represented as a lookup table (in which case the agent is tabular), or as a function approximator. There are many ways to update the Q-function. One popular rule is Q-Learning [Watkins, 1986]: Q(st, at) 1 αt(st, at) Q(st, at) + αt(st, at) rt + γ max a Q(st+1, a) , where t is the time step and αt(st, at) is a learning rate. One can replace the term maxa Q(st+1, a) in the rule above with Q(st+1, at+1) or Ea π(s)[Q(st+1, a)] (where π is the policy that describes the agent s current behaviour) to obtain the SARSA [Rummery and Niranjan, 1994] or the Expected SARSA [van Seijen et al., 2009] updates respectively. Policy-Based Methods. In these methods, the policy π( ; θ) is differentiable with respect to some θ Θ Rx, and θ is updated according to an objective K(θ). If using KA2C(θ) := J(θ) then we may estimate this using: KA2C(θ) := E t log π(at | st; θ) Aθ(st, at) , where Aθ is an estimate of aθ. One often computes Aθ by approximating vθ with a function V parameterised by some w W Ry, and using the fact that the expected temporal difference error δt := rt + γvθ(st+1) vθ(st) (or rt + vθ(st+1) vθ(st) J(θ) when γ = 1) equals aθ(st, at) [Bhatnagar et al., 2009]. Such algorithms are known as Actor-Critic (AC) algorithms [Konda and Tsitsiklis, 2000].1 More recently, other policy gradient algorithms have used surrogate objective functions, which increase stability in training by penalising large steps in policy space, and can be viewed as approximating the natural policy gradient [Amari, 1998; Kakade, 2001]. One common such penalty is the Kullback Leibler (KL) divergence between new and old policies, as employed in one version of Proximal Policy Optimisation (PPO) [Schulman et al., 2017], leading to: KPPO(θ) := E t h π(at | st; θ) π(at | st; θold)Aθ(st, at) κ DKL(π(st; θ) π(st; θold)) i , where κ is a scalar weight. Such algorithms enjoy both state of the art performance and strong convergence guarantees [Hsu et al., 2020; Liu et al., 2019]. Multi-Objective Reinforcement Learning. MORL is concerned with policy synthesis under multiple objectives. This setting can be formalised as a multi-objective MDP (MOMDP), which is a tuple S, A, T, I, R, γ that is defined analogously to an MDP, but where R : S A S Rm returns a vector of m rewards, and γ [0, 1]m defines m discount rates. We define Ri as (s, a, s) 7 R(s, a, s)i. 3 Lexicographic Reinforcement Learning In this section we present a family of value-based and policybased algorithms that solve lexicographic multi-objective problems by learning a lexicographically optimal policy. Given a MOMDP M with m rewards, we say that a policy π is (globally) lexicographically ϵ-optimal if π Πϵ m, where Πϵ 0 = Π is the set of all policies in M, Πϵ i+1 := {π Πϵ i | maxπ Πϵ i Ji(π ) Ji(π) ϵi}, and Rm 1 ϵ 0. We similarly write Θϵ i+1 to define global lexicographic ϵoptima for parametrised policies, but also Θϵ i+1 := {θ Θϵ i | maxθ N i(θ) Ji(θ ) Ji(θ) ϵi} to define local lexicographic ϵ-optima, where N i(θ) Θϵ i is a compact local neighbourhood of θ, and Θϵ 0 = Θϵ 1 = Θ. When ϵ = 0 we drop it from our notation and refer to lexicographic optima and lexicographically optimal policies simpliciter. 3.1 Value-Based Algorithms We begin by introducing bandit algorithms that take as input multiple Q-functions and converge to taking lexicographically optimal actions. Definition 1 (Lexicographic Bandit Algorithm). Let S be a set of states, A a set of actions, Q1, . . . , Qm : S A R a sequence of Q-functions, and t N a time parameter. A lexicographic bandit algorithm with tolerance τ R>0 is a function B : (S A R)m S N A, such that lim t Pr(B(Q1, . . . , Qm, s, t) τ s,m) = 1, where τ s,0 = A and τ s,i+1 := {a τ s,i | Qi(s, a) maxa τ s,i Qi(s, a ) τ}. 1Due to the choice of baseline [Sutton et al., 1999], we describe here the classic Advantage Actor-Critic (A2C) algorithm. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) Intuitively, a lexicographic bandit algorithm will, in the limit, pick an action a such that a maximises Q1 (with tolerance τ), and among all actions that do this, action a also maximises Q2 (with tolerance τ), and so on. An example of a lexicographic bandit algorithm is given in Algorithm 1, where the exploration probabilities ϵs,t should satisfy limt ϵs,t = 0 and P t=0 ϵs,t = for all s S. We can now introduce Algorithm 2 (VB-LRL), a valuebased algorithm for lexicographic multi-objective RL. Here B is any lexicographic bandit algorithm. The rule for updating the Q-values (on line 6) can be varied. We call the following update rule Lexicographic Q-Learning: Qi(s, a) 1 αt(s, a) Qi(s, a) + αt(s, a) Ri(s, a, s ) + γi max a τ s,i Qi(s , a) , where τ s,0 = A, τ s,i+1 := {a τ s,i | Qi(s, a) maxa τ s,i Qi(s, a ) τ}, and τ R>0 is the tolerance parameter.2 This rule is analogous to Q-Learning, but where the max-operator is restricted to range only over actions that (approximately) lexicographically maximise all rewards of higher priority. We can also use SARSA or Expected SARSA. Alternatively, we can adapt Double Q-Learning [Hasselt, 2010] for VB-LRL. To do this, we let the agent maintain two Q-functions QA i , QB i for each reward. To update the Q-values, with probability 0.5 we set: QA i (s, a) 1 αt(s, a) QA i (s, a) + αt(s, a) Ri(s, a, s ) + γi QB i s , arg max a τ s,i QA i (s , a ) , and else perform the analogous update on QB i , and let Qi(s, a) := 0.5 QA i (s, a) + QB i (s, a) in the bandit algorithm. Varying the bandit algorithm or Q-value update rule in VB-LRL produces a family of algorithms with different properties.3 We can now give our core result for Algorithm 2. All our proofs are included in the supplementary material.4 Theorem 1. In any MOMDP M, if VB-LRL uses a lexicographic bandit algorithm and either SARSA, Expected SARSA, or Lexicographic Q-Learning, then it will converge to a policy π that is lexicographically optimal if: 1. S and A are finite, 2. All reward functions are bounded, 3. Either γ1, . . . , γm < 1, or every policy leads to a terminal state with probability one, 4. The learning rates αt(s, a) [0, 1] satisfy the conditions P t αt(s, a) = and P t αt(s, a)2 < with probability one, for all s S, a A, 5. The tolerance τ satisfies the condition that 0 < τ < mini,s,a =a |qi(s, a) qi(s, a )|. 2There are several places where VB-LRL makes use of a tolerance parameter τ. In the main text of this paper, we assume that the same tolerance parameter is used everywhere, and that it is a constant. In the supplementary material, we relax these assumptions. 3The LRL algorithm in [G abor et al., 1998] is equivalent to Algorithm 2 with Algorithm 1, Lexicographic Q-Learning, and τ = 0. 4Available at https://github.com/lrhammond/lmorl. Algorithm 1 Lexicographic ϵ-Greedy input: Q1, . . . , Qm, s, t 1: with probability ϵs,t do a unif(A) 2: else 3: A 4: for i {1, . . . , m} do 5: x maxa Qi(s, a ) 6: {a | Qi(s, a) x τ} 7: a unif( ) 8: return a Algorithm 2 Value-Based Lexicographic RL input: M = S, A, T, I, R, γ 1: initialise Q1, . . . , Qm, t 0, s I 2: while Q1, . . . , Qm have not converged do 3: a B(Q1, . . . , Qm, s, t) Algorithm 1 4: s T(s, a) 5: for i {1, . . . , m} do 6: update Qi 7: if s is terminal then s I else s s 8: t t + 1 9: return π = s 7 limt B(Q1, . . . , Qm, s, t) We also show that VB-LRL with Lexicographic Double QLearning converges to a lexicographically optimal policy. Theorem 2. In any MOMDP, if VB-LRL uses Lexicographic Double Q-Learning then it converges to a lexicographically optimal policy π if conditions 1 5 in Theorem 1 hold. Condition 4 requires that the agent takes every action in every state infinitely often. Condition 5 is quite strong the upper bound on this range can in general not be determined a priori. However, we expect VB-LRL to be well-behaved as long as τ is small. We motivate this intuition with a formal guarantee about the behaviour of VB-LRL for arbitrary τ. Proposition 1. In any MOMDP, if VB-LRL has tolerance τ > 0, uses SARSA, Expected SARSA, or Lexicographic QLearning, and conditions 1 4 in Theorem 1 are met, then: 1. J1(π ) J1(πt) τ 1 γ λt, for some sequence {λt}t N such that limt λt = 0, 2. J2(π ) J2(πt) τ 1 γ ηt, for some sequence {ηt}t N such that limt ηt = 0, where πt is the policy at time t and π is a lexicographically optimal policy. Proposition 1 shows that we can obtain guarantees about the limit behaviour of VB-LRL without prior knowledge of the MOMDP when we only have two rewards, or are primarily interested in the two most prioritised rewards. We discuss this issue further in the supplementary material. Note also that while Algorithm 2 is tabular, it is straightforward to combine it with function approximators, making it applicable to high-dimensional state spaces. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) 3.2 Policy-Based Algorithms We next introduce a family of lexicographic policy gradient algorithms. These algorithms use one objective function Ki for each reward function, and update the parameters of π( ; θ) with a multi-timescale approach whereby we first optimise θ using K1, then at a slower timescale optimise θ using K2 while adding the condition that the loss with respect to K1 remains bounded by its current value, and so on. To solve these problems we use the well-known Lagrangian relaxation technique [Bertsekas, 1999]. Suppose that we have already optimised θ lexicographically with respect to K1, . . . , Ki 1 and we wish to now lexicographically optimise θ with respect to Ki. Let kj := Kj(θ ) for each j {1, . . . , i 1}. Then we wish to solve the constrained optimisation problem given by: maximise Ki(θ), subject to Kj(θ) kj τ, j {1, . . . , i 1}, where τ > 0 is a small constant tolerance parameter, included such that there exists some θ strictly satisfying the above constraints; in practice, while learning we set τ = τt to decay as t . This constraint qualification (Slater s condition [Slater, 1950]) ensures that we may instead solve the dual of the problem by computing a saddle point minλ 0 maxθ Li(θ, λ) of the Lagrangian relaxation [Bertsekas, 1999] where: Li(θ, λ) := Ki(θ) + j=1 λj Kj(θ) kj + τ . A natural approach would be to solve each optimisation problem for Li, where i {1, . . . , m}, in turn. While this would lead to a correct solution, when the space of lexicographic optima for each objective function is large or diverse, this process may end up being slow and sample-inefficient. Our key observation here is that by instead updating θ at different timescales, we can solve this problem synchronously, guaranteeing that we converge to a lexicographically optimal solution as if done step-by-step under fixed constraints. We set the learning rate η of the Lagrange multiplier to ηi after convergence with respect to the ith objective, and assume that for all learning rates ι {α, β1, . . . , βm, η0, . . . , ηm} and all i {1, . . . , m} we have: t=0 (ιt)2 < and lim t βi t αt = lim t ηi t βi t = lim t βi t ηi 1 t = 0. We also assume that τt = o(βm t ) in order to make sure that Slater s condition holds in the limit with respect to all learning rates. Using learning rates βi and η = ηi we may compute a saddle point solution to each Lagrangian Li via the following (estimated) gradient-based updates: θ + βi t θ ˆKi(θ) + j=1 λj θ ˆKj(θ) , λj Γλ h λj + ηt ˆkj τt ˆKj(θ) i j {1, . . . , i 1}, Algorithm 3 Policy-Based Lexicographic RL input: M = S, A, T, I, R, γ 1: initialise θ, w1, . . . , wm, λ1, . . . , λm 2: t 0, η η0, s I 3: while θ has not converged do 4: t t + 1, a π(s), s T(s, a) 5: for i {1, . . . , m} do 6: if ˆKi(θ) has not converged then ˆki ˆKi(θ) 7: else η ηi 8: update wi (if using a critic) and λi 9: update θ 10: if s is terminal then s I else s s 11: return θ where Γλ( ) = max( , 0), Γθ projects θ to the nearest point in Θ, and ˆ is used to denote a Monte Carlo estimate. We next note that by collecting the terms involved in the updates to θ for each i, at time t we are effectively performing the simple update θ Γθ[θ + θ ˆK(θ)], where: i=1 ci t ˆKi(θ) and ci t := βi t + λi j=i+1 βj t , and where we assume that Pm j=m+1 βj t = 0. It is therefore computationally simple to formulate a lexicographic optimisation problem from any collection of objective functions by updating a small number of coefficients ci t at each timestep and then linearly combining the objective functions. Finally, in many policy-based algorithms we use a critic Vi (or Qi) to estimate each ˆKi and so must also update the parameters wi of each critic. This is typically done on a faster timescale using the learning rate α, for instance via the TD(0) update for Vi given by wi wi + αt δi t wi Vi , where δi t is the TD error for Vi at time t [Sutton and Barto, 2018]. A general scheme for policy-based LRL (PB-LRL) is shown in Algorithm 3, which may be instantiated with a wide range of objective functions Ki and update rules for each wi. Below, we show that PB-LRL inherits the convergence guarantees of whatever (non-lexicographic) algorithm corresponds to the objective function used. Note that when m = 1, PB-LRL reduces to whichever algorithm is defined by the choice of objective function, such as A2C when using KA2C 1 , or PPO when using KPPO 1 . By using a standard stochastic approximation argument [Borkar, 2008] and proceeding by induction, we prove that any such algorithm that obtains a local (or global) ϵ-optimum when m = 1 obtains a lexicographically local (or global) ϵ-optimum when the corresponding objective function is used in PB-LRL. Theorem 3. Let M be a MOMDP, π a policy that is twice continuously differentiable in its parameters θ, and assume that the same form of objective function is chosen for each Ki and that each reward function Ri is bounded. If using a critic, let Vi (or Qi) be (action-)value functions that are continuously differentiable in wi for i {1, . . . , m} and suppose that if PB-LRL is run for T steps there exists some limit point w i (θ) = lim T Et[wi] for each wi when θ is held Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) fixed under some set of conditions C on M, π, and each Vi. If lim T Et[θ] Θϵ 1 (respectively Θϵ 1) under conditions C when m = 1, then for any fixed m N we have that lim T Et[θ] Θϵ m (respectively Θϵ m), where each ϵi 0 is a constant that depends on the representational power of the parametrisations of π (and Vi or Qi, if using a critic). In the remainder of the paper, we consider two particular variants of Algorithm 3, in which we use KA2C i and KPPO i respectively, for each i. We refer to the first as Lexicographic A2C (LA2C) and the second as Lexicographic PPO (LPPO). We conclude this section by combining Theorem 3 with certain conditions C that are sufficient for the local and global convergence of A2C and PPO respectively, in order to obtain the following corollaries. The proofs of these corollaries contain further discussion and references regarding the conditions required in each case. Corollary 1. Suppose that each critic is linearly parametrised as Vi(s) = w i ϕ(s) for some choice of state features ϕ and is updated using a semi-gradient TD(0) rule, and that: 1. S and A are finite, and each reward function Ri is bounded, 2. For any θ Θ, the induced Markov chain over S is irreducible, 3. For any s S and a A, π(a | s; θ) is twice continuously differentiable, 4. Letting Φ be the |S| c matrix with rows ϕ(s), then Φ has full rank (i.e. the features are independent), c |S|, and there is no w W such that Φw = 1. Then for any MOMDP with discounted or limit-average objectives, LA2C almost surely converges to a policy in Θϵ m. Corollary 2. Let π(a | s; θ, χ) exp χ 1f(s, a; θ) and suppose that both f and the action-value critics Qi are parametrised using two-layer neural networks (where χ is a temperature parameter), that a semi-gradient TD(0) rule is used to update Qi, and that Qi replaces Ai in the standard PPO loss KPPO, both of which updates use samples from the discounted steady state distribution. Further, let us assume that: 1. S is compact and A is finite, with S A Rd for some finite d > 0, and each reward function Ri is bounded, 2. The neural networks have widths µf and µQi respectively with Re LU activations, initial input weights drawn from a normal distribution with mean 0 and variance 1 d, and initial output weights drawn from unif([ 1, 1]), 3. We have that qπ i ( , ) Qi( , ; wi) | wi Ry for any π Π, 4. There exists c > 0 such that for any z Rd and ζ > 0 we have that Eπ 1(|z (s, a)| ζ) cζ z 2 for any π Π. Then for any MOMDP with discounted objectives, LPPO almost surely converges to a policy in Θϵ m. Furthermore, if the coefficient of the KL divergence penalty κ > 1 then limµf ,µQi ϵ = 0. 1 4 8 12 16 1k (a) 256 states 1 4 8 12 16 1k (b) 512 states Figure 1: We plot the learning time of LRL as the number of episodes until convergence (y axis) against the number of reward signals (x axis). We use randomly generated MOMDPs with 256 or 512 states, four actions, and a varying number of rewards. For each trial we generate 30 MOMDPs, and record the number of episodes it takes for each agent s long-run average reward per episode to converge to a stable value. The algorithms are Lexicographic QLearning (blue), Lexicographic Expected SARSA (orange), Lexicographic Double Q-Learning (green), Lexicographic PPO (red), and Lexicographic A2C (purple). 4 Experiments In this section we evaluate our algorithms empirically. We first show how the learning time of LRL scales with the number of reward functions. We then compare the performance of VB-LRL and PB-LRL against that of other algorithms for solving constrained RL problems. Further experimental details and additional experiments are described in the supplementary material, and documented in our codebase.5 4.1 Scaling with the Number of Rewards Our first experiment (shown in Figure 1) shows how the learning time of LRL scales in the number of rewards. The data suggest that the learning time grows sub-linearly as additional reward functions are added, meaning that our algorithms can be used with large numbers of objectives. 4.2 Lexicographic RL for Safety Constraints Many tasks are naturally expressed in terms of both a performance metric and a safety constraint. Our second experiment compares the performance of LRL against RCPO [Tessler et al., 2019], Apro PO [Miryoosefi et al., 2019], and the actorcritic algorithm for Va R-constraints in [Chow et al., 2017], in a number of environments with both a performance metric and a safety constraint. These algorithms synthesise slightly different kinds of policies, but are nonetheless sufficiently similar for a relevant comparison to be made. We use VBLRL with a neural network and a replay buffer, which we call LDQN, and the PB-LRL algorithms we evaluate are LA2C and LPPO. The results are shown in Figure 2. The Cart Safe environment from gym-safety6 is a version of the classic Cart Pole environment. The agent receives more reward the higher up the pole is, whilst incurring a cost if the cart is moved outside a safe region. Here the LRL algorithms, RCPO, and Apro PO all learn quite safe policies, but Va R AC struggles. Of the safer policies LDQN gets the most reward 5Available at https://github.com/lrhammond/lmorl. 6Available at https://github.com/jemaw/gym-safety. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) (roughly matching DQN and A2C), followed by RCPO and Apro PO, and then LA2C and LPPO. The latter two minimise cost more aggressively, and thus gain less reward. 0.0 0.5m 1.0m 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (a) Cart Safe Reward 0.0 0.5m 1.0m 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (b) Cart Safe Cost (c) Grid Nav Reward 0 2m 4m 0.0 (d) Grid Nav Cost 0 10k 20k 0.0 (e) Intersection Env Reward 0 10k 20k 0.0 (f) Intersection Env Cost DQN A2C random LDQN LA2C LPPO Apro PO RCPO Va R_AC Figure 2: We plot the average reward and cost (y axis) against the number of environment interactions (x axis). In each environment, RCPO, Apro PO, and Va R AC were tasked with maximising reward subject to a constraint on the cost, bounds on both the reward and cost, or a bound on the probability of the cost exceeding a certain constant. The LRL algorithms were tasked with minimising cost and, subject to that, maximising reward. Each algorithm was run ten times in each environment. The Grid Nav environment, again from gym-safety (based on an environment in [Chow et al., 2018]), is a large gridworld with a goal region and a number of unsafe squares. The agent is rewarded for reaching the goal quickly, and incurs a cost if it enters an unsafe square. Moreover, at each time step, the agent is moved in a random direction with probability 0.1. Here LDQN is the safest algorithm, but it also fails to obtain any reward. LA2C, LPPO, RCPO, and Va R AC are similar in terms of safety, but LA2C and LPPO obtain the most reward, Va R AC a fairly high reward, and RCPO a low reward. Apro PO has low safety and low reward. Finally, in the Intersection environment from highway-env7 the agent must guide a car through an intersection with dense traffic. We give the agent a reward of 10 if it reaches its destination, and a cost of 1 for each collision that occurs (which is slightly different from the environment s original reward structure). This task is challenging, and all the algorithms incur approximately the same cost as a random agent. However, they still manage to increase their reward, with LA2C and RCPO obtaining the most reward out of the constrained algorithms (roughly matching that of DQN and A2C). This shows that if optimising the first objective is too difficult, then the LRL algorithms fail gracefully by optimising the second objective, even if it has lexicographically lower priority. 5 Discussion and Conclusions We introduced two families of RL algorithms for solving lexicographic multi-objective problems, which are more general than prior work, and are justified both by their favourable theoretical guarantees and their compelling empirical performance against other algorithms for constrained RL. VB-LRL converges to a lexicographically optimal policy in the tabular setting, and PB-LRL inherits convergence guarantees as a function of the objectives used, leading to locally and globally lexicographically ϵ-optimal policies in the case of LA2C and LPPO respectively. The learning time of the algorithms grows sub-linearly as reward functions are added, which is an encouraging result for scalability to larger problems. Further, when used to impose safety constraints, the LRL algorithms generally compare favourably to the state of the art, both in terms of learning speed and final performance. We conclude by noting that in many situations, LRL may be preferable to constrained RL for reasons beyond its strong performance, as it allows one to solve different kinds of problems. For example, we might want a policy that is as safe as possible, but lack prior knowledge of what level of safety can be attained in the environment. LRL could also be used e.g. to guide learning by encoding prior knowledge in extra reward signals without the risk of sacrificing optimality with respect to the primary objective(s). These applications, among others, provide possible directions for future work. Acknowledgments Hammond acknowledges the support of an EPSRC Doctoral Training Partnership studentship (Reference: 2218880). References [Achiam et al., 2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, pages 22 31, 2017. [Amari, 1998] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251 276, 1998. [Bertsekas, 1999] Dimitri Bertsekas. Nonlinear Programming. Athena Scientific, 1999. 7Available at https://github.com/eleurent/highway-env. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) [Bhatnagar et al., 2009] Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor critic algorithms. Automatica, 45(11):2471 2482, 2009. [Borkar, 2008] Vivek S. Borkar. Stochastic Approximation. Hindustan Book Agency, 2008. [Chow et al., 2017] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(1):6070 6120, 2017. [Chow et al., 2018] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8103 8112, 2018. [G abor et al., 1998] Zolt an G abor, Zsolt Kalm ar, and Csaba Szepesv ari. Multi-criteria reinforcement learning. In Proceedings of the 15th International Conference on Machine Learning, pages 197 205, 1998. [Hasselt, 2010] Hado V. Hasselt. Double q-learning. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 2613 2621, 2010. [Hsu et al., 2020] Chloe Ching-Yun Hsu, Celestine Mendler D unner, and Moritz Hardt. Revisiting design choices in proximal policy optimization. ar Xiv:2009.10897, 2020. [Kakade, 2001] Sham Kakade. A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems, pages 1531 1538, 2001. [Konda and Tsitsiklis, 2000] Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In Proceedings of the 13th International Conference on Neural Information Processing Systems, pages 1008 1014, 2000. [Lesser and Abate, 2018] K. Lesser and A. Abate. Multiobjective optimal control with safety as a priority. IEEE Transactions on Control Systems Technology, 26(3):1015 1027, 2018. [Liu et al., 2015] C. Liu, X. Xu, and D. Hu. Multiobjective reinforcement learning: A comprehensive overview. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 45(3):385 398, 2015. [Liu et al., 2019] Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 10564 10575, 2019. [Miryoosefi et al., 2019] Sobhan Miryoosefi, Kiant e Brantley, Hal Daum e III, Miroslav Dud ık, and Robert E. Schapire. Reinforcement learning with convex constraints. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 14070 14079, 2019. [Mitten, 1974] L. G. Mitten. Preference order dynamic programming. Management Science, 21(1):43 46, 1974. [Polymenakos et al., 2019] K. Polymenakos, A. Abate, and S. Roberts. Safe policy search using gaussian process models. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, pages 1565 1573, 2019. [Rentmeesters et al., 1996] M.J. Rentmeesters, W.K. Tsai, and Kwei-Jay Lin. A theory of lexicographic multi-criteria optimization. In Proceedings of the 2nd IEEE International Conference on Engineering of Complex Computer Systems, 1996. [Roijers et al., 2013] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67 113, 2013. [Rummery and Niranjan, 1994] Gavin Rummery and Mahesan Niranjan. On-line q-learning using connectionist systems. Technical report, University of Cambridge, 1994. [Schulman et al., 2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ar Xiv:1707.06347, 2017. [Slater, 1950] Morton Slater. Lagrange multipliers revisited. Cowles Commission Discussion Paper No. 403, 1950. [Sutton and Barto, 2018] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 2018. [Sutton et al., 1999] Richard S. Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, pages 1057 1063, 1999. [Tessler et al., 2019] Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. Reward constrained policy optimization. In Proceedings of the 7th International Conference on Learning Representations, 2019. [Thomas et al., 2013] Philip S. Thomas, William Dabney, Sridhar Mahadevan, and Stephen Giguere. Projected natural actor-critic. In Proceedings of the 26th International Conference on Neural Information Processing Systems, pages 2337 2345, 2013. [van Seijen et al., 2009] Harm van Seijen, Hado van Hasselt, Whiteson Shimon, and Marco Wiering. A theoretical and empirical analysis of expected sarsa. IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177 184, 2009. [Watkins, 1986] Chris Watkins. Learning from Delayed Rewards. Ph D thesis, University of Cambridge, 1986. [Wray and Zilberstein, 2015] Kyle Hollins Wray and Shlomo Zilberstein. Multi-objective pomdps with lexicographic reward preferences. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 1719 1725, 2015. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)