# reinforcement_learning_with_convex_constraints__3189141f.pdf Reinforcement Learning with Convex Constraints Sobhan Miryoosefi Princeton University miryoosefi@cs.princeton.edu Kianté Brantley University of Maryland kdbrant@cs.umd.edu Hal Daumé III Microsoft Research University of Maryland me@hal3.name Miroslav Dudík Microsoft Research mdudik@microsoft.com Robert E. Schapire Microsoft Research schapire@microsoft.com In standard reinforcement learning (RL), a learning agent seeks to optimize the overall reward. However, many key aspects of a desired behavior are more naturally expressed as constraints. For instance, the designer may want to limit the use of unsafe actions, increase the diversity of trajectories to enable exploration, or approximate expert trajectories when rewards are sparse. In this paper, we propose an algorithmic scheme that can handle a wide class of constraints in RL tasks, specifically, any constraints that require expected values of some vector measurements (such as the use of an action) to lie in a convex set. This captures previously studied constraints (such as safety and proximity to an expert), but also enables new classes of constraints (such as diversity). Our approach comes with rigorous theoretical guarantees and only relies on the ability to approximately solve standard RL tasks. As a result, it can be easily adapted to work with any model-free or model-based RL algorithm. In our experiments, we show that it matches previous algorithms that enforce safety via constraints, but can also enforce new properties that these algorithms cannot incorporate, such as diversity. 1 Introduction Reinforcement learning (RL) typically considers the problem of learning to optimize the behavior of an agent in an unknown environment against a single scalar reward function. For simple tasks, this can be sufficient, but for complex tasks, boiling down the learning goal into a single scalar reward can be challenging. Moreover, a scalar reward might not be a natural formalism for stating certain learning objectives, such as safety desires ( avoid dangerous situations ) or exploration suggestions ( maintain a distribution over visited states that is as close to uniform as possible ). In these settings, it is much more natural to define the learning goal in terms of a vector of measurements over the behavior of the agent, and to learn a policy whose measurement vector is inside a target set ( 2). We derive an algorithm, approachability-based policy optimization (APPROPO, pronounced like apropos ), for solving such problems ( 3). Given a Markov decision process with vector-valued measurements ( 2), and a target constraint set, APPROPO learns a stochastic policy whose expected measurements fall in that target set (akin to Blackwell approachability in single-turn games, Blackwell, 1956). We derive our algorithm from a game-theoretic perspective, leveraging recent results in online convex optimization. APPROPO is implemented as a reduction to any off-the-shelf reinforcement learning algorithm that can return an approximately optimal policy, and so can be used in conjunction with the algorithms that are the most appropriate for any given domain. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Our approach builds on prior work for reinforcement learning under constraints, such as the formalism of constrained Markov decision processes (CMDPs) introduced by Altman (1999). In CMDPs, the agent s goal is to maximize reward while satisfying some linear constraints over auxiliary costs (akin to our measurements). Altman (1999) gave an LP-based approach when the CMDP is fully known, and more recently, model-free approaches have been developed for CMDPs in high-dimensional settings. For instance, Achiam et al. s (2017) constrained policy optimization (CPO) focuses on safe exploration and seeks to ensure approximate constraint satisfaction during the learning process. Tessler et al. s (2019) reward constrained policy optimization (RCPO) follows a two-timescale primal-dual approach, giving guarantees for the convergence to a fixed point. Le et al. (2019) describe a batch off-policy algorithm with PAC-style guarantees for CMDPs using a similar game-theoretic formulation to ours. While all of these works are only applicable to orthant constraints, our algorithm can work with arbitrary convex constraints. This enables APPROPO to incorporate previously studied constraint types, such as inequality constraints that represent safety or that keep the policy s behavior close to that of an expert (Syed and Schapire, 2008), as well as constraints like the aforementioned exploration suggestion, implemented as an entropy constraint on the policy s state visitation vector. The entropy of the visitation vector was recently studied as the objective by Hazan et al. (2018), who gave an algorithm capable of maximizing a concave function (e.g., entropy) over such vectors. However, it is not clear whether their approach can be adapted to the convex constraints setting studied here. Our main contributions are: (1) a new algorithm, APPROPO, for solving reinforcement learning problems with arbitrary convex constraints; (2) a rigorous theoretical analysis that demonstrates that it can achieve sublinear regret under mild assumptions ( 3); and (3) a preliminary experimental comparison with RCPO (Tessler et al., 2019), showing that our algorithm is competitive with RCPO on orthant constraints, while also handling a diversity constraint ( 4). 2 Setup and preliminaries: Defining the feasibility problem We begin with a description of our learning setting. A vector-valued Markov decision process is a tuple M = (S, A, β, Ps, Pz), where S is the set of states, A is the set of actions and β is the initial-state distribution. Each episode starts by drawing an initial state s0 from the distribution β. Then in each step i = 1, 2, . . . , the agent observes its current state si and takes action ai 2 A causing the environment to move to the next state si+1 Ps( |si, ai). The episode ends after a certain number of steps (called the horizon) or when a terminal state is reached. However, in our setting, instead of receiving a scalar reward, the agent observes a d-dimensional measurement vector zi 2 Rd, which, like si+1, is dependent on both the current state si and the action ai, that is, zi Pz( |si, ai). (Although not explicit in our setting, reward could be incorporated in the measurement vector.) Typically, actions are selected according to a (stationary) policy so that ai (si), where maps states to distributions over actions. We assume we are working with policies from some candidate space . For simplicity of presentation, we assume this space is finite, though possibly extremely large. For instance, if S and A are finite, then might consist of all deterministic policies. (Our results hold also when is infinite with minor technical adjustments.) Our aim is to control the MDP so that measurements satisfy some constraints. For any policy , we define the long-term measurement z( ) as the expected sum of discounted measurements: for some discount factor γ 2 [0, 1), and where expectation is over the random process described above (including randomness inherent in ). Later, we will also find it useful to consider mixed policies µ, which are distributions over finitely many stationary policies. The space of all such mixed policies over is denoted ( ). To execute a mixed policy µ, before taking any actions, a single policy is randomly selected according to µ; then all actions henceforth are chosen from , for the entire episode. The long-term measurement of a mixed policy z(µ) is defined accordingly: z(µ) , E µ [z( )] = µ( )z( ). (2) Our learning problem, called the feasibility problem, is specified by a convex target set C. The goal is to find a mixed policy µ whose long-term measurements lie in the set C: Feasibility Problem: Find µ 2 ( ) such that z(µ) 2 C. (3) For instance, in our experiments ( 4) we consider a grid-world environment where the measurements include the distance traveled, an indicator of hitting a rock, and indicators of visiting various locations on the grid. The feasibility goal is to achieve at most a certain trajectory length while keeping the probability of hitting the rock below a threshold for safety reasons, and maintaining a distribution over visited states close to the uniform distribution to enable exploration. We can potentially also handle settings where the goal is to maximize one measurement (e.g., reward ) subject to others by performing a binary search over the maximum attainable value of the reward (see 3.4). 3 Approach, algorithm, and analysis Before giving details of our approach, we overview the main ideas, which, to a large degree, follow the work of Abernethy et al. (2011), who considered the problem of solving two-player games; we extend these results to solve our feasibility problem (3). Although feasibility is our main focus, we actually solve the stronger problem of finding a mixed policy µ that minimizes the Euclidean distance between z(µ) and C, meaning the Euclidean distance between z(µ) and its closest point in C. That is, we want to solve min µ2 ( ) dist(z(µ), C) (4) where dist denotes the Euclidean distance between a point and a set. Our main idea is to take a game-theoretic approach, formulating this problem as a game and solving it. Specifically, suppose we can express the distance function in Eq. (4) as a maximization of the form dist(z(µ), C) = max λ2 λ z(µ) (5) for some convex, compact set .1 Then Eq. (4) becomes min µ2 ( ) max λ2 λ z(µ). (6) This min-max form immediately evokes interpretation as a two-person zero-sum game: the first player chooses a mixed policy µ, the second player responds with a vector λ, and λ z(µ) is the amount that the first player is then required to pay to the second player. Assuming this game satisfies certain conditions, the final payout under the optimal play, called the value of the game, is the same even when the order of the players is reversed: λ2 min µ2 ( ) λ z(µ). (7) Note that the policy µ we are seeking is the solution of this game, that is, the policy realizing the minimum in Eq. (6). Therefore, to find that policy, we can apply general techniques for solving a game, namely, to let a no-regret learning algorithm play the game repeatedly against a best-response player. When played in this way, it can be shown that the averages of their plays converge to the solution of the game (details in 3.1). In our case, we can use a no-regret algorithm for the λ-player, and best response for the µ-player. Importantly, in our context, computing best response turns out to be an especially convenient task. Given λ, best response means finding the mixed policy µ minimizing λ z(µ). As we show below, this can be solved by treating the problem as a standard reinforcement learning task where in each step i, the agent accrues a scalar reward ri = λ zi. We refer to any algorithm for solving the problem of scalar reward maximization as the best-response oracle. During the run of our algorithm, we invoke this oracle for different vectors λ corresponding to different definitions of a scalar reward. 1Note that the distance between a point and a set is defined as a minimization of the distance function over all points in the set C, but here we require that it be rewritten as a maximization of a linear function over some other set . We will show how to achieve this in 3.2. Algorithm 1 Solving a game with repeated play 1: input concave-convex function g : U ! R, online learning algorithm LEARNER 2: for t = 1 to T do 3: LEARNER makes a decision λt 2 4: ut argminu2U g(λt, u) 5: LEARNER observes loss function t(λ) = g(λ, ut) 7: return λ = 1 t=1 λt and u = 1 Although the oracle is only capable of solving RL tasks with a scalar reward, our algorithm can leverage this capability to solve the multi-dimensional feasibility (or distance minimization) problem. In the remainder of this section, we provide the details of our approach, leading to our main algorithm and its analysis, and conclude with a discussion of steps for making a practical implementation. We begin by discussing game-playing techniques in general, which we then apply to our setting. 3.1 Solving zero-sum games using online learning At the core of our approach, we use the general technique of Freund and Schapire (1999) for solving a game by repeatedly playing a no-regret online learning algorithm against best response. For this purpose, we first briefly review the framework of online convex optimization, which we will soon use for one of the players: At time t = 1, . . . , T, the learner makes a decision λt 2 , the environment reveals a convex loss function t : ! R, and the learner incurs loss t(λt). The learner seeks to achieve small regret, the gap between its loss and the best in hindsight: An online learning algorithm is no-regret if Regret T = o(T), meaning its average loss approaches the best in hindsight. An example of such an algorithm is online gradient descent (OGD) of Zinkevich (2003) (see Appendix A). If the Euclidean diameter of is at most D, and kr t(λ)k G for any t and λ 2 , then the regret of OGD is at most DG Now consider a two-player zero-sum game in which two players select, respectively, λ 2 and u 2 U, resulting in a payout of g(λ, u) from the u-player to the λ-player. The λ-player wants to maximize this quantity and the u-player wants to minimize it. Assuming g is concave in λ and convex in u, if both spaces and U are convex and compact, then the minimax theorem (von Neumann, 1928; Sion, 1958) implies that u2U g(λ, u) = min λ2 g(λ, u). (9) This means that the λ-player has an optimal strategy which realizes the maximum on the left and guarantees payoff of at least the value of the game, i.e., the value given by this expression; a similar statement holds for the u-player. We can solve this game (find these optimal strategies) by playing it repeatedly. We use a no-regret online learner as the λ-player. At each time t = 1, . . . , T, the learner chooses λt 2 . In response, the u-player, who in this setting is permitted knowledge of λt, selects ut to minimize the payout, that is, ut = argminu2U g(λt, u). This is called best response. The online learning algorithm is then updated by setting its loss function to be t(λ) = g(λ, ut). (See Algorithm 1.) As stated in Theorem 3.1, λ and u, the averages of the players decisions, converge to the solution of the game (see Appendix B for the proof). Theorem 3.1. Let v be the value of the game in Eq. (9) and let Regret T be the regret of the λ-player. Then for λ and u we have min u2U g(λ, u) v δ and max λ2 g(λ, u) v + δ, where δ = 1 T Regret T . (10) 3.2 Algorithm and main result We can now apply this game-playing framework to the approach outlined at the beginning of this section. First, we show how to write distance as a maximization, as in Eq. (5). For now, we assume that our target set C is a convex cone, that is, closed under summation and also multiplication by non-negative scalars (we will remove this assumption in 3.3). With this assumption, we can apply the following lemma (Lemma 13 of Abernethy et al., 2011), in which distance to a convex cone C Rd is written as a maximization over a dual convex cone C called the polar cone: C , {λ : λ x 0 for all x 2 C}. (11) Lemma 3.2. For a convex cone C Rd and any point x 2 Rd dist(x, C) = max λ2C \B λ x, (12) where B , {x : kxk 1} is the Euclidean ball of radius 1 at the origin. Thus, Eq. (5) is immediately achieved by setting = C \B, so the distance minimization problem (4) can be cast as the min-max problem (6). This is a special case of the zero-sum game (9), with U = {z(µ) : µ 2 ( )} and g(λ, u) = λ u, which can be solved with Algorithm 1. Note that the set U is convex and compact, because it is a linear transformation of a convex and compact set ( ). We will see below that the best responses ut in Algorithm 1 can be expressed as z( t) for some t 2 , and so Algorithm 1 returns which is exactly the long-term measurement vector of the mixed policy µ = 1 t=1 t. For this mixed policy, Theorem 3.1 immediately implies dist(z( µ), C) min µ2 ( ) dist(z(µ), C) + 1 T Regret T . (13) If the problem is feasible, then minµ2 ( ) dist(z(µ), C) = 0, and since Regret T = o(T), our long-term measurement z( µ) converges to the target set and solves the feasibility problem (3). It remains to specify how to implement the no-regret learner for the λ-player and best response for the u-player. We discuss these next, beginning with the latter. The best-response player, for a given λ, aims to minimize λ z(µ) over mixed policies µ, but since this objective is linear in the mixture weights µ( ) (see Eq. 2), it suffices to minimize λ z( ) over stationary policies 2 . The key point, as already mentioned, is that this is the same as finding a policy that maximizes long-term reward in a standard reinforcement learning task if we define the scalar reward to be ri = λ zi. This is because the reward of a policy is given by = λ z( ). (14) Therefore, maximizing R( ), as in standard RL, is equivalent to minimizing λ z( ). Thus, best response can be implemented using any one of the many well-studied RL algorithms that maximize a scalar reward. We refer to such an RL algorithm as the best-response oracle. For robustness, we allow this oracle to return an approximately optimal policy. Best-response oracle: BESTRESPONSE(λ). Given λ 2 Rd, return a policy 2 that satisfies R( ) max 02 R( 0) 0, where R( ) is the long-term reward of policy with scalar reward defined as r = λ z. For the λ-player, we do our analysis using online gradient descent (Zinkevich, 2003), an effective no-regret learner. For its update, OGD needs the gradient of the loss functions t(λ) = λ z( t), which is just z( t). With access to the MDP, z( ) can be estimated simply by generating multiple trajectories using and averaging the observed measurements. We formalize this by assuming access to an estimation oracle for estimating z( ). Algorithm 2 APPROPO 1: input projection oracle ΓC( ) for target set C which is a convex cone, best-response oracle BESTRESPONSE( ), estimation oracle EST( ), step size , number of iterations T 2: define , C \ B, and its projection operator Γ (x) , (x ΓC(x))/max{1, kx ΓC(x)k} 3: initialize λ1 arbitrarily in 4: for t = 1 to T do 5: Compute an approximately optimal policy for standard RL with scalar reward r = λt z: t BESTRESPONSE(λt) 6: Call the estimation oracle to approximate long-term measurement for t: ˆzt EST( t) 7: Update λt using online gradient descent with the loss function t(λ) = λ ˆzt: 9: return µ, a uniform mixture over 1, . . . , T Estimation oracle: EST( ). Given policy , return ˆz satisfying kˆz z( )k 1. OGD also requires projection to the set = C \ B. In fact, if we can simply project onto the target set C, which is more natural, then it is possible to also project onto . Consider an arbitrary x and denote its projection onto C as ΓC(x). Then the projection of x onto the polar cone C is ΓC (x) = x ΓC(x) (Ingram and Marsh, 1991). Given the projection ΓC (x) and further projecting onto B, we obtain Γ (x) = (x ΓC(x))/max{1, kx ΓC(x)k} (because Dykstra s projection algorithm converges to this point after two steps, Boyle and Dykstra, 1986). Therefore, it suffices to require access to a projection oracle for C: Projection oracle: ΓC(x) = argminx02C kx x0k. Pulling these ideas together and plugging into Algorithm 1, we obtain our main algorithm, called APPROPO (Algorithm 2), for approachability-based policy optimization. The algorithm provably yields a mixed policy that approximately minimizes distance to the set C, as shown in Theorem 3.3 (proved in Appendix C). Theorem 3.3. Assume that C is a convex cone and for all measurements we have kzk B. Suppose we run Algorithm 2 for T rounds with = ) 1T 1/2. Then dist(z( µ), C) min µ2 ( ) dist(z(µ), C) + T 1/2 + 0 + 2 1, (15) where µ is the mixed policy returned by the algorithm. When the goal is to solve the feasibility problem (3) rather than the stronger distance minimization (4), we can make use of a weaker reinforcement learning oracle, which only needs to find a policy that is good enough in the sense of providing long-term reward above some threshold: Positive-response oracle: POSRESPONSE(λ). Given λ 2 Rd, return 2 that satisfies R( ) 0 if max 02 R( 0) 0 (and arbitrary otherwise), where R( ) is the long-term reward of with scalar reward r = λ z. When the problem is feasible, it can be shown that there must exist 2 with R( ) 0, and furthermore, that t(λt) ( 0 + 1) (from Lemma C.1 in Appendix C). This means, if the goal is feasibility, we can modify Algorithm 2, replacing BESTRESPONSE with POSRESPONSE, and adding a test at the end of each iteration to report infeasibility if t(λt) < ( 0 + 1). The pseudocode is provided in Algorithm 4 in Appendix D along with the proof of the following convergence bound: Theorem 3.4. Assume that C is a convex cone and for all measurements we have kzk B. Suppose we run Algorithm 4 for T rounds with = ) 1T 1/2. Then either the algorithm reports infeasibility or returns µ such that dist(z( µ), C) T 1/2 + 0 + 2 1. (16) 3.3 Removing the cone assumption Our results so far have assumed the target set C is a convex cone. If instead C is an arbitrary convex, compact set, we can use the technique of Abernethy et al. (2011) and apply our algorithm to a specific convex cone C constructed from C to obtain a solution with provable guarantees. In more detail, given a compact, convex target set C Rd, we augment every vector in C with a new coordinate held fixed to some value > 0, and then let C be its conic hull. Thus, C = cone(C { }), where cone(X) = { x | x 2 X, 0}. (17) Given our original vector-valued MDP M = (S, A, β, Ps, Pz), we define a new MDP M 0 = (S, A, β, Ps, P 0 z0) with (d + 1)-dimensional measurement z0 2 Rd+1, defined (and generated) by i = zi h(1 γ) i, zi Pz( | si, ai) (18) where denotes vector concatenation. Writing long-term measurement for M and M 0 as z and z0 respectively, z0( ) = z( ) h i, for any policy 2 , and similarly for any mixed policy µ. The main idea is to apply the algorithms described above to the modified MDP M 0 using the cone C as target set. For an appropriate choice of > 0, we show that the resulting mixed policy will approximately minimize distance to C for the original MDP M. This is a consequence of the following lemma, an extension of Lemma 14 of Abernethy et al. (2011), which shows that distances are largely preserved in a controllable way under this construction. The proof is in Appendix E. Lemma 3.5. Consider a compact, convex set C in Rd and x 2 Rd. For any δ > 0, let C = cone(C { }), where = maxx2C kxk p 2δ . Then dist(x, C) (1 + δ)dist(x h i, C). Corollary 3.6. Assume that C is a convex, compact set and for all measurements we have kzk B. Then by putting = ) 1T 1/2 and running Algorithm 2 for T rounds with M 0 as the MDP and C as the target set, the mixed policy µ returned by the algorithm satisfies dist(z( µ), C) (1 + δ) min µ2 ( ) dist(z(µ), C) + T 1/2 + 0 + 2 1 where = maxx2C kxk p 2δ for an arbitrary δ > 0. Similarly for Algorithm 4, we either have dist(z( µ), C) (1 + δ) T 1/2 + 0 + 2 1 or the algorithm reports infeasibility. 3.4 Practical implementation of the positive response and estimation oracles We next briefly describe a few techniques for the practical implementation of our algorithm. As discussed in 3.2, when our aim is to solve a feasibility problem, we only need access to a positive response oracle. In episodic environments, it is straightforward to use any standard iterative RL approach as a positive response oracle: As the RL algorithm runs, we track its accrued rewards, and when the trailing average of the last n trajectory-level rewards goes above some level , we return the current policy (possibly specified implicitly as a Q-function).2 Furthermore, the average of the measurement vectors z collected over the last n trajectories can serve as the estimate ˆzt of the long-term measurement required by the algorithm, side-stepping the need for an additional estimation oracle. The hyperparameters and n influence the oracle quality; specifically, assuming that the rewards are bounded and the overall number of trajectories until the oracle terminates is at most polynomial in n, we have 0 = O( (log n)/n) and 1 = O( (log n)/n). In principle, we could use Theorem 3.4 to select a value T at which to stop; in practice, we run until the running average of the measurements ˆzt gets within a small distance of the target set C. If the RL algorithm runs for too long without achieving non-negative rewards, we stop and declare that the underlying problem is empirically infeasible. (Actual infeasibility would hold if it is truly not possible to reach non-negative expected reward.) 2This assumes that the last n trajectories accurately estimate the performance of the final iterate. If that is not the case, the oracle can instead return the mixture of the policies corresponding to the last n iterates. Figure 1: Left: The Mars rover environment. The agent starts in top-left and needs to reach the goal in bottom-right while avoiding rocks. Middle, Right: Visitation probabilities of APPROPO (middle) and APPROPO with a diversity constraints (right) at 12k samples. Both plots based on a single run. An important mechanism to further speed up our algorithm is to maintain a cache of all the policies returned by the positive response oracle so far. Each of the cached policies is stored with the estimate of its expected measurement vector ˆz( ) z( ), based on its last n iterations (as above). In each outer-loop iteration of our algorithm, we first check if the cache contains a policy that already achieves a reward at least under the new λ; this can be determined from the cached ˆz( ) since the reward is just a linear function of the measurement vector. If such a policy is found, we return it, alongside ˆz( ), instead of calling the oracle. Otherwise, we pick the policy from the cache with the largest reward (below by assumption) and use it to warm-start the RL algorithm implementing the oracle. The cache can be initialized with a few random policies (as we do in our experiments), effectively implementing randomized weight initialization. The cache interacts well with a straightforward binary-search scheme that can be used when the goal is to maximize some reward (possibly subject to additional constraints), rather than only satisfy a set of constraints. The feasibility problems corresponding to iterates of binary search only differ in the constraint values, but use the same measurements, so the same cache can be reused across all iterations. Running time. Note that APPROPO spends the bulk of its running time executing the best-response oracle. It additionally performs updates of λ, but these tend to be orders of magnitude cheaper than any per-episode (or per-transition) updates within the oracle. For example, in our experiments ( 4), the dimension of λ is either 2 or 66 (without or with the diversity constraint, respectively), whereas the policies trained by the oracle are two-layer networks described by 8,704 floating-point numbers. 4 Experiments We next evaluate the performance of APPROPO and demonstrate its ability to handle a variety of constraints. For simplicity, we focus on the feasibility version (Algorithm 4 in Appendix D). We compare APPROPO with the RCPO approach of Tessler et al. (2019), which adapts policy gradient, specifically, asynchronous actor-critic (A2C) (Mnih et al., 2016), to find a fixed point of the Lagrangian of the constrained policy optimization problem. RCPO maintains and updates a vector of Lagrange multipliers, which is then used to derive a reward for A2C. The vector of Lagrange multipliers serves a similar role as our λ, and the overall structure of RCPO is similar to APPROPO, so RCPO is a natural baseline for a comparison. Unlike APPROPO, RCPO only allows orthant constraints and it seeks to maximize reward, whereas APPROPO solves the feasibility problem. For a fair comparison, APPROPO uses A2C as a positive-response oracle, with the same hyperparameters as used in RCPO. Online learning in the outer loop of APPROPO was implemented via online gradient descent with momentum. Both RCPO and APPROPO have an outer-loop learning rate parameter, which we tuned over a grid of values 10 i with integer i (see Appendix F for the details). Here, we report the results with the best learning rate for each method. We ran our experiments on a small version of the Mars rover grid-world environment, used previously for the evaluation of RCPO (Tessler et al., 2019). In this environment, depicted in Figure 1 (left), the agent must move from the starting position to the goal without crashing into rocks. The episode terminates when a rock or the goal is reached, or after 300 steps. The environment is stochastic: with probability δ = 0.05 the agent s action is perturbed to a random action. The agent receives small negative reward each time step and zero for terminating, with γ = 0.99. We used the same safety constraint as Tessler et al. (2019): ensure that the (discounted) probability of hitting a rock is at most a fixed threshold (set to 0.2). RCPO seeks to maximize reward subject to this constraint. APPROPO Figure 2: Left: The performance of the algorithms as a function of the number of samples (steps in the environment); showing average and standard deviation over 25 runs. The vertical axes correspond to the three constraints, with thresholds shown as a dashed line; for reward (middle) this is a lower bound; for the others it is an upper bound. Right: Each point in the scatter plot represents the reward and the probability of failure obtained by the policy learnt by the algorithm at the specified number of samples. The grey region is the target set. Different points represent different random runs. solves a feasibility problem with the same safety constraint, and an additional constraint requiring that the reward be at least 0.17 (this is slightly lower than the final reward achieved by RCPO). We also experimented with including the exploration suggestion as a diversity constraint, requiring that the Euclidean distance between our visitation probability vector (across the cells of the grid) and the uniform distribution over the upper-right triangle cells of the grid (excluding rocks) be at most 0.12.3 In Figure 2 (left), we show how the probability of failure, the average reward, and the distance to the uniform distribution over upper triangle vary as a function of the number of samples seen by each algorithm. Both variants of our algorithm are able to satisfy the safety constraints and reach similar reward as RCPO with a similar number of samples (around 8k samples). Furthermore, including the diversity constraint, which RCPO is not capable of enforcing, allowed our method to reach a more diverse policy as depicted in both Figure 2 (bottom-left) and Figure 1 (right). 5 Conclusion In this paper, we introduced APPROPO, an algorithm for solving reinforcement learning problems with arbitrary convex constraints. APPROPO can combine any no-regret online learner with any standard RL algorithm that optimizes a scalar reward. Theoretically, we showed that for the specific case of online gradient descent, APPROPO learns to approach the constraint set at a rate of 1/ T, with an additive non-vanishing term that measures the optimality gap of the reinforcement learner. Experimentally, we demonstrated that APPROPO can be applied with well-known RL algorithms for discrete domains (like actor-critic), and achieves similar performance as RCPO (Tessler et al., 2019), while being able to satisfy additional types of constraints. In sum, this yields a theoretically justified, practical algorithm for solving the approachability problem in reinforcement learning. 3This number ensures that APPROPO without the diversity constraint does not satisfy it automatically. Abernethy, J., Bartlett, P. L., and Hazan, E. (2011). Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, pages 27 46. Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 22 31. Altman, E. (1999). Constrained Markov decision processes, volume 7. CRC Press. Blackwell, D. (1956). An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1 8. Boyle, J. P. and Dykstra, R. L. (1986). A method for finding projections onto the intersection of convex sets in hilbert spaces. In Advances in order restricted statistical inference, pages 28 47. Springer. Freund, Y. and Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79 103. Hazan, E., Kakade, S. M., Singh, K., and Van Soest, A. (2018). Provably efficient maximum entropy exploration. ar Xiv preprint ar Xiv:1812.02690. Ingram, J. M. and Marsh, M. (1991). Projections onto convex cones in hilbert space. Journal of approximation theory, 64(3):343 350. Le, H. M., Voloshin, C., and Yue, Y. (2019). Batch policy learning under constraints. ar Xiv preprint ar Xiv:1903.08738. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928 1937. Sion, M. (1958). On general minimax theorems. Pacific Journal of mathematics, 8(1):171 176. Syed, U. and Schapire, R. E. (2008). A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems (Neur IPS). Tessler, C., Mankowitz, D. J., and Mannor, S. (2019). Reward constrained policy optimization. In International Conference on Learning Representations. von Neumann, J. (1928). Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100:295 320. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning (ICML).