# convexconcave_zerosum_markov_stackelberg_games__e2586d7e.pdf Convex-Concave 0-Sum Markov Stackelberg Games Denizalp Goktas Brown University, Computer Science denizalp_goktas@brown.edu Arjun Prakash Brown University, Computer Science arjun_prakash@brown.edu Amy Greenwald Brown University, Computer Science amy_greenwald@brown.edu Zero-sum Markov Stackelberg games can be used to model myriad problems, in domains ranging from economics to human robot interaction. We develop a policy gradient method which we prove solves these games in continuous state, continuous action settings, using noisy gradient estimates computed from observed trajectories of play. When the games are convex-concave, we prove that our algorithm converges to Stackelberg equilibrium in polynomial time. We also prove that reach-avoid problems are naturally modeled as convex-concave zero-sum Markov Stackelberg games, and show experimentally that Stackelberg equilibrium policies are more effective than their Nash counterparts in these problems.1 1 Introduction Markov games [28, 65, 70] are a generalization of Markov decision processes (MDPs) comprising multiple players simultaneously making decisions over time, collecting rewards along the way depending on their collective actions. They have been used by practitioners to model many realworld multiagent planning and learning environments, such as autonomous driving [31, 59], cloud computing [77], and telecomunications [3]. Moreover, theoreticians are beginning to formally analyze policy gradient methods, proving polynomial-time convergence to optimal policies in MDPs [2, 16], and to Nash equilibrium policies [53] in zero-sum Markov games [24], the canonical solution concept. While Markov games are a fruitful way to model some problems (e.g., robotic soccer [46]), others, such as reach-avoid [48], may be more productively modeled as sequential-move games, where some players commit to moves that are observed by others, before they make their own moves. To this end, we study two-player zero-sum Markov Stackelberg [74] (i.e., sequential-move) games. While polynomial-time value-iteration (i.e., planning) algorithms are known for these games assuming discrete states [36], we develop a policy gradient method that converges to Stackelberg equilibrium in polynomial time in continuous state, continuous action games, using noisy gradients based only on observed trajectories of play. Furthermore, we demonstrate experimentally that Stackelberg equilibrium policies are more effective than their Nash counterparts in reach-avoid problems. A (discounted discrete-time) zero-sum Markov Stackelberg game [36] is played over an infinite horizon t = 0, 1, . . . between two players, a leader and a follower. The game starts at time t = 0, at some initial state S(0) µ 2 (S) drawn randomly from a set of states S. At each time step t = 1, 2, . . ., the players encounter a state s(t) 2 S, where the leader takes its action a(t) first, from its action space A(s(t)), after which the follower, having observed the leader s action, makes it own move b(t), chosen from a feasible subset C(s(t), a(t)) determined by the leader s action a(t) 1A full and current version of the paper can be found at: https://arxiv.org/abs/2401.12437 37th Conference on Neural Information Processing Systems (Neur IPS 2023). New Orleans. of its action space B(s(t)).2 After both players have taken their actions, they receive respective rewards, r(s(t), a(t), b(t)) and r(s(t), a(t), b(t)). The game then moves to time step t + 1 and transitions either to a new state S(t+1) p( | s(t), a(t), b(t)) with probability γ, called the discount factor, or the game ends with the remaining probability. Each player s goal is to play a (potentially history-dependent) policy that maximizes its respective expected (cumulative discounted) payoffs, E t=0 γtr(S(t), A(t), B(t)) t=0 γtr(S(t), A(t), B(t)) In zero-sum Markov Stackelberg games, when the reward function (a, b) 7! r(s, a, b) is continuous and bounded, for all s 2 S, and the correspondence a 7! 7! C(s, a) is continuous, as well as non-emptyand compact-valued, a recursive (or Markov perfect) [49] Stackelberg equilibrium is guaranteed to exist [36], meaning a stationary policy profile (i.e., a pair of mappings from states to the actions of the leader and the follower, respectively) specifying the actions taken at each state s.t. the leader s policy maximizes its expected payoff assuming the follower best responds, while the follower indeed best responds to the leader s policy. In other words, the aforementioned assumptions guarantee the existence of a policy profile .= ( a : S ! A and b : S ! B, that solves the following coupled min-max optimization problem: min a:S!A max b:S!B: 8s2S, b(s)2C(s, a(s)) γtr(S(t), a(S(t)), b(S(t))) where the expectation is with respect to S(0) µ and S(t+1) p( | s(t), a(S(t)), b(S(t))). The problem is coupled because the players actions sets constrain one another; in particular, the set of actions available to the follower at each state is determined by the leader s choice. In spite of multiple compelling applications, including autonomous driving [29, 43], reach-avoid problems in human-robot interaction [10], robust optimization in stochastic environments [15], and resource allocation over time [36], very little is known about computing recursive Stackelberg equilibria in zero-sum Markov Stackelberg games. A version of value iteration is known to converge in polynomial time when the state space is discrete [36], but this (planning) method becomes intractable in large or continuous state spaces. Furthermore, nothing is known, to our knowledge, about learning Stackelberg equilibria from observed trajectories of play. We develop an efficient policy gradient method for convex-concave zero-sum Markov Stackelberg games, and we show that reach-avoid problems naturally lie in this class of games. Contributions. Equation (1) reveals that the problem of computing Stackelberg equilibria in zerosum Markov Stackelberg games is an instance of a coupled min-max optimization problem. Goktas and Greenwald [33] studied coupled min-max optimization problems assuming an exact first-order oracle, meaning one that returns a function s exact value and gradient at any point in its domain. As access to an exact oracle is an unrealistic assumption in Markov games, we develop a first-order method for solving these problems, assuming access to a stochastic first-order oracle, which returns noisy estimates of a function s value and gradient at any point in its domain. We show that our method converges in polynomial-time (Theorem 3.1) in a large class of coupled min-max optimization problems, namely those which are convex-concave. We then proceed to study zero-sum Markov Stackelberg games, providing sufficient conditions on the action correspondence C : S A ! B, the rewards r : S A B ! R, and the transition probabilities p : S S A B ! R+ to guarantee that the game is convex-concave. Furthermore, we develop a policy gradient algorithm that provably converges to Stackelberg equilibrium in polynomial time when such games are convex-concave (Theorem 4.1), the first reinforcement learning algorithm of this kind. Our method specializes to continuous state, continuous action zero-sum Markov games; as such, we provide a provably-convergent policy gradient method for these problems as well. Finally, we prove that our framework naturally models reach-avoid problems, and run experiments which show that the Stackelberg equilibrium policies learned by our method exhibit better safety and liveness properties than their Nash counterparts. 2To simplify notation, we drop the dependency of action spaces A and B on states going forward, but our theory applies in this more general setting. 3Unlike a(t) and b(t), which are deterministic actions because they depend on a realized history of states and actions encountered, A(t) and B(t) are random variables, because they might depend on a random history. 2 Preliminaries Notation. All notation for variable types, e.g., vectors, should be clear from context; if any confusion arises, see Appendix A. Unless otherwise noted, we assume k k is the Euclidean norm, k k2. We let n = {x 2 Rn i=1 xi = 1} denote the unit simplex in Rn, and (A), the set of probability distributions on the set A. We also define the support of any distribution f 2 (X) as supp(f) .= {x 2 X : f(x) > 0}. We denote the orthogonal projection operator onto a set C by C, i.e., C(x) = arg miny2C kx yk2. We denote by C(x) the indicator function of a set C, with value 1 if x 2 C and 0 otherwise. Given two vectors x, y 2 Rn, we write x y or x > y to mean component-wise or >, respectively. For any set C, we denote the diameter by diam(C) .= maxc,c02Ckc c0k. Given a tuple consisting of a sequences of iterates and weights ({z(t)}t, { (t)}t), the weighted average of the iterates is given by z .= t (t)z(t) P Mathematical Concepts. Given X Rn, the function f : X ! Y is said to be f-Lipschitzcontinuous w.r.t. norm k k iff 8x1, x2 2 X, kf(x1) f(x2)k f kx1 x2k. If Y = R, then f is convex (resp. concave) iff for all λ 2 (0, 1) and x, x0 2 X, f(λx + (1 λ)x0) (resp. ) λf(x) + (1 λ)f(x0). For any Y, if the relation holds with equality, then f is called affine. A two-sided function h : X Y ! Z is biaffine if x 7! f(x, y) is affine for all y 2 Y, and y 7! h(x, y) is affine for all x 2 X. f is µ-strongly convex if f(x1) f(x2) + hrxf(x2), x1 x2i + µ/2 kx1 x1k2. For convenience, we say that an l-dimensional vector-valued function g : X ! Y Rl is log-convex, convex, log-concave, or concave, respectively, if gk is log-convex, convex, log-concave, or concave, for all k 2 [l]. A correspondence Z : X ! Y is concave if for all λ 2 (0, 1) and x, x0 2 X, Z(λx + (1 λ)x0) λZ(x) + (1 λ)Z(x), assuming Minkowski summation [22, 57]. We require notions of stochastic convexity related to stochastic dominance of probability distributions [7]. Given non-empty and convex parameter and outcome spaces W and O respectively, a conditional probability distribution w 7! p( | w) 2 (O) is said to be stochastically convex (resp. stochastically concave) in w 2 W if for all continuous, bounded, and convex (resp. concave) functions v : O ! R, λ 2 (0, 1), and w0, w 2 W s.t. w = λw0 + (1 λ)w , it holds that EO p( |w) [v(O)] (resp ) λ EO p( |w0) [v(O)] + (1 λ) EO p( |w ) [v(O)]. 3 Coupled Min-Max Optimization Problems A min-max Stackelberg game, denoted (X, Y, f, g), is a two-player, zero-sum game, where one player, called the leader, first commits to an action x 2 X from its action space X Rn, after which the second player, called the follower, takes an action y 2 Z(x) Y from a subset of of his action space Y Rm determined by the action correspondence Z : Rn Y. As is standard in the optimization literature, we assume throughout that the follower s action correspondence can be equivalently represented via a coupling constraint function g : Rn Rm ! Rd s.t. Z(x) .= {y 2 Y | g(x, y) 0}. An action profile (x, y) 2 X Y comprises actions for both players. Once both players have taken their actions, the leader (resp. follower) receives a loss (resp. payoff) f(x, y), defined by an objective function f : Rn Rm ! R. We define the marginal function f (x) .= maxy2Z(x) f(x, y), which, given an action for the leader, outputs its ensuing payoff, assuming the follower best responds. The constraints in a min-max Stackelberg game are said to be uncoupled if Z(x) = Y, for all x 2 X. A min-max Stackelberg game is said to be continuous iff 1. the objective function f is continuous; 2. the action spaces X and Y are non-empty and compact; and 3. the action correspondence Z is continuous, non-emptyand compact-valued.4 Stackelberg Equilibrium. The canonical solution concept for min-max Stackelberg games is the (", δ)-Stackelberg equilibrium ((", δ)-SE, or SE if " = δ = 0), an action profile (x , y ) 2 X Y [g(x , y )]k δ and minx2X f (x) + " f (x , y ) maxy2Z(x ) f (x , y) δ, for ", δ 0.5 As a straightforward corollary of Theorem 3.2 of Goktas and Greenwald [33], a SE is guaranteed to exist in continuous Stackelberg games. Moreover, the set of SE can be characterized as solutions to the following coupled min-max optimization problem: minx2X maxy2Z(x) f(x, y). 4See Theorem 5.9 and Example 5.10 of Rockafellar and Wets [63] for conditions on g that guarantee the continuity of Z or Section 3 of Goktas and Greenwald [33]. 5For δ > 0, this definition of an (", δ)-SE is more general than the one introduced by Goktas and Greenwald [33], as it allows for the coupling constraints to be satisfied only approximately, which is necessary in this paper, as the coupling constraints can only be accessed via a stochastic oracle. An alternative but weaker solution concept previously considered for min-max Stackelberg games [71] is the "-generalized Nash equilibrium ("-GNE, or GNE if " = 0), i.e., (x , y ) 2 X Z(x ) s.t. minx2X f (x, y ) + " f (x , y ) maxy2Z(x ) f (x , y) ", for some " 0.6 In general, the set of GNE and SE need not intersect; as such, GNE are not necessarily solutions of minx2X maxy2Z(x) f(x, y) (see, Appendix A of Goktas and Greenwald [33]). Furthermore, there is no GNE whose value is less than the SE value of a game. When a min-max Stackelberg game s constraints are uncoupled, a(n "-)GNE is called a(n ")-saddle point, or a(n "-)Nash equilibrium, and is also an SE. Finally, a saddle point is guaranteed to exist [67, 73] in continuous min-max Stackelberg games with uncoupled constraints, a convex-concave objective f, and convex action spaces X and Y, in which case such games have traditionally been referred to as convex-concave min-max (simultaneous-move) games or saddle-point problems [13]. Convex-Concave Games. A min-max Stackelberg game is said to be convex-concave if, in addition to being continuous, 1. f is convex; 2. y 7! f(x, y) is concave, for all x 2 X; 3. X and Y are convex; and 4. Z is convex-valued. This definition generalizes that of convex-concave min-max (simultaneous-move) game, because in such games, the marginal function f is necessarily convex when f is convex, by Danskin s theorem [23]. Assuming access to an exact first-order oracle, an (", δ)-SE of a convex-concave min-max Stackelberg game can be computed in polynomial time when f and g are Lipschitz-continuous [33], while the computation is NP-hard in continuous min-max Stackelberg games, even when X and Y are convex, f is convex-concave, and g is affine [47]. All the conditions that define a convex-concave Stackelberg game depend on the game primitives, namely (X, Y, f, g), and are well-understood (see, for instance Section 5 of Rockafellar and Wets [63]), with the exception of the condition that the marginal function f be convex. While it is difficult to obtain necessary and sufficient conditions on the game primitives that ensure the convexity of f , one possibility is to require f to be convex in (x, y) and Z to be concave.7 The following sufficient conditions, which also guarantee concavity, were introduced by Goktas and Greenwald [33]. Assumption 1 (Convex-Concave Assumptions). 1. The objective function f(x, y) is convex in (x, y), and concave in y, for all x 2 X; 2. the action correspondence Z is concave; 3. the action spaces X and Y are convex. As these assumptions are only sufficient, they are not necessarily satisfied in all applications of convexconcave min-max Stackelberg game. Hence, the convexity of the marginal function must sometimes be established by other means. We thus provide the following alternative set of sufficient conditions, which we use to show that the reach-avoid problem we study in Section 5 is convex-concave. Assumption 2 (Alternative Convex-Concave Assumptions). 1. (Convex-concave objective) The objective f(x, y) is convex in x, for all y 2 Y, and concave in y, for all x 2 X; 2. (log-convexconcave coupling) the coupling constraint g(x, y) is log-convex in x for all y 2 Y, and concave in y for all x 2 X; and 3. the action spaces X and Y are convex. Computation. We now turn our attention to the computation of (", δ)-SE in convex-concave min-max Stackelberg games, assuming access to an unbiased first-order stochastic oracle ( b F, b G, F, G) comprising random functions b F : Rn Rm ! R and b G : Rn Rm Φ ! Rd and sampling distributions F 2 ( ) and G 2 (Φ) s.t. E F[ b F(x, y; )] = f(x, y), Eφ G[ b G(x, y; φ)] = g(x, y), E [r(x,y) b F(x, y; )] = rf(x, y), and Eφ[r(x,y) b G(x, y; φ)] = rg(x, y). The following assumptions are required for the convergence of our methods. Assumption 3. 1. (Lipschitz game) f and g are Lipschitz-continuous; 2. (concave representation) the coupling constraint function y 7! g(x, y) is concave for all x 2 X; 3. (Slater s condition) 8x 2 X, 9by 2 Y s.t. g(x, by) > 0; and 4. (stochastic oracle) there exists an unbiased first-order stochastic oracle ( b F, b G, F, G) with bounded variance s.t. 8(x, y) 2 X Y, E[k b G(x, y; φ)k2] σg, E[kr(x,y) b F(x, y; )k2] σrf, and E[kr(x,y) b G(x, y; φ)k2] σrg, for σg, σrf, σrg 0. In the sequel, we rely on the following notation and definitions. For any action x 2 X of the leader, we can re-express the marginal function in terms of the Lagrangian (y, λ; x) .= f(x, y) + hλ, g(x, y)i (see, for instance, Section 5 of Boyd et al. [17]) as follows: f (x) = maxy2Y minλ2Rd + (y, λ; x). Further, we define the follower s best-response correspondence 6A GNE is guaranteed to exist in continuous min-max Stackelberg games when the objective function f is convex-concave, the action spaces A and B are convex, and the action correspondence Z is convex-valued [4]. 7See Section 2 of Nikodem [57] and Chapter 36 of Czerwik [22] for conditions on g which guarantee that Z is concave and/or continuous and/or convex-valued. Y (x) .= arg maxy2Y minλ2Rd + (y, λ; x), and the KKT multiplier correspondence (x) .= arg minλ2Rd + maxy2Y (y, λ; x). With these definitions in hand, under Assumption 3, we can build an unbiased first-order stochastic oracle b L(y, λ; x, , φ) .= b F(x, y; )+ < λ, b G(x, y; φ) > for the Lagrangian s.t. E( ,φ)[ b L(y, λ; x, , φ)], where the expectation is taken over ( , φ) F G. Algorithms. Assuming access to an exact first-order oracle (f, g), a natural approach to computing SE in convex-concave min-max Stackelberg games with uncoupled constraints games (i.e., saddle-point problems) is to simultaneously run projected gradient descent and projected gradient ascent on the objective function f w.r.t. x 2 X and y 2 Y, i.e., for t = 0, 1, 2, . . ., (x(t+1), y(t+1)) X Y[(x(t), y(t)) + ( rxf, ryf)(x(t), y(t))], a method known under the names of Arrow-Hurwicz-Uzawa, primal-dual, and (simultaneous) gradient descent ascent (GDA) [5, 6]. Intuitively, any fixed point of GDA in such games, i.e., (x , y ) 2 X Y s.t. k(x , y ) X Y[(x , y ) + ( rxf, ryf)(x , y )]k = 0, satisfies the necessary and sufficient optimality condition for an action profile to be a SE. More generally, in convex-concave min-max Stackelberg games (without coupled constraints), this approach fails, as the necessary and sufficient optimality condition for an action profile (x , y ) 2 X Y to be a SE is k(x , y ) X Z(x )[(x , y ) + ( rxf (x ), ryf(x , y ))]k = 0, where, for any leader action bx 2 X, rxf (bx) .= (y (bx), λ (bx); bx), for some (y , λ )(bx) 2 Y (bx) (bx), by the subdifferential envelope theorem [33]. The observation that any subgradient of rxf depends on the optimal KKT multipliers motivates a first-order method based on the gradient of the Lagrangian. A min-max Stackelberg game can be seen as a three-player game minx2X maxy2Z(x) f(x, y) = minx2X maxy2Y minλ2Rd + (y, λ; x), where the x-player moves first, and the yand λ-players move second, simultaneously, because strong duality holds under Assumption 3 (Slater s condition [68]) for the inner min-max optimization problem, i.e., maxy2Y minλ2Rd + (y, λ; x) = minλ2Rd + maxy2Y (y, λ; x). The problem of computing an SE can thus be reduced to the min-max optimization min(x,λ)2X Rd + maxy2Y (y, λ; x), which we might hope to solve by running GDA on (y, λ; x) w.r.t. (x, λ) and y over X Rd + and Y, respectively. Although y 7! (y, λ; x) is concave, (x, λ) 7! (y, λ; x) is not convex, and its stationary points (i.e., points (y , λ ; x ) s.t. k(y , λ ; x ) Y Rd + X[(y , λ ; x ) + (ry , rλ , rx )(y , λ ; x )]k = 0) do not necessarily coincide with SE even in simple convex-concave min-max Stackelberg games [35]. Algorithm 1 Saddle-Point-Oracle SGD/Nested SGDA Inputs: X, Y, b F, b G, F, G, x(0), Tx, { (t) x }t, δ (+ for nested SGDA:) , y0(0), λ0(0), Ty, { (t) y }t Outputs: (x(t), y(t), λ(t)) 1: for t = 0, . . . , Tx do 2: if Saddle-Point-Oracle SGD then 3: Find (y(t), λ(t)) 2 Y Rd + s.t. 4: max y2Y (y(t), λ; x(t)) min (y, λ(t); x(t)) δ, 5: if Nested SGDA then 6: for s = 0, . . . , Ty do 7: Sample F, φ G 8: y0(s+1) Y y ry b L(y0(s), λ0(s); x(t), , φ) y rλ b L(y0(s), λ0(s); x(t), , φ) 10: Average iterates (y(t), λ(t)) (y0 11: Sample F, φ G 12: x(t+1) X x rx b L(y(t), λ(t); x(t), , φ) 13: return (x x , y(Tx), λ(Tx)) As GDA fails in these games, Goktas and Greenwald [33] developed nested GDA, a nested first-order method for computing an (", δ)-SE, which solves the inner maximization problem by running GDA on w.r.t. y and λ over constraint sets Y and Rd + until convergence to a δ-saddle point (by, bλ). Then, exploiting the convexity of the marginal function f , the algorithm runs a descent step on f w.r.t. x, in which, for any leader action x 2 X, a subgradient rxf is approximated by \ rxf (x) = (by, bλ; x). In this paper, we replace the exact first-order oracle used by nested GDA with a stochastic one, the gradient descent step with a step of stochastic gradient descent (SGD), and GDA with stochastic GDA (SGDA), using in both cases the stochastic Lagrangian oracle b L. We call our method nested stochastc gradient descent ascent (nested SGDA). We begin by presenting saddle-point-oracle stochastic gradient descent algorithm (saddle-pointoracle SGD, Algorithm 1), whose analysis we build on to develop our primary contribution, nested SGDA. Following Goktas and Greenwald s [33] max-oracle gradient descent algorithm, saddle-pointoracle SGD runs SGD on f , assuming access to an oracle, which, for any leader action x 2 X, returns a δ-saddle point of (y, λ) 7! (y, λ; x). Our second algorithm, nested stochastic gradient descent ascent (nested SGDA, Algorithm 1), follows the same logic as saddle-point-oracle SGD, but implements the saddle-point oracle by running SGDA. The following theorem establishes conditions under which both of our algorithms converge to an (" + δ, δ)-SE in a number of oracle calls that is polynomial in 1/" and 1/δ.8 Theorem 3.1. Let (X, Y, f, g) be a convex-concave min-max Stackelberg game for which Assumption 3 holds. For any ", δ 0, if nested SGDA (resp. saddle-point-oracle SGD) is run with inputs9 that satisfy for all t 2 N+, (t) x 2 (1/pt+1), and outputs (x , y , λ ), then in expectation over all runs of the algorithm (i.e., sample paths of and φ), the action profile (x , y ) is an (" + δ, δ)-SE after O(1/"2δ2) (resp. O(1/"2)) oracle calls. If, in addition, f is µ-strongly-convex, then (x , y ) is an (" + δ, δ)-SE after O(1/"δ2) (resp. O(1/")) oracle calls. 4 Policy Gradient in Convex-Concave Zero-Sum Markov Stackelberg Games In this section, we reduce the computation of Stackelberg equilibrium in zero-sum Markov Stackelberg games to a coupled min-max optimization problem, which enables us to derive a policy gradient method for these games based on our nested SGDA algorithm. We consider zero-sum Markov Stackelberg games M .= (l, n, m, d, S, A, B, µ, r, g, p, γ) with state space S Rl and action spaces A Rn and B Rm for the leader and follower, respectively, where the follower s actions are constrained by the leader s via vector-valued state-dependent coupling constraints g : S Rn Rm ! Rd s.t. that define a correspondence C(s, a) .= {b 2 B | g(s, a, b) 0}. We define the set of states with non-trivially coupled constraints N .= {s 2 S | 9(a, b) 2 A B, g(s, a, b) < 0}. A Markov policy for the leader (resp. follower) hereafter policy for short is one that is history independent, and thus a mapping from states to actions a : S ! A (resp. b : S ! B). A policy profile .= ( a, b) 2 AS BS is a tuple comprising policies for the leader and follower, respectively. The follower s feasible policy correspondence is given by Z( a) = { b : S ! B | 8s 2 N , g(s, (s)) 0}. A continuous action zero-sum Markov Stackelberg game is a game where 1. for all states s 2 S, the reward function (a, b) 7! r(s, a, b) is continuous and bounded, i.e., kr(s, , )k1 rmax < 1, for some rmax 2 R+; 2. the action spaces A and B are non-empty and compact; and 3. for all states s 2 S, the correspondence a 7! 7! C(s, a) is continuous, non-empty-, and compact-valued. A continuous state zero-sum Markov Stackelberg game is a game where 1. S is non-empty and compact and 2. the reward function r is continuous and bounded, i.e., krk1 < 1. A history h 2 (S A B) of length 2 N is a sequence of state-action tuples h = (s(t), a(t), b(t)) 1 t=0 . Given a policy profile and a history of play h of length , we define the discounted history distribution as , (h) = µ(s(0)) Q 1 t=0 γtp(s(t+1) | s(t), a(t), b(t)) (s(t))(a(t), b(t)). Define the set of all realizable trajectories H , of length under policy as H , .= supp( , ), i.e., the set of all histories that occur with non-zero probability. For convenience, we denote by .= ,1, and by H = S(t), A(t), B(t), t any randomly sampled history from . Finally, we define the discounted state-visitation distribution under any initial state distribution µ as δ h2H ,t:S(t)=s µ(s(0)) Qt k=1 γkp(s(k) | s(k 1), a(k 1), b(k 1)). Given a policy profile , the (state-)value function v : S ! R and the action-value function q : S A B ! R are defined in terms of as v (s) .= EH P1 S(t), A(t), B(t), and q (s, a, b) .= EH P1 S(t), A(t), B(t), | S(0) = s, A(0) = a, B(0) = b , respectively. The cumulative payoff function u : AS BS ! R is then defined in terms of the value function as u( a, b) .= ES µ [v (S)], i.e., the total expected loss (resp. gain) of the leader (resp. 8We include detailed theorem statements and proofs in the full version of the paper. 9 should be chosen as a superset of all optimal KKT multipliers, i.e., [x2X (x) (see Appendix C). follower). Additionally, the marginal action-value function q (s, a) .= maxb2C(s,a) q (s, a, b) is the payoff when play initiates at state s with the leader taking action a, after which the follower best responds (at state s only), with both players playing according to thereafter. Finally, for any leader policy a 2 AS, we define the marginal (state-value) function u ( a) .= max b2Z( a) u( a, b). Recursive Stackelberg Equilibrium. A policy profile .= ( b) 2 AS BS is called an (", δ)- recursive (or Markov perfect) Stackelberg equilibrium iff 8s 2 S, k Rd [g(s, (s))]k δ and max b2Z( a) v( a, b)(s) δ v (s) min a2AS max b2Z( a) v( a, b)(s)+". A recursive SE is guaranteed to exist in continuous state, continuous action zero-sum Markov Stackelberg games [36]. A policy profile .= ( a) is called an (", δ)-Markov perfect GNE iff 8s 2 S, max b2Z( a) v( a, b)(s) δ v (s) min a2AS v a, Convex-Concave Markov Stackelberg Games. As we have shown (Theorem 3.1), Stackelberg equilibria can be computed in polynomial time in convex-concave min-max Stackelberg games, assuming access to an unbiased first order-stochastic oracle. We now define an analogous class of Markov Stackelberg games, namely zero-sum Markov Stackelberg games in which the min-max Stackelberg game played at each state is convex-concave. A convex-concave zero-sum Markov Stackelberg game is a continuous state, continuous action zero-sum Markov game where, for all policy profiles 2 AS BS, 1. the marginal action-value function (s, a) 7! q (s, a) is convex, 2. the action-value function (s, b) 7! q (s, a, b) is concave, for all a 2 A, 3. the state and action spaces S, A and B are convex, and 4. the action correspondence C is convex-valued. We note that any continuous state, continuous action convex-concave zero-sum Markov game, i.e., 1. N = ;, 2. (s, a) 7! r(s, a, b) is convex, for all b 2 B, 3. (s, b) 7! r(s, a, b) is concave, for all a 2 A, 4. (s, a) 7! p( | s, a, b) is stochastically convex, for all b 2 B; and 5. (s, b) 7! p( | s, a, b) is stochastically concave, for all a 2 A, is a convex-concave zero-sum Markov Stackelberg game for which the set of Markov perfect generalized Nash equilibria is a subset of the recursive SE. As our plan is to use our nested SGDA algorithm to compute recursive Stackelberg equilibria, we begin by showing that zero-sum Markov Stackelberg games are an instance of min-max Stackelberg games. Assume parametric policy classes for the leader and follower, respectively, namely PX .= { x : S ! A | x 2 X} AS and PY .= { y : S ! B | y 2 Y} BS, for parameter spaces X Rd and Y Rd. Using these parameterizations, we redefine v(x,y) .= v( x, y ), q(x,y) .= q( x, y ), u(x, y) .= u( x, y), etc., and thus restate the problem of computing a recursive SE as finding (x, y) 2 X Y that solves minx2X maxy2Y2Z(x) v(x,y)(s), for all states s 2 S. As this optimization problem is infinite dimensional for continuous state games, we optimize the objective and satisfy the constraints, both in expectation over the initial state distribution µ, thereby reducing the problem to the min-max Stackelberg game minx2X maxy2Z(x) u(x, y). In Appendix D, assuming 1. biaffine parametric policy classes, i.e., (s, x) 7! x(s) and (s, y) 7! y(s) are biaffine, and 2. non-empty, compact, and convex parameter spaces X and Y, we show that the min-max Stackelberg game associated with any convex-concave zero-sum Markov Stackelberg game is also convex-concave (Lemma 4). We also provide sufficient conditions on the primitives M of any zero-sum Markov Stackelberg game to ensure that it is convex-concave (Lemma 5 and 6). At a high level, our results allow us to conclude that a zero-sum Markov Stackelberg game is convex-concave if the 1. reward (resp. transition probability) function is concave (resp. stochastically concave) in the state and the follower s action; 2. the reward (resp. transition probability) function is convex (resp. stochastically convex) in the state and the leader s follower s actions; and 3. the follower s action correspondence is concave. Computation. We now turn our attention to the computation of recursive SE in convex-concave zero-sum Markov Stackelberg games. Mirroring the steps by which policy gradient has been show to converge in other settings [24], we first define an unbiased first-order stochastic oracle for zero-sum Markov-Stackelberg games, given access to an unbiased first-order stochastic oracle for the reward and probability transition functions. We then establish convergence of nested SGDA in this setting by invoking Theorem 3.1 under the following assumptions. Assumption 4 (Convergence Assumptions). 1. The parameter spaces X and Y are non-empty, compact, and convex; 2. the policy parameterizations are biaffine, i.e., (s, x) 7! x(s) and (s, y) 7! y(s) are biaffine; 3. the set of non-trivially constrained sets is finite N , i.e. k N k< 1; 4. (Slater s condition) for all s 2 N and a 2 A, there exists bb 2 B s.t. g(s, a, bb) > 0; and 5. the reward r, probability transition p, and coupling constraint g functions are Lipschitz-continuous. Stochastic nested GDA relies on an unbiased first-order stochastic oracle ( b F, b G, F, G), which we can use to obtain unbiased first-order stochastic estimators of u and g. Since the constraints are deterministic, we simply set b G(x, y; s) .= (g(s, x(s), y(s)))s2S and G(s) .= (s), for any distribution 2 (S) over the state space to obtain an unbiased first-order stochastic oracle for the constraints g. While for simplicity we define b G as such, b G is tractable to compute (i.e., finitedimensional) only when N is finite. When N is infinite, our theoretical results generalize by setting b G(x, y; s) .= (mins2N gc(s, x(s), y(s)))c2[d]; however, in practice, this estimator might be intractable, in which case one might choose to abandon our theoretical guarantees in favor of the biased estimator b G(x, y; s) .= g(s, x(s), y(s)). In all cases, the definition of r(x,y) b G follows directly, since b G is deterministic. Now, for any history h of length , define the cumulative payoff estimator b R( ; h) .= P 1 t=0 µ(s(0)) Qt 1 k=0 γkp(s(k+1) | s(k), (s(k)))r(s(k), (s(k)))). We then construct an estimator for u using first-order gradient estimator [69], i.e., we set b F(x, y; h) .= b R( x, y; h), and r(x,y) b F(x, y; h) .= r(x,y) b R( x, y; h). Regarding the variances of this oracle model, as b G and r(x,y) b G are deterministic, they have bounded variance. Moreover, if the policy and the reward and transition probability functions are Lipschitz-continuous, then b R and r(x,y) b R are also Lipschitz-continuous if their domains are compact (i.e., if S, A, and B are compact). Hence b F and r b F likewise must be Lipschitz-continuous, which implies that their variances must be bounded, e.g., there exists σrf 2 R s.t. Eh[kr b F(x, y; h)k2] kr b F(x, y; h)k2 1= σrf where the middle expression is well-defined since r b F is Lipschitz-continuous over its compact domain. With all of this machinery in place, we can now extend nested SGDA to compute recursive Stackelberg equilibria in zero-sum Markov Stackelberg games (Algorithm 2; Appendix C). In the usual case, when the policy parameterization does not represent the space of all policies AS BS, this result should be understood as convergence to the recursive Stackelberg equilibria of a game in which the players action spaces are restricted to the parameterized policies. Theorem 4.1. Let M be a convex-concave zero-sum Markov Stackelberg game. Under Assumption 4, for any ", δ 0, if nested policy gradient descent ascent (Algorithm 2, Appendix C) is run with inputs that satisfy for all t 2 N+, (t) x 2 (1/pt+1), and outputs (x , y , λ ), then in expectation over all runs of the algorithm (i.e., sample paths of and φ), the policy profile ( x , y ) is an (" + δ, δ) recursive SE after O(1/"2δ2) oracle calls. 5 Application: Reach-Avoid Problems In this section, we endeavor to apply our algorithms to a real-world application, namely reach-avoid problems. In a reach-avoid problem (e.g., [29, 32]), an agent seeks to reach one of a set of targets achieve liveness while avoiding obstacles along the way ensuring safety. Reach-avoid problems have myriad applications, including network consensus problems [42], motion planning [21, 41], pursuit-evasion games [30, 44], autonomous driving [43], and path planning [80], to name a few. The obstacles in a reach-avoid problem are not necessary stationary; they may move, either randomly or deliberately, around the environment. When the obstacles movement is random, the problem can be modeled as an MDP. But when their movement is deliberate, so that they are more like a rational opponent than a stochastic process, the problem is naturally modeled as a zero-sum game, where the agent the protagonist aims to reach its target, while an antagonist representing the obstacles seeks to prevent the protagonist from doing so. Past work has modeled these games as simultaneous-move (e.g., [29], [32]), imposing what should be a hard constraint that the agent cannot collide with any of the obstacles as a soft constraint in the form of large negative rewards. Using the framework of zero-sum Markov Stackelberg games, we model this hard constraint properly, with the leader as the antagonist, whose movements impose constraints on the moves of the follower, the protagonist. We then use nested policy GDA to compute Stackelberg equilibria and simultaneous SGDA to compute GNE, and show experimentally that the protagonist learns stronger policies in the sequential (i.e., Stackelberg) game than in the simultaneous. A (discrete-time discounted infinite-horizon continuous state and action) reach-avoid game (l, S, T, V, A, B, µ, r, h) comprises two players, the antagonist (or a-player) and the protagonist (or b-player), each of whom occupies a state sa, sb 2 S in a state space S Rl, for some l 2 N. The protagonist s goal is to find a path through the safe set V S S that reaches a state in the target set T V, while steering clear of the avoid set V = S S \ V. This safe and avoid set formulation is intended to represent capture constraints, which have been the focus of the reach-avoid literature [80]. Initially, the players occupy some state s(0) µ 2 (V) drawn from an initial joint distribution µ over all states, excluding the target and avoid sets. At each subsequent time-step t 2 N+, the antagonist (resp. protagonist) chooses a(t) 2 A (resp. b(t) 2 B) from a set of possible directions A Rl (resp. B Rl) in which to move. After both the antagonist and the protagonist move, they receive respective rewards r(s(t), a(t), b(t)) and r(s(t), a(t), b(t)). Then, either the game ends, with probability 1 γ, for some discount rate γ 2 (0, 1), or the players move to a new state s(t+1) .= h(s(t), a, b) = a , a), hb(s(t) , as determined by their respective displacement functions ha : S A ! S and hb : S B ! S. We can express this deterministic transition as the following probability transition function p(s0 | s, a, b) .= s0(h(s, a, b)). We define the feasible action correspondence C(s, a) .= {b 2 B | (s, a, b) 0} via a vectorvalued safety constraint function : S2 S S ! Rd, which outputs a subset of the protagonist s actions in the safe set, i.e., for all (s, a) 2 S2 A, C(s, a) {b 2 B | h(s, a, b) 2 V}. Note that we do not require this inclusion to hold with equality; in this way, the protagonist can choose to restrict itself to actions far from the boundaries of the avoid set, thereby increasing its safety, albeit perhaps at the cost of liveness. Overloading notation, we define the feasible policy correspondence C( a) .= { b : S ! B | b(s) 2 C(s, a(s)), for all s 2 S}. We consider two forms of reward functions. The first, called the reach probability reward, r(s, a, b) = T(sb), is an indicator function that awards the protagonist with a payoff of 1 if it enters the target set, and 0 otherwise. Under this reward function, the cumulative payoff function (i.e., the expected value of these rewards in the long term) represents the probability that the protagonist reaches the target, hence its name. The second reward function is the reach distance reward function, r(s, a, b) = mins02T ksb s0k2, which penalizes the protagonist based on how far away it is from the target set. With all these definitions in hand, we can now cast the reach-avoid game as a zero-sum Markov Stackelberg game (2l, l, l, d, S, A, B, µ, r, , p, γ). The next assumption ensures that 1. under the reach probability reward function, a reach-avoid game is a convex-non-concave zero-sum Markov Stackelberg game (i.e., the marginal function x 7! u (x) is convex, and the cumulative payoff function y 7! u(x, y) is non-concave, for all x 2 X); and 2. under the reach distance reward function, a reach-avoid game is a convex-concave zero-sum Markov Stackelberg game. Furthermore, a Markov perfect GNE is guaranteed to exist under this assumption, assuming the reach distance reward but not under the reach probability distance.10 To state this assumption, for convenience, we model the leader s policy a(s) .= xsa as parameterized by x 2 X Rl l, and the follower s policy b(s) .= ysb as parameterized by y 2 Y Rl l. Note also that we assume decentralized, play, meaning the players learn only from their own state and rewards, and maintain their policies independently of one another. Assumption 5 (Convex-Concave Reach-Avoid Game). 1. The state space S and the target set T are non-empty and convex; 2. the action spaces A, B are non-empty, compact and convex; 3. the displacement functions ha, hb are affine; 4. a 7! (s, a, b) is log-convex for all b 2 B, and b 7! (s, a, b) for all (s, a) 2 S A; 5. the players parameter spaces X and Y are non-empty, compact, and convex; and 6. the players policies are biaffine, i.e., x(s) .= xsa and y(s) .= xsb. Part 1 is a standard assumption commonly imposed on reach-avoid games (see, for instance Fisac et al. [29]). Part 3 is satisfied by natural displacement functions of the type h(s, a, b) = s +β(a, b), for some β 2 R, which is a natural characterization of all displacement functions with constant velocity β, when A = B {z 2 S | kzk = 1}. Part 4 is satisfied by various action correspondences, such as (s, a, b) .= exp{mins02V k(ha(sa, a), sb) s0k} 1 khb(sb, b) sbk, which shrinks 10The existence of Markov perfect GNE, and hence GNE, is guaranteed by a straightforward generalization of Shapley s [65] result on the existence of Markov perfect Nash equilibria in zero-sum Markov games. the space of actions exponentially as the protagonist approaches the antagonist, and can thus be interpreted as describing a safety-conscious protagonist. The following theorem states the convexconcavity properties of reach-avoid games, and shows polynomial-time computability of recursive SE under Assumption 5. Note that for the reach probability reward function, it is not possible to obtain a polynomial-time convergence result, result since the rewards are not even continuous. Theorem 5.1. Under the reach distance (resp. reach probability) reward function, any reach-avoid game for which Assumption 5 hold is convex-concave (resp. convex-non-concave). Moreover, if is Lipschitz-continuous, then nested SGDA is guaranteed to converge in such games to recursive SE policies in polynomial time. Experiments. We ran a series of experiments on reach-avoid problems,11 which were designed to assess the efficacy of policies learned in a Stackelberg game formulation as compared to those learned in a simultaneous-move game formulation, assuming complex, i.e., neural, policy parameterizations. We consider a variant of the two-player differential game introduced in Isaacs [38], played by two Dubins cars. A Dubins car is a simplified model of a vehicle with a constant forward speed and a constrained turning radius !. We model both the protagonist and antagonist as Dubins cars [38] moving around a 2-dimensional state space. The target set is a select subset of the state space, while the avoid set, which defines the safe set, is a ball around the antagonist. We experiment with only the reach distance, not the reach probability, reward function. In all safe states, the reward is actually a penalty, measuring the protagonist s distance to the target set, while a bonus β is awarded upon reaching a target, at which point the game ends. This reward function suffices for our Stackelberg game setup, which enforces the hard constraint that the protagonist cannot move into the avoid set. In our simultaneous-move game setup, we achieve a similar effect by enhancing the aforementioned reward function with a large penalty ( β) whenever the protagonist touches the avoid set. As in the case of reaching the target, touching the avoid set ends the game. We note that this reach-avoid game is not actually a continuous game, as there is a discontinuity in the reward function when the target is reached. Additionally, it is possible for the antagonist to be cornered, meaning left with an empty set of feasible actions (in which case the game ends). For these reasons, recursive SE are not guaranteed to exist in our setup. Our experiments were run on a 7x7 square grid, with the target set T a closed ball of radius 1 centered along the lower edge, and the avoid set V a closed ball of radius 0.3 around the antagonist. We set the bonus (resp. penalty) for reaching the target (resp. avoid set) β = 200, ! = 30 , and = 0.25. Using this experimental setup, we train two agents by playing two games, the Stackelberg and simultaneous-move variants of the reach-avoid game, using nested policy GDA and SGDA, respectively. We evaluate the protagonists policies to assess their safety and liveness characteristics. Match-up Outcome Mean win length Loss/draw length GNE vs. random 47 W, 18 L, 35 D 23.23 7.53 33.71 19.31 SE vs. random 95 W, 2 L, 3 D 18.16 3.69 33.0 20.8 GNE vs. chaser 0 W, 100 L, 0 D N/A 8.53 1.90 SE vs. chaser 63 W, 36 L, 1 D 21.63 5.04 11.06 7.71 Table 1: Game results summary for GNE and SE agents. To assess liveness, we ran our agents against an opponent that plays actions sampled uniformly at random. To assess safety, we ran our agents against an opponent who chases them, always taking actions that minimize their distance. Table 1 reports the number of wins (W), losses (L), and draws (D), and average game lengths, of 100 games against each opponent. An agent, playing the role of the protagonist, wins when it reaches the target set. A GNE agent loses if it enters V, while a Stackelberg agent loses if it finds itself cornered. The game is a draw if neither player wins or loses within 50 time steps. We find that the SE agent outperforms the GNE agent by a large margin. The SE agent wins almost all of its games against random, and roughly 2/3 of its games against the chaser, while the GNE agent wins only half of its games against random, and none of its games against the chaser. Moreover, even when the SE agent loses or draws, it tends to stay alive longer than the GNE agent. Not only does our Stackelberg approach outperform GNE, it is tractable as well. Our methods thus seem to offer a promising path to further progress solving the myriad of robotic applications of reach-avoid. 11Our code is found at: https://github.com/arjun-prakash/stackelberg-reach-avoid. 6 Acknowledgments Denizalp Goktas was supported by a JP Morgan AI fellowship. Arjun Prakash was partially supported by ONR N00014-22-1-2592 and the Quad Fellowship. [1] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approxi- mation with policy gradient methods in markov decision processes. In Conference on Learning Theory, pages 64 66. PMLR, 2020. 26 [2] Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. 22(1), jan 2021. ISSN 1532-4435. 1, 26 [3] Eitan Altman. Flow control using the theory of zero sum markov games. IEEE transactions on automatic control, 39(4):814 818, 1994. 1 [4] Kenneth Arrow and Gerard Debreu. Existence of an equilibrium for a competitive economy. Econometrica: Journal of the Econometric Society, pages 265 290, 1954. 4 [5] Kenneth J. Arrow and Leonid Hurwicz. On the stability of the competitive equilibrium, i. Econometrica, 26(4):522 552, 1958. ISSN 00129682, 14680262. URL http://www. jstor.org/stable/1907515. 5 [6] Kenneth Joseph Arrow, Leonid Hurwicz, and Hirofumi Uzawa. Studies in linear and non-linear programming. 1958. 5 [7] Alp E. Atakan. Stochastic convexity in dynamic programming. Economic Theory, 22(2): 447 455, 2003. ISSN 09382259, 14320479. URL http://www.jstor.org/stable/ 25055693. 3, 26 [8] Yu Bai, Chi Jin, Huan Wang, and Caiming Xiong. Sample-efficient learning of stackelberg equilibria in general-sum games. Advances in Neural Information Processing Systems, 34: 25799 25811, 2021. 16 [9] Stefan Banach. Sur les operations dans les ensembles abstraits et leur application aux equations integrales. Fund. math, 3(1):133 181, 1922. 25 [10] Somil Bansal, Mo Chen, Sylvia Herbert, and Claire J Tomlin. Hamilton-jacobi reachability: A brief overview and recent advances. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 2242 2253. IEEE, 2017. 2 [11] Richard Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716, 1952. 24, 25 [12] Alain Bensoussan, Shaokuan Chen, and Suresh P Sethi. The maximum principle for global so- lutions of stochastic stackelberg differential games. SIAM Journal on Control and Optimization, 53(4):1956 1981, 2015. 16 [13] Dimitri Bertsekas. Convex optimization theory, volume 1. Athena Scientific, 2009. 4 [14] Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 4. Athena scientific, 2012. 24 [15] Dimitris Bertsimas, David B Brown, and Constantine Caramanis. Theory and applications of robust optimization. SIAM review, 53(3):464 501, 2011. 2 [16] Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. ar Xiv preprint ar Xiv:1906.01786, 2019. 1 [17] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. 4 [18] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 22:33 57, 1996. 24 [19] Yanling Chang, Alan L Erera, and Chelsea C White. A leader follower partially observed, multiobjective markov game. Annals of Operations Research, 235(1):103 128, 2015. 16 [20] Lv Chen and Yang Shen. On a new paradigm of optimal reinsurance: a stochastic stackelberg differential game between an insurer and a reinsurer. ASTIN Bulletin: The Journal of the IAA, 48(2):905 960, 2018. 16 [21] Mo Chen, Zhengyuan Zhou, and Claire J Tomlin. Multiplayer reach-avoid games via low dimensional solutions and maximum matching. In 2014 American control conference, pages 1444 1449. IEEE, 2014. 8 [22] Stefan Czerwik. Functional equations and inequalities in several variables. World Scientific, [23] John M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4):641 664, 1966. ISSN 00361399. URL http://www.jstor.org/ stable/2946123. 4 [24] Constantinos Daskalakis, Dylan J Foster, and Noah Golowich. Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33:5527 5540, 2020. 1, 7 [25] Victor De Miguel and Huifu Xu. A stochastic multiple-leader stackelberg model: analysis, computation, and application. Operations Research, 57(5):1220 1235, 2009. 16 [26] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International conference on machine learning, pages 1467 1476. PMLR, 2018. 24 [27] Anthony V Fiacco and Jerzy Kyparisis. Convexity and concavity properties of the optimal value function in parametric nonlinear programming. Journal of optimization theory and applications, 48(1):95 126, 1986. 19, 25 [28] Arlington M Fink. Equilibrium in a stochastic n-person game. Journal of science of the hiroshima university, series ai (mathematics), 28(1):89 93, 1964. 1 [29] Jaime F. Fisac, Mo Chen, Claire J. Tomlin, and S. Shankar Sastry. Reach-avoid problems with time-varying dynamics, targets and constraints. In Proceedings of the 18th International Conference on Hybrid Systems: Computation and Control, HSCC 15, page 11 20, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450334334. doi: 10.1145/2728606.2728612. URL https://doi.org/10.1145/2728606.2728612. [30] James Flynn. Lion and man: the general case. SIAM Journal on Control, 12(4):581 597, 1974. [31] David Fridovich-Keil, Ellis Ratner, Lasse Peters, Anca D Dragan, and Claire J Tomlin. Efficient iterative linear-quadratic approximations for nonlinear multi-player general-sum differential games. In 2020 IEEE international conference on robotics and automation (ICRA), pages 1475 1481. IEEE, 2020. 1 [32] Yan Gao, John Lygeros, and Marc Quincampoix. On the reachability problem for uncertain hybrid systems. IEEE Transactions on Automatic Control, 52(9):1572 1586, 2007. 8 [33] Denizalp Goktas and Amy Greenwald. Convex-concave min-max stackelberg games. Advances in Neural Information Processing Systems, 34, 2021. 2, 3, 4, 5, 6, 16, 20 [34] Denizalp Goktas and Amy Greenwald. Robust no-regret learning in min-max Stackelberg games, 2022. 16 [35] Denizalp Goktas and Amy Greenwald. Gradient descent ascent in min-max stackelberg games. ar Xiv preprint ar Xiv:2208.09690, 2022. 5 [36] Denizalp Goktas, Sadie Zhao, and Amy Greenwald. Zero-sum stochastic stackelberg games. Advances in Neural Information Processing Systems, 35:11658 11672, 2022. 1, 2, 7, 16 [37] Xiuli He, Ashutosh Prasad, and Suresh P Sethi. Cooperative advertising and pricing in a dynamic stochastic supply chain: Feedback stackelberg strategies. In PICMET 08-2008 Portland International Conference on Management of Engineering and Technology, pages 1634 1649. IEEE, 2008. 16 [38] Rufus Isaacs. Differential Games I: Introduction. RAND Corporation, Santa Monica, CA, 1954. [39] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137 2143. PMLR, 2020. 24 [40] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267 274, 2002. 26 [41] Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimal motion planning. The international journal of robotics research, 30(7):846 894, 2011. 8 [42] Ali Khanafer, Behrouz Touri, and Tamer Basar. Robust distributed averaging on networks with adversarial intervention. In 52nd IEEE Conference on Decision and Control, pages 7131 7136. IEEE, 2013. 8 [43] Karen Leung, Sushant Veer, Edward Schmerling, and Marco Pavone. Learning autonomous vehicle safety concepts from demonstrations. ar Xiv preprint ar Xiv:2210.02761, 2022. 2, 8 [44] J Lewin. The lion and man problem revisited. Journal of optimization theory and applications, 49:411 430, 1986. 8 [45] Tao Li and Suresh P Sethi. A review of dynamic stackelberg game models. Discrete and Continuous Dynamical Systems-B, 22(1):125, 2017. 16 [46] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on International Conference on Machine Learning, ICML 94, page 157 163, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc. ISBN 1558603352. 1 [47] Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid block successive approximation for one-sided non-convex min-max problems: Algorithms and applications. IEEE Transactions on Signal Processing, 68:3676 3691, 2020. doi: 10.1109/TSP.2020.2986363. 4 [48] Kostas Margellos and John Lygeros. Hamilton Jacobi Formulation for Reach Avoid Differential Games. IEEE Transactions on Automatic Control, 56(8):1849 1861, August 2011. ISSN 15582523. doi: 10.1109/TAC.2011.2105730. 1 [49] Eric Maskin and Jean Tirole. Markov perfect equilibrium: I. observable actions. Journal of Economic Theory, 100(2):191 219, 2001. 2 [50] Francisco S Melo and M Isabel Ribeiro. Q-learning with linear function approximation. In International Conference on Computational Learning Theory, pages 308 322. Springer, 2007. 24 [51] Ashkan Mohammadi. Penalty methods to compute stationary solutions for constrained opti- mization problems. ar Xiv preprint ar Xiv:2206.04020, 2022. 19 [52] Ashkan Mohammadi, Boris S Mordukhovich, and M Ebrahim Sarabi. Variational analysis of composite models with applications to continuous optimization. Mathematics of Operations Research, 47(1):397 426, 2022. 19 [53] John F. Nash. Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1):48 49, 1950. doi: 10.1073/pnas.36.1.48. URL https://www. pnas.org/doi/abs/10.1073/pnas.36.1.48. 1 [54] Angelia Nedi c and Asuman Ozdaglar. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19(4):1757 1780, 2009. 19 [55] Angelia Nedic and Asuman Ozdaglar. Subgradient methods for saddle-point problems. Journal of optimization theory and applications, 142(1):205 228, 2009. 19 [56] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4): 1574 1609, 2009. 20 [57] Kazimierz Nikodem. K-convex and K-concave set-valued functions. Wydawnictwo Politechniki Lodzkiej, 1989. 3, 4 [58] Bernt Oksendal, Leif Sandal, and Jan Uboe. Stochastic stackelberg equilibria with applications to time-dependent newsvendor models. Journal of Economic Dynamics and Control, 37(7): 1284 1299, 2013. 16 [59] Praveen Palanisamy. Multi-agent connected autonomous driving using deep reinforcement learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1 7. IEEE, 2020. 1 [60] Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli. Stochastic variance-reduced policy gradient. In International conference on machine learning, pages 4026 4035. PMLR, 2018. 26 [61] Giorgia Ramponi and Marcello Restelli. Learning in markov games: can we exploit a general- sum opponent? In The 38th Conference on Uncertainty in Artificial Intelligence, 2022. 16 [62] Sean C Rismiller, Jonathan Cagan, and Christopher Mc Comb. Stochastic stackelberg games for agent-driven robust design. In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, volume 84003, page V11AT11A039. American Society of Mechanical Engineers, 2020. 16 [63] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science and Business Media, 2009. 3, 4 [64] Sailik Sengupta and Subbarao Kambhampati. Multi-agent reinforcement learning in bayesian stackelberg markov games for adaptive moving target defense. ar Xiv preprint ar Xiv:2007.10457, 2020. 16 [65] Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10): 1095 1100, 1953. 1, 9 [66] Zebang Shen, Alejandro Ribeiro, Hamed Hassani, Hui Qian, and Chao Mi. Hessian aided policy gradient. In International conference on machine learning, pages 5729 5738. PMLR, 2019. 26 [67] Maurice Sion. On general minimax theorems. Pacific Journal of mathematics, 8(1):171 176, [68] Morton Slater. Lagrange multipliers revisited. Cowles Foundation Discussion Papers 80, Cowles Foundation for Research in Economics, Yale University, 1959. URL https:// Econ Papers.repec.org/Re PEc:cwl:cwldpp:80. 5 [69] Hyung Ju Suh, Max Simchowitz, Kaiqing Zhang, and Russ Tedrake. Do differentiable simulators give better policy gradients? In International Conference on Machine Learning, pages 20668 20696. PMLR, 2022. 8 [70] Masayuki Takahashi. Equilibrium points of stochastic non-cooperative n-person games. Journal of Science of the Hiroshima University, Series AI (Mathematics), 28(1):95 99, 1964. 1 [71] Ioannis Tsaknakis, Mingyi Hong, and Shuzhong Zhang. Minimax problems with coupled linear constraints: Computational complexity, duality and solution methods. ar Xiv preprint ar Xiv:2110.11210, 2021. 4 [72] Deepanshu Vasal. Stochastic stackelberg games, 2020. 16 [73] John von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1): 295 320, 1928. 4 [74] Heinrich von Stackelberg. Marktform und gleichgewicht. J. springer, 1934. 1 [75] Yevgeniy Vorobeychik and Satinder Singh. Computing stackelberg equilibria in discounted stochastic games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 1478 1484, 2012. 16 [76] Quoc-Liem Vu, Zane Alumbaugh, Ryan Ching, Quanchen Ding, Arnav Mahajan, Benjamin Chasnov, Sam Burden, and Lillian J Ratliff. Stackelberg policy gradient: Evaluating the performance of leaders and followers. In ICLR 2022 Workshop on Gamification and Multiagent Solutions, 2022. 16 [77] Yuandou Wang, Hang Liu, Wanbo Zheng, Yunni Xia, Yawen Li, Peng Chen, Kunyin Guo, and Hong Xie. Multi-objective workflow scheduling with deep-q-network-based multi-agent reinforcement learning. IEEE access, 7:39974 39982, 2019. 1 [78] Yu Yuan, Zhibin Liang, and Xia Han. Robust reinsurance contract with asymmetric information in a stochastic stackelberg differential game. Scandinavian Actuarial Journal, pages 1 28, 2021. 16 [79] Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Basar. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58 (6):3586 3612, 2020. 26 [80] Zhengyuan Zhou, Jerry Ding, Haomiao Huang, Ryo Takei, and Claire Tomlin. Efficient path planning algorithms in reach-avoid problems. Automatica, 89:28 36, 2018. 8, 9