# solving_zerosum_convex_markov_games__8939e98e.pdf

Solving Zero-Sum Convex Markov Games

Fivos Kalogiannis 1 Emmanouil-Vasileios Vlatakis-Gkaragkounis 2 Ian Gemp 3 Georgios Piliouras 3

We contribute the first provable guarantees of global convergence to Nash equilibria (NE) in two-player zero-sum convex Markov games (c MGs) by using independent policy gradient methods. Convex Markov games, recently defined by Gemp et al. (2024), extend Markov decision processes to multi-agent settings with preferences that are convex over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min-max case of c MGs presents significant challenges, including inherent nonconvexity, the absence of Bellman consistency, and the complexity of the infinite horizon. Our results follow a two-step approach. First, leveraging properties of hidden-convex hidden-concave functions, we show that a simple nonconvex regularization transforms the min-max optimization problem into a nonconvex proximal Polyak-Łojasiewicz (NC-p PL) objective. Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria. Second, building on this reduction, we address the general constrained minmax problems under NC-p PL and two-sided p PL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent-ascent methods, which we believe may be of independent interest.

1. Introduction

The field of multi-agent reinforcement learning (MARL) often framed as Markov games (MGs) (Littman,

1Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA 2Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA and Archimedes, Athena Reaserch Center, Greece 3Google Deep Mind, London, UK. Correspondence to: Fivos Kalogiannis <fkalogiannis@ucsd.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1994) studies how multiple agents interact within a shared, dynamic environment to optimize their individual cumulative rewards (Silver et al., 2017; Lanctot et al., 2019). However, many real-world applications require a more expressive formulation of agent preferences that do not simply decompose additively over time (Puterman, 2014; Zahavy et al., 2021). To address this limitation, the emerging framework of convex Markov games (c MGs) (Gemp et al., 2024) has been proposed as a principled yet flexible model for capturing complex multi-agent interactions in dynamic environments. Unlike traditional MGs, c MGs allow for convex utility functions over the state-action occupancy measure, enabling a richer set of players preferences that better reflect practical applications. In this context, the occupancy measure of each player reflects the frequency of visiting any particular state of their corresponding Markov decision processes (MDPs), over a potentially infinite time horizon.

In practice, c MGs are useful for modeling a variety of challenging multi-agent settings, including:

(i) Fostering creativity in machine gameplay, such as discovering novel strategies in chess (Zahavy et al., 2022; 2023; Bakhtin et al., 2022), (ii) multi-step language model alignment (Wu et al., 2025), (iii) enhancing multi-agent exploration in robotic systems (Burgard et al., 2000; Rogers et al., 2013; Tan et al., 2022), (iv) ensuring safety in autonomous driving (Shalev-Shwartz et al., 2016), (v) enabling imitation learning from expert demonstrations, and (vi) promoting robustness and fairness, in multi-agent decision-making (Hughes et al., 2018). While the former (i-iii) are direct instantiations of c MGs, the latter (iv-vi) will profit when from a c MG formulation.

In general, the authoritative desirable outcome of a multiagent scenario is some sort of game theoretic equilibrium. Given the plethora of c MG applications, a natural question arises: are there algorithmic solutions for provable equilibrium computation in these games already? Surprisingly, this is not the case! Notwithstanding the model s appeal, an array of technical challenges arise that impede a straightforward algorithmic solution. Yet, empirical results suggest that variants of policy gradient methods (Williams, 1992) can actually lead to favorable outcomes.

Following this thread, we consider the simplest reasonable setting, competition between two agents, and pose our

Solving Zero-Sum Convex Markov Games

central question:

Do policy gradient methods converge in zero-sum convex Markov games? ( )

We answer this question in the affirmative. More explicitly, we provide an algorithmic framework that is efficient, adheres to the independent learning desideratum (Daskalakis et al., 2020), and is simultaneously easy-to-implement. This effectiveness is powered by novel technical insights to address numerous technical challenges native to c MGs.

Getting into the weeds, we enumerate the technical challenges that c MGs pose. As c MGs are a strict generalization of c MDPs, the Bellman consistency of the agents utility functions fails to hold (Zhang et al., 2020) in short, we cannot define individual value and action-value functions. Remarkably, this is a family of long-horizon strategic interactions where agents cannot use dynamic programming as opposed to conventional MGs. We note that, even though most policy optimization methods rely on a gradient-based approach, their majority implicitly performs an approximate dynamic programming subroutine such as value-iteration (Jin et al., 2021; Wei et al., 2021; Ding et al., 2022; Erez et al., 2023).

Consequently, the absence of state-wise value-functions translates into a failure of the elementary arguments that were used to prove the existence of Nash equilibria (Fink, 1964), even in two-player zero-sum Markov games (Shapley, 1953). In light of the latter, Gemp et al. (2024) prove the existence of Nash equilibria in c MGs by going beyond the fundamental toolbox of the Brouwer and the Kakutani fixed point theorems. What renders these conventional theorems obsolete is the inherent nonconvexity of the individual bestresponse mappings in terms of players policies where, each best-response mapping yields the set of the agent s set of utility maximizing strategy deviations.

The careful reader might already grasp that the failure of Bellman consistency rules out the seamless application of the majority of MARL algorithms; most of them rely on computing value and action-value functions. We direct our hopes to policy gradient methods which theory of c MDPs and practice of MARL suggest can work. Directly optimizing a policy means facing an optimization landscape that is nonconvex. However, since all local optima are also global in c MGs should bring some hope. Nonetheless, why should policy gradient methods work in c MGs when vanilla gradient following methods cycle (Bailey & Piliouras, 2018) and exhibit non-convergent chaotic trajectories even in zerosum two-player normal-form games? Observe that the latter are nothing but c MGs on a single state with individually linear utilities. Further, even if we stabilize gradient dynamics towards Nash equilibria, attaining strong convergence guarantees of an algorithm requires the deepening of our

understanding of the optimization landscape of c MDPs and c MGs. For instance, in the min-max case, the computational complexity of the general problem of computing a saddle-point of a nonconvex-nonconcave smooth function remains far from being settled (Daskalakis et al., 2021), while no algorithm is yet guaranteed to efficiently compute them without additional structural assumptions.

Searching for structure, we observe that the utilities in c MGs exhibit a property known as hidden convexity (Ben Tal & Teboulle, 1996; Li et al., 2005; Wu et al., 2007; Xia, 2020; Vlatakis-Gkaragkounis et al., 2021; Fatkhullin et al., 2023). However, the existing results on hidden convexity either focus on single-objective minimization or, in the context of continuous games, make assumptions that do not hold in c MGs (separability of the hidden mapping and unconstrained variables). Having identified the key difficulties of zero-sum c MGs, in the following section we offer a detailed account of how we tackle them. Briefly, we manage to overcome the aforementioned challenges using some distinct combination of the following algorithmic techniques: (i) a conceptually simple but nonconvex regularization of the utility function; (ii) alternating gradient iterations (Tammelin et al., 2015; Chavdarova et al., 2021; 2019; Zhang & Yu, 2019; Chambolle & Pock, 2011); (iii) a careful timescale separation of the individual step-size, and lastly (iv) nestedgradient iterations.

1.1. Technical Overview

The proof of the proposed theoretical guarantees comes after a combination of several key observations old and new. We believe that exposing them in the following itemized manner will serve in conveying the technical level of our work and an intuitive overview of why our techniques work.

Hidden convex functions are nonconvex functions that can be expressed as a convex function of a reparametrization of their original arguments, termed the hidden (latent) mapping, while the original arguments are called control variables . Yet, in some settings, like reinforcement learning (RL), this mapping is not at all hidden but directly computable or observable.

More specifically, in c MGs, players utilities are hiddenly concave through the state-action occupancy measure, which is similarly computable (in planning) or observable (in RL).

Although analyzing these games via occupancy measures yields a quasi-variational inequality problem (Kinderlehrer & Stampacchia, 2000) unnecessarily increasing the technical challenges (Bernasconi et al., 2024) yet, two crucial facts remain: (i) the utility is concave with respect to the hidden mapping, and (ii) the hidden mapping is accessible. These properties ensure that after regular-

Solving Zero-Sum Convex Markov Games

ization, the hidden function becomes strongly concave. In turn, the resulting function satisfies the proximal PolyakŁojasiewicz (p PL) condition, with respect to the control variables (i.e., the policies).

Consequently, each player s utility remains p PL while the feasibility sets for their control variables remain independent. This observation wraps up the discussion on hidden concavity and naturally shifts our focus on the constrained optimization of min-max nonconvex-p PL objectives. In other words, we reduce solving two-player zero-sum convex Markov games to the problem of computing saddle-points of functions that are nonconvex-p PL over constrained domains.

Leveraging this reduction, our final steps center on developing policy gradient methods that can provably converge to saddle-points of nonconvex-p PL functions over constrained domains, a technical development in optimization theory of independent interest as well.

The proofs are deferred to the Appendix for clarity.

1.2. Our Contributions

Before proceeding to enumerate all parts of our contributions, we remark that we deliver a definitive answer to question ( ):

Theorem. There exist decentralized policy gradient methods (Algorithms 1 and 2) that compute an ϵ-approximate Nash equilibrium for any ϵ > 0 in any two-player zero-sum convex Markov game using iterations and samples that are:

ϵ , |S|, |A| + |B|, 1 1 γ

with S denoting the c MG s state-space, A, B the two players action-spaces, and γ > 0 the discount factor.

Our contributions span two key areas: constrained nonconvex min-max optimization and equilibrium computation in convex Markov games.

In Theorem 4.1 we demonstrate that the best-response mapping for an objective f(x, y), x 7 y (x) := arg maxy Y f(x, y), is Lipschitz continuous in x for regularized hidden-convex and more generally, NC-p PL games. The significance of this result lies in guaranteeing the stability of the iterates of policy gradient methods and serves as a suggestion to practitioners looking for a regularization technique that is intuitive, simple, and easily implementable.

In Section 4, we provide the first provable guarantees of convergence to a Nash equilibirum for two policy gradient algorithms Nest-PG and Alt-PGDA (Theorem 4.3).

Key ingredients. To establish our results, we leverage a set of incorporating the distinct combination of the following non-trivial components: (i) an intuitive stabilityinducing specialized regularization of the utility function and (ii) alternating or (iii) nested-gradient iterations.

A noteworthy element of our approach is that it ensures convergence even with inexact gradients on top of being stochastic. The algorithm s robustness to gradient inexactness preserves each player s autonomy, allowing them to optimize independently without exchanging private policy information. I.e., unlike the single-agent setting, where exact gradients can be stochastically estimated, our framework s regularizer depends on both players policies, making exact estimation infeasible without policy sharing. Consequently, each player must rely solely on an inexact estimation of their own gradient.

Finally, a noteworthy product of our approach is the Lipschitz continuity of the best-response mapping (which returns the optimal strategy given the opponent s choices) in the hidden-convex case and more generally nonconvex-p PL despite the opponent s utility being nonconvex contrasting with the typical 1

2-H older continuity of maximizers in constrained optimization (Li & Li, 2014). Indeed, while (Kalogiannis et al., 2024) employ a similar technique of regularizing the value function through the occupancy measure perspective, their result only establishes a weaker notion of continuity due to the coupledness of individual players state-action occupancy measure. This claim has been independently supported by (Papadimitriou et al., 2023, Robust Berge Theorem), which considers general nonconvexstrongly concave functions where the feasibility sets of two different strategies depend on each other in a Hausdorff continuous manner. To significantly strengthen the continuity of the best-response mapping, we leverage the p PL condition in the individual policy spaces of agents that remain uncoupled. This result was only known for the significantly simpler case of unconstrained two-sided PL functions (Nouiehed et al., 2019).

2. Preliminaries

Notation. In general, x, y, z, u, v, w and λ, λ1, λ2 will denote vectors. Scalars will be denoted using α, β, γ, δ, ϵ, ζ, κ, µ, ν and a, b, c, d. Matrices will be denoted with bold uppercase letters. The probability simplex supported on a finite set M will be denote with (M). For a compact convex set Z Rd, we will denote its Euclidean diameter as DZ, i.e., DZ := maxx,y Z x y 2. Lastly, the global optimum of a function f will be denoted as f .

Solving Zero-Sum Convex Markov Games

2.1. Convex Markov Games

In this subsection, we define two-player zero-sum convex Markov games (Gemp et al., 2024) and introduce necessary notation. We then present the occupancy measure and remark its continuity properties. Subsequently, we review the policy gradient theorem for convex MDPs and describe a stochastic policy gradient estimator. Finally, we define the perturbed utility function U µ, obtained by adding a regularization term to the original utility function U.

Definition 1 (Two-player zero-sum c MG). An infinitehorizon zero-sum convex Markov game is a tuple Γ = (S, A, B, P, F, γ, ϱ):

a maximizing and a minimizing player, a finite state space S, and an initial state distribution ϱ (S), finite action spaces A, B, a state transition function P : S A B (S), a discount factor γ [0, 1), and two continuous utilities F1, F2 functions corresponding to each player s occupancy measure, i.e., there exist concave F1, F2

F1 : (S A) (S B) R;

F2 : (S A) (S B) R,

with F1 = F2 =: F.

With all this in hand, we adopt the following standard assumptions, which are widely used in prior work (Zhang et al., 2020):

Assumption 1. For the initial state distribution, it holds that ϱ(s) > 0, s.

Additionally, we assume direct policy parametrization and define the Markovian and stationary policy of the minimizing and the maximizing player to be x (A)|S| =: X and y (B)|S| =: Y respectively. Throughout, we only consider Markovian stationary policies. After the agents fix their policies, x, y, the transition matrix P(x, y) R|S| |S|

dictates how they traverse the state space. The occupancy measures are defined as:

λs,a 1 := (1 γ)Ex,y P h γh1{s(h) = s, a(h) = a}|ϱ ;

λs,b 2 := (1 γ)Ex,y P h γh1{s(h) = s, b(h) = b}|ϱ .

Further notation. Further, we denote λ to be the statejoint-action occupancy measure, λ (S A B). Overloading notation, λ(x, y) stands for the unique occupancy measure that corresponds to the policy pair x, y. Additionally, λ1 (S A), λ2 (S B) will signify the marginal occupancy measures with respect to the minimizing and the maximizing player respectively. Again,

overloading notation, λ1(x, y) and λ2(x, y) are the unique occupancy measures for a policy pair x, y. Finally, we will at times suppress the notation F1(λ1(x, y); y) in place of F1(λ1(x, y), λ2(x, y)) and similarly for F2.

Crucially, both the occupancy measure and its inverse operators satisfy the following continuity properties:

Lemma 2.1 (Continuity of the occupancy measure). Let λ (S A B) be the occupancy measure in a (convex) Markov game and let λ 1 1 : (S A) X and λ 1 2 : (S B) Y be the occupancy-to-policy mapping such that: λ 1 1 (λ1(x, y)) = x; λ 1 2 (λ2(x, y)) = y. Then,

λ is Lλ-Lipschitz continuous and has an ℓλ-Lipschitz continuous gradient with respect to the policy pair (x, y) in X Y. Specifically, for all (x, y) and (x , y ),

λ(x, y) λ(x , y ) Lλ (x, y) (x , y ) ;

λ(x, y) λ(x , y ) ℓλ (x, y) (x , y ) ,

where Lλ := |S| 1 2 (|A|+|B|)

(1 γ)2 , and ℓλ := 2γ|S| 1 2 (|A|+|B|) 3 2 (1 γ)3 .

For any fixed y (respectively, x), λ 1 is Lλ 1-Lipschitz continuous with respect to λ1 (respectively, λ2), i.e., for all λ1(x, y), λ1(x , y) respectively, λ2(x, y), λ2(x, y ),

x x Lλ 1 λ 1 1 (λ1(x, y) λ1(x , y)) ;

y y Lλ 1 λ 1 2 (λ2(x, y) λ2(x, y )) ,

with Lλ 1 := 2 mins ϱ(s)(1 γ).

Next, our solution concept corresponds to the following min-max Nash equilibrium of the following function U : (A)|S| (B)|S| R be such that:

U(x, y) := F (λ1(x, y), λ2(x, y)) .

Definition 2 (ϵ-NE). A policy profile (x , y ) X Y is said to be an ϵ-approximate Nash equilibrium (ϵ-NE), if for any x X and any y Y, it holds that:

U(x , y) ϵ U(x , y ) U(x, y ) + ϵ. (ϵ-NE)

Finally, we will denote U µ to be: U µ(x, y) := U (λ(x, y)) µ

2 λ2(x, y) 2. The following Lemma is a direct consequence of hidden-strong-convexity.

Lemma 2.2. When µ > 0, U µ(x, ) has a unique maximizer y (x), for all x.

2.2. Policy Gradient Estimators

As discussed, the inexistence of value or action-value functions for general utility MDPs, leads us to focus on policy gradient methods; the direct application of vanilla Q-learning methods is out of the question. To compute

Solving Zero-Sum Convex Markov Games

the policy gradient of a utility that is nonlinear in λ1, we make use of the (Williams, 1992) along the chain rule of differentiation to write:

x U(x, y) = X

F1 λs,a 1 (x, y) xλs,a 1 (x, y).

By sampling trajectories, each agent can stochastically estimate the policy gradient using the following estimator.

Definition 3 (Gradient Estimator). Given a trajectory ξ = s(0), a(0) , s(1), . . . , s(H 1), a(H 1)) of length H sampled under a policy profile x, y and initial distribution ϱ, the gradient estimator, ˆgx(ξ|x, z), is defined as:

ˆgx(ξ|x, z) :=

h=0 γhz s(h), a(h) h X

h =0 x log x a(h )|s(h ) !

with z = λ1F1(λ1(x, y); y).

Sufficient exploration In order to ensure that the agents sufficiently explore the environment and ultimately control the variance of the estimators, we assume that both agents are following ε-greedy direct policy parametrization, i.e., for a given parameter value x |S|(A), the agent plays according to πx = (1 ε)x + ε1

|A|, where 1 is an all-ones vector of appropriate dimension.

2.3. Optimization Theory

Next, we introduce several key concepts from nonconvex and min max optimization, focusing on hidden convexity and gradient domination conditions. We demonstrate how strong hidden convexity implies the proximal PolyakŁojasiewicz condition (p PL) and the quadratic growth condition (QG). We then show that when a min max objective satisfies a two-sided gradient domination condition, it enjoys a zero duality gap.

Definition 4 (Hidden Convex Function). Consider a function f : X R where X is a compact convex set such that f(x) := H (c(x)) , x X for some mapping c and function H. If the following conditions are satisfied:

the mapping c is invertible and its inverse c 1 is 1/µc Lipschitz continuous.

the set U := c(X) is convex and the function H : U R is µH-strongly convex.

The function is said to be (µc, µH)-hidden strongly convex (HSC), while for µH = 0, it is referred merely as hidden convex (HC).

Notably, the convergence analysis of our proposed methods begins with the following claim, which connects c MG utilities to hidden convexity.

Claim 2.3. The function U is hidden-convex (resp., hiddenconcave) for the min. player (resp., max. player), given a fixed action of the opponent. Similarly, the perturbed utility function U µ is (µ, Lλ 1)-hidden strongly concave for the max player due to the structure of the regularizer.

Proof. Follows from Definitions 1 and 4 & Lemma 2.1.

Proposition 2.4 (HC implies gradient domination; (Fatkhullin et al., 2023)). Let f : X R be an ℓ-smooth and (µc, µH)-hidden convex function and IX be the indicator function of the set X. Further, let F( ) := f( ) + IX ( ). Also, assume that the map c( ) is continuously differentiable on X.

(i) If the set X = c(X) is bounded with diameter DU, then

inf sx F (x) sx µc

DU (F(x) F ) , x X.

(ii) If f( ) is (µc, µH)-hidden strongly convex, then

inf sx F (x) sx 2 2µ2 cµH (F(x) F ) , x X.

Definition 5 (p PL). Let an ℓ-smooth function f : X R defined over the convex and compact set X Rd. Let DX ( , ℓ) be defined as:

DX (x, ℓ) := 2ℓmin y X

f(x), y x + ℓ

Then, f is said to satisfy the proximal Polyak-Łojasiewicz condtion with modulus µ if for all x X, it holds true that:

1 2DX (x, ℓ) µ (f(x) f ) .

Our following lemma establishes a variant of the gradient dominance for the case of convex Markov games. Namely we show that an approximate constrained first-order stationary point ensures approximately optimal policies in our game.

Lemma 2.5 (Gradient Dominance). For a zero-sum convex Markov game, it holds that:

max x X x U(x, y), x x µx (U(x, y) U(x , y)) ;

max y Y y U(x, y), y y µy (U(x, y ) U(x, y)) ,

for µx, µy = (1 γ) mins ρ(s)

Proof. Fix an arbitrary y Y. By the hidden convexity of U( , y) and Proposition 2.4, we have:

(1 γ) mins ϱ(s)

U(x, y) min x X U(x , y)

min sx x U(x,y)+ x IX (x) sx .

Solving Zero-Sum Convex Markov Games

2 is the diameter of the state-action occupancy measure. Applying (Rockafellar & Wets, 2009, Proposition 8.32), we obtain:

min sx x U(x,y)+ x IX (x) sx = max x X, x x 1

U(x, y), x x

max x X U(x, y), x x .

Thus, we set µx := (1 γ) mins ϱ(s)

2 , and by symmetry, the same holds for µy.

Finally, in the Appendix we show that both HSC and p PL imply QG, which states that the optimality gap at any point x is lower-bounded by a quadratic term in its distance from the minimizer. This ensures that progress toward optimality can bound the proximity to the solution.

Proposition 2.6 (QG from HSC and p PL). Let f : X R be ℓ-smooth and satisfy either: (i) (µc, µH)-hidden strong convexity (HSC), or (ii) proximal Polyak-Łojasiewicz (p PL) with modulus µ. Then, f satisfies the quadratic growth (QG) condition:

f(x) f(x ) µQG

4 x x p 2, x X,

where x p is the closest minimizer in X = arg minx X f(x), and

µQG = µ2 cµH (HSC), µQG = µ (p PL).

Piecing the Framework Together. At a high level, the standard tool in proving convergence to Nash equilibria for gradient-based methods is the construction of a potential or Lyapunov function (See (Bof et al., 2018)). A Lyapunov function needs to be lower-bounded and decreasing for consecutive iterates of the algorithm. However, proving the latter beyond the convex-concave setting, requires a careful examination of HSC and p PL. As such, in order to shed light on the structured nonconvex landscape, we note:

Starting from the taxonomy of structured nonconvex functions, we show in the appendix that HSC and p PL are equivalent given a proper transformation of their moduli.

Hence, the stationarity surrogate DX (x, ℓ) also serves as a surrogate for optimality f(x) minx X f(x ). More importantly, by Claim 2.3, in the min-max setting, for a fixed opponent strategy, a sufficiently small DX (x, ℓ) indicates that the player has found an approximate best response.

Furthermore, we highlight an additional crucial detail. Since, the p PL condition already guarantees quadratic growth (with an equal modulus), the contribution of HSC in this regard may appear redundant. Nonetheless, HSC is

translated to the p PL condition going through an argument which degrades the modulus of HSC µ to a p PL modulus of µPL = O(µ2). Hence, directly guaranteeing QG from HSC improves convergence rates of the next sections by at least an order of O(ϵ3/2).

3. New Insights in Structured Nonconvex Min-Max Optimization

As a preliminary step of independent interest, we present our results on optimization schemes for constrained minmax problems under nonconvex-p PL and p PL-p PL conditions. To maintain completeness, we restate our solution concept under constrained min-max optimization perspective:

Definition 6 (ϵ-SP). Assume a smooth function f : X Y R where X, Y are two compact and convex sets. Then, (x , y ) X Y is an ϵ-saddle-point (ϵ-SP) of f if, maxx X xf(x , y ), x x ϵ; maxy Y yf(x , y ), y y ϵ (ϵ-SP)

We adopt a context-agnostic approach, assuming each player receives a black-box representation of their gradient vector. These hypotheses remain intentionally abstract, as we impose no specific modeling assumptions on how players payoff signals are generated. In this sense, they serve as an inexact gradient oracle (Devolder et al., 2014) that captures a wide range of settings. Reflecting on our MARL scenario, this framework allows each player to independently select and apply their preferred gradient estimator, as in (Zhang et al., 2020; 2021; Barakat et al., 2023).

Model (Stochastic Inexact First-Order Oracle). For any t, the gradient estimators ˆgx(xt, yt) and ˆgy(xt, yt) satisfy:

E[ˆgx(xt, yt)] = gx(xt, yt), E[ˆgy(xt, yt)] = gy(xt, yt); E ˆgx(xt, yt) 2 σ2 x, E ˆgy(xt, yt) 2 σ2 y.

Additionally, scalars δx, δy > 0 bound the inexactness error:

gx(xt, yt) xf(xt, yt) δx,

gy(xt, yt) yf(xt, yt) δy.

Hence, ˆgx and ˆgy provide unbiased estimates of gx and gy, respectively, which in turn serve as inexact approximations of xf(xt, yt) and yf(xt, yt). As a result, we encounter two types of errors: (i) systematic bias (non-zero mean), bounded by δx, δy, and (ii) random noise (zero mean), with variance bounded by σ2 x, σ2 y.

3.1. Alternating Methods

A widely used stabilization technique in nonconvex minmax optimization, including adversarial training and GANs,

Solving Zero-Sum Convex Markov Games

is alternating updates between players (Lu et al., 2019; Nouiehed et al., 2019; Gidel et al., 2019; Bailey et al., 2020; Wibisono et al., 2022; Cevher et al., 2023; Lee et al., 2024). Building on this approach, we analyze nested and alternating gradient iteration schemes.

3.1.1. NESTED GRADIENT ITERATIONS

As a warm-up, we focus on solving the minimax problem (ϵ-SP) under the assumption that we have access to an oracle for approximate inner maximization, similar to (Jin et al., 2020, Section 4). Specifically, for any given x, the oracle provides a y such that: f(x, y ) maxy f(x, y) ϵ.

yt+1 ARGMAX(f(xt, ), ϵy);

xt+1 ΠX (xt η f(xt, yt+1)) (GDmax)

Theorem 3.1 (NC-p PL; Informal version of Theorem D.5). Let f : X Y be an L-Lipschitz and ℓ-smooth function and X, Y be compact convex sets with diameters DX , DY respectively. Also, assume that f(x, ) satisfies the p PL condition with modulus µ > 0. Then, the iteration scheme (GDmax) run with a tuning of

step-sizes: τx = Θ 1

ℓκ and τy = Θ 1

batch-sizes Mx = Θ σ2 x ϵ2 and My = Θ κσ2 y ϵ2 ,

where κ := ℓ

µ, outputs an (ϵ + δx + δy)-saddle-point of f after a total number of outer-loop and inner-loop iterations, T, that is at most

T = O κ3L (DX + DY)

Theorem 3.2 (p PL-p PL; Informal version of Theorem D.6). Let f : X Y be an L-Lipschitz and ℓ-smooth function and X, Y be compact convex sets with diameters DX , DY respectively. Further, assume that f satisfies the two-sided p PL condition with moduli µx, µy. Then, the iteration scheme (GDmax) run with a tuning of

step-sizes:τx = Θ 1

ℓκ and τy = Θ 1

batch-sizes Mx = Θ κxσ2 x ϵ and My = Θ ℓκyσ2 y ϵ2 ,

outputs an ϵ+ p

ℓΦ/µx(δx +δy) -saddle-point of f after a number of iterations, T, that is at most

µxµy log ℓLκx DX

ϵ log ℓLκy DY

where κx := ℓ µx and κy := ℓ µy .

3.1.2. ALTERNATING GRADIENT DESCENT ASCENT

Given the simplicity and practical superiority of singleloop methods over two-loop alternatives, a natural question arises: can we achieve comparable convergence rates

without resorting to multi-loop procedures? To this end, Alt-GDA leverages the sequential computation of xt+1 and yt+1 , ensuring that each update benefits from fresher gradient information.

xt ΠX (xt 1 τxˆgx(xt 1, yt 1))

yt ΠY (yt 1 τyˆgy(xt, yt 1)) (Alt-GDA)

Theorem 3.3 (NC-p PL; Informal version of Theorem D.7). Let f : X Y be an L-Lipschitz and ℓ-smooth function and X, Y be compact convex sets with diameters DX , DY respectively. Also, assume that f(x, ) satisfies the p PL condition with modulus µ > 0 for all x X. Then, the iteration scheme (Alt-GDA) run with a tuning of

step-sizes:τx = Θ 1

ℓκ2 and τy = Θ 1

batch-sizes Mx = Θ ℓ2κ2σ2 x ϵ2 and My = Θ κ2σ2 y ϵ2 ,

outputs an (ϵ + δ)-saddle-point of f after a number of iterations, T, that is at most

T = O κ2ℓL (DX + DY)

Theorem 3.4 (p PL-p PL; Informal version of Theorem D.8). Let f : X Y be an L-Lipschitz and ℓ-smooth function and X, Y be compact convex sets with diameters DX , DY respectively. Also, assume that f satisfies the p PL condition in x and y with moduli µx, µy > 0 respectively. Then, the iteration scheme (Alt-GDA) run with a tuning of

step-sizes:τx = Θ µy

ℓ3 and τy = Θ 1

batch-sizes Mx = Θ σ2 x µxϵ and My = Θ σ2 y µxµ2yϵ ,

outputs an ϵ-saddle-point of f after a number of iterations, T, that is at most

µxµ2y log L(DX + DY)

4. Main Results Convex Markov Games

In this section we present our main results regarding the convex Markov games (c MGs). We demonstrate that a utility function that is hidden concave in the occupancy measure satisfies the proximal-PL condition with respect to the player s policy. Then, we show that the maximizers of U µ(x, ) are Lipschitz continuous in the minimizing player s policy x. Finally, we describe two policy gradient algorithms that enjoy convergence to an approximate Nash equilibrium using a finite number of samples and iterations. Namely, (i) nested policy gradient (Nest-PG), and (ii) alternating policy gradient descent-ascent (Alt-PGDA).

Solving Zero-Sum Convex Markov Games

Throughout, δx is implicitly tuned through the coefficient of the regularizer, µ. Also, δy = 0 since the maximizing player has access to unbiased stochastic estimates of the gradients. To see why µ controls δx, we note that by the Lreg-Lipschitz continuity of the regularizer, the norm of its gradient is bounded. As such, δx = O(µLreg) and since we also pick µ = O(ϵ), the bound on δx follows suit.

Theorem 4.1 (Continuity of maximizers). Consider the mapping x 7 y (x) := arg maxy Y f(x, y). Then, for any two points x1, x2 it holds true that:

y (x1) y (x2) L x1 x2 ,

where L := ℓ µµQG .

Corollary 4.2. Define the function Φµ : X R as Φµ := maxy n U(x, y) µ

2 λ2(x, y) 2o . Then, for any x, x X, the following inequality holds:

xΦµ(x) xΦµ(x ) ℓΦ,µ x x .

with ℓΦ,µ := ℓ(1 + L ).

4.1. Nested Policy Gradient

Algorithm 1 Nest-PG: Nested Policy Gradient input (x0, y0), step-sizes τx, τy, regul. coeff. µ 0

for t = 1 to Tout do

yt,0 yt 1 for s = 1 to Tin do

yt,s ΠY ys 1 + τy ˆ y U µ (xt 1, yt,s 1)

yt yt,Tin end for xt ΠX xt 1 τx ˆ x U (xt 1, yt)

end for Pick t {1, . . . , T} of the best iterate. output (xt , yt +1).

Theorem 4.3. Consider a two-player zero-sum c MG, Γ, and let ϵ be a desired accuracy ϵ > 0. Then, Algorithm 1 run with appropriately tuned step-sizes τx, τy > 0, εx, εy > 0 batch-sizes Mx, My > 0 outputs an ϵ-approximate Nash equilibrium after a number of iterations that is at most:

poly 1 mins ϱ(s), γ, 1 1 γ , |S|, |A| + |B| .

Theorem 4.4. Consider a two-player zero-sum c MG, Γ, with with utility functions that are hidden strongly concave with moduli µx, µy > 0 respectively. Additionally, let ϵ be a desired accuracy ϵ > 0. Then, Algorithm 1 run with approprite step-sizes τx, τy > 0, exploration parameters εx, εy > 0, and batch-sizes Mx, My > 0 outputs an ϵapproximate Nash equilibrium after a number of iterations

that is at most:

ϵ poly 1 µx , 1

µy , 1 mins ϱ(s), γ, 1 1 γ , |S|, |A| + |B| .

4.2. Alternating Policy Gradient Descent Ascent

Algorithm 2 Alt-PGDA: Alternating Policy GDA input (x0, y0), step-sizes τx, τy, T > 0, regul. coeff. µ 0 for t = 1 to T do

xt ΠX xt 1 τx ˆ x U µ(xt 1, yt 1)

yt ΠY yt 1 + τy ˆ y U(xt, yt 1)

end for Pick t {1, . . . , T} of the best iterate. output (xt , yt ).

Theorem 4.5 (Informal Version of Theorem E.12). Consider a two-player zero-sum c MG, Γ, and let ϵ be a desired accuracy ϵ > 0. Then, Algorithm 2 run with appropriately step-sizes τx, τy > 0, exploration parameters εx, εy > 0 batch-sizes Mx, My > 0, and a regularization coefficient µ = O(ϵ), outputs an ϵ-approximate Nash equilibrium after a number of iterations that is at most:

ϵ6 poly 1 mins ϱ(s), γ, 1 1 γ , |S|, |A| + |B| .

Theorem 4.6 (Informal Version of Theorem E.13). Consider a two-player zero-sum c MG, Γ, with utility functions that are hidden strongly concave with moduli µx, µy > 0 respectively. Additionally, let ϵ be a desired accuracy ϵ > 0. Then, Algorithm 2 run with step-sizes τx, τy > 0, and batchsizes Mx, My > 0 outputs an ϵ-approximate Nash equilibrium after a number of iterations that is at most:

ϵ poly 1 µx , 1

µy , 1 mins ϱ(s), γ, 1 1 γ , |S|, |A| + |B| .

We deem noteworthy the fact that Theorem 4.6 guarantees (expected) last-iterate convergence for a class of nonconvex games that take place over a constrained domain.

5. Numerical Results

We demonstrate Algorithms 1 and 2 on an iterated version of rock-paper-scissors-dummy where each player remembers the actions selected in the previous round. Hence, the previous joint action constitutes the state in the Markov game. The dummy action is dominated by all other actions such that the Nash equilibrium of the stage game is uniform across rock, paper, and scissors with zero mass on dummy. We set the step-sizes τx = τy = 0.1 and vary the regularization coefficient µ to demonstrate its effect in biasing convergence.

Solving Zero-Sum Convex Markov Games

Figure 1. Exploitability decays towards a small, but positive value corresponding to the bias introduced by the regularization coefficient µ. Results are averaged over 100 trials, each running the algorithm with a different randomly initialized policy profile. The leftmost plot reports results for Algorithm 2; the right two report results for Algorithm 1 with Tin = 10 and 100 respectively.

Our experiments suggest clearly that (i) there is a pronounced trade-off between speed of convergence and the exploitability of the solution of the perturbed problem (ii) more inner-loop iterations account for more stable exploitability decrease across consecutive iterates.

6. Conclusion and Future Work

Convex Markov games (c MGs) unify the domains of convex MDPs and Markov games. Even the rudimentary form of a c MG, a two-player pure competition, fosters a rich mosaic of applications spanning language model alignment, self-driving cars, and creative chess playing, to name a few.

In this work, we present the first algorithmic solution for computing Nash equilibria. To achieve this, we develop the first guarantees of convergence of alternating descentascent to a saddle-point for nonconvex functions that satisfy a one-sided or two-sided proximal Polyak-Łojasiewicz condition over constrained domains. We utilized these results to design a number of independent policy gradient algorithms for convex Markov games that provably converge to an approximate Nash equilibrium. In a nutshell, we develop simple-to-use learning dynamics that converge to the optimal game solutions even under realistic and challenging conditions where batch learning only allows noisy gradient estimation.

In terms of a message to practitioners, our work highlights that the regularized (Alt-GDA) algorithm is exceptionally effective and easy to deploy to tackle min-max problems. Its efficacy is achieved by mere (i) regularization, (ii) alternating updates, and (iii) step-size magnitude separation . Moreover, the generality of Theorems 3.3 and 3.4, coupled with the prevalence of the PL condition in modern machine learning objectives, makes a strong case in favor of (Alt-GDA) as the default algorithm for min-max optimization. Furthermore, we believe that it can seamlessly accommodate more elaborate update schemes on top of (iiii), like adaptive learning rates, preconditioning, and, if necessary, higher-order information.

From a theoretical standpoint, we anticipate fascinating explorations of multi-player c MG interactions and a deep investigation of an array of game-theoretic solution concepts. We hope to see the experience of the rich multi-agent c MG applications inform theory and give rise to meaningful and challenging problems like equilibrium learning, equilibrium selection, and notions of equilibrium performance. We reiterate the fact that c MGs render the value-iteration subroutine useless. The latter lies at the heart of the corresponding algorithmic solutions for equilibrium learning or computation in conventional MGs; its inefficacy in c MGs, more than just posing an additional challenge, can stimulate the search for new algorithmic tools.

Acknowledgements

FK gratefully acknowledges Panayotis Mertikopoulos for his mentoring and insightful discussions during the former s 2024 summer research fellowship at Archimedes, Athena Research Center.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abel, D., Dabney, W., Harutyunyan, A., Ho, M. K., Littman, M., Precup, D., and Singh, S. On the expressivity of markov reward. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Afriat, S. Theory of maxima and the method of lagrange. SIAM Journal on Applied Mathematics, 20(3):343 357, 1971.

Azizian, W., Iutzeler, F., Malick, J., and Mertikopoulos, P. The rate of convergence of bregman proximal methods:

Solving Zero-Sum Convex Markov Games

Local geometry versus regularity versus sharpness. SIAM Journal on Optimization, 34(3):2440 2471, 2024. doi: 10.1137/23M1580218. URL https://doi.org/10. 1137/23M1580218.

Bai, Y. and Jin, C. Provable self-play algorithms for competitive reinforcement learning. In International conference on machine learning, pp. 551 560. PMLR, 2020.

Bailey, J. P. and Piliouras, G. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 321 338, 2018.

Bailey, J. P., Gidel, G., and Piliouras, G. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent. In Conference on Learning Theory, pp. 391 407. PMLR, 2020.

Bakhtin, A., Wu, D. J., Lerer, A., Gray, J., Jacob, A. P., Farina, G., Miller, A. H., and Brown, N. Mastering the game of no-press diplomacy via human-regularized reinforcement learning and planning. ar Xiv preprint ar Xiv:2210.05492, 2022.

Barakat, A., Fatkhullin, I., and He, N. Reinforcement learning with general utilities: Simpler variance reduction and large state-action space. In International Conference on Machine Learning, pp. 1753 1800. PMLR, 2023.

Bauschke, H. H. and Combettes, P. L. Convex analysis and monotone operator theory in Hilbert spaces. Springer Science & Business Media, 2011.

Bellman, R. Dynamic programming and stochastic control processes. Information and Control, 1(3):228 239, 1958. ISSN 0019-9958. doi: https://doi.org/10.1016/S0019-9958(58)80003-0. URL https://www.sciencedirect.com/ science/article/pii/S0019995858800030.

Ben-Tal, A. and Teboulle, M. Hidden convexity in some nonconvex quadratically constrained quadratic programming. Mathematical Programming, 72(1):51 63, 1996.

Bernasconi, M., Castiglioni, M., Celli, A., and Farina, G. On the role of constraints in the complexity of min-max optimization. ar Xiv preprint ar Xiv:2411.03248, 2024.

Berner, C., Brockman, G., Chan, B., Cheung, V., Debiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., et al. Dota 2 with large scale deep reinforcement learning. ar Xiv preprint ar Xiv:1912.06680, 2019.

Bof, N., Carli, R., and Schenato, L. Lyapunov theory for discrete time systems. ar Xiv preprint ar Xiv:1809.05289, 2018.

Bolte, J., Daniilidis, A., Ley, O., and Mazet, L. Characterizations of łojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the American Mathematical Society, 362(6):3319 3363, 2010.

Burgard, W., Moors, M., Fox, D., Simmons, R., and Thrun, S. Collaborative multi-robot exploration. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume 1, pp. 476 481. IEEE, 2000.

Cai, Y., Luo, H., Wei, C.-Y., and Zheng, W. Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback. Advances in Neural Information Processing Systems, 36:36364 36406, 2023.

Cai, Y., Luo, H., Wei, C.-Y., and Zheng, W. Near-optimal policy optimization for correlated equilibrium in generalsum markov games. In International Conference on Artificial Intelligence and Statistics, pp. 3889 3897. PMLR, 2024.

Cen, S., Chi, Y., Du, S. S., and Xiao, L. Faster last-iterate convergence of policy optimization in zero-sum markov games. ar Xiv preprint ar Xiv:2210.01050, 2022.

Cevher, V., Cutkosky, A., Kavis, A., Piliouras, G., Skoulakis, S., and Viano, L. Alternation makes the adversary weaker in two-player games. Advances in Neural Information Processing Systems, 36:18263 18290, 2023.

Chambolle, A. and Pock, T. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:120 145, 2011. URL https: //api.semanticscholar.org/Corpus ID: 261281173.

Chatterji, N., Pacchiano, A., Bartlett, P., and Jordan, M. On the theory of reinforcement learning with once-perepisode feedback. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Chavdarova, T., Gidel, G., Fleuret, F., and Lacoste-Julien, S. Reducing noise in gan training with variance reduced extragradient. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ 58a2fc6ed39fd083f55d4182bf88826d-Paper. pdf.

Chavdarova, T., Pagliardini, M., Stich, S. U., Fleuret, F., and Jaggi, M. Taming gans with lookahead-minmax, 2021. URL https://arxiv.org/abs/2006.14567.

Solving Zero-Sum Convex Markov Games

Chen, X., He, N., Hu, Y., and Ye, Z. Efficient algorithms for a class of stochastic hidden convex optimization and its applications in network revenue management. Operations Research, 2024.

Cheung, W. C. Exploration-exploitation trade-off in reinforcement learning on online markov decision processes with global concave rewards. ar Xiv preprint ar Xiv:1905.06466, 2019a.

Cheung, W. C. Regret minimization for reinforcement learning with vectorial feedback and complex objectives. In Advances in Neural Information Processing Systems (Neur IPS), 2019b.

Cui, Q. and Du, S. S. When are offline two-player zero-sum markov games solvable? Advances in Neural Information Processing Systems, 35:25779 25791, 2022.

Cui, Q., Zhang, K., and Du, S. Breaking the curse of multiagents in a large state space: Rl in markov games with independent linear function approximation. In The Thirty Sixth Annual Conference on Learning Theory, pp. 2651 2652. PMLR, 2023.

Daskalakis, C., Foster, D. J., and Golowich, N. Independent policy gradient methods for competitive reinforcement learning. Advances in neural information processing systems, 33:5527 5540, 2020.

Daskalakis, C., Skoulakis, S., and Zampetakis, M. The complexity of constrained min-max optimization. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1466 1478, 2021.

Daskalakis, C., Golowich, N., and Zhang, K. The complexity of markov equilibrium in stochastic games. In The Thirty Sixth Annual Conference on Learning Theory, pp. 4180 4234. PMLR, 2023.

Deng, X., Li, N., Mguni, D., Wang, J., and Yang, Y. On the complexity of computing markov perfect equilibrium in general-sum stochastic games. National Science Review, 10(1):nwac256, 2023.

Devolder, O., Glineur, F., and Nesterov, Y. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146:37 75, 2014.

Diakonikolas, J., Daskalakis, C., and Jordan, M. I. Efficient methods for structured nonconvex-nonconcave min-max optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2746 2754. PMLR, 2021.

Ding, D., Wei, C.-Y., Zhang, K., and Jovanovic, M. Independent policy gradient for large-scale markov potential games: Sharper rates, function approximation, and gameagnostic convergence. In International Conference on Machine Learning, pp. 5166 5220. PMLR, 2022.

Drusvyatskiy, D. and Lewis, A. S. Error bounds, quadratic growth, and linear convergence of proximal methods. Mathematics of Operations Research, 43(3):919 948, 2018.

Drusvyatskiy, D. and Paquette, C. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178(1):503 558, 2019.

Efroni, Y., Merlis, N., and Mannor, S. Reinforcement learning with trajectory feedback. In AAAI Conference on Artificial Intelligence, 2021.

Erez, L., Lancewicki, T., Sherman, U., Koren, T., and Mansour, Y. Regret minimization and convergence to equilibria in general-sum markov games. In International Conference on Machine Learning, pp. 9343 9373. PMLR, 2023.

Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003.

Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer Science & Business Media, 2007.

Fatkhullin, I., He, N., and Hu, Y. Stochastic optimization under hidden convexity. ar Xiv preprint ar Xiv:2401.00108, 2023.

Feng, Q. and Shanthikumar, J. G. Supply and demand functions in inventory models. Operations Research, 66 (1):77 91, 2018.

Fink, A. M. Equilibrium in a stochastic n-person game. Journal of science of the hiroshima university, series ai (mathematics), 28(1):89 93, 1964.

Geist, M., P erolat, J., Lauri ere, M., Elie, R., Perrin, S., Bachem, O., Munos, R., and Pietquin, O. Concave utility reinforcement learning: The mean-field game viewpoint. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2022.

Gemp, I., Haupt, A., Marris, L., Liu, S., and Piliouras, G. Convex markov games: A framework for fairness, imitation, and creativity in multi-agent learning. ar Xiv preprint ar Xiv:2410.16600, 2024.

Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1): 267 305, 2016.

Gidel, G., Hemmat, R. A., Pezeshki, M., Le Priol, R., Huang, G., Lacoste-Julien, S., and Mitliagkas, I. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1802 1811. PMLR, 2019.

Solving Zero-Sum Convex Markov Games

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

Gronauer, S. and Diepold, K. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, 55(2):895 943, 2022.

Hazan, E., Kakade, S., Singh, K., and Soest, A. V. Provably efficient maximum entropy exploration. In International Conference on Machine Learning (ICML), 2019.

Hughes, E., Leibo, J. Z., Phillips, M., Tuyls, K., Due nez Guzman, E., Garc ıa Casta neda, A., Dunning, I., Zhu, T., Mc Kee, K., Koster, R., Roff, H., and Graepel, T. Inequity aversion improves cooperation in intertemporal social dilemmas. Advances in neural information processing systems, 31, 2018.

J Reddi, S., Sra, S., Poczos, B., and Smola, A. J. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Advances in neural information processing systems, 29, 2016.

Jin, C., Netrapalli, P., and Jordan, M. What is local optimality in nonconvex-nonconcave minimax optimization? In International conference on machine learning, pp. 4880 4889. PMLR, 2020.

Jin, C., Liu, Q., Wang, Y., and Yu, T. V-learning a simple, efficient, decentralized algorithm for multiagent rl. ar Xiv preprint ar Xiv:2110.14555, 2021.

Jin, Y., Muthukumar, V., and Sidford, A. The complexity of infinite-horizon general-sum stochastic games. ar Xiv preprint ar Xiv:2204.04186, 2022.

Kalogiannis, F. and Panageas, I. Zero-sum polymatrix markov games: Equilibrium collapse and efficient computation of nash equilibria. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 59996 60020. Curran Associates, Inc., 2023.

Kalogiannis, F., Anagnostides, I., Panageas, I., Vlatakis Gkaragkounis, E.-V., Chatziafratis, V., and Stavroulakis, S. Efficiently computing nash equilibria in adversarial team markov games. ar Xiv preprint ar Xiv:2208.02204, 2022.

Kalogiannis, F., Yan, J., and Panageas, I. Learning equilibria in adversarial team markov games: A nonconvex-hiddenconcave min-max optimization problem. ar Xiv preprint ar Xiv:2410.05673, 2024.

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp. 795 811. Springer, 2016.

Kinderlehrer, D. and Stampacchia, G. An introduction to variational inequalities and their applications. SIAM, 2000.

Kobyzev, I., Prince, S. J., and Brubaker, M. A. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3964 3979, 2020.

Korpelevich, G. M. The extragradient method for finding saddle points and other problems. Ekonomika i Matematicheskie Metody, 12:747 756, 1976.

Lanctot, M., Lockhart, E., Lespiau, J.-B., Zambaldi, V., Upadhyay, S., P erolat, J., Srinivasan, S., Timbers, F., Tuyls, K., Omidshafiei, S., et al. Openspiel: A framework for reinforcement learning in games. ar Xiv preprint ar Xiv:1908.09453, 2019.

Lee, J., Cho, H., and Yun, C. Fundamental benefit of alternating updates in minimax optimization. ar Xiv preprint ar Xiv:2402.10475, 2024.

Leonardos, S., Overman, W., Panageas, I., and Piliouras, G. Global convergence of multi-agent policy gradient in markov potential games. ar Xiv preprint ar Xiv:2106.01969, 2021.

Li, D., Wu, Z.-Y., Joseph Lee, H.-W., Yang, X.-M., and Zhang, L.-S. Hidden convex minimization. Journal of Global Optimization, 31:211 233, 2005.

Li, G. and Pong, T. K. Calculus of the exponent of kurdyka łojasiewicz inequality and its applications to linear convergence of first-order methods. Foundations of computational mathematics, 18(5):1199 1232, 2018.

Li, X.-B. and Li, S. H older continuity of perturbed solution set for convex optimization problems. Applied Mathematics and Computation, 232:908 918, 2014.

Liao, F.-Y., Ding, L., and Zheng, Y. Error bounds, pl condition, and quadratic growth for weakly convex functions, and linear convergences of proximal point methods. In 6th Annual Learning for Dynamics & Control Conference, pp. 993 1005. PMLR, 2024.

Lin, T., Jin, C., and Jordan, M. On gradient descent ascent for nonconvex-concave minimax problems. In International conference on machine learning, pp. 6083 6093. PMLR, 2020.

Solving Zero-Sum Convex Markov Games

Littman, M. L. Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pp. 157 163. Elsevier, 1994.

Liu, M., Rafique, H., Lin, Q., and Yang, T. First-order convergence theory for weakly-convex-weakly-concave min-max problems. Journal of Machine Learning Research, 22(169):1 34, 2021.

Lu, S., Singh, R., Chen, X., Chen, Y., and Hong, M. Alternating gradient descent ascent for nonconvex min-max problems in robust learning and gans. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pp. 680 684. IEEE, 2019.

Luo, Z.-Q. and Tseng, P. Error bounds and convergence analysis of feasible descent methods: a general approach. Annals of Operations Research, 46(1):157 178, 1993.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Martinet, B. R egularisation d in equations variationnelles par approximations successives. ESAIM: Mathematical Modelling and Numerical Analysis - Mod elisation Math ematique et Analyse Num erique, 1970.

Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. ar Xiv preprint ar Xiv:1807.02629, 2018.

Mladenovic, A., Sakos, I., Gidel, G., and Piliouras, G. Generalized natural gradient flows in hidden convex-concave games and gans. In International Conference on Learning Representations, 2021.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

Mulvaney-Kemp, J., Park, S., Jin, M., and Lavaei, J. Dynamic regret bounds for constrained online nonconvex optimization based on polyak lojasiewicz regions. IEEE Transactions on Control of Network Systems, 10(2):599 611, 2022.

Nemirovski, A. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators. SIAM Journal on Optimization, 15(1):229 251, 2004.

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

Nouiehed, M., Sanjabi, M., Huang, T., Lee, J. D., and Razaviyayn, M. Solving a class of non-convex min-max games using iterative first order methods. Advances in Neural Information Processing Systems, 32, 2019.

Oikonomidis, K., Laude, E., and Patrinos, P. Forwardbackward splitting under the light of generalized convexity. ar Xiv preprint ar Xiv:2503.18098, 2025.

Papadimitriou, C., Vlatakis-Gkaragkounis, E.-V., and Zampetakis, M. The computational complexity of multiplayer concave games and kakutani fixed points. In Proceedings of the 24th ACM Conference on Economics and Computation, 2023.

Park, C., Zhang, K., and Ozdaglar, A. Multi-player zerosum markov games with networked separable interactions. Advances in Neural Information Processing Systems, 36: 37354 37369, 2023.

Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Rebjock, Q. and Boumal, N. Fast convergence to nonisolated minima: four equivalent conditions for c 2 functions. Mathematical Programming, pp. 1 49, 2024.

Rockafellar, R. T. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5):877 898, 1976.

Rockafellar, R. T. and Wets, R. J.-B. Variational analysis, volume 317. Springer Science & Business Media, 2009.

Rogers, J. G., Nieto-Granda, C., and Christensen, H. I. Coordination strategies for multi-robot exploration and mapping. In Experimental Robotics: The 13th International Symposium on Experimental Robotics, pp. 231 243. Springer, 2013.

Ruppert, D. Efficient estimations from a slowly convergent robbins-monro process. Technical Report, Cornell University, 1988.

Sakos, I., Vlatakis-Gkaragkounis, E.-V., Mertikopoulos, P., and Piliouras, G. Exploiting hidden structures in nonconvex games for convergence to nash equilibrium. Advances in Neural Information Processing Systems, 36, 2024.

Sayin, M., Zhang, K., Leslie, D., Basar, T., and Ozdaglar, A. Decentralized q-learning in zero-sum markov games. Advances in Neural Information Processing Systems, 34: 18320 18334, 2021.

Solving Zero-Sum Convex Markov Games

Schneider, S. and Wagner, D. H. Error detection in redundant systems. In Papers Presented at the February 26-28, 1957, Western Joint Computer Conference: Techniques for Reliability, IRE-AIEE-ACM 57 (Western), pp. 115 121, New York, NY, USA, 1957. Association for Computing Machinery. ISBN 9781450378611. doi: 10.1145/1455567.1455587. URL https://doi. org/10.1145/1455567.1455587.

Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. ar Xiv preprint ar Xiv:1610.03295, 2016.

Shapley, L. S. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095 1100, 1953.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. nature, 550(7676):354 359, 2017.

Singh, B., Kumar, R., and Singh, V. P. Reinforcement learning in robotic applications: a comprehensive survey. Artificial Intelligence Review, 55(2):945 990, 2022.

Song, Z., Mei, S., and Bai, Y. When can we learn generalsum markov games with a large number of players sample-efficiently? ar Xiv preprint ar Xiv:2110.04184, 2021.

Stella, L., Themelis, A., and Patrinos, P. Forward backward quasi-newton methods for nonsmooth optimization problems. Computational Optimization and Applications, 67 (3):443 487, 2017.

Tammelin, O., Burch, N., Johanson, M., and Bowling, M. Solving heads-up limit texas hold em. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI 15, pp. 645 652. AAAI Press, 2015. ISBN 9781577357384.

Tan, A. H., Bejarano, F. P., Zhu, Y., Ren, R., and Nejat, G. Deep reinforcement learning for decentralized multirobot exploration with macro actions. IEEE Robotics and Automation Letters, 8(1):272 279, 2022.

v. Neumann, J. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295 320, 1928.

Vlatakis-Gkaragkounis, E.-V., Flokas, L., and Piliouras, G. Solving min-max optimization with hidden structure via gradient descent ascent. Advances in Neural Information Processing Systems, 34:2373 2386, 2021.

Von Stengel, B. and Koller, D. Team-maxmin equilibria. Games and Economic Behavior, 21(1-2):309 321, 1997.

Wang, Y., Lacotte, J., and Pilanci, M. The hidden convex optimization landscape of regularized two-layer relu networks: an exact characterization of optimal solutions. In International Conference on Learning Representations, 2021.

Wang, Y., Liu, Q., Bai, Y., and Jin, C. Breaking the curse of multiagency: Provably efficient decentralized multiagent rl with function approximation. In The Thirty Sixth Annual Conference on Learning Theory, pp. 2793 2848. PMLR, 2023.

Wei, C.-Y., Lee, C.-W., Zhang, M., and Luo, H. Lastiterate convergence of decentralized optimistic gradient descent/ascent in infinite-horizon competitive markov games. In Conference on learning theory, pp. 4259 4299. PMLR, 2021.

Wibisono, A., Tao, M., and Piliouras, G. Alternating mirror descent for constrained min-max games. Advances in Neural Information Processing Systems, 35:35201 35212, 2022.

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229 256, 1992.

Wu, Y., Viano, L., Chen, Y., Zhu, Z., Antonakopoulos, K., Gu, Q., and Cevher, V. Multi-step alignment as markov games: An optimistic online gradient descent approach with convergence guarantees. ar Xiv preprint ar Xiv:2502.12678, 2025.

Wu, Z.-Y., Li, D., Zhang, L.-S., and Yang, X. Peeling off a nonconvex cover of an actual convex problem: hidden convexity. SIAM Journal on Optimization, 18(2):507 536, 2007.

Xia, Y. A survey of hidden convex optimization. Journal of the Operations Research Society of China, 8(1):1 28, 2020.

Yang, J., Kiyavash, N., and He, N. Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33:1153 1165, 2020.

Yang, J., Orvieto, A., Lucchi, A., and He, N. Faster singleloop algorithms for minimax optimization without strong concavity. In International Conference on Artificial Intelligence and Statistics, pp. 5485 5517. PMLR, 2022.

Yang, Y. and Ma, C. O(t-1) convergence of optimisticfollow-the-regularized-leader in two-player zero-sum markov games. ar Xiv preprint ar Xiv:2209.12430, 2022.

Ying, D., Guo, M. A., Ding, Y., Lavaei, J., and Shen, Z.-J. Policy-based primal-dual methods for convex constrained

Solving Zero-Sum Convex Markov Games

markov decision processes. Proceedings of the AAAI Conference on Artificial Intelligence, 37:10963 10971, 2023.

Zahavy, T., O Donoghue, B., Desjardins, G., and Singh, S. Reward is enough for convex mdps. In Advances in Neural Information Processing Systems (Neur IPS), 2021.

Zahavy, T., Schroecker, Y., Behbahani, F., Baumli, K., Flennerhag, S., Hou, S., and Singh, S. Discovering policies with domino: Diversity optimization maintaining near optimality. ar Xiv preprint ar Xiv:2205.13521, 2022.

Zahavy, T., Veeriah, V., Hou, S., Waugh, K., Lai, M., Leurent, E., Tomasev, N., Schut, L., Hassabis, D., and Singh, S. Diversifying ai: Towards creative chess with alphazero. ar Xiv preprint ar Xiv:2308.09175, 2023.

Zeng, S., Doan, T., and Romberg, J. Regularized gradient descent ascent for two-player zero-sum markov games. Advances in Neural Information Processing Systems, 35: 34546 34558, 2022.

Zhang, G. and Yu, Y. Convergence of gradient methods on bilinear zero-sum games. In International Conference on Learning Representations, 2019. URL https://api.semanticscholar. org/Corpus ID:211069144.

Zhang, J., Koppel, A., Bedi, A. S., Szepesvari, C., and Wang, M. Variational policy gradient method for reinforcement learning with general utilities. Advances in Neural Information Processing Systems, 33:4572 4583, 2020.

Zhang, J., Ni, C., Yu, Z., Szepesvari, C., and Wang, M. On the convergence and sample efficiency of variancereduced policy gradient method. Advances in Neural Information Processing Systems, 34:2228 2240, 2021.

Zhang, R., Liu, Q., Wang, H., Xiong, C., Li, N., and Bai, Y. Policy optimization for markov games: Unified framework and faster convergence. Advances in Neural Information Processing Systems, 35:21886 21899, 2022a.

Zhang, R., Mei, J., Dai, B., Schuurmans, D., and Li, N. On the global convergence rates of decentralized softmax gradient play in markov potential games. Advances in Neural Information Processing Systems, 35:1923 1935, 2022b.

Zheng, T., Zhu, L., So, A. M.-C., Blanchet, J., and Li, J. Universal gradient descent ascent method for nonconvexnonconcave minimax optimization. Advances in Neural Information Processing Systems, 36:54075 54110, 2023.

Solving Zero-Sum Convex Markov Games

Table of Contents

A Further Related Work 17

A.1 Hidden Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.2 Min-Max Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

A.3 Convex Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

A.4 Markov Games and Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 18

B Optimization Preliminaries 19

B.1 Optimization Definitions & Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.2 Min-Max Optimization Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C Hidden Convexity, PL, KL, and EB 26

C.1 Equivalence between the three conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C.2 p PL from HSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C.3 QG directly from HSC, KL, p PL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C.4 Warm-up: Stochastic Projected Gradient Descent on a proximal-PL Function . . . . . . . . . . . . . . . 31

D Constrained Min-Max Optimization Under the p PL Condition 33

D.1 Key Lemmata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

D.2 Stationarity Proxies and the Gradient Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

D.3 Convergence of Nested Gradient Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

D.4 Convergence of Stochastic Alternating Gradient Descent-Ascent . . . . . . . . . . . . . . . . . . . . . . 37

E Convex Markov Games 47

E.1 Properties of the c MG Utility Fucntions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

E.2 Stochastic Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

E.3 Additional Supporting Claims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

E.4 Nested Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

E.5 Alternating Policy Gradient Descent-Ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Roadmap of the Appendix

Helping the reader navigate, we list a short summary of the topics that are covered in each section.

In Appendix A, we attempt to present an overview of key research directions, theoretical landmarks, and applications. Interested readers can find further details in the cited literature therein although an exhaustive survey is infeasible.

In Appendix B, we establish a series of lemmata concerning gradient descent for smooth functions: (i) Lemmata B.2 to B.4 are standard results regarding the iterates of projected gradient descent (with an exact or inexact and stochastic first-order oracle). (ii) We introduce DX as the primary proxy of stationarity and derive a descent inequality for inexact gradient estimates based on this proxy (Lemmata B.7 to B.9). (iii) The section concludes with an analysis of the Lipschitz continuity of this proxy in min-max optimization. We also examine how one player s deviation affects the proxy of the other player.

Solving Zero-Sum Convex Markov Games

Appendix C, we study relationship of Hidden Strong Convexity (HSC) and the equivalent conditions of the proximal Polyak-Łojasiewicz (p PL) and the Kurdyka-Łojasiewicz (KL).

In Appendix D, we present key results, including: (i) The Lipschitz continuity of maximizers, (ii) Convergence rates for nested and alternating schemes under the general framework of hidden-convex hidden-concave min-max optimization.

Finally, in Appendix E, we extend our analysis to convex multi-agent reinforcement learning, combining the results from Appendix D with policy gradient estimators to compute parameter updates.

A. Further Related Work

A.1. Hidden Convex Optimization

Hidden convexity has emerged as an important structural property in nonconvex optimization, enabling global convergence results in settings where traditional convex analysis fails a survey can be found here (Xia, 2020). Fatkhullin et al. (2023) have decisively proven the convergence of first-order methods even for the nonsmooth and stochastic settings. Moreover, hidden convexity has been extensively studied across diverse applications, including policy optimization in reinforcement learning (RL) and optimal control (Hazan et al., 2019; Zhang et al., 2020; Ying et al., 2023), generative models (Kobyzev et al., 2020), supply chain and revenue management (Feng & Shanthikumar, 2018; Chen et al., 2024), and neural network training (Wang et al., 2021). Even earlier, instances of an implicit convex structure have appeared in a number of works in the past (Ben-Tal & Teboulle, 1996; Li et al., 2005; Wu et al., 2007). In a certain sense, hidden convexity falls into the general category of metric regularity conditions (Karimi et al., 2016; Li & Pong, 2018; Drusvyatskiy & Paquette, 2019; Drusvyatskiy & Lewis, 2018; Liao et al., 2024; Rebjock & Boumal, 2024; Luo & Tseng, 1993; Oikonomidis et al., 2025) that have been used to prove convergence of iterative gradient-based methods.

Particularly, hidden convexity ensures convergence in nonconvex-nonconcave games. Namely, Vlatakis-Gkaragkounis et al. (2021) introduced the notion of hidden-convex hidden-concave games and proved global convergence to a NE. Further extending these ideas, (Sakos et al., 2024) investigates the impact of hidden structure and show that such properties can be leveraged to enhance the stability of first-order methods. The generalized notion of hidden monotonicity (Mladenovic et al., 2021) guarantees global convergence for multi-player nonconvex games.

A.2. Min-Max Optimization

The literature of min-max optimization is long-standing and intimately connected to game theory (v. Neumann, 1928). In recent years, the exploration of nonconvex-nonconcave min-max problems gained prominence in machine learning, particularly due to developments like generative adversarial networks (GANs) (Goodfellow et al., 2014) and adversarial learning (Madry et al., 2017). Research has focused on defining appropriate solution concepts and developing methods to mitigate oscillatory behaviors in optimization algorithms. A min-max optimization problem is often formulated as a variational inequality problem (VIP) (Facchinei & Pang, 2003) and the literature has managed to guarantee provable convergence to solutions of the corresponding VIP only under certain assumptions. Namely, convergence to an approximate solution of the VIP point is guaranteed under (i) monotonicity, (or, a convex-concave objective) (ii) a gradient domination, and (iii) other regularity conditions.

In particular, under monotonicity, the proximal point methods (Martinet, 1970; Rockafellar, 1976) for VIP guarantee convergence. When the objective function is Lipschitz and strictly convex-concave, simple forward-backward schemes are known to converge. Moreover, when coupled with Polyak Ruppert averaging (Ruppert, 1988; Polyak & Juditsky, 1992; Nemirovski et al., 2009), these methods achieve an O(1/ϵ2) complexity without requiring strict convex-concavity (Bauschke & Combettes, 2011). If additionally the objective function has Lipschitz continuous gradients, the extragradient algorithm (Korpelevich, 1976) ensures trajectory convergence without strict monotonicity assumptions, while the time-averaged iterates converge at a rate of O(1/ϵ) (Nemirovski, 2004). Furthermore, in strongly convex-concave settings, forward-backward methods compute an ϵ-saddle point in O(1/ϵ) steps. If the operator is also Lipschitz continuous, classical results in operator theory establish that simple forward-backward methods are sufficient to achieve linear convergence (Facchinei & Pang, 2007; Bauschke & Combettes, 2011).

Under a single-sided gradient domination condition, convergence to a stationary point of the min-max objective has been proven in (Lin et al., 2020; Nouiehed et al., 2019; Yang et al., 2022). Under a two-sided gradient domination condition, (Daskalakis et al., 2020; Yang et al., 2020; Zheng et al., 2023) have proven the convergence to a solution of the corresponding

Solving Zero-Sum Convex Markov Games

VIP (and equivalently to a min-max point). Further, under the Minty condition, (Mertikopoulos et al., 2018; Liu et al., 2021; Diakonikolas et al., 2021) show convergence to solution of the VIP even under nonconvexity. Evenmore, (Azizian et al., 2024) shows convergence of gradient descent under various conditions of regularity and local optimization geometry.

A.3. Convex Markov Decision Processes

The study of convex reinforcement learning emerged as a natural extension of standard RL to handle more expressive, non-linear utility functions. Hazan et al. (2019) introduced the problem of reward-free exploration of an MDP through state-occupancy entropy maximization. To tackle policy optimization, they proposed a provably efficient algorithm based on the Frank-Wolfe method. Further advances were made by Zhang et al. (2020; 2021), who studied convex RL under the framework of RL with general utilities. Their key contribution was a policy gradient estimator for general utilities and the identification of a hidden convexity property within the convex RL objective, enabling statistically efficient policy optimization in the infinite-trials setting. More recently, in (Zahavy et al., 2021; Geist et al., 2022) convex RL is reinterpreted through a game-theoretic perspective. The former work views convex RL as a min-max game between a policy player and a cost player, while the latter positioned convex RL as a subclass of mean-field games.

A related strand of research focuses on the expressivity of scalar (Markovian) rewards. Abel et al. (2021) demonstrated that scalar rewards cannot naturally encode all tasks, such as policy ordering or trajectory ranking. While convex RL extends the expressivity of scalar RL in these aspects, it still has inherent limitations. Specifically, infinite-trial convex RL excels at defining policy orderings but lacks full trajectory ordering capabilities, as it only considers the stationary state distribution. In contrast, the finite-trials convex RL formulation presented in this paper naturally captures trajectory orderings at the cost of reduced expressivity in policy ordering.

Another relevant direction concerns RL with trajectory feedback, where learning occurs through entire sequences rather than scalar rewards. Most prior works in this area assume an underlying scalar reward model, which merely delays feedback until the episode s end (Efroni et al., 2021). A notable exception is the once-per-episode feedback model studied by Chatterji et al. (2021). Lastly, related research in multi-objective RL has explored the use of vectorial rewards to encode convex objectives. The works of Cheung (2019a;b) demonstrated that stationary policies are often suboptimal in such settings, necessitating non-stationary strategies. They provided principled procedures to optimize policies with sub-linear regret, complementing our analysis in infinite-horizon convex RL, where distinctions between finite and infinite trials diminish.

A.4. Markov Games and Multi-Agent Reinforcement Learning

Markov, or stochastic, games, were introduced by Lloyd S. Shapley (Shapley, 1953). Interestingly enough, their introduction coincides with that of single-agent of Markov decision processes (Schneider & Wagner, 1957; Bellman, 1958). In an MG, agents interact with both the environment and each other. Each agent must balance immediate rewards against the potential future benefits of guiding the system to more advantageous states. Since their inception, MGs have served as the canonical model of MARL (Littman, 1994) which in turn encompasses numerous applications like autonomous vehicles (Shalev-Shwartz et al., 2016), multi-agent robotics (Singh et al., 2022; Gronauer & Diepold, 2022), and general recreational game-playing (Mnih et al., 2013; Silver et al., 2017; Berner et al., 2019).

In recent years, theoretical MG research has focused on developing computationally and statistically efficient algorithms. The literature has experienced a rapid development making an exhaustive review infeasible in the context of this small note. We outline results regarding (i) the computational complexity of equilibrium computation in MGs, (ii) no-regret learning approaches with convergence to the corresponding equilibrium notion, (iii) convergence to a NE when the game s structure permits it, and (iv) offline equilibrium learning from data.

A number of works (Deng et al., 2023; Jin et al., 2022; Daskalakis et al., 2023) concurrently proved that (coarse) correlated equilibria are intractable in infinite-horizon MG when the policies are required to be stationary and Markovian; computing them is PPAD-complete. This is a pronounced contrast to normal-form games where they are computable in strongly polynomial time. In terms of no-regret approaches, the literature has been quite fruitful, as it considers finite-horizon games and policies that are stationary and both Markovian and non-Markovian. The solutions (Jin et al., 2021; Yang & Ma, 2022; Zhang et al., 2022a; Erez et al., 2023; Cai et al., 2024) are diverse and build upon the follow-the-regularized-leader and online-mirror-descent framework. Nash equilibrium convergence has mainly considered fully competitive two-player zero-sum MGs of infinite horizon, the Markovian counterpart of potential games, or their unification (adversarial team Markov games) and games with certain structural properties on the rewards and the dynamics. In two-player zero-sum games there have been results with a mostly stochastic-optimization approach (Daskalakis et al., 2020; Wei et al., 2021;

Solving Zero-Sum Convex Markov Games

Sayin et al., 2021; Cen et al., 2022; Zeng et al., 2022). In addition, a notable work guarantees convergence to a NE in infinite-horizon games using only bandit feedback (Cai et al., 2023). Then, for Markov potential games, global convergence of policy gradient methods has been presented in (Zhang et al., 2022b; Leonardos et al., 2021; Ding et al., 2022). Moreover, (Kalogiannis et al., 2022; 2024) consider an MG of a team versus an adversary similar to (Von Stengel & Koller, 1997) and guarantee provable convergence to NE. Lastly, (Kalogiannis & Panageas, 2023; Park et al., 2023) generalize zero-sum polymatrix game to their Markovian counterpart with extra assumptions on the dynamics, and prove the tractability of NEs. In terms of learning equilibria from samples, (Bai & Jin, 2020; Cui & Du, 2022) guaranteed sample-efficient learning of NE in zero-sum games and (Song et al., 2021) went further to consider general-sum games. Finally, (Wang et al., 2023; Cui et al., 2023) consider the sample complexity of learning an equilibrium beyond the tabular setting.

B. Optimization Preliminaries

In this section we go over some rudimentary optimization lemmata as well as some novel exploration of the min-max optimization landscape for p PL functions. Namely, Appendix B.2 compliments the study of (Nouiehed et al., 2019) for the constrained domain. To our knowledge, we offer the first such investigation of the constrained p PL landscape.

B.1. Optimization Definitions & Lemmata

Lemma B.1 (Smoothness inequality). Assume that f : X R is an ℓ-smooth function. Then, for x, y X it holds that:

f(y) + f(y), y x ℓ

2 y x 2 f(x) f(y) + f(y), y x + ℓ

Definition 7. A function f is said to be ℓ-weakly convex if f(x) + ℓ

2 x 2 is convex.

B.1.1. CONCERNING THE GRADIENT MAPPING

Lemma B.2. Let X Rd be a non-empty, closed, and convex set. Denote by ΠX : Rd X the Euclidean projection operator onto X. The following inequalities hold,

for projected gradient descent:

v, x ΠX (x ηv) 1

η x ΠX (x ηv) 2 ;

for projected gradient ascent:

v, ΠX (x + ηv) x 1

η x ΠX (x + ηv) 2 .

Proof. After writing the Euclidean projection of gradient descent with feedback v as an optimization problem:

min x X 1 2 x (x ηv) 2 ,

we get the first-order optimality conditions,

ηv + ΠX (x ηv) x, z ΠX (x ηv) 0, z X.

Setting z = x and re-arranging,

v, z ΠX (x ηv) 1

η ΠX (x ηv) x 2 .

Similarly for gradient ascent with feedback v we write,

min x X 1 2 x (x + ηv) 2

Solving Zero-Sum Convex Markov Games

The first-order optimality condition for the Euclidean projection, reads,

x x ηv, z x 0, z X.

Plugging-in x = ΠX (x + ηv),

ΠX (x + ηv) x ηv, x ΠX (x + ηv) 0.

Re-arranging we conclude,

v, ΠX (x + ηv) x 1

η ΠX (x + ηv) x 2 .

Lemma B.3 ((Ghadimi et al., 2016, Lemma 2)). Let v1, v2 be vectors in Rd and X Rd be a compact convex set and a scalar η > 0. Also, let points x+ 1 , x+ 2 X such that:

x+ 1 := ΠX (x ηv1) ;

x+ 2 := ΠX (x ηv2) .

Then, it holds true that:

x+ 1 x+ 2 η v1 v2 .

Lemma B.4 (Stoch. vs. Det. Grad. Mapping). Assume a stochastic gradient oracle ˆgx for a differentiable function f. For the stochastic gradient oracle, it holds that E[ˆgx(x)] = gx(x), E[ ˆgx 2] σ2 x, and gx(x) f(x) < δx for all x X. Also, let x, x+, ˆx+ be points in X such that:

x+ := ΠX (x η xf(x)) ;

ˆx+ := ΠX (x ηˆgx(x)) .

Then, the following inequalities hold:

x ˆx+ 2 2 x x+ 2 + 4η2σ2 x + 4η2δ2 x; x x+ 2 2 x x+ 2 + 4η2σ2 x + 4η2δ2 x.

Proof. We will prove the first inequality; the second follows from the same arguments.

E x ˆx+ 2 2E x x+ 2 + 2E ˆx+ x+ 2

2E x x+ 2 + 2η2E ˆgx(x) xf(x) 2

2E x x+ 2 + 4η2E gx(x) ˆgx(x) 2 + 4η2E gx(x) xf(x) 2

2E x x+ 2 + 4η2σ2 x + 4η2δ2 x

The first and second inequalities hold from the fact |a + b|2 2|a|2 + 2|b|2. The last inequality comes from the properties of the stochastic gradient oracle.

Solving Zero-Sum Convex Markov Games

B.1.2. DX : AN ALTERNATIVE PROXY OF STATIONARITY

In unconstrained optimization of differentiable functions, the conventional bounds of optimization algorithms directly guarantee the minimization of f(x) , i.e., the norm of the gradient of f at point x. In constrained optimization of differentiable functions, guarantees for algorithms like projected gradient descent ensure the minimization of the norm of the gradient mapping, i.e., 1

η x ΠX (x η f(x)) . It can be shown that the gradient mapping is indeed a good proxy of stationarity. For our work, we will consider the quantity DX to be defined shortly. This quantity is greater than the squared norm of the gradient mapping and turns out to be particularly favorable when handling the constrained optimization of proximal-PL functions. In what follows, we will define DX and discuss some of its useful properties.

Definition 8. Let X Rd be a compact convex set and f : X R be an ℓ-smooth function. We define DX (x, ℓ) to be:

DX (x, α) = 2α min y Rd

n f(x), y x + α

2 y x 2 + IX (y) IX (x) o .

Where IX is the indicator function of the set X with IX (x) = 0 if x X and IX (x) = + otherwise.

Remarkably, this expression can take a closed-form. As we will show, the minimizer of the display inside the brackets is exactly the point returned by one step of projected gradient descent on the argument x with a stepsize equal to 1

Claim B.5 (Closed form of DX ). Let f be an ℓ-smooth function defined on a compact convex set X Rd, a point x X, and a scalar α ℓ. Then the following equation holds for DX (x, a):

DX (x, α) = 2α f(x), x ΠX

α f(x) α2 x ΠX

Proof. By first-order optimality conditions, we see that the objective is equivalent to the objective of the Euclidean projection of the point x 1

α f(x) Rd to X. Then, we just plug the minimizer into the display.

Next, we see that D( , α) is non-decreasing in the scalar α.

Lemma B.6 ((Karimi et al., 2016, Lemma 1)). Let a differentiable functions f : X R and positive scalars 0 < α1 α2. Then, the following inequality holds true:

DX (x, α1) DX (x, α2).

The following Lemma demonstrates the relationship between DX and the norm of the gradient mapping.

Lemma B.7 (DX vs. Gradient Mapping Norm). Let X Rd be a non-empty, closed, and convex set. Consider a differentiable function f : X R that has an ℓ-Lipschitz continuous gradient. Define the gradient mapping at point x X with step size 1/ℓas, ℓ2(x x+)

Then, the following inequality holds: DX (x, ℓ) ℓ2 x+ x 2.

Proof. By observing the definition of DX , we observe that first-order optimality for its inner minimization problem read,

f(x) + ℓ(y x) , z y 0, z X.

On closer inspection, we recognize the first-order optimality of the Euclidean projection of one gradient descent step, i.e.,

x+ = arg min y Y

f(x), y x + ℓ

Solving Zero-Sum Convex Markov Games

Plugging-in x+:

DX (x, ℓ) = 2ℓ f(x), x+ x ℓ2 x+ x 2

= 2ℓ f(x), x x+ ℓ2 x+ x 2

ℓ2 x+ x 2 ,

where the inequality follows from Lemma B.2.

Lemma B.8 ((J Reddi et al., 2016, Lemma 6)). Let f : X R be an ℓ-smooth function and a point x X Rd. Also, define the vector v Rd and y X to be y := ΠX (x ηv) .

Then, the following inequality is true:

f(y) f(z) + f(x) v, y z

B.1.3. A DESCENT LEMMA INVOLVING DX

Lemma B.9. Let X Rd be a closed convex set, and let f : X R be an ℓ-smooth function for some ℓ> 0. Suppose η > 0 with η 1 5ℓ. For any x X and any vector v Rd, define x+ = ΠX (x ηv) . Then the following inequality holds:

f(x+) f(x) η

6DX (x, 1/η) + η

Proof. First, we define x+ := ΠX x 1

Invoking ℓ-smoothness of f for x, x+ and assuming α > 0 with α ℓ,

f(x+) f(x) + f(x), x+ x + ℓ

f(x) + f(x), x+ x + α

= f(x) f(x), x x+ α

2αDX (x, α). (1)

Invoking Lemma B.8 with x = x, y = x+, z = x, v = f(x)

f(x+) f(x) + ℓ

x+ x 2 . (2)

Again, invoking Lemma B.8 but with x = x, y = x+, z = x+, v,

f(x+) f(x+) + f(x) v, x+ x+

2η x+ x+ 2 . (3)

Adding 1/3 (1) and 2/3 (2) and letting 1/α = η 1

f(x+) f(x) 1

6η DX (x, 1/η) + ℓ

Solving Zero-Sum Convex Markov Games

Adding (3),

f(x+) f(x) η

6DX (x, 1/η) + ℓ

+ f(x) v, x+ x+

6DX (x, 1/η) + 5ℓ

2 f(x) v 2 + 1

2η x+ x+ 2 (4)

6DX (x, 1/η) + 5ℓ

6DX (x, 1/η) + η

2 f(x) v 2 (5)

(4) follows from the application of Young s inequality;

(5) follows by dropping the non-positive terms; non-positivity follows from the choice of the step-size, η 1 5ℓ.

B.2. Min-Max Optimization Lemmata

The following claims are novel to our knowledge. They justify the intuition that the constrained min-max p PL landscape should resemble its unconstrained PL counterpart. For example, the trivial bound xf(x, y) xf(x, y ) 2 ℓ2 y y of the unconstrained setting takes the form Lemma B.12. Claim B.10. Let X and Y be convex and compact sets. Suppose that the function f : X Y R is ℓ-smooth. For given points y1, y2 Y, define x1 and x+ 1 as follows:

x1 := arg minx X f(x, y1), and

x+ 1 := ΠX (x1 η xf(x1, y2)).

Then, then it holds true that:

x1 x+ 1 ℓ y1 y2 .

Symmetrically, for given points x1, x2 X, if y1 := arg maxy Y f(x1, y1), y+ 1 := ΠY (y1 + η yf(x2, y1)) ,

y1 y+ 1 ℓ x1 x2 .

Proof. By optimality of x1, x+ 1 for their corresponding optimization problems: f(x1, y1), x+ 1 x1 0 (6) f(x1, y2), x1 x+ 1 1

x1 x+ 1 2 (7)

Solving Zero-Sum Convex Markov Games

Adding-and-subtracting xf(x1, y2) to Equation (6),

0 f(x1, y1), x+ 1 x1

= f(x1, y1) f(x1, y2) + f(x1, y2), x+ 1 x1

= f(x1, y1) f(x1, y2), x+ 1 x1 + f(x1, y2), x+ 1 x1 . (8)

Hence, combining Equation (8) with Equation (7), f(x1, y1) f(x1, y2), x+ 1 x1 f(x1, y2), x1 x+ 1

By Cauchy-Schwarz,

f(x1, y1) f(x1, y2) x+ 1 x1 1

Finally by Lispchitz continuity of f(x, y),

This completes the proof for a step of projected gradient descent. The arguments for the projected gradient ascent claim are identical.

Lemma B.11. Assume f : X Y an ℓ-smooth function, and a scalar a > 0, two points y, y Y and x = arg minz X f(z, y). It is true that:

2ℓ2 y y 2 DX (x, α; y ).

Proof. Let x+ := arg maxx X n f(x, y ), x x α

2 x x 2o = ΠX x 1

α f(x, y ) .

1 2αDX (x, y , α) = f(x, y ), x x+ α

= f(x, y ) f(x, y), x x+ α

x x+ 2 + f(x, y), x x+ (9)

We have assumed that x = arg minx X f(x, y), therefore,

f(x, y), z x 0 z X.

As such, when z = x+, f(x, y), x x+ 0.

Hence, since 1

η x x+ 2 , f(x, y), x x+ 0 plugging in (9),

f(x, y ) f(x, y), x x+ 1

2αDX (x, α; y ).

Using Cauchy-Schwarz and ℓ-Lipschitz continuity of the gradient,

ℓ y y x x+ 1

2αDX (x, α; y ).

Finally, using Claim B.10 with η = 1

2αDX (x, α; y ).

Solving Zero-Sum Convex Markov Games

Lemma B.12. Let f : X Y be an ℓ-smooth function, a > 0, two points y, y Y, and a point x X. Then, the following inequality holds: |DX (x, a; y) DX (x, a; y )| 3ℓ2 y y 2 .

Proof. We define x, x X to be:

α xf(x, y) ;

α xf(x, y ) .

By the definition of DX (x, α; y ) we write:

1 2αDX (x, α; y) = f(x, y), x x α

1 2αDX (x, α; y ) = f(x, y ), x x α

Considering the difference DX (x, α; y) DX (x, α; y ) we see that:

1 2α|DX (x, α; y) DX (x, α; y )| = xf(x, y) xf(x, y ), x x α

x x 2 x x 2

| xf(x, y) xf(x, y ), x x | + α

x x 2 x x 2

| xf(x, y) xf(x, y ), x x | + α

xf(x, y) xf(x, y ) x x + α

α xf(x, y) xf(x, y ) 2 + 1

2α xf(x, y) xf(x, y ) 2

We note that:

The first inequality follows from the triangle inequality.

In the second, we apply the reverse triangle inequality.

The third inequality uses the Cauchy-Schwarz inequality.

The penultimate uses Lemma B.3 and the final ℓ-Lipschitz continuity of the gradient.

Claim B.13. Assume a function f : X Y satisfying the p PL condition with modulus µ over f(x, ) for any x X. Also,

assume that Φ(x) := maxy Y f(x, y). Additionally, define: DΦ X (x, α) = 2α minz X n Φ(x), z x + α

2 z x 2o . Then, it holds true that, DΦ X (x, α1) DX (x, α2; y) 3ℓκ2DY(y, α2; x)

where κ := ℓ

Proof. We define x, x as,

α1 f(x, y) .

Solving Zero-Sum Convex Markov Games

Then, we see that:

DΦ X (x, α1) DX (x, α2; y) Φ(x), x x α1

2 x x 2 f(x, y), x x α1

| Φ(x) f(x, y), x x | + α1

x x 2 x x 2

Φ(x) f(x, y) x x + α1

3 2α1 Φ(x) f(x, y) 2

2α1 y y (x) 2

2α1 DY(y, α2; x).

Where, the first inequality is due to the definition. The second and the third, follow from the triangle inequality and the reverse triangle inequality respectively. The fourth follows from the fact that the projection operator is contractive (1-Lipschitz). The last one is due to the quadratic growth condition and the p PL inequality.

C. Hidden Convexity, PL, KL, and EB

In this section, we offer a brief exposition to the notions of hidden convexity and other regularity conditions it is related to. We refert the interested reader to (Karimi et al., 2016; Li & Pong, 2018; Drusvyatskiy & Paquette, 2019; Drusvyatskiy & Lewis, 2018; Liao et al., 2024; Rebjock & Boumal, 2024) and references therein. We commence by defining hidden convexity in the manner that (Fatkhullin et al., 2023) do it. Definition 9 (Hidden Convex Problem). Consider the optimization problem

min x X f(x) := H(c(x)).

This problem is called hidden convex with moduli Lc 1 > 0 and µH 0 (or the function f is hidden convex on X) if the following conditions are satisfied:

(HC.1) Convexity of H and Domain:

The domain U = c(X) is convex. The function H : U R satisfies for all u, v U and 0 α 1,

H((1 α)u + αv) (1 α)H(u) + αH(v) (1 α)αµH

The problem admits a solution u U.

(HC.2) Invertibility and Lipschitz Continuity of c 1:

The mapping c : X U is invertible. There exists µc > 0 such that for all x, y X,

x y µc c(x) c(y) .

Furthermore, if µH > 0, the problem is referred to as (µc, µH)-hidden strongly convex. Assumption 2 (Lipschitzness and smoothnes of c). Let c : X Y U be a mapping such that for all (x, y), (x , y ) X Y and all u, u U, the following conditions hold:

c(x, y) c(x , y ) Lc (x, y) (x , y ) ;

c 1(u) c 1(u ) Lc 1 u u ;

c(x, y) c(x , y ) ℓc (x, y) (x , y ) ;

c 1(u) c 1(u ) ℓc 1 u u .

Solving Zero-Sum Convex Markov Games

Fact C.1. A hidden strongly-convex function has a unique maximizer.

C.1. Equivalence between the three conditions

Throught this subsection, we will denote X to be X = arg minx X f(x), x p w.r.t. to a point x X is defined as some element of arg minx X x x . Moreover, we define F(x) := f(x) + IX (x) with IX (x) = 0, if x X, and IX (x) = + , else.

Definition 10 (p PL, KL, p EB). Let f : X R be an L-Lipschitz continuous function with ℓ-Lipschitz continuous gradient. Then,

Proximal Error-Bound (p EB): f is said to satisfy the proximal Error-Bound if c > 0 s.t.

x x p c x ΠX

ℓ f(x) , x X

Proximal Polyak-Łojasiewicz (p PL): f is said to satisfy the proximal Polyak-Łojasiewicz condition if µ > 0 s.t.

1 2DX (x, ℓ) µ (f(x) f(x ))

Kurdyaka-Łojasiewicz (KL): f is said to satisfy if µ s.t.

min sx (f+IX )(x) sx 2 2µ (f(x) f(x )) , x X.

Lemma C.1 ((Karimi et al., 2016, Appendix G)).

(p PL) (p EB) (KL).

Remark 1. As we will see, (KL) with a modulus µ implies (p EB) with a modulus 1 + 2ℓ

µ . In turn, (p EB) with a modulus c, implies (p PL) with a modulus ℓ 1+4c2 .

Following, we repeat the calculations of (Karimi et al., 2016) by explicitly tracking the dependence between the constants. We will denote X := arg minx X f(x) and NX (x) to be the normal cone of X at x.

KL p EB First, let F be F(x) := f(x) + IX (x). By the ℓ-smoothness of f and the convexity of IX , we observe that F is weakly-convex. By (Bolte et al., 2010, Theorem 13), for any x dom F there exists a subgradient curve χx : [0, ) dom F that satisfies the following:

χx(t) F(χx(t)),

χx(0) = x, d dt F(χx(t)) = χx(t) 2,

where t 7 F(χx(t)) is a non-increasing and Lipschitz continuous function for t [η, ] for any η > 0. Further, we define the function r(t) = p

F(χx(t)) F . By differentiating r(t), we observe that:

dt = F(χx(t))

µ/2 χx(t) ,

In the second line, we plug in the definition of the subgradient curve.

In the third line, we use the KL inequality and the fact that χx(t) F(χx(t)).

Solving Zero-Sum Convex Markov Games

Taking the integral, we write:

r(T) r(0) = Z T

0 χx(t) dt (10)

µ/2 dist(χx(T), χx(0)), (11)

(10) uses the bound on r(t)

(11) uses the fact that any curve connecting two points has a length greater than their Euclidean distance.

Now, we will show that lim T r(T) = 0. By (11), we get

r(T) r(0) = Z T

F(χx(t)) F dt

F(χx(t)) F dt

where the first inequality follows from the KL condition, and second inequality uses the fact that F(χx(t)) is non-increasing in t. Hence, we get a bound on r(T),

0 r(T) 2r(0)

By taking the limit of T , we get lim T r(T) 0. Now we can focus on the term r(0) = p

F(x) F . Proceeding, by (11) we write, p

µ/2 dist(x, X ), (12)

From (12) and the KL condition, we see that:

dist(0, F(x)) µ dist(x, X ). (13)

Now, let define x+ = prox 1

ℓ f(x) . By the optimality of x+, we have f(x+) ℓ(x+ x) NX (x+). As such,

f(x+) f(x) ℓ(x+ x) IX (x+).

Letting the particular subgradient of IX (x+) that achieves this be ξ, we write,

dist(0, F(x+)) = 0 ξ

= f(x+) f(x) ℓ(x+ x)

ℓ x+ x + f(x+) f(x)

2ℓ x+ x , (14)

Solving Zero-Sum Convex Markov Games

where we used the triangle inequality and the ℓ-Lispchitz continuity of f. Finally, we can derive the proximal-EB condition by

dist(x, X ) x x+ + dist(x+, X )

µ dist(0, F(x+))

We note that:

the first inequality follows by the triangle inequality;

the second inequality uses the (13);

the third inequality uses (14).

Re-iterating:

min sx (f+IX )(x) sx 2 2µ (f(x) f(x )) , x X = dist(x, X ) 1 + 2ℓ

p EB p PL Before proceeding, we define the forward-backward envelope, F 1

ℓ, of F (Stella et al., 2017, Definition 2.1), as:

ℓ:= min y Rd

f(x) + f(x), y x + ℓ

2 y x 2 + IX (y) IX (x) .

Additionally, we observe that: DX (x, ℓ) = 2ℓ F(x) F 1

ℓ(x) . Moving forward we note that by assumption, it holds that:

dist(x, X ) c x ΠX

From the latter we write,

F(x) F = F(x) F 1

ℓ(x) + 2ℓ x x 2

ℓ(x) + 2ℓc2 x ΠX

1 + 4c2 F(x) F 1

= 1 + 4c2 1

2ℓDX (x, ℓ).

Where the second inequality follows from the p EB condition and the two last steps follow from (Stella et al., 2017, Proposition 2.2(i)): ℓ

ℓ f(x) 2 F(x) F 1

ℓ(x) and F(x) F 1

ℓ(x) = 1 2ℓDX (x, ℓ). In short,

ℓ 1 + 4c2 (F(x) F ) 1

2DX (x, ℓ).

Concluding, we see that KL with modulus µ translates to p EB with constant c = 1 + 2ℓ

µ . In turn, p EB with constant c

translates to p PL with modulus ℓ 1+4c2 . This means that KL with µ is equivalent to p PL with modulus µ,

1 + 4 1 + 2ℓ

9µ2 + 32ℓ2 .

Solving Zero-Sum Convex Markov Games

C.2. p PL from HSC

Proposition C.2. Let f : X R be an ℓ-smooth function that is (µc, µH)-hidden strongly convex with µH > 0. Then, f satisfies the proximal Polyak-Łojasiewciz condition with a modulus µ,

1 + 4 1 + 2ℓ 2µ2cµH

Proof. By (Fatkhullin et al., 2023, Proposition 2(ii)), (µc, µH)-hidden strong convexity implies,

min sx (f+IX )(x) sx 2 2µ2 cµH (f(x) f(x )) .

While, (Karimi et al., 2016, Appendix G) and our precise quantification of the constants,

1 + 4 1 + 2ℓ 2µ2cµH

C.3. QG directly from HSC, KL, p PL

Proposition C.3 (QG from HSC). Let f : X R be an ℓ-smooth function that is (µc, µH)-hidden strongly convex. Then, for all x X, f(x) f(x ) µQG

with µQG := µ2 cµH where x is the unique minimizer of f over X.

Proof. We start by leveraging the strong convexity property of the function H. By the definition of strong convexity with modulus µH, for any u, v U:

H(u) H(u ) u H(u ), u u µH

2 u u 2 . (15)

When u is a minimizer of H, the gradient at u satisfies:

u H(u), u u 0

Leveraging the latter, (15) simplifies into:

H(u) H(u ) µH

2 u u 2. (16)

Because c 1 is Lipschitz continuous with modulus 1/µc, we obtain:

x x = c 1(u) c 1(u ) 1

Substituting this bound into equation (16), we obtain:

H(u) H(u ) µH

Recognizing that H(u) = f(x) and H(u ) = f(x ), we can rewrite the above inequality as:

f(x) f(x ) µ2 cµH

This concludes the proof.

Solving Zero-Sum Convex Markov Games

Proposition C.4 (QG from KL). Let an ℓ-smooth function f : X R satisfy the Kurdyka-Łojasiewicz (KL) condition with modulus µ > 0, i.e.:

µ (f(x) f ) 1

2 inf sx F (x) sx 2 , x X.

Then, it satisfies the quadratic growth (QG) condition with a modulus µ

x x p 2 f(x) f , x X,

where x p arg minx X x x 2 and in turn X := arg minx X f(x).

Proof. Before proceeding, we note that the PL condition in (Liao et al., 2024) is what we call the KL condition in this manuscript. Let us define F( ) := f( ) + IX ( ) where IX is the indicator function of the compact convex set X. We note that, IX is convex and f is ℓ-smooth and as such ℓ-weakly convex. Hence, F( ) = f( ) + IX ( ) is also ℓ-weakly convex. By (Liao et al., 2024, Proof of Theorem 3.1 (PL EB))

µ inf s F (x) sx ,

where x p arg minx X x x 2 and in turn X := arg minx X f(x). Then, by (Liao et al., 2024, Proof of Theorem 3.1 (EB QG)), we know that,

x x p 2 f(x) f .

Proposition C.5 (QG from p PL; (Mulvaney-Kemp et al., 2022, Theorem 2)). Let an ℓ-smooth function f : X R satisfy the proximal Polyak-Łojasiewicz (PL) condition with modulus µ > 0. Then, it satisfies the quadratic growth (QG) condition with the same modulus µ, µ

x x p 2 f(x) f , x X,

where x p arg minx X x x 2 and in turn X := arg minx X f(x).

C.4. Warm-up: Stochastic Projected Gradient Descent on a proximal-PL Function

Now consider stochastic projected gradient on a nonconvex p PL ℓ-smooth function f. The analysis closely follows (J Reddi et al., 2016).

Assume Eˆgx(x) = gx(x), ˆgx(x) gx(x) δ and E ˆgx(x) 2 σ2. Further, define points xt+1, xt+1 to be:

xt+1 = ΠX (xt ηˆgx(xt)) ;

xt+1 = ΠX (xt η xf(xt)) .

Theorem C.6. Assume an ℓ-smooth function f : X R with an inexact stochastic gradient oracle ˆgx such that Eˆgx(x) = gx(x), ˆgx(x) gx δ and E ˆgx 2 σ2. Then,

running stochastic gradient descent with a δx-inexact gradient for T > 0 iterations, yields the following inequality:

1 2DX (xt+1, ℓ) 3ℓ(Ef(x0) f(x ))

T + 3δ2 x + 3σ2 x.

Further, if f is p PL with modulus µ, the followinng inequality holds true:

Ef(x T +1) f(x ) exp µ

3ℓT (f(x0) f ) + 3ℓδ2 x µ + 3ℓσ2 x µ .

Solving Zero-Sum Convex Markov Games

We let xt+1 := ΠX (xt η f(xt)). Invoking ℓ-smoothness of f for xt, xt+1 and requiring that η 1

f(xt+1) f(xt) + f(xt), xt+1 xt + 1

2η xt+1 xt 2

= f(xt) f(xt), xt xt+1 1

2η xt+1 xt 2

2DX (xt, 1/η) (17)

Invoking Lemma B.8 with x = xt, y = xt+1, z = xt, v = f(xt)

f(xt+1) f(xt) + ℓ

xt+1 xt 2 . (18)

Again, invoking Lemma B.8 but with x = xt, y = xt+1, z = xt+1, v = ˆg(xt),

f(xt+1) f(xt+1) + f(xt) ˆgx(xt), xt+1 xt+1

xt+1 xt 2 + ℓ

xt+1 xt 2 1

2η xt+1 xt+1 2 . (19)

Adding 1/3 (17) and 2/3 (18)

f(xt+1) f(xt) η

6DX (xt, 1/η) + ℓ

Adding (19),

f(xt+1) f(xt) η

6DX (xt, 1/η) + ℓ

+ f(xt) ˆgx(xt), xt+1 xt+1

xt+1 xt 2 + ℓ

xt+1 xt 2 1

2η xt+1 xt+1 2

6DX (xt, 1/η) + 5ℓ

2 f(xt) ˆgx(xt) 2 (20)

6DX (xt, 1/η) + η

2 f(xt) ˆgx(xt) 2 (21)

6DX (xt, 1/η) + η f(xt) gx(xt) 2 + η gx(xt) ˆgx(xt) 2 (22)

We note that we have picked η 1 5ℓ.

(20) follows from the application of Young s inequality a, b ρ

2ρ b 2 with ρ = η;

(21) follows by dropping the non-positive terms; non-positivity follows from the inequalities:

η 2. ℓ 5ℓ 1

Solving Zero-Sum Convex Markov Games

(22) follows from a + b 2 2 a 2 + 2 b 2.

Taking expectations:

Ef(xt+1) Ef(xt) η

6DX (xt, 1/η) + ηδ2 x + ησ2 x (23)

where the last inequality follows from the choice of the step-size. Rearranging (23) and summing over T, we get:

1 2DX (xt, 1/η) 3

Ef(xt) Ef(xt+1) + 3δ2 x + 3σ2 x

3 (Ef(x0) f(x ))

ηT + 3δ2 x + 3σ2 x.

When the p PL condition holds and η = 1 5ℓ, we get from (23),

Ef(xt+1) f 1 µ

(Ef(xt) f ) + δ2 x + σ2 x,

and as such,

Ef(x T ) f 1 µ

T (f(x0) f ) + δ2 x + σ2 x T X

T (f(x0) f ) + δ2 x + σ2 x 1 1 µ

T (f(x0) f ) + 3ℓδ2 x µ + 3ℓσ2 x µ

3ℓT (f(x0) f ) + 3ℓδ2 x µ + 3ℓσ2 x µ .

D. Constrained Min-Max Optimization Under the p PL Condition

D.1. Key Lemmata

Theorem D.1 (NC-p PL and cont. of maximizers). Let function f : X Y R with f(x, ) satisfying the proximal-PL condition with parameter µ for all x X. Then, consider points x1, x2 X and y (x1), y (x2) Y with y (x1) := arg maxy Y f(x1, y) and y (x2) := arg maxy Y f(x2, y), it holds true that:

y (x1) y (x2) L x1 x2 ,

where L := ℓ

Remark 2. One might compare the last statement to the Robust Berge Maximum Theorem (Papadimitriou et al., 2023, Th. 3.20) which concerns (non)convex strongly-concave functions with coupled feasibility sets. Essentially, the former statement illustrates that hidden-strong-concavity is in some aspect a stronger assumption than strong-concavity; in hiddenstrong-concavity the feasibility sets are only hiddenly coupled. This allows us to decouple the constraint sets and view the problem as a constrained nonconvex-p PL problem. Then, it is quite intuitive that Lispchitz continuity of the maximizers holds in light of the analogous result (Nouiehed et al., 2019, Lemma A.3) which concerns unconstrained min-max optimization over nonconvex-PL functions. Ultimately, this decoupling principle is also the reason why Kalogiannis et al. (2024) can compute the gradient of the maximum function by only invoking Danskin s Theorem and not the more elaborate Envelope Theorem (Afriat, 1971).

Proof. Since we have defined the p PL and QG growth conditions for a minimization problem, let us assume g(x, y) = f(x, y). Consequentially, arg miny Y g(x, y) = arg maxy Y f(x, y) y (x). Finally, we define:

DY(y, α; x) := 2α min z Y

n yg(x, y), z y + α

Solving Zero-Sum Convex Markov Games

By the proximal-PL condition, it holds that,

1 2DY (y (x1), 1/η; x2) µ (g (x2, y (x1)) g (x2, y (x2)))

By the QG condition:

g (x2, y (x1)) g (x2, y (x2)) µQG

y (x1) y p(x2) 2

Where, y p(x2) := arg miny Y x y (x1) y and Y (x) := arg miny Y g(x, y). Finally, by Lemma B.11,

DY (y (x1), 1/η; x2) ℓ2 x1 x2 2 .

Putting these pieces together,

y (x1) y p(x2) 2 ℓ2

2µ x1 x2 2 .

Further, we know that p PL with modulus µ implies QG with the same modulus (Proposition C.5). This concludes the proof.

Corollary D.2. Let function f : X Y R with f(x, ) satisfying the proximal-PL condition with modulus µ for all x X. Then, let Φ(x) := maxy Y f(x, y). For any two points x1, x2 X it holds true that:

xΦ(x1) xΦ(x2) ℓΦ x1 x2 ,

where ℓΦ := ℓ(1 + L ) = ℓ+ ℓ2

Proof. We write,

xΦ(x1) xΦ(x2) = xf(x1, y (x1)) xf(x2, y (x2))

ℓ (x1, y (x1)) (x2, y (x2))

ℓ x1 x2 + ℓL x1 x2 .

The first equation holds due to Danskin s lemma, and the first inequality is due to ℓ-Lipschitz continuity of the gradient. The second inequality is due to the triangle inequality and the L -Lipschitz continuity of the maximizers.

Lemma D.3. Let f : X Y R be an L-Lipschitz continuous and ℓ-smooth function that satisfies the two-sided p PL condition for both f( , y) and f(x, ), then:

min x X max y Y f(x, y) = max y Y min x X f(x, y) =: Φ .

Proof. We can invoke (Yang et al., 2020, Lemma 2.1) which holds for two-sided p PL functions with minor modifications.

Lemma D.4. Let the function f : X Y R satisfy the p PL condition with moduli µ1, µ2 > 0 respectively for f( , y), y Y and f(x, ), x X. Then, the function Φ : X R with Φ(x) = maxy Y f(x, y ) satisfies the p PL condition with modulus µ1.

Proof. For the purposes of this proof, we will enhance the notation of DX as follows:

Df X (x, α; y) := 2α min z X

n f(x, y), z x + α

DΦ X (x, α) := 2α min z X

n Φ(x), z x + α

Solving Zero-Sum Convex Markov Games

DΦ X (x, ℓΦ) = Df X (x, ℓΦ; y (x))

f(x, y (x)) min x X f(x , y (x)) .

f(x , y (x)) max y Y f(x , y),

and minimizing on both sides yields,

min x X f(x , y (x)) min x X max y Y f(x , y) = Φ .

DΦ X (x, ℓΦ) 2µ1 (Φ(x) Φ ) .

D.2. Stationarity Proxies and the Gradient Oracle

Definition 11. We define DY and DΦ X to be the following quantities:

DY(y, α; x) = 2α min z Y

n f(x, y), z y + α

and correspondigly,

DΦ X (x, α) = 2α min z X

n Φ(x), z x + α

Definition 12. We define the deterministic and stochastic gradient mapping at point xt and yt to be:

Gx,τx(x) := 1 τx (x ΠX (x τxv)) and Gt x,τx = Gx,τx(xt);

ˆGt x,τx(xt) := 1 τx (x ΠX (x τxˆgx(xt, yt))) and ˆGt x,τx = ˆGx,τx(xt),

and respectively:

Gy,τy(y) := 1 τy (ΠY (y + τyv) y) and Gt y,τy = Gt y,τy(y);

ˆGy,τy(y) := 1 τy (ΠY (y + τyˆv) y) and ˆGt y,τy = ˆGt y,τy(y).

Assumption 3 (Unbiased Inexact Gradient Estimators and Bounded Second Moments). For all iterations t, the gradient estimators ˆgx(xt, yt) and ˆgy(xt, yt) satisfy E [ˆgx(xt, yt)] = g(xt, yt),

E [ˆgy(xt, yt)] = g(xt, yt),

and E h ˆgx(xt, yt) 2i σ2 x,

E h ˆgy(xt, yt) 2i σ2 y.

In turn, gx(xt, yt) xf(xt, yt) δx, gy(xt, yt) yf(xt, yt) δy.

Remark 3 (Bound on Second Moment instead of Variance). At first, it might appear a slightly stronger assumption to place a bound on the second moment of the gradient estimator. Yet, in most relevant works the variance of the relevant estimators is bounded only after bounding the second moment (Daskalakis et al., 2020; Zhang et al., 2020; 2021). As such, this assumption is reasonable and rather standard to satisfy for the aforementioned applications.

Solving Zero-Sum Convex Markov Games

D.3. Convergence of Nested Gradient Iterations

We can formulate the nested gradient iterations algorithm using the following template, ( yt+1 ARGMAX(f(xt, ), ϵy); xt+1 ΠX (xt η f(xt, yt+1)) (24)

where, ARGMAX(h, ϵ) returns an ϵ-approximate maximize of function h. As a function, it can be implemented efficiently by projected gradient ascent.

Finally, the outer loop of the process implements projected gradient descent with a stochastic and inexact gradient feedback on Φ( ).

Theorem D.5 (NC-p PL). Let f : X Y R be an ℓ-smooth function satisfying the p PL condition with a modulus µ for f(x, ). Then, after T iterations of (24) it holds true that:

ˆGΦ,t x,τx 2 6LDX

τx T + 6δ2 x + 6σ2 x.

Proof. By Theorem C.6 and a tuning of step-size τx 1 5ℓΦ , we get that,

1 2DΦ X (xt+1, 1/τx) 3ℓ(EΦ(x0) Φ(x ))

τx T + 3δ2 x + 3σ2 x.

In this context, the inexactness of the gradient oracle δx is:

δx = max 0 t T 1 E Φ(xt) f(xt, yt) = max 0 t T 1 ℓE yt y (xt) ,

while by the quadratic growth condition of f(x, ) we know that,

2 yt y (xt) EΦ(xt) Ef(xt, yt)

where ϵy is the accuracy of the inner loop which we will defer tuning. We, invoke Lemma B.7 to see that the sum of

DΦ X (xt, 1/τx) upper bounds the sum of ˆGΦ,t x,τx 2 :

ˆGΦ,t x,τx 2 6 (EΦ(x0) Φ(x ))

τx T + 6δ2 x + 6σ2 x.

Now, we see that Φ(x0) Φ(x ) is bounded by LDX due to the L-Lipschitz continuity of Φ and the bounded diameter of X. Finally, we can tune τx and ϵy.

Mx = σ2 x 18ϵ2

Theorem D.6 (p PL-p PL). Let f : X Y R be an L-Lipschitz, ℓ-smooth function and X, Y be two compact convex sets with Euclidean diameters DX , DY respectively. Further, assume that f( , y) satisfies the p PL condition with modulus µx for all y Y while f(x, ) satisfies the p PL condition with a modulus µy for any x X. Additionally, let ˆgx be an

Solving Zero-Sum Convex Markov Games

inexact stochastic gradient oracle such that Eˆgx(x) = gx(x), E ˆgx(x) gx(x) σ2 x, and gx(x) δx. Then, after T iterations of (24) with a tuning of step-sizes τx = 1 5ℓΦ :

EΦ(x T +1) Φ(x ) exp µ

3ℓΦ T LDX + 3ℓΦδ2 x µx + 3ℓΦσ2 x µx .

Proof. By Theorem C.6 with τx = 1 5ℓΦ ,

EΦ(x T +1) Φ(x ) exp µ

3ℓΦ T (Φ(x0) Φ ) + 3ℓΦδ2 x µx + 3ℓΦσ2 x µx .

First, we repeat the fact that LDX Φ(x0) Φ due to Lipschitz continuity. Then, we note that δx has to be tuned as q

ϵµx 9ℓΦ and the batch-size needs to be Mx = l 9ℓΦσ2 x ϵµx

m where ϵ > 0 is the desired accuracy. Finally, T = l 3ℓΦ

D.4. Convergence of Stochastic Alternating Gradient Descent-Ascent

In what follows, we will analyze the convergence of projected alternating gradient descent-ascent for nonconvex-p PL and two-sided p PL functions. The convergence proofs closely follow those of (Yang et al., 2022) and (Yang et al., 2020) respectively after carefully modifying the arguments to make them work for the constrained setting. Convergence is proven by showing that an appropriate Lyapunov function diminshes along the trajectories of the algorithm s iterates (Bof et al., 2018).

In both scenarios we face, proving that the corresponding Lyapunov function diminishes is proven by first:

lower bounding the descent on the maximum function Φ( ) for every update on x;

lower bounding the ascent on f(x, ) for every update on y;

upper bounding the descent on f( , y) for every update on x.

As a reminder, the iteration scheme of alternating gradient descent-ascent is the following,

xt+1 = ΠX (xt τxˆgx(xt, yt)) ;

yt+1 = ΠY (yt + τyˆgy(xt+1, yt)) .

D.4.1. NC-PPL

Theorem D.7. Let f : X Y R be an L-Lipschitz, ℓ-smooth function and X, Y be two compact convex sets with Euclidean diameters DX , DY respectively. Further, assume that f(x, ) satisfies the p PL condition with modulus µ. Additionally, let (ˆgx, ˆgy) be an inexact stochastic gradient oracle satisfying assumption 3. Then, after T iterations of (Alt-GDA) with a tuning of step-sizes τx = 1 500ℓκ2 and τy = 1 5ℓ, it holds true that

E Gt x,τx 2 + κ2EDY(yt, ℓ; xt) κ2ℓL(DX + DY)

T + c2σ2 x Mx + c2δ2 x + c3κ2σ2 y My + c3κ2δ2 y,

where, c1, c2, c3 O(1).

Descent bound on Φ. GΦ,t x,τx := 1

τx (xt Πx (xt τx xΦ(xt)))

Due to ℓ-smoothness and the fact that 1 τx 5ℓΦ we can use Equation (23) to get:

EΦ(xt+1) EΦ(xt) τx

6 EDΦ X (xt, 1/τx) + τx E xΦ(xt) xf(xt, yt) 2 + τx E xf(xt, yt) ˆgx(xt, yt) 2

Solving Zero-Sum Convex Markov Games

6 E GΦ,t x,τx 2 + τx E xΦ(xt) xf(xt, yt) 2 + τx E xf(xt, yt) ˆgx(xt, yt) 2

6 E GΦ,t x,τx 2 + τx E xΦ(xt) xf(xt, yt) 2 + 2τxσ2 x + 2τxδ2 x

Ascent bound on f(xt, ) For a choice of τy 1 5ℓ 1

Ef(xt+1, yt+1) Ef(xt+1, yt) + τy

6 EDY(yt, 1/τy; xt+1) τyδ2 y τyσ2 y.

Upper Bound on the descent: f(xt, yt) f(xt+1, yt).

f(xt+1, yt) f(xt, yt) + xf(xt, yt), xt+1 xt ℓ

2 xt+1 xt 2

= f(xt, yt) τx D gx(xt, yt), ˆGt x,τx E τx D xf(xt, yt) gx(xt, yt), ˆGt x,τx E

= f(xt, yt) τx D ˆgx(xt, yt), ˆGt x,τx E τx D gx(xt, yt) ˆgx(xt, yt), ˆGt x,τx E

τx D xf(xt, yt) gx(xt, yt), ˆGt x,τx E

= f(xt, yt) τx D ˆgx(xt, yt), ˆGt x,τx E

τx gx(xt, yt) ˆgx(xt, yt), Gt x,τx τx D gx(xt, yt) ˆgx(xt, yt), ˆGt x,τx Gt x,τx E

τx D xf(xt, yt) gx(xt, yt), ˆGt x,τx E

f(xt, yt) τx

2 ˆgx(xt, yt) 2 τx

ˆGt x,τx 2 (25)

τx gx(xt, yt) ˆgx(xt, yt), Gt x,τx τx D gx(xt, yt) ˆgx(xt, yt), ˆGt x,τx Gt x,τx E

2 xf(xt, yt) gx(xt, yt) 2 τx

f(xt, yt) τx + ℓτ 2 x 2

ˆGt x,τx 2 τx

2 ˆgx(xt, yt) 2

τx gx(xt, yt) ˆgx(xt, yt), Gt x,τx τx D gx(xt, yt) ˆgx(xt, yt), ˆGt x,τx Gt x,τx E

τxδ2 x 2 (26)

f(xt, yt) τx + ℓτ 2 x 2

ˆGt x,τx 2 τx

2 ˆgx(xt, yt) 2

τx gx(xt, yt) ˆgx(xt, yt), Gt x,τx τx gx(xt, yt) ˆgx(xt, yt) 2 τxδ2 x 2 . (27)

the initial equations are mere additions-subtractions of terms and plugging-in of definitions;

Solving Zero-Sum Convex Markov Games

(25) follows from Young s inequality;

(26) follows from gathering terms and using the bound on the gradient inexactness error;

(27) follows from the Cauchy-Schwarz inequality and the non-expansiveness of the projection.

Taking expectations again:

Ef(xt+1, yt) Ef(xt, yt) τx + ℓτ 2 x 2

E ˆGt x,τx 2 τxσ2 x 2

τx E gx(xt, yt) ˆgx(xt, yt), Gt x,τx τxσ2 x τxδ2 x 2

= Ef(xt, yt) τx + ℓτ 2 x 2

E ˆGt x,τx 2 3τxσ2 x 2 τxδ2 x 2

Ef(xt, yt) 3τx

4 E ˆGt x,τx 2 3τxσ2 x 2 τxδ2 x 2

Ef(xt, yt) 3τx

2 E Gt x,τx 2 9τxσ2 x 2 σ2 x 7τxδ2 x 2 (28)

Where, we used that τx 1 2ℓ, which means that 1

2 τx + ℓτ 2 x 3τx

Bounding AGDA iterate difference Ef(xt+1, yt+1) Ef(xt, yt).

Ef(xt+1, yt+1) Ef(xt, yt) 3τx

2 E Gt x,τx 9τxσ2 x 2 7τx

6 EDY(yt, 1/τy; xt+1) τyδ2 y τyσ2 y

The Lyapunov function. We will define the Lyapunov function,

V (xt, yt) := Φ(xt) + α Φ(xt) f(xt, yt) = (1 + α) Φ(xt) αf(xt, yt),

with α > 0 to be tuned at the end. Also, we will denote Vt := V (xt, yt).

EVt EVt+1 = (1 + α) (EΦ(xt) EΦ(xt+1)) + α (Ef(xt+1, yt+1) f(xt, yt))

EVt EVt+1 (1 + α) τx

6 E GΦ,t x,τx 2 τx E xΦ(xt) xf(xt, yt) 2 2τxσ2 x 2τxδ2 x

2 E Gt x,τx 2 9τxσ2 x 2 7τx

6 EDY(yt, 1/τy; xt+1) τyδ2 y τyσ2 y

6 E GΦ,t x,τx 2 (1 + α)τx E xΦ(xt) xf(xt, yt) 2

2 E Gt x,τx 2 + ατy

6 EDY(yt, 1/τy; xt+1)

+ 2τx(1 + α) 9τx

2 α σ2 x + ( ατy) σ2 y

+ 2τx(1 + α) 7τx

2 α δ2 x + ( ατy) δ2 y

Solving Zero-Sum Convex Markov Games

6 E GΦ,t x,τx 2 (1 + α)τx E xΦ(xt) xf(xt, yt) 2

2 E Gt x,τx 2 + ατy

6 EDY(yt, 1/τy; xt) 3ατyℓ2

6 E xt xt+1 2

+ 2τx(1 + α) 9τx

2 α σ2 x + ( ατy) σ2 y

+ 2τx(1 + α) 7τx

2 α δ2 x + ( ατy) δ2 y (29)

6 E GΦ,t x,τx 2 (1 + α)τx E xΦ(xt) xf(xt, yt) 2

2 E Gt x,τx 2 + ατy

6 EDY(yt, ℓ; xt)

2 E Gt x,τx 2 2ατ 2 xτyℓ2σ2 x 2ατ 2 xτyℓ2δ2 x

+ 2τx(1 + α) 9τx

2 α σ2 x + ( ατy) σ2 y

+ 2τx(1 + α) 7τx

2 α δ2 x + ( ατy) δ2 y (30)

= (1 + α)τx

6 E GΦ,t x,τx 2 (1 + α)τx E xΦ(xt) xf(xt, yt) 2

2 ατ 2 xτyℓ2 E Gt x,τx 2 + ατy

6 EDY(yt, ℓ; xt)

+ 2τx(1 + α) 9τx

2 α 2ατ 2 xτyℓ2 σ2 x + ( ατy) σ2 y

+ 2τx(1 + α) 7τx

2 α 2ατ 2 xτyℓ2 δ2 x + ( ατy) δ2 y

6 3τxα 2ατ 2 xτyℓ2 E GΦ,t x,τx 2 (1 + α)τx E xΦ(xt) xf(xt, yt) 2

6 3τxακ2 2ατ 2 xτyℓ2κ2 EDY(yt, ℓ; xt)

+ 2τx(1 + α) 9τx

2 α 2ατ 2 xτyℓ2 σ2 x + ( ατy) σ2 y

+ 2τx(1 + α) 7τx

2 α 2ατ 2 xτyℓ2 δ2 x + ( ατy) δ2 y

6 3τxα 2ατ 2 xτyℓ2 E GΦ,t x,τx 2 (1 + α)τxκ2DY(yt, ℓ; xt)

6 3τxακ2 2ατ 2 xτyℓ2κ2 EDY(yt, ℓ; xt)

+ 2τx(1 + α) 9τx

2 α 2ατ 2 xτyℓ2 σ2 x + ( ατy) σ2 y

+ 2τx(1 + α) 7τx

2 α 2ατ 2 xτyℓ2 δ2 x + ( ατy) δ2 y (31)

= (1 + α)τx

6 3τxα 2ατ 2 xτyℓ2 E GΦ,t x,τx 2

6 3τxακ2 2ατ 2 xτyℓ2κ2 (1 + α)τxκ2 EDY(yt, ℓ; xt)

+ 2τx(1 + α) 9τx

2 α 2ατ 2 xτyℓ2 σ2 x + ( ατy) σ2 y

Solving Zero-Sum Convex Markov Games

+ 2τx(1 + α) 7τx

2 α 2ατ 2 xτyℓ2 δ2 x + ( ατy) δ2 y

(29) uses Lemma B.12 and the fact that |a b| < c implies a > b c, i.e.,

DY(yt+1, α; xt) DY(yt, α; xt) 3ℓ2 xt xt+1 2 ,

(30) uses the definition of ˆGt x,τx, Gt x,τx and Lemma B.4 to replace the term: xt xt+1 2, i.e,:

xt xt+1 2 2 xt xt+1 2 + 2 xt+1 xt+1 2

2τ 2 x Gt x,τx 2 + 2τ 2 x ˆgx(xt, yt) xf(xt, yt) 2

2τ 2 x Gt x,τx 2 + 4τ 2 xσ2 x + 4τ 2 xδ2 x.

By the fact that c d 2 = c 2 + d 2 2 c, d and Young s inequality, we can write a 2 = (a b) + b 2 (1 + 1/ρ) a 2 + (1 + ρ) b 2. Then we plug in a Gt x,τx, b GΦ,t x,τx,

Gt x,τx 2 (1 + ρ) GΦ,t x,τx 2 + 1 + 1

1 τx (ΠX (x τx f(xt, yt)) ΠX (x τx Φ(xt)))

(1 + ρ) GΦ,t x,τx 2 + 1 + 1

f(xt, yt) Φ(xt) 2

(1 + ρ) GΦ,t x,τx 2 + 1 + 1

ℓ2 y (xt) yt 2

(1 + ρ) GΦ,t x,τx 2 + 1 + 1

µQG (Φ(xt) f(xt, yt))

(1 + ρ) GΦ,t x,τx 2 + 1 + 1

κ2DY(yt, ℓ; xt)

Where the third to last inequality follows from Danskin s lemma, the penultimate is due to the quadratic growth property, and the last is due to the proximal-PL property. We have set κ := ℓ µQGµ.

(31) follows from the p PL property. Gt x,τx 2 2 GΦ,t x,τx 2 + 2 Gt x,τx GΦ,t x,τx 2

2 GΦ,t x,τx 2 + 2 f(xt, yt) Φ(xt) 2

2 GΦ,t x,τx 2 + 2ℓ yt y (xt) 2

2 GΦ,t x,τx 2 + 2ℓκ2DY(yt, ℓ; xt)

Re-iterating,

EVt EVt+1 (1 + α)τx

6 3τxα 2ατ 2 xτyℓ2 E GΦ,t x,τx 2

6 3τxακ2 2ατ 2 xτyℓ2κ2 (1 + α)τxκ2 EDY(yt, ℓ; xt)

+ 2τx(1 + α) 9τx

2 α 2ατyℓ2 σ2 x + ατ 2 xτy σ2 y

+ 2τx(1 + α) 7τx

2 α 2ατ 2 xτyℓ2 δ2 x + ( ατy) δ2 y

What is left to do is to ensure that the coefficients of GΦ,t x,τx 2 , DY(yt, ℓ; xt) in the last display are positive. We will show that this is indeed the case for a correct tuning of τx, τy and α.

Solving Zero-Sum Convex Markov Games

Coefficient of GΦ,t x,τx 2. We assume a priori that, τx, τy 1

ℓ. Hence we can write:

6 3τxα 2ατ 2 xτyℓ2 1 + α 18α 12α

Where we require that α 1/30 < 1/29.

Coefficient of DY. We require that τx τy cκ2 for some c > 1.

6 3τxακ2 2ατ 2 xτyℓ2κ2 (1 + α)τxκ2 α 18α/c 12α/c2 6(1 + α)/c

= (c2 24c 12)α 6c

Where in the inequality we assume c = 100, α = 1/30.

Coefficients of σ2 x, σ2 y, δx, δy.

For σ2 x, δ2 x

2τx(1 + α) + 7τx

2 α + 2ατ 2 xτyℓ2 2τx(1 + α) + 9τx

2 α + 2ατ 2 xτyℓ2

= τx 60 + 2 + 135

30 + 200τ 3 xκ2ℓ2

τx = τy 100κ2

= τx 60 + 2 + 135

30 + 200τ 2 xℓ 30 500

τx = 1 500κ2ℓ

τx 60 + 2 + 135

30 500 8τx (τx 1/ℓ= ℓ 1/τx) .

For σ2 y, δ2 y:

30 τx 4κ2τx

Gt x,τx 2 2 GΦ,t x,τx 2 + 2κ2DY(yt, ℓ; xt)

Summarizing:

Gt x,τx 2 + 9κ2τx EDY(yt, ℓ; xt) τx

Gt x,τx 2 + τy

EDY(yt, ℓ; xt)

EVt EVt+1 + 8τxσ2 x + 8τxδ2 x + 4κ2τxσ2 y + 4κ2τxδ2 y.

Summing over t and dividing by T:

Gt x,τx 2 + κ2EDY(yt, ℓ; xt) 720L(DX + DY)

τx T + 2880 σ2 x + δ2 x + 1440κ2 σ2 y + δ2 y .

Solving Zero-Sum Convex Markov Games

D.4.2. TWO-SIDED PPL MIN-MAX OPTIMIZATION

Theorem D.8. Let f : X Y R be an L-Lipschitz, ℓ-smooth function and X, Y be two compact convex sets with Euclidean diameters DX , DY respectively. Further, assume that f( , y) satisfies the p PL condition with modulus µx for all y Y while f(x, ) satisfies the p PL condition with a modulus µy for any x X. Additionally, let (ˆgx, ˆgy) be an inexact stochastic gradient oracle satisfying assumption 3. Then, after T iterations of (Alt-GDA) with a tuning of step-sizes

τx = µ2 y 160ℓ3 and τy = 1 5ℓ, it holds true that:

EΦ(x T ) Φ + 1/10 (EΦ(x T ) Ef(x T , y T )) exp

µxµ2 y 160ℓ3 T

L(DX + DY) + c1σ2 x µx + c1δ2 x µx + c2ℓ2σ2 y µxµ2y + c2ℓ2δ2 y µxµ2y ,

where, c1, c2 O(1).

Proof. Our goal is to ultimately demonstrate that there exists a Lyapunov function, Vt whose value decreases along any trajectory of the algorithm s iterates. Not only so, but its value contracts, i.e., there exists 0 < ϖ < 1 such that Vt+1 ϖVt, t. To demonstrate this, we first need to lower bound the descent on Φ( ), lower bound the ascent on f(xt, ), and finally upper bound the descent on f( , yt).

Descent Bound on Φ( ). By Lemma B.9 and Lemma D.4 we write,

EΦ(xt+1) EΦ(xt) τx

6 EDΦ X (xt, 1/τx) + τx E xΦ(xt) xf(xt, yt) 2

+ 2τxσ2 x + 2τxδ2 x

EΦ(xt) τxµx

3 (Φ(xt) Φ ) + τx E xΦ(xt) xf(xt, yt) 2

+ 2τxσ2 x + 2τxδ2 x

EΦ(xt+1) Φ 1 τxµx

(Φ(xt) Φ ) + τxℓ2 y yt 2

+ 2τxσ2 x + 2τxδ2 x

(Φ(xt) Φ ) + τxℓ2µQG/2 (Φ(xt) f(xt, yt))

+ 2τxσ2 x + 2τxδ2 x, (32)

or, we can also write,

EΦ(xt+1) EΦ(xt) τxµx

3 (Φ(xt) Φ ) + τxℓ2µQG/2 (Φ(xt) f(xt, yt))

+ 2τxσ2 x + 2τxδ2 x (33)

Ascent Bound on f(xt, ). A simple application of Lemma B.9 yields,

Ef(xt+1, yt+1) Ef(xt+1, yt) + τy

6 EDY(yt, 1/τy; xt+1) τyδ2 y τyσ2 y.

From which, we can also write,

EΦ(xt+1) Ef(xt+1, yt+1)

EΦ(xt+1) Ef(xt+1, yt) τy

6 EDY(yt, 1/τy; xt+1) + τyδ2 y + τyσ2 y

Solving Zero-Sum Convex Markov Games

(EΦ(xt+1) Ef(xt+1, yt)) + τyδ2 y + τyσ2 y

(EΦ(xt) Ef(xt, yt) + Ef(xt, yt) Ef(xt+1, yt) + EΦ(xt+1) EΦ(xt))

+ τyδ2 y + τyσ2 y (34)

Upper bound on the descent on f( , yt) From (28) we write,

Ef(xt+1, yt) Ef(xt, yt) 3τx

2 E Gt x,τx 2 9τx

Ef(xt, yt) 3τx

2 DX (xt, 1/τx; yt) 9τx

2 δ2 x (35)

The second inequality follows Lemma B.7. Now, (34) by plugging-in (33) and (35) reads,

EΦ(xt+1) Ef(xt+1, yt+1) 1 µyτy

(EΦ(xt) Ef(xt, yt))

(Ef(xt, yt) Ef(xt+1, yt))

(EΦ(xt+1) EΦ(xt))

+ τyδ2 y + τyσ2 y

(EΦ(xt) Ef(xt, yt))

2 DX (xt, 1/τx; yt) + 9τx

2 σ2 x + 7τx

6 EDΦ X (xt, 1/τx) + τx E xΦ(xt) xf(xt, yt) 2 + 2τxσ2 x + 2τxδ2 x

+ τyσ2 y + τyδ2 y

(EΦ(xt) Ef(xt, yt))

2 DX (xt, 1/τx; yt)

6 EDΦ X (xt, 1/τx) + τx E xΦ(xt) xf(xt, yt) 2

τxσ2 x + 7 1 µyτy

τxδ2 x + τyσ2 y + τyδ2 y (36)

The Lyapunov Function. We will consider the following Lyapunov function where α > 0 is to be defined later along the proof:

V (xt, yt) := (Φ(xt) Φ ) + α (Φ(xt) f(xt, yt)) .

For convenience, with Ut, Wt we will denote the quantities:

Ut := Φ(xt) Φ

Wt := Φ(xt) f(xt, yt),

and we can then write Vt = Ut + αWt. Piecing together (32), and (36)

Ut+1 + αWt+1 Ut τx

6 EDΦ X (xt, 1/τx) + τx E xΦ(xt) xf(xt, yt) 2

+ 2τxσ2 x + 2τxδ2 x

Solving Zero-Sum Convex Markov Games

2 DX (xt, 1/τx; yt)

6 EDΦ X (xt, 1/τx) + τx E xΦ(xt) xf(xt, yt) 2

+ α 7 1 µyτy

τxσ2 x + 7 1 µyτy

τxδ2 x + τyσ2 y + τyδ2 y

EDΦ X (xt, 1/τx)

+ τx + ατx 1 µyτy

E xΦ(xt) xf(xt, yt) 2

DX (xt, 1/τx; yt) DΦ X (xt, 1/τx)

+ 2 + 7α 1 µyτy

τxσ2 x + 2 + 7α 1 µyτy

τxδ2 x + ατyσ2 y + ατyδ2 y (37)

EDΦ X (xt, 1/τx)

+ τx + ατx 1 µyτy

µQG (EΦ(xt) Ef(xt, yt))

µQG (EΦ(xt) Ef(xt, yt))

+ 2 + 7α 1 µyτy

τxσ2 x + 2 + 7α 1 µyτy

τxδ2 x + ατyσ2 y + ατyδ2 y (38)

+ τx + ατx 1 µyτy

+ 2 + 7α 1 µyτy

τxσ2 x + 2 + 7α 1 µyτy

τxδ2 x + ατyσ2 y + ατyδ2 y (39)

In (37) we use the fact a |a b| + b with a = DX (xt, 1/τx; yt), b = DΦ X (xt, 1/τx) in order to remove the term DX (xt, 1/τx; yt) and introduce the terms DX (xt, 1/τx; yt) DΦ X (xt, 1/τx) and DΦ X (xt, 1/τx);

In (38), we use the following facts:

DX (xt, 1/τx; yt) DΦ X (xt, 1/τx) 3ℓ2 yt y (xt) 2, Lemma B.12;

2. xΦ(xt) xf(xt, yt) 2 ℓ2 yt y (xt) 2;

2 yt y (xt) 2 Φ(xt) f(xt, yt), quadratic growth condition of p PL functions Proposition C.5.

In (39) we use the fact that 1

2DΦ X (x, 1/τx) µx Ut (Lemma D.4) and the negativity of the coefficient of EDΦ X (xt, 1/τx), i.e., the display τx

6 + α 1 µyτy

2 . To ensure negativity, we require that α < 1 and utilize the fact that µx ℓand hence τxµx 1 by the choice of step-size.

Solving Zero-Sum Convex Markov Games

Summarizing:

Ut+1 + αWt+1 1 µxτx

+ α 1 + 11ℓ2τx

µQG 11µyℓ2τxτy

+ 2 + 7α 1 µyτy

τxσ2 x + 2 + 7α 1 µyτy

τxδ2 x + ατyσ2 y + ατyδ2 y | {z } ξ

Tuning the parameters: We need to ensure that 0 < ϖ1, ϖ2 < 1 and then, we can show that the value of the Lyapunov function is contracting, i.e. Ut+1 + αWt+1 max{ϖ1, ϖ2} (Ut + αWt) + ξ. We wiil now upper-bound ϖ1, ϖ2. For ϖ1 we see that:

3 = 1 29µxτx

αµQG + τx 1 µyτy

µQG + 1 µyτy

µ2 yτy τxℓ2 2µy

µ2 yτy τxℓ2 2

µ2 yτy τxℓ2 31

where we have used the fact that µQG is equal to µy by Proposition C.5 and we require that µ2 yτy τxℓ2 = 32. As such, we

τx = µ2 yτy 32ℓ2 and τy = 1 5ℓ. Further, We observe that 1 µxτx = max n 1 ℓ2τx

µy , 1 µxτx o since ℓ µx, µy. Combining the pieces:

EVt+1 (1 µxτx) EVt + 3τxσ2 x + 3τxδ2 x + τyσ2 y + τyδ2 y

Applying the inequality recursively and using the formula for the sum of the geometric sequence, we get,

EVT (1 µxτx)T V0 + 3σ2 x µx + 3δ2 x µx + +32ℓ2σ2 y µxµ2y + 32ℓ2δ2 y µxµ2y

exp ( µxτx T) V0 + 3σ2 x µx + 3δ2 x µx + 32ℓ2σ2 y µxµ2y + 32ℓ2δ2 y µxµ2y

Further, we note that since f, Φ are L-Lipschitz continuous and their domains have bounded diameters, we can bound V0 as

V0 L (DX + DY) . Finally, we see that by the choice of step-sizes µxτx = µxµ2 y 160ℓ3 .

Solving Zero-Sum Convex Markov Games

E. Convex Markov Games

Lemma E.1 (Continuity of the occupancy measure). Let λ (S A B) be the occupancy measure in a Markov game. Then λ is Lλ-Lipschitz continuous and ℓλ-smooth with respect to the policy pair (x, y) X Y. Specifically, for all (x, y) and (x , y ),

λ(x, y) λ(x , y ) Lλ (x, y) (x , y ) ;

λ(x, y) λ(x , y ) ℓλ (x, y) (x , y ) ,

where Lλ := |S| 1 2 (|A|+|B|)

(1 γ)2 , and ℓλ := 2γ|S| 1 2 (|A|+|B|) 3 2 (1 γ)3 .

Moreover, consider the functions λ 1 1 : (S A) X and λ 1 2 : (S B) Y, such that:

λ 1 1 (λ1(x, y)) = x;

λ 1 2 (λ2(x, y)) = y.

For any fixed y (respectively, x), λ 1 is Lλ 1-Lipschitz continuous with respect to λ1 (respectively, λ2), i.e., for all λ1(x, y), λ1(x , y) respectively, λ2(x, y), λ2(x, y ),

x x Lλ 1 λ 1 1 (λ1(x, y) λ1(x , y)) ;

y y Lλ 1 λ 1 2 (λ2(x, y) λ2(x, y )) ,

with Lλ 1 := 2 mins ϱ(s)(1 γ).

Proof. (Kalogiannis et al., 2024, Lemmata C.2 & C.3).

Throughout, we will assume ε-greedy parametrization of the policies, i.e., the players play according to policies:

πx = (1 ε)x + ε1

|A|, and πy = (1 ε)y + ε1

where 1 is the all-ones vector of appropriate dimension.

Solving Zero-Sum Convex Markov Games

E.1. Properties of the c MG Utility Fucntions

We reinstate that the utility function LU-Lipschitz continuous and ℓU-smooth with,

LU = LF |S| 1 2 (|A| + |B|) (1 γ)2 ; ℓU = 2ℓF γ|S| 1 2 (|A| + |B|)

E.1.1. PROPERTIES OF CMGS WITH CONVEX UTILITIES

For the two-player zero-sum c MG with merely concave utilites, we consider the regularized utility function U µ(x, y) := U(x, y) + µ

2 λ2(x, y) 2. We note that that an upper bound to the Lipschitz smoothness moduli of U µ, U µ is given by Lµ U, ℓµ U:

Lµ U = 2LUL2 λ = LF |S| 3 2 (|A| + |B|)3

(1 γ)6 ; ℓµ U = 4ℓF γ2|S| 3 2 (|A| + |B|)4

Let us now compute the moduli of the quadratic growth and p PL condition,

µQG = (mins ϱ(s))2(1 γ)2µ

4 ; µPL = (mins ϱ(s))4(1 γ)12µ2

4ℓF γ2|S| 3 2 (|A| + |B|)4

Since the quantity µQGµPL will frequently appear in our calculations, we write:

µQGµPL = (mins ϱ(s))6(1 γ)14µ3

4ℓF γ2|S| 3 2 (|A| + |B|)4

The condition number κ is defined as κ := ℓµ U µQGµPL and is equal to:

3 2 F γ3|S| 9 4 (|A| + |B|)6

(mins ϱ(s))3(1 γ)15µ 3 2

Finally, we define function Φµ := maxy Y U µ(x, y) and observe for its Lipschitz modulus, Lµ Φ, and its gradient s Lipschitz modulus, ℓµ Φ:

Lµ Φ = Lµ U = LF |S| 3 2 (|A| + |B|)3

(1 γ)6 ; ℓµ Φ = 2ℓµ Uκ = 512ℓ

9 2 F γ8|S| 23

4 (|A| + |B|)

(1 γ)23(mins ϱ(s))3µ 3 2 .

E.1.2. PROPERTIES OF CMGS WITH STRONGLY CONCAVE UTILITIES

We carry over the same calculations for the c MG with strongly concave utilities. Since we do not perturb the utilities, the relevant Lipschitz moduli remain the same. This is not the case for µQG, µPL. We write:

µQG = (mins ϱ(s))2(1 γ)2µ

4 ; µPL = (mins ϱ(s))4(1 γ)7µ2

4ℓF γ|S| 1 2 (|A| + |B|)

Further, κ := ℓU µQGµPL which is equal to:

3 2 F γ 3 2 |S| 3 4 (|A| + |B|)

(mins ϱ(s))3(1 γ) 9 2 µ 3 2 .

Finally, the Lipschitz modulus of Φ and its gradient will be:

LΦ = LU = LF |S| 1 2 (|A| + |B|) (1 γ)2 ; ℓΦ = 16

5 2 F γ 5 2 |S| 5 4 (|A| + |B|)

(mins ϱ(s))3(1 γ) 15

Solving Zero-Sum Convex Markov Games

E.2. Stochastic Estimators

Using a trajectory ξ := s(0), a(0), s(1), a(1), . . . and ξ := s(0), b(0), s(1), b(1), . . . respectively, we define the estimates of λ1, λ2,

ˆλs,a 1,ξ :=

h=1 γh1{s(h) = s, a(h) = a};

ˆλs,b 2,ξ :=

h=1 γh1{s(h) = s, b(h) = b}.

Additionally, for a pseudo-reward vector z R|S| |A| and ξ := s(0), a(0), s(1), a(1), . . . or z R|S| |B| correspondingly and ξ := s(0), b(0), s(1), b(1), . . . we define gradient estimates ˆ t x,ξ, ˆ t y,ξ:

ˆgx(ξ|xt, yt, z) =

γhz s(h), a(h)

h =0 x log x a(h )|s(h ) !#

ˆgy(ξ|xt, yt, z) =

γhz s(h), b(h)

h =0 y log y b(h )|s(h ) !#

At every iteration t, each agent constructs ˆrx,t λ1F(ˆλ1,t) describe how to construct ˆrx, ˆry batch versions of the latter estimators will be,

i=1 ˆgx(ξi|xt, yt, ˆrx,t), and ˆ t y :=

i=1 ˆgy(ξi|xt, yt, ˆry,t). (40)

Now, for each one of the two cases we let z λ1F ˆλ1,t; yt and z λ1F ˆλ1,t; yt correspondingly.

Remark 4. This stochastic gradient estimator assumes access to a gradient oracle for λ1F. Access to an oracle for F or λF is an assumption made in virtually all the references that we have encountered concerning policy gradient methods for MDPs with general utilities (Zhang et al., 2020; 2021; Barakat et al., 2023). Designing a gradient estimator that entirely relies on samples lies well-beyond the scope of our paper. Nevertheless, this assumption is not a strong one as even with a complete knowledge of F, agents cannot control the function s arguments except for affecting them implicitly.

The occupancy measure estimator Let a trajectory sample ξ of length Hξ, we define the occupancy estimator for player 1 to be,

ˆλs,a 1 (ξ) :=

h=1 γh1{s(h) = s, a(h) = a}

ˆλs,b 2 (ξ) :=

h=1 γh1{s(h) = s, b(h) = b}

Assume policies xt, yt and corresponding sampled trajectories {ξi,t}Mx i=1 and {ξ i,t}Mx i=1 for player 1 and 2 correspondingly. The occupancy batch estimators ˆλ1,t, ˆλ2,t

ˆλ1,t := 1 Mx

i=1 ˆλ1(ξi,t), ˆλ2,t := 1 Mx

i=1 ˆλ2(ξ i,t)

λs,a 1,H(x, y) := Ex,y

h=1 γh1{s(h) = s, a(h) = a} P s(h+1)|a(h), b(h), s(h) s(0) ϱ

Solving Zero-Sum Convex Markov Games

λs,b 2,H(x, y) := Ex,y

h=1 γh1{s(h) = s, b(h) = b} P s(h+1)|a(h), b(h), s(h) s(0) ϱ

Lemma E.2. Let an empirical estimate ˆλ1 of the truncated-horizon occupancy measure λ1,H. Then, the variance of the estimate can be bounded as:

E ˆλ1 λ1,H 2 1 Mx(1 γ)2 .

The policy gradient estimator Assuming ε-greedy parametrization to control the variance of our estimators, we can go forward and state our main lemmata regarding the gradient policy gradient estimator s variances.

Lemma E.3. Let ˆ t x be defined as in (40). The estimators variance can be bounded as:

E ˆ t x x F(λ1,H(xt, yt); yt) 2 27L2 F Mx(1 γ)6ε2 .

Proof. We essentially carry over the same computation as Step 2 in the proof of (Zhang et al., 2021)[Lemma F.2] using our notation and assumptions. To simplify the excessively busy notation and improve readability, let us introduce some shortcuts,

λ1 := λ1,H(xt, yt),

ˆλ1 := ˆλ1,t,

r := λ1F (λ1,H(xt, yt); yt) = λ1F

J := xλ1,H(xt, yt) = xλ1,

ˆr := λ1F ˆλ1; yt ,

g := x F (λ1,H(xt, yt); yt) = x F

ˆgi := ˆgx(ξi|xt, yt, ˆr),

ˆgi, := ˆgx(ξi|xt, yt, r ).

We can finally write:

E h ˆ t x J ˆr 2i = E

i ˆgi, + 1 Mx

i ˆgi, g + g J ˆr

For the first term we write,

" ˆgi ˆgi, 2 #

4 (1 γ)4ε2 E h ˆr r 2 i

4L2 F (1 γ)4ε2 E h ˆλ1 λ1 2 i (41)

4L2 F Mx(1 γ)6ε2 .

Solving Zero-Sum Convex Markov Games

Where (41) follows from the following fact:

ˆg(ξ|x, r1) ˆg(ξ|x, r2) =

t=0 γt r1(st, at) r2(st, at) t X

t =0 x log πx(at |st )

1 εγt(t + 1) r1 r2

The second and third terms can be similarly bounded as:

L2 F Mx(1 γ)4 ;

4L2 F Mx(1 γ)6ε2 .

Lemma E.4. For any H > 1 ln(1 γ) and a greedy exploration parameter ε, the following inequality holds true for the bias of the policy gradient estimator defined in Equation (40):

x F(λH(x, y); y) x F(λ1(x, y), y) 2 256LF (1 γ)6ε2 exp (1 γ)(H 1) .

Proof. By (Zhang et al., 2021, Lemma E.3), we have that,

x F(λH(x, y); y) x F(λ1(x, y), y) 2 8L2 F (1 γ)6ε2 + 16L2 F ε2

(1 γ)2 + 1 (1 γ)4

Further, we can bound the RHS as: 8L2 F (1 γ)6ε2 + 16L2 F ε2

(1 γ)2 + 1 (1 γ)4

γ2H 256LF (H + 1)2

(1 γ)6ε2 γ2H.

Now, in order to simplify the display, compute an H > 0 such that:

equivalently,

2 log1/γ(H + 1) H + 1

changing the base of the logarithm, we get,

1 ln(1/ γ) ln(H + 1) H + 1,

but the latter is true for any H + 1 > 1/ ln(1/ γ). Hence, since γ < 1 and noting that γ = 1 (1 γ),

256LF (H + 1)2

(1 γ)6ε2 γ2H 256LF (1 γ)6ε2 γH 1 256LF (1 γ)6ε2 exp (1 γ)(H 1) .

Solving Zero-Sum Convex Markov Games

E.3. Additional Supporting Claims

In this subsection we bound the errors incurred due to the ε-greedy parametrization, and the regularization.

Error due to ε-greedy parametrization.

Claim E.5. Assume an L-Lipschitz continuous function f : X R. Further, let X be a concatenation of n m-dimensional probability simplices. Further let w be the mapping w(x) := (1 ε)x + ε/m for some 0 < ε < 1.

|f(x) (f w)(x)| n Lε

Further, if for some ϵ > 0 and x X, f(x) f ϵ,

(f w)(x) f ϵ + n Lε

The second follows directly.

Proof. The claim follows from the Lipschitz continuity of f. It is the case that (f w)(x) = f((1 ε)x + ε/m),

|f(x) f((1 ε)x + ε/m)| L x ((1 ε)x + ε/m)

Claim E.6. Let an L-Lipschitz continuous function f : X Y R. Let X, Y be concatenations of nx and ny mxand my-dimensional probability simplices respectively. Further let w be the mapping w(x; ε, m) := (1 ε)x + ε/m for some 0 < ε < 1. Further, for some εx, εy > 0 define fw(x, y) := f (w(x; εx, mx), w(y; εy, my)), Φw(x) := maxy Y fw(x, y), and Φ w := minx X Φw(x). Then, the following inequalities hold true,

|Φw(x) Φ(x)| nx Lεx

mx + ny Lεy

|Φ Φ w| nx Lεx

mx + ny Lεy

Proof. The proof of the first item directly follows from an application of Claim E.5 and the L-Lipschitz continuity of Φ given that f is L-Lipschitz continuous.

For the second item, we use the inequality we just attained and write,

Φw(x) Φ(x) + nx Lεx

mx + ny Lεy

Minimizing over x for the two sides yields,

Φw(x w) Φ(x ) + nx Lεx

mx + ny Lεy

Where, x , x w are such that x arg minx X Φ(x) and x w arg minx X Φw(x). Applying the same trick to the other side of the other direction of the inequality:

Φ(x ) Φw(x w) + nx Lεx

mx + ny Lεy

|Φ(x ) Φw(x w)| nx Lεx

mx + ny Lεy

Solving Zero-Sum Convex Markov Games

Claim E.7. Assume an L-Lipschitz continuous and ℓ-smooth function f : X R that is µ-p PL. Further, let X be a concatenation of n m-dimensional probability simplices. Further let w be the mapping w(x) := (1 ε)x + ε/m for some 0 < ε < 1. Define Df X (x, α), Df w X (x, α) for some α > 0 to be:

Df X (x, α) := 2α min y X

n f(x), y x + α

Df w X (πx, α) := 2α min y X

n f w(x), y x + α

Then, it is the case that:

1 2Df w X (x, α) µ (f(x) f ) 8 nℓLαDX ε.

Proof. We essentially need to prove that in fact: Df w X (x, α) Df X (x, α) 16 nℓLαDX ε.

We write Df X (x, α) equivalently as:

Df X (x, α) = max y X

n 2α f(x), x y α2 x y 2o .

We define G( ; y) after we isolate the display inside the max-operator and in place of the gradient put any vector v of the same dimension:

G(v; y) := 2α v, x y α2 x y 2 .

We see that v G(v; y) = 2α(x y) and x y DX . As such, G is 2αDX -Lipschitz continuous in v. This consequently means that Ψ(v) := maxy Y G(v; y) is also 2αDX -Lipschitz in v. Now, all that is left to do is to bound the distance between x(f w)(x) and xf(x). By the chain rule:

x(f w)(x) = (1 ε) xf(z) z=w(x).

Further, set xε := w(x) and observe that

xf(x) xf(xε) ℓ x xε

= ℓ x (1 ε)x ε/m

ℓε( n + 1/m).

Where in the last inequality we use the fact that for a probability vector y, y 1 and the triangle inequality. Assuming that n, m > 1, we can further simplify into:

xf(x) xf(xε) 2 nℓε (42)

Now, by the fact that Ψ( xf(x)) = Df x(x, α), Ψ( x(f w)(x)) = Df w x (x, α), and the Lipschitz continuity of Ψ, we write: Df X (x, α) Df w X (x, α) 2αDX x(f w)(x) xf(x)

2αDX xf(x) xf(xε) + 2αDX ε xf(xε)

4 nℓαDX ε + 2αDX εL

16 nℓLαDX ε.

The claim follows.

Solving Zero-Sum Convex Markov Games

Claim E.8. Assume an L-Lipschitz continuous and ℓ-smooth function f : X R such that:

max x X, x x 1 xf(x), x x µ (f(x) f ) .

Further, let X be a concatenation of n m-dimensional probability simplices. Further let w be the mapping w(x) := (1 ε)x + ε/m for some 0 < ε < 1. It is the true that:

max x X, x x 1 x(f w)(x), x x µ (f(x) f ) 8 nℓLDX ε.

Proof. Similar to the previous case, we need to show that G(v; x ) := v, x x is Lipschitz continuous in v. Indeed, v G(v; x ) := x x and as such v G(v; x ) DX . Consequently, Ψ(v) := maxx X G(v; x ) is also DX -Lipschitz continuous. Further, as previously shown in (42),

xf(x) xf(xε) 2 nℓε.

We can finally write,

max x X, x x 1 x(f w), x x max x X, x x 1 xf, x x 8 nℓLDX ε.

Error due to the regularizer.

Claim E.9. Let a two-player zero-sum c MG with utility U : X Y R. Assume that the maximizing player employs regularization to their utility of the form:

U µreg.(x, y) := U(x, y) µreg.

2 λ2(x, y) 2 ,

where µreg. > 0. Then, the following inequality is true:

x U(x, y) x U µ(x, y) µreg. Lλ .

x U(x, y) x U µ(x, y) = µreg. xλ2(x, y)

µreg. λ2 xλ2(x, y)

E.4. Nested Policy Gradient

E.4.1. CMG WITH CONCAVE UTILITIES

Theorem E.10. Assume a two-player zero-sum c MG and a desried accuracy ϵ > 0. Running Algorithm 1 a tuning of Algorithm 1 with, µreg. = Θ (1 γ)(mins ϱ(s))ϵ

step-sizes, τx = Θ (mins ϱ(s))5(1 γ) 18

ℓ 5 2 F γ 5 2 |S| 5 2 (|A|+|B|) 15

and τy = Θ (1 γ)8

4ℓF γ2|S| 3 2 (|A|+|B|)4

exploration parameters εx = ϵ(mins ϱ(s))6µ3(1 γ)17

512LF |S|4ℓ5 F γ5(|A|+|B|) 17

and εy = Θ ϵ(mins ϱ(s))9µ 9 2 x (1 γ) 21

|S| 5 4 ℓ 3 2 F γ 1 2 (|A|+|B|) 9 4

Solving Zero-Sum Convex Markov Games

batch-sizes Mx = Θ L4 F ℓ2 F |S|7(|A|+|B|)11

ϵ4(mins ϱ(s))4(1 γ)32 and My = Θ ℓ10 F L2 F γ8|S|10(|A|+|B|)22

(mins ϱ(s))16(1 γ)52ϵ8 ;

an outer-loop number of iterations at least Tx = Θ LF ℓ 9 2 F γ8|S| 29

4 (|A|+|B|) 29

2 (mins ϱ(s)) 13

and inner loop iterations that are at least

Ty = Θ ℓUL2 λℓF γ2|S| 3 2 (|A|+|B|)4

(mins ϱ(s))6(1 γ)14ϵ2 ln ℓF LF γ|S|(|A|+|B|)

(mins ϱ(s))(1 γ)ϵ ;

will output an iterate xt , yt , such that:

E U(xt , yt +1) min x X U(x , yt +1) ϵ;

E max y Y U(xt, y ) EU(xt , yt ) ϵ.

Proof. In order to simplify our arguments, we formalize ε-greedy parametrization as a composition of the utility function U(x, y) with mappings w(x; ε, m) where w(x) := (1 ε)x + ε1/m. In particular, our convergence guarantees are w.r.t. the function U µ w(x, y) := U µ (w(x; εx, |A|), w(y; εy, |B|)) and Φ(x) := maxy Y U µ w(x, y).

We will tune Tx, τx, Hx, εx, Mx, by, Theorem D.5.

ˆG Φ,t x,τx 2 5ℓ ΦL ΦDX

T + 6δ2 x + 1 Mx 4L2 F (1 γ)6ε2x .

First, we see that in order to achieve an ϵ-approximate best-response w.r.t. to x, then, x needs to be an (1 γ)mins ϱ(s)ϵapproximate stationary point. In general, then, we need to ensure that,

δ2 x = Θ(ϵ2(1 γ)2(min s ϱ(s))2).

Hence, by Claim E.9, we can pick the regulizer s coefficient to be:

µreg. = Θ (1 γ)(mins ϱ(s))ϵ

Lemma E.4 dictates the tuning of Hx, Hy,

Hx, Hy = Θ 1 1 γ ln ℓF LF γ|S| (|A| + |B|)

ϵ(1 γ)mins ϱ(s)µx

εx = Θ (1 γ)(mins ϱ(s))ϵ

|S|ℓF LF ℓλL4 λ

= Θ ϵ mins ϱ(s) (1 γ)12

2LF ℓF |S| 7 2 γ(|A| + |B|) 11

= Θ 1 ℓF ℓ2 λLλ

4ℓF γ2|S| 3 2 (|A| + |B|)4

σ2 x = L4 F ℓ2 F |S|7 (|A| + |B|)11

ϵ2(mins ϱ(s))2(1 γ)30

L4 F ℓ2 F |S|7 (|A| + |B|)11

ϵ4(mins ϱ(s))4(1 γ)32

Solving Zero-Sum Convex Markov Games

As for the inner-loop, we need to set

ϵy = Θ µQG(1 γ)2(mins ϱ(s))2ϵ2

mins ϱ(s)5(1 γ)11ϵ3

ℓ2 F γ2|S|(|A| + |B|)3

εy = Θ µQG(1 γ)2(mins ϱ(s))2ϵ2

mins ϱ(s)5(1 γ)19ϵ3

|S| 7 2 LF ℓ4 F γ4(|A| + |B|)7

Now, we are ready to tune the number of inner-loop iterations, Ty,

ℓUL2 λℓF γ2|S| 3 2 (|A| + |B|)4

(mins ϱ(s))6(1 γ)14ϵ2 ln ℓF LF γ|S| (|A| + |B|)

(mins ϱ(s))(1 γ)ϵ

and the batch-size, My, to be:

ℓ10 F L2 F γ8|S|10 (|A| + |B|)22

(mins ϱ(s))16(1 γ)52ϵ8

9 2 F γ8|S| 29

4 (|A| + |B|)

2 (mins ϱ(s)) 13

E.4.2. CMG WITH STRONGLY-CONCAVE UTILITIES

Theorem E.11. Assume a two-player zero-sum c MG with a utilities that are strongly concave with modulie µ1, µ2 > 0. Let ϵ > 0 be given. Then, a tuning of Algorithm 1 with,

step-sizes, τx = Θ (mins ϱ(s))3(1 γ) 15

ℓ 5 2 F γ 5 2 |S| 5 2 (|A|+|B|) 15

and τy = Θ (1 γ)3

2ℓF γ|S| 1 2 (|A|+|B|) 3 2

exploration parameters εx = ϵ(mins ϱ(s))6µ3(1 γ)17

512LF |S|4ℓ5 F γ5(|A|+|B|) 17

and εy = Θ ϵ(mins ϱ(s))9µ 9 2 x (1 γ) 21

|S| 5 4 ℓ 3 2 F γ 1 2 (|A|+|B|) 9 4

batch-sizes Mx = Θ L4 F |S| 33

4 D2 X ℓ 27

2 (|A|+|B|) 89

ϵ2(mins ϱ(s))19µ 19

2 x (1 γ) 97

and My = Θ 4L2 F |S| 7 2 ℓ5 F γ3(|A|+|B|) 15

2 ϵ2(mins ϱ(s))22µ9xµ2y(1 γ)37

an outer-loop number of iterations at least Tx = Θ 16

2|S| 7 4 ℓ 7 2 F γ 7 2 (|A|+|B|) 21

mins ϱ(s)7µ 7 2 x (1 γ) 29

and inner loop iterations that are at

least Ty = Θ 2ℓ2 F γ2|S|(|A|+|B|)3

(mins ϱ(s))4(1 γ)7µ2y

will output an iterate xt , yt , such that:

E U(xt , yt +1) min x X U(x , yt +1) ϵ;

E max y Y U(xt, y ) EU(xt , yt ) ϵ.

Proof. An optimality gap E Φ(x T +1) Φ ϵ for the utility function that is composed with the greedy exploration mapping, translates to an optimality error of the original utility function:

EΦ(x T +1) Φ ϵ + |S| 1 2 LΦεx |A| + |S| 1 2 LΦεy

|B| + 8|S| 1 2 ℓ2 ΦDX εx.

Solving Zero-Sum Convex Markov Games

First, we can set the number of iterations to be at least,

ℓF ℓΦγ|S| 1 2 (|A| + |B|)

(mins ϱ(s))4(1 γ)7µ2x

7 2 F γ 7 2 (|A| + |B|) 21

mins ϱ(s)7µ

7 2x (1 γ) 29

As such, we tune the exploration parameter εx to be:

|S| 1 2 LU , ϵ

8|S| 1 2 ℓ2 ΦDX

So we can set, εx = ϵ

|S| 1 2 LUℓ2 ΦDX . Substituting yields:

ϵ(mins ϱ(s))6µ3 (1 γ)17

LF |S| 9 2 ℓ5 F γ5 (A + |B|)

ϵ(mins ϱ(s))6µ3 (1 γ)17

512LF |S|4ℓ5 F γ5 (|A| + |B|)

L4 F |S| 33

2 (|A| + |B|)

ϵ2(mins ϱ(s))19µ

Further, we require that εy min |B|ϵ

|S| 1 2 LΦ , ϵ

8|S| 1 2 ℓ2 UDY

. In order to tune δx, we see that it is a sum of two terms, the

bias of the gradient estimator and the expected distance of yt from y (xt) at each iterate t. We see that we need to set,

δx c µqg,xϵ ℓΦ

for some constant c > 0 sufficiently small. As such, we will control the horizon of the stochastic gradient estimator, Hx, to be,

Hx = Θ 1 1 γ ln ℓF LF γ|S| (|A| + |B|)

ϵ(1 γ)mins ϱ(s)µx

For the inner loop, we need to ensure that E yt y (xt) µpl,xϵ ℓΦ . Hence, by the quadratic growth condition w.r.t. y, the optimality gap of the inner loop, ϵy, needs to be bounded as:

ϵy = Θ ϵµpl,xµqg,y

ϵ(mins ϱ(s))6(1 γ)14µ2 xµy ℓF γ2|S| 3 2 (|A| + |B|)4

To achieve such an optimality gap at the inner-loop on every outer-loop iteration t, by Claim E.5, we need to set εy = ϵµpl,xµqg,y

|S| 1 2 ℓΦℓ2 U .

ϵ(mins ϱ(s))9µ

3 2 F γ 1 2 (|A| + |B|)

the latter leads to the bound on σ2 y,

L2 F |S| 5 2 ℓ3 F γ (|A| + |B|)

ϵ2(mins ϱ(s))18µ9x (1 γ)27

while, the previous bound calls for a batch-size,

4L2 F |S| 7 2 ℓ5 F γ3 (|A| + |B|)

ϵ2(mins ϱ(s))22µ9xµ2y (1 γ)37

Solving Zero-Sum Convex Markov Games

Finally, the horizon of the inner-loop gradient estimator will be set to be:

Hx = Θ 1 1 γ ln ℓF LF γ|S| (|A| + |B|)

ϵ(1 γ)mins ϱ(s)µPLµy

while the step-size needs to be:

2ℓF γ|S| 1 2 (|A| + |B|)

and the total number of iterations needs to be at least,

µpl,y ln γLF ℓF |S| (|A| + |B|)

mins ϱ(s)(1 γ)ϵ

2ℓ2 F γ2|S| (|A| + |B|)3

(mins ϱ(s))4(1 γ)7µ2y

E.5. Alternating Policy Gradient Descent-Ascent

E.5.1. CMG WITH CONCAVE UTILITIES

Theorem E.12. Let ϵ > 0 be a desired accuracy. After at most T = Θ ℓ4 F L4 F γ8|S|6(|A|+|B|)19

(1 γ)49(mins ϱ(s))11ϵ6 iterations of Algorithm 2,

with step-sizes, τx = Θ (mins ϱ(s))9(1 γ)20ϵ3

ℓ2 F γ5|S|3(|A|+|B|) 17

, and τy = Θ (1 γ)8

4ℓF γ2|S| 3 2 (|A|+|B|)4

exploration parameters εx = Θ mins ϱ(s) (1 γ)12ϵ

2LF ℓF |S| 7 2 γ(|A|+|B|) 11

and εy = Θ mins ϱ(s)(1 γ)23ϵ

16LF |S| 11

2 ℓ2 F γ4(|A|+|B|)11

batch-sizes, Mx = Θ L4 F ℓ2 F |S|7(|A|+|B|)11

(mins ϱ(s))4(1 γ)32ϵ4 , and My = Θ ℓ7 F γ14|S| 31

2 (|A|+|B|)34

(mins ϱ(s))10(1 γ)85ϵ5

sampling horizons Hx, Hy = Θ 1 (1 γ) ln γℓF LF |S|(|A|+|B|)

(1 γ)(mins ϱ(s))ϵ ,

there exists an iterate t , such that

EU(xt , yt ) Φ ϵ;

EΦ(xt ) EU(xt , yt ) ϵ.

Proof. We invoke Theorem D.7 and tune the parameters appropriately using Lemma E.4 and claims E.5 and E.6. We tune the coefficient of the regularizer first. Needing to achieve an (1 γ)(mins ϱ(s))ϵ-first order stationary point, we tune the regularizer as:

µreg. = Θ (1 γ)(mins ϱ(s))ϵ

We compute the p PL modulus given the regularizer tuning,

(mins ϱ(s))6(1 γ)14ϵ2

ℓF L2 F γ2|S| 1 2 (|A| + |B|)4

Further, by claims E.6 to E.8, we see that εx, εy need to be tuned as,

εx = Θ (1 γ)(mins ϱ(s))ϵ

|S|ℓF LF ℓλL4 λ

= Θ ϵ mins ϱ(s) (1 γ)12

2LF ℓF |S| 7 2 γ(|A| + |B|) 11

Solving Zero-Sum Convex Markov Games

mins ϱ(s)(1 γ)23ϵ

16LF |S| 11

2 ℓ2 F γ4(|A| + |B|)11

The resulting bounds on the variance are:

σ2 x = L4 F ℓ2 F |S|7 (|A| + |B|)11

ϵ2(mins ϱ(s))2(1 γ)30 ,

σ2 y = Θ 16L4 F |S|11ℓ4 F γ8(|A| + |B|)22

mins ϱ(s)(1 γ)52ϵ2

The latter dictates the batch-sizes,

L4 F ℓ2 F |S|7 (|A| + |B|)11

ϵ4(mins ϱ(s))4(1 γ)32

My = Θ κ2σ2 y = Θ

ℓ3 F γ6|S| 9 2 (|A| + |B|)12σ2 y (mins ϱ(s))9(1 γ)33ϵ3

ℓ7 F γ14|S| 31

2 (|A| + |B|)34

(mins ϱ(s))10(1 γ)85ϵ5

The step-size is tuned straightforwardly as,

(mins ϱ(s))9(1 γ)20ϵ3

ℓ2 F γ5|S|3 (|A| + |B|)

By Lemma E.4 we see easily that Hx, Hy need to be,

Hx, Hy = Θ 1 (1 γ) ln γℓF LF |S| (|A| + |B|)

(1 γ)(mins ϱ(s))ϵ

After we have tuned the regularizer, we can compute the upper bound on the number of iterations:

ℓ4 F L4 F γ8|S|6 (|A| + |B|)19

(1 γ)49(mins ϱ(s))11ϵ6

Now, tuning ηy is also straightforward,

= Θ 1 ℓF ℓ2 λLλ

4ℓF γ2|S| 3 2 (|A| + |B|)4

E.5.2. CMG WITH STRONGLY-CONCAVE UTILITIES

Theorem E.13. Assume ϵ > 0 and a two-player zero-sum c MG with utilities that are strongly concave with moduli µ1, µ2. Then if the two players are following Algorithm 2 with

exploration parameters εx, εy = Θ ϵ(mins ϱ(s))4(1 γ)15 min{µ2 x,µ2 y}

4LF |S| 5 2 ℓ2 F γ2(|A|+|B|)4

step-sizes, τx = Θ µ4 2(mins ϱ(s))8(1 γ)30

32ℓ2 F γ2|S|(|A|+|B|) , τy = (1 γ)3

2ℓF γ|S| 1 2 (|A|+|B|) 3 2

Solving Zero-Sum Convex Markov Games

batch-sizes, Mx = Θ 16L4 F |S| 13

2 ℓ5 F γ6(|A|+|B|)12

ϵ2(mins ϱ(s))20(1 γ)42 min{µ6x,µ6y}

, My = Θ 16L4 F |S| 19

2 ℓ5 F γ10(|A|+|B|)20

ϵ2(mins ϱ(s))28(1 γ)50 min{µ10 x ,µ10 y }

then, it is the case that:

EU(x T , y T ) Φ ϵ;

EΦ(x T ) EU(x T , y T )Φ ϵ.

after at most T iterations, with T = Θ 4ℓF γ6|S| 9 2 (|A|+|B|)12

mins ϱ(s)12(1 γ)36µ2 1µ4 2 log LF ℓF γ|S|(|A|+|B|) (mins ϱ(s))(1 γ)µ1,µ2

Assume the function U(x, y) := U (w(x; εx, |A|), w(y; εy, |B|)) where w(z; ε, m) := (1 ε)z + ε1

m . claims E.6 and E.7 make sure that we can bound the optimality gap on the initial function U by running Algorithm 2 on U. Hence, combining the aforementioned claims with Theorem D.8 and Lemma E.3, we see that we need to set:

ϵ(mins ϱ(s))4 (1 γ)15 min{µ2 x, µ2 y}

4LF |S|2ℓ2 F γ2 (|A| + |B|)4 (DX + DY)

The resulting variances will be:

σ2 x, σ2 y = Θ

16L4 F |S|4ℓ4 F γ4 (|A| + |B|)8 (DX + DY)2

ϵ2(mins ϱ(s))8 (1 γ)30 min{µ4x, µ4y}

We will control the resulting variances using batches of the following size, which will also counter the µ1

16L4 F |S| 13

2 ℓ5 F γ6 (|A| + |B|)12

ϵ2(mins ϱ(s))20 (1 γ)42 min{µ6x, µ6y}

16L4 F |S| 19

2 ℓ5 F γ10 (|A| + |B|)20

ϵ2(mins ϱ(s))28 (1 γ)50 min{µ10 x , µ10 y }

We can easily see that the sampling horizons need to be:

Hx, Hy = Θ 1 1 γ ln ℓF LF γ|S| (|A| + |B|)

ϵ(1 γ)mins ϱ(s)µxµy

The step-sizes are tuned to be:

τx = µ4 2,H(mins ϱ(s))8(1 γ)30

32ℓ2 F γ2|S| (|A| + |B|) ; τy = (1 γ)3

2ℓF γ|S| 1 2 (|A| + |B|)

Finally, the iteration complexity is at least:

4ℓF γ6|S| 9 2 (|A| + |B|)12

mins ϱ(s)12(1 γ)36µ2 H,1µ4 H,2 log LF ℓF γ|S| (|A| + |B|) (mins ϱ(s))(1 γ)µ1,H, µ2,H