# anytimeconstrained_equilibria_in_polynomial_time__871dad7d.pdf

Anytime-Constrained Equilibria in Polynomial Time

Jeremy Mc Mahan 1

Abstract We extend anytime constraints to the Markov game setting and the corresponding solution concept of anytime-constrained equilibrium (ACE). Then, we present a comprehensive theory of anytime-constrained equilibria that includes (1) a computational characterization of feasible policies, (2) a fixed-parameter tractable algorithm for computing ACE, and (3) a polynomial-time algorithm for approximately computing ACE. Since computing a feasible policy is NP-hard even for two-player zero-sum games, our approximation guarantees are the best possible so long as P = NP. We also develop the first theory of efficient computation for action-constrained Markov games, which may be of independent interest.

1. Introduction

Although multi-agent reinforcement learning (MARL) has made many breakthroughs in game-playing, the literature has long since advocated the importance of constraints in real-world applications (Gu et al., 2023b). Despite their importance, the literature on constrained MARL is far behind the state-of-the-art in the single-agent setting. Most recently, almost-sure (Castellano et al., 2022) and anytime (Mc Mahan and Zhu, 2024; Mc Mahan, 2024) constraints have emerged in the single-agent setting to capture real-world scenarios, such as medical applications (Coronato et al., 2020; Paragliola et al., 2018; Kolesar, 1970), disaster relief scenarios (Fan et al., 2021; Wu et al., 2019; Tsai et al., 2019), and resource management (Mao et al., 2016; Li et al., 2018; Peng and Shen, 2021; Bhatia et al., 2021). However, many of these motivating applications are actually multi-agent problems. Most obviously, anytime-compliant autonomous vehicles (Mc Mahan and Zhu, 2024; Shalev-Shwartz et al., 2016; Wu et al., 2019) must interact with other vehicles, which is a key aspect of MARL (Chu et al., 2020; Dinneweth et al.,

1Department of Computer Science, University of Wisconsin Madison, Wisconsin, USA. Correspondence to: Jeremy Mc Mahan <jmcmahan@wisc.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

2022; Wiering, 2000). Despite their relevance, anytime constraints have yet to be studied in the multi-agent setting, which we remedy in this work.

Formally, we consider a constrained Markov game (c MG) G with a budget vector B. A joint policy π satisfies an anytime constraint if every player i s accumulated cost is at most Bi at all times: Pπ G[ h [H], Ph t=1 ci,t Bi] = 1. Given such a constraint, the natural solution concept is the anytime-constrained equilibrium (ACE). At a high level, an ACE is a feasible joint policy π for which no player can gain a higher value from any feasible deviation. Our main question is as follows:

For what class of c MGs can ACE be computed (approximately) in polynomial time?

Already in the single-agent setting, Mc Mahan and Zhu (2024) showed computing optimal anytime-constrained policies is NP-hard. The situation for games is even worse: we show that even for simple two-player zero-sum anytime-constrained MGs, computing a feasible policy is NP-hard. Furthermore, as shown in (Mc Mahan and Zhu, 2024), expectation-constrained policies can arbitrarily violate anytime constraints, implying that standard expectationconstrained approaches fail. Moreover, even for expectationconstrained MGs, efficient algorithms are unknown outside of regret settings (Chen et al., 2022; Ding et al., 2023), which have different constraint requirements. Lastly, typical distributed learning and self-play approaches fail since the feasibility of a player s action generally depends on the choices of others.

Past Work. The only known efficient algorithms for anytime constraints also fail to generalize to the multi-agent setting. The approach designed in (Mc Mahan, 2024) only applies to one constraint. On the other hand, the approach from (Mc Mahan and Zhu, 2024) requires state augmentation and state-dependent action spaces. Although simple in the single-agent setting, the coupled nature of the players constraints induces state-dependent action spaces that do not form product spaces. Consequently, each stage game becomes a non-normal-form game for which efficient solvers are unknown. Moreover, their approach utilizes a relaxed augmented-state space that relies on to identify infeasible states. However, the non-uniqueness of equilibria in

Anytime-Constrained Equilibria in Polynomial Time

games could allow solutions, which should indicate infeasibility, even when a feasible equilibrium exists.

Our Contributions. We present a comprehensive theory of anytime-constrained MGs, which includes (1) a computational characterization of feasible policies, (2) a fixedparameter tractable (FPT) (Downey and Fellows, 2012) algorithm for computing subgame-perfect ACE, and (3) a polynomial time algorithm for computing approximately feasible subgame-perfect ACE. Notably, our FPT algorithm runs in polynomial time so long as the supported costs require small precision. Similarly, our approximation algorithm runs in polynomial time so long as the maximum supported cost is bounded by a polynomial factor of the budget. Given our hardness results, our algorithmic guarantees are the best possible in the worst case. Along the way, we develop efficient algorithms for constrained games, culminating in a theory of action-constrained MGs, which may be of independent interest.

Each of our main results utilizes a different algorithmic technique. For (1), we view a policy s set of feasibly realizable histories as a directed graph and then construct an AND/OR tree whose TRUE subtree is the union of all feasible policy graphs. For (2), we show that the subgameperfect ACE of a c MG corresponds to the Markov-perfect equilibria of an action-constrained MG. Then, we design efficient algorithms for solving action-constrained games by solving a sequence of linear programs (LP). For (3), we use a combination of cost truncation and rounding to derive an approximate game whose solutions are approximately feasible equilibria for the original c MG. Then, we show that the approximate game s subgame-perfect ACE is computable in polynomial time.

1.1. Related Work

Constrained MARL. Ever since constrained equilibria were introduced (Altman and Shwartz, 2000), most works have focused on learning in the regret setting (Altman and Shwartz, 2000; Chen et al., 2022; Gattami et al., 2021; Ding et al., 2023; Jordan et al., 2024). Outside of these works, the literature has focused on single-agent-constrained Markov Decision Processes (c MDP). It is known that c MDPs can be solved in polynomial time using linear programming (Altman, 1999), and many interesting planning and learning algorithms have been developed for them (Paternain et al., 2019; Vaswani et al., 2022; Borkar, 2005; Hasanzade Zonuzy et al., 2021). Many learning algorithms can even avoid violation during the learning process under certain assumptions (Wei et al., 2022; Bai et al., 2023). Furthermore, Brantley et al. (2020) developed no-regret algorithms for c MDPs and extended their algorithms to the setting with a constraint on the cost accumulated over all episodes, which is called a knapsack constraint (Brantley et al., 2020; Che-

ung, 2019).

Safe MARL. Most works implement safety using some constraints (Garc ıa et al., 2015), and several multi-agent works exist down this line (Shalev-Shwartz et al., 2016; Gu et al., 2023a; Elsayed-Aly et al., 2021). The single agent setting has mainly focused on no-violation learning for c MDPs (Chow et al., 2018; Bossens and Bishop, 2022; Gu et al., 2023b) and solving CCMDPs (Wang et al., 2023; Gu et al., 2023b), which capture the probability of entering unsafe states. Performing learning while avoiding dangerous states has also been studied (Roderick et al., 2021; Thomas et al., 2021; Zhao et al., 2023) under non-trivial assumptions.

2. Equilibria

Constrained Markov Games. A (tabular, finite-horizon) n-player Constrained Markov Game (c MG) is a tuple G = (S, A, P, R, C, H), where (i) S is the finite set of states, (ii) A = A1 An is the finite set of joint actions, (iii) Ph(s, a) (S) is the transition distribution, (iv) Rh(s, a) (Rn) is the reward distribution, (v) Ch(s, a) (Rn) is the cost distribution, and (vi) H is the finite time horizon. To simplify notation, we let rh(s, a) def = E[Rh(s, a)] denote the expected reward, S def = |S| denote the number of states, A def = |A| denote the number of joint actions, [H] def = {1, . . . , H}, and |G| be the total description size of the c MG.

Interaction Protocol. The agents interact with G using a joint policy π = {πh}H h=1. In the fullest generality, πh : Hh (A) is a mapping from the observed history at time h (including costs) to a distribution of actions. Often, researchers study Markovian policies, which take the form πh : S (A), and product policies, which take the form π = {πi}n i=1, where each πi is an independent policy for player i.

The agents start at an initial state s1 S with observed history τ1 = (s1). For any h [H], the agents choose a joint action ah πh(τh). Then, the agents receive immediate reward vector rh Rh(s, a) and cost vector ch Ch(s, a). Lastly, G transitions to state sh+1 Ph(sh, ah) and the agents update their observed history to τh+1 = (τh, ah, ch, sh+1). This process is repeated for H steps; the interaction ends once s H+1 is reached.

Anytime Constraints. Suppose the agents have a budget vector B Rn. We say a joint policy π satisfies anytime constraints if,

Anytime-Constrained Equilibria in Polynomial Time

Here, Pπ G denotes the probability law over histories induced from the interaction of π with G, and all vector operations are performed component-wise. If G only has anytime constraints, which will be the case in this work, we call G an anytime-constrained Markov game (ac MG). We refer to any policy π satisfying (ANY) as feasible for G, and let ΠG denote the set of all feasible policies for G. Remark 2.1 (Extensions). Our results can also handle multiple constraints per agent, infinite discounting, and the weaker class of almost sure constraints. We defer the details to the appendix.

Solution Concepts. Solutions to games traditionally take the form of equilibrium. In the MARL realm, the most popular notions include the Nash equilibrium (NE), correlated equilibrium (CE), and course-correlated equilibrium (CCE). Given constraints, the key difference is a focus on feasible policies. Infeasible policies lead to disastrous outcomes for an agent. Thus, not only should a constrained equilibrium be feasible, but agents should only consider deviating if doing so would be feasible.

Definition 2.2 (Anytime-Constrained Equilibria). We call a joint policy π an anytime-constrained equilibrium (ACE) for an ac MG G if (1) π ΠG and (2) for all players i [n] and potential deviation policies π i, either,

(π i, π i) ΠG OR V π i V π i,π i i . (ACE)

Here, V π i def = Eπ G h PH t=1 ri,t i denotes i s value from interacting with G under π, and Eπ G denotes the expectation defined by the law Pπ G. Lastly, we call π an anytimeconstrained Nash equilibrium (ACNE) for G if π is additionally a product policy.

Remark 2.3 (Correlated Equilibrium). Our definition of ACE in Definition 2.2 technically corresponds to anytimeconstrained course-correlated equilibria (ACCCE), which we simplify to ACE for exposition purposes. Our results apply equally well to anytime-constrained correlated equilibria (ACCE). We delay the definition and discussion of ACCE to the appendix. In the main text, we signal when results specialize to each equilibria type by writing ACE (NE/CE/CCE).

Stage Games. It is often useful to consider refinements of equilibrium notions that are more structured and robust. The classical refinement for sequential games is the subgameperfect equilibrium (SPE). An SPE policy is required to behave optimally under any history, also called subgames, even if those subgames are not realizable. That way, players could still recover if any deviated from the policy s suggestion. In the constrained setting, we only consider feasible subgames, i.e., subgames that are realizable by some feasible policy. This means the players could still adapt and

finish the game whenever a player takes an unsupported but feasible action.

Formally, we let Hπ h def = {τh Hh | Pπ G[τh] > 0} denote the subset of partial histories at time h that are realizable by a policy π, and let Fh def = S

π ΠG Hπ h denote the set of partial histories at time h realizable by some feasible policy. For any feasible subgame τh Fh, we let,

ΠG(τh) def =

t=1 ct B | τh

(SUB) denote the set of feasible policies for the subgame τh. To capture our earlier intuition, we require an anytimeconstrained SPE to be a policy that is feasible for any feasible subgame, and that beats any feasible deviation for that subgame.

Definition 2.4 (Anytime-Constrained Subgame-Perfect Equilibria). We call a joint policy π an anytime-constrained subgame-perfect equilibrium (ACSPE) for an ac MG G if for all times h [H + 1], and all feasibly-realizable histories τh Fh, π satisfies (1) π ΠG(τh) and (2) for all players i [n], and potential deviation policies π i, either,

(π i, π i) ΠG(τh) OR V π i,h(τh) V π i,π i i,h (τh). (ACSPE) Here, V π i,h(τh) def = Eπ G h PH t=h ri,t | τh i denotes i s value from time h onward conditioned under history τh. Lastly, we call π an anytime-constrained subgame-perfect Nash equilibrium (ACSPNE) for G if π is additionally a product policy.

Next, we show that ACSPE exists whenever a feasible policy exists. Here, we require the cost distributions to have finite support. We relax this assumption for our approximation algorithms later in Section 6.

Assumption 2.5 (Finite Cost Support). Throughout Section 2 - Section 5, we assume each Ch(s, a) has finite support.

Proposition 2.6 (Existence). For any ac MG G, the following are equivalent,

1. G admits an ACE (CE/CCE),

2. G admits an ACSPE (CE/CCE), and

3. G admits a feasible policy.

The equivalence also holds for ACNE so long as G admits a feasible product policy.

Although Proposition 2.6 implies equilibria exist under very minimal assumptions, they are generally hard to compute. As shown in (Mc Mahan and Zhu, 2024), feasible policies

Anytime-Constrained Equilibria in Polynomial Time

are generally not Markovian nor product policies. Moreover, just determining whether there exists a feasible policy is NPhard even for the simplest of ac MGs.

Proposition 2.7 (Hardness). Determining if a feasible policy exists for an ac MG is NP-hard even for the restricted class of two-player zero-sum ac MGs for which S = 1, A = 2, and cost functions are deterministic mappings to non-negative integers.

3. Feasibility

Before we can fully understand ACE, we must first understand feasible policies. This section derives characterizations of feasibly realizable histories under anytime constraints. This leads us to design an algorithm that determines if a feasible policy exists, while also producing the set of all feasibly realizable cumulative costs and actions. These sets will be critical to our later equilibria computation in Section 4.

First, observe that for a policy π to be feasible, all histories realizable under π must obey the budget. If τh = (s1, a1, c1, . . . , sh) Hh is any partial history, we let ch def = Ph 1 t=1 ct denote the vector of cumulative costs induced by the history. Given this notation, we see that π ΠG if and only if for all h [H + 1] and all τh Hπ h, it holds that ch B.

History Translation. Consequently, we only need to consider the (state, cost)-pairs induced by a history to determine feasibility. In particular, we can focus on τh def = ((s1, 0), a1, (s2, c1), a2, . . . , (sh, ch)), which denotes τh written in (state, cost)-form. Observe that for any τh, the translation of τh to (state, cost)-form is well-defined and unique. Specifically, given τh, we can infer any immediate cost ck uniquely by ck = ck+1 ck, and given τh, we can infer ck uniquely by ck = Pk 1 t=1 ct. Given this equivalence, we focus on characterizing the following sets.

Definition 3.1 (Feasible Sets). We define the set of state, cumulative cost pairs realizable by a feasible policy at time h by, FSh def = [

τh Hπ h {(sh, ch)} . (1)

We define the set of actions taken by some feasible policy at a pair (s, c) FSh by,

FAh(s, c) def = [

τh+1 Hπ h+1, (sh, ch)=(s, c)

Realizability Graphs. Now, imagine that the game terminates prematurely should (1) the agents ever choose an action that could immediately violate the budget or (2) reach

a point where all available actions lead to immediate violation. Under this interpretation, it is easy to see that a policy is feasible if and only if it always reaches time H + 1 under any realization. We can utilize this intuition constructively through the idea of realizability graphs.

For a given policy π, we define its realizability graph Rπ def = (Vπ, Eπ) to be the directed acyclic multi-graph satisfying (i) Vπ is the set of (time, state, cost)-triples feasibly-realizable under π, and (ii) Eπ is the set of all feasible one-step (time, state, cost)-evolutions under π. We also label each edge by the action responsible for that evolution. Then, any πrealizable feasible history τh corresponds to the labeled path P = ((1, s1, c1), a1, . . . , (h, sh, ch)) in Rπ. Thus, π is feasible if and only if all sink nodes in Rπ are distance H from the source node (1, s1, 0).

Algorithmic Approach. Overall, we can determine if ΠG = by proving the existence of a realizability graph whose sinks are all distance H from the source. Furthermore, we can compute all feasibly realizable histories by computing the union of all feasible policy realizability graphs. We accomplish this by constructing a single graph containing all feasible realizability graphs and then pruning it to match the union.

Our graph is generated by iteratively taking feasible actions. If no sequence of feasible actions ever reaches time H + 1, then naturally no feasible policy exists. However, we must also ensure all branches generated from an action reach time H + 1 to satisfy the anytime constraints. We deal with this difficulty by making each action an AND node. On the other hand, for a (time, state, cost)-triple to be feasible, there need only be a single action that ensures time H + 1 is reached. Thus, we make each triple an OR node. We will later show that a TRUE subgraph corresponds to our desired solution.

Definition 3.2 (Feasibility Tree). We iteratively define an AND/OR tree T (Martelli and Montanari, 1973). We define the root node to be (1, s1, 0). For any time h [H], and node (h, s, c) VT , we call a A a feasible action if no realization under a leads to immediate violation, i.e. Prc Ch(s,a)[ c + c B] = 1. For any feasible action

a, we create a new AND node u def = (h, s, c, a) and edge (h, s, c) (h, s, c, a). For every s Supp(Ph(s, a)) and c Supp(Ch(s, a)), we also create a new OR node w def = (h + 1, s , c + c) and edge (h, s, c, a) (h + 1, s , c + c). We label any leaf node of the form (H + 1, s, c) as TRUE and any leaf node of the form (h, s, c) for h < H + 1 as FALSE.

Since any feasible policy can only take feasible actions by definition, any feasibly realizable history appears in (state, cost)-form as a path in T . Moreover, conditionally feasible histories appear as superpaths in T .

Anytime-Constrained Equilibria in Polynomial Time

Algorithm 1 Feasibility Require: G

1: T Definition 3.2(G) 2: AOSOLVE(T) 3: if (1, s1, 0) is FALSE then 4: return Infeasible 5: end if 6: for h 1 to H do 7: RSh and RAh( ) 8: for TRUE (h, s, c) VT do 9: RSh RSh {(s, c)} 10: for TRUE (h, s, c, a) E+ T (v) do 11: RAh(s, c) RAh(s, c) {a} 12: end for 13: end for 14: end for 15: return ({RSh}h, {RAh( s)}h, s)

Lemma 3.3. For any time h [H + 1], and any feasibly-realizable history τh Fh, there exists a unique path Pτh T satisfying Pτh = ((1, s1, c1), (1, s1, c1, a1), . . . , (h, sh, ch)). Moreover, if τk is any suphistory of τh realizable by some π ΠG(τh), then Pτh Pτk.

Thus, if any feasible history exists, then there exists at least one length H path in T . However, always taking actions that are feasible at the current time is a greedy strategy that may not guarantee reaching time H + 1. Consequently, T contains many infeasible paths as well. On the bright side, we can show that any infeasible path must contain a FALSE node.

Lemma 3.4. If P = ((1, s1, 0), (1, s1, 0, a1), . . . , (h, sh, ch)) T is any path ending at a FALSE node, then τh = (s1, a1, c1, . . . , sh) is not realized by any feasible policy. Moreover, if τk is any feasible subhistory of τh, then τh is not realizable by any π ΠG(τk).

Pruning. Overall, we see the subtree of TRUE nodes of T is exactly the union of all feasible policy realizability graphs. Computing the TRUE nodes for an AND/OR tree can be done in linear time using standard tree recursion (Martelli and Montanari, 1973). Suppose that AOSOLVE is any such AND/OR tree solver. Then, we can compute the FS and FA sets by implicitly pruning the FALSE nodes from T . The full procedure is described in Algorithm 1.

Proposition 3.5. For any ac MG G, Algorithm 1(G) outputs Infeasible if ΠB G = and otherwise outputs ({FSh}h, {FAh( s)}h, s). Moreover, Algo-

rithm 1(G) runs in time O((HSADG)2), where DG def = | S

τh Hh { ch | ch B} |.

4. Reduction

As hinted in the previous section, we can convert the anytime constraint on full histories into a per-time constraint on the available actions. Specifically, if the agents track their cumulative costs, they can identify actions that satisfy the constraint long term. These actions exactly correspond to those in FAh(s, c).

Then, the agents can convert their anytime-constrained MG G into a traditional Markov game G with non-stationary state space FSh and non-stationary, state-dependent action space FAh(s, c). Importantly, FAh(s, c) may induce nonnormal-form subgames because the exclusion of infeasible joint actions can cause FAh(s, c) not to take the form of a product space such as A1 An. Consequently, G is an action-constrained Markov game.

Definition 4.1. We define an action-constrained Markov game G def = ( S, A, P, R, H) where,

1. Sh def = FSh,

2. Ah(s, c) def = FAh(s, c),

3. Ph((s , c + c) | (s, c), a) def = Ch(c | s, a)Ph(s | s, a) and,

4. Rh((s, c), a) def = Rh(s, a) whenever a Ah(s, c) and Rh( | (s, c), a) = 1 otherwise.

In addition, we define G s initial state to be s1 def = (s1, 0).

For typical MGs, the standard solution concept is the Markov-perfect equilibrium. For action-constrained MGs, we use the same solution concept, but add the condition that the policy must only support actions in the constrained action sets.

Definition 4.2 (Markov-Perfect Equilibria). For an actionconstrained MG G, we say a Markovian policy π is allsubgame-feasible or simply feasible in this work if πh( s) ( Ah( s)) for all times h [H] and all states s S. We let ΠG denote the set of all feasible policies for G. Then, a Markov-perfect equilibrium (MPE) for G is a Markovian joint policy π satisfying (1) π ΠG, and (2) for all players i [n], all times h [H], all partial histories τh Hh, and all potential deviation policies π i, either,

(π i, π i) ΠG OR V π i,h( sh) V π i,π i i,h ( τh). (MPE) Here, V π i,h( s) def = Eπ

h PH t=h ri,t | sh = s i denotes i s ex-

pected value in G under π from time h onward conditioned on starting at state s. Lastly, we call π a Markov-perfect Nash equilibrium (MPNE) for G if π is additionally a product policy.

Anytime-Constrained Equilibria in Polynomial Time

Algorithm 2 Reduction Require: (G)

1: x Algorithm 1(G) 2: if x = Infeasible then 3: return Infeasible 4: end if 5: Construct G Definition 4.1(G) 6: π MGSOLVE(G) 7: return π

Augmented Policies. Any Markovian policy π for G can be viewed as a history-dependent policy for G represented in a compact form. In particular, by definition of P, we see that the c part of G s state space always corresponds to the current cumulative cost vector. Thus, π is equivalent to the history-dependent policy π formed by π h(τh) def = πh(sh, ch), and can be used directly in G just by feeding in the pair (sh, ch) to π to generate the next joint action.

Moreover, if π is feasible for G, we see that since Ah(s, c) = FAh(s, c) by definition, it must be the case that π is feasible for G by Proposition 3.5. Although less obvious, we also show that any MPE for G is an ACE for G.

Lemma 4.3 (Equilibria). Any MPE (NE/CE/CCE) for G is an ACSPE (NE/CE/CCE) for G.

Then, we can compute an ACSPE for G or determine that none exists by attempting to find an MPE for G. We summarize the full reduction in Algorithm 2.

Theorem 4.4 (Reduction). For any ac MG G, if MGSOLVE can compute a feasible MPE (NE/CE/CCE) for any feasible action-constrained Markov game, then Algorithm 2(G) correctly outputs Infeasible if ΠG = and outputs an ACSPE (NE/CE/CCE) π, otherwise. Moreover, if MGSOLVE runs in time O(poly(|G|)), then Algorithm 2(G) runs in time O(poly(|G|, DG)), and any output policy can be stored with O(HSADG) space.

5. Computation

In the last section, we showed how to reduce our anytime-constrained game problem to an action-constrained game problem. However, efficient algorithms for actionconstrained games are currently unknown. In this section, we remedy this knowledge gap by designing efficient algorithms for computing MPE of an action-constrained MG. Moreover, we show that feeding our action-constrained method into our reduction yields a polynomial time algorithm for computing ACE so long as the cost precision is logarithmic.

We take a backward induction approach similar to other planning algorithms for Markov games. Unlike traditional

MGs, here, we must iteratively solve action-constrained matrix stage games. Then, we can combine the constructed policies for each stage to solve the full game. We prove the correctness of this algorithm by deriving a novel theory of equilibria in action-constrained Markov games.

Matrix Games. The key bottleneck to this backward induction approach is solving the action-constrained matrix games. Formally, let (A, X, u) denote an action-constrained matrix game, where (i) A is the joint action space, (ii) X A is the set of feasible actions, and (iii) u is the utility function. We tackle this problem by devising a variation of the standard CE/CCE LP. Importantly, we modify the constraint P a A σ(a) = 1, which ensures the total probability mass of all joint actions equals one, into the constraint P

a X σ(a) = 1, which ensures the support of the joint strategy is contained in the valid joint action space. We also define the utility of any infeasible action to be so that infeasible deviations will be appropriately ignored by the LP. The full definition of the LP, which has no objective function, is, X

a X σ(a) (ui(a) ui(a i, a i)) 0, i, a i Ai

a X σ(a) = 1,

σ(a) 0 a X (CLP)

Lemma 5.1. If (A, X, u) is any action-constrained matrix game and σ is any solution to (CLP)(A, X, u), then for any player i [n], and deviation σ i (Ai) for which

σ def = (σ i, σ i) (X), we have that Ea σ[ui(a)] Ea σ [ui(a)]. Moreover, if X = , then there exists a solution to (CLP)(A, X, u).

Markov Games. To use (CLP) in our backward induction, it will be useful to represent stage games with the Q matrix. Formally, for any given partial policy π, any time h [H], and any state s S, we define the stage game to be the matrix game Qh(s) whose utility for any player i [n] under joint action a A is defined by,

Qπ i,h( s, a) def =

( rh( s, a) + P

s Ph( s | s, a) V π i,h+1( s ) if a Ah( s) .

(Q) Then, given some LP feasibility algorithm LPSOLVE, we use these ideas to solve action-constrained MGs in Algorithm 3.

Theorem 5.2 (Constrained Solver). For any feasible actionconstrained MG G, if LPSOLVE is a polynomial-time linearprogram feasibility solver, then Algorithm 3(G) correctly outputs a feasible MPE (CE/CCE) in polynomial time.

Anytime-Constrained Equilibria in Polynomial Time

Algorithm 3 Constrained Solver

1: V π i,H+1( s) 0 for all i [n] and s SH+1 2: for h = H down to 1 do 3: for s Sh do 4: Qπ i,h( s, a) (Q) for each i [n] and a A 5: π LPSOLVE((CLP)(A = A, X = Ah( s), u = Qπ h( s))) 6: V π i,h( s) P

a πh( a | s)Qπ i,h( s, a) 7: end for 8: end for 9: return π

Reduction Analysis. Given our efficient actionconstrained game solver, we can now finish analyzing the running time of our reduction. Since Theorem 5.2 implies that Algorithm 3 runs in time poly(|G|), Theorem 4.4 implies that Algorithm 2 with MGSOLVE = Algorithm 3 runs in time poly(|G|, DG). The main issue is that DG can be exponentially large in the worst case. However, we can show that DG poly(|G|)2O(dn), where d denotes the cost precision, which is the number of significant bits needed to represent any supported cost. By definition, our method is FPT (Downey and Fellows, 2012) in the cost precision.

Theorem 5.3 (FPT). Equipped with any polynomial-time LP solver and MGSOLVE = Algorithm 3, Algorithm 2 is a fixed-parameter tractable algorithm for computing ACSPE (CE/CCE) in the cost-precision d. Consequently, if d = O(log(|G|)) while n is held constant or d = O(1) while n is arbitrary, then Algorithm 2 runs in polynomial time and any output policy can be stored in polynomial space.

Remark 5.4 (Learning). Our methods immediately apply to the learning setting through model-based approaches. Moreover, any new learning algorithm for action-constrained MGs can immediately solve G and thus compute ACE for G.

6. Approximation

In the previous section, we showed our method runs in polynomial time whenever the cost precision is small. However, in cases where the cost precision is large or, even worse infinite, the computation may require exponential time due to the NP-hard nature of finding a feasible solution, Proposition 2.7. To combat this issue, we slightly relax the feasibility condition. This allows us to compute equilibrium policies that only violate the budget by a given ϵ > 0 at the cost of an additional poly(1/ϵ) factor in running time.

In this section, we allow any infinite support cost distributions that are bounded above. We also require that distribution is a product distribution to enable comparisons

between supported costs. Technically, we also need the CDF of the distribution to be efficiently computable for use in computation. Assumption 6.1 (Bounded). We assume that each cmax def = suph,s,a sup Supp(Ch(s, a)) < , and that each Ch(s, a) = {Ci,h(s, a)}i is a product distribution.

Moreover, if Hcmax i B, observe that every policy is feasible for player i, which just leads to a standard unconstrained problem for that player. A similar phenomenon happens if cmax i 0. Thus, we assume WLOG that Hcmax > B and cmax > 0. We can then define our relaxed feasibility notions as follows. Definition 6.2 (Approximate Feasibility). For any ϵ > 0, a joint policy π is ϵ-additive feasible for G if,

t=1 ct B + ϵ

and ϵ-relative feasible for G if,

t=1 ct B(1 + ϵσB)

where σB is the sign of B1. We then define an ϵadditive/relative approximate ACE to satisfy the usual conditions of Definition 2.2 but with the feasibility condition (1) relaxed to an ϵ-additive/relative feasibility condition.

Rounding. The key idea of our approximations is to have players round down any cost vector they receive to the nearest multiple of some ℓ> 0. Doing so allows the players to track far fewer cumulative costs than originally. In addition, we can effectively truncate the lower regime of the distribution using the fact that if player i ever receives an immediate cost smaller than Bi Hcmax i , then it may take any action whatsoever going forward without violating its budget. Thus, the player can treat any smaller cost as if it were Bi Hcmax i . This process of rounding costs then induces a new c MG with finite support cost distributions. Definition 6.3 (Approximate Game). For any ℓ> 0, we define c ℓ def = c

ℓ ℓto be the largest multiple of ℓthat lower bounds c. For any player i [n], let ˆci,1 ˆci,m denote the elements of the finite set of player i s rounded costs { ci ℓ| ci [Bi Hcmax i , cmax i ]} in order, and

let ˆci,0 def = . We discretize the potentially infinite support distribution Ci,h(s, a) into a finite support distribution ˆCi,h(s, a) by defining for each k [m],

ˆCi,h(ˆci,k | s, a) def = Pr c Ci,h(s,a)[c [ˆci,k 1, ˆci,k]]. (5)

1When the costs and budgets are negative, negating the constraint yields PH t=1 ct |B| (1 ϵ), which is the traditional notion of relative approximation for covering objectives.

Anytime-Constrained Equilibria in Polynomial Time

Algorithm 4 Approximation Require: (G)

1: Construct ˆG Definition 6.3(G) 2: Let MGSOLVE = Algorithm 3 3: return Algorithm 2( ˆG)

We then define the approximate ac MG ˆG def = (S, A, P, R, ˆC, H) to be our original game G but with a different cost distribution.

Since outside of extreme cases, we always round costs down, the players always maintain an underestimate of their true cumulative cost. Consequently, the players may be incurring more costs than expected. However, we can show that the true cost accumulated is not much larger than the surrogate cost.

Lemma 6.4. Any feasible policy for ˆG is an Hℓ-additive feasible policy for G.

On the flip side, the fact that players are willing to spend more than before means they have more actions available to them at each stage. Consequently, they can always find strategies that achieve the same value or even higher than any truly feasible policy for the game. It is then easy to see that solutions to ˆG satisfy condition (2) in Definition 2.2, so form approximate ACE for G.

Lemma 6.5. Any ACSPE (NE/CE/CCE) for ˆG are Hℓadditive approximate ACSPE (NE/CE/CCE) for G. Moreover, if ΠG = then Π ˆ G = .

Consequently, we can compute approximate equilibria for G by solving ˆG. The full algorithm is described in Algorithm 4. Since every approximate cost is an integer multiple of ℓ, every approximate cumulative cost will also be an integer multiple of ℓ. In fact, for each i [n], every approximate cumulative cost s multiple must reside in the set n H j Bi Hcmax i ℓ k , . . . , H j cmax i

ℓ ko . Depending on the choice of ℓ, the players may need to track far fewer cumulative costs to behave optimally.

Theorem 6.6 (Approximation). For any ac MG G and ℓ> 0, Algorithm 4(G) correctly outputs Infeasible if no Hℓadditive feasible policies for G exist and outputs an Hℓadditive approximate ACSPE (CE/CCE) π, otherwise. Moreover, Algorithm 4 runs in time O(poly(|G|, cmax B n ℓn )).

Corollary 6.7 (Additive). For any ϵ > 0, if we define ℓ def = ϵ/H, then Algorithm 4 correctly outputs Infeasible or an ϵ-additive approximate ACSPE (CE/CCE) for any ac MG. Moreover, if cmax B poly(|G|), Algorithm 4 runs in time O(poly(|G|, 1

Corollary 6.8 (Relative). For any ϵ > 0, if we define ℓ def = ϵ|B|/H, then Algorithm 4 correctly outputs Infeasible or

an ϵ-relative approximate ACSPE (CE/CCE) for any ac MG. Moreover, if cmax poly(|G|)|B|, Algorithm 4 runs in time O(poly(|G|, 1

Remark 6.9 (Cost Bound). We can efficiently compute approximately feasible solutions so long as cmax poly(|G|) |B|. This condition is very natural. When the supported costs all have the same sign, any feasible policy induces costs with cmax |B| anyway. Importantly, this restriction is not an artifact of our approach; some bound on cmax is necessary for efficient computation as proved in (Mc Mahan and Zhu, 2024).

7. Conclusion

In this work, we introduced anytime-constraints for Markov games and studied the corresponding solution concept of anytime-constrained equilibria. Although finding a feasible policy is NP-hard for simple games, we showed efficient computation is possible so long as the cost precision is constant. The main ingredients to our approach were a graph algorithm to derive all feasibly realizable histories and an efficient algorithm for solving action-constrained MGs. Lastly, we presented approximation algorithms for computing approximately feasible anytime-constrained equilibria running in polynomial time so long as the cost distribution s supremum is no larger than a polynomial factor of the budget. Given the hardness results, our approximation guarantees are best possible under worst-case analysis.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

E. Altman. Constrained Markov Decision Processes. Chapman and Hall/CRC, 1999. doi: 10.1201/9781315140223.

E. Altman and A. Shwartz. Constrained markov games: Nash equilibria. In J. A. Filar, V. Gaitsgory, and K. Mizukami, editors, Advances in Dynamic Games and Applications, pages 213 221, Boston, MA, 2000. Birkh auser Boston. ISBN 978-1-4612-1336-9.

Q. Bai, A. Singh Bedi, and V. Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via conservative natural policy gradient primal-dual algorithm. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):6737 6744, 6 2023. doi: 10.1609/ aaai.v37i6.25826. URL https://ojs.aaai.org/ index.php/AAAI/article/view/25826.

Anytime-Constrained Equilibria in Polynomial Time

A. Bhatia, P. Varakantham, and A. Kumar. Resource constrained deep reinforcement learning. Proceedings of the International Conference on Automated Planning and Scheduling, 29(1):610 620, 5 2021. doi: 10.1609/icaps.v29i1.3528. URL https://ojs.aaai. org/index.php/ICAPS/article/view/3528.

V. Borkar. An actor-critic algorithm for constrained markov decision processes. Systems & Control Letters, 54(3):207 213, 2005. ISSN 01676911. doi: https://doi.org/10.1016/j.sysconle.2004.08. 007. URL https://www.sciencedirect.com/ science/article/pii/S0167691104001276.

D. M. Bossens and N. Bishop. Explicit explore, exploit, or escape (e4): Near-optimal safety-constrained reinforcement learning in polynomial time. Mach. Learn., 112 (3):817 858, 6 2022. ISSN 0885-6125. doi: 10.1007/ s10994-022-06201-z. URL https://doi.org/10. 1007/s10994-022-06201-z.

K. Brantley, M. Dud ık, T. Lykouris, S. Miryoosefi, M. Simchowitz, A. Slivkins, and W. Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. In Neur IPS, 2020. URL https://proceedings. neurips.cc/paper/2020/hash/ bc6d753857fe3dd4275dff707dedf329-Abstract. html.

A. Castellano, H. Min, E. Mallada, and J. A. Bazerque. Reinforcement learning with almost sure constraints. In R. Firoozi, N. Mehr, E. Yel, R. Antonova, J. Bohg, M. Schwager, and M. Kochenderfer, editors, Proceedings of The 4th Annual Learning for Dynamics and Control Conference, volume 168 of Proceedings of Machine Learning Research, pages 559 570. PMLR, 6 2022. URL https://proceedings.mlr.press/ v168/castellano22a.html.

Z. Chen, S. Ma, and Y. Zhou. Finding correlated equilibrium of constrained markov game: A primal-dual approach. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 25560 25572. Curran Associates, Inc., 2022. URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ a3f8f584febcc88ed8cdeb30b096db34-Paper-Conference. pdf.

W. C. Cheung. Regret minimization for reinforcement learning with vectorial feedback and complex objectives. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings. neurips.cc/paper/2019/file/

a02ffd91ece5e7efeb46db8f10a74059-Paper. pdf.

Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 4fe5149039b52765bde64beb9f674940-Paper. pdf.

T. Chu, J. Wang, L. Codec a, and Z. Li. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 21(3):1086 1095, 2020. doi: 10.1109/TITS. 2019.2901791.

A. Coronato, M. Naeem, G. De Pietro, and G. Paragliola. Reinforcement learning for intelligent healthcare applications: A survey. Artificial Intelligence in Medicine, 109:101964, 2020. ISSN 0933-3657. doi: https://doi.org/10.1016/j.artmed.2020.101964. URL https://www.sciencedirect.com/ science/article/pii/S093336572031229X.

D. Ding, X. Wei, Z. Yang, Z. Wang, and M. Jovanovic. Provably efficient generalized lagrangian policy optimization for safe multi-agent reinforcement learning. In N. Matni, M. Morari, and G. J. Pappas, editors, Proceedings of The 5th Annual Learning for Dynamics and Control Conference, volume 211 of Proceedings of Machine Learning Research, pages 315 332. PMLR, 15 16 Jun 2023. URL https://proceedings.mlr.press/ v211/ding23a.html.

J. Dinneweth, A. Boubezoul, R. Mandiau, and S. Espi e. Multi-agent reinforcement learning for autonomous vehicles: a survey. Autonomous Intelligent Systems, 2(1):27, 2022. doi: 10.1007/s43684-022-00045-z. URL https: //doi.org/10.1007/s43684-022-00045-z.

R. Downey and M. Fellows. Parameterized Complexity. Monographs in Computer Science. Springer New York, 2012. ISBN 9781461205159. URL https://books. google.com/books?id=Hy Tj Bw AAQBAJ.

I. Elsayed-Aly, S. Bharadwaj, C. Amato, R. Ehlers, U. Topcu, and L. Feng. Safe multi-agent reinforcement learning via shielding, 2021. URL https://arxiv. org/abs/2101.11196.

C. Fan, C. Zhang, A. Yahja, and A. Mostafavi. Disaster city digital twin: A vision for integrating

Anytime-Constrained Equilibria in Polynomial Time

artificial and human intelligence for disaster management. International Journal of Information Management, 56:102049, 2021. ISSN 0268-4012. doi: https://doi.org/10.1016/j.ijinfomgt.2019.102049. URL https://www.sciencedirect.com/ science/article/pii/S0268401219302956.

J. Garc ıa, Fern, and o Fern andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437 1480, 2015. URL http://jmlr.org/papers/v16/ garcia15a.html.

A. Gattami, Q. Bai, and V. Aggarwal. Reinforcement learning for constrained markov decision processes. In A. Banerjee and K. Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2656 2664. PMLR, 13 15 Apr 2021. URL https://proceedings.mlr. press/v130/gattami21a.html.

S. Gu, J. Grudzien Kuba, Y. Chen, Y. Du, L. Yang, A. Knoll, and Y. Yang. Safe multi-agent reinforcement learning for multi-robot control. Artificial Intelligence, 319:103905, 2023a. ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2023.103905. URL https://www.sciencedirect.com/ science/article/pii/S0004370223000516.

S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang, Y. Yang, and A. Knoll. A review of safe reinforcement learning: Methods, theory and applications, 2023b.

A. Hasanzade Zonuzy, A. Bura, D. Kalathil, and S. Shakkottai. Learning with safety constraints: Sample complexity of reinforcement learning for constrained mdps. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):7667 7674, 5 2021. doi: 10.1609/ aaai.v35i9.16937. URL https://ojs.aaai.org/ index.php/AAAI/article/view/16937.

P. Jordan, A. Barakat, and N. He. Independent learning in constrained Markov potential games. In S. Dasgupta, S. Mandt, and Y. Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pages 4024 4032. PMLR, 02 04 May 2024. URL https://proceedings.mlr.press/ v238/jordan24a.html.

P. Kolesar. A markovian model for hospital admission scheduling. Management Science, 16(6):B384 B396, 1970. ISSN 00251909, 15265501. URL http://www. jstor.org/stable/2628725.

R. Li, Z. Zhao, Q. Sun, C.-L. I, C. Yang, X. Chen, M. Zhao, and H. Zhang. Deep reinforcement learning for resource management in network slicing. IEEE Access, 6:74429 74441, 2018. doi: 10.1109/ACCESS.2018.2881964.

H. Mao, M. Alizadeh, I. Menache, and S. Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, Hot Nets 16, page 50 56, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450346610. doi: 10.1145/ 3005745.3005750. URL https://doi.org/10. 1145/3005745.3005750.

A. Martelli and U. Montanari. Additive and/or graphs. In Proceedings of the 3rd International Joint Conference on Artificial Intelligence, IJCAI 73, page 1 11, San Francisco, CA, USA, 1973. Morgan Kaufmann Publishers Inc.

J. Mc Mahan. Deterministic policies for constrained reinforcement learning in polynomial-time, 2024. URL https://arxiv.org/abs/2405.14183.

J. Mc Mahan and X. Zhu. Anytime-constrained reinforcement learning. In S. Dasgupta, S. Mandt, and Y. Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine Learning Research, pages 4321 4329. PMLR, 02 04 May 2024. URL https://proceedings.mlr.press/ v238/mcmahan24a.html.

G. Paragliola, A. Coronato, M. Naeem, and G. De Pietro. A reinforcement learning-based approach for the risk management of e-health environments: A case study. In 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pages 711 716, 2018. doi: 10.1109/SITIS.2018.00114.

S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ c1aeb6517a1c7f33514f7ff69047e74e-Paper. pdf.

H. Peng and X. Shen. Multi-agent reinforcement learning based resource management in mecand uav-assisted vehicular networks. IEEE Journal on Selected Areas in Communications, 39(1):131 141, 2021. doi: 10.1109/ JSAC.2020.3036962.

M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.

Anytime-Constrained Equilibria in Polynomial Time

M. Roderick, V. Nagarajan, and Z. Kolter. Provably safe pac-mdp exploration using analogies. In A. Banerjee and K. Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1216 1224. PMLR, 4 2021. URL https://proceedings.mlr.press/ v130/roderick21a.html.

S. Shalev-Shwartz, S. Shammah, and A. Shashua. Safe, multi-agent, reinforcement learning for autonomous driving, 2016. URL https://arxiv.org/abs/1610. 03295.

G. Thomas, Y. Luo, and T. Ma. Safe reinforcement learning by imagining the near future. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 13859 13869. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 73b277c11266681122132d024f53a75b-Paper. pdf.

Y. L. Tsai, A. Phatak, P. K. Kitanidis, and C. B. Field. Deep Reinforcement Learning for Disaster Response: Navigating the Dynamic Emergency Vehicle and Rescue Team Dispatch during a Flood. In AGU Fall Meeting Abstracts, volume 2019, pages NH33B 14, Dec. 2019.

S. Vaswani, L. Yang, and C. Szepesvari. Near-optimal sample complexity bounds for constrained mdps. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 3110 3122. Curran Associates, Inc., 2022. URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 14a5ebc9cd2e507cd811df78c15bf5d7-Paper-Conference. pdf.

Y. Wang, S. S. Zhan, R. Jiao, Z. Wang, W. Jin, Z. Yang, Z. Wang, C. Huang, and Q. Zhu. Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 36593 36604. PMLR, 7 2023. URL https://proceedings.mlr.press/ v202/wang23as.html.

H. Wei, X. Liu, and L. Ying. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In G. Camps Valls, F. J. R. Ruiz, and I. Valera, editors, Proceedings

of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 3274 3307. PMLR, 3 2022. URL https://proceedings.mlr.press/ v151/wei22a.html.

M. Wiering. Multi-agent reinforcement leraning for traffic light control. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML 00, page 1151 1158, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1558607072.

C. Wu, B. Ju, Y. Wu, X. Lin, N. Xiong, G. Xu, H. Li, and X. Liang. Uav autonomous target search based on deep reinforcement learning in complex disaster scene. IEEE Access, 7:117227 117245, 2019. doi: 10.1109/ACCESS. 2019.2933002.

W. Zhao, T. He, R. Chen, T. Wei, and C. Liu. State-wise safe reinforcement learning: a survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 23, 2023. ISBN 978-1-95679203-4. doi: 10.24963/ijcai.2023/763. URL https: //doi.org/10.24963/ijcai.2023/763.

Anytime-Constrained Equilibria in Polynomial Time

A. Proofs for Section 2

A.1. Proof of Proposition 2.6

Proof. If ΠG = , then by definition no ACE or ACSPE can exist since they must be feasible. On the other hand, if ΠG = , then Algorithm 2 yields an ACSPE as shown by Proposition 3.5. Thus, an ACSPE exists and so an ACE also exists.

A.2. Proof of Proposition 2.7

Proof. The proof of Theorem 2 in (Mc Mahan and Zhu, 2024) shows computing a feasible anytime-constrained policy is NP-hard for 2 constraints. Since this fact is independent of the reward structures, the result applies to two-player zero-sum c MGs by treating one player as a dummy with no influence on the transitions.

B. Proofs for Section 3

We introduce a few helpful observations here.

Observation 1 (Decomposability). For any policy π, time h [H] and π-realizable partial history τh+1 Hπ h+1, we have that τh+1 = (τh, ah, ch, sh+1) where,

1. ah Supp(πh(τh)),

2. ch Supp(Ch(sh, ah)),

3. sh+1 Supp(Ph(sh, ah)), and

4. Pπ[τh] > 0.

Proof. By the Markov property (Equation (2.1.11) from (Puterman, 1994)), we can decompose τh+1 = (τh, ah, ch, sh+1) so that, Pπ [τh+1] = Pπ[τh]πh(ah | τh)Ch(ch | sh, ah)Ph(sh+1 | sh, ah). (6)

Since Pπ [τh+1] > 0 by assumption, it must be the case that each quantity on the RHS is also positive. In particular, we see that (i) ah Supp(πh(τh)), (ii) ch Supp(Ch(sh, ah)), (iii) sh+1 Supp(Ph(sh, ah)), and (iv) Pπ[τh] > 0.

Observation 2. For any feasible policy π ΠG, time h [H], and π-realizable history τh Hπ h, it must be that Supp(πh(τh)) {a A | Prc Ch(s,a)[ ch + c B] = 1}. The same claim holds for the set ΠG(τh) as well.

Proof. Fix any such π, h, and τh. Let s def = sh and c def = ch. Suppose for the sake of contradiction that there exists some a Supp(πh(τh)) satisfying Prc Ch(s,a)[ c + c > B] > 0. Then, we would have that,

t=1 ct > B] Pπ[

t=1 ct > B]

t=1 ct > B | τh]Pπ[τh]

= Pπ[ c + ch > B | τh]Pπ[τh]

Pπ[τh]πh(a | τh) Pr c Ch(s,a)[ c + c > B]

The penultimate line used the Markov property, and the final line used the fact that each quantity occurs with non-zero probability by assumption. Thus, we see the existence of such an action leads to contradiction.

B.1. Proof of Lemma 3.3

Proof. For the first claim, we proceed by induction on h.

Anytime-Constrained Equilibria in Polynomial Time

Base Case. For the base case, we consider h = 1. In this case, the only realizable history is τ1 = (s1) with c1 = 0. By definition, we have (1, s1, 0) is the source node. Thus, the claim holds.

Inductive Step. For the inductive step, we consider any h 1. Since τh+1 Fh+1 by assumption, there must exist some π ΠG that realizes τh+1. Then, Observation 1 implies we can decompose τh+1 into τh+1 = (τh, ah, ch, sh+1) satisfying conditions (i)-(iv). Since τh is π-realizable according to (iv), the induction hypothesis implies Pτh = ((1, s1, 0), (1, s1, 0, a1), . . . , (h, sh, ch)) is a path in T . Also, Observation 2 implies that ah satisfies Prc Ch(sh,ah)[ ch + c B] = 1, so ah is a feasible action for (h, sh, ch). By definition of T , we then know that (h, sh, ch, ah) is a node in T that is adjacent to (h, sh, ch). Similarly, by (ii) and (iii), we then know that (h + 1, sh+1, ch+1) is a node in T and is adjacent to (h, sh, ch, ah). Thus, Pτh+1 = (Pτh, (h, sh, ch, ah), (h + 1, sh+1, ch+1)) is the desired path in T . This completes the induction.

For the second claim, fix any h and any τh Fh. We show that for any τk that is π-realizable for some π ΠG(τh) that Pτh Pτk. We proceed by induction on k.

Base Case. For the base case, we consider k = h. In this case, the only realizable history conditioned on τh is τh itself. Trivially, we have that Pτh Pτh.

Inductive Step. For the inductive step, we consider any k h. Again, we can decompose τk+1 = (τk, ak, ck, sk+1) by Observation 1 where ak is a feasible action for (k, sk, ck) by Observation 2. By the induction hypothesis, we know that Pτh Pτk. Again, it is easy to see that the path Pτk+1 = (Pτk, (k, sk, ck, ak), (k + 1, sk+1, ck+1)) is a well-defined path in T . It is then immediate that Pτh Pτk Pτk+1. This completes the induction.

B.2. Proof of Lemma 3.4

Proof. We then proceed by backward induction on the number of nodes in P, h.

Base Case. For the base case, we consider h = H + 1. In this case, P ends in a H + 1 node, which is TRUE by definition of T . Thus, the claim vacuously holds.

Inductive Step. For the inductive step, we consider any h H. First suppose that (h, sh, ch) is a sink node. Then, by definition of T , there must be no safe actions for τh. If there were a feasible π ΠG realizing τh at time h, then Observation 2 would imply a feasible action does exist, a contradiction. Thus, τh cannot be feasibly realized.

Now, suppose that (h, sh, ch) has outgoing edges. Since any triple-node is an OR node, we know all out-neighbors of (h, sh, ch) must be FALSE. In other words, for any action a, there exists at least one super-path P = (P, (h, sh, ch, a), (h + 1, sh+1, ch+1)) ending in a FALSE node. By the induction hypothesis, we know that the corresponding history τh+1 is not feasibly realizable. Therefore, any policy π realizing τh+1 must violate the budget by definition. Moreover, by definition of the edges of T , if π realizes τh and πh(a | τh) > 0, then π realizes τh+1. Consequently, if a policy π realizes τh and a, then π is not feasible. Since this holds for any possible action a, we have that τh is not feasibly realizable.

The argument for why any feasible subhistory cannot realize τh is nearly identical: if it could realize τh from some policy, then that policy must be infeasible.

B.3. Proof of Proposition 3.5

Proof. The proof of correctness follows from Lemma 3.3 and Lemma 3.4. Specifically, Lemma 3.3 (along with the fact that no feasible paths are removed due to Lemma 3.4) implies that RSh FSh and RAh( s) FAh( s) for any s since all feasibly-realizable histories appear in T . Then, Lemma 3.4 implies that any paths containing FALSE nodes are not feasibly realizable. Consequently, RSh FSh and RAh( s) FAh( s) for any s.

For the complexity claim, we observe the time taken to construct the tree and perform a bottom-up tree evaluation is linear in the size of T . Moreover, the final loop to construct the feasible sets also requires touching every node and edge one time. Thus, the time complexity is dominated by the size of the feasibility tree. The number of nodes in the tree is at most HSADG since there is at most one tuple per time, state, action, and non-violating cumulative cost. Moreover, there are at

Anytime-Constrained Equilibria in Polynomial Time

most a quadratic number of edges. Hence, the number of edges is at most A times the number of nodes, leading to a total size of O((HSADG)2).

C. Proofs for Section 4

Policy Evaluation. For any given policy π, player i [n], time h [H + 1], and history τh Hh where s def = sh, we can compute player i s value from π under a c MG G recursively using the tabular policy evaluation equations (Equation 4.2.6 (Puterman, 1994)). In the c MG setting, these equations take the following form:

V π i,h(τh) = Ea πh(τh)

ri,h(s, a) + X

c,s Ch(c | s, a)Ph(s | s, a)V π i,h+1(τh, a, c, s )

For a traditional or action-constrained MG G, the policy evaluation equations take the more familiar form:

V π i,h( τh) = E a πh(τh)

ri,h( s, a) + X

s Ph( s | s, a) V π i,h+1( τh, a, s )

When π is Markovian, these equations further simplify to,

V π i,h( s) = E a πh( s)

ri,h( s, a) + X

s Ph( s | s, a) V π i,h+1( s )

Policy Translations. As mentioned in the appendix, we can immediately treat any feasible Markovian policy π for G = Definition 4.1(G, B) as a compact history-dependent policy π for G by simply using the transformation πh(τh) def = πh(sh, ch). Going even further, we can treat history dependent policies for one game as full history dependent policies for the other. The key observation is that any history τh Hh has a unique equivalent history τh Hh and vice versa.

Specifically, the history τh = (s1, a1, c1, s2, . . . , sh) can be effectively permuted into the history τh = ((s1, 0), a1, (s2, c1), . . . , (sh, ch)) and vice versa. Moreover, given τh it is easy to infer any ck since ck = ck+1 ck, and given τh it is easy to infer (sk, ck). We call τh the translation of τh to G and denote it by H(τh). Importantly, this conversion allows us to formally discuss a policies value in both games by simply modifying its input history.

Lemma C.1 (Translations). For any feasible policy π ΠG, time h [H +1], and partial history τh Hπ h, if τh def = Hh(τh) is the translation of τh to Hh, then V π i,h(τh) = V π i,h( τh). Moreover, if π is Markovian in S, then V π i,h(τh) = V π i,h(sh, ch).

Proof. We proceed by induction on h. We first note by Lemma 3.3 that all realizable histories of π have (state, cumulativecost)-pair in FSh and actions in FAh( s). Thus, in the argument below we can always assume histories realized by π lead to a valid history for G.

Base Case. For the base case, we consider h = H + 1. In this case, V π i,H+1(τH+1) = 0 = V π i,H+1( τH+1) by definition of the value function at time H + 1. The second claim also holds since V π i,H+1(s H+1, c H+1) = 0.

Inductive Step. For the inductive step, we consider any h H. Let s def = sh and let s def = (sh, ch). We observe by (CPE) and (PE) that,

Anytime-Constrained Equilibria in Polynomial Time

V π i,h(τh) = Ea πh(τh)

ri,h(s, a) + X

c,s Ch(c | s, a)Ph(s | s, a)V π i,h+1(τh, a, c, s )

= Ea πh(τh)

ri,h(s, a) + X

c,s Ch(c | s, a)Ph(s | s, a) V π i,h+1( τh, a, (s , ch + c))

= Ea πh(τh)

ri,h( s, a) + X

s Ph( s | s, a) V π i,h+1( τh, a, s )

= E a πh( τh)

ri,h( s, a) + X

s Ph( s | s, a) V π i,h+1( τh, a, s )

= V π i,h( τh).

The first line used (CPE). The second line applied the induction hypothesis along with the fact that τh+1 = (τh, a, c, s ) translates to τh+1 = ( τh, a, (s , ch + c)) where τh is the translation of τh. The third line used the definition of r and P from Definition 4.1. The fourth line used the fact that πh( τh) = πh(τh) by definition of the translation. The last line used (PE).

For the second claim, we note if π is Markovian in S, then we can replace πh( τh) by πh( s) and inductively replace V π i,h+1( τh, a, s ) by V π i,h+1( s ). These replacements result in the second to last line exactly matching the RHS of (*PE). Thus, V π i,h(τh) = V π i,h(sh, ch) in this case.

C.1. Proof of Lemma 4.3

We first make the following observation.

Observation 3. For any policy π, if π ΠG, then π ΠG(τh) for any τh Fh and h [H].

Proof. This is immediate from Lemma 3.4 as any policy whose support is contained in Ah( s) = FAh( s) at each stage is an anytime-feasible policy.

We will also show the following stronger claim.

Claim 1. Suppose that π is any MPE for G, and that π def = (π i, π i) ΠG is a feasible deviation for player i. Then, for all times h [H + 1], and all partial histories τh Hπ h , we have that V π i,h(τh) V π i,h(τh).

Proof. Observe that,

V π i,h(τh) = V π i,h(sh, ch) V π i,h( τh) = V π i,h(τh).

The first equality used Lemma C.1 for a Markovian policy in S. The inequality used the fact that π is a MPE for G with τh being the unique translation of τh to an element of Hh. The final equality again used Lemma C.1.

Proof of Lemma. The lemma then follows as the observation yields condition (1) and the claim yields condition (2) of ACSPE.

C.2. Proof of Theorem 4.4

Proof. The correctness of the algorithm is immediate from Proposition 3.5 and Lemma 4.3. For the complexity claim, the time the algorithm takes is O((HSADG)2) time to construct G and poly(G) time to solve the LP. Since the description

Anytime-Constrained Equilibria in Polynomial Time

size of G is also polynomial in HSADG, we then see the running time is bounded by O(poly(|G|, DG)). The number of G s states is at most O(HSDG), and for each state up to A joint actions probabilities must be stored. Hence, the storage claim follows.

D. Proofs for Section 5

D.1. Proof of Lemma 5.1

Proof. Let (A, X, u) be any action-constrained matrix game, σ be any solution to (CLP)(A, X, u), i [n] be any player, and σ i be any deviation strategy satisfying σ = (σ i, σ i) (X). By definition of the constraints, we see that,

Ea σ [ui(a)] = X

a A σ(a)ui(a)

a X σ(a)ui(a)

a i Ai σ i(a i) X

a X σ(a)ui(a)

a i Ai σ i(a i) X

a X σ(a)ui(a i, a i)

a i Ai σ i(a i) X

ai Ai σ(ai, a i)ui(a i, a i)

a i A i σ i(a i)σ i(a i)ui(a i, a i)

a A σ (a )ui(a )

= Ea σ [ui(a )] .

The second line used the second constraint that ensured Supp(σ) X. The fourth line used the first constraint. The sixth line used the definition of marginals.

For the second claim, the fact that X = implies there exist at least one feasible joint action, and so a feasible σ exists. A specific σ satisfying the other constraints is then immediate from classical game theory since it corresponds to the constraint of a normal-form game with possible entries (which can be replaced by the worst possible utility minus 1).

D.2. Proof of Theorem 5.2

We first make the following observation.

Observation 4. For any feasible action-constrained MG G, suppose that π is output from Algorithm 3(G). Then, π ΠG.

Proof. Since G is feasible, at any time h [H] and state s Sh, we know that Ah( s) = by definition. Thus, Lemma 5.1 implies that (CLP) always outputs a solution and that solution is supported on Ah( s) for any stage game (h, s). Since π is exactly the collection of all such stage solutions, we then see that Supp(πh( s)) Ah( s) for all (h, s). Thus, π ΠG.

The following claim will also prove useful.

Claim 2. For any feasible action-constrained MG G, suppose that π is output from Algorithm 3(G). Then, for all players

i [n], times h [H + 1], deviations π i satisfying π def = (π i, π i) ΠG, and histories τh Hπ h , we have that V π i,h( sh) V π i,h( τh).

Proof. We proceed by induction on h.

Base Case. For the base case, we consider h = H + 1. In this case, V π i,h( s) = 0 = V π i,h( s) by definition of the value function of a feasible policy at time H + 1.

Anytime-Constrained Equilibria in Polynomial Time

Inductive Step. For the inductive step, we consider any h H. We observe that,

V π i,h( s) = E a πh( s)

ri,h( s, a) + X

s Ph( s | s, a) V π i,h+1( s )

ri,h( s, a) + X

s Ph( s | s, a) V π i,h+1( τh+1)

= E a πh( s) h Qπ i,h( τh, a) i

E a π h( τh) h Qπ i,h( τh, a) i

= V π i,h( τh).

The first line used (PE). The second line uses the induction hypothesis. The third line used the definition of the Q-function. The fourth line used Lemma 5.1 and the fact that Supp(πh( τh) ) Ah( sh) by assumption that π is feasible. The last line used the relationship between the Q and value functions.

Proof of Theorem. By Observation 4, we know the output policy of our algorithm is feasible, and by Claim 2 we know the output policy satisfies the stage game solution condition. Thus, it is a MPE for G. The running time follows since we run a polynomial time LP solver on a polynomial sized matrix game, O( A), for a polynomial number of times, O(H S).

D.3. Proof of Theorem 5.3

Proof. We follow the same argument as in (Mc Mahan and Zhu, 2024). By ignoring insignificant digits, we can write each number in the form 2 ib i + . . . 2 1b 1 + 20b0 + . . . + 2d i 1bd i for some i. By dividing by 2 i, each number is of the form 20b0 + . . . + 2d 1bd 1. Notice, the largest possible number that can be represented in this form is Pd 1 i=0 2i = 2d 1. Since at each time h, we potentially add the maximum cost, the largest cumulative cost ever achieved is at most 2d H 1. Since that is the largest cost achievable, no more than 2d H can ever be achieved through all H times. Similarly, no cost can be achieved smaller than 2d H.

Thus, each cumulative cost is in the range [ 2d H + 1, 2d H 1] and so at most 2d+1H cumulative costs can ever be created. By multiplying back the 2 i term, we see at most 2d+1H costs are ever generated by numbers with d bits of precision. Since this argument holds for each constraint independently, the total number of cumulative cost vectors that could ever be achieved is (H2d+1)n. Hence, DG Hn2(d+1)n.

Theorem 5.3 then follows immediately from Theorem 4.4, Theorem 5.2, and the definition of fixed-parameter tractability (Downey and Fellows, 2012).

E. Proofs for Section 6

E.1. Proof of Lemma 6.4

For any h we let ˆch+1 := f(τh+1) be a random variable of the history defined inductively by ˆc1 = 0 and ˆck+1 = fk(ˆck, ck) for all k h. Here, f is a function that either rounds the immediate cost or truncates to ˆc1. Notice that since f is a deterministic function, ˆck can be computed from τh+1 for all k [h + 1]. Then, a probability distribution over ˆc is induced by the one over histories.

Proof. The key observation is that for each time h, generally ˆ ch ch ˆ ch + (h 1)ℓholds. The one exception is when a very negative cost is received, in which case the agents truncation may lead to higher cost in ˆG. However, in that case, any action will still be allowed and so overestimated cost does not lead to issues. Formally, we can show

Pπ G ˆ ch ch ˆ ch + (h 1)ℓ ˆ ch, ch B (H h + 1)cmax = 1. (7)

The proof of this claim follows identically to the proof of Lemma 5 in (Mc Mahan and Zhu, 2024).

Anytime-Constrained Equilibria in Polynomial Time

Then, if π is feasible for ˆG, we see that Pπ h [H], ˆ ch B = 1. Thus,

Pπ [ h [H], ch B + (h 1)ℓ] = 1.

In words, π is Hℓ-feasible for G.

E.2. Proof of Lemma 6.5

Proof. Approximate feasibility of any such π from Lemma 6.4. Moreover, we observe that there are more feasible deviations (π i, π i) in ˆG since the cost constraint is easier to satisfy as also shown in the proof of Lemma 6.4. Thus, π must also beat any feasible deviation for G. Hence, it is an approximate equilibrium.

E.3. Proof of Theorem 6.6

Proof. The correctness of the algorithm follows immediately from Lemma 6.4 and Lemma 6.5. For the complexity claim, we first note that to construct ˆG from ˆG, we must loop over each approximate cost while performing the iteration to create G. The number of such immediate costs per agent i is the number integer multiples of ℓwe consider, which consists of the range { j Bi Hcmax i ℓ k , j cmax i

ℓ k }. The number of elements of this set is,

Bi Hcmax i ℓ

+ 1 cmax i (H + 1) Bi

Thus, if we consider the worst case player, the bound becomes cmax(H+1) B

ℓ + 2. Since this holds independently for each player, the total number supported immediate costs in each approximate cost distribution is at most O( cmax(H+1) B n ℓn ). Moreover, since each immediate cost is an integer multiple of ℓ, any cumulative cost is in the range

at widest {H j Bi Hcmax i ℓ k , H j cmax i

ℓ k }. Overall, we see the time needed to construct G and the size of the state set of G

blow up by a factor of O( cmax(H+1) B n ℓn ). The running time claims then follow from the previous running time claims.

E.4. Proof of Corollary 6.7

Proof. The proof is immediate from Theorem 6.6 and the definition of ℓ.

E.5. Proof of Corollary 6.8

Proof. The proof is immediate from Theorem 6.6 and the definition of ℓ. Note, here we are using a vector ℓ. It is easy to see that the proof of Lemma 6.4 easily handles this variation.

F. Extensions

The infinite discounted case, and generalized anytime constraints case follow similarly to in (Mc Mahan and Zhu, 2024). As for almost sure constraints, we note that we can do the same meta-graph construction, but we start by including all possible cumulative costs, since we do not know which may eventually lead to success until we have seen the end. All other results follow similarly.

For ACCE, we observe all of our results follow with the minimal change of considering a strategic modification deviation ϕ π instead of the general deviation (π i, π i) we originally considered. Our LP solution is also easily adapted by replacing the CCE condition with the CE condition.