# polynomialtime_approximability_of_constrained_reinforcement_learning__ca623540.pdf

Polynomial-Time Approximability of Constrained Reinforcement Learning

Jeremy Mc Mahan 1

We study the computational complexity of approximating general constrained Markov decision processes. Our primary contribution is the design of a polynomial time (0, ϵ)-additive bicriteria approximation algorithm for finding optimal constrained policies across a broad class of recursively computable constraints, including almost-sure, chance, expectation, and their anytime variants. Matching lower bounds imply our approximation guarantees are optimal so long as P = NP. The generality of our approach results in answers to several long-standing open complexity questions in the constrained reinforcement learning literature. Specifically, we are the first to prove polynomial-time approximability for the following settings: policies under chance constraints, deterministic policies under multiple expectation constraints, policies under nonhomogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes.

1. Introduction

Constrained Reinforcement Learning (CRL) is growing increasingly crucial for managing complex, real-world applications such as medicine (Coronato et al., 2020; Paragliola et al., 2018; Kolesar, 1970), disaster relief (Fan et al., 2021; Wu et al., 2019; Tsai et al., 2019), and resource management (Mao et al., 2016; Li et al., 2018; Peng and Shen, 2021; Bhatia et al., 2021). Various constraints, including expectation (Altman, 1999), chance (Xu and Mannor, 2011), almost-sure (Castellano et al., 2022), and anytime constraints (Mc Mahan and Zhu, 2024b), were each proposed to address new challenges. Despite the richness of the literature, most works focus on stochastic,

1Department of Computer Science, University of Wisconsin Madison, Wisconsin, USA. Correspondence to: Jeremy Mc Mahan <jmcmahan@wisc.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

expectation-constrained policies, leaving many popular settings with longstanding open problems. Even chance constraints, arguably a close second in popularity, still lack any polynomial-time, even approximate, algorithms despite being introduced over a decade ago (Xu and Mannor, 2011). Other settings for which polynomial-time algorithms are open include deterministic policies under multiple expectation constraints, policies under non-homogeneous constraints (i.e., constraints of different types), and policies under constraints for continuous-state processes. Consequently, we study the computational complexity of general constrained problems to resolve many of these fundamental open questions.

Formally, we study the solution of Constrained Markov Decision Processes (CMDPs). Here, we define a CMDP through three fundamental parts: (1) an MDP M that accumulates both rewards and costs, (2) a general cost criterion C, and (3) a budget vector B. Additionally, we allow the agent to specify whether they require their policy to be deterministic or stochastic, formalized through a goal policy class Π. The agent s goal is to solve maxπ Π V π M subject to Cπ M B, where V π M denotes the agent s value and Cπ M denotes the agent s cost under π. This model can capture very general problems, including minimum time routes for self-driving vehicles that must satisfy 1) an anytime constraint on fuel consumption, 2) an expectation constraint on CO2 consumption, and 3) a chance constraint on vehicle wear and tear. Our main question is the following:

Can general CMDPs be approximated in polynomial time?

Hardness. Solving general CMDPs is notoriously challenging. When restricted to deterministic policies, solving a CMDP with just one constraint is NP-hard (Feinberg, 2000; Xu and Mannor, 2011; Mc Mahan and Zhu, 2024b; Mc Mahan, 2024). This difficulty increases with the number of constraints: with at least two constraints, finding a feasible deterministic policy, let alone a near-optimal one, becomes NP-hard (Mc Mahan and Zhu, 2024b). Even if we relax the deterministic requirement, this hardness remains for all constraint types other than expectation. Computational hardness aside, standard RL techniques fail to apply due to the combinatorial nature induced by many constraint

Polynomial-Time Approximability of Constrained Reinforcement Learning

types. Adding in additional constraints with fundamentally different structures further complicates the problem.

Past Work. A few works have managed to derive provable approximation algorithms for some cases of CRL. Mc Mahan (2024) presented a fully polynomial-time approximation scheme (FPTAS) for the computation of deterministic policies of a general class of constraints, which includes expectation, almost-sure, and anytime constraints. Although powerful, their framework only works for one constraint and fails to capture anytime-expectation constraints along with chance constraints. Similarly, Khonji et al. (2019) achieves an FPTAS for expectation and chance constraints, but only in the constant horizon setting. In contrast, Mc Mahan and Zhu (2024b) develops a polynomial-time (0, ϵ)- additive bicritiera approximation algorithm for anytime and almost-sure constraints. However, their framework is specialized to those constraint types and thus fails for our purpose. In contrast, Xu and Mannor (2011) developed a pseudo-polynomial time algorithm for finding feasible chance-constrained policies, but their methods do not lead to polynomial-time solutions.

Our Contributions. We design a polynomial-time (0, ϵ)- additive bicriteria approximation algorithm for tabular, SRcriterion CMDPs. An SR criterion is required to satisfy a generalization of the policy evaluation equations and includes expectation, chance, and almost-sure constraints along with their anytime equivalents. Our framework implies the first positive polynomial-time approximability results for (1) policies under chance constraints, (2) deterministic policies under multiple expectation constraints, and (3) policies under non-homogeneous constraints each of which has been unresolved for over a decade. We then extend our algorithm into a polynomial-time (ϵ, ϵ)-additive bicriteria approximation algorithm for continuous-state CMPDs under a general class of constraints, which includes expectation, almost-sure, and anytime constraints.

Our Techniques. Our algorithm requires several key techniques. First, we transform a constraint concerning all realizable histories into a simpler per-time constraint. We accomplish this by augmenting the state space with an artificial budget and augmenting the action space to choose future budgets to satisfy the constraint. However, Bellman updates then become as hard as the knapsack problem due to the large augmented action space. For tabular c MDPs, we show that the Bellman updates can be approximately computed using dynamic programming and rounding. By strategically rounding the artificial budget space, we achieve a (0, ϵ)-bicriteria approximation for tabular CMDPs. By appropriately discretizing the continuous state space, our method becomes a (ϵ, ϵ)-bicriteria approximation algorithm for continuous state CMDPs.

1.1. Related Work

Constrained RL. It is known that stochastic expectationconstrained policies are polynomial-time computable via linear programming (Altman, 1999), and many planning and learning algorithms exist for them (Paternain et al., 2019; Vaswani et al., 2022; Borkar, 2005; Hasanzade Zonuzy et al., 2021). Some learning algorithms can even avoid violation during the learning process under certain assumptions (Wei et al., 2022; Bai et al., 2023). Furthermore, Brantley et al. (2020) developed no-regret algorithms for c MDPs and extended their algorithms to the setting with a constraint on the cost accumulated over all episodes, which is called a knapsack constraint (Brantley et al., 2020; Cheung, 2019).

Safe RL. The safe RL community (Garc ıa et al., 2015; Gu et al., 2024) has mainly focused on no-violation learning for stochastic expectation-constrained policies (Chow et al., 2018; Bossens and Bishop, 2022; Alshiekh et al., 2018; Cheng et al., 2019; Berkenkamp et al., 2017) and solving chance constraints (Wang et al., 2023; Zhao et al., 2023), which capture the probability of entering unsafe states. Performing learning while avoiding dangerous states (Zhao et al., 2023) is a special case of expectation constraints that has also been studied (Roderick et al., 2021; Thomas et al., 2021) under non-trivial assumptions. In addition, instantaneous constraints, which require the immediate cost to be within budget at all times, have also been studied (Li et al., 2021; Fisac et al., 2019; Gros et al., 2020).

2. Constraints

Cost-Accumulating MDPs. In this work, we consider environments that accumulate both rewards and costs. Formally, we consider a (finite-horizon, tabular) costaccumulating Markov Decision Process (ca MDP) tuple M = (H, S, A, P, R, C, s0), where (i) H Z 0 is the finite time horizon, (ii) Sh is the finite set of states, (iii) Ah(s) is the finite set of available actions, (iv) Ph(s, a) (S) is the transition distribution for a given state s S and action a A (note, (S) represents the probability simplex over S), (v) Rh(s, a) (R) is the reward distribution, (vi) Ch(s, a) (Rm) is the cost distribution, and (vii) s0 S is the initial state.

To simplify notation, we let rh(s, a) def = E[Rh(s, a)] denote the expected reward, ch(s, a) represent the cost function when costs are deterministic (which will be the case throughout the main text), S def = |S| denote the number of states, A def = |A| denote the number of joint actions, [H] def = {1, . . . , H}, M be the set of all ca MDPs, and |M| be the total description size of the ca MDP. We also use the Iverson Bracket notation [P] def = 1{P =T rue} and the characteristic function χP which is when P is False and 0

Polynomial-Time Approximability of Constrained Reinforcement Learning

Agent Interactions. The agent interacts with M using a policy π = {πh}H h=1. In the fullest generality, πh : Hh (A) is a mapping from the observed history at time h (including costs) to a distribution of actions. Often, researchers study Markovian policies, which take the form πh : S (A), and deterministic policies, which take the form πh : Hh A. We let Π denote the set of all policies and ΠD denote the set of all deterministic policies.

The agent starts in the initial state s0 with observed history τ1 = (s0). For any h [H], the agent chooses a joint action ah πh(τh). Then, the agent receives immediate reward rh Rh(s, a) and cost vector ch Ch(s, a). Lastly, M transitions to state sh+1 Ph(sh, ah), prompting the agent to update its observed history to τh+1 = (τh, ah, ch, sh+1). This process is repeated for H steps; the interaction ends once s H+1 is reached.

Constrained Processes. Suppose the agent has a cost criterion C : M Π Rm and a corresponding budget vector B Rm. We refer to the tuple (M, C, B) as a Constrained Markov Decision Process (CMDP). Given a CMDP and desired policy class Π {ΠD, Π}, the agent s goal is to solve the constrained optimization problem:

max π Π V π M

s.t. Cπ M B (CON)

In the above, V π M def = Eπ M h PH h=1 rh(sh, ah) i denotes the value of a policy π, Eπ M denotes the expectation defined by Pπ M, and Pπ M denotes the probability law over histories induced from the interaction of π with M. Lastly, we let V denote the optimal solution value to (CON). In the main paper, we focus on the case where Π = ΠD.

SR Criteria. We study cost criteria that generalize the standard policy evaluation equations to enable dynamic programming techniques. In particular, we require the cost of a policy to be recursively computable with respect to the time horizon. For our later approximations in Section 5, we will also need key functions defining the recursion to be short maps, i.e., 1-Lipschitz, with respect to the infinity norm. Definition 2.1 (SR). We call a cost criterion shortly recursive (SR) if for any ca MDP M and any policy π ΠD, π s cost decomposes recursively into Cπ M = Cπ 1 (s0), where Cπ H+1 = 0 and for all h [H] and τh Hh letting s = sh(τh) and a = πh(τh),

Cπ h(τh) = ch(s, a) + f s g (Ph (s | s, a)) Cπ h+1 (τh, a, s ) .

(SR) Here, f s is the finite extension of an associative, nondecreasing, binary function f , and g is a [0, 1]-valued

function rooted at 0. Moreover, we require that f is a short map when either of its inputs are fixed, satisfies f(0, x) = f(x, 0) = x for all x, and when combined with g, i.e., f s g (Ph (s | s, a)) xs , is a short map in x.

Remark 2.2 (Stochastic Variants). Our results generalize to both stochastic policies and stochastic costs as well. The algorithmic approach is identical, but the definitions and analysis are more complex. Consequently, we focus on the deterministic cases in the main text.

Constraint Formulations. The fundamental constraints considered in the CRL literature are Expectation, Chance, and Almost-sure constraints. Each of these induces a natural anytime variant that stipulates the required constraint must hold for the truncated cost Pt h=1 ch at all times h [H]. We give the formal definitions in Figure 1. Importantly, each constraint is equivalent to Cπ M B for some appropriately chosen SR criteria. Proposition 2.3 (SR Modeling). The classical constraints can be modeled by SR constraints of the form Cπ M B as follows:

1. Expectation Constraints f (x, y) def = x + y, g(x) def = x,

and B def = B.

2. Chance Constraints (f , g) defined as above and

B def = δ. Here, we assume M s states are aug-

mented with cumulative costs and that ch((s, c), a) def = [ch(s, a) + c > B] for the anytime variant and

ch((s, c), a) def = [ch(s, a) + c > B][h = H] otherwise.

3. Almost-sure Constraints f (x, y) def = max(x, y),

g(x) def = [x > 0], and B def = B. Anytime variant

f (x, y) def = max(0, max(x, y)) while g and B remain the same.

General anytime variants, including anytime expectation constraints, can be modeled by Cπ M,t B

t [H] where Cπ M,t is the original SR criterion but defined for the truncated-horizon process with horizon t.

Computational Limitations. It is known that computing feasible policies for CMDPs is NP-hard (Mc Mahan, 2024; Mc Mahan and Zhu, 2024b). As such, we must relax feasibility for any hope of polynomial-time algorithms. Consequently, we focus on bicriteria approximation algorithms.

Definition 2.4 (Bicriteria). A policy π is an (α, β)-additive bicriteria approximation to a CMDP (M, C, B) if V π M V α and Cπ M B + β. We refer to an algorithm as an (α, β)-bicriteria if for any CMDP it outputs an (α, β)- additive bicriteria approximation or correctly reports the instance is infeasible.

Polynomial-Time Approximability of Constrained Reinforcement Learning

Con/Time Expectation Chance Almost-Sure

Classical Eπ M

Anytime ( t [H]) Eπ M

Figure 1. The Constraint Landscape

The existence of a polynomial-time bicriterion for our general constrained problem implies brand-new approximability results for many classic problems in the CRL literature. For clarity, we will explicitly state the complexity-theoretic implications for each classical setting. Theorem 2.5 (Implications). A polynomial-time (ϵ, ϵ)- bicriteria implies that in polynomial time it is possible to compute a policy π Π satisfying V π M V ϵ and any constant combination of the following constraints:

1. Eπ M h PH h=1 ch i B + ϵ

2. Pπ M h PH h=1 ch B + ϵ i = 1

3. Pπ M h PH h=1 ch > B + ϵ i δ + ϵ.

In other words, polynomial-time approximability is possible for each of the settings described in Section 1 when the number of constraints is constant.

Remark 2.6 (Extensions). All of our results hold for Markov games and infinite discounted settings.

3. Reduction

In this section, we present a general solution approach to SRcriterion CMDPs. Our approach revolves around converting the general cost constraint into a per-step action constraint. This leads to the design of an augmented MDP that can be solved with standard RL methods.

Augmentation. State augmentation is the known approach for solving anytime-constrained MDPs (Mc Mahan and Zhu, 2024b). The augmentation permits the problem to be solved by the following dynamic program:

V h (s, c) = max a A, c+ch(s,a) B rh(s, a)

s Ph(s | s, a)V h+1(s, c + ch(s, a)). (1)

When moving to other constraints, the cumulative cost may no longer suffice to determine constraint satisfaction. For example, the expected cost depends on the cumulative cost of all realizable branches, not just the current branch.

Expectation Constraints. Instead, we can exploit the recursive nature of the expected cost to find a solution. Suppose at stage (s, h) we impose an artificial budget b on the expected cost of a policy π from time h onward: Eπ h PH t=h ct i b. By the policy evaluation equations, we know this equates to satisfying:

Cπ h(s) = ch(s, a) + X

s Ph(s | s, a)Cπ h+1(s ) b. (2)

For this invariant to be satisfied, it suffices for the agent to choose future artificial budgets bs for s S satisfying,

ch(s, a) + X

s Ph(s | s, a)bs b. (3)

and ensure the future artificial budgets are obeyed inductively: Cπ h+1(s , bs ) bs .

General Approach. We can apply the same reasoning for general recursively computable cost criteria. If C is SR, then we know that Cπ h(s) obeys (SR). Thus, to guarantee that Cπ h(s) b it suffices to choose bs s satisfying,

ch(s, a) + f s g (Ph(s | s, a)) bs b, (4)

and inductively guarantee that Cπ h+1(s ) bs .

We can view choosing future artificial budgets as part of the agent s augmented actions. Then, at any augmented state (s, b), the agent s augmented action space includes all (a, b) A RS satisfying (3). When M transitions to s Ph(s, a), the agent s new augmented state should consist of the environment s new state in addition to its chosen demand for that state, (s , bs ). Putting these pieces together yields the definition of the reduced, action-constrained MDP, Definition 3.1. Definition 3.1 (Reduced MDP). Given any SR-criterion CMDP (M, C, B), we define the reduced MDP M def = (H, S, A, P, R, s0) where,

1. Sh def = Sh B where B def = S

τh Hh {Cπ h(τh)}

2. Ah(s, b) def = {(a, b) Ah(s) RS | ch(s, a) + f s g (Ph(s | s, a), bs ) b}

Polynomial-Time Approximability of Constrained Reinforcement Learning

Algorithm 1 Reduction Require: (M, C, B)

1: M Definition 3.1(M, C, B) 2: π, V SOLVE( M) 3: if V = then 4: return Infeasible 5: else 6: return π 7: end if

Algorithm 2 Augmented Interaction Require: π

1: s1 = (s0, B) 2: for h 1 to H do 3: (a, b) πh( sh) 4: sh+1 Ph(sh, a) 5: sh+1 = (sh+1, bsh+1) 6: end for

3. Ph((s , b ) | (s, b), (a, b)) def = Ph(s | s, a)[b = bs ]

4. Rh((s, b), (a, b)) def = Rh(s, a)

5. s0 def = (s0, B)

We also re-define the base case value to V H+1(s, b) def = χ{b 0}. Note, the reduced MDP has a non-stationary state and action set, unlike the base MDP.

Reduction. Importantly, M s augmented action space ensures constraint satisfaction. Thus, we have effectively reduced a problem involving total history constraints to one with only standard per-time-step constraints. So long as our cost is SR, M can be solved using fast RL methods instead of the brute force computation required for general CMDPs. These properties ensure our method, Algorithm 1, is correct.

Lemma 3.2 (Value). For any h [H + 1], τh Hh, and b B, if s = sh(τh), then,

V h (s, b) sup π ΠD V π h (τh)

s.t. Cπ h(τh) b. (5)

Lemma 3.3 (Cost). Suppose that π ΠD. For all h [H+ 1] and (s, b) S, if V π h (s, b) > , then Cπ h(s, b) b.

Theorem 3.4 (Reduction). If SOLVE is any finite-time MDP solver, then Algorithm 1 correctly solves (CON) in finite time for any SR-criterion CMDP.

Remark 3.5 (Deployment). Given a budget-augmented policy π output from Algorithm 1, the agent can execute π using Algorithm 2.

4. Bellman Updates

In this section, we discuss efficient methods for solving M. Our approach relies on using (SR) to break down the Bellman update so that it is solvable using dynamic programming. We then use dynamic rounding to achieve an efficient approximation algorithm.

Bellman Hardness. Even a small set of artificial budgets, B, needed to be considered, solving M would still be challenging due to its exponentially large, constrained action space. Just one Bellman update equates to solving the constrained optimization problem:

V h (s, b) = max a,b rh(s, a) + X

s Ph(s | s, a)V h+1 (s , bs )

s.t. ch(s, a) + f s g (Ph (s | s, a)) bs b.

Above, we used the fact that (s , b ) Supp( Ph((s, b), (a, b))) iff s Supp(Ph(s, a)) and b = bs . In fact, even when each bs only takes on two possible values, {0, ws }, this optimization problem generalizes the knapsack problem, implying that it is NP-hard to solve.

Dynamic Programming. To get around this computational bottleneck, we must fully exploit Definition 2.1. For any fixed (h, (s, b), a), the key idea is to treat choosing b s as its own sequential decision-making problem. Suppose we have already chosen b1, . . . , bt 1 leading to partial cost F def = f t 1 s =1 g(Ph(s | s, a))bs . Since f is associative, we can update our partial cost after choosing bt to f (F, g(Ph(t | s, a))bt). Once we have made a choice for each future state, we can verify if (a, b) Ah(s, b) by checking the condition: ch(s, a) + F b. By incorporating the value objective, we design a dynamic program for computing (BU).

Definition 4.1 (DP). For any h [H], (s, b) S, a A and F R, we define V s,a h,b (S + 1, F) = χ{ch(s,a)+F b}, and for any t [S],

V s,a h,b (t, F) def = max bt B Ph(t | s, a) V h+1(t, bt)+

V s,a h,b (t + 1, f (F, g(Ph(t | s, a))bt)) . (6)

Lemma 4.2 (DP Correctness). For any h [H] and (s, b) S, we have that V h (s, b) = maxa A rh(s, a) + V s,a h,b (1, 0).

Dynamic Rounding. Although a step in the right direction, solving Definition 4.1 can still be slow due to the exponential number of considered partial costs. We resolve this issue by rounding each partial cost to an element of some

Polynomial-Time Approximability of Constrained Reinforcement Learning

small set ˆF. Since f need not be linear, using rounding in a preprocessing step does not suffice: we must re-round at each step to ensure inputs are a valid element of our input set.

For any ℓ> 0, we view ℓas a new unit length. Our rounding function maps any real number to its closest upper bound in the set of integer multiples of ℓ. We use upper bounds to guarantee that the rounded partial costs are always larger than the true partial costs. Smaller ℓensures less approximation error, while larger ℓensures fewer considered partial costs. Thus, ℓdirectly controls the accuracy-efficiency tradeoff.

Definition 4.3 (Rounding Functions). For any ℓ> 0 and x R, we define x ℓ def = x

ℓ ℓto be the smallest integer

multiple of ℓthat is larger than x. We also define κℓ(x) def = x+ℓ(S +1). Note, when considering vectors, all operations are performed component-wise.

Since we round up the partial costs, the approximate partial cost of a feasible b could exceed b. To ensure all feasible choices of b are considered, we must also relax the budget comparison. Instead, we compare partial costs to a carefully chosen upper threshold κ(b). Putting these pieces together yields our approximate Bellman update method.

Definition 4.4 (Approximate Update). Fix any ℓ> 0 and function κ : Rm Rm. For any h [H], (s, b) S, a A and ˆF Rm, we define ˆV s,a h,b (S + 1, ˆF) def = χ{ch(s,a)+ ˆ F κ(b)}, and for any t [S],

ˆV s,a h,b (t, ˆF) def = max bt B Ph(t | s, a) V h+1(t, bt)+

ˆV s,a h,b t + 1, l f ˆF, g(Ph(t | s, a))bt m

We then define the approximate update by,

ˆV h (s, b) def = max a A rh(s, a) + ˆV s,a h,b (1, 0). (AU)

Overall, solving the ADP yields an approximate solution.

Lemma 4.5 (Approximation). For any h [H], (s, b) S, a A, ˆF Rm, and t [S + 1], we have that,

ˆV s,a h,b (t, ˆF) = max b BS t+1

s =t Ph(s | s, a) V h+1(s , bs )

s.t. ch(s, a) + ˆf s,a h,b(t, ˆF) κ(b), (7)

where ˆf s,a h,b(t, ˆF) is the dynamic rounding of

f ˆF, f S s =t g(Ph(t | s, a), bt) . Moreover, if ℓand κ are replaced with the identity function, (AU) is equivalent to (BU).

Algorithm 3 Approximate Backward Induction Require: M 1: ˆV H+1(s, b) χ{b 0} for all (s, b) S 2: for h H down to 1 do 3: for (s, b) S do 4: ˆa, ˆV h (s, b) (AU) 5: πh(s, b) ˆa 6: end for 7: end for 8: return π, ˆV

Remark 4.6 (DP details). Technically, to turn this recursion into a true dynamic program, we must also precompute the inputs to any subproblem. Unlike in standard RL, this computation must be done with a forward recursion. If we let ˆFs,a h (t) denote the set of possible input rounded partial costs for state t, then the set satisfies the inductive relationship ˆFs,a h (1) def = {0} and for any t [S], ˆFs,a h (t + 1) def = S

F ˆ Fs,a h (t) { f (F, + g(Ph(t | s, a))bt ℓ}. This relationship translates directly into an iterative algorithm for computing all needed inputs. Using this gives a complete DP algorithm for solving (ADP)1.

Theorem 4.7 (Approx Solve). When ℓand κ are replaced with the identity function, Algorithm 3 correctly solves any M produced from Definition 3.1. Moreover, Algorithm 3 runs in time O Hm+1Sm+2A|B|2 cmax cmin m /ℓm .

5. Bicriteria

Algorithm 3 allows us to approximately solve M in finite cases much faster than traditional methods. However, when |B| is large, the algorithm still runs in exponential time. Similarly to the partial cost rounding in Definition 4.4, we can reduce the size of |B| by considering a smaller approximate set based on rounding. Since we still desire optimistic budgets, we use the same rounding function from Definition 4.3 but with a different choice of ℓ.

Budget Rounding. Rounding naturally impacts the state space, but has other consequences as well. To avoid complex computation, we consider the approximate set ˆB def = { b ℓ| b [bmin, bmax]} where [bmin, bmax] B is a superset of all required artificial budgets that we formalize later. As before, rounding the budgets may cause originally feasible choices to now violate the constraint. To ensure all feasible choices are considered and that we can use Algorithm 3 to get speed-ups, we define the approximate action space to include all vectors that lead to feasible subproblems

1We use the notation x, o minx z(x) to say that x is the minimizer and o the value of the optimization.

Polynomial-Time Approximability of Constrained Reinforcement Learning

Algorithm 4 Bicriteria Require: (M, C, B)

1: Hyperparameter: ℓ 2: ˆ M Definition 5.1(M, (f, g), B, ℓ) 3: π, ˆV Algorithm 3( ˆ M, (f, g), ℓ) 4: if ˆV 1 (s0, B ℓ) = then 5: return Infeasible 6: else 7: return π 8: end if

of (ADP). From Lemma 4.5, we know this set is exactly the set of (a, ˆb) satisfying ch(s, a)+ ˆf s,a h,ˆb(1, 0) κ(ˆb). Putting these ideas together yields a new, approximate MDP.

Definition 5.1 (Approximate MDP). Given any SRcriterion CMDP (M, C, B), we define the approximate MDP ˆ M def = (H, ˆS, ˆ A, ˆP, ˆR, ˆs0) where,

1. ˆSh def = Sh ˆB where ˆB def = { b ℓ| b [bmin, bmax]}.

2. ˆ Ah(s,ˆb) def = {(a, ˆb) Ah(s) ˆBS | ch(s, a) + ˆf s,a h,ˆb(1, 0) κ(ˆb)}

3. ˆPh((s ,ˆb ) | (s, b), (a, ˆb)) def = Ph(s | s, a)[ˆb = ˆbs ]

4. ˆRh((s,ˆb), a) def = Rh(s, a)

5. ˆs0 def = (s0, B ℓ)

We again re-define the base case value to ˆV H+1(s,ˆb) def = χ{ˆb 0}.

Since we always round budgets up, the agent can make even better choices than originally. It is then easy to see that policies for ˆ M always achieve optimal constrained value. We formalize this observation in Lemma 5.2.

Lemma 5.2 (Optimal Value). For any h [H + 1] and (s, b) S, ˆV h (s, b ℓ) V h (s, b).

Time-Space Errors. To assess the violation gap of Algorithm 4 policies, we must first explore the error accumulated by our rounding approach. Rounding each artificial budget naturally accumulates approximation error over time. Rounding the partial costs while running Algorithm 3 accumulates additional error over (state) space. Thus, solving ˆ M using Algorithm 4 accumulates error over both time and space, unlike standard approximate methods in RL. As a result, our rounding and threshold functions will generally depend on both H and S.

Arithmetic Rounding. Our approach is to round each value down to its closest element in an ℓ-cover. Using the same rounding as in Definition 4.3, we guarantee that b b ℓ b + ℓ. Thus, b ℓis an overestimate that is not too far from the true value. By setting ℓto be inversely proportional to SH, we control the errors over time and space. The lower bound must also be a function of S since it controls the error over space.

Lemma 5.3 (Approximate Cost). Suppose that π ΠD. For all h [H +1] and (s,ˆb) ˆS, if ˆV π h (s,ˆb) > , then ˆCπ h(s,ˆb) ˆb + ℓ(S + 1)(H h + 1).

Theorem 5.4 (Bicriteria). For any SR-criterion CMDP with polynomially-bounded costs and ϵ > 0, the choice of ℓ def = ϵ 1+(S+1)H ensures Algorithm 4 is a (0, ϵ)-bicriteria running in polynomial time

O H6m+1S4m+2A cmax cmin 3m /ϵ3m .

Corollary 5.5 (Relative). For any ϵ > 0, the choice of

ℓ def = ϵ B(H(S+1)+1) ensures Algorithm 4 is a polynomial time (0, 1 + ϵ)-relative bicriteria for the class of polynomialbudget-bounded-cost CMDPs with SR-cost criteria. This includes all SR-criterion CMDPs with non-negative costs.

Remark 5.6 (Chance Constraints). Technically, for chance constraints, we first create a cost-augmented MDP that is initially passed into the input. This allows us to write chance constraints in the SR form. Consequently, the S term in Theorem 5.4 is really a larger augmented S. To achieve ϵ cost violation, (Mc Mahan and Zhu, 2024b) showed that an augmented space of size O(SH2 cmax cmin /ϵ) is needed, which still results in a polynomial-time complexity. Remark 5.7 (Approximation Optimality). (Mc Mahan and Zhu, 2024b) showed that our assumptions on cost bounds are necessary to achieve polynomial-time approximations. Thus, our approximation guarantees are the best possible. Moreover, we can show that our dependency on the number of constraints is also unavoidable. This is formalized in Proposition 5.8.

Proposition 5.8 (Multi-Constraint Hardness). If m = Ω(n1/d) for some constant d, then computing an ϵ-feasible policy for a CMDP is NP-hard for any ϵ > 0.

5.1. Continuous MDPs

We also show that approximations are possible in infinite state settings under certain continuity assumptions.

Assumption 5.9 (Continuity). We assume the ca MDP M is Lipschitz continuous. Formally, we require that (1) S = [smin, smax], (2) the reward function is λr Lipschitz, (3) the cost function is λc Lipschitz, (4) the transitions are λp Lipschitz each with respect to the state input, and (5) each of these quantities is polynomial-sized in the input representation. For SR-criterion CMDPs, we also assume that

Polynomial-Time Approximability of Constrained Reinforcement Learning

f has a natural finite equivalent denoted f , g is a sublinear short map, and f s z (smax smin) for any constant z.

All we need to do is discretize the state space, and run our previous algorithm on the following discretized CMDP. Definition 5.10 (Discretized CMDP). Given any SRcriterion CMDP (M, C, B), we define the discretized CMDP ( M, C, B) where M = (H, S, A, P, R, C, s0) is the discretized ca MDP defined by,

1. Sh def = { s ℓ| s S}

2. Ph( s | s, a) def = R s +ℓ s = s Ph(s | s, a)ds

3. s0 def = ( s0 ℓ, B)

and C is the cost criterion defined by replacing f s with its natural finite equivalent f.

We see that discretization results in a small impact to both the value and cost that depend on the continuity parameters. Lemma 5.11 (Discretization). For all h [H+1], τh Hh, and π ΠD, we let τh denote τh with each state st rounded to st ℓ. Then, we have that V π h ( τh) V h (τh) ℓ(λr + λp)Hrmax(smax smin)(H h + 1) and Cπ h( τh) C h(τh) + ℓ(λc + λp)Hcmax(smax smin)(H h + 1). For almost-sure/anytime constraints, the cost incurs an additional factor of 1/ pmin, where pmin denotes the smallest non-zero transition probability for M.

Overall, using our previous bicriteria on M yields our approximation results. Theorem 5.12 (Continuous Bicriteria). For any SR-criterion CMDP satisfying Assumption 5.9 and any ϵ > 0, the choice of discretization

ℓd def = ϵ/2 (λr+λc+λp)H max(cmax,rmax)(smax smin) and

approximation ℓa def = ϵ/2 1+(S+1)H ensures Algo-

rithm 4( M) is a (ϵ, ϵ)-bicriteria running in time

O H6m+1 S4m+2A cmax cmin 3m /ϵ3m , where S =

O((λr + λc + λp)H max(cmax, rmax)(smax smin)2/ϵ). This time is polynomial so long as |smax smin| = O(|M|). Moreover, almost-sure/anytime constraints enjoy the same guarantee with an additional factor of pmin in S. Corollary 5.13 (Simplified). For continuous-state SRcriterion CMDPs satisfying Assumption 5.9, there exist polynomial-time (ϵ, ϵ)-bicriteria solutions for expectation constraints, almost-sure constraints, anytime-almost-sure constraints, and any combinations of these constraints.

6. Conclusion

In this work, we studied the question of whether polynomialtime approximation algorithms exist for many of the clas-

sic formulations studied in the CRL literature. We conclude that for the vast majority of constraints, including all the standard constraints, polynomial-time approximability is possible. We demonstrated this phenomenon by developing polynomial-time bicriteria approximations with the strongest possible guarantees for a general class of constraints that can be written in a form that satisfies general policy evaluation equations. Overall, our work resolves the polynomial-time approximability of many settings, some of which have lacked any polynomial-time algorithm for over a decade. In particular, we are the first to develop a polynomial-time algorithm with any kind of guarantee for chance constraints and non-homogeneous constraints.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

M. Alshiekh, R. Bloem, R. Ehlers, B. K onighofer, S. Niekum, and U. Topcu. Safe reinforcement learning via shielding. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018. doi: 10.1609/ aaai.v32i1.11797. URL https://ojs.aaai.org/ index.php/AAAI/article/view/11797.

E. Altman. Constrained Markov Decision Processes. Chapman and Hall/CRC, 1999. doi: 10.1201/9781315140223.

Q. Bai, A. Singh Bedi, and V. Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via conservative natural policy gradient primal-dual algorithm. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):6737 6744, 6 2023. doi: 10.1609/ aaai.v37i6.25826. URL https://ojs.aaai.org/ index.php/AAAI/article/view/25826.

F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause. Safe model-based reinforcement learning with stability guarantees. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 766ebcd59621e305170616ba3d3dac32-Paper. pdf.

A. Bhatia, P. Varakantham, and A. Kumar. Resource constrained deep reinforcement learning. Proceedings of the International Conference on Automated Planning and Scheduling, 29(1):610 620, 5 2021. doi:

Polynomial-Time Approximability of Constrained Reinforcement Learning

10.1609/icaps.v29i1.3528. URL https://ojs.aaai. org/index.php/ICAPS/article/view/3528.

V. Borkar. An actor-critic algorithm for constrained markov decision processes. Systems & Control Letters, 54(3):207 213, 2005. ISSN 01676911. doi: https://doi.org/10.1016/j.sysconle.2004.08. 007. URL https://www.sciencedirect.com/ science/article/pii/S0167691104001276.

D. M. Bossens and N. Bishop. Explicit explore, exploit, or escape (e4): Near-optimal safety-constrained reinforcement learning in polynomial time. Mach. Learn., 112 (3):817 858, 6 2022. ISSN 0885-6125. doi: 10.1007/ s10994-022-06201-z. URL https://doi.org/10. 1007/s10994-022-06201-z.

K. Brantley, M. Dud ık, T. Lykouris, S. Miryoosefi, M. Simchowitz, A. Slivkins, and W. Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. In Neur IPS, 2020. URL https://proceedings. neurips.cc/paper/2020/hash/ bc6d753857fe3dd4275dff707dedf329-Abstract. html.

A. Castellano, H. Min, E. Mallada, and J. A. Bazerque. Reinforcement learning with almost sure constraints. In R. Firoozi, N. Mehr, E. Yel, R. Antonova, J. Bohg, M. Schwager, and M. Kochenderfer, editors, Proceedings of The 4th Annual Learning for Dynamics and Control Conference, volume 168 of Proceedings of Machine Learning Research, pages 559 570. PMLR, 6 2022. URL https://proceedings.mlr.press/ v168/castellano22a.html.

R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick. Endto-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 33 (01):3387 3395, Jul. 2019. doi: 10.1609/aaai.v33i01. 33013387. URL https://ojs.aaai.org/index. php/AAAI/article/view/4213.

W. C. Cheung. Regret minimization for reinforcement learning with vectorial feedback and complex objectives. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings. neurips.cc/paper/2019/file/ a02ffd91ece5e7efeb46db8f10a74059-Paper. pdf.

Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information

Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 4fe5149039b52765bde64beb9f674940-Paper. pdf.

A. Coronato, M. Naeem, G. De Pietro, and G. Paragliola. Reinforcement learning for intelligent healthcare applications: A survey. Artificial Intelligence in Medicine, 109:101964, 2020. ISSN 0933-3657. doi: https://doi.org/10.1016/j.artmed.2020.101964. URL https://www.sciencedirect.com/ science/article/pii/S093336572031229X.

C. Fan, C. Zhang, A. Yahja, and A. Mostafavi. Disaster city digital twin: A vision for integrating artificial and human intelligence for disaster management. International Journal of Information Management, 56:102049, 2021. ISSN 0268-4012. doi: https://doi.org/10.1016/j.ijinfomgt.2019.102049. URL https://www.sciencedirect.com/ science/article/pii/S0268401219302956.

E. A. Feinberg. Constrained discounted markov decision processes and hamiltonian cycles. Mathematics of Operations Research, 25(1):130 140, 2000. doi: 10.1287/moor.25.1.130.15210. URL https://doi. org/10.1287/moor.25.1.130.15210.

J. F. Fisac, N. F. Lugovoy, V. Rubies-Royo, S. Ghosh, and C. J. Tomlin. Bridging hamilton-jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), page 8550 8556. IEEE Press, 2019. doi: 10.1109/ ICRA.2019.8794107. URL https://doi.org/10. 1109/ICRA.2019.8794107.

J. Garc ıa, Fern, and o Fern andez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(42):1437 1480, 2015. URL http://jmlr.org/papers/v16/ garcia15a.html.

S. Gros, M. Zanon, and A. Bemporad. Safe reinforcement learning via projection on a safe set: How to achieve optimality? IFAC-Papers On Line, 53(2):8076 8081, 2020. ISSN 2405-8963. doi: https://doi.org/10.1016/j.ifacol.2020.12.2276. URL https://www.sciencedirect.com/ science/article/pii/S2405896320329360. 21st IFAC World Congress.

S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang, and A. Knoll. A review of safe reinforcement learning: Methods, theory and applications, 2024. URL https://arxiv.org/abs/2205.10330.

Polynomial-Time Approximability of Constrained Reinforcement Learning

A. Hasanzade Zonuzy, A. Bura, D. Kalathil, and S. Shakkottai. Learning with safety constraints: Sample complexity of reinforcement learning for constrained mdps. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):7667 7674, 5 2021. doi: 10.1609/ aaai.v35i9.16937. URL https://ojs.aaai.org/ index.php/AAAI/article/view/16937.

M. Khonji, A. Jasour, and B. Williams. Approximability of constant-horizon constrained pomdp. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5583 5590. International Joint Conferences on Artificial Intelligence Organization, 7 2019. doi: 10.24963/ijcai.2019/ 775. URL https://doi.org/10.24963/ijcai. 2019/775.

P. Kolesar. A markovian model for hospital admission scheduling. Management Science, 16(6):B384 B396, 1970. ISSN 00251909, 15265501. URL http://www. jstor.org/stable/2628725.

J. Li, D. Fridovich-Keil, S. Sojoudi, and C. J. Tomlin. Augmented lagrangian method for instantaneously constrained reinforcement learning problems. In 2021 60th IEEE Conference on Decision and Control (CDC), page 2982 2989. IEEE Press, 2021. doi: 10.1109/ CDC45484.2021.9683088. URL https://doi.org/ 10.1109/CDC45484.2021.9683088.

R. Li, Z. Zhao, Q. Sun, C.-L. I, C. Yang, X. Chen, M. Zhao, and H. Zhang. Deep reinforcement learning for resource management in network slicing. IEEE Access, 6:74429 74441, 2018. doi: 10.1109/ACCESS.2018.2881964.

H. Mao, M. Alizadeh, I. Menache, and S. Kandula. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, Hot Nets 16, page 50 56, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450346610. doi: 10.1145/ 3005745.3005750. URL https://doi.org/10. 1145/3005745.3005750.

J. Mc Mahan. Deterministic policies for constrained reinforcement learning in polynomial time, 2024. URL https://arxiv.org/abs/2405.14183.

J. Mc Mahan and X. Zhu. Anytime-constrained multi-agent reinforcement learning, 2024a. URL https://arxiv. org/abs/2410.23637.

J. Mc Mahan and X. Zhu. Anytime-constrained reinforcement learning. In S. Dasgupta, S. Mandt, and Y. Li, editors, Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 of Proceedings of Machine

Learning Research, pages 4321 4329. PMLR, 02 04 May 2024b. URL https://proceedings.mlr. press/v238/mcmahan24a.html.

G. Paragliola, A. Coronato, M. Naeem, and G. De Pietro. A reinforcement learning-based approach for the risk management of e-health environments: A case study. In 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pages 711 716, 2018. doi: 10.1109/SITIS.2018.00114.

S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, volume 32, 2019. URL https://proceedings.neurips. cc/paper_files/paper/2019/file/ c1aeb6517a1c7f33514f7ff69047e74e-Paper. pdf.

H. Peng and X. Shen. Multi-agent reinforcement learning based resource management in mecand uav-assisted vehicular networks. IEEE Journal on Selected Areas in Communications, 39(1):131 141, 2021. doi: 10.1109/ JSAC.2020.3036962.

M. Roderick, V. Nagarajan, and Z. Kolter. Provably safe pac-mdp exploration using analogies. In A. Banerjee and K. Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 1216 1224. PMLR, 4 2021. URL https://proceedings.mlr.press/ v130/roderick21a.html.

G. Thomas, Y. Luo, and T. Ma. Safe reinforcement learning by imagining the near future. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 13859 13869. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 73b277c11266681122132d024f53a75b-Paper. pdf.

Y. L. Tsai, A. Phatak, P. K. Kitanidis, and C. B. Field. Deep Reinforcement Learning for Disaster Response: Navigating the Dynamic Emergency Vehicle and Rescue Team Dispatch during a Flood. In AGU Fall Meeting Abstracts, volume 2019, pages NH33B 14, Dec. 2019.

S. Vaswani, L. Yang, and C. Szepesvari. Near-optimal sample complexity bounds for constrained mdps. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 3110 3122. Curran Associates, Inc.,

Polynomial-Time Approximability of Constrained Reinforcement Learning

2022. URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ 14a5ebc9cd2e507cd811df78c15bf5d7-Paper-Conference. pdf.

Y. Wang, S. S. Zhan, R. Jiao, Z. Wang, W. Jin, Z. Yang, Z. Wang, C. Huang, and Q. Zhu. Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 36593 36604. PMLR, 7 2023. URL https://proceedings.mlr.press/ v202/wang23as.html.

H. Wei, X. Liu, and L. Ying. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In G. Camps Valls, F. J. R. Ruiz, and I. Valera, editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 3274 3307. PMLR, 3 2022. URL https://proceedings.mlr.press/ v151/wei22a.html.

C. Wu, B. Ju, Y. Wu, X. Lin, N. Xiong, G. Xu, H. Li, and X. Liang. Uav autonomous target search based on deep reinforcement learning in complex disaster scene. IEEE Access, 7:117227 117245, 2019. doi: 10.1109/ACCESS. 2019.2933002.

H. Xu and S. Mannor. Probabilistic goal markov decision processes. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three, IJCAI 11, page 2046 2052. AAAI Press, 2011. ISBN 9781577355151.

W. Zhao, T. He, R. Chen, T. Wei, and C. Liu. State-wise safe reinforcement learning: a survey. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI 23, 2023. ISBN 978-1-95679203-4. doi: 10.24963/ijcai.2023/763. URL https: //doi.org/10.24963/ijcai.2023/763.

Polynomial-Time Approximability of Constrained Reinforcement Learning

A. Proofs for Section 2

A.1. Proof of Proposition 2.3

Expectation Constraints. We define Cπ M def = EM h PH h=1 c H i . Under this definition, the standard policy evaluation equations imply that, Cπ h(τh) = ch(s, a) + X

s Ph(s | s, a)Cπ h+1(τh+1). (8)

It is then clear that this can be written in (f, g)-form for f being summation and g being the identity. It is easy to see that these functions have the desired properties.

Chance Constraints. Let M 0 denote the initial ca MDP. We define Cπ M 0 def = Pπ M h PH h=1 ch > B i . We see that the probability can be recursively decomposed as follows for the anytime variant:

Cπ h(τh, c) = [ch(s, a) + c > B] + X

s Ph(s | s, a)Cπ h+1(τh+1, ch(s, a) + c). (9)

For the general invariant, we only include the indicator term at step H. To write this into the desired form, we can define a cost-augmented MDP M that keeps track of the cumulative cost at each step as in (Mc Mahan and Zhu, 2024a). In particular, the anytime variant has the immediate cost defined to be ch((s, c), a) def = [ch(s, a) + c > B]. Then, it is clear that the expected cost for the new M exactly corresponds to the probability cost. Thus, the claim holds.

Almost-sure Constraints. We define Cπ M def = max τH+1 Pπ M[τH+1]>0

h PH h=1 c H i to be the worst case cost. Under this definition,

it is known that the worst-case cost decomposes into,

Cπ h(τh) = ch(s, a) + max s [Ph(s | s, a) > 0]Cπ h+1(τh+1). (10)

It is then clear that this can be written in (f, g)-form for f being maximum and g being the indicator. Properties of maximum imply that maxs (C(s ) + ϵ) maxs C(s ) + ϵ. Thus, the total combination is a short map, and the rest of the needed properties can be seen to hold. The anytime variant follows similarly.

A.2. Proof of Theorem 2.5

Proof. The theorem follows immediately by translating the results on the SR-criterion into their original forms in the proof above.

B. Proof for Section 3

B.1. Helpful Technical Lemmas

Definition B.1 (Budget Space). For any s S, we define BH+1(s) def = {0}, and for any h [H],

Bh(s) def = [

b s Bh+1(s )

ch(s, a) + f s g(Ph(s | s, a), bs ) . (11)

We define B def = S

h,s Bh(s). Lemma B.2 (Budget Space Intution). For all s S and h [H + 1],

Bh(s) = b Rd | π ΠD, τh Hh, (s = sh(τh) Cπ h(τh) = b) , (12)

and |Bh(s)| A PH t=h SH t. Thus, B can be computed in finite time using backward induction.

Proof. We proceed by induction on h. Let s S be arbitrary.

Polynomial-Time Approximability of Constrained Reinforcement Learning

Base Case. For the base case, we consider h = H + 1. In this case, we know that for any π ΠD and any τ HH+1, Cπ H+1(τH+1) = 0 {0} = BH+1(s) by definition. Furthermore, |BH+1(s)| = 1 = A0 = A PH t=H+1 St.

Inductive Step. For the inductive step, we consider h H. In this case, we know that for any π ΠD and any τh Hh, if s = sh(τh) and a = πh(τh), then the policy evaluation equations imply,

Cπ h(τh) = ch(s, a) + f s g(Ph(s | s, a), Cπ h+1(τh, a, s )).

We know by the induction hypothesis that V π h+1(τh, a, s ) Bh+1(s ). Thus, by (11), Cπ h(τh) Bh(s). Lastly, we see by (11) and the induction hypothesis that,

|Bh(s)| A Y

s |Bh+1(s )| A Y

s A PH t=h+1 SH t = A1+S PH t=h+1 SH t = A PH t=h SH t.

This completes the proof.

B.2. Proof of Lemma 3.2

Proof. First, let V h (τh, b) denote the supremum in (5). We proceed by induction on h.

Base Case. For the base case, we consider h = H + 1. Definition 2.1 implies that Cπ H+1(τH+1) = 0 for any π ΠD. Thus, there exists a π ΠD satisfying Cπ H+1(τH+1) b if and only if b 0. We also know by definition that any policy π satisfies V π H+1(τH+1) = 0 and if no feasible policy exists V H+1(τH+1, b) = by convention. Therefore, we see that V H+1(τH+1, b) = χ{b 0}. Then, by definition of V H+1, it follows that,

V H+1(s, b) = χ{b 0} = V H+1(τH+1, b).

Inductive Step. For the inductive step, we consider any h H. If V h (τh, b) = , then trivially V h (s, b) V h (τh, b). Instead, suppose that V h (τh, b) > . Then, there must exist a π ΠD satisfying Cπ h(τh) b. Let a = πh(τh). By (SR), we know that, Cπ h(τh) = ch(s, a ) + f s g (Ph(s | s, a )) Cπ h+1(τh, a , s ).

For each s S, define b s def = Cπ h+1(τh, a , s ) and observe that b s B. Thus, we see that (a , b ) A BS and ch(s, a) + f s g(Ph(s | s, a))bs b, so (a , b ) Ah(s, b) by definition.

Since π satisfies Cπ h+1(τh, a , s ) b s , we see that V h+1(s , b s ) V π h+1(τh, a , s ). Thus, the induction hypothesis implies V h+1(s , b s ) V h+1(s , b s ) V π h+1(τh, a , s ). The optimality equations for M then give us,

V h (s, b) = max a Ah(s,b) rh((s, b), a) + X

s Ph( s | (s, b), a) V h+1( s )

= max (a,b) Ah(s,b) rh(s, a) + X

s Ph(s | s, a) V h+1(s , bs )

rh(s, a ) + X

s Ph(s | s, a ) V h+1(s , b s )

rh(s, a ) + X

s Ph(s | s, a )V π h+1(τh, a, s )

= V π h (τh).

The second line used the definition of each quantity in M. The first inequality used the fact that (a , b ) Ah(s, b). The second inequality used the induction hypothesis. The final equality used the deterministic policy evaluation equations.

Since π was an arbitrary feasible policy for the optimization defining V h (τh, b), we see that V h (s, b) V h (τh, b). This completes the proof.

Polynomial-Time Approximability of Constrained Reinforcement Learning

B.3. Proof of Lemma 3.3

Proof. We proceed by induction on h.

Base Case. For the base case, we consider h = H + 1. By definition and assumption, V π H+1(s, b) = χ{b 0} > . Thus, it must be the case that b 0 and so by Definition 2.1 Cπ H+1(s, b) = 0 b.

Inductive Step. For the inductive step, we consider any h H. We decompose πh(s, b) = (a, b) where we know (a, b) Ah(s, b) since V π h (s, b) > 2. Moreover, it must be the case that for any s S with Ph(s | s, a) > 0 that V π h+1(s , bs ) > otherwise the average reward would be which would imply a contradiction:

V π h (s, b) = rh(s, a) + X

s Ph(s | s, a) V π h+1 (s , bs )

= rh(s, a) + . . . + Ph(s | s, a)( ) + . . .

Thus, the induction hypothesis implies that Cπ h+1(s , bs ) bs for any such s S. By (SR), we see that,

Cπ h(s, b) = ch(s, a) + f s g(Ph(s | s, a)) Cπ h+1(s , bs )

ch(s, a) + f s g(Ph(s | s, a))bs

The second line used the fact that f is non-decreasing and g is a non-negative scalar. The third line used the fact that (a, b) Ah(s, b). This completes the proof.

B.4. Proof of Theorem 3.4

Proof. If V 1 (s0, B) = , then we know by Lemma 3.2 that,

= V 1 (s0, B) sup π ΠD V π 1 (s0)

s.t. Cπ 1 (s0) B. (13)

In other words, no feasible π exists, so Algorithm 1 reporting Infeasible is correct. On the other hand, suppose that V 1 (s0, B) > and let π be any solution to the optimality equations for M. By Lemma 3.3, we know that Cπ 1 (s0, B) B implying that π is a feasible solution. Moreover, Lemma 3.2 again tells us that,

V π 1 (s0, B) = V 1 (s0, B) sup π ΠD V π 1 (s0)

s.t. Cπ 1 (s0) B. (14)

Thus, π is an optimal solution to (CON) and Algorithm 1 correctly returns it. Therefore, in all cases, Algorithm 1 is correct.

C. Proofs for Section 4

Formally, ˆf s,a h,b can be defined recursively by ˆf s,a h,b(t, ˆF) def = ˆf s,a h,b t + 1, l f ( ˆF, g(Ph(t | s, a))bt) m

with base case ˆf s,a h,b(S+

1, ˆF) def = ˆF.

2By convention, we assume max =

Polynomial-Time Approximability of Constrained Reinforcement Learning

C.1. Proof of Lemma 4.2

Proof. First, we show that,

V s,a h,b (t, ˆF) = max b BS t+1

s =t Ph(s | s, a) V h+1(s , bs )

s.t. ch(s, a) + f s,a h,b(t, F) b,

For notational simplicity, we define V s,a h,b(t) def = PS s =t Ph(s | s, a) V h+1(s , bs ). We proceed by induction on t.

Base Case. For the base case, we consider t = S + 1. By definition, we know that V s,a h,b (t, F) = χ{ch(s,a)+F b}. We just need to show that the maximum in (15) also matches this expression. First, observe objective is the empty summation, which is 0. Also, f s,a h,b(S + 1, F) = F, so the constraint is satisfied iff ch(s, a) + F b. Thus, the maximum is 0 when ch(s, a) + F b and is due to infeasibility otherwise. In other words, it equals χ{ch(s,a)+F b} as was to be shown.

Inductive Step. For the inductive step, we consider any t S. From (6), we see that,

V s,a h,b (t, F) = max bt B Ph(t | s, a) V h+1(t, bt) + V s,a h,b (t + 1, f (F, g(Ph(t | s, a))bt))

= max bt B Ph(t | s, a) V h+1(t, bt) + max b BS t, ch(s,a)+f s,a h,b(t+1,f (F,g(Ph(t|s,a))bt)) b

V s,a h,b(t + 1)

= max bt B max b BS t, ch(s,a)+f s,a h,b(t+1,f (F,g(Ph(t|s,a))bt)) b

Ph(t | s, a) V h+1(t, bt) + V s,a h,b(t + 1)

= max b BS t+1, ch(s,a)+f s,a h,b(t+1,f (F,g(Ph(t|s,a))bt)) b

Ph(t | s, a) V h+1(t, bt) + V s,a h,b(t + 1)

= max b BS t+1, ch(s,a)+f s,a h,b(t,F ) b

Ph(t | s, a) V h+1(t, bt) + V s,a h,b(t + 1)

= max b BS t+1, ch(s,a)+f s,a h,b(t,F ) b

V s,a h,b(t)

The second line used the induction hypothesis. The third line used the fact that the first term is independent of future b values. The fourth line used the properties of maximum. The fourth line used the recursive definition of f s,a h,b(t, F). The last line used the recursive definition of V s,a h,b(t).

For the second claim, we observe that,

V h (s, b) = max a,b, ch(s,a)+f s g(Ph(s |s,a))bs b rh(s, a) + X

s Ph(s | s, a) V h+1(s , bs )

= max a,b, ch(s,a)+f s,a h,b(1,0) b rh(s, a) + X

s Ph(s | s, a) V h+1(s , bs )

= max a max b, ch(s,a)+f s,a h,b(1,0) b rh(s, a) + X

s Ph(s | s, a) V h+1(s , bs )

= max a rh(s, a) + max b, ch(s,a)+f s,a h,b(1,0) b

s Ph(s | s, a) V h+1(s , bs )

= max a rh(s, a) + V s,a h,b (1, 0).

Polynomial-Time Approximability of Constrained Reinforcement Learning

C.2. Proof of Lemma 4.5

Proof. Recall, as in the proof of Lemma 4.2, we define V s,a h,b(t) def = PS s =t Ph(s | s, a) V h+1(s , bs ) to simplify expressions. We proceed by induction on t.

Base Case. For the base case, we consider t = S + 1. By definition, we know that ˆV s,a h,b (t, ˆF) = χ{ch(s,a)+ ˆ F κ(b)}. We just need to show that the maximum in (7) also matches this expression. First, observe objective is the empty summation, which is 0. Also, ˆf s,a h,b(S + 1, F) = F, so the constraint is satisfied iff ch(s, a) + ˆF κ(b). Thus, the maximum is 0 when ch(s, a) + ˆF κ(b) and is due to infeasibility otherwise. In other words, it equals χ{ch(s,a)+ ˆ F κ(b)} as was to be shown.

Inductive Step. For the inductive step, we consider any t S. From (ADP), we see that,

ˆV s,a h,b (t, ˆF) = max bt B Ph(t | s, a) V h+1(t, bt) + ˆV s,a h,b t + 1, l f ˆF, g(Ph(t | s, a))bt m

= max bt B Ph(t | s, a) V h+1(t, bt) + max b BS t, ch(s,a)+ ˆ f s,a h,b(t+1, f ( ˆ F ,g(Ph(t|s,a))bt) ℓ) κ(b)

V s,a h,b(t + 1)

= max bt B max b BS t, ch(s,a)+ ˆ f s,a h,b(t+1, f ( ˆ F ,g(Ph(t|s,a))bt) ℓ) κ(b)

Ph(t | s, a) V h+1(t, bt) + V s,a h,b(t + 1)

= max b BS t+1, ch(s,a)+ ˆ f s,a h,b(t+1, f ( ˆ F ,g(Ph(t|s,a))bt) ℓ) κ(b)

Ph(t | s, a) V h+1(t, bt) + V s,a h,b(t + 1)

= max b BS t+1, ch(s,a)+ ˆ f s,a h,b(t, ˆ F) κ(b)

Ph(t | s, a) V h+1(t, bt) + V s,a h,b(t + 1)

= max b BS t+1, ch(s,a)+ ˆ f s,a h,b(t, ˆ F) κ(b)

V s,a h,b(t)

The second line used the induction hypothesis. The third line used the fact that the first term is independent of future b values. The fourth line used the properties of maximum. The fourth line used the recursive definition of ˆf s,a h,b(t, ˆF). The last line used the recursive definition of V s,a h,b(t).

For the second claim, we simply observe without rounding that (ADP) is the same as (6). Thus, Lemma 4.2 yields the result.

C.3. Proof of Theorem 4.7

Proof. The fact that Algorithm 3 correctly solves any M follows from the fact that (AU) is equivalent to (BU) via Lemma 4.5.

For the time complexity claim, observe that the number of subproblems considered is O(HS2A|B|| ˆF|) and the time needed per subproblem is O(|B|) to explicitly optimize each artificial budget. Thus, the running time is O(HS2A|B|2| ˆF|). We can further analyze | ˆF| in terms of the original input variables. First, we claim that ˆF [bmin, bmax + ℓS]. To see this, observe that the rounded input at state t + 1 is,

f ( ˆF, bt ℓ) f (F, bt) = t f s =1 g(Ph(s | s, a))bs t f s =1 g(Ph(s | s, a))bmin bmin.

Here, we used the fact that f is non-decreasing and the weighted combination is a short map rooted at 0. Similarly, we see,

f ( ˆF, bt ℓ) f (F, bt ℓ) + ℓ(t 1)

t f s =1 g(Ph(s | s, a))(bs + ℓ) + ℓ(t 1)

t f s =1 g(Ph(s | s, a))bmax + ℓt

Polynomial-Time Approximability of Constrained Reinforcement Learning

Under this assumption, it is clear that the number of integer multiples of ℓresiding in this superset is O((bmax + ℓS bmin)/ℓ) per constraint. When considering all constraints at once, this becomes O( bmax + ℓS bmin m /ℓm) = O( bmax bmin m /ℓm + Sm). Incorporating this bound into the runtime then gives O(HSm+2A|B|2 bmax bmin m /ℓm).

Similar to the reasoning above, we can see the cost of any policy, and thus the artificial budget set, is contained within [Hcmin, Hcmax]. Using this fact, we get the final running time O(Hm+1Sm+2A|B|2 cmax cmin m /ℓm).

D. Proofs for Section 5

D.1. Time-Space Error Lemmas

Lemma D.1 (Time Error). For any h [H], a A, if b b + x, then,

f s,a h,b(1, 0) f s,a h,b (1, 0) f s,a h,b(1, 0) + x. (16)

Here, we translate a scalar x > 0 into the vector (x, . . . , x).

Proof. By definition of f s,a h,b ,

f s,a h,b (1, 0) = f (0, f s g(Ph(s | s, a))b s )

= f s g(Ph(s | s, a))b s

f s g(Ph(s | s, a))bs

= f (0, f s g(Ph(s | s, a))bs )

= f s,a h,b(1, 0).

The second and fourth lines used the fact that f is identity preserving. The inequality uses the fact that f is non-decreasing and g is a non-negative scalar, so the total weighted combination is also non-decreasing.

Similarly, we see that,

f s,a h,b (1, 0) = f (0, f s g(Ph(s | s, a))b s )

= f s g(Ph(s | s, a))b s

f s g(Ph(s | s, a))(bs + x)

f s g(Ph(s | s, a))bs + x

= f (0, f s g(Ph(s | s, a))bs ) + x

= f s,a h,b(1, 0) + x.

The second and fifth lines used the fact that f is identity preserving. The first inequality again uses the fact that the weighted combination is non-decreasing. The second inequality follows since the weighted combination is a short map with respect to the infinity norm.

In particular, since |α(y) α(z)| y z holds for any infinity-norm short map α, we see that |α(y+z) α(y)| z . Moreover, if α is non-decreasing and z is a positive scaler treated as a vector, we further have α(y + z) α(y) = |α(y + z) α(y)| z = z. This final inequality immediately implies that α(y + z) α(y) + z. When α is vector-valued, this inequality holds component-wise.

Since f is associative, we can define f s,a h,b(t, F) = f(F, f S s =t g(Ph(s | s, a))bs ) either forward recursively or backward recursively.

Polynomial-Time Approximability of Constrained Reinforcement Learning

Lemma D.2 (Space Error). For any h [H], a A, b Rm S, u Rm, and t [S + 1],

f s,a h,b(t, u) ˆf s,a h,b(t, u) f s,a h,b(t, u) + (S t + 1)ℓ. (17)

Proof. We proceed by induction on t.

Base Case. For the base case, we consider t = S + 1. By definition, we have that ˆf s,a h,b(S + 1, u) = u = f s,a h,b(S + 1, u). Thus, the claim holds.

Inductive Step. For the inductive step, we consider any t S. The recursive definition of ˆf s,a h,b implies,

ˆf s,a h,b(t, u) = ˆf s,a h,b(t + 1, f (u, g(Ph(t | s, a))bt) ℓ)

f s,a h,b(t + 1, f (u, g(Ph(t | s, a))bt) ℓ)

= f ( f (u, g(Ph(t | s, a))bt) ℓ, S f s =t+1 g(Ph(s | s, a)bs ))

f (f (u, g(Ph(t | s, a))bt), S f s =t+1 g(Ph(s | s, a)bs ))

= f (u, f (g(Ph(t | s, a))bt, S f s =t+1 g(Ph(s | s, a)bs )))

= f s,a h,b(t, u).

The first inequality used the induction hypothesis to replace ˆf with f, and the second inequality used that f is non-decreasing in either input and bt ℓ bt. The other lines use f s associativity.

Similarly, we see that,

ˆf s,a h,b(t, u) = ˆf s,a h,b(t + 1, f (u, g(Ph(t | s, a))bt) ℓ)

f s,a h,b(t + 1, f (u, g(Ph(t | s, a))bt) ℓ) + (S t)ℓ

= f ( f (u, g(Ph(t | s, a))bt) ℓ, S f s =t+1 g(Ph(s | s, a)bs )) + (S t)ℓ

f (f (u, g(Ph(t | s, a))bt) + ℓ, S f s =t+1 g(Ph(s | s, a)bs )) + (S t)ℓ

f (f (u, g(Ph(t | s, a))bt), S f s =t+1 g(Ph(s | s, a)bs )) + (S t + 1)ℓ

= f (u, f (g(Ph(t | s, a))bt, S f s =t+1 g(Ph(s | s, a)bs ))) + (S t + 1)ℓ

= f s,a h,b(t, u) + (S t + 1)ℓ.

The first inequality used the induction hypothesis to replace ˆf with f. The second inequality used that f is non-decreasing in either input and x ℓ x + ℓ. The third inequality used that f is a short map in the first input. The other lines use f s associativity.

This completes the proof.

D.2. Proof of Lemma 5.2

Proof. We proceed by induction on h.

Base Case. For the base case, we consider h = H + 1. Since b ℓ b, we immediately see,

ˆV H+1(s, b ℓ) = χ{ b ℓ 0} χ{b 0} = V H+1(s, b). (18)

Polynomial-Time Approximability of Constrained Reinforcement Learning

Inductive Step. For the inductive step, we consider any h H. If V h (s, b) = , then trivially ˆV h (s, b ℓ) V h (s, b). Now, suppose that V h (s, b) > . Let π be a solution to the optimality equations for M. Consequently, we know that V π h (s, b) = V h (s, b) > , which implies (a , b ) = πh(s, b) Ah(s, b). By definition of Ah(s, b),

ch(s, a ) + f s,a

h,b (1, 0) = ch(s, a ) + f s g(Ph(s | s, a ))b s b b ℓ. (19)

For each s S, define ˆb s def = b s ℓ. We show (a , ˆb s ) ˆ Ah(s, b ℓ) as follows:

ch(s, a ) + ˆf s,a h,ˆb (1, 0) ch(s, a ) + f s,a h,ˆb (1, 0) + ℓS

ch(s, a ) + f s,a h,b (1, 0) + ℓ(S + 1)

b ℓ+ ℓ(S + 1)

The first inequality follows from Lemma D.2. The second inequality follows from Lemma D.1 with ˆb b + ℓ. The third inequality follows from (19). The equality follows by definition of κ. Thus, (a , ˆb s ) ˆ Ah(s, b ℓ).

Since b s B by definition, the induction hypothesis implies that ˆV h+1(s ,ˆb s ) V h+1(s , b s ) = V π h+1(s , b s ). The optimality equations for ˆ M then imply that,

ˆV h (s, b ℓ) = max (a,ˆb) ˆ Ah(s,b) rh(s, a) + X

s Ph(s | s, a) ˆV h+1 s ,ˆbs

rh(s, a ) + X

s Ph(s | s, a) ˆV h+1 s ,ˆb s

rh(s, a ) + X

s Ph(s | s, a) V π h+1 (s , b s )

= V π h (s, b)

= V h (s, b).

The first inequality used the fact that (a , ˆb ) ˆ Ah(s, b). The second inequality follows from the induction hypothesis. The last two equalities follow from the standard policy evaluation equations and the definition of π, respectively. This completes the proof.

D.3. Proof of Lemma 5.3

Proof. We proceed by induction on h.

Base Case. For the base case, we consider h = H + 1. By definition and assumption, ˆV π H+1(s,ˆb) = χ{ˆb 0} > .

Thus, it must be the case that ˆb 0 and so by definition ˆCπ H+1(s,ˆb) = 0 ˆb.

Inductive Step. For the inductive step, we consider any h H. As in the proof of Lemma 3.3, we know that (a, ˆb) = πh(s, b) ˆ Ah(s,ˆb) and for any s S with Ph(s | s, a) > 0 that ˆV π h+1(s , bs ) > . Thus, the induction hypothesis implies that ˆCπ h+1(s ,ˆbs ) ˆbs + ℓ(S + 1)(H h) for any such s . For any other s , we have g(Ph(s | s, a)) = g(0) = 0 by assumption.

Thus, the weighted combination of ˆCπ h+1(s ,ˆbs ) is equal to the weighted combination of ˆb where ˆb s def = ˆCπ h+1(s ,ˆbs ) if

Polynomial-Time Approximability of Constrained Reinforcement Learning

Ph(s | s, a) > 0 and ˆb s def = 0 otherwise. Moreover, we have ˆb ˆb + ℓ(S + 1)(H h) since ℓ> 0. Thus, by (SR),

ˆCπ h(s,ˆb) = ch(s, a) + f s g(Ph(s | s, a)) ˆCπ h+1(s ,ˆbs )

= ch(s, a) + f s,a h,ˆb (1, 0)

ch(s, a) + f s,a h,ˆb(1, 0) + ℓ(S + 1)(H h)

κ(ˆb) + ℓ(S + 1)(H h)

= ˆb + ℓ(S + 1)(H h + 1).

The first inequality used Lemma D.1. The second inequality used the fact that (a, ˆb) Ah(s,ˆb). The last line used the definition of κ. This completes the proof.

D.4. Proof of Theorem 5.4

Proof. If (CON) is feasible, then inductively we see that ˆV 1 (s0, B ℓ) > . The contrapositive then implies if ˆV 1 (s0, B ℓ) = , then (CON) is infeasible. Thus, when Algorithm 4 outputs Infeasible , it is correct.

On the other hand, suppose ˆV 1 (s0, B ℓ) > and that π is an optimal solution to ˆ M. By Lemma 5.2 and Lemma 3.2, we know that ˆV π 1 (s0, B ℓ) V π 1 (s0, B) V . Also, by Lemma 5.3, we know that ˆCπ 1 (s0, B ℓ) B ℓ+ ℓ(S + 1)H B + ℓ(1 + (S + 1)H). Our choice of ℓ= ϵ 1+(S+1)H then implies that ˆCπ = ˆCπ 1 (s0, B ℓ) B + ϵ. Thus, π is an (0, ϵ)-additive bicriteria approximation for (CON).

Both cases together imply that Algorithm 4 is a valid (0, ϵ)-bicriteria.

Time Complexity. We see immediately from Theorem 4.7 that the running time of Algorithm 4 is at most O H2m+1S2m+2A| ˆB|2 cmax cmin m /ϵm . To complete the analysis, we need to bound | ˆB|. First, we note | ˆB|

is at most the number of integer multiples of ℓin the range [bmin, bmax] [Hcmin, Hcmax]m. For any individual constraint, this number is at most O(H(cmax Hcmin)/ℓ) O(H2S(cmax cmin)/ϵ) using the definition of ℓ= ϵ 1+(S+1)H . Thus, the total number of rounded artificial budgets is at most O((H2S cmax cmin /ϵ)m). Squaring this quantity and plugging

it back into our original formula yields: O H6m+1S4m+2A cmax cmin 3m /ϵ3m .

D.5. Proof of Proposition 5.8

Proof. We consider a reduction from the Hamiltonian Path problem. The transitions reflect the graph structure, and the actions determine the edge to follow next. To determine if a Hamiltonian path exists, we can simply make an indicator constraint for each node that signals that node has been reached. It is then clear that relaxing the budget constraint does not help since we can always shrink the budget for any given ϵ-slackness. Thus, the claim holds.

D.6. Proof of Lemma 5.11

Proof. We proceed by induction on h.

Base Case. For the base case, we consider h = H + 1. By definition, we have V π H+1( τH+1) = 0 = V π H+1(τH+1) and Cπ H+1( τH+1) = 0 = Cπ H+1(τH+1).

Polynomial-Time Approximability of Constrained Reinforcement Learning

Inductive Step. For the inductive step, we consider any h H. For simplicity, let x def = ℓ(λr + λp)Hrmax(smax smin). The standard policy evaluation equations imply that,

V π h ( τh) = rh( s ℓ, a) + X

s Ph( s | s ℓ, a) V π h+1( τh+1)

= rh( s ℓ, a) + X

s = s Ph(s | s ℓ, a)ds V π h+1( τh+1)

= rh( s ℓ, a) + Z

s Ph(s | s ℓ, a) V π h+1( τh+1)ds

rh( s ℓ, a) + Z

s Ph(s | s ℓ, a)(V π h+1(τh+1) x(H h))ds

= rh( s ℓ, a) + Z

s Ph(s | s ℓ, a)V π h+1(τh+1)ds x(H h)

rh(s, a) ℓλr + Z

s (Ph(s | s, a) ℓλp)V π h+1(τh+1)ds x(H h)

= V π h (τh) ℓλr ℓλp

s V π h+1(τh+1)ds x(H h)

V π h (τh) ℓλr ℓλp Hrmax(smax smin) x(H h)

V π h (τh) ℓ(λr + λp)Hrmax(smax smin) x(H h)

= V π h (τh) x(H h + 1).

If we let y def = ℓ(λc + λp)Hcmax(smax smin), we also see that,

Cπ h( τh) = ch( s ℓ, a) + f s Ph( s | s ℓ, a) Cπ h+1( τh+1)

= ch( s ℓ, a) + f s Z s +ℓ

s = s Ph(s | s ℓ, a)ds Cπ h+1( τh+1)

= ch( s ℓ, a) + f s Ph(s | s ℓ, a) Cπ h+1( τh+1)

ch( s ℓ, a) + f s Ph(s | s ℓ, a)(Cπ h+1(τh+1) + y(H h))

ch( s ℓ, a) + f s Ph(s | s ℓ, a)Cπ h+1(τh+1) + y(H h)

ch(s, a) + ℓλc + f s (Ph(s | s, a) + ℓλp)Cπ h+1(τh+1) + y(H h)

= ch(s, a) + f s Ph(s | s, a)Cπ h+1(τh+1) + ℓλc + ℓλp f s Cπ h+1(τh+1) + y(H h)

Cπ h(τh) + ℓλc + ℓλp f s Hcmax + y(H h)

Cπ h(τh) + ℓλc + ℓλp(smax smin)Hcmax + y(H h)

Cπ h(τh) + ℓ(λc + λp)(smax smin)Hcmax + y(H h)

= Cπ h(τh) + y(H h + 1).

We note the above also holds if P is replaced with a g(P) for a sublinear short map g.

For almost-sure constraints, the proof is slightly different since we need to keep the inner integral by definition of the

Polynomial-Time Approximability of Constrained Reinforcement Learning

worst-case cost for continuous state spaces. Letting y def = ℓ(λc + λp)Hcmax(smax smin)/ pmin, the bound then becomes,

Cπ h( τh) = ch( s ℓ, a) + max s [ Ph( s | s ℓ, a) > 0] Cπ h+1( τh+1)

= ch( s ℓ, a) + max s [ Z s +ℓ

s = s Ph(s | s ℓ, a)ds > 0] Cπ h+1( τh+1)

= ch( s ℓ, a) + max s

R s +ℓ s = s Ph(s | s ℓ, a)ds

p s Cπ h+1( τh+1)

= ch( s ℓ, a) + max s

s = s Ph(s | s ℓ, a) Cπ h+1( τh+1)ds /p s

ch( s ℓ, a) + max s

s = s Ph(s | s ℓ, a)(Cπ h+1(τh+1) + y(H h))ds /p s

ch( s ℓ, a) + max s

s = s Ph(s | s ℓ, a)Cπ h+1(τh+1)ds /p s + y(H h)

ch(s, a) + ℓλc + max s

s = s Ph(s | s, a)Cπ h+1(τh+1)ds /p s

+ ℓλp max s

s = s Cπ h+1(τh+1)ds /p s + y(H h)

ch(s, a) + max S S

S Ph(s | s, a)

p S Cπ h+1(τh+1)ds + ℓλc + ℓ2λp Hcmax/ pmin

= Cπ h(τh) + ℓλc + ℓ2λp Hcmax/ pmin + y(H h)

Cπ h(τh) + ℓ(λc + λp)(smax smin)Hcmax/ pmin + y(H h)

= Cπ h(τh) + y(H h + 1).

D.7. Proof of Theorem 5.12

Proof. The theorem follows immediately from Theorem 5.4 and Lemma 5.11.

E. Extensions

Markov Games. It is easy to see that our augmented approach works to compute constrained equilibria. For efficient algorithms, using to indicate infeasibility becomes problematic. However, we can still use per-stage LP solutions and add a constraint that the equilibrium value must be larger than some very small constant to rule out invalid solutions. Alternatively, the AND/OR tree approach used in (Mc Mahan and Zhu, 2024a) can be applied here to directly compute all the near-feasible states.

Infinite Discounting. Since we focus on approximation algorithms, the infinite discounted case can be immediately handled by using the idea of effective horizon. We can treat the problem as a finite horizon problem where the finite horizon H is defined so that P h=H γh 1cmax ϵ . By choosing ϵ and ϵ small enough, we can get equivalent feasibility approximations. The discounting also ensures the effective horizon H is polynomially sized, implying efficient computation.

Stochastic Policies. For stochastic policies, our approximate results follow from simply replacing each maxa and maxbt with a general linear program over a finite distribution, which can be solved in polynomial time.

Polynomial-Time Approximability of Constrained Reinforcement Learning

Stochastic Costs. For finitely-supported cost distributions, all results remain the same except for almost-sure/anytime constraints, which now must be written in the form:

Cπ h(τh) = max c Supp(Ch(s,a)) c + max s [Ph(s | s, a) > 0]Cπ h(τh, a, c, s ). (20)

Also, note that histories must now be cost-dependent.

Now, we have that future budgets depend on both the next state and the realized cost, so our (ADP) must now be dependent on both states and immediate costs for subproblems. The construction is similar to the approach in (Mc Mahan, 2024).