# learning_adversarial_mdps_with_stochastic_hard_constraints__00ff6932.pdf

Learning Adversarial MDPs with Stochastic Hard Constraints

Francesco Emanuele Stradi 1 Matteo Castiglioni 1 Alberto Marchesi 1 Nicola Gatti 1

Abstract We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater s parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.

1. Introduction

Reinforcement learning (Sutton & Barto, 2018) studies problems where a learner sequentially takes actions in an environment modeled as a Markov decision process (MDP) (Puterman, 2014). Most of the algorithms for such problems focus on learning policies that prescribe the learner how to take actions so as to minimize losses (equivalently, maximize rewards). However, in many real-world applications, the learner must fulfill additional requirements. For instance,

1DEIB, Politecnico di Milano, Milan, Italy. Correspondence to: Francesco Emanuele Stradi <francescoemanuele.stradi@polimi.it>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

autonomous vehicles must avoid crashing (Wen et al., 2020; Isele et al., 2018), bidding agents in ad auctions must not deplete their budget (Wu et al., 2018; He et al., 2021), and recommendation systems must not present offending items to their users (Singh et al., 2020). A commonly-used model that allows to capture such additional requirements is the constrained MDP (CMDP) (Altman, 1999), where the goal is to learn a loss-minimizing policy while at the same time satisfying some constraints.

We study online learning problems in episodic CMDPs with adversarial losses and stochastic hard constraints, under bandit feedback. In such settings, the goal of the learner is to minimize their regret the difference between their cumulative loss and what they would have obtained by always selecting a best-in-hindsight policy , while at the same time guaranteeing that the constraints are satisfied during the learning process. We consider three scenarios that differ in the way in which constraints are satisfied and are all usually referred to as hard constraints settings in the literature (Liu et al., 2021; Guo et al., 2022). In the first scenario, the learner attains sublinear cumulative positive constraints violation. In the second one, the learner satisfies constraints at every episode, while, in the third one, they achieve constant cumulative positive constraints violation.

To the best of our knowledge, our work is the first to study CMDPs that involve both adversarial losses and hard constraints. Indeed, all the works on adversarial CMDPs (see, e.g., (Wei et al., 2018; Qiu et al., 2020)) consider settings with soft constraints. These are much weaker than hard constraints, as they are only concerned with the minimization of the cumulative (both positive and negative) constraints violation. As a result, they allow negative violations to cancel out positive ones across different episodes. Such cancellations are unreasonable in real-world applications. For instance, in autonomous driving, avoiding a collision clearly does not repair a crash occurred previously. Furthermore, the only few works addressing stochastic hard constraints in CMDPs (Liu et al., 2021; Shi et al., 2023; M uller et al., 2024; Stradi et al., 2025) are restricted to stochastic losses. Thus, their techniques cannot be easily generalized to our setting. Our CMDP settings capture many more applications than theirs, since being able to deal with adversarial losses allows to tackle general non-stationary environments, which are ubiquitous in the real world.

Learning Adversarial MDPs with Stochastic Hard Constraints

We refer to Appendix A for a complete discussion on related works.

1.1. Original contributions

We start by addressing the first scenario, where we design an algorithm called Sublinear Violation Optimistic Policy Search (SV-OPS) that guarantees both sublinear regret and sublinear cumulative positive constraints violation. SV-OPS builds on top of state-of-the-art learning algorithms in adversarial, unconstrained MDPs, by introducing the tools necessary to deal with constraints violation. Specifically, SV-OPS works by selecting policies that optimistically satisfy the constraints. SV-OPS updates the set of such policies in an online fashion, guaranteeing that it is always non-empty with high probability and that it collapses to the (true) set of constraints-satisfying policies as the number of episodes increases. This allows SV-OPS to attain sublinear violation. Crucially, even though such an optimistic set of policies changes during the execution of the algorithm, it always contains the (true) set of constraintssatisfying policies. This allows SV-OPS to attain sublinear regret. SV-OPS also addresses a problem left open by Qiu et al. (2020), i.e., learning with bandit feedback in CMDPs with adversarial losses and stochastic constraints. Indeed, SV-OPS goes even further, as Qiu et al. (2020) were only concerned with soft constraints, while SV-OPS is capable of managing positive constraints violation.

Next, we switch the attention to the second scenario, where we design a safe algorithm, i.e., one that satisfies the constraints at every episode. To achieve this, we need to assume that the learner has knowledge about a policy strictly satisfying the constraints. Indeed, this is necessary even in simple stochastic multi-armed bandit settings, as shown in (Bernasconi et al., 2022). This scenario begets considerable additional challenges compared to the first one, since assuring safety extremely limits exploration capabilities, rendering techniques for adversarial, unconstrained MDPs inapplicable. Nevertheless, we design an algorithm called Safe Optimistic Policy Search (S-OPS) that attains sublinear regret while being safe with high probability. S-OPS works by selecting, at each episode, a suitable randomization between the policy that SV-OPS would choose and the (known) policy strictly satisfying the constraints. Crucially, the probability defining the randomization employed by the algorithm is carefully chosen in order to pessimistically account for constraints satisfaction. This guarantees that a sufficient amount of exploration is performed.

Then, in the third scenario, we design an algorithm that attains constant cumulative positive constraints violation and sublinear regret, by simply assuming that a policy strictly satisfying the constraints exists, but it is not known to learner. Our algorithm called Constant Violation Optimistic Policy

Search (CV-OPS) estimates such a policy and its associated constraints violation in a constant number of episodes. This is done by employing two no-regret algorithms. The first one with the objective of minimizing violation, and the second one with the goal of selecting the most violated constraint. A stopping condition that depends on the guarantees of both the no-regret algorithms enforces that the number of episodes used to estimate the desired policy is sufficient, while still being constant. After that, CV-OPS runs S-OPS with the estimated policy, attaining the desired results.

Finally, we provide a lower bound showing that any algorithm attaining o(

T) violation cannot avoid a dependence on the Slater s parameter in the regret bound. We believe that this result may be of independent interest, since it is not only applicable to our second and third settings, but also to other settings where a larger violation is allowed.

2. Preliminaries

2.1. Constrained Markov decision processes

We study episodic constrained MDPs (CMDPs) (Altman, 1999) with adversarial losses and stochastic constraints. These are tuples M := X, A, P, {ℓt}T t=1 , {Gt}T t=1 , α . T is the number of episodes.1 X and A are finite state and action spaces. P : X A X [0, 1] is a transition function, where P(x |x, a) denotes the probability of moving from state x X to x X by taking a A.2 {ℓt}T t=1 is the sequence of vectors of losses at each episode, namely ℓt [0, 1]|X A|. We refer to the loss for a state-action pair (x, a) X A as ℓt(x, a). Losses are adversarial, i.e., no statistical assumption on how they are selected is made. {Gt}T t=1 is the sequence of matrices that define the costs characterizing the m constraints at each episode, namely Gt [0, 1]|X A| m. For i [m], the i-th constraint cost for (x, a) X A is denoted by gt,i(x, a). Costs are stochastic, i.e., the matrices Gt are i.i.d. random variables distributed according to an (unknown) probability distribution G. Finally, α = [α1, . . . , αm] [0, L]m is the vector of cost thresholds characterizing the m constraints, where αi denotes the threshold for the i-th constraint.

The learner uses a policy π : X A [0, 1], which defines a probability distribution over actions at each state. We denote by π( |x) the distribution at x X, with π(a|x) being

1We denote an episode by t [T], where [a . . . b] is the set of all integers from a to b and [b] := [1 . . . b]. 2For ease of notation, we focus w.l.o.g. on loop-free CMDPs. This means that X is partitioned into L + 1 layers X0, . . . , XL with X0 = {x0} and XL = {x L}. Moreover, the loop-free property requires that P(x |x, a) > 0 only if x Xk+1 and x Xk for some k [0 . . . L 1]. Any (episodic) CMDP with horizon H that is not loop-free can be cast into a loop-free one by duplicating the state space H times. In loop-free CMDPs, we let k(x) [0 . . . L] be the layer index in which state x X is.

Learning Adversarial MDPs with Stochastic Hard Constraints

Algorithm 1 CMDP Interaction at episode t [T]

1: ℓt, Gt chosen adversarially and stochastically, resp. 2: Learner chooses a policy πt : X A [0, 1] 3: Environment is initialized to state x0 4: for k = 0, . . . , L 1 do 5: Learner takes action ak πt( |xk) 6: Learner sees ℓt(xk, ak), gt,i(xk, ak) i [m] 7: Environment evolves to xk+1 P( |xk, ak) 8: Learner observes the next state xk+1 9: end for

the probability of action a A. Algorithm 1 details the learner-environment interaction at episode t [T], where the learner has bandit feedback. Specifically, the learner observes the trajectory of state-action pairs (xk, ak), for k [0 . . . L 1], visited during the episode, their losses ℓt(xk, ak), and costs gt,i(xk, ak) for i [m]. The learner knows X and A, but they do not know anything about the transition function P.

We introduce the notion of occupancy measure (Rosenberg & Mansour, 2019a). Given a transition function P and a policy π, the vector q P,π [0, 1]|X A X| is the occupancy measure induced by P and π. For every x Xk, a A, and x Xk+1 with k [0 . . . L 1], it holds q P,π(x, a, x ) := P[xk = x, ak = a, xk+1 = x |P, π]. Moreover, we also let q P,π(x, a) := P

x Xk+1 q P,π(x, a, x ) and q P,π(x) := P

a A q P,π(x, a). Next, we define valid occupancies.

Lemma 2.1 (Rosenberg & Mansour (2019b)). A vector q [0, 1]|X A X| is a valid occupancy measure of an episodic loop-free MDP if and only if it holds:

x Xk+1 q(x, a, x ) = 1 k [0 . . . L 1]

x Xk+1 q(x, a, x ) = X

a A q(x , a, x)

k [1 . . . L 1], x Xk P q = P,

where P is the transition function of the MDP and P q is the one induced by q (see below).

Notice that any valid occupancy measure q induces a transition function P q and a policy πq, with P q(x |x, a) = q(x, a, x )/q(x, a) and πq(a|x) = q(x, a)/q(x).

2.2. Online learning with hard constraints

Our baseline for evaluating the performances of the learner is defined through a linear programming formulation of the (offline) learning problem in constrained MDPs. Specifically, given a constrained MDP M := (X, A, P, ℓ, G, α) characterized by a loss vector ℓ [0, 1]|X A|, a cost matrix G [0, 1]|X A| m, and a threshold vector α [0, L]m, such a problem consists in finding a policy minimizing

the loss while ensuring that all the constraints are satisfied. Thus, our baseline OPTℓ,G,α is defined as the optimal value of a parametric linear program, which reads as follows:

OPTℓ,G,α :=

( minq (M) ℓ q s.t. G q α, (1)

where q [0, 1]|X A| is a vector encoding an occupancy measure, while (M) is the set of valid occupancy measures. Notice that, given the equivalence between policy and occupancy, the (offline) learning problem can be formulated as a linear program working in the space of the occupancy measures q, since expected losses and costs are linear in q.

As customary in settings with adversarial losses, we measure the performance of an algorithm by comparing it with the best-in-hindsight constraint-satisfying policy. The performance of the learner is evaluated in terms of the (cumulative) regret RT := PT t=1 ℓ t q P,πt T OPTℓ,G,α, where

T PT t=1 ℓt is the average of the adversarial losses over the T episodes and G := EG G[G] is the expected value of the stochastic cost matrices. We let q be a best-in-hindsight constraint-satisfying occupancy measure, i.e., one achieving value OPTℓ,G,α, while we let π be its corresponding policy.

Thus, the regret reduces to RT := PT t=1 ℓ t (q P,πt q ). For the ease of notation, we refer to q P,πt by simply using qt, thus omitting the dependency on P and πt. Our goal is to design learning algorithms with RT = o(T), while at the same time satisfying constraints. We consider three different settings, all falling under the umbrella of hard constraints settings (Guo et al., 2022), introduced in the following.

2.2.1. GUARANTEEING SUBLINEAR VIOLATION

In this setting, we consider the (cumulative) positive constraints violation VT := maxi [m] PT t=1

G qt α + i , where [x]+ := max{0, x}. Our goal is to design algorithms with VT = o(T). To achieve such a goal, we only need to assume that the problem is well posed, as follows:

Assumption 2.2. There is an occupancy measure q , called feasible solution, such that G q α.

2.2.2. GUARANTEEING SAFETY

In this setting, our goal is to design algorithms ensuring that the following safety property is met:

Definition 2.3 (Safe algorithm). An algorithm is safe if and only if G qt α for all t [T].

As shown by Bernasconi et al. (2022), without further assumptions, it is not possible to achieve RT = o(T) while at the same time guaranteeing that the safety property holds with high probability, even in simple stochastic multi-armed bandit instances. To design safe learning algorithms, we

Learning Adversarial MDPs with Stochastic Hard Constraints

need the following two assumptions. The first one is about the possibility of strictly satisfying constraints. Assumption 2.4 (Slater s condition). There is an occupancy measure q : G q < α. We call q strictly feasible solution, while a π induced by q is a strictly feasible policy.

The second assumption is related to learner s knowledge about a strictly feasible policy. Assumption 2.5. Both the policy π and its costs β = [β1, . . . , βm] := G q are known to the learner.

Intuitively, Assumption 2.5 is needed to guarantee that safety holds during the first episodes, when the leaner s uncertainty about the costs is high. Assumptions 2.4 and 2.5 are often employed in CMDPs (see, e.g., (Liu et al., 2021)), as they are usually met in real-world applications of interest, where it is common to have access to a do-nothing policy resulting in no constraint being violated.

2.2.3. GUARANTEEING CONSTANT VIOLATION

In this setting, we relax Assumption 2.5, and we only assume Slater s condition (Assumption 2.4). We show that it is possible to achieve constant violation, namely VT is upper bounded by a constant independent of T, while attaining similar regret guarantees compared to the second setting. Our results depend on the Slater s parameter ρ [0, L], which is defined as ρ := maxq (M) mini [m] αi g i q , with q := arg maxq (M) mini [m] αi g i q being its associated occupancy. Given βi := g i q , we have ρ = mini [m] [αi βi]. By Assumption 2.4, it holds ρ > 0.

3. Concentration bounds

In this section, we provide concentration bounds for the estimates of unknown stochastic parameter of the CMDP. Let Nt(x, a) be the number of episodes up to t [T] in which the state-action pair (x, a) X A is visited. Then,

bgt,i(x, a) :=

τ [t] gτ,i(x,a)1τ {x,a}

max{1,Nt(x,a)} , with 1τ{x, a} = 1 if and only if (x, a) is visited in episode τ, is an unbiased estimator of the expected cost of constraint i [m] for (x, a), which we denote by gi(x, a) := EG G[gt,i(x, a)]. Thus, by applying Hoeffding s inequality, it holds, with probability at least 1 δ that |bgt,i(x, a) gi(x, a)| ξt(x, a), where we let the confidence bound ξt(x, a) := min{1, p

4 ln(T|X||A|m/δ)/max{1, Nt(x, a)}} (refer to Lemma B.2 for the formal result). For ease of notation, we let b Gt [0, 1]|X A| m be the matrix of the estimated costs bgt,i(x, a). Moreover, we denote by ξt [0, 1]|X A|

the vector whose entries are the bounds ξt(x, a), and we let Ξt [0, 1]|X A| m be a matrix built by concatenating vectors ξt in such a way that the statement of Lemma B.2 becomes: | b Gt G| Ξt holds with probability at least 1 δ, where | | and are applied component wise. In

the following, given any δ (0, 1), we refer to the event defined in Lemma B.2 as EG(δ).

Similarly, we define confidence sets for the transition function of a CMDP, by exploiting suitable concentration bounds for estimated transition probabilities. By letting Mt(x, a, x ) be the total number of episodes up to t [T] in which (x, a) X A is visited and the environment transitions to x X, the estimated transition probability for (x, a, x ) is defined as b Pt (x | x, a) := Mt(x,a,x ) max{1,Nt(x,a)}. Then, at episode t [T], the confidence set

for the transitions is Pt := T

(x,a,x ) X A X Px,a,x t , with

Px,a,x t := n

P : |P(x |x, a) b Pt(x |x, a)| ϵt(x, a, x ) o ,

where we let ϵt(x, a, x ) := 2 q

b Pt(x |x,a) ln(T |X||A|/δ)

max{1,Nt(x,a) 1} +

14 ln(T |X||A|/δ) 3 max{1,Nt(x,a) 1} for some confidence δ (0, 1). It is well known that, with probability at least 1 4δ, it holds that the transition function P belongs to Pt for all t [T] (see (Jin et al., 2020) and Lemma G.1 for the formal statement). At each t [T], given a confidence set Pt, it is possible to efficiently build a set (Pt) that comprises all the occupancy measures that are valid with respect to every transition function P Pt. For reasons of space, we defer the formal definition of (Pt) to Appendix G. Lemma G.1 implies that, with high probability, the set (M) of valid occupancy measures is included in all the estimated sets (Pt), for every t [T]. In the following, given any confidence δ (0, 1), we refer to the event that (M) T

t [T ] (Pt) as E (δ), which holds with probability at least 1 4δ thanks to Lemma G.1.

Finally, for ease of presentation, given δ (0, 1) we define a clean event EG, (δ) under which all the concentration bounds for costs and transitions correctly hold. Formally, EG, (δ) := EG(δ) E (δ), which holds with probability at least 1 5δ by a union bound (and Lemmas B.2 and G.1).

4. Guaranteeing sublinear violation

We start by designing the SV-OPS algorithm, guaranteeing that both the regret RT and the positive constraints violation VT are sublinear in T. To get this result, we only need the existence of a feasible solution (Assumption 2.2). Dealing with adversarial losses while limiting positive constraints violation begets considerable challenges, which go beyond classical exploration-exploitation trade-offs faced in unconstrained settings. On the one hand, using state-of-the-art algorithms for online learning in adversarial, unconstrained MDPs would lead to sublinear regret, but violation would grow linearly. On the other hand, a na ıve approach that randomly explores to compute a set of policies satisfying the constraints with high probability can lead to sublinear violation, at the cost of linear regret. Thus, a clever adaptation of the techniques for unconstrained settings is needed.

Learning Adversarial MDPs with Stochastic Hard Constraints

Algorithm 2 SV-OPS Require: X, A, α, T, δ, η, γ

1: for k [0 . . . L 1], (x, a, x ) Xk A Xk+1 do 2: N0(x, a) 0; M0(x, a, x ) 0 3: bq1 (x, a, x ) 1/|Xk||A||Xk+1| 4: end for 5: π1 πbq1 6: for t [T] do 7: Choose πt in Algorithm 1 and receive feedback 8: Build upper occupancy bounds for k [0 . . . L 1]:

ut(xk, ak) max P Pt 1 q P ,πt(xk, ak)

9: Build optimistic loss estimator for (x, a) X A:

( ℓt(x,a) ut(x,a)+γ if 1t{x, a} = 1 0 otherwise

10: for k [0 . . . L 1] do 11: Nt(xk, ak) Nt 1(xk, ak) + 1 12: Mt(xk, ak, xk+1) Mt 1(xk, ak, xk+1)+ 1 13: end for 14: Build Pt, b Gt, and Ξt as in Section 3 15: Build unconstrained occupancy for all (x, a, x ):

qt+1(x, a, x ) bqt(x, a, x )e ηbℓt(x,a)

16: if PROJ qt+1, b Gt, Ξt, Pt is feasible then 17: bqt+1 PROJ qt+1, b Gt, Ξt, Pt

18: else 19: bqt+1 any q (Pt) 20: end if 21: πt+1 πbqt+1 22: end for

Our algorithm called Sublinear Violation Optimistic Policy Search (SV-OPS) works by selecting policies derived from a set of occupancy measures that optimistically satisfy cost constraints. This ensures that the set is always non-empty with high probability and that it collapses to the (true) set of constraint-satisfying occupancy measures as the number of episodes increases, enabling SV-OPS to attain sublinear constraints violation. The fundamental property preserved by SV-OPS is that, even though the optimistic set changes during the execution of the algorithm, it always subsumes the (true) set of constraint-satisfying occupancy measures. This crucially allows SV-OPS to employ classical policy-selection methods for unconstrained MDPs.

Algorithm 2 provides the pseudocode of SV-OPS. At each episode t [T], SV-OPS plays policy πt and receives feedback as described in Algorithm 1 (Line 7). Then, SV-OPS computes an upper occupancy bound ut(xk, ak) for every state-action pair (xk, ak) visited during Algorithm 1, by using the confidence set for the transition function Pt 1 computed in the previous episode, namely, it sets ut(xk, ak) := max P Pt 1 q P ,πt(x, a) for every k [0 . . . L 1] (Line 8).

Intuitively, ut(xk, ak) represents the maximum probability with which (xk, ak) is visited when using policy πt, given the confidence set for the transition function built so far. The upper occupancy bounds are combined with the exploration factor γ to compute an optimistic loss estimator bℓt(x, a) for every state-action pair (x, a) X A (see Line 9). After that, SV-OPS updates all the counters given the path traversed in Algorithm 1 (Lines 11 12), it builds the new confidence set Pt, and it computes the matrices b Gt and Ξt of estimated costs and their bounds, respectively, by using the feedback (Line 14). To choose a policy πt+1, SV-OPS first computes an unconstrained occupancy measure qt+1 according to an unconstrained OMD update (Orabona, 2019) (see Line 15). Then, qt+1 is projected onto a suitably-defined set of occupancy measures that optimistically satisfy the constraints. Next, we formally define the projection (Line 16).

PROJ qt+1, b Gt, Ξt, Pt :=

arg min q (Pt) D(q|| qt+1)

s.t. b Gt Ξt q α, (2)

where D(q|| qt+1) is the unnormalized KL-divergence between q and qt+1. Problem (2) is a linearly-constrained convex mathematical program, and, thus, it can be solved efficiently, that is, in polynomial time, for an arbitrarily-good approximate solution.3 Intuitively, Problem (2) performs a projection onto the set of q (Pt) that additionally satisfy b Gt Ξt q α, where lower confidence bounds b Gt Ξt for the costs are used in order to take an optimistic approach with respect to constraints satisfaction. Finally, if Problem (2) is feasible, then at the next episode SV-OPS selects the πbqt+1 induced by a solution bqt+1 to Problem (2) (Line 17), otherwise it chooses a policy induced by any q (Pt) (Line 19).

The optimistic approach adopted in Problem (2) crucially allows to prove the following lemma. Lemma 4.1. Given confidence δ (0, 1), Algorithm 2 ensures that PROJ( qt+1, b Gt, Ξt, Pt) is feasible at every episode t [T] with probability at least 1 5δ.

Lemma 4.1 holds since, under the event EG, (δ), projection is performed on a set subsuming the (true) set of constraintssatisfying occupancies. Lemma 4.1 is fundamental, as it allows to show that SV-OPS attains sublinear VT and RT .

4.1. Cumulative positive constraints violation

To prove that the positive constraints violation achieved by SV-OPS is sublinear, we exploit the fact that the concentration bounds for costs and transitions shrink at a rate of O(1/

T). This allows us to show the following result.

3As customary in adversarial MDPs, we assume that an optimal solution to Problem (2) can be computed efficiently. Otherwise, we can still derive all of our results up to small approximations.

Learning Adversarial MDPs with Stochastic Hard Constraints

Theorem 4.2. Given δ (0, 1), Algorithm 2 attains VT

|A|T ln (T |X||A|m/δ) with prob. at least 1 8δ.

4.2. Cumulative regret

The crucial observation that allows us to prove that the regret attained by SV-OPS grows sublinearly is that the set on which the algorithm perform its projection step (Problem (2)) always contains the (true) set of occupancy measures that satisfy the constraints, and, thus, it also always contains the best-in-hindsight constraint-satisfying occupancy measure q . As a result, even though cost estimates may be arbitrarily bad, SV-OPS is still guaranteed to select policies resulting in losses that are smaller than or equal to those incurred by q . This allows us to show the following:

Theorem 4.3. Given δ (0, 1), by setting η = γ = p

L ln(L|X||A|/δ)/T |X||A|, Algorithm 2 attains RT

|A|T ln (T |X||A|/δ) with prob. at least 1 10δ.

5. Guaranteeing safety

We design another algorithm, called S-OPS, attaining sublinear regret and enjoying the safety property with high probability. To do this, we work under Assumptions 2.4 and 2.5. Designing safe algorithms raises many additional challenges compared to the case studied in Section 4. Indeed, adapting techniques for adversarial, unconstrained MDPs does not work anymore, and, thus, ad hoc approaches are needed. This is because safety extremely limits exploration.

Our algorithm Safe Optimistic Policy Search (S-OPS) builds on top of the SV-OPS algorithm developed in Section 4. Selecting policies derived from the optimistic set of occupancy measures, as done by SV-OPS, is not sufficient anymore, as it would clearly result in the safety property being unsatisfied during the first episodes. Our new algorithm circumvents such an issue by employing, at each episode, a suitable randomization between the policy derived from the optimistic set (the one SV-OPS would select) and the strictly feasible policy π . Crucially, as we show next, such a randomization accounts for constraints satisfaction by taking a pessimistic approach, namely, by considering upper confidence bounds on the costs characterizing the constraints. This is needed in order to guarantee the safety property. Moreover, having access to the strictly feasible policy π and its expected costs β (Assumption 2.5) allows S-OPS to always place a sufficiently large probability on the policy derived from the optimistic set, so that a sufficient amount of exploration is guaranteed, and, in its turn, sublinear regret is attained. Notice that S-OPS effectively selects non-Markovian policies, as it employs a randomization between two Markovian policies at each episode.

Algorithm 3 provides the pseudocode of S-OPS. Differently

Algorithm 3 S-OPS Require: X, A, α, T, δ, η, γ, π , β

1: for k [0 . . . L 1], (x, a, x ) Xk A Xk+1 do 2: N0(x, a) 0; M0(x, a, x ) 0 3: bq1 (x, a, x ) 1 |Xk A||Xk+1| 4: end for

( π w. probability λ0 := maxi [m] n L αi L βi

πbq1 w. probability 1 λ0 6: for t [T] do 7: Select πt in Algorithm 1 and receive feedback 8: Build upper occupancy bounds for k [0 . . . L 1]:

ut(xk, ak) max P Pt 1 q P ,πt(xk, ak)

9: Build optimistic loss estimator for (x, a) X A:

( ℓt(x,a) ut(x,a)+γ if 1t{x, a} = 1 0 otherwise

10: for k [0 . . . L 1] do 11: Nt(xk, ak) Nt 1(xk, ak) + 1 12: Mt(xk, ak, xk+1) Mt 1(xk, ak, xk+1) + 1 13: end for 14: Build Pt, b Gt, and Ξt as in Section 3 15: Build unconstrained occupancy for all (x, a, x ):

qt+1(x, a, x ) bqt(x, a, x )e ηbℓt(x,a)

16: if PROJ qt+1, b Gt, Ξt, Pt is feasible then 17: bqt+1 PROJ qt+1, b Gt, Ξt, Pt

18: bπt+1 πbqt+1

19: Build but+1 [0, 1]|X A| so that for all (x, a):

but+1(x, a) max P Pt q P ,bπt+1(x, a)

20: σ maxi [m]

min{(bgt,i+ξt) but+1,L} αi min{(bgt,i+ξt) but+1,L} βi

( σ if i [m] : (bgt,i + ξt) but+1 > αi 0 if i [m] : (bgt,i + ξt) but+1 αi 22: else 23: bqt+1 take any q (Pt); λt 1 24: end if

( π with probability λt πbqt+1 with probability 1 λt 26: end for

from SV-OPS, the policy selected at the first episode is obtained by randomizing a uniform occupancy measure with π (Line 5). The probability λ0 of selecting π is chosen pessimistically. Intuitively, in the first episode, being pessimistic means that λ0 must guarantee that the constraints are satisfied for any possible choice of costs and transitions, and, thus, λ0 := maxi [m] {L αi/L βi}. Thanks to Assumptions 2.4 and 2.5, it is always the case that λ0 < 1. Thus, π1 = π with positive probability and some explo-

Learning Adversarial MDPs with Stochastic Hard Constraints

ration is performed even in the first episode. Analogously to SV-OPS, at each t [T], S-OPS selects a policy πt and receives feedback as described in Algorithm 1, it computes optimistic loss estimators, it updates the confidence set for the transitions, and it computes the matrices of estimated costs and their bounds. Then, as in SV-OPS, an update step of unconstrained OMD is performed. Although identical to the update done in SV-OPS, the one in S-OPS uses loss estimators computed when using a randomization between the policy obtained by solving Problem (2) and the strictly feasible policy π . Thus, there is a mismatch between the occupancy measure used to estimate losses and the one computed by the projection step. The projection step performed by S-OPS (Line 16) is the same as the one in SV-OPS. Specifically, the algorithm projects the unconstrained occupancy measure qt+1 onto an optimistic set by solving Problem (2), which, if the problem is feasible, results in occupancy measure bqt+1. However, differently from SV-OPS, when the problem is feasible, S-OPS does not select the policy πbqt+1 derived from bqt+1, but it rather uses a randomization between such a policy and the strictly feasible policy π (Line 25). The probability λt of selecting π is chosen pessimistically with respect to constraints satisfaction, by using upper confidence bounds for the costs and upper occupancy bounds given policy πbqt+1 (Lines 19 and 21). Such a pessimistic approach ensures that the constraints are satisfied with high probability, thus making the algorithm safe with high probability. If Problem (2) is not feasible, then any occupancy measure in (Pt) can be selected (Line 23).

5.1. Safety property

We show that S-OPS is safe with high probability.

Theorem 5.1. Given a confidence δ (0, 1), Algorithm 3 is safe with probability at least 1 5δ.

Intuitively, Theorem 5.1 follows from the way in which the randomization probability λt is defined. Indeed, λt relies on two crucial components: (i) a pessimistic estimate of the costs for state-action pairs, namely, the upper confidence bounds bgt,i + ξt, and (ii) a pessimistic choice of transition probabilities, encoded by the upper occupancy bounds defined by the vector but. Notice that the maxi [m] operator allows to be conservative with respect to all the constraints.

5.2. Cumulative regret

Proving that S-OPS attains sublinear regret begets challenges that, to the best of our knowledge, have never been addressed before. Specifically, analyzing the estimates of the adversarial losses requires non-standard techniques in our setting, since the policy πt used by the algorithm and determining the feedback is not the one resulting from an OMD-like update, as it is obtained via a non-standard randomization. Nevertheless, the particular shape of the prob-

Algorithm 4 CV-OPS

Require: Anytime adversarial MDPs regret minimizer AP , online linear optimizer AD

1: for t [T] do 2: Select πt AP

3: Select ϕt AD

4: Play πt and observe feedback as prescribed in Algorithm 1 5: Feed {xk, ak, P

i [m]ϕt,i(gt,i(xk, ak) αi

L )}L 1 k=1 to AP

6: Feed { PL 1 k=1 (gt,i(xk, ak) αi

L )}i [m] to AD

7: if maxi [m] P

τ [t] PL 1 k=1 (gt,i(xk, ak) αi

t ln(t) + 8L q

t then 8: Go to Line 11 9: end if 10: end for

t max i [m]

k=1 gt,i(xk, ak) αi 2L

12: bπ πτ with probability 1/t, for τ [t] 13: Run S-OPS with βi = αi bρ for all i [m] and π = bπ

ability λt can be exploited to overcome such a challenge. Indeed, we show that each λt can be upper bounded by the initial λ0, and, thus, a loss estimator from feedback received by using a policy computed by an OMD-like update is available with probability at least 1 λ0. This observation is crucial in order ot prove the following result:

Theorem 5.2. Given δ (0, 1), by setting η = γ = p

L ln(L|X||A|/δ)/T |X||A|, Algorithm 3 attains RT

|A|T ln (T |X||A|m/δ) with prob. at least

1 11δ, where Ψ := maxi [m]{1/min{(αi βi),(αi βi)2}}.

The regret bound in Theorem 5.2 is in line with the one of SV-OPS in the bounded violation setting, with an additional ΨL2 factor. Such a factor comes from the mismatch between loss estimators and the occupancy measure chosen by the OMD-like update. Notice that Ψ depends on the violation gap mini [m]{αi βi}, which represents how much the strictly feasible solution satisfies the constraints. Such a dependence is expected, since the better the strictly feasible solution (in terms of constraints satisfaction), the larger the exploration performed during the first episodes.

6. Guaranteeing constant violation

In this section, we provide an algorithm that attains constant cumulative positive violation. To achieve this goal, we only need that a strictly feasible policy exists (Assumption 2.4). Algorithm 4 provides the pseudocode of Constant Violation Optimistic Policy Search (CV-OPS). The key idea of CV-OPS is to estimate, in a constant number of episodes, a strictly feasible policy and its associated violation, and then run S-OPS with such estimates. Algorithm 4 needs access to two anytime no-regret algorithms, one for adversarial MDPs with bandit feedback

Learning Adversarial MDPs with Stochastic Hard Constraints

that learns an estimated strictly feasible policy and one for the full-feedback setting on the simplex, which learns the most violated constraint. Specifically, CV-OPS employs an anytime regret minimizer for adversarial MDPs called the primal algorithm AP that attains, with probability at least 1 Cδ P δ, for all τ [T], q (M), and for any sequence of loss functions the following regret bound Pτ t=1 ℓ t (qt q) CP A p

τ ln(τ),where CP A encompass constant terms. This kind of guarantees are attained by state-of-the-arts algorithms for adversarial MDPs (e.g., (Jin et al., 2020)) after applying a standard doubling trick (Lattimore & Szepesv ari, 2020). Algorithm 4 also employs an anytime online linear optimizer called the dual algorithm AD) that attains for all τ [T], ϕ m, and for any sequence of loss functions the following regret bound Pτ t=1 ℓ t (ϕt ϕ) CD A τ, where CD A encompasses constant terms. This bound can be easily obtained by an online gradient descent algorithm (Orabona, 2019).

At each episode t [T], Algorithm 4 requests a policy and a distribution over the m constraints to AP and AD, respectively (Lines 2-3). Thus, the algorithm plays the policy received by AP and observes the usual bandit feedback for CMDPs (Line 4). Next, the loss functions for both the primal and the dual algorithm are built. Specifically, AP

receives the violation attained by the policy πt where any constraint is weighted given ϕt (Line 5), while AD receives the negative of the violation attained for all i [m] (Line 6). The estimation phase stops when the violation attained by the algorithm exceeds twice the regret bounds attained by both the primal and the dual algorithm plus the uncertainty on the estimation (Line 7). This condition is suitably chosen to ensure that the number of episodes are sufficient to estimate an approximation of π , while still being constant. After the estimation, Algorithm 4 computes pessimistically the estimated Slater s parameter bρ as the average violation attained during the estimation phase minus a quantity associated with the uncertainty of the estimation (Line 11). Finally, CV-OPS computes the strictly feasible solution as the uniform policy with respect to all the policies played in the estimation phase (Line 12) and runs S-OPS with βi = αi bρ, for all i [m] and π = bπ in input (Line 13).

6.1. Cumulative positive constraints violation

First, we show that the stopping condition at Line 7 allows to run the estimation phase for no more than a constant number of episodes. This is done in the following lemma.

Lemma 6.1. Given any δ (0, 1), the episodes that Algorithm 4 uses to compute bρ and bπ are t 1/ρ4(3CP A + 10L ln 1

δ + 3CD A + L)4, with prob. at least 1 (Cδ P + 2)δ.

The bound could be reduced to t 1/ρ2(3CP A + 10L ln 1

δ + 3CD A + L)2, with access to a no-regret algorithm without the logarithmic dependence on T in the bound. Next, we

show that Algorithm 4 estimates a strictly feasible policy whose constraints margin is in [bρ, ρ], as follows.

Lemma 6.2. Given any δ (0, 1), Algorithm 4 guarantees bρ mini [m](αi g i q P,bπ ) ρ with prob. at least 1 2δ.

Finally, we provide the result in term of cumulative positive constraints violation attained by our algorithm.

Theorem 6.3. Given δ (0, 1), Algorithm 4 attains VT

O(L/ρ4(CP A + L q

δ + CD A + L)4) with probability at

least 1 (Cδ P + 7)δ.

Theorem 6.3 is proved by employing Lemma 6.1 to bound the episodes in the estimation phase and Lemma 6.2 to state that S-OPS with bρ, bπ is safe with high probability.

6.2. Cumulative regret

We provide the theoretical guarantees attained by Algorithm 4 in terms of cumulative regret. To do so, we show that bρ is not too small. This is done in the following lemma.

Lemma 6.4. Given any δ (0, 1), Algorithm 4 guarantees bρ ρ/2 with probability at least 1 (Cδ P + 2)δ.

Finally, we state the regret attained by CV-OPS.

Theorem 6.5. Given any δ (0, 1), with η = γ = p

L ln(L|X||A|/δ)/T |X||A|, Algorithm 4 attains regret RT O(ΘL3|X| p

|A|T ln (T |X||A|m/δ) + L/ρ4(CP A + L ln 1/δ + CD A + L)4) with probability at least 1 (Cδ P + 13)δ, where we let Θ := 1/min{ρ,ρ2}.

As Theorem 6.3, Theorem 6.5 is proved by employing Lemma 6.1 to bound the episodes in the estimation phase. Then, the result follows from the regret guarantees of S-OPS and Lemma 6.4 for 1/bρ 2/ρ. Differently from S-OPS, the bound of CV-OPS scales as the inverse of the Slater s parameter ρ, not as the (possibly smaller) margin of a generic strictly feasible policy. Thus, the bound of Algorithm 4 is asymptotically smaller than the one of S-OPS.

6.3. Lower bound on the regret

We conclude by showing that a dependency on the feasibility of the strictly feasible solution in the regret bound is unavoidable to guarantee violation of order o(

T), which is the case of both the second and third setting.

This is done by means of the following lower bound.

Theorem 6.6. There exist two instances of CMDPs (with a single state and one constraint) such that, if in the first instance an algorithm suffers from a violation VT = o(

T) probability at least 1 nδ for any δ (0, 1) and n > 0, then, in the second instance, it must suffer from a regret RT = Ω( 1

T) with probability 3/4 nδ.

Learning Adversarial MDPs with Stochastic Hard Constraints

Notice that this lower bound holds for any algorithm attaining a violation bound that is o(

T), thus, it is still applicable to settings where the violations are allowed to be much larger than the ones attained by Algorithm 3 and Algorithm 4.

Acknowledgments

This paper is supported by the Italian MIUR PRIN 2022 Project Targeted Learning Dynamics: Computing Efficient and Fair Equilibria through No-Regret Algorithms , by the FAIR (Future Artificial Intelligence Research) project, funded by the Next Generation EU program within the PNRR-PE-AI scheme (M4C2, Investment 1.3, Line on Artificial Intelligence), and by the EU Horizon project ELIAS (European Lighthouse of AI for Sustainability, No. 101120237).

Impact Statement

This paper presents theoretical results and has the goal of advancing the field of Machine Learning. There are no potential societal consequences of our work which we feel must be highlighted here.

Altman, E. Constrained Markov Decision Processes. Chapman and Hall, 1999.

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2008. URL https://proceedings. neurips.cc/paper/2008/file/ e4a6222cdb5b34375400904f03d8e6a5-Paper. pdf.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272. PMLR, 2017.

Bai, Q., Aggarwal, V., and Gattami, A. Provably sampleefficient model-free algorithm for mdps with peak constraints. Journal of Machine Learning Research, 24(60): 1 25, 2023.

Bernasconi, M., Cacciamani, F., Castiglioni, M., Marchesi, A., Gatti, N., and Trov o, F. Safe learning in treeform sequential decision making: Handling hard and soft constraints. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine

Learning Research, pp. 1854 1873. PMLR, 17 23 Jul 2022. URL https://proceedings.mlr.press/ v162/bernasconi22a.html.

Bura, A., Hasanzade Zonuzy, A., Kalathil, D., Shakkottai, S., and Chamberland, J.-F. Dope: Doubly optimistic and pessimistic exploration for safe reinforcement learning. Advances in neural information processing systems, 35: 1047 1059, 2022.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Chen, K., Cai, K., Huang, L., and Lui, J. C. Beyond the click-through rate: web link selection with multi-level feedback. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3308 3314, 2018.

Ding, Y. and Lavaei, J. Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 7396 7404, 2023.

Efroni, Y., Mannor, S., and Pirotta, M. Explorationexploitation in constrained mdps, 2020. URL https: //arxiv.org/abs/2003.02189.

Even-Dar, E., Kakade, S. M., and Mansour, Y. Online markov decision processes. Mathematics of Operations Research, 34(3):726 736, 2009.

Guo, H., Liu, X., Wei, H., and Ying, L. Online convex optimization with hard constraints: Towards the best of two worlds and beyond. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 36426 36439. Curran Associates, Inc., 2022.

He, Y., Chen, X., Wu, D., Pan, J., Tan, Q., Yu, C., Xu, J., and Zhu, X. A unified solution to constrained bidding in online display advertising. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2993 3001, 2021.

Isele, D., Nakhaei, A., and Fujimura, K. Safe reinforcement learning on autonomous vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1 6. IEEE, 2018.

Jin, C., Jin, T., Luo, H., Sra, S., and Yu, T. Learning adversarial Markov decision processes with bandit feedback and unknown transition. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 4860 4869. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/ v119/jin20c.html.

Learning Adversarial MDPs with Stochastic Hard Constraints

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020. ISBN 9781108486828. URL https://books.google.it/books?id= byd Xz AEACAAJ.

Liu, T., Zhou, R., Kalathil, D., Kumar, P., and Tian, C. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183 17193, 2021.

M uller, A., Alatur, P., Cevher, V., Ramponi, G., and He, N. Truly no-regret learning in constrained mdps. In Forty-first International Conference on Machine Learning, 2024.

Neu, G. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 28, 2015.

Neu, G., Antos, A., Gy orgy, A., and Szepesv ari, C. Online markov decision processes under bandit feedback. Advances in Neural Information Processing Systems, 23, 2010.

Orabona, F. A modern introduction to online learning. Co RR, abs/1912.13213, 2019. URL http://arxiv. org/abs/1912.13213.

Pacchiano, A., Ghavamzadeh, M., Bartlett, P., and Jiang, H. Stochastic bandits with linear constraints. In International conference on artificial intelligence and statistics, pp. 2827 2835. PMLR, 2021.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Qiu, S., Wei, X., Yang, Z., Ye, J., and Wang, Z. Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 15277 15287. Curran Associates, Inc., 2020. URL https://proceedings. neurips.cc/paper/2020/file/ ae95296e27d7f695f891cd26b4f37078-Paper. pdf.

Rosenberg, A. and Mansour, Y. Online stochastic shortest path with bandit feedback and unknown transition function. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a. URL https://proceedings. neurips.cc/paper/2019/file/ a0872cc5b5ca4cc25076f3d868e1bdf8-Paper. pdf.

Rosenberg, A. and Mansour, Y. Online convex optimization in adversarial Markov decision processes. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5478 5486. PMLR, 09 15 Jun 2019b. URL https://proceedings.mlr.press/v97/ rosenberg19a.html.

Shi, M., Liang, Y., and Shroff, N. A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints. ar Xiv preprint ar Xiv:2302.04375, 2023.

Singh, A., Halpern, Y., Thain, N., Christakopoulou, K., Chi, E., Chen, J., and Beutel, A. Building healthy recommendation sequences for everyone: A safe reinforcement learning approach. In Proceedings of the FAcc TRec Workshop, Online, pp. 26 27, 2020.

Stradi, F. E., Germano, J., Genalti, G., Castiglioni, M., Marchesi, A., and Gatti, N. Online learning in CMDPs: Handling stochastic and adversarial constraints. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 46692 46721. PMLR, 21 27 Jul 2024a. URL https://proceedings.mlr. press/v235/stradi24a.html.

Stradi, F. E., Lunghi, A., Castiglioni, M., Marchesi, A., and Gatti, N. Learning constrained markov decision processes with non-stationary rewards and constraints, 2024b. URL https://arxiv.org/abs/2405.14372.

Stradi, F. E., Castiglioni, M., Marchesi, A., and Gatti, N. Optimal strong regret and violation in constrained mdps via policy optimization. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Wei, H., Ghosh, A., Shroff, N., Ying, L., and Zhou, X. Provably efficient model-free algorithms for non-stationary cmdps. In International Conference on Artificial Intelligence and Statistics, pp. 6527 6570. PMLR, 2023.

Wei, X., Yu, H., and Neely, M. J. Online learning in weakly coupled markov decision processes: A convergence time study. Proc. ACM Meas. Anal. Comput. Syst., 2(1), apr 2018. doi: 10.1145/3179415. URL https://doi. org/10.1145/3179415.

Wen, L., Duan, J., Li, S. E., Xu, S., and Peng, H. Safe reinforcement learning for autonomous vehicles through

Learning Adversarial MDPs with Stochastic Hard Constraints

parallel constrained policy optimization. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1 7. IEEE, 2020.

Wu, D., Chen, X., Yang, X., Wang, H., Tan, Q., Zhang, X., Xu, J., and Gai, K. Budget constrained bidding by modelfree reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1443 1451, 2018.

Zheng, L. and Ratliff, L. Constrained upper confidence reinforcement learning. In Bayen, A. M., Jadbabaie, A., Pappas, G., Parrilo, P. A., Recht, B., Tomlin, C., and Zeilinger, M. (eds.), Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pp. 620 629. PMLR, 10 11 Jun 2020. URL https://proceedings.mlr. press/v120/zheng20a.html.

Learning Adversarial MDPs with Stochastic Hard Constraints

The appendix is organized as follows:

In Appendix A, we provide the complete discussion on related works.

In Appendix B, we provide the omitted proofs related to the analysis of the clean event.

In Appendix C, we provide the omitted proofs related to the performances attained by Algorithm 2, namely, the one which guarantees sublinear violation.

In Appendix D, we provide the omitted proofs related to the performances attained by Algorithm 3, namely, the one which guarantees the safety property.

In Appendix E, we provide the omitted proofs related to the performances attained by Algorithm 4, namely, the one which guarantees constant violation.

In Appendix F, we provide the lower bound associated to both the second and the third setting.

In Appendix G, we provide useful lemmas from existing works.

A. Related Works

Online learning (Cesa-Bianchi & Lugosi, 2006; Orabona, 2019) in MDPs has received growing attention over the last years (see, e.g., (Auer et al., 2008; Even-Dar et al., 2009; Neu et al., 2010)). Two types of feedback are usually investigated: full feedback, with the entire loss function being observed by the leaner, and bandit feedback, where the learner only observes losses of chosen actions. Azar et al. (2017) study learning in episodic MDPs with unknown transitions and stochastic losses under bandit feedback, achieving e O(

T) regret, matching the lower bound for these MDPs. Rosenberg & Mansour (2019b) study learning under full feedback in episodic MDPs with adversarial losses and unknown transitions, achieving e O(

T) regret. The same setting is studied by Rosenberg & Mansour (2019a) under bandit feedback, obtaining a suboptimal e O(T 3/4) regret. Jin et al. (2020) provide an algorithm with an optimal e O(

T) regret, in the same setting.

Online learning in CMDPs has generally been studied with stochastic losses and constraints. Zheng & Ratliff (2020) deal with fully-stochastic episodic CMDPs, assuming known transitions and bandit feedback. The regret of their algorithm is e O(T 3/4), while its cumulative constraints violation is guaranteed to be below a threshold with a given probability. Bai et al. (2023) provide the first algorithm that achieves sublinear regret with unknown transitions, assuming that the rewards are deterministic and the constraints are stochastic with a particular structure. Efroni et al. (2020) propose two approaches to deal with the exploration-exploitation trade-off in episodic CMDPs. The first one resorts to a linear programming formulation of CMDPs and obtains sublinear regret and cumulative positive constraints violation. The second one relies on a primal-dual formulation of the problem and guarantees sublinear regret and cumulative (positive/negative) constraints violation, when transitions, losses, and constraints are unknown and stochastic, under bandit feedback. Liu et al. (2021); M uller et al. (2024); Stradi et al. (2025) study stochastic hard constraints; however, the authors only focus on stochastic losses. Bura et al. (2022) focus on our second scenario and develop a pessimist algorithm for the stochastic setting only. In this work, the strictly safe policy is played for a certain amount of time, after which a good estimate on constraints costs and transitions is available. This allows the pessimistic set to be large enough to to be used. We underline that, optimizing over the pessimistic decision space cannot be done in our setting, since adversarial no-regret algorithms do not work in increasing decision spaces. Recently, Shi et al. (2023) study stochastic hard constraints on both states and actions. As concerns adversarial settings, Wei et al. (2018); Qiu et al. (2020); Stradi et al. (2024a) address CMDPs with adversarial losses, but they only provide guarantees in terms of soft constraints. Moreover, Wei et al. (2023); Ding & Lavaei (2023); Stradi et al. (2024b) consider non-stationary losses/constraints with bounded variation. Their results do not apply to general adversarial losses.

Hard constraints have also been studied in online convex optimization (Guo et al., 2022), and in stochastic settings with simpler structure (Chen et al., 2018; Pacchiano et al., 2021; Bernasconi et al., 2022). Our results are much more general than those, as we jointly consider adversarial losses, bandit feedback, and an MDP structure.

B. Omitted proofs for the clean event

In this section, we report the omitted proof related to the clean event. We start stating the following preliminary result.

Learning Adversarial MDPs with Stochastic Hard Constraints

Lemma B.1. Given any δ (0, 1), fix i [m], t [T] and (x, a) X A, it holds, with probability at least 1 δ:

bgt,i(x, a) gi(x, a) ζt(x, a),

where ζt(x, a) := q

ln(2/δ) 2Nt(x,a) and gt,i(x, a) is the true mean value of the distribution.

Proof. Focus on specifics i [m], t [T] and (x, a) X A. By Hoeffding s inequality and noticing that constraints values are bounded in [0, 1], it holds that:

" bgt,i(x, a) gi(x, a) c Nt(x, a)

Setting δ = 2 exp 2c2 Nt(x,a) and solving to find a proper value of c concludes the proof.

Now we generalize the previous result in order to hold for every i [m], t [T] and (x, a) X A at the same time.

Lemma B.2. Given a confidence δ (0, 1), with probability at least 1 δ, for every i [m], episode t [T], and pair (x, a) X A, it holds |bgt,i(x, a) gi(x, a)| ξt(x, a), where we let the confidence bound ξt(x, a) := min{1, p

4 ln(T|X||A|m/δ)/max{1, Nt(x, a)}}.

Proof. From Lemma G.1, given δ (0, 1), we have for any i [m], t [T] and (x, a) X A:

" bgt,i(x, a) gi(x, a) ζt(x, a)

Now, we are interested in the intersection of all the events, namely,

n bgt,i(x, a) gi(x, a) ζt(x, a) o#

Thus, we have:

n bgt,i(x, a) gi(x, a) ζt(x, a) o#

n bgt,i(x, a) gi(x, a) ζt(x, a) oc #

"n bgt,i(x, a) gi(x, a) ζt(x, a) oc #

1 |X||A|m Tδ ,

where Inequality (3) holds by Union Bound. Noticing that gt,i(x, a) 1, substituting δ with δ := δ /|X||A|m T in ζt(x, a) with an additional Union Bound over the possible values of Nt(x, a), and thus obtaining ξt(x, a), concludes the proof.

C. Omitted proofs for sublinear violation

In this section we report the omitted proofs of the theoretical results for Algorithm 2.

Learning Adversarial MDPs with Stochastic Hard Constraints

C.1. Feasibility

We start by showing that Program (2) admits a feasible solution with arbitrarily large probability. Lemma 4.1. Given confidence δ (0, 1), Algorithm 2 ensures that PROJ( qt+1, b Gt, Ξt, Pt) is feasible at every episode t [T] with probability at least 1 5δ.

Proof. To prove the lemma we show that under the event EG, (δ), which holds the probability at least 1 5δ, Program (2) admits a feasible solution. Precisely, under the event E (δ), the true transition function P belongs to Pt at each episode. Moreover, under the event EG(δ), we have, for any feasible solution q of the offline optimization problem, for any t [T], b Gt Ξt q G t q α,

where the first inequality holds by the definition of the event. The previous inequality shows that if q satisfies the constraints with respect to the true mean constraint matrix, it satisfies also the optimistic constraints. Thus, the feasible solutions to the offline problem are all available at every episode. Noticing that the clean event is defined as the intersection between EG(δ) and E (δ) concludes the proof.

C.2. Violations

We proceed bounding the cumulative positive violation as follows.

Theorem 4.2. Given δ (0, 1), Algorithm 2 attains VT O L|X| p

|A|T ln (T |X||A|m/δ) with prob. at least 1 8δ.

Proof. The key point of the problem is to relate the constraints satisfaction with the convergence rate of both the confidence bound on the constraints and the transitions.

First, we notice that under the clean event EG, (δ), all the following reasoning hold for every constraint i [m]. Thus, we focus on the bound of a single constraint violation problem defined as follows:

By Lemma 4.1, under the clean event the EG, (δ), the convex program is feasible and it holds:

g 2ξt bgt ξt

Thus, multiplying for the estimated occupancy measure and by construction of the convex program we obtain:

(g 2ξt 1) bqt (bgt 1 ξt 1) bqt α.

Rearranging the equation, it holds:

g bqt α + 2ξ t 1bqt.

Now, in order to obtain the instantaneous violation definition we proceed as follows,

g bqt + g qt g qt α + 2ξ t 1bqt,

from which we obtain:

g qt α g (qt bqt) + 2ξ t 1bqt

g qt bqt 1 + 2ξ t 1bqt,

where the last step holds by the H older inequality. Notice that, since the RHS of the previous inequality is greater than zero, it holds, [g qt α]+ qt bqt 1 + 2ξ t 1bqt.

which leads to VT PT t=1 qt bqt 1 + 2 PT t=1 ξ t 1bqt, where the first part of the equation refers to the estimate of the transitions while the second one to the estimate of the constraints. We will bound the two terms separately.

Learning Adversarial MDPs with Stochastic Hard Constraints

Bound on PT t=1 bqt qt 1. The term of interest encodes the distance between the estimated occupancy measure and the real one chosen by the algorithm. Thus, it depends on the estimation of the true transition functions. To bound the quantity of interest, we proceed as follows:

t=1 bqt qt 1 =

x,a |bqt(x, a) qt(x, a)|

|A|T ln T|X||A|

where Inequality (4) holds since, by Lemma G.2, under the clean event, with probability at least 1 2δ, we have PT t=1 P

x,a |bqt(x, a) qt(x, a))| O L|X| r

|A|T ln T |X||A|

δ , when bqt (Pt). Please notice that the condition

bqt (Pt) is verified since the constrained space defined by Program (2) is contained in (Pt).

Bound on PT t=1 ξ t 1bqt. This term encodes the estimation of the constraints functions obtained following the estimated occupancy measure. Nevertheless, since the confidence bounds converge only for the paths traversed by the learner, it is necessary to relate ξt to the real occupancy measure chosen by the algorithm. To do so, we notice that by H older inequality and since ξt(x, a) 1, it holds:

t=1 ξ t 1bqt

t=1 ξ t 1qt +

t=1 ξ t 1(bqt qt)

t=1 ξ t 1qt +

t=1 ξt 1 bqt qt 1

t=1 ξ t 1qt +

t=1 bqt qt 1.

The second term of the inequality is bounded by the previous analysis, while for the first term we proceed as follows:

t=1 ξ t 1qt =

x,a ξt 1(x, a)qt(x, a)

x,a ξt 1(x, a)1t{x, a} + L

4 ln T|X||A|m

1 max{1, Nt 1(x, a)}1t{x, a} + L

4 ln T|X||A|m

NT (x, a) + L

L|X||A|T ln T|X||A|m

where Inequality (5) follows from Azuma inequality and noticing that P x,a ξt 1(x, a)qt(x, a) L (with probability at

least 1 δ), Inequality (6) holds since 1 + PT t=1 1

T and Inequality (7) follows from Cauchy-Schwarz

inequality and noticing that q P

x,a NT (x, a)

We combine the previous bounds as follows:

t=1 qt bqt 1 + 2

t=1 ξ t 1bqt

Learning Adversarial MDPs with Stochastic Hard Constraints

|A|T ln T|X||A|m

The results holds with probability at least at least 1 8δ by union bound over the clean event, Lemma G.2 and the Azuma-Hoeffding inequality. This concludes the proof.

C.3. Regret

In this section, we prove the regret bound of Algorithm 2. Precisely, the bound follows from noticing that, under the clean event, the optimal safe solution is included in the decision space for every episode t [T].

Theorem 4.3. Given δ (0, 1), by setting η = γ = p

L ln(L|X||A|/δ)/T |X||A|, Algorithm 2 attains RT

|A|T ln (T |X||A|/δ) with prob. at least 1 10δ.

Proof. We first rewrite the regret definition as follows:

t=1 ℓ t (qt bqt)

t=1 bℓ t (bqt q )

t=1 (ℓt bℓt) bqt

t=1 (bℓt ℓt) q .

Precisely, the first term encompasses the distance between the true transitions and the estimated ones, the second concerns the optimization performed by online mirror descent and the last ones encompass the bias of the estimators.

Bound on 1 . We start bounding the first term, namely, the cumulative distance between the estimated occupancy measure and the real one, as follows:

t=1 ℓ t (qt bqt)

x,a ℓt(x, a)(qt(x, a) bqt(x, a))

x,a |(qt(x, a) bqt(x, a)|, (8)

where the Inequality (8) holds by H older inequality noticing that ℓt 1 for all t [T]. Then, noticing that the projection of Algorithm 2 is performed over a subset of (Pt) and employing Lemma G.2, we obtain:

|A|T ln T|X||A|

with probability at least 1 2δ, under the clean event.

Bound on 2 . To bound the second term, we underline that, under the clean event EG, (δ), the estimated safe occupancy bqt belongs to (Pt) and the optimal safe solution q is included in the constrained decision space for each t [T]. Moreover we notice that, for each t [T], the constrained space is convex and linear, by construction of Program (2). Thus, following the standard analysis of online mirror descent (Orabona, 2019) and from Lemma G.6, we have, under the clean event:

2 L ln |X|2|A|

t,x,a bqt(x, a)bℓt(x, a)2.

Learning Adversarial MDPs with Stochastic Hard Constraints

Thus, to bound the biased estimator, we notice that bqt(x, a)bℓt(x, a)2 bqt(x,a) ut(x,a)+γ bℓt(x, a) bℓt(x, a). We then apply

Lemma G.3 with αt(x, a) = 2γ and obtain P

t,x,a bqt(x, a)bℓt(x, a)2 P

t,x,a qt(x,a) ut(x,a)ℓt(x, a) + L ln L

δ 2γ . Finally, we notice that, under the clean event, qt(x, a) ut(x, a), obtaining, with probability at least 1 δ:

2 L ln |X|2|A|

η + η|X||A|T + ηL ln(L/δ)

Setting η = γ = q

L ln(L|X||A|/δ)

T |X||A| , we obtain:

|X||A|T ln |X||A|

with probability at least 1 δ, under the clean event.

Bound on 3 . The third term follows from Lemma G.5, from which, under the clean event, with probability at least 1 3δ

and setting γ = q

L ln(L|X||A|/δ)

T |X||A| , we obtain:

|A|T ln T|X||A|

Bound on 4 . We bound the fourth term employing Corollary G.4 and obtaining,

bℓt ℓt q = X

t,x,a q (x, a) bℓt(x, a) ℓt(x, a)

t,x,a q (x, a)ℓt(x, a) qt(x, a)

ut(x, a) 1 + X

q (x, a) ln |X||A|

t,x,a q (x, a)ℓt(x, a) qt(x, a)

ut(x, a) 1 + L ln |X||A|

Noticing that, under the clean event, qt(x, a) ut(x, a) and setting γ = q

L ln(L|X||A|/δ)

T |X||A| , we obtain, with probability at least 1 δ:

|X||A|T ln T|X||A|

Final result. Finally, combining Equation (9), Equation (10), Equation (11) and Equation (12) and applying a union bound, we obtain, with probability at least 1 10δ,

|A|T ln T|X||A|

D. Omitted proofs when Condition 2.5 holds

In this section we report the omitted proofs of the theoretical results for Algorithm 3.

D.1. Safety

We start by showing that Algorithm 3 is safe with high probability.

Learning Adversarial MDPs with Stochastic Hard Constraints

Theorem 5.1. Given a confidence δ (0, 1), Algorithm 3 is safe with probability at least 1 5δ.

Proof. We show that, under event EG, (δ), the non-Markovian policy defined by the probability λt satisfies the constraints. Intuitively, the result follows from the construction of the convex combination parameter λt. Indeed, λt is built using a pessimist estimated of the constraints cost, namely, bgt,i +ξt. Moreover, the upper occupancy bound but introduces pessimism in the choice of the transition function. Finally, the maxi [m] operator allows to be conservative for all the m constraints.

We split the analysis in the two possible cases defined by λt, namely, λt = 0 and λt (0, 1). Please notice that λt < 1, by construction.

Analysis when λt = 0. When λt = 0, it holds, by construction, that i [m] : (bgt 1,i + ξt 1) but αi. Thus, under the event EG, (δ), it holds, i [m]:

αi (bgt 1,i + ξt 1) but

(bgt 1,i + ξt 1) bqt (13)

= (bgt 1,i + ξt 1) qt

g i qt, (14)

where Inequality (13) holds by definition of but and Inequality (14) by the pessimistic definition of the constraints.

Analysis when λt (0, 1). We focus on a single constraint i [m], then we generalize the analysis for the entire set of constraints. First we notice that the constraints cost, for a single constraint i [m], attained by the non-Markovian policy πt, is equal to λt 1g i q + (1 λt 1)g i q P,bπt. Thus, it holds by definition of the known strictly feasible π ,

λt 1g i q + (1 λt 1)g i q P,bπt = λt 1βi + (1 λt 1)g i q P,bπt. (15)

Then, we consider both the cases when L < (bgt 1,i + ξt 1) but (first case) and L > (bgt 1,i + ξt 1) but (second case). If the two quantities are equivalent, the proof still holds breaking the ties arbitrarily.

First case. It holds that:

λt 1βi + (1 λt 1)g i qbπt,P λt 1βi + (1 λt 1)L (16)

L βi (βi L) + L

βi L (βi L) + L

where Inequality (16) holds by definition of the constraints.

Second case. It holds that:

λt 1βi+(1 λt 1)g i q P,bπt

λt 1βi + (1 λt 1) (bgt 1,i + ξt 1) q P,bπt (17)

λt 1βi + (1 λt 1) (bgt 1,i + ξt 1) but (18)

= λt 1βi λt 1 (bgt 1,i + ξt 1) but + (bgt 1,i + ξt 1) but

= λt 1(βi (bgt 1,i + ξt 1) but) + (bgt 1,i + ξt 1) but

(bgt 1,i + ξt 1) but αi

(bgt 1,i + ξt 1) but βi (βi (bgt 1,i + ξt 1) but) + (bgt 1,i + ξt 1) but

= αi (bgt 1,i + ξt 1) but

βi (bgt 1,i + ξt 1) but (βi (bgt 1,i + ξt 1) but) + (bgt 1,i + ξt 1) but

= αi (bgt 1,i + ξt 1) but + (bgt 1,i + ξt 1) but

Learning Adversarial MDPs with Stochastic Hard Constraints

where Inequality (17) holds by the definition of the event and Inequality (18) holds by the definition of but.

To conclude the proof, we underline that λt is chosen taking the maximum over the constraints, which implies that the more conservative λt (the one which takes the combination nearer to the strictly feasible solution) is chosen. Thus, all the constraints are satisfied.

D.2. Regret

We start by the statement of the following Lemma, which is a generalization of the results from (Jin et al., 2020). Intuitively, the following result states that the distance between the estimated non-safe occupancy measure bqt and the real one reduces as the number of episodes increases, paying a 1 λt factor. This is reasonable since, from the update of the non-Markovian policy πt (see Algorithm 3), policy bπt bqt is played with probability 1 λt 1.

Lemma D.1. Under the clean event, with probability at least 1 2δ, for any collection of transition functions {P x t }x X such that P x t Pt, and for any collection of {λt}T 1 t=0 used to select policy πt+1, we have, for all x,

t=1 (1 λt 1) X

q P x t ,bπt(x, a) q P,bπt(x, a) O

|A|T ln T|X||A|

Proof. We will refer as qx t to q P x t ,πt and as bq x t to q P x t ,bπt. Moreover, we define:

ϵ t (x |x, a) =

v u u t P (x |x, a) ln T |X||A|

max {1, Nt(x, a)} + ln T |X||A|

max {1, Nt(x, a)}.

Now following standard analysis by Lemma G.2 from (Jin et al., 2020), we have that,

t=1 (1 λt 1) X

q P x t ,bπt(x, a) q P,bπt(x, a)

t,wm (1 λt 1)ϵ t (xm+1|xm, am)q P,bπt(xm, am) + |X| X

t,wm,w h (1 λt 1)

ϵ it (xm+1 | xm, am) q P,bπt (xm, am) ϵ t x h+1 | x h, a h q P,bπt (x h, a h | xm+1) ,

where wm = (xm, am, xm+1).

Bound on the first term. To bound the first term we notice that, by definition of q P,bπt it holds:

t,wm (1 λt 1)ϵ t (xm+1|xm, am)q P,bπt(xm, am)

t,wm ϵ t (xm+1|xm, am) q P,πt(xm, am) λt 1q P,π (xm, am)

t,wm ϵ t (xm+1|xm, am)q P,πt(xm, am)

|A|T ln T|X||A|

where the last step holds following Lemma G.2 from (Jin et al., 2020).

Learning Adversarial MDPs with Stochastic Hard Constraints

Bound on the second term. Following Lemma G.2 from (Jin et al., 2020), the second term is bounded by (ignoring constants),

t,wm,w h (1 λt 1)

v u u t P (xm+1 | xm, am) ln T |X||A|

max {1, Nt (xm, am)}

q P,bπt (xm, am)

v u u t P x h+1 | x h, a h ln T |X||A|

max {1, Nt (x h, a h)} q P,bπt (x h, a h | xm+1)

t,wm,w h (1 λt 1) q P,bπt (xm, am) ln T |X||A|

max {1, Nt (xm, am)} +

t,wm,w h (1 λt 1) q P,bπt (x h, a h) ln T |X||A|

max {1, Nt (x h, a h)} .

The last two terms are bounded logarithmically in T, employing the definition of q P,bπt and following Lemma G.2 from (Jin et al., 2020), while, similarly, the first term is bounded by:

v u u t|Xm+1| X

(1 λt 1)q P,bπt (xm, am)

max {1, Nt (xm, am)}

v u u t|Xh+1| X

(1 λt 1)q P,bπt (x h, a h) max {1, Nt (x h, a h)} ,

which is upper bounded by:

v u u t|Xm+1| X

qt (xm, am) max {1, Nt (xm, am)}

v u u t|Xh+1| X

qt (x h, a h) max {1, Nt (x h, a h)}.

Employing the same argument as Lemma G.2 from (Jin et al., 2020) shows that the previous term is bounded logarithmically in T and concludes the proof.

We are now ready to prove the regret bound attained by Algorithm 3.

Theorem 5.2. Given δ (0, 1), by setting η = γ = p

L ln(L|X||A|/δ)/T |X||A|, Algorithm 3 attains RT

|A|T ln (T |X||A|m/δ) with prob. at least 1 11δ, where Ψ := maxi [m]{1/min{(αi βi),(αi βi)2}}.

Proof. We start decomposing the RT := PT t=1 ℓ t (qt q ) definition as:

t=1 ℓ t qt q Pt,πt

t=1 bℓ t q Pt,bπt q

t=1 ℓ t q Pt,πt q Pt,bπt

ℓt bℓt q Pt,bπt

where Pt is the transition chosen by the algorithm at episode t. Precisely, the first term encompasses the estimation of the transition functions, the second term concerns the optimization performed by the algorithm, the third term encompasses the regret accumulated by performing the convex combination of policies and the last two terms concern the bias of the optimistic estimators.

We proceed bounding the five terms separately.

Learning Adversarial MDPs with Stochastic Hard Constraints

Bound on 1 We bound the first term as follows:

t=1 ℓ t qt q Pt,πt

x,a ℓt(x, a) qt(x, a) q Pt,πt(x, a)

qt(x, a) q Pt,πt(x, a) ,

where the last inequality holds by H older inequality noticing that ℓt 1 for all t [T]. Then we can employ Lemmas G.2, since πt is the policy that guides the exploration and Pt Pt, obtaining:

|A|T ln T|X||A|

with probability at least 1 2δ, under the clean event.

Bound on 2 The second term is bounded similarly to the second part of Theorem 4.3. Precisely, we notice that under the clean event EG, (δ), the optimal safe solution q is included in the constrained decision space for each t [T]. Moreover we notice that, for each t [T], the constrained space is convex and linear, by construction of the convex program. Thus, following the standard analysis of online mirror descent (Orabona, 2019) and from Lemma G.6, we have, under the clean event:

2 L ln |X|2|A|

t,x,a q Pt,bπt(x, a)bℓt(x, a)2.

Guaranteeing the safety property makes bounding the biased estimator more complex with respect to Theorem 4.3. Thus, noticing that λt maxi [m] n L αi L βi

o and by definition of πt, we proceed as follows:

t,x,a q Pt,bπt(x, a)bℓt(x, a)2 max i [m]

t,x,a (1 λt 1)q Pt,bπt(x, a)bℓt(x, a)2

q Pt,πt(x, a) λt 1q Pt,π (x, a) bℓt(x, a)2

t,x,a q Pt,πt(x, a)bℓt(x, a)2

The previous result is intuitive. Paying an additional maxi [m] n L αi βi

o factor allows to relate the loss estimator bℓt with the policy that guides the exploration, namely, πt. Thus, following the same steps as Theorem 4.3 we obtain, with probability 1 δ, under the clean event:

2 L ln |X|2|A|

η + max i [m]

η|X||A|T + max i [m]

Setting η = γ = q

L ln(L|X||A|/δ)

T |X||A| , we obtain:

L|X||A|T ln |X|2|A|

with probability at least 1 δ, under the clean event.

Learning Adversarial MDPs with Stochastic Hard Constraints

Bound on 3 In the following, we show how to rewrite the third term so that the dependence on the convex combination parameter is explicit. Intuitively, the third term is the regret payed to guarantee the safety property. Thus, we rewrite the third term as follows:

t=1 ℓ t q Pt,πt q Pt,bπt =

t=1 ℓ t λt 1q Pt,π + (1 λt 1)q Pt,bπt q Pt,bπt

t=1 λt 1ℓ t q Pt,π

where we used that ℓ t q Pt,π L for any t [T]. Thus, we proceed bounding PT t=1 λt 1.

We focus on a single episode t [T], in which we assume without loss of generality that the i-th constraint is the hardest to satisfy.

λt = min (bgt,i + ξt) but+1, L αi min {(bgt,i + ξt) but+1, L} βi

(bgt,i + ξt) but+1 αi

(bgt,i + ξt) but+1 βi

(bgt,i + ξt) but+1 αi

= (bgt,i ξt) but+1 + 2ξ t but+1 αi αi βi

= (bgt,i ξt) bqt+1 + (bgt,i ξt) (but+1 bqt+1) + 2ξ t but+1 αi αi βi

(bgt,i ξt) bqt+1 + bg t,i(but+1 bqt+1) + 2ξ t but+1 αi αi βi

bg t,i(but+1 bqt+1) + 2ξ t but+1 αi βi (22)

= bg t,i(but+1 q P,bπt+1) + bg t,i(q P,bπt+1 q Pt+1,bπt+1) + 2ξ t but+1 αi βi

bgt,i ||but+1 q P,bπt+1||1 + bgt,i q P,bπt+1 q Pt+1,bπt+1 1 + 2ξ t but+1 αi βi

||but+1 q P,bπt+1||1 + q P,bπt+1 q Pt+1,bπt+1 1 + 2ξ t but+1 αi βi

L(1 λt) but+1 q P,bπt+1 1 + L(1 λt) q P,bπt+1 q Pt+1,bπt+1 1 + 2L(1 λt)ξ t but+1 min {(αi βi), (αi βi)2} (23)

where Inequality (21) holds since, for the hardest constraint, when λt = 0, (bgt,i + ξt) but+1 > αi, Inequality (22) holds since, under the clean event, (bgt,i ξt) bqt+1 αi and Inequality (23) holds since λt L αi

L βi . Intuitively, Inequality (23) shows that, to guarantee the safety property, Algorithm 3 has to pay a factor proportional to the pessimism introduced on the transition and cost functions, plus the constraints satisfaction gap of the strictly feasible solution given as input to the algorithm.

We need to generalize the result summing over t, taking into account that the hardest constraints may vary. Thus, we bound

Learning Adversarial MDPs with Stochastic Hard Constraints

the summation as follows,

t=1 λt 1 max i [m]

2L min {(αi βi), (αi βi)2}

(1 λt 1) but q P,bπt 1 + q P,bπt q Pt,bπt 1 + ξ t 1but

The first two terms of the equation are bounded applying Lemma D.1, which holds with probability at least 1 2δ, under the clean event, while, to bound PT t=1(1 λt 1)ξ t 1but, we proceed as follows:

t=1 (1 λt 1)ξ t 1but =

t=1 (1 λt 1)ξ t 1q P,bπt +

t=1 (1 λt 1)ξ t 1(but q P,bπt),

where the second term is bounded employing H older inequality and Lemma D.1. Next, we focus on the first term, proceeding as follows,

t=1 (1 λt 1)ξ t 1q P,bπt

t=1 ξ t 1qt (24)

x,a ξt 1(x, a)1t(x, a) + L

4 ln T|X||A|m

1 max{1, Nt 1(x, a)}1t(x, a) + L

ln T|X||A|m

x,a NT (x, a) + L

L|X||A|T ln T|X||A|m

where Inequality (24) follows from the definition of πt, Inequality (25) follows from Azuma-Hoeffding inequality and Inequality (26) holds since 1 + PT t=1 1

T and Cauchy-Schwarz inequality.

Thus, we obtain,

1 min {(αi βi), (αi βi)2}

|A|T ln T|X||A|m

with probability at least 1 3δ, under the clean event.

Bound on 4 We first notice that 4 presents an additional challenge with respect to the bounded violation case. Indeed, since bπt is not the policy that drives the exploration, bℓt cannot be directly bounded employing results from the unconstrained adversarial MDPs literature. First, we rewrite the fourth term as follows,

ℓt bℓt q Pt,bπt

Et[bℓt] bℓt q Pt,bπt +

ℓt Et[bℓt] q Pt,bπt,

where Et[ ] is the expectation given the filtration up to time t. To bound the first term we employ the Azuma-Hoeffding inequality noticing that, the martingale difference sequence is bounded by:

bℓ t q Pt,bπt max i [m]

bℓ t (1 λt 1)q Pt,bπt

Learning Adversarial MDPs with Stochastic Hard Constraints

= max i [m]

bℓ t q Pt,πt λt 1q Pt,π

bℓ t q Pt,πt

where the first inequality holds since λt 1 λ0. Thus, the first term is bounded by maxi [m] n L αi βi

δ. To bound the second term, we employ the definition of πt and the upper-bound to λt 1, proceeding as follows:

ℓt Et[bℓt] q Pt,bπt

t,x,a q Pt,bπt(x, a)ℓt(x, a) 1 Et [1t(x, a)]

ut(x, a) + γ

t,x,a q Pt,bπt(x, a)ℓt(x, a) 1 qt(x, a) ut(x, a) + γ

t,x,a (1 λt 1)q Pt,bπt(x, a)ℓt(x, a) 1 qt(x, a) ut(x, a) + γ

t,x,a q Pt,πt(x, a)ℓt(x, a) 1 qt(x, a) ut(x, a) + γ

= max i [m]

q Pt,πt(x, a) ut(x, a) + γ (ut(x, a) qt(x, a) + γ)

|A|T ln T|X||A|

+ max i [m]

where the last steps holds by Lemma G.2. Thus, combining the previous equations, we have, with probability at least 1 3δ, under the clean event:

|A|T ln T|X||A|

Bound on 5 The last term is bounded as in Theorem 4.3. Thus, setting γ = q

L ln(L|X||A|/δ)

T |X||A| , we obtain, with probability at least 1 δ, under the clean event:

|X||A|T ln T|X||A|

Final result Finally, we combine the bounds on 1 , 2 , 3 , 4 and 5 . Applying a union bound, we obtain, with probability at least 1 11δ,

1 min {(αi βi), (αi βi)2}

|A|T ln T|X||A|m

which concludes the proof.

E. Omitted proofs for constant violation

In this section, we show how it is possible to relax Condition 2.5 and still achieving constant cumulative positive violation.

Learning Adversarial MDPs with Stochastic Hard Constraints

E.1. Slater s parameter estimation

We start by showing that the number of episodes necessary to Algorithm 4 to estimate the Slater s parameter are upper bounded by a constant term. This is done by means of the following lemma.

Lemma 6.1. Given any δ (0, 1), the episodes that Algorithm 4 uses to compute bρ and bπ are t 1/ρ4(3CP A + 10L ln 1

δ + 3CD A + L)4, with prob. at least 1 (Cδ P + 2)δ.

Proof. By the no-regret property of Algorithm AP , it holds:

i [m] ϕt,i g t,i(qt q ) αi CP A p t ln( t),

with probability at least 1 Cδ P δ.

Thus, applying the Azuma inequality, we get:

i [m] ϕt,i(gt,i αi/L) qt CP A p t ln( t) + L

i [m] ϕt,i(gi αi/L) q

= CP A p t ln( t) + L

i [m] ϕt,i(αi βi)

CP A p t ln( t) + L

δ t min i [m](αi βi)

= CP A p t ln( t) + L

with probability 1 (Cδ P + 1)δ, by Union Bound. Similarly, applying the Azuma inequality, it holds:

k=1 (gt,i(xk, ak) αi/L) CP A p t ln( t) + 2L

with probability at least 1 (Cδ P + 2)δ.

By the no-regret property of Algorithm AD, it holds:

k=1 gt,i(xk, ak) αi

k=1 gt,i (xk, ak) αi

where i = arg maxi [m] P t t=1 PL 1 k=1 (gt,i(xk, ak) αi/L), from which we obtain, with probability at least 1 (Cδ P +2)δ:

k=1 gt,i(xk, ak) αi

CP A p t ln( t) + 2L

δ + CD A t tρ.

By the stopping condition of Algorithm 4, it holds maxi [m] P t t=1 PL 1 k=1 (gt,i(xk, ak) αi/L) 2CP A p t ln( t) +

δ + 2CD A t, which implies:

k=1 (gt,i(xk, ak) αi/L) 2CP A p t 1 ln( t 1) + 8L

δ + 2CD A p

2CP A p t ln( t) + 8L

δ + 2CD A t,

Learning Adversarial MDPs with Stochastic Hard Constraints

k=1 (gt,i(xk, ak) αi/L) 2CP A p t ln( t) 8L

Thus, we get the following inequality:

2CP A p t ln( t) 8L

δ 2CD A t L CP A p t ln( t) + 2L

δ + CD A t tρ.

tρ 3CP A p t ln( t) + 10L

δ + 3CD A t + L

3CP A t3/4 + 10L

δ t3/4 + 3CD A t3/4 + L t3/4,

which implies:

3CP A + 10L q

δ + 3CD A + L 4

This concludes the proof.

Thus, we show that the estimation of the Slater s parameter bρ is upper bounded by the true one. A similar reasoning holds for the estimated strictly feasible solution. The result is provided in the following lemma.

Lemma 6.2. Given any δ (0, 1), Algorithm 4 guarantees bρ mini [m](αi g i q P,bπ ) ρ with prob. at least 1 2δ.

Proof. It holds:

k=1 gt,i(xk, ak) αi/L

= min i [m]

k=1 gt,i(xk, ak) + 2L

gi(xk, ak) L

= min i [m]

t=1 g i qt + 2L

min i [m] (αi βi)

where the first steps hold with probability at least 1 2δ by Azuma inequality and a Union Bound and the last inequality holds by definition of q .

To prove the second result we first notice that, by definition of q :

min i [m](αi g i q P,bπ ) min i [m](αi g i q ) = min i [m](αi βi) = ρ.

Learning Adversarial MDPs with Stochastic Hard Constraints

Furthermore, by definition of bπ , it holds:

t=1 q P,πt.

min i [m](αi g i q P,bπ ) = min i [m]

g i q P,πt !!

k=1 gt,i(xk, ak) + 2L

where the first inequality holds with probability at least 1 2δ employing the Azuma inequality and a union bound. This concludes the proof.

Finally, we show that bρ is not to small, namely, it is lower bounded by ρ/2. The result is provided in the following lemma.

Lemma 6.4. Given any δ (0, 1), Algorithm 4 guarantees bρ ρ/2 with probability at least 1 (Cδ P + 2)δ.

Proof. Similarly to Lemma 6.1, we get, with probability at least 1 (Cδ P + 2)δ

k=1 gt,i(xk, ak) αi

CP A p t ln( t) 2L

δ CD A t + tρ.

Thus, notice that, by the stopping condition of Algorithm 4, it holds:

k=1 gt,i(xk, ak) αi

CP A p t ln( t) 4L

2 max i [m]

k=1 gt,i(xk, ak) αi

2 max i [m]

k=1 gt,i(xk, ak) αi

CP A p t ln( t) 4L

2 max i [m]

k=1 gt,i(xk, ak) αi

+ CP A p t ln( t) CP A p t ln( t) + 4L

δ + CD A t CD A t

2 max i [m]

k=1 gt,i(xk, ak) αi

k=1 gt,i(xk, ak) αi/L

k=1 gt,i(xk, ak) αi/L

CP A p t ln( t) 4L

δ CD A t + 2L

2 max i [m]

k=1 gt,i(xk, ak) αi

+ CP A p t ln( t) + 4L

δ + CD A t 2L

This concludes the proof.

Learning Adversarial MDPs with Stochastic Hard Constraints

E.2. Violations

In this section, we provide the theoretical guarantees attained by Algorithm 4 in terms of cumulative positive violation. This is done by means of the following theorem.

Theorem 6.3. Given δ (0, 1), Algorithm 4 attains VT O(L/ρ4(CP A + L q

δ + CD A + L)4) with probability at least

1 (Cδ P + 7)δ.

Proof. We split the violations between the two phases of Algorithm 4 as:

VT max i [m]

g i qt αi + + max i [m]

g i qt αi +

L t + max i [m]

g i qt αi +

L CP A + L q

δ + CD A + L 4

+ max i [m]

g i qt αi + ,

where the last step holds with probability at least 1 (Cδ P + 2)δ by Lemma 6.1.

In the following we show that, after t episodes, Algorithm 4 is safe with high probability. Similarly to Theorem 5.1 there are two possible scenarios defined by λt, namely, λt = 0 and λt (0, 1). When λt = 0, applying the same reasoning of Theorem 5.1 gives the result.

In the following analysis we consider a generic constraints i [m]. Thus, we notice that the constraints cost attained by the non-Markovian policy πt, is equal to λt 1g i q P,bπ + (1 λt 1)g i q P,bπt.

Then, we consider both the cases when L < (bgt 1,i + ξt 1) but (first case) and L > (bgt 1,i + ξt 1) but (second case). If the two quantities are equivalent, the proof still holds breaking the ties arbitrarily.

First case. It holds that:

λt 1g i q P,bπ + (1 λt 1)g i qbπt,P = L αi L αi + bρ(g i q P,bπ L) + L

αi L αi bρ L(αi bρ L) + L (30)

where Inequality (30) holds with probability at least 1 2δ thanks to Lemma 6.2.

Second case. Similarly to the first case, it holds that, under the clean event:

λt 1g i q P,bπ +(1 λt 1)g i q P,bπt

λt 1g i q P,bπ + (1 λt 1) (bgt 1,i + ξt 1) q P,bπt (31)

λt 1g i q P,bπ + (1 λt 1) (bgt 1,i + ξt 1) but (32)

= λt 1g i q P,bπ λt 1 (bgt 1,i + ξt 1) but + (bgt 1,i + ξt 1) but

= λt 1(g i q P,bπ (bgt 1,i + ξt 1) but) + (bgt 1,i + ξt 1) but

(bgt 1,i + ξt 1) but αi (bgt 1,i + ξt 1) but αi + bρ (αi bρ (bgt 1,i + ξt 1) but) + (bgt 1,i + ξt 1) but

= αi (bgt 1,i + ξt 1) but αi bρ (bgt 1,i + ξt 1) but (αi bρ (bgt 1,i + ξt 1) but) + (bgt 1,i + ξt 1) but

= αi (bgt 1,i + ξt 1) but + (bgt 1,i + ξt 1) but

Learning Adversarial MDPs with Stochastic Hard Constraints

where Inequality (31) holds by the definition of the event and Inequality (32) holds by the definition of but.

To conclude the proof, we underline that λt is chosen taking the maximum over the constraints, which implies that the more conservative λt (the one which takes the combination nearer to the strictly feasible solution) is chosen. Thus, all the constraints are satisfied and a final union bound concludes the proof.

E.3. Regret

In this section, we provide the theoretical guarantees attained by Algorithm 4 in terms of cumulative regret. This is done by means of the following theorem.

Theorem 6.5. Given any δ (0, 1), with η = γ = p

L ln(L|X||A|/δ)/T |X||A|, Algorithm 4 attains regret RT O(ΘL3|X| p

|A|T ln (T |X||A|m/δ) + L/ρ4(CP A + L ln 1/δ + CD A + L)4) with probability at least 1 (Cδ P + 13)δ, where we let Θ := 1/min{ρ,ρ2}.

Proof. We split the regret between the two phases of Algorithm 4 as:

t=1 ℓ t (q P,πt q ) +

t= t+1 ℓ t (q P,πt q )

t= t+1 ℓ t (q P,πt q )

L CP A + L q

δ + CD A + L 4

t= t+1 ℓ t (q P,πt q ),

where the last step holds with probability at least 1 (Cδ P + 2)δ by Lemma 6.1.

To bound the second terms we follow the steps of Theorem 5.2 after noticing that, for all t [T]:

λt max i [m]

L αi L αi + bρ

λt = min (bgt,i + ξt) but+1, L αi min {(bgt,i + ξt) but+1, L} αi + bρ

(bgt,i + ξt) but+1 αi (bgt,i + ξt) but+1 αi + bρ

(bgt,i + ξt) but+1 αi

= (bgt,i ξt) but+1 + 2ξ t but+1 αi bρ

= (bgt,i ξt) bqt+1 + (bgt,i ξt) (but+1 bqt+1) + 2ξ t but+1 αi bρ

(bgt,i ξt) bqt+1 + bg t,i(but+1 bqt+1) + 2ξ t but+1 αi bρ

bg t,i(but+1 bqt+1) + 2ξ t but+1 bρ (34)

= bg t,i(but+1 q P,bπt+1) + bg t,i(q P,bπt+1 q Pt+1,bπt+1) + 2ξ t but+1 bρ

Learning Adversarial MDPs with Stochastic Hard Constraints

bgt,i ||but+1 q P,bπt+1||1 + bgt,i q P,bπt+1 q Pt+1,bπt+1 1 + 2ξ t but+1 bρ

||but+1 q P,bπt+1||1 + q P,bπt+1 q Pt+1,bπt+1 1 + 2ξ t but+1 bρ

L(1 λt) but+1 q P,bπt+1 1 + L(1 λt) q P,bπt+1 q Pt+1,bπt+1 1 + 2L(1 λt)ξ t but+1 min {bρ, bρ 2} (35)

4L(1 λt) but+1 q P,bπt+1 1 + 4L(1 λt) q P,bπt+1 q Pt+1,bπt+1 1 + 8L(1 λt)ξ t but+1 min {ρ, ρ2} (36)

where Inequality (33) holds since, for the hardest constraint, when λt = 0, (bgt,i + ξt) but+1 > αi, Inequality (34) holds since, under the clean event, (bgt,i ξt) bqt+1 αi, Inequality (35) holds since λt L αi L αi+bρ and Inequality (36) holds with probability at least 1 (Cδ P + 2)δ by Lemma 6.4. A final union bound concludes the proof.

F. Omitted proof of the lower bound

In this section, we provide the lower bound which holds both in the second setting, that is, when the objective is to guarantee the safety property for each episode and when the objective is to attain constant violation. Theorem 6.6. There exist two instances of CMDPs (with a single state and one constraint) such that, if in the first instance an algorithm suffers from a violation VT = o(

T) probability at least 1 nδ for any δ (0, 1) and n > 0, then, in the second instance, it must suffer from a regret RT = Ω( 1

T) with probability 3/4 nδ.

Proof. We consider two instances defined as follows. Both of them are characterized by a CMDP with one state (which is omitted for simplicity), two actions a1, a2, one constraint and α = 1/2. For the sake of simplicity we consider CMDP with rewards in place of losses. Notice that this is without loss of generality since any losses can be converted to an associated reward. We assume that the rewards are deterministic while the constraints are Bernoulli distributions with means defined in the following. Specifically, instance i1 and instance i2 are defined as:

2, g(a1) = 1

2 + ϵ r(a2) = 0, g(a2) = 1

2, g(a1) = 1

2 r(a2) = 0, g(a2) = 1

where ϵ is a parameter to be defined later. Thus, since the algorithm must suffer o(

T) violation, for any constant c > 0, it holds:

P1 qt(a2) ϵ ϵ + ρ c 1

T , t [T] 1 n δ,

where q(a2) is the occupancy measure associated to action a2 and P1 is the probability measure of instance i1 which encompasses the randomness of both environment and algorithm. Thus we can rewrite the inequality above as:

t=1 qt(a2) T ϵ ϵ + ρ c

By means of the Pinsker s inequality we can relate the probability measures P1 and P2 as follows:

t=1 qt(a2) T ϵ ϵ + ρ c

t=1 qt(a2) T ϵ ϵ + ρ c

1 2KL(i1, i2),

where KL(i1, i2) is the the KL-divergence between the probability measures of instance i1 and i2.

Noticing that by standard KL-decomposition argument, KL(i1, i2) ϵ2T, we have:

t=1 qt(a2) T ϵ ϵ + ρ c

Learning Adversarial MDPs with Stochastic Hard Constraints

We then notice that, since the rewards are deterministic, the regret of the second instance R2 T is bounded as:

2T ϵ ϵ + ρ c

with probability 3

4 n δ, taking ϵ = 1

2 T , ρ ϵ and c

2 32 . This concludes the proof.

G. Auxiliary lemmas from existing works

G.1. Auxiliary lemmas for the transitions estimation

First, we provide the formal result on the transitions confidence set. Lemma G.1 (Jin et al. (2020)). Given a confidence δ (0, 1), with probability at least 1 4δ, it holds that the transition function P belongs to Pt for all t [T].

Similarly to (Jin et al., 2020), the estimated occupancy measure space (Pt) is characterized as follows:

x Xk,a A,x Xk+1 q (x, a, x ) = 1

a A,x Xk+1 q (x, a, x ) = P

x Xk 1,a A q (x , a, x)

k, (x, a, x ) , q (x, a, x ) h b Pt (x | x, a) + ϵt (x | x, a) i P

y Xk+1 q(x, a, y)

q (x, a, x ) h b Pt (x | x, a) ϵt (x | x, a) i P

y Xk+1 q(x, a, y)

q (x, a, x ) 0

Given the estimation of the occupancy measure space, it is possible to derive the following lemma. Lemma G.2. (Jin et al., 2020) With probability at least 1 6δ, for any collection of transition functions {P x t }x X such that P x t Pt, we have, for all x,

q P x t ,πt(x, a) qt(x, a) O

|A|T ln T|X||A|

We underline that the constrained space defined by Program (2) is a subset of (Pt). This implies that, in Algorithm 2, it holds bqt (Pt) and Lemma G.2 is valid.

G.2. Auxiliary lemmas for the optimistic loss estimator

We will make use of the optimistic biased estimator with implicit exploration factor (see, (Neu, 2015)). Precisely, we define the loss estimator as follows, for all t [T]:

bℓt(x, a) := ℓt(x, a) ut(x, a) + γ 1t{x, a}, (x, a) X A,

where ut(x, a) := max P Pt q P ,πt(x, a). Thus, the following lemmas hold.

Learning Adversarial MDPs with Stochastic Hard Constraints

Lemma G.3. (Jin et al., 2020) For any sequence of functions α1, . . . , αT such that αt [0, 2γ]X A is Ft-measurable for all t, we have with probability at least 1 δ,

x,a αt(x, a) bℓt(x, a) qt(x, a)

ut(x, a)ℓt(x, a) L ln L

Following the analysis of Lemma G.3, with αt(x, a) = 2γ1t(x, a) and union bound, the following corollary holds.

Corollary G.4. (Jin et al., 2020) With probability at least 1 δ:

bℓt(x, a) qt(x, a)

ut(x, a)ℓt(x, a) 1

2γ ln |X||A|

Furthermore, when πt bqt, the following lemma holds.

Lemma G.5. (Jin et al., 2020) With probability at least 1 7δ,

ℓt bℓt bqt O

|A|T ln T|X||A|

We notice that πt bqt holds only for Algorithm 2, since in Algorithm 3, πt bqt with probability 1 λt 1.

G.3. Auxiliary lemmas for online mirror descent

We will employ the following results for OMD (see, Orabona (2019)) with uniform initialization over the estimated occupancy measure space.

Lemma G.6. (Jin et al., 2020) The OMD update with bq1 (x, a, x ) = 1 |Xk||A||Xk+1| for all k < L and (x, a, x ) Xk A Xk+1, and

bqt+1 = arg min q (Pt) bℓ t q + 1

η D (q bqt) ,

where D (q q ) = P x,a,x q (x, a, x ) ln q(x,a,x ) q (x,a,x ) P x,a,x (q (x, a, x ) q (x, a, x )) ensures

t=1 bℓ t (bqt q) L ln |X|2|A|

t,x,a bqt(x, a)bℓt(x, a)2,

for any q t (Pt), as long as bℓt(x, a) 0 for all t, x, a.