# truly_noregret_learning_in_constrained_mdps__341c2d63.pdf

Truly No-Regret Learning in Constrained MDPs

Adrian M uller 1 Pragnya Alatur 2 Volkan Cevher 1 Giorgia Ramponi 3 Niao He 2

Abstract Constrained Markov decision processes (CMDPs) are a common way to model safety constraints in reinforcement learning. State-of-the-art methods for efficiently solving CMDPs are based on primal-dual algorithms. For these algorithms, all currently known regret bounds allow for error cancellations one can compensate for a constraint violation in one round with a strict constraint satisfaction in another. This makes the online learning process unsafe since it only guarantees safety for the final (mixture) policy but not during learning. As Efroni et al. (2020) pointed out, it is an open question whether primal-dual algorithms can provably achieve sublinear regret if we do not allow error cancellations. In this paper, we give the first affirmative answer. We first generalize a result on last-iterate convergence of regularized primal-dual schemes to CMDPs with multiple constraints. Building upon this insight, we propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.

1. Introduction Classical reinforcement learning (RL, Sutton & Barto, 2018) aims to solve sequential decision-making problems under uncertainty. It involves learning a policy while interacting with an unknown Markov decision process (MDP, Bellman, 1957). However, in many real-world situations, RL algorithms need to solve the task while respecting certain safety constraints. For example, in autonomous driving and drone navigation, we must avoid collisions and adhere to traffic rules to ensure safe behavior (Brunke et al., 2022). Such safety requirements are commonly described by constrained Markov decision processes (CMDPs, Altman, 1999). In CMDPs, the goal is to maximize the expected cumulative reward while subject to multiple safety constraints, each

1EPFL 2ETH Z urich 3University of Z urich. Correspondence to: Adrian M uller <adrian.luis.mueller@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

modeled by a different expected cumulative reward signal that needs to lie above a respective threshold. We consider the finite-horizon setting, in which an algorithm chooses a policy in each episode, plays it for one episode, and observes the random transitions, rewards, and constraint rewards along its trajectory.

In the literature, there are three standard approaches for finding an optimal policy in a known CMDP: linear programming (LP, Altman, 1999), primal-dual (Paternain et al., 2022), and dual algorithms (Paternain et al., 2019). If the CMDP is unknown, a common approach to handle the uncertainty is the classical paradigm of optimism in the face of uncertainty (Auer et al., 2008). In their influential paper, Efroni et al. (2020) established comprehensive regret guarantees for all three types of optimistic algorithms in the online setup. In practice, especially primal-dual algorithms are preferred due to their high computational efficiency and flexibility for policy parameterization, thereby scaling to high-dimensional problems (Chow et al., 2017; Achiam et al., 2017; Tessler et al., 2018). Thus, it is important to rigorously understand the fundamental properties of this algorithm class. Indeed, there has been a large number of studies on primal-dual (and dual) approaches for CMDPs (Ding et al., 2020; 2022b; Liu et al., 2021b; Ding & Jovanovi c, 2022; Ghosh et al., 2022; Ding & Lavaei, 2022; Qiu et al., 2020; Liu et al., 2021a; Bai et al., 2022, to list just a few) since the work of Efroni et al. (2020).

However, unlike for LP-based algorithms, the known bounds for primal-dual (and dual) algorithms suffer from the fundamental limitation pointed out by Efroni et al. (2020, Section 2.2): they concern a weaker, less safe notion of regret. More precisely, the known guarantees bound the sum of the suboptimalities and the sum of the constraint violations across episodes, where one episode corresponds to one round of learning. However, a policy can have a negative constraint violation (by being very safe but obtaining a lower return than an optimal safe policy) or a positive constraint violation (by being unsafe but obtaining a higher return than an optimal safe policy). Thus, terms from these two cases can cancel each other out when summing the violations across episodes, a phenomenon referred to as error cancellations (Efroni et al., 2020). An algorithm with sublinear weak regret may heavily violate safety constraints during learning. For example, if the policies alternate between the two cases

Truly No-Regret Learning in Constrained MDPs

above in every other episode, the algorithm may even obtain zero regret despite being unsafe every second episode. This weak notion of regret falls short of capturing safety in a setup with no simulator and where the algorithm must adhere to constraints during learning. In fact, these cancellations are not a weakness in the analysis but rather due to oscillations of the underlying optimization method, which converges on average but not in the last-iterate (Efroni et al., 2020; Beck, 2017). Indeed, these oscillations are observed in practice (Stooke et al., 2020; Moskovitz et al., 2023).

We thus consider a stronger notion of regret that concerns the sum of the positive parts of the error terms instead. This regret does not allow for error cancellations, and we refer to it as strong regret. The results of this paper address the research question pointed out by Efroni et al. (2020):

Can we design an efficient primal-dual algorithm that

achieves sublinear strong regret in an unknown CMDP?

We provide the first affirmative answer for tabular finitehorizon CMDPs. Specifically, we introduce a regularization framework inspired by the recent work of Ding et al. (2023) and derive guarantees in the online setup for a primal-dual algorithm that arises from this formulation.

Contributions Our main contributions are the following:

We first prove non-asymptotic policy last-iterate convergence (Definition 4.1) of a regularized primal-dual scheme for CMDPs despite the inherent non-concavity, assuming access to a value function oracle (Section 4). Our guarantee generalizes previous results for the strictly easier problem of CMDPs with only a single constraint. This is the first analysis that establishes last-iterate convergence of primal-dual algorithms in arbitrary CMDPs.

Combining this regularized primal-dual scheme with optimistic exploration, we propose an improved modelbased primal-dual algorithm (Algorithm 1) for online learning in CMDPs (Section 5). Our algorithm requires no prior knowledge of the CMDP and maintains valueoptimism for the regularized problem.

Finally, we establish that our algorithm achieves sublinear strong regret when learning an unknown CMDP (Section 5.2). This is the first primal-dual algorithm achieving a sublinear regret guarantee without allowing error cancellations, providing the first answer to the open question posed by Efroni et al. (2020).

The latter is relevant due to the efficiency and practical importance of primal-dual algorithms, which are often preferred over LP-based algorithms in large-scale applications. Additionally, we provide numerical evaluations of our algorithm in simple environments. We illustrate that it exhibits

sublinear regret when safety during learning is concerned, while unregularized algorithms do not. We conclude that error cancellations are not merely a hypothetical issue of existing algorithms but bear practical relevance.

1.1. Related Work Since Efroni et al. (2020) analyzed the vanilla primal-dual (and dual) algorithm, their analysis has been extended in various works, both for the case of an unknown or known CMDP (Ding et al., 2020; 2022b; Liu et al., 2021b; Ding & Jovanovi c, 2022; Ghosh et al., 2022; Ding & Lavaei, 2022; Qiu et al., 2020; Liu et al., 2021a; Bai et al., 2022). As Calvo-Fullana et al. (2023) pointed out, even the works assuming full knowledge of the CMDP only establish convergence of the averaged iterates. Hence, none of the mentioned works provides a guarantee for the strong regret or is easily amendable to obtain one.

Very recently, Ding et al. (2023) were the first to provide a last-iterate convergence analysis for a primal-dual algorithm in a known discounted infinite-horizon CMDP closely related to our algorithm. However, their analysis is limited to the case of a single constraint, which is non-trivial to generalize to multiple constraints (Section 4). Moreover, the authors left it as an open question whether the algorithm can be generalized to achieve last-iterate convergence in the online setup, when the CMDP is unknown (Section 5). In addition, our analysis holds for an algorithm with closed-form updates (Equations (6) and (7)), while their algorithm involves Bregman projections for technical reasons (Lemma 4.1). Prior, Moskovitz et al. (2023) showed last-iterate convergence of a primal-dual scheme, but their analysis concerns a hypothetical algorithm whose implicit updates do not allow efficient implementation. Li et al. (2021) provided a dual (not primal-dual) algorithm based on regularization like ours but only proved convergence for a history-weighted mixture policy1 in a known CMDP. Similarly, Ying et al. (2022) derived a dual algorithm with last-iterate convergence but left it open whether an online version is possible. M uller et al. (2023) were the first to prove a sublinear regret guarantee without error cancellations for a dual algorithm in the online setup. However, their algorithm, which is based on the augmented Lagrangian method, lacks the desired computational efficiency. We refer to Appendix B for further comparison with prior results.

2. Problem Formulation Notation For n N, we use [n] to refer to the set of integers {1, . . . , n}. For a finite set X, we denote the probability simplex over X as (X) = {v [0, 1]X| P

x X vx = 1}. For a R, we set [a]+ := max{0, a} to be the positive

1By mixture policy we refer to a policy randomly drawn from all policy iterates.

Truly No-Regret Learning in Constrained MDPs

part of a. b denotes the ℓ2-norm of a vector b Rn. O-notation refers to asymptotics up to poly-logarithmic factors.

Constrained MDPs A finite-horizon CMDP with state and action spaces S, A (with finite cardinalities S and A) and horizon H > 0 is defined by a tuple M = (S, A, H, p, r, u, c). Every episode consists of H steps and starts from an initial state s1 S.2 At every step h, ph(s |s, a) denotes the probability of transitioning to state s

if the current state and action are s and a. Moreover, rh : S A [0, 1], (s, a) 7 rh(s, a) denotes the reward function at step h [H]. Similarly, uh : S A [0, 1]I, (s, a) 7 uh(s, a) = (u1,h(s, a), . . . , u I,h(s, a))T [0, 1]I refers to the I constraint reward functions, and c [0, H]I are the respective thresholds ci for the i-th constraint (i [I]).

The algorithm interacts with the CMDP by playing a policy π Π, where

Π := (π1, . . . , πH) h s S : πh( |s) (A) .

For any π Π, we consider the Markov process given by ah πh( |sh), sh+1 ph( |sh, ah) for h = 1, . . . , H. For any function r : [H] S A R, (h, s, a) 7 r h(s, a), every (s, h) S [H] and π Π, consider the value functions

V π r ,h(s) :=Eπ

h =h r h (sh , ah ) sh = s

Qπ r ,h(s, a) :=Eπ

h =h r h (sh , ah ) sh = s, ah = a

For notational convenience, we drop the indices for the step and state if we refer to h = 1 and s1 and write V π r = V π r ,1(s1).

In the CMDP setting, we are interested in solving the following optimization problem:

max π Π V π r s.t. V π ui ci ( i [I]), (1)

and we fix an optimal solution π Π for Equation (1). Among all policies that are feasible with respect to the I safety constraints V π ui ci, the goal is to find one that maximizes V π r .

We consider the stochastic reward setting, in which the algorithm observes rewards sampled from random variables Rh(s, a) [0, 1] and Uh(s, a) [0, 1]I such that E[Rh(s, a)] = rh(s, a) and E[Ui,h(s, a)] = ui,h(s, a) for all i [I] when taking action a in state s at step h. Throughout, we make the following assumption, which is standard in

2It is straightforward to extend this to any initial distribution µ.

the context of CMDPs (Altman, 1999; Efroni et al., 2020; Li et al., 2021; Ying et al., 2022; Ding et al., 2022c; Paternain et al., 2022; Ding et al., 2023).

Assumption 2.1 (Slater policy). There exists π Π and ξ RI >0 such that V π ui ci + ξi for all i [I]. Set the Slater gap

Ξ := min i [I] ξi.

This assumption asserts that there exists a policy that strictly satisfies the constraints.

Problem Formulation The algorithm interacts with the unknown CMDP over a fixed number of K > 0 episodes. Prior to every episode k [K], the algorithm selects a policy πk Π and plays it for one run of the CMDP. The goal is to simultaneously minimize its two strong regrets:

R(K; r) := X

h V π r V πk r i

+ , (Objective)

R(K; u) := max i [I]

+ . (Constraints)

Only when a policy has a suboptimal objective or violates the constraints, this counts to the respective regret. All existing works on primal-dual (and dual) algorithms (e.g., Liu et al., 2021a; Efroni et al., 2020; Bai et al., 2022; Ding et al., 2022a;c) only prove sublinear guarantees on a weaker notion:

Rweak(K; r) := X

V π r V πk r ,

Rweak(K; u) := max i [I]

ci V πk ui .

The weak regrets allow for the aforementioned error cancellations as positive and negative terms count toward each of the regrets. Even if they are sublinear in K (in fact, even if they are zero), the algorithm may continue compensating for a constraint violation in one episode with strict constraint satisfaction in another. On the other hand, a sublinear bound on the stronger notion of regret guarantees that the algorithm achieves a low constraint violation in most episodes (see Section 5.3). This is crucial for many practical applications where we do not have access to a simulator, but we have to learn our optimal policy in an online fashion. In the example of navigating an autonomous vehicle or drone, one would want to avoid crossing the boundaries of a specified track in each episode during learning. It is not helpful to compensate for crashing the vehicle into a wall by driving overly safely in the next episode. However, from a theoretical perspective, it is strictly more challenging to provide a guarantee for the strong regret than for the weaker notion.3

3For practical purposes, one may consider the strong regret only

Truly No-Regret Learning in Constrained MDPs

3. Primal-Dual Scheme Vanilla Scheme Primal-dual and dual algorithms arise from the equivalent Lagrangian formulation (Altman, 1999) of Equation (1):

max π Π min λ RI 0 L(π, λ), (2)

L(π, λ) := V π r + X

i [I] λi(V π ui ci) = V π r+λT (u H 1c)

is the Lagrangian. Paternain et al. (2019) showed that CMDPs exhibit strong duality, by which Equation (2) is equivalent to finding a saddle point (π , λ ) of the Lagrangian. Primal-dual algorithms solve this saddle point problem via iterated play between two no-regret dynamics for π and λ. Typically, as considered by Efroni et al. (2020) in the regret minimization setting,

πk+1,h(a|s) πk,h(a|s) exp ηQπk r+λT k u,h(s, a) , (3)

λk+1 =projΛ (λk η(V πk u c)) , (4)

where projΛ refers to the projection onto a predefined Λ = [0, λmax]I, which amounts to truncating the coordinates. We refer to Equations (3) and (4) as vanilla primal-dual scheme. The mixture policy of the iterates is guaranteed to converge to an optimal solution pair of the min-max problem. However, the last iterate is not guaranteed to converge. Instead, the method oscillates around an optimal solution, which results in the weak regret bounds of previous primal-dual algorithms (Section 6).

Regularized Scheme The key idea of the regularization is to induce strict concavity in the primal variable (to be precise, in the state-action occupancy measure dπ h(s, a) := Pπ[sh = s, ah = a] and not in the policy) and strong convexity in the dual variable λ. This enables us to establish convergence to the unique solution of the regularized problem. We then show how to retrieve an error bound for the original, unregularized problem by carefully choosing the amount of regularization.

For τ > 0, we define the regularized Lagrangian Lτ : Π RI R as

Lτ(π, λ) := L(π, λ) + τ H(π) + 1

where H(π) := Eπ[PH h=1 log(πh(ah|sh))] is the entropy of a policy π. Then, consider the following regularized CMDP problem:

max π Π min λ Λ Lτ(π, λ), (5)

for the constraint violations and the weak one for the objective. We refer to Appendix G for a discussion of the differences. However, this relaxation does not improve our theoretical results.

The domain of the dual variable λ is now a compact set Λ := [0, λmax]I, with λmax HΞ 1 to be specified (crucially, we will choose it depending on the number of episodes K). Thanks to strong duality of the unregularized problem, any saddle point (π , λ ) of L satisfies λ HΞ 1 (e.g., (Ying et al., 2022) for infinite horizon), which will allow us to constrain the dual variable as above. We denote the regularized primal and dual optimizers as follows:

π τ = arg max π Π min λ Λ Lτ(π, λ),

λ τ = arg min λ Λ max π Π Lτ(π, λ).

Regularization preserves strong duality (Appendix C), by which we are equivalently looking for a saddle point (π τ, λ τ) of the regularized Lagrangian Lτ. Ding et al. (2023) proposed to perform the ascent-descent scheme in Equations (3) and (4) on Lτ rather than L in the discounted infinite-horizon setting given a value function oracle:

πk+1,h(a|s) πk,h(a|s) exp ηQπk r+λT k u+τψk,h(s, a) ,

λk+1 =projΛ ((1 ητ)λk η(V πk u c)) , (7)

where ψk,h(s, a) := log(πk,h(a|s)). We refer to Equations (6) and (7) as regularized primal-dual scheme. In fact, our scheme above is a simplification of Ding et al. (2023) s algorithm, since their policy update would read πk+1,h( |s) = arg maxπh( |s) b (A) πh( |s), Qπk r+λT k u+τψk,h(s, )

1 ηKL(πh( |s)||πk,h( |s)), where b (A) := {πh( |s) (A) | a A: πh(a|s) ε0/A} for some ε0 > 0 is a restricted probability simplex for technical reasons stemming from the analysis. While this update can be performed via Bregman projections (Orabona, 2019) of the KL divergence onto b (A), this requires solving a convex program at every iteration k of the scheme. In contrast, our scheme admits a closed form of the policy update in Equation (6) due to our modified analysis (see discussion of Lemma 4.1).

4. Last-iterate Convergence In this section, we prove last-iterate convergence of the regularized primal-dual scheme (Equations (6) and (7)) with an exact value function oracle (e.g., via policy evaluation if the true model is known) for an arbitrary number of constraints. We define last-iterate convergence as follows.

Definition 4.1 (Last-iterate convergence). A method producing policy iterates πk Π (k = 1, 2, . . . ) is last-iterate convergent if

V π r V πk r 0 and [ci V πk ui ]+ 0 ( i [I])

Truly No-Regret Learning in Constrained MDPs

The main technical challenges we overcome to show lastiterate convergence are: (a) to prove ascent properties for the primal update (Equation (6)), which optimizes a nonconcave objective with surrogate gradients Qπk r+λT k u+τψk,h(s, a) that are unbounded in general; and (b) to bound all unregularized constraint violations of the last iterate πk in the presence of more than one constraint. We provide all proofs for this section in Appendix D.

Regularized Optimizers Our first step is to show that the iterates (πk, λk) converge to the regularized optimizers (π τ, λ τ). Indeed, we formalize this by showing that the potential function

s,h Pπ τ [sh = s]KLk,h(s) + 1

approaches zero, if we choose the regularization parameter τ and the step size η sufficiently small. Here, KLk,h(s) := KL(π τ,h( |s), πk,h( |s)) refers to the Kullback-Leibler divergence between the optimal and the k-th policy, and Pπ τ refers to the probability distribution under policy π τ.

Lemma 4.1 (Regularized convergence). Let η, τ < 1 and λmax HΞ 1. The iterates in Equations (6) and (7) satisfy

Φk+1 (1 ητ)kΦ1 + O ητ 1Cη,τ,Λ ,

Cη,τ,Λ =λ2 max H3A1/2I2 exp (ηH (1 + λmax I + log(A)))

+ I (H + τλmax)2 .

Despite the exponential term, we can control the factor Cη,τ,Λ to be constant of order poly(A, H, I, Ξ 1) by choosing η < (Hλmax I log(A)) 1. For the remaining part, η and τ need to be traded off to have fast linear convergence (1 ητ)kΦ1 and a small bias term ητ 1Cη,τ,Λ simultaneously. Ding et al. (2023) showed a similar result with a different constant Cη,τ for their update rule that constrains the policies to the restricted probability simplex b (A) := {πh( |s) (A) | a A: πh(a|s) ε0/A} by solving a convex problem in every iteration. They introduce this restriction as their proof requires a uniform bound of Qπk r+λT k u+τψk,h(s, a) and thus of τψk,h = τ log(πk,h(s, a)), which may be unbounded outside of b (A). Our modified proof overcomes this challenge by leveraging a mirror descent (MD) lemma with local norms rather than the standard online MD lemma (Orabona, 2019). While the standard norm of the regularized Q-values may be unbounded outside of b (A), we are able to bound their local norms to arrive at Lemma 4.1, even though our policy updates (Equation (6)) are not restricted to b (A) and thus closed-form. We refer to Appendix D for the proof.

Unregularized Error Bounds While the bound in Lemma 4.1 depends on the choice of η < 1, τ < 1 and λmax HΞ 1, we show that it is possible though not obvious to choose them (depending on the desired approximation) such that Φk decays to zero. Prior to this, we show that this will allow us to upper-bound both the constraint violation and the objective suboptimality in the original problem.

Lemma 4.2 (Error bounds). For any sequence (πk)k [K], h V π r V πk r i

+ H3/2(2Φk)1/2 + τH log(A),

+ H3/2(2Φk)1/2 + τλmax

+ λ 1 max H2Ξ 1 + τH log(A) .

A similar result was provided by Ding et al. (2023) for the case of a single constraint (I = 1). Generalizing this is technically challenging as the standard way of showing that approximate saddle points have small constraint violation (Beck, 2017, Theorem 3.60) does not apply in the case of regularized saddle points. Simultaneously, the technique of Ding et al. (2023, Corollary 1) leverages the fact that only one constraint is present. We overcome this by choosing the domain Λ = [0, λmax]I larger than standard primaldual algorithms, making it possible to extract bounds on the individual constraint violations from the approximate saddle points. See Appendix D for the proof. This novel approach yields the rather uncommon inverse dependency on the diameter of Λ in Lemma 4.2, which needs to be chosen such that both terms τλmax and λ 1 max are O(ε) to obtain an ε-close solution.

Lemma 4.2 tells us that we can bound the objective suboptimality and all constraint violations by controlling the terms (Φk)1/2 via Lemma 4.1, and remaining the terms by appropriately choosing the regularization and domain diameter, in terms of ε.

Last-iterate Convergence We are now ready to establish last-iterate convergence to the unregularized optimal policy of the regularized primal-dual scheme.

Theorem 4.1 (Last-iterate convergence). Let ε (0, 1). Then, with appropriate choices of η ε6, τ ε2, λmax ε 1, we have h V π r V πk r i

+ ε, ci V πk ui

+ ε ( i [I])

for k = Ω(poly(A, H, I, Ξ 1) ε 10).

Here, we only highlight the explicit dependency on the desired approximation. The dependency on the CMDP size is (low-degree) polynomial and detailed in Appendix D. The only problem-dependent constant in this bound is the Slater gap Ξ, which is shared by all primal-dual analyses of our

Truly No-Regret Learning in Constrained MDPs

knowledge. While the provided rate is slow and may be improved in the future, all other known rates of primal-dual algorithms for CMDPs with arbitrary constraints (I > 1) only hold for the averaged and not the last iterate. More importantly, the technique leading to this result will allow us to achieve sublinear strong regret in the following section.

5. Online Setup Recall the regularized primal-dual scheme from Equations (6) and (7). In our online learning setup, the true value functions are not known as we are learning the unknown CMDP. Thus, we are required to explore the CMDP and respect safety during exploration. Replacing the value functions by optimistic estimates (Shani et al., 2020; Auer et al., 2008) allows us to turn the primal-dual scheme into an online learning algorithm for finite-horizon CMDPs (see Algorithm 1). Importantly, we need to be optimistic with respect to the regularization term τH(π) too, rather than just the classical mixture value V π r+λT k u. The main technical challenge is to incorporate the model uncertainty into our primal-dual analysis from Section 4.

5.1. Optimistic Model For all s, a, h and k [K], let nk 1,h(s, a) := Pk 1 l=1 1{sl h=s, al h=a} count the number of times that the state-action pair (s, a) has been visited at step h before episode k. Here, (sl h, al h) denotes the state-action pair visited at step h in episode l. First, we compute the empirical averages of the reward and transition probabilities as follows:

rk 1,h(s, a) :=

Pk 1 l=1 Rl h(s, a)1{sl h=s, al h=a} nk 1,h(s, a) 1 ,

uk 1,i,h(s, a) :=

Pk 1 l=1 U l i,h(s, a)1{sl h=s, al h=a} nk 1,h(s, a) 1 , (8)

pk 1,h(s |s, a) :=

Pk 1 l=1 1{sl h=s, al h=a, sl h+1=s } nk 1,h(s, a) 1 ,

where a b := max{a, b} and 1A is the indicator function of an event A. We consider optimistic estimates

ˆrk,h(s, a) := rk 1,h(s, a) + bk 1,h(s, a),

ˆuk,i,h(s, a) := uk 1,i,h(s, a) + bk 1,h(s, a), (9) ˆψk,h(s, a) := log(πk,h(a|s)) + bp k 1,h(s, a) log(A),

ˆpk,h(s |s, a) := pk 1,h(s |s, a),

where bk 1,h(s, a) = br k 1,h(s, a) + bp k 1,h(s, a), and for any δ (0, 1), we specify the correct values for

br k 1,h(s, a) =O

log (SAHIKδ 1)

nk 1,h(s, a) 1

bp k 1,h(s, a) =O

S + log (SAHKδ 1)

nk 1,h(s, a) 1

in Appendix E to obtain our regret guarantees with probability at least 1 δ. The optimistic model guarantees that, with high probability, the obtained value functions overestimate the true ones and simultaneously allows us to control the estimation error. While optimistic exploration is standard, we here also take the entropy term in the objective into account via ˆψk. Let

ˆzk := ˆrk + λT k ˆuk + τ ˆψk (10)

be the optimistic reward function mimicking the πdependency of the regularized Lagrangian at (πk, λk). Consider the truncated value functions

(h, s, a) 7 ˆQk ˆzk,h(s, a), ˆV k ˆuk = ˆV k ˆuk,1(s1)

that we compute via truncated policy evaluation (by dynamic programming) of πk with respect to the optimistic model. We refer to Algorithm 2 in Appendix E, where we also establish the relevant properties of the model.

Algorithm Combining the truncated policy estimation under our learned model with the regularized primal-dual scheme (Equations (6) and (7)) yields Algorithm 1. The computational cost of the algorithm amounts to evaluating a policy O(I) times per episode, which matches the complexity of standard primal-dual algorithms and is more efficient than running dual or LP-based algorithms. Projecting onto Λ is immediate since Λ is a product of intervals.

5.2. Regret Analysis We now provide the key steps of our regret analysis, showing that Algorithm 1 indeed achieves sublinear strong regret for both the constraint violations and the objective. We defer all proofs for this section to Appendix F.

Lemma 5.1 (Regularized convergence). Let η, τ < 1 and λmax HΞ 1. With probability at least 1 δ, the iterates of Algorithm 1 satisfy

Φk+1 (1 ητ)kΦ1 + O ητ 1Cη,τ,Λ

+ ηλmax ISA1/2H2k1/2 + IS3/2AH2 ,

where Cη,τ,Λ is the same constant as in Lemma 4.1.

Truly No-Regret Learning in Constrained MDPs

Algorithm 1 Regularized Primal-Dual Algorithm with Optimistic Exploration Require: Λ = [0, λmax]I, stepsize η > 0, regularization parameter τ > 0, number of episodes K, initial policy π1,h(a|s) = 1/A ( s, a, h), λ1 := 0 RI

for k = 1, . . . , K do

Update ˆrk, ˆuk, ˆpk, ˆψk via Equation (9)

Truncated policy evaluation (Algorithm 2) w.r.t. ˆzk (Equation (10)) and ˆuk:

ˆQk ˆzk( ), ˆV k ˆuk :=EVAL(πk, λk, ˆrk, ˆuk, ˆψk, ˆpk).

Update primal variables for all h, s, a:

πk+1,h(a|s) πk,h(a|s) exp η ˆQk h,ˆzk(s, a)

Update dual variables:

λk+1 =projΛ (1 ητ)λk η( ˆV k ˆuk c) .

Play πk for one episode, update rk, uk, gk, pk via Equation (8) end for

Here, we use O-notation for asymptotics up to polylogarithmic factors in S, A, H, I, K, Ξ 1, and δ 1. This result is similar to our Lemma 4.1, but now we obtain an additional term corresponding to the model uncertainty (estimation error), which we control when choosing the step size η.

Regret Bound In a final step, we can leverage Lemma 4.2 to turn Lemma 5.1 into a sublinear regret bound for Algorithm 1, when summing up the error terms and choosing η, τ, and λmax HΞ 1 optimally depending on K given our bounds. This yields our main result.

Theorem 5.1 (Regret bound). Let τ = K 1/7, η = (H2I) 1ΞK 5/7, λmax = HΞ 1K1/14. Then with probability at least 1 δ, Algorithm 1 obtains a strong regret of

R(K; r) Cr K0.93, R(K; u) Cu K0.93,

where Cr, Cu = poly(S, A, H, I, Ξ 1, log(1/δ), log(K)) and K is the number of episodes.

Here, we only highlight the leading term in K. The dependency on the CMDP parameters is (low-degree) polynomial and detailed in Appendix F. Again, Ξ is the only problemdependent constant (and unavoidable).

Remark 5.1. We remark that our proof of Theorem 5.1, in fact, shows last-iterate convergence in the online setup, which is strictly stronger than a regret bound in general.

Our strong regret bound of O(K0.93) is less tight than the O(K1/2) that the vanilla primal-dual algorithm achieves for the weak regret, for which there exist well-known lower bounds (Jin et al., 2018; Domingues et al., 2021). Nevertheless, Algorithm 1 is the first primal-dual algorithm for CMDPs provably achieving sublinear strong regret. It is thus the first algorithm of its kind for which we can guarantee that it cannot keep violating constraints indefinitely. While LP-based approaches achieve strong regret of O(K1/2), most modern (deep) safe RL algorithms for CMDPs follow primal-dual schemes (Chow et al., 2017; Tessler et al., 2018; Stooke et al., 2020). We believe that it might be possible to tighten our analysis, although this will require novel ideas. Indeed, our numerical evaluations show that the parameter choices in Theorem 5.1 are overly pessimistic.

5.3. Strong vs. Weak Regret and Safety at any Time We allude to the strong regret several times by saying that a sublinear bound guarantees safety during learning or in most episodes. As our algorithm does not guarantee safety in every episode, one may wonder in which sense safety during learning is formally guaranteed by the strong regret compared to the weak one. Indeed, this is an important theme in CMDPs. In an unknown CMDP, there is no way to explore it without constraint violations unless further assumptions are made (as the constraint rewards ui and transitions P are unknown, we cannot know that an action is unsafe without trying at least once). However, this is a limitation of the CMDP model with exploration rather than our algorithm. We can thus only argue about safety in most episodes.

Approximate Safety in Most Episodes Unlike any previous primal-dual algorithm, our method guarantees that for any fixed ε > 0, the fraction of episodes in which our policy is not ε-safe vanishes to 0 as the number of episodes K grows. Not being ε-safe here means to violate at least one constraint by at least ε. We make this formal in the following remark.

Remark 5.2. Fix ε > 0 and suppose R(K; u) O(Kα) for some α (0, 1). Then there exist at most O(Kα/ε) episodes with a constraint violation of at least ε. In other words, only a small fraction O(Kα 1/ε) = o(1) of the iterates is not ε-safe. In comparison, this is by no means guaranteed by a sublinear bound on Rweak(K; u).

Hence, our algorithm is approximately safe in most episodes, while being safe in every episode is not possible by design. This is a remarkable result since previous works on primaldual algorithms can only guarantee safety of the average policy and not in most of the episodes (such algorithms can be fully unsafe in, e.g., half of the episodes).

Strict Safety in Most Episodes Furthermore, ε-safety can be strengthened by a simple reduction to ensure strict safety in most episodes. This is possible by increasing the true

Truly No-Regret Learning in Constrained MDPs

constraint thresholds by a small shift of ε = O(K(α 1)/2) to be more conservative and applying our results. For the formal details of this reduction, see, e.g., Appendix C.4 in Ding et al. (2023). We thus established that the fraction of unsafe episodes is vanishing (in terms of K).

6. Simulation We perform numerical simulations of our algorithms and compare them to their unregularized counterparts (Efroni et al., 2020). We find that the vanilla primal-dual and dual algorithms can suffer linear strong regret while our regularized counterparts do not, illustrating that error cancellations are not merely a hypothetical issue. We provide further details in Appendix G.

6.1. Baselines and Environment We compare our regularized primal-dual algorithm (Algorithm 1) to the vanilla primal-dual algorithm of Efroni et al. (2020), which corresponds to Equations (3) and (4) with optimistic exploration. We also include the vanilla dual algorithm of Efroni et al. (2020) as a baseline and our regularized dual algorithm (below), which arises from the same regularization framework as Algorithm 1. We test each algorithm for the same total number (6) of hyperparameter configurations and report the best results for each.

Dual Algorithm Leveraging the framework we introduced, it is immediate to also derive a dual algorithm for finitehorizon CMDPs. Dual algorithms amount to performing projected dual descent (Beck, 2017; Paternain et al., 2019) on the Lagrangian, where one can again use the optimistic model to estimate the unknown CMDP. Efroni et al. (2020) proved that this algorithm achieves a sublinear weak regret. Instead, we perform dual descent on the regularized Lagrangian Lτ. Explicitly,

πk = arg max π Π

V ˆpk,π ˆrk+λT k ˆuk + τ ˆHk(π) , (11)

λk+1 = projΛ (1 ητ)λk η(V ˆpk,πk ˆuk c) , (12)

where V ˆpk,π refers to the value functions under transition model ˆpk. These updates are similar to the ones proposed by Li et al. (2021); Ying et al. (2022), yet both assume a value function oracle. We can compute the first update via regularized dynamic programming, and the second one is the same as before. The dual approach has a higher computational complexity as the primal update requires a planning subroutine rather than just policy evaluation, but shows similar numerical performance. See Appendix G.3 for the full description of the regularized dual algorithm.

Environment We consider a randomly generated CMDP with deterministic rewards and unknown transitions. We draw the reward function r, constraint thresholds c, and transitions p uniformly at random. In order for oscillations (and

0 1,000 2,000 3,000 4,000

Iteration k

Constraint violation

0 1,000 2,000 3,000 4,000

Iteration k

V π r V πk r

Suboptimality

Regularized Primal-Dual (ours) Vanilla Primal-Dual

Figure 1. Constraint violation and objective suboptimality of the vanilla primal-dual algorithm (Efroni et al., 2020, cf. Equations (3) and (4)) and our regularized version (Algorithm 1). We present the values of the individual policies in each episode while learning the CMDP.

thus error cancellations) to occur, the objective must be conflicting with the constraints (Moskovitz et al., 2023), as they can otherwise easily be satisfied. However, by concentration of measure, two random vectors in high dimension are nearly orthogonal with high probability (Blum et al., 2020). Uniformly sampling the constraints would thus not yield interesting CMDPs, which is why we invoke a negative correlation between reward and constraint function. We sample the constraint function as (1 r) + βζ, where ζ RHSA

is Gaussian with zero mean and identity covariance matrix. We consider S = A = H = 5, β = 0.1, and focus on the case of one constraint for visualization purposes. We refer to Appendix G for further details.

6.2. Results The constraint violation and suboptimality of the iterates in each episode show the oscillatory behavior of the vanilla primal-dual algorithm as opposed to ours (Figure 1). While the on-average errors across episodes are sublinear, the vanilla algorithm keeps violating the constraints indefinitely as the number of episodes grows. In comparison, the oscillations of the regularized method are dampened, thus allowing it to converge to an optimal safe policy.

With respect to the weak regret, the vanilla algorithms perform better (Figures 2b and 3b, even constant for the suboptimality). However, with respect to the strong regret, the regularized algorithms outperform the unregularized ones, as they achieve sublinear regret without allowing for error cancellations (Figures 2a and 3a). While the strong regrets for the vanilla algorithms may look sublinear, a second look at their iterates (Figure 1) reveals that their regret will indeed grow linearly due to the persisting oscillations. This confirms our key point that a sublinear bound on the weak regret is not informative whenever we do not allow compensating for an unsafe episode with a safe one.

Truly No-Regret Learning in Constrained MDPs

(a) Strong regrets

0 1,000 2,000 3,000 4,000 0

Constraint violation

0 1,000 2,000 3,000 4,000 0

Suboptimality

Regularized Primal-Dual (ours) Vanilla Primal-Dual

(b) Weak regrets

0 1,000 2,000 3,000 4,000

Rweak(k; u)

Constraint violation

0 1,000 2,000 3,000 4,000 60

Rweak(k; r)

Suboptimality

Regularized Primal-Dual (ours) Vanilla Primal-Dual

Figure 2. Vanilla primal-dual algorithm (Efroni et al., 2020, cf. Equations (3) and (4)) and our regularized version (Algorithm 1). Figure 2a shows the strong regret; Figure 2b shows the weak regret. The weak regret regarding the objective can be negative, illustrating that the iterates are superoptimal but unsafe on average. Y-axes differ across plots. All results are averaged over n = 5 independent runs, with plotted confidence intervals.

(a) Strong regrets

0 1,000 2,000 3,000 4,000 0

Constraint violation

0 1,000 2,000 3,000 4,000 0

Suboptimality

Regularized Dual (ours) Vanilla Dual

(b) Weak regrets

0 1,000 2,000 3,000 4,000

Rweak(k; u)

Constraint violation

0 1,000 2,000 3,000 4,000

Rweak(k; r)

Suboptimality

Regularized Dual (ours) Vanilla Dual

Figure 3. Vanilla dual algorithm (Efroni et al., 2020) and our regularized version (Equations (11) and (12)). Figure 3a shows the strong regret; Figure 3b shows the weak regret. Y-axes differ across plots. All results are averaged over n = 5 independent runs, with plotted confidence intervals.

The vanilla algorithms will suffer linear strong regret even with a potentially better learning rate scheduling. We observed that the learning rate influences the oscillation frequency: With a larger learning rate, the vanilla methods oscillate faster. However, changing the learning rate does not dampen the oscillation magnitude. Hence, the strong regret is still linear. Indeed, we observe a change of magnitude only via the regularization parameter rather than the learning rate.

Comparison with Guarantee With the theoretically derived stepsize η, regularization τ, and exploration from Theorem 5.1, we need many episodes to observe a benefit, due to the slowly vanishing gap between regularized and unregularized problem. Setting hyperparameters empirically, we observe a better regret than the theory suggests. Therefore, the plots in this section refer to the empirical choice.

7. Conclusion In this paper, we gave the first answer to the open question of Efroni et al. (2020) whether primal-dual algorithms can achieve sublinear strong regret in finite-horizon CMDPs. While our answer is affirmative, it remains open in how far it is possible to lower the gap to the desired O(K1/2) regret bound. We hope that our first analysis inspires further research on truly no-regret learning in CMDPs, including improvements in the analysis of our algorithm, incorporating function approximation, algorithms for the infinite-horizon average reward setup, and showing provable benefits of related approaches such as optimistic gradients.

Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Truly No-Regret Learning in Constrained MDPs

Acknowledgements We thank the anonymous reviewers for their valuable feedback. P.A. is supported by the AI Center and ETH Foundations of Data Science (ETH-FDS) initiative. V.C. is supported by the Hasler Foundation Program: Hasler Responsible AI (project number 21043). V.C. is supported by the Swiss National Science Foundation (SNSF) under grant number 200021 205011. N.H. is supported by Swiss National Science Foundation Project Funding No. 200021207343 and SNSF Starting Grant.

Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In International conference on machine learning, pp. 22 31. PMLR, 2017.

Agrawal, S. and Devanur, N. Linear contextual bandits with knapsacks. Advances in Neural Information Processing Systems, 29, 2016.

Altman, E. Constrained Markov decision processes, volume 7. CRC press, 1999.

Amani, S., Alizadeh, M., and Thrampoulidis, C. Linear stochastic bandits under safety constraints. Advances in Neural Information Processing Systems, 32, 2019.

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. Journal of the ACM (JACM), 65(3):1 55, 2018.

Bai, Q., Bedi, A. S., Agarwal, M., Koppel, A., and Aggarwal, V. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 3682 3689, 2022.

Beck, A. First-order methods in optimization. SIAM, 2017.

Bellman, R. A markovian decision process. Journal of Mathematics and Mechanics, 6(5):679 684, 1957. ISSN 00959057, 19435274. URL http://www.jstor. org/stable/24900506.

Bertsekas, D., Nedic, A., and Ozdaglar, A. Convex analysis and optimization, volume 1. Athena Scientific, 2003.

Blum, A., Hopcroft, J., and Kannan, R. Foundations of data science. Cambridge University Press, 2020.

Borkar, V. S. A convex analytic approach to markov decision processes. Probability Theory and Related Fields, 1988.

Brunke, L., Greeff, M., Hall, A. W., Yuan, Z., Zhou, S., Panerati, J., and Schoellig, A. P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5:411 444, 2022.

Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283 1294. PMLR, 2020.

Calvo-Fullana, M., Paternain, S., Chamon, L. F., and Ribeiro, A. State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards. IEEE Transactions on Automatic Control, 2023.

Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070 6120, 2017.

Cover, T. M. Elements of information theory. John Wiley & Sons, 1999.

Dann, C., Lattimore, T., and Brunskill, E. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.

Ding, D. and Jovanovi c, M. R. Policy gradient primaldual mirror descent for constrained mdps with large state spaces. In 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 4892 4897. IEEE, 2022.

Ding, D., Zhang, K., Basar, T., and Jovanovic, M. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378 8390, 2020.

Ding, D., Zhang, K., Bas ar, T., and Jovanovi c, M. R. Convergence and optimality of policy gradient primal-dual method for constrained markov decision processes. In 2022 American Control Conference (ACC), pp. 2851 2856. IEEE, 2022a.

Ding, D., Zhang, K., Duan, J., Bas ar, T., and Jovanovi c, M. R. Convergence and sample complexity of natural policy gradient primal-dual methods for constrained mdps. ar Xiv preprint ar Xiv:2206.02346, 2022b.

Ding, D., Zhang, K., Duan, J., Bas ar, T., and Jovanovi c, M. R. Convergence and sample complexity of natural policy gradient primal-dual methods for constrained mdps. ar Xiv preprint ar Xiv:2206.02346, 2022c.

Ding, D., Wei, C.-Y., Zhang, K., and Ribeiro, A. Lastiterate convergent policy gradient primal-dual methods for constrained mdps. ar Xiv preprint ar Xiv:2306.11700, 2023.

Truly No-Regret Learning in Constrained MDPs

Ding, Y. and Lavaei, J. Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints. ar Xiv preprint ar Xiv:2201.11965, 2022.

Domingues, O. D., M enard, P., Kaufmann, E., and Valko, M. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp. 578 598. PMLR, 2021.

Efroni, Y., Merlis, N., Ghavamzadeh, M., and Mannor, S. Tight regret bounds for model-based reinforcement learning with greedy policies. Advances in Neural Information Processing Systems, 32, 2019.

Efroni, Y., Mannor, S., and Pirotta, M. Explorationexploitation in constrained mdps. ar Xiv preprint ar Xiv:2003.02189, 2020.

Ghosh, A., Zhou, X., and Shroff, N. Provably efficient model-free constrained rl with linear function approximation. Advances in Neural Information Processing Systems, 35:13303 13315, 2022.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.

Kazerouni, A., Ghavamzadeh, M., Abbasi Yadkori, Y., and Van Roy, B. Conservative contextual linear bandits. Advances in Neural Information Processing Systems, 30, 2017.

Li, T., Guan, Z., Zou, S., Xu, T., Liang, Y., and Lan, G. Faster algorithm and sharper analysis for constrained markov decision process. ar Xiv preprint ar Xiv:2110.10351, 2021.

Liu, T., Zhou, R., Kalathil, D., Kumar, P., and Tian, C. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183 17193, 2021a.

Liu, T., Zhou, R., Kalathil, D., Kumar, P. R., and Tian, C. Policy optimization for constrained mdps with provable fast global convergence, 2021b. URL https: //arxiv.org/abs/2111.00552.

Moskovitz, T., O Donoghue, B., Veeriah, V., Flennerhag, S., Singh, S., and Zahavy, T. Reload: Reinforcement learning with optimistic ascent-descent for last-iterate convergence in constrained mdps. ar Xiv preprint ar Xiv:2302.01275, 2023.

M uller, A., Alatur, P., Ramponi, G., and He, N. Cancellationfree regret bounds for lagrangian approaches in constrained markov decision processes. In Sixteenth European Workshop on Reinforcement Learning, 2023.

Orabona, F. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

Pacchiano, A., Ghavamzadeh, M., Bartlett, P., and Jiang, H. Stochastic bandits with linear constraints. In International conference on artificial intelligence and statistics, pp. 2827 2835. PMLR, 2021.

Paternain, S., Chamon, L. F., Calvo-Fullana, M., and Ribeiro, A. Constrained reinforcement learning has zero duality gap. ar Xiv preprint ar Xiv:1910.13393, 2019.

Paternain, S., Calvo-Fullana, M., Chamon, L. F., and Ribeiro, A. Safe policies for reinforcement learning via primal-dual methods. IEEE Transactions on Automatic Control, 2022.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Qiu, S., Wei, X., Yang, Z., Ye, J., and Wang, Z. Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss. Advances in Neural Information Processing Systems, 33:15277 15287, 2020.

Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning, pp. 8604 8613. PMLR, 2020.

Sion, M. On general minimax theorems. Pacific J. Math, 1958.

Stooke, A., Achiam, J., and Abbeel, P. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pp. 9133 9143. PMLR, 2020.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Tessler, C., Mankowitz, D. J., and Mannor, S. Reward constrained policy optimization. ar Xiv preprint ar Xiv:1805.11074, 2018.

Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.

Wu, Y., Shariff, R., Lattimore, T., and Szepesv ari, C. Conservative bandits. In International Conference on Machine Learning, pp. 1254 1262. PMLR, 2016.

Ying, D., Ding, Y., and Lavaei, J. A dual approach to constrained markov decision processes with entropy regularization. In International Conference on Artificial Intelligence and Statistics, pp. 1887 1909. PMLR, 2022.

Truly No-Regret Learning in Constrained MDPs

Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304 7312. PMLR, 2019.

Truly No-Regret Learning in Constrained MDPs

A. Summary of Notation

The following table summarizes our general CMDP notation.

State space S, with cardinality S

Action space A, with cardinality A

# of constraints I

Time horizon H

Transition probability ph(s |s, a) = P[sh+1 = s | sh = s, ah = a]

Initial state s1 S

Slater gap of π Ξ = mini [I](V π ui ci)

Number of episodes K

Objective reward Random variable Rh(s, a) [0, 1], with E[Rh(s, a)] = rh(s, a)

Constraint rewards Random variable Ui,h(s, a) with E[Ui,h(s, a)] = ui,h(s, a)

Constraint thresholds c RI, with ci [0, H]

Constraint functions gi,h(s, a) = ui,h(s, a) 1

Policy π Π with (h, s, a) 7 πh(a|s) (non-stationary)

Value functions V π r ,h(s) = Eπ[PH h =h r h (sh , ah ) | sh = s]

(shorthand) V π r = V π r ,1(s1)

(vector-valued) V π u = (V π u 1, . . . , V π u I)T RI

Q-values Qπ r ,h(s, a) = Eπ[PH h =h r h (sh , ah ) | sh = s, ah = a]

Occupancy measures dπ h(s, a) = Pπ[sh = s, ah = a]

dπ h(s) = Pπ[sh = s]

Lagrangian L(π, λ) = V π r + P

i [I] λi(V π ui ci) = V π r+λT g Optimal policy π arg maxπ Π minλ RI 0 L(π, λ)

Dual optimizer λ arg minλ RI 0 maxπ Π L(π, λ)

Confidence level 1 δ

Objective regret R(K; r) = P

k [K] V π r V πk r

+ Constraint regret R(K; u) = maxi [I] P

k [K] ci V πk ui

Truly No-Regret Learning in Constrained MDPs

The following table summarizes the notation specific to the algorithm.

Step size η > 0 (hyperparameter)

Regularization parameter τ > 0 (hyperparameter)

Dual threshold λmax > 0 (hyperparameter)

Dual domain Λ = [0, λmax]I

Entropy H(π) = Eπ h PH h=1 log(πh(ah|sh)) i

Regularized Lagrangian Lτ(π, λ) = V π r + P

i [I] λi(V π ui ci) + τ H(π) + 1

Regularized optimal policy π τ arg maxπ Π minλ Λ Lτ(π, λ)

Regularized dual optimizer λ τ arg minλ Λ maxπ Π Lτ(π, λ)

Auxiliary function ψk,h(s, a) = log(πk,h(a|s))

KL divergence KL(q, q ) = P

a A q(a) log q(a) q (a) (q, q (A))

KLk,h(s) = KL(π τ,h( |s), πk,h( |s))

s dπ τ h (s)KLk,h(s)

Potential function Φk = KLk + 1

2 λ τ λk 2 (k 1)

Visitation counter nk 1,h(s, a) = Pk 1 l=1 1{sl h=s, al h=a}

Averages rk 1,h(s, a), uk 1,h(s, a), pk 1,h(s |s, a)

Exploration bonuses bk 1,h(s, a) = br k 1,h(s, a) + bp k 1,h(s, a)

Optimistic estimates ˆrk, ˆuk, ˆgk = ˆuk 1

H c, ˆψk, ˆpk

Regularized reward function zk = r + λT k u + τψk, ˆzk = ˆrk + λT k ˆuk + τ ˆψk

Success event G

Truncated value functions ˆQk ˆzk,h(s, a) = ˆQk ˆrk,h(s, a) + P i λk,i ˆQk ˆuk,i,h(s, a)

+τ ˆQk ˆ ψk,h(s, a)

ˆV k ˆzk,h(s) = D πk,h( |s), ˆQk ˆzk,h(s, ) E

B. Extended Related Work In this section, we review further related work and provide a technical comparison with prior works.

Constrained MDPs Efroni et al. (2020) provided the first regret analysis for LP-based (OPTLP), primal-dual (OPTPRIMALDUAL), and dual algorithms (OPTDUAL). OPTLP achieves the optimal strong regret of O(K1/2), yet most modern CMDP algorithms are based on primal-dual schemes rather than LP. OPTPRIMALDUAL is akin to our Algorithm 1 but without regularization. It guarantees a weak regret of O(K1/2) but no bound on the strong regret, which is left as an open question that we addressed in Section 5. The same holds regarding the guarantees for OPTDUAL, for which the question about strong regret bounds is still unanswered.

Since Efroni et al. (2020) analyzed the vanilla primal-dual (and dual) algorithm, their analysis has been extended in various works, both for the case of an unknown or known CMDP. Specifically, the algorithms have been extended to natural policy gradient methods with policy parameterization (Ding et al., 2020; 2022b; Liu et al., 2021b), function approximation in the linear MDP setup (Ding & Jovanovi c, 2022; Ghosh et al., 2022), CMDPs with time-varying characteristics (Ding & Lavaei, 2022; Qiu et al., 2020), and have even been shown to achieve bounded on-average constraint violation (Liu et al., 2021a; Bai et al., 2022). However, all these works only established convergence of the averaged iterates or a sublinear weak regret. In practice, recent works do show empirical success (using optimistic gradients (Moskovitz et al., 2023) and PID control (Stooke et al., 2020)) but without the desired theoretical guarantees.

Truly No-Regret Learning in Constrained MDPs

Comparison with Prior Results Ding et al. (2023) analyzed two algorithms, RPG-PD and OPG-PD, for last-iterate convergence assuming a value function oracle. RPG-PD follows the same scheme as our Equations (6) and (7). However, the analysis is tailored for a single constraint. It is not straightforward (as far as we know) how Corollary 1 in (Ding et al., 2023) can be extended to multiple constraints (which we achieve in Section 4). Our Lemma 4.2 generalizes the analysis to deal with multiple constraints. However this extensions leads to a worse iteration complexity in Theorem 4.1 of O(ε 10) rather than O(ε 6). In addition to this, Ding et al. (2023) s policy update differs from ours in that it does not allow a closed-form solution but requires projection onto a restricted probability simplex for technical reasons (see discussion of Lemma 4.1, providing an analysis for our closed-form updates).

Moreover, Ding et al. (2023) left it as an open question whether the algorithm can be generalized to achieve last-iterate convergence in the online setup, when the CMDP is unknown; we addressed this point in Section 5.

The other algorithm of Ding et al. (2023), OPG-PD, is based on optimistic gradient updates and requires the restrictive assumption that the optimal state-visitation distribution (i.e., occupancy measure) is unique and introduces an extra problemdependent constant. Moreover, it assumes a uniform lower bound on the state-visitation frequency in the discounted infinite-horizon setting, an assumption that cannot be guaranteed in the finite-horizon setting.

Moskovitz et al. (2023) showed last-iterate convergence of a primal-dual scheme using optimistic gradient updates given a known CMDP, but their analysis concerns an algorithm operating over occupancy measures rather than policies (different from the practical implementation). Its implicit updates are constrained over the set of occupancy measures (i.e., the Bellman flow polytope), making them at least as computationally expensive as solving the CMDP directly via an LP in the first place.

Calvo-Fullana et al. (2023) considered a rather different approach to overcome the problem that CMDPs cannot be modeled by a single (mixture) reward weighted by Lagrange multipliers (sometimes referred to as scalarization fallacy). They proposed a state-augmentation technique that addresses this related problem without guaranteeing last-iterate convergence.

Dual Algorithms Li et al. (2021) provided a dual (not primal-dual) algorithm based on the same regularization scheme as ours but considered an accelerated dual update and only proved convergence for a history-weighted mixture policy in a known CMDP. Similarly, Ying et al. (2022) derived a dual (not primal-dual) algorithm with last-iterate convergence but left it open if a sample-based version is possible. Moreover, their analysis covers the discounted infinite-horizon setting and requires a uniform lower bound on the state-visitation frequency, an assumption that cannot be guaranteed in the finite-horizon setting.

Constrained Bandits In the simpler bandit setup where there is only a single state, there are mainly three setups in the literature: Knapsack bandits (Agrawal & Devanur, 2016; Badanidiyuru et al., 2018) consider reward maximization over time as long a some global budget is not used up yet. Conservative bandits (Wu et al., 2016; Kazerouni et al., 2017) concern algorithms whose cumulative reward performs sufficiently well relative to some pre-defined baseline policy. Finally, there is a line of research on stage-wise constrained bandits (Amani et al., 2019; Pacchiano et al., 2021), which require algorithms that obtain a reward and a cost associated with an action, where the latter should stay below a threshold in each round. While these settings may inspire related research in CMDPs, they are rather different from ours: They consider hard thresholds in the single-state setup, while exploration in CMDPs is generally stateful and commonly aims at simultaneous minimization of reward and constraint regrets.

C. Properties of the Lagrangian Formulation The results in this section are not novel by themselves, but we re-establish them here for finite-horizon CMDPs for completeness. We refer to Appendix I for the relevant convex optimization background. To view the CMDP as a convex optimization problem, we will express it via the common notion of occupancy measures (Borkar, 1988).

Definition C.1. The state-action occupancy measure dπ of a policy π for a CMDP M is defined as

dπ h(s, a) := E 1{sh=s,ah=a} | s1; p, π = P[sh = s, ah = a | s1; p, π],

for s S, a A, h [H]. We denote the stacked vector of these values as dπ RHSA, with the element at index (h, s, a) being dπ h(s, a). Similarly, we define

dπ h(s) := P[sh = s | s1; p, π] = X

a dπ h(s, a)

Truly No-Regret Learning in Constrained MDPs

We can now define

Q(p) := dπ RHSA | π Π

as the state-action occupancy measure polytope. Note that Q(p) is indeed a polytope (Puterman, 2014). Moreover, we have a surjective map π 7 dπ between Π and Q(p), for which we can explicitly compute an element in the pre-image of d Q(p) via πh(a|s) = dh(s, a)/(P

a dh(s, a )).

We can stack the expected rewards rh(s, a) and constraint rewards ui,h(s, a) in the same way as dπ h(s, a) to obtain vectors r RHSA and ui RHSA. Note that we then have V π r = P

h,s,a dπ h(s, a)rh(s, a) = r T dπ by linearity of expectation. Similarly, for all i [I], we have V π ui = u T i dπ. Moreover, if we stack U = (ui)i [I] RI HSA and c = (ci)i [I] RI as

u T 1 ... u T I

we obtain V π u = Udπ [0, H]I for the vector of the constraint value functions. We can thus write

π arg max π Π V π r s.t. V π ui ci ( i [I])

equivalently as

dπ arg max dπ Q(p) r T dπ s.t. Udπ c, (13)

where is understood element-wise. This is a linear program (LP). In particular, by compactness of the state-action occupancy polytope, there exists an optimal solution π as we assume feasibility.

Lemma C.1 (Strong duality CMDP (Paternain et al., 2019)). We have

max π Π min λ RI 0 L(π, λ) = min λ RI 0 max π Π L(π, λ),

and both optima are attained.

Proof. Note that, under Assumption 2.1, we can view Equation (1) as the convex optimization problem in Equation (13) over Q(p) that satisfies all parts of Assumption I.1 from Appendix I.2. Indeed,

(a) X := Q(p) is a polytope and thus convex

(b) the objective f( ) := r T ( ) is affine and thus convex

(c) the constraints gi( ) := ci u T i ( ) are affine and thus convex

(d) by Assumption 2.1, Equation (13) is feasible, and thus its optimum is attained (since the domain is compact and the objective continuous)

(e) a Slater point exists by Assumption 2.1, namely d π

(f) all dual problems have an optimal solution since the domain X is compact and the objective f( )+λT g( ) is continuous,

where Q(p) RHSA, r RHSA and ui RSAH are defined as above. The claim now readily follows from Theorem I.1.

Lemma C.2 (e.g., Ying et al. (2022)). We have λ 1 H

Proof. As in the proof of Lemma C.1, under Assumption 2.1, we can view the CMDP problem as a convex optimization problem in the occupancy measure, in the same setup as Appendix I.2. Specifically, we have V π r = r T dπ and V π u = Udπ. Then, set X = Q(p), x = d π, f( ) = r T ( ) and gi( ) = ci u T i ( ). Plugging this into Theorem I.3 indeed yields

λ 1 V π r V π r mini [I](V π ui ci) H

Truly No-Regret Learning in Constrained MDPs

Lemma C.3 (Saddle point CMDP). Let π Π and λ RI 0. Then

L(π, λ ) L(π , λ ) L(π , λ).

Proof. By Lemma C.1, this immediately follows from Lemma I.2 in Appendix I.1.

Lemma C.4 (Strong duality regularized CMDP (Ding et al., 2023)). We have

max π Π min λ Λ Lτ(π, λ) = min λ Λ max π Π Lτ(π, λ),

and both primal and dual optimum are attained.

Proof. For all π Π, λ Λ, we have

Lτ(π, λ) =V π r+λT g + τ H(π) + 1

s,a,h (rh(s, a) + X

i λigi,h(s, a))dπ h(s, a)

s,a,h dπ h(s, a) log dπ h(s, a) P

a dπ h(s, a )

=:Locc τ (dπ, λ),

where gi,h(s, a) = ui,h(s, a) 1

H ci and where we used the definition of the occupancy measures and the polytope Q(p). Consider the problem

max d Q(p) min λ Λ Locc τ (d, λ). (14)

For any π Π that is optimal for Equation (5), dπ is also optimal for Equation (14). Conversely, for every d Q(p) that is optimal for Equation (14), we have that any π given by πh(a|s) := dh(s,a) P

a A dh(s,a ) for s with P a A dh(s, a) > 0, and arbitrary otherwise, is optimal for Equation (2).

Note that Locc τ is continuous. We further claim that Locc τ is 1-strongly convex in λ Λ and concave in d Q(p). Indeed, while the former claim is immediate, we can see the latter via the log-sum inequality (e.g., Cover (1999, Theorem 2.7.1)) with n = 2: For non-negative ai, bi,

i=1 ai log ai

and equality if and only if ai/bi is the same for all i. Only considering the nonlinear term in Lτ(λ, ), for d1, d2 Q(p)

Truly No-Regret Learning in Constrained MDPs

and α (0, 1) we have

s,a,h (αd1,h(s, a) + (1 α)d2,h(s, a)) log αd1,h(s, a) + (1 α)d2,h(s, a) P

a (αd1,h(s, a ) + (1 α)d2,h(s, a ))

s,a,h (αd1,h(s, a) + (1 α)d2,h(s, a)) log αd1,h(s, a) + (1 α)d2,h(s, a) P a αd1,h(s, a ) + P a (1 α)d2,h(s, a )

s,a,h αd1,h(s, a) log αd1,h(s, a) P

a αd1,h(s, a )

s,a,h (1 α)d2,h(s, a) log (1 α)d2,h(s, a) P

a (1 α)d2,h(s, a )

s,a,h d1,h(s, a) log d1,h(s, a) P

a d1,h(s, a )

s,a,h d2,h(s, a) log d2,h(s, a) P

a d2,h(s, a )

with equality if and only if d1,h(s,a) P

a d1,h(s,a ) = d2,h(s,a) P

a d2,h(s,a ) for all s, h. By Lemma I.1, we thus have

max d Q(p) min λ Λ Locc τ (d, λ) = min λ Λ max d Q(p) Locc τ (d, λ),

and primal and dual optimizers exist. This implies the same for the original problem Equation (2) by converting the occupancy measures back into policies via πh(a|s) = dh(s, a)/(P

a dh(s, a )).

Lemma C.5 (Saddle point regularized CMDP). Let π Π and λ Λ. Then

Lτ(π, λ τ) Lτ(π τ, λ τ) Lτ(π τ, λ).

Proof. By Lemma C.4, this follows from Lemma I.2.

Lemma C.6. Let π Π and λ Λ. Then

V π r+(λ τ )T g τH(π τ) V π τ r+(λ τ )T g V π τ r+λT g + τ

where g = u 1

Proof. Plugging the definition of Lτ into Lemma C.5 proves the claim, after using that H(π) 0 and λ 2 0.

D. Last-Iterate Convergence In this section, we provide the proofs for all results in Section 4, resulting in the proof of last-iterate convergence of the regularized primal-dual scheme (Equations (6) and (7)).

We first establish the convergence of the aforementioned potential function Φk.

Lemma 4.1 (Regularized convergence). Let η, τ < 1 and λmax HΞ 1. The iterates in Equations (6) and (7) satisfy

Φk+1 (1 ητ)kΦ1 + O ητ 1Cη,τ,Λ ,

Cη,τ,Λ =λ2 max H3A1/2I2 exp (ηH (1 + λmax I + log(A)))

+ I (H + τλmax)2 .

Truly No-Regret Learning in Constrained MDPs

Proof. We first decompose the k-th primal-dual gap as follows:

Lτ(π τ, λk) Lτ(πk, λ τ) = Lτ(π τ, λk) Lτ(πk, λk) | {z } (i)

+ Lτ(πk, λk) Lτ(πk, λ τ) | {z } (ii)

We first bound term (i):

(i) =Lτ(π τ, λk) Lτ(πk, λk)

=V π τ r+λT k g V πk r+λT k g (as dπ h(s, a) = dπ h(s)πh(a|s), and cancel λk 2)

s,a,h dπ τ h (s)π τ,h(a|s) log(π τ,h(a|s)) + τ X

s,a,h dπk h (s)πk,h(a|s) log(πk,h(a|s))

=V π τ r+λT k g+τψk V πk r+λT k g+τψk (since ψk,h(s, a) = log(πk,h(a|s)))

s,a,h dπ τ h (s)π τ,h(a|s) log(πk,h(a|s)) τ X

s,a,h dπk h (s)πk,h(a|s) log(πk,h(a|s))

s,a,h dπ τ h (s)π τ,h(a|s) log(π τ,h(a|s)) + τ X

s,a,h dπk h (s)πk,h(a|s) log(πk,h(a|s))

=V π τ r+λT k g+τψk V πk r+λT k g+τψk

s,a,h dπ τ h (s)π τ,h(a|s) log(πk,h(a|s))

s,a,h dπ τ h (s)π τ,h(a|s) log(π τ,h(a|s))

=V π τ r+λT k g+τψk V πk r+λT k g+τψk

s,h dπ τ h (s) X

a π τ,h(a|s) log π τ,h(a|s) πk,h(a|s)

=V π τ r+λT k g+τψk V πk r+λT k g+τψk τ X

s,h dπ τ h (s)KLk,h(s)

=V π τ r+λT k g+τψk V πk r+λT k g+τψk τKLk

=V π τ r+λT k u+τψk V πk r+λT k u+τψk τKLk (as g = u 1

=V π τ zk V πk zk τKLk

s,h dπ τ h (s) D Qπk zk,h(s, ), π τ,h( |s) πk,h( |s) E τKLk (by Lemma H.1)

Note that for all s, h, D Qπk zk,h(s, ), π τ,h( |s) πk,h( |s) E

KLk,h(s) KLk+1,h(s)

a πk,h(a|s) exp Qπk zk,h(s, a) Qπk zk,h(s, a)2 (Lemma I.6)

KLk,h(s) KLk+1,h(s)

η (Lemma E.11)

2A1/2 exp (ηH (1 + λmax I + τ log(A))) 2H2 (1 + Iλmax + τ log(A))2 + 2τ 2(64/e2)

=KLk,h(s) KLk+1,h(s)

2 1 H Dη,τ,Λ,

Dη,τ,Λ = HA1/2 exp (ηH (1 + λmax I + τ log(A))) 2H2 (1 + Iλmax + τ log(A))2 + 2τ 2(64/e2)

Truly No-Regret Learning in Constrained MDPs

and where we were able to apply Lemma I.6 by Lemma I.9 and since Qπk z,h(s, a) 0. Hence,

s,h dπ τ h (s) D Qπk zk,h(s, ), π τ,h( |s) πk,h( |s) E X

s,h dπ τ h (s) KLk,h(s) KLk+1,h(s)

2 1 H Dη,τ,Λ

Plugging in, we thus find

(i) = V π τ zk V πk zk τKLk KLk KLk+1

2Dη,τ,Λ τKLk = (1 ητ)KLk KLk+1

2Dη,τ,Λ. (16)

We now bound term (ii):

(ii) =Lτ(πk, λk) Lτ(πk, λ τ)

=V πk r+λT k g V πk r+(λ τ )T g + τ

2 λ τ 2 (cancel H(πk))

i (λk,i λ τ,i)V πk gi + τ

i (λk,i λ τ,i)(V πk ui ci + τλk,i) τ

λ τ λk 2 λ τ λk+1 2

V πk uk c + τλk 2 (Lemma I.8)

λ τ λk 2 λ τ λk+1 2

2D τ,Λ, (Lemma E.11)

with D τ,Λ = I(H + τλmax)2 and where we were able to apply Lemma I.8 by Lemma I.9. Plugging in, we find

i (λk,i λ τ,i)(V πk ui ci + τλk,i) τ

λ τ λk 2 λ τ λk+1 2

=(1 ητ) λ τ λk 2 λ τ λk+1 2

2D τ,Λ. (17)

From Lemma C.3 (with π = πk, λ = λk), we have 0 Lτ(π τ, λk) Lτ(πk, λ τ). Moreover, recall Φk = KLk + 1 2 λk λ τ 2, thus by Equations (16) and (17),

Φk+1 =KLk+1 + 1

2 λk+1 λ τ 2

(1 ητ)KLk + η2

2 Dη,τ,Λ η(i) + (1 ητ) λk λ τ 2

2 D τ,Λ η(ii) (Equations (16) and (17))

(1 ητ)Φk + η2(Dη,τ,Λ + D τ,Λ) η ((i) + (ii)) (Def. Φk)

(1 ητ)Φk + η2(Dη,τ,Λ + D τ,Λ) η (Lτ(π τ, λk) Lτ(πk, λ τ)) (Equation (15))

(1 ητ)Φk + η2(Dη,τ,Λ + D τ,Λ). (as Lτ(π τ, λk) Lτ(πk, λ τ) 0)

Finally, the claimed bound follows by noting that

Dη,τ,Λ + D τ,Λ

=HA1/2 exp (ηH (1 + λmax I + τ log(A))) 2H2 (1 + Iλmax + τ log(A))2 + 2τ 2(64/e2) + I(H + τλmax)2

O λ2 max H3A1/2I2 exp (ηH (1 + λmax I + log(A))) + I(H + τλmax)2 ,

as τ 1 and λmax HΞ 1 1.

Truly No-Regret Learning in Constrained MDPs

We can use the following result to turn the convergence of the potential function into an error bound. We will then choose the optimal values for λmax, τ, and η.

Lemma 4.2 (Error bounds). For any sequence (πk)k [K],

h V π r V πk r i

+ H3/2(2Φk)1/2 + τH log(A),

+ H3/2(2Φk)1/2 + τλmax

+ λ 1 max H2Ξ 1 + τH log(A) .

Proof. (1) We bound the objective optimality gap. First, decompose it as

V π r V πk r = V π r V π τ r | {z } (i)

+ V π τ r V πk r | {z } (ii)

We bound (ii) as follows:

(ii) =V π τ r V πk r

s,a,h dπ τ h (s) π τ,h(a|s) πk,h(a|s) Qπk r,h(s, a) (Lemma H.1)

s,h dπ τ h (s) π τ,h( |s) πk,h( |s) 1

s,h dπ τ h (s) q

2KLk,h(s) (by Pinsker s)

1 H dπ τ h (s)KLk,h(s) (by Jensen s)

We next bound term (i). By Lemma C.6 with π = π we have

V π r τH(π τ) V π τ r + X

i λ τ,i V π τ gi V π gi .

By Lemma C.6 with λ = 0 we have X

i λ τ,i V π τ gi 0.

Moreover, V π gi 0 by feasibility and λ τ,i 0. Combing these inequalities, we find

(i) =V π r V π τ r τH(π τ) τH log(A), (19)

which concludes the proof for the objective optimality gap.

(2) Let i [I]. We now bound the i-th constraint violation. First, decompose it as

ci V πk ui = V πk gi = V π τ gi | {z } (iii)

+ V π τ gi V πk gi | {z } (iv)

We first bound (iv). The same calculation as for the objective optimality gap (1) shows

(iv) =V π τ gi V πk gi H3/2p

Truly No-Regret Learning in Constrained MDPs

We next bound term (iii). Recall Λ = [0, λmax]I. Lemma C.6 with π = π and λ Λ as

( 0 (j = i) λmax (j = i)

j λ τ,j V π gj V π τ r + λmax V π τ gi + τ

2λ2 max + τH(π τ)

From Lemma C.3 (with π = π τ) we get

V π τ r V π r X

j λ j V π gj V π τ gj .

Adding the two previous inequalities and canceling terms, we get

j λ τ,j V π gj λmax V π τ gi + τ

2λ2 max + X

j λ j V π gj V π τ gj + τH(π τ),

where the first inequality holds since 0 V π gj by feasibility and λ τ 0. Rearranging this shows

2λmax + 1 λmax

j λ j V π gj V π τ gj + 1 λmax τH(π τ)

2λmax + 1 λmax

j λ j V π uj V π τ uj + 1 λmax τH(π τ) (g = u 1

2λmax + 1 λmax λ 1 H + 1 λmax τH(π τ) (H older s)

2λmax + 1 λmax

Ξ + τH(π τ) (Lemma C.2)

2λmax + 1 λmax

Ξ + τH log(A) .

Finally, we are ready to prove last-iterate convergence by combining the previous two lemmas.

Theorem 4.1 (Last-iterate convergence). Let ε (0, 1). Then, with appropriate choices of η ε6, τ ε2, λmax ε 1, we have h V π r V πk r i

+ ε, ci V πk ui

+ ε ( i [I])

for k = Ω(poly(A, H, I, Ξ 1) ε 10).

Proof. The bound follows from Lemma 4.1 and Lemma 4.2. We choose τ = ε2, η = (H2I log(A)) 1Ξε6, λmax = H

Ξ . Set r(k) := V π r V πk r

+ and gi(k) := V πk gi

We first consider the suboptimality for the reward. Plugging Lemma 4.1 into Lemma 4.2 we find, using

b and 1 + x exp(x),

r(k) H3/2Φ1/2 1 exp ( ητk/2) (a)

1/2 O(C1/2 η,τ,Λ) (b)

+ τH log(A). (c)

Truly No-Regret Learning in Constrained MDPs

For (b), note that, using the definitions of η, τ, λmax (and taking , and τ < 1)

C1/2 η,τ,Λ λmax H3/2A1/4I exp (ηH (1 + λmax I + log(A)) /2) + I1/2(H + τλmax)

λmax H3/2A1/4I exp (2) + I1/2(H + τλmax)

=ε 1 H5/2A1/4IΞ 1 exp (2) + I1/2H + I1/2ε2HΞ 1ε 1)

ε 1 H5/2A1/4IΞ 1.

1/2 = (H2I log(A)) 1/2Ξ1/2ε(6 2)/2 = (H2I log(A)) 1/2Ξ1/2ε2,

we thus have

(b) = H3/2 η

1/2 Cη,τ,Λ H3I1/2A1/4Ξ 1/2ε = poly(A, H, I, Ξ 1) ε.

(c) = τH log(A) = H log(A)ε2.

For (a), using the standard inequality e x 1 x/2 (if 0 x 1) with x := ητ/2, we first find

exp( ητl/2) (1 ητ/4)l

(a) = H3/2Φ1/2 1 exp ( ητk/2) H3/2Φ1/2 1 1

l=1 exp ( ητl/2)

H3/2Φ1/2 1 1

l=1 (1 ητ/4)l

H3/2Φ1/2 1 1

l=1 (1 ητ/4)l

=H3/2Φ1/2 1 1

=H3/2Φ1/2 1 1 k 4 (H2I log(A)) 1Ξε6ε2

=4H7/2IΞ 1 log(A)1

k Φ1/2 1 ε 8.

Furthermore, since π1 plays actions uniformly at random and λ1 = 0, we have Φ1/2 1 (H log(A) + 1

2Iλ2 max)1/2 H1/2 log(A)1/2 + I1/2λmax = H1/2 log(A)1/2 + I1/2HΞ 1ε 1. Hence, the calculation above shows

(a) 4H7/2IΞ 1 log(A)1

k (H1/2 log(A)1/2 + I1/2HΞ 1ε 1)ε 8 poly(A, H, I, Ξ 1)ε

for k = Ω(ε 10). Hence, summing up terms (a) to (c) and choosing k = Ω(poly(A, H, I, Ξ 1)ε 10) yields the bound for the objective.

Next, we consider the regret for the constraints. Plugging Lemma 4.1 into Lemma 4.2 we find, using

ui(k) H3/2Φ1/2 1 exp ( ητk/2) (a )

1/2 O(C1/2 η,τ,Λ) (b )

+ τλmax + 1 λmax H2Ξ 1 (c )

+ 1 λmax τH log(A). (d )

Truly No-Regret Learning in Constrained MDPs

Note that terms (a ), (b ) are identical to (a), (b). Moreover, for (d ) we have

(d ) = 1 λmax τ log(A) =ΞH 1 log(A)ε3.

Finally, for (c ), we have

(c ) =τλmax + 1 λmax H2Ξ 1

=ε2 HΞ 1ε 1 + H 1Ξε H2Ξ 1

=H(1 + Ξ 1)ε.

Thus, summing up (a ) to (d ) and choosing k = Ω(poly(A, H, I, Ξ 1)ε 10) yields the bound for the constraints.

E. Properties of the Optimistic Model In this section, we establish important properties of the model Algorithm 1 builds.

E.1. Building the Model First, we describe the exact model and how we perform policy evaluation.

We follow Shani et al. (2020) for the optimistic exploration, but we also take the I constraint functions u into account rather than just the reward function r. We also need to pay special attention to the auxiliary term ψk.

For all s, a, h and k [K], let nk 1,h(s, a) := Pk 1 l=1 1{sl h=s, al h=a} count the number of times that the state-action pair (s, a) has been visited at step h before episode k. Here, (sl h, al h) denotes the state-action pair visited at step h in episode l. First, we compute the empirical averages of the reward and transition probabilities as follows:

rk 1,h(s, a) :=

Pk 1 l=1 Rl h(s, a)1{sl h=s, al h=a} nk 1,h(s, a) 1 ,

uk 1,i,h(s, a) :=

Pk 1 l=1 U l i,h(s, a)1{sl h=s, al h=a} nk 1,h(s, a) 1 ( i [I]),

pk 1,h(s |s, a) :=

Pk 1 l=1 1{sl h=s, al h=a, sl h+1=s } nk 1,h(s, a) 1 ,

where a b := max{a, b}. We consider optimistic estimates ˆrk, ˆuk, ˆpk:

ˆrk,h(s, a) := rk 1,h(s, a) + bk 1,h(s, a),

ˆuk,i,h(s, a) := uk 1,i,h(s, a) + bk 1,h(s, a) ( i [I]),

ˆpk,h(s |s, a) := pk 1,h(s |s, a),

with the bonuses bk 1,h(s, a) = br k 1,h(s, a) + bp k 1,h(s, a) specified below. For ψk, we take4

ˆψk,h(s, a) := ψk,h(s, a) + bp k 1,h(s, a) log(A).

For notational convenience, we write

zk := r + λT k u + τψk, ˆzk := ˆrk + λT k ˆuk + τ ˆψk

for the reward function mimicking the π-dependency of the regularized Lagrangian at (πk, λk).

For any δ (0, 1), we specify the correct bonuses to obtain our regret guarantees with probability at least 1 δ:

bk 1,h(s, a) :=br k 1,h(s, a) + bp k 1,h(s, a),

4In other words, there is no bonus for the function, only for the transitions. This is because ψk is known in episode k, and so the only uncertainty in the corresponding value function is due to estimating the transitions p. Note that the extra log(A) factor corrects for the fact that ψk is not a function to [0, 1].

Truly No-Regret Learning in Constrained MDPs

br k 1,h(s, a) :=

1 2 log 2SAH(I+1)K

nk 1,h(s, a) 1 , bp k 1,h(s, a) := H

2S + 2 log SAHK

nk 1,h(s, a) 1 .

For ψk, recall

ˆψk,h(s, a) := ψk,h(s, a) + bp k 1,h(s, a) log(A).

We define the truncated value functions5

ˆQk ˆzk,h(s, a) := ˆQk ˆrk,h(s, a) + X

i λk,i ˆQk ˆuk,i,h(s, a) + τ ˆQk ˆ ψk,h(s, a), (22)

ˆV k ˆzk,h(s) := D πk,h( |s), ˆQk ˆzk,h(s, ) E , (23)

where we compute ˆQk ˆrk,h(s, a), ˆQk ˆuk,i,h(s, a), ˆQk ˆ ψk,h(s, a) via truncated policy evaluation with respect to the estimated model, see Algorithm 2.

Algorithm 2 EVAL (Truncated Policy Evaluation)

Initialize ˆV k ˆrk,H+1(s) = ˆV k ˆuk,i,H+1(s) = ˆV k ˆ ψk,H+1(s) = 0 (for s S) for h = H, . . . , 1 do

for (s, a) S A do

Truncated DP step:

ˆQk ˆrk,h(s, a) := min n H h + 1, ˆrk,h(s, a) + D ˆpk,h( |s, a), ˆV k ˆrk,h+1( ) Eo

ˆQk ˆuk,i,h(s, a) := min n H h + 1, ˆuk,i,h(s, a) + D ˆpk,h( |s, a), ˆV k ˆuk,i,h+1( ) Eo ( i [I])

ˆQk ˆ ψk,h(s, a) := min n ψk,h(s, a) + (H h + 1) log(A), ˆψk,h(s, a) + D ˆpk,h( |s, a), ˆV k ˆ ψk,h+1( ) Eo

Retrieve V -function:

ˆV k ˆrk,h(s) := D πk,h( |s), ˆQk ˆrk,h(s, ) E

ˆV k ˆuk,i,h(s) := D πk,h( |s), ˆQk ˆuk,i,h(s, ) E ( i [I])

ˆV k ˆ ψk,h(s) := D πk,h( |s), ˆQk ˆ ψk,h(s, ) E

end for end for output ˆQk ˆzk( ) := ˆQk ˆrk( ) + P

i λk,i ˆQk ˆuk,i( ) + τ ˆQk ˆ ψk( ), and ( ˆV k ˆuk,i(s1, 1))i

The main reason for truncating during the otherwise standard policy evaluation algorithm is the need for a bonus-independent upper bound on the surrogate value functions so that Lemma E.10 holds.6 Clearly, the truncated value functions are all lower bounded by zero and upper bounded by the actual value functions under the estimated model. Finally, note that these truncated value functions need not correspond to the true value function of a policy in some MDP.

Recall the truncated value functions from Algorithm 2 in Appendix E.1. Note that due to the separate definition of ˆQk ˆzk,h(s, a)

5Importantly, note that this is the definition of ˆQk ˆzk,h(s, a), rather than running truncated policy evaluation on zk. 6In fact, truncation is only required for the update of π, and for the update of λ, we can use either truncated or exact values.

Truly No-Regret Learning in Constrained MDPs

and the updates in Algorithm 2, for all s S, h [H],

ˆV k ˆzk,h(s) = D πk,h( |s), ˆQk ˆzk,h(s, ) E (by def.)

πk,h( |s), ˆQk ˆrk,h(s, a) + X

i λk,i ˆQk ˆuk,i,h(s, a) + τ ˆQk ˆ ψk,h(s, a)

= D πk,h( |s), ˆQk ˆrk,h(s, a) E + X

i λk,i D πk,h( |s), ˆQk ˆuk,i,h(s, a) E

+ τ D πk,h( |s), ˆQk ˆ ψk,h(s, a) E

= ˆV k ˆrk,h(s) + X

i λk,i ˆV k ˆuk,i,h(s) + τ ˆV k ˆ ψk,h(s). (24)

Similarly, by linearity of expectation

V πk zk,h(s) = V πk rk,h(s) + X

i λk,i V πk uk,i,h(s) + τV πk ψk,h(s) (25)

for the true value functions.

E.2. Properties of the Model We are now ready to establish the properties of the model. In particular, we will show that it is optimistic with respect to the value function and prove bounds on the estimation error during the learning procedure.

Success Event We will condition our regret analysis on a success event G, which we formally define below. Fix δ > 0, and construct the estimated model as in Appendix E.1. G ensures that (a) the optimistic reward estimates are in fact optimistic and (b) the true transitions are close to the estimated ones, i.e.:

ui ˆuk,i ( i [I]),

ph( |s, a) pk 1,h( |s, a) 1 H bp k 1,h(s, a) ( s, a, h),

for every episode k [K]. Formally, with δ := δ/3, define the failure events

F r k := s, a, h: |rh(s, a) rk 1,h(s, a)| br k 1,h(s, a) ,

F u k := i, s, a, h: |ui,h(s, a) uk 1,i,h(s, a)| br k 1,h(s, a) ,

F p k := n s, a, h: ph( |s, a) pk 1,h( |s, a) 1 H bp k 1,h(s, a) o ,

s, a, h: nk 1,h(s, a) 1

j<k dπj h (s, a) H log SAH

where dπj h (s, a) refers to the occupancy measure (Appendix C), and

k=1 F p k ,

k=1 F n k .

Truly No-Regret Learning in Constrained MDPs

We define the success event G as the complement of F r F p F n, i.e.

G := F r F p F n.

We now show that this event holds with high probability. The proof of this theorem relies on standard concentration bounds (specifically, Hoeffding for the rewards and L1-concentration for the transitions) and a union bound over all involved indices.

Lemma E.1 (Success event). Let δ > 0 and define the bonuses accordingly. Suppose that for all k [K], in episode k, policy πk is played. Then P[G] 1 δ.

Proof. By Hoeffding s for any possible realization of nk 1,h(s, a) (and total probability), we have P[F r] δ after union bound over all indices s, a, h and all episodes k. For nk 1,h(s, a) = 0 the bound holds trivially.

By the L1 concentration bound of Weissman et al. (2003, Theorem 2.1), for any possible realization of nk 1,h(s, a) (and total probability), we have P[F p] δ after union bound over all indices s, a, h and all episodes k. For nk 1,h(s, a) = 0 the bound holds trivially.

By Dann et al. (2017, Corollary E.4), we also have P[F n] δ .

We conclude by union bound over the three events.

Decomposition via Extended Value Difference Lemma The following lemma allows us to decompose the instantaneous regret into three terms that we will bound separately.

Lemma E.2 (Decomposition via simulation lemma). We have the following decomposition:

V π τ zk V πk zk = ˆV k ˆzk V πk zk (a)

h E D ˆQk ˆzk,h(sh, ), π τ,h( |sh) πk,h( |sh) E s1, π τ, p (b)

h E ˆQk ˆzk,h(sh, ah) + zk,h(sh, ah) + D ph( |sh, ah), ˆV k zk,h+1( ) E s1, π τ, p . (c)

Proof. First, expand

V π τ zk V πk zk = ˆV k ˆzk V πk zk + V π τ zk ˆV k ˆzk .

Then apply Lemma H.2 to π = πk, π = π τ and M = (S, A, ˆpk, ˆzk), M = (S, A, p, zk) to the second term (after multiplying both sides by 1).

General On-Policy Bounds The following two results are standard and will allow us to bound the estimation errors during learning. Consider the setup in which policy πk is derived based on the previous episodes 1, . . . , k 1, and then played in episode k. Recall that for all s, a, h and k [K], nk 1,h(s, a) := Pk 1 l=1 1{sl h=s, al h=a} counts the visits of state-action pair (s, a) at step h before episode k. We write for asymptotic inequality up to polylogarithmic terms.

Note that in the following two lemmas, the exponent of H is different from the one in the referenced proofs. This is because the referenced works consider the case of stationary transition probabilities, whereas we consider non-stationary dynamics. See Shani et al. (2020, Lemmas 18, 19).

Lemma E.3 (Lemma 36, Efroni et al. (2020)). Suppose for all s, a, h, k [K], we have

nk 1,h(s, a) > 1

j<k dπj h (s, a) H log SAH

Truly No-Regret Learning in Constrained MDPs

Then for all K K

nk 1,h(sk h , ak h ) | Fk 1

SAH2K + SAH),

where Fk 1 is the σ-algebra induced by all random variables up to and including episode k 1.

Proof. We refer to Efroni et al. (2019, Lemma 38) for a proof of the statement.

Lemma E.4 (Lemma 37, Efroni et al. (2020)). Suppose for all s, a, h, k [K], we have

nk 1,h(s, a) > 1

j<k dπj h (s, a) H log SAH

Then for all K K

h=1 E 1 nk 1,h(sk h , ak h ) | Fk 1

where Fk 1 is the σ-algebra induced by all random variables up to and including episode k 1.

Proof. We refer to Zanette & Brunskill (2019, Lemma 13) for a proof of the statement.

Estimation Error (On-Policy Error Bounds) We next prove bounds on the estimation error obtained while learning the model.

Lemma E.5 (Estimation error ˆr, ˆu). Conditioned on G, for every K [K], we have

ˆV k ˆrk V πk r O

S2AH4K + S3/2AH2 ,

ˆV k ˆuk,i V πk ui O

S2AH4K + S3/2AH2 ( i [I]).

Proof. By Lemma H.2 (with π = π = πk and M = (S, A, ˆpk, ˆzk), M = (S, A, p, zk)), we have according to the truncated policy evaluation (Algorithm 2),

ˆV k ˆrk V πk r

h=1 E ˆQk ˆrk,h(sh, ah) rh(sh, ah) D ph( |sh, ah), ˆV k ˆrk,h+1( ) E s1, πk, p

h=1 E ˆrk,h(sh, ah) rh(sh, ah) s1, πk, p

h=1 E D ˆpk,h( |sh, ah) ph( |sh, ah), ˆV k ˆrk,h+1( ) E s1, πk, p

h=1 E rk 1,h(sh, ah) + br k 1,h(sh, ah) rh(sh, ah) s1, πk, p

h=1 E bp k 1,h(sh, ah) + D ˆpk,h( |sh, ah) ph( |sh, ah), ˆV k ˆrk,h+1( ) E s1, πk, p .

Truly No-Regret Learning in Constrained MDPs

Since G occurs, we have rk 1,h(sh, ah) rh(sh, ah) br k 1,h(sh, ah). Moreover,

D ˆpk,h( |sh, ah) ph( |sh, ah), ˆV k ˆrk,h+1( ) E

ˆpk,h( |sh, ah) ph( |sh, ah) 1 ˆV k ˆrk,h+1( ) ˆpk,h( |sh, ah) ph( |sh, ah) 1 H

bp k 1,h(sh, ah)

by H older s, the truncation in the policy evaluation, and since G occurs. Plugging this into the inequality above,

ˆV k ˆrk V πk r

h=1 E 2br k 1,h(sh, ah) s1, πk, p

h=1 E 2bp k 1,h(sh, ah) s1, πk, p .

Recalling the definition of br k 1,h(s, a) 1

nk 1,h(s,a) 1, bp k 1,h(s, a) H

nk 1,h(s,a) 1 and summing up, we thus find

ˆV k ˆrk V πk r H

nk 1,h(sh, ah) 1

nk 1,h(sk h, ak h) 1

SAH2K + SAH (Lemma E.3)

S2AH4K + S3/2AH2

where we used that we play πk in episode k and Lemma E.3, which applies since G occurs.

The proof for ui (i [I]) is identical.

Lemma E.6 (Estimation error ˆψ). Conditioned on G, for every K [K], we have

ˆV k ˆ ψk V πk ψk

S2AH4K + S3/2AH2 .

Proof. By Lemma H.2 (with π = π = πk and M = (S, A, ˆpk, ˆzk), M = (S, A, p, zk)), we have according to the

Truly No-Regret Learning in Constrained MDPs

truncated policy evaluation,

ˆV k ˆrk V πk r

h=1 E ˆQk ˆ ψk,h(sh, ah) ψk,h(sh, ah) D ph( |sh, ah), ˆV k ˆ ψk,h+1( ) E s1, πk, p

h=1 E ˆψk,h(sh, ah) ψk,h(sh, ah) s1, πk, p

h=1 E D ˆpk,h( |sh, ah) ph( |sh, ah), ˆV k ˆ ψk,h+1( ) E s1, πk, p

h=1 E ψk,h(sh, ah) ψk,h(sh, ah) s1, πk, p

h=1 E bp k 1,h(sh, ah) log(A) + D ˆpk,h( |sh, ah) ph( |sh, ah), ˆV k ˆ ψk,h+1( ) E s1, πk, p .

Since G occurs, we have D ˆpk,h( |sh, ah) ph( |sh, ah), ˆV k ˆ ψk,h+1( ) E

ˆpk,h( |sh, ah) ph( |sh, ah) 1 ˆV k ˆ ψk,h+1( ) ˆpk,h( |sh, ah) ph( |sh, ah) 1 H log(A)

bp k 1,h(sh, ah) log(A)

by H older s, the truncation in the policy evaluation, and since G occurs. Plugging this into the inequality above,

ˆV k ˆ ψk V πk ψk

h=1 E 2bp k 1,h(sh, ah) log(A) s1, πk, p

and the rest of the proof follows exactly as in the proof of Lemma E.5, with an extra log(A) factor.

The following lemma allows us to control the total estimation error (a) from the optimistic model. Roughly speaking, it guarantees that the estimates are not too optimistic.

Lemma E.7 (Estimation error with regularization). Conditioned on G, for every K [K], we have (if τ 1)

ˆV k ˆzk V πk zk (2 + Iλmax)

S2AH4K + S3/2AH2 .

Proof. By Equations (22), (24) and (25), conditioned on G,

ˆV k ˆzk V πk zk

ˆV k ˆrk V πk r +

ˆV k ˆuk,i V πk ui + τ

ˆV k ˆ ψk V πk ψk

(1 + Iλmax + τ)

S2AH4K + S3/2AH2 ,

and τ 1. The last inequality holds due to Lemmas E.5 and E.6.

Truly No-Regret Learning in Constrained MDPs

Per-State Optimism In the following, we show per-state optimism for the optimistic model.

Lemma E.8 (State optimism ˆr, ˆu). Conditioned on the success event G, for all s, a, h, and k [K],

ˆQk ˆrk,h(s, a) + rh(s, a) + D ph( |s, a), ˆV k ˆrk,h+1( ) E 0,

ˆQk ˆuk,i,h(s, a) + uk,i,h(s, a) + D ph( |s, a), ˆV k ˆuk,i,h+1( ) E 0 (i [I])

Proof. For ˆrk, by Algorithm 2 we have

ˆQk ˆrk,h(s, a)

= min n H h + 1, ˆrk,h(s, a) + D ˆpk,h( |s, a), ˆV k ˆrk,h+1( ) Eo

= min H h + 1,

rk 1,h(s, a) + br k 1,h(s, a) + D ˆpk,h( |s, a), ˆV k ˆrk,h+1( ) E + bp k 1,h(s, a)

min 1, rk 1,h(s, a) + br k 1,h(s, a)

+ min n H h, D ˆpk,h( |s, a), ˆV k ˆrk,h+1( ) E + bp k 1,h(s, a) o ,

where we used min {a + b, c + d} min {a, c} + min {b, d}. Thus

ˆQk ˆrk,h(s, a) + rh(s, a) + D ph( |s, a), ˆV k ˆrk,h+1( ) E

min 1, rk 1,h(s, a) + br k 1,h(s, a) + rh(s, a)

min n H h, D ˆpk,h( |s, a), ˆV k ˆrk,h+1( ) E + bp k 1,h(s, a) o + D ph( |s, a), ˆV k ˆrk,h+1( ) E

= min 1 rh(s, a), rk 1,h(s, a) rh(s, a) + br k 1,h(s, a)

min H h D ph( |s, a), ˆV k ˆrk,h+1( ) E ,

D ˆpk,h( |s, a) ph( |s, a), ˆV k ˆrk,h+1( ) E + bp k 1,h(s, a)

= max rh(s, a) 1 | {z } (a)

, rk 1,h(s, a) + rh(s, a) br k 1,h(s, a) | {z } (b)

+ max (H h) + D ph( |s, a), ˆV k ˆrk,h+1( ) E

D ˆpk,h( |s, a) ph( |s, a), ˆV k ˆrk,h+1( ) E bp k 1,h(s, a) | {z } (d)

Now, for each of the four terms appearing in the maxima, conditioned on G, we have

(a) = rh(s, a) 1 0

by boundedness of the reward functions,

(b) = rk,h(s, a) + rh(s, a) br k 1,h(s, a) 0

since G occurs,

(c) = (H h) + D ph( |s, a), ˆV k ˆrk,h+1( ) E (H h) + 1 (H (h + 1) + 1) = 0

Truly No-Regret Learning in Constrained MDPs

by H older s and the truncation of our evaluation procedure, and finally

(d) = D ˆpk,h( |s, a) ph( |s, a), ˆV k ˆrk,h+1( ) E bp k 1,h(s, a)

ˆpk,h( |s, a) ph( |s, a) 1 (H (h + 1) + 1) bp k 1,h(s, a)

ˆpk,h( |s, a) ph( |s, a) 1 H bp k 1,h(s, a) 0

by H older s, the truncation of our evaluation procedure, since ˆpk = pk 1, and since G occurs. We are thus taking the minimum of non-positive terms, which shows

ˆQk ˆrk,h(s, a) + rh(s, a) + D ph( |s, a), ˆV k ˆrk,h+1( ) E 0.

The proof for ˆuk,i (i [I]) is identical.

Lemma E.9 (State optimism ˆψ). Conditioned on the success event G, for all s, a, h, and k [K],

ˆQk ˆ ψk,h(s, a) + ψk,h(s, a) + D ph( |s, a), ˆV k ˆ ψk,h+1( ) E 0.

Proof. For ˆψk,h(s, a) = ψk,h(s, a) + bp k 1,h(s, a) log(A), from Algorithm 2 we have

ˆQk ˆ ψk,h(s, a)

= min n ψk,h(s, a) + (H h + 1) log(A), ˆψk,h(s, a) + D ˆpk,h( |s, a), ˆV k ˆ ψk,h+1( ) Eo

=ψk,h(s, a)

+ min n (H h + 1) log(A), bp k 1,h(s, a) log(A) + D ˆpk,h( |s, a), ˆV k ψk,h+1( ) Eo .

ˆQk ˆ ψk,h(s, a) + ψk,h(s, a) + D ph( |s, a), ˆV k ˆ ψk,h+1( ) E

= min n (H h + 1) log(A), bp k 1,h(s, a) log(A) + D ˆpk,h( |s, a), ˆV k ˆ ψk,h+1( ) Eo

+ D ph( |s, a), ˆV k ˆ ψk,h+1( ) E

= min (H h + 1) log(A) D ph( |s, a), ˆV k ˆ ψk,h+1( ) E ,

bp k 1,h(s, a) log(A) + D ˆpk,h( |s, a) ph( |s, a), ˆV k ˆ ψk,h+1( ) E

= + max (H h + 1) log(A) + D ph( |s, a), ˆV k ˆ ψk,h+1( ) E

| {z } =:(a)

bp k 1,h(s, a) log(A) D ˆpk,h( |s, a) ph( |s, a), ˆV k ˆ ψk,h+1( ) E

| {z } =:(b)

For (a), first note that by the truncation in the policy evaluation (Algorithm 2)

ˆV k ˆ ψk,h+1(s) = D πk,h+1( |s), ˆQk ˆ ψk,h+1(s, ) E (by definition)

a πk,h+1(a|s) ψk,h+1(s, a)

a πk,h+1(a|s) (H (h + 1) + 1) log(A) (truncated update)

a πk,h+1(a|s) (H h) log(A)

(H h + 1) log(A). (26)

Truly No-Regret Learning in Constrained MDPs

(a) = (H h + 1) log(A) + D ph( |s, a), ˆV k ˆ ψk,h+1( ) E

(H h + 1) log(A) + (H h + 1) log(A)

by H older s and Equation (26). For (b),

(b) = bp k 1,h(s, a) log(A) D ˆpk,h( |s, a) ph( |s, a), ˆV k ˆ ψk,h+1( ) E

bp k 1,h(s, a) log(A) + ˆpk,h( |s, a) ph( |s, a) 1 (H h + 1) log(A)

bp k 1,h(s, a) log(A) + ˆpk,h( |s, a) ph( |s, a) 1 H log(A)

by H older s, and since ˆpk = pk 1 and G occurs.

Finally, the following lemma shows that we can discard term (c) in Lemma E.2. It guarantees that for every state, an optimistic Bellman-type inequality holds.

Lemma E.10 (State optimism ˆz). Conditioned on the success event G,

ˆQk ˆzk,h(s, a) + zk,h(s, a) + D ph( |s, a), ˆV k ˆzk,h+1( ) E 0.

Proof. By Equations (22) and (24), conditioned on G,

ˆQk ˆzk,h(s, a) + zk,h(s, a) + D ph( |s, a), ˆV k ˆzk,h+1( ) E

= ˆQk ˆrk,h(s, a) + rk,h(s, a) + D ph( |s, a), ˆV k ˆrk,h+1( ) E

i λk,i ˆQk ˆuk,i,h(s, a) + uk,i,h(s, a) + D ph( |s, a), ˆV k ˆuk,i,h+1( ) E

τ ˆQk ˆ ψk,h(s, a) + ψk,h(s, a) + D ph( |s, a), ˆV k ˆ ψk,h+1( ) E

where the last inequality holds due to Lemmas E.8 and E.9 and since all λk,i 0.

Value Function Bounds In the following, we bound the regularized value functions, which allows us to apply descent properties of the regularized primal-dual algorithm. Recall that g = u 1

Lemma E.11 (Value function bounds). For any s, a, h, we have

0 Qπk r+λT k u+τψk,h(s, a) τ log(πk,h(a|s)) + H(1 + Iλmax + τ log(A)) (27)

Moreover, for any s, a, h,

a πk,h(a|s) exp ηQπk r+λT k u+τψk,h(s, a) Qπk r+λT k u+τψk,h(s, a)2

A exp (ηH (1 + λmax I + τ log(A))) 2H2 (1 + Iλmax + τ log(A))2 + 2τ 2(64/e2)

and V πk g + τλk

I (H + τλmax)

Truly No-Regret Learning in Constrained MDPs

Proof. We prove the first inequality. Non-negativity is immediate. Moreover, for all s, a,

Qπk r+λT k u+τψk,h(s, a) =

Qπk r,h(s, a) + X

i λk,i Qπk ui,h(s, a) + τQπk ψk,h(s, a)

Qπk r,h(s, a) + X

i λk,i Qπk ui,h(s, a) + τ Qπk ψk,h(s, a)

H + Iλmax H + τEπk

h =h log(πk,h (ah |sh )) | sh = s, ah = a

We finish by bounding

h =h log(πk,h (ah |sh )) sh = s, ah = a

= log(πk,h(a|s)) + Eπk

h =h+1 log(πk,h (ah |sh )) ah = a, sh = s

= log(πk,h(a|s)) + Eπk

h =h+1 log(πk,h (ah |sh )) sh+1 ph( |s, a)

= log(πk,h(a|s)) + X

s dπk sh+1 ph( |s,a),h (s ) X

a πk,h (a |s ) log(πk,h (a |s ))

log(πk,h(a|s)) + X

s dπk sh+1 ph( |s,a),h (s ) log(A)

= log(πk,h(a|s)) + X

h =h+1 log(A)

log(πk,h(a|s)) + H log(A),

where we used the standard bound on the entropy in the third to last step, and we are considering unnormalized occupancy measures throughout.

For the second inequality, first note that (using (a + b)2 2a2 + 2b2)

Qπk r+λT k u+τψk,h(s, a)2 2H2 (1 + Iλmax + τ log(A))2 | {z } =:C1

+2τ 2 log2 1 πk,h(a|s)

Moreover, using Equation (27) we have

πk,h(a|s) exp ηQπk r+λT k u+τψk,h(s, a)

πk,h(a|s)1 ητ exp (ηH (1 + λmax I + τ log(A))) | {z } =:C2

We thus find, if ητ 1/4 < 1/2 (which is easily satisfied since we choose η, τ small), X

a πk,h(a|s) exp ηQπk r+λT k u+τψk,h(s, a) C1

a πk,h(a|s)1 ητ C2 C1

a πk,h(a|s)1/2 C2 C1

a πk,h(a|s)

!1/2 AC2 2 C2 1 1/2

Truly No-Regret Learning in Constrained MDPs

where the last inequality is Cauchy-Schwarz, and

a πk,h(a|s) exp ηQπk r+λT k u+τψk,h(s, a) 2τ 2 log2 1 πk,h(a|s)

a πk,h(a|s)1 ητ C2 2τ 2 log2 1 πk,h(a|s)

a πk,h(a|s)1/2 C2 2τ 2πk(a|s, h)1/4 log2 1 πk,h(a|s)

a πk,h(a|s)1/2 C2 2τ 2(64/e2)

AC2 2τ 2(64/e2),

where we used the fact that q1/4 log2(1/q) 64/e2 for q (0, 1), and Cauchy-Schwarz in the same manner as before. Adding up the previous two terms and plugging in the definitions of the constants yields the second inequality.

For the third inequality, we find (recalling that gh(s, a) [ 1, 1]I) V πk g + τλk V πk g + τ λk

concluding the proof.

Lemma E.12 (Truncated value function bounds). For all s, a, h, we have

0 ˆQk ˆzk,h(s, a) τ log(πk,h(a|s)) + H(1 + Iλmax + τ log(A)) (28)

a πk,h(a|s) exp η ˆQk ˆzk,h(s, a) ˆQk ˆzk,h(s, a)2

A exp (ηH (1 + λmax I + τ log(A))) 2H2 (1 + Iλmax + τ log(A))2 + 2τ 2(64/e2)

and ˆV k ˆuk c + τλk

I (H + τλmax)

Proof. Since the truncated value functions are bounded between 0 and the true value functions, the statement follows from Lemma E.11.

F. Regret Analysis In this section, we provide all proofs of the result from Section 5, leading to a regret bound of the regularized primal-dual algorithm (Algorithm 3).

We write for asymptotic inequality up to polylogarithmic terms. First, we note that the primal-dual updates indeed correspond to mirror descent updates.

Observation F.1. The closed-form expressions in Algorithm 1 are solutions to

max πh( |s) (A)

a A πh(a|s) ˆQk ˆzk,h(s, a) 1

η KL(πh( |s), πk,h( |s))

λT ˆV k ˆuk c + τλk + 1

2η λ λk 2 ,

respectively.

Truly No-Regret Learning in Constrained MDPs

This means that both the primal and dual variables are updated via mirror descent with different regularizers. Hence, we can make use of the descent lemmas for online mirror descent (we refer to Lemmas I.5 and I.8 in Appendix I.3). However, the value functions that serve as (surrogate) gradients are only estimates. Thus, the convergence proof for the potential function Φk that measures the distance from (πk, λk) to (π τ, λ τ) needs to take the estimation errors into account.

Lemma 5.1 (Regularized convergence). Let η, τ < 1 and λmax HΞ 1. With probability at least 1 δ, the iterates of Algorithm 1 satisfy

Φk+1 (1 ητ)kΦ1 + O ητ 1Cη,τ,Λ

+ ηλmax ISA1/2H2k1/2 + IS3/2AH2 ,

where Cη,τ,Λ is the same constant as in Lemma 4.1.

Proof. Condition on G, which occurs with probability at least 1 δ by Lemma E.1. We first decompose the k-th primal-dual gap as follows:

Lτ(π τ, λk) Lτ(πk, λ τ) = Lτ(π τ, λk) Lτ(πk, λk) | {z } (i)

+ Lτ(πk, λk) Lτ(πk, λ τ) | {z } (ii)

We first bound term (i):

(i) =Lτ(π τ, λk) Lτ(πk, λk)

=V π τ r+λT k g V πk r+λT k g (as dπ h(s, a) = dπ h(s)πh(a|s), and cancel λk 2)

s,a,h dπ τ h (s)π τ,h(a|s) log(π τ,h(a|s)) + τ X

s,a,h dπk h (s)πk,h(a|s) log(πk,h(a|s))

=V π τ r+λT k g+τψk V πk r+λT k g+τψk (since ψk,h(s, a) = log(πk,h(a|s)))

s,a,h dπ τ h (s)π τ,h(a|s) log(πk,h(a|s)) τ X

s,a,h dπk h (s)πk,h(a|s) log(πk,h(a|s))

s,a,h dπ τ h (s)π τ,h(a|s) log(π τ,h(a|s)) + τ X

s,a,h dπk h (s)πk,h(a|s) log(πk,h(a|s))

=V π τ r+λT k g+τψk V πk r+λT k g+τψk

s,a,h dπ τ h (s)π τ,h(a|s) log(πk,h(a|s))

s,a,h dπ τ h (s)π τ,h(a|s) log(π τ,h(a|s))

=V π τ r+λT k g+τψk V πk r+λT k g+τψk

s,h dπ τ h (s) X

a π τ,h(a|s) log π τ,h(a|s) πk,h(a|s)

=V π τ r+λT k g+τψk V πk r+λT k g+τψk τ X

s,h dπ τ h (s)KLk,h(s)

=V π τ r+λT k g+τψk V πk r+λT k g+τψk τKLk

=V π τ r+λT k u+τψk V πk r+λT k u+τψk τKLk (as g = u 1

=V π τ zk V πk zk τKLk

Truly No-Regret Learning in Constrained MDPs

Now by Lemma E.2, we have

V π τ zk V πk zk =( ˆV k ˆzk V πk zk ) (a)

h E D ˆQk ˆzk,h(sh, ), π τ,h( |sh) πk,h( |sh) E s1, π τ, p (b)

h E ˆQk ˆzk,h(sh, ah) + zk,h(sh, ah) + D ph( |sh, ah), ˆV πk ˆzk,h+1( ) E s1, π τ, p (c)

We leave term (a) as is and will sum over k later. For term (b), note that for all s, h, D ˆQk ˆzk,h(s, ), π τ,h( |s) πk,h( |s) E

KLk,h(s) KLk+1,h(s)

a πk,h(a|s) exp ˆQk ˆzk,h(s, a) ˆQk ˆzk,h(s, a)2 (Lemma I.6)

KLk,h(s) KLk+1,h(s)

2 1 H Dη,τ,Λ, (Lemma E.12)

Dη,τ,Λ = HA1/2 exp (ηH (1 + λmax I + τ log(A))) 2H2 (1 + Iλmax + τ log(A))2 + 2τ 2(64/e2)

and where we were able to apply Lemma I.6 by Observation F.1 and since ˆQk ˆzk,h(s, a) 0. Hence,

h E D ˆQk ˆzk,h(sh, ), π τ,h( |sh) πk,h( |sh) E s1, π τ, p = X

s,h dπ τ h (s) D ˆQk ˆzk,h(s, ), π τ,h( |s) πk,h( |s) E

s,h dπ τ h (s) KLk,h(s) KLk+1,h(s)

2 1 H Dη,τ,Λ

Term (c) is 0 by Lemma E.10, which applies since G occurs. Plugging in, we thus find

(i) =V π τ zk V πk zk τKLk

( ˆV k ˆzk V πk zk ) + KLk KLk+1

2Dη,τ,Λ + 0 τKLk

=( ˆV k ˆzk V πk zk ) + (1 ητ)KLk KLk+1

2Dη,τ,Λ. (30)

We now bound term (ii):

(ii) =Lτ(πk, λk) Lτ(πk, λ τ)

=V πk r+λT k g V πk r+(λ τ )T g + τ

2 λ τ 2 (cancel H(πk))

i (λk,i λ τ,i)V πk gi + τ

i (λk,i λ τ,i)(V πk ui ci + τλk,i) τ

i (λk,i λ τ,i)( ˆV k ˆuk,i ci + τλk,i) (a)

i (λk,i λ τ,i)(V πk ui ˆV k ˆuk,i) (b)

Truly No-Regret Learning in Constrained MDPs

To bound (a), we note that X

i (λk,i λ τ,i)( ˆV k ˆuk,i ci + τλk,i)

λ τ λk 2 λ τ λk+1 2

ˆV k ˆuk c + τλk 2 (Lemma I.8)

λ τ λk 2 λ τ λk+1 2

2D τ,Λ, (Lemma E.12)

with D τ,Λ = I(H + τλmax)2 and where we were able to apply Lemma I.8 by Observation F.1. We can bound (b) via

i (λk,i λ τ,i)(V πk ui ˆV k ˆuk,i) X

i λmax ˆV k ˆuk,i V πk ui .

Plugging in, we find

i (λk,i λ τ,i)(V πk ui ci + τλk,i) τ

λ τ λk 2 λ τ λk+1 2

i λmax ˆV k ˆuk,i V πk ui τ

=(1 ητ) λ τ λk 2 λ τ λk+1 2

i λmax ˆV k ˆuk,i V πk ui . (31)

Before proceeding, note that conditioned on G, we have

ˆV k ˆuk,i V πk ui 0,

ˆV k ˆzk V πk zk 0,

by Lemmas E.8 and E.10, respectively. Hence, we can treat these differences and their absolute values interchangeably in what follows.

From Lemma C.3 (with π = πk, λ = λk), we have 0 Lτ(π τ, λk) Lτ(πk, λ τ). Moreover, recall Φk = KLk + 1 2 λk λ τ 2, thus by Equations (30) and (31),

Φk+1 =KLk+1 + 1

2 λk+1 λ τ 2

(1 ητ)KLk + η2

2 Dη,τ,Λ + η( ˆV k ˆzk V πk zk ) η(i) (Equations (30) and (31))

+ (1 ητ) λk λ τ 2

2 D τ,Λ + η X

i λmax ˆV k ˆuk,i V πk ui η(ii)

(1 ητ)Φk + η2(Dη,τ,Λ + D τ,Λ)

+ η( ˆV k ˆzk V πk zk ) + η X

i λmax ˆV k ˆuk,i V πk ui (Def. Φk)

η ((i) + (ii))

(1 ητ)Φk + η2(Dη,τ,Λ + D τ,Λ)

+ η( ˆV k ˆzk V πk zk ) + η X

i λmax ˆV πk ˆuk,i V πk ui

η (Lτ(π τ, λk) Lτ(πk, λ τ)) (Equation (29))

(1 ητ)Φk + η2(Dη,τ,Λ + D τ,Λ) (as Lτ(π τ, λk) Lτ(πk, λ τ) 0)

+ η( ˆV k ˆzk V πk zk ) + η X

i λmax ˆV k ˆuk,i V πk ui .

Truly No-Regret Learning in Constrained MDPs

By induction and geometric series bound we find

Φk+1 (1 ητ)kΦ1 +

k =1 (1 ητ)k+1 k η2(Dη,τ,Λ + D τ,Λ)

k =1 (1 ητ)k+1 k

η( ˆV k zk V πk zk ) + η X

i λmax ˆV k ˆuk ,i V πk ui

(1 ητ)kΦ1 + 1

ητ η2(Dη,τ,Λ + D τ,Λ)

( ˆV k ˆzk V πk zk ) + X

i λmax ˆV k ˆuk ,i V πk ui

(1 ητ)kΦ1 + η

τ (Dη,τ,Λ + D τ,Λ)

+ O η ((2 + Iλmax) + Iλmax)

S2AH4k + S3/2AH2 ,

where we used Lemmas E.5 and E.7 (which apply since G occurs) in the final step. Finally, noting that Dη,τ,Λ + D τ,Λ O(Cη,τ,Λ) (see Theorem 4.1) and invoking λmax 1 concludes the proof.

Invoking Lemma 4.2, we can leverage the convergence of the potential function to show a sublinear regret bound for Algorithm 3.

Theorem 5.1 (Regret bound). Let τ = K 1/7, η = (H2I) 1ΞK 5/7, λmax = HΞ 1K1/14. Then with probability at least 1 δ, Algorithm 1 obtains a strong regret of

R(K; r) Cr K0.93, R(K; u) Cu K0.93,

where Cr, Cu = poly(S, A, H, I, Ξ 1, log(1/δ), log(K)) and K is the number of episodes.

Proof. The bound follows from Lemma 5.1 and Lemma 4.2.

Set r(k) := V π r V πk r

+ and gi(k) := V πk gi

+. Condition on the success event G, which happens with probability at least 1 δ by Lemma E.1.

We first consider the regret for the reward. Plugging Lemma 5.1 into Lemma 4.2 we find, using

b and 1 + x exp(x),

r(k) H3/2Φ1/2 1 exp ( ητk/2) (a)

1/2 C1/2 η,τ,Λ (b)

+ H3/2 ηλmax ISA1/2H2k1/2 1/2 (c)

+ H3/2 ηλmax S3/2AH2 1/2 (d)

+ τH log(A). (e)

We first show that we can ignore terms (b), (d), and (e) since they are o(K 13/14). For (b), note that, using the definitions of η, τ, λmax (and taking , and τ < 1)

C1/2 η,τ,Λ λmax H3/2A1/4I exp ((ηH (1 + λmax I + log(A)))/2) + I1/2 (H + τλmax)

λmax H3/2A1/4I exp (HI) 1ΞK 5/7 1 + H

Ξ K1/14I + log(A) + I1/2 H + HΞ 1K 1/14

λmax H3/2A5/4I exp(2) + I1/2 H + HΞ 1K 1/14

H5/2A5/4IΞ 1K1/14.

Truly No-Regret Learning in Constrained MDPs

1/2 = (H2I) 1/2Ξ1/2K( 5/7+1/7)/2 = (H2I) 1/2Ξ1/2K 2/7,

we thus have

(b) = H3/2 η

1/2 C1/2 η,τ,Λ poly(A, H, I, Ξ 1)K 3/14.

Hence, when summing (b) over k = 1, . . . , K, it only contributes o(K 13/14) to the regret. Similarly,

(d) =H3/2 ηλmax S3/2AH2 1/2 H3/2 S3/2AH2 1/2 K (9/2)/14,

(e) =τH log(A) = K 1/7H log(A).

Hence, when summing (d) and (e) over k = 1, . . . , K, they only contribute o(K 13/14) to the regret. We now turn to terms (a) and (c). For (a), using the standard inequality e x 1 x/2 (if 0 x 1) with x := ητ/2, we first find

exp( ητk/2) (1 ητ/4)k

and hence, (after summing (a) over k = 1, . . . , K)

k=1 H3/2Φ1/2 1 exp ( ητk/2) H3/2Φ1/2 1

k=1 (1 ητ/4)k

k=1 (1 ητ/4)k

=H3/2Φ1/2 1 4 ητ

=H3/2Φ1/2 1 4 (H2I) 1ΞK 5/7K 1/7

=4H7/2IΞ 1Φ1/2 1 K12/14.

Furthermore, since π1 plays actions uniformly at random and λ1 = 0, we have Φ1/2 1 (H log(A) + 1

2Iλ2 max)1/2 H1/2 log(A)1/2 + I1/2λmax = H1/2 log(A)1/2 + I1/2HΞ 1K1/14. Hence, the calculation above shows

(a) 4H7/2IΞ 1 H1/2 log(A)1/2 + I1/2HΞ 1K1/14 K12/14

4H4IΞ 1 log(A)1/2K12/14 + 4H9/2IΞ 2I1/2K13/14.

For (c), we find (after summing over k = 1, . . . , K)

k=1 H3/2 ηλmax ISA1/2H2k1/2 1/2 =

k=1 H3/2 (H2I) 1ΞK 5/7HΞ 1K1/14ISA1/2H2k1/2 1/2

K H3/2 (H2I) 1ΞK 5/7HΞ 1K1/14ISA1/2H2K1/2 1/2

=H3/2 (H2I) 1ΞHΞ 1ISA1/2H2 1/2 K1+( 5/7+1/14+1/2)/2

=H3/2 SA1/2H 1/2 K13/14

=H2S1/2A1/4K13/14.

Hence, summing up terms (a) to (e) over k = 1, . . . , K indeed yields the bound for the objective (using 13/14 0.93):

R(K; r) H9/2IΞ 2I1/2 + H2S1/2A1/4 K0.93.

Truly No-Regret Learning in Constrained MDPs

Next, we consider the regret for the constraints. Plugging Lemma 5.1 into Lemma 4.2 we find, using

ui(k) H3/2Φ1/2 1 exp ( ητk/2) (a )

1/2 C1/2 η,τ,Λ (b )

+ H3/2 ηλmax ISA1/2H2k1/2 1/2 (c )

+ H3/2 ηλmax S3/2AH2 1/2 (d )

+ τλmax + 1 λmax H2Ξ 1 (e )

+ 1 λmax τH log(A). (f )

Note that terms (a ), (b ), (c ), (d ) are identical to (a), (b), (c), (d). Summed up, they thus correspond to the same regret as the one for the reward. Moreover, for (f ) we have

(f ) = 1 λmax τ log(A) =ΞH 1 log(A)K 3/14,

and thus, we can ignore (f ) when summing from k = 1, . . . , K. Finally, for (e ), we have

(e ) =τλmax + 1 λmax H2Ξ 1

=K 1/7 HΞ 1K1/14 + H 1ΞK 1/14 H2Ξ 1

=H(1 + Ξ 1)K 1/14.

Thus, summing up (a ) to (f ) from k = 1 to K yields the regret for the constraints (using 13/14 0.93):

R(K; u) H9/2IΞ 2I1/2 + H2S1/2A1/4 + H(1 + Ξ 1) K0.93.

G. Examples and Simulation In this section, we expand on the description provided in Section 6.

G.1. Examples In this section, we provide examples to highlight crucial differences between the strong regret R and the weak regret Rweak.

A Minimal Example Consider the case in which an agent repeatedly has the option between three different investment strategies (o1, o2, o3). Each of them yields a respective reward (r1, r2, r3) = (1/5, 1, 3/5). There is an initial cost (u1, u2, u3) = (9/10, 1/10, 5/10) associated with each option, which must not exceed the budget c = 1/2 of the agent. If the budget is exceeded, the agent will be in debt, and a larger debt is associated with a higher risk. Clearly, we can model this scenario as an optimization problem maxx ([3]) P

i [3] xiri, subject to P

i [3] xiri c, where x describes the distribution of investments in the respective strategies. We can equivalently model this problem as a CMDP with one state s1, horizon H = 1 and three actions A = [3] (i.e., a constrained bandit), in which r1(s1, a) = ra, u1(s1, a) = 1 ua (a [3]) and c = 1 c. Here, o1 is highly profitable but too risky, o2 is less profitable but safe, and o3 is a compromise between both. Note that R(K, u) now measures the total amount of debt the algorithm accumulated during K episodes. A strategy A that always plays o3 will have a total debt of 0 since o3 does not violate the constraint. On the other hand, a strategy B that plays o1 and o2 in an alternating fashion will have a total debt of K/5, which is linear in K, despite having a weak regret of 0 due to the aforementioned cancellations. Both strategies, A and B, have the same accumulated objective. This simple example, which does not even require an unknown environment, illustrates why weak regret cannot be the right notion of safety during learning. We point to Calvo-Fullana et al. (2023); Moskovitz et al. (2023); Stooke et al. (2020) for other examples exhibiting similar behavior.

Truly No-Regret Learning in Constrained MDPs

A Slight Relaxation For practical purposes, one may consider the strong regret only for the constraint violations and the weak one for the objective. The reasoning behind this possible relaxation is that we may tolerate superoptimal performance with respect to the reward (when the algorithm violates the constraints) while we still do not tolerate additive negative constraint violations (as discussed). For example, an agent maximizing a wealth function may be allowed to obtain a higher return than any safe method, adding a negative term to the objective regret. In this case, we may want to allow compensating suboptimal returns by superoptimal ones, while we do not want to allow compensating unsafe strategies by strictly safe ones. However, this relaxation does not improve our theoretical results in their current form. It remains open whether, under this relaxation, stronger results are possible to obtain.

We argue that despite the possible relaxation, it is sensible to require the strong regret for both the violation and the objective, as done in this paper, which is also what Efroni et al. (2020) referred to. Indeed, one can think of settings in which one gets paid out the return of an episode only up until the limit that would be attainable subject to the constraints. For instance, if there is an illegal set of options, the controller of the environment may decide to pay out only so much as attainable when they are not being used (but may not know whether the illegal actions have, in fact, been used). In other words, the return is Wk = min{V πk r , V π r }. Hence, per-episode regret would be

V π r Wk = V π r min{V πk r , V π r } = [V π r V πk r ]+,

which is exactly the strong regret of the episode.

G.2. Experiment Details In this section, we report all details of the parameters, environments, and hardware used for our simulations.

Hyperparameters For the vanilla algorithms, we run for K = 4000 episodes for each step size η {0.05, 0.075, 0.1, 0.125, 0.15, 0.2}, which we observed to be a reasonable range across CMDPs when fixing the number of episodes. Similarly, for the regularized algorithms, we perform the same parameter search across all pairs of step size η {0.05, 0.1, 0.2} and regularization parameter τ {0.01, 0.02}, totaling a number of six hyperparameter configurations as well. We always set λmax = 6, which did not play a role in our simulations as long as it was chosen sufficiently large. We use exploration bonuses 0.08 nh(s, a) 1/2. For each hyperparameter configuration, we sample n = 5 independent runs with K episodes, obtain the regret curves and plot their average. For each algorithm, we then report the result for the best hyperparameter configuration in hindsight (with respect to the strong regrets).

Environment As described, we sample the rewards r uniformly at random and the constraints as c = (1 r) + βζ, for a Gaussian vector η RHSA and β = 0.1. We sample an environment with S = A = H = 5 according to the procedure above. Throughout seeds (and other CMDP sizes), this led to CMDPs in which the oscillations of the iterates and error cancellations can be observed. As argued, sampling constraints and rewards fully independently does not provide CMDPs that are interesting test beds. Indeed, unlike in random CMDPs, in real-life situations, we often observe goals that are explicitly conflicting with safety constraints, as otherwise, there is no need to encode them via a CMDP. For instance, consider a vehicle that aims to arrive fast but not go over the speed limit or cross the sidewalk. The latter would be the fastest option, but it conflicts with the constraint. In other words, the constraint and the reward function are negatively correlated.

All simulations were performed on a Mac Book Pro 2.8 GHz Quad-Core Intel Core i7. We provide the code in the supplementary material. For all experiments, we set the seed to 123.

G.3. Dual Approach In this section, we provide the pseudo-code for the regularized dual algorithm. One approach to solving the regularized min-max CMDP problem is to perform dual descent on the regularized Lagrangian. Evaluating the dual function then corresponds to entropy-regularized dynamic programming in an MDP, and the descent step on the dual function corresponds to a mirror descent step of the Lagrange multipliers on the regularized Lagrangian. Replacing the required value functions in this scheme with the optimistically estimated ones yields Algorithm 3. Note that here, we consider the estimated value functions without truncation. Set ˆHk(π) := V π ψπ,k, where ψπ,k,h(s, a) := log(πh(a|s)) + bp k 1,h(s, a) log(A).

Truly No-Regret Learning in Constrained MDPs

Algorithm 3 Regularized Dual Algorithm with Optimistic Exploration Require: Λ = [0, λmax]I, stepsize η > 0, regularization parameter τ > 0, number of episodes K, initial policy π1,h(a|s) = 1/A ( s, a, h), λ1 := 0 RI

for k = 1, . . . , K do

Update primal variable via regularized dynamic programming

πk = arg min π Π

V ˆpk,π ˆrk + λT k V ˆpk,π ˆuk + τ ˆHk(π)

Evaluate V ˆpk,πk ˆuk Update dual variables:

λk+1 = projΛ λk η(V ˆpk,πk ˆuk c + τλk) .

Play πk for one episode, update model estimates: ˆrk+1, ˆuk+1, ˆpk+1, ˆψk+1 end for

H. Difference Lemmas In the following, we recap the well-known performance difference and value difference lemma.

Lemma H.1 (Performance difference lemma). For all π, π Π and all r : S A R, we have

V π r V π r =Eπ

a Qπ r ,h(sh, a) (πh(a|sh) π h(a|sh))

s dπ h(s) X

a Qπ r ,h(s, a) (πh(a|s) π h(a|s))

s dπ h(s) Qπ r ,h(s, ), πh( |s) π h( |s) .

Proof. See Cai et al. (2020, Lemma 3.2) for the first equality. The second equality follows since we consider unnormalized occupancy measures. The third equality holds by definition of the inner product.

From Shani et al. (2020, Lemma 1):

Lemma H.2 (Extended value difference lemma (aka simulation lemma)). Let π, π be policies and M = (S, A, p, r), M = (S, A, p , r ) be MDPs. Let ˆQM h (s, a) be an approximation of the value function Qp,π r,h(s, a). Let ˆV M h (s) := D ˆQM h (s, ), πh( |s) E . Then

ˆV M(s1, 1) V p ,π

h=1 E D ˆQM h (sh, ), πh( |sh) π h( |sh) E s1; p , π

h=1 E ˆQM h (sh, ah) r h(s, a) D p h( |sh, ah), ˆV M h+1( ) E s1; p , π ,

where V p ,π

r ,1 (s1) is the value function of π in M .

Note that ˆQ need not correspond to a true value function under some model.

Truly No-Regret Learning in Constrained MDPs

I. Convex Optimization Background In this section, we review fundamental results from the optimization literature. All of these results are standard, and we include them for completeness. They are not novel by themselves nor specific to the sections in which we make use of them.

I.1. Convex Min-Max Optimization While the following results from min-max optimization are commonly used, we establish them here for our setup (both for completeness and due to the lack of a unifying resource for our case).

Setup Let X Rdx, Y Rdy be (nonempty) compact convex sets and let f : X Y R be a continuous and convex-concave function. Set f : X R, f(x) := maxy Y f(x, y), and f : Y R, f(y) := minx X f(x, y), which both exist by continuity of f on a compact domain.

Lemma I.1 (Existence minimax points). We have

inf x X max y Y f(x, y) = sup y Y min x X f(x, y), (32)

and the maximum and minimum are attained, i.e., there exist x X, y Y such that

f(x ) = inf x X max y Y f(x, y), (33)

f(y ) = sup y Y min x X f(x, y). (34)

Proof. The first equality holds due to Sion s Minimax Theorem (Sion, 1958). Note that the second part of the lemma is not immediate from Sion-like statements.

We shall prove that f is continuous. By compactness of X, this implies the existence of x . By symmetry, this also settles the existence of y (by repeating the argument for f). Thus, let x X and consider a sequence (xk)k in X such that xk x. We aim to show that f(xk) f(x), which would conclude the proof.

Let y arg maxy Y f(x, y), which exists by continuity. Then, for every k we have

f(xk) = max y Y f(xk, y ) f(xk, y) f(x, y).

Taking lim infk on both sides yields

lim inf k f(xk) f(x, y) = f(x). (35)

Assume by contradiction that lim supk f(xk) > f(x). Then we can pick δ > 0 such that lim supk f(xk) f(x) + δ. Thus we can pick a subsequence xn(k) and yn(k) arg maxy Y f(xn(k), y ) such that for all k,

f(xn(k)) f(x) + δ/2 f(x, yn(k)) + δ/2. (36)

Since Y is compact, by further picking a subsequence if needed, we can WLOG assume that there exists y Y such that yn(k) y. Then by Equation (36),

f(xn(k), yn(k)) = f(xn(k)) f(x, yn(k)) + δ/2.

Taking k and using continuity of f yields the contradiction f(x, y) f(x, y) + δ/2. We therefore must have lim supk f(xk) f(x) lim infk f(xk), proving f(xk) f(x). Thus, f is indeed continuous.

General Setup The statements in this paragraph all concern the following more general setup (dropping convex-concavity and boundedness of the domain). As we showed in the previous paragraph, all assertions made here hold in the continuous, convex-concave compactly constrained setup.

Let X Rdx, Y Rdy be (nonempty) closed sets and let f : X Y R be a continuous function. Consider f : X R { }, f(x) := maxy Y f(x, y) and f : Y R { }, f(y) := minx X f(x, y).

Truly No-Regret Learning in Constrained MDPs

Lemma I.2 (Min-max to saddle point). Let

x arg min x X max y Y f(x, y),

y arg max y Y min x X f(x, y),

and assume f(x ) = f(y ). Then (x , y ) is a saddle point, i.e., for all x X, y Y, we have

f(x , y) f(x , y ) f(x, y ).

Proof. We have

f(x , y) max y Y f(x , y ) = f(x ) = f(y ) = min x X f(x , y ) f(x , y ),

proving the first inequality, and for the second, we have

f(x, y ) min x X f(x , y ) = f(y ) = f(x ) = max y Y f(x , y ) f(x , y ).

Note that this proof does not require convexity or compactness.

Lemma I.3 (Saddle point to min-max). Let (x , y ) be a saddle point, i.e., for all x X, y Y, we have

f(x , y) f(x , y ) f(x, y ).

x arg min x X max y Y f(x, y),

y arg max y Y min x X f(x, y).

Proof. We first note that the assertion implies maxy f(x , y ) minx X f(x , y ). Hence

f(x ) = max y f(x , y ) min x X f(x , y ) min x X max y Y f(x , y ),

showing the claim for x . For y , we note that similarly,

f(y ) = min x X f(x , y ) max y Y f(x , y ) max y Y min x X f(x , y ),

concluding the proof.

Note that this proof does not require convexity or compactness.

Lemma I.4 (Invariance of saddle points). Let

x arg min x X max y Y f(x, y),

y arg max y Y min x X f(x, y),

and assume f(x ) = f(y ). Consider closed sets X X, Y Y. If (x , y ) X Y , then

x arg min x X max y Y f(x, y),

y arg max y Y min x X f(x, y).

Truly No-Regret Learning in Constrained MDPs

Proof. By Lemma I.2 (which applies since f(x ) = f(y )), (x , y ) is a saddle point for the minmax problem with domain X Y. Thus, since y Y Y, we have f(x , y ) = maxy Y f(x , y) = maxy Y f(x , y). Moreover, since X X and y Y , we have f(x , y ) = minx X f(x, y ) minx X f(x, y ) minx X maxy Y f(x, y). Hence maxy Y f(x , y) minx X maxy Y f(x, y), proving

x arg min x X max y Y f(x, y).

The proof for y follows by repeating the argument for f.

Note that this proof does not require convexity or compactness.

I.2. Constrained Convex Optimization We state some well-known results from constrained convex optimization that will be useful. The results are standard, and we refer, for example, to the work by Beck (2017).

Consider the (primal) optimization problem

f := min f(x)

s.t. g(x) 0 (37)

with the following assumptions.

Assumption I.1 (Assumption 8.41, Beck (2017)). In Equation (37),

(a) X Rn is convex

(b) f : Rn R is convex

(c) g( ) := (g1( ), . . . , gm( ))T with gi : Rn R convex

(d) Equation (37) has a finite optimal value f , which is attained by exactly the elements of X =

(e) There exists x X such that g( x) < 0

(f) For all λ Rm 0, minx X(f(x) + λT g(x)) has an optimal solution

In this setup, we define the dual objective as

q(λ) := min x X f(x) + λT g(x) ,

where L: Rn Rm R, L(x; λ) := f(x) + λT g(x) is the Lagrangian of the problem in Equation (37). The dual problem is then defined as

q := max q(λ)

In this setup, we have the following results connecting the primal and the dual problem.

Theorem I.1 (Theorem A.1, Beck (2017)). Under Assumption I.1, strong duality holds in the following sense: We have

and the optimal solution of the dual problem is attained, with the set of optimal solutions Λ = .

Proof. Proposition 6.4.4 of Bertsekas et al. (2003) gives a proof of the more general Theorem A.1 of Beck (2017). We remark that if we assume affine constraints g and X being a polytope, then we can drop assumption (e) (Beck, 2017, Theorem A.1).

Truly No-Regret Learning in Constrained MDPs

Theorem I.2. Suppose Assumption I.1 holds. Let x X , λ Λ and x X. Then

f(x) f(x ) + (λ )T g(x) 0.

Proof. We have

f(x) =f(x) + (λ )T g(x) (λ )T g(x)

q(λ ) (λ )T g(x) (definition of q( ))

=f(x ) (λ )T g(x) (since by Theorem I.1, q = f )

and rearranging this proves the claim. Again, we see that we can drop assumption (e) if we consider affine constraints g and a polytope X.

Theorem I.3. Under Assumption I.1, for all λ Λ and x as in (e), we have

λ λ 1 f( x) f

mini [m]( gi( x)).

Proof. The first relation holds since λ 0. We show the second relation as follows (cf. Beck (2017, Theorem 8.42)). We have

f(x ) =q(λ ) (Theorem I.1)

f( x) + (λ )T g( x) (definition of q( ))

f( x) + λ 1 max i [m] gi( x) (since λ 0)

=f( x) λ 1 min i [m]( gi( x))

and rearranging this proves the claim. We remark that for this theorem, we do need assumption (e), even in the affine case.

I.3. Online Mirror Descent Setup In the following, we consider a convex set X Rd and a non-empty closed convex set V X. Let ψ: X R be proper, closed, and strictly convex on V . Let Bψ : X int (X) R be the Bregman divergence associated with ψ. Define x A :=

x T Ax. Assume that

lim λ 0 ( ψ(x + λ(y x)))T (y x) = ( x bdry(X), y int (X)), or (38)

V int (X) . (39)

Consider the following descent lemma using local norms.

Lemma I.5 (MD descent lemma, Orabona (2019) Lemma 6.16). Suppose ψ is twice differentiable, with positive definite Hessian in the interior of its domain. Assume

x arg min x X g T x + 1

η Bψ( x, x), (40)

x arg min x V g T x + 1

η Bψ( x, x) (41)

exist. Then, for all x V , there exist z on the line segment7 between x and x , and z on the line segment between x and x such that

ηg T (x x ) Bψ(x , x) Bψ(x , x ) + η2

2 min n g 2 ( 2ψ(z)) 1 , g 2 ( 2ψ(z)) 1 o .

7The line segment between two vectors is the convex hull of the set containing those two vectors.

Truly No-Regret Learning in Constrained MDPs

We get the following descent lemma for exponentiated Q-ascent.

Lemma I.6. Let V := ([d]), and g Rd 0 =: X. Then x := arg max x X g T x 1

ηKL( x, x) and arg max x V g T x 1 ηKL( x, x) exist and are unique. Moreover, if g only has non-negative entries, then for all x V we have

g T (x x) KL(x , x) KL(x , x )

i=1 xig2 i .

Proof. Note that the negative entropy ψ(x) = P

i xi log(xi) is strictly convex and twice differentiable and satisfies Equation (38), as a short calculation reveals. Moreover, for p, q V , we have Bψ(p, q) = KL(p, q) (Orabona, 2019, Example 6.4). Existence and uniqueness are discussed in Orabona (2019).

Maximizing g T x 1

ηKL( x, x) is equivalent to minimizing ( g)T x + 1

ηKL( x, x), allowing us to apply Lemma I.5. Note that for z RI >0, 2ψ(z) = diag(1/z1, . . . , 1/zd). Thus, g ( 2ψ(z)) 1 = P

i zig2 i . Moreover, we have xi = x exp ( η( gi)) xi (Orabona, 2019), since gi 0, and thus by Lemma I.5, we can pick z x (componentwise) such that

g T (x x) =( g)T (x x )

KL(x , x) KL(x , x )

i=1 z ig2 i

KL(x , x) KL(x , x )

i=1 xig2 i .

Lemma I.7 (MD descent lemma, cf. Orabona (2019) Lemma 6.9). Let x V , g Rd, and η > 0. Assume further that ψ is µ-strongly convex w.r.t. some norm Rd in V . Then,

x = arg min x V g T x + 1

η Bψ( x, x) (42)

exists and is unique. Moreover, for all x V , the following inequality holds:

ηg T (x x ) Bψ(x , x) Bψ(x , x ) + η2

where is the dual norm associated with Rd.

We can deduce the descent lemma for projected gradient descent.

Lemma I.8 (Descent lemma PGD). Let x V , g Rd, and η > 0. Then x := arg min x V x T g + 1 2η x x 2 exists and is unique. Moreover, for all x V we have

g T (x x ) x x 2 x x 2

Proof. Note that ψ(x) = 1

2 x 2 is 1-strongly convex w.r.t. the L2 norm. Moreover, for a, b V , we have Bψ(a, b) = 1 2 a b 2 (Orabona, 2019, Example 6.4). Since the L2 norm is its own dual norm, applying Lemma I.7 to the minimization of g T x + 1 2η x x 2 yields the claim.

Finally, the following lemma shows that the updates of the algorithms indeed fall into the category of (online) mirror descent.

Truly No-Regret Learning in Constrained MDPs

Lemma I.9 ((Orabona, 2019)). Consider a compact set Y RI with y Y , and let x ([d]) for some d Z 1. Then, the closed-form expressions

x i = xi = xi exp (ηgi) P

i [d] xi exp (ηxi ) (i [d]),

y = proj Y (y ηg) ,

are the unique solutions to

max x ([d]) x T g 1

η KL( x, x),

min y Y y T g + 1

respectively.

Proof. For the primal variable, the derivation of exponentiated gradient is standard, see, e.g., Orabona (2019, Section 6.6).

For the dual variable, the derivation of projected gradient descent simply follows from the first-order optimality criterion and convexity of the objective.