# online_learning_in_risk_sensitive_constrained_mdp__2c361f7a.pdf Online Learning in Risk Sensitive constrained MDP Arnob Ghosh * 1 Mehrdad Moharrami * 2 We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of O Vg,max K3/4 + p H4S2A log(1/δ)K3/4 and constraint violation of O Vg,max p H3S2A log(1/δ)K3/4 with probability at least 1 δ, where S and A denote the cardinalities of the state and action spaces, respectively, H is the episode length, K is the number of episodes, α < 0 is the risk-aversion parameter, and Vg,max = 1 |α|(exp(|α|H) 1). To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem. 1. Introduction Reinforcement Learning (RL) has achieved remarkable breakthroughs across a wide range of domains, in- *Equal contribution 1Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, USA 2Department of Computer Science, University of Iowa, Iowa City, USA. Correspondence to: Arnob Ghosh . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). cluding human-level performance on classic Atari 2600 games (Mnih et al., 2015), mastering complex board games such as Go and Chess (Silver et al., 2016; 2017), tackling challenging robotic manipulation tasks (Levine et al., 2016; Haarnoja et al., 2018), optimizing control systems for autonomous vehicles (Sallab et al., 2017), advancing drug discovery by designing molecules with desired properties (Popova et al., 2018), and enhancing the capabilities of Large Language Models to achieve human-level decisionmaking in natural language understanding and generation tasks (Ouyang et al., 2022; Guo et al., 2025). These successes underscore the versatility of RL in high-dimensional, sequential decision-making problems. However, most of these methods focus on maximizing a single reward signal without regard to auxiliary performance or safety constraints, which are often critical in real-world applications. Constrained Markov Decision Processes (CMDPs) provide a principled way to incorporate such additional constraints into the decision-making process (Altman, 1999b). By extending the standard RL objective to include one or more cost functions, CMDPs enable practitioners to balance primary performance goals against other considerations such as safety (Garcıa & Fern andez, 2015), resource allocation (Zhao et al., 2022), or fairness (Jabbari et al., 2017). Various techniques have been proposed to solve CMDPs, including primal-dual methods (Efroni et al., 2020) and Lagrangian relaxation (Altman, 1999b; Borkar, 2005). More recently, it has been shown that CMDPs has zero duality gap under certain conditions (Paternain et al., 2019), and this result has been leveraged in a series of papers to derive regret bounds for model-free CMDPs (Ghosh et al., 2022; Ding et al., 2021). While these methods are effective for constraints that are linear in cost (e.g., requiring that the expected sum of certain costs remains below a threshold), many real-world tasks demand handling nonlinear, risk-sensitive constraints that capture higher-order statistics or worst-case scenarios (Chow et al., 2018). Such constraints are paramount in domains where even a single catastrophic event is unacceptable, such as in healthcare (Gottesman et al., 2019). However, incorporating risk-sensitive constraints into CMDPs introduces significant challenges, as the resulting problem often becomes highly nonlinear, and the strong duality properties that simplify solution methods for CMDPs may fail to hold. Motivated by this, we are Online Learning in Risk Sensitive constrained MDP interested in the following question: Can we achieve a provably efficient learning algorithm for CMDP problem with risk-sensitive constraints? In this work, we study a finite-horizon CMDP with an entropic risk-sensitive constraint. In particular, we consider that the agent seeks to maximize the expected cumulative reward subject to the constraint that the risk-sensitive utility associated with a policy remains above a prescribed threshold in an online learning setup. Solving this problem exactly is notoriously difficult due to its nonlinear nature and the potential failure of strong duality. Our Contributions: We show that the proposed algorithm, with probability at least 1 δ, achieves a regret bound of O Vg,max K3/4 + p H4S2A log(1/δ)K3/4 and a constraint violation bound of O Vg,max p H3S2A log(1/δ)K3/4 , where S and A denote the cardinalities of the state and action spaces, H the episode length, K the number of episodes, α < 0 the risk-aversion parameter, and Vg,max = 1 |α|(exp(|α|H) 1). To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem. The standard primal-dual approach may not apply in this setting, as the exponential Bellman equation for the composite state-action value function is invalid due to the non-linearity of the entropic risk measure where markovian policy is optimal. Moreover, we show that a Markovian policy might be suboptimal unlike the unconstraned entropic risk measure setting. To address this, we introduce an augmented state space by extending the original state-space with a continuous budget variable. We show that, in the risk-averse case (α < 0), the entropic risk associated with a policy π can be expressed as maxτ (τ + E[u(G(π) τ)]), where u(t) = 1 α(exp(αt) 1) and G(π) denotes the total utility under policy π. By reformulating the problem in the augmented state space, we can apply a value-iteration-based approach to compute the composite state-action value function. We then apply a primal-dual-based approach. However, since the entropic risk measure is not linear in the stateaction occupancy measure, strong duality may not hold. As a result, standard techniques used in the risk-neutral CMDP setting for bounding constraint violation (Ghosh et al., 2022; Efroni et al., 2020; Ding et al., 2021) are not directly applicable. To address this, we introduce a truncation variable ξ and show that it is crucial for controlling both regret and constraint violation. 1.1. Related Works In this section, we provide a brief overview of the most relevant related work. Specifically, we focus on risk-sensitive MDPs and constrained MDPs, as the use of entropic risk measures in constrained MDPs remains largely unexplored. Risk Sensitive RL: The study of risk sensitive RL traces its origins to economic theory, where risk was incorporated into utility functions to better capture decision-making under uncertainty (Hardy, 1923; von Neumann & Morgenstern, 1944; Luce & Raiffa, 1957; Arrow, 1965). Risk sensitive Markov decision processes were first introduced in (Howard & Matheson, 1972), where the standard cost function was replaced by an exponential transformation involving a risk sensitive parameter. This framework is closely aligned with the concepts of robustness (Dai Pra et al., 1996; Hernandez Hern andez & Marcus, 1996), as it balances utility maximization while accounting for the variance of returns (Whittle, 2002). The use of exponential utility functions effectively penalizes variability in outcomes, thereby mitigating risk in decision-making. Risk sensitivity entered the control and reinforcement learning literature in the early 2000s (Borkar, 2001; Borkar & Meyn, 2002; Borkar, 2002). Since then, significant efforts have been made to understand the behavior of exponential risk sensitive reinforcement learning, with notable recent contributions including (Fei et al., 2020; Murthy et al., 2023; Fei et al., 2024; Moharrami et al., 2024). Simultaneously, alternative risk measures such as Conditional Value-at-Risk (CVa R) gained traction (Artzner et al., 1999; Rockafellar & Uryasev, 2000; Wang et al., 2023; 2024). CVa R has become particularly valuable in reinforcement learning applications, as it focuses on optimizing expected rewards in worst case scenarios (Tamar et al., 2015; Zhao et al., 2024). In this work, we focus on solving the problem where the objective is to maximize the cumulative reward subject to the constraint the entropic risk measure associated with the utility value function is above a certain threshold. Hence, it is fundamentally different from the above work. Constrained RL: Constrained RL is one of the most active fields at the intersection of constraint optimization, MDPs, and RL. The earliest attempts to integrate constraints into MDP formulations date back to the 1970s (Kolesar, 1970), with more rigorous treatments appearing in the 1990s (Ross, 1989; Altman, 1999a). Optimization techniques for solving constraint problems, such as Lagrangian relaxation (Everett, 1963; Shapiro, 1979), and primal-dual (Efroni et al., 2020) approach have been used to address these challenges, often by converting the constrained problem into an unconstrained one through the use of Lagrange multipliers (Altman, 1998; Bertsekas, 2016). This framework was later adapted to RL (Geibel & Wysotzki, 2005; Uchibe & Doya, 2007; Zheng & Ratliff, 2020; Tessler et al., 2019; Ding Online Learning in Risk Sensitive constrained MDP et al., 2020; Ying et al., 2022). Since then, numerous studies have explored different formulations of constrained objectives within the RL framework, including constraints with auxiliary cost/reward functions (Achiam et al., 2017; Tessler et al., 2019; Ghosh et al., 2022), constraints on the quantile and distribution of returns (Jung et al., 2024), and constraints on risk measures (Borkar & Jain, 2014; Zhang et al., 2024). Compared to the risk-neutral constraints, we consider risk-sensitive constraints. Furthermore, to the best of our knowledge, this is the first work that provides the sub-linear regret and violation bound for risk-sensitive constraints setup. 2. Problem Formulation We consider the risk-sensitive CMDP within an episodic framework. The CMDP is defined as (S, A, P, H, r, g), where S (|S| = S) is the finite state space, A (|A| = A) is the finite action space, and H is the fixed episode length. The transition dynamics are governed by a set of transition probabilities P = {Ph}H h=1, where Ph(s | s, a) represents the probability of transitioning to state s from state s at time step h, upon taking action a. The reward and utility functions are denoted by r = {rh}H h=1 and g = {gh}H h=1, respectively, defined for each time step of the episode and assumed to be deterministic and in [0, 1]. Each episode k 0 starts at a fixed state s1. At each time step h, the agent observes the state sk h S and selects an action ak h A. The agent then receives a reward rh(sk h, ak h) and a utility gh(sk h, ak h). Finally, the MDP evolves to sk h+1 drawn from Ph( |sk h, ak h). The episode terminates at step H + 1. Without loss of generality, we assume that r H+1 = g H+1 = 0. In this paper, we consider the challenging scenario where the agent only observes rh(sk h, ak h) and gh(sk h, ak h) at the visited state-action pairs. The agent s history-dependent policy space is denoted by ΠHD = π = {πh( | )}H h=1 : πh( | sh, {sh , ah }h 1 h =1) (A), h [H] , where (A) denotes the probability simplex over the action space. We consider a total of K episodes. For each k K and h H, let πk h denote the policy at time step h of episode k. The reward value function is the expected total reward V π r,1(s) = Eπ[PH+1 h=1 rh(sh, ah)|s1 = s]. The constraint is based on the risk associated with the utility function. For risk-sensitive RL with the risk factor α, we denote the risk-sensitive state-action value function for the utility as Qπ g,1 : S A R. This function represents the expected cumulative utility, assessed using the entropic risk measure, under a policy π ΠHD, starting from the state-action pair (s, a), i.e., Qπ g,1(s, a) = 1 α log Eπ h eα PH+1 h =1 gh (sh ,ah ) s1 = s, a1 = a i . Similarly, the risk-sensitive value function for the utility associated with policy π, starting from state s is given by V π g,1(s) = 1 α log n Eπ[eα PH+1 h =1 gh (sh ,ah )|s1 = s] o For a Markov policy π, the state-action value function and the utility-based value function satisfy the following dynamic programming relations: Qπ g,h(s, a) = gh(s, a) + 1 α log PheαV π g,h+1(s, a) , V π g,h(s) = 1 α log eαQπ g,h(s, ), π( |s) . Our objective is to solve the following risk-sensitive constrained problem: max π ΠHD V π r,1(s1), subject to V π g,1(s1) B, (1) where B R is the lower bound on acceptable utility. The goal is to ensure that the entropic risk measure associated with the utility value function is above a certain threshold. We consider the setting where α < 0, indicating that the agent is risk-averse. Here, α denotes the agent s risk tolerance. As α 0, the agent approaches risk-neutral behavior and seeks policies whose expected cumulative utility is at least B. Conversely, as α , the agent becomes increasingly risk-averse and favors policies that yield outcomes with lower volatility and a greater certainty of meeting or exceeding the threshold B. Problems of this type are essential in various applications, including autonomous driving, financial investment, and modeling human behavior. Learning Metric: In this setting, we assume the agent does not have access to the transition probabilities and must learn in an online manner. The agent aims to minimize the following performance metrics: Regret(K) = k=1 V π r,1 (s1) V πk r,1 (s1), Violation(K) = k=1 (B V πk g,1 (s1)) where πk ΠHD is the policy deployed at episode k, and π represents the optimal feasible policy. The regret quantifies the sub-optimality gap, while the violation measures the extent of constraint violations. Online Learning in Risk Sensitive constrained MDP 3. Solution Methodology We begin by highlighting the challenges in solving the risksensitive CMDP compared to the standard CMDP. In the standard CMDP, which corresponds to the risk-neutral scenario (α = 0), the Lagrangian is solved to derive a bound on the learning metric (Ghosh et al., 2022): min λ 0 max π V π r,1(x) + λ(V π g,1(x) b) (2) Challenge in applying value iteration: In the standard CMDP setting, for a fixed λ, the optimal Markov policy π can be obtained using a standard dynamic programming approach as the problem reduces to an unconstrained MDP with a modified per-step reward of r + λg. Hence, a valuebased method can be applied to compute the composite value function. This approach is then used to iteratively update both the policy π and the dual variable λ. However, in the risk-sensitive setting, expanding the Lagrangian yields: rh(x, a) + Ph V π r,h+1(x, a) + λgh(x, a)+ λ α log PheαV π g,h+1(s, a) which prevents the use of standard dynamic programming to solve for the optimal policy at a fixed λ. Unlike the risk-neutral case, the Lagrangian here does not reduce to a problem with a simple modified reward r + λg. Lack of strong duality: For the standard CMDP, it has been shown that strong duality holds if a strict feasible policy (a.k.a. Slater s condition) (Paternain et al., 2019) exists. Thus, one can solve in the dual-domain. For standard CMDP problems, (Ghosh et al., 2022) demonstrate that primal-dual based approaches can achieve a O( K) regret and violation using the strong duality result. The key to obtain strong duality result in (Paternain et al., 2019) is that the value function is linear in the state-action occupancy measure. However, the risk-sensitive state-action value function is not linear with respect to the occupancy measure, and the risk-sensitive CMDPs may not satisfy strong duality. Consequently, traditional primal-dual-based algorithms are inadequate for obtaining the desired results. Absence of a Markov Optimal Policy: In contrast to standard CMDPs, the lack of strong duality in the risk-sensitive setting means that one cannot guarantee the existence of a Markov policy that solves the CMDP. The existence of such a policy is crucial for deriving finite-time bounds. In Appendix A, we present an example demonstrating that the optimal policy for a CMDP with a entropic risk constraint may not be even Markovian. 3.1. Certainty-Equivalence Representation To address these issues, we leverage the certaintyequivalence representation of the entropic risk measure (Ben-Tal & Teboulle, 2007). Specifically, for a function u( ), the Optimized Certainty Equivalent (OCE) associated with a π ΠHD is defined as OCEu(π, s) = maxτ n τ + E h u PH h=1 gh(sh, ah) τ s1 = s io In the risk-averse case (α < 0), the entropic risk measure can be recovered by setting u(t) = 1 α(eαt 1). For notational convenience, we use OCE(π) from this point onward, as the function u( ) and the initial state s1 is fixed. Lemma 3.1. For α < 0, we have OCE(π) = V π g,1(s1). Unlike the entropic risk measure, the OCE does not admit a dynamic programming formulation. As a result, the optimal policy that maximizes OCE(π) is generally historydependent. The solution is to consider an augmented MDP that extends the state space by introducing a scalar budget variable, which intuitively tracks the cumulative utility over time. This technique enables a reformulation of the problem into a state-based representation that is amenable to dynamic programming. This approach has been employed in the context of CVa R and spectral risk measures, and has more recently extended to the OCE framework (B auerle & Ott, 2011; Wang et al., 2024). We build on this formulation and extend it to our CMDP state-space. 3.2. Augmented CMDP Consider an augmented CMDP (Saug, A, Paug, H, r, gaug) by appending a budget variable to the state and modifying the underlying utility function (B auerle & Ott, 2011; Wang et al., 2024). More specifically, we augment the state space with a budget variable ch at horizon h defined by ch = τ Ph 1 h =1 gh (sh , ah ), where c1 = τ [0, H] is the initial budget provided by the algorithm, and ch denotes the remaining budget at time step h. Note that ch [ H, H]. The budget transition is known and deterministic: ch+1 = ch gh(sh, ah). We define the augmented utility function gaug h by gaug h (s, c, a) = 0 for h H, and gaug H+1(s, c, a) = u( c), where u(t) = 1 α(eαt 1). The choice of u(t) is dictated by the entropic risk measure. Note that the transition probability for the augmented CMDP problem is given by Paug h ( , c |s, c, a) = Ph( |s, a) for c = c gh(s, a), and Paug h ( , c |s, c, a) = 0, otherwise, as gh is deterministic. The agent focuses on Markov policies defined over the augmented state space, denoted by Πaug M = π = {πh( | )}H h=1 : πh( | sh, ch) (A), h [H] and c [ H, H] . Online Learning in Risk Sensitive constrained MDP For a Markov policy π in the augmented state space, abusing the notation, let Qπ g,h and V π g,h denote the augmented state-action value function and the augmented state-value function, respectively. The distinction between the risksensitive MDP and the augmented MDP is evident from the difference in the size of their state spaces. By definition, Qπ g,h(s, c, a)= h =h gaug h (sh , ch , ah ) sh = s, ch = c, ah = a , V π g,h(sh, ch)=E H+1 X h =h gaug h (sh , ch , ah ) sh = s, ch = c . Note that for a Markov policy π in the augmented state space, the augmented CMDP admits a linear state-action value function with respect to the utility function gaug h , in contrast to the risk-sensitive value functions Qπ g,h(s, a) and V π g,h(s) defined for a Markov policy π in the original state space S. This is achieved by shifting the utility to the terminal time step, i.e., V π g,H+1(s H, c H) = u( c H+1). The resulting additive structure is the key property that enables the analysis of Lagrangian relaxation. Furthermore, the function Vg,h(s, ) is bounded above by Vg,max = 1 |α|(exp(|α|H) 1). Finally, for a Markov policy π in the augmented MDP, the functions Qπ g,h and V π g,h satisfy standard dynamic programming equations: Qπ g,h(sh, ch, ah) = E V π g,h+1(sh+1, ch+1) | sh, ch, ah , V π g,h(sh, ch) = X a A π(a | sh, ch)Qπ g,h(sh, ch, a). Note that OCE selects the optimal trade-off between the initial budget and the value function of the augmented CMDP: OCE(π) = maxτ τ+V π g,1(s1, τ) . Given this relationship, we present the augmented risk-sensitive CMDP formulation, which is equivalent to the original problem (1). Since policies are defined over the augmented state space, we denote the reward value function as V π r,1(s1, τ). max π V π r,1(s1, ˆτ) subject to ˆτ = arg max{(τ + V π g,1(s1, τ))} (ˆτ + V π g,1(s1, ˆτ)) B. (Wang et al., 2024) studied an augmented formulation in the unconstrained setting for various risk measures. However, to the best of our knowledge, our work is the first to extend the augmented CMDP framework to the constrained setting. Unlike the unconstrained case, the constrained setting requires controlling both regret and constraint violation. We make the following assumption, which, in the risk-neutral CMDP setting, follows by Slater s condition (Paternain et al., 2019). Deriving an analogous condition for the augmented CMDP is left for future work. Assumption 3.2. There exists a Markov policy in the augmented state space that solves the augmented CMDP. Note that a complete history-dependent policy will be computationally more challenging, hence, we focus on Markovian policy class on the augmented state space. 4. Algorithm Considering a Lagrangian relaxation of (3), our goal is to solve the following optimization problem: min λ 0 max π max τ V π r,1(s1, τ)+λ (τ + V π g,1(s1, τ)) B . (4) Note that the order of maximization in the above formulation is interchangeable. Since the augmented value function V π g,1(s1, τ) satisfies a linear Bellman equation, the inner maximization, for a fixed τ and λ 0, can be efficiently solved using standard value iteration algorithms. In particular, for a fixed τ and λ 0, one can find the optimal policy that maximizes the composite state-action value function. The above approach addresses the primary challenge faced by the original risk-sensitive CMDP. However, the absence of strong duality remains a fundamental challenge. Specifically, the key issue is determining an appropriate tuning strategy for λ to ensure a bounded trade-off between policy regret and constraint violation. A natural approach to updating the Lagrange multiplier λ 0 is to use gradient descent. However, due to the lack of strong duality, the argument from (Ghosh et al., 2022) cannot be directly applied to establish its boundedness. In particular, the value function is still non-convex in terms of the augmented state-action occupancy measure. Hence, one can not apply the tool used in (Paternain et al., 2019) to prove strong duality. To address this issue, we consider a truncated dual update, capped at a threshold ξ to be specified later. This truncation is crucial for achieving both the regret and constraint violation bounds. Algorithm Description: We now describe our proposed Algorithm 1. The algorithm maintains an empirical estimate of the transition probabilities, which is updated as new data is observed (line 6). At each episode, this estimate is then used to compute an optimistic estimate of the reward state-action value function and the utility state-action value function, using a bonus term for every augmented state-action pair (lines 8 11). This bonus term enforces exploration. Next, the policy is updated via a greedy algorithm with respect to the estimated composite state-action value function given by the Lagrangian relaxation (line 13). We then update the reward and utility value functions using standard dynamic programming equations (line 14), which effectively solves the inner maximization of the Lagrangian relaxation for any budget c in the discretized budget space C and a fixed Online Learning in Risk Sensitive constrained MDP λk > 0. Subsequently, we obtain the optimal initial budget ˆτk using the current estimate of V πk g,1 (s1, ˆτ) and V πk r,1 (s1, ˆτ) to solve the inner maximization of (4) (line 16). Finally, we tune the dual variable via a gradient-descent update on λ: λk+1 =Proj[0,ξ] (λk + η(B (ˆτk + Vg,1(s1, ˆτk)))) where Proj[0,ξ] denotes the projection operator onto the interval [0, ξ]. In line 19, we execute the updated policy πk h using the optimized ˆτk as the initial budget. We track the evolution of the budget while executing the policy. Computational Challenge: One key challenge in optimizing τ is that the objective function is not concave in τ. To address this, we discretize the budget variable with resolution ϵ0, where ϵ0 is chosen to ensure polynomial-time complexity while preserving regret and violation guarantees. We denote the resulting discretized set of feasible budget values by C. Note that C [ H, H] and that |C| = 2H/ϵ0 . 5. Analysis 5.1. Main Result We now present our main theoretical contribution, which establishes high-probability sublinear bounds on both regret and constraint violation for Algorithm 1. Our analysis extends existing frameworks in risk-sensitive and constrained reinforcement learning. Theorem 5.1. With probability 1 δ, Algorithm 1 returns the policies {πk}K k=1 such that Regret(K) O Vg,max K3/4+ p H4S2A log(1/δ)K3/4 , Violation(K) O Vg,max K3/4p H3S2A log(1/δ) , where Vg,max = 1 |α|(exp(|α|H) 1), and O( ) hides logarithmic factors in S, A, H, and K. This is the first sublinear regret and violation result for the risk-sensitive constrained MDP setting. In the unconstrained, risk-neutral setting, the best known regret bound is O( p H3SAK log(1/δ)) (Azar et al., 2017). The additional HS factor in our bound arises from the use of an augmented state space; specifically, we must apply uniform, value-aware concentration bounds over the discretized augmented state space. The additional K1/4 factor arises from carefully balancing the discretization error with the size of the discretized budget set C. For the unconstrained risk-sensitive MDP setting, the regret bound is O(Vg,max p H2S2AK log(1/δ)), which is known to be tight (Fei et al., 2021). In our constrained setting, we achieve a violation bound of O Vg,max p H3S2A log(1/δ)K3/4 . The additional HK1/4 factor arises from the use of a discretized augmented state space and the introduction of the truncation threshold ξ, as strong duality may not hold. 5.2. Proof Outline We first derive the regret and constraint violation associated with the problem formulation in the augmented CMDP. We then discuss the relationship between the regret and violation in the augmented state space and those in the original CMDP. Note that the policy πk depends on the discretized initial budget ˆτk and the discretized utilities observed: Regretaug(K) = X k (V π r,1 (s1, τ ) V πk r,1 (s1, ˆτk)), Violationaug(K) = X k (B (ˆτk + V πk g,1 (s1, ˆτk))), (5) where π is the optimal feasible policy for the original risk-sensitive CMDP, ˆτk is given by Algorithm 1, and τ = arg maxτ(τ + V π g,1(s1, τ)). Note that the value of ˆτk determines the policy πk. Step 1: Let V k g,1 and V k r,1 denote the optimistic estimates of V πk g,1 and V πk r,1 , respectively, as computed by Algorithm 1 on line 12. We begin with the following observation: V π r,1 (x, τ ) V πk r,1 (x, ˆτk))+λ(B (ˆτk+V πk g,1 (x, ˆτk)) k (V k r,1(x, ˆτk) V πk r,1 (x, ˆτk)) k (V π r,1 (x, τ ) V k r,1(x, ˆτk))+ λk((τ +V π g,1(x, τ )) (ˆτk+V k g,1(x, ˆτk))) {z } T2 k (λ λk)(B (ˆτk + V k g,1(x, ˆτk))) k ((ˆτk + V k g,1(x, ˆτk)) (ˆτk + V πk g,1 (x, ˆτk))) where the inequality follows from the fact that (τ + V π g,1)(x) B. Note that when λ = 0, the above expression reduces to the regret. This same expression also suffices for bounding the violation, as we explain later. Step 2: We bound T3 by analyzing the truncated dual update for constraint violation. Online Learning in Risk Sensitive constrained MDP Algorithm 1 Constraint Risk Sensitive Value Iteration Algorithm Input: Number of episodes K Z+, discretized budget space C, confidence level δ (0, 1], risk parameter α < 0, utility lower bound B, learning rate η, initial policy π0, truncation threshold ξ, and Vg,max = 1 |α|(exp(|α|H) 1). 1: λ1 0 2: For all (s, ˆc, a) S C A, initialize Vg,H+1(s, ˆc) u( ˆc) and Vr,H+1(s, ˆc) 0 3: For all (h, s, ˆc, a) [H] S C A, set Qr,h(s, ˆc, a) H, Qg,h(s, ˆc, a) Vg,max 4: For all h [H], initialize dataset Dh 5: for episode k = 1 to K do 6: For each h [H], compute counts and empirical transitions: Nh(s, a, s ) Pk 1 i=1 1[(si h, ai h, si h+1) = (s, a, s )], Nh(s, a) max{1, P s Nh(s, a, s )}, and P k h (s | s, a) Nh(s,a,s ) Nh(s,a) 7: for step h = H to 1 do 8: for all (s, ˆc, a) S C A do 9: Bonr,h(s, a) 9H v u u u u u t S|C| log (HSAK|C|/δ) Nh(s, a) , Bong,h(s, a) 6Vg,max v u u u u u t S|C| log (HSAK|C|Vg,max/δ) 10: Qg,h(s, ˆc, a) min n Es P k h ( |s,a) [Vg,h+1(s , ˆc ϕ(gh(s, a)))] + Bong,h(s, a), Vg,max o 11: Qr,h(s, ˆc, a) min n rh(s, a) + Es P k h ( |s,a) [Vr,h+1(s , ˆc ϕ(gh(s, a))] + Bonr,h(s, a), H o 12: end for 13: For all (s, ˆc) S C, set πk h(a | s, ˆc) = 1 for some a arg maxa A (Qr,h(s, ˆc, a ) + λk Qg,h(s, ˆc, a )). 14: For all (s, ˆc) S C, update Vr,h(s, ˆc) Qr,h(s, ˆc, ), πk h( |s, ˆc) , Vg,h(s, ˆc) Qg,h(s, ˆc, ), πk h( |s, ˆc) . 15: end for 16: ˆτk = arg maxˆc C [Vr,1(s1, ˆc) + λk (ˆc + Vg,1(s1, ˆc))]. 17: λk+1 = Proj[0,ξ] (λk + η(B (ˆτk + Vg,1(s1, ˆτk)))) 18: for step h = 1 to H do 19: Take action ak h according to the greedy policy πk h starting from initial augmented state (s1, ˆτk). 20: Insert (sk h, ak h, sk h+1) into Dh 21: end for 22: end for Lemma 5.2. For any λ [0, ξ], we have k=1 (λ λk)(B (ˆτk + V k g,1(x, ˆτk))) λ2 2η + ηV 2 g,max K Step 3: We use optimistic estimates to bound T2. Lemma 5.3. With probability at least 1 δ, X k (V π r,1 (x, τ ) V k r,1(x, ˆτk))+ λk((τ + V π g,1(x, τ )) (ˆτk + V k g,1(x, ˆτk))) ϵ0Kξ. Proving the optimism is challenging due to the continuity of the initial budget τ. In the augmented MDP for the unconstrained CVa R problem, (Wang et al., 2023) leverages the fact that the greedy policy is optimal for the reward value function. In the unconstrained setting, one can bound the term |(P k h (s, a) Ph(s, a))T V π r,h( , c gh)| by showing that value function class on the greedy policy has small covering, as the Bellman operator is a contraction under the max operator. However, in the constrained setting considered here, the optimal policy is not necessarily greedy with respect to the composite state-action value function. As a result, this approach does not apply. Instead, we bound |(P k h (s, a) Ph(s, a))T V k j,h( , ˆc)| for j {g, r} using the uniform concentration bound from (Jin et al., 2020), and a discretization of the budget variable at resolution ϵ0, denoted by C (see Lemma D.1). The size of C appears in the uniform concentration bound and is reflected in the bonus term in line 8 of Algorithm 1. Specifically, given the boundedness of both the reward and utility value functions, we apply a standard ε-covering number argument to show that with high probability, at episode k, the composite value function of the optimal policy is bounded above by the optimistic estimate, up to a discretization gap of ϵ0λk. Step 4: The upper bound for T1 is established by leveraging the uniform concentration bound, applying Azuma s inequality, and invoking the Elliptical Potential Lemma. Lemma 5.4. With probability at least 1 δ, k (V k r,1(x, ˆτk) V πk r,1 (x, ˆτk)) HAK|C| log(K) log (HSAK|C|/δ). Online Learning in Risk Sensitive constrained MDP Step 5: We bound T4 using a similar argument to the one used for bounding T1. Lemma 5.5. With probability at least 1 δ, k ((ˆτk + V k g,1(x, ˆτk)) (ˆτk + V πk g,1 (x, ˆτk))) 14Vg,max S q HAK|C| log(K) log (HSA|C|Vg,max/δ) + 14Vg,max S log(K) p HAK|C| + Kϵ0He|α|H. Note that the learned policy πk is Markovian with respect to the discretized augmented state space, as it depends on the accumulated discretized utility rather than the exact cumulative utility. Consequently, πk is not Markovian with respect to the augmented state space which lies in the continuous space. This discrepancy is captured by the additional term in the upper bound of Lemma 5.5. Regret Bound: We derive the regret bound by combining the bounds for T1, T2, and T3, and by setting the parameters as λ = 0, η = K 1/4/Vg,max , ϵ0 = K 1/2, and ξ = K1/4. More specifically, the regret associated with the augmented CMDP is bounded by O Vg,max K3/4+ p H4S2A log(1/δ)K3/4 . Violation Bound: To bound the violation, we combine the bounds for T1, T2, and T3 for a fixed λ, by setting λ = K1/4, η = K 1/4/Vg,max , ϵ0 = K 1/2, and ξ = K1/4. In particular, for any fixed λ [0, ξ], we have k (V π r,1 (x, τ ) V πk r,1 (x, ˆτk))+λ(B (ˆτk+V πk g,1 (x, ˆτk))) H3S2A log(1/δ)K Note that P k(V π r,1 (x, τ ) V πk r,1 (x, ˆτk)) is trivially bounded by HK. Setting λ = ξ = K1/4, the violation associated with the augmented CMDP is bounded by O Vg,max p H3S2A log(1/δ)K3/4 . From Augmented CMDP to Original CMDP: As noted earlier, the policy πk, together with the initial budget ˆτk, can be interpreted as a history-dependent policy in the original CMDP. This policy depends on the current state and the accumulated discretized utility. To highlight the dependence of πk on ˆτk, we denote it by πk(ˆτk). Consequently, the reward associated with the policy πk(ˆτk) is given by V πk r,1 (s1, ˆτk) in both the original and the augmented CMDP. However, the entropic risk measure corresponding to this Figure 1. Evolution of the empirical reward V k r,1(s1, ˆτk), the empirical risk measure for utility ˆτk + V k g,1(s1, ˆτk), and the dual variable λ across iterations for α = 0.0001. Here, B = 2.2. All values are averaged over the most recent 100 episodes. policy, by Lemma 3.1, is given by τ + Eπk(ˆτk) h=1 gh(sh, ah) τ s1 ˆτk + Eπk(ˆτk) h=1 gh(sh, ah) ˆτk = ˆτk + V πk g,1 (s1, ˆτk). Therefore, we have Regretaug(K) = Regret(K) and Violation(K) Violationaug(K). 6. Simulation Results We simulate our proposed approach on a 5 5 Grid-World with two actions: and . In all simulations, we use K = 15, 000, ξ = K 1/4, and η = c K 1/4/Vg,max, where the coefficient c is linearly scaled down from 100 to 1 across episodes to improve the convergence rate (i.e., a larger η in earlier episodes). Further details of the simulation setup, including the reward, utility, and transition probabilities, are provided in Appendix I. Figure 1 illustrates the behavior of the algorithm for α = 0.0001 as a function of k. Initially, the policy violates the risk constraint and also underperforms in terms of reward. As the number of iterations increases, the empirical risk associated with the cumulative utility rises and eventually meets the threshold B = 2.2. The dual variable λ increases rapidly in the early stages to penalize constraint violations, then grows more slowly once the policy becomes feasible. Additionally, we observe that the optimistic estimates V k r,1(s1, ˆτk) and V k g,1(s1, ˆτk) are initially inflated due to exploration bonuses. As more data is gathered, these estimates converge to their empirical counterparts, and the gap between the estimated and empirical values gradually narrows. Table 1 presents the empirical reward values V emp r,1 (s1) and empirical risk values V emp g,1 (s1) for the average policy induced by the final 20 episodes after K = 15,000 iterations, Online Learning in Risk Sensitive constrained MDP Table 1. Empirical reward and risk from the average policy over the last 20 episodes after K = 15,000 iterations for various (α, B) pairs. ( 10 2, 2.2) ( 10 4, 2.2) ( 10 2, 2.6) ( 10 4, 2.6) ( 10 2, 2.9) ( 10 4, 2.9) V emp r,1 (s0) 1.95 2.80 1.90 2.24 1.92 1.88 V emp g,1 (s0) 2.76 2.20 2.80 2.57 2.78 2.80 across various choices of α and B. Decreasing B enlarges the feasible set, enabling the algorithm to attain higher rewards while satisfying the constraint. Conversely, for fixed B, decreasing α shrinks the feasible set, resulting in lower achievable rewards. Figure 3 (Appendix I) shows that as α decreases how the policy selects less risky actions compromising the reward. When |α| and B are large, the algorithm may fail to find a feasible policy, leading to constraint violations. Moreover, as |α| increases, the value of Vg,max also grows, which may impact both regret and constraint satisfaction. Thus, larger values of |α| may require more episodes (K) for a reliable performance evaluation. 7. Extension As noted, this is the first work to establish theoretical bounds for online learning in a CMDP with a risk-sensitive constraint. Our analysis naturally extends to several related settings, some of which we outline below. Extension to other risk constraints: In this paper, we consider a constraint requiring the entropic risk measure of the utility value function to exceed a specified threshold. Our analysis can be extended to other risk-sensitive constraints, such as Conditional Value at Risk (CVa R) with a hard threshold, by adopting a similar augmented formulation as in the unconstrained setting (Wang et al., 2024; Liang & Luo, 2024). A complete characterization of regret and constraint violation in these cases is left for future work. Extension to risk-sensitive objective: In this paper, we focus on the objective of maximizing the expected cumulative reward. However, our analysis can be extended to settings where the goal is to maximize the entropic risk measure of the reward value function. This extension requires augmenting the state space with two budget variables: τr for the reward and τg for the utility. The optimization proceeds in two stages: first, for each τr, we optimize over τg to evaluate the entropic risk of the utility; then, we optimize over τr to bound the entropic risk of the reward. A full characterization of this framework is left for future work. Extension to Multiple constraints: Our framework can be extended to accommodate multiple risk-sensitive constraints. Specifically, we augment the state space with multiple budget variables (τ1, . . . , τI), where I denotes the number of constraints. For the resulting augmented state space, we first compute the value functions associated with each budget, and then jointly optimize over the initial budgets (τ1, . . . , τI). However, this extension influences the regret and violation bounds, which will scale with the number of constraints. A complete characterization of this extended framework is left for future work. 8. Conclusion and Future Work In this work, we study the problem of maximizing cumulative reward subject to the constraint that the entropic risk measure of the utility function remains above a specified threshold. This formulation is particularly relevant in safetycritical applications. We consider an online learning setting in which an agent interacts with the environment over K episodes, aiming to minimize both regret and constraint violation. The problem presents significant challenges due to the non-linearity of the entropic risk value, which renders standard primal-dual methods difficult to analyze. To overcome this, we reformulate the problem by augmenting the state space with a budget variable. By optimizing over the initial budget, we show that the value function in the augmented formulation exactly recovers the entropic risk measure, allowing us to recast the problem into a structure amenable to value-based methods. Since strong duality may not hold in this setting, we introduce a truncated dual update. We prove that the resulting algorithm achieves a regret bound of O Vg,max K3/4 + p H4S2A log(1/δ)K3/4 and a constraint violation bound of O Vg,max p H3S2A log(1/δ)K3/4 , with probability at least 1 δ. To the best of our knowledge, this is the first result establishing theoretical bounds for the entropic riskconstrained MDP problem. An important direction for future work is determining whether tighter violation bounds, particularly with improved dependence on K, can be achieved. Recent approaches (Jiang & Ye, 2024; Dalal et al., 2018) to improve the sample complexity bound for risk-neutral CMDP setting can be useful in reducing the dependency. Extending the analysis to settings with infinite state spaces remains an open challenge. Another promising avenue is to consider environments where utilities and rewards are drawn from unknown distributions, introducing additional complexity in both learning and generalization. Another important future direction is to consider a stricter violation bound, such as no cancellation violation bound as considered in risk-neutral CMDP setting (Ghosh et al., 2024). Empirical evaluations for larger state-space, and for a more practical setup have been left for the future. Online Learning in Risk Sensitive constrained MDP Impact Statement This paper is theoretical in nature. The goal of this work is to advance the safe decision making using RL. We do not foresee any immediate or unique ethical concerns that require specific attention in this context. Achiam, J., Held, D., Tamar, A., and Abbeel, P. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 17, pp. 22 31. JMLR, 2017. Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. Reinforcement learning: Theory and algorithms. https://rltheorybook.github.io/ rltheorybook_AJKS.pdf, 2022. Altman, E. Constrained markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical methods of operations research, 48:387 417, 1998. Altman, E. Constrained Markov Decision Processes. Chapman & Hall/CRC, 1999a. Altman, E. Constrained Markov Decision Processes. CRC Press, 1999b. Arrow, K. J. Aspects of the theory of risk bearing. Yrjo Hahnsson Foundation, 1965. Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. Coherent measures of risk. Mathematical Finance, 9(3):203 228, 1999. Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International conference on machine learning, pp. 263 272. PMLR, 2017. Ben-Tal, A. and Teboulle, M. An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17(3):449 476, 2007. Bertsekas, D. P. Nonlinear Programming. Athena Scientific, 3rd edition, 2016. Borkar, V. A sensitivity formula for risk-sensitive cost and the actor critic algorithm. Systems & Control Letters, 44 (5):339 346, 2001. ISSN 0167-6911. Borkar, V. and Jain, R. Risk-constrained markov decision processes. IEEE Transactions on Automatic Control, 59 (9):2574 2579, 2014. Borkar, V. S. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27(2):294 311, 2002. Borkar, V. S. An actor-critic algorithm for constrained markov decision processes. Systems & Control Letters, 54(3):207 213, 2005. ISSN 0167-6911. Borkar, V. S. and Meyn, S. P. Risk-sensitive optimal control for markov decision processes with monotone cost. Mathematics of Operations Research, 27(1):192 209, 2002. B auerle, N. and Ott, J. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361 379, 2011. Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18 (167):1 51, 2018. Dai Pra, P., Meneghini, L., and Runggaldier, W. J. Connections between stochastic control and dynamic games. Mathematics of Control, Signals and Systems, 9:303 326, 1996. Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., and Tassa, Y. Safe exploration in continuous action spaces. ar Xiv preprint ar Xiv:1801.08757, 2018. Ding, D., Zhang, K., Basar, T., and Jovanovic, M. Natural policy gradient primal-dual method for constrained markov decision processes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 8378 8390. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 5f7695debd8cde8db5abcb9f161b49ea-Paper. pdf. Ding, D., Wei, X., Yang, Z., Wang, Z., and Jovanovic, M. Provably efficient safe exploration via primal-dual policy optimization. In International conference on artificial intelligence and statistics, pp. 3304 3312. PMLR, 2021. Efroni, Y., Mannor, S., and Pirotta, M. Explorationexploitation in constrained mdps. ar Xiv preprint ar Xiv:2003.02189, 2020. Everett, H. Generalized lagrange multiplier method for solving problems of optimum allocation of resources. Oper. Res., 11(3):399 417, June 1963. ISSN 0030-364X. Fei, Y., Yang, Z., Chen, Y., Wang, Z., and Xie, Q. Risk-sensitive reinforcement learning: near-optimal risk-sample tradeoff in regret. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546. Online Learning in Risk Sensitive constrained MDP Fei, Y., Yang, Z., Chen, Y., and Wang, Z. Exponential bellman equation and improved regret bounds for risksensitive reinforcement learning. Advances in neural information processing systems, 34:20436 20446, 2021. Fei, Y., Yang, Z., Chen, Y., and Wang, Z. Exponential bellman equation and improved regret bounds for risk-sensitive reinforcement learning. In Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS 21, Red Hook, NY, USA, 2024. Curran Associates Inc. Garcıa, J. and Fern andez, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437 1480, 2015. Geibel, P. and Wysotzki, F. Risk-sensitive reinforcement learning applied to control under constraints. J. Artif. Int. Res., 24(1):81 108, July 2005. ISSN 1076-9757. Ghosh, A., Zhou, X., and Shroff, N. Provably efficient model-free constrained rl with linear function approximation. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS 22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088. Ghosh, A., Zhou, X., and Shroff, N. Towards achieving sublinear regret and hard constraint violation in model-free rl. In International Conference on Artificial Intelligence and Statistics, pp. 1054 1062. PMLR, 2024. Gottesman, O., Johansson, F., Komorowski, M., Faisal, A., Sontag, D., Doshi-Velez, F., and Celi, L. A. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16 18, 2019. Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. ar Xiv preprint ar Xiv:2501.12948, 2025. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML), pp. 1861 1870, 2018. Hardy, C. O. Risk and Risk-bearing. University of Chicago Press, 1923. Hernandez-Hern andez, D. and Marcus, S. I. Risk sensitive control of markov processes in countable state space. Systems & Control Letters, 29(3):147 155, 1996. ISSN 0167-6911. Howard, R. A. and Matheson, J. E. Risk-sensitive Markov decision processes. Management Science, 18(7):356 369, 1972. Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1617 1626. PMLR, August 2017. Jiang, J. and Ye, Y. Achieving o(1/ϵ) sample complexity for constrained markov decision process. ar Xiv preprint ar Xiv:2402.16324, 2024. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Abernethy, J. and Agarwal, S. (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 2137 2143. PMLR, 09 12 Jul 2020. Jung, W., Cho, M., Park, J., and Sung, Y. Quantile constrained reinforcement learning: a reinforcement learning framework constraining outage probability. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS 22, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9781713871088. Kolesar, P. A markovian model for hospital admission scheduling. Management Science, 16(6):B384 B396, 1970. Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1):1334 1373, 2016. Liang, H. and Luo, Z.-Q. Bridging distributional and risksensitive reinforcement learning with provable regret bounds. Journal of Machine Learning Research, 25(221): 1 56, 2024. Luce, R. D. and Raiffa, H. Games and decisions: Introduction and critical surveys. Wiley New York, 1957. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540): 529 533, 2015. Moharrami, M., Murthy, Y., Roy, A., and Srikant, R. A policy gradient algorithm for the risk-sensitive exponential cost mdp. Mathematics of Operations Research, 2024. Murthy, Y., Moharrami, M., and Srikant, R. Modified policy iteration for exponential cost risk sensitive mdps. In Matni, N., Morari, M., and Pappas, G. J. (eds.), Online Learning in Risk Sensitive constrained MDP Proceedings of The 5th Annual Learning for Dynamics and Control Conference, volume 211 of Proceedings of Machine Learning Research, pp. 395 406. PMLR, 15 16 Jun 2023. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (Neur IPS), 2022. Paternain, S., Chamon, L., Calvo-Fullana, M., and Ribeiro, A. Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019. Popova, M., Isayev, O., and Tropsha, A. Deep reinforcement learning for de novo drug design. Science Advances, 4 (7):eaap7885, 2018. Rockafellar, R. T. and Uryasev, S. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000. Ross, K. W. Randomized and past-dependent policies for markov decision processes with multiple constraints. Operations Research, 37(3):474 477, 1989. Sallab, A. E., Abdou, M., Perot, E., and Yogamani, S. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70 76, 2017. Shapiro, J. F. Mathematical programming: structures and algorithms. Wiley, 1979. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354 359, 2017. Tamar, A., Glassner, Y., and Mannor, S. Optimizing the cvar via sampling. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI 15, pp. 2993 2999. AAAI Press, 2015. ISBN 0262511290. Tessler, C., Mankowitz, D. J., and Mannor, S. Reward constrained policy optimization. In International Conference on Learning Representations, 2019. Uchibe, E. and Doya, K. Constrained reinforcement learning from intrinsic and extrinsic rewards. In 2007 IEEE 6th International Conference on Development and Learning, pp. 163 168, 2007. von Neumann, J. and Morgenstern, O. Theory of Games and Economic Behavior (60th Anniversary Commemorative Edition). Princeton University Press, 1944. Wang, K., Kallus, N., and Sun, W. Near-minimaxoptimal risk-sensitive reinforcement learning with cvar. In International Conference on Machine Learning, pp. 35864 35907. PMLR, 2023. Wang, K., Liang, D., Kallus, N., and Sun, W. Risk-sensitive rl with optimized certainty equivalents via reduction to standard rl. ar Xiv preprint ar Xiv:2403.06323, 2024. Whittle, P. Risk sensitivity, a strangely pervasive concept. Macroeconomic Dynamics, 6(1):5 18, 2002. doi: 10. 1017/S1365100502027025. Ying, D., Ding, Y., and Lavaei, J. A dual approach to constrained markov decision processes with entropy regularization. In International Conference on Artificial Intelligence and Statistics, pp. 1887 1909. PMLR, 2022. Zhang, Q., Leng, S., Ma, X., Liu, Q., Wang, X., Liang, B., Liu, Y., and Yang, J. Cvar-constrained policy optimization for safe reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 12, 2024. Zhao, X., Song, W., Li, Q., Shi, H., Kang, Z., and Zhang, C. A deep reinforcement learning approach for resource-constrained project scheduling. In 2022 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1226 1234, 2022. Zhao, Y., Zhan, W., Hu, X., fung Leung, H., Farnia, F., Sun, W., and Lee, J. D. Provably efficient CVar RL in lowrank MDPs. In The Twelfth International Conference on Learning Representations, 2024. Zheng, L. and Ratliff, L. Constrained upper confidence reinforcement learning. In Bayen, A. M., Jadbabaie, A., Pappas, G., Parrilo, P. A., Recht, B., Tomlin, C., and Zeilinger, M. (eds.), Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pp. 620 629. PMLR, 10 11 Jun 2020. Online Learning in Risk Sensitive constrained MDP A. Limitations of Markov Policies under Entropic Risk Constraint Here, we present a simple example that illustrates the limitations of Markov policies in a CMDP with an entropic risk measure. In particular, the optimal policy in this example does not lie within the class of Markov policies. This demonstrates that Markov policies can be suboptimal, underscoring the need to extend the policy search to history-dependent policies in our framework. Consider a CMDP (S, A, P, H, r, g), where the state space is S = {s1, s2, s 2, s3}, the action space is A = {a, b}, and the horizon is H = 3. The transition probabilities are depicted in Figure 2 and are independent of the actions taken at the states. The reward and utility functions are zero everywhere except at the following state-action pairs: r(s2, ) = 1, g(s 2, ) = 1, r(s3, a) = 1, g(s3, b) = 1. Notice that action a maximizes the expected reward, while action b maximizes the entropic risk measure. Let the lower bound on acceptable utility be B = 1 in (1). Consider a history-dependent policy that selects action b if and only if the state at h = 1 is s1. It is easy to verify that this policy is feasible, and the resulting expected reward is 1. Figure 2. Transition diagram for the CMDP. Next, we show that any Markov policy is suboptimal. Let π = (π0, π1, π2) denote a Markov policy. Observe that the only relevant components are π2(a | s2) and π2(b | s2). Define p = π2(a | s2). Then we have: V π r,1(s1) = 1 V π g,1(s1) = 1 where α < 0 is the risk factor, and the last inequality follows by Jensen s inequality. Observe that for any p < 1 2, the resulting policy is suboptimal since V π r,1(s1) < 1. On the other hand, for any p 1 2, we have V π g,1(s1) < 1, and thus the resulting policy is infeasible. B. Proof of Lemma 3.1 From the definition, OCE(π) = maxτ τ + E h u PH h=1 gh(sh, ah) τ i , where u(t) = 1 α(eαt 1). Note that for α < 0, the function u(t) is strictly concave, hence, the unique maximizer can be found using the first order condition: α log E exp α h=1 gh(sh, ah) . The result follows by substituting τ in the definition of OCE(π). Online Learning in Risk Sensitive constrained MDP C. Proof of Lemma 5.2 Note that for λ [0, ξ], we have |λk+1 λ|2 = |Proj[0,ξ](λk + η(B (ˆτk + Vg,1(s1, ˆτk)))) Proj(λ)|2 |λk + η(B (ˆτk + Vg,1(s1, ˆτk))) λ|2 |λ λk|2 2η(B (ˆτk + Vg,1(s1, ˆτk)))(λ λk) + η2V 2 g,max where the first inequality follows from the non-expansiveness of the projection operator, and the second from the fact that |B (ˆτk + Vg,1(s1, ˆτk))| Vg,max. Summing over k then yields 0 |λK+1 λ|2 |λ1 λ|2 X k 2η(B (ˆτk + Vg,1(s1, ˆτk)))(λ λk) + Kη2V 2 g,max Since λ1 = 0, it follows that k=1 (λ λk)(B (ˆτk + V k g,1(x, ˆτk))) λ2 2η + ηV 2 g,max K D. Proof of Lemma 5.3 Recall that π = {π h}H h=1 denotes the optimal feasible policy that solves the optimization problem (1). By assumption, there exists a corresponding Markovian policy in the augmented state space that solves the optimization problem (3). With slight abuse of notation, we also denote this Markovian policy by π = {π h}H h=1, where each π h : S [ H, H] A. As noted earlier, the budget variable ch in the augmented MDP at time h 1 lies in the interval [ H, H]. To improve computational efficiency and address the challenge of optimizing the initial budget ˆτk, we discretize both the initial budget and the utility values received at each state. Consequently, the learned policy πk = {πk h}H h=1 depends on the accumulated discretized utility rather than the exact cumulative utility. As a result, πk is not Markovian with respect to the augmented state space, but instead is Markovian with respect to the discretized augmented state space, where all utility values are replaced by their discretized counterparts. Specifically, the budget variable in the discretized augmented state space is restricted to a finite set C, which is an ϵ0-resolution discretization of the interval [ H, H]. Thus, the cardinality of C is given by Finer discretization improves the approximation accuracy at the cost of increased computational complexity. The discretization operator ϕ : [ H, H] C projects a real-valued budget onto the nearest larger discretized value: ϕ(c) = arg min ˆc C, ˆc c |ˆc c|. For notational consistency, all hatted symbols denote discretized values. In particular, we use ˆgh(s, a) as a shorthand for ϕ(gh(s, a)). Lemma D.1. For all (s, a) S A, ˆc C, h H, k K, and j {r, g} P k h ( |s, a) Ph( |s, a) T V k j,h( , ˆc) Bonk j,h(s, a) with probability at least 1 δ, where Bonk j,h(s, a) = 6Vj,max q S|C| log (HSAK|C|Vj,max/δ)/N k h(s, a), Vr,max = H, and Vg,max = exp(|α|H) 1 Online Learning in Risk Sensitive constrained MDP Lemma D.2. For all (s, a) S A, c [ H, H], ˆc C, h H, and k K, and j {r, g}, the following holds with probability at least 1 δ: Qπ j,h(s, c, a) Qk j,h(s, ˆc, a) (Ph( | s, a))T V π j,h+1 ( , c gh(s, a)) V k j,h+1 ( , ˆc ˆgh(s, a)) . Qπ j,h(s, c, a) Qk j,h(s, ˆc, a) = (Ph( | s, a))T V π j,h+1 ( , c gh(s, a)) Bonk j,h(s, a) P k h ( | s, a) T V k j,h+1 ( , ˆc ˆgh(s, a)) = (Ph( | s, a))T V π j,h+1 ( , c gh(s, a)) V k j,h+1 ( , ˆc ˆgh(s, a)) Bonk j,h(s, a) P k h ( | s, a) Ph( | s, a) T V k j,h+1 ( , ˆc ˆgh(s, a)) (Ph( | s, a))T V π j,h+1 ( , c gh(s, a)) V k j,h+1 ( , ˆc ˆgh(s, a)) Bonk j,h(s, a) + Bonk j,h(s, a), where the last inequality follows by Lemma D.1. Lemma D.3. For all (s, a) S A, c [ H, H], ˆc C with ˆc c, h H, and k K, the following holds with probability at least 1 δ: Qπ r,h(s, c, a) + λk Qπ g,h(s, c, a) Qk r,h(s, ˆc, a) λk Qk g,h(s, ˆc, a) 0 Proof. We prove the claim by induction. The base case holds trivially for h = H + 1, since Qπ r,H+1(s, c, a) = Qk r,H+1(s, ˆc, a) = 0 and Qπ g,H+1(s, c, a) = u( c) u( ˆc) = Qk g,H+1(s, ˆc, a), where the inequality follows from the monotonicity of u( ). For the induction step, assuming the statement holds for horizon h + 1, we will verify it for horizon h: Qπ r,h(s, c, a) + λk Qπ g,h(s, c, a) Qk r,h(s, ˆc, a) λk Qk g,h(s, ˆc, a) (Ph( | s, a))T V π r,h+1( , c gh(s, a)) V k r,h+1( , ˆc ˆgh(s, a)) + λk (Ph( | s, a))T V π g,h+1 ( , c gh(s, a)) V k g,h+1 ( , ˆc ˆgh(s, a)) where the first inequality follows from Lemma D.2, and the last inequality follows from the induction hypothesis, the assumption that ˆc c, and the inequality ˆgh(s, a) gh(s, a). Notice that by the definition of πk, V π r,h+1 (s, c) + λk V π g,h+1 (s, c) (V k r,h+1 (s, ˆc) + λk V k g,h+1 (s, ˆc)) a A π (a|s, c) Qπ r,h+1(s, c, a) + λk Qπ g,h+1(s, c, a) Qk r,h+1(s, ˆc, a) λk Qk g,h+1(s, ˆc, a) . As an immediate corollary of Lemma D.3, the following inequality holds with probability at least 1 δ: V π r,1 (s1, τ ) + λk τ + V π g,1(s1, τ ) V k r,1(s1, ˆτk) λk ˆτk + V k g,1(s1, ˆτk) V π r,1 (s1, τ ) + λk τ + V π g,1(s1, τ ) V k r,1(s1, ϕ(τ ) ϵ0) λk ϕ(τ ) ϵ0 + V k g,1(s1, ϕ(τ ) ϵ0) where we use the fact that the initial budget at episode k is given by ˆτk = arg maxˆc C V k r,1(s1, ˆc) + λk ˆc + V k g,1(s1, ˆc) , and the inequality ϕ(τ ) ϵ0 < τ ϕ(τ ). Finally, observe that λk ξ. Online Learning in Risk Sensitive constrained MDP E. Proof of Lemma 5.4 and Lemma 5.5 As noted earlier, the policy πk = {πk h} is Markovian with respect to the discretized augmented state space, but historydependent with respect to the augmented state space. However, note that V πk g,H+1 depends on the cumulative utility, while V k g,H+1 depends on the cumulative discretized utility. More specifically, πk h( | sh, ch, {sh , ah , ch }h 1 h =1) = πk h h =1 ϕ(gh (sh , ah )) V k g,H+1({sh , ah , ch }H h =1) = u h =1 ϕ(gh (sh , ah )) V πk g,H+1({sh , ah , ch }H h =1) = u h =1 gh (sh , ah ) In particular, the value of V πk g,h depends on both the cumulative discretized utility (since the policy πk depends on it) and the cumulative utility (since V πk g,H+1 depends on it). For notational simplicity, we retain only the cumulative discretized utility in the notation and explicitly acknowledge the dependence on the cumulative utility when referring to V πk g,H+1. In what follows, we adopt the convention of reserving the ˆ symbol exclusively for discretized values. Lemma E.1. The following inequality hold with probability at least 1 δ: V k r,1(s, ˆτk) V πk r,1 (s, ˆτk) 2 h=1 Eπk[Bonk r,h(sh, ah) | s1 = s, ˆc1 = ˆτk, Hk] V k g,1(s, ˆτk) V πk g,1 (s, ˆτk) 2 h=1 Eπk[Bonk g,h(sh, ah) | s1 = s, ˆc1 = ˆτk, Hk] + ϵ0He|α|H Proof. For all s S, ˆc C, h H, k K, and j {r, g} the following holds with probability at least 1 δ: V k j,h(s, ˆc) V πk j,h (s, ˆc) = X a A πk(a|s, ˆc) Qk j,h(s, ˆc, a) Qπk j,h(s, ˆc, a) a A πk(a|s, ˆc) Bonk j,h(s, a) + P k h ( |s, a) T V k j,h+1( , ˆc ˆgh(s, a)) Qπk j,h(s, ˆc, a) = Ea πk( |s,ˆc) h Bonk j,h(s, a) + P k h ( |s, a) Ph( |s, a) T V k j,h+1( , ˆc ˆgh(s, a)) i a A πk(a|s, ˆc) (Ph( |s, a))T V k j,h+1( , ˆc ˆgh(s, a)) V πk j,h+1( , ˆc ˆgh(s, a)) 2Ea πk( |s,ˆc) Bonk j,h(s, a) +Eπk h V k j,h+1(sh+1, ˆch+1) V πk j,h+1(sh+1, ˆch+1) sh = s, ˆch = ˆc i , where we used the fact that πk is Markovian with respect to the discretized augmented state space. Hence, we have V k r,1(s, ˆτk) V πk r,1 (s, ˆτk) 2 h=1 Eπk[Bonk r,h(sh, ah) | s1 = s, c1 = ˆτk, Hk], V k g,1(s, ˆτk) V πk g,1 (s, ˆτk) 2 h=1 Eπk[Bonk g,h(sh, ah) | s1 = s, c1 = ˆτk, Hk] h =1 ˆgh (sh , ah ) h =1 gh (sh , ah ) ! s1 = s, c1 = ˆτk, Hk where Hk is the trajectory from episodes 1, 2, , k 1. The result follows from the fact that u(t), for t [ H, H], is a Lipschitz function with Lipschitz constant e|α|H. Online Learning in Risk Sensitive constrained MDP Combining Lemma E.1 with Elliptical Potential Lemma, we have k=1 V k r,1(s, ˆτk) V πk r,1 (s, ˆτk) 2 h=1 Eπk[Bonk r,h(sh, ah) | s1 = s, ˆc1 = ˆτk] + 2H p HK log(1/δ) v u u u u u t S|C| log (HSAK|C|/δ) Nh(sk h, ak h) s1 = s, ˆc1 = ˆτk HK log(1/δ) HAK|C| log(K) log (HSAK|C|/δ) Using a similar argument, k=1 V k g,1(s, ˆτk) V πk g,1 (s, ˆτk) 2 h=1 Eπk[Bonk g,h(sh, ah) | s1 = s, ˆc1 = ˆτk] + Kϵ0He|α|H + 2Vg,max p HK log(1/δ) 14Vg,max S q HAK|C| log(K) log (HSAK|C|Vg,max/δ) + Kϵ0He|α|H F. Bounding the Regret and Violation Bounding the Regret: We bound the regret by combining Lemmas 5.2, 5.3, and 5.4, and by setting λ = 0, η = K 1/4/Vg,max , ϵ0 = K 1/2, and ξ = K1/4: Regretaug(K) ηV 2 g,max K + ϵ0K5/4 + 20HS p HAK|C| log(K) log (HSAK|C|/δ) = O e|α|H 1 H4S2A log(1/δ)K3/4 . Bounding the Violation: We bound the violation by combining Lemmas 5.2, 5.3, 5.4, and 5.5 and by setting λ = 0, η = K 1/4/Vg,max , ϵ0 = K 1/2, and ξ = K1/4. Note that for a fixed λ [0, ξ], we have V π r,1 (x, τ ) V πk r,1 (x, ˆτk))+λ(B (ˆτk+V πk g,1 (x, ˆτk)) ηV 2 g,max K + ξ2 2η + ϵ0Kξ + 20HS p HAK|C| log(K) log (HSAK|C|/δ) + 14ξVg,max S q HAK|C| log(K) log (HSAK|C|Vg,max/δ) + Kϵ0He|α|H Vg,max K3/4 + Vg,max 2 K3/4 + K3/4 + 60K3/4p S2H4A log(K) log (HSAK/δ) + 42Vg,max K p S2H3A log(K) log (HSAK/δ) + K1/2He|α|H O Vg,max K p S2H3A log (1/δ) . Trivially bounding P k(V πk r,1 (x, ˆτk) V π r,1 (x, τ )) HK, we have B (ˆτk+V πk g,1 (x, ˆτk)) HK + O Vg,max K p S2H3A log (1/δ) Setting λ = ξ = K1/4 and dividing both sides by K1/4, we have B (ˆτk+V πk g,1 (x, ˆτk)) O e|α|H 1 S2H5A log (1/δ) Online Learning in Risk Sensitive constrained MDP G. Proof of Lemma D.1 Consider the value function classes Vj = {V ( , ) | V : S C [0, Vj,max]} for j {r, g}, where Vr,max = H and Vg,max = exp(|α|H) 1 |α| . The ε-covering numbers of Vj under the ℓ norm is bounded by N(Vj, ε, ) 1 + Vj,max where S is the state space, S = |S|, and C is the discretized budget space. We provide a uniform concentration bound. Fix j {r, g}, (s, a) S A, ˆc C, h H, and k K. Pick a function V Vj. Applying Azuma-Hoeffding s inequality, we obtain P P k h ( |s, a) Ph( |s, a) T V ( , ˆc) > ϵ 2 exp N k h(s, a)ϵ2 where N k h(s, a) denotes the number of visits to (s, a) at time h up to episode k. Hence, with probability at least 1 δ, P k h ( |s, a) Ph( |s, a) T V ( , ˆc) < 2Vj,max log(1/δ) N k h(s, a). Let Vε Vj denote an ε-covering set. Applying a union bound together with a standard covering argument, with probability at least 1 δ/2, for all (s, a) S A, ˆc C, h H, k K, and V Vj, we have P k h ( |s, a) Ph( |s, a) T V ( , ˆc) 2ε + 2Vj,max log (2HSAK|C|N(Vj, ε, )/δ) N k h(s, a) 2ε + 2Vj,max S|C| log (2HSAK|C|(1 + Vj,max/ε)/δ) N k h(s, a) S|C| log (HSAK|C|Vj,max/δ) N k h(s, a) , where the last inequality uses ε = 1/K and the fact that N k h(s, a) K. The final result follows by applying a union bound over Vg and Vr. See (Agarwal et al., 2022)[Lemma 7.2] for a similar argument. H. Auxiliary Lemmas We adopt the following results from (Wang et al., 2023). Lemma H.1 (Azuma). Let {Xi}i [N] be a sequence of random variables supported on [0, 1], adapted to the filtration {Fi}i [N]. For any δ (0, 1), we have with probability at least 1 δ, t=1 E[Xt|Ft 1] = N log(2/δ) (6) Lemma H.2 (Elliptical Potential). For any sequence of states and actions {sh,k, ah,k}h [H],k [K], we have 1 Nk(sh,k, ah,k) SA log(K), Nk(sh,k, ah,k) p HSAK log(K). Online Learning in Risk Sensitive constrained MDP Figure 3. Frequency of actions for the average policy over the last 20 episodes after 15,000 iterations. Left: α = 0.01, Right: α = 0.0001, with B = 2.2. I. Simulation Environment The reward function is given in Table 2, utility values in Table 3, and transition probabilities in Table 4. Both rewards and utilities depend on the state. The transition probability p specifies the chance of successfully moving in the intended direction (e.g., choosing the action attempts to move the agent to the right neighboring cell with probability p). If the intended move is invalid (e.g., outside the grid), an alternative action is executed. Consequently, in the fourth row and column, the choice of action becomes inconsequential. The total horizon is H = 9, and a state at position (i, j) is only reachable at time step h = i + j + 1. The environment presents several interesting features. For instance, the state at position (1, 2) offers a high reward. However, it is surrounded by states with transition probabilities close to 0.5 and low utility, making it risky. While it may appear attractive to visit (1, 2), the agent might instead end up in one of the neighboring low-utility states. Consequently, as α becomes more negative, the agent may learn to avoid this region in order to mitigate risk which may lead to a smaller cumulative reward. The environment is deterministic across the terminal rows and columns. The code is available at: https://github.com/mmoharami/Risk-Sensitive-CMDP. Table 2. Reward matrix r(i, j) for state (i, j) Row \ Col 0 1 2 3 4 0 0.0 0.1 0.2 0.2 0.1 1 0.5 0.1 1.5 0.5 0.3 2 0.1 0.1 0.4 0.3 0.2 3 0.1 0.1 0.3 0.1 0.6 4 0.1 0.2 0.3 0.1 0.0 For a faster convergence, we use the bonus terms Bonk r,h(s, a) = 0.5H log(K)/N k h(s, a), and Bonk g,h(s, a) = 0.005Vg,max log(K)/N k h(s, a) instead of the values for which we obtain the regret and the violation bound across all the values of α. We use the discretized budget space with precision K 1/2, and δ = 0.05. The initial policy π0 is uniform across the two actions for every augmented state. Figure 3 illustrates the frequency of selecting each action ( or ) at every state for α = 0.01 and α = 0.0001, respectively, with B = 2.2. The displayed policies are the average policies over the last 20 episodes after running the algorithm for 15,000 iterations. Online Learning in Risk Sensitive constrained MDP Table 3. Utility matrix u(i, j) for state (i, j). Row \ Col 0 1 2 3 4 0 0.1 0.1 0.2 0.1 0.1 1 0.4 0.2 0.1 0.0 0.0 2 0.3 0.4 1.0 0.0 0.1 3 0.2 0.5 0.4 0.2 0.1 4 0.1 0.1 0.4 0.2 0.0 Table 4. Probability matrix p(i, j) representing the likelihood that the action taken in state (i, j) will occur. Row \ Col 0 1 2 3 4 0 0.9 0.9 0.7 0.5 1.0 1 0.9 0.9 0.5 0.5 1.0 2 0.7 0.9 0.9 0.6 1.0 3 0.9 0.8 0.8 0.5 1.0 4 1.0 1.0 1.0 1.0 1.0 Observe that under the more risk-averse setting α = 0.01, the algorithm tends to avoid the risky state (1, 2) to satisfy the constraint. In contrast, in the more risk-neutral scenario α = 0.0001, the algorithm chooses to visit state (1, 2) more frequently by taking appropriate actions, thereby increasing the reward while still adhering to the constraint.