# riskaverse_totalreward_mdps_with_erm_and_evar__a82f6eea.pdf

Risk-averse Total-reward MDPs with ERM and EVa R

Xihong Su1, Marek Petrik1, Julien Grand-Cl ement2

1University of New Hampshire, 33 Academic Way, Durham, NH, 03824 USA 2 HEC Paris, 1 Rue de la Lib eration, Jouy-en-Josas, 78350 France xihong.su@unh.edu, mpetrik@cs.unh.edu, grand-clement@hec.fr

Optimizing risk-averse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse total reward criterion, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVa R) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. Compared with prior work, our results only require the relatively mild condition of transient MDPs and allow for both positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.

1 Introduction Risk-averse Markov decision processes (MDP) (Puterman 2005) that use monetary risk measures as their objective have been gaining in popularity in recent years (Kastner, Erdogdu, and Farahmand 2023; Marthe, Garivier, and Vernade 2023; Lam et al. 2022; Li, Zhong, and Brandeau 2022; B auerle and Glauner 2022; Hau, Petrik, and Ghavamzadeh 2023; Hau et al. 2023; Su, Petrik, and Grand-Cl ement 2024a,b). Riskaverse objectives, such as Value at Risk (Va R), Conditional Value at Risk (CVa R), Entropic Risk Measure (ERM), or Entropic Value at Risk (EVa R), penalize the variability of returns (Follmer and Schied 2016). As a result, these risk measures yield policies with stronger guarantees on the probability of catastrophic losses, which is important in domains like healthcare or finance. In this paper, we target the total reward criterion (TRC) (Kallenberg 2021; Puterman 2005) instead of the common discounted criterion. TRC also assumes an infinite horizon but does not discount future rewards. To control for infinite returns, we assume that the MDP is transient, i.e. that there is a positive probability that the process terminates after a finite number of steps, an assumption commonly used in the TRC literature (Filar and Vrieze 2012). We consider the TRC with both positive and negative rewards. When the rewards are non-positive, the TRC is equivalent to the

Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

stochastic shortest path problem, and when they are nonnegative, it is equivalent to the stochastic longest path (Dann, Wei, and Zimmert 2023). Two reasons motivate our departure from discounted objectives in risk-averse MDPs. First, considering risk affects discounted objectives significantly. It is common to use discounted objectives because they admit optimal stationary policies and value functions that can be computed using dynamic programs. However, most risk-averse discount objectives, such as Va R, CVa R, or EVa R, require that optimal policies are history-dependent (B auerle and Ott 2011; Hau et al. 2023; Hau, Petrik, and Ghavamzadeh 2023) and do not admit standard dynamic programming optimality equations. Second, TRC captures the concept of stochastic termination, which is common in reinforcement learning (Sutton and Barto 2018). In risk-neutral objectives, discounting can serve well to model the probability of termination because it guarantees the same optimal policies (Puterman 2005; Su and Petrik 2023). However, as we show in this work, no such correspondence exists with risk-averse objectives, and the difference between them may be arbitrarily significant. Modeling stochastic termination using a discount factor in risk-averse objectives is inappropriate and leads to dramatically different optimal policies. As our main contribution, we show that the risk-averse TRC with ERM and EVa R risk measures admit optimal stationary policies and optimal value functions in transient MDPs. We also show that the optimal value function satisfies dynamic programming equations and can be computed with exponential value iteration, policy iteration, or linear programming algorithms. These algorithms are simple and closely resemble the algorithms for solving MDPs. Our results indicate that EVa R is a particularly interesting risk measure in reinforcement learning. ERM and the closely related exponential utility functions have been popular in sequential decision-making problems because they admit dynamic programming decompositions (Patek and Bertsekas 1999; de Freitas, Freire, and Delgado 2020; Smith and Chapman 2023; Denardo and Rothblum 1979; Hau, Petrik, and Ghavamzadeh 2023; Hau et al. 2023). Unfortunately, ERM is difficult to interpret; it is scale-dependent; and it is incomparable with popular risk measures like Va R and CVa R. Because EVa R reduces to an optimization over ERM, it preserves most of the computational advantages of ERM, and since

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

Risk properties Optimal policy

Risk measure Coherent Law inv. Disc. TRC

E yes yes S S

EVa R yes yes M S ERM no yes M S NCVa R yes no S S Va R yes yes H H CVa R yes yes H H

Table 1: Structure of optimal policies in risk-averse MDPs: S , M and H refer to Stationary, Markov and Historydependent policies respectively.

EVa R closely approximates CVa R and Va R at the same risk level, its value is also much easier to interpret. Finally, EVa R is also a coherent risk measure, unlike ERM (Ahmadi-Javid 2012; Ahmadi-Javid and Pichler 2017). Table 1 puts our contribution in the context of other work on risk-averse MDP objectives. Optimal policies for Va R and CVa R are known to be history-dependent in the discounted objective (B auerle and Ott 2011; Hau et al. 2023) and must be history-dependent in TRC because TRC generalizes the finitehorizon objective. The TRC with Nested risk measures, such as Nested CVa R (NCVa R), applies the risk measure in each level of the dynamic program independently and preserves most of the favorable computational properties of risk-neutral MDPs (Ahmadi et al. 2021a). Unfortunately, nested risk measures are difficult to interpret; their value depends on the sequence in which the rewards are obtained in a complex and unpredictable way (Kupper and Schachermayer 2006) and may be unbounded even if MDPs are transient. While we are unaware of prior work on the TRC objective with ERM or EVa R risk-aversion allowing both positive and negative rewards, the ERM risk measure is closely related to exponential utility functions. Prior work on TRC with exponential utility functions also imposes constraints on the sign of the instantaneous rewards, such as all positive rewards (Blackwell 1967) or all negative rewards (Bertsekas and Tsitsiklis 1991; Freire and Delgado 2016; Carpin, Chow, and Pavone 2016; de Freitas, Freire, and Delgado 2020; Fei et al. 2021; Fei, Yang, and Wang 2021; Ahmadi et al. 2021a; Cohen et al. 2021; Meggendorfer 2022). Disallowing a mix of positive and negative rewards limits the modeling power of prior work because it requires that either all states are more desirable or all states are less desirable than the terminal state. Allowing rewards with mixed signs raises some technical challenges, which we address by employing a squeeze argument that takes advantage of MDP s transience. Notation. We use a tilde to mark random variables, e.g. x. Bold lower-case letters represent vectors, and upper-case bold letters represent matrices. Sets are either calligraphic or upper-case Greek letters. The symbol X represents the space of real-valued random variables. When a function is defined over an index set, such as z : {1, 2, . . . , N} R, we also treat it interchangeably as a vector z Rn such that zi = z(i), i = 1, . . . , n. Finally, R, R+, R++ denote real,

non-negative real, and positive real numbers, respectively. R = R { , }. Given a finite set Y, the probability simplex is Y := {x RY + | 1Tx = 1}.

2 Background on Risk-averse MDPs Markov Decision Processes We focus on solving Markov decision processes (MDPs) (Puterman 2005), modeled by a tuple ( S, A, p, r, µ), where S = {1, 2, . . . , S, S + 1} is the finite set of states and A = {1, 2, . . . , A} is the finite set of actions. The transition function p: S A S represents the probability p(s, a, s ) of transitioning to s S after taking a A in s S and psa S is such that ( psa)s = p(s, a, s ). The function r: S A S R represents the reward r(s, a, s ) R associated with transitioning from s S and a A to s S. The vector µ S is the initial state distribution. We designate the state e := S + 1 as a sink state and use S = {1, . . . , S} to denote the set of all non-sink states. The sink state e must satisfy that p(e, a, e) = 1 and r(e, a, e) = 0 for each a A, and µe = 0. Throughout the paper, we use a bar to indicate whether the quantity involves the sink state e. Note that the sink state can indicate a goal when all rewards are negative and an undesirable terminal state when all rewards are positive. The following technical assumption is needed to simplify the derivation. To lift the assumption, one needs to carefully account for infinite values, which adds complexity to the results and distracts from the main ideas. Assumption 2.1. The initial distribution µ satisfies that

The solution to an MDP is a policy. Given a horizon t N, a history-dependent policy in the set Πt HR maps the history of states and actions to a distribution over actions. A Markov policy π Πt MR is a sequence of decision rules π = (d0, d1, . . . , dt 1) with dk : S A the decision rule for taking actions at time k. The set of all randomized decision rules is D = ( A)S. Stationary policies ΠSR are Markov policies with π := (d) := (d, d, . . . ) with the identical decision rule in every timestep. We treat decision rules and stationary policies interchangeably. The sets of deterministic Markov and stationary policies are denoted by Πt MD and ΠSD. Finally, we omit the superscript t to indicate infinite horizon definitions of policies. The risk-neutral Total Reward Criterion (TRC) objective is:

sup π ΠHR lim inf t Eπ,µ "t 1 X

k=0 r( sk, ak, sk+1)

where the random variables are denoted by a tilde and sk and ak represent the state from S and action at time k. The superscript π denotes the policy that governs the actions ak when visiting sk and µ denotes the initial distribution. Finally, note that lim inf gives a conservative estimate of a policy s return since the limit does not necessarily exist for non-stationary policies. Unlike the discounted criterion, the risk-neutral TRC may be unbounded, optimal policies may not exist, or may be nonstationary (Bertsekas and Yu 2013; James and Collins 2006).

(a, 1 ϵ, r)

Figure 1: left: a discounted MDP, right: a transient MDP

To circumvent these issues, we assume that all policies have a positive probability of eventually transitioning to the sink state. Assumption 2.2. The MDP is transient for any π ΠSD:

t=0 Pπ,s [ st = s ] < , s, s S. (2)

Assumption 2.2 underlies most of our results. Transient MDPs are important because their optimal policies exist and can be chosen to be stationary deterministic (Kallenberg 2021, theorem 4.12). Transient MDPs are also common in stochastic games (Filar and Vrieze 2012) and generalize the stochastic shortest path problem (Bertsekas and Yu 2013). An important tool in their analysis is the spectral radius ρ: Rn n R which is defined for each A Rn n as the maximum absolute eigenvalue: ρ(A) := maxi=1,...,n |λi| where λi is the i-th eigenvalue (Horn and Johnson 2013). Lemma 2.3 (Theorem 4.8 in Kallenberg (2021)). An MDP is transient if and only if ρ(P π) < 1 for all π ΠSR. Now, let us understand the differences between a discounted MDP and a transient MDP, which are useful in demonstrating the behavior of risk-averse objectives. Consider the MDPs in Figure 1. There is one non-sink state s and one action a. A triple tuple represents an action, transition probability, and a reward separately. Note that every discounted MDP can be converted to a transient MDP as described in Su, Grand-Cl ement, and Petrik (2024, appendix B). For the discounted MDP, the discount factor is γ. For the transient MDP, e is the sink state, and there is a positive probability 1 ϵ of transiting from state s to state e. Once the agent reaches the state e, it stays in e. For the risk-neutral objective, if γ equals ϵ, their value functions have identical values. However, for risk-aversion objectives, such as ERM, we show that the value functions in a discounted MDP can diverge from those in a transient MDP in Section 5.

Monetary risk measures Monetary risk measures aim to generalize the expectation operator to account for the spread of the random variable. Entropic risk measure (ERM) is a popular risk measure, defined for any risk level β > 0 and x X as (Follmer and Schied 2016)

ERMβ [ x] = β 1 log E exp ( β x) . (3)

and extended to β [0, ] as ERM0[ x] = limβ 0+ ERMβ [ x] = E[ x] and ERM [ x] = limβ ERMβ [ x] = ess inf[ x]. ERM plays a unique role in sequential decision-making because it is the only law-invariant risk measure that satisfies the tower property

(e.g., Su, Grand-Cl ement, and Petrik (2024, proposition A.1)), which is essential in constructing dynamic programs (Hau, Petrik, and Ghavamzadeh 2023). Unfortunately, two significant limitations of ERM hinder its practical applications. First, ERM is not positively homogenous and, therefore, the risk value depends on the scale of the rewards, and ERM is not coherent (Follmer and Schied 2016; Hau, Petrik, and Ghavamzadeh 2023; Ahmadi-Javid 2012). Second, the risk parameter β is challenging to interpret and does not relate well to other standard risk measures, like Va R or CVa R. For these reasons, we focus on the Entropic Value at Risk (EVa R), defined as, for a given α (0, 1),

EVa Rα [ x] = sup β>0 β 1 log α 1E exp ( β x)

= sup β>0 ERMβ [ x] + β 1 log α, (4)

and extended to EVa R0 [ x] = ess inf[ x] and EVa R1 [ x] = E [ x] (Ahmadi-Javid 2012). It is important to note that the supremum in (4) may not be attained even when x is a finite discrete random variable (Ahmadi-Javid and Pichler 2017). EVa R addresses the limitations of ERM while preserving its benefits. EVa R is coherent and positively homogenous. EVa R is also a good approximation to interpretable quantilebased risk measures, like Va R and CVa R (Ahmadi-Javid 2012; Hau, Petrik, and Ghavamzadeh 2023).

Risk-averse MDPs. Risk-averse MDPs, using static Va R and CVa R risk measures, under the discounted criterion received abundant attention (Hau et al. 2023; B auerle and Ott 2011; B auerle and Glauner 2022; Pflug and Pichler 2016; Li, Zhong, and Brandeau 2022), showing that these objectives require history-dependent optimal policies. In contrast, nested risk measures under the TRC may admit stationary policies that can be computed using dynamic programming (Ahmadi et al. 2021a; Meggendorfer 2022; de Freitas, Freire, and Delgado 2020; Gavriel, Hanasusanto, and Kuhn 2012). However, the TRC with nested CVa R can be unbounded (Su, Grand Cl ement, and Petrik 2024, proposition C.1). Recent work has shown that optimal Markov policies exist for EVa R discounted objectives, and they can be computed via dynamic programming (Hau, Petrik, and Ghavamzadeh 2023), building upon similar results established for ERM (Chung and Sobel 1987). However, in TRC with ERM, the value functions may also be unbounded (Su, Grand-Cl ement, and Petrik 2024, proposition D.1).

3 Solving ERM Total Reward Criterion

This section shows that an optimal stationary policy exists for ERM-TRC and that the value function satisfies dynamic programming equations. We then outline algorithms for computing it. Our objective in this section is to maximize the ERM-TRC objective for some given β > 0 defined as

sup π ΠHR lim inf t ERMπ,µ β

k=0 r( sk, ak, sk+1)

The definition employs limit inferior because the limit may not exist for non-stationary policies. Return functions gt : ΠHR R++ R and g t : R++ R for a horizon t N and the infinite-horizon versions gt : ΠHR R++ R and g t : R++ R are defined

gt(π, β) := ERMπ,µ β

k=0 r( sk, ak, sk+1)

g t (β) := sup π ΠHR gt(π, β),

g (π, β) := lim inf t gt(π, β),

g (β) := lim inf t g t (β).

Note that the functions g and g can return infinite values and that (5) differs from g in the order of the limit and the supremum. Finally, when β = 0, we assume that all g functions are defined as the expectation. In the remainder of the section, we assume that the risk level β > 0 is fixed and omit it in notations when its value is unambiguous from the context.

3.1 Finite Horizon

We commence the analysis with definitions and basic properties for the finite horizon criterion. To the best of our knowledge, this analysis is original in the context of the ERM but builds on similar approaches employed in the study of exponential utility functions. Finite-horizon functions vt(π) RS and vt, RS are defined for each horizon t N and policy π ΠMD, s S as

vt s(π) := ERMπ,s β

k=0 r( sk, ak, sk+1)

vt, s := max π ΠMD vt s(π),

and vt e(π) := 0. Because the nonlinearity of ERM complicates the analysis, it will be convenient to instead rely on exponential value function wt(π) RS for π ΠMD, t N, and s S that satisfy

wt s(π) := exp β vt s(π) , (8)

vt s(π) = β 1 log( wt s(π)). (9)

The optimal wt, RS is defined analogously from vt, . Note that wt < 0 (componentwise) and w0(π) = w0, = 1 for any π ΠMD. Similar exponential value functions have been used previously in exponential utility function objectives (Denardo and Rothblum 1979; Patek 2001), in the analysis of robust MDPs, and even in regularized MDPs (see Grand-Cl ement and Petrik (2022) and references therein). One can define a corresponding exponential Bellman operator for any w RS as

Ldw := Bdw bd,

L w := max d D Ldw = max d ext D Ldw, (10)

where ext D is the set of extreme points of D corresponding to deterministic decision rules and Bd RS S + and bd RS + are defined for s, s S and d D as

Bd s,s := X

a A p(s, a, s ) da(s) e β r(s,a,s ), (11a)

a A p(s, a, e) da(s) e β r(s,a,e). (11b)

The following theorem shows that L can be used to compute w. We use the shorthand notation π1:t 1 = (d1, . . . , dt 1) Πt 1 MR to denote the tail of π that starts with d1 instead of d0. Theorem 3.1. For each t = 1, . . . , and π = (d0, . . . , dt 1) Πt MR, the exponential values satisfy that

wt(π) = Ldtwt 1(π1:t 1), w0(π) = 1,

wt, = L wt 1, = wt(π ) wt(π), w0, = 1,

for some π Πt MD. The proof of Theorem 3.1 is standard and has been established both in the context of ERMs (Hau, Petrik, and Ghavamzadeh 2023) and utility functions (Patek 1997). The following corollary follows directly from Theorem 3.1 by algebraic manipulation and by the monotonicity of exponential value function transformation and the ERM. Corollary 3.2. We have that

gt(π, β) = ERMµ β vt s0(π) ,

g t (β) = ERMµ β vt, s0 = max π ΠMD ERMµ β vt s0(π) .

3.2 Infinite Horizon We now turn to construct infinite-horizon optimal policies as a limiting case of the finite horizon. An important quantity is the infinite-horizon exponential value function defined for each π ΠHR as

w (π) := lim inf t wt(π), w , := lim inf t wt, .

Note again that we use the inferior limit because the limit may not be defined for non-stationary policies. The limiting infinite-horizon value functions w (π) and w , are defined analogously from vt(π) and vt, using the inferior limit. The following theorem is the main result of this section. It shows that for an infinite horizon, the optimal exponential value function is attained by a stationary deterministic policy and is a fixed point of the exponential Bellman operator. Theorem 3.3. Whenever w , > there exists π = (d ) ΠSD such that

w , = w (π ) = Ld w , ,

and w , is the unique value that satisfies this equation. Corollary 3.4. Asuming the hypothesis of Theorem 3.3, we have that v , = v (π ) and

g (β) = ERMµ β v , s0 = max π ΠSD ERMµ β v s0 (π) .

We now outline the proof of Theorem 3.3; see Su, Grand Cl ement, and Petrik (2024, appendix D) for details. To establish Theorem 3.3, we show that wt, converges to a fixed point as t . Standard arguments do not apply to our setting (Puterman 2005; Kallenberg 2021; Patek 2001) because the ERM-TRC Bellman operator is not an L -contraction, it is not linear, and the values in value iteration do not increase or decrease monotonically. Although the exponential Bellman operator Ld is linear, it may not be a contraction. The main idea of the proof is to show that whenever the exponential value functions are bounded, the exponential Bellman operator must be weighted-norm contraction with a unique fixed point. To facilitate the analysis, we define wt : Πt SR RS RS, t N for z RS, π Πt SR, as

wt(π, z) = Ldw(π1:t 1) = Ld Ld . . . Ld( z)

k=0 (Bd)kbd. (12)

The value z can be interpreted as the exponential value function at the termination of the process following π for t periods. Note that wt(π) = wt(π, 1), π ΠMR, t N. An important technical result we show is that the only way a stationary policy s return can be bounded is if the policy s matrix has a spectral radius strictly less than 1. Lemma 3.5. For each π = (d) ΠSR and z 0:

w (π, z) > ρ(Bd) < 1.

Lemma 3.5 uses the transience property to show that the Perron vector (with the maximum absolute eigenvalue) f of Bd satisfies that f bd > 0. Therefore, ρ(Bd) < 1 is necessary for the series in (12) to be bounded. The limitation of Lemma 3.5 is that it only applies to stationary policies. The lemma does not preclude the possibility that all stationary policies have unbounded returns, but a Markov policy with a bounded return exists. We construct an upper bound on wt, that decreases monotonically in t and converges to show this is impossible. The proof then concludes by squeezing wt, between a lower and the upper bound with the same limits. This technique allows us to relax the limiting assumptions from prior work (Patek 2001; de Freitas, Freire, and Delgado 2020). Finally, our results imply an optimal stationary policy exists whenever the planning horizon T is sufficiently large. Because the set ΠSD is finite, one policy must be optimal for a sufficiently large T. This property suggests behavior similar to turnpikes in discounted MDPs (Puterman 2005).

3.3 Algorithms We now briefly describe the algorithms we use to compute the optimal ERM-TRC policies. Surprisingly, the main algorithms for discounted MDPs, including value iteration, policy iteration, and linear programming, can be adapted to this risk-averse setting with only minor modifications. Value iteration is the most direct method for computing the optimal value function (Puterman 2005). The value iteration computes a sequence of wk, k = 0, . . . such that

wk+1 = L wk, w0 = 0.

The initialization of w0 = 0 is essential and guarantees convergence directly from the monotonicity argument used to prove Theorem 3.3. Policy iteration (PI) starts by initializing with a stationary policy π0 = (d0) ΠSD. Then, for each iteration k = 0, . . . , PI alternates between the policy evaluation step and the policy improvement step:

wk = (I Bdk) 1bdk, dk+1 argmax d D Bdwk bd.

PI converges because it monotonically improves the value functions when initialized with a policy d0 with bounded return (Patek 2001). However, we lack a practical approach to finding such an initial policy. Finally, linear programming is a fast and convenient method for computing optimal exponential value functions:

min 1 w | w RS, w ba + Baw, a A . (13)

Here, Ba s, = (Ba s,s1, , Ba s,s S), Ba s,s and ba s are constructed as in (11). We use the shorthand Ba = Bd and ba = bd where da (s) = 1 if a = a for each s S, a A. It is important to note that the value functions, as well as the coefficients of Bd may be irrational. It is, therefore, essential to study the sensitivity of the algorithms to errors in the input. However, this question is beyond the scope of the present paper, and we leave it for future work.

4 Solving EVa R Total Reward Criterion This section shows that the EVa R-TRC objective can be reduced to a sequence of ERM-TRC problems, similarly to the discounted case (Hau, Petrik, and Ghavamzadeh 2023). As a result, an optimal stationary EVa R-TRC policy exists and can be computed using the methods described in Section 3. Formally, we aim to compute a policy that maximizes the EVa R of the random return at some given fixed risk level α (0, 1) defined as

sup π ΠHR lim inf t EVa Rπ,µ α

k=0 r( sk, ak, sk+1)

In contrast with Ahmadi et al. (2021b), the objective in (14) optimizes EVa R rather than Nested EVa R.

4.1 Reduction to ERM-TRC To solve (14), we exploit that EVa R can be defined in terms of ERM as shown in (4). To that end, define a function ht : ΠHR R R for t N as

ht(π, β) := gt(π, β) + β 1 log(α), (15)

where gt is the ERM value of the policy defined in (6). Also, h t , h , h are defined analogously in terms of g t , g , and g respectively. The functions h are useful, because by (4):

k=0 r( sk, ak, sk+1)

= sup β>0 ht(π, β), (16)

for each π ΠHR and t N. However, note that the limit in the definition of supβ>0 h (β) is inside the supremum unlike in the objective in (14).

There are two challenges with solving (14) by reducing it to (16). First, the supremum in the definition of EVa R in (4) may not be attained, as mentioned previously. Second, the functions g t and h t may not converge uniformly to g and h . Note that Theorem 3.3 only shows pointwise convergence when the functions are bounded. To circumvent the challenges described above, we replace the supremum in (16) with a maximum over a finite set B(β0, δ) of discretized β values:

B(β0, δ) := {β0, β1, . . . , βK} , (17a)

where δ > 0, 0 < β0 < β1 < < βK, and

βk+1 := βk log 1

α βkδ , βK log 1

α δ , (17b)

for an appropriately chosen value K for each β0 and δ. We assume that the denominator in the expression for βk+1 in Equation (17b) is positive; otherwise βk+1 = and βk is sufficiently large. The construction in (17) resembles equations (19) and (20) in Hau, Petrik, and Ghavamzadeh (2023) but differs in the choice of β0 because Hoeffding s lemma does not readily bound the TRC criterion. The following proposition upper-bounds the value of K; see (Hau, Petrik, and Ghavamzadeh 2023, theorem 4.3) for a proof that K is polynomial in δ.

Proposition 4.1. Assume a given β0 > 0 and δ (0, 1) such that β0δ < log 1

α. Then, to satisfy the condition in (17b), it is sufficient to choose K as

K := log z log(1 z), where z := β0δ

The following theorem shows that one can obtain an optimal ERM policy for an appropriately chosen β that approximates an optimal EVa R policy arbitrarily closely.

Theorem 4.2. For any δ > 0, let

(π , β ) argmax (π,β) ΠSD B(β0,δ) h (π, β),

where β0 > 0 is chosen such that g (0) g (β0) δ. Then the limits below exist and satisfy:

lim t EVa Rπ ,µ α

k=0 r( sk, ak, sk+1)

sup π ΠHR lim t sup β>0 h(π, β) δ. (19)

Note that the right-hand side in (19) is the δ-optimal objective in (14). The first implication of Theorem 4.2 is that there exists an optimal stationary deterministic policy.

Corollary 4.3. There exists an optimal stationary deterministic policy π ΠSD that attains the supremum in (14).

The second implication of Theorem 4.2 is that it suggests an algorithm for computing the optimal, or near-optimal, stationary policy. We summarize it in Section 4.2.

Algorithm 1: Simple EVa R algorithm

Data: MDP and desired precision δ > 0 Result: δ-optimal policy π ΠSD while g (0) g (β0) > δ do β0 β0/2 ; Construct B(β0, δ) as described in (17a) ; Compute π argmaxπ ΠSD maxβ B(β0,δ) h (π, β) by solving a linear program in (13) ;

4.2 Algorithms We now propose a simple algorithm for computing a δoptimal EVa R policy described in Algorithm 1. The algorithm reduces finding optimal EVa R-TRC policies to solving a sequence of ERM-TRC problems in (5). As Theorem 4.2 shows, there exists a δ-optimal policy such that it is ERMTRC optimal for some β B(β0, δ). It is, therefore, sufficient to compute an ERM-TRC optimal policy for one of those β values. The analysis above shows that Algorithm 1 is correct.

Corollary 4.4. Algorithm 1 computes the δ-optimal policy π ΠSD that satifies the condition (19).

Corollary 4.4 follows directly from Theorem 4.2 and from the existence of a sufficiently small β0 from the continuity of g (β) for positive β around 0. Algorithm 1 prioritizes simplicity over computational complexity and could be accelerated significantly. Evaluating each h (β) requires computing an optimal ERM-TRC solution which involves solving a linear program. One could reduce the number of evaluations of h needed by employing a branch-and-bound strategy that takes advantage of the monotonicity of g . An additional advantage of Algorithm 1 is that the overhead of computing optimal solutions for multiple risk levels α can be small if one selects an appropriate set B.

5 Numerical Evaluation In this section, we illustrate our algorithms and formulations on tabular MDPs that include positive and negative rewards. The ERM returns for the discounted and transient MDPs in Figure 1 with parameters r = 0.2, γ = 0.9, ϵ = 0.9 are shown in Figure 2. The figure shows that, as expected, the returns are identical in the risk-neutral objective (when β = 0). However, for β > 0, the discounted and TRC returns differ significantly. The discounted return is unaffected by β while the ERM-TRC return decreases with an increasing β. Please see Su, Grand-Cl ement, and Petrik (2024, appendix B) for more details. To evaluate the effect of risk-aversion on the structure of the optimal policy, we use the gambler s ruin problem (Hau, Petrik, and Ghavamzadeh 2023; B auerle and Ott 2011). In this problem, a gambler starts with a given amount of capital and seeks to increase it up to a cap K. In each turn, the gambler decides how much capital to bet. The bet doubles or is lost with a probability q and 1 q, respectively. The

0.1 0.2 0.3 0.4 0.5 12.5

TRC Discounted

Figure 2: ERM values with TRC and discounted criteria.

0 1 2 3 4 5 6 7

Optimal action

α = 0.4 α = 0.9 α = 0.2

Figure 3: The optimal EVa R-TRC policies.

gambler can quit and keep the current wealth; the game also ends when the gambler goes broke or achieves the cap K. The reward equals the final capital, except it is -1 when the gambler is broke. The initial state is chosen uniformly. In the formulation, we use q = 0.68, and a cap is K = 7. The algorithm was implemented in Julia 1.10, and is available at https://github.com/suxh2019/ERMLP. Please see Su, Grand Cl ement, and Petrik (2024, appendix F) for more details. Figure 3 shows optimal policies for four different EVa R risk levels α computed by Algorithm 1. The state represents how much capital the gambler holds. The optimal action indicates the amount of capital invested. The action 0 means quitting the game. Note that there is only one action when the

-1 0 1 2 3 4 5 6 7 0.0

Probability

α = 0.9 α = 0.7 α = 0.4 α = 0.2

Figure 4: Distribution of the final capital for EVa R optimal policies.

capital is 0 and 7 for all policies so that action is neglected in Figure 3. Because the optimal policy is stationary, we can interpret and analyze it. The policies become notably less risk-averse as α increases. For example, when α = 0.2, the gambler is very risk-averse and always quits with the current capital. When α = 0.4, the gambler invests 1 when capital is greater than 1 and quits otherwise to avoid losing it all. When α = 0.9, the gambler makes bigger bets, increasing the probability of reaching the cap and losing all capital. To understand the impact of risk-aversion on the distribution of returns, we simulate the resulting policies over 7,000 episodes and show the distribution of capitals in Figure 4. When α = 0.2, the return follows a uniform distribution on [1, 7]. When α = 0.4, the returns are 1 and 7. When α = 0.7 or 0.9, the returns are 1 and 7. Overall, the figure shows that for lower values of α, the gambler gives up some probability of reaching the cap in exchange for a lower probability of losing all capital.

6 Conclusion and Future Work We analyze transient MDPs with two risk measures: ERM and EVa R. We establish the existence of stationary deterministic optimal policies without any assumptions on the sign of the rewards, a significant departure from past work. Our results also provide algorithms based on value iteration, policy iteration, and linear programming for computing optimal policies. Future directions include extensions to infinite-state TRC problems, risk-averse MDPs with average rewards, and partial-state observations.

Acknowledgments We thank the anonymous reviewers for their detailed reviews and thoughtful comments, which significantly improved the paper s clarity. This work was supported, in part, by NSF grants 2144601 and 2218063. Julien Grand-Cl ement was supported by Hi! Paris and Agence Nationale de la Recherche (Grant 11-LABX-0047).

References Ahmadi, M.; Dixit, A.; Burdick, J. W.; and Ames, A. D. 2021a. Risk-averse stochastic shortest path planning. In IEEE Conference on Decision and Control (CDC), 5199 5204. Ahmadi, M.; Rosolia, U.; Ingham, M. D.; Murray, R. M.; and Ames, A. D. 2021b. Constrained risk-averse Markov decision processes. In AAAI Conference on Artificial Intelligence, volume 35, 11718 11725. Ahmadi-Javid, A. 2012. Entropic Value-at-Risk: A New Coherent Risk Measure. Journal of Optimization Theory and Applications, 155(3): 1105 1123. Ahmadi-Javid, A.; and Pichler, A. 2017. An analytical study of norms and Banach spaces induced by the entropic value-atrisk. Mathematics and Financial Economics, 11(4): 527 550. B auerle, N.; and Glauner, A. 2022. Markov decision processes with recursive risk measures. European Journal of Operational Research, 296(3): 953 966.

B auerle, N.; and Ott, J. 2011. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74: 361 379. Bertsekas, D. P.; and Tsitsiklis, J. N. 1991. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3): 580 595. Bertsekas, D. P.; and Yu, H. 2013. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT. Blackwell, D. 1967. Positive dynamic programming. In Berkeley symposium on Mathematical Statistics and Probability, volume 1, 415 418. University of California Press Berkeley. Carpin, S.; Chow, Y.-L.; and Pavone, M. 2016. Risk aversion in finite Markov Decision Processes using total cost criteria and average value at risk. In IEEE International Conference on Robotics and Automation (ICRA), 335 342. Chung, K.-J.; and Sobel, M. J. 1987. Discounted MDP s: Distribution Functions and Exponential Utility Maximization. SIAM Journal on Control and Optimization, 25(1): 49 62. Cohen, A.; Efroni, Y.; Mansour, Y.; and Rosenberg, A. 2021. Minimax regret for stochastic shortest path. Advances in neural information processing systems, 34: 28350 28361. Dann, C.; Wei, C.-Y.; and Zimmert, J. 2023. A Unified Algorithm for Stochastic Path Problems. In International Conference on Learning Theory. de Freitas, E. M.; Freire, V.; and Delgado, K. V. 2020. Risk Sensitive Stochastic Shortest Path and Logsumexp: From Theory to Practice. In Cerri, R.; and Prati, R. C., eds., Intelligent Systems, Lecture Notes in Computer Science, 123 139. de Freitas, E. M.; Freire, V.; and Delgado, K. V. 2020. Risk Sensitive Stochastic Shortest Path and Log Sum Exp: From Theory to Practice. In Intelligent Systems: Brazilian Conference (BRACIS), 123 139. Springer. Denardo, E. V.; and Rothblum, U. G. 1979. Optimal stopping, exponential utility, and linear programming. Mathematical Programming, 16(1): 228 244. Fei, Y.; Yang, Z.; Chen, Y.; and Wang, Z. 2021. Exponential bellman equation and improved regret bounds for risksensitive reinforcement learning. Advances in Neural Information Processing Systems, 34: 20436 20446. Fei, Y.; Yang, Z.; and Wang, Z. 2021. Risk-sensitive reinforcement learning with function approximation: A debiasing approach. In International Conference on Machine Learning, 3198 3207. PMLR. Filar, J.; and Vrieze, K. 2012. Competitive Markov decision processes. Springer Science & Business Media. Follmer, H.; and Schied, A. 2016. Stochastic finance: an introduction in discrete time. De Gruyter Graduate, 4th edition. Freire, V.; and Delgado, K. V. 2016. Extreme risk averse policy for goal-directed risk-sensitive Markov decision process. In Brazilian Conference on Intelligent Systems (BRACIS), 79 84. Gavriel, C.; Hanasusanto, G.; and Kuhn, D. 2012. Riskaverse shortest path problems. In IEEE Conference on Decision and Control (CDC), 2533 2538.

Grand-Cl ement, J.; and Petrik, M. 2022. Towards Convex Optimization Formulations for Robust MDPs. Hau, J. L.; Delage, E.; Ghavamzadeh, M.; and Petrik, M. 2023. On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes. In Neural Information Processing Systems (Neur IPS). Hau, J. L.; Petrik, M.; and Ghavamzadeh, M. 2023. Entropic risk optimization in discounted MDPs. In International Conference on Artificial Intelligence and Statistics, 47 76. PMLR. Horn, R. A.; and Johnson, C. A. 2013. Matrix Analysis. Cambridge University Press, 2nd edition. James, H. W.; and Collins, E. 2006. An analysis of transient Markov decision processes. Journal of applied probability, 43(3): 603 621. Kallenberg, L. 2021. Markov decision processes. Lecture Notes. University of Leiden. Kastner, T.; Erdogdu, M. A.; and Farahmand, A.-m. 2023. Distributional Model Equivalence for Risk-Sensitive Reinforcement Learning. In Conference on Neural Information Processing Systems. Kupper, M.; and Schachermayer, W. 2006. Representation Results for Law Invariant Time Consistent Functions. Mathematics and Financial Economics, 16(2): 419 441. Lam, T.; Verma, A.; Low, B. K. H.; and Jaillet, P. 2022. Risk Aware Reinforcement Learning with Coherent Risk Measures and Non-linear Function Approximation. In International Conference on Learning Representations (ICLR). Li, X.; Zhong, H.; and Brandeau, M. L. 2022. Quantile Markov decision processes. Operations research, 70(3): 1428 1447. Marthe, A.; Garivier, A.; and Vernade, C. 2023. Beyond Average Return in Markov Decision Processes. In Conference on Neural Information Processing Systems. Meggendorfer, T. 2022. Risk-aware stochastic shortest path. In AAAI Conference on Artificial Intelligence, volume 36, 9858 9867. Patek, S. D. 1997. Stochastic and shortest path games: theory and algorithms. Ph.D. thesis, Massachusetts Institute of Technology. Patek, S. D. 2001. On terminating Markov decision processes with a risk-averse objective function. Automatica, 37(9): 1379 1386. Patek, S. D.; and Bertsekas, D. P. 1999. Stochastic shortest path games. SIAM Journal on Control and Optimization, 37(3): 804 824. Pflug, G. C.; and Pichler, A. 2016. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41(2): 682 699. Puterman, M. L. 2005. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Smith, K. M.; and Chapman, M. P. 2023. On Exponential Utility and Conditional Value-at-Risk as risk-averse performance criteria. IEEE Transactions on Control Systems Technology.

Su, X.; Grand-Cl ement, J.; and Petrik, M. 2024. Risk-averse Total-reward MDPs with ERM and EVa R. ar Xiv preprint ar Xiv:2408.17286. Su, X.; and Petrik, M. 2023. Solving multi-model MDPs by coordinate ascent and dynamic programming. In Uncertainty in Artificial Intelligence, 2016 2025. PMLR. Su, X.; Petrik, M.; and Grand-Cl ement, J. 2024a. EVa R Optimization in MDPs with Total Reward Criterion. In Seventeenth European Workshop on Reinforcement Learning. Su, X.; Petrik, M.; and Grand-Cl ement, J. 2024b. Optimality of Stationary Policies in Risk-averse Total-reward MDPs with EVa R. In ICML 2024 Workshop: Foundations of Reinforcement Learning and Control Connections and Perspectives. Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: An Introduction. The MIT Press, 2nd edition.