# nearminimaxoptimal_risksensitive_reinforcement_learning_with_cvar__dd44fedf.pdf

Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVa R

Kaiwen Wang 1 2 Nathan Kallus 2 Wen Sun 1

In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVa R) with risk tolerance τ. Starting with multi-arm bandits (MABs), we show the minimax CVa R regret rate is Ω(

τ 1AK), where A is the number of actions and K is the number of episodes, and that it is achieved by an Upper Confidence Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov Decision Processes (MDPs), we show a minimax regret lower bound of Ω(

τ 1SAK) (with normalized cumulative rewards), where S is the number of states, and we propose a novel bonus-driven Value Iteration procedure. We show that our algorithm achieves the optimal regret of e O(

τ 1SAK) under a continuity assumption and in general attains a near-optimal regret of e O(τ 1

SAK), which is minimax-optimal for constant τ. This improves on the best available bounds. By discretizing rewards appropriately, our algorithms are computationally efficient.

1. Introduction

Reinforcement Learning (RL) (Sutton & Barto, 2018) is the canonical framework for sequential decision making under uncertainty, with applications in personalizing recommendations (Bottou et al., 2013), robotics (Rajeswaran et al., 2017), healthcare (Murphy, 2003) and education (Singla et al., 2021). In vanilla RL, the objective is to maximize the average of returns, the cumulative rewards collected by the policy. As RL is increasingly applied in consequential settings, it is often necessary to account for risk beyond solely optimizing for the average.

1Computer Science, Cornell University 2Operations Research and Information Engineering, Cornell Tech. Correspondence to: Kaiwen Wang <https://kaiwenw.github.io>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Conditional Value at Risk (CVa R) is a popular coherent measure of risk (Rockafellar & Uryasev, 2000; Filippi et al., 2020). For a random return X (where higher is better), the CVa R with risk tolerance τ (0, 1] is defined as

CVa Rτ(X) := supb R(b τ 1E[(b X)+]), (1)

where x+ = max(x, 0). CVa Rτ(X) is the average outcome among the worst τ-percent of cases, and when X is continuous this exactly corresponds to those less than or equal to the τ-th quantile (Acerbi & Tasche, 2002), i.e.,

CVa Rτ(X) = E[X | X F X(τ)], (2)

where F X(τ) = inf{x : FX(x) τ} is the τ-th quantile of X, a.k.a. the Value at Risk (Va R). A high risk tolerance τ = 1 recovers the risk-neutral expectation, i.e., CVa R1(X) = EX. As τ decreases, CVa Rτ models the worst-case outcome, i.e., limτ 0 CVa Rτ(X) = ess inf X. In the CVa R RL model we consider, X is the return of a policy, so our objective captures the tail-risk of the returns distribution. Another motivating perspective is that CVa R RL is equivalent to the robust MDP model, i.e., expected value under worst-case perturbation of the transition kernel (Chow et al., 2015). Thus, CVa R RL is an attractive alterantive to vanilla RL in safety-critical applications.

In this paper, we provide algorithms with state-of-theart regret guarantees for tabular, online decision making with the CVa R objective. To start, we prove tight lower bounds on the expected CVa R regret (formalized in Section 2) for both multi-arm bandit (MAB) and RL problems. We then propose BERNSTEIN-UCB, an Upper Confidence Bound (UCB) algorithm with a novel bonus constructed using Bernstein s inequality, and we prove it is minimaxoptimal1. Compared to Brown-UCB (Tamkin et al., 2019), BERNSTEIN-UCB is minimax-optimal in general, without requiring reward distributions to be continuous.

We then turn to tabular RL with the CVa R objective. We propose CVa R-UCBVI, a novel bonus-driven Value Iteration (VI) algorithm in an augmented MDP. The augmented

1Following Azar et al. (2017), we say an algorithm is minimax optimal if its regret matches (up to log terms) our novel minimax lower bound, in all problem parameters. Sometimes, this is also referred to as nearly-minimax-optimal (Zhou et al., 2021a).

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

MDP framework of Bäuerle & Ott (2011) reveals Bellman equations for CVa R RL that served as the initial catalyst for the VI approach. We provide two choices of bonuses for CVa R-UCBVI, based on Hoeffding s and Bernstein s inequalities. With the Bernstein bonus, we guarantee a CVa R regret of e O(τ 1

SAK), where K is the number of episodes, S, A are the sizes of the state and action spaces, respectively. This improves over the previous bound of e O(τ 1

S3AHK) (Bastani et al., 2022) in S and H, the horizon length. Note that we work under the normalized returns model, so there should be no H scaling in the bound (Jiang & Agarwal, 2018). If τ is a constant, our result is already minimax-optimal. Surprisingly, however, our lower bound actually scales as τ 1/2. Under an assumption that the returns of any policies are continuously distributed with lower-bounded density, we improve the upper bound on the regret of CVa R-UCBVI with the Bernstein bonus to e O(

τ 1SAK). This establishes CVa R-UCBVI as the first algorithm with minimax-optimal regret for risksensitive CVa R RL, under the continuity assumption. Our key technical innovation is decomposing the regret using a novel simulation lemma for CVa R RL and precisely bounding the sum of variance bonuses with the Law of Total Variance (Azar et al., 2017).

1.1. Related Literature

CVa R MAB: Kagrecha et al. (2019) proposed a successive rejects algorithm for best CVa R arm identification, but it does not have regret guarantees. Tamkin et al. (2019) proposed two algorithms for CVa R MAB and analyze upper bounds on their regret, but not lower bounds. Their CVa RUCB" builds a confidence band for the reward distribution of each arm via Dvoretzky-Kiefer-Wolfowitz inequality, resulting in an optimistic estimate of CVa R. This leads to a suboptimal τ 1 dependence in the regret but may empirically work better if τ is not approaching 0. Their Brown UCB" is structurally similar to our BERNSTEIN-UCB, but they use a Hoeffding bonus that ensures optimism only if all arms have continuously distributed rewards (Brown, 2007, Theorem 4.2). We propose a Bernstein bonus that attains the minimax-optimal rate

τ 1AK without any assumptions on the reward distribution.

Regret bounds for CVa R RL: To the best of our knowledge, Bastani et al. (2022) is the first and only work with regret bounds for CVa R RL (formalized in Section 2). Their algorithm iteratively constructs optimistic MDPs by routing unexplored states to a sink state with the maximum reward. This approach leads to a CVa R regret bound of e O(τ 1

S3AHK) (Bastani et al., 2022, Theorem 4.1), which is sub-optimal. The authors conjectured that bonusbased optimism could improve the bound by a S

H factor. Our proposed CVa R-UCBVI indeed enjoys these improvements, leading to a e O(τ 1

SAK) regret guarantee

in Theorem 5.3. If returns are continuously distributed, we further improve the τ dependence, leading to the minimaxoptimal result in Theorem 5.5.

CVa R RL without regret guarantees: Keramati et al. (2020) proposed a distributional RL approach (Bellemare et al., 2017) for RL with the CVa R objective. A key difference is that Keramati et al. (2020) focuses on the easier task of identifying a policy with high CVa R. On the other hand, Bastani et al. (2022) and our work focuses on algorithms with low CVa R regret, which guarantees safe exploration. Note that low-regret methods can be converted into probably approximately correct (PAC) CVa R RL, by taking the uniform mixture of policies from the low-regret algorithm.

Tamar et al. (2015) derived the policy gradient for the CVa R RL objective and showed asymptotic convergence to a local optimum. Chow & Ghavamzadeh (2014) developed actor-critic algorithms for the mean-CVa R objective, i.e., maximizing expected returns subject to a CVa R constraint. Another motivating perspective for CVa R RL is its close ties to robust MDPs (Wiesemann et al., 2013). Specifically, Chow et al. (2015, Proposition 1) showed that the CVa R of returns is equivalent to the expected returns under the worst-case perturbation of the transition kernel in some uncertainty set. While the uncertainty set is not rectangular, Chow et al. (2015) derived tractable robust Bellman equations and proved convergence to a globally optimal CVa R policy. However, these methods for CVa R RL do not lead to low-regret algorithms, which is our focus.

Risk-sensitive RL with different risk measures: Prior works have also proved risk-sensitive RL regret bounds in the context of other risk measures that are not directly comparable to the CVa R RL setting we consider. Fei et al. (2020; 2021); Liang & Luo (2022) showed Bellman equations and regret guarantees with the entropic risk measure based on an exponential utility function. Du et al. (2022); Lam et al. (2023) studied the more conservative Iterated CVa R objective, which considers the risk of the reward-togo at every step along the trajectory. In contrast, our setup aims to holistically maximize the CVa R of the total returns.

Risk-sensitive regret lower bounds: Fei et al. (2020); Liang & Luo (2022) showed regret lower bounds for risksensitive RL with the entropic risk measure. We show tight lower bounds for risk-sensitive MAB and RL with the CVa R objective, which to the best of our knowledge are the first lower bounds for this problem.

Safety in offline RL: While our focus is online RL, riskaversion has also been studied in offline RL. Some past works include offline learning with risk measures (Urpí et al., 2021) and distributional robustness (Panaganti et al., 2022; Si et al., 2020; Kallus et al., 2022; Zhou et al., 2021b).

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

2. Problem Setup

As warmup, we consider CVa Rτ regret in a multi-arm bandit (MAB) problem with K episodes. At each episode k [K], the learner selects an arm ak A and observes reward rk ν(ak), where ν(a) is the reward distribution of arm a. The learner s goal is to compete with the arm with the highest CVa Rτ value, i.e., a = maxa A CVa Rτ(ν(a)), and minimize the regret, defined as Regret MAB τ (K) = PK k=1 CVa Rτ(ν(a )) CVa Rτ(ν(ak)).

The focus of this paper is online RL, which generalizes the MAB (where S = H = 1). The statistical model is tabular Markov Decision Processes (MDPs) (Agarwal et al., 2021), with finite state space S of size S, finite action space A of size A, and horizon H. Let ΠH be the set of history-dependent policies, so each policy is π = (πh : S Hh (A))h [H] where Hh = {(si, ai, ri)}i [h 1] is the history up to time h. At each episode k [K], the learner plays a history-dependent policy πk ΠH, which induces a distribution over trajectories as follows. First, start at the fixed initial state s1 S. Then, for each time h = 1, 2, . . . , H, sample an action ah πk h(sh, Hh), which leads to a reward rh R(sh, ah) and the next state sh+1 P (sh, ah). Here, P : S A (S) is the unknown Markov transition kernel and R : S A ([0, 1]) is the known reward distribution. The return is the sum of rewards from this process, R(π) = PH h=1 rh, which is a random variable. We posit the return is almost surely normalized in that R(π) [0, 1] w.p. 1 (as in Jiang & Agarwal, 2018, Section 2.1). We note normalized returns allows for sparsity in the rewards, and thus is strictly more general for regret upper bounds. Many prior works do not normalize, so their returns may scale with H. When comparing to their bounds, we make the scaling consistent by dividing them by H.

We focus on the setting we call CVa R RL, in which the learner s goal is to compete with a CVa Rτ-optimal policy, i.e., π arg maxπ ΠH CVa Rτ(R(π)). Toward this end, we define the regret as Regret RL τ (K) = PK k=1 CVa R τ CVa Rτ(R(πk)), where CVa R τ = CVa Rτ(R(π )). CVa R RL captures vanilla risk-neutral RL when τ = 1 and Worst Path RL (Du et al., 2022) when τ 0. We prove lower bounds in expected regret. For upper bounds, we give high probability regret bounds, which implies upper bounds (with the same dependencies on problem parameters) in expected regret by integrating over the failure probability δ (0, 1).

Notation: [i : j] = {i, i + 1, . . . , j}, [n] = [1 : n] and (S) is the set of distributions on S. We set L = log(HSAK/δ) (for MAB, L = log(AK/δ)), where δ is the desired failure probability provided to the algorithm. Please see Table 1 for a comprehensive list of notations.

3. Lower Bounds

We start with the minimax lower bound for CVa Rτ MAB.

Theorem 3.1. Fix any τ (0, 1/2), A N. For any algorithm, there is a MAB problem with Bernoulli rewards s.t.

8τ , then E Regret MAB τ (K) 1 24e

Proof Sketch Our proof is inspired by the lower bound construction for the vanilla MAB (Lattimore & Szepesvári, 2020, Theorem 15.2). The key idea is to fix any learner, and construct two MAB instances that appear similar to the learner but in reality have very different CVa R value. Specifically, for any ε (0, 1), we need two reward distributions such that their KL-divergence is O(ε2τ 1) while their CVa Rs differ by Ω(τ 1ε). We show that Ber(1 τ) and Ber(1 τ + ε) satisfy this.

Compared to the vanilla MAB minimax lower bound of Ω(

AK), our result for CVa Rτ MAB has an extra

τ 1 factor. This proves that it is information-theoretically harder to be more risk-averse with CVa Rτ. While Brown UCB of Tamkin et al. (2019) appears to match this lower bound, their proof hinges on the continuity of reward distributions, which is invalid for Bernoulli rewards. In Theorem 4.1, we show that BERNSTEIN-UCB is minimaxoptimal over all reward distributions.

We next extend the above result to the RL setting.

Corollary 3.2. Fix any τ (0, 1/2), A, H N. For any algorithm, there is an MDP (with S = Θ(AH 1)) s.t. if K q

8τ , then E Regret RL τ (K) 1 24e

The argument is to show that a class of MDPs with rewards only at the last layer essentially reduces to a MAB with exponentially many actions. Thus, the hardest CVa R RL problems are actually very big CVa R MAB problems. The bound does not scale with H as we ve assumed returns to be normalized in [0, 1] (Jiang & Agarwal, 2018).

4. Risk Sensitive MAB

In this section, we propose a simple modification to the classic Upper Confidence Bound (UCB) with a novel Bernstein bonus that enjoys minimax-optimal regret. In the classic UCB algorithm (Auer et al., 2002), the bonus quantifies the confidence band from Hoeffding s inequality. Instead, we propose to build a confidence band of µ(b, a) = ER ν(a)[(b R)+] by using a Bernstein-based bonus (Eq. (3)). The standard deviation τ in our bonus is crucial for obtaining the minimax-optimal regret.

Theorem 4.1. For any δ (0, 1), w.p. at least 1 δ, BERNSTEIN-UCB with ε p

A/2τK enjoys

Regret MAB τ (K) 4

τ 1AKL + 16τ 1AL2.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Algorithm 1 BERNSTEIN-UCB

1: Input: risk tolerance τ, number of episodes K, failure probability δ, approximation parameter ε. 2: for episode k = 1, 2, . . . , K do 3: Compute counts Nk(a) = 1 Pk 1 i=1 I [ai = a]. 4: Define pessimistic estimate of µ(b, a) = ER ν(a)[(b R)+], i.e., for all b, a,

bµk(b, a) = 1 Nk(a)

i=1 (b ri)+I [ai = a] , BONk(a) =

2τ log(AK/δ)

Nk(a) + log(AK/δ)

Nk(a) . (3)

5: Compute ε-optimal solutions bba,k, i.e., for all a,

bfk(bba,k, a) max b [0,1] bfk(b, a) ε, where, bfk(b, a) = b τ 1(bµk(b, a) BONk(a)).

6: Compute and pull the action for this episode, ak = arg maxa A bfk(bba,k, a). Receive reward rk ν(ak). 7: end for

Proof Sketch First, we use Bernstein s inequality to build a confidence band of µ(b, a) at b a = arg maxb [0,1] b τ 1µ(b, a) . Conveniently, b a is the τ-th quantile of ν(a), hence Var R ν(a)((b a R)+) τ. This proves pessimism with our Bernstein-based bonus, i.e., bµk(b a, a) BONk(a) µ(b a, a). Pessimism in turn

implies optimism in CVa R, i.e., CVa R τ \ CVa R k τ := maxb [0,1] b τ 1(bµk(b, ak) BONk(ak)) . This allows us to decompose regret into (1) the sum of bonuses plus (2) the difference between empirical and true CVa R of ν(ak). (1) is handled using a standard pigeonhole argument. To bound (2), we prove a new concentration inequality for CVa R that holds for any bounded random variable (Theorem C.6), which may be of independent interest.

Up to log terms, the resulting bound matches our lower bound in Theorem 3.1, proving that BERNSTEIN-UCB is minimax-optimal. As noted earlier, under the assumption that rewards are continuous, Brown-UCB (Tamkin et al., 2019) also matches our novel lower bound. When working with continuous distributions, CVa R takes the convenient form in Eq. (2), which roughly suggests that Hoeffding s inequality on the lower τN data points suffices for a CVa R concentration bound (Brown, 2007, Theorem 4.2). This

is why the bonus of Brown-UCB, which is q

5τ log(3/δ)

Nk(a) , does not yield optimism when continuity fails. In general, the 1 Nk(a) term in our bonus from Bernstein s inequality is needed for proving optimism in general.

Computational Efficiency: In Line 5, the objective function bfk( , a) is concave and unimodal. So, its optimal value can be efficiently approximated, e.g., by goldensection search (Kiefer, 1953) or gradient ascent in 1/ε2 = O(τK/A) steps (Boyd et al., 2004). Thus, BERNSTEINUCB is both minimax-optimal for regret and computationally efficient.

5. Risk Sensitive RL

We now shift gears to CVa R RL. First, we review the augmented MDP framework due to Bäuerle & Ott (2011) and derive Bellman equations for our problem (Bellemare et al., 2023). Using this perspective, we propose CVa R-UCBVI, a bonus-driven Value Iteration algorithm in the augmented MDP, which we show enjoys strong regret guarantees.

5.1. Augmented MDP and Bellman Equations

For any history-dependent π ΠH, timestep h [H], state sh S, budget bh [0, 1], and history Hh, define

V π h (sh, bh; Hh) = Eπ

bh PH t=h rt + | sh, Hh

Then, the CVa R RL objective can be formulated as,

CVa R τ = maxπ ΠH maxb [0,1] b τ 1V π 1 (s1, b)

= maxb [0,1] b τ 1 minπ ΠH V π 1 (s1, b) . (4)

Bäuerle & Ott (2011) showed a remarkable fact about minπ ΠH V π 1 (s1, b1): there exists an optimal policy ρ = ρ h : SAug A

h [H] that is deterministic and Markov in an augmented MDP, which we now describe. The augmented state is (s, b) SAug := S [0, 1]. Given any b1 [0, 1], the initial state is (s1, b1). Then, for each h = 1, 2, ..., H, ah = ρ h(sh, bh), rh R(sh, ah), sh+1 P (sh, ah), bh+1 = bh rh. Intuitively, the extra state bh is the amount of budget left from the initial b1, and is a sufficient statistic of the history for the CVa R RL problem. Let ΠAug denote the set of deterministic, Markov policies in the augmented MDP. Then, we may optimize over this simpler policy class without losing optimality!

Before we formalize the optimality result, we first derive Bellman equations (as in Bellemare et al., 2023, Chapter

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

7.8). For any ρ ΠAug, overload notation and define

V ρ h (sh, bh) = Eρ

bh PH t=h rt + | sh, bh

where rh, . . . , r H are the rewards generated by executing ρ in the augmented MDP starting from sh, bh at time step h. Observe that V ρ h satisfies the Bellman equations,

V ρ h (sh, bh) = Eah ρh(sh,bh)[U ρ h(sh, bh, ah)],

U ρ h(sh, bh, ah) = Esh+1,rh V ρ h+1(sh+1, bh+1) ,

where sh+1 P (sh, ah), rh R(sh, ah), bh+1 = bh rh and V ρ H+1(s, b) = b+. Analogously, define V h and ρ

inductively with the Bellman optimality equations,

V h (sh, bh) = min a A U h(sh, bh, ah),

ρ h(sh, bh) = arg min a A U h(sh, bh, ah),

U h(sh, bh, ah) = Esh+1,rh V h+1(sh+1, bh+1) ,

where sh+1 P (sh, ah), rh R(sh, ah), bh+1 = bh rh and V H+1(s, b) = b+. Armed with these definitions, we formalize the optimality result in the following theorem.

Theorem 5.1 (Optimality of ΠAug). For any b [0, 1],

V 1 (s1, b) = V ρ 1 (s1, b) = infπ ΠH V π 1 (s1, b).

This is a known result in the infinite-horizon, discounted setting (Bäuerle & Ott, 2011; Bellemare et al., 2023). We provide a proof from first principles for the finitehorizon setting in Appendix F, by inductively unravelling the Bellman optimality equations. As a technical remark, we show optimality over history-dependent policies in the augmented MDP with memory, larger than the historydependent class defined here.

These facts imply that we could compute V 1 and ρ using dynamic programming (DP) if we knew the true transitions P , and the procedure is similar to the classic Value Iteration procedure in vanilla RL. Based on Theorem 5.1 and Eq. (4), by executing ρ starting from (s1, b ) with b := arg maxb [0,1] b τ 1V 1 (s1, b) , we achieve the maximum CVa R value in the original MDP. Below we leverage this DP perspective on the augmented MDP to design exploration algorithms to solve CVa R RL.

5.2. CVa R-UCBVI

In this section, we introduce our algorithm CVa R-UCBVI (Algorithm 2), an extension of the classic UCBVI algorithm of Azar et al. (2017) which attained the minimaxoptimal regret for vanilla RL. Our contribution is showing that bonus-driven pessimism, which guarantees that the learned b V h,k is a high probability lower confidence bound

(LCB) on the optimal V h , is sufficient and in fact optimal for CVa R RL. This remarkably shows that the bonusdriven exploration paradigm from vanilla RL, i.e., optimism/pessimism under uncertainty, can be used to optimally conduct safe exploration for CVa R RL.

CVa R-UCBVI iterates over K episodes, where the k-th episode proceeds as follows. First, in Line 3, we compute an empirical estimate b Pk of the transition dynamics P using the previous episodes data. Then, in Line 6, we inductively compute b U h,k from h = H to h = 1 by subtracting a bonus that accounts for the error from using b Pk instead of P . Next, b V h,k and bρk are computed

greedily w.r.t. b U h,k to mimic the Bellman optimality equations (Section 5.1). Subtracting the bonus is key to showing b U h,k (resp. b V h,k) is a high probability lower bound

of U h (resp. V h ). Next, in Line 9, we compute bbk using the pessimistic b V 1,k. Similar to our MAB algorithm, this

guarantees that \ CVa R k τ := bbk τ 1 b V 1,k(s1,bbk) is an optimistic estimate of CVa R τ. Finally, in Line 10, we roll in with the learned, augmented policy bρk starting from bbk in the augmented MDP to collect data for the next iterate. We highlight that in Line 10, the algorithm is still only interacting with the original MDP described in Section 2. To roll in with an augmented policy, the algorithm can imagine this augmented MDP by keeping track of the bh via the update bh+1 = bh rh. There is virtually no overhead as it is only a scalar with known transitions.

5.3. The Hoeffding Bonus

Two types of bonuses may be used in CVa R-UCBVI: Hoeffding (Eq. (5)) and Bernstein (Eq. (6)). We now show that a simple Hoeffding bonus, defined below, can already provide the best CVa R regret bounds in the current literature:

BONHOEFF h,k (s, a) =

L Nk(s, a), (5)

where L = log(HSAK/δ).

Theorem 5.2. For any δ (0, 1), w.p. at least 1 δ, CVa R-UCBVI with the Hoeffding bonus (Eq. (5)) enjoys

Regret RL τ (K) 4eτ 1

SAHKL + 10eτ 1S2AHL2.

Proof Sketch The first step is to establish pessimism, i.e., b V 1,k V 1 , which implies optimism of

\ CVa R k τ CVa R τ. At this point, we cannot apply CVa R concentration as we did for MAB, since b V 1,k is not the empirical CVa R. Instead, we show that the simulation lemma (Lemma G.4) extends to the augmented MDP, which gives V bρk 1 (s1,bbk) b V 1,k(s1,bbk)

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Algorithm 2 CVa R-UCBVI

1: Input: risk tolerance τ, number of episodes K, failure probability δ, bonus function BONh,k(s, b, a). 2: for episode k = 1, 2, . . . , K do 3: Compute counts and empirical transition estimate,

Nk(s, a, s ) =

i=1 I [(sh,i, ah,i, sh+1,i) = (s, a, s )] ,

Nk(s, a) = 1 X

s S Nk(s, a, s ), b Pk(s | s, a) = Nk(s, a, s )

4: For all s S, b [0, 1], set b V H+1,k(s, b) = b V H+1,k(s, b) = b+. 5: for h = H, H 1, . . . , 1 do 6: Define pessimistic estimates of V and bρk, i.e., for all s, b, a,

b U h,k(s, b, a) = b Pk(s, a) Erh R(s,a) h b V h+1,k( , b rh) i BONh,k(s, b, a),

bρk h(s, b) = arg min a b U h,k(s, b, a), b V h,k(s, b) = max n b U h,k(s, b, bρk h(s, b)), 0 o .

7: If using Bernstein bonus (Section 5.4), also define optimistic estimates for V , i.e., for all s, b, a,

b U h,k(s, b, a) = b Pk(s, a) Erh R(s,a) h b V h+1,k( , b rh) i + BONh,k(s, b, a),

b V h,k(s, b) = min n b U h,k(s, b, bρk h(s, b)), 1 o .

8: end for 9: Calculate bbk = arg maxb [0,1] n b τ 1 b V 1,k(s1, b) o .

10: Collect {(sh,k, ah,k, rh,k)}h [H] by rolling in bρk starting from (s1,bbk) in the augmented MDP. 11: end for

h PH h=1 2BONHOEFF h,k (sh, ah) + ξh,k(sh, ah) i . The

expectation is over the distribution of rolling in bρk from bbk, which is exactly how we explore and collect sh,k, ah,k. Thus, we can apply Azuma and elliptical potential to conclude the proof, as in the usual UCBVI analysis.

The leading term of the Hoeffding bound is e O(τ 1

SAHK), which is optimal in S, A, K. Notably, it has a S factor improvement over the current best bound τ 1

S3AHKL from Bastani et al. (2022) (we ve divided their bound by H to make returns scaling consistent). While Theorem 5.2 is already the tightest in the literature, our lower bound suggests the possibility of removing another

5.4. Improved Bounds with the Bernstein Bonus

Precise design of the exploration bonus is critical to enabling tighter performance bounds, even in vanilla RL (Azar et al., 2017; Zanette & Brunskill, 2019). In this subsection, we propose the Bernstein bonus and prove two tighter regret bounds. The bonus depends on the sam-

ple variance, which recall, for any function f, is defined as Vars b Pk(s,a)(f(s )) = b Pk(s, a) f( ) f N 2 with f N = b Pk(s, a) f being the sample mean (Maurer & Pontil, 2009). We define the Bernstein bonus as follows,

BONBERN h,k (s, b, a) =

v u u t2 Vars b Pk(s,a) Erh h b V h+1,k(s , b ) i L

v u u u t2Es b Pk(s,a),rh

b V h+1,k(s , b ) b V h+1,k(s , b ) 2 L

+ L Nk(s, a), where b = b rh, and rh R(s, a). (6)

When using the Bernstein bonus, Line 7 should be activated to compute optimistic estimates b V h,k by adding the

bonus. Together, (b V h,k, b V h,k) forms a tight confidence band around V h , which we will use to inductively prove pessimism and optimism at all h [H] (as in Zanette & Brunskill, 2019). Compared to Zanette & Brunskill (2019), our Bernstein bonus also depends on the state augmentation b, since our UCBVI procedure is running in the

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

augmented MDP. We now prove the first Bernstein bound, which tightens the Hoeffding bound by a

Theorem 5.3. For any δ (0, 1), w.p. at least 1 δ, CVa R-UCBVI with the Bernstein bonus (Eq. (6)) enjoys a regret guarantee of

Regret RL τ (K) 10eτ 1

SAKL + τ 1ξ,

where ξ e O(SAHK1/4 + S2AH) is a lower order term.

Proof Sketch We first establish pessimism and optimism of b V h,k and b V h,k, similar to Zanette & Brunskill (2019). Then, apply Simulation lemma as in Theorem 5.2. The key observation is that the H sample variances from the bonus, i.e., PH h=1 Vars b Pk(s,a) Erh[b V h+1,k(s , b rh)] , can be

reduced to a single variance Varbρk,bbk

(bbk PH h=1 rh)+ , which we bound by 1. To do so, we show that Azar et al. (2017) s Law of Total Variance technique also applies to our V bρk, which, unlike the value function of vanilla RL, depends on the state augmentation b.

The leading term of Theorem 5.3 is τ 1

SAK, and improves over Bastani et al. (2022) by a S

H factor. Up to log terms, this matches our lower bound in all parameters except τ, which implies CVa R-UCBVI is minimaxoptimal for a constant τ. In particular, τ = 1 recovers the risk-neutral vanilla RL setting, where CVa R-UCBVI matches the minimax result (Azar et al., 2017). To get the optimal τ 1/2 scaling (Corollary 3.2), we cannot loosely bound each variance term by 1, as they should scale as τ if bbk approximates the τ-th quantile of R(bρk,bbk). We show this is indeed the case under a continuity assumption.

Assumption 5.4. For all ρ ΠAug and b1 [0, 1], the returns of rolling in ρ from b1, i.e., R(ρ, b1), is continuously distributed with a density lower bounded by pmin.

Theorem 5.5. Under Assumption 5.4, the bound in Theorem 5.3 can be refined to,

Regret RL τ (K) 12e

τ 1SAKL + τ 1p 1/2 min ξ.

Proof Sketch The only divergence from Theorem 5.3 is how we bound Varbρk,bbk

(bbk PH h=1 rh)+ . Since the

density of R(bρk,bbk) is lower bounded, the CVa R objective

f(b) = b τ 1Ebρk,bbk

h (b PH h=1 rh)+i is strongly

concave. This implies that bbk approximates the true τ-th quantile b k = arg maxb [0,1] f(b), i.e., pmin

2 (bbk b k)2

τ(f(b k) f(bbk)) V bρk 1 (s1,bbk) b V 1,k(s1,bbk). Leveraging

this fact, we show Varbρk,bbk

(bbk PH h=1 rh)+ 2τ +

4p 1 min(V bρk 1 (s1,bbk) b V 1,k(s1,bbk)), which notably scales with τ. We conclude the proof by showing the error term,

i.e., PK k=1(V bρk 1 (s1,bbk) b V 1,k(s1,bbk)) e O(

SAHK), is lower order.

The leading term of Theorem 5.5 is

τ 1SAK, which matches our lower bound in Corollary 3.2 and establishes the optimality of CVa R-UCBVI for return distributions satisfying Assumption 5.4. Notably, pmin only multiplies with the lower order term ξ. This result highlights the importance of the Bernstein bonus for CVa R RL it improves the regret bound of the Hoeffding bonus by

τ 1H, whereas in vanilla RL, the improvement is

With regards to Assumption 5.4, lower bounded densities, i.e., strong monotonicity of the CDF, is standard for identifying the quantile (Ma et al., 2021). In fact, PAC results for quantile identification is not possible without some control of the mass at the quantile. As a simple example, consider estimating the 0.5-th quantile using N i.i.d. datapoints sampled from Ber(0.5). The correct answer is 0, but by symmetry, the sample median is always distributed as Ber(0.5) for any N. So we always have a 0.5-probability of being incorrect. We provide an information theoretic lower bound to rule out all estimators not just the sample median in Theorem I.1.

It nonetheless remains an open question whether Assumption 5.4 can be removed by eschewing identifying the quantile. In MABs Theorem 4.1, we circumvented the need to identify quantiles by decomposing the regret into (1) the sum of bonuses, plus, (2) the difference between the empirical and true CVa Rs, both of which can be shown to have the correct τ 1/2 scaling. An analogous approach for RL is to decompose the regret into (1) P h,k Ebρk,bbk, b Pk BONBERN h,k (sh,k, bh,k, ah,k) , plus, (2) P k CVa Rτ(bρk,bbk; b Pk) CVa Rτ(bρk,bbk). However, it is unclear if both terms can be unconditionally bounded by e O(

Remark on b-dependence: Although CVa R-UCBVI operates in the augmented MDP, our Hoeffding bonus has no dependence on the budget state b and matches the Hoeffding bonus of UCBVI (from vanilla RL; Azar et al., 2017). Intuitively, this is possible since the dynamics of b are known (we assume known reward distribution), so there is no need to explore in the b-dimension. In contrast to the Bernstein bonus of UCBVI, our Bernstein bonus depends

on b and captures the variance of b R(bρk,bbk) + . This is crucial for obtaining the correct τ rate.

6. Computational Efficiency via Discretization

Previously, we assumed each line of Algorithm 2 was computed exactly. This is not computationally feasible since the dynamic programming (DP) step (Lines 6 and 7) needs to be done over all b [0, 1] and the calculation for bbk

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

(Line 9) involves maximizing a non-concave function. Following Bastani et al. (2022), we propose to discretize the rewards so the aforementioned steps need only be performed over a finite grid. Thus, we gain computational efficiency while maintaining the same statistical guarantees.

Fix a precision η (0, 1), define ϕ(r) = η r/η 1, which rounds-up r [0, 1] to an η-net of [0, 1], henceforth referred to as the grid . The discretized MDP disc(M) is an exact replica of the true MDP M with one exception: its rewards are post-processed with ϕ, i.e., R(s, a; disc(M)) = R(s, a; M) ϕ 1, where denotes composition.

In disc(M), the τ-th quantile of the returns distribution (the argmax of the CVa R objective) will be a multiple of η, so it suffices to compute b V 1 (s1, b) and maximize Line 9 over the grid. Since b transitions by subtracting rewards, which are multiples of η, bh will always stay on the grid. Hence, the entire DP procedure (Lines 6 and 7) only needs to occur on the grid. In Appendix H.3, we show CVa R-UCBVI has a runtime of O(S2η 2AHK) in the discretized MDP. It s worth clarifying that CVa RUCBVI is still interacting with M, except that it internally discretizes the received rewards to simulate running in the disc(M) for computation s sake. Thus, we still want to compete with the strongest CVa R policy in the true MDP; we ve just made our algorithm weaker by restricting it to run in an imagined disc(M).

Now, we show that the true regret only increases by O(Kη), which can be made lower order by setting η = K 1/2. Theorems 5.2 and 5.3 made no assumptions on the reward distribution, so they immediately apply to bound the disc(M) regret, i.e., Regret RL τ (K; disc(M)) = PK k=1 CVa R τ(disc(M)) CVa Rτ(bρk,bbk; disc(M)). We translate regret in disc(M) to regret in M via a coupling argument, inspired by Bastani et al. (2022). Let Zπ,M denote the returns from running π in M. For random variables X, Y , we say Y stochastically dominates X, denoted X Y , if t R : Pr(Y t) Pr(X t). Then, for any ρ ΠAug, b1 [0, 1], we show two facts:

F1 Running ρ, b1 in the imagined disc(M) is equivalent to running a reward-history-dependent policy, adapted(ρ, b1)h(sh, r1:h 1) = ρh(sh, b1 ϕ(r1) ... ϕ(rh 1)). Also, Zρ,b1,disc(M) Hη Zadapted(ρ,b1),M Zρ,b1,disc(M).

F2 There exists a memory-history-dependent2 policy disc(ρ, b) such that Zρ,b,M Zdisc(ρ,b),disc(M). Intuitively, when running in disc(M), once the discretized reward rh is seen, a memory mh is generated from

2The memory-MDP model is novel and key to our coupling argument. In Appendix F, we define this model and show that ΠAug

still contains the optimal policy over this seemingly larger class of memory-history-dependent policies, i.e., Theorem 5.1 holds.

the conditional reward distribution of rewards that get rounded-up to rh. Thus, mh is essentially sampled from the unconditional reward distribution. The memorydependent policy disc(ρ, b) makes use of these samples to mimic running ρ, b in M.

F1 implies CVa Rτ(adapted(bρk,bbk); M) CVa Rτ(bρk,bbk; disc(M)) τ 1Hη. F2 implies CVa R τ(M) CVa R τ(disc(M)). Combining these two facts, we have Regret RL τ (K; M) Regret RL τ (K; disc(M)) + Kτ 1Hη.

Translating Theorem 5.5 requires more care, as its proof relied on continuously distributed returns (Assumption 5.4) which is untrue in disc(M). We show that we only need the true returns distribution to be continuous.

Assumption 6.1. For all ρ ΠAug and b1 [0, 1], the returns distribution of adapted(ρ, b1) in M is continuous, with a density lower bounded by pmin.

With this premise, we can prove Theorem 5.5 for disc(M),

with an extra term of e O(τ 1 q

p 1 min SAHKη). In sum, set-

ting η = 1/

K ensures that CVa R-UCBVI is both nearminimax-optimal for regret and computationally efficient, with a runtime of O(S2AHK2). We note the superlinearin-K runtime from discretization is not even avoidable in Lipschitz bandits (Wang et al., 2020), and we leave developing more scalable methods for future work.

7. Concluding Remarks

In this paper, we presented a more complete picture of risksensitive MAB and RL with CVa R by providing not only novel lower bounds but also procedures and analyses that both improve on the state of the art and match our lower bounds. One exception where a gap remains is CVa R RL with discontinuous returns and a risk tolerance that is not constant (or, not lower bounded); in this case, our lower and upper bounds differ by a factor of τ. We discuss the feasibility of closing this gap in Section 5.4.

A direction for future work is to develop algorithms with optimal regret guarantees for more general risk measures, e.g., optimized certainty equivalent (OCE) (Ben-Tal & Teboulle, 2007). Another orthogonal direction is to extend our results beyond tabular MDPs. We believe that our techniques in this work are already enough for linear MDPs (Jin et al., 2020) where the transition kernel is linear in some known feature space. However, extending the results beyond linear models, such as to low-rank MDPs (Agarwal et al., 2020; Uehara et al., 2022) and block MDPs (Misra et al., 2020; Zhang et al., 2022) remains a challenge due to the fact that achieving point-wise optimism is harder when nonlinear function approximation is used.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Acknowledgements This material is based upon work supported by the National Science Foundation under Grant Nos. IIS-1846210 and IIS-2154711.

Acerbi, C. and Tasche, D. On the coherence of expected shortfall. Journal of Banking & Finance, 26(7):1487 1503, 2002.

Agarwal, A., Kakade, S., Krishnamurthy, A., and Sun, W. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095 20107, 2020.

Agarwal, A., Jiang, N., Kakade, S. M., and Sun, W. Reinforcement learning: Theory and algorithms. 2021. https://rltheorybook.github.io/.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.

Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263 272. PMLR, 2017.

Bastani, O., Ma, Y. J., Shen, E., and Xu, W. Regret bounds for risk-sensitive reinforcement learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum? id=y JEUDfzs TX7.

Bäuerle, N. and Ott, J. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361 379, 2011.

Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449 458. PMLR, 2017.

Bellemare, M. G., Dabney, W., and Rowland, M. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.

Ben-Tal, A. and Teboulle, M. An old-new concept of convex risk measures: the optimized certainty equivalent. Mathematical Finance, 17(3):449 476, 2007.

Bibaut, A., Kallus, N., Dimakopoulou, M., Chambaz, A., and van der Laan, M. Risk minimization from adaptively collected data: Guarantees for supervised and policy learning. Advances in Neural Information Processing Systems, 34:19261 19273, 2021.

Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., Ray, D., Simard, P., and Snelson, E. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013.

Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.

Brown, D. B. Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 35 (6):722 730, 2007.

Chen, H. Chapter 2. order statistics. http:// www.math.ntu.edu.tw/~hchen/teaching/ Large Sample/notes/noteorder.pdf.

Chow, Y. and Ghavamzadeh, M. Algorithms for cvar optimization in mdps. Advances in neural information processing systems, 27, 2014.

Chow, Y., Tamar, A., Mannor, S., and Pavone, M. Risksensitive and robust decision-making: a cvar optimization approach. Advances in neural information processing systems, 28, 2015.

Du, Y., Wang, S., and Huang, L. Risk-sensitive reinforcement learning: Iterated cvar and the worst path. ar Xiv preprint ar Xiv:2206.02678, 2022.

Fei, Y., Yang, Z., Chen, Y., Wang, Z., and Xie, Q. Risksensitive reinforcement learning: Near-optimal risksample tradeoff in regret. Advances in Neural Information Processing Systems, 33:22384 22395, 2020.

Fei, Y., Yang, Z., Chen, Y., and Wang, Z. Exponential bellman equation and improved regret bounds for risksensitive reinforcement learning. Advances in Neural Information Processing Systems, 34:20436 20446, 2021.

Filippi, C., Guastaroba, G., and Speranza, M. G. Conditional value-at-risk beyond finance: a survey. International Transactions in Operational Research, 27(3): 1277 1319, 2020.

Jiang, N. and Agarwal, A. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pp. 3395 3398. PMLR, 2018.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020.

Kagrecha, A., Nair, J., and Jagannathan, K. Distribution oblivious, risk-aware algorithms for multiarmed bandits with unbounded rewards. ar Xiv preprint ar Xiv:1906.00569, 2019.

Kallus, N., Mao, X., Wang, K., and Zhou, Z. Doubly robust distributionally robust off-policy evaluation and learning. In International Conference on Machine Learning, 2022.

Keramati, R., Dann, C., Tamkin, A., and Brunskill, E. Being optimistic to be conservative: Quickly learning a cvar policy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 4436 4443, 2020.

Kiefer, J. Sequential minimax search for a maximum. Proceedings of the American mathematical society, 4(3): 502 506, 1953.

Kisiala, J. Conditional value-at-risk: Theory and applications. ar Xiv preprint ar Xiv:1511.00140, 2015.

Lam, T., Verma, A., Low, B. K. H., and Jaillet, P. Riskaware reinforcement learning with coherent risk measures and non-linear function approximation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=-Rw ZOVybbj.

Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.

Liang, H. and Luo, Z.-Q. Bridging distributional and risk-sensitive reinforcement learning with provable regret bounds. ar Xiv preprint ar Xiv:2210.14051, 2022.

Ma, Y., Jayaraman, D., and Bastani, O. Conservative offline distributional reinforcement learning. Advances in Neural Information Processing Systems, 34:19235 19247, 2021.

Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009.

Misra, D., Henaff, M., Krishnamurthy, A., and Langford, J. Kinematic state abstraction and provably efficient richobservation reinforcement learning. In International conference on machine learning, pp. 6961 6971. PMLR, 2020.

Murphy, S. A. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331 355, 2003.

Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M. Robust reinforcement learning using offline data. In Advances in neural information processing systems, 2022.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Rajeswaran, A., Ghotra, S., Ravindran, B., and Levine, S. EPOpt: Learning robust neural network policies using model ensembles. In International Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=Sy Wvg P5el.

Rockafellar, R. T. and Uryasev, S. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000.

Si, N., Zhang, F., Zhou, Z., and Blanchet, J. Distributional robust batch contextual bandits. ar Xiv preprint ar Xiv:2006.05630, 2020.

Singla, A., Rafferty, A. N., Radanovic, G., and Heffernan, N. T. Reinforcement learning for education: Opportunities and challenges. ar Xiv preprint ar Xiv:2107.08828, 2021.

Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(12):1 286, 2019.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.

Tamar, A., Glassner, Y., and Mannor, S. Optimizing the cvar via sampling. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.

Tamkin, A., Keramati, R., Dann, C., and Brunskill, E. Distributionally-aware exploration for cvar bandits. In Neur IPS 2019 Workshop on Safety and Robustness on Decision Making, 2019.

Uehara, M., Zhang, X., and Sun, W. Representation learning for online and offline RL in low-rank MDPs. In ICLR, 2022. URL https://openreview.net/ forum?id=J4i SIR9fh Y0.

Urpí, N. A., Curi, S., and Krause, A. Risk-averse offline reinforcement learning. 2021.

Van de Geer, S. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.

Wang, T., Ye, W., Geng, D., and Rudin, C. Towards practical lipschitz bandits. In Proceedings of the 2020 ACMIMS on Foundations of Data Science Conference, pp. 129 138, 2020.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Wang, Y. and Gao, F. Deviation inequalities for an estimator of the conditional value-at-risk. Operations Research Letters, 38(3):236 239, 2010.

Wiesemann, W., Kuhn, D., and Rustem, B. Robust markov decision processes. Mathematics of Operations Research, 38(1):153 183, 2013.

Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304 7312. PMLR, 2019.

Zhang, T. Mathematical Analysis of Machine Learning Algorithms. 2023. http://www.tongzhang-ml. org/lt-book.html.

Zhang, X., Song, Y., Uehara, M., Wang, M., Agarwal, A., and Sun, W. Efficient reinforcement learning in block mdps: A model-free representation learning approach. In International Conference on Machine Learning, pp. 26517 26547. PMLR, 2022.

Zhou, D., Gu, Q., and Szepesvari, C. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pp. 4532 4576. PMLR, 2021a.

Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J., and Glynn, P. Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 3331 3339. PMLR, 2021b.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

A. Notations

Table 1. List of Notations

x+ max(x, 0), i.e., the Re LU function. S, A, S, A State and action spaces, with sizes S = |S| and A = |A|. In MAB, S = 1. SAug Augmented state space S [0, 1] for CVa R RL. H N Horizon of the RL problem. In MAB, H = 1. K N Number of episodes. δ (0, 1) Failure probability. L log(SAHK/δ). (S) The set of distributions supported by S. R(a) ([0, 1]) Reward distribution of arm a (for MAB). P (s, a) (S) Ground truth transition kernel (for RL). R(s, a) ([0, 1]) Known reward distribution (for RL). R(π) Returns distribution of history-dependent policy π (for RL). R(ρ, b) Returns distribution of augmented policy ρ ΠAug starting from b (for RL). F (t) for t [0, 1] The t-th quantile function of X with CDF F, i.e., inf{x : F(x) t}. Ih,k(s, a) Indices of prior visits of s, a at h, i.e., {i [k 1] : (sh,i, ah,i) = (s, a)}. Nh,k(s, a) Number of prior visits of s, a at h, i.e., |Ih,k(s, a)|.

ξh,k(s, a) min n 2, 2HSL Nh,k(s,a) o .

Ek Trajectories from episodes 1, 2, ..., k 1. Hh(Hh,k) History up to and not including time h (in episode k). ΠH Set of history-dependent policies. ΠAug Set of Markov, deterministic policies in the augmented MDP. (ρ, b) The policy obtained from rolling in ρ starting from (s1, b) in the augmented MDP. disc(M) The discretized MDP obtained by discretizing rewards, Section 6. adapted(ρ, b1) The policy from adapting (ρ, b1) in disc(M) to M. disc(ρ, b1) The policy from discretizing (ρ, b1) in M to disc(M).

B. Concentration Lemmas

B.1. Uniform Hoeffding and Bernstein via Lipschitzness

Recall the classic Hoeffding and Bernstein inequalities (Theorems 2.8 and 2.10 in Boucheron et al., 2013). Let X1:N be i.i.d. random variables in [0, 1], with mean µ and variance σ2. Then, for any δ, w.p. at least 1 δ, we have 1 N

N , (Hoeffding)

2σ2 log(4/δ)

N + log(4/δ)

N . (Bernstein)

Now we consider uniform inequalities for a function class. Specifically, let X1:N be i.i.d. copies of X X and F is a (potentially infinite) set of functions f : X [0, 1]. Suppose Gε F is a finite ℓ -cover, a.k.a. ε-net, of F in the sense that: for any f F, there exists g Gε such that supx X |f(x) g(x)| ε.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Lemma B.1. Let δ (0, 1). We have w.p. at least 1 δ,

i=1 f(Xi) Ef(X)

log(4|G1/N|/δ)

N . (Uniform Hoeffding)

If N 2 log(4|G1/N|/δ), we also have

i=1 f(Xi) Ef(X)

2 Var(f(X)) log(4|G1/N|/δ)

N + 3 log(4|G1/N|/δ)

N . (Uniform Bernstein)

Proof. Apply a union bound over the elements of Gε. Then for any f F, 1 N

i=1 f(Xi) Ef(X)

i=1 g(Xi) Eg(X)

log(4|Gε|/δ)

Setting ε = 1/N gives the Uniform Hoeffding result. We also have 1 N

i=1 f(Xi) Ef(X)

2 Var(g(X)) log(4|Gε|/δ)

N + log(4|Gε|/δ)

2 Var(f(X)) log(4|Gε|/δ)

N + log(4|Gε|/δ)

2 log(4|Gε|/δ)

Var(g(X)) p

Var(f(X)) p

Var(f(X) g(X)) ε. By assumption, q

2 log(4|Gε|/δ)

N 1, so the total error is at most 3ε. Thus, setting ε = 1/N gives the Uniform Bernstein result.

A particularly important application of this for us is that F will be an finite set of functions fb parameterized by a continuous parameter b [0, 1]. These functions are C-Lipschitz in the b parameter, so to construct Gε, it suffices to take a grid over [0, 1] such that any element is ε/C close to the grid. This grid requires C/ε atoms, and so log(|G1/N|) log(CN).

Empirical Bernstein: By Theorems 4 and 6 of Maurer & Pontil (2009), we also have an empirical version of the uniform Bernstein, where we may replace Var(f(X)) with 1 N(N 1) PN i,j=1(f(Xi) f(Xj))2, i.e., the empirical variance. Another useful result of Maurer & Pontil (2009) is their Theorem 10, which proves a fast convergence of empirical variance to the true variance: w.p. 1 δ, d Varf(X) Var f(X)

where d Varf(X) = 1 N(N 1) PN i,j=1(f(Xi) f(Xj))2 is the sample variance of N datapoints. Note that d Varf(X) is the

variance under the empirical distribution of these N datapoints, and hence behaves like a variance. Since p

Var(X + Y ) p

Var(Y ) by Cauchy-Schwartz, this can also be extended to be uniform by the above argument, i.e.,

d Varf(X) p

log(2|G1/N|/δ)

Proof. For any f, let g be its neighbor in the net. Using the triangle inequality of variance, p

Var(X + Y ) p

Var(Y ), we have

d Varf(X) p

d Varg(X) p

Var g(X) + q

d Var((f g)(X)) + p

Var((f g)(X))

log(2|G1/N|/δ)

Setting ε = 1/N completes the proof.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

B.2. Tape Method for Tabular MAB and RL

In this section, we describe how we are able to prove claims about MAB and RL using uniform concentration inequalities over i.i.d. data, i.e., without needing to use complicated uniform martingale inequalities, e.g., Bibaut et al. (2021); Van de Geer (2000). We construct a probability space using a tape method inspired by Slivkins et al. (2019, Section 1.3.1). Compared to using black-box uniform martingale inequalities, our approach is potentially loose in log terms. However, our approach is much cleaner as we only need uniform concentrations for i.i.d. data. Thus, we prove everything from first principles, so that concentration inequalities do not distract from the main ideas.

First, consider the MAB problem. Before the protocol starts, nature constructs an (one-indexed) array with AK cells. For each a A, k [K], nature fills the index [a, k] with an independent sample from R(a). Whenever the learner pulls arm a on the k-th episode, it receives the contents of (a, Nk(a) + 1) where that Nk(a) is the number of times that a has been pulled up until now.

Notice that this formulation will never run out of rewards since we ve seeded each arm with K cells. Also, this is equivalent to drawing a sample whenever the learner pulls arm a. Crucially, all the rewards are independent and so we can obtain concentration inequalities for Nk(a) for any learner, even before it is executed.

In particular, for any function f : [0, 1] [0, 1] of the rewards, we can union bound Hoeffding/Bernstein over the cells [a, 1 : k] for all a, k to get: for any δ, w.p. at least 1 δ, we have for all a, k,

i Ik(a) f(ri) Ef(R(a))

i Ik(a) f(ri) Ef(R(a))

2 Var(f(R(a))) log(4AK/δ))

Nk(a) + log(4AK/δ)

Here Ik(a) is the indices where the learner has pulled arm a.

We now do something similar for tabular RL. Before the RL algorithm starts, nature constructs an (one-indexed) array with SAHK cells. For each s S, a A, h [H], k [K], nature fills the index [s, a, h, k] with an independent sample from P (s, a). Whenever, the learner takes action a at state s and step h on episode k, it receives the next state via the content of (s, a, h, Nh,k(s, a) + 1) where recall Nh,k(s, a) is the number of times the learner has taken action a at state s and step h before the current episode. Then, for any function f : S [0, 1] of the states, we can union bound Hoeffding/Bernstein over the cells [s, a, 1 : H, 1 : k] for all s, a, k to get: for any δ, w.p. at least 1 δ, we have for all s, a, h, k,

h,i Ik(s,a) f(sh+1,i) Es P (s,a)f(s )

log(4SAHK/δ)

h,i Ik(s,a) f(sh+1,i) Es P (s,a)f(s )

2 Vars P (s,a)(f(s )) log(4SAHK/δ))

Nk(s, a) + log(4SAHK/δ)

where Ik(s, a) are the (h, i) pairs where the learner has visited (s, a), and Nk(s, a) is the size of Ik(s, a).

Since these are standard Hoeffding/Bernstein results over i.i.d. data, the uniform concentration results from the previous section applies.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

C. Concentration of CVa R

In this section, we derive general concentration results for the empirical CVa R, which may be of independent interest. The significance of our result is that it applies to any bounded random variable X, which may be continuous, discrete or neither. Prior concentration results from Brown (2007) require X to be continuous. Some later works (Wang & Gao, 2010; Kagrecha et al., 2019) did not explicitly mention their dependence on the continuity of X, but their proof appears to require it as well and is complicated by casework. We provide a simple new proof of this concentration based on the Acerbi integral formula for CVa R, Lemma C.3.

For any random variable X in [0, 1] with CDF F, the quantile function is defined as,

F (t) = inf{x [0, 1] : F(x) t} = sup{x [0, 1] : F(x) < t}.

The quantile has many useful properties (Chen, Lemma 1), which we recall here.

Lemma C.1. For t (0, 1), F (t) is nondecreasing and left-continuous, and satisfies

1. For all x R, F (F(x)) x.

2. For all t (0, 1), F(F (t)) t.

3. F(x) t x F (t).

The quantile is always a maximizer of the CVa R objective in Eq. (1). This is true for any random variable, discrete, continuous or neither.

Lemma C.2. For any random variable X in [0, 1] with CDF F, we have

F (τ) arg max b [0,1]

b τ 1E (b X)+ .

Proof. Recall the objective in Eq. (1), f(b) = b + τ 1E[(b X)+]. It has a subgradient of

f(b) = 1 + τ 1(Pr(X < b) + [0, 1] Pr(X = b)).

We want to show that 0 f(F (τ)), which is equivalent to showing

0 (a) τ Pr(X < F (τ)) (b) Pr(X = F (τ)).

For (b), observe that Pr(X < F (τ)) + Pr(X = F (τ)) = F(F (τ)) τ (Lemma C.1). Hence, τ Pr(X < F (τ)) Pr(X = F (τ)).

For (a), recall that Pr(X < F (τ)) = limn Pr(X F (τ) n 1), since X F (τ) 1 X F (τ) 1/2 S

n N X F (τ) n 1 = X < F (τ) and continuity of probability measures. If for any n N, we had Pr(X F (τ) n 1) τ, i.e., F(F (τ) n 1) τ, then by definition of F (τ) = inf{x [0, 1] : F(x) t}, we have F (τ) F (τ) n 1, which is a contradiction. Therefore, it must be that for all n N, we have Pr(X F (τ) n 1) < τ, so Pr(X < F (τ)) = limn Pr(X F (τ) n 1) τ.

The following interpretation of CVa Rτ due to Acerbi & Tasche (2002) will be very useful. An alternative proof was given in Kisiala (2015, Proposition 2.2).

Lemma C.3 (Acerbi s Integral Formula). For any non-negative random variable X with CDF F, we have

CVa Rτ(X) = τ 1 Z τ

0 F (y)dy = E F (U) | U τ ,

where U Unif([0, 1]).

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Now suppose X1:N are i.i.d. copies of X [0, 1]. Define the empirical CVa R as the CVa R of the empirical distribution b FN(x) = 1

N PN i=1 I [Xi x].

\ CVa Rτ(X1:N) = max b [0,1]

i=1 (b Xi)+ )

Let X(i) denote the i-th increasing order statistic.

Lemma C.4. The maximum for the empirical CVa R is attained at the empirical quantile X Nτ . Hence,

\ CVa Rτ(X1:N) = 1 Nτ

X Nτ + 1 Nτ

Proof. By Lemma C.2, the maximum is attained at the τ-th quantile of the empirical distribution, i.e., b F N(τ) =

inf n x : b FN(x) τ o . Let k [N] be the largest X(k) such that such that b FN(X(k)) = k N < τ k+1

N = b FN(X(k+1)).

This implies that b F N(τ) = X(k+1). Note that k < Nτ k + 1, so k + 1 = Nτ . Thus,

\ CVa Rτ(X1:N) = X( Nτ ) 1 Nτ

X( Nτ ) Xi +

= X( Nτ ) 1 Nτ

X( Nτ ) X(i)

X( Nτ ) + 1 Nτ

Lemma C.5. Let U1:N be i.i.d. copies of Unif([0, 1]). Let p (0, 1). For any δ (0, 1), w.p. at least 1 δ, we have

3p(1 p) log(2/δ)

N + 5 log(2/δ)

provided that N 25 log(2/δ).

Proof. Let F be the distribution function of Unif([0, 1]), and b FN the empirical distribution of U1:N. Note that U( Np ) = b F N(p), by reasoning in the proof of Lemma C.4. So the left hand side is p b F N(p) .

Now consider any error ε (0, 1). We have

b F N(p) p + ε p b FN(p + ε) p + ε b FN(p + ε) ε,

b F N(p) > p ε p > b FN(p ε) b FN(p ε) (p ε) < ε.

In both cases, we can use Bernstein on I [U p ε] or I [U p + ε] to obtain a bound depending on the variance. In the first case, p

Var(I [U p ε]) p

Var(I [U p]) + p

Var(I [U p ε] I [U p]) p(1 p) + ε. Similarly, p

Var(I [U p + ε]) p

Var(I [U p]) + p

Var(I [U p + ε] I [U p]) p(1 p) + ε. Thus, we have w.p. at least 1 δ,

(p + ε) b FN(p + ε)

2p(1 p) log(2/δ)

N + log(2/δ)

2ε2 log(2/δ)

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

b FN(p ε) (p ε) <

3p(1 p) log(2/δ)

N + log(2/δ)

3ε2 log(2/δ)

So we can set ε = q

3p(1 p) log(2/δ)

N + 5 log(2/δ)

N , as the third error term with this setting of ε is at most 3 log(2/δ)

5 log(2/δ)1.5

N1.5 4 log(2/δ)

N when N 25 log(2/δ). Thus w.p. at least 1 δ, we have b F N(p) p ε.

Theorem C.6. Let X1:N be N i.i.d. copies of a random variable X [0, 1]. Then for any δ (0, 1), w.p. at least 1 δ, if N 25 log(2/δ), then we have \ CVa Rτ(X1:N) CVa Rτ(X)

Nτ + 15 log(2/δ)

Proof. We use the interpretation of the empirical CVa R in Lemma C.4. The first term is lower order since 1 Nτ

Now recall that by the inverse CDF trick, we have Xi = F (Ui) where Ui are i.i.d. copies of Unif([0, 1]). Since F is non-decreasing, we have X(i) = F (U(i)). Thus, the second term of Lemma C.4 is

i=1 X(i) = 1 Nτ

i=1 F (U(i)) = 1 Nτ

i=1 F (Ui)I Ui U( Nτ ) ,

which we want to show is close to CVa Rτ(X) = τ 1E F (U)I [U τ] by Lemma C.3. If U( Nτ ) were replaced by τ, we can simply invoke Bernstein and note that Var F (U)I [U τ] Pr(U τ) = τ, which gives 1 Nτ

i=1 F (Ui)I [Ui τ] τ 1E F (U)I [U τ] τ 1 r

2τ log(2/δ)

N + log(2/δ)

Thus, we just need to bound the difference term, 1 Nτ

i=1 F (Ui) I Ui U( Nτ ) I [Ui τ] .

By Lemma C.5, we have U Nτ τ ε w.p. 1 δ, where ε = q

3τ(1 τ) log(2/δ)

N + 5 log(2/δ)

N . So, for any Ui we have I Ui U( Nτ ) I [Ui τ] I [τ Ui τ + ε] and I [Ui τ] I [τ ε Ui τ], so the difference term is at most,

i=1 F (Ui)I [τ ε Ui τ] , 1

i=1 F (Ui)I [τ Ui τ + ε]

By applying another Bernstein, and noting that p

Var(F (U)I [τ ε U τ]) ε, p

Var(F (U)I [τ U τ + ε]) ε, we have this is at most

max E F (U)I [τ ε U τ] , E F (U)I [τ U τ + ε] +

2ε2 log(2/δ)

N + log(2/δ)

2ε2 log(2/δ)

N + log(2/δ)

Nτ + 5 log(2/δ)

Nτ + 3 log(2/δ)

N τ + 4 log1.5(2/δ)

N 1.5τ + log(2/δ)

Nτ + 15 log(2/δ)

Nτ , (when N 16 log(2/δ))

where the bound on ε occurs when N 25 log(2/δ), which also implies the last inequality.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

D. Proofs for Lower Bounds

D.1. CVa R MAB Lower Bound

Let us first define some MAB notations that make explicit the dependence on the current MAB problem instance ν and the learner Alg. Recall that ν is a vector of A reward distributions, and in the k-th episode, Alg picks an action ak based on the historical actions and rewards. Let a(ν) = CVa R τ(ν) CVa Rτ(ν(a)) where CVa R τ(ν) = maxa A CVa Rτ(ν(a)). Let Regret MAB τ (K, ν, Alg) denote the regret of running Alg in MAB ν for K episodes. Let Tk(a) denote the number of times an arm a has been pulled up to time K.

For two distributions P, Q, recall the KL-divergence is defined as

DKL(P, Q) =

( R log d P d Q(ω)d P(ω) , if P Q,

, otherwise.

A key inequality for lower bounds is the Bretagnolle-Huber inequality, cf. (Lattimore & Szepesvári, 2020, Theorem 14.2),

Lemma D.1 (Bretagnolle-Huber). Let P, Q be probability measures on the same measurable space (Ω, F) and A F be any event. Then

P(A) + Q(AC) 1

2 exp( DKL(P, Q)).

Lemma D.2 (Regret Decomposition). For any MAB instance ν and learner Alg, we have

E Regret MAB τ (K, ν, Alg) = X

a A a(ν)E[Ta(K)],

where the expectations are with respect to the trajectory of running Alg in ν.

E Regret MAB τ (K, ν, Alg)

k=1 CVa R τ(ν) E[CVa Rτ(ν(ak))]

(CVa R τ(ν) CVa Rτ(ν(ak)))) X

a A I [ak = a]

k=1 E[(CVa R τ(ν) CVa Rτ(ν(ak)))I [ak = a]].

Notice that if once we condition on ak, if ak = a, the difference CVa R τ(ν) CVa Rτ(ν(ak)) is simply a(ν). If ak = a, then we get 0. So, by the tower rule,

E[(CVa R τ(ν) CVa Rτ(ν(ak)))I [ak = a]] = E[I [ak = a] a(ν)].

Therefore, continuing from before,

k=1 E[I [ak = a] a(ν)]

k=1 E[I [ak = a]]

a A a(ν)E[Ta(K)].

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Theorem 3.1. Fix any τ (0, 1/2), A N. For any algorithm, there is a MAB problem with Bernoulli rewards s.t. if

8τ , then E Regret MAB τ (K) 1 24e

Proof of Theorem 3.1. Fix any τ (0, 1/2) and any MAB algorithm Alg. WLOG suppose A = [A]. Define the shorthand, βc = Ber(1 τ + cε), i.e., larger c implies ε more likelihood of pulling 1. Construct two MAB instances as follows,

ν = (β1, β0, . . . , β0)

ν = (β1, β0, . . . , β0, β2 |{z} index i , β0, . . . , β0), where i = arg min a>1 Eν,Alg[Ta(K)].

For the first MAB instance ν, the optimal action is a (ν) = 1, and a(ν) = τ 1ε for all a > 1. By Lemma D.2,

Eν,Alg Regret MAB τ (K, ν, Alg) = X

a A a(ν)Eν,Alg[Ta(K)]

= τ 1ε(K Eν,Alg[T1(K)])

τ 1ε Pr ν,Alg

2 (Markov s inequality)

= τ 1ε Pr ν,Alg

For the second MAB instance ν , the optimal action is a (ν ), and 1(ν) = τ 1ε. By Lemma D.2,

Eν ,Alg Regret MAB τ (K, ν , Alg) = X

a A a(ν )Eν ,Alg[Ta(K)]

ν1(ν )Eν ,Alg[T1(K)]

> τ 1ε Pr ν ,Alg

2 . (Markov s inequality)

Let Pν,Alg denote the trajectory distribution from running Alg in MAB ν. Therefore,

Eν,Alg Regret MAB τ (K, ν, Alg) + Eν ,Alg Regret MAB τ (K, ν , Alg)

+ Pr ν ,Alg

4τ exp( DKL(Pν,Alg, Pν ,Alg)) (Bretagnolle-Huber Lemma D.1)

a A Eν,Alg[Ta(K)]DKL(ν(a), ν (a))

(Lattimore & Szepesvári, 2020, Lemma 15.1)

4τ exp( Eν,Alg[Ti(K)]DKL(ν(i), ν (i))) (other arms are the same for ν, ν )

4τ exp( Eν,Alg[Ti(K)]DKL(ν(i), ν (i)))

4τ exp 8Kε2

The last inequality uses two facts. By definition of i = arg mina>1 Eν,Alg[Ta(K)], Eν,Alg[Ti(K)] K A 1. Also, by

Lemma D.4, DKL(ν(i), ν (i)) 8ε2τ 1. Setting ε2 = (A 1)τ

8K and noting 2 max{a, b} a + b gives the desired lower bound.

Lemma D.3. For any τ (0, 1/2) and ε [0, τ], we have

CVa Rτ(Ber(1 τ + ε)) = τ 1ε.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Proof. The CDF of X Ber(1 τ + ε) is as follows,

0, if x < 0, τ ε, if x [0, 1), 1, if x 1.

Therefore, F (τ) = inf{x : F(x) τ} = 1 for any ε > 0, and it is 0 when ε = 0. By Lemma C.2, we have

CVa Rτ(Ber(1 τ)) = 0 τ 1E (0 X)+ = 0,

CVa Rτ(Ber(1 τ + ε)) = 1 τ 1E (1 X)+ = 1 τ 1(τ ε) = τ 1ε.

Lemma D.4. For any τ (0, 1/2) and ε [0, τ], we have

DKL(Ber(1 τ), Ber(1 τ + ε)) 2ε2τ 1.

Proof. By explicit computation, we have

DKL(Ber(1 τ), Ber(1 τ + ε))

= τ log τ τ ε

+ (1 τ) log 1 τ 1 τ + ε

τ log τ τ ε

+ τ log τ τ + ε

= τ log 1 ε2

The first inequality is because f(x) = x log x x+ε is a decreasing function and 1 τ τ. The second inequality is

because log(1 x) 2x for x [0, 1].

D.2. Lower bound for CVa R RL

Corollary 3.2. Fix any τ (0, 1/2), A, H N. For any algorithm, there is an MDP (with S = Θ(AH 1)) s.t. if

8τ , then E Regret RL τ (K) 1 24e

Proof of Corollary 3.2. Fix any τ, A, H. Consider an MDP where the states are represented by an A-balanced tree with depth H (each node of the tree is a state). The initial state s1 is the root, and based on the action a1, transits to the a1-th node in the next layer. The process repeats until we ve reached one of the AH 1 leaves, where a reward is given (which also depends on the action taken at the leaf). There are no rewards until the last step. The number of states is S = 1 + A + ... + AH 1, since the h-th layer of the tree has Ah 1 states.

Since there are no rewards until the last step, running in this MDP reduces to a MAB with AH arms where the arms are the sequences of actions a1:H. So, by Theorem 3.1, for any RL algorithm, there is an MDP constructed this way (with

Bernoulli rewards at the end) such that if K q

8τ , then E Regret RL τ (K) 1 24e

τ . The key observation

is that AH 1 = (A 1) AH 1 + AH 2 + + A + 1 = (A 1)S. This concludes the proof.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

E. Proofs for BERNSTEIN-UCB

For any arm a A, let b a denote the τ-th quantile of R(a), so

b a = arg max b [0,1]

b τ 1ER ν(a) (b R)+

CVa Rτ(R(a)) = b a τ 1ER ν(a) (b a R)+ .

Let us denote

µ(b, a) = ER ν(a) (b R)+ ,

bµk(b, a) = 1 Nk(a)

i=1 (b ri)+I [ai = a] .

Recall that a is the arm with the highest CVa Rτ.

For any δ (0, 1), w.p. at least 1 δ, uniform Bernstein implies that for all b, a,

|bµk(b, a) µ(b, a)|

2τ log(2AK/δ)

Nk(a) + log(2AK/δ)

Nk(a) . (7)

Note that our bonus Eq. (3) is constructed to match the upper bound. This implies that bµk BONk is a pessimistic estimate of µ.

Lemma E.1 (Pessimism). For all k [K]

min a A{bµk(b a, a) BONk(a)} µ(b a , a ).

Proof. Fix any k [K]. By Eq. (7), for all a A,

bµk(b a, a) BONk(a) µ(b a, a).

min a A{bµk(b a, a) BONk(a)} bµk(b a , a ) BONk(a )

µ(b a , a ).

Theorem 4.1. For any δ (0, 1), w.p. at least 1 δ, BERNSTEIN-UCB with ε p

A/2τK enjoys

Regret MAB τ (K) 4

τ 1AKL + 16τ 1AL2.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Proof of Theorem 4.1.

Regret MAB τ (K)

k=1 CVa R τ CVa Rτ(R(ak))

b a τ 1µ(b a , a ) CVa Rτ(ν(ak))

b a τ 1 min a A(bµk(b a , a) BONk(a)) CVa Rτ(ν(ak)) (pessimism Lemma E.1)

k=1 max a A b a τ 1(bµk(b a , a) BONk(a)) CVa Rτ(ν(ak))

k=1 max a A

n bba,k τ 1 bµk(bba,k, a) BONk(a) o CVa Rτ(ν(ak)) (bba,k is ε-optimal)

n bbak,k τ 1 bµk(bbak,k, ak) BONk(ak) o CVa Rτ(ν(ak)) (defn. of ak)

k=1 τ 1BONk(ak) + max b [0,1]

b τ 1bµk(b, ak) CVa Rτ(ν(ak))

k=1 τ 1BONk(ak) + \ CVa Rτ({ri}i Ik(ak)) CVa Rτ(ν(ak))

2L Nk(a)τ + L Nk(a)τ +

3L Nk(a)τ + 15L Nk(a)τ (CVa Rτ concentration Theorem C.6)

10L Nk(a)τ + 16L Nk(a)τ

AKL + 16Lτ 1 A log(K). (elliptical potential Lemma G.12)

A technical detail is that CVa Rτ concentration only applies when Nk(a) 25L. We can trivially bound the total regret of the episodes when Nk(a) < 25L by 25AL. Also, we remark the concentration step applies since {ri}i Ik(a) are i.i.d. via the tape framework, so we do not need to generalize Theorem C.6 to martingale sequences. Finally, setting ε = p

τ 1A/2K renders it lower order.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

F. Proofs for Augmented MDP

We first define the memory-MDP model, where the MDP is also equipped with a memory generator Mh, which generates mh Mh(sh, ah, rh, Hh). These memories are stored into the history Hh = (st, at, rt, mt)t [h 1] and may be used by history dependent policies in future time steps. Concretely, rolling out π proceeds as follows: for any h [H], ah πh(sh, Hh), sh+1 P (sh, ah), rh R(sh, ah) and mh Mh(sh, ah, rh, Hh).

We can also extend the above formulation to the augmented MDP, where the state is augmented with b as in Section 5.1. Here, the history is HAug h = (st, bt, at, rt, mt)t [h 1]. Let ΠAug H represent the set of history dependent policies in this augmented MDP with memory. Also, recall that ΠAug is the set of Markov, deterministic policies in the augmented MDP.

The V function is defined for these multiple types of policies:

π ΠH : V π h (sh, bh; Hh) = Eπ

| sh, bh, Hh

ρ ΠAug : V ρ h (sh, bh) = Eρ

ρ ΠAug H : V ρ h (sh, bh; HAug h ) = Eρ

| sh, bh, HAug h

Notice that rolling out ρ, b in the augmented MDP is equivalent to rolling out πρ,b in the original MDP, where

πρ,b h (sh, Hh) = ρh(sh, b r1 ... rh 1).

Thus, it s evident that their V functions should match.

Lemma F.1. Fix any ρ ΠAug, h [H], augmented state (sh, bh) and history Hh. Then, we have V ρ h (sh, bh) = V πρ,b h (sh, bh; Hh) for b = bh + r1 + ... + rh 1. In particular, we have V ρ 1 (s1, ) = V πρ,b 1 (s1, ).

Proof. The setting of b in the lemma satisfies bh = b r1 ... rh 1. Therefore, the trajectories of (ρ, b) and πρ,b are exactly coupled.

We now show the key result of this section. The theorem shows that the V , U functions defined via the Bellman optimality equations (from Section 5.1) correspond to the V, U functions of ρ . Furthermore, the Markov (in augmented state) and deterministic ρ is in fact an optimal policy amongst all history-dependent policies in the augmented MDP with memory! This result and our proof is analogous to the Markov optimality theorem of vanilla RL, e.g., (Puterman, 2014), (Agarwal et al., 2021, Theorem 1.7).

Theorem F.2. For all h we have U h = U ρ

h and V h = V ρ

h . Furthermore, for all sh, bh, HAug h , we have

V h (sh, bh) = inf ρ ΠAug H V ρ h (sh, bh; HAug h ).

In particular, V 1 (s1, b) = infρ ΠAug H V ρ 1 (s1, b) for all b.

Proof. We first prove the claim that U h = U ρ

h and V h = V ρ

h . The base case of H + 1 is trivial since VH+1(s, b) = b+

everywhere. For the inductive step, fix any h [H] and suppose the claim is true for h + 1. Then,

h (sh, bh, ah) = Esh+1 P (sh,ah),rh R(sh,ah) h V ρ

h+1(sh+1, bh rh) i (Bellman Eqns)

= Esh+1 P (sh,ah),rh R(sh,ah) V h+1(sh+1, bh rh) (IH)

= U h(sh, bh, ah). (Bellman Opt. Eqns)

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

This proves that U h = U ρ

h . For V ,

h (sh, bh) = Eah ρ h(sh,bh) h U ρ

h (sh, bh, ah) i (Bellman Eqns)

= Eah ρ h(sh,bh)[U h(sh, bh, ah)] (above claim)

= min ah A U h(sh, bh, ah) (defn. of ρ h)

= V h (sh, bh). (Bellman Opt. Eqns)

Therefore, we ve shown that V h = V ρ

We also prove the second claim inductively. The base case is again trivial since VH+1(s, b) = b+ everywhere. For the inductive step, fix any h [H] and suppose the claim is true for h + 1. Now fix any sh, bh and HAug h ,

inf ρ ΠAug H V ρ h (sh, bh; HAug h )

= inf ρ ΠAug H Eρ

| sh, bh, HAug h

inf ρ ΠAug H Eah,sh+1,rh,mh

inf ρ ΠAug H Eρ

| sh+1, bh+1, HAug h+1

= inf ρ ΠAug H Eah ρh(sh,bh,HAug h ) Esh+1 P (sh,ah),rh R(sh,ah) V h+1(sh+1, bh rh) (IH)

= min a A Esh+1 P (sh,ah),rh R(sh,ah) V h+1(sh+1, bh rh) ( )

= V h (sh, bh). (by defn.)

There are three key steps. First, the inequality is due to expanding out one step, where ah ρh(sh, bh, HAug h ), sh+1 P (sh, ah), rh R(sh, ah), mh Mh(sh, ah, rh, Hh), then push the inf for future steps inside the expectation. Second, the IH invocation is significant as it essentially removes dependence of the memory hallucinations mh. Third, the step marked with is significant since, regardless of the history, the current best action is just to minimize the inner function (which is independent of the history). We also have V h (sh, bh) infρ ΠAug H V ρ h (sh, bh; HAug h ) since by the first part of

the claim, V h is the value of ρ ΠAug H . Thus, we ve shown V h (sh, bh) = infρ ΠAug H V ρ h (sh, bh; HAug h ).

As a corollary of the above theorem, we can restrict the policy class to history-dependent policies on the non-augmented MDP (and without history).

Theorem 5.1 (Optimality of ΠAug). For any b [0, 1],

V 1 (s1, b) = V ρ 1 (s1, b) = infπ ΠH V π 1 (s1, b).

Proof of Theorem 5.1. The first equality is directly from Theorem F.2. We now prove the second equality. For any b,

min ρ ΠAug V πρ,b 1 (s1, b)

= min ρ ΠAug V ρ 1 (s1, b) (Lemma F.1)

= min π ΠAug H V π 1 (s1, b) (Theorem F.2)

min π ΠH V π 1 (s1, b)

min ρ ΠAug V πρ,b 1 (s1, b).

The last two inequalities is due to considering strictly smaller sets of policies. Therefore, we have equality throughout, which proves the claim.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

G. Proofs for CVa R-UCBVI

G.1. The high probability good event

In this section, we derive all the high probability results needed in the remainder of the proof. Fix any failure probability δ (0, 1). Then, w.p. at least 1 δ, for all h [H], k [K], s S, a A, we have, for all b [0, 1], s S,

b Pk(s, a) P (s, a) Erh V h+1( , b rh)

L Nk(s, a), (8)

b Pk(s, a) P (s, a) Erh V h+1( , b rh)

2 Vars b Pk(s,a) Erh V h+1(s , b rh) L

Nk(s, a) + L Nk(s, a), (9)

b Pk(s | s, a) P (s | s, a)

2P (s | s, a)L

Nk(s, a) + L Nk(s, a). (10)

where rh R(s, a) in the expectations.

Proof. Eq. (8) is due to uniform Hoeffding applied to Esh+1,rh V h+1(sh+1, b rh) , which is 1-Lipschitz in b by Lemma G.1. Eq. (10) is due to standard Bernstein s inequality on the indicator random variable on (s, a, s ), i.e., I [(sh,k, ah,k, sh+1,k) = (s, a, s )]. Eq. (9) is due to uniform empirical Bernstein applied to Esh+1,rh V h+1(sh+1, b rh) . In Appendix B, we derive and review these uniform results.

Lemma G.1. For any h [H] and s S, V h (s, ) is 1-Lipschitz.

Proof. We proceed by induction. Let b, b [0, 1] be arbitrary. At h = H + 1, V H+1(s, b) V H+1(s, b ) = |b+ (b )+| |b b | since the Re LU is 1-Lipschitz. For the inductive step, fix any h and suppose the claim is true at h + 1. Then |V h (s, b) V h (s, b )| = mina Esh+1,rh V h+1(sh+1, b rh) mina Esh+1,rh V h+1(sh+1, b rh) maxa Esh+1,rh V h+1(sh+1, b rh) V h+1(sh+1, b rh) |b b |, by the IH. The expectations are over sh+1 P (s, a) and rh R(s, a).

We now show that the projected error between b Pk(s, a) and P can be bounded in two ways.

Lemma G.2. For any δ (0, 1), w.p. at least 1 δ, we have for all f : S [0, 1],

b Pk(s, a) P (s, a) f min

SL Nk(s, a), Es P (s,a)[f(s )]

H + ξk(s, a)

where ξk(s, a) := min 1, 2HSL

Proof. Fix any f : S [0, 1]. The first bound of q

SL Nk(s,a) follows from applying Hoeffding on an ε-net of the space

of f s, i.e., for each g in the net, we have b Pk(s, a) P (s, a) g q

L Nk(s,a). This ε-net has ℓ2 bounded by

This gives the metric entropy log(1 + 2

S/ε)S S log(S/ε). Setting ε = 1 HK makes the error lower order, i.e., 1 HK 1 Nk(s,a), which gives the uniform result over all f s. The detailed proof is in (Agarwal et al., 2021, Lemma 7.2).

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

The second bound also appears in Agarwal et al. (2021) as Lemma 7.8. We prove its proof for completeness: b Pk(s, a) P (s, a) f X

b Pk(s | s, a) P (s | s, a) f(s )

2P (s | s, a)L

Nk(s, a) + f(s )L

Nk(s, a) (Eq. (10))

s 2P (s | s, a)f 2(s )L

Nk(s, a) + SL Nk(s, a) (C-S)

SHL Nk(s, a) + P

s P (s | s, a)f(s )

H + SL Nk(s, a). (AM-GM)

Finally, since b Pk(s, a) f and P (s, a) f are both in [0, 1], a trivial bound is 1, which is why ξh,k can be truncated.

Finally, we also have consequences of Azuma s inequality Lemma G.11. W.p. at least 1 δ, for all h [H],

k=1 Ebρk,bbk[2BONh,k(sh, bh, ah) + ξh,k(sh, ah) | Ek] 6L + 2

k=1 2BONh,k(sh,k, bh,k, ah,k) + ξh,k(sh,k, ah,k),

where we used the fact that WLOG we truncated the bonus to be at most 1 (by sentence below Eq. (BON )), so 2BONh,k + ξh,k 3. Ek denotes the complete trajectories from episodes 1, 2, ..., k 1

For the Bernstein bonus proofs, we ll also need,

k=1 Es P (sh,k,ah,k),r R(sh,k,ah,k)

b V h+1,k(s , bh,k r) b V h+1,k(s , bh,k r) 2 | Ek, Hh,k

b V h+1,k(sh+1,k, bh+1,k) b V h+1,k(sh+1,k, bh+1,k) 2 , (Azuma 2)

k=1 Es P (sh,k,ah,k),r R(sh,k,ah,k)

h+1(s , bh,k r) b V h+1,k(s , bh,k r) 2 | Ek, Hh,k

h+1(sh+1,k, bh+1,k) b V h+1,k(sh+1,k, bh+1,k) 2 , (Azuma 3)

where we ve used that the envelope is at most 1 and bh+1,k = bh,k rh,k. Here, Hh,k = (st,k, at,k, rt,k)t [h 1] denotes the history before h for the k-th episode. Also, for all h [H],

k=1 Vars P (sh,k,ah,k) Er R(sh,k,ah,k) h V bρk

h+1(s , bh,k r) i

k=1 Ebρk,bbk

h Vars P (sh,ah) Er R(sh,ah) h V bρk

h+1(s , bh,k r) i | Ek i . (Azuma 4)

Also, for all h, t [H] where t h,

k=1 Ebρk,sh=sh,k,bh=bh,k 2BONBERN t,k (st, bt, at) + ξt,k(st, at) | Ek 6L + 2

k=1 2BONBERN t,k (st,k, bt,k, at,k) + ξt,k(st,k, at,k).

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Finally a standard Azuma also gives, for all h [H],

k=1 Ebρk,bbk 2BONBERN h,k (sh, bh, ah) + ξh,k(sh, ah) | Ek 3

k=1 2BONBERN h,k (sh,k, bh,k, ah,k) + ξh,k(sh,k, ah,k)

Henceforth, we always condition on the union of these high probability statements to be true.

G.2. Key lemmas for CVa R-UCBVI

In general, the bonus should be designed to satisfy for all h [H], k [K],

s, b, a : b Pk(s, a) P (s, a) Erh R(s,a) V h+1( , b rh) BONh,k(s, b, a). (BON )

The bonus only needs to satisfy this inequality for our proofs to work. WLOG, since the left hand side is the difference between two numbers in [0, 1], we can always assume bonus to be truncated by 1, i.e. has envelope 1.

We say that pessimism is satisfied at h [H], k [K] if

s, b : b V h,k(s, b) V h (s, b). (Pessimism (V ))

Lemma G.3. For any k [K], h [H], suppose Pessimism (V ) holds at (h + 1, k) and BON holds at (h, k). Then Pessimism (V ) holds at (h, k).

Proof. First, we prove pessimism for b U h,k. For any s, b, a, we have

b U h,k(s, b, a) U h(s, b, a)

= b Pk(s, a) Erh R(s,a) h b V h+1,k( , b rh) i BONh,k(s, b, a) P (s, a) Erh R(s,a) V h+1( , b rh)

b Pk(s, a) P (s, a) Erh R(s,a) V h+1( , b rh) BONh,k(s, b, a) (IH)

To complete the proof, if b V h,k(s, b) = 0, it is trivially pessimistic, and if not,

b V h,k(s, b) V h (s, b) = min a

n b U h,k(s, b, a) o min a {U h(s, b, a)}

n b U h,k(s, b, a) U h(s, b, a) o

Remarkably, we show Simulation lemma also holds for CVa R-UCBVI. Here, it is also required that the bonus satisfies BON . As for notation, recall Ek represents the episodes before and not including k. Lemma G.4 (Simulation Lemma). Fix any k [K], t [H]. Then, for all st, bt, we have

V bρk t (st, bt) b V t,k(st, bt)

h=t Ebρk,st,bt

BONh,k(sh, bh, ah) + P (sh, ah) b Pk(sh, ah) b V h+1,k( , bh+1) | Ek

Furthermore, if we assume that BON holds, then for all st, bt,

V bρk t (st, bt) b V t,k(st, bt)

h=t (1 + 1/H)h t Ebρk,st,bt[2BONh,k(sh, bh, ah) + ξh,k(sh, ah) | Ek]. (12)

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

In particular,

V bρk 1 (s1, b) b V 1,k(s1, b) e

h=1 Ebρk[2BONh,k(sh, bh, ah) + ξh,k(sh, ah) | Ek].

Proof. Fix any k and t. All expectations in the proof will condition on Ek; this way, the randomness is only from rolling in the policy bρk and not over any of the prior episodes.

First claim: Let s first show Eq. (11) by induction. The base case is t = H+1, we have V bρk

H+1(s, b) = b V H+1,k(s, b) = b+,

H+1 b VH+1,k = 0.

For the inductive step, fix any t H and suppose Eq. (11) is true for t + 1. Let us denote at = bρk t (st, bt) = arg mina b Ut,k(st, bt, a), so b V t,k(st, bt) = max n b U t,k(st, bt, at), 0 o b U t,k(st, bt, at)

b Pt,k(st, at) Ert h b V t+1,k( , bt+1) i BONt,k(st, bt, at), where bt+1 = bt rt is the random next budget. So, we have

V bρk t (st, bt) b V t,k(st, bt)

U bρk t (st, bt, at) b U t,k(st, bt, at)

= BONt,k(st, bt, at) b Pt,k(st, at) Ert h b V t+1,k( , bt+1) i + P t (st, at) Ert h V bρk

t+1,k( , bt+1) i

= BONt,k(st, bt, at) + P t (st, at) b Pt,k(st, at) Ert h b V t+1,k( , bt+1) i

+ P t (st, at) Ert h V bρk t+1( , bt+1) b V t+1,k( , bt+1) i

BONt,k(st, bt, at) + P t (st, at) b Pt,k(st, at) Ert h b V t+1,k( , bt+1) i

+ Est+1 P t (st,at)

h=t+1 Ebρk,st+1,bt+1

BONh,k(sh, bh, ah) + P (sh, ah) b Pk(sh, ah) b V h+1,k( , bh+1) #

h=t Ebρk,st,bt

BONh,k(sh, bh, ah) + P (sh, ah) b Pk(sh, ah) b V h+1,k( , bh+1) .

This concludes the proof for the first claim.

Second claim: Now let us show Eq. (12) by induction. The base case at t = H + 1 is same as the first claim. For the inductive step, fix any t H and suppose Eq. (12) is true for t + 1. Then, continuing from the line before invoking the IH

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

of the first claim, we have

V bρk t (st, bt) b V t,k(st, bt)

BONt,k(st, bt, at) + P t (st, at) b Pt,k(st, at) Ert h b V t+1,k( , bt+1) i

+ P t (st, at) Ert h V bρk t+1( , bt+1) b V t+1,k( , bt+1) i

= BONt,k(st, bt, at) + P t (st, at) b Pt,k(st, at) Ert h b V t+1,k( , bt+1) V t+1( , bt+1) i

+ P t (st, at) b Pt,k(st, at) Ert V t+1( , bt+1) + P t (st, at) Ert h V bρk t+1( , bt+1) b V t+1,k( , bt+1) i

BONt,k(st, bt, at) + ξt,k(st, at) + 1

H P t (st, at) Ert h V t+1( , bt+1) b V t+1,k( , bt+1) i (Lemma G.2)

+ BONt,k(st, bt, at) + P t (st, at) Ert h V bρk t+1( , bt+1) b V t+1,k( , bt+1) i (premise (BON ))

2BONt,k(st, bt, at) + ξt,k(st, at) + (1 + 1/H)P t (st, at) Ert h V bρk t+1( , bt+1) b V t+1,k( , bt+1) i (V V bρk)

2BONt,k(st, bt, at) + ξt,k(st, at)

+ (1 + 1/H)

h=t+1 (1 + 1/H)h t 1Ebρk,st,bt[2BONh,k(sh, bh, ah) + ξh,k(sh, ah)]. (IH)

This completes the inductive proof. Observing that (1 + 1/H)H exp(1/H)H = e gives the corollary.

G.3. CVa R-UCBVI with Hoeffding Bonus

The Hoeffding bonus BONHOEFF h,k (s, a) defined in 5 satisfies the crucial bonus requirement BON by the uniform Hoeffding s inequality result of Eq. (8). Thus, we have pessimism for all k, h with the Hoeffding bonus.

Theorem 5.2. For any δ (0, 1), w.p. at least 1 δ, CVa R-UCBVI with the Hoeffding bonus (Eq. (5)) enjoys

Regret RL τ (K) 4eτ 1

SAHKL + 10eτ 1S2AHL2.

Proof of Theorem 5.2. Let R(ρk,bbk) denote the distribution of returns from rolling in bρk starting at bbk. For any k, we have

CVa Rτ(R(bρk,bbk)) = max b [0,1]

b τ 1Ebρk,bbk

bbk τ 1Ebρk,bbk

= bbk τ 1V bρk 1 (s1,bbk). (13)

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Regret RL τ (K)

k=1 CVa R τ CVa Rτ(R(bρk,bbk))

b τ 1V 1 (s1, b ) CVa Rτ(R(bρk,bbk))

n b τ 1 b V 1,k(s1, b ) o CVa Rτ(R(bρk,bbk)) (Pessimism (V ))

n bbk τ 1 b V 1,k(s1,bbk) o n bbk τ 1V bρk 1 (s1,bbk) o (defn. of bbk and Eq. (13))

V bρk 1 (s1,bbk) b V 1,k(s1,bbk)

(h,k) [H] [K] Ebρk,bbk 2BONHOEFF h,k (sh, ah) + ξh,k(sh, ah) | Ek (Simulation Lemma G.4)

6eτ 1HL + 2eτ 1 X

(h,k) [H] [K] 2BONHOEFF h,k (sh,k, ah,k) + ξh,k(sh,k, ah,k) (Eq. (Azuma 1))

6eτ 1HL + 2eτ 1 2

SAHKL + 2HSL SA log(K) (elliptical potential Lemma G.12)

SAHKL + 10eτ 1S2AHL2,

which concludes the proof.

G.4. CVa R-UCBVI with Bernstein Bonus

When running Algorithm 2 with the Bernstein bonus BONBERN h,k (s, b, a) (Eq. (6)), we need to also show that b V h,k are optimistic for V h . We say that optimism is satisfied at (h, k) [H] [K] if

s, b : V h (s, b) b V h,k(s, b). (Optimism (V ))

Lemma G.5. For any k [K], h [H], suppose Optimism (V ) holds at (h + 1, k) and BON holds at (h, k). Then Optimism (V ) holds at (h, k).

Proof. First, we prove optimism for b U h,k. For any s, b, a we have

U h(s, b, a) b U h,k(s, b, a)

= P (s, a) Erh V h+1( , b rh) b Pk(s, a) Erh h b V h+1,k( , b rh) i BONh,k(s, b, a)

P (s, a) b Pk(s, a) Erh V h+1( , b rh) BONh,k(s, b, a) (IH)

0. by Eq. (BON )

To complete the proof, if b V (s, b) = 1, then it is trivially optimistic, and if not

V h (s, b) b V h,k(s, b) = min a U h(s, b, a) b U h,k(s, b, bρk(s, b))

U h(s, b, bρk(s, b)) b U h,k(s, b, bρk(s, b))

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Lemma G.6. For any (h, k) [H] [K], if Pessimism (V ) and Optimism (V ) both hold at (h + 1, k), then BON holds at (h, k) for the Bernstein bonus BONBERN.

Proof. Recall that uniform empirical Bernstein Eq. (9) gave us the following inequality: for all s, a and b,

b Pk(s, a) P (s, a) Erh V h+1( , b rh)

2 Vars b Pk(s,a) Erh V h+1(s , b rh) L

Nk(s, a) + L Nk(s, a).

Eq. (9) revisited.

Now apply the useful triangle inequality of variances p

Var(Y ) + p

Var(X Y ) (Zanette & Brunskill, 2019, Eqn. 51),

Vars b Pk(s,a) Erh V h+1(s , b rh)

Vars b Pk(s,a) Erh h b V h+1(s , b rh) i

Vars b Pk(s,a) Erh h V h+1(s , b rh) b V h+1,k(s , b rh) i .

The first term is in the bonus. The second term is bounded by the correction term of the bonus as follows,

Vars b Pk(s,a) Erh h V h+1(s , b rh) b V h+1,k(s , b rh) i

Es b Pk(s,a)

Erh h V h+1(s , b rh) b V h+1,k(s , b rh) i 2

Es b Pk(s,a),r R(s,a)

V h+1(s , b r) b V h+1,k(s , b r) 2 (Jensen)

Es b Pk(s,a),r R(s,a)

b V h+1,k(s , b r) b V h+1,k(s , b r) 2 . (premise: b V h+1,k V h+1 b V h+1,k)

This upper bound is a part of the Bernstein bonus. Thus, we ve shown that BONBERN dominates the error, and so BON is satisfied at (h, k).

A key corollary of Lemmas G.3, G.5 and G.6 is that we have Pessimism (V ) and Optimism (V ) for all (h, k) [H] [K] with the Bernstein bonus. Indeed, for any k, we first apply Lemma G.6 to get that BON is satisfied at (H, k) (as optimism/pessimism trivially holds at H + 1). Then, apply Lemmas G.3 and G.5 to get Pessimism (V ) and Optimism (V ) at (H, k). Then, apply Lemma G.6 to get that BON is satisfied at (H 1, k). Continue in this fashion until we ve shown BON , Pessimism (V ) and Optimism (V ) for all h [H].

Theorem G.7. The Bernstein bonus satisfies BON , Pessimism (V ) and Optimism (V ) at all (h, k) [H] [K].

We now prove the regret guarantee for the Bernstein bonus. The main body of the proof for Theorem 5.3 and Theorem 5.5 will be the same. The proofs will only diverge at the end when bounding the sum of variances, where we invoke Assumption 5.4.

Theorem 5.3. For any δ (0, 1), w.p. at least 1 δ, CVa R-UCBVI with the Bernstein bonus (Eq. (6)) enjoys a regret guarantee of

Regret RL τ (K) 10eτ 1

SAKL + τ 1ξ,

where ξ e O(SAHK1/4 + S2AH) is a lower order term.

Theorem 5.5. Under Assumption 5.4, the bound in Theorem 5.3 can be refined to,

Regret RL τ (K) 12e

τ 1SAKL + τ 1p 1/2 min ξ.

Proof of Theorem 5.3 and Theorem 5.5. Following the same initial steps as Theorem 5.2, we have

Regret RL τ (K) 6eτ 1HL + 2eτ 1 X

(h,k) [H] [K] 2BONBERN h,k (sh,k, bh,k, ah,k) + ξh,k(sh,k, ah,k).

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

The proof boils down to bounding the sum.

Logarithmic-in-K terms: First, note that any O(1/Nk(s, a)) term will scale logarithmically in K. This includes the ξh,k(s, a) term, as well as a 2L Nk(s,a) from the bonus. Thus,

(h,k) [H] [K]

2L + 2HSL Nk(sh,k, ah,k) 4HSL SA log(K) = 4S2AHL2. (Lemma G.12)

The variance correction term of the bonus:

(h,k) [H] [K]

v u u u t Es b Pk(sh,k,ah,k),r R(sh,k,ah,k)

b V h+1,k(s , bh,k r) b V h+1,k(s , bh,k r) 2

Nk(sh,k, ah,k) . (14)

First, apply a Cauchy Schwarz. The P

h,k 1 Nk(sh,k,ah,k) term is at most SAL by elliptical potential Lemma G.12. Then,

translate b P to P via Lemma G.2 to get

v u u u u u u u u t

(h,k) [H] [K] 8

SL Nk(sh,k, ah,k)

(h,k) [H] [K] Es P (sh,k,ah,k),r R(sh,k,ah,k)

b V h+1,k(s , bh,k r) b V h+1,k(s , bh,k r) | Ek, Hh,k 2 !

Then, switch to the empirical s, b using Eq. (Azuma 2),

v u u u t SAL

(h,k) [H] [K]

b V h+1,k(sh+1,k, bh+1,k) b V h+1,k(sh+1,k, bh+1,k) 2

For the sum term, since each term is at most 1, we have b V h,k(sh,k, bh,k) b V h,k(sh,k, bh,k) 2 b V h,k(sh,k, bh,k) b V h,k(sh,k, bh,k) . Then, applying a simulation-like Lemma G.8 to each summand bounds the sum by,

t=h Ebρk,sh=sh,k,bh=bh,k 2BONBERN t,k (st, bt, at) + ξt,k(st, at) | Ek (simulation-like Lemma G.8)

k=1 2BONBERN t,k (st,k, bt,k, at,k) + ξt,k(st, at) (Eq. (Azuma 5))

k=1 2BONBERN t,k (st,k, bt,k, at,k) + ξt,k(st, at).

Now, we can loosely bound each Bernstein bonus by 2 q

2L Nk(s,a) + L Nk(s,a), so by elliptical potential Lemma G.12,

k=1 2BONBERN t,k (st,k, bt,k, at,k) + ξt,k(st, at) 4

SAHKL + (L + 2HSL) SA log(K)

SAHKL + 3S2AHL2. (15)

Therefore, we ve shown that

t=h Ebρk,sh=sh,k,bh=bh,k 2BONBERN t,k (st, bt, at) + ξt,k(st, at) | Ek 12

SAH3KL + 12S2AH2L2. (16)

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Combining everything together, we have

SAH3KL + 12S2AH2L2

5SAHK1/4L + 4S3/2AHL2,

which is lower order in K (the dominant term should scale as K1/2).

Bounding the empirical variance term: We now shift our focus to the variance term of the bonus,

(h,k) [H] [K]

v u u t Vars b Pk(sh,k,ah,k) Er R(sh,k,ah,k) h b V h+1,k(s , bh,k r) i

Nk(sh,k, ah,k) . (17)

The key idea here is to apply a sequential Law of Total Variance Lemma G.13, but to do so, we must first con-

Vars b Pk(sh,k,ah,k) Erh h b V h+1,k(s , bh,k rh) i to r

Vars P h (sh,k,ah,k) Erh h V bρk h+1(s , bh,k rh) i | Ek . So

we need to bound the difference term, i.e.,

(h,k) [H 1] [K]

v u u t Vars b Pk(sh,k,ah,k) Erh h b V h+1,k(s , bh,k rh) i

Nk(sh,k, ah,k)

v u u t Vars P h (sh,k,ah,k) Erh h V bρk h+1(s , bh,k rh) i | Ek

Nk(sh,k, ah,k)

Switch the empirical variance to the (conditional) population one, which incurs a P

L Nk(sh,k,ah,k) term (Appendix B). Then, use p

Var(X + Y ) p

Var(Y ) (Eqn 51 of (Zanette & Brunskill, 2019)) to get,

(h,k) [H 1] [K]

L Nk(sh,k, ah,k) + X

(h,k) [H 1] [K]

v u u t Vars P (sh,k,ah,k) Erh h b V h+1,k(s , bh+1,k) V bρk h+1(s , bh+1,k) i | Ek

Nk(sh,k, ah,k)

L SA log(K) + s

(h,k) [H 1] [K] Vars P (sh,k,ah,k) Erh h V bρk h+1(s , bh,k rh) b V h+1,k(s , bh,k rh) i | Ek ,

where the second inequality is due to elliptical potential Lemma G.12 and Cauchy-Schwarz. Focusing on the sum inside the square root, X

(h,k) [H 1] [K] Vars P (sh,k,ah,k) Erh h V bρk

h+1(s , bh,k rh) b V h+1,k(s , bh,k rh) i | Ek

(h,k) [H 1] [K] Es ,rh (P R)(sh,k,ah,k)

h+1(s , bh,k rh) b V h+1,k(s , bh,k rh) 2 | Ek

(h,k) [H 1] [K]

h+1(sh+1,k, bh+1,k) b V h+1,k(sh+1,k, bh+1,k) 2 (Eq. (Azuma 3))

(h,k) [H 1] [K]

h+1(sh+1,k, bh+1,k) b V h+1,k(sh+1,k, bh+1,k) (r.v. is in [0, 1])

(h,k) [H 1] [K]

t=h Ebρk,sh=sh,k,bh=bh,k 2BONBERN t,k (st, bt, at) + ξt,k(st, at)

(simulation lemma Lemma G.4)

SAH3KL + 12S2AH2L2. (Eq. (16))

Combining the steps, we have shown that the switching cost to the population variance is at most

Eq. (18) 2SAL2 +

SAH3KL + 12S2AH2L2 4SAHK1/4L + 6S3/2AHL2

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

which is again lower order.

(Key step) Bounding the dominant (population) variance term:

(h,k) [H] [K]

v u u t Vars P h (sh,k,ah,k) Er R(sh,k,ah,k) h V bρk h+1(s , bh,k r) i | Ek

Nk(sh,k, ah,k) , (19)

First apply a Cauchy-Schwarz (as usual) and then the law of total variance,

(h,k) [H] [K] Vars P h (sh,k,ah,k) Er R(sh,k,ah,k) h V bρk h+1(s , bh,k r) i

v u u u t SAL

(h,k) [H] [K] Ebρk,bbk

h Vars P h (sh,ah) Er R(sh,ah) h V bρk h+1(s , bh r) i | Ek i

(Eq. (Azuma 4))

v u u u t SAL

(h,k) [H] [K] Ebρk,bbk

h Vars P h (sh,ah),r R(sh,ah) V bρk h+1(s , bh r) i | Ek

(joint variance is larger)

v u u u t SAL

k=1 Varbρk,bbk

, (Law of Total Variance Lemma G.9)

Below, we give two ways to bound the sum,

k=1 Varbρk,bbk

The first way is to simply bound it by a probability, which is trivially at most 1. This results in Theorem 5.3. To prove Theorem 5.5, we show a second more refined approach, which uses Assumption 5.4 to bound each variance by τ, plus a lower order term.

Before doing so, we first prepare to conclude the proof by recapping all the terms in the regret decomposition. First, we collected 4S2AHL2 from the 1/Nk(s, a) terms. Eq. (14) is the correction term inside the Bernstein bonus. Eq. (18) is the switching cost from empirical variance (in the Bernstein bonus) to the population variance that we want to bound now, which is Eq. (19). We also multiply back the 2

L factor we omitted from above. So,

4S2AHL2 + 3

L(Eq. (14) + Eq. (18) + Eq. (19))

4S2AHL2 + 3

L 5SAHK1/4L + 4S3/2AHL2 + 4SAHK1/4L + 6S3/2AHL2 + Eq. (19)

= 27SAHK1/4L2 + 34S2AHL3 + 3

L Eq. (19).

Thus, the regret is at most

Regret RL τ (K) 6eτ 1HL + 2eτ 1 27SAHK1/4L2 + 34S2AHL3 + 3

L Eq. (19) + 54eτ 1SAHK1/4L2 + 70eτ 1S2AHL3.

First bound for Eq. (19) (for Theorem 5.3): Since x+ = x I [x 0], for any random variable X [0, 1], we have Var(X+) E X2I [X 0] Pr(X 0). Therefore,

k=1 Varbρk,bbk

k=1 Pr bρk,bbk

h=1 rh bbk | Ek

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Certainly, each probability is bounded by 1. Hence,

SAL(2HL + 2K)

Combining everything together, we get

Regret RL τ (K) 6

SAKL + 54eτ 1SAHK1/4L2 + 82eτ 1S2AHL3,

which concludes the proof for Theorem 5.3.

Second bound for Eq. (19) (for Theorem 5.5): Define

f(b) = b τ 1Ebρk,bbk

, b k = arg max b [0,1] f(b),

bf(b) = b τ 1 b V 1,k(s1, b), bbk = arg max b [0,1] bf(b).

So, b k as the τ-th quantile of R(bρk,bbk). By pessimism, we have b V 1,k(s1, b) V 1 (s1, b) Ebρk,bbk

b PH t=1 rt + ,

since V 1 is the minimum amongst all history-dependent policies, including (bρk,bbk). Thus, we have bf(b) f(b) for all b [0, 1]. In particular, we have

f(b k) f(bbk) = f(b k) bf(bbk) + bf(bbk) f(bbk)

f(b k) bf(b k) + bf(bbk) f(bbk) (bbk is argmax of bf)

bf(bbk) f(bbk) (f(b k) bf(b k) by pessimism)

τ 1 V bρk 1 (s1,bbk) b V 1,k(s1,bbk) .

Now we invoke Assumption 5.4 with Lemma G.10 which implies

b k bbk 2 f(b k) f(bbk) τ 1 V bρk 1 (s1,bbk) b V 1,k(s1,bbk)

= b k bbk 2 2p 1 min V bρk 1 (s1,bbk) b V 1,k(s1,bbk) .

Using Var(X) 2(Var(Y ) + Var(X Y )),

k=1 Varbρk,bbk

k=1 Varbρk,bbk

k=1 Varbρk,bbk

k=1 Pr bρk,bbk

t=1 rt b k | Ek

bbk b k 2 (Re LU is 1-Lipschitz)

2Kτ + 4p 1 min

V bρk 1 (s1,bbk) b V 1,k(s1,bbk)

2Kτ + 4p 1 min

h=1 Ebρk,bbk 2BONBERN h,k (sh, bh, ah) + ξh,k(sh, ah) | Ek (simulation Lemma G.4)

2Kτ + 4p 1 min 3

SAHKL + 3S2AHL2 . (Eq. (Azuma 6) and the loose bound in Eq. (15))

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

SAL 2HL + 4Kτ + 72p 1 min

SAHKL + 24p 1 min S2AHL2

τSAKL + 9p 1/2 min SAHK1/4L + 5p 1/2 min S3/2AHL2.

Combining everything together, we get

Regret RL τ (K) 12e

τ 1SAKL + 54 + 9p 1/2 min eτ 1SAHK1/4L2 + 70 + 5p 1/2 min eτ 1S2AHL3.

This concludes the proof for Theorem 5.5.

Lemma G.8. Fix any k [K]. Then for all t [H], for all st, bt, we have

b V t,k(st, bt) b V t,k(st, bt)

h=t (1 1/H)t h Ebρk,st,bt[2BONh,k(sh, bh, ah) + ξh,k(sh, ah) | Ek].

Proof. In this proof, all expectations are conditioned on Ek. We proceed by induction. The base case at t = H + 1 trivially holds since b V H+1,k(s, b) b V H+1,k(s, b) = b+ b+ = 0. Now suppose t H and suppose the claim holds for t+1. Then setting at = bρk(st, bt), we have

b V t,k(st, bt) b V t,k(st, bt)

b U t,k(st, bt, at) b U t,k(st, bt, at)

= b Pt,k(st, at) Ert h b V t+1,k( , bt rt) b V t+1,k( , bt rt) i + 2BONt,k(st, bt, at)

= b Pt,k(st, at) P t (st, at) Ert h b V t+1,k( , bt rt) b V t+1,k( , bt rt) i + 2BONt,k(st, bt, at)

+ P t (st, at) Ert h b V t+1,k( , bt rt) b V t+1,k( , bt rt) i

2BONt,k(st, bt, at) + ξt,k(st, at) + (1 1/H)P t (st, at) Ert h b V t+1,k( , bt rt) b V t+1,k( , bt rt) i

(Lemma G.2)

2BONt,k(st, bt, at) + ξt,k(st, at) +

h=t+1 (1 1/H)1+t (h+1)Ebρk,st,bt[2BONh,k(sh, bh, ah) + ξh,k(sh, ah)],

where Ert is short for Ert Rt(st,at).

Lemma G.9. For any k [K], we have

h=1 Ebρk,bbk

h Vars P (sh,ah),rh R(s,a) V bρk

h+1(s , bh rh) | Ek i .

Proof. Apply Law of Total Variance Lemma G.13 with Y = bbk PH h=1 rh + , Xh = (sh, ah, rh 1) for h [H] (when h = 1, r0 is omitted), and H = Ek being the trajectories from the past episodes 1, 2, ..., k 1. Here, sh, ah, rh are collected from rolling in with bρk starting from bbk, and note that bbk is a constant conditioned on Ek. Lemma G.13 gives

= E[Var(Y | X1:H, Ek) | Ek] +

h=1 E[Var(E[Y | X1:h, Ek] | X1:h 1, Ek) | Ek]

h=1 E[Var(E[Y | X1:h, Ek] | X1:h 1, Ek) | Ek].

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

The first term is zero because once we condition on X1:H, Ek, the term bbk P

t rt + is a constant, and variance of constants is zero. Now consider each summand. The outer expectation is taken over s1:h 1, a1:h 1, r1:h 2 from rolling in bρk. The variance is taken over sh P (sh 1, ah 1), rh 1 Rh 1(sh 1, ah 1), and deterministically picking ah = bρk h(sh, bh) where bh = bbk Ph 1 t=1 rt. The inner expectation is over the remainder of the trajectory, which is sh+1:H, ah+1:H, rh:H. Therefore,

E[Var(E[Y | X1:h, Ek] | X1:h 1, Ek) | Ek]

| X1:h 1, Ek

h Var U bρk

h (sh, bh, ah) | X1:h 1, Ek | Ek i (bh 1 = bbk r1 ... rh 1)

h Varsh P h 1(sh 1,ah 1),rh 1 Rh 1(sh 1,ah 1) U bρk

h (sh, bh, ah) | Ek i

h Varsh P h 1(sh 1,ah 1),rh 1 Rh 1(sh 1,ah 1) V bρk

h (sh, bh) | Ek i . (ah = bρk(sh, bh))

Note that in the special case of h = 1, we have s1 is not random and r0 is omitted. So, the variance is taken over a constant, which is zero.

Lemma G.10. Let π be a history-dependent policy such that its return distribution R(π) is continuously distributed and has a density p lower bounded by pmin. Then, the function

f(b) = τ 1Eπ

is τ 1pmin strongly convex.

Proof. Observe that

f (b) = τ 1 Pr π

Since we ve assumed that R(π) is continuously distributed with density p, so

f (b) = τ 1p(b).

Since f (b) τ 1pmin for all b, we have f is strongly convex with that parameter.

G.5. Auxiliary Lemmas

Lemma G.11 (Azuma). Let {Xi}i [N] be a sequence of random variables supported on [0, 1], adapted to filtration {Fi}i [N]. For any δ (0, 1), we have w.p. at least 1 δ,

t=1 E[Xt | Ft 1]

N log(2/δ), (Standard Azuma)

t=1 E[Xt | Ft 1] 2

t=1 Xt + 2 log(1/δ). (Multiplicative Azuma)

Proof. For standard Azuma, see Zhang (2023, Theorem 13.4). For multiplicative Azuma, apply (Zhang, 2023, Theorem 13.5) with λ = 1. The claim follows, since 1 1 exp( λ) 2.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Below we recall the standard elliptical potential lemma (Lattimore & Szepesvári, 2020; Agarwal et al., 2021). Regarding terminology, we remark that this lemma is also known as the pigeonhole argument in the tabular RL literature (Azar et al., 2017; Zanette & Brunskill, 2019). The term elliptical potential is more commonly used in the linear MDP setting (Jin et al., 2020), of which tabular RL is a special case.

Lemma G.12 (Elliptical Potential). For any sequence of states and actions {sh,k, ah,k}h [H],k [K], we have

1 Nk(sh,k, ah,k) SA log(K),

Nk(sh,k, ah,k) p

HSAK log(K).

Proof. For the first claim, observe that in the sum, 1

3, . . . can appear at most SA times. And since we run for K episodes, the maximum denominator is K. Therefore, we have

1 Nk(sh,k, ah,k) SA

1 k SA log(K).

For the second claim,

Nk(sh,k, ah,k)

v u u u t KH

1 Nk(sh,k, ah,k)

SAHK log(K).

Lemma G.13 (Sequential Law of Total Conditional Variance). For any random variables Y, X1, X2, ..., XN, H, we have

Var(Y | H) = E[Var(Y | X1:N, H) | H] +

t=1 E[Var(E[Y | X1:t, H] | X1:t 1, H) | H].

Notice, for each summand, the inner expectation is taken over Y , the variance is taken over Xt, and outer expectation is taken over X1:t 1.

Proof. Recall the Law of Total Conditional Variance (LTCV): for any random variables Y, Z1, Z2,

Var(Y | Z1) = E[Var(Y | Z1, Z2) | Z1] + Var(E[Y | Z1, Z2] | Z1).

We now inductively prove the desired claim by recursively applying the (LTCV). The base case is N = 1, which follows immediately from LTCV applied to Z1 = H, Z2 = X1. For the inductive case, fix any N and suppose the claim is true for N. Now consider N + 1, where we have Y, X1, X2, ..., XN+1, H. By the IH, we have

Var(Y | H) = E[Var(Y | X1:N, H) | H] +

t=1 E[Var(E[Y | X1:t, H] | X1:t 1, H) | H].

Now applying LTCV on the first term with Z1 = (X1:N, H), Z2 = XN+1, we have

Var(Y | X1:N, H) = E[Var(Y | X1:N+1, H) | X1:N, H] + Var(E[Y | X1:N+1, H] | X1:N, H),

and therefore,

E[Var(Y | X1:N, H) | H] = E[Var(Y | X1:N+1, H) | H] + E[Var(E[Y | X1:N+1, H] | X1:N, H) | H],

which concludes the proof.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

H. Proofs for CVa R-UCBVI with discretized rewards

Recall the discretized MDP disc(M), as introduced in Section 6. It is a copy of the true MDP M except its rewards are rounded to an ε-net. I.e., let ϕ(r) = η r/η be the rounding up operator of r onto the net, so that 0 ϕ(r) r η. Concretely, the reward distribution is R(s, a; disc(M)) = R(s, a; M) ϕ 1. Our proofs are inspired by Bastani et al. (2022, Lemma B.1,Lemma B.5).

From disc(M) to M: Fix any ρ ΠAug and b [0, 1] (which we ll run in disc(M)). Then define an adapted policy, which is a history-dependent policy in M, as follows,

adapted(ρ, b)h(sh, r1:h 1) = ρh(sh, b1 ϕ(r1) ... ϕ(rh 1)).

Intuitively, as adapted(ρ, b) runs M, it uses the history to emulate the evolution of b in disc(M). Let Zρ,b,disc(M) be the returns from running ρ, b in disc(M). Let Zadapted(ρ,b),M be the returns from running adapted(ρ, b) in M. We show that Zρ,b,disc(M) Hη Zadapted(ρ,b),M Zρ,b,disc(M) w.p. 1 via a coupling argument.

Lemma H.1. We almost surely have Zρ,b,disc(M) Hη Zadapted(ρ,b),M Zρ,b,disc(M). Therefore, if Fρ,b,disc(M) is the CDF of Zρ,b,disc(M) and Fadapted(ρ,b),M is the CDF of Zadapted(ρ,b),M, we have

x R : Fρ,b,disc(M)(x) Fadapted(ρ,b),M(x) Fρ,b,disc(M)(x + Hη).

Proof. Let s1, a1, r1, s2, a2, r2, . . . be the trajectory of running adapted(ρ, b) in M. Let bs1, ba1, br1, bs2, ba2, br2, . . . be the trajectory of running ρ, b in disc(M). We couple these two trajectories by making adapted(ρ, b) in M follow ρ, b in disc(M). Set bs1 = s1. By definition of adapted(ρ, b), ba1 = a1. By definition of disc(M), br1 = ϕ(r1). Continuing in this fashion, we have bst = st, bat = at, brt = ϕ(rt) for all t [H]. This is a valid coupling since the actions are sampled from the exact same distribution, i.e., ah ρh(b ϕ(r1) ... ϕ(rh 1)), and by the transitions of bbh in disc(M), we have bbh = b br1 ... brh 1 which is exactly what was inputted into ρh by adapted(ρ, b).

Since r ϕ(r) for all r, we have

Zadapted(ρ,b),M =

t=1 ϕ(rt) =

t=1 brt = Zρ,b,disc(M).

Since ϕ(r) η r for all r, we have

Zρ,b,disc(M) =

t=1 ϕ(rt) Hη +

t=1 rt = Zadapted(ρ,b),M Hη.

To conclude the proof, recall a basic fact about couplings and stochastic comparisons. For two random variables X, Y in the same probability space and a constant c, if P(X Y + c) = 1, we have FY (t) FX(t + c) for all x. This is because FY (x) FX(x + c) = P(Y t X > t + c) P(X Y > c) = 0.

From M to disc(M): Fix any ρ ΠAug and b [0, 1] (which we ll run in M). Then define a discretized policy, which is a history-dependent policy in the discretized MDP disc(M) with memory (as in Appendix F), as follows,

disc(ρ, b)h(sh, m1:h 1) = ρh(sh, b m1 mh 1),

mh R(sh, ah) | ϕ(R(sh, ah)) = rh.

With this definition, although we receive reward brh (on the discrete grid) when running in disc(M), the memory element mh exactly imitates a random reward that would have been received in the true MDP M. Then, the discretized policy disc(ρ, b) will instead follow these exact rewards mh rather than what has been received.

Let Zρ,b,M be the returns from running ρ, b in M. Let Zdisc(ρ,b),disc(M) be the returns from running disc(ρ, b) in disc(M) (which memory as described above). We show that Zρ,b,M Zdisc(ρ,b),disc(M) w.p. 1 via a coupling argument.

Lemma H.2. We almost surely have Zρ,b,M Zdisc(ρ,b),disc(M). Therefore, if Fρ,b,M is the CDF of Zρ,b,M and Fdisc(ρ,b),disc(M) is the CDF of Zdisc(ρ,b),disc(M), we have

x R : Fdisc(ρ,b),disc(M)(x) Fρ,b,M.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Proof. Let s1, a1, r1, s2, a2, r2, ... be the trajectory of running ρ, b in M. Let bs1, ba1, br1, bm1, bs2, ba2, br2, bm2, ... be the trajectory of running disc(ρ, b) in disc(M) with memory. We couple these two trajectories by making ρ, b in M follow disc(ρ, b) in disc(M). Set s1 = bs1. By definition of disc(ρ, b), a1 = ba1. Then, set r1 = bm1. Note that r1 is sampled by first sampling a discrete br1, then sampling bm1 from the conditional reward distribution of the interval that rounds to br1. By law of total probability, this is indeed equivalent to sampling directly from the unconditional reward distribution. Continuing in this fashion, we have bst = st, bat = at, brt = ϕ(rt), bmt = mt for all t [H]. Importantly, the policies actions match because b bm1 ... bmh 1 = b r1 ... rh 1. Therefore, we always have

t=1 ϕ(rt) =

t=1 brt = Zdisc(ρ,b),disc(M).

As with the previous proof, stochastic dominance implies the claim on CDFs.

Now we show two useful consequences of the above coupling results.

Theorem H.3. Fix any ρ ΠAug and b1 [0, 1]. Then,

b : 0 Eadapted(ρ,b1),M

Eρ,b1,disc(M)

τ 1Hη CVa Rτ(adapted(ρ, b1), M) CVa Rτ(ρ, b, disc(M)) 0.

Proof. Let f(b) = Eadapted(ρ,b1),M

b PH h=1 rh + and let F = Fadapted(ρ,b1),M. Similarly, let disc(f)(b) =

Eρ,b1,disc(M)

b PH h=1 rh + and disc(F) = Fρ,b1,disc(M). Both f(0) = disc(f)(0) = 0. Also, their derivatives

are F and disc(F) respectively. By Lemma H.1, disc(F)(t) F(t) disc(F)(t + Hη).

First, we show disc(f)(b) f(b) 0. By the fundamental theorem of Calculus,

disc(f)(b) f(b) = Z b

0 disc(F)(t) F(t)dt 0.

Next, we show f(b) disc(f)(b) Hη. By the fundamental theorem of Calculus (FTC),

f(b) disc(f)(b) Z b

0 F(t)dt Z b

0 disc(F)(t)dt

Hη disc(F)(t)dt Z b

0 disc(F)(t)dt (Lemma H.1)

b disc(F)(t)dt

= disc(f)(b + Hη) disc(f)(b) (FTC)

Hη, (Re LU is 1-Lipschitz)

Altogether, we ve shown 0 f(b) disc(f)(b) Hη for all b. This immediately implies the claims about CVa R:

CVa Rτ(adapted(ρ, b1), M) CVa Rτ(ρ, b1, disc(M))

= max b b τ 1f(b) max b b τ 1 disc(f)(b)

τ 1 max b (disc(f)(b) f(b))

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

and similarly,

CVa Rτ(ρ, b1, disc(M)) CVa Rτ(adapted(ρ, b1), M)

τ 1 max b (f(b) disc(f)(b))

Theorem H.4. We have,

b [0, 1] : V 1 (s1, b; disc(M)) V 1 (s1, b; M),

which implies

CVa R τ(disc(M)) CVa R τ(M).

Proof. Fix any ρ ΠAug, b [0, 1]. By Lemma H.2, we have Fdisc(ρ,b),disc(M)(x) Fρ,b,M. Applying the same FTC-style arguments in Theorem H.3, we have

b : Edisc(ρ,b),disc(M)

Setting b = b and using the definition of V functions, we have V disc(ρ,b)(s1, b; disc(M)) V ρ 1 (s1, b; M). Then, since V 1 is the minimum history-dependent policy in this memory-MDP (Theorem F.2) implies that

V 1 (s1, b; disc(M)) V disc(ρ,b) 1 (s1, b; disc(M)) V ρ 1 (s1, b; M).

Since ρ ΠAug was arbitrary and the minimum is attained by ρ ΠAug (Theorem F.2) this implies that V 1 (s1, b; disc(M)) V 1 (s1, b; M), as needed. For the CVa R claim,

CVa R τ(M) CVa R τ(disc(M))

= max b b τ 1V 1 (s1, b; M) max b b τ 1V 1 (s1, b; disc(M))

τ 1 max b (V 1 (s1, b; disc(M)) V 1 (s1, b; M)) 0.

In the proof above, we highlight that disc(M) is the MDP with memory. In the proof of Bastani et al. (2022, Lemma B.6), this detail was glossed over as their history-dependent policy in Bastani et al. (2022, Lemma B.5) does not exactly fit into the vanilla history-dependent policy framework (as in Section 2); their policies are coupled through time with the α parameter, which is disallowed a priori by the history-dependent policy framework. Our formalism with the memory-MDP resolves this ambiguity.

H.1. Amendment of Bernstein proof in the discretized MDP

In this section, we amend the proof of the Second bound for Eq. (19) in the Bernstein regret bound.

Theorem H.5. Suppose we re running CVa R-UCBVI in the discretized MDP and assume Assumption 6.1 holds. For any δ (0, 1), w.p. at least δ, we have the regret of Theorem 5.5 plus an additional term,

p 1 min SAHKηL.

Thus, setting η = 1/

K makes this an lower order term.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

Proof of Theorem H.5. Recall that

v u u u t SAL

k=1 Varbρk,bbk,disc(M)

where note the randomness is taken over trajectories from disc(M), as that s the MDP we re working in. Define

f(b) = b τ 1Eadapted(bρk,bbk),M

, b k = arg max b [0,1] f(b),

bf(b) = b τ 1 b V 1,k(s1, b), bbk = arg max b [0,1] bf(b).

A priori, the pessimism argument a priori only applies to policies in the discretized MDP. But thanks to Theorem H.4, it also holds here,

b V 1,k(s1, b) V 1 (s1, b; disc(M)) V 1 (s1, b; M) Eadapted(bρk,bbk),M

Thus, we also have bf(b) f(b) for all b, and the same argument as before gives

f(b k) f(bbk) τ 1

Eadapted(bρk,bbk),M

b V 1,k(s1,bbk)

τ 1 V bρk 1 (s1,bbk; disc(M)) b V 1,k(s1,bbk) + τ 1Hη. (Theorem H.3)

By Assumption 6.1 (which applies to adapted(bρk,bbk)), Lemma G.10 applies and we have b k bbk 2 2p 1 min V bρk 1 (s1,bbk; disc(M)) b V 1,k(s1,bbk) + Hη .

Using Var(X) 2(Var(Y ) + Var(X Y )),

k=1 Varbρk,bbk,disc(M)

k=1 Varbρk,bbk,disc(M)

k=1 Varbρk,bbk,disc(M)

k=1 Pr bρk,bbk,disc(M)

bbk b k 2 (Re LU is 1-Lipschitz)

k=1 Pr adapted(bρk,bbk),M

V bρk 1 (s1,bbk; disc(M)) b V 1,k(s1,bbk) + Hη (Lemma H.1)

2Kτ + 4p 1 min HKη + 4p 1 min

V bρk 1 (s1,bbk; disc(M)) b V 1,k(s1,bbk) .

Previously in Theorem 5.5, we also had the 2Kτ + 4p 1 min PK k=1 V bρk 1 (s1,bbk; disc(M)) b V 1,k(s1,bbk) term. This can

be bounded as before. We ve only incurred an extra 4p 1 min HKη term. So, Eq. (19) incurs an extra q

SAL(2 4p 1 min HKη).

Finally, Eq. (19) gets multiplied by 6eτ 1

L, which gives a final extra regret of 18eτ 1 q

p 1 min SAHKηL.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

H.2. Translating regret from discretized MDP to true MDP

In this section, we prove that running our algorithms in the imagined discretized MDP also has low regret in the true MDP. Let us first recall the setup. When running CVa R-UCBVI, we will provide the discretization parameter η. The algorithm will discretize the rewards received from the environment when updating b in this way, it is running in its own hallucinated discretized MDP, while the regret we care about is in the true MDP. Since the algorithm is essentially running in the hallucinated discretized MDP, our regret bound applies in the discretized MDP, i.e., roll-outs in expectations are with respect to disc(M),

k=1 CVa R τ(disc(M)) CVa Rτ(bρk,bbk; disc(M)) C.

When rolling out bρk from bbk in the hallucinated discretized MDP, we are essentially running adapted(bρk) as described in Theorem H.3. So the true regret in the real MDP is,

k=1 CVa R τ(M) CVa Rτ(adapted(bρk,bbk); M)

k=1 CVa R τ(disc(M)) CVa Rτ(bρk,bbk; disc(M)) + τ 1Hη (Theorems H.3 and H.4)

C + Kτ 1Hη.

In other words, when discretizing our algorithm, we pay an extra regret of at most Kτ 1Hη, where η is the discretization parameter. Setting η = 1/K renders this term lower order.

H.3. Computational Complexity

In this section, we compute the running time complexity of CVa R-UCBVI under discretization of η. There are two places where discretization comes in,

1. At each h, we only compute b U h,k(s, b, a) for all s, a and b in the grid. So assuming each step takes Tstep, the total run time of DP is O(SAHη 1Tstep).

2. When computingbbk, we only need to search over gridη([0, 1]), since we know that the returns distribution is supported on the gridη([0, 1]). Thus, the optimal solution, which is the τ-th quantile, lives on the grid. This computation costs O(η 1), which is lower order.

So the total runtime is O(K SAHη 1Tstep). For running with the Hoeffding bonus, each step is dominated by computing

the expectation b Pk(s, a) Erh R(s,a) h b V h+1,k( , b rh) i , as the bonus term is a constant. In the discretized MDP, this

expectation can be computed using only grid elements, so Tstep = O(Sη 1).

When running with the Bernstein bonus, we also need to consider the complexity of computing the bonus term. In the bonus

term (Eq. (6)), the expectation term Es b Pk(s,a),rh R(s,a)

b V h+1,k(s , b rh) b V h+1,k(s , b rh) 2 can be computed

in O(Sη 1). Notably, the variance term Vars b Pk(s,a) Erh R(s,a) h b V h+1,k(s , b ) i can also be computed in O(Sη 1)

by first computing the empirical mean (which takes O(Sη 1)). So for the Berstein bonus, we also have Tstep = O(Sη 1).

So the total running time of CVa R-UCBVI with discretized rewards is O(S2AHKη 2). As remarked by (Auer et al., 2008; Azar et al., 2017), we can also reduce the computational cost by selectively recomputing the DP after sufficiently many observations have passed.

Near-Minimax-Optimal Risk-Sensitive RL with CVa R

I. Minimax Lower Bounds for Quantile Estimation

Theorem I.1. Let m(p) = I[p > 1/2] be the median of Ber(p). For any n,

inf f:{0,1}n [0,1] sup p [0,1] EY1,...,Yn iid Ber(p), ˆmn Ber(f(Y1,...,Yn))[( ˆmn m(p))2] 1

That is, the minimax mean-squared error of any (potentially randomized) estimator for the median of a Bernoulli based on n observations thereon is bounded away from 0 for all n.

Proof. Let g(f, p) denote the value of the game (the objective of the above inf-sup). Let Pp denote the measure of (Y1, . . . , Yn) Ber(p)n. Fix any ϵ (0, e 1

inf f:{0,1}n [0,1] sup p [0,1] g(f, p) inf f:{0,1}n [0,1]

2g(f, (1 + ϵ)/2) + 1

2g(f, (1 ϵ)/2)

inf f:{0,1}n {0,1}

2g(f, (1 + ϵ)/2) + 1

2g(f, (1 ϵ)/2)

= inf f:{0,1}n {0,1}

2P(1+ϵ)/2(f(Y1, . . . , Yn) = 1) + 1

2P(1 ϵ)/2(f(Y1, . . . , Yn) = 0)

= inf f:{0,1}n {0,1} 1 2 1

2P(1+ϵ)/2(f(Y1, . . . , Yn) = 1) 1

2P(1 ϵ)/2(f(Y1, . . . , Yn) = 1)

1 2 KL(P(1+ϵ)/2, P(1 ϵ)/2)

1 2nϵ log((1 + ϵ)/(1 ϵ))

e + 1 e 1ϵ.

The first line is because worst-case risk upper bounds any Bayesian risk. The second because Bayesian risk is optimized by non-randomized estimators. The third by writing the expectation of a 0-1 variable as a probability. The fourth by total probability. The fifth by Pinsker s inequality. The sixth by evaluating the divergence. And the last by convexity of log((1 + ϵ)/(1 ϵ)).

Since ϵ (0, e 1

e+1] was arbitrary (for fixed n), the conclusion is reached.