# risksensitive_rewardfree_reinforcement_learning_with_cvar__e38f0bf6.pdf

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Xinyi Ni 1 Guanlin Liu 1 Lifeng Lai 1

Exploration is a crucial phase in reinforcement learning (RL). The reward-free RL paradigm, as proposed by (Jin et al., 2020), offers an efficient method to design exploration algorithms for riskneutral RL across various reward functions with a single exploration phase. However, as RL applications in safety critical settings grow, there s an increasing need for risk-sensitive RL, which takes potential risks into consideration for decisionmaking. Yet, efficient exploration strategies for risk-sensitive RL remain underdeveloped. This study presents a novel risk-sensitive reward-free framework based on Conditional Value-at-Risk (CVa R), designed to effectively address CVa R RL for any given reward function through a single exploration phase. We introduce an efficient algorithm named CVa R-RF-UCRL, which is shown to be (ϵ, p)-PAC, with a sample complexity upper

bounded by O S2AH4

ϵ2τ 2 with τ being the risk tol-

erance parameter. We also prove a Ω S2AH2

lower bound for any CVa R-RF exploration algorithm, demonstrating the near-optimality of our algorithm. Additionally, we propose the planning algorithms: CVa R-VI and its more practical variant, CVa R-VI-DISC. The effectiveness and practicality of our CVa R reward-free approach are further validated through numerical experiments.

1. Introduction

In reinforcement learning (RL), agents learn optimal actions by iteratively interacting with the environment and leveraging feedback from reward signals. A critical part of this learning process is exploration, where agents navigate through states to effectively gather environment information.

1Department of Electrical and Computer Engineering, University of California, Daivs, Davis, USA. Correspondence to: Xinyi Ni <xni@ucdavis.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Despite exploration being widely recognized as a vital aspect of RL, simple randomized exploration strategies often fail due to high sample complexity (Li, 2012). While research by (Dann & Brunskill, 2015; Dann et al., 2017; Azar et al., 2017; Jin et al., 2018) demonstrates that stochastic exploration can be sample-efficient, applying these algorithms across different reward functions can lead to inefficiencies. To address this, (Jin et al., 2020) introduces the concept of reward-free RL, in which the goal is to approximate the near optimal policy under any reward function after a single phase of exploration, enhancing the efficiency and adaptability of the learning process. (Jin et al., 2020) also derives upper and lower bounds of the sample complexity of the risk-free approach.

Building on these insights, subsequent studies such as (Wang et al., 2020; Zhang et al., 2021; Kaufmann et al., 2021; M enard et al., 2021; Chen et al., 2022; Miryoosefi & Jin, 2022) have sought tighter upper bounds and more practical algorithms. The focus of these existing rewardfree RL research has been predominantly on the risk-neutral approach, in which the goal is to maximize the average total (discounted) reward. In practical scenarios especially in those safety critical scenarios, however, decision-makers often have risk preferences that aim to mitigate low-probability but high-impact outcomes. Thus, the need to consider risks beyond solely optimizing for the average becomes important, leading to the development of risk-sensitive RL.

In risk-sensitive RL, the objective function is shaped by applying risk measures to reward functions (Delage & Mannor, 2010; B auerle & Ott, 2011; La & Ghavamzadeh, 2013; Shen et al., 2014; Fei et al., 2020; Prashanth et al., 2022; Ying et al., 2022), thus is also significantly dependent on the exploration phase. Various risk measures have been extensively analyzed and adopted in RL frameworks (Chow & Ghavamzadeh, 2014; Chow et al., 2015; Tamar et al., 2015b;a; Chow et al., 2017; Keramati et al., 2020; Ni & Lai, 2022a;b; Hau et al., 2023). One widely used class of risk measures is coherent risk measures, which satisfy a set of natural and desirable properties: 1) monotonicity, 2) translation invariance, 3) subadditivity, 4) positive homogeneity, ensuring rationality and reliability in capturing risk preferences (Artzner et al., 1999; Tamar et al., 2015a). Despite these advancements, efficient exploration, particularly in contexts without a predefined reward function,

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

remains an under-explored area. Existing studies on sample complexity and algorithm performance in risk-sensitive RL typically target specific reward functions, potentially limiting their effectiveness in varied reward settings (Fei et al., 2021; Bastani et al., 2022; Du et al., 2022; Wang et al., 2023; Ding et al., 2023). This situation underscores the urgency for developing efficient exploration methods in risk-sensitive RL, crucial for its practical deployment and success in diverse stochastic environments. In this paper, we study risk-sensitive RL in the reward-free setting, and aim to answer the following question:

Is it possible to design provably efficient risk-sensitive reward-free RL algorithm?

In this paper, we design an algorithm with near-optimal sample complexity to the above question.

Contribution: This paper introduces a CVa R-based risksensitive reward-free RL framework (CVa R-RF RL). For the exploration phase, we propose CVa R-RF-UCRL to efficiently explore environments with unknown reward functions. The number of trajectories collected in the exploration phase is upper bounded by O S2AH4

ϵ2τ 2 , where S is the number of states, A is the action count, H is the horizon length, ϵ is the targeted accuracy, and τ the risk tolerance level for CVa R. We also prove a lower bound of Ω S2AH2

for any CVa R-RF exploration algorithm. Subsequently, we introduce the CVa R-RF-planning algorithm equipped with CVa R-VI, which is able to solve CVa R RL for given reward function but without interacting with the environment. We also propose CVa R-VI-DISC, a discretized version of CVa R-VI for direct implementation in real-world settings while maintaining an optimization error within ϵ/3. These developments ensure the efficiency and applicability of our CVa R-RF framework in advancing the field of risk-sensitive RL.

Challenges: 1). Compared to risk-neutral reward-free RL (Jin et al., 2020), CVa R-RF RL focuses only on the tail distribution related to the risk tolerance parameter τ. But in a reward-free setup, we can t access reward information, including the reward distribution. Therefore, we must adjust our exploration strategy based on τ. To address this, we define an adaptive stopping rule for different τ values during the exploration phase. Moreover, while the optimal policy in risk-neutral RL is Markovian, the optimal policy for risksensitive RL is history-dependent, which makes it more complex. To simplify this, we propose a planning algorithm with CVa R-VI that can construct a Markovian policy as the optimal policy for CVa R RL, reducing the added complexity. 2). Compared with CVa R RL (Chow & Ghavamzadeh, 2014; Chow et al., 2015; Tamar et al., 2015b;a; Chow et al., 2017; Keramati et al., 2020; Ni & Lai, 2022a;b; Hau et al., 2023), CVa R-RF RL faces challenges due to the absence of

immediate feedback on risks associated with actions during the exploration phase. In CVa R RL, with rewards given, the agent doesn t need to explore every state or action, as it can immediately adjust its strategy based on the reward. However, in CVa R-RF RL, where rewards are unknown during the exploration, it s necessary to thoroughly explore the environment by visiting all possible states and actions. This extensive exploration gathers enough information for the planning phase, allowing the agent to adjust its strategy effectively. To facilitate this, we introduce CVa R-RF-UCRL, a method that efficiently explores all states.

Outline: In Section 2, we introduce the preliminaries essential for the understanding of CVa R-RF RL. Section 3 presents the formal problem statement of CVa R-RF RL. In Section 4, we present the CVa R-RF-UCRL for exploration and CVa R-RF-planning algorithms, and present the upper bound for sample complexity. Section 5 provides our analysis of the lower bound of sample complexity specifically for CVa R-RF exploration. Section 6 provides numerical examples. Section 7 offer concluding remarks.

2. Preliminaries

We explore a tabular Markov decision process (MDP) represented as M = (S, A, H, P, r). Here, S and A are state and action spaces with sizes S and A respectively, H is the number of steps per episode, Ph( |s, a) is the state transition probability at step h for action a in state s, and rh(s, a) is a deterministic reward function mapping state-action pairs to rewards between 0 and 1. Both transition probabilities and reward function can vary with each step h. We define ΠH as the set of history-dependent policies, where each policy π consists of H functions {πh : S Hh A}. Here, Hh is the history up to step h, and A is the probability simplex over A. The probability of reaching state s under policy π is P π(s).

In each episode of the MDP, the process starts by choosing an initial state s1 from an unknown initial distribution P1( ). At every step h, the agent observes the current state sh from the state space S, selects an action ah based on the distribution πh(sh; Hh), earns a reward rh(sh, ah), and then moves to the next state sh+1 according to the transition probability Ph( |sh, ah). The episode ends when the agent reaches the state s H+1.

We now introduce CVa R, a widely used coherent risk measure in RL (Rockafellar et al., 2000). For a random variable X, CVa R at a given risk tolerance τ (0, 1] is defined as:

CVa Rτ(X) := sup b R

b τ 1E[(b X)+] ,

where the notation x+ = max(x, 0). CVa R effectively quantifies the average outcome in the worst τ-percentile of scenarios. It is noteworthy that for a continuous variable X,

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

this definition aligns precisely with outcomes less than or equal to the τ-th quantile, as elucidated by (Acerbi & Tasche, 2002). This τ-th quantile is also identified as Value-at-Risk (Va R), another well-recognized risk measure. However, Va R lacks the coherence property, distinguishing it from CVa R.

It is important to note that the reward function in this context is deterministic, with its cumulative sum ranging between [0, H]. Given this constraint and acknowledging that the optimal b aligns with the Va R (see Lemma D.2), and considering Va Rτ [0, H], we can appropriately restrict the range of b as follows:

CVa Rτ(X) := sup b [0,H]

b τ 1E[(b X)+] . (1)

Reward-Free RL: The RF-RL framework, as proposed by (Jin et al., 2020), is structured into two distinct phases: exploration and planning. In the exploration phase, the goal is to design algorithms that can efficiently explore the environment without reward information. Formally, in the exploration phase, each episode commences with an exploration policy πt, based solely on data from previous episodes. An episode ξt captures a sequence of states and actions (st 1, at 1, . . . , st H, at H), starting at initial state st 1. Actions are chosen as at h = πt h(st h), with subsequent states determined as st h Ph(st h 1, at h 1). Each trajectory ξt is added to the dataset Dt. Data collection ends at a random stopping time tstop, resulting in dataset Dtstop. Based on the dataset, we are able to get the empirical transition kernel ˆP.

In the planning phase, the agent s exploration strategy is critically assessed. During this phase, the agent is no longer permitted to interact with the environment. Instead, a specific reward function r is given, and the primary goal is to derive a near-optimal policy tailored to this r using the dataset Dtstop gathered during the exploration phase. The efficiency of the exploration approach is quantified based on the number of trajectories needed to consistently reach this objective, effectively measuring the algorithm s ability to prepare the agent for diverse reward scenarios without direct interaction with the MDP.

Our Goal: This paper focuses on establishing an efficient CVa R based reward-free RL framework, including: 1). Develop a CVa R-RF-Exploration algorithm that efficiently explores the environment without requiring any reward function and is adaptive to different τ. 2). Propose a CVa R-RF-Planning algorithm, which computes near-optimal policies based on the dataset acquired during the exploration phase and a specified reward function, without further interaction with the environment. 3). Ensure the efficiency and reliability by analyzing the sample complexity of exploration algorithm and the optimization error of planning algorithm.

3. Problem Statement

To address the inner objective of CVa R outlined in (1), which depends on the variable b, we consider an augmented MDP, in which an augmented state is defined as (s, b) SAug := S [0, H]. The initial state for a given b1 [0, H] is set to (s1, b1). Then, for each timestep h = 1, . . . , H, the agent selects action ah based on policy πh, and updates bh+1 to bh rh.

For any history-dependent policy π ΠH, timestep h [H], state sh S, budget bh [0, H], and history H, we define the value function as:

V π h (sh, bh; Hh)

h =h rh (sh , ah )

The CVa R objective following policy π, starting at s1, is then expressed as:

CVa Rπ τ (s1) = max b1 [0,H]{b1 τ 1V π 1 (s1, b1)},

and the optimal CVa R objective is formulated as:

CVa R τ(s1) = max π ΠH max b1 [0,H]{b τ 1V π 1 (s1, b1)}

= max b1 [0,H]{b1 τ 1 min π ΠH V π 1 (s1, b1)}. (2)

The work of (B auerle & Ott, 2011) significantly advances our understanding by establishing the existence of an optimal policy ρ : SAug A, which is deterministic and Markovian within the augmented MDP, denoted by SAug = S [0, H]. With a starting point of b1 [0, H] and initial state (s1, b1), the process unfolds as follows: for each h = 1, 2, . . . , H, the action ah is determined as ρ (sh, bh), the reward rh as rh(sh, ah), the next state sh+1 evolves according to P h(sh, ah), and the budget bh+1 is updated to bh rh. The additional state bh effectively tracks the residual budget from b1, serving as a comprehensive summary of historical decisions for the CVa R RL problem.

The adoption of deterministic Markovian policies simplifies the decision-making process in MDPs, directly associating states with actions, thereby facilitating implementation and analytical processes. Consequently, without loss of optimality, the optimization problem in (2) simplifies to:

CVa R τ(s1) = max b1 [0,H]{b1 τ 1 min ρ ΠAug V ρ 1 (s1, b1)}, (3)

where ΠAug is the class of deterministic Markovian policies.

We now introduce the function definitions and the Bellman equations for the augmented MDP proposed in (Bellemare et al., 2023; Wang et al., 2023). For any policy ρ ΠAug,

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

V ρ h (sh, bh)

h =h rh (sh , ah )

Qρ h(sh, bh, ah)

h =h rh (sh , ah )

!+ sh, bh, ah

For notation convenience, we introduce the following definition:

[Ph Vh+1] (sh, bh, ah)

= Esh+1 P( |sh,ah)[Vh+1(sh+1, bh+1)].

These functions adhere to the following Bellman equations:

V ρ h (sh, bh) = Eah ρh(sh,bh) [Qρ h(sh, bh, ah)] ,

Qρ h(sh, bh, ah) = [Ph Vh+1] (sh, bh, ah), (6)

where bh+1 = bh rh and V ρ H+1(s, b) = b+ 1 := max(0, b1). Similarly, we define the optimal conditions as:

V h (sh, bh) = min a A Q h(sh, ah, bh),

ρ h(sh, bh) = argmina AQ h(sh, bh, ah)],

Q h(sh, bh, ah) = Ph V h+1 (sh, bh, ah),

where bh+1 = bh rh and V H+1(s, b) = b+ 1 = max(0, b1).

Here we introduce a key fact shown in (Wang et al., 2020), which shows the optimality of ΠAug.

Theorem 3.1. (Optimality) For any b1 [0, 1], V 1 (s1, b1) = V ρ 1 (s1, b1) = infπ ΠH V π 1 (s1, b1).

Theorem 3.1 suggest that we could compute V 1 and ρ

using dynamic programming (DP) if the true transitions P were known, following the classical Value Iteration procedure in standard RL. By executing ρ starting from (s1, b 1) where b 1 := arg maxb1 [0,H]{b1 τ 1V 1 (s1, b1)}, we can attain the optimal CVa R value.

Based on these arguments, the goal of our paper is to identify a probably approximately correct (PAC) algorithm for CVa R-RF RL, characterized by specific performance and accuracy parameters (ϵ, δ), which is defined as follows:

Definition 3.2. (PAC algorithm for CVa R-RF) A CVa RRF exploration algorithm is (ϵ, δ)-PAC with a given risk tolerance τ if for any reward function r,

P Es1 P1 h CVa R τ(s1; r) CVa Rˆρ τ(s1; r) i ϵ 1 δ.

Here, CVa R τ(s1; r) is derived by executing optimal policy ρ starting from (s1, b 1) under the true transition probabilities P and the reward function r with optimal initial budget b 1 := arg maxb1 [0,H]{b1 τ 1V 1 (s1, b1; r)}. Conversely, CVa Rˆρ τ(s1; r) is derived by executing the output policy in the planning phase ˆρ starting from (s1,ˆb1) under the empirical transition probabilities ˆP and the reward function r with the near optimal initial budget obtained in the planning phase.

4. Main Results

In this section, we first analyze the exploration phase by assuming the optimization error during the planning phase is bounded. Inspired by (Fiechter, 1994; Kaufmann et al., 2021), we propose the CVa R-RF-UCRL, which is an (ϵ, δ)- PAC algorithm for CVa R-RF exploration, with the sample complexity upper bounded by O(S2AH4/ϵ2τ 2). Then, in the planning phase, we propose a CVa R-RF-planning algorithm, adopting CVa R-VI and CVa R-VI-DISC, which satisfy the optimization error assumption.

Notation: Consider (si h, ai h, si h+1) as the state, action, and next state observed by an algorithm at step h of episode i. For any step h [H] and any state-action pair (s, a) S A, we define nt h(s, a) = Pt i=1 I{(si h, ai h) = (s, a)} as the count of visits to the state-action pair (s, a) at step h in the first t episodes, and nt h(s, a, s ) = Pt i=1 I{(si h, ai h, si h+1) = (s, a, s )}. The empirical transitions are defined as:

ˆPt h(s |s, a) =

( nt h(s,a,s ) nt h(s,a) , if nt h(s, a) > 0

1 S , otherwise .

We denote by ˆV t,ρ h (sh, bh; r) and ˆQt,ρ h (sh, bh, ah; r) the value and action-value functions of a policy π in the MDP with transition kernels ˆPt and reward function r.

4.1. Exploration Phase

We first present a lemma that will be useful for the study of the objective within the CVa R-RF exploration context. Prior to delving into this lemma, we make an assumption regarding the planning phase. Assumption 4.1. The optimization error during the planning phase is bounded, i.e., \ CVa R ˆρ

τ (s1; r) \ CVa R ˆρ τ(s1; r) ϵτ/3,

where \ CVa R ˆρ

τ (s1; r) is derived by executing the optimal policy ˆρ starting from (s1,ˆb 1) under the empirical transition probabilities ˆP and the reward function r with optimal initial budget ˆb 1 := arg maxb1 [0,H]{b1 τ 1 ˆV 1 (s1, b1; r)}.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Notice that, Assumption 4.1 focuses on the optimization error based on same emprical transition probability ˆP and given r. This assumption is not about the error with respect to the optimal policy for the true underlying MDP. Theorem 3.1 justifies the existence of an optimal policy ˆρ

for MDP specified by ˆP and given reward function (more technical details could be found in Appendix A.1). Furthermore, there exist many CVa R RL works capable of generating such a near-optimal policy ˆρ that satisfies this assumption, such as (Chow et al., 2015; Tamar et al., 2015b; Wang et al., 2023). We also propose our algorithms in the planning phase that satisfy this assumption. Therefore, Assumption 4.1 could be immediately satisfied based on these facts.

The following lemma is useful for subsequent discussions and analyses related to our primary objective.

Lemma 4.2. An algorithm is (ϵ, δ)-PAC for CVa RRF exploration with a given risk tolerance τ if for any reward function r and for any b1 [0, H], V ρ 1 (s1, b1; r) ˆV ρ 1 (s1, b1; r) ϵτ/3.

Proof. Please refer to Appendix A.1 for more details.

For simplifying the exposition of our algorithm, we posit that the initial state distribution P0 is supported solely on a singular state s1. This assumption incurs no loss of generality (Fiechter, 1994). If this is not the case, one might contemplate an augmented MDP that includes an additional initial state s0, paired with a unique action a0 yielding a null reward. Thus, b0 = b1. In this scenario, the transitions from s0 using a0 are defined as P0( |s0, a0) = P0.

The error upper bounds in the CVa R-RF-UCRL algorithm are derived from an upper bound on the estimation error for each policy ρ, each initial budget b [0, H] and each reward function r. The complete procedure is outlined in Algorithm 1. Before discussing the details of this algorithm, we introduce the definitions for the estimation error and its upper confidence bound.

Definition 4.3. For a given policy ρ, reward function r, and episode t, we define this error for any (sh, bh, ah) SAug A as

ˆet,ρ h (sh, bh, ah; r)

:= ˆQt,ρ h (sh, bh, ah; r) Qρ h(sh, bh, ah; r) .

Definition 4.4. The upper confidence bound Et h(sh, ah) for the error, recursively defined as follows: Et H+1(s,a) = 0 for all (s, a), and for all h [H], with the convention

Et h(sh, ah) = min H, H

2β(nt h(s, a), δ) nt h(s, a)

s ˆPt h(s |s, a) max a Et h+1(s , a) ,

where β(n, δ) is a threshold function, an input to the algorithm, the choice of which will be discussed later.

It is important to note that the error upper bound only depends on the state s and action a, and is independent of the policy ρ, initial budget b1 and reward function r. Lemma 4.5 establishes that Et h(s, a) serves as the upper bound on the error ˆet,ρ h (s, b, a; r) for any ρ, b, r with a high probability.

Lemma 4.5. With the Kullback-Leibler divergence between two distributions over S defined as KL(p q) = P

s S p(s) log p(s)

q(s), consider the event

E = t N, h [H], (s, a),

KL(ˆPt h( |s, a), Ph( |s, a)) β(nt h(s, a), δ) nt h(s, a)

it is established that for any policy ρ, any reward function r and any b, ˆet,ρ h (s, b, a; r) Et h(s, a) holds on event E.

Proof. Please refer to the Appendix A.2 for more details.

We now introduce the proposed CVa R-RF-UCRL algorithm, which operates on the principle of uniformly reducing the estimation error across all budgets, policies and potential reward functions by adopting a greedy approach based on the upper bounds Et on these errors. The stopping criterion for CVa R-RF-UCRL is defined as reaching an error in step 1 that is smaller than ϵτ/3:

Sampling rule: the exploration policy πt+1 is the greedy policy with respect to Et(s, a), such that for all s S and h [H]:

πt+1 h (sh) = argmaxa Et h(s, a). (9)

Stopping rule: the algorithm stops at

tstop = inf{t : Et h(s1, πt+1 1 (s1)) ϵτ/3}.

Now, we have the following Lemma showing that CVa RRF-UCRL is an algorithm with (ϵ, δ)-PAC.

Lemma 4.6. (Correctness) On the event E, given τ, for any r ,ρ and b1, V tstop,ρ 1 (s1, b1; r) ˆV tstop,ρ 1 (s1, b1; r) ϵτ/3,

which implies CVa R τ(s1; r) CVa Rˆρ τ (s1; r) ϵ.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Algorithm 1 CVa R-RF-UCRL

1: Given: risk tolerance τ (0, 1] 2: Initialization: t = 1, D0 = , initialize E0 with (8) and π1 h with (9) 3: while Et 1 h (s1, πt 1(s1)) ϵτ/3 do 4: Observe the initial state st 1 P0 5: for h = 1, . . . , H 1, H do 6: Play at h πt h(st h) 7: Observe the next state sh+1 8: end for 9: Compute Et according to (8) and πt+1 according to (9) 10: Dt = Dt 1 (st 1, at 1, . . . , st H, at H) 11: t = t + 1 12: end while 13: Return the dataset Dtstop

Proof. By definition of the stopping rule and the sampling rule, we have for all a A, Etstop 1 (s1, a) ϵ/3. Hence, by Lemma 4.5 on the event E, for all ρ, b1, r, and all a, ˆetstop,ρ 1 (s1, b1, a; r) ϵτ/3. In particular, for all ρ, b1, and r, V tstop,ρ 1 (s1, b1; r) ˆV tstop,ρ 1 (s1, b1; r) ϵτ/3. The conclusion follows from Lemma 4.2 by choosing ρ to be ˆρ .

We are now able to state our main results for CVa R-RFUCRL, which show that with a proper chosen threshold β(n, δ), CVa R-RF-UCRL achieves (ϵ, δ)-PAC for CVa R RL. Furthermore, an upper bound on its sample complexity can be established under these conditions.

Theorem 4.7. (Upper Bound for Sample Complexity) Using threshold β(n, δ) = log(2SAH/δ) + (S 1) log(e(1 + n/(S 1))), the CVa R-RF-UCRL is (ϵ, δ)-PAC for CVa RRF exploration. The number of trajectories collected in the exploration phase is bounded by O S2AH4

Proof. Please refer to Appendix A.3 for more details.

Compared with the risk-neutral reward-free approaches (Jin et al., 2020; Kaufmann et al., 2021; M enard et al., 2021), the denominator of the bound we obtained is related to the risk tolerance parameter τ. This is expected since CVa R is interpreted as the mean of the tail distribution with an area under the curve equal to τ, it requires more trajectories for smaller τ values and fewer trajectories for larger τ values.

4.2. Planning Phase

In the planning phase, the reward function is provided, and the goal is to find a near-optimal policy based on the given reward function and the dataset generated during the exploration phase. Following a similar approach to (Jin et al.,

2020), we now introduce our planning algorithm, as outlined in Algorithm 2.

Algorithm 2 CVa R-RF-Planning

1: Input: a dataset of transition Dtstop, reward function r, accuracy ϵ, risk tolerance τ. 2: for all (s, a, s , h) S A S [H] do 3: Nh(s, a, s ) P (sh,ah,sh+1) D I[sh = s, ah = a, sh+1 = s ]. 4: Nh(s, a) P

s Nh(s, a, s ).

5: ˆPh(s |s, a) = Nh(s, a, s )/Nh(s, a). 6: end for 7: ˆρ,ˆb APPROXIMATE-CVa R-SOLVER(ˆP, r, ϵ, τ). 8: return policy ˆρ, and initial budget ˆb.

In Algorithm 2 , we first compute the empirical transition matrix ˆP based on the collected dataset Dtstop. Then, for each reward function r, we find a near-optimal policy by employing a APPROXIMATE-CVa R-SOLVER that utilizes transitions ˆP, the given reward function r, an accuracy parameter ϵ and the given risk tolerance τ. It s worth noting that the solver can be any algorithm designed to find an ϵ/3-suboptimal policy ˆρ for CVa R RL when both the transition matrix and the reward are known. One straightforward approach to achieve this is by using the Value Iteration algorithm, which iteratively solves the Bellman optimality equation (6) in a dynamic programming manner. The greedy policy induced by the resulting Q yields the optimal optimal policy without errors. We present Algorithm 3, which generates an optimal policy exactly according to Theorem 3.1 (Wang et al., 2023). This algorithm satisfies our Assumption 4.1 about the optimization error.

Algorithm 3 CVa R-VI

1: Input: transition matrix P, reward function r, risk tolerance τ 2: for all s S, b [0, H] do 3: Set VH+1(s, b) = b+

4: for h = H, H 1, . . . , 1 do 5: Qh(sh, bh, ah) = [Ph Vh+1] (sh, bh, ah), where bh+1 = bh rh 6: ρ h(sh, bh) = argmina Qh(sh, bh, ah) 7: V h (sh, bh) = mina Qh(sh, bh, ah) 8: end for 9: end for 10: Calculate b = argmaxb1 [0,1] b τ 1V1(s1, b)

11: return policy ρ and b

4.2.1. DISCRETIZATION

Algorithm 3 faces computational challenges due to the dynamic programming step, which requires optimization over

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

all b [0, H], involving the maximization of a non-concave function (Wang et al., 2023). Following the approach proposed in (Bastani et al., 2022; Wang et al., 2023), we introduce a discretization of rewards, which allows the mentioned steps to be performed over a finite grid. This offers computational efficiency while preserving the same statistical guarantees.

We fix a precision η (0, 1) and define ϕ(r) = η r/η 1. This rounding function maps r [0, 1] to an η-net of the interval [0, 1], commonly referred as the grid. The discretized MDP dis(M) is an exact replica of the true MDP M with one exception: its rewards are post-processed using ϕ, i.e., r(s, a; disc(M)) = ϕ(r(s, a; M)). We now introduce the CVa R value iteration with discretization algorithm.

Algorithm 4 CVa R-VI-DISC

1: Input: transition matrix P, reward function r, precision parameter η, risk tolerance τ. 2: Discretize the reward funtion r by

ˆr = ϕ(r) = η r/η 1

3: for all s S,ˆb = n η, n = 0, 1, . . . do 4: Set ˆVH+1(s,ˆb) = ˆb+ := max(0,ˆb) 5: for h = H, H 1, . . . , 1 do

6: ˆQh(sh,ˆbh, ah) = h Ph ˆVh+1 i (sh,ˆbh, ah), where ˆbh+1 = ˆbh ˆrh 7: ˆρ h(sh,ˆbh) = argmina ˆQh(sh,ˆbh, ah) 8: ˆV h (sh,ˆbh) = mina ˆQh(sh,ˆbh, ah) 9: end for 10: end for 11: Calculate ˆb = argmaxˆb n ˆb τ 1 ˆV1(s1,ˆb) o

12: return policy ˆρ and ˆb

4.2.2. COMPUTATIONAL COMPLEXITY

In disc(M), the τ-th quantile of the returns distribution (the argmax of the CVa R objective) will be a multiple of η. Therefore, it suffices to compute V1(s1, b1) and maximize line 9 over the grid. Since b1 transitions by subtracting rewards, which are multiples of η, bh will always stay on the grid. Hence, the entire dynamic programming procedure only needs to occur on the grid. This approach demonstrates that CVa R value iteration via discretization is computationally tractable.

Theorem 4.8. The CVa R-VI-DISC has a run time of O(S2AHη 2) in the discretized MDP. Setting η = ϵτ/3H, as suggested in Theorem 4.9, the run time is O( S2AH3

Proof. Please refer to Appendix for more details.

4.2.3. DISCRETIZATION ERROR

Next, we evaluate the impact of errors resulting from the discretization step. Following a similar method as previous works (Wang et al., 2023), we can relate the errors within disc(M) to equivalent errors within M using a coupling argument. This leads us to introduce the CVa R-VI-DISC algorithm, which is tailored for practical applications.

The following theorem guarantees that the optimization error assumption is met when when Algorithm 4 is employed.

Theorem 4.9. By selecting η ϵτ/3H, we ensure that

|CVa Rρ τ (s1; r) CVa Rˆρ τ(s1; r)| ϵ/3, (10)

where ρ represents the policy generated by Algorithm 3 and ˆρ is the output of Algorithm 4. Consequently, the optimization error is bounded by ϵ/3, which satisfies Assumption 4.1.

Proof. Please refer to the Appendix for more details.

4.3. Adaptability to Varying Risk Tolerances

We further introduce an important proposition that underscores the adaptability of our exploration process to different levels of risk tolerance τ:

Proposition 4.10. For any τ τ, the exploration dataset obtained through Algorithm 1 at risk tolerance τ contains the requisite information for conducting CVa R-RF RL with any higher risk tolerance τ . Consequently, the planning phase is also compatible with any given τ τ.

Proof. Utilizing Lemma 4.2, we observe that as ϵτ/3 ϵτ /3, the CVa R-RF exploration algorithm configured with a risk tolerance of τ also satisfies the (ϵ, δ)-PAC criterion for CVa R-RF RL when operating under a higher risk tolerance τ τ. Furthermore, invoking Theorem 4.9, we have that the stipulated optimization error condition is met since η η . This implies that the planning phase remains efficacious under these adjusted parameters.

5. Lower Bound

In this section, we develop a lower bound of the sample complexity for CVa R-RF exploration. We present a theorem that delineates this lower bound, applicable to any algorithm operating within the CVa R-RF exploration framework.

Theorem 5.1. Consider a universal constant C > 0. For a given risk tolerance τ (0, 1], if the number of actions A 2, the number of states S C log2 A + 2, the horizon H C log2 S + 1, and the accuracy parameter ϵ min{1/4τ, H/48τ}, then any CVa R-RF exploration algorithm that can output ϵ-optimal policies for an arbitrary number of adaptively chosen reward functions

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

with a success probability δ = 1/2 must collect at least Ω(S2AH2/τϵ2) trajectories in expectation.

Proof Sketch. Here we highlight the main idea of our lower bound proof, while the detailed proof can be found in the Appendix. Our proof is inspired by the lower bound construction in for the reward-free RL (Jin et al., 2020). The key idea is that any reward-free risk neutral problem can be transformed into a CVa R-RF RL problem. If a CVa R-RF exploration algorithm that can output ϵ-optimal policies in the transformed CVa R-RF RL problem, it can also solve the original reward-free risk neutral problem. Specifically, for a MDP M with initial state s1, we consider a new MDP M

with an initial state s0. For any action a, P(s1|s0, a) = τ, P(s |s0, a) = 1 τ, P(s |s , a) = 1, and r(s , a) = 1. For any adaptively chosen reward function for M and a policy π, the CVa R with tolerance τ following policy in the new MDP M is equal to the cumulative rewards in the original MDP M. (Jin et al., 2020) shows that any reward-free exploration algorithm that output ϵ-optimal policy from initial state s1 must collect at least Ω(S2AH2/τϵ2) trajectories in expectation. Thus, from the initial state s0, the CVa R-RF exploration algorithm must collect at least Ω(S2AH2/τϵ2) trajectories in expectation.

This theorem illustrates that, compared with the lower bound, the upper bound established in Theorem 4.7 has by an additional factor of H2 and 1/τ, while being tight with respect to the parameters S, A, ϵ. If τ is a constant, our result is nearly minimax-optimal with an additional factor on H2. An interesting direction of the future work is utilizing the empirical Bernstein inequality to further improve the sample complexity. The H factor can potentially be optimized by adopting an approach similar to (M enard et al., 2021) by introducing an empirical Bernstein inequality derived from a control of the transition probability. As shown in (Wang et al., 2023), the Bernstein inequality could also potentially improve the dependence on τ under a continuity assumption. Furthermore, compared with the risk-neutral reward-free RL, our derived lower bound for any CVa R-RF exploration algorithm includes an additional τ in the denominator. This is because CVa R focuses on the τ worst outcomes. Additionally, the CVa R setting poses challenges due to non-Markovianity, requiring more efforts in achieving a minimax optimal sample complexity bound.

6. Experiments

In this section, we provide numerical examples to evaluate the proposed CVa R-RF RL framework. In these examples, we use similar experimental setup as in (Kaufmann et al., 2021). Our environment is configured as a grid-world consisting of 21 21 states, where each state offers four possible actions (up, down, left, right), and actions leading to the boundary result in remaining in the current state. The agent

will move to the correct state with a probability of 0.95. However, there is an equal probability of 0.05

3 for the agent to move in any one of the other three directions. Initially, the exploration algorithm CVa R-RF-UCRL runs without reward information, collecting n = 30, 000 transitions. The empirical transition probability ˆP is then estimated. We use the β(n, δ) threshold from Theorem 4.7 with δ = 0.1 and set a time horizon H of 20. Using the obtained dataset and ˆP, the planning algorithm derives near-optimal policies, employing CVa R-VI-DISC as the solver.

Reward Setup 1: The first one is similar with (Kaufmann et al., 2021), where the agent starts at position (10, 10). The reward structure is primarily set at 0 for most states, except at (16, 16) where it is 1.0. Here we choose ϵ = 0.1. Then we executing the output policy of CVa R-VI-DISC in the same grid-world for K = 10, 000 trajectories and plot the number of state visits following the policy. For comparison, we also generate the optimal policy using true transition probability. Figures 1a displays the number of visits to each state following the policy generated from P, while Figure 1b shows for ˆP. Additionally, Table 1 presents the CVa R values under both true and empirical transition probabilities.

(a) Optimal policy

(b) CVa R-VI-DISC

Figure 1. Number of state visits following policies generated under P and ˆP in reward setup 1 with risk tolerance τ = 0.05.

ϵ, τ CVa RP CVa RˆP Error

0.1, 0.05 4.308 4.258 0.05 0.1, 0.95 4.960 4.954 0.006

Table 1. CVa R values under reward setup 1 with different τ.

These visitation patterns, shown in Figures 1a and 1b, are notably similar, indicating that the agent tends to favor states with higher rewards. This behavior is consistent with the objective of maximizing CVa R. The similarity in patterns under both true and empirical transition probabilities underscores the reliability of the data collected during the exploration phase. Moreover, with ϵ = 0.1 and τ = 0.05, the difference between true and empirical CVa R is 0.05, which is below the anticipated error threshold of ϵ = 0.1. Similarly, with ϵ = 0.1 and τ = 0.95, the error is only 0.006, again less than the threshold of 0.1. These results align with our theoretical analysis.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Reward Setup 2: We consider a more complex case as the reward structure is primarily set at 0.5 for most states, except at (16, 16) where it is 1.0, and a zero-reward zone marked x from (12, 10) to (12, 16).

(a) Optimal policy

(b) CVa R-VI-DISC

Figure 2. Number of state visits following policies generated under P and ˆP in reward setup 2 with risk tolerance τ = 0.05.

ϵ, τ CVa RP CVa RˆP Error

0.1, 0.05 1.852 1.829 0.023 0.1, 0.95 1.993 1.990 0.003

Table 2. CVa R values under reward setup 2 with different τ.

Figure 2 and Table 2 illustrate that CVa R-RF RL effectively avoids traversing zero-reward regions, and the observed errors remain within the pre-defined thresholds. These outcomes are also consistent with the CVa R s property as the agent is more risk-averse compared to risk-neutral case.

7. Conclusion

In this paper, we have introduced a novel risk-sensitive reward-free RL framework based on CVa R (CVa R-RF RL), which is able to solve CVa R RL for given any reward function after a singular reward-free exploration phase. We have proposed CVa R-RF-UCRL as the exploration algorithm and established upper and lower bounds for the sample complexity. We have developed a CVa R-RF-planning algorithm, equipped with CVa R-VI and CVa R-VI-DISC to generate near-optimal Markov policies solely based on the exploration dataset and given reward function. Through our numerical experiments, we have validated the effectiveness and practicality of this CVa R-RF-RF framework.

Acknowledgement

This work was supported by the National Science Foundation under Grants CCF-21-12504 and CCF-22-32907.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal

consequences of our work, none which we feel must be specifically highlighted here.

Acerbi, C. and Tasche, D. On the Coherence of Expected Shortfall. Journal of Banking & Finance, 26(7):1487 1503, 2002.

Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. Coherent Measures of Risk. Mathematical Finance, 9(3):203 228, 1999.

Auer, P., Jaksch, T., and Ortner, R. Near-Optimal Regret Bounds for Reinforcement Learning. Advances in Neural Information Processing Systems, 21, 2008.

Azar, M. G., Osband, I., and Munos, R. Minimax Regret Bounds for Reinforcement Learning. In Proc. International Conference on Machine Learning, pp. 263 272, Sydney, Australia, 2017. PMLR.

Bastani, O., Ma, J. Y., Shen, E., and Xu, W. Regret Bounds for Risk-Sensitive Reinforcement Learning. Advances in Neural Information Processing Systems, 35:36259 36269, 2022.

B auerle, N. and Ott, J. Markov Decision Processes with Average-Value-at-Risk Criteria. Mathematical Methods of Operations Research, 74:361 379, 2011.

Bellemare, M. G., Dabney, W., and Rowland, M. Distributional Reinforcement Learning. MIT Press, 2023.

Chen, J., Modi, A., Krishnamurthy, A., Jiang, N., and Agarwal, A. On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL. Advances in Neural Information Processing Systems, 35:20960 20973, 2022.

Chow, Y. and Ghavamzadeh, M. Algorithms for CVa R Optimization in MDPs. Advances in Neural Information Processing Systems, 27, 2014.

Chow, Y., Tamar, A., Mannor, S., and Pavone, M. Risk Sensitive and Robust Decision-Making: a CVa R Optimization Approach. Advances in Neural Information Processing Systems, 28, 2015.

Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. Risk-Constrained Reinforcement Learning with Percentile Risk Criteria. The Journal of Machine Learning Research, 18(1):6070 6120, 2017.

Dann, C. and Brunskill, E. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning. Advances in Neural Information Processing Systems, 28, 2015.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and Rregret: Uniform PAC bounds for Episodic Reinforcement Learning. Advances in Neural Information Processing Systems, 30, 2017.

Delage, E. and Mannor, S. Percentile Optimization for Markov Decision Processes with Parameter Uncertainty. Operations Research, 58(1):203 213, 2010.

Ding, Y., Jin, M., and Lavaei, J. Non-Stationary Risk Sensitive Reinforcement Learning: Near-Optimal Dynamic Regret, Adaptive Detection, and Separation Design. In Proc. the AAAI Conference on Artificial Intelligence, volume 37, pp. 7405 7413, Washington, DC, 2023.

Du, Y., Wang, S., and Huang, L. Provably Efficient Risk Sensitive Reinforcement Learning: Iterated CVa R and Worst Path. In Proc. The Eleventh International Conference on Learning Representations, 2022.

Fei, Y., Yang, Z., Chen, Y., Wang, Z., and Xie, Q. Risk Sensitive Reinforcement Learning: Near-Optimal Risk Sample Tradeoff in Regret. Advances in Neural Information Processing Systems, 33:22384 22395, 2020.

Fei, Y., Yang, Z., and Wang, Z. Risk-Sensitive Reinforcement Learning with Function Approximation: A Debiasing Approach. In Proc. International Conference on Machine Learning, pp. 3198 3207. PMLR, 2021.

Fiechter, C.-N. Efficient Reinforcement Learning. In Proc. The Seventh Annual Conference on Computational Learning Theory, pp. 88 97, New Brunswick, NJ, 1994.

Hau, J. L., Petrik, M., and Ghavamzadeh, M. Entropic Risk Optimization in Discounted MDPs. In Proc. International Conference on Artificial Intelligence and Statistics, pp. 47 76, Valencia, Spain, 2023. PMLR.

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-Learning Provably Efficient? Advances in Neural Information Processing Systems, 31, 2018.

Jin, C., Krishnamurthy, A., Simchowitz, M., and Yu, T. Reward-Free Exploration for Reinforcement Learning. In Proc. International Conference on Machine Learning, pp. 4870 4879, Stockholm, Sweden, 2020. PMLR.

Kaufmann, E., M enard, P., Domingues, O. D., Jonsson, A., Leurent, E., and Valko, M. Adaptive Reward-Free Exploration. In Proc. Algorithmic Learning Theory, pp. 865 891, Paris, France, 2021. PMLR.

Keramati, R., Dann, C., Tamkin, A., and Brunskill, E. Being Optimistic to Be Conservative: Quickly Learning a CVa R Policy. In Proc. the AAAI conference on artificial intelligence, volume 34, pp. 4436 4443, New York, NY, 2020.

La, P. and Ghavamzadeh, M. Actor-Critic Algorithms for Risk-Sensitive MDPs. Advances in Neural Information Processing Systems, 26, 2013.

Li, L. Sample Somplexity Bounds of Exploration. In Reinforcement Learning: State-of-the-Art, pp. 175 204. Springer, 2012.

M enard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast Active Learning for Pure Exploration in Reinforcement Learning. In Proc. International Conference on Machine Learning, pp. 7599 7608. PMLR, 2021.

Miryoosefi, S. and Jin, C. A Simple Reward-Free Approach to Constrained Reinforcement Learning. In Proc. International Conference on Machine Learning, pp. 15666 15698, Baltimore, MD, 2022. PMLR.

Ni, X. and Lai, L. Policy Gradient Based Entropic-Va R Optimization in Risk-Sensitive Reinforcement Learning. In Proc. 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1 6, Allerton, IL, 2022a. IEEE.

Ni, X. and Lai, L. Risk-Sensitive Reinforcement Learning via Entropic-Va R Optimization. In Proc. 2022 56th Asilomar Conference on Signals, Systems, and Computers, pp. 953 959, Pacific Grove, CA, 2022b. IEEE.

Prashanth, L., Fu, M. C., et al. Risk-Sensitive Reinforcement Learning via Policy Gradient Search. Foundations and Trends in Machine Learning, 15(5):537 693, 2022.

Rockafellar, R. T., Uryasev, S., et al. Optimization of Conditional Value-at-Risk. Journal of Risk, 2:21 42, 2000.

Shen, Y., Tobia, M. J., Sommer, T., and Obermayer, K. Risk Sensitive Reinforcement Learning. Neural Computation, 26(7):1298 1328, 2014.

Tamar, A., Chow, Y., Ghavamzadeh, M., and Mannor, S. Policy Pradient for Coherent Risk Measures. Advances in Neural Information Processing Systems, 28, 2015a.

Tamar, A., Glassner, Y., and Mannor, S. Optimizing the CVa R via Sampling. In Proc. the AAAI Conference on Artificial Intelligence, volume 29, Austin, TX, 2015b.

Wang, K., Kallus, N., and Sun, W. Near-minimax-optimal risk-sensitive reinforcement learning with cvar. ar Xiv preprint ar Xiv:2302.03201, 2023.

Wang, R., Du, S. S., Yang, L., and Salakhutdinov, R. R. On Reward-Free Reinforcement Learning with Linear Function Approximation. Advances in Neural Information Processing Systems, 33:17816 17826, 2020.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Ying, C., Zhou, X., Su, H., Yan, D., Chen, N., and Zhu, J. Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk. ar Xiv preprint ar Xiv:2206.04436, 2022.

Zhang, Z., Du, S., and Ji, X. Near Optimal Reward-Free Reinforcement Learning. In Proc. International Conference on Machine Learning, pp. 12402 12412. PMLR, 2021.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

A. Proof of Exploration Phase

A.1. Proof of Lemma 4.2

Recall the definition of value function V for various policy types:

π ΠH : V π h (sh, bh; Hh) = Eπ

!+ sh, bh, Hh

ρ ΠAug : V ρ h (sh, bh) = Eρ

Notice that executing ρ, b in the augmented MDP is equivalent to executing policy πρ,b in the original MDP, where πρ,b h (sh, Hh) = ρh(sh, b r1 . . . rh 1). Consequently, their V functions should be equivalent.

Therefore, by Lemma D.3, we have

CVa R τ(s1; r) CVa Rˆρr τ (s1; r)

= CVa Rπρ ,b 1 τ (s1; r) CVa Rπ ˆ ρ,ˆb1 τ (s1; r)

= CVa Rπρ ,b 1 τ (s1; r) \ CVa R πρ ,b 1

τ (s1; r) | {z } Evaluation error I

+ \ CVa R πρ ,b 1

τ (s1; r) \ CVa R π ˆ ρ ,ˆb 1

τ (s1; r) | {z } 0 by definition

+ \ CVa R π ˆ ρ ,ˆb 1

τ (s1; r) \ CVa R π ˆ ρ,ˆb1

τ (s1; r) | {z } optimization error ϵ/3 by Assumption 4.1

+ \ CVa R π ˆ ρ,ˆb1

τ (s1; r) CVa Rπ ˆ ρ,ˆb1 τ (s1; r) | {z } Evaluation error II

By the triangle inequality, we have CVa R τ(s1; r) CVa Rˆρ r τ (s1; r)

τ (s1; r) \ CVa R πρ ,b

τ (s1; r) + \ CVa R π ˆ ρ ,ˆb

τ (s1; r) CVa Rπ ˆ ρ ,ˆb

τ (s1; r) .

For the evaluation errors, by the definition of CVa R, we have CVa Rπρ ,b 1 τ (s1; r) \ CVa R πρ ,b 1

τ (s1; r) = b 1 τ 1V πρ ,b 1 1 (s1, b 1; r) max b1 [0,H]

n b1 τ 1 ˆV πρ ,b 1 1 (s1, b1; r) o

b 1 τ 1V πρ ,b 1 1 (s1, b 1; r) b 1 τ 1 ˆV πρ ,b 1 1 (s1, b 1; r)

τ 1 V πρ ,b 1 1 (s1, b 1; r) ˆV πρ ,b 1 1 (s1, b 1; r) ,

and similarly, \ CVa R π ˆ ρ,ˆb1

τ (s1; r) CVa Rπ ˆ ρ,ˆb1 τ (s1; r) τ 1 V π ˆ ρ,ˆb1 1 (s1,ˆb1; r) ˆV π ˆ ρ,ˆb1 1 (s1,ˆb1; r) .

Therefore, if an exploration algorithm that satisfies V ρ 1 (s1, b1; r) ˆV ρ 1 (s1, b1; r) ϵτ/3, ρ ΠAug, b1 [0, H],

or equivalently, Qρ 1(s1, b1, ρ(s1, b1); r) ˆQρ 1(s1, b1, ρ(s1, b1); r) ϵτ/3, ρ ΠAug, b1 [0, H],

it further ensures CVa R τ(s1; r) CVa Rˆρ r τ (s1; r) ϵ, which completes the proof.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

A.2. Proof of Lemma 4.5

We first consider the case where the initial budget b1 is fixed and for convenience, we omit the index h + 1 by using (s , b ). Referring to the Bellman equations in both the empirical augmented MDP and the true augmented MDP,

ˆQt,ρ h (sh, bh, ah; r) = X

s ˆPt h(s |s, a) ˆQt,ρ h+1(s , b , ρ(s , b ); r),

and Qρ h(sh, bh, ah; r) = X

s Ph(s |s, a)Qρ h+1(s , b , ρ(s , b ); r),

ˆQt,ρ h (sh, bh, ah; r) Qρ h(sh, bh, ah; r)

s ˆPt h(s |s, a) ˆQt,ρ h+1(s , b , ρ(s , b ); r) X

s Ph(s |s, a)Qρ h+1(s , b , ρ(s , b ); r)

ˆPt h(s |s, a) Ph(s |s, a) Qρ h+1(s , b , ρ(s , b ); r)

s ˆPt h(s |s, a) ˆQt,ρ h+1(s , b , ρ(s , b ); r) Qρ h+1(s , b , ρ(s , b ); r) .

Thus, for nt h(s, a) 0, we obtain

ˆet,ρ h (sh, bh, ah; r)

= | ˆQt,ρ h (sh, bh, ah; r) Qρ h(sh, bh, ah; r)|

ˆPt h(s |s, a) Ph(s |s, a) Qρ h+1(s , b , ρ(s , b ); r)

s ˆPt h(s |s, a) ˆQt,ρ h+1(s , b , ρ(s , b ); r) Qρ h+1(s , b , ρ(s , b ); r)

(2) b1 ˆPt h( |s, a) Ph( |s, a) 1 + X

s ˆPt h(s |s, a)ˆet,ρ h+1(s , b , a ; r)

2β(nt h(s, a), δ) nt h(s, a) + X

s ˆPt h(s |s, a)ˆet,ρ h+1(s , b , a ; r),

where (1) is due to the Pinsker s inequality; (2) is due to the fact that Qρ h(sh, bh, ah; r) b1 (Qρ h(sh, bh, ah; r) (bh)+ b1 as bh+1 = bh rh) for all s, a, b, r and the definition of L1 norm; (3) is due to the fact that TV(P, Q) = 1

2 P( ) Q( ) 1 q

1 2KL(P, Q) and the definition of E.

Notice that ˆet,ρ h (sh, bh, ah; r) b1, then for all nt h(s, a) 0, we have

ˆet,ρ h (sh, ah, bh; r) min

2β(nt h(s, a), δ) nt h(s, a) + X

s ˆPt h(s |s, a)ˆet,ρ h+1(s , a , b ; r)

Notice that b1 [0, H], in order to find the upper bound of the estimation error over all the initial budgets, we extend the inequality to

ˆet,ρ h (sh, ah, bh; r) max b1 [0,H]

2β(nt h(s, a), δ) nt h(s, a) + X

s ˆPt h(s |s, a)ˆet,ρ h+1(s , a , b ; r)

2β(nt h(s, a), δ) nt h(s, a) + X

s ˆPt h(s |s, a)ˆet,ρ h+1(s , a , b ; r)

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Now we prove Lemma 4.5 by induction. For H + 1, since

ˆet,ρ H+1(s, a, b; r) = | ˆQt,ρ H+1(s, a, b; r) Qρ H+1(s, a, b; r)| = max{0, b1} max{0, b1} = 0

and Et H+1(s, a) = 0 for all (s, a), the result is true. Assume the result holds for h+1, i.e., ˆet,ρ h+1(s, a, b; r) Et h+1(s, a; b1) for all (s, a), we have

ˆet,ρ h (s, a, b; r) min

2β(nt h(s, a), δ) nt h(s, a) + X

s ˆPt h(s |s, a)ˆet,ρ h+1(s , a , b ; r)

2β(nt h(s, a), δ) nt h(s, a) + X

s ˆPt h(s |s, a) max a A Eh+1(s , a)

= Et h(s, a)

holds for h, which complete the proof.

A.3. Proof of Theorem 4.7

Notice that in the exploration phase, we follow the exploration policy π rather than ρ. We begin by introducing some notations. Let Pπ h(s, a) represent the probability that the state-action pair (s, a) is reached at the h-th step of a trajectory under the exploraion policy π. We use the shorthand ph t (s, a) = ph πt(s, a) for simplicity. The pseudo-counts nt h(s, a) are defined as Pt i=1 Pi h(s, a), and we define the event

Ecnt = t N , h [H], (s, a) S A : nt h(s, a) 1

2 nt h(s, a) βcnt(δ) ,

where βcnt(δ) = log(2SAH/δ). Recalling the event E defined in Lemma 4.5, we let F = E Ecnt and introduce the following lemma.

By Lemma D.4 and the principle of inclusion-exclusion, we have P(F) = P(E Ecnt) = P(E) + P(Ecnt) P(E Ecnt) P(E) + P(Ecnt) 1 = 1 δ. From Lemma 4.6, on the event F, it is shown that CVa R τ(s1; r) CVa Rˆρ τ (s1; r) ϵ for all reward functions r, thereby proving that CVa R-RF-UCRL is (ε, δ)-PAC.

We now proceed to upper bound the sample complexity of CVa R-RF-UCRL on the event F. The first step involves introducing an average upper bound on the error at step h under policy πt+1, defined as

(s,a) Pt+1 h (s, a)Et h(s, a).

By Lemma D.1, the average errors can be related as follows:

(s,a) Pt+1 h (s, a)

β(nt h(s, a), δ) nt h(s, a) 1

(s ,a ) Pt+1 h (s, a)Ph(s |s, a)I(a = πt+1(s ))Et h+1(s , a )

(s,a) Pt+1 h (s, a)

β(nt h(s, a), δ) nt h(s, a) 1

For h = 1, observe that Pt+1 1 (s1, a)Et 1(s1, a) = Et 1(s1, πt+1 1 (s1))I(πt+1 1 (s1) = a), as the policy is deterministic. Now, if t < tstop, Et 1(s1, πt+1 1 (s1)) ϵ/3 by definition of the stopping rule, hence Q1 t = P

a Pt+1 1 (s1, a)Et 1(s1, a) (ϵτ/3) P

a A I(πt+1 1 (s1) = a) = ϵτ/3. Thus, we have

(s,a) HPt+1 h (s, a)

β(nt h(s, a), δ) nt h(s, a) 1

for t < tstop. Summing these inequalities for t {0, . . . , T} where T < tstop gives:

(T + 1)ϵτ 9

t=0 Pt+1 h (s, a)

β(nt h(s, a), δ) nt h(s, a) 1

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

The next step involves relating the counts to the pseudo-counts, taking into account that the event Ecnt holds.

Using Lemma D.5, it can be stated that, on the event F, for T < tstop, the inequality

(T + 1)ϵτ 18

t=0 Pt+1 h (s, a)

β(nt h(s, a), δ) nt h(s, a) 1

β(T + 1, δ)

nt+1 h (s, a) nt h(s, a) p

nt h(s, a) 1 ,

is derived, where the relation Pt+1 h (s, a) = nt+1 h (s, a) nt h(s, a), as per the definition of pseudo-counts, is used.

Applying Lemma D.6 to bound the sum over t, we get:

(T + 1)ϵτ 18(1 +

β(T + 1, δ)

n T +1 h (s, a)

β(T + 1, δ)

s,a n T +1 h (s, a).

Given that P

s,a n T +1 h (s, a) = T + 1, the inequality simplifies to:

T + 1ϵτ 18(1 +

β(T + 1, δ).

For sufficiently large T, this inequality cannot hold, as the left-hand side grows with

T, while the right-hand side is logarithmic. Therefore, tstop is finite and satisfies (applying the inequality to T = tstop 1):

tstop O H4S2A

The conclusion follows from Lemma D.7.

B. Proof of Planning Phase

B.1. Proof of Theorem 4.8

The utilization of discretization in the algorithm significantly impacts its computational tractability, and it is applied in two main areas: 1. In the dynamic programming step at each timestep h, the algorithm exclusively computes Qh(sh, bh, ah) for all sh, ah and bh within the grid. This leads to a total runtime of O(SAHη 1Tstep), where Tstep represents the time required for each step. The time complexity here arises from discretization and is a function of the state space size, action space size, and the horizon length. 2. When computing ˆb, the algorithm searches over the grid to find the solution. Since the returns distribution is supported on the grid, the τ-quantile of the return distribution (the optimal solution) exists on the grid. This computation has a time complexity of O(η 1), which is considered a lower-order term compared to the first part.

It s important to note that the most time-consuming part of the algorithm is the computation of expectations, specifically the term: [Ph Vh+1] (sh, bh, ah) = Esh+1 P( |sh,ah)[V h+1(sh+1, bh+1)].

In the discretized MDP, this expectation can be computed using only grid elements, implying Tstep = O(Sη 1). As a result, the overall time complexity of this algorithm is approximately O(SAHη 1Tstep) = O(S2AHη 2).

B.2. Proof of Theorem 4.9

The proof draws inspiration from (Bastani et al., 2022; Wang et al., 2023). To facilitate the discussion, we introduce the following notation. Let Zρ,M represent the returns from executing ρ in the MDP M. For random variables X, Y ,

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

we say Y stochastically dominates X, which is denoted X Y . This dominance implies that for any real value t, the probability that Y is less than or equal to t is greater than or equal to the probability of X being less than or equal to t, i.e., t R : Pr(Y t) Pr(X t). 1) From disc(M) to M:

Consider any policy ρ ΠAug and b [0, 1] (which we use in disc(M)). Define an adapted policy for use in M as follows:

adapted(ρ, b1)h(sh, r1:h 1) = ρh(sh, b1 ϕ(r1) ϕ(rh 1)).

The adapted policy simulates the evolution of b in disc(M) by using the history. Let Zρ,b,disc(M) be the returns from running ρ, b in disc(M). Let Zadopted(ρ,b),M be the returns from running adopted(ρ, b) in M. According to Lemma H.1 in (Wang et al., 2023), we almost surely have

Zρ,b,disc(M) Hη Zadopted(ρ,b),M Zρ,b,disc(M).

Thus, for any x R, it follows that

Fρ,b,disc(M)(x) Fadapted(ρ,b),M(x) Fρ,b,disc(M)(x + Hη)

where Fρ,b,disc(M) is the CDF of Zρ,b,disc(M) and Fadapted(ρ,b),disc(M) is the CDF of Zadapted(ρ,b),M.

Based on these arguments and Theorem H.3 in (Wang et al., 2023), we conclude:

CVa Rτ(adapted(ρ, b); M) CVa Rτ(ρ, b; disc(M)) τ 1Hη. (11)

2) From M to disc(M): Let s introduce the memory-MDP model as defined in (Wang et al., 2023) first. The memory-MDP mode augments a standard MDP with a memory generator Mh, which produces memory items mh Mh(sh, ah, rh, Hh) at each timestep. These memories are stored into the history Hh = (st, at, rt, mt)t [h 1]. The process of executing π in this memory-MDP is as follows: for any h [H], ah πh(sh, Hh), sh+1 P( |sh, ah), rh = r(sh, ah) and mh Mh(sh, ah, rh, Hh). As a result of this process, the augmented MDP with memory has a history HAug h = (st, bt, at, rt, mt)t [h 1]. This memory-MDP model allows us to capture and model dependencies on past experiences through the memory items.

Building on the framework presented in (Wang et al., 2023), consider a scenario where we have a policy ρ ΠAug and an initial budget b [0, 1], which we intend to use in the original MDP M. To adapt this policy to run in disc(M), we introduce a discretized policy, which is history-dependent and incorporates memory. This policy operates in the discretized MDP disc(M) and is defined as follows:

disc(ρ, b)h(sh, m1:h 1) = ρh(sh, b m1 mh 1).

Indeed, this definition of the discretized policy disc(ρ, b) is designed to ensure that, despite receiving discrete rewards ˆrh in the discretized MDP disc(M), the memory element mh is carefully generated to imitate the reward that would have been received in the true MDP M.

By applying Lemma H.2 in (Wang et al., 2023), we almost surely have

Zρ,b,M Zdisc(ρ,b),disc(M).

Consequently, if we define Fρ,b,M as the CDF of Zρ,b,M and Fdisc(ρ,b),disc(M) as the CDF of Zdisc(ρ,b),disc(M), we can establish that, x R : Fdisc(ρ,b),disc(M) Fρ,b,M.

Based on these observations and utilizing Theorem H.4 in (Wang et al., 2023), we obtain

CVa R τ(disc(M)) CVa R τ(M). (12)

Combining Eq. (11) and Eq. (12), we have

|CVa Rρ τ (s1; r) CVa Rˆρ τ(s1; r)| τ 1Hη. (13)

We can satisfy the assumption about the optimization error by selecting η ϵτ/3H to ensure

|CVa Rρ τ (s1; r) CVa Rˆρ τ(s1; r)| ϵ/3.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

C. Proof of Lower Bound

In this section, we prove our lower bound presented in Theorem 5.1. First, we develop the connection between the reward-free problem and the CVa R-reward-free RL problem.

Lemma C.1. For any MDP M = (S, A, H, P, r) with initial state s1 and any policy π, there exists another MDP M = (S , A, H + 1, P , r ) with initial state s0, we have

CVa Rπ,M τ (s0) = Eπ

h =1 rh (sh , ah ) s1, M

Proof. We set horizon h starting at 0 in M . We can build such a M = (S , A, H + 1, P , r ), where S = S s0, s , P ( |s, a) = P( |s, a) for any s S and a A, P (s1|s0, a) = τ for any a A, P (s |s0, a) = 1 τ for any a A, P (s |s , a) = 1 for any a A, r (s, a) = r(s, a) for any s S and a A, r(s0, a) = 0 for any a A, and r(s1, a) = 1 for any a A.

For any policy π, PH h =1 rh (sh , ah ) equals to H with probability at least 1 τ. Thus, the τ-Va R following by any policy π in the transferred MDP M is H. We have

CVa Rπ,M τ (s0) = max b0 [0,H]{b0 τ 1V π,M 0 (s0, b0)}

h =0 r h (sh , ah )

h =1 r h (sh , ah )

h =1 r h (sh , ah )

h =1 r h (sh , ah ) s1, M #

h =1 rh (sh , ah )|s1, M

Now we can prove our lower bound, Theorem 5.1. Here, we restated Theorem 4.1 in (Jin et al., 2020), which show that any reward-free exploration algorithm that output ϵ-optimal policy must collect at least Ω(S2AH2/τϵ2) trajectories in expectation.

Theorem C.2. (Theorem 4.1 in (Jin et al., 2020)) Consider a universal constant C > 0. For a given risk tolerance τ (0, 1], if the number of actions A 2, the number of states S C log2 A, the horizon H C log2 S, and the accuracy parameter ϵ min{1/4τ, H/48τ}, then any reward-free exploration algorithm that can output ϵ-optimal policies for an arbitrary number of adaptively chosen reward functions with a success probability δ = 1/2 must collect at least Ω(S2AH2/τϵ2) trajectories in expectation.

Thus, any CVa R-RF exploration algorithm must collect at least Ω(S2AH2/ϵ2) trajectories from the state s1, in expectation, and then collect at least Ω(S2AH2/τϵ2) trajectories from the initial state s0.

D. Technical Lemmas

D.1. An Essential Lemma for Upper Bound

The following crucial lemma establishes a relationship between the errors at step h and those at step h + 1.

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Lemma D.1. On the event E, for all h [H] and (s, a) S A,

Et h(s, a) 3H

β(nt h(s, a), δ) nt h(s, a) 1

s S Ph(s |s, a)Eh+1(s , ρt+1(s )).

Proof. By the definition of Et h(s, a) and the greedy policy ρt+1, if nt h(s, a) > 0,

Et h(s, a) H

2β(nt h(s, a), δ) nt h(s, a) + X

s S ˆPh(s |s, a)Eh+1(s , ρt+1(s )).

By the definition of E and Pinsker s inequality, we further have X

s S ˆPh(s |s, a)Et h+1(s , ρt+1(s ))

s S Ph(s |s, a)Et h+1(s , ρt+1(s )) + X

ˆPh(s |s, a) Pt h(s |s, a) Et h+1(s , ρt+1(s ))

s S Ph(s |s, a)Et h+1(s , ρt+1(s )) + (ˆPh( |s, a) Pt h( |s, a) H

s S Ph(s |s, a)Et h+1(s , ρt+1(s )) + H

2β(nt h(s, a), δ) nt h(s, a) ,

where we use the fact that Et h+1(s , ρt+1(s ) H. Therefore, plugging in this inequality and using 2

2 3, we have

Et h(s, a) X

s S Ph(s |s, a)Et h+1(s , ρt+1(s )) + 3H

β(nt h(s, a), δ) nt h(s, a) .

Notice that Et h(s, a) H 3H 3H + X

s S Ph(s |s, a)Et h+1(s , ρt+1(s )),

and this is also true for nt h(s, a) = 0 with 1/0 = + , which leads to the conclusion.

D.2. Auxiliary Lemmas

Lemma D.2. Va Rα = b := arg maxb R(b τ 1E[(b X)+]).

Proof. Recall the definitions of CVa R and Va R, we have CVa Rτ(X) = supb b 1

τ E[(b X)+] , Va Rτ(X) = inf{x R : P(X x) τ}. By Theorem 6.2 in (Acerbi & Tasche, 2002), we have

CVa Rτ(X) = E[X|X Va Rτ(X)].

Firstly, we define f(b) = b 1

τ E[(b X)+], thus the derivative of f(b) with respect to b is:

f (b) = 1 1

By setting the derivative equal to zero, we have P(X b) = 1 τ. According to the definition of Va R, b is the τ-th quantile of the distribution of X, which means b = Va Rτ(X). Therefore, the critical point b that maximizes f(b) is equal to Va Rτ(X). Now we prove f(b ) = CVa Rτ(X).

f(b ) = Va Rτ(X) 1

τ E[(Va Rτ(X) X)+]

= Va Rτ(X) 1

Z Va Rτ (X)

(Va Rτ(X) x)d F(x)

Va Rτ (X) xd F(x) = E[X|X Va Rτ(X)] = CVa Rτ(X).

Risk-Sensitive Reward-Free Reinforcement Learning with CVa R

Lemma D.3. (Lemma F.1 in (Wang et al., 2023)) Given any ρ ΠAug, h [H], augmented state (sh, bh), and history Hh, we have V ρ h (sh, bh) = V πρ,b h (sh, bh; Hh) for b = bh + r1 + . . . + rh 1. Particularly, V ρ 1 (s1, ) = V πρ,b 1 (s1, ).

Lemma D.4. (Lemma 10 in (Kaufmann et al., 2021)) Given β(n, δ) = log(2SAH/δ) + (S 1) log e 1 + n S 1 , it

holds that P(E) 1 δ

2. Furthermore, P(Ecnt) 1 δ

Lemma D.5. (Lemma 7 in (Kaufmann et al., 2021)) On the event Ecnt, for all h [H] and (s, a) S A,

t N , β(nt h(s, a), δ) nt h(s, a) 1 4β( nt h(s, a), δ) nt h(s, a) 1 .

Lemma D.6. (Lemma 19 in (Auer et al., 2008)) For any sequence of numbers z1, . . . , zn with 0 zk Zk 1 = max n 1, Pk 1 i=1 zi o ,

Lemma D.7. (Lemma 15 in (Kaufmann et al., 2021).) Let n 1 and a, b, c, d > 0. If n 2 a + b log(c + dn) then

a + b log c + d

4 (a + b( c +