# provably_efficient_cvar_rl_in_lowrank_mdps__c85360f6.pdf Published as a conference paper at ICLR 2024 PROVABLY EFFICIENT CVAR RL IN LOW-RANK MDPS Yulai Zhao Princeton University yulaiz@princeton.edu Wenhao Zhan Princeton University wenhao.zhan@princeton.edu Xiaoyan Hu The Chinese University of Hong Kong xyhu21@cse.cuhk.edu.hk Ho-fung Leung Independent Researcher ho-fung.leung@outlook.com Farzan Farnia The Chinese University of Hong Kong farnia@cse.cuhk.edu.hk Wen Sun Cornell University ws455@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVa R) with a fixed risk tolerance τ. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVa R RL to settings where state space is large, function approximation must be deployed. We study CVa R RL in low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the underlying transition kernel admits a low-rank decomposition, but unlike prior linear models, low-rank MDPs do not assume the feature or state-action representation is known. We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVa R RL. We prove that our algorithm achieves a sample complexity of O H7A2d4 τ 2ϵ2 to yield an ϵ-optimal CVa R, where H is the length of each episode, A is the capacity of action space, and d is the dimension of representations. Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVa R objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVa R RL algorithm in low-rank MDPs. 1 INTRODUCTION As a widely adopted framework to deal with challenging sequential decision-making problems, Reinforcement learning (RL) (Sutton & Barto, 2018) has demonstrated many empirical successes, e.g., popular strategy games (Silver et al., 2016; 2017; Vinyals et al., 2019). However, the classical RL framework often prioritizes the maximization of the expected cumulative rewards solely, which renders it not suitable for many real-world applications that face possible failure or safety concerns. In high-stake scenarios such as autonomous driving (Isele et al., 2018; Wen et al., 2020), finance (Davis & Lleo, 2008; Wang et al., 2021; Filippi et al., 2020) and healthcare (Ernst et al., 2006; Coronato et al., 2020), optimizing only the expected cumulative rewards can produce policies that underestimate the risks associated with rare but catastrophic events. To mitigate this limitation, risk-sensitive decision-making has recently grown more popular (Hu & Leung, 2023). Equal contribution. Listing order is random. Published as a conference paper at ICLR 2024 To this end, risk measures like Conditional Value-at-Risk (CVa R) (Rockafellar et al., 2000) have been integrated into the RL-based systems as a performance criterion, yielding a more balanced approach that encourages policies to avoid high-risk outcomes. As a popular tool for managing risk, the CVa R metric is widely used in robust RL (Hiraoka et al., 2019), distributional RL (Achab et al., 2022), and portfolio optimization (Krokhmal et al., 2002). CVa R quantifies the expected return in the worst-case scenarios.1 For a random variable X with distribution P and a confidence level τ (0, 1), the CVa R at confidence level τ is defined as CVa Rτ(X) = sup b R b EX P (b X)+ where x+ = max(x, 0). In this paper, we consider the random variable X as the cumulative reward of a specific policy where randomness comes from the MDP transitions and the policy itself. We aim to maximize this objective to capture the average worst values of the returns distribution at a certain risk tolerance level τ. The most related works to this paper are Bastani et al. (2022) and Wang et al. (2023). Specifically, Bastani et al. (2022) proved the first regret bounds for risk-sensitive RL with CVa R metric (CVa R RL), and Wang et al. (2023) improved the results to achieve the minimax optimal regret. However, their algorithms are restricted to the tabular setting, so they are inefficient when the state space is large. To overcome such limitation, we introduce function approximation to the MDP structure and consider low-rank MDPs (Jiang et al., 2017) in this paper. In low-rank MDPs, the key idea is to exploit the (high-dimensional) structure in the model transitions of the MDP by representing them with a lower-dimensional structure. This is often implemented by assuming the MDP transition kernels admit a low-rank factorization, i.e., there exist two unknown mappings ψ(s ), ϕ(s, a), such that the probability of transiting to the next state s under the current state and action (s, a) is P(s |s, a) = ψ(s ) ϕ(s, a). Low-rank MDPs generalize popular RL models including tabular MDPs, linear MDPs (Jin et al., 2020) and block MDPs (Du et al., 2019; Efroni et al., 2021), and are widely applied in practice (Zhang et al., 2020; 2021; Sodhani et al., 2021; 2022). Unlike linear MDPs, since the underlying ψ and ϕ are unknown, we need to carefully balance representation learning, exploration, and worst-case failure stakes in low-rank MDPs. The application of non-linear function approximation further complicates the problem, for controlling model errors state and action-wise becomes much harder (Wang et al., 2023). Moreover, even within a given MDP, planning in CVa R RL is not a straightforward task as the CVa R metric exhibits a non-linear structure2 (in stark contrast to standard risk-neutral RL), which results in extra computational overhead. All of these challenges call for a more comprehensive algorithm design for the risk-sensitive RL in low-rank MDPs. In this paper, we aim to fill a gap in the existing body of knowledge by exploring the interplay between risk-averse RL and low-rank MDPs. We summarize our contributions as follows. Contributions. First, we design an oracle-efficient representation learning algorithm called ELA (REprensentation Learning for CVAR), which optimizes the CVa R metric in low-rank MDPs. ELA leverages an MLE oracle to learn the model dynamics while simultaneously constructing Upper Confidence Bound (UCB)-type bonuses to encourage exploration of the unknown environment. We provide a comprehensive theoretical analysis of the algorithm, demonstrating that ELA would provide an ϵ-optimal CVa R with O(1/ϵ2) samples. To the best of our knowledge, ELA is the first provably sample-efficient algorithm for CVa R RL in low-rank MDPs. Second, to improve the computational complexity of ELA planning, we introduce a computationally efficient planning oracle, which, when combined with ELA, leads to the ELLA (REprensentation Learning with LSVI for CVAR) algorithm. This algorithm leverages least-squares value iteration with discretized rewards to find near-optimal policies via optimistic simulations within the learned model. Importantly, the computational cost solely depends on the dimension of representations rather than the state space size. We show that ELLA requires a polynomial running time in addition to polynomial calls to the MLE oracle. 1Standard CVa R definition considers average worst-case loss, thus a lower value is more desirable. However, in this work, we aim to obtain a higher CVa R in the RL context as we are maximizing rewards. 2In RL, planning refers to determining the optimal policy and optimal value function within a given MDP. Published as a conference paper at ICLR 2024 2 RELATED WORK Low-rank MDPs. Theoretical benefits of low-rank structure in MDPs have been broadly explored in various works (Jiang et al., 2017; Sun et al., 2019; Du et al., 2021; Sekhari et al., 2021; Huang et al., 2023). In a contextual decision process (a generic RL model with rich observations and function approximation) with a low Bellman rank, OLIVE (Jiang et al., 2017) yielded a near-optimal policy with polynomial samples. Additionally, Sun et al. (2019) introduced provably efficient algorithms based on a structural parameter called the witness rank, demonstrating that the witness rank is never larger than the Bellman rank (Jiang et al., 2017). Despite their provable efficiency, these algorithms lack computational efficiency. Leveraging Maximum Likelihood Estimation (MLE) as its computation oracle, Flambe (Agarwal et al., 2020) proposed the first computationally efficient algorithm that was also provably efficient in low-rank MDPs. Following a similar setup, Rep-UCB (Uehara et al., 2022) improves the sample complexity dependencies of Flambe. The key to the improvements is a careful tradeoff between exploration and exploitation by combining the reward signal and exploration bonus. CVa R RL. There is a long line of works studying (static) CVa R in RL, which refers to the CVa R of accumulative reward beyond a certain risk threshold (Chow & Ghavamzadeh, 2014; Chow et al., 2015; Tamar et al., 2015; Bastani et al., 2022; Wang et al., 2023). For tabular MDPs, much work has been done on value-based algorithms (Chow et al., 2015; Stanko & Macek, 2019). However, these results often require the planner to know the model transitions of the MDPs, which is generally infeasible. Recently, there has been a growing interest in the more general setting where the unknown model transitions are learned through online interactions (Yu et al., 2018; Bastani et al., 2022; Wang et al., 2023). However, their results are restricted to the tabular setting and cannot be combined with function approximation, where the state space is often enormous. Prior works have also studied risk-sensitive RL in the context of other risk measures. Specifically, Du et al. (2022) proposed Iterated CVa R RL, also known as dynamic CVa R, and Chen et al. (2023) demonstrates regret guarantees for Iterated CVa R RL with general function approximations. However, iterated CVa R is intrinsically different from the static CVa R in our setting. Iterated CVa R quantifies the worst τ-percent performance at each step of decision-making. Such a definition allows the agent to control the risk throughout the decision process tightly, whereas our setting aims to maximize the CVa R of the total reward. Therefore, their algorithm designs and analysis techniques do not apply to our tasks. Our work also focuses on the static CVa R in RL. Specifically, we present the first sample-efficient algorithm for optimizing the static CVa R metric that carefully balances the interplay between riskaverse RL and low-rank MDP structure. Furthermore, we design a computationally efficient planning oracle that makes our algorithm only require polynomial running time with an MLE oracle. 3 PRELIMINARIES Notations. We will frequently use [H] to denote the set {1, , H}. Denote (S) as the distribution over space S. Let U(A) be the uniform distribution over the action space. In this work, we use the standard O( ), Ω( ) and Θ( ) notations to hide universal constant factors and use O( ) to hide logarithmic factors. Please see Table 1 for a comprehensive list of notations used in this work. 3.1 LOW-RANK EPISODIC MDP We consider an episodic MDP M with episode length H N, state space S, and a finite action space A. At each episode k [K], a trajectory τ = (s1, a1, s2, a2, , s H, a H) is generated by an agent, where (a) s1 S is a fixed starting state,3 (b) at step h, the agent chooses action according to a history-dependent policy ah π( |τh 1, sh) where τh 1 := (s1, a1, , ah 1) denotes the history and (c) the model transits to the next state sh+1 P h( |sh, ah). The agent repeats these steps till the end of the episode. For each time step, operators P h : S A (S) denote the (non-stationary) transition dynamics, whereas rh : S A ([0, 1]) is the (immediate) reward distribution the 3Note that any H-length episodic MDP with a stochastic initial state is equivalent to an (H + 1)-length MDP with a dummy initial state s0. Published as a conference paper at ICLR 2024 agent could receive from deploying a certain action at a specific state. We use Π to denote the class of all history-dependent policies. Below, we proceed to the low-rank MDP definition. Definition 3.1 (Low-rank episodic MDP). The transition kernel P = {P h : S A 7 (S)}h [H] admits a low-rank decomposition with rank d if there exist two embedding functions ϕ := {ϕ h : S A 7 Rd}h [H] and ψ := {ψ h : S 7 Rd}h [H] such that P h(s |s, a) = ψ h(s ), ϕ h(s, a) (1) where ϕ h(s, a) 2 1 for all (h, s, a) [H] S A, and for any function g : S 7 [0, 1] and h [H], it holds that R s S ψ h(s)g(s)ds 2 d. Low-rank episodic MDP admits such a decomposition of P . We study the function approximation setting where the state spaces S can be enormous and even infinite. To generalize across states, assume access to two embedding classes Ψ {S A Rd} and Φ {S Rd}, which are used to identify the true embeddings (ψ , ϕ ). Formally, we need the following realizability assumption, which is proven essential for obtaining performance guarantees independent of the state space size in low-rank MDPs (Agarwal et al., 2020). Assumption 3.2. There exists a model class F = (Ψ, Φ) such that ψ h Ψ, ϕ h Φ, h [H]. We expect to obtain sample complexity that scales logarithmically with the (finite) cardinality of the model class F. Extensions to the infinite function classes with complexity measures are possible. See more discussions in (Agarwal et al., 2020). 3.2 RISK-SENSITIVE RL AND AUGMENTED MDP We study risk-sensitive RL with the CVa R metric. Throughout the paper, let τ (0, 1] be a fixed risk tolerance. First, we recall the classical definition: for a random variable X [0, 1] is from distribution P, the conditional-value-at-risk (CVa R) corresponding to the risk tolerance τ is defined as CVa Rτ(X) := sup c [0,1] c τ 1 EX P [(c X)+] (2) where (x)+ := max(x, 0) for any x R. Interestingly, the supremum in the expression is achieved when c is set as the τ-th percentile (unknown before planning), also known as value-at-risk (Va R), i.e., xτ = inf{x R : P(X x) τ}. In risk-sensitive RL, X represents the stochastic cumulative reward accumulated over H successive actions and state transitions: X = PH h=1 rh. This multi-step nature of risk-sensitive RL brings the dynamic planning structure to the setting and makes it difficult. We may make an intuitive insight of c by considering it as a initial budget, which affects the agent s action selection and needs to be carefully managed during the planning. Particularly, after receiving a random reward rh at each timestep h [H], the learner deducts it from the current budget, i.e., ch+1 = ch rh where ch is the remaining budget and c1 = c. The main task of risk-sensitive RL is to interplay carefully between policy planning, exploration, and budget control. Inspired by this observation, we incorporate the available budget as an additional state variable that could impact the agent s choice of action. Formal descriptions of the augmented MDP are introduced below. Augmented MDP. We introduce the augmented MDP framework (B auerle & Ott, 2011) to study CVa R RL, which augments the state space S in classic episodic MDP to SAug = S [0, H] that includes the budget variable as an additional state. The transition kernel of the state and the immediate reward distribution are the same as the original MDP M. Inspired by the observation above, we consider policies defined on the augmented state space ΠAug := {π : SAug (A)}. In the augmented MDP, the agent rolls out an augmented policy π ΠAug with initial budget c1 [0, H] as follows. At the beginning of an episode, the agent observes the augmented state (s1, c1), selects action a1 π( |s1, c1), receives reward r1 r1(s1, a1), and transits to s2 P 1 ( |s1, a1). Most importantly, the available budget is also updated: c2 = c1 r1. The agent then chooses an action based on the (s2, c2) pair. The procedure repeats for H times until the end of the episode. For any π ΠAug, the (augmented) Q-function is defined as Qπ h,P (s, c, a) := Eπ,P t=h rt(st, at) !+ sh = s, ch = c, ah = a Published as a conference paper at ICLR 2024 for any (h, s, c, a) [H] S [0, H] A and the (augmented) value function is defined as V π h,P (s, c) := E a πh( |s,c) Qπ h,P (s, c, a) (3) for any (h, s, c) [H] S [0, H]. The Bellman equation is given by Qπ h,P (s, c, a) = E s P h ( |s,a),r rh(s,a) V π h+1(s , c r). Goal metric. In this paper, we aim to find the optimal history-dependent policy to maximize the CVa R objective, i.e., CVa R τ := max π Π CVa Rτ(R(π)) = sup c [0,H] c τ 1 min π Π E[(c R(π))+] where R(π) is the random cumulative reward of policy π in M. Nevertheless, it is known that minπ Π{τ 1 E[(c R(π))+]} can be attained by an augmented policy π ΠAug with initial budget c (Wang et al., 2023). Thus we can indeed focus on searching within π ΠAug: CVa R τ = max c [0,H] c τ 1 min π ΠAug V π 1,P (s1, c) := CVa Rτ(R(π , c )) (4) where π ΠAug is the optimal augmented policy and c [0, H] is the optimal initial budget, and we overload R(π, c) to denote the stochastic cumulative reward when the agent rolls out the augmented policy π with initial budget c in augmented MDP. 4 ALGORITHM In this section, we present the ELA algorithm for risk-sensitive RL in low-rank MDPs. The pseudocode is listed in Algorithm 1. The algorithm is iterative in nature, where the k-th episode proceeds in three folds: (1) we collect new data by rolling in with the exploration policy πk 1 starting from the initial budget ck 1. (2) Then, all transitions collected so far are used in two aspects. First, we pass all transition tuples to the MLE oracle (Line 11). The MLE oracle returns embedding functions ( bψh, bϕh) for each h, which determine the model. Second, we compute the exploration bonus using the latest learned representation bϕ. (3) The algorithm performs VI on the learned model with the bonus-enhanced reward signal to obtain the exploration policy-budget pair (πk, ck) we use in the next iteration. After K iterations, we output the current model and all policy-budget pairs. Next, we illustrate more details about these steps. 4.1 DATA COLLECTION During training, we collect two (disjoint) sets of transition tuples to compute the bonus terms and estimate the transition kernels. These two datasets are different in their (marginalized) distributions and facilitate the regret analysis in Section 4.3. Next, we clarify how data are collected and added to the two sets. To collect a new transition tuple for each dataset Dh and e Dh ( h [H 1]), at the k-th iteration, the algorithm roll outs policy πk 1 with initial budget ck 1 to obtain two trajectories. The difference occurs at the (h 1)-th timestep. Particularly, to obtain (sh, ah, sh+1) for Dh, the algorithm keeps on to execute policy πk 1 for one more step and observes state sh. Then, a uniform action ah U(A) is taken to receive the next-state sh+1 P h( |sh, ah). However, to obtain ( sh, ah, s h+1) for e Dh, the algorithm takes two consecutive uniform actions at timesteps h 1 and h, i.e., ah 1 U(A), sh P h 1( |sh 1, ah 1), ah U(A), and receives state s h+1. Intuitively, dataset Dh reveals the behavior of the exploration policy-budget pair (πk, ck) up to the h-th timestep, which facilitates the design of bonus terms (cf. Lemma C.2). Meanwhile, dataset e Dh enhances the exploration and leads to improved estimates of transition kernels (cf. Lemma E.3). At first sight, the data collection procedure requires 2(H 1) trajectories per iteration (i.e., one trajectory for Dh and one trajectory for e Dh). To save sample (trajectory) complexity, we collect Published as a conference paper at ICLR 2024 transition tuples for Dh and e Dh at the h, (h 1) steps in one go, respectively (Lines 8-10). Therefore, the algorithm only requires H trajectories per iteration. For both sets, new transition tuples are concatenated with the existing data to perform representation learning, i.e., learning a factorization and a representation by MLE (Line 11). 4.2 REPRESENTATION LEARNING AND BONUS-DRIVEN VALUE ITERATION MLE oracle. In the function approximation setting, the agent must estimate the model structure as accurately as possible. Collected transition tuples are used to compute model transitions through the MLE oracle. As a general approach used to estimate the parameters of a probability model, MLE is also gaining more focus in low-rank MDPs (Agarwal et al., 2020; Uehara et al., 2022). Value Iteration. Based on the learned model, the algorithm runs Value-Iteration (VI) with the exploration bonus term. In risk-sensitive RL, for any π ΠAug and (h, s, c) [H] S [0, H], we define value function enhanced with exploration bonus bh : S A R as V π h,P,b(s, c) := Eπ,P h =h rh (sh , ah ) h =h bh (sh , ah ) sh = s, ch = c where we deduct the exploration bonus term because the agent desires to minimize the value function (4) in risk-sensitive RL. Such a value function is used to perform VI and update policy on the learned model b P, since the learner has no prior knowledge of the real model transitions. Therefore, obtaining an accurate estimation of the model determines the quality of the output policy. In Algorithm 1, we assume the learner has access to exact VI (Line 15). However, we remark that this step is not computationally efficient due to the continuity of c and potentially large state space S. To overcome such a computational barrier arising from the nature of controlling risk while planning, in Section 5, we provide a computationally efficient planning oracle that performs LSVI with discretized reward function (with sufficiently high precision). We rigorously prove that we only need polynomial running time with the MLE oracle to output a near-optimal CVa R. 4.3 MAIN RESULTS In this subsection, we are ready to demonstrate our theoretical guarantees that characterize the regret and sample complexity bounds. Theorem 4.1. Fix δ (0, 1). Set the parameters in Algorithm 1 as: H2(|A| + d2) log |F|Hk , λk = O d log |F|Hk We have two equivalent interpretations of the theoretical results. In terms of PAC bound, with probability at least 1 δ, the regret is bounded by k=1 CVa R τ CVa Rτ(R(πk, ck)) = O τ 1H3Ad2 log (|F|/δ) . Alternatively, we can interpret in terms of sample complexity: w.p. at least 1 δ, to present an ϵoptimal policy and budget pair s.t. CVa R τ CVa Rτ(R(bπ, bc)) ϵ. The total number of trajectories required is upper bounded by O H7A2d4 log (|F|/δ) Proof. The proof can be found in Appendix C. In Theorem 4.1, we present the first regret/sample complexity bounds for CVa R RL with function approximation, in which exploring the unknown action/space spaces posits extra difficulty. We explicitly characterize the number of samples required to output an ϵ-optimal CVa R in terms of the length of the episode H, the action space size A, the representation dimension d, and confidence level τ. The theorem has no explicit dependence on state space S, proving nice guarantees even in Published as a conference paper at ICLR 2024 Algorithm 1 ELA Require: Risk tolerance τ (0, 1], number of iterations K, parameters {λk}k [K] and {αk}k [K], models F = {Ψ, Φ}, failure probability δ (0, 1). 1: Set datasets Dh, e Dh for each h [H 1]. 2: Initialize the exploration policy π0 {π0 h(s, c) = U(A), for any (s, c) S [0, H]}h [H]. 3: Initialize the budget c0 1. 4: for iteration k = 1, . . . , K do 5: Collect a tuple ( s1, a1, s 2) by taking a1 U(A), s 2 P 1 ( | s1, a1). 6: Update e D1 e D1 {( s1, a1, s 2)}. 7: for h = 1, , H 1 do 8: Collect two transition tuples (sh, ah, sh+1) and ( sh+1, ah+1, s h+2) by first rolling out πk 1 starting from (s1, ck 1) into state sh, taking ah U(A), and receiving sh+1 P h( |sh, ah), then taking ah+1 U(A) and receiving s h+2 P h+1( | sh+1, ah+1). 9: Update Dh Dh {(sh, ah, sh+1)}. 10: Update e Dh+1 e Dh+1 {( sh+1, ah+1, s h+2)} if h H 2. 11: Learn representations via MLE b Ph := ( bψh, bϕh) arg max (ψ,ϕ) F (sh,ah,sh+1) {Dh+ e Dh} log ψ(sh+1), ϕ(sh, ah) 12: Update empirical covariance matrix bΣh = P (s,a) Dh bϕh(s, a)bϕh(s, a) + λk Id. 13: Set the exploration bonus: bϕh(s, a)bΣ 1 h bϕh(s, a) , 2 h H 2 14: end for 15: Run Value-Iteration (VI) and obtain ck arg maxc [0,H] n c τ 1 minπ V π 1, b P,bb(s1, c) o . 16: Set πk arg minπ V π 1, b P,bb(s1, ck). 17: end for Ensure: Uniformly sample k from [K], return (bπ, bc) = (πk, ck). the infinite-state setting. As far as we are concerned, the only existing theoretical guarantees in the field of risk-averse CVa R RL were provided in the tabular setting (Bastani et al., 2022; Wang et al., 2023), where representations are known. Moreover, these results explicitly depend on |S|, thus can not apply to the function approximation setting with enormous state space. Given the above, our work accomplishes a great leap by incorporating exploration in the comprehensive function approximation setting, which evidently better aligns with real-world applications than the tabular and/or linear MDP settings. As for the sample complexity, the dependencies on A and d match the same rates as the analysis of risk-neutral RL in low-rank MDP (Uehara et al., 2022), showing our algorithm overcomes the extra hardness of balancing exploration and budget control even with a large action space. Our sample complexity matches the dependency on τ with the rates in the CVa R-UCBVI algorithm with the Hoeffding bonus (Wang et al., 2023) and UCB algorithm (Bastani et al., 2022). On the other hand, the sample complexity enjoys an Ω 1 τϵ2 lower bound(Wang et al., 2023, Theorem 3.1). Tightening the dependency on τ is left as future work. Our theoretical guarantees have a slightly worse dependency on H7 while Rep-UCB (Uehara et al., 2022) scales as H5, since we do not assume the trajectory reward is normalized to 1. If we adopt this reward normalization, our PAC bound can be improved to O(H5), and thus matching the rate. In addition, we point out that it might be possible to reduce another H2 dependency by utilizing a Bernstein-type bonus in the algorithm instead of the Hoeffding-type bonus, which is left for future work. Published as a conference paper at ICLR 2024 5 PLANNING ORACLE: CVa R-LSVI In Algorithm 1, we need to calculate ck arg maxc [0,H] n c τ 1 minπ V π 1, b P,bb(s1, c) o (Line 15), which is not naively computationally efficient since the objective c τ 1 minπ V π 1, b P,bb(s1, c) is not concave. In this section, we introduce a feasible planning oracle for this step and provide the corresponding theoretical guarantees. For simplicity, we assume the reward distribution r is discrete and rh(s, a) only takes value in iυ where υ > 0 and 0 i 1/υ . For a continuous r, we can discretize it with sufficiently high precision so that all the analysis still applies. The details are deferred to Appendix B. Since the supremum of the CVa R objective (2) is attained at τ-th quantile of the cumulative reward distribution, which is also discrete when the reward rh is discrete, we can only search ck within the grid in Line 15 of Algorithm 1: ck υ arg max 0 i H/υ iυ τ 1 min π ΠAug V π 1, b P,bb(s1, iυ) . Nevertheless, if we run standard value iteration to calculate minπ ΠAug V π 1, b P,bb(s1, iυ), our sample complexity will scale with the size of the state space |S|, which can be infinite in low-rank MDPs. To circumvent such dependency on |S|, we introduce a novel LSVI-UCB algorithm for the CVa R objective in this subsection, called CVa R-LSVI. In the following discussion, we fix i1 where 0 i1 H/υ and aim at calculating minπ ΠAug V π 1, b P,bb(s1, i1υ). We will drop the subscript b P and bb when it is clear from the context. We use V h to denote minπ ΠAug V π h, b P,bb and ( b P, r) to denote the MDP model whose transition is b P and reward distribution is r. First, recall that the discrete reward distribution rh(s, a) only takes values of iυ where 0 i 1/υ . This means that it is always linear with respect to a ( 1/υ + 1)-dimension vector: rh(iυ|s, a) = ϕh,r(s, a), ψh,r(iυ) , where (ϕh,r(s, a))i = rh(iυ|s, a) and ψh,r(iυ) = ei for all 0 i 1/υ and s S, a A, h [H]. Since b Ph(s |s, a) = bϕh(s, a), bψh(s ) , this implies that for all s, s S, a A, h [H], 0 i 1/υ , we have b Ph(s |s, a)rh(iυ|s, a) = ϕh(s, a), ψh(s , iυ) , where ϕh(s, a) = bϕh(s, a) ϕh,r(s, a) and ψh(s , iυ) = bψh(s, a) ψh,r(iυ). The linearity of transition and reward implies that we can utilize LSVI to compute Qπ h(s, c, a). More specifically, we propose an iterative algorithm consisting of the following steps: Step 1: Ridge Regression. In the t-th iteration, denote the trajectories collected before t-th iteration by {(sj h, aj h, rj h)H h=1}t 1 j=1. Let V t H+1(s, iυ) = iυ for all s S and 0 i H/υ . From h = H to h = 1, for each 0 i H/υ , we first compute: wt h(iυ) (Λt h) 1 t 1 X ϕh(sj h, aj h) h V t h+1(sj h+1, iυ rj h) i , where Λt h = λI + Pt 1 j=1 ϕh(sj h, aj h)(ϕh(sj h, aj h)) . Then for any j [t 1], a A and 0 i H/υ , we can estimate the value function V t h(sj h, iυ) as: Qt h(sj h, iυ, a) = Clip[ H,H] bbh(sj h, a) + ϕh(sj h, a) wt h(iυ) β ϕh(sj h, a) (Λt h) 1 V t h(sj h, iυ) = min a A Qt h(sj h, iυ, a). Note that although we do not compute Qt h(s, iυ, a) for all s S (which will incur computation cost scaling with |S|), they can be implicitly expressed via wt h(iυ), i.e., for all s S, a A, 0 i H/υ , we know Qt h(s, iυ, a) = Clip[ H,H] bbh(s, a) + ϕh(s, a) wt h(iυ) β ϕh(s, a) (Λt h) 1 Published as a conference paper at ICLR 2024 Step 2: Sample Collection. In the t-th iteration, simulate the greedy policy eπt (w.r.t. the estimated Q function Qt h(s,iυ, a)) with the initial budget c1 = i1υ in the MDP model ( b P, r) and collect a trajectory (st h, at h, rt h)H h=1. Then, go back to the first step. Step 3: Policy Evaluation. After repeating the above two steps for T1 iterations, we simulate each policy eπt with initial budget c1 = i1υ in ( b P, r) for T2 episodes. Suppose the collected trajectories are st,j h , at,j h , rt,j h H j=1 and we estimate the empirical value function of eπt as follows: b V eπt 1 (s1, i1υ) = 1 h=1 rt,j h (st,j h , at,j h ) h=1 bbh(st,j h , at,j h ). Then we simply use maxt [T1] b V eπt 1 (s1, i1υ) as a surrogate for V 1 (s1, i1υ). The details of CVa R-LSVI are stated in Algorithm 2 (cf. Appendix B). Note that in Line 16 of Algorithm 1, we can also use the above CVa R-LSVI algorithm to compute πk. Combining Algorithm 1 and 2, we can derive a computationally efficient algorithm, called ELLA, for CVa R objective in low-rank MDPs, which is shown in Algorithm 3 (cf. Appendix B). Now, we are ready present the computational complexity of ELLA. Particularly, the following theorem characterizes the computational cost for finding an ϵ-optimal policy: Theorem 5.1 (Informal). Let the parameters in Algorithm 1 and 2 take appropriate values, then we have with probability at least 1 δ that CVa R τ CVa Rτ(R(bπ, bc)) ϵ where (bπ, bc) is the returned policy and initial budget by Algorithm 3. In total, the sample complexity is upper bounded by O H7A2d4 log |F| . The MLE oracle is called O H7A2d4 log |F| times and the rest of the computation cost is O H19A3d12 log |F| Theorem 5.1 is a special case of the continuous reward setting, whose formal statement is in Theorem B.3 and the proof is deferred to Appendix D. Theorem 5.1 indicates that Algorithm 3 is able to find a near-optimal policy with polynomial sample complexity and polynomial computational complexity given an MLE oracle. Note that calling CVa R-LSVI in Line 17 and 20 of Algorithm 3 will not increase the sample complexity because we are only simulating with a known model ( b P, r) and do not need to interact with the ground-truth environment. 6 CONCLUDING REMARKS The paper proposes ELA, the first provably efficient algorithm for risk-sensitive reinforcement learning with the CVa R objective in low-rank MDPs. ELA achieves a sample complexity of O(1/ϵ2) to find an ϵ-optimal policy. To improve computational efficiency, we propose the ELLA algorithm, which leverages least-squares value iteration upon discretized reward as the planning oracle. Given the MLE oracle, we prove this algorithm only requires a polynomial computational complexity. Regarding future work, we believe establishing lower bounds for CVa R RL in low-rank MDPs is an exciting and challenging direction due to the complexity of the risk landscape and the non-linearity in function approximation methods used. We point out that even for standard episodic RL, which is a well-studied area, we have only recently seen some results on sample lower bounds in the context of low-rank MDPs (Cheng et al., 2023; Zhao et al., 2023). However, these results do not directly apply to the CVa R risk landscape. We suppose the sample complexity lower bound may take the form of Ω HAd τϵ2 according to the lower bounds in low-rank MDPs (Cheng et al., 2023; Zhao et al., 2023) and tabular CVa R RL (Wang et al., 2023), which implies that our algorithm has a loose dependency on H and τ. Such a gap might be alleviated if we use the Hoeffding-type bonus instead of the Bernstein-type bonus in the bonus-driven exploration. We leave it as future work to tighten the dependencies. Published as a conference paper at ICLR 2024 Mastane Achab, Reda Alami, Yasser Abdelaziz Dahou Djilali, Kirill Fedyanin, Eric Moulines, and Maxim Panov. Distributional deep q-learning with cvar regression. In Deep Reinforcement Learning Workshop Neur IPS 2022, 2022. Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank MDPs. Advances in neural information processing systems, 33:20095 20107, 2020. Osbert Bastani, Jason Yecheng Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning. Advances in Neural Information Processing Systems, 35:36259 36269, 2022. Nicole B auerle and Jonathan Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74:361 379, 2011. Yu Chen, Yihan Du, Pihe Hu, Siwei Wang, Desheng Wu, and Longbo Huang. Provably efficient iterated cvar reinforcement learning with function approximation. ar Xiv preprint ar Xiv:2307.02842, 2023. Yuan Cheng, Ruiquan Huang, Jing Yang, and Yingbin Liang. Improved sample complexity for reward-free reinforcement learning under low-rank mdps. ar Xiv preprint ar Xiv:2303.10859, 2023. Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVa R optimization in MDPs. Advances in neural information processing systems, 27, 2014. Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decisionmaking: a CVa R optimization approach. Advances in neural information processing systems, 28, 2015. Antonio Coronato, Muddasar Naeem, Giuseppe De Pietro, and Giovanni Paragliola. Reinforcement learning for intelligent healthcare applications: A survey. Artificial Intelligence in Medicine, 109: 101964, 2020. Mark Davis and S ebastien Lleo. Risk-sensitive benchmarked asset management. Quantitative Finance, 8(4):415 426, 2008. Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665 1674. PMLR, 2019. Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. In International Conference on Machine Learning, pp. 2826 2836. PMLR, 2021. Yihan Du, Siwei Wang, and Longbo Huang. Provably efficient risk-sensitive reinforcement learning: Iterated cvar and worst path. In The Eleventh International Conference on Learning Representations, 2022. Yonathan Efroni, Dipendra Misra, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Provable rl with exogenous distractors via multistep inverse dynamics. ar Xiv preprint ar Xiv:2110.08847, 2021. Damien Ernst, Guy-Bart Stan, Jorge Goncalves, and Louis Wehenkel. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pp. 667 672. IEEE, 2006. Carlo Filippi, Gianfranco Guastaroba, and Maria Grazia Speranza. Conditional value-at-risk beyond finance: a survey. International Transactions in Operational Research, 27(3):1277 1319, 2020. Takuya Hiraoka, Takahisa Imagawa, Tatsuya Mori, Takashi Onishi, and Yoshimasa Tsuruoka. Learning robust options by conditional value at risk optimization. Advances in Neural Information Processing Systems, 32, 2019. Published as a conference paper at ICLR 2024 Xiaoyan Hu and Ho-Fung Leung. A tighter problem-dependent regret bound for risk-sensitive reinforcement learning. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 5411 5437. PMLR, 25 27 Apr 2023. URL https://proceedings.mlr.press/v206/hu23b.html. Audrey Huang, Jinglin Chen, and Nan Jiang. Reinforcement learning in low-rank MDPs with density features. ar Xiv preprint ar Xiv:2302.02252, 2023. David Isele, Alireza Nakhaei, and Kikuo Fujimura. Safe reinforcement learning on autonomous vehicles. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1 6. IEEE, 2018. Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are PAC-learnable. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2017. Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020. Pavlo Krokhmal, Jonas Palmquist, and Stanislav Uryasev. Portfolio optimization with conditional value-at-risk objective and constraints. Journal of risk, 4:43 68, 2002. R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000. Ayush Sekhari, Christoph Dann, Mehryar Mohri, Yishay Mansour, and Karthik Sridharan. Agnostic reinforcement learning with low-rank MDPs and rich observations. Advances in Neural Information Processing Systems, 34:19033 19045, 2021. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of Go without human knowledge. nature, 550(7676):354 359, 2017. Shagun Sodhani, Amy Zhang, and Joelle Pineau. Multi-task reinforcement learning with contextbased representations. In International Conference on Machine Learning, pp. 9767 9779. PMLR, 2021. Shagun Sodhani, Franziska Meier, Joelle Pineau, and Amy Zhang. Block contextual MDPs for continual learning. In Learning for Dynamics and Control Conference, pp. 608 623. PMLR, 2022. Silvestr Stanko and Karel Macek. Risk-averse distributional reinforcement learning: A CVa R optimization approach. In IJCCI, pp. 412 423, 2019. Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pp. 2898 2933. PMLR, 2019. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. Aviv Tamar, Yonatan Glassner, and Shie Mannor. Optimizing the CVa R via sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline RL in low-rank MDPs. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=J4i SIR9fh Y0. Published as a conference paper at ICLR 2024 Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Micha el Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350 354, 2019. Kaiwen Wang, Nathan Kallus, and Wen Sun. Near-minimax-optimal risk-sensitive reinforcement learning with CVa R. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 35864 35907. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/v202/wang23m. html. Zhicheng Wang, Biwei Huang, Shikui Tu, Kun Zhang, and Lei Xu. Deeptrader: a deep reinforcement learning approach for risk-return balanced portfolio management with market conditions embedding. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 643 650, 2021. Lu Wen, Jingliang Duan, Shengbo Eben Li, Shaobing Xu, and Huei Peng. Safe reinforcement learning for autonomous vehicles through parallel constrained policy optimization. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pp. 1 7. IEEE, 2020. Pengqian Yu, William B Haskell, and Huan Xu. Approximate value iteration for risk-aware markov decision processes. IEEE Transactions on Automatic Control, 63(9):3135 3142, 2018. Amy Zhang, Clare Lyle, Shagun Sodhani, Angelos Filos, Marta Kwiatkowska, Joelle Pineau, Yarin Gal, and Doina Precup. Invariant causal prediction for block MDPs. In International Conference on Machine Learning, pp. 11214 11224. PMLR, 2020. Amy Zhang, Rowan Thomas Mc Allister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=-2FCw DKRREu. Canzhe Zhao, Ruofeng Yang, Baoxiang Wang, Xuezhou Zhang, and Shuai Li. Learning adversarial low-rank markov decision processes with unknown transition and full-information feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. Published as a conference paper at ICLR 2024 A NOTATIONS Table 1: List of Notations A = |A| cardinality of the action space [H] = {1, , H} ( ) probability simplex Zπ P,c1 return distribution of rolling π with initial budget c1 in an MDP with P V π h,P,b : S [0, H] 7 [0, H] defined in (5) d(π,c1) h,P (s), d(π,c1) h,P (s, a) occupancy of s and (s, a) at step h when rolling π with initial budget c1 in an MDP with P d(π,c1) h (s) = d(π,c1) h,P (s), d(π,c1) h (s, a) = d(π,c1) h,P (s, a) occupancy of s and (s, a) at step h when rolling π with initial budget c1 in the (true) environment In-Algorithm Initialization F = {Ψ, Φ} model class λk = O(d log(|F|Hk/δ)) regularizer at iteration k αk = p H2(A + d2) log(|F|Hk/δ) parameter at iteration k Datasets (at the end of the k-th episode) Dk h = {(si h, ai h, si h+1)}i [k] si h dπi 1 h,ci 1, ai h U(A) e Dk h = {( si h, ai h, s i h+1)}i [k] si h dπi 1 h 1,ci 1 U(A) P h, ai h U(A) Estimations (at the end of the k-th episode) {( bψk h, bϕk h)}h [H] learned representations b Pk = { b P k h }h [H] empirical transition kernel bΣk h = P (s,a) Dk h bϕk h(bϕk h) + λk I empirical covariance matrix bbk = {bbk h}h [H] bonus term In-Analysis ρk h(s) = 1 k Pk 1 i=0 dπi h,ci(s) occupancy of s in dataset Dk h ρk h(s, a) = 1 k Pk 1 i=0 dπi h,ci(s, a) ηk h(s) = P s ,a ρk h 1(s )U(a )P h 1(s|s , a ) occupancy of s in dataset e Dk h Σρk h U(A),ϕ = k Es ρk h,a U(A)[ϕϕ ] + λk I Σρk h,ϕ = k E(s,a) ρk h[ϕϕ ] + λk I bΣk h,ϕ = k E(s,a) Dk h[ϕϕ ] + λk I unbiased estimate of Σρk h U(A),ϕ ζk = log(|F|Hk/δ)/k f k h(s, a) = b P k h ( |s, a) P h( |s, a) 1 estimation error in L1 norm ω(π,c1) h,P ( |s, a) the distribution of the remaining budget at h for any (s, a) when rolling out (π, c1) in an MDP with P ω(π,c1) h ( |s, a) = ω(π,c1) h,P ( |s, a) the distribution of the remaining budget at h for any (s, a) when rolling out (π, c1) in the true MDP Published as a conference paper at ICLR 2024 B ELLA FOR CONTINUOUS REWARD Now, we extend the analysis in Section 5 to continuous reward distribution. B.1 DISCRETIZED REWARD Inspired by Wang et al. (2023), we discretize the reward rh and the budget ch at each step. In this way, we only need to plan over a finite grid. More specifically, suppose the precision is υ > 0, then we round up the reward r to U(r) := r/υ υ. In the following discussion, we use M to denote the MDP which shares the same transition as M while its reward r is discretized from the original reward distribution r of M, i.e., r = U(r). Since the supremum of the CVa R objective (2) is attained at τ-th quantile of the return distribution, which is also discretized in ( b P, r), we can only search ck within the grid for ( b P, r) in Line 15 of Algorithm 1: ck υ arg max 0 i H/υ iυ τ 1 min π ΠAug V π 1, b P,bb(s1, iυ) where ΠAug is the augmented policy class of augmented M and V π h,P,b is the value function of (P, r) with bonus b: V π h,P,b(s, c) := Eπ,P h =h U (rh ) h =h bh (sh , ah ) sh = s, ch = c Similarly, the Q function of (P, r) with bonus b is defined as: Q π h,P,b(s, c, a) := Eπ,P h =h U (rh ) h =h bh (sh , ah ) sh = s, ch = c, ah = a To stay consistent with Line 15, we also derive the optimal augmented policy of ( b P, r) with bonus bb in Line 16 of Algorithm 1, i.e., πk arg min π ΠAug V π 1, b P,bb(s1, ck). Note that in Line 8 of Algorithm 1 we need to roll out πk in M to collect samples. Since πk ΠAug works only within the grid, we discretize the reward we observe when executing πk in M, which is equivalent to playing the following augmented policy πk ΠAug in M: , h [H]. (8) Planning within a discretized grid via (7) will inevitably incur errors compared to the original objective. Nevertheless, we can show that if the precision υ is sufficiently small, the discretized MDP M will be an excellent approximation to M and thus the induced error of planning via (7) will be negligible. Formally, let R(π, c) denote the return of executing π ΠAug with initial budget c = iυ (0 i H/υ ) in M, then we have the following properties from the literature (Wang et al., 2023): Proposition B.1. For any 0 < τ < 1,υ > 0, policy π ΠAug and 0 i H/υ , we have (1) CVa R τ CVa R τ := max π ΠAug,0 i H/υ CVa Rτ(R(π, iυ)), (2) 0 CVa Rτ(R(π, iυ)) CVa Rτ(R(π, iυ)) Hυ Published as a conference paper at ICLR 2024 B.2 CVa R-LSVI With discretization, we only need to search within the grid of c in each iteration. For each discrete value iυ, we apply the CVa R-LSVI algorithm as planning oracle as introduced in Section 5. The only difference is that now we simulate with the discretized reward model r. The full algorithm is shown in Algorithm 2. In particular, in Line 12 we can express Q t h with wt h: Q t h(s, iυ, a) = Clip[ H,H] bbh(s, a) + ϕh(s, a) wt h(iυ) β ϕh(s, a) (Λt h) 1 Moreover, note that we can bound the norm of the newly-constructed feature vectors as follows: ϕh(s, a) 2 1, ψh(s , iυ)ds This bound will be useful in our analysis. Algorithm 2 CVa R-LSVI Require: MDP transition model and reward ( b P = (bϕ, bψ), r), bonus bb, initial budget i1υ, number of iterations T1, parameters λ and β, number of policy evaluation episodes T2. 1: Compute ϕh(s, a) = bϕh(s, a) ϕh,r(s, a) for all h [H], s S, a A where ϕh,r(s, a) R 1/υ +1 and (ϕh,r(s, a))i = rh(iυ|s, a) for all 0 i 1/υ . 2: for t = 1, , T1 do 3: Initialize V t H+1(s, iυ) iυ for all s S and 0 i H/υ . 4: for h = H, , 1 do 5: Compute Λt h λI + Pt 1 j=1 ϕh(sj h, aj h)(ϕh(sj h, aj h)) . 6: Calculate wt h(iυ) (Λt h) 1 Pt 1 j=1 ϕh(sj h, aj h) h V t h+1(sj h+1, iυ rj h) i , 0 i H/υ . 7: for j = 1, , t 1 do 8: Compute for all a A, 0 i H/υ : Q t h(sj h, iυ, a) Clip[ H,H] bbh(sj h, a) + ϕh(sj h, a) wt h(iυ) β ϕh(sj h, a) (Λt h) 1 9: V t h(sj h, iυ) mina A Q t h(sj h, iυ, a). 10: end for 11: end for 12: Simulate the greedy policy eπt (w.r.t. Q t h(s,iυ, a) defined in (9)) with the initial budget c1 = i1υ in the MDP ( b P, r) and collect a trajectory (st h, at h, rt h)H h=1. 13: end for 14: for t = 1, , T1 do 15: Simulate eπt with initial budget c1 = i1υ in ( b P, r) for T2 episodes and collect trajectories st,j h , at,j h , rt,j h H 16: Compute b V eπt 1 (s1, i1υ) 1 T2 PT2 j=1 i1υ PH h=1 rt,j h (st,j h , at,j h ) + PH h=1 bbh(st,j h , at,j h ). 17: end for 18: Return: value estimate mint [T1] b V eπt 1 (s1, i1υ) and policy arg mineπt b V eπt 1 (s1, i1υ) Now let V h, b P,bb denote minπ ΠAug V π h, b P,bb. Then we have the following theorem indicating that Algorithm 2 can do planning in ( b P, r) accurately with appropriate T1 and T2: Theorem B.2. Let λ = 1, β = O H 3 2 dι 1 4 υ , T1 = O H5d3ι Published as a conference paper at ICLR 2024 where ι = log2 Hd T1 υδ , we have with probability at least 1 δ that 1, b P,bb(s1, i1υ) V 1, b P,bb(s1, i1υ) + 3 1, b P,bb(s1, i1υ) V beπ 1, b P,bb(s1, i1υ) 1 where beπ = arg mineπt b V eπt 1 (s1, i1υ) and b V 1, b P,bb(s1, i1υ) := mineπt b V eπt 1 (s1, i1υ) are the returned values of CVa R-LSVI. The proof is deferred to Appendix D.1. B.3 COMPUTATIONAL COMPLEXITY Equipping ELA with CVa R-LSVI, we can derive ELLA, which is shown in Algorithm 3. Based on the above discussions about discretization and CVa R-LSVI, the computational complexity of ELLA for finding an ϵ-policy can be characterized as follows: Theorem B.3. Let H2(|A| + d2) log |F|Hk , λk = O d log |F|Hk H6A2d4 log |F| 3H , λ = 1, β = O H 3 2 dι 1 4 υ T1 = O H5d3ι Then we have with probability at least 1 δ, CVa R τ CVa Rτ(R(bπ, bc)) ϵ. In total, the sample complexity is upper bounded by O H7A2d4 log |F| . The MLE oracle is called O H7A2d4 log |F| times and the rest of the computation cost is O H29A3d12 log |F| The proof is deferred to Appendix D.2. Theorem B.3 indicates that even when the reward is continuous, Algorithm 3 can still achieve polynomial sample complexity and polynomial computational complexity given an MLE oracle. C PROOFS FOR SECTION 4 Proof Sketch Recall that πk and ck are the exploration policy and initial budget output from Line 16 of Algorithm 1 at the end of the k-th iteration, respectively. In the proof, we show that, with probability at least 1 δ, it holds that Regret(K) τ 1H3Ad2 which is formalized in Lemma C.1. Once it is established, the suboptimality of the uniform mixture of {(ck, πk)} satisfies that k=1 CVa Rτ(R(πk, ck)) τ 1H3Ad2K 1 Let the RHS smaller than ϵ, we have that when τ 2ϵ2 log 1 + K Published as a conference paper at ICLR 2024 Algorithm 3 ELLA Require: Risk tolerance τ (0, 1], number of iterations K, parameters {λk}k [K] and {αk}k [K], models F = {Ψ, Φ}, failure probability δ (0, 1), discretization precision υ, CVa RLSVIparameters λ, β, T1, T2. 1: Set datasets Dh, e Dh for each h [H 1]. 2: Calculate the discretized reward distribution r. 3: Initialize the exploration policy π0 {π0 h(s, iυ) = U(A), for any s S, 0 i H/υ }h [H]. 4: Initialize the budget i0 H/υ . 5: for iteration k = 1, . . . , K do 6: Collect a tuple ( s1, a1, s 2) by taking a1 U(A), s 2 P 1 ( | s1, a1). 7: Update e D1 e D1 {( s1, a1, s 2)}. 8: for h = 1, , H 1 do 9: Collect two transition tuples (sh, ah, sh+1) and ( sh+1, ah+1, s h+2) by first rolling out πk 1 (defined in (8)) starting from (s1, ik 1υ) into state sh, taking ah U(A), and receiving sh+1 P h( |sh, ah), then taking ah+1 U(A) and receiving s h+2 P h+1( | sh+1, ah+1). 10: Update Dh Dh {(sh, ah, sh+1)}. 11: Update e Dh+1 e Dh+1 {( sh+1, ah+1, s h+2)} if h H 2. 12: Learn representations via MLE b Ph := ( bψh, bϕh) arg max (ψ,ϕ) F (sh,ah,sh+1) {Dh+ e Dh} log ψ(sh+1), ϕ(sh, ah) 13: Update empirical covariance matrix bΣh = P (s,a) Dh bϕh(s, a)bϕh(s, a) + λk Id. 14: Set the exploration bonus: bϕh(s, a)bΣ 1 h bϕh(s, a) , 2 h H 2 15: end for 16: for i = 0, 1, , H/υ do 17: Run CVa R-LSVI (Algorithm 2) with MDP model ( b P, r), bonus bb, initial budget iυ and parameters (λ, β, T1, T2) and let the returned value estimate and policy be b V 1(s1, iυ) and beπ(i). 18: end for 19: Obtain ik arg max0 i H/υ 1(s1, iυ) and πk beπ(ik). 20: end for Ensure: Uniformly sample k from [K], return (bπ, bc) = (πk, ikυ). the uniform mixture of {(πk, ck)} is an ϵ-optimal policy. Finally, noting that H trajectories are collected per iteration, we conclude the proof of Theorem 4.1. To establish (10), we decompose the suboptimality of the k-th iteration into CVa R τ CVa Rτ(R(πk, ck)) 1, b Pk,bbk(s1, c ) CVa Rτ(R(πk, ck)) + CVa Rτ(R(π , c )) c τ 1V π 1, b Pk,bbk(s1, c ) 1, b Pk,bbk(s1, ck) CVa Rτ(R(πk, ck)) + c τ 1V π 1,P ,0(s1, c ) c τ 1V π 1, b Pk,bbk(s1, c ) τ 1 V πk 1,P ,0(s1, ck) V πk 1, b Pk,bbk(s1, ck) 1, b Pk,bbk(s1, c ) V π 1,P ,0(s1, c ) | {z } (ii) Published as a conference paper at ICLR 2024 where the first inequality holds by the fact that πk is greedy (Line 16 of Algorithm 1) and the last inequality holds by CVa Rτ(R(πk, ck)) = sup t R t τ 1ER R(πk,ck)[(t R)+] ck τ 1ER R(πk,ck)[(ck R)+] Utilizing simulation lemma C.3 for risk-sensitive RL, we further upper bound terms (i) and (ii) by the error in estimating the transition kernel and the bonus terms, i.e., term (i) E(s,a) d(πk,ck) h h bbk h(s, a) i + 3H E(s,a) d(πk,ck) h f k h(s, a) term (ii) E(s,a) d(π ,c ) h H f k h(s, a) bbk h(s, a) i where f k h(s, a) := b P k h ( |s, a) P h( |s, a) 1. By the analysis in Lemma C.1 and C.2, we prove the inequality (10). Before proceeding to the detailed proofs, we first introduce the essential regret decomposition. Lemma C.1 (Regret). With probability 1 δ, we have that k=1 CVa R τ CVa Rτ(R(πk, ck)) τ 1H3Ad2 Proof. Letting f k h(s, a) = b P k h ( |s, a) P h( |s, a) , we condition on the event that for all (k, h) [K] [H], the following inequalities hold Es ρk h,a U(A) h f k h(s, a) 2i ζk, Es ηk h,a U(A) h f k h(s, a) 2i ζk ϕ(s, a) (bΣk h,ϕ) 1 = Θ( ϕ(s, a) (Σρk h U(A),ϕ) 1) By Lemmas E.1 and E.2, this event happens with probability 1 δ. For any iteration k, we have that For a fixed iteration k, we have that CVa R τ CVa Rτ(R(πk, ck)) 1, b Pk,bbk(s1, c ) CVa Rτ(R(πk, ck)) + CVa Rτ(R(π , c )) c τ 1V π 1, b Pk,bbk(s1, c ) 1, b Pk,bbk(s1, ck) CVa Rτ(R(πk, ck)) + c τ 1V π 1,P ,0(s1, c ) c τ 1V π 1, b Pk,bbk(s1, c ) τ 1 V πk 1,P ,0(s1, ck) V πk 1, b Pk,bbk(s1, ck) + τ 1 V π 1, b Pk,bbk(s1, c ) V π 1,P ,0(s1, c ) H2Aζk by Lemma C.2 where the first inequality holds by the fact that πk is greedy (Line 15 of Algorithm 1) and the last inequality holds by CVa Rτ(R(πk, ck)) = sup t R t τ 1ER R(πk,ck)[(t R)+] ck τ 1ER R(πk,ck)[(ck R)+] Therefore, it remains to bound the first term, which by simulation lemma C.3, can be further written as V πk 1,P ,0(s1, ck) V πk 1, b Pk,bbk(s1, ck) h=1 rh(sh, ah) h=1 rh(sh, ah) h=1 bbk h(sh, ah) h=1 Eπk, b Pk h bbk h(sh, ah) c1 = cki + H h=1 E(s,a) d(πk,ck) h f k h(s, a) h=1 E(s,a) d(πk,ck) h h bbk h(s, a) i h=1 E(s,a) d(πk,ck) h f k h(s, a) | {z } (ii) Published as a conference paper at ICLR 2024 where the last inequality holds by the simulation Lemma E.5 for risk-neutral RL and the fact that bbk h 2. where the last inequality holds by V πk h, b P k,r+bbk 2. We first consider term (i). By Lemma E.4 and noting that the bonus term bbk h is O(1), we have h=1 E(s,a) d(πk,ck) h h bbk h(s, a) i h=1 E(s,a) d(πk,ck) h min αk bϕk h(s, a) Σ 1 ρk h U(A), b ϕk h , 2 v u u t A (αk)2 Es ρk 1,a U(A) bϕk 1(s, a) 2 Σ 1 ρk 1 U(A), b ϕk 1 h=1 E(s,a) d(πk,ck) h+1 αk bϕk h+1(s, a) Σ 1 ρk h+1 U(A), b ϕk h+1 h=1 E(s,a) d(πk,ck) h ϕ h(s, a) Σ 1 ρk h,ϕ v u u tk(αk)2A Es ρk h+1,a U(A) bϕk h+1(s, a) 2 Σ 1 ρk h+1 U(A), b ϕk h+1 v u u t A (αk)2 Es ρk 1,a U(A) bϕk 1(s, a) 2 Σ 1 ρk 1 U(A), b ϕk 1 d A(αk)2 + 4λkd h=1 E(s,a) d(πk,ck) h ϕ h(s, a) Σ 1 ρk h,ϕ where the last inequality holds by k Es ρk h,a U(A) bϕk h(s, a) 2 Σ 1 ρk h U(A), b ϕk h = k Tr Eρk h U(A)[bϕk h(bϕk h) ] n k Eρk h U(A)[bϕk h(bϕk h) ] + λko 1 d Similarly, for term (ii), we have that h=1 E(s,a) d(πk,ck) h f k h(s, a) A Es ρk 1,a U(A) h f k 1 (s, a) 2i h=1 E(s,a) d(πk,ck) h ϕ h(s, a) Σ 1 ρk h,ϕ k A Es ρk h+1,a U(A) h f k h+1(s, a) 2i + 4λkd h=1 E(s,a) d(πk,ck) h ϕ h(s, a) Σ 1 ρk h,ϕ where the last step uses H2Ad2 log |F|Hk Combining (13), (14), (15), and (16) we obtain CVa R τ CVa Rτ(R(πk, ck)) 2H αk E(s,a) d(πk,ck) h ϕ h(s, a) Σ 1 ρk h,ϕ Published as a conference paper at ICLR 2024 d A(αk)2 + 4λkd h=1 E(s,a) d(πk,ck) h ϕ h(s, a) Σ 1 ρk h,ϕ h=1 E(s,a) d(πk,ck) h ϕ h(s, a) Σ 1 ρk h,ϕ where the last inequality holds by (Uehara et al., 2022, Lemma 18), i.e., k=1 E(s,a) d(πk,ck) h ϕ h(s, a) Σ 1 ρk h,ϕ d K log 1 + K Therefore, we conclude the proof. Lemma C.2 (Almost Optimism at the Initial Distribution). Consider an episode k [K] and set H2(A + d2) log |F|Hk , λk = O d log |F|Hk k log |F|Hk (17) with probability 1 δ, we have that 1, b Pk,bbk(s1, c ) V π 1,P ,0(s1, c ) p Proof. Similar to the proof of Lemma C.1, letting f k h(s, a) = b P k h ( |s, a) P h( |s, a) 1, we condition on the event that for all (k, h) [K] [H], the following inequalities hold Es ρk h,a U(A) h f k h(s, a) 2i ζk, Es ηk h,a U(A) h f k h(s, a) 2i ζk ϕ(s, a) (bΣk h,ϕ) 1 = Θ( ϕ(s, a) (Σρk h U(A),ϕ) 1) From Lemmas E.1 and E.2, this event happens with probability 1 δ. Note that bk H(s, a) = f k H(s, a) := 0 for any (s, a) S A. Then, for any policy π, from the simulation Lemma C.3, we have that 1, b Pk,bbk(s1, c ) V π 1,P ,0(s1, c ) h=1 rh(sh, ah) h=1 bbk h(sh, ah) h=1 rh(sh, ah) h=1 E(s,a) d(π ,c ) h H f k h(s, a) bbk h(s, a) i (19) For any h + 1 {2, , H 1}, by Lemma E.3 and noting that fh+1 2, we have that E(s ,a ) d(π ,c ) h+1, b Pk [f k h+1(s , a )] E(s,a) d(π ,c ) bϕk h(s, a) Σ 1 ρk h U(A), b ϕk h k A Es ηk h+1,a U(A) h f k h+1(s , a ) 2i + 4λkd + 4kζk # E(s ,a ) d(π ,c ) h+1, b Pk [f k h+1(s , a )] p βk E(s,a) d(π ,c ) bϕk h(s, a) Σ 1 ρk h U(A), b ϕk h where βk := k Aζk + λkd + kζk (A + d2) log(|F|Hk/δ) Published as a conference paper at ICLR 2024 Combining (19) and (20), we further derive 1, b Pk,bbk(s1, c ) V π 1,P ,0(s1, c ) h=1 E(s,a) d(π ,c ) h bbk h(s, a) i + p i=1 E(s,a) d(π ,c ) i+1, b Pk f k i+1(s, a) h=1 E(s,a) d(π ,c ) h bbk h(s, a) i + h=1 E(s,a) d(π ,c ) min αk bϕk h(s, a) Σ 1 ρk h U(A), b ϕk h , 2 + p Therefore, we conclude the proof. Lemma C.3 (Simulation lemma for risk-sensitive RL). Given two episodic MDPs (H, P = {Ph}h [H], r) and (H, b P = { b Ph}h [H], r), for any fixed c [0, 1] and policy π = {πh : S [0, H] 7 (A)}h [H], we have that V π 1,P,0(s1, c) V π 1, b P,0(s1, c) H h=1 E(s,a) d(π,c) h,P [fh(s, a)] (21) where fh(s, a) := Ph( |s, a) b Ph( |s, a) 1 for any (h, s, a) [H] S A. Proof. By definition, we derive that V π 1,P,0(s1, c) V π 1, b P,0(s1, c) =Ea1 π1( |s1,c),r1 n Es P1( |s1,a1) V π 2,P,0(s , c r1) Es b P1( |s1,a1) h V π 2, b P,0(s , c r1) io =Ea1 π1( |s1,c),r1 Es b P1( |s1,a1) h V π 2,P,0(s , c r1) V π 2, b P,0(s , c r1) i + Es P1( |s1,a1) V π 2,P,0(s , c r1) Es b P1( |s1,a1) V π 2,P,0(s , c r1) ) Ea1 π1( |s1,c),r1Es b P1( |s1,a1) h V π 2,P,0(s , c r1) V π 2, b P,0(s , c r1) i + H E(s,a) dπ 1,P[f1(s, a)] h=1 E(s,a) d(π,c) h,P [fh(s, a)] which concludes the proof. D PROOFS FOR APPENDIX B D.1 PROOF OF THEOREM B.2 In the following discussion we use ϕ j h to denote ϕh(sj h, aj h) and ( b Ph V )(s, iυ, a) to denote Er rh( |s,a),s b Ph( |s,a)[V (s , iυ r)]. We also use Gt,h denote the filtration generated by {(sj h , aj h , rj h )H h =1}t 1 j=1 (st h , at h , rt h )h 1 h =1 (st h, at h). First, note that we have the following concentration lemma: Lemma D.1. For all h [H], t [T1], 0 i H/υ , with probability at 1 δ/4, we have V t h+1(sj h+1, iυ rj h) b Ph V t h+1(sj h, iυ, aj h) i (Λt h) 1 O Published as a conference paper at ICLR 2024 The proof is deferred to Appendix D.3. Let E1 denote the event in Lemma D.1. On the other hand, we can further bound the difference between Q π h(s, iυ, a) and bbh(s, a) + ϕh(s, a) wt h(iυ) as follows: Lemma D.2. For any policy π, conditioned on E1, we have for all s S, 0 i H/υ , a A, h [H], t [T1] that bbh(s, a) + ϕh(s, a) wt h(iυ) Q π h(s, iυ, a) = b Ph V t h+1 V π h+1 (s, iυ, a) + ξt h(s, iυ, a), where |ξt h(s, iυ, a)| β ϕh(s, a) (Λt h) 1. The proof of Lemma D.2 follows the same arguments in the proof of Jin et al. (2020)[Lemma B.4] and thus is omitted here. With Lemma D.2, we can prove that the estimated Q function Q t is optimistic: Lemma D.3. Conditioned on event E1, we have for all s S, 0 i H/υ , h [H], t [T1] that V t h(s, iυ) V h(s, iυ) where V h(s, iυ) = supπ V π h(s, iυ). The proof is deferred to Appendix D.4. Combining Lemma D.2 and Lemma D.3,we can bound the regret of Algorithm 2 as follows: Lemma D.4. With probability at least 1 δ/2, we have 1 (s1, i1υ) V 1(s1, i1υ) O where ι = log2 Hd T1 The proof is deferred to Appendix D.5. Denote the event in Lemma D.4 by E2. Lemma D.4 implies that by setting T1 = O H5d3ι υ3ε2 , we have conditioned on event E2 that min t [T1] V eπt 1 (s1, i1υ) V 1(s1, i1υ) ε/2. Let t1 denote arg mint [T1] V eπt 1 (s1, i1υ). Then on the other hand, by setting T2 = O H2 log T1 with probability at least 1 δ/2 we have that for all t [T1], b V eπt 1 (s1, i1υ) V eπt 1 (s1, i1υ) ε/4. Denote the above event by E3 and let beπ denote arg mineπt b V eπt 1 (s1, i1υ). Then conditioned on event E2 E3, we have 1(s1, i1υ) V 1(s1, i1υ) b V eπt1 1 (s1, i1υ) V 1(s1, i1υ) V eπt1 1 (s1, i1υ) V 1(s1, i1υ) + ε D.2 PROOF OF THEOREM B.3 First, we show that Algorithm 3 can achieve sublinear regret in the discretized MDP M. Let (π dis, i ) := arg maxπ ΠAug,0 i H/υ CVa Rτ(R(π, iυ)). Note that from (4) we have CVa R τ = i υ τ 1V π dis 1,P ,0(s1, i υ). This implies that CVa R τ CVa Rτ(R(πk, ikυ)) Published as a conference paper at ICLR 2024 = i υ τ 1V π dis 1, b Pk,bbk(s1, i υ) CVa Rτ(R(πk, ikυ)) + i υ τ 1V π dis 1,P ,0(s1, i υ) i υ τ 1V π dis 1, b Pk,bbk(s1, i υ) = i υ τ 1V π dis 1, b Pk,bbk(s1, i υ) CVa Rτ(R(πk, ikυ)) + τ 1 V π dis 1, b Pk,bbk(s1, i υ) V π dis 1,P ,0(s1, i υ) Suppose in k-th iteration, CVa R-LSVI returns b V 1, b Pk,bbk(s1, i υ) for the initial budget i υ. Then from Theorem B.2, we know V π dis 1, b Pk,bbk(s1, i υ) b V 1, b Pk,bbk(s1, i υ) τ This implies that i υ τ 1V π dis 1, b Pk,bbk(s1, i υ) i υ τ 1 b V 1, b Pk,bbk(s1, i υ) + ϵ 1, b Pk,bbk(s1, ikυ) + ϵ 1, b Pk,bbk(s1, ikυ) + ϵ where the last step is due to Theorem 5.1. On the other hand, from (2) we know CVa Rτ(R(πk, ikυ)) ik V πk 1,P ,0(s1, ikυ). (24) Plug Equation (23) and (24) into (22), we have CVa R τ CVa Rτ(R(πk, ikυ)) τ 1 1,P ,0(s1, ikυ) V πk 1, b Pk,bbk(s1, ikυ) V π dis 1, b Pk,bbk(s1, i υ) V π dis 1,P ,0(s1, i υ) + ϵ Then, following the same arguments in the proof of Theorem 4.1, we can obtain CVa R τ CVa Rτ(R(πk, ikυ)) O τ 1H3Ad2 log (|F|/δ) + Kϵ Next we bridge CVa Rτ and CVa Rτ via Proposition B.1. Note that from Proposition B.1 we have k=1 CVa R τ CVa Rτ(R(πk, ikυ)) CVa R τ CVa Rτ(R(πk, ikυ)) + KHυ Combining (25) and (26), we have k=1 CVa R τ CVa Rτ(R(πk, ikυ)) O τ 1H3Ad2 log (|F|/δ) + Kϵ Substituting the values of the parameters, we will obtain k=1 CVa R τ CVa Rτ(R(πk, ikυ)) ϵ. This implies that the uniformly mixed policy of {(πk, ikυ)}K k=1 is ϵ-optimal. The total number of collected trajectories is KH = O H7A2d4 log |F| Published as a conference paper at ICLR 2024 Now, we discuss the computational complexity. The MLE oracle is called KH times. For the rest of the computation, the cost is dominated by running CVa R-LSVI in Line 17 of Algorithm 3. In CVa RLSVI, we compute (Λt h) 1 by the Sherman-Morrison formula, then the computational complexity of CVa R-LSVI is dominated by Line 6 of Algorithm 2, which requires O( d2AT1 υ2 ) time per step and leads to a total computational complexity of O( d2AH2T 2 1 υ3 ) for each call of CVa R-LSVI. Note that CVa RLSVI is called totally KH υ times in Algorithm 3 and thus the total computation cost of Algorithm 3 is O Kd2AH3T 2 1 υ4 H29A3d12 log |F| D.3 PROOF OF LEMMA D.1 The proof largely follows the proof of Jin et al. (2020)[Lemma B.3]. We first bound wt h(iυ). Note that we have for any vector v Rd where d := d(1 + 1/υ ) that v (Λt h) 1ϕ j h H j=1 v (Λt h) 1v ϕ j h (Λt h) 1ϕ j h ϕ j h (Λt h) 1ϕ j h. Note that with Lemma E.6 we have Pt 1 j=1 ϕ j h (Λt h) 1ϕ j h d and therefore we can obtain that for all v Rd v wt h(iυ) H v which implies that wt h(iυ) H q To utilize Lemma E.7, we further need to bound the covering number of the class of estimated value functions. Fix h [H], it can be observed that all V t h( , ) where t [T1] belongs to the following function class V parametrized by w and A: n V | V (s, iυ) = min a Clip[ H,H] ϕ h (s, a)w(iυ) bbh(s, a) ϕh(s, a) A o , where w(iυ) H q λ and A β2/λ. Then for any V1 (parametrized by w1, A1) and V2 (parametrized by w2, A2), we know sup s S,0 i H/υ |V1(s, iυ) V2(s, iυ)| sup s S,a A,0 i H/υ ϕ h (s, a)(w1(iυ) w2(iυ)) ( ϕh(s, a) A1 ϕh(s, a) A2) sup ϕ 1,0 i H/υ ϕ (w1(iυ) w2(iυ)) + sup ϕ 1 ϕ A1 A2 sup 0 i H/υ w1(iυ) w2(iυ) + p This indicates that the covering number of V with respect to ℓ norm can be upper bounded by the covering number of w and A. More specifically, let N(ϵ) denote the covering number of V with Published as a conference paper at ICLR 2024 respect to ℓ norm, then we have υ ) log HT1d υ2 log HT1d Substituting the above bound into Lemma E.7 concludes the proof. D.4 PROOF OF LEMMA D.3 The proof is conducted via induction. For the base case, when h = H + 1, we have V t H+1(s, iυ) = V H+1(s, iυ) = iυ for all s S and 0 i H/υ . Then suppose we have V t h+1(s, iυ) V H+1(s, iυ) for all s S and 0 i H/υ . For any s S, a A and 0 i H/υ , if Q t h(s, iυ, a) = H, then Q t h(s, iυ, a) Q h(s, iυ, a) naturally holds. Otherwise, from Lemma D.2 we know Q t h(s, iυ, a) Q h(s, iυ, a) bbh(s, a) + ϕh(s, a) wt h(iυ) β ϕh(s, a) (Λt h) 1 Q h(s, iυ, a) V t h+1 V π h+1 (s, iυ, a) 0. Therefore we have Q t h(s, iυ, a) Q h(s, iυ, a) for all s S, a A and 0 i H/υ , which leads to V t h(s, iυ) V h(s, iυ) for all s S and 0 i H/υ as well. This concludes our proof. D.5 PROOF OF LEMMA D.4 First note that from Lemma D.3 we have V 1(s1, i1υ) V t 1(s1, i1υ) for all t [T1]. This indicates that conditioned on E1, 1 (s1, i1υ) V 1(s1, i1υ) 1 (s1, i1υ) V t 1(s1, i1υ). (27) Fix t [T1] and then we know 1 (s1, i1υ) V t 1(s1, i1υ) = Q eπt 1 (st 1, i1υ, at 1) Q t 1(s1, i1υ, at 1). If Q t 1(st 1, i1υ, at 1) = H, then we know Q eπt 1 (st 1, i1υ, at 1) Q t 1(s1, i1υ, at 1) 0. Otherwise, we have 1 (st 1, i1υ, at 1) Q t 1(s1, i1υ, at 1) Q eπt 1 (st 1, i1υ, at 1) + bbh(st 1, at 1) D ϕ t 1, wt h(i1υ) E + β ϕ t 1 (Λt 1) 1. Then with Lemma D.2, we know 1 (st 1, i1υ, at 1) Q t 1(s1, i1υ, at 1) b Ph 2 V t 2 (st 1, i1υ, at 1) + 2β ϕ t 1 (Λt 1) 1. (28) Let it hυ denote i1υ Ph 1 h =1 rt h and let {(ζt h)H h=1}T1 t=1 denote the following random variable: 0, if there exists some h h s.t. Q t h (st h , it h υ, at h ) = H, b Ph h+1 V t h+1 (st h, it hυ, at h) h+1(st h+1, it h+1υ) V t 2(st h+1, it h+1υ) , otherwise. It can be easily observed that ζt h Gt,h+1 and E[ζt h|Gt,h] = 0, which means that {(ζt h)H h=1}T1 t=1 is a martingale with respect to {Gt,h}. Then when Q t 1(st 1, i1υ, at 1) = H, Equation (28) is equivalent to 1 (st 1, i1υ, at 1) Q t 1(s1, i1υ, at 1) V eπt 2 (st 2, it 2υ) V t 2(st 2, it 2υ) + ζt h + 2β ϕ t 1 (Λt 1) 1. Repeat such expansion till a step ht such that Q t ht(st ht, it htυ, at ht) = H or ht = H + 1. Then we have 1 (s1, i1υ) V t 1(s1, i1υ) h=1 ζt h + 2β h=1 ϕ t h (Λt h) 1 h=1 ζt h + 2β h=1 ϕ t h (Λt h) 1. Published as a conference paper at ICLR 2024 Therefore combining (28) and (29), we know 1 (s1, i1υ) V 1(s1, i1υ) h=1 ζt h + 2β h=1 ϕ t h (Λt h) 1. (30) For the first term of the RHS of (30), from Azuma-Hoeffding s inequality, we have with probability at least 1 δ/2 that h=1 ζt h 2H q For the second term, we can utilize the elliptical potential lemma (Lemma E.8 in Appendix E), which yields h=1 ϕ t h (Λt h) 1 h=1 (ϕ t h) (Λt h) 1ϕ t h H 2d T1ι 1 2 υ . Plug the above two inequalities into (30) leads to the result in Lemma D.4. E AUXILIARY LEMMAS Lemma E.1 (MLE guarantees). For any fixed (h, k) [H] [K], with probability at least 1 δ, we have that Es {0.5ρk h+0.5ηk h},a U(A)[ b P k h ( |s, a) P h( |s, a) 2 1] ζ, ζ := log(|F|/δ) Using union bound, a direct corollary is: with probability at least 1 δ the following holds for all h [H], k [K] Es {0.5ρk h+0.5ηk h},a U(A)[ b P k h ( |s, a) P h( |s, a) 2 1] 0.5ζk, ζk := log(|F|Hk/δ) Lemma E.2 (Concentration of the bonus term). Set λk = Θ(d log(k H|F|/δ)) at the k-th episode. Define Σρk h,ϕ = k Es ρk h,a U(A)[ϕ(s, a)ϕ (s, a)] + λk I, bΣk,ϕ = X s,a Dh ϕ(s, a)ϕ (s, a) + λk I. With probability 1 δ, we have for any k N+, h [H], ϕ Φ, ϕ(s, a) (bΣk h,ϕ) 1 = Θ ϕ(s, a) Σ 1 ρk h,ϕ Lemma E.3 (One-step back inequality for the learned model). Let π = {πh : S [0, H] 7 (A)}h [H] denote any policy on the augmented state. Fix an initial budget c1 [0, 1]. Take any g : S A 7 R such that g B. We condition on the event where the MLE guarantee (32): Es ρk h,a U(A)[f k h(s, a)] ζk holds for any h [H]. For h = 1, we have that E(s,a) d(π,c1) 1, b Pk [g(s, a)] q A Es ρk 1,a U(A)[g2(s, a)] For any h + 1 {2, , H} and policy π, we have that E(s ,a ) d(π,c1) h+1, b Pk [g(s , a )] E(s,a) d(π,c1) bϕk h(s, a) Σ 1 ρk h U(A), b ϕk h k A Es ηk h+1,a U(A)[g2(s , a )] + B2λkd + k B2ζk where Σρk h U(A),bϕk h = k Es ρk h,a U(A)[bϕk h(bϕk h) ] + λk I. Published as a conference paper at ICLR 2024 Proof. For h = 1, we have that E(s,a) d(π,c1) 1, b Pk [g(s, a)] max s,a d1(s)π1(a|s, c1) ρk 1(s)U(a) Es ρk 1,a U(A)[g2(s, a)] A Es ρk 1,a U(A)[g2(s, a)] where we use the fact that d1 = ρk 1. Recall that ρk h(s) = 1 k Pk 1 i=0 dπi h,ci(s) is the (expected) occu- pancy of any s S in dataset Dh . Let ω(π,c1) h,P : S A 7 ([0, 1]) denote the distribution of the remaining budget ch at timestep h conditioned on any (s, a) S A when rolling policy π in an MDP with transition kernel P and the initial budget c1. For any h {1, , H 1}, we have that E(s ,a ) d(π,c1) h+1, b Pk [g(s , a )] =E(s,a) d(π,c1) h, b Pk Es b P k h ( |s,a)Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )] =E(s,a) d(π,c1) bϕk h(s, a) Z S bψk h(s ) Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )]ds E(s,a) d(π,c1) bϕk h(s, a) Σ 1 ρk h U(A), b ϕk h S bψk h(s ) Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )]ds Σρk h U(A), b ϕk h where the last inequality is because of CS inequality. For any (s, a) S A, we have that S bψk h(s )Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )]ds Σρk h U(A), b ϕk h S bψk h(s )Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )]ds n k E s ρk h, a U(A)[bϕk h(bϕk h) ] + λk I o S bψk h(s )Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )]ds k E s ρk h, a U(A) S bψk h(s ) bϕk h( s, a)Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )]ds 2# where we use the assumption a Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )] B, S bψk h(s )y(s )ds 2 for any y : S 7 [0, 1]. Note that bψk h(s )bϕk h( s, a) = b P k h (s | s, a). Further, we derive that k E s ρk h, a U(A) S b P k h (s | s, a)Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )]ds 2# k E s ρk h, a U(A) " Es P h ( | s, a)Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r)[g(s , a )] 2# + B2λkd + k B2ζk k E s ρk h, a U(A),s P h ( | s, a)Ec ω(π,c1) h, b Pk ( |s,a),r,a πh+1( |s ,c r) [g(s , a )]2 + B2λkd + k B2ζk k A E s ρk h, a U(A),s P h ( | s, a),a U(A)[g2(s , a )] + B2λkd + k B2ζk Published as a conference paper at ICLR 2024 Note that s ρk h, a U(A), s P h( | s, a) is equivalent to s ηk h+1. Therefore, E(s ,a ) d(π,c1) h+1, b Pk [g(s , a )] E(s,a) d(π,c1) bϕk h(s, a) Σ 1 ρk h U(A), b ϕk h a bψk h(s )Ec,rh [πh+1(a |s , c rh)] g(s , a )ds Σρk h U(A), b ϕk h E(s,a) d(π,c1) bϕk h(s, a) Σ 1 ρk h U(A), b ϕk h k A Es ηk h+1,a U(A)[g2(s , a )] + B2λkd + k B2ζk which concludes the proof. Lemma E.4 (One-step back inequality for the true model). Let π = {πh : S [0, H] 7 (A)}h [H] denote any policy on the augmented state. Fix an initial budget c1 [0, 1]. Take any g : S A 7 R such that g B. Then for h = 1, we have that E(s,a) d(π,c1) 1 [g(s, a)] q A Es ρk 1,a U(A)[g2(s, a)] For h + 1 {2, , H}, we have that E(s ,a ) d(π,c1) h+1 [g(s , a )] E(s,a) d(π,c1) h ϕ h(s, a) Σ 1 ρk h,ϕ k A Es ρk h+1,a U(A)[g2(s , a )] + λkd B2 Proof. For h = 1, we have that E(s,a) d(π,c1) 1 [g(s, a)] d1(s)π1(a|s, c1) ρk 1(s)U(a) Es ρk 1,a U(A)[g2(s, a)] q A Es ρk 1,a U(A)[g2(s, a)] Let ω(π,c1) h := ω(π,c1) h,P denote the distribution of the remaining budget ch at timestep h conditioned on any (s, a) S A when rolling policy π in the (true) environment and the initial budget c1. For h + 1 {2, , H}, by CS inequality, we derive that E(s ,a ) d(π,c1) h+1 [g(s , a )] =E(s,a) d(π,c1) h ,s P h ( |s,a)Ec ω(π,c1) h ,r,a πh+1( |s ,c r) [g(s , a )] E(s,a) d(π,c1) h ϕ h(s, a) Σ 1 ρk h,ϕ S ψ h(s )Ec ω(π,c1) h ,r,a πh+1( |s ,c r)[g(s , a )]ds Σρk h,ϕ For any (s, a) S A, we obtain that S ψ h(s )Ec ω(π,c1) h ,r,a πh+1( |s ,c r)[g(s , a )]ds S ψ h(s )Ec ω(π,c1) h ,r,a πh+1( |s ,c r)[g(s , a )]ds n k E( s, a) ρk h[ϕ h(ϕ h) ] + λk I o S ψ h(s )Ec ω(π,c1) h ,r,a πh+1( |s ,c r)[g(s , a )]ds k E( s, a) ρk h,s P h ( | s, a)Ec ωπ h,r,a πh+1(a ,s ,c r)[g2(s , a )] + λkd B2 k A Es ρk h+1,a U(A)[g2(s , a )] + λkd B2 where the last inequality holds by the fact that ( s, a) ρk h, s P h( s, a) is equivalent to s ρk h+1 and importance sampling. Therefore, we conclude the proof. Lemma E.5 (Simulation lemma for risk-neutral RL). Let c M and M be to MDPs with the same state and action spaces, but they have different reward and transition functions, i.e., (brh, b Ph)H h=1 and Published as a conference paper at ICLR 2024 (rh, Ph)H h=1, respectively. Consider any policy π = {πh : S (A)}h [H], denote {V π h, c M}H h=1 and {V π h }H h=1 be the value functions. Then we have that V π 1, c M(s1) V π 1 (s1) = h=1 E(s,a) dπ h, c M brh(s, a) rh(s, a) + D b Ph( |s, a) Ph( |s, a), V π h+1( ) E Equivalently, we have that V π 1, c M(s1) V π 1 (s1) = h=1 E(s,a) dπ h brh(s, a) rh(s, a) + D b Ph( |s, a) Ph( |s, a), V π h+1, c M( ) E where V π H+1 = b VH+1, c M = 0 (as the episode ends at H). Proof. Recall that dπ h(s, a) and dπ h, c M(s, a) are the occupancy measures of any (s, a) S A at timestep h [H] when executing policy π in M and c M, respectively. By definition, we have that V π 1, c M(s1) V π 1 (s1) a π1(s1, a) br1(s1, a) r1(s1, a) + X b P1(s |s1, a)b V π 2, c M(s ) P1(s |s1, a)V π 2 (s ) ! =E(s,a) dπ 1, c M h br1(s, a) r1(s, a) + D b P1( |s, a), b V π 2, c M( ) V π 2 ( ) E + D b P1( |s, a) P1( |s, a), V π 2 ( ) Ei h=1 E(s,a) dπ h, c M h brh(s, a) rh(s, a) + D b Ph( |s, a) Ph( |s, a), V π h+1( ) Ei Equivalently, we have that V π 1, c M(s1) V π 1 (s1) =E(s,a) dπ 1 h br1(s, a) r1(s, a) + D P1( |s, a), b V π 2, c M( ) V π 2 ( ) E + D b P1( |s, a) P1( |s, a), b V π 2, c M( ) Ei h=1 E(s,a) dπ h h brh(s, a) rh(s, a) + D b Ph( |s, a) Ph( |s, a), b V π h+1, c M( ) Ei which concludes the proof. Lemma E.6. Let Λt = λI + Pt j=1 ϕ j(ϕ j) where ϕi Rd, λ > 0, d > 0, then we have j=1 (ϕ j) Λ 1 t ϕ j d. Proof. Please refer to (Jin et al., 2020, Lemma D.1). Lemma E.7. Let {xj} j=1 be a stochastic process on the space (S, {iυ} H/υ i=0 ) with corresponding filtration {Gj} j=0. Let {ϕj} j=0 be an Rd-valued stochastic process where ϕj Gj 1, and ϕj 1. Let Λt = λI + Pt 1 j=1 ϕjϕ j . Then for any δ > 0, with probability at least 1 δ, for all t 0, and any V V so that supx |V (x)| H, we have: ϕj V (xj) E[V (xj)|Gj 1] 2 log t + λ where Nϵ is the ϵ-covering number of V with respect to the ℓ -norm supx |V (x) V (x)|. Proof. Please refer to (Jin et al., 2020, Lemma D.4). Published as a conference paper at ICLR 2024 Lemma E.8. Suppose ϕt Rd and ϕt 1 for all t 0. For any t 0, we define Λt = I + Pt j=1 ϕ j ϕj. Then we have ϕ j Λ 1 j 1ϕj 2 log det(Λt) Proof. Please refer to (Jin et al., 2020, Lemma D.2).