# eluderbased_regret_for_stochastic_contextual_mdps__a6b8351d.pdf

Eluder-based Regret for Stochastic Contextual MDPs

Orin Levy * 1 Asaf Cassel * 1 Alon Cohen 2 3 Yishay Mansour * 1 3

We present the E-UC3RL algorithm for regret minimization in Stochastic Contextual Markov Decision Processes (CMDPs). The algorithm operates under the minimal assumptions of realizable function class and access to offline least squares and log loss regression oracles. Our algorithm is efficient (assuming efficient offline regression oracles) and enjoys a regret guarantee of e O(H3p

T|S||A|d E(P) log(|F||P|/δ))) , with T being the number of episodes, S the state space, A the action space, H the horizon, P and F are finite function classes used to approximate the context-dependent dynamics and rewards, respectively, and d E(P) is the Eluder dimension of P w.r.t the Hellinger distance. To the best of our knowledge, our algorithm is the first efficient and rate-optimal regret minimization algorithm for CMDPs that operates under the general offline function approximation setting. In addition, we extend the Eluder dimension to general bounded metrics which may be of independent interest.

1. Introduction

Reinforcement Learning (RL) is a field of machine learning that pertains to sequential decision making under uncertainty. At the heart of RL is the Markov Decision Process (MDP), a fundamental mathematical model that has been studied extensively. An agent repeatedly interacts with an MDP by observing its state s S and selecting an action a A, which leads to a new state s and an instantaneous reward that reflects the quality of the action taken. The agent s goal is to maximize her return during each episode

*Equal contribution 1Balavatnick school of Computer Science, Tel Aviv University, Tel Aviv, Israel 2School of Electrical Engeneering, Tel Aviv University, Tel Aviv, Israel 3Google Research, Tel Aviv, Israel. Correspondence to: Orin Levy <orinlevy@mail.tau.ac.il>, Asaf Cassel <acassel@mail.tau.ac.il>, Alon Cohen <alonco@mail.tau.ac.il>, Yishay Mansour <mansour.yishay@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

of interaction with the MDP. MDPs can be applied to a wide range of real-life scenarios, including advertising, healthcare, games, robotics (see, e.g., Sutton & Barto, 2018; Mannor et al., 2022).

Many modern applications involve the presence of additional side information, or context, that impacts the environment. A naive approach to handling the context is to extend the state space of the environment to include it. However, this method increases the complexity of learning and policy representation. Contextual MDPs (CMDPs) offer a more efficient solution by keeping the state space small and treating the context as additional side-information that the agent observes at the start of each episode. Furthermore, there exists a mapping from each context to an MDP, and the optimal policy for a given context is the optimal policy of the corresponding MDP (Hallak et al., 2015). An example of a context is user information that remains constant throughout the episode. Such information may include the user s age and interests, and can deeply impact decision making. This feature makes CMDPs an excellent model for recommendation systems.

As is common in recent works, the aforementioned mapping from context to MDP is assumed to be taken from a known function class, and access to the function class is provided via an optimization oracle. A distinctive feature between works is whether they assume access to online or offline oracles. Intuitively, in both settings we have a function class F = {f : X Y }, a loss ℓ: Y Y R, and a dataset1

{(xi, yi)}n i=1. An offline oracle observes the entire data and needs to find ˆf arg minf F Pn i=1 ℓ(f(xi), yi). An online oracle makes a sequence of predictions f1, . . . , fn where fi can depend on data up to i 1, and its goal is to minimize regret, given by Pn i=1 ℓ(fi(xi), yi) ℓ( ˆf (xi), yi). The offline problem can potentially be easier to solve than the online problem. Moreover, practical deep RL applications typically work in the offline regime.

Previously, Modi & Tewari (2020) obtained e O(

T) regret for a generalized linear model (GLM). Foster et al. (2021) obtain e O(

T) regret for general function approximation and adversarially chosen contexts, assuming access to a much stronger online estimation oracle. However, they noted the challenge of implementing their methodol-

1We think of X as the context and Y as the MDP.

Eluder-based Regret for Stochastic Contextual MDPs

ogy using offline oracles. It thus remained open whether, for stochastic contexts, we can restrict the access to offline oracles. Recently, Levy & Mansour (2023) gave an e O(

T/pmin) regret algorithm for stochastic contexts using offline least squares regression, where pmin is a minimum reachability parameter of the CMDP. This parameter can be arbitrarily small and in general CMDPs leads to an e O(T 3/4) regret guarantee. The question of whether the minimum reachability assumption can be obviated or replaced by a less restrictive assumption on the function class remained open.

In this work, we give the first e O(

T) regret algorithm for stochastic contexts using standard offline oracles, under the bounded Eluder dimension (Russo & Van Roy, 2013) assumption (more details in Section 2.3).

Contributions. We present the E-UC3RL algorithm for stochastic CMDPs with offline regression oracles. Our algorithm is efficient (assuming efficient oracles) and enjoys an e O H3p

T|S||A|d E(P) log(|F||P|/δ) regret bound with probability at least 1 δ, where S is the state space, A the action space, H the horizon, P and F are finite function classes used to approximate the context-dependent dynamics and rewards, respectively, and d E(P) is the Eluder dimension with respect to Hellinger distance of the contextdependent dynamics function class P. The algorithm builds on the optimistic in expectation approach of Levy & Mansour (2023) but modifies it with a log-loss oracle for the dynamics approximation and carefully chosen counterfactual reward bonuses. To that end we present an extension of the Eluder dimension to general bounded metrics (rather than the ℓ2 norm considered by Russo & Van Roy (2013) and Osband & Van Roy (2014)). An additional key technical tool enabling our result is a multiplicative change of measure inequality for the value function. Both tools may be of separate interest.

Comparison with Levy & Mansour (2023). This work is most closely related to ours. It relies on a minimum reachability assumption and provides a regret bound of e O(

T/pmin) where pmin is the reachability parameter of the CMDP. This implies that any policy π will reach any state s with a probability of at least pmin for any context c, hence pmin 1/|S|. As such, any policy inherently explores with probability pmin, significantly simplifying the exploration task. While the notion of minimum reachability is intuitive, it fails even for deterministic transition functions where pmin = 0. Moreover, it is impossible to estimate it online as we typically observe each context only once.

The primary focus of our work is to replace minimum reachability, which is a structural assumption about the true CMDP, with an assumption about the dynamics function class, which is chosen by the learner. This makes learning an effective exploratory policy non-trivial, necessitating in-

novative confidence bounds that capture the intricacies of the function class learning complexity. In our approach, we employ the Eluder dimension as the complexity measure. One can show that minimum reachability implies a bound on the Eluder dimension, but, the Eluder dimension can be much smaller.

The general function approximation literature. We stress that the role of the Eluder dimension in this work is to avoid direct dependence on the size of the context space, which could be prohibitively large, while also maintaining computational efficiency. This is unlike previous works on function approximation in RL (see, e.g., Jiang et al. (2017); Jin et al. (2021); Wu et al. (2023); Chen et al. (2022); Wang et al. (2020b); Dann et al. (2021); Liu et al. (2022)) that use an Eluder dimension to avoid dependence on |S|, |A|. These works are often computationally inefficient and require additional structural assumptions regarding the MDP, such as low Bellman-rank or Bellman completeness, or a much stronger optimization oracle.

Additional Related Work. Hallak et al. (2015) were the first to study regret guarantees in the the CMDP model. However, they assume a small context space, and their regret is linear in its size. Jiang et al. (2017) present OLIVE, a sample efficient algorithm for learning Contextual Decision Processes (CDP) under the low Bellman rank assumption. In contrast, we do not make any assumptions on the Bellman rank. Sun et al. (2019) use the Witness Rank to derive PAC bounds for model based learning of CDPs. Modi et al. (2018) present generalization bounds for learning smooth CMDPs and finite contextual linear combinations of MDPs. Modi & Tewari (2020) present a regret bound of e O(

T) for CMDPs under a Generalized Linear Model (GLM) assumption. Our function approximation framework is more general than smooth CMDPs or GLM.

Foster et al. (2021) present the Estimation to Decision (E2D) meta algorithm and apply it to obtain e O(

T) regret for adversarial Contextual RL. Later, Xie et al. (2022) show sample complexity bounds for online reinforcement learning using online oracle, that can be also applied to CMDPs. Levy et al. (2023) obtained similar results using their OMGCMDP! algorithm. However, these works assume access to online estimation oracles and their bounds scale with the oracle s regret. In contrast, we use substantially weaker offline regression oracles. It is not clear to us whether Foster et al. s Inverse Gap Minimization (IGM) technique or Levy et al. s convex optimization with log-barrier method can be applied to CMDPs with offline regression oracles.

Levy & Mansour (2022) study the sample complexity of learning CMDPs using function approximation and provide the first general and efficient reduction from CMDPs to offline supervised learning. However, their sample complexity scales as ϵ 8, and thus they cannot achieve the optimal

Eluder-based Regret for Stochastic Contextual MDPs

rate for the regret. Levy & Mansour (2023), previously mentioned here in relation to upper bounds, also showed an Ω( p

TH|S||A| log(|G|/|S|)/ log(|A|)) regret lower bound for the general setting of offline function approximation with |G|, the size of the function class used to approximate the rewards in each state.

More broadly, CMDPs are a natural extension of the extensively studied Contextual Multi-Armed Bandit (CMAB) model. CMABs augment the Multi-Arm Bandit (MAB) model with a context that determines the rewards (Lattimore & Szepesv ari, 2020; Slivkins, 2019). Langford & Zhang (2007); Agarwal et al. (2014) use an optimization oracle to derive an optimal regret bound that depends on the size of the policy class they compete against. Regression based approaches were presented in Agarwal et al. (2012); Foster & Rakhlin (2020); Foster et al. (2018); Foster & Krishnamurthy (2021); Simchi-Levi & Xu (2021); Zhang (2022). Most closely related to our work, Xu & Zeevi (2020) present the first optimistic algorithm for CMAB. They assume access to a least-squares regression oracle and achieve e O( p

T|A| log |F|) regret, where F is a finite and realizable function class, used to approximate the rewards. Extending their techniques to CMDPs necessitates accounting for the context-dependent dynamics whose interplay with the rewards significantly complicates decision making. This is the main challenge both in our work and in Levy & Mansour (2023).

The Eluder dimension was introduced by Russo & Van Roy (2013) and applied to derive sublinear regret for MABs and CMABs. Osband & Van Roy (2014) showed an application of the Eluder dimension to derive a regret bound for modelbased reinforcement learning and Wen & Van Roy (2017) for deterministic systems. Wang et al. (2020a) use it to derive a regret bound for value function approximation. Jin et al. (2021) present the Bellman-Eluder dimension and use it to develop sample-efficient algorithms for a family of RL problems where both the Bellman rank and the Eluder dimension are low. Ayoub et al. (2020) apply the Eluder dimension to derive a regret bound for tabular episodic RL using targeted value regression. We, on the other hand, extend the Eluder dimension to general bounded metrics and apply it to contextual RL.

2. Preliminaries

2.1. Episodic Loop-Free Markov Decision Process (MDP)

An MDP is defined by a tuple (S, A, P, r, s0, H), where S and A are finite sets describing the state and action spaces, respectively; s0 S is the unique start state; H N is the horizon; P : S A S [0, 1] defines the probability of transitioning to state s given that we start at state s and

perform action a; and r(s, a) is the expected reward of performing action a at state s. An episode is a sequence of H interactions where at step h, if the environment is at state sh and the agent performs action ah then (regardless of past history) the environment transitions to state sh+1 P( | sh, ah) and the agent receives reward R(sh, ah) [0, 1], sampled independently from a distribution Dsh,ah that satisfies r(sh, ah) = EDsh,ah[R(sh, ah)].

For technical convenience and without loss of generality, we assume that the state space and accompanying transition probabilities have a loop-free (or layered) structure. Concretely, we assume that the state space can be decomposed into H + 1 disjoint subsets (layers) S0, S1, . . . , SH 1, SH such that transitions are only possible between consecutive layers, i.e., for h = h + 1 we have P(sh |sh, a) = 0 for all sh Sh , sh Sh, a A. In addition, SH = {s H}, meaning there is a unique final state with reward 0. We note that this assumption can always be satisfied by increasing the size of the state space by a factor of H.

A deterministic stationary policy π : S A is a mapping from states to actions. Given a policy π and MDP M = (S, A, P, r, s0, H), the h [H 1] stage value function of a state s Sh is defined as

V π M,h(s) = Eπ,M

k=h r(sk, ak)

For brevity, when h = 0 we denote V π M,0(s0) := V π M(s0), which is the expected cumulative reward under policy π and its measure of performance. A policy π M is optimal for MDP M if it satisfies that π M arg maxπ:S A{V π M(s0)}. It is well known that such a policy is optimal even among the class of stochastic and history dependent policies (see, e.g., Puterman, 2014; Sutton & Barto, 2018; Mannor et al., 2022).

2.2. Problem Setup: Stochastic Contextual Markov Decision Process (CMDP)

A CMDP is defined by a tuple (C, S, A, M) where C is the context space, S the state space and A the action space. The mapping M maps a context c C to an MDP M(c) = (S, A, P c , rc , s0, H), where rc (s, a) = E[Rc (s, a)|c, s, a], Rc (s, a) Dc,s,a. We assume that Rc (s, a) [0, 1].

We consider a stochastic CMDP, meaning, the context is stochastic. Formally, we assume that there is an unknown distribution D over the context space C, and for each episode a context is sampled i.i.d. from D. For mathematical convenience, we assume that the context space C is finite but potentially very large. Our results do not depend on the size of the context space and can be further extended to infinite context spaces.

A deterministic context-dependent policy π : C S A

Eluder-based Regret for Stochastic Contextual MDPs

maps a context c C to a policy π(c; ) : S A. Let ΠC denote the class of all deterministic context-dependent policies.

Interaction protocol. The interaction between the agent and the environment is defined as follows. In each episode t = 1, 2, ..., T the agent:

(i) Observes context ct D; (ii) Chooses a policy πt (based on ct and the observed history); (iii) Observes trajectory (ct, st 0, at 0, rt 0, . . . , st H) generated by playing πt in M(ct).

Our goal is to minimize the regret, defined as

t=1 V π (ct; ) M(ct) (s0) V πt(ct; ) M(ct) (s0),

where π ΠC is an optimal context-dependent policy. We aim to derive regret bounds that are independent of the context space size |C|. For that purpose, we make function approximation assumptions in Section 2.4, which rely on the following definition of Eluder dimension.

2.3. Metric Eluder Dimension

We extend the notion of Eluder dimension, given by Osband & Van Roy (2014), to general bounded metrics. Let X be a set and (Y, D) a bounded metric space. Let P {X Y} be a set of functions from X to Y. We say that x X is (D, ϵ) dependent of x1, . . . , xn if and only if for any P, P P it holds that

i=1 D2(P(xi), P (xi)) ϵ2 = D2(P(x), P (x)) ϵ2.

We say that x X is (D, ϵ) independent of x1, . . . , xn if it is not (D, ϵ) dependent.

Definition 2.1 (Metric-Eluder Dimension). We say that d := d E(P, D, ϵ) is the (D, ϵ) Eluder dimension of a class P if d is the maximum length of sequences x1, . . . , xd and ϵ 1, . . . , ϵ d such that for all 1 i d, xi is (D, ϵ i) independent of its prefix x1, . . . , xi 1 and ϵ i ϵ.

This quantity roughly corresponds to the number of queries required to ϵ identify a function in P. The utility of this definition is summarized in the following result, which is a straightforward adaptation of Proposition 6 in Osband & Van Roy (2014) (proof in Appendix A). For any P P, define its radius at x X as w P (x) = sup P,P P D(P(x), P (x)).

Lemma 2.2. For any t [T], h [H] let xt h X and Pt P be arbitrary. Define the confidence sets with parameter β

h=0 D2(P(xi h), Pt(xi h)) β

We have that

h=0 (w Pt(xt h))2 6d E(P, D, T 1/2)β log T.

2.4. Function Class Assumptions

We note that, without further assumptions, the regret may scale linearly in the size of the context space (Hallak et al., 2015). Even worse, if the context space contains more than T contexts, and the distribution over the contexts is uniform, the regret may scale linearly in T. We overcome this limitation by imposing the following function approximation assumptions, that extend similar notions in the Contextual Multi-Armed Bandits literature (Agarwal et al., 2012; Russo & Van Roy, 2013; Foster et al., 2018; Foster & Krishnamurthy, 2021; Simchi-Levi & Xu, 2021) to CMDPs.

Realizable reward function approximation. Our algorithm gets as input a finite function class F C S A [0, 1] such that there exists f F that satisfies f (c, s, a) = rc (s, a) for all c C and (s, a) S A.

Realizable dynamics function approximation. Our algorithm gets as input a finite function class P S (S A C) [0, 1] such that P P, and every function P P represents valid transition probabilities, i.e., satisfies P

s S P(s | s, a, c) = 1 for all c C and (s, a) S A. For convenience, we denote P(s | s, a, c) = P c(s | s, a), for all P P.

Offline regression oracles. Given a data set D = {(ci, si, ai, s i, ri)}n i=1, we assume access to offline oracles that solve the optimization problems:

bf arg min f F

i=1 (f(ci, si, ai) ri)2,

(Least Squares Regression (LSR))

b P arg min P P

i=1 log 1 P ci(s i | si, ai).

(Log Loss Regression (LLR))

Notice that the above problems can always be solved by iterating over the function class. However, since we consider strongly convex loss functions, there are function classes where these optimization problems can be solved efficiently. One particular example is the class of linear functions.

Eluder Dimension (w.r.t the Squared Hellinger distance). As shown by Foster et al. (2021), the log-loss oracle provides generalization guarantees with respect to the squared Helligner distance.

Eluder-based Regret for Stochastic Contextual MDPs

Definition 2.3 (Squared Hellinger Distance). For any two distributions P, Q over a discrete support X, the Squared Hellinger Distance is defined as

D2 H(P, Q) := X

The Hellinger distance is a bounded metric. Thus, we assume a known upper bound d P of d E(P, DH, T 1/2), the Eluder dimension of P with respect to Hellinger distance.

Clearly, for a finite class the Eluder dimension is at most the number of functions. For classes of discrete distributions, where the minimum probability is p > 0, one can bound the Eluder dimension w.r.t. the Hellinger distance by d2/p, where d2 is the (standard) Eluder dimension w.r.t. ℓ2. (See Lemma A.4).

3. Algorithm and Main Result

We present Eluder Upper Counterfactual Confidence for Contextual Reinforcement Learning (E-UC3RL), given in Algorithm 1. At each episode t, the algorithm estimates the reward and dynamics using the regression oracles. It then constructs an optimistic CMDP using reward bonuses and plays its optimal policy. The reward bonuses are inspired by the notion of counterfactual confidence, suggested by Xu & Zeevi (2020) for CMABs. The original idea was to calculate the confidence bounds using the counterfactual actions of past policies given the current context. Levy & Mansour (2023) adapted this approach to CMDPs using the minimum reachability assumption, without which, it becomes crucial to also consider counterfactual states. Notice that the states are stochastically generated by the MDP in response to the agent s played actions. This makes counterfactual state computation impossible without access to the true dynamics. Instead, we consider the counterfactual probabilities of a state-action pair and evaluate this quantity using the estimated dynamics. These probabilities are typically referred to as occupancy measures (Zimin & Neu, 2013). Concretely, for any non-contextual policy π and dynamics P, let qh(s, a | π, P) denote the probability of reaching state s S and performing action a A at time h [H] of an episode generated using policy π and dynamics P. Note that, given π and P, the occupancy measure of any state-action pair can be computed efficiently using a standard planning algorithm.

At round t and (s, a, h, c) tuple the cumulative occupancy measure of past policies, i.e., Pt 1 i=1 qh(s, a|πc i , P c ) is a good indicator for the quality of the estimated dynamics and rewards bft and b Pt. Thus we would ideally choose bonuses inversely proportional to this quantity. Since P is unknown, it is natural to replace it in qh( ) with the most recent estimate b Pt. However, the instability of the oracle estimates

b Pt means that qh(s, a, |πc i , b P c t ) can change arbitrarily with t, which may lead to overly large bonuses. We resolve this instability using the Eluder dimension assumption, which allows us to replace b Pt with b Pi in qh( ), thus stabilizing the occupancy measure estimate of πi as qh(s, a, |πc i , b P c i ) for all t. Finally, since our bonuses are based on past contextdependent policies, we first have to compute πk(ct; ) for all k [t 1], which is the purpose of our internal loop (Algorithm 1).

Algorithm 1 Eluder Upper Counterfactual Confidence for Contextual RL (E-UC3RL)

1: inputs: MDP parameters: S, A, s0, H; tuning parameters βr, βP . 2: for round t = 1, . . . , T do 3: compute using the LSR oracle:

bft arg min f F

h=0 (f(ci, si h, ai h) ri h)2

4: computed using the LLR oracle:

b Pt arg min P P

1 P ci(si h+1|si h, ai h)

5: observe a fresh context ct D. 6: for k = 1, 2, . . . , t do 7: compute for all h [H] and (s, a) Sh A:

brct k (s, a) = bfk(ct, s, a) + bβr k (ct, s, a, h) + HbβP k (ct, s, a, h)

bβ k(c, s, a, h) =

1 + Pk 1 i=1 qh(s, a|πi(c; ), b P c i )

8: define c Mk(ct) = (S, A, b P ct k , brct k , s0, H). 9: compute using a planning algorithm:

πk(ct; ) arg max π:S A V π c Mk(ct)(s0).

10: play πt(ct; ) and observe a trajectory σt = (ct, st 0, at 0, rt 0, st 1, . . . , st H 1, at H 1, rt H 1, st H).

The following is our main result for Algorithm 1. We sketch its proof in Section 4, and defer the complete proof to Appendix B.5.

Theorem 3.1 (E-UC3RL regret bound). For any T > 1 and

Eluder-based Regret for Stochastic Contextual MDPs

δ (0, 1), suppose we run Algorithm 1 with parameters

504TH2d P log2(64T 4H|F||P|/δ2)

|S||A| log(T + 1) ,

2029TH2d P log2(8TH|P|/δ)

|S||A| log(T + 1) ,

and d P d E(P, DH, T 1/2). Then, with probability at least 1 δ it holds that

RT (E-UC3RL) e O(H3p

T|S||A|d P (log(|F||P|/δ))).

We remark that using covering numbers analysis (Shalev Shwartz & Ben-David, 2014), our result naturally generalizes to infinite function classes as well as context spaces. In addition, when comparing our regret upper bound to the lower bound of Levy & Mansour (2023), there is an apparent gap of H2.5 and d P factors. We leave this gap for future research.

Computational efficiency of E-UC3RL. The algorithm calls each oracle T times, making it oracle-efficient (since it s oracle-call complexity is in poly(T)). Aside from simple arithmetic operations, each of the T(T + 1)/2 iterations of the internal loop call one MDP planning procedure and calculate the related occupancy measure. Both of these can be implemented efficiently using dynamic programming. Overall, excluding the oracle s computation time, the runtime complexity of our algorithm is in poly(T, |S|, |A|, H). Hence, if both the LSR and LLR oracles are computationally efficient then E-UC3RL is also computationally efficient.

4. Analysis

Our analysis consists of four main steps:

(i) Establish an upper bound on the expected regret of the square and log loss regression oracles; (ii) Construct confidence bounds over the expected value of any context-dependent policy for both dynamics and rewards; (iii) Define the optimistic approximated CMDP and establish optimism lemmas; (iv) Combine the above to derive a high probability regret bound.

In what follows, we present the main claims of our analysis, deferring the proofs to Appendix B. Before beginning, we discuss some of the challenges and present a key technical result, the value change of measure lemma (Lemma 4.1).

A Key Technical Challenge

Our goal is to derive computable and reliable confidence bounds over the expected value of any policy. The difficulty

is that the offline regression oracles have regret guarantees only with respect to the trajectories distributions, which are related to the true context-dependent dynamics P . Hence, a main technical challenge is to translate the oracle s regret to a guarantee with respect to the estimated contextdependent dynamics b Pt. Notice that the confidence bounds are computable only if stated in terms of b Pt. Following ideas from Foster et al. (2021), we solve this issue using a multiplicative value change of measure that is based on the Hellinger distance. Concretely, the following change of measure lemma allows us to measure the value difference caused by the use of approximated transition probabilities in terms of the expected cumulative Hellinger distance (proof in Appendix B.1).

Lemma 4.1 (Value change of measure). Let r : S A [0, 1] be a bounded expected rewards function. Let P and b P denote two dynamics and consider the MDPs M = (S, A, P , r, s0, H) and c M = (S, A, b P, r, s0, H). Then, for any policy π it holds that

V π c M(s) 3V π M(s)+

h=0 D2 H( b P( |sh, ah), P ( |sh, ah)) s0 = s

Notice that this bound is loose when the reward function is not small. However, it is significantly tighter than standard results when the rewards are small. For instance, later in the analysis we consider the reward rc(s, a) = ( bft(c, s, a) f (c, s, a))2 that is the squared reward approximation error. Letting c M = (S, A, b P, rc, s0, H) and M = (S, A, P , rc, s0, H), Lemma 4.1 implies that the expected reward approximation error with respect to b P is at most a constant multiple of the expected reward and dynamics approximation errors with respect to P . In contrast, a standard change of measure replaces the squared Hellinger distance with Total Variation (TV) whose cumulative error scales as

Step 1: Establishing Oracle Guarantees

The regret guarantees of the least-squares oracle were established in Levy & Mansour (2023, Lemma B.10), stated in the Appendix as Lemma B.5. The following corollary bounds the cumulative expected least-squares loss of the sequence of the oracle s predictions (proof in Appendix B.2). In the following, we denote the expected squared error at round t over a trajectory generated by π where the context is c as Et sq(π, c). Formally,

Et sq(π, c) :=

E π(c; ),P c

bft(c, sh, ah) f (c, sh, ah) 2 s0

Eluder-based Regret for Stochastic Contextual MDPs

Corollary 4.2 (Reward approximation bound). Let bft F be the least squares minimizer of Line 3 in Algorithm 1. For any δ (0, 1) it holds with probability at least 1 δ,

i=1 Et sq(πi, c)

68H log(2T 3|F|/δ), t 1.

Next, we analyze the expected regret of the dynamic s logloss oracle in terms of the Hellinger distance. The following result is a straightforward application of Lemma A.14 in Foster et al. (2021) (proof in Appendix B.2). To that end, we denote the expected squared Hellinger distance at round t over a trajectory generated by π where the context is c as Et H(π, c). Formally,

Et H(π, c) :=

E π(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c t ( |sh, ah))

Corollary 4.3 (Dynamics approximation bound). Let b Pt P be the log loss minimizer of Line 4 in Algorithm 1. For any δ (0, 1) it holds that with probability at least 1 δ,

i=1 Et H(πi, c)

2H log(TH|P|/δ), t 1.

Lastly, we apply a variant of Corollary 4.3 together with the shrinking confidence bound guarantee in Lemma 2.2 to derive the following bound (proof in Appendix B.2).

Lemma 4.4 (Stability error of log-loss oracle). Let b Pi P denote the log-loss minimizer at round i [T]. For any δ (0, 1) it holds with probability at least 1 δ that

i=1 Ei H(πi, c)

112Hd P log2(2TH|P|/δ), t 1,

where d P d E(P, DH, T 1/2), the Eluder dimension of P at scale T 1/2.

One of the novel contributions of our work is the use of the Eluder dimension to bound the stability error due to the log-loss oracle. This allows us to choose stable bonuses that yield valid and computable confidence bounds.

Step 2: Constructing Confidence Bounds

Our main goal in this subsection is to upper bound w.h.p the expected value difference between the true and the empirical CMDPs, for any context-dependent policy π (Corollary 4.7). For that purpose, we derive confidence bounds over the rewards and dynamics approximation. Let πt denote the context-dependent policy selected at round t. For

any h [H] and state-action pair (s, a) Sh A, context c C, and round t 1, we define the reward bonuses

b R t,h(c, s, a) := bβr t (c, s, a, h),

b P t,h(c, s, a) := HbβP t (c, s, a, h), (1)

where bβ t (c, s, a, h) is defined in Line 7. b R t,h is the bonus related to the rewards approximation error, and b P t,h is the bonus related to that of the dynamics. We remark that these bonuses differ only in constant terms (HβP versus βr), and are identical to the bonus terms defined in Algorithm 1 (Line 7). We use these bonuses in our optimistic construction to account for the approximation errors in the rewards and dynamics, respectively. Next, for any context c C and functions f F, P P we define the MDP M(f,P )(c) = (S, A, P c, f(c, , ), s0, H). The following results derive confidence bounds for the dynamics and rewards approximation in terms of the reward bonuses and approximation errors (proofs in Appendix B.3).

Lemma 4.5 (Confidence bound for rewards approximation w.r.t the approximated dynamics). Let P and f be the true context-dependent dynamics and rewards. Let b Pt and bft be the approximated context-dependent dynamics and rewards at round t. Then, for any t 1, and context-dependent policy π ΠC,

Ec h V π(c; ) M(f , b Pt)(c)(s0) i Ec h V π(c; ) M( b ft, b Pt)(c)(s0) i H

i=1 Et sq(πi, c)

i=1 Ei H(πi, c)

a A qh(s, a|π(c; ), b P c t ) b R t,h(c, s, a)

Lemma 4.6 (Confidence bound for dynamics approximation w.r.t the true rewards f ). Let P and f be the true context-dependent dynamics and rewards. Let b Pt be the approximated context-dependent dynamics at round t. Then, for any t 1, and context-dependent policy π ΠC,

Ec h V π(c; ) M(f ,P )(c)(s0) i Ec h V π(c; ) M(f , b Pt)(c)(s0) i H2

a A qh(s, a|π(c; ), b P c t ) b P t,h(c, s, a)

i=1 Et H(πi, c)

i=1 Ei H(πi, c)

The proofs of Lemmas 4.5 and 4.6 are similar and can be found in Appendix B.3. By applying the high probability approximation bounds in Corollaries 4.2 and 4.3

Eluder-based Regret for Stochastic Contextual MDPs

and Lemma 4.4 to Lemmas 4.5 and 4.6, we obtain the desired high probability confidence bound on the expected value approximation (proof in Appendix B.3).

Corollary 4.7. Under the terms of Lemmas 4.5 and 4.6, the following holds with probability at least 1 3δ/4 simultaneously for all t 1 and π ΠC. Ec h V π(c; ) M(f ,P )(c)(s0) i Ec h V π(c; ) M( b ft, b Pt)(c)(s0) i

ah A qh(sh, ah|π(c; ), b P c t )b R t,h(c, sh, ah)

ah A qh(sh, ah|π(c; ), b P c t )b P t,h(c, sh, ah)

+ 2029H4d P

βP log2(8TH|P|/δ)

βr log2(64T 4H|F||P|/δ2).

Step 3: Establishing Optimism Lemmas

In this subsection, we use the results of step 2 (Corollary 4.7) to establish the properties of the optimistic approximated CMDP c Mt. Namely, that its optimal value is higher than that of the true CMDP M (Lemma 4.8), but also that the optimistic value of πt is not significantly higher than its true value (Lemma 4.9), in expectation over the context. Combining both results and applying Azuma s inequality yields the regret bound.

We begin by defining the optimistic-in-expectation contextdependent reward function at every round t 1 as brc t(s, a) := bft(c, s, a) + b R t,h(c, s, a) + b P t,h(c, s, a) , where the bonuses are defined in Equation (1), and we note that brc t(s, a) [0, H + 2] for all h [H] and (c, s, a) C Sh A. The approximated optimistic-in-expectation CMDP at round t is defined as (C, S, A, c Mt) where for any context c C we define c Mt(c) := (S, A, b P c, brc, s0, H). We also recall that M(c) = M(f ,P )(c) is the true CMDP. The next two Lemmas establish the properties of the optimistic CMDP (proofs in Appendix B.4).

Lemma 4.8 (Optimism in expectation). Let π be an optimal context-dependent policy for M. Under the good event of Corollary 4.7, for any t 1 it holds that

Ec h V π (c; ) M(c) (s0) i Ec h V πt(c; ) c Mt(c) (s0) i

+ 2029H4d P

βP log2(8TH|P|/δ)

βr log2 64T 4H|F||P|/δ2 .

Lemma 4.9 (The cost of approximation). Under the good

event of Corollary 4.7, we have that for every t 1

Ec h V πt(c; ) c Mt(c) (s0) i Ec h V πt(c; ) M(c) (s0) i

E πt(c; ), b P c t

b R t,h(c, sh, ah) + b R t,h(c, sh, ah) #

+ 2029H4d P

βP log2(8TH|P|/δ)

βr log2(64T 4H|F||P|/δ2).

Step 4: Deriving the Regret Bound

Using the above results, we derive Theorem 3.1 as follows. Summing Lemmas 4.8 and 4.9 over 1 t T bounds a notion of expected regret. Next, we use a standard algebraic argument (Lemma B.18) to bound the expected cumulative bonuses as

E πt(c; ), b P c t

b R t,h(c, sh, ah) + b R t,h(c, sh, ah) #

H|S||A|(βr + HβP ) log(T + 1).

By plugging in our choice of βP and βr we bound the expected regret by

T|S||A|d P log(T + 1) log2(18T 4H|F||P|/δ2).

We derive the high probability result by using Corollary 4.7 and applying Azuma s inequality. The complete proof is in Appendix B.5.

5. Discussion and Conclusion

In this paper, we make a step forward in understanding RL with offline function approximation. We consider the tabular CMDP setting, under the offline function approximation assumption, and obtain a rate-optimal regret bound. To obtain our result, we extend the Eluder dimension presented by Russo & Van Roy (2013) to a general bounded metric, rather than only the ℓ2 norm. This result may be of separate interest. Further, by applying the Metric Eluder dimension with Hellinger distance, we obtain our algorithm EUC3RL, the first efficient algorithm for regret minimization in Contextual MDPs that uses offline regression oracles. We note that our algorithm requires a known bound on the Eluder dimension and its regret depends on it. Obtaining an efficient algorithm that has rate-optimal regret using offline oracles but without dependence on the Eluder dimension is an important open question for future research. Extending our technique to RL with rich observations is also an interesting direction for future research.

Eluder-based Regret for Stochastic Contextual MDPs

Acknowledgements

We would like to thank the reviewers for their helpful comments.

This project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396 and grant agreement No. 101078075). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. This work received additional support from the Israel Science Foundation (ISF, grant numbers 993/17 and 2549/19), Tel Aviv University Center for AI and Data Science (TAD), the Yandex Initiative for Machine Learning at Tel Aviv University, the Len Blavatnik and the Blavatnik Family Foundation, and by the Israeli VATAT data science scholarship.

AC is supported by the Israeli Science Foundation (ISF) grant no. 2250/22.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

Agarwal, A., Dud ık, M., Kale, S., Langford, J., and Schapire, R. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pp. 19 26. PMLR, 2012.

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646. PMLR, 2014.

Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L. Model-based reinforcement learning with value-targeted regression. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 463 474. PMLR, 13 18 Jul 2020.

Chen, Z., Li, C. J., Yuan, A., Gu, Q., and Jordan, M. I. A general framework for sample-efficient function approximation in reinforcement learning. ar Xiv preprint ar Xiv:2209.15634, 2022.

Dann, C., Mohri, M., Zhang, T., and Zimmert, J. A provably efficient model-free posterior sampling method for

episodic reinforcement learning. Advances in Neural Information Processing Systems, 34:12040 12051, 2021.

Foster, D. and Rakhlin, A. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199 3210. PMLR, 2020.

Foster, D., Agarwal, A., Dudik, M., Luo, H., and Schapire, R. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 1539 1548. PMLR, 2018.

Foster, D. J. and Krishnamurthy, A. Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination. Advances in Neural Information Processing Systems, 34:18907 18919, 2021.

Foster, D. J., Kakade, S. M., Qian, J., and Rakhlin, A. The statistical complexity of interactive decision making. ar Xiv preprint ar Xiv:2112.13487, 2021.

Hallak, A., Di Castro, D., and Mannor, S. Contextual markov decision processes. ar Xiv preprint ar Xiv:1502.02259, 2015.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., and Schapire, R. E. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pp. 1704 1713. PMLR, 2017.

Jin, C., Liu, Q., and Miryoosefi, S. Bellman eluder dimension: New rich classes of rl problems, and sampleefficient algorithms. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 13406 13418. Curran Associates, Inc., 2021.

Langford, J. and Zhang, T. The epoch-greedy algorithm for multi-armed bandits with side information. In Platt, J., Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cambridge University Press, 2020.

Levy, O. and Mansour, Y. Learning efficiently function approximation for contextual MDP. ar Xiv preprint ar Xiv:2203.00995, 2022.

Levy, O. and Mansour, Y. Optimism in face of a context: Regret guarantees for stochastic contextual mdp. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 8510 8517, 2023.

Eluder-based Regret for Stochastic Contextual MDPs

Levy, O., Cohen, A., Cassel, A. B., and Mansour, Y. Efficient rate optimal regret for adversarial contextual mdps using online function approximation. In ICML, 2023.

Liu, Q., Chung, A., Szepesv ari, C., and Jin, C. When is partially observable reinforcement learning not scary? In Conference on Learning Theory, pp. 5175 5220. PMLR, 2022.

Mannor, S., Mansour, Y., and Tamar, A. Reinforcement Learning: Foundations. Online manuscript; https://sites.google.com/ view/rlfoundations/home, 2022. accessed March-05-2023.

Modi, A. and Tewari, A. No-regret exploration in contextual reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pp. 829 838. PMLR, 2020.

Modi, A., Jiang, N., Singh, S., and Tewari, A. Markov decision processes with continuous side information. In Algorithmic Learning Theory, pp. 597 618. PMLR, 2018.

Osband, I. and Van Roy, B. Model-based reinforcement learning and the eluder dimension. Advances in Neural Information Processing Systems, 27, 2014.

Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Rosenberg, A., Cohen, A., Mansour, Y., and Kaplan, H. Near-optimal regret bounds for stochastic shortest path. In International Conference on Machine Learning, pp. 8210 8219. PMLR, 2020.

Russo, D. and Van Roy, B. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Shani, L., Efroni, Y., Rosenberg, A., and Mannor, S. Optimistic policy optimization with bandit feedback. In ICML, 2020.

Simchi-Levi, D. and Xu, Y. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 2021.

Slivkins, A. Introduction to multi-armed bandits. Found. Trends Mach. Learn., 12(1-2):1 286, 2019.

Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over

model-free approaches. In Conference on learning theory, pp. 2898 2933. PMLR, 2019.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.

Wang, R., Salakhutdinov, R. R., and Yang, L. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6123 6135. Curran Associates, Inc., 2020a.

Wang, R., Salakhutdinov, R. R., and Yang, L. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33:6123 6135, 2020b.

Wen, Z. and Van Roy, B. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3): 762 782, 2017.

Wu, Y., He, J., and Gu, Q. Uniform-PAC guarantees for model-based RL with bounded eluder dimension. In Evans, R. J. and Shpitser, I. (eds.), Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pp. 2304 2313. PMLR, 31 Jul 04 Aug 2023.

Xie, T., Foster, D. J., Bai, Y., Jiang, N., and Kakade, S. M. The role of coverage in online reinforcement learning. ar Xiv preprint ar Xiv:2210.04157, 2022.

Xu, Y. and Zeevi, A. Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. ar Xiv preprint ar Xiv:2007.07876, 2020.

Zhang, T. Feel-good thompson sampling for contextual bandits and reinforcement learning. SIAM Journal on Mathematics of Data Science, 4(2):834 857, 2022.

Zimin, A. and Neu, G. Online learning in episodic markovian decision processes by relative entropy policy search. Advances in neural information processing systems, 26, 2013.

Eluder-based Regret for Stochastic Contextual MDPs

A. Metric Eluder Dimension

Our goal is to prove Lemma 2.2, a variant of Proposition 6 in Osband & Van Roy (2014). Recall that for any P P, its radius at x X is

w P (x) = sup P,P P D(P(x), P (x)).

Now, for any t [T], h [H] let xt h X and Pt P be arbitrary. Define the confidence sets with parameter β as

h=0 D2(P(xi h), Pt(xi h)) β

Lemma A.1 (Restatement of Lemma 2.2). Suppose that β H. We have that

h=0 (w Pt(xt h))2 6d E(P, D, T 1/2)β log T.

We first need the following result, which is a direct adaptation of Lemma 1 in (Osband & Van Roy, 2014) (see proof below for completeness).

Lemma A.2. We have that for any ϵ > 0

h=0 I[w Pt(xt h) > ϵ] 4β

ϵ2 + H d E(P, D, ϵ).

Proof of Lemma 2.2. To reduce notation, write wt,h = w Pt(xt h), and d = d E(P, D, T 1/2). Next, reorder the sequence (w1,0, . . . , w1,H 1, . . . , w T,0, . . . , w T,H 1) (wi1, . . . , wi T H) where wi1 wi2 . . . wi T H. Then we have that

h=0 (w Pt(xt h))2 =

s=1 w2 is =

s=1 w2 is I[wis T 1/2] +

s=1 w2 is I[wis > T 1/2]

s=1 w2 is I[wis > T 1/2]

s=1 w2 is I[wis > T 1/2].

Now, let ϵ T 1/2 and suppose that wis > ϵ. Since wis are ordered in descending order, this implies that PT t=1 PH 1 h=0 I(w Pt(xt h) > ϵ) s. On the other hand, by Lemma A.2, we have

h=0 I(w Pt(xt h) > ϵ) 4β

ϵ2 + H d E(P, D, ϵ) 4β

where the second transition used that d E(P, D, ϵ ) is non-increasing in ϵ . We conclude that s 4β/ϵ2 + H d, and changing sides gives that ϵ2 4β/(s d H). This implies that

wis > T 1/2 = w2 is 4βd s d H .

Eluder-based Regret for Stochastic Contextual MDPs

We thus have

s=1 w2 is I(wis > T 1/2) 1 + d H +

s=d H+2 w2 is I(wis > T 1/2) (wis 1)

1 + d H + 4βd

= 1 + d H + 4βd log T

6dβ log T. (β H)

Proof of Lemma A.2. The proof follows similarly to Lemma 1 in Osband & Van Roy (2014). We begin by showing that if w Pt(xt h) > ϵ then xt h is (D, ϵ) dependent on fewer than 4β/ϵ2 disjoint sub-sequences of (x1 0, . . . , x1 H 1, . . . , xt 1 0 , . . . , xt 1 H 1). To see this, note that if w Pt(xt h) > ϵ there are P, P Pt such that

D(P(xt h), P (xt h)) > ϵ.

Now, suppose that xt h is (D, ϵ) dependent on a sub-sequence (xt1 h1, . . . , xtn hn) of (x1 0, . . . , x1 H 1, . . . , xt 1 0 , . . . , xt 1 H 1). Since D(P(xt h), P (xt h)) ϵ, the dependence implies that

j=1 D2(P(xtj hj), P (xtj hj)) > ϵ2.

We conclude that, if xt h is ϵ dependent on K disjoint sub-sequences of (x1 0, . . . , x1 H 1, . . . , xt 1 0 , . . . , xt 1 H 1) then

h=0 D2(P(xi h), P (xi h)) > Kϵ2.

On the other hand, since P, P Pt, we have

h=0 D2(P(xi h), P (xi h))

D(P(xi h), Pt(xi h)) + D(Pt(xi h), P (xi h)) 2 (triangle inequality for D)

h=0 D2(P(xi h), Pt(xi h)) + 2

h=0 D2(Pt(xi h), P (xi h))

((a + b)2 2(a2 + b2))

It follows that K < 4β/ϵ2.

Next, denote d := d E(P, D, ϵ). We show that in any action sequence (y1, . . . , yτ), there is some element yj that is (D, ϵ) dependent on at least τ/d 1 disjoint sub-sequences of (y1, . . . , yj 1). To show this, we will construct K = (τ/d) 1 disjoint sub-sequences B1, . . . , BK. First, let Bm = (ym) for m = 1, . . . , K. If y K+1 is (D, ϵ) dependent on each sub-sequence B1, . . . , BK, the claim is established. Otherwise, append y K+1 to a sub-sequence Bm that y K+1 is (D, ϵ) independent of. Repeat this process for elements with indices i > K + 1 until yi is ϵ dependent on all subsequences. Suppose in contradiction that the process terminated without finding the desired yi. By the definition of the Eluder dimension, each Bm must satisfy |Bm| d and thus PK m=1|Bm| Kd. On the other hand

Kd = (τ/d) 1 d < τ =

m=1 |Bm| Kd,

Eluder-based Regret for Stochastic Contextual MDPs

where the last equality follows since all elements of (y1, . . . , yτ) where placed. This is a contradiction, and thus the process must terminate successfully.

h=0 I[w Pt(xt h) > ϵ],

and take (y1, . . . , yτ) to be the sub-sequence (xt1 h1, . . . , xtτ hτ ) of elements xt h for which w Pt(xt h) > ϵ. Then there exists xtj hj that depends on at least τ/d 1 disjoint sub-sequences (xt1 h1, . . . , xtτ hτ ). Notice that at most H 1 of these sub-sequences are

not sub-sequences of (x1 0, . . . , x1 H 1, . . . , xtj 1 0 , . . . , xtj 1 H 1). We conclude that xtj hj depends on at least (τ/d) H disjoint

sub-sequences of (x1 0, . . . , x1 H 1, . . . , xtj 1 0 , . . . xtj 1 H 1). On the other hand, since w Ptj (xtj hj) > ϵ then by the first part of the proof it depends on fewer than 4β/ϵ2 such disjoint sub-sequences, thus

so τ (4β/ϵ2 + H)d as desired.

A.1. Upper bound using the ℓ2 Eluder dimension

Proposition A.3. Assume that P, Q are two distributions over finite domain X, such that for all x X it holds that P(x), Q(x) p. Then,

P Q 2 2 4D2 H(P, Q) 1

Lemma A.4. Let function class P, and assume P(x) p holds for some p > 0 for every P P and x in the domain. Let ϵ (0, 1] and assume the the Eluder dimension w.r.t the ℓ2 at scale pϵ is d. Then, the Eluder dimension w.r.t the Squared Hellinger Distance at scale ϵ is upper bounded by d/p.

Proof. Let d be the Eluder dimension w.r.t the ℓ2-norm at scale pϵ of the class P. Let k > 1/p and m = dk. Also let x1, x2, . . . , xm+1 be an arbitrary sequence. We show that there must exist j [m + 1] such that xj is ϵ dependent on its prefix in Hellinger distance, thus the Hellinger Eluder dimension at scale ϵ is at most m. As in Lemma A.2, there exists n and {Ij}j [k] disjoint subsequences such that j [k]Ij = [n 1], and xn is pϵ dependent in ℓ2 on each Ij, j [k].

Now, let P, P P such that D2 H(P(xn), P (xn)) ϵ2. Then, by Proposition A.3 we have that P(xn) P (xn) 2 2 4pϵ2. Because of the dependence, for all j [k] we have that P

τ Ij P(xτ) P (xτ) 2 2 4pϵ2. We conclude, using Proposition A.3 again, that

τ=1 D2 H(P(xτ), P (xτ)) 1

τ=1 P(xτ) P (xτ) 2 2 = 1

τ Ij P(xτ) P (xτ) 2 2 X

j [k] pϵ2 = pkϵ2 > ϵ2,

thus xn is ϵ dependent on its prefix in Hellinger distance. We conclude that no sequence of length m + 1 can have all elements ϵ independent of their prefix in Hellinger distance. Thus, the Hellinger Eluder at scale ϵ is at most m d/p.

B.1. Multiplicative Value Change of Measure

First, we give the following Bernstein type tail bound (see e.g., Rosenberg et al., 2020, Lemma D.4). Lemma B.1. Let {Xt}t 1 be a sequence of random variables with expectation adapted to a filtration Ft. Suppose that 0 Xt 1 almost surely. Then with probability at least 1 δ

t=1 E[Xt | Ft 1] 2

t=1 Xt + 4 log 2

Eluder-based Regret for Stochastic Contextual MDPs

Recall the Helligner distance given in Definition 2.3. The following change of measure result is due to (Foster et al., 2021).

Lemma B.2 (Lemma A.11 in Foster et al., 2021). Let P and Q be two probability measures on (X, F). For all h : X R with 0 h(X) R almost surely under P and Q, we have

|EP[h(X)] EQ[h(X)]| q

2R(EP[h(X)] + EQ[h(X)]) D2 H(P, Q).

In particular,

EP[h(X)] 3EQ[h(X)] + 4RD2 H(P, Q).

Next, we need the following refinement of the previous result.

Corollary B.3. For any β 1,

EP[h(X)] (1 + 1/β)EQ[h(X)] + 3βRD2 H(P, Q).

Proof. Let η (0, 1). Consider the following derivation.

EP[h(X)] EQ[h(X)] q

2R(EP[h(X)] + EQ[h(X)]) D2 H(P, Q)

η(EP[h(X)] + EQ[h(X)]) + R

2η D2 H(P, Q).

The above implies

EP[h(X)] 1 + η

1 η EQ[h(X)] + R 2η(1 η)D2 H(P, Q)

EQ[h(X)] + 3R(2β + 1)2

2β D2 H(P, Q) (Plug η = 1 2β+1 for all β (0, ).)

EQ[h(X)] + 3RβD2 H(P, Q). (For any β 1)

Lemma B.4 (restatement of Lemma 4.1). Let r : S A [0, 1] be a bounded expected rewards function. Let P and b P denote two dynamics and consider the MDPs M = (S, A, P , r, s0, H) and c M = (S, A, b P, r, s0, H). Then, for any policy π we have

V π c M(s) 3V π M(s) + 9H2 E P ,π

h=0 D2 H( b P( |sh, ah), P ( |sh, ah))

Proof. We first prove by backwards induction that for all h [H 1] the following holds.

V π c M,h(s) 1 + 1

V π M,h(s) + E P ,π

h =h 3H2D2 H( b P( |sh , ah ), P ( |sh , ah ))

The base case, h = H 1 is immediate since V π c M,h(s) = V π M,h(s). Now, we assume that the above holds for h + 1 and prove that it holds for h. To see this, we have that

V π c M,h(s) = E a π( |s)

h r(s, a) + Es b P ( |s,a) h V π c M,h+1(s ) ii (By Bellman s equations)

r(s, a) + 1 + 1

Es P ( |s,a) h V π c M,h+1(s ) i + 3H2D2 H( b P( |s, a), P ( |s, a)) (Corollary B.3)

h r(s, a) + 3H2D2 H( b P( |s, a), P ( |s, a)) i (Induction hypothesis)

Eluder-based Regret for Stochastic Contextual MDPs

+ E a π( |s)

H h E s P ( |s,a)

V π M,h+1(s ) #

+ E a π( |s)

H h E s P ( |s,a)

h =h+1 3H2D2 H( b P( |sh , ah ), P ( |sh , ah ))

sh+1 = s ###

H h E a π( |s)

r(s, a) + E s P ( |s,a)

V π M,h+1(s ) (r, D2 H 0)

h =h 3H2D2 H( b P( |sh , ah ), P ( |sh , ah ))

V π M,h(s) + E P ,π

h =h 3H2D2 H( b P( |sh , ah ), P ( |sh , ah ))

, (By Bellman s equations)

as desired. Plugging in h = 0 and using that 1 + 1

H H 3 concludes the proof.

B.2. Oracle Bounds (Step 1)

Reward oracle.

Lemma B.5 (Lemma B.10 in Levy & Mansour, 2023). For any δ (0, 1), with probability at least 1 δ we have

h=0 (ft(ci, si h, ai h) f (ci, si h, ai h))2 Hi 1

h=0 Eci,si h,ai h((ft(ci, si h, ai h) f (ci, si h, ai h))2 | Hi 1)

68H log(2|F|t3/δ) + 2

h=0 (ft(ci, si h, ai h) ri h)2 (f (ci, si h, ai h) ri h)2

simultaneously, for all t 2 and any fixed sequence of functions f1, f2, . . . F.

Corollary B.6 (restatement of Corollary 4.2). Let bft F be the least squares minimizer in Algorithm 1. For any δ (0, 1) it holds that with probability at least 1 δ we have

i=1 E πi(c; ),P c

bft(c, sh, ah) f (c, sh, ah) 2 s0

68H log(2T 3|F|/δ),

simultaneously, for all t 1.

Proof. Recall that for all t 2, bft is the least square minimizer at round t. Hence, by our assumption that f F

h=0 ( bft(ci, si h, ai h) ri h)2 (f (ci, si h, ai h) ri h)2 0.

Thus the corollary immediately follows by Lemma B.5.

Dynamics oracle. Recall the Hellinger distance given in Definition 2.3. The following lemma by (Foster et al., 2021) upper bounds the expected cumulative Hellinger Distance in terms of the log-loss. Let X be a set and Y be a finite set. Let x(t), y(t), t 1, be a sequence of random variables that satisfy y(t) g ( |x(t)), where g : Y X R+ maps x X to the density of y Y. Define H(t) = (x(1), y(1), . . . , x(t), y(t)) and let G(t) = σ(H(t)). Next, for a random variable Z, we define Et Z := E[Z|Gt].

Eluder-based Regret for Stochastic Contextual MDPs

Lemma B.7 (Lemma A.14 from Foster et al., 2021). Let g : Y X R+ be a mapping from X to densities over Y. Consider a sequence of {0, 1}-valued random variables (It)t T where It is F (t 1)-measurable. For any δ (0, 1) we have that with probability at least 1 δ,

t=1 Et 1 h D2 H(g( |x(t)), g ( |x(t))) i It

log 1 g(y(t)|x(t)) log 1 g (y(t))|x(t)

It + 2 log(1/δ).

Additionally, with probability at least 1 δ,

t=1 D2 H(g( |x(t)), g ( |x(t)))

log 1 g(y(t)|x(t)) log 1 g (y(t))|x(t)

+ 2 log(1/δ).

Using the above lemma, we bound the realized and expected cumulative Hellinger distance between the approximated and true dynamics, by the actual regret of the log-loss regression oracle (and constant terms), with high probability. Lemma B.8 (Concentration of log-loss oracle). For any δ (0, 1) it holds that with probability at least 1 δ we have

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), P c( |sh, ah))

1 P ci(si h+1|si h, ai h)

1 P ci (si h+1|si h, ai h)

+ 2H log(TH|P|/δ).

simultaneously, for all t 1 and P P.

Proof. Fix some t 1 and P P. We have with probability at least 1 δ T |P| that

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), P c( |sh, ah))

E πi(c; ),P c

D2 H(P c ( |sh, ah), P c( |sh, ah))

D2 H(P ci ( |si h, ai h), P ci( |si h, ai h))

P ci (si h+1|si h, ai h) P ci(si h+1|si h, ai h)

+ 2 log(HT|P|/δ)

(Foster et al., 2021, Lemma A.14)

1 P ci(si h+1|si h, ai h)

1 P ci (si h+1|si h, ai h)

+ 2H log(HT|P|/δ)

The filtration used in (i) is over the history up to time t, Ht 1 = (σ1, . . . , σt 1). Now, by taking a union bound over every t = 1. . . . , T and P P, we obtain the lemma.

Lemma B.9 (Realized log-loss error). For any δ (0, 1) it holds that with probability at least 1 δ we have

h=0 D2 H(P ci ( |si h, ai h), P ci( |si h, ai h))

1 P ci(si h+1|si h, ai h)

1 P ci (si h+1|si h, ai h)

+ 2H log(TH|P|/δ).

simultaneously, for all t 1 and P P.

Eluder-based Regret for Stochastic Contextual MDPs

Proof. Fix some t 1 and P P. We have with probability at least 1 δ T |P| that

h=0 D2 H(P ci ( |si h, ai h), P ci( |si h, ai h))

i=1 D2 H(P ci ( |si h, ai h), P ci( |si h, ai h))

P ci (si h+1|si h, ai h) P ci(si h+1|si h, ai h)

+ 2 log(HT|P|/δ)

(Foster et al., 2021, Lemma A.14)

1 P ci(si h+1|si h, ai h)

1 P ci (si h+1|si h, ai h)

+ 2H log(HT|P|/δ).

The filtration used in (i) is over the history up to time t, Ht 1 = (σ1, . . . , σt 1). Now, by taking a union bound over every t = 1. . . . , T and P P, we obtain the lemma.

Corollary B.10 (restatement of Corollary 4.3). Let b Pt P be the log loss minimizer in Algorithm 1. For any δ (0, 1) it holds that with probability at least 1 δ we have

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c t ( |sh, ah))

2H log(TH|P|/δ), 1 t T.

Proof. By our assumption that P P, and b Pt is the log loss minimizer at time t, it holds that

1 b P ci t (si h+1|si h, ai h)

1 P ci (si h+1|si h, ai h)

Thus, the corollary immediately follows by Lemma B.8.

Lemma B.11 (Stability error of log-loss oracle, restatement of Lemma 4.4). Let b Pi P denote the log-loss minimizer at round i [T]. For any δ (0, 1) it holds with probability at least 1 δ that

i=1 Eπi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i (sh, ah))

112Hd P log2(2TH|P|/δ)

simultaneously, for all t 1, where d P d E(P, DH, T 1/2), the Eluder dimension of P at scale T 1/2.

Proof. Let β = 2H log(2TH|P|/δ) and define

h=0 D2 H(P ci( |si h, ai h), b P ci t ( |si h, ai h)) β

Now, suppose that Lemma B.9 holds with δ/2. Then, since b Pt is the log-loss minimizer, we have that P Pt for all 1 t T. Next, recalling that

w Pt(s, a, c) = sup P,P Pt DH(P c( |s, a), P c( |s, a)),

thus, we have that

h=0 D2 H(P ci ( |si h, ai h), b P ci i ( |si h, ai h)

h=0 (w Pi(si h, ai h, ci))2.

Eluder-based Regret for Stochastic Contextual MDPs

Applying Lemma 2.2 we get that

h=0 D2 H(P ci ( |si h, ai h), b P ci i ( |si h, ai h) 24d PHβ log T

48Hd P log2(2TH|P|/δ).

Finally, we apply Lemma B.1 to get that with probability at least 1 δ/2

i=1 Eπi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i (sh, ah)) s0

h=0 E h D2 H(P ci ( |si h, ai h), b P ci i ( |si h, ai h) πi, s0 i

h=0 D2 H(P ci ( |si h, ai h), b P ci i ( |si h, ai h) + 16H log 2TH

Taking a union bound and combining the last two inequalities concludes the proof.

B.3. Confidence Bounds (Step 2)

In the following analysis, we use an occupancy measures-based representation of the value function. Recall the definition of the occupancy measures (Zimin & Neu, 2013). For any non-contextual policy π and dynamics P, let qh(s, a|π, P) denote the probability of reaching state s S and performing action a A at time h [H] of an episode generated using policy π and dynamics P.

Using this notation, the value function of any policy π with respect to the MDP (S, A, P, r, s0, H) can be represented as follows.

V π M(s0) =

a A qh(s, a|π, P) r(s, a). (2)

Thus, the following is an immediate corollary of Lemma 4.1.

Corollary B.12. For any (non-contextual) policy π, two dynamics P and b P, and rewards function r that is bounded in [0, 1] it holds that

a A qh(s, a|π, b P) r(s, a) 3

a A qh(s, a|π, P) r(s, a)

+ 9H2 H 1 X

a A qh(s, a|π, P) D2 H(P( |s, a), b P( |s, a)).

We are now ready to prove the confidence bounds. Recall the reward bonuses b R t,h, b P t,h defined in Equation (1).

Lemma B.13 (restatement of Lemma 4.5). Let P and f be the true context dependent dynamics and rewards. Let b Pt and bft be the approximated context-dependent dynamics and rewards at round t. Then, for any t 1, and context-dependent policy π ΠC the following holds. Ec h V π(c; ) M(f , b Pt)(c)(s0) i Ec h V π(c; ) M( b ft, b Pt)(c)(s0) i

ah A qh(sh, ah|π(c; ), b P c t )b R t,h(c, sh, ah)

i=1 E πi(c; ),P c

f (c, sh, ah) bft(c, sh, ah) 2 s0

Eluder-based Regret for Stochastic Contextual MDPs

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i ( |sh, ah))

Proof. We have that Ec h V π(c; ) M(f , b Pt)(c)(s0) i Ec h V π(c; ) M( b ft, b Pt)(c)(s0) i

= Ec h V π(c; ) M(f , b Pt)(c)(s0) V π(c; ) M( b ft, b Pt)(c)(s0) i (By linearity of expectation)

ah A qh(sh, ah|π(c; ), b P c t ) bft(c, sh, ah) f (c, sh, ah)

(Equation (2))

ah A qh(sh, ah|π(c; ), b P c t ) bft(c, sh, ah) f (c, sh, ah)

(Triangle ineq.)

ah A qh(sh, ah|π(c; ), b P c t ) bft(c, sh, ah) f (c, sh, ah)

2 qh(sh, ah|π(c; ), b P c t )

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i ) (AM-GM)

+ 1 2βr qh(sh, ah|π(c; ), b P c t )

i=1 qh(sh, ah|πi(c; ), b P c i )

! f (c, sh, ah) bft(c, sh, ah) 2 )#

2 qh(sh, ah|π(c; ), b P c t )

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i )

2βr + 1 2βr Ec

ah A qh(sh, ah|πi(c; ), b P c i ) f (c, sh, ah) bft(c, sh, ah) 2

ah A qh(sh, ah|π(c; ), b P c i ) min

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i )

i=1 E πi(c; ),P c

f (c, sh, ah) bft(c, sh, ah) 2 s0

(Corollary B.12)

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i ( |sh, ah))

and the lemma follows by b R t,h definition.

Lemma B.14 (restatement of Lemma 4.6). Let P and f be the true context dependent dynamics and rewards. Let b Pt be the approximated context-dependent dynamics at round t. Then, for any t 1, and context-dependent policy π ΠC we have

Ec h V π(c; ) M(f ,P )(c)(s0) i Ec h V π(c; ) M(f , b Pt)(c)(s0) i Ec

a A qh(s, a|π(c; ), b P c t ) b P t,h(c, s, a)

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c t ( |sh, ah))

Eluder-based Regret for Stochastic Contextual MDPs

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i ( |sh, ah))

Proof. The following holds for any t 1 and a context-dependent policy π ΠC.

Ec[V π(c; ) M(f ,P )(c)(s0)] Ec[V π(c; ) M(f , b Pt)(c)(s0)]

= Ec h V π(c; ) M(f ,P )(c)(s0) V π(c; ) M(f , b Pt)(c)(s0) i (By linearity of expectation)

Eπ(c; ), b P c t

s S (P c (s |sh, ah) b P c t (s |sh, ah))V π(c; ) M(f ,P ),h+1(s )

# (Lemma C.2)

ah A qh(sh, ah|π(c; ), b P c t ) X

s S (P c (s |sh, ah) b P c t (s |sh, ah))V π(c; ) M(f ,P ),h+1(s )

(Equation (2))

ah A qh(sh, ah|π(c; ), b P c t )

s S (P c (s |sh, ah) b P c t (s |sh, ah))V π(c; ) M(f ,P ),h+1(s )

(Triangle inequality)

ah A qh(sh, ah|π(c; ), b P c t ) X

P c (s |sh, ah) b P c t (s |sh, ah)

(f [0, 1], V π(c; ) M(f ,P ),h(s ) [0, H] for all h [H] and s S)

2 qh(sh, ah|π(c; ), b P c t )

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i ) (AM-GM)

+ qh(sh, ah|π(c; ), b P c t ) 2βP (1 +

i=1 qh(sh, ah|πi(c; ), b P c i ))

P c (s |sh, ah) b P c t (s |sh, ah)

ah A qh(sh, ah|π(c; ), b P c t )H min

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i )

ah A qh(sh, ah|πi(c; ), b P c i )

P c (s |sh, ah) b P c t (s |sh, ah)

ah A qh(sh, ah|π(c; ), b P c t )H min

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i )

ah A qh(sh, ah|πi(c; ), P c )

P c (s |sh, ah) b P c t (s |sh, ah)

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i ( |sh, ah))

(Corollary B.12)

ah A qh(sh, ah|π(c; ), b P c t )H min

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i )

Eluder-based Regret for Stochastic Contextual MDPs

ah A qh(sh, ah|πi(c; ), P c )D2 H(P c ( |sh, ah), b P c t ( |sh, ah))

(TV2 4D2 H)

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i ( |sh, ah))

ah A qh(sh, ah|π(c; ), b P c t )b P t,h(c, sh, ah)

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c t ( |sh, ah))

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i ( |sh, ah))

Corollary B.15 (restatement of Corollary 4.7). Under the terms of Lemmas 4.5 and 4.6, the following holds with probability at least 1 3δ/4 simultaneously for all t 1 and π ΠC: Ec h V π(c; ) M(f ,P )(c)(s0) i Ec h V π(c; ) M( b ft, b Pt)(c)(s0) i

ah A qh(sh, ah|π(c; ), b P c t ) b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

+ 2029H4d P

βP log2(8TH|P|/δ) + 504H3d P

βr log2(64T 4H|F||P|/δ2).

Proof. We begin by taking a union bound on the events of Corollaries 4.2 and 4.3 and Lemma 4.4 to get that with probability at least 1 3δ/4, simultaneously for all t 1

i=1 E πi(c; ),P c

bft(c, sh, ah) f (c, sh, ah) 2 s0

68H log(8T 3|F|/δ)

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c t ( |sh, ah))

2H log(4TH|P|/δ) (3)

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i ( |sh, ah))

112Hd P log2(8TH|P|/δ).

Assuming this event holds, we get that for all t 1 and context-dependent policy π ΠC. Ec h V π(c; ) M(f ,P )(c)(s0) i Ec h V π(c; ) M( b ft, b Pt)(c)(s0) i

Ec h V π(c; ) M(f ,P )(c)(s0) i Ec h V π(c; ) M(f , b Pt)(c)(s0) i (By triangle inequality)

+ Ec h V π(c; ) M(f , b Pt)(c)(s0) i Ec h V π(c; ) M( b ft, b Pt)(c)(s0) i

ah A qh(sh, ah|π(c; ), b P c t ) b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

(Lemmas 4.5 and 4.6)

i=1 E πi(c; ),P c

f (c, sh, ah) bft(c, sh, ah) 2 s0

Eluder-based Regret for Stochastic Contextual MDPs

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c t ( |sh, ah))

i=1 E πi(c; ),P c

h=0 D2 H(P c ( |sh, ah), b P c i ( |sh, ah))

ah A qh(sh, ah|π(c; ), b P c t ) b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

(Equation (3))

+ 3 2βr 68H log(8T 3|F|/δ) + 6H

βP 2H log(4TH|P|/δ)

112Hd P log2(8TH|P|/δ)

ah A qh(sh, ah|π(c; ), b P c t ) b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

+ 2029H4d P

βP log2(8TH|P|/δ) + 504H3d P

βr log2(64T 4H|F||P|/δ2),

B.4. Establishing Optimism Lemmas (Step 3)

Lemma B.16 (restatement of Lemma 4.8). Let π be an optimal context-dependent policy for M. Under the good event of Corollary 4.7, we have that for any t 1

Ec h V π (c; ) M(c) i Ec h V πt(c; ) c Mt(c) (s0) i + 2029H4d P

βP log2(8TH|P|/δ)

βr log2(64T 4H|F||P|/δ2).

Proof. Fix any round t 1 consider the following derivation.

Ec h V π (c; ) M(c) (s0) i = Ec h V π (c; ) M(f ,P )(c)(s0) i

Ec[V π (c; ) M( b ft, b Pt)(c)(s0)] (By Corollary 4.7)

ah A qh(sh, ah|π(c; ), b P c t ) b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

+ 2029H4d P

βP log2(8TH|P|/δ) + 504H3d P

βr log2(64T 4H|F||P|/δ2)

=Ec[V π (c; ) c Mt(c) (s0)] + 2029H4d P

βP log2(8TH|P|/δ) (Equation (2))

βr log2(64T 4H|F||P|/δ2)

Ec h V πt(c; ) c Mt(c) (s0) i + 2029H4d P

βP log2(8TH|P|/δ) (πt is optimal in c Mt)

Eluder-based Regret for Stochastic Contextual MDPs

βr log2(64T 4H|F||P|/δ2),

as the lemma states.

Lemma B.17 (restatement of Lemma 4.9). Under the good event of Corollary 4.7, we have that for every t 1

Ec h V πt(c; ) c Mt(c) (s0) i Ec h V πt(c; ) M(c) (s0) i + 2

E πt(c; ), b P c t

b R t,h(c, sh, ah) + b P t,h(c, sh, ah) #

+ 2029H4d P

βP log2(8TH|P|/δ)

βr log2(64T 4H|F||P|/δ2).

Proof. For all t 1 the following holds.

Ec h V πt(c; ) c Mt(c) (s0) i

ah A qh(s, a|πt(c; ), b P c t ) bft(c, sh, ah) + b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

(Equation (2))

= Ec h V πt(c; ) M( b ft, b Pt)(c)(s0) i

ah A qh(sh, ah|πt(c; ), b P c t ) b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

Ec h V πt(c; ) M(f ,P )(c)(s0) i (Corollary 4.7)

ah A qh(sh, ah|πt(c; ), b P c t ) b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

+ 2029H4d P

βP log2(8TH|P|/δ) + 504H3d P

βr log2(64T 4H|F||P|/δ2),

and the proof follows by writing the second term as an expectation over sh, ah when playing πt(c; ) on the dynamics b P c t .

B.5. Regret Bound

We begin with a technical result that bounds the expected cumulative bonuses.

Lemma B.18. Let b R t,h(c, sh, ah), b P t (c, sh, ah) be the reward bonuses in Equation (1). We have that

h=0 Ec h Eπt(c; ), b P c t b R t,h(c, sh, ah) + b P t (c, sh, ah) i H|S||A|(βr + HβP ) log(T + 1).

Proof. First we bound b R t,h(c, sh, ah), b P t,h(c, sh, ah) by the second term in the minimum to get that

h=0 Ec h Eπt(c; ), b P c t b R t,h(c, sh, ah) + b P t,h(c, sh, ah) i

ah A qh(sh, ah|πt(c; ), b P c t ) b R t,h(c, sh, ah) + b P t,h(c, sh, ah)

Eluder-based Regret for Stochastic Contextual MDPs

2(βr + βP H)Ec

qh(sh, ah|πt(c; ), b P c t )

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i )

2(βr + βP H)Ec

qh(sh, ah|πt(c; ), b P c t )

1 + Pt 1 i=1 qh(sh, ah|πi(c; ), b P c i )

2(βr + βP H)Ec

ah A 2 log(T + 1)

= H|S||A|(βr + βP H) log(T + 1),

where the last inequality used Lemma C.1 with xt = qh(sh, ah|πt(c; ), b P c t ).

Using all the above, we derive our main result stated in the following theorem.

Theorem B.19 (E-UC3RL regret bound, restatement of Theorem 3.1). For any T > 1 and δ (0, 1), suppose we run Algorithm 1 with parameters

504TH2d P log2(64T 4H|F||P|/δ2)

|S||A| log(T + 1) , βP =

2029TH2d P log2(8TH|P|/δ)

|S||A| log(T + 1) ,

and d P d E(P, DH, T 1/2). Then, with probability at least 1 δ

RT (E-UC3RL) e O H3p

T|S||A|d P (log(|F|/δ) + log(|P|/δ)) .

Proof of Theorem 3.1. We prove a regret bound under the following good events. The first event is that of Corollary 4.7, which occurs with probability at least 1 3δ/4. The second event is that

t=1 V π (ct; ) M(ct) (s0) V πt(ct; ) M(ct) (s0)

t=1 Ec h V π (c; ) M(c) (s0) i Ec h V πt(c; ) M(c) (s0) i + H p

T log(8/δ). (4)

By Azuma s inequality (where the filtration is the histories {Ht}T t=1), the above holds with probability at least 1 δ/4. Taking a union bound, the good event holds with probability at least 1 δ. Hence, assume the good events hold, and consider the following derivation.

t=1 Ec h V π (c; ) M(c) (s0) i Ec h V πt(c; ) M(c) (s0) i

t=1 Ec h V π (c; ) M(c) (s0) i Ec h V πt(c; ) c Mt(c) (s0) i

t=1 Ec h V πt(c; ) c Mt(c) (s0) i Ec h V πt(c; ) M(c) (s0) i

E πt(c; ), b P c t

b R t,h(c, sh, ah) + b P t,h(c, sh, ah) #

(Lemmas 4.8 and 4.9)

+ 2T 2029H4d P

βP log2(8TH|P|/δ)

+ 2T 504H3d P

βr log2(64T 4H|F||P|/δ2)

2H|S||A|(βr + HβP ) log(T + 1) (Lemma B.18)

+ 2T 2029H4d P

βP log2(8TH|P|/δ)

Eluder-based Regret for Stochastic Contextual MDPs

+ 2T 504H3d P

βr log2(64T 4H|F||P|/δ2)

504T|S||A|H4d P log(T + 1) log2(64T 4H|F||P|/δ2 (Plugging in βP , βr)

2029T|S||A|H6d P log(T + 1) log2(8TH|P|/δ

2029T|S||A|H6d P log(T + 1) log2(8TH|P|/δ

T|S||A|d P log(T + 1) log2(18T 4H|F||P|/δ2).

Finally, we get that

RT (E-UC3RL) =

t=1 V π (ct; ) M(ct) (s0) V πt(ct; ) M(ct) (s0)

Ec h V π (c; ) M(c) (s0) i Ec h V πt(c; ) M(c) (s0) i (Equation (4))

T|S||A|d P log(T + 1) log2(18T 4H|F||P|/δ2)

T|S||A|d P (log(|F|/δ) + log(|P|/δ)) .

C. Auxiliary lemmas

Lemma C.1. Let St = λ + Pt 1 k=1 xt. Suppose xt [0, λ] and , then

xt St 2 log(T + 1).

Proof. The following holds.

t=1 log St+1

St 2 since xt λ)

t=1 log St+1 log St

= 2 log ST +1

S1 (telescopic sum)

2 log(T + 1).

Lemma C.2 (value-difference, Corollary 1 in Shani et al., 2020). Let M, M be any H-finite horizon MDPs. Then, for any two policies π, π the following holds

V π,M 0 (s) V π ,M 0 (s) =

h=0 E h Qπ,M h (sh, ), πh( |sh) π h( |sh) |s0 = s, π , M i

h=0 E h rh(sh, ah) r h(sh, ah) + (ph( |sh, ah) p h( |sh, ah))V π,M h+1 |sh = s, π , M i .