# reinforcement_learning_with_lookahead_information__b5b71003.pdf

Reinforcement Learning with Lookahead Information

Nadav Merlis Fair Play Joint Team, CREST, ENSAE Paris nadav.merlis@ensae.fr

We study reinforcement learning (RL) problems in which agents observe the reward or transition realizations at their current state before deciding which action to take. Such observations are available in many applications, including transactions, navigation and more. When the environment is known, previous work shows that this lookahead information can drastically increase the collected reward. However, outside of specific applications, existing approaches for interacting with unknown environments are not well-adapted to these observations. In this work, we close this gap and design provably-efficient learning algorithms able to incorporate lookahead information. To achieve this, we perform planning using the empirical distribution of the reward and transition observations, in contrast to vanilla approaches that only rely on estimated expectations. We prove that our algorithms achieve tight regret versus a baseline that also has access to lookahead information linearly increasing the amount of collected reward compared to agents that cannot handle lookahead information.

1 Introduction

In reinforcement learning (RL), agents sequentially interact with a changing environment, aiming to collect as much reward as possible. While performing actions that yield immediate rewards is enticing, agents must also bear in mind that actions influence the state of the environment, affecting the potential reward that could be collected in future steps. When the environment is unknown, agents also need to balance reward maximization based on previous data and exploration gathering of data that might improve future reward collection.

In the standard interaction model, at each timestep, agents first choose an action and only then observe its outcome on the rewards and state dynamics. As such, agents can only maximize the expected rewards, collected through the expected dynamics. Yet, in many applications, some information on the immediate outcome of actions is known before actions are performed. For example, when agents interact through transactions, prices and traded goods are usually agreed upon before performing any exchange ( reward information ). Alternatively, in navigation problems, nearby traffic information is known to the agent before choosing which path to go through ( transition information ).

In a recent work, Merlis et al. [2024] shows that even for agents with full statistical knowledge of the environment, such lookahead information can drastically increase the reward collected by agents by a multiplicative factor of up to AH when immediate rewards are revealed in advance and AH/2 when observing the immediate future transitions.1 Intuitively, agents do not only gain from instantaneously using this information they can also adapt their planning to account for lookahead information being revealed in subsequent states, significantly increasing their future values. However, the work of Merlis et al. [2024] only tackles planning settings in which the model is known and does not provide algorithms or guarantees when interacting with unknown environments.

1A is the size of the action space, S is the size of the state space and H is the interaction length.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

In this work, we aim to design provably-efficient agents that learn how to interact when given immediate ( one-step lookahead ) reward or transition information before choosing an action, under the episodic tabular Markov Decision Process model. While such information can always be embedded into the state of the environment, the state space becomes exponential at best, and continuous at worst, rendering most theoretically-guaranteed approaches both computationally and statistically intractable. To alleviate this, we start by deriving dynamic programming ( Bellman ) equations in the original state space that characterize the optimal lookahead policies. Inspired by these update rules, we present two variants to the MVP algorithm [Zhang et al., 2021b] that allow incorporating either reward or transition lookahead. In particular, we suggest a planning procedure that uses the empirical distribution of the reward/transition observations (instead of the estimated expectations), which might also be applied to other complex settings. We prove that these algorithms achieve tight regret bounds of O

H3SAK and O

A) after K episodes (for reward and transition lookahead, respectively), compared to a stronger baseline that also has access to lookahead information. As such, they can collect significantly more rewards than vanilla RL algorithms.

Outline. We formally define RL problems with reward/transition lookahead in Section 2 and further discuss the differences between our setting and standard RL problems in Section 3. Then, we present our results in two complementary sections: Section 4 analyzes reward lookahead while Section 5 analyzes transition lookahead. We end with conclusions and future directions in Section 6.

Related Work. Problems with varying lookahead information have been extensively studied in control, with model predictive control [MPC, Camacho et al., 2007] as the most notable example. Conceptually, when interacting with an environment that might be too complex or hard to model, it is oftentimes convenient to use a simpler model that allows accurately predicting its behavior just in the near future. MPC uses such models to repeatedly update its policy using short-term planning. In some cases, the utilized future predictions consist of additive perturbations to the dynamics [Yu et al., 2020], while other cases involve more general future predictions on the model behavior [Li et al., 2019, Zhang et al., 2021a, Lin et al., 2021, 2022]. To the best of our knowledge, these studies focus on comparing the performance of the controller to one with full future information (and thus, linear regret is inevitable), sometimes also considering prediction errors. They do not, however, attempt to learn the predictions. In contrast, we estimate the reward/transition distributions and leverage them to better plan, thus increasing the value gained by the agent. In addition, these works focus on continuous (mostly linear) control problems, whereas we study tabular settings; results from any one of these settings cannot be directly applied to the other.

In RL, lookahead is mostly used as a planning tool; namely, agents test the possible outcomes after performing multiple steps to decide which actions to take or to better estimate the value [Tamar et al., 2017, Efroni et al., 2019a, 2020, Moerland et al., 2020, Rosenberg et al., 2023, El Shar and Jiang, 2020, Biedenkapp et al., 2021, Huang et al., 2019]. Specifically, the future value at the end of the lookahead is often estimated using rollouts, and a longer lookahead is more robust to suboptimality of the rollout policy [Bertsekas, 2023]. However, when agents actually interact with the environment, no additional lookahead information is observed. One notable exception is [Merlis et al., 2024], which analyzes the potential value increase due to multi-step reward lookahead information (and briefly mentions transition lookahead). However, they only tackle planning settings, where the model is known, and do not study learning. In this work, we continue a long line of literature on regret analysis for tabular RL [Jaksch et al., 2010, Jin et al., 2018, Dann et al., 2019, Zanette and Brunskill, 2019, Efroni et al., 2019b, 2021, Simchowitz and Jamieson, 2019, Zhang et al., 2021b, 2023]. Yet, we are not aware of any existing results on regret minimization with reward or transition lookahead information.

Finally, various applications that involve one-step lookahead information have been previously studied. The most notable ones are prophet problems [Correa et al., 2019], where one-step reward lookahead is obtained, and the Canadian traveler problem with resampling [Nikolova and Karger, 2008], which can be formulated through one-step transition lookahead. We discuss the relation to these problems and the relevant existing results when analyzing each type of feedback, and also discuss the relation between transition lookahead and stochastic action sets [Boutilier et al., 2018].

2 Setting and Notations

We study episodic tabular Markov Decision Processes (MDPs), defined by the tuple M = (S, A, H, P, R), where S is the state space (of size S), A is the action space (of size A) and H is the

interaction horizon. At each timestep h {1, . . . , H} [H] of an episode k [K], an agent, located in state sk h S, chooses an action ak h A and obtains a reward Rk h = Rh(sk h, ak h) Rh(sk h, ak h). We assume that the rewards are supported by [0, 1] and of expectations rh(s, a). Afterward, the environment transitions to a state sk h+1 Ph( |sk h, ak h) and the interaction continues until the end of the episode. We use the notation R Rh(s) (or s Ph(s)) to denote reward (next-state) samples for all actions simultaneously at step h and state s and assume independence between different timesteps.2 On the other hand, samples from different actions at a specific state/timestep are not necessarily independent.

Reward Lookahead. With one-step reward lookahead at timestep h and state s, agents first observe the rewards for all actions Rh(s) {Rh(s, a)}a A and only then choose an action to perform. Formally, we define the set of reward lookahead policies as ΠR = π : [H] S [0, 1]A 7 A , where A is the probability simplex, and denote ah = πh(sh, Rh). The value of a reward lookahead agent is the cumulative rewards gathered by it starting at timestep h and state s, denoted by

V R,π h (s) = E

t=h Rt(st, πt(st, Rt(st))|sh = s

We also define the optimal reward lookahead value to be V R,

h (s) = maxπ ΠR V R,π h (s). When interacting with an unknown environment for K episodes, agents sequentially choose reward lookahead policies πk ΠR based on all historical information and are measured by their regret,

V R, 1 (sk 1) V R,πk 1 (sk 1) .

We allow the initial state of each episode sk 1 to be arbitrarily chosen.

Transition Lookahead. Denoting s h+1(s, a), the future state when playing action a at step h and state s, one-step transition lookahead agents observe s h+1(s) s h+1(s, a)

a A before acting. The set of transition lookahead agents is denoted by ΠT = π : [H] S SA 7 A with values

V T,π h (s) = E

t=h Rt(st, πt(st, s t+1(st)))|sh = s

The optimal value is V T,

h (s) = maxπ ΠT V T,π h (s), and we similarly define the regret versus optimal

transition lookahead agents as Reg T (K) = PK k=1 V T, 1 (sk 1) V T,πk 1 (sk 1) .

When the type of lookahead is clear from the context, we sometimes denote values by V π h and V h .

Other Notations. For any p n and V Rn, we define Varp(V ) = Pn i=1 pi V 2 i (Pn i=1 pi Vi)2. Also, given a transition kernel P and a vector V RS, we let PV (s, a) = P

s S P(s |s, a)V (s ) and similarly define it for value or transition kernel differences. We denote by nk h(s, a), the number of times the pair (s, a) was visited at timestep h up to episode k (inclusive) and similarly denote

nk h(s) = P a A nk h(s, a). We also let ˆrk h(s, a) = 1 nk h(s,a) Pk k =1 1 n sk h = s, ak h = a o Rk h and

ˆPh(s |s, a) = 1 nk h(s,a) Pk k =1 1 n sk h = s, ak h = a, sk h+1 = s o be the empirical expected rewards and transition kernel at (sh, ah) = (s, a) using data up to episode k and assume they are initialized to be zero. Finally, we denote by ˆRk h(s), the empirical reward distribution across all actions, and use ˆP k h (s) to denote the empirical joint next-state distribution for all actions. In particular, if ki is the ith

episode where s was visited at step h, to sample R ˆRk h(s), we uniformly sample i U nk h(s)

and return R = n Rki h (s, a) o

a A. A sample s ˆP k h (s) similarly returns s = n s ki h+1(s, a) o

When we want to indicate the distribution used to calculate an expectation, we sometimes state it in a subscript, e.g., write ERh(s)[R(a)] to indicate that R(a) Rh(s, a) or use EM to emphasize

2This assumption is not used by our algorithms: it is only to ensure that the optimal policy is Markovian.

that all distributions are according to an environment M. In this paper, O-notation only hides absolute constants while O hides factors of polylog(S, A, H, K, δ). We also use the notation a b = max{a, b}.

3 Comparing the Values of Lookahead Agents and Vanilla RL agents

In the classic RL formulation [e.g., Azar et al., 2017], agents only observe the reward and transition after performing an action and aim to maximize the no-lookahead value, defined by

V π h (s) = E

t=h rt(st, πt(st)|sh = s

where π ΠM = {π : [H] S 7 A} is a Markovian policy. The optimal value is V no h (s) =

maxπ ΠM V π h (s) and the regret is classically defined as Reg(K) = PK k=1 V no 1 (sk 1) V πk 1 (sk 1) .

By definition, the set of lookahead policies also includes all Markovian policies (since agents are not obliged to use reward/transition information), so the optimal lookahead values are always larger than their no-lookahead counterpart. In other words, denoting the value gain due to lookahead information by GR(s) = V R, 1 (s) V no 1 (s) and GT (s) = V T, 1 (s) V no 1 (s), it holds that GR(s), GT (s) 0. In terms of regret, for any fixed algorithm, we can also write

Reg(K) = Reg R(K)

k=1 GR(sk 1) = Reg T (K)

k=1 GT (sk 1).

As the value gains are non-negative, it directly implies that any regret bound w.r.t. the lookahead value also leads to the same bound for the standard regret. Even more so, in most cases, lookahead information leads to a strict improvement in the value, that is, GR(s), GT (s) G0 > 0. When this happens, any algorithm with sub-linear lookahead regret enjoys a negative linear standard regret:

If Reg R(K) = o(K) and GR(sk 1) G0 for all k [K], then Reg(K) G0K + o(K).

The same also holds for transition lookahead. Conversely, any agent that suffers positive standard regret will suffer linear regret compared to the best lookahead agent, i.e.,

If Reg(K) 0 and GR(sk 1) G0 for all k [K], then Reg R(K) G0K.

Notably, any agent that does not use lookahead information will suffer linear lookahead regret in any such environment. We now present two illustrative examples for environments where the lookahead value gain is significant, one for reward lookahead and another for transition lookahead.

Figure 1: Two-state prophet-like problem

Reward lookahead. Consider a simple 2-state environment, depicted in Figure 1. Starting at si, agents can either stay there by playing a1, earning no reward, or play any other action and move to the absorbing sf, obtaining a Bernoulli reward Ber(1/(A 1)H). Actions in the terminal state sf yield no reward. Without observing the rewards, agents will arbitrarily move from si to sf, obtaining a reward V no = 1/(A 1)H in expectation. On the other hand, when agents observe the rewards before acting, they should move from si to sf only if a reward was realized for some action (and otherwise, stay in si by playing a1). Such agents will have (A 1)H opportunities to observe a unit reward across all timesteps and actions, collecting in expectation V R, = (1 1/(A 1)H)(A 1)H 1 1/e. In other words, just by observing the rewards before acting, the agent s value multiplicatively increases by almost V R, /V no AH. Moreover, the additive value gain is GR 1 1/e, so sub-linear lookahead regret with reward information results with a negatively-linear standard regret of Reg(K) (1 1/e)K.

Transition lookahead. Consider a chain of H/2 states (also described in further detail at Appendix C.9 and depicted at Figure 2). In each state, one action deterministically keeps the agent in its

current state, while all other actions move the agent one state forward w.p. 1/A, but lead to a terminal non-rewarding state otherwise. If the reward is located at the end of the chain, any standard RL agent can collect it only at an exponentially low probability. On the other hand, transition lookahead agents would move forward only if there is an action that allows it while staying at their current state otherwise; such agents will collect the rewards at the end of the chain with constant probability. More specifically, any no-lookahead agent can collect at most V no = O(HA H/2) rewards, while transition lookahead agents can collect V T, = Ω(H); as such, lookahead agents achieve exponential increase in value, and sublinear regret versus the best lookahead agent will yield a standard regret of Reg(K) HK.

In the following sections, we will present agents that are guaranteed to always achieve sublinear regret compared to the best lookahead agent.

4 Planning and Learning with One-Step Reward Lookahead

In this section, we analyze RL settings with one-step reward lookahead, in which immediate rewards are observed before choosing an action. One well-known example of this situation is the prophet problem [Correa et al., 2019], where an agent sequentially observes values from known distributions. Upon observing a value, the agent decides whether to take it as a reward and stop the interaction, or discard it and continue to observe more values. This problem has numerous applications and extensions concerning auctions and posted-price mechanisms [Correa et al., 2017]. As shown in [Merlis et al., 2024], it is critical to observe the distribution values before taking a decision; otherwise, the agent s revenue can decrease by a factor of H. Notably, the example presented in Figure 1 is a small variant of the prophet problem, where the agent can either take one of A 1 values and finish the interaction or discard them and continue playing by staying at si; we showed that for this example, the lookahead information increases the value by a factor of V R, /V no AH.

The most natural way to tackle this setting is to extend (augment) the state space to contain the observed rewards; this way, we transition from a state and reward observations to a new state with new reward observations and return to the vanilla MDP formulation. However, this comes at a great cost. Even for Bernoulli rewards, there are 2A possible reward combinations at any given state, and the augmentation increases the state space by this factor leading to an exponentially-large state space. Even worse, for continuous rewards, the augmented state space becomes continuous, and any performance guarantees that depend on the size of the state space immediately become vacuous. Hence, algorithms that naïvely use this reduction are expected to be both computationally and statistically intractable. We refer to Appendix B.2 for further details on one such augmentation.

We take a different approach and derive Bellman equations for this setting in the original state space. Proposition 1. The optimal value of one-step reward lookahead agents satisfies

V R, H+1(s) = 0, s S,

V R, h (s) = ER Rh(s)

Rh(s, a) + X

s S Ph(s |s, a)V R, h+1(s )

, s S, h [H].

Also, given reward observations R = {R(a)}a A at state s and step h, the optimal policy is

π h(s, R) arg max a A

s S Ph(s |s, a)V R, h+1(s )

We prove Proposition 1 in Appendix B.2, where we present an equivalent environment with extended state space in which one could apply the standard Bellman equations [Puterman, 2014] to calculate the value with reward lookahead. In contrast to the previously discussed augmentation approach, we find it more convenient to divide the augmentation into two steps at odd steps 2h 1, the augmented environment would be in a state sh 0, while at even steps 2h, the state is sh Rh. Doing so creates an overlap between the values of the original and augmented environments at odd steps, simplifying the proofs. We also use this augmentation to prove a variant of the law of total variance [LTV, e.g. Azar et al., 2017] and a value-difference lemma [e.g. Efroni et al., 2019b].

We remark that calculating the exact value is not always tractable even for S = H = 1 (bandit problems) and Gaussian rewards, Proposition 1 requires calculating the expectation of the maximum

Algorithm 1 Monotonic Value Propagation with Reward Lookahead (MVP-RL)

1: Require: δ (0, 1), bonuses br k,h(s), bp k,h(s, a) 2: for k = 1, 2, ... do 3: Initialize V k H+1(s) = 0 4: for h = H, H 1, .., 1 do 5: Calculate the truncated values for all s S

V k h (s) = min ER ˆ Rk 1 h (s)

n R(a) + bp k,h(s, a) + ˆP k 1 h V k h+1(s, a) o + br k,h(s), H

6: end for 7: for h = 1, 2, . . . H do 8: Observe sk h and Rk h(sk h, a) for all a A

9: Play an action ak h arg maxa A n Rk h(sk h, a) + bp k,h(sk h, a) + ˆP k 1 h V k h+1(sk h, a) o

10: Collect the reward Rk h(sk h, ak h) and transition to the next state sk h+1 Ph( |sk h, ak h) 11: end for 12: end for

of Gaussian random variables, which does not admit any simple closed-form solution. On the other hand, these equations allow approximating the value by using reward samples in the following, we show that it can be used to achieve tight regret bounds when the environment is unknown.

4.1 Regret-Minimization with Reward Lookahead

We now present a tractable algorithm that achieves tight regret bounds with one-step reward lookahead. Specifically, we modify the Monotonic Value Propagation (MVP) algorithm [Zhang et al., 2021b] to perform planning using the empirical reward distributions instead of using the empirical reward expectations. To compensate for transition uncertainty, we add a transition bonus that uses the variance of the optimistic next-state values (w.r.t. the empirical transition kernel), designed to be monotone in the future value. Such construction permits using the variance of optimistic values for the bonus calculation while being able to later replace it with the variance of the optimal value (see discussion in Zhang et al. 2021b). A reward bonus is used for the value calculation, but does not affect the action choice in the current state. Intuitively, this is because we get the same amount of information for all the actions of a state, so they have the same level of uncertainty there is no need for bonuses to encourage reward exploration at the action level.

A high-level description of the algorithm is presented in Algorithm 1, while the full algorithm and its bonuses are stated in Appendix B.3. Notice that the planning requires calculating the expected maximum using the empirical distribution, whose support always contains at most K elements, so both the memory and computations are polynomial. The algorithm ensures the following guarantees:

Theorem 1. When running MVP-RL, with probability at least 1 δ uniformly for all K 1, it holds that Reg R(K) O

H3SAK ln SAHK

δ + H3S2A ln SAHK

See proof in Appendix B.7. Remarkably, our upper bound matches the standard lower bound for episodic RL of Ω

H3SAK [Domingues et al., 2021] up to log-factors; this lower bound is proved for known deterministic rewards, so in particular, it also holds for problems with reward lookahead.

To our knowledge, the only comparable bounds in settings with reward lookahead were proven to prophet problems; as agents observe (up to) n distributions at a fixed order, it can be formulated as a deterministic chain-like MDP, with H = n, S = n + 1 and A = 2. Agents start at the head of the chain and can either advance without collecting a reward or collect the observed reward and move to a terminal non-rewarding state (for more details, see Merlis et al. 2024). For this problem, [Gatmiry et al., 2024] proved a regret bound of O(n3

K) (albeit requiring a weaker form of feedback), and [Agarwal et al., 2023] proved a bound of O(n

T) slightly better than ours, but heavily relies on the ability to control which distributions to observe, which is a specific instance of deterministic transitions. We are unaware of any previous results that cover general Markovian dynamics.

4.2 Proof Concepts

When analyzing the regret of RL algorithms, a key step usually involves bounding the difference between the value of a policy in two different environments ( value-difference lemma ). In particular, for a given policy πk, many algorithms maintain a confidence interval on the value V πk h (s) V k h (s), V k h (s) , calculated based on optimistic and pessimistic MDPs that use the empirical model with bonuses/penalties [Dann et al., 2019, Zanette and Brunskill, 2019, Efroni et al., 2021]. Then, the instantaneous regret (without lookahead) is bounded using the optimistic values by

V k h (sh) V πk h (sh) = ˆrk 1 h (sh, ah) rh(sh, ah) + ˆP k 1 h Ph V k h (sh, ah)

+ Ph V k h+1 V πk h+1 (sh, ah) + bonuses,

while the pessimistic values are used either as part of the bonuses or while bounding them. However, when trying to perform a similar decomposition with reward lookahead, we do not have the difference of expected rewards, but rather terms of the form

ER ˆ Rk 1 h (sh) R(πk h(sh, R)) ER Rh(sh) R(πk h(sh, R))

(see, e.g., the last term of Lemma 4 in the appendix). As the action can be an arbitrary function of the reward realization, this term is extremely challenging to bound. For example, one could couple both distributions while trying to relate this error term to a Wasserstein distance between the empirical and real reward distribution; however, such distances exhibit much slower error rates than standard mean estimation [Fournier and Guillin, 2015]. Instead, we follow a different approach and show that uniformly for all possible expected next-state values ˆPV [0, H]A (as a function of the action at a given state), it holds w.h.p. that ER ˆ Rk 1 h (s) h max a

n R(a) + ˆPV (s, a) oi ER Rh(s) h max a

n R(a) + ˆPV (s, a) oi

δ nk 1 h (s) 1 . (1)

Throughout the proof, whenever we face an expectation w.r.t. the empirical rewards, we reformulate the expression to fit the form of Equation (1) and use it as a change of measure tool. We remark that while this confidence interval admits an extra A-factor compared to standard bounds, the counts only depend on the visits to the state (and not to the state-action), which compensates for this factor.

The choice of MVP for the bonus is similarly motivated unlike some other bonuses (e.g., Zanette and Brunskill 2019), MVP does not require pessimistic values either in the bonus itself or in its analysis. In contrast to the optimistic ones, the pessimistic values are not calculated via value iteration, but rather by following the policy πk in the pessimistic environment. As such, they cannot be easily manipulated to fit the form in Equation (1).

The analysis of the transitions adapts the techniques in [Efroni et al., 2021], while requiring extra care in handling the dependence of actions in the rewards.

5 Reinforcement Learning with One-Step Transition Lookahead

We now move to analyzing problems with one-step transition lookahead, where the resulting next state due to playing any of the actions is revealed before deciding which action to play. For example, consider the stochastic Canadian traveler problem with resampling [Nikolova and Karger, 2008, Boutilier et al., 2018]. In this problem, an agent wants to navigate on a graph as fast as possible from a source to a target, but observes which edges at a node are available only upon reaching this node. When edge availability is stochastic and resampled every time a node is visited, this is a clear case of one-step transition lookahead, as the information on the availability of edges is given before trying to traverse them. The example in Section 3 and Appendix C.9 is one possible formulation of this problem on a chain agents are awarded for arriving at the end of the chain as fast as possible, but trying to use a non-existing edge results with termination. We showed that in this particular instance, the lookahead value is exponentially larger than the standard value, and any lookahead agent with low regret would greatly surpass no-lookahead agents.

As with reward lookahead, the future states for all actions can be embedded into the state, but doing so increases the size of the state space by a factor of SA, again making this approach intractable (see Appendix C.2 for an example for such an extension). We once more show that this is not necessary; the transition-lookahead optimal values can be calculated using the following Bellman equations:

Proposition 2. The optimal value of one-step transition lookahead agents satisfies

V T, H+1(s) = 0, s S,

V T, h (s) = Es Ph(s)

n rh(s, a) + V T, h+1(s (s, a)) o , s S, h [H].

Also, given next-state observations s = {s (a)}a A at state s and step h, the optimal policy is

π h(s, s ) arg max a A

n rh(s, a) + V T, h+1(s (a)) o .

The proof can be found at Appendix C.2 and again relies on augmenting the state space to incorporate the transitions; this time, we divide the episode into odd steps whose extended state is sh s 0 (for an arbitrary fixed s 0 SA) and even steps with the state sh s h+1. Beyond planning, this again allows proving a variant of the LTV and of a value-difference lemma.

One important insight is that the policy π h(s, s ) admits the form of a list. Namely, consider the values V h (s, s , a) = rh(s, a) + V T, h+1(s ) and assume some ordering of next-state-action pairs {(s i, ai)}SA i=1 such that V h (s, s 1, a1) V h (s, s SA, a SA). Then, an optimal policy would look at all realized pairs (s (a), a) and play the action with the highest location in this list. We refer the readers to Appendix C.4 for an additional discussion on list representations in transition lookahead.

Similar results could be achieved through a reduction to RL problems with stochastic action sets [Boutilier et al., 2018]. There, at every round, a subset of base actions is sampled, and only these actions are available to the agent. In particular, one could sample A actions of the form (s , a) S A and impose a deterministic transition to s given this extended action. However, since every original action must be sampled exactly once, this sampling procedure creates a dependence between pairs even when next-states at different actions are independent, adding unnecessary complications. We show that when transitions are independent between states, the expectation in Proposition 2 can be efficiently calculated (see Appendix C.4.1 for details), and otherwise, it can be approximated through sampling, as we do in learning settings.

5.1 Regret-Minimization with Transition Lookahead

Relying on similar principals as with reward lookahead, we now present MVP-TL, an adaptation of MVP to settings with one-step transition lookahead (summarized in Algorithm 2; the full details can be found at Appendix C.3). This time, we estimate the empirical expected reward and add a standard Hoeffding-like reward bonus, while performing planning using samples from the empirical joint distribution of the next-state for all the actions simultaneously. A variance-based transition bonus is added to the values; though this time, the variance also incorporates the rewards, namely

v u u t Vars ˆ P k 1 h (s)( V k h (s, s ))

nk 1 h (s) 1 , V k h (s, s ) = max a A

n ˆrk 1 h (s, a) + br k,h(s, a) + V k h+1(s (a) o .

The motivation for this modification is the technical challenges described in Section 4.2, in the context of reward lookahead. For reward lookahead, we analyzed a value term that included both the rewards and next-state values, and used concentration arguments to move from the empirical reward distribution to the real one. For transition lookahead, similar values are analyzed, but we require variance-based concentration to obtain tighter regret bounds [Azar et al., 2017], so this variance naturally arises. The bonus is again designed to be monotone, as in the original MVP algorithm, and does not affect the immediate action choice only the optimistic lookahead value. As before, the planning relies on sampling the next-state observations at previous episodes, and so it is polynomial, even if the precise joint distribution is complex. The algorithm enjoys the following regret bounds:

Theorem 2. When running MVP-TL, with probability at least 1 δ uniformly for all K 1, it holds that Reg T (K) O

δ + H3S4A3 ln SAHK

Algorithm 2 Monotonic Value Propagation with Transition Lookahead (MVP-TL)

1: Require: δ (0, 1), bonuses br k,h(s, a), bp k,h(s) 2: for k = 1, 2, ... do 3: Initialize V k H+1(s) = 0 4: for h = H, H 1, .., 1 do 5: Calculate the truncated values for all s S

V k h (s) = min Es ˆ P k 1 h (s)

max a A ˆrk 1 h (s, a) + br k,h(s, a) + V k h+1(s (a)) + bp k,h(s), H

6: end for 7: for h = 1, 2, . . . H do 8: Observe sk h and s k h+1(sk h, a) for all a A

9: Play an action ak h arg maxa A n ˆrk 1 h (sk h, a) + br k,h(sk h, a) + V k h+1(s k h+1(sk h, a)) o

10: Collect the reward Rk h Rh(sk h, ak h) and transition to the next state sk h+1 = s k h+1(sk h, ak h) 11: end for 12: end for

See proof in Appendix C.8. For transition lookahead, the regret bounds we provide exhibit two rates, both corresponding to a natural adaptation of known lower bounds to transition lookahead.

1. Bandit rate O(

H2SAK): this is the rate due to reward stochasticity. Consider a problem where at odd timesteps 2h 1 and across all states, all actions have rewards of mean 1/2 ϵ, except for one action of mean 1/2. Assuming that the state-distribution is uniform, each such timestep forms a hard instance of a contextual bandit problem with S contexts, exhibiting a regret of Ω(

SAK) [Auer et al., 2002, Bubeck et al., 2012]. Since there are H/2 odd steps and we can design each step independently, the total regret would be Ω(H

SAK). The even steps can be used to remove the lookahead and create a uniform state distribution. To do so, we set that when taking an action at odd steps, we always transition to a fixed state sd. From this state, one action a1 leads uniformly to all states, while the rest of the actions lead to an absorbing non-rewarding state rendering them strictly suboptimal. Thus, no-regret agents will only play a1, regardless of the lookahead information, and the state distribution at odd timesteps will be uniform.

2. Transition learning rate O(

H3SK): recall that the vanilla RL lower bound designs a tree with Ω(S) leaves, to which agents need to navigate at the right timing (with Ω(H) options) and take the right action (out of A). While all leaves might transition agents to a rewarding state, one combination of state-action-timing has a slightly higher probability of doing so [Domingues et al., 2021]. This roughly creates a bandit problem with SAH arms, constructed such that the maximal reward is Ω(H), yielding a total regret of H

HSAK. Now consider the following simple modification where in each leaf, only one action can lead to a reward (and the rest of the actions are useless never lead to rewards). Thus, the agent still needs to test all leaves at all timings, and so there are still SH arms with a corresponding regret of

H3SK. Moreover, to test a leaf at a certain timing, we must navigate to it, and since the agent is going to play the single useful action at the leaf, transition lookahead does not provide any additional information.

As discussed before, transition lookahead can be formulated as an RL instance with stochastic action sets. While Boutilier et al. [2018] prove that with stochastic action sets, Q-learning asymptotically converges, they provide no learning algorithm nor regret bounds. Therefore, to our knowledge, our result is the first to achieve sublinear regret with transition lookahead.

5.2 Proof Concepts

Transition lookahead causes similar issues as reward lookahead. Hence, it is natural to apply a similar analysis approach first, formulate the value as the expectation w.r.t. the next-state observations of the maximum of action-observation dependent values; then use uniform concentration as a change of measure tool between the empirical and real next-state distribution. In particular, if V (s, s , a) represents the value starting from state s, performing a and transitioning to s , one can show that for

all V (s, , ) [0, H]SA (see Lemma 19), Es ˆ P k 1 h (s) h max a V (s, s (a), a) i Es Ph(s) h max a V (s, s (a), a) i

v u u t SA ln 1

δ Vars ˆ P k 1 h (s) maxa V (s, s (a), a)

nk 1 h (s) 1 , (2)

where the variance term stems from using a Bernstein-like concentration bound. However, in contrast to the reward lookahead, the

SA-factor propagates to the dominant term of the regret, so pursuing this approach would lead to a worse regret bound of O

To avoid this, we pinpoint the two locations where this change of measure is needed the proof that V k h is optimistic and the regret decomposition and make sure to perform this change of measure only on a single value V h (s, s , a) = rh(s, a) + V h+1(s ), mitigating the need to cover all possible values and removing the additional

SA-factor. However, doing so leaves us with a residual term. Defining V h (s, s ) = maxa A{V h (s, s (a), a)} and assuming a similar optimistic value V k h (s, s ), this term is of the form Es ˆ P k 1 h (s) V k h (s, s ) V h (s, s ) Es Ph(s) V k h (s, s ) V h (s, s ) .

While similar terms have been analyzed before [e.g., Zanette and Brunskill, 2019, Efroni et al., 2021], the analysis leads to a constant regret term that depends on the support of the distribution in question; in our case, it is the distribution over all possible next-states of cardinality SA. Therefore, following the same derivation would lead to an exponential additive regret term.

We overcome it by utilizing the fact that both the optimistic policy and the optimal one decide which action to take according to a list of next-state-actions (s , a). In other words, instead of looking at the next-state s (with SA possible values) to determine a value, we look at the highest-ranked realized pair (s , a) in the list that corresponds to the policy that induces the value (with SA possible rankings). Since we have two values, we need to calculate the probability of being at a certain list location for both πk and π , but the cardinality of this space is (SA)2: polynomial and not exponential.

6 Conclusions and Future Work

In this work, we presented an RL setting in which immediate rewards or transitions are observed before actions are chosen. We showed how to design provably and computationally efficient algorithms for this setting that achieve tight regret bounds versus a strong baseline that also uses lookahead information. Our algorithms rely on estimating the distribution of the reward or transition observations, a concept that might be utilized in other settings. In particular, we believe that our techniques for transition lookahead could be extended to RL problems with stochastic action sets [Boutilier et al., 2018], but leave this for future work.

One natural extension to our work would be to consider multi-step lookahead information observing the transition/rewards L steps in advance. We conjecture that from a statistical point of view, a similar algorithmic approach that samples from the empirical observation distribution would be efficient. However, it is not clear how to perform efficient planning with such feedback.

Another possible direction would be to derive model-free algorithms [Jin et al., 2018], with the aim to improve the computation efficiency of the solutions; our model-based algorithms require at most O(KS2AH) computations per episode due to the planning stage, while model-free algorithms might potentially allow just O(AH) computations per episode.

On the practical side, previous works presented RL algorithms that utilize/estimate a world model with multi-step lookahead to perform planning and learning [Schrittwieser et al., 2020, Chung et al., 2024], aiming to achieve the optimal no-lookahead value. For some of these approaches, it is quite natural to replace the simulated world behavior with lookahead information on the real future realization. We leave this adaptation and evaluation to future studies.

Finally, the notion of lookahead could be studied in various other decision-making settings (e.g., linear MDPs Jin et al. 2020) and can also be generalized to situations where lookahead information can be queried under some budget constraints [Efroni et al., 2021] or when agents only observe noisy lookahead predictions; we leave these problems for future research.

Acknowledgements

We thank Alon Cohen and Austin Stromme for the helpful discussions. This project has received funding from the European Union s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101034255.

Arpit Agarwal, Rohan Ghuge, and Viswanath Nagarajan. Semi-bandit learning for monotone stochastic optimization. ar Xiv preprint ar Xiv:2312.15427, 2023.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263 272. PMLR, 2017.

Dimitri Bertsekas. A course in reinforcement learning. Athena Scientific, 2023.

André Biedenkapp, Raghu Rajan, Frank Hutter, and Marius Lindauer. Temporl: Learning when to act. In International Conference on Machine Learning, pages 914 924. PMLR, 2021.

Craig Boutilier, Alon Cohen, Avinatan Hassidim, Yishay Mansour, Ofer Meshi, Martin Mladenov, and Dale Schuurmans. Planning and learning with stochastic action sets. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4674 4682, 2018.

Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012.

Eduardo F Camacho, Carlos Bordons, Eduardo F Camacho, and Carlos Bordons. Model predictive control. Springer, 2007.

Stephen Chung, Ivan Anokhin, and David Krueger. Thinker: learning to plan and act. Advances in Neural Information Processing Systems, 36, 2024.

José Correa, Patricio Foncea, Ruben Hoeksma, Tim Oosterwijk, and Tjark Vredeveld. Posted price mechanisms for a random stream of customers. In Proceedings of the 2017 ACM Conference on Economics and Computation, pages 169 186, 2017.

Jose Correa, Patricio Foncea, Ruben Hoeksma, Tim Oosterwijk, and Tjark Vredeveld. Recent developments in prophet inequalities. ACM SIGecom Exchanges, 17(1):61 70, 2019.

Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507 1516, 2019.

Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578 598. PMLR, 2021.

Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. How to combine tree-search methods in reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3494 3501, 2019a.

Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies. In Advances in Neural Information Processing Systems, pages 12224 12234, 2019b.

Yonathan Efroni, Mohammad Ghavamzadeh, and Shie Mannor. Online planning with lookahead policies. Advances in Neural Information Processing Systems, 33:14024 14033, 2020.

Yonathan Efroni, Nadav Merlis, Aadirupa Saha, and Shie Mannor. Confidence-budget matching for sequential budgeted learning. In International Conference on Machine Learning, pages 2937 2947. PMLR, 2021.

Ibrahim El Shar and Daniel Jiang. Lookahead-bounded q-learning. In International Conference on Machine Learning, pages 8665 8675. PMLR, 2020.

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability theory and related fields, 162(3):707 738, 2015.

Khashayar Gatmiry, Thomas Kesselheim, Sahil Singla, and Yifan Wang. Bandit algorithms for prophet inequality and pandora s box. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 462 500. SIAM, 2024.

Yunhan Huang, Veeraruna Kavitha, and Quanyan Zhu. Continuous-time markov decision processes with controlled observations. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 32 39. IEEE, 2019.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010.

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory, pages 2137 2143. PMLR, 2020.

Yingying Li, Xin Chen, and Na Li. Online optimal control with linear dynamics and predictions: Algorithms and regret analysis. Advances in Neural Information Processing Systems, 32, 2019.

Yiheng Lin, Yang Hu, Guanya Shi, Haoyuan Sun, Guannan Qu, and Adam Wierman. Perturbationbased regret analysis of predictive control in linear time varying systems. Advances in Neural Information Processing Systems, 34:5174 5185, 2021.

Yiheng Lin, Yang Hu, Guannan Qu, Tongxin Li, and Adam Wierman. Bounded-regret mpc via perturbation analysis: Prediction error, constraints, and nonlinearity. Advances in Neural Information Processing Systems, 35:36174 36187, 2022.

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. In Conference on learning theory, 2009.

Nadav Merlis, Dorian Baudry, and Vianney Perchet. The value of reward lookahead in reinforcement learning. ar Xiv preprint ar Xiv:2403.11637, 2024.

Thomas M Moerland, Anna Deichler, Simone Baldi, Joost Broekens, and Catholijn M Jonker. Think neither too fast nor too slow: The computational trade-off between planning and reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS), Nancy, France, pages 16 20, 2020.

Evdokia Nikolova and David R Karger. Route planning under uncertainty: The canadian traveller problem. In AAAI, pages 969 974, 2008.

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Aviv Rosenberg, Assaf Hallak, Shie Mannor, Gal Chechik, and Gal Dalal. Planning and learning with adaptive lookahead. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9606 9613, 2023.

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604 609, 2020.

Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, pages 1153 1162, 2019.

Aviv Tamar, Garrett Thomas, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. Learning from the hindsight plan episodic mpc improvement. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 336 343. IEEE, 2017.

Chenkai Yu, Guanya Shi, Soon-Jo Chung, Yisong Yue, and Adam Wierman. The power of predictions in online control. Advances in Neural Information Processing Systems, 33:1994 2004, 2020.

Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304 7312. PMLR, 2019.

Runyu Zhang, Yingying Li, and Na Li. On the regret analysis of online lqr control with predictions. In 2021 American Control Conference (ACC), pages 697 703. IEEE, 2021a.

Zihan Zhang, Xiangyang Ji, and Simon Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pages 4528 4531. PMLR, 2021b.

Zihan Zhang, Yuxin Chen, Jason D Lee, and Simon S Du. Settling the sample complexity of online reinforcement learning. ar Xiv preprint ar Xiv:2307.13586, 2023.

Table of Contents

A Structure of the Appendix 14

B Proofs for Reward Lookahead 15 B.1 Data Generation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Extended MDP for Reward Lookahead . . . . . . . . . . . . . . . . . . . . . . 15 B.3 Full Algorithm Description for Reward Lookahead . . . . . . . . . . . . . . . . 20 B.4 The First Good Event Concentration . . . . . . . . . . . . . . . . . . . . . . 21 B.5 Optimism of the Upper Confidence Value Functions . . . . . . . . . . . . . . . 22 B.6 The Second Good Event Martingale Concentration . . . . . . . . . . . . . . . 23 B.7 Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C Proofs for Transition Lookahead 30 C.1 Data Generation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 C.2 Extended MDP for Transition Lookahead . . . . . . . . . . . . . . . . . . . . . 30 C.3 Full Algorithm Description for Transition Lookahead . . . . . . . . . . . . . . . 34 C.4 Additional Notations and List Representation . . . . . . . . . . . . . . . . . . . 35 C.5 The First Good Event Concentration . . . . . . . . . . . . . . . . . . . . . . 37 C.6 Optimism of the Upper Confidence Value Functions . . . . . . . . . . . . . . . 39 C.7 The Second Good Event Martingale Concentration . . . . . . . . . . . . . . . 40 C.8 Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 C.9 Example: Value Gain due to Transition Lookahead . . . . . . . . . . . . . . . . 46

D Auxiliary Lemmas 47 D.1 Concentration results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 D.2 Count-Related Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 D.3 Analysis of Variance terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

E Existing Results 53

A Structure of the Appendix

Both reward and transition lookahead appendices share the following structure. First, we describe our assumption on the data generation process and analyze general properties of reward and transition lookahead. This is done by looking at an extended MDP that incorporates the lookahead information into the state. Then, we present the full algorithm and describe the relevant probabilistic events that ensure the concentration of all the empirical quantities. For transition lookahead, we require some additional notions for the event definitions (including the list representation of values and policies), which are explained in a separate subsection.

Given the concentration-related good event, we can prove that the planning procedure in the algorithm is optimistic, which we do in the subsequent subsection. Then, we define an additional good event that allows adding and removing conditional expectations in a way that will be needed for the proof.

At this point, we provided all (almost all) the results required for the regret analysis, and the proof of the main theorems is stated. The proofs also require some additional analysis for the bonuses (and especially variance terms), which is located at the end of the regret analysis.

For transition lookahead, the appendix includes one more part that further analyzes the example presented in Section 3.

At the end of the appendix, we state and prove several lemmas that will be used throughout our analysis, while also stating several existing results that will be of use.

B Proofs for Reward Lookahead

B.1 Data Generation Process

To simplify the proofs, we assume the following tabular data-generation process: Before the game starts, a set of K samples from the transition probabilities and rewards is generated for all (s, a, h). Once a state s at step h is visited for the ith time, the ith sample from the reward distribution Rh(s) is the reward realization for all action a A. When a state-action pair is visited for the ith time, the ith sample from the transition kernel Ph( |s, a) determines the next-state realization. In particular, it implies that the reward samples from the first i visits to a state are i.i.d., and the same for the next-states samples and state-action visitations. Throughout this appendix, we use the notation Rk h = Rk h(sk h, a

a A to denote the reward observation at episode k and timestep h for all the actions.

For the proof, we define the following three filtrations. Let

Fk,h = σ s1 t, a1 t, R1 t

t [H], . . . , n sk 1 t , ak 1 t , Rk 1 t o

t [H], n sk t , ak t , Rk t o

t [h], sk h+1

F R k,h = σ s1 t, a1 t, R1 t

t [H], . . . , n sk 1 t , ak 1 t , Rk 1 t o

t [H], n sk t , ak t , Rk t o

the filtrations that contains all information until episode k and step h, as well as the state at timestep h + 1, or all information of time h + 1, respectively. We make this distinction so that Fk,h 1 contains only sk h, while F R k,h 1 also contains ak h. We also define

Fk = σ s1 t, a1 t, R1 t

t [H], . . . , n sk t , ak t , Rk t o

t [H], sk+1 1

which contains all information up to the end of the kth episode, as well as the initial state at episode k + 1.

B.2 Extended MDP for Reward Lookahead

In this appendix, we present an alternative formulation of the one-step reward lookahead that falls under the vanilla (no-lookahead) model and would be helpful for the analysis.

Throughout the section, we study the relations between MDPs with and without reward lookahead, and between different MDPs with lookahead. Therefore, for clarity, we state the concerning MDP in the value, e.g. V R,π(s|M). Specifically in this subsection, we distinguish between values without lookahead (denoted V π) and values with lookahead (denoted V R,π). In the following subsections, unless stated otherwise, we will only consider lookahead values; for brevity, and with some abuse of notations, we will then omit the R in the value notation.

For any MDP M = (S, A, H, P, R), define an equivalent extended MDP MR of horizon 2H that separates the state transition and reward generation as follows:

1. Assume w.l.o.g. that M starts at some initial state s1. The extended environment starts at a state s1 0, where 0 RA is the zeros vector. 2. For any h [H], at timestep 2h 1, the environment MR transitions from state sh 0 to sh R, where R Rh(s) is a vector containing the rewards for all actions a A. This transition occurs regardless of the action that was played. At timestep 2h, given an action ah the environment transitions from sh R to sh+1 0, where sh+1 Ph( |sh, ah). 3. The reward at a state s R when playing an action a is R(a), namely, the reward is deterministic and only obtained on even timesteps.

We emphasize that throughout the section, we assume that M and MR are coupled; that is, assume that under a policy π in M, the agent visits a state sh, observes Rh, plays an action ah and transitions to sh+1. Then, in MR, the agent starts from sh 0, transitions to sh R (regardless of the action it played), takes the action ah and finally transitions to sh+1 0.

Since the reward is embedded into the state, any state-dependent policy in MR is a one-step reward lookahead policy in the original MDP. Moreover, the policy at the odd steps of M does not affect

the value, and assuming that the policy at the even steps in MR is the same as the policy in M, we trivially get the following relation between the values

V π 2h(s, R|MR) = E

t=h Rt(st, at)|sh = s, Rh(s, ) = R, π

V R,π h (s, R|M),

V π 2h 1(s, 0|MR) = E

t=h Rt(st, at)|sh = s, π

= V R,π h (s|M). (3)

While MR has a continuous state space, which generally makes algorithm design impractical, this representation permits applying classic results on MDPs to environments with one-step lookahead.

As a remark, rewards could be directly embedded into the state without separating the state and reward updates. However, this creates unnecessary complications when analyzing the relations between similar environments. This is because we are mainly interested in the value given the state in expectation over the realized rewards. In particular, value-difference are analyzed assuming a shared initial state, but in our case, we do not want to assume the same reward realization, but rather also account for the distance between reward distributions, which the step separation enables. For similar reasons, this representation also simplifies the proof of the law of total variance [Azar et al., 2017]. Proposition 1. The optimal value of one-step reward lookahead agents satisfies

V R, H+1(s) = 0, s S,

V R, h (s) = ER Rh(s)

Rh(s, a) + X

s S Ph(s |s, a)V R, h+1(s )

, s S, h [H].

Also, given reward observations R = {R(a)}a A at state s and step h, the optimal policy is

π h(s, R) arg max a A

s S Ph(s |s, a)V R, h+1(s )

Proof. We prove the result in the extended MDP MR and remind the reader that in this formulation, the policy only uses state information, as in the standard RL formulation. In particular, it implies that there exists a Markovian optimal policy that uniformly maximizes the value (in the extended state space), and the optimal value is given through the dynamic-programming equations [Puterman, 2014] V 2H+1(s, R|MR) = 0, s S, R RA,

V 2h(s, R|MR) = max a

s S Ph(s |s, a)V 2h+1(s , 0|MR)

, h [H], s S, R RA,

V 2h 1(s, 0|MR) = ERh(s) V 2h(s, R|MR) , h [H], s S. (4)

By the equivalence between M and MR for all policies, this is also the optimal value in M. Specifically, combining both recursion equations and substituting the relation between the original and extended values of Equation (3), we get the desired value recursion for any h [H] and s S:

V R, h (s|M) = V 2h 1(s, 0|MR)

= ERh(s) V 2h(s, R|MR)

s S Ph(s |s, a)V 2h+1(s , 0|MR)

s S Ph(s |s, a)V R, h+1(s|M)

Similarly, for any h [H], s S and R RA, the optimal policy at the even stages of the extended MDP is

π 2h(s, R) arg max a A

s S Ph(s |s, a)V 2h+1(s , 0|MR)

alongside arbitrary actions at odd steps. Playing this policy in the original MDP will lead to an optimal one-step reward lookahead policy, as it achieves the optimal value of the original MDP. This policy directly translates to the optimal policy in the statement, by the equivalence between the original and extended MDPs and the relation V 2h+1(s , 0|MR) = V R, h+1(s |M).

Remark 1. As in Equation (4), one could also write the dynamic programming equations for any policy π ΠR, namely

V π 2h(s, R|MR) = R(πh(s, R)) + X

s S Ph(s |s, πh(s, R))V π 2h+1(s , 0|MR), h [H], s S, R RA,

V π 2h 1(s, 0|MR) = ERh(s) V π 2h(s, R|MR) , h [H], s S.

In particular, following the notation of Equation (3), one can also write

V R,π h (s, R|M) = R(πh(s, R)) + X

s S Ph(s |s, πh(s, R))V R,π h+1(s |M), and,

V R,π h (s|M) = ERh(s) h V R,π h (s, R|M) i

R(πh(s, R)) + X

s S Ph(s |s, πh(s, R))V R,π h+1(s |M))

We will use this notation in some of the proofs.

Another useful application of the extended MDP is a variation of the law of total variance (LTV), which will be useful in our analysis Lemma 3. For any deterministic one-step reward lookahead policy π ΠR, it holds that

h=1 Var Ph( |sh,ah)(V R,π h+1(sh+1))|π, s1

h=1 Rh(sh, ah) V R,π 1 (s1)

Proof. We apply the law of total variance (Lemma 27) in the extended MDP; there, the rewards are deterministic and equal to either 0 (at odd steps) or Rh(sh, ah) (at even steps), so the total expected rewards are PH h=1 Rh(sh, ah).

h=1 Rh(sh, ah) V π 1 (s1, 0|MR)

h=1 Var(V π 2h(sh, Rh(sh)|MR)|(sh, 0))

| {z } Odd steps

h=1 Var(V π 2h+1(sh+1, 0|MR)|(sh, Rh(sh)))

| {z } Even steps

h=1 Var(V π 2h+1(sh+1, 0|MR)|(sh, Rh(sh)))|π, s1

h=1 Var Ph( |sh,ah)(V π 2h+1(sh+1, 0|MR))|π, s1

h=1 Var Ph( |sh,ah)(V R,π h+1(sh+1|M))|π, s1

Noting that V π 1 (s1, 0|MR) = V R,π 1 (s1|M) concludes the proof.

Finally, though not needed in our analysis, we use the extended MDP to prove the following valuedifference lemma, which could be of further use in follow-up works. While we prove decomposition just using the next-step values, one could recursively apply the formula until the end of the episode to immediately get another formula that does not depend on the next value.

Lemma 4 (Value-Difference Lemma with Reward Lookahead). Let M1 = (S, A, H, P 1, R1) and M2 = (S, A, H, P 2, R2) be two environments. For any deterministic one-step reward lookahead policy π ΠR, any h [H] and s S, it holds that

V R,π h (s|M1) V R,π h (s|M2)

= EM1 h V R,π h+1(sh+1|M1) V R,π h+1(sh+1|M2)|sh = s i

P 1 h(s |sh, πh(sh, Rh)) P 2 h(s |sh, πh(sh, Rh)) V R,π h+1(s |M2)|sh = s

+ EM1 h ER1 h(s) h V R,π h (sh, R|M2) i ER2 h(s) h V R,π h (sh, R|M2) i |sh = s i ,

where V R,π h (s, R|M) is the value at a state given the reward realization, defined in Equation (3) and given in Remark 1.

Proof. We again work with the extended MDPs MR 1 , MR 2 . Since under the extension, both the environments and the policy are Markovian, all values obey the following Bellman equations:

V π 2h(s, R|MR) = R(πh(s, R)) + X

s S Ph(s |s, π(s, R))V π 2h+1(s , 0|MR), h [H], s S, R RA

V π 2h 1(s, 0|MR) = ERh(s) V π 2h(s, R|MR) , h [H], s S.

Using the relation between the value of the original and extended MDP (eq. (3)) and the Bellman equations of the extended MDP, for any h [H], we have

V R,π h (s|M1) V R,π h (s|M2)

= V π 2h 1(s, 0|MR 1 ) V π 2h 1(s, 0|MR 2 )

= ER1 h(s) V π 2h(s, R|MR 1 ) ER2 h(s) V π 2h(s, R|MR 2 )

= ER1 h(s) V π 2h(s, R|MR 1 ) V π 2h(s, R|MR 2 ) + ER1 h(s) V π 2h(s, R|MR 2 ) ER2 h(s) V π 2h(s, R|MR 2 )

= ER1 h(s) V π 2h(s, R|MR 1 ) V π 2h(s, R|MR 2 ) + ER1 h(s) h V R,π h (s, R|M2) i ER2 h(s) h V R,π h (s, R|M2) i

= EM1 V π 2h(sh, Rh|MR 1 ) V π 2h(sh, Rh|MR 2 )|sh = s

+ ER1 h(s) h V R,π h (s, R|M2) i ER2 h(s) h V R,π h (s, R|M2) i . (5)

We now focus on the first term. Denoting ah = πh(sh, Rh) the action taken by the agent at environment M1, We have

V π 2h(sh, Rh|MR 1 ) V π 2h(sh, Rh|MR 2 )

s S P 1 h(s |sh, ah)V π 2h+1(s , 0|MR 1 )

s S P 2 h(s |sh, ah)V π 2h+1(s , 0|MR 2 )

s S P 1 h(s |sh, ah)V R,π h+1(s |M1) X

s S P 2 h(s |sh, ah)V R,π h+1(s |M2)

s S P 1 h(s |sh, ah) V R,π h+1(s |M1) V R,π h+1(s |M2)

P 1 h(s |sh, ah) P 2 h(s |sh, ah) V R,π h+1(s |M2)

= EM1 h V R,π h+1(sh+1|M1) V R,π h+1(sh+1|M2)|sh, ah i

P 1 h(s |sh, ah) P 2 h(s |sh, ah) V R,π h+1(s |M2).

Substituting this back into Equation (5), we have

V π h (s|M1) V π h (s|M2)

= EM1 h EM1 h V R,π h+1(sh+1|M1) V R,π h+1(sh+1|M2)|sh, ah i |sh = s i

P 1 h(s |sh, ah) P 2 h(s |sh, ah) V R,π h+1(s |M2)|sh = s

+ ER1 h(s) h V R,π h (s, R|M2) i ER2 h(s) h V R,π h (s, R|M2) i

= EM1 h V R,π h+1(sh+1|M1) V R,π h+1(sh+1|M2)|sh = s i

P 1 h(s |sh, πh(sh, Rh)) P 2 h(s |sh, πh(sh, Rh)) V R,π h+1(s |M2)|sh = s

+ EM1 h ER1 h(s) h V R,π h (sh, R|M2) i ER2 h(s) h V R,π h (sh, R|M2) i |sh = s i .

B.3 Full Algorithm Description for Reward Lookahead

Algorithm 3 Monotonic Value Propagation with Reward Lookahead (MVP-RL)

1: Require: δ (0, 1), bonuses br k,h(s), bp k,h(s, a) 2: for k = 1, 2, ... do 3: Initialize V k H+1(s) = 0 4: for h = H, H 1, .., 1 do 5: for s S do 6: if nk 1 h (s) = 0 then 7: V k h (s) = H 8: else 9: Calculate the truncated values

V k h (s) = min

1 nk 1 h (s)

nk 1 h (s) X

t=1 max a A

n R kt h(s) h (s, a) + bp k,h(s, a) + ˆP k 1 h V k h+1(s, a) o + br k,h(s), H

10: end if 11: For any vector R RA, define the policy πk

πk h(s, R) arg max a A

n R(a) + bp k,h(s, a) + ˆP k 1 h V k h+1(s, a) o

12: end for 13: end for 14: for h = 1, 2, . . . H do 15: Observe sk h and Rk h = Rk h(sk h, a)

a A 16: Play an action ak h = πk h(sk h, Rk h) 17: Collect the reward Rk h(sk h, ak h) and transition to the next state sk h+1 Ph( |sk h, ak h) 18: end for 19: Update the empirical estimators and counts for all visited state-actions 20: end for

We use a variant of the MVP algorithm [Zhang et al., 2021b] while adapting their proof and the one from [Efroni et al., 2021]. The algorithm is described in Algorithm 3 and uses the following bonuses:

br k,h(s) = 3

ALk δ 2(nk 1 h (s) 1) ,

bp k,h(s, a) = min

v u u t Var ˆ P k 1 h ( |s,a)( V k h+1)Lk δ nk 1 h (s, a) 1 + 400

9 HLk δ nk 1 h (s, a) 1 , H

where Lk δ = ln 144S2AH2k3(k+1)

δ , and for brevity, we shorten Var ˆ P k 1 h ( |s,a)( V k h+1(s )) to

Var ˆ P k 1 h ( |s,a)( V k h+1) (omitting the state from the value).

For the optimistic value iteration, we use the notation kt h(s) to represent the tth episode where the state s was visited at the hth timestep. Thus, line 9 of Algorithm 3 is the expectation w.r.t. the empirical reward distribution ˆRk 1 h (s) (when defining its realization to be zero when nk 1 h (s) = 0). Since the bonuses are larger than H when nk 1 h (s) = 0, one could write the update in more concisely as

V k h (s) = min ER ˆ Rk 1 h (s)

n R(a) + bp k,h(s, a) + ˆP k 1 h V k h+1(s, a) o + br k,h(s), H .

We will often use this representation in our analysis.

B.4 The First Good Event Concentration

We now define the first good event, which ensures that all empirical quantities are well-concentrated. For the transitions, we require each element to concentrate well, as well as both the inner product and the variance w.r.t. the optimal value function. For the reward, we make sure that the maximum of the rewards to concentrate well (with any possible bias, that will later correspond with the next-state values). Formally, for any fixed vector u RA, denote

mh(s, u) = ER Rh(s) h max a {Rh(a) + u(a)} i ,

ˆmk h(s, u) = ER ˆ Rk h(s) h max a {Rh(a) + u(a)} i

with the convention that ˆmk h(s, u) = maxa u(a) if nk h(s) = 0. We define the following good events:

s, s , a, h : |Ph(s |s, a) ˆP k 1 h (s |s, a)|

2P(s |s, a)Lk δ nk 1 h (s, a) 1 + Lk δ nk 1 h (s, a) 1

s, a, h : ˆP k 1 h Ph V h+1(s, a)

2Var Ph( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 + HLk δ nk 1 h (s, a) 1

s, a, h : q

Var Ph( |s,a)(V h+1) q

Var ˆ P k 1 h ( |s,a)(V h+1) 4H

Lk δ nk 1 h (s, a) 1

s, h, u [0, 2H]A : mh(s, u) ˆmk 1 h (s, u) 3

ALk δ 2(nk 1 h (s) 1)

where we again use Lk δ = ln 144S2AH2k3(k+1)

δ . Then, we define the first good event as

k 1 Er(k) \

k 1 Ep(k) \

k 1 Epv1(k) \

k 1 Epv2(k),

for which, the following holds: Lemma 5 (The First Good Event). The good event G1 holds w.p. Pr(G1) 1 δ/2.

Proof. The proof of the first three events uses standard concentration arguments (see, e.g., Efroni et al. 2021) and is stated for completeness. For any fixed k 1, s, a, h and number of visits n [k], we utilize Lemma 16 w.r.t. the transition kernel Ph( |s, a), the value V h+1 [0, H] and probability δ = δ 8SAHk2(k+1); notice that by the assumption that samples are generated i.i.d. before the game starts, given the number of visits, all samples are i.i.d., so standard concentration could be applied. By taking the union bound over all n [k] and slightly increasing the constants to ensure that n = 0 trivially holds, we get that the events also hold for any number of visit nk 1 h (s, a) {0 . . . , k}, and taking another union bound over all k 1, s, a, h ensures that each of the events k 1Ep(k), k 1Epv1(k) and k 1Epv2(k) holds w.p. at least 1 δ

8 We now focus on bounding the probability of the event k Er(k). For any fixed k, h and s, observe that the event trivially holds if nk h = 0, then the event trivially holds, since for all u [0, 2H]A,

mh(s, u) ˆmk 1 h (s, u) = ER Rh(s) h max a {Rh(s, a) + u(a)} i max a {u(a)} ( ) 1 3

where ( ) uses the boundedness of the rewards in [0, 1]. Next, recall that for any fixed nk 1 h = n [k], the rewards samples at state s and step h are i.i.d. vectors on [0, 1]A. Therefore, by Lemma 18,

nk 1 h (s) = n, u [0, 2H]A : mh(s, u) ˆmk 1 h (s, u) > 3

ALk δ 2(nk 1 h (s) 1)

δ 8SAHk2(k + 1).

Taking a union bound on all possible values of n [k], s and h, we get

Pr{Er(k)} 1 SAk δ 8SAHk2(k + 1) 1 δ 8k(k + 1).

By summing over all k 1, the event k Er(k) holds with a probability of at least 1 δ/8. Finally, taking the union bound with the other three events leads to the desired result of Pr(G1) 1 δ/2.

B.5 Optimism of the Upper Confidence Value Functions

In this subsection, we prove that under the good event G1, the values V k that MVP-RL produces are optimistic. Lemma 6 (Optimism). Under the first good event G1, for all k [K], h [H] and s S, it holds that V h (s) V k h (s).

Proof. The proof follows by backward induction on H; see that the claim trivially holds for h = H+1, where both values are defined to be zero.

Now assume by induction that for some k [K] and h [H], the desired inequalities hold at timestep h + 1 for all s S; we will show that this implies that they also hold at timestep h.

At this point, we also assume w.l.o.g. that V k h (s) < H, and in particular, the value is not truncated; otherwise, by the boundedness of the rewards, V h (s) H = V k h (s). For similar reasons, we assume w.l.o.g. that bp k,h(s, a) < H, so that it is also not truncated.

By the optimism of the value at step h + 1 due to the induction hypothesis and the monotonicity of the bonus (Lemma 23), under the good event, we have for all s S and a A that

ˆP k 1 h V k h+1(s, a) + bp k,h(s, a)

ˆP k 1 h V k h+1(s, a) + max

v u u t Var ˆ P k 1 h ( |s,a)( V k h+1)Lk δ nk 1 h (s, a) 1 , 400

9 HLk δ nk 1 h (s, a) 1

ˆP k 1 h V h+1(s, a) + max

v u u t Var ˆ P k 1 h ( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 , 400

9 HLk δ nk 1 h (s, a) 1

ˆP k 1 h V h+1(s, a) + 10

v u u t Var ˆ P k 1 h ( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 + 200

9 HLk δ nk 1 h (s, a) 1

ˆP k 1 h V h+1(s, a) + 10

Var Ph( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 + 8HLk δ nk 1 h (s, a) 1 (Under Epv2(k))

Ph V h+1(s, a). (Under Epv1(k))

Thus, under the good event and the induction hypothesis, we have that

V k h (s) = ER ˆ Rh(s)

n R(a) + bp k,h(s, a) + ˆP k 1 h V k h+1(s, a) o + br k,h(s)

max a A R(a) + Ph V h+1(s, a) + br k,h(s).

In particular, using Proposition 1, we get

V k h (s) V h (s) ER ˆ Rh(s)

max a A R(a) + Ph V h+1(s, a) + br k,h(s)

max a A R(a) + Ph V h+1(s, a)

where the last inequality holds under the event Er(k) with u(a) = Ph V h+1(s, a) [0, H]A.

B.6 The Second Good Event Martingale Concentration

In this subsection, we present four good events that will allow us to replace the expectation over the randomizations inside each episode with their realization.

Define the following bonus-like term that will later appear in the proof due to value concentration:

bpv1 k,h(s, a) = min

2Var Ph( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 + 4H2SLk δ nk 1 h (s, a) 1 , H

Y k 1,h := V k h+1(sk h+1) V πk h+1(sk h+1),

Y k 2,h = Var Ph( |st,h,at,h)(V πk h+1),

Y k 3,h = bp k,h(sk h, ak h) + bpv1 k,h(sk h, ak h).

The second good event is the intersection of the events G2 = Ediff1 Ediff2 EVar Ebp defined as follows.

h [H], K 1 :

k=1 E[Y k 1,h|Fk,h 1] 1 + 1

k=1 Y k 1,h + 18H2 ln 8HK(K + 1)

h [H], K 1 :

k=1 E[Y k 1,h|F R k,h 1] 1 + 1

k=1 Y k 1,h + 18H2 ln 8HK(K + 1)

h=1 Y k 2,h 2

h=1 E[Y k 2,h|Fk 1] + 4H3 ln 8HK(K + 1)

h [H], K 1 :

k=1 E[Y k 3,h|Fk,h 1] 2

k=1 Y k 3,h + 50H2 ln 8HK(K + 1)

We define the good event G = G1 G2. Lemma 7. The good event G holds with a probability of at least 1 δ.

Proof. The proof follows similarly to Lemmas 15 and 21 of [Efroni et al., 2021].

First, define the random process Wk = 1 n V k h (s) V πk h (s) [0, H], h [H], s S o and define Y k 1,h = Wk Y k 1,h, which is bounded in [0, H]. Also observe that Wk is Fk 1 measurable, since both values and policies are calculated based on data up to the episode k 1, and in particular, it is Fk,h 1 measurable and Y k 1,h is Fk,h measurable. thus, by Lemma 25, for any k [K] and h [H], we have w.p. at least 1 δ 8HK(K+1) that

k=1 E[ Y k 1,h|Fk,h 1] 1 + 1

k=1 Y k 1,h + 18H2 ln 8HK(K + 1)

Since Wk is Fk,h 1 measurable, we can write the event as

k=1 Wk E[Y k 1,h|Fk,h 1] 1 + 1

k=1 Wk Y k 1,h + 18H2 ln 8HK(K + 1)

and taking the union bound over all h [H] and K 1, we get w.p. at least 1 δ

8 that the event

h [H], K 1 :

k=1 Wk E[Y k 1,h|Fk,h 1] 1 + 1

k=1 Wk Y k 1,h + 18H2 ln 8HK(K + 1)

Importantly, by optimism (Lemma 6), under G1, it holds that Wk = 1 for all k 1, so we immediately get that G1 Ediff1 = G1 Ediff1.

Following the exact same proof just with the filtration F R k,h and defining the equivalent Ediff2, we get that this event also holds w.p. 1 δ

8 and is the desired event when G1 holds.

Next, we prove that the other two events also hold w.p. at least 1 δ

By the assumptions of our setting, we know that V πk h (s) [0, H], and so

h=1 Y k 2,h =

h=1 Var Ph( |st,h,at,h)(V πk h+1) [0, H3].

In particular, applying Lemma 25 (w.r.t. the filtration Fk) with C = H3 and any fixed K, we get w.p. 1 δ 8HK(K+1) that

h=1 Y k 2,h 2

h=1 E[Y k 2,h|Fk 1] + 4H3 ln 8HK(K + 1)

Taking the union bound on all possible values of K 1 proves that EVar holds w.p. at least 1 δ

Similarly, by definition, we have that Y k 3,h = bp k,h(sk h, ak h) + bpv1 k,h(sk h, ak h) [0, 2H] and is Fk,h measurable. Thus, for any fixed k 1 and h [H], using Lemma 25, we have w.p. 1 δ 8HK(K+1) that

k=1 E[Y k 3,h|Fk,h 1] 1 + 1

k=1 Y k 3,h + 50H2 ln 8HK(K + 1)

k=1 Y k 3,h + 50H2 ln 8HK(K + 1)

applying the union bound on all K 1, the event Ebp holds w.p. 1 δ

To summarize, we have that the event G1 holds w.p. 1 δ

2 (Lemma 5), and we proved that the events Ediff1, Ediff2, EVar, Ebp hold each w.p. 1 δ

8, so we also have that the event

G = G1 G2 = G1 Ediff1 Ediff2 EVar Ebp

= G1 Ediff1 Ediff2 EVar Ebp

holds w.p. at least 1 δ.

B.7 Regret Analysis

We finally analyze the regret of the algorithm

Theorem 1. When running MVP-RL, with probability at least 1 δ uniformly for all K 1, it holds that Reg R(K) O

H3SAK ln SAHK

δ + H3S2A ln SAHK

Proof. Assume that the good events G holds, which by Lemma 7, happens with probability at least 1 δ. Then, by optimism (Lemma 6), for any k [K], h [H] and s S, it holds that V h (s) V k h (s). Moreover, we can lower bound the value of the policy πk as follows (see Remark 1):

V πk h (s) = ER Rh(s) h R(πk h(s, R)) + Ph V πk h+1(s, πk h(s, R)) i

= ER Rh(s) h R(πk h(s, R)) + ˆP k 1 h V k h+1(s, πk h(s, R)) + bp k,h(s, πk h(s, R)) i

+ ER Rh(s) h Ph V πk h+1(s, πk h(s, R)) ˆP k 1 h V k h+1(s, πk h(s, R)) bp k,h(s, πk h(s, R)) i

(1) = ER Rh(s)

n R(a) + ˆP k 1 h V k h+1(s, a) + bp k,h(s, a) o

+ ER Rh(s) h Ph V πk h+1(s, πk h(s, R)) ˆP k 1 h V k h+1(s, πk h(s, R)) bp k,h(s, πk h(s, R)) i

(2) ER ˆ Rk 1 h (s)

n R(a) + ˆP k 1 h V k h+1(s, a) + bp k,h(s, a) o br k,h(s)

+ ER Rh(s) h Ph V πk h+1(s, πk h(s, R)) ˆP k 1 h V k h+1(s, πk h(s, R)) bp k,h(s, πk h(s, R)) i

(3) V k h (s) 2br k,h(s)

+ ER Rh(s) h Ph V πk h+1(s, πk h(s, R)) ˆP k 1 h V k h+1(s, πk h(s, R)) bp k,h(s, πk h(s, R)) i .

Relation (1) is by the definition of πk (see Algorithm 3), while (2) holds under the good event Er(k) with u(a) = ˆP k 1 h V k h+1(s, a) + bp k,h(s, a) [0, 2H] (due to the value and bonus truncation). Finally, (3) is by the definition of V k h (s), where the inequality also accounts for its possible truncation.

To further bound this, we need to bound

ˆP k 1 h V k h+1(s, a) Ph V πk h+1(s, a) = Ph V k h+1 V πk h+1 (s, a) + ˆP k 1 h Ph V k h+1(s, a)

= Ph V k h+1 V πk h+1 (s, a)

+ ˆP k 1 h Ph V h+1(s, a) + ˆP k 1 h Ph V k h+1 V h+1 (s, a).

The first error term can be bounded under the good event, while the second using Lemma 24. More formally, under the good event Epv1(k), we have

ˆP k 1 h Ph V h+1(s, a)

2Var Ph( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 + HLk δ nk 1 h (s, a) 1 ,

and by Lemma 24 with α = 4H (using and P1 = Ph, P2 = ˆP k 1 h , under Ep(k)), ˆP k 1 h Ph V k h+1 V h+1 (s, a) 1 4H EPh( |s,a) V k h+1(s ) V h+1(s ) + HSLk δ(1 + 4H 2/4) nk 1 h (s, a) 1

1 4H EPh( |s,a) h V k h+1(s ) V πk h+1(s ) i + 3H2SLk δ nk 1 h (s, a) 1

= 1 4H Ph V k h+1 V πk h+1 (s, a) + 3H2SLk δ nk 1 h (s, a) 1 ,

where the second inequality is since the value of πk cannot exceed the optimal value.

Since under the good event by Lemma 6, we have 0 V πk h+1(s ) V h+1(s ) V k h+1(s ) H, we can trivially bound the error by H and bound

ˆP k 1 h V k h+1(s, a) Ph V πk h+1(s, a)

Ph V k h+1 V πk h+1 (s, a) | {z } 0

+ 3H2SLk δ nk 1 h (s, a) 1 +

2Var Ph( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 + HLk δ nk 1 h (s, a) 1 , H

Ph V k h+1 V πk h+1 (s, a) + min

2Var Ph( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 + 4H2SLk δ nk 1 h (s, a) 1 , H

Ph V k h+1 V πk h+1 (s, a) + bpv1 k,h(s, a).

Substituting back to Equation (6) while writing the linear operation Ph V (s, a) as an expectation and letting the action be ah = πk h(s, R), we get under G for all k [K], h [H] and s S that

V k h (s) V πk h (s)

ER Rh(s) h ˆP k 1 h V k h+1(s, πk h(s, R)) Ph V πk h+1(s, πk h(s, R)) + bp k,h(s, πk h(s, R)) i + 2br k,h(s)

E h V k h+1(sh+1) V πk h+1(sh+1)|sh = s, ah i (s, a) + bpv1 k,h(s, ah) + bp k,h(s, ah) + 2br k,h(s)

V k h+1(sh+1) V πk h+1(sh+1) + bp k,h(sh, ah) + bpv1 k,h(sh, ah)|sh = s, πk + 2br k,h(s).

Next, taking s = sk h, the action ah = πk h(s, R) becomes ak h, and summing on all k, we can rewrite

k=1 V k h (sk h) V πk h (sk h)

k=1 E 1 + 1

V k h+1(sk h+1) V πk h+1(sk h+1) + bp k,h(sk h, ak h) + bpv1 k,h(sk h, ak h)|Fk,h 1

k=1 br k,h(sk h)

V k h+1(sk h+1) V πk h+1(sk h+1)

bp k,h(sk h, ak h) + bpv1 k,h(sk h, ak h) + 2

k=1 br k,h(sk h) + 68H2 ln 8HK(K + 1)

V k h+1(sk h+1) V πk h+1(sk h+1) + 1

V k h+1(sk h+1) V πk h+1(sk h+1)

v u u t Var Ph( |sk h,ak h)(V πk h+1)Lk δ nk 1 h (sk h, ak h) 1 +

1620H2SLk δ nk 1 h (sk h, ak h) 1 + 68H2 ln 8HK(K + 1)

k=1 br k,h(sk h)

V k h+1(sk h+1) V πk h+1(sk h+1) + 18

Lk δVar Ph( |sk h,ak h)(V πk h+1) q

nk 1 h (sk h, ak h) 1

1700H2SLk δ nk 1 h (sk h, ak h) 1 + 6

ALk δ 2nk 1 h (s) 1

where inequality (1) holds when both Ediff1 and Ebp occur and inequality (2) is by Lemma 8. In the last inequality, we also substituted the definition of the reward bonus. Recursively applying this

inequality up to h = H + 1 (where both values are zero), w.p. at least 1 δ, we get

V 1 (sk 1) V πk 1 (sk 1)

V k 1 (sk 1) V πk 1 (sk 1) (Lemma 6)

Lk δVar Ph( |sk h,ak h)(V πk h+1) q

nk 1 h (sk h, ak h) 1 + 1 + 1

1700H2SLk δ nk 1 h (sk h, ak h) 1

ALk δ 2nk 1 h (s) 1

H3SAKLK δ + 50

2SAH2 LK δ 1.5

+ 5000H2SLK δ SAH(2 + ln(K)) + 12 q

ALK δ SH + 2

H3SAKLK δ + H3S2A(LK δ )2 .

Relation ( ) is by Lemma 9 and Lemma 20.

B.7.1 Lemmas for Bounding Bonus Terms

Lemma 8. Conditioned on the good event G, for any h [H], it holds that

bp k,h(sk h, ak h) + bpv1 k,h(sk h, ak h) 1 8H

V k h+1(sk h+1) V πk h+1(sk h+1)

v u u t Var Ph( |sk h,ak h)(V πk h+1)Lk δ nk 1 h (sk h, ak h) 1 +

810H2SLk δ nk 1 h (sk h, ak h) 1 .

Proof. We start by analyzing each of the terms separately. First, we apply Lemma 22 with α = 20

3 32HLk δ, noting that under the good event (by Lemma 6), 0 V πk h+1(s) V h+1(s) V k h+1(s) H and using the event Epv; doing so yields

bp k,h(s, a) 20

v u u t Var ˆ P k 1 h ( |s,a)( V k h+1)Lk δ nk 1 h (s, a) 1 + 400

9 HLk δ nk 1 h (s, a) 1

Lk δVar Ph( |s,a)(V πk h+1)

nk 1 h (s, a) 1 + 1 32H Ph V k h+1 V πk h+1 (s, a) + 1 32H ˆP k 1 h V k h+1 V πk h+1 (s, a)

+ 6400H2Lk δ 9nk 1 h (s, a) 1 + 20

3 4HLk δ nk 1 h (s, a) 1 + 400

9 HLk δ nk 1 h (s, a) 1 Using Lemma 24 with α = 1, under the good event Ep(k) and for any s, a, we can further bound ˆP k 1 h V k h+1 V πk h+1 (s, a)

= Ph V k h+1 V πk h+1 (s, a) + ˆP k 1 h Ph V k h+1(s ) V πk h+1 (s, a)

Ph V k h+1 V πk h+1 (s, a) + Ph V k h+1 V πk h+1 (s, a) + HSLk δ(1 + 2 1/4) nk 1 h (s, a) 1 (Lemma 24)

2Ph V k h+1 V πk h+1 (s, a) + 1.5HSLk δ nk 1 h (s, a) 1 Thus, we get the overall bound

bp k,h(s, a) 20 q

Lk δVar Ph( |s,a)(V πk h+1)

nk 1 h (s, a) 1 + 3 32H Ph V k h+1 V πk h+1 (s, a) + 785H2SLk δ nk 1 h (s, a) 1

For the second bonus, we apply Lemma 21 w.r.t. V πk h+1(s) V h+1(s) and α = 32 q

2Lk δH and get

bpv1 k,h(s, a)

2Var Ph( |s,a)(V h+1)Lk δ nk 1 h (s, a) 1 + 4H2SLk δ nk 1 h (s, a) 1

2Var Ph( |s,a)(V πk h+1)Lk δ nk 1 h (s, a) 1 + 1 32H Ph V h+1 V πk h+1 (s, a) + 16HLk δ nk 1 h (s, a) + 4H2SLk δ nk 1 h (s, a) 1

2Var Ph( |s,a)(V πk h+1)Lk δ nk 1 h (s, a) 1 + 1 32H Ph V k h+1 V πk h+1 (s, a) + 20H2SLk δ nk 1 h (s, a) 1 where we again used the optimism. Combining both and summing over all k, we get

bp k,h(sk h, ak h) + bpv1 k,h(sk h, ak h) 9

v u u t Var Ph( |sk h,ak h)(V πk h+1)Lk δ nk 1 h (sk h, ak h) 1 + 1

k=1 Ph V k h+1 V πk h+1 (sk h, ak h)

805H2SLk δ nk 1 h (sk h, ak h) 1

Finally, under the good event Ediff2, it holds that

k=1 Ph V k h+1 V πk h+1 (sk h, ak h) =

k=1 E h V k h+1(sk h+1) V πk h+1(sk h+1)|F R k,h 1 i

V k h+1(sk h+1) V πk h+1(sk h+1) + 18H2 ln 8HK(K + 1)

Substituting this relation back concludes the proof.

Lemma 9. Under the event EVar it holds that

Var Ph( |sk h,ak h)(V πk h+1) q

nk 1 h (sk h, ak h) 1 2 q

H3SAKLK δ +

8SAH2LK δ .

Proof. Following Lemma 24 of [Efroni et al., 2021], by Cauchy-Schwartz inequality, it holds that

Var Ph( |sk h,ak h)(V πk h+1) q

nk 1 h (sk h, ak h) 1

h=1 Var Ph( |sk h,ak h)(V πk h+1)

1 nk 1 h (sk h, ak h) 1 .

The second term can be bounded by Lemma 20, namely,

1 nk 1 h (sk h, ak h) 1 SAH(2 + ln(K)).

We further focus on bounding the first term. Under EVar, we have

h=1 Var Ph( |sk h,ak h)(V πk h+1)

h=1 Var Ph( |sk h,ak h)(V πk h+1)|Fk 1

+ 4H3 ln 8HK(K + 1)

δ (Under EVar)

h=1 Rh(sk h, ak h) V πk 1 (sk 1)

+ 4H3 ln 8HK(K + 1)

δ (By Lemma 3)

2H2K + 4H3 ln 8HK(K + 1)

where the last inequality is since both the values and cumulative rewards are bounded in [0, H]. Combining both, we get

Var Ph( |sk h,ak h)(V πk h+1) q

nk 1 h (sk h, ak h) 1

2H2K + 4H3 ln 8HK(K + 1)

SAH(2 + ln(K))

2H2K + 4H3 ln 8HK(K + 1)

2SAH ln 8HK(K + 1)

H3SAKLK δ +

8SAH2LK δ .

C Proofs for Transition Lookahead

C.1 Data Generation Process

As for the reward transition, we also assume that all data was generated before the game starts for all state-action-timesteps, and it is given to the agent when the relevant (s, a, h) is visited. Thus, the rewards and next-state from the first ith visits at a state (or a state-action pair) at a certain timestep are i.i.d.

Throughout this appendix, we use the notation s k h+1 = s k h+1(sk h, a)

a A to denote the next-state observations at episode k and timestep h for all the actions, and use the equivalent filtrations to the ones defined at Appendix B.1, namely

Fk,h = σ s1 t, a1 t, s 1 t+1, R1 t

t [H], . . . , sk 1 t , ak 1 t , s k 1 t+1 , Rk 1 t

t [H], sk t , ak t , s k t+1, Rk t

Fk = σ s1 t, a1 t, s 1 t+1

t [H], . . . , sk t , ak t , s k t+1, Rk t

t [H], sk+1 1 .

In particular, notice that since both s k h+1 and ak h are Fk,h measurable, then so does sk h+1.

C.2 Extended MDP for Transition Lookahead

In this appendix, we present an equivalent extended MDP that embeds the lookahead into the state to fall under the vanilla MDP model, similarly to Appendix B.2. We use this equivalence to apply various existing results on MDPs without the need to reprove them. We follow the same conventions as Appendix B.2 while denoting transition lookahead values by V T,π(s|M) (and again, the superscript T will be omitted in subsequent subsections).

For any MDP M = (S, A, H, P, R), let MT be an MDP of horizon 2H and state space SA+1 that separates the state transition and next-state generation as follows:

1. Assume w.l.o.g. that M starts at some initial state s1. The extended environment starts at a state s1 s 0, where s 0 SA is a vector of A copies of some arbitrary state s0 S. 2. For any h [H], at timestep 2h 1, the environment MT transitions from state sh s 0 to sh s h+1, where s h+1 Ph(s) is a vector containing the next state for all actions a A; this transition happens regardless of the action that the agent played. At timestep 2h, given an action ah, the environment transitions from sh s h+1 to s h+1(a) s 0.

3. The rewards at odd steps 2h 1 are zero, while the rewards at even steps 2h are Rh(sh, ah) Rh(sh, ah) of expectation rh(sh, ah).

As before, since the next state is embedded into the extended state space, any state-dependent policy in MT is a one-step transition lookahead policy in the original MDP. Also, the policy at even timesteps does not affect either the rewards or transitions, so it does not affect the value in any way. We again couple the two environments to have the exact same randomness, so assuming that the policy at the even steps in MT is the same as the policy in M, we trivially get the following relation between the values

V π 2h(s, s |MT ) = E

t=h Rt(st, at)|sh = s, s h+1(s, ) = s , π

V T,π h (s, s |M),

V π 2h 1(s, s 0|MT ) = E

t=h Rt(st, at)|sh = s, π

= V T,π h (s|M). (7)

While MT is finite, it is exponential in size, so applying any standard algorithm in this environment would lead to exponentially-bad performance bounds. Nonetheless, as with the extended-reward environment, we use this representation to prove useful results on one-step transition lookahead.

Proposition 2. The optimal value of one-step transition lookahead agents satisfies

V T, H+1(s) = 0, s S,

V T, h (s) = Es Ph(s)

n rh(s, a) + V T, h+1(s (s, a)) o , s S, h [H].

Also, given next-state observations s = {s (a)}a A at state s and step h, the optimal policy is

π h(s, s ) arg max a A

n rh(s, a) + V T, h+1(s (a)) o .

Proof. We prove the result in the extended MDP MT , in which (as with reward lookahead) the optimal value can be calculated using the Bellman equations as follows [Puterman, 2014]

V T 2H+1(s, s |MT ) = 0, s S, s SA,

V 2h(s, s |MT ) = max a rh(s, a) + V 2h+1(s (a), s 0|MT ) , h [H], s S, s SA,

V 2h 1(s, s 0|MT ) = Es Ph(s) V 2h(s, s |MT ) , h [H], s S. (8)

By the equivalence between M and MT for all policies, this is also the optimal value in M. Combining both recursion equations and substituting Equation (7) leads to the stated value calculation for all h [H] and s S:

V T, h (s|M) = V 2h 1(s, s 0|MT )

= Es Ph(s) V 2h(s, s h+1|MT )

= Es Ph(s) h max a rh(s, a) + V 2h+1(s h+1(a), s 0|MT ) i

= Es Ph(s) h max a

n rh(s, a) + V T, h+1(s h+1(a)|M) oi .

In addition, a given state s and next-state observations s , the optimal policy at the even stages of the extended MDP is

π 2h(s, s ) arg max a A

rh(s, a) + V 2h+1(s (a)) ,

alongside arbitrary actions at odd steps. Playing this policy in the original MDP will lead to the optimal one-step transition lookahead policy, as it achieves the optimal value of the original MDP. By the value relations between the two environments (V 2h+1(s, s 0|MT ) = V T, h+1(s|M)), this is equivalent to the stated policy.

Remark 2. As in Remark 1, one could write the dynamic programming equations for any policy π ΠT , and not just to the optimal one, namely

V π 2h(s, s |MT ) = rh(s, π(s, s )) + V 2h+1(s (πh(s, s )), s 0|MT ), h [H], s S, s SA,

V π 2h 1(s, s 0|MT ) = Es Ph(s) V π 2h(s, s |MT ) , h [H], s S.

In particular, following the notation of Equation (7), we can write

V T,π h (s, s |M) = rh(s, πh(s, s )) + V T,π h+1(s (πh(s, s ))|M), and,

V T,π h (s|M) = Es Ph(s) h V T,π h (s, s |M) i

= Es Ph(s) h rh(s, πh(s, s )) + V T,π h+1(s (πh(s, s ))|M) i ,

a notation that will be extensively used for transition lookahead.

We also prove a variation of the law of total variance (LTV) for transition lookahead:

Lemma 10. For any one-step transition lookahead policy π ΠT , it holds that

h=1 Vars Ph(sh)(V T,π h (sh, s ))|π, s1

h=1 rh(sh, ah) V T,π 1 (s1)

Proof. We apply the law of total variance in the extended MDP; there, the expected rewards are either 0 (at odd steps) or rh(sh, ah) (at even steps), so the total expected rewards are PH h=1 rh(sh, ah). Hence, by Lemma 27,

h=1 rh(sh, ah) V π 1 (s1, s 0|MT )

h=1 Var(V π 2h(sh, s h+1|MT )|(sh, s 0))

| {z } Odd steps

h=1 Var(V π 2h+1(sh+1, s 0|MT )|(sh, s h+1))

| {z } Even steps

h=1 Var(V π 2h(sh, sh+1|MT )|(sh, s 0))|π, s1

h=1 Vars Ph(sh)(V π 2h(sh, s |MT ))|π, s1

h=1 Vars Ph(sh)(V T,π h (sh, s |M))|π, s1

Using again the identity V π 1 (s1, s 0|MT ) = V T,π 1 (s1|M) leads to the desired result.

Finally, prove a value-difference lemma also for transition lookahead

Lemma 11 (Value-Difference Lemma with Transition Lookahead). Let M1 = (S, A, H, P 1, R1) and M2 = (S, A, H, P 2, R2) be two environments. For any deterministic one-step transition lookahead policy π ΠT , any h [H] and s S, it holds that

V T,π h (s|M1) V T,π h (s|M2)

= EM1 r1 h(sh, πh(sh, s h+1)) r2 h(sh, πh(sh, s h+1))|sh = s

+ EM1 h V T,π h+1(sh+1|M1) V T,π h+1(sh+1|M2)|sh = s i

+ EM1 h Es P 1 h(sh) h V T,π h (sh, s |M2) i Es P 2 h(sh) h V T,π h (sh, s |M2) i |sh = s i .

where V T,π h (s, s |M) is the value at a state given the reward realization, defined in Equation (7) and given in Remark 2.

Proof. We again work with the extended MDPs MT 1 , MT 2 and use their Bellman equations, namely,

V π 2h(s, s |MT ) = rh(s, π(s, s )) + V 2h+1(s (πh(s, s )), s 0|MT ), h [H], s S, s SA,

V π 2h 1(s, s 0|MT ) = Es Ph(s) V π 2h(s, s |MT ) , h [H], s S.

Using the relation between the value of the original and extended MDP (eq. (7)) and the Bellman equations of the extended MDP, for any h [H], we have

V T,π h (s|M1) V T,π h (s|M2)

= V π 2h 1(s, s 0|MT 1 ) V π 2h 1(s, s 0|MT 2 )

= Es P 1 h(s) V π 2h(s, s |MT 1 ) Es P 2 h(s) V π 2h(s, s |MT 2 )

= Es P 1 h(s) V π 2h(s, s |MT 1 ) V π 2h(s, s |MT 2 ) + Es P 1 h(s) V π 2h(s, s |MT 2 ) Es P 2 h(s) V π 2h(s, s |MT 2 )

= Es P 1 h(s) V π 2h(s, s |MT 1 ) V π 2h(s, s |MT 2 ) + Es P 1 h(s) h V T,π h (s, s |M2) i Es P 2 h(s) h V T,π h (s, s |M2) i

= EM1 V π 2h(sh, s h+1|MT 1 ) V π 2h(sh, s h+1|MT 2 )|sh = s

+ Es P 1 h(s) h V T,π h (s, s |M2) i Es P 2 h(s) h V T,π h (s, s |M2) i . (9)

Denoting ah = πh(sh, s h+1) the action taken by the agent at environment M1, We have

V π 2h(sh, s h+1|MT 1 ) V π 2h(sh, s h+1|MT 2 )

= r1 h(sh, ah) + V π 2h+1(s h+1(ah), s 0|MT 1 ) r2 h(sh, ah) + V π 2h+1(s h+1(ah), s 0|MT 2 )

= r1 h(sh, ah) r2 h(sh, ah) + V T,π h+1(s h+1(ah)|M1) V T,π h+1(s h+1(ah)|M2),

when taking the expectation w.r.t. M1, it holds that s h+1(ah) = sh+1; substituting this back into Equation (9), we get

V π h (s|M1) V π h (s|M2)

= EM1 h r1 h(sh, ah) r2 h(sh, ah) + V T,π h+1(s h+1(ah)|M1) V T,π h+1(s h+1(ah)|M2)|sh = s i

+ Es P 1 h(s) h V T,π h (s, s |M2) i Es P 2 h(s) h V T,π h (s, s |M2) i

= EM1 r1 h(sh, πh(sh, s h+1)) r2 h(sh, πh(sh, s h+1))|sh = s

+ EM1 h V T,π h+1(sh+1|M1) V T,π h+1(sh+1|M2)|sh = s i

+ EM1 h Es P 1 h(sh) h V T,π h (sh, s |M2) i Es P 2 h(sh) h V T,π h (sh, s |M2) i |sh = s i .

C.3 Full Algorithm Description for Transition Lookahead

Algorithm 4 Monotonic Value Propagation with Transition Lookahead (MVP-TL)

1: Require: δ (0, 1), bonuses br k,h(s, a), bp k,h(s) 2: for k = 1, 2, ... do 3: Initialize V k H+1(s) = 0 4: for h = H, H 1, .., 1 do 5: for s S do 6: if nk 1 h (s) = 0 then 7: V k h (s) = H 8: else 9: Calculate the truncated values

V k h (s) = min

1 nk 1 h (s)

nk 1 h (s) X

t=1 max a A

n ˆrk 1 h (s, a) + br k,h(s, a) + V k h+1(s kt h(s) h+1 (s, a)) o + bp k,h(s), H

10: end if 11: For any set of next-states s SA, define the policy πk

πk h(s, s ) arg max a A

ˆrk 1 h (s, a) + br k,h(s, a) + V k h+1(s (a))

12: end for 13: end for 14: for h = 1, 2, . . . H do 15: Observe sk h and s k h+1 = s k h+1(sk h, a)

a A 16: Play an action ak h = πk h(sk h, s k h ) 17: Collect the reward Rk h Rh(sk h, ak h) and transition to the next state sk h+1 = s k h+1(sk h, ak h) 18: end for 19: Update the empirical estimators and counts for all visited state-actions 20: end for

As with reward lookahead, we again use a variant of the MVP algorithm [Zhang et al., 2021b], described in Algorithm 4. For the bonuses, we use the notation

V k h (s, s ) = max a A ˆrk 1 h (s, a) + br k,h(s, a) + V k h+1(s (a)

and define the following bonuses:

br k,h(s, a) = min

Lk δ nk 1 h (s, a) 1 , 1

bp k,h(s) = 20

v u u t Vars ˆ P k 1 h (s)( V k h (s, s ))Lk δ nk 1 h (s) 1 + 400

3 HLk δ nk 1 h (s) 1 ,

where Lk δ = ln 16S3A2Hk2(k+1)

Vars ˆ P k 1 h (s)( V k h (s, s )) = Es ˆ P k 1 h (s) V k h (s, s )2 Es ˆ P k 1 h (s) V k h (s, s ) 2 .

The notation kt h(s) again represents the tth episode where the state s was visited at the hth timestep; in particular, line 9 of the algorithm is the expectation w.r.t. the empirical reward distribution ˆP k 1 h (s). Since the transition bonus is larger than H when nk 1 h (s) = 0, we can arbitrarily define the expectation w.r.t. ˆP k 1 h (s) when nk 1 h (s) = 0 to be 0, and one could write the update in a more concise way as

V k h (s) = min n Es ˆ P k 1 h (s) V k h (s, s ) + bp k,h(s), H o .

C.4 Additional Notations and List Representation

In this subsection, we present additional notations for both values and transition distributions that will be helpful in the analysis. In particular, we show that instead of looking at the distribution over all combinations of next state s SA, we can look at a ranking of all the next-state-actions and represent important quantities using the effective distribution on these ranks this moves the problem from being SA-dimensional to a dimension of SA.

We start by defining the values starting from state s S, playing a A and transitioning to s S, denoted by

V π h (s, s , a) = rh(s, a) + V π h+1(s ),

V h (s, s , a) = rh(s, a) + V h+1(s ), V k h (s, s , a) = ˆrk 1 h (s, a) + br k,h(s, a) + V k h+1(s ),

We similarly define (consistently with Remark 2)

V π h (s, s ) = V π h (s, s (πh(s, s )), πh(s, s )),

V h (s, s ) = max a V h (s, s (a), a), and ,

V k h (s, s ) = max a V k h (s, s (a), a).

List representation. We now move to defining lists of next-state-actions and distributions with respect to such lists. Let ℓbe a list that orders all next-state-action pairs from (s ℓ(1), aℓ(1)) to (s ℓ(SA), aℓ(SA)) and define the set of all possible lists to be L (with |L| = (SA)!). Also, define ℓu, the list induced by a function u : S A 7 R such that u(s ℓu(1), aℓu(1)) u(s ℓu(SA), aℓu(SA)), where ties are broken in any fixed arbitrary way. From this point forward, for brevity and when clear from the context, we omit the list from the indexing, e.g., write the list ℓby (s 1, a1), . . . , (s SA, a SA).

We now define the probability of list elements. Denote by Eℓ i the event that the highest-ranked realized element in the list is element i, namely

Eℓ i = s SA : s (ai) = s i and j < i, s (aj) = s j . (10)

Then, for a probability measure P on SA, define µ(i|ℓ, P) = P(s Eℓ i ). Notably, when the list is induced by u and element i is the realized highest-ranked elements, we can write maxa u(s (a), a) = u(s i, ai), so we have that (e.g. by Lemma 17 with f(s ) = maxa u(s (a), a))

Es Ph(s) h max a {u(s (a), a)} i = Ei µ( |ℓ,Ph(s))[u(s i, ai)]

We also denote by ˆµk h(i|s; ℓ) = 1 nk h(s) 1 PK t=1 1 st h = s, s t h+1 Eℓ i , the empirical probability for a list location i to be the highest-realized ranking according to a list ℓat state s and step h, based on samples up to episode k; We have by Lemma 17 that ˆµk h(i|s; ℓ) = ˆP k h (Eℓ i |s) and

Es ˆ P k 1 h (s) h max a {u(s (a), a)} i = Ei ˆµk 1 h ( |s;ℓu)[u(s i, ai)].

Similarly, we will require the distribution probability w.r.t. two lists the probability that the top element w.r.t. list ℓis i and the top element w.r.t. list ℓ is j; we denote the real and empirical probability distributions by µ(i, j|ℓ, ℓ , P) and ˆµk h(i, j|s; ℓ, ℓ ), respectively. This allows, for example, using Lemma 17 to write for any u, v : S A 7 R,

Es Ph(s) h max a {u(s (a), a)} max a {v(s (a), a)} i

= Ei,j µ( |ℓu,ℓv,Ph(s)) h u(s ℓu(i), aℓu(i)) v(s ℓv(j), aℓv(j)) i ,

Es ˆ P k 1 h (s) h max a {u(s (a), a)} max a {v(s (a), a)} i

= Ei,j ˆµk h( |s;ℓu,ℓv) h u(s ℓu(i), aℓu(i)) v(s ℓv(j), aℓv(j)) i . (11)

Finally, we say that a policy πh(s, s ) is induced by lists ℓh(s) if it chooses an action a such that its next-state s (a) is ranked higher in ℓthan all other realized next-state-action pairs. In particular, the policy πk and the optimal policy π (defined in Proposition 2) are such policies w.r.t. the lists ℓk h(s) and ℓ h(s) induced by V k h (s, s , a) and V h (s, s , a), respectively. As such, for any probability measure Ph(s), function u : S S A 7 R and a policy π induced by a list ℓ, it holds that

Es Ph(s)[u(s, s (π(a)), π(a))] = Ei µ( |ℓh(s),Ph(s))[u(s, s i, ai)]. (12)

C.4.1 Planning with Transition Lookahead

We have already seen the optimal policy is induced by a list ℓ h(s), and in particular, we can write the dynamic programming equations of Proposition 2 as

V h (s) = Es Ph(s)

n rh(s, a) + V T, h (s (a)) o

= Ei µ( |ℓ h(s),Ph(s)) rh(s, ai) + V h+1(s (ai)) .

Therefore, one way to perform the planning is to build a list ℓ h(s) of (s , a) s.t. the values

V h (s, s , a) = rh(s, a) + V h+1(s )

are sorted in a non-increasing order and calculate the probability of any pair in the list to be the highest-realized pair:

µ(i|ℓ, Ph(s)) = Ph(Eℓ i ) = Pr s h+1(ai) = s i and j < i, s h+1(aj) = s j|sh = s .

In general, calculating this distribution is intractable, and one must resort to approximating it by sampling (as done in Algorithm 4. Nonetheless, if next states are generated independently between actions, this distribution could be efficiently calculated as follows:

µ(i|ℓ, Ph(s)) = Pr s h+1(ai) = s i and j < i, s h+1(aj) = s j|sh = s

(1) = Pr s (ai) = s i and j < i s.t. aj = ai, s (aj) = s j|sh = s

(2) = Pr{s (ai) = s i|sh = s} Y

a =ai Pr j < i s.t. aj = a, s (a) = s j|sh = s

(3) = Ph(s i|s, ai) Y

j=1 1{aj = a}Ph(s j|s, a)

Relation (1) holds since if s (ai) = s i, it cannot get any previous value of the same action in the list, so these events can be removed. Relation (2) is by the independence and (3) directly calculates the probabilities.

C.5 The First Good Event Concentration

Next, we define the events that ensure the concentration of all empirical measures. For rewards, an event handles the convergence of the empirical rewards to their mean. For the transitions, we want the Bellman operator, applied on the optimal value with the empirical model, to concentrate well, and we require the variance of values w.r.t. the empirical and real model to be close. Finally, the empirical measure ˆµk h(i, j|s; ℓ, ℓ h(s)) must concentrate well around its mean for any list ℓ this will allow the change-of-measure argument described in the proof sketch.

Formally, define the following good events:

s, a, h : |rh(s, a) ˆrk 1 h (s, a)|

Lk δ nk 1 h (s, a) 1

Eℓ(k) = n s, h, ℓ L, i, j [SA] : ˆµk 1 h (i, j|s; ℓ, ℓ h(s)) µ(i, j|ℓ, ℓ h(s); Ph(s))

4SALk δµ(i, j|s; ℓ, ℓ h(s); Ph(s)) nk 1 h (s) 1 + 2SALk δ nk 1 h (s) 1

s, h : Es Ph(s) V h (s, s ) Es ˆ P k 1 h (s) V h (s, s )

2Vars Ph(s)(V h (s, s ))Lk δ nk 1 h (s) 1 + HLk δ nk 1 h (s) 1

Vars Ph(s)(V h (s, s )) q

Vars ˆ P k 1 h (s)(V h (s, s )) 4H

Lk δ nk 1 h (s) 1

where we again use Lk δ = ln 16S3A2Hk2(k+1)

δ . We define the first good event as

k 1 Er(k) \

k 1 Eℓ(k) \

k 1 Epv1(k) \

k 1 Epv2(k),

for which the following holds: Lemma 12 (The First Good Event). It holds that Pr(G1) 1 δ/2.

Proof. We prove that each of the events holds w.p. at least 1 δ/8. The result then directly follows by the union bound. We also remark that due to the domain of the variables and their estimators (e.g., [0, 1] for the rewards), all bounds trivially hold when the counts equal zero, so w.l.o.g., we only prove the results for cases in which states/state-actions were already previously visited.

Event k 1Er(k). Fix k 1, s, a, h and visits n 1. Given all of these, the reward observations are i.i.d. random variables supported by [0, 1]. Denoting the empirical mean based on these n samples by ˆrh(s, a, n), by Hoeffding s inequality, it holds w.p. 1 δ 8SAHk2(k+1) that

|rh(s, a) ˆrh(s, a, n)|

ln 16SAHk2(k+1)

Taking the union bound over all n [k] at timestep k, we get that w.p. 1 δ 8SAHk(k+1)

|rh(s, a) ˆrk 1 h (s, a)|

Lk δ nk 1 h (s, a) 1 ,

and another union bound over all possible values of s, a, h and k 1 implies that k 1Er(k) holds w.p. at least 1 δ/8.

The event k 1Eℓ(k). For any fixed k 1, s, h, a list ℓ L and number of visits n [k], we utilize Lemma 16 (event Ep) w.r.t. the distribution µ(i, j|ℓ, ℓ h(s), P) (whose support is of size M = (SA)2). When applying the lemma, notice that given the number of visits n 1, the empirical distribution ˆµk 1 h (i, j|s; ℓ, ℓ h(s)) is the average of n = nk 1 h (s) i.i.d samples, so that for all i, j [SA],

ˆµk 1 h (i, j|s; ℓ, ℓ h(s)) µ(i, j|ℓ, ℓ h(s); Ph(s))

2µ(i, j|ℓ, ℓ h(s); Ph(s)) ln 2(SA)2

δ n + 2 ln 2(SA)2

4µ(i, j|ℓ, ℓ h(s); Ph(s)) ln 2SA

δ n + 2 ln 2SA

w.p. 1 δ . Choosing δ = δ 8|L|SHk2(k+1) (such that ln 2SA

δ SA ln 16S3A2Hk2(k+1)

δ since |L| (SA)SA), while taking the union bound on all n [k], all s, h and all lists ℓ L implies that k 1Eℓ(k) holds w.p. at least 1 δ

Events k 1Epv1(k) and k 1Epv2(k). We repeat the arguments stated in Lemma 5. For any fixed k 1, s, h and number of visits n [k] , we utilize Lemma 16 w.r.t. the next-state distribution for all actions Ph(s), the value V h (s, s ) [0, H] and probability δ = δ 8SHk2(k+1); we yet again remind that given the number of visits, samples are i.i.d.

As before, the events k 1Epv1(k) and k 1Epv2(k) hold w.p. at least 1 δ

8 through the union bound first on n [k] (to get the empirical quantities) and then on s, h and k 1. This proves that each of the events in G1 holds w.p. at least 1 δ

8, so G1 holds w.p. at least 1 δ

C.6 Optimism of the Upper Confidence Value Functions

We now prove that under the event G1, the values that MVP-TL outputs are optimistic. Lemma 13 (Optimism). Under the first good event G1, for all k [K], h [H], a A and s, s S, it holds that V h (s, s , a) V k h (s, s , a). Moreover, for all s SA, V h (s, s ) V k h (s, s ) and also V h (s) V k h (s).

Proof. The proof of all claims follows by backward induction on H; the base case naturally holds for h = H + 1, where all values are defined to be zero.

Assume by induction that for some k [K] and h [H], the inequality V h+1(s) V k h+1(s) holds for all s S; we will show that this implies that all stated inequalities also hold at timestep h. At this point, we also assume w.l.o.g. that V k h (s) < H (namely, not truncated), since otherwise, by the boundedness of the rewards, V h (s) H = V k h (s). In particular, under the good event Er(k), for all s and a , it holds that ˆrk 1 h (s, a) + br k,h(s, a) rh(s, a), so for all s, a and s , we have

V k h (s, s , a) = ˆrk 1 h (s, a) + br k,h(s, a) + V k h+1(s ) rh(s, a) + V h+1(s ) = V h (s, s , a).

where the inequality also uses the induction hypothesis. This proves the first part of the lemma. Moreover, it implies that

V k h (s, s ) = max a A V k h (s, s (a), a) max a A{V h (s, s (a), a)} = V h (s, s ), (13)

and proves the second part of the statement.

To prove the last claim of the lemma, we use the monotonicity of the bonus, relying on Lemma 23. This lemma can be used when applied to the empirical distribution of all possible next-states ˆP k 1 h (s); indeed, the non-truncated optimistic value can be written as

V k h (s) = Es ˆ P k 1 h (s)

max a A ˆrk 1 h (s, a) + br k,h(s, a) + V k h+1(s (a)) + bp k,h(s)

Es ˆ P k 1 h (s) V k h (s, s ) + max

v u u t Vars ˆ P k 1 h (s)( V k h (s, s ))Lk δ nk 1 h (s) 1 , 400

9 3HLk δ nk 1 h (s) 1

which is exactly the required form in Lemma 23, w.r.t. the distribution ˆP k 1 h (s) and the values V k h (s, s ) (while noticing that due to the truncation of the values and bonuses, V k h (s, s ) [0, 3H]). Thus, the lemma guarantees monotonicity in the value, so by Equation (13),

V k h (s) Es ˆ P k 1 h (s) V h (s, s ) + max

v u u t Vars ˆ P k 1 h (s)(V h (s, s ))Lk δ

nk 1 h (s) 1 , 400

9 3HLk δ nk 1 h (s) 1

Es ˆ P k 1 h (s) V h (s, s ) + 10

v u u t Vars ˆ P k 1 h (s)(V h (s, s ))Lk δ

nk 1 h (s) 1 + 200

3 HLk δ nk 1 h (s) 1

Es ˆ P k 1 h (s) V h (s, s ) + 10

Vars Ph(s)(V h (s, s ))Lk δ nk 1 h (s) 1 + 50HLk δ nk 1 h (s) 1 (Under Epv2(k))

Es Ph(s) V h (s, s ) (Under Epv1(k))

C.7 The Second Good Event Martingale Concentration

In this subsection, we present three good events that allow replacing the expectation over the randomizations inside each episode by their realization. Let

Y k 1,h := V k h+1(sk h+1) V πk h+1(sk h+1)

Y k 2,h = Vars Ph(sk h)(V πk h (sk h, s ))

Y k 3,h = br k,h(sk h, ak h).

The second good event is the intersection of the events G2 = Ediff EVar Ebr defined as follows.

h [H], K 1 :

k=1 E[Y k 1,h|Fk,h 1] 1 + 1

k=1 Y k 1,h + 18H2 ln 6HK(K + 1)

h=1 Y k 2,h 2

h=1 E[Y k 2,h|Fk 1] + 4H3 ln 6HK(K + 1)

h [H], K 1 :

k=1 E[Y k 3,h|Fk,h 1] 2

k=1 Y k 3,h + 18 ln 6HK(K + 1)

We define the good event G = G1 G2. Lemma 14. The good event G holds with a probability of at least 1 δ.

Proof. The analysis of the first event follows Ediff exactly as the one of Ediff1 in Lemma 7: define Wk = 1 n V k h (s) V πk h (s) [0, H], h [H], s S o (which happens a.s. under G1 due to the

optimism in Lemma 13 and truncation) and Y k 1,h = Wk Y k 1,h, which is bounded in [0, H] and Fk,hmeasurable. The corresponding event w.r.t. this modified variables Ediff then holds w.p. 1 δ

6 by Lemma 25, and as in Lemma 7, we can use the fact that G1 Ediff = G1 Ediff to conclude this part of the proof.

Moving to the second event, since V πk h (s, s ) [0, H], then PH h=1 Y k 2,h [0, H3]. Therefore, by Lemma 25 (w.r.t. the filtration Fk) with C = H3 and any fixed K, we get w.p. 1 δ 6HK(K+1) that

h=1 Y k 2,h 2

h=1 E[Y k 2,h|Fk 1] + 4H3 ln 6HK(K + 1)

Taking the union bound on all possible values of K 1 proves that EVar holds w.p. at least 1 δ

Finally, by definition, we have that Y k 3,h = br k,h(sk h, ak h) [0, 1] and is Fk,h-measurable. Thus, for any fixed k 1 and h [H], using Lemma 25, we have w.p. 1 δ 6HK(K+1) that

k=1 E[Y k 3,h|Fk,h 1] 1 + 1

k=1 Y k 3,h + 18 ln 6HK(K + 1)

k=1 Y k 3,h + 18 ln 6HK(K + 1)

so that due to the union bound, Ebr holds w.p. 1 δ

To conclude, G1 holds w.p. 1 δ

2 (Lemma 5) and the events Ediff, EVar, Ebr each hold w.p. 1 δ

6. As before, when accounting to the fact that Ediff and Ediff are identical under G1, the event G = G1 G2 holds w.p. at least 1 δ.

C.8 Regret Analysis

Theorem 2. When running MVP-TL, with probability at least 1 δ uniformly for all K 1, it holds that Reg T (K) O

δ + H3S4A3 ln SAHK

Proof. Assume that the event G holds, which by Lemma 14, happens with probability at least 1 δ. In particular, throughout the proof, we use optimism (Lemma 13), which implies that 0 V πk h (s, s ) V h (s, s ) V k h (s, s ) 3H (the upper bound is also by the truncation), as well as 0 V πk h (s) V h (s) V k h (s) H.

We first focus on lower-bounding the value of the policy πk: by Remark 2, we have

V πk h (s) = Es Ph(s) h rh(s, πk h(s, s )) + V πk h+1(s (πk h(s, s ))) i

= Es Ph(s) ˆrk 1 h (s, πk h(s, s)) + V k h+1(s (πk h(s, s ))) + br k,h(s, πk h(s, s ))

+ Es Ph(s) rh(s, πk h(s, s )) ˆrk 1 h (s, πk h(s, s )) br k,h(s, πk h(s, s ))

+ Es Ph(s) h V πk h+1(s (πk h(s, s ))) V k h+1(s (πk h(s, s ))) i

(1) = Es Ph(s)

max a A ˆrk 1 h (s, a) + V k h+1(s (a)) + br k,h(s, a)

+ Es Ph(s) rh(s, πk h(s, s )) ˆrk 1 h (s, πk h(s, s)) br k,h(s, πk h(s, s ))

+ Es Ph(s) h V πk h+1(s (πk h(s, s ))) V k h+1(s (πk h(s, s ))) i

(2) Es Ph(s) V k h (s, s ) 2Es Ph(s) br k,h(s, πk h(s, s ))

Es Ph(s) h V k h+1(s (πk h(s, s ))) V πk h+1(s (πk h(s, s ))) i

where (1) is by the definition of πk and (2) uses the reward concentration event. Thus, we can write

V k h (s) V πk h (s) Es ˆ P k 1 h (s) V k h (s, s ) Es Ph(s) V k h (s, s ) + 2Es Ph(s) br k,h(s, πk h(s, s ))

+ Es Ph(s) h V k h+1(s (πk h(s, s ))) V πk h+1(s (πk h(s, s ))) i + bp k,h(s)

= Es ˆ P k 1 h (s) V k h (s, s ) V h (s, s ) Es Ph(s) V k h (s, s ) V h (s, s ) + bp k,h(s) | {z } (i)

+ Es Ph(s)[V h (s, s )] Es ˆ P k 1 h (s)[V h (s, s )] | {z } (ii)

+2Es Ph(s) br k,h(s, πk h(s, s ))

+ Es Ph(s) h V k h+1(s (πk h(s, s ))) V πk h+1(s (πk h(s, s ))) i (14)

Bounding term (ii): using the concentration event Epv1(k), we have

2Vars Ph(s)(V h (s, s ))Lk δ nk 1 h (s) 1 + HLk δ nk 1 h (s) 1

2Vars Ph(s)(V πk h (s, s ))Lk δ nk 1 h (s) 1 + 1

8H Es Ph(s) h V πk h (s, s ) V πk h (s, s ) i + 4H2Lk δ nk 1 h (s) 1 + HLk δ nk 1 h (s) 1

2Vars Ph(s)(V πk h (s, s ))Lk δ nk 1 h (s) 1 + 1

8H Es Ph(s) V k h (s, s ) V πk h (s, s ) + 5H2Lk δ nk 1 h (s) 1 .

Relation (1) uses Lemma 21 with the values 0 V πk h (s, s ) V h (s, s ) H with α = 8H q

2Lk δ and (2) is by optimism.

Bounding term (i): We first focus on the transition bonus; to bound it, we apply Lemma 22 w.r.t. ˆP k 1 h (s |s), Ph(s |s), the values 0 V πk h (s, s ) V h (s, s ) V k h (s, s ) 3H (by optimism),

under the event Epv2(k) and with α = 8H 20

bp k,h(s) = 20

v u u t Vars ˆ P k 1 h (s)( V k h (s, s ))Lk δ nk 1 h (s) 1 + 400

3 HLk δ nk 1 h (s) 1

1 8H Es ˆ P k 1 h (s) V k h (s, s ) V h (s, s ) + 1

8H Es Ph(s)[V h (s, s ) V πk h (s, s )]

Vars Ph(s)(V πk h (s, s ))Lk δ nk 1 h (s) 1 + 1600H2

3nk 1 h (s) 1 + 20

3 4HLk δ nk 1 h (s) 1 + 400

3 HLk δ nk 1 h (s) 1

Es ˆ P k 1 h (s) V k h (s, s ) V h (s, s ) Es Ph(s) V k h (s, s ) V h (s, s )

8H Es Ph(s) V k h (s, s ) V πk h (s, s ) + 20

Vars Ph(s)(V πk h (s, s ))Lk δ nk 1 h (s) 1 + 700H2

nk 1 h (s) 1 .

Substituting back to term (i), we now have

Es ˆ P k 1 h (s) V k h (s, s ) V h (s, s ) Es Ph(s) V k h (s, s ) V h (s, s )

8H Es Ph(s) V k h (s, s ) V πk h (s, s ) + 20

Vars Ph(s)(V πk h (s, s ))Lk δ nk 1 h (s) 1 + 700H2Lk δ nk 1 h (s) 1 .

The next step in the proof involves bounding the first term of (i). At this point, we remind that both values can be written as V k h (s, s ) = maxa V k h (s, s (a), a) and V h (s, s ) = maxa V h (s, s (a), a), inducing the lists ℓ= ℓk h(s) and ℓ = ℓ h(s), respectively; thus the expectations can be written as (see Appendix C.4 for further details on the list representation, and in particular, Equation (11)):

Es ˆ P k 1 h (s) V k h (s, s ) V h (s, s ) Es Ph(s) V k h (s, s ) V h (s, s )

(1) = Ei,j ˆµk h( |s; ℓ,ℓ )

h V k h (s, s ℓ(i), a ℓ(i)) V h (s, s ℓ (j), aℓ (j)) i

Ei,j µ( | ℓ,ℓ ,Ph(s))

h V k h (s, s ℓ(i), a ℓ(i)) V h (s, s ℓ (j), aℓ (j)) i

(2) 1 8H Ei,j µ( | ℓ,ℓ ,Ph(s))

h V k h (s, s ℓ(i), a ℓ(i)) V h (s, s ℓ (j), aℓ (j)) i + 3H(SA)2Lk δ(2SA + 8H 4SA/4) nk 1 h (s) 1

(1) 1 8H Es Ph(s) V k h (s, s ) V h (s, s ) + 30H2(SA)3Lk δ nk 1 h (s) 1

1 8H Es Ph(s) h V k h (s, s ) V πk h (s, s ) i + 30H2(SA)3Lk δ nk 1 h (s) 1

Relations (1) formulate the expectation using the list representations and backward, as done in Equation (11). For inequality (2) we rely on Lemma 24 with α = 8H under the event Eℓ(k) and the optimism, which ensures that the value difference is bounded in [0, 3H]. We also remark that the support of the distributions is of size (SA)2; were we to use the same result on the distributions ˆP k 1 h (s) and Ph(s), the support would be of size SA, which would lead to an exponential additive factor. And so, we finally have a bound of

(i) 3 8H Es Ph(s) V k h (s, s ) V πk h (s, s ) + 20

Vars Ph(s)(V πk h (s, s ))Lk δ nk 1 h (s) 1 + 735H2(SA)3Lk δ nk 1 h (s) 1 .

Combining both terms. Substituting this and Equation (15) into Equation (14), we have

V k h (s) V πk h (s) 1 2H Es Ph(s) V k h (s, s ) V πk h (s, s ) + 9

Vars Ph(s)(V πk h (s, s ))Lk δ nk 1 h (s) 1 + 750H2(SA)3Lk δ nk 1 h (s) 1

+ 2Es Ph(s) br k,h(s, πk h(s, s )) + Es Ph(s) h V k h+1(s (πk h(s, s ))) V πk h+1(s (πk h(s, s ))) i .

and further bounding (using the concentration event Er(k)

V k h (s, s )) V πk h (s, s ) = ˆrk 1 h (s, πk h(s, s )) + br k,h(s, πk h(s, s )) + V k h+1(s (πk h(s, s )))

rk 1 h (s, πk h(s, s )) V πk h+1(s (πk h(s, s )))

V k h+1(s (πk h(s, s ))) V πk h+1(s (πk h(s, s ))) + 2br k,h(s, πk h(s, s )),

we finally get the decomposition

V k h (s) V πk h (s) 1 + 1

Es Ph(s) h V k h+1(s (πk h(s, s ))) V πk h+1(s (πk h(s, s ))) i

Vars Ph(s)(V πk h (s, s ))Lk δ nk 1 h (s) 1 + 750H2(SA)3Lk δ nk 1 h (s) 1 + 3Es Ph(s) br k,h(s, πk h(s, s )) .

At this point, we choose to take s = sk h and sum over all k [K]; specifically, for s = s k h+1, the action becomes πk h(s, s ) = ak h and s (πk h(s, s )) = sk h+1. Formally, we can write the bound as

k=1 V k h (sk h) V πk h (sk h) 1 + 1

k=1 E h V k h+1(sk h+1) V πk h+1(sk h+1)|Fk,h 1 i

k=1 E br k,h(sk h, ak h)|Fk,h 1 + 9

v u u t Vars Ph(sk h)(V πk h (sk h, s ))Lk δ nk 1 h (sk h) 1

750H2(SA)3Lk δ nk 1 h (sk h) 1 .

and, in particular, under the events Ediff and Ebr, it holds that

k=1 V k h (sk h) V πk h (sk h) 1 + 1

V k h+1(sk h+1)) V πk h+1(sk h+1) + 36H2 ln 6HK(K + 1)

k=1 br k,h(sk h, ak h) + 54 ln 6HK(K + 1)

v u u t Vars Ph(sk h)(V πk h (sk h, s ))Lk δ nk 1 h (sk h) 1 +

750H2(SA)3Lk δ nk 1 h (sk h) 1 .

To conclude the proof, we recursively apply this formula from h = 1 to h = H + 1 (where the values are zero) and use the optimism. This yields

Reg T (K) =

k=1 V 1 (sk h) V πk 1 (sk h)

k=1 V k 1 (sk h) V πk 1 (sk h) (Optimism)

(1) 9 1 + 1

Vars Ph(sk h)(V πk h (sk h, s ))Lk δ q

nk 1 h (sk h) 1

Lk δ nk 1 h (sk h, ak h) 1

750H2(SA)3Lk δ nk 1 h (sk h) 1 + 90H3 1 + 1

2H ln 6HK(K + 1)

H3SKLK δ + 50

2SH2 LK δ 1.5

LK δ SAH + 2

SAH2K + 2050H3S4A3LK δ (2 + ln(K)) + 250H3LK δ

A LK δ + H3S4A3 LK δ 2 .

Relation (1) is the recursive application of the difference alongside substitution of the reward bonuses, while relation (2) is by Lemma 15 and Lemma 20.

C.8.1 Lemmas for Bounding Bonus Terms

Lemma 15. Under the event EVar it holds that

Vars Ph(sk h)(V πk h (sk h, s )) q

nk 1 h (sk h) 1 2 q

Proof. Similar to Lemma 9, we again rely on the lookahead version of the law of total variation to prove this bound. First, by Cauchy-Schwartz inequality, it holds that

Vars Ph(sk h)(V πk h (sk h, s )) q

nk 1 h (sk h) 1

h=1 Vars Ph(sk h)(V πk h (sk h, s ))

1 nk 1 h (sk h) 1 .

We use Lemma 20 to bound the second term by

1 nk 1 h (sk h) 1 SH(2 + ln(K))

and focus on bounding the first term. Under EVar, we have

h=1 Vars Ph(sk h)(V πk h (sk h, s ))

h=1 Vars Ph(sk h)(V πk h (sk h, s ))|Fk 1

+ 4H3 ln 6HK(K + 1)

δ (Under EVar)

h=1 rh(sk h, ak h) V πk 1 (sk 1)

+ 4H3 ln 6HK(K + 1)

δ (By Lemma 10)

2H2K + 4H3 ln 6HK(K + 1)

where the last inequality is since both the values and cumulative rewards are bounded in [0, H]. Combining both, we get

Vars Ph(sk h)(V πk h (sk h, s )) q

nk 1 h (sk h) 1

2H2K + 4H3 ln 6HK(K + 1)

SH(2 + ln(K))

2H2K + 4H3 ln 6HK(K + 1)

2SH ln 6HK(K + 1)

C.9 Example: Value Gain due to Transition Lookahead

Figure 2: Random chain: agents start at the left side and must reach its right side to collect a reward.

We now present in further detail the example described at Section 3. This example is inspired by the one in Appendix C.3 in [Merlis et al., 2024], greatly simplifying it and achieving similar behavior for a much smaller environment.

Agents start at the left side of a chain of length H/2 (depicted in Figure 2) and have two options:

1. Play a safe action a1 that leaves the agent in the same state (in green), or, 2. play one of the A 1 risky actions a2, . . . , a A (in red). Each of these actions moves the agent forward in the chain w.p. 1 A 1, but leads to a terminal non-rewarding state w.p. 1 1 A 1.

At the end of the chain, the last state is an absorbing state with a unit reward.

Without lookahead, all agents can do is try to randomly reach the end of the chain, succeeding with probability (A 1) H/2. In particular, such agents cannot collect more than V no H(A 1) H/2. On the other hand, with transition lookahead, agents observe whether the risky actions allow moving forward in the chain or lead to the bad terminal state. If one action allows progressing in the chain

(which happens w.p. p = 1 1 1 A 1 A 1 1 1/e), a lookahead agent would take it, and otherwise, they will use a1 to remain in the same state. In other words, optimal lookahead agents reach the reward after H/2 1 successful progression steps with probability p each. The probability of reaching the end of the chain using less than 5H/6 steps is at least

c0, for some absolute c0 > 0.

Under this event, the agent collects H

6 rewards, so the lookahead value is at least V T, c0H

To summarize, for this example, no lookahead optimal value is at most HA H/2, while transition lookahead agents can collect a value of H: transition lookahead increases the value by an exponential multiplicative factor. The difference between the two values is GT = Ω(H), and following the discussion in Section 3, a sublinear transition lookahead regret would imply a negatively linear standard regret of Reg(K) HK. Remark 3. The chain length was chosen to be H/2 for simplicity similar conclusions can be achieved for a length of 1 1/e. Then, the multiplicative increase in value due to transition lookahead would be (A 1)(1 1

e)H, matching Proposition 2 in [Merlis et al., 2024]. In fact, setting the transition from the last state of the chain to the terminal state (rendering it possible to earn only one unit of reward), the analysis coincides with the one in [Merlis et al., 2024]. Following their exact derivation, the value with lookahead information is multiplicatively larger than its no-lookahead factor by an exponential factor of Θ (A 1)min{(1 1

e)H 1,S} 2 . This significantly improves the

result in [Merlis et al., 2024], that only holds if S A(1 1

D Auxiliary Lemmas

In this appendix, we prove various auxiliary lemma that will be used throughout our proofs.

D.1 Concentration results

We first present and reprove a set of well-known concentration results.

Lemma 16. Let P be a distribution over a discrete set X of size |X| = M and let X, X1, . . . , Xn be independent samples from this distribution. Also, let U : X 7 [0, C] for some C > 0 and define the empirical distribution ˆPn(x) = 1

n Pn i=1 1{xi = x}. Then, for any δ (0, 1), each of the following events hold w.p. at least 1 δ:

x X, |P(x) ˆPn(x)|

2P(x) ln 2M

δ n + 2 ln 2M

ˆPn(x) P(x) U(x)

2Var P (U(X)) ln 2

δ n + 2C ln 2

Var ˆ Pn(U(X)) p

Var P (U(X)) 4C

where Var P (U(X)) = P

x X P(x)U(x)2 P

x X P(x)U(x) 2.

Proof. All the results require standard probability arguments and are stated for completeness.

For the first event Ep, notice that each of the components ˆPn(x) is the empirical mean of independent Bernoulli random variables Xi(x) of mean P(x). Therefore, by Bernstein s inequality, recalling that the variance of the variable Ber(p) is p(1 p), we get w.p. at least 1 δ M that

|P(x) ˆPn(x)|

2P(x)(1 P(x)) ln 2M

δ n + 2 ln 2M

2P(x) ln 2M

δ n + 2 ln 2M

Taking the union bound over all x X implies that Ep holds w.p. at least 1 δ.

For the second event Epv1, we apply Bernstein s inequality on the variables Yi = U(Xi). The empirical mean is given by ˆYn = 1 n P

i U(Xi) = P

x X ˆPn(x)U(x) and its average is E[Y ] = P

x X P(x)U(x). Similarly, the variance of the random variables is Var(Y ) = Var P (U(X)). Thus, by Bernstein s inequality, w.p. at least 1 δ,

2Var(Y ) ln 2

δ n + 2C ln 2

Stating the bounds in terms of Xi leads to the second event.

For the last event, we follow the analysis of [Efroni et al., 2021, Lemma 19], which in turn, relies on [Maurer and Pontil, 2009, Theorem 10]. Define Vn = 1 2n(n 1) Pn i,j=1(U(Xi) U(Xj))2. This is a well-known unbiased variance estimator, namely, E[Vn] = Var P (U(X)), and by [Maurer and Pontil, 2009, Theorem 10], for any δ > 0 it holds w.p. at least 1 δ that

Var P (U(X)) C

where we scaled the bound by C to account for the values being in [0, C].

Next, we relate Vn to the empirical variance. By elementary algebra, we have

Vn = 1 2n(n 1)

i,j=1 (U(Xi) U(Xj))2

i=1 U(Xi)2 1 n(n 1)

i =j U(Xi)U(Xj)

i=1 U(Xi)2 n (n 1)

x X ˆPn(x)U(x)2

x X ˆPn(x)U(x)

i=1 U(Xi)2 1 n2(n 1)

The first two terms are exactly the variance w.r.t. the empirical distribution; therefore, using the inequality a

|a b| for positive numbers, we have

Var ˆ Pn(U(X))

i=1 U(Xi)2 1 n2(n 1)

Combining both inequalities and recalling the trivial bound of C on the difference, we get that w.p. at least 1 δ,

Var ˆ Pn(U(X)) p

Var P (U(X)) min

Next, we present a short lemma that allows moving between different spaces of probabilities. Lemma 17. Let X be a finite set and let X1, . . . , Xn X. Also, let E1, . . . , Em X be a partition of the set X, namely, for all i = j, Ei Ej = and m i=1Ei = X. Finally, let f : X 7 R such that for all i [m] and x Ei, it holds that f(x) = f(i), and define

ℓ=1 1{Xℓ= x}, and, ˆQn(i) = 1

ℓ=1 1{Xℓ Ei}.

Then, the following hold:

1. ˆQn(i) = ˆPn(Ei) P

x Ei ˆPn(x) and, in particular, Ei ˆ Qn[f(i)] = Ex ˆ Pn[f(x)].

2. If P is a distribution over X and X1, . . . , Xn X are i.i.d. samples from P, then E[ ˆQn(i)] = P(Ei) Q(i). It also holds that Ex P [f(x)] = Ei Q[f(i)].

Proof. For the first part, we have by definition that

ℓ=1 1{Xℓ Ei} = X

ℓ=1 1{Xℓ= x}1{x Ei} = X

x X ˆPn(x)1{x Ei}

ˆPn(x) = ˆPn(Ei).

In particular, it holds that

Ei ˆ Qn[f(i)] =

i=1 ˆQn(i)f(i) =

ˆPn(x)f(i) (1) =

ˆPn(x)f(x) (2) = X

x X ˆPn(x)f(x)

= Ex ˆ Pn[f(x)],

where (1) is since f is constant inside Ei and (2) is since {Ei}m i=1 partition X.

For the second part of the statement, notice that since the samples are i.i.d., it holds that E h ˆPn(x) i =

P(x), and therefore,

E[ ˆQn(i)] = E

x Ei P(x) = P(Ei) = Q(i).

Finally, as in the first part of the statement, it holds that

Ei Q[f(i)] =

i=1 Q(i)f(i) =

x Ei P(x)f(i) =

x Ei P(x)f(x) = X

x X P(x)f(x)

= Ex P [f(x)].

Finally, we present two specialized concentration results that are needed for reward and transition lookahead, respectively. Lemma 18. Let X, X1, . . . Xn Rd be i.i.d. random vectors over [0, 1] and let C 1 be some constant. Then, for any δ (0, 1), with probability at least 1 δ,

E max i [d]{X(i) + u(i)} 1

ℓ=1 max i [d]{Xℓ(i) + u(i)}

Proof. Denote m(u) = E maxi [d]{X(i) + u(i)} and ˆm(u) = 1

n Pn ℓ=1 maxi [d]{Xℓ(i) + u(i)} and fix any u [0, C]d. Since the variables are bounded in [0, 1], their maximum is bounded almost surely in [maxi u(i), maxi u(i) + 1], namely, an interval of unit length. Therefore, by Hoeffding s inequality, for any δ (0, 1), w.p. 1 δ

|m(u) ˆm(u)|

Now, for some ϵ (0, C], let uϵ be the closest vector to u on a grid {0, ϵ, 2ϵ, . . . , C}d. Then, it clearly holds that

|m(u) ˆm(u)| |m(uϵ) ˆm(uϵ)| + 2ϵ.

Taking the union bound over all C

ϵ + 1 d possible choices for uϵ and fixing δ = δ ( C

ϵ +1) d , we

get w.p. 1 δ for all u that

|m(u) ˆm(u)|

v u u tln 2( C

ϵδ 2n + 2ϵ.

Now, fixing ϵ = q

δ 2n and noting that 1

2n for C 1, we get

|m(u) ˆm(u)|

2n δ 2n + 2

Lemma 19. Let X, X1, . . . Xn Rd be i.i.d. random vectors with components supported over the discrete set [m] and let C 1 be some constant. Then, uniformly over all u [0, C]dm w.p. 1 δ: E h max i {u(X(i), i)} i 1

ℓ=1 max i {u(Xℓ(i), i)}

δ Var(maxi{u(X(i), i)})

n + +8Cmd ln 6n

Proof. We follow a similar path to Lemma 18 and use a covering argument. Denoting w(u) = E[maxi{u(X(i), i)}] and ˆw(u) = 1

n Pn ℓ=1 maxi{u(Xℓ(i), i)}, by Bernstein s inequality, for any δ (0, 1) and fixed u [0, C]dm, it holds w.p. 1 δ that

|w(u) ˆw(u)|

2Var(maxi{u(X(i), i)}) ln 2

δ n + 2C ln 2

δ 3n . (17)

Now, for some ϵ (0, C], let uϵ be the closest matrix to u on a grid {0, ϵ, 2ϵ, . . . , C}md and denote Z(u) = maxi{u(X(i), i)} with samples Zi(u). By the smoothness of the max function, it holds that |Z(u) Z(uϵ)| ϵ. In particular, we also have that E[Z(u)2] E[Z(uϵ)2] ϵ2 + 2Cϵ, and E[Z(u)]2 E[Z(uϵ)]2 ϵ2 + 2Cϵ, so we have Var max i {u(X(i), i)} Var max i {uϵ(X(i), i)} = |Var(Z(u)) Var(Z(uϵ))| 2ϵ2 + 4Cϵ.

Similarly, it holds that |w(u) ˆw(u)| |w(uϵ) ˆw(uϵ)| + 2ϵ.

Taking the union bound over all C

ϵ + 1 md possible choices for uϵ and fixing δ = δ ( C

we get w.p. 1 δ for all u that

|w(u) ˆw(u)|

v u u t2Var(maxi{uϵ(X(i), i)}) ln 2( C

δ n + 2C ln 2( C

2md Var(maxi{uϵ(X(i), i)}) ln 6C

ϵδ n + 2Cmd ln 6C

ϵδ (Var(maxi{u(X(i), i)}) + 2ϵ2 + 4Cϵ)

n + 2Cmd ln 6C

ϵδ Var(maxi{u(X(i), i)})

8md Cϵ ln 6C

4mdϵ2 ln 6C

+ 2Cmd ln 6C

ϵδ 3n + 2ϵ.

Now, fixing ϵ = C ln 6n

δ n and noticing that 6C

|w(u) ˆw(u)|

δ Var(maxi{u(X(i), i)})

8md C ln 6n

4md C ln 6n

+ 2Cmd ln 6n

δ 3n + 2C ln 6C

δ Var(maxi{u(X(i), i)})

n + 8Cmd ln 6n

D.2 Count-Related Lemmas

Lemma 20. The following bounds hold:

nk 1 h (sk h, ak h) 1 SAH + 2

1 nk 1 h (sk h, ak h) 1 SAH(2 + ln(K)),

nk 1 h (sk h) 1 SH + 2

1 nk 1 h (sk h) 1 SH(2 + ln(K)).

Proof. Recall that every time a state (or state-action) is visited, its visitation-count is increased by 1, up to n K 1 h (s, a) at the last episode. therefore, we can write

nk 1 h (sk h, ak h) 1 =

1 sk h = s, ak h = a

nk 1 h (s, a) 1

n K 1 h (s,a) X

n K 1 h (s, a)

v u u t SAH

a A n K 1 h (s, a) (Jensen s inequality)

SAH2K. where we bounded the total number of visits by the number of steps HK. Similarly, we also have

1 nk 1 h (sk h, ak h) 1 =

n K 1 h (s,a) X

2 + ln n K 1 h (s, a) 1 SAH(2 + ln(K)).

We can likewise prove the inequalities for the state counts as follows:

nk 1 h (sk h) 1 =

nk 1 h (s) 1

n K 1 h (s) X

n K 1 h (s)

s S n K 1 h (s) (Jensen s inequality)

1 nk 1 h (sk h) 1 =

n K 1 h (s) X

2 + ln n K 1 h (s) 1 SH(2 + ln(K)).

D.3 Analysis of Variance terms

Lemma 21. Let P be a distribution over a finite set X and let X P. Also, let V1, V2 : X 7 [0, C] for some C > 0 such that V1(x) V2(x) for all x X. Then, for any α, n > 0, it holds that p

Var P (V2(X)) n

Var P (V1(X)) n + 1

αEP [V2(X) V1(X)] + Cα

Proof. By Lemma 26, we have p

Var P (V2(X)) p

Var P (V1(X)) p

Var P (V2(X) V1(X))

EP [(V2(X) V1(X))2]

CEP [V2(X) V1(X)]

where the last inequality is by the boundedness and since V1(x) V2(x). Thus, we can bound p

Var P (V2(X)) p

Var P (V1(X)) n

CEP [V2(X) V1(X)] n

EP [V2(X) V1(X)]

αEP [V2(X) V1(X)] + Cα

where last inequality is due to Young s inequality (ab 1

4 b2 for all α > 0).

Lemma 22. Let P, P be distributions over a finite set X and let X P. Also, let V1, V2, V3 : X 7 [0, C] for some C > 0 such that V1(x) V2(x) V3(x) for all x X. Finally, assume that p

Var P (V2(X)) p

Var P (V2(X)) β

for some β > 0. Then, for any α, n > 0, it holds that p

Var P (V3(X)) n

Var P (V1(X)) n + 1

αEP [V3(X) V2(X)] + 1

αEP [V2(X) V1(X)] + Cα

Var P (V1(X)) n + 1

αEP [V3(X) V1(X)] + 1

αEP [V3(X) V1(X)] + Cα

Proof. We decompose the l.h.s. as follows p

Var P (V3(X)) n =

Var P (V3(X)) p

Var P (V2(X)) n +

Var P (V2(X)) p

Var P (V2(X)) n

Var P (V2(X)) p

Var P (V1(X)) n +

Var P (V1(X)) n We bound the first and third terms using Lemma 21 and bound the second term with the assumption and get p

Var P (V3(X)) n 1

αEP [V3(X) V2(X)] + Cα

αEP [V2(X) V1(X)] + Cα

Var P (V1(X)) n

Var P (V1(X)) n + 1

αEP [V3(X) V2(X)] + 1

αEP [V2(X) V1(X)] + Cα

Var P (V1(X)) n + 1

αEP [V3(X) V1(X)] + 1

αEP [V3(X) V1(X)] + Cα

where the last inequality uses the fact that V1(x) V2(x) V3(x) for all x X. The last two bounds are the desired results.

E Existing Results

Lemma 23 (Monotonic Bonuses,[Zhang et al., 2023], Appendix C.1). For any p S, v RS + s.t. v H, δ (0, 1) and positive integer n, define the function

f(p, v, n) = p T v + max

Varp(v) ln 1

Then, the function f(p, v, n) is non-decreasing in each entry of v.

Lemma 24 (Efroni et al. 2021, Lemma 28). Let Y RS be a vector such that 0 Y (s) H for all s S. Let P1 and P2 be two transition models and n RSA + . If

(s, a, s ) S A S, h [H] : |P2,h(s |s, a) P1,h(s |s, a)|

C1Lk δP1,h(s |s, a) n(s, a) 1 + C2Lk δ n(s, a) 1

for some C1, C2 > 0, then, for any α > 0,

|(P1,h P2,h)Y (s, a)| 1

αEs P1,h( |s,a)[Y (s )] + HSLk δ(C2 + αC1/4) n(s, a) 1 ,

Lemma 25 (Efroni et al. 2021, Lemma 27). Let {Yt}t 1 be a real-valued sequence of random variables adapted to a filtration {Ft}t 0. Assume that for all t 1 it holds that 0 Yt C a.s., and let T N. Then each of the following inequalities holds with probability greater than 1 δ.

t=1 E[Yt|Ft 1] 1 + 1

t=1 Yt + 2(2C + 1)2 ln 1

t=1 E[Yt|Ft 1] + 4C ln 1

Lemma 26 (Standard Deviation Differences, e.g., Zanette and Brunskill 2019, lines 48-51). Let P d be some distribution over [d] and let V1, V2 Rd. Then, it holds that p

Var P (V1) p

Var P (V2) p

Var P (V1 V2).

Lemma 27 (Law of Total Variance, e.g., Zanette and Brunskill 2019, Lemma 15). For any nolookahead policy π, it holds that

h=1 Var(V π h+1(sh+1)|sh)|π, s1

h=1 rh(sh, ah) V π 1 (s1)

where Var(V π h+1(sh+1)|sh) is the variance of the value at step sh+1 given state sh and under the policy π, due to the policy randomization and next-state transition probabilities.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: In the abstract, we accurately present the setting and its motivation, as well as a summary of the results, all of which are proved in the appendix.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes] Justification: The main limitations in this work are a result of the studied setup some possible extensions and improvement are discussed in the future work section.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: Proofs for all the stated results are provided in the appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: The paper does not include experiments. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [NA]

Justification: The paper does not include experiments.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The paper is purely theoretical and studies a fundamental decision-making model; any ethical issue that might arise would be a core issue in the ethics of applying machine learning, and not tied specifically to this work.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: Due to the theoretical nature of the paper and the generality of the model, it is no direct societal impact.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: No data or models are released with this paper. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not use existing assets. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.