# optimistic_planning_by_regularized_dynamic_programming__7c074e25.pdf Optimistic Planning by Regularized Dynamic Programming Antoine Moulin 1 Gergely Neu 1 We propose a new method for optimistic planning in infinite-horizon discounted Markov decision processes based on the idea of adding regularization to the updates of an otherwise standard approximate value iteration procedure. This technique allows us to avoid contraction and monotonicity arguments typically required by existing analyses of approximate dynamic programming methods, and in particular to use approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation. We use our method to recover known guarantees in tabular MDPs and to provide a computationally efficient algorithm for learning near-optimal policies in discounted linear mixture MDPs from a single stream of experience, and show it achieves near-optimal statistical guarantees. 1. Introduction The idea of constructing a confidence set of statistically plausible models and picking a policy that maximizes the expected return in the best of these models can be traced back to the pioneering work of Lai & Robbins (1985) in the context of multi-armed bandit problems, and has been successfully extended to address the exploration-exploitation dilemma in reinforcement learning (RL, Sutton & Barto, 2018). This popular design principle came to be known as optimism in the face of uncertainty, and the associated optimization task as the problem of optimistic planning. The optimistic principle has driven the development of statistically efficient RL algorithms for a variety of problem settings. Following the work of Brafman & Tennenholtz (2002); Strehl et al. (2009) on optimistic exploration methods for RL in Markov decision processes (MDPs), a breakthrough was achieved by Jaksch, Ortner, and Auer (2010), 1Universitat Pompeu Fabra, Barcelona, Spain. Correspondence to: Antoine Moulin , Gergely Neu . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). whose UCRL2 algorithm was shown to achieve near-optimal regret guarantees in a broad class of tabular MDPs. In subsequent years, their work inspired an impressive amount of follow-up work, leading to a variety of extensions, improvements, and other mutations. The computational efficiency of such optimistic methods crucially hinges on the implementation of the optimistic planning subroutine. In the work of Jaksch et al. (2010), this was addressed by a procedure called extended value iteration (EVI), which performs dynamic programming (DP) in an auxiliary MDP where the confidence set of models is projected to the space of actions, allowing the realization of arbitrary transitions that are statistically plausible given all past experience. After mild adjustments, the EVI procedure can be shown to give near-optimal solutions to the optimistic planning problem in a computationally efficient manner (cf. Fruit et al., 2018 and Section 38.5.2 in Lattimore & Szepesv ari, 2020). Other, even more effective optimistic dynamic programming procedures have been proposed and analyzed (Fruit et al., 2018; Qian et al., 2018). However, these computational developments have been largely restricted to the relatively simple tabular setting. In recent years, the RL theory literature has seen a massive revival largely due to the breakthrough achieved by Jin, Yang, Wang, and Jordan (2020), who successfully extended the idea of optimistic exploration to a class of largescale MDPs using linear function approximation. While extremely influential, their approach (and virtually all of its numerous follow-ups) are limited to the relatively simple setting of finite-horizon MDPs. The reason for this limitation is inherent in their algorithm design that crucially uses the fact that optimistic planning in finite-horizon MDPs can be solved via a simple backward recursion over the time indices within each episode (Neu & Pike-Burke, 2020). This idea completely fails for infinite-horizon problems where dynamic programming methods should aim to approximate the solution of a fixed-point equation. Solving such fixedpoint equations is possible in the tabular case but no known efficient method exists for linear function approximation, the short reason being that the least-squares transition estimator used in the construction of Jin et al. (2020) cannot be straightforwardly used to build an approximate Bellman operator that satisfies the necessary contraction properties. Optimistic Planning by Regularized Dynamic Programming The best attempt at attacking the infinite-horizon setting under function approximation we are aware of is by Wei, Jahromi, Luo, and Jain (2021), who propose a set of algorithms that are either statistically or computationally efficient, but eventually fall short of providing an algorithm with both of these desired properties. Another good contribution was made by Vial, Parulekar, Shakkottai, and Srikant (2022), who provided approximate DP methods for stochastic shortest path problems with linear transition functions, and analyzed them via studying the concentration properties of the empirical transition operator. This technique did allow them to prove regret bounds, but the guarantees did not reach optimality in terms of scaling with the time horizon unless strong assumptions are made. Notably, Vial et al. (2022) only managed to perform a tight analysis in the special case where the features are orthogonal, which allowed them to reason about contraction properties of the empirical Bellman operator. Lacking a general contraction argument, or another idea that would enable computationally efficient optimistic planning, efficient exploration-exploitation in infinite-horizon MDPs under function approximation has remained unsolved so far. This is the problem we address in this paper in the context of discounted infinite-horizon MDPs. Instead of relying on a contraction argument (or an approximate version thereof), we propose to solve the optimistic planning problem using regularized dynamic programming. In particular, we consider a variant of the Mirror-Descent Modified Policy Iteration (MD-MPI) algorithm of Geist, Scherrer, and Pietquin (2019) that uses a least-squares estimator of the transition kernel and an exploration bonus to define an optimistic regularized Bellman operator. Using arguments from the classic analysis of mirror descent methods, we show that each application of this optimistic operator improves the quality of the policy up to an additive error term that telescopes over the iterations. In other words, we show that each iteration improves over the last one in an average sense. This is in stark contrast to arguments used for analyzing previous optimistic planning methods that relied on contraction arguments which guarantee strict improvements to the policy in each iteration. The advantage is that it remains applicable even when the approximate dynamic programming operator is not contractive or monotone (even approximately). Our concrete contribution is applying the above scheme to discounted linear mixture MDPs and showing that it achieves a near-optimal regret bound of order p (B2d H + d2H3 + log |A| H4) T, where d is the feature dimension, B is a bound on the norm of the features, and H = 1 1 γ is the effective horizon. This result implies that our algorithm produces an ε-optimal policy after about B2d H + d2H3 + log |A| H4 /ε2 iterations. Each policy update takes poly(d, H, T) iterations of regularized dynamic programming, each consisting of poly(d, H, T) elementary operations. This is to be contrasted with previous contributions on a similar1 setting by Zhou, He, and Gu (2021), whose policy updates rely on a version of EVI adapted to linear function approximation. Their EVI variants require globally constraining the model parameters in a way that the model is a valid transition kernel. While this last constraint allowed them to reason about contractive properties of the EVI iterations, it is practically impossible to enforce without making strong assumptions on the feature maps and the MDP itself. The difficulty remains even when the property is only required to hold locally in each state. In this sense, our method is the first to obtain near-optimal statistical rates while also being entirely computationally feasible. The rest of this paper is organized as follows. After presenting the notation at the end of this section and the background in Section 2, we introduce our algorithmic framework in Section 3. We provide generic performance guarantees and explain the key steps of the analysis in Section 4. The guarantees are instantiated in the context of tabular and linear mixture MDPs in Section 5. We conclude in Section 6 with a discussion of our contribution along with its limitations. Notation. For a natural number N > 0, we denote [N] = {1, 2, . . . , N}. For a real number M, we define the truncation operator ΠM that acts on functions f defined on a domain A via ΠMf : x 7 max (min [f (x) , M] , 0). For a measurable space (A, F), we define the set of all probability distributions (A), and for any two distributions P, Q (A) such that P Q, we define the relative entropy as DKL (P Q) = Ea P h ln d P d Q(a) i . For a dis- tribution P (A) and a bounded function f RA, we write P, f = Ea P [f(a)] to denote the expectation of f under P, and we will use the same notation for finitedimensional vector spaces to denote inner products. For a finite-dimensional vector v Rd and a square matrix Z Rd d, we will use the notation v Z = p 2. Preliminaries We consider a discounted MDP M = (X, A, r, P, γ, ν0), where X is the finite state space2, A is the finite action space, r : X A [0, 1] is the deterministic reward function assumed to be known3, P : X A (X) is the transition probability distribution, γ (0, 1) is the discount factor, and ν0 (X) is the initial state distribution. The 1We provide a detailed discussion about the differences between our settings in Section 6.1. 2Our results extend to the case where X is a measurable space. The precise definitions require measure-theoretic concepts (Bertsekas & Shreve, 1996). For the sake of readability and because they are well understood, we only consider finite state spaces here. 3It is a standard assumption, and removing it only costs a constant factor in the regret (Jaksch et al., 2010). Optimistic Planning by Regularized Dynamic Programming model describes a sequential interaction scheme between a decision-making agent and its environment, where the following steps are repeated for a sequence of rounds t = 1, 2, . . . after the initial state is drawn as X0 ν0: the agent observes the state Xt X, selects an action At A, obtains reward r (Xt, At), and the environment generates the next state Xt+1 P ( |Xt, At). The goal of the agent is to pick its sequence of actions in a way that the total discounted return P t=0 γtr (Xt, At) is as large as possible. Below we describe the most fundamental objects relevant to our work, and refer the reader to the classic book of Puterman (2014) for more context and details. A (stationary) policy is a mapping π : X (A) from a state to a probability measure over actions. The value function and action-value function of a policy π are respectively defined as the functions V π RX and Qπ RX A mapping each state x and state-action pair x, a to V π(x) = Eπ t=0 γtr (Xt, At) Qπ(x, a) = Eπ t=0 γtr (Xt, At) X0 = x, A0 = a where Eπ denotes the expectation with respect to the probability measure Pπ, generated by the interaction between the environment and the policy π. With some abuse of notation, we define the conditional expectation operator P : RX RX A as (Pf) (x, a) = P x X P (x |x, a) f (x ), for f RX . Its adjoint P T acts on distributions µ (X A) as (P Tµ) (x ) = P x,a X A P (x |x, a) µ (x, a). It returns the state distribution realized after starting from the stateaction distribution µ and then taking a step forward in the MDP dynamics. With these, we can simply state the Bellman equations tying together the value functions as V π (x) = Ea π( |x) [Qπ (x, a)] , Qπ = r + γPV π. We also introduce the operator E : RX RX A acting on functions f RX via the assignment (Ef) (x, a) = f (x), and its adjoint via its action ETµ (x) = P a µ (x, a) on distributions µ (X A). The operator E can be thought of as a padding operator over the action space that allows us to use vector notation, while ET applied to a state-action distribution returns the corresponding marginal distribution of states. The adjoint P T (resp. ET) is the operator such that, for any f, g, Pf, g = f, P Tg (resp. E, ET). In a discounted MDP, a policy π induces a unique normalized discounted occupancy measure over the state space, defined for any state x X as νπ (x) = (1 γ) t=0 γt Pπ [Xt = x] . The normalization term (1 γ) guarantees νπ is a probability measure over X. We call the inverse of this normalization constant the effective horizon and denote it by H = 1 1 γ . We also define the associated state-action occupancy measure µπ, defined as µπ (x, a) = νπ (x) π (a|x). State-action occupancy measures are known to satisfy the following recurrence relation that is sometimes called the system of Bellman flow constraints: E Tµπ = γP Tµπ + (1 γ) ν0. (1) Using the state-action occupancy measure, the discounted return of a policy can be written as Rπ γ = 1 1 γ µπ, r . We will use µ to denote an occupancy measure with maximal return and ν = ETµ to denote the associated stateoccupancy measure. Finally, given two policies π, π , we denote DKL (π π ) = (DKL (π ( |x) π ( |x)))x X , and we define H (π π ) = νπ, DKL (π π ) , the conditional relative entropy4. In this paper, we will consider the setting of online learning in discounted MDPs, where the agent aims to produce an ε-optimal policy πout satisfying µ µπout, r ε based on a single stream of experience in the MDP. We will assume that the learner has access to a reset action that drops the agent back to a state randomly drawn from the initial-state distribution ν0, and that the learner follows a stationary policy πt in each round t. We will measure the performance in terms of the number of samples necessary to guarantee that the output policy is ε-optimal. As an auxiliary performance measure, we will also consider the expected regret (or simply, regret)5 of the learner defined as It is easy to see that a regret bound can be converted into sample complexity guarantees. In particular, selecting a time index I uniformly at random from 1, . . . , T and returning πout = πI guarantees that E µ µπout, r = RT which can be made arbitrarily small if RT grows sublinearly and T is set large enough. We note here that, while superficially similar to the discounted regret criterion considered in earlier works like Liu & Su (2020); He 4Technically, this is the conditional relative entropy between the occupancy measures µπ and µπ , but we will keep referring to it in terms of the policies to keep our notations light. We refer to Neu et al. (2017) for further discussion. 5In the related literature, it is more common to define regret as a random variable and bound it with high probability. Our algorithm is only suitable for bounding the expected regret, and thus we only define this quantity here; we defer further discussion to Section 6. Optimistic Planning by Regularized Dynamic Programming et al. (2021) or Zhou et al. (2021), there are some major differences between our objectives. We only point out here that we consider the complexity of producing a good policy to execute from the initial state distribution, whereas theirs measures the suboptimality of the policies along the trajectory traversed by the learner. We defer a further discussion of the two settings to Section 6.1. 3. Algorithm Our approach implements the principle of optimism in the face of uncertainty in discounted MDPs. Instead of aiming to solve an optimistic version of the Bellman optimality equations via extended value iteration as done by Jaksch et al. (2010), our method draws on techniques from convex optimization aiming at average policy improvement. In particular, our planning procedure is based on a regularized version of approximate value iteration and incorporates an optimistic estimate of the associated Bellman operator. Consequently, we refer to our algorithm as RAVI-UCB, standing for Regularized Approximate Value Iteration with Upper Confidence Bounds. RAVI-UCB performs a sequence of regularized Q-function and policy updates as follows. Starting with an initial estimate V0 = 0 and an initial policy π0, it calculates a sequence of updates for k = 1, . . . , K as Qk+1(x, a) = ΠH h r(x, a) + CBk(x, a) + γ b PVk (x, a) i , Vk+1(x) = 1 a πk(a|x)eηQk+1(x,a) ! πk+1(a|x) = πk(a|x)eηQk+1(x,a) P a πk (a |x) eηQk+1(x,a ) . Here, b P is a nominal transition model and CBk is an exploration bonus defined to be large enough to ensure that γ b PVk + CBk γPVk and so that Qk+1 is an upper bound on the regularized Bellman update r + γPVk. The Qfunctions are truncated to the range [0, H] to make sure that the optimistic property above can be ensured by setting a reasonably sized exploration bonus CBk. It is important to note that Qk+1 does not directly attempt to approximate the optimal action-value function Q in the true MDP, which marks a clear departure from previously known optimismbased regret analyses. Instead, our analysis will show that (1 γ) ν0, Vk acts as an optimistic estimate of the optimal return (1 γ) ν0, V in an average sense, and that the total reward of our algorithm can also be bounded in terms of the same quantity. The overall procedure is presented as Algorithm 1. The algorithm proceeds in a sequence of epochs k = 1, 2, . . . , where a new epoch is started by taking the reset action with probability 1 γ in each round, which results in epochs of Algorithm 1 RAVI-UCB. Inputs: Horizon T, learning rate η > 0, initial value V0, initial policy π0. Initialize: t = 1, Q1 = EV0, D1 = . for k = 1, . . . do b Pk = TRANSITION-ESTIMATE (DTk). Vk (x) = 1 a πk 1 (a|x) eηQk(x,a) . πk (a|x) = πk 1 (a|x) eη(Qk(x,a) Vk(x)). CBk = BONUS (DTk). Qk+1 = ΠH h r + CBk + γ b Pk Vk i . repeat Play at πk ( |xt). Observe xt+1. Update Dt+1 = ADD (Dt, {(xt, at, xt+1)}). t = t + 1. With probability 1 γ, reset to initial distribution: xt ν0 and break. until t = T end for average length H = 1 1 γ . At the beginning of each epoch, we update the model estimate b Pk and perform one step of online mirror descent to produce the new policy πk and the associated softmax value function Vk. We then update the exploration bonuses CBk such that they satisfy, for all x, a γP ( |x, a) γ b Pk ( |x, a) , Vk CBk (x, a) . (2) We will refer to exploration bonuses satisfying the above condition as valid. As we will see explicitly in Section 5, the model estimate and bonuses are computed using the data gathered so far, DTk, where Tk denotes the first time index of epoch k. Finally, we apply an optimistic Bellman update to produce a state-action value estimate Qk+1. We highlight that the assignments in Algorithm 1 are only made symbolically for all x, a, and a practical implementation will not necessarily need to loop over the entire stateaction space. Rather, all quantities of interest can be computed on demand while executing the policy in runtime. Finally, to make some of the arguments in Section 4 more convenient to state, we introduce some notation. We let Tk = {Tk, Tk + 1, . . . , Tk+1 1} denote the set of time indices belonging to epoch k, and K (T) denote the total number of epochs. For the sake of analysis, it will be useful to pad out the trajectory of states and actions with the artificial observations (XT +1, AT +1, . . . , XT +, AT +), where T + is the first time that a reset would have occurred had the algorithm been executed beyond time step T. These observations are well-defined random variables, and are introduced to make sure that the last epoch does not require special treatment. Optimistic Planning by Regularized Dynamic Programming 4. Main Result & Analysis Our main technical result regarding the performance of RAVI-UCB is the following regret bound. Theorem 4.1. Let {πk}k and {CBk}k be the policies and exploration bonuses produced by RAVI-UCB over T timesteps, where the input is η = p 2 log |A| / (H2T), V0 = 0 and any policy π0. Suppose that the sequence of bonuses {CBk}k is valid in the sense of Equation (2). Then the policies {πk}k satisfy the following bound: t=1 CBt (Xt, At) 2H4 log |A| T + 2H2. We present the proof of Theorem 4.1 below. In particular, we state a sequence of lemmas whose combination will yield the complete proof. We will provide the proofs that we believe to be most insightful in the main text, and relegate the more technical ones to Appendix A. The analysis will be split into two main parts: one pertaining to the general properties of our optimistic planning procedure and to the eventual regret bound that can be derived from it, and one concerning the specifics of the setting considered. In particular, we first analyze RAVI-UCB using a generic exploration bonus that we will suppose to be valid , and then show in Section 5 how to derive such valid exploration bonuses in the concrete settings of tabular MDPs and linear mixture MDPs. 4.1. Optimistic Planning We first study the properties of our optimistic planning procedure, without making explicit references to the setting. For this general analysis, we will fix an epoch index k, assume that b Pk is some estimator of the transition kernel P and that the exploration bonus CBk is valid in the sense of Equation (2). We provide the following inequality that will be useful for bounding the suboptimality gaps. Lemma 4.2. Let Qk+1 be the state-action value estimate produced by RAVI-UCB in epoch k, with any input, and assume the bonuses CBk are valid in the sense of Equation (2). Then, r + γPVk Qk+1 r + 2CBk + γPVk, where Vk is the value estimate defined in Algorithm 1. Proof. We start by proving the lower-bound. For each stateaction pair (x, a), we need to handle two separate cases corresponding to whether or not Qk+1 (x, a) is truncated from above. In the first case, we have Qk+1 (x, a) = H, which implies Qk+1 (x, a) = H = 1 + γH r (x, a) + γ PVk (x, a) . (3) Here, we have crucially used the condition Vk H in the inequality, which was made possible by truncating the Q-functions to the range [0, H]. In the other case where Qk+1 (x, a) H, we use the validity of CBk to show the following inequality: Qk+1 (x, a) r (x, a) + CBk (x, a) + γ b Pk Vk (x, a) r (x, a) + γ (PVk) (x, a) , where the first inequality is valid even when a truncation from below happens. For the upper-bound, we proceed similarly and consider the two cases corresponding to whether or not Qk+1 (x, a) is truncated from below in each state-action pair. First considering the case where Qk+1 (x, a) = 0, we observe that Qk+1 (x, a) = 0 r (x, a) + γ (PVk) (x, a) , from which the claim follows due to non-negativity of CBk. As for the other case, we have Qk+1 (x, a) r (x, a) + CBk (x, a) + γ b Pk Vk (x, a) r (x, a) + 2CBk (x, a) + γ PVk (x, a) , where the last step follows from the validity condition on CBk. Our key result regarding the quality of the policies produced by RAVI-UCB is the following. Lemma 4.3. Let K be a fixed number of epochs, and let πk and CBk be the policy and exploration bonus produced by RAVI-UCB in epoch k, where the input is V0 = 0, any policy π0, and any η > 0. Suppose that {CBk}k is a sequence of valid exploration bonuses in the sense of Equation (2). Then, the sequence {πk}k satisfies the following bound: k=1 ( µ , r µπk, r ) 2 k=1 µπk, CBk + 2H η H (π π0) + ηH3K Proof. The main idea of the proof is to show that, under the validity condition of the exploration bonuses, (1 γ) ν0, Vk acts as an approximate upper bound on the optimal return µ , r , up to some additional terms resulting from the use of incremental updates. Thanks to the use of regularization, we can show that these additional terms are small on average, and that the gap between the optimistic value and the return of πk can be bounded in terms of µπk, CBk . With this in mind, we begin by rewriting the performance gap of the output policy as follows: k=1 ( µ , r µπk, r ) = k=1 ( k + k) , Optimistic Planning by Regularized Dynamic Programming where we defined k = µ , r (1 γ) ν0, Vk and k = (1 γ) ν0, Vk µπk, r for all k. Let us now fix some k and consider the first term, k. We start by observing that (1 γ) ν0 = ETµ γP Tµ , which allows us to write k = µ , r (1 γ) ν0, Vk = µ , r + γPVk µ , EVk . (4) In order to treat the first term in Equation (4), we use the lower-bound from Lemma 4.2 to obtain k µ , Qk+1 EVk = µ , Qk+1 EVk+1 + µ , EVk+1 EVk . Summing up for all k = 1, . . . , K, we get k=1 k µ , QK+1 EV K+1 + µ , E (VK+1 V1) , where we defined Qk = Pk i=1 Qi and V k = Pk i=1 Vi for any k. By a classic telescoping argument (presented in Lemma C.1), one can show that, for all k, V k (x) = max p (A) p, Qk (x, ) 1 η DKL (p π0 ( |x)) π ( |x) , Qk (x, ) 1 η DKL π ( |x) π0( |x) . Combining this with the previous inequality, we get η H (π π0) + µ , EVK+1 , (5) by definition of the conditional entropy and V1 = 0. We now move on to bounding k. Then, using the upperbound of Lemma 4.2 to lower-bound r, we bound k as follows: k = (1 γ) ν0, Vk µπk, r (1 γ) ν0, Vk µπk, Qk+1 2CBk γPVk = E Tµπk γP Tµπk, Vk µπk, Qk+1 γPVk + 2 µπk, CBk , where we have used (1 γ) ν0 = ETµπk γP Tµπk in the third line. We can then rewrite the current upper-bound as k µπk, EVk Qk+1 + 2 µπk, CBk = µπk, EVk µπk+1, Qk+1 + µπk+1 µπk, Qk+1 + 2 µπk, CBk . To proceed, we use Lemma C.1 to note that µπk+1, Qk+1 = E Tµπk+1, Vk+1 + 1 η DKL (πk+1 πk) , which allows us to continue as k µπk, EVk µπk+1, EVk+1 + µπk+1 µπk, Qk+1 1 η H (πk+1 πk) (6) + 2 µπk, CBk . The last remaining difficulty is to control the second difference in the last inequality. This can be done thanks to the regularization, that makes the occupancy measures change slowly enough . To proceed, we use Pinsker s inequality and the boundedness of Qk+1 to show µπk+1 µπk, Qk+1 H p 2DKL (µπk+1 µπk). Appealing to Lemma A.1, we can bound the last term as DKL (µπk+1 µπk) H H (πk+1 πk) . Using these results, we obtain µπk+1 µπk, Qk+1 1 η H (πk+1 πk) 2H3H (πk+1 πk) 1 η H (πk+1 πk) where the last step follows from the Fenchel Young inequality applied to the convex function f (z) = z2/2. Then, summing up both sides of Equation (6) for all k = 1, . . . , K, k=1 k µπK+1, EVK+1 + ηH3 k=1 µπk, CBk , (7) where we used V1 = 0. Combining Equations (5) and (7), k=1 ( µ , r µπk, r ) 2 k=1 µπk, CBk + 2H η H (π π0) + ηH3 where we used µ µπK+1, EVK+1 2H. 4.2. The Epoch Schedule The final part is to account for the effects of the randomized epoch schedule. Provided that the exploration bonuses are valid, we need to control the sum PT t=1 µπt, CBt . We relate it to a more easily tractable sum in the next lemma. Optimistic Planning by Regularized Dynamic Programming Lemma 4.4. The sequence of policies selected by RAVI-UCB satisfies t=1 µπt, CBt t=1 CBt (Xt, At) The proof is in Appendix A.3. This bound is guaranteed by the epoch schedule used by RAVI-UCB that ensures that within each epoch k of geometric length, the sequence of realized state-action trajectory is distributed according to the occupancy measure of πk. 4.3. Putting Everything Together The proof of Theorem 4.1 concludes by combining the above claims. In anticipation of Section 5, for our main assumption to be satisfied we let δ = 1/T and define the exploration bonuses as in Lemma 5.1 or Lemma 5.4. This implies the resulting exploration bonuses are valid with probability at least 1 δ, so on this event we can use Lemma 4.3 to bound the expected regret of RAVI-UCB. Setting π0 as the uniform policy, we get t=1 µπt, CBt η log |A| + ηH3 2 K (T) + 2H , where we used that the expected epoch length is H and H (π π0) log |A|. Noticing that E [K] = (1 γ) T and setting the learning rate η = p 2 log |A| / (H2T), the expected optimization error becomes η log |A| + ηH3K 2H2T log |A|. The remaining terms in the regret bound corresponding to the sum of exploration bonuses can be bounded by appealing to Lemma 4.4. This concludes the proof. 5. Applications We now consider two classes of MDPs and show how to implement our algorithm and derive a regret bound. 5.1. Tabular MDPs For didactic purposes, we first instantiate RAVI-UCB in the setting of tabular MDPs with small state and action spaces. As we will see, a simple application of our framework allows us to recover known guarantees in this setting. The algorithm can be found in Appendix B.1. Let N1(x, a) = 1 and N 1(x, a, x ) = 0 denote the initial counts6 for the tuples (x, a) and 6We initialize N1 at 1 to avoid divisions by zero. (x, a, x ). At epoch k, for t Tk, we update Dt+1 = Nt+1, N t+1 as Nt+1(x, a) = Nt(x, a) + I{Xt=x,At=a} and N t+1(x, a, x ) = N t(x, a, x )+I{Xt=x,At=a,Xt+1=x }. We use b Pk (x |x, a) = NTk(x, a, x )/NTk(x, a) as a model estimate, and given β > 0, the exploration bonuses are defined as CBk(x, a) = β p NTk(x, a) . (8) The following lemma shows that an appropriate choice of the scaling parameter β ensures the validity of the exploration bonuses. Lemma 5.1. Let δ (0, 1). Then, setting the coefficient β = 8H p |X| log (|X| |A| T/δ) guarantees that, with probability 1 δ, the validity condition (2) is satisfied by CBk as defined in Equation (8) for all k. Then, we can bound the bonuses as follows. Lemma 5.2. The sum of exploration bonuses generated by RAVI-UCB satisfies t=1 CBt (Xt, At) |X| |A| T . We refer the reader to previous works for the proofs of the above two lemmas (see, e.g., Jaksch et al., 2010; Fruit et al., 2018). Combining the above two results gives a regret bound of order |X| H p |A| T, as expected. 5.2. Linear Mixture MDPs We now focus on a class of MDPs commonly referred to as linear mixture MDPs (Modi et al., 2020; Ayoub et al., 2020) formally defined as follows. Assumption 5.3 (Linear mixture MDP). There exist a known feature map ψ : X A X Rd, and an unknown θ Rd with θ 2 B such that P(x |x, a) = Pd i=1 θiψi(x, a, x ). Furthermore, for any (x, a) X A, V [0, H]X , x X ψ(x, a, x )V (x ) Here, we suppose M satisfies Assumption 5.3. While remotely related to the notion of linear MDPs (Jin et al., 2020; Yang & Wang, 2019), linear mixture MDPs are a distinct class of models that cannot be captured in that framework, and have been widely studied in the past few years as linear MDPs we refer to Zhou et al. (2021) for further discussion. As often assumed in the related literature, we assume the map φk(x, a) = P x ψ(x, a, x )Vk (x ) can be computed (or approximated) efficiently. We provide a detailed discussion of all such computational matters in Section 6.2. Optimistic Planning by Regularized Dynamic Programming The algorithm is in Appendix B.2. Let λ > 0 be a regularization parameter, Λ1 = λI, and b1 = 0. At epoch k, for t Tk, the data is stored as Dt+1 = (Λt+1, bt+1) where Λt+1 = Λt + φk (xt, at) φk (xt, at) T and bt+1 = bt + φk (xt, at) Vk (xt+1). b Pk = P i bθk,iψi is computed via a least-squares regression, where bθk = Λ 1 Tk b Tk. Given β > 0, the exploration bonuses are defined as CBk(x, a) = β φk(x, a) Λ 1 Tk . (9) We now turn to the validity condition required by Lemma 4.3. Lemma 5.4. Let δ (0, 1). Then, setting the coefficient 2 log 1 + T B2H2 λB guarantees that, with probability 1 δ, the validity condition (2) is satisfied by CBk as defined in Equation (9) for all k. The proof is in Appendix A.2. It relies on standard techniques regarding linear mixture MDPs (Zhou et al., 2021; Cai et al., 2020). One important property required is the boundedness of each Vk that is guaranteed by the truncation. Then, we can bound the sum of the exploration bonuses with the following lemma. Lemma 5.5. The sum of exploration bonuses generated by RAVI-UCB satisfies t=1 CBt (Xt, At) d HT log (T) . The proof (Appendix A.4) follows from a series of small (but somewhat tedious) adjustments of a classic result often referred to as the elliptical potential lemma , the main challenge being dealing with the randomized epoch schedule. Our main technical result regarding the performance of RAVI-UCB is the following. Theorem 5.6. Suppose that RAVI-UCB is run with the uniform policy as π0, V0 = 0, λ = 1, a learning rate η = p 2 log |A| / (H2T), and an exploration parameter 2 log 1 + T B2H2 d + log T + B. Then, the expected regret of RAVI-UCB satisfies (d2H3 + B2d H + H4 log |A|) T . e O ( ) hides logarithmic factors of T, B, d, and H. A perhaps more useful result is the following, derived from an onlineto-batch conversion. Suppose RAVI-UCB returns a policy πout = πU with U being an epoch index chosen uniformly at random from the range of epochs. The following corollary provides a guarantee on the quality of this policy. Corollary 5.7. Let ε > 0. Then, RAVI-UCB run with the same parameters as before outputs a policy πout satisfying E [ µ µπout, r ] ε after Tε steps, with Tε = e O B2d H + d2H3 + H4 log |A| ε 2 . The expectation appearing in the first statement is with respect to the random transitions in the MDP and the epoch scheduling, whereas the expectation in the second one is also with respect to the random choice of the policy. It is possible to remove the former expectation, but the latter is inherent to the online-to-batch conversion process used by our analysis. We will return to this point in Section 6.2. 6. Discussion We now discuss the merits and limitations of our results, and point out directions for future research. 6.1. Results and Comparisons There are many differences between our approach and previously proposed optimistic exploration methods that we are aware of. Perhaps the most interesting novelty in our method is that it radically relaxes the optimistic properties that previous methods strive for: instead of calculating estimates of the value function or the MDP model that are strictly optimistic, we only guarantee that our value estimates are optimistic in an average sense. Thus, during its runtime, our algorithm may execute several policies that do not individually satisfy any optimistic properties, even approximately. We find this property to be curious and believe that the ideas we develop to tackle such notions of average optimism may find other applications. We note though that our planning procedure can be used to produce optimistic policies in a stricter sense by executing several regularized value iteration steps per policy update, until the resulting optimization error vanishes. Doing so results in an improved dependency on H by a factor H but comes at the cost of a major computational overhead. While our algorithm is closely related to the MD-MPI method of Geist, Scherrer, and Pietquin (2019) and our proofs feature several similar steps, we remark that the purpose of our analysis is quite different from theirs, even when disregarding the optimistic adjustment we make to the Bellman operators. Taking a close look at their proofs for the special case of zero approximation errors, one can deduce bounds on our quantity of interest that are of the order (H + H(π πK)) /K after K iterations. This is faster than what our analysis provides for approximate DP, which is due to the monotonicity of the exact Bellman operator which allows fast last-iterate convergence. The same rate appears in the analysis of regularized policy iteration methods by Agarwal et al. (2021) (see Theorem 16). Either way, all of Optimistic Planning by Regularized Dynamic Programming these analyses use tools from the analysis of mirror descent first developed by Martinet (1970), Rockafellar (1976), and Nemirovski & Yudin (1983) (see also Beck & Teboulle, 2003). Note that, as the guarantees of these regularizationbased methods hold on arbitrary data sequences, our regret guarantees trivially generalize to the case where the rewards change over time in a potentially adversarial fashion (as in, e.g., Even-Dar et al., 2009; Cai et al., 2020). Another line of work that our contribution seemingly fits into is the one initiated by Liu & Su (2020) on the topic of regret minimization for discounted MDPs (see also He et al., 2021; Zhou et al., 2021). A closer look reveals that their objective is quite different from ours, in that they aim to upper bound PT t=1 (V (Xt) V πt (Xt)) along the trajectory traversed by the learning agent. This notion of regret has been motivated by a formerly popular notion of sample complexity of exploration in discounted MDPs we highlight Kakade (2003); Strehl et al. (2009) out of the abundant PAC-MDP literature on this subject. This performance measure is in fact not comparable to ours in almost any possible sense. In fact, it is easy to see that this notion may fail to capture the sample complexity of learning a good policy in a meaningful way: a policy that immediately enters a trap state that yields zero reward until the end of time will only incur a constant regret of order 1 1 γ , even if there is a policy that yields a steady stream of +1 rewards in each round. Thus, without making stringent assumptions about the MDP that rule out such undesirable scenarios, the value of minimizing this notion of discounted regret may be questionable. 6.2. Limitations and Future Directions On a related note, our method suffers from the limitation of requiring access to a reset action taking the agent back to the initial distribution ν0 at any time. In general, this is necessary to achieve our objectives. Indeed, in MDPs where all states around the initial distribution are transient, it is impossible to learn a good policy from a single stream of experience without resets since the agent only gets to visit the relevant part of the state space once. We thus believe these issues are inherent to learning in discounted MDPs. Another limitation is that our guarantees only hold on expectation as opposed to high probability. In fact, several of our results can be strengthened to hold in this stronger sense, albeit at the cost of a more involved analysis. In particular, the only parts of our analysis that need to be changed are Lemmas 4.4 and 5.5, to deal with the randomized epoch schedule. The first of these can be handled via a martingale argument and the second by bounding the number and length of the epochs with high probability. Both of these changes are conceptually simple, but practically tedious so we omit them for clarity. On the other hand, Corollary 5.7 relies on a randomized online-to-batch conversion, and the result is stated on expectation with respect to the randomization step. Once again, this result can be strengthened to hold with high probability by running a best-policy-selection subroutine on the sequence of policies produced by the algorithm. This post-processing step is standard in the related literature and we omit details here to preserve clarity. Based on our current results, generalizing our techniques to the infinite-horizon average-reward setting seems to be challenging but not impossible. The key step in our proof that requires discounting is setting the truncation level at H = 1 1 γ , which serves the purpose of guaranteeing that our approximate Bellman operator is optimistic. In particular, the truncation level needs to be set large enough so that the inequality of Equation 3 goes through. We see no natural way to extend this condition to the undiscounted setup. We remain hopeful that this challenge can be overcome with more effort (but may potentially need some significant new ideas). Finally, let us remark on the linear mixture MDP assumption that we have used. While arguably well-studied in the past years, this model for linear function approximation has limitations that make it rather difficult to adapt to practical scenarios. The biggest is that learning algorithms in this model need access to an oracle to evaluate sums of the form P x ψ (x, a, x ) V (x ), which can only be performed efficiently in special cases. Options include assuming that ψ (x, a, ) is sparse or the integral can be approximated effectively via Monte Carlo sampling. A major inconvenience that this causes in the implementation of our method is that Q-functions (and policies) cannot be represented effectively with a single low-dimensional object, so these values have to be recalculated on the fly while executing the policy, requiring excessive Monte Carlo integration in runtime. We thus wish to extend our analysis to more tractable MDP models like the model of Jin et al. (2020). While it is straightforward to implement our algorithm for linear MDPs, unfortunately the covering number of the value function class used by our algorithm appears to be too large to allow proving strong performance bounds. On a more positive note, we wish to point out that linear mixture MDPs are still a rich family of models that in general is incomparable to linear MDPs, and can subsume many interesting models we refer to (Ayoub et al., 2020) for further discussion. We are optimistic that the limitations of our current analysis can be eventually removed and our method can be adapted to a much broader class of infinite-horizon MDPs. Acknowledgements G. Neu was supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (Grant agreement No. 950180). Part of this work was done while the second author was visiting the Simons Institute for the Theory of Computing. Optimistic Planning by Regularized Dynamic Programming Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. J. Mach. Learn. Res., 22(98):1 76, 2021. Ayoub, A., Jia, Z., Szepesvari, C., Wang, M., and Yang, L. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463 474, 2020. Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. Bertsekas, D. and Shreve, S. E. Stochastic optimal control: the discrete-time case, volume 5. Athena Scientific, 1996. Brafman, R. I. and Tennenholtz, M. R-max: A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct): 213 231, 2002. Cai, Q., Yang, Z., Jin, C., and Wang, Z. Provably efficient exploration in policy optimization. In International Conference on Machine Learning, pp. 1283 1294. PMLR, 2020. Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006. de la Pe na, Victor, H., Lai, T. L., and Shao, Q.-M. Selfnormalized processes: Limit theory and Statistical Applications, volume 204. Springer, 2009. Self-normalized tail bound appears in Thm. 14.7. Even-Dar, E., Kakade, S. M., and Mansour, Y. Online Markov decision processes. Math. Oper. Res., 34(3): 726 736, 2009. Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pp. 1578 1586. PMLR, 2018. Geist, M., Scherrer, B., and Pietquin, O. A theory of regularized Markov decision processes. In International Conference on Machine Learning, pp. 2160 2169. PMLR, 2019. He, J., Zhou, D., and Gu, Q. Nearly minimax optimal reinforcement learning for discounted MDPs. Advances in Neural Information Processing Systems, 34:22288 22300, 2021. Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563 1600, 2010. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137 2143. PMLR, 2020. Kakade, S. On the sample complexity of reinforcement learning. Ph D thesis, Gatsby Computational Neuroscience Unit, University College London, 2003. Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22, 1985. Lai, T. L. and Wei, C. Z. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154 166, 1982. Lai, T. L., Robbins, H., and Wei, C. Z. Strong consistency of least squares estimates in multiple regression ii. Journal of multivariate analysis, 9(3):343 361, 1979. Lattimore, T. and Szepesv ari, Cs. Bandit algorithms. Cambridge University Press, 2020. Liu, S. and Su, H. Regret bounds for discounted MDPs. ar Xiv preprint ar Xiv:2002.05138, 2020. Martinet, B. R egularisation d in equations variationnelles par approximations successives. ESAIM: Mathematical Modelling and Numerical Analysis - Mod elisation Math ematique et Analyse Num erique, 4(R3):154 158, 1970. Modi, A., Jiang, N., Tewari, A., and Singh, S. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pp. 2010 2020, 2020. Nemirovski, A. and Yudin, D. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, 1983. Neu, G. and Pike-Burke, C. A unifying view of optimism in episodic reinforcement learning. Advances in Neural Information Processing Systems, 33:1392 1403, 2020. Neu, G., Jonsson, A., and G omez, V. A unified view of entropy-regularized Markov decision processes. ar Xiv preprint ar Xiv:1705.07798, 2017. Ouhamma, R., Basu, D., and Maillard, O.-A. Bilinear exponential family of mdps: Frequentist regret bound with tractable exploration and planning. ar Xiv preprint ar Xiv:2210.02087, 2022. Optimistic Planning by Regularized Dynamic Programming Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014. Qian, J., Fruit, R., Pirotta, M., and Lazaric, A. Exploration bonus for regret minimization in undiscounted discrete and continuous markov decision processes. ar Xiv preprint ar Xiv:1812.04363, 2018. Rockafellar, R. T. Monotone Operators and the Proximal Point Algorithm. SIAM Journal on Control and Optimization, 14(5):877 898, 1976. Strehl, A. L., Li, L., and Littman, M. L. Reinforcement learning in finite MDPs: PAC analysis. The Journal of Machine Learning Research, 10:2413 2444, 2009. Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. 2nd edition. 2018. Vial, D., Parulekar, A., Shakkottai, S., and Srikant, R. Regret bounds for stochastic shortest path problems with linear function approximation. In International Conference on Machine Learning, pp. 22203 22233, 2022. Wei, C.-Y., Jahromi, M. J., Luo, H., and Jain, R. Learning infinite-horizon average-reward MDPs with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 3007 3015. PMLR, 2021. Yang, L. and Wang, M. Sample-optimal parametric qlearning using linearly additive features. In International Conference on Machine Learning, pp. 6995 7004. PMLR, 2019. Zhou, D., He, J., and Gu, Q. Provably efficient reinforcement learning for discounted MDPs with feature mapping. In International Conference on Machine Learning, pp. 12793 12802. PMLR, 2021. Optimistic Planning by Regularized Dynamic Programming A. Omitted Proofs A.1. Technical Tools for the Proof of Lemma 4.3 Lemma A.1. Let π and π be two policies, with their corresponding state-action occupancy measures being µπ and µπ , and their state occupancy measures being νπ and νπ . Then, DKL µπ µπ 1 1 γ H (π π ) . Proof. Using the chain rule of the relative entropy, we write DKL µπ µπ = DKL νπ νπ + H (π π ) . By the Bellman flow constraints in Equation (1) and the joint convexity of the relative entropy, we bound the second term as DKL νπ νπ = DKL γP Tµπ + (1 γ) ν0 γP Tµπ + (1 γ) ν0 (1 γ) DKL (ν0 ν0) + γDKL P Tµπ P Tµπ = γDKL P Tµπ P Tµπ γDKL µπ µπ , where we also used the data-processing inequality in the last step. The proof is concluded by reordering the terms. A.2. Proof of Lemma 5.4 Let us fix k [K], t {Tk, Tk + 1, . . . , Tk+1 1}, δ (0, 1). We start by recalling the definition of the nominal transition model b Pk acting on functions V as b Pk V (x, a) = φV (x, a) , bθk , where we denoted the state-action feature map φV (x, a) = P x X ψ (x, a, x ) V (x ), and the parameter bθk can be written out as bθk = Λ 1 Tk b Tk = j=Ti φi (xj, aj) φi (xj, aj) T + λI j=Ti φi (xj, aj) Vi (xj+1) . To proceed, we notice that the true transition operator acting on V can be written in a similar form as (PV ) (x, a) = X x X P (x |x, a) V (x ) (by definition of P) x X θ, ψ (x, a, x ) V (x ) (by Assumption 5.3) x X ψ (x, a, x ) V (x ) = θ, φV (x, a) , where we used the definition of φV in the last line. Proceeding further with the same expression, we write (PV ) (x, a) = φV (x, a) , Λ 1 Tk ΛTkθ φV (x, a) , Λ 1 Tk j=Ti φi (xj, aj) φi (xj, aj) T θ + λΛ 1 Tk θ φV (x, a) , Λ 1 Tk j=Ti φi (xj, aj) (PVi) (xj, aj) + λΛ 1 Tk θ Optimistic Planning by Regularized Dynamic Programming where we used the definition of ΛTk and Assumption 5.3 in the last line. Comparing the expressions for PV and b Pk V , we obtain b Pk V (x, a) PV (x, a) = φV (x, a) , Λ 1 Tk j=Ti φi (xj, aj) [Vi (xj+1) (PVi) (xj, aj)] λΛ 1 Tk θ Using the Cauchy Schwartz inequality and taking V = Vk, we get b Pk Vk (x, a) PVk (x, a) φk (x, a) Λ 1 Tk (|ξk| + |bk|) , where ξk = Pk 1 i=0 PTi+1 1 j=Ti φi (xj, aj) [Vi (xj+1) (PVi) (xj, aj)] Λ 1 Tk , and bk = λ θ Λ 1 Tk . The second term can be easily bounded as |bk| λB, using that λmin (ΛTk) λ and the boundedness of the features. For the first term, observe that Vi (xj+1) (PVi) (xj, aj) forms a martingale difference sequence, with increments bounded in [ H, H] by the truncation made in the algorithm. Additionally, the feature vectors are bounded as φi (xj, aj) 2 BH and the true parameter as θ 2 B by Assumption 5.3. Therefore, we can apply the self-normalized concentration result in Theorem C.2 (stated in Appendix C.2), which guarantees that with probability at least 1 δ, the following bound holds for all k [K]: v u u t2 log " det (ΛTk)1/2 det (λI) 1/2 The determinants appearing in the bound can be further upper bounded by using that det (λI) = λd and det (ΛTk) tr (ΛTk) d (by the trace-determinant inequality) j=Ti φi (xj, aj) 2 2 (by the definition of ΛTk) λ + Tk B2H2 d (by the boundedness of the features) where in the last step we used Tk T. We plug this back in the upper-bound on ξk to obtain the bound 2 log 1 + TB2H2 Putting everything together, we have verified that, for all k [K], D P ( |x, a) b Pk ( |x, a) , Vk E φk (x, a) Λ 1 Tk 2 log 1 + TB2H2 = β φk (x, a) Λ 1 Tk . holds with probability at least 1 δ, where we have defined β as 2 log 1 + TB2H2 This concludes the proof. Optimistic Planning by Regularized Dynamic Programming A.3. Proof of Lemma 4.4 For the sake of this proof, we slightly update our notation for Tk by setting TK(T ) = TK(T ), TK(T ) + 1, . . . , T + . We will use Fk 1 to denote the filtration generated by the observations up to the end of epoch k 1, and Lk to denote the length of epoch k. We start by rewriting the sum of exploration bonuses up to step T + as t=1 CBt (Xt, At) = t Tk CBt (Xt, At) . (11) By virtue of the definition of T +, all epochs are of geometric length with mean 1 1 γ . Now, let us consider a fixed epoch k and define the auxiliary infinite sequence of state-action pairs Xk,0, Ak,0, Xk,1, Ak,1, . . . that is generated independently from the realized sample trajectory (Xt, At)t Tk given Fk 1 as follows. The initial state Xk,0 is drawn from ν0, and then subsequently for each i = 0, 1, . . . , the actions are drawn as Ak,i πk ( |Xk,i) and follow-up states are drawn as Xk,i+1 P ( |Xk,i, Ak,i). Recalling the notational convention that CBt = CBk for all t Tk, we observe that for any k, we have t Tk CBt (Xt, At) i=0 CBk (Xk,i, Ak,i) i=0 I{i i (since each Zi is nonnegative) i=k P [Z1 > i] (upper bounding the first k > 0 terms by 1) i=k γi (using that Z1 is geometric with parameter 1 γ) where we have used the formula for the geometric sum in the last step. Now, setting k = l log T 1 γ m , we get E max k |Tk| = k + T γk 1 γ 1 + log T 1 γ + T exp log γ 1 γ + T exp ( log T) 1 γ = 2 + log T where in the second line we have used the inequality log γ 1 γ 1 that holds for all γ (0, 1). The proof is concluded by using that log 2 > 1 2 and combining the above bound with the bound relating Ck to |Tk|. Optimistic Planning by Regularized Dynamic Programming B. Applications: Algorithm Specifications For clarity, we provide the complete algorithms in the context of tabular and linear mixture MDPs. The highlighted parts correspond to the instantiations of the functions TRANSITION-ESTIMATE, BONUS, and ADD from Algorithm 1. B.1. Tabular MDPs In tabular MDPs, we use the maximum likelihood estimates to compute b Pk and the classical count-based bonuses. Therefore, we only need to store and update the counts Nt and N t when interacting with the environment. Algorithm 2 RAVI-UCB for tabular MDPs. Inputs: Horizon T, learning rate η > 0, confidence parameter β > 0, value V0, policy π0. Initialize: t = 1, N 1 = 0, N1 = 1, Q1 = EV0. for k = 1, . . . do b Pk (x |x, a) = N Tk (x, a, x ) /NTk (x, a). Vk (x) = 1 a πk 1 (a|x) eηQk(x,a) . πk (a|x) = πk 1 (a|x) eη(Qk(x,a) Vk(x)). CBk (x, a) = β/ p NTk (x, a). Qk+1 = ΠH h r + CBk + γ b Pk Vk i . repeat Play at πk ( |xt), and observe xt+1. Update N t+1 (xt, at, xt+1) = N t (xt, at, xt+1) + 1, and Nt+1 (xt, at) = Nt (xt, at) + 1. t = t + 1. With probability 1 γ, reset to initial distribution: xt ν0 and break. until t = T end for B.2. Linear Mixture MDPs In linear mixture MDPs, b Pk is computed via a least-squares regression, and we use elliptical bonuses for CBk. Thus, we only need to store and update the empirical covariance matrix Λt and the vector bt when interacting with the environment. Algorithm 3 RAVI-UCB for linear mixture MDPs. Inputs: Horizon T, learning rate η > 0, confidence parameter β > 0, regularization parameter λ, value V0, policy π0. Initialize: t = 1, Λ1 = λI, b1 = 0, Q1 = EV0. for k = 1, . . . do bθk = Λ 1 Tk b Tk. b Pk = P i bθk,iψi. Vk (x) = 1 a πk 1 (a|x) eηQk(x,a) . πk (a|x) = πk 1 (a|x) eη(Qk(x,a) Vk(x)). φk (x, a) = P x ψ (x, a, x ) Vk (x ). CBk (x, a) = β φk (x, a) Λ 1 Tk . Qk+1 = ΠH h r + CBk + γ b Pk Vk i . repeat Play at πk ( |xt), and observe xt+1. Update Λt+1 = Λt + φk (xt, at) φk (xt, at) T, and bt+1 = bt + φk (xt, at) Vk (xt+1). t = t + 1. With probability 1 γ, reset to initial distribution: xt ν0 and break. until t = T end for Optimistic Planning by Regularized Dynamic Programming C. Standard Results C.1. Softmax Policies and Value Functions In this section, we recall a range of standard facts relating the softmax policies our algorithm uses and the associated value functions. These can be found in numerous papers, textbooks, and lecture notes for concreteness, see Section 28.1 in Lattimore & Szepesv ari, 2020 as an example. Lemma C.1. Let {Vk}k [K], {πk}k [K], and {Qk}k [K] be the sequences of functions defined in Algorithm 1. Then, the following equalities are satisfied for all k [K] and x X: Vk (x) = max p (A) p, Qk (x, ) 1 η DKL (p πk 1 ( |x)) πk ( |x) = arg max p (A) p, Qk (x, ) 1 η DKL (p πk 1 ( |x)) . Furthermore, for all k [K] and x X, we have i=1 Vi (x) = max p (A) i=1 Qi (x, ) η DKL (p π0 ( |x)) Proof. First, we show that the maximum indeed takes the form claimed in the main paper and that the maximizer is given by a softmax policy. For simplicity, we drop the indices for now and consider the optimization problem η DKL (p p ) , where Q RA, and p (A). As the probability simplex is compact and p 7 p, Q 1 ηDKL (p p ) is continuous, the supremum is attained at some p (A). The Lagrangian function of this optimization problem is given for all p RA + and α R as L (p, α) = p, Q 1 η DKL (p p ) + α ( p, 1 1) . Its partial derivative with respect to the primal variable p(a) is p(a) = Q(a) 1 Setting it to zero gives us the expression p (a) = p (a) exp η (Q(a) + α) 1 . Then, we use the constraint on p to find the value of α. In particular, p , 1 = 1 implies X a A p (a) exp (ηQ (a)) = exp (1 ηα) , from which we deduce that a A p (a) exp (ηQ (a)) Denoting V = 1 a A p (a) exp [ηQ (a)] , we plug back the expression of α into p : p (a) = p (a) exp η (Q(a) V ) . From this, we can directly express the relative entropy between p and p as DKL (p p ) = X a p (a) log p (a) a p (a) Q(a) V = p , Q V 1 , Optimistic Planning by Regularized Dynamic Programming so that we can write p , Q 1 η DKL (p p ) = p , Q p , Q V 1 = V . The first statement of the lemma then follows from applying this result to Q = Qk (x, ) and p = πk 1 ( |x), for k [K], x X. That is, for any state-action pair (x, a) X A and k 1, denoting the maximum Vk and the maximizer πk, we have that the following expressions are equivalent to the ones given in the statement of the lemma: a A πk 1 (a|x) eηQk(x,a) ! πk (a|x) = πk 1 (a|x) exp (η [Qk (x, a) Vk (x)]) . For the second statement, we start by denoting Vk = Pk i=1 Vi and Qk = Pk i=1 Qi, for k 1, and show by induction that, for x X, the following holds πk ( |x) = π0 ( |x) exp η Qk (x, ) Vk (x) 1 . Let x X. The case k = 1 follows immediately from the previous statement with Q = Q1 (x, ) and p = π0 ( |x). Assume the previous equation holds at k. Using the first statement with Q = Qk+1 (x, ) and p = πk ( |x) we have, for a A, πk+1 (a|x) = πk (a|x) eη[Qk+1(x,a) Vk+1(x)]. Applying the inductive hypothesis, it gives πk+1 (a|x) = π0 (a|x) exp η Qk (x, a) Vk (x) exp (η [Qk+1 (x, a) Vk+1 (x)]) = π0 (a|x) exp η Qk+1 (x, a) Vk+1 (x) , which finishes the induction. Then, we move on to the actual statement. We have a A πk 1 (a|x) eηQk(x,a) ! (by the first statement) a A π0 (a|x) eη(Qk(x,a)+ Qk 1(x,a) Vk 1(x)) ! (by induction) a A π0 (a|x) eη( Qk(x,a)) ! Therefore, by definition of Vk 1, i=1 Vi (x) = 1 a A π0 (a|x) eη( Qk(x,a)) ! = max p (A) i=1 Qi (x, ) η DKL (p π0 ( |x)) which concludes the proof. C.2. A Self-Normalized Tail Inequality Theorem C.2 (Theorem 14.7 in de la Pe na et al. (2009), Theorem 2 in Abbasi-Yadkori et al. (2011)). Let {ηt} t=1 be a real-valued stochastic process with corresponding filtration {Ft} t=0. Let ηt|Ft 1 be zero-mean and σ-sub Gaussian; i.e. E [ηt|Ft 1] = 0, and λ R, E eληt Ft 1 e λ2σ2 Let {φt} t=0 be an Rd-valued stochastic process where φt is Ft 1-measurable. Assume Λ0 is a d d positive definite matrix, and let Λt = Λ0 + Pt s=1 φsφT s. Then, for any δ > 0, with probability at least 1 δ, we have for all t 0, " det (Λt)1/2 det (Λ0) 1/2