# modelbased_reinforcement_learning_with_valuetargeted_regression__dfcf877e.pdf Model-Based Reinforcement Learning with Value-Targeted Regression Alex Ayoub 1 Zeyu Jia 2 Csaba Szepesv ari 1 3 Mengdi Wang 4 3 Lin F. Yang 5 This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model P belongs to a known family of models P, a special case of which is when models in P take the form of linear mixtures: P = Pd i=1 i Pi. We propose a model based RL algorithm that is based on the optimism principle: In each episode, the set of models that are consistent with the data collected is constructed. The criterion of consistency is based on the total squared error that the model incurs on the task of predicting state values as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, takes the form O(d H3T), where H, T and d are the horizon, the total number of steps and the dimension of , respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound ( Hd T). For a general model family P, the regret bound is derived based on the Eluder dimension. 1. Introduction In reinforcement learning (RL), a core problem in artificial intelligence (Russel & Norvig, 2003; Sutton & Barto, 2018), an agent learns to control a possibly complex, initially unknown environment in a sequential trial and error process. The application of RL algorithms to various do- 1Amii and Department of Computing Science, University of Alberta 2School of Mathematical Science, Peking University 3Deep Mind 4Department of Electrical Engineering, Princeton University 5Department of Electrical and Computer Engineering, University of California, Los Angeles. Correspondence to: Csaba Szepesv ari , Mengdi Wang . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). mains, such as games, robotics and science, has witnessed phenomenal empirical advances during the last few years (e.g., Mnih et al. 2015; Silver et al. 2017; Al Quraishi 2019; Arulkumaran et al. 2019). In online RL, an agent has to learn to act in an unknown environment from scratch , collect data as she acts, and adapt her policy to maximize the reward collected, or, equivalently, to minimize her regret. Designing RL algorithms that provably achieve sublinear regret in some class of environments has been the subject of much research, mainly focusing on the so-called tabular, and linear-factored MDP settings (e.g., Jaksch et al. 2010; Osband et al. 2014; Azar et al. 2017; Dann et al. 2017; 2018; Agrawal & Jia 2017; Osband et al. 2017; Jin et al. 2018; Yang & Wang 2019a; Jin et al. 2019). An appealing alternative to studying these structured cases is to consider learning and acting when the environment is described by a general model class, the topic of the current paper. Despite its appeal, as it appears, prior work, considered this option exclusively in a Bayesian setting. In particular, (Strens, 2000) introduced posterior sampling to RL, which was later analyzed by (Osband & Van Roy, 2014; Abbasi-Yadkori & Szepesv ari, 2015; Theocharous et al., 2017) (for a more in-depth discussion of related work, the reader is referred to Section 5). As opposed to these works, in the present paper we are interested in developing algorithms for bounding the worst-case expected regret. The specific setting that we adopt is that of episodic reinforcement learning in an environment where the unknown transition probability model that describes the environment s stochastic dynamics belongs to a family of models that is given to the learner. The model family P is a general set of models, and it may be either finitely parametrized or nonparametric. In particular, our approach accommodates working with smoothly parameterized models (e.g., Abbasi-Yadkori & Szepesv ari, 2015), and can find use in both robotics (Kober et al., 2013) and queueing systems (Kovalenko, 1968). An illuminating special case is when elements of P take the form P = P i i Pi where P1, P2, . . . , Pd are fixed, known basis models and = ( 1, . . . , d) are unknown, real-valued parameters. Model P can be viewed as a linear mixture model that aggregates a finite family of known basic dynamical models (Modi et al., 2019). As an important special case, linear mixture models include the linear-factor MDP model of Model-Based Reinforcement Learning with Value-Targeted Regression (Yang & Wang, 2019a). The main contribution of this paper is a new model-based upper confidence RL algorithm. The main novelty is the criterion to select models that are deemed consistent with past data. As opposed to the most common approach where the models are selected based on their ability to predict next states or raw observations (cf. Jaksch et al. 2010; Yang & Wang 2019a or Strens 2000; Osband & Van Roy 2014; Abbasi-Yadkori & Szepesv ari 2015; Ouyang et al. 2017; Agrawal & Jia 2017 in a Bayesian setting), we propose to evaluate models based on their ability to predict the values of a value function at next states, where the value function used is an estimate of the optimal that our algorithm produces based on past information. In essence, our algorithm selects models based on their ability to produce small prediction errors in an appropriately constructed value-targeted regression (VTR) problem. Our algorithm combines VTR for constructing sets of plausible models with (standard) optimistic planning. The idea of using a value function estimate to fit models has been explored in the context of batch RL by Farahmand (2018). VTR is attractive for multiple reasons: (i) Firstly, VTR permits model learning to focus on task-relevant aspects of the transition dynamics. This is important as the dynamics can be quite complicated and in a resource bounded setting, modelling irrelevant aspects of the dynamics can draw valuable resources away from modelling task-relevant aspects. (ii) Secondly, VTR poses model learning as a realvalued regression problem, which should be easier than the usual approaches to build probabilistic models. In particular, when state-representation available to the agent takes values in a high-dimensional space then building a faithful probability model can be highly demanding. (iii) Thirdly, VTR aims to control directly what matters in terms of controlling the regret. Specifically the objective used in value-targeted is obtained from an expression that upper bounds the regret, hence it is natural to expect that minimizing value prediction errors will lead to a small regret. An additional attractive feature of our algorithm is its modular structure. As a result, advances on the components (faster optimistic planning, tighter confidence sets for VTR) are directly translated into an improved algorithm. On the skeptical side, one may question whether VTR is going too far in ignoring details of the dynamics. In particular, since the value functions used in defining the regression targets are derived based on imperfect knowledge, the model may never get sufficiently refined in a way that would keep the regret small. Similarly, one may be worried about that by ignoring details of the observations (i.e., the identity of states), the approach advocated is ignoring information available in the data, which may slow down learning. This leads to central question of our article: Is value-targeted regression sufficient and efficient for model-based online RL? The main contribution of this paper is a positive answer to this question. In particular, the regret bounds we derive conclusively show that the despite the imperfection and non-stationarity of the value targets, our algorithm will not get stuck (i.e., it enjoys sublinear regret). Our results further suggest that in the worst case sense, for common settings, there may be no performance penalty associated with using value-targeted regression. We are careful here as this conclusion is based on comparing worst-case upper bounds, which cannot provide a definitive answer. Finally, it is worth noting that the regret bound does not depend on the size of either the state or the action space. To complement the theoretical findings, results from a number of small-scale, synthetic experiments confirm that our algorithm is competitive in terms of its regret. The experiments also allow us to conclude that it is value-targeted regression together with optimistic planning that is effective. In particular, if optimism is taken away (i.e., -greedy for exploration), value-targeted regression performs worse than using a canonical approach to estimate the model. Similarly, if value-targeted regression is taken away, optimism together with the canonical model-estimation approach is less effective. Note that our results do not rule out that certain combinations of value-targeted regression and canonical model building are more effective than value-targeted regression. In fact, given the vast number of possibilities, we find this to be quite probable. While our proofs can be adjusted to deal with adding simultaneous alternative targets, sadly, our current theoretical tools are unable to capture the tradeoffs that one expects to arise as a result of such modifications. It is interesting to note that, in an independent and concurrent work, value-targeted regression has also been suggested as the main model building tool of the Mu Zero algorithm (Schrittwieser et al., 2019), which was empirically evaluated on a number of RL benchmarks, such as the 57 Atari games , the game of Go , chess and shogi, and was found to be highly competitive with its state-of-the-art alternatives. This reinforces the conclusion that training models using value-targeted regression is indeed a good approach to build effective model-based RL algorithms. Since Mu Zero does not implement optimistic planning and our results show that optimism is not optional in a worst case sense, the good results of Mu Zero on these benchmark may seem to contradict our experimental findings that value-targeted regression is ineffective without an appropriate, smart exploration component. However, there is no contradiction: Smart exploration may be optional in some environments; our experiments show that it is not optional on some environments. In short, for robust performance across a wide range of environments, smart exploration is necessary but Model-Based Reinforcement Learning with Value-Targeted Regression smart exploration may be optional in some environments. 2. Problem Formulation We study learning to control episodic Markov decision processes (MDPs, for short), described by a tuple M = (S, A, P, r, H, s ). Here, S is the state space, A is the ac- tion space, P is the transition kernel, r is a reward function, H > 0 is the episode length, or horizon, and s 2 S is the initial state. In the online RL problem, the learning agent is given S, A, H and r but does not know P.1 The agent interacts with its environment in a number of episodes. Each episode begins at state s and ends after the agent made H decisions. At state s 2 S, the agent, after observing the state s, can choose an action a 2 A. As a result, the immediate reward r(s, a) is incurred. Then the process transitions to a random next state s0 2 S according to the transition law P( |s, a).2 The agent s goal is to maximize the total expected reward received over time. If P is known, the behavior that achieves maximum expected reward over any number of episodes can be described by applying a deterministic policy . Such a policy is a mapping from S [H] into A (note for a natural number n, [n] = {1, . . . , n}). Following the policy means that the agent upon encountering state s in stage h will choose action (s, h). In what follows, we will use h(s) as an alternate notation, as this makes some of the formulae more readable. (We will follow the same convention of moving h to the subindex position when it comes to other functions whose domain is S [H].) The value function V : S [H] ! R of a policy is defined via r(si, (si)) where si is the state encountered at stage i 2 [H] and the subscript (which we will often suppress) signifies that the probabilities underlying the expectation are jointly governed by and P (P is suppressed for clarity). An optimal policy and the optimal value function V are defined to be a policy and its value function such that V h (s) with = achieves the maximum value among all possible policies for any s 2 S and h 2 [H]. Note that this is well-defined and in fact, as noted above, there is no loss of generality in restricting the search of optimal policies to deterministic policies. 1Our results are easy to extend to the case when r is not known. 2The precise definitions require measure-theoretic concepts (Bertsekas & Shreve, 1978), i.e., P is a Markov kernel, mapping from S A to distributions over S, hence, all these spaces need to be properly equipped with a measurability structure. For the sake of readability and also because they are well understood, we omit these technical details. In online RL, a learning agent will use all past observations to come up with its decisions. The performance of such an agent is measured by its regret, which is the total reward the agent misses because she did not follow the optimal policy from the beginning. In particular, the total expected regret of an agent A across K episodes is given by where T = KH is the total number of time steps that the agent interacts with its environment, sk 1 = s is the initial state at the start of the k-th episode, and s1 1, . . . , sk 1, . . . , s2 1 , . . . , s K H are the T = KH state-action pairs in the order that they are encountered by the agent. The regret is sublinear if R(T)/T ! 0 as T ! 1. As is well known, the worst-case value of R(T) over a set of sufficiently large model class, grows at least as fast as T regardless of the algorithm used (e.g., Jaksch et al., 2010). In this paper, we aim to design a general model-based reinforcement learning algorithm, with a guaranteed sublinear regret, for any (not too large) family of transition models: Assumption 1 (Known Transition Model Family). The unknown transition model P belongs to a family of models P which is available to the learning agent. The elements of P are transition kernels mapping state-action pairs to signed distributions over S. That we allow signed distributions only increases generality; this may be important when one is given a model class that can be compactly represented but only when it also includes non-probability kernels (see Pires & Szepesv ari 2016 for a discussion of this). Parametric and nonparametric transition models are common in modelling complex stochastic controlled systems. For example, robotic systems are often smoothly parameterized by unknown mechanical parameters such as friction and parameters that describe the geometry of the robot. An important special case is the class of linear mixture models: Definition 1 (Linear Mixture Models). We say that P is the class of linear mixture models with component models P1, . . . , Pd if P1, . . . , Pd are transition kernels that map state-action pairs to signed measures and P 2 P if and only if there exists 2 Rd such that for all (s, a) 2 S A, P(ds0|s, a) = j Pj(ds0|s, a) = P (ds0|s, a)> . (2) The linear mixture model can be viewed as a way of aggregating a number of known basis models as considered by Model-Based Reinforcement Learning with Value-Targeted Regression (Modi et al., 2019). We can view each Pj( | ) as a basis latent mode . When is restricted to lie in the (d 1) simplex, the actual transition is a probabilistic mixture of these latent modes. As an example of when mixture models arise, consider large-scale queueing networks where the arrival rate and job processing speed for each queue is not known. By using a discrete-time Bernoulli approximation, the transition probability matrix from time t to t + t becomes increasingly close to linear with respect to the unknown arrival/processing rates as t ! 0. In this case, it is common to model the discrete-time state transition as a linear aggregation of arrival/processing processes with unknown parameters (Kovalenko, 1968). Another interesting special case is the linear-factored MDP model of (Yang & Wang, 2019a) where, assuming a discrete state space for a moment, P takes the form P(s0|s, a) = φ(s, a)>M (s0) Mij [ j(s0)φi(s, a)] , where φ(s, a) 2 Rd1, (s0) 2 Rd2 are given features for every s, s0 2 S and a 2 A (when the state space is continuous, becomes an Rd2-valued measure over S). The matrix M 2 Rd1 d2 is an unknown matrix and is to be learned. It is easy to see that the factored MDP model is a special case of the linear mixture model (19) with each j(s0)φi(s, a) being a basis model (this should be replaced by j(ds0)φi(s, a) when the state space is continuous). In this case, the number of unknown parameters in the transition model is d = d1 d2. In this setting, without additional assumptions, our regret bound will match that of (Yang & Wang, 2019a). 3. Upper Confidence RL with Value-Targeted Model Regression Our algorithm (Alg 1) can be viewed as a generalization of UCRL (Jaksch et al., 2010), following ideas of (Osband & Van Roy, 2014). In particular, at the beginning of episode k = 1, 2, . . . , K, the algorithm first computes a subset Bk of the model class P that contains the set of models that are deemed to be consistent with all the data that has been collected in the past. The new idea, value-targeted regression is used in the construction of Bk. The details of how this is done are postponed to a later section. Given Bk, the algorithm needs to find the model that maximizes the optimal value, and the corresponding optimal policy. Denoting by V P the optimal value function under a model P, this amounts to finding the model P 2 Bk that maximizes the value V 1). Given the model Pk that maximizes this value, an optimal policy is extracted from the model as in standard dynamic programming, detailed in the next section. Algorithm 1 UCRL-VTR 1: Input: Family of MDP models P, d, H, T = KH, sequence {βk}k=1,2,.... 2: B1 = P 3: for k = 1, 2, . . . , K do 4: Observe the initial state sk 1 of episode k 5: Optimistic planning: Pk = argmax P 02Bk V Compute Q1,k, . . . QH,k for Pk using (3) 6: for h = 1, 2, . . . , H do 7: Choose the next action greedily with respect to Qh,k: h = arg max a2A Qh,k(sk 8: Observe state sk h+1 9: Compute and store value predictions: yh,k Vh+1,k(sk h+1) 10: end for 11: Construct confidence set using value-targeted regression as described in Section 3.2: Bk+1 = {P 0 2 P|Lk+1(P 0, ˆPk+1) βk} 12: end for At the end of the episode, the data collected is used to refine the confidence set Bk. 3.1. Model-Based Optimistic Planning Upper confidence methods are prominent in sequential online learning. As noted before, we let Pk = argmax P 02Bk V Given model Pk, the optimal policy for Pk can be computed using dynamic programming. In particular, for 1 h H + 1, define QH+1,k(s, a) = 0, Vh,k(s) = max a2A Qh,k(s, a), Qh,k(s, a) = r(s, a) + h Pk( |s, a), Vh+1,ki, where for a measure µ and function f that share a common domain, hµ, fi denotes the integral of f with respect to µ. It follows that, taking the action at state s and stage h that maximizes Qh,k(s, ) gives an optimal policy for model Pk. As long as P 2 Bk with high probability, the preceding calculation gives an optimistic (that is, upper) estimate of value of an episode. Next, we show how to construct the confidence set Bk. Model-Based Reinforcement Learning with Value-Targeted Regression 3.2. Value-Targeted Regression for Confidence Set Construction Every time we observe a transition (s, a, s0) with s0 P( |s, a), we receive information about the model P. A standard approach to use this information would be either using a maximum likelihood approach, or regressing onto s0. As our goal is not to find the best model, we propose an alternate approach where we set up a regression problem where the model is used to predict the value assigned to s0 by our more recent value function estimate: ˆPk+1 = argmin P 02P L0(P 0), (4) h P 0( |sk0 h ), Vh+1,k0i yh,k0 yh,k0 = Vh+1,k0(sk0 h+1) , h 2 [H], k0 2 [k] . In the above regression procedure, the regret target keeps changing as the algorithm refines the value estimates. This is in contrast to typical supervised learning for building models, where the regression targets are often fixed objects (such as raw observations, features or keypoints; e.g. (Jaksch et al., 2010; Osband & Van Roy, 2014; Abbasi-Yadkori & Szepesv ari, 2015; Xie et al., 2016; Agrawal & Jia, 2017; Yang & Wang, 2019a; Kaiser et al., 2019)). For a confidence set construction, we get inspiration from Proposition 5 in the paper of (Osband & Van Roy, 2014). The set is centered at ˆPk+1. Defining Lk+1(P, ˆPk+1) h ) ˆPk+1( |sk0 h ), Vh+1,k0i Bk+1 = {P 0 2 P | Lk+1(P 0, ˆPk+1) βk+1} , where the value of βk is obtained using a calculation similar to that done in Proposition 5 of the paper of (Osband & Van Roy, 2014). In turn, this calculation is based on the nonlinear least-squares confidence set construction from Russo & Van Roy (2014), which we describe and refine in the appendix. It is not hard to see that the confidence set can also be written in the alternative form Bk+1 = {P 0 2 P | Lk+1(P 0) βk+1} with a suitably defined βk+1 and where Lk+1(P 0) = h P 0( |sk0 h ), Vh+1,k0i yh,k0 Note that the above formulation strongly exploits that the MDP is time-homogeneous: The same transition model is used at all stages of an episode. When the MDP is timeinhomogeneous, the construction can be easily modified to accommodate this. 3.3. Implementation of UCRL-VTR Algorithm 1 gives a general and modular template for model-based RL that is compatible with regression methods/optimistic planners. While the algorithm is conceptually simple, and the optimization and evaluation of the loss in value-targeted regression appears to be at advantage in terms of computation as compared to standard approaches typically used in model-based RL, the implementation of UCRL-VTR is nontrivial in general and for now it requires a case-by-base design. Computation efficiency of the algorithm depends on the specific family of models chosen. For the linear-factor MDP model considered by Yang & Wang (2019a), the regression is linear and admits efficient implementation; further, optimistic planning for this model can be implemented in poly(d) time by using Monte-Carlo simulation and sketching as argued in the cited paper. Other ideas include loosening the confidence set to come up with computationally tractable methods, or relaxing the requirement that the same model is used in all stages. This latter idea is what we use in our experiments. In the general case, optimistic planning is computationally intractable. However, we expect that randomized (e.g. (Osband et al., 2017; 2014; Lu & Van Roy, 2017)) and approximate dynamic programming methods (tree search, roll out, see e.g., (Bertsekas & Tsitsiklis, 1996)) will often lead to tractable and good approximations. As was mentioned above, in some special cases these have been rigorously shown to work. In similar settings, the approximation errors are known to mildly impact the regret (Abbasi-Yadkori & Szepesv ari, 2015) and we expect the same to hold in our setting. If we look beyond methods with rigorous guarantees, there are practical deep RL algorithms that implement parts of UCRL-VTR. As mentioned earlier, the Mu Zero algorithm of Schrittwieser et al. (2019) is a state-of-the-art algorithm on multiple domains and this algorithm implements value-targeted-regression to learn a model which is fed to a planner that uses Monte Carlo tree search, although the planner does not implement optimistic planning. 4. Theoretical Analysis We will need the concept of Eluder dimension due to (Russo & Van Roy, 2014). Let F be a set of real-valued functions with domain X. For f 2 F, x1, . . . , xt 2 X, introduce the notation f|(x1,...,xt) = (f(x1), . . . , f(xt)). We say that x 2 X is -independent of x1, . . . , xt 2 X given F if there exists f, f 0 2 F such that k(f f 0)|(x1,...,xt)k2 while f(x) f 0(x) > . Definition 2 (Eluder dimension, Russo & Van Roy 2014). The Eluder dimension dim E(F, ) of F at scale is the length of the longest sequence (x1, . . . , xn) in X such that for some 0 , for any 2 t n, xt is 0-independent of Model-Based Reinforcement Learning with Value-Targeted Regression (x1, . . . , xt 1) given F. Let V be the set of optimal value functions under some model in P: V = {V P 0 : P 0 2 P}. Note that V B(S, H), where B(S, H) denotes the set of real-valued measurable functions with domain S that are bounded by H. We let X = S A V. Choose F to be the collection of functions f : X ! R as follows: 9P 2 P s.t. for any (s, a, v) 2 S A V f(s, a, v) = P(ds0|s, a)v(s0) (5) Note that F B(X, H). For a norm k k on F and > 0 let N(F, , k k) denote the ( , k k)-covering number of F. That is, this if m = N(F, , k k) then one can find m elements of F such that any element in F is at most away from one of these elements in norm k k. Denote by k k1 the supremum norm: kfk1 = supx2X |f(x)|. Define the K-episode pseudo-regret as Clearly, R(KH) = ERK holds for any K > 0 where R(T) is the expected regret after T steps of interaction as defined in (1). Thus, to study the expected regret, it suffices to study RK. Our main result is as follows. Theorem 1 (Regret of Algorithm 1). Let Assumption 1 hold and let 2 (0, 1). For k > 0 let βk be βk = 2H2 log 2N(F, , k k1) + 2H(k H 1) 4k H(k H 1) (6) Then, for any K > 0, with probability 1 2δ, RK + H(d K(H 1)) + 4 2K(H 1) log(1/δ) , where d = dim E(F, ) is the Eluder dimension with F given by (5). A typical choice of is = 1/(KH). In the special case of linear transition model, Theorem 1 implies a worstcase regret bound that depends linearly on the number of parameters. Corollary 2 (Regret of Algorithm 1 for Linearly Parametrized Transition Model). Let P1, . . . , Pd be d transition models, Rd a nonempty set with diameter R measured in k k1 and let P = {P j j Pj : 2 }. Then, for any 0 < δ < 1, with probability at least 1 δ, the pseudo-regret RK of Algorithm 1 when it uses the confidence sets given in Theorem 1 satisfies H3K log(1/δ)) . We also provide a lower bound for the regret in our model. The proof is by reduction to a known lower bound and is left to Appendix B. Theorem 3 (Regret Lower Bound). For any H 1 and d 8, there exist a state space S and action set A, a reward function r : S A ! [0, 1], d transition models P1, . . . , Pd and a set Rd of diameter of at most one such that for any algorithm there exists 2 such that for sufficiently large number of episodes K, the expected regret of the algorithm on the H-horizon MDP with reward r and transition model P = P j j Pj is at least (H Rusmevichientong & Tsitsiklis (2010) gave a regret lower bound of (d T) for linearly parameterized bandit with actions on the unit sphere (see also Section 24.2 of Lattimore & Szepesv ari 2020). Our regret upper bound matches this bandit lower bound in d, T. Whether the upper or lower bound is tight (or none of them) remains to be seen. The theorems validate that, in the setting we consider, it is sufficient to use the predicted value functions as regression targets. That for the special case of linear mixture models the lower bound is close to the upper bound appears to suggest that little benefit if any can be derived from fitting the transition model to predict future observations. We conjecture that this is in fact true when considering the worst-case regret. Of course, a conclusion that is concerned with the worst-case regret has no implication for the behavior of the respective methods on particular MDP instances. We note in passing that by appropriately increasing βk, the regret upper bounds can be extended to the so-called misspecified case when P can be outside of P (for related results, see, e.g., Jin et al. 2019; Lattimore & Szepesv ari 2019). However, the details of this are left for future work. Further, our method applies to handle the case where the linearly parameterized transition model is sparse. Suppose that model parameter is known to have at most s nonzero entries. In this case, the class of sparse linear models has a much smaller covering number, and the regret would improve and depend on both d,s. Details of this are left for future work. 4.1. Regret Bound with Model Misspecification Next, we consider the case where the model family P does not exactly realize the true transition model P: Assumption 2 (Model with misspecification error). The model family P "-approximates P in the sense that there exists P 2 P such that s,a k P( |s, a) P ( |s, a)k T V ", (7) Model-Based Reinforcement Learning with Value-Targeted Regression where k k T V denotes the total variation distance. This assumption indicates that the true transition model P of the MDP is close to the family P, and " measures the worst-case deviation. We handle the misspecification error by slightly enlarging the confidence set. This allows us to obtain a regret bound similar to our previous result with an additional linear term that is proportional to the misspecification error and slightly larger constants: Theorem 4. Let Assumption 2 hold, , δ 2 (0, 1). We choose βk be βk = 8H2 log 4N(F, , k k1) + 4H(k H 1) 8k H(k H 1) Then, for any K > 0, with probability 1 2δ, the K-episode pseudo-regret RK of Algorithm 1 satisfies RK + H(d K(H 1)) + 4 2K(H 1) log(1/δ) + H2K", where d = dim E(F, ) with F given by (5). The proof of this theorem is given in Appendix D. 5. Related Work A number of prior efforts have established efficient RL methods with provable regret bounds. For tabular H-horizon MDP with S states and A actions, there have been results on model-based methods (e.g., Jaksch et al. 2010; Osband et al. 2014; Azar et al. 2017; Dann et al. 2017; Agrawal & Jia 2017; Dann et al. 2018; Kakade et al. 2018), and on model-free methods (e.g., Osband et al. 2017; Jin et al. 2018; Zhang et al. 2020). Both model-based and model-free methods are known to achieve a regret of O( H2SAT), where O( ) hides log factors, T denotes the total number of timesteps and the bound applies to the non-stationary setting (when the transition models are not shared across stages). Moreover, apart from logarithmic factors, this bound is known to be unimprovable Jaksch et al. (2010); Osband et al. (2017); Jin et al. (2018); Zhang et al. (2020). There have been significant theoretical and empirical advances on RL with function approximation, including but not limited to (Baird, 1995; Tsitsiklis & Van Roy, 1997; Parr et al., 2008; Mnih et al., 2013; 2015; Silver et al., 2017; Yang & Wang, 2019b; Bradtke & Barto, 1996). Under the assumption that the optimal action-value function is captured by linear features, Zanette et al. (2019) considers the case when the features are extrapolation friendly and a simulation oracle is available, Wen & Van Roy (2013; 2017) tackle problems where the transition model is deterministic, Du et al. (2019) deals with a relaxation of the deterministic case when the transition model has low variance. Yang & Wang (2019b) considers the case of linear factor models, while Lattimore & Szepesv ari (2019) considers the case when all the action-value functions of all deterministic policies are well-approximated using a linear function approximator. These latter works handle problems when the algorithm has access to a simulation oracle of the MDP. As for regret minimization in RL using linear function approximation, Yang & Wang (2019a) assumed the transition model admits a matrix embedding of the form P(s0|s, a) = φ(s, a)>M (s0), and proposed a model-based Matrix RL method with regret bounds O(H2d T) with stronger assumptions and O(H2d2p T) in general, where d is the dimension of state representation φ(s, a). Jin et al. (2019) studied the setting of linear-factor MDP and constructed a model-free least-squares action-value iteration algorithm, which was proved to achieve the regret bound O( H3d3T). Modi et al. (2019) considered a related setting where the transition model is an ensemble involving state-action-dependent features and basis models and proved a sample complexity 2 where d is the feature dimension, K is the number of basis models and d K is their total model complexity. As for RL with a general model class, in their seminal work, Osband & Van Roy (2014) provided a general posterior sampling RL method that works for any given classes of reward and transition functions. It established a Bayesian regret upper bound O(pd Kd ET), where d K and d E are the Kolmogorov and the Eluder dimensions of the model class. In the case of linearly parametrized transition model (Assumption 1 of this paper), this Bayesian regret becomes O(d T), and our worst-case regret result matches with the Bayesian one. Abbasi-Yadkori & Szepesv ari (2015); Theocharous et al. (2017) also considered the Bayesian regret and, in particular, Abbasi-Yadkori & Szepesv ari (2015) considered a smooth parameterization. To the authors best knowledge, there are no prior works addressing the problem of designing low-regret algorithms for MDPs with linearly or non-linearly parameterized transition models. Model based PAC RL algorithms have been studied by Sun et al. (2019), who essentially adopt the value-aware loss of Farahmand et al. (2017), who considered this loss in the batch setting. Farahmand (2018) refines the work of Farahmand et al. (2017) by changing the algorithm to be similar to what we use here: In every iteration of fitted Q-iteration, first a model is obtained by minimizing the value prediction loss measured with the last value function, after which this model is used to obtain the next action-value function. The main result bounds the suboptimality of the policy that is greedy with respect to the last action-value function. A preliminary version of our paper was presented at L4DC. Model-Based Reinforcement Learning with Value-Targeted Regression Exploration/ Targets Optimism Dithering Next states UC-Matrix RL EG-Freq Values UCRL-VTR EG-VTR Mixed UCRL-Mixed EG-Mixed Table 1. Legend to the algorithms compared. Note that UCMatrix RL of Yang & Wang (2019a) in the tabular case essentially becomes UCRL of Jaksch et al. (2010). 6. Numerical Experiments The goal of our experiments is to provide insight into the benefits and/or pitfalls of using value-targets for fitting models, both with and without optimistic planning. We run our experiments in the tabular setting as it is easy to keep aspects of the test environments under control while avoiding approximate computations. Note that tabular environments are a special case of the linear model where Pj(s0|s, a) = I(j = f(s, a, s0)), where j 2 [S2A] and f is a bijection that maps its arguments to the set [S2A], making d = S2A. The objective is either to minimize mean-squared error of predicting next states (alternatively, maximize loglikelihood of observed data), which leads to frequency based model estimates, or it is to minimize the value targets as proposed in our paper. The other component of the algorithms is whether they implement optimistic planning, or planning with the nominal model and then implementing an -greedy policy with respect to the estimated model ( dithering ). In the case of optimistic planning, the algorithm that uses mixed targets uses a union bound and takes the smallest value upper confidence bounds amongst the two bounds obtained with the two model-estimation methods. These leads to six algorithms, as shown in Table 1. Results for the mixed variants are very similar to the variant that uses VTR and can be found in Appendix G In the experiments we use confidence bounds that are specialized to the linear case. For details of these, see Appendix F. For -greedy, we optimize the value of in each environment to get the best results. This gives -greedy an unfair advantage . As we shall see it soon, despite this advantage, -greedy will not fair particularly well in our experiments. 6.1. Measurements We report the cumulative regret as a function of the number of episodes and the weighted model error to indicate how well the model is learned. The results are obtained from 30 independent runs for the -greedy algorithms and 10 independent runs for the UC algorithms. The weighted model error reported is as follows. Given the model estimate Figure 1. The River Swim environment with 6 states. State s1 has a small associated reward, state s6 has a large associated reward. The action whose effect is shown with the dashed arrow deterministically moves the agent left. The other action is stochastic, and with relatively high probability moves the agent towards state s6: This represents swimming against the current . ˆP, we compute N(s, a, s0) N 0(s, a) | ˆP(s0 | s, a) P (s0 | s, a)|, (8) where N 0 is the observation-count of the state-action pair (s, a), N is the count of transitioning to s0 from (s, a), and P is the true dynamics model. The weighting is introduced so that an algorithm that discards a state-action pair is not (unduly) penalized. 6.2. Results for River Swim The schematic diagram of the River Swim environment is shown in Figure 1. River Swim consists of S states arranged in a chain. The agent begins on the far left and has the choice of swimming left or right at each state. There is a current that makes swimming left much easier than swimming right. Swimming left with the current always succeeds in moving the agent left, but swimming right against the current sometimes moves the agent right (with a small probability of moving left as well), but more often than not leaves the agent in the current state. Thus smart exploration is a necessity to learn a good policy in this environment. We experiment with small environments with S = 5 and set the horizon to H = 20. The optimal value of the initial state is 5.6 for our five-state River Swim. The initial state is the leftmost state (s1 in the diagram). The value that we found to work the best for EGRL-VTR is = 0.2 and the value that we found to work best for EG-Freq is = 0.12. The results are shown in Figure 2. The regret per episode for an algorithm that does not learn is expected to be in the same range as the respective optimal values. Based on this we see that 105 episodes is barely sufficient for the algorithms other than UCRL-VTR to learn a good policy. Looking at the model errors we see that EGRL-VTR is doing quite poorly, EG-Freq is also lacking, the others are doing reasonably well. However, this is because EG-Freq visits more uniformly than the other methods the various state-action pairs. The results clearly indicate that (i) fitting to the state-value function alone provides enough of a signal for learning as evident by UCRL-VTR obtaining low regret as predicted by our theoretical results, and that Model-Based Reinforcement Learning with Value-Targeted Regression 0 1 2 3 4 5 ESLVRGe 1e4 Cumulat Lve 5eg Uet 1e5 Cumulat Lve 5eg Uet f RU a 5 Vtate 5Lve USw Lm UC5L-VT5 UC-0at ULx5L EG5L-VT5 EG-FUe T 0 1 2 3 4 5 ESLs Rde 1e4 We Lgh Wed L1 n Rrm 0Rdel Err Rr Rn a 5 s Wa We 5Lver Sw Lm Figure 2. The results for the -greedy algorithms are averaged over thirty runs and error bars are reported for the regret plots. e1 e2 e3 e4 e5 e6 e7 e8 p = 0.5 p = 0.5 p = 0.5, r = 1.0 p = 0.5, r = 1.0 Figure 3. An eleven state Wide Tree MDP. The algorithm starts in the initial state s1. From the initial state s1 the algorithm has a choice of either deterministically transitioning to either state s2 or state s3. Finally from either state s2 or state s3 the algorithm picks one of two possible actions and transitions to one of the terminal states ei. The choice of the initial action determines the delayed reward the algorithm will observe. (ii) optimism is necessary when using VTR to achieve good results, as evident by UCRL-VTR achieving significantly better regret than EGRL-VTR and even in the smaller River Swim environment. It is also promising that value-targeted regression with optimistic exploration outperformed optimism based on the canonical model estimation procedure. We attribute this to the fact that value-targeted regression will learn a model faster that predicts the optimal values well than the canonical, frequency based approach. That value-targeted regression also learns a model with small weighted error appears to be an accidental feature of this environment. Our next experiments are targeted at further exploring whether VTR can be effective without learning a good model. 6.3. Wide Tree We introduce a novel tabular MDP we call Wide Tree. The Wide Tree environment has a fixed horizon H = 2 and S = 11 states. A visualization of the Wide Tree environment is shown in Figure 3. In Wide Tree, an agent starts at the initial state s1. The agent then progresses to one of the many bottom terminal states and collects a reward of either 0 or 1. The only significant action is whether to transition from s1 to either s2 or s3. Note that the model in the second layer is 0.0 0.2 0.4 0.6 0.8 1.0 Ep LVRGe 1e4 Cumula WLve 5eg Ue W 1e3 Cumula WLve 5eg Ue W f RU a 11 VWa We WLGe TUee 8C5L-VT5 8C-0a WULx5L EG5L-VT5 EG-FUe T 0.0 0.2 0.4 0.6 0.8 1.0 Ep Lsode 1e4 We Lgh Wed L1 norm 0odel Error on a 11 s Wa We WLde Tree Figure 4. The results for the -greedy algorithms are averaged over thirty runs and error bars are reported for the regret plots. irrelevant for making a good decision: Once in s3, all actions lead to a reward of one, and once in s2, all actions lead to a reward of zero. We vary the number of bottom states reachable from states s2 and s3 while still maintaining a reward structure but the results here are shown with S = 11. We set = 0.1 in this environment, as this allows the model error of EG-Freq to match that of UC-Matrix RL. The results are shown in Figure 4. Both UCRL-VTR and EG-VTR learn equally poor models (their graphs are on the top of each other ). Yet, UCRL-VTR manages to quickly learn a good policy, as attested by its low regret. EG-Freq and EG-VTR perform equally poorly and UC-Matrix RL is even slower as it keeps exploring the environment. These experiments clearly illustrate that UCRL-VTR is able to achieve good results without learning a good model its focus on values makes pays off swiftly in this well-chosen environment. 7. Conclusions We considered online learning in episodic MDPs and proposed an optimistic model-based reinforcement learning method (UCRL-VTR) with the unique characteristic to evaluate and select models based on their ability to predict value functions that the algorithm constructs during learning. The regret of the algorithm was shown to be bounded by a quantity that relates to the richness of the model class through the Eluder dimension and the metric entropy of an appropriately constructed function space. For the case of linear mixture models, the regret bound simplifies to O(d H3T) where d is the number of model parameters, H is the horizon, and T is the total number of interaction steps. Our experiments confirmed that the value-targeted regression objective is not only theoretically sound, but also yields a competitive method which allows task-focused model-tuning: In a carefully chosen environment we demonstrated that the algorithm achieves low regret despite that it ignores modeling a major part of the environment. Model-Based Reinforcement Learning with Value-Targeted Regression 8. Acknowledgements Csaba Szepesv ari gratefully acknowledges funding from the Canada CIFAR AI Chairs Program, Amii and NSERC. Mengdi Wang gratefully acknowledges funding from the U.S. National Science Foundation (NSF) grant CMMI1653435, Air Force Office of Scientific Research (AFOSR) grant FA9550-19-1-020, and C3.ai DTI. Abbasi-Yadkori, Y. and Szepesv ari, C. Bayesian optimal control of smoothly parameterized systems. In UAI, pp. 1 11, 2015. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011. Agrawal, S. and Jia, R. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pp. 1184 1194, 2017. Al Quraishi, M. Alpha Fold at CASP13. Bioinformatics, 35 (22):4862 4865, 2019. Arulkumaran, K., Cully, A., and Togelius, J. Alphastar: An evolutionary computation perspective. ar Xiv preprint ar Xiv:1902.01724, 2019. Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 263 272. JMLR. org, 2017. Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30 37. Elsevier, 1995. Bertsekas, D. P. and Shreve, S. Stochastic optimal control: the discrete-time case. Academic Press, 1978. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic pro- gramming. Athena Scientific, 1996. Bradtke, S. J. and Barto, A. G. Linear least-squares algo- rithms for temporal difference learning. Machine learning, 22(1-3):33 57, 1996. Dann, C., Lattimore, T., and Brunskill, E. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 5713 5723, 2017. Dann, C., Li, L., Wei, W., and Brunskill, E. Policy cer- tificates: Towards accountable reinforcement learning. ar Xiv preprint ar Xiv:1811.03056, 2018. Du, S. S., Luo, Y., Wang, R., and Zhang, H. Provably efficient Q-learning with function approximation via distribution shift error checking oracle. ar Xiv preprint ar Xiv:1906.06321, 2019. Farahmand, A.-M. Iterative value-aware model learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 9090 9101, 2018. Farahmand, A.-M., Barreto, A., and Nikovski, D. Value- aware loss function for model-based reinforcement learning. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1486 1494. PMLR, 2017. Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4863 4873, 2018. Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:1907.05388, 2019. Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Camp- bell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., and Michalewski, H. Model-based reinforcement learning for Atari. In ICLR, 2019. Kakade, S., Wang, M., and Yang, L. F. Variance reduc- tion methods for sublinear reinforcement learning. ar Xiv preprint ar Xiv:1802.09184, 2018. Kober, J., Bagnell, J. A., and Peters, J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238 1274, 2013. Kovalenko, B. G. I. N. Introduction to queueing theory. Is- rael Program for Scientific Translation, Jerusalem, 1968. Lattimore, T. and Szepesv ari, C. Learning with good feature representations in bandits and in RL with a generative model, 2019. Lattimore, T. and Szepesv ari, C. Bandit Algorithms. Cam- bridge University Press, 2020. (to appear). Lu, X. and Van Roy, B. Ensemble sampling. In Advances in neural information processing systems, pp. 3258 3266, 2017. Model-Based Reinforcement Learning with Value-Targeted Regression Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing Atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529, 2015. Modi, A., Jiang, N., Tewari, A., and Singh, S. Sample com- plexity of reinforcement learning using linearly combined model ensembles. ar Xiv preprint ar Xiv:1910.10597, 2019. Osband, I. and Van Roy, B. Model-based reinforcement learning and the Eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466 1474, 2014. Osband, I. and Van Roy, B. On lower bounds for regret in reinforcement learning. ar Xiv preprint ar Xiv:1608.02732, 2016. Osband, I., Van Roy, B., and Wen, Z. Generalization and ex- ploration via randomized value functions. ar Xiv preprint ar Xiv:1402.0635, 2014. Osband, I., Van Roy, B., Russo, D., and Wen, Z. Deep ex- ploration via randomized value functions. ar Xiv preprint ar Xiv:1703.07608, 2017. Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learn- ing unknown Markov decision processes: A Thompson sampling approach. In Advances in Neural Information Processing Systems, pp. 1333 1342, 2017. Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pp. 752 759. ACM, 2008. Pires, B. and Szepesv ari, C. Policy error bounds for model- based reinforcement learning with factored linear models. In COLT, pp. 121 151, 2016. Rusmevichientong, P. and Tsitsiklis, J. N. Linearly parame- terized bandits. Mathematics of Operations Research, 35 (2):395 411, 2010. Russel, S. and Norvig, P. Artificial Intelligence a modern approach. Prentice Hall, 2003. Russo, D. and Van Roy, B. Learning to optimize via poste- rior sampling. Mathematics of Operations Research, 39 (4):1221 1243, 2014. Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., , Lillicrap, T., and Silver, D. Mastering Atari, Go, chess and shogi by planning with a learned model. ar Xiv preprint ar Xiv:1911.08265, 2019. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G. v. d., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017. Strens, M. J. A. A Bayesian framework for reinforcement learning. In ICML, pp. 943 950, 2000. Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., and Langford, J. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory (COLT), pp. 2898 2933, 2019. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. The MIT Press, 2 edition, 2018. Theocharous, G., Wen, Z., Abbasi-Yadkori, Y., and Vlas- sis, N. Posterior sampling for large scale reinforcement learning. ar Xiv preprint ar Xiv:1711.07979, 2017. Tsitsiklis, J. N. and Van Roy, B. Analysis of temporal- diffference learning with function approximation. In Advances in neural information processing systems, pp. 1075 1081, 1997. Wen, Z. and Van Roy, B. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pp. 3021 3029, 2013. Wen, Z. and Van Roy, B. Efficient reinforcement learn- ing in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3): 762 782, 2017. Xie, C., Patil, S., Moldovan, T., Levine, S., and Abbeel, P. Model-based reinforcement learning with parametrized physical models and optimism-driven exploration. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 504 511. IEEE, 2016. Yang, L. F. and Wang, M. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. ar Xiv preprint ar Xiv:1905.10389, 2019a. Yang, L. F. and Wang, M. Sample-optimal parametric Q- learning with linear transition models. International Conference on Machine Learning, 2019b. Model-Based Reinforcement Learning with Value-Targeted Regression Zanette, A., Lazaric, A., Kochenderfer, M. J., and Brunskill, E. Limiting extrapolation in linear approximate value iteration. In Advances in Neural Information Processing Systems 32, pp. 5616 5625. Curran Associates, Inc., 2019. Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. ar Xiv preprint ar Xiv:2004.10019, April 2020.