# scalar_posterior_sampling_with_applications__2801caf5.pdf Scalar Posterior Sampling with Applications Georgios Theocharous Adobe Research theochar@adobe.com Zheng Wen Adobe Research zwen@adobe.com Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com Nikos Vlassis Netflix nvlassis@netflix.com We propose a practical non-episodic PSRL algorithm that unlike recent state-of-theart PSRL algorithms uses a deterministic, model-independent episode switching schedule. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove a Bayesian regret bound under mild assumptions. Our result is more generally applicable to multiple parameters and continuous state action problems. We compare our algorithm with state-of-the-art PSRL algorithms on standard discrete and continuous problems from the literature. Finally, we show how the assumptions of our algorithm satisfy a sensible parametrization for a large class of problems in sequential recommendations. 1 Introduction Thompson sampling [Thompson, 1933], or posterior sampling for reinforcement learning (PSRL), is a conceptually simple approach to deal with unknown MDPs [Strens, 2000; Osband et al., 2013]. PSRL begins with a prior distribution over the MDP model parameters (transitions and/or rewards) and typically works in episodes. At the start of each episode, an MDP model is sampled from the posterior belief and the agent follows the policy that is optimal for that sampled MDP until the end of the episode. The posterior is updated at the end of every episode based on the observed actions, states, and rewards. A special case of MDP under which PSRL has been recently extensively studied is MDP with state resetting, either explicitly or implicitly. Specifically, in [Osband et al., 2013; Osband and Van Roy, 2014] the considered MDPs are assumed to have fixed-length episodes, and at the end of each episode the MDP s state is reset according to a fixed state distribution. In [Gopalan and Mannor, 2015], there is an assumption that the environment is ergodic and that there exists a recurrent state under any policy. Both approaches have developed variants of PSRL algorithms under their respective assumptions, as well as state-of-the-art regret bounds, Bayesian in [Osband et al., 2013; Osband and Van Roy, 2014] and Frequentist in [Gopalan and Mannor, 2015]. However, many real-world problems are of a continuing and non-resetting nature. These include sequential recommendations and other common examples found in controlled mechanical systems (e.g., control of manufacturing robots), and process optimization (e.g., controlling a queuing system), where resets are rare or unnatural. Many of these real world examples could easily be parametrized with a scalar parameter, where each value of the parameter could specify a complete model. These type of domains do not have the luxury of state resetting, and the agent needs to learn to act, without necessarily revisiting states. Extensions of the PSRL algorithms to general MDPs without state resetting has so far produced non-practical algorithms and in some cases buggy theoretical analysis. This is due to the difficulty of analyzing regret under policy switching schedules that depend on various dynamic statistics produced by the true underlying model (e.g., doubling the visitations of 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. state and action pairs and uncertainty reduction of the parameters). Next we summarize the literature for this general case PSRL. The earliest such general case was analyzed as Bayes regret in a lazy PSRL algorithm [Abbasi Yadkori and Szepesvári, 2015]. In this approach a new model is sampled, and a new policy is computed from it, every time the uncertainty over the underlying model is sufficiently reduced; however, the corresponding analysis was shown to contain a gap [Osband and Van Roy, 2016]. A recent general case PSRL algorithm with Bayes regret analysis was proposed in [Ouyang et al., 2017b]. At the beginning of each episode, the algorithm generates a sample from the posterior distribution over the unknown model parameters. It then follows the optimal stationary policy for the sampled model for the rest of the episode. The duration of each episode is dynamically determined by two stopping criteria. A new episode starts either when the length of the current episode exceeds the previous length by one, or when the number of visits to any state-action pair is doubled. They establish e O(HS AT) bounds on expected regret under a Bayesian setting, where S and A are the sizes of the state and action spaces, T is time, and H is the bound of the span, and e O notation hides logarithmic factors. However, despite the state-of-the-art regret analysis, the algorithm is not well suited for large and continuous state and action spaces due to the requirement to count state and action visitations for all state-action pairs. In another recent work [Agrawal and Jia, 2017], the authors present a general case PSRL algorithm that achieves near-optimal worst-case regret bounds when the underlying Markov decision process is communicating with a finite, though unknown, diameter. Their main result is a high probability regret upper bound of e O(D SAT) for any communicating MDP with S states, A actions and diameter D, when T S5A. Despite the nice form of the regret bound, this algorithm suffers from similar practicality issues as the algorithm in [Ouyang et al., 2017b]. The epochs are computed based on doubling the visitations of state and action pairs, which implies tabular representations. In addition it employs a stricter assumption than previous work of a fully communicating MDP with some unknown diameter. Finally, in order for the bound to be true T S5A, which would be impractical for large scale problems. Both of the above two recent state-of-the-art algorithms [Ouyang et al., 2017b; Agrawal and Jia, 2017], do not use generalization, in that they learn separate parameters for each state-action pair. In such non-parametrized case, there are several other modern reinforcement learning algorithms, such as UCRL2 [Jaksch et al., 2010], REGAL [Bartlett and Tewari, 2009], and R-max [Brafman and Tennenholtz, 2002], which learn MDPs using the well-known optimism under uncertainty principle. In these approaches a confidence interval is maintained for each state-action pair, and observing a particular state transition and reward provides information for only that state and action. Such approaches are inefficient in cases where the whole structure of the MDP can be determined with a scalar parameter. Despite the elegant regret bounds for the general case PSRL algorithms developed in [Ouyang et al., 2017b; Agrawal and Jia, 2017], both of them focus on tabular reinforcement learning and hence are sample inefficient for many practical problems with exponentially large or even continuous state/action spaces. On the other hand, in many practical RL problems, the MDPs are parametrized in the sense that system dynamics and reward/loss functions are assumed to lie in a known parametrized low-dimensional manifold [Gopalan and Mannor, 2015]. Such model parametrization (i.e. model generalization) allows researchers to develop sample efficient algorithms for large-scale RL problems. Our paper belongs to this line of research. Specifically, we propose a novel general case PSRL algorithm, referred to as DS-PSRL, that exploits model parametrization (generalization). We prove an T) Bayes regret bound for DS-PSRL, assuming we can model every MDP with a single smooth parameter. DS-PSRL also has lower computational and space complexities than algorithms proposed in [Ouyang et al., 2017b; Agrawal and Jia, 2017]. In the case of [Ouyang et al., 2017b] the number of policy switches in the first T steps is KT = O ; on the other hand, DS-PSRL adopts a deterministic schedule and its number of policy switches is KT log(T). Since the major computational burden of PSRL algorithms is to solve a sampled MDP at each policy switch, DSPSRL is computationally more efficient than the algorithm proposed in [Ouyang et al., 2017b]. As to the space complexity, both algorithms proposed in [Ouyang et al., 2017b; Agrawal and Jia, 2017] need to store counts of state and action visitations. In contrast, DS-PSRL uses a model independent schedule and as a result does not need to store such statistics. In the rest of the paper we will describe the DS-PSRL algorithm, and derive a state-of-the-art Bayes regret analysis. We will demonstrate and compare our algorithm with state-of-the-art on standard problems from the literature. Finally, we will show how the assumptions of our algorithm satisfy a sensible parametrization for a large class of problems in sequential recommendations. 2 Problem Formulation We consider the reinforcement learning problem in a parametrized Markov decision process (MDP) (X, A, , P ) where X is the state space, A is the action space, : X A ! R is the instantaneous loss function, and P is an MDP transition model parametrized by . We assume that the learner knows X, A, , and the mapping from the parameter to the transition model P , but does not know . Instead, the learner has a prior belief P0 on at time t = 0, before it starts to interact with the MDP. We also use to denote the support of the prior belief P0. Note that in this paper, we do not assume X or A to be finite; they can be infinite or even continuous. For any time t = 1, 2, . . ., let xt 2 X be the state at time t and at 2 A be the action at time t. Our goal is to develop an algorithm (controller) that adaptively selects an action at at every time step t based on prior information and past observations to minimize the long-run Bayes average loss Similarly as existing literature [Osband et al., 2013; Ouyang et al., 2017b], we measure the performance of such an algorithm using Bayes regret: is the average loss of running the optimal policy under the true model . Note that under the mild weakly communicating assumption, J is independent of the initial state. The Bayes regret analysis of PSRL relies on the key observation that at each stopping time the true MDP model and the sampled model e are identically distributed [Ouyang et al., 2017b]. This fact allows to relate quantities that depend on the true, but unknown, MDP , to those of the sampled MDP e that is fully observed by the agent. This is formalized by the following Lemma 1. Lemma 1 (Posterior Sampling [Ouyang et al., 2017b]). Let (Fs)1 s=1 be a filtration (Fs can be thought of as the historic information until current time s) and let be an almost surely finite Fs-stopping time. Then, for any measurable function g, E [g( )|F ] = E Additionally, the above implies that E [g( )] = E through the tower property. 3 The Proposed Algorithm: Deterministic Schedule PSRL In this section, we propose a PSRL algorithm with a deterministic policy update schedule, shown in Figure 1. The algorithm changes the policy in an exponentially rare fashion; if the length of the current episode is L, the next episode would be 2L. This switching policy ensures that the total number of switches is O(log T). We also note that, when sampling a new parameter e t, the algorithm finds the optimal policy assuming that the sampled parameter is the true parameter of the system. Any planning algorithm can be used to compute this optimal policy [Sutton and Barto, 1998]. In our analysis, we assume that we have access to the exact optimal policy, although it can be shown that this computation need not be exact and a near optimal policy suffices (see [Abbasi-Yadkori and Szepesvári, 2015]). Inputs: P1, the prior distribution of . L 1. for t 1, 2, . . . do if t = L then Sample e t Pt. L 2L. else e t e t 1. end if Calculate near-optimal action at (xt, e t). Execute action at and observe the new state xt+1. Update Pt with (xt, at, xt+1) to obtain Pt+1. end for Figure 1: The DS-PSRL algorithm with deterministic schedule of policy updates. To measure the performance of our algorithm we use Bayes regret RT defined in Equation 1. The slower the regret grows, the closer is the performance of the learner to that of an optimal policy. If the growth rate of RT is sublinear (RT = o(T)), the average loss per time step will converge to the optimal average loss as T gets large, and in this sense we can say that the algorithm is asymptotically optimal. Our main result shows that, under certain conditions, the construction of such asymptotically optimal policies can be reduced to efficiently sampling from the posterior of and solving classical (non-Bayesian) optimal control problems. First we state our assumptions. We assume that MDP is weakly communicating. This is a standard assumption and under this assumption, the optimal average loss satisfies the Bellman equation. Further, we assume that the dynamics are parametrized by a scalar parameter and satisfy a smoothness condition. Assumption A1 (Lipschitz Dynamics) There exist a constant C such that for any state x and action a and parameters , 0 2 <, k P(.|x, a, ) P(.|x, a, 0)k1 C | 0| . We also make a concentrating posterior assumption, which states that the variance of the difference between the true parameter and the sampled parameter gets smaller as more samples are gathered. Assumption A2 (Concentrating Posterior) Let Nj be one plus the number of steps in the first j episodes. Let e j be sampled from the posterior at the current episode j. Then there exists a constant C0 such that The A2 assumption simply says the variance of posterior decreases given more data. In other words, we assume that the problem is learnable and not a degenerate case. A2 was actually shown to hold for two general categories of problems, finite MDPs and linearly parametrized problems with Gaussian noise Abbasi-Yadkori and Szepesvári [2015]. In addition, in this paper we prove how this assumption is satisfied for a large class of practical problems, such as smoothly parametrized sequential recommendation systems in Section 6. Now we are ready to state the main theorem. We show a sketch of the analysis in the next section. More details are in the appendix. Theorem 1 Under Assumption A1 and A2, the regret of the DS-PSRL algorithm is bounded as where the e O notation hides logarithmic factors. Notice that the regret bound in Theorem 1 does not directly depend on S or A. Moreover, notice that the regret bound is smaller if the Lipschitz constant C is smaller or the posterior concentrates faster (i.e. C0 is smaller). 4 Sketch of Analysis To analyze the algorithm shown in Figure 1, first we decompose the regret into a number of terms, which are then bounded one by one. Let exa t+1 P(. | xt, a, e t), i.e. an imaginary next state sample assuming we take action a in state xt when parameter is e t. Also let ext+1 P(. | xt, at, e t) and xt+1 P(. | xt, at, ). By the average cost Bellman optimality equation [Bertsekas, 1995], for a system parametrized by e t, we can write J(e t) + ht(xt) = min (xt, a) + E t+1) | Ft, e t Here ht(x) = h(x, e t) is the differential value function for a system with parameter e t. We assume there exists H > 0 such that ht(x) 2 [0, H] for any x 2 X. Because the algorithm takes the optimal action with respect to parameter e t and at is the action at time t, the right-hand side of the above equation is minimized and thus J(e t) + ht(xt) = (xt, at) + E ht(ext+1) | Ft, e t The regret decomposes into two terms as shown in Lemma 2. Lemma 2 We can decompose the regret as follows: E [ (xt, at) J( )] H E [1 {At}] + E [ht(xt+1) ht(ext+1)] + H where At denotes the event that the algorithm has changed its policy at time t. The first term H PT t=1 E [1 {At}] is related to the sequential changes in the differential value functions, ht+1 ht. We control this term by keeping the number of switches small; ht+1 = ht as long as the same parameter e t is used. Notice that under DS-PSRL, PT t=1 1 {At} log2(T) always holds. Thus, the first term can be bounded by H PT t=1 E [1 {At}] H log2(T). The second term PT t=1 E [ht(xt+1) ht(ext+1)] is related to how fast the posterior concentrates around the true parameter vector. To simplify the exposition, we define P(x | xt, at, ) P(x | xt, at, e t) = E [ht(xt+1) ht(ext+1)|xt, at] . Recall that ext+1 P(. | xt, at, e t) while xt+1 P(. | xt, at, ), thus, from the tower rule, we have E [ t] = E [ht(xt+1) ht(ext+1)] . The following two lemmas bound PT t=1 E [ t] under Assumption A1 and A2. Lemma 3 Under Assumption A1, let m be the number of schedules up to time T, we can show: v u u u t TE where Mj is the number of steps in the jth episode. Lemma 4 Given Assumption A2 we can show: 5 2C0 log2 T . 2C0T log2 T = O( Combining the above results, we have RT H log2(T) + CH 2C0T log2 T + H = O(CH C0T log T) . This concludes the proof. 5 Experiments In this section we compare through simulations the performance of DS-PSRL algorithm with the latest PSRL algorithm called Thompson Sampling with dynamic episodes (TSDE) Ouyang et al. [2017b]. We experimented with the River Swim environment Strehl and Littman [2008], which was the domain used to show how TSDE outperforms all known existing algorithms in Ouyang et al. [2017b]. The River Swim example models an agent swimming in a river who can choose to swim either left or right. The MDP consists of K states arranged in a chain with the agent starting in the leftmost state (s = 1). If the agent decides to move left i.e with the river current then he is always successful but if he decides to move right he might fail with some probability. The reward function is given by: r(s, a) = 5 if s = 1, a = left; r(s, a) = 10000 if s = K, a = right; and r(s, a) = 0 otherwise. 5.1 Scalar Parametrization For scalar parametrization a scalar value defines the transition dynamics of the whole MDP. We did two types of experiments, In the first experiment the transition dynamics (or fail probability) were the same for all states for a given scalar value. In the second experiment we allowed for a single scalar value to define different fail probabilities for different states. We assumed two probabilities of failure, a high probability P1 and a low probability P2. We assumed we have two scalar values { 1, 2}. We compared with an algorithm that switches every time-step, which we otherwise call t-mod-1, with TSDE and DS-PSRL algorithms. We assumed the true model of the world was = 2 and that the agent starts in the left-most state. In the first experiment, 1 sets P1 to be the fail probability for all states and 2 sets P2 to be the fail probability for all states. For 1 the optimal policy was to go left for the states closer to left and right for the states closer to right. For 2 the optimal policy was to always go right. The results are shown in Figure 2(a), where all schedules are quickly learning to optimize the reward. In the second experiment, 1 sets P1 to be the fail probability for all states. And 2 sets P1 for the first few states on the left-end, and P2 for the remaining. The optimal policies were similar to the first experiment. However the transition dynamics are the same for states closer to the left-end, while the polices are contradicting. For 1 the optimal policy is to go left and for 2 the optimal policy is to go right for states closer to the left-end. This leads to oscillating behavior when uncertainty about the true is high and policy switching is done frequently. The results are shown in Figure 2(b) where t-mod-1 and TSDE underperform significantly. Nonetheless, when the policy is switched after multiple interactions, the agent is likely to end up in parts of the space where it becomes easy to identify the true model of the world. The second experiment is an example where multi-step exploration is necessary. 5.2 Multiple Parameters Even though our theoretical analysis does not account for the case with multiple parameters, we tested empirically our algorithm with multiple parameters. We assumed a Dirichlet prior for every state and action pair. The initial parameters of the priors were set to one (uniform) for the non-zero transition probabilities of the River Swim problem and zero otherwise. Updating the posterior in this case is equivalent to updating the parameters after every transition. We did not compare with the t-mod-1 schedule, due to the computational cost of sampling and solving an MDP every time-step. Unlike the 0 2500 5000 7500 10000 Time Average reward Optimal DS PSRL TSDE #switches: DS PSRL: 14 TSDE: 159, #states: 50, #param: 1 expr: 1 0 2500 5000 7500 10000 Time Average reward Optimal t mod 1 DS PSRL TSDE #switches: DS PSRL: 14 TSDE: 315, #states: 50, #param: 1 expr: 2 Figure 2: When multi-step exploration is necessary DS-PSRL outperforms. scalar case we cannot define a small finite number of values, for which we can pre-compute the MDP policies. The ground truth model used was 2 from the second scalar experiment. Our results are shown in Figures 3(a) and 3(b). DS-PSRL performs better than TSDE as we increase the number of parameters. 0 2500 5000 7500 10000 Time Average reward Optimal DS PSRL TSDE #switches: DS PSRL: 14 TSDE: 274, #states: 15, #param: 43 0 2500 5000 7500 10000 Time Average reward Optimal DS PSRL TSDE #switches: DS PSRL: 14 TSDE: 301, #states: 20, #param: 58 0 250 500 750 1000 Time Average cost Optimal t mod 1 DS PSRL TSDE #switches: DS PSRL: 10 TSDE: 60 (c) The LQ problem. Figure 3: Multiple parameters (a,b) and continuous domain (c). 5.3 Continuous Domains In a final experiment we tested the ability of DS-PSRL algorithm in continuous state and action domains. Specifically, we implemented the discrete infinite horizon linear quadratic (LQ) problem in Abbasi-Yadkori and Szepesvári [2015, 2011]: xt+1 = A xt + B ut + wt+1 and ct = x T t Qxt + u T where t = 0, 1, ..., ut 2 Rd is the control at time t, xt 2 Rn is the state at time t, ct 2 R is the cost at time t, wt+1 is the noise , A 2 Rn n and B 2 Rn d are unknown matrices while Q 2 Rn n and R 2 Rd d are known (positive definite) matrices. The problem is to design a controller based on past observations to minimize the average expected cost. Uncertainty is modeled as a multivariate normal distribution. In our experiment we set n = 2 and d = 2. We compared DS-PSRL with t-mod-1 and a recent TSDE algorithm for learning-based control of unknown linear systems with Thompson Sampling Ouyang et al. [2017a]. This version of TSDE uses two dynamic conditions. The first condition is the same as in the discrete case, which activates when episodes increase by one from the previous episode. The second condition activates when the determinant of the sample covariance matrix is less than half of the previous value. All algorithms learn quickly the optimal A and B as shown in Figure 3(c). The fact that switching every time-step works well indicates that this problem does not require multi-step exploration. 6 Application to Sequential Recommendations With sequential recommendations we refer to the problem where a system recommends various items to a person over time to achieve a long-term objective. One example is a recommendation system at a website that recommends various offers. Another example is a tutorial recommendation system, where the sequence of tutorials is important in advancing the user from novice to expert over time. Finally, consider a points of interest recommendation (POI) system, where the system recommends various locations for a person to visit in a city, or attractions in a theme park. Personalized sequential recommendations are not sufficiently discussed in the literature and are practically nonexistent in the industry. This is due to the increased difficulty in accurately modeling long-term user behaviors and non-myopic decision making. Part of the difficulty arises from the fact that there may not be a previous sequential recommendation system deployed for data collection, otherwise known as the cold start problem. Fortunately, there is an abundance of sequential data in the real world. These data is usually passive in that they do not include past recommendations. A practical approach that learns from passive data was proposed in Theocharous et al. [2017]. The idea is to first learn a model from passive data that predicts the next activity given the history of activities. This can be thought of as the no-recommendation or passive model. To create actions for recommending the various activities, the authors perturb the passive model. Each perturbed model increases the probability of following the recommendations, by a different amount. This leads to a set of models, each one with a different propensity to listen . In effect, they used the single propensity to listen parameter to turn a passive model into a set of active models. When there are multiple model one can use online algorithms, such as posterior sampling for Reinforcement learning (PSRL) to identify the best model for a new user [Strens, 2000; Osband et al., 2013]. In fact, the algorithm used in Theocharous et al. [2017] was a deterministic schedule PSRL algorithm. However, there was no theoretical analysis. The perturbation function used was the following: P(s|X, a, ) = P(s|X)1/ , if a = s P(s|X)/z( ), otherwise (5) where s = is a POI, X = (s1, s2 . . . st) a history of POIs, and z( ) = s6=a P (s|X) 1 P (s=a|X)1/ is a normalizing factor. Here we show how this model satisfies both assumptions of our regret analysis. Lipschitz Dynamics We first prove that the dynamics are Lipschitz continuous: Lemma 5 (Lipschitz Continuity) Assume the dynamics are given by Equation 5. Then for all , 0 1 and all X and a, we have k P( |X, a, ) P( |X, a, 0)k1 2 Please refer to Appendix D for the proof of this lemma. Concentrating Posterior As is detailed in Appendix E (see Lemma 6), we can also show that Assumption A2 holds in this POI recommendation example. Specifically, we can show that under mild technical conditions, we have 7 Summary and Conclusions We proposed a practical general case PSRL algorithm, called DS-PSRL with provable guarantees. The algorithm has similar regret to state-of-the-art. However, our result is more generally applicable to continuous state-action problems; when dynamics of the system is parametrized by a scalar, our regret is independent of the number of states. In addition, our algorithm is practical. The algorithm provides for generalization, and uses a deterministic policy switching schedule of logarithmic order, which is independent from the true model of the world. This leads to efficiency in sample, space and time complexities. We demonstrated empirically how the algorithm outperforms state-of-the-art PSRL algorithms. Finally, we showed how the assumptions satisfy a sensible parametrization for a large class of problems in sequential recommendations. Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In COLT, 2011. Yasin Abbasi-Yadkori and Csaba Szepesvári. Bayesian optimal control of smoothly parameterized systems. In UAI, pages 1 11, 2015. Shipra Agrawal and Randy Jia. Posterior sampling for reinforcement learning: worst-case regret bounds. In NIPS, 2017. Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In UAI, pages 35 42, 2009. Dimitri P Bertsekas. Dynamic programming and optimal control, volume 2. Athena Scientific Belmont, MA, 1995. Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near- optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213 231, 2002. Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized Markov decision processes. In COLT, pages 861 898, 2015. Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563 1600, 2010. Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In NIPS, pages 1466 1474, 2014. Ian Osband and Benjamin Van Roy. Posterior sampling for reinforcement learning without episodes. ar Xiv preprint ar Xiv:1608.02731, 2016. Ian Osband, Dan Russo, and Benjamin Van Roy. (More) efficient reinforcement learning via posterior sampling. In NIPS, pages 3003 3011, 2013. Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with thompson sampling. ar Xiv preprint ar Xiv:1709.04047, 2017. Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown Markov decision processes: A thompson sampling approach. In NIPS, 2017. Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309 1331, 2008. Learning Theory 2005. Malcolm Strens. A Bayesian framework for reinforcement learning. In ICML, pages 943 950, 2000. Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT Press Cambridge, 1998. Georgios Theocharous, Nikos Vlassis, and Zheng Wen. An interactive points of interest guidance system. In IUI Companion, pages 49 52. ACM, 2017. William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25:285 294, 1933.