# continuousintime_limit_for_bayesian_bandits__8c43c768.pdf Journal of Machine Learning Research 24 (2023) 1-35 Submitted 10/22; Revised 5/23; Published 9/23 Continuous-in-time Limit for Bayesian Bandits Yuhua Zhu yuz244@ucsd.edu Department of Mathematics and Halıcıo glu Data Science Institute University of California - San Diego La Jolla, CA 92093, USA Zachary Izzo zizzo@stanford.edu Department of Mathematics Stanford University Stanford, CA 94305, USA Lexing Ying lexing@stanford.edu Department of Mathematics Stanford University Stanford, CA 94305, USA Editor: Qiang Liu This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges toward a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases. Keywords: Bayesian bandit, Hamilton-Jacobi-Bellman equation, optimal control, dynamic view, diffusion approximation, continuous limit 1. Introduction Bandit problems were first introduced by Thompson (1933) with later pioneering work due to Robbins (1952) and Wald (2004). In more recent years, bandit algorithms have become widely adopted for automated decision-making tasks such as dynamic pricing (Ferreira et al., 2018), mobile health (Tewari and Murphy, 2017), Alpha Go (Silver et al., 2016), etc. The bandit problem can be considered from one of two perspectives: Bayesian or frequentist. The Bayesian approach dominated bandit research from 1960-1980 (Bradt et al., 1956; Gittins, 1979). The objective is to minimize an average cumulative regret with respect to the Bayesian prior measure of the problem environment. It formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes c 2023 Yuhua Zhu and Zachary Izzo and Lexing Ying. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1181.html. Zhu and Izzo and Ying the Bayesian regret. In the frequentist setting (Lai and Robbins, 1985), the cumulative regret is viewed as an unknown deterministic quantity, and the goal is to design policies to achieve the best environment-dependent performance. The main difficulty with Bayesian bandits is that computation of the optimal policy is often intractable, especially when the number of arms or the horizon is large. Gittin s index (Gittins, 1979) reduced the computational cost for the discounted infinite horizon setting but does not apply to undiscounted cases (Berry and Fristedt, 1985). Although computing the Bayes-optimal policy is challenging, there is a significant payoff: the performance of the policy is not only optimal in the Bayesian setting (by definition) but also has favorable frequentist regret guarantees (Lattimore, 2016). In addition, Bayesian bandits have been widely employed in economics (Bergemann and Valimaki, 2006), including in contract theory (Gur et al., 2022), dynamic pricing (Leloup and Deveaux, 2001), portfolio management (El Karoui et al., 2005), etc. In this paper, we revisit the Bayesian perspective for the multi-armed bandit problem and analyze it using tools from PDEs. A continuous-in-time limiting HJB equation is derived as the horizon n goes to infinity for a range of bandit problems. Based on the limiting equation, a regularized Bayes-optimal policy is proposed, where regularization is employed to increase exploration and stability. Numerical schemes can be used to approximate the optimal policy, leading to improved computational efficiency when the horizon is large. In addition, the exact optimal policy for the limiting HJB equation can be obtained for certain types of bandit problems, including the classical Bernoulli and Gaussian arm reward cases, resulting in an efficient algorithm to approximate the optimal policy even if the number of arms is large. In summary, our contributions are as follows. We derive a continuous-in-time limit, an HJB equation, for the Bayesian multi-armed bandit problem. We propose a regularized version of the Bayesian multi-armed bandit problem, which encourages exploration and smooths the optimal policy. Based on the limiting PDE, we give an efficient algorithm for approximating the optimal policy. Recently, the use of differential equations to analyze machine learning algorithms has received growing interest, especially in optimization algorithms Su et al. (2014); Li et al. (2017), sampling algorithms Welling and Teh (2011); Liu (2017), neural networks Mei et al. (2018); Chen et al. (2018); Han et al. (2019); Chen et al. (2020). However, fewer connections have been built between multi-armed bandits and differential equations until the recent two years. For bandit problems, Fan and Glynn (2021); Wager and Xu (2021); Kobzar and Kohn (2022) model several policies in the frequentist setting via a continuous SDE or PDE, and this continuous analysis is used to provide insights into the properties of the algorithms studied. In the Bayesian setting, Araman and Caldentey (2022); Che and H orner (2018) give differential equation approximations, but their work can only be applied to settings with two possible environments. In this paper, we consider Bayesian bandits with general environments. There are also related works using differential equations in online learning settings, including contextual bandits (Kapralov and Panigrahy, 2011), drifting games (Wang and Kohn, 2022), etc. Continuous Limit for Bayesian Bandits 2. Bayesian Bandits Throughout the paper, we focus on K-armed stochastic bandits played over n rounds, where n Z+ is a positive integer called the horizon. At each round i, the learner chooses an arm (also called an action) Ai from the action space A = {ak}K k=1 according to a policy πi, and the environment reveals a reward Xi R. The underlying environment ν belongs to an environment class E and defines the arm reward distributions. More precisely, given an environment ν = (P ν a , a A) E, the reward Xi follows the distribution P ν Ai. The policy πi at round i is a function which maps the history Hi = (A1, X1, , Ai 1, Xi 1) to a probability distribution over the action space A. More precisely, we denote the set of possible histories at the beginning of round i by Hi = (A R)i 1, and H1 = . We denote by (A) the set of probability measures on A so that the policy πi is a mapping from Hi to (A). We denote by Π the set of policies π = {πi}n i=1, which is measurable with respect to the filtration associated with the process {Hi}n i=1. We call Π the competitor class. Define the expectation of arm a in environment ν as µa(ν), where R x P ν a (x)dx. (1) The expected cumulative reward cn(π, ν) measures the performance of policy π in environment ν, cn(π, ν) = E where the expectation is taken over the probability measure induced by the interaction of the policy and the environment. The goal in the K-armed bandit setting is to design a policy π that leads to the largest expected cumulative reward among all policies in the competitor class Π. The main difficulty arises because the environment is unknown, and the policy can only depend on the history sequences {Hi}n 1 i=0 . One way to measure the performance of a policy π is to find functions C : E [0, ), f : E [0, ) that upper bound the regret: Rn(π, ν) = nµ (ν) cn(π, ν) C(ν)f(n), where µ (ν) = maxa µa(ν) is the expected reward of the optimal arm. This is the frequentist regret, which is environment-dependent (Lattimore and Szepesv ari, 2020). Another way to measure the performance of a policy is via the averaged cumulative reward with respect to a probability measure ρ(ν) on the environment E, cn(π, ρ) = E E µAi(ν)ρ(ν)dν In the Bayesian setting, the environment is viewed as a random variable. According to Bayes rule, the probability ρ(ν) will be updated conditional on the history sequence. Given a horizon n, we assume that at each round i, the environment νi is sampled from a prior measure ρi(ν) over the environment class E. After pulling arm Ai and obtaining the reward Xi P Ai ν , we update ρi+1(ν) to be the posterior distribution of the environment. Given Zhu and Izzo and Ying an initial prior measure ρ1(ν), the goal is to find the optimal policy that maximizes the averaged cumulative reward, max π Π cn(π) = E E µAi(ν)ρi(ν)dν The bandit problem with the above objective function is called a Bayesian bandit (Chapter 35 in Lattimore and Szepesv ari (2020)). 3. Continuous Limits of Bayesian Bandits 3.1 An illustrative example Consider the one-armed bandit problem in which the reward of the first arm follows a Bernoulli(ν) distribution (with ν unknown) and the second arm gives a deterministic reward µ2 = 1 2. We assume an initial prior distribution of ν Beta(α, β). Then the posterior measure of ν at round i depends on two quantities. The first quantity is qi, which is the number of pulls of the unknown arm before round i. The second quantity is si, which is the cumulative reward of the unknown arm before round i. The posterior distribution of ν is Beta(α + si, β + qi si). Let wi(s, q) be the optimal cumulative reward starting from round i with si = s, qi = q. Similarly, let wi k(s, q) be the optimal cumulative reward in this same setting, assuming that the k-th arm is pulled at round i. Formally, we have wi(s, q) = max π Π E E µAj(ν)ρj(ν)dν si = s, qi = q wi k(s, q) = max π Π E E µAj(ν)ρj(ν)dν si = s, qi = q, Ai = k As before, the dependence of the expectation on the policy π is through the actions Aj. If the second arm is chosen at round i, then wi 2(s, q) = µ2 + wi+1(s, q). If the first arm is chosen at round i, then wi 1(s, q) = p(s, q) + p(s, q)wi+1(s + 1, q + 1) + (1 p(s, q))wi+1(s, q + 1), where p(s, q) = α + s α + β + q. The first term p(s, q) represents the expectation of the reward if the first arm is pulled at round i. The second and third term hold because after pulling the first arm, one has qi+1 = qi + 1; P(si+1 = si + 1|si) = p(si, qi), P(si+1 = si|si) = 1 p(si, qi). Continuous Limit for Bayesian Bandits Since wi is the optimal reward, one has wi(s, q) = max{wi 1(s, q), wi 2(s, q)}. (2) Note that for horizon n, wn+1(s, q) = 0 for all s, q by the definition of w. Therefore, one can compute wi(s, q) for all s = {0, , i 1}, q = {0, , i 1} via backwards induction. To derive a continuous limit in this setting, we rescale the reward and the number of arm pulls by 1/n: One can define the rescaled optimal reward function vi(ˆs, ˆq) = 1 which then satisfies the equation vi(ˆs, ˆq) = max 1 nµ2 + vi+1(ˆs, ˆq), 1 n p(ˆs, ˆq) + p(ˆs, ˆq)vi+1(ˆs + 1 n) + (1 p(ˆs, ˆq))vi+1(ˆs, ˆq + 1 p(s, q) = n 1α + s n 1(α + β) + q. By setting δt = δs = δq = n 1, the above equation can be equivalently written as vi+1(ˆs, ˆq) vi(ˆs, ˆq) δt + max µ2, p(ˆs, ˆq) + p(ˆs, ˆq)vi+1(ˆs + δs, ˆq + δq) vi+1(ˆs, ˆq + δq) +vi+1(ˆs, ˆq + δq) vi+1(ˆs, ˆq) From the above equation, one sees that the rescaled value function vi(ˆs, ˆq) is the numerical discretization of the following PDE: tv(t, ˆs, ˆq) + max{µ2, ˆµ(ˆs, ˆq) + ˆµ(ˆs, ˆq) ˆsv(t, ˆs, ˆq) + ˆqv(t, ˆs, ˆq)} = 0, v(1, ˆs, ˆq) = 0, (3) where ˆµ(s, q) = lim n p(s, q). Furthermore, by moving the constant µ2 outside of the maximum operator and introducing a control parameter ˆπ [0, 1], one arrives at a Hamilton-Jacobi-Bellman equation for v: tv + max ˆπ(t,ˆs,ˆq) [0,1] (ˆµ + ˆµ ˆsv + ˆqv µ2) π + µ2 = 0, v(1, ˆs, ˆq) = 0. (4) (3) and (4) are equivalent because when ˆµ + ˆµ ˆsv + ˆqv > (<)µ2, then ˆπ = 1(= 0), respectively. In other words, as the horizon n , the rescaled value function vi(ˆs, ˆq) Zhu and Izzo and Ying 101 102 103 Figure 1: The above plot shows the decrease in the error | 1 nwi(s, q) v(t, ˆs, ˆq)| as the horizon n for (i, s, q) = {(n 4 ), (1, 0, 0)} and the corresponding (t, s, q) = {(1 4), (0, 0, 0)}. We set the initial hyperparameters (α, β) = (n satisfies the above HJB equation. We plot the convergence of 1 nwi(ˆs, ˆq) as n increases in Figure 1. Classical results from optimal control (see, e.g., Chapter 10.3.3 of Evans (2010)) imply that v(t, ˆs, ˆq) in (4) solves the control problem max ˆπ(τ) [0,1] t ˆµ(ˆs(τ), ˆq(τ))π(τ) + (1 π(τ))µ2dτ s.t. dˆq(τ) = π(τ)dτ, dˆs(τ) = ˆµ(ˆs(τ), ˆq(τ))π(τ)dτ, ˆq(t) = ˆs, ˆs(t) = ˆq. The preceding example illustrates that the Bayesian bandit algorithm (2) can be viewed as the discretization of an HJB equation (4), which solves the control problem (5). In other words, as the horizon n , the Bayesian bandit problem will converge to a continuous control problem that can be solved via the HJB equation. In the next section, we extend this formulation to a more general setting. 3.2 Formal derivation from Bayesian bandits to the HJB equation We return to the original K-armed bandit setting with horizon n and environments parameterized by ν E Rd. The reward of the k-th arm ak follows the distribution P ν k . We assume a prior measure ρ(ν) on the environment ν at the beginning of the first round. We also assume that the updated measure ρi(ν) at the beginning of round i only depends on (si, qi). Here si = (si k)K k=1 RK is a K-dimensional vector representing the cumulative reward of each arm up to round i 1, and qi = (qi k)K k=1 {0, , i 1}K is a K-dimensional vector representing the number of pulls of each arm up to round i 1. We note that the assumption that the state space can be reduced to (si, qi) covers many, but not all, bandit algorithms. For instance, many stochastic bandit algorithms, where the arm rewards are drawn from a stationary probability distribution, can be represented using this state space, Continuous Limit for Bayesian Bandits but the Exp3 algorithm (Auer et al., 2002) cannot. For an extensive discussion, refer to Section 2.1 of Wager and Xu (2021). Under the above assumptions, the posterior distribution of the environment at round i can be written as a function of si, qi, i.e., ρ(ν|si, qi). If arm k is pulled at round i, then the reward Xi follows the distribution P ν k , and si+1 k = si k + Xi, qi+1 k = qi k + 1. (6) Let wi(s, q) be the optimal expected cumulative reward starting from round i with (si, qi) = (s, q). Then it satisfies the following equation: wi(s, q) = max k E µk(ν)ρ(ν|s, q)dν + Z R wi+1(s + xek, q + ek)P ν k (x)ρ(ν|s, q) dx dν , (7) where µk(ν) is the expected reward of the k-th arm defined in (1), and ek is a K-dimensional vector with the k-th element being 1 and all other elements being 0. The first term in the max operator represents the expectation of the rewards if the k-th arm is pulled. The second term is due to the fact that si+1 k and qi+1 k will follow (6) if the k-th arm is pulled at round i. Next, we rescale the parameters to derive the continuous-in-time limit. Let nqi, ˆs = 1 f(n)si, v i 1 n , s f(n), q = 1 f(n)wi(s, q). (8) We shrink the n rounds to the time interval [0, 1] so that as n , the discrete round i n will correspond to a continuous time t = (i 1)/n [0, 1]. We also rescale the number of pulls to [0, 1], so that it is on the same scale as the rescaled rounds. The cumulative reward si and wi are rescaled by 1 f(n), where f(n) will be determined later. Accordingly, the rescaled expected reward v(t,ˆs, ˆq) becomes a function of the continuous time t and rescaled history (ˆs, ˆq). We refer to the function f(n) as the scaling factor. For different scaling factors, the limiting rescaled cumulative reward v(t,ˆs, ˆq) will follow different dynamics. Define the moments of the k-th arm w.r.t. the probability measure P ν k (x) and Bayesian measure ρ(ν|s, q) by µk(s, q) = Z R x P ν k (x)ρ(ν|s, q) dx dν, σ2 k(s, q) = Z R x2P ν k (x)ρ(ν|s, q) dx dν, Ep k(s, q) = Z R xp P ν k (x)ρ(ν|s, q) dx dν. (9) We assume that these moments exist and are finite. Inserting the Taylor expansion wi+1(s + xek, q + ek) = 1 p! p skwi+1(s, q + ek)xp into (7) yields wi(s, q) = max k µk(s, q) + wi+1(s, q + ek) + µk(s, q) skwi+1(s, q + ek) + 1 2 σ2 k(s, q) 2 skwi+1(s, q + ek) 1 p! Ep k(s, q) p skwi+1(s, q + ek) Zhu and Izzo and Ying Therefore, the rescaled reward v i 1 n , s f(n), q n = 1 f(n)wi(s, q) satisfies n , s f(n), q 1 f(n) µk(s, q) + v i n, s f(n), q + ek + 1 f(n) µk(s, q) ˆskv i n, s f(n), q + ek 2 1 f2(n) σ2 k(s, q) 2 ˆskv i n, s f(n), q + ek 1 p! 1 fp(n) Ep k(s, q) p ˆskv i n, s f(n), q + ek After reorganizing the terms and setting δt = δq = 1 v(t + δt,ˆs, ˆq) v(t,ˆs, ˆq) 1 δtf(n) µk(f(n)ˆs, nˆq) + v(t + δt,ˆs, ˆq + δqek) v(t + δt,ˆs, ˆq) + 1 δtf(n) µk(f(n)ˆs, nˆq) ˆskv(t + δt,ˆs, ˆq + δqek) 2 1 δtf2(n) σ2 k(f(n)ˆs, nˆq) 2 ˆskv(t + δt,ˆs, ˆq + δqek) 1 p! 1 δtfp(n) Ep k(f(n)ˆs, nˆq) p ˆskv(t + δt,ˆs, ˆq + δqek) If one assumes that there exist functions {ˆµk(ˆs, ˆq)}K k=1, {ˆσk(ˆs, ˆq)}K k=1, such that for all ˆs RK, ˆq [0, 1]K, lim n n f(n) µk(f(n)ˆs, nˆq) = ˆµk(ˆs, ˆq); lim n n f2(n) σ2 k(f(n)ˆs, nˆq) = ˆσ2 k(ˆs, ˆq); lim n n fp(n) Ep k(f(n)ˆs, nˆq) = ˆEp k 0, for p 3, then as the horizon n , i.e., δt, δq 0, the rescaled expected cumulative reward v(t,ˆs, ˆq) = 1 f(n)wnt+1(f(n)ˆs, nˆq) satisfies the PDE tv + max k {ˆµk + ˆqkv + ˆµk ˆskv + 1 2 ˆσ2 k 2 ˆskv} = 0, which can be equivalently written as the following HJB equation: tv + max ˆπ(t,ˆs,ˆq) K ˆµk(t,ˆs, ˆq) + ˆqkv + ˆµk(ˆs, ˆq) ˆskv + 1 2 ˆσ2 k(ˆs, ˆq) 2 ˆskv ˆπk = 0. (11) Here we introduce ˆπ(t,ˆs, ˆq) as the feedback control, which corresponds to the policy in the bandit problem. Since the policy is a mapping from the history Hi to probability Continuous Limit for Bayesian Bandits measures on the action space (A), the policy at round i is described by a K-dimensional vector-valued function πi(si, qi) that satisfies P k πi k(si, qi) = 1. In the limit, the policy ˆπ(t,ˆs, ˆq) = limn πnt+1(f(n)ˆs, nˆq) is a mapping from the rescaled history (ˆs, ˆq) to the simplex K that satisfies P k ˆπk(t,ˆs, ˆq) = 1 for (t,ˆs, ˆq). We remark briefly that the selection (or even the existence) of f(n) may not be obvious in the general setting described above. In general, f(n) should be thought of as describing the order or asymptotic size of the unscaled rewards in the original bandit problem with n rounds. For instance, if the rewards are of constant size and observed with at least constant probability, then we will expect the cumulative reward of the original bandit problem to be linear in the time horizon, and we will have f(n) = Θ(n). Rather than describing necessary and sufficient technical conditions relating f(n) to the arm reward distributions P ν k and the posterior ρ(ν|s, q) (which may be intractable given the generality of the framework), in the remainder of the paper, we will show that in a wide range of concrete examples, our framework provides useful insights into the problem. See Remark 2 for further discussion of the scaling factor f(n). Note that the solution to (11) is not necessarily differentiable, so we are searching for a viscosity solution instead of a classical solution (Evans, 2010). One has the following guarantee on the well-posedness of the solution. Proposition 1 If {ˆµk(s, q)}k, {ˆσk(s, q)}k are bounded and Lipschitz continuous in (s, q), then the value function defined in (13) is the unique viscosity solution to the HJB equation (11). See, e.g., Nisio (2015) for the proof. Finding necessary and sufficient conditions on the problem primitives specifically, the arm reward distributions P ν k , the prior ρ, and the scaling factor f(n) under which the boundedness and Lipschitz assumptions in Proposition 1 hold is an interesting question for future work. In this paper, our goal is to demonstrate the useful insights which can be derived from our framework for specific bandit problems, once it has been determined that they meet these conditions. Summary If the rescaled moments of all the arms satisfy (10) for some scaling factor f(n), then the rescaled optimal expected cumulative reward 1 f(n)wnt+1(f(n)ˆs, nˆq) will converge to v(t,ˆs, ˆq) as n , where v satisfies the HJB equation (11) with boundary condition v(1,ˆs, ˆq) = 0. In addition, the rescaled optimal policy π ,nt+1(f(n)ˆs, nˆq) will converge to ˆπ (t,ˆs, ˆq) as n , where ˆπ (t,ˆs, ˆq) is given by π k(t,ˆs, ˆq) = 1, k = argmax k ˆµk(ˆs, ˆq) + ˆqkv + ˆµk(ˆs, ˆq) ˆskv + 1 2 ˆσ2 k(ˆs, ˆq) 2 ˆskv 0, o.w. (12) These results form the foundation for the rest of the paper. 3.3 A formal derivation from Bayesian bandits to the optimal control problems If one views v(t, s, q) as the optimal cumulative reward starting from time t v(t,ˆs, ˆq) = max ˆπ(τ) K E Z 1 t ˆµ(ˆs(τ), ˆq(τ)) π(τ)dτ , (13) Zhu and Izzo and Ying then v(t, s, q) is the solution to the optimal control problem (Evans, 2010) max ˆπ(τ) K E Z 1 t ˆµ(ˆs(τ), ˆq(τ)) π(τ)dτ s.t. dˆqk(τ) = ˆπk(τ)dτ, 1 k K; dˆsk(τ) = ˆµk(ˆs(τ), ˆq(τ))ˆπk(τ)dτ + ˆσk(ˆs(τ), ˆq(τ)) p ˆπk(τ)d Bτ, 1 k K; ˆs(t) = ˆs, ˆq(t) = ˆq. Therefore, as the horizon n , the Bayesian bandit problem also converges to the above continuous control problem. In fact, one can derive the optimal control formulation above directly from the definition of the Bayesian bandit problem. Given a policy {πi}i, at each round i, the environment ν is sampled with probability ρ(ν|si, qi), the k-th arm is pulled with probability πi k, and the reward of the k-th arm follows distribution P ν k . Assume at the beginning of round I, (s I, q I) = (s, q). The goal of Bayesian bandits is to find the optimal policy {π ,i}i that maximizes the expected cumulative reward. More precisely, our goal is to find {πi}i that solves the following optimization problem: max {πi}i K cn({πi}) = E E µAi(ν)ρ(ν|si, qi)dν where Ai = k w.p. πi k(si, qi), si+1 k si k = ( 0, if Ai = k Xi P ν k , ν ρ(ν|si, qi), if Ai = k , qi+1 k qi k = ( 0, if Ai = k 1, if Ai = k , (s I, q I) = (s, q), where µa(ν) is the expected reward of arm a defined in (1). Using the same rescaling as in (10) and viewing the history (si, qi) at the discrete rounds as a function (ˆs(t), ˆq(t)) over the continuous time t, the differences of the rescaled cumulative reward ˆs(t) and the rescaled number of pulls ˆq(t) after one round become E[ˆsk(t + δt) ˆsk(t)] = δt n f(n)πi k(f(n)ˆs, nˆq) µk(f(n)ˆs, nˆq), V[ˆsk(t + δt) ˆsk(t)] = δt n f2(n)πi k(f(n)ˆs, nˆq) σ2 k(f(n)ˆs, nˆq) (E[ˆsk(t + δt) ˆsk(t)])2 , E[ˆqk(t + δt) ˆqk(t)] = δt πi k(f(n)ˆs, nˆq), V[ˆqk(t + δt) ˆqk(t)] = (δt)2πi k(f(n)ˆs, nˆq) (δt)2(πi k(f(n)ˆs, nˆq))2, Continuous Limit for Bayesian Bandits where δt = 1 n, and µk, σk are defined in (9). Accordingly, the rescaled objective function becomes ˆcn({πi}) = 1 f(n)cn({πi}) (17) Formally, based on (16) and (17), under the assumptions (10) on the moments, the following equations hold. lim n E[ˆsk(t + δt) ˆsk(t)]/δt = ˆπk(t,ˆs, ˆq)ˆµk(ˆs, ˆq), lim n V[ˆsk(t + δt) ˆsk(t)]/δt = ˆπk(t,ˆs, ˆq)ˆσ2 k(ˆs, ˆq), lim n E[ˆqk(t + δt) ˆqk(t)]/δt = ˆπk(t,ˆs, ˆq), lim n V[ˆqk(t + δt) ˆqk(t)]/δt = 0, lim n ˆcn({πi}) = E k=1 ˆπk(t,ˆs(τ), ˆq(τ))ˆµk(ˆs(τ), ˆq(τ))dτ This implies that the rescaled version of the Bayesian bandits (15) will converge to the continuous optimal control problem (14). Remark 2 Assumption (10) is critical to determine what scaling factor f(n) one should use to derive a meaningful limiting HJB equation. Take the Bernoulli reward introduced in Section 3.1 for an example. Since µ(s, q) = σ2(s, q) = Ep(s, q) = α + s α + β + q, ˆµ(ˆs, ˆq) = lim n n f(n) α f(n) + ˆs n + ˆq , ˆσ2(ˆs, ˆq) = lim n n f(n)2 α f(n) + ˆs ˆEp(ˆs, ˆq) = lim n n f(n)p α f(n) + ˆs If the initial hyperparameters (α, β) are set such that limn ( α f(n), α+β n ) = (ˆα, ˆβ), then the only scaling factor that will induce a meaningful HJB equation is f(n) = O(n). For all f(n) = O(nb) with b < 1, ˆµ diverges. For the case where b > 1, one has ˆµ 0, and consequently, the limiting HJB equation does not give any useful information on the dynamics. In this particular case, one always ends up with an HJB equation without a diffusion term (σ 0), and this is consistent with the nature of the Bernoulli bandits. Note that ˆs(t) is non-decreasing by definition, but when σ > 0, there is a chance that ˆs(t) will decrease due to the nonzero d Bτ term. Therefore, any valid scaling factor must induce a deterministic optimal control problem. However, in the general case, depending on the choice of scaling factor f(n), the limiting optimal control problem can be stochastic or deterministic. Zhu and Izzo and Ying 3.4 Specialized limiting equations for Structured and Unstructured Bandits In this section, we will derive the general limiting HJB equation for both unstructured and structured bandits. Then in Section 3.4.1, we will derive the continuous limit for some specific bandit problems which are common in the literature. Unstructured Bandits In the unstructured bandit problem, action a A is completely uninformative of all other actions b = a. That is, when a is played, the learner gains no information about the reward distribution of the other arms (Chapter 4.3 of Lattimore and Szepesv ari (2020)). Here we mainly discuss a specific kind of unstructured bandit problem in which all the arms belong to the same parametric family of distributions with different unknown parameters. For instance, the rewards of each arm may be normally distributed with an unknown mean. One will see that, in this case, the drift and diffusion terms of each arm in the limiting HJB equation have a similar form. Provided that the initial prior measure is determined by a hyperparameter β, then we claim that under certain conditions, the Bayesian bandit problem converges to the following HJB equation as the horizon n : tv + max ˆπ(t,ˆs,ˆq) K ˆqkv + ˆµ(ˆsk, ˆqk, ˆβk) ˆskv + 1 2 ˆσ(ˆsk, ˆqk, ˆβk)2 2 ˆskv +ˆµ(ˆsk, ˆqk, ˆβk) ˆπk(t,ˆs, ˆq) i = 0, v(1, s, q) = 0, (18) where the functions ˆµ and ˆσ depend on the problem setting, and ˆβ is the rescaled hyperparameter. The main difference between the general form (14) and the unstructured version (19) is that the drift and diffusion functions (ˆµk, ˆσk) have the same form for each arm. Thus, the difference in drift and diffusion for each arm will be due only to differences in the arm histories (ˆsk, ˆqk) at time t and the initial prior measure represented by ˆβ. Structured Bandits In a structured bandit problem, the learner can obtain information about other actions even if these actions are never played. That is, playing a particular arm a A may be informative of the reward distributions of other arms b = a (Chapter 4.3 of Lattimore and Szepesv ari (2020)). Specifically, we consider the following setting. Let the action space {ak}K k=1 = A Rd be a set of real vectors, and assume that the reward x Ω on the arm ak has density p(x|ak, ν) with an unknown parameter ν E Rd. Then we claim that under certain conditions, the Bayesian bandit problem described above converges to the following HJB equation as the horizon n : tv + max ˆπ(t,ˆs,ˆq) K ˆqkv + ˆµ(ˆs, ˆq, ak) ˆskv + 1 2 ˆσ(ˆs, ˆq, ak)2 2 ˆskv +ˆµ(ˆs, ˆq, ak)) ˆπk(t, s, q)] = 0, v(1, s, q) = 0, (19) Similar to the unstructured case, the main difference between (11) and (19) is that the drift and diffusion terms (ˆµk, ˆσk) have the same form. In the structured case, the two terms Continuous Limit for Bayesian Bandits depend on the history of all the arms (ˆs, ˆq) and the position of the arm ak, and the difference for each arm is only due to the position ak. 3.4.1 Three common bandits Here we derive the HJB equation for three common examples: unstructured bandits with Bernoulli and normal arm rewards, and a structured linear bandit with normal rewards. Unstructured Bernoulli rewards Consider the environment class ν [0, 1]K of Bernoulli distributions Bγ with horizon n. For an environment ν E = [0, 1]K, the k-th arm follows a Bernoulli distribution taking values γ(n) and γ(n) with probability νk and 1 νk, respectively, where νk is the k-th component of ν. We set the initial prior measure for the k-th arm to be ν1 k Beta(αk(n), βk(n)) for 1 k K. Lemma 3 Let f(n) be such that there exist real numbers ( ˆα, ˆβ, ˆσ) with lim n γ(n)(αk(n) βk(n)) f(n) = ˆαk, lim n αk(n) + βk(n) n = ˆβk, lim n n f(n)γ(n) = ˆσ. Then as n , the rescaled Bayesian bandit problem with scaling factor f(n) converges to the HJB equation (18) with ˆµ(s, q, ˆαk, ˆβk) = ˆαk + s ˆβk + q , ˆσ(s, q, ˆαk, ˆβk) ˆσ. See Appendix A for the proof of the above lemma. Unstructured normal rewards Consider the environment class ν RK of normally distributed arm rewards Nσ with horizon n. For environment ν E = RK, the rewards of the k-th arm follow the normal distribution N(νk, σ2(n)), where νk is the k-th component of ν. We set the initial prior measure to be ν1 k N(αk(n), βk(n)2) for 1 k K. Lemma 4 Let f(n) be such that there exist real numbers ( ˆα, ˆβ, ˆσ) with lim n σ2(n)αk(n) f(n)β2 k(n) = ˆαk, lim n σ2(n) β2 k(n)n = ˆβk, lim n n f(n)σ(n) = ˆσ. Then as n , the rescaled Bayesian bandit problem with the scaling factor f(n) converges to the continuous control problem (19) with ˆµ(s, q, ˆαk, ˆβk) = ˆαk + s ˆβk + q , ˆσ(s, q, ˆαk, ˆβk) ˆσ. See Appendix B for the proof of the above lemma. Linear bandits with normal rewards Consider the case of stochastic linear bandits with horizon n, where the environment is encoded by a vector ν Rd. For environment ν E = Rd, the reward Xi at round i depends linearly on the chosen action Ai A Rd in the following sense: Xi = Ai, ν + ηi, (20) Zhu and Izzo and Ying where (ηi)n i=1 is a sequence of independent and identically distributed normal random variables N(0, σ2(n)) with given σ. We set the prior measure of ν to be the normal distribution N(α(n), Σ(n)). Lemma 5 Let f(n) be such that there exist real constants ( ˆα, ˆΣ) with lim n σ2(n) n Σ 1(n) = ˆΣ 1, lim n σ2(n) f(n) Σ 1(n)α(n) = ˆα, lim n n f(n)σ(n) = ˆσ. Then as n , the rescaled Bayesian bandit problem with scaling factor f(n) converges to the continuous control problem (19) with ˆµ(s, q, b) =b ˆΣ 1 + X k qkak(ak) ! 1 , ˆσ(s, q) ˆσ. See Appendix C for the proof of the above lemma. 4. Approximate Bayes-optimal policy What can one do with the limiting HJB equations or optimal control problems? In Section 4.1, we propose a regularized version of the Bayesian bandit by adding a regularizer term to the objective function of the optimal control limit. In this way, the resulting optimal policy will be a stochastic policy instead of a deterministic one, which can encourage exploration and make the solution less sensitive to perturbations. In Section 4.2, we propose an approximate Bayes-optimal policy algorithm, which is based on the solution to the (regularized or unregularized) HJB equation. 4.1 Regularized Bayesian bandits The optimal policy from the optimal control problem (14) is deterministic and can be sensitive to small perturbations to the problem. To encourage robustness, one can add regularization to the objective function in (14), resulting in a stochastic optimal policy and encouraging more exploration and stability. For example, entropy regularization for the policy P k πk log πk can be added to (14): max ˆπ(τ) K E Z 1 t (ˆµ(ˆs(τ), ˆq(τ)) λ log ˆπ(τ)) ˆπ(τ)dτ s.t. dˆqk(τ) = ˆπk(τ)dt, 1 k K; dˆsk(τ) = ˆµk(ˆs(τ), ˆq(τ))ˆπk(τ)dτ + ˆσk(ˆs(τ), ˆq(τ)) p ˆπk(τ)d Bτ, 1 k K; ˆs(t) = ˆs, ˆq(t) = ˆq. The resulting HJB equation is (Evans, 2010) tv + max ˆπ K ˆµk(ˆs, ˆq) ˆskv + ˆqkv + 1 2 ˆσ2 k(ˆs, ˆq) 2 ˆskv + ˆµk(ˆs, ˆq) λ log(ˆπk(t,ˆs, ˆq))) ˆπk(t,ˆs, ˆq)] = 0. (22) Continuous Limit for Bayesian Bandits The maximum in the above equation can be computed explicitly (Ying and Zhu, 2022). Let Hk(p, m, h) = ˆµk(s, q)p + m + 1 2 ˆσ2 k(s, q)h + ˆµk(s, q), then the optimal policy is ˆπ k(t,ˆs, ˆq) = 1 λHk( ˆskv, ˆqkv, 2 ˆskv) , (23) with normalizing constant Z = P k exp 1 λHk( skv, qkv, 2 skv) . Hence, (22) can be equivalently written as λHk( ˆskv, ˆqkv, 2 ˆskv) ! There are several potential advantages of the regularized version. First, the resulting optimal policy is always stochastic for λ > 0, which will be less sensitive to perturbations compared with deterministic optimal policy. Second, regularization encourages more exploration, which helps the performance when the initial prior is significantly different from the underlying truth. Third, regularization will usually lead to a smoother solution with a differentiable policy and value function, making it easier to numerically approximate the solution. For a more comprehensive introduction to the use of regularization in bandit and reinforcement learning problems, see, e.g., the tutorial by Geist et al. (2019). 4.2 Approximating the Bayes-optimal policy Based on the limiting equation, if one can obtain the optimal policy for the HJB equation, then one can approximate the optimal Bayesian bandit policy by rescaling (t,ˆs, ˆq) to (i, s, q). This is summarized by the pseudocode in Algorithm 1. Algorithm 1 Approximate Bayes-optimal policy for i = 1, . . . , K do Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do Pull Ai πi(s, q) = ˆπ k(i 1 n , s f(n), q n) ˆπ (t,ˆs, ˆq) given in (12) (unregularized) or (23) (regularized) Get reward Xi s Ai s Ai + Xi, q Ai q Ai + 1 end for 5. Solving the limiting HJB equation One of the difficulties of the Bayesian bandit problem is its large computational cost. The computational complexity for solving a K-armed bandit with horizon n via backward induction is O(n2K), which is intractable when n or K is large. If one can obtain the exact Zhu and Izzo and Ying optimal policy for the limiting HJB equation, then one can use it to approximate the Bayesoptimal policy for the finite horizon problem with almost no additional computational cost. Section 5.1 shows one of the cases where the exact solution can be obtained. Even if the exact solution cannot be obtained directly, Section 5.2 shows a numerical scheme to approximate the solution. The computational cost of numerically solving the HJB equation is O(N2K), where N depends on the mesh of the scheme. This can be much more efficient than the discrete Bayesian bandit algorithm when n is large and K is small. 5.1 Exact solution Although solving the HJB equation can also be challenging in general, it turns out that if µk(s, q) = sk+ˆα qk+ˆβ , one can obtain the exact solution. By Lemmas 3 and 4, two common bandit problems are exactly in this form. Theorem 6 If the drift term in the HJB equation (11) is ˆµk(ˆs, ˆq) = sk + ˆαk then for any constants (ˆαk, ˆβk), the optimal policy for the unregularized HJB equation (11) is ˆπ k(t,ˆs, ˆq) = 1, k = argmax k ˆqk + ˆβk 0, o.w. ; (25) and the optimal policy for the regularized HJB equation (22) is ˆπ k(t,ˆs, ˆq) exp 1 λ ˆsk + ˆαk The proof is given in Appendix D. Based on the above theorem, the approximate optimal policy for the unregularized Bayesian bandit problem is given by π ,i k (s, q) = ˆπ k n , s f(n), q 1, k = argmax k sk qk + nˆβk + f(n)ˆαk q + nˆβk 0, o.w. (27) and the approximate optimal policy for the regularized Bayesian bandit problem is given by π ,i k (s, q) = ˆπ k n , s f(n), q exp n λf(n) sk + f(n)ˆαk Furthermore, note that the approximate optimal policy (27) for the unregularized Bayesian bandit is similar in form to UCB: the first term is an approximation to the empirical mean, and the second term measures the degree to which the arm has been explored. The approximate optimal policy (28) for the regularized Bayesian bandit has a form similar to the tempered greedy algorithm, where the term sk+f(n)ˆαk qk+nˆβk is an approximation to the empirical mean and n λf(n) adjusts the exploration rate. When n λf(n) is smaller (i.e., when the regularization constant λ is larger), there is more exploration. Continuous Limit for Bayesian Bandits 5.2 Numerical solution In the general case, an exact solution to the HJB equation (11) is not available, so we present a numerical scheme to approximate the solution in this section. In certain cases, the numerical scheme yields the exact optimal policy and value function (see Lemma 7). In general, when the horizon is large, the computational cost of numerically solving the PDE will be much less than that of classical Bayesian bandit algorithms while still yielding a good approximation to the optimal policy. First, observe that one can directly compute the maximum in the HJB equation (11). Let k (t,ˆs, ˆq) = argmaxk ˆqkv + ˆµk(ˆs, ˆq) ˆskv + 1 2 ˆσ2 k(ˆs, ˆq) 2 ˆskv + ˆµk(ˆs, ˆq), then the optimal policy is π k(t,ˆs, ˆq) = ( 1, k = k (t,ˆs, ˆq) 0, k = k (t,ˆs, ˆq) and the optimal value function satisfies tv + ˆqk v + ˆµk (ˆs, ˆq) ˆsk v + 1 2 ˆσ2 k (ˆs, ˆq) 2 ˆsk v + ˆµk (ˆs, ˆq) = 0. (29) All of the above results hold in the deterministic case when ˆσk 0 for all k. Based on (29), we present a finite difference method for solving the HJB equation (29). HJB equation with diffusion First, consider the case where ˆσ 0. We discretize the time interval [0, 1] via the grid points 0 = t0 < t1 < < t Nt = 1, where tl = lδt and δt = 1/Nt. We impose a cutoffon the cumulative reward in RK so that it lies within [ S, S]K, then further discretize this clipped interval into S si k S. Here i {ZK, Ns ik Ns} is a K-dimensional index vector, and si = iδs with δs = S/Ns. Finally, we discretize the number of pulls to 0 qj k 1, where j {ZK, 0 jk Nq} is a K-dimensional index vector, and qj = jδq with δq = 1/Nq. Observe that at each time tl, one has that 0 qj k tl and PK k=1 qj k = tl. We will typically set Nq = Nt and δq = δt. Let vl i,j be the approximation for v(tl, si, qj), where i, j ZK. The numerical approximation vl i,j satisfies the following equation: 1 δt ( vl+1 i,j vl i,j) + 1 2δs ˆµi,j k ( vl+1 i+ek ,j+ek vl+1 i ek ,j+ek ) + 1 δq ( vl+1 i,j+ek vl+1 i,j ) 2δ2s ˆσi,j k ( vl+1 i+ek ,j+ek 2 vl+1 i+ek ,j+ek + vl+1 i ek ,j+ek ) + ˆµi,j k = 0, (30) where ek is the k-th standard basis vector, ˆµi,j k = ˆµk(si, qj), and ˆσi,j k = ˆσk(si, qj). Since the boundary conditions at the terminal time specify that v Nt i,j = 0, one can use the preceding equations to solve backward in time and compute all of the values vl i,j. In fact, if one defines gl i,j as a vector with k-th element given by (gl i,j)k = vl+1 i,j + δt 2δs ˆµi,j k ( vl+1 i+ek,j+ek vl+1 i ek,j+ek) + 1 δq ( vl+1 i,j+ek vl+1 i,j ) 2δ2s ˆσi,j k ( vl+1 i+ek,j+ek 2 vl+1 i,j+ek + vl+1 i ek,j+ek) + ˆµi,j k Zhu and Izzo and Ying then one gets an equivalent form for the numerical scheme (30) with the correspondence given by k = argmaxk(gl i,j)k, and vl i,j = (gl i,j)k , ( πl i,j)k = 0, k = k (32) Note that for this scheme to be numerically stable, δt and δs must satisfy the inequality δt min(ˆσk i,j)2δ2 s. Since vl i,j is only defined on grid points, the continuous approximated solution v(t,ˆs, ˆq) is defined as v(t,ˆs, ˆq) = vl i,j for lδt t < (l + 1)δt ikδs ˆsk < (ik + 1)δs jkδq ˆqk < (jk + 1)δq. HJB equation without diffusion Second, we consider the case where ˆσ 0. Since there is no longer a diffusion term, one should use the upwind scheme for the transport term. Let ˆµi,j k,+ = max(ˆµi,j k , 0) , ˆµi,j k, = min(ˆµi,j k , 0), (gl i,j)k = vl+1 i,j + δt " ˆµi,j k,+ δs ( vl+1 i+ek,j+ek vl+1 i,j+ek) + ˆµi,j k, δs ( vl+1 i,j+ek vl+1 i ek,j+ek) δq ( vl+1 i,j+ek vl+1 i,j ) + ˆµi,j k k = argmaxk(gl i,j)k and vl i,j = (gl i,j)k, ( πl i,j)k = 0, k = k . (35) In this case, the stability conditions imply that max(ˆµi,j k )δt δs. Connection to the Bayesian bandit algorithm In certain cases, the numerical schemes (31)-(32) and (34)-(35) give the exact optimal value function for the finite horizon problem. Lemma 7 For the Bernoulli bandits introduced in Section 3.1, when the initial hyperparameters (αk, βk) = (c1n, c2n) for some constants (c1, c2), then the numerical scheme (34)-(35) for the limiting HJB equation based on the scaling factor f(n) = n gives the exact optimal value function when δt = δq = δs = 1 n. Similarly, for the binomial bandits described in Section 3.4.1 with γ being a constant independent of n, when the initial hyperparameters (αk, βk) = (c1n + c2 n, c1 c2 n) for some constants (c1, c2), then the numerical scheme (31)-(32) for the limiting HJB equation based on the scaling factor f(n) = n gives the exact optimal value function when δt = δq = 1 n, δs = γ n. See Appendix E for the proof of the above lemma. Continuous Limit for Bayesian Bandits 6. Numerical experiments 6.1 Convergence to the HJB equation In this section, we will show the convergence of the Bayes-optimal solution to the HJB solution as the horizon goes to infinity. Namely, we would like to show the differences in the optimal policy and the rescaled optimal cumulative reward πi,n(s, q) ˆπn i 1 n , s f(n), q , 1 f(n)wi,n(s, q) v i 1 n , s f(n), q decay as the horizon n , where πi,n(s, q), wi(s, q) are obtained by backward induction and ˆπ(t,ˆs, ˆq), v(t,ˆs, ˆq) are obtained by solving the corresponding HJB equation. We show the convergence result in Figures 2 and 3. Below are the details of the plots. Consider the one-armed Bernoulli bandit problem, where the first arm has a reward 1 with probability ν and 1 with probability 1 ν, while the second arm has a deterministic reward µ2. In this case, given a prior measure ν Beta(α, β), one can obtain the exact optimal policy and cumulative reward πi,n(s, q) and wi,n(s, q) via the equations πi,n k (s, q) = 1, for k = argmax k ˆwi,n k (s, q) 0, o.w. ; wi,n(s, q) = max k ˆwi,n k (s, q), ˆwi,n 1 (s, q) = p(s, q)wi+1,n(s + 1, q + 1) + (1 p(s, q)wi+1,n(s 1, q + 1)), ˆwi,n 2 (s, q) = wi+1,n(s, q) + µ2, with wn+1,n(s, q) = 0 for all (s, q) and p(s, q) = α+s/2+q/2 α+β+q . The limiting HJB equation depends on the scaling factor f(n). By Lemma 3, one arrives at a stochastic optimal control problem if f(n) = n and a deterministic one if f(n) = n. We will compare πi,n, wi,n with the limiting HJB solution for both of these scenarios. In Figures 2 and 3, we set µ2 = 1/(3 n). The hyperparameters (α, β) for the initial prior measure are set to be (n, n n), which implies that in the limiting HJB equation tv + max π(s,q) [0,1] ˆµ ˆsv + ˆqv + 1 2 ˆσ2 2 ˆsv + ˆµ µ2 2+q, ˆµ2 = 1/3, ˆσ = 1 for f(n) = n, and ˆµ = s 2+q, ˆµ2 = 0, ˆσ = 0 for f(n) = n. By Theorem 6, one can obtain the exact optimal policy for the limiting HJB equation. In Figure 2, we plot the average difference over i, s, q. That is, we plot πi,n(s, q) ˆπ i 1 n , s f(n), q 1 f(n)wi,n(s, q) v i 1 n , s f(n), q where the summation is over i {1, , n}, s { (i 1), , i 1}, q {0, , i 1}, and Z is the number of summations. Zhu and Izzo and Ying Next, we test the difference between the Bayes-optimal solution and the numerical solution to the HJB equation presented in Section 6. The following averaged differences are plotted in Figure 3: πi,n(s, q) πN i 1 n , s f(n), q 1 f(n)wi,n(s, q) v N i 1 n , s f(n), q where v N is obtained according to the scheme (31)-(32) with δt = δq = N 1, δs = N 1/2 when f(n) = n; and v N is according to the scheme (34) - (35) with δt = δq = δs = N 1 when f(n) = n. The difference en,N w based on f(n) = n is rescaled by 1 n so that it is on the same scale as the scheme based on f(n) = n. Due to the different discretizations, v N i 1 n , s f(n), q n is not necessarily on a grid point, so we define the continuous approximation solution v(t, s, q) as in (33), which implies that n , s f(n), q = πl m,j, v N i 1 n , s f(n), q = vl m,j, (38) , m = s f(n)δs where x is the largest integer less than or equal to x. Figure 2 shows that the difference en π decays as n increases for both scaling factors. The stochastic limit according to the scaling factor f(n) = n is closer to the optimal Bayesian solution compared with the deterministic limit. Figure 3 shows that the difference en,N decays as n and N increase. Note that en,N has two components: model error and numerical error. Model error decreases as the horizon n increases. The numerical error decreases as the number of grid points N increases. We can see from Figure 3 that when both the horizon n and the number of grid points N increase, the differences decrease. We observe that when N = 50, the difference en,N v in the value functions decreases slower or does not decrease after n reaches some threshold. This is because the numerical error dominates over the model error in this regime. 6.2 The performance of the approximate Bayes-optimal policy We compare the performance of the approximate Bayes-optimal policy (Algorithm 1) with Thompson sampling and UCB in terms of the expected regret. (See Appendix F for the details of Thompson sampling and UCB.) For unstructured bandits, we consider normal arm rewards, in which case the exact policy for the limiting HJB equation can be directly obtained. The performance of the three algorithms is shown in Figure 4. Below are the details of the plots. Consider the K-armed normal bandit problem with K 2. Assume that the first arm follows r1 N(µ1, 1) for µ1 0, while the k-th arm follows rk N(µ, 1) for 2 k K. Note that although µ1 1, this is unknown to us. We define = µ µ1 (39) Continuous Limit for Bayesian Bandits 102 103 104 10-4 Figure 2: The above plot shows the decay of the difference between the Bayes-optimal solution and the solution to the HJB equation as n increases, i.e., en π and en w defined in (36). Here f(n) is the scaling factor. When f(n) = n, the resulting limit is a stochastic optimal control problem, while when f(n) = n, the resulting limit is a deterministic one. 102 103 104 10-4 102 103 104 Figure 3: The above plot shows the decay of the difference between the Bayes-optimal solution and the numerical solution to the HJB equation as n and N increase, i.e., en,N π and en,N w defined in (37). Here N is the number of grid points when numerically solving the HJB equation, and f(n) is the scaling factor. When f(n) = n, the resulting limit is a stochastic optimal control problem, while when f(n) = n, the resulting limit is a deterministic one. The curve for N = 50, f(n) = n is indistinguishable from the curve for N = 500, f(n) = n in en,N π , so it is not shown on the right plot. Zhu and Izzo and Ying to be the arm gap. The horizon is set to be n = 103. For the proposed method (Algorithm 1), we set the scaling factor f(n) = n, that is, the limiting optimal control problem is stochastic. The exact solution to the limiting HJB equation can be obtained by Theorem 6. The initial prior measure for both the Bayes-optimal policy and Thompson sampling is νk N( 1 n, 1 n) for all k. This implies that the limiting HJB equation is (11) with ˆµk(s, q) = sk+1 qk+1 and ˆσ 1. In addition, δ = n2 for the UCB algorithm. Figure 4 shows the expected regret of the three algorithms for [ 1, 1] and K = 5, 10, 20. The expected regret is averaged over 103 simulations. We can see from Figure 4 that the overall performance of the approximate Bayes-optimal policy is better than the other two algorithms, especially when the prior guess is close to the underlying environment. When approaches 1, UCB is a bit better than the approximate Bayes-optimal policy because the prior guess is significantly different from the underlying truth. However, note that as the number of arms increases, the performance is almost the same, even around = 1. -1 -0.5 0 0.5 1 0 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 0 Figure 4: The above plot shows the expected regret in the K-armed normal bandit problem for the approximate Bayes-optimal policy, Thompson sampling, and UCB. The arm gap [ 1, 1] is defined in (39). The left, middle, and right plots correspond to K = 5, 10, 20. For structured bandits, we consider the linear bandits described in Section 3.4.1. Assume there are two arms, and the reward for arm ai follows aiν + η with unknown ν and η N(0, 1). We set the initial measure for ν N(0, 1 n), and take the scaling factor f(n) = n, then ˆµ, ˆσ for the limiting HJB equation can be obtained according to Lemma 5. We solve the limiting HJB equation by the numerical scheme (31)-(32) with δt = δq = 1 N , δs = 1 N and N = 100. The performance in terms of the expected regret is plotted in Figure 5, where we test three different action positions (a1, a2) = (0.1, 0.1), (0.1, 0.2), (0.1, 0.2). Figure 5 shows that the overall performance of the approximate Bayes-optimal policy is more robust than the other two methods. First, the approximate Bayes-optimal always outperforms TS. When (a1, a2) = (0.1, 0.1), the performance of UCB and approximate Bayes-optimal policy are similar. However, for the other two cases where (a1, a2) = Continuous Limit for Bayesian Bandits (0.1, 0.2), (0.1, 0.2), UCB has much worse performance on one side, while the approximate Bayes-optimal policy has evener regret on both sides. If one measures the performance in the worst-case regret or in the averaged regret over the possible environments, the approximate Bayes-optimal policy outperforms the other two. -0.5 0 0.5 0 -0.5 0 0.5 0 -0.5 0 0.5 0 Figure 5: The above plot shows the expected regret in the 2-armed linear bandit problem for the approximate Bayes-optimal policy, Thompson sampling, and UCB. The arm gap of the normal bandits [ 1, 1] is defined in (39). The environment of the linear bandits ν [ 1/2, 1/2] is defined in (20). We also show the performance of the regularized approximate Bayes-optimal policy in Figure 6. We compare the regularized Bayes-optimal policy with the unregularized version for normal bandits and linear bandits. The setting of the two bandit problem is the same as Figures 4 and 5, but the initial Bayesian prior of the two bandits are worse (farther from the ground truth). We set νk N(0.01 n, 1) for the normal bandits, and ν N( n, 1) for the linear bandits. One can see from Figure 6 that the regularized version performs similarly to or better than the unregularized version when the initial prior measure is bad. We remark that since the solution to the regularized HJB equation is always smooth when λ > 0, it is also potentially easier to break the curse of dimensionality. However, we leave the high-dimensional problem for future study. 7. Discussion In this paper, we derived a continuous-in-time limit for the Bayesian bandit problem. We showed that the rescaled optimal cumulative reward converges to the solution of an HJB equation. We derive several benefits from the limiting PDE: A single recipe for many Bayesian bandits. For most multi-armed bandit problems, the classical Bayesian bandit algorithms yield a formulation that cannot be solved accurately. Different recipes are required to approximate the value function in different settings. On the other hand, the limiting PDE gives a single, unified recipe to solve the Bayes-optimal policy. Zhu and Izzo and Ying -1 -0.5 0 0.5 1 0 Figure 6: The above plot shows the expected regret of a 5-armed normal bandit problem and a 2-armed linear bandit problem for the regularized approximate Bayes-optimal policy. The environment ν [ 1/2, 1/2] is defined in (20). The left, middle and right plots correspond to (a1, a2) = (0.1, 0.1), (0.1, 0.2), (0.1, 0.2). For example, one way to solve the one-armed bandit problem with normal distributions using a Bayesian bandit algorithm is to solve the following equation backward: W i(µ, q) = max W i+1(µ, q) + µ2, µ + 1 2σ2 i )wi+1(µ + x, q + 1)dx for all µ R and i = 1, , n with σi = (qi + σ 2) 1. Since there is no closedform solution for the integral in the above equation, one must approximate W i using piecewise quadratic functions (see e.g. Section 35.3.2 of Lattimore and Szepesv ari (2020) for details). This results in a completely different algorithm for solving this problem compared to solving the Bernoulli reward case. However, if one instead uses the HJB equation to solve the one-armed bandit problem with normally distributed rewards, the same formulation (18) which applies for Bernoulli rewards also applies for normal rewards. The only modification is the different forms of ˆµ according to Lemma 4. Improved efficiency for large n. The classical Bayesian bandit algorithm requires a computational cost of O(n2K) to calculate the optimal policy, which can be prohibitive for large n. On the other hand, as n , the Bayesian bandit problem converges to the continuous HJB equation. The computational cost of solving the HJB equation is independent of the horizon, and it only depends on the numerical discretization of the PDE, which is O(N2K). When N n, one obtains huge computational savings by solving for the HJB value function instead. Since the accuracy of the approximation solution is v v = O(N 1), one retains an accurate approximation of the solution with much less computational cost. Continuous Limit for Bayesian Bandits Improved efficiency for large K For the case where the exact solution can be obtained for the limiting HJB equation as stated in Theorem 6, there is no computational cost of solving the limiting equation. In this case, even if K is large, one can approximate the Bayes-optimal policy efficiently. For the case where the exact solution to the HJB equation cannot be obtained, it may be possible to break the curse of dimensionality numerically by using a non-linear function approximation, such as a deep neural network. We leave the high-dimensional problem for future study. One can also extend the current framework of finite arms to infinite arms, for instance, with a continuous action space. The policy π(t, s, q, a) is a probability density function such that R π(t, s, q, a)da = 1 for all t, s, q. The general limiting control problem takes the form max R ˆπda=1 E Z 1 A ˆµ(ˆs, ˆq, a)ˆπ(τ, a)dadτ s.t. dˆq(τ, a) = ˆπ(τ, a)dt, a A; dˆs(τ, a) = ˆµ(ˆs, ˆq, a)ˆπ(τ, a)dt + ˆσ(ˆs, ˆq, a) p ˆπ(τ, a)d Bt, a A; ˆs(t, a) = s, ˆq(t, a) = q. We will leave the study of the above case to future research. Victor F Araman and Ren e A Caldentey. Diffusion approximations for a class of sequential experimentation problems. Management Science, 68(8):5958 5979, 2022. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002. Dirk Bergemann and Juuso Valimaki. Bandit problems. 2006. Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). Springer, 1985. Russell N Bradt, SM Johnson, and Samuel Karlin. On sequential designs for maximizing the sum of n observations. The Annals of Mathematical Statistics, 27(4):1060 1074, 1956. Yeon-Koo Che and Johannes H orner. Recommender systems as mechanisms for social learning. The Quarterly Journal of Economics, 133(2):871 925, 2018. Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. Zhengdao Chen, Grant Rotskoff, Joan Bruna, and Eric Vanden-Eijnden. A dynamical central limit theorem for shallow neural networks. Advances in Neural Information Processing Systems, 33:22217 22230, 2020. Zhu and Izzo and Ying Nicole El Karoui, Monique Jeanblanc, and Vincent Lacoste. Optimal portfolio management with american capital guarantee. Journal of Economic Dynamics and Control, 29(3):449 468, 2005. Lawrence C Evans. Partial differential equations, volume 19. American Mathematical Soc., 2010. Lin Fan and Peter W Glynn. Diffusion approximations for thompson sampling. ar Xiv preprint ar Xiv:2105.09232, 2021. Kris Johnson Ferreira, David Simchi-Levi, and He Wang. Online network revenue management using thompson sampling. Operations research, 66(6):1586 1602, 2018. Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160 2169. PMLR, 2019. John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):148 164, 1979. Yonatan Gur, Gregory Macnamara, and Daniela Saban. Sequential procurement with contractual and experimental learning. Management Science, 68(4):2714 2731, 2022. Jiequn Han, Qianxiao Li, et al. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1 41, 2019. Michael Kapralov and Rina Panigrahy. Prediction strategies without loss. Advances in Neural Information Processing Systems, 24, 2011. Vladimir A Kobzar and Robert V Kohn. A pde-based analysis of the symmetric two-armed bernoulli bandit. ar Xiv preprint ar Xiv:2202.05767, 2022. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. Tor Lattimore. Regret analysis of the finite-horizon gittins index strategy for multi-armed bandits. In Conference on Learning Theory, pages 1214 1245. PMLR, 2016. Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. Benoˆıt Leloup and Laurent Deveaux. Dynamic pricing on the internet: Theory and simulations. Electronic Commerce Research, 1(3):265 276, 2001. Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101 2110. PMLR, 2017. Qiang Liu. Stein variational gradient descent as gradient flow. Advances in neural information processing systems, 30, 2017. Continuous Limit for Bayesian Bandits Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33): E7665 E7671, 2018. Makiko Nisio. Stochastic control theory. ISI Lecture Notes, 9, 2015. Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952. David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov s accelerated gradient method: theory and insights. Advances in neural information processing systems, 27, 2014. Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495 517. Springer, 2017. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. Stefan Wager and Kuang Xu. Diffusion asymptotics for sequential experiments. ar Xiv preprint ar Xiv:2101.09855, 2021. Abraham Wald. Sequential analysis. Courier Corporation, 2004. Zhilei Wang and Robert V Kohn. A new approach to drifting games, based on asymptotically optimal potentials. ar Xiv preprint ar Xiv:2207.11405, 2022. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688. Citeseer, 2011. Lexing Ying and Yuhua Zhu. A note on optimization formulations of markov decision processes. Communications in Mathematical Sciences, 20(3):727 745, 2022. Zhu and Izzo and Ying A Proof of Lemma 3 The expectation of the k-th arm in environment ν is µk(ν) = γ(2νk 1). At the start of the i-th round, the posterior distribution of νk is uniquely determined by the cumulative reward si k = Pi 1 j=1 Xj1Aj=k and the number of pulls qi k = Pi 1 j=1 1Aj=k through round i 1 according to νi k Beta(αi k, βi k), αi k = α + si k/(2γ) + qi k/2, βi k = β si k/(2γ) + qi k/2. Hence the joint posterior ρi(ν) at round i is ρ(ν|si, qi) = 1 k=1 ν αi k 1 k (1 νk)βi k 1 (A.1) where Z is a normalizing constant. This allows us to compute the posterior mean and variance of each arm: µk(si, qi) = Z [0,1]K γ(2νk 1)ρ(ν|si, qi)dν = γ 2 αi k αi k + βi k 1 = γ αk βk + si k/γ αk + βk + qi k , σ2 k(si, qi) = Z [0,1]K γ2ρ(ν|si, qi)dν = γ2, (A.2) and the higher-order moments are Ep k(si, qi) = Z γpνk + ( γ)p(1 νk)ρ(ν|si, qi)dν = γp, p is even, γp αk βk + si k/γ αk + βk + qi k , p is odd. By the definition of ˆµ, ˆσ in (10), one has ˆµk(ˆs, ˆq) = lim n ˆσ2 k(ˆs, ˆq) = lim n n f(n)2 γ2 ˆEp k = lim n n f(n)p Ek(f(n)s, nq) = p 2 n f(n)2 γ2, p is even, p 1 γ(αk βk) n + ˆqk , p is odd. By the assumption given in Lemma 3, one ends up with ˆµ(s, q, ˆαk, ˆβk) = lim n ˆαk + s ˆβk + q , ˆσ(s, q) ˆσ, ˆEp k 0. The last equation is because limn n f(n)γ = ˆσ implies that limn 1 f(n)γ = 0. Continuous Limit for Bayesian Bandits Different limiting equations due to different scaling factors Let us consider the case where the reward value γ and γ are independent of the horizon n. We set γ = 1 and the prior hyperparameter to be (α, β) = ( n 2 ), i.e., the prior measure of νk Beta( n 2 ). In this case, one can rescale the cumulative reward s and cn by n. By Lemma 3, one has ˆµ(s, q) = s 1 + q; ˆσ(s, q) 1. One can also rescale the cumulative reward s and cn by n 1. By Lemma 3, one has ˆµ(s, q) = s 1 + q; ˆσ(s, q) 0. In this case, the Bayesian bandit problem converges to a deterministic control problem in the form of (18) with ˆµ(s, q) = s 1 + q, ˆσ 0. (A.3) One can see from the above examples that the same Bayesian bandit problem may converge toward different control problems based on the chosen scaling. B Proof of Lemma 4 The probability density function for the k-th arm in environment ν is 2πσe |x νk|2 (Note that here π denotes the constant 3.14159... rather than the policy.) Thus the expected reward for the k-th arm in environment ν is µk(ν) = νk. At the start of the i-th round, the posterior distribution of νk is uniquely determined by the cumulative reward si k and the number of pulls qi k through round i 1 according to νi k N(αi k, (βi k)2), αi k = αβ 2 + si kσ 2 qi kσ 2 + β 2 , (βi k)2 = 1 qi kσ 2 + β 2 . Hence the joint posterior ρi(ν) at round i is ρi(ν) = ρ(ν|si, qi) = 2πβi k e |νk αi k|2 From this, it follows that µk(si, qi) = Z RK νkρi(ν)dν = αi k = αβ 2 + si kσ 2 qi kσ 2 + β 2 , σ2 k(si, qi) = Z RK(σ2 + ν2 k)ρi(ν)dν = σ2 + (βi k)2 + (αi k)2 = σ2 + 1 qi kσ 2 + β 2 + ( µk)2, Zhu and Izzo and Ying Inserting µk and σk into (10) yields ˆµk(ˆs, ˆq) = lim n n f(n) αβ 2 + f(n)ˆskσ 2 nˆqkσ 2 + β 2 = lim n ˆsk + σ2α f(n)β2 (ˆσk(ˆs, ˆq))2 = lim n n f(n)2 σ2 + n f(n)2(nσ 2ˆqk + β 2) + 1 = lim n n f(n)2 σ2 By the condition in Lemma 4, one ends up with ˆµ(s, q, ˆαk, ˆβk) = ˆαk + s ˆβ 2 k + q , ˆσ(s, q) ˆσ. For the higher-order moments, one can write the moments of the normal distribution in the following form, Z xp P ν k (x)dx = j=0 C(p, j)νp 2j k σ2j where C(p, j) is a constant depends on p, j. Then one has Ep(si, qi) = Z Z xp P ν k (x)dxρ(νk)dνk = j=0 C(p, j)σ2j Z νp 2j k ρ(νk)dνk l=0 C(p, j)C(p 2j, l)σ2j(βi k)2l(αi k)p 2j 2l l=0 C(p, j)C(p 2j, l)σ2j(βi k)2l( µk)p 2j 2l Therefore, after rescaling, one has ˆEp k(ˆs, ˆq) = lim n l=0 C(p, j)C(p 2j, l)n1+j p n f(n)2 σ2 j+l 1 n1+j p n 1, lim n n f(n)2 σ2 = ˆσ, lim n 1 ˆqk + ˆβ 2 , lim n n f(n) µk = ˆµ, one has, ˆEp k(ˆs, ˆq) 0 Continuous Limit for Bayesian Bandits C Proof of Lemma 5 The expectation of the reward at round i in environment ν is Ai, ν , and the probability density function for k-th arm is 2πσe |x (ak) ν| Then, the posterior distribution of ν at round i is uniquely determined by the cumulative reward (si, qi) up to time i 1, νi N(αi, Σi), Σi = (Σ 1 + σ 2 K X k=1 qi kak(ak) ) 1, αi = Σi(Σ 1α + σ 2 K X k=1 si kak). Note that the ak Rd in the above equation represents the action value of the k-th arm. Hence the posterior measure of νi at round i is ρi(ν) = 1 p (2π)d|Σi| exp 1 2(ν αi) Σi(ν αi) . Therefore, one obtains µk(si, qi) = Z RK(ak) νρi(ν)dν = (ak) αi =(ak) (Σ 1 + σ 2 K X j=1 qi jaj(aj) ) 1(Σ 1α + σ 2 K X j=1 si jaj), σ2 k(si, qi) = Z RK(σ2 + (ak) νν (ak) )ρi(ν)dν = σ2 + (ak) (Σi + αi(αi) )ak =σ2 + (ak) (Σ 1 + σ 2 K X j=1 qi jaj(aj) ) 1ak + ( µk)2. Inserting µk and σk into (10) yields ˆµk(ˆs, ˆq) = lim n (ak) j=1 ˆqi jaj(aj) j=1 ˆsi jaj (ˆσk(ˆs, ˆq))2 = lim n j=1 ˆqi jaj(aj) By the condition in Lemma 4, one ends up with ˆµ(s, q, b) = b (ˆΣ 1 + X k qkak(ak) ) 1( ˆα + k=1 skak), ˆσ(s, q) ˆσ. Zhu and Izzo and Ying For the higher-order moments, since Σ(ˆs, ˆq) = σ2n 1 σ2 k ˆqkak(ak) ! 1 αi(ˆs, ˆq) = f(n) k ˆqkak(ak) ! 1 σ2Σ 1α Therefore, similar to the normal bandit problem, the higher-order moments are in the following order: ˆEk(ˆs, ˆq) = lim n n f(n)p l=0 O(σ2j)O(Σl)O(αp 2j 2l) l=0 O n f(n)p O(σ2j)O(σ2ln l)O f(n)p 2j 2l 2j+2l n1+j p ! where the last equality is because 1j p 1 for all p 3 and limn nσ f(n) = ˆσ. D Proof of Theorem 6 Look at the optimal control problem (21) with ˆσ 0, when ˆµk = ˆsk+αk ˆqk+βk , note that d dτ ˆµk = 1 ˆqk + βk d dτ ˆsk ˆsk + αk (ˆqk + βk)2 d dτ ˆqk = 0 for ˆπ(τ). This implies that ˆµk(τ) ˆµk(t) for k. Therefore, the objective function becomes Z 1 t (ˆµ(t) λπ(τ)) π(τ)dτ (D.1) Since ˆµ(t) is a constant, the above objective function will be maximized at π given in (25) and (26) for the unregularized version, i.e., λ = 0 and the regularized version, i.e., λ > 0. For the stochastic case, Note that d E[ˆµk] = E 1 ˆqk + βk dˆsk ˆsk + αk (ˆqk + βk)2 dˆqk = 0 + E[ˆσ(ˆs, ˆq) p π(ˆs, ˆq)d Bt] = 0 for ˆπ(τ). (D.2) so the objective function for the stochastic case is the same as (D.1), which results in the same optimal policy. E Proof of Lemma 7 Consider k-armed Bernoulli bandit, where the k-th arm gives reward 1 with probability νk and 0 with probability 1 νk. The initial prior measure of νk follows Beta(αk, βl). Then Continuous Limit for Bayesian Bandits the rescaled optimal cumulative reward vi(ˆs, ˆq) = 1 with scaling factor f(n) = n satisfies vi(ˆs, ˆq) = max k n pk(ˆs, ˆq) + pk(ˆs, ˆq)vi+1(ˆs + 1 nek, ˆq + 1 nek) + (1 pk(ˆs, ˆq))vi+1(ˆs, ˆq + 1 (E.1) where pk(ˆs, ˆq) = n 1αk + sk n 1(αk + βk) + qk . The corresponding HJB equation under the scaling factor f(n) = n is tv + max ˆπ(t,ˆs,ˆq) K (ˆµk + ˆµk skv + qkv) πk = 0, v(1, ˆs, ˆq) = 0. where ˆµk = ˆαk + sk ˆβk + qk , with ˆαk = lim n n 1αk, ˆβk = lim n n 1(αk + βk). Since ˆµ 0, applying the numerical scheme (34)-(35) with δt = δq = δs = 1/n gives , n,ˆs, ˆq) = max k n ,ˆs, ˆq) + 1 ˆµk(ˆs, ˆq) nek, ˆq + 1 n ,ˆs, ˆq + 1 nek) + 1 n 1 n ,ˆs, ˆq + 1 nek) v(l + 1 +ˆµk(ˆs, ˆq)]} (E.2) By comparing (E.1) and (E.2), one can see if and only if ˆµk(ˆs, ˆq) = pk(ˆs, ˆq), The numerical scheme is equivalent to the exact Bayes-optimal algorithm. The above condition holds if and only if (αk, βk) = (c1n, c2n) for any constants (c1, c2), which completes the proof for the first part of the Lemma 7. Consider the binomial bandits described in Section 3.4.1. First, the optimal cumulative reward satisfies, wi(s, q) = max k γ(2pk(s, q) 1) + pk(s, q)wi+1(s + γek, q + ek) +(1 pk(s, q))wi+1(s γek, q + ek) with pk(s, q) = αk+sk/(2γ)+qk/2 αk+βk+qk . Then the rescaled optimal cumulative reward vi(ˆs, ˆq) = 1 nwi(s, q) Zhu and Izzo and Ying with scaling factor f(n) = n satisfies vi(ˆs, ˆq) = max k n n 1/2γ(2ˆpk(ˆs, ˆq) 1) + ˆpk(ˆs, ˆq)vi+1(ˆs + γn 1/2ek, ˆq + n 1ek) +(1 ˆpk(ˆs, ˆq))vi+1(ˆs γn 1/2ek, ˆq + n 1ek) o , 2ˆpk(ˆs, ˆq) 1 = 1 nγ γ(αk βk) n + ˆsk By letting µ(ˆs, ˆq) = nγ(2ˆpk(ˆs, ˆq) 1), then vi(ˆs, ˆq) = max k n 1 µ(ˆs, ˆq) + 1 µ(ˆs, ˆq) nγ + 1 vi+1(ˆs + γn 1/2ek, ˆq + n 1ek) 1 µ(ˆs, ˆq) nγ vi+1(ˆs γn 1/2ek, ˆq + n 1ek) , (E.3) On the other hand, the limiting HJB equation for vi(ˆs, ˆq) is tv + max ˆπ(t,ˆs,ˆq) K ˆµk + ˆµk skv + qkv + 1 2 ˆσ2 2 skv πk = 0, v(1, ˆs, ˆq) = 0, ˆµ(ˆs, ˆq) = ˆαk + ˆsk ˆβk + ˆqk , ˆσ = γ, with ˆαk = lim n γ(αk βk) n , ˆβk = lim n αk + βk By comparing (E.1) and (E.2), one can see if and only if ˆµk(ˆs, ˆq) = µk(ˆs, ˆq), The numerical scheme is equivalent to the exact Bayes-optimal algorithm. The above condition holds if and only if (αk, βk) = (c1n + c2 n, c1 c2 n) for any constants (c1, c2), which completes the proof for the second part of the Lemma 7. F Detailed algorithms Continuous Limit for Bayesian Bandits Algorithm 2 Thompson Sampling for unstructured bandits Require: Input: n, (s, q) = 0, ρ(ν|β) for i = 1, . . . , K do Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do Update ρ(ν|s, q) according to Bayesian rule Sample νi according to the probability distribution ρ(ν|s, q) Ai argmaxk µk(νi), sk sk + Xi, qk qk + 1 end for Algorithm 3 UCB for unstructured bandits Require: Input: n, (s, q) = 0, δ for i = 1, . . . , K do Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do Update ˆµk sk qk Ai argmaxk ˆµk, sk sk + Xi, qk qk + 1 end for Algorithm 4 Thompson Sampling for linear bandits Require: Input: n, ak, (s, q) = 0, ρ(ν) N(µ, Σ) for i = 1, . . . , K do Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do Sample ν according to the probability distribution ρ(ν) Ai = argmaxk a k ν x = a Ai, y = ν x + η, where η N(0, σ2) µ = (Σ 1 + σ 2xx ) 1(Σ 1µ + σ 2yx), Σ = (Σ 1 + σ 2xx ) 1 Algorithm 5 UCB for linear bandits Require: Input: V = v0, W = 0, ˆν = 0, λ = 0.1 for i = 1, . . . , K do Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do 2 log(n2) + log(1 + (i 1)/λ) Ai argmaxk ak ˆν + β q x a Ai, y ν x + η, where η N(0, σ2) V V + x2, W W + xy, ˆν W