# continuousintime_limit_for_bayesian_bandits__8c43c768.pdf

Journal of Machine Learning Research 24 (2023) 1-35 Submitted 10/22; Revised 5/23; Published 9/23

Continuous-in-time Limit for Bayesian Bandits

Yuhua Zhu yuz244@ucsd.edu Department of Mathematics and Halıcıo glu Data Science Institute University of California - San Diego La Jolla, CA 92093, USA

Zachary Izzo zizzo@stanford.edu Department of Mathematics Stanford University Stanford, CA 94305, USA

Lexing Ying lexing@stanford.edu Department of Mathematics Stanford University Stanford, CA 94305, USA

Editor: Qiang Liu

This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to ﬁnd the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we ﬁrst show that under a suitable rescaling, the Bayesian bandit problem converges toward a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added beneﬁt that its computational cost does not increase as the horizon increases. Keywords: Bayesian bandit, Hamilton-Jacobi-Bellman equation, optimal control, dynamic view, diﬀusion approximation, continuous limit

1. Introduction

Bandit problems were ﬁrst introduced by Thompson (1933) with later pioneering work due to Robbins (1952) and Wald (2004). In more recent years, bandit algorithms have become widely adopted for automated decision-making tasks such as dynamic pricing (Ferreira et al., 2018), mobile health (Tewari and Murphy, 2017), Alpha Go (Silver et al., 2016), etc. The bandit problem can be considered from one of two perspectives: Bayesian or frequentist. The Bayesian approach dominated bandit research from 1960-1980 (Bradt et al., 1956; Gittins, 1979). The objective is to minimize an average cumulative regret with respect to the Bayesian prior measure of the problem environment. It formulates the bandit problem as an optimization problem, and the goal is to ﬁnd the optimal policy which minimizes

c 2023 Yuhua Zhu and Zachary Izzo and Lexing Ying.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1181.html.

Zhu and Izzo and Ying

the Bayesian regret. In the frequentist setting (Lai and Robbins, 1985), the cumulative regret is viewed as an unknown deterministic quantity, and the goal is to design policies to achieve the best environment-dependent performance. The main diﬃculty with Bayesian bandits is that computation of the optimal policy is often intractable, especially when the number of arms or the horizon is large. Gittin s index (Gittins, 1979) reduced the computational cost for the discounted inﬁnite horizon setting but does not apply to undiscounted cases (Berry and Fristedt, 1985). Although computing the Bayes-optimal policy is challenging, there is a signiﬁcant payoﬀ: the performance of the policy is not only optimal in the Bayesian setting (by deﬁnition) but also has favorable frequentist regret guarantees (Lattimore, 2016). In addition, Bayesian bandits have been widely employed in economics (Bergemann and Valimaki, 2006), including in contract theory (Gur et al., 2022), dynamic pricing (Leloup and Deveaux, 2001), portfolio management (El Karoui et al., 2005), etc. In this paper, we revisit the Bayesian perspective for the multi-armed bandit problem and analyze it using tools from PDEs. A continuous-in-time limiting HJB equation is derived as the horizon n goes to inﬁnity for a range of bandit problems. Based on the limiting equation, a regularized Bayes-optimal policy is proposed, where regularization is employed to increase exploration and stability. Numerical schemes can be used to approximate the optimal policy, leading to improved computational eﬃciency when the horizon is large. In addition, the exact optimal policy for the limiting HJB equation can be obtained for certain types of bandit problems, including the classical Bernoulli and Gaussian arm reward cases, resulting in an eﬃcient algorithm to approximate the optimal policy even if the number of arms is large. In summary, our contributions are as follows.

We derive a continuous-in-time limit, an HJB equation, for the Bayesian multi-armed bandit problem.

We propose a regularized version of the Bayesian multi-armed bandit problem, which encourages exploration and smooths the optimal policy.

Based on the limiting PDE, we give an eﬃcient algorithm for approximating the optimal policy.

Recently, the use of diﬀerential equations to analyze machine learning algorithms has received growing interest, especially in optimization algorithms Su et al. (2014); Li et al. (2017), sampling algorithms Welling and Teh (2011); Liu (2017), neural networks Mei et al. (2018); Chen et al. (2018); Han et al. (2019); Chen et al. (2020). However, fewer connections have been built between multi-armed bandits and diﬀerential equations until the recent two years. For bandit problems, Fan and Glynn (2021); Wager and Xu (2021); Kobzar and Kohn (2022) model several policies in the frequentist setting via a continuous SDE or PDE, and this continuous analysis is used to provide insights into the properties of the algorithms studied. In the Bayesian setting, Araman and Caldentey (2022); Che and H orner (2018) give diﬀerential equation approximations, but their work can only be applied to settings with two possible environments. In this paper, we consider Bayesian bandits with general environments. There are also related works using diﬀerential equations in online learning settings, including contextual bandits (Kapralov and Panigrahy, 2011), drifting games (Wang and Kohn, 2022), etc.

Continuous Limit for Bayesian Bandits

2. Bayesian Bandits

Throughout the paper, we focus on K-armed stochastic bandits played over n rounds, where n Z+ is a positive integer called the horizon. At each round i, the learner chooses an arm (also called an action) Ai from the action space A = {ak}K k=1 according to a policy πi, and the environment reveals a reward Xi R. The underlying environment ν belongs to an environment class E and deﬁnes the arm reward distributions. More precisely, given an environment ν = (P ν a , a A) E, the reward Xi follows the distribution P ν Ai. The policy πi at round i is a function which maps the history Hi = (A1, X1, , Ai 1, Xi 1) to a probability distribution over the action space A. More precisely, we denote the set of possible histories at the beginning of round i by Hi = (A R)i 1, and H1 = . We denote by (A) the set of probability measures on A so that the policy πi is a mapping from Hi

to (A). We denote by Π the set of policies π = {πi}n i=1, which is measurable with respect to the ﬁltration associated with the process {Hi}n i=1. We call Π the competitor class. Deﬁne the expectation of arm a in environment ν as µa(ν), where

R x P ν a (x)dx. (1)

The expected cumulative reward cn(π, ν) measures the performance of policy π in environment ν,

cn(π, ν) = E

where the expectation is taken over the probability measure induced by the interaction of the policy and the environment. The goal in the K-armed bandit setting is to design a policy π that leads to the largest expected cumulative reward among all policies in the competitor class Π. The main diﬃculty arises because the environment is unknown, and the policy can only depend on the history sequences {Hi}n 1 i=0 . One way to measure the performance of a policy π is to ﬁnd functions C : E [0, ), f : E [0, ) that upper bound the regret:

Rn(π, ν) = nµ (ν) cn(π, ν) C(ν)f(n),

where µ (ν) = maxa µa(ν) is the expected reward of the optimal arm. This is the frequentist regret, which is environment-dependent (Lattimore and Szepesv ari, 2020).

Another way to measure the performance of a policy is via the averaged cumulative reward with respect to a probability measure ρ(ν) on the environment E,

cn(π, ρ) = E

E µAi(ν)ρ(ν)dν

In the Bayesian setting, the environment is viewed as a random variable. According to Bayes rule, the probability ρ(ν) will be updated conditional on the history sequence. Given a horizon n, we assume that at each round i, the environment νi is sampled from a prior measure ρi(ν) over the environment class E. After pulling arm Ai and obtaining the reward Xi P Ai ν , we update ρi+1(ν) to be the posterior distribution of the environment. Given

Zhu and Izzo and Ying

an initial prior measure ρ1(ν), the goal is to ﬁnd the optimal policy that maximizes the averaged cumulative reward,

max π Π cn(π) = E

E µAi(ν)ρi(ν)dν

The bandit problem with the above objective function is called a Bayesian bandit (Chapter 35 in Lattimore and Szepesv ari (2020)).

3. Continuous Limits of Bayesian Bandits

3.1 An illustrative example

Consider the one-armed bandit problem in which the reward of the ﬁrst arm follows a Bernoulli(ν) distribution (with ν unknown) and the second arm gives a deterministic reward µ2 = 1 2. We assume an initial prior distribution of ν Beta(α, β). Then the posterior measure of ν at round i depends on two quantities. The ﬁrst quantity is qi, which is the number of pulls of the unknown arm before round i. The second quantity is si, which is the cumulative reward of the unknown arm before round i. The posterior distribution of ν is Beta(α + si, β + qi si). Let wi(s, q) be the optimal cumulative reward starting from round i with si = s, qi = q. Similarly, let wi k(s, q) be the optimal cumulative reward in this same setting, assuming that the k-th arm is pulled at round i. Formally, we have

wi(s, q) = max π Π E

E µAj(ν)ρj(ν)dν

si = s, qi = q

wi k(s, q) = max π Π E

E µAj(ν)ρj(ν)dν

si = s, qi = q, Ai = k

As before, the dependence of the expectation on the policy π is through the actions Aj. If the second arm is chosen at round i, then

wi 2(s, q) = µ2 + wi+1(s, q).

If the ﬁrst arm is chosen at round i, then

wi 1(s, q) = p(s, q) + p(s, q)wi+1(s + 1, q + 1) + (1 p(s, q))wi+1(s, q + 1),

where p(s, q) = α + s α + β + q.

The ﬁrst term p(s, q) represents the expectation of the reward if the ﬁrst arm is pulled at round i. The second and third term hold because after pulling the ﬁrst arm, one has

qi+1 = qi + 1;

P(si+1 = si + 1|si) = p(si, qi), P(si+1 = si|si) = 1 p(si, qi).

Continuous Limit for Bayesian Bandits

Since wi is the optimal reward, one has

wi(s, q) = max{wi 1(s, q), wi 2(s, q)}. (2)

Note that for horizon n, wn+1(s, q) = 0 for all s, q by the deﬁnition of w. Therefore, one can compute wi(s, q) for all s = {0, , i 1}, q = {0, , i 1} via backwards induction. To derive a continuous limit in this setting, we rescale the reward and the number of arm pulls by 1/n:

One can deﬁne the rescaled optimal reward function

vi(ˆs, ˆq) = 1

which then satisﬁes the equation

vi(ˆs, ˆq) = max 1

nµ2 + vi+1(ˆs, ˆq),

1 n p(ˆs, ˆq) + p(ˆs, ˆq)vi+1(ˆs + 1

n) + (1 p(ˆs, ˆq))vi+1(ˆs, ˆq + 1

p(s, q) = n 1α + s n 1(α + β) + q.

By setting δt = δs = δq = n 1, the above equation can be equivalently written as

vi+1(ˆs, ˆq) vi(ˆs, ˆq)

δt + max µ2, p(ˆs, ˆq) + p(ˆs, ˆq)vi+1(ˆs + δs, ˆq + δq) vi+1(ˆs, ˆq + δq)

+vi+1(ˆs, ˆq + δq) vi+1(ˆs, ˆq)

From the above equation, one sees that the rescaled value function vi(ˆs, ˆq) is the numerical discretization of the following PDE:

tv(t, ˆs, ˆq) + max{µ2, ˆµ(ˆs, ˆq) + ˆµ(ˆs, ˆq) ˆsv(t, ˆs, ˆq) + ˆqv(t, ˆs, ˆq)} = 0, v(1, ˆs, ˆq) = 0, (3)

where ˆµ(s, q) = lim n p(s, q).

Furthermore, by moving the constant µ2 outside of the maximum operator and introducing a control parameter ˆπ [0, 1], one arrives at a Hamilton-Jacobi-Bellman equation for v:

tv + max ˆπ(t,ˆs,ˆq) [0,1] (ˆµ + ˆµ ˆsv + ˆqv µ2) π + µ2 = 0, v(1, ˆs, ˆq) = 0. (4)

(3) and (4) are equivalent because when ˆµ + ˆµ ˆsv + ˆqv > (<)µ2, then ˆπ = 1(= 0), respectively. In other words, as the horizon n , the rescaled value function vi(ˆs, ˆq)

Zhu and Izzo and Ying

101 102 103

Figure 1: The above plot shows the decrease in the error | 1

nwi(s, q) v(t, ˆs, ˆq)| as the horizon n for (i, s, q) = {(n

4 ), (1, 0, 0)} and the corresponding (t, s, q) = {(1

4), (0, 0, 0)}. We set the initial hyperparameters (α, β) = (n

satisﬁes the above HJB equation. We plot the convergence of 1

nwi(ˆs, ˆq) as n increases in Figure 1. Classical results from optimal control (see, e.g., Chapter 10.3.3 of Evans (2010)) imply that v(t, ˆs, ˆq) in (4) solves the control problem

max ˆπ(τ) [0,1]

t ˆµ(ˆs(τ), ˆq(τ))π(τ) + (1 π(τ))µ2dτ

s.t. dˆq(τ) = π(τ)dτ,

dˆs(τ) = ˆµ(ˆs(τ), ˆq(τ))π(τ)dτ,

ˆq(t) = ˆs, ˆs(t) = ˆq.

The preceding example illustrates that the Bayesian bandit algorithm (2) can be viewed as the discretization of an HJB equation (4), which solves the control problem (5). In other words, as the horizon n , the Bayesian bandit problem will converge to a continuous control problem that can be solved via the HJB equation. In the next section, we extend this formulation to a more general setting.

3.2 Formal derivation from Bayesian bandits to the HJB equation

We return to the original K-armed bandit setting with horizon n and environments parameterized by ν E Rd. The reward of the k-th arm ak follows the distribution P ν k . We assume a prior measure ρ(ν) on the environment ν at the beginning of the ﬁrst round. We also assume that the updated measure ρi(ν) at the beginning of round i only depends on (si, qi). Here si = (si k)K k=1 RK is a K-dimensional vector representing the cumulative reward of each arm up to round i 1, and qi = (qi k)K k=1 {0, , i 1}K is a K-dimensional vector representing the number of pulls of each arm up to round i 1. We note that the assumption that the state space can be reduced to (si, qi) covers many, but not all, bandit algorithms. For instance, many stochastic bandit algorithms, where the arm rewards are drawn from a stationary probability distribution, can be represented using this state space,

Continuous Limit for Bayesian Bandits

but the Exp3 algorithm (Auer et al., 2002) cannot. For an extensive discussion, refer to Section 2.1 of Wager and Xu (2021).

Under the above assumptions, the posterior distribution of the environment at round i can be written as a function of si, qi, i.e., ρ(ν|si, qi). If arm k is pulled at round i, then the reward Xi follows the distribution P ν k , and

si+1 k = si k + Xi, qi+1 k = qi k + 1. (6)

Let wi(s, q) be the optimal expected cumulative reward starting from round i with (si, qi) = (s, q). Then it satisﬁes the following equation:

wi(s, q) = max k

E µk(ν)ρ(ν|s, q)dν + Z

R wi+1(s + xek, q + ek)P ν k (x)ρ(ν|s, q) dx dν ,

(7) where µk(ν) is the expected reward of the k-th arm deﬁned in (1), and ek is a K-dimensional vector with the k-th element being 1 and all other elements being 0. The ﬁrst term in the max operator represents the expectation of the rewards if the k-th arm is pulled. The second term is due to the fact that si+1 k and qi+1 k will follow (6) if the k-th arm is pulled at round i. Next, we rescale the parameters to derive the continuous-in-time limit. Let

nqi, ˆs = 1 f(n)si, v i 1

n , s f(n), q

= 1 f(n)wi(s, q). (8)

We shrink the n rounds to the time interval [0, 1] so that as n , the discrete round i n will correspond to a continuous time t = (i 1)/n [0, 1]. We also rescale the number of pulls to [0, 1], so that it is on the same scale as the rescaled rounds. The cumulative reward si and wi are rescaled by 1 f(n), where f(n) will be determined later. Accordingly, the rescaled expected reward v(t,ˆs, ˆq) becomes a function of the continuous time t and rescaled history (ˆs, ˆq). We refer to the function f(n) as the scaling factor. For diﬀerent scaling factors, the limiting rescaled cumulative reward v(t,ˆs, ˆq) will follow diﬀerent dynamics. Deﬁne the moments of the k-th arm w.r.t. the probability measure P ν k (x) and Bayesian measure ρ(ν|s, q) by

µk(s, q) = Z

R x P ν k (x)ρ(ν|s, q) dx dν, σ2 k(s, q) = Z

R x2P ν k (x)ρ(ν|s, q) dx dν,

Ep k(s, q) = Z

R xp P ν k (x)ρ(ν|s, q) dx dν. (9)

We assume that these moments exist and are ﬁnite. Inserting the Taylor expansion

wi+1(s + xek, q + ek) =

1 p! p skwi+1(s, q + ek)xp

into (7) yields

wi(s, q) = max k

µk(s, q) + wi+1(s, q + ek) + µk(s, q) skwi+1(s, q + ek) + 1

2 σ2 k(s, q) 2 skwi+1(s, q + ek)

1 p! Ep k(s, q) p skwi+1(s, q + ek)

Zhu and Izzo and Ying

Therefore, the rescaled reward v i 1

n , s f(n), q

n = 1 f(n)wi(s, q) satisﬁes

n , s f(n), q

1 f(n) µk(s, q) + v i

n, s f(n), q + ek

+ 1 f(n) µk(s, q) ˆskv i

n, s f(n), q + ek

2 1 f2(n) σ2 k(s, q) 2 ˆskv i

n, s f(n), q + ek

1 p! 1 fp(n) Ep k(s, q) p ˆskv i

n, s f(n), q + ek

After reorganizing the terms and setting δt = δq = 1

v(t + δt,ˆs, ˆq) v(t,ˆs, ˆq)

1 δtf(n) µk(f(n)ˆs, nˆq) + v(t + δt,ˆs, ˆq + δqek) v(t + δt,ˆs, ˆq)

+ 1 δtf(n) µk(f(n)ˆs, nˆq) ˆskv(t + δt,ˆs, ˆq + δqek)

2 1 δtf2(n) σ2 k(f(n)ˆs, nˆq) 2 ˆskv(t + δt,ˆs, ˆq + δqek)

1 p! 1 δtfp(n) Ep k(f(n)ˆs, nˆq) p ˆskv(t + δt,ˆs, ˆq + δqek)

If one assumes that there exist functions {ˆµk(ˆs, ˆq)}K k=1, {ˆσk(ˆs, ˆq)}K k=1, such that for all ˆs RK, ˆq [0, 1]K,

lim n n f(n) µk(f(n)ˆs, nˆq) = ˆµk(ˆs, ˆq);

lim n n f2(n) σ2 k(f(n)ˆs, nˆq) = ˆσ2 k(ˆs, ˆq);

lim n n fp(n) Ep k(f(n)ˆs, nˆq) = ˆEp k 0, for p 3,

then as the horizon n , i.e., δt, δq 0, the rescaled expected cumulative reward v(t,ˆs, ˆq) = 1 f(n)wnt+1(f(n)ˆs, nˆq) satisﬁes the PDE

tv + max k {ˆµk + ˆqkv + ˆµk ˆskv + 1

2 ˆσ2 k 2 ˆskv} = 0,

which can be equivalently written as the following HJB equation:

tv + max ˆπ(t,ˆs,ˆq) K

ˆµk(t,ˆs, ˆq) + ˆqkv + ˆµk(ˆs, ˆq) ˆskv + 1

2 ˆσ2 k(ˆs, ˆq) 2 ˆskv ˆπk = 0. (11)

Here we introduce ˆπ(t,ˆs, ˆq) as the feedback control, which corresponds to the policy in the bandit problem. Since the policy is a mapping from the history Hi to probability

Continuous Limit for Bayesian Bandits

measures on the action space (A), the policy at round i is described by a K-dimensional vector-valued function πi(si, qi) that satisﬁes P k πi k(si, qi) = 1. In the limit, the policy ˆπ(t,ˆs, ˆq) = limn πnt+1(f(n)ˆs, nˆq) is a mapping from the rescaled history (ˆs, ˆq) to the simplex K that satisﬁes P k ˆπk(t,ˆs, ˆq) = 1 for (t,ˆs, ˆq). We remark brieﬂy that the selection (or even the existence) of f(n) may not be obvious in the general setting described above. In general, f(n) should be thought of as describing the order or asymptotic size of the unscaled rewards in the original bandit problem with n rounds. For instance, if the rewards are of constant size and observed with at least constant probability, then we will expect the cumulative reward of the original bandit problem to be linear in the time horizon, and we will have f(n) = Θ(n). Rather than describing necessary and suﬃcient technical conditions relating f(n) to the arm reward distributions P ν k and the posterior ρ(ν|s, q) (which may be intractable given the generality of the framework), in the remainder of the paper, we will show that in a wide range of concrete examples, our framework provides useful insights into the problem. See Remark 2 for further discussion of the scaling factor f(n).

Note that the solution to (11) is not necessarily diﬀerentiable, so we are searching for a viscosity solution instead of a classical solution (Evans, 2010). One has the following guarantee on the well-posedness of the solution.

Proposition 1 If {ˆµk(s, q)}k, {ˆσk(s, q)}k are bounded and Lipschitz continuous in (s, q), then the value function deﬁned in (13) is the unique viscosity solution to the HJB equation (11).

See, e.g., Nisio (2015) for the proof. Finding necessary and suﬃcient conditions on the problem primitives speciﬁcally, the arm reward distributions P ν k , the prior ρ, and the scaling factor f(n) under which the boundedness and Lipschitz assumptions in Proposition 1 hold is an interesting question for future work. In this paper, our goal is to demonstrate the useful insights which can be derived from our framework for speciﬁc bandit problems, once it has been determined that they meet these conditions.

Summary If the rescaled moments of all the arms satisfy (10) for some scaling factor f(n), then the rescaled optimal expected cumulative reward 1 f(n)wnt+1(f(n)ˆs, nˆq) will converge to v(t,ˆs, ˆq) as n , where v satisﬁes the HJB equation (11) with boundary condition v(1,ˆs, ˆq) = 0. In addition, the rescaled optimal policy π ,nt+1(f(n)ˆs, nˆq) will converge to ˆπ (t,ˆs, ˆq) as n , where ˆπ (t,ˆs, ˆq) is given by

π k(t,ˆs, ˆq) =

1, k = argmax k

ˆµk(ˆs, ˆq) + ˆqkv + ˆµk(ˆs, ˆq) ˆskv + 1

2 ˆσ2 k(ˆs, ˆq) 2 ˆskv

0, o.w. (12)

These results form the foundation for the rest of the paper.

3.3 A formal derivation from Bayesian bandits to the optimal control problems

If one views v(t, s, q) as the optimal cumulative reward starting from time t

v(t,ˆs, ˆq) = max ˆπ(τ) K E Z 1

t ˆµ(ˆs(τ), ˆq(τ)) π(τ)dτ , (13)

Zhu and Izzo and Ying

then v(t, s, q) is the solution to the optimal control problem (Evans, 2010)

max ˆπ(τ) K E Z 1

t ˆµ(ˆs(τ), ˆq(τ)) π(τ)dτ

s.t. dˆqk(τ) = ˆπk(τ)dτ, 1 k K;

dˆsk(τ) = ˆµk(ˆs(τ), ˆq(τ))ˆπk(τ)dτ + ˆσk(ˆs(τ), ˆq(τ)) p

ˆπk(τ)d Bτ, 1 k K;

ˆs(t) = ˆs, ˆq(t) = ˆq.

Therefore, as the horizon n , the Bayesian bandit problem also converges to the above continuous control problem. In fact, one can derive the optimal control formulation above directly from the deﬁnition of the Bayesian bandit problem.

Given a policy {πi}i, at each round i, the environment ν is sampled with probability ρ(ν|si, qi), the k-th arm is pulled with probability πi k, and the reward of the k-th arm follows distribution P ν k . Assume at the beginning of round I, (s I, q I) = (s, q). The goal of Bayesian bandits is to ﬁnd the optimal policy {π ,i}i that maximizes the expected cumulative reward. More precisely, our goal is to ﬁnd {πi}i that solves the following optimization problem:

max {πi}i K cn({πi}) = E

E µAi(ν)ρ(ν|si, qi)dν

where Ai = k w.p. πi k(si, qi),

si+1 k si k =

( 0, if Ai = k

Xi P ν k , ν ρ(ν|si, qi), if Ai = k ,

qi+1 k qi k =

( 0, if Ai = k

1, if Ai = k ,

(s I, q I) = (s, q),

where µa(ν) is the expected reward of arm a deﬁned in (1).

Using the same rescaling as in (10) and viewing the history (si, qi) at the discrete rounds as a function (ˆs(t), ˆq(t)) over the continuous time t, the diﬀerences of the rescaled cumulative reward ˆs(t) and the rescaled number of pulls ˆq(t) after one round become

E[ˆsk(t + δt) ˆsk(t)] = δt n f(n)πi k(f(n)ˆs, nˆq) µk(f(n)ˆs, nˆq),

V[ˆsk(t + δt) ˆsk(t)] = δt n f2(n)πi k(f(n)ˆs, nˆq) σ2 k(f(n)ˆs, nˆq) (E[ˆsk(t + δt) ˆsk(t)])2 ,

E[ˆqk(t + δt) ˆqk(t)] = δt πi k(f(n)ˆs, nˆq),

V[ˆqk(t + δt) ˆqk(t)] = (δt)2πi k(f(n)ˆs, nˆq) (δt)2(πi k(f(n)ˆs, nˆq))2,

Continuous Limit for Bayesian Bandits

where δt = 1

n, and µk, σk are deﬁned in (9). Accordingly, the rescaled objective function becomes

ˆcn({πi}) = 1 f(n)cn({πi})

(17) Formally, based on (16) and (17), under the assumptions (10) on the moments, the following equations hold.

lim n E[ˆsk(t + δt) ˆsk(t)]/δt = ˆπk(t,ˆs, ˆq)ˆµk(ˆs, ˆq),

lim n V[ˆsk(t + δt) ˆsk(t)]/δt = ˆπk(t,ˆs, ˆq)ˆσ2 k(ˆs, ˆq),

lim n E[ˆqk(t + δt) ˆqk(t)]/δt = ˆπk(t,ˆs, ˆq),

lim n V[ˆqk(t + δt) ˆqk(t)]/δt = 0,

lim n ˆcn({πi}) = E

k=1 ˆπk(t,ˆs(τ), ˆq(τ))ˆµk(ˆs(τ), ˆq(τ))dτ

This implies that the rescaled version of the Bayesian bandits (15) will converge to the continuous optimal control problem (14).

Remark 2 Assumption (10) is critical to determine what scaling factor f(n) one should use to derive a meaningful limiting HJB equation. Take the Bernoulli reward introduced in Section 3.1 for an example. Since

µ(s, q) = σ2(s, q) = Ep(s, q) = α + s α + β + q,

ˆµ(ˆs, ˆq) = lim n n f(n)

α f(n) + ˆs

n + ˆq , ˆσ2(ˆs, ˆq) = lim n n f(n)2

α f(n) + ˆs

ˆEp(ˆs, ˆq) = lim n n f(n)p

α f(n) + ˆs

If the initial hyperparameters (α, β) are set such that limn ( α f(n), α+β

n ) = (ˆα, ˆβ), then the only scaling factor that will induce a meaningful HJB equation is f(n) = O(n). For all f(n) = O(nb) with b < 1, ˆµ diverges. For the case where b > 1, one has ˆµ 0, and consequently, the limiting HJB equation does not give any useful information on the dynamics. In this particular case, one always ends up with an HJB equation without a diﬀusion term (σ 0), and this is consistent with the nature of the Bernoulli bandits. Note that ˆs(t) is non-decreasing by deﬁnition, but when σ > 0, there is a chance that ˆs(t) will decrease due to the nonzero d Bτ term. Therefore, any valid scaling factor must induce a deterministic optimal control problem. However, in the general case, depending on the choice of scaling factor f(n), the limiting optimal control problem can be stochastic or deterministic.

Zhu and Izzo and Ying

3.4 Specialized limiting equations for Structured and Unstructured Bandits

In this section, we will derive the general limiting HJB equation for both unstructured and structured bandits. Then in Section 3.4.1, we will derive the continuous limit for some speciﬁc bandit problems which are common in the literature.

Unstructured Bandits In the unstructured bandit problem, action a A is completely uninformative of all other actions b = a. That is, when a is played, the learner gains no information about the reward distribution of the other arms (Chapter 4.3 of Lattimore and Szepesv ari (2020)). Here we mainly discuss a speciﬁc kind of unstructured bandit problem in which all the arms belong to the same parametric family of distributions with diﬀerent unknown parameters. For instance, the rewards of each arm may be normally distributed with an unknown mean. One will see that, in this case, the drift and diﬀusion terms of each arm in the limiting HJB equation have a similar form. Provided that the initial prior measure is determined by a hyperparameter β, then we claim that under certain conditions, the Bayesian bandit problem converges to the following HJB equation as the horizon n :

tv + max ˆπ(t,ˆs,ˆq) K

ˆqkv + ˆµ(ˆsk, ˆqk, ˆβk) ˆskv + 1

2 ˆσ(ˆsk, ˆqk, ˆβk)2 2 ˆskv

+ˆµ(ˆsk, ˆqk, ˆβk) ˆπk(t,ˆs, ˆq) i = 0,

v(1, s, q) = 0, (18) where the functions ˆµ and ˆσ depend on the problem setting, and ˆβ is the rescaled hyperparameter. The main diﬀerence between the general form (14) and the unstructured version (19) is that the drift and diﬀusion functions (ˆµk, ˆσk) have the same form for each arm. Thus, the diﬀerence in drift and diﬀusion for each arm will be due only to diﬀerences in the arm histories (ˆsk, ˆqk) at time t and the initial prior measure represented by ˆβ.

Structured Bandits In a structured bandit problem, the learner can obtain information about other actions even if these actions are never played. That is, playing a particular arm a A may be informative of the reward distributions of other arms b = a (Chapter 4.3 of Lattimore and Szepesv ari (2020)). Speciﬁcally, we consider the following setting. Let the action space {ak}K k=1 = A Rd be a set of real vectors, and assume that the reward x Ω on the arm ak has density p(x|ak, ν) with an unknown parameter ν E Rd. Then we claim that under certain conditions, the Bayesian bandit problem described above converges to the following HJB equation as the horizon n :

tv + max ˆπ(t,ˆs,ˆq) K

ˆqkv + ˆµ(ˆs, ˆq, ak) ˆskv + 1

2 ˆσ(ˆs, ˆq, ak)2 2 ˆskv

+ˆµ(ˆs, ˆq, ak)) ˆπk(t, s, q)] = 0,

v(1, s, q) = 0, (19) Similar to the unstructured case, the main diﬀerence between (11) and (19) is that the drift and diﬀusion terms (ˆµk, ˆσk) have the same form. In the structured case, the two terms

Continuous Limit for Bayesian Bandits

depend on the history of all the arms (ˆs, ˆq) and the position of the arm ak, and the diﬀerence for each arm is only due to the position ak.

3.4.1 Three common bandits

Here we derive the HJB equation for three common examples: unstructured bandits with Bernoulli and normal arm rewards, and a structured linear bandit with normal rewards.

Unstructured Bernoulli rewards Consider the environment class ν [0, 1]K of Bernoulli distributions Bγ with horizon n. For an environment ν E = [0, 1]K, the k-th arm follows a Bernoulli distribution taking values γ(n) and γ(n) with probability νk and 1 νk, respectively, where νk is the k-th component of ν. We set the initial prior measure for the k-th arm to be ν1 k Beta(αk(n), βk(n)) for 1 k K.

Lemma 3 Let f(n) be such that there exist real numbers ( ˆα, ˆβ, ˆσ) with

lim n γ(n)(αk(n) βk(n))

f(n) = ˆαk, lim n αk(n) + βk(n)

n = ˆβk, lim n

n f(n)γ(n) = ˆσ.

Then as n , the rescaled Bayesian bandit problem with scaling factor f(n) converges to the HJB equation (18) with

ˆµ(s, q, ˆαk, ˆβk) = ˆαk + s

ˆβk + q , ˆσ(s, q, ˆαk, ˆβk) ˆσ.

See Appendix A for the proof of the above lemma.

Unstructured normal rewards Consider the environment class ν RK of normally distributed arm rewards Nσ with horizon n. For environment ν E = RK, the rewards of the k-th arm follow the normal distribution N(νk, σ2(n)), where νk is the k-th component of ν. We set the initial prior measure to be ν1 k N(αk(n), βk(n)2) for 1 k K.

Lemma 4 Let f(n) be such that there exist real numbers ( ˆα, ˆβ, ˆσ) with

lim n σ2(n)αk(n)

f(n)β2 k(n) = ˆαk, lim n σ2(n) β2 k(n)n = ˆβk, lim n

n f(n)σ(n) = ˆσ.

Then as n , the rescaled Bayesian bandit problem with the scaling factor f(n) converges to the continuous control problem (19) with

ˆµ(s, q, ˆαk, ˆβk) = ˆαk + s

ˆβk + q , ˆσ(s, q, ˆαk, ˆβk) ˆσ.

See Appendix B for the proof of the above lemma.

Linear bandits with normal rewards Consider the case of stochastic linear bandits with horizon n, where the environment is encoded by a vector ν Rd. For environment ν E = Rd, the reward Xi at round i depends linearly on the chosen action Ai A Rd

in the following sense: Xi = Ai, ν + ηi, (20)

Zhu and Izzo and Ying

where (ηi)n i=1 is a sequence of independent and identically distributed normal random variables N(0, σ2(n)) with given σ. We set the prior measure of ν to be the normal distribution N(α(n), Σ(n)).

Lemma 5 Let f(n) be such that there exist real constants ( ˆα, ˆΣ) with

lim n σ2(n)

n Σ 1(n) = ˆΣ 1, lim n σ2(n)

f(n) Σ 1(n)α(n) = ˆα, lim n

n f(n)σ(n) = ˆσ.

Then as n , the rescaled Bayesian bandit problem with scaling factor f(n) converges to the continuous control problem (19) with

ˆµ(s, q, b) =b ˆΣ 1 + X

k qkak(ak) ! 1

, ˆσ(s, q) ˆσ.

See Appendix C for the proof of the above lemma.

4. Approximate Bayes-optimal policy

What can one do with the limiting HJB equations or optimal control problems? In Section 4.1, we propose a regularized version of the Bayesian bandit by adding a regularizer term to the objective function of the optimal control limit. In this way, the resulting optimal policy will be a stochastic policy instead of a deterministic one, which can encourage exploration and make the solution less sensitive to perturbations. In Section 4.2, we propose an approximate Bayes-optimal policy algorithm, which is based on the solution to the (regularized or unregularized) HJB equation.

4.1 Regularized Bayesian bandits

The optimal policy from the optimal control problem (14) is deterministic and can be sensitive to small perturbations to the problem. To encourage robustness, one can add regularization to the objective function in (14), resulting in a stochastic optimal policy and encouraging more exploration and stability. For example, entropy regularization for the policy P k πk log πk can be added to (14):

max ˆπ(τ) K E Z 1

t (ˆµ(ˆs(τ), ˆq(τ)) λ log ˆπ(τ)) ˆπ(τ)dτ

s.t. dˆqk(τ) = ˆπk(τ)dt, 1 k K;

dˆsk(τ) = ˆµk(ˆs(τ), ˆq(τ))ˆπk(τ)dτ + ˆσk(ˆs(τ), ˆq(τ)) p

ˆπk(τ)d Bτ, 1 k K;

ˆs(t) = ˆs, ˆq(t) = ˆq.

The resulting HJB equation is (Evans, 2010)

tv + max ˆπ K

ˆµk(ˆs, ˆq) ˆskv + ˆqkv + 1

2 ˆσ2 k(ˆs, ˆq) 2 ˆskv + ˆµk(ˆs, ˆq)

λ log(ˆπk(t,ˆs, ˆq))) ˆπk(t,ˆs, ˆq)] = 0. (22)

Continuous Limit for Bayesian Bandits

The maximum in the above equation can be computed explicitly (Ying and Zhu, 2022). Let

Hk(p, m, h) = ˆµk(s, q)p + m + 1

2 ˆσ2 k(s, q)h + ˆµk(s, q),

then the optimal policy is

ˆπ k(t,ˆs, ˆq) = 1

λHk( ˆskv, ˆqkv, 2 ˆskv) , (23)

with normalizing constant Z = P k exp 1

λHk( skv, qkv, 2 skv) . Hence, (22) can be equivalently written as

λHk( ˆskv, ˆqkv, 2 ˆskv) !

There are several potential advantages of the regularized version. First, the resulting optimal policy is always stochastic for λ > 0, which will be less sensitive to perturbations compared with deterministic optimal policy. Second, regularization encourages more exploration, which helps the performance when the initial prior is signiﬁcantly diﬀerent from the underlying truth. Third, regularization will usually lead to a smoother solution with a diﬀerentiable policy and value function, making it easier to numerically approximate the solution. For a more comprehensive introduction to the use of regularization in bandit and reinforcement learning problems, see, e.g., the tutorial by Geist et al. (2019).

4.2 Approximating the Bayes-optimal policy

Based on the limiting equation, if one can obtain the optimal policy for the HJB equation, then one can approximate the optimal Bayesian bandit policy by rescaling (t,ˆs, ˆq) to (i, s, q). This is summarized by the pseudocode in Algorithm 1.

Algorithm 1 Approximate Bayes-optimal policy

for i = 1, . . . , K do

Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do

Pull Ai πi(s, q) = ˆπ k(i 1

n , s f(n), q

n) ˆπ (t,ˆs, ˆq) given in (12) (unregularized) or (23) (regularized)

Get reward Xi s Ai s Ai + Xi, q Ai q Ai + 1 end for

5. Solving the limiting HJB equation

One of the diﬃculties of the Bayesian bandit problem is its large computational cost. The computational complexity for solving a K-armed bandit with horizon n via backward induction is O(n2K), which is intractable when n or K is large. If one can obtain the exact

Zhu and Izzo and Ying

optimal policy for the limiting HJB equation, then one can use it to approximate the Bayesoptimal policy for the ﬁnite horizon problem with almost no additional computational cost. Section 5.1 shows one of the cases where the exact solution can be obtained. Even if the exact solution cannot be obtained directly, Section 5.2 shows a numerical scheme to approximate the solution. The computational cost of numerically solving the HJB equation is O(N2K), where N depends on the mesh of the scheme. This can be much more eﬃcient than the discrete Bayesian bandit algorithm when n is large and K is small.

5.1 Exact solution

Although solving the HJB equation can also be challenging in general, it turns out that if µk(s, q) = sk+ˆα

qk+ˆβ , one can obtain the exact solution. By Lemmas 3 and 4, two common bandit problems are exactly in this form.

Theorem 6 If the drift term in the HJB equation (11) is

ˆµk(ˆs, ˆq) = sk + ˆαk

then for any constants (ˆαk, ˆβk), the optimal policy for the unregularized HJB equation (11) is

ˆπ k(t,ˆs, ˆq) =

1, k = argmax k

ˆqk + ˆβk 0, o.w. ; (25)

and the optimal policy for the regularized HJB equation (22) is

ˆπ k(t,ˆs, ˆq) exp 1

λ ˆsk + ˆαk

The proof is given in Appendix D. Based on the above theorem, the approximate optimal policy for the unregularized Bayesian bandit problem is given by

π ,i k (s, q) = ˆπ k

n , s f(n), q

1, k = argmax k

sk qk + nˆβk + f(n)ˆαk

q + nˆβk 0, o.w. (27)

and the approximate optimal policy for the regularized Bayesian bandit problem is given by

π ,i k (s, q) = ˆπ k

n , s f(n), q

exp n λf(n) sk + f(n)ˆαk

Furthermore, note that the approximate optimal policy (27) for the unregularized Bayesian bandit is similar in form to UCB: the ﬁrst term is an approximation to the empirical mean, and the second term measures the degree to which the arm has been explored. The approximate optimal policy (28) for the regularized Bayesian bandit has a form similar to the tempered greedy algorithm, where the term sk+f(n)ˆαk

qk+nˆβk is an approximation to the empirical mean and n λf(n) adjusts the exploration rate. When n λf(n) is smaller (i.e., when the regularization constant λ is larger), there is more exploration.

Continuous Limit for Bayesian Bandits

5.2 Numerical solution

In the general case, an exact solution to the HJB equation (11) is not available, so we present a numerical scheme to approximate the solution in this section. In certain cases, the numerical scheme yields the exact optimal policy and value function (see Lemma 7). In general, when the horizon is large, the computational cost of numerically solving the PDE will be much less than that of classical Bayesian bandit algorithms while still yielding a good approximation to the optimal policy. First, observe that one can directly compute the maximum in the HJB equation (11). Let k (t,ˆs, ˆq) = argmaxk ˆqkv + ˆµk(ˆs, ˆq) ˆskv + 1

2 ˆσ2 k(ˆs, ˆq) 2 ˆskv + ˆµk(ˆs, ˆq),

then the optimal policy is

π k(t,ˆs, ˆq) =

( 1, k = k (t,ˆs, ˆq)

0, k = k (t,ˆs, ˆq)

and the optimal value function satisﬁes

tv + ˆqk v + ˆµk (ˆs, ˆq) ˆsk v + 1

2 ˆσ2 k (ˆs, ˆq) 2 ˆsk v + ˆµk (ˆs, ˆq) = 0. (29)

All of the above results hold in the deterministic case when ˆσk 0 for all k. Based on (29), we present a ﬁnite diﬀerence method for solving the HJB equation (29).

HJB equation with diﬀusion First, consider the case where ˆσ 0. We discretize the time interval [0, 1] via the grid points 0 = t0 < t1 < < t Nt = 1, where tl = lδt and δt = 1/Nt. We impose a cutoﬀon the cumulative reward in RK so that it lies within [ S, S]K, then further discretize this clipped interval into S si k S. Here i {ZK, Ns ik Ns} is a K-dimensional index vector, and si = iδs with δs = S/Ns. Finally, we discretize the number of pulls to 0 qj k 1, where j {ZK, 0 jk Nq} is a K-dimensional index vector, and qj = jδq with δq = 1/Nq. Observe that at each time tl, one has that 0 qj k tl and PK k=1 qj k = tl. We will typically set Nq = Nt and δq = δt. Let vl i,j be the approximation for v(tl, si, qj), where i, j ZK. The numerical approximation vl i,j satisﬁes the following equation:

1 δt ( vl+1 i,j vl i,j) + 1

2δs ˆµi,j k ( vl+1 i+ek ,j+ek vl+1 i ek ,j+ek ) + 1

δq ( vl+1 i,j+ek vl+1 i,j )

2δ2s ˆσi,j k ( vl+1 i+ek ,j+ek 2 vl+1 i+ek ,j+ek + vl+1 i ek ,j+ek ) + ˆµi,j k = 0, (30)

where ek is the k-th standard basis vector, ˆµi,j k = ˆµk(si, qj), and ˆσi,j k = ˆσk(si, qj). Since the boundary conditions at the terminal time specify that v Nt i,j = 0, one can use the preceding equations to solve backward in time and compute all of the values vl i,j. In fact, if one deﬁnes gl i,j as a vector with k-th element given by

(gl i,j)k = vl+1 i,j + δt

2δs ˆµi,j k ( vl+1 i+ek,j+ek vl+1 i ek,j+ek) + 1

δq ( vl+1 i,j+ek vl+1 i,j )

2δ2s ˆσi,j k ( vl+1 i+ek,j+ek 2 vl+1 i,j+ek + vl+1 i ek,j+ek) + ˆµi,j k

Zhu and Izzo and Ying

then one gets an equivalent form for the numerical scheme (30) with the correspondence given by

k = argmaxk(gl i,j)k, and vl i,j = (gl i,j)k , ( πl i,j)k =

0, k = k (32)

Note that for this scheme to be numerically stable, δt and δs must satisfy the inequality

δt min(ˆσk i,j)2δ2 s.

Since vl i,j is only deﬁned on grid points, the continuous approximated solution v(t,ˆs, ˆq) is deﬁned as

v(t,ˆs, ˆq) = vl i,j for

lδt t < (l + 1)δt ikδs ˆsk < (ik + 1)δs jkδq ˆqk < (jk + 1)δq.

HJB equation without diﬀusion Second, we consider the case where ˆσ 0. Since there is no longer a diﬀusion term, one should use the upwind scheme for the transport term. Let ˆµi,j k,+ = max(ˆµi,j k , 0) , ˆµi,j k, = min(ˆµi,j k , 0),

(gl i,j)k = vl+1 i,j + δt

" ˆµi,j k,+ δs ( vl+1 i+ek,j+ek vl+1 i,j+ek) + ˆµi,j k, δs ( vl+1 i,j+ek vl+1 i ek,j+ek)

δq ( vl+1 i,j+ek vl+1 i,j ) + ˆµi,j k

k = argmaxk(gl i,j)k and vl i,j = (gl i,j)k, ( πl i,j)k =

0, k = k . (35)

In this case, the stability conditions imply that

max(ˆµi,j k )δt δs.

Connection to the Bayesian bandit algorithm In certain cases, the numerical schemes (31)-(32) and (34)-(35) give the exact optimal value function for the ﬁnite horizon problem.

Lemma 7 For the Bernoulli bandits introduced in Section 3.1, when the initial hyperparameters (αk, βk) = (c1n, c2n) for some constants (c1, c2), then the numerical scheme (34)-(35) for the limiting HJB equation based on the scaling factor f(n) = n gives the exact optimal value function when δt = δq = δs = 1

n. Similarly, for the binomial bandits described in Section 3.4.1 with γ being a constant independent of n, when the initial hyperparameters (αk, βk) = (c1n + c2 n, c1 c2 n) for some constants (c1, c2), then the numerical scheme (31)-(32) for the limiting HJB equation based on the scaling factor f(n) = n gives the exact optimal value function when δt = δq = 1

n, δs = γ n.

See Appendix E for the proof of the above lemma.

Continuous Limit for Bayesian Bandits

6. Numerical experiments

6.1 Convergence to the HJB equation

In this section, we will show the convergence of the Bayes-optimal solution to the HJB solution as the horizon goes to inﬁnity. Namely, we would like to show the diﬀerences in the optimal policy and the rescaled optimal cumulative reward

πi,n(s, q) ˆπn i 1

n , s f(n), q

, 1 f(n)wi,n(s, q) v i 1

n , s f(n), q

decay as the horizon n , where πi,n(s, q), wi(s, q) are obtained by backward induction and ˆπ(t,ˆs, ˆq), v(t,ˆs, ˆq) are obtained by solving the corresponding HJB equation. We show the convergence result in Figures 2 and 3. Below are the details of the plots. Consider the one-armed Bernoulli bandit problem, where the ﬁrst arm has a reward 1 with probability ν and 1 with probability 1 ν, while the second arm has a deterministic reward µ2. In this case, given a prior measure ν Beta(α, β), one can obtain the exact optimal policy and cumulative reward πi,n(s, q) and wi,n(s, q) via the equations

πi,n k (s, q) =

1, for k = argmax k ˆwi,n k (s, q)

0, o.w. ; wi,n(s, q) = max k ˆwi,n k (s, q),

ˆwi,n 1 (s, q) = p(s, q)wi+1,n(s + 1, q + 1) + (1 p(s, q)wi+1,n(s 1, q + 1)),

ˆwi,n 2 (s, q) = wi+1,n(s, q) + µ2,

with wn+1,n(s, q) = 0 for all (s, q) and p(s, q) = α+s/2+q/2

α+β+q . The limiting HJB equation depends on the scaling factor f(n). By Lemma 3, one arrives at a stochastic optimal control problem if f(n) = n and a deterministic one if f(n) = n. We will compare πi,n, wi,n with the limiting HJB solution for both of these scenarios. In Figures 2 and 3, we set µ2 = 1/(3 n). The hyperparameters (α, β) for the initial prior measure are set to be (n, n n), which implies that in the limiting HJB equation

tv + max π(s,q) [0,1]

ˆµ ˆsv + ˆqv + 1

2 ˆσ2 2 ˆsv + ˆµ µ2

2+q, ˆµ2 = 1/3, ˆσ = 1 for f(n) = n, and ˆµ = s 2+q, ˆµ2 = 0, ˆσ = 0 for f(n) = n. By Theorem 6, one can obtain the exact optimal policy for the limiting HJB equation. In Figure 2, we plot the average diﬀerence over i, s, q. That is, we plot

πi,n(s, q) ˆπ i 1

n , s f(n), q

1 f(n)wi,n(s, q) v i 1

n , s f(n), q

where the summation is over i {1, , n}, s { (i 1), , i 1}, q {0, , i 1}, and Z is the number of summations.

Zhu and Izzo and Ying

Next, we test the diﬀerence between the Bayes-optimal solution and the numerical solution to the HJB equation presented in Section 6. The following averaged diﬀerences are plotted in Figure 3:

πi,n(s, q) πN i 1

n , s f(n), q

1 f(n)wi,n(s, q) v N i 1

n , s f(n), q

where v N is obtained according to the scheme (31)-(32) with δt = δq = N 1, δs = N 1/2

when f(n) = n; and v N is according to the scheme (34) - (35) with δt = δq = δs = N 1

when f(n) = n. The diﬀerence en,N w based on f(n) = n is rescaled by 1 n so that it is on the same scale as the scheme based on f(n) = n. Due to the diﬀerent discretizations, v N i 1

n , s f(n), q

n is not necessarily on a grid point, so we deﬁne the continuous approximation solution v(t, s, q) as in (33), which implies that

n , s f(n), q

= πl m,j, v N i 1

n , s f(n), q

= vl m,j, (38)

, m = s f(n)δs

where x is the largest integer less than or equal to x. Figure 2 shows that the diﬀerence en π decays as n increases for both scaling factors. The stochastic limit according to the scaling factor f(n) = n is closer to the optimal Bayesian solution compared with the deterministic limit. Figure 3 shows that the diﬀerence en,N decays as n and N increase. Note that en,N has two components: model error and numerical error. Model error decreases as the horizon n increases. The numerical error decreases as the number of grid points N increases. We can see from Figure 3 that when both the horizon n and the number of grid points N increase, the diﬀerences decrease. We observe that when N = 50, the diﬀerence en,N v in the value functions decreases slower or does not decrease after n reaches some threshold. This is because the numerical error dominates over the model error in this regime.

6.2 The performance of the approximate Bayes-optimal policy

We compare the performance of the approximate Bayes-optimal policy (Algorithm 1) with Thompson sampling and UCB in terms of the expected regret. (See Appendix F for the details of Thompson sampling and UCB.) For unstructured bandits, we consider normal arm rewards, in which case the exact policy for the limiting HJB equation can be directly obtained. The performance of the three algorithms is shown in Figure 4. Below are the details of the plots. Consider the K-armed normal bandit problem with K 2. Assume that the ﬁrst arm follows r1 N(µ1, 1) for µ1 0, while the k-th arm follows rk N(µ, 1) for 2 k K. Note that although µ1 1, this is unknown to us. We deﬁne

= µ µ1 (39)

Continuous Limit for Bayesian Bandits

102 103 104 10-4

Figure 2: The above plot shows the decay of the diﬀerence between the Bayes-optimal solution and the solution to the HJB equation as n increases, i.e., en π and en w deﬁned in (36). Here f(n) is the scaling factor. When f(n) = n, the resulting limit is a stochastic optimal control problem, while when f(n) = n, the resulting limit is a deterministic one.

102 103 104 10-4

102 103 104

Figure 3: The above plot shows the decay of the diﬀerence between the Bayes-optimal solution and the numerical solution to the HJB equation as n and N increase, i.e., en,N π and en,N w deﬁned in (37). Here N is the number of grid points when numerically solving the HJB equation, and f(n) is the scaling factor. When f(n) = n, the resulting limit is a stochastic optimal control problem, while when f(n) = n, the resulting limit is a deterministic one. The curve for N = 50, f(n) = n is indistinguishable from the curve for N = 500, f(n) = n in en,N π , so it is not shown on the right plot.

Zhu and Izzo and Ying

to be the arm gap. The horizon is set to be n = 103. For the proposed method (Algorithm 1), we set the scaling factor f(n) = n, that is, the limiting optimal control problem is stochastic. The exact solution to the limiting HJB equation can be obtained by Theorem 6. The initial prior measure for both the Bayes-optimal policy and Thompson sampling is νk N( 1 n, 1

n) for all k. This implies that the limiting HJB equation is (11) with ˆµk(s, q) = sk+1

qk+1 and ˆσ 1. In addition, δ = n2 for the UCB algorithm. Figure 4 shows the expected regret of the three algorithms for [ 1, 1] and K = 5, 10, 20. The expected regret is averaged over 103 simulations. We can see from Figure 4 that the overall performance of the approximate Bayes-optimal policy is better than the other two algorithms, especially when the prior guess is close to the underlying environment. When approaches 1, UCB is a bit better than the approximate Bayes-optimal policy because the prior guess is signiﬁcantly diﬀerent from the underlying truth. However, note that as the number of arms increases, the performance is almost the same, even around = 1.

-1 -0.5 0 0.5 1 0

-1 -0.5 0 0.5 1

-1 -0.5 0 0.5 1 0

Figure 4: The above plot shows the expected regret in the K-armed normal bandit problem for the approximate Bayes-optimal policy, Thompson sampling, and UCB. The arm gap [ 1, 1] is deﬁned in (39). The left, middle, and right plots correspond to K = 5, 10, 20.

For structured bandits, we consider the linear bandits described in Section 3.4.1. Assume there are two arms, and the reward for arm ai follows aiν + η with unknown ν and η N(0, 1). We set the initial measure for ν N(0, 1

n), and take the scaling factor f(n) = n, then ˆµ, ˆσ for the limiting HJB equation can be obtained according to Lemma 5. We solve the limiting HJB equation by the numerical scheme (31)-(32) with δt = δq = 1 N , δs = 1

N and N = 100. The performance in terms of the expected regret is plotted in Figure 5, where we test three diﬀerent action positions (a1, a2) = (0.1, 0.1), (0.1, 0.2), (0.1, 0.2). Figure 5 shows that the overall performance of the approximate Bayes-optimal policy is more robust than the other two methods. First, the approximate Bayes-optimal always outperforms TS. When (a1, a2) = (0.1, 0.1), the performance of UCB and approximate Bayes-optimal policy are similar. However, for the other two cases where (a1, a2) =

Continuous Limit for Bayesian Bandits

(0.1, 0.2), (0.1, 0.2), UCB has much worse performance on one side, while the approximate Bayes-optimal policy has evener regret on both sides. If one measures the performance in the worst-case regret or in the averaged regret over the possible environments, the approximate Bayes-optimal policy outperforms the other two.

-0.5 0 0.5 0

-0.5 0 0.5 0

-0.5 0 0.5 0

Figure 5: The above plot shows the expected regret in the 2-armed linear bandit problem for the approximate Bayes-optimal policy, Thompson sampling, and UCB. The arm gap of the normal bandits [ 1, 1] is deﬁned in (39). The environment of the linear bandits ν [ 1/2, 1/2] is deﬁned in (20).

We also show the performance of the regularized approximate Bayes-optimal policy in Figure 6. We compare the regularized Bayes-optimal policy with the unregularized version for normal bandits and linear bandits. The setting of the two bandit problem is the same as Figures 4 and 5, but the initial Bayesian prior of the two bandits are worse (farther from the ground truth). We set νk N(0.01 n, 1) for the normal bandits, and ν N( n, 1) for the linear bandits. One can see from Figure 6 that the regularized version performs similarly to or better than the unregularized version when the initial prior measure is bad. We remark that since the solution to the regularized HJB equation is always smooth when λ > 0, it is also potentially easier to break the curse of dimensionality. However, we leave the high-dimensional problem for future study.

7. Discussion

In this paper, we derived a continuous-in-time limit for the Bayesian bandit problem. We showed that the rescaled optimal cumulative reward converges to the solution of an HJB equation. We derive several beneﬁts from the limiting PDE:

A single recipe for many Bayesian bandits. For most multi-armed bandit problems, the classical Bayesian bandit algorithms yield a formulation that cannot be solved accurately. Diﬀerent recipes are required to approximate the value function in diﬀerent settings. On the other hand, the limiting PDE gives a single, uniﬁed recipe to solve the Bayes-optimal policy.

Zhu and Izzo and Ying

-1 -0.5 0 0.5 1 0

Figure 6: The above plot shows the expected regret of a 5-armed normal bandit problem and a 2-armed linear bandit problem for the regularized approximate Bayes-optimal policy. The environment ν [ 1/2, 1/2] is deﬁned in (20). The left, middle and right plots correspond to (a1, a2) = (0.1, 0.1), (0.1, 0.2), (0.1, 0.2).

For example, one way to solve the one-armed bandit problem with normal distributions using a Bayesian bandit algorithm is to solve the following equation backward:

W i(µ, q) = max W i+1(µ, q) + µ2, µ + 1

2σ2 i )wi+1(µ + x, q + 1)dx

for all µ R and i = 1, , n with σi = (qi + σ 2) 1. Since there is no closedform solution for the integral in the above equation, one must approximate W i using piecewise quadratic functions (see e.g. Section 35.3.2 of Lattimore and Szepesv ari (2020) for details). This results in a completely diﬀerent algorithm for solving this problem compared to solving the Bernoulli reward case. However, if one instead uses the HJB equation to solve the one-armed bandit problem with normally distributed rewards, the same formulation (18) which applies for Bernoulli rewards also applies for normal rewards. The only modiﬁcation is the diﬀerent forms of ˆµ according to Lemma 4.

Improved eﬃciency for large n. The classical Bayesian bandit algorithm requires a computational cost of O(n2K) to calculate the optimal policy, which can be prohibitive for large n.

On the other hand, as n , the Bayesian bandit problem converges to the continuous HJB equation. The computational cost of solving the HJB equation is independent of the horizon, and it only depends on the numerical discretization of the PDE, which is O(N2K). When N n, one obtains huge computational savings by solving for the HJB value function instead. Since the accuracy of the approximation solution is v v = O(N 1), one retains an accurate approximation of the solution with much less computational cost.

Continuous Limit for Bayesian Bandits

Improved eﬃciency for large K For the case where the exact solution can be obtained for the limiting HJB equation as stated in Theorem 6, there is no computational cost of solving the limiting equation. In this case, even if K is large, one can approximate the Bayes-optimal policy eﬃciently. For the case where the exact solution to the HJB equation cannot be obtained, it may be possible to break the curse of dimensionality numerically by using a non-linear function approximation, such as a deep neural network. We leave the high-dimensional problem for future study.

One can also extend the current framework of ﬁnite arms to inﬁnite arms, for instance, with a continuous action space. The policy π(t, s, q, a) is a probability density function such that R π(t, s, q, a)da = 1 for all t, s, q. The general limiting control problem takes the form

max R ˆπda=1 E Z 1

A ˆµ(ˆs, ˆq, a)ˆπ(τ, a)dadτ

s.t. dˆq(τ, a) = ˆπ(τ, a)dt, a A;

dˆs(τ, a) = ˆµ(ˆs, ˆq, a)ˆπ(τ, a)dt + ˆσ(ˆs, ˆq, a) p

ˆπ(τ, a)d Bt, a A;

ˆs(t, a) = s, ˆq(t, a) = q.

We will leave the study of the above case to future research.

Victor F Araman and Ren e A Caldentey. Diﬀusion approximations for a class of sequential experimentation problems. Management Science, 68(8):5958 5979, 2022.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Dirk Bergemann and Juuso Valimaki. Bandit problems. 2006.

Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). Springer, 1985.

Russell N Bradt, SM Johnson, and Samuel Karlin. On sequential designs for maximizing the sum of n observations. The Annals of Mathematical Statistics, 27(4):1060 1074, 1956.

Yeon-Koo Che and Johannes H orner. Recommender systems as mechanisms for social learning. The Quarterly Journal of Economics, 133(2):871 925, 2018.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary diﬀerential equations. Advances in neural information processing systems, 31, 2018.

Zhengdao Chen, Grant Rotskoﬀ, Joan Bruna, and Eric Vanden-Eijnden. A dynamical central limit theorem for shallow neural networks. Advances in Neural Information Processing Systems, 33:22217 22230, 2020.

Zhu and Izzo and Ying

Nicole El Karoui, Monique Jeanblanc, and Vincent Lacoste. Optimal portfolio management with american capital guarantee. Journal of Economic Dynamics and Control, 29(3):449 468, 2005.

Lawrence C Evans. Partial diﬀerential equations, volume 19. American Mathematical Soc., 2010.

Lin Fan and Peter W Glynn. Diﬀusion approximations for thompson sampling. ar Xiv preprint ar Xiv:2105.09232, 2021.

Kris Johnson Ferreira, David Simchi-Levi, and He Wang. Online network revenue management using thompson sampling. Operations research, 66(6):1586 1602, 2018.

Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160 2169. PMLR, 2019.

John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):148 164, 1979.

Yonatan Gur, Gregory Macnamara, and Daniela Saban. Sequential procurement with contractual and experimental learning. Management Science, 68(4):2714 2731, 2022.

Jiequn Han, Qianxiao Li, et al. A mean-ﬁeld optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1 41, 2019.

Michael Kapralov and Rina Panigrahy. Prediction strategies without loss. Advances in Neural Information Processing Systems, 24, 2011.

Vladimir A Kobzar and Robert V Kohn. A pde-based analysis of the symmetric two-armed bernoulli bandit. ar Xiv preprint ar Xiv:2202.05767, 2022.

Tze Leung Lai and Herbert Robbins. Asymptotically eﬃcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

Tor Lattimore. Regret analysis of the ﬁnite-horizon gittins index strategy for multi-armed bandits. In Conference on Learning Theory, pages 1214 1245. PMLR, 2016.

Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020.

Benoˆıt Leloup and Laurent Deveaux. Dynamic pricing on the internet: Theory and simulations. Electronic Commerce Research, 1(3):265 276, 2001.

Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modiﬁed equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101 2110. PMLR, 2017.

Qiang Liu. Stein variational gradient descent as gradient ﬂow. Advances in neural information processing systems, 30, 2017.

Continuous Limit for Bayesian Bandits

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean ﬁeld view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33): E7665 E7671, 2018.

Makiko Nisio. Stochastic control theory. ISI Lecture Notes, 9, 2015.

Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

Weijie Su, Stephen Boyd, and Emmanuel Candes. A diﬀerential equation for modeling nesterov s accelerated gradient method: theory and insights. Advances in neural information processing systems, 27, 2014.

Ambuj Tewari and Susan A Murphy. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495 517. Springer, 2017.

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Stefan Wager and Kuang Xu. Diﬀusion asymptotics for sequential experiments. ar Xiv preprint ar Xiv:2101.09855, 2021.

Abraham Wald. Sequential analysis. Courier Corporation, 2004.

Zhilei Wang and Robert V Kohn. A new approach to drifting games, based on asymptotically optimal potentials. ar Xiv preprint ar Xiv:2207.11405, 2022.

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681 688. Citeseer, 2011.

Lexing Ying and Yuhua Zhu. A note on optimization formulations of markov decision processes. Communications in Mathematical Sciences, 20(3):727 745, 2022.

Zhu and Izzo and Ying

A Proof of Lemma 3

The expectation of the k-th arm in environment ν is µk(ν) = γ(2νk 1). At the start of the i-th round, the posterior distribution of νk is uniquely determined by the cumulative reward si k = Pi 1 j=1 Xj1Aj=k and the number of pulls qi k = Pi 1 j=1 1Aj=k through round i 1 according to

νi k Beta(αi k, βi k), αi k = α + si k/(2γ) + qi k/2, βi k = β si k/(2γ) + qi k/2.

Hence the joint posterior ρi(ν) at round i is

ρ(ν|si, qi) = 1

k=1 ν αi k 1 k (1 νk)βi k 1 (A.1)

where Z is a normalizing constant. This allows us to compute the posterior mean and variance of each arm:

µk(si, qi) = Z

[0,1]K γ(2νk 1)ρ(ν|si, qi)dν = γ 2 αi k αi k + βi k 1 = γ αk βk + si k/γ αk + βk + qi k ,

σ2 k(si, qi) = Z

[0,1]K γ2ρ(ν|si, qi)dν = γ2,

(A.2) and the higher-order moments are

Ep k(si, qi) = Z γpνk + ( γ)p(1 νk)ρ(ν|si, qi)dν =

γp, p is even,

γp αk βk + si k/γ αk + βk + qi k , p is odd.

By the deﬁnition of ˆµ, ˆσ in (10), one has

ˆµk(ˆs, ˆq) = lim n

ˆσ2 k(ˆs, ˆq) = lim n n f(n)2 γ2

ˆEp k = lim n n f(n)p Ek(f(n)s, nq) =

p 2 n f(n)2 γ2, p is even,

p 1 γ(αk βk)

n + ˆqk , p is odd.

By the assumption given in Lemma 3, one ends up with

ˆµ(s, q, ˆαk, ˆβk) = lim n ˆαk + s

ˆβk + q , ˆσ(s, q) ˆσ, ˆEp k 0.

The last equation is because limn

n f(n)γ = ˆσ implies that limn 1 f(n)γ = 0.

Continuous Limit for Bayesian Bandits

Diﬀerent limiting equations due to diﬀerent scaling factors Let us consider the case where the reward value γ and γ are independent of the horizon n. We set γ = 1 and the prior hyperparameter to be (α, β) = ( n

2 ), i.e., the prior measure of νk Beta( n

2 ). In this case, one can rescale the cumulative reward s and cn by n. By Lemma 3, one has

ˆµ(s, q) = s 1 + q; ˆσ(s, q) 1.

One can also rescale the cumulative reward s and cn by n 1. By Lemma 3, one has

ˆµ(s, q) = s 1 + q; ˆσ(s, q) 0.

In this case, the Bayesian bandit problem converges to a deterministic control problem in the form of (18) with

ˆµ(s, q) = s 1 + q, ˆσ 0. (A.3)

One can see from the above examples that the same Bayesian bandit problem may converge toward diﬀerent control problems based on the chosen scaling.

B Proof of Lemma 4

The probability density function for the k-th arm in environment ν is

2πσe |x νk|2

(Note that here π denotes the constant 3.14159... rather than the policy.) Thus the expected reward for the k-th arm in environment ν is µk(ν) = νk. At the start of the i-th round, the posterior distribution of νk is uniquely determined by the cumulative reward si k and the number of pulls qi k through round i 1 according to

νi k N(αi k, (βi k)2), αi k = αβ 2 + si kσ 2

qi kσ 2 + β 2 , (βi k)2 = 1 qi kσ 2 + β 2 .

Hence the joint posterior ρi(ν) at round i is

ρi(ν) = ρ(ν|si, qi) =

2πβi k e |νk αi k|2

From this, it follows that

µk(si, qi) = Z

RK νkρi(ν)dν = αi k = αβ 2 + si kσ 2

qi kσ 2 + β 2 ,

σ2 k(si, qi) = Z

RK(σ2 + ν2 k)ρi(ν)dν = σ2 + (βi k)2 + (αi k)2 = σ2 + 1 qi kσ 2 + β 2 + ( µk)2,

Zhu and Izzo and Ying

Inserting µk and σk into (10) yields

ˆµk(ˆs, ˆq) = lim n n f(n) αβ 2 + f(n)ˆskσ 2

nˆqkσ 2 + β 2 = lim n

ˆsk + σ2α f(n)β2

(ˆσk(ˆs, ˆq))2 = lim n n f(n)2 σ2 + n f(n)2(nσ 2ˆqk + β 2) + 1

= lim n n f(n)2 σ2

By the condition in Lemma 4, one ends up with

ˆµ(s, q, ˆαk, ˆβk) = ˆαk + s

ˆβ 2 k + q , ˆσ(s, q) ˆσ.

For the higher-order moments, one can write the moments of the normal distribution in the following form, Z xp P ν k (x)dx =

j=0 C(p, j)νp 2j k σ2j

where C(p, j) is a constant depends on p, j. Then one has

Ep(si, qi) = Z Z xp P ν k (x)dxρ(νk)dνk =

j=0 C(p, j)σ2j Z νp 2j k ρ(νk)dνk

l=0 C(p, j)C(p 2j, l)σ2j(βi k)2l(αi k)p 2j 2l

l=0 C(p, j)C(p 2j, l)σ2j(βi k)2l( µk)p 2j 2l

Therefore, after rescaling, one has

ˆEp k(ˆs, ˆq) = lim n

l=0 C(p, j)C(p 2j, l)n1+j p n f(n)2 σ2 j+l 1

n1+j p n 1, lim n n f(n)2 σ2 = ˆσ, lim n 1

ˆqk + ˆβ 2 , lim n n f(n) µk = ˆµ,

one has, ˆEp k(ˆs, ˆq) 0

Continuous Limit for Bayesian Bandits

C Proof of Lemma 5

The expectation of the reward at round i in environment ν is Ai, ν , and the probability density function for k-th arm is

2πσe |x (ak) ν|

Then, the posterior distribution of ν at round i is uniquely determined by the cumulative reward (si, qi) up to time i 1,

νi N(αi, Σi), Σi = (Σ 1 + σ 2 K X

k=1 qi kak(ak) ) 1, αi = Σi(Σ 1α + σ 2 K X

k=1 si kak).

Note that the ak Rd in the above equation represents the action value of the k-th arm. Hence the posterior measure of νi at round i is

ρi(ν) = 1 p

(2π)d|Σi| exp 1

2(ν αi) Σi(ν αi) .

Therefore, one obtains

µk(si, qi) = Z

RK(ak) νρi(ν)dν = (ak) αi

=(ak) (Σ 1 + σ 2 K X

j=1 qi jaj(aj) ) 1(Σ 1α + σ 2 K X

j=1 si jaj),

σ2 k(si, qi) = Z

RK(σ2 + (ak) νν (ak) )ρi(ν)dν = σ2 + (ak) (Σi + αi(αi) )ak

=σ2 + (ak) (Σ 1 + σ 2 K X

j=1 qi jaj(aj) ) 1ak + ( µk)2.

Inserting µk and σk into (10) yields

ˆµk(ˆs, ˆq) = lim n (ak)

j=1 ˆqi jaj(aj)

j=1 ˆsi jaj

(ˆσk(ˆs, ˆq))2 = lim n

j=1 ˆqi jaj(aj)

By the condition in Lemma 4, one ends up with

ˆµ(s, q, b) = b (ˆΣ 1 + X

k qkak(ak) ) 1( ˆα +

k=1 skak), ˆσ(s, q) ˆσ.

Zhu and Izzo and Ying

For the higher-order moments, since

Σ(ˆs, ˆq) = σ2n 1 σ2

k ˆqkak(ak) ! 1

αi(ˆs, ˆq) = f(n)

k ˆqkak(ak) ! 1 σ2Σ 1α

Therefore, similar to the normal bandit problem, the higher-order moments are in the following order:

ˆEk(ˆs, ˆq) = lim n n f(n)p

l=0 O(σ2j)O(Σl)O(αp 2j 2l)

l=0 O n f(n)p

O(σ2j)O(σ2ln l)O f(n)p 2j 2l

2j+2l n1+j p !

where the last equality is because 1j p 1 for all p 3 and limn

nσ f(n) = ˆσ.

D Proof of Theorem 6

Look at the optimal control problem (21) with ˆσ 0, when ˆµk = ˆsk+αk

ˆqk+βk , note that

d dτ ˆµk = 1 ˆqk + βk

d dτ ˆsk ˆsk + αk (ˆqk + βk)2 d dτ ˆqk = 0 for ˆπ(τ).

This implies that ˆµk(τ) ˆµk(t) for k. Therefore, the objective function becomes Z 1

t (ˆµ(t) λπ(τ)) π(τ)dτ (D.1)

Since ˆµ(t) is a constant, the above objective function will be maximized at π given in (25) and (26) for the unregularized version, i.e., λ = 0 and the regularized version, i.e., λ > 0. For the stochastic case, Note that

d E[ˆµk] = E 1 ˆqk + βk dˆsk ˆsk + αk (ˆqk + βk)2 dˆqk

= 0 + E[ˆσ(ˆs, ˆq) p

π(ˆs, ˆq)d Bt] = 0 for ˆπ(τ).

(D.2) so the objective function for the stochastic case is the same as (D.1), which results in the same optimal policy.

E Proof of Lemma 7

Consider k-armed Bernoulli bandit, where the k-th arm gives reward 1 with probability νk and 0 with probability 1 νk. The initial prior measure of νk follows Beta(αk, βl). Then

Continuous Limit for Bayesian Bandits

the rescaled optimal cumulative reward

vi(ˆs, ˆq) = 1

with scaling factor f(n) = n satisﬁes

vi(ˆs, ˆq) = max k

n pk(ˆs, ˆq) + pk(ˆs, ˆq)vi+1(ˆs + 1

nek, ˆq + 1

nek) + (1 pk(ˆs, ˆq))vi+1(ˆs, ˆq + 1

(E.1) where

pk(ˆs, ˆq) = n 1αk + sk n 1(αk + βk) + qk .

The corresponding HJB equation under the scaling factor f(n) = n is

tv + max ˆπ(t,ˆs,ˆq) K (ˆµk + ˆµk skv + qkv) πk = 0, v(1, ˆs, ˆq) = 0.

where ˆµk = ˆαk + sk

ˆβk + qk , with ˆαk = lim n n 1αk, ˆβk = lim n n 1(αk + βk).

Since ˆµ 0, applying the numerical scheme (34)-(35) with δt = δq = δs = 1/n gives ,

n,ˆs, ˆq) = max k

n ,ˆs, ˆq) + 1

ˆµk(ˆs, ˆq)

nek, ˆq + 1

n ,ˆs, ˆq + 1

nek) + 1 n 1

n ,ˆs, ˆq + 1

nek) v(l + 1

+ˆµk(ˆs, ˆq)]} (E.2) By comparing (E.1) and (E.2), one can see if and only if

ˆµk(ˆs, ˆq) = pk(ˆs, ˆq),

The numerical scheme is equivalent to the exact Bayes-optimal algorithm. The above condition holds if and only if (αk, βk) = (c1n, c2n)

for any constants (c1, c2), which completes the proof for the ﬁrst part of the Lemma 7. Consider the binomial bandits described in Section 3.4.1. First, the optimal cumulative reward satisﬁes,

wi(s, q) = max k γ(2pk(s, q) 1) + pk(s, q)wi+1(s + γek, q + ek)

+(1 pk(s, q))wi+1(s γek, q + ek)

with pk(s, q) = αk+sk/(2γ)+qk/2

αk+βk+qk . Then the rescaled optimal cumulative reward

vi(ˆs, ˆq) = 1 nwi(s, q)

Zhu and Izzo and Ying

with scaling factor f(n) = n satisﬁes

vi(ˆs, ˆq) = max k

n n 1/2γ(2ˆpk(ˆs, ˆq) 1) + ˆpk(ˆs, ˆq)vi+1(ˆs + γn 1/2ek, ˆq + n 1ek)

+(1 ˆpk(ˆs, ˆq))vi+1(ˆs γn 1/2ek, ˆq + n 1ek) o ,

2ˆpk(ˆs, ˆq) 1 = 1 nγ

γ(αk βk) n + ˆsk

By letting µ(ˆs, ˆq) = nγ(2ˆpk(ˆs, ˆq) 1), then

vi(ˆs, ˆq) = max k

n 1 µ(ˆs, ˆq) + 1

µ(ˆs, ˆq) nγ + 1 vi+1(ˆs + γn 1/2ek, ˆq + n 1ek)

1 µ(ˆs, ˆq) nγ

vi+1(ˆs γn 1/2ek, ˆq + n 1ek) , (E.3)

On the other hand, the limiting HJB equation for vi(ˆs, ˆq) is

tv + max ˆπ(t,ˆs,ˆq) K

ˆµk + ˆµk skv + qkv + 1

2 ˆσ2 2 skv πk = 0, v(1, ˆs, ˆq) = 0,

ˆµ(ˆs, ˆq) = ˆαk + ˆsk

ˆβk + ˆqk , ˆσ = γ, with ˆαk = lim n γ(αk βk) n , ˆβk = lim n αk + βk

By comparing (E.1) and (E.2), one can see if and only if

ˆµk(ˆs, ˆq) = µk(ˆs, ˆq),

The numerical scheme is equivalent to the exact Bayes-optimal algorithm. The above condition holds if and only if

(αk, βk) = (c1n + c2 n, c1 c2 n)

for any constants (c1, c2), which completes the proof for the second part of the Lemma 7.

F Detailed algorithms

Continuous Limit for Bayesian Bandits

Algorithm 2 Thompson Sampling for unstructured bandits Require: Input: n, (s, q) = 0, ρ(ν|β)

for i = 1, . . . , K do

Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do

Update ρ(ν|s, q) according to Bayesian rule Sample νi according to the probability distribution ρ(ν|s, q) Ai argmaxk µk(νi), sk sk + Xi, qk qk + 1 end for

Algorithm 3 UCB for unstructured bandits Require: Input: n, (s, q) = 0, δ

for i = 1, . . . , K do

Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do

Update ˆµk sk

qk Ai argmaxk ˆµk, sk sk + Xi, qk qk + 1 end for

Algorithm 4 Thompson Sampling for linear bandits Require: Input: n, ak, (s, q) = 0, ρ(ν) N(µ, Σ)

for i = 1, . . . , K do

Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do

Sample ν according to the probability distribution ρ(ν) Ai = argmaxk a k ν x = a Ai, y = ν x + η, where η N(0, σ2) µ = (Σ 1 + σ 2xx ) 1(Σ 1µ + σ 2yx), Σ = (Σ 1 + σ 2xx ) 1

Algorithm 5 UCB for linear bandits Require: Input: V = v0, W = 0, ˆν = 0, λ = 0.1

for i = 1, . . . , K do

Ai i, si Xi, qi 1 end for for i = K + 1, . . . , n do

2 log(n2) + log(1 + (i 1)/λ)

Ai argmaxk ak ˆν + β q

x a Ai, y ν x + η, where η N(0, σ2) V V + x2, W W + xy, ˆν W