# parallelizing_thompson_sampling__d75b2e39.pdf Parallelizing Thompson Sampling Amin Karbasi Yale University amin.karbasi@yale.edu Vahab Mirrokni Google Research mirrokni@google.com Mohammad Shadravan Yale University mohammad.shadravan@yale.edu How can we make use of information parallelism in online decision making problems while efficiently balancing the exploration-exploitation trade-off? In this paper, we introduce a batch Thompson Sampling framework for two canonical online decision making problems, namely, stochastic multi-arm bandit and linear contextual bandit with finitely many arms. Over a time horizon T, our batch Thompson Sampling policy achieves the same (asymptotic) regret bound of a fully sequential one while carrying out only O(log T) batch queries. To achieve this exponential reduction, i.e., reducing the number of interactions from T to O(log T), our batch policy dynamically determines the duration of each batch in order to balance the exploration-exploitation trade-off. We also demonstrate experimentally that dynamic batch allocation dramatically outperforms natural baselines such as static batch allocations. 1 Introduction Many problems in machine learning and artificial intelligence are sequential in nature and require making decisions over a long period of time and under uncertainty. Examples include A/B testing [Graepel et al., 2010], hyper-parameter tuning [Kandasamy et al., 2018], adaptive experimental design [Berry and Fristedt, 1985], ad placement [Schwartz et al., 2017], clinical trials [Villar et al., 2015], and recommender systems [Kawale et al., 2015], to name a few. Bandit problems provide a simple yet expressive view of sequential decision making with uncertainty. In such problems, a repeated game between a learner and the environment is played where at each round the learner selects an action, so called an arm, and then the environment reveals the reward. The goal of the learner is to maximize the accumulated reward over a horizon T. The main challenge faced by the learner is that the environment is unknown, and thus the learner has to follow a policy that identifies an efficient trade-off between the exploration (i.e., trying new actions) and exploitation (i.e., choosing among the known actions). A common way to measure the performance of a policy is through regret, a game-theoretic notion, which is defined as the difference between the reward accumulated by the policy and that of the best fixed action in hindsight. We say that a policy has no regret, if its regret growth-rate as a function of T is sub-linear. There has been a large body of work aiming to develop no-regret policies for a wide range of bandit problems (for a comprehensive overview, see [Lattimore and Szepesv ari, 2020, Bubeck and Cesa-Bianchi, 2012, Slivkins, 2019]). However, almost all the existing policies are fully sequential in nature, meaning that once an action is executed the reward is immediately observed by the learner and can be incorporated to make the subsequent decisions. In practice however, it is often more preferable (and sometimes the only way) to explore many actions in parallel, so called a batch of actions, in order to gain more information about the environment in a timely fashion. For instance, in clinical trials, a phase of medical treatment is often carried out on a group of individuals and the results are gathered for the entire group at the end of the phase. Based on the collected information, the treatment for the subsequent phases are devised [Perchet et al., 2016]. Similarly, in a marketing campaign, the response to a line of products is not collected in a fully sequential manner, instead, a batch of products are mailed to a subset of costumers and their feedback is gathered collectively 35th Conference on Neural Information Processing Systems (Neur IPS 2021). [Schwartz et al., 2017]. Note that developing a no-regret policy is impossible without any information exchange about the carried out actions and obtained rewards. Thus, the main challenge in developing a batch policy is to balance between how many actions to run in parallel (i.e., batch size) versus how frequently to share information (i.e., number of batches). At one end of the spectrum lie the fully sequential no-regret bandit policies where the batch size is 1, and the number of batches is T. At the other end of the spectrum lie the fully parallel policies where the batch size is T and all the actions are completely determined a priory without any amount of information exchange (such policies clearly suffer a linear regret). In this paper, we investigate the sweet spot between the batch size and the corresponding regret in the context of Thompson Sampling (TS). More precisely, For the stochastic N-armed bandit, we develop Batch Thomson Sampling (B-TS), a batch version of the vanilla Thomson Sampling policy, that achieves the problem-dependent asymptotic optimal regret with O(N log T) batches. B-TS policy with the same number of batches also achieves the problem independent regret bound of O( NT log T) with Beta priors, and a slightly improved regret bound of O( NT log N) with Gaussian priors. For the stochastic N-armed bandit, we develop Batch Minimax Optimal Thompson Sampling (B-MOTS), a batch Thompson Sampling policy that achieves the optimal minimax problem-independent regret bound of O( NT) with O(N log T) batches. We also present B-MOTS-J, a variant of B-MOTS, designed for Gaussian rewards, which achieves both minimax and asymptotic optimality with O(N log(T)) batches. Finally, for the linear contextual bandit with N arms, we develop Batch Thompson Sampling for Contextual Bandits (B-TS-C) that achieves the problem-independent regret bound of O(d3/2 T) with O(N log(T)) batches. The main idea that allows our batch policy to achieve near-optimal regret guarantees while reducing the number of sequential interactions with the environment from T to O(log T) is a novel dynamic batch mechanism that determines the duration of each batch based on an offline estimation of the regret accumulated during that phase. We also observe empirically that batch Thompson Sampling methods with a fixed batch size, but equal number of batches, incur higher regrets. 2 Related Work In this paper, we mainly focus on Thompson Sampling (also known as posterior sampling and probability matching), the earliest principled way for managing the exploration-exploitation trade-off in sequential decision making problems [Thompson, 1933, Russo et al., 2017]. There has been a recent surge in understanding the theoretical guarantees of Thompson Sampling due to its strong empirical evidence and simple implementation [Chapelle and Li, 2011]. In particular, for the stochastic multiarmed bandit problem, Agrawal and Goyal [2012] proved a problem-dependent logarithmic bound on expected regret of Thompson Sampling which was then showed to be asymptotically optimal [Kaufmann et al., 2012]. Subsequently, Agrawal and Goyal [2017] provided a problem-independent (i.e., worst-case) regret bound of O( NT log T) on the expected regret when using Beta priors. Interestingly, the expected regret can be improved to O( NT log N) by using Gaussian priors. Very recently, Jin et al. [2020] developed Minimax Optimal Thompson Sampling (MOTS), a variant of Thompson Sampling that achieves the minimax optimal regret of O( NT). Agrawal and Goyal [2013b] also extended the analysis of multi-armed Thompson Sampling to the linear contextual setting and proved a regret bound of O(d3/2 T) where d is the dimension of the context vectors. In this paper, we develop the first variants of Batch Thompson Sampling that achieve the aforementioned regret bounds (problem-dependent and problem-independent versions) while reducing the sequential interaction with the environment from T to O(N log T), thus increasing the efficiency of running Thompson Sampling by an exponential factor (for a fixed N). There has been a large body of work and numerous algorithms for regret minimization of multiarmed bandit problems, including upper confidence bound (UCB), ϵ-greedy, explore-then-commit, among many others. We refer the interested readers to some recent surveys for more details [Lattimore and Szepesv ari, 2020, Slivkins, 2019]. The closest line of work to our paper is the proposed batch UCB algorithm [Gao et al., 2019], for which Esfandiari et al. [2021a] showed an asymptotically optimal regret bound with O(log T) number of batches. Very recently, Esfandiari et al. [2021a] and Ruan et al. [2021] also addressed the batch linear bandits and the batch linear contextual bandits, respectively. Our work extends those results to the case of Thompson Sampling for the stochastic multi-armed bandit as well as the linear contextual bandit problems. As we have highlighted in our proofs, our work builds on previous art, especially Agrawal and Goyal [2012, 2017, 2013b] (we believe that giving due credits to previous work is a virtue and not vice). However, we build on a non-trivial way. As it is clear from their analysis (and more generally for randomized probability matching strategies), breaking the sequential nature of distribution updates is non-trivial. We show that by a careful batch-mode strategy, one can reduce the sequential updates from T to O(log(T)). We are unaware of any previous work that obtains such a result for Thompson Sampling. In contrast, UCB strategies are much more amenable to parallelization (and the analysis is simple) as one can simply use the arm elimination method proposed by Esfandiari et al. [2021a] and Gu et al. [2021]. There is no clear way to use the arm elimination strategy for batch TS. Moreover, batch TS clearly outperforms the fully sequential UCB in all of our empirical results. The benefits of batch-mode optimization has been considered in other machine learning settings, including convex optimization [Balkanski and Singer, 2018b, Chen et al., 2020], reinforcement learning [Zhang et al., 2020], submodular optimization [Chen et al., 2019, Fahrbach et al., 2019, Balkanski and Singer, 2018a], Gaussian processes [Desautels et al., 2014, Kathuria et al., 2016, Contal et al., 2013], stochastic sequential optimization [Esfandiari et al., 2021b, Agarwal et al., 2019, Chen and Krause, 2013], and Bayesian optimization [Wang et al., 2018, Rolland et al., 2018], to name a few. Finally, we would also like to mention a concurrent work by Kalkanli and Ozgur [2021] on the very same subject of our paper, also accepted to Neur IPS 2021. Clearly, there is a significant overlap between the two works. For instance, in both papers, we have the problem-dependent asymptotic optimal regret with O(log T) batches. However, there are also a few differences. In particular, Kalkanli and Ozgur [2021] provide a tighter problem dependent bound on the expected number of batches for the stochastic multi-armed bandit. In contrast, we generalize the batch Thompson Sampling strategy to the linear contextual setting. 3 Preliminaries and Problem Formulation As stated earlier, a standrad bandit problem is a repeated sequential game between a learner and the environment where at each round t = 1, 2, . . . , T, the learner selects an action a(t) from the set of actions A and then the environment reveals the reward ra(t) R. Different structures on the set of actions and rewards define different bandit problems. In this paper, we mainly consider two canonical variants, namely, stochastic multi-armed bandit, and stochastic linear contextual bandit. Stochastic Multi-Armed Bandit. In this setting, the set of actions A is finite, namely, A = [N], and each action a [N] is associated with a sub-Gaussian distribution Pa (e.g., Bernoulli distribution, distributions supported on [0, 1], etc). When the player selects an action a, a reward ra is sampled independently from Pa. We denote by µa = Ea Pa[ra] the average reward of an action a and by µ = maxa A Ea Pa[ra] the action with the maximum average reward. Suppose the player selects actions a1, . . . , a T and receives the stochastic rewards ra(1), . . . , ra(T ). Then the (expected) regret is defined as R(T) = Tµ E We say that a policy achieves no-regret, if E [R(T)] /T 0 as the horizon T tends to infinity. In order to compare the regret of algorithms, there are multiple choices in the literature. Once we fully specify the horizon T, the class of the bandit problem (e.g., multi-armed bandit with N arms) and the specific instance we encounter withing the class (e.g., µ1, . . . , µN in the stochastic multi-armed problem), then we can consider the problem-dependent regret bounds for each specific instance. In contrast, problem-independent bounds (also called worst-case bounds) only depends on the horizon T and class of bandits for which the algorithm is designed (which is the number of arms N in the multi-armed stochastic bandit problem), and not the specific instance within that class1. For the 1There is a related notion of regret, called Bayesian regret, considered in the Thompson Sampling literature [Russo and Van Roy, 2014, Bubeck and Liu, 2013], where a known prior on the environment is assumed. The problem-dependent regret bound, it is known that UCB-like algorithms [Auer, 2002, Garivier and Capp e, 2011, Maillard et al., 2011] and Thomson Sampling [Agrawal and Goyal, 2013a, Kaufmann et al., 2012] achieve the asymptotic regret of O(log T P a>0 1 a ) where a = µ µa 0. It is also known that no algorithm can achieve a better asymptotic regret bound [Lai and Robbins, 1985], thus implying that UCB and TS are both asymptotically optimal. In contrast, for the stochastic multi-armed bandit, UCB achieves the minimax problem-independent regret bound of NT [Auer, 2002] whereas TS (with Beta-priors) achieves a slightly worst regret of NT log T [Agrawal and Goyal, 2017]. Very recently, Jin et al. [2020] developed Minimax Optimal Thompson Sampling (MOTS) that achieves the minimax optimal regret of O( Contextual Linear Bandit. Contextual linear bandits generalise the multi-armed setting by allowing the learner to make use of side information. More specifically, each arm a is associated with a feature/context vector ba Rd. At the beginning of each round t [T], the learner first observes the contexts ba(t) for all a A, and then she chooses an action a(t) A. We assume that a feature vector ba affects the reward in a linear fashion, namely, ra(t) = ba(t), µ + ηa,t. Here, the parameter µ is unknown to the learner, and ηa,t is an independent zero-mean sub-Gaussian noise given all the actions and rewards up to time t. Therefore, E[ra(t)|ba(t)] = ba(t), µ . The learner is trying to guess the correlation between µ and the contexts ba(t). For the set of actions a(1), . . . , a(T), the regret is defined as t=1 ra (t)(t) t=1 ra(t)(t) where a (t) = arg maxa ba(t), µ . The context vectors at time t are generally chosen by an adversary after observing the actions played and the rewards received up to time t 1. In order to obtain scale-free regret bounds, it is commonly assumed that µ 2 1 and ba(t) 2 1 for all arms a A. By applying UCB to linear bandit, it is possible to achieve R(T) = O(d T) with high probability [Auer, 2002, Dani et al., 2008, Rusmevichientong and Tsitsiklis, 2010, Abbasi-Yadkori et al., 2011]. In contrast, Agrawal and Goyal [2013b] showed that the regret of Thompson Sampling can be bounded by O(d3/2 Batch Bandit. The focus of this paper is to parallelize the sequential decision making problem. In contrast to the fully sequential setting, where the learner selects an action and immediately receives the reward, in the batch mode setting, the learner selects a batch of actions and receives the rewards of all of them simultaneously (or only after the last action is executed). More formally, let the history Ht consists of all the actions and rewards up to time t, namely, {a(s)}s [t 1] and {ra(s)(s)}s [t 1], respectively. We also denote the observed set of contexts up to and including time t by Ct = {ba(s)}a A,s [t]. Note that in the multi-armed bandit problem Ct = . A fully sequential policy π at round t [T] maps the history and contexts to an action, namely, πt : Ht Ct A. In contrast, a batch policy π only interacts with the environment at rounds 0 = t0 < t1 < t2 < tm = T. The l-th batch of duration tl tl 1 contains the time units {tl 1 + 1, tl 1 + 2, . . . , tl} which we denote it by the shorthand (tl 1, tl]. To select the actions in the l-th batch the policy is only allowed to use the history of actions/rewards observed in the previous batches, in addition to the contexts received so far. Therefore, a batch policy at time t (tl 1, tl] is the following map: πt : Htl 1 Ct A. Moreover, a batch policy with a predetermined fixed batch size is called static and the one with a dynamic batch size is called dynamic. 4 Batch Thompson Sampling for Stochastic Multi-armed Bandit In the classic Thompson Sampling (TS), at any time t [T], we consider a prior distribution Da(t) on the underlying parameters of the reward distribution for every arm a [N]. TS works by first sampling θa(t) Da(t), independently for each a [N], and then choosing the one with the highest value, namely, at = argmaxa [N] θa(t). Once the action at is played, we receive the reward rt, based on which the the prior distributions are updated as follows. If an arm a is not selected, its distribution does not change, i.e., Da(t + 1) = Da(t). However, if a = at, then we update Da(t + 1) given the information (at, rt) using the Bayes rule. By instantiating TS with different frequentist regret bounds considered in this paper immediately imply a regret bound on the Bayesian regret but the opposite is not generally possible [Lattimore and Szepesv ari, 2020]. Algorithm 1 Batch Thompson Sampling 1: Initialize: ka 0 ( a [N]), la 0 ( a [N]), batch 2: for t = 1, 2, T do 3: θa(t) Da(t) ( a [N]) 4: a(t) := argmaxa [N] θa(t). 5: ka(t) ka(t) + 1 6: if ka(t) < 2la(t) then 7: batch batch {a(t)} 8: else 9: la(t) = la(t) + 1 10: Query(batch) and receive rewards 11: Update Da(t) ( a batch) 12: batch 13: end if 14: end for Algorithm 2 Batch TS for Contextual Bandits 1: Initialize: ka 0 ( a [N]), la 0 ( a [N]), batch , B = Id, ˆµ = 0d 2: for t = 1, 2, T do 3: µ(t) N(ˆµ(B(t)), v2B(B(t)) 1) 4: a(t) = argmaxa ba(t)T µ(t) 5: ka(t) ka(t) + 1 6: if ka(t) < 2la(t) then 7: batch batch {a(t)} 8: else 9: la(t) = la(t) + 1 10: Query(batch) and receive rewards 11: Update ˆµ 12: batch 13: end if 14: end for prior distributions (e.g., Beta, Gaussian), for which Bayes update is simple to compute, it is possible to show that one can achieve an asymptotically optimal regret [Agrawal and Goyal, 2012, 2017]. The main idea behind the Batch Thompson Sampling (B-TS), outlined in Algorithm 1, is as follows. For each arm a [N], B-TS keeps track of {ka}a [N], the number of times the arm a has been selected so far. Initially, all ka s are set to 1. For each arm a and at the beginning of the batch, necessarily 2la 1 ka < 2la for some integer la 1. Now consider a new batch that starts at time t. Within this batch, B-TS samples arms according to the prior distributions up to time t 1, namely [Da(t 1)]a [N], and selects the one with the highest value. B-TS keeps selecting arms until the point that for one of the arms, say a, it reaches ka = 2la. At this point, B-TS queries all the arms selected during this batch. Based on the received rewards, B-TS updates {Da}a [N] and starts a new batch. We should note that the doubling trick used in Algorithm 1 is reminiscent to the halving trick used in the pure exploration setting [Jun et al., 2021, Fiez et al., 2019]. Regret Bounds with Beta Priors. For the ease of presentation, we first consider the Bernoulli multi-armed bandit where ra {0, 1} and µa = Pr[ra = 1]. In this setting, we can instantiate TS with Beta priors as follows. TS assumes an independent Beta-distributed prior, with parameters (αa, βa), over each µa. Due to the nice congugacy property of Beta distributions, it is very easy to update the posterior distribution, given the observations. In particular, the Bayes update can be performed as follows: (αa, βa) = (αa, βa) if a(t) = a, (αa, βa) + (r(t), 1 r(t)) if a(t) = a. TS initially assumes αa = βa = 1 for all arms a [N], which corresponds to the uniform distribution over [0, 1]. The update rule of B-TS in Algorithm 1 is also very similar. Let B(t) be the last time t t 1 that B-TS carried out a batch. Moreover, for each arm a, let Sa(t) be the number of instances arm a was selected by time t 1 and ra = 1. Similarly, let Fa(t) be the number of instances arm a was selected by time t 1 and ra = 0. We also denote by ka(t) = Sa(t)+Fa(t) the total number of instances arm a was selected by time t 1. Initially, B-TS starts with the uniform distribution over [0, 1], i.e., Da(1) = Beta(1, 1) for all a [N]. Inspired by the update rule of TS, at any time t, B-TS updates the distribution Da(t) by Beta(Sa(B(t)) + 1, Fa(B(t)) + 1). Note that during a batch when arms are being selected, the distributions {Da}a [N] do not change. The updates only take place once the batch is carried out and the rewards are observed. First we bound the number of batch queries as follows. Theorem 4.1. The total number of batches carried out by B-TS is at most O(N log T). The proof is given in Appendix A.2. Remark 4.2. One might be tempted to show a sublinear dependency on N. However, simple empirical results show that the number of batches carried out by B-TS indeed scales logarithmically in T but linearly in N. Please see figs 1a and 1b for more details. 0 20 40 60 80 100 number of arms B-TS (Beta) B-TS (Gaussian) 0 2000 4000 6000 8000 10000 time B-TS (Beta) B-TS (Gaussian) Figure 1: (a) and (b) show the number of batch queries versus the number of arms and the horizon, respectively. We consider a synthetic Bernoulli setting where the horizon is set to T = 103 and the number of arms vary from N = 1 to N = 100. We report the average regret over 100 experiments. As we clearly see in Figure 1a, the regret increases linearly in N which rules out the possibility that the regret of B-TS may depend sub-linearly in N. For Figure 1b, we also consider the Bernoulli setting and set N = 10 and vary the horizon from T = 1 to T = 104. Again, as our theory suggests, the regret increases logarithmically in T. What is more challenging is to show is that this simple batch strategy achieves the same asymptotic regret as a fully sequential one. Theorem 4.3. Without loss of generality, let us assume that the first arm has the highest mean value, i.e., µ = µ1. Then, the expected regret of B-TS, outlined in Algorithm 1, with Beta priors can be bounded as follows R(T) = (1 + ϵ)O ln T d(µa, µ1) a where d(µa, µ1) := µa log µa µ1 + (1 µa) log (1 µa) 1 µ1 and a = µ1 µa. The complete proof is given in Appendix A.2. Remark 4.4. Even though we only provided the details for the Bernoulli setting, B-TS can be easily extended to general reward distributions supported over [0, 1]. To do so, once a reward rt [0, 1] is observed, we flip a coin with bias rt and update the Beta distribution according to the outcome of the coin. It is easy to see that Theorem 4.3 holds for this extension as well. Remark 4.5. Gao et al. [2019] proved that for the B-batched N-armed bandit problem with time horizon T it is necessary to have B = Ω(log T/ log log T) batches to achieve the problem-dependent asymptotic optimal regret. This lower bound implies that B-TS use almost the minimum number of batches needed (i.e., O(log T) versus Ω(log T/log log T)) to achieve the optimal regret. Now we present the problem independent regret bound for B-TS. Theorem 4.6. Batch Thompson Sampling, outlined in Algorithm 1 and instantiated with Beta priors, achieves R(T) = O( NT ln T) with O(N log T) batch queries. The proof is given in Appendix A.3. Regret Bounds with Gaussian Priors. In order to obtain a better regret bound, we can instantiate TS with Gaussian distributions. To do so, let us define the empirical mean estimator for each arm a as follows: ˆµa(t) = Pt 1 τ=1 ra(τ) I(a(τ) = a) ka(t) + 1 , where ka(t) denotes the number of instances that an arm a [N] has been selected up to time t 1 and I( ) is the indicator function. Then, by assuming that the prior distribution of an arm a is Algorithm 3 Batch Minimax Optimal Thompson Sampling (B-MOTS) 1: Initialize: ka 0 ( a [N]), la 0 ( a [N]), batch 2: Initialize: Play each arm a once and initialize Da(t). 3: for t = N + 1, T do 4: for all arms a [N] sample θa(t) N(ˆµa(B(t)), 1/(ρka(B(t)))) θa(t) Da(t) = min{ θa(t), τa(t)} 5: a(t) := argmaxa θa(t) 6: ka(t) ka(t) + 1 7: if ka(t) < 2la(t) then 8: batch batch {a(t)} 9: else 10: la(t) = la(t) + 1 11: Query(batch) and observe rewards 12: Update Da(t), a [N] 13: batch 14: end if 15: end for N ˆµa(t), 1 ka(t)+1 , and that the likelihood of rat given µa is N(µa, 1), the posterior will also be a Gaussian distribution with parameters N ˆµa(t + 1), 1 ka(t+1)+1 . In B-TS, we need to slightly change the way we estimate ˆµa(t) as the algorithm has only access to the information received by the previous batches. Recall that B(t) indicates the last time t t 1 that B-TS carried out a batch query. For each arm a [N], we assume the prior distribution Da(t) N ˆµa(B(t)), 1 ka(B(t))+1 . We also update the empirical mean estimator as follows: ˆµa(t) = PB(t) τ=1 ra(τ) I(a(τ) = a) ka(B(t) + 1) + 1 . (1) Note that at any time t during the l-th batch, i.e., t (tl 1, tl], the distribution Da(t) remains unchanged. Once the arms {at}t (tl 1,tl] are carried out and the rewards {rt}t (tl 1,tl] are observed, ˆµa changes and the posterior is computed accordingly, namely, Da(tl + 1) N ˆµa(tl + 1), 1 ka(tl+1)+1 . As we instantiate B-TS with Gaussian priors, the regret bound slightly improves. Theorem 4.7. Batch Thompson Sampling, outlined in Algorithm 1 and instantiated with Gaussian priors, achieves E[R(T)] = O( NT ln N) with O(N log T) batch queries. The proof is given in Appendix A.4. 5 Batch Minimax Optimal Thompson Sampling So far, we have considered the parallelization of the vanilla Thompson Sampling which does not achieve the optimal minimax regret. In this section, we introduce Batch Minimax Optimal Thompson Sampling (B-MOTS), that achieves the optimal minimax bound of O( NT), as well as the asymptotic optimal regret bound for Gaussian rewards. In contrast to the fully sequential MOTS developed by Jin et al. [2020], B-MOTS requires only O(N log T) batches. The crucial difference between B-MOTS and B-TS is that instead of choosing Gaussian or Beta distributions, B-MOTS uses a clipped Gaussian distribution. To run B-MOTS, we need to slightly change the way Da(t) is updated. First, to initialize Da(t), B-MOTS plays each arm once in the beginning and sets ka(N + 1) to 1 and ˆµa(N + 1) to the observed reward of each arm a [N]. To determine Da(t) for the subsequent batches, let us first define a confidence range ( , τa(t)) for each arm a [N] as follows: τa(t) = ˆµa(B(t)) + α ka(B(t)) log+ T Nka(B(t)) where log+(x) = max{0, log(x)} and the empirical mean for each arm a is estimated as ˆµa(t) = PB(t) τ=1 ra(τ) I(a(τ) = a) ka(B(t) + 1) . (3) Note that the estimators in (3) and (1) slightly differ due to the initialization step of B-MOTS. For each arm a [n], B-MOTS first samples θa(t) from a Gaussian distribution with the following parameters θa(t) N(ˆµa(B(t)), 1/(ρka(B(t)))), where ρ (1/2, 1) is a tuning parameter. Then, the sample is clipped by the confidence range as follows: Da(t) = min{ θa(t), τa(t)}. (4) The rest is exactly as in Alg 1 for B-TS. If you are interested in the details, you can find the outline of B-MOTS algorithm in Appendix B. B-MOTS for Sub Gaussian Rewards. We state the regret bounds in the most general format, i.e., when the rewards follow a sub-Gaussian distribution. To remind ourselves, we say that a random variable X is σ sub-Gaussian if E[exp(λX λE[X])] exp(σ2λ2/2), for all λ R. The following theorem shows that B-MOTS is minimax optimal. Theorem 5.1. If the reward of each arm is 1-subgussian then the regret of B-MOTS is bounded by R(T) = O( a: a>0 a). Moreover, the number of batches is bounded by O(N log T). The proof is given in Appendix B.2. The next theorem presents the asymptotic regret bound of B-MOTS for sub-Gaussian rewards. Theorem 5.2. Assume that the reward of each arm a [N] is 1-subgaussian with mean µa. For any fixed ρ (1/2, 1), the regret of B-MOTS can be bounded as R(T) = O log(T) P a: a>0 1 ρ a The proof is given in Appendix B.3 The asymptotic regret rate of B-MOTS matches the existing lower bound log(T) P a: a>0 1/ a [Lai and Robbins, 1985] up to a multiplicative factor 1/ρ. Therefore, similar to the analysis of the fully sequential setting [Jin et al., 2020], B-MOTS reaches the exact lower bound at a cost of minimax optimality. In the next section, we show that at least in the Gaussian reward setting, minimax and asymptotic optimally cab be achieved simultaneously. B-MOTS-J for Gaussian Rewards. In this part we present a batch version of Minimax Optimal Thompson Sampling for Gaussian rewards [Jin et al., 2020], called B-MOTS-J, which achieves both minimax and asymptotic optimality when the reward distribution is Gaussian. The only difference between B-MOTS-J and B-MOTS is the way θa(t) are sampled. In particular, B-MOTS-J samples θa(t) according to J (µ, σ2) (instead of a Gaussian distribution), where the PDF is defined as ΦJ (x) = 1 2σ2 |x µ| exp Note that when x is restricted to x 0, then J becomes a Rayleigh distribution. More precisely, to sample θa(t), we set the parameters of J as follows: θa(t) J ˆµa(B(t)), 1 ka(B(t)) , where ˆµa(t)) is estimated according to (3). The rest of the algorithm is run exactly like B-MOTS. Theorem 5.3. Assume that the reward of each arm a is sampled from a Gaussian distribution N(µa, 1) and α > 2. Then, the regret of B-MOTS-J can be bounded as follows: a=2 a), lim T R(T) log(T) = X The proof is given in Appendix B.4. 6 Batch Thompson Sampling for Contextual Bandits In this section, we propose Batch Thompson Sampling for Contextual Bandits (B-TS-C), outlined in Algorithm 2. As in the fully sequential TS, proposed by Agrawal and Goyal [2013b], we assume Gaussian priors and Gaussian likelihood functions. However, we should highlight that the analysis of B-TS-C and the corresponding regret bound hold irrespective of whether or not the reward distribution matches the Gaussian priors and Gaussian likelihood functions (similar to the multi-armed bandit setting discussed in Section 4). More formally, given a context ba(t), and parameter µ, we assume that the likelihood of the reward ra(t) is given by N(ba(t)T µ, v2), where v = σ p 9d ln(T/δ) and δ (0, 1)2. Let us define the matrix B(t) as follows B(t) = Id + τ=1 ba(τ)(τ)ba(τ)(τ)T . Note that the matrix B(t) depends on all the contexts observed up to time t 1. We consider the prior N(ˆµ(B(t)), v2B(B(t)) 1) for µ and update the the empirical mean estimator as follows: ˆµ(t) = B(B(t)) 1 τ=1 ba(τ)(τ) ra(τ)(τ) Note that in order to estimate ˆµ(t), we only consider the rewards received up to time B(t), namely, the rewards of arms pulled in the previous batches. At each time step t, B-TS-C generates a sample µ(t) from N(ˆµ(B(t)), v2B(B(t)) 1) and plays the arm a that maximizes ba(t)T µ(t). The posterior distribution for µ at time t + 1 will be N(ˆµ(B(t + 1)), v2B(B(t + 1)) 1). Theorem 6.1. The B-TS-C algorithm (Algorithm 2) achieves the total regret of R(T) = O d3/2 T(ln(T) + p ln(T) ln(1/δ)) with probability 1 δ. Moreover, B-TS-C carries out O(N log T) batch queries. The proof is given in Appendix C.2. 7 Experimental Results In this section, we compare the performance of our proposed batch Thompson Sampling policies (e.g., B-TS,B-MOTS, B-MOT-J and B-TS-C) with their fully sequential counterparts. We also include several baselines such as UCB and Thompson Sampling with static batch design (Static-TS). In particular, for Static-TS, Static-TS2, and Static-TS4, we set the total number of batches to that of B-TS, twice of B-TS, and four times of B-TS, respectively. However, in the static batch design, we use equal sized batches. Batch Thompson Sampling. In Figure 2a, we compare the performance of UCB against TS and its batch variants in a synthetic Bernoulli setting. We vary T from 1 to 104 and set N = 10. We run all the experiments 1000 times. Figure 2a compares the average regret, i.e., R(T)/T, versus the horizon T. As expected, TS outperforms UCB. Moreover, TS and B-TS follow the same trajectory and have practically the same regret. Note that for the static variant of TS, namely, Static-TS, we see that in the first few hundred iterations, its performance is even worst than UCB and then it catches with TS. Figure 2b more clearly shows the trade-off between the regret obtained by different baselines versus the number of batch queries (bottom-left is the desirable location). In this figure, we set T = 103. We see that the lowest regret is achieved by TS and B-TS but TS carries out many more queries. Also, we should highlight that while the static versions make fewer queries than TS, they do not achieve a similar regret. Notably, even Static-TS-4, that carries out 4 times more queries than B-TS, has a much higher regret. This shows the importance of a dynamic batch design. 2If the horizon T is unknown, we can use vt = σ p 9d ln(t/δ). 100 101 102 103 104 time UCB TS B-TS Static-TS 0 200 400 600 800 1000 number of Queries B-TS TS UCB Static-TS Static-TS-2 Static-TS-4 0.0 0.2 0.4 0.6 0.8 parameter value TS-C σ = 0.01, ϵ = 0.5 TS-C δ = 0.5, ϵ = 0.5 TS-C δ = 0.5, σ = 0.01 B-TS-C σ = 0.01, ϵ = 0.5 B-TS-C δ = 0.5, ϵ = 0.5 B-TS-C δ = 0.5, σ = 0.01 0 2000 4000 6000 8000 10000 time Static-TS-C TS-C B-TS-C Figure 2: (a) and (b) compare the regret of UCB against TS and its batch variants. (c) and (d) compare the batch variants of TS and MOTS. (e) shows the sensitivity of TS-C and its batch variants to the tuning parameters. (f) shows the performance of TS-C and its batch variants on real data. Batch Minimax Optimal Thompson Sampling. In Figures 2c and 2d we compare the regret of MOTS Jin et al. [2020] and its batch versions. The synthetic setting is similar to Jin et al. [2020]. We set N = 50, T = 106 and the total number of runs to 2000. The reward of each arm is sampled from an independent Gaussian distribution. More precisely, the optimal arm has the expected reward and variance 1 while the other N 1 arms have the expected reward 1 ϵ and variance 1 (we set ϵ = 0.2). For MOTS, we set ρ = 0.9999 and α = 2 as suggested by Jin et al. [2020]. As we can see in Figure 2c, the batch variants of TS and MOTS achieve practically a similar regret. Also, as our theory suggests, B-MOTS (along with MOTS) have the lowest regret while B-MOTS drastically reduces the number of batches w.r.t MOTS. Moreover, the static batch designs, namely Static-TS and Static-MOTS, show the highest regret while carrying out the same number of batch queries as B-TS and B-MOTS. Therefore, the dynamic batch design of B-TS and B-MOTS seems crucial for obtaining good performances. A similar trend is shown in Figure 2d where we run MOTS-J (with α = 2) and its batch variants. Again, B-MOTS-J and MOTS-J are practically indistinguishable while achieving the lowest regret. Contextual Bandit. For the contextual bandit, we perform a synthetic and a real-data experiment. Figure 2e shows the performance of the sequential Thompson Sampling, namely, TS-C, and the batch variants, namely, B-TS-C, as we change different parameters ϵ, δ and σ from 0 to 1. Here, the context dimension is 5 and we set the horizon to T = 104. We run all the experiments 1000 times. As we see in Figure 2e, TS-C and B-TS-C follow practically the same curves. For the experiment on real data, we use the Movie Lens data set where the dimension of the context is 20, the horizon is T = 105, and we run each experiment 100 times. For the parameters, we set δ = 0.61, σ = 0.01, and ϵ = 0.71 as suggested by Beygelzimer et al. [2011]. We see that Static TS-C performs a bit worst than TS-C and B-TS-C, again suggesting that it is crucial to use dynamic batch sizes. 8 Conclusion In this paper, we revisited the classic Thompson Sampling procedure and developed the first Batch variants for the stochastic multi-armed bandit and linear contextual bandit. We proved that our proposed batch policies achieve similar regret bounds (up to constant factors) but with significantly fewer number of interactions with the environment. We have also demonstrated experimentally that our batch policies achieve practically the same regret on both synthetic and real data. Funding Transparency Statement This research is supported by NSF CAREER Award (IIS-1845032), ONR YIA (N00014-19-1-2406), and TATA. Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. Arpit Agarwal, Sepehr Assadi, and Sanjeev Khanna. Stochastic submodular cover with limited adaptivity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 323 342. SIAM, 2019. Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39 1, 2012. Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artificial intelligence and statistics, pages 99 107, 2013a. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127 135, 2013b. Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):1 24, 2017. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. Eric Balkanski and Yaron Singer. The adaptive complexity of maximizing a submodular function. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 1138 1151, 2018a. Eric Balkanski and Yaron Singer. Parallelization does not accelerate convex optimization: Adaptivity lower bounds for non-smooth convex minimization. ar Xiv preprint ar Xiv:1808.03880, 2018b. Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments (monographs on statistics and applied probability). London: Chapman and Hall, 5(71-87):7 7, 1985. Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19 26. JMLR Workshop and Conference Proceedings, 2011. S ebastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. ar Xiv preprint ar Xiv:1204.5721, 2012. S ebastien Bubeck and Che-Yu Liu. Prior-free and prior-dependent regret bounds for thompson sampling. In Advances in Neural Information Processing Systems, pages 638 646, 2013. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249 2257, 2011. Lin Chen, Moran Feldman, and Amin Karbasi. Unconstrained submodular maximization with constant adaptive complexity. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 102 113, 2019. Lin Chen, Qian Yu, Hannah Lawrence, and Amin Karbasi. Minimax regret of switching-constrained online convex optimization: No phase transition. Neur IPS, 2020. Yuxin Chen and Andreas Krause. Near-optimal batch mode active learning and adaptive submodular optimization. ICML (1), 28(160-168):8 1, 2013. Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208 214, 2011. Emile Contal, David Buffoni, Alexandre Robicquet, and Nicolas Vayatis. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 225 240. Springer, 2013. Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008. Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. Journal of Machine Learning Research, 15: 3873 3923, 2014. Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Regret bounds for batched bandits. AAAI, 2021a. Hossein Esfandiari, Amin Karbasi, and Vahab Mirrokni. Adaptivity in adaptive submodularity. In COLT, 2021b. Matthew Fahrbach, Vahab Mirrokni, and Morteza Zadimoghaddam. Submodular maximization with nearly optimal approximation, adaptivity and query complexity. In Proceedings of the 2019 Annual ACM-SIAM Symposium on Discrete Algorithms, pages 255 273, 2019. doi: 10.1137/1.9781611975482.17. URL https://epubs.siam.org/doi/abs/10.1137/1. 9781611975482.17. Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32:10667 10677, 2019. Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. In Advances in Neural Information Processing Systems, pages 503 513, 2019. Aur elien Garivier and Olivier Capp e. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359 376, 2011. Thore Graepel, Joaquin Quinonero Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft s bing search engine. In ICML, 2010. Quanquan Gu, Amin Karbasi, Khashayar Khosravi, Vahab Mirrokni, and Dongruo Zhou. Batched neural bandits, 2021. Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, and Quanquan Gu. Mots: Minimax optimal thompson sampling. ar Xiv preprint ar Xiv:2003.01803, 2020. Kwang-Sung Jun, Lalit Jain, Houssam Nassif, and Blake Mason. Improved confidence bounds for the linear logistic model and applications to bandits. In International Conference on Machine Learning, pages 5148 5157. PMLR, 2021. Cem Kalkanli and Ayfer Ozgur. Batched thompson sampling, 2021. Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnab as P oczos. Parallelised bayesian optimisation via thompson sampling. In International Conference on Artificial Intelligence and Statistics, pages 133 142. PMLR, 2018. Tarun Kathuria, Amit Deshpande, and Pushmeet Kohli. Batched Gaussian process bandit optimization via determinantal point processes. In Advances in Neural Information Processing Systems, pages 4206 4214, 2016. Emilie Kaufmann, Nathaniel Korda, and R emi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pages 199 213. Springer, 2012. Jaya Kawale, Hung H Bui, Branislav Kveton, Long Tran-Thanh, and Sanjay Chawla. Efficient thompson sampling for online matrix-factorization recommendation. In Advances in neural information processing systems, pages 1297 1305, 2015. Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985. Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. Odalric-Ambrym Maillard, R emi Munos, and Gilles Stoltz. A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. In Proceedings of the 24th annual Conference On Learning Theory, pages 497 514, 2011. Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. Ann. Statist., 44(2):660 681, 04 2016. doi: 10.1214/15-AOS1381. URL https: //doi.org/10.1214/15-AOS1381. Extended abstract in COLT 2015. Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional bayesian optimization via additive models with overlapping groups. ar Xiv preprint ar Xiv:1802.07028, 2018. Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. ar Xiv preprint ar Xiv:2007.01980, 2021. Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, 2010. Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221 1243, 2014. Daniel Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial on thompson sampling. ar Xiv preprint ar Xiv:1707.02038, 2017. Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500 522, 2017. Aleksandrs Slivkins. Introduction to multi-armed bandits. ar Xiv preprint ar Xiv:1904.07272, 2019. William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 1933. Sof ıa S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scale bayesian optimization in high-dimensional spaces. In International Conference on Artificial Intelligence and Statistics, pages 745 754. PMLR, 2018. Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33, 2020. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]