# batched_multiarmed_bandits_problem__6ca3cd44.pdf

Batched Multi-armed Bandits Problem

Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou Department of {Statistics, Electrical Engineering, Statistics, Mathematics} Stanford University {zijungao,yjhan,zren,zqzhou}@stanford.edu

In this paper, we study the multi-armed bandit problem in the batched setting where the employed policy must split data into a small number of batches. While the minimax regret for the two-armed stochastic bandits has been completely characterized in [PRCS16], the effect of the number of arms on the regret for the multi-armed case is still open. Moreover, the question whether adaptively chosen batch sizes will help to reduce the regret also remains underexplored. In this paper, we propose the Ba SE (batched successive elimination) policy to achieve the rate-optimal regrets (within logarithmic factors) for batched multi-armed bandits, with matching lower bounds even if the batch sizes are determined in an adaptive manner.

1 Introduction and Main Results

Batch learning and online learning are two important aspects of machine learning, where the learner is a passive observer of a given collection of data in batch learning, while he can actively determine the data collection process in online learning. Recently, the combination of these learning procedures has been arised in an increasing number of applications, where the active querying of data is possible but limited to a ﬁxed number of rounds of interaction. For example, in clinical trials [Tho33, Rob52], data come in batches where groups of patients are treated simultaneously to design the next trial. In crowdsourcing [KCS08], it takes the crowd some time to answer the current queries, so that the total time constraint imposes restrictions on the number of rounds of interaction. Similar problems also arise in marketing [BM07] and simulations [CG09].

In this paper we study the inﬂuence of round constraints on the learning performance via the following batched multi-armed bandit problem. Let I = {1, 2, , K} be a given set of K 2 arms of a stochastic bandit, where successive pulls of an arm i I yields rewards which are i.i.d. samples from distribution ν(i) with mean µ(i). Throughout this paper we assume that the reward follows a Gaussian distribution, i.e., ν(i) = N(µ(i), 1), where generalizations to general sub-Gaussian rewards and variances are straightforward. Let µ = maxi [K] µ(i) be the expected reward of the best arm, and i = µ µ(i) 0 be the gap between arm i and the best arm. The entire time horizon T is splitted into M batches represented by a grid T = {t1, , t M}, with 1 t1 < t2 < < t M = T, where the grid belongs to one of the following categories:

1. Static grid: the grid T = {t1, , t M} is ﬁxed ahead of time, before sampling any arms; 2. Adaptive grid: for j [M], the grid value tj may be determined after observing the rewards up to time tj 1 and using some external randomness.

Note that the adaptive grid is more powerful and practical than the static one, and we recover batch learning and online learning by setting M = 1 and M = T, respectively. A sampling policy π = (πt)T t=1 is a sequence of random variables πt [K] indicating which arm to pull at time t [T], where for tj 1 < t tj, the policy πt depends only on observations up to time tj 1. In other words,

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

the policy πt only depends on observations strictly anterior to the current batch of t. The ultimate goal is to devise a sampling policy π to minimize the expected cumulative regret (or pseudo-regret, or simply regret), i.e., to minimize E[RT (π)] where

µ µ(πt) = Tµ

Let ΠM,T be the set of policies with M batches and time horizon T, our objective is to characterize the following minimax regret and problem-dependent regret under the batched setting:

R min-max(K, M, T) inf π ΠM,T sup {µ(i)}K i=1: i

K E[RT (π)], (1)

R pro-dep(K, M, T) inf π ΠM,T sup >0 sup {µ(i)}K i=1: i {0} [ ,

K] E[RT (π)]. (2)

Note that the gaps between arms can be arbitrary in the deﬁnition of the minimax regret, while a lower bound on the minimum gaps is present in the problem-dependent regret. The constraint i

K is a technical condition in both scenarios, which is more relaxed than the usual condition i [0, 1]. These quantities are motivated by the fact that, when M = T, the upper bounds of the regret for multi-armed bandits usually take the form [Vog60, LR85, AB09, BPR13, PR13]

E[RT (π1)] C

E[RT (π2)] C X

max{1, log(T 2 i )} i ,

where π1, π2 are some policies, and C > 0 is an absolute constant. These bounds are also tight in the minimax sense [LR85, AB09]. As a result, in the fully adaptive setting (i.e., when M = T), we have the optimal regrets R min-max(K, T, T) = Θ(

KT), and R pro-dep(K, T, T) = Θ(K log T). The target is to ﬁnd the dependence of these quantities on the number of batches M.

Our ﬁrst result tackles the upper bounds on the minimax regret and problem-dependent regret. Theorem 1. For any K 2, T 1, 1 M T, there exist two policies π1 and π2 under static grids (explicitly deﬁned in Section 2) such that if maxi [K] i

E[RT (π1)] polylog(K, T)

E[RT (π2)] polylog(K, T) KT 1/M

where polylog(K, T) hides poly-logarithmic factors in (K, T).

The following corollary is immediate. Corollary 1. For the M-batched K-armed bandit problem with time horizon T, it is sufﬁcient to have M = O(log log T) batches to achieve the optimal minimax regret Θ(

KT), and M = O (log T) to achieve the optimal problem-dependent regret Θ(K log T), where both optimal regrets are within logarithmic factors.

For the lower bounds of the regret, we treat the static grid and the adaptive grid separately. The next theorem presents the lower bounds under any static grid. Theorem 2. For the M-batched K-armed bandit problem with time horizon T and any static grid, the minimax and problem-dependent regrets can be lower bounded as

R min-max(K, M, T) c

R pro-dep(K, M, T) c KT 1 M ,

where c > 0 is a numerical constant independent of K, M, T.

We observe that for any static grids, the lower bounds in Theorem 2 match those in Theorem 1 within poly-logarithmic factors. For general adaptive grids, the following theorem shows regret lower bounds which are slightly weaker than Theorem 2.

Theorem 3. For the M-batched K-armed bandit problem with time horizon T and any adaptive grid, the minimax and problem-dependent regrets can be lower bounded as

R min-max(K, M, T) c M 2

R pro-dep(K, M, T) c M 2 KT 1 M ,

where c > 0 is a numerical constant independent of K, M, T.

Compared with Theorem 2, the lower bounds in Theorem 3 lose a polynomial factor in M due to a larger policy space. However, since the number of batches M of interest is at most O(log T) (otherwise by Corollary 1 we effectively arrive at the fully adaptive case with M = T), this penalty is at most poly-logarithmic in T. Consequently, Theorem 3 shows that for any adaptive grid, albeit conceptually more powerful, its performance is essentially no better than that of the best static grid. Speciﬁcally, we have the following corollary. Corollary 2. For the M-batched K-armed bandit problem with time horizon T with either static or adaptive grids, it is necessary to have M = Ω(log log T) batches to achieve the optimal minimax regret Θ(

KT), and M = Ω(log T/ log log T) to achieve the optimal problem-dependent regret Θ(K log T), where both optimal regrets are within logarithmic factors.

In summary, the above results have completely characterized the minimax and problem-dependent regrets for batched multi-armed bandit problems, within logarithmic factors. It is an outstanding open question whether the M 2 term in Theorem 3 can be removed using more reﬁned arguments.

1.1 Related works

The multi-armed bandits problem is an important class of sequential optimization problems which has been extensively studied in various ﬁelds such as statistics, operations research, engineering, computer science and economics over the recent years [BCB12]. In the fully adaptive scenario, the regret analysis for stochastic bandits can be found in [Vog60, LR85, BK97, ACBF02, AB09, AMS09, AB10, AO10, GC11, BPR13, PR13].

There is less attention on the batched setting with limited rounds of interaction. The batched setting is studied in [CBDS13] under the name of switching costs, where it is shown that O(log log T) batches are sufﬁcient to achieve the optimal minimax regret. For small number of batches M, the batched two-armed bandit problem is studied in [PRCS16], where the results of Theorems 1 and 2 are obtained for K = 2. However, the generalization to the multi-armed case is not straightforward, and more importantly, the practical scenario where the grid is adaptively chosen based on the historic data is excluded in [PRCS16]. For the multi-armed case, a different problem of ﬁnding the best k arms in the batched setting has been studied in [JJNZ16, AAAK17], where the goal is pure exploration, and the error dependence on the time horizon decays super-polynomially. We also refer to [DRY18] for a similar setting with convex bandits and best arm identiﬁcation. The regret analysis for batched stochastic multi-armed bandits still remains underexplored.

We also review some literature on general computation with limited rounds of adaptivity, and in particular, on the analysis of lower bounds. In theoretical computer science, this problem has been studied under the name of parallel algorithms for certain tasks (e.g., sorting and selection) given either deterministic [Val75, BT83, AA88] or noisy outcomes [FRPU94, DKMR14, BMW16]. In (stochastic) convex optimization, the information-theoretic limits are typically derived under the oracle model where the oracle can be queried adaptively [NY83, AWBR09, Sha13, DRY18]. However, in the previous works, one usually optimizes the sampling distribution over a ﬁxed sample size at each step, while it is more challenging to prove lower bounds for policies which can also determine the sample size. There is one exception [AAAK17], whose proof relies on a complicated decomposition of near-uniform distributions. Hence, our technique of proving Theorem 3 is also expected to be an addition to these literatures.

1.2 Organization

The rest of this paper is organized as follows. In Section 2, we introduce the Ba SE policy for general batched multi-armed bandit problems, and show that it attains the upper bounds in Theorem 1 under two speciﬁc grids. Section 3 presents the proofs of lower bounds for both the minimax and

problem-dependent regrets, where Section 3.1 deals with the static grids and Section 3.2 tackles the adaptive grids. Experimental results are presented in Section 4. The auxiliary lemmas and the proof of main lemmas are deferred to supplementary materials.

1.3 Notations

For a positive integer n, let [n] {1, , n}. For any ﬁnite set A, let |A| be its cardinality. We adopt the standard asymptotic notations: for two non-negative sequences {an} and {bn}, let an = O(bn) iff lim supn an/bn < , an = Ω(bn) iff bn = O(an), and an = Θ(bn) iff an = O(bn) and bn = O(an). For probability measures P and Q, let P Q be the product measure with marginals P and Q. If measures P and Q are deﬁned on the same probability space, we denote by TV(P, Q) = 1

2 R |d P d Q| and DKL(P Q) = R d P log d P

d Q the total variation distance and Kullback Leibler (KL) divergences between P and Q, respectively.

2 The Ba SE Policy

In this section, we propose the Ba SE policy for the batched multi-armed bandit problem based on successive elimination, as well as two choices of the static grids to prove Theorem 1.

2.1 Description of the policy

The policy that achieves the optimal regrets is essentially adapted from Successive Elimination (SE). The original version of SE was introduced in [EDMM06], and [PR13] shows that in the M = T case SE achieves both the optimal minimax and problem-dependent rates. Here we introduce a batched version of SE called Batched Successive Elimination (Ba SE) to handle the general case M T.

Given a pre-speciﬁed grid T = {t1, , t M}, the idea of the Ba SE policy is simply to explore in the ﬁrst M 1 batches and then commit to the best arm in the last batch. At the end of each exploration batch, we remove arms which are probably bad based on past observations. Specﬁcally, let A I denote the active arms that are candidates for the optimal arm, where we initialize A = I and sequentially drop the arms which are signiﬁcantly worse than the best one. For the ﬁrst M 1 batches, we pull all active arms for a same number of times (neglecting rounding issues1) and eliminate some arms from A at the end of each batch. For the last batch, we commit to the arm in A with maximum average reward.

Before stating the exact algorithm, we introduce some notations. Let

Y i(t) = 1 |{s t : arm i is pulled at time s}|

s=1 Ys1{arm i is pulled at time s}

denote the average rewards of the arm i up to time t, and γ > 0 is a tuning parameter associated with the UCB bound. The algorithm is described in detail in Algorithm 1.

Note that the Ba SE algorithm is not fully speciﬁed unless the grid T is determined. Here we provide two choices of static grids which are similar to [PRCS16] as follows: let

u1 = a, um = a um 1, m = 2, , M, tm = um , m [M],

u 1 = b, u m = bu m 1, m = 2, , M, t m = u m , m [M],

where parameters a, b are chosen appropriately such that t M = t M = T, i.e.,

1 2 21 M , b = Θ T 1 M . (3)

For minimizing the minimax regret, we use the minimax grid deﬁned by Tminimax = {t1, , t M}; as for the problem-dependent regret, we use the geometric" grid which is deﬁned by Tgeometric = {t 1, , t M}. We will denote by π1 Ba SE and π2 Ba SE the respective policies under these grids.

1There might be some rounding issues here, and some arms may be pulled once more than others. In this case, the additional pull will not be counted towards the computation of the average reward Y i(t), which ensures that all active arms are evaluated using the same number of pulls at the end of any batch. Note that in this way, the number of pulls for each arm is underestimated by at most half, therefore the regret analysis in Theorem 4 will give the same rate in the presence of rounding issues.

Algorithm 1: Batched Successive Elimination (Ba SE)

Input: Arms I = [K]; time horizon T; number of batches M; grid T = {t1, ..., t M}; tuning parameter γ. Initialization: A I. for m 1 to M 1 do

(a) During the period [tm 1 + 1, tm], pull an arm from A for a same number of times. (b) At time tm: Let Y max(tm) = maxj A Y j(tm), and τm be the total number of pulls of each arm in A. for i A do

if Y max(tm) Y i(tm) p

γ log(TK)/τm then A A {i}. end end end for t t M 1 + 1 to T do

pull arm i0 such that i0 arg maxj A Y j(t M 1) (break ties arbitrarily). end Output: Resulting policy π.

2.2 Regret analysis

The performance of the Ba SE policy is summarized in the following theorem. Theorem 4. Consider an M-batched, K-armed bandit problem where the time horizon is T. let π1 Ba SE be the Ba SE policy equipped with the grid Tminimax and π2 Ba SE be the Ba SE policy equipped with the grid Tgeometric. For γ 12 and maxi [K] i = O(

K), we have

E[RT (π1 Ba SE)] C log K p

1 2 21 M , (4)

E[RT (π2 Ba SE)] C log K log(KT) KT 1/M

mini = i , (5)

where C > 0 is a numerical constant independent of K, M and T.

Note that Theorem 4 implies Theorem 1. In the sequel we sketch the proof of Theorem 4, where the main technical difﬁculity is to appropriately control the number of pulls for each arm under batch constraints, where there is a random number of active arms in A starting from the second batch. We also refer to a recent work [EKMM19] for a tighter bound on the problem-dependent regret with an adaptive grid.

Proof of Theorem 4. For notational simplicity we assume that there are K + 1 arms, where arm 0 is the arm with highest expected reward (denoted as ), and i = µ µi 0 for i [K]. Deﬁne the following events: for i [K], let Ai be the event that arm i is eliminated before time tmi, where

mi = min j [M] : arm i has been pulled at least τ i 4γ log(TK)

2 i times before time tj T ,

with the understanding that if the minimum does not exist, we set mi = M and the event Ai occurs. Let B be the event that arm is not eliminated throughout the time horizon T. The ﬁnal good event" E is deﬁned as E = ( K i=1Ai) B. We remark that mi is a random variable depending on the order in which the arms are eliminated. The following lemma shows that by our choice of γ 12, the good event E occurs with high probability.

Lemma 1. The event E happens with probability at least 1 2 T K .

The proof of Lemma 1 is postponed to the supplementary materials. By Lemma 1, the expected regret RT (π) (with π = π1 Ba SE or π2 Ba SE) when the event E does not occur is at most

E[RT (π)1(Ec)] T max i [K] i P(Ec) = O(1). (6)

Next we condition on the event E and upper bound the regret E[RT (π1 Ba SE)1(E)] for the minimax grid Tminimax. The analysis of the geometric grid Tgeometric is entirely analogous, and is deferred to the supplementary materials.

For the policy π1 Ba SE, let I0 I be the (random) set of arms which are eliminated at the end of the ﬁrst batch, I1 I be the (random) set of remaining arms which are eliminated before the last batch, and I2 = I I0 I1 be the (random) set of arms which remain in the last batch. It is clear that the total regret incurred by arms in I0 is at most t1 maxi [K] i = O(

Ka), and it remains to deal with the sets I1 and I2 separately.

For arm i I1, let σi be the (random) number of arms which are eliminated before the arm i. Observe that the fraction of pullings of arm i is at most 1 K σi before arm i is eliminated. Moreover, by the deﬁnition of tmi, we must have

τ i > (number of pullings of arm i before tmi 1) tmi 1

4γK log(TK).

Hence, the total regret incurred by pulling an arm i I1 is at most (note that tj 2a tj 1 for any j = 2, 3, , M by the choice of the grid)

i tmi K σi i 2a tmi 1

4γK log(TK)

Note that there are at most t elements in (σi : i I1) which are at least K t for any t = 2, , K, the total regret incurred by pulling arms in I1 is at most

4γK log(TK)

4γK log(TK)

t 2a log K p

4γK log(TK). (7)

For any arm i I2, by the previous analysis we know that i t M 1 p

4γK log(TK). Hence, let Ti be the number of pullings of arm i, the total regret incurred by pulling arm i I2 is at most

4γK log(TK)

4γK log(TK),

where in the last step we have used that T = t M 2a t M 1 in the minimax grid Tminimax. Since P

i I2 Ti T, the total regret incurred by pulling arms in I2 is at most

4γK log(TK) 2a p

4γK log(TK). (8)

By (7) and (8), the inequality

RT (π1 Ba SE)1(E) 2a p

4γK log(TK)(log K + 1) + O(

holds almost surely. Hence, this inequality combined with (6) and the choice of a in (3) yields the desired upper bound (4).

3 Lower Bound

This section presents lower bounds for the batched multi-armed bandit problem, where in Section 3.1 we design a ﬁxed multiple hypothesis testing problem to show the lower bound for any policies under static grids, while in Section 3.2 we construct different hypotheses for different policies under general adaptive grids.

3.1 Static grid

The proof of Theorem 2 relies on the following lemma.

Lemma 2. For any static grid 0 = t0 < t1 < < t M = T and the smallest gap (0,

K], the following minimax lower bound holds for any policy π under this grid:

sup {µ(i)}K i=1: i {0} [ ,

K] E[RT (π)]

4 exp 2tj 1 2

We ﬁrst show that Lemma 2 implies Theorem 2 by choosing the smallest gap > 0 appropriately. By deﬁnitions of the minimax regret R min-max and the problem-dependent regret R pro-dep, choosing = j = p

(K 1)/(tj 1 + 1) [0,

K] in Lemma 2 yields that

R min-max(K, M, T) c0

K max j [M] tj ptj 1 + 1,

R pro-dep(K, M, T) c0K max j [M] tj tj 1 + 1,

for some numerical constant c0 > 0. Since t0 = 0, t M = T, the lower bounds in Theorem 2 follow.

Next we employ the general idea of the multiple hypothesis testing to prove Lemma 2. Consider the following K candidate reward distributions:

P1 = N( , 1) N(0, 1) N(0, 1) N(0, 1), P2 = N( , 1) N(2 , 1) N(0, 1) N(0, 1), P3 = N( , 1) N(0, 1) N(2 , 1) N(0, 1), ... PK = N( , 1) N(0, 1) N(0, 1) N(2 , 1).

We remark that this construction is not entirely symmetric, where the reward distribution of the ﬁrst arm is always N( , 1). The key properties of this construction are as follows:

1. For any i [K], arm i is the optimal arm under reward distribution Pi; 2. For any i [K], pulling a wrong arm incurs a regret at least under reward distribution Pi.

As a result, since the average regret serves as a lower bound of the worst-case regret, we have

sup {µ(i)}K i=1: i {0} [ ,

K] ERT (π) 1

t=1 EP t i Rt(π)

i=1 P t i (πt = i), (9)

where P t i denotes the distribution of observations available at time t under Pi, and Rt(π) denotes the instantaneous regret incurred by the policy πt at time t. Hence, it remains to lower bound the quantity 1

K PK i=1 P t i (πt = i) for any t [T], which is the subject of the following lemma. Lemma 3. Let Q1, , Qn be probability measures on some common probability space (Ω, F), and Ψ : Ω [n] be any measurable function (i.e., test). Then for any tree T = ([n], E) with vertex set [n] and edge set E, we have

i=1 Qi(Ψ = i) X

1 2n exp( DKL(Qi Qj)).

The proof of Lemma 3 is deferred to the supplementary materials, and we make some remarks below. Remark 1. A more well-known lower bound for 1

n Pn i=1 Qi(Ψ = i) is the Fano s inequality [CT06], which involves the mutual information I(U; X) with U Uniform([n]) and PX|U=i = Qi. However, since I(U; X) = EPU DKL(PX|U PX), Fano s inequality gives a lower bound which depends linearly on the pairwise KL divergence rather than exponentially and is thus loose for our purpose. Remark 2. An alternative lower bound is to use 1 2n2 P i =j exp( DKL(Qi Qj)), i.e., the summation is taken over all pairs (i, j) instead of just the edges in a tree. However, this bound is weaker than Lemma 3, and in the case where Qi = N(i , 1) for some large > 0, Lemma 3 with the tree T = ([n], {(1, 2), (2, 3), , (n 1, n)}) is tight (giving the rate (exp( O( 2))) while the alternative bound loses a factor of n (giving the rate exp( O( 2))/n).

To lower bound (9), we apply Lemma 3 with the star tree T = ([n], {(1, i) : 2 i n}). For i [K], denote by Ti(t) the number of pulls of arm i anterior to the current batch of t. Hence, PK i=1 Ti(t) = tj 1 if t (tj 1, tj]. Moreover, since DKL(P t 1 P t i ) = 2 2EP t 1Ti(t), we have

i=1 P t i (πt = i) 1 2K

i=2 exp( DKL(P t 1 P t i )) = 1 2K

i=2 exp( 2 2EP t 1Ti(t))

4 exp 2 2tj 1

Now combining (9) and (10) completes the proof of Lemma 2.

3.2 Adaptive grid

Now we investigate the case where the grid may be randomized, and be generated sequentially in an adaptive manner. Recall that in the previous section, we construct multiple ﬁxed hypotheses and show that no policy under a static grid can achieve a uniformly small regret under all hypotheses. However, this argument breaks down even if the grid is only randomized but not adaptive, due to the non-convex (in (t1, , t M)) nature of the lower bound in Lemma 2. In other words, we might not hope for a single ﬁxed multiple hypothesis testing problem to work for all policies. To overcome this difﬁculty, a subroutine in the proof of Theorem 3 is to construct appropriate hypotheses after the policy is given (cf. the proof of Lemma 4). We sketch the proof below.

We shall only prove the lower bound for the minimax regret, where the analysis of the problemdependent regret is entirely analogous. Consider the following time T1, , TM [1, T] and gaps 1, , M (0,

1 2 M , j =

K 36M T 1 21 j

2(1 2 M ) , j [M]. (11)

Let T = {t1, , t M} be any adaptive grid, and π be any policy under the grid T . For each j [M], we deﬁne the event Aj = {tj 1 < Tj 1, tj Tj} under policy π with the convention that t0 = 0, t M = T. Note that the events A1, , AM form a partition of the entire probability space. We also deﬁne the following family of reward distributions: for j [M 1], k [K 1] let

Pj,k = N(0, 1) N(0, 1) N( j + M, 1) N(0, 1) N(0, 1) N( M, 1),

where the k-th component of Pj,k has a non-zero mean. For j = M, we deﬁne

PM = N(0, 1) N(0, 1) N( M, 1).

Note that this construction ensures that Pj,k and PM only differs in the k-th component, which is crucial for the indistinguishability results in Lemma 5.

We will be interested in the following quantities:

k=1 Pj,k(Aj), j [M 1], p M = PM(AM),

where Pj,k(A) denotes the probability of the event A given the true reward distribution Pj,k and the policy π. The importance of these quantities lies in the following lemmas. Lemma 4. If pj 1 2M for some j [M], then we have

sup {µ(i)}K i=1: i

K E[RT (π)] c M 2

where c > 0 is a numerical constant independent of (K, M, T) and (π, T ).

Lemma 5. The following inequality holds: PM j=1 pj 1

The detailed proofs of Lemma 4 and Lemma 5 are deferred to the supplementary materials, and we only sketch the ideas here. Lemma 4 states that, if any of the events Aj occurs with a non-small

probability in the respective j-th world (i.e., under the mixture of (Pj,k : k [K 1]) or PM), then the policy π has a large regret in the worst case. The intuition behind Lemma 4 is that, if the event tj 1 Tj 1 occurs under the reward distribution Pj,k, then the observations in the ﬁrst (j 1) batches are not sufﬁcient to distinguish Pj,k from its (carefully designed) perturbed version with size of perturbation j. Furthermore, if in addition tj Tj holds, then the total regret is at least Ω(Tj j) due to the indistinguishability of the j perturbations in the ﬁrst j batches. Hence, if Aj occurs with a fairly large probability, the resulting total regret will be large as well.

Lemma 5 complements Lemma 4 by stating that at least one pj should be large. Note that if all pj were deﬁned in the same world, the partition structure of A1, , AM would imply P

j [M] pj 1. Since the occurrence of Aj cannot really help to distinguish the j-th world with later ones, Lemma 5 shows that we may still operate in the same world and arrive at a slightly smaller constant than 1.

Finally we show how Lemma 4 and Lemma 5 imply Theorem 3. In fact, by Lemma 5, there exists some j [M] such that pj (2M) 1. Then by Lemma 4 and the arbitrariness of π, we arrive at the desired lower bound in Theorem 3.

4 Experiments

This section contains some experimental results on the performances of Ba SE policy under different grids. The default parameters are T = 5 104, K = 3, M = 3 and γ = 1, and the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms. In addition to the minimax and geometric grids, we also experiment on the arithmetic grid with tj = j T/M for j [M]. Figure 1 (a)-(c) display the empirical dependence of the average Ba SE regrets under different grids, together with the comparison with the centralized UCB1 algorithm [ACBF02] without any batch constraints. We observe that the minimax grid typically results in a smallest regret among all grids, and M = 4 batches appear to be sufﬁcient for the Ba SE performance to approach the centralized performance. We also compare our Ba SE algorithm with the ETC algorithm in [PRCS16] for the two-arm case, and Figure 1 (d) shows that Ba SE achieves lower regrets than ETC. The source codes of the experiment can be found in https://github.com/Mathegineer/batched-bandit.

2 3 4 5 6 7 0

Average regret

Minimax Grid Geometric Grid Arithmetic Grid UCB1

(a) Average regret vs. the number of batches M.

2 4 6 8 10 12 14 16 18 20 0

Average regret

Minimax Grid Geometric Grid Arithmetic Grid UCB1

(b) Average regret vs. the number of arms K.

Average regret

Minimax Grid Geometric Grid Arithmetic Grid UCB1

(c) Average regret vs. the time horizon T.

2 3 4 5 6 7 0

Average regret

Ba SE: Minimax Grid Ba SE: Geometric Grid ETC: Minimax Grid ETC: Geometric Grid UCB1

(d) Comparison of Ba SE and ETC.

Figure 1: Empirical regret performances of the Ba SE policy.

[AA88] Noga Alon and Yossi Azar. Sorting, approximate sorting, and searching in rounds. SIAM Journal on Discrete Mathematics, 1(3):269 280, 1988.

[AAAK17] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39 75, 2017.

[AB09] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT, pages 217 226, 2009.

[AB10] Jean-Yves Audibert and Sébastien Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, 11(Oct):2785 2836, 2010.

[ACBF02] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

[AMS09] Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009.

[AO10] Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010.

[AWBR09] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1 9, 2009.

[BCB12] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1 122, 2012.

[BK97] Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222 255, 1997.

[BM07] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120 1135, 2007.

[BMW16] Mark Braverman, Jieming Mao, and S Matthew Weinberg. Parallel algorithms for select and partition with noisy comparisons. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 851 862. ACM, 2016.

[BPR13] Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic multi-armed bandits. In Proceedings of the 26th Annual Conference on Learning Theory, pages 122 134, 2013.

[BT83] Béla Bollobás and Andrew Thomason. Parallel sorting. Discrete Applied Mathematics, 6(1):1 11, 1983.

[CBDS13] Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160 1168, 2013.

[CG09] Stephen E Chick and Noah Gans. Economic analysis of simulation selection problems. Management Science, 55(3):421 437, 2009.

[CT06] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, New York, second edition, 2006.

[DKMR14] Susan Davidson, Sanjeev Khanna, Tova Milo, and Sudeepa Roy. Top-k and clustering with noisy comparisons. ACM Transactions on Database Systems (TODS), 39(4):35, 2014.

[DRY18] John Duchi, Feng Ruan, and Chulhee Yun. Minimax bounds on stochastic batched convex optimization. In Conference On Learning Theory, pages 3065 3162, 2018.

[EDMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079 1105, 2006.

[EKMM19] Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Batched multi-armed bandits with optimal regret. ar Xiv preprint ar Xiv:1910.04959, 2019.

[FRPU94] Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with noisy information. SIAM Journal on Computing, 23(5):1001 1018, 1994.

[GC11] Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359 376, 2011.

[JJNZ16] Kwang-Sung Jun, Kevin G Jamieson, Robert D Nowak, and Xiaojin Zhu. Top arm identiﬁcation in multi-armed bandits with batch arm pulls. In AISTATS, pages 139 148, 2016.

[KCS08] Aniket Kittur, Ed H Chi, and Bongwon Suh. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 453 456. ACM, 2008.

[LR85] Tze Leung Lai and Herbert Robbins. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1):4 22, 1985.

[NY83] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efﬁciency in optimization. Wiley, 1983.

[PR13] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. The Annals of Statistics, pages 693 721, 2013.

[PRCS16] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. The Annals of Statistics, 44(2):660 681, 2016.

[Rob52] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952.

[Sha13] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In Conference on Learning Theory, pages 3 24, 2013.

[Tho33] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

[Tsy08] A. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag, 2008.

[Val75] Leslie G Valiant. Parallelism in comparison problems. SIAM Journal on Computing, 4(3):348 355, 1975.

[Vog60] Walter Vogel. A sequential design for the two armed bandit. The Annals of Mathematical Statistics, 31(2):430 443, 1960.