# optimal_batched_best_arm_identification__fa42386b.pdf

Optimal Batched Best Arm Identification

Tianyuan Jin1, Yu Yang3, Jing Tang2, Xiaokui Xiao1, Pan Xu3

1National University of Singapore 2The Hong Kong University of Science and Technology (Guangzhou) 3Duke University {tianyuan,xkxiao}@nus.edu.sg, jingtang@ust.hk, {yu.yang,pan.xu}@duke.edu

We study the batched best arm identification (BBAI) problem, where the learner s goal is to identify the best arm while switching the policy as less as possible. In particular, we aim to find the best arm with probability 1 δ for some small constant δ > 0 while minimizing both the sample complexity (total number of arm pulls) and the batch complexity (total number of batches). We propose the three-batch best arm identification (Tri-BBAI) algorithm, which is the first batched algorithm that achieves the optimal sample complexity in the asymptotic setting (i.e., δ 0) and runs in 3 batches in expectation. Based on Tri-BBAI, we further propose the almost optimal batched best arm identification (Opt-BBAI) algorithm, which is the first algorithm that achieves the near-optimal sample and batch complexity in the non-asymptotic setting (i.e., δ is finite), while enjoying the same batch and sample complexity as Tri-BBAI when δ tends to zero. Moreover, in the non-asymptotic setting, the complexity of previous batch algorithms is usually conditioned on the event that the best arm is returned (with a probability of at least 1 δ), which is potentially unbounded in cases where a sub-optimal arm is returned. In contrast, the complexity of Opt-BBAI does not rely on such an event. This is achieved through a novel procedure that we design for checking whether the best arm is eliminated, which is of independent interest.

1 Introduction

Multi-armed bandit (MAB) is a fundamental model in various online decision-making problems, including medical trials [40], online advertisement [5], and crowdsourcing [43]. These problems typically involve a bandit with multiple arms, where each arm follows an unknown distribution with a mean value. At each time step, the learner selects an arm, and receives a reward sample drawn from the chosen arm s distribution. Best arm identification (BAI) aims to identify the arm with the highest mean reward, which can be approached with a fixed budget or a fixed confidence level [14, 4, 19].

In this paper, we assume that there is a unique best arm and study BAI with fixed confidence. Specifically, we consider a set [n] = {1, 2, . . . , n} of n arms, where each arm i is associated with a reward distribution having a mean value µi. Without loss of generality, for a bandit instance denoted by µ = {µ1, µ2, . . . , µn}, we assume that µ1 > µ2 µn. At each time step t, the learner selects an arm and observes a sample drawn independently from the chosen arm s distribution. In the fixed confidence setting, the learner aims to correctly identify the best arm (arm 1 in our context) with a probability of at least 1 δ, where δ > 0 is a pre-specified confidence parameter. Meanwhile, the learner seeks to minimize the total number of arm pulls, also known as the sample complexity.

Denote by a (λ) = arg maxi λi the best arm for an arbitrary bandit instance λ. Let Alt(µ) = {λ: a (λ) = 1} be a set of models that have a different best arm from the model µ, and Pk = {w Rn + : Pn i=1 wi = 1} be the probability simplex. Garivier and Kaufmann [17] showed that for bandits with reward distributions that are continuously parameterized by their means, the number Nδ of pulls

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

by any algorithm that returns the best arm with a probability of at least 1 δ is bounded by

lim inf δ 0 Eµ[Nδ] log(1/δ) T (µ), (1.1)

where T (µ) is defined according to

T (µ) 1 := sup w Pk

inf λ Alt(µ)

i=1 wi d(µi, λi)

and d(µi, λi) is the Kullback-Leibler (KL) divergence of two arms distributions with means µi and λi, respectively. We say that an algorithm achieves the asymptotically optimal sample complexity if it satisfies

lim sup δ 0

Eµ[Nδ] log(1/δ) T (µ). (1.3)

The well-known Track-and-Stop algorithm [17] solves the BAI problem with asymptotically optimal sample complexity. However, it is a fully sequential algorithm, which is hard to be implemented in parallel. The learner in such an algorithm receives immediate feedback for each arm pull, and adjusts the strategy for the next arm selection based on the previous observations. Unfortunately, this sequential approach may not be feasible in many real-world applications. For instance, in medical trials, there is typically a waiting time before the efficacy of drugs becomes observable, making it impossible to conduct all tests sequentially. Instead, the learner needs to group the drugs into batches and test them in parallel. Similarly, in crowdsourcing, the goal is to identify the most reliable worker for a specific task by testing candidates with a sequence of questions. Again, there is often a waiting time for workers to finish all the questions, necessitating the grouping of workers into batches and conducting parallel tests. In such scenarios, the results of pulling arms are only available at the end of each batch [22, 1, 39, 37, 15, 24, 25, 38].

Motivated by these applications, we study the problem of batched best arm identification (BBAI). In BBAI, we are allowed to pull multiple arms in a single batch, but the results of these pulls are revealed only after the completion of the batch. The objective is to output the best arm with a high probability of at least 1 δ, while minimizing both the sample complexity (total number of pulls) and the batch complexity (total number of batches). This leads to the following natural question:

Can we solve the BBAI problem with an asymptotically optimal sample complexity and only using a constant number of batches?

Furthermore, the aforementioned results hold only in the limit as the confidence parameter δ approaches zero, which may provide limited practical guidance since we often specify a fixed confidence level parameter δ > 0. To address this, some algorithms [31, 21] have been proposed to solve the BAI problem with finite confidence. Specifically, for some universal constant C, these algorithms satisfy that with probability 1 δ,

1 2 i log log 1 i

where i = µ1 µi. In addition, Jamieson et al. [21] demonstrated that for two-armed bandits, the term log log 1 2 2 2 is optimal as 2 0, where 2 is assumed to be the gap between the best arm and the second best arm. In the context of the batched setting, Jin et al. [22] proposed an algorithm that achieves the sample complexity in (1.4) within O(log (n) log(1/ 2)) batches, where log (n) is the iterated logarithm function1. Furthermore, Tao et al. [39] proved that for certain bandit instances, any

algorithm that achieves the sample complexity bound shown in (1.4) requires at least Ω log 1 2 log log 1 2

batches. It should be noted that the batched lower bound proposed by Tao et al. [39] assumes δ as a constant, making it inapplicable in the asymptotic setting. Therefore, an additional question that arises is:

Can we achieve the optimal sample complexity in (1.3) and (1.4) adaptively, taking into account the specified confidence level parameter δ, while minimizing the number of batches?

1Specifically, log (n) denotes the number of times the function log( ) needs to be applied to n until the result is less than 1.

Table 1: Comparison of sample and batch complexity of different algorithms. In the asymptotic setting (i.e., δ 0), the sample complexity of an algorithm is optimal if it satisfies the definition in (1.3). The field marked with indicates that the result is not provided. The sample complexity presented for [31, 22] is conditioned on the event that the algorithm returns the best arm, which can be unbounded when it returns a sub-optimal arm with certain (non-zero) probability (see Remark 4.4 for more details). In contrast, the sample complexity presented for [17, 26] and our algorithms is the total expected number of pulls that will be executed.

Algorithm Asymptotic behavior (δ 0) Finite-confidence behavior

Sample complexity Batch complexity Sample complexity Batch complexity

Karnin et al. [31] Not optimal O(log n log n 2 ) O P

i>1 log(log 1 i ) 2 i

O(log n log n 2 )

Jin et al. [22] Not optimal O(log 1 2 ) O P i>1 log(log 1 i ) 2 i

O(log (n) log 1 2 )

Wang et al. [42] Optimal O( p

log(1/δ)) O(n H(µ)4/w2 min) - Agrawal et al. [2] Optimal T (µ) log δ 1

m (m = o(log δ 1))

Karpov et al. [32] O P

i>1 log(n log 1 i ) 2 i O(log 1 2 )

Lower bound [39] O P i>1 log(log 1 i ) 2 i

Ω(log(1/ 2)) Tri-BBAI (Our Algorithm 1) Optimal 3

Opt-BBAI (Our Algorithm 2) Optimal 3 O P

i>1 log(n log 1 i ) 2 i O(log 1 2 )

In this work, we provide positive answers to both of the aforementioned questions. Specifically, the main contributions of our paper can be summarized as follows:

We propose Tri-BBAI (Algorithm 1) that returns the best arm with a probability of at least 1 δ by pulling all arms for a total number of at most T (µ) log(1/δ) times when δ 0. Tri-BBAI employs three batches in expectation when δ 0. Therefore, Tri-BBAI achieves the optimal sample within constant batches. As a comparison, Track-and-Stop [17] also achieves the optimal sample complexity but requires solving the right-hand side of (1.2) after each arm pull, resulting in a batch complexity of the same order as the sample complexity, which is a significant computational overhead in practice. Built upon Tri-BBAI, we further propose Opt-BBAI (Algorithm 2) that runs in O(log(1/ 2)) expected batches and pulls at most O P

i>1 2 i log(n log 1 i ) expected number of arms for finite confidence δ. It is also important to note that Opt-BBAI achieves the same sample and expected batch complexity as Tri-BBAI asymptotically when δ 0. Moreover, for the finite confidence case, this sample complexity matches (1.4) within a log( ) factor and matches the optimal batched complexity within a log log( ) factor. To the best of our knowledge, Opt-BBAI is the first batched bandit algorithm that can achieve the optimal asymptotic sample complexity and the near-optimal non-asymptotic2 sample complexity adaptively based on the assumption on δ. Notably, in the non-asymptotic setting, the complexity of earlier batch algorithms [22, 32, 39, 24] typically depends on the event of returning the best arm, which occurs with a probability of at least 1 δ. However, this complexity could potentially become unbounded if a sub-optimal arm is returned instead. Unlike these algorithms, the complexity of Opt-BBAI is not contingent on such an event. This is made by employing an innovative procedure to verify if the best arm is eliminated, a method that holds its independent interest. To the best of our knowledge, Opt-BBAI is the first algorithm that achieves the optimal asymptotic sample complexity while providing the optimal non-asymptotic sample complexity within logarithm factors. We also conduct numerical experiments3 to compare our proposed algorithms with the optimal sequential algorithm Track-and-Stop [17], and the batched algorithm Top-k δ-Elimination [22] on various problem instances. The results indicate that our algorithm significantly outperforms the Track and Stop method in terms of batch complexity, while its sample complexity is not much worse than that of Track and Stop. Additionally, our algorithm demonstrates a notable improvement in sample complexity compared to [22], while exhibiting similar batch complexity.

For ease of reference, we compare our results with existing work on batched bandits in Table 1.

2Non-asymptotic here refers to finite confidence. In this paper, we use names non-asymptotic and finite confidence interchangeably. 3Due to the space limit, we put the experimental results in Appendix E.

Table 2: Comparison of sample complexity of different algorithms.

Algorithm Asymptotic behavior (δ 0) Finite-confidence behavior

Kalyanakrishnan et al. [30] O(H(µ) log(1/δ)) O(H(µ) log(H(µ)))

Karnin et al. [31] O(H(µ) log(1/δ)) O P

i>1 log(log 1 i ) 2 i

Garivier and Kaufmann [17] T (µ) log(1/δ)+o(1/δ) -

Jamieson et al. [21] O(H(µ) log(1/δ)) O P

i>1 log(log 1 i ) 2 i

Jourdan and Degenne [26] T β (µ) log(1/δ)+o(1/δ) O ((H(µ) log H(µ))α), α > 1 Degenne et al. [12] T (µ) log(1/δ)+o(1/δ) O(n T (µ)2) Katz-Samuels et al. [33] O(T (µ) log(1/δ)) O(H(µ) log(n/ min)) Wang et al. [42] T (µ) log(1/δ)+o(1/δ) O(n H(µ)4/w2 min)

Opt-BBAI (Our Algorithm 2) T (µ) log(1/δ)+o(1/δ) O P

i>1 log(n log 1 i ) 2 i

2 Related Work

The BAI problem with fixed confidence is first studied for [0, 1] bounded rewards by Even-Dar et al. [14]. The sample complexity of their algorithm scales with the sum of the squared inverse gap, i.e., H(µ) = P

i>1 1/ 2 i . Garivier and Kaufmann [17] showed that H(µ) < T (µ) 2H(µ) with T (µ) defined in (1.2) and proposed the Track-and-Stop algorithm, which is the first one in the literature proved to be asymptotically optimal. Later, Degenne et al. [12] viewed T (µ) as a min-max game and provided an efficient algorithm to solve it. Jourdan et al. [27] studied the asymptotically optimal sample complexity for any reward distributions with bounded support. Degenne et al. [13] studied pure exploration in linear bandits. Their proposed algorithm proved to be both asymptotically optimal and empirically efficient.

There are also many studies [30, 10, 8, 21, 31] that focus on providing non-asymptotic optimal sample complexity. The best non-asymptotic sample complexity was achieved by Chen et al. [10], which replaces the term log log 1 i in (1.4) with a much smaller term. Furthermore, when we allow a loss ϵ of the quality of the returned arm, the problem is known as (ϵ, δ)-PAC BAI, for which various algorithms [29, 14, 6, 9] are proposed, achieving the worst-case optimal sample complexity.

There are a few attempts to achieve both the asymptotic and non-asymptotic optimal sample complexity. Degenne et al. [12] provided a non-asymptotic sample complexity O n T (µ)2 , which could be n T (µ) larger than the optimal sample complexity. Recently, Jourdan and Degenne [26] managed to achieve a sample complexity that is β-asymptotically optimal (with w1 fixed at 1/β in (1.2)), rendering it asymptotically optimal up to a factor of 1/β. Meanwhile, they also reached a non-asymptotic sample complexity of O (H(µ) log H(µ))α 4 for some α > 1, where H(µ) = P

i>1 1/ 2 i . Wang et al. [42] explored both asymptotic and non-asymptotic sample complexities. Their algorithm achieves asymptotic optimality and shows a non-asymptotic sample complexity of O(n H(µ)4/w2 min). However, this non-asymptotic sample complexity is n H(µ)3/w2 min away from being optimal. Jourdan et al. [28] studied (ϵ, δ)-PAC BAI, proposing an asymptotically optimal algorithm and providing non-asymptotic sample complexity. When ϵ = 0, it aligns with our setting, our non-asymptotic sample complexity is better scaled. Specifically, Jourdan et al. [28] offered a non-asymptotic sample complexity scale of n/ 2 2 log(1/ 2), whereas ours is more instance-sensitive, as our sample complexity is related to all gaps, not just 2. Additionally, Jourdan et al. [28] considered a practical scenario where the algorithm can return a result at any time while still ensuring a good guarantee on the returned arm. For ease of reference, we summarize the sample complexity of different algorithms in Table 2. As shown in Table 2, our algorithm is the only one that achieves both the asymptotic optimality and non-asymptotic optimality within logarithm factors.

Another line of research [31, 4, 7] investigated BAI with a fixed budget, where the objective is to determine the best arm with the highest probability within T pulls. Audibert et al. [4] and Karnin et al. [31] offered finite-time bounds for this problem, while Carpentier and Locatelli [7] presented a tight lower bound, demonstrating that such finite-time bounds[4, 31] are optimal for certain bandit

4To derive the near-optimal non-asymptotic sample complexity, α should be 1. However, As explained in their original paper, the algorithm will be sub-optimal if we set α close to 1.

instances. However, the asymptotic optimality for this problem remains unknown. Interested readers are referred to recent advancements[11, 35] in the asymptotic results of BAI with a fixed budget.

In addition, some recent works [1, 22, 39, 32] also focused on batched BAI in non-asymptotic setting. Agarwal et al. [1] studied the batched BAI problem under the assumption that 2 is known. They proposed an algorithm that has the worst-case optimal sample complexity of O n 2 2 and runs in log (n) batches. Later, Jin et al. [22] provided the algorithms that achieves the sample complexity given in (1.4) within O(log(1/ 2)) batches, where O hides the log log( ) factors. Tao et al. [39] studied the BAI problem in the general collaborative setting and showed that no algorithm can achieve (1.4) with o((log 1 2 )/ log log 1 2 ) batches. Karpov et al. [32] further proposed an algorithm which has the sample complexity P

i>1 log(n log 1)

2 i and the batch complexity O(log(1/ 2)). We note that the lower bound of batch complexity given by Tao et al. [39] can only be applied to a constant δ. In other words, the lower bound of complexity for the asymptotic setting remains unknown. Agrawal et al. [2] studied the optimal batch size for keeping the asymptotic optimality. They showed an algorithm with batch size m = o(log(1/δ)) achieves the asymptotic optimality. The batch complexity of the algorithm is O(T (µ) log(1/δ)/m). Wang et al. [42] provided an algorithm that uses O( p

log δ 1) batches and retains asymptotic optimality for linear bandits. However, for the asymptotic setting, such batch size is still too large as it grows to infinity as δ decreases to 0.

3 Achieving Asymptotic Optimality with at Most Three Batches

3.1 Reward Distribution

We assume the reward distributions belong to a known one-parameter exponential family that is commonly considered in the literature [17, 16, 36]. In particular, the measure νθ of such probability distributions with respect the model parameter θ satisfies dνθ

dρ (x) = exp(xθ b(θ)), for some measure ρ and b(θ) = log( R exθdρ(x)). For one-parameter exponential distributions, it is known that b (θ) = E[νθ] and the mapping b (θ) 7 θ is one-to-one. Moreover, given any two mean values µ and µ , we define d(µ, µ ) to be the Kullback-Leibler divergence between two distributions with mean values µ and µ .

3.2 The Proposed Three-Batch Algorithm

Algorithm 1 shows the pseudo-code of our proposed Tri-BBAI algorithm. In particular, Tri-BBAI has four stages. In what follows, we elaborate on the details of each stage.

Stage I: Initial exploration. In this stage, we pull each arm for L1 times. Denote by i (t) the arm with the largest empirical mean at time t (i.e., after we pull all arms for a total number of t times), i.e., i (t) = maxi [n] ˆµi(t). Let τ0 = n L1. Fix t = τq 1 τ0, we let bq i = ˆµi(t) + ϵ for i = i (t) and bq i = ˆµi(t) ϵ for i = i (t). Let w (µ) = arg maxw Pk infλ Alt(µ) (Pn i=1wi d(µi, λi)). Then, for the aforementioned bq = {bq 1, bq 2, . . . , bq n}, we calculate w (bq) according to Lemma A.1 and T (bq) according to Lemma A.2. We note that arm 1 is assumed to be the arm with the highest mean in these two lemmas. However, in the context of bq, the index of i (t) might be different. To align with the standard practice, we can rearrange the indices of the arms in bq so that i (t) corresponds to index 1.

Purpose. To achieve the asymptotic optimality, we attempt to pull each arm i for around w i (µ)T (µ) times. We can show that with a high probability, w i (bq)T (bq) is close to w i (µ)T (µ), which implies that pulling arm i for a number of times proportional to w i (bq)T (bq) is likely to ensure the asymptotic optimal sample complexity.

Stage II: Exploration using w (bq) and T (bq). Stage II operates in batches with the maximum number of batches determined by log(1/δ). At batch q, each arm i is pulled maxp:p N,p [1,q] T q i times in total. Here T q i := min αw i (bq)T (bq) log δ 1, L2 , (3.1)

where the definition of w i (bq) and T (bq) could be found in Stage I. We then evaluate the stage switching condition, |w i (bq) w i (bq 1)| 1/ n for all i [n]. If this condition is met, we go to the next stage; otherwise, we proceed to the next batch within Stage II.

Algorithm 1: Three-Batch Best Arm Identification (Tri-BBAI) Input: Parameters ϵ, δ, L1, L2, L3, α and function β(t, δ). Output: The estimated optimal arm.

1 Stage I: Round 1 exploration

2 for i 1 to n do

3 Pull arm i for L1 times;

4 t n L1 and τ0 t;

5 Stage II:Round 2 exploration

6 Let w 0(b0) = (1/n, 1/n, , 1/n) and T 0 i = L1 for all i [n];

7 for q = 1, 2 , log(1/δ) do

8 i (t) maxi [n] ˆµi(t);

9 for each i [n] \ {i (t)} do

10 bq i ˆµi(t) + ϵ;

11 bq i (t) ˆµi (t)(t) ϵ;

12 for i 1 to n do

13 Pull arm i for {0, T q i maxp:p N,p [0,q) T p i } times; /**/ Note that T q i = min αw i (bq)T (bq) log δ 1, L2 by (3.1)

14 t t + {0, T q i maxp:p N,p [0,q) T p i };

15 if |w i (bq) w i (bq 1)| 1/ n for all i [n] then

18 τ t, and i (τ) = maxi [n] ˆµi(τ);

19 Stage III: Statistical test with Chernoff s stopping rule;

20 Compute Zj(τ) according to (3.3) ;

21 if minj [n]\{i (τ)} Zj(τ) β(τ, δ/2) then

22 return i (τ);

23 Stage IV: Round 3 exploration for i 1 to n do

24 Pull arm i for total max{0, L3 L1 Ti} times;

25 t n L3, and i (t) maxi [n] ˆµi(t);

26 return i (t) ;

Purpose. Pulling arm i proportional to w i (bq)T (bq) provides statistical evidence for the reward distributions without sacrificing sample complexity compared to the optimality per our above discussion. Meanwhile, we also set a threshold L2 to avoid over-exploration due to sampling errors from Stage I.

The rationale for running Stage II in multiple batches is based on empirical considerations. In experiments, δ is always finite. Consequently, the error of arm i, |ˆµi(t) µi|, remains constant since the number of pulls of arms is limited. Given that the stopping rule in Stage III is highly dependent on the error of arm i and the number of pulls Ti := maxp:p N,p 1 T p i , there is a constant probability that the stopping rule may not be met, leading to significant sample costs in Stage IV. Adding the condition in Line 15 ensures that w (bq) converges and that the sample size T q i , as defined by w (bq) and T (bq), closely approximates αw (µ)T (µ) log δ 1. This alignment significantly increases the probability that the stopping rule will be satisfied in experiments.

Moreover, such modification doesn t hurt any theoretical results. To explain, in our analysis for Tri-BBAI, we demonstrate that as δ approaches 0, w (bq) will be very close to w (µ) and the probability that Line 15 is not satisfied could be bounded by 1/ log2 δ 1, which means with high probability Stage II costs 2 batches. Besides, q log(1/δ), which implies that even if Line 15 is not satisfied, the number of batches required for Stage II is at most log(1/δ). Therefore, the expected number of batches required for Stage II is 2 as δ approaches 0.

Stage III: Statistical test with Chernoff s stopping rule. Denote by Ni(t) the number of pulls of arm i at time t. For each pair of arms i and j, define their weighted empirical mean as

ˆµij(t) := Ni(t) ˆµi(t) Ni(t) + Nj(t) + Nj(t) ˆµj(t)

Ni(t) + Nj(t), (3.2)

where ˆµi(t) and ˆµj(t) are the empirical means of arms i and j at time t. For ˆµi(t) ˆµj(t), define

Zij(t) := d(ˆµi(t), ˆµij(t))Ni(t) + d(ˆµj(t), ˆµij(t))Nj(t),

Zj(t) := Zi (t)j(t). (3.3)

We test whether Chernoff s stopping condition is met (Line 21). If so, we return the arm with the largest empirical mean, i.e., i (τ), where τ is the total number of pulls examined after Stage II.

Purpose. The intuition of using Chernoff s stopping rule for the statistical test is two-fold. Firstly, if Chernoff s stopping condition is met, with a probability of at least 1 δ/2, the returned arm i (τ) in Line 22 is the optimal arm (see Lemma B.1). Secondly, when δ is sufficiently small, with high probability, Chernoff s stopping condition holds (see Lemma B.4). As a consequence, our algorithm identifies the best arm successfully with a high probability of meeting the requirement.

Stage IV: Re-exploration. If the previous Chernoff s stopping condition is not met, we pull each arm until the total number of pulls of each arm eqauls L3 taking into account the pulls in the previous stages (Line 24). Finally, the arm with the largest empirical mean is returned (Line 26).

Purpose. If Chernoff s stopping condition is not met, i (τ) might not be the optimal arm. In addition, when each arm is pulled for L3 times, we are sufficiently confident that i (t) is the best arm. Since the probability of Stage IV happening is very small, its impact on the sample complexity is negligible.

3.3 Theoretical Guarantees of Tri-BBAI

In the following, we present the theoretical results for Algorithm 1. Theorem 3.1 (Asymptotic Sample Complexity). Given any δ > 0, let ϵ = 1 log log(δ 1), L1 = p

log δ 1, L2 = log δ 1 log log δ 1

n , and L3 = (log δ 1)2. Meanwhile, for any given α (1, e/2], define function β(t, δ) as β(t, δ) = log(log(1/δ)tα/δ).5 Then, for any bandit instance µ, Algorithm 1 satisfies

lim sup δ 0

Eµ[Nδ] log(1/δ) αT (µ).

By letting α in Theorem 3.1 approach 1 (e.g., α = 1 + 1/ log δ 1), we obtain the asymptotic optimal sample complexity. Theorem 3.2 (Correctness). Let ϵ, L1, L2, L3, α, and β(t, δ) be the same as in Theorem 3.1. Then, for sufficiently small δ > 0, Algorithm 1 satisfies Pµ(i (Nδ) = 1) δ. Theorem 3.3 (Asumptotic Batch Complexity). Let ϵ, L1, L2, L3, α, and β(t, δ) be the same as in Theorem 3.1. For sufficiently small δ > 0, Algorithm 1 runs within 3 + o(1) batches in expectation. Besides, Algorithm 1 runs within 3 batches with probability 1 1/ log(1/δ2).

To the best of our knowledge, all previous works in the BAI literature [18, 12, 2, 42] that achieve the asymptotic optimal sample complexity require unbounded batches as δ 0. In contrast, Tri-BBAI achieves the asymptotic optimal sample complexity and runs within 3 batches in expectation, which is a significant improvement in the batched bandit setting where switching to new policies is expensive. Remark 3.4. Apart from best arm identification, regret minimization is another popular task in bandits, where the aim is to maximize the total reward in T rounds. Jin et al. [25] proposed an algorithm that achieves a constant batch complexity for regret minimization and showed that their algorithm is optimal when T goes to infinity. In regret minimization, the cost of pulling the optimal arm is 0, indicating that the allocation wi (i.e., the proportion of pulling the optimal arm) is close to 1. In the BAI problem, the main hardness is to find the allocation wi for each arm since even pulling arm 1 will increase the sample complexity of the algorithm. Therefore, the strategy proposed by Jin et al. [25] cannot be applied to the BAI problem.

5Recent work by Kaufmann and Koolen [34] offers improved deviation inequalities allowing for a smaller selection of β(t, δ) without sacrificing the asymptotic optimality. However, it remains uncertain whether this refined parameter choice is applicable to our batched bandit problem. For ease of presentation, we use the parameter choice of β in Garivier and Kaufmann [17].

Algorithm 2: (Almost) Optimal Batched Best Arm Identification (Opt-BBAI) Input: Parameters δ, ϵ, L1, L2, L3, α and function β(t, δ). Output: The estimated optimal arm.

1 Stage I, II, and III: the same as that in Algorithm 1

2 Stage IV: Exploration with Exponential Gap Elimination

3 Sr [n], B0 0, r 1;

4 while |Sr| > 1 do

5 Let ϵr 2 r/4, δr δ/(40π2n r2), ℓr 0, and γr δr;

6 Successive elimination /**/ Eliminate arms whose means are lower than µ1 by at least ϵr 7 for each arm i Sr do

8 Pull arm i for dr 32

ϵ2r log(2/δr) times;

9 Let ˆpr i be the empirical mean of arm i;

10 Let maxi Sr ˆpr i ;

11 Set Sr+1 Sr \ {i Sr : ˆpr i < ˆpr

12 Checking for best arm elimination /**/ Reduce the risk of the best arm being eliminated

13 Let Br Br 1 + dr|Sr|;

14 for j < r do

15 for each arm i Sj \ Sj+1 do

16 if Brγj 2ℓj > Bj then

17 γj (γj)2;

18 Repull arm i for total 32

ϵ2 j log(2/γj) times;

19 Let ˆpj i be the empirical mean of arm i in Sj;

20 ℓj ℓj + 1;

21 if i Sj, ˆpj i > ˆpr

22 return Randomly return an arm in Sr;

23 r r + 1;

24 return The arm in Sr;

4 Best of Both Worlds: Achieving Asymptotic and Non-asymptotic Optimalities

The Tri-BBAI algorithm is shown to enjoy the optimal sample complexity with only three batches in the asymptotic setting. However, in practice, we are limited to a finite number of samples and thus δ cannot go to zero, which is a critical concern in real-world applications. Consequently, obtaining the optimal sample and batch complexity in a non-asymptotic setting becomes the ultimate objective of a practical bandit strategy in BBAI. In this section, we introduce Opt-BBAI, which can attain the optimal sample and batch complexity in asymptotic settings and near-optimal sample and batch complexity in non-asymptotic settings.

We assume a bounded reward distribution within [0, 1], which aligns with the same setting in the literature [31, 21]. Again, we consider that the reward distribution belongs to a one-parameter exponential family. By refining Stage IV of Algorithm 1, we can achieve asymptotic optimality and near non-asymptotic optimality adaptively based on the assumption on δ in various settings.

The pseudo-code for the algorithm is provided in Algorithm 2. The main modification from Algorithm 1 occurs in Stage IV. Intuitively, if the algorithm cannot return at Stage III, then the value of log δ 1 may be comparable to other problem parameters, such as 1/ 2. Therefore, we aim to achieve the best possible sample and batch complexity for the non-asymptotic scenario. Stage IV operates in rounds, progressively eliminating sub-optimal arms until a result is obtained. Each round consists of two components: Successive Elimination and Checking for Best Arm Elimination.

Successive Elimination. In the r-th round, we maintain a set Sr, a potential set for the best arm. Each arm in Sr is then pulled dr = 32/ϵ2 r log(2/δr) times. At Line 11, all possible sub-optimal arms are eliminated. This first component of Stage IV borrows its idea from successive elimination [14].

Purpose. After dr number of pulls, arms are likely to be concentrated on their true means within a distance of ϵr/4 with a high probability. Hence, with high probability, the best arm is never eliminated at every round of Line 11, and the final remaining arm is the best arm.

Issue of Best Arm Elimination. Due to successive elimination, there is a small probability ( δr) that the best arm will be eliminated. The following example illustrates that, conditioning on this small-probability event, the sample and batch complexity of the algorithm could become infinite.

Example 4.1. Consider a bandit instance where µ1 > µ2 = µ3 > µ4 µn. If the best arm is eliminated at a certain round and never pulled again, the algorithm tasked with distinguishing between the 2nd and 3rd best arms will likely never terminate due to their equal means, leading to unbounded sample and batch complexity.

To address this issue, we introduce a Check for Best Arm Elimination component into Stage IV.

Checking for Best Arm Elimination. In the r-th round, we represent the total sample complexity used up to the r-th round as Br. We employ γj as an upper bound for the probability that the best arm is eliminated in Sj. If Line 16 is true (Brγj 2ℓj > Bj), we adjust γj to γ2 j and pull each arm in Sj for 32/ϵ2 j log(2/γj) times (Line 18), subsequently updating their empirical mean (Line 19). Finally, we return a random arm in Sr if the condition at Line 21 holds.

Purpose. If Line 16 is satisfied, it indicates that the sample costs, based on the event of the best arm being eliminated in the r-th round, exceed Bj/2ℓj. In this case, we increase the number of samples for arms in Sj and re-evaluate if their empirical mean is lower than that of the current best arm . This ensures that the expected total sample costs, assuming the best arm is eliminated at Sj, are bounded by P ℓj=1 Bj/2ℓj Bj. If Line 21 holds, we randomly return an arm from Sr. Since this only happens with a small probability, we still guarantee that Algorithm 2 will return the best arm with a probability of at least 1 δ.

In what follows, we provide the theoretical results for Algorithm 2.

Theorem 4.2. Let ϵ, L1, L2, L3, α, and β(t, δ) be the same as in Theorem 3.1. For finite δ (0, 1), Algorithm 2 identifies the optimal arm with probability at least 1 δ and there exists some universal constant C such that E[Nδ] C P

i>1 2 i log n log 1 i , and the algorithm runs in C log(1/ 2) expected batches.

When δ is allowed to go to zero, we also have the following result.

Theorem 4.3. Let ϵ, L1, L2, L3, α, and β(t, δ) be the same as in Theorem 3.1. Algorithm 2 identifies the optimal with probability at least 1 δ and its sample complexity satisfies lim supδ 0 Eµ[Nδ]/log(1/δ) αT (µ), and the expected batch complexity of Algorithm 2 converges to 3 when δ approaches 0.

The results in Theorems 4.2 and 4.3 state that Algorithm 2 achieves both the asymptotic optimal sample complexity and a constant batch complexity. Moreover, it also demonstrates near-optimal performance in both non-asymptotic sample complexity and batch complexity. Notably, this is the first algorithm that successfully attains the optimal or near-optimal sample and batch complexity, adaptively, in asymptotic and non-asymptotic settings.

Remark 4.4. Jin et al. [22] achieved near-optimal sample and batch complexity in a non-asymptotic setting. However, their results are contingent on the event that the algorithm can find the best arm (with probability 1 δ). Consequently, with a probability of δ there is no guarantee for its batch and sample complexity to be bounded. As demonstrated in Example 4.1, the batch and sample complexity in [22] could even be infinite.

In contrast, the batch and sample complexity introduced in Theorem 4.2 is near-optimal and does not rely on any specific event due to the procedure checking for best arm elimination we proposed. Our technique could be of independent interest and could be further applied to existing eliminationbased BAI algorithms [14, 21, 22, 31] to ensure that the sample and batch complexity is independent of the low-probability event that the best arm is eliminated.

5 Conclusion, Limitations, and Future Work

In this paper, we studied the BAI problem in the batched bandit setting. We proposed a novel algorithm, Tri-BBAI, which only needs 3 batches in expectation to find the best arm with probability 1 δ and achieves the asymptotic optimal sample complexity. We further proposed Opt-BBAI and theoretically showed that Opt-BBAI has a near-optimal non-asymptotic sample and batch complexity while still maintaining the asymptotic optimality as Tri-BBAI does.

In our experiments, although Tri-BBAI utilizes a limited number of batches, its sample complexity does not match that of Garivier and Kaufmann [17]. Designing a batched algorithm with sample complexity comparable to Garivier and Kaufmann [17], while maintaining a constant number of batches, presents an intriguing challenge.

As for future work, an interesting direction is to investigate whether our checking for best arm elimination could be beneficial to other elimination-based algorithms. Additionally, some research [3, 23] implied a strong correlation between batch complexity in batched bandit, and the memory complexity and pass complexity in streaming bandit. Thus, it could be valuable to assess if our techniques could enhance the results in the field of streaming bandit.

Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments. This research is funded by a Singapore Ministry of Education Ac RF Tier 2 grant (A-8000423-00-00), by the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG-Ph D/202101004[T]), by the Ministry of Education, Singapore, under its MOE Ac RF TIER 3 Grant (MOEMOET32022-0001), by National Key R&D Program of China under Grant No. 2023YFF0725100, by the National Natural Science Foundation of China (NSFC) under Grant No. 62402410 and U22B2060, by Guangdong Basic and Applied Basic Research Foundation under Grant No. 2023A1515110131, and by Guangzhou Municipal Science and Technology Bureau under Grant No. 2023A03J0667 and 2024A04J4454.

In particular, T. Jin was supported by a Singapore Ministry of Education Ac RF Tier 2 grant (A8000423-00-00) and by the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG-Ph D/2021-01004[T]). X. Xiao was supported by the Ministry of Education, Singapore, under its MOE Ac RF TIER 3 Grant (MOE-MOET32022-0001). J. Tang was partially supported by National Key R&D Program of China under Grant No. 2023YFF0725100, by the National Natural Science Foundation of China (NSFC) under Grant No. 62402410 and U22B2060, by Guangdong Basic and Applied Basic Research Foundation under Grant No. 2023A1515110131, and by Guangzhou Municipal Science and Technology Bureau under Grant No. 2023A03J0667 and 2024A04J4454. P. Xu was supported in part by the Whitehead Scholars Program at the Duke University School of Medicine. The views and conclusions in this paper are those of the authors and should not be interpreted as representing any funding agencies.

[1] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Proc. COLT, pages 39 75, 2017. (pp. 2 and 5.)

[2] Shubhada Agrawal, Sandeep Juneja, and Peter Glynn. Optimal δ-Correct Best-Arm Selection for Heavy-Tailed Distributions. In Algorithmic Learning Theory, pages 61 110. PMLR, 2020. (pp. 3, 5, and 7.)

[3] Sepehr Assadi and Chen Wang. Exploration with limited memory: streaming algorithms for coin tossing, noisy comparisons, and multi-armed bandits. In Proceedings of the 52nd Annual ACM SIGACT Symposium on theory of computing, pages 1237 1250, 2020. (p. 10.)

[4] Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In COLT, pages 41 53, 2010. (pp. 1 and 4.)

[5] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120 1135, 2007. (p. 1.)

[6] Wei Cao, Jian Li, Yufei Tao, and Zhize Li. On top-k selection in multi-armed bandits and hidden bipartite graphs. In Proc. Neur IPS, pages 1036 1044, 2015. (p. 4.)

[7] Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pages 590 604. PMLR, 2016. (p. 4.)

[8] Lijie Chen and Jian Li. On the optimal sample complexity for best arm identification. ar Xiv preprint ar Xiv:1511.03774, 2015. (p. 4.)

[9] Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroid constraints. In Proc. COLT, pages 647 669, 2016. (p. 4.)

[10] Lijie Chen, Jian Li, and Mingda Qiao. Towards Instance Optimal Bounds for Best Arm Identification. In Proc. COLT, pages 535 592, 2017. (p. 4.)

[11] Rémy Degenne. On the Existence of a Complexity in Fixed Budget Bandit Identification. ar Xiv preprint ar Xiv:2303.09468, 2023. (p. 5.)

[12] Rémy Degenne, Wouter M Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. Advances in Neural Information Processing Systems, 32, 2019. (pp. 4 and 7.)

[13] Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pages 2432 2442. PMLR, 2020. (p. 4.)

[14] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and Markov decision processes. In Proc. COLT, pages 255 270, 2002. (pp. 1, 4, 8, and 9.)

[15] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched Multi-armed Bandits Problem. In Proc. Neur IPS, pages 501 511, 2019. (p. 2.)

[16] Aurélien Garivier and Olivier Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proc. COLT, pages 359 376, 2011. (p. 5.)

[17] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Proc. COLT, pages 998 1027, 2016. (pp. 1, 2, 3, 4, 5, 7, 10, 13, 25, and 26.)

[18] Aurélien Garivier, Tor Lattimore, and Emilie Kaufmann. On explore-then-commit strategies. In Proc. Neur IPS, pages 784 792, 2016. (p. 7.)

[19] Aditya Grover, Todor Markov, Peter Attia, Norman Jin, Nicolas Perkins, Bryan Cheong, Michael Chen, Zi Yang, Stephen Harris, William Chueh, et al. Best arm identification in multi-armed bandits with delayed feedback. In Proc. AISTATS, pages 833 842, 2018. (p. 1.)

[20] Peter Harremoës. Bounds on tail probabilities in exponential families. ar Xiv preprint ar Xiv:1601.05179, 2016. (p. 18.)

[21] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil ucb: An optimal exploration algorithm for multi-armed bandits. In Proc. COLT, pages 423 439, 2014. (pp. 2, 4, 8, and 9.)

[22] Tianyuan Jin, Shi Jieming, Xiaokui Xiao, and Enhong Chen. Efficient Pure Exploration in Adaptive Round model. In Proc. Neur IPS, pages 6605 6614, 2019. (pp. 2, 3, 5, 9, and 25.)

[23] Tianyuan Jin, Keke Huang, Jing Tang, and Xiaokui Xiao. Optimal streaming algorithms for multi-armed bandits. In International Conference on Machine Learning, pages 5045 5054. PMLR, 2021. (pp. 10, 25, and 26.)

[24] Tianyuan Jin, Jing Tang, Pan Xu, Keke Huang, Xiaokui Xiao, and Quanquan Gu. Almost optimal anytime algorithm for batched multi-armed bandits. In International Conference on Machine Learning, pages 5065 5073. PMLR, 2021. (pp. 2, 3, and 15.)

[25] Tianyuan Jin, Pan Xu, Xiaokui Xiao, and Quanquan Gu. Double explore-then-commit: Asymptotic optimality and beyond. In Conference on Learning Theory, pages 2584 2633. PMLR, 2021. (pp. 2 and 7.)

[26] Marc Jourdan and Rémy Degenne. Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. ar Xiv preprint ar Xiv:2210.05431, 2022. (pp. 3 and 4.)

[27] Marc Jourdan, Rémy Degenne, Dorian Baudry, Rianne de Heide, and Emilie Kaufmann. Top two algorithms revisited. ar Xiv preprint ar Xiv:2206.05979, 2022. (p. 4.)

[28] Marc Jourdan, Rémy Degenne, and Emilie Kaufmann. An varepsilon-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond. Advances in Neural Information Processing Systems, 36:16578 16649, 2023. (p. 4.)

[29] Shivaram Kalyanakrishnan and Peter Stone. Efficient selection of multiple bandit arms: theory and practice. In Proc. ICML, pages 511 518, 2010. (p. 4.)

[30] Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC Subset Selection in Stochastic Multi-armed Bandits. In Proc. ICML, pages 655 662, 2012. (p. 4.)

[31] Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In Proc. ICML, pages 1238 1246, 2013. (pp. 2, 3, 4, 8, and 9.)

[32] Nikolai Karpov, Qin Zhang, and Yuan Zhou. Collaborative top distribution identifications with limited interaction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 160 171. IEEE, 2020. (pp. 3, 5, 25, and 26.)

[33] Julian Katz-Samuels, Lalit Jain, Kevin G Jamieson, et al. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. Advances in Neural Information Processing Systems, 33:10371 10382, 2020. (p. 4.)

[34] Emilie Kaufmann and Wouter M Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. The Journal of Machine Learning Research, 22(1): 11140 11183, 2021. (p. 7.)

[35] Junpei Komiyama, Taira Tsuchiya, and Junya Honda. Minimax optimal algorithms for fixedbudget best arm identification. Advances in Neural Information Processing Systems, 35: 10393 10404, 2022. (p. 5.)

[36] Pierre Ménard and Aurélien Garivier. A minimax and asymptotically optimal algorithm for stochastic bandits. In Proc. ALT, pages 223 237, 2017. (pp. 5, 15, and 18.)

[37] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. The Annals of Statistics, 44(2):660 681, 2016. (p. 2.)

[38] Xuanfei Ren, Tianyuan Jin, and Pan Xu. Optimal Batched Linear Bandits. In Forty-first International Conference on Machine Learning, volume 235, pages 42391 42416. PMLR, 2024. (p. 2.)

[39] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126 146. IEEE, 2019. (pp. 2, 3, and 5.)

[40] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. (p. 1.)

[41] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. (p. 20.)

[42] Po-An Wang, Ruo-Chun Tzeng, and Alexandre Proutiere. Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems, 34:5810 5821, 2021. (pp. 3, 4, 5, and 7.)

[43] Yuan Zhou, Xi Chen, and Jian Li. Optimal PAC multiple arm identification with applications to crowdsourcing. In Proc. ICML, pages 217 225, 2014. (p. 1.)

Denote by ˆµs i the empirical mean of arm i after its s-th pull and by ˆµi(t) the empirical mean of arm i at time t (i.e., after a total number of t pulls are examined for all arms). We use i = µ1 µi for the gap between the optimal arm and the i-th arm. Throughout the paper, we use the following asymptotic notation. Specifically we use f(δ) g(δ) to denote that there exists some δ0, such that for any δ < δ0, f(δ) g(δ). Similarly, we use f(δ) g(δ) to denote that there exists some δ0, such that for any δ < δ0, f(δ) g(δ).

A Computing w (µ) and T (µ)

In this section, we introduce two useful lemmas that help us to calculate w (b) and T (b) in Algorithm 1 and Algorithm 2, which are also used in [17]. For ease of presentation, we first define some useful notations. For every c [0, 1], we define

Ic(µ, µ ) := cd(µ, cµ + (1 c)µ ) + (1 c)d(µ , cµ + (1 c)µ ).

For any arm i [n] \ {1}, we define

gi(x) := (1 + x)I 1 1+x (µ1, µi).

Lemma A.1 (Garivier and Kaufmann 17, Lemma 3). For every w Pn,

inf λ Alt(µ)

i=1 wid(µi, λi)

= min i =1 (w1 + wi)I w1 w1+wi (µ1, µi).

T (µ) 1 = sup w Pn min i =1 (w1 + wi)I w1 w1+wi (µ1, µi), (A.1)

w (µ) = arg max w Pn min i =1 (w1 + wi)I w1 w1+wi (µ1, µi). (A.2)

Lemma A.2 (Garivier and Kaufmann 17, Theorem 5). For i [n], let

w i (µ) = xi(y ) P

i [n] xi(y ),

where y is the unique solution of the equation Fµ(y) = 1, and where

d µ1, µ1+xi(y)µi

d µi, µ1+xi(y)µi

in a continuous, increasing function on [0, d(µ1, µ2)) such that Fµ(y) when y d(µ1, µ2).

B Proof of Theorems in Section 3

B.1 Proof of Theorem 3.2

Proof of Theorem 3.2. The proof of Theorem 3.2 requires the following two Lemmas, which guarantees the error returned by Line 22 and Line 26 of Algorithm 1 respectively. Lemma B.1. [17, Proposition 12] Let µ be the exponential bandit model. Let δ (0, 1). For sufficiently small δ > 0, the Chernoff s stopping rule Line 21 of Algorithm 1 with the threshold

β(t, δ/2) = log 2 log(2/δ)tα

ensures that Pµ(i (Nδ) = 1) δ/2.

Recall ˆµs i is the empirical mean of arm i after its s-th pull. Let E1 = { s L3 and i [n] : ˆµs i [µi ϵ, µi + ϵ]}. The following lemma shows that P(Ec 1) δ/2, where Ec denotes the complement event of E.

Lemma B.2. For sufficiently small δ > 06, we have

P(Ec 1) δ/2.

If E1 happens, we have that at Line 26 of Algorithm 1, ˆµ1(t) µ1 ϵ µi + ϵ ˆµi(t) for all i > 1, which means i (t) = 1. Combining Lemma B.1 and Lemma B.2, Theorem 3.1 follows immediately.

B.2 Proof of Theorem 3.1

Proof of Theorem 3.1. Recall in Algorithm 1, bq 1 = ˆµi (t)(t) ϵ and bq i = ˆµi(t) + ϵ for all i [n] \ {i (t)}, where t = τq. Let E0 be the intersection of the following four events:

E1 0 = q 1{bq 1 [µ1 3ϵ/2, µ1 ϵ/2]}, (B.1a)

E2 0 = q 1{for all i [n] \ {1}, bq i [µi + ϵ/2, µi + 3ϵ/2]}, (B.1b)

E3 0 = q 0{ˆµ1(τq) µ1 ϵ/2}, (B.1c)

E4 0 = q 0{for all i [n] \ {1}, ˆµi(τq) µi + ϵ/2}. (B.1d)

Here (B.1a) and (B.1b) ensure that the estimation of w (bq) and T (bq) is close to w (µ) and T (µ), and (B.1c) and (B.1d) ensure that arm 1 has the largest average reward at time τq. Lemma B.3. For sufficiently small δ > 0, we have

P((Ec 0) 1 (log δ 1)2 .

Lemma B.4. If E0 holds, then for sufficiently small δ > 0, we have

1. Ti = αw i (µ)T (µ) log(1/δ) + o(log(1/δ));

2. |w i (b2) w i (b1)| 1/ n for all i [n];

3. minj / [n]\{i (τ)} Zj(τ) β(τ, δ/2).

From Lemma B.4, if δ is sufficiently small and E0 holds, then Algorithm 1 will return at Line 22. Besides, we note that the total number of pulls of any arm is no more than L3 = (log(δ 1))2 times. Therefore,

E[Nδ] n L1 + 1{E0} (αT (µ) log δ 1 + o(n log(1/δ))) + 1{Ec 0} (L2 + n L3)

1{E0} αT (µ) log δ 1 + o(n log(1/δ)) + 1{Ec 0}n L3 1{E0} αT (µ) log δ 1 + n

αT (µ) log δ 1 + n. (B.2)

We further have

lim δ 0 E[Nδ] log δ 1 αT (µ).

This completes the proof.

B.3 Proof of Theorem 3.3

Proof of Theorem 3.3. Stage I costs one batch. For the Stage II, note that P(Ec 0) 1 log(δ 1)2 and if E0 occurs, then |w i (b2) w i (b1)| 1/ n for all i [n] (from the second statement of Lemma B.4), which means Stage II costs 2 batches. Moreover, from Lemma B.4, if δ is sufficiently small and E0 holds, then Algorithm 1 will return at Line 22. Otherwise, the algorithm goes to Stage IV and costs one batch. Therefore, for sufficiently small δ, the expected number of batches

1 + P(Ec 0) log(1/δ) + 2 + P(Ec 0) = 3 + o(1).

Finally, if E0 is true, the algorithm costs 3 batches. Therefore, with probability 1 1/ log(1/δ2), Algorithm 1 costs 3 batches.

6If event E occur for a sufficiently small δ > 0, it signifies that there exists a δ0 (0, 1) such that for all δ δ0, event E consistently holds.

B.4 Proof of Supporting Lemmas

The proof of Lemma B.3 requires the following useful inequalities.

Lemma B.5 (Maximal Inequality). Let N and M be two positive integers, let γ > 0, and ˆµn be the empirical mean of n random variables i.i.d. according to the arm s distribution with mean µ. Then, for x µ,

P( N n M, ˆµn x) e N(x µ)2/(2V0), (B.3)

and for every x µ,

P( N n M, ˆµn x) e N(x µ)2/(2V1), (B.4)

where V0 is the maximum variance of arm s distribution with mean µ [x, µ] and V1 is the maximum variance of arm s distribution with mean µ [µ, x].

Note that Lemma B.5 is an improved version of Lemma 4 of [36], where we use a much smaller variance upper bound to tighten the inequalities for the case x µ and the case x µ respectively.

Lemma B.6. [24, Proposition 1] For ϵ > 0 and µ µ ϵ,

d(µ, µ ) d(µ, µ ϵ), and d(µ, µ ) d(µ ϵ, µ ). (B.5)

Now, we are ready to prove Lemma B.3.

Proof of Lemma B.3. Let V be the maximum variance of reward distribution with mean µ [µn, µ1]. From Lemma B.5, we have that for s L1,

P( s L1 : ˆµs 1 / [µ1 ϵ/2, µ1 + ϵ/2]) P( s L1 : ˆµs 1 µ1 + ϵ/2) + P( s L1 : ˆµs 1 µ1 ϵ/2)

2e L1(ϵ/2)2/(2V )

1 n(log δ 1)2 , (B.6)

where the second inequality is from Lemma B.5, and the last inequality is due to that for sufficiently small δ > 0, L1 = p

log δ 1 8V/ϵ2 log(n(log δ 1)2). Similarly, for i [n] \ {1},

P( s L1 : ˆµs i / [µi ϵ/2, µi + ϵ/2]) 1 n(log δ 1)2 . (B.7)

Define events: A1 = { s L1 : ˆµs 1 [µ1 ϵ/2, µ1 + ϵ/2]}

and A2 = {for all i [n] \ {1}, s L1 : ˆµs i [µi ϵ/2, µi + ϵ/2]}.

Assume that both events A1 and A2 hold. Then, we have

ˆµL1 1 µ1 ϵ/2 µi + ϵ/2 + i ϵ ˆµL1 i ,

where in the last inequality we assumed ϵ mini =1 i. This further implies i (τq) = 1 for any q 0 and thus bq 1 1 = ˆµ1(τq) ϵ. Therefore, we have for any q 1

1. bq 1 = ˆµ1(τq 1) ϵ [µ1 3ϵ/2, µ1 ϵ/2];

2. i [n] \ {1}, bq i = ˆµi(τq 1) + ϵ [µi ϵ/2, µ1 + ϵ/2];

3. ˆµ1(τq 1) µ1 ϵ/2;

4. and i [n] \ {1}, ˆµi(τq 1) µi + ϵ/2,

which means E0 defined in (B.1) occurs. Therefore, we have

P(E0) P(A1 A2)

= P { s L1 : ˆµs 1 [µ1 ϵ/2, µ1 + ϵ/2]} \

i:i [n]\{1} { s L1 : ˆµs i [µi ϵ/2, µi + ϵ/2]}

= P { s L1 : ˆµs 1 / [µ1 ϵ/2, µ1 + ϵ/2]}c \

i:i [n]\{1} { s L1 : ˆµs i / [µi ϵ/2, µi + ϵ/2]}c

1 P s L1 : ˆµs 1 / µ1 ϵ

i:i [n]\{1} P s L1 : ˆµs i / µi ϵ

1 (log δ 1) 2,

where the last inequality is due to (B.6) and (B.7).

Proof of Lemma B.4. Assume E0 is true. Recall that 2 = mini [n]\{1} µ1 µi. For sufficiently small δ > 0 such that ϵ < 2/8, we have bq 1 [µ1 2/4, µ1] and bq i [µi, µi + 2/4]. Let L(b) = {λ S : λ1 [µ1 2/4, µ1] and for any i [n] \ {1}, λi [µi, µi + 2/4]}. Therefore

αw i (bq)T (bq) max b L(b) αw i (b )T (b ) < .

Since maxb L(b) αw i (b )T (b ) < is independent of δ , we have

αw i (bq)T (bq) max b L(b) αw i (b )T (b ) log log δ 1/n = L2,

which implies T q i = min{αw i (bq)T (bq) log δ 1, L2} = αw i (bq)T (bq) log δ 1. Besides, as w (bq) and T (bq) are continuous on bq, we have that as bq µ, w (bq) w (µ) and T (bq) T (µ). Note that |bq i µi| 2ϵ and ϵ approaches 0 as δ approaches 0. Recall τ is the stopping time of Stage II and Ti is the number of pulls of arm i at time τ. Therefore, for sufficiently small δ,

Ti := max p:p N,p 1 T p i = αw i (µ)T (µ) log δ 1 + o(log(1/δ)).

Moreover, for sufficient smaller δ and i [n], |w i (bq) w i (µ)| = o(1) and thus

|w i (b1) w i (b2)| < 1/ n,

which means the condition in Line 15 is satisfied and the Algorithm goes to Stage III.

Assume τ = τq . We have

Z1i(τ) = N1(τ) d(ˆµ1(τ), ˆµ1i(τ)) + Ni(τ) d(ˆµi(τ), ˆµ1i(τ)) = N1(τq ) d(ˆµ1(τ), ˆµ1i(τ)) + Ni(τq ) d(ˆµi(τ), ˆµ1i(τ))

α log δ 1 w 1(bq )d(ˆµ1(τ), ˆµ1i(τ))

T (bq ) 1 + w i (bq )d(ˆµ1(τ), ˆµ1i(τ))

where the last inequality is due to Ti T q

i = αw i (bq )T (bq ) log(1/δ). In what follows, we will show

w 1(bq ) d(ˆµ1(τ), ˆµ1i(τ))

T (bq ) 1 + w i (bq ) d(ˆµi(τ), ˆµ1i(τ))

T (bq ) 1 1.

The following minimization problem is a convex optimization problem

min λ1w:λ1w λiw w 1(bq ) d(ˆµ1(τ), λ1w) + w i (bq ) d(ˆµi(τ), λiw),

which is solved when we have λ1w = λiw = ˆµ1i(τ). The proof of this lemma is based on the assumption that E0 occurs. Note that E0 occurs, then bq 1 = ˆµ1(τq 1) ϵ µ1 ϵ/2 ˆµ1(τ) and bq

i = ˆµ1(τq 1) + ϵ > ˆµi(τ) for all i [n] \ {1}. Therefore, ˆµ1(τ) bq 1 bq

i ˆµi(τ). Let

λw (bq )(bq 1 , bq

i ) = w 1(bq ) w 1(bq ) + w i (bq ) bq 1 + w i (bq ) w 1(bq ) + w i (bq ) bq i .

Then, λw (b)(bq 1 , bq

i ) is the solution to minimizing w 1(bq ) d(bq 1 , x) + w i (bq ) d(bq

i , bq 1 ).

Case 1: if ˆµ1i(τ) (bq 1 , ˆµ1(τ)), then

w 1(bq ) d(ˆµ1(τ), ˆµ1i(τ)) + w i (bq ) d(ˆµi(τ), ˆµ1i(τ))

w 1(bq ) d(ˆµ1(τ), ˆµ1(τ)) + w i (bq ) d(ˆµi(τ), bq 1 )

w 1(bq ) d(bq 1 , bq 1 ) + w i (bq ) d(bq

w 1(bq ) d(bq 1 , λw (bq )(bq 1 , bq

i )) + w i (bq ) d(bq

i , λw (bq )(bq 1 , bq

i )), (B.9)

where the first and second inequalities are due to (B.5) and the last inequality is due to the fact that w 1(bq ) d(bq 1 , x) + w i (bq ) d(bq

i , x) achieves it minimum at x = λw (bq )(bq 1 , bq

Case 2: if ˆµ1i(τ) (bq

i , bq 1 ), we have

w 1(bq ) d(ˆµ1(τ), ˆµ1i(τ)) + w i (bq ) d(ˆµi(τ), ˆµ1i(τ))

w 1(bq ) d(bq 1 , ˆµ1i(τ)) + w i (bq ) d(bq

i , ˆµ1i(τ))

w 1(bq ) d(bq 1 , λw (bq )(bq 1 , bq

i )) + w i (bq ) d(bq

i , λw (bq )(bq 1 , bq

i )), (B.10)

where the first inequality is due to (B.5) and the last inequality is due to the fact that for x (bq

i , bq 1 ), w 1(bq ) d(bq 1 , x) + w i (bq ) d(bq

i , x) achieves its minimum at x = λw (bq )(b1......q , bq

Case 3: if ˆµ1i(τ) (ˆµi(τ), bq

i ), similar to (B.9) and (B.10), we have

w 1(bq ) d(ˆµ1(τ), ˆµ1i(τ)) + w i (bq ) d(ˆµi(τ), ˆµ1i(τ))

w 1(bq ) d(bq 1 , λw (bq )(bq 1 , bq

i )) + w i (bq ) d(bq

i , λw (bq )(bq 1 , bq

i )). (B.11)

Combine all three cases, we obtain

w 1(bq ) d(ˆµ1(τ), ˆµ1i(τ)) + w i (bq ) d(ˆµi(τ), ˆµ1i(τ))

w 1(bq ) d(b1, λw (bq )(bq 1 , bq

i )) + w i (bq ) d(bq

i , λw (bq )(bq 1 , bq

i )). (B.12)

Note that ˆµ1i(τ) = λw (bq )(ˆµ1(τ), ˆµi(τ)), we have

w 1(bq ) d(ˆµ1(τ), ˆµ1i(τ))

T (bq ) 1 + w i (bq ) d(ˆµ1(τ), ˆµ1i(τ))

w 1(bq ) d(bq 1 , λw (bq )(bq 1 , bq

T (bq ) 1 + w i (bq ) d(bq

i , λw (bq )(bq 1 , bq

= 1, (B.13)

where the first inequality is due to (B.12) and the last inequality is due to (A.1). Note that conditioned on event E0, for sufficiently small δ > 0, we have 1 = i (τ). Therefore, for any i [n] \ {1},

Zi(τ) = Z1i(τ)

= α log δ 1 w 1(bq ) d(ˆµ1(τ), ˆµ1i(τ))

T (bq ) 1 + w i (bq ) d(ˆµ1(τ), ˆµ1i(τ))

where the first equality is from (3.3), the second equality is from (B.8), and the last inequality is from (B.13). Finally, for α 1 + 6 log log δ 1/ log δ 1, we obtain

Zi(τ) log (log δ 1)6

log 2(log(2/δ))) (n L2 + n L1)α

= β(τ, δ/2),

which completes the proof.

B.5 Proof of Maximum Inequality

Proof of Lemma B.5. Recall that the result of Lemma 4 of Ménard and Garivier [36] is: for ˆµn µ,

P( N n M, d(ˆµn, µ) γ) e Nγ. (B.14)

By Lemma 1 of Harremoës [20], we have

d(x, µ) = Z µ

V (y) dy (x µ)2

2V0 , (B.15)

where V (y) is the variance of the distribution with mean y. As a simple consequence of (B.15) and (B.14), we have

P( N n M, ˆµn x) e N(x µ)2/(2V0). (B.16)

For the case when ˆµn µ, we can directly follow the idea of Lemma 4 of Ménard and Garivier [36]. For the case ˆµn µ, we aim to show

P( N n M, d(ˆµn, µ) γ) e Nγ. (B.17)

We let M(λ) be the log-moment generating function of νθ. Recall that dνθ

dρ (x) = exp(xθ b(θ)). We use the following properties of one exponential family.

1. M(λ) = b(θ + λ) b(θ);

2. KL(νθ1, νθ) = b(θ) b(θ1) + b (θ1)(θ θ1);

3. E[νθ] = b (θ).

Let E[νθ] = µ. Let λ = θ1 θ, z = b (θ1) > µ, and γ = d(z, µ), we have

γ = d(z, µ) = KL(νθ1, νθ) = λz M(λ).

Since d(z, µ) is monotone increasing for z > µ, we obtain λ > 0. If event { N n M, d(ˆµn, µ) γ} and ˆµn µ, one have

ˆµn µ, λˆµn M(λ) λz M(λ) = γ, λnˆµn n M(λ) Nγ.

By Doob s maximal inequality for the exponential martingale exp(λnˆµn n M(λ)),

P( N n M, d(ˆµn, µ) γ) P( N n M, λnˆµn n M(λ) Nγ)

e Nγ. (B.18)

Combining above inequality and (B.15), we obtain for ˆµn µ

P( N n M, d(ˆµn, µ) γ) e N(x µ)2/(2V1). (B.19)

This completes the proof.

C Proof of Theorems in Section 4

In this proof, we define one round" as a single iteration of the While Loop of Algorithm 2.

Proof of Theorem 4.2. Proof of Correctness. We first show that the best arm is not eliminated by Line 11 for any round r. For any l, let El be defined as follows.

El = {1 Sr, r l}. (C.1)

Lemma C.1. Line 11 of Algorithm 2 satisfies X

r 1 P ˆpr 1 ˆpr ϵr

To streamline our presentation, we define ℓjs = s, γjs = (δj)2s, and ˆpjs i as the sth updates of parameters ℓj, γj, and ˆpj i, respectively, within the second For Loop. For any fixed j, we define Uj(s) to be the following set of rounds, where ℓj remains to be the same value s:

Uj(s) := {r = 1, 2, . . . : ℓj = s at round r of Algorithm 2}. (C.2)

Then the condition of Line 21 could be represented as

i Sj and r Uj(s), ˆpjs i ˆpr ϵj

The following lemma shows that with high probability, Algorithm 2 will not return at Line 22. Lemma C.2. Line 21 of Algorithm 2 satisfies

max i Sj ˆpjs i < ˆpr ϵj

Note that Algorithm 2 will return arm 1 at Line 24 if 1: Arm 1 is always maintained in Sr, and 2: Algorithm 2 never return at Line 22.

According to Lemma C.1, there is a probability of at least 1 δ/8 that Arm 1 will never be eliminated at Line 11. Given that Arm 1 is never eliminated at Line 11, Lemma C.2 suggests that there is a probability of at least 1 3δ/16 that we will never loop back at Line 22. Hence, with a probability of 1 δ/2, Stage IV will identify the optimal arm. By incorporating the results of Lemma B.1, we find that with a probability exceeding 1 δ/2 δ/2, which is greater than 1 δ, Algorithm 2 will return the optimal arm. Proof of Sample Complexity. Let N be the sample complexity of Stage IV. Let E = r>1Er. The following lemma shows the sample complexity conditioned on event E. Lemma C.3. The expected sample complexity N conditioned on event E has the order

E[N | E] = O X

log n/δ log( 1 i )

Now, we consider the expected sample complexity if the best arm is eliminated at some round r. We prove the following lemma. Lemma C.4. The sum of expected sample complexity N at each loop has the order

r 1 E[N 1{Er, (Er+1)c}] = O X

log n/δ log( 1 i )

The number of pulls of all arms in the first three stages is no more than max{n p

log δ 1, L2}. By combining the above results, Lemma C.3 and Lemma C.4 together, we can obtain the following total sample complexity

(log δ 1) log log δ 1 + X

log nδ 1 log 1 i

Since in the non-asymptotic setting, δ is finite. Then

log n log 1 i

Proof of Batch Complexity. Just as we discussed in the proof of sample complexity, we first demonstrate the batch complexity conditioned on the event of E.

Lemma C.5. Conditional on event E, Algorithm 2 conducts O(log(1/ 1 2 )) batches in expectation.

Let B be the number of batches we used in Stage IV. We consider the expected batch complexity if the best arm is eliminated at some round r. We prove the following lemma. Lemma C.6. The sum of expected batch complexity B at each loop has the order X

r 1 E[B 1{Er, (Er+1)c}] = O(log(1/ 1 2 )).

By combining Lemma C.5, Lemma C.6, the fact that the first three stages use O(log(1/δ)) batches, and δ is a constant in non-asymptotic setting, the batch complexity of Algorithm 2 is O(log(1/ 1 2 )).

Proof of Theorem 4.3. Let E0 be the event defined in (B.1) in the proof of Theorem 3.1. From Lemma B.4, if δ is sufficiently small and E0 holds, then Algorithm 2 will return at Stage III. Note that the expected number of pulls in Stage IV is (log δ 1)2. Hence, the expected number of pulls of all arms is no more than 2(log δ 1)2 times. Therefore, for any ϵ > 0

E[Nδ] αT (b) log δ 1 + o(log(1/δ)) + 1{Ec 0}2n(log δ 1)2

αT (µ) log δ 1 + 2n. (C.3)

lim δ 0 Eµ[Nδ] log δ 1 αT (µ),

Moreover, there exists some universal constant C, such that the expected number of batches used in Stage IV is C log(1/ 2) 1{Ec 0} o(1), where the last inequality is due to Lemma B.3. Form Theorem 3.3, the batch complexity of the first three Stage is 3 + o(1). Therefore, the total expected batch complexity is 3 + o(1).

D Proof of Supporting Lemmas

The proof of the supporting lemmas requires the following concentration inequalities. Lemma D.1. [41, Theorem 2.2.6] Let X1, . . . , Xk [0, 1] be independent bounded random variables with mean µ. Then for any ϵ > 0

P(ˆµ µ + ϵ) exp kϵ2

P(ˆµ µ ϵ) exp kϵ2

where ˆµ = 1/k Pk t=1 Xt.

D.1 Proof of Lemma C.1

From Lemma D.1,

P |ˆpr i µi| ϵr

2 exp 2drϵ2 r 64

Applying union bound, we have

n |ˆpr i µi| ϵr

The aforementioned inequality suggests that, with a probability of at least 1 δ/8, the condition |ˆpr i µi| ϵr

8 holds true for any i [n] and r 1. Then, for any r > 1,

ˆpr 1 µ1 ϵr

r 1 P ˆpr 1 ˆpr ϵr

D.2 Proof of Lemma C.2

From Lemma D.1,

P |ˆpjs i µi| ϵj

γjs = (δj)2s.

Applying union bound, we have

n |ˆpjs i µi| ϵj

i Sj (δj)2s

Besides, from (D.3), we obtain

|ˆpr i µi| ϵr

Define events E3 = r 1 i [n] {|ˆpr i µi| ϵr/8},

and E4 = j 1 s>1 i Sj{|ˆpjs i µi| < ϵj/8}.

If E3 and E4 truly hold, we have that (1): for any j and arm i Sj,

µi ˆpj i + ϵj

where the second inequality is due to Line 11 of Algorithm 2; (2): for any fixed j, s, and any r Uj(s), where Uj(s) is defined in (C.2) and represents the set of all rounds in which parameter ℓj is updated for exactly s times,

ˆpjs i < µi + ϵj

Therefore, if E3 and E4 truly hold, the event ˆpjs i < ˆpr ϵj

2 consistently holds for all j, s, i Sj, and r Uj(s). It s noteworthy that, according to (D.4), P E4 1 δ/16, and from (D.5), P(E3) 1 δ/8. Therefore,

max i Sj ˆpjs i < ˆpr ϵj

D.3 Proof of Lemma C.3

We first focus on bounding the number of pulls of arm i within the first For Loop. Define

r(i) = min r : ϵr < i

From Lemma D.1 and the union bound, for r r(i),

P |ˆpr i µi| ϵr(i)

|ˆpr 1 µ1| ϵr(i)

4 exp 2drϵ2 r(i) 64

4(δr)r r(i)

10r(i) r. (D.6)

Let Er i be the event |ˆpr i µi| ϵr(i)/8 and |ˆpr 1 µ1| ϵr(i)/8 truly hold at r-th round. Conditioned on Er i and E,

ˆpr i µi + ϵr

8 µ1 i + ϵr

2 ˆpr 1 ϵr ˆpr ϵr,

which mean arm i will be eliminated. Therefore, the total sample cost of arm i in the first For Loop of Algorithm 2 is X

ϵ2r log(2/δr) + X

r>r(i) 10r r(i) 1 32

ϵ2r log(2/δr) =O log(1/δr(i))

=O log n/δ log( 1 i )

Consequently, if we define H to be the total sample complexity in the first For Loop of Algorithm 2. Then we have

log n/δ log( 1 i )

Now, we bound the sample complexity within the second For Loop. We let Lj be the total number of pulls of arms in Sj \ Sj+1 within the second For Loop. We let

nj = min s {0, 1, 2, } : Bj (δj)2s 2s H .

As per Line 16, nj represents the total count of arm re-pulls in the set Sj \ Sj+1. Consequently, we can establish the following bound for Lj.

Lj (Bj Bj 1)

s=0 2s Bj Bj 1 + δj H.

where the first inequality is because γj(s+1) = γ2 js and we pull the arms in Sj/Sj+1 for total X

= (Bj Bj 1) 2s

times at Line 18 and the last inequality is because for nj > 1,

δj H δj Bj (δj)2nj 1 2nj 1 Bj2nj+1,

where the first inequality is due to the definition of nj. Therefore, X

j>1 Bj + H X

Now, we conclude that conditioned on event E, the total expected sample complexity is

log n/δ log( 1 i )

D.4 Proof of Lemma C.4

From Lemma D.1,

P |ˆpr i µi| ϵr

2 exp 2drϵ2 r 4

Applying union bound, we have

|ˆpr i µi| ϵr

(δr)3. (D.7)

The aforementioned inequality suggests that, with a probability of at least 1 (δr)3, the condition |ˆpr i µi| ϵr/4 consistently holds true for any i Sr. Then,

ˆpr 1 µ1 ϵr

P(Er, (Er+1)c) (δr)3. (D.8)

For any fixed j, recall the definition of Uj(s) in (C.2). Assume Er, (Er+1)c truly hold and denote Ur(s) = {r1 s, r2 s, , rs s } as the rounds where ℓr = s. We note that at rl s-th round, each arm within the first For Loop of Algorithm 2 is pulled at least

times. Besides, each arm within the second For Loop is pulled at least

times. Then, similar to (D.7), for rl s-th round, we have

|ˆprl s i µi| ϵr

exp 2ϵ2 r 16 32 4s+l

(δ4 r)4s+l.

Then, we apply a union bound over all rounds in Ur(s), we obtain

P l [s ] i Srls

|ˆprl s i µi| ϵ

(δ3 r)2s/2.

Moreover, if Er, (Er+1)c happens, the best arm is eliminated at r-th round. Then for any rounds in Ur(s), we have pulled arm 1 for at least

times in the second For Loop. Recall that ˆpjs i is the sth updates of parameter ˆpj i within the second For Loop. From Lemma D.1, we have that

P ˆprs 1 µ1 ϵr

exp 2ϵ2 r 16 32 2s

(δ3 r)2s/2.

Therefore, if Er, (Er+1)c happens, with probability

1 γrs = 1 (δ3 r)2s 1, (D.9)

it holds that

l [s ] i Srls

|ˆprl s i µi| ϵ

, and ˆprs 1 > µ1 ϵ

and thus for all l [s ]

ˆprs 1 µ1 ϵ

which means Algorithm 2 returns at Line 22. Before we continue, we will first show that the number of pulls in the second For Loop is lower than the first For Loop. Assume the algorithm stops at r -th round. Let s j = max{s : Br γjs 2s > Bj}. The number of pulls for Sj \ Sj+1 at second For Loop is at most

s=1 (Bj Bj 1) 2s

= 22s j+1γjs j Bj 2s jγjs j

Br 22s j+1(δj)2 s j

Therefore, the total number of pulls within the second For Loop is at most

s=1 (Bj Bj 1) 2s Br X

j>1 δj Br ,

which means the number of for the second For Loop is lower than the first For Loop. Finally, we obtain

E[N 1{Er, (Er+1)c}] E O X

s=1 ((δr)3)2s 1 Br (δr)2s2s

= E[O(δr Br)]

δr log n/δ log( 1 i )

In the first inequality, we used the fact that if the algorithm stops in some rounds in Ur(s), the total number of pulls of all arms is at most Br/(γjs 2s) + Br/(γjs 2s), whcih comes from the first and second For Loop respectively. Moreover, the factor ((δr)3)2s 1 is because from (D.9), if Er, (Er+1)c holds, then Algorithm 2 returns in some rounds in Ur(s) (at Line 22) with probability at least 1 ((δr)3)2s 1. The last equality is because from Lemma C.3, if Er, (Er+1)c holds, then

log n/δ log( 1 i )

r 1 E[N 1{Er, (Er+1)c}] = O X

log n/δ log( 1 i )

D.5 Proof of Lemma C.5

We note that each round within the While Loop costs one batch. Let

r(2) = min r : ϵr < 2

Let Er be the event \

|ˆpr i µi| ϵr(2)

\ |ˆpr 1 µ1| ϵr(2)

Conditioned on Er and E,

ˆpr i µi + ϵr(2)

8 µ1 i ϵr(2)

8 ˆpr 1 ϵr(2) ˆpr ϵr(2),

which means all sub-optimal arms have been eliminated and the algorithm returns. From Lemma D.1 and the union bound, for r r(2),

P((Er)c) 4n exp 2drϵ2 r(i) 64

4n(δr)r r(i) 10r(i) r. (D.10)

Therefore, the total batch cost is

1 10r r2 = O(r2) = O log 1

This completes the proof.

D.6 Proof of Lemma C.6

The proof of this lemma is similar to that of Lemma C.4. we have the following results. 1: From (D.8), we have

P(Er, (Er+1)c) (δr)3.

2: From (D.9), we have if Er and (Er+1)c truly hold, then with fixed ℓs, Algorithm 2 returns at Line 22 with probability

1 (δ3 r)2s 1. (D.11)

We first compute the size of Ur(s). From Algorithm 2, we know that the arm kept in the set Sr will be pulled 4 times larger compared to (r 1)-th round. Besides, we update ℓr to s + 1, if the number of pulls exceeds Br/((δr)2s 2s) (Line 16 of Algorithm 2). It is easy to see after log4 n rounds, the number of pulls of any arm exceeds Br and then after ln 1 (δr)2s 2s rounds, the number of pulls of

single arm exceeds Br/((δr)2s 2s). Therefore, the size of Ur(s) is at most ln 1 (δr)2s 2s + ln n. We let U be the total number of rounds used. Then, we obtain

E[U 1{Er, (Er+1)c}] r P(Er, (Er+1)c) + E O X

s=1 ((δr)3)2s 1 ln n + ln 1 (δr)2s 2s

r P(Er, (Er+1)c) + δr.

We have shown in Lemma C.5, P

r>1 r P(Er, (Er+1)c) = O(log(1/ 2)). Therefore, the total number of rounds is X

r 1 E[U 1{Er, (Er+1)c}] = O(log(1/ 2)),

which completes the proof.

E Experiments

In this section, we compare our algorithms Tri-BBAI and Opt-BBAI with Track-and-Stop [17], Top-k δ-Elimination [22], ID-BAI [23] and Collab Top M [32] under bandit instances with Bernoulli rewards. All the experiments are repeated in 1000 trials. We perform all computations in Python on R9 5900HX for all our experiments. The implementation of this work can be found at https://github.com/panxulab/Optimal-Batched-Best-Arm-Identification

Data generation. For all experiments in this section, we set the number of arms n = 10, where each arm has Bernoulli reward distribution with mean µi for i [10]. More specifically, the mean rewards are generated by the following two cases.

Uniform: The best arm has µ1 = 0.5, and the mean rewards of the rest of the arms follow uniform distribution over [0.2, 0.4], i.e., µi is uniformly generated from [0.2, 0.4] for i [n] \ {1}. Normal: The best arm has µ1 = 0.6, and the mean rewards of the rest of the arms are first generated from normal distribution N(0.2, 0.2) and then projected to the interval [0, 0.4].

Implementation details. The hyperparameters of all methods are chosen as follows.

Track-and-Stop [17] is a fully sequential algorithm and thus the only parameter that needs to be set is the β(t) function in the Chernoff s stopping condition (similar to Stage III of Algorithm 1). Note that the theoretical value of β(t) in Track-and-Stop [17] is of the same order as presented in our Theorem 3.1. However, they found that a smaller value works better in practice. Therefore, we follow their experiments to set β = log ((log(t) + 1)/δ). Top-k δ-Elimination is a batched algorithm that eliminates the arms in batches. It has two parameters ϵ and δ. In our experiments, we fix ϵ = 0.1. ID-BAI [23] is designed to identify the best arm among a set of bandit arms with optimal instancedependent sample complexity. We use the same algorithm setting of the original paper in our experiments. Collab Top M [32] is the algorithm to identify the Top-m arms within a multi-agent setting. We set the m as 1 and the agents K as 1. For Tri-BBAI and Opt-BBAI, we set α = 1.0017, and ϵ = 0.01. We use the same β(t) function for Chernoff s stopping condition as in Track-and-Stop. Moreover, for the lengths of the batches, we set L1, L2 and L3 to be the value calculated by Theorem 3.1.

Results. We present a comprehensive comparison on the sample complexities and batch complexities of our algorithms and baseline algorithms in Tables 3 and 4. Notably, our algorithms Tri-BBAI and Opt-BBAI, also including Top-K δ Elimination and Collab Top M, require significantly fewer batches than Track-and-Stop. Furthermore, the sample complexity of Tri-BBAI and Opt-BBAI is significantly lower than that of Top-K δ Elimination, ID-BAI and Collab Top M. Additionally, the sample complexity of Tri-BBAI and Opt-BBAI is comparable to Track and Stop when δ is large, and it is at most 3.6 times greater than Track and Stop when δ is very small. Additionally, we also provide the runtime comparison in Tables 3 and 4. Our algorithms have a significantly reduced runtime compared to Track-and-Stop, achieving nearly 1000 speedup.

7From Theorem 3.1, we can select α to be any constant within the range (1, e/2). To optimize the convergence rate of E[Nδ]/(log(1/δ)T (µ)), we set α slightly above 1, specifically we choose α = 1.001.

Table 3: Experimental results in terms of sample complexity, batch complexity and runtime under the uniform mean rewards. The number of arms is n = 10. The experiments are averaged over 1000 repetitions.

Dataset δ Algorithm Sample Complexity Batch Complexity Runtime (s) Recall

Track-and-Stop 668.49 298.62 668.49 298.62 275.94 135.67 100% Top-k δ-Elimination 32472 0 2 0.01 0.04 0.01 100% ID-BAI 120427 0 - 0.136 0.003 100% Collab Top M 27290.205 4806.435 3.001 0.031 0.032 0.006 100% Tri-BBAI 365.232 21.85 3.99 0.11 1.02 0.23 90.1% Opt-BBAI 1220.30 586.92 4.89 1.01 0.95 0.22 100%

Track-and-Stop 977.23 371.61 977.23 371.61 381.85 146.10 100% Top-k δ-Elimination 52734 0 2 0.01 0.07 0.01 100% ID-BAI 146948 0 - 0.167 0.003 100% Collab Top M 35449.181 5530.938 3.0 0 0.042 0.006 100% Tri-BBAI 1114.41 98.68 3.96 0.19 1.05 0.19 99.3% Opt-BBAI 1929.65 604.11 4.64 1.00 1.02 0.18 100%

Track-and-Stop 1468.15 448.88 1468.15 448.88 624.82 219.14 100% Top-k δ-Elimination 72996 0 2 0.01 0.10 0.01 100% ID-BAI 173478 0 - 0.197 0.006 100% Collab Top M 43457.489 5519.169 3.0 0 0.051 0.006 100% Tri-BBAI 2005.07 175.13 3.88 0.33 1.09 0.13 100% Opt-BBAI 2781.41 658.40 4.37 0.98 1.01 0.14 100%

Track-and-Stop 1769.72 467.36 1769.72 467.36 743.96 196.39 100% Top-k δ-Elimination 93258 0 2 0.01 0.12 0.01 100% ID-BAI 199999 0 - 0.257 0.005 100% Collab Top M 51823.644 6379.803 3.0 0 0.061 0.007 100% Tri-BBAI 2989.96 274.36 3.81 0.39 1.10 0.13 100% Opt-BBAI 3752.31 739.60 4.20 0.99 1.00 0.12 100%

Track-and-Stop 2135.09 535.01 2135.09 535.01 821.59 137.83 100% Top-k δ-Elimination 113520 0 2 0.01 0.15 0.01 100% ID-BAI 226528 0 - 0.257 0.005 100% Collab Top M 60263.887 6367.285 3.0 0 0.071 0.007 100% Tri-BBAI 4066.34 381.41 3.74 0.44 1.11 0.13 100% Opt-BBAI 4799.49 852.00 4.04 0.96 1.03 0.12 100%

Track-and-Stop 2517.93 561.65 2517.93 561.65 1085.30 202.72 100% Top-k δ-Elimination 133782 0 2 0.01 0.18 0.01 100% ID-BAI 253049 0 - 0.288 0.006 100% Collab Top M 67958.078 6730.415 3.0 0 0.081 0.008 100% Tri-BBAI 5185.75 485.82 3.65 0.48 1.12 0.12 100% Opt-BBAI 5871.33 951.60 3.90 0.93 1.07 0.14 100%

Track-and-Stop 2942.67 598.77 2942.67 598.77 1232.91 192.82 100% Top-k δ-Elimination 154044 0 2 0.01 0.21 0.01 100% ID-BAI 279579 0 - 0.317 0.006 100% Collab Top M 76627.155 7000.211 3.0 0 0.090 0.008 100% Tri-BBAI 6334.49 613.88 3.58 0.49 1.12 0.12 100% Opt-BBAI 7055.16 1059.50 3.82 0.93 1.11 0.16 100%

Track-and-Stop 3347.02 535.16 3347.02 535.16 1464.84 359.57 100% Top-k δ-Elimination 174306 0 2 0.01 0.23 0.01 100% ID-BAI 306108 0 - 0.347 0.006 100% Collab Top M 84995.542 7270.987 3.0 0 0.100 0.009 100% Tri-BBAI 7486.57 764.36 3.49 0.50 1.13 0.11 100% Opt-BBAI 8136.01 1094.05 3.64 0.85 1.10 0.12 100%

Track-and-Stop 3609.64 638.68 3609.64 638.68 1661.96 266.15 100% Top-k δ-Elimination 194568 0 2 0.01 0.26 0.01 100% ID-BAI 332628 0 - 0.376 0.006 100% Collab Top M 92705.663 7639.167 3.0 0 0.109 0.009 100% Tri-BBAI 8759.03 844.59 3.41 0.49 1.12 0.10 100% Opt-BBAI 9315.58 1264.28 3.56 0.83 1.01 0.10 100%

Track-and-Stop 4136.94 665.10 4136.94 665.10 1714.68 257.00 100% Top-k δ-Elimination 214830 0 2 0.01 0.28 0.01 100% ID-BAI 359158 0 - 0.407 0.008 100% Collab Top M 100894.264 7704.855 3.0 0 0.119 0.009 100% Tri-BBAI 10083.60 1005.54 3.30 0.45 1.26 0.18 100% Opt-BBAI 10548.19 1277.15 3.48 0.78 0.99 0.09 100%

Table 4: Experimental results in terms of sample complexity, batch complexity and runtime under the normal mean rewards. The number of arms is n = 10. The experiments are averaged over 1000 repetitions.

Dataset δ Algorithm Sample Complexity Batch Complexity Runtime (s) Recall

Track-and-Stop 305.21 183.51 305.21 183.51 154.42 64.49 100% Top-k δ-Elimination 32472.0 0 2 0.01 0.04 0.01 100% ID-BAI 120427 0 - 0.145 0.012 100% Collab Top M 11834.089 2208.691 2.888 0.315 0.015 0.003 100% Tri-BBAI 334.49 32.80 3.89 0.32 1.08 0.16 98.3% Opt-BBAI 793.23 358.15 4.18 0.79 1.05 0.16 100%

Track-and-Stop 490.95 206.28 490.95 206.28 182.42 81.61 100% Top-k δ-Elimination 52734 0 2 0.01 0.07 0.01 100% ID-BAI 146948 0 - 0.171 0.011 100% Collab Top M 16145.218 2255.120 2.933 0.250 0.021 0.004 100% Tri-BBAI 893.05 121.47 3.62 0.48 1.11 0.11 100% Opt-BBAI 1236.33 381.17 3.74 0.71 1.11 0.13 100%

Track-and-Stop 699.18 198.64 699.18 198.64 264.66 116.82 100% Top-k δ-Elimination 72996 0 2 0.01 0.10 0.01 100% ID-BAI 173478 0 - 0.204 0.013 100% Collab Top M 20414.549 2277.714 2.960 0.195 0.027 0.004 100% Tri-BBAI 1532.76 201.18 3.37 0.48 1.13 0.10 100% Opt-BBAI 1747.15 389.65 3.43 0.58 1.07 0.10 100%

Track-and-Stop 833.08 234.48 833.08 231.90 315.10 91.15 100% Top-k δ-Elimination 93258 0 2 0.01 0.12 0.01 100% ID-BAI 199999 0 - 0.236 0.015 100% Collab Top M 24673.111 2584.898 2.962 0.191 0.032 0.005 100% Tri-BBAI 2141.11 282.37 3.20 0.40 1.13 0.08 100% Opt-BBAI 2263.174 405.72 3.19 0.42 1.06 0.08 100%

Track-and-Stop 972.75 245.18 972.75 245.18 336.78 74.03 100% Top-k δ-Elimination 113520 0 2 0.01 0.15 0.01 100% ID-BAI 226528 0 - 0.269 0.044 100% Collab Top M 29081.141 2323.105 2.982 0.132 0.039 0.005 100% Tri-BBAI 2838.36 353.12 3.09 0.28 1.14 0.08 100% Opt-BBAI 2881.61 430.05 3.08 0.27 1.09 0.09 100%

Track-and-Stop 1122.53 308.73 1122.53 308.73 468.14 138.87 100% Top-k δ-Elimination 133782 0 2 0.01 0.17 0.01 100% ID-BAI 253049 0 - 0.312 0.038 100% Collab Top M 33386.52 2133.062 2.987 0.113 0.043 0.005 100% Tri-BBAI 3516.52 467.48 3.03 0.18 1.14 0.08 100% Opt-BBAI 3556.59 477.73 3.04 0.20 1.15 0.15 100%

Track-and-Stop 1256.29 308.88 1256.29 308.88 544.76 74.00 100% Top-k δ-Elimination 154044 0 2 0.01 0.20 0.02 100% ID-BAI 279579 0 - 0.336 0.019 100% Collab Top M 37499.802 2413.418 2.986 0.117 0.048 0.005 100% Tri-BBAI 4218.39 523.49 3.01 0.09 1.15 0.08 100% Opt-BBAI 4220.78 523.11 3.02 0.13 1.16 0.13 100%

Track-and-Stop 1438.20 355.44 1438.20 355.44 566.44 101.01 100% Top-k δ-Elimination 174306 0 2 0.01 0.23 0.01 100% ID-BAI 306108 0 - 0.362 0.018 100% Collab Top M 41924.698 2206.559 2.992 0.089 0.055 0.005 100% Tri-BBAI 4912.64 613.69 3.01 0.08 1.15 0.07 100% Opt-BBAI 4940.68 650.75 3.00 0.06 1.12 0.10 100%

Track-and-Stop 1632.12 348.14 1632.12 348.14 590.42 56.25 100% Top-k δ-Elimination 194568 0 2 0.01 0.26 0.01 100% ID-BAI 332628 0 - 0.393 0.017 100% Collab Top M 46244.241 1816.041 2.997 0.054 0.059 0.005 100% Tri-BBAI 5637.36 743.53 3.00 0.05 1.14 0.08 100% Opt-BBAI 5661.51 740.74 3.00 0.05 1.03 0.06 100%

Track-and-Stop 1747.29 318.85 1747.29 318.85 790.89 119.52 100% Top-k δ-Elimination 214830 0 2 0.01 0.28 0.01 100% ID-BAI 359158 0 - 0.423 0.015 100% Collab Top M 50556.41 1822.998 2.997 0.054 0.067 0.007 100% Tri-BBAI 6356.78 787.53 3.00 0.03 1.30 0.16 100% Opt-BBAI 6355.28 793.45 3.00 0.03 1.03 0.05 100%

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes]

Justification: The main claims presented in the abstract and introduction accurately represent the contributions and scope of the paper. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: See the discussion in Section 5. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: We provide the full set of assumptions and complete proofs. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer:[Yes] Justification: We include all information needed to reproduce the main experimental results in Appendix E. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer:[Yes]

Justification: We have uploaded my source and will make the full code public if this work gets accepted.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We have included all the test details in Appendix E.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We report the error range in Tables 3 and 4.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors).

It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We state all information on the computer resources in Appendix E.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper).

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: The research conducted in the paper conforms, in every respect, to the Neur IPS Code of Ethics.

Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA] .

Justification: There is no societal impact of the work performed.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA] .

Justification: The paper does not use existing assets.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Our paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] . Justification: Our paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.