# optimal_batched_linear_bandits__5a38d585.pdf

Optimal Batched Linear Bandits

Xuanfei Ren 1 Tianyuan Jin 2 Pan Xu 3

Abstract We introduce the E4 algorithm for the batched linear bandit problem, incorporating an Explore Estimate-Eliminate-Exploit framework. With a proper choice of exploration rate, we prove E4

achieves the finite-time minimax optimal regret with only O(log log T) batches, and the asymptotically optimal regret with only 3 batches as T , where T is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least 3 batches in expectation as T , which indicates E4 achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E4 is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities. In addition, we show that with another choice of exploration rate E4 achieves an instance-dependent regret bound requiring at most O(log T) batches, and maintains the minimax optimality and asymptotic optimality. We conduct thorough experiments to evaluate our algorithm on randomly generated instances and the challenging End of Optimism instances (Lattimore & Szepesvari, 2017) which were shown to be hard to learn for optimism based algorithms. Empirical results show that E4 consistently outperforms baseline algorithms with respect to regret minimization, batch complexity, and computational efficiency.

1. Introduction

Sequential decision-making problems including bandits and reinforcement learning (Bubeck et al., 2012; Slivkins et al., 2019; Lattimore & Szepesv ari, 2020) have become pivotal

1University of Science and Technology of China 2National University of Singapore 3Duke University. Correspondence to: Pan Xu <pan.xu@duke.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

in modeling decision processes where an agent actively interacts with an uncertain environment to achieve a longterm goal. In particular, we study the linear contextual bandit problems (Langford & Zhang, 2007; Li et al., 2010; Abbasi-Yadkori et al., 2011; Chu et al., 2011; Agrawal & Goyal, 2013; Kirschner et al., 2021; Xu et al., 2022), where the agent can choose from K arms to play at each step, and each arm is associated with a feature vector in the Rd space. Each arm when played emits a noisy reward with its mean assumed to be a linear function of the feature vector with the unknown weight parameter shared across different arms. These bandit problems are prevalent in numerous real-world applications, from online advertising (Abe et al., 2003) to recommendation systems (Agarwal et al., 2008; Li et al., 2010), where abundant side information is available and thus provides powerful feature representation of each arm. The most critical dilemma in designing algorithms for learning bandit problems is the trade-off between exploitation and exploration, where the agent must decide whether to exploit the currently known best option or to explore unknown possibly suboptimal options for maximizing the long-run gains.

In conventional bandit setting, the agent adjusts its strategy or policy step-by-step, by first choosing an arm to play, observing a reward for this arm, and adjusting its strategy accordingly in the immediate next step. This ideal setting is also referred to as the fully-sequential setting, where immediate outcome can be obtained and switch polices is cost-efficient. In real-world problems, fully sequential learning strategies are usually expensive or infeasible in critical applications where the interaction with the environment takes months to observe the rewards like in clinical trials or medical treatment (Robbins, 1952), or switching between different policies is expensive in e-commerce (Bertsimas & Mersereau, 2007). Therefore, the batched bandit setting is more realistic and feasible, where an agent selects a batch of actions rather than a single action at one decision point to play, which significantly minimizes the number of policy switching and expedites the learning process by enabling parallel experiments (Perchet et al., 2015; Gao et al., 2019; Esfandiari et al., 2021; Han et al., 2020; Jin et al., 2021b;a; Ruan et al., 2021; Jin et al., 2023).

To evaluate the effectiveness of bandit algorithms, a crucial metric is the regret, which represents the gap between

Optimal Batched Linear Bandits

the expected cumulative rewards of the learning agent and those of an oracle who knows the optimal action at each step in hindsight. We denote the regret as RT , where T is the learning time horizon. The goal of a bandit algorithm is to find a tight bound for the regret to guarantee good performance theoretically. Generally speaking, there are three types of bounds of regret that are considered in the bandit literature. The first type of regret bound is called the worstcase regret bound where we measure the performance of a bandit algorithm with respect to the worst possible bandit instance. For linear bandits with K arms, it is well-established (Lattimore & Szepesv ari, 2020) that there exists a bandit instance such that, for all bandit strategies denoted by π, the regret satisfies RT Ω(

d T). The second type of regret is called the instance-dependent regret bound, where the regret RT is bounded by problem dependent parameters including the time horizon, the mean rewards of different arms, the number of arms, etc. Instance-dependent regret bounds provide performance evaluation of a bandit algorithm specific to certain bandit instances, offering more delicate insights into an algorithm s behavior in practice. The above regret bounds are both in the finite-time regime, where we fixed the time horizon T. The third type of regret bound, the asymptotic regret bound, characterizes the performance of an algorithm when T goes infinite. It has been proved that the regret of any consistent algorithm satisfies lim inf T RT / log T c , where c := c (θ ) is a statistical complexity measure of the problem under the bandit instance θ (Graves & Lai, 1997; Combes et al., 2017; Lattimore & Szepesvari, 2017), which we provide a formal definition in Section 3.

Achieving the optimal regret across various regret types poses a significant challenge. Notably, instance-dependent analysis proves to be more intricate than worst-case analysis. Many algorithms, while considered minimax optimal, may encounter difficulties in specific instances (Lattimore & Szepesvari, 2017). Conversely, certain existing instanceoptimal algorithms lack a guarantee for worst-case regret (Wagenmaker & Foster, 2023). It is even more challenging to achieve optimal regret for batched bandit algorithms. Due to their batch design, learning agents cannot update the exploration policy after each data point, resulting in a rarelyswitching nature. In this paper, we show that we can achieve the same order of all three types of regret as full-sequential algorithm but with much fewer batches. In particular, we first provide a lower bound on the batch complexity for asymptotic algorithms, and then propose an algorithm that attains both non-asymptotic and asymptotic optimal regret simultaneously and attains the corresponding optimal batch complexities. At the key of our algorithm is a careful design of the Explore-Estimate-Eliminate-Exploit framework leveraging D-optimal design exploration, optimal allocation learning, and arm elimination.

Our contributions are summarized as follows.

We propose the algorithm, Explore-Estimate-Eliminate Exploit (E4), for solving batched linear contextual bandits. With a proper choice of exploration rate, we prove that E4

d T) worst-case regret after only O(log log T) batches, which matches the existing lower bounds on the regret and batch complexity for linear contextual bandits with finite arms (Gao et al., 2019). Under the same configuration, when T , E4 achieves the asymptotically optimal regret with only 3 batches. We also prove that any asymptotically optimal algorithm must have at least 3 batches in expectation as T . Therefore, E4 is the first algorithm in linear bandits that achieves simultaneously optimal in terms of regret bound and batch complexity in both the finite-time worst-case setting and the asymptotic setting. We present the comparison of regret and batch complexities with existing work in Table 1 for the readers reference. With a different exploration rate, we prove that our E4

algorithm attains an instance-dependent regret bound O(d log(KT)/ min) with at most O(log T) batches, where min is the minimum gap between the mean rewards of the best arm and the suboptimal arms. Under the same configuration, E4 also maintains the minimax optimality and asymptotic optimality. Our result nearly matches the batch complexity lower bound (Gao et al., 2019) that states any algorithm with O(d log T/ min) regret must have at least Ω(log T) batches. We conduct experiments on challenging linear bandit problems, including the End of Optimism instances (Lattimore & Szepesvari, 2017), where it has been demonstrated that algorithms based on optimism are suboptimal, as well as on a suite of randomly generated instances. Our empirical evaluation verifies that E4 not only outperforms the existing baselines in terms of regret and batch complexity but attains superior computational efficiency by exhibiting a significant speedup over baseline algorithms.

Notation We denote a set {1, . . . , K} as [K], K N+. We use bold letter x Rd to denote a vector. x 2 =

x x is its Euclidean norm. For a semipositive definite matrix V Rd d, x V =

x Vx is the Mahalanobis norm. For an event E within a probability space, we denote its complement as Ec. For f(T) and g(T) as functions of T, we write g(T) = O(f(T)) to imply that there is a constant C independent of T such that g(T) Cf(T), and use O(f(T)) to omit logarithmic dependencies. We also define f(T) = Ω(g(T)) if g(T) = O(f(T)). When both f(T) = O(g(T)) and g(T) = O(f(T)) hold, we write f(T) = Θ(g(T)). We assume 0 = 0.

Optimal Batched Linear Bandits

Table 1: Regret and batch complexity comparison in the finite-time and asymptotic settings for linear bandits.

ALGORITHM NON-ASYMPTOTIC SETTING ASYMPTOTIC SETTING

WORST-CASE REGRET BATCH COMPLEXITY ASYMPTOTIC REGRET BATCH COMPLEXITY

ABBASI-YADKORI ET AL. (2011) O(d

T) O(log T) - - ESFANDIARI ET AL. (2021) O(

d T) O(log T) - - RUAN ET AL. (2021) O(

d T) O(log log T) - - HANNA ET AL. (2023) O(

d T) O(log log T) - - LOWER BOUND (GAO ET AL., 2019) Ω(

d T) Ω(log log T) - -

LATTIMORE & SZEPESVARI (2017) - - OPTIMAL SEQUENTIAL OSSB (COMBES ET AL., 2017) - - OPTIMAL SEQUENTIAL OAM (HAO ET AL., 2020) - - OPTIMAL SEQUENTIAL SOLID (TIRINZONI ET AL., 2020) O (d + log K)

T SEQUENTIAL OPTIMAL SEQUENTIAL IDS (KIRSCHNER ET AL., 2021) O(d

T) O d4 log4 T/ 2 min OPTIMAL O log4 T

LOWER BOUND (THEOREM 5.2) - - OPTIMAL 3 E4 (ALGORITHM 1) O(

d T) O(log log T) OPTIMAL 3

2. Related Work

Existing works on batched linear bandits predominantly concentrated on achieving non-asymptotic optimal regret with minimal batch complexity. Gao et al. (2019) argued that, for an algorithm to achieve minimax optimality, it should incorporate at least Θ(log log T) batches. Additionally, to attain an instance-dependent regret of order O(d log T/ min), an algorithm should include at least Θ(log T) batches. In the contextual linear bandits setting, Han et al. (2020); Ruan et al. (2021); Zhang et al. (2021); Hanna et al. (2023) delved into linear bandits with both stochastic and adversarial contexts. However, the techniques proposed by Han et al. (2020) cannot be directly applied to our fixed K arms setting, as it assumes that each arm is i.i.d. sampled from some distribution. Ruan et al. (2021); Zhang et al. (2021); Hanna et al. (2023) researched on stochastic contextual linear bandit problems, and our setting can be seen as a special case of theirs when the number of contexts is one, meaning the arm set is fixed. They both achieved O(d

T) regret bound with O(log log T) batches. In the context of linear bandits with fixed K arms, a simple algorithm with O(log T) batch complexity achieves an optimal regret bound of O(

d T) (Esfandiari et al., 2021; Lattimore & Szepesv ari, 2020). However, this falls short of the optimality compared to the lower bound in Gao et al. (2019).

Another related line of research is the algorithms that achieve asymptotically optimal regret for linear bandits (Lattimore & Szepesvari, 2017; Combes et al., 2017; Hao et al., 2020; Tirinzoni et al., 2020; Kirschner et al., 2021; Wagenmaker & Foster, 2023), which strive for the best possible performance in a sufficiently large horizon. Nevertheless, traditional well-performing optimistic algorithms like Upper Confidence Bound (UCB) and Thompson Sampling (TS) have been proven to fall short of achieving asymptotic optimality (Lattimore & Szepesvari, 2017). Lattimore & Szepesvari (2017) demonstrated an asymptotic lower bound

for the regret of any consistent algorithm and proposed a simple algorithm with a matching upper bound, but no finitetime performance was guaranteed. Similarly, Combes et al. (2017) achieved asymptotic optimality for a large class of bandit problems, but no finite-time regret guarantees were provided either. Hao et al. (2020) achieved asymptotic optimality with good empirical performance within finite horizon, considering a contextual bandit setting with a finite number of contexts, sharing similarities with our setting. They also proved that an optimistic algorithm could have bounded regret when the set of optimal arms spans the arm space, aligning with the concept of the End of Optimism phenomenon (Lattimore & Szepesvari, 2017).

Further, Tirinzoni et al. (2020) provided the first asymptotically optimal regret bound with non-asymptotic instancedependent and worst-case regret bounds, achieving the minimax optimality. Kirschner et al. (2021) proposed the frequentist IDS algorithm that leverages an information theoretic method, achieving minimax and asymptotic optimality simultaneously with a non-asymptotic instance-dependent regret bound. Wagenmaker & Foster (2023) proposed a method for achieving instance optimality in both asymptotic and non-asymptotic settings, along with a unifying framework for non-asymptotic instance optimality. Even so, they fell short of achieving minimax optimality. More importantly, all of the mentioned works that focus on asymptotic regret analysis are designed in a fully sequential way and cannot be applied to the batched linear bandit setting.

3. Preliminary

In this paper, we study the linear contextual bandit problem, where the reward of an arm x X Rd is given by r(x) = x, θ + ε. Here x is the arm feature, X is the arm set with |X| = K arms in total, θ Rd is an unknown weight parameter, and ε is a zero-mean 1-subgaussian noise. For simplicity, we assume X spans Rd. Otherwise, we can

Optimal Batched Linear Bandits

always find a corresponding arm set with a lower dimension that does by applying an orthogonal transformation to both θ and actions. For normalization purposes, we assume bounded expected rewards: | x, θ | L, x X, where L > 0 is some constant.

Conventional bandit algorithms operate in a fully sequential way. In each round t = 1, . . . , T, the learner chooses an action xt X and then observes a reward rt = r(xt). The objective is to minimize the accumulated regret of the algorithm defined as RT = E[maxx X PT t=1 x xt, θ ]. In the batched setting, the algorithm pulls a batch of bℓarms at round ℓ. For T total arms played, we can equivalently rewrite the regret as RT = E[maxx X PB ℓ=1 Pbℓ j=1 x xℓ,j, θ ]. where bℓis the batch size of batch ℓ, B is the total number of batches, and PB ℓ=1 bℓ= T. Note that the two definitions are exactly the same by re-indexing the arms pulled in each batch.

We assume the best arm x = argmaxx X x, θ is unique. We define x = x x, θ to be the suboptimality gap for an arm x = x , and min = minx =x x. We also define µx = x, θ to be the mean reward for pulling arm x. For a K-dimensional non-negative vector α RK 0, we use α = {αx}x X to denote an allocation of arm pulls over arm set X.

Asymptotic lower bound Asymptotic lower bounds are usually defined for consistent policies (Lattimore & Szepesv ari, 2020). A policy π is called consistent (Lattimore & Szepesvari, 2017) if its regret is sub-polynomial for any bandit instance, i.e., RT = o(T p) for all θ and p > 0.

For a linear bandit instance with arm set X and weight θ , define Hα = P

x X αxxx , where α RK 0 is an allocation for the number of pulls of each arm.1 c (θ ) is the solution to the following convex program,

c (θ ) = inf α RK 0

s.t. x x 2 H 1 α 2 x 2 , x X ,

where X = X {x }. We write c := c (θ ) when no ambiguity arises. Lemma 3.1 (Lattimore & Szepesvari (2017)). Any consistent algorithm π for the linear bandits setting with Gaussian noise has regret RT satisfying lim inf T RT / log T c , where c is given by (3.1).

We call an algorithm asymptotically optimal if it achieves lim T RT / log T = c .

1For the ease of presentation, we may overload the notation of covariance matrix H a little bit in this paper. Specifically, for an integer t N+, we define Ht = Pt s=1 xsx s . For a vector w = {wx}x X RK 0, we define Hw = P

x X wx xx .

Least square estimators For a dataset {(xs, rs)}t s=1, we define the following least squares estimators which will be used in our algorithm design. Let H = Pt s=1 xsx s , and

ˆθ = H 1Pt s=1 xsrs, (3.2a)

ˆµx(t) = ˆθ, x , (3.2b)

ˆx = argmaxx X ˆθ, x , (3.2c) ˆ x = ˆθ, ˆx x . (3.2d)

4. Optimal Batched Linear Bandits Algorithm

Algorithm 1 Explore, Estimate, Eliminate, and Exploit (E4) Input: arm set X, horizon T, α > 0, γ > 0, Doptimal design allocation function f( , , ) defined in Definition 4.1, exploration rate {T1, T2, . . .}, elimination parameters {ε1, ε2, . . .}. Initialization: set A X and t 0.

1: Exploration: Pull arm x A for f(x, A, T1) times. 2: t t + P

x A f(x, A, T1).

3: Estimation: Update least squares estimators ˆθ, ˆx , ˆ based on the current batch of data by (3.2). 4: Calculate w( ˆ ) = {wx}x A by Definition 4.3.

Batch ℓ= 2 5: Exploration: Pull arm x A for f(x, A, T2) + nx times, where nx = min wx α log T, (log T)1+γ . 6: b2 P

x A f(x, A, T2) + nx, t t + b2.

7: Estimation: Update least squares estimators ˆθ, ˆx , ˆ based on the current batch of data by (3.2). 8: Elimination: if stopping rule (4.2) holds A {ˆx }. end if Batch ℓ 3 9: ℓ 3. 10: while t < T and |A| > 1 do 11: Exploration: Pull arm x A for f(x, A, Tℓ) times. 12: Estimation: Update least squares estimator ˆθ based on the current batch of data by (3.2). 13: Elimination: A x A : maxy A ˆθ, y x 2εℓ . 14: ℓ ℓ+ 1, t t + P

x A f(x, A, Tℓ). 15: end while 16: Exploitation: Pull arm x A for T t times .

In this paper, we design a general algorithm framework, called Explore-Estimate-Eliminate-Exploit (E4), which is presented in Algorithm 1. The algorithm proceeds in a batched fashion.

In each batch we use an Exploration stage that explores the arm space in some way we would specify below, an Estima-

Optimal Batched Linear Bandits

tion stage that calculate some statistics with least squares estimators, and an Elimination stage that eliminates lowrewarding arms. Note that in a specific Estimation stage, we exclusively use data from that particular batch for the computation of estimators and statistics. This ensures the independence, aligning with our theoretical objectives. In Algorithm 1, we maintain an active set, denoted by A X, which only consists of arms that we are unsure about its suboptimality and need to further explore. If after some batch, there is only one arm in the active arm set A, the algorithm enters the Exploitation stage. This means it identifies the best arm and would pull it until the total pulls reach T.

Let ℓbe the index of batches. We denote bℓas the total number of arms pulled in the ℓ-th batch, which is also called as the batch size. And we denote Tℓas the D-optimal design exploration rate in the ℓ-th batch. In what follows, we elaborate the details in Algorithm 1 for different batches.

Batch ℓ= 1: Explore with D-optimal design and estimate the exploration allocation w for the next batch.

Exploration: In the first batch, the agent has no prior information on arms and thus explores all arms through the classical D-optimal design (Lattimore & Szepesv ari, 2020). Specifically, given an arm set A and and a user defined exploration rate M, we use D-optimal design rule to decide the number of pulls of each arm x, denoted by f(x, A, M), which is defined as follows. Definition 4.1. For an arm set A and an exploration rate M, the D-optimal design allocation of arm x A is given by f(x, A, M) = 2π xg(π )M/d , where π := π (A) is a probability measure over the action set defined by

π (A) = argminπ maxx A x 2 H 1 π , (4.1)

and g(π) = maxx A x 2 H 1 π .

Throughout our algorithm, we employ this D-optimal design exploration rule. Specifically, we set the exploration rate M = Tℓfor the ℓ-th batch. Remark 4.2. We present Definition 4.1 here just for simplicity. In both our theoretical analysis and experiments, we use the Frank Wolfe algorithm (see Chapter 21.2 in Lattimore & Szepesv ari (2020)) to approximately solve the optimization problem in (4.1) and get a near-optimal solution π. See Appendix C for more details.

Estimation: Then Algorithm 1 calculates least squares estimators by (3.2). Based on these estimators, we solve an allocation w = {wx}x X using the following definition, which will be used in the Exploration stage of batch 2.

Definition 4.3. Let ϵ = 1/ log log T, and w = {wx}x X [0, )K be the solution to the problem

minw [0, )K P

x =ˆx wx( ˆ x 4ϵ)

s.t. x ˆx 2 H 1 w ( ˆ x 4ϵ)2/2, x = ˆx ,

and wˆx (log T)γ/α,

where ˆ and ˆx are defined in (3.2), Hw is defined in Section 3, γ (0, 1) and α = (1 + 1/ log log T)(1 + d log log T/ log T).

Remark 4.4. Our Definition 4.3 is novel and different from literature. Compared with the naive allocation solved from an optimization problem like (3.1), we use ˆ x 4ϵ instead of ˆ x, and add a constraint wˆx (log T)γ/α in the program. From Lemma C.6 we can see the solution from Definition 4.3 convergences to the solution of naive method as T , so our modification tricks in Definition 4.3 doesn t influence the asymptotic regret analysis. In fact, with high probability, estimated gaps based on data in the second batch of our algorithm are larger than those underestimated gaps ˆ x 4ϵ. We elaborate this trick in our proof of Lemma C.7, which shows that with high probability, the stopping rule (4.2) holds after the second batch.

Batch ℓ= 2: Explore with D-optimal design and the allocation w from Definition 4.3, and eliminate all suboptimal arms based on Chernoff s stopping rule.

Exploration: The exploration stage of batch 2 requires playing arms according to two allocations: (1) D-optimal design exploration specified by Definition 4.1, and (2) optimal allocation specified by Definition 4.3. More specifically, for the second part, we pull x A for min{wx α log T, (log T)1+γ} times, where γ (0, 1) is an arbitrarily small constant and w = {wx}x X is solved from Definition 4.3 based on the estimators from batch 1.

Estimation: We calculate the least squares estimators according to (3.2) based on the current batch of data.

Elimination: We then test the Chernoff s stopping rule and only keeps the current best arm based on our Estimation stage if it holds. This means we will eliminate all suboptimal arms and directly go to the Line 1, Exploitation stage of Algorithm 1. In particular, the stopping rule is defined as Z(b2) β(b2, 1/T) and Pb2 s=1 xsx s c Id , (4.2)

where b2 is the batch size of the second batch defined in Line 1 of Algorithm 1, c = maxx X x 2 2 and

Z(b2) = min x =ˆx ˆ 2 x/(2 x ˆx 2 H 1 b2 ), (4.3)

β(t, δ) = (1 + 1/ log log T) log (t log log T) d 2 /δ . (4.4)

If the stop rule (4.2) does not hold, Algorithm 1 will enter the while loop and conduct more batches of exploration.

The use of Chernoff s stopping rule is inspired from best arm identification problems (Garivier & Kaufmann, 2016;

Optimal Batched Linear Bandits

Jedra & Proutiere, 2020; Jin et al., 2023). Specifically, if (4.2) holds, we can prove that ˆx is the true best arm with probability at least 1 1/T. This further ensures that no more exploration is needed and the regret is small.

Batch ℓ 3: Perform phased elimination with D-optimal design exploration. When Algorithm 1 does enter these batches, it means the previous two batches are not sufficient to identify the best arm and thus more exploration is needed. To this end, we adopt the phased elimination with the Doptimal design exploration (Lattimore & Szepesv ari, 2020).

Exploration: We use D-optimal design exploration defined in Definition 4.1 with rate Tℓfor the ℓ-th batch.

Estimation: We calculate the least squares estimators according to (3.2) based on the current batch of data.

Elimination: We eliminate arms whose estimated mean reward on the estimated parameters are smaller than that of the empirically best arm by a margin of 2εℓ. The active arm set is updated as A {x A : maxy A ˆθ, y x 2εℓ}.

Final batch: Pull the estimated best arm ˆx until the end. If the Chernoff s stopping rule (4.2) holds at the end of the second batch, or |A| = 1 after some batch ℓ 3, Algorithm 1 will enter Line 1, the Exploitation stage. In this stage, the agent just commits to the estimated best arm ˆx A until total pulls reach t = T.

5. Theoretical Analysis

In this section, we provide theoretical analysis results on the regret optimality and batch complexity of Algorithm 1.

First we give a formal definition of batch.

Definition 5.1. For a linear bandit problem with time horizon T, we say that the batch complexity of a learning algorithm is (at most) M, if the learner decides T1 and π1 before the learning process starts, and executes π1 in the first T1 time steps (which corresponds the first batch). Based on the data (context sets, played actions and the rewards) obtained from the first T1 steps, the learner then decides T2 and π2, and executes π2 for T2 time steps (the second batch). The learner repeats the process for M times/batches. In general, at the beginning of the k-th batch, the learner decides Tk (the size of the batch) and πk based on the data collected from the first (k 1) batches. The batch sizes should satisfy that ΣM k=1Tk = T.

We present the following result on the batch complexity lower bound of asymptotically optimal algorithms.

Theorem 5.2. If an algorithm achieves asymptotic optimality defined in Lemma 3.1, then on some bandit instances it must have at least 3 batches in expectation as T .

Remark 5.3. In our proof detailed in Appendix B, we first show that any algorithm that performs at most 2 batches is not asymptotically optimal. Then we further show that even with randomly chosen batch sizes, the expected number of batches of an asymptotically optimal algorithm is at least 3. This is the first batch complexity lower bound in the literature for asymptotically optimal algorithms. Though the lower bound in Theorem 5.2 seems to be conservative, we will show in the next theorem that our proposed algorithm E4 achieves asymptotic optimality in regret and its this batch matches the lower bound in Theorem 5.2.

Next we present the upper bounds on the regret and the batch complexity of Algorithm 1.

Theorem 5.4. In Algorithm 1, let the parameters be set to α = (1 + 1/ log log T)(1 + d log log T/ log T), δ = 1/(KT 2), γ (0, 1), and εℓ= p

d log(1/δ)/Tℓ, ℓ 1. Let the exploration rate be defined as

T1 = {T1 = (log T)1/2, T2 = (log T)1/2, T3 = (log T)1+γ,

Tℓ= T 1 1 2ℓ 3 for ℓ 4}.

Then the regret of Algorithm 1 satisfies

RT = O log log T p

d T log(KT) = O

and only needs at most O(log log T) batches. As T ,

RT log T c ,

only runs with 3 batches in expectation. Furthermore, Algorithm 1 runs with 3 batches with probability at least 1 2/(log T)2 as T .

Remark 5.5. Our O(

d T) regret bound nearly matches the lower bound in Lattimore & Szepesv ari (2020), and the O(log log T) batch complexity matches the lower bound of Ω(log log T) in (Gao et al., 2019). Furthermore, according to Theorem 5.2, our algorithm also achieves the asymptotically optimal regret bound with the optimal batch complexity, i.e., E4 only need 3 batches to achieve the asymptotic optimality. To the best of our knowledge, E4 is the only algorithm in the literature that is simultaneously optimal in terms of regret bound and batch complexity in both the finite-time worst-case setting and the asymptotic setting.

In contrast, existing linear bandit algorithms can be categorized as follows: (1) achieving the minimax optimal regret bound but with O(log T) batch complexity (Esfandiari et al., 2021); (2) achieving the minimax optimal regret bound with optimal batch complexity but no asymptotic guarantees (Ruan et al., 2021; Hanna et al., 2023); (3) achieving the asymptotically optimal regret but no finite time minimax optimality (Lattimore & Szepesvari, 2017; Combes et al., 2017; Hao et al., 2020; Wagenmaker & Foster, 2023); or (4)

Optimal Batched Linear Bandits

achieving the asymptotically optimal regret and the minimax optimality but in a fully sequential way (Tirinzoni et al., 2020). Compared to these works, Frequentist Information Directed Sampling (IDS) (Kirschner et al., 2021) achieves both minimax and asymptotic optimality. Furthermore, it can be proved to have a batch complexity slightly larger than O(log4 T)2, which is still much higher than the batch complexity of our algorithm E4. Moreover, the IDS algorithm can only achieve a worst case regret bound O(d

T), which is suboptimal for K-armed linear bandits when K is fixed. Remark 5.6. Lastly, we would like to highlight that when T is large enough, the stopping rule (4.2) used in Line 1 of Algorithm 1 holds with probability at least 1 2/(log T)2. When this happens, our algorithm would skip the while loop and directly goes to the final exploitation batch. Therefore, the algorithm will only need 3 batches to achieve the asymptotic optimality. Notably, since the stopping rule in Line 1 only depends on the first two batches of Algorithm 1, different choices of exploration rates in later do not change the asymptotic behavior of Algorithm 1 when T , as we will see in the next theorem.

In the following theorem, we show that with a slightly different choice of exploration rates, Algorithm 1 can achieve an instance-dependent regret bound with the optimal batch complexity as well.

Theorem 5.7. Let the exploration rate in Algorithm 1 be chosen as

T2 = {T1 = (log T)1/2, T2 = (log T)1/2, T3 = (log T)1+γ,

Tℓ= d log(KT 2) 2ℓ 3 for ℓ 4},

and all other parameters remain the same as in Theorem 5.4. Then Algorithm 1 satisfies RT = O(

d T log(KT)) = O(

RT = O((log T)1+γ + d log(KT)/ min)

with at most O(log T) batches and in expectation O(log(1/ min)) batches. Furthermore, the regret of Algorithm 1 satisfies lim sup T RT /log T c with 3 batches in expectation.

Remark 5.8. Theorem 5.7 shows that E4 can achieve the minimax optimal regret and the instance-dependent regret bound O(d log T/ min) with at most O(log T) batches. Gao et al. (2019) proved that any algorithm with O(d log T/ min) regret have at least Ω(log T/ log log T) batches. Hence, in this context, E4 attains optimal batch complexity while achieving this instance-dependent regret

2IDS is not a batched algorithm, but it has a rarely-switching structure and thus can be easily adapted into an batched algorithm. By some calculations, their batch complexity is larger than O(d4 log4 T/ 2 min), which we also include in Table 1.

bound. Notably, the instance-dependent bound gives a tighter regret bound within the class of bandit instances where min p

d/T than the worst-case regret bound. Remark 5.9. Notably, our instance-dependent batch complexity O log(1/ min) is novel and doesn t depend on time horizon T, which shows that for instances with large reward gaps the batch complexity of our Algorithm 1 is a small constant and does not increase as the horizon T increases.

6. Experiments

In this section, we conduct experiments on challenging linear bandit instances, assessing the performance of our E4

algorithm in comparison to several baseline algorithms. In particular, we first do simulations on hard linear bandit instances we constructed inspired from the End of Optimism instance (Lattimore & Szepesvari, 2017) on which optimistic algorithms such as UCB and TS are proved to be suboptimal as T . We then conduct experiments on more general random instances. The implementation could be found at https://github.com/panxulab/optimal-batchedlinear-bandits.

Baselines: We compare E4(Algorithm 1) with Rarely Switching OFUL (denoted by rs-OFUL) (Abbasi-Yadkori et al., 2011), the Optimal Algorithm (denoted by End OA) in Lattimore & Szepesvari (2017), IDS (Kirschner et al., 2021), and Phased Elimination Algorithm with D-optimal design (denoted by Pha Elim D) (Esfandiari et al., 2021; Lattimore & Szepesv ari, 2020). Note that IDS was shown to outperform other asymptotically optimal algorithms (Lattimore & Szepesvari, 2017; Combes et al., 2017; Hao et al., 2020; Tirinzoni et al., 2020), and thus we do not include the rest of them in our experiments.

Figure 1: The End of Optimism instance in R2. The true parameter θ is (1, 0). The arms are x1 = (1, 0), x2 = (0, 1), x3 = (1 ε, 2ε). Note that xi is the best arm if θ

lies in the colored region Ci, i = 1, 2, 3.

End of Optimism Instances: Lattimore & Szepesvari

Optimal Batched Linear Bandits

0 20000 40000 60000 80000 100000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(a) d = 5, K = 9, ϵ = 0.01

0 20000 40000 60000 80000 100000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(b) d = 5, K = 9, ϵ = 0.01

0 20000 40000 60000 80000 100000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(c) d = 5, K = 9, ϵ = 0.2

0 20000 40000 60000 80000 100000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(d) d = 5, K = 9, ϵ = 0.2

Figure 2: Regret and Batch Analysis: End of Optimism instances (d = 5, K = 9).

(2017) designed a hard instance in R2 such that optimism based algorithms cannot achieve the asymptotic optimality. For completeness, we present the original End of Optimism instance in R2 in Figure 1. Following a similar idea, we design several hard instances in more complicated settings to evaluate our proposed algorithm. Let ed j denote the natural basis vector in Rd where the j-th variable is 1 and all others are 0. Then we construct the following hard instances inspired by the End of Optimism (Lattimore & Szepesvari, 2017). Let θ = ed 1, the arm set X defined as follows

X = {ed i }d i=1 {(1 ϵ)ed 1 + 2ϵed j}d j=2, (6.1)

where ϵ = 0.01, 0.2 and the dimension d = 2, 3, 5. Hence there are 6 different hard instances in total. For each instance, the number of arms is K = 2d 1. We conduct experiments on these instances and present the result for d = 5 in Figure 3. Due to space limit, we defer the results for d = 2 and d = 3 to Appendix A.1.

Implementation: Based on the weight parameter θ and arm set X defined in (6.1), we generate the noise reward for arm x X as r(x) = x, θ + ε, where ε N(0, 1). For the parameters in E4 (Algorithm 1), we fix the exploration rate to be T1. We set the other parameters as described in Theorem 5.4. For rs-OFUL, we implement the rarely switching OFUL algorithm in Abbasi-Yadkori et al. (2011) with switching parameter C = 0.5. For End OA, we follow the Warmup phase and Success phase exactly the same way as Lattimore & Szepesvari (2017) describes, and use OFUL in its Recovery phase. For IDS, we follow their computationally efficient variant (Kirschner et al., 2021, Section 2) step by step. For Pha Elim D, we follow the framework in Esfandiari et al. (2021), and improve the batch sizes design by using exploration rate Ti = T 1 1/2i instead of qi in their paper for better performance. For each instance, we repeat the experiment for each method for 10 simulations and calculate the mean value and standard error of the cumulative regret and the batch complexity.

Notably, we proposed two options for the exploration rates in Theorems 5.4 and 5.7. However, as we discussed in Remark 5.6, our algorithm skips the while loop with ex-

tremely high probability. Consequently, different choices of exploration rates would not affect the performance of the algorithm in practice. Therefore, we fix the exploration rate to be T1 defined in Section 5 for the sake of convenience.

Results: We observe that E4 (Algorithm 1) achieves a similar regret as IDS and outperforms all other baseline algorithms, which demonstrates the effectiveness of our proposed algorithm. Moreover, E4 only needs 3 batches to achieve this optimal regret, which is in sharp contrast with IDS which needs more than 105 batches to achieve the same regret. The only algorithm that enjoys a similar batch complexity as E4 is Pha Elim D, which only runs in around 4 batches. Nevertheless, the regret of Pha Elim D is significantly higher than our E4 algorithm. As ϵ goes from 0.01 to 0.2, the regret of our algorithm and IDS remains optimal and very small in magnitude, while the performance of other algorithms becomes worse.

Table 2: Runtime (seconds) comparison.

ϵ ALGORITHM d = 2 d = 3 d = 5

PHAELIMD 0.18 0.71 1.46 RS-OFUL 0.45 1.47 3.72 ENDOA 3.15 3.17 8.94 IDS 9.48 30.22 178.31 E4 0.04 0.12 0.25

PHAELIMD 0.15 0.76 1.40 RS-OFUL 0.28 1.60 2.90 ENDOA 2.23 3.87 10.19 IDS 6.42 31.24 246.53 E4 0.06 0.15 0.33

Runtime comparison: We also present the averaged runtime for all the competing algorithms across 10 independent trials, as displayed in Table 2. In all instances regardless of the dimension or ϵ, our algorithm E4 is consistently the most computationally efficient one, offering a speedup ranging from 5-fold to 1000-fold when compared to other baselines.

Due to the space limit, we defer more experimental results to Appendix A. In particular, we conducted an ablation study on the sensitivity of different algorithms performance

Optimal Batched Linear Bandits

on the instance parameter ϵ in Appendix A.2. We also conducted extra experiments on some randomly generated instances and present the results in Appendix A.3.

7. Conclusion and Future Work

In this paper, we proposed the Explore, Estimate, Eliminate and Exploit (E4) algorithm, which only needs 3 batches to achieve the asymptotically optimal regret in linear bandits. To the best of our knowledge, E4 is the first batched linear bandit algorithm that is simultaneously optimal in terms of regret bound and batch complexity in both the finite-time worst-case setting and the asymptotic setting. We conducted numerical experiments on challenging linear bandit instances, which unequivocally show that our method outperforms the current baseline methods in multiple critical aspects: it achieves the lowest level of regret, it requires the minimal batch complexity, and it enjoys superior running time efficiency.

For future research, an intriguing objective is to investigate whether we can attain the optimal instance-dependent regret while maintaining the optimal batch complexity. While recent work by Wagenmaker & Foster (2023) achieves a specially defined instance-dependent optimality, their algorithms fail to achieve the minimax optimality, and more importantly, are fully sequential and thus cannot be applied to the batch setting. We hope our strategy proposed in this paper will provide a viable solution towards building an optimal instance-dependent algorithm with optimal batch complexity, which we leave for future work.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments. This research is supported by the Whitehead Scholars Program, by the US National Science Foundation Award 2323112, by the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG-Ph D/2021-01004[T]), and by the Singapore Ministry of Education Academic Research Fund (Ac RF) Tier 2 under grant number A-8000423-00-00. In particular, P. Xu was supported in part by the National Science Foundation (DMS-2323112) and the Whitehead Scholars Program at the Duke University School of Medicine. T. Jin was supported by the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG-Ph D/2021-

01004[T]), and by the Singapore Ministry of Education Academic Research Fund (Ac RF) Tier 2 under grant number A-8000423-00-00. The views and conclusions in this paper are those of the authors and should not be interpreted as representing any funding agencies.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Abe, N., Biermann, A. W., and Long, P. M. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37:263 293, 2003.

Agarwal, D., Chen, B.-C., Elango, P., Motgi, N., Park, S.-T., Ramakrishnan, R., Roy, S., and Zachariah, J. Online models for content optimization. In Proceedings of the 21st International Conference on Neural Information Processing Systems, pp. 17 24, 2008.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pp. 127 135. PMLR, 2013.

Bertsimas, D. and Mersereau, A. J. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120 1135, 2007.

Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 (1):1 122, 2012.

Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011.

Combes, R., Magureanu, S., and Proutiere, A. Minimal exploration in structured stochastic bandits. Advances in Neural Information Processing Systems, 30, 2017.

Esfandiari, H., Karbasi, A., Mehrabian, A., and Mirrokni, V. Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7340 7348, 2021.

Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multiarmed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.

Garivier, A. and Kaufmann, E. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp. 998 1027. PMLR, 2016.

Optimal Batched Linear Bandits

Graves, T. L. and Lai, T. L. Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM Journal on Control and Optimization, 35(3):715 743, 1997.

Han, Y., Zhou, Z., Zhou, Z., Blanchet, J., Glynn, P. W., and Ye, Y. Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321, 2020.

Hanna, O. A., Yang, L., and Fragouli, C. Contexts can be cheap: Solving stochastic contextual bandits with linear bandit algorithms. In Neu, G. and Rosasco, L. (eds.), Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pp. 1791 1821. PMLR, 12 15 Jul 2023.

Hao, B., Lattimore, T., and Szepesvari, C. Adaptive exploration in linear contextual bandit. In International Conference on Artificial Intelligence and Statistics, pp. 3536 3545. PMLR, 2020.

Jedra, Y. and Proutiere, A. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33:10007 10017, 2020.

Jin, T., Tang, J., Xu, P., Huang, K., Xiao, X., and Gu, Q. Almost optimal anytime algorithm for batched multi-armed bandits. In International Conference on Machine Learning, pp. 5065 5073. PMLR, 2021a.

Jin, T., Xu, P., Xiao, X., and Gu, Q. Double explore-thencommit: Asymptotic optimality and beyond. In Conference on Learning Theory, pp. 2584 2633. PMLR, 2021b.

Jin, T., Yang, Y., Tang, J., Xiao, X., and Xu, P. Optimal batched best arm identification. ar Xiv preprint ar Xiv:2310.14129, 2023.

Kirschner, J., Lattimore, T., Vernade, C., and Szepesv ari, C. Asymptotically optimal information-directed sampling. In Conference on Learning Theory, pp. 2777 2821. PMLR, 2021.

Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96 1, 2007.

Lattimore, T. and Szepesvari, C. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pp. 728 737. PMLR, 2017.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article

recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E. Batched bandit problems. In Gr unwald, P., Hazan, E., and Kale, S. (eds.), Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pp. 1456 1456, Paris, France, 03 06 Jul 2015. PMLR.

Robbins, H. Some aspects of the sequential design of experiments. 1952.

Ruan, Y., Yang, J., and Zhou, Y. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 74 87, 2021.

Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(12):1 286, 2019.

Tirinzoni, A., Pirotta, M., Restelli, M., and Lazaric, A. An asymptotically optimal primal-dual incremental algorithm for contextual linear bandits. Advances in Neural Information Processing Systems, 33:1417 1427, 2020.

Wagenmaker, A. J. and Foster, D. J. Instance-optimality in interactive decision making: Toward a non-asymptotic theory. In The Thirty Sixth Annual Conference on Learning Theory, pp. 1322 1472. PMLR, 2023.

Xu, P., Zheng, H., Mazumdar, E. V., Azizzadenesheli, K., and Anandkumar, A. Langevin monte carlo for contextual bandits. In International Conference on Machine Learning, pp. 24830 24850. PMLR, 2022.

Zhang, Z., Ji, X., and Zhou, Y. Almost optimal batch-regret tradeoff for batch linear contextual bandits. ar Xiv preprint ar Xiv:2110.08057, 2021.

Optimal Batched Linear Bandits

A. Additional Experiments

In this section, we provide more experimental results.

A.1. More Results on End of Optimism Instances

We provide more experimental results on instances defined in (6.1) for d = 2 and 3 in Figure 3 and Figure 4 respectively. We also present the detailed batch complexities of algorithms in Table 3. These experiments demonstrate that the proposed algorithm E4 consistently outperforms baseline algorithms in terms of regret bound and batch complexity across different bandit instances.

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(a) d = 2, K = 3, ϵ = 0.01

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(b) d = 2, K = 3, ϵ = 0.01

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(c) d = 2, K = 3, ϵ = 0.2

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(d) d = 2, K = 3, ϵ = 0.2

Figure 3: Regret and Batch Analysis: End of Optimism instances (d = 2, K = 3).

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(a) d = 3, K = 5, ϵ = 0.01

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(b) d = 3, K = 5, ϵ = 0.01

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(c) d = 3, K = 5, ϵ = 0.2

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA IDS E4

(d) d = 3, K = 5, ϵ = 0.2

Figure 4: Regret and Batch Analysis: End of Optimism instances (d = 3, K = 5).

Table 3: Batch Complexity Analysis: End of Optimism instances. Note that batch complexity of sequential algorithms like End OA and IDS equals time horizon T.

INSTANCE E4 PHAELIMD RS-OFUL ENDOA IDS

d = 2, K = 3, T = 10000 ϵ = 0.01 3.0 0.0 4.0 0.0 36.1 0.3 - - ϵ = 0.2 3.0 0.0 4.0 0.0 37.0 0.0 - -

d = 3, K = 5, T = 50000 ϵ = 0.01 3.0 0.0 4.0 0.0 61.0 0.5 - - ϵ = 0.2 3.0 0.0 4.0 0.0 60.5 0.8 - -

d = 5, K = 9, T = 100000 ϵ = 0.01 3.0 0.0 4.0 0.0 102.3 0.9 - - ϵ = 0.2 3.0 0.0 4.0 0.0 101.8 0.6 - -

A.2. Ablation Study of E4 on the End of Optimism Instances

In this section, our focus is solely on batched algorithms, as our primary concern lies in evaluating the performance of batched linear bandits algorithms. According to Lattimore & Szepesvari (2017), optimism based algorithms would fail to achieve asymptotic optimality on these hard End of Optimism instances. And the construction of such instances need sufficiently small ϵ. So it is meaningful to verify this statement by changing the value of ϵ. Note that rs-OFUL and Pha Elim D are both algorithms based on optimism, with the latter one discards low-rewarding arms without considering their

Optimal Batched Linear Bandits

information gain. We use the simplest End of Optimism instances (see Figure 1) given by (6.1) where we choose d = 2, ϵ = 0.005, 0.01, 0.05, 0.1, 0.15, 0.2.

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL E4

(a) ϵ = 0.005

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL E4

(b) ϵ = 0.01

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL E4

(c) ϵ = 0.05

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL E4

(d) ϵ = 0.1

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL E4

(e) ϵ = 0.15

0 2000 4000 6000 8000 10000 Time step

Pha Elim D rs-OFUL E4

(f) ϵ = 0.2

Figure 5: Ablation study on the parameter ϵ.

As the results in Figure 5 show, when ϵ gets larger, the regret of algorithms gets larger. This is because agents can quickly discard the worst arm x3 and suffer the regret from x2 in the following rounds. A larger ϵ leads to a larger regret (worse case).

However, we can also see that as ϵ increases, the performance difference between E4 and other optimistic algorithms decreases. This is because optimism based algorithms can achieve worst-case (instances with large ϵ) optimal regret, but fail to achieve asymptotically optimality under instances with small enough ϵ. In fact, we can see the regret of optimism based algorithms as the minimax optimal regret, and see the regret of E4 as the asymptotically optimal regret. Hence these experimental results help to understand the difference between these two types of optimality.

A.3. Evaluations of E4 on Randomly Generated Instances

In this section, we empirically evaluate algorithms on randomly generated instances. Notably, we have excluded IDS from consideration here due to its consistently poor performance in batch complexity and runtime efficiency as is shown in Figure 2 and Table 2. For each chosen dimension of features and number of arms, we generate an instance with θ 2 = 1, where arm features are sampled from U([0, 1]K). We set the time horizon to T = 50000 and conduct 10 independent experiments for each instance. The regret and batch complexity results are illustrated in Figure 6. The outcomes indicate that our algorithm E4 exhibits commendable performance even in randomly generated instances, which further demonstrates its advantages.

B. Lower Bounds on the Batch Complexity

The following lemma implies that any algorithm cannot find the best arm correctly with a high enough probability using only one exploration stage of o(log T) steps.

Lemma B.1. Consider any linear bandit instances with two arms. Suppose a best-arm-identification algorithm has only o(log T) steps of exploration, let x be its output, then there are infinite many T such that P( x = x ) > 1/

Optimal Batched Linear Bandits

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA E4

(a) d = 2, K = 3

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA E4

(b) d = 2, K = 3

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA E4

(c) d = 3, K = 5

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA E4

(d) d = 3, K = 5

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA E4

(e) d = 5, K = 9

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA E4

(f) d = 5, K = 9

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA E4

(g) d = 20, K = 50

0 10000 20000 30000 40000 50000 Time step

Pha Elim D rs-OFUL End OA E4

(h) d = 20, K = 50

Figure 6: Regret and Batch Analysis: Random generated instances (d = 2, 3, 5, 20).

Table 4: Batch Complexity Analysis: Random generated instances. Note that again the batch complexity of the sequential algorithm End OA equals the time horizon 50000.

INSTANCE (T = 50000) E4 PHAELIMD RS-OFUL ENDOA

d = 2, K = 3 3.0 0.0 3.0 0.0 38.0 0.0 - d = 3, K = 5 3.0 0.0 3.7 0.5 60.7 0.5 - d = 5, K = 9 3.0 0.0 4.0 0.0 98.8 0.7 - d = 20, K = 50 3.0 0.0 4.0 0.0 355.5 0.7 -

B.1. Algorithms with Deterministic Number of Batches

We first present the following theorem that states algorithms performing at most 2 batches are not asymptotically optimal.

Theorem B.2. Algorithms that only perform 1 or 2 batches are not asymptotically optimal.

Proof. An algorithm is called asymptotically optimal if its regret satisfies lim sup T RT / log T c (θ ) for all bandit instances. To prove the desired result, we just need to consider instances given by given by arm set X = {(1, 0), (0, 1)}, parameter θ = (α, 0) or (0, α), where α (0, 1). Here the gap = α. We aim to prove algorithms that only perform 1 or 2 batches fail to achieve asymptotic optimality on these instances.

First, no consistent algorithm can achieve sub-linear regret with only 1 batch. Algorithms with 1 batch must decide how many times to pull each arm at the beginning, then pull arms for T times in total according to the decision. With no information of unknown instance, it must suffer O(T) regret on some instance.

Then we prove that no algorithm can achieve asymptotic optimality with 2 batches. Assume certain algorithm with policy π can achieve this, meaning its regret satisfies

lim T RT log T = c (θ ) = X

where the second equality is the explicit form of the c on multi-armed bandits setting. The detail can be seen from the Example 3 in Lattimore & Szepesvari (2017). (B.1) implies the asymptotically optimal algorithm pulls the optimal arm for Θ(T) times in expectation, and pulls the suboptimal arm for 2 α2 log T + o(log T) times in expectation. Since the algorithm has only 2 batches, let t1 and t2 be the expected batch sizes for the first and second batch, where t1 + t2 = T. Without loss of generality, we can assume the agent use a uniform exploration (pull all arms equal times) in the first batch.

Optimal Batched Linear Bandits

Note that we consider the problem in an asymptotic setting, where T . The statements above show that an optimal algorithm has 2

α log T + o(log T) regret. Then we claim t1 = O(log T). Here we can see that if t1 has an order higher than O(log T), the agent must pull one arm for some times with the order larger than O(log T). On some instances, this leads regret with the order larger than O(log T), which is an contradiction with asymptotic optimality. We see t2 = T t1 = T o(T). This means in the second batch, the algorithm must pull one arm for at least T/2 o(T) times.

Then we prove that t1 cannot be o(log T) by contradiction. Suppose an algorithm (denoted as Algorithm A) is a 2-batch asymptotically optimal algorithm with the initial batch size being o(log T). Another algorithm for best-arm identification, referred to as Algorithm B, can be derived from Algorithm A. This derivation is straightforward: it begins with the exploration phase using the first batch of Algorithm A, collects the data, updates the policy, and then diverges from Algorithm A s approach by outputting the arm, denoted as x, that is to be pulled more frequently in the second batch of Algorithm A as the best-arm. Since the second batch of Algorithm A has size Θ(T), it must pull x for Θ(T) times. Given that Algorithm A is asymptotically optimal, the probability of selecting an incorrect arm is P( x = x ) 1/

T for sufficiently large T. Otherwise, the regret of Algorithm A would be at least Θ(T) α 1/

T), for infinite many T, contradicting its asymptotic optimality, which is characterized by a regret of O(log T) order. However, viewing Algorithm B as a best-arm identification algorithm introduces a new issue: achieving the bound P( x = x ) 1/

T for sufficiently large T necessitates an exploration process involving more than Θ(log T) samples, as indicated by Lemma B.1, which is a contradiction. This implies that for Algorithm A to be asymptotically optimal, it cannot limit its initial batch to only o(log T) samples. So t1 cannot be o(log T). Combining with t1 = O(log T) we know t1 = Θ(log T).

But the main problem arises here: how many steps do we need to explore in the first batch? In fact, we have no information about the unknown instance at the beginning. In the first batch, we may choose to pull each arm for 2 β2 log T + o(log T) times, where β > 0 is a predetermined constant parameter. Then for instances with = α > β and sufficiently large T, the regret of batch 1 is α 2 β2 log T + o(log T) > 2

α log T + o(log T), which implies the algorithm is suboptimal. This shows algorithms with only 2 batches cannot achieve asymptotic optimality, which ends the proof.

B.2. Proof of Theorem 5.2

Firstly, any consistent algorithm must have at least 2 batches. We still consider instances given by given by arm set X = {(1, 0), (0, 1)}, parameter θ = (α, 0) or (0, α), where α (0, 1). Then we aim to prove any asymptotically optimal must have 3 batches in expectation as T on all these instances, which leads to the desired conclusion. Now we prove by contradiction. Suppose that an asymptotically optimal algorithm has 3 δ batches in expectation on some instance among these, for some constant δ (0, 1). Then we discuss the behavior of this algorithm on this certain bandit instance. Let M be the event that the algorithm has only 2 batches, and p = P(M). Therefore the algorithm has at least 3 batches with probability 1 p. We have that the expected batch complexity of this algorithm

3 δ 2 p + 3 (1 p) = 3 p,

which implies p δ.

We use the event M to decompose the regret:

t=1 x xt, θ M

t=1 x xt, θ Mc #

:= AT p + BT (1 p)

here we use AT = E PT t=1 x xt, θ |M to denote the regret given M, and BT = E PT t=1 x xt, θ |Mc to denote the regret given Mc. Since we suppose the algorithm is asymptotically optimal, i.e.,

log T lim T RT log T = c ,

which implies AT c /δ log T + o(log T). Given M, we denote T = t1 + t2, where t1 and t2 are the expected batch sizes for two batches, respectively. Similar to the proof of Theorem B.2, we have t1 = O(log T) and t2 = Θ(T), otherwise the order of the regret would be larger than Ω(log T). We can also use the same arguments as the proof of Theorem B.2 to prove that t1 cannot be o(log T). Otherwise, AT = Ω(

T). Combining these results we know t1 = Θ(log T).

Optimal Batched Linear Bandits

The behavior of the algorithm in the first batch is not influenced by bandit instances. The algorithm must pull Θ(log T) arms in the first batch. Then we just analyze the regret in the first batch. And the algorithm faces the same problem as the proof of Theorem B.2: how many steps does it need to explore in the first batch? Use exactly the same way we can find an instance where the algorithm is suboptimal, which is a contradiction. Actually, no matter how the agent allocates these Θ(log T) steps of exploration, there exists an instance that makes the algorithm suboptimal. This ends the proof.

C. Proofs of Main Theorems in Section 5

The two lemmas outlined below offer a theoretical basis for the exploration of D-optimal design and the application of Chernoff s stopping rule mentioned in Section 4.

In practice, we use the Frank Wolfe algorithm to solve the optimization problem in (4.1) to get a near-optimal solution π, which only exerts an impact on our theoretical analysis up to a constant order. In this way, the agent pulls each arm x X for 2 πxg( π)M/d times. Besides, the number of iterations for this algorithm can be O(d log log d), which shows the efficiency. See Section 21.2 in Lattimore & Szepesv ari (2020) for more details.

Lemma C.1 (D-optimal design concentration (Lattimore & Szepesv ari, 2020)). Let A be the active arm set. If the agent performs a D-optimal design exploration on A with rate M defined in Definition 4.1, we can get the concentration results for related least squares estimators: for any x X, with probability at least 1 1/(KT 2),

| x, ˆθ θ |

d log(KT 2)

Lemma C.2 ((Jedra & Proutiere, 2020)). Let δ (0, 1), u > 0. Regardless of the sampling strategy, the Chernoff s Stopping Rule defined as

Z(t) β(t, δ) and

s=1 xsx s c Id

with c = maxx X x 2 2 and

Z(t) = min b =ˆx ˆ 2 b 2 ˆx b 2 H 1 t ,

β(t, δ) = (1 + u) log u d

ensures that P τ < , θ , x ˆx τ > 0 δ, where ˆx τ is the estimated best arm at time step τ. This shows that if the stopping rule holds, we can find the best arm correctly with probability of at least 1 δ.

Note that we need different concentrations for each batches, with each just based on one batch of independent data. This makes the proof much easier. The subsequent lemmas are crucial as they assist in demonstrating the asymptotic properties of Algorithm 1. Lemma C.3 gives a concentration result in case that the agents use at least a D-optimal design exploration with rate (log T)1/2.

Lemma C.3. Suppose in one batch of Algorithm 1, the agent conducts at least a D-optimal design exploration with rate (log T)1/2 given in Definition 4.1. This means (1) t1 = Θ (log T)1/2 pulls for D-optimal design exploration, (2) possibly some other pulls. Let t t1 be the total number of pulls in this batch, ˆθt1, ˆθt be the least squares estimators of θ

after the D-optimal design exploration and the whole batch of exploration, µx = θ , x , ˆµx(t) = ˆθt, x for x X, ϵ = 1/ log log T. Define an event

E1 = { t t1, x X, |ˆµx(t) µx| ϵ}. (C.1)

Then for sufficiently large T, we have

P(Ec 1) 1 (log T)2 .

Optimal Batched Linear Bandits

Next Lemma C.4 describes the exact concentration result for the first two batches in our Algorithm 1. Note that in the estimation stage of the ℓ-th batch, we only use data collected in the exploration stage of this batch to calculate the least squares estimators. Lemma C.4. Let ϵ = 1/ log log T, b1 and b2 be the batch size of the first two batches in Algorithm 1, respectively. For the first two batches, we use ˆµx(b1) = ˆθb1, x and ˆµx(b2) = ˆθb2, x to denote the estimated expected rewards for x X using data collected in the first and the second batches, respectively. Define

E2 = { x X, |ˆµx(b1) µx| ϵ, |ˆµx(b2) µx| ϵ}. (C.2)

Then for sufficiently large T, we have

P(Ec 2) 2 (log T)2 .

Now we talk about some properties of the optimal allocation solved from Definition 4.3. Without loss of generality, we can always assume the allocation for ˆx solved from Definition 4.3 satisfies

wˆx = (log T)γ/α. (C.3)

In fact, if we have one solution w( ˆ ) [0, )K to the program in Definition 4.3, then we can increase the value wˆx to w ˆx wˆx to get another solution. This is because the increase of wˆx doesn t affect the objective function P x =ˆx wx( ˆ x 4ϵ). Moreover, this increase can minimize x ˆx 2 H 1 w for x = ˆx , which makes the constraint still true.

In the following analysis we use {wx}x X to denote w( ˆ ) solved using estimated gaps. Next definition gives the allocation solved from the true parameters, which we will use in our theoretic analysis. It s easy to see Definition 4.3 is an approximate version of Definition C.5. Using the same statements as above, we can assume w x = . Definition C.5. Let = { x}x X [0, )K be the vector of gaps, x = argmaxx X θ , x be the best arm. We define w( ) = {w x}x X [0, ]K to be the solution to the optimisation problem

min w [0, ]K X

s.t. x x 2 H 1 w 2 x 2 , x X ,

where Hw = P

x X wx xx .

The following lemma shows the solution from Definition 4.3 converges to the solution from Definition C.5 as T . Lemma C.6. We use w( ˆ ) = {wx}x X from Definition 4.3 and w( ) = {w x}x X from Definition C.5. Then if E2 holds,

lim T wx = w x, x X.

The following lemmas demonstrate that for sufficiently large T the stopping rule in the second batch of Algorithm 1 holds with probability larger than 1 1/(log T)2, and while loop breaks after the third batch with probability larger than 1 1/T. Lemma C.7. If E2 holds, then for sufficiently large T, the stopping condition (4.2) holds in the second batch.

Namely, Algorithm 1 skip the while loop after the second batch and enter the Exploitation stage with probability of at least 1 2/(log T)2, since P(Ec 2) 2/(log T)2. Lemma C.8. If the stopping rule (4.2) doesn t hold in the second batch, for sufficiently large T, the algorithm would eliminate all suboptimal arms in the third batch with probability of at least 1 1/T.

Let Regret2 be the regret for batch ℓ= 2 of Algorithm 1. The following lemma implies that Regret2 matches the leading term in the asymptotic lower bound in Lemma 3.1. Lemma C.9. Regret of the second batch can be bounded by:

lim T Regret2

where c = c (θ ) defined in Lemma 3.1.

Optimal Batched Linear Bandits

C.1. Proof of Theorem 5.4

Now we prove the results in Theorem 5.4. Note that under the choice of exploration rate T1, we want to show that E4

achieves the optimal regret and the batch complexity in both the finite-time setting and the asymptotic setting.

C.1.1. THE FINITE-TIME SETTING

We first prove the batch complexity and the worst case regret bound in the finite-time setting.

Batch complexity: In this part, we analyze the upper bound of batch complexity. It is sufficient to count the number of while loops starting from the third batch in Algorithm 1 before Exploitation in Line 1. For ℓ 4, the ℓ-th batch has size Θ T 1 1 2ℓ 3 , so the (log log T + 3)-th batch has size Θ(T). Suppose the (log log T + 3)-th batch size is larger than T/M, where M > 0 is a constant. By definition, the batch size increases as t increases, so the sizes of batches after this batch is larger than T/M. Therefore, the algorithm would have at most log log T + 3 + M batches. Then we prove the batch complexity upper bound O(log log T).

Minimax optimality: First, we assume the agent doesn t eliminate any arms in the first two batches, because it s easier the other way as we will show at the end of this part of the proof. Since the number of time steps for the first two batches is at most O (log T)1+γ , the regret of the first two batches is at most O (log T)1+γ , due to the bounded rewards assumption. Besides, the property of Chernoff s Stopping Rule (Lemma C.2) means the agent finds the best arm correctly with probability larger than 1 1/T, then the regret of Line 1 Exploitation stage can be bounded by a small constant. Hence we don t need to consider the regret in Line 1 Exploitation stage.

In Algorithm 1, we let A be the current active arm set. Note that A may shrink after some batches due to the Elimination stages. We define an arm x to be active in batch ℓ, if it is in the active arm set A of this batch. The same as (Esfandiari et al., 2021) and (Lattimore & Szepesv ari, 2020), we define the following good event E0 (C.4) that we will use in the following proof constantly:

E0 = {For any arm x that is active in the beginning of batch ℓ, (C.4)

at the end of this batch we have | x, ˆθ θ | εℓ.}

By Definition 4.1, for any x X, after the ℓ-th batch, with probability at least 1 1/(KT 2), we have

| x, ˆθ θ | εℓ.

Since there are K arms and at most T batches, by union bound we know the good event E0 (C.4) happens with probability at least 1 1/T. Consequently, we know

t=1 x xt, θ Ec 0

P(Ec 0) + E

t=1 x xt, θ E0

2LTP(Ec 0) + E

t=1 x xt, θ E0

t=1 x xt, θ E0

So we can just bound the regret when E0 happen. Under E0, at the end of each batch ℓ 3, we know for each arm,

| x, ˆθ θ | εℓ.

Denote the best arm and estimated best arm (ties broken arbitrarily) by:

x = argmax x A x, θ ,

ˆx = argmax x A x, ˆθ .

Optimal Batched Linear Bandits

max y A ˆθ, y x = ˆθ, ˆx x

= ˆθ θ + θ , ˆx x

= ˆθ θ , ˆx ˆθ θ , x + θ , ˆx x

| ˆθ θ , ˆx | + | ˆθ θ , x | + θ , ˆx x

2εℓ+ θ , ˆx x

Compared this with the elimination rule in Algorithm 1 for ℓ 3, we know the best arm won t be eliminated, namely,

here we use Aℓto denote the active action set in the ℓ-th batch. Correspondingly, for each suboptimal arm x, define ℓx = min{ℓ: 4εℓ< x} to be the first phase where the suboptimality gap of arm x bigger than 4εℓ. Then if x = x is not eliminated in the first ℓ 1 batches, in the ℓ-th batch,

max y A ˆθ, y x ˆθ, x x

= ˆθ θ + θ , x x

= ˆθ θ , x x + θ , x x

| ˆθ θ , ˆx | | ˆθ θ , x | + x > 2εℓ.

Compared this with the elimination rule in Algorithm 1 for ℓ 3, we know x = x would be eliminated in the first ℓ batches, namely,

x / Aℓx+1, x = x , (C.5)

where Aℓx+1 is the active action set in the ℓx + 1 batch. Thus when the good event E0 (C.4) happens, arms active in phase ℓ have the property: x 4εℓ 1.

When T3 = (log T)1+γ, Tℓ= T 1 1 2ℓ 3 , ℓ 4 and εℓ= p

d log(KT 2)/Tℓ, Algorithm 1 has at most B = O(log log T) batches. We use Regretℓto denote the regret of the ℓ-th batch, then the regret for batches ℓ 5 under E0 can be bounded as

ℓ=5 Regretℓ

ℓ=5 4Tℓεℓ 1

ℓ=5 4CT 1 1 2ℓ 3 q

d log(KT 2)/T 1 1 2ℓ 4

d T log(KT 2)

= O log log T p

d T log(KT) .

As a result, the regret for batches ℓ 3 under E0 can be bounded as

ℓ=3 Regretℓ

ℓ=3 4Tℓεℓ 1

= O(T3) + O(T4) +

ℓ=5 4Tℓεℓ 1

Optimal Batched Linear Bandits

= O (log T)1+γ + O

d T log(KT 2)

= O log log T p

d T log(KT) .

All of the proofs above suppose the algorithm doesn t eliminate any arms in the first two batches, i.e. the stopping rule in the second batch doesn t hold. Now we suppose the algorithm eliminates all suboptimal arms after the second batch. The property of Chernoff s Stopping Rule Lemma C.2 means the agent finds the optimal arm with probability at least 1 1/T, then the regret of Line 1 Exploitation after the second batch can be bounded by a constant. We can combine these to end the proof.

C.1.2. THE ASYMPTOTIC SETTING

Now we investigate the batch complexity and the regret bound of E4 when T .

Batch complexity: By Lemma C.7, we know for sufficiently large T, with probability 1 2/(log T)2, the stopping rule (4.2) in the second batch of Algorithm 1 holds. This means the algorithm identifies the best arm and goes into the Line 1, the Exploitation stage. According to the last parts of proof, with time grid selections T1 (and even T2), the algorithm has at most O(log T) batches in total. So we can calculate the expected batch complexity by

lim T 3 1 2/(log T)2 + O log T/(log T)2 = 3, (C.6)

because the algorithm just has 3 batches when the stopping rule holds, and this event happens with probability at least 1 2/(log T)2. Thus we complete the proof of asymptotic batch complexity.

Asymptotic optimality: Lemmas C.7 and C.8 demonstrate that for sufficiently large T the stopping rule in the second batch of Algorithm 1 holds with probability larger than 1 2/(log T)2, and the while loop starting from batch ℓ= 3 breaks after the third batch with probability larger than 1 1/T.

Decompose accumulative regret by batches as:

RT = Regret1 + Regret2 + Regret3 + Regretelse,

where Regretℓrepresents the regret in the ℓ-th batch (ℓ= 1, 2, 3), and Regretelse represents the regret after the third batch.

We research on the asymptotic setting, so all of the statements below assumes that T is sufficiently large. With the assumption of bounded rewards, regret in the first batch is negligible compared with Θ(log T) order, namely, Regret1 = o(log T):

lim T Regret1

log T = (log T)1/2

Note that the stopping rule (4.2) in the second batch holds with probability at least 1 2/(log T)2. But the batch ℓ= 3 in the while loop is encountered only when the stopping rule fails to hold. Given that the batch 3 has a size of O (log T)1+γ , the regret in this batch is considered negligible:

lim T Regret3 = lim T 2 (log T)2 O (log T)1+γ = 0.

Similarly, note that the agent eliminates all suboptimal arms after the third batch with probability 1 1/T, according to Lemma C.8, so the regret after this batch Regretelse is negligible, too:

lim T Regretelse

log T = lim T

The only term of care about is the regret of the second batch of Algorithm 1. Therefore, it is sufficient to use Lemma C.9 to conclude the proof of asymptotically optimal regret, which shows

lim T Regret2

Optimal Batched Linear Bandits

Now we can conclude our prove of the main Theorem 5.4 by combining all these results about different parts of regret:

RT log T = lim sup T

Regret1 + Regret2 + Regret3 + Regretelse

= lim sup T

which matches the lower bound in Lemma 3.1.

C.2. Proof of Theorem 5.7

Finite-time instance-dependent batch complexity and regret bound: This part of proof is similar to the analysis in the proof of Theorem 5.4. We use the same good event E0 in (C.4). Here we choose Tℓ= Θ(d log(KT 2) 2ℓ 3), εℓ= p

1/2ℓ 3, ℓ 4. Then the log T-th batch in Algorithm 1 has size Θ(d log(KT 2)T). Therefore, the while loop in the algorithm would break with batches at most O(log T).

In this part we can also suppose the agent doesn t eliminate any arms in the first two batches. The reason for this is the same as the proof of Theorem 5.4. Similarly, the regret of the first three batches is at most O (log T)1+γ and the regret of Line 1 Exploitation stage is O(1), which is negligible. To analyze the instance-dependent regret, we focus on batches for ℓ 4. Define min = minx =x x. Recall that under the good event E0 (C.4), the arm x = x would be eliminated after the ℓx batch. We can explicitly calculate when would all suboptimal arms be eliminated. Solving

min > 4εℓ (C.7)

ℓ> log 1 2 min

Define L0 = l log 1 2 min

m + 7 = O log( 1 min ) . We just need to consider regret in the first L0 batches. Because after that the only active action is the optimal one.

Absolute the same as the proof of Theorem 5.4, we have

ℓ=4 Regretℓ

ℓ=4 4Tℓεℓ 1

ℓ=4 4C d log(KT 2) 2ℓ 3 q

ℓ=4 8C d log(KT 2)2(ℓ 4)/2

8C d log(KT 2)2(L0 3)/2 (C.9)

= O(d log(KT)/ min).

Combining with the regret of the first three batches, we have:

RT = O (log T)1+γ + d log(KT)/ min . (C.10)

Since the good event E0 (C.4) happens with probability at least 1 1/T, the expected batch complexity of the algorithm is

O log 1 min

+ (log T)/T = O log 1 min

Optimal Batched Linear Bandits

Minimax optimality: We can also prove RT = O(

d T), which implies the algorithm with {Tℓ} ℓ=1 = T2 is nearly minimax optimal.

When min > p

d/T, (C.10) is a tighter bound compared to O(

d T). So we just need to consider the case min p

d/T. Here the same as the methods in (C.7)-(C.8), we can prove that under E0, arms with gap larger than p

d/T would be eliminated in the first L = log(T/d) + 5 batches. However, the regret caused by arms with gaps less than p

d/T is at most T p

d T. Note that the regret for batches ℓ= 4, . . . , L in the while loop can be bounded by

ℓ=4 Regretℓ

ℓ=4 4Tℓεℓ 1

8C d log(KT 2)2(L 3)/2 (C.11)

d T log(KT) ,

where we get (C.11) the same as (C.9). As a result, the total regret can be bounded by

ℓ=1 Regretℓ+

ℓ=4 Regretℓ+

= O (log T)1+γ +

ℓ=4 Regretℓ+

d T log(KT) .

Asymptotic regret and batch complexity: When {Tℓ} ℓ=1 is set to either T1 or T2, Algorithm 1 yields identical performances within the initial three batches. Thus the corresponding portion of the proof aligns with that of Theorem 5.4. Specifically, according to Lemmas C.7 and C.8, for a sufficiently large T, the probability that the algorithm avoids batch ℓ= 3 exploration at least 1 2/(log T)2. Similarly, the probability that it avoids batch ℓ= 4 is at least 1 1/T. By utilizing the same rationale, we can complete this part of the proof.

D. Proofs of Technical Lemmas

D.1. Proof of Lemma B.1

We prove the statement by contradiction. Suppose there is an algorithm π such that for sufficient large T

P( x = x ) 1/

The algorithm terminates after b = o(log T) steps and output the estimated best arm x. By the assumption above we know P( x = x ) 1/

T, so this is a δ-PAC algorithm with δ = 1/

However, by the Theorem 1 in Jedra & Proutiere (2020), for a (1/

T)-PAC strategy, any bandit instance θ and sufficiently large T,

b T (θ ) log

T = Θ(log T),

which is contradictory with b = o(log T). Therefore, the initial assumption is incorrect. We can now conclude the proof of P( x = x ) > 1/

T for infinite many T.

D.2. Proof of Lemma C.4

Here we can use Lemma C.3 directly.

In the first batch of Algorithm 1, the agent just use a D-optimal design exploration given in Definition 4.1 with b1 = t1 = Θ (log T)1/2 pulls. In the second batch of Algorithm 1, the agent use a D-optimal design exploration given in

Optimal Batched Linear Bandits

Definition 4.1 with b1 = t1 = Θ (log T)1/2 pulls, and pull arms x X for another min wx α log T, (log T)1+γ times. The behavior of the first two batches satisfies the conditions of Lemma C.3. Therefore, by the conclusion of Lemma C.3,

P( x X, |ˆµx(b1) µx| ϵ) 1 1 (log T)2 ,

P( x X, |ˆµx(b2) µx| ϵ) 1 1 (log T)2 .

Hence by union bound we have

P(E2) = P( x X, |ˆµx(b1) µx| ϵ, |ˆµx(b2) µx| ϵ) 1 2 (log T)2 ,

D.3. Proof of Lemma C.6

First, if E2 holds, for sufficiently large T, ϵ = 1/ log log T < min/2, so the estimated best arm is really the best arm, namely, ˆx = x . Then

lim T wx = lim T (log T)γ/α = = w x .

Besides, as T , (log T)γ/α , so the last restricted condition wˆx (log T)γ/α in the optimisation problem in Definition 4.3 has no effect. In addition, by the concentration result of E2, we know

lim T ˆ x 4ϵ = x, x X.

Then we can compare the two programs in Definition 4.3 and Definition C.5 line by line. By the continuity of the program we know

lim T wx = w x, x X.

D.4. Proof of Lemma C.7

If E2 holds,

|ˆµx(b1) µx| ϵ, |ˆµx(b2) µx| ϵ, x X. (D.1)

For sufficiently large T, ϵ = 1/ log log T < min/2. Hence by (D.1), for sufficiently large T and x = x ,

ˆµx (b1) ˆµx(b1) = ˆµx (b1) µx (ˆµx(b1) µx) + x |ˆµx (b1) µx | |ˆµx(b1) µx| + x 2ϵ + x > 0,

which implies the estimated best arm is really the best arm, namely, ˆx = x . And for suboptimal arm x = x , we have

| ˆ x x| = |(ˆµx ˆµx) (µx µx)| |ˆµx µx | + |ˆµx µx| 2ϵ. (D.2)

The second stopping rule Pt s=1 xsx s c Id holds easily for sufficiently large T, with the D-optimal design exploration in the second batch, because features of arms used in D-optimal design exploration spans Rd (Lattimore & Szepesv ari, 2020).

Now, all that s left to do is to calculate the stopping statistic Z(b2) defined in (4.3) and to show it is larger than the threshold value β(b2, 1/T) defined in (4.4) as T .

After the first batch, the agent calculates the allocation w( ˆ ) = {wx}x X according to Definition 4.3, where we use ˆ to denote estimated gaps based on data in the first batch. Correspondingly, we define w( ) = {w x}x X to denote the proportion solved from Definition C.5 using the true instance parameter θ . Under the concentration event E2 and Lemma C.6, we know lim T ˆ x = x, lim T wx = w x, x X.

Optimal Batched Linear Bandits

Note that when ℓ= 2, since w x for x X is independent of T,

lim T wx α log T

(log T)1+γ = lim T w x α log T (log T)1+γ = 0, x X .

Thus for sufficiently large T,

wx α log T (log T)1+γ, x X .

Besides, by (C.3) we know wx = (log T)γ/α. It follows that

wx α log T = (log T)1+γ.

Therefore, the exploration rule in the second batch should be D-optimal design exploration with rate T2, then pull each arm x X for wx α log T times. Here we use b2 to denote the batch size of the second batch, a2 = P

x A f(x, A, T2) = Θ (log T)1/2 to denote the total number of D-optimal design exploration steps in this batch given by Definition 4.1. Consequently, we know

b2 = a2 + (log T)1+γ + X

x X wx α log T. (D.3)

Likewise, we can split the data {xs}b2 s=1 into {xs}a2 s=1 and {xs}b2 s=a2+1, with the former one denoting the arms in D-optimal design exploration and the latter one denoting the remaining arms according to the allocation w in the second batch.

By the constraint condition in the convex program of Definition 4.3, we have

x x 2 H 1 w = x ˆx 2 H 1 w ( ˆ x 4ϵ)2

Then we get

( ˆ x 4ϵ)2/2

x x 2 H 1 w = ( ˆ x 4ϵ)2/2

x ˆx 2 H 1 w 1, x X . (D.4)

In the second batch of exploration, the agent collects data and estimates the parameters. To avoid confusion, here we use θ, and x to denote related estimators based on data in the second batch, and they have the same concentration properties with estimators before. Namely, lim T = and x = x . And similar to (D.2), we have

| x x| 2ϵ, x X .

Further, combining this with (D.2), we have for sufficiently large T,

x ˆ x 4ϵ 0, x X , (D.5)

where the second inequality is because lim T ˆ x = x > 0, x X and lim T ϵ = lim T 1/ log log T = 0. We can calculate the stopping statistic given by (4.3):

Z(b2) = min x = x 2 x 2 x x 2 H 1 b2

= min x =x 2 x 2 x x 2 H 1 b2

Here we get

Z(b2) α log T =

minx =x 2 x 2 x x 2 H 1 b2 α log T

Optimal Batched Linear Bandits

min x =x ( ˆ x 4ϵ)2

2 α log T x x 2 H 1 b2

= min x =x ( ˆ x 4ϵ)2

2 x x 2 Hb2/(α log T ) 1

min x =x ( ˆ x 4ϵ)2

2 x x 2 H 1 w , (D.7)

where (D.6) is because of (D.5), and (D.7) holds because

Hb2 α log T = Pb2 s=1 xsx s α log T

= 1 α log T

s=1 xsx s +

s=a2+1 xsx s

s=a2+1 xsx s

= 1 α log T

(log T)1+γx x + X

x X wxα log T xsx s

= (log T)γ/α x x + X

Then we can combine (D.4) and (D.7) to get

Z(b2) α log T min x =x ( ˆ x 4ϵ)2

2 x x 2 H 1 w 1. (D.8)

By the selection of α = (1 + 1/ log log T)(1 + d log log T/ log T) in Theorem 5.4, we have

α log T = (1 + 1/ log log T)(1 + d log log T/ log T) log T = (1 + 1/ log log T) log (log T)d

Besides, for sufficiently large T, by Lemma C.6 we have X

Combining this with (D.3), we get

b2 = a2 + (log T)1+γ + X

x X wx α log T

a2 + (log T)1+γ + 2 X

x X w x α log T

= Θ (log T)1/2 + (log T)1+γ + 2 X

x X w x α log T

= Θ (log T)1+γ ,

since {w x}x X is determined by the bandit instance which is independent of T. Hence for sufficiently large T, we have b2 U(log T)1+γ for some constant number U > 0. Then, putting this upper bound of b2 into the the expression of β(b2, 1/T) from (4.4), we get

β(b2, 1/T) = (1 + 1/ log log T) log (b2 log log T)d/2

Optimal Batched Linear Bandits

(1 + 1/ log log T) log (U log log T)d/2(log T)d(1+γ)/2

for some constant number U > 0. Combining (D.8), (D.9), (D.10), we have for sufficiently large T,

Z(b2) α log T

= (1 + 1/ log log T) log (log T)d

(1 + 1/ log log T)log (U log log T)d/2(log T)d(1+γ)/2

β(b2, 1/T),

which implies Z(b2) β(b2, 1/T) for sufficiently large T.

So we prove that if E2 holds, for sufficiently large T, the stopping condition (4.2) holds in the second batch. In addition, by Lemma C.4, E2 happens with probability at least 1 2/(log T)2. Now it can be concluded that the Chernoff s stopping rule in (4.2) holds with probability at least 1 2/(log T)2

D.5. Proof of Lemma C.8

In this proof we continue using the good event E0 (C.4) defined before. Since it happens with probability at least 1 1/T, we can just assume it happens, then prove the elimination fact.

Under the good event E0 (C.4), we have shown in (C.5) that arm x = x with gap

x > 4ε3 = 4 p

d log(KT 2)/(log T)1+γ (D.11)

would be eliminated after the third batch. As T , (D.11) holds for all suboptimal arms x = x , since min > 0 and ε3 0 as T . So we finish our proof.

D.6. Proof of Lemma C.9

With E2 defined in (C.2), we can decompose the regret of the second batch by

Regret2 = Regret2E2P(E2) + Regret2Ec 2P(Ec 2),

where we let Regret2E2 and Regret2Ec 2 be the regret of the second batch given E2 and Ec 2, respectively. Lemma C.4 shows P(E2) 1 2/(log T)2. In Ec 2, the regret is negligible for this O (log T)1+γ -size batch as T , because

lim T Regret2Ec 2P(Ec 2) = lim T 2 (log T)2 O (log T)1+γ = 0.

This implies

log T = lim sup T

Regret2E2P(E2) + Regret2Ec 2P(Ec 2) log T

= lim sup T

Regret2E2P(E2)

Moreover, when E2 appears, we can use the continuity of the problem to end the proof.

In fact,when the concentration result E2 holds, for all arms x, |ˆµx µx| ϵ = 1/ log log T. So lim T ˆµx = µx. In a similar way, lim T ˆ x = lim T (ˆµx ˆµx) = µx µx = x. Then using Lemma C.6, we get lim T wx = w x for all x X, where we use {w x}x X to denote w( ) solved using the true gaps defined in Definition C.5, and {wx}x X to denote w( ˆ ) solved using estimated gaps defined in Definition 4.3. As we discussed in (D.3), for sufficiently large T, in

Optimal Batched Linear Bandits

the second batch the agent first explores with a D-optimal design multi-set with size o(log T), then pull the best arm for (log T)1+γ times and each suboptimal arm x X for wx α log T times. Thus the regret is

log T lim sup T

= lim sup T

o(log T) + P

x X αwx log T x

= lim sup T

x X αwx log T x

x X w x x = c .

This completes the proof.