# lipschitz_bandits_with_batched_feedback__fd35cbf7.pdf

Lipschitz Bandits with Batched Feedback

Yasong Feng Shanghai Center for Mathematical Sciences Fudan University ysfeng20@fudan.edu.cn

Zengfeng Huang School of Data Science Fudan University huangzf@fudan.edu.cn

Tianyu Wang Shanghai Center for Mathematical Sciences Fudan University wangtianyu@fudan.edu.cn

In this paper, we study Lipschitz bandit problems with batched feedback, where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. We introduce a novel landscape-aware algorithm, called Batched Lipschitz Narrowing (BLi N), that optimally solves this problem. Specifically, we show that for a T-step problem with Lipschitz reward of zooming dimension dz, our algorithm achieves theoretically optimal (up to logarithmic factors) regret rate e O T

dz+1 dz+2 using only O (log log T) batches. We also provide complexity analysis for this problem. Our theoretical lower bound implies that Ω(log log T) batches are necessary for any algorithm to achieve the optimal regret. Thus, BLi N achieves optimal regret rate using minimal communication.

1 Introduction

Multi-Armed Bandit (MAB) algorithms aim to exploit the good options while explore the decision space. These algorithms and methodologies find successful applications in artificial intelligence and reinforcement learning [e.g., 37]. While the classic MAB setting assumes that the rewards are immediately observed after each arm pull, real-world data often arrives in different patterns. For example, observations from clinical trials are often be collected in a batched fashion [34]. Another example is from online advertising, where strategies are tested on multiple customers at the same time [10]. In such cases, any observation-dependent decision-making should comply with this data-arriving pattern, including MAB algorithms.

In this paper, we study the Lipschitz bandit problem with batched feedback a MAB problem where the expected reward is Lipschitz and the reward observations are communicated to the player in batches. In such settings, rewards are communicated only at the end of the batches, and the algorithm can only make decisions based on information up to the previous batch. Existing Lipschitz bandit algorithms heavily rely on timely access to the reward samples, since the partition of arm space may change at any time. Therefore, they can not solve the batched feedback setting. To address this difficulty, we present a novel adaptive algorithm for Lipschitz bandits with communication constraints, named Batched Lipschitz Narrowing (BLi N). BLi N learns the landscape of the reward by adaptively narrowing the arm set, so that regions of high reward are more frequently played. Also, BLi N determines the data collection procedure adaptively, so that only very few data communications are needed.

Corresponding authors

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

The above BLi N procedure achieves optimal regret rate e O T

dz+1 dz+2 (dz is the zooming dimension [26, 15]), and can be implemented in a clean and easy-to-implement form. In addition to achieving the optimal regret rate, BLi N is also optimal in the following senses:

BLi N s communication complexity is optimal. BLi N only needs O(log log T) rounds of communications to achieve the optimal regret rate (Theorem 2), and no algorithm can achieve this rate with fewer than Ω(log log T) rounds of communications (Corollary 2). BLi N s time complexity is optimal (Remark 1): if the arithmetic operations and sampling are of complexity O(1), then the time complexity of BLi N is O(T), which improve the best known time complexity O(T log T) for Lipschitz bandit problems [15].

1.1 Settings & Preliminaries

For a Lipschitz bandit problem (with communication constraints), the arm set is a compact doubling metric space (A, d A). The expected reward µ : A R is 1-Lipschitz with respect to the metric d A, that is, |µ(x1) µ(x2)| d A(x1, x2) for any x1, x2 A.

At time t T, the learning agent pulls an arm xt A that yields a reward sample yt = µ(xt) + ϵt, where ϵt is a mean-zero independent sub-Gaussian noise. Without loss of generality, we assume that ϵt N(0, 1), since generalizations to other sub-Gaussian noises are not hard.

Similar to most bandit learning problems, the agent seeks to minimize regret in the batched feedback environment. The regret is defined as R(T) = PT t=1 (µ µ(xt)), where µ denotes maxx A µ(x). For simplicity, we define x = µ µ(x) (called optimality gap of x) for all x A.

1.1.1 Doubling Metric Spaces and the ([0, 1]d, ) Metric Space

By the Assouad s embedding theorem [6], the (compact) doubling metric space (A, d A) can be embedded into a Euclidean space with some distortion of the metric; See [43] for more discussions in a machine learning context. Due to existence of such embedding, the metric space ([0, 1]d, ), where metric balls are hypercubes, is sufficient for the purpose of our paper. For the rest of the paper, we will use hypercubes in algorithm design for simplicity, while our algorithmic idea generalizes to other doubling metric spaces.

1.1.2 Zooming Numbers and Zooming Dimensions

An important concept for bandit problems in metric spaces is the zooming number and the zooming dimension [26, 14, 39], which we discuss now.

Define the set of r-optimal arms as S(r) = {x A : x r}. For any r = 2 i, the decision space [0, 1]d can be equally divided into 2di cubes with edge length r, which we call standard cubes (also referred to as dyadic cubes). The r-zooming number is defined as

Nr := #{C : C is a standard cube with edge length r and C S(16r)}.

In words, Nr is the r-packing number of the set S(16r) in terms of standard cubes. The zooming dimension is then defined as

dz := min{d 0 : a > 0, Nr ar d, r = 2 i for some i N}.

Moreover, we define the zooming constant Cz as the minimal a to make the above inequality true for dz, Cz = min{a > 0 : Nr ar dz, r = 2 i for some i N}.

Zooming dimension dz can be significantly smaller than ambient dimension d and can be zero. For a simple example, consider a problem with ambient dimension d = 1 and expected reward function µ(x) = x for 0 x 1. Then for any r = 2 i with i 4, we have S(16r) = [1 16r, 1] and Nr = 16. Therefore, for this problem the zooming dimension equals to 0, with zooming constant Cz = 16.

1.2 Batched feedback pattern and our results

In the batched feedback setting, for a T-step game, the player determines a grid T = {t0, , t B} adaptively, where 0 = t0 < t1 < < t B = T and B T. During the game, reward observations

are communicated to the player only at the grid points t1, , t B. As a consequence, for any time t in the j-th batch, that is, tj 1 < t tj, the reward yt cannot be observed until time tj, and the decision made at time t depends only on rewards up to time tj 1. The determination of the grid T is adaptive in the sense that the player chooses each grid point tj T based on the operations and observations up to the previous point tj 1.

In this work, we present BLi N algorithm to solve Lipschitz bandits under batched feedback. During the learning procedure, BLi N detects and eliminates the bad area of the arm set in batches and partition the remaining area according to an approporiate edge-length sequence. Our first theoretical upper bound is that with simple Doubling Edge-length Sequence rm = 2 m+1, BLi N achieves optimal regret rate e O T

dz+1 dz+2 by using O(log T) batches.

Theorem 1. With probability exceeding 1 2 T 6 , the T-step total regret R(T) of BLi N with Doubling Edge-length Sequence (D-BLi N) satisfies

dz+1 dz+2 (log T) 1 dz+2 ,

where dz is the zooming dimension of the problem instance. In addition, D-BLi N only needs no more than O(log T) rounds of communications to achieve this regret rate. Here and henceforth, only omits constants.

While D-BLi N is efficient for batched Lipschitz bandits, its communication complexity is not optimal. We then propose a new edge-length sequence, which we call Appropriate Combined Edge-length Sequence (ACE Sequence) to improve the algorithm. The idea behind this sequence is that by appropriately combining some batches, the algorithm can achieve better communication bound without incurring increased regret. As we shall see, BLi N with ACE Sequence (A-BLi N) achieves regret rate e O T

dz+1 dz+2 with only O(log log T) batches.

Theorem 2. With probability exceeding 1 2 T 6 , the T-step total regret R(T) of A-BLi N satisfies

dz+1 dz+2 (log T) 1 dz+2 log log T,

where dz is the zooming dimension of the problem instance. In addition, Algorithm 1 only needs no more than O(log log T) rounds of communications to achieve this regret rate.

As a comparison, seminal works [26, 39, 15] show that the optimal regret bound for Lipschitz bandits without communications constraints, where the reward observations are immediately observable after each arm pull, is R(T) T

dz+1 dz+2 (log T) 1 dz+2 in terms of zooming dimension, and

R(T) Rz(T) inf r0

in terms of zooming number. Therefore, A-BLi N achieves optimal regret rate of Lipschitz bandits by using very few batches.

Moreover, our lower-bound analysis shows that Ω(log log T) batches are necessary for any algorithm to achieve the optimal regret rate. Thus, BLi N is optimal in terms of both regret and communication.

1.3 Related Works

The history of the Multi-Armed Bandit (MAB) problem can date back to Thompson [42]. Solvers for this problems include the UCB algorithms [28, 4, 7], the arm elimination method [20, 32, 36], the ϵ-greedy strategy [7, 40], the exponential weights and mirror descent framework [8].

Recently, with the prevalence of distributed computing and large-scale field experiments, the setting of batched feedback has captured attention [e.g., 17]. Perchet et al. [33] mainly consider batched bandit with two arms, and a matching lower bound for static grid is proved. It was then generalized by Gao et al. [21] to finite-armed bandit problems. In their work, the authors designed an elimination method for finite-armed bandit problem and proved matching lower bounds for both static and adaptive grid. Soon afterwards, Zhang et al. [46] studies inference for batched bandits. Esfandiari et al. [19] studies batched linear bandits and batched adversarial bandits. Han et al. [22] and Ruan et

al. [35] provide solutions for batched contextual linear bandits. Li and Scarlett [29] studies batched Gaussian process bandits. Batched dueling bandits have also been studied by [2]. Parallel to the regret control regime, best arm identification with limited number of batches was studied in [1] and [23]. Top-k arm identification in the collaborative learning framework is also closely related to the batched setting, where the goal is to minimize the number of iterations (or communication steps) between agents. In this setting, tight bounds have been obtained in the recent works [41, 24]. Yet the problem of Lipschitz bandit with communication constraints remains unsolved.

The Lipschitz bandit problem is important in its own stand. The Lipschitz bandit problem was introduced as continuum-armed bandits [3], where the arm space is a compact interval. Along this line, bandits that are Lipschitz (or Hölder) continuous have been studied. For this problem, Kleinberg [25] proves a Ω(T 2/3) lower bound and introduced a matching algorithm. Under extra conditions on top of Lipschitzness, regret rate of e O(T 1/2) was achieved [9, 18]. For general (doubling) metric spaces, the Zooming bandit algorithm [26] and the Hierarchical Optimistic Optimization (HOO) algorithm [15] were developed. In more recent years, some attention has been focused on Lipschitz bandit problems with certain extra structures. To name a few, Bubeck et al. [16] study Lipschitz bandits for differentiable rewards, which enables algorithms to run without explicitly knowing the Lipschitz constants. Wang et al. [44] studied discretization-based Lipschitz bandit algorithms from a Gaussian process perspective. Magureanu et al. [31] derive a new concentration inequality and study discrete Lipschitz bandits. The idea of robust mean estimators [11, 5, 13] was applied to the Lipschitz bandit problem to cope with heavy-tail rewards, leading to the development of a nearoptimal algorithm for Lipschitz bandit with heavy-tailed rewards [30]. Lipschitz bandits where a clustering is used to infer the underlying metric, has been studied by [45]. Contextual Lipschitz bandits have also been studied by [39] and [27]. Yet all of the existing works for Lipschitz bandits assume that the reward sample is immediately observed after each arm pull, and none of them solve the Lipschitz bandit problem with communication constraints.

This paper is organized as follows. In section 2, we introduce the BLi N algorithm and give a visual illustration of the algorithm procedure. In section 3, we prove that BLi N with ACE Sequence achieves the optimal regret rate using only O (log log T) rounds of communications. Section 4 provides information-theoretical lower bounds for Lipschitz bandits with communication constraints, which shows that BLi N is optimal in terms of both regret and rounds of communications. Experimental results are presented in Section 5.

2 Algorithm

With communication constraints, the agent s knowledge about the environment does not accumulate within each batch. This characteristic of the problem suggests a uniform type algorithm we shall treat each step within the same batch equally. Following this intuition, in each batch, we uniformly play the remaining arms, and then eliminate arms of low reward after the observations are communicated. Next we describe the uniform play rule and the arm elimination rule.

Uniform Play Rule: At the beginning of each batch m, a collection of subsets of the arm space Am = {Cm,1, Cm,2, , Cm,|Am|} is constructed. This collection of subset Am consists of standard cubes, and all cubes in Am have the same edge length rm. We will detail the construction of Am when we describe the arm elimination rule. We refer to cubes in Am as active cubes of batch m.

During batch m, each cube in Am is played nm := 16 log T

r2m times, where T is the total time horizon. More specifically, within each C Am, arms x C,1, x C,2, , x C,nm C are played.2 The reward samples {y C,1, y C,2, , y C,nm}C Am corresponding to {x C,1, x C,2, , x C,nm}C Am will be collected at the end of the this batch.

Arm Elimination Rule: At the end of batch m, information from the arm pulls is collected, and we estimate the reward of each C Am by bµm(C) = 1 nm Pnm i=1 y C,i. Cubes of low estimated rewards are then eliminated, according to the following rule: a cube C Am is eliminated if bµmax m bµm(C) 4rm, where bµmax m := max C Am bµm(C). After necessary removal of bad cubes ,

each cube in Am that survives the elimination is equally partitioned into rm rm+1

d subcubes of edge

2One can arbitrarily play x C,1, x C,2, , x C,nm as long as x C,i C for all i.

length rm+1, where rm+1 is predetermined. These cubes (of edge length rm+1) are collected to construct Am+1, and the learning process moves on to the next batch. Appropriate rounding may be required to ensure the ratio rm rm+1 is an integer. See Remark 2 for more details.

The learning process is summarized in Algorithm 1.

Algorithm 1 Batched Lipschitz Narrowing (BLi N)

1: Input. Arm set A = [0, 1]d; time horizon T. 2: Initialization Number of batches B; Edge-length sequence {rm}B+1 m=1; The first grid point t0 = 0; Equally partition A to rd 1 subcubes and define A1 as the collection of these subcubes. 3: Compute nm = 16 log T

r2m for m = 1, , B + 1. 4: for m = 1, 2, , B do 5: For each cube C Am, play arms x C,1, x C,nm from C.

6: Collect the rewards of all pulls up to tm. Compute the average payoff bµm(C) =

Pnm i=1 y C,i

nm for each cube C Am. Find bµmax m = max C Am bµ(C). 7: For each cube C Am, eliminate C if bµmax m bµm(C) > 4rm. Let A+ m be set of cubes not eliminated. 8: Compute tm+1 = tm + (rm/rm+1)d |A+ m| nm+1. If tm+1 T or m = B then break. 9: Equally partition each cube in A+ m into (rm/rm+1)d subcubes and define Am+1 as the collection of these subcubes. /*See Remark 2 for more details on cases where (rm/rm+1)d is not an integer.*/ 10: end for 11: Cleanup: Arbitrarily play the remaining arms until all T steps are used.

The following theorem gives regret and communication upper bound of BLi N with Doubling Edgelength Sequence rm = 2 m+1 (see Appendix B for proof). Note that this result implies Theorem 1. Theorem 3. With probability exceeding 1 2 T 6 , the T-step total regret R(T) of BLi N with Doubling Edge-length Sequence (D-BLi N) satisfies

R(T) 528(log T) 1 dz+2 T

dz+1 dz+2 ,

where dz is the zooming dimension of the problem instance. In addition, D-BLi N only needs no more than log T log log T

dz+2 + 2 rounds of communications to achieve this regret rate.

Although D-BLi N efficiently solves batched Lipschitz bandits, its simple partition strategy leads to suboptimal communication complexity. Now we show that by approporiately combining some batches, BLi N achieves the optimal communication bound, without incurring increasing regret. Specifically, we introduce the following edge-length sequence, which we call ACE Sequence. Definition 1. For a problem with ambient dimension d, zooming dimension dz and time horizon T, we denote c1 = dz+1 (d+2)(dz+2) log T log T and ci+1 = ηci for any i 1, where η = d+1 dz

d+2 . Then the

Appropriate Combined Edge-length Sequence {rm} is defined by rm = 2 Pm i=1 ci for any m 1.

We show that BLi N with ACE Sequence (A-BLi N) obtains an improved communication complexity, thus proves Theorem 2. Theorem 4. With probability exceeding 1 2

T 6 , the T-step total regret R(T) of Algorithm 1 satisfies

128Cz log d+2 d+1 dz log log T + 8e

dz+1 dz+2 (log T) 1 dz+2 , (2)

where dz is the zooming dimension of the problem instance. In addition, Algorithm 1 only needs no more than log log T log d+2 d+1 dz + 1 rounds of communications to achieve this regret rate.

The partition and elimination process of a real A-BLi N run is in Figure 1. In the i-th subgraph, the white cubes are those remaining after the (i 1)-th batch. In this experiment, we set A = [0, 1]2, and the optimal arm is x = (0.8, 0.7). Note that x is not eliminated during the game. More details of this experiment are in Section 5.

Figure 1: Partition and elimination process of A-BLi N. The i-th subfigure shows the pattern before the i-th batch, which is from a real A-BLi N run on the reward function defined in Section 5. Dark gray cubes are those eliminated in the most recent batch, while the light gray ones are those eliminated in earlier batches. For the total time horizon T = 80000, A-BLi N needs 4 rounds of communications. For this experiment, r1 = 1

8, r3 = 1 16, r4 = 1 32, which is the ACE sequence (rounded as in Remark 2) for d = 2 and dz = 0.

Remark 1 (Time and space complexity). The time complexity of our algorithm is O(T), which is better than the state of the art O(T log T) in [15]. This is because that the running time of a batch j is of order O(lj), where lj = tj tj 1 is number of samples in batch j. Since P

j lj = T, the time complexity of BLi N is O(T). Besides, the space complexity of BLi N is also improved, because we do not need to store information of cubes in previous batches.

3 Regret Analysis of A-BLi N

In this section, we provide regret analysis for A-BLi N. The highlight of the finding is that O(log log T)

batches are sufficient to achieve optimal regret rate of e O T

dz+1 dz+2 , as summarized in Theorem 4.

To prove Theorem 4, we first show that the estimator bµ is concentrated to the true expected reward µ (Lemma 1), and the optimal arm survives all eliminations with high probability (Lemma 2). In the following analysis, we let Bstop be the total number of batches of the BLi N run. Lemma 1. Define

|µ(x) bµm(C)| rm +

nm , 1 m Bstop 1, C Am, x C

It holds that P (E) 1 2T 6. Lemma 2. Under event E, the optimal arm x = arg max µ(x) is not eliminated after the first Bstop 1 batches.

Based on these results, we show the cubes that survive elimination are of high reward. Lemma 3. Under event E, for any 1 m Bstop, any C Am and any x C, x satisfies

x 8rm 1. (3)

The proofs of Lemma 1-3 are in Appendix A. We are now ready to prove the theorem.

Proof of Theorem 4. Let Rm denote regret of the m-th batch. Fixing any positive number B, the total regret R(T) can be divided into two parts: R(T) = P

m>B Rm. In the following, we bound these two parts separately and then determine B to obtain the upper bound of the total regret. Moreover, we show A-BLi N uses only O(log log T) rounds of communications to achieve the optimal regret.

Recall that Am is set of the active cubes in the m-th batch. According to Lemma 3, for any x C Am C, we have x 8rm 1. Let A+ m be set of cubes not eliminated in batch m. Then any cube in A+ m 1 is a subset of S(8rm 1), and thus

|A+ m 1| Nrm 1 Czr dz m 1. (4)

By definition, rm = rm 12 cm, so

|Am| = 2dcm|A+ m 1|. (5)

The total regret of the m-th batch is

i=1 x C,i |Am| 16 log T

r2m 8rm 1 (i) = 2dcm|A+ m 1| 16 log T

(ii) 2dcm Czr dz+1 m 1 128 log T

(iii) = 2(Pm 1 i=1 ci)(dz+1)+cm(d+2) 128Cz log T,

where (i) follows from (5), (ii) follows from (4), and (iii) follows from the definition of ACE Sequence.

Define Cm = (Pm 1 i=1 ci)(dz + 1) + cm(d + 2). Since cm = cm 1 d+1 dz

d+2 , calculation shows that Cm = (Pm 2 i=1 ci)(dz + 1) + cm 1(d + 2) + cm 1(dz + 1 d 2) + cm(d + 2) = Cm 1. Thus for any m, we have Cm = C1 = c1(d + 2). Hence,

Rm 2c1(d+2) 128Cz log T = T

dz+1 dz+2 128Cz(log T) 1 dz+2 . (6)

The inequality (6) holds even if the m-th batch does not exist (where we let Rm = 0) or is not completed. Thus we obtain the first upper bound P

dz+1 dz+2 128Cz B(log T) 1 dz+2 . Lemma 3 implies that any arm x played after the first B batches satisfies x 8r B, so the total regret after B batches is bounded by

m>B Rm 8r B T = 8T 2 PB i=1 ci = 8T 2 c1( 1 ηB

dz+1 dz+2 (log T) 1 dz+2 (T/ log T)

dz+1 dz+2 (log T) 1 dz+2 T

Therefore, the total regret R(T) satisfies

m>B Rm 128Cz B + 8T

dz+1 dz+2 (log T) 1 dz+2 .

This inequality holds for any positive B. Then by choosing B = log log T log(dz+2)

log log T log(dz+2)

log d+2 d+1 dz , we have ηB

dz+2 = 1 log T and

128Cz log log T

log d+2 d+1 dz + 8e

dz+1 dz+2 (log T) 1 dz+2 .

The above analysis implies that we can achieve the optimal regret rate e O T

dz+1 dz+2 by letting the for-loop run B times and finishing the remaining rounds in the Cleanup step. In other words, B + 1 rounds of communications are sufficient for A-BLi N to achieve the regret bound (2).

Remark 2. The quantity rm rm+1 in line 9 of Algorithm 1 may not be integers for some m. Thus, in practice we denote αn = Pn i=1 ci , βn = Pn i=1 ci , and define rounded ACE Sequence {erm}m N by erm = 2 αk for m = 2k 1 and erm = 2 βk for m = 2k. Then the total regret can be divided as R(T) = P

1 k B R2k 1 + P

1 k B R2k + P

m>2B Rm. For the first part we

have er2k 2 rk 1 and er2k 1 rk, while for the second part we have er2k 1

er2k = 2. Therefore, by similar argument to the proof of Theorem 4, we can bound these three parts separately, and conclude that BLi N with rounded ACE sequence achieves the optimal regret bound e O(T

dz+1 dz+2 ) by using only O(log log T) rounds of communications. The details are in Appendix C.

4 Lower Bounds

In this section, we present lower bounds for Lipschitz bandits with batched feedback, which in turn gives communication lower bounds for all Lipschitz bandit algorithms. Our lower bounds depend on the rounds of communications B. When B is sufficiently large, our results match the lower bound for the vanilla Lipschitz bandit problem eΘ(Rz(T)) (Rz(T) is defined in Eq. 1). More importantly, this dependency on B gives the minimal rounds of communications needed to achieve optimal regret bound for all Lipschitz bandit algorithms, which is summarized in Corollary 2. Since this lower bound matches the upper bound presented in Theorem 4, BLi N optimally solves Lipschitz bandits with minimal communication.

4.1 Proof Outline

Similar to most lower bound proofs, we need to construct problem instances that are difficult to differentiate. What s different is that we need to carefully integrate batched feedback pattern [33] with the Lipschitz payoff reward [39, 30]. To capture the adaptivity in grid determination, we construct static reference communication grids to remove the stochasticity in grid selection [1, 21]. Below, we first consider the static grid case, where the grid is predetermined. This static grid case will provide intuition for the adaptive and more general case.

The expected reward functions of these instances are constructed as follows: we choose some position and height , such that the expected reward function obtains local maximum of the specified height at the specified position . We will use the word peak to refer to the local maxima. The following theorem presents the lower bound for the static grid case. Theorem 5 (Lower Bound for Static Grid). Consider Lipschitz bandit problems with time horizon T such that the grid of reward communication T is static and determined before the game. If B rounds of communications are allowed, then for any policy π, there exists a problem instance such that

E[RT (π)] c (log T)

1 ( 1 d+2) B Rz(T)

1 ( 1 d+2) B ,

where c > 0 is a numerical constant independent of B, T, π and T , Rz(T) is defined in (1), and d is the dimension of the arm space.

To prove Theorem 5, we first show that for any k > 1 there exists an instance such that E[RT (π)] tk

1 d+2 k 1 . Fixing k > 1, we let rk = 1

1 d+2 k 1 and Mk := tk 1r2 k = 1 rd k . Then we construct a set of problem

instances Ik = {Ik,1, , Ik,Mk}, such that the gap between the highest peak and the second highest peak is about rk for every instance in Ik.

Based on this construction, we prove that no algorithm can distinguish instances in Ik from one another in the first (k 1) batches, so the worst-case regret is at least rktk, which gives the inequality we need. For the first batch (0, t1], we can easily construct a set of instances where the worst-case regret is at least t1, since no information is available during this time. Thus, there exists a problem

instance such that E[RT (π)] max

1 d+2 1 , , t B

. Since 0 < t1 < < t B = T, the

inequality in Theorem 5 follows.

As a result of Theorem 5, we can derive the minimum rounds of communications needed to achieve optimal regret bound for Lipschitz bandit problem, which is stated in Corollary 1. Corollary 1. Any Lipschitz bandit algorithm needs Ω(log log T) rounds of communications to achieve the optimal regret rate, for the case that the times of reward communication are predetermined and static.

The detailed proof of Theorem 5 and Corollary 1 are deferred to Appendix D and E.

4.2 Communication Lower Bound for BLi N

So far we have derived lower bounds for the static grid case. Yet there is a gap between the static and the adaptive case. We will close this gap in the following Theorem.

Theorem 6 (Lower Bound for Adaptive Grid). Consider Lipschitz bandit problems with time horizon T such that the grid of reward communication T is adaptively determined by the player. If B rounds of communications are allowed, then for any policy π, there exists an instance such that

E [RT (π)] c 1

1 ( 1 d+2) B Rz(T)

1 ( 1 d+2) B ,

where c > 0 is a numerical constant independent of B, T, π and T , Rz(T) is defined in (1), and d is the dimension of the arm space.

To prove Theorem 6, we consider a reference static grid Tr = {T0, T1, , TB}, where Tj = T

for ε = 1 d+2. We construct a series of worlds , denoted by I1, , IB. Each world is a set of problem instances, and each problem instance in world Ij is defined by peak location set Uj and basic height rj, where the sets Uj and quantities rj for 1 j B are presented in Appendix F.

Based on these constructions, we first prove that for any adaptive grid and policy, there exists an index j such that the event Aj = {tj 1 < Tj 1, tj Tj} happens with sufficiently high probability in world Ij. Then similar to Theorem 5, we prove that in world Ij there exists a set of problem instances that is difficult to differentiate in the first j 1 batches. In addition, event Aj implies that tj Tj, so the worst-case regret is at least rj Tj, which gives the lower bound we need.

The proof of Theorem 6 is deferred to Appendix F. Similar to Corollary 1, we can prove that at least Ω(log log T) rounds of communications are needed to achieve optimal regret bound. This result is formally summarized in Corollary 2. Corollary 2. Any Lipschitz bandit algorithm3 needs Ω(log log T) rounds of communications to achieve the optimal regret rate.

5 Experiments

In this section, we present numerical studies of A-BLi N. In the experiments, we use the arm space A = [0, 1]2 and the expected reward function µ(x) = 1 1

2 x x1 2 3 10 x x2 2, where x1 = (0.8, 0.7) and x2 = (0.1, 0.1). The landscape of µ and the resulting partition is shown in Figure 2(a). As can be seen, the partition is finer in the area closer to the optimal arm x = (0.8, 0.7).

(a) Partition

Figure 2: Resulting partition and regret of A-BLi N. In Figure 2(a), we show the resulting partition of A-BLi N. The background color denotes the true value of expected reward µ, and blue means high values. The figure shows that the partition is finer for larger values of µ. In Figure 2(b), we show accumulated regret of A-BLi N and zooming algorithm [26]. In the figure, different background colors represent different batches of A-BLi N. For the total time horizon T = 80000, A-BLi N needs 4 rounds of communications.

We let the time horizon T = 80000, and report the accumulated regret in Figure 2(b). The regret curve is sublinear, which agrees with the regret bound (2). Besides, different background colors in Figure 2(b) represent different batches. For the total time horizon T = 80000, A-BLi N only needs 4 rounds of communications. We also present regret curve of zooming algorithm [26] for

3including algorithms of which the number of batches can be determined adaptively

comparison. Different from zooming algorithm, regret curve of A-BLi N is approximately piecewise linear, which is because the strategy of BLi N does not change within each batch. Results of more repeated experiments are in Appendix G, as well as experimental results of D-BLi N. Our code is available at https://github.com/Feng Yasong-fifol/Batched-Lipschitz-Narrowing.

6 Conclusion

In this paper, we study Lipschitz bandits with communication constraints, and propose the BLi N algorithm as a solution. We prove that BLi N only need O (log log T) rounds of communications to achieve the optimal regret rate of best previous Lipschitz bandit algorithms [26, 14] that need T batches. This improvement in number of the batches significantly saves data communication costs. We also provide complexity analysis for this problem. We show that Ω(log log T) rounds of communications are necessary for any algorithm to optimally solve Lipschitz bandit problems. Hence, BLi N algorithm is optimal.

Acknowledgments and Disclosure of Funding

This work was partly supported by the National Key Research and Development Program of China (2020AAA0107600).

[1] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivity: coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39 75. PMLR, 2017.

[2] Arpit Agarwal, Rohan Ghuge, and Viswanath Nagarajan. Batched dueling bandits. ar Xiv preprint ar Xiv:2202.10660, 2022.

[3] Rajeev Agrawal. The continuum-armed bandit problem. SIAM Journal on Control and Optimization, 33(6):1926 1951, 1995.

[4] Rajeev Agrawal. Sample mean based index policies by O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054 1078, 1995.

[5] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137 147, 1999.

[6] Patrice Assouad. Plongements Lipschitziens dans Rn. Bulletin de la Société Mathématique de France, 111:429 448, 1983.

[7] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

[8] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

[9] Peter Auer, Ronald Ortner, and Csaba Szepesvári. Improved rates for the stochastic continuumarmed bandit problem. In Conference on Computational Learning Theory, pages 454 468. Springer, 2007.

[10] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120 1135, 2007.

[11] Peter J. Bickel. On some robust estimates of location. The Annals of Mathematical Statistics, pages 847 858, 1965.

[12] Jean Bretagnolle and Catherine Huber. Estimation des densités: risque minimax. Séminaire de probabilités de Strasbourg, 12:342 363, 1978.

[13] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711 7717, 2013.

[14] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. Online optimization in X-armed bandits. Advances in Neural Information Processing Systems, 22:201 208, 2009.

[15] Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. X-armed bandits. Journal of Machine Learning Research, 12(5):1655 1695, 2011.

[16] Sébastien Bubeck, Gilles Stoltz, and Jia Yuan Yu. Lipschitz bandits without the Lipschitz constant. In International Conference on Algorithmic Learning Theory, pages 144 158. Springer, 2011.

[17] Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26:1160 1168, 2013.

[18] Eric W. Cope. Regret and convergence bounds for a class of continuum-armed bandit problems. IEEE Transactions on Automatic Control, 54(6):1243 1253, 2009.

[19] Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7340 7348, 2021.

[20] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6):1079 1105, 2006.

[21] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32:503 513, 2019.

[22] Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321, 2020.

[23] Kwang-Sung Jun, Kevin Jamieson, Robert Nowak, and Xiaojin Zhu. Top arm identification in multi-armed bandits with batch arm pulls. In Artificial Intelligence and Statistics, pages 139 148. PMLR, 2016.

[24] Nikolai Karpov, Qin Zhang, and Yuan Zhou. Collaborative top distribution identifications with limited interaction. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 160 171. IEEE, 2020.

[25] Robert Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. Advances in Neural Information Processing Systems, 18:697 704, 2005.

[26] Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 681 690, 2008.

[27] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: smoothing, zooming, and adapting. In Conference on Learning Theory, pages 2025 2027. PMLR, 2019.

[28] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

[29] Zihan Li and Jonathan Scarlett. Gaussian process bandit optimization with few batches. In International Conference on Artificial Intelligence and Statistics, pages 92 107. PMLR, 2022.

[30] Shiyin Lu, Guanghui Wang, Yao Hu, and Lijun Zhang. Optimal algorithms for Lipschitz bandits with heavy-tailed rewards. In International Conference on Machine Learning, pages 4154 4163, 2019.

[31] Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Conference on Learning Theory, pages 975 999. PMLR, 2014.

[32] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693 721, 2013.

[33] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched bandit problems. The Annals of Statistics, 44(2):660 681, 2016.

[34] Stuart J. Pocock. Group sequential methods in the design and analysis of clinical trials. Biometrika, 64(2):191 199, 1977.

[35] Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74 87, 2021.

[36] Sudeep Salgia, Sattar Vakili, and Qing Zhao. A domain-shrinking based bayesian optimization algorithm with order-optimal regret performance. In Advances in Neural Information Processing Systems, volume 34, pages 28836 28847, 2021.

[37] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016.

[38] Sean Sinclair, Tianyu Wang, Gauri Jain, Siddhartha Banerjee, and Christina Yu. Adaptive discretization for model-based reinforcement learning. Advances in Neural Information Processing Systems, 33:3858 3871, 2020.

[39] Aleksandrs Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research, 15(1):2533 2568, 2014.

[40] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.

[41] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126 146. IEEE, 2019.

[42] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

[43] Tianyu Wang and Cynthia Rudin. Bandits for BMO functions. In International Conference on Machine Learning, pages 9996 10006. PMLR, 2020.

[44] Tianyu Wang, Weicheng Ye, Dawei Geng, and Cynthia Rudin. Towards practical lipschitz bandits. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, FODS 20, page 129 138, New York, NY, USA, 2020. Association for Computing Machinery.

[45] Nirandika Wanigasekara and Christina Yu. Nonparametric contextual bandits in an unknown metric space. In Advances in Neural Information Processing Systems, volume 32, pages 14684 14694, 2019.

[46] Kelly Zhang, Lucas Janson, and Susan Murphy. Inference for batched bandits. Advances in Neural Information Processing Systems, 33:9818 9829, 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [N/A]

(d) Have you read the ethics review guidelines and ensured that your paper conforms to them?

[Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable?

[N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]