# on_interpolating_experts_and_multiarmed_bandits__36794220.pdf

On Interpolating Experts and Multi-Armed Bandits

Houshuang Chen 1 Yuchen He 1 Chihao Zhang 2

Learning with expert advice and multi-armed bandit are two classic online decision problems which differ on how the information is observed in each round of the game. We study a family of problems interpolating the two. For a vector m = (m1, . . . , m K) NK, an instance of m MAB indicates that the arms are partitioned into K groups and the i-th group contains mi arms. Once an arm is pulled, the losses of all arms in the same group are observed. We prove tight minimax regret bounds for m-MAB and design an optimal PAC algorithm for its pure exploration version, m BAI, where the goal is to identify the arm with minimum loss with as few rounds as possible. We show that the minimax regret of m-MAB is

T PK k=1 log(mk + 1) and the minimum

number of pulls for an (ε, 0.05)-PAC algorithm

of m-BAI is Θ 1 ε2 PK k=1 log(mk + 1) . Both our upper bounds and lower bounds for m-MAB can be extended to a more general setting, namely the bandit with graph feedback, in terms of the clique cover and related graph parameters. As consequences, we obtained tight minimax regret bounds for several families of feedback graphs.

1. Introduction

A typical family of online decision problems is as follows: In each round of the game, the player chooses one of N arms to pull. At the same time, the player will incur a loss of the pulled arm. The objective is to minimize the expected regret defined as the difference between the cumulative losses of the player and that of the single best arm over T rounds. The minimax regret, denoted as R (T), represents the minimum

1Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 2John Hopcroft Center for Computer Science, Shanghai Jiao Tong University, Shanghai, China. Correspondence to: Chihao Zhang <chihao@sjtu.edu.cn>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

expected regret achievable by any algorithm against the worst loss sequence.

There are variants of the problem according to amount of information the player can observe in each round. In the problem of multi-armed bandit (MAB), the player can only observe the loss of the arm just pulled. The minimax regret is Θ

NT (Audibert & Bubeck, 2009). Another important problem is when the player can observe the losses of all arms in each round, often refered to as learning with expert advice. The minimax regret is Θ T log N (Freund & Schapire, 1997; Haussler et al., 1995). Bandit with graph feedback generalizes and interpolates both models. In this model, a directed graph G, called the feedback graph, is given. The vertex set of G is the set of arms and a directed edge from i to j indicates that pulling the arm i can observe the loss of arm j. As a result, the MAB corresponds to when G consists of singletons with self-loop, and learning with expert advice corresponds to when G is a clique. A number of recent works devote to understanding how the structure of G affects the minimax regret (Alon et al., 2015; Chen et al., 2021; He & Zhang, 2023; Eldowa et al., 2024; Koc ak & Carpentier, 2023; Rouyer et al., 2022; Dann et al., 2023).

In this paper, we consider a natural interpolation between learning with expert advice and multi-armed bandit. Let m = (m1, m2, . . . , m K) NK be a vector with each mi 1. An instance of m-MAB is that the all N arms are partitioned into K groups and the pull of each arm can observe the losses of all arms in the same group. In the language of bandit with graph feedback, the feedback graph G is the disjoint union of K cliques with size m1, m2, . . . , mk respectively. We show that the minimax regret for m-MAB is Θ q

k [K] log(mk + 1) . As a result, this generalizes the optimal regret bounds for both MAB and learning with expert advice.

A closely related problem is the so-called pure exploration version of bandit, often referred to as the best arm identification (BAI) problem where the loss of each arm follows some (unknown) distribution. The goal of the problem is to identify the arm with minimum mean loss with as few rounds as possible. Similarly, we introduced the problem of m-BAI with the same feedback pattern as m-MAB. We design an (ε, 0.05)-PAC algorithm for m-BAI which ter-

On Interpolating Experts and Multi-Armed Bandits

minates in T = O 1 ε2 P

k [K] log(mk + 1) rounds for

every ε < 1

8. This means that after T rounds of the game, with probability at least 0.95, the algorithm can output an arm whose mean loss is less than ε plus the mean of the best one. We show that our algorithm is optimal by proving a matching lower bound Ω 1 ε2 P

k [K] log(mk + 1) for

any (ε, 0.05)-PAC algorithm.

Both our upper bounds and lower bounds for the minimax regret of m-MAB can be generalized to bandit with graph feedback. To capture the underlying structure necessary for our proofs, we introduce some new graph parameters which yield optimal bound for several families of feedback graphs. The main results are summarized in Section 1.1.

Our algorithm deviates from the standard online stochastic mirror descent (OSMD) algorithm for bandit problems. We employ the two-stage OSMD developed in (He & Zhang, 2023) and give a novel analysis which yields the optimal regret bound. For the lower bound, we prove certain new instance-specific lower bounds for the best arm identification problem. These lower bounds may find applications in other problems. We will give an overview of our techniques in Section 1.2.

1.1. Main Results

We summarize our main results in this section. Formal definitions of m-MAB, m-BAI and bandit with graph feedback are in Section 2. All the proof details can be found in the appendix.

Theorem 1.1. There exists an algorithm such that for any instance of (m1, . . . , m K)-MAB, any T > 0 and any loss sequence ℓ(0), ℓ(1), . . . , ℓ(T 1) [0, 1]N, its regret is at most

k=1 log(mk + 1),

where c > 0 is a universal constant.

Given an instance of m-BAI, for ε, δ (0, 1), an (ε, δ)- PAC algorithm can output an arm whose mean loss is less than ε plus the mean of the optimal one with probability at least 1 δ. Using a reduction from m-BAI to m-MAB (Lemma A.1), we obtain a PAC algorithm for m-BAI:

Theorem 1.2. There exists an (ε, 0.05)-PAC algorithm for (m1, . . . , m K)-BAI which pulls

log(mk + 1)

arms where c > 0 is a universal constant.

Let Ber(p) denote the Bernoulli distribution with mean p.

We complement the above algorithm with the following lower bound:

Theorem 1.3. There exists an instance H such that for every (ε, 0.05)-PAC algorithm A of (m1, . . . , m K)-BAI with ε 0, 1

8 , the expected number of pulls T of A on H satisfies

log(mk + 1)

where c > 0 is a universal constant. Moreover, we can pick H as the one in which each arm follows Ber( 1

Using the reduction from m-BAI to m-MAB (Lemma A.1) again, we obtain the lower bound for m-MAB.

Theorem 1.4. For any algorithm A of (m1, . . . , mk)-MAB, for any sufficiently large T > 0, there exists a loss sequence ℓ(0), ℓ(1), . . . , ℓ(T 1) such that the regret of A in T rounds is at least

k=1 log(mk + 1),

where c > 0 is a universal constant.

Our results generalize to the setting of bandit with graph feedback. Let G = (V, E) be a directed graph with self-loop on each vertex. Let V1, . . . , VK V be subsets of vertices. We say that they form a (V1, . . . , VK)-clique cover of G if each induced subgraph G[Vk] for k [K] is a clique and S

k [K] Vk = V .

Corollary 1.5. Let G be a feedback graph with a self-loop on each vertex. If G contains a (V1, . . . , VK)-clique cover where |Vk| = mk for every k [K], then the minimax regret of bandit with graph feedback G is at most

k=1 log(mk + 1)

for some universal constant c > 0.

Our lower bounds generalize to bandit with graph feedback as well. The terms strongly observable feedback graphs and weakly observable feedback graphs are defined in Section 2.

Theorem 1.6. Let G = (V, E) be the feedback graph. Assume that there exist K disjoint sets S1, . . . , SK V such that

each G[Sk] is a strongly observable graph with a selfloop on each vertex;

there is no edge between Si and Sj for any i = j.

On Interpolating Experts and Multi-Armed Bandits

Then for any algorithm A and any sufficiently large time horizon T > 0, there exists some loss sequence on which

the regret of A is at least c q

T PK k=1 log (|Sk| + 1) for some universal constant c > 0.

The following lower bound for weakly observable feedback graphs confirms a conjecture in (He & Zhang, 2023) and implies the optimality of several regret bounds established there, e.g., when the feedback graph is the disjoint union of loopless complete bipartite graphs. The notion of t-packing independent set is defined in Section 2.

Theorem 1.7. Let G = (V, E) be the feedback graph. Assume that V can be partitioned into K disjoint sets V = V1 V2 VK such that

for every k [K], each G[Vk] is observable;

for every k [K], there exists a tk-packing independent set Sk in G[Vk] such that every vertex in Sk does not have a self-loop;

there is no edge from Vi to Sj for any i = j in G.

Then for any algorithm A and any sufficiently large time horizon T > 0, there exists some loss sequence on which the regret of A with feedback graph G is at least

c T 2 3 PK k=1 max n log |Sk|, |Sk|

3 for some univer-

sal constant c > 0.

Theorem 1.7 implies tight regret lower bounds for several weakly observable graphs. We summarize the minimax regret for some feedback graphs, weakly or strongly observable, in Table 1.

1.2. Overview of Technique & Contribution

We note that a simple reduction (Lemma A.1) implies that any algorithm for m-MAB can be turned into a PAC algorithm for m-BAI. As a result, Theorems 1.1 to 1.4 follow from a minimax regret upper bound for m-MAB and a lower bound for m-BAI.

1.2.1. UPPER BOUNDS FOR m-MAB

We design a new two-stage algorithm (Algorithm 1) to establish an upper bound for m-MAB. The algorithm is similar to the one used in (He & Zhang, 2023) to study weakly observable graphs with a few tweaks to accommodate our new analysis.

The algorithm maintains a distribution Y (t) over K groups and for each group k [K], it maintains a distribution X(t) k for arms in that group. In each round of the game, the algorithm pulls an arm in a two-stage manner: First pick the group according to the distribution over groups and

then pick the arm in that group following the distribution in the group. At the end of each round, all distributions are updated in the manner similar to online stochastic mirror descent (OSMD) with carefully designed loss vectors and various potential functions. Each group can be viewed as a super arm, and Y (t) updates with the corresponding loss sequence ˆL(t)(k) for k [K], while X(t) k updates with the corresponding loss estimator ˆℓ(t) k in group k.

Our main technical contribution is a novel analysis of this two-stage algorithm. We design auxiliary two-stage piecewise continuous processes whose regret is relatively easy to analyze. Then we view our algorithm as a discretization of the process and bound the accumulated discretization errors.

Our new analysis is the key to the tight regret bound. If we apply the classical analysis for OSMD to the two-stage algorithm, as done in (He & Zhang, 2023), the regret decomposes into two parts: (1) the regret due to choosing the group and (2) the regret due to running in the optimal group k . Let R(t)(ℓ) denote the t-th instant regret for loss sequence ℓ. That is

R(t)(ℓ) O R(t)(ˆL) + R(t)(ˆℓk ) .

The first part is easy to bound, while the second part is challenging because it contains a factor 1 Y (t)(k ) (the inverse of the probability to choose the optimal group at each round) which is usually hard to bound. However, with our new analysis for OSMD, the regret in the second part can be improved to the expectation of the regret for each group. That is

R(t)(ˆL) + X

k [K] Y (t)(k) R(t)(ˆℓk)

Technically, the Y (t)(k) term will eliminate the 1 Y (t)(k) factor, making it possible for the optimal bound (see Lemma B.4 for details).

Since the notion of m-MAB generalizes both learning with expert advice and multi-armed bandit, we remark that our analysis of Algorithm 1 can specialize to an analysis of both ordinary mirror descent (MD) algorithm and OSMD algorithm. We believe that the viewpoint of discretizing a piecewise continuous process is more intuitive than the textbook analysis of OSMD and may be of independent pedagogical interest.

1.2.2. LOWER BOUNDS FOR m-BAI

Our lower bound for the number of rounds in an (ε, 0.05)- PAC algorithm for m-BAI where m = (m1, . . . , m K) is

log(mk + 1)

On Interpolating Experts and Multi-Armed Bandits

Table 1. Minimax Regret Bound on Various Feedback Graphs

Graph Type Previous Result This Work

General strongly observable graphs with self-loops O

T PK k=1 log mk

See Theorem 1.6 for the lower bound

Disjoint union of K cliques O

T PK k=1 log mk

General weakly observable graphs Ω T 2 3 max n |S|

k , log |S| o 1

3 3 Ω T 2 3 PK k=1 max n log |Sk|, |Sk|

Disjoint union of K loopless bipartite graphs Ω T 2 3 (log N)

1 3 Ω T 2 3 PK k=1 log mk 1

1 Here α is the independence number of the graph. 2 Here K is the clique cover number of the graph and m1, m2, . . . m K are the size of the K cliques respectively. 3 Here S is a t-packing independent set of the graph. Sk and tk are defined in Theorem 1.7. 4 Previous results are from (Alon et al., 2015), (Alon et al., 2017) and (Chen et al., 2021).

which is the sum of lower bounds on each (mk)-BAI instance. To achieve this, we show that the instance where all arms are Ber( 1

2) is in fact a universal hard instance in the sense that every (ε, 0.05)-PAC algorithm requires

Ω PK k=1 log(mk+1)

ε2 to identify. Via a reduction of directsum flavor, we show that every (ε, 0.05)-PAC algorithm, when applied to this instance, must successfully identify that each group consists of Ber( 1

2) arms. As a result, the lower bound is the sum of the lower bounds for each all Ber( 1

2) (mk)-BAI instance.

We then prove the lower bound for all Ber( 1

2) (m)-BAI instance for every m 2. We use H (m) 0 to denote this instance. The H (m) 0 specified lower bound is obtained by constructing another m instances H (m) 1 , . . . , H (m) m and compare the distribution of losses generated by H (m) 0 and the distribution of losses generated by a mixture of H (m) 1 , . . . , H (m) m . For technical reasons, we first prove the lower bound when all arms are Gaussian and reduce the Gaussian arms to Bernoulli arms.

1.3. Related Works

The bandit feedback setting as an online decision problem has received considerable attention. The work of (Audibert & Bubeck, 2009) first provided a tight bound for the bandit feedback setting, while the full information feedback case has been well studied in (Freund & Schapire, 1997; Haussler et al., 1995). Building upon these works, (Mannor & Shamir, 2011) introduced an interpolation between these two extremes and generalized the feedback of the classic bandit problem to a graph structure. Several prior studies, such as (Alon et al., 2015; Zimmert & Lattimore, 2019; Chen et al., 2021; He & Zhang, 2023), have proposed

various graph parameters to characterize the factors that influence regret. However, the algorithms proposed in these works for more general graphs do not yield a tight bound in our specific setting.

The pure exploration version of the bandit problem, known as the best arm identification (BAI) problem, has also received significant attention in the literature (Even-Dar et al., 2002; Mannor & Tsitsiklis, 2004; Bubeck et al., 2009; Audibert et al., 2010; Karnin et al., 2013; Chen et al., 2017). While the BAI problem may appear deceptively simple, determining the precise bound for BAI under the bandit feedback setting remains an open question. However, for the problem of identifying an ε-optimal arm with high probability, (Even-Dar et al., 2002) established a tight bound for the bandit feedback setting, while the bound for the full feedback model is relatively straightforward (see e.g. Chen et al. (2021)).

1.3.1. COMPARISON WITH PREVIOUS WORK

The very recent work of (Eldowa et al., 2024) studied interpolation of learning with experts and multi-armed bandit as well from a different perspective. They proved an O p

Tα(1 + log (N/α)) upper bound for the minimax regret of bandit with strongly feedback graph G where α is the independence number of G. The parameter is in general not comparable with clique covers used in this work for feedback graphs. Particularly on an m-MAB instance where m = (m1, . . . , m K), the independence number is K and therefore their upper bound becomes to O p

TK log(N/K) while our results showed that

the minimax regret is indeed Θ q

T PK k=1 log(mk + 1) .

On Interpolating Experts and Multi-Armed Bandits

Another work (Foster et al., 2020) using the idea of the twostage algorithm bears similarity to ours. But applying their algorithm and analysis to our problem can only derive an upper bound of O p TK maxk [K] log mk , which is also suboptimal. To see the difference, assume K = log N and m = (1, 1, . . . , 1, N K +1), then the minimax regret is Θ T log N while the upper bounds in both (Eldowa

et al., 2024) and (Foster et al., 2020) are O

T log N . We believe that the algorithms and the analysis of previous work cannot achieve the same bound in our setting without significant extra effort.

2. Preliminaries

In this section, we formally define the notations used and introduce some preparatory knowledge that will help in understanding this work.

2.1. Mathematical Notations

Let n be a non-negative integer. We use [n] to denote the set {1, 2, . . . , n} and n 1 = x Rn 0 : Pn i=1 x(i) = 1

to denote the n 1 dimensional standard simplex where R 0 is the set of all non-negative real numbers. For a real vector x Rn, the i-th entry of x is denoted as x(i) for every i [n]. We define e[n] i as the indicator vector of the i-th coordinate such that e[n] i (i) = 1 and e[n] i (j) = 0 for all j = i and j [n]. We may write e[n] i as ei if the information on n is clear from the context.

Given two vectors x, y Rn, we define their inner product as x, y = Pn i=1 x(i)y(i). For any a, b R, let [a, b] = {c R | min {a, b} c max {a, b}} be the interval between a and b. For any x, y Rn, we say y x if y(i) x(i) for every i [n]. Then we can define the rectangle formed by x and y: Rect(x, y) = {z Rn : y z x}.

For any positive semi-definite matrix M Rn n, let x M =

x TMx be the norm of x with respect to M. Specifically, we abbreviate x ( 2ψ) 1 as x 2ψ where 2ψ is the Hessian matrix of a convex function ψ.

Let F : Rn R be a convex function which is differentiable in its domain dom(F). Given x, y dom(F), the Bregman divergence with respect to F is defined as BF (x, y) = F(x) F(y) x y, F(y) . Given two measures P1 and P2 on the same measurable space (Ω, F), the KL-divergence between P1 and P2 is defined as DKL (P1, P2) = P ω ΩP1 [ω] log P1[ω]

P2[ω] if Ωis discrete

or DKL (P1, P2) = R

P2[ω] d P1 [ω] if Ωis continuous provided P1 is absolutely continuous with respect to P2.

2.2. Graph Theory

Let G = (V, E) be a directed graph where |V | = N. We use (u, v) to denote the directed edge from vertex u to vertex v. For any U V , we denote the subgraph induced by U as G[U]. For v V , let Nin(v) := {u V : (u, v) E} be the set of in-neighbors of v and Nout(v) := {u V : (v, u) E} be the set of outneighbors. If the graph is undirected, we have Nin(v) = Nout(v), and we use N(v) to denote the neighbors for brevity. We say S V is an independent set of G if for every v S, {u S | u = v, u Nin(v) Nout(v)} = . The maximum independence number of G is denoted as α(G) and abbreviated as α when G is clear from the context. Furthermore, we say an independent set S is a t-packing independent set if and only if for any v V , there are at most t out-neighbors of v in S, i.e., |Nout(v) S| t. We say the subsets V1, . . . , VK V form a (V1, . . . , VK)-clique cover of G if each induced subgraph G[Vk] for k [K] is a clique and S k [K] Vk = V .

2.3. m-MAB and m-BAI

Let K > 0 be an integer. Given a vector m = (m1, m2, . . . , m K) ZK 1 with P

k [K] mk = N, we now define problems m-MAB and m-BAI respectively.

2.3.1. m-MAB

In the problem of m-MAB, there are N arms. The arms are partitioned into K groups and the k-th group contains mk arms. Let T N be the time horizon. Then m-MAB is the following online decision game. The game proceeds in T rounds. At round t = 0, 1, . . . , T 1:

The player pulls an arm At [N];

The adversary chooses a loss function ℓ(t) [0, 1]N;

The player incurs loss ℓ(t)(At) and observes the losses of all arms in the group containing At.

Clearly the vector m encodes the amount of information the player can observe in each round. Two extremes are the problem of learning with expert advice and multi-armed bandit, which correspond to (N)-MAB and (1, . . . , 1)-MAB respectively.

We assume the player knows m and T in advance and use A to denote the player s algorithm (which can be viewed as a function from previous observed information and the value of its own random seeds to the arm pulled at each round).

The performance of the algorithm A is measured by the notion of regret. Fix a loss sequence L = ℓ(0), . . . , ℓ(T 1) . Let a = arg mina [N] PT t=1 ℓ(t)(a) be the arm with minimum accumulated losses. The regret of the algorithm A and

On Interpolating Experts and Multi-Armed Bandits

time horizon T on L with respect to the arm a is defined as Ra(T, A, L) = E h PT 1 t=0 ℓ(t)(At) i PT 1 t=0 ℓ(t)(a). If

there is no ambiguity, we abbreviate Ra(T, A, L) as Ra(T). We also use R(T) to denote Ra (T).

We are interested in the regret of the best algorithm against the worst adversary, namely the quantity

R a(T) = inf A sup L Ra(T, A, L).

We call R a (T) the minimax regret of m-MAB and usually write it as R (T).

We may use the following two ways to name an arm in m-MAB:

use the pair (k, j) where k [K] and j [mk] to denote the j-th arm in the k-th group ;

use a global index i [N] to denote the i-th arm.

Following this convention, we use ℓ(t)(i) and ℓ(t) k (j) to denote the loss of arm i and arm (k, j) at round t respectively.

2.3.2. BEST ARM IDENTIFICATION AND m-BAI

The best arm identification (BAI) problem asks the player to identify the best arm among N given arms with as few pulls as possible. To be specific, each arm i is associated with a parameter pi and each pull of arm i gives an observation of its random loss, which is drawn from a fixed distribution with mean pi independently. The loss of each arm is restricted to be in [0, 1]. The one with smallest pi, indexed by i , is regarded as the best arm. An arm j is called an ε-optimal arm if its mean is less than the mean of the best arm plus ε for some ε (0, 1), namely pj < pi + ε. With fixed ε, δ > 0, an (ε, δ)-probably approximately correct algorithm, or (ε, δ)-PAC algorithm for short, can find an εoptimal arm with probability at least 1 δ. In most parts of this paper, we choose δ = 0.05. For an algorithm A of BAI, we usually use T to denote the number of arms A pulled before termination. Similarly for any arm i, we use Ti to denote the number of times that the arm i has been pulled by A before its termination. We also use Ni to denote the number of times that the arm i has been observed by A.

Let m = (m1, m2, , m K) ZK 1 be a vector. Similar to m-MAB, the arms are partitioned into K groups and the k-th group consists of mk arms. Each pull of an arm can observe the losses of all arms in the group. As usual, the goal is to identify the best arm (the one with minimum pi) with as few rounds as possible.

Similar to m-MAB, we use i [N] or (k, j) where k [K] and j [mk] to name an arm. For a fixed algorithm, we use

Ti or T(k,j) to denote the number of times the respective arm has been pulled and use Ni or N(k,j) to denote the number of times it has been observed. For every k [K] we use T (k) to denote the number of times the arms in the k-th group have been pulled, namely T (k) = P

j [mk] T(k,j). By definition, it holds that T = P

k [K] T (k) and N(k,j) = T (k) for every j [mk].

2.4. Bandit with Graph Feedback

A more general way to encode the observability of arms is to use feedback graphs. In this problem, a directed graph G = (V, E) is given. The vertex set V = [N] is the collection of all arms.

The game proceeds in the way similar to m-MAB. The only difference is that when an arm At is pulled by the player at a certain round, all arms in Nout(At) can be observed. As a result, given a vector m = (m1, m2, , m K) ZK 1, the m-MAB problem is identical to bandit with graph feedback G = (V, E) where G is the disjoint union of K cliques G1 = (V1, E1), G2 = (V2, E2), . . . , GK = (VK, EK) with mk = |Vk| and Ek = V 2 k for every k [K].

According to (Alon et al., 2015), we measure the observability of each vertex in terms of its in-neighbors. If a vertex has no in-neighbor, we call it a non-observable vertex, otherwise it is observable. If a vertex v has a self-loop or Nin(v) exactly equals to V \ {v}, then v is strongly observable. If an observable vertex is not strongly observable, then it is weakly observable. In this work, we assume each vertex is observable. If all the vertices are strongly observable, the graph G is called a strongly observable graph. If G contains weakly observable vertices (and does not have nonobservable ones), we say G is a weakly observable graph.

We can also define the notion of regret for bandit with graph feedback. Assume notations before, the regret of an algorithm A with feedback graph G and time horizon T on a loss sequence L with respect to the arm a is defined as Ra(G, T, A, L) = E h PT 1 t=0 ℓ(t)(At) i PT 1 t=0 ℓ(t)(a).

If there is no ambiguity, we abbreviate Ra(G, T, A, L) as Ra(G, T) or Ra(T). We also use R(T) to denote Ra (T). Then minimax regret is again

R a (G, T) = inf A sup L Ra (G, T, A, L).

When G is clear from the context, we write it as R (T).

3. The Upper Bounds

In this section, we prove Theorem 1.1 and Theorem 1.2. We describe the algorithm for m-MAB in Section 3.1 and analyze it in Section 3.2. The algorithm for m-BAI is ob-

On Interpolating Experts and Multi-Armed Bandits

tained by a reduction to m-MAB described in Appendix A. Finally we discuss how to extend the algorithm to bandit with strongly observable feedback graphs and prove Corollary 1.5 in Appendix B.3.

3.1. The Algorithm

As discussed in the introduction, our algorithm basically follows the framework of the two-stage online stochastic mirror descent developed in (He & Zhang, 2023). However, our updating rules is slightly different from the one in (He & Zhang, 2023) in order to incorporate with our new analysis.

Given a K-dimensional vector m = (m1, . . . , m K) as input, in each round t, the algorithm proceeds in the following two-stage manner:

A distribution Y (t) over [K] is maintained, indicating which group of arms the algorithm is going to pick.

For each k [K], a distribution X(t) k is maintained, indicating which arm in the k-th group the algorithm will pick conditioned on that the k-th group is picked in the first stage.

The algorithm then picks the j-th arm in the k-group with probability Y (t)(k) X(t) k (j).

The algorithm is described in Algorithm 1 and we give an explanation for each step below. Assuming Y (0) and X(0) k for all k [K] are well initialized, in each time step t = 0, 1, . . . , T 1, the player will repeat the following operations:

Sampling: For each arm (k, j), the algorithm pulls it with probability

Z(t)(k, j) = Y (t)(k) X(t) k (j).

The arm pulled at this round is denoted by At = (kt, jt). Our algorithm can guarantee that Z(t) is a distribution over all arms.

Observing: Observe partial losses ℓ(t) kt (j) for all j [mkt].

Estimating: For each arm (k, j), define the unbiased estimator ˆℓ(t) k (j) = 1[k=kt] Pr[k=kt] ℓ(t) k (j). It is clear that

E h ˆℓ(t) k (j) i = ℓ(t) k (j).

For each k [K], update X(t) k in the manner of standard OSMD:

ϕk(X (t+1) k ) = ϕk(X(t) k ) ˆℓ(t) k ; X(t+1) k

= arg min x mk 1 Bϕk(x, X (t+1) k ),

where ϕk(x) = η 1 k P

i x(i) log x(i) is the negative entropy scaled by the learning rate ηk.

Define Y (t) in the way that, for any k [K]

Y (t+1)(k) = 1 p

η ηk X(t) k (j) 1 exp ηk ˆℓ(t) k (j) .

where η is the learning rate. Then let Y (t+1) be the projection of Y (t+1) on K 1:

Y (t+1) = arg min y K 1 Bψ(y, Y (t+1)),

where ψ(y) = 2 P

y(i) for any y = (y(1), . . . , y(K)) RK, referred to as Tsallis entropy in literature. Note that when x is small, 1 exp ( x) x. So when ηk is small (and it is so), the updating rule is approximately for any k [K]

Y (t+1)(k) = 1 p

Y (t)(k) +η X

j [mk] X(t) k (j) ˆℓ(t) k (j),

which is equivalent to

ψ(Y (t+1)) = ψ(Y (t)) η b L(t),

where b L(t) = (b L(t)(1), . . . , b L(t)(K)) RK satisfying b L(t)(k) = P

j [mk] X(t) k (j) ˆℓ(t) k (j). One can

think of b L(t)(k) as the average loss of the arms in the k-th group at round t. Nevertheless, we use rule Equation (1) in the algorithm to guarantee the result in Lemma B.1 since it is convenient for our analysis later.

In the realization of Algorithm 1, we will choose η = 1

T and ηk = log(mk+1)

T PK k=1 log(mk+1).

3.2. Regret Bound for MAB

We prove the following theorem, which implies Theorem 1.1.

Theorem 3.1. For every T > 0 and every loss sequence ℓ(0), . . . , ℓ(T 1) [0, 1]N, the regret of Algorithm 1 satisfies

k=1 log(mk + 1)

On Interpolating Experts and Multi-Armed Bandits

Algorithm 1 Two-Stage Algorithm for m-MAB

Input: An (m1, . . . , m K)-MAB instance X(0) k arg min a mk 1 ϕk(a), for all k [K];

Y (0) arg min b K 1 ψ(b);

for t 0 to T 1 do

Z(t)(k, j) Y (t)(k) X(t) k (j), for all k [K] and j [mk]; Pull At = (kt, jt) Z(t) and observe ℓ(t) kt (j) for all j [mk];

Update ϕk(X (t+1) k ) = ϕk(X(t) k ) ˆℓ(t) k ;

X(t+1) k = arg minx mk 1 Bϕk(x, X (t+1) k ); Update k [K], 1 q

Y (t+1)(k) = 1

j [mk] X(t) k (j) 1 exp ηk ˆℓ(t) k (j) ;

Y (t+1) = arg miny K 1 Bψ(y, Y (t+1)); end for

Instead of directly bounding the regret of the sequence of the action distributions Z(t)

0 t T 1, we study an auxiliary

piecewise continuous process Z(s)

s [0,T ). We define

and bound the regret of Z(s)

s [0,T ) in Appendix B.1.1,

and compare it with the regret of Z(t)

0 t T 1 in Appendix B.1.2. Finally, we prove Theorem 3.1 in Appendix B.1.3

4. Lower Bound

In this section, the main work is to prove a lower bound for m-BAI. A natural way is to prove a lower bound for each group and then sum up all in a direct sum flavor. A conventional method is to design a family of hard instances and claim that there exists some instance requiring a sufficient number of pulls. However, different groups may have different hard instances, preventing a direct summation of the lower bound for each group. Instead, we prove that the instance with all m arms following a Ber(1/2) distribution, denoted by H (m) 0 , is always the most challenging one.

Lemma 4.1. Let A be an (ε, 0.05)-PAC algorithm. Assume m 2. There exists a universal constant c1 > 0 such that A terminates on H (m) 0 after at least c1

ε2 log(m + 1) rounds in expectation.

Armed with above lemma, we can establish the lower bound for m-BAI.

Lemma 4.2. Let ε be a number in 0, 1

8 . For every (ε, 0.05)-PAC algorithm of m-BAI, we have EH m 0 T (k) c1 log(mk+1)

ε2 for every k [K] with mk 2 and EH m 0 [T] PK k=1 c1 log(mk+1)

2ε2 if the total

number of arms PK k=1 mk 2, where c1 is the constant in Lemma 4.1, T (k) is the number of rounds that arms in group k is played and T is the total pull times.

Moreover, these lower bounds still hold even the algorithm can identify the ε-optimal arm with probability 0.95 only when the input arms have losses drawn from either Ber 1

Now let us fix m = (m1, . . . , m K). We then derive a regret lower bound for m-MAB and thus prove Theorem 1.4 using Lemmas 4.1 and A.1. Lemma 4.3. For any algorithm A of (m1, . . . , mk)-MAB, for any sufficiently large T > 0, there exists H H such that the expected regret of A satisfies

EH [R(T)] c

k=1 log(mk + 1)

where c > 0 is a universal constant. Here the expectation is taken over the randomness of losses which are drawn from H independently in each round.

We can,of course, apply our more powerful Lemma 4.1 to a broader class of graphs, thus obtaining improved lower bounds for both strongly and weakly observable graphs.

5. Conclusion

In this study, we delve into the m-MAB and m-BAI, and reduce the the latter to the former. We propose a two stage algorithm for m-MAB and prove a lower bound for m-BAI, thereby providing both problems with tight bounds. Furthermore, we utilize the bound proven for BAI to more general graphs and yield some improved lower bounds. The technique developed in the upper bound is more intuitive than standard methods. The proof of the lower bound for BAI reveals that to address the failure error of m arms, calculations must consider them together, such as using mixture distribution, instead of assuming m failures to distinguish one distribution from all others one by one to construct a contradiction. We believe our approach may inspire others to enhance the logarithmic factor in other problems.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgments

This work was supported by CCF-Huawei Populus Grove Fund CCF-Huawei LK2022007, and the NSFC No.

On Interpolating Experts and Multi-Armed Bandits

Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pp. 23 35. PMLR, 2015.

Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Mansour, Y., and Shamir, O. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In COLT, volume 7, pp. 1 122, 2009.

Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT, pp. 41 53, 2010.

Bubeck, S., Munos, R., and Stoltz, G. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory: 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings 20, pp. 23 37. Springer, 2009.

Chen, H., Huang, Z., Li, S., and Zhang, C. Understanding bandits with graph feedback. Advances in Neural Information Processing Systems, 34:24659 24669, 2021.

Chen, L., Li, J., and Qiao, M. Towards instance optimal bounds for best arm identification. In Conference on Learning Theory, pp. 535 592. PMLR, 2017.

Dann, C., Wei, C.-Y., and Zimmert, J. A blackbox approach to best of both worlds in bandits and beyond. In The Thirty Sixth Annual Conference on Learning Theory, pp. 5503 5570. PMLR, 2023.

Eldowa, K., Esposito, E., Cesari, T., and Cesa-Bianchi, N. On the minimax regret for online learning with feedback graphs. Advances in Neural Information Processing Systems, 36, 2024.

Erez, L. and Koren, T. Towards best-of-all-worlds online learning with feedback graphs. Advances in Neural Information Processing Systems, 34:28511 28521, 2021.

Even-Dar, E., Mannor, S., and Mansour, Y. Pac bounds for multi-armed bandit and markov decision processes. In COLT, volume 2, pp. 255 270. Springer, 2002.

Foster, D. J., Gentile, C., Mohri, M., and Zimmert, J. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33:11478 11489, 2020.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, 1997.

Haussler, D., Kivinen, J., and Warmuth, M. K. Tight worstcase loss bounds for predicting with expert advice. In European Conference on Computational Learning Theory, pp. 69 83. Springer, 1995.

He, Y. and Zhang, C. Improved algorithms for bandit with graph feedback via regret decomposition. Theoretical Computer Science, 979:114200, 2023.

Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238 1246. PMLR, 2013.

Koc ak, T. and Carpentier, A. Online learning with feedback graphs: The true shape of regret. In International Conference on Machine Learning, pp. 17260 17282. PMLR, 2023.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Mannor, S. and Shamir, O. From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems, 24, 2011.

Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623 648, 2004.

Rouyer, C., van der Hoeven, D., Cesa-Bianchi, N., and Seldin, Y. A near-optimal best-of-both-worlds algorithm for online learning with feedback graphs. Advances in Neural Information Processing Systems, 35:35035 35048, 2022.

Zimmert, J. and Lattimore, T. Connections between mirror descent, thompson sampling and the information ratio. In Advances in Neural Information Processing Systems, pp. 11973 11982, 2019.

On Interpolating Experts and Multi-Armed Bandits

A. A Reduction from BAI to MAB

In this section, we construct a PAC algorithm for m-BAI by leveraging an algorithm designed for m-MAB, employing Lemma A.1. We then reduce BAI to MAB. Consequently, the upper bound of an algorithm for MAB is also applicable to the constructed algorithm for BAI, and a lower bound for BAI implies one for MAB.

Let r(T, L) be a real valued function with the time horizon T and loss sequence L = ℓ(1), . . . , ℓ(T ) as its input. Let H

be a BAI instance. With fixed T > 0, we use EH h r(T, L) i to denote the expectation of r(T, L) where ℓ(t) in L is drawn

from H independently for every t [T]. Let H be a set of BAI instances.

Lemma A.1. Let A be an algorithm for m-MAB with regret Ra (T, A, L) r(T, L) for every time horizon T and every loss sequence L. Then there exists an (ε, 0.05)-PAC algorithm A for m-BAI that terminates in T rounds where T is the solution of the equation

T = 2500 max L r(T , L)

Moreover, if we only care about identifying an ε-optimal arm with probability 0.95 when the input is chosen from a known family H, we can construct an algorithm solving this problem that terminates in T H rounds where T H is the solution of the equation

T H = 2500 max H H EH h r(T H, L) i

Proof. Given an instance H of m-BAI, we run A for T rounds. Let Ti be the number of times that the arm i has been pulled, i.e., Ti = PT 1 t=0 [At = i]. Let Z =

Z1, Z2, . . . , ZN = T1

T , . . . , TN

T be a distribution on N arms. We construct A by simply sampling from Z = T1

T , . . . , TN

T and outputting the result.

Recall that pi is the mean of the i-th arm in H and arm a is the one with the minimum mean. Define the gap vector = (p1 pa , , p N pa ). Note that Z is a random vector and define conditional expected regret R(Z) = , Z T

given Z. Thus the expected regret EZ R(Z) max L r(T , L). By Markov s inequality, R(Z) 100 max L r(T , L) with probability at least 0.99. Now we only consider Z conditioned on R(Z) 100 max L r(T , L). Let B [N] denote the bad set which contains arms that are not ε-optimal. Then εT P

i B Zi 100 max L r(T , L). Note that

T = 2500 max L r(T , L) ε . Therefore P

i B Zi 0.04. In total, this algorithm will make a mistake with probability no more than 0.05 by the union bound.

When we only care about the input instances chosen from H, we run A for T H rounds and similarly, we output an arm

drawn from T1 T H , T2

T H , . . . , TN

. It is easy to verify via the same arguments that this algorithm can output an ε-optimal arm with probability 0.95 when the input is chosen from H.

B. Upper bound

In this section, we first prove the regret bound for MAB, and then reduce BAI to MAB to give a bound for BAI with the help of Lemma A.1. Finally we can easily apply the algorithm to more general graph.

B.1. Regret Upper Bound for MAB

B.1.1. THE PIECEWISE CONTINUOUS PROCESS

Assuming notations in Algorithm 1, the process Z(s)

s [0,T ) is defined as

Z(s)(k, j) = Y(s)(k) X (s) k (j), k [K], j [mk],

s [0,T ) and n X (s) k o

s [0,T ) for every k [K] are piecewise continuous processes defined in the following way.

For every integer t {0, 1, . . . , T 1}, we let Y(t) = Y (t) and X (t) k = X(t) k for every k [K].

On Interpolating Experts and Multi-Armed Bandits

For every integer t {0, 1, . . . , T 1} and every k [K], the trajectory of n X (s) k o

s [t,t+1) is a continuous path in

Rmk governed by the ordinary differential equation

d ϕk(X (s) k ) ds = ˆℓ(t) k . (2)

For every integer t {0, 1, . . . , T 1}, the trajectory of Y(s)

s [t,t+1) is a continuous path in RK governed by the ordinary differential equation d ψ(Y(s))

ds = b L(s), (3)

where b L(s) = b L(s)(1), . . . , b L(s)(K) RK satisfies b L(s)(k) = P

j [mk] X (s) k (j) ˆℓ(t) k (j).

Clearly the trajectories of Z(s), Y(s) and X (s) k for every k [K] are piecewise continuous paths in the time interval s [0, T). An important property is that the end of each piece of the trajectories of Y(s) and X (s) k coincides with its discrete counterpart before performing projection to the probability simplex.

Formally, for every t [T] and k [K], define X (t)

k := lims t X (s) k and Y(t) := lims t Y(s). We have the following lemma.

Lemma B.1. For every t [T] and k [K], it holds that X (t)

k = X (t) k and Y(t) = Y (t).

Proof. To ease the notation, for any fixed t {0, 1, . . . , T 1} and fixed k [K], we now prove that X (t+1)

k = X (t+1) k and Y(t+1) = Y (t+1) respectively.

In fact, X (t+1)

k = X (t+1) k immediately follows by integrating both sides of (2) from t to t + 1 and noting that X (t) k = X(t) k .

More efforts are needed to prove the identity for Y(t). Recall ϕk(x) = η 1 k P

j x(j) log x(j) for every x = x(1), . . . , x(mk) . It follows from (2) that for every s [t, t + 1) every k [K] and every j [mk],

X (s) k (j) = X (t) k (j) exp (s t)ηkˆℓ(t) k (j) .

As a result, we know that

b L(s)(k) = X

j [mk] X (t) k (j) exp (s t)ηkˆℓ(t) k (j) ˆℓ(t) k (j).

Integrating (3) from t to s, plugging in above and noting that Y(t) = Y (t), we obtain

Y(s)(k) = 1 p

Y (t)(k) + η

j [mk] X(t) k (j) 1 exp ηk (s t) ˆℓ(t) k (j) ,

which is exactly our rule to define Y (t+1) in Line 1 of Algorithm 1 (take s = t + 1).

We define the regret for the piecewise continuous process as follows.

Definition B.2. The continuous regret contributed by the process Z(s)

s [0,T ) with respect to a fixed arm a [N] is defined as

t=0 E Z t+1

t Z(s) e[N] a , ℓ(t) ds .

Then we are ready to bound Ra(T). Recall that we may write e[N] a as ea if the information on N is clear from the context.

On Interpolating Experts and Multi-Armed Bandits

Lemma B.3. For any time horizon T > 0, any loss sequence ℓ(0), ℓ(1), . . . , ℓ(T 1) [0, 1]N, and any arm a = (k, j), it holds that Ra(T) Bψ(e[K] k , Y (0)) + Bϕk(e[mk] j , X(0) k ).

Proof. Assume a = (k, j). For every t {0, 1, . . . , T 1}, we compute the decreasing rate of the Bregman divergence caused by the evolution of Y(s) and X (s) k respectively.

First consider the change of Bψ(ek, Y(s)) over time:

d ds Bψ(ek, Y(s)) = d

ψ(ek) ψ(Y(s)) ek Y(s), ψ(Y(s))

= d ψ(Y(s))

ds , Y(s) ek

= b L(s), Y(s) ek .

Integrating above from t to t + 1, we have Z t+1

t b L(s), Y(s) ek ds = Bψ(ek, Y(t)) Bψ(ek, Y(t+1) ) = Bψ(ek, Y (t)) Bψ(ek, Y (t+1)), (4)

where the last equality follows from Lemma B.1.

Note that projection never increases Bregman divergence; that is, we have

Bψ(ek, Y (t+1)) Bψ(ek, Y (t+1))

= ψ(Y (t+1)) ψ(Y (t+1)) + ψ(Y (t+1)), ek Y (t+1) ψ(Y (t+1)), ek Y (t+1)

= ψ(Y (t+1)) ψ(Y (t+1)) ψ(Y (t+1)), Y (t+1) Y (t+1) | {z } A

+ ψ(Y (t+1)) ψ(Y (t+1)), Y (t+1) ek | {z } B

Since ψ is convex, we have A 0. By the definition of Y (t+1),

Y (t+1) = arg min y K 1 Bψ(y, Y (t+1)) = arg min y K 1 ψ(y) y, ψ(Y (t+1)) .

The first-order optimality condition (see Section 26.5 in (Lattimore & Szepesv ari, 2020)) implies that B 0. As a result, Bψ(ek, Y (t+1)) Bψ(ek, Y (t+1)) and it follows from Equation (4) that

t b L(s), Y(s) ek ds Bψ(ek, Y (t)) Bψ(ek, Y (t+1)). (5)

Then we consider the change of Bϕk(ej, X (s) k ) over time. Likewise we have

d ds Bϕk(ej, X (s) k ) = d ϕk(X (s) k ) ds , X (s) k ej = ˆℓ(t) k , X (s) k ej .

By an argument similar to the one for Y(s) above, we can obtain Z t+1

t ˆℓ(t) k , X (s) k ej ds Bϕk(ej, X(t) k ) Bϕk(ej, X(t+1) k ). (6)

On the other hand, we have for every s [t, t + 1) and any arm a = (k , j ),

E h Z(s) ea , ℓ(t) i = E h Z(s) ea , ˆℓ(t) i = E

j [mk] Y(s)(k) X (s) k (j) ˆℓ(t) k (j) ˆℓ(t)(a )

On Interpolating Experts and Multi-Armed Bandits

Recall that for every k [K], it holds that b L(s)(k) = P

j [mk] X (s) k (j) ˆℓ(t) k (j). Rearranging above yields

E h Z(s) ea , ℓ(t) i = E

k [K] Y(s)(k) b L(s)(k) ˆℓ(t)(a )

= E h Y(s), b L(s) ˆℓ(t)(a ) i

= E h Y(s) ek , b L(s) + b L(s)(k ) ˆℓ(t) k (j ) i

= E h Y(s) ek , b L(s) i + E h X (s) k ej , ˆℓ(t) k i .

Integrating above from t to t + 1 and plugging in Equations (5) and (6), we obtain Z t+1

t E h Z(s) ea , ℓ(t) i ds = Z t+1

t E h Y(s) ek , b L(s) i ds + Z t+1

t E h X (s) k ej , ˆℓ(t) k i ds

Bψ(ek, Y (t)) Bψ(ek, Y (t+1)) + Bϕk(ej, X(t) k ) Bϕk(ej, X(t+1) k ).

Summing above over t from 0 to T 1 finishes the proof.

B.1.2. COMPARISON OF Ra(T) AND Ra(T)

For any fixed loss sequence ℓ(0), ℓ(1), . . . , ℓ(T 1), we bound the difference between the regret Ra(T) of Algorithm 1 and the continuous regret Ra(T) for any arm a. Formally, we establish the following lemma: Lemma B.4.

Ra(T) Ra(T) 1

ξ Rect(Y (t),Y (t+1)) b L(t) 2 2ψ(ξ) + X

k [K] Y (t)(k) sup

ζk Rect(X(t) k ,X (t+1) k ) ˆℓ(t) k 2 2ϕk(ζk)

Proof. By the definition of the regret, we have

t=0 Z(t) ea, ˆℓ(t)

t=0 E h Z(t) ea, ˆℓ(t) i

t=0 E Z t+1

t Z(s) ea, ˆℓ(t) ds + Z t+1

t Z(t) Z(s), ˆℓ(t) ds

t=0 E Z t+1

t Z(t) Z(s), ˆℓ(t) ds ,

where the first equality holds due to Fubini s theorem. Therefore, we only need to bound the term PT 1 t=0 E h R t+1 t Z(t) Z(s), ˆℓ(t) ds i .

Fix t {0, 1, . . . , T 1}. We have shown in the proof of Lemma B.1 that

X (s) k (j) = X(t) k (j) exp (s t)ηkˆℓ(t) k (j) X(t) k (j)

for any s [t, t + 1) and any j [mk].

Recall that b L(s)(k) = P

j [mk] X (s) k (j) ˆℓ(t) k (j) for every k [K]. Then by the discussion above, we have b L(s) b L(t) for any s [t, t + 1). As a result, it follows from (3) that for any s [t, t + 1),

ψ(Y(s)) ψ(Y (t)) = Z s

t b L(w) dw (s t) b L(t). (7)

On Interpolating Experts and Multi-Armed Bandits

Recall that for any two vectors x, y of the same dimension, Rect(x, y) is the rectangle between x and y. Since our ψ is a separable function (and therefore 2ψ is diagonal), we can apply the mean value theorem entrywise and obtain

ψ(Y(s)) ψ(Y (t)) = 2ψ(ξ(s))(Y(s) Y (t)) (8)

for some ξ(s) Rect(Y(s), Y (t)).

By our choice of ψ, it holds that 2ψ(ξ(s)) 0 for any ξ(s) Rect(Y(s), Y (t)). Therefore, combining Equations (7) and (8), we have Y(s) Y (t) (s t) 2ψ(ξ(s)) b L(t).

Similar argument yields that X (s) k X(t) k (s t) 2ϕk(ζ(s) k ) ˆℓ(t) k

for some ζ(s) k Rect(X (s) k , X(t) k ).

Therefore for any k [K], j [mk] and any s [t, t + 1), we can bound the difference between Z(t)(k, j) and Z(s)(k, j):

Z(t)(k, j) Z(s)(k, j) = Y (t)(k) X(t) k (j) Y(s)(k) X (s) k (j)

Y (t)(k) X(t) k (j) Y (t)(k) (s t) h 2ψ(ξ(s)) b L(t)i (k) X(t) k (j) (s t) h 2ϕk(ζ(s) k ) ˆℓ(t) k i (j)

= (s t)2 h 2ψ(ξ(s)) b L(t)i (k) h 2ϕk(ζ(s) k ) ˆℓ(t) k i (j) + (s t) X(t) k (j) h 2ψ(ξ(s)) b L(t)i (k)

+ (s t) Y (t)(k) h 2ϕk(ζ(s) k ) ˆℓ(t) k i (j)

(s t) X(t) k (j) h 2ψ(ξ(s)) b L(t)i (k) + (s t) Y (t)(k) h 2ϕk(ζ(s) k ) ˆℓ(t) k i (j)

for some ξ(s) Rect(Y(s), Y (t)) and ζ(s) k Rect(X (s) k , X(t) k ).

We are now ready to bound the gap between Ra(T) and Ra(T):

Ra(T) Ra(T)

t=0 E Z t+1

t Z(t) Z (s), ˆℓ(t)

j [mk] X(t) k (j) sup

ξ Rect(Y (t),Y (t+1))

h 2ψ(ξ) b L(t)i (k)

ˆℓ(t) k (j) ds

j [mk] Y (t)(k) sup

ζk Rect(X(t) k ,X (t+1) k )

h 2ϕk(ζk) ˆℓ(t) k i (j)

ˆℓ(t) k (j) ds

Note that in both expressions (A) and (B) above, only the term (s t) depend on s. So we can integrate and obtain:

j [mk] X(t) k (j) sup

ξ Rect(Y (t),Y (t+1))

h 2ψ(ξ) b L(t)i (k)

ˆℓ(t) k (j)

ξ Rect(Y (t),Y (t+1))

h 2ψ(ξ) b L(t)i (k)

j [mk] X(t) k (j) ˆℓ(t) k (j)

ξ Rect(Y (t),Y (t+1))

h 2ψ(ξ) b L(t)i (k) b L(t)(k)

On Interpolating Experts and Multi-Armed Bandits

ξ Rect(Y (t),Y (t+1)) b L(t) 2 2ψ(ξ)

j [mk] Y (t)(k) sup

ζk Rect(X(t) k ,X (t+1) k )

h 2ϕk(ζk) ˆℓ(t) k i (j)

ˆℓ(t) k (j)

k [K] Y (t)(k) sup

ζk Rect(X(t) k ,X (t+1) k ) ˆℓ(t) k 2 2ϕk(ζk)

Combining Equations (9) and (10), we have

Ra(T) Ra(T) 1

ξ Rect(Y (t),Y (t+1)) b L(t) 2 2ψ(ξ) + X

k [K] Y (t)(k) sup

ζk Rect(X(t) k ,X (t+1) k ) ˆℓ(t) k 2 2ϕk(ζk)

If we apply the regret decomposition theorem in (He & Zhang, 2023) and use the standard OSMD bound for each stage, we will get the term sup

ζk Rect(X(t) k ,X (t+1) k ) ˆℓ(t) k 2 2ϕk (ζk ) (12)

where k is the index of the group containing the optimal arm instead of the term X

k [K] Y (t)(k) sup

ζk Rect(X(t) k ,X (t+1) k ) ˆℓ(t) k 2 2ϕk(ζk)

in Equation (11). The new Y (t)(k) term is crucial to our optimal regret bound since it cancels a Y (t)(k) term hidden in the denominator of ˆℓ(t) k 2 2ϕk(ζk). This will be clear in Appendix B.1.3.

B.1.3. THE REGRET OF ALGORITHM 1

Note that the regret of Algorithm 1 is composed of the two parts in Lemma B.3 and Lemma B.4. In this section, we will prove Theorem 3.1 by providing more specific bounds for the terms in these two lemmas.

Proof of Theorem 3.1. By definition of Bregman divergence,

Bψ(ek, Y (0)) = ψ(ek) ψ(Y (0)) ψ(Y (0)), ek Y (0) .

Since we initialize Y (0) = arg minb K 1 ψ(b), Y (0)(k) = 1 K for k [K] and ψ(Y (0)), ek Y (0) 0 follows the first-order optimality condition for Y (0). Thus

Bψ(ek, Y (0)) ψ(ek) ψ(Y (0)) = 2 + 2

Similarly we have X(0) k (j) = 1 mk for j [mk] and

Bϕk(ej, X(0) k ) ϕk(ej) ϕk(X(0) k ) = log mk

K η + log mk

On Interpolating Experts and Multi-Armed Bandits

Recall that At = (kt, jt) is the arm pulled by the algorithm at round t. Now we plug our estimator ˆℓ(t) k (j) = 1[kt=k]

Y (t)(k) ℓ(t) k (j)

and 2ψ(ξ) = diag 1 2ηξ(1)3/2 , 1 2ηξ(2)3/2 , , 1 2ηξ(K)3/2 into the first term on the RHS of Lemma B.4.

ξ Rect(Y (t),Y (t+1)) b L(t) 2 2ψ(ξ)

ξ Rect(Y (t),Y (t+1))

k [K] ξ(k)3/2

j [mk] ℓ(t) k (j)X(t) k (j)

Y (t)(k) 3/2

j [mk] ℓ(t) k (j)X(t) k (j)

1 [kt = k] p

Y (t)(k) (c) 2η

E Y (t)(k) 2η

In the calculation above: (a) follows from Y (t+1)(k) Y (t)(k), (b) is due to P

j [mk] ℓ(t) k (j)X(t) k (j) [0, 1], and (c) is due to Jensen s inequality.

Similarly we have for the second term with 2ϕk(ζk) = diag 1 ηkζk(1), 1 ηkζk(2), , 1 ηkζk(mk)

k [K] Y (t)(k) sup

ζk Rect(X(t) k ,X (t+1) k ) ˆℓ(t) k 2 2ϕk(ζk)

k [K] ηk Y (t)(k) sup

ζk Rect(X(t) k ,X (t+1) k )

j [mk] ζk(j) 1 [kt = k]

Y (t)(k) ℓ(t) k (j) 2

k [K] ηk Y (t)(k) X

j [mk] X(t) k (j) 1 [kt = k]

Y (t)(k) ℓ(t) k (j) 2

j [mk] X(t) k (j) 1 [kt = k]

j [mk] X(t) k (j) = X

In the calculation above: (d) follows from X (t+1) k (j) X(t) k (j) and (e) is due to ℓ(t) k (j) [0, 1].

Hence, summing up above two terms from 0 to T 1, we obtain

Ra(T) Ra(T) η

k [K] ηk. (14)

Combining Equations (13) and (14) and choosing η = 1

T and ηk = log(mk+1)

T PK k=1 log(mk+1), we obtain for any fixed arm a,

K η + log mk

k [K] ηk + ηT

k=1 log(mk + 1)

On Interpolating Experts and Multi-Armed Bandits

B.2. Upper Bound for BAI

We can use the Algorithm 1 and Theorem 3.1 to give an upper bound for m-BAI through Lemma A.1.

Proof of Theorem 1.2. We use Algorithm 1 to construct an (ε, 0.05)-PAC algorithm for m-BAI as described in Lemma A.1.

Since the regret satisfies R(T) c q

T PK k=1 log(1 + mk) for some constant c on every loss sequence by Theorem 1.1,

running Algorithm 1 with T = (2500c)2 PK k=1 log(1+mk)

ε2 , we can get an (ε, 0.05)-PAC algorithm which always terminates

in O PK k=1 log(mk+1)

B.3. The Strongly Observable Graph with Self-loops

We can generalize our results to any strongly observable graph G = (V, E) with each vertex owning a self-loop. Assume G contains a (V1, . . . , VK)-clique cover. We construct a new graph G = (V, E ) by ignoring the edges between any two distinct cliques. It is clear that R (G, T) R (G , T). Then we can prove Corollary 1.5 by directly applying Algorithm 1 with feedback graph G . This proves Corollary 1.5, which asserts that

R (G, T) = O

k=1 log(mk + 1)

Although we assume that each vertex contains a self-loop for the sake of simplicity, we note that our algorithm can still be applied to strongly observable graphs that have some vertices without self-loops. In such cases, we can incorporate an additional exploration term into our algorithm, and a similar analysis to that in Section 3.2 still works.

There have been several works using the clique cover as the parameter to bound the minimax regret of graph bandit. For example, (Erez & Koren, 2021) applies FTRL algorithm with a carefully designed potential function which combines the Tsallis entropy with negative entropy. It achieves a regret of (log T)O(1) O

KT . Our new bound takes into account the size of each clique and is always superior.

C. Lower Bounds for m-BAI

Let A be an algorithm for m-BAI where m = (m1, . . . , m K) is a vector. Given an instance of m-BAI, we use T to denote the number of rounds the algorithm A proceeds. Recall that for every group k [K] and j [mk], we use T(k,j) to denote the number of times that the arm (k, j) has been pulled. For every k [K], let T (k) = P

j [mk] T(k,j) be the number of rounds the arms in the k-th group have been pulled. We also use N(k,j) to denote the number of times the arm (k, j) has been observed. Clearly N(k,j) = T (k).

In the following part, we only consider stochastic environment. That is, ℓ(t) is independently drawn from the same distribution for each t N. Therefore, we omit the superscript (t) and only use ℓ(i) or ℓk(j) to denote the one-round loss of arm i or arm (k, j) respectively when the information is clear from the context.

In Appendix C.1, we lower bound the number of rounds for a PAC algorithm on a specific m-BAI instance with m = (m) and then prove the result for m-BAI in Appendix C.4. We then use these results to prove a regret lower bound for m-MAB and bandit problems with general feedback graphs in Appendix E.

C.1. An Instance-Specific Lower Bound for (m)-BAI

In this section, we study the number of rounds required for (m)-BAI in an (ε, 0.05)-PAC algorithm. In this setting, the pull of any arm can observe the losses of all arms. We will establish a lower bound for a specified instance, namely the one where all arms follow Ber( 1

2). This is key to our lower bound later.

We focus on instances of (m)-BAI where each arm is Bernoulli. As a result, each instance can be specified by a vector (p1, . . . , pm 1, pm) Rm meaning the loss of arm i follows Ber(pi) in each round independently.

2 . In the following context, when we denote an instance as H m, the superscript m indicates that it is an

On Interpolating Experts and Multi-Armed Bandits

m-BAI instance. Consider the following m + 1 (m)-BAI instances n H (m) j o

The instance H (m) 0 is 1

2 . That is, pi = 1

2 for every i [m] in H (m) 0 ;

the j-th arm

that is, the instance satisfies pj = 1

2 ε and pi = 1

2 for every i = j.

We say an algorithm A distinguishes n H (m) j o

j [m] {0} with probability p if

Pr h A outputs j the input instance is H (m) j i p,

and the output can be arbitrary among {0, 1, . . . m} when the input is not in n H (m) j o

j [m] {0}. We refer it as a distin-

guishing algorithm, which is different from the (ε, 0.05)-PAC algorithm.

The main result of this section is

Lemma C.1 (restate Lemma 4.1). Let A be an (ε, 0.05)-PAC algorithm. Assume m 2. There exists a universal constant c1 > 0 such that A terminates on H (m) 0 after at least c1

ε2 log(m + 1) rounds in expectation.

We will demonstrate in Lemma C.6 that an (ε, 0.05)-PAC algorithm can be adapted into an algorithm to distinguish n H (m) j o

j [m] {0} with few extra samples. Thus, we first establish a lower bound for distinguishing algorithms. For

technical reasons, we begin by proving a lower bound for distinguishing algorithms for Gaussian arms in Appendix C.2 and subsequently reduce the Bernoulli arms to Gaussian arms in Appendix C.3.

C.2. The Gaussian Arms

In this section, we relax the constraint on the range of each arm s loss and allow the losses to be arbitrary real numbers. Let ε 0, 1

2 and σ 1 2

. We construct m + 1 instances {Nj}j {0} [m] with Gaussian distributions:

In the instance N0, for each i [m], ℓ(i) is independently drawn from a Gaussian distribution N(0, σ2);

In the instance Nj for j [m], ℓ(j) N( ε, σ2) and ℓ(i) N(0, σ2) for each i = j and i [m] independently.

Lemma C.2 (Bretagnolle-Huber inequality, see e.g. (Lattimore & Szepesv ari, 2020)). Let P1 and P2 be two probability measures on the same measurable space (Ω, F), and let E F be an arbitrary event. Then

P1[E] + P2[E] 1

2e DKL(P1,P2)

Let Nmix be the mixture of {Nj}j [m] meaning that the environment chooses k from [m] uniformly at random and generates losses according to Nk in the following BAI game. Let A be an algorithm distinguishing {Nj}j [m] {0}. Let Ωbe the set of all possible outcomes during the first t rounds, including the samples according to the input distribution and the output of A (if A does not terminate after the t -th round, we assume its output is 1). Note that if the algorithm terminates in t < t rounds, we can always add t t virtual rounds so that it still produces a certain loss sequence in Rm t .

As a result, each outcome ω Ωcan be viewed as a pair ω = (w, x) where w Rm t is the loss sequence and x { 1, 0, 1, . . . , m} indicates the output of A. Thus Ω= W { 1, 0, 1, . . . , m} where W = Rm t .

To ease the proof below, we slightly change A s output: if the original output is x { 1, 0, . . . , m}, we instead output a uniform real in [x, x + 1). Therefore, we can let Ω= W X where W = Rm t and X = R. The benefit of doing so is

On Interpolating Experts and Multi-Armed Bandits

that we can let F be the Borel sets in Ωwhich is convenient to work with. Clearly it is sufficient to establish lower bounds for the algorithms after the change.

For any instance H (m), let PH (m) be the measure of outcomes of A in t rounds with input instance H (m) and p H (m) be the corresponding probability density function (PDF). Then PN0 and PNmix are two probability measures on (Ω, F) and p Nmix(ω) = 1 m P j [m] p Nj(ω) for any ω = (w, x) Ω= Rm t +1. We also let p W H (m) be the PDF of the samples

during the first t rounds according to the input H (m) and p X H (m) be the PDF of A s output. Furthermore, we let p X|W H (m) to

be the conditional density function of X given W. By definition, we have p X|W H (m)(x|w) = p H (m)(ω) p W H (m)(w).

DKL (PNmix, PN0) log m 1 + exp ε2t

Proof. For any ω = (w, x) Ω, let wj,t denote the (j, t)th entry of the matrix w for every j [m] and t [t ]. That is, wj,t = ℓ(t)(j), which is the loss of arm j in the t-th round. Then for each i [m],

p W Ni(w) = 2πσ2 mt

t [t ] (wi,t + ε)2 + P

j =i w2 j,t

p W N0(w) = 2πσ2 mt

t [t ],j [m] w2 j,t 2σ2

Therefore we have

p Ni(ω) p N0(ω) = p W Ni(w) p W N0(w) =

t [t ]((wi,t+ε)2+P

j =i w2 j,t) 2σ2

t [t ],j [m] w2 j,t 2σ2

t [t ] wi,t 2σ2

From Jensen s inequality, we have

DKL (PNmix, PN0) = Z

Ω log p Nmix(ω)

p N0(ω) d PNmix(ω) log Z

p N0(ω) d PNmix(ω)

j [m] p Nj(ω)

1 m P i [m] p Nj(ω)

p N0(ω) dω.

Note that for ω = (w, x), For i, j [m] and i = j,

Ω p Ni(ω)p Nj(ω)

p N0(ω) dω = Z

X p W Ni(w) p X|W Ni (x|w) p W Nj(w)

p W N0(w)dxdw

W p W Ni(w) p W Nj(w)

p W N0(w)dw

t [t ] (wi,t + ε)2 + (wj,t + ε)2 + P j =i j =j w2 j ,t

For i [m], Z

Ω p Ni(ω) p Ni(ω)

p N0(ω) dω = Z

X p W Ni(w) p X|W Ni (x|w) p W Ni(w) p W N0(w)dxdw

On Interpolating Experts and Multi-Armed Bandits

W p W Ni(w) p W Ni(w) p W N0(w)dw

t [t ] (wi,t + 2ε)2 + P

j =i w2 j ,t 2ε2t

Therefore, combining the equations above, we get Z

j [m] p Nj(ω)

i [m] p Ni(ω)

p N0(ω) dω = 1 m2 X

Ω p Ni(ω)p Nj(ω)

= m(m 1) + m exp ε2t

m2 = m 1 + exp ε2t

where the first equality follows from Fubini s theorem. This indicates that DKL (PNmix, PN0) log m 1+exp ε2t

Let t = c0 log(m+1)

ε2 , where c0 σ2 is a universal constant. We have the following lemma to bound Pr N0 [T t ]. Here the randomness comes from the algorithm and environment when the input instance is N0. Lemma C.4. For any algorithm distinguishing {Nj}j [m] {0} with probability 0.925, we have Pr N0 [T t ] 0.1.

Proof. Let A be an algorithm that can distinguish {Nj}j [m] {0} with probability 0.925. Let E be the event that A terminates within t rounds and gives answer N0. Recall that T is a random variable which represents the rounds that A runs. Assume Pr N0 [T t ] < 0.1. Then we have Pr N0

E < 0.075 + 0.1 from the union bound. Combining Lemma C.2 and Lemma C.3, we get

Pr Nmix [E] m 2 m 1 + exp ε2t

E > m 2 (m 1 + m + 1) 0.1 0.075 0.075

for every m 1. This indicates the existence of some j [m] such that Pr Nj [E] > 0.075, which is in contradiction to the promised success probability of A. Therefore A satisfies

Pr N0 [T t ] 0.1.

C.3. From Gaussian to Bernoulli

We then show a reduction from Gaussian arms to Bernoulli arms which implies lower bounds for instances n H (m) j o

Given an input instance from {Nj}j [m] {0}, we can map it to a corresponding instance among n H (m) j o

j [m] {0} by the

following rules.

In each round, if an arm receives a loss ℓ R, let

( 0, if ℓ< 0; 1, if ℓ 0. (15)

Obviously, losses drawn from Gaussian distribution N(0, σ2) are mapped to Ber 1

2 losses. For a biased Gaussian N ε, σ2 , as Figure 1 shows, it holds that

Pr h bℓ< 0 i = Z ε

2πσ e (x+ε)2

2σ2 dx + Z 0

2πσ e (x+ε)2

On Interpolating Experts and Multi-Armed Bandits

2πσ e (x+ε)2

ˆℓ= 0 ˆℓ= 1

Figure 1. From Gaussian to Bernoulli

Let f(σ) = R 0 ε 1

2πσe (x+ε)2

2σ2 dx denote the shadowed area in Figure 1. Note that f is continuous with regard to σ and

Assume that ε < 1

8. Therefore, there exists σ0 1 2

such that f(σ0) = ε. Choose σ = σ0. Then we map

N( ε, σ2) to Ber 1

2 ε and transform the sample space from Rm t to {0, 1}m t .

Lemma C.5. Let ε be a number in 0, 1

8 . For any algorithm distinguishing n H (m) j o

j [m] {0} with probability 0.925, we

have Pr H (m) 0 [T t ] 0.1.

Proof. Assume that there exists such an algorithm A with Pr H (m) 0 [T t ] < 0.1. We then construct an algorithm A to distinguish {Nj}j [m] {0}.

The algorithm A proceeds as follows: When A receives a loss ℓ, it first calculates bℓas Equation (15) and treats bℓas the loss to apply A. If A outputs H (m) j , A output Nj. Therefore, A also succeeds with probability 0.925 while satisfying Pr N0 [T t ] < 0.1. This violates Lemma C.4.

We remark that we cannot replace H (m) 0 by H (m) j for any j [m] in Lemma C.5, since an H (m) j favourite algorithm

exists for every j [m]. For example, an H (m) 1 favourite algorithm is as follows: one first sample the arms for 2 log 1 0.03 ε2 rounds. If the empirical mean bp1 < 1

2, terminate and output H (m) 1 . Otherwise apply an algorithm which can distinguish n H (m) j o

j [m] {0} with probability 0.96. By the Hoeffding s inequality, the error probability in the first stage is at most

0.03. Therefore, this H (m) 1 favourite algorithm has success probability 0.925 and with high probability, it only needs to

play 2 log 1 0.03 ε2 rounds when the input instance is H (m) 1 .

Then we are ready to prove Lemma C.1, which is a direct corollary of the following lemma.

Lemma C.6. Let ε be a number in 0, 1

8 and assume m 2. There exists a constant c1 > 0 such that for any algorithm

A which can output an ε-optimal arm on any instance among n H (m) j o

j [m] {0} with probability at least 0.95, we have

EH (m) 0 [T] c1 log(m+1)

On Interpolating Experts and Multi-Armed Bandits

Proof. We first consider the case c0 log(m + 1) > 4 log 40 where c0 is the universal constant in the definition of t . We reduce from the hypothesis testing lower bound in Lemma C.5. Assume A satisfying Pr H (m) 0

h T c0 log(m+1)

2ε2 i < 0.1.

Then we construct an algorithm A to distinguish n H (m) j o

j [m] {0}. Given an instance among n H (m) j o

j [m] {0}, we

first apply A to get an output arm i. Then we sample 2 log 1 0.025 ε2 rounds and check whether the empirical mean bpi 1

2. If so, output H (m) i . Otherwise, output H (m) 0 . The success probability of at least 0.925 is guaranteed by Hoeffding s inequality and the union bound.

According to our assumption, with probability larger than 0.9, A terminates in c0 log(m+1)

2ε2 + 2 log 1 0.025 ε2 < c0 log(m+1)

ε2 rounds. This violates Lemma C.5.

Then we consider the case c0 log(m + 1) 4 log 40; that is, when m is bounded by some constant. It then follows from Lemma D.3 that A satisfies Pr H (m) 0 T cs

ε2 0.1 for a universal constant cs when m 2.

Then choosing c1 = min n c0 20, cs 10 log(m0+1) o where m0 = e

c0 1 , we have EH (m) 0 [T] c1 log(m+1)

algorithms that can output an ε-optimal arm on any instance among n H (m) j o

j [m] {0} with probability at least 0.95 when

C.4. The Lower Bound for m-BAI

Recall that in m-BAI, the N arms are partitioned into K groups with size m1, m2, . . . , m K respectively. Each pull of an arm results in an observation of all the arms in its group. Consider an m-BAI instance H m 0 which consists of all fair coins. Recall that we use T (k) to denote the number of rounds in which the pulled arm belongs to the k-th group.

We then prove the following lemma, which indicates the result of Theorem 1.3 directly.

Lemma C.7 (restate Lemma 4.2). Let ε be a number in 0, 1

8 . For every (ε, 0.05)-PAC algorithm of m-BAI, we have EH m 0 T (k) c1 log(mk+1)

ε2 for every k [K] with mk 2 and EH m 0 [T] PK k=1 c1 log(mk+1)

2ε2 if the total number of arms PK k=1 mk 2, where c1 is the constant in Lemma C.6.

Moreover, these lower bounds still hold even the algorithm can identify the ε-optimal arm with probability 0.95 only when the input arms have losses drawn from either Ber 1

Proof. We only prove the latter case which is stronger. Let H be the set of all m-BAI instances where the input arms have losses drawn from either Ber 1

Let A be an algorithm that identifies the ε-optimal arm with probability 0.95 when the input instance is in H. Assume A satisfies EH m 0 T (k) < c1 log(mk+1)

ε2 for some k [K]. In the following, we construct an algorithm A to find an ε-optimal

arm given instances in n H (mk) j o

Given any (mk)-BAI instance H (mk) n H (mk) j o

j [m] {0} , we construct an m-BAI instance: set H (mk) to be the

k-th group and all remaining arms are fair ones. Then we apply A on this instance. The output of A is as follows:

Output of A =

( arm j, if the output of A is arm (k, j); an arbitrary arm, otherwise.

Clearly, the correct probability of A is at least 0.95. However, A satisfies EH (mk) 0 [T] < c1 log(mk+1)

ε2 , which violates Lemma C.6.

Therefore, we have EH m 0 T (k) c1 log(mk+1)

ε2 for every k [K] with mk 2 and thus have proved EH m 0 [T] PK k=1 c1 log(mk+1)

ε2 as long as each mk 2. For those groups of size one, we can pair and merge them so that each group contains at least two arms (in case there are odd number of singleton groups, we merge the remaining one to any other groups). Notice that this operation only makes the problem easier (since one can observe more arms in each round) and only

On Interpolating Experts and Multi-Armed Bandits

affects the lower bound by a factor of at most 2. Therefore, we still have

c1 log(mk + 1)

D. Lower Bound for (m)-BAI with Bounded m

In this section, we will lower bound the number of pulls in (ε, 0.05)-PAC algorithms of (m)-BAI when m is bounded by a constant. To this end, we first prove a likelihood lemma in Appendix D.1.

D.1. Likelihood Lemma

Consider two instances Ha and Hb which only differ at one arm (without loss of generality, assume it is the first arm). In Ha, ℓ(1) is drawn from Ber 1

2 and in Hb, ℓ(1) is drawn from Ber 1

2 ε where ε 0, 1

2 is a fixed number.

Let A be a PAC algorithm for BAI. Let Kt j = Pt r=1 ℓ(r)(j) be the accumulative loss of arm j before the (t + 1)-th

round and abbreviate KNj j as Kj. Let Aj be the event that Nj < ˆt for a fixed ˆt N. Let Ca j be the event that n max1 t ˆt Kt j 1

2t < ˆt cε2ˆt o and Cb j be the event n max1 t ˆt Kt j 1

2 ε t < ˆt cε2ˆt o where c is a positive constant.

Lemma D.1 (Lemma 3 of (Mannor & Tsitsiklis, 2004)). If 0 x 1

2 and y > 0, then (1 x)y e dxy where d = 1.78.

Lemma D.2 (Likelihood Lemma). Let Sa = A1 B Ca 1 and Sb = A1 B Cb 1 where B is an arbitrary event. Then we have Pr Hb [Sa] e 8(1+ c)ε2ˆt Pr Ha [Sa] (16)

and Pr Ha Sb e 8(1+ c)ε2ˆt Pr Hb Sb (17)

Proof. We first prove Equation (16). For each ω Sa (ω is a history of the algorithm, including the behavior of the algorithm and observed result in each round), we have

Pr Hb [ω] Pr Ha [ω] =

2 + ε N1 K1 1

2 N1 = (1 2ε)K1 (1 + 2ε)N1 K1

= 1 4ε2 N1 K1 (1 2ε)2K1 N1 1 4ε2 N1 (1 2ε)2K1 N1 .

From Lemma D.1 and the definition of Sa, we have

1 4ε2 N1 1 4ε2 ˆt e 8ε2ˆt

and (1 2ε)2K1 N1 (1 2ε)2 ˆt cε2ˆt e 8 cε2ˆt.

Therefore Pr Hb [ω] Pr Ha [ω] e 8(1+ c)ε2ˆt

Pr Hb [Sa] X

Pr Hb [ω] Pr Ha [ω] Pr Ha [ω] e 8(1+ c)ε2ˆt Pr Ha [Sa] .

The proof of Equation (17) is similar.

On Interpolating Experts and Multi-Armed Bandits

D.2. Lower Bound for (m)-BAI with Constant m

Lemma D.3. There exists a constant cs such that for any algorithm A which can output an ε-optimal arm on any instance among n H (m) j o

j [m] {0} with probability at least 0.95 when m 2 and c0 log(m + 1) 4 log 40, we have

Pr H (m) 0 T cs

Proof. Note that there must exist j [m] such that Pr H (m) 0 [A output arm j] 1

m. Let B be the event that the algorithm

output any arm except for arm j. Apply Lemma D.2 with ˆt = log 3 100ε2 , c = 100, Hb = H (m) j and Ha = H (m) 0 . Assume that

Pr H (m) 0 T ˆt < 0.1. By the Kolmogorov s inequality, we have Pr H (m) 0

h max1 t ˆt Kt j 1

2t < ˆt cε2ˆt i 1 0.25.

Therefore, we have Pr H (m) 0 [Sa] 0.9 1

m 0.25 0.15 by the union bound.

Then from Equation (16), we have

Pr H (m) j [B] e 8(1+ c) log 3

100 Pr H (m) 0 [Sa] > 0.15 1

However, this is in contradiction with the success probability of A. Therefore, letting cs = log 3

100 , we have Pr H (m) 0 T cs

E. Regret Lower Bounds

In this section we prove lower bounds for minimax regrets in various settings. All lower bounds for regrets in the section are based on the lower bounds for m-BAI established in Appendix C.

E.1. Regret Lower Bound for m-MAB

Let us fix m = (m1, . . . , m K). We then derive a regret lower bound for m-MAB and thus prove Theorem 1.4. Let T be the time horizon and c1 be the constant in Lemma C.6. Consider a set of m-BAI instances where each arm has losses drawn

from either Ber 1

2 ε where ε = q

c1 PK k=1 log(mk+1)

8T . Denote this set by H.

Lemma E.1. For any algorithm A of (m1, . . . , mk)-MAB, for any sufficiently large T > 0, there exists H H such that the expected regret of A satisfies

EH [R(T)] c

k=1 log(mk + 1)

where c > 0 is a universal constant. Here the expectation is taken over the randomness of losses which are drawn from H independently in each round.

Proof. Assume A satisfies

EH [R(T)] <

2 PK k=1 c1 log(mk + 1)

for every H H where c1 is the constant in Lemma C.6. Lemma A.1 shows that A implies an algorithm to identify the ε-optimal arm for m-BAI instances in H with probability 0.95 which terminates in c1

PK k=1 log(mk+1)

8ε2 rounds. We can assume ε < 1

8 since T is sufficiently large.

However, according to Lemma C.7, for any such algorithms, there exists some instances in H that need at least c1 PK k=1 log(mk+1)

2ε2 rounds. This violates Lemma A.1 and thus indicates a regret lower bound of

T PK k=1 log(mk + 1) .

Theorem 1.4 is a direct corollary of Lemma E.1.

On Interpolating Experts and Multi-Armed Bandits

E.2. Regret Lower Bounds for Strongly Observable Graphs

Let G = (V, E) be a strongly observable graph with a self-loop on each vertex. Let N = |V |. Assume that there exist K disjoint sets S1, . . . , SK V such that there is no edge between Si and Sj for any i = j. For every k [K], let mk = |Sk|. Let S = S

Proof of Theorem 1.6. We present a reduction from m-MAB to bandit with feedback graph G where m = (m1, . . . , m K). Let A be an algorithm for bandit with feedback graph G. Consider a set of instances where the loss of each arm is drawn

from either Ber 1

2 ε where ε = q

c1 PK k=1 log(mk+1)

8T (here c1 is the constant in Lemma C.6). Denote this set by H. When we say the input of MAB is an instance in H, we mean that the loss sequence is drawn from this instance independently in each round.

Then we design an algorithm A for m-MAB to deal with instances in H as follows. For an m-MAB instance H m in H, we construct a bandit instance with feedback graph G: the losses of arms in Sk correspond to the losses of arms in the k-th group of H m in the m-MAB game and the losses of arms in V \ S are always equal to 1.

The algorithm A actually makes decisions according to A. If A pulls an arm in S, A pulls the corresponding arm in the m-MAB game. Otherwise, when A requests to pull an arm At V \ S, we replace this action by letting A pull the first arm in each group once and then feed the information that At should have observed back to A (Note that all arms outside S have fixed loss 1). We force A to terminate after pulling exactly T arms. Note that ε 1

K since T is sufficiently large. If we use R(T) and R (T) to denote the regret of A and A respectively, then by our choice of ε, we have

E [R(T)] E [R (T)]

where the expectation is taken over the randomness of loss sequences specified above.

Lemma E.1 shows that there exists H H such that

EH [R (T)] c

k=1 log(mk + 1)

Therefore, there exist some loss sequences on which A needs to suffer a regret of Ω q

T PK k=1 log(mk + 1) .

Remark E.2. Although we assume each vertex has a self-loop in Theorem 1.6, it is easy to verify that this result also holds for strongly observable graphs which contain some vertices without self-loops, as long as we can find legal {Sk}k [K]. For example, for the loopless clique, we can also apply Theorem 1.6 with K = 1 and S1 = V . It gives a minimax regret lower bound of Ω T log N , which matches the previous best upper bound in (Alon et al., 2015).

S1 S2 S3 SK

Figure 2. A Feedback Graph Example

Theorem 1.6 gives a general regret lower bound for bandit with arbitrary feedback graphs. Intuitively, it allows us to partition the graph and consider the hardness of each single part respectively.

For example, consider the graph shown in Figure 2: The feedback graph is the disjoint union of K1 cliques and K2 = K K1 cycles where each clique contains m1 vertices and each cycle contains m2 vertices. Note that the clique cover of this graph contains K1 cliques of size m1 and K2m2

2 cliques of constant size. According to Theorem 3.1, our Algorithm 1 gives a

On Interpolating Experts and Multi-Armed Bandits

regret upper bound of O p

T (K1 log m1 + K2m2) , which matches the lower bound given in Theorem 1.6. The previous

best lower bound ((Alon et al., 2015)) on this feedback graph is Ω p

(K1 + K2m2) T . When K1 and m1 are large, our

result wins by a factor of Θ log m1 .

E.3. Regret Lower Bounds for Weakly Observable Graphs

Let G = (V, E) be a weakly observable graph. Assume that V can be partitioned into K disjoint sets V = V1 V2 VK and each G[Vk] contains a tk-packing independent set Sk such that every vertex in Sk does not have a self-loop. Assume there are no edges from Vj to Si for any i = j. Let mk = |Sk| and S = S

Without loss of generality, we assume in the following proof that each mk 2. When there exists some mk = 1, we can pair and merge them into new sets of size at least 2 (in case there are odd number of singleton sets, we merge the remaining one to any other sets). This merging process only affects the result by at most a constant factor. Let m = (m1, . . . , m K). Our proof idea is to embed a certain m -BAI instance in G so that the lower bound follows from the lower bound of m -BAI.

Proof of Theorem 1.7. Let

ξk = max c1 log(mk + 1), c2mk

for every k [K] where c1 > 0 is the constant in Lemma C.7 and c2 = c1 log 3

4 . Assume there exists an algorithm A such that

3 T 2 3 (18)

for every loss sequence. We will construct an m -BAI game for some m = (m 1, m 2, . . . , m K ) and reduce this BAI game to the bandit problem with feedback graph G. The vector m is obtained from m in the following ways. For every k [K], we distinguish between two cases:

Case 1: if c1 log(mk + 1) c2mk

tk , we let the arms in Sk form a group in the m -BAI instance;

Case 2: if c1 log(mk + 1) < c2mk

tk , we divide Sk into mk

2 small sets, each with size at least two. Each small set becomes a group in the m -BAI instance.

In other words, each group in the m -BAI instance is either one of Sk (Case 1) or is a subset of a certain Sk (Case 2).

Given an m -BAI instance and time horizon T > 0, we now define the loss sequence for bandit with feedback graph G: the losses of arms in S in each round are sampled from the distribution of the corresponding arm in the m -MAB instance independently, and the losses of arms in V \ S are always equal to 1. We then design an algorithm A for the m -BAI game by simulating A on this graph bandit problem. If A pulls an arm in V \ S and observes arms in Sk, we again consider two cases:

Case 1: if c1 log(mk + 1) c2mk

tk , we let A pull an arbitrary arm in the corresponding group m -MAB instance;

Case 2: if c1 log(mk + 1) < c2mk

tk , for each arm in Sk that will be observed, A pulls the corresponding arm in the m -MAB instance once.

Otherwise if A pulls an arm in S, A does nothing and just skips this round. Note that A can always observe more information about the feedback of arms in S than A. So A can well simulate A just by feeding the information it observed to A and making decisions according to the behavior of A as described above.

Let Ti be the number of times that arm i has been pulled by A. At the end of the game, A samples an arm in V according to the distribution T1

T , . . . , TN

T . If the sampled arm is in V \ S, A outputs a random arm. Otherwise A outputs the

sampled arm. Choose ε = 1250 1 3 PK k=1 ξk

3 . We can verify that A is an (ε, 0.05)-PAC algorithm through an argument similar to the one in our proof of Lemma A.1.

On Interpolating Experts and Multi-Armed Bandits

Let T (k) be the number of times that the arms in group k have been pulled by A in the m -BAI game. According to Lemma C.7, for each k [K ],

h T (k)i c1 log(m k + 1) ε2 ,

where H m 0 is the m -BAI instance with all fair coins. Let I0 denote the graph bandit instance constructed from above rules based on H m 0 . Recall that one pull of A corresponds to at most tk pulls of A in Case 2. Therefore, when the input is

I0, A must pull the arms in Vk \ Sk for at least c1 mk

2 log 3 tkε2 c2mk

tkε2 times if k is in Case 2 and at least c1 log(mk+1)

ε2 times if k is in Case 1. In other words, A must pull the arms in Vk \ Sk for at least ξk

ε2 times for every k [K]. Plugging in our

choice of ε, A needs to pull the arms in V \ S for more than 1

1250 2 3 PK k=1 ξk 1

3 T 2 3 times in total on I0. These pulls

contribute a regret of at least 1

PK k=1 ξk 1

3 T 2 3 , which contradicts the assumption in Equation (18).

Therefore, there exists some loss sequences such that A satisfies

k=1 max log mk, mk

Theorem 1.7 confirms a conjecture in (He & Zhang, 2023). It can also generalize the previous lower bound for weakly

observable graphs Ω T 2 3 log |S|, |S|

3 in (Chen et al., 2021) by applying Theorem 1.7 with K = 1 and V1 = V where

S V is a t-packing independent set of G. As consequences, Theorem 1.7 provides tight lower bounds for several feedback graphs. For example, when G is the disjoint union of K complete bipartite graphs of size m1, m2, . . . , m K respectively, it

implies a lower bound of Ω P

k [K] log mk 1

3 T 2 3 , which matches the upper bound in (He & Zhang, 2023).