# understanding_bandits_with_graph_feedback__be910d3e.pdf

Understanding Bandits with Graph Feedback

Houshuang Chen Shanghai Jiao Tong University chenhoushuang@sjtu.edu.cn

Zengfeng Huang Fudan University huangzf@fudan.edu.cn

Shuai Li Shanghai Jiao Tong University shuaili8@sjtu.edu.cn

Chihao Zhang Shanghai Jiao Tong University chihao@sjtu.edu.cn

The bandit problem with graph feedback, proposed in [Mannor and Shamir, Neur IPS 2011], is modeled by a directed graph G = (V, E) where V is the collection of bandit arms, and once an arm is triggered, all its incident arms are observed. A fundamental question is how the structure of the graph affects the min-max regret. We propose the notions of the fractional weak domination number δ and the k-packing independence number capturing upper bound and lower bound for the regret respectively. We show that the two notions are inherently connected via aligning them with the linear program of the weakly dominating set and its dual the fractional vertex packing set respectively. Based on this connection, we utilize the strong duality theorem to prove a general regret upper bound O (δ log |V |)

1 3 T 2 3 and a lower bound Ω (δ /α)

1 3 T 2 3 where α is the integrality gap of the dual linear program. Therefore, our bounds are tight up to a (log |V |)

1 3 factor on graphs with bounded integrality gap for the vertex packing problem including trees and graphs with bounded degree. Moreover, we show that for several special families of graphs, we can get rid of the (log |V |)

1 3 factor and establish optimal regret.

1 Introduction

The multi-armed bandit is an extensively studied problem in reinforcement learning. Imagining a player facing an n-armed bandit, each time the player pulls one of the n arms and incurs a loss. At the end of each round, the player receives some feedback and tries to make a better choice in the next round. The expected regret is deﬁned by the difference between the player s cumulative losses and cumulative losses of the single best arm during T rounds. In this article, we assume the loss at each round is given in an adversarial fashion. This is called the adversarial bandit in the literature. The difﬁculty of the adversarial bandit problem is usually measured by the min-max regret which is the expected regret of the best strategy against the worst possible loss sequence.

Player s strategy depends on how the feedback is given at each round. One simple type of feedback is called full feedback where the player can observe all arm s losses after playing an arm. An important problem studied in this model is online learning with experts [14, 17]. Another extreme, introduced in [8], is the vanilla bandit feedback where the player can only observe the loss of the arm he/she just pulled. Optimal bounds for the regret, either in n or in T, are known for both types of feedback.

The work of [24] initialized the study on the generalization of the above two extremes, that is, the feedback consists of the losses of a collection of arms. This type of feedback can be naturally described by a feedback graph G where the vertex set is [n] and a directed edge (i, j) means pulling

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

the arm i can observe the loss of arm j. Therefore, the full feedback means that G is a clique with self-loops, and the vanilla bandit feedback means that G consists of n disjoint self-loops.

A natural yet challenging question is how the graph structure affects the min-max regret. The work of [1] systematically investigated the question and proved tight regret bounds in terms of the time horizon T. They show that, if the graph is strongly observable , the regret is Θ(T 1 2 ); if the graph is weakly observable , the regret is Θ(T 2 3 ); and if the graph is non-observable , the regret is Θ(T). Here the notions of strongly observable , weakly observable and non-observable roughly indicate the connectivity of the feedback graph and will be formally deﬁned in Section 2. However, unlike the case of full feedback or vanilla bandit feedback , the dependency of the regret on n, or more generally on the structure of the graph, is still not well understood. For example, for weakly observable graphs, an upper bound and a lower bound of the regret in terms of the weak domination number δ(G) were proved in [1], but a large gap exists between the two. This suggests that the weak domination number might not be the correct parameter to characterize the regret.

We make progress on this problem for weakly observable graphs. This family of graphs is general enough to encode almost all feedback patterns of bandits. We introduce the notions of the fractional weak domination number δ (G), the k-packing independence number and provide evidence to show that they are the correct graph parameters. The two parameters are closely related and help us to improve the upper bound and lower bound respectively. As the name indicated, δ (G) is the fractional version of δ(G), namely the optimum of the linear relaxation of the integer program for the weakly dominating set. We observe that this graph parameter has already been used in an algorithm for strongly observable graphs in [3], where it functioned differently. In the following, when the algorithm is clear from the context, we use R(G, T) to denote the regret of the algorithm on the instance G in T rounds. Our main algorithmic result is:

Theorem 1. There exists an algorithm such that for any weakly observable graph, any time horizon T n3 log(n)/δ 2(G), its regret satisﬁes R(G, T) = O (δ (G) log n)

1 3 T 2 3 .

Note that the regret of the algorithm in [1] satisﬁes R(G, T) = O (δ(G) log n)

1 3 T 2 3 . The fractional weak domination number δ is always no larger than δ, and the gaps between the two can be as large as Θ(log n). We will give an explicit example in Section 4.3 in which the gap matters and our algorithm is optimal. Theorem 1 can be seamlessly extended to more general time-varying graphs and probabilistic graphs. The formal deﬁnitions of these models are in Appendix E.

On the other hand, we investigate graph structures that can be used to fool algorithms. We say a set S of vertices is a k-packing independent set if S is an independent set and any vertex has at most k out-neighbors in S. We prove the following lower bound:

Theorem 2. Let G = (V, E) be a directed graph. If G contains a k-packing independent set S with |S| 2, then for any randomized algorithm and any time horizon T, there exists a sequence of loss

functions such that the expected regret is Ω max n |S|

k , log |S| o 1

For every k N, we use ζk, the k-packing independence number, to denote the size of the maximum k-packing independent set. To prove Theorem 2, we reduce the problem of minimizing regret to statistical hypothesis testing for which powerful tools from information theory can help.

We can use Theorem 2 to strengthen lower bounds in [1]. Besides, we show that large δ usually implies large ζ1 via studying the linear programming dual of fractional weakly dominating sets and applying a novel rounding procedure. This is also one of our main technical contributions. Combinatorially, the dual linear program is to ﬁnd the maximum fractional vertex packing set in the graph. Speciﬁcally, we can establish lower bounds in terms of δ by applying Theorem 2:

Theorem 3. If G is weakly observable, then for any algorithm and any sufﬁciently large time horizon

T N, there exists a sequence of loss functions such that R(G, T) = Ω δ

α 1/3 T 2 3 , where α

is the integrality gap of the linear program for vertex packing.

Clearly the exact lower bound is determined by the integrality gap of a certain linear program. In general graphs, we have a universal upper bound α = O (n/δ ). For concrete instances, we can

obtain clearer and tighter bounds on α. For example, the linear program has a constant integrality gap α on graphs of bounded degree. Corollary 4. Let N be a constant and G be the family of graphs with maximum in-degree . Then for every weakly observable G = (V, E) G , any algorithm and any sufﬁciently large time horizon T N, there exists a sequence of loss functions such that R(G, T) = Ω((δ ) 1 3 T 2 3 ).

We also show that for 1-degenerate directed graphs (formally deﬁned in Section 2.1), the integrality gap is 1. This family of graphs includes trees and directed cycles. As a consequence, we have Corollary 5. Let G be a 1-degenerate weakly observable graph. Then for any algorithm and any sufﬁciently large time horizon T N, there exists a sequence of loss functions such that R(G, T) = Ω((δ ) 1 3 T 2 3 ).

Comparison of previous results and our results

In Table 1, we compare our new upper bounds, lower bounds and their gap with previous best results.

Table 1: A comparison of results

Graph Type Previous best results [1] This work

Min-max regret Gap Min-max regret Gap

General graphs O (δ log n) 1 3 T 2 3

Ω max ( δ log2 n) 1 3 , 1 T 2 3 See discussion below O (δ log n) 1 3 T 2 3

α ) 1 3 , 1 T 2 3 See discussion below

for δ = log2 n: Ω T 2 3 O (log n) for δ = log2 n: Ω log log n T 2 3 O log n log log n

Trees / Bounded in-degree Same as general graphs O (log n) O (δ log n)1/3 T 2/3

Ω (δ )1/3 T 2/3 O (log n)1/3

Complete bipartite graphs O (log n)1/3 T 2/3

Ω T 2/3 O (log n)

1 3 Θ (log n)1/3 T 2/3 O (1)

Orthogonal relation on Fk 2 O (log n)2/3 T 2/3

Ω T 2/3 O (log n)

2 3 Θ (log n)1/3 T 2/3 O (1)

Discussion. In general, our upper bound is never worse than the previous one since δ δ. Our lower bound is not directly comparable to the previously known lower bound as they are stated in terms of different parameters. In fact, we can not ﬁnd an instance such that our lower bound Ω(max{1, (δ /α)1/3}) is worse than the previous lower bound Ω(max{1, (δ/(log n)2)1/3}) and there are instances on which our bound outperforms. The two key quantities, namely the integrality gap δ δ of the primal linear programming and the integrality gap α of the dual linear programming, seem to be correlated in a graph. The relation between the two bounds is worth further investigation.

Related work

The multi-armed bandit problem originated from the sequential decision making under uncertainty studied in [34, 6] and the adversarial bandit is a natural variant introduced by [7]. The work of [24] introduced the graph feedback model with a self-loop on each vertex in order to interpolate between the full feedback and bandit feedback settings. This model has been extensively studied in the work of [24, 4, 20, 3]. The work of [1] removed the self-loop assumption and considered generalized constrained graphs. They gave a full characterization of the mini-max regret in terms of the time horizon T. In contrast to ﬁxed graph feedback, recent work of [20, 15, 2, 32] considered the timevarying graphs. Another line of recent work in [22, 23, 3] is to study random graphs, or the graphs with probabilistic feedback.

Most algorithms for adversarial bandits are derived from the EXP3 algorithm, e.g. [9, 29]. However, even for the vanilla multi-armed bandit problem, a direct application of EXP3 can only get an

upper bound of O n log n T [9], which is suboptimal. In fact, the EXP3 is a special case of online stochastic mirror descent (OSMD) algorithm when using negentropy function as the potential function. OSMD was developed by [26] and [27] for online optimization. By choosing a more sophisticated potential function, namely the 1/2-Tsallis entropy function [33], OSMD can achieve

the tight bound Θ

The idea of using domination number or related parameters to study the feedback graph appeared many times in literature, e.g. [1, 12, 13, 4, 3]. The work of [3] used the idea of the fractional dominating set to study the high-probability regret bound for the strongly observable graph. Other similar works [23, 32, 13] mainly focused on stochastic settings. The follow-up works related to the weakly observable graph mainly considered harder settings including the time-varying graphs [2, 15, 3], bounded-memory adversaries [19] and the feedback graphs with switching costs [30, 5]. The recent work of [21] considered the bound with respect to cumulative losses of the best arm. To the best of our knowledge, there is no further development on the min-max regret since the work of [1].

2 Preliminaries

In this section, we formally describe the problem setting of bandits with graph feedback and introduce notations, deﬁnitions and propositions that will be used later.

Let n N. We will use [n] to denote the set {1, 2, . . . , n}. Let x Rn be an n-dimensional vector. For every i [n], we use x(i) to denote the value on the ith-coordinate. We use {e1, . . . , en} to

denote the standard basis of Rn. That is, ei Rn is the vector such that ei(j) = 1, if j = i 0, if j = i for

every j [n]. For every n N, we deﬁne n x Rn 0 : Pn i=1 x(i) = 1 as the n-dimensional probability simplex. Clearly n is convex and every x n can be viewed as a distribution on [n]. Throughout the article, sometimes we will view a function ℓ: [n] R equivalently as a vector in Rn, depending on which form is more convenient in the context. With this in mind, we have the inner product ℓ, x P

i [n] ℓ(i) x(i) for every x Rn.

In this article, we use G = (V, E) to denote a directed graph with possible self-loops but no multiple edges. Therefore each (u, v) E indicates a directed edge from u to v in G. If we say a graph G = (V, E) is undirected, we view each undirected edge {u, v} E as two directed edges (u, v) and (v, u). In the following, we assume |V | 2 unless otherwise speciﬁed. For any S V , G[S] is the subgraph of G induced by S. For every v V , we deﬁne Nin(v) = {u V : (u, v) E} and Nout(v) = {u V : (v, u) E} as the set of in-neighbors and out-neighbors of v respectively. We also call |Nin(v)| and |Nout(v)| the in-degree and out-degree of v respectively. A set S V is an independent set if there is no u, v S such that (u, v) E. Note that we do not consider an isolated vertex with a self-loop as an independent set.

A vertex v V is strongly observable if (v, v) E or u V \ v, (u, v) E. A vertex v V is non-observable if Nin(v) = . A directed graph G is called strongly observable if each vertex of G is strongly observable. It is called non-observable if it contains at least one non-observable vertex. The graph is called weakly observable if it is neither strong observable nor non-observable.

We say a directed graph G is 1-degenerate if one can iteratively apply the following two operations in arbitrary orders on G to get an empty graph: Pick a vertex with in-degree one and remove the in-edge; Pick a vertex with in-degree zero and out-degree at most one, and remove both the vertex and the out-edge. Typical 1-degenerate graphs include trees (directed or undirected) and directed cycles.

Let U = {i V : i / Nin(i)} denote the set of vertices without self-loops. Consider the following linear program deﬁned on G. We will call the linear program (P).

i V xi; subject to X

i Nin(j) xi 1, j U; 0 xi 1, i V . (P)

We use δ (G) to denote the optimum of the above linear program. We call δ (G) the fractional weak domination number of G. The dual of the linear program is

j U yj subject to X

j Nout(i) U yj 1, i V ; 0 yj 1, j U . (D)

We call this linear program (D). We use ζ (G) to denote the optimum of the dual. We call ζ (G) the fractional vertex packing number of G. Then it follows from the strong duality theorem (see e.g. [11]) of linear programs that δ (G) = ζ (G).

We remark that in [1], the weakly (integral) dominating set was deﬁned to dominate all weakly observable vertices instead of vertices without self-loops . These two deﬁnitions are all equivalent for all results in this article. See Appendix A for more explanations on this.

2.2 Bandits

Let G = (V, E) be a directed graph where V = [n] is the collection of bandit arms. Let T N be the time horizon which is known beforehand. The bandit problem is an online-decision game running for T rounds. The player designs an algorithm A with the following behavior in each round t of the game: The algorithm A chooses an arm At [n]; An adversary privately provides a loss function ℓt : N [0, 1] and A pays a loss ℓt(At); The algorithm receives feedback {ℓt(j) : j Nout(At)}.

The expected regret of the algorithm A against a speciﬁc loss sequence ℓ = {ℓ1, . . . , ℓT } is deﬁned

by R(G, T, A, ℓ ) = E h PT t=1 ℓt(At) i mina [n] PT t=1 ℓt(a). Note that we look at the expectation of the algorithm since A might be randomized and it is not hard to see that randomization is necessary to guarantee o(T) regret due to the adversarial nature of the loss sequence. The purpose of the problem is to design an algorithm performing well against the worst loss sequence, namely determining the min-max regret R(G, T) inf A supℓ R(G, T, A, ℓ ).

There is another model called stochastic bandits in which the loss function at each round is not adversarially chosen but sampled from a ﬁxed distribution. It is clear that this model is not harder than the one introduced above in the sense that any algorithm performing well in the adversarial setting also performs well in the stochastic setting. Therefore, we will construct instances of stochastic bandits to derive lower bounds in Section 4.

2.3 Optimization

Our upper bound is obtained via the online mirror descent algorithm. In this section, we collect a minimal set of terminologies to understand the algorithm. More details about the approach and its application to online decision-making can be found in e.g. [28].

Let C Rn. We use int(C) to denote the interior C. For a convex function Ψ : Rd R { }, dom(Ψ) {x : Ψ(x) < } is the domain of Ψ. Assume Ψ is differentiable in its domain. For every x, y dom (Ψ), BΨ(x, y) Ψ(x) Ψ(y) x y, Ψ(y) 0 is the Bregman divergence between x and y with respect to the convex function Ψ. The diameter of C with respect to Ψ is DΨ(C) supx,y C{Ψ(x) Ψ(y)}. Let A Rn n be a semi-deﬁnite positive matrix and x Rn

be a vector. We deﬁne x A

x TAx as the norm of x with respect to A.

3 The algorithm

In this section, we design an algorithm to achieve the upper bound in Theorem 1. The proof is in Appendix B.

3.1 Online stochastic mirror descent (OSMD)

Our algorithm is based on the Online Stochastic Mirror Descent framework that has been widely used for bandit problems in various settings. Assuming the set of arms is [n], possibly with many additional structures, a typical OSMD algorithm usually consists of following steps:

Pick some initial distribution X1 over all n arms. For each round t = 1, 2, . . . , T: Tweak Xt according to the problem structure to get a distribution Xt over n arms. The adversary chooses some (unknown) loss vector ℓt : [n] [0, 1] with the knowledge of all previous information including Xt. The algorithm then picks an arm At Xt and pays a loss ℓt(At). After this, the algorithm observes some (partial) information Φt about ℓt. Construct an estimator ˆℓt of ℓt using collected information Φt, At and Xt. Compute an updated distribution Xt+1 from Xt using mirror descent with a pre-speciﬁed potential function Ψ and the estimated gradient ˆℓt.

Although the framework of OSMD is standard, there are several key ingredients left for the algorithm designer to specify: How to construct the distribution Xt from Xt? How to construct the estimator ˆℓt? How to pick the potential function Ψ? In general, ﬁlling these blanks heavily relies on the problem structure and sometimes requires ingenious construction to achieve low regret. We will ﬁrst describe our algorithm and then explain our choices in Section 3.2.

3.2 The algorithm for bandits with graph feedback

Let G = (V, E) be the input directed graph with V = [n]. A few ofﬂine preprocessing steps are required before the online part of the algorithm. We ﬁrst solve the linear program (P) to get an optimal solution {x i }i [n]. Recall δ (G) = P

i [n] x i is the fractional weak domination number

of G. Deﬁne a distribution u n by normalizing {x i }i [n], i.e., we let u(i) = x i P

j [n] x j for all

i [n]. The distribution u will be the exploration distribution whose function will be explained

later. Deﬁne parameters γ = δ (G) log n

T 1/3 , η = γ2

δ (G) where γ is the exploration rate and η is the step size in the mirror descent. Finally, we let the potential function Ψ : Rn 0 R be x Rn 0 7 Pn i=1 x(i) log x(i) with the convention that 0 log 0 = 0. When restricted to n, Ψ(x) is the negative entropy of the distribution x.

Algorithm 1: Online Stochastic Mirror Descent with Exploration begin

X1 arg min a n Ψ(a);

for t = 1, 2, . . . , T do

Xt (1 γ) Xt + γ u; /* use u to explore with rate γ. */ Play At Xt and observe ℓt(j) for all j Nout(At); /* If j Nout(At), the value of ℓt(j) is unset. */ j [n] : ˆℓt(j) 1[j Nout(At)] P

i Nin(j) Xt(i) ℓt(j);

/* For j Nout(At), ˆℓt(j) = 0. */ Xt+1 arg min x n η x, ˆℓt + BΨ(x, Xt);

/* The update rule of mirror descent w.r.t. Ψ. */ end end

Clearly our algorithm implements OSMD framework by specializing the three ingredients mentioned in Section 3.1. We choose Xt = (1 γ) Xt + γ u. This means that our algorithm basically follows Xt but with a certain probability γ to explore the arms according to u. The reason for doing this is to guarantee that each arm has some not-so-small chance to be observed. It will be clear from the analysis of OSMD that the performance of the algorithm depends on the variance of ˆℓt, and a lower bound for each Xt(i) implies an upper bound on the variance. On the other hand, we cannot choose γ too large since it is Xt who contains information on which arm is good, and our probability to follow Xt is 1 γ. Therefore, our choice of γ is optimized with respect to the trade-off between

the two effects. The Exp3.G algorithm in [1] used a uniform distribution over the weakly dominating set as an exploration probability instead of u, which is the only difference between the two algorithms and leads to different graph parameters in regret bounds. Moreover, our exploration probability can be efﬁciently computed by solving the linear program P while it is NP-hard to determine theirs. Our estimator ˆℓt is a simple unbiased estimator for ℓt, namely E h ˆℓt i = ℓt. The potential function we used is the negative entropy function.

4 Lower bounds

In this section, we prove several lower bounds for the regret in terms of different graph parameters. All the lower bounds obtained in this section are based on a meta lower bound (Theorem 2) via the notion of k-packing independent set. Deﬁnition 6. Let G(V, E) be a directed graph and k N. A set S V is a k-packing independent set of G if (1) S is an independent set; (2) For any v V , it holds that |Nout(v) S| k.

Intuitively, if a graph contains a large k-packing independent set S, then one can construct a hard instance as follows: All arms in V S are bad, say with loss 1; All arms in S have loss Ber 1

except a special one with loss Ber 1

2 ε . Then any algorithm with low regret must successfully identify the special arm from S without observing arms in S much (since each observation of arms in S comes from pulling V S, which costs a penalty at least 1

2 in the regret), and we can tweak the parameters to make this impossible. A similar idea already appeared in [1]. However, we will formally identify the problem of minimizing regret on this family of instances with the problem

of best arm identiﬁcation. Therefore, stronger lower bounds Ω max n |S|

k , log |S| o 1

Theorem 2 can be obtained using tools from information theory. Remark. If the maximum independent set of the graph G is of size one and G is weakly observable , then it has been shown in [1] that R(G, T) = Ω T 2 3 for any algorithm. If the graph has no independent set, which means each vertex contains a self-loop, then the graph is strongly observable and its regret can be O(T 1 2 ). In particular, the problem of vanilla n-armed bandits falls into this case.

We delay the proof of Theorem 2 to Appendix D and discuss some of its consequences in the remaining of this section. We recover and strengthen the lower bound based on the (integral) domination number of [1] in Section 4.1. Then we prove Theorem 3 in Section 4.2 and discuss some of its useful corollaries. Finally, we discuss in Section 4.3 when our lower bounds are optimal.

Our main technical contribution here is that we relate δ to the lower bound as well. This is achieved via applying the strong duality theorem of linear programming and using a new rounding method to construct hard instances from fractional solutions of the dual linear programming. This approach towards the lower bounds is much cleaner and in many cases stronger than previous ones in [1]. To the best of our knowledge, the method is new to the community of bandits algorithms

4.1 Lower bound via the (Integral) weak domination number

We ﬁrst use Theorem 2 to recover and strengthen lower bounds in [1]. Let G = (V, E) be a directed graph and U V be the set of vertices without self-loops.

The weakly dominating set of U is a set S V such that for every u U, there exists some v S with (v, u) E. The weak domination number, denoted by δ(G), is the size of the minimum weakly dominating set of U. In fact, δ(G) is the optimum of the integral restriction of the linear program (P) in Section 2.1:

i V xi; subject to X

i Nin(j) xi 1, j U; xi {0, 1} , i V . (P )

The following structural lemma was proved in [1]

Lemma 7. The graph G contains a (log n)-packing independent set S of size at least δ(G) 50 log n.

Applying Lemma 7 to Theorem 2, we obtain

Theorem 8. If G is weakly observable, then for any algorithm and any sufﬁciently large time horizon T N, there exists a sequence of loss functions such that R(G, T) =

Ω max n δ(G) log2 n, log δ(G) log n o1/3 T 2/3 .

Note that the bound in [1] is R(G, T) = Ω max n δ(G) log2 n, 1 o1/3 T 2/3 . Theorem 8 outperforms

when ω(log n) < δ(G) < o(log2 n log log n).

4.2 Lower bound via the linear program dual

In this section, we use Theorem 2 to derive lower bounds in terms of δ (G). Recall the linear program (D) in Section 2.1 whose optimum is ζ (G) = δ (G) by the strong duality theorem of linear programming. Consider its integral restriction (D ):

j U yj subject to X

j Nout(i) U yj 1, i V ; yj {0, 1} , j U . (D )

For every feasible solution {ˆyj}j U of (D ), the set S {j U : ˆyj = 1} is called a vertex packing set on U. It enjoys the property that for every i V , |Nout(i) S| 1.

Let ζ(G) be the optimum of (D ), namely the size of the maximum vertex packing set on U. Let α ζ (G)

ζ(G) be the integrality gap of (D). In the following, we will write δ , δ, ζ , ζ instead of δ (G), δ(G), ζ (G), ζ(G) respectively if the graph G is clear from the context.

We can use a greedy algorithm to ﬁnd an 1-packing independent set of size at least |S|

3 in S and then Theorem 3 follows from Theorem 2. Theorem 3 is less informative if we do not know how large the integrality gap α is. On the other hand, the integrality gap of linear programs for packing programs has been well-studied in the literature of approximation algorithms. The following bound due to [31] is tight for general graphs.

Lemma 9 ([31]). For any directed graph G, the integrality gap α = O (n/δ ).

Corollary 10. If G is weakly observable, then for any algorithm and any sufﬁciently large time

horizon T N, there exists a sequence of loss functions such that R(G, T) = Ω δ 2

The bound for the integrality gap in Lemma 9 holds for any graphs. It can be greatly improved for special graphs.

An interesting family of graphs with small α is those bounded degree graphs. If the in-degree of every vertex in U is bounded, we have the following bound for the integrality gap:

Lemma 11 ([10]). If the in-degree of every vertex in U is bounded by a constant , then the integrality gap α 8 .

Corollary 12. Let N be a constant and G be the family of graphs with maximum in-degree . Then for every weakly observable G = (V, E) G , any algorithm and any sufﬁciently large time horizon T N, there exists a sequence of loss functions such that R(G, T) = Ω((δ ) 1 3 T 2 3 ).

Next, we show the integrality gap of another broad family of graphs, the 1-degenerate graphs, is 1. The family of 1-degenerate graphs was deﬁned in Section 2.1. Graphs including trees (both directed and undirected), directed cycles belong to this family. The proof of the following lemma is in Appendix C.2.

Lemma 13. For any 1-degenerate directed graph, the integrality gap α = 1.

Corollary 14. Let G be a 1-degenerate weakly observable graph. Then for any algorithm and any sufﬁciently large time horizon T N, there exists a sequence of loss functions such that R(G, T) = Ω((δ ) 1 3 T 2 3 ).

We obtained in Theorem 1 that R(G, T) = O (δ log n)

1 3 T 2 3 , and therefore our lower bound is

tight up to a factor of (log n)

1 3 on 1-degenerate graphs and graphs of bounded degree.

4.3 Instances with optimal regret

In this section, we will examine several families of graphs in which the optimal regret bound can be obtained using tools developed in this article.

4.3.1 Complete bipartite graphs

Let G = (V1 V2, E) be an undirected complete bipartite graph with n = |V1| + |V2|. Clearly δ =

δ = 2. Therefore both our Theorem 1 and the algorithm in [1] satisfy R(G, T) = O (log n) 1 3 T 2 3 .

Assuming without loss of generality that |V1| |V2|, then V1 is a |V1|-packing independent set of size at least n

2 . Therefore it follows from Theorem 2 that any algorithm satisﬁes R(G, T) =

Ω (log n) 1 3 T 2 3 . Note that the lower bound in [1] is Ω T 2 3 for this instance.

4.3.2 Orthogonal relation on Fk 2

Our algorithm outperforms the one in [1] when δ δ. Let us now construct a family of graphs where δ δ = Ω(log n).

Let F2 be the ﬁnite ﬁeld with two elements and k N be sufﬁciently large. The vertex set of the undirected graph G = (V1 V2, E) consists of two disjoint parts V1 and V2 where V1 and V2 are both isomorphic to Fk 2 \ {0}. Therefore we can write V1 = {xα}α Fk 2\{0} and V2 = {yβ}β Fk 2\{0}. The set of edges E is as follows: E is the edge set such that G[V1] is a clique and G[V2] is an independent set; For every xα V1 and yβ V2, {xα, yβ} E iff α, β = Pk i=1 α(i) β(i) = 1, where all multiplications and summations are in F2.

We will show that the upper bound and the lower bound of the regret for this instance proved in [1] based on δ is O (log n)

2 3 T 2 3 and Ω T 2 3 respectively. However, we achieve the optimal

1 3 T 2 3 regret. Thus we conclude that both our new upper bound and lower bound are crucial for the optimal regret on this family of instances. The details can be found in Appendix C.3

5 Conclusion

In this article, we introduced the notions of fractional weak domination number and k-packing independence number respectively to prove new upper bounds and lower bounds for the regret of bandits with graph feedback. Our results implied optimal regret on several families of graphs. Although there are still some gaps in general, we believe that these two notions are the correct graph parameters to characterize the complexity of the problem. We now list a few interesting problems worth future investigation.

Let G be an n-vertex undirected cycle. What is the regret on this instance? Theorem 1 implies an upper bound O (n log n) 1 3 T 2 3 and Theorem 2 implies a lower bound Ω n 1 3 T 2 3 .

The lower bound Ω δ

3 T 2 3 in Theorem 3 for general graphs is not satisfactory. The

lower bound is proved via the construction of a 1-packing independent set. This construction did not release the full power of Theorem 2 as the lower bound in the theorem applies for any k-packing independent set. It is still possible to construct larger k-packing independent sets via rounding the linear program D to some less integral solution.

Is Theorem 2 tight? In fact, the bound for BESTARMID, which is constructed to prove Theorem 2 in the full version of the paper, is tight since matching upper bound exists. Therefore, one needs new constructions of hard instances to improve Theorem 2, if possible.

Acknowledgements

This work was partly supported by National Key Research and Development Program of China (2020AAA0107600), National Natural Science Foundation of China Grant (62006151, 62076161,61802069,61902241), Shanghai Sailing Program, and Shanghai Science and Technology Commission Grant (17JC1420200).

[1] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Annual Conference on Learning Theory, volume 40. Microtome Publishing, 2015.

[2] Noga Alon, Nicolò Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. ar Xiv e-prints, pages ar Xiv 1502, 2015.

[3] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

[4] Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts: A tale of domination and independence. Advances in Neural Information Processing Systems, 26:1610 1618, 2013.

[5] Raman Arora, Teodor V Marinov, and Mehryar Mohri. Bandits with feedback graphs and switching costs. Advances in Neural Information Processing Systems, 32, 2019.

[6] Kenneth J Arrow, David Blackwell, and Meyer A Girshick. Bayes and minimax solutions of sequential decision problems. Econometrica, Journal of the Econometric Society, pages 213 244, 1949.

[7] Peter Auer and Nicolo Cesa-Bianchi. On-line learning with malicious noise and the closure algorithm. Annals of mathematics and artiﬁcial intelligence, 23(1):83 99, 1998.

[8] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

[9] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

[10] Nikhil Bansal, Nitish Korula, Viswanath Nagarajan, and Aravind Srinivasan. Solving packing integer programs via randomized rounding with alterations. Theory of Computing, 8(1):533 565, 2012.

[11] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

[12] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Multi-armed bandits in the presence of side observations in social networks. In 52nd IEEE Conference on Decision and Control, pages 7309 7314. IEEE, 2013.

[13] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. In The 2014 ACM international conference on Measurement and modeling of computer systems, pages 289 300, 2014.

[14] Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.

[15] Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the graphs. In International Conference on Machine Learning, pages 811 819. PMLR, 2016.

[16] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

[17] Elad Eban, Aharon Birnbaum, Shai Shalev-Shwartz, and Amir Globerson. Learning the experts for online sequence prediction. 2012.

[18] Robert M Fano. Transmission of information: A statistical theory of communications. American Journal of Physics, 29(11):793 794, 1961.

[19] Zhili Feng and Po-Ling Loh. Online learning with graph-structured feedback against adaptive adversaries. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 931 935. IEEE, 2018.

[20] Tomáš Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efﬁcient learning by implicit exploration in bandit problems with side observations. In Neural Information Processing Systems, pages 613 621, 2014.

[21] Chung-Wei Lee, Haipeng Luo, and Mengxiao Zhang. A closer look at small-loss bounds for bandits with graph feedback. In Conference on Learning Theory, pages 2516 2564. PMLR, 2020.

[22] Shuai Li, Wei Chen, Zheng Wen, and Kwong-Sak Leung. Stochastic online learning with probabilistic graph feedback. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, pages 4675 4682, 2020.

[23] Fang Liu, Swapna Buccapatnam, and Ness Shroff. Information directed sampling for stochastic bandits with graph feedback. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

[24] Shie Mannor and Ohad Shamir. From bandits to experts: on the value of side-observations. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 684 692, 2011.

[25] Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623 648, 2004.

[26] Arkadi Nemirovski. Efﬁcient methods for large-scale convex optimization problems. Ekonomika i Matematicheskie Metody, 15(1), 1979.

[27] Arkadij Semenoviˇc Nemirovskij and David Borisovich Yudin. Problem complexity and method efﬁciency in optimization. 1983.

[28] Francesco Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

[29] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning, pages 784 791, 2008.

[30] Anshuka Rangi and Massimo Franceschetti. Online learning with feedback graphs and switching costs. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2435 2444. PMLR, 2019.

[31] Aravind Srinivasan. Improved approximation guarantees for packing and covering integer programs. SIAM Journal on Computing, 29(2):648 670, 1999.

[32] Aristide Tossou, Christos Dimitrakakis, and Devdatt Dubhashi. Thompson sampling for stochastic bandits with graph feedback. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 31, 2017.

[33] Constantino Tsallis. Possible generalization of boltzmann-gibbs statistics. Journal of statistical physics, 52(1):479 487, 1988.

[34] Abraham Wald. Sequential analysis. 1947.

[35] Julian Zimmert and Tor Lattimore. Connections between mirror descent, thompson sampling and the information ratio. In Advances in Neural Information Processing Systems, pages 11973 11982, 2019.