# bandits_with_feedback_graphs_and_switching_costs__ae7ea43e.pdf

Bandits with Feedback Graphs and Switching Costs

Raman Arora Dept. of Computer Science

Johns Hopkins University

Baltimore, MD 21204

arora@cs.jhu.edu

Teodor V. Marinov Dept. of Computer Science

Johns Hopkins University

Baltimore, MD 21204

tmarino2@jhu.edu

Mehryar Mohri Google Research & Courant Institute of Math. Sciences

New York, NY 10012

mohri@google.com

We study the adversarial multi-armed bandit problem where the learner is supplied with partial observations modeled by a feedback graph and where shifting to a new action incurs a ﬁxed switching cost. We give two new algorithms for this problem in the informed setting. Our best algorithm achieves a pseudo-regret of O(γ(G)

2 3 ), where γ(G) is the domination number of the feedback graph. This signiﬁcantly improves upon the previous best result for the same problem, which was based on the independence number of G. We also present matching lower bounds for our result that we describe in detail. Finally, we give a new algorithm with improved policy regret bounds when partial counterfactual feedback is available.

1 Introduction

A general framework for sequential learning is that of online prediction with expert advice [Littlestone and Warmuth, 1994, Cesa-Bianchi et al., 1997, Freund and Schapire, 1997], which consists of repeated interactions between a learner and the environment. The learner maintains a distribution over a set of experts or actions. At each round, the loss assigned to each action is revealed. The learner incurs the expected value of these losses for their current distribution and next updates her distribution. The learner s goal is to minimize her regret, which, in the simplest case, is deﬁned as the difference between the cumulative loss over a ﬁnite rounds of interactions and that of the best expert in hindsight.

The scenario just described corresponds to the so-called full information setting where the learner is informed of the loss of all actions at each round. In the bandit setting, only the loss of the action they select is known to the learner. These settings are both special instances of a general model of online learning with side information introduced by Mannor and Shamir [2011], where the information available to the learner is speciﬁed by a feedback graph. In an undirected feedback graph, each vertex represents an action and an edge between vertices a and a0 indicates that the loss of action a0 is observed when action a is selected and vice-versa. The bandit setting corresponds to a feedback graph reduced to only self-loops at each vertex, the full information setting to a fully connected graph. Online learning with feedback graphs has been further extensively analyzed by Alon et al. [2013, 2017] and several other authors [Alon et al., 2015, Kocák et al., 2014, Cohen et al., 2016, Yun et al., 2018, Cortes et al., 2018].

In many applications, the learner also incurs a cost when switching to a new action. Consider, for example, a commercial bank that issues various credit card products, many of which are similar, e.g., different branded cards with comparable fees and interest rates. At each round, the bank offers a speciﬁc product to a particular sub-population (e.g., customers at a store). The payoff observed for this action also reveals feedback for related cards and similar sub-populations. At the same time, offering a different product to a group incurs a switching cost in terms of designing a new marketing campaign. Another example of a problem with feedback graph and switching costs is a large company seeking to allocate and reallocate employees to different tasks so that the productivity

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

is maximized. Employees with similar skills, e.g., technical expertise, people skills, can be expected to perform as well as each other on the same task. Reassigning employees between tasks, however, is associated with a cost for retraining and readjustment time. We refer the reader to Appendix B for more motivating examples.

The focus of this paper is to understand the fundamental tradeoffs between exploration and exploitation in online learning with feedback graphs and switching costs, and to design learning algorithms with provably optimal guarantees. We consider the general case of a feedback graph G with a set of vertices or actions V . In the expert setting with no switching cost, the min-max optimal regret is achieved by the weighted-majority or the Hedge algorithm [Littlestone and Warmuth, 1994, Freund and Schapire, 1997], which is in (

log (|V |) T). In the bandit setting, the extension of these algorithms, EXP3 [Auer et al., 2002], achieves a regret of O(

|V | log (|V |) T). The min-max optimal regret of (

|V |T) can be achieved by the INF algorithm [Audibert and Bubeck, 2009]. The

|V |-term in the bandit setting is inherently related to the additional exploration needed to observe the loss of all actions.

The scenario of online learning with side information modeled by feedback graphs, which interpolates between the full information and the bandit setting, was introduced by Mannor and Shamir [2011]. When the feedback graph G is ﬁxed over time and is undirected, a regret in the order of O(

(G) log (|V |) T) can be achieved, with a lower bound of (

(G)T), where (G) denotes the independence number of G. There has been a large body of work studying different settings of this problem with time-varying graphs (Gt)T

t=1, in both the directed or undirected cases, and in both the so-called informed setting, where, at each round, the learner receives the graph before selecting an action, or the uninformed setting where it is only made available after the learner has selected an action and updated its distribution [Alon et al., 2013, Kocák et al., 2014, Alon et al., 2015, Cohen et al., 2016, Alon et al., 2017, Cortes et al., 2018].

For the expert setting augmented with switching costs, the min-max optimal regret remains in

log (|V |) T). However, classical algorithms such as the Hedge or Follow-the-Perturbed-Leader [Kalai and Vempala, 2005] no more achieve the optimal regret bound. Several algorithms designed by Kalai and Vempala [2005], Geulen et al. [2010], Gyorgy and Neu [2014] achieve this min-max optimal regret. In the setting of bandits with switching costs, the lower bound was carefully investigated by Cesa-Bianchi et al. [2013] and Dekel et al. [2014] and shown to be in (|V |

2 3 ). This lower bound is asymptotically matched by mini-batching the EXP3 algorithm, as proposed by Arora et al. [2012].

The only work we are familiar with, which studies both bandits with switching costs and side information is that of Rangi and Franceschetti [2019]. The authors propose two algorithms for time-varying feedback graphs in the uninformed setting. When reduced to the ﬁxed feedback graph setting, their regret bound becomes O( (G)

2 3 ). We note that, in the informed setting with a ﬁxed feedback graph, this bound can be achieved by applying the mini-batching technique of Arora et al. [2012] to the EXP3-SET algorithm of Alon et al. [2013].

Our main contributions are two-fold. First, we propose two algorithms for online learning in the informed setting with a ﬁxed feedback graph G and switching costs. Our best algorithm admits a pseudo-regret bound in O(γ(G)

2 3 ), where γ(G) is the domination number of G. We note that the domination number γ(G) can be substantially smaller than the independence number (G) and therefore that our algorithm signiﬁcantly improves upon previous work by Rangi and Franceschetti [2019] in the informed setting. We also extend our results to achieve a policy regret bound in

2 3 ) when partial counterfactual feedback is available. The O(γ(G)

2 3 ) regret bound in the switching costs setting might seem at odds with a lower bound stated by Rangi and Franceschetti [2019]. However, the lower bound of Rangi and Franceschetti [2019] can be shown to be technically inaccurate (see Appendix C). Our second main contribution is a lower bound in (T

2 3 ) for any non-complete feedback graph. We also extend this lower bound to (γ(G)

2 3 ) for a class of feedback graphs that we will describe in detail. In Appendix I, we show a lower bound for the setting of evolving feedback graphs, matching the originally stated lower bound in [Rangi and Franceschetti, 2019].

The rest of this paper is organized as follows. In Section 2, we describe in detail the setup we analyze and introduce the relevant notation. In Section 3, we describe our main algorithms and results. We

further extend our algorithms and analysis to the setting of online learning in reactive environments (Section 4). In Section 5, we present and discuss in detail lower bounds for this problem.

2 Problem Setup and Notation

We study a repeated game between an adversary and a player over T rounds. For any n 2 N, we denote by [n] the set of integers {1, . . . , n}. At each round t 2 [T], the player selects an action at 2 V and incurs a loss t(at), as well as a cost of one if switching between distinct actions in consecutive rounds (at 6= at 1). For convenience, we deﬁne a0 as an element not in V so that the ﬁrst action always incurs a switching cost. The regret RT of any sequence of actions (at)T

t=1 is thus deﬁned by RT = maxa2V

t=1 t(at) t(a) + M, where M = PT

t=1 1at6=at 1 is the number of action switches in that sequence. We will assume an oblivious adversary, or, equivalently, that the sequence of losses for all actions is determined by the adversary before the start of the game. The performance of an algorithm A in this setting is measured by its pseudo-regret RT (A) deﬁned by

RT (A) = max

t(at) + 1at6=at 1

where the expectation is taken over the player s randomized choice of actions. The regret of A is deﬁned as E[RT ], with the expectation outside of the maximum. In the following, we will abusively refer to RT (A) as the regret of A, to shorten the terminology.

We also assume that the player has access to an undirected graph G = (V, E), which determines which expert losses can be observed at each round. The vertex set V is the set of experts (or actions) and the graph speciﬁes that, if at round t the player selects action at, then, the losses of all experts whose vertices are adjacent to that of at can be observed: t(a) for a 2 N(at), where N(at) denotes the neighborhood of at in G deﬁned for any u 2 V by: N(u) = {v: (u, v) 2 E}. We will denote by deg(u) = |N(u)| the degree of u 2 V in graph G. We assume that G admits a self-loop at every vertex, which implies that the player can at least observe the loss of their own action (bandit information). In all our ﬁgures, self-loops are omitted for the sake of simplicity.

We assume that the feedback graph is available to the player at the beginning of the game (informed setting). The independence number of G is the size of a maximum independent set in G and is denoted by (G). The domination number of G is the size of a minimum dominating set and is denoted by γ(G). The following inequality holds for all graphs G: γ(G) (G) [Bollobás and Cockayne, 1979, Goddard and Henning, 2013]. In general, γ(G) can be substantially smaller than (G), with

γ(G) = 1 and (G) = |V | 1 in some cases. We note that all our results can be straightforwardly extended to the case of directed graphs.

3 An Adaptive Mini-batch Algorithm

In this section, we describe an algorithm for online learning with switching costs, using adaptive mini-batches. All proofs of results are deferred to Appendix D.

The standard exploration versus exploitation dilemma in the bandit setting is further complicated in the presence of a feedback graph: if a poor action reveals the losses of all other actions, do we play the poor action? The lower bound construction of Mannor and Shamir [2011] suggests that we should not, since it might be better to just switch between the other actions.

Adding switching costs, however, modiﬁes the price of exploration and the lower bound argument of Mannor and Shamir [2011] no longer holds. It is in fact possible to show that EXP3 and its graph feedback variants switch too often in the presence of two good actions, thereby incurring (T) regret, due to the switching costs. One way to deal with the switching costs problem is to adapt the ﬁxed mini-batch technique of Arora et al. [2012]. That technique, however, treats all actions equally while, in the presence of switching costs, actions that provide additional information are more valuable.

We deal with the issues just discussed by adopting the idea that the mini-batch sizes could depend both on how favorable an action is and how much information an action provides about good actions.

Algorithm 1 Algorithm for star graphs

Input: Star graph G(V, E), learning rates ( t), exploration rate β 2 [0, 1], maximum mini-batch . Output: Action sequence (at)T

t=1. 1: q1 = 1 |V |.

t b tc T do 3: pt = (1 β)qt + βδ(r) % δ(r) is the Dirac distribution on r 4: Draw at pt, set t = pt(r) 5: if at 1 6= r and at 6= r then 6: Set at = at 1 7: end if 8: Play at for the next b tc iterations

9: Set b t(i) = Pt+b tc 1

j=t I(at = r) j(i)

10: For all i 2 V , qt+1(i) =

qt(i) exp( tb t(i)) P

j2V qt(j) exp( tb t(j)) 11: t = t + 1 12: end while

3.1 Algorithm for Star Graphs

We start by studying a simple feedback graph case in which one action is adjacent to all other actions with none of these other actions admitting other neighbors. For an example see Figure 1.

Figure 1: Example of a star graph.

We call such graphs star graphs and we refer to the action adjacent to all other actions as the revealing action. The revealing action is denoted by r. Since only the revealing action can convey additional information about other actions, we will select our mini-batch size to be proportional to the quality of this action. Also, to prevent our algorithm from switching between two non-revealing actions too often, we will simply disallow that and allow switching only between the revealing action and a non-revealing action. Finally, we will disregard any feedback a non-revealing action provides us. This simpliﬁes the analysis of the regret of our algorithm. The pseudocode of the algorithm is given in Algorithm 1.

The following intuition guides the design of our algorithm and its analysis. We need to visit the revealing action sufﬁciently often to derive information about all other actions, which is determined by the explicit exploration factor β. If r is a good action, our regret will not be too large if we visit it often and spent a large amount of time in it. On the other hand if r is poor, then the algorithm should not sample it often and, when it does, it should not spend too much time there. Disallowing the algorithm to directly switch between non-revealing actions also prevents it from switching between two good non-revealing actions too often. The only remaining question is: do we observe enough information about each action to be able to devise a low regret strategy? The following regret guarantee provides a precise positive response. Theorem 3.1. Suppose that the inequality E[ 2

t(i)] holds for all t T and all i 2 V , for some and β 1

. Then, for any action a 2 V , Algorithm 1 admits the following guarantee:

Furthermore, the algorithm does not switch more than 2T/ times, in expectation.

The exploration parameter β is needed to ensure that t = pt(r) 1, so that at every iteration of the while loop Algorithm 1 plays at least one action. The bound assumed on the second moment E[ 2

t(i)] might seem unusual since in the adversarial setting we do not assume a randomization of the losses. For now, the reader can just assume that this is a bound on the squared loss, that is, 2

t(i) . The role of this expectation and the source of the randomness will become clear in Section 3.3. We note that the star graph admits independence number (G) = |V | 1 and domination number γ(G) = 1.

In this case, the algorithms of Rangi and Franceschetti [2019] and variants of the mini-batching algorithm only guarantee a regret bound of the order O( (G)

2 3 ), while Algorithm 1 guarantees a regret bound of the order O(T

2 3 ) when we set = 1/T

2 3 , and β = 1/T

3.2 Algorithm for General Feedback Graphs

We now extend Algorithm 1 to handle arbitrary feedback graphs. The pseudocode of this more general algorithm is given in Algorithm 2.

Algorithm 2 Algorithm for general feedback graphs

Input: Graph G(V, E), learning rates ( t), exploration rate β 2 [0, 1], maximum mini-batch . Output: Action sequence (at)t.

1: Compute an approximate dominating set R 2: q1 Unif(V ), u Unif(R) 3: while P

t t T do 4: pt = (1 β)qt + βu. 5: Draw i pt, set t = pt(ri) , where ri is the dominating vertex for i and set at = i. 6: if at 1 62 R and at 62 R then 7: Set at = at 1 8: end if 9: Play at for the next b tc iterations.

10: Set b t(i) = Pt+b tc 1

j=t I(at = ri) j(i)

11: For all i 2 V , qt+1(i) =

qt(i) exp( tb t(i)) P

j2V qt(j) exp( tb t(j)).

12: t = t + 1. 13: end while

The ﬁrst step of Algorithm 2 consists of computing an approximate minimum dominating set for G using the Greedy Set Cover algorithm [Chvatal, 1979]. The Greedy Set Cover algorithm naturally partitions G into disjoint star graphs with revealing actions/vertices in the dominating set R. Next, Algorithm 2 associates with each star-graph its revealing arm r 2 R. The mini-batch size at time t now depends on the probability pt(r) of sampling a revealing action r, as in Algorithm 1. There are several key differences, however, that we now point out. Unlike Algorithm 1, the mini-batch size can change between rounds even if the action remains ﬁxed. This occurs when the newly sampled action is associated with a new revealing action in R, however, it is different from the revealing action. The above difference introduces some complications, because t conditioned on all prior actions a1:t 1 is still a random variable, while it is a deterministic in Algorithm 1. We also allow switches between any action and any vertex r 2 R. This might seem to be a peculiar choice. For example, allowing only switches within each star-graph in the partition and only between revealing vertices seems more natural. Allowing switches between any vertex and any revealing action beneﬁts exploration while still being sufﬁcient for controlling the number of switches. If we further constrain the number of switches by using the more natural approach, it is possible that not enough information is received about each action, leading to worse regret guarantees. We leave the investigation of such more natural approaches to future work. Algorithm 2 admits the following regret bound.

Theorem 3.2. For any β |R|

The expected regret of Algorithm 2 is

Further, if the algorithm is augmented similar to Algorithm 7, then it will switch between actions at

most 2T |R|

Setting = 1/(|R|

2 3 ), = |R|

1 3 and β = |R|

1 3 , recovers a pseudo-regret bound of O(|R|

2 3 ), with an expected number of switches bounded by 2|R|

2 3 . We note that |R| = O(γ(G) log (|V |)) and thus the regret bound of our algorithm scales like γ(G)

1 3 . Further, this is a strict improvement over the results of Rangi and Franceschetti [2019] as their result shows a scaling of (G)

1 3 . The proof of Theorem 3.2 can be found in Appendix D.3.

Algorithm 3 Corralling star-graph algorithms

Input: Feedback graph G(V, E), learning rate , mini-batch size Output: Action sequence (at)T

t=1. 1: Compute an approximate minimum dominating set R and initialize |R| base star-graph algorithms,

B1, B2, . . . , B|R|, with step size 0

2|R|, mini-batch size and exploration rate 1/ (Algorithm 1).

, β = 1 T 0 , β = exp

, 1,i = , 1,i = 2|R| for all i 2 [|R|], q1 = p1 = 1 |R| 3: for t = 1, . . . , T 0 do 4: Draw it pt 5: for jt = (t 1) + 1, . . . , (t 1) + do 6: Receive action ai

jt from Bi for all i 2 [|R|].

7: Set ajt = ait

jt, play ajt and observe loss jt(ajt).

8: Send jt(ajt)

pt(it) I{i = it} as loss to algorithm Bi for all i 2 [|R|].

9: Update b t(i) = b t(i) + 1

pt(it) I{i = it}. 10: end for 11: Update qt+1 = Algorithm 4(qt, b t, t). 12: Set pt+1 = (1 β)qt+1 + β 1

13: for i = 1, . . . , |R| do 14: if 1 pt(i) > t,i then

15: Set t+1,i = 2 pt(i), t+1,i = β t,i and restart i-th star-graph algorithm, with updated

step-size 0

t+1,i 16: else 17: Set t+1,i = t,i, t+1,i = t,i. 18: end if 19: end for 20: end for

3.3 Corralling Star Graph Algorithms

Algorithm 4 Log-Barrier-OMD(qt, t, t)

Input: Previous distribution qt, loss vector t, learning rate vector t. Output: Updated distribution qt+1.

1: Find λ 2 [mini t(i), maxi t(i)] such that P|R|

1 1 qt(i) + t,i( t(i) λ) = 1

2: Return qt+1 such that 1 qt+1(i) = 1 qt(i) + t,i( t(i) λ).

An alternative natural method to tackle the general feedback graph problem is to use the recent corralling algorithm of Agarwal et al. [2016]. Corralling star graph algorithms was in fact our initial approach. In this section, we describe that technique, even though it does not seem to achieve an optimal rate. Here too, the ﬁrst step consists of computing an approximate minimum dominating set. Next, we initialize an instance of Algorithm 1 for each star graph. Finally, we combine all of the star graph algorithms via a mini-batched version of the corralling algorithm of Agarwal et al. [2016]. Mini-batching is necessary to avoid switching between star graph algorithms too often. The pseudocode of this algorithm is given in Algorithm 3. Since during each mini-batch we sample a single star graph algorithm, we need to construct appropriate unbiased estimators of the losses jt, which we feed back to the sampled star graph algorithm. The bound on the second moment of these estimators is exactly what Theorem 3.1 requires. Our algorithm admits the following guarantees.

Theorem 3.3. Let = T

1 4 , = |R|

1 4 /(40c log (T 0) T

1 3 log (|V |)), and 0 = 1/T

2 3 , where c is a constant independent of T, , |V | and |R|. Then, for any a 2 V , the following inequality holds for Algorithm 3:

Algorithm 5 Policy regret with side observations

Input: Feedback graph G(V, E), learning rate , mini-batch size , where and are set as in

Theorem 3.3. Output: Action sequence (at)t.

1: Transform feedback graph G from m-tuples to actions and initialize Algorithm 2. 2: for t = 1, . . . , T/m do 3: Sample action at from pt generated by Algorithm 2 and play it for the next m rounds. 4: if at 1 = at then 5: Observe mini-batched loss b t(at) = 1 m

j=1 (t 1)m+j(at) and additional side observations. Feed mini-batched loss and additional side observations to Algorithm 2. 6: else 7: Set b t(at) = 0 and set additional feedback losses to 0. Feed losses to Algorithm 2. 8: end if 9: end for

Furthermore, the expected number of switches of the algorithm is bounded by T

This bound is suboptimal compared to the γ(G)

1 3 -dependency achieved by Algorithm 2. We conjecture that this gap is an artifact of the analysis of the corralling algorithm of Agarwal et al. [2016]. However, we were unable to improve on the current regret bound by simply corralling.

4 Policy Regret with Partial Counterfactual Feedback

In this section, we consider games played against an adaptive adversary, who can select losses based on the player s past actions. In that scenario, the notion of pseudo-regret is no longer meaningful or interpretable, as pointed out by Arora et al. [2012]. Instead, the authors proposed the notion of policy regret deﬁned by the following: maxa2V

t=1 t(a1, . . . , at) PT

t=1 t(a, . . . , a), where the benchmark action a does not depend on the player s actions. Since it is impossible to achieve o(T) policy regret when the t-th loss is allowed to depend on all past actions of the player, the authors made the natural assumption that the adversary is m-memory bounded, that is that the t-th loss can only depend on the past m actions chosen by the player. In that case, the known min-max policy regret bounds are in (|V |

2 3 ) [Dekel et al., 2014], ignoring the dependency on m.

Here, we show that the dependency on |V | can be improved in the presence of partial counterfactual feedback. We assume that partial feedback on losses with memory m is available. We restrict the feedback graph to admitting only vertices for repeated m-tuples of actions in V , that is, we can only observe additional feedback for losses of the type t(a, a, . . . , a), where a 2 V . For a motivating example, consider the problem of prescribing treatment plans to incoming patients with certain disorders. Two patients that are similar, for example patients in the same disease sub-type or with similar physiological attributes, when prescribed different treatments, reveal counterfactual feedback about alternative treatments for each other.

Our algorithm for incorporating such partial feedback to minimize policy regret is based on our algorithm for general feedback graphs (Algorithm 2). The learner receives feedback about m-memory bounded losses in the form of m-tuples. We simplify the representation by replacing each m-tuple vertex in the graph by a single action, that is vertex (a, . . . , a) represented as a.

As described in Algorithm 5, the input stream of T losses is split into mini-batches of size m, indexed by t, such that b t( ) = 1

j=1 (t 1)m+j( ). This sequence of losses, (b t)T/m

t=1 , could be fed as input to Algorithm 2 if it were not for the constraint on the additional feedback. Suppose that between the t-th mini-batch and the t + 1-st mini-batch, Algorithm 2 decides to switch actions so that at+1 6= at. In that case, no additional feedback is available for b t+1(at+1) and the algorithm cannot proceed as normal. To ﬁx this minor issue, the feedback provided to Algorithm 2 is that the loss of action at+1 was 0 and all actions adjacent to at+1 also incurred 0 loss. This modiﬁcation of losses cannot occur more than the number of switches performed by Algorithm 2. Since the expected number of switches is bounded by O(γ(G)

2 3 ), the modiﬁcation does not affect the total expected regret.

Theorem 4.1. The expected policy regret of Algorithm 5 is bounded as O((mγ(G))

The proof of the above theorem can be found in Appendix E. Let us point out that Algorithm 5 requires knowledge (or an upper bound) on the memory of the adversary, unlike the algorithm proposed by Arora et al. [2012]. We conjecture that this is due to the adaptive mini-batch technique of our algorithm. In particular, we believe that for m-memory bounded adversaries, it is necessary to repeat each sampled action at at least m times.

5 Lower Bound

The main tool for constructing lower bounds when switching costs are involved is the stochastic process constructed by Dekel et al. [2014]. The crux of the proof consists of a carefully designed multi-scale random walk. The two characteristics of this random walk are its depth and its width. At time t, the depth of the walk is the number of previous rounds on which the value of the current round depends. The width of the walk measures how far apart two rounds that depend on each other are in time. The loss of each action is equal to the value of the random walk at each time step, and the loss of the best action is slightly better by a small positive constant. The depth of the process controls how well the losses concentrate in the interval [0, 1]1. The width of the walk controls the variance between losses of different actions and ensures it is impossible to gain information about the best action, unless one switches between different actions.

5.1 Lower Bound for Non-complete Graphs

Figure 2: Feedback graph for switching costs

We ﬁrst verify that the dependence on the time horizon cannot be improved from T

2 3 for any feedback graph in which there is at least one edge missing, that is, in which there exist two vertices that do not reveal information about each other. Without loss of generality, assume that the two vertices not joined by an edge are v1 and v2. Take any vertex that is a shared neighbor and denote this vertex by v3 (see Figure 2 for an example). We set the loss for action v3 and all other vertices to be equal to one. We now focus the discussion on the subgraph with vertices {v1, v2, v3}. The losses of actions v1 and v2 are set according to the construction in [Dekel et al., 2014]. Since {v1, v2} forms an independent set, the player would need to switch between these vertices to gain information about the best action. This is also what the lower bound proof of Rangi and Franceschetti [2019] is based upon. However, it is important to realize that the construction in Dekel et al. [2014] also allows for gaining information about the best action if its loss is revealed together with some other loss constructed from the stochastic process. In that case, playing vertex v3 would provide such information. This is a key property which Rangi and Franceschetti [2019] seem to have missed in their lower bound proof. We discuss this mistake carefully in Appendix C and provide a lower bound matching what the authors claim in the uninformed setting in Appendix I. Our discussion suggests that we should set the price for revealing information about multiple actions according to the switching cost and this is why the losses of all vertices outside of the independent set are equal to one. We note that the losses of the best action are much smaller than one sufﬁciently often, so that enough instantaneous regret is incurred when pulling action v3. Our main result follows and its proof can be found in Appendix F. Theorem 5.1. For any non-complete feedback graph G, there exists a sequence of losses on which any algorithm A in the informed setting incurs expected regret at least

2 3 log (T)

5.2 Lower Bound for Disjoint Union of Star Graphs

How do we construct a lower bound for a disjoint union of star graphs? First, note that if two adjacent vertices are allowed to admit losses set according to the stochastic process and one of them is the best vertex, then we could distinguish it in time O(

T) by repeatedly playing the other vertex. This suggests that losses set according to the stochastic process should be reserved for vertices in an

1Technically, the losses are always clipped between [0, 1].

independent set. Second, it is important to keep track of the amount of information revealed by common neighbors.

v1 v2 v3 v4

Figure 3: Disjoint union of star graphs.

Consider the feedback graph of Figure 3. This disjoint union of star graphs admits a domination number equal to four and its minimum dominating set is denoted by {v1, v2, v3, v4}. Probably the most natural way to set up the losses of the vertices is to set the losses of the maximum independent set, which consists of the colored vertices, according to the construction of Dekel et al. [2014] and the losses of the minimum dominating set equal to one. Let v1 be the vertex with highest degree. Any time the best action is sampled to be not adjacent to v1, switching between that action and v1 reveals deg(v1) information about it. On the other hand, no matter how we sample the best action as a neighbor of v1, it is then enough to play v1 to gain enough information about it. If I denotes the maximum independent set, the above reasoning shows that only O(T

2 3 |I|/deg(v1)) rounds of switching are needed to distinguish the best action. Since deg(v1) can be made arbitrarily large and thus |I|/deg(v1) gets arbitrary close to one, we see that the regret lower bound becomes independent of the domination number and equal to (T

We now present a construction for the disjoint union of star graphs which guarantees a lower bound of the (γ(G)

2 3 ). The idea behind our construction is to choose an independent set such that none of its members have a common neighbor, thereby avoiding the problem described above. We note that such an independent set cannot have size greater than γ(G). Let R be the set of revealing vertices for the star graphs. We denote by Vi the set of vertices associated with the star graph with revealing vertex vi. To construct the losses, we ﬁrst sample an active vertex for each star graph from its leaves. The active vertices are represented in red in Figure 3. This forms an independent set I indexed by R. Next, we follow the construction of Dekel et al. [2014] for the vertices in I, by ﬁrst sampling a best vertex uniformly at random from I and then setting the losses in I according to the multi-scale random walk. All other losses are set to one. For any star graph consisting of a single vertex, we treat the vertex as a non-revealing vertex. This construction guarantees the following. Theorem 5.2. The expected regret of any algorithm A on a disjoint union of star graphs is lower bounded as follows:

2 3 log (T)

The proof of this theorem can be found in Appendix G. This result can be viewed as a consequence of that of Dekel et al. [2014] but it can also be proven in alternative fashion. The general idea is to count the amount of information gained for the randomly sampled best vertex. For example, a strategy that switches between two revealing vertices vi and vj will gain information proportional to deg(vi)deg(vj). The lower bound follows from carefully counting the information gain of switching between revealing vertices. This counting argument can be generalized beyond the disjoint union of star graphs, by considering an appropriate pair of minimal dominating/maximal independent sets. We give an argument for the disjoint union of star graphs in Appendix G and leave a detailed argument for general graphs to future work.

6 Conclusion

We presented an extensive analysis of online learning with feedback graphs and switching costs in the adversarial setting, a scenario relevant to several applications in practice. We gave a new algorithm whose regret guarantee only depends on the domination number. We also presented a matching lower bound for a family of graphs that includes disjoint unions of star graphs. The technical tools introduced in our proofs are likely to help derive a lower bound for all graph families. We further derived an algorithm with more favorable policy regret guarantees in the presence of feedback graphs.

Acknowledgements

This research was partly supported by NSF BIGDATA grants IIS-1546482 and NSF IIS-1838139, and by NSF CCF-1535987, NSF IIS-1618662, and a Google Research Award.

Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E. Schapire. Corralling a band of

bandit algorithms. ar Xiv preprint ar Xiv:1612.06246, 2016.

Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts:

A tale of domination and independence. In Advances in Neural Information Processing Systems, pages 1610 1618, 2013.

Noga Alon, Nicolò Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback

graphs: Beyond bandits. In JMLR Workshop and Conference Proceedings, volume 40. Microtome Publishing, 2015.

Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir.

Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6):1785 1826, 2017.

Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary:

from regret to policy regret. In Proceedings of the 29th International Conference on Machine Learning, pages 1747 1754, 2012.

Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits.

In COLT, pages 217 226, 2009.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed

bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Béla Bollobás and Ernest J Cockayne. Graph-theoretic parameters concerning domination, indepen-

dence, and irredundance. Journal of Graph Theory, 3(3):241 249, 1979.

Swapna Buccapatnam, Atilla Eryilmaz, and Ness B. Shroff. Stochastic bandits with side observations

on networks. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 14, pages 289 300. ACM, 2014.

Stephane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observations

in stochastic bandits. In UAI, 2012.

Nicolò Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E. Schapire, and

Manfred K. Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427 485, 1997.

Nicolò Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other

adaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160 1168, 2013.

Vasek Chvatal. A greedy heuristic for the set-covering problem. Mathematics of operations research,

4(3):233 235, 1979.

Alon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the

graphs. In International Conference on Machine Learning, pages 811 819, 2016.

Corinna Cortes, Giulia De Salvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Online learning

with abstention. In 35th ICML, 2018.

Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2/3 regret. In

Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459 467. ACM, 2014.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an

application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Sascha Geulen, Berthold Vöcking, and Melanie Winkler. Regret minimization for online buffering

problems using the weighted majority algorithm. In COLT, pages 132 143, 2010.

Wayne Goddard and Michael A. Henning. Independent domination in graphs: A survey and recent

results. Discrete Mathematics, 313(7):839 854, 2013.

Andras Gyorgy and Gergely Neu. Near-optimal rates for limited-delay universal lossy source coding.

IEEE Transactions on Information Theory, 60(5):2823 2834, 2014.

Adam Kalai and Santosh Vempala. Efﬁcient algorithms for online decision problems. Journal of

Computer and System Sciences, 71(3):291 307, 2005.

Tomávs Kocák, Gergely Neu, Michal Valko, and Rémi Munos. Efﬁcient learning by implicit explo-

ration in bandit problems with side observations. In Advances in Neural Information Processing Systems, pages 613 621, 2014.

Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and

computation, 108(2):212 261, 1994.

Fang Liu, Swapna Buccapatnam, and Ness Shroff. Information directed sampling for stochastic

bandits with graph feedback. In 32nd AAAI Conference on Artiﬁcial Intelligence, 2018.

Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In

Advances in Neural Information Processing Systems, pages 684 692, 2011.

Anshuka Rangi and Massimo Franceschetti. Online learning with feedback graphs and switching costs.

In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 2435 2444, 2019.

Aristide Tossou, Christos Dimitrakakis, and Devdatt Dubhashi. Thompson sampling for stochastic

bandits with graph feedback. In 31st AAAI Conference on Artiﬁcial Intelligence, 2017.

Yifan Wu, András György, and Csaba Szepesvari. Online learning with gaussian payoffs and side

observations. In Advances in Neural Information Processing Systems 28, pages 1360 1368. Curran Associates, Inc., 2015a.

Yifan Wu, András György, and Csaba Szepesvári. Online learning with Gaussian payoffs and side

observations. In NIPS, pages 1360 1368, 2015b.

Donggyu Yun, Alexandre Proutiere, Sumyeong Ahn, Jinwoo Shin, and Yung Yi. Multi-armed bandit

with additional observations. Proc. ACM Meas. Anal. Comput. Syst., 2(1):13:1 13:22, 2018.