# cooperative_online_learning_with_feedback_graphs__5eef5948.pdf

Published in Transactions on Machine Learning Research (06/2024)

Cooperative Online Learning with Feedback Graphs

Nicolò Cesa-Bianchi nicolo.cesa-bianchi@unimi.it Politecnico di Milano & Università degli Studi di Milano, Milano, Italy

Tommaso Cesari tcesari@uottawa.ca School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada

Riccardo Della Vecchia ric.della.vecchia@gmail.com Inria, Université de Lille, CNRS, Centrale Lille, UMR 9189 CRISt AL

Reviewed on Open Review: https: // openreview. net/ forum? id= Pt Ny Ibo DIG

We study the interplay between communication and feedback in a cooperative online learning setting, where a network of communicating agents learn a common sequential decision-making task through a feedback graph. We bound the network regret in terms of the independence number of the strong product between the communication network and the feedback graph. Our analysis recovers as special cases many previously known bounds for cooperative online learning with expert or bandit feedback. We also prove an instance-based lower bound, demonstrating that our positive results are not improvable except in pathological cases. Experiments on synthetic data conﬁrm our theoretical ﬁndings.

1 Introduction

Nonstochastic online learning with feedback graphs (Mannor and Shamir, 2011) is a sequential decision-making setting in which, at each decision round, an oblivious adversary assigns losses to all actions in a ﬁnite set. What the learner observes after choosing an action is determined by a feedback graph deﬁned on the action set. Unlike bandit feedback, where a learner choosing an action pays and observes the corresponding loss, in the feedback graph setting the learner also observes (without paying) the loss of all neighboring actions in the graph. Special cases of this setting are prediction with expert advice (where the graph is a clique) and multiarmed bandits (where the graph has no edges). The Exp3-SET algorithm (Alon et al., 2017) is known to achieve a regret scaling with the square root of the graph independence number, and this is optimal up to a logarithmic factor in the number of actions. In recommendation systems, feedback graphs capture situations in which a user s reaction to a recommended product allows the system to infer what reaction similar recommendations would have elicited in the same user, see Alon et al. (2017) for more examples.

Online learning has been also investigated in distributed settings, in which a network of cooperating agents solves a common task. At each time step, some agents become active, implying that they are requested to make predictions and pay the corresponding loss. Agents cooperate through a communication network by sharing the feedback obtained by the active agents. The time this information takes to travel the network is taken into account: a message broadcast by an agent is received by another agent after a delay equal to the shortest path between them. Regret in cooperative online learning has been previously investigated only in the full-information setting (Cesa-Bianchi et al., 2020; Hsieh et al., 2022) and in the bandit setting (Cesa-Bianchi et al., 2019; Bar-On and Mansour, 2019a).

In this work we provide a general solution to the problem of cooperative online learning with feedback graphs. In doing so, we generalize previous approaches and also clarify the impact on the regret of the mechanism governing the activation of agents. Under the assumption that agents are stochastically activated, our analysis captures the interplay between the communication graph (over the agents) and the feedback graph (over

Published in Transactions on Machine Learning Research (06/2024)

the actions), showing that the network regret scales with the independence number of the strong product between the communication network and the feedback graph.

More precisely, we design a distributed algorithm, Exp3-α2, whose average regret RT /Q (where Q is the expected number of simultaneously active agents) on any communication network N and any feedback graph F is (up to log factors)

where T is the horizon, n is the diﬀusion radius (the maximum delay after which feedback is ignored), and α N n F is the independence number of the strong product between the n-th power N n of the communication network N and the feedback graph F. We also prove a near-matching instance-dependent lower bound showing that, with the exception of pathological cases, for any pair of graphs (N, F), no algorithm

run with an oblivious network interface can have regret smaller than q α(N n F)/Q T.

Our results hold for any diﬀusion radius n, which serves as a parameter to control the message complexity of the protocol. When n is equal to the diameter of the communication network, then every agent can communicate with every other agent. Our protocol is reminiscent of the local communication model in distributed computing (Linial, 1992; Suomela, 2013), where the output of a node depends only on the inputs of other nodes in a constant-size neighborhood of it, and the goal is to derive algorithms whose running time is independent of the network size. Although our tasks have no completion time, in our model each node is directly inﬂuenced only by a constant-size neighborhood around it.

Let |A| and |K| be, respectively, the number of agents and actions. When Q = |A| (all agents are always active) and F is the bandit graph (no edges), then α(N n F) = |K| α(N n) and we recover the bound q α(N n) |K| / |A| + 1 + n T of Cesa-Bianchi et al. (2019). When n = 1 and F is the expert graph (clique),

then α(N n F) = α(N) and we recover the bound q α(N)/Q + 1 T of Cesa-Bianchi et al. (2020)1. Interestingly, if all agents were always active in Cesa-Bianchi et al. (2020), the graph topology would become irrelevant in the expert setting, resulting in a simpliﬁed regret bound of O(

T), analogous to the case of a clique graph. This starkly contrasts with the bandit case of Cesa-Bianchi et al. (2019), where even when all agents are active simultaneously, the graph topology explicitly appears in the regret bound. Finally, in the non-cooperative case (N is the bandit graph), we obtain p

|A|α(F)T/Q which, for |A| = 1 and Q = 1, recovers the bound of Alon et al. (2017). Table 1 summarizes all known bounds (omitting log factors and setting, for simplicity, Q = 1 and n = 0).

|N| = 1 Any N

F = clique (experts)

T (Freund and Schapire, 1997) p

α(N)T (Cesa-Bianchi et al., 2020) F = no edges (bandits) p

|K|T (Auer et al., 2002) p

α(N)|K|T (Cesa-Bianchi et al., 2019) Any F p

α(F)T (Alon et al., 2017) p

α(N F)T (this work)

Table 1: Known bounds in online learning with feedback graphs and cooperative online learning.

Our theory is developed under the so-called oblivious network interface; i.e., when agents are oblivious to the global network topology and run an instance of the same algorithm using a common initialization and a common learning rate for their updates. In this case, the stochastic activation assumption is necessary to not incur linear regret RT = Ω(T) Cesa-Bianchi et al. (2020).

Our core and main technical contributions are presented in Lemma 1 and Theorem 1 for the upper bound, and in Lemma 2 and Theorem 2 for the lower bound. Lemma 1, implies that the second moment of the loss estimates is dominated by the independence number of the strong product between the two graphs. The proof of this result generalizes the analysis of (Cesa-Bianchi et al., 2019, Lemma 3), identifying the strong product

1This is a reformulation of the bound originally proven by Cesa-Bianchi et al. (2020), see Section C of the Supplementary Material for a proof.

Published in Transactions on Machine Learning Research (06/2024)

as the appropriate notion for capturing the combined eﬀects of the communication and feedback graphs. In Theorem 1, we present a new analysis, in the distributed learning setting, of the drift term arising from the decomposition of network regret. This is obtained by combining Lemma 1 with a regret analysis technique developed by Gyorgy and Joulani (2021) for a single agent. The proof of the lower bound in Theorem 2 builds upon a new reduction to the setting of Lemma 2 that we prove in Appendix A. Lemma 2 contains a lower bound for a single-agent setting with a feedback graph and oblivious adversary where every time step is independently skipped with a known and constant probability q. This reduction is new and necessary, since it is not enough to claim that the average number of rounds played is q T and plug this in the lower bound for bandit with feedback graphs. In fact, one needs to build an explicit assignment of ℓ1, . . . , ℓT such that by averaging over the random subset of active time steps it is possible to prove the lower bound under the conditions detailed in Lemma 2. In Section 6, we corroborate our theoretical results with experiments on synthetic data.

2 Further related work

Adversarial losses. A setting closely related to ours is investigated by Herbster et al. (2021). However, they assume that the learner has full knowledge of the communication network a weighted undirected graph and provide bounds for a harder notion of regret deﬁned with respect to an unknown smooth function mapping users to actions. Bar-On and Mansour (2019b) bound the individual regret (as opposed to our network regret) in the adversarial bandit setting of Cesa-Bianchi et al. (2019), in which all agents are active at all time steps. Their results, as well as the results of Cesa-Bianchi et al. (2019), have been extended to cooperative linear bandits by Ito et al. (2020). Della Vecchia and Cesari (2021) study cooperative linear semibandits and focus on computational eﬃciency. Dubey et al. (2020a) show regret bounds for cooperative contextual bandits, where the reward obtained by an agent is a linear function of the contexts. Nakamura et al. (2023) consider cooperative bandits in which agents dynamically join and leave the system.

Stochastic losses. Cooperative stochastic bandits are also an important topic in the online learning community. Kolla et al. (2018) study a setting in which all agents are active at all time steps. In our model, this corresponds to the special case where the feedback graph is a bandit graph (no edges) and the activation probabilities q(v) are equal to 1 for all agents v. More importantly, however, they focus on a stochastic multi-armed bandit problem. Hence, even restricting to the special cases of bandits with simultaneous activation, their algorithmic ideas cannot be directly applied to our adversarial setting. Other recently studied variants of cooperative stochastic bandits consider agent-speciﬁc restrictions on feedback (Chen et al., 2021) or on access to arms (Yang et al., 2022), bounded communication (Martínez-Rubio et al., 2019), corrupted communication (Madhushani et al., 2021), heavy-tailed reward distributions (Dubey et al., 2020b), stochastic cooperation models (Chawla et al., 2020), strategic agents (Dubey and Pentland, 2020), and Bayesian agents (Lalitha and Goldsmith, 2021). Multi-agent bandits have been also studied in federated learning settings with star-shaped communication networks (He et al., 2022; Li and Wang, 2022) in the presence of adversarial (as opposed to stochastic) agent activations. Finally, Liu et al. (2021) investigate a decentralized stochastic bandit network for matching markets.

3 Notation and setting

Our graphs are undirected and contain all self-loops. For any undirected graph G = (V, E) and all m 0, we let δG(u, v) be the shortest-path distance (in G) between two vertices u, v V , Gm the m-th power of G (i.e., the graph with the same set of vertices V of G but in which two vertices u, v V are adjacent if and only if δG(u, v) m), α(G) the independence number of G (i.e., the largest cardinality of a subset I of V such that δG(u, v) > 1 for all distinct u, v I), and N G(v) the neighborhood {u V : δG(u, v) 1} of a vertex v V . To improve readability, we sometimes use the alternative notations αm(G) for α Gm and N G m(v) for N Gm(v). Finally, for any two undirected graphs G = (V, E) and G = (V , E ), we denote by G G their strong product, deﬁned as the graph with set of vertices V V in which (v, v ) is adjacent to (u, u ) if and only if (v, v ) N G(u) N G (u ).

An instance of our problem is parameterized by:

Published in Transactions on Machine Learning Research (06/2024)

1. A communication network N = (A, EN) over a set A of agents, and a maximum communication delay n 0, limiting the communication among agents. 2. A feedback graph F = (K, EF ) over a set K of actions. 3. An activation probability q(v) > 0 for each agent v A,2 determining the subset of agents incurring losses on that round. Let Q = P

v A q(v) be the expected cardinality of this subset. 4. A sequence ℓ1, ℓ2, . . . : K [0, 1] of losses, chosen by an oblivious adversary.

We assume the agents do not know N (see the oblivious network interface assumption introduced later). The only assumption we make is that each agent knows the pairs v, q(v) for all agents v located at distance n or less.3

The distributed learning protocol works as follows. At each round t = 1, 2, . . ., each agent v is activated with a probability q(v), independently of the past and of the other agents. Agents that are not activated at time t remain inactive for that round. Let At be the subset of agents that are activated at time t. Each v At plays an action It(v) drawn according to its current probability distribution pt( , v), is charged the corresponding loss ℓt It(v) , and then observes the losses ℓt(i), for any action i N F 1 It(v) . Afterwards, each agent v A broadcasts to all agents u N N n (v) in its n-neighborhood a feedback message containing all the losses observed by v at time t together with its current distribution pt( , v); any agent u N N n (v) receives this message at the end of round t + δN(v, u). Note that broadcasting a message in the n-neighborhood of an agent v can be done when v knows only its 1-neighborhood. Indeed, because messages are time-stamped using a global clock, v drops any message received from an agent outside its n-neighborhood. On the other hand, v may receive more than once the same message sent by some agent in its n-neighborhood. To avoid making a double update, an agent can extract from each received message the timestamp together with the index of the sender, and keep these pairs stored for n time steps.

Each loss observed by an agent v (either directly or in a feedback message) is used to update its local distribution pt( , v). To simplify the analysis, updates are postponed, i.e., updates made at time t involve only losses generated at time t n 1. This means that agents may have to store feedback messages for up to n + 1 time steps before using them to perform updates.

The online protocol can be written as follows.

At each round t = 1, 2, . . .

1. Each agent v is independently activated with probability q(v); 2. Each active agent v draws an action It(v) from K according to its current distribution pt( , v), is charged the corresponding loss ℓt It(v) , and observes the set of losses Lt(v) = i, ℓt(i) : i N F 1 It(v)

3. Each agent v broadcasts to all agents u N N n (v) the feedback message t, v, Lt(v), pt( , v) , where Lt(v) = if v / At 4. Each agent v receives the feedback message t s, u, Lt s(u), pt s( , u) from each agent u such that δN(v, u) = s, for all s [n]

Similarly to Cesa-Bianchi et al. (2019), we assume the feedback message sent out by an agent v at time t contains the distribution pt( , v) used by the agent to draw actions at time t. This is needed to compute the importance-weighted estimates of the losses, bt(i, v), see (3).

The goal is to minimize the network regret

RT = max i K E

v At ℓt It(v)

t=1 |At| ℓt(i)

2We assume without loss of generality that q(v) = 0 for all agents v A. The deﬁnition of regret (2) and all subsequent results could be restated equivalently in terms of the restriction N = (A , E N) of the communication network N, where A = {v A : q(v) > 0} and for all u, v A , (u, v) E N if and only if (u, v) E N. 3This assumption can be relaxed straightforwardly by assuming that each agent v only knows q(v), which can then be broadcast to the n-neighborhood of v as the process unfolds.

Published in Transactions on Machine Learning Research (06/2024)

where the expectation is taken with respect to the activations of the agents and the internal randomization of the strategies drawing the actions It(v). Since the active agents At are chosen i.i.d. from a ﬁxed distribution, we also consider the average regret

v At ℓt It(v) #

t=1 ℓt(i) ,

where Q = E |At| > 0 for all t.

In our setting, each agent locally runs an instance of the same online algorithm. We do not require any ad-hoc interface between each local instance and the rest of the network. In particular, we make the following assumption (Cesa-Bianchi et al., 2020).

Assumption 1 (Oblivious network interface). An online algorithm alg is run with an oblivious network interface if:

1. each agent v locally runs a local instance of alg; 2. all local instances use the same initialization and the same strategy for updating the learning rates; 3. all local instances make updates while being oblivious to whether or not their host node v was active and when.

This assumption implies that each agent s instance is oblivious to both the network topology and the location of the agent in the network. Its purpose is to show that communication improves learning rates even without any network-speciﬁc tuning. In concrete applications, one might use ad-hoc variants that rely on the knowledge of the task at hand, and decrease the regret even further.

4 Upper bound

In this section, we introduce Exp3-α2 (Algorithm 1), an extension of the Exp3-Coop algorithm by Cesa Bianchi et al. (2019), and analyze its network regret when run with an oblivious network interface.

Algorithm 1: Exp3-α2 (Locally run by each agent v A) input: learning rates η1(v), η2(v) . . . for t = 1, 2, . . . , n + 1 do if v is active in this round, draw It(v) from K uniformly at random for t n + 2 do if v is active in this round, draw It(v) from K according to pt( , v) in (3)

An instance of Exp3-α2 is locally run by each agent v A. The algorithm is parameterized by its (variable) learning rates η1(v), η2(v), . . . , which, in principle, can be arbitrary (measurable) functions of the history. In all rounds t in which the agent is active, v draws an action It(v) according to a distribution pt( , v). For the ﬁrst n + 1 rounds t, pt( , v) is the uniform distribution over K. During all remaining time steps t, the algorithm computes exponential weights using all the available feedback generated up to (and including) round t n 1. More precisely, for any action i K,

pt(i, v) = wt(i, v)/ wt( , v) 1 ,

wt(i, v) = exp ηt(v) Pt n 1 s=1 bℓs(i, v) , bℓs(i, v) = ℓs(i)Bs(i, v)/bs(i, v) ,

Bs(i, v) = I u N N n (v) : u As, Is(u) N F 1 (i) ,

bs(i, v) = 1 Q u N N n (v) 1 q(u) P j N F 1 (i) ps(j, u) .

The event Bs(i, v) indicates whether an agent in the n-neighborhood of v played at time s an action in the 1-neighborhood of i. If Bs(i, v) occurs, then agent v can use bℓs(i, v) to update the local estimate for the

Published in Transactions on Machine Learning Research (06/2024)

cumulative loss of i. Note that bℓs(i, v) are proper importance-weighted estimates, as Es n bℓs(i, v) = ℓs(i) for all v A, i K, and s > n. The notation Es n denotes conditioning with respect to any randomness in rounds 1, . . . , s n 1. Note also that when q(u) = 1 for all u A and F is the edgeless graph, the probabilities pt(i, v) in (3) correspond to those computed by Exp3-Coop (Cesa-Bianchi et al., 2019).

Before analyzing our cooperative implementation of Exp3-α2, we present a key graph-theoretic result that helps us characterize the joint impact on the regret of the communication network and the feedback graph. Our new result relates the variance of the estimates of eq. (3) to the structure of the communication graph given by the strong product of N n and F.

Lemma 1. Let N = (A, EN) and F = (K, EF ) be any two graphs, n 0, q(v)

v A a set of numbers in (0, 1], Q = P v A q(v), and p(i, v)

i K,v A a set of numbers in (0, 1] such that P i K p(i, v) = 1 for all v A. Then,

q(v)p(i, v) 1 Q

u N N n (v) 1 q(u) P

j N F 1 (i) p(j, u) 1 1 e 1 α N n F + Q .

Proof. Let w = w(i, v)

(i,v) K A where w(i, v) = q(v)p(i, v), and for all (i, v) K A set W(i, v) = P (j,u) N F 1 (i) N N n (v) w(j, u). Deﬁne also, for all c = c(j, u)

(j,u) K A [0, 1]|K| |A| and (i, v) K A

fc(i, v) = 1 Y

u N N n (v)

j N F 1 (i) c(j, u)

Then we can write the left-hand side of the statement of the lemma as X

w(i, v) fw(i, v) = X

(i,v) K A : W (i,v)<1

w(i, v) fw(i, v) | {z } (I)

(i,v) K A : W (i,v) 1

w(i, v) fw(i, v) | {z } (II)

and proceed by upper bounding the two terms (I) and (II) separately. For the ﬁrst term (I), using the inequality 1 x e x (for all x R) with x = w(j, u), we can write, for any (i, v) K A,

fw(i, v) 1 exp

u N N n (v)

j N F 1 (i) w(j, u)

= 1 exp W(i, v) .

Now, since in (I) we are only summing over (i, v) K A such that W(i, v) < 1, we can use the inequality 1 e x (1 e 1) x (for all x [0, 1]) with x = W(i, v), obtaining fw(i, v) (1 e 1)W(i, v), and in turn

(i,v) K A : W (i,v)<1

w(i, v) (1 e 1)W(i, v) 1 1 e 1 X

w(i, v) W(i, v) α N n F

where in the last step we used a known graph-theoretic result see Lemma 3 in the Supplementary Material. For the second term (II): for all v A, let r(v) be the cardinality of N N n (v). Then, for any (i, v) K A such that W(i, v) 1,

1 fw(i, v) max

1 fc(i, v) : c [0, 1]|K| |A|, X

(j,u) N F 1 (i) N N n (v) c(j, u) 1

u N N n (v)

j N F 1 (i) c(j, u)

: c [0, 1]|K| |A|, X

u N N n (v)

j N F 1 (i) c(j, u) = 1

u N N n (v)

1 C(u) : C [0, 1]|A|, X

u N N n (v) C(u) = 1

Published in Transactions on Machine Learning Research (06/2024)

u N N n (v)

1 C(u) : C [0, 1]|A|, X

u N N n (v)

1 C(u) = r(v) 1

where the ﬁrst equality follows from the deﬁnition of fc(i, v) and the monotonicity of x 7 1 x, the second-to-last inequality is implied by the AM-GM inequality (Lemma 4), and the last one comes from r(v) 1 (being v N N n (v)). Hence

(i,v) K A : W (i,v) 1

w(i, v) fw(i, v) X

(i,v) K A : W (i,v) 1

w(i, v) 1 e 1

v A w(i, v) = 1 1 e 1 X

i K p(i, v) = Q 1 e 1 .

For a slightly stronger version of this result, see Lemma 6 in the Supplementary Material. By virtue of Lemma 1, we can now show the main result of this section. Theorem 1. If each agent v A uses adaptive learning rates equal to ηt(v) = 0 for t n + 1, ηt(v) = q

log(K)/ Pt s=1 Xs(v) with Xt(v) = n + P i K pt(i,v) bt(i,v) for t > n + 1, the average network regret of Exp3-α2

playing with an oblivious network interface can be bounded as

log(K) n + 1 + α (N n F)

Proof. Let i argmini K E PT t=1 |At| ℓt(i) where the expectation is with respect to the random sequence At A of agent activations. We write the network regret RT as a weighted sum of single agent regrets RT (v):

v A q(v)RT (v) = X

i K bℓt(i, v)pt(i, v) bℓt(i , v)

where the expectation is now only with respect to the random draw of the agents actions, and it is separated from the activation probability q(v). Fix any agent v A. Exp3-α2 plays uniformly for the ﬁrst n + 1 rounds, and each agent, therefore, incurs a linear regret in this phase. For t > n + 1 we borrow a decomposition technique from (Gyorgy and Joulani, 2021): for any sequence ept( , v)

t>n+1 of distributions over K, the above expectation can be written as

i K bℓt(i, v)ept+1(i, v) bℓt(i , v)

i K bℓt(i, v)pt(i, v) 1 ept+1(i, v)

Take now ept( , v) as the (full-information) exponential-weights updates with non-increasing step-sizes ηt 1(v) for the sequence of losses bℓt( , v). That is, ep1( , v) is the uniform distribution over K, and for any time step t and action i K, ept+1(i, v) = ewt+1(i, v)/ ewt+1( , v) 1, where ewt+1(i, v) = exp ηt(v) Pt s=1 bℓs(i, v) . With this choice, the ﬁrst term in (5) is the look-ahead regret for the iterates ept+1( , v) (which depend on bℓt( , v) at time t), while the second one measures the drift of pt( , v) from ept+1( , v).

Using an argument from (Joulani et al., 2020, Theorem 3),4 we deterministically bound the ﬁrst term in (5):

i K bℓt(i, v)ept+1(i, v) bℓt(i , v) ln |K|

ηT (v) . (6)

4We use (Joulani et al., 2020, Theorem 3) with pt = 0 for all t [T], r0 = (1/η0(v))P

i pi ln(pi), rt(p) = (1/ηt(v) 1/ηt 1(v))P

i pi ln(pi) for all t [T], and dropping the Bregman-divergence terms due to the convexity of rt.

Published in Transactions on Machine Learning Research (06/2024)

The subtle part is now to control the second term in (5). To do so, ﬁx any t > n + 1. Note that for all i K,

wt(i, v) = exp

s=1 bℓs(i, v)

s=1 bℓs(i, v)

= ewt+1(i, v)

(using ℓs(i, v) 0 for all s, i, v), which in turn, using the inequality ex 1 + x (for all x R), yields

ept+1(i, v)

pt(i, v) ewt+1(i, v)

wt(i, v) = exp ηt(v) Pt s=t n bℓs(i, v) 1 ηt(v) Pt s=t n bℓs(i, v) .

Thus, we upper bound the second expectation in (5) by

ηt(v) bℓt(i, v)pt(i, v)

s=t n bℓs(i, v)

i K E h ηt(v) bℓt(i, v)2pt(i, v) i =: g(1) t (v) + g(2) t (v) . (7)

We study the two terms g(1) t (v) and g(2) t (v) separately. Let then Ht = Ht(v) be the σ-algebra generated by the activations of agents and the actions drawn by them at times 1, . . . , t 1, and let also indicate Et = E[ | Ht].

First, we bound g(1) t (v) in (7) using the fact that pt( , v), ηt(v) and bs( , v) (for all s {t n, . . . , t}) are determined by the randomness in steps 1, . . . , t n 1. We use the tower rule and take the conditional expectation inside since all quantities apart from Bt n(i, v), . . . , Bt(i, v) are determined given Ht n, we rewrite the expression as

g(1) t (v) = E

i K ηt(v)pt(i, v)

ℓt(i) bt(i, v) ℓs(i) bs(i, v)Et n Bt(i, v)Bs(i, v) #

Conditional on Ht n the Bernoulli random variables Bs(i, v), and Bt(i, v) for s = t n, . . . , t 1 are independent. This follows because the feedbacks at time s are missing at time t for s = t n, . . . , t 1, and therefore, from the independent activation of agents and the fact that the only other source of randomness is the independent internal randomization of the algorithm, they are independent random variables, implying

Et n Bt(i, v)Bs(i, v) = bt(i, v)bs(i, v) ,

for s = t n, . . . , t 1. Using ℓt(i), ℓs(i) 1, we then get

g(1) t (v) E

i K ηt(v)pt(i, v)n

= E[ηt(v)n] .

With a similar argument, we also get

g(2) t (v) = E

ηt(v)ℓt(i)2pt(i, v)

bt(i, v)2 Et n h Bt(i, v) i#

i K ηt(v)pt(i, v)

Finally, the single agent regret for each agent v A is bounded by

RT (v) E ln |K|

+ (n + 1) +

pt(i, v) bt(i, v)

+ (n + 1) +

t=n+2 E [ηt(v)Xt(v)] ,

where, in the second line, we deﬁned Xt(v) = I t > n + 1 n + P

i K pt(i,v) bt(i,v) .

Published in Transactions on Machine Learning Research (06/2024)

We now take ηt(v) = q

log(K) Pt s=1 Xs(v), and we use a standard inequality stating that for any at > 0, PT t=1 at .q Pt s=1 as 2 q PT t=1 at. Applying this inequality for at equal to Xt(v) we have

RT (v) (n + 1) + E

v u u tlog(K)

log(K)Xt(v) q Pt s=1 Xs(v)

(n + 1) + 3E

v u u tlog(K)

Multiplying by q(v), summing over agents v A we obtain

v q(v)RT (v) = Q(n + 1) + 3Q X

pt(i, v) bt(i, v)

Q(n + 1) + 3Q

v u u tlog(K)

pt(i, v)q(v)

Q(n + 1) + 3Q

log(K) n + 1 + α (N n F)

where the ﬁrst inequality follows from Jensen s inequality and the second from Lemma 1.

Note that in Theorem 1, every agent tunes the learning rate ηt(v) using available information at time t. This allows the network regret to adapt to the unknown parameters of the problems such as the time horizon T, the independence number α N n F , and the total activation mass Q on A. This approach improves over the doubling trick approach used in Cesa-Bianchi et al. (2019) since we do not need to restart the algorithm.

5 Lower bound

In this section, we prove that not only the upper bound in Theorem 1 is optimal in a minimax sense i.e., that it is attained for some pairs of graphs (N, F) but it is also tight in an instance-dependent sense, for all pairs of graphs belonging to a large class.

Deﬁnition 1. Let G be the class of all pairs of graphs (N, F) such that α(N F) = α(N)α(F).

Many suﬃcient conditions guaranteeing that (N, F) G are known in the graph theory literature: see, e.g., Hales (1973, Section 3), Acín et al. (2017, Theorem 6), and Rosenfeld (1967, Theorem 2). To the best of our knowledge, a full characterization of G is still a challenging open problem in graph theory that goes beyond the scope of this paper. It is easy to verify that if (either N or) F is a clique or an edgeless graph, then (N, F) G . We remark that these instances cover in particular both the bandit and the full-info case that were previously only studied individually, and analyzed with ad hoc techniques. For some further discussion on G , we refer the interest reader to Appendix B.1.

The proof of the lower bound in Theorem 2 exploits a reduction to a setting we introduce in Lemma 2. In this lemma, we state that in a single-agent setting with a feedback graph, if each one of T time steps is independently skipped with a known and constant probability µ, the learner s regret is Ω p

α(F)µT . Skipped rounds do not count towards regret. More precisely, at each round t, there is an independent Bernoulli random variable At with mean µ. If At = 0, the learner is not required to make any predictions, incurs no loss, and receives no feedback information. The (single-agent) regret of a possibly randomized algorithm alg on a sequence ℓt

t [T ] of losses is deﬁned as Rsa T (µ, alg, ℓ) = maxi [K] Rsa T (µ, alg, ℓ, i) where

Rsa T (µ, alg, ℓ, i) = E

ℓt(It) ℓt(i) I{At = 1}

Published in Transactions on Machine Learning Research (06/2024)

and It is the random variable denoting the action played by the learner at time t that only depends on the rounds s {1, . . . , t} where As = 1. The expectation in Rsa T is computed over both It and At for t [T].

Note that it is not true, in general, that Rsa T (µ, alg, ℓ) = µ maxi [K] E h PT t=1 ℓt(It) ℓt(i) i .

Lemma 2. For any feedback graph F, for any (possibly randomized) online learning algorithm alg, for any µ > 0, and for any T max 0.0064 α(F)3, 1

µ3 , if each round t [T] is independently skipped with probability µ, then

inf alg sup ℓ Rsa T (µ, alg, ℓ) Ω= p

The proof of this lemma can be found in Appendix A. Now let alg be a possibly randomized online algorithm. Let RT (q, alg, ℓ) be the network regret (2) incurred by alg run with oblivious network interface, losses ℓ= (ℓt)t [T ], and activation probabilities q = (q(v))v A. We can now prove our lower bound.

Theorem 2. For any choice of n, any pair of graphs (N n, F) G , and all Q (0, αn(N)], we have that for T max 0.0064 α(F)3, α(N)3/Q3 the following holds

inf alg sup ℓ,q RT (q, alg, ℓ) Ω= q

Qα N n F T ,

where the inﬁmum is over all randomized online algorithms and the supremum is over all assignments of losses ℓ= (ℓt)t [T ] and activation probabilities q(v)

v A such that Q = P

Proof. Fix any (N, n, F) and Q (0, αn(N)] as in the statement of the theorem. Then α N n F = αn(N)α(F). Let I A be a set of αn(N) agents such that δN(u, v) > n for all u, v I. Deﬁne the activation probabilities q(v) = Q/αn(N) 1 for all v I and q(v) = 0 for all v A \ I. By construction, no communication occurs among agents in I. Furthermore, each agent v in I is activated independently with the same probability q(v) = P v At = Q/αn(N).

Therefore, we can use Lemma 2 to show that there exists a sequence of losses that simultaneously lower bounds the regret of all agents. Indeed, for all T max 0.0064 α(F)3, αn(N)3/Q3 we have

inf alg sup ℓ RT (q, alg, ℓ) = inf alg sup ℓ max i K

ℓt It(v) ℓt(i) I{v At}

= inf alg sup ℓ max i K

v I Rsa T Q/αn(N), alg, ℓ, i

= αn(N) inf alg sup ℓ Rsa T Q/αn(N), alg, ℓ

α(F)(Q/αn(N))T (Lemma 2 with µ = Q/αn(N))

Q α(F)αn(N) T = q

Qα N n F T ,

where It(v) is the random variable denoting the arm pulled at time t by the instance of alg run by agent v and Lemma 2 is invoked on |I| = αn(N) instances of alg with feedback graph G = F and independent randomization.

6 Experiments

To empirically appreciate the impact of cooperation, we run a number of experiments on synthetic data. Our code available at Della Vecchia (2024).

For each choice of N and F, we compare Exp3-α2 run on N and F against a baseline which runs Exp3-α2 on N and F, where N is an edgeless communication graph. Hence, the baseline runs A independent instances on the same feedback graph.

Published in Transactions on Machine Learning Research (06/2024)

In our experiments, we ﬁx the time horizon (T = 10,000), the number of arms (K = 20), and the number of agents (A = 20). We also set the delay δN to 1. The loss of each action is a Bernoulli random variable of parameter 1/2, except for the optimal action which has parameter 1/2 p

K/T. The activation probabilities q(v) are the same for all agents v A, and range in the set {0.05, 0.5, 1}. This implies that Q {1, 10, 20}. The feedback graph F and the communication graph N are Erdős Rényi random graphs of parameters p N, p F {0.2, 0.8}. For each choice of the parameters, the same realization of N and F was kept ﬁxed in all the experiments, see Figure 1.

Figure 1: The random instances of N (leftmost graphs) and F (rightmost graphs) used in our experiments. The sparse graphs are Erdős Rényi of parameter 0.2, the dense graphs are Erdős Rényi of parameter 0.8.

In each experiment, Exp3-α2 and our baseline are run on the same realization of losses and agent activations. Hence, the only stochasticity left is the internal randomization of the algorithms. Our results are averages of 20 repetitions of each experiment with respect to this randomization.

p N p F Exp3-α2 Indep. Exp3-α2/Indep.

0.8 0.8 31.7 110.1 29% 0.8 0.2 94.4 130.5 72% 0.2 0.8 59.5 110.1 54% 0.2 0.2 103 130.5 79%

(a) Results for q = 0.05

p N p F Exp3-α2 Indep. Exp3-α2/Indep.

0.8 0.8 18.3 47.2 39% 0.8 0.2 34.1 101.1 34% 0.2 0.8 22.5 47.2 48% 0.2 0.2 77.7 101.1 77%

(b) Results for q = 0.5

p N p F Exp3-α2 Indep. Exp3-α2/Indep.

0.8 0.8 18.7 22.6 83% 0.8 0.2 23.2 84.7 27% 0.2 0.8 18.8 22.6 83% 0.2 0.2 41.3 84.7 49%

(c) Results for q = 1

Table 2: Table of the performance of Exp3-α2 and independent single agent optimal algorithms with feedback graphs. We summarise the performance in terms of total cumulative regret RT /Q after 1000 rounds for the two algorithms. The last column of each table is the percentage of the cumulative regret of Exp3-α2 with respect to the independent optimal algorithms.

Figure 2 and Table 2 summarize the results of our experiments in terms of the average regret RT /Q. See Appendix D for the actual learning curves. Recall that our upper bound (1) scales with the quantity p

Published in Transactions on Machine Learning Research (06/2024)

Figure 2: Average regret of Exp3-α2 (blue dots) against the baseline (red dots). The X-axis and the Y -axis correspond to the parameters p F and p N of the Erdős Rényi graph, the Z-axis is the average regret RT /Q. The three plots correspond to increasing values (from left to right) of activation probability: q = 0.05 (leftmost plot), q = 0.5 (central plot), q = 1 (rightmost plot).

Note that our algorithm (blue dots) is never worse than the baseline (red dots). This is consistent with the fact that N for the baseline is the edgeless graph, implying that α(N F) = A α(F). Consistently with (1), the average performance gets worse when Q 1.5

By construction, the performance of the baseline in each plot remains constant when p N varies in {0.2, 0.8}. On the other hand, our algorithm is worse when N is sparse because α(N F) increases. The performance of both algorithms is worse when F is sparse because, once more, α(N F) increases.

7 Conclusions

In this work, we nearly characterize the minimax regret in cooperative online learning with feedback graphs, showing that the dependence on α N n F in our bounds is tight in all but a few, pathological instances. In a bandit setting, when all agents are active at all time steps, previous works showed that communication speeds up learning by reducing the variance of loss estimates. On the opposite end of the spectrum, in full-information settings, updating non-active agents was shown to improve regret. These results left open the question of which updates would help in intermediate settings and why. In this paper, we prove that both types of updates help local learners across the entire experts-bandits spectrum (Theorem 1). We stress that this strategy crucially depends on the stochasticity of the activations. Indeed, Cesa-Bianchi et al. (2020) disproved the naive intuition that more information automatically translates into better bounds, showing how using all the available data can lead to linear regret in the case of adversarial activations with oblivious network interface.

As we only considered undirected feedback graphs, their extension to the directed case remains open. Using the terminology introduced by Alon et al. (2015) for directed graphs, we conjecture our main result (Theorem 1) remains true in the strongly observable case. In the weakly observable case, the scaling parameter of the single-agent regret is a graph-theoretic quantity diﬀerent from the independence number, and the minimax rate becomes T 2/3. In this case, we ignore the best possible scaling parameter for the network regret.

Finally, we leave open the problem of characterizing the minimax regret with stochastic activations but no oblivious network interface. We conjecture that the lower bound in Theorem 2 could be extended to the case in which the algorithms run by each agent are not necessarily instances of the same online algorithm. In other words, we conjecture that the oblivious network interface is suﬃcient to achieve minimax optimality.

5Also the baseline, whose agents learn in isolation, gets worse when Q decreases. Indeed, when Q = 1 agents get only to play for T/|A| time steps each, and together achieve a network regret RT that scales with p

|A|α(F), as predicted by our analysis.

Published in Transactions on Machine Learning Research (06/2024)

Acknowledgments

NCB is partially supported by the MUR PRIN grant 2022EKNE5K (Learning in Markets and Society), funded by the Next Generation EU program within the PNRR scheme, the FAIR (Future Artiﬁcial Intelligence Research) project, funded by the Next Generation EU program within the PNRR-PE-AI scheme, the EU Horizon CL4-2022-HUMAN-02 research and innovation action under grant agreement 101120237, project ELIAS (European Lighthouse of AI for Sustainability). TC gratefully acknowledges the support of the University of Ottawa through grant GR002837 (Start-Up Funds) and that of the Natural Sciences and Engineering Research Council of Canada (NSERC) through grants RGPIN-2023-03688 (Discovery Grants Program) and DGECR2023-00208 (Discovery Grants Program, DGECR - Discovery Launch Supplement). RDV is thankful for the funding received by the CHIST-ERA Project Causal e Xplainations in Reinforcement Learning Causal XRL.

Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. Advances in Neural Information Processing Systems, 24, 2011.

Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing, 46(6): 1785 1826, 2017.

Nicolò Cesa-Bianchi, Tommaso Cesari, and Claire Monteleoni. Cooperative online learning: Keeping your neighbors updated. In Algorithmic Learning Theory, pages 234 250. PMLR, 2020.

Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. Multi-agent online optimization with delays: Asynchronicity, adaptivity, and optimism. Journal of Machine Learning Research, 23(78): 1 49, 2022.

Nicolò Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in nonstochastic bandits. Journal of Machine Learning Research, 20(17):1 38, 2019.

Yogev Bar-On and Yishay Mansour. Individual regret in cooperative nonstochastic multi-armed bandits. Advances in Neural Information Processing Systems, 32, 2019a.

Nathan Linial. Locality in distributed graph algorithms. SIAM Journal on computing, 21(1):193 201, 1992.

Jukka Suomela. Survey of local algorithms. ACM Computing Surveys (CSUR), 45(2):1 40, 2013.

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Andras Gyorgy and Pooria Joulani. Adapting to delays and data in adversarial multi-armed bandits. In International Conference on Machine Learning, pages 3988 3997. PMLR, 2021.

Mark Herbster, Stephen Pasteris, Fabio Vitale, and Massimiliano Pontil. A gang of adversarial bandits. Advances in Neural Information Processing Systems, 34, 2021.

Yogev Bar-On and Yishay Mansour. Individual regret in cooperative nonstochastic multi-armed bandits. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b.

Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi Kawarabayashi. Delay and cooperation in nonstochastic linear bandits. Advances in Neural Information Processing Systems, 33:4872 4883, 2020.

Riccardo Della Vecchia and Tommaso Cesari. An eﬃcient algorithm for cooperative semi-bandits. In Algorithmic Learning Theory, pages 529 552. PMLR, 2021.

Published in Transactions on Machine Learning Research (06/2024)

Abhimanyu Dubey et al. Kernel methods for cooperative multi-agent contextual bandits. In International Conference on Machine Learning, pages 2740 2750. PMLR, 2020a.

Tomoki Nakamura, Naoki Hayashi, and Masahiro Inuiguchi. Cooperative learning for adversarial multi-armed bandit on open multi-agent systems. IEEE Control Systems Letters, 2023.

Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE/ACM Transactions on Networking, 26(4):1782 1795, 2018.

Yu-Zhen Janice Chen, Stephen Pasteris, Mohammad Hajiesmaili, John Lui, Don Towsley, et al. Cooperative stochastic bandits with asynchronous agents and constrained feedback. Advances in Neural Information Processing Systems, 34, 2021.

Lin Yang, Yu-zhen Janice Chen, Mohammad Hajiesmaili, John Lui, and Don Towsley. Distributed bandits with heterogeneous agents. ar Xiv preprint ar Xiv:2201.09353, 2022.

David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. Advances in Neural Information Processing Systems, 32, 2019.

Udari Madhushani, Abhimanyu Dubey, Naomi Leonard, and Alex Pentland. One more step towards reality: Cooperative bandits with imperfect communication. Advances in Neural Information Processing Systems, 34, 2021.

Abhimanyu Dubey et al. Cooperative multi-agent bandits with heavy tails. In International Conference on Machine Learning, pages 2730 2739. PMLR, 2020b.

Ronshee Chawla, Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. The gossiping inserteliminate algorithm for multi-agent bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3471 3481. PMLR, 2020.

Abhimanyu Dubey and Alex Pentland. Private and byzantine-proof cooperative decision-making. In AAMAS, pages 357 365, 2020.

Anusha Lalitha and Andrea Goldsmith. Bayesian algorithms for decentralized stochastic bandits. IEEE Journal on Selected Areas in Information Theory, 2(2):564 583, 2021.

Jiafan He, Tianhao Wang, Yifei Min, and Quanquan Gu. A simple and provably eﬃcient algorithm for asynchronous federated contextual linear bandits. Advances in neural information processing systems, 35: 4762 4775, 2022.

Chuanhao Li and Hongning Wang. Asynchronous upper conﬁdence bound algorithms for federated linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pages 6529 6553. PMLR, 2022.

Lydia T Liu, Feng Ruan, Horia Mania, and Michael I Jordan. Bandit learning in decentralized matching markets. The Journal of Machine Learning Research, 22(1):9612 9645, 2021.

Pooria Joulani, András György, and Csaba Szepesvári. A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, variance reduction, and variational bounds. Theoretical Computer Science, 808:108 138, 2020.

R. Stanton Hales. Numerical invariants and the strong product of graphs. Journal of Combinatorial Theory, Series B, 15(2):146 155, 1973.

Antonio Acín, Runyao Duan, David E Roberson, Ana Belén Sainz, and Andreas Winter. A new property of the Lovász number and duality relations between graph parameters. Discrete Applied Mathematics, 216: 489 501, 2017.

Moshe Rosenfeld. On a problem of C.E. Shannon in graph theory. Proceedings of the American Mathematical Society, 18(2):315 319, 1967.

Published in Transactions on Machine Learning Research (06/2024)

Riccardo Della Vecchia. Cooperative online learning with feedback graphs. https://github.com/ riccardodv/COOP-learning, 2024.

Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback graphs: Beyond bandits. In Conference on Learning Theory, pages 23 35. PMLR, 2015.

Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and trends in Machine Learning, 4(2):107 194, 2012.

Francesco Orabona, Koby Crammer, and Nicolò Cesa-Bianchi. A generalized online mirror descent with applications to classiﬁcation and regression. Machine Learning, 99(3):411 435, 2015.

Published in Transactions on Machine Learning Research (06/2024)

A Proof of Lemma 2

Lemma 2. For any feedback graph F, for any (possibly randomized) online learning algorithm alg, for any µ > 0, and for any T max 0.0064 α(F)3, 1

µ3 , if each round t [T] is independently skipped with probability µ, then inf alg sup ℓ Rsa T (µ, alg, ℓ) Ω= p

Proof. Let T be the (random) set of times {t [T] | At = 1} and let τ1 < τ2 < < τ|T | the (random) elements of T in increasing order. Fix a (possibly randomized) online learning algorithm alg and a sequence ℓ= ℓt

t [T ] of losses. For any random variable J (later, J and the corresponding hard instance will be those used in the lower bound for online learning with feedback graphs: (Alon et al., 2017, Theorem 5)). Let Nt = A1 + + At 1 be the number of rounds actually played up to time t. Then

RT (µ, alg, ℓ) = max i [K] E

ℓt(INt+1) ℓt(i) I{At = 1}

= max i [K] E

ℓτs(Is) ℓτs(i)

ℓτs(Is) ℓτs(J)

T0 [T ] |T0|=n

ℓτs(Is) ℓτs(J) | T = T0, |T0| = n

P (T = T0, |T0| = n) .

Then, we recognize that the conditional expectation in the previous formula is the expected regret for single-agent online learning with feedback graph. Therefore, from (Alon et al., 2017, Theorem 5) we get that, letting C1 = (8/100)2 and , for all T C1α(F)3,

inf alg sup ℓ RT (µ, alg, ℓ) X

T0 [T ] |T0|=n

2 2ε r n α(F)

P (T = T0, |T0| = n)

2 2ε r n α(F)

P (|T | = n)

n [T ] εn 1

2 2ε r n α(F)

f Bin(µ,T )(n) ,

where f Bin(µ,T ) is the p.m.f. of a Binomial random variable with parameters p, T. We want ε = ε (p, T) that maximizes that expression. Therefore, by deﬁning g(ε) = aε + bε2 as the following quadratic polynomial in ε

2 2ε r n α(F)

f Bin(µ,T )(n)

2 f Bin(µ,T )(n) ε 2 X

α(F)1/2 f Bin(µ,T )(m) ε2 ,

where a = P

2 f Bin(µ,T )(n) and b = 2 P

m [T ] m3/2

α(F )1/2 f Bin(µ,T )(m) . We ﬁnd that the maximum of g is achieved in a

2b, which gives the optimal value

2 f Bin(µ,T )(n)

m [T ] m3/2 α(F )1/2 f Bin(µ,T )(m)

Published in Transactions on Machine Learning Research (06/2024)

and the function g evaluated at the optimal value is equal to g = a2

g = g(ε (p, T)) = g

2 f Bin(µ,T )(n)

m [T ] m3/2 α(F )1/2 f Bin(µ,T )(m)

P n [T ] nf Bin(µ,T )(n) 2

m [T ] m3/2f Bin(µ,T )(m) .

Therefore, we can lower bound the regret with g and obtain

inf alg sup ℓ RT (µ, alg, ℓ)

n [T ] nf Bin(µ,T )(n) 2

P m [T ] m3/2f Bin(µ,T )(m) =

8 (µT)2 P m [T ] m3/2f Bin(µ,T )(m)

8 (µT)3/2 P

m [T ] m3/2f Bin(µ,T )(m) .

where in the ﬁrst equality we substituted the expected value of a binomial distribution of parameter µ.

We now want to prove the existence of a constant c > 0 such that, for every µ > 0 and every T 1 µ3

m [T ] m3/2f Bin(µ,T )(m) c , or equivalently

m [T ] m3/2f Bin(µ,T )(m)

(µT)3/2 c .

We split the sum over m [T] into two blocks, the ﬁrst for 1 m c2 µT and the second for c2 µT <

m T for a constant c2 = µT µT

1 2T ln T 3/2

+ 1 and with c1 > 0:

P m [T ] m3/2f Bin(µ,T )(m)

(µT)3/2 = Pc2 µT m=1 m3/2f Bin(µ,T )(m)

P m>c2 µT m3/2f Bin(µ,T )(m)

(µT)3/2 . (8)

The idea is to choose the split point c2 µT so that we can upper bound the tail mass using Hoeﬀding s

inequality. Hoeﬀding s inequality yields the simple bound FBin(µ,T )(m) exp 2T µ m

T 2 , and together with symmetry properties of the binomial distribution 1 FBin(µ,T )(m) = FBin(1 µ,T )(T m) we obtain a bound on the upper tail. This contribution compensates exactly the term µ3/2 left at the denominator for T 1/µ, leaving just the constant c1:

m>c2 µT m3/2f Bin(µ,T )(m)

(µT)3/2 e 2T (1 µ) T (c2 µT +1)

= e 2T µ c2 µT

= c1 (µT)3/2

For the ﬁrst term in Equation (8), we upper bound the lower tail simply by one:

m3/2f Bin(µ,T )(m) c2 µT FBin(µ,T )(m) c2 µT .

We conclude by proving that the ﬁrst term in Equation (8) is bounded by a constant. If we take m c2 µT we obtain for T 2/µ and c1 1

Pc2 µT m=1 m3/2f Bin(µ,T )(m)

(µT)3/2 (c2 µT )3/2

Published in Transactions on Machine Learning Research (06/2024)

1 2T ln T 3/2

4.32 (ln T)3/4 1

4.32(ln T)3/4

4.32 1.08 5 ,

where in the third-last inequality we used µ 1 T 1/3 . Putting everything together and letting c1 = 1 yields

inf alg sup ℓ RT (µ, alg, ℓ) 3

B Graph-theoretic results

In this section, we present a general version of a graph-theoretic lemma (Lemma 6) that is crucial for our positive results in Section 4. Before stating it, we recall a few known results.

The ﬁrst result is a direct consequence of (Alon et al., 2017, Lemma 10) specialized to undirected graphs.

Lemma 3. Let G = (V, E) be an undirected graph containing all self-loops and αd(G) its d-th independence number. For all i V, let N G d (i) be the d-th neighborhood of i, p(i) 0, and P(i) = P

j N G d (i) p(j) > 0. Then

p(i) P(i) αd(G) .

Proof. Initialize V1 = V, ﬁx j1 argminj V1 P(j), and denote V2 = V \ N(j1). For k 2 ﬁx jk argminj Vk P(j) and shrink Vk+1 = Vk \ N(jk) until Vk+1 = . Since G is undirected jk / Sk 1 s=1 N(js), therefore the number m of times that an action can be picked this way is upper bounded by α. Denoting N (jk) = Vk N(jk) this implies

p(i) P(i) =

i N(jk) p(i)

P(jk) = m α .

concluding the proof.

The following result, known as the inequality of arithmetic and geometric means, or simply AM-GM inequality, is used in the proofs of Lemmas 1, 5, and 6.

Lemma 4 (AM-GM inequality). For any x1, . . . , xr [0, ),

r (x1 xr)1/r .

Published in Transactions on Machine Learning Research (06/2024)

Proof. By Jensen s inequality,

1 r ln (xi) =

i=1 ln x1/r i = ln

The second result is proven in (Cesa-Bianchi et al., 2019, Lemma 3), but here we give a slightly diﬀerent proof based on the AM-GM inequality. Lemma 5. Let G = (V, E) be an undirected graph containing all self-loops and αd(G) its d-th independence number. For all v V, let N G d (v) be the d-th neighborhood of v, c(v) 0, and C(v) = 1 Q

w N G d (v) 1 c(w) > 0. Then X

c(v) C(v) 1 1 e 1

Proof. Set for brevity P(v) = P

w N G d (v) c(w). Then we can write

c(v) C(v) = X

v V : P (v) 1

c(v) C(v) | {z } (I)

v V : P (v)<1

c(v) C(v) | {z } (II)

and proceed by upper bounding the two terms (I) and (II) separately. Let r(v) be the cardinality of N G d (v). We have, for any given v V,

w N(v) c(w) 1

w N(v) c(w) = 1

w N G d (v)

1 c(w) = r(v) 1

r(v) 1 e 1 .

where the ﬁrst equality follows from the deﬁnition of C(v) and the monotonicity of x 7 1 x, the ﬁrst inequality is implied by the AM-GM inequality (Lemma 4), and the last one comes from r(v) 1 (for v N G d (v)). Hence

v V : P (v) 1

c(v) 1 e 1 X

c(v) 1 e 1 .

As for (II), using the inequality 1 x e x, x R, with x = c(w), we can write

w N G d (v) c(w)

= 1 exp ( P(v)) .

Now, since in (II) we are only summing over v such that P(v) < 1, we can use the inequality 1 e x (1 e 1) x, holding when x [0, 1], with x = P(v), thereby concluding that

C(v) (1 e 1)P(v) .

v V : P (v)<1

c(v) (1 e 1)P(v) 1 1 e 1 X

c(v) P(v) α 1 e 1 ,

where in the last step we used Lemma 3.

Published in Transactions on Machine Learning Research (06/2024)

We can now state a more general version of our key graph-theoretic result, which can be proved similarly to Lemma 1.

Lemma 6. Let G1 = (V1, E1) and G2 = (V2, E2) be two undirected graphs containing all self-loops and α G1 G2 the independence number of their strong product G1 G2. For all (i, j) V1 V2, let also N G1 1 (i), N G2 1 (v), and N G1 G2 1 (i, v) be the ﬁrst neighborhoods of i (in G1), v (in G2), and (i, v) (in G1 G2). If w = w(j, u)

(j,u) V1 V2 is an arbitrary matrix with non-negative entries such that 1 P

j N G1 1 (i) w(j, u) 0

for all (i, u) V1 V2 and 1 Q

u N G2 1 (v) 1 P

j N G1 1 (i) w(j, u) > 0 for all (i, v) V1 V2, then

w(i, v) 1 Q

u N G2 1 (v) 1 P

j N G1 1 (i) w(j, u) e e 1

α G1 G2 + X

v V2 w(i, v)

B.1 Further discussion on G

In general, α(N)α(F) α N F holds for any arbitrary pairs of graphs N, F. Indeed, the Cartesian product I J of an independent set I of N and an independent set J of F is an independent set of N F. There exist graphs N, F with α(N)α(F) α N F , but these appear to be quite rare and pathological cases. For the sake of completeness, we add an example of such a construction below. This shows that not all pairs of graphs belong to G .

Example 1. Take as the ﬁrst graph G1 = (V1, E1), the cycle C5 over 5 vertices. Then, for any k 2, build Gk = (Vk, Ek) inductively by replacing each vertex v Vk 1 by a copy of C5 and each edge e Ek 1 by a copy of K5,5 (the complete bipartite graph with partitions of size 5 and 5) between the two copies of C5 that replaced its endpoints. It can be shown that α(Gk) = 2k but α Gk Gk 5k 4k = α(Gk)2.

To see why, note ﬁrst that α(C5) = 2 but α C5 C5 5, by choosing the independent set containing the 5 vertices (1, 1), (2, 3), (3, 5), (4, 2), (5, 4). For k 2, α(Gk) = 2k but we can take the analogous in Gk Gk of the above independent set in C5 C5. This gives 5 sets S1, S2, S3, S4, S5 of 25k 1 vertices each, with no edges between Si, Sj when i = j. The subgraph of Gk Gk induced by each Si is simply the previous iteration Gk 1 of this construction, and proceeding by induction we can ﬁnd an independent subset of each Si with 5k 1 vertices, giving a total of 5k independent vertices.

C The upper bound of Cesa-Bianchi et al. (2020) for experts

(Cesa-Bianchi et al., 2020, Theorem 10) gives theoretical guarantees for the average regret over active agents. In this section, we brieﬂy discuss how to convert their statement to a corresponding result for the total regret over active agents that is the focus of our present work.

Before stating the theorem, we recall that the convex conjugate f : Rd R of a convex function f : X R is deﬁned, for any x Rd, by f (x) = supw X x w f(w) . Moreover, given σ > 0, we say that f is σ-strongly convex on X with respect to a norm if, for all u, w X, we have f(u) f(w)+ f(w) (u w)+ σ

2 u w 2. The following well-known result can be found in (Shalev-Shwartz et al., 2012, Lemma 2.19 and subsequent paragraph).

Lemma 7. Let f : X R be a strongly convex function on X. Then the convex conjugate f is everywhere diﬀerentiable on Rd.

The following result see, e.g., (Orabona et al., 2015, bound (6) in Corollary 1 with F set to zero) shows an upper bound on the regret of Algorithm 2 for single-agent online convex optimization with expert feedback.

Theorem 3. Let g: X R be a diﬀerentiable function σ-strongly convex with respect to . Then the regret of Algorithm 2 run with gt =

t η g, for η > 0, satisﬁes

t=1 ℓt xt inf x X

t=1 ℓt(x) D

Published in Transactions on Machine Learning Research (06/2024)

Algorithm 2: input: σt-strongly convex regularizers gt : X R for t = 1, 2, . . . initialization: θ1 = 0 Rd

for t = 1, 2, . . . do choose wt = g t (θt) observe ℓt(wt) Rd

update θt+1 = θt ℓt(wt)

where D = sup g and is the dual norm of . If sup ℓt L, then choosing η =

2σD/L gives RT L p

We can now present the equivalent of (Cesa-Bianchi et al., 2020, Theorem 10) for cooperative online covenx optimization with expert feedback (i.e., F is a clique) where n = 1 but the feedback is broadcast to ﬁrst neighbor immediately after an action is played (rather than the following round).

Theorem 4. Consider a network N = (A, EN) of agents. If all agents v run Algorithm 2 with an oblivious network interface and gt =

t η g, where gt is upper bounded by a constant L > 0, η > 0 is a learning rate, and the regularizer g: X R is diﬀerentiable, σ-strongly convex with respect to some norm , and upper bounded by a constant M 2, then the network regret satisﬁes

2Q(α(N) + Q)T .

2σM/L, we have RT

Q(α(N) + Q)T .

Proof sketch. For any x X, agent v, and time t, let xt(v) be the prediction made by v at time t, rt(v, x) = ℓt xt(v) ℓt(x), Qv = Pr v S

w At N N 1 (w) = 1 Q

w N N 1 (v) 1 q(w) , and A := w A : q(w) > 0 . Proceeding as in (Cesa-Bianchi et al., 2020, Theorem 2) yields, for each v A and x X,

t=1 rt(v, x)

Now, by the independence of the activations of the agents at time t and rt(v, x)

v A ,x X, we get

RT = sup x X

t=1 E rt(v, x) . (10)

Putting Equations (9) and (10) together and applying Jensen s inequality yields

v V q(v) r 1

The proof is concluded by invoking Lemma 5.

D Learning curves

We also plot the average regret RT /Q against the number T of rounds. Our algorithm is the blue curve and the baseline is the red curve. Recall that these curves are averages over 20 repetitions of the same experiment (the shaded areas correspond to one standard deviation) where the stochasticity is due to the internal randomization of the algorithms. Experiments are designed to show the diﬀerence in performance when we allow agents to communicate and when we do not. The strong product captures in a mathematical

Published in Transactions on Machine Learning Research (06/2024)

form this diﬀerence in the regret bound for our algorithm, while the experiments here show it empirically. In particular, the bound for the case of no communication is bigger, and performances are worse in our simulations, as expected from theory.

Experiments were run on a local cluster of CPUs (Intel Xeon E5-2623 v3, 3.00GHz), parallelizing the code over four cores. The run took approximately two hours.

0 200 400 600 800 1000

(a) p N ER = 0.2, p F ER = 0.2

0 200 400 600 800 1000

(b) p N ER = 0.8, p F ER = 0.2

0 200 400 600 800 1000

(c) p N ER = 0.2, p F ER = 0.8

0 200 400 600 800 1000

(d) p N ER = 0.8, p F ER = 0.8

Figure 3: Average regret RT /Q against T = 1000 of rounds. Activation probability q = 1.

Published in Transactions on Machine Learning Research (06/2024)

0 200 400 600 800 1000

(a) p N ER = 0.2, p F ER = 0.2

0 200 400 600 800 1000

(b) p N ER = 0.8, p F ER = 0.2

0 200 400 600 800 1000

(c) p N ER = 0.2, p F ER = 0.8

0 200 400 600 800 1000

(d) p N ER = 0.8, p F ER = 0.8

Figure 4: Average regret RT /Q against T = 1000 of rounds. Activation probability q = 0.5.

Published in Transactions on Machine Learning Research (06/2024)

0 200 400 600 800 1000

(a) p N ER = 0.2, p F ER = 0.2

0 200 400 600 800 1000

(b) p N ER = 0.8, p F ER = 0.2

0 200 400 600 800 1000

(c) p N ER = 0.2, p F ER = 0.8

0 200 400 600 800 1000

(d) p N ER = 0.8, p F ER = 0.8

Figure 5: Average regret RT /Q against T = 1000 of rounds. Activation probability q = 0.05.