# decentralized_multiagent_linear_bandits_with_safety_constraints__1b6d740b.pdf

Decentralized Multi-Agent Linear Bandits with Safety Constraints

Sanae Amani,1 Christos Thrampoulidis 2

1 University of California, Los Angeles 2 University of British Columbia, Vancouver samani@ucla.edu, cthrampo@ece.ubc.ca

We study decentralized stochastic linear bandits, where a network of N agents acts cooperatively to efﬁciently solve a linear bandit-optimization problem over a d-dimensional space. For this problem, we propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network. At each round of the algorithm each agent chooses its actions following an upper conﬁdence bound (UCB) strategy and agents share information with their immediate neighbors through a carefully designed consensus procedure that repeats over cycles. Our analysis adjusts the duration of these communication cycles ensuring near-optimal regret performance O(d log NT

NT) at a communication rate of O(d N 2) per round. The structure of the network affects the regret performance via a small additive term coined the regret of delay that depends on the spectral gap of the underlying graph. Notably, our results apply to arbitrary network topologies without a requirement for a dedicated agent acting as a server. In consideration of situations with high communication cost, we propose RC-DLUCB: a modiﬁcation of DLUCB with rare communication among agents. The new algorithm trades off regret performance for a signiﬁcantly reduced total communication cost of O(d3N 2.5) over all T rounds. Finally, we show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits. For the recently studied problem of linear bandits with unknown linear safety constraints, we propose the ﬁrst safe decentralized algorithm. Our study contributes towards applying bandit techniques in safety-critical distributed systems that repeatedly deal with unknown stochastic environments. We present numerical simulations for various network topologies that corroborate our theoretical ﬁndings.

1 Introduction

Linear stochastic bandits (LB) provide simple, yet commonly encountered, models for a variety of sequential decision-making problems under uncertainty. Speciﬁcally, LB generalizes the classical multi-armed bandit (MAB) problem of K arms that each yields reward sampled independently from an underlying distribution with unknown parameters, to a setting where the expected reward of each arm is a linear function that depends on the same unknown parameter vector (Dani, Hayes, and Kakade 2008;

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Abbasi-Yadkori, P al, and Szepesv ari 2011; Rusmevichientong and Tsitsiklis 2010). LBs have been successfully applied over the years in online advertising, recommendation services, resource allocation, etc. (Lattimore and Szepesv ari 2018). More recently, researchers have explored the potentials of such algorithms in more complex systems, such as in robotics, wireless networks, the power grid, medical trials, e.g., (Li et al. 2013; Avner and Mannor 2019; Berkenkamp, Krause, and Schoellig 2016; Sui et al. 2018). A distinguishing feature of many of these perhaps less conventional bandit applications, is their distributive nature. For example, in sensor/wireless networks (Avner and Mannor 2019), a collaborative behavior is required for decision-makers/agents to select better actions as individuals, but each of them is only able to share information about the unknown environment with a subset of neighboring agents. While a distributed nature is inherent in certain systems, distributed solutions might also be preferred in broader settings, as they can lead to speed-ups of the learning process. This calls for extensions of the traditional bandit setting to networked systems. At the same time, in many of these applications the unknown system might be safetycritical, i.e., the algorithm s chosen actions need to satisfy certain constraints that, importantly, are often unknown. This leads to the challenge of balancing the goal of reward maximization with the restriction of playing safe actions. The past few years have seen a surge of research activity in these two areas: (i) distributed (Wang et al. 2019; Mart ınez Rubio, Kanade, and Rebeschini 2019; Sz or enyi et al. 2013; Landgren, Srivastava, and Leonard 2016a); and (ii) safe bandits (Sui et al. 2015, 2018; Kazerouni et al. 2017; Amani, Alizadeh, and Thrampoulidis 2019; Moradipari et al. 2019; Khezeli and Bitar 2019; Pacchiano et al. 2020). This paper contributes to the intersection of these two emerging lines of work. Concretely, we consider the problem of decentralized multi-agent linear bandits for a general (connected) network structure of N agents, who can only communicate messages with their immediate neighbors. For this, we propose and analyze the ﬁrst fully-decentralized algorithm. We also present a communication-efﬁcient version and discuss key trade-offs between regret, communication cost and graph structure. Finally, we present the ﬁrst simultaneously distributed and safe bandit algorithm for a setting with unknown linear constraints.

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Notation. We use lower-case letters for scalars, lower-case bold letters for vectors, and upper-case bold letters for matrices. The Euclidean-norm of x is denoted by x 2. We denote the transpose of any column vector x by x T . For any vectors x and y, we use x, y to denote their inner product. Let A be a positive deﬁnite d d matrix and ν Rd. The weighted 2-norm of ν with respect to A is deﬁned by ν A =

νT Aν. For positive integers n and m n, [n] and [m : n] denote the sets {1, 2, . . . , n} and {m, . . . , n}, respectively. We use 1 and ei to denote the vector of all 1 s and the i-th standard basis vector, respectively.

1.1 Problem Formulation

Decentralized Linear Bandit. We consider a network of N agents and known convex compact decision set D Rd (our results can be easily extended to settings with time varying decision sets). Agents play actions synchronously. At each round t, each agent i chooses an action xi,t D and observes reward yi,t = θ , xi,t + ηi,t, where θ Rd is an unknown vector and ηi,t is random additive noise. Communication Model. The agents are represented by the nodes of an undirected and connected graph G. Each agent can send and receive messages only to and from its immediate neighbors. The topology of G is known to all agents via a communication matrix P (see Assumption 1). Safety. The learning environment might be subject to unknown constraints that restrict the choice of actions. In this paper, we model the safety constraint by a linear function depending on an unknown vector µ Rd and a known constant c R. Speciﬁcally, the chosen action xi,t must satisfy µ , xi,t c, for all i and t, with high probability. We deﬁne the unknown safe set as Ds(µ ) := {x D : µ , x c}. After playing xi,t, agent i observes banditfeedback measurements zi,t = µ , xi,t + ζi,t. This type of safety constraint, but for single-agent settings, has been recently introduced and studied in (Amani, Alizadeh, and Thrampoulidis 2019; Pacchiano et al. 2020; Sui et al. 2015, 2018; Moradipari et al. 2019). See also (Kazerouni et al. 2017; Khezeli and Bitar 2019) for related notions of safety studied recently in the context of single-agent linear bandits. Goal. Let T be the total number of rounds. We deﬁne the cumulative regret of the entire network as: RT := PT t=1 PN i=1 θ , x θ , xi,t . The optimal action x is deﬁned with respect to D and Ds(µ ) as arg maxx D θ , x and arg maxx Ds(µ ) θ , x in the original and safe settings, respectively. The goal is to minimize the cumulative regret, while each agent is allowed to share poly(Nd) values per round to its neighbors. Speciﬁcally, we wish to achieve a regret close to that incurred by an optimal centralized algorithm for NT rounds (the total number of plays). In the presence of safety constraint, in addition to the aforementioned goals, agents actions must also satisfy the safety constraint at each round.

1.2 Contributions

DLUCB. We propose a fully decentralized linear bandit algorithm (DLUCB), at each round of which, the agents simul-

taneously share information among each other and pick their next actions. We prove a regret bound that captures both the degree of selected actions optimality and the inevitable delay in information-sharing due to the network structure. See Sec. 2.1 and 2.2. Compared to existing distributed LB algorithms, ours can be implemented (and remains valid) for any arbitrary (connected) network without requiring a peer-topeer network structure or a master node. See Sec. 2.4. RC-DLUCB. We propose a fully decentralized algorithm with rare communication (RC-DLUCB) to reduce the communication cost (total number of values communicated during the run of algorithm) for applications that are sensitive to high communication cost. See Sec. 2.3 Safe-DLUCB. We present and analyze the ﬁrst fully decentralized algorithm for safe LBs with linear constraints. Our algorithm provably achieves regret of the same order (wrt. NT) as if no constraints were present. See Sec. 3 We complement our theoretical results with numerical simulations under various settings in Sec. 4.

1.3 Related Works

Decentralized Bandits. There are several recent works on decentralized/distributed stochastic MAB problems. In the context of the classical K-armed MAB, (Mart ınez-Rubio, Kanade, and Rebeschini 2019; Landgren, Srivastava, and Leonard 2016a,b) proposed decentralized algorithms for a network of N agents that can share information only with their immediate neighbors, while (Sz or enyi et al. 2013) studies the MAB problem on peer-to-peer networks. More recently, (Wang et al. 2019) focuses on communication efﬁciency and presented K-armed MAB algorithms with significantly lower communication overhead. In contrast to these, here, we study a LB model. The most closely related works on distributed/decentralized LB are (Wang et al. 2019) and (Korda, Sz or enyi, and Shuai 2016). In (Wang et al. 2019), the authors present a communication-efﬁcient algorithm that operates under the coordination of a central server, such that every agent has instantaneous access to the full network information through the server. This model differs from the fully decentralized one considered here. In another closely related work, (Korda, Sz or enyi, and Shuai 2016) studies distributed LBs in peer-to-peer networks, where each agent can only send information to one other randomly chosen agent, not necessarily its neighbor, per round. A feature, in common with our algorithm, is the delayed use of bandit feedback, but the order of the delay differs between the two, owing to the different model. Please also see Sec. 2.4 for a more elaborate comparison. To recap, even-though motivated by the aforementioned works, our paper presents the ﬁrst fully decentralized algorithm for the multi-agent LB problem on a general network topology, with communication between any two neighbors in the network. Furthermore, non of the above has studied the presence of safety constraints. Safe Bandits. In a more general context, the notion of safe learning has many diverse deﬁnitions in the literature. Speciﬁcally, safety in bandit problems has itself received signiﬁcant attention in recent years, e.g. (Sui et al. 2015, 2018; Kazerouni et al. 2017; Amani, Alizadeh, and Thram-

poulidis 2019; Moradipari et al. 2019; Khezeli and Bitar 2019; Pacchiano et al. 2020). To the best of our knowledge, all existing works on MAB/LB problems with safety constraints study a single-agent. As mentioned in Sec. 1.1, the multi-agent safe LB studied here is a canonical extension of the single-agent setting studied in (Amani, Alizadeh, and Thrampoulidis 2019; Moradipari et al. 2019; Pacchiano et al. 2020). Accordingly, our algorithm and analysis builds on ideas introduced in this prior work and extends them to multi-agent collaborative learning.

2 Decentralized Linear Algorithms

In this section, we present Decentralized Linear Upper Conﬁdence Bound (DLUCB). Starting with a high-level description of the gossip communication protocol and of the beneﬁts and challenges it brings to the problem in Sec. 2.1, we then explain DLUCB Algorithm 1 in Sec. 2.2. In Sec. 2.3 we present a communication-efﬁcient version of DLUCB. Finally, in Sec. 2.4 we compare our algorithms to prior art. Throughout this section, we do not assume any safety constraints. Below, we introduce some necessary assumptions.

Assumption 1 (Communication Matrix). For an undirected connected graph G with N nodes, P RN N is a symmetric communication matrix if it satisﬁes the following three conditions: (i) Pi,j = 0 if there is no connection between nodes i and j; (ii) the sum of each row and column of P is 1; (iii) the eigenvalues are real and their magnitude is less than 1, i.e., 1 = |λ1|> |λ2| . . . |λN| 0. We assume that agents have knowledge of communication matrix P.

We remark that P can be constructed with little global information about the graph, such as its adjacency matrix and the graph s maximal degree; see Sec. 4 for an explicit construction. Once P is known, the total number of agents N and the graph s spectral gap 1 |λ2| are also known. We show in Sec. 2.2 that the latter two parameters fully capture how the network structure affects the algorithm s regret.

Assumption 2 (Subgaussian Noise). For i [N] and t > 0, ηi,t, ζi,t are zero-mean σ-sub Gaussian random variables.

Assumption 3 (Boundedness). Without loss of generality, x 2 1 for all x D, θ 2 1, and µ 2 1.

2.1 Information-Sharing Protocol

DLCUB implements a UCB strategy. At the core of singleagent UCB algorithms, is the construction of a proper conﬁdence set around the true parameter θ using past actions and their observed rewards. In multi-agent settings, each agent i [N] maintains their own conﬁdence set Ci,t at every round t. To exploit the network structure and enjoy the beneﬁts of collaborative learning, it is important that Ci,t is built using information about past actions of not only agent i itself, but also of agents j = i [N]. For simplicity, we consider ﬁrst a centralized setting of perfect information-sharing among agents. Speciﬁcally, assume that at every round t, agent i knows the past chosen actions and their observed rewards by all other agents in the graph. Having gathered all this information, each agent i maintains knowledge of the

following sufﬁcient statistics during all rounds t:

i=1 xi,τx T i,τ, b ,t =

i=1 yi,τxi,τ. (1)

Here, λ 1 is a regularization parameter. Of course, in this idealized scenario, the conﬁdence set constructed based on (1) is the same for every agent. In fact, it is the same as the conﬁdence set that would be constructed by a singleagent that is allowed to choose N actions at every round. Here, we study a decentralized setting with imperfect information sharing. In particular, each agent i can only communicate with its immediate neighbors j N(i) at any time t. As such, it does not have direct access to the perfect statistics A ,t and b ,t in (1). Instead, it is conﬁned to approximations of them, which we denote by Ai,t and bi,t. At worst-case, where no communication is used, Ai,t = λI + Pt 1 τ=1 xi,τx T i,τ (similarly for bi,t). But, this is a very poor approximation of A ,t (correspondingly, b ,t). Our goal is to construct a communication scheme that exploits exchange of information among agents to allow for drastically better approximations of (1). Towards this goal, our algorithm implements an appropriate gossip protocol to communicate each agent s past actions and observed rewards to the rest of the network (even beyond immediate neighbors). We describe the details of this protocol next. Running Consensus. In order to share information about agents past actions among the network, we rely on running consensus, e.g., (Lynch 1996; Xiao and Boyd 2004). The goal of running consensus is that after enough rounds of communication, each agent has an accurate estimate of the average (over all agents) of the initial values of each agent. Precisely, let ν0 RN be a vector, where each entry ν0,i, i [N] represents agent s i information at some initial round. Then, running consensus aims at providing an accurate estimate of the average 1 N P i [N] ν0,i at each agent. Note that encoding ν0 = Xej, allows all agents to eventually get an estimate of the value X = P i [N] ν0,i that was initially known only to agent j. To see how this is relevant to our setting recall (1) and focus at t = 2 for simplicity. At round t = 2, each agent j only knows xj,1 and estimation of A ,2 = PN i=1 xi,1x T i,1 by agent j boils down to estimating each xi,1, i = j. In our previous example, let X be k-th entry of xi,1 for some i = j. By running consensus on ν0 = [xj,1]kej for k [d], every agent eventually builds an accurate estimate of [xj,1]k, the k-th entry of xj,1 that would otherwise only be known to j. It turns out that the communication matrix P deﬁned in Assumption 1 plays a key role in reaching consensus. The details are standard in the rich related literature (Xiao and Boyd 2004; Lynch 1996). Here, we only give a brief explanation of the highlevel principles. Roughly speaking, a consensus algorithm updates ν0 by ν1 = Pν0 and so on. Note that this operation respects the network structure since the updated value ν1,j is a weighted average of only ν0,j itself and neighboronly values ν0,i, i N(j). Thus, after S rounds, agent j has access to entry j of νS = PSν0. This is useful because P is well-known to satisfy the following mixing prop-

erty: lim S PS = 11T /N (Xiao and Boyd 2004). Thus, lim S [νS]j = 1 N PN i=1 ν0,i, j [N], as desired. Of course, in practice, the number S of communication rounds is ﬁnite, leading to an ϵ-approximation of the average. Accelerated-Consensus. In this paper, we adapt polynomial ﬁltering introduced in (Mart ınez-Rubio, Kanade, and Rebeschini 2019; Seaman et al. 2017) to speed up the mixing of information by following an approach whose convergence rate is faster than the standard multiplication method above. Speciﬁcally, after S communication rounds, instead of PS, agents compute and apply to the initial vector ν0 an appropriate re-scaled Chebyshev polynomial q S(P) of degree S of the communication matrix. Recall that Chebyshev polynomials are deﬁned recursively. It turns out that the Chebyshev polynomial of degree ℓfor a communication matrix P is also given by a recursive formula as follows: qℓ+1(P) = 2wℓ |λ2|wℓ+1 Pqℓ(P) wℓ 1

wℓ+1 qℓ 1(P), where w0 = 0, w1 = 1/|λ2|, wℓ+1 = 2wℓ/|λ2| wℓ 1, q0(P) = I and q1(P) = P. Speciﬁcally, in a Chebyshev-accelerated gossip protocol (Mart ınez-Rubio, Kanade, and Rebeschini 2019), the agents update their estimates of the average of the initial vector s ν0 entries as follows: νℓ+1 = (2wℓ)/(|λ2|wℓ+1)Pνℓ (wℓ 1/wℓ+1)νℓ 1. (2) Our algorithm DLUCB, presented later in Sec. 2.2, implements the Checyshev-accelerated gossip protocol outlined above; see (Mart ınez-Rubio, Kanade, and Rebeschini 2019) for a similar implementation only for the classical K-Armed MAB. Speciﬁcally, we summarize the accelerated communication step described in (2) with a function Comm(xnow, xprev, ℓ) with three inputs: (1) xnow, the quantity of interest that the agent wants to update at the current round; (2) xprev, the estimated value for the same quantity of interest that the agent updated in the previous round (cf. νℓ 1 in (2)); (3) ℓ, the current communication round. Note that inputs here are scalars, however, matrices and vectors can also be passed as inputs, in which case Comm runs entrywise. For a detailed description of Comm please refer to Algorithm 3 in App. B. The accelerated consensus algorithm implemented in Comm guarantees fast mixing of information thanks to the following key property (Mart ınez-Rubio, Kanade, and Rebeschini 2019, Lem. 3): for ϵ (0, 1) and any vector ν0 in the N-dimensional simplex, it holds that

Nq S(P)ν0 1 2 ϵ, provided S = log(2N/ϵ) p

2 log(1/|λ2|) . (3)

In view of this, our algorithm properly calls Comm (see Algorithm 1) such that for every i [N] and t [T], the action xi,t and corresponding reward yi,t are communicated within the network for S rounds. At round t + S, agent i has access to ai,jxj,t and ai,jyj,t where ai,j = N[q S(P)]i,j. Thanks to (3), ai,j is ϵ close to 1, thus, these are good approximations of the true xj,t and yj,t. Accordingly, at the beginning of round t > S, each agent i computes

Ai,t := λI +

j=1 a2 i,jxj,τx T j,τ, bi,t :=

j=1 a2 i,jyj,τxj,τ,

which are agent i s approximations of the sufﬁcient statistics A ,t S+1 and b ,t S+1 deﬁned in (1). On the other hand, for rounds 1 t S (before any mixing has been completed), let Ai,t = λI + Pt 1 τ=1 xi,τx T i,τ and bi,t = Pt 1 τ=1 yi,τxi,τ for i [N]. With these, at the beginning of each round t [T], agent i constructs the conﬁdence set

Ci,t := {ν Rd : ν ˆθi,t Ai,t βt}, (5)

where ˆθi,t = A 1 i,t bi,t and βt is chosen as in Thm. 1 below to guarantee θ Ci,t with high probability. Theorem 1 (Conﬁdence sets). Let Assumptions 1, 2 and 3 hold. Fix ϵ (0, 1) and S as in (3). For δ (0, 1), let βt :=

d log 2λd N+2N 2t

λdδ +λ1/2. Then with probability at least 1 δ, for all i [N] and t [T] it holds that θ Ci,t.

The proof is mostly adapted from (Abbasi-Yadkori, P al, and Szepesv ari 2011, Thm. 2) with necessary modiﬁcations to account for the imperfect information; see App. A.1.

2.2 Decentralized Linear UCB

We now describe DLUCB Algorithm 1 (see App. A.3 for a more detailed version). Each agent runs DLUCB in a parallel/synchronized way. For concreteness, let us focus on agent i [N]. At every round t, the agent maintains the following ﬁrst-in ﬁrst-out (FIFO) queues of size at most S: Ai,t, Bi,t, Ai,t 1, and Bi,t 1. The queue Ai,t contains agent i s estimates of all actions played at rounds [t S : t 1]. Concretely, its j-th member, denoted by Ai,t(j) RN d, is a matrix whose k-th row is agent i s estimate of agent k s action played at round t + j S 1. Similarly, we deﬁne Bi,t as the queue containing agent i s estimates of rewards observed at rounds [t S : t 1]. At each round t, agent i sends every member of Ai,t and Bi,t (each entry of them) to its neighbors and at the same time it receives the corresponding values from them. The received values are used to update the information stored in Ai,t and Bi,t. The update is implemented by the sub-routine Comm outlined in Sec. 2.1 and presented in detail in App. B. At the beginning of rounds t > S when, the information of rounds [t S] is mixed enough, agent i updates its estimates Ai,t and bi,t of A ,t S and b ,t S, respectively. Using these, it creates the conﬁdence set Ci,t and runs the UCB decision rule of Line 11 to select an action. Next, agent i updates Ai,t and Bi,t in Lines 12 and 13, by eliminating the ﬁrst elements (dequeuing) Ai,t(1) and Bi,t(1) of the queues Ai,t and Bi,t and adding the following elements at their end (enqueuing). At Ai,t it appends Xi,t RN d, whose rows are all zero but its i-th row which is set to x T i,t. Concurrently, at Bi,t, it appends yi,t RN, whose elements are all zero but its i-th element which is set to yi,t. Note that Xi,t (similarly, yi,t) contains agent i s estimates of actions at round t, and the zero rows will be updated with agent i s estimates of other agents information at round t in future rounds. This is achieved via calling the consensus algorithm Comm in Lines 14 and 15, with which agent i communicates all the members of Ai,t and Bi,t with its neighbors.

Algorithm 1: DLUCB for Agent i

Input: D, N, d, |λ2|, ϵ, λ, δ, T

1 S = log(2N/ϵ)/ p

2 log(1/|λ2|)

2 Ai,1 = λI, bi,1 = 0, Ai,0 = Ai,1 = Bi,0 = Bi,1 =

3 for t = 1, . . . , S do

4 Play xi,t = arg maxx D maxν Ci,t ν, x and observe yi,t.

5 Ai,t.append(Xi,t) and Bi,t.append(yi,t)

6 Ai,t+1 = Comm(Ai,t, Ai,t 1, [t])

7 Bi,t+1 = Comm(Bi,t, Bi,t 1, [t]) // Comm runs for each member of Ai,t and Bi,t

8 Ai,t+1 = Ai,t + xi,tx T i,t, bi,t+1 = bi,t + yi,txi,t

9 for t = S + 1, . . . , T do

10 Ai,t = Ai,t 1 + N 2Ai,t(1)T Ai,t(1), bi,t = bi,t 1 + N 2Ai,t(1)T Bi,t(1)

11 Play xi,t = arg maxx D maxν Ci,t ν, x and observe yi,t

12 Ai,t.remove Ai,t(1) .append Xi,t

13 Bi,t.remove Bi,t(1) .append yi,t

14 Ai,t+1 = Comm(Ai,t, Ai,t 1(2 : S), [S])

15 Bi,t+1 = Comm(Bi,t, Bi,t 1(2 : S), [S])

Regret analysis. There are two key challenges in the analysis of DLUCB compared to that of single-agent LUCB. First, information sharing is imperfect: the consensus algorithm mixes information for a ﬁnite number S of communication rounds resulting in ϵ-approximations of the desired quantities (cf. (3)). Second, agents can use this (imperfect) information to improve their actions only after an inevitable delay. To see what changes in the analysis of regret, consider the standard decomposition of agent i s instantaneous regret at round t: ri,t = θ , x θ , xi,t 2βt xi,t A 1 i,t . Us-

ing Cauchy-Schwartz inequality, an upper bound on the cumulative regret PT t=1 PN i=1 ri,t can be obtained by bounding the following key term:

xi,t 2 A 1 i,t . (6)

We do this in two steps, each addressing one of the above challenges. First, in Lemma 1, we address the inﬂuence of imperfect information by relating the A 1 i,t norms in (6), with those in terms of their perfect information counterparts A 1 ,t S+1. Hereafter, let A ,t = λI for t = S, . . . , 0, 1.

Lemma 1 (Inﬂuence of imperfect information). Fix any ϵ (0, 1/(4d+1)) and choose S as in (3). Then, for all i [N], t [T] it holds that xi,t 2 A 1 i,t e xi,t 2 A 1 ,t S+1 .

The intuition behind the lemma comes from the discussion on the accelerated protocol in Sec. 2.1. Speciﬁcally, with sufﬁciently small communication-error ϵ (cf. (3)), Ai,t (cf. (4)) is a good approximation of A ,t S+1 (cf. (1)). The lemma replaces the task of bounding (6) with that of bounding PT t=1 PN i=1 xi,t 2 A 1 ,t S+1. Unfortunately, this remains

challenging. Intuitively, the reason for this is the mismatch of information about past actions in the gram matrix A ,t S+1 at time t, compared to the inclusion of all terms xi,τ up to time t in (6). Our idea is to relate xi,t A 1 ,t S+1 to xi,t B 1 i,t , where Bi,t = A ,t + Pi 1 j=1 xj,tx T j,t. This is

possible thanks to the following lemma. Lemma 2 (Inﬂuence of delays). Let S as in (3). Then, xi,t 2 A 1 ,t S+1 e xi,t 2 B 1 i,t , is true for all pairs (i, t)

[N] [T] except for at most ψ(λ, |λ2|, ϵ, d, N, T) := Sd log (1 + NT

dλ ) of them.

Using Lemmas 1 and 2 allows controlling the regret of all actions, but at most ψ of them, using standard machinery in the analysis of UCB-type algorithms. The proofs of Lemmas 1 and 2 and technical details relating the results to a desired regret bound are deferred to App. A.2. The theorem below is our ﬁrst main result and bounds the regret of DLUCB. Theorem 2 (Regret of DLUCB). Fix ϵ (0, 1/(4d+1)) and δ (0, 1). Let Assumptions 1, 2, 3 hold, and S be chosen as in (3). Then, with probability at least 1 δ, it holds that:

RT 2Sd log 1 + NT

dλ + 2eβT q

2d NT log λ + NT

The regret bound has two additive terms: a small term 2ψ(λ, |λ2|, ϵ, d, N, T) (cf. Lemma 2), which we call regret of delay, and, a second main term that (notably) is of the same order as the regret of a centralized problem where communication is possible between any two nodes (see Table 1). Thm. 2 holds for small ϵ 1/(4d+1). In App. A.2, we also provide a general regret bound for arbitrary ϵ (0, 1).

2.3 DLUCB with Rare Communication

As discussed in more detail in Sec. 2.4, DLUCB achieves order-wise optimal regret, but its communication cost scales as O(d N 2T), i.e., linearly with the horizon duration T (see Table 1). In this section, we present a modiﬁcation tailored to communication settings that are sensitive to communication cost. The new algorithm termed RC-DLUCB is also a fully decentralized algorithm that trade-offs a slight increase in the regret performance, while guaranteeing a signiﬁcantly reduced communication cost of O d3N 2.5 log(Nd) log1/2(1/|λ2|)

over the entire horizon [T]. Due to space limitations, we defer a detailed description (see Algorithm 4) and analysis (see Thms. 4 and 5) of RC-DLUCB in App. C. At a high-level, we design RC-DLUCB inspired by the Rarely Switching OFUL algorithm by (Abbasi-Yadkori, P al, and Szepesv ari 2011). In contrast to the Rarely Switching OFUL algorithm that is designed to save on computations in single-agent systems, RC-DLUCB incorporates a similar idea in our previous DLUCB to save on communication rounds. Speciﬁcally, compared to DLUCB where communication happens at each round, in RC-DLUCB agents continue selecting actions individually (i.e., with no communication), unless a certain condition is triggered by any one of them. Then, they all switch to a communication phase, in which they communicate the unmixed information they have gathered for a duration of S rounds. Roughly speaking, an agent triggers the

Algorithm Regret Communication

DLUCB O(d log(Nd) log0.5(1/|λ2|) log(NT) + d log(NT)

NT) O(d N 2T log(Nd) log0.5(1/|λ2|))

RC-DLUCB O(Nd1.5 log(Nd) log0.5(1/|λ2|) log1.5(NT) + d log2(NT)

NT) O(d3N 2.5 log(Nd) log0.5(1/|λ2|)) No Communication O(d N log(T)

T) 0 Centralized O(d log(NT)

NT) O(d N 2T) DCB O((d N log(NT))3 + log(NT)

NT) O(d2NT log(NT)) Dis Lin UCB O(log2(NT)

NT) O(d3N 1.5)

Table 1: Comparison of DLUCB and RC-DLUCB to baseline, as well as, to state-of-the-art. See Sec. 2.4 for details.

communication phase only once it has gathered enough new information compared to the last update by the rest of the network. This can be measured by keeping track of the variations in the corresponding gram matrix.

2.4 Regret-Communication Trade-offs and Comparison to State of the Art

In Table 1, we compare (in terms of regret and communication) DLUCB and RC-DLUCB to two baselines: (i) a No Communication and (ii) a fully Centralized algorithm, as well as, to the state of the art: (iii) DCB (Korda, Sz or enyi, and Shuai 2016) and (iv) Dis Lin UCB (Wang et al. 2019). Baselines. In the absence of communication, each agent independently implements a single-agent LUCB (Abbasi Yadkori, P al, and Szepesv ari 2011). This trivial No Communication algorithm has zero communication cost and applies to any graph, but its regret scales linearly with the number of agents. At another extreme, a fully Centralized algorithm assumes communication is possible between any two agents at every round. This achieves optimal regret O(

NT), which is a lower bound to the regret of any decentralized algorithm. However, it is only applicable in very limited network topologies, such as a star graph where the central node acts as a master node, or, a complete graph. Notably, DLUCB achieves order-wise optimal regret that is same as that of the Centralized algorithm modulo a small additive regret-of-delay term. Dis Lin UCB. In a motivating recent paper (Wang et al. 2019), the authors presented Dis Lin UCB a communication algorithm that applies to multi-agent settings, in which agents can communicate with a master-node/server, by sending or receiving information to/from it with zero latency. Notably, Dis Lin UCB is shown to achieve order-optimal regret performance same as the Centralized algorithm, but at a signiﬁcantly lower communication cost that does not scale with T (see Table 1). In this paper, we do not assume presence of a master-node. In our setting, this can only be assumed in very limited cases: a star or a complete graph. Thus, compared to Dis Lin UCB, our DLUCB can be used for arbitrary network topologies with similar regret guarantees. However, DLUCB requires that communication be performed at each round. This discrepancy motivated us to introduce RC-LUCB, which has communication cost (slightly larger, but) comparable to that of Dis Lin UCB (see Table 1), while being applicable to general graphs. As a ﬁnal note,

as in RC-DLUCB, the reduced communication cost in Dis Lin UCB relies on the idea of the Rarely Switching OFUL algorithm of (Abbasi-Yadkori, P al, and Szepesv ari 2011). DCB. In another closely related work (Korda, Sz or enyi, and Shuai 2016) presented DCB for decentralized linear bandits in peer-to-peer networks. Speciﬁcally, it is assumed in (Korda, Sz or enyi, and Shuai 2016) that at every round each agent communicates with only one other randomly chosen agent per round. Instead, we consider ﬁxed network topologies where each agent can only communicate with its immediate neighbors at every round. Thus, the two algorithms are not directly comparable. Nevertheless, we remark that, similar to our setting, DCB also faces the challenge of controlling a delayed use of information, caused by requiring enough mixing of the communicated information among agents. A key difference is that the duration of delay is typically O(log t) in DCB, while in DLUCB it is ﬁxed to S, i.e., independent of the round t. This explains the signiﬁcantly smaller ﬁrst-term in the regret of DLUCB as compared to the ﬁrst-term in the regret of DCB in Table 1.

3 Safe Decentralized Linear Bandits

For the safe decentralized LB problem, we propose Safe DLUCB, an extension of DLUCB to the safe setting and an extension of single-agent safe algorithms (Amani, Alizadeh, and Thrampoulidis 2019; Moradipari et al. 2019; Pacchiano et al. 2020) to multi-agent systems. Due to space limitations, we defer a detailed description of Safe-DLCUB to Algorithm 5 in App. D. Here, we give a high-level description of its main steps and present its regret guarantees. First, we need the following assumption and notation. Assumption 4 (Non-empty safe set). A safe action x0 D and c0 := µ , x0 < c are known to all agents. Also, θ , x0 0.

Deﬁne the normalized safe action x0 := x0 x0 . For any x Rd, denote by xo := x, x0 x0 its projection on x0, and, by x := x xo its projection onto the orthogonal subspace. In the presence of safety, the agents must act conservatively to ensure that the chosen actions xi,t do not violate the safety constraint µ , xi,t c. To this end, agent i communicates, not only xi,t and yi,t, but also the banditfeedback measurements zi,t, following the communication protocol implemented by Comm (cf. Sec. 2.1). Once information is sufﬁciently mixed, it builds an additional conﬁ-

0 500 1000 0

Iteration, t

0 500 1000 0

Iteration, t

(b) RC-DLUCB

0 500 1000 0

Per-agent regret, Rt

Iteration, t

Figure 1: Regret comparison.

dence set Ei,t that includes µ with high probability (note that µo is already known by Assumption 4). Please refer to App. D.1 for the details on constructing Ei,t. Once Ei,t is constructed, agent i creates the following safe inner approximation of the true Ds(µ ): Ds i,t := {x D : xo, x0

x0 c0 + ˆµ i,t, x + βt x A , 1 i,t c}. Speciﬁcally,

Proposition 1 in App. D.1 guarantees for any δ (0, 1) that for all i [N], t [T], all actions in Ds i,t are safe with probability 1 δ. After constructing Ds i,t, agent i selects safe action xi,t Ds i,t following a UCB decision rule:

θi,t, xi,t = max x Ds i,t max ν κr Ci,t ν, x . (7)

A subtle, but critical, point in (7) is that the inner maximization is over an appropriately enlarged conﬁdence set κr Ci,t. Speciﬁcally, compared to Lines 4 and 11 in Algorithm 1, we need here that κr > 1. Intuitively, this is required because the outer maximization in (7) is not over the entire Ds(µ ), but only a subset of it. Thus, larger values of κr are needed to provide enough exploration to the algorithm so that the selected actions in Ds i,t are -often enoughoptimistic, i.e., θi,t, xi,t θ , x ; see Lemma 5 in App. D.2 for the exact statement. We attribute the above idea that more aggressive exploration of that form is needed in the safe setting to (Moradipari et al. 2019), only they considered a Thompsonsampling scheme and a single agent. (Pacchiano et al. 2020) extended this idea to UCB algorithms, again in the singleagent setting (and for a slightly relaxed notion of safety). Here, we show that the idea extends to multi-agent systems and when incorporated to the framework of DLUCB leads to a safe decentralized algorithm with provable regret guarantees stated in the theorem below. See App. D for the proof. Theorem 3 (Regret of Safe-DLUCB). Fix δ (0, 0.5), κr = 2 c c0 + 1, ϵ (0, 1/(4d + 1)). Let Assumptions 1, 2, 3, 4 hold, and S be chosen as in (3). Then, with probability at least 1 2δ, it holds that: RT 2Sd log(1 + NT

2d NT log(λ + NT

The regret bound is of the same order as DLUCB regret bound, with only an additional factor κr in its second term.

4 Experiments

In this section, we evaluate our algorithms performance on synthetic data. Since the UCB decision rule at line 11

of Algorithm 1 involves a generally non-convex optimization problem, we use a standard computationally tractable modiﬁcation that replaces ℓ2 with ℓ1 norms in the deﬁnition of conﬁdence set (5) (unless the decision set is ﬁnite); see (Dani, Hayes, and Kakade 2008). All results directly apply to this modiﬁed algorithm after only changing the radius βt with βt

d (Dani, Hayes, and Kakade 2008, Sec. 3.4). All the results shown depict averages over 20 realizations, for which we have chosen d = 5, D = [ 1, 1]5, λ = 1, and σ = 0.1. Moreover, θ is drawn from N(0, I5) and then normalized to unit norm. We compute the communication matrix as P = I 1 δmax+1D 1/2LD 1/2, where δmax is the maximum degree of the graph and L is the graph Laplacian (see (Duchi, Agarwal, and Wainwright 2011) for details). In Figs. 1a and 1b, ﬁxing N = 20, we evaluate the performance of DLUCB and RC-DLUCB on 4 different topologies: Ring, Star, Complete, and a Random Erd os R enyi graph with parameter p = 0.5; see Fig. 2 in App. E for graphical illustrations of the graphs. We also compare them to the performance of No Communication (see Sec. 2.4). The plot veriﬁes the sublinear growth for all graphs, the superiority over the setting of No Communication and the fact that smaller |λ2| leads to a smaller regret (regret of delay term in Thm. 2). A comparison between Figs. 1a and 1b, further conﬁrms the slightly better regret performance of DLUCB compared to RC-DLUCB (but the latter has superior communication cost). Fig. 1c emphasizes the value of collaboration in speeding up the learning process. It depicts the per-agent regret of DLUCB on random graphs with N = 5, 10 and 15 nodes and compares their performance with the singleagent LUCB. Clearly, as the number of agents increases, each agent learns the environment faster as an individual.

5 Conclusion

In this paper, we proposed two fully decentralized LB algorithms: 1) DLUCB and 2) RC-DLUCB with small communication cost. We also proposed Safe-DLUCB to address the problem of safe LB in multi-agent settings. We derived nearoptimal regret bounds for all the aforementioned algorithms that are applicable to arbitrary, but ﬁxed networks. An interesting open problem is to design decentralized algorithms with provable guarantees for settings with time-varying networks. Also, extensions to nonlinear settings and other types of safety-constraints are important future directions.

Acknowledgments

This work is supported by the National Science Foundation under Grant Number (1934641). This work was completed while both authors were afﬁliated with UC, Santa Barbara.

Ethics Statement

Sequential decision making problems arise at every occasion that learners repeatedly interact with an unknown environment in an effort to maximize a certain notion of reward gained from interactions with this environment. Bandits often provide a simple form of this interaction and bandit optimization algorithms have been successfully applied over the years in online advertising, recommendation services, resource allocation, etc. (Lattimore and Szepesv ari 2018). More recently, researchers have started exploring the potentials of bandit algorithms in physical systems, such as in robotics, wireless networks, the power grid and in medical trials. A distinguishing feature of many of these new applications is their safety-critical nature. Speciﬁcally, the algorithm s chosen actions need to satisfy certain system constraints. Importantly, the constraints are often unknown, which leads to the challenge of balancing the goal of reward maximization with the restriction of playing safe actions . At the same time, many modern applications of bandit algorithms involve a networked set of distributed agents (e.g., wireless/sensor networks). This calls for extensions of the traditional bandit setting to networked systems. The past few years have seen a surge of research activity in these two areas: (i) safe, and (ii) distributed bandit optimization (Sui et al. 2015; Amani, Alizadeh, and Thrampoulidis 2019; Sui et al. 2018; Kazerouni et al. 2017; Moradipari et al. 2019; Korda, Sz or enyi, and Shuai 2016; Mart ınez-Rubio, Kanade, and Rebeschini 2019; Wang et al. 2019; Sz or enyi et al. 2013). This paper presents the ﬁrst (simultaneously) safe and distributed bandit algorithm and contributes at the intersection of these two emerging lines of works. We study simpliﬁed linear models, but we believe that they already capture some relevant key problem features and challenges. Finally, while our study is theoretical we believe that it motivates further research in this ﬁeld that can potentially guide practical implementations.

References Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312 2320.

Amani, S.; Alizadeh, M.; and Thrampoulidis, C. 2019. Linear stochastic bandits under safety constraints. In Advances in Neural Information Processing Systems, 9252 9262.

Avner, O.; and Mannor, S. 2019. Multi-user communication networks: A coordinated multi-armed bandit approach. IEEE/ACM Transactions on Networking 27(6): 2192 2207.

Berkenkamp, F.; Krause, A.; and Schoellig, A. P. 2016. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. ar Xiv preprint ar Xiv:1602.04450 .

Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. Stochastic linear optimization under bandit feedback .

Duchi, J. C.; Agarwal, A.; and Wainwright, M. J. 2011. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control 57(3): 592 606.

Kazerouni, A.; Ghavamzadeh, M.; Yadkori, Y. A.; and Van Roy, B. 2017. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, 3910 3919.

Khezeli, K.; and Bitar, E. 2019. Safe Linear Stochastic Bandits. ar Xiv preprint ar Xiv:1911.09501 .

Korda, N.; Sz or enyi, B.; and Shuai, L. 2016. Distributed clustering of linear bandits in peer to peer networks. In Journal of machine learning research workshop and conference proceedings, volume 48, 1301 1309. International Machine Learning Societ.

Landgren, P.; Srivastava, V.; and Leonard, N. E. 2016a. Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms. In 2016 IEEE 55th Conference on Decision and Control (CDC), 167 172. IEEE.

Landgren, P.; Srivastava, V.; and Leonard, N. E. 2016b. On distributed cooperative decision-making in multiarmed bandits. In 2016 European Control Conference (ECC), 243 248. IEEE.

Lattimore, T.; and Szepesv ari, C. 2018. Bandit algorithms. preprint 28.

Li, S.; Hao, F.; Li, M.; and Kim, H.-C. 2013. Medicine rating prediction and recommendation in mobile social networks. In International conference on grid and pervasive computing, 216 223. Springer.

Lynch, N. A. 1996. Distributed algorithms. Elsevier.

Mart ınez-Rubio, D.; Kanade, V.; and Rebeschini, P. 2019. Decentralized Cooperative Stochastic Bandits. In Advances in Neural Information Processing Systems, 4531 4542.

Moradipari, A.; Amani, S.; Alizadeh, M.; and Thrampoulidis, C. 2019. Safe Linear Thompson Sampling. ar Xiv preprint ar Xiv:1911.02156 .

Pacchiano, A.; Ghavamzadeh, M.; Bartlett, P.; and Jiang, H. 2020. Stochastic Bandits with Linear Constraints. ar Xiv preprint ar Xiv:2006.10185 .

Rusmevichientong, P.; and Tsitsiklis, J. N. 2010. Linearly parameterized bandits. Mathematics of Operations Research 35(2): 395 411.

Seaman, K.; Bach, F.; Bubeck, S.; Lee, Y. T.; and Massouli e, L. 2017. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on Machine Learning Volume 70, 3027 3036. JMLR. org.

Sui, Y.; Gotovos, A.; Burdick, J. W.; and Krause, A. 2015. Safe exploration for optimization with Gaussian processes. Proceedings of Machine Learning Research 37: 997 1005.

Sui, Y.; Zhuang, V.; Burdick, J. W.; and Yue, Y. 2018. Stagewise safe bayesian optimization with gaussian processes. ar Xiv preprint ar Xiv:1806.07555 . Sz or enyi, B.; Busa-Fekete, R.; Heged us, I.; Orm andi, R.; Jelasity, M.; and K egl, B. 2013. Gossip-based distributed stochastic bandit algorithms. In Journal of Machine Learning Research Workshop and Conference Proceedings, volume 2, 1056 1064. International Machine Learning Societ. Wang, Y.; Hu, J.; Chen, X.; and Wang, L. 2019. Distributed Bandit Learning: How Much Communication is Needed to Achieve (Near) Optimal Regret. ar Xiv preprint ar Xiv:1904.06309 . Xiao, L.; and Boyd, S. 2004. Fast linear iterations for distributed averaging. Systems & Control Letters 53(1): 65 78.