# differentiallyprivate_federated_linear_bandits__dce74fad.pdf

Differentially-Private Federated Linear Bandits

Abhimanyu Dubey and Alex Pentland Media Lab and Institute for Data, Systems and Society Massachusetts Institute of Technology {dubeya, pentland}@mit.edu

The rapid proliferation of decentralized learning systems mandates the need for differentially-private cooperative learning. In this paper, we study this in context of the contextual linear bandit: we consider a collection of agents cooperating to solve a common contextual bandit, while ensuring that their communication remains private. For this problem, we devise FEDUCB, a multiagent private algorithm for both centralized and decentralized (peer-to-peer) federated learning. We provide a rigorous technical analysis of its utility in terms of regret, improving several results in cooperative bandit learning, and provide rigorous privacy guarantees as well. Our algorithms provide competitive performance both in terms of pseudoregret bounds and empirical benchmark performance in various multi-agent settings.

1 Introduction

The multi-armed bandit is the classical sequential decision-making problem, involving an agent sequentially choosing actions to take in order to maximize a (stochastic) reward [44]. It embodies the central exploration-exploitation dilemma present in sequential decision-making. Practical applications of the multi-armed bandit range from recommender systems [52] and anomaly detection [11] to clinical trials [15] and ﬁnance [24]. Increasingly, however, such large-scale applications are becoming decentralized, as their data is often located with different entities, and involves cooperation between these entities to maximize performance [14, 22]. This paradigm is now known as federated learning1.

The objective of the federated paradigm is to allow cooperative estimation with larger amounts of data (from multiple clients, devices, etc.) while keeping the data decentralized [27]. There has been a surge of interest in this problem from both academia and industry, owing to its overall applicability. Most research on provably private algorithms in the federated setting has been on distributed supervised learning [28] and optimization [20]. The contextual bandit problem, however, is a very interesting candidate for private methods, since the involved contexts and rewards both typically contain sensitive user information [38]. There is an increasing body of work on online learning and multi-armed bandits in cooperative settings [13, 31, 39], and private single-agent learning [41, 38], but methods for private federated bandit learning are still elusive, despite their immediate applicability.

Contributions. In this paper, we study the federated contextual bandit problem under constraints of differential privacy. We consider two popular paradigms: (a) centralized learning, where a central controller coordinates different clients [27, 49] and (b) decentralized peer-to-peer learning with delays, where agents communicate directly with each other, without a controller [39, 31, 30, 32]. We provide a rigorous formulation of (ε, δ)-differential privacy in the federated contextual bandit, and present two variants of FEDUCB, the ﬁrst federated algorithm ensuring that each agent is private with respect to the data from all other agents, and provides a parameter to control communication.

1Originally, federated learning referred to the algorithm proposed in [28] for supervised learning, however, now the term broadly refers to the distributed cooperative learning setting [27].

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Next, we prove rigorous bounds on the cumulative group pseudoregret obtained by FEDUCB. In the centralized setting, we prove a high probability regret bound of e O(d3/4p

MT/ε) which matches the non-private bound in terms of its dependence on T and M. In the decentralized case, we prove a corresponding regret bound of e O(d3/4p

(diameter(G))MT/ε), where G is the communication network between agents. In addition to the regret bounds, we present a novel analysis of communication complexity, and its connections with the privacy budget and regret.

2 Related Work

Multi-Agent and Distributed Bandits. Bandit learning in multi-agent distributed settings has received attention from several academic communities. Channel selection in distributed radio networks consider the (context-free) multi-armed bandit with collisions [35, 37, 36] and cooperative estimation over a network with delays [31, 30, 32]. For the contextual case, recent work has considered non-private estimation in networks with delays [12, 13, 48, 29]. A closely-related problem is that of bandits with side information [8, 5], where the single learner obtains multiple observations every round, similar to the multi-agent communicative setting. Our work builds on the remarkable work of Abbasi-Yadkori et al.[1], which in turn improves the Lin UCB algorithm introduced in [33].

Differential Privacy. Our work utilizes differential privacy, a cryptographically-secure privacy framework introduced by Dwork [16, 18] that requires the behavior of an algorithm to ﬂuctuate only slightly (in probability) with any change in its inputs. A technique to maintain differential privacy for the continual release of statistics was introduced in [10, 17], known as the tree-based algorithm that privatizes the partial sums of n entries by adding at most log n noisy terms. This method has been used to preserve privacy across several online learning problems, including convex optimization [26, 25], online data analysis [23], collaborative ﬁltering [6] and data aggregation [9]. In the single-agent bandit setting, differential privacy using tree-based algorithms have been explored in the multi-armed case [43, 40, 46] and the contextual case [41]. In particular, our work uses several elements from [41], extending their single-agent results to the federated multi-agent setting. For the multi-agent multiarmed (i.e., context-free) bandit problem, differentially private algorithms have been devised for the centralized [45] and decentralized [14] settings. Empirically, the advantages of privacy-preserving contextual bandits has been demonstrated in the work of Malekzadeh et al.[38], and Hannun et al.[21] consider a centralized multi-agent contextual bandit algorithm that use secure multi-party computations to provide privacy guarantees (both works do not have any regret guarantees).

To the best of our knowledge, this paper is the ﬁrst to investigate differential privacy for contextual bandits in the federated learning setting, in both centralized and decentralized environments. Our work is most closely related to the important work of [1, 41], extending it to federated learning.

3 Background and Preliminaries

We use boldface to represent vectors, e.g., x, and matrices, e.g., X. For any matrix X, its Gram matrix is given by X X. Any symmetric matrix X is positive semi-deﬁnite (PSD) if y Xy 0 for any vector y, and we denote PSD by X 0. For any PSD matrix X we denote the ellipsoid X-norm of vector y as y X = p

y Xy. For two matrices X and Y , X Y implies that X Y 0. For any PSD matrix X and vectors u, v, we denote the X-ellipsoid inner product as u, v X = u Xv, and drop the subscript when X = I. We denote the set {a, a + 1, ..., b 1, b} by the shorthand [a, b], and if a = 1 we refer to it as [b]. We denote the γth power of graph G as Gγ (Gγ has edge (i, j) if the shortest distance between i and j in G is γ). Nγ(v) denotes the set of nodes in G at a distance of at most γ (including itself) from node v G.

Federated Contextual Bandit. This is an extension of the linear contextual bandit [33, 1] involving a set of M agents. At every trial t [T], each agent i [M] is presented with a decision set Di,t Rd from which it selects an action xi,t Rd. It then obtains a reward yi,t = x i,tθ + ηi,t where θ Rd is an unknown (but ﬁxed) parameter and ηi,t is a noise parameter sampled i.i.d. every trial for every agent. The objective of the agents is to minimize the cumulative group pseudoregret:2

RM(T) = PM,T i=1,t=1 x i,t xi,t, θ , where x i,t = arg maxx Di,t x, θ is optimal.

2The pseudoregret is an expectation (over the randomness of ηi,t) of the stochastic quantity regret, and is more amenable to high-probability bounds. However, a bound over the pseudoregret can also bound the regret with high probability, e.g., by a Hoeffding concentration (see, e.g., [47]).

In the single-agent setting, the optimal regret is e O(

d T), which has been matched by a variety of algorithms, most popular of them being the ones based on upper conﬁdence bounds (UCB) [1, 33, 4]. In non-private distributed contextual bandits, a group pseudoregret of e O(

d MT) has been achieved [48, 49], which is the order-optimality we seek in our algorithms as well.

Assumptions. We assume: (a) bounded action set: i, t, xi,t L, (b) bounded mean reward: x, θ , x 1, (c) bounded target parameter: θ S, (d) sub-Gaussian rewards: i, t, ηi,t is σsub-Gaussian, (e) bounded decision set: i, t Di,t is compact, (f) bounded reward: i, t, |yi,t| B.3

Differential Privacy. The contextual bandit problem involves two sets of variables that any agent must private to the other participating agents the available decision sets (Di,t)t [T ] and observed rewards (yi,t)t [T ]. The adversary model assumed here is to prevent any two colluding agents j and k to obtain non-private information about any speciﬁc element in agent i s history. That is, we assume that each agent is a trusted entity that interacts with a new user at each instant t. Therefore, the context set (Di,t) and outcome (yi,t) are sensitive variables that the user trusts only with the agent i. Hence, we wish to keep (Di,t)t [T ] private. However, the agent only stores the chosen actions (xi,t)t [T ] (and not all of Di,t), and hence making our technique differentially private with respect to ((xi,t, yi,t))t [T ] will sufﬁce. We ﬁrst denote two sequences Si = ((xi,t, yi,t))t [T ] and S i = (x i,t, y i,t)

t [T ] as t neighbors if for each t = t, (xi,t, yi,t) = (x i,t, y i,t). We can now provide the formal deﬁnition for federated differential privacy: Deﬁnition 1 (Federated Differential Privacy). In a federated learning setting with M 2 agents, a randomized multiagent contextual bandit algorithm A = (Ai)M i=1 is (ε, δ, M)-federated differentially private under continual multi-agent observation if for any i, j s.t. i = j, any t and set of sequences Si = (Sk)M k=1 and S i = (Sk)M k=1,k =i S i such that Si and S i are t-neighboring, and any subset of actions Sj Dj,1 Dj,2 ... Dj,T of actions, it holds that:

P (Aj (Si) Sj) eε P (Aj (S i) Sj) + δ.

Our notion of federated differential privacy is formalizing the standard intuition that the action chosen by any agent must be sufﬁciently impervious (in probability) to any single (x, y) pair from any other agent . Here, we essentially lift the deﬁnition of joint differential privacy [41] from the individual (x, y) level to the entire history (xt, yt)t for each participating agent. Note that our deﬁnition in its current form does not require each algorithm to be private with respect to its own history, but only the histories belonging to other agents, i.e., each agent can be trusted with its own data. This setting can also be understood as requiring all outgoing communication from any agent to be locally differentially private [51] to the personal history (xt, yt)t. We can alternatively relax this assumption and assume that the agent cannot be trusted with its own history, in which case the notion of joint or local DP at the individual level (i.e., (xt, yt) ) must be considered, as done in [51, 41].

The same guarantee can be obtained if each agent Ai is (ε, δ)-differentially private (in the standard contextual bandit sense, see [41]) with respect to any other agent j s observations, for all j. A composition argument [16]4 over all M agents would therefore provide us ( p

2M log(1/δ )ε + Mε(eε 1), Mδ + δ )-differential privacy with respect to the overall sequence. To keep the notation simple, however, we adopt the (ε, δ, M) format.

4 Federated Lin UCB with Differential Privacy

In this section, we introduce our algorithm for federated learning with differential privacy. For the remainder of this section, for exposition, we consider the single-agent setting and drop an additional index subscript we use in the actual algorithm (e.g., we refer to the action at time t as xt and not xi,t for agent i). We build on the celebrated Lin UCB algorithm, an application of the optimism heuristic to the linear bandit case [33, 1], designed for the single-agent problem. The central idea of the algorithm is, at every round t, to construct a conﬁdence set Et that contains θ with high probability, followed by computing an upper conﬁdence bound on the reward of

3Assuming bounded rewards is required for the privacy mechanism, and is standard in the literature [3, 2]. 4Under the stronger assumption that each agent interacts with a completely different set of individuals, we do not need to invoke the composition theorem (as x1,t, x2,t, ..., x M,t are independent for each t). However, in the case that one individual could potentially interact simultaneously with all agents, this is not true (e.g., when for some t, Di,t = Dj,t i, j) and we must invoke the k-fold composition Theorem[17] to ensure privacy.

Algorithm 1 CENTRALIZED FEDUCB(D, M, T, ρmin, ρmax)

1: Initialization: i, set Si,1 Mρmin I, si,1 0, b Qi,0 0, Ui,1 0, ui,1 0. 2: for For each iteration t [T] do 3: for For each agent i [M] do 4: Set Vi,t Si,t + Ui,t, ui,t si,t + ui,t. 5: Receive Di,t from environment. 6: Compute regressor θi,t V 1 i,t ui,t. 7: Compute βi,t following Proposition 2. 8: Select xi,t arg maxx Di,t x, θi,t + βi,t x V 1 i,t .

9: Obtain yi,t from environment. 10: Update Ui,t+1 Ui,t + xi,tx i,t, ui,t+1 ui,t + xi,tyi,t. 11: Update b Qi,t b Qi,t 1 + [x i,t yi,t] [x i,t yi,t] 12: if log det Vi,t + xi,tx i,t + M(ρmax ρmin)I log det (Si,t) D/ ti then 13: SYNCHRONIZE TRUE. 14: end if 15: if SYNCHRONIZE then 16: [ AGENTS] Agent sends b Qi,t PRIVATIZER and gets b Ui,t+1, bui,t+1 PRIVATIZER. 17: [ AGENTS] Agent communicates b Ui,t+1, bui,t+1 to controller. 18: [CONTROLLER] Compute St+1 PM i=1 b Ui,t+1, st+1 PM i=1 bui,t+1. 19: [CONTROLLER] Communicate St+1, si,t+1 back to agent. 20: [ AGENTS] Si,t+1 St+1, si,t+1 st+1. 21: [ AGENTS] b Qi,t+1 0. 22: else 23: Si,t+1 Si,t, si,t+1 si,t, ti ti + 1. 24: ti 0, Ui,t+1 0, ui,t+1 0. 25: end if 26: end for 27: end for

each action within the decision set Dt, and ﬁnally selecting the action with the largest UCB, i.e., xt = arg maxx Dt (maxθ Et x, θ ). The conﬁdence set is an ellipsoid centered on the regularized

linear regression estimate (for X<t = x 1 x 2 ... x t 1 and y<t = [y1 y2 ... yt 1] ):

Et := n θ Rd : θ ˆθt Vt βt o , where ˆθt := arg minθ Rd X<tθ y<t 2 2 + θ 2 Ht .

The regression solution can be given by ˆθt := (Gt + Ht) 1X <ty<t, where Gt = X <t X<t is the Gram matrix of actions, Ht is a (time-varying) regularizer, and βt is an appropriately chosen exploration parameter. Typically in non-private settings, the regularizer is constant, i.e., Ht = λI t, λ > 0 [1, 33], however, in our case, we will carefully select Ht to introduce privacy, using a strategy similar to [41]. Given Vt = Gt + Ht, let UCBt(x; θ) = θ, x + βt x V 1 t .

In the federated setting, since there are M learners that have distinct actions, the communication protocol is a key component of algorithm design: communication often creates heterogeneity between agents, e.g., for any two agents, their estimators (ˆθt) at any instant are certainly distinct, and the algorithm must provide a control over this heterogeneity, to bound the group regret. We additionally require that communication between agents is (ε, δ)-private, making the problem more challenging.

4.1 Centralized Environment with a Controller

We ﬁrst consider the centralized communciation environment where there exists a controller that coordinates communication between different agents, as is typical in large-scale distributed learning. We consider a set of M agents that each are interacting with the contextual bandit, and periodically communicate with the controller, that synchronizes them with other agents. We present the algorithm CENTRALIZED FEDUCB in Algorithm 1 that details our approach.

Algorithm 1 works as follows. Consider an agent i, and assume that synchronization had last taken place at instant t . At any instant t > t , the agent has two sets of parameters - (A) the ﬁrst being all observations up to instant t for all M agents (including itself) and (B) the second being its own observations from instant t to t. Since (A) includes samples from other agents, these

are privatized, and represented as the Gram matrix St +1 = P i [M] b Ui,t +1 and reward vector st +1 = P

i [M] bui,t +1. Algorithm 1 privatizes its own observations as well (for simplicity in analysis) and hence S, s are identical for all agents at all times. Moreover, since the group parameters are noisy variants of the original parameters, i.e., b Ui,t = Gi,t + Hi,t and bui,t = ui,t + hi,t (where Hi,t and hi,t are perturbations), we can rewrite St, st as (for any instant t > t ),

τ=1 xi,τx i,τ + Hi,t

τ=1 yi,τxi,τ + hi,t

When we combine the group parameters with the local (unsynchronized) parameters, we obtain the ﬁnal form of the parameters for any agent i as follows (for any instant t > t ):

τ=t xi,τx i,τ + St, ui,t =

τ=t yi,τxi,τ + st (2)

Then, with a suitable sequence (βi,t)t, the agent selects the action following the linear UCB objective:

xi,t = arg max x Di,t

bθi,t, x + βi,t x V 1 i,t

where bθi,t = V 1 i,t ui,t. (3)

The central idea of the algorithm is to therefore carefully perturb the Gram matrices Vi,t and the reward vector ui,t with random noise (Hi,t, hi,t) based on the sensitivity of these elements and the level of privacy required, as is typical in the literature [41]. First, each agent updates its local (unsynchronized) estimates. These are used to construct the UCB in a manner identical to the standard OFUL algorithm [1]. If, for any agent i, the log-determinant of the local Gram matrix exceeds the synchronized Gram matrix (Si,t) by an amount D/ ti (where ti is the time since the last synchronization), then it sends a signal to the controller, that synchronizes all agents with their latest action/reward pairs. The synchronization is done using a privatized version of the Gram matrix and rewards, carried out by the subroutine PRIVATIZER (Section 4.3, Alg. 2). This synchronization ensures that the heterogeneity between the agents is controlled, allowing us to control the overall regret and limit communication as well:

Proposition 1 (Communication Complexity). If Algorithm 1 is run with threshold D, then total rounds of communication n is upper bounded by 2 p

(d T/D) log (ρmax/ρmin + T L2/dρmin) + 4.

Now, the perturbations Hi,t, hi,t are designed keeping the privacy setting in mind. In our paper, we defer this to the subsequent section in a subroutine known as PRIVATIZER, and concentrate on the performance guarantees ﬁrst. The PRIVATIZER subroutine provides suitable perturbations based on the privacy budget (ε and δ), and is inspired by the single-agent private algorithm of [41]. In this paper, we assume these budgets to be identical for all agents (however, the algorithm and analysis hold for unique privacy budgets as well, as long as a lower bound on the budgets is known). In turn, the quantities ε and δ affect the algorithm (and regret) via the quantities ρmin, ρmax, and κ which can be understood as spectral bounds on Hi,t, hi,t.

Deﬁnition 2 (Sparsely-accurate ρmin, ρmax and κ). Consider a subsequence σ of [T] = 1, ..., T of size n. The bounds 0 ρmin ρmax and κ are (α/2n M, σ)-accurate for (Hi,t)i [M],t σ and (hi,t)i [M],t σ, if, for each round t σ and agent i:

Hi,t ρmax, H 1 i,t 1/ρmin, hi,t H 1 i,t κ; with probability at least (1 α/2n M).

The motivation for obtaining accurate bounds ρmin, ρmax and κ stems from the fact that in the non-private case, the quantities that determine regret are not stochastic conditioned on the obtained sequence (xt, yt)t [T ], whereas the addition of stochastic regularizers in the private case requires us to have control over their spectra to achieve any meaningful regret. To form the UCB, recall that we additionally require a suitable exploration sequence βi,t for each agent, which is deﬁned as follows.

Deﬁnition 3 (Accurate (βi,t)i [M],t [T ]). A sequence (βi,t)i [M],t [T ] is (α, M, T)-accurate for (Hi,t)i [M],t [T ] and (hi,t)i [M],t [T ], if it satisﬁes θi,t θ Vi,t βi,t with probability at least 1 α for all rounds t = 1, ..., T and agents i = 1, ..., M simultaneously.

Proposition 2. Consider an instance of the problem where synchronization occurs exactly n times on instances σ, up to and including T trials, and ρmin, ρmax and κ are (α/2n M)-accurate. Then, for Algorithm 1, the sequence (βi,t)i [M],t [T ] is (α, M, T)-accurate where:

βi,t := σ q

2 log(2/α) + d log (det(Vi,t)) d log(Mρmin) + S p

The key point here is that for the desired levels of privacy (ε, δ) and synchronization rounds (n), we can calculate appropriate ρmin, ρmax and κ, which in turn provide us a UCB algorithm with guarantees. Our basic techniques are similar to [1, 41], however, the multi-agent setting introduces several new aspects in the analysis, such as the control of heterogeneity.

Theorem 1 (Private Group Regret with Centralized Controller). Assuming Proposition 2 holds, and synchronization occurs in at least n = Ω(d log (ρmax/ρmin + T L2/dρmin)) rounds, Algorithm 1 obtains the following group pseudoregret with probability at least 1 α:

RM(T) = O σ

MTd log (ρmax/ρmin + T L2/dρmin) + p

log 2/α + S p

This regret bound is obtained by setting D = 2Td (log (ρmax/ρmin + T L2/dρmin) + 1) 1, which therefore ensures O(M log T) total communication (by Proposition 1). However, the remarks next clarify a more sophisticated relationship between communication, privacy and regret:

Remark 1 (Communication Complexity and Regret). Theorem 1 assumes that communication is essentially O(M log T). This rate (and corresponding D) is chosen to provide a balance between privacy and utility, and in fact, can be altered depending on the application. In the Appendix, we demonstrate that when we allow O(MT) communication (i.e., synchronize every round), the regret can be improved by a factor of O( p

log(T)) (in both the gap-dependent and independent bounds), to match the single-agent e O(

d MT) rate. Similarly, following the reasoning in [48], we can show that with O(M 1.5d3) communication (i.e., independent of T), we incur regret of O(

d MT log2(MT)).

Theorem 1 demonstrates the relationship between communication complexity (i.e., number of synchronization rounds) and the regret bound for a ﬁxed privacy budget, via the dependence on the bounds ρmin, ρmax and κ. We now present similar results (for a ﬁxed privacy budget) on group regret for the decentralized setting, which is more involved, as the delays and lack of a centralized controller make it difﬁcult to control the heterogeneity of information between agents. Subsequently, in the next section, we will present results on the privacy guarantees of our algorithms.

4.2 Decentralized Peer-to-Peer Environment

In this environment, we assume that the collection of agents communicate by directly sending messages to each other. The communication network is denoted by an undirected (connected) graph G = (V, E), where, edge eij E if agents i and j can communicate directly. The protocol operates as follows: every trial, each agent interacts with their respective bandit, and obtains a reward. At any trial t, after receiving the reward, each agent v sends the message mv,t to all its neighbors in G. This message is forwarded from agent to agent γ times (taking one trial of the bandit problem each between forwards), after which it is dropped. This communication protocol, based on the time-to-live (delay) parameter γ diam(G) is a common technique to control communication complexity, known as the LOCAL protocol [19, 34, 42]. Each agent v V therefore also receives messages mv ,t d(v,v ) from all the agents v such that d(v, v ) γ, i.e., from all agents in Nγ(v).

There are several differences in this setting compared to the previous one: ﬁrst, since agents receive messages from different other agents based on their position in G, they generally have heterogenous information throughout (no global sync). Second, information does not ﬂow instantaneously through G, and messages can take up to γ rounds to be communicated (delay). This requires (mainly technical) changes to the centralized algorithm in order to control the regret. We present the pseudocode our algorithm DECENTRALIZED FEDUCB in the Appendix, but highlight the major differences from the centralized version as follows. The ﬁrst key algorithmic change from the earlier variant is the idea of subsampling, inspired by the work of Weinberger and Ordentlich [50]. Each agent i maintains γ total estimators θi,g, g = 1, ..., γ, and uses each estimator in a round-robin fashion, i.e., at t = 1 each agent uses θi,1, and at t = 2, each agent uses θi,2, and so on. These estimators (and their associated

Algorithm 2 PRIVATIZER(ε, δ, M, T) for any agent i

1: Initialization: 2: If communication rounds n are ﬁxed a priori, set m 1 + log n , else m 1 + log T . 3: Create binary tree T of depth m. 4: for node n in T do 5: Sample noise c N R(d+1) (d+1), where c Nij N 0, 16m(L2 + 1)2 log(2/δ)2/ε2 . 6: Store N = (c N + c N )/

2 at node n. 7: end for 8: Runtime: 9: for each communication round t n do 10: Receive b Qi,t from agent, and insert it into T (see [26], Alg. 5). 11: Compute Mi,t+1 using the least nodes of T (see [26], Alg. 5). 12: Set b Ui,t+1 = Ui,t+1 + Hi,t as top-left d d submatrix of Mi,t+1. 13: Set bui,t+1 = ui,t+1 + hi,t as ﬁrst d entries of last column of Mi,t+1. 14: Return b Ui,t+1, bui,t+1 to agent. 15: end for

Vi,g, ui,g) are updated in a manner similar to Alg. 1: if the log det of the gth Gram matrix exceeds a threshold D/( i,g + 1), then the agent broadcasts a request to synchronize. Each agent j within the γ-clique of i that receives this request broadcasts its own Vj,g, uj,g. Therefore, each Vi,g, ui,g is updated with action/reward pairs from only the trials they were employed in (across all agents). This ensures that if a signal to synchronize the gth set of parameters has been broadcast by an agent i, all agents within the γ-clique of i will have synchronized their gth parameters by the next 2 rounds they will be used again (i.e., 2γ trials later). We defer most of the intermediary results to the Appendix for brevity, and present the primary regret bound (in terms of privacy parameters ρmin, ρmax and κ): Theorem 2 (Decentralized Private Group Regret). Assuming Proposition 2 holds, and synchronization occurs in at least n = Ω d( χ(Gγ) γ)(1 + L2) 1 log ρmax/ρmin + TL2/dρmin rounds, decentralized FEDUCB obtains the following group pseudoregret with probability at least 1 α:

M( χ(Gγ) γ)Td

Remark 2 (Decentralized Group Regret). Decentralized FEDUCB obtains an identical dependence on the privacy bounds ρmin, ρmax and κ and horizon T as Algorithm 1, since the underlying bandit subroutines are identical for both. The key difference is in additional the leading factor of p

χ(Gγ) γ, which arises from the delayed spread of information: if G is dense, e.g., complete, then γ = 1 and χ(Gγ) = 1, since there is only one clique of G. In the worst case, if G is a line graph, then χ(Gγ) = M/γ, giving an additional factor of M (i.e., it is as good as each agent acting individually). In more practical scenarios, we expect G to be hierarchical, and expect a delay overhead of o(1). Note that since each agent only communicates within its clique, the factor χ(Gγ) is optimal, see [12].

4.3 Privacy Guarantees

We now discuss the privacy guarantees for both algorithms. Here we present results for the centralized algorithm, but our results hold (almost identically) for the decentralized case as well (see appendix). Note that each agent interacts with data from other agents only via the cumulative parameters St and st. These, in turn, depend on Zi,t = Ui,t + Hi,t and zi,t = ui,t + hi,t for each agent i, on the instances t that synchronization occurs. Proposition 3 (see [16, 41]). Consider n T synchronization rounds occuring on trials σ [T]. If the sequence (Zi,t, zi,t)t σ is (ε, δ)-differentially private with respect to (xi,t, yi,t)t [T ], for each agent i [M], then all agents are (ε, δ, M)-federated differentially private.

Tree-Based Mechanism. Let x1, x2, ...x T be a (matrix-valued) sequence of length T, and si = Pi t=1 xt be the incremental sum of the sequence that must be released privately. The tree-based mechanism [17] for differential privacy involves a trusted entity maintaining a binary tree T (of depth m = 1 + log2 T ), where the leaf nodes contain the sequence items xi, and each parent node maintains the (matrix) sum of its children. Let ni be the value stored at any node in the tree. The

mechanism achieves privacy by adding a noise hi to each node, and releasing ni + hi whenever a node is queried. Now, to calculate st for some t [T], the procedure is to traverse the tree T up to the leaf node corresponding to xt, and summing up the values at each node on the traversal path. The advantage is that we only access at most m nodes, and add m = O(log T) noise (instead of O(T)).

The implementation of the private release of (Ui,t, ui,t) is done by the ubiquitous tree-based mechanism for partial sums. Following [41], we also aggregate both into a single matrix Mi,t R(d+1) (d+1), by ﬁrst concatenating: Ai,t := [Xi,1:t, yi,1:t] Rt (d+1), and then computing Mi,t = A i,t Ai,t. Furthermore, the update is straightforward: Mi,t+1 = Mi,t+[x i,t yi,t] [x i,t yi,t]. Recall that in our implementation, we only communicate in synchronization rounds (and not every round). Assume that two successive rounds of synchronization occur at time t and t. Then, at instant t, each agent i inserts Pt τ=t [x i,τ yi,τ] [x i,τ yi,τ] into T , and computes (Ui,t, ui,t) by summing up the entire path up to instant t via the tree mechanism. Therefore, the tree mechanism accesses at most m = 1 + log2 n nodes (where n total rounds of communication occur until instant T), and hence noise that ensures each node guarantees (ε/ p

8m ln(2/δ), δ/2m)-privacy is sufﬁcient to make the outgoing sequence (ε, δ)-private. This is different from the setting in the joint DP single-agent bandit [41], where observations are inserted every round, and not only for synchronization rounds.

To make the partial sums private, a noise matrix is also added to each node in T . We utilize additive Gaussian noise: at each node, we sample c N R(d+1) (d+1), where each c Ni,j N(0, σ2 N), and σ2 N = 16m(L2 + 1)2 log(2/δ)2/ε2, and symmetrize it (see step 6 of Alg. 2). The total noise Hi,t is the sum of at most m such terms, hence the variance of each element in Hi,t is mσ2 N. We can bound the operator norm of the top-left (d d)-submatrix of each noise term. Therefore, to guarantee (ε, δ, M)-federated DP, we require that with probability at least 1 α/n M:

Hi,t op Λ =

32m(L2 + 1) log(4/δ)(4

d + 2 ln(2n M/α))/ε.

Remark 3 (Privacy Guarantee). The procedure outlined above guarantees that each of the n outgoing messages (Ui,t, ui,t) (where t is a synchronization round) for any agent i is (ε, δ)-differentially private. This analysis considers the L2-sensitivity with respect to a single differing observation, i.e., (x, y) and not the entire message itself, i.e., the complete sequence (xi,τ, yi,τ)t τ=t (where t and t are successive synchronization rounds), which may potentially have O(t) sensitivity and warrants a detailed analysis. While our analysis is sufﬁcient for the user-level adversary model, there may be settings where privacy is required at the message-level as well, which we leave as future work.

However, as noted by [41], this Hi,t would not always be PSD. To ensure that it is always PSD, we can shift each Hi,t by 2ΛI, giving a bound on ρmax. Similarly, we can obtain bounds on ρmin and κ:

Proposition 4. Fix α > 0. If each agent i samples noise parameters Hi,t and hi,t using the treebased Gaussian mechanism mentioned above for all n trials of σ in which communication occurs, then the following ρmin, ρmax and κ are (α/2n M, σ)-accurate bounds:

ρmin = Λ, ρmax = 3Λ, κ

d + 2 log(2n M/α) /(

Remark 4 (Strategyproof Analysis). The mechanism presented above assumes the worst-case communication, i.e., synchronization occurs every round, therefore at most T partial sums will be released, and m = 1+ log T . This is not true for general settings, where infrequent communication would typically require m = O(log log T). However, if any agent is byzantine and requests a synchronization every trial, m must be 1 + log T to ensure privacy. In case the protocol is ﬁxed in advance (i.e., synchronization occurs on a pre-determined set σ of n rounds), then we can set m = 1 + log n to achieve the maximum utility at the desired privacy budget.

Remark 5 (Decentralized Protocol). Decentralized FEDUCB obtains similar bounds for ρmin, ρmax and κ, with m = 1 + log(T/γ) . An additional term of log(γ) appears in Λ and κ, since we need to now maintain γ partial sums with at most T/γ elements in the worst-case (see Appendix). Unsurprisingly, there is no dependence on the network G, as privatization is done at the source itself.

Corollary 1 ((ε, δ)-dependent Regret). FEDUCB with the PRIVATIZER subroutine in Alg. 2, obtains e O(d3/4p

MT/ε) centralized regret and e O(d3/4p

( χ(Gγ) γ)MT/ε) decentralized regret.

10 1 100 101

(A) Privacy Budget

M = 10 M = 20

M = 50 M = 100

0 200000 400000 600000 800000 1000000 T

(B) Communication

n = O(M1.5)

n = O(Mlog T) n = O(MT)

0 200000 400000 600000 800000 1000000 T

(C) Dependence on d

d = 8 d = 16 d = 32

d = 64 d = 128

Figure 1: A comparison of centralized FEDUCB on 3 different axes. Fig. (A) describes the variation in asymptotic per-agent regret for varying privacy budget ε (where δ = 0.1); (B) describes the effect of n in private (solid) vs. non-private (dashed) settings; (C) describes the effect of d in per-agent regret in the private setting (n = O(M log T), ε = 1, δ = 0.1). Experiments averaged over 100 runs.

4.4 Experiments

We provide an abridged summary of our experiments (ref. Appendix for all experiments). Here, we focus on the centralized environment, and on the variation of the regret with communication complexity and privacy budget. For all experiments, we assume L = S = 1. For any d, we randomly ﬁx θ Bd(1). Each Di,t is generated as follows: we randomly sample K d2 actions x, such that for K 1 actions 0.5 x, θ 0.6 and for the optimal x , 0.7 x , θ 0.8 such that 0.1 always. yi,t is sampled from Ber( xi,t, θ ) such that E[yi,t] = xi,t, θ and |yi,t| 1. Results are in Fig. 1, and experiments are averaged on 100 trials.

Experiment 1: Privacy Budget. In this setting, we set n = O(M log T), d = 10 (to balance communication and performance), and plot the average per-agent regret after T = 107 trials for varying M and ε, while keeping δ = 0.1. Figure 1A describes the results, competitive even at large privacy budget.

Experiment 2: Communication. To test the communication-privacy trade-off, we plot the regret curves for M = 100 agents on n = O(M 1.5), O(M log T), O(MT) communication for both private (ε = 1) and non-private settings. We observe a tradeoff as highlighted in Remark 1.

Experiment 3: Dependence on d. Finally, as an ablation, we provide the average per-agent regret curves for M = 100 by varying the dimensionality d. We observe essentially a quadratic dependence.

5 Discussion and Conclusion

The relentless permeance of intelligent decision-making on sensitive user data necessitates the development of technically sound and robust machine learning systems; it is difﬁcult to stress enough the importance of protecting user data in modern times. There is a signiﬁcant effort [7, 22] in creating decentralized decision-making systems that beneﬁt from pooling data from multiple sources while maintaining user-level privacy. This research provides an instance of such a system, along with guarantees on both its utility and privacy. From a technical perspective, we make improvements along several fronts while there has been prior work on multi-agent private linear bandits [21, 38] our work is the ﬁrst to provide rigorous guarantees on private linear bandits in the multi-agent setting. Our work additionally provides the ﬁrst algorithm with regret guarantees for contextual bandits in decentralized networks, extending the work of many on multi-armed bandits [39, 31, 30, 14, 12].

There are several unresolved questions in this line of work, as highlighted by our algorithm itself. In the decentralized case, our algorithm obtains a communication overhead of O( p

χ(Gγ) γ), which we comment is an artefact of our proof technique. We conjecture that the true dependence scales as O( p

α(Gγ) + γ) (where α( ) is the independence number of the graph), that would be obtained by an algorithm that bypasses subsampling, with a more sophisticated synchronization technique. Additionally, the lower bound obtained for private contextual bandit estimation [41] has an O(ε 1) dependence on the privacy budget. In the federated setting, given the intricate relationship between communication and privacy, an interesting line of inquiry would be to understand the communicationprivacy tradeoff in more detail, with appropriate lower bounds for ﬁxed communication budgets.

Broader Impact

We envision a world that runs on data-dependent inference. Given the ease of availability of sensitive data, large-scale centralized data sources are not feasible to uphold the best interests and safety of the average user, and decentralized inference mechanisms are a feasible alternative balancing utility with important normative values as privacy, transparency and accountability. This research is a part of such a broader research goal, to create decentralized systems that maintain the security of the end-user in mind. We believe that methods like ours are a starting point for better private algorithms that can be readily deployed in industry, with appropriate guarantees as well. While there are several subsequent steps that can be improved both theoretically and practically, the design for this research from the ground-up has been to ensure robustness and security (sometimes) in lieu of efﬁciency.

Funding Disclosures

This project was funded by the MIT Trust:Data Consortium.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

[2] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39 1, 2012.

[3] Shipra Agrawal and Navin Goyal. Further optimal regret bounds for thompson sampling. In Artiﬁcial Intelligence and Statistics, pages 99 107, 2013.

[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

[5] Swapna Buccapatnam, Atilla Eryilmaz, and Ness B Shroff. Stochastic bandits with side observations on networks. In The 2014 ACM international conference on Measurement and modeling of computer systems, pages 289 300, 2014.

[6] Joseph A Calandrino, Ann Kilzer, Arvind Narayanan, Edward W Felten, and Vitaly Shmatikov. " you might also like:" privacy risks of collaborative ﬁltering. In 2011 IEEE symposium on security and privacy, pages 231 246. IEEE, 2011.

[7] Igor Calzada and Esteve Almirall. Data ecosystems for protecting european citizens digital rights. Transforming Government: People, Process and Policy, 2020.

[8] Nicolo Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. A gang of bandits. In Advances in Neural Information Processing Systems, pages 737 745, 2013.

[9] T-H Hubert Chan, Elaine Shi, and Dawn Song. Privacy-preserving stream aggregation with fault tolerance. In International Conference on Financial Cryptography and Data Security, pages 200 214. Springer, 2012.

[10] TH Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. In International Colloquium on Automata, Languages, and Programming, pages 405 417. Springer, 2010.

[11] Kaize Ding, Jundong Li, and Huan Liu. Interactive anomaly detection on attributed networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 357 365, 2019.

[12] Abhimanyu Dubey and Alex Pentland. Cooperative multi-agent bandits with heavy tails. ar Xiv preprint ar Xiv:2008.06244, 2020.

[13] Abhimanyu Dubey and Alex Pentland. Kernel methods for cooperative multi-agent contextual bandits. ar Xiv preprint ar Xiv:2008.06220, 2020.

[14] Abhimanyu Dubey and Alex Pentland. Private and byzantine-proof cooperative decision-making. In Proceedings of the 19th International Conference on Autonomous Agents and Multi Agent Systems, pages 357 365, 2020.

[15] Audrey Durand, Charis Achilleos, Demetris Iacovides, Katerina Strati, Georgios D Mitsis, and Joelle Pineau. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine Learning for Healthcare Conference, pages 67 82, 2018. [16] Cynthia Dwork. Differential privacy. Encyclopedia of Cryptography and Security, pages 338 340, 2011. [17] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715 724, 2010. [18] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014. [19] Pierre Fraigniaud. Locality in distributed graph algorithms, 2016. [20] Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. ar Xiv preprint ar Xiv:1712.07557, 2017. [21] Awni Hannun, Brian Knott, Shubho Sengupta, and Laurens van der Maaten. Privacy-preserving contextual bandits. ar Xiv preprint ar Xiv:1910.05299, 2019. [22] Thomas Hardjono and Alex Pentland. Data cooperatives: Towards a foundation for decentralized personal data management. ar Xiv preprint ar Xiv:1905.08819, 2019. [23] Moritz Hardt and Guy N Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 61 70. IEEE, 2010. [24] Xiaoguang Huo and Feng Fu. Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society open science, 4(11):171377, 2017. [25] Roger Iyengar, Joseph P Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towards practical differentially private convex optimization. In 2019 IEEE Symposium on Security and Privacy (SP), pages 299 316. IEEE, 2019. [26] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In Conference on Learning Theory, pages 24 1, 2012. [27] Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019. [28] Jakub Koneˇcn y, H Brendan Mc Mahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016. [29] Nathan Korda, Balázs Szörényi, and Li Shuai. Distributed clustering of linear bandits in peer to peer networks. In Journal of machine learning research workshop and conference proceedings, volume 48, pages 1301 1309. International Machine Learning Society, 2016. [30] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative decision-making in multiarmed bandits: Frequentist and bayesian algorithms. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 167 172. IEEE, 2016. [31] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. On distributed cooperative decision-making in multiarmed bandits. In 2016 European Control Conference (ECC), pages 243 248. IEEE, 2016. [32] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Social imitation in cooperative multiarmed bandits: Partition-based algorithms with strictly local information. In 2018 IEEE Conference on Decision and Control (CDC), pages 5239 5244. IEEE, 2018. [33] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661 670, 2010. [34] Nathan Linial. Locality in distributed graph algorithms. SIAM Journal on computing, 21(1):193 201, 1992. [35] Keqin Liu and Qing Zhao. Decentralized multi-armed bandit with multiple distributed players. In 2010 Information Theory and Applications Workshop (ITA), pages 1 10. IEEE, 2010.

[36] Keqin Liu and Qing Zhao. Distributed learning in cognitive radio networks: Multi-armed bandit with distributed multiple players. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3010 3013. IEEE, 2010. [37] Keqin Liu and Qing Zhao. Distributed learning in multi-armed bandit with multiple players. IEEE Transactions on Signal Processing, 58(11):5667 5681, 2010. [38] Mohammad Malekzadeh, Dimitrios Athanasakis, Hamed Haddadi, and Ben Livshits. Privacypreserving bandits. ar Xiv preprint ar Xiv:1909.04421, 2019. [39] David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic multi-armed bandits. In Advances in Neural Information Processing Systems, 2019. [40] Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence, pages 592 601. AUAI Press, 2015. [41] Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, pages 4296 4306, 2018. [42] Jukka Suomela. Survey of local algorithms. ACM Computing Surveys (CSUR), 45(2):24, 2013. [43] Abhradeep Guha Thakurta and Adam Smith. (nearly) optimal algorithms for private online learning in full-information and bandit settings. In Advances in Neural Information Processing Systems, pages 2733 2741, 2013. [44] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. [45] Aristide CY Tossou and Christos Dimitrakakis. Differentially private, multi-agent multi-armed bandits. In European Workshop on Reinforcement Learning (EWRL), 2015. [46] Aristide CY Tossou and Christos Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016. [47] Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits. ar Xiv preprint ar Xiv:1309.6869, 2013. [48] Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: How much communication is needed to achieve (near) optimal regret. ar Xiv preprint ar Xiv:1904.06309, 2019. [49] Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Near-optimal regret with efﬁcient communication. In International Conference on Learning Representations, 2020. [50] Marcelo J Weinberger and Erik Ordentlich. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, 2002. [51] Mengmeng Yang, Lingjuan Lyu, Jun Zhao, Tianqing Zhu, and Kwok-Yan Lam. Local differential privacy and its applications: A comprehensive survey. ar Xiv preprint ar Xiv:2008.03686, 2020. [52] Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. Online context-aware recommendation with time varying multi-armed bandit. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 2025 2034, 2016.