# on_differentially_private_federated_linear_contextual_bandits__2e30ee39.pdf

Published as a conference paper at ICLR 2024

ON DIFFERENTIALLY PRIVATE FEDERATED LINEAR CONTEXTUAL BANDITS

Xingyu Zhou Wayne State University, USA Email: xingyu.zhou@wayne.edu

Sayak Ray Chowdhury Microsoft Research, India Email: t-sayakr@microsoft.com

We consider cross-silo federated linear contextual bandit (LCB) problem under differential privacy, where multiple silos interact with their respective local users and communicate via a central server to realize collaboration without sacrificing each user s privacy. We identify three issues in the state-of-the-art (Dubey & Pentland, 2020): (i) failure of claimed privacy protection, (ii) incorrect regret bound due to noise miscalculation and (iii) ungrounded communication cost. To resolve these issues, we take a two-step approach. First, we design an algorithmic framework consisting of a generic federated LCB algorithm and flexible privacy protocols. Then, leveraging the proposed framework, we study federated LCBs under two different privacy constraints. We first establish privacy and regret guarantees under silo-level local differential privacy, which fix the issues present in state-of-the-art algorithm. To further improve the regret performance, we next consider shuffle model of differential privacy, under which we show that our algorithm can achieve nearly optimal regret without a trusted server. We accomplish this via two different schemes one relies on a new result on privacy amplification via shuffling for DP mechanisms and another one leverages the integration of a shuffle protocol for vector sum into the tree-based mechanism, both of which might be of independent interest. Finally, we support our theoretical results with numerical evaluations over contextual bandit instances generated from both synthetic and real-life data.

1 INTRODUCTION

We consider the classic cross-silo Federated Learning (FL) paradigm (Kairouz et al., 2021) applied to linear contextual bandits (LCB). In this setting, a set of M local silos or agents (e.g., hospitals) communicate with a central server to learn about the unknown bandit parameter (e.g., hidden vector representing values of the user for different medicines). In particular, at each round t [T], each local agent i [M] receives a new user (e.g., patient) with context information ct,i Ci (e.g., age, gender, medical history), recommends an action at,i Ki (e.g., a choice of medicine), and then it observes a real-valued reward yt,i (e.g., effectiveness of the prescribed medicine). In linear contextual bandits, the reward yt,i is a linear function of the unknown bandit parameter θ Rd corrupted by i.i.d mean-zero observation noise ηt,i, i.e., yt,i = xt,i, θ + ηt,i, where xt,i = ϕi(ct,i, at,i) and ϕi : Ci Ki Rd is a known function that maps a context-action pair to a d-dimensional real-valued feature vector. The goal of federated LCB is to minimize the cumulative group pseudo-regret defined

t=1 max a Ki ϕi(ct,i, a), θ xt,i, θ .

To achieve the goal, as in standard cross-silo FL, the agents are allowed to communicate with the central server following a star-shaped communication, i.e., each agent can communicate with the server by uploading and downloading data, but agents cannot communicate with each other directly. However, the communication process (i.e., both data and schedule) could also possibly incur privacy leakage for each user t at each silo i, e.g., the sensitive context information ct,i and reward yt,i.

To address this privacy risk, we resort to differential privacy (Dwork & Roth, 2014), a principled way to prove privacy guarantee against adversaries with arbitrary auxiliary information. In standard cross-device FL, the notion of privacy is often the client-level DP, which protects the identity of each participating client or device. However, it has limitations in cross-silo FL, where the protection

Published as a conference paper at ICLR 2024

Table 1: Summary of main results. ε > 0, δ (0, 1) are privacy parameters.

DP model Privacy cost in regret Communication Reference

Central-JDP e O

MT d3/4 log1/4(1/δ) ε NA Shariff & Sheffet (2018)

Silo-level LDP e O

T (Md)3/4 log1/4(1/δ) ε

MT) Theorem 5.1

MT d3/4 log1/4(1/δ) ε

MT) Theorem 5.3 and 5.5

targets are users (e.g., patients) rather than participating silos or agents (e.g., hospitals). Also, in order to adopt client-level DP to cross-silo FL, one needs the server and other silos to be trustworthy, which is often not the case. Hence, recent studies (Lowy & Razaviyayn, 2021; Lowy et al., 2022; Liu et al., 2022; Dobbe et al., 2018) on cross-silo federated supervised learning have converged to a new privacy notion, which requires that for each silo, all of its communication during the entire process is private ( indistinguishable ) with respect to change of one local user of its own. This allows one to protect each user within each silo without trustworthy server and other silos. In this paper, we adapt it to the setting of cross-silo federated contextual bandits and call it silo-level LDP1.

Dubey & Pentland (2020) adopt a similar but somewhat weaker notion of privacy called Federated DP and takes the first step to tackle this important problem of private and federated linear contextual bandits (LCBs). In fact, the performance guarantees presented by the authors are currently the state-of-the-art for this problem. The proposed algorithm claims to protect the privacy of each user at each silo. Furthermore, given a privacy budget ε > 0, the claimed regret bound is e O( p

MT/ε) with only O(M log T) communication rounds, which matches the regret of a super-single agent that plays for total MT rounds. Unfortunately, in spite of being the state-of-the-art, the aforementioned privacy, regret and communication cost guarantees have fundamental gaps, as discussed below.

Our contributions: identify privacy, regret, communication gaps in state-of-the-art (Dubey & Pentland, 2020). In Section 3, we first show that the algorithm in (Dubey & Pentland, 2020) could leak privacy from the side channel of adaptive communication schedule, which depends on users nonprivate local data. Next, we identify a mistake in total injected privacy noise in their regret analysis. Accounting for this miscalculation, the correct regret bound would amount to e O(M 3/4p

T/ε), which is M 1/4 factor higher than the claimed one, and doesn t match regret performance of the super agent. Finally, we observe that due to the presence of privacy noise, its current analysis for O(M log T) communications no longer holds. To resolve these issues, we take the following two-step approach:

(i) design a generic algorithmic and analytical framework. In Section 4, we propose a generic federated LCB algorithm along with a flexible privacy protocol. Our algorithm adopts a fixed-batch schedule (rather than an adaptive one in Dubey & Pentland (2020)) that helps avoid privacy leakage from the side channel, as well as subtleties in communication analysis. Our privacy protocol builds on a distributed version of the celebrated tree-based algorithm (Chan et al., 2011; Dwork et al., 2010), enabling us to provide different privacy guarantees in a unified way. We further show that our algorithm enjoys a simple and generic analytical regret bound that only depends on the total amount of injected privacy noise under the required privacy constraints.

(ii) prove regret guarantees under different privacy notions. We build upon the above framework to study federated LCBs under two different privacy constraints. In Section 5.1, we consider silo-level LDP (a stronger notion of privacy than Federated DP of Dubey & Pentland (2020)) and establish privacy guarantee with a correct regret bound e O(M 3/4p

T/ε) and communication cost O(

MT), hence fixing the gaps in Dubey & Pentland (2020). Next, to match the regret of a super single agent, we consider shuffle DP (SDP) (Cheu et al., 2019) in Section 5.2 and establish a regret bound of e O( p

MT/ε). We provide two different techniques to achieve this one that relies on a new result on privacy amplification via shuffling for DP mechanisms and the other that integrates a shuffle protocol for vector sums (Cheu et al., 2021) into the tree-based mechanism. See Table 1 for a summary.

Related work. In standard multi-armed bandits, where rewards are only sensitive data, different DP models including central (Mishra & Thakurta, 2015; Azize & Basu, 2022; Sajed & Sheffet, 2019),

1It appears under different names in prior work, e.g., silo-specific sample-level DP (Liu et al., 2022), inter-silo record-level DP (Lowy & Razaviyayn, 2021).

Published as a conference paper at ICLR 2024

local (Ren et al., 2020) and distributed (Chowdhury & Zhou, 2022a; Tenenbaum et al., 2021), have been studied. In linear contextual bandits, where both contexts and rewards are sensitive, there is a line of work under central (Shariff & Sheffet, 2018), local (Zheng et al., 2020) and shuffle (Chowdhury & Zhou, 2022b; Garcelon et al., 2022; Tenenbaum et al., 2023) models of DP. Li et al. (2022); Hanna et al. (2022) study linear bandits without contexts protection. Dubey & Pentland (2020) is the first to consider federated LCBs under item-level privacy while Huang et al. (2023) study user-level privacy under some distributional assumptions; see Appendix A. Federated or distributed LCBs without privacy have also been studied (Wang et al., 2020; He et al., 2022a; Huang et al., 2021), where a common goal is to achieve the regret of a super single agent that plays MT rounds while keeping communication cost minimal. Lowy & Razaviyayn (2021); Liu et al. (2022) study private cross-silo federated learning under supervised setting, whereas we focus on the sequential learning setting.

2 DIFFERENTIAL PRIVACY IN FEDERATED LCBS

We now formally introduce differential privacy in cross-silo federated contextual bandits. Let a dataset Di at each silo i be given by a sequence of T unique users U1,i, . . . , UT,i. Each user Ut,i is identified by her context information ct,i as well as reward responses she would give to all possible actions recommended to her. We say two datasets Di and D i at silo i are adjacent if they differ exactly in one participating user, i.e., Uτ,i = U τ,i for some τ [T] and Us,i = U s,i for all s = τ.

Silo-level local differential privacy (LDP). Consider a multi-round, cross-silo federated learning algorithm Q. At each round t, each silo i communicates a randomized message Zt i of its data Di to the server, which may depend (due to collaboration) on previous randomized messages Z1 j , . . . , Zt 1 j from all other silos j = i. We allow Zt i to be empty if there is no communication at round t. Let Zi = (Z1 i , . . . , ZT i ) denote the full transcript of silo i s communications with the server over T rounds and Qi the induced local mechanism in this process. Note that Zi is a realization of random messages generated according to the local mechanism Qi. We denote by Z i = (Z1, . . . , Zi 1, Zi+1, . . . , ZM) the full transcripts of all but silo i. We assume that Zi is conditionally independent of Dj for all j = i given Di and Z i. With this notation, we have the following definition of silo-level LDP.

Definition 2.1 (Silo-level LDP). A cross-silo federated learning algorithm Q with M silos is said to be (εi, δi)i M silo-level LDP if for each silo i [M], it holds that

P Qi(Zi Ei|Di,Z i) eεi P Qi(Zi Ei|D i,Z i) +δi ,

for all adjacent datasets Di and D i, and for all events Ei in the range of Qi. If εi = ε and δi = δ for all i [M], we simply say Q is (ε, δ)-silo-level LDP.

Roughly speaking, a silo-level LDP algorithm protects the privacy of each individual user (e.g., patient) within each silo in the sense that an adversary (which could either be the central server or other silos) cannot infer too much about any individual s sensitive information (e.g., context and reward) or determine whether an individual participated in the learning process.2

Remark 2.2 (Federated DP vs. Silo-level LDP). Dubey & Pentland (2020) consider a privacy notion called Federated DP (Fed-DP in short). As summarized in Dubey & Pentland (2020), Fed-DP requires the action chosen by any agent must be sufficiently impervious (in probability) to any single pair (x, y) from any other agent . Both silo-level LDP and Fed-DP are item-level DP as the neighboring relationship is defined by differing in one participating user. The key here is to note that silo-level DP implies Fed-DP by the post-processing property of DP, and thus it is a stronger notion of privacy. In fact, Dubey & Pentland (2020) claim to achieve Fed-DP by relying on privatizing the communicated data from each silo. However, as we shall see in Section 3, its proposed algorithm fails to privatize the adaptive synchronization schedule, which is the key reason behind privacy leakage in their algorithm.

Shuffle differential privacy (SDP). Another common DP notion for FL is SDP (Cheu et al., 2019), which has been widely studied in supervised learning (Lowy & Razaviyayn, 2021; Girgis et al., 2021; Lowy et al., 2022) to match the centralized utility performance. Motivated by this, we adapt it to FL-LCBs. Specifically, each silo i [M] first applies a local randomizer R to its raw local data and sends the randomized output to a shuffler S. The shuffler S permutes all the messages from all M silos uniformly at random and sends those to the central server. Roughly speaking, SDP requires

2This is a notion of item-level DP. A comparison with standard local DP, central (joint)-DP and shuffle DP for single-agent LCBs is presented in Appendix G.2.

Published as a conference paper at ICLR 2024

all the messages sent by the shuffler to be private ( indistinguishable ) with respect to a single user change among all MT users. This item-level DP is defined formally as follows. Definition 2.3 (SDP). Consider a cross-silo federated learning algorithm Q that induces a (randomized) mechanism M whose output is the collection of all messages sent by the shuffler during the entire learning process. Then, the algorithm Q is said to be (ε, δ)-SDP if

P M(D) E eε P M(D ) E + δ ,

for all E in the range of M and for all adjacent datasets D = (D1, . . . , DM) and D = (D 1, . . . , D M) such that PM i=1 PT t=1 1{Ut,i =U t,i} = 1.

3 PRIVACY, REGRET AND COMMUNICATION GAPS IN STATE-OF-THE-ART

Gap in privacy analysis. We take a two-step approach to demonstrate the privacy issue in Dubey & Pentland (2020). To start with, we argue that Algorithm 1 in Dubey & Pentland (2020) fails to achieve silo-level LDP due to privacy leakage through the side channel of communication schedule (i.e., when agents communicate with the server). The key issue is that the adaptive communication schedule in the algorithm depends on users non-private data. This fact can be utilized by an adversary or malicious silo j to infer another silo i s users sensitive information, which violates the requirement of silo-level LDP. In that algorithm, all silos synchronously communicate with the server if some silo i [M] : f(Xi, Z) > 0 , (1) where f is some function, Xi is non-private local data of silo i since the last synchronization and Z is all previously synchronized data. Crucially, the form of f and the rule (1) are public information, known to all silos even before the algorithm starts. This local and non-private data-dependent communication rule in (1) causes privacy leakage, as illustrated below with a toy example. Example 3.1 (Privacy leakage). Consider two silos i and j following Algorithm 1 in Dubey & Pentland (2020). After the first round, Xi is the data of the first user in silo i (say Alice), Xj is the data of the first user in silo j (say Bob) and Z is zero. Let communication is triggered at the end of first round and assume f(Xj, 0) 0. Since the rule (1) is public, silo j can infer that f(Xi, 0) > 0, i.e. the communication is triggered by silo i. Since f is also public knowledge, silo j can utilize this to infer some property of Xi. Hence, by observing only the communication signal (even without looking at the data), silo j can infer sensitive data of Alice. In fact, the specific form of f in Dubey & Pentland (2020) allows silo j to infer context information of Alice (details in Appendix B).

This example shows that Algorithm 1 in Dubey & Pentland (2020) does not satisfy silo-level LDP, implying their proof for Fed-DP guarantee via post-processing of silo-level LDP does not hold. However, it does not imply that this algorithm fails to satisfy Fed-DP, which is a weaker notion than silo-level LDP. Nevertheless, by leveraging Example 3.1, one can show that this algorithm indeed fails to guarantee Fed-DP. To see this, recall the definition of Fed-DP from Remark 2.2. In the context of Example 3.1, it translates to silo j selecting similar actions for its users when a single user in silo i changes. Specifically, if the first user in silo i changes from Alice to say, Tracy, Fed-DP mandates that all T actions suggested by silo j to its local T users remain indistinguishable . This, in turn, implies that the communicated data from silo i must remain indistinguishable at silo j for each t [T]. This is because the actions at silo j are chosen deterministically based on its local data as well as on the communicated data from silo i, and the local data at silo j remains unchanged. However, in Algorithm 1 of Dubey & Pentland (2020), the communicated data from silo i is not guaranteed to remain indistinguishable as synchronization depends on non-private local data (Xi in (1)). In other words, without additional privacy noise added to Xi in (1), the change from Alice to Tracy could affect the existence of synchronization at round t 1. Consequently, under these two neighboring situations (e.g. Alice vs. Tracy), the communicated data from silo i could differ significantly at round t + 1. As a result, the action chosen at round t + 1 in silo j can be totally different violating Fed-DP. This holds true even if silo i injects noise while communicating its data (as done in Dubey & Pentland (2020)) due to a large change of non-private communicated data (see Appendix B for details).

Gaps in regret and communication analysis. We now turn to regret and communication analysis of Dubey & Pentland (2020), which has fundamental gaps that lead to incorrect conclusions in the end. First, the reported privacy cost in regret bound is O( p

MT/ε) (ignoring dependence on dimension d), which leads to the conclusion that federated LCBs across M silos under silo-level LDP can achieve the same order of regret as in the centralized setting (i.e., when a super single agent

Published as a conference paper at ICLR 2024

Algorithm 1 Private-Fed Lin UCB

1: Parameters: Batch size B N, regularization λ > 0, confidence radii {βt,i}t [T ],i [M], feature map ϕi : Ci Ki Rd, privacy protocol P = (R, S, A)

2: Initialize: Wi = 0, Ui = 0 for all agents i [M], f Wsyn = 0, e Usyn = 0 3: for t=1, . . . , T do 4: for each agent i = 1, . . . , M do 5: Receive context ct,i; compute Vt,i = λI + f Wsyn + Wi and bθt,i = V 1 t,i (e Usyn + Ui)

6: Play action at,i =argmaxa Ki ϕi(ct,i,a), bθt,i +βt,i ϕi(ct,i,a) V 1 t,i ; observe reward yt,i

7: Set xt,i =ϕi(ct,i, at,i), Ui = Ui + xt,iyt,i and Wi = Wi + xt,ix t,i 8: end for 9: if t mod B = 0 then 10: // Local randomizer R at all agents i [M] 11: Send randomized messages Rbias t,i = Rbias(Ui) and Rcov t,i = Rcov(Wi) to S 12: // Third party S 13: Shuffle (or, not) all messages Sbias t = S({Rbias t,i }i [M]) and Scov t = S({Rcov t,i }i [M]) 14: // Analyzer A at the server 15: Compute private synchronized statistics e Usyn = Abias(Sbias t ) and f Wsyn = Acov(Scov t ) 16: // All agents i [M]

17: Receive f Wsyn and e Usyn from the server and reset Wi = 0, Ui = 0 18: end if 19: end for

plays MT rounds). However, in the proposed analysis, the total amount of injected privacy noise is miscalculated. In particular, variance of total noise needs to be Mσ2 rather than the proposed value of σ2. This is due to the fact that each silo injects Gaussian noise with variance σ2 when sending out local data which amounts to total Mσ2 noise at the server. Accounting for this correction, the cost of privacy becomes O(M 3/4p

T/ε), which is O(M 1/4) factor worse than the claimed one. Hence, we conclude that Algorithm 1 in Dubey & Pentland (2020) cannot achieve the same order of regret as in centralized setting. Second, the proposed analysis to show O(log T) communication rounds for the data-adaptive schedule (1) under privacy constraint essentially follows from the non-private one of Wang et al. (2020). Unfortunately, due to privacy noise, this direct approach no longer holds, and hence the reported logarithmic cost stands ungrounded (details in Appendix B).

4 OUR APPROACH

To address the issues in Dubey & Pentland (2020), we introduce a generic algorithm for private, federated linear contextual bandits (Algorithm 1) and a flexible privacy protocol (Algorithm 2). This helps us (a) derive correct privacy, regret, and communication results under silo-level LDP (and hence under Fed-DP) (Section 5.1), and (b) achieve the same order of regret as in centralized setting under SDP (Section 5.2). Throughout the paper, we make the following standard assumptions in LCBs. Assumption 4.1 (Boundedness (Shariff & Sheffet, 2018)). The rewards are bounded, i.e., yt,i [0, 1] for all t [T] and i [M]. Moreover, θ 2 1 and supc,a ϕi(c, a) 2 1 for all i [M].

4.1 ALGORITHM: PRIVATE FEDERATED LINUCB

We build upon the celebrated Lin UCB algorithm (Abbasi-Yadkori et al., 2011) by adopting a fixedbatch schedule for synchronization among agents and designing a privacy protocol P (Algorithm 2) for both silo-level LDP and SDP . At each round t, each agent i recommends an action at,i to each local user following optimism in the face of uncertainty principle. First, the agent computes a local estimate bθt,i based on all available data to her, which includes previously synchronized data from all agents as well as her own new local data (line 5 of Algorithm 1). Then, the action at,i is selected based on the Lin UCB decision rule (line 6), where a proper radius βt,i is chosen to balance between exploration and exploitation. After observing the reward yt,i, each agent accumulates her own local data (bias vector xt,iyt,i and covariance matrix xt,ix t,i) and stores them in Ui and Wi, respectively

Published as a conference paper at ICLR 2024

Algorithm 2 P, a privacy protocol used in Algorithm 1

1: Procedure: Local Randomizer R at each agent 2: //Input: stream data (γ1, . . . , γK), ε>0, δ (0, 1] 3: for k=1, . . . , K do 4: Express k in binary form: k = P

j Binj(k) 2j

5: Find index of first one ik =min{j : Binj(k)=1} 6: Compute p-sum αik =P

j<ik αj+γk 7: Output bαk =αik +N(0,σ2 0I) 8: end for 9: Procedure: Analyzer A at server 10: //Input : data from S :(bαk,1, . . . , bαk,M), k [K] 11: for k=1, . . . , K do 12: Express k in binary and find index of first one ik 13: Add noisy p-sums of all agents: eαik = PM i=1 bαk,i 14: Output: esk = P

j:Binj(k)=1 eαj 15: end for

γ1 γ2 γ3 γ4 γ5 γ6 γ7 γ8

Σ[1,2] Σ[3,4] Σ[5,6] Σ[7,8]

Figure 1: Illustration of the tree-based algorithm. Each leaf node is the stream data and each internal node is a p-sum Σ[i, j] = Pj l=i γl. The green node corresponds to the newly computed p-sum at each k, i.e., αik in Algorithm 2.

(line 7). A communication is triggered between agents and central server whenever a batch ends we assume w.l.o.g. total rounds T is divisible by batch size B (line 9). During this process, a protocol P = (R, S, A) assists in aggregating local data among all agents while guaranteeing privacy properties (to be discussed in detail soon). After communication, each agent receives latest synchronized data f Wsyn, e Usyn from the server (line 17). Here, for any t=k B, k [T/B], f Wsyn represents noisy version of all covariance matrices up to round t from all agents (i.e., PM i=1 Pt s=1 xs,ix s,i) and similarly, e Usyn represents noisy version of all bias vectors PM i=1 Pt s=1 xs,iys,i. Finally, each agent resets Wi and Ui so that they can be used to accumulate new local data for the next batch. Note that Algorithm 1 uses a fixed-batch (data-independent) communication schedule rather than the adaptive, data-dependent one in Dubey & Pentland (2020), which allows us to resolve privacy and communication issues.

4.2 PRIVACY PROTOCOL

We now turn to our privacy protocol P (Algorithm 2), which helps to aggregate data among all agents under privacy constraints. The key component of P is a distributed version of the classic tree-based algorithm, which was originally designed for continual release of private sum statistics (Chan et al., 2011; Dwork et al., 2010). That is, given a stream of (multivariate) data γ =(γ1, . . . , γK), one aims to release sk =Pk l=1 γl privately for all k [K]. The tree-based mechanism constructs a complete binary tree T in online manner. The leaf nodes contain data γ1 to γK, and internal nodes contain the sum of all leaf nodes in its sub-tree, see Fig. 1 for an illustration. For any new arrival data γk, it only releases a tree node privately, which corresponds to a noisy partial sum (p-sum) between two time indices. As an example, take k = 6, and hence the new arrival is γ6. The tree-based mechanism first computes the p-sum P[5, 6] = γ5 + γ6 (line 6 in Algorithm 2). Then, it adds a Gaussian noise with appropriate variance σ2 0 to P[5, 6] and releases the noisy p-sum (line 7). Finally, to compute the prefix sum statistic P[1, 6] privately, it simply adds noisy p-sums for P[1, 4] and P[5, 6], respectively. Reasons behind releasing and aggregating p-sums are that (i) each data point γk only affects at most 1 + log K p-sums (useful for privacy) and (ii) each sum statistic P[1, k] only involves at most 1 + log k p-sums (useful for utility).

Our privacy protocol P = (R, S, A) breaks down the above classic mechanism of releasing and aggregating p-sums into a local randomizer R at each agent and an analyzer A at the server, separately, while allowing for a possible shuffler in between to amplify privacy. For each k, the local randomizer R at each agent computes and releases the noisy p-sum to a third-party S (lines 4-7). S can either be a shuffler that permutes the data uniformly at random (for SDP) or can simply be an identity mapping (for silo-level LDP). It receives a total of M noisy p-sums, one from each agent, and sends them to the central server. The analyzer A at the server first adds these M new noisy p-sums to synchronize them (line 13). It then privately releases the synchronized prefix sum by adding up all relevant synchronized p-sums as discussed in above paragraph (line 14). Finally, we employ P to Algorithm 1 by observing that local data γk,i for batch k and agent i consists of bias vectors γbias k,i =Pk B t=(k 1)B+1 xt,iyt,i and covariance matrices γcov k,i =Pk B t=(k 1)B+1 xt,ix t,i, which are stored

Published as a conference paper at ICLR 2024

in Ui and Wi respectively. We denote the randomizer and analyzer for bias vectors as Rbias and Abias, and for covariance matrices as Rcov and Acov in Algorithm 1.

5 THEORETICAL RESULTS

5.1 FEDERATED LCBS UNDER SILO-LEVEL LDP

We first present the performance of Algorithm 1 under silo-level LDP, hence fixing the privacy, regret and communication issues of the state-of-the-art algorithm in Dubey & Pentland (2020). The key idea is to inject Gaussian noise with proper variance (σ2 0 in Algorithm 2) when releasing a p-sum such that all the released p-sums up to any batch k [K] is (ε, δ)-DP for any agent i [M]. Then, by Definition 2.1, it achieves silo-level LDP. Note that in this case, there is no shuffler, which is equivalent to the fact that the third party S in P is simply an identity mapping, denoted by I. The following result states this formally, with proof deferred to Appendix E. Theorem 5.1 (Performance under silo-level LDP). Fix batch size B, privacy budgets ε>0, δ (0, 1). Let P =(R, I, A) be a protocol given by Algorithm 2 with parameters σ2 0 =8κ (log(2/δ)+ε)

ε2 , where κ = 1+log(T/B). Then, under Assumption 4.1, Algorithm 1 instantiated with P satisfies (ε, δ)- silo-level LDP. Moreover, for any α (0, 1], there exist choices of λ and {βt,i}t,i such that, with probability at least 1 α, it enjoys a group regret

RM(T)=O d MB log T +d

MT log(MT/α) + e O

T (Md)3/4 log1/4(1/δ) ε log1/4 T/Bα .

The first term in the above regret bound doesn t depend on privacy budgets ε, δ, and serves as a representative regret bound for federated LCBs without privacy constraint. The second term is the dominant one which depends on ε, δ and denotes the cost of privacy due to injected noise.

Corollary 5.2. Setting B = p

T/M, Algorithm 1 achieves e O d

T (Md)3/4 log1/4(1/δ) ε

group regret, with total

MT synchronizations under (ε, δ)-silo-level LDP.

Comparison with Dubey & Pentland (2020). First, we avoid the privacy leakage by adopting data-independent synchronization. However, this leads to an O(

T) communication cost. It remains open to design a (correct) data-adaptive schedule with logarithmic cost; details in Appendix D. We also show that privacy cost scales as O(M 3/4) with number of agents M, correcting the reported

M scaling. Next, we compare our result with that of a super single agent running for MT rounds under the central model of DP (i.e., where the central server is trusted), which serves as a benchmark for our results. As shown in Shariff & Sheffet (2018), the total regret for such a single agent is e O d

MT d3/4 log1/4(1/δ) ε . Comparing this with Corollary 5.2, we observe that the privacy

cost of federated LCBs under silo-level LDP is a multiplicative M 1/4 factor higher than a super agent under the central model. This motivates us to consider SDP in next section.

5.2 FEDERATED LCBS UNDER SDP

We now aim to close the above M 1/4 gap in the privacy cost under silo-level LDP compared to that achieved by a super single agent (with a truseted central server). To do so, we consider federated LCBs under SDP, which still enjoys the nice feature of silo-level LDP that the central server is not trusted. Thanks to our flexible protocol P, the only change needed compared to silo-level LDP is the introduction of a shuffler S to amplify privacy and adjustment of the privacy noise σ2 0 accordingly. Theorem 5.3 (Performance under SDP via amplification). Fix batch size B and let κ=1+log(T/B). Let P = (R, S, A) be a protocol given by Algorithm 2. Then, under Assumption 4.1, there exist constants C1, C2 > 0 such that for any ε κ C1T

M , δ κ C2T , Algorithm 1 instantiated with P and

σ2 0 = O 2κ log(1/δ) log(κ/(δT )) log(Mκ/δ)

ε2M , satisfies (ε, δ)-SDP. Moreover, for any α (0, 1], there exist choices of λ and {βt,i}t,i such that, with a probability at least 1 α, it enjoys a group regret

RM(T)=O d MB log T +d

MT log(MT/α) + e O d3/4

MT log3/4(Mκ/δ) ε log1/4 T/Bα .

Corollary 5.4. Setting B = p

T/M, Algorithm 1 achieves e O d

MT log3/4(Mκ/δ) ε

group regret, with total

MT synchronizations under (ε, δ)-SDP.

Published as a conference paper at ICLR 2024

Corollary 5.4 asserts that privacy cost of federated LCBs under SDP matches that of a super single agent under central DP (up to a log factor in T, M, δ).

Comparison with existing SDP analysis. Note that the above result doesn t directly follow from prior amplification results (Feldman et al., 2022; Erlingsson et al., 2019; Cheu et al., 2019; Balle et al., 2019), which show that shuffling outputs of M (ε, δ)-LDP algorithms achieve roughly 1/

M factor amplification in privacy for small ε the key to close the aforementioned gap in privacy cost. However, these amplification results apply only when each mechanism is LDP in the standard sense, i.e., they operate on a dataset of size n = 1. This doesn t hold in our case since the dataset at each silo is a stream of T points. Lowy & Razaviyayn (2021) adopt group privacy to handle the case of n > 1, which can amplify any general DP mechanism but comes at the expense of a large increase in δ. To avoid this, we prove a new amplification lemma specific to Gaussian DP mechanisms operating on datasets with size n>1. This helps us achieve the required 1/

M amplification in ε while keeping the increase in δ in check. The key idea behind our new lemma is to directly analyze the sensitivity when creating clones as in Feldman et al. (2022), but now by accounting for the fact that all n>1 points can be different (see Appendix F for formal statement and proof of the lemma).

5.2.1 SDP GUARANTEE FOR A WIDE RANGE OF PRIVACY PARAMETERS

One limitation of attaining SDP via amplification is that the privacy guarantee holds only for small values of ε, δ (see Theorem 5.3). In this subection, we propose an alternative privacy protocol to resolute this limitation. This new protocol leverages the same binary tree structure as in Algorithm 2 for releasing and aggregating p-sums, but it employs different local randomizers and analyzers for computing (noisy) synchronized p-sums of bias vectors and covariance matrices (eαik in Algorithm 2). Specifically, it applies the vector sum mechanism PVec of Cheu et al. (2021), which essentially take n vectors as inputs and outputs their noisy sum. Here privacy is ensured by injecting suitable binomial noise to a fixed-point encoding of each vector entry, which depends on ε, δ and n.

In our case, one cannot directly aggregate M p-sums using PVec with n = M. This is because each p-sum would then have a large norm (O(T) at the worst case), which would introduce a large amount of privacy noise (cf. Theorem 3.2 in Cheu et al. (2021)), resulting in worse utility (regret). Instead, we first expand each p-sum resulting in a set of points such that each with O(1) norm. Then, we aggregate all of those data points using PVec mechanism (one each for bias vectors and covariance matrices). For example, consider summing bias vectors during batch k = 6 and refer back to Fig. 1 for illustration. Here, the p-sum for each agent is given by P[5, 6] = γ5 + γ6 (see line 6 in Algorithm 2), the expansion of which results in 2B bias vectors (B each for batch 5 and 6). A noisy sum of n = 2BM bias vectors is then computed using PVec. We denote the entire mechanism as PT Vec see Algorithm 5 in Appendix F.2 for pseudo-code and a complete description.

Now, the key intuition behind using PVec as a building block is that it allows us to compute private vector sums under the shuffle model using nearly the same amount of noise as in the central model. In other words, it simulates the privacy noise introduced in vector summation under the central model using a shuffler. This, in turn, helps us match the regret of the centralized setting while guaranteeing (strictly stronger) SDP. Specifically, we have the same order of regret as in Theorem 5.3, but now it holds for a wide range of privacy budgets ε, δ as presented below. Proof is deferred to Appendix F. Theorem 5.5 (Performance under SDP via vector sum). Fix batch size B and let κ=1+log(T/B). Let PT Vec be a privacy protocol given by Algorithm 5. Then, under Assumption 4.1, there exist parameter choices of PT Vec such that for any ε 60 p

2κ log(2/δ) and δ 1, Algorithm 1 instantiated with PT Vec satisfies (ε, δ)-SDP. Moreover, for any α (0, 1], there exist choices of λ and {βt,i}t,i such that, with a probability at least 1 α, it enjoys a group regret

RM(T)=O d MB log T +d

MT log(MT/α) + e O d3/4

MT log3/4(κd2/δ) ε log1/4 T/Bα .

Remark 5.6 (Importance of communicating P-sums). A key technique behind closing the regret gap under SDP is to communicate and shuffle only the p-sums rather than prefix sums. With this we can ensure that each data point (bias vector/covariance matrix) participates only in O(log K) shuffle mechanisms (rather than in O(K) mechanisms if we communicate and shuffle prefix-sums). This helps us keep the final privacy cost in check after adaptive composition. In other words, one cannot simply use shuffling to amplify privacy of Algorithm 1 in Dubey & Pentland (2020) to close the regret gap (even ignoring its privacy and communication issues), since it communicates prefix sums at each

Published as a conference paper at ICLR 2024

0 2000 4000 6000 8000 10000 Round

Time-average Regret

SDP-Fed Lin UCB(ε = 0.0001)

SDP-Fed Lin UCB(ε = 0.001) LDP-Fed Lin UCB(ε = 0.0001)

LDP-Fed Lin UCB(ε = 0.001)

(a) Synthetic data (M = 100)

0 2000 4000 6000 8000 10000 Round

Time-average Regret

LDP-Fed Lin UCB(ε = 0.2)

SDP-Fed Lin UCB(ε = 0.2)

(b) Synthetic data (M = 100)

0 5000 10000 15000 20000 25000

Time-average Regret

Fed Lin UCB LDP-Fed Lin UCB(ε = 0.2) LDP-Fed Lin UCB(ε = 1)

LDP-Fed Lin UCB(ε = 5)

(c) Real data (M = 10) Figure 2: Comparison of time-average group regret for LDP-Fed Lin UCB (silo-level LDP), SDP-Fed Lin UCB (shuffle model) and Fed Lin UCB (non-private) under varying privacy budgets ε, δ on (a, b) synthetic Gaussian bandit instance and (c) bandit instance generated from MSLR-WEB10K Learning to Rank dataset.

synchronization. This again highlights the algorithmic novelty of our privacy protocols (Algorithms 2 and 5), which could be of independent interest. See Appendix F for further details.

6 SIMULATION RESULTS AND CONCLUSIONS

We evaluate regret performance of Algorithm 1 under silo-level LDP and SDP, which we abbreviate as LDP-Fed Lin UCB and SDP-Fed Lin UCB, respectively. We fix confidence level α=0.01, batchsize B = 25 and study comparative performances under varying privacy budgets ε, δ. We plot timeaveraged group regret Reg M(T)/T in Figure 2 by averaging results over 25 parallel runs.

Synthetic bandit instance. We simulate a LCB instance with a parameter θ of dimension d = 10 and |Ki| = 100 actions for each of the M agents. Similar to Vaswani et al. (2020), we generate θ

and feature vectors by sampling a (d 1)-dimensional vectors of norm 1/

2 uniformly at random, and append it with a 1/

2 entry. Rewards are corrupted with Gaussian N(0, 0.25) noise.

Real-data bandit instance. We generate bandit instances from Microsoft Learning to Rank dataset (Qin & Liu, 2013). Queries form contexts c and actions a are the available documents. The dataset contains 10K queries, each with up to 908 judged documents on scale of rel(c, a) {0, 1, 2}. Each pair (c, a) has a feature vector ϕ(c, a), which is partitioned into title and body features of dimensions 57 and 78, respectively. We first train a lasso regression model on title features to predict relevances from ϕ, and take this model as the parameter θ with d = 57. Next, we divide the queries equally into M =10 agents and assign corresponding feature vectors to the agents. This way, we obtain a federated LCB instance with 10 agents, each with number of actions |Ki| 908.

Observations. In Fig. 2(a), we compare performance of LDP-Fed Lin UCB and SDP-Fed Lin UCB (with amplification based privacy protocol P) on synthetic bandit instance with M =100 agents under privacy budget δ =0.0001 and ε=0.001 or 0.0001. We observe that regret of SDP-Fed Lin UCB is less than LDP-Fed Lin UCB for both values of ε, which is consistent with theoretical results. Here, we only work with small privacy budgets since the privacy guarantee of Theorem 5.3 holds for ε, δ 1. Instead, in Fig. 2(b), we consider higher privacy budgets as suggested in Theorem 5.5 (e.g. ε=0.2, δ=0.1) and compare the regret performance of LDP-Fed Lin UCB and SDP-Fed Lin UCB (with vecorsum based privacy protocol PT vec). As expected, we observe that regret of SDP-Fed Lin UCB decreases faster than that of LDP-Fed Lin UCB. Next, we benchmark the performance of Algorithm 1 under silo-level LDP (i.e. LDP-Fed Lin UCB) against a non-private Federated LCB algorithm with fixed communication schedule, building on Abbasi-Yadkori et al. (2011) and refer as Fed Lin UCB. In Fig. 2(c), we demonstrate the cost of privacy under silo-level LDP on real-data bandit instance by varying ε in the set {0.2, 1, 5} while keeping δ fixed to 0.1. We observe that regret of LDP-Fed Lin UCB decreases and comes closer to that of Fed Lin UCB as ε increases (i.e., level of privacy protection decreases). A similar regret behavior is noticed under SDP also (postponed to Appendix H).

Concluding Remarks. We conclude with some discussions. First, our adversary model behind silo-level LDP excludes malicious users within the same silo. If one is also interested in protecting against adversary users within the same silo, a simple tweak of Algorithm 1 would suffice (see Appendix G.3). With this, one can not only protect against colluding silos (as in silo-level LDP), but also against colluding users within the same silo (as in central JDP). Next, we assume that all MT users are unique in our algorithms. In practice, a user can participate in multiple rounds within the same silo or across different silos; see Appendix G.4. Finally, for future work, a challenging task is to achieve O(log T) communication cost with correct privacy and regret guarantees; see Appendix D for further discussions.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

XZ is supported in part by NSF CNS-2153220 and CNS-2312835. XZ would like to thank Abhimanyu Dubey for discussions on the work of Dubey & Pentland (2020). XZ would also like to thank Andrew Lowy and Ziyu Liu for insightful discussions on the privacy notion for cross-silo federated learning. XZ would thank Vitaly Feldman and Audra Mc Millan for the discussion on some subtleties behind hiding among the clones in Feldman et al. (2022).

Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Achraf Azize and Debabrota Basu. When privacy meets partial information: A refined analysis of differentially private bandits. ar Xiv preprint ar Xiv:2209.02570, 2022.

Borja Balle, James Bell, Adri a Gasc on, and Kobbi Nissim. The privacy blanket of the shuffle model. In Annual International Cryptology Conference, pp. 638 667. Springer, 2019.

Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635 658. Springer, 2016.

T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1 24, 2011.

Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 375 403. Springer, 2019.

Albert Cheu, Matthew Joseph, Jieming Mao, and Binghui Peng. Shuffle private stochastic convex optimization. ar Xiv preprint ar Xiv:2106.09805, 2021.

Sayak Ray Chowdhury and Xingyu Zhou. Distributed differential privacy in multi-armed bandits. ar Xiv preprint ar Xiv:2206.05772, 2022a.

Sayak Ray Chowdhury and Xingyu Zhou. Shuffle private linear contextual bandits. In Proceedings of the 39th International Conference on Machine Learning, pp. 3984 4009. PMLR, 2022b.

Roel Dobbe, Ye Pu, Jingge Zhu, Kannan Ramchandran, and Claire Tomlin. Customized local differential privacy for multi-agent distributed optimization. ar Xiv preprint ar Xiv:1806.06035, 2018.

Abhimanyu Dubey. No-regret algorithms for private gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2062 2070. PMLR, 2021.

Abhimanyu Dubey and Alex Sandy Pentland. Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003 6014, 2020.

John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pp. 429 438. IEEE, 2013.

Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 715 724, 2010.

Ulfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2468 2479. SIAM, 2019.

Published as a conference paper at ICLR 2024

Vitaly Feldman, Audra Mc Millan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 954 964. IEEE, 2022.

Evrard Garcelon, Kamalika Chaudhuri, Vianney Perchet, and Matteo Pirotta. Privacy amplification via shuffling for linear contextual bandits. In International Conference on Algorithmic Learning Theory, pp. 381 407. PMLR, 2022.

Antonious Girgis, Deepesh Data, Suhas Diggavi, Peter Kairouz, and Ananda Theertha Suresh. Shuffled model of differential privacy in federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 2521 2529. PMLR, 2021.

Osama A Hanna, Antonious M Girgis, Christina Fragouli, and Suhas Diggavi. Differentially private stochastic linear bandits:(almost) for free. ar Xiv preprint ar Xiv:2207.03445, 2022.

Jiafan He, Tianhao Wang, Yifei Min, and Quanquan Gu. A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. ar Xiv preprint ar Xiv:2207.03106, 2022a.

Jiahao He, Jiheng Zhang, and Rachel Zhang. A reduction from linear contextual bandit lower bounds to estimation lower bounds. In International Conference on Machine Learning, pp. 8660 8677. PMLR, 2022b.

Ruiquan Huang, Weiqiang Wu, Jing Yang, and Cong Shen. Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34:27057 27068, 2021.

Ruiquan Huang, Huanyu Zhang, Luca Melis, Milan Shen, Meisam Hejazinia, and Jing Yang. Federated linear contextual bandits with user-level differential privacy. In International Conference on Machine Learning, pp. 14060 14095. PMLR, 2023.

Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aur elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pp. 403 410, 2014.

Fengjiao Li, Xingyu Zhou, and Bo Ji. Differentially private linear bandits with partial distributed feedback. ar Xiv preprint ar Xiv:2207.05827, 2022.

Fengjiao Li, Xingyu Zhou, and Bo Ji. (private) kernelized bandits with distributed biased feedback. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1):1 47, 2023.

Ziyu Liu, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith. On privacy and personalization in cross-silo federated learning. ar Xiv preprint ar Xiv:2206.07902, 2022.

Andrew Lowy and Meisam Razaviyayn. Private federated learning without a trusted server: Optimal algorithms for convex losses. ar Xiv preprint ar Xiv:2106.09779, 2021.

Andrew Lowy, Ali Ghafelebashi, and Meisam Razaviyayn. Private non-convex federated learning without a trusted server. ar Xiv preprint ar Xiv:2203.06735, 2022.

Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 592 601, 2015.

Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. Co RR, abs/1306.2597, 2013. URL

http://arxiv.org/abs/1306.2597.

Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy. ar Xiv preprint ar Xiv:2007.03121, 2020.

Touqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pp. 5579 5588. PMLR, 2019.

Published as a conference paper at ICLR 2024

Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.

Thomas Steinke. Composition of differential privacy & privacy amplification by subsampling. ar Xiv preprint ar Xiv:2210.00597, 2022.

Jay Tenenbaum, Haim Kaplan, Yishay Mansour, and Uri Stemmer. Differentially private multi-armed bandits in the shuffle model. Advances in Neural Information Processing Systems, 34, 2021.

Jay Tenenbaum, Haim Kaplan, Yishay Mansour, and Uri Stemmer. Concurrent shuffle differential privacy under continual observation. ar Xiv preprint ar Xiv:2301.12535, 2023.

Sharan Vaswani, Abbas Mehrabian, Audrey Durand, and Branislav Kveton. Old dog learns new tricks: Randomized ucb for bandit problems. In International Conference on Artificial Intelligence and Statistics, pp. 1988 1998. PMLR, 2020.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: How much communication is needed to achieve (near) optimal regret. ICLR, 2020.

Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33:12300 12310, 2020.

Xingyu Zhou and Jian Tan. Local differential privacy for bayesian optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):11152 11159, 2021.

Published as a conference paper at ICLR 2024

A MORE DETAILS ON RELATED WORK

Private bandit learning has also been studied beyond linear settings, such as kernel bandits (Zhou & Tan, 2021; Dubey, 2021; Li et al., 2023). It is worth noting that Chowdhury & Zhou (2022a) also presents optimal private regret bounds under all three DP models (i.e., central, local, and distributed) in bandits while only relying on discrete privacy noise, hence avoiding the privacy leakage of continuous privacy noise on finite computers due to floating point arithmetic.

Recently, Huang et al. (2023) took the pioneering step to study user-level privacy for federated LCBs, establishing both regret upper bounds and lower bounds. In contrast to our item-level DP (e.g., silo-level LDP), user-level DP in Huang et al. (2023) roughly requires that even replacing the whole local history at any agent, the central server s broadcast message should be close across the whole learning period. This notion is more likely to be preferred in cross-device FL settings where the protection target is the device (agent). In addition to this, there are several other key differences compared to our work. First, they deal with linear bandits with stochastic contexts under additional distribution coverage assumptions (rather than the arbitrary adversary contexts in our case). In fact, it has been shown in Huang et al. (2021) that some assumption on the context distribution is necessary for a sublinear regret under user-level DP. Second, due to this stochastic context and some coverage conditions on contexts, an exponentially growing batch schedule can be applied in their case. In contrast, under the adversary context case, it is unclear to us how to apply the same technique to derive a sublinear regret.

B MORE DISCUSSIONS ON GAPS IN SOTA

In this section, we provide more details on the current gaps in Dubey & Pentland (2020), especially on privacy violation and communication cost. It turns out that both gaps come from the fact that an adaptive communication schedule is employed in Dubey & Pentland (2020).

B.1 MORE ON VIOLATION OF SILO-LEVEL LDP

As shown in the main paper, Algorithm 1 in Dubey & Pentland (2020) does not satisfy silo-level LDP. To give a more concrete illustration of privacy leakage, we now specify the form of f 3, local data Xi and synchronized data Z in (1) according to Dubey & Pentland (2020). In particular, a communication is triggered at round t if for any silo i, it holds that

det Z+Pt s=t +1 xs,ix s,i+λmin I

det (Z+λmin I)

where t is the latest synchronization time before t, Z is all synchronized (private) covariance matrices up to time t , λmin > 0 is some regularization constant (which depends on privacy budgets ε, δ) and D > 0 is some suitable threshold (which depends on number of silos M).

With the above explicit form in hand, we can give a more concrete discussion of Example 3.1. A communication is triggered at round t = 1 if det x1,mx 1,m+λmin I > det (λmin I) e D holds for any silo m. This implies that (λmin + x1,m 2)λd 1 min > e Dλd min, which, in turn, yields x1,m 2 > λmin(e D 1) =: C. Now, if x1,j 2 C, then silo j immediately knows that x1,i 2 > C, where C is a known constant. Since x1,i contains the context information of the user (Alice), this norm condition could immediately reveal that some specific features in the context vector are active (e.g., Alice has both diabetes and heart disease), thus leaking Alice s private and sensitive information to silo j. Remark B.1. The above result has two implications: (i) the current proof strategy for Fed-DP guarantee in Dubey & Pentland (2020) does not hold since it essentially relies on the post-processing of DP through silo-level LDP; (ii) Fed-DP could fail to handle reasonable adversary model in crosssilo federated LCBs. That is, even if Algorithm 1 in Dubey & Pentland (2020) satisfies Fed-DP, it still cannot protect Alice s information from being inferred by a malicious silo (which is a typical

3There is some minor issue in the form of f in Dubey & Pentland (2020). The correct one is given by our restatement of their Algorithm 1, see line 9 in Algorithm 3.

Published as a conference paper at ICLR 2024

adversary model in cross-silo FL). Thus, we believe that silo-level LDP is a more proper privacy notion for cross-silo federated LCBs.

B.2 MORE ON VIOLATION OF FED-DP

As shown in the main paper, Algorithm 1 in Dubey & Pentland (2020) also does not satisfy its weaker notion of Fed-DP. To give a more concrete illustration, recall Example 3.1 and let us define mi,j as the message/data sent from silo i to silo j after round t = 1. Suppose in the case of Alice, there is no synchronization and hence mi,j = 0. On the other hand, in the case of Tracy (i.e., the first user at silo i changes from Alice to Tracy), suppose synchronization is triggered by silo i via rule (1) due to Tracy s data. Then, according to Dubey & Pentland (2020), mi,j = x1,iy1,i + N (only consider bias vector here), where N is the injected noise when silo i sends out its data. Now, based on the requirement of Fed-DP, the recommended action at silo j in round t = 2 needs to be similar or indistinguishable in probability under the change from Alice to Tracy. Note that silo j chooses its action at round t = 2 based on its local data (which is unchanged) and mi,j, via deterministic selection rule (i.e., Lin UCB) in Algorithm 1 of Dubey & Pentland (2020). Thus, Fed-DP essentially requires mi,j to be close in probability when Alice changes to Tracy, which is definitely not the case (i.e., 0 vs. x1,iy1,i + N). Thus, Algorithm 1 in Dubey & Pentland (2020) also fails Fed-DP. Remark B.2. One can also think from the following perspective: the non-private data-dependent sync rule (i.e., (2)) in Dubey & Pentland (2020) impacts the communicated messages/data as well, which cannot be made private by injecting noise when sending out data. To rescue, a possible approach is to use private (noisy) data in rule (2) when determining synchronization (while still injecting noise when sending out data). As a result, whether there exists a synchronization would be indistinguishable under Alice or Tracy and hence mi,j now would be similar. However, this approach still suffers the gap in communication cost analysis (see below) and moreover it will incur new challenges in regret analysis, see Appendix D for a detailed discussion on this approach.

B.3 MORE ON COMMUNICATION COST ANALYSIS

The current analysis in Dubey & Pentland (2020) (cf. Proposition 5) for communication cost (i.e., how many rounds of communication within T) essentially follows the approach in the non-private work (Wang et al., 2020) (cf. proof of Theorem 4). However, due to additional privacy noise injected into the communicated data, one key step of the approach in Wang et al. (2020) fails in the private case. In the following, we first point out the issue using notations in Dubey & Pentland (2020).

The key issue in its current proof of Proposition 5 in Dubey & Pentland (2020) is that

log det(Si,t+n )

det(Si,t) > D

which appears right above Eq. 4 in Dubey & Pentland (2020) does not hold. More specifically, [t, t + n ] is the i-th interval between two communication steps and Si,t, Si,t+n are corresponding synchronized private matrices. At the time t+n , we know (2) is satisfied by some silo (say j [M]), since there is a new synchronization. In the non-private case, Si,t+n simply includes some additional local covariance matrices from silos other than j, which are positive semi-definite (PSD). As a result, (3) holds. However, in the private case, Si,t+n includes the private messages from silos other than j, which may not be positive semi-definite (PSD), since there are some new covariance matrices as well as new Gaussian privacy noise (which could be negative definite). Thus, (3) may not hold anymore.

C A GENERIC REGRET ANALYSIS FOR ALGORITHM 1

In this section, we first establish a generic regret bound for Algorithm 1 under sub-Gaussian noise condition, i.e., Lemma C.4. To this end, let us first give the following notations. Fix B, T N, we let K = T/B be the total number of communication steps. For all i [M] and all t = k B, k [K], we let Nt,i = f Wt,i Pt s=1 xs,ix s,i and nt,i = e Ut,i Pt s=1 xs,iys,i be the cumulative injected noise up to the k-th communication by agent i. We further let Ht := λId + P

i [M] Nt,i and ht := P

i [M] nt,i.

Published as a conference paper at ICLR 2024

Assumption C.1 (Regularity). Fix any α (0, 1], with probability at least 1 α, we have Ht is positive definite and there exist constants λmax, λmin and ν depending on α such that for all t = k B, k [K]

Ht λmax, H 1 t 1/λmin, ht H 1 t ν.

With the above regularity assumption and the boundedness in Assumption 4.1, we fist establish the following general regret bound of Algorithm 1, which can be viewed as a direct generalization of the results in Shariff & Sheffet (2018); Chowdhury & Zhou (2022b) to the federated case.

Lemma C.2. Let Assumptions C.1 and 4.1 hold. Fix any α (0, 1], there exist choices of λ and {βt,i}t [T ],i [M] such that, with probability at least 1 α, the group regret of Algorithm 1 satisfies

Reg M(T) = O

d MT log 1 + MT

+ O M B d log 1 + MT

where βT := r

α + d log 1 + MT dλmin

+ λmax + ν.

Lemma C.4 is a corollary of the above result, which holds by bounding λmax, λmin, ν under sub Gaussian privacy noise.

Assumption C.3 (sub-Gaussian private noise). There exist constants eσ1 and eσ2 such that for all t = k B, k [K]: (i) PM i=1 nt,i is a random vector whose entries are independent, mean zero, sub-Gaussian with variance at most eσ2 1, and (ii) PM i=1 Nt,i is a random symmetric matrix whose entries on and above the diagonal are independent sub-Gaussian random variables with variance at most eσ2 2. Let σ2 =max{eσ2 1, eσ2 2}.

Now, we are ready to state Lemma C.4 as follows.

Lemma C.4 (A generic regret bound of Algorithm 1). Let Assumptions C.3 and 4.1 hold. Fix time horizon T N, batch size B [T], confidence level α (0, 1]. Set λ = Θ(max{1, σ(

log(T/(Bα))}) and βt,i = q

α + d log 1 + Mt

λ for all i [M]. Then, Algorithm 1 achieves group regret

Reg M(T) = O d MB log T + d

MT log(MT/α) + O p

σMT log(MT)d3/4 log1/4(T/(Bα))

with probability at least 1 α.

Proof of Lemma C.2. We divide the proof into the following six steps. Let E be the event given in Assumption C.1, which holds with probability at least 1 α under Assumption C.1. In the following, we condition on the event E.

Step 1: Concentration. In this step, we will show that with high probability, θ bθt,i Vt,i βt,i

for all i [M]. Fix an agent i [M] and t [T], let tlast be the latest communication round of all agents before t. By the update rule, we have bθt,i = V 1 t,i (e Usyn + Ui)

s=1 xs,jys,j +

j=1 ntlast,j +

s=tlast+1 xs,iys,i

s=1 xs,jx s,j +

j=1 Ntlast,j +

s=tlast+1 xs,ix s,i

s=1 xs,jys,j +

j=1 ntlast,j +

s=tlast+1 xs,iys,i

Published as a conference paper at ICLR 2024

By the linear reward function ys,j = xs,j, θ + ηs,j for all j [M] and elementary algebra, we have

θ bθt,i = V 1 t,i

s=1 xs,jηs,j

s=tlast+1 xs,iηs,i htlast

where we recall that Htlast = λI + PM j=1 Ntlast,j and htlast = PM j=1 ntlast,j.

Thus, multiplying both sides by V 1/2 t,i , yields

θ bθt,i Vt,i

s=1 xs,jηs,j +

s=tlast+1 xs,iηs,i

+ Htlastθ V 1 t,i + htlast V 1 t,i

s=1 xs,jηs,j +

s=tlast+1 xs,iηs,i

(Gt,i+λmin I) 1 + θ Htlast + htlast H 1 tlast

s=1 xs,jηs,j +

s=tlast+1 xs,iηs,i

(Gt,i+λmin I) 1 + p

where (a) holds by Vt,i Htlast and Vt,i Gt,i + λmin I with Gt,i := PM j=1 Ptlast s=1 xs,jx s,j + Pt 1 s=tlast+1 xs,ix s,i (i.e., non-private Gram matrix) under event E; (b) holds by the boundedness of θ and event E.

For the remaining first term, we can use self-normalized inequality (cf. Theorem 1 in Abbasi-Yadkori et al. (2011)) with a proper filtration4. In particular, we have for any α (0, 1], with probability at least 1 α, for all t [T]

s=1 xs,jηs,j +

s=tlast+1 xs,iηs,i

(Gt,i+λmin I) 1

+ log det(Gt,i + λmin I)

det(λmin I)

Now, using the trace-determinant lemma (cf. Lemma 10 in Abbasi-Yadkori et al. (2011)) and the boundedness condition on xs,j for all s [T] and j [M], we have

det(Gt,i + λmin I) λmin + Mt

Putting everything together, we have with probability at least 1 2α, for all i [M] and all t [T], θ bθm Vt,i βt,i = βt, where

+ d log 1 + Mt dλmin

λmax + ν. (4)

Step 2: Per-step regret. With the above concentration result, based on our UCB policy for choosing the action, we have the classic bound on the per-step regret rt,i, that is, with probability at least 1 2α rt,i = θ , x t,i θ , xt,i

(a) = θ , x t,i UCBt,i(x t,i) + UCBt,i(x t,i) UCBt,i(xt,i) + UCBt,i(xt,i) θ , xt,i

(b) 0 + 0 + 2βt,i xt,i V 1 t,i 2βT xt,i V 1 t,i where in (a), we let UCBt,i(x) := bθt,i, x + βt,i x V 1 t,i ; (b) holds by the optimistic fact of UCB (from the concentration), greedy action selection, and the concentration result again.

4In particular, by the i.i.d noise assumption across time and agents, one can simply construct the filtration sequentially across agents and rounds, which enlarges the single-agent filtration by a factor of M.

Published as a conference paper at ICLR 2024

Step 3: Regret decomposition by good and bad epochs. In Algorithm 1, at the end of each synchronization time t = k B for k [K], all the agents will communicate with the server by uploading private statistics and downloading the aggregated ones from the server. We then divide time horizon T into epochs by the communication (sync) rounds. In particular, the k-th epoch contains rounds between (tk 1, tk], where tk = k B is the k-th sync round. We define Vk := λmin I + PM i=1 Ptk t=1 xt,ix t,i, i.e., all the data at the end of the k-th communication plus a regularizer. Then, we say that the k-th epoch is a good epoch if det(Vk) det(Vk 1) 2; otherwise it is a bad epoch. Thus, we can divide the group regret into two terms:

Reg M(T) = X

t good epochs rt,i + X

t bad epochs rt,i.

Step 4: Bound the regret in good epochs. To this end, we introduce an imaginary single agent that pulls all the MT actions in the following order: x1,1,, x1,2, . . . , x1,M, x2,1, . . . , x2,M, . . . , x T,1, . . . , x T,M. We define a corresponding imaginary design matrix Vt,i = λmin I + P

p<t,q [M] xp,qx p,q + P

p=t,q<i xp,qx p,q, i.e., the design matrix right before xt,i. The key reason behind this construction is that one can now use the standard result (i.e., the elliptical potential lemma (cf. Lemma 11 in Abbasi-Yadkori et al. (2011))) to bound the summation of bonus terms, i.e., P

t,i xt,i V 1 t,i .

Suppose that t [T] is within the k-th epoch. One key property we will use is that for all i, Vk Vt,i and Gt,i + λmin I Vk 1, which simply holds by their definitions. This property enables us to see that for any t good epochs, det( Vt,i)/ det(Gt,i + λmin I) 2. This is important since by the standard determinant trick , we have xt,i (Gt,i+λmin I) 1

2 xt,i V 1 t,i . (5)

In particular, this follows from Lemma 12 in Abbasi-Yadkori et al. (2011), that is, for two positive definite matrices A, B Rd d satisfying A B, then for any x Rd, x A x B p

det(A)/ det(B). Note that here we also use det(A) = 1/ det(A 1). Hence, we can bound the regret in good epochs as follows. X

t good epochs rt,i

t good epochs min{2βT xt,i V 1 t,i , 1}

t good epochs min{2βT xt,i (Gt,i+λmin I) 1 , 1}

t good epochs min{2

2βT xt,i V 1 t,i , 1}

t good epochs 2

2βT min{ xt,i V 1 t,i , 1}

2βT min{ xt,i V 1 t,i , 1}

d MT log 1 + MT

where (a) holds by the per-step regret bound in Step 2 and the boundedness of reward; (b) follows from the fact that Vt,i Gt,i + λmin I under event E; (c) holds by (5) when t is in good epochs; (d) is true since βT 1; (e) holds by the elliptical potential lemma (cf. Lemma 11 in Abbasi-Yadkori et al. (2011)).

Step 5: Bound the regret in bad epochs. Let Tbad be the total number of rounds in all bad epochs. Thus, the total number of bad rounds across all agents are M Tbad. As a result, the cumulative group regret in all these bad rounds are upper bounded by M Tbad due to the to the boundedness of reward.

Published as a conference paper at ICLR 2024

We are left to bound Tbad. All we need is to bound the Nbad total number of bad epochs. Then, we have Tbad = Nbad B, where B is the fixed batch size. To this end, recall that K = T/B and define Ψ := {k [K] : log det(Vk) log det(Vk 1) > log 2}, i.e., Nbad = |Ψ|. Thus, we have

log 2 |Ψ| X

k Ψ log det(Vk) log det(Vk 1) X

k [K] log det(Vk) log det(Vk 1)

d log 1 + MT

Hence, we have Nbad = |Ψ| d log 2 log 1 + MT dλmin

. Thus we can bound the regret in bad epochs as follows. X

t bad epochs rt,i M Tbad = M B Nbad M B d log 2 log 1 + MT

Step 6: Putting everything together. Now, we substitute the total regret in good epochs given by (6) and total regret in bad epochs given by (7) into the total regret decomposition in Step 3, yields the final cumulative group regret

Reg M(T) = O

d MT log 1 + MT

+ O M B d log 1 + MT

where βT := r

α + d log 1 + MT dλmin

+ λmax + ν. Finally, taking a union bound, we

have the required result.

Now, we turn to the proof of Lemma C.4, which is an application of Lemma C.2 we just proved.

Proof of Lemma C.4. To prove the result, thanks to Lemma C.2, we only need to determine the three constants λmax, λmin and ν under the sub-Gaussian private noise assumption in Assumption C.3. To this end, we resort to concentration bounds for sub-Gaussian random vector and random matrix.

To start with, under (i) in Assumption C.3, by the concentration bound for the norm of a vector containing sub-Gaussian entries (cf. Theorem 3.1.1 in Vershynin (2018)) and a union bound over all communication rounds, we have for all t = k B where k = [T/B] and any α (0, 1], with probability at least 1 α/2, for some absolute constant c1,

= ht Σn := c1 eσ1 (

log(T/(αB)).

By (ii) in Assumption C.3, the concentration bound for the norm of a sub-Gaussian symmetric random matrix (cf. Corollary 4.4.8 in Vershynin (2018)) and a union bound over all communication rounds, we have for all t = k B where k = [T/B] and any α (0, 1], with probability at least 1 α/2,

ΣN := c2 eσ2 (

log(T/(αB))

for some absolute constant c2. Thus, if we choose λ = 2ΣN, we have Ht = λId + PM i=1 Nt,i 3ΣN, i.e., λmax = 3ΣN, and λmin = ΣN. Finally, to determine ν, we note that

ht H 1 t 1 λmin ht c σ (

log(T/(αB)) 1/2 := ν,

where σ = max{eσ1, eσ2}. The final regret bound is obtained by plugging the three values into the result given by Lemma C.2.

D DISCUSSION ON PRIVATE ADAPTIVE COMMUNICATION

In the main paper and Appendix B, we have pointed out that the gap in privacy guarantee of Algorithm 1 in Dubey & Pentland (2020) is that its adaptive communication schedule leads to privacy leakage

Published as a conference paper at ICLR 2024

Algorithm 3 Restatement of Algorithm 1 in (Dubey & Pentland, 2020)

1: Parameters: Adaptive communication parameter D, regularization λ > 0, confidence radii {βt,i}t [T ],i [M], feature map ϕi : Ci Ki Rd, privacy budgets ε > 0, δ [0, 1].

2: Initialize: For all i [M], Wi = 0, Ui = 0, PRIVATIZER with ε, δ, f Wsyn = 0, e Usyn = 0, 3: for t=1, . . . , T do 4: for each agent i = 1, . . . , M do 5: Vt,i = λI + f Wsyn + Wi, bθt,i = V 1 t,i (e Usyn + Ui)

6: Play arm at,i = argmaxa Ki ϕi(ct,i, a), bθt,i + βt,i ϕi(ct,i, a) V 1 t,i and set xt,i =

ϕi(ct,i, at,i) 7: Observe reward yt,i 8: Update Wi = Wi + xt,ix t,i, Ui = Ui + xt,iyt,i 9: if log det(Vt,i + xt,ix t,i) log det(Vlast) > D t tlast then 10: Send a signal to the server to start a synchronization round. 11: end if 12: if a synchronization is started then 13: Send Wi and Ui to PRIVATIZER 14: PRIVATIZER sends private cumulative statistics f Wt,i, e Ut,i to server

15: Server aggregates f Wsyn = f Wsyn + PM j=1 f Wt,j and e Usyn = e Usyn + PM j=1 e Ut,j

16: Receive f Wsyn and e Usyn from the server

17: Reset Wi = 0, Ui = 0, tlast = t and Vlast = λI + f Wsyn 18: end if 19: end for 20: end for

due to its dependence on non-private data. As mentioned in Remark B.1, one possible approach is to use private data to determine the sync in (2). This will resolve the privacy issue. However, the same issue in communication cost still remains (due to privacy noise), and hence O(log T) communication does not hold. Moreover, this new approach will also lead to new challenges in regret analysis, when compared with its current one in Dubey & Pentland (2020) and the standard one in Wang et al. (2020).

To better illustrate the new challenges, let us restate Algorithm 1 in Dubey & Pentland (2020) using our notations and first focus on how to establish the regret based on its current adaptive schedule (which has the issue of privacy leakage). After we have a better understanding of the idea, we will see how new challenges come up when one uses private data for an adaptive schedule.

As shown in Algorithm 3, the key difference compared with our fixed-batch schedule is highlighted in color. Note that we only focus on silo-level LDP and use PRIVATIZER to represent a general protocol that can privatize the communicated data (e.g., P or the standard tree-based algorithm in Dubey & Pentland (2020)).

D.1 REGRET ANALYSIS UNDER NON-PRIVATE ADAPTIVE SCHEDULE

In this section, we demonstrate the key step in establishing the regret with the non-private adaptive communication schedule in Algorithm 3 (i.e., line 9). It turns out that the regret analysis is very similar to our proof for Lemma C.2 for the fixed batch case, in that the only key difference lies in Step 5 when bounding the regret in bad epochs5. The main idea behind adaptive communication is: whenever the accumulated local regret at any agent exceeds a threshold, then synchronization is required to keep the data homogeneous among agents. This idea is directly reflected in the following analysis.

5There is another subtle but important difference, which lies in the construction of filtration that is required to apply the standard self-normalized inequality to establish the concentration result. We believe that one cannot directly use the standard filtration (e.g., Abbasi-Yadkori et al. (2011)) in the adaptive case, and hence more care is indeed required.

Published as a conference paper at ICLR 2024

Bound the regret in bad epochs (adaptive communication case). Let s consider an arbitrary bad epoch k, i.e., (tk 1, tk], where tk is the round for the k-th communication. For all i, we want to bound the total regret between (tk 1, tk], denoted by Rk i . That is, the local regret between any two communications (in the bad epoch) will not be too large. For now, suppose we already have such a bound U (which will be achieved by adaptive communication later), i.e., Rk i U for all i, k, we can easily bound the total regret in bad epochs. To see this, recall that Ψ := {k [K] : log det(Vk) log det(Vk 1) > log 2}, i.e., Nbad = |Ψ|, we have X

t bad epochs rt,i = X

k Ψ Rk i = O (|Ψ|MU) .

Plugging in Nbad = |Ψ| d log 2 log 1 + MT

dλ , we have the total regret for bad epochs. Now, we are only left to find U. Here is the place where the adaptive schedule in the algorithm comes in. First, note that X

tk 1<t<tk rt,i

tk 1<t<tk min{2βT xt,i V 1 t,i , 1} (8)

(tk tk 1) log det Vtk,i

where (a) holds by boundedness of reward; (b) follows from the elliptical potential lemma, i.e., Vlast is PSD under event E and Vt,i = Vt 1,i + xt 1,ix t 1,i for all t (tk 1, tk); (c) holds by the adaptive

schedule in line 9 of Algorithm 3. As a result, we have Rk i O βT

D + 1, where the regret at round tk is at most 1 by the boundedness of reward. With a proper choice of D, one can obtain a final regret bound.

D.2 CHALLENGES IN REGRET ANALYSIS UNDER PRIVATE ADAPTIVE SCHEDULE

Now, we will discuss new challenges when one uses private data for an adaptive communication schedule. In this case, one needs to first privatize the new local gram matrices (e.g., Pt s=tlast+1 xs,ix s,i) before being used in the determinant condition. This can be done by using standard tree-based algorithm with each data point as xs,ix s,i. With this additional step, now the determinant condition becomes

log det(e Vt,i) log det(Vlast) > D t tlast , (10)

where e Vt,i := Vlast + Pt s=tlast+1 xs,ix s,i + N loc t,i and N loc t,i is the new local injected noise for private schedule up to time t. Now suppose one uses (10) to determine tk. Then, it does not imply that (9) is upper bounded by βT

D. That is, det(e Vt,i)

det(Vlast) D does not necessarily mean that

det(Vlast+Pt s=tlast+1 xs,ix s,i)

det(Vlast) D .

One may try to work around (8) by first using Gt,i + λmin I to lower bound Vt,i. Then, (9) becomes

(tk tk 1) log det(Gtk,i+λmin I) det(Gtk 1,i+λmin I)

, which again cannot be bouded based on the rule given

by (10). To see this, note that det(e Vtk 1,i)

det(Vlast) D only implies that det(Gtk,i+λmin I) det(Gtk 1,i+λmax I) D .

E ADDITIONAL DETAILS ON FEDERATED LCBS UNDER SILO-LEVEL LDP

In this section, we provide details for Section 5.1. In particular, we present the proof for Theorem 5.1 and the alternative privacy protocol for silo-level LDP.

Published as a conference paper at ICLR 2024

E.1 PROOF OF THEOREM 5.1

Proof of Theorem 5.1. Privacy. We only need to show that P in Algorithm 2 with a proper choice of σ0 satisfies (ε, δ)-DP for all k [K], which implies that the full transcript of the communication is private in Algorithm 1 for any local agent i.

First, we recall that the (multi-variate) Gaussian mechanism satisfies zero-concentrated differential privacy (z CDP) (Bun & Steinke, 2016). In particular, by Bun & Steinke (2016, Lemma 2.5), we have that computation of each node (p-sum) in the tree is ρ-z CDP with ρ = L2

2σ2 0 . Then, from the construction of the binary tree in P, one can easily see that one single data point γi (for all i [K]) only impacts at most 1 + log(K) nodes. Thus, by adaptive composition of z CDP (cf. Lemma 2.3 in Bun & Steinke (2016)), we have that the entire releasing of all p-sums is (1 + log K)ρ-z CDP. Finally, we will use the conversion lemma from z CDP to approximated DP (cf. Proposition 1.3 in Bun & Steinke (2016)). In particular, we have that ρ0-z CDP implies (ε = ρ0 + 2 p

ρ0 log(1/δ), δ)-DP for all δ > 0. In other words, to achieve a given (ε, δ)-DP, it suffices to achieve ρ0-z CDP with ρ0 = f(ε, δ) := ( p

log(1/δ) + ε p

log(1/δ))2. In our case, we have ρ0 = (1 + log(K))ρ = (1 + log(K)) L2

2σ2 0 . Thus, we have σ2 0 = (1 + log(K)) L2

2ρ0 = (1 + log(K)) L2 2f(ε,δ). To simply it, one

can lower bound f(ε, δ) by ε2 4 log(1/δ)+4ε (cf. Remark 15 in Steinke (2022)). Therefore, to obtain

(ε, δ)-DP, it suffices to set σ2 0 = 2 L2 (1+log(K))(log(1/δ)+ε)

ε2 . Note that there are two streams of data in Algorithm 1, and hence it suffices to ensure that each of them is (ε/2, δ/2)-DP. This gives us the final noise level σ2 0 = 8 (1+log(K))(log(2/δ)+ε)

ε2 (note that by boundedness assumption L = 1 in our case).

Regret. In order to establish the regret bound, thanks to Lemma C.4, we only need to determine the maximum noise level in the learning process. Recall that σ2 0 = 8 (1+log(K))(log(2/δ)+ε)

ε2 is the noise level for both streams (i.e., γbias and γcov). Now, by the construction of binary tree in P, one can see that each prefix sum P[1, k] only involves at most 1 + log(k) tree nodes. Thus, we have that the noise level in nt,i and Nt,i are upper bounded by (1 + log(K))σ2 0. As a result, the overall noise level across all M silos is upper bounded by σ2 total = M(1 + log(K))σ2 0. Finally, setting σ2 in Lemma C.4 to be the noise level σ2 total , yields the required result.

E.2 ALTERNATIVE PRIVACY PROTOCOL FOR SILO-LEVEL LDP

For silo-level LDP, each local randomizer can simply be the standard tree-based algorithm, i.e., releasing the prefix sum at each communication step k (rather than p-sum in Algorithm 2). The analyzer now becomes a simple aggregation. As before, no shuffler is required in this case. This alternative protocol is given by Algorithm 4, which is essentially the main protocol used in Dubey & Pentland (2020).

It can be seen that both privacy and regret guarantees under this Palt are the same as Theorem 5.1. To see this, for privacy, the prefix sum is a post-processing of the p-sums. Thus, since we have already shown that the entire releasing of p-sums is private in the proof of Theorem 5.1, hence the same as the prefix sum. Meanwhile, the total noise level at the server is the same as before. Thus, by Lemma C.4, we have the same regret bound.

F ADDITIONAL DETAILS ON FEDERATED LCBS UNDER SDP

In this section, we provide more detailed discussions on SDP and present the proof for Theorem 5.3 (SDP via amplification lemma) and Theorem 5.5 (SDP via vector sum).

First, let us start with some general discussions.

Importance of communicating P-sums. For SDP, it is important to communicate P-sums rather than prefix sum. Note that communicating noisy p-sums in our privacy protocol P rather than the noisy prefix sum (i.e., the sum from beginning as done in Dubey & Pentland (2020)) plays a key role in achieving optimal regret with shuffling. To see this, both approaches can guarantee silo-level LDP. By our new amplification lemma, privacy guarantee can be amplified by 1/

M in ε for each

Published as a conference paper at ICLR 2024

Algorithm 4 Palt, an alternative privacy protocol for silo-level LDP

1: Procedure: Local Randomizer R 2: // Input: stream data γ = (γi)i [K]; privacy parameters ε, δ; Output: private prefix sum 3: for k=1, . . . , K do 4: Express k in binary form: k = P

j Binj(k) 2j

5: Find the index of first one ik := min{j : Binj(k) = 1} 6: Compute p-sum αik = P

j<i αj + γk.

7: Add noise to p-sum bαik = αik + N(0, σ2 0I) 8: Output private prefix sum esk = P

j:Binj(k)=1 bαj 9: end for 10: end procedure 11: Procedure: Analyzer A 12: // Input: a collection of M data points, y = {yi}i [M]; Output: Aggregated sum 13: Output ey = P

i [M] yi 14: end procedure

of the K shuffled outputs, where K = T/B is total communication rounds. Now, if the prefix sum is released to the shuffler, then any single data point participates in at most K shuffle mechanisms, which would blow up ε by a factor of O(

K) (by advanced composition (Dwork & Roth, 2014)). This would eventually lead to a K1/4 factor blow up in regret due to privacy. Similarly, if we apply PVec to the data points in the prefix sum, then again a single data point can participate in at most K shuffled outputs.

On the other hand, if only noisy p-sums are released for shuffling at each communication round k [K] (as in our protocol P) or only the data points in each p-sum are used in PVec (as in our protocol in PT Vec), then due to the binary-tree structure, each data point only participates in at most log K shuffled mechanisms, which only leads to O( log K) blow-up of ε; hence allowing us to achieve the desired e O(

MT) regret scaling, and close the gap present under silo-level LDP.

Remark F.1 (Shuffled tree-based mechanism). Both the protocol P in Algorithm 2 along with our new amplification lemma and protocol PT Vec in Algorithm 5 can be treated as a black-box method, which integrates shuffling into the tree-based mechanism while providing formal guarantees for continual release of sum statistics. Hence, it can be applied to other federated online learning problems beyond contextual bandits.

F.1 AMPLIFICATION LEMMA FOR SDP

We first formally introduce our new amplification lemma, which is the key to our analysis, as mentioned in the main paper.

The motivation for our new amplification result is two-fold: (i) Existing results on privacy amplification via shuffling (e.g., Feldman et al. (2022); Erlingsson et al. (2019); Cheu et al. (2019); Balle et al. (2019)) are only limited to the standard LDP case, i.e., each local dataset has size n = 1, which is not applicable in our case where each silo runs a DP (rather than LDP) mechanism over a dataset of size n = T; (ii) Although a recent work (Lowy & Razaviyayn, 2021) establishes a general amplification result for the case of n > 1, it introduces a very large value for the final δ that scales linearly with n due to group privacy.

We first present the key intuition behind our new lemma. Essentially, as in Lowy & Razaviyayn (2021), we follow the nice idea of hiding among the clones introduced in Feldman et al. (2022). That is, the output from silo 2 to n can be similar to that of silo 1 by the property of DP (i.e., creating clones). The key difference between n = 1 and n > 1 is that in the latter case, the similarity distance between the output of silo 1 and j (j > 1) will be larger as in this case all n > 1 data points among two silos could be different. To capture this, Lowy & Razaviyayn (2021) resorts to group privacy for

Published as a conference paper at ICLR 2024

general DP local randomizers.6 However, group privacy for approximate DP will introduce a large value for δ. Thus, since we know that each local randomizer in our case is the Gaussian mechanism, we can capture the similarity of outputs between silo 1 and j (j > 1) by directly bounding the sensitivity. This helps to avoid the large value for the final δ. Specifically, we have the following result, which can be viewed as a refinement of Theorem D.5 in Lowy & Razaviyayn (2021) when specified to the Gaussian mechanism. We follow the notations in Lowy & Razaviyayn (2021) for easy comparison.

Lemma F.2 (Amplification lemma for Gaussian mechanism). Let X = (X1, , XN) X N n be a distributed data set, i.e., N silos each with n data points. Let r N and let R(i) r (Z, ) : X n Z := Rd be a Gaussian mechanism with (εr 0, δr 0)- DP, εr 0 (0, 1)7, for all Z = Z(1:N) (1:r 1) Z(r 1) N and i [N], where X is

an arbitrary set. Suppose for all i, maxany pair(X,X ) R(i) r (Z, X) R(i) r (Z, X ) n

maxadjacent pair(X,X ) R(i) r (Z, X) R(i) r (Z, X ) .8 Given Z = Z(1:N) (1:r 1), consider the shuffled

algorithm Ar s : X n N Z(r 1) N ZN that first samples a random permutation π of [N] and then computes Zr = (Z(1) r , , Z(N) r ), where Z(i) r := R(i) r (Z, Xπ(i)). Then, for any δ [0, 1] such

that εr 0 1

n ln N 16 log(2/δ) , Ar s is (εr, δr)-DP, where

1 + eεr 0 1 eεr 0 + 1

enεr 0 log(4/δ)

N + 8enεr 0

δr := eεr 0 1 eεr 0 + 1

δ + N(eεr + 1)(1 + e εr 0/2)δr 0.

If εr 0 1/n, choosing δ = Nnδr 0 yields εr = O εr 0

log(1/(n Nδr 0))

and δr = O(Nδr 0), where

δr 0 1/(Nn).

F.2 VECTOR SUM PROTOCOL FOR SDP

One limitation of our first scheme for SDP is that the privacy guarantee holds only for very small values of ε. This comes from two factors: one is due to the fact that standard 1/

M amplification result requires the local privacy budget to be close to one; the other one comes from the fact that now the local dataset could be n = T, which further reduces the range of valid ε.

In this section, we give the vector sum protocol in Cheu et al. (2021) for easy reference. Let s also give a concrete example to illustrate how to combine Algorithm 6 with Algorithm 5. Consider a fixed k = 6. Then, for each agent, we have αi6 = γ5 + γ6. That is, consider the case of summing bias vectors, for agent i [M], γ5 = P5B t=4B+1 xt,iyt,i and γ6 = P6B t=5B+1 xt,iyt,i. Then, D6 consists of 2B data points, each of which is a single bias vector. Now, Rvec and Avec (as well the shuffler) work together to compute the noisy sum of 2B M data points. In particular, denote by Pvec the whole process, then we have eαi6 = Pvec(DM 6 ), where DM 6 is the data set that consists of n = 2B M data points, each of them is a single bias vector.

Next, we present more details on the implementations, i.e., the parameter choices of g, b, p. Let s consider k = 6 again as an example. In this case, the total number of data points that participate in Pvec is n = 2B M. Then, according to the proof of Theorem C.1 in Chowdhury & Zhou (2022b),

6This is because it mainly focuses on the lower bound, where one needs to be general to handle any mechanisms. 7Note that standard Gaussian mechanism only applies to the regime when ε < 1. In our case, εr 0 is often less than 1. Gaussian mechanism also works for the regime ε > 1, in this case, σ2 1/ε rather than 1/ε2. With minor adjustment of the final εr, our proof can be extended. 8This is w.l.o.g; one can easily generalize it to any upper bound that is a function of n. 9In our application, each data point means a bias vector or a covariance matrix. See Appendix F.2 for a concrete example.

Published as a conference paper at ICLR 2024

Algorithm 5 PT Vec, another privacy protocol used in Algorithm 1

1: Procedure: Local Randomizer R at each agent 2: // Input: stream data (γ1, . . . , γK), privacy budgets ε > 0, δ (0, 1] 3: for k=1, . . . , K do 4: Express k in binary form: k = P

j Binj(k) 2j

5: Find index of first one ik =min{j : Binj(k)=1} 6: Let Dk be the set of all data points9that contribute to αik = P

j<ik αj + γk 7: Output yk = RVec(Dk) // apply RVec in Algorithm 6 to each data point 8: end for 9: Procedure: Analyzer A at server 10: // Input: stream data from S: { yk = ( yk,1, . . . , yk,M)}k [K] 11: for k=1, . . . , K do 12: Express k in binary and find index of first one ik 13: Add all messages from M agents: eαik = AVec( yk) // apply AVec in Algorithm 6 14: Output: esk = P j:Binj(k)=1 eαj 15: end for

g = max{2 n, d, 4}, b = 24 104 g2 log 4 (d2+1)

ε2n , p = 1/4.

Algorithm 6 Pvec, a shuffle protocol for vector summation (Cheu et al., 2021)

1: Input: Database of d-dimensional vectors X = (x1, , xn); privacy parameters ε, δ; L. 2: procedure: Local Randomizer Rvec(xi) 3: for j [d] do 4: Shift component to enforce non-negativity: wi,j xi,j + L 5: mj R1D(wi,j) 6: end for 7: Output labeled messages {(j, mj)}j [d] 8: end procedure 9: procedure: Analyzer Avec(y) 10: for j [d] do 11: Run analyzer on coordinate j s messages zj A1D(yj) 12: Re-center: oj zj n L 13: end for 14: Output the vector of estimates o = (o1, od) 15: end procedure

Algorithm 7 P1D, a shuffle protocol for summing scalars (Cheu et al., 2021)

1: Input: Scalar database X = (x1, xn) [0, L]n; g, b N; p (0, 1

2). 2: procedure: Local Randomizer R1D(xi) 3: xi xig/L . 4: Sample rounding value η1 Ber(xig/L xi). 5: Set bxi xi + η1. 6: Sample privacy noise value η2 Bin(b, p). 7: Report yi {0, 1}g+b containing bxi + η2 copies of 1 and g + b (bxi + η2) copies of 0. 8: end procedure 9: procedure: Analyzer A1D(S(y1, . . . , yn)) 10: Output estimator L

g ((Pn i=1 Pb+g j=1(yi)j) pbn). 11: end procedure

Published as a conference paper at ICLR 2024

First, we present proof of Theorem 5.3.

Proof of Theorem 5.3. Privacy. In this proof, we directly work on approximate DP. By the boundedness assumption and Gaussian mechanism, we have that with σ2 0 = 2L2 log(1.25/bδ0)

bε2 0 , R in P is

(bε0, bδ0)-DP for each communication round k [K] (provided bε0 1) . Now, by our amplification

lemma (Lemma F.2), we have that the shuffled output is (bε, bδ)-DP with bε = O bε0

log(1/(T Mbδ0))

and bδ = O(Mbδ0) (provided bε0 1/T and bδ0 1/(MT)). Here we note that in our case, N = M and n = T, where n = T follows from the fact that there exists αi in the tree that corresponds to the sum of T data points. Moreover, since the same mechanism is run at all silos, shuffling-thenprivatizing is the same as first privatizing-then-shuffling the outputs. Next, we apply the advanced composition theorem (cf. Theorem 3.20 in Dwork & Roth (2014)). In particular, by the binary tree structure, each data point involves only κ := 1 + log(K) times in the output of R. Thus, to achieve (ε, δ)-DP, it suffices to have bε = ε 2

2κ log(2/δ) and bδ = δ 2κ. Using all these equations, we can solve

for bε0 = C1 ε

κ log(1/δ) log(κ/(δT )) and bδ0 = C2 δ Mκ, for some constants C1 > 0 and C2 > 0. To

satisfy the conditions on bε0 and bδ0, we have ε κ C1T

M and δ κ C2T . With the choice of bε0 and

bδ0, we have the noise variance σ2 0 = O 2L2β log(1/δ) log(κ/(δT )) log(Mκ/δ)

ε2M . Thus, we can apply P to the bias and covariance terms (with L = 1), respectively.

Regret. Again, we simply resort to our Lemma C.4 for the regret analysis. In particular, we only need to determine the maximum noise level in the learning process. Note that σ2 0 = O 2L2κ log(1/δ) log(κ/(δT )) log(Mκ/δ)

ε2M is the noise level injected for both bias and covariance terms. Now, by the construction of the binary tree in P, one can see that each prefix sum only involves at most 1 + log(k) tree nodes. As a result, the overall noise level across all M silos is upper bounded by σ2 total = Mκσ2 0. Finally, setting σ2 in Lemma C.4 to be the noise level σ2 total , yields the required result.

Now, we prove Theorem 5.5.

Proof of Theorem 5.5. Privacy. For each calculation of the noisy synchronized p-sum, there exist parameters for PVec such that it satisfies (ε0, δ0)-SDP where ε0 (0, 15] and δ0 (0, 1/2) (see Lemma 3.1 in Cheu et al. (2021) or Theorem 3.5 in Chowdhury & Zhou (2022b)). Then, by the binary tree structure, each single data point (bias vector or covariance matrix) only participates in at most κ := 1+log(K) runs of PVec. Thus, to achieve (ε, δ)-DP, it suffices to have ε0 = ε 2

2κ log(2/δ) and δ0 = δ 2κ by advanced composition theorem. Thus, for any ε (0, 30 p

2κ log(2/δ)) and δ (0, 1), there exist parameters for PVec such that the entire calculations of noisy p-sums are (ε, δ)-SDP. Since we have two streams of data (bias and covariance), we finally have that for any ε (0, 60 p

2κ log(2/δ)) and δ (0, 1), there exist parameters for PVec such that Algorithm 1 with PT Vec satisfies (ε, δ)-SDP.

Regret. By the same analysis in the proof of Theorem 3.5 in Chowdhury & Zhou (2022b), the injected noise for each calculation of the noisy synchronized p-sum is sub-Gaussian with the variance being at most bσ2 = O log2(d2/δ0)

= O κ log(1/δ) log2(d2κ/δ)

ε2 . Now, by the binary tree structure, each prefix sum only involves at most κ p-sums. Hence, the overall noise level is upper bounded by σ2 total = κbσ2. Finally, setting σ2 in Lemma C.4 to be the noise level σ2 total , yields the required result.

Now, we provide proof of amplification Lemma F.2 for completeness. We follow the same idea as in Feldman et al. (2022) and Lowy & Razaviyayn (2021). For easy comparison, we use the same notations as in Lowy & Razaviyayn (2021) and highlighted the key difference using color text.

Published as a conference paper at ICLR 2024

Proof of Lemma F.2. Let X0, X1 X n N be adjacent distributed data sets (i.e. PN i=1 Pn j=1 1{xi,j =xi,j} = 1). Assume WLOG that X0 = (X0 1, X2, , XN) and X1 = (X1 1, X2, , XN), where X0 1 = (x1,0, x1,2, , x1,n) = (x1,1, x1,2, , x1,n). We can also assume WLOG that Xj / {X0 1, X1 1} for all j {2, , N} by re-defining X and R(i) r if necessary.

Fix i [N], r [R], Z = Z1:r 1 = Z(1:N) (1:r 1) Z(r 1) N, denote R(X) := R(i) r (Z, X) for X X n, and As(X) := Ar s(Z1:r 1, X). Draw π uniformly from the set of permutations of [N]. Now, since R is (εr 0, δr 0)-DP, R(X1 1) (εr 0,δr 0) R(X0 1), so by Lowy & Razaviyayn (2021, Lemma D.12),

there exists a local randomizer R such that R (X1 1) (εr 0,0) R(X0 1) and TV (R (X1 1), R(X1 1)) δr 0.

Hence, by Lowy & Razaviyayn (2021, Lemma D.8), there exist distributions U(X0 1) and U(X1 1) such that

R(X0 1) = eεr 0

eεr 0 + 1U(X0 1) + 1 eεr 0 + 1U(X1 1) (11)

R (X1 1) = 1 eεr 0 + 1U(X0 1) + eεr 0

eεr 0 + 1U(X1 1). (12)

Here, we diverge from the proof in (Lowy & Razaviyayn, 2021). We denote eε0 := nεr 0 and eδ0 := δr 0. Then, by the assumption of R(X), for any X, we have R(X) ( e ε0, e δ0) R(X0 1)) and

R(X) ( e ε0, e δ0) R(X1 1)). This is because by the assumption, when the dataset changes from any X to

X0 1 (or X1 1), the total change in terms of l2 norm can be n times that under an adjacent pair. Thus, one has to scale the εr 0 by n while keeping the same δr 0.

Now, we resume the same idea as in (Lowy & Razaviyayn, 2021). By convexity of hockey-stick divergence and the above result, we have R(X) ( e ε0, e δ0)

1 2(R(X0 1) + R(X1 1)) := ρ for all X X n.

That is, R is ( eε0, eδ0) deletion group DP for groups of size n with reference distribution ρ. Thus, by Lowy & Razaviyayn (2021, Lemma D.11), we have that there exists a local randomizer R such that R (X) and ρ are ( eε0, 0) indistinguishable and TV (R (X), R(X)) eδ0 for all X. Then by the definition of ( eε0, 0) indistinguishability, for all X there exists a left-over distribution LO(X) such that R (X) = 1 e f ε0 ρ + (1 1/e e ε0)LO(X) = 1 2e f ε0 (R(X0 1) + R(X1 1)) + (1 1/e e ε0)LO(X).

Now, define a randomizer L by L(X0 1) := R(X0 1), L(X1 1) := R (X1 1), and

L(X) := 1 2e e ε0 R(X0 1) + 1 2e e ε0 R (X1 1) + (1 1/e e ε0)LO(X)

= 1 2e e ε0 U(X0 1) + 1 2e e ε0 U(X1 1) + (1 1/e e ε0)LO(X) (13)

for all X X n \ {X0 1, X1 1}. (The equality follows from (11) and (12).) Note that TV (R(X0 1), L(X0 1)) = 0, TV (R(X1 1), L(X1 1)) δr 0, and for all X X n \ {X0 1, X1 1}, TV (R(X), L(X)) TV (R(X), R (X)) + TV (R (X), L(X)) eδ0 + 1 2e f ε0 TV (R (X1 1), R(X1 1)) = (1 + 1 2enεr 0 )δr 0.

Keeping r fixed (omitting r scripts everywhere), for any i [N] and Z := Z1:r 1 Z(r 1) N, let L(i)(Z, ), U (i)(Z, ), and LO(i)(Z, ) denote the randomizers resulting from the process described above. Let AL : X n N ZN be defined exactly the same way as Ar s := As (same π) but with the randomizers R(i) replaced by L(i). Since As applies each randomizer R(i) exactly once and R(1)(Z, Xπ(1), R(N)(Z, Xπ(N)) are independent (conditional on Z = Z1:r 1) 10, we have TV (As(X0), AL(X0)) N(1+ 1 2enεr 0 )δr 0 and TV (As(X1), AL(X1) N(1+ 1 2enεr 0 )δr 0. Now we

10This follows from the assumption that R(i)(Z1:r 1, X) is conditionally independent of X given Z1:r 1 for all Z1:r 1 and X = X .

Published as a conference paper at ICLR 2024

claim that AL(X0) and AL(X1) are (εr, δ) indistinguishable for any δ 2e Ne nεr 0 /16. Observe that this claim implies that As(X0) and As(X1) are (εr, δr) indistinguishable by Lowy & Razaviyayn (2021, Lemma D.13) (with P := AL(X0), Q := AL(X1), P := As(X0), Q := As(X1).) Therefore, it only remains to prove the claim, i.e. to show that Deεr (AL(X0), AL(X1) δ for any δ 2e Ne nεr 0 /16.

Now, define L(i) U (Z, X) :=

U (i)(Z, X0 1) if X = X0 1 U (i)(Z, X1 1) if X = X1 1 L(i)(Z, X) otherwise. . For any inputs Z, X, let AU(Z, X) be

defined exactly the same as As(Z, X) (same π) but with the randomizers R(i) replaced by L(i) U . Then by (11) and (12),

AL(X0) = eεr 0

eεr 0 + 1AU(X0)+ 1 eεr 0 + 1AU(X1) and AL(X1) = 1 eεr 0 + 1AU(X0)+ eεr 0

eεr 0 + 1AU(X1).

Then by (13), for any X X n\{X0 1, X1 1} and any Z = Z1:r 1 Z(r 1) N, we have L(i) U (Z, X) = 1 2e f ε0 L(i) U (Z, X0 1) + 1 2e f ε0 L(i) U (Z, X1 1) + (1 e e ε0)LO(i)(Z, X). Hence, Lowy & Razaviyayn (2021, Lemma D.10) (with p := e e ε0 = e nεr 0) implies that AU(X0) and AU(X1)) are

e e ε0 ln(4/δ)

N + 8e e ε0

indistinguishable for any δ 2e Ne nεr 0 /16.

Here, we also slightly diverge from Lowy & Razaviyayn (2021). Instead of using Lowy & Razaviyayn (2021, Lemma D.14), we can directly follow the proof of Lemma 3.5 in Feldman et al. (2022) and Lemma 2.3 in Feldman et al. (2022) to establish our claim that AL(X0) and AL(X1) are indistinguishable (hence the final result). Here, we also slightly improve the δ term compared to Feldman et al. (2022) by applying amplification via sub-sampling to the δ term as well. In particular, the key step is to rewrite (14) as follows (with T := 1

2(AU(X0) + AU(X1))

AL(X0) = 2 eεr 0 + 1T + eεr 0 1 eεr 0 + 1AU(X0) and AL(X1) = 2 eεr 0 + 1T + eεr 0 1 eεr 0 + 1AU(X1). (15)

Thus, by the convexity of the hockey-stick divergence and Lemma 2.3 in Feldman et al. (2022), we have AL(X0) and AL(X1) are

1 + εr 0 1 εr 0 + 1

e e ε0 ln(4/δr)

, εr 0 1 εr 0 + 1δ

indistinguishable for any δ 2e Ne nεr 0 /16. As decribed before, this leads to the result that As(X0) and As(X1) are (εr, δr) indistinguishable by Lowy & Razaviyayn (2021, Lemma D.13) (original result in Lemma 3.17 of Dwork & Roth (2014)) with (noting that eε0 = nεr 0)

1 + eεr 0 1 eεr 0 + 1

enεr 0 ln(4/δ)

N + 8enεr 0

δr := eεr 0 1 eεr 0 + 1

δ + N(eεr + 1)(1 + e εr 0/2)δr 0.

G FURTHER DISCUSSIONS

In this section, we provide more details on our upper bounds, privacy notion and algorithm design.

G.1 DISCUSSION ON TIGHTNESS OF UPPER BOUNDS

In the paper, we have established regrets of O(M 3/4p

T/ε) under silo-level LDP and O( p

MT/ε) under SDP. An essential open question is regarding tightness of these upper bounds. It turns out that

Published as a conference paper at ICLR 2024

the key to obtain both lower bounds is to first establish a tight characterization for single-agent LCB under central JDP (which is still open to the best of our knowledge), as elaborated below:

SDP: The current upper bound aligns with the state-of-the-art result achieved by a super single agent under central JDP. However, the tightness of this bound is still uncertain, as it even remains open whether the upper bound under the centralized setting is tight. To our best knowledge, the only existing lower bound for LCBs under central JDP is Ω(

T + 1/ε) (He et al., 2022b), implying a lower bound of Ω(

MT + 1/ε) for the super single agent case. That is, the privacy cost in the lower bound is only additive rather than multiplicative cost of 1/ ε present in the upper bound. It is unclear to us which one of the upper bound or lower bound is loose.

Silo-level LDP: A lower bound can potentially be established using a similar reduction as in Lowy & Razaviyayn (2021), where the authors derive a lower bound for silo-level LDP supervised learning via the central model. To be more specific, for any silo-level LDP algorithm A with privacy guarantee ε0, one can first virtually shuffle all the MT user sequences and then apply A, leading to a shuffled version As. As shown in Lowy & Razaviyayn (2021),, the shuffled version algorithm As enjoys an SDP privacy guarantee of roughly ε := ε0/

M (here again, one cannot directly use standard amplification lemma). Since SDP implies central JDP in our linear contextual bandit case, then one can conclude that As has a lower bound of Lc(ε), where Lc(ε) denotes the lower bound for LCB under central JDP with privacy guarantee ε. Finally, one can note that A and As have the same regret performance, hence establishing the regret lower bound Lc(ε0/

M) for A under silo-level LDP.

Implication: If one can establish a lower bound Lc(ε) = Ω( p

T/ε) for standard LCB under central JDP (i.e., Ω( p

MT/ε) for the super single agent case), then by the above argument, it directly implies that our SDP upper bound is tight and moreover, the upper bound under silo-level LDP is also tight.

It is worth noting that our new findings in this paper (e.g., identifying existing gaps and establishing new upper bounds) motivate the above interesting questions, which we believe will promote advances in the field.

G.2 SILO-LEVEL LDP/SDP VS. OTHER PRIVACY NOTIONS

In this section, we compare our silo-level LDP and SDP with standard privacy notions for single-agent LCBs, including local, central, and shuffle model for DP, respectively.

Silo-level LDP vs. single-agent local DP. Under standard LDP for single-agent LCBs (Zheng et al., 2020; Duchi et al., 2013; Zhou & Tan, 2021), each user only trusts herself and hence privatizes her response before sending it to the agent. In contrast, under silo-level LDP, each local user trusts the local silo (agent), which aligns with the pratical situations of cross-silo FL, e.g., patients often trust the local hospitals. In such cases, standard LDP becomes unnecessarily stringent, hindering performance/regret and making it less appealing to cross-silo federated LCBs.

Silo-level LDP vs. single-agent central DP. The comparison with standard central DP (in particular central JDP)11 for single-agent LCB (e.g., Shariff & Sheffet (2018)) is delicate. We first note that under both notions, users trust the agent and the privacy burden lies at the agent. Under standard central DP, the agent uses private statistics until round t to choose action for each round t, which ensures that any other users t = t cannot infer too much about user t s information by observing the actions on rounds t = t (i.e., joint differential privacy (JDP) (Kearns et al., 2014)). On the other hand, silo-level LDP does not necessarily require each agent (silo) to use private statistics to recommend actions to users within the silo. Instead, it only requires the agent to privatize its sent messages (both schedule and content). Thus, silo-level LDP may not protect a user t from the colluding of all other users within the same silo. In other words, the adversary model for silo-level LDP is that the adversary could be any other silos or the central server rather than other users within the same silo. Note that the same adversary model is assumed in a similar notion for federated supervised learning (e.g., inter-silo record-level differential privacy (ISRL-DP) in Lowy & Razaviyayn (2021)). In fact, with a minor tweak of our Algorithm 1, one can achieve a slightly stronger notion of privacy than silo-level LDP in that it now can protect against both other silos/server and users within the same silo.

11As shown in Shariff & Sheffet (2018), JDP relaxation is necessary for achieving sub-linear regret for LCBs under the central model. Otherwise, a linear regret lower bound exists for central standard DP.

Published as a conference paper at ICLR 2024

The key idea is exactly that now each agent will only use private statistics to recommend actions, see Appendix G.3.

Silo-level LDP vs. Federated DP in Dubey & Pentland (2020). In Dubey & Pentland (2020), the authors define the so-called notion of federated DP for federated LCBs, which essentially means that the action chosen by any agent must be sufficiently impervious (in probability) to any single data from any other agent . This privacy guarantee is directly implied by our silo-level LDP. In fact, in order to show such a privacy guarantee, Dubey & Pentland (2020) basically tried to show that the outgoing communication is private, which is the idea of silo-level LDP. However, as mentioned in the main paper, Dubey & Pentland (2020) only privatizes the communicated data and fails to privatize the communication schedule, which leads to privacy leakage. Moreover, as already mentioned in Remark B.1, Fed-DP fails to protect a user s privacy even under a reasonable adversary model. Thus, we believe that silo-level LDP is a better option for federated LCBs.

SDP vs. single-agent shuffle DP. Under the single-agent shuffle DP (Chowdhury & Zhou, 2022b; Tenenbaum et al., 2023), the shuffler takes as input a batch of users data (i.e., from t1 to t2), which enables to achieve a regret of e O(T 3/5) (vs. e O(T 3/4) regret under local model and e O(

T) regret under central model). In contrast, under our SDP, the shuffler takes as input the DP outputs from all M agents. Roughly speaking, single-agent shuffle DP aims to amplify the privacy dependence on T while our SDP amplifies privacy over M. Due to this, single-agent shuffle DP can directly apply a standard amplification lemma (e.g., Feldman et al. (2022)) or shuffle protocol (e.g., Cheu et al. (2021)) that works well with LDP mechanism at each user (i.e., the size of dataset is n = 1). In contrast, in order to realize amplification over M agents DP outputs, we have to carefully modify the standard amplification lemma to handle the fact that now each local mechanism operates on n > 1 data points, which is one of the key motivations for our new amplification lemma.

Sublinear regret under SDP vs. linear regret lower bound under central standard DP (not JDP). One may wonder why an even better regret under the shuffle model is possible given that there is a linear regret bound under central model for LCBs. This is not contradicting as the lower bound of linear regret is established under standard central DP while SDP in LCBs only implies central JDP. More specifically, in contrast to standard private data analysis and supervised learning, the shuffle model is NOT an intermediate trust model between central standard DP and local DP for LCBs. That is, even if an LCB algorithm satisfies shuffle DP, it can still fail to satisfy central standard DP. Rather, it is only an intermediate trust model between the central joint DP (JDP) and the local model. That is, if an LCB algorithm satisfies shuffle DP, it satisfies central JDP (via Billboard lemma).

G.3 A SIMPLE TWEAK OF ALGORITHM 1 FOR A STRONGER PRIVACY GUARANTEE

As discussed in the last subsection, the adversary model behind silo-level LDP only includes other silos and the central server, i.e., excluding adversary users within the same silo. Thus, for silo-level LDP, Algorithm 1 can use non-private data to recommend actions within a batch (e.g., Vt,i includes non-private recent local bias vectors and covariance matrices). If one is also interested in protecting against adversary users within the same silo, a simple tweak of Algorithm 1 suffices.

As shown in Algorithm 8, the only difference is a lazy update of bθt,i is adopted (line 5), i.e., it is only computed using private data without any dependence on new non-private local data. In fact, same regret bound as in Theorem 5.1 can be achieved for this new algorithm (though empirical performance could be worse due to the lazy update). In the following, we highlight the key changes in the regret analysis. It basically follows the six steps in the proof of Lemma C.2. One can now define a mapping κ(t) that maps any t [T] to the most recent communication round. That is, for any t [tk 1, tk] where tk = k B is the k-th communication round, we have κ(t) = tk 1. Then, one can replace all t in Vt,i and Gt,i by κ(t). The main difference that needs a check is Step 4 when bounding the regret in good epochs. The key is again to establish a similar form as (5). To this end, note that for all t [tk 1, tk] Vk Vt,i and Gκ(t),i + λmin I = Vk 1, which enables us to obtain xt,i (Gκ(t),i+λmin I) 1

2 xt,i V 1 t,i . Following the same analysis yields the desired regret bound.

Published as a conference paper at ICLR 2024

Algorithm 8 Priv-Fed Lin UCB-Lazy

1: Parameters: Batch size B N, regularization λ > 0, confidence radii {βt,i}t [T ],i [M], feature map ϕi : Ci Ki Rd, privacy protocol P = (R, S, A)

2: Initialize: For all i [M], Wi = 0, Ui = 0, f Wsyn = 0, e Usyn = 0 3: for t=1, . . . , T do 4: for each agent i = 1, . . . , M do 5: Vt,i = λI + f Wsyn, bθt,i = V 1 t,i e Usyn

6: Play arm at,i = argmaxa Ki ϕi(ct,i, a), bθt,i + βt,i ϕi(ct,i, a) V 1 t,i and set xt,i =

ϕi(ct,i, at,i) 7: Observe reward yt,i 8: Update Ui = Ui + xt,iyt,i and Wi = Wi + xt,ix t,i 9: end for 10: if t mod B = 0 then 11: // Local randomizer R at all agents i [M] 12: Send randomized messages Rbias t,i = Rbias(Ui) and Rcov t,j = Rcov(Wi) to the shuffler 13: // Third party S 14: Sbias t = S({Rbias t,i }i [M]) and Scov t = S({Rcov t,i }i [M]) 15: // Analyzer A at the server 16: Construct private cumulative statistics e Usyn = Abias(Sbias t ) and f Wsyn = Acov(Scov t ) 17: // All agents i [M]

18: Receive f Wsyn and e Usyn from the server 19: Reset Wi = 0, Ui = 0 20: end if 21: end for

G.4 NON-UNIQUE USERS

In the main paper, we assume all users across all silos and T rounds are unique. Here, we briefly discuss how to handle the case of non-unique users.

The same user appears multiple times in the same silo. One example of this could be one patient visiting the same hospital multiple times. In such cases, one needs to carefully apply group privacy or other technique (e.g., Chowdhury & Zhou (2022b)) to characterize the privacy loss of these returning users.

The same user appears multiple times across different silos. One example of this could be one patient who has multiple records across different hospitals. Then, one needs to use adaptive advanced composition to characterize the privacy loss of these returning users.

H ADDITIONAL DETAILS ON SIMULATION RESULTS

In Figure 3, we compare regret performance of LDP-Fed Lin UCB with Fed Lin UCB under varying privacy budgets.12 In sub-figure (a), we plot results for δ = 0.1 and varying level of ε {0.2, 1, 5} on synthetic Gaussian bandit instance, wherein sub-figure (b), we plot results for ε = 5 and varying level of δ {0.1, 0.01, 0.001}. In sub-figure (c), we plot results for δ = 0.1 and varying level of ε {0.2, 1, 5} on bandit instance generated from MSLR-WEB10K data by training a lasso model on bodyfeatures (d = 78). In all these plots, we observe that regret of LDP-Fed Lin UCB decreases and, comes closer to that of Fed Lin UCB as ε, δ increases (i.e., level of privacy protection decreases), which support our theoretical results. Here, we don t compare SDP-Lin UCB (with privacy amplification) since its privacy guarantee holds for ε, δ 1. Instead, we do so in sub-figure (d) with ε = δ = 0.0001. Here also, we observe a drop in regret of SDP-Fed Lin UCB compared to that of LDP-Fed Lin UCB.

12All existing non-private federated LCB algorithms (e.g., Wang et al. (2020)) adopts adaptive communication. We refrain from comparing with those to maintain consistency in presentation.

Published as a conference paper at ICLR 2024

0 2000 4000 6000 8000 10000 Round

Time-average Regret

Fed Lin UCB LDP-Fed Lin UCB(ε = 0.2) LDP-Fed Lin UCB(ε = 1)

LDP-Fed Lin UCB(ε = 5)

(a) Synthetic data (varying ε, δ = 0.1)

0 2000 4000 6000 8000 10000 Round

Time-average Regret

Fed Lin UCB LDP-Fed Lin UCB(δ=0.1) LDP-Fed Lin UCB(δ=0.01)

LDP-Fed Lin UCB(δ=0.001)

(b) Synthetic data (varying δ, ε = 5)

0 5000 10000 15000 20000 25000

Time-average Regret

Fed Lin UCB LDP-Fed Lin UCB(ε = 0.2)

LDP-Fed Lin UCB(ε = 1)

LDP-Fed Lin UCB(ε = 5)

(c) Real data (varying ε, δ = 0.1)

0 5000 10000 15000 20000 25000

Time-average Regret

Fed Lin UCB SDP-Fed Lin UCB(ε = 0.0001)

LDP-Fed Lin UCB(ε = 0.0001)

(d) Real data (M = 100)

Figure 3: Comparison of time-average group regret for Fed Lin UCB (non-private) and LDP-Fed Lin UCB (i.e., under silo-level LDP) on (a, b) synthetic Gaussian bandit instance and (c,d) bandit instance generated from MSLR-WEB10K Learning to Rank dataset.