# shuffle_private_linear_contextual_bandits__7bc1d752.pdf

Shuffle Private Linear Contextual Bandits

Sayak Ray Chowdhury * 1 Xingyu Zhou * 2

Differential privacy (DP) has been recently introduced to linear contextual bandits to formally address the privacy concerns in its associated personalized services to participating users (e.g., recommendations). Prior work largely focus on two trust models of DP the central model, where a central server is responsible for protecting users sensitive data, and the (stronger) local model, where information needs to be protected directly on users side. However, there remains a fundamental gap in the utility achieved by learning algorithms under these two privacy models, e.g., if all users are unique within a learning horizon T, e O(

T) regret in the central model as compared to e O(T 3/4) regret in the local model. In this work, we aim to achieve a stronger model of trust than the central model, while suffering a smaller regret than the local model by considering recently popular shuffle model of privacy. We propose a general algorithmic framework for linear contextual bandits under the shuffle trust model, where there exists a trusted shuffler in between users and the central server that randomly permutes a batch of users data before sending those to the server. We then instantiate this framework with two specific shuffle protocols one relying on privacy amplification of local mechanisms, and another incorporating a protocol for summing vectors and matrices of bounded norms. We prove that both these instantiations lead to regret guarantees that significantly improve on that of the local model, and can potentially be of the order e O(T 3/5) if all users are unique. We also verify this regret behavior with simulations on synthetic data. Finally, under the practical scenario of non-unique users, we show that the regret of our shuffle private algorithm scale as e O(T 2/3), which matches what the

*Equal contribution 1Boston University, USA 2Wayne State University, USA. Correspondence to: Sayak Ray Chowdhury <sayak@bu.edu>, Xingyu Zhou <xingyu.zhou@wayne.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

central model could achieve in this case.

1. Introduction

In the linear contextual bandit problem (Auer, 2003; Chu et al., 2011), a learning agent observes the context information ct of an user at every round t. The goal is to recommend an action at to the user so that the resulting reward yt is maximized. The mean reward is given by a linear function of an unknown parameter vector θ Rd, d N, i.e.,

E [yt | ct, at] = θ , ϕ(ct, at) ,

where ϕ : C X Rd maps a context-action pair to a d-dimensional feature vector, and , denotes the standard Euclidean inner product. The context and action sets C and X are arbitrary, and can also possibly be varying with time. An agent s performance over T rounds is typically measured through the cumulative pseudo-regret

Reg(T) = XT

max a X θ , ϕ(ct, a) θ , ϕ(ct, at) ,

which is the total loss suffered due to not recommending the actions generating highest possible rewards corresponding to observed contexts. This framework has found applications in many real-life settings such as internet advertisement selection (Abe et al., 2003), article recommendation in web portals (Li et al., 2010), mobile health (Tewari & Murphy, 2017), to name a few. The general applicability of this framework has motivated a line of work (Shariff & Sheffet, 2018; Zheng et al., 2020) studying linear contextual bandit problems under the additional constraint of differential privacy (Dwork, 2008), which guarantees that the users contexts and generated rewards will not be inferred by an adversary during this learning process.

To illustrate the privacy concern in the contextual bandit problem, let us consider a mobile medical application in which an mobile app recommends a tailored treatment plan (i.e., action) to each patient (i.e., user) based on her personal information such as age, weight, height, medical history etc. (i.e., context). Meanwhile, this mobile app s recommendation algorithm also needs to be updated once in a while in a cloud server after collecting data from a batch of patients, including treatment outcomes (i.e., rewards)

Shuffle Private Linear Contextual Bandits

and contexts, which are often considered to be private and sensitive information. Hence, each patient would like to obtain a personalized and effective treatment plan while guaranteeing their sensitive information remains protected against a potential adversarial attack in this interactive process. Protection of privacy is typically achieved by injecting sufficient noise in users data (Arora et al., 2014; Xin & Jaakkola, 2014), which results in a loss in utility (i.e., an increase in regret) of the recommended action. Hence, the key question is how to balance utility and privacy carefully.

This has motivated studies of linear contextual bandits under different trust models of differential privacy (i.e., who the user will trust with her sensitive data). On one end of the spectrum lies the central model, which guarantees privacy to users who trust the learning agent to store their raw data in the server and use those to update its strategy of recommending actions. Under this trust model, Shariff & Sheffet (2018) has shown that the cumulative regret is e O(

T (log(1/δ))1/4

ε ), where ε and δ are privacy parameters with smaller values denoting higher level of protection. Perhaps unsurprisingly, this regret bound due to the high degree of trust matches the optimal Θ(

T) scaling for non-private linear contextual bandits (Chu et al., 2011). However, this relatively high trust model is not always feasible since the users may not trust the agent at all. This is captured by the local model, where any data sent by the users must already be private, and the agent can only store those randomized data in the server. This is a strictly stronger notion of privacy, and hence, often comes at a price. Under this trust model, Zheng et al. (2020) has shown that the cumulative regret is e O( T 3/4(log(1/δ))1/4

ε ), which, as expected, is much worse than that in the central model. This naturally leads to the following question:

Can a finer trade-off between privacy and regret in linear contextual bandits be achieved?

Furthermore, in both Shariff & Sheffet (2018) and Zheng et al. (2020), the learning agents update their strategy at every round. This not only puts excessive computational burden on the server (due to T updates each taking at least O(d2) time and memory) but also could be practically infeasible at times. For example, consider the above mobile health application. The cloud server is often infeasible to update the algorithm deployed in mobile app after interactions with each single user. Rather, a more practical strategy is to update the algorithm after collecting a batch of users data (e.g., a one-month period of data).

Motivated by these, we consider the linear contextual bandit problem under an intermediate trust model of differential privacy, known as the shuffle model (Cheu et al., 2019; Erlingsson et al., 2019) in the hope to attain a finer regretprivacy trade-off, while only using batch updates. In this new trust model, there exists a shuffler between users and the central server which permutes a batch of users random-

ized data before they are viewed by the server so that it can t distinguish between two users data. Shuffling thus adds an another layer of protection by decoupling data from the users that sent them. Here, as in the local model, the users don t trust the server. However, it is assumed that they have a certain degree of trust in the shuffler since it can be efficiently implemented using cryptographic primitives (e.g., mixnets) due to its simple operation (Bittau et al., 2017; Apple, 2017). The shuffle model provides the possibility to achieve a stronger privacy guarantee than the central model while suffering a smaller utility loss than the local model. The key intuition behind this is that the additional randomness of the shuffler creates a privacy blanket (Balle et al., 2019b) so that each user now needs much less random noise to hide her information in the crowd. Indeed, the shuffle model achieves a better trade-off between utility and privacy as compared to central and local model in several learning problems such as empirical risk minimization (Girgis et al., 2021), stochastic convex optimization (Lowy & Razaviyayn, 2021; Cheu et al., 2021), and standard multi-arm bandits (Tenenbaum et al., 2021). However, little is known about (linear) contextual bandits in the shuffle model due to its intrinsic challenges. That is, in addition to rewards, the contexts are also sensitive information that need to be protected, which not only results in the aforementioned large gap in regret between local and central model1, but also leads to new challenges in the shuffle model. Against this backdrop, we make the following contributions:

We design a general algorithmic framework (Algorithm 1) for private linear contextual bandits in the shuffle model. It decomposes the learning process into three black-box components: a local randomizer at each user, an analyzer at the central server and a shuffler in-between. We instantiate the framework with two specific shuffle protocols. The first one directly builds on privacy amplification of existing local mechanisms. The other one utilizes an efficient mechanism for summing vectors with bounded ℓ2 norms.

We show that both shuffle protocols provide stronger privacy protection compared to the central model. Furthermore, when all users are unique, we prove a regret bound of e O T 3/5 for both the protocols, which improves over the e O T 3/4 regret of local model. Hence, we achieve a finer trade-off between regret and privacy. We further perform simulations on synthetic data that corroborate our theoretical results.

As a practical application of our general framework, we show that under the setting of non-unique or returning users, the regret of both our shuffle protocols matches the one that the central model would achieve in the

1In contrast, for MAB, the problem-independent upper bounds in the local and central model are both e O(

T) (Ren et al., 2020a).

Shuffle Private Linear Contextual Bandits

same setting. This, along with the fact that both shuffle protocols also offer a certain degree of local privacy, further elaborate usefulness of shuffle model in private linear contextual bandits.

Related work. Due to the utility gap present between central and local models, a significant body of recent work have focused on the shuffle model (Balle et al., 2019b; Feldman et al., 2020; Ghazi et al., 2019; Balle et al., 2019a). A nice overview of recent work in the shuffle model is presented in Cheu (2021). Regret performance of multi-armed bandit algorithms under central and local trust models have been considered in Mishra & Thakurta (2015); Sajed & Sheffet (2019); Ren et al. (2020a); Chen et al. (2020); Zhou & Tan (2020); Dubey (2021); Tossou & Dimitrakakis (2017), whereas online learning algorithms under full information have appeared in Guha Thakurta & Smith (2013); Agarwal & Singh (2017). Recently, the two models have also been adopted to design differentially private control and reinforcement learning algorithms (Vietri et al., 2020; Garcelon et al., 2020; Chowdhury et al., 2021; Chowdhury & Zhou, 2021). Han et al. (2021) consider linear bandits with stochastic contexts, and show that e O(

T/ε) regret can be achieved even in the local model. In contrast, in this work, we allow the contexts to be arbitrary and can even be adversarially generated, which pose additional challenges.

Batched linear bandits are studied in Han et al. (2020); Ren et al. (2020b), where the authors show that only O(

T) model update is sufficient to achieve corresponding minimax optimal regrets. In the shuffle private model, batched learning not only reduces the model update frequency, but more importantly plays a key role in amplifying privacy via shuffling a batch of users data. Interestingly, as a byproduct, our established generic regret bound also improves over the non-private one in Ren et al. (2020b) in the sense that no restriction is required for the regularizer.

Concurrent and independent work. While preparing this submission, we have noticed that Garcelon et al. (2021) also study linear contextual bandits in the shuffle model. The authors claim that a single fixed batch schedule is not sufficient to obtain a better regret-privacy trade-off in shuffle model. They propose to use separate asynchronous schedules a fixed batch scheme for the shuffler and an adaptive model update scheme for the server. In contrast, thanks to a tighter analysis, we show that a single fixed batch schedule is indeed sufficient to attain the same regret-privacy tradeoff in shuffle model. Moreover, we believe, there exists a fundamental gap in their analysis for the adaptive model update, which might make their results ungrounded. We provide a detailed discussion on this in Section 6, which highlights the key difference in dealing with adaptive update in the non-private and the private settings. Finally, in addition to the above differences in theoretical results, our

established generic framework enables to design flexible shuffle private protocols for linear contextual bandits that are able to handle a wide range of practically interested privacy budget ε rather than a restricted small value ε 1 in the concurrent work (Garcelon et al., 2021).

2. Privacy in the Shuffle Model

In this section, we introduce the shuffle model, and its corresponding privacy notion called the shuffle differential privacy. Before that, we recall definitions of differential privacy under central and local models (Dwork et al., 2014).

2.1. Central and Local Differential Privacy

Throughout, we let D denote the data universe, and n N the number of (unique) users. Let Di D, i = 1, 2, . . . , n, denote the data point of user i, and D i Dn 1 denote collection of data points of all but the i-th user. Let ε > 0 and δ (0, 1] be given privacy parameters. Definition 2.1 (Differential Privacy (DP)). A mechanism M satisfies (ε, δ)-DP if for each user i [n], each data set D, D Dn, and each event E in the range of M,

P [M(Di, D i) E] exp(ε)P [M(D i, D i) E] + δ.

Definition 2.2 (Local Differential Privacy (LDP)). A mechanism M satisfies (ε, δ)-LDP if for each user i [n], each data point Di, D i D and each event E in the range of M,

P [M(Di) E] exp(ε)P [M(D i) E] + δ.

Roughly speaking, a central DP (or, simply, DP) mechanism ensures that the outputs of the mechanism on two neighbouring data sets (i.e., those differ only on one user) are approximately indistinguishable. In contrast, local DP ensures that the output of the local mechanism for each user is indistinguishable.

2.2. Shuffle Differential Privacy

A (standard) shuffle protocol P = (R, S, A) consists of three parts: (i) a (local) randomizer R, (ii) a shuffler S and (iii) an analyzer A. For n users, the overall protocol works as follows. Each user i first applies the randomizer on its raw data Di and then sends the resulting messages R(Di) to the shuffler. The shuffler S permutes messages from all the users uniformly at random and then reports the permuted messages S(R(D1), . . . , R(Dn)) to the analyzer. Finally, the analyzer A computes the output using received messages. In this protocol, the users trust the shuffler but not the analyzer. Hence, the privacy objective is to ensure that the outputs of the shuffler on two neighbouring datasets are indistinguishable in the analyzer s view. To this end, define the mechanism (S Rn)(D) := S(R(D1), . . . , R(Dn)), where D Dn.

Shuffle Private Linear Contextual Bandits

Definition 2.3 (Shuffle differential privacy (SDP)). A protocol P = (R, S, A) for n users satisfies (ε, δ)-SDP if the mechanism S Rn satisfies (ε, δ)-DP.

To achieve benefits of the shuffle model in intrinsically adaptive algorithms (e.g., gradient descent, multi-armed bandits etc.), one needs to divide the users into multiple batches, and run a potentially different shuffle protocol on each batch (Cheu et al., 2021; Tenenbaum et al., 2021). This is quite natural since the shuffler needs enough users data to infuse sufficient randomness so as to amplify the privacy. Moreover, each protocol might depend on the output of the preceding protocols to foster adaptivity. Formally, a general M-batch, M N, shuffle protocol P for n users works as follows. In each batch m, we simply run a standard singlebatch shuffle protocol for a subset of nm users (such that n = P

m nm) with randomizer Rm, shuffler S and analyzer A. To ensure adaptivity, the randomizer Rm and number of users nm for the m-th batch could be chosen depending on outputs of the shuffler from all the previous batches, given by S Rm (D1), . . . , Rm (Dnm )

m <m. The objective of privacy is same as in the single-batch protocol the analyzer s view must satisfy DP. However, instead of a single-batch output, one need to protect outputs of all the M batches. To this end, define the (composite) mechanism MP = (S Rn1 1 , . . . , S Rnm m ), where each individual mechanism S Rnm m operates on nm users data, i.e., on datasets from Dnm.

Definition 2.4 (M-batch SDP). An M-batch shuffle protocol P is (ε, δ)-SDP if the mechanism MP is (ε, δ)-DP.

3. A Shuffle Algorithm for Contextual Bandits

In this section, we introduce a general algorithmic framework (Algorithm 1) for linear contextual bandits under the shuffle model. We build on the celebrated Lin UCB algorithm (Chu et al., 2011; Abbasi-Yadkori et al., 2011), which is an application of the optimism in the face of uncertainty principle to linear bandits. Throughout the paper, we make the following assumptions, which are standard in the literature (Chu et al., 2011; Shariff & Sheffet, 2018).

Assumption 3.1 (Boundedness). The rewards are bounded for all t, i.e., yt [0, 1]. Moreover, the parameter vector and the features have bounded norm, i.e., θ 2 1 and supc,a ϕ(c, a) 2 1.2

3.1. Algorithm: Shuffle Private Lin UCB

Our shuffle algorithm for contextual bandits consist of batches with a fixed size B, i.e., we have total M = T/B batches.3 The central idea is to construct, for each batch m,

2All terms are assumed to be bounded by one via normalization. 3We assume, wlog, total number of rounds T is multiple of B.

Algorithm 1 Shuffle Private Lin UCB

1: Parameters: Batch size B N, regularization λ>0, confidence radii {βm}m 0, feature map ϕ : C X Rd

2: Initialize: Batch counter m=1, end-time t0 =0, batch statistics V0 =λId, u0 =0, parameter estimate bθ0 =0 3: for local user t=1, 2, . . . do 4: Observe user s context information ct C 5: Choose action at argmaxa X ϕ(ct, a), bθm 1 + βm 1 ϕ(ct, a) V 1 m 1 6: Observe reward yt 7: # For the local randomizer: 8: Send randomized messages Mt,1 = R1(ϕ(ct, at)yt) and Mt,2 = R2(ϕ(ct, at)ϕ(ct, at) ) to the shuffler 9: if t = m B then 10: # For the shuffler: 11: Set batch end-time: tm = t 12: Permute all received messages uniformly at random Ym,1 = S1 {Mτ,1}tm 1+1 τ tm and Ym,2 = S2 {Mτ,2}tm 1+1 τ tm

13: # For the analyzer (server): 14: Compute per-batch statistics eum = A1(Ym,1) and e Vm = A2(Ym,2) using shuffled messages 15: Update overall batch statistics: um = um 1 + eum, Vm = Vm 1 + e Vm 16: Compute parameter estimate bθm = V 1 m um 17: Send updated models (bθm, Vm) to users 18: Increase batch counter: m = m + 1 19: end if 20: end for

a d-dimensional ellipsoid Em with centre bθm, shape matrix Vm and radius βm so that it contains the unknown parameter θ with high probability. Moreover, the ellipsoids are designed while keeping the privacy setting in mind. They depend on the randomizer, shuffler and analyzer employed in the shuffle protocol based on required privacy levels ε, δ. The personal data of user t in batch m is given by the feature vector ϕ(ct, at) and reward yt, where the action at is selected given the context ct as

at argmax a X { ϕ(ct, a), bθm 1 +βm 1 ϕ(ct, a) V 1 m 1}.

We consider a fixed randomizer across all the batches given by two functions R1 and R2 that locally operate on the vectors ϕ(ct, at)yt and matrices ϕ(ct, at)ϕ(ct, at) , respectively. Similarly, we have shuffler functions S1 and S2 operating on batches (of size B) of those respective randomized messages. Finally, the analyzer functions A1 and A2 receive permuted messages from S1 and S2, and output, for each batch m , an aggregate vector eum and matrix e Vm , respectively. The central server uses this aggregate batch statistics to construct the ellipsoid: Vm = λId+Pm m =1 e Vm

Shuffle Private Linear Contextual Bandits

and bθm = V 1 m PM m =1 eum . For a given confidence level α (0, 1], the radius of the ellipsoid is set as βm =

α + d log 1 + tm

λ , where tm is the time when batch m ends. The regularizer λ and thus, in turn, the confidence radius βm typically depend on the total noise infused in the shuffle protocol. On a high level, these randomizer, shuffler and analyzer functions together provide suitable random perturbations to the Gram matrices and feature-reward vectors based on the privacy budget ε, δ, and in turn, they affect the regret performance via the noise levels of these perturbations. Next, we turn to discuss specific choices of these functions, and the associated performance guarantees of Algorithm 1 under those choices.

3.2. Achieving SDP via LDP Amplification

In this section, we show that our general framework (Algorithm 1) enables us to directly utilize existing LDP mechanisms for linear contextual bandits to achieve a finer utilityprivacy trade-off. The key idea here is to leverage the explicit privacy amplification property of the shuffle protocol (Feldman et al., 2020). Roughly, the privacy guarantee can be amplified by a factor of

B by randomly permuting the output of an LDP mechanism independently operating on a batch of B different users. In other words, the same level of privacy can be achieved for each user by adding a

B factor less noise in the presence of shuffler, yielding a better utility. Specifically, we instantiate Algorithm 1 with the shuffle protocol PAmp =(RAmp, SAmp, AAmp), where we employ standard Gaussian mechanism (Dwork et al., 2014) as randomizer functions. Essentially, we inject independent Gaussian perturbation to each entry of the vector ϕ(ct, at)yt and the matrix ϕ(ct, at)ϕ(ct, at) with variances σ2 1 and σ2 2, respectively. We make sure the noisy matrix is symmetric by perturbing upper diagonal entries, and copying those to the lower terms. The noise variances are properly tuned depending on the sensitivity of these elements to achieve desired level of privacy. In this case, the shuffler functions simply permute its data uniformly at random, and the job of the analyzer is to simply add its received data (i.e., vectors or matrices). We defer further details on the protocol PAmp to Appendix B and focus on performance guarantees first. Theorem 3.2 (Performance under LDP amplification). Fix time horizon T N, batch size B [T], confidence level α

(0, 1], privacy budgets δ (0, 1], ε (0, q

B ]. Then, Algorithm 1 instantiated using shuffle protocol PAmp with

noise σ1 = σ2 = 4

2 log(2.5B/δ) log(2/δ)

B , and regularizer

log(T/Bα)), enjoys the regret

d B log T + log1/2(B/δ)

ε1/2B1/4 d3/4T 3/4 log2(T/α)

with probability at least 1 α. Moreover, it satisfies O(ε, δ)-

shuffle differential privacy (SDP).

Corollary 3.3. Setting batch size B = O(T 3/5) in Algo-

rithm 1, we can achieve regret e O T 3/5

ε log1/2(T/δ) .4

Comparsion with central and local DP models. At this point, we turn to compare the regret of our Shuffle Private Lin UCB algorithm to that of Lin UCB under central model with JDP guarantee5 (Shariff & Sheffet, 2018) and local model with LDP (Zheng et al., 2020) guarantee. As men-

tioned before, Lin UCB achieves O q

ε and O T 3/4

regret under JDP and LDP guarantees, respectively. As seen in Corollary 3.3, our regret bound in the shuffle trust model lies perfectly in between these two extremes. Importantly, it improves over the T 3/4 scaling in the (stronger) local trust model, achieving a better trade-off between regret and privacy. However, it couldn t achieve the optimal

T scaling in the (weaker) central trust model. It remains an open question whether

T regret can be achieved under any notion of privacy stronger than the central model. Remark 3.4. Our shuffle protocol PAmp, by design, provides a certain level of local privacy to each user. Specifically, for batch size B, Algorithm 1 is O(ε p

B/ log(2/δ), δ/B)- LDP. Furthermore, since shuffe model ensures a higher level of trust than the central model, Algorithm 1 is also O(ε, δ)- JDP. See Appendix B for details.

Apart from achieving a refined utility-privacy trade-off, the above shuffle protocol PAmp requires minimum modifications over existing LDP mechanisms. However, the privacy guarantee in Theorem 3.2 holds only for small privacy budget ε particularly when the batch size B is large, which could potentially limit its application in some practical scenarios (e.g., when ε is around 1 or larger (Apple, 2017)). Moreover, PAmp needs to communicate and shuffle real vectors and matrices, which are often difficult to encode on finite computers in practice (Canonne et al., 2020; Kairouz et al., 2021) and a naive use of finite precision approximation may lead to a possible failure of privacy protection (Mironov, 2012). To overcome these limitations of PAmp, we introduce a different instantiation of Algorithm 1 in the next section.

3.3. Achieving SDP via Vector Summation

We instantiate Algorithm 1 with the shuffle protocol PVec = (RVec, SVec, AVec), where we rely on a particularly efficient and accurate mechanism for summing vectors with bounded ℓ2 norms (Cheu et al., 2021). First, the local randomizer RVec adopts a one-dimensional randomizer that operates

4Note that with a careful choice of B (depending on privacy parameters ε, δ), we can have a better regret dependence on ε, δ. See Corollary B.2 for details. 5JDP, or, joint differential privacy, is a notion of privacy under central trust model specific to contextual bandits. See Appendix D.

Shuffle Private Linear Contextual Bandits

independently on each entry of the vector ϕ(ct, at)yt and the matrix ϕ(ct, at)ϕ(ct, at) , respectively. This adopted one-dimensional randomizer transmits only bits (0/1) via a fixed-point encoding scheme (Cheu et al., 2019), and ensures privacy by injecting binomial noise. In particular, given any entry x [0, 1], it is first encoded as bx = x+γ1, using an accuracy parameter g N, where x = xg and γ1 Ber(xg x). Then a binomial noise is generated, γ2 Bin(b, p), where parameters b N, p (0, 1) control the privacy noise. The output of the one-dimensional randomizer is simply a collection of total g + b bits, in which bx+γ2 bits are 1 and the rest are 0. Combining the outputs of the one-dimensional randomizer for each entry of vector ϕ(ct, at)yt and matrix ϕ(ct, at)ϕ(ct, at) , yield final outputs of randomizer. The shuffler functions in SVec simply permutes all the received bits uniformly at random. The job of the analyzer AVec is to add the received bits for each entry, and remove the bias introduced due to encoding and binomial noise. This is possible since bits are already labeled entry-wise when leaving RVec. The constants g, b, p are left as tunable parameters of PVec, and need to be set properly depending on the desired level of privacy. The detailed implementation of this scheme is deferred to Appendix C. The following theorem states the performance guarantees of Algorithm 1 instantiated with PVec.

Theorem 3.5 (Performance under vector sum). Fix batch size B [T], privacy budgets ε (0, 15], δ (0, 1/2). Then, Algorithm 1 instantiated with PVec with parameters

p = 1/4, g = max{2

B, d, 4} and b = C g2 log2(d2/δ)

ε2B is (ε, δ)-SDP, where C 1 is some sufficiently large constant. Furthermore, for any α (0, 1], setting λ =

Θ log(d2/δ)

log(T/Bα) , it enjoys the regret

d B log T + log1/2(d2/δ)

ε1/2B1/4 d3/4T 3/4 log2(T/α)

with probability at least 1 α.

Remark 3.6. Similar to Corollary 3.3, an e O T 3/5

can also be achieved in this case by setting B = O(T 3/5), but the dependence on δ is now: log1/2(d2/δ) as compared to log1/2(T/δ). Moreover, in contrast to Theorem 3.2, the guarantees hold for a wide range of ε, making PVec better suitable for practical purposes (Apple, 2017). Finally, as before, if B also depends on privacy parameters, the dependence on ε, δ can be improved, see Corollary C.2. Remark 3.7. PVec can also be regarded as privacy amplification of Binomial mechanism (rather than Gaussian mechanism in PAmp), which is the reason that it also offers a certain degree, O(ε

B, δ), to be precise, of LDP guarantee. Remark 3.8. Both shuffle protocols, PAmp and PVec, in fact, can be tuned to satisfy (ε, δ)-LDP by sacrificing on regret performance. See Corollaries B.3 and C.3 for details.

3.4. Key Techniques: Overview

In this section, we provide a generic template of regret bound for linear contextual bandits under the shuffle model of privacy. To this end, we need following notations to discuss the effect of noise added by shuffle protocol, in the learning process. Let nm = eum Ptm t=tm 1+1 ϕ(ct, at)yt and Nm = e Vm Ptm t=tm 1+1 ϕ(ct, at)ϕ(ct, at) denote the total noise added during batch m in the feature-reward vector, and in the Gram-matrix, respectively. Furthermore, assume that there exist constants eσ1 and eσ2 such that for each batch m, (i) Pm m =1 nm is a random vector whose entries are independent, mean zero, sub-Gaussian with variance at most eσ2 1, and (ii) Pm m =1 Nm is a random symmetric sub-Gaussian matrix whose entries on and above the diagonal are independent with variance at most eσ2 2. Let σ2 =max{eσ2 1, eσ2 2}. Then, we have the following result.

Lemma 3.9 (Informal). With the choice of λ σ(

log(T/(Bα))), the regret of Algorithm 1 satisfies

Reg(T)= e O d B+d

σTd3/4 with high probability.

With the above result, one only needs to determine the noise variance σ2 under different privacy protocols. We illustrate this with the shuffle protocols introduced in previous sections. First, note that since we assume unique users, Algorithm 1 is SDP if each batch is SDP. Now, for the LDP amplification protocol PAmp, in order to guarantee SDP for each batch with sufficiently small privacy loss ε, it suffices to work with an LDP mechanism with loss ε

B by virtue of amplification.6 We ensure this by choosing Gaussian mechanism with noise variance O(1/(ε2B)). Hence, the total noise variance added by PAmp is σ2 O( T

ε2B ). Thus, by Lemma 3.9, we obtain the result in Theorem 3.2. Similarly, for the vector sum protocol PVec, we ensure PVec to be SDP by properly setting parameters g, b, p. Moreover, the analyzer s outputs are unbiased estimates of the sum of the non-private vectors (matrices) within that batch, and the entry-wise private noise is sub-Gaussian with variance of O( 1

ε2 ). Thus, the total noise variance added by PVec is σ2 O( T

ε2B ), and hence, by Lemma 3.9, we have the result in Theorem 3.5.

Remark 3.10. Lemma 3.9, in fact, can serve as a general template of regret for private linear contextual bandit algorithms. For example, for the local model (Zheng et al., 2020), B =1 and σ2 T

ε2 , yielding e O T 3/4

ε regret. Similarly, for the central model (Shariff & Sheffet, 2018), B =1 and σ2 log T

ε2 , which yields e O T 1/2

6We provide intuition without worrying about the details related to δ-dependent terms. Refer to Appendix A for formal proofs.

Shuffle Private Linear Contextual Bandits

4. Regret Performance under Returning Users

Similar to existing work on differentially private bandits, in previous sections, we have assumed that all participating users are unique, i.e., each user participates in the protocol only at one round. A more practical scenario is that an user can contribute with her data at multiple rounds. For example, consider the mobile medical application described in the introduction. The cloud server can collect one particular user s data during multiple batches to track the effectiveness of its treatment plan over a period, and hence, use same user s data multiple times to update its recommendation algorithm. Motivated by this, we provide privacy and regret guarantees of Algorithm 1 under the setting of returning users in linear contextual bandits. We first define the setting of returning users that we consider in this section, and then state the performance guarantee for Algorithm 1.

Assumption 4.1 (Returning Users). For a given time horizon T N and batch size B [T], any user can participate in all M = T/B batches, but within each batch m [M], she only contributes once.

In addition to the above motivating example, this assumption also captures many practical adaptive learning scenarios such as clinical trials and product recommendations, in which each trial (batch) involves a group of unique people, but the same person may participate in multiple trials (Ren et al., 2020b; Schwartz et al., 2017).

Theorem 4.2 (Performance guarantees (informal)). Under Assumption 4.1, we obtain the following results for PAmp and PVec, respectively.

(i) For any ε 2 B log(2/δ)

2T and δ (0, 1], Algorithm 1 instantiated using PAmp with noise levels σ1 =σ2 =

16 log(2/δ)

T (log(5T/δ)) εB is O(ε, δ)-SDP, and enjoys, with high-probability, the regret bound

Reg(T) = e O

ε d3/4 log3/4(T/δ)

(ii) For any ε 15, δ (0, 1/2), there exist choices of parameters g, b N, p (0, 1/2) depending on B, ε, δ such that Algorithm 1 instantiated using PVec is (ε, δ)-SDP, and, enjoys, with high probability, the regret bound

Reg(T)= e O

ε d3/4 log3/4(d2M/δ)

Proof sketch. In contrast to Section 3 for unique users, where (ε, δ)-SDP guarantee for Algorithm 1 can be established by showing each batch is (ε, δ)-SDP, we now need to guarantee that outputs of all the batches together have a total privacy loss of (ε, δ). This is due to the fact that now each batch can potentially operate on same set of users, and

hence, one need to use advanced decomposition to calculate the total privacy loss. This leads to scaling up the noise variance by a multiplicative factor of O(M) at each batch, which eventually leads to the above bound (the additional M factor in δ also comes from advance composition).

Interestingly, the privacy (ε, δ)-dependent term in above regret bounds match the one that can be achieved in the user-level central trust model that handles returning users. Note that, since existing work in the central model of privacy (i.e., under JDP guarantee) assume unique users (Shariff & Sheffet, 2018), we first generalize it to handle returning users. This can be viewed as the same form of generalization from event-level DP to user-level DP under continual observation, where the adjacent relation between two data streams changes from the flip of one single round to the flip of multiple rounds associated with a single user (Dwork et al., 2010). See Appendix D for formal definitions of event-level and user-level joint differential privacy (JDP). As in standard notion of DP, one straightforward approach for converting event-level JDP to user-level JDP is to use group privacy (Dwork et al., 2014). However, this blackbox approach would blow up the terms dependent on δ. To overcome this, we propose a simple modification of original (event-level) algorithm in Shariff & Sheffet (2018) so that it can handle returning users. In particular, user-level JDP can be achieved by scaling up the noise variance by a multiplicative factor of M 2 0 , if any user participates in at most M0 rounds. This follows from the fact that flipping one user now would change the ℓ2 sensitivity of the expanded binary-tree nodes from O( log T) to O(M0 log T). Note that we use M0 to distinguish from the number of batches M since there is no batch concept in standard central model. This modified version enjoys the following regret guarantee.

Proposition 4.3. If any user participates in at most M0 rounds, the algorithm in Shariff & Sheffet (2018), with the above modification to handle user-level privacy, achieves the high-probability regret bound

Reg(T) = e O

ε d3/4 log1/4(1/δ)

Remark 4.4. Comparing Theorem 4.2 and Proposition 4.3, we observe that the cost of privacy in the shuffle model is essentially same (upto a log factor) as in the central model under the setting of returning users. In particular, if M = M0 = T 1/3 rounds (i.e., the same number of possible returning rounds for any user), the regret is e O T 2/3

in both shuffle and user-level central trust models. See Appendix E for complete proofs and more details.

Shuffle Private Linear Contextual Bandits

0 5000 10000 15000 20000 Round

Lin UCB Lin UCB-JDP Lin UCB-SDP-Amp Lin UCB-SDP-Vec Lin UCB-LDP

(a) ε = 0.2

0 5000 10000 15000 20000 Round

Lin UCB Lin UCB-JDP Lin UCB-SDP-Amp Lin UCB-SDP-Vec Lin UCB-LDP

0 5000 10000 15000 20000 Round

Lin UCB Lin UCB-JDP Lin UCB-SDP-Amp Lin UCB-SDP-Vec Lin UCB-LDP

Figure 1: Comparison of cumulative regret for Lin UCB (non-private), Lin UCB-JDP (central model), Lin UCB-SDP (shuffle model) and Lin UCB-LDP (local model) with varying privacy level ε = 0.2 (a), ε = 1 (b) and ε = 10 (c). For ε = 0.2 (higher privacy level), gap between private and non-private regret is higher as compared to ε = 10 (lower privacy level). In all cases, regret of Lin UCB-SDP lies perfectly in between Lin UCB-JDP and Lin UCB-LDP, achieving finer regret-privacy trade-off.

5. Simulation Results

In this section, we empirically evaluate the regret performance of Algorithm 1 (under shuffle model), which we abbreviate as Lin UCB-SDP-Amp and Lin UCB-SDP-Vec when instantiated with PAmp and PVec, respectively. We compare them with the algorithms of Shariff & Sheffet (2018) and Zheng et al. (2020) under central and local models, which we call Lin UCB-JDP and Lin UCB-LDP, respectively. We benchmark these against the non-private algorithm of Abbasi-Yadkori et al. (2011), henceforth referred as Lin UCB. For all the experiments, we consider 100 arms, set T = 20000 rounds, and average our results over 50 randomly generated bandit instances. Each instance is characterized by an (unknown) parameter θ and feature vectors of dimension d = 5. To ensure boundedness, similar to Vaswani et al. (2020), we generate each θ and feature vectors by sampling a (d 1)-dimensional vectors of norm 1/

2 uniformly at random, and append it with a 1/

2 entry. We consider Bernoulli {0, 1} rewards. We fix δ =0.1 and plot the results for varying privacy level ε {0.2, 1, 10} in Figure 1. We use Batchsize B = 20 for Lin UCB-SDP. We postpone the results for d = 10, 15 to Appendix G.

From Figure 1, we observe that the regret performance of Lin UCB-SDP (under both shuffle protocols PAmp and PVec) is indeed better than Lin UCB-LDP. In addition, it is not surprising that Lin UCB-SDP incurs a larger regret than Lin UCB-JDP. Moreover, the regret performance of Lin UCBSDP (in fact for any private algorithm) comes closer to that of Lin UCB as ε increases, i.e, as the privacy guarantee becomes weaker. The experimental findings are consistent with our theoretical results.7

7Code is available at https://github.com/sayakrc/ Differentially-Private-Bandits.

6. Concluding Remarks

We conclude by discussing some important theoretical and practical aspects about shuffle protocols, and in general, about privacy in linear contextual bandits.

Communications. In the protocol PAmp, each participating user at each round need to send one d-dimensional real vector and one d d real matrix. On the other hand, the protocol PVec only communicates 0/1 bits. In particular, each participating user at each round sends out a total of O(d2(g+b)) bits, where g+b

B+log(1/δ)/ε2. Hence, PVec might be more feasible in practice than PAmp.

Batched algorithms for local and central models. Existing work on differentially private linear contextual bandits under both local and central models perform sequential update, i.e., the model estimates are updated after each round. As mentioned before, this may not be feasible in practice due to computational load. Fortunately, our proposed algorithm (Algorithm 1) along with its generic regret bound (Lemma 3.9) also offers a simple way to design and analyze private algorithms for local and central models with batched update. In particular, we show that it suffices to update after every B = e O(T 3/4) rounds to achieve the same privacyregret trade-off as in the sequential local model and every B = e O(

T) to match the sequential central model. See Appendix F for the details.

Adaptive model update. One might wonder whether we can further reduce the update frequency to O(log T) via an adaptive model update schedule based on the standard determinant trick (Lemma 12 of Abbasi-Yadkori et al. (2011)). In this approach, the key step is to establish that ϕ(c, a) V 1 τt η ϕ(c, a) V 1 t , where τt < t is the most recent model update time before t. To this end, if one uses the determinant trick, one can obtain that

ϕ(c, a) V 1 τt

det(Vt) det(Vτt) ϕ(c, a) V 1 t ,

if the condition Vt Vτt holds. Note that this is true in

Shuffle Private Linear Contextual Bandits

the non-private setting. However, this does not necessarily hold in private settings due to the added noise, which, to the best of our knowledge, is the key analytical gap in the current proof of the main result (Theorem 10) in Garcelon et al. (2021). As we can see, this issue exists in all three trust models when one needs to use the noisy design matrix to determine the update frequency via the determinant trick.

Close the gap. For the case of unique users, we have shown that shuffle model enables us to achieve a regret O(T 3/5), which is between local model regret O(T 3/4) and central model regret O(

T). One important question is how to further close the gap. To this end, one might first need to have a tight lower bound for the regret under the local model. To the best of our knowledge, Liao et al. (2021) presents the first lower bound under local model Ω( 1 min{2,eε}(eε 1)d

T). It is unclear to us whether the upper bound and lower bound are tight for the local model8.

Advanced shuffle protocols. We use vector summation protocol in Cheu et al. (2021) rather than advanced shuffle protocols (in terms of communication cost) (e.g., Ghazi et al. (2020); Balle et al. (2020)) because it is simple to implement and suffices to deliver our main message. Another key technical reason is that the random noise introduced by these advanced shuffle protocols is (discrete) Laplace, which at best exhibits sub-exponential tail behavior. Now, in contextual bandits, one need to protect both rewards and feature vectors, and a principled way to achieve privacy is to inject suitable random noise to the matrices ϕ( )ϕ( ) and vectors ϕ( )y generated by rewards and features. This raises the need to control respective norms of random vectors and random symmetric matrices with i.i.d. sub-exponential entries in order to achieve a meaningful utility/regret bound. In contrast, our current adopted shuffle protocol from Cheu et al results in random vectors and matrices with Binomial entries, which have sub-Gaussian tails. Hence, the respective norms can be controlled using standard results from random matrix theory. However, we are not aware of non-trivial bounds on matrix norms with sub-exponential entries (i.e., without a simple union bound over co-ordinates), which abstains us from using advanced shuffle protocols.

Pure DP in the shuffle model. First, let us highlight the reason why we focus only on approximate DP. Existing linear contextual bandit algorithms under central and local privacy models consider only approximate DP, and our main motivation in this work is to close the gap in regret achieved under these two models via shuffling. Second, almost all the existing shuffle protocols can only guarantee approximate DP. To the best of knowledge, the very recent work (Cheu & Yan, 2021) develops the first shuffle protocol that is able to achieve pure DP with an error of O(1/ε). However, since

8Note that Zheng et al. (2020) conjecture that the lower bound for the local model is Ω(T 3/4), see Appendix G therein.

this new protocol also relies on discrete Laplace noise, the same challenge of controlling norms of vectors and matrices with sub-exponential entries in contextual bandits still remain. At mentioned before, to the best of our knowledge, it remains open to derive non-trivial (i.e., without a simple union bound over co-ordinates) concentration bounds on the norm for Laplace random matrices. That being said, if there indeed exist non-trivial concentrations for sub-exponential or Laplace matrices, our derived result (i.e., Lemma A.2) can also be used to derive meaningful regret bounds under this new protocol.

Regret in low-privacy regime: As in previous works, we focus on the high-privacy regime. One may wonder if it is possible to potentially achieve O(

T/ε) regret after

amplification in the low-privacy regime (i.e., ε q

n , by Lemma B.4). For example, this is possible if one can

replace ε 1

2 0 by e ε0

2 in the second term of the regret bound in Proposition F.1 However, this essentially amounts to requiring a mechanism that guarantees (ε0, δ0)-LDP for ε0 > 1 by injecting (sub)-Gaussian noise with standard deviation σ = O(e ε0). To the best of our knowledge, we are unaware of any such mechanism. The standard Gaussian mechanism fails to satisfy this since it needs σ = Θ(1/ ε0) to achieve (ε0, δ0)-DP in low-privacy regime. For pure DP, the state-of-the-art staircase mechanism (Geng et al., 2015), requires a noise with σ = O(e ε0/2) rather than σ = O(e ε0). This again can t guarantee

T/ε regret even if one manages to apply it in our context (with proper matrix-norm bounds). That being said, if there indeed exists the above required mechanism, our results can be used to establish the desired

T/ε regret in low-privacy regime.

Future work. One immediate future research direction is to address the above adaptive model update in the private settings. We also believe our framework can be generalized to design shuffle private algorithms for reinforcement learning with linear function approximation (e.g., linear mixture Markov decision processes (MDPs)) to achieve finer trade-off between the local model (Liao et al., 2021) and the central model (Zhou, 2022).

Acknowledgements

We thank anonymous reviewers for their useful comments, which helped preparing the final version. XZ would like to thank Albert Cheu for insightful discussions on shuffle protocols. SRC is grateful to a CISE Postdoctoral fellowship of Boston University.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in

Shuffle Private Linear Contextual Bandits

Neural Information Processing Systems, pp. 2312 2320, 2011.

Abe, N., Biermann, A. W., and Long, P. M. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003.

Agarwal, N. and Singh, K. The price of differential privacy for online learning. In International Conference on Machine Learning, pp. 32 40. PMLR, 2017.

Apple. Learning with privacy at scale. 2017. URL https: //machinelearning.apple.com/research/ learning-with-privacy-at-scale.

Arora, S., Yttri, J., and Nilsen, W. Privacy and security in mobile health (mhealth) research. Alcohol research: current reviews, 36(1):143, 2014.

Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397 422, March 2003. ISSN 1532-4435.

Balle, B., Bell, J., Gascon, A., and Nissim, K. Differentially private summation with multi-message shuffling. ar Xiv preprint ar Xiv:1906.09116, 2019a.

Balle, B., Bell, J., Gasc on, A., and Nissim, K. The privacy blanket of the shuffle model. In Annual International Cryptology Conference, pp. 638 667. Springer, 2019b.

Balle, B., Bell, J., Gasc on, A., and Nissim, K. Private summation in the multi-message shuffle model. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, pp. 657 676, 2020.

Bittau, A., Erlingsson, U., Maniatis, P., Mironov, I., Raghunathan, A., Lie, D., Rudominer, M., Kode, U., Tinnes, J., and Seefeld, B. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, pp. 441 459, 2017.

Canonne, C. L., Kamath, G., and Steinke, T. The discrete gaussian for differential privacy. In Neur IPS, 2020.

Chan, T. H., Shi, E., and Song, D. Private and continual release of statistics. In International Colloquium on Automata, Languages, and Programming, pp. 405 417. Springer, 2010.

Chen, X., Zheng, K., Zhou, Z., Yang, Y., Chen, W., and Wang, L. (locally) differentially private combinatorial semi-bandits. In International Conference on Machine Learning, pp. 1757 1767. PMLR, 2020.

Cheu, A. Differential privacy in the shuffle model: A survey of separations. ar Xiv preprint ar Xiv:2107.11839, 2021.

Cheu, A. and Yan, C. Pure differential privacy from secure intermediaries. ar Xiv preprint ar Xiv:2112.10032, 2021.

Cheu, A., Smith, A., Ullman, J., Zeber, D., and Zhilyaev, M. Distributed differential privacy via shuffling. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 375 403. Springer, 2019.

Cheu, A., Joseph, M., Mao, J., and Peng, B. Shuffle private stochastic convex optimization. ar Xiv preprint ar Xiv:2106.09805, 2021.

Chowdhury, S. R. and Zhou, X. Differentially private regret minimization in episodic markov decision processes. ar Xiv preprint ar Xiv:2112.10599, 2021.

Chowdhury, S. R., Zhou, X., and Shroff, N. Adaptive control of differentially private linear quadratic systems. In 2021 IEEE International Symposium on Information Theory (ISIT), pp. 485 490. IEEE, 2021.

Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15, pp. 208 214, 2011.

Dubey, A. No-regret algorithms for private gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2062 2070. PMLR, 2021.

Dwork, C. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pp. 1 19. Springer, 2008.

Dwork, C., Naor, M., Pitassi, T., and Rothblum, G. N. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 715 724, 2010.

Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

Erlingsson, U., Feldman, V., Mironov, I., Raghunathan, A., Talwar, K., and Thakurta, A. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2468 2479. SIAM, 2019.

Feldman, V., Mc Millan, A., and Talwar, K. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. ar Xiv preprint ar Xiv:2012.12803, 2020.

Shuffle Private Linear Contextual Bandits

Garcelon, E., Perchet, V., Pike-Burke, C., and Pirotta, M. Local differentially private regret minimization in reinforcement learning. ar Xiv preprint ar Xiv:2010.07778, 2020.

Garcelon, E., Chaudhuri, K., Perchet, V., and Pirotta, M. Privacy amplification via shuffling for linear contextual bandits. ar Xiv preprint ar Xiv:2112.06008, 2021.

Geng, Q., Kairouz, P., Oh, S., and Viswanath, P. The staircase mechanism in differential privacy. IEEE Journal of Selected Topics in Signal Processing, 9(7):1176 1184, 2015.

Ghazi, B., Golowich, N., Kumar, R., Pagh, R., and Velingker, A. On the power of multiple anonymous messages. ar Xiv preprint ar Xiv:1908.11358, 2019.

Ghazi, B., Kumar, R., Manurangsi, P., and Pagh, R. Private counting from anonymous messages: Near-optimal accuracy with vanishing communication overhead. In International Conference on Machine Learning, pp. 3505 3514. PMLR, 2020.

Girgis, A., Data, D., Diggavi, S., Kairouz, P., and Suresh, A. T. Shuffled model of differential privacy in federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 2521 2529. PMLR, 2021.

Guha Thakurta, A. and Smith, A. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26:2733 2741, 2013.

Han, Y., Zhou, Z., Zhou, Z., Blanchet, J., Glynn, P. W., and Ye, Y. Sequential batch learning in finite-action linear contextual bandits. ar Xiv preprint ar Xiv:2004.06321, 2020.

Han, Y., Liang, Z., Wang, Y., and Zhang, J. Generalized linear bandits with local differential privacy. ar Xiv preprint ar Xiv:2106.03365, 2021.

Hsu, J., Huang, Z., Roth, A., Roughgarden, T., and Wu, Z. S. Private matchings and allocations. SIAM Journal on Computing, 45(6):1953 1984, 2016.

Kairouz, P., Liu, Z., and Steinke, T. The distributed discrete gaussian mechanism for federated learning with secure aggregation. In Neur IPS, 2021.

Kearns, M., Pai, M., Roth, A., and Ullman, J. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pp. 403 410, 2014.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article

recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Liao, C., He, J., and Gu, Q. Locally differentially private reinforcement learning for linear mixture markov decision processes. ar Xiv preprint ar Xiv:2110.10133, 2021.

Lowy, A. and Razaviyayn, M. Private federated learning without a trusted server: Optimal algorithms for convex losses. ar Xiv preprint ar Xiv:2106.09779, 2021.

Mironov, I. On significance of the least significant bits for differential privacy. In Proceedings of the 2012 ACM conference on Computer and communications security, pp. 650 661, 2012.

Mishra, N. and Thakurta, A. (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 592 601, 2015.

Ren, W., Zhou, X., Liu, J., and Shroff, N. B. Multi-armed bandits with local differential privacy. ar Xiv preprint ar Xiv:2007.03121, 2020a.

Ren, Z., Zhou, Z., and Kalagnanam, J. R. Batched learning in generalized linear contextual bandits with general decision sets. IEEE Control Systems Letters, 2020b.

Sajed, T. and Sheffet, O. An optimal private stochasticmab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pp. 5579 5588. PMLR, 2019.

Schwartz, E. M., Bradlow, E. T., and Fader, P. S. Customer acquisition via display advertising using multiarmed bandit experiments. Marketing Science, 36(4): 500 522, 2017.

Shariff, R. and Sheffet, O. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31:4296 4306, 2018.

Tenenbaum, J., Kaplan, H., Mansour, Y., and Stemmer, U. Differentially private multi-armed bandits in the shuffle model. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=P0Ae Y-ef PEx.

Tewari, A. and Murphy, S. A. From ads to interventions: Contextual bandits in mobile health. In Mobile Health, pp. 495 517. Springer, 2017.

Tossou, A. and Dimitrakakis, C. Achieving privacy in the adversarial multi-armed bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.

Shuffle Private Linear Contextual Bandits

Vaswani, S., Mehrabian, A., Durand, A., and Kveton, B. Old dog learns new tricks: Randomized ucb for bandit problems. In International Conference on Artificial Intelligence and Statistics, pp. 1988 1998. PMLR, 2020.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Vietri, G., Balle, B., Krishnamurthy, A., and Wu, S. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pp. 9754 9764. PMLR, 2020.

Wang, T., Zhou, D., and Gu, Q. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. ar Xiv preprint ar Xiv:2101.02195, 2021.

Xin, Y. and Jaakkola, T. Controlling privacy in recommender systems. Neural Information Processing Systems, 2014.

Zheng, K., Cai, T., Huang, W., Li, Z., and Wang, L. Locally differentially private (contextual) bandits learning. In Neur IPS, 2020.

Zhou, X. Differentially private reinforcement learning with linear function approximation. ar Xiv preprint ar Xiv:2201.07052, 2022.

Zhou, X. and Tan, J. Local differential privacy for bayesian optimization. ar Xiv preprint ar Xiv:2010.06709, 2020.

Shuffle Private Linear Contextual Bandits

A. A Unified Regret Analysis Under Differential Privacy

In this section, we will formally state Lemma 3.9, i.e., the generic regret of Algorithm 1 under sub-Gaussian private noise and then present its proof.

Let s first recall the following notations. For each batch m [M], let Nm := e Vm Ptm t=tm 1+1 ϕ(ct, at)ϕ(ct, at) denote

the additional noise injected into the non-private Gram-matrix and similarly let nm := eum Ptm t=tm 1+1 ϕ(ct, at)yt denote the additional noise injected into the non-private feature-reward vector. Then, we let Hm := λId + Pm i=1 Ni to denote the total noise in the first m batches plus the regularizer, and similarly let hm := Pm i=1 ni.

Assumption A.1 (Regularity). For any α (0, 1], Hm is positive definite and there exist constants λmax, λmin and ν depending on α, such that with probability at least 1 α, for all m [M]

Hm λmax, H 1 m 1/λmin, hm H 1 m ν.

With the above regularity assumption and the boundedness in Assumption 3.1, we fist establish the following general regret bound of Algorithm 1, which can be viewed as a direct generalization of the results in (Shariff & Sheffet, 2018) to the batched case.

Lemma A.2. Let Assumptions A.1 and 3.1 hold. Fix any α (0, 1], with probability at least 1 α, the regret of Algorithm 1 satisfies

log 2 log 1 + T dλmin

d T log 1 + T dλmin

+ d log 1 + T dλmin

In fact, Lemma 3.9 in the main paper is a simple application of Lemma A.2 by considering the following assumption.

Assumption A.3 (sub-Gaussian private noise). There exist constants eσ1 and eσ2 such that for all m [M]: (i) Pm m =1 nm is a random vector whose entries are independent, mean zero, sub-Gaussian with variance at most eσ2 1, and (ii) Pm m =1 Nm is a random symmetric matrix whose entries on and above the diagonal are independent sub-Gaussian random variables with variance at most eσ2 2. Let σ2 =max{eσ2 1, eσ2 2}.

Now, we are well-prepared to formally state Lemma 3.9 in the main paper.

Lemma A.4 (Formal statement of Lemma 3.9). Let Assumptions A.3 and 3.1 hold. Fix time horizon T N, batch size B

[T], confidence level α (0, 1]. Set λ = Θ(max{1, σ(

log(T/(Bα))}) and βm = q

α + d log 1 + T

λ. Then, Algorithm 1 achieves regret

Reg(T) = O d B log T + d

T log(T/α) + O

σTd3/4 log T log(T/α)

with probability at least 1 α.

Remark A.5. The above lemma also presents a regret bound for non-private batched LCB when σ = 0. Note that in this case, our regret bound is achieved with a dimension-independent regularizer, in contrast to the necessary condition on λ = Θ(d) as required in (Ren et al., 2020b) to attain the optimal regret.

A.1. Proofs

In this section, we present proofs for Lemma A.2 and Lemma A.4 above, respectively.

Proof of Lemma A.2. Let E be the event given in Assumption A.1, which holds with probability at least 1 α under Assumption A.1. In the following, we condition on the event E. We first show that bθm concentrates around the true parameter

Shuffle Private Linear Contextual Bandits

θ with a properly chosen confidence radius βm for all m [M]. To this end, note that

bθm = V 1 m um

t=1 ϕ(ct, at)ϕ(ct, at) + λId +

t=1 ϕ(ct, at)yt +

t=1 ϕ(ct, at)ϕ(ct, at) + Hm

t=1 ϕ(ct, at)yt + hm

By the linear reward function yt = ϕ(ct, at), θ + ηt and elementary algebra, we have

θ bθm = V 1 m

t=1 ϕ(ct, at)ηt hm

Thus, multiplying both sides by V 1/2 m , yields

t=1 ϕ(ct, at)ηt

V 1 m + Hmθ V 1 m + hm V 1 m

t=1 ϕ(ct, at)ηt

(Gm+λmin I) 1 + θ Hm + hm H 1 m ,

where (a) holds by Vm Hm and Vm Gm + λmin I with Gm := Ptm t=1 ϕ(ct, at)ϕ(ct, at) under event E. Further, by the boundedness condition of θ and event E, θ Hm λmax and hm H 1 m ν. For the remaining first term, we can use self-normalized inequality (cf. Theorem 1 in (Abbasi-Yadkori et al., 2011)) with the filtration Ft = σ(c1, a1, y1, . . . , ct, at, yt, ct+1, at+1). In particular, we have with probability at least 1 α, for all m [M]

t=1 ϕ(ct, at)ηt

(Gm+λmin I) 1

+ log det(Gm + λmin I)

det(λmin I)

Now, using the trace-determinant lemma (cf. Lemma 10 in (Abbasi-Yadkori et al., 2011)) and the boundedness condition on ϕ(c, a) , we have

det(Gm + λmin I) λmin + tm

Putting everything together, we have with probability at least 1 2α, for all m [M], θ bθm Vm βm, where

+ d log 1 + tm dλmin

With the above concentration result and our OFUL-type algorithm, the regret can be upper bounded as follows.

ϕ(ct, at) V 1 m 1

ϕ(ct, at) (Gm 1+λmin I) 1

At this moment, we note that the standard elliptical potential lemma (cf. Lemma 11 in (Abbasi-Yadkori et al., 2011)) cannot be applied to our batch setting due to the delay of Gm.

Shuffle Private Linear Contextual Bandits

To handle this, inspired by (Wang et al., 2021), we let b Vk := Pk t=1 ϕ(ct, at)ϕ(ct, at) + λmin Id, that is, a (virtual) design matrix at the end of time k. Hence, we have Gm 1 +λmin Id = b Vt(m 1). Moreover, for any tm 1 < t tm, let mt = tm 1, that is, mapping t to the starting time of the batch that includes t. Finally, let Γi( , ) := βM ϕ( , ) b V 1 i .

With above notations, the bound in (2) can be rewritten as follows.

ϕ(ct, at) (Gm 1+λmin) 1

t=1 2Γmt(ct, at)

In the sequential case (i.e., B = 1), we always have mt = t 1. Thus, the key is to bound the difference between Γmt(ct, at) and Γt 1(ct, at). To this end, we have the following claim, which will be proved at the end.

Claim A.6. Define the set Ψ as follows

Ψ = {t [T] : Γmt(ct, at)/Γt 1(ct, at) > 2}.

Then, we have

|Ψ| d T 2M log 2 log 1 + T dλmin

According to Claim A.6, we can decompose regret as follows.

t=1 min{2Γmt(ct, at), 1}

t Ψ min{2Γmt(ct, at), 1} + X

t/ Ψ min{2Γmt(ct, at), 1}

(b) |Ψ| + X

t/ Ψ min{4Γt 1(ct, at), 1}

t=1 4βM min{ ϕ(ct, at) b V 1 t 1 , 1}

(d) d T 2M log 2 log 1 + T dλmin

d T log 1 + T dλmin

where (a) holds by the boundedness of reward; (b) holds by definition of Ψ; (c) holds by the fact that βM 1; (d) follows from Claim A.6 and standard argument for linear bandit, i.e., Cauchy-Schwartz and standard elliptical potential lemma (cf. Lemma 11 in (Abbasi-Yadkori et al., 2011)). Hence, we have finished the proof of Lemma A.2.

Finally, we give the proof of Claim A.6.

For any t Ψ, suppose tm 1 < t tm for some m. Then, we have mt = t(m 1) and

log det(b Vtm) log det(b Vt(m 1)) (a) log det(b Vt 1) log det(b Vmt) (b) 2 log(Γmt(ct, at)/Γt 1(ct, at)) > 2 log 2,

where (a) holds by the fact b Vtm b Vt 1; (b) holds by Lemma 12 in (Abbasi-Yadkori et al., 2011), that is, for two positive definite matrices A, B Rd d satisfying A B, then for any x Rd, x A x B p

det(A)/ det(B). Note that here we also use det(A) = 1/ det(A 1) for any matrix;

Therefore, if we let bΨ := {m [M] : log det(b Vtm) log det(b Vt(m 1)) > 2 log 2}, then we have |Ψ| (T/M)|bΨ|. Thus, we only need to bound |bΨ|. Note that for each m, log det(b Vtm) log det(b Vt(m 1)) 0, and hence

2 log 2 |bΨ| X

m bΨ log det(b Vtm) log det(b Vt(m 1))

m=1 log det(b Vtm) log det(b Vt(m 1)) = log det(GM + λmin I)

det(λmin I)

Shuffle Private Linear Contextual Bandits

Algorithm 2 Local Randomizer RAmp

1: Parameters: σ1, σ2, d 2: function R1(ϕ(c, a)y) 3: Sample fresh noise n N(0, σ2 2Id d) 4: M1 = ϕ(c, a)y + n 5: return Mt,1 6: end function 7: function R2(ϕ(c, a)ϕ(c, a) ) 8: Sample fresh noise N(i,j) N(0, σ2 1), i j d and let N(j,i) = N(i,j) 9: M2 = ϕ(c, a)ϕ(c, a) + N 10: return M2 11: end function

Finally, using the same analysis as in (1), yields

|bΨ| d 2 log 2 log 1 + T dλmin

which directly implies the result of Claim A.6.

Proof of Lemma A.4. To prove the result, thanks to Lemma A.2, we only need to determine the three constants λmax, λmin and ν under the sub-Gaussian private noise assumption in Assumption A.3. To this end, we resort to concentration bounds for sub-Gaussian random vector and random matrix.

To start with, under (i) in Assumption A.3, by the concentration bound for the norm of a vector containing sub-Gaussian entries (cf. Theorem 3.1.1 in (Vershynin, 2018)) and a union bound over m, we have for all m [M] and any α (0, 1], with probability at least 1 α/2, for some absolute constant c1,

= hm Σn := c1 eσ1 (

By (ii) in Assumption A.3, the concentration bound for the norm of a sub-Gaussian symmetric random matrix (cf. Corollary 4.4.8 (Vershynin, 2018)) and a union bound over m, we have for all m [M] and any α (0, 1], with probability at least 1 α/2,

ΣN := c2 eσ2 (

for some absolute constant c2. Thus, if we choose λ = 2ΣN, we have Hm = λId + Pm i=1 Ni 3ΣN, i.e., λmax = 3ΣN, and λmin = ΣN. Finally, to determine ν, we note that

hm H 1 m 1 λmin hm c σ (

log(M/α)) 1/2 := ν,

where σ = max{eσ1, eσ2}. The final regret bound is obtained by plugging the three values into the result given by Lemma A.2.

B. Analysis of LDP Amplification Protocol

B.1. Pseudocode of PAmp

The shuffle protocol is given by PAmp =(RAmp, SAmp, AAmp), in which RAmp is presented in Algorithm 2, SAmp is presented in Algorithm 3, and AAmp is presented in Algorithm 4.

Shuffle Private Linear Contextual Bandits

Algorithm 3 Shuffler SAmp

1: Input: {Mτ,1}τ B and {Mτ,2}τ B, in which B is a batch and Mτ,1 Rd, Mτ,2 Rd d come from user τ 2: function S1({Mτ,1}τ B) 3: Generate a uniform permutation π of indexes in B 4: Set Y1 = (Mπ(1),1, . . . , Mπ(B),1) 5: return Y1 6: end function 7: function S2({Mτ,2}τ B) 8: Generate a uniform permutation π of indexes in B 9: Set Y2 = (Mπ(1),2, . . . , Mπ(B),2) 10: return Y2 11: end function

Algorithm 4 Analyzer AAmp

1: Input: Shuffled outputs Y1 = (Mπ(1),1, . . . , Mπ(B),1) and Y2 = (Mπ(1),2, . . . , Mπ(B),2) 2: function A1(Y1) 3: return PB i=1 Mπ(i),1 4: end function 5: function A1(Y2) 6: return PB i=1 Mπ(i),2 7: end function

B.2. Main Results

Theorem B.1 (Restatement of Theorem 3.2). Fix time horizon T N, batch size B [T], confidence level α (0, 1], ,

and privacy budgets ε (0, q

B ], δ (0, 1]. Then, Algorithm 1 instantiated with shuffle protocol PAmp with noise

levels σ1 =σ2 = 4

2 log(2.5B/δ) log(2/δ)

B , and regularizer λ = Θ(

log(T/Bα)), enjoys the regret

d B log T + log1/4(B/δ) log1/4(2/δ)

ε1/2B1/4 d3/4T 3/4 log T log(T/α)

with probability at least 1 α. Moreover, it satisfies O(ε, δ)-shuffle differential privacy (SDP).

Corollary B.2 (Utility-targeted). Under the same assumption in Theorem B.1 and Algorithm 1 is instantiated with PAmp. Let B = O(d 1/5ε 2/5T 3/5 log1/5(T/δ) log1/5(2/δ), Algorithm 1 achieves O(ε, δ)-SDP with regret

Reg(T) = e O d4/5T 3/5ε 2/5 log1/5(T/δ) log1/5(2/δ) .

Simultaneously, Algorithm 1 also achieves O(ε, δ)-JDP and O(ε0, δ0)-LDP where

ε0 = O ε4/5T 3/10d 1/10 log1/10(T/δ) log 2/5(2/δ) , δ0 = O δd1/5T 3/5ε2/5 log 1/5(T/δ) log 1/5(2/δ) .

Corollary B.3 (Privacy-targeted). Let Assumption 3.1 hold and Algorithm 1 is instantiated with PAmp. For any ε0 [0, 1]

and δ0 (0, 1], let σ1 = σ2 = 4

2 log(2.5/δ0)

ε0 . Then, for all B [T], Algorithm 1 is (ε0, δ0)-LDP. Further suppose

B = O(d 1/4T 3/4ε 1/2 0 log1/4(1/δ0)), then Algorithm 1 achieves regret

Reg(T) = O d3/4T 3/4ε 1/2 0 log1/4(1/δ0) .

Simultaneously, Algorithm 1 achieves O(ε, δ)-SDP and O(ε, δ)-JDP where

ε = O ε5/4 0 T 3/8d1/8 log3/8(1/δ0) , δ = O(δ0d 1/4T 3/4ε 1/2 0 log1/4(1/δ0)).

Shuffle Private Linear Contextual Bandits

B.3. Proofs

To prove Theorem B.1, we need the following important lemma, which can be seen as a special case of Theorem 3.8 in (Feldman et al., 2020). In particular, in our paper, we consider a fixed local randomizer rather than the more general adaptive one in (Feldman et al., 2020). Another difference is that we consider the case of randomizer-then-shuffle rather than the shuffle-then-randomizer. However, as pointed in (Feldman et al., 2020), the two cases are equivalent when the local randomizer is a fixed one.

Lemma B.4 (Amplification by shuffling). Consider a one-round protocol P = (R, S, A) over n users. Let R be an (ε0, δ0)-LDP mechanism. Then, for any δ [0, 1] such that ε0 log( n 16 log(2/δ )), P is (eε, eδ)-SDP, i.e., the analyzer s view

is (eε, eδ)-DP, where

eε0 log(4/δ ) n + 8eε0

, eδ = δ + (eε + 1) 1 + e ε0

That is, when ε0 > 1, eε = O

eε0 log(1/δ ) n

and when ε0 1, eε = O ε0

log(1/δ ) n

Roughly speaking, we have a privacy amplification by a factor n due to shuffling, which is the key to our analysis.

Proof of Theorem B.1. To apply Lemma B.4, we choose δ = δ/2 and ε0 = ε

log(1/δ ) = ε

log(2/δ). For any ε

B], we have ε0 1, which implies eε = O(ε). Meanwhile, we let δ0 = δ/B for any δ [0, 1], which implies that eδ = O(δ). Now, we are only left to choose σ1 and σ2 in RAmp so that it is (ε0, δ0)-LDP. To this end, via the standard Gaussian mechanism and boundedness assumption, we have when

σ1 = σ2 = 4 p

2 log(2.5/δ0)

RAmp is (ε0, δ0)-LDP. Finally, plugging in ε0 = ε

log(2/δ) and δ0 = δ/B, yields

σ1 = σ2 = 4 p

2 log(2.5B/δ) log(2/δ)

Finally, plugging the value σ = 4

2T log(2.5B/δ) log(2/δ)

B (since there are total at most T noise) into the regret bound in Lemma A.4 yields the required results.

Proof of Corollary B.2. To establish the regret bound, we simply choose a balanced B in the regret bound given by Theorem B.1. To prove the JDP guarantee, we will use the powerful Billboard lemma (cf. Lemma 9 in (Hsu et al., 2016)), which says that an algorithm is JDP if the action recommended to each user is a function of her private data and a common signal computed in a differential private way. In our case, the private data is user s context and the common signal is the updated policy (i.e., bθm and design matrix Vm), which is a post-processing of shuffle outputs. Thus, the SDP guarantee directly implies the JDP guarantee in our case. Finally, the LDP guarantee simply follows from the standard Gaussian mechanism with parameter ε0 = ε

log(2/δ) and δ0 = δ/B.

Proof of Corollary B.3. The LDP guarantee follows from standard Gaussian mechanism. To show the regret bound, we will use the result in Theorem B.1. In particular, comparing the values of σ1, σ2 in Corollary B.3 and the values in Theorem B.1,

we can plug ε = ε0

B and δ = δ0B into the regret bound in Theorem B.1. Then, with a balanced choice of B, we

obtain the required regret. The SDP guarantee also follows from Theorem B.1 with ε = ε0

B and δ = δ0B. Finally, as in the proof of Corollary B.2, the JDP guarantee follows from SDP guarantee and Billboard lemma.

Shuffle Private Linear Contextual Bandits

Algorithm 5 Local Randomizer RVec

1: Parameters: g, b, p, d 2: # Local randomizer for a scalar within [0, ] 3: function R (x, ) 4: Set x = xg/ 5: Sample rounding value γ1 Ber(xg/ x) 6: Set bx = x + γ1 7: Sample binomial noise γ2 Bin(b, p) 8: Set m be a multi-set containing bx + γ2 copies of 1 and (g + b) (bx + γ2) copies of 0. 9: return m 10: end function 11: function R1(ϕ(c, a)y) 12: Set 1 = 1 13: for each coordinate k [d] do 14: Shift data wk = [ϕ(c, a)y]k + 1 15: Run the scalar randomizer mk = R (wk, 1) 16: end for 17: # Labeled outputs (all bits in mk are labeled by k) 18: M1 = {(k, mk)}k [d] 19: return M1 20: end function 21: function R2(ϕ(c, a)ϕ(c, a) ) 22: Set 2 = 1 23: for all i j d do 24: Shift data w(i,j) = [ϕ(c, a)ϕ(c, a) ](i,j) + 2 25: Run the scalar randomizer to obtain m(i,j) = R (w(i,j), 2) and m(j,i) = m(i,j) 26: end for 27: # Labeled outputs 28: M2 = {((i, j), m(i,j))}(i,j) [d] [d] 29: return M2 30: end function

C. Analysis of Vector Summation Protocol

C.1. Pseudocode of PVec

The shuffle protocol is given by PVec =(RVec, SVec, AVec), in which RVec is presented in Algorithm 5, SVec is presented in Algorithm 6, and AVec is presented in Algorithm 7. Note that the original algorithm for the analyzer in (Cheu et al., 2021) has a small issue in the de-bias process (cf. Algorithm 2 in (Cheu et al., 2021)). In particular, instead of subtracting the norm , one needs to subtract B , see Lines 11 and 19 in Algorithm 7. Here, B corresponds to n in Algorithm 2 of (Cheu et al., 2021).

C.2. Main Results

Theorem C.1 (Restatement of Theorem 3.5). Fix batch size B [T], privacy budgets ε (0, 15], δ (0, 1/2). Then,

Algorithm 1 instantiated with PVec with parameters p = 1/4, g = max{2

B, d, 4} and b = 24 104 g2 log2(4(d2+1)/δ)

(ε, δ)-SDP. Furthermore, for any α (0, 1], setting λ=Θ log(d2/δ)

log(T/(Bα)) , it enjoys the regret

d B log T + log1/2(d2/δ)

ε1/2B1/4 d3/4T 3/4 log T log(T/α)

with probability at least 1 α.

Corollary C.2 (Utility-targeted). Under the same assumption in Theorem C.1 and Algorithm 1 is instantiated with PVec. Let

Shuffle Private Linear Contextual Bandits

Algorithm 6 Shuffler SVec

1: Input: {Mτ,1}τ B and {Mτ,2}τ B, in which B is a batch of users. Mτ,1 = {(k, mk)}k [d] and M2 = {((i, j), m(i,j))}(i,j) [d] [d] are labeled data of user τ 2: function S1({Mτ,1}τ B) 3: Uniformly permutes all messages, i.e., a total of (g + b) B d bits 4: Set yk be the collection of bits labeled by k [d] 5: Set Y1 = {y1, . . . , yd} 6: return Y1 7: end function 8: function S2({Mτ,2}τ B) 9: Uniformly permutes all messages, i.e., a total of (g + b) B d2 bits 10: Set y(i,j) be the collection of bits labeled by (i, j) [d] [d] 11: Set Y2 = {y(i,j)}(i,j) [d] d 12: return Y2 13: end function

B = O(d 1/5ε 2/5T 3/5 log2/5(d2/δ)), Algorithm 1 achieves (ε, δ)-SDP with regret

Reg(T) = e O d4/5T 3/5ε 2/5 log2/5(d2/δ) .

Simultaneously, Algorithm 1 also achieves O(ε, δ)-JDP and O(ε0, δ0)-LDP where

ε0 = O ε4/5T 3/10d 1/10 log1/5(d2/δ) , δ0 = O(δ).

Corollary C.3 (Privacy-targeted). Let Assumption 3.1 hold and Algorithm 1 is instantiated with PVec. For any ε0 (0, 15] and δ0 (0, 1/2), let

g = max{d, 4}, b = 24 104 g2 log 4 (d2+1)

ε2 0 , p = 1/4,

Then, for all B [T], Algorithm 1 is (ε0, δ0)-LDP. Further suppose B = O(d 1/4T 3/4ε 1/2 0 (log(d2/δ0))1/2), then Algorithm 1 achieves regret

Reg(T) = e O

d3/4T 3/4 log1/2(d2/δ0) ε0

Simultaneously, Algorithm 1 also achieves O(ε, δ)-SDP and O(ε, δ)-JDP where

ε = O ε5/4T 3/8d1/8(log(d2/δ0)) 1/4 , δ = O(δ0).

C.3. Proofs

Proof of Theorem C.1. The privacy part follows from the one-round SDP guarantee of vector summation protocol in (Cheu et al., 2021). In particular, by Theorem 3.2 in (Cheu et al., 2021), we have to properly choose parameters g, b, p in RVec. To this end, by adapting the results of Lemma 3.1 in (Cheu et al., 2021), we have in our case when one chooses

B, d, 4}, b = 24 104 g2 log 4 (d2+1)

ε2B , p = 1/4,

PVec is (ε, δ)-SDP. It is worth pointing out that here we choose b such that p = 1/4, which is necessary for our following analysis on the tail of the private noise. This is the key difference compared to the original one in (Cheu et al., 2021) where the variance of the noise is sufficient.

Shuffle Private Linear Contextual Bandits

Algorithm 7 Analyzer AVec

1: Input: Shuffled outputs Y1 = {yk}k [d] and Y2 = {y(i,j)}(i,j) [d] d 2: Initialize: g, b, p 3: # Analyzer for a collection y of (g + b) B bits using 4: function A (y, )

g (P(g+b) B i=1 yi) p b B

6: end function 7: function A1(Y1) 8: 1 = 1 9: for each coordinate k [d] do 10: Run analyzer on k-th labeled data to obtain zk = A (yk, 1) 11: Re-center: ok = zk B 1 12: end for 13: return {o1, . . . , ok} 14: end function 15: function A2(Y2) 16: 2 = 1 17: for all i j d do 18: Run analyzer on (i, j)-th labeled data to obtain z(i,j) = A (y(i,j), 2) 19: Re-center: o(i,j) = z(i,j) B 2 and o(j,i) = o(i,j) 20: end for 21: return {o(i,j)}(i,j) [d] [d] 22: end function

Now, we turn to regret analysis. Thanks to our general regret bound in Corollary A.4, we only need to verify the condition of sub-Gaussian private noise in the protocol PVec (in particular RVec). To this end, we need a more careful analysis compared to (Cheu et al., 2021) as the issue pointed above. Fix any coordinate k [d], we will determine the private noise in k, which motivates us to check the scalar randomizer R in RVec. Consider a batch of users. Let zi denote the sum of g + b bits generated by user i using R . That is, we have

zi = xi + γ1,i + γ2,i.

This implies that

xi + γ1,i + xi g

xi + γ2,i bp.

Define shifted random variables ι1,i := γ1,i + xi g

xi and ι2,i := γ2,i bp. Thus, taking the summation over all i within a given batch B of size B, yields X

i B zi B b p = g

i B ι1,i + X

which implies that

i B zi B b p

Note that the above is exactly the output of the analyzer A in PVec. Thus, to verify the sub-Gaussian condition in Assumption A.3, we only need to show that the last two terms above are zero-mean and sub-Gaussian random variables. To this end, we note that γ1,i is draw from Ber( g

xi xi). Hence, E [ι1,i] = 0 and ι1,i is sub-Gaussian with variance 1 since ι1,i [ 1, 1]. By independence of private noise across i, we have P

i B ι1,i is sub-Gaussian with variance of B. Similarly, since γ2,i is independently sampled from binomial Bin(b, p), we have E [ι2,i] = 0 and P

i B ι2,i can be viewed as a sum of B b bounded random variable within [0, 1], hence it is sub-Gaussian with variance of B b/4. Therefore, the total noise

Shuffle Private Linear Contextual Bandits

i B ι2,i is sub-Gaussian with variance given by

g2 B b/4 (a) = 1

g2 B b/4 = O (log(d2/δ))2

where (a) holds by the fact that in PVec, = 1. Thus, this implies that eσ2 1, eσ2 2 in Assumption A.3 are satisfied with O M (log(d2/δ))2

ε2 , hence σ in Lemma A.4 is given by σ = O p

T/B log(d2/δ)

ε , which leads to the following regret bound

R(T) = O d B + (log(d2/δ))1/2

ε d3/4B 1/4T 3/4 ,

Hence, we finish the proof.

Proof of Corollary C.2. The regret bound simply follows from a balanced choice of B in Theorem C.1. As before, JDP follows from SDP and Billboard lemma. To show the LDP guarantee, one way is to use DP property of Binomial mechanism and the refined advanced composition in (Cheu et al., 2021) across dimensions (cf. Lemma 3.3 in (Cheu et al., 2021)). However, there is a simple way to achieve this by noting that when B = 1, the SDP guarantee of PVec also implies LDP guarantee since now the shuffle output is the same as the output at each local randomizer9. Thus, by comparing the values of b for a general B and the case when B = 1, we can see that ε0 = ε

B and δ0 = δ, i.e., an implicit privacy amplification by

B. Note that, this simple way might lead to a larger term in δ. A careful analysis via Binomial mechanism and the (refined) advanced composition could yield something like ε0 = ε

log(d2/δ) and δ0 = δ/d2, where d2 comes from the d d matrix in the computation. Here we choose the simple way to avoid additional complexity for clarity.

Proof of Corollary C.3. The LDP guarantee follows from the same trick as in the proof of Corollary C.2 which helps to avoid Binomial mechanism and advance composition over dimensions. To establish the regret bound, we can compare the values of b in Corollary C.3 and the one in Theorem C.1. In particular, we can plug ε = ε0

B and δ = δ0 into the regret bound in Theorem C.1. Then, with a balanced choice of B, we obtain the required regret. The SDP guarantee also follows from Theorem C.1 with ε = ε0

B and δ = δ0. Finally, as in the proof of Corollary B.2, the JDP guarantee follows from SDP guarantee and Billboard lemma.

D. Joint Differenital Privacy

In this section, we will give formal DP definitions in the central model for linear contextual bandits. In particular, we first present the standard (event-level) definition which assumes all users are unique and then generalize it to (user-level) definition that allows for returning users. To this end, we first give the following general DP definition.

Definition D.1 (General DP). A randomized mechanism M : D R satisfies (ε, δ)-differential privacy if for any two adjacent datasets X, X D and for any measurable subsets of outputs Y R it holds that

P [M(X) Y] exp(ε)P [M(X ) Y] + δ.

Remark D.2. All the DP definitions in our main paper can be viewed as a particular instantiation of Definition D.1 in terms of adjacent relation between two datasets and the corresponding output sequences.

A straightforward adaptation of Definition D.1 to linear contextual bandits in the central model is to consider the sequence of T unique users as the dataset, denoted by UT := {u1, . . . , u T } UT , and the corresponding prescribed actions as the output sequence, denoted by M(UT ) := {a1, . . . , at} AT . This is the central trust model because the learning agent in the protocol can have direct access to users sensitive information, but all the prescribed actions via the deployed algorithm are indistinguishable on two neighboring user sequences. Unfortunately, it is not hard to see that this is in conflict with the goal of personalization of linear contextual bandits, which essentially requires the algorithm to prescribe different actions to different users according to their contexts. Indeed, as shown in (Shariff & Sheffet, 2018), any learning protocol that satisfy the above notion of privacy protection has to incur a linear regret. Hence, to obtain a non-trivial utility-privacy trade-off, we need to relax DP to the notion called joint differential privacy (JDP) (Kearns et al., 2014) in the central model, which

9Here, we can assume that each local randomizer already randomly orders the g + b bits before they are sent out.

Shuffle Private Linear Contextual Bandits

requires that simultaneously for any user ut UT , the joint distribution of the actions recommended to all users other than ut be differentially private in the type of the user ut. It weakens the classic DP notion only in that the action suggested specifically to ut may be sensitive in her type (i.e., context and reward responses10), as required by personalization. However, JDP is still a very strong definition since it protects ut from any arbitrary collusion of other users against her, so long as she does not herself reveal the action suggested to her. Formally, we let M t(UT ) := M(UT ) \ {at} to denote all the actions prescribed by the deployed algorithm excluding the one recommended to ut and based on it we have the definition of JDP as follows.

Definition D.3 (Joint Differential Privacy (JDP)). A learning process of linear contextual bandits is (ε, δ)-joint differentially private if its deployed algorithm M : UT AT satisfies that for all t [T], for all neighboring user sequences UT , U T UT differing only on the t-th user and for all set of actions A t AT 1 given to all but the t-th user,

P [M t(UT ) A t] exp(ε)P [M t(U T ) A t] + δ.

The above JDP definition assumes that all the T users are unique, which is the standard event-level DP considered in existing similar works (Shariff & Sheffet, 2018; Vietri et al., 2020; Chowdhury & Zhou, 2021). That is, since each user only contributes one event in the total T rounds, two user sequences UT and U T are said to be adjacent if they only differ at one round t [T].

However, a more practical situation is that one user could contribute her data at multiple rounds, i.e., returning users. This motivates us to consider a user-level JDP, in which two user sequences UT and U T are adjacent if one replaces all the data associated with user u to u in UT results in U T . In this case, changing one user in the sequence could affect the data at multiple rounds. Accordingly, the output sequences need to remove all the actions at these rounds to avoid the conflict with personalization. Following the notations in (Dwork et al., 2010), we say UT and U T are neighboring sequences if there exist u, u such that if one replace some of u in UT , the resultant sequence is U T . Formally, UT , U T are neighboring with neighboring indices I, if there exist u, u U and index set I [T] such that UT |I:u u = U T , in which UT |I:u u means replacing u by u in UT at all indices in I. Meanwhile, we let M I(UT ) := M(UT ) \ a I, where a I is the set of actions at indices in I. With these notations, we have the following formal definition.

Definition D.4 (User-level JDP). A learning process of linear contextual bandits is (ε, δ)-joint differentially private if its deployed algorithm M : UT AT satisfies that for all neighboring user sequences UT , U T UT with neighboring indices given by I, and for all set of actions A I AT |I|,

P [M I(UT ) A I] exp(ε)P [M I(U T ) A I] + δ.

Remark D.5. A straightforward way to achieve user-level JDP via event-level JDP is to use group privacy property of DP (Dwork et al., 2014; Vietri et al., 2020). In particular, suppose a mechanism is (ε, δ)-JDP (event-level), then it is (kε, ke(k 1)εδ)-JDP (user-level) if each user contributes at most k rounds. This black-box approach leads to a large increase in δ. We will show that a careful and direct analysis can improve this part while the linear increase in ε is unchanged. This makes sense since the sensitivity now increases by a factor of k.

E. Regret and Privacy Analysis Under Returning Users

We consider the following returning users case.

Assumption E.1 (Returning Users). Fix a batch size B, any particular user can potentially participates in all M = T/B batches, but within each batch m [M], she only contributes once.

Under the above assumption, our previous SDP guarantee from one-round SDP protocol is no longer true. Instead, we now need to guarantee that outputs of all the batches together have a total privacy loss of (ε, δ), since all of them may reveal the sensitive information of a given user if she participates in all the batches, i.e., worst-case scenario. To this end, we resort to advanced composition theorem (Dwork et al., 2014), which is restated as follows for an easy reference.

Theorem E.2 (Advanced composition). Given target privacy parameters ε (0, 1) and δ > 0, to ensure (ε , kδ + δ )-DP for the composition of k (adaptive) mechanisms, it suffices that each mechanism is (ε, δ)-DP with ε = ε

2k log(1/δ ).

10Technically speaking, the type of the user is identified by the reward response she would give to all possible actions recommended based on her context information.

Shuffle Private Linear Contextual Bandits

E.1. LDP Amplification Protocol

Theorem E.3 (Formal statement of (i) in Theorem 4.2). Let Assumption 3.1 and Assumption E.1 hold. For any ε

2T], δ (0, 1] and B [T], let σ1 = σ2 = 16 log(2/δ)

T (log(5T/δ)) εB . Then, Algorithm 1 instantiated using PAmp is O(ε, δ)-SDP. Furthermore, for any α (0, 1], setting λ = Θ(

log(T/Bα)), it has the following regret

ε d3/4 log1/4(T/δ) log1/2(2/δ) log T log(T/α)

The following corollary says that if the batch schedule also depends on privacy parameters, one can improve the dependence on ε, i.e., from ε 1/2 to ε 1/3.

Corollary E.4 (Utility-targeted). Under the same assumption in Theorem E.3 and B = O(d 1/6ε 1/3T 2/3(log(T/δ))1/2), Algorithm 1 instantiated using PAmp achieves O(ε, δ)-SDP with regret

R(T) = O d5/6T 2/3ε 1/3 (log(T/δ))1/2 .

Proof of Theorem E.3. First, by advanced composition in Theorem E.2, if we let each batch s privacy parameters be εm = ε 2

2M log(2/δ) and δm = δ/(2M), then final privacy guarantee is (ε, δ)-DP. Thus, we only need to replace ε by εm and δ by δm in Theorem B.1

E.2. Vector Summation Protocol

Theorem E.5 (Formal statement of (ii) in Theorem 4.2). Let Assumption 3.1 and Assumption E.1 hold. Then, for any ε 15, δ (0, 1/2) and B [T], let

B, d, 4}, b = 107 log(2/δ) g2 T log 8 T (d2+1)

ε2B2 , p = 1/4.

Algorithm 1 instantiated using PVec is (ε, δ)-SDP. Furthermore, for any α (0, 1], setting

log(2/δ) log(d2T/(Bδ))

log(T/(Bα)) !

then it has the regret bound

ε d3/4 log3/4(d2M/δ) log T log(T/α)

Corollary E.6 (Utility-targeted). Under the same assumption in Theorem E.5, B = O(d 1/6ε 1/3T 2/3(log(Td2/δ))1/2), Algorithm 1 instantiated using PVec achieves (ε, δ)-SDP with regret

R(T) = e O d5/6T 2/3ε 1/3 log(d2T/δ) 1/2 .

Proof of Theorem E.5. First, by advanced composition in Theorem E.2, if we let each batch s privacy parameters be εm = ε 2

2M log(2/δ) and δm = δ/(2M), then final privacy guarantee is (ε, δ)-DP. Thus, we only need to replace ε by εm and δ by δm in Theorem C.1

E.3. JDP under Returning Users

As mentioned before, existing algorithm with JDP guarantee assumes unique users, i.e., event-level JDP given by Definition D.3. To handle returning users, we need to consider user-level JDP given by Definition D.4. One straightforward way is

Shuffle Private Linear Contextual Bandits

to resort to group privacy (Dwork et al., 2014). That is, if any user appears at most M0 rounds in the process, the original (ε, δ)-JDP algorithm proposed in (Shariff & Sheffet, 2018) now achieves (M0ε, M0 exp((M0 1)ε)δ)-JDP (user-level). However, this black-box will incur a large loss in the δ term. To overcome this, we note that a simple modification of the added noise in the original algorithm in (Shariff & Sheffet, 2018) will work. In particular, we scale up the noise variance by a multiplicative factor of M 2 0 , if any user participates in at most M0 rounds. This follows from the fact that flipping one user now would change the ℓ2 sensitivity of the expanded binary-tree nodes from O( log T) to O(M0 log T). Then, utilizing our derived generic regret bound in Lemma A.4, yields the following result. Proposition E.7. If any user participates in at most M0 rounds, the algorithm in Shariff & Sheffet (2018), with the above modification to handle user-level privacy, achieves the high-probability regret bound

Reg(T) = e O

ε d3/4 log1/4(1/δ)

Proof. The key idea behind the regret analysis in the central model for linear contextual bandits in (Shariff & Sheffet, 2018) is to utilize the following two properties of the so-called tree-based mechanism (or binary counting mechanism) (Chan et al., 2010): (i) change of each leaf-node (corresponding to a user s data) only incurs the change of l2-sensitivity of the expanded binary-tree by O( log T); (ii) for any t [T], the summation of data from time 1 to t only involves at most O(log T) tree nodes. Property (i) is used to compute the added noise at each node to guarantee privacy while property (ii) is used to compute the total noise in the private sum when bounding the regret. Now, in the case of returning users, if we flip one user s data, it will change the l2-sensitivity of the expanded binary-tree by O(M0 log T), i.e., an additional M0 factor in the sensitivity, which leads to the additional M 2 0 factor in the added noise. Property (ii) is the same as before, i.e., total number of noise is at most O(log T). Finally, by Lemma A.4, we have the result.

F. Batched Algorithms for Local and Central Models

To start with, for the batched algorithm in the local model, one can simply replace the shuffler in Algorithm 1 by an identity mapping while using the same local randomizer as in (Zheng et al., 2020) (i.e., Gaussian mechanism). We call this algorithm Batched-Local-Lin UCB. Thanks to Lemma A.4, we have the following privacy and regret guarantees.

Proposition F.1. Let Assumption 3.1 hold. Fix any ε0 [0, 1], δ0 (0, 1] and α [0, 1], let σ1 = σ2 = 4

2 log(2.5/δ0)

ε0 . Then, for all B [T], Bathed-Local-Lin UCB is (ε0, δ0)-LDP and with probability at least 1 α

R(T) = O d B + T 3/4d3/4 (log(1/δ))1/4

ε0 log(T/α) .

Remark F.2. The above theorem indicates that it suffices to update every B = O(T 3/4) to ensure the same privacy and regret guarantees as in the sequential case.

For the batched algorithm in the central model, we can make the following simple modification over the sequential one in (Shariff & Sheffet, 2018), which relies on the seminal tree-based algorithm (Chan et al., 2010) at the central server (analyzer) to balance between privacy and regret. In the batched case, instead of updating the binary-tree nodes after every round, the server updates them only after each batch by treating the the sum of the statistics (i.e., vectors or matrices) within the batch as a single new observation. We call this algorithm Batched-Central-Lin UCB. With this modification and Lemma A.4, we have the following privacy and regret guarantees. Proposition F.3. Let Assumption 3.1 hold. Fix any ε [0, 1], δ (0, 1] and α [0, 1]. Then, for all B [T], Bathed-Central-Lin UCB is (ε, δ)-JDP and with probability at least 1 α

R(T) = O d B +

Td3/4 (log(1/δ))1/4

ε log(T/α) .

Remark F.4. The above theorem indicates that it suffices to update every B = O(

T) to attain the same privacy-regret trade-off as in the sequential case.

G. Additional Experimental Results

Shuffle Private Linear Contextual Bandits

0 5000 10000 15000 20000 Round

Lin UCB Lin UCB-JDP Lin UCB-SDP-Amp Lin UCB-SDP-Vec Lin UCB-LDP

0 5000 10000 15000 20000 Round

Lin UCB Lin UCB-JDP Lin UCB-SDP-Amp Lin UCB-SDP-Vec Lin UCB-LDP

Figure 2: Comparison of cumulative regret for Lin UCB (non-private), Lin UCB-JDP (central model), Lin UCB-SDP (shuffle model) and Lin UCB-LDP (local model) with privacy level ε = 1 for varying feature dimension d = 10 (a) and d = 15 (b). In all cases, regret of Lin UCB-SDP lies perfectly in between Lin UCB-JDP and Lin UCB-LDP, achieving finer regret-privacy trade-off.