# differentially_private_contextual_linear_bandits__7ed99838.pdf

Diﬀerentially Private Contextual Linear Bandits

Roshan Shariﬀ Department of Computing Science University of Alberta Edmonton, Alberta, Canada roshan.shariff@ualberta.ca

Or Sheﬀet Department of Computing Science University of Alberta Edmonton, Alberta, Canada osheffet@ualberta.ca

We study the contextual linear bandit problem, a version of the standard stochastic multi-armed bandit (MAB) problem where a learner sequentially selects actions to maximize a reward which depends also on a user provided per-round context. Though the context is chosen arbitrarily or adversarially, the reward is assumed to be a stochastic function of a feature vector that encodes the context and selected action. Our goal is to devise private learners for the contextual linear bandit problem. We ﬁrst show that using the standard deﬁnition of diﬀerential privacy results in linear regret. So instead, we adopt the notion of joint diﬀerential privacy, where we assume that the action chosen on day t is only revealed to user t and thus needn t be kept private that day, only on following days. We give a general scheme converting the classic linear-UCB algorithm into a joint diﬀerentially private algorithm using the tree-based algorithm [10, 18]. We then apply either Gaussian noise or Wishart noise to achieve joint-diﬀerentially private algorithms and bound the resulting algorithms regrets. In addition, we give the ﬁrst lower bound on the additional regret any private algorithms for the MAB problem must incur.

1 Introduction

The well-known stochastic multi-armed bandit (MAB) is a sequential decision-making task in which a learner repeatedly chooses an action (or arm) and receives a noisy reward. The objective is to maximize cumulative reward by exploring the actions to discover optimal ones (having the best expected reward), balanced with exploiting them. The contextual bandit problem is an extension of the MAB problem, where the learner also receives a context in each round, and the expected reward depends on both the context and the selected action.

As a motivating example, consider online shopping: the user provides a context (composed of query words, past purchases, etc.), and the website responds with a suggested product and receives a reward if the user buys it. Ignoring the context and modeling the problem as a standard MAB (with an action for each possible product) suﬀers from the drawback of ignoring the variety of users preferences; whereas separately learning each user s preferences doesn t allow us to generalize between users. Therefore it is common to model the task as a contextual linear bandit problem: Based on the user-given context, each action is mapped to a feature vector; the reward probability is then assumed to depend on the same unknown linear function of the feature vector across all users.

The above example motivates the need for privacy in the contextual bandit setting: users past purchases and search queries are sensitive personal information, yet they strongly predict future purchases. In this work, we give upper and lower bounds for the problem of (joint) diﬀerentially private contextual linear bandits. Diﬀerential privacy is the de facto gold standard of privacy-preserving data analysis in both academia and industry, requiring that an algorithm s output have very limited dependency on any single user interaction (one context and reward). However, as we later illustrate, adhering to the standard notion of diﬀerential privacy (under event-level continual observation) in the contextual bandit requires us to essentially ignore the context and thus incur linear regret. We 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

therefore adopt the more relaxed notion of joint diﬀerential privacy [23] which, intuitively, allows us to present the t-th user with products corresponding to her preferences, while guaranteeing that all interactions with all users at times t > t have very limited dependence on user t s preferences. The guarantee of diﬀerential privacy under continuous observation assures us that even if all later users collude in an eﬀort to learn user t s context or preference, they still have very limited advantage over a random guess.

1.1 Problem Formulation

Stochastic Contextual Linear Bandits. In the classic MAB, in every round t a learner selects an action at from a ﬁxed set A and receives a reward yt. In the (stationary) stochastic MAB, the reward is noisy with a ﬁxed but unknown expectation E[yt | at] that depends only on the selected action. In the stochastic contextual bandit problem, before each round the learner also receives a context ct C the expected reward E[yt | ct,at] depends on both ct and at. It is common to assume that the context aﬀects the reward in a linear way: map every context-action pair to a feature vector φ(c,a) Rd (where φ is an arbitrary but known function) and assume that E[yt | ct,at] = θ , φ(ct,at) . The vector θ Rd is the key unknown parameter of the environment which the learner must discover to maximize reward. Alternatively, we say that on every round the learner is given a decision set Dt B {φ(ct,a) | a A} of all the pre-computed feature vectors: choosing xt Dt eﬀectively determines the action at A. Thus, the contextual stochastic linear bandit framework consists of repeated rounds in which the learner: (i) receives a decision set Dt Rd; (ii) chooses an action xt Dt; and (iii) receives a stochastic reward yt = θ , xt + ηt. When all the Dt are identical and consist of the standard basis vectors, the problem reduces to standard MAB.

The learner s objective is to maximize cumulative reward, which is equivalent to minimizing regret: the extra reward a learner would have received by always choosing the best available action. In other words, the regret characterizes the cost of having to learn the optimal action over just knowing it beforehand. For stochastic problems, we are usually interested in a related quantity called pseudo-regret, which is the extra expected reward that the learner could have earned if it had known θ in advance. In our setting, the cumulative pseudo-regret after n rounds is b Rn B Ín t=1 maxx Dt θ , x xt .1

Joint Diﬀerential Privacy. As discussed above, the context and reward may be considered private information about the users which we wish to keep private from all other users. We thus introduce the notion of jointly diﬀerentially private learners under continuous observation, a combination of two deﬁnitions [given in 23, 18]. First, we say two sequences S = (D1, y1),(D2, y2),. . .,(Dn, yn) and S = (D 1, y 1),. . .,(D n, y n) are t-neighbors if for all t , t it holds that (Dt , yt ) = (D t , y t ).

Deﬁnition 1. A randomized algorithm Afor the contextual bandit problem is (ε,δ)-jointly diﬀerentially private (JDP) under continual observation if for any t and any pair of t-neighboring sequences S and S , and any subset S>t Dt+1 Dt+2 Dn of sequence of actions ranging from day t + 1 to the end of the sequence, it holds that P(A(S) S>t) eεP(A(S ) S >t) + δ.

The standard notion of diﬀerential privacy under continual observation would require that changing the context ct cannot have much eﬀect on the probability of choosing action at even for round t itself (not just for future rounds as with JDP). In our problem formulation, however, changing ct to c t may change the decision set Dt to a possibly disjoint D t, making that notion ill-deﬁned. Therefore, when we discuss the impossibility of regret-minimization under standard diﬀerential privacy in Section 5, we revert back to a ﬁxed action set A with an explicit per-round context ct.

1.2 Our Contributions and Paper Organization

In this work, in addition to formulating the deﬁnition of JDP under continual observation, we also present a framework for implementing JDP algorithms for the contextual linear bandit problem. Not surprisingly, our framework combines a tree-based privacy algorithm [10, 18] with a linear upper conﬁdence bound (Lin UCB) algorithm [13]. For modularity, in Section 3 we analyze a family of linear UCB algorithms that use diﬀerent regularizers in every round, under the premise that the

1The pseudo-regret ignores the stochasticity of the reward but not the resulting randomness in the learner s choice of actions. It equals the regret in expectation, but is more amenable to high-probability bounds such as ours. In particular, in some cases we can achieve polylog(n) bounds on pseudo-regret because, unlike regret, it doesn t have added regret noise of variance Ω(n).

regularizers are PSD with bounded singular values. Moreover, we repeat our analysis twice ﬁrst we obtain a general O( n) upper bound on regret; then, for problem instances that maintain a reward gap separating the optimal and sub-optimal actions, we obtain a polylog(n)/ regret upper bound. Our leading application of course is privacy, though one could postulate other reasons where such changing regularizers would be useful (e.g., if parameter estimates turn out to be wrong and have to be updated). We then plug two particular regularizers into our scheme: the ﬁrst is a privacy-preserving mechanism that uses additive Wishart noise [30] (which is always PSD); the second uses additive Gaussian noise [19] (shifted to make it PSD w.h.p. over all rounds). The main term in the two regret bounds obtained by both algorithms is O( n d3/4/ ε) (the bound itself depends on numerous parameters; a notation list appears in Section 2). Details of both techniques appear in Section 4. Experiments with a few variants of our algorithms are detailed in Section D of the supplementary material. In Section 5 we also give a lower bound for the ε-diﬀerentially private MAB problem. Whereas all previous work on the private MAB problem uses standard (non-private) bounds, we show that any private algorithm must incur an additional regret of Ω(k log(n)/ε). While the result resembles the lower bound in the adversarial setting, the proof technique cannot rely on standard packing arguments [e.g. 20] since the input for the problem is stochastic rather than adversarial. Instead, we rely on a recent coupling argument [22] to prove any private algorithm must substantially explore suboptimal arms. By showing that the contextual bandit problem does not become much harder by adding privacy, we open the possibility of machine learning systems that are useful to their users without signiﬁcantly compromising their privacy.

Future Directions. The linear UCB algorithm we adapt in this work is a canonical approach to the linear bandit problem, using the principle of optimism in the face of uncertainty. However, recent work [24] shows that all such optimistic algorithms are sub-optimal, and instead proposes adapting to the decision set in a particular way by solving an intricate optimization problem. It remains an open question to devise a private version of this algorithm which interpolates between UCB and ﬁne-tuning to the speciﬁc action set.

1.3 Related Work

MAB and the Contextual Bandit Problem. The MAB dates to the seminal work of Robbins [28], with the UCB approach developed in a series of works [8, 4] culminating in [6]. Stochastic linear bandits were formally ﬁrst introduced in [3], and [5] was the ﬁrst paper to consider UCB-style algorithms. An algorithm that is based on a conﬁdence ellipsoid is described by [13], with a variant based on ridge regression given in [12], or explore-then-commit variant in [29], and a variant related to a sparse setting appears in [2]. Abbasi-Yadkori et al. [1] gives an instance dependent bound for linear bandits, which we convert to the contextual setting.

Diﬀerential Privacy. Diﬀerential privacy, ﬁrst introduced by Dwork et al. [17, 16], is a rigorous mathematical notion of privacy that requires the probability of any observable output to change very little when any single datum changes. (We omit the formal deﬁnition, having already deﬁned JDP.) Among its many elegant traits is the notion of group privacy: should k datums change then the change in the probability of any event is still limited by (roughly) k times the change when a single datum was changed. Diﬀerential privacy also composes: the combination of k (ε,δ)-diﬀerentially private algorithms is O(kε2 + 2 p

k log(1/δ )), kδ + δ -diﬀerentially private for any δ > 0 [14].

The notion of diﬀerential privacy under continual observation was ﬁrst deﬁned by Dwork et al. [18] using the tree-based algorithm [originally appearing in 10]. This algorithm maintains a binary tree whose n leaves correspond to the n entries in the input sequence. Each node in the tree maintains a noisy (privacy-preserving) sum of the input entries in its subtree the cumulative sums of the inputs can thus be obtained by combining at most log(n) noisy sums. This algorithm is the key ingredient of a variety of works that deal with privacy in an online setting, including counts [18], online convex optimization [21], and regret minimization in both the adversarial [31, 34] and stochastic [26, 33] settings. We comment that Mishra and Thakurta [26] proposed an algorithm similar to our own for the contextual bandit setting, however (i) without maintaining PSD, (ii) without any analysis, only empirical evidence, and (iii) without presenting lower bounds. A partial utility analysis of this algorithm, in the reward-privacy model (where the context s privacy is not guaranteed), appears in the recent work of Neel and Roth [27]. Further details about achieving diﬀerential privacy via additive noise and the tree-based algorithm appear in Section A of the supplementary material. The related problem of private linear regression has also been extensively studied in the oﬄine setting [11, 7].

2 Preliminaries and Notation

We use bold letters to denote vectors and bold CAPIT ALS for matrices. Given a d-column matrix M, its Gram matrix is the (d d)-matrix MËM. A symmetric matrix M is positive-semideﬁnite (PSD, denoted M 0) if xËMx 0 for any vector x. Any such M deﬁnes a norm on vectors, so we deﬁne x 2 M = xËMx. We use M N to mean M N 0. The Gaussian distribution N(µ,σ2) is deﬁned

by the density function (2πσ2) 1/2 exp( (x µ)2/2σ2). The squared L2-norm of a d-dimensional vector whose coordinates are drawn i.i.d. from N(0,1) is given by the χ2(d) distribution, which is tightly concentrated around d. Given two distributions P and Q we denote their total variation distance by d TV(P,Q) = maxevent E|P(E) Q(E)|.

Notation. Our bound depends on many parameters of the problem, speciﬁed below. Additional parameters (bounds) are speciﬁed in the assumptions stated below. n horizon, i.e. number of rounds s,t indices of rounds d dimensionality of action space Dt Rd; decision set at round t xt Dt; action at round t yt R; reward at round t θ Rd; unknown parameter vector

m B log2(n) + 1 X<t R(t 1) d, with X<t,s = xË s for s < t Gt Gram matrix of the actions: XË <t X<t Ht regularizer at round t y<t vector of rewards up to round t 1 ut action-reward product: XË <t y<t ht perturbation of ut

3 Linear UCB with Changing Regularizers

In this section we introduce and analyze a variation of the well-studied Lin UCB algorithm, an application of the Upper Conﬁdence Bound (UCB) idea to stochastic linear bandits [13, 29, 1]. At every round t, Lin UCB constructs a conﬁdence set Et that contains the unknown parameter vector θ with high probability. It then computes an upper conﬁdence bound on the reward of each action in the decision set Dt, and optimistically chooses the action with the highest UCB: xt arg maxx Dt UCBt(x), where UCBt(x) B maxθ Et θ, x . We assume the rewards are linear with added subgaussian noise (i.e., ys = θ , xs +ηs for s < t), so it is natural to center the conﬁdence set Et on the (regularized) linear regression estimate: ˆθt B arg min ˆθ Rd X<t ˆθ y<t 2 + ˆθ 2 Ht = (Gt + Ht) 1XË <t y<t. where Gt B XË <t X<t

The matrix Vt B Gt + Ht Rd d is a regularized version of the Gram matrix Gt. Whenever the learner chooses an action vector x Dt, the corresponding reward gives it some information about the projection of θ onto x. In other words, the estimate ˆθ is probably closer to θ along the directions where many actions have been taken. This motivates the use of ellipsoidal conﬁdence sets that are smaller in such directions, inversely corresponding to the eigenvalues of Gt (or Vt). The ellipsoid is uniformly scaled by βt to achieve the desired conﬁdence level, as prescribed by Proposition 4.

Et B {θ Rd | θ ˆθt Vt βt}, for which UCBt(x) = ˆθ, x + βt x V 1 t . (1)

Just as the changing regularizer Ht perturbs the Gram matrix Gt, our algorithm allows for the vector ut B XË <t y<t to be perturbed by ht to get ut B ut + ht. The estimate ˆθt is replaced by θt B V 1 t ut.

Algorithm 1 Linear UCB with Changing Perturbations

Initialize: G1 0d d, u1 0d. for each round t = 1,2,. . .,n do

Receive Dt decision set Rd. Receive regularized Vt Gt + Ht and perturbed ut ut + ht Compute regressor θt V 1 t ut Compute conﬁdence-set bound βt based on Proposition 4. Pick action xt arg maxx Dt θt, x + βt x V 1 t . Observe yt reward for action xt Update: Gt+1 Gt + xt xË t, ut+1 ut + xt yt end for

Our analysis relies on the following assumptions about the environment and algorithm: Assumptions (Algorithm 1). For all rounds t = 1,. . .,n and actions x Dt:

1. Bounded action set: x L. 2. Bounded mean reward: | θ , x | B with B 1.2 3. Bounded target parameter: θ S. 4. All regularizers are PSD Ht 0. 5. yt = θ , xt + ηt where ηt is σ2-conditionally subgaussian on previous actions and rewards, i.e.: E[exp(ληt) | x1, y1,. . ., xt 1, yt 1, xt] exp(λ2σ2/2), for all λ R. 6. Strongest bound on both the action and the reward: xt 2 + y2 t L2 for all rounds t. In particular, if xt L (Assumption 1) and the rewards are bounded: |yt| B (not just their means as in Assumption 2), we can set L2 = L2 + B2.

Assumptions 1 and 2 aren t required to have good pseudo-regret bounds, they merely simplify the bounds on the conﬁdence set (see Proposition 4 and Section 3.1). Assumption 6 is not required at all for now, it is only used for joint diﬀerential privacy in Section 4.

To fully describe Algorithm 1 we need to specify how to compute the conﬁdence-set bounds (βt)t. On the one hand, these bounds have to be accurate the conﬁdence set Et should contain the unknown θ ; on the other hand, the larger they are, the larger the regret bounds we obtain. In other words, the βt should be as small as possible subject to being accurate.

Deﬁnition 2 (Accurate (βt)t). A sequence (βt)n t=1 is (α,n)-accurate for (Ht)n t=1 and (ht)n t=1 if, with probability at least 1 α, it satisﬁes θ θt Vt βt for all rounds t = 1,. . .,n simultaneously.

We now argue that three parameters are the key to establishing accurate conﬁdence-set bounds (βt)n t=1 taking into account the noise in the setting and the noise added by a changing Ht and ht.

Deﬁnition 3 (Accurate ρmin, ρmax, and γ). The bounds 0 < ρmin ρmax and γ are (α/2n)-accurate for (Ht)n t=1 and (ht)n t=1 if for each round t:

Ht ρmax, H 1 t 1/ρmin, ht H 1 t γ; with probability at least 1 α/2n.

Proposition 4 (Calculating βt). Suppose Assumptions 3 to 5 hold and let ρmin, ρmax, and γ be (α/2n)-accurate for some α (0,1) and horizon n. Then (βt)n t=1 is (α,n)-accurate where

2 log(2/α) + log(det Vt) d log(ρmin) + S ρmax + γ

2 log(2/α) + d log ρmax ρmin + t L2 dρmin

+ S ρmax + γ. (if Assumption 1 also holds)

3.1 Regret Bounds

We now present bounds on the maximum regret of Algorithm 1. However, due to space constraints, we defer an extensive discussion of the proof techniques used and the signiﬁcance of the results to Section B in the supplementary material. The proofs are based on those of Abbasi-Yadkori et al. [1], who analyzed Lin UCB with constant regularizers. On one hand, our changes are mostly technical; however, it turns out that various parts of the proof diverge and now depend on ρmax and ρmin; tracing them all is somewhat involved. It is an interesting question to establish similar bounds using known results only as a black box; we were not able to accomplish this.

Theorem 5 (Regret of Algorithm 1). Suppose Assumptions 1 to 5 hold and the βt are as given by Proposition 4. Then with probability at least 1 α the pseudo-regret of Algorithm 1 is bounded by

8n σ 2 log( 2

α) + d log ρmax ρmin + n L2 dρmin

+ (S ρmax + γ)

d log 1 + n L2 dρmin

Theorem 6 (Gap-Dependent Regret of Algorithm 1). Suppose Assumptions 1 to 5 hold and the βt are as given by Proposition 4. If the optimal actions in every decision set Dt are separated from the sub-optimal actions by a reward gap of at least , then with probability at least 1 α the pseudo-regret of Algorithm 1 satisﬁes

α) + d log ρmax ρmin + n L2 dρmin

+ (S ρmax + γ)

d log 1 + n L2 dρmin

2See Remark 2 preceding the proof of Lemma 21 in Section B of the supplementary material for a discussion as to bounding B by 1 from below.

4 Linear UCB with Joint Diﬀerential Privacy

Notice that Algorithm 1 uses its history of actions and rewards up to round t only via the conﬁdence set Et, which is to say via Vt and ut, which are perturbations of the Gram matrix Gt and the vector ut B XË <t y<t, respectively; these also determine βt. By recording this history with diﬀerential privacy, we obtain a Linear UCB algorithm that is jointly diﬀerentially private (Deﬁnition 1) because it simply post-processes Gt and ut.

Claim 7 (see Dwork and Roth [15, Proposition 2.1]). If the sequence (Vt, ut)n 1 t=1 is (ε,δ)-diﬀerentially private with respect to (xt, yt)n 1 t=1 , then Algorithm 1 is (ε,δ)-jointly diﬀerentially private.

Remark 1. Algorithm 1 is only jointly diﬀerentially private even though the history maintains full diﬀerential privacy its action choice depends not only on the past contexts cs (s < t, via the diﬀerentially private X<t) but also on the current context ct via the decision set Dt. This use of ct is not diﬀerentially private, as it is revealed by the algorithm s chosen xt.

Rather than applying the tree-based algorithm separately to Gt and ut, we aggregate both into the single matrix Mt R(d+1) (d+1), which we now construct. Deﬁne A B X1:n y1:n Rn (d+1), with At holding the top t 1 rows of A (and A1 = 01 (d+1)). Now let Mt B AË t At then the top-left d d sub-matrix of Mt is the Gram matrix Gt and the ﬁrst d entries of its last column are ut. Furthermore, since Mt+1 = Mt + xË t yt Ë xË t yt , the tree-based algorithm for private cumulative sums can be used to maintain a private estimation of Mt using additive noise, releasing Mt + Nt. The top-left d d sub-matrix of Nt becomes Ht and the ﬁrst d entries of its last column become ht. Lastly, to have a private estimation of Mt, Assumption 6 must hold.

Below we present two techniques for maintaining (and updating) the private estimations of Mt. As mentioned in Section 1.3, the key component of our technique is the tree-based algorithm, allowing us to estimate Mt using at most m B 1 + log2 n noisy counters. In order for the entire tree-based algorithm to be (ε,δ)-diﬀerentially private, we add noise to each node in the tree so that each noisy count on its own preserves (ε/

8m ln(2/δ), δ/2m)-diﬀerential privacy. Thus in each day, the noise Nt that we add to Mt comes from the sum of at most m such counters.

4.1 Diﬀerential Privacy via Wishart Noise

First, we instantiate the tree-based algorithm with noise from a suitably chosen Wishart distribution Wd+1(V, k), which is the result of sampling k independent (d + 1)-dimensional Gaussians from N(0d+1,V) and computing their Gram matrix.

Theorem 8 (Theorem 4.1 [30]). Fix positive ε0 and δ0. If the L2-norm of each row in the input is bounded by L then releasing the input s Gram matrix with added noise sampled from Wd+1( L2I, k0) is (ε0,δ0)-diﬀerentially private, provided k0 d + 1 + 28ε 2 0 ln(4/δ0).

Applying this guarantee to our setting, where each count needs to preserve (ε/

8m ln(2/δ), δ/2m)- diﬀerential privacy, it suﬃces to sample a matrix from Wd+1( LI, k) with k B d + 1 + 224mε 2 ln(8m/δ) ln(2/δ) . Moreover, the sum of m independent samples from the Wishart distribution is a noise matrix Nt Wd+1( L2I,mk).3 Furthermore, consider the regularizers Ht and ht derived from Nt (the top-left submatrix and the right-most subcolumn resp.) Ht has distribution Wd( L2I,mk), and each entry of ht is the dot-product of two multivariate Gaussians. Knowing their distribution, we can infer the accurate bounds required for our regret bounds. Furthermore, since the Wishart noise has eigenvalues that are fairly large, we consider a post-processing of the noise matrix shifting it by c I with

2 ln(8n/α) 2 4 L2

2 ln(8n/α) (4) making the bounds we require smaller than without the shift. The derivations are deferred to Section C of the supplementary material.

3Intuitively, we merely concatenate the m batches of multivariate Gaussians sampled in the generation of each of the m Wishart noises.

Proposition 9. Fix any α > 0. If for each t the Ht and ht are generated by the tree-based algorithm with Wishart noise Wd+1( L2I, k), then the following are (α/2n)-accurate bounds:

2 ln(8n/α) 2,

2 ln(8n/α) 2,

2 ln(2n/α) . Moreover, if we use the shifted regularizer H t B Ht c I with c as given in Eq. (4), then the following are (α/2n)-accurate bounds:

ρ min = 4 L2

2 ln(8n/α) ,

ρ max = 8 L2

2 ln(8n/α) ,

2 ln(2n/α) .

Plugging these into Theorems 5 and 6 gives us the following upper bounds on pseudo-regret.

Corollary 10. Algorithm 1 with Ht and ht generated by the tree-based mechanism with each node adding noise independently from Wd+1((L2 + B2)I, k) and then subtracting c I using Eq. (4), we get a pseudo-regret bound of

O B n σ log 1/α + d log(n) + S L

dlog(n)3/4(d1/4 + ε 1/2log(1/δ)1/4)(d1/4 + log(n/α)1/4)

in general, and a gap-dependent pseudo-regret bound of

σ log 1/α + d log(n) + S L

dlog(n)3/4(d1/4 + ε 1/2log(1/δ)1/4)(d1/4 + log(n/α)1/4) 2!

4.2 Diﬀerential Privacy via Additive Gaussian Noise

Our second alternative is to instantiate the tree-based algorithm with symmetric Gaussian noise: sample Z R(d+1) (d+1) with each Z i,j N(0,σ2 noise) i.i.d. and symmetrize to get Z = (Z + Z Ë)/

2.4 Recall that each datum has a bounded L2-norm of L, hence a change to a single datum may alter the Frobenius norm of Mt by L2. It follows that in order to make sure each node in the tree-based algorithm preserves (ε/

8m ln(2/δ), δ/2)-diﬀerential privacy,5 the variance in each coordinate must be σ2 noise = 16m L4ln(4/δ)2/ε2. When all entries on Z are sampled from N(0,1), known concentration results [32] on the top singular value of Z give that P[ Z > (4

d + 1 + 2 ln(2n/α))] < α/2n. Note however that in each day t the noise Nt is the sum of m such matrices, thus the variance of each coordinate is mσ2 noise. The top-left (d d)-submatrix of Nt has operator norm of at most

d + 2 ln(2n/α) =

32m L2 ln(4/δ) 4

d + 2 ln(2n/α) /ε. However, it is important to note that the result of adding Gaussian noise may not preserve the PSD property of the noisy Gram matrix. To that end, we ought to shift Nt by some c I in order to make sure that we maintain strictly positive eigenvalues throughout the execution of the algorithm. Since the bounds in Theorems 5 and 6 mainly depend on ρmax + γ, we choose the shift-magnitude to be 2ΥI. This makes ρmax = 3Υ and ρmin = Υ and as a result ht H 1 t

Υ 1 ht , which we can bound using standard concentration bounds on the χ2-distribution (see Claim 17). This culminates in the following bounds.

Proposition 11. Fix any α > 0. Given that for each t the regularizers Ht, ht are taken by applying the tree-based algorithm with symmetrized shifted Gaussian noise whose entries are sampled i.i.d. from N(0,σ2 noise), then the following ρmin, ρmax, and γ are (α/2n)-accurate bounds:

ρmin = Υ, ρmax = 3Υ, γ = σnoise p

2 ln(2n/α) q

d + 2 ln(2n/α) /

4This increases the variance along the diagonal entries beyond the noise magnitude required to preserve privacy, but only by a constant factor of 2. 5We use here the slightly better bounds for the composition of Gaussian noise based on zero-Concentrated DP [9].

Note how this choice of shift indeed makes both ρmax and γ2 roughly on the order of O(Υ).

The end result is that for each day t, ht is given by summing at most m d-dimensional vectors whose entries are sampled i.i.d. from N(0,σ2 noise); the symmetrization doesn t change the distribution of each coordinate. The matrix Ht is given by: (i) summing at most m matrices whose entries are sampled i.i.d. from N(0,σ2 noise); (ii) symmetrizing the result as shown above; and (iii) adding 2ΥI. This leads to a bound on the regret of Algorithm 1 with the tree-based algorithm using Gaussian noise. Corollary 12. Applying Algorithm 1 where the regularizers Ht and ht are derived by applying the tree-based algorithm where each node holds a symmetrized matrix whose entries are sampled i.i.d. from N(0,σ2 noise) and adding 2ΥI, we get a regret bound of

O B n σ(d log(n) + log(1/α)) + S L log(n) q

d + ln(n/α)) ln(1/δ)/ε

in general, and a gap-dependent pseudo-regret bound of

σ(d log(n) + log(1/α)) + S L log(n) q

d + ln(n/α)) ln(1/δ)/ε 2!

5 Lower Bounds

In this section, we present lower bounds for two versions of the problem we deal with in this work. The ﬁrst, and probably the more obvious of the two, deals with the impossibility of obtaining sub-linear regret for the contextual bandit problem with the standard notion of diﬀerential privacy (under continual observation). Here, we assume user t provides a context ct which actually determines the mapping of the arms into feature vectors: φ(ct,a) Rd. The sequence of choice thus made by the learner is a1,. . .,an An which we aim to keep private. The next claim, whose proof is deferred to Section C in the supplementary material, shows that this is impossible without eﬀectively losing any reasonable notion of utility. Claim 13. For any ε < ln(2) and δ < 0.25, any (ε,δ)-diﬀerentially private algorithm A for the contextual bandit problem must incur pseudo-regret of Ω(n). The second lower bound we show is more challenging. We show that any ε-diﬀerentially private algorithm for the classic MAB problem must incur an additional pseudo-regret of Ω(k log(n)/ϵ) on top of the standard (non-private) regret bounds. We consider an instance of the MAB where the leading arm is a1, the rewards are drawn from a distribution over { 1,1}, and the gap between the means of arm a1 and arm a , a1 is a. Simple calculation shows that for such distributions, the total-variation distance between two distributions whose means are µ and µ is /2. Fix 2, 3,. . ., k as some small constants, and we now argue the following. Claim 14. Let A be any ε-diﬀerentially private algorithm for the MAB problems with k arms whose expected regret is at most n3/4. Fix any arm a , a1, whose diﬀerence between it and the optimal arm a1 is a. Then, for suﬃciently large ns, A pulls arm a at least log(n)/100ε a many times w.p. 1/2. We comment that the bound n3/4 was chosen arbitrarily, and we only require a regret upper bound of n1 c for some c > 0. Of course, we could have used standard assumptions, where the regret is asymptotically smaller than any polynomial; or discuss algorithms of regret O( n) (best minimax regret). Aiming to separate the standard lower-bounds on regret from the private bounds, we decided to use n3/4. As an immediate corollary we obtain the following private regret bound: Corollary 15. The expected pseudo-regret of any ε-diﬀerentially private algorithm for the MAB is Ω(k log(n)/ε). Combined with the non-private bound of Ω Í a,a1 log(n)/ a we get that the private regret bound is the max of the two terms, i.e.: Ω k log(n)/ε + Í a,a1 log(n)/ a .

Proof. Based on Claim 14, the expected pseudo-regret is at least Í

200ε a = (k 1) log(n)

Proof of Claim 14. Fix arm a. Let P be the vector of the k-probability distributions associated with the k arms. Denote E as the event that arm a is pulled < log(n)/100ε a := ta many times. Our goal is to show that PA; rewards P[E] < 1/2.

To that end, we postulate a diﬀerent distribution for the rewards of arm a a new distribution whose mean is greater by a than the mean reward of arm a1. The total-variation distance between the given distribution and the postulated distribution is a. Denote Q as the vector of distributions of

arm-rewards (where only Pa , Qa). We now argue that should the rewards be drawn from Q, then the event E is very unlikely: PA; rewards Q[E] 2n 1/4/ a. Indeed, the argument is based on a standard Markov-like argument: the expected pseudo-regret of A is at most n3/4, yet it is at least PA; rewards Q[E] (n ta) a (n a/2)PA; rewards Q[E], for suﬃciently large n.

We now apply a beautiful result of Karwa and Vadhan [22, Lemma 6.1], stating that the eﬀective group privacy between the case where the n datums of the inputs are drawn i.i.d. from either distribution P or from distribution Q is proportional to εn d TV(P,Q). In our case, the key point is that we only consider this change under the event E, thus the number of samples we need to redraw from the distribution Pa rather than Qa is strictly smaller than ta, and the elegant coupling argument of [22] reduces it to 6 a ta. To better illustrate the argument, consider the coupling argument of [22] as an oracle O. The oracle generates a collection of precisely ta pairs of points, the left ones are i.i.d. samples from Pa and the right ones are i.i.d. samples from Qa, and, in expectation, in (1 a) fraction of the pairs the rightand the left-samples are identical. Whenever the learner A pulls arm a it makes an oracle call to O, and depending on the environment (whether the distribution of rewards is P or Q) O provides either a fresh left-sample or a right-sample. Moreover, suppose there exists a counter C that stands between A and O, and in case O runs out of examples then C routes A s oracle calls to a diﬀerent oracle. Now, Karwa and Vadhan [22, Lemma 6.1] assures that the probability of the event C never re-routes the requests happens with similar probability under P or under Q, diﬀerent only up to a multiplicative factor of exp(ϵ ata). And seeing as the event C never re-routes the requests is quite unlikely when O only provides right-samples (from Q), it is also fairly unlikely when O only provides left-samples (from P).

Formally, we conclude the proof by applying the result of [22] to infer that PA; rewards P[E] exp(6ε ata)PA; rewards Q[E] exp(0.06 log(n)) 2 a n 1/4 = n 0.19 2

a 1/2 for suﬃciently large ns, proving the required claim.

Acknowledgements

We gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC) for supporting R.S. with the Alexander Graham Bell Canada Graduate Scholarship and O.S. with grant #2017 06701. R.S. was also supported by Alberta Innovates and O.S. is also an unpaid collaborator on NSF grant #1565387.

[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24, pages 2312 2320. Curran Associates, Inc., 2011.

[2] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Online-to-conﬁdence-set conversions and application to sparse stochastic bandits. In AISTATS, pages 1 9, 2012.

[3] Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003.

[4] Rajeev Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem., volume 27, pages 1054 1078. Applied Probability Trust, 1995.

[5] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-oﬀs. JMLR, 3:397 422, 2003.

[6] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. JMLR, 47(2-3):235 256, 2002.

[7] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Eﬃcient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS 14, pages 464 473, Washington, DC, USA, 2014. IEEE Computer Society. ISBN 978-1-4799-6517-5. doi: 10.1109/FOCS.2014.56.

[8] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments. London ; New York : Chapman and Hall, 1985.

[9] Mark Bun and Thomas Steinke. Concentrated diﬀerential privacy: Simpliﬁcations, extensions, and lower bounds. In Theory of Cryptography, Lecture Notes in Computer Science, pages 635 658. Springer, Berlin, Heidelberg, November 2016. ISBN 978-3-662-53640-7 978-3-662-53641-4. doi: 10.1007/ 978-3-662-53641-4_24.

[10] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. In Automata, Languages and Programming, Lecture Notes in Computer Science, pages 405 417. Springer, Berlin, Heidelberg, July 2010. ISBN 978-3-642-14161-4 978-3-642-14162-1. doi: 10.1007/978-3-642-14162-1_ 34.

[11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Diﬀerentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069 1109, July 2011. ISSN 1532-4435.

[12] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoﬀfunctions. In AISTATS, volume 15 of JMLR Proceedings, pages 208 214, 2011.

[13] Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory, pages 355 366, January 2008.

[14] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and diﬀerential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51 60, October 2010. doi: 10.1109/FOCS.2010.12.

[15] Cynthia Dwork and Aaron Roth. The algorithmic foundations of diﬀerential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, August 2014. ISSN 1551-305X, 1551-3068. doi: 10.1561/0400000042.

[16] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology - EUROCRYPT 2006, Lecture Notes in Computer Science, pages 486 503. Springer, Berlin, Heidelberg, May 2006. ISBN 978-3-540-34546-6 978-3-540-34547-3. doi: 10.1007/11761679_29.

[17] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, Lecture Notes in Computer Science, pages 265 284. Springer, Berlin, Heidelberg, March 2006. ISBN 978-3-540-32731-8 978-3-540-32732-5. doi: 10.1007/11681878_14.

[18] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Diﬀerential privacy under continual observation. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC 10, pages 715 724, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0050-6. doi: 10.1145/1806689.1806787.

[19] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze Gauss: Optimal bounds for privacy-preserving principal component analysis. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, STOC 14, pages 11 20, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2710-7. doi: 10.1145/2591796.2591883.

[20] Moritz Hardt and Kunal Talwar. On the geometry of diﬀerential privacy. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC 10, pages 705 714, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0050-6. doi: 10.1145/1806689.1806786.

[21] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Diﬀerentially private online learning. In Conference on Learning Theory, pages 24.1 24.34, June 2012.

[22] Vishesh Karwa and Salil Vadhan. Finite sample diﬀerentially private conﬁdence intervals. ar Xiv:1711.03908 [cs, math, stat], November 2017.

[23] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. pages 403 410. ACM Press, 2014. ISBN 978-1-4503-2698-8. doi: 10.1145/ 2554797.2554834.

[24] Tor Lattimore and Csaba Szepesvári. The end of optimism? an asymptotic analysis of ﬁnite-armed linear bandits. In AISTATS, pages 728 737, 2017.

[25] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, October 2000. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1015957395.

[26] Nikita Mishra and Abhradeep Thakurta. (Nearly) optimal diﬀerentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence, UAI 15, pages 592 601, Arlington, Virginia, United States, 2015. AUAI Press. ISBN 978-0-9966431-0-8.

[27] Seth Neel and Aaron Roth. Mitigating bias in adaptive data gathering via diﬀerential privacy. In ICML, pages 3717 3726, 2018.

[28] Herbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5): 527 535, 09 1952.

[29] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, April 2010. ISSN 0364-765X. doi: 10.1287/moor.1100.0446.

[30] Or Sheﬀet. Private approximations of the 2nd-moment matrix using existing techniques in linear regression. ar Xiv:1507.00056 [cs], June 2015.

[31] Adam Smith and Abhradeep Thakurta. (Nearly) optimal algorithms for private online learning in fullinformation and bandit settings. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2733 2741. Curran Associates, Inc., 2013.

[32] Terence Tao. Topics in Random Matrix Theory, volume 132. American Mathematical Society Providence, RI, 2012.

[33] Aristide C. Y. Tossou and Christos Dimitrakakis. Algorithms for diﬀerentially private multi-armed bandits. In Thirtieth AAAI Conference on Artiﬁcial Intelligence, March 2016.

[34] Aristide C. Y. Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. ar Xiv:1701.04222 [cs], January 2017.

[35] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv:1011.3027 [cs, math], November 2010.

[36] Fuzhen Zhang. Matrix Theory: Basic Results and Techniques. Universitext. Springer, New York, 2nd edition, 2011. ISBN 978-1-4614-1098-0.