# differentially_private_contextual_linear_bandits__7ed99838.pdf Differentially Private Contextual Linear Bandits Roshan Shariff Department of Computing Science University of Alberta Edmonton, Alberta, Canada roshan.shariff@ualberta.ca Or Sheffet Department of Computing Science University of Alberta Edmonton, Alberta, Canada osheffet@ualberta.ca We study the contextual linear bandit problem, a version of the standard stochastic multi-armed bandit (MAB) problem where a learner sequentially selects actions to maximize a reward which depends also on a user provided per-round context. Though the context is chosen arbitrarily or adversarially, the reward is assumed to be a stochastic function of a feature vector that encodes the context and selected action. Our goal is to devise private learners for the contextual linear bandit problem. We first show that using the standard definition of differential privacy results in linear regret. So instead, we adopt the notion of joint differential privacy, where we assume that the action chosen on day t is only revealed to user t and thus needn t be kept private that day, only on following days. We give a general scheme converting the classic linear-UCB algorithm into a joint differentially private algorithm using the tree-based algorithm [10, 18]. We then apply either Gaussian noise or Wishart noise to achieve joint-differentially private algorithms and bound the resulting algorithms regrets. In addition, we give the first lower bound on the additional regret any private algorithms for the MAB problem must incur. 1 Introduction The well-known stochastic multi-armed bandit (MAB) is a sequential decision-making task in which a learner repeatedly chooses an action (or arm) and receives a noisy reward. The objective is to maximize cumulative reward by exploring the actions to discover optimal ones (having the best expected reward), balanced with exploiting them. The contextual bandit problem is an extension of the MAB problem, where the learner also receives a context in each round, and the expected reward depends on both the context and the selected action. As a motivating example, consider online shopping: the user provides a context (composed of query words, past purchases, etc.), and the website responds with a suggested product and receives a reward if the user buys it. Ignoring the context and modeling the problem as a standard MAB (with an action for each possible product) suffers from the drawback of ignoring the variety of users preferences; whereas separately learning each user s preferences doesn t allow us to generalize between users. Therefore it is common to model the task as a contextual linear bandit problem: Based on the user-given context, each action is mapped to a feature vector; the reward probability is then assumed to depend on the same unknown linear function of the feature vector across all users. The above example motivates the need for privacy in the contextual bandit setting: users past purchases and search queries are sensitive personal information, yet they strongly predict future purchases. In this work, we give upper and lower bounds for the problem of (joint) differentially private contextual linear bandits. Differential privacy is the de facto gold standard of privacy-preserving data analysis in both academia and industry, requiring that an algorithm s output have very limited dependency on any single user interaction (one context and reward). However, as we later illustrate, adhering to the standard notion of differential privacy (under event-level continual observation) in the contextual bandit requires us to essentially ignore the context and thus incur linear regret. We 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. therefore adopt the more relaxed notion of joint differential privacy [23] which, intuitively, allows us to present the t-th user with products corresponding to her preferences, while guaranteeing that all interactions with all users at times t > t have very limited dependence on user t s preferences. The guarantee of differential privacy under continuous observation assures us that even if all later users collude in an effort to learn user t s context or preference, they still have very limited advantage over a random guess. 1.1 Problem Formulation Stochastic Contextual Linear Bandits. In the classic MAB, in every round t a learner selects an action at from a fixed set A and receives a reward yt. In the (stationary) stochastic MAB, the reward is noisy with a fixed but unknown expectation E[yt | at] that depends only on the selected action. In the stochastic contextual bandit problem, before each round the learner also receives a context ct C the expected reward E[yt | ct,at] depends on both ct and at. It is common to assume that the context affects the reward in a linear way: map every context-action pair to a feature vector φ(c,a) Rd (where φ is an arbitrary but known function) and assume that E[yt | ct,at] = θ , φ(ct,at) . The vector θ Rd is the key unknown parameter of the environment which the learner must discover to maximize reward. Alternatively, we say that on every round the learner is given a decision set Dt B {φ(ct,a) | a A} of all the pre-computed feature vectors: choosing xt Dt effectively determines the action at A. Thus, the contextual stochastic linear bandit framework consists of repeated rounds in which the learner: (i) receives a decision set Dt Rd; (ii) chooses an action xt Dt; and (iii) receives a stochastic reward yt = θ , xt + ηt. When all the Dt are identical and consist of the standard basis vectors, the problem reduces to standard MAB. The learner s objective is to maximize cumulative reward, which is equivalent to minimizing regret: the extra reward a learner would have received by always choosing the best available action. In other words, the regret characterizes the cost of having to learn the optimal action over just knowing it beforehand. For stochastic problems, we are usually interested in a related quantity called pseudo-regret, which is the extra expected reward that the learner could have earned if it had known θ in advance. In our setting, the cumulative pseudo-regret after n rounds is b Rn B Ín t=1 maxx Dt θ , x xt .1 Joint Differential Privacy. As discussed above, the context and reward may be considered private information about the users which we wish to keep private from all other users. We thus introduce the notion of jointly differentially private learners under continuous observation, a combination of two definitions [given in 23, 18]. First, we say two sequences S = (D1, y1),(D2, y2),. . .,(Dn, yn) and S = (D 1, y 1),. . .,(D n, y n) are t-neighbors if for all t , t it holds that (Dt , yt ) = (D t , y t ). Definition 1. A randomized algorithm Afor the contextual bandit problem is (ε,δ)-jointly differentially private (JDP) under continual observation if for any t and any pair of t-neighboring sequences S and S , and any subset S>t Dt+1 Dt+2 Dn of sequence of actions ranging from day t + 1 to the end of the sequence, it holds that P(A(S) S>t) eεP(A(S ) S >t) + δ. The standard notion of differential privacy under continual observation would require that changing the context ct cannot have much effect on the probability of choosing action at even for round t itself (not just for future rounds as with JDP). In our problem formulation, however, changing ct to c t may change the decision set Dt to a possibly disjoint D t, making that notion ill-defined. Therefore, when we discuss the impossibility of regret-minimization under standard differential privacy in Section 5, we revert back to a fixed action set A with an explicit per-round context ct. 1.2 Our Contributions and Paper Organization In this work, in addition to formulating the definition of JDP under continual observation, we also present a framework for implementing JDP algorithms for the contextual linear bandit problem. Not surprisingly, our framework combines a tree-based privacy algorithm [10, 18] with a linear upper confidence bound (Lin UCB) algorithm [13]. For modularity, in Section 3 we analyze a family of linear UCB algorithms that use different regularizers in every round, under the premise that the 1The pseudo-regret ignores the stochasticity of the reward but not the resulting randomness in the learner s choice of actions. It equals the regret in expectation, but is more amenable to high-probability bounds such as ours. In particular, in some cases we can achieve polylog(n) bounds on pseudo-regret because, unlike regret, it doesn t have added regret noise of variance Ω(n). regularizers are PSD with bounded singular values. Moreover, we repeat our analysis twice first we obtain a general O( n) upper bound on regret; then, for problem instances that maintain a reward gap separating the optimal and sub-optimal actions, we obtain a polylog(n)/ regret upper bound. Our leading application of course is privacy, though one could postulate other reasons where such changing regularizers would be useful (e.g., if parameter estimates turn out to be wrong and have to be updated). We then plug two particular regularizers into our scheme: the first is a privacy-preserving mechanism that uses additive Wishart noise [30] (which is always PSD); the second uses additive Gaussian noise [19] (shifted to make it PSD w.h.p. over all rounds). The main term in the two regret bounds obtained by both algorithms is O( n d3/4/ ε) (the bound itself depends on numerous parameters; a notation list appears in Section 2). Details of both techniques appear in Section 4. Experiments with a few variants of our algorithms are detailed in Section D of the supplementary material. In Section 5 we also give a lower bound for the ε-differentially private MAB problem. Whereas all previous work on the private MAB problem uses standard (non-private) bounds, we show that any private algorithm must incur an additional regret of Ω(k log(n)/ε). While the result resembles the lower bound in the adversarial setting, the proof technique cannot rely on standard packing arguments [e.g. 20] since the input for the problem is stochastic rather than adversarial. Instead, we rely on a recent coupling argument [22] to prove any private algorithm must substantially explore suboptimal arms. By showing that the contextual bandit problem does not become much harder by adding privacy, we open the possibility of machine learning systems that are useful to their users without significantly compromising their privacy. Future Directions. The linear UCB algorithm we adapt in this work is a canonical approach to the linear bandit problem, using the principle of optimism in the face of uncertainty. However, recent work [24] shows that all such optimistic algorithms are sub-optimal, and instead proposes adapting to the decision set in a particular way by solving an intricate optimization problem. It remains an open question to devise a private version of this algorithm which interpolates between UCB and fine-tuning to the specific action set. 1.3 Related Work MAB and the Contextual Bandit Problem. The MAB dates to the seminal work of Robbins [28], with the UCB approach developed in a series of works [8, 4] culminating in [6]. Stochastic linear bandits were formally first introduced in [3], and [5] was the first paper to consider UCB-style algorithms. An algorithm that is based on a confidence ellipsoid is described by [13], with a variant based on ridge regression given in [12], or explore-then-commit variant in [29], and a variant related to a sparse setting appears in [2]. Abbasi-Yadkori et al. [1] gives an instance dependent bound for linear bandits, which we convert to the contextual setting. Differential Privacy. Differential privacy, first introduced by Dwork et al. [17, 16], is a rigorous mathematical notion of privacy that requires the probability of any observable output to change very little when any single datum changes. (We omit the formal definition, having already defined JDP.) Among its many elegant traits is the notion of group privacy: should k datums change then the change in the probability of any event is still limited by (roughly) k times the change when a single datum was changed. Differential privacy also composes: the combination of k (ε,δ)-differentially private algorithms is O(kε2 + 2 p k log(1/δ )), kδ + δ -differentially private for any δ > 0 [14]. The notion of differential privacy under continual observation was first defined by Dwork et al. [18] using the tree-based algorithm [originally appearing in 10]. This algorithm maintains a binary tree whose n leaves correspond to the n entries in the input sequence. Each node in the tree maintains a noisy (privacy-preserving) sum of the input entries in its subtree the cumulative sums of the inputs can thus be obtained by combining at most log(n) noisy sums. This algorithm is the key ingredient of a variety of works that deal with privacy in an online setting, including counts [18], online convex optimization [21], and regret minimization in both the adversarial [31, 34] and stochastic [26, 33] settings. We comment that Mishra and Thakurta [26] proposed an algorithm similar to our own for the contextual bandit setting, however (i) without maintaining PSD, (ii) without any analysis, only empirical evidence, and (iii) without presenting lower bounds. A partial utility analysis of this algorithm, in the reward-privacy model (where the context s privacy is not guaranteed), appears in the recent work of Neel and Roth [27]. Further details about achieving differential privacy via additive noise and the tree-based algorithm appear in Section A of the supplementary material. The related problem of private linear regression has also been extensively studied in the offline setting [11, 7]. 2 Preliminaries and Notation We use bold letters to denote vectors and bold CAPIT ALS for matrices. Given a d-column matrix M, its Gram matrix is the (d d)-matrix MËM. A symmetric matrix M is positive-semidefinite (PSD, denoted M 0) if xËMx 0 for any vector x. Any such M defines a norm on vectors, so we define x 2 M = xËMx. We use M N to mean M N 0. The Gaussian distribution N(µ,σ2) is defined by the density function (2πσ2) 1/2 exp( (x µ)2/2σ2). The squared L2-norm of a d-dimensional vector whose coordinates are drawn i.i.d. from N(0,1) is given by the χ2(d) distribution, which is tightly concentrated around d. Given two distributions P and Q we denote their total variation distance by d TV(P,Q) = maxevent E|P(E) Q(E)|. Notation. Our bound depends on many parameters of the problem, specified below. Additional parameters (bounds) are specified in the assumptions stated below. n horizon, i.e. number of rounds s,t indices of rounds d dimensionality of action space Dt Rd; decision set at round t xt Dt; action at round t yt R; reward at round t θ Rd; unknown parameter vector m B log2(n) + 1 X 0. If for each t the Ht and ht are generated by the tree-based algorithm with Wishart noise Wd+1( L2I, k), then the following are (α/2n)-accurate bounds: 2 ln(8n/α) 2, 2 ln(8n/α) 2, 2 ln(2n/α) . Moreover, if we use the shifted regularizer H t B Ht c I with c as given in Eq. (4), then the following are (α/2n)-accurate bounds: ρ min = 4 L2 2 ln(8n/α) , ρ max = 8 L2 2 ln(8n/α) , 2 ln(2n/α) . Plugging these into Theorems 5 and 6 gives us the following upper bounds on pseudo-regret. Corollary 10. Algorithm 1 with Ht and ht generated by the tree-based mechanism with each node adding noise independently from Wd+1((L2 + B2)I, k) and then subtracting c I using Eq. (4), we get a pseudo-regret bound of O B n σ log 1/α + d log(n) + S L dlog(n)3/4(d1/4 + ε 1/2log(1/δ)1/4)(d1/4 + log(n/α)1/4) in general, and a gap-dependent pseudo-regret bound of σ log 1/α + d log(n) + S L dlog(n)3/4(d1/4 + ε 1/2log(1/δ)1/4)(d1/4 + log(n/α)1/4) 2! 4.2 Differential Privacy via Additive Gaussian Noise Our second alternative is to instantiate the tree-based algorithm with symmetric Gaussian noise: sample Z R(d+1) (d+1) with each Z i,j N(0,σ2 noise) i.i.d. and symmetrize to get Z = (Z + Z Ë)/ 2.4 Recall that each datum has a bounded L2-norm of L, hence a change to a single datum may alter the Frobenius norm of Mt by L2. It follows that in order to make sure each node in the tree-based algorithm preserves (ε/ 8m ln(2/δ), δ/2)-differential privacy,5 the variance in each coordinate must be σ2 noise = 16m L4ln(4/δ)2/ε2. When all entries on Z are sampled from N(0,1), known concentration results [32] on the top singular value of Z give that P[ Z > (4 d + 1 + 2 ln(2n/α))] < α/2n. Note however that in each day t the noise Nt is the sum of m such matrices, thus the variance of each coordinate is mσ2 noise. The top-left (d d)-submatrix of Nt has operator norm of at most d + 2 ln(2n/α) = 32m L2 ln(4/δ) 4 d + 2 ln(2n/α) /ε. However, it is important to note that the result of adding Gaussian noise may not preserve the PSD property of the noisy Gram matrix. To that end, we ought to shift Nt by some c I in order to make sure that we maintain strictly positive eigenvalues throughout the execution of the algorithm. Since the bounds in Theorems 5 and 6 mainly depend on ρmax + γ, we choose the shift-magnitude to be 2ΥI. This makes ρmax = 3Υ and ρmin = Υ and as a result ht H 1 t Υ 1 ht , which we can bound using standard concentration bounds on the χ2-distribution (see Claim 17). This culminates in the following bounds. Proposition 11. Fix any α > 0. Given that for each t the regularizers Ht, ht are taken by applying the tree-based algorithm with symmetrized shifted Gaussian noise whose entries are sampled i.i.d. from N(0,σ2 noise), then the following ρmin, ρmax, and γ are (α/2n)-accurate bounds: ρmin = Υ, ρmax = 3Υ, γ = σnoise p 2 ln(2n/α) q d + 2 ln(2n/α) / 4This increases the variance along the diagonal entries beyond the noise magnitude required to preserve privacy, but only by a constant factor of 2. 5We use here the slightly better bounds for the composition of Gaussian noise based on zero-Concentrated DP [9]. Note how this choice of shift indeed makes both ρmax and γ2 roughly on the order of O(Υ). The end result is that for each day t, ht is given by summing at most m d-dimensional vectors whose entries are sampled i.i.d. from N(0,σ2 noise); the symmetrization doesn t change the distribution of each coordinate. The matrix Ht is given by: (i) summing at most m matrices whose entries are sampled i.i.d. from N(0,σ2 noise); (ii) symmetrizing the result as shown above; and (iii) adding 2ΥI. This leads to a bound on the regret of Algorithm 1 with the tree-based algorithm using Gaussian noise. Corollary 12. Applying Algorithm 1 where the regularizers Ht and ht are derived by applying the tree-based algorithm where each node holds a symmetrized matrix whose entries are sampled i.i.d. from N(0,σ2 noise) and adding 2ΥI, we get a regret bound of O B n σ(d log(n) + log(1/α)) + S L log(n) q d + ln(n/α)) ln(1/δ)/ε in general, and a gap-dependent pseudo-regret bound of σ(d log(n) + log(1/α)) + S L log(n) q d + ln(n/α)) ln(1/δ)/ε 2! 5 Lower Bounds In this section, we present lower bounds for two versions of the problem we deal with in this work. The first, and probably the more obvious of the two, deals with the impossibility of obtaining sub-linear regret for the contextual bandit problem with the standard notion of differential privacy (under continual observation). Here, we assume user t provides a context ct which actually determines the mapping of the arms into feature vectors: φ(ct,a) Rd. The sequence of choice thus made by the learner is a1,. . .,an An which we aim to keep private. The next claim, whose proof is deferred to Section C in the supplementary material, shows that this is impossible without effectively losing any reasonable notion of utility. Claim 13. For any ε < ln(2) and δ < 0.25, any (ε,δ)-differentially private algorithm A for the contextual bandit problem must incur pseudo-regret of Ω(n). The second lower bound we show is more challenging. We show that any ε-differentially private algorithm for the classic MAB problem must incur an additional pseudo-regret of Ω(k log(n)/ϵ) on top of the standard (non-private) regret bounds. We consider an instance of the MAB where the leading arm is a1, the rewards are drawn from a distribution over { 1,1}, and the gap between the means of arm a1 and arm a , a1 is a. Simple calculation shows that for such distributions, the total-variation distance between two distributions whose means are µ and µ is /2. Fix 2, 3,. . ., k as some small constants, and we now argue the following. Claim 14. Let A be any ε-differentially private algorithm for the MAB problems with k arms whose expected regret is at most n3/4. Fix any arm a , a1, whose difference between it and the optimal arm a1 is a. Then, for sufficiently large ns, A pulls arm a at least log(n)/100ε a many times w.p. 1/2. We comment that the bound n3/4 was chosen arbitrarily, and we only require a regret upper bound of n1 c for some c > 0. Of course, we could have used standard assumptions, where the regret is asymptotically smaller than any polynomial; or discuss algorithms of regret O( n) (best minimax regret). Aiming to separate the standard lower-bounds on regret from the private bounds, we decided to use n3/4. As an immediate corollary we obtain the following private regret bound: Corollary 15. The expected pseudo-regret of any ε-differentially private algorithm for the MAB is Ω(k log(n)/ε). Combined with the non-private bound of Ω Í a,a1 log(n)/ a we get that the private regret bound is the max of the two terms, i.e.: Ω k log(n)/ε + Í a,a1 log(n)/ a . Proof. Based on Claim 14, the expected pseudo-regret is at least Í 200ε a = (k 1) log(n) Proof of Claim 14. Fix arm a. Let P be the vector of the k-probability distributions associated with the k arms. Denote E as the event that arm a is pulled < log(n)/100ε a := ta many times. Our goal is to show that PA; rewards P[E] < 1/2. To that end, we postulate a different distribution for the rewards of arm a a new distribution whose mean is greater by a than the mean reward of arm a1. The total-variation distance between the given distribution and the postulated distribution is a. Denote Q as the vector of distributions of arm-rewards (where only Pa , Qa). We now argue that should the rewards be drawn from Q, then the event E is very unlikely: PA; rewards Q[E] 2n 1/4/ a. Indeed, the argument is based on a standard Markov-like argument: the expected pseudo-regret of A is at most n3/4, yet it is at least PA; rewards Q[E] (n ta) a (n a/2)PA; rewards Q[E], for sufficiently large n. We now apply a beautiful result of Karwa and Vadhan [22, Lemma 6.1], stating that the effective group privacy between the case where the n datums of the inputs are drawn i.i.d. from either distribution P or from distribution Q is proportional to εn d TV(P,Q). In our case, the key point is that we only consider this change under the event E, thus the number of samples we need to redraw from the distribution Pa rather than Qa is strictly smaller than ta, and the elegant coupling argument of [22] reduces it to 6 a ta. To better illustrate the argument, consider the coupling argument of [22] as an oracle O. The oracle generates a collection of precisely ta pairs of points, the left ones are i.i.d. samples from Pa and the right ones are i.i.d. samples from Qa, and, in expectation, in (1 a) fraction of the pairs the rightand the left-samples are identical. Whenever the learner A pulls arm a it makes an oracle call to O, and depending on the environment (whether the distribution of rewards is P or Q) O provides either a fresh left-sample or a right-sample. Moreover, suppose there exists a counter C that stands between A and O, and in case O runs out of examples then C routes A s oracle calls to a different oracle. Now, Karwa and Vadhan [22, Lemma 6.1] assures that the probability of the event C never re-routes the requests happens with similar probability under P or under Q, different only up to a multiplicative factor of exp(ϵ ata). And seeing as the event C never re-routes the requests is quite unlikely when O only provides right-samples (from Q), it is also fairly unlikely when O only provides left-samples (from P). Formally, we conclude the proof by applying the result of [22] to infer that PA; rewards P[E] exp(6ε ata)PA; rewards Q[E] exp(0.06 log(n)) 2 a n 1/4 = n 0.19 2 a 1/2 for sufficiently large ns, proving the required claim. Acknowledgements We gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC) for supporting R.S. with the Alexander Graham Bell Canada Graduate Scholarship and O.S. with grant #2017 06701. R.S. was also supported by Alberta Innovates and O.S. is also an unpaid collaborator on NSF grant #1565387. [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 24, pages 2312 2320. Curran Associates, Inc., 2011. [2] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Online-to-confidence-set conversions and application to sparse stochastic bandits. In AISTATS, pages 1 9, 2012. [3] Naoki Abe, Alan W. Biermann, and Philip M. Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263 293, 2003. [4] Rajeev Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem., volume 27, pages 1054 1078. Applied Probability Trust, 1995. [5] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. JMLR, 3:397 422, 2003. [6] Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. JMLR, 47(2-3):235 256, 2002. [7] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, FOCS 14, pages 464 473, Washington, DC, USA, 2014. IEEE Computer Society. ISBN 978-1-4799-6517-5. doi: 10.1109/FOCS.2014.56. [8] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments. London ; New York : Chapman and Hall, 1985. [9] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography, Lecture Notes in Computer Science, pages 635 658. Springer, Berlin, Heidelberg, November 2016. ISBN 978-3-662-53640-7 978-3-662-53641-4. doi: 10.1007/ 978-3-662-53641-4_24. [10] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. In Automata, Languages and Programming, Lecture Notes in Computer Science, pages 405 417. Springer, Berlin, Heidelberg, July 2010. ISBN 978-3-642-14161-4 978-3-642-14162-1. doi: 10.1007/978-3-642-14162-1_ 34. [11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069 1109, July 2011. ISSN 1532-4435. [12] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payofffunctions. In AISTATS, volume 15 of JMLR Proceedings, pages 208 214, 2011. [13] Varsha Dani, Thomas Hayes, and Sham Kakade. Stochastic linear optimization under bandit feedback. In 21st Annual Conference on Learning Theory, pages 355 366, January 2008. [14] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51 60, October 2010. doi: 10.1109/FOCS.2010.12. [15] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, August 2014. ISSN 1551-305X, 1551-3068. doi: 10.1561/0400000042. [16] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mc Sherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology - EUROCRYPT 2006, Lecture Notes in Computer Science, pages 486 503. Springer, Berlin, Heidelberg, May 2006. ISBN 978-3-540-34546-6 978-3-540-34547-3. doi: 10.1007/11761679_29. [17] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography, Lecture Notes in Computer Science, pages 265 284. Springer, Berlin, Heidelberg, March 2006. ISBN 978-3-540-32731-8 978-3-540-32732-5. doi: 10.1007/11681878_14. [18] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC 10, pages 715 724, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0050-6. doi: 10.1145/1806689.1806787. [19] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze Gauss: Optimal bounds for privacy-preserving principal component analysis. In Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing, STOC 14, pages 11 20, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2710-7. doi: 10.1145/2591796.2591883. [20] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC 10, pages 705 714, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0050-6. doi: 10.1145/1806689.1806786. [21] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. In Conference on Learning Theory, pages 24.1 24.34, June 2012. [22] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. ar Xiv:1711.03908 [cs, math, stat], November 2017. [23] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. pages 403 410. ACM Press, 2014. ISBN 978-1-4503-2698-8. doi: 10.1145/ 2554797.2554834. [24] Tor Lattimore and Csaba Szepesvári. The end of optimism? an asymptotic analysis of finite-armed linear bandits. In AISTATS, pages 728 737, 2017. [25] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, October 2000. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1015957395. [26] Nikita Mishra and Abhradeep Thakurta. (Nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI 15, pages 592 601, Arlington, Virginia, United States, 2015. AUAI Press. ISBN 978-0-9966431-0-8. [27] Seth Neel and Aaron Roth. Mitigating bias in adaptive data gathering via differential privacy. In ICML, pages 3717 3726, 2018. [28] Herbert Robbins. Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc., 58(5): 527 535, 09 1952. [29] Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395 411, April 2010. ISSN 0364-765X. doi: 10.1287/moor.1100.0446. [30] Or Sheffet. Private approximations of the 2nd-moment matrix using existing techniques in linear regression. ar Xiv:1507.00056 [cs], June 2015. [31] Adam Smith and Abhradeep Thakurta. (Nearly) optimal algorithms for private online learning in fullinformation and bandit settings. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2733 2741. Curran Associates, Inc., 2013. [32] Terence Tao. Topics in Random Matrix Theory, volume 132. American Mathematical Society Providence, RI, 2012. [33] Aristide C. Y. Tossou and Christos Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Thirtieth AAAI Conference on Artificial Intelligence, March 2016. [34] Aristide C. Y. Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. ar Xiv:1701.04222 [cs], January 2017. [35] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv:1011.3027 [cs, math], November 2010. [36] Fuzhen Zhang. Matrix Theory: Basic Results and Techniques. Universitext. Springer, New York, 2nd edition, 2011. ISBN 978-1-4614-1098-0.