# generalized_linear_bandits_with_local_differential_privacy__e8bca6d1.pdf Generalized Linear Bandits with Local Differential Privacy Yuxuan Han1 yhanat@connect.ust.hk Zhipeng Liang2 zliangao@connect.ust.hk Yang Wang1,2 yangwang@ust.hk Jiheng Zhang1,2 jiheng@ust.hk Department of Mathematics1 Department of Industrial Engineering and Decision Analytics2 The Hong Kong University of Science and Technology Contextual bandit algorithms are useful in personalized online decision-making. However, many applications such as personalized medicine and online advertising require the utilization of individual-specific information for effective learning, while user s data should remain private from the server due to privacy concerns. This motivates the introduction of local differential privacy (LDP), a stringent notion in privacy, to contextual bandits. In this paper, we design LDP algorithms for stochastic generalized linear bandits to achieve the same regret bound as in non-privacy settings. Our main idea is to develop a stochastic gradient-based estimator and update mechanism to ensure LDP. We then exploit the flexibility of stochastic gradient descent (SGD), whose theoretical guarantee for bandit problems is rarely explored, in dealing with generalized linear bandits. We also develop an estimator and update mechanism based on Ordinary Least Square (OLS) for linear bandits. Finally, we conduct experiments with both simulation and real-world datasets to demonstrate the consistently superb performance of our algorithms under LDP constraints with reasonably small parameters (ε, δ) to ensure strong privacy protection. 1 Introduction Contextual bandit algorithms have received extensive attention for their efficacy for online decision making in many applications such as recommendation system, clinic trials, and online advertisement [7, 35, 24]. Despite their success in many applications, intensive utilization of user-specific information, especially in privacy-sensitive domains such as clinical trials and e-commerce promotions, raises concerns about data privacy protection. Differential privacy, as a provable protection against identification from attackers [18, 19], has been put forth as a competitive candidate for a formal definition of privacy and has received considerable attention from both academic research [33, 17, 43, 36, 8] and industry adoption [20, 12, 37]. While increasing attention has been paid to bandit algorithms with joint differential privacy [34, 9], we introduce in this paper a more stringent notion, local differential privacy (LDP), in which users even distrust the server collecting the data, to contextual bandits. *Equal contributions. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). In contextual bandit, at each time round t with individual-specific context Xt, the decision maker can take an action at from a finite set (arms) to receive a reward randomly generated from the distribution depending on the context Xt and the chosen arm through its parameter θ at which is not unknown to the decision maker. We use the standard notion of expected regret to measure the difference between expected rewards obtained by the action at and the best achievable expected reward in this round. While several papers consider the adversarial setting (i.e., Xt can be arbitrary determined in each round), this paper considers the stochastic contextual case where Xt is generated i.i.d. from a distribution PX. The goal is to maximize the rewards accumulated over the time horizon. An algorithm achieves LDP guarantee if every user involved in this algorithm is guaranteed that anyone else can only access her context (and related information such as the arm chosen and the reward) with limited advantage over a random guess. Recently there is an emerging steam of works combining LDP and bandit. [6, 29, 10] consider the LDP contextual-free bandit and design algorithms to achieve the same regret as in the non-privacy setting. For contextual bandits, [44] considers the adversarial setting. Despite their pioneering work, their regret bounds O(T 3/4) leave a gap from the corresponding non-privacy results O(T 1/2), which is conjectured to be inevitable. A natural question arises: can we close this gap for stochastic contextual bandits? In this paper, we design several algorithms and show that they can achieve the same regret rate in terms of T as in the non-private settings. If we don t assume any structure on the arms parameters, the above formulation is referred to as multi-parameter contextual bandits. If we impose structural assumptions such as all arms share the same parameter (see Section 2.2 for details), then the formulation is referred to as singleparameter contextual bandits. Although multi-parameter and single-parameter settings can be shown to be equivalent, they need independent analysis and design of algorithms because of their distinct properties based on different modeling assumptions (e.g., [27]). In this paper, we consider the privacy guarantee in both settings. In fact, multi-parameter setting is more difficult since we need to estimate the parameters for all K arms with sufficient accuracy to make good decisions. However, privacy protection also requires protecting the information about which arm is pulled in each round. Such a requirement hinders the identification of optimal arm and may incur considerable regret in the decision process. A proper balance between privacy protection and estimation accuracy is the key to design algorithms with desired performance guarantee in this setting. Result Regret Context Parameter β-Margin Theorem 10 [44] O(T 3/4/ε) Adversary Both No Margin Theorem 3.1 O(T 1/2/ε) Stochastic Single No Margin Theorem 3.3 O(log T/ε2) Stochastic Single β = 1 Theorem 3.3 O(T 1 β 2 /ε1+β) Stochastic Single 0 β < 1 Theorem 4.1 O((log T/ε)2) Stochastic Multiple β = 1 Theorem 4.1 O(T 1 β 2 /ε1+β) Stochastic Multiple 0 < β < 1 Table 1: Summary of our main results in (ε, δ)-LDP, where O( ) omits poly-logarithmic factors. Contributions. We organize our results for various settings in Table 1. Our main contributions can be summarized as follows: 1. We develop a framework for implementing LDP algorithms by integrating greedy algorithms with a private OLS estimator for linear bandits and a private SGD estimator for generalized linear bandits. We prove that our algorithms achieve regret bound matching the corresponding non-privacy results. 2. In the multi-parameter setting, to ensure the privacy of the arm pulled in each round, we design a novel LDP strategy by simultaneously updating all the arms with synthetic information instead of releasing the pulled arm. By conducting such synthetic updates for unselected arms, we protect the information of the pulled arm from being identified by the server or other users. This is at the cost of corrupting the estimation of the un-selected arms. To deal with this issue, we design an elimination method that is only based on data collected during a short warm up period. We show that such a mechanism can be combined with the OLS and SGD estimators to achieve the desired performance guarantees. 3. We introduce the SGD estimator to bandit algorithms to tackle generalized linear reward structure. To the best of our knowledge, few papers have ever considered SGD-based bandit algorithms. Theoretical regret bounds are established in [13] by combining SGD and Thompson Sampling, while most of the others are limited to empirical studies [7, 32]. We establish such theoretical regret bounds for SGD-based bandit algorithms. Our private SGD estimator for bandits is highly computationally efficient, and more importantly, greatly simplifies the data processing mechanism for LDP guarantee. 2 Preliminaries Notations. We start by fixing some notations that will be used throughout this paper. For a positive integer n, [n] denotes the set {1, , n}. |A| denotes the cardinality of the set A. 2 is Euclidean norm. W(i, j) denotes the element in the i-th row and j-th column of matrix W. We write W > 0 if the matrix W is symmetric and positive definite. We denote Id as the d-dimensional identity matrix. Let denote the Kronecker product. Let Bd r denote the d-dimensional ball with radius r and Sd 1 r denotes the (d 1)-dimensional sphere for the ball. Given a set A, Unif(A) denote the uniform distribution over A. For a tuple (Zi,j)i N,j M and 1 k1 < k2 M, we denote Zi,k1:k2 = (Zi,k1, , Zi,k2). We adopt the standard asymptotic notations: for two non-negative sequences {an} and {bn}, {an} = O({bn}) iff lim supn an/bn < , an = Ω(bn) iff bn = O(an), an = Θ(bn) iff an = O(bn) and bn = O(an). We also write O( ), Ω( ) and Θ( ) to denote the respective meanings within multiplicative logarithmic factors in n. 2.1 Local Differential Privacy Definition 2.1 (Local differential privacy). We say a (randomized) mechanism M : X Z is (ε, δ)-LDP, if for every x = x X and any measurable set C Z we have P(M(x) C) eεP(M(x ) C) + δ. When δ = 0, we simply denote ε-LDP. We now present some tools that will be useful for our analysis. Lemma 2.1 (Gaussian Mechanism [16, 19]). For any f : X Rn, let σε,δ = 1 ε supx,x X f(x) f(x ) 2 p 2 ln(1.25/δ). The Gaussian mechanism, which adds random noise independently drawn from distribution N(0, σ2 ε,δIn) to each output of f, ensures (ε, δ)-LDP. Besides the Gaussian mechanism, we also use the following privacy mechanism for bounded vectors. Lemma 2.2 (Privacy Mechanism for l2-ball [14]). For any R > 0, let rε,d = R π 2 eε + 1 eε 1 dΓ( d+1 2 + 1) where Γ is the Gamma function. For any x Bd R, consider the mechanism Ψε,R : Bd R Sd 1 rε,d of generating Zx as the follows. First, generate a random vector X = (2b 1)x where b is a Bernoulli random variable with success probability 1 2R . Next, generate random vector Zx via Zx Unif{z Rd : z T X > 0, z 2 = rε,d} with probability eε/(1 + eε), Unif{z Rd : z T X 0, z 2 = rε,d} with probability 1/(1 + eε). Then Ψε,R is ε-LDP and E[Ψε,R(x)] = x. Lemma 2.3 (Post-Processing property [19]). If M : X Y is (ε, δ)-LDP and f : Y Z is a fixed map, then f M : X Z is (ε, δ)-LDP. Lemma 2.4 (Composition property [19]). If M1 : X Z1 is (ε1, δ1)-LDP and M2 : X Z2 is (ε2, δ2)-LDP, then M = (M1, M2) : X Z1 Z2 is (ε1 + ε2, δ1 + δ2)-LDP. 2.2 Local Differential Privacy in Bandit We consider contextual bandits with LDP guarantee in the context of the user-server communication protocol described in Figure 1. The user in round t with context Xt Rd receives (processed) historical information St 1 from the server, and chooses an action at [K] to obtain a random reward rt = v(Xt, at) + ϵt . Define Ft as the filtration of all historical information up to time t, i.e., Ft = σ(X1, , Xt, ϵ1, , ϵt 1), and we require ϵt is bounded and E[ϵt|Ft] = 0. Then the user processes the tuple (Xt, rt) by some mechanism ψ with LDP guarantee and send the processed information Zt = ψ(Xt, rt) to the server. After receiving Zt, the server updates the historical information St to get St+1. We consider the generalized linear bandits by allowing v(Xt, at) = µ(XT t θ at), where µ : R R is a link function and θ i Rd is the underlying parameter of the i-th arm. For a fix time t, we denote a t = arg maxi [K] µ(XT t θ i ). The regret over time horizon T is Reg(T) = PT t=1 µ(XT t θ a t ) µ(XT t θ at) . If we don t assume any structure on {θ i }i [K], we refer it as the multi-parameter setting. We also consider d-dimensional single-param setting by assuming θ i = ei θ for some θ Rd where {ei}i [K] is canonical basis of RK. In this case, xt,i Rd is the i-th segment of Xt Rd K and XT t θ i = x T t,iθ , so choosing arm i becomes choosing the i-th segment xt,i of the context. Server Side St St+1 St+2 at ψt at+1 ψt+1 Figure 1: User-server communication protocol In the rest of paper, we always assume that θ i 2 1, i [K], the reward is bounded by cr and the Euclidean norm of the context is bounded by CB, our analysis can be easily generalized to the case where ϵt and the context follow sub-gaussian distributions. In fact, when the context and noise are subgaussian, it is guaranteed by the subgaussian concentration that there are at least T O(log T) users have contexts and rewards bounded by C1,T , C2,T with overwhelming probability, so we can still use above LDP mechanisms to protect their privacy. For users whose contexts and reward are out of range C1,T , C2,T , they can send a private version of (0, 0) vector so that there is no privacy leakage. We also impose regularize assumptions on the link function, which are common in previous work [44, 31, 39] and the corresponding family contains a lot of commonly-use model, e.g., linear model, logistic model. Assumption 1. The link function µ is continuously differentiable, Lipschitz and there exists some ζ > 0 such that infx [ CB,CB] µ (x) = ζ > 0. 3 Single-Parameter Setting In this section, we develop a LDP contextual bandit framework (Algorithm 1) by combining statistical estimation and privacy mechanisms in the single-param bandit setting to achieve optimal regret bound in various cases. We use an abstract privacy mechanism ψ in (1) and estimator ϕ in (2) to allow the plug-in of various components. 3.1 Privacy Guarantee For the linear case where the link function µ(x) = x, we can use the following ordinary least square (OLS) estimator. Let with σε,δ = 2 p 2 ln(1.25/δ)/ε. Define Mt = xt,atx T t,at + Wt where Wt is a random matrix with Wt(i, j) N(0, 4C2 Bσ2 ε,δ) and Wt(j, i) = Wt(i, j), and ut = rtxt,at + ξt where ξt is a random vector following distribution N(0, C2 Bc2 rσ2 ε,δId). The OLS privacy mechanism and the corresponding estimator are ψOLS t (xt,at, rt; ˆθt 1) = (Mt, ut), (3) ϕOLS t (Z1, . . . , Zt; ˆθt 1) = t X i=1 ui, (4) where c > 0 is to be determined. We have the following LDP guarantee using the Gaussian mechanism (Lemma 2.1) and post-processing (Lemma 2.3). Algorithm 1: LDP Single-parameter Contextual Bandit Input: Time horizon T; Privacy Level ε, δ. 1 Initialization: Setting ˆθ0 = 0. 2 for t 1 to T do 3 User side: 4 Receive ˆθt 1 from the server. 5 Pull arm at = argmaxa [K]x T t,aˆθt 1 and receive rt. 6 Generate Zt by Zt = ψt(xt,at, rt; ˆθt 1). (1) 7 Server side: 8 Receive Zt from the user. 9 Update the estimation via ˆθt = ϕt(Z1, . . . , Zt; ˆθt 1). (2) Proposition 3.1. Algorithm 1 with the private OLS update mechanism ψOLS t and estimator ϕOLS t is (ε, δ)-LDP for ε (0, 1]. For the general link function µ, its non-linearity adds to the difficulty in terms of both privacypreserving and bandits. To estimate parameters in generalized linear bandits, one common approach to use a maximum likelihood estimator (MLE) at each step. In contrast to OLS solution, MLE does not have a close form solution with simple sufficient statistics in general. Thus, solving an MLE optimization procedure requires using all the previous data points and conducting costly operations at each round, resulting in time complexity and memory usage increasing with time. Instead, we use a one-step stochastic gradient approximation to incrementally update the estimator with the new observation at each round. To obtain a LDP version of this approximation, we use the LDP l2-ball mechanism in Lemma 2.2. ψSGD t (xt,at, rt; ˆθt 1) = Ψε,R µ(x T t,at ˆθt 1) rt xt,at , (5) ϕSGD t (Z1, . . . , Zt; ˆθt 1) = ˆθt 1 ηtψSGD t . (6) where ηt > 0 is the stepsize to be determined and R = 2cr CB. Similarly, we can prove the following LDP guarrantee using the l2-ball mechanism Lemma 2.2 and post-processing Lemma 2.3. Proposition 3.2. Algorithm 1 with the private SGD update mechanism ψSGD t and estimator ϕSGD t is ε-LDP. 3.2 Regret Analysis To derive the regret bound of our framework, we need the following assumptions on the marginal distribution PX of the stochastic contexts {xt,a}a [K]. Assumption 2. There exists some κu > 0 such that λmax(Σa) κu d where Σa is the covariance matrix of PX and λmax(Σa) is the largest eigenvalues of Σa. Assumption 3. For every u 2 = 1, denote a = arg maxa [K] x T t,au, there exist some κl > 0, p > 0 such that Pu((x T v)2 > κl/d) p holds for any u, v Sd 1 1 , where Pu( ) is the distribution of xt,a . Similar assumptions are common in the analysis of single-parameter contextual bandits, e.g. [13, 22], and our conditions contain a wide range of distributions, including sub-gaussian with bounded density. See appendix A for discussion. Now we can show that our framework indeed achieves optimal regret bound. Theorem 3.1. Under Assumptions 2 and 3, with the choice of c = 2σε,δ(4 d + 2 log(2T/α)) in (4), Algorithm 1 with OLS mechanism ψOLS t and estimator ϕOLS t achieve the following regret with probability at least 1 α for some constant C, T(CB(σε,δ + σϵ)d (d + log(T/α)) log(KT/α) κlp + o(1)) Under Assumptions 1 3, with the choice of ηt = c d/(κlζp t) for some c > 1 in (6), Algorithm 1 with SGD mechanism ψSGD t and estimator ϕSGD t achieves the following regret with probability at least 1 α for some constant C, d ζκlp log log(T/α) + o(1)). with o(1) means some factor that turns to 0 as T . In the algorithm we shift the sample covariance matrix by c t to ensure the positive-definiteness of the noise matrix as in [34]. Such a shift guarantee the estimation accuracy in the early stage. Note that the optimal worst-case regret bound in the non-privacy case is O(T 1/2), our results show that we can achieve the same regret bound as in the non-privacy case in terms of time T. In fact, we can show a Ω( T/ε) lower bound in this setting even when K = 2, which verified our optimal dependence on both T and ε. Theorem 3.2. For θ Rd and an algorithm π, we denote E[Regπ(T; θ)] the expectation regret of π when the underlying parameter is θ. When K = 2 and xt,a N(0, Id/d) are independent over a [K], we have for any possible ε-LDP algorithm π, supθ : θ 2 1 E[Regπ(T; θ )] = Ω( Given the best known O(T 3/4) regret bound of adversarial contextual LDP bandit in [44], our O( T/ε) result points out a possible gap between stochastic contextual bandits and adversarial contextual bandits under the LDP constriant. The bounds given above are problem-independent, which do not dependent on the underlying parameters. If we consider an additional assumption that there is a gap between the optimal arm and the rest, which is usually the case when the number of contexts is small, then we can obtain sharper bounds than the problem-independent ones in Theorems 3.1. Assumption 4 ((γ, β)-margin condition). We say PX satisfies the (γ, β)-strong margin condition with γ > 0, 0 < β 1, if for t := µ(x T t,a t θ ) maxj =a t µ(x T t,jθ ) and h [0, b] with some positive constant b, we have P[ t h] γhβ. Theorem 3.3. Under Assumptions 2 4 with the same choice of c in Theorems 3.1, Algorithm 1 with OLS mechanism ψOLS t and estimator ϕOLS t achieves the following regret with probability at least 1 α for some constant C, γCB log T[(CBd(CBσϵ + σε,δ) p d + log(T/α) κlp )2 + oβ,γ(1)], β = 1, γCB 1 β T 1 β 2 [(CBd(CBσϵ + σε,δ) p d + log(T/α) κlp )1+β + oβ,γ(1)], 0 β < 1. Under Assumptions 1 4 and with the same choice of ηt in Theorems 3.1, Algorithm 1 with SGD mechanism ψSGD t and estimator ϕSGD t achieves the following regret with probability at least 1 α for some constant C, γLCB log T[(rε,d Ld CB p log(log(T)/α) ζκlp )2 + oβ,γ(1)], β = 1, 2 [(rε,d Ld CB p log(log(T)/α) ζκlp )1+β + oβ,γ(1)], 0 β < 1. with oβ,γ(1) being a factor depending on β, γ that converges to 0 as T . Unlike in the worst-case bound, it is more challenge to establish corresponding lower under the margin condition. A possible roadmap to show the Ω(log T/ε2) lower bound is to follow the argument in [21]: Their argument uses the Van-Tree inequality to bound the mean squared estimation error from below for each time to show the Ω(log T) bound in non-private setting. To consider the influence of private noise, it is possible to combine the LDP-version Van-Tree inequality in [2] with the above argument to get Ω(log T/ε2) bound. 4 Multi-parameter Setting In this section, we present our LDP framework for the multiple parameter setting. Compared with the single parameter setting, this framework introduces three non-trivial components to match classical regret bounds while still guarantee LDP: warm up, synthetic update and elimination. Algorithm 2: LDP Multi-parameter Contextual Bandit Input: Time horizon T; Warm up period length s0; Privacy Level ε, δ. 1 Initialization: Setting ˆθ0,i = 0, i [K]. 2 for t 1 to Ks0 do 3 User side: 4 Receiving ˆθt 1,1:K from the server. 5 Pulling arm at := (t mod K) + 1 and receive rt. 6 Generate and update Zt,i = 1{at = i}ψt(Xt, rt; ˆθt 1,i), i [K] to the server. 7 Server side: 8 Receive the update Zt,1:K from the user. 9 Re-estimate parameters via ˆθt,i := ϕt(Z1,i, . . . , Zt,i), i [K]. 11 for t Ks0 + 1 to T do 12 User side: 13 Receive ˆθt 1,1:K from the server. 14 Determine a subset ˆKt of [K] by setting ˆKt := {a [K] : XT t ˆθKs0,a > max a [K] XT t ˆθKs0,a h 15 Pulling arm at := argmaxa ˆ Ktµ(XT t ˆθt 1,a) and receive rt. 16 Generating information for all arms {Zi,t}i [K] by setting Zi,t = ψt(Xt, rt; ˆθt 1,i) if at = i, ψt(0, 0; ˆθt 1,i) otherwise. 17 Server side: 18 Receive the update {Zi,t}i [K] from the user. 19 Re-estimate parameters via ˆθt,i := ϕt(Z1,i, . . . , Zt,i). Warm up. In the warm up stage, all arms are given equal opportunities to be explored for a preliminary estimation of their parameters. Such estimation does not aim for the accuracy to select the optimal arm with high probability. Instead, we only need accuracy at the level of ruling out the substantially inferior arms. Thus, this stage only needs O(log T) steps. Since the actions in this stage are independent of the contexts, there is no need to protect the pulled arm. However, we still need to protect the contexts by using a privacy mechanism similar in the single-parameter setting. Synthetic update. After the warm up, we need to make decisions based on the contexts to achieve vanishing regret. In order to obtain the privacy guarantee, we introduce our synthetic update mechanism. Although in each time only one arm is pulled, we create synthetic data for all unselected arms. In this way, the server receives synthetic feedback about all arms, regardless of whether it is selected or not, and thus cannot figure out which one is selected. Another method to provide LDP protection for the selected arm is to ensure the action at satisfies LDP. However, the regret will grow linearly, as shown in [34]. Elimination. We use the information obtained during warm up to exclude obviously inferior arms. Such a method has been applied in [5] to guarantee a certain kind of independence of the information in each round. However, we use this method for a different purpose. The necessity of such an elimination strategy comes from protecting privacy in the multi-parameter setting. Although we have obtained an estimation to a certain level of accuracy in the warm up stage, our knowledge on un-selected arms will be gradually corrupted by the noise incurred in the synthetic update in each round. Such corruption will make us fail to distinguish arms that are possibly optimal from the surely sub-optimal ones. To avoid corruption, we may need to pick the sub-optimal arms frequently but this will result in large regret. That is why we use the warm up information to eliminate the arms with extremely poor performance as in (7). 4.1 Privacy Guarantee The OLS/SGD mechanisms and estimators are the same as (3) (6) in the single-parameter setting. To prevent the server from distinguishing the selected arm from the other K 1 arms, a straightforward idea is to use (ε/K, δ/K)-LDP mechanism for the synthetic update by composition property in lemma 2.4. However, we can prove that our algorithm can still achieve the same LDP guarantee with a much less stringent privacy mechanism, say (ε/2, δ/2)-LDP, in Propositions 4.1 and 4.2. Proposition 4.1. Algorithm 2 with the private OLS update mechanism ψOLS t and estimator ϕOLS t is (ε, δ)-LDP. Proposition 4.2. Algorithm 2 with the private SGD update mechanism ψSGD t and estimator ϕSGD t is ε-LDP. 4.2 Regret Analysis Assumption 5 (Diversity condition). Let Kopt and Ksub be a partition of [K] such that for any i Ksub, µ(XT θi) < maxj =i µ(XT θj) hsub for some hsub > 0 and every X X. For any i Kopt define the set Ui := {X : µ(XT θi) > maxj =i µ(XT θj)}. There exists κl > 0, p > 0 such that for all i Kopt and unit vector v,P((v T X)21{X Ui} κl/Kopt) > p . Assumption 6 ((γ, β)-margin condition). This is almost identical to Assumption 4 except that we replace t with t := µ(XT t θa t ) maxj =a t µ(XT t θj). In our algorithm, diversity condition guarantees that conditioning on the arm i is pulled, the distribution of Xt still can provide enough information about θi. We would remark here that we need no longer any deterministic gap in the definition of Ui, which weakens the assumption made in [4],[5]. Now we are in the suited position to present our theoretical guarantee of the algorithm. Theorem 4.1. Under Assumptions 1, 5 and 6, with the choice of c = 2σε/2,δ/2(4 d+2 log(2TK/α)) in (4), s0 = C K( CBσϵ + σε,δ min{λ0, h}p κl )2(d + log(TK/α)) and h = hsub, λ0 = (2γLCB) 1(p Algorithm 2 with OLS mechanism ψOLS t and estimator ϕOLS t achieve the following regret with probability at least 1 α for some constant C, Reg(T) γCCB h KCB(CBσϵ + σε,δ) p d + log((TK)/α) κlp 1+β + ohsub,β,γ(1) i log T, β = 1, T 1 β 2 1 β , 0 < β < 1. Under Assumptions 1, 5 and 6, with the choice of step-size ηt := (1{t Ks0}((t mod K) + 1) + 1{t>Ks0}(t (K 1)s0)) 1Kopt(ζκlp ) 1c for any c 1 and h = hsub, Algorithm 2 with SGD mechanism ψSGD t and estimator ϕSGD t achieve the following regret with probability at least 1 α for some constant C, Reg(T) γLCCB h Krε,d LCB p log((TK log T)/α) ζκlp 1+β + ohsub,β,γ(1) i log T, β = 1, T 1 β 2 1 β , 0 < β < 1. Theorem 4.1 recovers the non-privacy bound in [5] under similar condition up to a logarithmic factor. Notice that unlike Theorem 3.3 in the single-parameter case, we cannot establish the regret when β = 0. The reason is that in our analysis, we need the probability of t > h vanish as h 0 to guarantee the estimation error for θi, i Kopt converges. The corresponding theoretical result in this setting when β = 0 is left as an open question. 5 Experiment To the best of our knowledge, the contextual bandit algorithms with LDP guarantee has only been studied by [44], who propose a variant of Lin UCB algorithm for linear bandits and a variant of Generalized Linear Online-to-confidence-set Conversion (GLOC) framework [23] for generalized linear bandits. We refer their methods as LDP-UCB and LDP-GLOC. We call our method LDP-OLS if we plug in the OLS mechanism and estimator into Algorithms 1 and 2, and LDP-SGD if we plug in the SGD ones. We evaluate all the four methods on two different privacy levels ε = 0.5 and 1 in synthetic datasets, which are industry standards. For example, Apple uses ε = 4 in their projects on Emojis and Safari usage [38]. Similar choices of the privacy parameter ε can be found in [3, 20]. We also demonstrate the efficacy of our algorithms with real data on Auto Lending2 in Appendix G. For the sake of comparison, the learning step parameter for LDP-GLOC and LDP-SGD are tuned in the same way.3. The first and second columns in Figure 4 are for single-param and multi-param settings, respectively, which are simulation studies on linear bandits. The context is generated from Unif(Sd 1 1 ) at each round. In conclusion, our methods significantly outperform existing ones in all settings consistently. LDPSGD achieves better performance under more strigent privacy requirements. 0 0.2 0.4 0.6 0.8 1 Cumulative Regret Single-param (ε = 0.5, K = 10, d = 2) 0 0.2 0.4 0.6 0.8 1 Multi-param (ε = 0.5, K = 3, d = 2) 0 0.2 0.4 0.6 0.8 1 Cumulative Regret Single-param (ε = 1, K = 10, d = 2) 0 0.2 0.4 0.6 0.8 1 Multiple (ε = 1, K = 3, d = 2) LDP-SGD LDP-OLS LDP-UCB LDP-GLOC Figure 2: We perform 10 replications for each case and plot the mean and 0.5 standard deviation of their regrets. 2On-Line Auto Lending dataset CRPM-12-001 provided by Columbia University https://www8.gsb. columbia.edu/cprm/research/datasets, and has been used in the study of contextual bandits by [26, 11]. 3The source code to reproduce all the results is available at the Git Hub repo liangzp/LDP-Bandit. 6 Conclusion In this paper, we propose LDP contextual bandit frameworks in both single-parameter and multiparameter settings with flexibility to deal generalized linear reward structure, and establish theoretical guarantee of our algorithms based on the frameworks. Our algorithms are highly efficient and have superior empirical performance. There are still some open questions to be explored. Whether our regret bounds are optimal in terms of ε in the multi-parameter setting is still unknown. It will be interesting to explore estimators and mechanisms beyond the private OLS and SGD ones to study the optimality in terms of ε. Moreover, whether there is a fundamental limit in adversarial contextual bandit under LDP constraints is still an open question. It also remains an open question to analyze the regret bound in the multi-parameter setting when β = 0 in the margin condition. Acknowledgments and Disclosure of Funding This work was supported by Hong Kong Research Grant Council (Grants 16317416, 16204718, 16214121) and Guangdong-Hong Kong-Macao Joint Laboratory for Data-Driven Fluid Mechanics and Engineering Applications. [1] G.-Y. Ban and N. B. Keskin. Personalized dynamic pricing with machine learning: High dimensional features and heterogeneous elasticity. Forthcoming, Management Science, 2020. [2] L. P. Barnes, W.-N. Chen, and A. Özgür. Fisher information under local differential privacy. IEEE Journal on Selected Areas in Information Theory, 1(3):645 659, 2020. [3] R. Bassily, K. Nissim, U. Stemmer, and A. Thakurta. Practical locally private heavy hitters. ar Xiv preprint ar Xiv:1707.04982, 2017. [4] H. Bastani and M. Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276 294, 2020. [5] H. Bastani, M. Bayati, and K. Khosravi. Mostly exploration-free algorithms for contextual bandits. ar Xiv, pages 1 62, 2017. [6] D. Basu, C. Dimitrakakis, and A. Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? ar Xiv, pages 1 27, 2019. [7] A. Bietti, A. Agarwal, and J. Langford. A Contextual Bandit Bake-off. pages 1 45, 2018. [8] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011. [9] X. Chen, D. Simchi-Levi, and Y. Wang. Privacy-preserving dynamic personalized pricing with demand learning. ar Xiv, pages 1 35, 2020. [10] X. Chen, K. Zheng, Z. Zhou, Y. Yang, W. Chen, and L. Wang. (Locally) Differentially Private Combinatorial Semi-Bandits. ar Xiv, 2020. [11] W. C. Cheung, D. Simchi-Levi, and R. Zhu. Hedging the drift: Learning to optimize under non-stationarity. Available at SSRN 3261050, 2018. [12] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. ar Xiv preprint ar Xiv:1712.01524, 2017. [13] Q. Ding, C.-J. Hsieh, and J. Sharpnack. An efficient algorithm for generalized linear bandit: Online stochastic gradient descent and thompson sampling. In International Conference on Artificial Intelligence and Statistics, pages 1585 1593. PMLR, 2021. [14] J. C. Duchi and M. I. Jordan. Minimax Optimal Procedures for Locally Private Estimation. [15] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182 201, 2018. [16] C. Dwork, K. Kenthapadi, F. Mc Sherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 486 503. Springer, 2006. [17] C. Dwork and J. Lei. Differential privacy and robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 371 380, 2009. [18] C. Dwork, F. Mc Sherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006. [19] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 487, 2013. [20] Ú. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054 1067, 2014. [21] A. Goldenshluger and A. Zeevi. A linear response bandit problem. Stochastic Systems, 3(1):230 261, 2013. [22] Y. Han, Z. Zhou, Z. Zhou, J. Blanchet, P. W. Glynn, and Y. Ye. Sequential batch learning in finite-action linear contextual bandits. ar Xiv, 2020. [23] K. S. Jun, A. Bhargava, R. Nowak, and R. Willett. Scalable generalized linear bandits: Online computation and hashing. Advances in Neural Information Processing Systems, 2017December:99 109, 2017. [24] T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. [26] R. Phillips, A. S. Sim sek, and G. Van Ryzin. The effectiveness of field price discretion: Empirical evidence from auto lending. Management Science, 61(8):1741 1759, 2015. [27] M. Raghavan, A. Slivkins, J. W. Vaughan, and Z. S. Wu. The externalities of exploration and how data diversity helps exploitation. ar Xiv, 2018. [28] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. ar Xiv preprint ar Xiv:1109.5647, 2011. [29] W. Ren, X. Zhou, J. Liu, and N. B. Shroff. Multi-Armed Bandits with Local Differential Privacy. ar Xiv, 2020. [30] Z. Ren and Z. Zhou. Dynamic batch learning in high-dimensional sparse linear contextual bandits. ar Xiv, pages 1 33, 2020. [31] Z. Ren, Z. Zhou, and J. R. Kalagnanam. Batched learning in generalized linear contextual bandits with general decision sets. IEEE Control Systems Letters, 2020. [32] C. Riquelme, G. Tucker, and J. Snoek. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. ar Xiv preprint ar Xiv:1802.09127, 2018. [33] B. I. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-preserving mechanisms for svm learning. ar Xiv preprint ar Xiv:0911.5708, 2009. [34] R. Shariff and O. Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 2018-December:4296 4306, 2018. [35] A. Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2019. [36] A. Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 813 822, 2011. [37] J. Tang, A. Korolova, X. Bai, X. Wang, and X. Wang. Privacy loss in apple s implementation of differential privacy on macos 10.12. ar Xiv preprint ar Xiv:1709.02753, 2017. [38] D. P. Team. Learning with privacy at scale. 2017. [39] P. Toulis, E. Airoldi, and J. Rennie. Statistical analysis of stochastic gradient methods for generalized linear models. In International Conference on Machine Learning, pages 667 675. PMLR, 2014. [40] J. A. Tropp. User-friendly tail bounds for matrix martingales. 2011. [41] A. B. Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008. [42] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019. [43] L. Wasserman and S. Zhou. A statistical framework for differential privacy. Journal of the American Statistical Association, 105(489):375 389, 2010. [44] K. Zheng, T. Cai, W. Huang, Z. Li, and L. Wang. Locally Differentially Private (Contextual) Bandits Learning. ar Xiv, (Neur IPS):1 20, 2020.