# local_differential_privacy_for_bayesian_optimization__b4ee0d95.pdf

Local Differential Privacy for Bayesian Optimization

Xingyu Zhou,1 Jian Tan 2

1 ECE, The Ohio State University 2 Alibaba Group, Sunnyvale zhou.2055@osu.edu, j.tan@alibaba-inc.com

Motivated by the increasing concern about privacy in nowadays data-intensive online learning systems, we consider a black-box optimization in the nonparametric Gaussian process setting with local differential privacy (LDP) guarantee. Speciﬁcally, the rewards from each user are further corrupted to protect privacy and the learner only has access to the corrupted rewards to minimize the regret. We ﬁrst derive the regret lower bounds for any LDP mechanism and any learning algorithm. Then, we present three almost optimal algorithms based on the GP-UCB framework and Laplace DP mechanism. In this process, we also propose a new Bayesian optimization (BO) method (called Mo MA-GP-UCB) based on median-of-means techniques and kernel approximations, which complements previous BO algorithms for heavy-tailed payoffs with a reduced complexity. Further, empirical comparisons of different algorithms on both synthetic and realworld datasets highlight the superior performance of Mo MAGP-UCB in both private and non-private scenarios.

Introduction We consider the problem of maximizing an unknown function f over a set D via sequentially querying it and received only bandit feedback, i.e., when we query at x, we observe a possibly noisy evaluation of f(x). This model has been a main focus in machine learning research, e.g., the classic multi-armed bandit (MAB) setting (Lai and Robbins 1985), linear bandit setting (Abbasi-Yadkori, P al, and Szepesv ari 2011) and the general Bayesian optimization (Shahriari et al. 2015), with each one generalizing the previous one. It also ﬁnds broad applications in many real-world systems, including medical experiments, online shopping websites and recommender systems (Li et al. 2010). These systems adaptively make a decision and receive rewards (feedback) from the user to simultaneously learn insightful facts and maximize the proﬁt. Recently, privacy has become a key issue in the above mentioned online learning systems. Users have become increasingly concerned about directly sharing their online information or activities to these systems, since these activities

This work was done during the internship of the ﬁrst author at Alibaba Group, Sunnyvale, USA. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

may reveal their private information. For example, a customer of an online shopping website is not willing to tell the website that he or she has purchased medicines for mental issues. Another example is the medical experiments in which the patient could reject to share the actual effects of the treatment due to privacy concerns. This stimulates the need to have a mechanism that further corrupts the feedbacks from each user to protect privacy, which exactly ﬁts the locally differential private (LDP) model (Kasiviswanathan et al. 2011; Duchi, Jordan, and Wainwright 2013). In contrast to the standard differential privacy model (Dwork, Roth et al. 2014), in which the learner collects the true data while releasing a private output to protect privacy, in the LDP model, the learner only has access to corrupted input data from the users. Hence, LDP often provides a much stronger privacy protection for the user and is more appealing in real applications, especially for the systems mentioned above (Cormode et al. 2018). To the best of our knowledge, in the setting of online learning with bandit feedback, LDP model has only been studied theoretically very recently. For example, in (Gajane, Urvoy, and Kaufmann 2018; Ren et al. 2020), the authors investigate MAB with LDP guarantee. (Zheng et al. 2020) studies LDP in the linear (contextual) bandit setting. However, LDP in the most general scenario, i.e., Bayesian optimization (BO), remains an important open problem. Motivated by this, in this paper, we investigate the locally differentially private BO, in which the rewards are further corrupted to protect privacy. Speciﬁcally, we consider a Gaussian process (GP) model for BO (also called Gaussian process bandit setting), which directly generalizes both MAB and linear bandit setting. The main contributions of this paper can be summarized as follows. Contributions. (i) We ﬁrst derive the regret lower bounds for any LDP mechanism and any learning algorithm. (ii) Then, we present three almost optimal algorithms based on the GP-UCB framework and Laplace DP mechanism. (iii) Our two new methods developed for handling LDP also contribute to BO under heavy-tailed payoffs in general. In particular, one is a new truncation method that can be applied to any sub-Weibull rewards. The other one, called Mo MAGP-UCB, is based on median-of-means techniques and kernel approximations, which complements previous BO algorithms for general heavy-tailed payoffs (Chowdhury and

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Gopalan 2019) with a reduced complexity. (iv) We further conduct empirical comparisons of different algorithms over both synthetic and real-world datasets, which demonstrate the superior performance of our new Mo MA-GP-UCB algorithm in both private and non-private settings.

Related Work

In the traditional non-private case, a line of BO methods based on Gaussian process (GP) and upper conﬁdence bound (UCB) have been analyzed in both sub Gaussian (Srinivas et al. 2009; Chowdhury and Gopalan 2017) and heavy-tailed scenarios (Chowdhury and Gopalan 2019). Kernel approximation is recently proposed to reduce complexity of GP-based BO algorithms (Mutny and Krause 2018; Calandriello et al. 2019). In the private BO case, (Kusner et al. 2015) studies how to privately release the BO outputs to protect privacy (e.g., hyper-parameters of machine learning model), and hence it belongs to the traditional DP perspective rather than LDP. LDP model has been previously considered in MAB setting (Gajane, Urvoy, and Kaufmann 2018; Basu, Dimitrakakis, and Tossou 2019; Ren et al. 2020). Recently, it is generalized to linear contextual bandits in which both the rewards and the contexts are corrupted for privacy (Zheng et al. 2020). There are also some other types of DP considered in MAB and linear bandit setting (not comparable to our work). Due to space limitations, we refer readers to (Ren et al. 2020; Zheng et al. 2020) and the references therein.

Problem Statement and Preliminaries

We consider a sequential decision-making problem over a set D. A learning policy is adopted to select an action xt D at each discrete time slot t = 1, 2, . . . with the corresponding reward observation yt = f(xt) + ηt, i.e., yt could be a noisy version of f(xt). Then, the reward yt will be further corrupted to protect privacy, and only the private response yt is revealed to the learning agent. The action xt is chosen based on the arms played and the private rewards obtained until t 1, denoted by the history Ht 1 = {(xs, ys) : s [t 1]1}. The objective is to simultaneously preserve LDP and minimize the cumulative regret deﬁned as

t=1 f(x ) f(xt), (1)

where x = arg maxx D f(x) (assuming the maximum is attained).

Deﬁnition 1 ((ϵ, δ)-LDP). A randomized mechanism M : D Z is said to protect (ϵ, δ)-LDP if for any x, x D, and any measurable subset E Z, there is

P{M(x) E} eϵP{M(x ) E} + δ,

for ϵ 0 and δ 0. Moreover, if δ = 0, we say it protects ϵ-LDP.

1For any positive integer, we deﬁne [m]:={1,2, ..., m}

Note that, if not explicitly stated, LDP in this paper means ϵ-LDP (stronger than (ϵ, δ)-LDP). Noise Assumptions. We assume that the noise ηt has zero mean conditioned on the history and is bounded by R almost surely. We also address the case of unbounded noise at the end of the paper. Regularity Assumptions. Attaining a sub-linear regret is in general infeasible for an arbitrary reward function f over a very large space without any assumptions on the structure of f. In this paper, we assume that D is compact and f has a bounded norm in the RKHS of functions D R, corresponding a kernel function k : D D R. This RKHS denoted by Hk(D) is completely determined by its kernel function with an inner product , H that satisﬁes the reproducing property: f(x) = f, k(x, ) H for all f Hk(D). The norm for the RKHS is given by f H = p

f, f H, which measures the smoothness of f. We assume f H B and B < is a known constant. Moreover, we assume a bounded variance by restricting k(x, x) 1. Note that two commonly used kernels Squared Exponential and Mat ern satisfy the bounded variance assumption, deﬁned as:

k SE(x, x ) = exp ( s2/2l2)

k Mat ern(x, x ) = 21 ν

where l > 0 and ν > 0 are hyper-parameters, s = x x 2 speciﬁes the similarity between two points, and Bν( ) is the modiﬁed Bessel function. Surrogate GP Model2. A Gaussian process, denoted by GP(µ( ), k( , )), is a collection of (possibly inﬁnitely many) random variables f(x), x D, such that every ﬁnite subset of random variables {f(xi), i [m]} is jointly Gaussian with mean E [f(xi)] = µ(xi) and covariance E [(f(xi) µ(xi))(f(xj) µ(xj))] = k(xi, xj), where i, j [m] and m N. By conditioning GPs on available observations, one can obtain a non-parametric surrogate Bayesian model over the space of functions. In particular, we use GP(0, k( , )) as an initial prior on the unknown black-box function f, and a Gaussian likelihood with the noise variables ηt drawn independently across t from N(0, λ). Conditioned on a set of past observations Ht = {(xs, ys), s [t]}, by the properties of GPs (Rasmussen 2003), the posterior distribution over f is GP(µt( ), kt( , )), where

µt(x) = kt(x)T (Kt + λI) 1y1:t (2)

kt(x, x ) = k(x, x ) kt(x)T (Kt + λI) 1kt(x )

σ2 t (x) = kt(x, x), (3)

and kt(x) = [k(x1, x), . . . , k(xt, x)]T and Kt = [k(u, v)]u,v Ht. Therefore, for every x D, the posterior distribution of f(x), given Ht is N(µt(x), σ2 t (x)). The following term often plays a key role in the regret bounds of

2The surrogate GP model described above (i.e., a GP prior and a Gaussian likelihood) is only used for the algorithm design.

GP based algorithms.

γt := γt(k, D) = max A D:|A|=t 1 2 ln |It + λ 1KA|,

where KA = [k(x, x )]x,x A. Roughly speaking, γt is the maximum mutual information that can be obtained about the GP prior from t samples corrupted by a Gaussian channel N(0, λ). It is a function of the kernel k and domain D. For instance, if D is compact and convex, then we have

γt = O((ln t)d+1) for k SE, O(t

d(d+1) 2ν+d(d+1) ln t) for k Mat ern, and O(d ln t) for a linear kernel (Srinivas et al. 2009).

Lower Bounds In this section, we derive the lower bounds for both k SE and k Mat ern under any LDP mechanism and any learning algorithm, as presented in the following theorem. Theorem 1. Let D = [0, 1]d for some d N. Fix a kernel k {k SE, k Mat ern}, B > 0, ϵ > 0, T Z, δ (0, 1), α (0, 1] and v > 0. Given any learning algorithm, any ϵ-LDP mechanism, there exists a function f Hk(D) with f H B, and a reward distribution satisfying E |yt|1+α|Ft 1 v for all t [T], such that the following hold, respectively

E [RT ] = Ω v 1 1+α T 1 1+α ζ 2α 1+α ln B(1+α)/αT ζ2

v1/α dα 2+2α ,

where ζ = eϵ 1, if k = k SE E [RT ] = Ω v

ν ν(1+α)+dα T

ν+dα ν(1+α)+dα ζB

dα ν(1+α)+dα ,

where ζ = ζ

2α 1+α + 2dα2 (1+α)(ν(1+α)+dα) and ζ = eϵ 1, if k = k Mat ern. Remark 1. For a small ϵ and α = 1, and hence ζ ϵ, the regret lower bounds in Theorem 1 have an additional factor of 1/ϵ in front of the lower bounds for non-private case in (Chowdhury and Gopalan 2019)3.

Proof Sketch of Theorem 1. The proof follows the standard techniques in (Scarlett, Bogunovic, and Cevher 2017; Chowdhury and Gopalan 2019), which provide lower bounds for non-private BO under i.i.d Gaussian noise (or heavy-tailed payoffs). The key challenge is handle the additional requirement of ϵ-LDP. To this end, we aim to relate the Kullback-Leibler (KL) divergence between two distributions P1 and P2 to the KL divergence between two new distributions M1 and M2, which are the distributions transformed from P1 and P2 according to a given ϵ-LDP mechanism. Inspired by (Basu, Dimitrakakis, and Tossou 2019), we resort to Theorem 1 of (Duchi, Jordan, and Wainwright 2013) and Pinskers inequality. More speciﬁcally, by Theorem 1 of (Duchi, Jordan, and Wainwright 2013), we have

Dkl(M1||M2) + Dkl(M2||M1) 4(eϵ 1)2||P1 P2||2 T V .

Then, by Pinskers inequality, we have

||P1 P2||2 T V 2Dkl(P1||P2).

Thus, roughly speaking, there is an additional term (eϵ 1)2. The full proof is in Appendix.

3for k = k Mat ern, it holds for a large ν.

Algorithms and Upper Bounds In this section, we will present three algorithms that are able to achieve nearly optimal regret while guaranteeing ϵ-LDP. All the three algorithms rely on adding additional Laplace noise on the reward (i.e., Laplace mechanism in DP) to provide privacy guarantee. Note that, due to the additional Laplace noise, the rewards received by the learner are now no longer sub-Gaussian, and hence standard algorithms will not work. As a result, the three algorithms mainly differ in the way of handling the issue of non-sub-Gaussian rewards.

Laplace Mechanism A commonly used mechanism in the areas of DP is the Laplace mechanism, which adds independent Laplace noise to the data point. For any L > 0, the PDF of the Laplace(L) (i.e., mean is zero) is given by

Laplace(L) : l(x | L) = (2L) 1 exp( |x|/L).

Thus, it is with mean 0 and variance 2L2. The Laplace mechanism used in this paper is stated in Curator 1 and its theoretical guarantee is given by Lemma 1.

Curator 1 Convert-to-Laplace (CTL(ϵ)) On receiving a reward observation yt: return yt := yt + L, where L Laplace(L) and L = 2(B+R)

Lemma 1. CTL(ϵ) guarantees ϵ-LDP.

Proof. See Appendix.

Adaptively Truncated Approximate (ATA) Algorithm and Regret One direct way of handling non-sub-Gaussian rewards in BO is to utilize the recently developed technique for heavytailed payoffs (Chowdhury and Gopalan 2019). In particular, the authors show that when combining a good feature approximation (e.g., Nystr om approximation) and a feature adaptive truncation of rewards (e.g., TOFU in (Shao et al. 2018)), one can obtain a regret bound roughly O(γT T 1 1+α ), when the (1 + α)-th moment of the reward is ﬁnite and α (0, 1]. Hence, when α = 1, it recovers the regret bounds under sub-Gaussian rewards (Chowdhury and Gopalan 2017). Thus, it is natural to adapt ATA-GP-UCB introduced in (Chowdhury and Gopalan 2019) to handle the non-sub Gaussian payoffs caused by the Laplace noise in the LDP setting, which leads to the LDP-ATA-GP-UCB, as described in Algorithm 1. Further, by adapting the regret analysis of ATA-GP-UCB in (Chowdhury and Gopalan 2019), we have the following theorem for the regret upper bound of LDP-ATA-GP-UCB. Theorem 2. Let f Hk(D) with f H B for all x D and noise ηt is bounded by R. Fix ϵ > 0, ε (0, 1) and set ρ = 1+ε 1 ε, and v = B2 + R2 + 8(B + R)2/ϵ2. Then, for any δ (0, 1], LDP-ATA-GP-UCB with parameters q = 6ρ ln(4T/δ)/ε2, bt = p

v/ ln(4mt T/δ) and βt+1 = B(1+

Algorithm 1 LDP-ATA-GP-UCB

1: Input: Parameters λ, B, R, ϵ > 0, {bt}t 1, {βt}t 1, and q. 2: Set: µ0(x) = 0 and σ0(x) = k(x, x) for all x D. 3: for t = 1, 2, 3, . . . , T do 4: Play xt = arg maxx D µt 1(x) + βt(x) σt 1(x) 5: Receive private response yt from CTL(ϵ) 6: Set mt as the dimension of ϕt 7: Set ΦT t = [ ϕt(x1), . . . , ϕt(xt)] and Vt = ΦT t Φt + λImt 8: Find the rows u T 1 . . . , u T mt of V 1/2 t ΦT t 9: Set ˆri = Pt τ=1 ui,τ yτ1|ui,τ yτ | bτ for i [mt]

10: Set θt = V 1/2 t [ˆr1, . . . , ˆrmt]T

11: Set µt(x) = ϕt(x)T θt 12: Set σ2 t (x) = k(x, x) ϕt(x)T ϕt(x) + λ ϕt(x)T V 1 t ϕt(x) 13: end for

1 1 ε) + 4 p

ln(4mt T/δ)vmt/λ, with probability at least 1 δ, has regret bound

RT = O ˆρB p

in which ˆρ := ρ 1 + 1 1 ε

, and ϕT :=

v ln(T/δ) ln( γT T ln(T/δ)

Remark 2. Note that by substituting the value of v into the regret bound, we obtain that RT = O(γT

T/ϵ). That is, it has a factor of 1/ϵ compared to the non-private case, which matches the same scaling of ϵ in the lower bounds as shown in Theorem 1. Moreover, LDP-ATA-GP-UCB enjoys the same scaling with respect to both γT and T as in the state-of-the-art non-private sub-Gaussian case. Although the LDP-ATA-GP-UCB algorithm achieves almost optimal regret bound, it might be a overkill for the LDP setting. In particular, the original setting for the ATAGP-UCB algorithm in (Chowdhury and Gopalan 2019) only assumes at most a ﬁnite variance. However, in our LDP setting, the corrupted reward y has all the moments being bounded and enjoys an exponential-type tail. In other words, although the additional Laplace noise causes the corrupted reward to be no longer sub-Gaussian, it still enjoys better properties compared to the general conditions for the ATAGP-UCB algorithm to work. Therefore, it seems that there is some hope that we can design simple algorithm to achieve the same regret bound. Another issue of ATA-GP-UCB is its computational complexity. As pointed out by (Shao et al. 2018) in the linear bandit setting (ATA-GP-UCB reduces to TOFU), for each round, it needs to truncate all the historical payoffs, which leads to a high complexity. Based on the discussions above, in the following, we will propose two novel algorithms that are also able to achieve almost optimal regret while substantially reducing the implementation and computational complexity of LDP-ATAGP-UCB.

Algorithm 2 LDP-TGP-UCB

1: Input: Parameters B, R, ϵ > 0, λ, δ. 2: Set: K = B2 + R2 + 2L2

3: for t = 1, 2, 3, . . . , T do 4: Set bt 1 = B + R + L ln(t 1)

5: Set βt = B + 2

γt 1 + ln(1/δ) +

K(ln(t 1) + 1)

6: Play xt = arg maxx D ˆµt 1(x) + βtσt 1(x) 7: Receive private response yt from CTL(ϵ). 8: Set ˆyt = yt1| yt| bt and ˆYt = [ˆy1, . . . , ˆyt]T

9: Set ˆµt(x) = kt(x)T (Kt + λI) 1 ˆYt 10: Set σ2 t (x) = k(x, x) kt(x)T (Kt + λI) 1kt(x) 11: end for

Raw Reward Truncation Algorithm and Regret In this section, instead of using the sophisticated truncation in the feature space as in LDP-ATA-GP-UCB, we turn to adopt the simple truncation on the raw rewards. In the general heavy-tail reward setting (with at most a ﬁnite variance), (Chowdhury and Gopalan 2019) proposed TGP-UCB algorithm which truncates the reward to zero if it is larger than a truncated point bt for round t. Speciﬁcally, for a ﬁnite variance case, the truncated point bt is θ(t1/4) in TGP-UCB, which ﬁnally leads to an regret bound of O(T 3/4). Hence, it has an additional factor O(T 1/4) when compared to the regret bound for the sub-Gaussian case. This means that we cannot directly adopt TGP-UCB to achieve the same regret bound as in LDP-ATA-GP-UCB of the last section. However, as pointed before, the corrupted reward y has a nice exponential tail property. This suggests that a truncated point of order O(ln t) is enough, which will only in turn incurs an additional O(ln T) factor in the regret. Based on this idea, we propose the LDP-TGP-UCB algorithm, as described in Algorithm 2. Moreover, by reﬁning the regret analysis of TGP-UCB in (Chowdhury and Gopalan 2019), we can obtain the following theorem for the regret bound of LDP-TGP-UCB. Theorem 3. Fix ϵ > 0. Let f Hk(D) with f H B for all x D and the noise ηt is bounded by R for all t. Then, for any δ (0, 1], LDP-TGP-UCB achieves, with probability at least 1 δ, the regret bound

ln TγT T + ϑ

γT (γT + ln(1/δ)) ,

where ϑ = (B + R)/ϵ. Remark 3. As in LDP-ATA-GP-UCB, LDP-TGP-UCB is also able to achieve regret bound O(γT

T/ϵ). The advantage of LDP-TGP-UCB is its simple implementation in the sense that each reward is only trimmed once.

Proof Sketch of Theorem 3. As in most GP-UCB like algorithms (Chowdhury and Gopalan 2017, 2019), the key step boils down to establishing a (high-probability) conﬁdence interval bound, i.e., in our setting,

|f(x) ˆµt(x)| βt+1σt(x).

To this end, with some linear algebra calculations, we have

|f(x) ˆµt(x)|

τ=1 ˆητϕ(xτ) V 1 t

where ˆηt = ˆyt f(xt), ϕ(x) := k(x, ), which maps x Rd to RKHS H associated with kernel function k and Vt = ΦT t Φt + λIH, Φt := [ϕ(x1)T , . . . , ϕ(xt)T ]T . The key term is Pt τ=1 ˆητϕ(xτ) V 1 t , which can be handled by the self-normalized inequality if ˆητ is sub-Gaussian. However, in our setting, it is not. To overcome this issue, we will divide it into two parts. In particular, similar to (Chowdhury and Gopalan 2019), we deﬁne ξt = ˆηt E [ˆηt | Ft 1]. Now, the key term can be written as

τ=1 ˆητϕ(xτ) V 1 t

τ=1 ξτφ(xτ) V 1 t | {z } T1

τ=1 E [ˆητ|Fτ 1] φ(xτ) V 1 t .

For T1, note that ξt = ˆyt E [ˆyt | Ft 1], which is bounded by 2bt, and hence sub-Gaussian. Thus, by the selfnormalized inequality for the RKHS-valued process in (Durand, Maillard, and Pineau 2018; Chowdhury and Gopalan 2019), we can bound T1 as follows

2(γt + ln(1/δ)).

For T2, with some linear algebra, we can ﬁrst bound it as q Pt τ=1 E [ˆητ|Fτ 1]2. Further, note that E [ˆητ|Fτ 1] = E yτ1| yτ |>bτ |Fτ 1 . Hence, by Cauchy-Schwartz inequality with bτ = B + R + L ln τ, we have

E [ˆητ|Fτ 1]2 E y2 τ|Fτ 1 P(|L| > L ln τ) K 1

where K := B2 +R2 +2L2. The last inequality holds since |L| Exp(1/L). Therefore, by the property of Harmonic sum, we have

K(ln t + 1).

Hence, the (high-probability) conﬁdence interval bound is obtained by setting

βt+1 = B + 2

γt + ln(1/δ) + 1

K(ln t + 1).

The full proof is relegated to Appendix.

It is worth pointing out the truncation trick used in LDPTGP-UCB also sheds light on the regret bounds for nonprivate BO under payoffs that are beyond sub-Gaussian, e.g., sub-Weibull which includes sub-Gaussian and subexponential as special cases (Vladimirova et al. 2019). More speciﬁcally, according to (Vladimirova et al. 2019), a random variable X is said to be a sub-Weibull with tail parameter θ, i.e., X sub W(θ), if for some constants a and b such that

P(|X| x) a exp( bx1/θ), for all x > 0. (5)

It can be seen that sub-Gaussian and sub-exponential distributions are special cases of sub-Weibull with θ = 1/2 and θ = 1, respectively. Thus, instead of choosing the truncation point bt = O(ln t) as in LDP-TGP-UCB, one turn to choose bt = O((ln t)θ), which in turn only incurs a log factor in the regret bound. As a result, with this simple truncation, the non-private BO under sub-Weibull noise is still O(γT

Median of Means Approximate (Mo MA) Algorithm and Regret

In this section, we will introduce a new BO method that is able to achieve almost optimal regret bounds under general heavy-tailed payoffs. Hence, the LDP setting is just a special case. This new method is mainly inspired by the MENU algorithm in (Shao et al. 2018), which is introduced to handle the heavy-tailed payoffs in the linear bandit setting. Moreover, it has been shown to have a lower complexity than TOFU (which is the core of ATA-GP-UCB). The key idea in MENU is based on median of means techniques (Bubeck, Cesa-Bianchi, and Lugosi 2013). The main challenge in generalizing MENU to the BO setting is to handle the possibly inﬁnite feature dimension associated with the kernel function. To this end, we will again use kernel approximation techniques (i.e., Nystr om approximation) developed in (Calandriello et al. 2019; Chowdhury and Gopalan 2019). However, instead of updating the approximations after every iteration as in ATA-GP-UCB, in our new method the approximation is only updated after each epoch , which is composed of multiple iterations. This further reduces its complexity.

This new algorithm is called Mo MA-GP-UCB, which is presented in Algorithm 3. With the aid of Fig. 1 (adapted from (Shao et al. 2018)), we can easily see that the total number of T iterations are divided into N epochs, each consisted of k iterations. The algorithm will loop over each n = 1, . . . , N epoch. Within each epoch n, a point xn is selected in a GP-UCB fashion, and the selected point xn will be played k times with corresponding rewards. Then, the kernel approximation terms are updated, i.e., ϕn, Φn and Vn. Following this update, it will calculate k least-squareestimates (LSE), each is based on the rewards along each row j [k] (e.g., using the data in the pink row to generate the pink LSE, and similarly green data for the green LSE). Next, it applies median-of-means techniques to ﬁnd the best LSE θn,k for epoch n. Finally, the posterior mean and variance are updated.

Now, we have the following theorem for the regret bound of Mo MA-GP-UCB under general heavy-tailed payoffs.

Theorem 4. Let f Hk(D) with f H B for all x D. Assume that E |ηt|1+α | Ft 1 c. Fix ε (0, 1) and set ρ = 1+ε 1 ε. Then, for any δ (0, 1], Mo MA-GP-UCB with parameters q = 6ρ ln(4T/δ)/ε2, k = 24 ln 4e T

and βn+1 = B(1 + 1 1 ε) + 3 (9mnc) 1 1+α n

1 α 2(1+α) , with

... ... ... ...

Vn Φn update approximations

find k LSEs based on {xi}n

n,1 n,k apply Mo M using { n,i}k

n,k update posterior

... ... k = d24 ln

Figure 1: Illustration of Mo MA-GP-UCB

probability at least 1 δ, has regret bound

δ c 1 1+α T 1 1+α γ

3+α 2(1+α) T

where ˆB = ρB(1 + 1 1 ε) and Z = ρ3+α

Remark 4. Note that when α = 1, Mo MA-GP-UCB recovers the same regret bound O(γT

T) as in the sub-Gaussian case. Moreover, for the special linear kernel case, substituting γT = O(d ln T), the bound in Theorem 4 recovers the regret bound in (Shao et al. 2018) up to logarithmic factor.

Proof Sketch of Theorem 4. The proof is mainly inspired by (Shao et al. 2018; Chowdhury and Gopalan 2019). The key step is again a (high-probability) conﬁdence interval bound, i.e., |f(x) µn(x)| βn+1 σn(x). (6) Assume that the kernel approximation is ε-accurate, the LHS of Eq. (6) can be bounded by

|f(x) µn(x)| B(1 + 1 1 ε) σn(x)

+λ 1/2 V 1 n ΦT nfn θn,k Vn σn(x), (7)

where fn = [f(x1), . . . , f(xn)]T , i.e., a vector containing f s evaluations up to epoch n. Now, we need to focus on the term V 1 n ΦT nfn θn,k Vn. To this end, we ﬁrst establish the following result regarding the k LSEs. In particular, for j [k], we have

P V 1 n ΦT nfn θn,j Vn γ 3

where γ := (9mnc) 1 1+α n

1 α 2(1+α) . Based on this result, by the choice of k , we obtain that if k = 24 ln e T

δ , then for all n [N], with probability at least 1 δ,

V 1 n ΦT nfn θn,k Vn 3γ. (8) Combining Eqs. (8) and (7), yields that, under the event that the kernel approximation is ε-accurate, with probability at least 1 δ,

|f(x) µn(x)| B(1 + 1 1 ε) + λ 1/23γ σn(x)

Algorithm 3 Mo MA-GP-UCB

1: Input: Parameters λ, δ, {βt}t 1, and q. 2: Set: µ0(x) = 0 and σ0(x) = k(x, x) for all x D. 3: Set: k = 24 ln 4e T

δ and N = T

k . 4: for n = 1, 2, 3, . . . , N do 5: xn = arg maxx D µn 1(x) + βn(x) σn 1(x) 6: Play xn with k times and observe rewards yn,1, yn,2, . . . , yn,k. 7: ϕn(x) = Nystr om Embed({(xi, σn 1(xi))}n i=1, q) 8: Set mn as the dimension of ϕn 9: Set ΦT n = [ ϕn(x1), . . . , ϕn(xn)] and Vn = ΦT n Φn+ λImn 10: For j [k], θn,j = V 1 n Pn i=1 yi,j ϕn(xi) 11: For j [k], rj = median({ θn,j θn,s Vn : s [k] \ j}) 12: Set k = arg minj [k] rj 13: Set µn(x) = ϕn(x)T θn,k 14: Set σ2 n(x) = k(x, x) ϕn(x)T ϕn(x) + λ ϕn(x)T V 1 n ϕn(x) 15: end for

for all n [N] when k = 24 ln e T

δ . Since for any δ, the kernel approximation (under given parameters) is ε-accurate with probability at least 1 δ, by the virtue of union bound, we have that when k = 24 ln 2e T

|f(x) µn(x)| βn+1 σn(x) (9)

for all n [N], where βn+1 := B(1+ 1 1 ε)+λ 1/23γ. Finally, by the nice properties of Nystr om approximation, we can obtain that with probability at least 1 δ, both mn = O( ρ2

ε2 γn ln(T/δ)) and σn 1(xn) ρσn 1(xn). Then, the regret bound follows from that RT = 2k PN n=1 βn σn 1(x) along with standard GP-UCB analysis. The full proof is relegated to Appendix.

Now, we can easily design the private version of it, called LDP-Mo MA-GP-UCB. The only difference is line 6, in which LDP-Mo MA-GP-UCB received private response from CTL(ϵ). Thus, we can directly obtain the regret bound of LDP-Mo MA-GP-UCB as a corollary of Theorem 4.

Corollary 1. Let f Hk(D) with f H B for all x D and noise ηt is bounded by R. Fix ϵ > 0, ε (0, 1) and set ρ = 1+ε 1 ε, and c = R2 + 8(B + R)2/ϵ2. Then, for any δ (0, 1], LDP-Mo MA-GP-UCB with parameters q = 6ρ ln(4T/δ)/ε2, k = 24 ln 4e T

δ and βn+1 = B(1+

1 1 ε) + 3 (9mnc) 1 1+α n

1 α 2(1+α) , with probability at least 1 δ, has regret bound

where ˆB = ρB(1 + 1 1 ε) and Z = ρ3+α

0 0.5 1 1.5 2 Iteration 104

Cumulative Regret

LDP-TGP-UCB LDP-ATA-GP-UCB LDP-Mo MA-GP-UCB

(a) LDP, ϵ = 1, Synthetic data, k Mat ern

0 2000 4000 6000 8000 10000 Iteration

Cumulative Regret

LDP-TGP-UCB LDP-ATA-GP-UCB LDP-Mo MA-GP-UCB

(b) LDP, ϵ = 0.5, Light sensor data

0 2000 4000 6000 8000 10000 Iteration

Cumulative Regret

LDP-TGP-UCB LDP-ATA-GP-UCB LDP-Mo MA-GP-UCB

(c) LDP, ϵ = 1, Stock market data.

0 0.5 1 1.5 2 Iteration 104

Cumulative Regret

TGP-UCB ATA-GP-UCB Mo MA-GP-UCB

(d) Non-private, Synthetic data, k SE

0 2000 4000 6000 8000 10000 Iteration

Cumulative Regret

TGP-UCB ATA-GP-UCB Mo MA-GP-UCB

(e) Non-private, Light sensor data

0 2000 4000 6000 8000 10000 Iteration

Cumulative Regret

TGP-UCB ATA-GP-UCB Mo MA-GP-UCB

(f) Non-private, Stock market data

Figure 2: (a)-(c) Cumulative regrets for three LDP algorithms; (d)-(f) Cumulative regrets (and standard variance) for non-private versions on heavy-tailed data.

Unbounded Noise Case In the case of unbounded noise, we show that Laplace mechanism can ensure (ϵ, δ)-LDP (weaker than ϵ-LDP), see Appendix.

Experiments We conduct experiments to compare the performance of the three private algorithms (i.e., LDP-ATA-GP-UCB, LDPTGP-UCB, LDP-Mo MA-GP-UCB) and the performance of three non-private BO methods for general heavy-tailed payoffs (i.e., ATA-GP-UCB, TGP-UCB in (Chowdhury and Gopalan 2019) and Mo MA-GP-UCB proposed in this paper). As in (Chowdhury and Gopalan 2019), the parameters used for each algorithm are set order-wise similar to those recommended by the theorems. We run each algorithm for 10 independent trials and plot the average of cumulative regret along with time evolution.

Datasets and Settings Synthetic data. The domain D is generated by discretizing [0, 1] uniformly into 100 points. The black-box function f = Pp i=1 aik( , xi) is generated by uniformly sampling ai [ 1, 1] and support points xi D with p = 100. The parameters for the kernel function are l = 0.2 for k SE and l = 0.2, ν = 2.5 for k Mat ern. We set B = maxx D |f(x)| and y(x) = f(x) + η. For the LDP case, the noise η is uniformly sampled in [ 1, 1] and hence R = 1. For the nonprivate heavy-tailed case, the noise η are samples from the Student s t-distribution with 3 degrees of freedom. Hence, v = B2 + 3 and c = 3. Light sensor data. This data is collected in the CMU Intelligent Workplace in Nov 2005, which is available online as Matlab structure4 and contains locations of 41 sensors, 601 train samples and 192 test samples. We use it in the context of ﬁnding the maximum average reading of the sensors. For fair comparison, the settings for this dataset follow

4http://www.cs.cmu.edu/ guestrin/Class/10708-F08/projects

from (Chowdhury and Gopalan 2019),which has shown that the payoffs are heavy-tailed. In particular, f is set as empirical average of the test samples, with B set as its maximum, and k is set as the empirical covariance of the normalized train samples. The noise is estimated by taking the difference between the test samples and its empirical mean (i.e., f), and R is set as the maximum. Here, we consider α = 1, set v as the empirical mean of the squared readings of test samples, and c is the empirical mean of the squared noise. Stock market data. This dataset is the adjusted closing price of 29 stocks from January 4th, 2016 to April 10th, 2019. We use it in the context of identifying the most profitable stock in a given pool of stocks. As veriﬁed in (Chowdhury and Gopalan 2019), the rewards follows from heavytailed distribution. We take the empirical mean of stock prices as our objective function f and empirical covariance of the normalized stock prices as our kernel function k. The noise is estimated by taking the difference between the raw prices and its empirical mean (i.e., f), with R set as the maximum. Consider α = 1, with v set as the empirical mean of the squared prices and c set as the empirical mean of squared noise.

Results From Figure 2, we can see that Mo MA-GP-UCB (or LDPMo MA-GP-UCB) tends to empirically outperform the other two algorithms in both non-private and private settings. We also conduct additional experiments (relegated to Appendix), and similar observations are obtained. Note that similar to (Chowdhury and Gopalan 2019), the high error bar in (d) is because a different f is chosen for each trial.

Conclusion We derived regret lower bounds for LDP BO and presented three almost optimal algorithms. We also proposed Mo MA-GP-UCB. It complements previous BO algorithms for heavy-tailed payoffs and has superior performance with a reduced complexity.

References Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312 2320. Basu, D.; Dimitrakakis, C.; and Tossou, A. 2019. Differential Privacy for Multi-armed Bandits: What Is It and What Is Its Cost? ar Xiv preprint ar Xiv:1905.12298 . Bubeck, S.; Cesa-Bianchi, N.; and Lugosi, G. 2013. Bandits with heavy tail. IEEE Transactions on Information Theory 59(11): 7711 7717. Calandriello, D.; Carratino, L.; Lazaric, A.; Valko, M.; and Rosasco, L. 2019. Gaussian process optimization with adaptive sketching: Scalable and no regret. ar Xiv preprint ar Xiv:1903.05594 . Chowdhury, S. R.; and Gopalan, A. 2017. On kernelized multi-armed bandits. ar Xiv preprint ar Xiv:1704.00445 . Chowdhury, S. R.; and Gopalan, A. 2019. Bayesian optimization under heavy-tailed payoffs. In Advances in Neural Information Processing Systems, 13790 13801. Cormode, G.; Jha, S.; Kulkarni, T.; Li, N.; Srivastava, D.; and Wang, T. 2018. Privacy at scale: Local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, 1655 1658. Duchi, J. C.; Jordan, M. I.; and Wainwright, M. J. 2013. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 429 438. IEEE. Durand, A.; Maillard, O.-A.; and Pineau, J. 2018. Streaming kernel regression with provably adaptive mean, variance, and regularization. The Journal of Machine Learning Research 19(1): 650 683. Dwork, C.; Roth, A.; et al. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends R in Theoretical Computer Science 9(3 4): 211 407. Gajane, P.; Urvoy, T.; and Kaufmann, E. 2018. Corrupt bandits for preserving local privacy. In Algorithmic Learning Theory, 387 412. PMLR. Kasiviswanathan, S. P.; Lee, H. K.; Nissim, K.; Raskhodnikova, S.; and Smith, A. 2011. What can we learn privately? SIAM Journal on Computing 40(3): 793 826. Kusner, M.; Gardner, J.; Garnett, R.; and Weinberger, K. 2015. Differentially private Bayesian optimization. In International conference on machine learning, 918 927. Lai, T. L.; and Robbins, H. 1985. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics 6(1): 4 22. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661 670. Mutny, M.; and Krause, A. 2018. Efﬁcient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems, 9005 9016.

Rasmussen, C. E. 2003. Gaussian processes in machine learning. In Summer School on Machine Learning, 63 71. Springer. Ren, W.; Zhou, X.; Liu, J.; and Shroff, N. B. 2020. Multi Armed Bandits with Local Differential Privacy. ar Xiv preprint ar Xiv:2007.03121 . Scarlett, J.; Bogunovic, I.; and Cevher, V. 2017. Lower bounds on regret for noisy gaussian process bandit optimization. ar Xiv preprint ar Xiv:1706.00090 . Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and De Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104(1): 148 175. Shao, H.; Yu, X.; King, I.; and Lyu, M. R. 2018. Almost optimal algorithms for linear stochastic bandits with heavytailed payoffs. In Advances in Neural Information Processing Systems, 8420 8429. Srinivas, N.; Krause, A.; Kakade, S. M.; and Seeger, M. 2009. Gaussian process optimization in the bandit setting: No regret and experimental design. ar Xiv preprint ar Xiv:0912.3995 . Vladimirova, M.; Girard, S.; Nguyen, H.; and Arbel, J. 2019. Sub-Weibull distributions: generalizing sub-Gaussian and sub-Exponential properties to heavier-tailed distributions. ar Xiv preprint ar Xiv:1905.04955 . Zheng, K.; Cai, T.; Huang, W.; Li, Z.; and Wang, L. 2020. Locally Differentially Private (Contextual) Bandits Learning. ar Xiv preprint ar Xiv:2006.00701 .