# faster_rates_for_private_adversarial_bandits__ae0e6eed.pdf

Faster Rates for Private Adversarial Bandits

Hilal Asi * 1 Vinod Raman * 2 Kunal Talwar * 1

We design new differentially private algorithms for the problems of adversarial bandits and bandits with expert advice. For adversarial bandits, we give a simple and efficient conversion of any non-private bandit algorithm to a private bandit algorithm. Instantiating our conversion with existing non-private bandit algorithms gives a regret upper bound of O

KT ε , improving upon the existing upper bound

KT log(KT )

for all ε 1. In particular,

our algorithms allow for sublinear expected regret even when ε 1

T , establishing the first known separation between central and local differential privacy for this problem. For bandits with expert advice, we give the first differentially private algorithms, with expected

KT log(N) log(KT )

and O N 1/6K1/2T 2/3 log(NT )

ε1/3 + N 1/2 log(NT )

ε , where K and N are the number of actions and experts respectively. These rates allow us to get sublinear regret for different combinations of small and large K, N and ε.

1. Introduction

In the adversarial bandit problem, a learner plays a sequential game against nature over T N rounds. In each round t {1, . . . , T}, nature picks a loss function ℓt : [K] [0, 1], hidden to the learner. The learner, using the history of the game up to time point t 1, selects a potentially random action It {1, . . . , K} and nature reveals only the loss ℓt(It) of the selected action. For any sequence of loss functions ℓ1, . . . , ℓT , the goal of the learner is to select a sequence of actions I1, . . . , IT , while only observing

*Equal contribution 1Apple 2Department of Statistics, University of Michigan, Ann Arbor. Work done while interning at Apple. Correspondence to: Vinod Raman <vkraman@umich.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

the loss of selected actions, such that its expected regret

arg min i [K]

is minimized, where the expectation is taken with respect to the randomness of the learner.

Bandit algorithms, and in particular adversarial bandit algorithms (Auer et al., 2002), have been of significant interest for over two decades (Bubeck et al., 2012) due to their applications to online advertising, medical trials, and recommendation systems. In many of these settings, one would like to publish the actions selected by bandit algorithms without leaking sensitive user information. For example, when predicting treatment options for patients with the goal of maximizing the number of cured patients, one may want to publish results about the best treatment without leaking sensitive patient medical history (Lu et al., 2021). In online advertising, a goal is to publish the recommended ads without leaking user preferences. In light of such privacy concerns, we study adversarial bandits under the constraint of differential privacy (Dwork, 2006). Surprisingly, unlike the stochastic setting (Azize & Basu, 2022), the price of privacy in adversarial bandits is not well understood. Existing work by Agarwal & Singh (2017) and Tossou & Dimitrakakis (2017) give ε-differentially private bandit algorithms with

expected regret at most O

1. However, their

algorithms satisfy the stronger notion of local differential privacy and become vacuous for tasks with high privacy requirements, where one might take ε < 1

T . In fact, it was not known how large ε needs to be in order to obtain sublinear expected worst-case regret.

Main Contributions. Motivated by this gap, we provide new, differentially private algorithms for adversarial bandits and bandits with expert advice with better trade-offs between privacy and regret. In the adversarial bandits setting, we provide a simple and efficient conversion of any non-private bandit algorithm into a private bandit algorithm. By instantiating this conversion with existing (non-private)

1Tossou & Dimitrakakis (2017) also claim to give an algorithm

with regret O T 2/3

, however, we are unable to verify

its correctness. See Appendix G.1.

Faster Rates for Private Adversarial Bandits

bandit algorithms, we get ε-differentially private bandit algorithms with expected regret at most O

KT ε , improving upon the best known upper bounds for all ε 1. In particular, this result shows that sublinear regret is possible for any ε ω 1

T . As corollaries, we establish separations in the achievable regret bounds between:

(1) Oblivious and Adaptive adversaries. In particular, while we show that sublinear regret is possible for all choices of ε ω( 1

T ) under an oblivious adversary, this is not the case for an adaptive adversary, where one cannot achieve sublinear regret if ε o( 1

T ) (Asi et al., 2023).

(2) Central and Local differential privacy. While our results show that sublinear regret is possible if ε o( 1

T ) under central differential privacy, it is well known that this is not the case for local differential privacy.

For bandits with expert advice (Auer et al., 2002), we give the first differentially private algorithms. In particular, we give three different (ε, δ)-differentially private bandit algorithms, obtaining expected re-

KT log(N) log(KT )

O N1/6K1/2T 2/3 log(NT )

ε1/3 + N1/2 log(NT )

ε respectively. These regret guarantees cover regimes with high-privacy requirements and regimes with a large number of experts N. In both settings, our techniques involve combining the Laplace mechanism with batched losses.

1.1. Related Works

Adversarial Bandits and Bandits with Expert Advice. We refer the reader to the excellent book by Bubeck et al. (2012) for a history of stochastic and adversarial bandits. The study of the adversarial bandit problem dates back at least to the seminal work of Auer et al. (2002), who show that a modification to the Multiplicative Weights Algorithm, known as EXP3, achieves worst-case expected regret O p

TK log(K) . Following this work, there has been an explosion of interest in designing better adversarial bandit algorithms, including, amongst others, the work by Audibert & Bubeck (2009), who establish that the minimax regret for adversarial bandits is Θ

TK . More recently, there has been interest in unifying existing adversarial bandit algorithms through the lens of Follow-the-Regularized Leader (FTRL) and Follow-the-Perturbed-Leader (FTPL) (Abernethy et al., 2015). Surprisingly, while it was known since the work of Audibert & Bubeck (2009) that an FTRL-based approach can lead to minimax optimal regret bounds, it was only recently shown that this is also the case for FTPL-based bandit algorithms (Honda et al., 2023).

The first works for bandits with expert advice also date back at least to that of Auer et al. (2002), who propose EXP4 and bound its expected regret by O p

TK log(N) , where N is the number of experts. When N K, Seldin & Lugosi (2016) prove a lower bound of Ω q

K log(K)T log(N) on the expected regret, showing that EXP4 is already near optimal. As a result, EXP4 has become an important building block for related problems, like online multiclass classification (Daniely & Helbertal, 2013; Raman et al., 2024) and sleeping bandits (Kleinberg et al., 2010), among others.

Private Online Learning. Dwork et al. (2010a) initiated the study of differentially private online learning. Jain et al. (2012) extend these results to broad setting of online convex programming by using gradient-based algorithms to achieve differential privacy. Following this work, Guha Thakurta & Smith (2013) privatize the Follow-the-Approximate-Leader template to obtain sharper guarantees for online convex optimization. In the special case of learning with expert advice, Dwork et al. (2014) and Jain & Thakurta (2014) give private online learning algorithms with regret bounds of

. More recently, Agarwal & Singh (2017)

design private algorithms for online linear optimization with regret bounds that scale like O(

ε). In particular, for the setting of learning with expert advice, they show that it is possible to obtain a regret bound that scales like O p

T log(N) + N log(N) log2 T

ε , improving upon the work by Dwork et al. (2014) and (Jain & Thakurta, 2014). For large N, this upper bound was further improved to O p

T log(N) + T 1/3 log(N)

T log(N) + T 1/3 log(N)

ε2/3 by Asi et al. (2023) and Asi et al. (2024) respectively in the oblivious setting.

Private Bandits. The majority of existing work on differentially private bandits focus on the stochastic setting (Mishra & Thakurta, 2015; Tossou & Dimitrakakis, 2016; Sajed & Sheffet, 2019; Hu et al., 2021; Azize & Basu, 2022), linear contextual bandits (Shariff & Sheffet, 2018; Neel & Roth, 2018), or adjacent notions of differential privacy (Zheng et al., 2020; Tenenbaum et al., 2021; Ren et al., 2020). To our knowledge, there are only three existing works that study differentially private adversarial bandits. The first is by Guha Thakurta & Smith (2013) who give an (ε, δ)- differentially private bandit algorithm with expected regret O KT 3/4

ε . Finally, and in parallel, Agarwal & Singh (2017) and Tossou & Dimitrakakis (2017) improve the up-

per bound to O

. We note that the private

algorithms given by Agarwal & Singh (2017) and Tossou & Dimitrakakis (2017) satisfy the even stronger notion of

Faster Rates for Private Adversarial Bandits

Table 1. Summary of upper bounds with constant factors and dependencies on log 1

δ suppressed. The three rows for Bandits with Experts represent different algorithms with incomparable guarantees.

Existing Work Our Work

Adversarial Bandits

KT log(KT )

KT ε Bandits with Experts NA

Bandits with Experts NA

KT log(N) log(KT )

ε Bandits with Experts NA N1/6K1/2T 2/3 log(NT )

ε1/3 + N1/2 log(NT )

local differential privacy (Duchi et al., 2013).

2. Preliminaries

2.1. Notation

Let K N denote the number of actions and ℓ: [K] 7 [0, 1] denote an arbitrary loss function that maps an action to a bounded loss. For an abstract sequence z1, . . . , zn, we abbreviate it as z1:n and (zs)n s=1 interchangeably. For a measurable space (X, σ(X)), we let Π(X) denote the set of all probability measures on X. We let Lap(λ) denote the Laplace distribution with mean zero and scale λ such that its probability density function is fλ(x) = 1 2λ exp |x|

Finally, we let [N] := {1, . . . , N} for N N.

2.2. The Adversarial Bandit Problem

In the adversarial bandit problem, a learner plays a sequential game against nature over T N rounds. In each round t [T], the learner selects (potentially randomly) an action It [K] and observes only its loss ℓt(It). The goal of the learner is to adaptively select actions I1, . . . , IT [K] such that its cumulative loss is close to the best possible cumulative loss of the best fixed action i [K] in hindsight. Crucially, we place no assumptions on the sequence of losses ℓ1, . . . , ℓT , and thus they may be chosen adversarially.

Before we quantify the performance metric of interest, we provide a formal definition of a bandit online learning algorithm. This definition will be useful for precisely formalizing the notion of privacy (Section 2.4) and describing our generic transformation of non-private bandit algorithms to private ones (Section 3).

Definition 2.1 (Bandit Algorithm). A bandit algorithm is a deterministic map A : ([K] R) Π([K]) which, for every t N, maps a history of actions and observed losses (Is, ℓs(Is))t 1 s=1 ([K] R)t 1 to a distribution µt Π([K]). The learner then samples an action It µt.

We will slightly abuse notation by using A((Is, ℓs(Is))t 1 s=1) to denote the random action It drawn from µt, the distribu-

tion that A outputs when run on (Is, ℓs(Is))t 1 s=1. In addition, we will sometimes use Ht := (Is, ℓs(Is))t 1 s=1 to denote the history of selected actions and observed losses induced by running A up to, but not including, timepoint t N. Note that Ht is a random variable and we may write the action selected by algorithm A on round t N as A(Ht). It will also be helpful to think about Ht as the View of A as a result of its interaction with the adversary up to, but not including, timepoint t.

Given a bandit online learner A, we define its worst-case expected regret as

RA(T, K) = sup ℓ1:T

t=1 ℓt(A(Ht))

where the expectation is taken only with respect to the randomness of the learner. Our goal is to design a bandit algorithm A such that RA(T, K) = o(T). Note that our definition of regret means that we are assuming an oblivious adversary, one that selects the entire sequence of losses ℓ1, . . . , ℓT before the game begins. This assumption is in contrast to that of an adaptive adversary which, for every t N, may select the loss ℓt based on Ht. We leave quantifying the rates for private adversarial bandits under adaptive adversaries for future work. That said, we do note that the lower bounds for adaptive adversaries established in fullinformation setting by Asi et al. (2023) also carry over to the bandit feedback setting. Accordingly, Corollary 3.2 and Theorems 4 and 5 in Asi et al. (2023) show that the strong separation in the possible rates for oblivious and adaptive adversaries also holds under bandit feedback.

2.3. The Bandits with Expert Advice Problem

In adversarial bandits with expert advice (Auer et al., 2002), we distinguish between a set of experts [N] and the set of available actions [K]. In each round t [T], each expert j [N] predicts a distribution µj t Π([K]). The learner uses these predictions to compute its own distribution ˆµt Π([K]), after which it samples It ˆµt and observes the loss ℓt(It). The goal of the learner is to compete against the best fixed expert in hindsight while observing bandit feedback. We need a new definition of a bandit with expert

Faster Rates for Private Adversarial Bandits

advice algorithm to account for the fact that the learner has access to expert advice.

Definition 2.2 (Bandits with Expert Advice Algorithm). A bandit with expert advice algorithm is a deterministic map A : ([K] R) (Π([K])N) Π([K]) which, for every t N, maps the history of actions and observed losses (Is, ℓs(Is))t 1 s=1 ([K] R)t 1 as well the sequence of expert advice µ1:N 1:t ((Π([K])N)t to a distribution ˆµt Π([K]). The learner then samples an action action It ˆµt.

One can now take an analogous definition of worst-case expected regret to be

RA(T, K, N) := sup ℓ1:T sup µ1:N 1:T

t=1 ℓt(A(Ht, µ1:N 1:t ))

i=1 µi t(j) ℓt(i)

where the expectation is taken only with respect to the randomness of the learner. As for adversarial bandits, our definition of minimax regret for bandits with experts advice implicitly assumes an oblivious adversary.

2.4. Differential Privacy

In this work, we are interested in designing bandit algorithms that have low expected regret while satisfying the constraint of differential privacy. Roughly speaking, differential privacy quantifies the following algorithmic property: an algorithm A is a private bandit algorithm if, for any two sequences of losses that differ in exactly one position, the distributions over actions induced by running A on the two loss sequences are close. Definition 2.3 formalizes this notion of privacy in adversarial bandits.

Definition 2.3 ((ε, δ)-Differential Privacy in Adversarial Bandits (Dwork et al., 2014)). A bandit algorithm A is (ε, δ)-differentially private if for every T N, any two sequences of loss functions ℓ1:T and ℓ 1:T differing at exactly one time point t [T], and any E [K]T , we have that P [(A(H1), A(H2), . . . , A(HT )) E] eεP [(A(H 1), A(H 2), . . . , A(H T )) E] + δ, where we let Ht = (Is, ℓs(Is))t 1 s=1 and H t = (I s, ℓ s(Is))t 1 s=1.

We note that the our notion of differential privacy in Definition 2.3 is inherently for an oblivious adversary. A different definition of privacy is required if the adversary is allowed to be adaptive i.e., having the ability to pick the loss ℓt using the realized actions I1, . . . , It 1 played by the learner (see Definition 2.1 in Asi et al. (2023) for more details). While the utility guarantees of our bandit algorithms hold only for oblivious adversaries, their privacy guarantees hold against adaptive adversaries.

We use an analogous definition of differential privacy for bandits with expert advice. Definition 2.4 ((ε, δ)-Differential Privacy in Bandits with Expert Advice (Dwork et al., 2014)). A bandit with expert advice algorithm A is (ε, δ)-differentially private if for every T N, any two sequences of loss functions ℓ1:T and ℓ 1:T differing at exactly one time point t [T], and any E [K]T , we have that P [(A(H1), A(H2), . . . , A(HT )) E] eεP [(A(H 1), A(H 2), . . . , A(H T )) E] + δ, where we let Ht = (Is, ℓs(Is))t 1 s=1 and H t = (I s, ℓ s(Is))t 1 s=1.

Note that Definition 2.4 implicitly assumes that only the sequence of losses is sensitive information and that expert predictions are public.

Our main focus in this work will be on designing bandit algorithms that satisfy pure differential privacy (i.e. when δ = 0). In Appendix A, we review several fundamental properties of privacy and privacy-preserving mechanisms that serve as important building blocks.

3. Faster Rates for Private Adversarial Bandits

In this section, we establish a connection between nonprivate bandit algorithms that can handle negative losses and ε-differentially private bandit algorithms. Let B be any bandit algorithm and define

RB(T, K, λ) := sup ℓ1:T

t=1 ℓt(B( Ht))

inf i [K] E

t=1 ℓt(B( Ht))

where ℓt(i) = ℓt(i) + Zt(i) with Zt(i) Lap(λ), Ht = (Is, ℓs(Is))t 1 s=1, and the expectation is taken with respect to both the randomness of B and the losses ℓ1:T . Theorem 3.1 states that one can always convert B into an ε-differentially private bandit algorithm A whose regret guarantees can be written in terms of RB(T, K, λ). Theorem 3.1 (Generic Conversion). Let B be any bandit algorithm. Then, for every τ 1 and ε 1, there exists an ε-differentially private bandit algorithm Aτ such that

RAτ (T, K) τ RB

In particular, picking τ = 1

ε means that there exists a ε-differentially private bandit algorithm A such that

ε RB (εT, K, 1) + 2

As a corollary of Theorem 3.1, we establish new upper bounds on the expected regret under the constraint of εdifferential privacy that improves on existing work in all

Faster Rates for Private Adversarial Bandits

regimes of ε > 0. In particular, Corollary 3.2 follows by letting B be the HTINF algorithm from Huang et al. (2022), which modifies Follow-the-Regularized-Leader (FTRL) for heavy-tailed losses.

Corollary 3.2 (FTRL Conversion). For every ε [ 1

T , 1], if B is HTINF with α = 2 and σ =

6, then Algorithm 1, when run with B and τ = 1

ε , is ε-differentially private and suffers worse-case expected regret at most

In Appendix E, we instantiate Theorem 3.1 with EXP3 and FTPL to obtain two other upper bounds. In every case, our upper bounds establish the first known separation in rates between central differential privacy and local differential privacy (see Appendix A for definition) for this problem. Namely, while the lower bounds from Basu et al. (2019) show that any local ε-differentially private bandit algorithm must suffer linear Ω(T) expected regret when ε < 1

T , Corollary 3.2 gives an algorithm satisfying ε-central differential privacy (i.e. Definition 2.3) whose expected regret is sublinear o(T) even when ε < 1

T . The remainder of this section is dedicated to proving Theorem 3.1. Corollary 3.2 is proven in Appendix D.

3.1. Proof of Theorem 3.1

The conversion behind Theorem 3.1 is remarkably simple. At a high-level, it requires simulating the non-private bandit algorithm on noisy batched losses. That is, instead of passing every loss to the non-private bandit algorithm, we play the same arm for a batch size τ, average the loss across this batch, add independent Laplace noise to the batched loss, and then pass this noisy batched loss to the non-private bandit algorithm. By adding Laplace noise to batched losses as opposed to the original losses (as is done by Tossou & Dimitrakakis (2017) and Agarwal & Singh (2017)), the magnitude of the required noise is reduced by a multiplicative factor of the batch size.

However, a key issue that needs to be handled when adding noise (whether to batched or un-batched losses) is the fact that the losses fed to the non-private bandit algorithm can now be negative and unbounded. Accordingly, in order to get any meaningful utility guarantees, Theorem 3.1 effectively requires our non-private bandit algorithm to handle unbounded, negative (but still unbiased) losses. Fortunately, there are several existing adversarial bandit algorithms that can achieve low expected regret while observing negative losses. Three of these are presented in Corollary 3.2, E.1, and E.3. To the best of our knowledge, this is the first work to establish a connection between handling negative losses (for example in works that handle heavy-tailed losses) and

Algorithm 1 Non-Private to Private Conversion

1: Input: Non-private bandit algorithm B, batch size τ, privacy parameter ε (0, 1] 2: Initialize: j = 1 3: for t = 1, . . . , T do 4: if t = (j 1)τ + 1 then 5: Receive action Ij from B. 6: end if 7: Play action It := Ij 8: Observe loss ℓt(It). 9: if t = jτ then 10: Define ˆℓj(i) := 1

τ Pjτ s=(j 1)τ+1 ℓs(i)

11: Pass ˆℓj(Ij) + Zj to B, where Zj Lap( 1

τε). 12: Update j j + 1. 13: end if 14: end for

(non-local) differential privacy.

Algorithm 1 provides the pseudo code for converting a nonprivate bandit algorithm to a private bandit algorithm.

Lemma 3.3 (Privacy guarantee). For every bandit algorithm B, batch size τ 1, and ε 1, Algorithm 1 is ε-differentially private.

Proof. (sketch of Lemma 3.3) Observe that Algorithm 1 applies the bandit algorithm B on the losses ˆℓ1, . . . , ˆℓ T

τ in a black box fashion. Accordingly, the privacy guarantee of Algorithm 1 follows from the privacy guarantee of ˆℓ1(I1), . . . , ˆℓ T

τ ) and post-processing. The privacy

of each ˆℓj(Ij) follows from the Laplace mechanism.

A rigorous proof of Lemma 3.3 is in Appendix C.

Lemma 3.4 (Utility guarantee). For every bandit algorithm B, batch size τ 1, and ε 1, the worst-case expected regret of Algorithm 1 is at most τ RB( T

The proof of Lemma 3.4 follows from the following result by Arora et al. (2012).

Theorem 3.5 (Theorem 2 in (Arora et al., 2012)). Let B be any bandit algorithm. Let τ 1 be a batch size and let Aτ be the batched version of B. That is, the bandit algorithm Aτ groups the rounds 1, . . . , T into consecutive and disjoint batches of size τ such that the j th batch begins on round (j 1)τ + 1 and ends on round jτ. At the start of each batch j the algorithm Aτ calls B and receives an action Ij drawn from B s internal distribution. Then, Aτ plays this action for τ rounds. At the end of the batch, Aτ feeds B with the average loss value 1

τ Pjτ s=(j 1)τ+1 ℓs(Ij). For such an algorithm Aτ, its worst-case expected regret is at most τ RB T

Faster Rates for Private Adversarial Bandits

Note that Algorithm 1 is precisely the batched version of its input B. Accordingly, Theorem 3.5 immediately implies that on any sequence ℓ1:T , the expected regret of Algorithm 1 is at most τ RB( T

ετ ) + τ. We provide a complete proof of Lemma 3.4 in Appendix C.

4. Upper bounds for Bandits with Expert Advice

Theorem 3.1 also allows us to give guarantees for bandits with expert advice. To do so, we need Theorem 4.1, due to Auer et al. (2002), which shows that any bandit algorithm can be converted into a bandit with expert advice algorithm in a black-box fashion. For completeness, we provide this conversion and the proof of Theorem 4.1 in Appendix F.

Theorem 4.1 (Bandit to Bandit with Expert Advice). Let B be any bandit algorithm and RB(T, K) denote its worstcase expected regret. Then, the worst-case expected regret of Algorithm 5 when initialized with B is at most RB(T, N).

By treating each expert as a meta-action, Theorem 3.1 and Theorem 4.1 can be used to convert a non-private bandit algorithm B into a private bandit with expert advice algorithm A in the following way: given a non-private bandit algorithm B, use Theorem 3.1 to convert it into a private bandit algorithm B . Then, use Theorem 4.1 to convert B into a private bandit with expert advice algorithm A. By post-processing, the corresponding actions played by A are also private. In fact, this conversion also satisfies a stronger notion of privacy where the expert advice is also taken to be sensitive information. Theorem 4.2 formalizes this conversion.

Theorem 4.2 (Generic Conversion). Let B be any bandit algorithm. Then, for every τ 1 and ε 1, there exists an ε-differentially private bandit with expert advice algorithm Aτ such that

RAτ (T, K, N) τ RB

In particular, by setting τ = 1

ε , there exists an εdifferentially private bandit with expert advice algorithm A such that

RA(T, K, N) 2

ε RB (εT, N, 1) + 2

The proof of Theorem 4.2 is deferred to Appendix F since it closely follows that of Theorem 3.1. Using HTINF for B in Theorem 4.2 gives the following corollary.

Corollary 4.3 (FTRL Conversion). For every ε [ 1

T , 1], if B is HTINF with α = 2 and σ =

6, then Theorem 4.2 guarantees the existence of an ε-differentially private

algorithm whose worst-case expected regret at most

The upper bound in Corollary 4.3 is non-vacuous for constant or small N (i.e. N K). However, this bound is vacuous when N grows with T. To address this, we consider EXP4 which enjoys expected regret O p

KT log(N) in the non-private setting, exhibiting only a poly-logarithmic dependence on N (Auer et al., 2002). The following theorem shows that by adding independent Laplace noise to each observed loss, a similar improvement over Corollary 4.3 can be established for large N, at the cost of a worse dependence on ε.

Theorem 4.4 (Locally Private EXP4). For every ε 1,

Algorithm 6 when run with η = r

3T K 1+ 10 log2(KT )

and γ = 4ηK log(KT )

ε is ε-differentially private and suffers worst-case expected regret at most

TK log(N) log(KT)

Due to space constraints, we defer Algorithm 6, which just adds independent Laplace noise to each observed loss, to Appendix F. Note that when N K, the upper bound in Corollary 4.3 is still superior to that of Theorem 4.4 for all ranges of ε 1. The proof of Theorem 4.4 is also deferred to Appendix F.

Algorithm 6 provides a stronger privacy guarantee than what is actually necessary. Indeed, by adding independent Laplace noise to each observed loss, Algorithm 6 actually satisfies ε-local differential privacy (see Appendix A for definition). Accordingly, in contrast to Corollary 3.2, the upper bound in Theorem 4.4 is vacuous for ε 1

T . The following algorithm uses the batching technique from Section 3 to improve the dependence in ε from Theorem 4.4 while also improving the dependence on N from Corollary 4.3.

Theorem 4.5 (Private, Batched EXP4). For every ε, δ (0, 1], Algorithm 2, when run with

η = (N log( 1

δ ))1/6 log1/3(NT) log1/3(N)

T 1/3K1/2ε1/3 ,

τ = (N log( 1

δ ))1/3 log2/3(NT)T 1/3

ε2/3 log1/3(N) ,

Faster Rates for Private Adversarial Bandits

Algorithm 2 Private, Batched EXP4

1: Input: Action space [K], Number of experts N, batch size τ, privacy parameters ε, δ > 0, learning rate η, mixing parameter γ > 0 2: Initialize: r = 1, w1(j) = 1 for all j [N] 3: for t = 1, . . . , T do 4: Receive expert advice µ1 t, . . . , µN t Π([K]) 5: if t = (r 1)τ + 1 then

6: Set Pr(j) wr(j) P

j [N] wr(j) 7: end if 8: Set Qt(i) (1 γ) PN j=1 Pr(j)µj t(i) + γ

K . 9: Draw It Qt 10: Observe loss ℓt(It) and construct unbiased estimator ˆℓt(i) = ℓt(i)I{It=i}

Qt(i) 11: if t = rτ then 12: Define ℓr(j) := 1 τ Prτ s=(r 1)τ+1 ˆℓs µj s and ℓ r(j) := ℓr(j) + Zj r where

13: Update wr+1(j) wr(j) exp{ η ℓ r(j)} 14: Update r r + 1. 15: end if 16: end for

( η1/3N 1/3K2/3 log2/3(NT)

ε2/3τ 2/3 ,

δ ) log(NT)

satisfies (ε, δ)-differentially privacy and suffers worst-case expected regret at most

N 1/6K1/2T 2/3 log1/6( 1

δ ) log1/3(NT) log1/3(N) ε1/3

+ N 1/2 log( 1

δ )1/2 log(NT) log(N)

The proof of Theorem 4.5 modifies the standard proof of EXP4 to handle the noisy, batched losses. See Appendix F for the full proof. Compared to Theorem 4.3 and 4.4, Theorem 4.5 shows that Algorithm 2 enjoys sublinear regret even when N T 1/4 and ε = 1

T . Table 2 summarizes the three regimes and the corresponding algorithm that obtains the best regret bound in that regime. Our upper bounds for bandits with expert advice become vacuous when ε 1

and N T. We leave deriving non-vacuous upper bounds for this regime as an interesting direction for future work.

5. Barriers to Even Faster Rates for Private Adversarial Bandits

In this work, we provided new algorithms for the private adversarial bandit problem and its expert advice counterpart. In the adversarial bandits setting, we provided a generic conversion of a non-private bandit algorithm into a private bandit algorithm. Instantiating our conversion with existing bandit algorithms resulted in private bandit algorithms whose worst-case expected regret improve upon all existing work in all privacy regimes. In the bandits with expert advice setting, we provide, to the best of our knowledge, the first private adversarial bandit algorithms by modifying EXP4.

An important direction of future work is answering whether it is possible to achieve an additive separation in ε and T. We note that this is possible in the stochastic bandit setting (Azize & Basu, 2022) as well as the the full-information adversarial online setting (Agarwal & Singh, 2017). To this end, we conclude our paper by discussing some road blocks when attempting to derive such guarantees for the adversarial bandit setting.

5.1. On the Hardness of Privatizing EXP3

First, we comment on the difficulty of privatizing EXP3. In the full-information setting, a standard privacy analysis for exponential weights shows that for every t [T], the per-round privacy loss at time step t is at most 2η, and for η = ε

T advanced composition yields (ε, δ)-differential pri-

vacy with expected regret O( p

T log(K)/ε) (Dwork et al., 2014).

Unfortunately, it is not easy to bound the per-round privacy loss of EXP3 uniformly across time. This is because EXP3 uses Inverse-Probability-Weighted estimators (Robins et al., 1994) (see Algorithm 3). Thus the algorithm needs to know not just the arm It but also the probability Pt with which it was selected. It is, however, not clear how to account for the privacy cost of releasing Pt and indeed we can construct examples where the per-round privacy loss grows with the time horizon T. We provide a more formal analysis of this issue in Appendix G.1.

5.2. Limits of a Class of Adversarial Bandit Algorithms

In Section 3, we established upper bounds of O(

KT ε ) on the expected regret for the private adversarial bandit problem. Unfortunately, by exploiting the ability to pick arbitrary sequences of loss functions, we can show that for a large class of bandit algorithms, including EXP3 and its batched

Faster Rates for Private Adversarial Bandits

Table 2. Summary of the three regimes for bandits with expert advice and the corresponding algorithm that obtains the best regret bound in that regime.

Regime Best Alg. Guarantee Low-dimension, High Privacy (N K) Cor. 4.3 O(

High-dimension, Low Privacy (N K, ϵ K

N ) Thm. 4.4 O( KT log N log(KT )

High-dimension, High Privacy (N K, ϵ 1

T ) Thm. 4.5 O( N1/6K1/2T 2/3 log(NT )

ϵ1/3 + N1/2 log(NT )

variants, one cannot significantly improve upon this upper bound. Informally, our result holds for any (adaptively) private bandit algorithm that quickly reduces the probability of playing a sub-optimal arm. We provide the formal details below, starting with some intuition.

Consider an instance on two arms where arm 1 has loss 1 2 at each step, while arm 2 has loss 1 at each step. Any algorithm that has regret R must play arm 2 at most O(R) times on this instance. Informally, our lower bound applies to bandit algorithms that drops the probability of playing arm 2 to be about R

T within about o(T/R) steps. We note that EXP3 drops this probability to O( R

T ) in O(log T) steps. For algorithms of this kind, our lower bound shows that any ε-differentially private algorithm (for ε < 1) must incur regret O( p

T/ε). Intuitively, the lower bound follows from the fact that if the loss of arm 2 falls to 0 at step T/R (while arm 1 is unchanged at 1

2), then an ε-differentially private algorithm must pull arm 2 at least 1

ε times to notice this change. Accounting for the accumulated regret in the time it takes to pull arm 2 sufficiently many times, and setting parameters appropriately yields the lower bound.

More formally, fix K = 2 and T N. For γ [0, 1], τ {1, . . . , T} and p [T], define the sets

Eγ := n i1:T :

t=1 I{it = 2} γT o

s=τ+1 I{is = 2} p

Consider the sequence of loss functions ℓ1, . . . , ℓT , such that ℓ1:T (2) = 1 and ℓ1:T (1) = 1

2. Our assumptions on the bandit algorithms are with respect to their behavior on ℓ1, . . . , ℓT . In particular, we will consider bandit algorithms A for which there exists γ [0, 1], τ < T

2 and γτ p γ(T τ) such that:

(1) P(I1, . . . , IT Eγ) 1

(2) P(I1, . . . , IT Ep γ,τ) 1

where I1:T are the random variables denoting the actions played by A when run on the sequence of loss functions

ℓ1:T . The first condition simply lower bounds the probability that A plays action 2 by γ, when A is run on ℓ1:T . The second condition states that A drops, and subsequently maintains, the probability of playing action 2 to γ in roughly τ rounds. Accordingly, when τ is small, condition (2) states that A drops the probability of playing action 2 down to γ relatively quickly. One should really think of γ as being O RA(ℓ1:T )

T , where RA(ℓ1:T ) denotes the expected regret of A when run on ℓ1:T . Then, condition (1) is trivially satisfied, while condition (2) states that A roughly drops and keeps the probability of playing action 2 around O RA(ℓ1:T )

T by round τ, and keeps it there. In other words, after round τ, A plays action 2 on average p times in p

γ O( p T RA(T )) rounds. The latter property is reasonable for bandit algorithms given that ℓt(2) ℓt(1) = 1

2 for all t [T] , and thus any low-regret bandit algorithm should not be playing action 2 often. Lemma 5.1 provides a lower bound on the expected regret of private bandit algorithms that satisfy these two conditions. Lemma 5.1. Let ε 1 and A be any ε-differentially private algorithm. If A satisfies conditions (1) and (2) with parameters γ [0, 1

2 and 2γτ p γ(T τ), then the worst-case expected regret of A is at least

In particular, if A satisfies conditions (1) and (2) with pa-

rameters γ [0, 1

ε , and p = O( 1

ε), then the

worst-case expected regret of A is Ω q

Lemma 5.1 shows that if one wants to design an εdifferentially private algorithm (for ε 1) whose upper bound enjoys an additive separation between T and ε, then there cannot exist a γ [0, 1

2] such that it satisfies condi-

tions (1) and (2) with τ o q

ε , and p γ(T τ). Unfortunately, an implication of Lemma 5.1 is that such an additive separation is not possible for EXP3 and its batched variants. Indeed, batched EXP3 on the loss sequence ℓ1:T ,

Faster Rates for Private Adversarial Bandits

where ℓt(1) = 1/2 and ℓt(2) = 1, has regret at least γT + κ/η, where κ is the batch size. Assume, towards a contradiction that for some setting of parameters, this was o( p

T/ε). This implies that γ is o(1/

εT), and that κ/η is o( p

T/ε). EXP3 drops the probability of a bad arm to O(γ) in log(1/γ)/η steps, and the batched version will do so in κ log(1/γ)/η steps. Thus condition (2) is satisfied with τ being o( p

T/ε) and p = 1

ε (ignoring logarithmic factors). This then yields a lower bound of ω( p

T/ε) using Lemma 5.1, which contradicts the assumption. Thus the expected regret has to be Ω( p

T/ε) up to logarithmic factors. We provide the proof of Lemma 5.1 in Appendix G.

6. Future work

There are several important directions of future work. First, it is important to understand whether an additive separation between ε and T is possible under bandit feedback. Note that this is the case in the full-information setting under an oblivious adversary (Asi et al., 2023). Another important direction is improving our upper bounds for private bandits with expert advice. In particular, it would be interesting to see whether sublinear regret is possible for all N > 1 and ϵ ω( 1

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abernethy, J. D., Lee, C., and Tewari, A. Fighting bandits with a new kind of smoothness. Advances in Neural Information Processing Systems, 28, 2015.

Agarwal, N. and Singh, K. The price of differential privacy for online learning. In International Conference on Machine Learning, pp. 32 40. PMLR, 2017.

Arora, R., Dekel, O., and Tewari, A. Online bandit learning against an adaptive adversary: from regret to policy regret. ar Xiv preprint ar Xiv:1206.6400, 2012.

Asi, H., Feldman, V., Koren, T., and Talwar, K. Private online prediction from experts: Separations and faster rates. In The Thirty Sixth Annual Conference on Learning Theory, pp. 674 699. PMLR, 2023.

Asi, H., Koren, T., Liu, D., and Talwar, K. Private online learning via lazy algorithms. ar Xiv preprint ar Xiv:2406.03620, 2024.

Audibert, J.-Y. and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In COLT, pp. 217 226, 2009.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Azize, A. and Basu, D. When privacy meets partial information: A refined analysis of differentially private bandits. Advances in Neural Information Processing Systems, 35: 32199 32210, 2022.

Basu, D., Dimitrakakis, C., and Tossou, A. Differential privacy for multi-armed bandits: What is it and what is its cost? ar Xiv preprint ar Xiv:1905.12298, 2019.

Bubeck, S., Cesa-Bianchi, N., et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 (1):1 122, 2012.

Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.

Cheng, D., Zhou, X., and Ji, B. Follow-the-perturbed-leader for adversarial bandits: Heavy tails, robustness, and privacy.

Daniely, A. and Helbertal, T. The price of bandit information in multiclass online classification. In Conference on Learning Theory, pp. 93 104. PMLR, 2013.

Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Local privacy, data processing inequalities, and statistical minimax rates. ar Xiv preprint ar Xiv:1302.3203, 2013.

Dwork, C. Differential privacy. In International colloquium on automata, languages, and programming, pp. 1 12. Springer, 2006.

Dwork, C., Naor, M., Pitassi, T., and Rothblum, G. N. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pp. 715 724, 2010a.

Dwork, C., Rothblum, G. N., and Vadhan, S. Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51 60. IEEE, 2010b.

Dwork, C., Roth, A., et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

Guha Thakurta, A. and Smith, A. (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26, 2013.

Faster Rates for Private Adversarial Bandits

Honda, J., Ito, S., and Tsuchiya, T. Follow-the-perturbedleader achieves best-of-both-worlds for bandit problems. In International Conference on Algorithmic Learning Theory, pp. 726 754. PMLR, 2023.

Hu, B., Huang, Z., and Mehta, N. A. Optimal algorithms for private online learning in a stochastic environment. ar Xiv preprint ar Xiv:2102.07929, 2021.

Huang, J., Dai, Y., and Huang, L. Adaptive best-of-bothworlds algorithm for heavy-tailed multi-armed bandits. In international conference on machine learning, pp. 9173 9200. PMLR, 2022.

Jain, P. and Thakurta, A. G. (near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pp. 476 484. PMLR, 2014.

Jain, P., Kothari, P., and Thakurta, A. Differentially private online learning. In Conference on Learning Theory, pp. 24 1. JMLR Workshop and Conference Proceedings, 2012.

Kairouz, P., Oh, S., and Viswanath, P. The composition theorem for differential privacy. In International conference on machine learning, pp. 1376 1385. PMLR, 2015.

Kleinberg, R., Niculescu-Mizil, A., and Sharma, Y. Regret bounds for sleeping experts and bandits. Machine learning, 80(2):245 272, 2010.

Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Information and computation, 108(2):212 261, 1994.

Lu, Y., Xu, Z., and Tewari, A. Bandit algorithms for precision medicine. ar Xiv preprint ar Xiv:2108.04782, 2021.

Mishra, N. and Thakurta, A. (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 592 601, 2015.

Neel, S. and Roth, A. Mitigating bias in adaptive data gathering via differential privacy. In International Conference on Machine Learning, pp. 3720 3729. PMLR, 2018.

Neu, G. and Bart ok, G. Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. Journal of Machine Learning Research, 17(154):1 21, 2016.

Raman, A., Raman, V., Subedi, U., Mehalel, I., and Tewari, A. Multiclass online learnability under bandit feedback. In International Conference on Algorithmic Learning Theory, pp. 997 1012. PMLR, 2024.

Ren, W., Zhou, X., Liu, J., and Shroff, N. B. Multi-armed bandits with local differential privacy. ar Xiv preprint ar Xiv:2007.03121, 2020.

Robins, J. M., Rotnitzky, A., and Zhao, L. P. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846 866, 1994.

Sajed, T. and Sheffet, O. An optimal private stochasticmab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pp. 5579 5588. PMLR, 2019.

Seldin, Y. and Lugosi, G. A lower bound for multi-armed bandits with expert advice. In 13th European Workshop on Reinforcement Learning (EWRL), volume 2, pp. 7, 2016.

Shariff, R. and Sheffet, O. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.

Tenenbaum, J., Kaplan, H., Mansour, Y., and Stemmer, U. Differentially private multi-armed bandits in the shuffle model. Advances in Neural Information Processing Systems, 34:24956 24967, 2021.

Tossou, A. and Dimitrakakis, C. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.

Tossou, A. and Dimitrakakis, C. Achieving privacy in the adversarial multi-armed bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.

Zheng, K., Cai, T., Huang, W., Li, Z., and Wang, L. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33: 12300 12310, 2020.

Faster Rates for Private Adversarial Bandits

A. Privacy properties and privacy-preserving mechanisms

Definition A.1 (ε-indistinguishability). Let X and Y be random variables with support X. Let

D (X||Y ) := max S X

be the max divergence. Then X and Y are ε-indistinguishable if and only if

max{D (X||Y ), D (Y ||X)} ε.

Definition A.2 ((ε, δ)-indistinguishability). Let X and Y be random variables with support X. Let

Dδ (X||Y ) := max S X,P(X S) δ

ln P(X S) δ

be the δ-approximate max divergence. Then X and Y are (ε, δ)-indistinguishable if and only if

max{Dδ (X||Y ), Dδ (Y ||X)} ε.

The follow lemma relates the two notions of indistinguishability to differential privacy.

Lemma A.3 (Differential privacy Indistinguishability (Remark 3.2 in (Dwork et al., 2014))). Let X and Y be arbitrary sets. Let A be a randomized algorithm such that A : X n Y. Then, A is ε-differentially private if and only if for every pair of neighboring datasets x1:n and x 1:n, we have that the random variables A(x1:n) and A(x 1:n) are ε-indistinguishable. Likewise, A is (ε, δ)-differentially private if and only if for every pair of neighboring datasets x1:n and x 1:n, we have that the random variables A(x1:n) and A(x 1:n) are (ε, δ)-indistinguishable.

Next, we cover composition.

Lemma A.4 (Basic Composition (Corollary 3.15 in (Dwork et al., 2014))). Let X, Y1, Y2, . . . , YT be arbitrary sets and n N. Let A1, A2, . . . , AT be a sequence of randomized algorithms where A1 : X n Y1 and At : Y1, . . . , Yt 1, X n Yt for all t = 2, 3, . . . , T. If for every t [T] and every y1:t 1 Y1 Y2 Yt 1, we have that At(y1:t 1, ) is εt-differentially private, then the overall algorithm A : X n Y1 Y2 YT , defined as

A(x1:n) = A1(x1:n), A2(A1(x1:n), x1:n), . . . , AT (A1(x1:n), A2(A1(x1:n), x1:n), . . . , x1:n) ,

satisfies εT-differential privacy.

Lemma A.5 (Basic Composition (Corollary 3.15 in (Dwork et al., 2014))). Let X, Y1, Y2, . . . , YT be arbitrary sets and n N. Let A1, A2, . . . , AT be a sequence of randomized algorithms where A1 : X n Y1 and At : Y1, . . . , Yt 1, X n Yt for all t = 2, 3, . . . , T. If for every t [T] and every y1:t 1 Y1 Y2 Yt 1, we have that At(y1:t 1, ) is εt-differentially private, then the overall algorithm A : X n Y1 Y2 YT , defined as

A(x1:n) = A1(x1:n), A2(A1(x1:n), x1:n), . . . , AT (A1(x1:n), A2(A1(x1:n), x1:n), . . . , x1:n) ,

satisfies εT-differential privacy.

Lemma A.6 (Advanced Composition (Dwork et al., 2010b; Kairouz et al., 2015)). Let X, Y1, Y2, . . . , YT be arbitrary sets and n N. Let A1, A2, . . . , AT be a sequence of randomized algorithms where A1 : X n Y1 and At : Y1, . . . , Yt 1, X n Yt for all t = 2, 3, . . . , T. If for every t [T] and every y1:t 1 Y1 Y2 Yt 1, we have that At(y1:t 1, ) is εt-differentially private, then for every δ > 0, the overall algorithm A : X n Y1 Y2 YT , defined as

A(x1:n) = A1(x1:n), A2(A1(x1:n), x1:n), . . . , AT (A1(x1:n), A2(A1(x1:n), x1:n), . . . , x1:n) ,

satisfies (ε , δ )-differential privacy, where

t=1 ε2 t log 1

Faster Rates for Private Adversarial Bandits

Post-processing and group privacy will also be useful. Lemma A.7 (Post Processing (Proposition 2.1 in (Dwork et al., 2014))). Let X, Y, Z be arbitrary sets and n N. Let A : X n Y and B : Y Z be randomized algorithms. If A is (ε, δ)-differentially private then the composed algorithm B A : X n Z is also (ε, δ)-differentially private.

For our lower bounds in Section 5, the notion of group privacy will be useful. Lemma A.8 (Group Privacy (Theorem 2.2 in (Dwork et al., 2014))). Let X and Y be arbitrary sets and let n N. Suppose A : X n Y is an ε-differentially private algorithm. Then, for every pair of datasets x1:n, x 1:n that differ in 1 k n positions and every event E Y, we have that

P [A(x1:n) E] ekεP [A(x 1:n) E] .

For designing algorithms, the following primitive will be useful. Definition A.9 (Laplace Mechanism (Definition 3.3 in (Dwork et al., 2014))). Let X be an arbitrary set and n N. Suppose f : X n R is a query with sensitivity (i.e. for all pairs of datasets x1:n, x 1:n X n that differ in exactly one index, we have that |f(x1:n) f(x 1:n)| ). Then, for every ε, the mechanism M : X n R defined as M(x1:n) = f(x1:n) + Z, where Z Lap(

ε ), is ε-differentially private.

Lastly, as we make comparisons to local differential privacy, we define it below for the sake of completeness. Definition A.10 (Local differential privacy (Duchi et al., 2013)). Let X and Y be arbitrary sets. A randomized mechanism M : X Y is (ε, δ)-LDP, if for every x = x X and any measurable subset Y Y, we have that

P [M(x) Y ] eεP [M(x ) Y ] + δ.

When δ = 0, we say that M is ε local differentially private.

B. Helper Lemmas

Lemma B.1 (Hazard Rate of Laplace distribution). Let D denote the Laplace distribution Lap(0, λ), f and F denote its probability and cumulative density functions respectively. Define

h D(z) := f(z) 1 F(z)

to be the hazard rate function of Lap(0, λ). Then

sup z R h D(z) 1

Moreover, h D(z) is non-decreasing in z.

Proof. Recall that for λ > 0, we have

2λ exp{ |x|

( 1 2 exp{ z

λ}, if z 0 1 1

λ}, if z > 0 .

Fix x R. If x 0, then f(x) 1 F(x) =

1 2λ exp{ x

Otherwise, note that when x 0, we have

f(x) 1 F(x) =

1 2λ exp{ x

This shows that supx R h D(x) 1

λ. To see that h D(x) is non-decreasing, note that when x 0, we have that h D(x) =

1 2λ exp{ x

λ } is increasing in x and when x 0, h D(x) is constant.

Faster Rates for Private Adversarial Bandits

Lemma B.2 (Truncated Non-negativity of Noisy Losses). Let Z Lap(λ) and ℓ [0, 1]. Then, for any M 0, we have that

E [(Z + ℓ)I{|Z + ℓ| > M}] 0.

Proof. Let M 0 and ℓ [0, 1]. Then, we can write

E [(Z + ℓ)I{|Z + ℓ| > M}] = ℓ E [I{|Z + ℓ| > M}] + E [ZI{|Z + ℓ| > M}] .

Since ℓ 0, it suffices to show that E [ZI{|Z + ℓ| > M}] 0. To that end, note that

E [ZI{|Z + ℓ| > M}] = E [ZI{Z > M ℓ}] + E [ZI{Z < M ℓ}] .

Suppose that M ℓ 0. Then, since Z is symmetric random variable (around the origin), E [ZI{Z < M ℓ}] = E [ZI{Z > M + ℓ}]. Since M ℓ< M + ℓ, we have that

E [ZI{|Z + ℓ| > M}] = E [ZI{Z > M ℓ}] E [ZI{Z > M + ℓ}] 0.

Finally, suppose that M ℓ< 0. Then,

E [ZI{Z > M ℓ}] = E [ZI{0 Z > M ℓ}] + E [ZI{Z 0}] .

Using again the fact that Z is symmetric, we have that

E [ZI{0 Z > M ℓ}] = E [ZI{0 Z < ℓ M}] .

Finally, since ℓ M M + ℓ, we have that

E [ZI{|Z + ℓ| > M}] = E [ZI{Z 0}] E [ZI{0 Z < ℓ M}] E [ZI{Z > M + ℓ}] 0,

completing the proof.

Lemma B.3 (Norms of Laplace Vectors (Fact C.1 in (Agarwal & Singh, 2017))). If Z1, . . . , ZT (Lap(λ))N, then

P( t [T] : ||Zt||2 10λ2 log2(NT)) 1

C. Proof of Lemmas 3.3 and 3.4

C.1. Proof of Lemma 3.3

Note that the sequence of actions played by Algorithm 1 are completely determined by I1, . . . , I T

τ in a dataset-independent way. Thus, by post-processing it suffices to show that the actions I1, . . . , I T

τ are output in a ε-differentially private manner. Note that the distribution over the action I1 is independent of the dataset ℓ1, . . . , ℓT . Thus, it suffices to only prove privacy with respect to the actions I2, . . . , I T

τ . Consider the sequence of mechanisms M2, . . . , M T

M2 : [K] ℓ1:T R [K] is defined as

M2(i1, ℓ1:T ) = ˆℓ1(i1) + Z1, B((i1, ˆℓ1(i1) + Z1)) ,

for Z1 Lap( 1

τε) and Mj : ([K] R)j 2 [K] ℓ1:T R [K] is defined as

Mj((is, rs)j 2 s=1, ij 1, ℓ1:T ) = ˆℓj 1(ij 1) + Zj 1, B((is, rs)j 2 s=1 (ij 1, ˆℓj 1(ij 1) + Zj 1)) ,

Faster Rates for Private Adversarial Bandits

for Zj 1 Lap( 1

τε). Observe that Algorithm 1 is precisely the mechanism M : ℓ1:T ([K] R)T that adaptively composes M2, . . . , M T

τ . We will now show that M is ε-differentially private by showing that M(ℓ1:t) and M(ℓ 1:t) are

ε-indistinguishable for arbitrary neighboring datasets ℓ1:T and ℓ 1:T .

Consider two datasets ℓ1:T and ℓ 1:T that differ in exactly one position. Let t [T] be the index where the two datasets differ. Let j {1, . . . , T

τ } be the batch in where the t lies. That is, let j {1, . . . , T

τ } such that t {(j 1)τ +1, . . . , j τ}.

Fix outcomes (is, rs) T

τ 2 s=1 ([K] R) T

τ 2 and i T

For all j j , we have that the random variables Mj((is, rs)j 2 s=1, ij 1, ℓ1:T ) and Mj((is, rs)j 2 s=1, ij 1, ℓ 1:T ) are 0indistinguishable. We now show that the random variables Mj +1((is, rs)j 1 s=1 , ij , ℓ1:T ) and Mj +1((is, rs)j 1 s=1 , ij , ℓ 1:T ) are ε-indistinguishable.

Recall that Mj +1((is, rs)j 1 s=1 , ij , ℓ1:T ) = ˆℓj (ij ) + Zj , B((is, rs)j 1 s=1 (ij , ˆℓj (ij ) + Zj )) .

Note that the query ˆℓj (ij ) has sensitivity at most 1

τ . Indeed, we have that

ˆℓj (ij ) ˆℓ j (ij ) =

s=(j 1)τ+1 ℓs(ij ) ℓ s(ij )

τ |ℓt (ij ) ℓ t (ij )| 1

Thus, by Definition A.9 and post-processing, we have that Mj +1((is, rs)j 2 s=1, ij 1, ℓ1:T ) and Mj +1((is, rs)j 2 s=1, ij 1, ℓ 1:T ) are ε-indistinguishable for all inputs.

To complete the proof, we now show that for all j > j + 1, Mj((is, rs)j 2 s=1, ij 1, ℓ1:T ) and Mj((is, rs)j 2 s=1, ij 1, ℓ 1:T ) are 0-indistinguishable. Fix some j > j + 1. Recall, that

Mj((is, rs)j 2 s=1, ij 1, ℓ1:T ) = ˆℓj 1(ij 1) + Zj 1, B((is, rs)j 2 s=1 (ij 1, ˆℓj 1(ij 1) + Zj 1)) .

Since for every s {(j 1)τ + 1, . . . , jτ} we have that ℓs = ℓ s, we get that ˆℓj 1(ij 1) + Zj 1 and ˆℓ j 1(ij 1) + Zj 1 are same in distribution. The same can be said about B((is, rs)j 2 s=1 (ij 1, ˆℓj 1(ij 1) + Zj 1)) and B((is, rs)j 2 s=1 (ij 1, ˆℓ j 1(ij 1) + Zj 1)). Accordingly, Mj((is, rs)j 2 s=1, ij 1, ℓ1:T ) and Mj((is, rs)j 2 s=1, ij 1, ℓ 1:T ) are 0-indistinguishable for all inputs. Since M is the composition of M2, . . . , M T

τ , by basic composition, we have that

M(ℓ1:T ) and M(ℓ 1:T ) are ε-indistinguishable, and therefore M is ε-differentially private. This completes the proof.

C.2. Proof of Lemma 3.4

Let ℓ1, . . . , ℓT be any sequence of loss functions. Note that the bandit algorithm B is evaluated on the loss sequence ˆℓ1 + Z1, . . . , ˆℓ T

τ where ˆℓj(i) = 1

τ Pjτ s=(j 1)τ+1 ℓs(i) and Zj Lap( 1

τε). Let I1, . . . , I T

τ be the random

variables denoting the predictions of B as indicated in Line 4 in Algorithm 1. By definition of RB T

τε we get that

j=1 ˆℓj(Ij)

j=1 ˆℓj(i) RB

By definition of ˆℓs, we have that

s=(j 1)τ+1 ℓs(Ij)

s=(j 1)τ+1 ℓs(i) τ RB

Next, note that by construction, we have that for every j {1, . . . , T

τ } and s {(j 1)τ + 1, . . . , jτ}, we have that Is = Ij. Thus, we can write

Faster Rates for Private Adversarial Bandits

s=(j 1)τ+1 ℓs(Is)

s=(j 1)τ+1 ℓs(i) τ RB

which further gives

t=1 ℓt(i) τ RB

Finally, the expected regret for rounds τ T

τ + 1, . . . , T can be bounded above by τ. Thus, overall, we have that

t=1 ℓt(i) τ RB

Noting that ℓ1, . . . , ℓT was arbitrary completes the proof.

D. Proof of Corollary 3.2

The following Theorem from (Huang et al., 2022) will be useful.

Theorem D.1 (Theorem 4.1 in (Huang et al., 2022)). Let ℓ1, . . . , ℓT be any sequence of random loss functions that satisfy the following two properties: (1) for every i [K] and t [T], the random variable ℓt(i) is truncated non-negative and (2) for every i [K] and t [T], the random variable ℓt(i) is heavy-tailed with parameters α (1, 2] and σ > 0. Then, the expected regret of HTINF (Algorithm 1 in (Huang et al., 2022)) when run on ℓ1, . . . , ℓT is at most 30σK1 1

α (T + 1) 1 α .

We now make precise the definition of truncated non-negativity and heavy-tails.

Definition D.2 (Truncated Non-negativity). A random variable X is truncated non-negative if for every M 0, we have that E [X I{|X| > M}] 0.

In Appendix B, we prove that random losses of the form ℓ(i) = ℓ(i) + Zi are truncated non-negative when ℓ(i) [0, 1] and Zi Lap(λ).

Definition D.3 ((α, σ)-Heavy-tailed loss). A random loss ℓ(i) is (α, σ)-heavy tailed if E h | ℓ(i)|αi σα.

In addition, if ℓ(i) = ℓ(i) + Zi, where ℓ(i) [0, 1] and Zi Lap(λ), then ℓ(i) is (2,

2 + 4λ2)-heavy tailed. We are now ready to prove Corollary 3.2.

Proof. (of Corollary 3.2) In order to use Theorem 3.1, we need to upper bound RHTINF(T, λ). Let ℓ1, . . . , ℓT be any sequence of loss functions such that ℓt : [K] [0, 1] and let ℓ1, . . . , ℓT be such that ℓt(i) = ℓt(i) + Zt(i) where Zt(i) Lap(λ). Then, since for every t [T] and i [K], we have that ℓt(i) is truncated non-negative and (2,

2 + 4λ2)- heavy tailed, Theorem D.1 implies that

RHTINF(T, K, λ) 30 p

(2 + 4λ2)K(T + 1).

Finally, to get Corollary 3.2, we just upper bound

2 ε RHTINF(εT, K, 1) + 2

Faster Rates for Private Adversarial Bandits

Algorithm 3 EXP3 with Mixing

1: Input: Action space [K], learning rate η, mixing parameter γ > 0 2: Initialize: w1(i) = 1 for all i [K] 3: for t = 1, . . . , T do 4: Set Pt(i) = (1 γ) wt(i) P

i [K] wt(i) + γ

K 5: Draw It Pt 6: Observe loss ℓt(It) and construct unbiased estimator ˆℓt(i) = ℓt(i)I{It=i}

Pt(i) 7: Update wt+1(i) wt(i) exp{ ηˆℓt(i)} for all i [K] 8: end for

E. Additional Upper Bounds for Private Adversarial Bandits

In this section, we instantiate Theorem 3.1 with other (non-private) bandit algorithms to obtain two other regret upper bounds.

E.1. EXP3 Conversion

Corollary E.1 follows by letting B in Theorem 3.1 be the classical EXP3 algorithm (Auer et al., 2002). See Appendix D for the pseudocode of EXP3.

Corollary E.1 (EXP3 Conversion). For every ε 1, if B is EXP3 run with learning rate

log(K) 22 εKT log2(εKT)

and mixing parameter γ = 4ηK log(εKT), then Algorithm 1, when run with B and τ = 1

ε , is ε-differentially private and suffers worst-case expected regret at most

TK log(K) log(KT) ε + 4

Algorithm 3 provides the pseudocode for the version of EXP3 that we consider.

The following lemma about EXP3 will be useful when proving Corollary E.1.

Lemma E.2 ((Auer et al., 2002; Bubeck et al., 2012)). For any sequence of loss functions ℓ1, . . . , ℓT , where ℓt : [K] R, if η > 0 is such that η maxi [K] ˆℓt(i) 1 for all t [T], then EXP3 when run on ℓ1, . . . , ℓT outputs distributions P1:T Π([K])T such that

i=1 Pt(i)ℓt(i)

t=1 ℓt(i) + 2γT + log(K)

i=1 ℓt(i)2,

where ˆℓt is the unbiased estimate of the true loss ℓt that EXP3 computes in Line 6 of Algorithm 3 and the expectation is taken only with respect to the randomness of EXP3.

Proof. (of Corollary E.1) In order to use Theorem 3.1, we first need to bound REXP3(T, K, λ). Let ℓ1, . . . , ℓT be any sequence of loss functions such that ℓt : [K] [0, 1] and let ℓ1, . . . , ℓT be such that ℓt(i) = ℓt(i) + Zt(i) where Zt(i) Lap(λ). Let E be the event that there exists a t [T] such that maxi [K] |Zt(i)|2 10λ2 log2 KT. Then, Lemma B.3 shows that P [E] 1

T . Moreover, note that E [Zt(i)|Ec] = 0 for all i [K] and t [T]. Let A be the random variable denoting the internal randomness of EXP3. We need to bound

REXP3(T, K, λ) = E A,Z1:T

t=1 ℓt(EXP3( Ht)) inf i [K]

Faster Rates for Private Adversarial Bandits

We can write REXP3(T, K, λ) as

t=1 ℓt(EXP3( Ht)) inf i [K]

P(E) + E A,Z1:T

t=1 ℓt(EXP3( Ht)) inf i [K]

Since EA,Z1:T h PT t=1 ℓt(EXP3( Ht)) infi [K] PT t=1 ℓt(i) E i T, we have that

REXP3(T, K, λ) E A,Z1:T

t=1 ℓt(EXP3( Ht)) inf i [K]

We now want to use Lemma E.2 to bound EA,Z1:T h PT t=1 ℓt(EXP3( Ht)) infi [K] PT t=1 ℓt(i) Eci . Recall, that EXP3

is actually running on the noisy losses ℓ1, . . . , ℓT . So, in order to use Lemma E.2, we need to pick γ, η > 0 such that

η maxi [K] ˆ ℓt(i) 1, where we use ˆ ℓt to denote the unbiased estimate that EXP3 constructs of the true (noisy) loss ℓt. In

particular, recall that EXP3 constructs ˆ ℓt(i) = ℓ(i)I{It=i}

Pt(i) where we used Pt(i) to denote the measure that EXP3 uses to select its action It on round t [T]. Moreover, given a mixing parameter γ > 0, we have that Pt(i) γ

K . Thus, we need to pick γ and η such that

η max i [K] ˆ ℓt(i) ηK

γ max i [K] |Zt(i)| 1.

Conditioned on event Ec, we have that maxi [K] |Zt(i)| 4λ log(KT). Thus, it suffices to pick γ = 4ηλK log(KT). Then, conditioned on the event Ec and the random variables Z1, . . . , ZT , we can use Lemma E.2 to get that

i=1 Pt(i) ℓt(i)

t=1 ℓt(i) + 2γT + log(K)

i=1 ℓt(i)2.

Taking an outer expectation, then gives that

i=1 Pt(i) ℓt(i)

inf i [K] E Z1:T

+ 2γT + log(K)

η + η E Z1:T

i=1 ℓt(i)2 Ec #

Since Zt(i), conditioned on Ec, is zero-mean and Zt(i) conditioned on the history Ht is independent of Pt(i), we have that

i=1 Pt(i)ℓt(i)

t=1 ℓt(i) + 2γT + log(K)

η + η E Z1:T

i=1 ℓt(i)2 Ec #

which further gives

REXP3(T, K, λ) 2γT + log(K)

η + η E Z1:T

i=1 ℓt(i)2 Ec #

It just remains to bound EZ1:T h PT t=1 PK i=1 ℓt(i)2 Eci . Note that we can write

Faster Rates for Private Adversarial Bandits

i=1 ℓt(i)2 Ec #

t=1 max i [K] ℓt(i)2 Ec #

t=1 max i [K](ℓt(i) + Zt(i))2 Ec #

t=1 (1 + max i [K] Zt(i)2)

t=1 (1 + 10λ2 log2 KT)

= 2ηTK(1 + 10λ2 log2 KT).

Plugging this bound back in gives that

REXP3(T, K, λ) 2γT + log(K)

η + 2ηTK(1 + 10λ2 log2 KT) + 1.

Recall that we picked γ = 4ηλK log(KT). Substituting this selection gives

REXP3(T, K, λ) 8ηλKT log(KT) + log(K)

η + 2ηTK(1 + 10λ2 log2 KT) + 1.

We can then write

REXP3(T, K, λ) log(K)

η + 2ηTK(1 + 10 max{λ2, λ} log2 KT) + 1.

Picking η = q

log(K) 2T K(1+10 max{λ2,λ} log2 KT ), we get overall that

REXP3(T, K, λ) 2 q

2TK log(K)(1 + 10 max{λ2, λ} log2 KT) + 1.

Finally, Corollary E.1 follows by the fact that

2 ε REXP3(εT, K, 1) + 2

TK log(K) log(KT) ε + 4

This completes the proof.

E.2. FTPL Conversion

Corollary E.3 follows by using Follow-the-Perturbed-Leader (FTPL) with Geometric Resampling (Neu & Bart ok, 2016). The pseudocode for FTPL with Geometric Resampling is provided in Algorithm 4.

Corollary E.3 (FTPL Conversion). For every ε [ 1

T , 1], if B is FTPL with perturbation distribution Lap 1 η and

Geometric Resampling threshold M (see Algorithm 4), where M =

log(K) (εKT + 10εKT log2(εKT)), 1 M(1 + 4 log(εT))

Faster Rates for Private Adversarial Bandits

Algorithm 4 Bandit FTPL with Geometric Resampling (Neu & Bart ok, 2016)

1: Input: M, η 2: Initialize: ˆL0(i) = 0 for all i [K] 3: for t = 1, . . . , T do 4: Sample Z1, . . . , ZK i.i.d. from Lap(0, 1

5: Select action It arg mini [K](ˆLt 1(i) + Zi) 6: Observe loss ℓt(It) 7: Let Mt = 0. 8: for i = 1, 2, . . . , M do 9: Sample Z 1, . . . , Z K i.i.d. from Lap(0, 1

10: if It arg maxi [K](ˆLt 1(i) + Z i) then 11: Set Mt = i. 12: break 13: end if 14: end for 15: Define ˆℓt(i) = ℓt(i)Mt I{It = i}. 16: Update ˆLt = ˆLt 1 + ˆℓt(i). 17: end for

Algorithm 1, when run with B and τ = 1

ε , is ε-differentially private and suffers worse-case expected regret at most

KT log(K) log(KT) ε + 2

We now prove Corollary E.3. Lemma E.4 first bounds RB(T, K, λ) when B is Algorithm 4.

Lemma E.4. Let B denote Algorithm 4. Then, if M =

log(K) (KT + 10KTλ2 log2(KT)), 1 M(1 + 4λ log(T))

we have that

RB(T, K, λ) 11λ

KT log(K) log(KT) + 10

Proof. Let ℓ1, . . . , ℓT be any sequence of loss functions such that ℓt : [K] [0, 1] and let ℓ1, . . . , ℓT be such that ℓt(i) = ℓt(i) + Gt(i) where Gt(i) Lap(λ). Let E be the event that there exists a t [T] such that maxi [K] |Gt(i)|2 10λ2 log2 KT. Then, Lemma B.3 shows that P [E] 1

T . Moreover, note that E [Gt(i)|Ec] = 0 for all i [K] and t [T]. We need to bound

RB(T, K, λ) = E B,G1:T

t=1 ℓt(B( Ht)) inf i [K]

We can write RB(T, K, λ) as

t=1 ℓt(B( Ht)) inf i [K]

P(E) + E B,G1:T

t=1 ℓt(B( Ht)) inf i [K]

Since EB,G1:T h PT t=1 ℓt(B( Ht)) infi [K] PT t=1 ℓt(i) E i T, we have that

Faster Rates for Private Adversarial Bandits

RB(T, K, λ) E B,G1:T

t=1 ℓt(B( Ht)) inf i [K]

t=1 ℓt(B( Ht))

t=1 ℓt(i) + 1

Let i be the arm that minimizes PT t=1 ℓt(i). Moreover, let ˆ ℓt denote the unbiased estimate that Algorithm 4 constructs of the true (noisy) loss ℓt when run on the noisy losses ℓ1, . . . , ℓT . We start with the following regret decomposition for FTPL from (?)Lemma 3]honda2023follow.

2 E Z Lap( 1

max i [K] |Zi| + E B,G1:T

ˆ ℓt(i)(Pt(i) Pt+1(i))

where we define Pt(i) := P h It = i|ˆ ℓ1, . . . , ˆ ℓt 1 i . The first term on the right can be bounded as

2 E Z Lap( 1

max i [K] |Zi| 6 log(K)

As for the second term, Lemma 5 from (Cheng et al.) gives that

exp{ η||ˆ ℓt||1} Pt+1(i)

Pt(i) exp{η||ˆ ℓt||1}.

Accordingly, we have that

Pt(i)(1 exp{η||ˆ ℓt||1}) Pt(i) Pt+1(i) Pt(i)(1 exp{ η||ˆ ℓt||1}).

Thus, we can bound

ˆ ℓt(i)(Pt(i) Pt+1(i)) ˆ ℓt(i)Pt(i)(exp{η||ˆ ℓt||1} 1).

For η > 0 such that η||ˆ ℓt||1 1, we have that

exp{η||ˆ ℓt||1} 2η||ˆ ℓt||1 + 1.

Since ||ˆ ℓt||1 |Mt(ℓt(It) + Gt(It))| M(1 + 4η log(T)), it suffices to pick η 1 M(1+4λ log(T )). For this choice of η, we have that

ˆ ℓt(i)(Pt(i) Pt+1(i)) 2Pt(i)ηˆ ℓt(i)||ˆ ℓt||1 2Pt(i)η(ˆ ℓt(i))2.

Plugging this in gives

ˆ ℓt(i)(Pt(i) Pt+1(i))

2η E B,G1:T

i=1 Pt(i)(ˆ ℓt(i))2 Ec #

and therefore

Faster Rates for Private Adversarial Bandits

ˆ ℓt(It) ˆ ℓt(i )

η + 2η E B,G1:T

i=1 Pt(i)(ˆ ℓt(i))2 Ec #

To bound the second term on the right hand side, we have that

i=1 Pt(i)(ˆ ℓt(i))2 Ec #

i=1 Pt(i)( ℓt(i))2I{It = i}(Mt)2 Ec #

i=1 Pt(i)( ℓt(i))2I{It = i} 1 (Pt(i))2

= 2 E B,G1:T

i=1 ( ℓt(i))2I{It = i} 1 Pt(i)

i=1 (ℓt(i) + Gt(i))2 Ec #

i=1 (1 + Gt(i)2)

= 2KT + 20KTλ2 log2(KT),

where the first inequality follows from Lemma 12 in (Cheng et al.). Thus,

ˆ ℓt(It) ˆ ℓt(i )

η + 4ηKT + 40ηKTλ2 log2(KT).

Next, note that

t=1 ℓt(It) ℓt(i )

ˆ ℓt(It) ˆ ℓt(i )

t=1 ℓt(It) ˆ ℓt(It)

ˆ ℓt(i ) ℓt(i )

Thus, it suffices to upper bound the latter two terms. Starting with the third term, we have that

Faster Rates for Private Adversarial Bandits

ˆ ℓt(i ) ℓt(i )

t=1 ℓt(i )(1 (1 Pt(i ))M) ℓt(i )

t=1 ℓt(i ) ℓt(i )(1 Pt(i ))M ℓt(i )

t=1 ℓt(i )(1 Pt(i ))M Ec #

t=1 (ℓt(i ) + Gt(i))(1 Pt(i ))M Ec #

t=1 ℓt(i )(1 Pt(i ))M Ec #

where the second equality follows by Lemma 4 from (Neu & Bart ok, 2016). Now, for the second term, by Lemma 5 from (Neu & Bart ok, 2016) we have that

t=1 ℓt(It) ˆ ℓt(It)

Combining all our bounds gives

t=1 ℓt(It) ℓt(i )

η + 4η(KT + 10KTλ2 log2(KT)) + KT

KT and η = min nq

log(K) (KT +10KT λ2 log2(KT )), 1

KT (1+4λ log(T ))

o , we get that

t=1 ℓt(It) ℓt(i )

KT log(K) log(KT) + 10

Since EB,G1:T h PT t=1 ℓt(It) ℓt(i ) Eci = EB,G1:T h PT t=1 ℓt(It) ℓt(i ) Eci , we have that

RB(T, K, λ) 10λ

KT log(K) log(KT) + 10

which completes the proof.

Equipped with Lemma E.4, we are now ready to prove Corollary E.3.

Proof. (of Corollary E.3) Let B be Algorithm 4 with the hyperparameters selected according to Lemma E.4. Then, we know that

RB(T, K, λ) 11λ

KT log(K) log(KT) + 10

By Theorem 3.1, we can convert B into an ε-differentially private algorithm A such that

Faster Rates for Private Adversarial Bandits

Algorithm 5 Bandit to Bandit with Expert Advice

1: Input: Bandit algorithm B, Number of experts N, Action space [K] 2: Initialize: B with action space [N] 3: for t = 1, . . . , T do 4: Receive expert predictions µ1 t, . . . , µN t Π([K])N

5: Sample Ii t µj t for all j [N] 6: Define ℓt(j) := ℓt(Ij t ) for all j [N] 7: Receive expert Jt [N] from B 8: Play action IJt t [K] and observe loss ℓt(IJt t ) 9: Pass ℓt(Jt) to B 10: end for

ε RB(εT, K, 1) + 2

KεT log(K) log(KT) + 10

KT log(K) log(KT) ε + 2

completing the proof.

F. Proofs for Bandits with Expert Advice

The following guarantee about Multiplicative Weights (MW) will be useful when proving utility guarantees. Lemma F.1 ((Cesa-Bianchi & Lugosi, 2006; Littlestone & Warmuth, 1994)). For any sequence of loss functions ℓ1, . . . , ℓT , where ℓt : [N] R, if η > 0 is such that η maxj [N] ℓt(j) 1 for all t [T], then MW when run on ℓ1, . . . , ℓT outputs distributions P1:T Π([N])T such that

j=1 Pt(j)ℓt(j) inf j [N]

t=1 ℓt(j) + log(N)

j=1 Pt(j)ℓt(j)2.

F.1. Proof of Theorem 4.1

Proof. (of Theorem 4.1) Consider a loss sequence ℓ1, . . . , ℓT and a sequence of expert predictions µ1:N 1:T . Let j arg minj [N] PT t=1 PK i=1 µj t(i)ℓt(i) denote an optimal expert in hindsight. By definition of the bandit algorithm B, pointwise for every I1:N 1:T , we have that

t=1 ℓt(j ) + RB(T, N).

By definition of ℓt, we then have that

t=1 ℓt(IJt t )

t=1 ℓt(Ij t ) + RB(T, N).

Taking an outer expectation with respect to the randomness of I1:N 1:T , we have,

t=1 ℓt(Ij t )

i=1 µj t (i) ℓt(i)

Faster Rates for Private Adversarial Bandits

Algorithm 6 Local-DP EXP4

1: Input: Action space [K], Number of experts N, privacy parameters ε > 0, η, γ > 0 2: Initialize: w1(j) = 1 for all j [N] 3: for t = 1, . . . , T do 4: Receive expert advice µ1 t, . . . , µN t 5: Set Pt(j) = wt(j) P

j [N] wt(j)

6: Set Qt(i) = (1 γ) PN j=1 Pt(j)µj t(i) + γ

K . 7: Predict It Qt 8: Observe loss ℓt(It) and define ℓ t(i) := ℓt(i) + Zi t, where Zi t Lap(0, 1

9: Construct unbiased estimator ˆℓ t(i) = ℓ t(i)I{It=i}

Qt(i) 10: Define ℓ t(j) := µj t ˆℓ t for all j [N] 11: Update wt+1(j) wt(j) exp{ η ℓ t(j)} 12: end for

which completes the proof.

F.2. Proof of Theorem 4.2

Let B be any bandit algorithm. Then, for every τ 1. We need to show that there exists a ε-differentially private bandit with expert advice algorithm Aτ such that

RAτ (T, K, N) τ RB(T

Proof. (of Utility in Theorem 4.2). Fix ε 1 and τ 1. By Theorem 3.1, we can convert B into an ε-differentially private bandit algorithm Bτ such that

RBτ (T, K) τ RB(T

Then, using Theorem 4.1, we can convert Bτ into a bandit with expert advice algorithm Aτ such that

RAτ (T, K, N) RBτ (T, N) τ RB(T

completing the proof.

Proof. (of Privacy in Theorem 4.2) Consider the same algorithm as in the proof of the utility guarantee. That is, let Aτ be the result of using Theorem 1 to convert B to Bτ and Theorem 4.1 to convert Bτ to Aτ. By Theorem 3.1, we know that Bτ is ε-differentially private. It suffices to show that Algorithm 5, when given Eτ as input is also ε-differentially private. To that end, let ℓ1:T and ℓ 1:T be two sequences that differ at exactly one timepoint. Let µ1:N 1:T be any sequence of expert advice and fix Ii t µi t for all t [T] and i [N]. Observe that Algorithm 5 instantiates Bτ on the action space [N] and simulates Bτ on the sequence of losses ℓt(j) := ℓt(Ij t ). Let ℓ1:T and ℓ 1:T denote the two sequences of losses that Algorithm 5 simulates Bτ on when run on ℓ1:T and ℓ 1:T respectively. Note that ℓ1:T and ℓ 1:T differ at exactly one timepoint. Thus, Bτ outputs actions J1, . . . , JT in an ε-differentially private manner. Finally, by post-processing it follows that the sequence of actions IJt t output by Algorithm 5 is also ε-differentially private.

F.3. Proof of Theorem 4.4

Proof. (of Utility in Theorem 4.4) Fix ε 1. Let λ = 1

ε. Let ℓ1, . . . , ℓT be any sequence of loss functions and µ1:N 1:T be any sequence of advice vectors. Let E be the event that there exists a t [T] such that maxi [K] |Zi t|2 10λ2 log2(KT). Then, Lemma B.3 shows that P [E] 1

T . Moreover, note that E Zi t Ec = 0 for all i [K] and t [T]. Let A be the random variable denoting the internal randomness of Algorithm 6 when sampling actions It. We need to bound

Faster Rates for Private Adversarial Bandits

R(T, K, N) := E A,Z1:T

t=1 ℓt(It) inf j [N]

t=1 µj t ℓt

We can write R(T, K, N) as

t=1 ℓt(It) inf j [N]

t=1 µj t ℓt

P(E) + E A,Z1:T

t=1 ℓt(It) inf j [N]

t=1 µj t ℓt

Since EA,Z1:T h PT t=1 ℓt(It) infj [N] PT t=1 µj t ℓt E i T, we have that

R(T, K, N) E A,Z1:T

t=1 ℓt(It) inf j [N]

t=1 µj t ℓt

Accordingly, for the remainder of the proof, we will assume that event Ec has occurred, which further implies that maxt [T ] maxi [K] |Zi t| 4λ log(KT).

Algorithm 6 runs Multiplicative Weights using the noisy losses ℓ 1, . . . , ℓ T . For γ = 4ηKλ log(KT), we have that

η max t [T ] max j [N] ℓ t(j) = η max t [T ] max j [N] µj t ˆℓ t = η max t [T ] max j [N] µj t(It)(ℓt(It) + ZIt t ) Qt(It) ηK

γ (4λ log(KT)) 1.

Accordingly, for this choice of γ, Lemma F.1 implies that

j=1 Pt(j) ℓ t(j) inf j [N]

t=1 ℓ t(j) + log(N)

j=1 Pt(j) ℓ t(j)2.

Taking expectation of both sides, we have that

j=1 Pt(j) ℓ t(j)

inf j [N] E Z1:T

η + η E A,Z1:T

j=1 Pt(j) ℓ t(j)2

We now analyze each of the three terms with expectations separately. First,

j=1 Pr(j) ℓ t(j)

i=1 ˆℓ t(i)µj t(i)

j=1 Pt(j)µj t(i)

= 1 (1 γ) E A,Z1:T

i=1 Qt(i)ˆℓ t(i)

γ K(1 γ) E A,Z1:T

i=1 ˆℓ t(i)

Faster Rates for Private Adversarial Bandits

j=1 Pt(j) ℓ t(j)2

j=1 Pt(j)(µj t ˆℓ t)2

i=1 ˆℓ t(i)2µj t(i)

j=1 Pt(j)µj t(i)

ˆℓ t(i)2 Ec #

1 (1 γ) E A,Z1:T

i=1 Qt(i)ˆℓ t(i)2 Ec #

inf j [N] E Z1:T

= inf j [N] E Z1:T

t=1 ˆℓ t µj t

= inf j [N] E Z1:T

t=1 ℓ t µj t

= inf j [N]

t=1 ℓt µj t,

where the second equality follows by the unbiasedness of ˆℓ t and the last by the fact that Zi t is zero-mean (conditioned on Ec). Putting all the bounds together, we get that

1 (1 γ) E A,Z1:T

i=1 Qt(i)ˆℓ t(i)

t=1 ℓt µj t + log(N)

η + γ K(1 γ) E Z1:T

i=1 ˆℓ t(i)

+ η (1 γ) E A,Z1:T

i=1 Qt(i)ˆℓ t(i)2 Ec #

Multiplying both sides by (1 γ), we have that EA,Z1:T h PT t=1 PK i=1 Qt(i)ˆℓ t(i) Eci is at most

(1 γ) inf j [N]

t=1 ℓt µj t + (1 γ) log(N)

i=1 ˆℓ t(i)

+ η E A,Z1:T

i=1 Qt(i)ˆℓ t(i)2 Ec #

which implies that

i=1 Qt(i)ˆℓ t(i)

t=1 ℓt µj t + log(N)

η + γT + η E A,Z1:T

i=1 Qt(i)ˆℓ t(i)2 Ec #

Using the fact that ˆℓ t is an unbiased estimator of ℓ t gives that

Faster Rates for Private Adversarial Bandits

i=1 Qt(i)ℓ t(i)

t=1 ℓt µj t + log(N)

η + γT + η E Z1:T

i=1 ℓ t(i)2 Ec #

Since Zi t is zero-mean (conditioned on Ec) and independent of Qt(i), we get that,

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t + log(N)

η + γT + η E Z1:T

i=1 ℓ t(i)2 Ec #

It suffices to bound the expectation on the right-hand side. To that end, observe that

i=1 ℓ t(i)2 Ec #

i=1 (ℓt(i) + Zi t)2 Ec #

i=1 (ℓt(i)2 + (Zi t)2)

i=1 (1 + (Zi t)2)

2KT(1 + 10λ2 log2 KT)

Thus, overall we have that

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t + log(N)

η + γT + 2ηKT(1 + 10λ2 log2 KT).

Plugging in our choice of γ = 4ηKλ log(KT),

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t + log(N)

η + 4ηKTλ log(KT) + 2ηKT(1 + 10λ2 log2 KT).

which for λ 1 gives

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t + log(N)

η + 3ηKT(1 + 10λ2 log2 KT).

Picking η = q

log(N) 3T K(1+10λ2 log2 KT ), we have

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t + 16 p

TK log(N)λ log(KT).

For our choice λ = 1

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t + 16 p

TK log(N) log(KT)

Faster Rates for Private Adversarial Bandits

Finally, noting that

R(T, K, N) E A,Z1:T

i=1 Qt(i)ℓt(i)

t=1 µj t ℓt + 1

completes the proof.

The proof of privacy in Theorem 4.4 is identical to the proof of Lemma 3.3 after taking batch size τ = 1, so we omit the details here.

F.4. Proof of Theorem 4.5

Proof. (of Utility in Theorem 4.5) Fix ε, δ (0, 1] and batch size τ. Let λ = 3K

δ ) γτε . Let ℓ1, . . . , ℓT be any sequence of loss functions and µ1:N 1:T be any sequence of advice vectors. Let E be the event that there exists a r {1, . . . , T

τ } such that maxj [N] |Zj r|2 10λ2 log2(N T

τ ). Then, Lemma B.3 shows that P [E] τ

T . Moreover, note that E Zj r Ec = 0 for all j [N] and r [ T

τ ]. Let A be the random variable denoting the internal randomness of Algorithm 6 when sampling actions It. We need to bound

R(T, K, N) := E A,Z1:T

t=1 ℓt(It) inf j [N]

t=1 µj t ℓt

We can write R(T, K, N) as

t=1 ℓt(It) inf j [N]

t=1 µj t ℓt

P(E) + E A,Z1:T

t=1 ℓt(It) inf j [N]

t=1 µj t ℓt

Since EA,Z1:T h PT t=1 ℓt(It) infj [N] PT t=1 µj t ℓt E i T, we have that

R(T, K, N) E A,Z1:T

t=1 ℓt(It) inf j [N]

t=1 µj t ℓt

Accordingly, for the remainder of the proof, we will assume that event Ec has occurred, which further implies that maxr [ T

τ ] maxj [N] |Zj r| 4λ log(N T

Algorithm 2 runs Multiplicative Weights using the noisy, batched losses ℓ 1, . . . , ℓ T

τ . For γ 12ηK

δ ) log(NT ) ετ , we

τ ] max j [N] η( ℓ r(j)) max r [ T

τ ] max j [N] η( ℓr(j) + Zj r) max r [ T

τ ] max j [N] ηZj r η 12K q

Accordingly, for any choice γ 12ηK

δ ) log(NT ) ετ , Lemma F.1 implies that

j=1 Pr(j) ℓ r(j) inf j [N]

r=1 ℓ r(j) + log(N)

j=1 Pr(j) ℓ r(j)2.

Taking expectation of both sides, we have that

Faster Rates for Private Adversarial Bandits

j=1 Pr(j) ℓ r(j)

inf j [N] E Z1:T

η + η E A,Z1:T

j=1 Pr(j) ℓ r(j)2

Using the fact that Zj r is zero-mean and conditionally independent of Pr given the history of the game up to and including time point (r 1)τ, we have that

j=1 Pr(j) ℓr(j)

inf j [N] E Z1:T

η + η E A,Z1:T

j=1 Pr(j) ℓ r(j)2

We now analyze each of the three terms with expectations separately. First,

j=1 Pr(j) ℓr(j)

i=1 ˆℓs(i)µj s(i)

j=1 Pr(j)µj s(i)

1 τ(1 γ) E A,Z1:T

i=1 Qt(i)ˆℓt(i)

j=1 Pr(j) ℓ r(j)2

j=1 Pr(j)( ℓr(j) + Zj r)2

j=1 Pr(j)( ℓr(j)2 + (Zj r)2)

j=1 Pr(j)( ℓr(j)2 + 10λ2 log2(N T

j=1 Pr(j) ℓr(j)2

λ2 log2(N T

To bound the first of the two terms above, note that:

Faster Rates for Private Adversarial Bandits

j=1 Pr(j) ℓr(j)2

j=1 Pr(j) 1

ˆℓs µj s 2 Ec

i=1 ˆℓ2 s(i)µj s(i)

j=1 Pr(j)µj s(i)

1 τ(1 γ) E A,Z1:T

i=1 Qt(i)ˆℓ2 t(i)

inf j [N] E Z1:T

τ inf j [N] E Z1:T

τ inf j [N] E Z1:T

t=1 ˆℓt µj t

τ inf j [N]

t=1 ℓt µj t,

where the last equality follows by the unbiasedness of ˆℓt. Putting all the bounds together, we get that

Faster Rates for Private Adversarial Bandits

1 τ(1 γ) E A,Z1:T

i=1 Qt(i)ˆℓt(i)

τ inf j [N]

t=1 ℓt µj t + log(N)

η + γ (1 γ)

+ η τ(1 γ) E A,Z1:T

i=1 Qt(i)ˆℓ2 t(i)

λ2 log2(N T

Multiplying both sides by τ(1 γ), gives

i=1 Qt(i)ˆℓt(i)

(1 γ) inf j [N]

t=1 ℓt µj t + τ(1 γ) log(N)

+ η E A,Z1:T

i=1 Qt(i)ˆℓ2 t(i)

+ 10η(1 γ)τ T

λ2 log2(N T

which implies that

i=1 Qt(i)ˆℓt(i)

t=1 ℓt µj t + τ log(N)

+ η E A,Z1:T

i=1 Qt(i)ˆℓ2 t(i)

+ 10ηTλ2 log2(N T

Using the fact that ˆℓt is an unbiased estimator of ℓt gives that

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t+τ log(N)

η +γT+η E A,Z1:T

i=1 ℓ2 t(i)

+10ηTλ2 log2(N T

By the boundedness of the loss, we have

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t + τ log(N)

η + γT + ηKτ T

+ 10ηTλ2 log2(N T

Faster Rates for Private Adversarial Bandits

Bounding the regret in the last τ rounds by τ, gives

i=1 Qt(i)ℓt(i)

t=1 ℓt µj t + τ log(N)

η + γT + ηTK + 10ηTλ2 log2(NT) + τ

t=1 ℓt µj t + τ log(N)

η + γT + ηTK + 90ηTNK2 log( 1

δ ) log2(NT) ε2γ2τ 2 + τ

Using Equation 1, then gives that

R(T, K, N) τ log(N)

η + γT + ηTK + 90ηTNK2 log( 1

δ ) log2(NT) ε2γ2τ 2 + 2τ

Since η < 1, we trivially have that

R(T, K, N) 3τ log(N)

η + γT + ηTK + 90ηTNK2 log( 1

δ ) log2(NT) ε2γ2τ 2 .

Now, choosing γ = max η1/3N 1/3K2/3 log2/3(NT )

ε2/3τ 2/3 , 12ηK

δ ) log(NT ) ετ

R(T, K, N) 3τ log(N)

η1/3(N log( 1

δ ))1/3K2/3 log2/3(NT)

ε2/3τ 2/3 , ηK q

δ )) log(NT)

Choosing η = (N log( 1

δ ))1/6 log1/3(NT ) log1/3(N)

T 1/3K1/2ε1/3 and τ = (N log( 1

δ ))1/3 log2/3(NT )T 1/3

ε2/3 log1/3(N) gives

R(T, K, N) 95(N log( 1

δ ))1/6K1/2 log1/3(NT) log1/3(N)T 2/3

+ (95N log( 1

δ ))1/3K1/2 log2/3(NT) log2/3(N)T 1/3

100N 1/6K1/2T 2/3 log1/6( 1

δ ) log1/3(NT) log1/3(N) ε1/3

+ N 1/2 log( 1

δ )1/2 log(NT) log(N)

which completes the proof.

Proof. (of Privacy in Theorem 4.5) Fix ε, δ (0, 1]. Note that the sequence of actions played by Algorithm 2 are completely determined by the noisy loss vectors ℓ 1, . . . , ℓ T

τ . Thus, by post-processing it suffices to show that these vectors are output in a ε-differentially private manner. From this perspective, Algorithm 2 can be viewed as the adaptive composition M of the sequence of mechanisms M1, . . . , M T

τ , where M1 : ([K] Π([K]))τ ℓ1:T RN is defined as

M1(I1:τ, µ1:τ 1:T , ℓ1:T ) = ( ℓ 1(1), . . . , ℓ 1(N))

Faster Rates for Private Adversarial Bandits

for ℓ 1(j) defined as in Line 10 of Algorithm 2. Likewise, for s {2, . . . , T

τ }, define Ms : (RN)s 1 (Π([K]) [K])τ ℓ1:T RN such that

Ms( ℓ 1:s 1, µ1:N (s 1)τ+1:sτ, I(s 1)τ+1:sτ, ℓ1:T ) = ( ℓ sτ(1), . . . , ℓ sτ(N)).

We will prove that M(ℓ1:T ) and M(ℓ 1:T ) are (ϵ, δ)-indistinguishable. To do so, fix two neighboring data sets ℓ1:T and ℓ 1:T . Let t be the index where the two datasets differ. Let r {1, . . . , T

τ } be the batch in where t lies. Fix a sequence of

outcomes x1: T

τ , µ1:N 1:T Π([K])NT , and I1:T [K]T .

For all r < r , we have that the random variables Mr(x1:r 1, µ1:N (r 1)τ+1:rτ, I(r 1)τ+1:rτ, ℓ1:T ) and Mr(x1:r 1, µ1:N (r 1)τ+1:rτ, I(r 1)τ+1:rτ, ℓ 1:T ) are 0-indistinguishable. We now show that Mr (x1:r 1, µ1:N (r 1)τ+1:r τ, I(r 1)τ+1:r τ, ℓ1:T ) and Mr (x1:r 1, µ1:N (r 1)τ+1:r τ, I(r 1)τ+1:r τ, ℓ 1:T ) are (ε, δ)- indistinguishable. On input x1:r 1, µ1:N (r 1)τ+1:r τ, and I(r 1)τ+1:r τ, the mechanism Mr ( , ℓ1:T ) computes

ℓ r τ(j) = ℓr τ(j) + Zj r , where Zj r Lap 0, 3K

ℓr τ(j) = 1

µj m(i)ℓm(i)I{Im = i}

for every j [N]. Note that x1:r 1 complete determines Qm(i). Moreover, the global sensitivity of ℓr τ(j), with respect to neighboring datasets, is at most K

γτ since Qt(i) γ

K for all t [T]. Accordingly, by the Laplace Mechanism and advanced composition, we have that the outputs of Mr ( , ℓ1:T ) and Mr ( , ℓ 1:T ) are (ε, δ)-indistinguishable.

To complete the proof, it suffices to show that for all r r +1, we have that the outputs of, Mr( , ℓ1:T ) and Mr( , ℓ 1:T ) are 0indistinguishable. However, this follows from the fact that for every r r +1, we have that ℓ(r 1)τ+1:rτ+1 = ℓ (r 1)τ+1:rτ and that mechanism Mr does not access the true data ℓ1:(r 1)τ, but only the privatized, published outputs of the previous mechanisms M1, . . . , Mr 1. Thus, by advanced composition, we have that the entire mechanism M is (ε, δ)-differentially private.

G. Barriers to Private Adversarial Bandits

G.1. Privacy leakage in EXP3

To better understand its per-round privacy loss, it is helpful to view EXP3 as the adaptive composition of T 1 mechanisms M2, . . . , MT where Mt : [K]t 1 ℓ1:T [K]. For every t {2, . . . , T}, the mechanism Mt, given as input the previously selected actions I1, . . . , It 1 and the dataset ℓ1:T , computes the distribution

Pt(i) = (1 γ) wt(i) PK j=1 wt(j) + γ

where wt(j) = exp{ η Pt 1 s=1 ˆℓs(j)} and ˆℓs(j) = ℓs(j)I{Is=j}

Ps(j) . Then, Mt samples an action It Pt. The mechanism Mt is εt-differentially private if for any pair of neighboring data sets ℓ1:T and ℓ 1:T , we have that

sup I1,...,It 1 [K]t 1 sup i [K]

P[Mt(I1:t 1, ℓ1:T ) = i] P[Mt(I1:t 1, ℓ 1:T ) = i] eεt.

Now, consider two neighboring datasets ℓ1:T and ℓ 1:T that differ at the first time point t = 1. Let P1, . . . , PT denote the sequence of probabilities output by the mechanisms when run on ℓ1:T and let P 1, . . . , P T denote the same for ℓ 1:T . Since ℓ1 = ℓ 1, we have that ˆℓ1 = ˆℓ 1. Accordingly, P2 = P 2. The key insight now is that because P2 = P 2, we have that ˆℓ2 = ˆℓ 2, and so P3 = P 3. Continuing this process gives that Pt = P t and ˆℓt = ˆℓ t for all t 2. Unfortunately, this difference in probabilities can cause the privacy loss to grow with t. To get some intuition, fix some t 2 and sequence I1, . . . , It 1 [K]t 1. Consider the ratio

Faster Rates for Private Adversarial Bandits

Figure 1. Probabilities on action 2 assigned by EXP3 when run with γ = η = 0.0001, and T = 100 1

η on datasets ℓ1:T and ℓ 1:T .

Pt(i) P t(i) sup i [K]

wt(i) w t(i)

PK j=1 w t(j) PK j=1 wt(j) sup i [K]

wt(i) w t(i).

Observe that

wt(i) w t(i) = sup i [K] exp n η

s=1 ˆℓ s(i) ˆℓs(i) o

= sup i [K] exp n η (ℓ 1(i) ℓ1(i))I{I1 = i}

s=2 ℓs(i)I{Is = i}( 1 P s(i) 1 Ps(i)) o .

Since P s(i) = Ps(i) for every s t 1, we can pick two neighboring sequences of losses and a sequence of actions I1, . . . , IT such that supi [K] ws(i) w s(i) grows very quickly with s. For example, the following choices for neighboring datasets and sequences of actions will do. Let K = 2 and pick ℓ1:T such that ℓ1(1) = 1, ℓ1(2) = 0, and ℓt(1) = ℓt(2) = 1 for all t {2, . . . , T}. Pick neighboring dataset ℓ 1:T such that ℓ t(1) = ℓ t(2) = 1, for all t [T]. Finally, consider the sequence of actions I1, . . . , IT such that It = 2 if t is odd and It = 1 if t is even. That is, the sequence of actions I1, . . . , IT alternates between 2 and 1, starting with action 2.

We claim that for this choice of neighboring datasets and sequences of actions, Pt(2) and P t(2) diverge rapidly with Pt(2) approaching 1 γ

2 and P t(2) approaching γ

2 . To see why, fix t 2 and suppose that It = 1. Then, by definition, for the loss sequence ℓ1:T , we have that wt+2(1) = wt+1(1) = wt(1) exp η Pt(1) and wt+2(2) = wt(2) exp η 1 Pt+1(1). Since It = 1, we know that Pt+1(1) < Pt(1) and therefore wt+2(2) > wt(2) exp η 1 Pt(1). Now, if Pt(1) < 1

2, then wt+2(1)/wt+2(2) < wt(1)/wt(2) < 1. Accordingly, Pt+2(1) < 1/2, and repeating the analysis would eventually show that wt(1)/wt(2) 0, implying that Pt(2) 1 γ

2 . A symmetric argument shows that if Pt(1) > 1/2, then wt(2)/wt(1) 0, and therefore Pt(2) γ

2 . Since the two loss sequences ℓ1:T and ℓ 1:T are identical after time point t = 2, an identical argument holds for ℓ 1:T . To complete the proof sketch, note that for loss sequence ℓ1:T , I4 = 1 and P4(1) < 1

2, thus Pt(2) 1 γ

2 . On the other hand, for loss sequence ℓ 1:T , I2 = 1 and P 2(1) > 1

2, giving that P t(2) γ

We also verify this claim empirically in Figure 1, which gives a better sense of the rate of divergence between Pt(2) and P t(2). The code generating the figure above is provided below.

Faster Rates for Private Adversarial Bandits

import numpy as np import matplotlib.pyplot as plt

eta = 0.0001 T = 100 * int(1/eta) gamma = eta

# Execute EXP3 on loss sequence l_1, \dots, l_T w_1 = 1 w_2 = 1 P_2 = 0 P_2_hist = []

for t in range(T): Q_2 = (w_2/(w_2 + w_1)) #unmixed prob. P_2 = (1-gamma) * Q_2 + gamma/2 #mixed prob. P_2_hist.append(P_2) if t == 0: w_2 = w_2 * np.exp(0*eta/(P_2)) elif t % 2 == 0: w_2 = w_2 * np.exp(-1*eta/(P_2)) #pull action 2 in even rounds else: w_1 = w_1 * np.exp(-1*eta/((1-P_2))) #pull action 1 in odd rounds

plt.plot(P_2_hist, label= "P_t(2)")

# Execute EXP3 on loss sequence l'_1, \dots, l'_T w_1 = 1 w_2 = 1 P_2 = 0 P_2_hist = []

for t in range(T): Q_2 = (w_2/(w_2 + w_1)) P_2 = (1-gamma) * Q_2 + gamma/2 P_2_hist.append(P_2) if t % 2 == 0: w_2 = w_2 * np.exp(-1*eta/(P_2)) #pull action 2 in even rounds else: w_1 = w_1 * np.exp(-1*eta/((1-P_2))) #pull action 1 in odd rounds

plt.plot(P_2_hist, label= "P'_t(2)")

plt.xlabel("t") plt.legend() plt.show()

We note that the authors of (Tossou & Dimitrakakis, 2017) acknowledge that this issue was overlooked when stating Theorem 3.3 in (Tossou & Dimitrakakis, 2017). Therefore, we are unable to verify the Theorem 3.3. Unfortunately, (Tossou & Dimitrakakis, 2017) use Theorem 3.3 in the proof of Corollary 3.3, which claims to give a private adversarial bandit

algorithm with expected regret O T 2/3

K ln(K) ε1/3

, ignoring log factors in 1

δ. Thus, we are unable to verify whether

Corollary 3.3 is correct.

G.2. Proof of Lemma 5.1

Proof. (of Lemma 5.1) Let A be any ε-differentially private algorithm (for ε 1) that satisfies condition (1) and (2) with parameters γ [0, 1

2 and 2γτ p γ(T τ). Consider the alternate sequence of loss functions ℓ 1, . . . , ℓ T such that ℓ 1:τ = ℓ1:τ but ℓ τ+1:T is such that ℓ t(2) = 0 and ℓ t(1) = 1

2 for all t {τ + 1, . . . , T}.

Faster Rates for Private Adversarial Bandits

It suffices to show that P(I 1, . . . , I T / Ep γ,τ) eεp P(I1, . . . , IT / Ep γ,τ) 1

where I1:T and I 1:T are the random variables denoting the selected actions of A when run on ℓ1:T and ℓ 1:T respectively. Indeed, when I 1, . . . , I T Ep γ,τ, we have that the regret of A when run on ℓ 1:T is at least p 2γ p

2. On the other hand, if I1:T Eγ, we have that the regret of A on ℓ1:T is at least γ

2 T. So with probability 1

2, the regret of A on ℓ1:T is γ

2 T and with probability at least 1 1

2eεp, the regret of A on ℓ 1:T is at least p 2γ p

2, where the inequality follows from the fact that γ 1

2. Therefore, the worst-case expected regret is at least

To prove Equation 2, recall that we may write any randomized algorithm A as a deterministic function of an input x and an infinite sequence of bits b1, b2, . . . generated uniformly at random. From this perspective, we can think of a randomized bandit algorithm A as a deterministic mapping from a sequence of losses ℓ1:T and an infinite sequence of bits b {0, 1}N to a sequence of T actions. That is,

A : {0, 1}N [0, 1]K T [K]T .

Using this perspective, Equation 2 is equivalent to showing that:

P b {0,1}N (A(b, ℓ 1:T ) / Ep γ,τ) eεp Pb {0,1}N(A(b, ℓ1:T ) / Ep γ,τ).

Consider the following sequence of losses parameterized by S {τ + 1, . . . , T}, |S| p:

1/2, if i = 1 0, if i = 2 and t S 1, i = 2 and t / S

Let L := {ℓS 1:T : S {τ + 1, . . . , T}, S p } be the collection of all such sequences of loss functions. Note that every ℓS 1:T L differs from ℓ1:T only at time points t S. Thus, by group privacy (see Lemma A.8), we have that

sup ℓS 1:T L P b {0,1}N (A(b, ℓS 1:T ) / Ep γ,τ) eεp P b {0,1}N (A(b, ℓ1:T ) / Ep γ,τ).

Now, fix the sequence of random bits b {0, 1}N. Let i 1:T = A(b, ℓ 1:T ). Define S := {t τ + 1 : i t = 2} and

S p be the first p such time points. Let i S p 1:T = A(b, ℓ S p 1:T ). Let t = max{t τ + 1 : Pt s=τ+1 I{i s = 2} p} and

t S p = max{t τ + 1 : Pt s=τ+1 I{i S p s = 2} p}. Because bandit algorithms only observe the losses of the selected

action, we have that t = t S p. In addition, we have that i 1:T Ep γ,τ if and only if t τ + p

γ , and likewise for i S p 1:T . Therefore,

I{i 1:T Ep γ,τ} = I{i S p 1:T Ep γ,τ}

and therefore

I{i 1:T / Ep γ,τ} = I{i S p 1:T / Ep γ,τ}.

Taking expectation on both sides with respect to b {0, 1}N, gives that

Faster Rates for Private Adversarial Bandits

Pb {0,1}N(A(b, ℓ 1:T ) / Ep γ,τ) = Pb {0,1}N(A(b, ℓ S p 1:T ) / Ep γ,τ)

sup ℓS 1:T L P b {0,1}N (A(b, ℓS 1:T ) / Ep γ,τ)

eεp P b {0,1}N (A(b, ℓ1:T ) / Ep γ,τ),

completing the proof.