# on_private_and_robust_bandits__fc9f0c42.pdf

On Private and Robust Bandits

KAUST yulian.wu@kaust.edu.sa

Xingyu Zhou Wayne State University xingyu.zhou@wayne.edu

Youming Tao Shandong University ym.tao99@mail.sdu.edu.cn

KAUST di.wang@kaust.edu.sa

We study private and robust multi-armed bandits (MABs), where the agent receives Huber s contaminated heavy-tailed rewards and meanwhile needs to ensure differential privacy. We consider both the ﬁnite k-th raw moment and the ﬁnite k-th central moment settings for heavy-tailed rewards distributions with k 2. We ﬁrst present its minimax lower bound, characterizing the information-theoretic limit of regret with respect to privacy budget, contamination level, and heavytailedness. Then, we propose a meta-algorithm that builds on a private and robust mean estimation sub-routine PRM that essentially relies on reward truncation and the Laplace mechanism. For the above two different heavy-tailed settings, we give corresponding schemes of PRM, which enable us to achieve nearly-optimal regrets. Moreover, our two proposed truncation-based or histogram-based PRM schemes achieve the optimal trade-off between estimation accuracy, privacy and robustness. Finally, we support our theoretical results and show the effectiveness of our algorithms with experimental studies.

1 Introduction

The multi-armed bandit (MAB) [1] problem provides a fundamental framework for sequential decision-making under uncertainty with bandit feedback, which has drawn a wide range of applications in medicine [2], ﬁnance [3, 4], recommendation system [5], and online advertising [6], to name a few. Consider a portfolio selection in ﬁnance as an example. At each decision round t 2 [T], the learning agent selects an action at 2 [K] (i.e., a choice of assets to user t) and receives a reward rt (e.g., the corresponding payoff) that is i.i.d. drawn from an unknown probability distribution associated with the portfolio choice. The goal is to learn to maximize its cumulative payoff.

In practice, applying the celebrated MAB formulation to real-life applications (e.g., the above ﬁnance example) needs to deal with both robustness and privacy issues. On the one hand, it is known that ﬁnancial data is often heavy-tailed (rather than sub-Gaussian) [7, 8]. Moreover, the received payoff data in ﬁnance often contains outliers [9] due to data contamination. On the other hand, privacy concern in ﬁnance is growing [10 12]. For instance, even if the adversary does not have direct access to the dataset, they are still able to reconstruct other customers personal information by interacting with the pricing platform and observing its decisions [13].

Motivated by this, a line of work on MABs has focused on designing robust algorithms with respect to heavy-tailed rewards [14], adversary contamination [15, 16], or both [17]. Another line of recent work has studied privacy protection in MABs via different trust models of differential privacy (DP) [18] such as central DP [19, 20], local DP [21, 22] and distributed DP [23, 24]. Moreover, there have also

Equal contribution.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

been recent advances in understanding the close relationship between robustness and privacy for the mean estimation problem (e.g., robustness induces privacy [25, 26] and vice versa [27]). In light of this, a fundamental question we are interested in this paper is:

Is there a simple algorithm that can tackle privacy and robustness in MABs simultaneously?

Our contributions. We give an afﬁrmative answer to it by showing that a uniﬁed truncation-based algorithm could achieve a nearly optimal trade-off between regret, privacy, and robustness for MABs under two different heavy-tailed settings. The key intuition is that reward truncation not only helps to reduce outliers (due to both heavy tails and contamination), but also bound its sensitivity (useful for DP guarantees). To make our intuition rigorous, we take the following principled approaches.

(i) We ﬁrst establish the minimax regret lower bound for private and robust MABs, i.e., heavy-tailed MABs with both privacy constraints and Huber s contamination (see section 4) where the reward distribution of each action has ﬁnite k-th moment for k 2. This characterizes the informationtheoretic limit of regret with respect to privacy budget, contamination level and heavy-tailedness.

(ii) To match the lower bound, we ﬁrst propose a meta-algorithm (see section 5), which builds upon the idea of batched successive elimination and relies on a generic private and robust mean estimation sub-routine denoted by PRM. Then, for two different settings of (heavy-tailed) reward distributions (i.e., ﬁnite raw or central moments), we propose corresponding schemes for the sub-routine PRM, both of which require truncation and the Laplace mechanism to guarantee robustness and privacy, simultaneously. Armed with these, our meta-algorithm can enjoy nearly matching regret upper bounds (see section 6). Experimental studies also corroborate our theoretical results.

(iii) Along the way, several results could be of independent interest. In particular, our proposed PRM shows that reward truncation is sufﬁcient to help achieve the optimal high-probability concentration for private and robust mean estimation in the one-dimension case. Moreover, without contamination, our regret upper bounds not only match the optimal one for private heavy-tailed MABs with ﬁnite raw moments, but also provide the ﬁrst results for the case with ﬁnite central moments, hence a complete study for private bandits as well.

Due to space limit, experiments (section A in Appendix), technical lemmas, and all proofs are relegated to Appendix.

2 Related Work

Robust MABs. The studies on robust bandits can be largely categorized into two groups. The ﬁrst group of work mainly focuses on the setting where the total contamination is bounded, i.e., the cumulative difference between observed reward and true reward is bounded by some constant [15]. The second group considers Huber s -contamination model [28] (which is also the focus of our paper) or a similar -fraction model. In these cases, the reward for each round can be contaminated by an arbitrary distribution with probability 2 [0, 1/2) [29, 16, 30], or at most -fraction of the rewards are arbitrarily contaminated [31]. The existing work in this group has mainly focused on the light-tailed setting where the true inlier distribution is Gaussian or sub-Gaussian and uses a robust median or trimmed-mean estimator. A very recent work [17] studies the (non-private) setting where the inlier distribution only has ﬁnite variance and uses Huber s estimator to establish problemdependent bounds. In contrast, we take the perspective of minimax regret, i.e., problem-independent bounds, and also account for privacy protection.

Private MABs. In addition to the above mentioned results on private MABs with light-tailed rewards, [22] studies private heavy-tailed MABs with ﬁnite raw moments under both central and local models of DP. However, how to design private algorithms for heavy-tailed distributions with ﬁnite central moments is left as an important problem. In this paper, we resolve the problem as a byproduct of our main results.

Robust and private mean estimation. Our work is also related to robust and private mean estimation, especially the one-dimensional case. On the robustness side with Huber s model, a high-probability concentration bound for the median of Gaussian (hence the mean by symmetry) is ﬁrst established in [32]. Recently, [30] gives a high probability mean concentration via a trimmed-mean estimator for general sub-Gaussian inlier distributions while [33] focuses on the heavy-tailed setting. We refer the reader to the survey paper [34] on high-dimensional robust statistics focusing on robust mean estimation. On the robustness side with heavy-tail, we refer the reader to the survey paper

[35] for mean estimation and regression under heavy-tailed distributions. Besides median-of-means techniques and trimmed mean we mentioned above to handle heavy-tailed data, Catoni s estimator is a very different estimator for heavy tail [36, 37] and also used in bandits [14]. On the privacy side, one close work is [38], which presents the ﬁrst high-probability mean concentration for private heavy-tailed distributions with ﬁnite central moments (via a medians-of-means approach). It is worth noting that there are recent exciting advances in understanding the close relationship between robustness and privacy (e.g., robustness induces privacy [25, 26] and vice versa [27]). From this aspect, our results imply that for the one-dimensional mean estimation problem, truncation alone sufﬁces to help to achieve both.

Concurrent and independent work. While preparing this submission, we have noticed that [39] also studies private and robust bandits problems under the Huber model. However, there are many differences between our work and theirs. First, for the differential privacy and bandit models, we investigate the multi-armed bandit problem under central differential privacy while they study stochastic linear bandits under the local differential privacy model. For the assumption of rewards, we focus on the heavy-tailed cases where the distribution of reward has ﬁnite k-th raw moment and central moment for k 2 while they essentially assume that the rewards are bounded2.

3 Preliminaries

In this section, we ﬁrst formally introduce our problem of private and robust MABs and subsequently present its corresponding regret notions.

3.1 Private and Robust MABs

In this paper, we consider the multi-armed bandits problem where the agent interacts with the environment for T rounds. In each round t, the agent chooses an action at 2 [K] and then standard reward rt is generated independently from inlier distribution. After contamination, the agent observes contaminated reward xt.

As mentioned before, by robustness, we aim to handle both reward contamination and possible heavytailed inlier distributions. To this end, we ﬁrst introduce the following two classes of heavy-tailed reward distributions. Deﬁnition 3.1 (Finite k-th raw moment). A distribution over R is said to have a ﬁnite k-th raw moment if it is within

where k is considered ﬁxed but arbitrary. Deﬁnition 3.2 (Finite k-th central moment [40, 38]). A distribution over R is said to have a ﬁnite k-th central moment if it is within

where µ := EX P [X] 2 [ D, D] and D 1.

The relationship between the ﬁnite raw moment and the ﬁnite central moment has been discussed in [22]. We further consider the celebrated Huber contamination model [28] in heavy-tailed MABs. Deﬁnition 3.3 (Heavy-tailed MABs with Huber contamination). Given the corruption level 2 [0, 1/2). For each round t 2 [T], the observed reward3 xt for action at, is sampled independently from the true distribution Pat 2 Pk (or Pat 2 Pc

k) with probability 1 ; otherwise is sampled from some arbitrary and unknown contamination distribution Gat 2 G.

In addition to robustness, we also consider the privacy protection in MABs via the lens of DP. In particular, we consider the standard central model of DP for MABs (e.g., [41]), where the learning agent has access to users raw data (i.e., rewards) and guarantees that its output (i.e., sequence

2Note that although the general problem formulation in [39] considers sub-Gaussian rewards, their privacy guarantee only holds for bounded rewards.

3Here we use xt in the contaminated case to distinguish with standard reward rt.

of actions) are indistinguishable in probability on two neighboring reward sequences. Due to contamination, the reward data accessed by the learning agent at round t could have already been contaminated. More precisely, we let DT = (x1, . . . , x T ) 2 RT be a reward sequence generated in the learning process and M(DT ) = (a1, . . . , a T ) 2 [K]T to denote the sequence of all actions recommended by a learning algorithm M. With this setup, we have the following formal deﬁnition. Deﬁnition 3.4 (Differential Privacy for MABs). For any > 0, a learning algorithm M : RT ! [K]T is -DP if for all sequences DT , D0

T 2 RT differing only in a single element and for all events E [K]T , we have

P [M(DT ) 2 E] e P [M (D0

In other words, we protect the privacy of any individual user who interacts with the learning agent in the sense that an adversary observing the output of the learning agent (i.e., a sequence of actions) cannot infer too much about whether any particular individual has participated in this process, or the speciﬁc reward feedback of this individual. In this paper, we will leverage the well-known Laplace mechanism to guarantee differential privacy. Deﬁnition 3.5 (Laplace Mechanism). Given a function f : X n ! Rd, the Laplacian mechanism is given by

ML(Dn, f, ) = f(Dn) + (Y1, Y2, , Yd),

where Yi is i.i.d. drawn from a Laplacian Distribution4 Lap( 1(f)

), where 1(f) is the 1-sensitivity of the function f, i.e., 1(f) = sup Dn D0n ||f(Dn) f(D0

n)||1. Then, for any > 0, Laplacian mechanism satisﬁes -DP.

In the following sections, for brevity, we will simply use private and robust MABs to refer to our setting, i.e., heavy-tailed MABs with Huber contamination and privacy constraints.

3.2 Regrets for Private and Robust MABs

In the contamination case, the standard regret using observed (contaminated) rewards {xt}t2[T ] is ill-deﬁned [31]. Instead, the literature focuses on the clean regret, that is, to compete with the best policy in hindsight as measured by the expected true uncontaminated rewards [31, 17, 42]. Hence, let µa be the mean of the inlier distribution of arm a 2 [K] and µ = maxa2[K] µa. We also let be the set of all -DP MAB algorithms and E ,k be the set of all instances of heavy-tailed MABs (with parameter k) with Huber contamination (of level ). Deﬁnition 3.6 (Clean Regret). Fix an algorithm 2 and an instance 2 E ,k. Then, the clean regret of under is given by RT ( , ) := E , [Tµ PT

Note that here the expectation is taken over the randomness generated by the contaminated environment and -DP MAB algorithm while the means are of the true inlier distributions.

To capture the intrinsic difﬁculty of the private and robust MAB problem, we are also interested in its minimax regret. Deﬁnition 3.7 (Minimax Regret). The minimax regret of our private and robust MAB problem is deﬁned as Rminimax

, ,k := inf 2 sup 2E ,k E , [Tµ PT

4 Lower Bound

We start with the following lower bound on the minimax regret, which characterizes the fundamental impact of privacy budget (via ), contamination level (via ) and heavy-tailedness of rewards (via k) in the regret. Note that this lower bound is mainly established under Deﬁnition 3.1, which in turn also serves as one valid lower bound for the central moment case in Deﬁnition 3.2. Theorem 4.1. Consider a private and robust MAB problem where inlier distributions have ﬁnite k-th raw (or central) moments (k 2). Then, its minimax regret satisifes

KT + (K/ )1 1

1 k + T 1 1

4For a parameter λ, the Laplacian distribution has the density function Lap(λ)(x) = 1 2λ exp( |x|

Algorithm 1 Private and Robust Arm Elimination

1: Input: Number of arms K, time horizon T, privacy budget , Huber parameter 2 (0, 1/2),

error probability δ 2 (0, 1], inliner distribution parameters i.e., k and optional D. 2: Initialize: = 0, active set of arms S = {1, , K}. 3: for batch = 1, 2, . . . do 4: Set batch size B = 2 . 5: if B < T then 6: Randomly select an action a 2 [K]. 7: Play action a for B times. 8: else 9: for each active arm a 2 S do 10: for i from 1 to B do 11: Pull arm a, observe contaminated reward xa

i . 12: If total number of pulls reaches T, exit. 13: end for 14: Set truncation threshold M . 15: Set additional parameters Φ. 16: Compute estimate eµa = PRM({xa

i=1, M , Φ). 17: end for 18: Set conﬁdence radius β . 19: Let eµmax = maxa2S eµa. 20: Remove all arms a from S s.t. eµmax eµa > 2β . 21: end if 22: end for

This lower bound basically takes a maximum of three terms. The ﬁrst term comes from the standard regret for Gaussian rewards, the second one captures the additional cost in regret due to privacy and heavy-tailed rewards, and the last term indicates the additional cost in regret due to contamination and heavy-tailed rewards. Note that, for a given k, the impact of privacy and contamination is separable. It would also be helpful to compare our lower bound with the related ones, as discussed below. Remark 4.2. First, when k = 1 and = 0, our lower bound recovers the state-of-the-art lower bound for private MABs with sub-Gaussian rewards [20]; Second, when there is no privacy protection, a very recent work [17] establishes a problem-dependent regret lower bound for robust MABs while we are interested in the problem-independent lower bound with further privacy protection.

5 Our Approach: A Meta-Algorithm

In this section, we ﬁrst introduce a meta-algorithm for private and robust MABs, which not only allows us to tackle inlier distributions with bounded raw or central moments in a uniﬁed way, but also highlights the key component, i.e., a private and robust mean estimation sub-routine building on the main idea of reward truncation.

Our meta-algorithm, at a high level, can be viewed as a batched version of the celebrated successive arm elimination [43] along with a private and robust mean estimation sub-routine PRM (see Algorithm 1). That is, it divides the time horizon T into batches with exponentially increasing size and eliminates sub-optimal arms successively based on the mean estimate via PRM. More speciﬁcally, based on the batch size, it consists of two phases. That is, when the batch size is less than a threshold T , it simply recommends actions randomly (line 5-7) (more on this will be explained soon). Otherwise, for each active arm a in batch , it ﬁrst prescribes a to a batch of B = 2 fresh new users (i.e., doubling ) and observes possibly contaminated rewards (line 8). Then, it calls the sub-routine PRM to compute a private and robust mean estimate for each active arm a (line 12). In particular, it only uses the rewards within the most recent batch (i.e., forgetting ) along with a proper reward truncation threshold M . Finally, it adopts the classic idea of arm elimination with a proper choice of conﬁdence radius β to remove sub-optimal arms with high conﬁdence (line 18-20).

We now provide more intuitions behind our algorithm design by highlighting how its main components work in concert. First, the reason behind the ﬁrst phase (i.e., B T ), named forced exploration, is necessary as it ensures that concentration is satisﬁed so that the optimal arm will not be eliminated.

Algorithm 2 PRM for the ﬁnite raw moment case

1: Input: A collection of data {xi}n

i=1, truncation parameter M, additional parameters Φ = { }. 2: for i = 1, 2, . . . , n do 3: Truncate data xi = xi

{|xi| M}. 4: end for 5: Return private estimate eµ =

n + Lap( 2M

This is due to the fact that in the contamination case, a well-behaved concentration only kicks in when the number of samples is larger than a threshold. Thus, one cannot adopt arm elimination in this phase since it might eliminate the optimal arm. Note that, instead of our choice of random selection, one can also use other methods for the ﬁrst phase (see Remark 5.1 below). Second, for the second phase, the idea of doubling batching and forgetting is the key to achieving privacy with a minimal amount of noise (hence better regret). This is because now any single reward feedback only impacts one computation of estimate. This is in sharp contrast to standard arm elimination (e.g., [43]) where each mean estimate is based on all samples so far (as no batching is used), and hence a single reward change could impact O(T) mean estimations5. Third, the simple idea of reward truncation in PRM turns out to be extremely useful for both robustness and privacy. On the one hand, truncation helps to reduce the impact of outliers (due to both heavy tails and contamination); On the other hand, truncation also helps to bound the sensitivity, which is necessary for privacy. In fact, as we will show later, a well-tuned truncation threshold enables us to achieve a near-optimal trade-off between regret, privacy and robustness. Finally, in contrast to the ﬁrst phase, we can now eliminate sub-optimal arms with high conﬁdence due to the high probability concentration of mean estimate when batch size is larger than T (more details will be given later for speciﬁc choices of PRM and hence the choice of T ).

We choose to use the successive elimination (SE) technique here because it sufﬁces to enable us to achieve optimal order-wise regret later. In fact, once armed with our novel PRM module, one can also use other exploration strategies like UCB in [20]. One difference is that now instead of ﬁrst pulling each arm once, it needs to pull each arm T times to ensure that concentration kicks in later. This is in fact not surprising since on the high-level, the analysis of SE and UCB is very similar, i.e., doubling (batching) and forgetting. We also note that instead of UCB, one can also adapt it to the Thompson sampling strategy, e.g., [45], with our PRM module. Again, the key idea is batching and forgetting.

Remark 5.1. The algorithmic choice of the ﬁrst phase can be ﬂexible. For example, instead of playing a randomly selected action for the whole batch, one can choose to play a randomly selected action for each round. Moreover, one can also choose to be greedy or probabilistically greedy with respect to the mean estimate by PRM, which also only uses the rewards collected within the last batch for each arm. All of these choices have the same theoretical guarantees, though some will help to improve the empirical performance.

We then present the following remark that places our meta-algorithm in the existing literature.

Remark 5.2 (Comparison with existing literature). For private MABs (without contamination), the state-of-the-art also builds upon the idea of batching and forgetting [19, 20, 24] to achieve optimal regret. For robust MABs (without privacy), existing works take different robust mean estimations. For example, both [31, 30] use a trimmed mean estimator for sub-Gaussian inlier distributions while [17] adopts Huber s estimator to handle inlier distributions with only bounded variance. We are the ﬁrst to study privacy and robustness simultaneously, via a simple truncation-based estimator, which in turn reveals the close relationship between privacy and robustness in MABs. This complements the recent advances in capturing the connection between these two in (high-dimensional) statistics [25, 27].

6 Upper Bounds

In this section, we establish the regret upper bounds for two speciﬁc instantiations of our metaalgorithm one for the ﬁnite raw moment case and another for the ﬁnite central moment case. The results could match our lower bound up to a log factor on T, demonstrating their near-optimality.

5One can use the tree-based mechanism [44] to reduce it to O(log T), but it is still sub-optimal [19].

6.1 Finite Raw Moment Case

In this section, we will focus on private and robust MABs where the inlier distributions have a ﬁnite k-th raw moment as given by Deﬁnition 3.1. In particular, we ﬁrst introduce the choice of PRM in this case (see Algorithm 2) and establish its concentration property, which plays a key role in our implementation of meta-algorithm.

The PRM in Algorithm 2 is simply a truncation-based Laplace mechanism. That is, it ﬁrst truncates all the received data with the threshold M (line 3). Then, Laplace noise is added to the empirical mean to preserve privacy (line 5). We highlight again that truncation here helps with both robustness (via removing outliers) and privacy (via bounding the sensitivity of empirical mean).

As in the standard algorithm design of MABs, the key is to utilize the concentration of the mean estimator. To this end, we ﬁrst give the following high-probability concentration result for the mean estimate returned by PRM in Algorithm 2. Theorem 6.1 (Concentration of Mean Estimate). Given a collection of Huber-contaminated data {xi}n

i=1 where the inlier distribution satisﬁes Deﬁnition 3.1 with mean µ, let eµ be the mean estimate by Algorithm 2. Then, for any privacy budget > 0 and δ 2 (0, 1), the following results hold:

Uncontaminated case. For = 0, we have |eµ µ| = O

n + M log(1/δ)

n + 1 M k 1

probability at least 1 δ. Thus, choosing the truncation threshold M =

Contaminated case. For 0 < 1 2 (0, 1/2) and n =

, we have the following

with probability at least 1 δ, |eµ µ| = O

n + M log(1/δ)

n + 1 M k 1 + 1M

. Therefore,

choosing the truncation threshold M =

, yields |eµ µ| β, where

With the above result, several remarks are in order. First, for the uncontaminated case, our concentration result consists of the standard sub-Gaussian term and a new one due to privacy and heavy-tailed data. It can be translated into a sample complexity bound, i.e., to guarantee |eµ µ| for any 2 (0, 1), it requires the sample size to be n O( log(1/δ)

2 + log(1/δ) k/(k 1) ), which is optimal since it matches the lower bound for private heavy-tail mean estimation (cf. Theorem 7.2 in [46]). Second, for the contaminated case, it has an additional bias term O( 1 1/k

1 ), which is also known to be information-theoretically optimal [47]. Thus, via truncation, the PRM given by Algorithm 2 achieves the optimal trade-off between accuracy, privacy and robustness, which in turn shows its potential to be integrated into our meta-algorithm. Note that 1 here is any upper bound (e.g., estimate) on the true contamination level .

Now, based on the concentration result, we can set other missing parameters in our meta-algorithm accordingly. In particular, we have the following theorem that states the speciﬁc instantiation along with its performance guarantees. Theorem 6.2 (Performance Guarantees). Consider a private and robust MAB with inlier distributions satisfying Deﬁnition 3.1 and 0 < 1 2 (0, 1/2). Let Algorithm 1 be instantiated with Algorithm 2 and M , β be given by Theorem 6.1 with n replaced by B . Set T = ( log(1/δ)

1 ) and δ = 1/T. Then Algorithm 1 is -DP with its regret upper bound

k 1 + K log T

The above theorem presents the ﬁrst achievable regret guarantee for private and robust bandits. The ﬁrst three terms match our lower bound in Theorem 4.1 up to log T factor. The last additive term is

Algorithm 3 PRM for the ﬁnite central moment case

1: Input: A collection of data {xi}2n

i=1, truncation parameter M, additional parameters Φ = { , D, r}, r 2 R. 2: // First step: initial estimate 3: Bj = [j, j + r), j 2 J = { D, D + r, . . . , D r}.

4: Compute private histogram using the ﬁrst fold of data: epj =

{Xi2Bj } n + Lap

. 5: Get the initial estimate J = arg maxj2J epj. 6: // Second step: ﬁnal estimate 7: Get ﬁnal estimator using the second fold of data: eµ = J + 1

i=n+1(Xi J)

{|Xi J| M} + Lap

mainly due to the fact that the mean concentration result only holds when the sample size is larger than T = ( log(1/δ)

1 ). As a result, each sub-optimal has to be played at least ( log(1/δ)

1 ) times. However, for a sufﬁciently large T and a constant , the last term is dominated by other terms. The last term is also exactly the reason that we choose an upper bound 1 on the actual contamination level and state the upper bound results in terms of 1 rather than . That is, for a very small but non-zero , one can choose a larger 1 to balance the regret. This subtle issue is also mentioned in one nice related work [42], see the remark after Theorem 7.4 and Remark 5.4 in their work.

Remark 6.3. For the uncontaminated case with = 0, using the uncontaminated concentration bound in Theorem 6.1 and the same analysis, we achieve a regret upper bound O(p KT log T + ( K log T

1 k), which also matches the lower bound in Theorem 4.1 up to a factor of O(log T).

6.2 Finite Central Moment Case

The setting in the last section for the ﬁnite raw moment case may not be entirely satisfactory as it essentially assumes that the mean of arms is bounded within a small range (hence the sub-optimal gaps). Thus, in this section, we turn to private and robust MABs where the inlier distributions have a ﬁnite k-th central moment as given by Deﬁnition 3.2 for a reward distribution with large mean but small variability around the mean. To this end, we ﬁrst need a new PRM, since now simply truncating around zero as in Algorithm 2 will not work well (as discussed in detail in Remark 6.13).

Our new PRM is presented in Algorithm 3, which consists of two steps. The intuition is simple: the ﬁrst step aims to have a rough estimate of the mean, which is necessary since now the mean could be far away from zero. Then, in the second step, it truncates around the initial estimate to return the ﬁnal result. More speciﬁcally, in the ﬁrst step, we ﬁrst construct bins over the range [ D, D], which is assumed to contain the true mean by Deﬁnition 3.2. Then, we compute the private histogram via the Laplace mechanism. The initial estimate is given by the left endpoint of the bin that has the largest empirical mass. Next, in the second step, it simply truncates around the initial estimate and again adds Laplace noise for privacy.

Remark 6.4. It is worth noting that a similar idea of two-step estimation has been used in previous work on robust mean estimation in the one-dimensional heavy-tailed case [33, 38, 40]. However, there are several differences in our algorithm design and analysis. In particular, while [33] considers mean estimation under Huber s model without privacy constraints, we further impose differential privacy requirements. As a result, the estimates for both two steps are in different forms in our case compared to [33], though they share the same high-level intuition. On the other hand, while [38] considers mean estimation under differential privacy, there is no consideration of Huber contamination as in our case. Moreover, our second estimate is based on truncation while their method is via medians-of-means. In fact, as will be shown later (see Remark 6.7), when our result reduces to the uncontaminated case, it achieves improvement over the one in [38]. Finally, [40] considers both Huber contamination and local differential privacy, and establishes the corresponding mean square error (MSE). In contrast, we consider the central differential privacy and aim to establish a high-probability tail concentration. To this end, we take a different truncation method (i.e., using an indicator function in Line 7) compared to the one in [40] (Section 3.2 in its ﬁrst arxiv version).

As before, we ﬁrst present the concentration property of our new PRM, which will manifest in the speciﬁc instantiation of our meta-algorithm. In particular, we ﬁrst give the following general theorem and then state two more detailed corollaries. Theorem 6.5 (Concentration of Mean Estimate). Given a collection of Huber-contaminated data {xi}2n

i=1 where the inlier distribution satisﬁes Deﬁnition 3.2 with mean µ, let eµ be the mean estimate by Algorithm 3. For any 1 2 (0, max), 2 (0, 1] and δ 2 (0, 1), there exist some constants T ( 1, , δ), r, M and D 2r such that for all n T ( 1, , δ), with probability at least 1 δ

n + M log(1/δ)

n + 1 M k 1 + 1M

where max < 1/2 is the breakdown point.

The above theorem follows the same pattern as the one for the raw moment case (Theorem 6.1). The key differences are the threshold value T ( , , δ) and the breakdown point max, which are summarized in the following results. Corollary 6.6 (Mean Concentration, = 0). Let the same assumptions in Theorem 6.5 hold. For

any 2 (0, 1], setting r = 101/k and M =

, then for all n (log(D/δ)/ ) and D 2r, we have that for any δ 2 (0, 1), with probability at least 1 δ, | µ µ| β where

. In other words, taking number of samples n such that

2 + log(1/δ)

k k 1 + log(D/δ)

we have | µ µ| with probability at least 1 δ.

Remark 6.7. The above lemma strictly improves the result in [38, Theorem 3.5]6. In particular, it uses the method of medians-of-means and achieves log D log(1/δ)

for the third term. In contrast, our third term is additive rather than multiplicative. In fact, our concentration is optimal, which matches the lower bound for the one-dimensional case (see [46, Theorem 7.2]). Corollary 6.8 (Mean Concentration, > 0). Let the same assumptions in Theorem 6.5 hold. For any 2 (0, 1] and 0 < 1 2 (0, 0.133), we let r = 1/k where = 1 0.249 and

, ( 1) 1/k}). Then, there exists constant c1, for all n such that n

T = (max{ log(1/δ)

, c1 log(D/δ)

1 }) and D 2r, we have that for any δ 2 (0, 1), with

probability at least 1 δ, |eµ µ| β with β = O

Remark 6.9. The above concentration has the same form as the one in Theorem 6.1. Speciﬁcally, for a large sample size n, it has the optimal concentration (for small ). The threshold T on n depends on both , now. We note that even for the sub-Gaussian inlier distributions without privacy protection, the existing concentration also has a threshold T = log(1/δ)

1 (see Lemma 4.1 in [30]). The ﬁnite central moment assumption will help us get logarithmic results for sample complexity on D

of n (log(D/δ)/ ) which is better than the polynomial results on D from the ﬁnite raw moment. Actually, ﬁnding a private estimator that achieves an error on the logarithmic of the range of the

parameter is signiﬁcant in the topic of DP estimation (see [48 50] on the importance of the problem).

Now, we are left to leverage the above two concentration results to design speciﬁc instantiations of our meta-algorithm and establish their performance guarantees. Our ﬁrst instantiation is for the uncontaminated case, i.e., = 0. Therefore, robustness is then only with respect to heavy-tailed rewards while privacy is still preserved. Theorem 6.10 (Performance Guarantees, = 0). Consider a private and robust MAB with inlier distributions satisfying Deﬁnition 3.2 and = 0. Let Algorithm 1 be instantiated with Algorithm 3, and r, M , β be given by Corollary 6.6 with n replaced by B . Set T = ( log(D/δ)

) and δ = 1/T. Then, Algorithm 1 is -DP with its regret upper bound

KT log T +(K log T/ )

where γ := O (KD log(DT)/ ).

6We also note that the main focus of [38] is not on achieving the optimal estimate though.

Remark 6.11. To the best of our knowledge, this is the ﬁrst result on private and heavy-tailed bandits with the ﬁnite central moment assumption. The state-of-the-art result only focuses on the simpler case, i.e., the ﬁnite raw moment assumption [22].

Finally, armed with Corollary 6.8, we have the second instantiation of our meta-algorithm that deals with the contaminated case.

Theorem 6.12 (Performance Guarantees, > 0). Consider a private and robust MAB with inlier distributions satisfying Deﬁnition 3.2 and 1 2 (0, 0.133). Let Algorithm 1 be instantiated with Algorithm 3, and r, T , M , β be given by Corollary 6.8 with n replaced by B . Set δ = 1/T, then Algorithm 1 is -DP with its regret upper bound

KT log T +(K log T/ )

where ˆγ := O

1 + DK log T

+ DK log(DT )

and = 1 0.249 .

Remark 6.13. The above upper bound also matches our lower bound up to log factor and O(ˆγ), which is dominated by other terms for a sufﬁciently large T and small , D. A comparison between the results in Theorem 6.2 (Remark 6.3) and Theorem 6.12 (Theorem 6.10) for contaminated case (uncontaminated case), respectively, allows us to address a fundamental question: why we do not

ﬁrst transform the central moment condition to the raw moment condition, and then simply apply Algorithms 1 and 2 to get a regret upper bound? This is because the central moment condition implies

E[|X|k] = O(Dk). By employing a simple scaling technique, Algorithms 1 and 2 would yield a regret bound of O(D Reg), where Reg corresponds to the bound stated in Theorem 6.2. Thus, a signiﬁcant contrast arises when comparing this result with Theorem 6.12, where the dependence on D is only an additive linear term rather than a multiplicative factor of O(D). Consequently, the bounds presented in Theorem 6.12 and Theorem 6.10 prove to be substantially tighter when D assumes larger values, hence highlighting the contributions of our new PRM module in Algorithm 3.

7 Simulations and Conclusion

We refer readers to Appendix A for our simulation results. In this paper, we investigated private and robust multi-armed bandits with heavy-tailed rewards under Huber s contamination model as well as differential privacy constraints. We proposed a meta-algorithm that builds on a private and robust mean estimation sub-routine PRM for different heavy-tailed assumptions of rewards, i.e. ﬁnite k-th raw moment and ﬁnite k-th central moment with k 2. Moreover, we also established regret upper bounds for these algorithms, which nearly match our derived minimax lower bound.

There remain many open questions in this direction. First, the problem-dependent lower bound is unclear for private and robust bandits but some work has been done for corrupted heavy-tailed bandits (see Theorem 1 in [17]). One interesting future direction is to study how to leverage both the insights in [17] and the lower bound under privacy to derive a problem-dependent lower bound for private and robust bandits. Second, how to design algorithms and analyze the theoretical results for private and robust bandits under other corrupted models such as "bandits with total corruption budget" in [15]. Third, throughout the paper T is known in advance, and it remains to see how to handle the case where T is unknown and inﬁnity.

8 Acknowledgments

We thank Mengchu Li for the insightful discussions and for pointing out the ﬁrst arxiv version of [40]. And we also thank the anonymous Neur IPS reviewers for their feedback. Xingyu Zhou is supported in part by NSF grant CNS-2153220. Di Wang and Yulian Wu were supported in part by the baseline funding BAS/1/1689-01-01, funding from the CRG grand URF/1/4663-01-01, FCC/1/1976-49-01 from CBRC of King Abdullah University of Science and Technology (KAUST). Di Wang was also supported by the funding of the SDAIA-KAUST Center of Excellence in Data Science and Artiﬁcial Intelligence (SDAIA-KAUST AI)

[1] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments

(monographs on statistics and applied probability). London: Chapman and Hall, 5(71-87):7 7, 1985.

[2] Benjamín Gutiérrez, Loïc Peter, Tassilo Klein, and Christian Wachinger. A multi-armed bandit

to smartly select a training set from big medical data. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 38 45. Springer, 2017.

[3] Matthew Hoffman, Eric Brochu, Nando De Freitas, et al. Portfolio allocation for bayesian

optimization. In UAI, pages 327 336. Citeseer, 2011.

[4] Weiwei Shen, Jun Wang, Yu-Gang Jiang, and Hongyuan Zha. Portfolio choices with orthogonal

bandit learning. In Twenty-fourth international joint conference on artiﬁcial intelligence, 2015.

[5] Stéphane Caron and Smriti Bhagat. Mixing bandits: A recipe for improved cold-start recom-

mendations in a social network. In Proceedings of the 7th Workshop on Social Network Mining and Analysis, pages 1 9, 2013.

[6] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500 522, 2017.

[7] Svetlozar Todorov Rachev. Handbook of heavy tailed distributions in ﬁnance: Handbooks in

ﬁnance, Book 1. Elsevier, 2003.

[8] John Hull. Risk management and ﬁnancial institutions,+ Web Site, volume 733. John Wiley &

Sons, 2012.

[9] John Adams, Darren Hayunga, Sattar Mansi, David Reeb, and Vincenzo Verardi. Identifying

and treating outliers in ﬁnance. Financial Management, 48(2):345 384, 2019.

[10] Yanzhe Murray Lei, Sentao Miao, and Ruslan Momot. Privacy-preserving personalized revenue

management. HEC Paris Research Paper No. MOSI-2020-1391, 2020.

[11] Xi Chen, David Simchi-Levi, and Yining Wang. Privacy-preserving dynamic personalized

pricing with demand learning. Management Science, 68(7):4878 4898, 2022.

[12] Xi Chen, Sentao Miao, and Yining Wang. Differential privacy in personalized pricing with

nonparametric demand models. Operations Research, 2022.

[13] Matthew Fredrikson, Eric Lantz, Somesh Jha, Simon Lin, David Page, and Thomas Ristenpart.

Privacy in pharmacogenetics: An {End-to-End} case study of personalized warfarin dosing. In 23rd USENIX Security Symposium (USENIX Security 14), pages 17 32, 2014.

[14] Sébastien Bubeck, Nicolo Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE

Transactions on Information Theory, 59(11):7711 7717, 2013.

[15] Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to

adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114 122, 2018.

[16] Sayash Kapoor, Kumar Kshitij Patel, and Purushottam Kar. Corruption-tolerant bandit learning.

Machine Learning, 108(4):687 715, 2019.

[17] Debabrota Basu, Odalric-Ambrym Maillard, and Timothée Mathieu. Bandits corrupted by na-

ture: Lower bounds on regret and robust optimistic algorithm. ar Xiv preprint ar Xiv:2203.03186, 2022.

[18] Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam Smith. Calibrating noise to

sensitivity in private data analysis. In Theory of cryptography conference, pages 265 284. Springer, 2006.

[19] Touqir Sajed and Or Sheffet. An optimal private stochastic-mab algorithm based on optimal

private stopping rule. In Proceedings of the 36th International Conference on Machine Learning (ICML), pages 5579 5588, 2019.

[20] Achraf Azize and Debabrota Basu. When privacy meets partial information: A reﬁned analysis

of differentially private bandits. ar Xiv preprint ar Xiv:2209.02570, 2022.

[21] Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy. ar Xiv preprint ar Xiv:2007.03121, 2020.

[22] Youming Tao, Yulian Wu, Peng Zhao, and Di Wang. Optimal rates of (locally) differentially

private heavy-tailed multi-armed bandits. ar Xiv preprint ar Xiv:2106.02575, 2021.

[23] Jay Tenenbaum, Haim Kaplan, Yishay Mansour, and Uri Stemmer. Differentially private multi-

armed bandits in the shufﬂe model. Advances in Neural Information Processing Systems, 34, 2021.

[24] Sayak Ray Chowdhury and Xingyu Zhou. Distributed differential privacy in multi-armed

bandits. ar Xiv preprint ar Xiv:2206.05772, 2022.

[25] Samuel B Hopkins, Gautam Kamath, Mahbod Majid, and Shyam Narayanan. Robustness

implies privacy in statistical estimation. ar Xiv preprint ar Xiv:2212.05015, 2022.

[26] Hilal Asi, Jonathan Ullman, and Lydia Zakynthinou. From robustness to privacy and back.

ar Xiv preprint ar Xiv:2302.01855, 2023.

[27] Kristian Georgiev and Samuel B Hopkins. Privacy induces robustness: Information-computation

gaps and sparse mean estimation. ar Xiv preprint ar Xiv:2211.00724, 2022.

[28] Peter J Huber. Robust estimation of a location parameter. Ann. Math. Statist., 35(4):73 101,

[29] Pranjal Awasthi, Sreenivas Gollapudi, Kostas Kollias, and Apaar Sadhwani. Online learning

under adversarial corruptions. 2020.

[30] Arpan Mukherjee, Ali Tajer, Pin-Yu Chen, and Payel Das. Mean-based best arm identiﬁcation

in stochastic bandits under reward contamination. Advances in Neural Information Processing Systems, 34:9651 9662, 2021.

[31] Laura Niss and Ambuj Tewari. What you see may not be what you get: Ucb bandit algorithms

robust to "-contamination. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 450 459. PMLR, 2020.

[32] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance.

In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665 674. IEEE, 2016.

[33] Adarsh Prasad, Sivaraman Balakrishnan, and Pradeep Ravikumar. A uniﬁed approach to robust

mean estimation. ar Xiv preprint ar Xiv:1907.00927, 2019.

[34] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust

statistics. ar Xiv preprint ar Xiv:1911.05911, 2019.

[35] Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed

distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, 2019.

[36] Olivier Catoni and Ilaria Giulini. Dimension-free pac-bayesian bounds for the estimation of the

mean of a random vector. ar Xiv preprint ar Xiv:1802.04308, 2018.

[37] Sujay Bhatt, Guanhua Fang, Ping Li, and Gennady Samorodnitsky. Nearly optimal catoni s

m-estimator for inﬁnite variance. In International Conference on Machine Learning, pages 1925 1944. PMLR, 2022.

[38] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed

distributions. In Conference on Learning Theory, pages 2204 2235. PMLR, 2020.

[39] Vasileios Charisopoulos, Hossein Esfandiari, and Vahab Mirrokni. Robust and private stochastic

linear bandits. 2023.

[40] Mengchu Li, Thomas B Berrett, and Yi Yu. On robustness and local differential privacy. ar Xiv

preprint ar Xiv:2201.00751, 2022.

[41] Nikita Mishra and Abhradeep Thakurta. (Nearly) optimal differentially private stochastic multi-

arm bandits. In Proceedings of the 31st Conference on Uncertainty in Artiﬁcial Intelligence (UAI), pages 592 601, 2015.

[42] Sitan Chen, Frederic Koehler, Ankur Moitra, and Morris Yau. Online and distribution-free

robustness: Regression and contextual bandits with huber contamination. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 684 695. IEEE, 2022.

[43] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and

stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(6), 2006.

[44] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM

Transactions on Information and System Security (TISSEC), 14(3):1 24, 2011.

[45] Bingshan Hu and Nidhi Hegde. Near-optimal thompson sampling-based algorithms for dif-

ferentially private stochastic bandits. In Uncertainty in Artiﬁcial Intelligence, pages 844 852. PMLR, 2022.

[46] Samuel B Hopkins, Gautam Kamath, and Mahbod Majid. Efﬁcient mean estimation with pure

differential privacy via a sum-of-squares exponential mechanism. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 1406 1417, 2022.

[47] Ilias Diakonikolas. Algorithmic high-dimensional robust statistics. Webpage http://www.

iliasdiakonikolas. org/simons-tutorial-robust. html, 2018.

[48] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-

dimensional distributions. In Conference on Learning Theory, pages 1853 1902. PMLR, 2019.

[49] Vishesh Karwa and Salil Vadhan. Finite sample differentially private conﬁdence intervals. ar Xiv

preprint ar Xiv:1711.03908, 2017.

[50] Gautam Kamath, Xingtu Liu, and Huanyu Zhang. Improved rates for differentially private

stochastic convex optimization with heavy-tailed data. In International Conference on Machine Learning, pages 10633 10660. PMLR, 2022.

[51] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found.

Trends Theor. Comput. Sci., 9(3-4):211 407, 2014.

[52] Joseph P Near and Chiké Abuah. Programming differential privacy. URL: https://uvm, 2021.

[53] Roman Vershynin. High-dimensional probability: An introduction with applications in data

science, volume 47. Cambridge university press, 2018.

[54] Mengjie Chen, Chao Gao, and Zhao Ren. Robust covariance and scatter matrix estimation

under huber s contamination model. The Annals of Statistics, 46(5):1932 1960, 2018.

[55] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.