# robust_and_private_stochastic_linear_bandits__50629299.pdf

Robust and private stochastic linear bandits

Vasileios Charisopoulos 1 Hossein Esfandiari 2 Vahab Mirrokni 2

In this paper, we study the stochastic linear bandit problem under the additional requirements of differential privacy, robustness and batched observations. In particular, we assume an adversary randomly chooses a constant fraction of the observed rewards in each batch, replacing them with arbitrary numbers. We present differentially private and robust variants of the arm elimination algorithm using logarithmic batch queries under two privacy models and provide regret bounds in both settings. In the first model, every reward in each round is reported by a potentially different client, which reduces to standard local differential privacy (LDP). In the second model, every action is owned by a different client, who may aggregate the rewards over multiple queries and privatize the aggregate response instead. To the best of our knowledge, our algorithms are the first simultaneously providing differential privacy and adversarial robustness in the stochastic linear bandits problem.

1. Introduction

Bandits model is a popular formulation for online learning, wherein a learner interacts with her environment by choosing a sequence of actions, each of which presents a reward to the learner, from an available (potentially infinite) set of actions. The goal of the learner is to minimize her regret, defined as the difference between the rewards obtained by the chosen sequence of actions and the best possible action in hindsight. To achieve this, the learner must balance between exploration (choosing actions that reveal information about the action set) and exploitation (repeating actions that offered the highest rewards in previous rounds).

1Operations Research & Information Engineering, Cornell University. Part of this work was completed while the author was with Google. 2Google Research. Correspondence to: Vasileios Charisopoulos <vc333@cornell.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

In theory, deciding the next action sequentially is easiest. However, there are several obstacles to overcome when it comes to practice. The first obstacle is that the rewards in bandit algorithms are often the result of interactions with physical entities (Bouneffouf et al., 2020) (e.g., recommendation systems, clinical trials, advertising, etc.), raising concerns about the privacy of participating entities. For example, responses of an individual to medical treatments can inadvertently reveal privacy-sensitive health information. Therefore, it is essential to design learning algorithms that preserve the privacy of reward sequences.

Furthermore, observations collected from multiple users or external resources are prone to failures or corruptions. These corruptions are modeled by adversaries, which can tamper with a fraction of the observed rewards. Adversarial corruptions can be strategic, (e.g., simultaneously hijacking the devices of multiple users), or random (such as misclicks in the context of an ad campaign). Regardless of their nature, they highlight the need for developing robust learning algorithms that succeed in the presence of such corruptions. Developing robust private policies has drawn considerable attention in the past couple of years ((Esfandiari et al., 2022; Liu et al., 2021; Kothari et al., 2022; Ghazi et al., 2021; Dimitrakakis et al., 2014; Li et al., 2022b)). However, despite the importance of the bandits model, we are not aware of any provably robust and private policy for this model.

Lastly, in practice, it is often desirable or even necessary for the learner to perform actions in parallel. For example, ad campaigns present an assortment of advertisements to multiple users at the same time and are only periodically recalibrated (Bertsimas & Mersereau, 2007). Consequently, batch policies must optimally balance between parallelization, which can offer significant time savings, and information exchange, which must happen frequently enough to allow for exploration of the action space (Esfandiari et al., 2021).

In this paper, we develop a learning policy that addresses both privacy and robustness challenges, while enjoying the benefits of parallelization. Specifically, our policy protects the privacy of reward sequences by respecting the standard differential privacy measure, while withstanding an adversary that changes a constant fraction of the observed rewards in each batch. In the remainder of this section, we formally

Robust and private stochastic linear bandits

introduce the problem and survey related work in the bandit literature.

1.1. Problem formulation and provable guarantees

We study the stochastic linear bandit problem: given an action space A Rd with K elements satisfying maxa A a 2 1, a learner plays actions a A and receives rewards

ra := a, θ + η, η Sub G(1), (1)

where θ is an unknown vector in Rd and Sub G(1) denotes a zero-mean subgaussian random variable. Our assumption that |A| K is without loss of generality, since our results extend to the infinite case by a standard covering argument (Lattimore & Szepesv ari, 2020, Chapter 20). For simplicity, we also assume that θ 1. Given a budget of T total actions, the goal of the learner is to minimize her expected regret:

E [RT ] := max a A

t=1 a at, θ (2)

Batched observations. In bandits problems with batch policies, the learner commits to a sequence (i.e., a batch) of actions and observes the rewards of the actions only after the entire batch of actions has been played. The learner may play multiple batches of actions, whose sizes may be chosen adaptively, subject to the requirement that the total number of batches does not exceed B (in addition to the total number of actions played not exceeding the budget T). We assume that B is also known to the learner.

Robustness. We require that our algorithm is robust under possibly adversarial corruptions suffered. In particular, we assume that an adversary replaces every observation by an arbitrary number with some small probability α. Thus, during each batch, the observed rewards will satisfy

( ai, θ + ηi, w.p. 1 α, , w.p. α , i = 1, . . . , n, (3)

where is an arbitrary value, α [0, 1/4) is the corruption probability, and n is the size of the batch.

Diffential privacy. Our other requirement is that the algorithm is differentially private (DP). Definition 1.1 (Differential Privacy for Bandits (Basu et al., 2019)). A randomized mechanism M for stochastic linear bandits is called (εpriv, δpriv)-differentially private if, for any two neighboring sequences of rewards R = (r1, . . . , r T ) and R = (r 1, . . . , r T ) where ri = r i for at most one index i, and any subset of outputs O MT , it satisfies

P (M(R) O) eεpriv P (M(R ) O) + δpriv. (4)

The main contribution of our paper is a batched arm elimination algorithm that satisfies both desiderata, presented in detail in Section 2. We assume a distributed setting where a central server takes on the role of the learner, connected with several clients that report back rewards. The clients do not trust the central server and therefore choose to privatize their reward sequences; this model is better known as local differential privacy (LDP) (Kasiviswanathan et al., 2011). Our algorithm addresses the following client response models:

(M1) Each reward ri may be solicited from a different client i.

(M2) Each client owns an action a A and may report multiple rewards in each batch.

Remark 1.2. In Model (M1), we may assume without loss of generality that every reward ri is solicited from a different client, and thus every client returns at most 1 response.

Below, we provide informal statements for the expected regret that our algorithms achieves under each model. While the regret under (M1) has better dependence on the dimension d, (M2) leads to a better dependence on the privacy parameter εpriv. The improved dependence on εpriv in the latter should not come as a complete surprise, since model (M2) can be viewed as interpolating between the local and central models of differential privacy. For simplicity, we focus on the case where B scales logarithmically in T, although our analysis can be easily modified for general B. Figure 1 illustrates the qualitative behavior of our regret bounds.

Theorem 1.3 (Informal). Under Model (M1), there is an εpriv-locally differentially private algorithm that is robust to adversarial corruptions with expected regret satisfying

E [RT ] = O h

d T + T max n

αd, αd oi 1 + 1 εpriv

It is worth noting that in the non-private setting, the regret bound above scales as O(T

d T) when α < 1/d and O(Tαd +

d T) when α 1/d. Note that the total amount of corruption injected by the adversary is upper bounded by C = αT. Interestingly, our result shaves off a factor of at least

d compared to the regret bound of the best previous work on robust stochastic linear bandits (Bogunovic et al., 2021), which scales as O(

d T + Cd3/2).

Theorem 1.4 (Informal). Under Model (M2), there is an εpriv-differentially private algorithm that is robust to adversarial corruptions with expected regret satisfying

E [RT ] = O d

T + d εpriv

+ O d3/2 α T + d

T + d εpriv

Robust and private stochastic linear bandits

Compared to Theorem 1.3, Theorem 1.4 yields an improved dependence on T in the private part of the regret at the expense of an additional

d factor in the non-private part.

1.2. Related work

In this section, we survey related work in the bandit literature that addresses differential privacy and/or robustness to corruptions. We note that, to the best of our knowledge, our work is the first to simultaneously provide robustness and differential privacy guarantees for the stochastic linear bandit setting. While preparing the camera-ready version of this manuscript we were made aware of the work of (Wu et al., 2023), which studies private and robust multi-armed bandits.

Differential privacy in linear bandits. Differential privacy has been well-studed in the context of bandit learning. In the central DP model, which is the focus of this paper, (Shariff & Sheffet, 2018) proved a lower bound of Ω(

T + log(T )

εpriv ) on the expected regret and proposed a private variant of the Lin UCB algorithm with additive noise that achieves expected regret of O(

T/εpriv). In recent work, (Li et al., 2022a; Hanna et al., 2022) proposed a private variant of the arm elimination algorithm that obtains a regret bound of O( T log T + log2(T )

εpriv ) which is tight up to logarithmic factors; in particular, the work of (Li et al., 2022a) achieves (ϵ, δ)-differential privacy using the Gaussian mechanism while (Hanna et al., 2022) achieve ϵdifferential privacy (also known as pure differential privacy) via the Laplace mechanism. While conceptually similar to that of (Li et al., 2022a; Hanna et al., 2022), our algorithm guarantees differential privacy and robustness to corrupted observations simultaneously and maintains an order-optimal regret bound.

In the local DP model, (Zheng et al., 2020) used a reduction to private bandit convex optimization to achieve expected regret O(T 3/4/εpriv). Under additional distributional assumptions on the action set, this was improved to O(T 1/2/εpriv) by (Han et al., 2021). The same rate was obtained by (Hanna et al., 2022), who removed the requirement that actions are generated from a distribution. Finally, a recent line of work focused on so-called shuffle differential privacy (Bittau et al., 2017; Cheu, 2021), wherein a trusted shuffler can preprocess client responses before transmitting them to the central server. A sequence of works (Tenenbaum et al., 2021; Chowdhury & Zhou, 2022; Garcelon et al., 2022; Hanna et al., 2022) proposed shuffle-DP algorithms for linear bandits, with (Li et al., 2022a; Hanna et al., 2022) achieving essentially the same regret bound as in the central DP setting.

Robustness to adversarial attacks. Recent work proposed various adversarial attacks in the bandit setting, as well as algorithms to protect against them. (Lykouris et al., 2018) (and (Gupta et al., 2019) in a follow-up work) study multi-armed bandits with adversarial scaling, wherein an adversary can shrink the means of the arm distributions in each round, and propose robust algorithms for this setting. The corruption in this work differs from our setting, where the adversary can replace a random fraction of rewards arbitrarily. The works of (Jun et al., 2018; Liu & Shroff, 2019; Garcelon et al., 2020) study multi-armed and contextual bandit algorithms from the attacker perspective, demonstrating how an adversary can induce linear regret with logarithmic effort.

(Li et al., 2019) and (Bogunovic et al., 2021) study additive adversarial corruptions in contextual bandits. In particular, they assume that the observed reward in round i suffers an additive perturbation by ci(ai), where ai is the ith context and ci : A [ 1, 1] is a context-dependent corruption function. Crucially, the adversary is subject to a budget constraint given some budget C unknown to the learner:

i=1 max a A |ci(a)| C. (5)

In (Li et al., 2019), the authors present a robust exploration algorithm for contextual bandits using the L oewner ellipsoid. Letting denote the gap between the highest and lowest expected rewards, their algorithm achieves a regret of O d5/2C log T

+ d6 log2 T

2 , under the key assumption that the action space A is a full-dimensional polytope, and requires no knowledge of the corruption budget C.

On the other hand, the work of (Bogunovic et al., 2021) introduces a robust variant of the phased arm elimination algorithm for stochastic linear bandits that achieves an expected regret of O(

d T + Cd3/2), assuming the budget C is known to the learner; for unknown budgets, an additional C2 factor appears in the regret bound. Our work deviates from that of (Bogunovic et al., 2021) in the sense that we measure corruption using the probability α of an adversary interfering with each observation; moreover, assuming that C scales as αT, our work shaves off a

d factor from the result of (Bogunovic et al., 2021) in certain regimes, while it also ensures differential privacy.

1.3. Notation

We let x, y := x Ty denote the Euclidean inner product with induced norm x = p

x, x and write Sd 1 := {x Rd | x 2 = 1} for the unit sphere in d dimensions. When M is a positive-definite matrix, we write x M := p

x, Mx for the norm induced by M. Finally, we write

Robust and private stochastic linear bandits

0 0.2 0.4 0.6 0.8 1 εpriv

Regret as a function of εpriv (α = 0)

d T(1 + 1/εpriv)

100 101 102 103 104

Effect of privacy (εpriv = 0.1, α = 0, d = 50)

d T(1 + 1 εpriv ) d

T + d εpriv

0 0.1 0.2 0.3 0.4 0.5

Regret as a function of α (εpriv = 1, d = 5)

d T + T max{αd,

Figure 1. Demonstration of regret bounds. Left: private vs. non-private regret bounds under (M1). Center: scaling of private regret bound under (M1) and (M2). Right: effect of corruption parameter α under model (3).

A op := supx Sn 1 Ax for the ℓ2 ℓ2 operator norm of a matrix A Rm n.

1.4. Coresets and G-optimal designs

Our algorithms make use of coresets, which in turn are formed with the help of a concept called G-optimal design. We formally define this concept below.

Definition 1.5 (G-optimal design). Let A Rd be a finite set of vectors and let π : A [0, 1] be a probability distribution on A satisfying P

a A π(a) = 1. Then π is called a G-optimal design for A if it solves the following optimization problem:

minimize max a A a 2 M 1(π)

where M 1(π) := P

a A π(a)aa T 1.

A standard result in experiment design (Lattimore & Szepesv ari, 2020, Theorem 21.1) shows that the optimal value of (6) is equal to d. Moreover, it is possible to find a probability distribution π satisfying the following:

Definition 1.6 (Approximate G-optimal design). Let A Rd be a finite set of vectors and let π : A [0, 1] be a probability distribution on A. We call π an approximate G-optimal design for A if it satisfies

max a A a 2 M 1(π) 2d, |supp(π)| Cd log log d (7)

for a universal constant C > 0.

In particular, an approximate G-optimal design π in the sense of Definition 1.6 can be found in time O(d log log d).

Given a (approximate) G-optimal design π in the sense of Definition 1.5 or Definition 1.6, a coreset SA of total size n is a multiset {a1, . . . , an} where each action a supp(π) appears a total of na := π(a) n times.

2. Algorithm and main results

To minimize the regret of the learner, we use a variation of the standard arm elimination algorithm (Lattimore & Szepesv ari, 2020). In this algorithm, the learner uses batches of actions to construct confidence intervals for the optimal rewards and eliminates a set of suboptimal arms in each round based on their performance on the current batch. While the vanilla arm elimination algorithm is neither robust nor differentially private, we develop a variant that simultaneously ensures both these properties. An additional attractive property of our algorithm is that its implementation only requires a simple modification.

2.1. Our approach

To motivate our approach, we first sketch a naive attempt at modifying the arm elimination algorithm and briefly explain why it is unable to achieve good regret guarantees.

Recall that in the standard arm elimination algorithm, the learner first forms a so-called coreset of the action space A, which is a multiset of vectors a1, . . . , an A, and plays all the actions aj receiving rewards rj. To prune the action space, the learner first computes the least-squares estimate:

j=1 aja T j

j=1 ajrj, (8)

and chooses a suitable threshold γ to eliminate arms with a, bθ < max j=1,...,n aj, bθ 2γ.

Clearly, the arm elimination algorithm interacts with the rewards directly only when forming the least squares estimate bθ. Therefore, estimating bθ with a differentially private algorithm is sufficient to protect the privacy of rewards. Likewise, computing bθ robustly will ensure robustness of the overall algorithm.

Robust and private stochastic linear bandits

The main idea behind our arm elimination variant is the following. First, let us dispense with the differential privacy requirement. Notice that in the absence of corruptions, bθ is the empirical mean of the sequence of variables {Z1, . . . , Zn}:

i=1 aia T i

To compute bθ robustly, one may attempt to run an algorithm such as the geometric median. However, the approximation guarantee of the geometric median method scales proportionally to maxj [n] Zj bθ , for which worst-case bounds are overly pessimistic. Indeed, letting M denote the Gram matrix of the coreset used in the current arm elimination round, a tedious but straightforward calculation shows that these bounds scale as κ2(M), the condition number of M. In turn, the latter quantity depends on the geometry of the maintained action set and is difficult to control in general. For example, even if the original action set is well-conditioned , that property will not necessarily hold throughout the algorithm.

To work around this issue, we take advantage of the probabilistic nature of the adversary. The main idea is that, in expectation, the least-squares estimate computed over the subset of non-corrupted rewards, Igood, and given by

i=1 aia T i

j Igood ajrj, (9)

is close to the true least-squares estimate in the absence of any corruptions. While the set Igood is not known a-priori to the learner, we may still estimate bθIgood from Eq. (9) using a well-known spectral filtering algorithm from the robust statistics literature. In doing so, we reduce the problem of robust linear regression (with a fixed design matrix) to that of robust mean estimation (over an appropriately weighted set of inputs). We mention in passing that the work of(Chen et al., 2022) also develop a distribution-free algorithm for robust linear regression which applies to a more general class of problems. However, their algorithm requires repeatedly solving a semidefinite program, while our spectral filtering-based method is simpler to implement.

In what follows, we describe our robust linear regression primitive and state its theoretical approximation guarantees, and finally sketch how to take advantage of it to design a robust and diffentially private algorithm for batched bandits.

2.2. Robust linear regression with fixed designs

In this section, we describe an efficient algorithm for Huberrobust linear regression with a fixed design matrix. In particular, we let the (clean) set of observations satisfy

yi = ai, θ + ηi, i = 1, . . . , n. (10)

where ηi are independent noise realizations and a1, . . . , an are design vectors. The least-squares estimate of θ is given by

bθ := M 1 n

i=1 yiai, Mn :=

i=1 aia T i .

Now, suppose that an adversary corrupts each yi independently with probability α (0, 1/2), so the learner observes

( yi, if Zi = 1, , otherwise , Zi Ber(1 α). (11)

The goal is to estimate the least-squares solution bθ robustly. Our strategy will be to first estimate the least-squares solution over the subset of good indices G0:

i G0 M 1 n aiyi, G0 = {i | Zi = 1} . (12)

To estimate θG0, we will apply the well-known (randomized) spectral filtering algorithm for robust mean estimation (see, e.g., (Diakonikolas & Kane, 2019; Prasad et al., 2019)), provided in Algorithm 2 for completeness, to the components of the least-squares solution after an appropriate reweighting. In particular, we will estimate

γ{ai}n i=1 := maxa A a 2 M 1 n Pn i=1 y2 i n ;

ew := Filter n M 1/2 n aiyi on

i=1 , γ{ai}n i=1

eθ := n M 1/2 n ew (13) We prove the following guarantee for this method. The proof of this proposition is deferred to Appendix A. We use this proposition in the next section to design robust and differentially private algorithms for stochastic linear bandits.

Proposition 2.1. Fix a δ (0, 1), a A and let ei = yi ai, θ . Then with probability at least 1 2δ, we have

max a A a 2 M 1 n

α + log(1/δ)

+ max a A a 2 M 1 n

i=1 y2 i + p

i=1 ei a, M 1 n ai + α,

where Mn := Pn i=1 aia T i .

Robust and private stochastic linear bandits

Algorithm 1 Robust arm elimination

1: Input: action space A, T, B, failure prob. δ, corruption prob. α (0, 1/4), truncation parameter ν > 0. 2: Set A0 := A, q = T 1/B

3: for i = 1, . . . , B 1 do 4: Compute approximate G-optimal design π with |supp(π)| d log log d. 5: Form a coreset SAi 1 by playing each distinct a supp(π) a total of

( qiπ(a) , under Model (M1); qi max {π(a), ν} , under Model (M2). .

6: Play actions aj SAi 1 and collect rewards rj according to (3).

7: Compute ewi := Filter n M 1/2 n airi on

i=1 , maxa A a 2 M 1 n

Pn i=1 r2 i n

, where Mn := Pn i=1 aia T i . {Alg. 2}

8: Compute eθi := n M 1/2 n ew. 9: Set the elimination threshold

log(qi/δ) + log(qi/δ)

, under (M1);

1 + 1 εpriv

νm + log(k/δ)

α log(1/δ) + α, under (M2),

where k := |supp(π)| in the second option. 10: Eliminate suboptimal arms:

a SAi 1 | a, eθi max a SAi 1 a , eθi 2γi.

11: end for 12: Play the best action in SAB 1 in the last round.

Algorithm 2 Filter(S := {Xi}m i=1, λ)

1: Compute empirical mean and covariance:

i S Xi, ΣS := 1

i S (Xi θS)(Xi θS)T.

2: Compute leading eigenpair (µ, v) of ΣS. 3: if µ < 4λ then 4: return θS 5: else 6: Compute outlier scores τi := v, Xi θS 2 for all i. 7: Sample an element Y with P (Y = Xi) τi 8: return Filter(S \ {Y } , λ) 9: end if

3. Robust differentially private bandits

In this section, we consider the requirement of differential privacy. In particular, we assume that the learner is an untrusted server; every client must therefore privatize their rewards before reporting them to the learner. Recall that we consider two different models for generating client

(M1) Every reward is obtained from a distinct client.

(M2) All rewards associated with a distinct action a are obtained from the same client.

Algorithm 1 documents the parameter choices under each of the models above. For our regret analysis, we rely on the following facts for each round i.

Fact 1: The optimal arm is not eliminated. Let a denote the optimal action in the sence of maximizing the inner products a, θ . Then, with high probability,

a, eθ a , eθ = a, θ + a, eθ θ

a , θ a , eθ θ

a a , θ + 2γi 2γi,

using the bound on the difference in the penultimate inequality and the fact that a, θ a , θ in the last inequality.

Robust and private stochastic linear bandits

Thus, a always satisfies the condition of the algorithm and is not eliminated.

Fact 2: Surviving arms have bounded gap. Fix an arm a and let := a a, θ be its gap. We have

a a, eθ a , θ γi ( a, θ + γi)

Now, let i be the smallest positive integer such that γi < /4. Then the above implies that

a a, eθ 2γi.

Consequently, any arm a with gap a > 4γi for some index i will be eliminated at the end of that round. Therefore, all arms that are active at the beginning of round i will necessarily satisfy a 4γi 1.

3.1. Local differential privacy under (M1)

In this setting, we can achieve pure LDP using the Laplace mechanism (Dwork & Roth, 2014). In particular, we define

M(r) = r + ξ, ξ Lap 2

where εpriv is a desired privacy parameter. Then, when queried for a response, client i reports the privatized reward:

ˆri = M(ri) = ai, θ + ηi + ξi, ξi Lap 2

(15) The three forthcoming lemmata control different terms appearing in the confidence interval from Eq. (14). Lemma 3.1 below controls the contribution of the additive noise.

Lemma 3.1. Under the model (M1), with probability at least 1 2δ we have

i=1 ei a, M 1 n ai

log(1/δ) εpriv

Two of the three terms in Eq. (14) depend on the maximal weighted norm a 2 M 1 n over the action set; Lemma 3.2 bounds that norm for an arbitrary round of the arm elimination algorithm.

Lemma 3.2. Under Model (M1), we have the bound:

max a A a 2 M 1 n 2d

Finally, Lemma 3.3 below controls the contribution of p Pn i=1 y2 i to the robust confidence interval.

Lemma 3.3. With probability at least 1 δ, we have

i=1 y2 i n 1 + p

log(n/δ) + log(n/δ)

With control over the confidence interval (14) at hand, we arrive at the regret bound in Theorem 3.4 below. The proof follows standard arguments (see, e.g., (Esfandiari et al., 2021, Theorem 5.1)) and can be found in Appendix B.1.4.

Theorem 3.4. Under Model (M1), the expected regret of Algorithm 1 is at most

Td log(T/δ)

log(T/δ) max n

αd, αd o 1 + log(T/δ)

(19) up to a dimension-independent multiplicative constant.

3.2. Local differential privacy under (M2)

In this setting, every client achieves differential privacy by aggregating their responses before transmitting them to the server. In particular, let na denote the number of times action a is played during the current round. The parameter na can be considered public, since it is known to the untrusted server. Then, client a may report

i=1 a, θ + ηi

i=1 a, θ + ηi + ξa,

where ηi Sub G(1) and ξa is Laplace noise. The amount of noise needed to achieve privacy scales inversely with na.

Lemma 3.5. With ξa Lap 2 naεpriv

, the mechanism M in (20) is εpriv-differentially private for client a.

Recall that in this model, the arm elimination algorithm follows the modifications below:

1. We receive |supp(π)| distinct responses in each round, where π is an approximately G-optimal design.

2. Every action a supp(π) is played a total of na = m max {π(a), ν} times for fixed m and ν > 0.

Our proof for this setting is analogous to the proof under Model (M1). We have the following analogue of Lemma 3.1:

Robust and private stochastic linear bandits

Lemma 3.6. Under the model (M2), with probability at least 1 2δ we have X

v supp(π) ev a, M 1v

where ev = M(rv) v, θ and M = P

a supp(π) aa T.

Lemma 3.7. Under Model (M2), we have the bound:

max a A a 2 M 1 2d. (22)

We also have the following analogue of Lemma 3.3.

Lemma 3.8. With probability at least 1 δ, we have s X

v supp(π) y2v

log(|supp(π)| /δ)

νm + log(|supp(π)| /δ)

Putting everything together, we arrive at Theorem 3.9 below, whose proof can be found in Appendix B.2.4.

Theorem 3.9. Under Model (M2), the expected regret of Algorithm 1 is at most

νd log log d

d T log(1/δ)

ν + log(1/δ) log(T)

αd log log d + p

Td log log d/δ

ν + log( d log log d

δ ) log T νεpriv

(24) up to a dimension-dependent multiplicative constant.

4. Conclusion

In this paper we presented a robust and εpriv-LDP policy for batched stochastic linear bandits with an expected regret

E [RT ] = O [

d T + T max{

αd, αd}](1 + 1/εpriv) ,

where α is the probability of corruption of each reward, which only requires a logarithmic number of batch queries. In the absence of corruption (α = 0), our regret matches that of the best-known non-robust differentially private algorithm (Hanna et al., 2022). On the other hand, when no differential privacy is required, our regret bounds shaves off

a factor of

d compares to previous work on robust linear bandits (Bogunovic et al., 2021). In addition, a variant of our policy is immediately applicable to a differential privacy model that interpolates between the local and central settings and achieves improved dependence on the privacy parameter εpriv.

While simple to implement, our algorithms require the learner to provide an upper bound on the corruption probability α, which may be difficult to estimate in practice. We leave the task of designing an adaptive policy as exciting future work. At the same time, it is unclear if our regret bounds for the privacy model (M2) are tight (in terms of the dependence on εpriv and d). A natural question left open by our work is constructing tight lower bounds in this setting.

Basu, D., Dimitrakakis, C., and Tossou, A. Differential Privacy for Multi-armed Bandits: What Is It and What Is Its Cost? ar Xiv e-prints, art. ar Xiv:1905.12298, May 2019.

Bertsimas, D. and Mersereau, A. J. A learning approach for interactive marketing to a customer segment. Operations Research, 55(6):1120 1135, 2007. doi: 10.1287/opre. 1070.0427.

Bittau, A., Erlingsson, U., Maniatis, P., Mironov, I., Raghunathan, A., Lie, D., Rudominer, M., Kode, U., Tinnes, J., and Seefeld, B. Prochlo: Strong Privacy for Analytics in the Crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP 17, pp. 441 459, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450350853. doi: 10.1145/3132747.3132769.

Bogunovic, I., Losalka, A., Krause, A., and Scarlett, J. Stochastic Linear Bandits Robust to Adversarial Attacks. In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 991 999. PMLR, 13 15 Apr 2021.

Bouneffouf, D., Rish, I., and Aggarwal, C. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1 8. IEEE, 2020.

Chen, S., Koehler, F., Moitra, A., and Yau, M. Online and distribution-free robustness: Regression and contextual bandits with huber contamination. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 684 695. IEEE, 2022.

Robust and private stochastic linear bandits

Cheu, A. Differential Privacy in the Shuffle Model: A Survey of Separations. ar Xiv e-prints, art. ar Xiv:2107.11839, 2021.

Chowdhury, S. R. and Zhou, X. Shuffle private linear contextual bandits. ar Xiv e-prints, art. ar Xiv:2202.05567, 2022.

Diakonikolas, I. and Kane, D. M. Recent Advances in Algorithmic High-Dimensional Robust Statistics. ar Xiv e-prints, art. ar Xiv:1911.05911, November 2019.

Dimitrakakis, C., Nelson, B., Mitrokotsa, A., and Rubinstein, B. I. Robust and private bayesian inference. In International Conference on Algorithmic Learning Theory, pp. 291 305. Springer, 2014.

Dwork, C. and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407, 2014.

Esfandiari, H., Karbasi, A., Mehrabian, A., and Mirrokni, V. Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7340 7348, 2021.

Esfandiari, H., Mirrokni, V., and Narayanan, S. Tight and robust private mean estimation with few users. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 16383 16412. PMLR, 17 23 Jul 2022.

Garcelon, E., Roziere, B., Meunier, L., Tarbouriech, J., Teytaud, O., Lazaric, A., and Pirotta, M. Adversarial attacks on linear contextual bandits. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 14362 14373. Curran Associates, Inc., 2020.

Garcelon, E., Chaudhuri, K., Perchet, V., and Pirotta, M. Privacy amplification via shuffling for linear contextual bandits. In Dasgupta, S. and Haghtalab, N. (eds.), Proceedings of The 33rd International Conference on Algorithmic Learning Theory, volume 167 of Proceedings of Machine Learning Research, pp. 381 407. PMLR, 29 Mar 01 Apr 2022.

Ghazi, B., Kumar, R., Manurangsi, P., and Nguyen, T. Robust and private learning of halfspaces. In International Conference on Artificial Intelligence and Statistics, pp. 1603 1611. PMLR, 2021.

Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):1548 1566, 2011.

Gupta, A., Koren, T., and Talwar, K. Better algorithms for stochastic bandits with adversarial corruptions. In Beygelzimer, A. and Hsu, D. (eds.), Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pp. 1562 1578. PMLR, 25 28 Jun 2019.

Han, Y., Liang, Z., Wang, Y., and Zhang, J. Generalized linear bandits with local differential privacy. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 26511 26522. Curran Associates, Inc., 2021.

Hanna, O. A., Girgis, A. M., Fragouli, C., and Diggavi, S. Differentially Private Stochastic Linear Bandits: (Almost) for Free. ar Xiv e-prints, art. ar Xiv:2207.03445, July 2022.

Jun, K.-S., Li, L., Ma, Y., and Zhu, J. Adversarial attacks on stochastic bandits. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn privately? SIAM Journal on Computing, 40(3):793 826, 2011.

Kothari, P., Manurangsi, P., and Velingker, A. Private robust estimation by stabilizing convex relaxations. In Conference on Learning Theory, pp. 723 777. PMLR, 2022.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, F., Zhou, X., and Ji, B. Differentially private linear bandits with partial distributed feedback. In 2022 20th International Symposium on Modeling and Optimization in Mobile, Ad hoc, and Wireless Networks (Wi Opt), pp. 41 48. IEEE, 2022a.

Li, M., Berrett, T. B., and Yu, Y. On robustness and local differential privacy. ar Xiv preprint ar Xiv:2201.00751, 2022b.

Li, Y., Lou, E. Y., and Shan, L. Stochastic Linear Optimization with Adversarial Corruption. ar Xiv e-prints, art. ar Xiv:1909.02109, 2019.

Liu, F. and Shroff, N. Data poisoning attacks on stochastic bandits. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 4042 4050. PMLR, 09 15 Jun 2019.

Robust and private stochastic linear bandits

Liu, X., Kong, W., Kakade, S., and Oh, S. Robust and differentially private mean estimation. Advances in Neural Information Processing Systems, 34:3887 3901, 2021.

Lykouris, T., Mirrokni, V., and Paes Leme, R. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pp. 114 122, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450355599. doi: 10.1145/3188745.3188918.

Prasad, A., Balakrishnan, S., and Ravikumar, P. A unified approach to robust mean estimation. ar Xiv preprint ar Xiv:1907.00927, 2019.

Shariff, R. and Sheffet, O. Differentially private contextual linear bandits. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Tenenbaum, J., Kaplan, H., Mansour, Y., and Stemmer, U. Differentially Private Multi-Armed Bandits in the Shuffle Model. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 24956 24967. Curran Associates, Inc., 2021.

Vershynin, R. High-Dimensional Probability: An introduction with applications in data science, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.

Wu, Y., Zhou, X., Tao, Y., and Wang, D. On Private and Robust Bandits. ar Xiv e-prints, art. ar Xiv:2302.02526, 2023. doi: 10.48550/ar Xiv.2302.02526.

Zheng, K., Cai, T., Huang, W., Li, Z., and Wang, L. Locally differentially private (contextual) bandits learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12300 12310. Curran Associates, Inc., 2020.

Robust and private stochastic linear bandits

A. Proofs from Section 2.2

We will work with the empirical second moment and covariance matrices defined below:

eΣG0 := 1 |G0|

i G0 M 1/2 n y2 i aia T i M 1/2 n , ΣG0 := eΣG0 θG0θT G0, (25a)

i=1 M 1/2 n y2 i aia T i M 1/2 n , Σn := eΣn θnθT n. (25b)

In addition, we will use the vector notation below:

Our guarantees will depend on the maximal weighted norm of the elements ai, which we will denote by

µ := max i=1,...,n ai 2 M 1 n . (27)

Finally, we make the following assumption:

Assumption A.1. Fix δ to be a desired failure probability. The corruption probability α satisfies α log(1/δ)

To approximate θG0, we will first reduce the above problem to robust mean estimation and apply the spectral filtering algorithm from the robust statistics literature. In (Prasad et al., 2019), the authors provide the following guarantee:

Theorem A.2 (Prasad et al. (2019, Theorem 4)). Suppose that λ ΣG0 op and that the set of inliers, G0, satisfies

n + log(1/δ)

n c, where c is a dimension-independent constant. Then with probability at least 1 δ, the spectral filtering algorithm for robust mean estimation terminates in at most O((n |G0|) + log(1/δ)) steps and returns an estimate eθ satisfying eθ θG0 C

n + log(1/δ)

In light of Theorem A.2, we will control the quantities involved. Before we proceed, we state the following bound for the size of G0 that we will repeatedly appeal to throughout:

Lemma A.3. Let n log(1/δ)

α . Then with probability at least 1 δ, we have

Proof. Let Sn = Pn i=1 1 {i / G0}, which is a sum of i.i.d. Bernoulli random variables with parameter α. From the Chernoff bound (Vershynin, 2018, Exercise 2.3.5), it follows that for δ (0, 1], we have

P |Sn nα| p

nα log(1/δ) δ, (30)

after setting δ = q

αn 1 in (30). The claim follows since

Robust and private stochastic linear bandits

A.1. Controlling the empirical mean

We now control the deviation of θG0 from the mean of the dataset absent any corruptions.

Lemma A.4. With probability at least 1 δ, we have θG0 n(1 α)

µα(1 α) log(1/δ). (31)

Proof. Define the following collection of random variables:

Qi := (Zi E [Zi])M 1/2 n yiai, with Zi = 1 {i G0} . (32)

Clearly, we have E [Qi] = 0. At the same time,

E h Qi 2i = Var(Zi)y2 i ai 2 M 1 n α(1 α)y2 i µ.

Applying the vector Bernstein inequality (Gross, 2011, Theorem 12), we obtain

µα(1 α) + t

µα(1 α) y 2

Consequently, we may set t = y p

µα(1 α) log(1/δ) to obtain the claimed probability. Finally, we note that

i G0 M 1/2 n yiai (1 α)

i=1 M 1/2 n yiai = |G0| θG0 n(1 α)θn.

A.2. Putting everything together

We now combine the bounds from Appendix A.1 and Theorem A.2. We first note that

ΣG0 = eΣG0 θG0θT G0 eΣG0

i G0 y2 i M 1/2 n aia T i M 1/2 n

i G0 y2 i M 1/2 n aia T i M 1/2 n op Id

i G0 y2 i M 1/2 n ai 2Id

i G0 y2 i ai 2 M 1 n Id,

which implies that the spectral norm of ΣG0 is bounded from above by

ΣG0 op y G0 2

|G0| max i ai 2 M 1 n µ y 2

|G0| . (33)

At the same time, we appeal to Lemma A.3 to deduce that

|G0| (1 α)n 3 p

n log(1/δ) (1 α)n

2 , for n 18 log(1/δ)

Consequently, we can replace the previous upper bound with ΣG0 op 2µ y 2

Robust and private stochastic linear bandits

We now appeal to Theorem A.2. Note that Lemma A.3 yields

n 1 (1 α) +

Therefore, the estimate eθ computed by the spectral filtering algorithm satisfies

n + log(1/δ)

Taking a union bound over Lemmas A.3 and A.4, we deduce that (34) holds with probability at least 1 2δ.

A.3. Application to phased elimination

Let θLS := M 1 n Pn i=1 yiai denote the least squares solution from an approximate G-optimal design, and define

θG0 = M 1 X

i G0 yiai = M 1/2 |G0| θG0, (35a)

θ = n M 1/2eθ. (35b)

Note that θ can be computed from the output of Algorithm 2, while θG0 only serves for the analysis. With these at hand, we have the following decomposition: a, θ θ = a, θ θG0 + a, θG0 θLS + a, θLS θ (36)

In what follows, we bound each term in (36) separately.

A.3.1. BOUNDING THE FIRST TERM IN (36)

The first term in (36) is equal to D M 1/2a, M 1/2( θ θG0) E = D M 1/2a, neθ |G0| θG0 E = D M 1/2a, n(eθ θG0) E + (n |G0|) D M 1/2a, θG0 E

a M 1 n(eθ θG0) + (n |G0|) D M 1/2a, θG0 E (37)

In particular, the second term in (37) is given by

D M 1/2a, θG0 E = 1 |G0|

M 1/2a, M 1/2 X

1 |G0| a M 1

1 |G0| a M 1

i G0 aia T i 1/2 X

using the fact that P

i G0 aia T i Pn i=1 aia T i . Let AG0 be a matrix whose rows are the vectors {ai | i G0}. We have

i G0 aia T i = AT G0AG0, and X

i G0 yiai = AT G0y G0.

Letting AG0 = UΣV T denote the economic SVD of AG0, we thus have X

i G0 aia T i 1/2 X

= (AT G0AG0) 1/2AT G0y G0 = V Σ 1V TV ΣU Ty G0 y . (39)

Robust and private stochastic linear bandits

Plugging Eq. (39) into Eq. (38) and the result into Eq. (37), we obtain

D M 1/2a, M 1/2( θ θG0) E a M 1

n(eθ θG0) + n |G0|

Using Eq. (34), the bound a M 1 µ, and Lemma A.3 with α log(1/δ)

n , the above becomes:

D M 1/2a, M 1/2( θ θG0) E µ y n α + log(1/δ)

A.3.2. BOUNDING THE SECOND TERM IN (36)

Recall that θG0 = M 1/2 |G0| θG0. We further decompose the second term in (36) into

a, θG0 θLS = D M 1/2a, M 1/2( θG0 θLS) E

M 1/2a, M 1/2 X

= M 1/2a, |G0| θG0 n |G0|θn

= M 1/2a, |G0| θG0 n(1 α)

+ D M 1/2a, nαθn E (41)

The first term in (41) can be upper bounded using Lemma A.4. Indeed, M 1/2a, |G0| θG0 n(1 α)

µα log(1/δ) µ y p

α log(1/δ).

We now simplify the second term in (41). With ei = ri ai, θ , we obtain

D M 1/2a, nαθn E = α

i=1 M 1yiai

i=1 M 1ai( ai, θ + ei)

i=1 aia T i

a, M 1ai ei

= α a, θ + α

a, M 1ai ei.

Since maxa A | a, θ | 1, combining the two bounds above yields

a, θG0 θLS µ y p

α log(1/δ) + α

a, M 1ai ei

A.3.3. BOUNDING THE THIRD TERM IN (36)

The last term is straightforward to bound. Let ei = yi ai, θ and note that

θLS θ = M 1 n X

i=1 yiai θ = M 1 n X

i=1 ai( ai, θ + ei) θ = M 1 n X

i=1 aiei. (43)

Robust and private stochastic linear bandits

A.3.4. PUTTING EVERYTHING TOGETHER

Combining Eqs. (40), (42) and (43) yields the following robust confidence intervals:

" n α + log(1/δ)

i=1 ei a, M 1ai + α. (44)

B. Missing proofs from Section 3

B.1. Missing proofs from Section 3.1

B.1.1. PROOF OF LEMMA 3.1

Proof. We write ei = M(ri) ai, θ = ηi + ξi, ηi Sub G(1), ξi Lap 2 εpriv

. Now, define the random variables

Xi := ηi a, M 1ai ; Yi := ξi a, M 1ai .

The family {Xi} is subgaussian with Xi ψ2 a, M 1ai . Consequently,

i=1 Xi 2 ψ2

i=1 Tr(a TM 1aia T i M 1a)

i=1 aia T i M 1a

= a 2 M 1 .

Therefore, applying the Hoeffding inequality (Vershynin, 2018, Theorem 2.6.2) yields:

i=1 ηi a, M 1ai c1 a M 1 p

On the other hand, when ξi Lap(2/εpriv), we have the Bernstein-style bound

E h eλ Pn i=1 ξi a,M 1ai i =

i=1 E exp λξi a, M 1ai

λ2 a, M 1ai 2

using (Vershynin, 2018, Proposition 2.7.1(e)) in the last step. Collecting terms we obtain

λ2 a, M 1ai 2

λ2 Pn i=1 a, M 1ai 2

Now, appealing to (Vershynin, 2018, Proposition 2.7.1(a)), we obtain the concentration bound

i=1 ξi a, M 1ai c2 a M 1 log(1/δ)

Combining the two bounds yields the result.

Robust and private stochastic linear bandits

B.1.2. PROOF OF LEMMA 3.3

Proof. Let π below denote an approximate G-optimal design in the sense of Definition 1.6. We have

i=1 aia T i

a supp(π) naaa T

a supp(π) π(a)aa T

Consequently, we have the inequality

a 2 M 1 = a, M 1a

= a, (n M(π)) 1a

n a, M 1(π)a

= a 2 M 1(π)

using the fact that π is an approximate G-optimal design in the last inequality.

B.1.3. PROOF OF LEMMA 3.3

Proof. With y = y1 . . . yn T, we have y n y . To control the latter, we note

max i | ai, θ + ηi + ξi| max i {| ai, θ | + |ηi| + |ξi|} 1 + max i |ηi| + max i |ξi| .

Since ηi Sub G(1), standard concentration inequalities for subgaussian maxima yield

P max i |ηi| C p

log(n/δ) δ. (47)

Similarly, ξi are subexponential with parameter 2/εpriv. By a union bound and (Vershynin, 2018, Proposition 2.7.1),

P max i |ξi| t

i=1 P (|ξi| t) n exp

( ε2 privt2

Setting t := 4 log(n/δ)

εpriv above yields maxi |ξi| 4 log(n/δ)

εpriv with probability at least 1 δ.

Finally, taking a union bound and relabelling yields the result.

B.1.4. PROOF OF THEOREM 3.4

Proof. We perform a regret analysis under the LDP model (M1). First, we simplify Proposition 2.1 using Lemmas 3.1 and 3.3 and the assumption α log(1/δ)/n. Letting µ := maxa A a 2 M 1, we have

a, eθ θ µ n α + p

αn log(1/δ) 1 + p

log(n/δ) + log(n/δ)

for any fixed a with probability at least 1 δ by suitably adjusting constants.

Robust and private stochastic linear bandits

In particular, when a1, . . . , an are drawn from an approximate G-optimal design, Lemma 3.3 implies that

µ max a A a 2 M 1 n 2d

so the bound in (48) can be written as

log(n/δ) + log(n/δ)

By standard arguments (see, e.g., the proof of (Esfandiari et al., 2021, Theorem 5.1)), we may focus on bounding the regret conditioned on the good event where all the invocations to the coreset construction and robust filtering algorithms succeed. This requires us to choose failure probability δ proportional to δ /(KT 2), where T is the number of rounds, K is the size of the action space, and δ is an overall desired failure probability. To ease notation, we relabel δ in this manner below.

Now, recalling the width of the confidence interval

log(qi/δ) + log(qi/δ)

we have the following expression for the regret:

i=1 (arms pulled) (instantaneous regret)

i=1 qi4γi 1

log(qi 1/δ) + log(qi 1/δ)

To bound the first sum above, we notice that

qi = q q B/2 1

For the second sum, we first bound log(qi 1/δ) log(T B 1

B /δ) log(T/δ), followed by

log(qi 1/δ) + log(qi 1/δ)

log(T/δ) + log(T/δ)

Finally, we note that when B log(T) we have q B/2 1

T and q = T 1/B e. Therefore,

Td log(1/δ)

log(T/δ) + log(T/δ)

B.2. Missing proofs from Section 3.2

B.2.1. PROOF OF LEMMA 3.6

Proof. We have ev = M(rv) v, θ = ηv + ξv, where ηv Sub G(1/na) and ξv Lap(2/naεpriv). Therefore,

ηv + ξv (d) = 1 na ηv + 1

na ξv, ηv Sub G(1), ξv Lap(2/εpriv).

Robust and private stochastic linear bandits

Consequently, we may trace the proof of Lemma 3.1 to arrive at

v H ηv a, M 1v a M 1 p

c1 na + c2 p

log(1/δ) naεpriv

This completes the proof after noticing that na νm.

B.2.2. PROOF OF LEMMA 3.7

Proof. Recall that M = P

v H vv T. In particular, we have X

v H vv T = X

v supp(ˆπ) X

v H ˆπ(v)vv T = a 2 M 1 a 2 M 1(ˆπ) 2d, (53)

where the last inequality follows since π is an approximate G-optimal design.

B.2.3. PROOF OF LEMMA 3.8

Proof. Let k = |supp(ˆπ)| and let y1, . . . , yk be an enumeration of the elements yv, v supp(ˆπ). With y = y1 . . . yk T, we have y

k y . To control the latter, we note

max v | v, θ + ηv + ξv| max v {| v, θ | + |ηv| + |ξv|} 1 + max v |ηv| + max i |ξv| .

Since ηv Sub G(1/nv), standard concentration inequalities for subgaussian maxima yield

max v |ηv| C

Similarly, ξv are subexponential with parameter 2 nvεpriv . By a union bound and (Vershynin, 2018, Proposition 2.7.1),

P max v |ξv| t X

v supp(ˆπ) P (|ξv| t)

v supp(ˆπ) P (|ξv| t)

( minv n2 vε2 privt2

8 , minv nvεprivt

( ν2m2ε2 privt2

8 , νmεprivt

Setting t := 4 log(k/δ)

νmεpriv above yields maxv |ξv| 4 log(k/δ)

νmεpriv with probability at least 1 δ.

Finally, taking another union bound and relabelling yields the result.

B.2.4. PROOF OF THEOREM 3.9

Proof. We proceed with deriving an expression for the robust confidence interval from Proposition 2.1 under (M2). Indeed, with probability at least 1 δ, for any fixed a A we have:

1 + 1 εpriv

νm + log(k/δ)

Robust and private stochastic linear bandits

where k := |supp(π)|. Recall we can find (in poly-time) an approximate G-optimal design π satisfying

k := |supp(π)| d log log d. (56)

Therefore, we may proceed with the regret analysis. Similarly to the proof of Theorem 3.4, we condition on the case where all randomized algorithms and invocations to random events succeed with high probability.

Then, with m = qi at round i, we have the following bound:

v supp(π) nv

qi max {π(v), ν}

v supp(π) qi max {π(v), ν} + 1

= supp(π) + qi X

v max {π(v), ν}

qi(1 + νd log log d).

In particular, we have the following property for the sum P

i=1 qi = 1 1 + νd log log d

i=1 ni = T 1 + νd log log d. (57)

Consequently, the regret of the algorithm conditioned on the good event is given by

i=1 niγi 1 q(1 + νd log log d)

i=0 qiγi, (58)

where γi is the width of the confidence interval at round i. The first term in the sum P

log(1/δ) εpriv ν

which is a term independent of the corruption fraction. The second group of summands in P

log(d log log d/δ)

νqi + log(d log log d/δ)

= T 1 + νd log log d +

log(d log log d/δ)

q1/2 1 + log(d log log d/δ)

εpriv . (60)

Finally, summing over i using the last term of the confidence interval yields

i=0 qiα = αT 1 + νd log log d. (61)

Putting everything together, we arrive at the claimed regret bound:

Regret (1 + νd log log d)

d T log(1/δ)

ν + log(1/δ) log(T)

αd log log d + p

Td log log d/δ

ν + log(d log log d/δ)

Robust and private stochastic linear bandits