# fixedbudget_differentially_private_best_arm_identification__d0458bd1.pdf

Published as a conference paper at ICLR 2024

FIXED-BUDGET DIFFERENTIALLY PRIVATE BEST ARM IDENTIFICATION

Zhirui Chen1, P. N. Karthik2 , Yeow Meng Chee1, and Vincent Y. F. Tan1

1National University of Singapore 2Indian Institute of Technology, Hyderabad zhiruichen@u.nus.edu pnkarthik@ai.iith.ac.in {ymchee,vtan}@nus.edu.sg

We study best arm identification (BAI) in linear bandits in the fixed-budget regime under differential privacy constraints, when the arm rewards are supported on the unit interval. Given a finite budget T and a privacy parameter ε > 0, the goal is to minimise the error probability in finding the arm with the largest mean after T sampling rounds, subject to the constraint that the policy of the decision maker satisfies a certain ε-differential privacy (ε-DP) constraint. We construct a policy satisfying the ε-DP constraint (called DP-BAI) by proposing the principle of maximum absolute determinants, and derive an upper bound on its error probability. Furthermore, we derive a minimax lower bound on the error probability, and demonstrate that the lower and the upper bounds decay exponentially in T, with exponents in the two bounds matching order-wise in (a) the sub-optimality gaps of the arms, (b) ε, and (c) the problem complexity that is expressible as the sum of two terms, one characterising the complexity of standard fixed-budget BAI (without privacy constraints), and the other accounting for the ε-DP constraint. Additionally, we present some auxiliary results that contribute to the derivation of the lower bound on the error probability. These results, we posit, may be of independent interest and could prove instrumental in proving lower bounds on error probabilities in several other bandit problems. Whereas prior works provide results for BAI in the fixed-budget regime without privacy constraints or in the fixed-confidence regime with privacy constraints, our work fills the gap in the literature by providing the results for BAI in the fixed-budget regime under the ε-DP constraint.

1 INTRODUCTION

Multi-armed bandit problems (Thompson, 1933) form a class of sequential decision-making problems with applications in fields as diverse as clinical trials, internet advertising, and recommendation systems. The common thread in all these applications is the need to balance exploration (learning about the environment) and exploitation (making the best decision given current knowledge). The exploration-exploitation trade-off has been studied extensively in the context of regret minimisation, where the goal is to minimise the cumulative difference between the rewards of the actions taken and the best possible action in hindsight; see Lattimore and Szepesvári (2020) and the references therein for an exhaustive list of works on regret minimisation. On the other hand, the pure exploration framework, which is the focus of this paper, involves identifying the best arm (action) based on a certain criterion such as the highest mean reward. The pure exploration paradigm has been a subject of rigorous study in the literature, predominantly falling within two overarching regimes: the fixed-confidence regime and the fixed-budget regime. In the fixed-confidence regime, the objective is to curtail the anticipated number of trials needed to pinpoint the optimal arm, all while adhering to a predefined maximum allowable error probability. Conversely, in the fixed-budget regime, the aim is to suppress the likelihood of erroneous identification of the best arm under a predetermined budget.

Motivation: The task of identifying the best arm in a multi-armed bandit setting is non-trivial due to the inherent uncertainty associated with each arm s true reward distribution. This problem is amplified when privacy constraints are considered, such as the need to protect individual-level data in a medical trial or user data in an online advertising setting (Chan et al., 2011). In the context of such data-intensive applications, the notion of differential privacy (Dwork, 2006) has become the gold-standard for the modelling and analytical study of privacy. While there has been growing

This work was carried out when P. N. Karthik was a Research Fellow at the National University of Singapore.

Published as a conference paper at ICLR 2024

interest in the design of privacy-preserving algorithms for regret minimisation in multi-armed bandits (Basu et al., 2019; Jain et al., 2012; Chan et al., 2011; Guha Thakurta and Smith, 2013; Mishra and Thakurta, 2015; Tossou and Dimitrakakis, 2016), a comparable level of attention has not been directed towards the domain of pure exploration. Addressing this lacuna in the literature, our research aims to investigate differentially private best arm identification within the fixed-budget regime.

Problem Setup: Briefly, our problem setup is as follows. We consider a multi-armed bandit in which each arm yields independent rewards supported on the unit interval [0, 1]. Each arm is associated with a known, d-dimensional feature vector, where d is potentially much smaller than the number of arms. The mean reward of each arm is a linear function of the associated feature vector, and is given by the dot product of the feature vector with an unknown d-dimensional vector θ which fully specifies the underlying problem instance. Given a designated budget T and a parameter ε > 0, the objective is to minimise the probability of error in identifying the arm with the largest mean reward (best arm), while concurrently fulfilling a certain ε-differential privacy (ε-DP) constraint delineated in Basu et al. (2019). We explain the specifics of our model and define the ε-DP constraint formally in Section 2 below.

Overview of Prior Works: Differential privacy (DP) (Dwork, 2006) and best-arm identification (BAI) (Lattimore and Szepesvári, 2020) have both been extensively investigated in the literature, encompassing a wide array of works. In this section, we discuss a selection of more recent contributions at the intersection of these two topics. Shariff and Sheffet (2018) prove that any ε-DP (viz. (ε, δ)-DP with δ = 0) algorithm must incur an additional regret of at least Ω((K log T)/ε), where K is the number of arms. Building on this result, Sajed and Sheffet (2019) propose an elimination-based algorithm that satisfies the ε-DP constraint and achieves order-wise optimality in the additional regret term. Zheng et al. (2020) study regret minimisation with the (ε, δ)-local differential privacy constraint, a stronger requirement than (ε, δ)-DP, for contextual and generalised linear bandits. Azize and Basu (2022) study the ε-global differential privacy constraint for regret minimisation, and provide both minimax and problem-dependent regret bounds for general stochastic bandits and linear bandits. Chowdhury and Zhou (2023) and Solanki et al. (2023) explore differential privacy in a distributed (federated) setting. Chowdhury and Zhou (2023) explore regret minimization with the (ε, δ)-DP constraint in a distributed setting, considering an untrustworthy server. They derive an upper bound on the regret which matches order-wise with the one obtainable under a centralized setting with a trustworthy server; for a similar work that studies regret minimisation in the distributed and centralised settings, see Hanna et al. (2022). Solanki et al. (2023) study federated learning for combinatorial bandits, considering a slightly different notion of privacy than the one introduced in Dwork (2006). We observe that the existing literature on bandits mainly focused on regret minimisation with DP constraint and the pure exploration counterpart has not been studied extensively.

In the pure exploration domain, Carpentier and Locatelli (2016) study the fixed-budget BAI problem and obtains a minimax lower bound on the error probability; the authors show that their bound is order-wise tight in the exponent of the error probability. Yang and Tan (2022) investigate fixed-budget BAI for linear bandits and propose an algorithm based on the G-optimal design. They prove a minimax lower bound on the error probability and obtain an upper bound on the error probability of their algorithm OD-LINBAI. Despite the significant contributions of Carpentier and Locatelli (2016), Yang and Tan (2022), Komiyama et al. (2022), and Kato et al. (2023), these works do not take into account DP constraints. Nikolakakis et al. (2021) and Rio et al. (2023) study BAI in the fixed-confidence setting with ε-DP constraint and propose successive elimination-type algorithms, but these works do not derive a lower bound that is a function of the privacy parameter ε. Our work is thus the first to study differentially private best arm identification in fixed-budget regime and provide a lower bound explicitly related to the privacy parameter ε.

Our Contributions: We present a novel algorithm for fixed-budget BAI under the ε-DP constraint. Our proposed algorithm, called DP-BAI, is based on the principle of maximizing absolute determinants (or MAX-DET in short). A key aspect of our algorithm is the privatisation of the empirical mean of each arm via the addition of Laplacian noise. The amount of noise added to an arm is inversely proportional to the product of the privacy parameter ε and the number of times the arm is pulled. Recognising the trade-off between the number of arm pulls and the level of noise injected for privatisation, the MAX-DET principle minimises the maximum Laplacian noise injected across all arms, thereby ensuring a small probability of error in identifying the best arm. We believe our work can open for future exploration in precise control over Laplacian noise (crucial to meet the

Published as a conference paper at ICLR 2024

ε-DP guarantee) and using other popular techniques in fixed-budget BAI, such as G-optimal designs (Kiefer and Wolfowitz, 1960; Yang and Tan, 2022) and XY-adaptive allocations (Soare et al., 2014), with DP-constraint. We find it analytically convenient to leverage the properties of the MAX-DET collection (cf. Definition 3.1) to satisfy the ε-DP constraint. See Remark 2 for a brief justification on why extending other popular techniques for fixed-budget BAI such as G-optimal design (Yang and Tan, 2022) to the differential privacy setting of our work is not readily feasible.

Additionally, we establish the first-known lower bound on the error probability under the ε-DP constraint for a class of hard problem instances. We demonstrate that both the upper and lower bounds decay exponentially relative to the budget T. The exponents in these bounds capture the problem complexity through a certain hardness parameter, which we show can be expressed as the sum of two terms: one measuring the complexity of the standard fixed-budget BAI without privacy constraints, and the other accounting for the ε-DP constraint. We also present some auxiliary findings, such as the properties of the so-called early stopping version of a BAI policy (see Lemmas 4.3 and 4.5), that contribute to the derivation of the lower bound, which may be of independent interest and could prove instrumental in deriving lower bounds on error probabilities in several other bandit problems. Our work stands out as the first in the field to provide precise and tight bounds on the error probability for fixed-budget BAI under the ε-DP constraint, achieving order-wise optimal exponents in both the lower and the upper bounds.

2 NOTATIONS AND PRELIMINARIES

Consider a multi-armed bandit with K > 2 arms, in which each arm yields independent and identically distributed (i.i.d.) rewards, and the rewards are statistically independent across arms. Let [K] := {1, . . . , K} denote the set of arms. For i [K], let νi denote the rewards distribution of arm i. As in several prior works (Chowdhury and Zhou, 2022; Shariff and Sheffet, 2018; Zhou and Chowdhury, 2023), we assume throughout the paper that νi is supported in [0, 1] for all i [K]. We impose a linear structure on the mean rewards of the arms. That is, for each i [K], we assume that arm i is associated with a feature vector ai Rd, where d is the dimension of the feature vector, and the mean reward of arm i is given by µi := a i θ for some fixed and unknown θ Rd. We assume that the feature vectors of the arms {ai}K i=1 are known beforehand to a decision maker, whose goal it is to identify the best arm i = argmaxi [K] µi; we assume that the best arm is unique and defined unambiguously.

The Fixed-Budget Regime: The decision maker is allowed to pull the arms sequentially, one at each time t {1, 2, . . .}. Let At [K] denote the arm pulled by the decision maker at time t, and let Ni,t = Pt s=1 1{As=i} denote the number of times arm i is pulled up to time t. Upon pulling arm At, the decision maker obtains the instantaneous reward XAt,NAt,t [0, 1]; here, Xi,n νi denotes the reward obtained on the nth pull of arm i. Notice that E[Xi,n] = µi = a i θ for all i [K] and n 1. For all t, the decision to pull arm At is based on the history of arm pulls and rewards seen up to time t, i.e., At is a (random) function of Ht := (A1, XA1,NA1,1, . . . , At 1, XAt 1,NAt 1,t 1). Given a fixed budget T < , the objective of the decision maker is to minimise the probability of error in finding the best arm after T rounds of arm pulls, while also satisfying a certain differential privacy constraint outlined below. We let ˆIT denote the best arm output by the decision maker.

The ε-Differential Privacy Constraint: Let X := {x = (xi,t)i [K],t [T ]} [0, 1]KT denote the collection of all possible rewards outcomes from the arms. Any sequential arm selection policy of the decision maker may be viewed as taking inputs from X and producing (A1, . . . , AT , ˆIT ) [K]T +1 as outputs in the following manner: for an input x = (xi,t) X,

Output at time t = 1 : A1 = A1, Output at time t = 2 : A2 = A2(A1, x A1,NA1,1),

Output at time t = 3 : A3 = A3(A1, x A1,NA1,1, A2, x A2,NA2,2),

... Output at time t = T : AT = AT (A1, x A1,NA1,1, . . . , AT 1, x NAT 1,T 1),

Terminal output : ˆIT = ˆIT (A1, x A1,NA1,1, . . . , AT , x NAT ,T ). (1)

Published as a conference paper at ICLR 2024

We say that x = (xi,t) and x = (x i,t) are neighbouring if they differ in exactly one location, i.e., there exists (i, t) [K] [T] such that xi,t = x i,t and xj,s = x j,s for all (j, s) = (i, t). With the viewpoint in (1), we now introduce the notion of ε-differential privacy for a sequential policy of the decision maker, following the lines of Nikolakakis et al. (2021, Section 5). Definition 2.1. Given any ε > 0, a randomised policy M : X [K]T +1 satisfies ε-differential privacy if, for any pair of neighbouring x, x X,

PM(M(x) S) eε PM(M(x ) S) S [K]T +1. (2)

Remark 1. A generalization of the notion of ε-differential privacy is that of (ε, δ)-differential privacy (Dwork et al., 2014, Chapter 2). For the sake of simplicity in exposition, in the main text, we provide details for the former. An extension of our algorithm, called DP-BAI-GAUSS, will be shown to be applicable to the latter (generalized) notion of differential privacy. The details and accompanying analyses of the performance of DP-BAI-GAUSS can be found in Appendix D.

While the actual sequence of rewards observed under M is random, it is important to note that a pair of reward sequences, say (x, x ), is fixed when specifying the ε-DP constraint. In (2), PM denotes the probability measure induced by the randomness arising from only the arm outputs under M. In the sequel, we refer to the tuple v = ((ai)i [K], (νi)i [K], θ , ε) as a problem instance, and let P denote to be the set of all problem instances that admit a unique best arm. Given v P and a policy π, we write Pπ v to denote the probability measure induced under π and under the instance v. When the dependence on v is clear from the context, we simply write Pπ.

3 OUR METHODOLOGY

To meet the ε-DP guarantee, our approach is to add Laplacian noise to the empirical mean reward of each arm, with the magnitude of the noise inversely proportional to the product of ε and the number of times the arm is pulled. Intuitively, to minimize the maximum Laplacian noise that is added (so as to minimize the failure probability of identifying the best arm), we aim to balance the number of pulls for each arm in the current active set. To this end, we employ the MAX-DET explained below.

The MAX-DET Collection: Fix d N. For any set S Rd with |S| = d vectors, each of length d , let DET(S) to denote the absolute value of the determinant of the d d matrix formed by stacking the vectors in S as the columns of the matrix. Definition 3.1. Fix d N. Given any finite set A Rd with |A| d , we say B A with |B| = d is a MAX-DET collection of A if

DET(B) DET(B ) for all B A with |B | = d . (3)

Thus, a MAX-DET collection B A has the maximum absolute determinant among all subsets of A with the same cardinality as B. If span(A) = d , the vectors in B are linearly independent, and any b A may be expressed as a linear combination of the vectors in B. Call the coefficients appearing in this linear combination expression for b as its coordinates (Meyer, 2000, Chapter 4). The set of coordinates of each b A is unique, and b may be expressed alternatively as a d -length vector of its coordinates. In this new system of coordinates, the vectors in B constitute the standard basis vectors.

3.1 THE DIFFERENTIALLY PRIVATE BEST ARM IDENTIFICATION (DP-BAI) POLICY

We now construct a policy based on the idea of successive elimination (SE) of arms. Our policy for Differentially Private Best Arm Identification, called DP-BAI, operates over a total of M phases, where M is designed to have order O(log d). In each phase p [M], the policy maintains an active set Ap of arms which are potential contenders for emerging as the best arm. The policy ensures that with high probability, the true best arm lies within the active set in each phase.

Policy-Specific Notations: We now introduce some policy-specific notations. Let

λ = inf{β 2 : βlog(d) K d2/4 }, (4) Let {gi}i 0 and {hi}i 0 be defined as follows:

g0 = min{K, d2/4 }, gi = gi 1/2 i 1, (5)

h0 = max{K d2/4 , 0}, hi = (hi 1 + 1)/λ 1 i 1. (6)

Published as a conference paper at ICLR 2024

Let s0 = g0 + h0, and for each p [M], let sp = |Ap| denote the number of active arms at the beginning of phase p, defined via

sp = g0 + hp 1, 1 p M1, gp M1, M1 < p M + 1. (7)

For α > 0, let Lap 1

α denote the Laplacian distribution with density fα(z) = α

2 e α |z|, z R.

Initialisation: We initialise our policy with the following parameters: M1 = min{i N : hi = 0}, M = M1 + min{i N : gi = 1} 1,

T = T M1d (M M1) d2/4 , a(0) i = ai i [K], d0 = d, T0 = 0, A1 = [K]. (8) Policy Description: We now describe the DP-BAI policy. The policy takes as inputs the differential privacy parameter ε, budget T, the number of arms K, and the feature vectors of the arms {ai : i [K]}. With the initialisation in (8), the policy operates in phases. In each phase p [M], the first step is dimensionality reduction (Yang and Tan, 2022), whereby the dimension of the set of vectors {a(p 1) i : i Ap} is reduced using a linear transformation; here, a(p 1) i Rdp 1 for all i Ap. More specifically, suppose that dp := dim(span{a(p 1) i : i Ap}). The policy chooses an arbitrary orthogonal basis Up = (u(p) 1 , . . . , u(p) dp ) for span{a(p 1) i : i Ap}, and obtains a new set of vectors

a(p) i := [a(p 1) i ]Up, for all i Ap, (9) where [v]Up denotes the coordinates of v with respect to Up. Subsequently, the policy checks if dp < sp, where sp = |Ap| is as defined in (7). If this is true, then the policy constructs a MAX-DET collection Bp Ap consisting of |Bp| = dp arms, and pulls each arm i Bp for T

Mdp many times,

and sets Tp = Tp 1 + dp T

Mdp . On the other hand, if dp sp, then the policy pulls each arm in

Msp many times, and sets Tp = Tp 1 + sp T

Msp . After pulling the arms according to the preceding rule, the policy computes

ˆµ(p) i = 1 Ni,Tp Ni,Tp 1

s=Ni,Tp 1+1 Xi,s (10)

for each arm i Ap that was pulled at least once in phase p, and subsequently computes its private empirical mean eµ(p) i via eµ(p) i = ˆµ(p) i + eξ(p) i , (11)

where eξ(p) i Lap 1 (Ni,Tp Ni,Tp 1)ε is independent of the arm pulls and arm rewards. For i Ap that was not pulled in phase p, the policy computes its corresponding private empirical mean via

eµ(p) i = X

j Bp αi,j eµ(p) j , (12)

where (αi,j)j Bp is the unique set of coefficients such that a(p) i = P

j Bp αi,j a(p) j . At the end of phase p, the policy retains only the top sp+1 arms with the largest private empirical means and eliminates the remaining arms; intuitively, these arms are most likely to produce the highest rewards in the subsequent phases. At the end of the Mth phase, the policy returns the only arm left in AM+1 as the best arm. For pseudo-code of the DP-BAI policy, see Algorithm 1. Remark 2. It is natural to wonder why we do not devise a differentially private version of ODLin BAI (Yang and Tan, 2022), the state-of-the-art linear fixed-budget BAI algorithm, which uses G-optimal designs. A proposal to do so, called DP-OD, is provided in Appendix E. However, the error probability in identifying the best arm under DP-OD depends not only on the suboptimality gaps of the arms, but is also a function of the arm vectors. For example, in a 2-armed bandit instance, let a1 = [x, 0] , a2 = [0, y] with x, y > 0, and θ = [(0.5+ )/x, 0.5/y] . Then, µ1 = 0.5+ , µ2 = 0.5, and the suboptimality gap = µ1 µ2. For this instance, the upper bound on the error probability of DP-OD is exp Ω T 2+ x y

x y (ϵ ) 1 . We observe that x y

x y can be made arbitrarily

large. Thus, this bound is inferior to the upper bound of DP-BAI (equal to exp( Ω( T 2+(ϵ ) 1 )) and independent of the arm vectors). See Appendix E for further details.

Published as a conference paper at ICLR 2024

Algorithm 1 Fixed-Budget Differentially Private Best Arm Identification (DP-BAI)

ε: differential privacy parameter; T: budget; {ai : i [K]}: d-dimensional feature vectors. Output: ˆIT : best arm.

1: Initialise T0 = 0, A1 = [K], a(0) i = ai for all i [K]. Set M and T as in (8). 2: for p {1, 2, . . . , M} do

3: Set dp = dim(span{a(p 1) i : i Ap}).

4: Obtain the new vector set {a(p) i : i Ap} from the set {a(p 1) i : i Ap} via (9). 5: Compute sp using (7). 6: if dp < sp then 7: Construct a MAX-DET collection Bp Ap.

8: Pull each arm in Bp for T

Mdp many times. Update Tp Tp 1 + dp T

9: Obtain the empirical means {ˆµi(p) : i Bp} via (10).

10: Generate eξ(p) i Lap 1 ε T Mdp

11: Set eµ(p) i ˆµ(p) i + eξ(p) i for all i Bp.

12: For arm i Ap \ Bp, compute eµ(p) i via (12). 13: else 14: Pull each arm in Ap for T

Msp many times. Update Tp Tp 1 + sp T

15: Obtain the empirical means {ˆµ(p) i : i Ap} via (10).

16: Generate eξ(p) i Lap 1 ε T Msp

17: Set eµ(p) i ˆµ(p) i + eξ(p) i for all i Ap. 18: end if 19: Compute sp+1 using (7).

20: Ap+1 the set of sp+1 arms with largest private empirical means among {eµ(p) i : i Ap}. 21: end for 22: ˆIT the only arm remaining in AM+1 23: return Best arm ˆIT .

4 THEORETICAL RESULTS

We now present theoretical results for the DP-BAI policy, followed by a minimax lower bound on the error probability. We write ΠDP-BAI to denote the DP-BAI policy symbolically. The first result below, proved in Appendix F, asserts that ΠDP-BAI meets the ε-DP constraint for any ε > 0. Proposition 4.1. The DP-BAI policy with privacy and budget parameters (ε, T) satisfies the ε-DP constraint, i.e., for any pair of neighbouring x, x X, PΠDP-BAI(ΠDP-BAI(x) S) eε PΠDP-BAI(ΠDP-BAI(x ) S) S [K]T +1. (13)

In (13), the probabilities appearing on either sides of (13) are with respect to the randomness in the arms output by ΠDP-BAI for fixed neighbouring reward sequences x, x X (see Section 2). The use of Laplacian noise for the privatisation of the empirical means of the arms (see Lines 10-11 and 16-17 in Algorithm 1) plays a crucial role in showing (13).

4.1 THE HARDNESS PARAMETER

Recall that a problem instance v may be expressed as the tuple v = ((ai)i [K], (νi)i [K], θ , ε). In this section, we capture the hardness of such an instance in terms of the instance-specific arm sub-optimality gaps and the privacy parameter ε. Recall that the arm means under the above instance v are given by µi = a i θ for all i [K]. Let i := µi (v) µi denote the sub-optimality gap of arm i [K]. Further, let (l1, . . . , l K) be a permutation of [K] such that l1 l2 . . . l K, and let (i) := li for all i [K]. The hardness of instance v is defined as

H(v) := HBAI(v) + Hpri(v), (14)

Published as a conference paper at ICLR 2024

where HBAI(v) := max 2 i (d2 K) i 2 (i) and Hpri(v) := 1

ε max 2 i (d2 K) i (i) . (15)

Going forward, we omit the dependence of H, HBAI, and Hpri on v for notational brevity. It is worthwhile to mention here the quantity in (14) specialises to the hardness term H2 in Audibert et al. (2010), which is identical to HBAI, when K d2 and ε + . The former condition K d2 holds, for instance, for a standard K-armed bandit with K = d, θ Rd as the vector of arm means, and {ai}d i=1 as the standard basis vectors in Rd. Intuitively, while HBAI quantifies the difficulty of fixed-budget BAI without privacy constraints, Hpri accounts for the ε-DP constraint and captures the additional difficulty of BAI under this constraint.

4.2 UPPER BOUND ON THE ERROR PROBABILITY OF DP-BAI

In this section, we provide an upper bound on the error probability of DP-BAI. Theorem 4.2. Fix v P. Let i (v) denote the unique best arm of instance v. For all sufficiently large T, the error probability of ΠDP-BAI with budget T and privacy parameter ε satisfies

PΠDP-BAI v (ˆIT = i (v)) exp T

where M and T are as defined in (8). In (16), PΠDP-BAI v denotes the probability measure induced by ΠDP-BAI under the instance v.

This is proved in Appendix G. Since M = Θ(log d) and T = Θ(T) (as T ), (16) implies that

PΠDP-BAI v (ˆIT = i (v)) = exp Ω T H log d

4.3 LOWER BOUND ON THE ERROR PROBABILITY

In this section, we derive the first-of-its-kind lower bound on the error probability of fixed-budget BAI under the ε-DP constraint. Towards this, we first describe an auxiliary version of a generic policy that takes as input three arguments a generic policy π, n N, and ι [K] and pulls an auxiliary arm (arm 0) whenever arm ι is pulled n or more times under π. We believe that such auxiliary policies are potentially instrumental in deriving lower bounds on error probabilities in other bandit problems.

The Early Stopping Policy: Suppose that the set of arms [K] is augmented with an auxiliary arm (arm 0) which yields reward 0 each time it is pulled; recall that the arm rewards are supported in [0, 1]. Given a generic policy π, n N and ι [K], let ES(π, n, ι) denote the early stopping version of π with the following sampling and recommendation rules.

Sampling rule: given a realization Ht 1 = (a1, x1, . . . , at 1, xt 1), if Pt 1 s=1 1{as=ι} < n, then

PES(π,n,ι)(At A | Ht 1) = Pπ(At A | Ht 1) A [K], (18)

and if Pt 1 s=1 1{as=ι} n, then PES(π,n,ι)(At = 0 Ht 1) = 1. That is, as long as arm ι is pulled for a total of fewer than n times, the sampling rule of ES(π, n, ι) is identical to that of π. Else, ES(π, n, ι) pulls arm ι with certainty.

Recommendation rule: Given history HT = (a1, x1, . . . , a T , x T ), if PT s=1 1{as=0} = 0, then

PES(π,n,ι)(ˆIT A | HT ) = Pπ(ˆIT A | HT ) A [K], (19)

and if PT s=1 1{as=0} > 0, then PES(π,n,ι)(ˆIT = 0 | HT ) = 1. That is, if the auxiliary arm 0 is not pulled under π, the recommendation of ES(π, n, ι) is consistent with that of π. Else, ES(π, n, ι) recommends arm 0 as the best arm.

The next result below provides a bridge between a policy π and its early stopped version. Lemma 4.3. Fix a problem instance v P, policy π, n N, and ι [K]. For any A [K] and E = {ˆIT A} {Nι,T < n}, Pπ v(E) = PES(π,n,ι) v (E). (20)

Published as a conference paper at ICLR 2024

In addition, let X (n,ι) := {(xi,t)i [K],t [ni] : (xi,t)i [K],t [T ] X} Rn1 . . . Rn K, where ni = T for all i = ι and nι = n. Notice that Definition 2.1 readily extends to any randomised policy that maps X (n,ι) to {0, . . . , K}T +1. We then have the following corollary to Lemma 4.3.

Corollary 4.4. If π : X [K]T +1 meets the ε-DP constraint, then ES(π, n, ι) : X (n,ι) {0, . . . , K}T +1 also meets the ε-DP constraint.

Given the early stopping version of a policy π, the following lemma provides a bridge between two problem instances v, v P.

Lemma 4.5. Fix a policy π, n N, ι [K], and ε > 0, and suppose that M = ES(π, n, ι) satisfies the ε-DP constraint with respect to X (n,ι). For any pair of instances v = ((ai)i [K], (νi)i [K], θ , ε) and v = ((ai)i [K], (ν i)i [K], θ , ε), with θ = θ , νι = ν ι, and νi = ν i for all i = ι, we have

PM v M((Xi,j)i [K],j [ni]) S eε PM v M((Xi,j)i [K],j [ni]) S S {0, . . . , K}T +1

(21) where in (21), (i) ε = 6εn TV(vι, v ι), with TV(vι, v ι) being the total variation distance between the distributions νι and ν ι, and (ii) ni = T for all i = ι and nι = n.

The proof of Lemma 4.5 follows exactly along the lines of the proof of Karwa and Vadhan (2018, Lemma 6.1) and is omitted. Leveraging Lemma 4.5 in conjunction with Lemma 4.3 provides us with a change-of-measure technique, facilitating the transition from Pπ v to Pπ v under any given policy π. This change-of-measure technique serves as the foundation that enables us to derive the subsequent minimax lower bound on the error probability.

Definition 4.6. A policy π for fixed-budget BAI is said to be consistent if

lim T + Pπ v ˆIT = i (v) = 0, v P. (22)

Theorem 4.7 (Lower Bound). Fix any β1, β2, β3 [0, 1] with β1 + β2 + β3 < 3, a consistent policy π, and a constant c > 0. For all sufficiently large T, there exists an instance v P such that

Pπ v ˆIT = i (v) > exp T c(log d)β1(HBAI(v)β2 + Hpri(v)β3)

Consequently,

inf π consistent lim inf T + sup v P

Pπ v ˆIT = i (v) exp T c(log d)β1(HBAI(v)β2 + Hpri(v)β3)

(24) for any c > 0 and β1, β2, β3 [0, 1] with β1 + β2 + β3 < 3.

Theorem 4.7, proved in Appendix H, implies that for any chosen β [0, 1) (arbitrarily close to 1), there does not exist a consistent policy π with an upper bound on its error probability assuming any one of the following forms for all instances v P: exp Ω T (log d)β(HBAI(v)+Hpri(v)) ,

exp Ω T (log d)(HBAI(v)β+Hpri(v)) , or exp Ω T (log d)(HBAI(v)+Hpri(v)β) . In this sense, the dependencies of the upper bound in (16) on log d, HBAI(v), and Hpri(v) are tight . Also, in this precise sense, none of these terms can be improved upon in general. Remark 3. It is pertinent to highlight that the upper bound in (16) applies to any problem instance, whereas the lower bound in (23) is a minimax result that is applicable to one or more hard instances. An ongoing quest in fixed-budget BAI is to construct a policy with provably matching error probability upper and lower bounds for all problem instances.

5 NUMERICAL STUDY

This section presents a numerical evaluation of our proposed DP-BAI policy on synthetic data, and compares it with BASELINE, an algorithm which follows DP-BAI but for Lines 6 to 13 in Algorithm 1, i.e., BASELINE does not construct MAX-DET collections. We note that BASELINE is ε-DP for any ε > 0, and bears similarities with SEQUENTIAL HALVING (Karnin et al., 2013) when ε + (i.e.,

Published as a conference paper at ICLR 2024

50 100 150 200 250 300 350 400 450 500

Success Rate

OD-Lin BAI DP-BAI (ε = 0.1)

DP-OD (ε = 0.1)

Baseline (ε = 0.1)

0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 ε in DP-BAI, Baseline and DP-OD

Success Rate

OD-Lin BAI DP-BAI DP-OD Baseline

Figure 1: Comparison of DP-BAI to BASELINE, OD-LINBAI and DP-OD for different values of ε. Note that ε is not applicable to OD-LINBAI.

non-private algorithm). However, because it does not exploit the linear structure on the arm means, we will see that it performs poorly vis-à-vis DP-BAI. In addition, we compare DP-BAI with the state-of-the-art OD-LINBAI (Yang and Tan, 2022) algorithm for fixed-budget best arm identification, which is a non-private algorithm and serves as an upper bound in performance (in terms of the error probability) of our algorithm. Also, we consider an ε-DP version of OD-LINBAI which we call DP-OD. A more comprehensive description of the DP-OD algorithm is presented in Appendix E.1.

Our synthetic instance is constructed as follows. We set K = 30, d = 2, and θ = [0.045 0.5] , a1 = [0 1] , a2 = [0 0.9] , a3 = [10 0] , and ai = [1 ωi] for all i {4, . . . , 30}, where ωi is randomly generated from a uniform distribution on the interval [0, 0.8]. Clearly, µ1 = 0.5, µ2 = µ3 = 0.45, and µi = ωi/2 + 0.045 for all i {4, . . . , 30}, thereby implying that arm 1 is the best arm. The sub-optimality gaps are given by 2 = 3 = 0.05 and i > 0.05 for all i {4, . . . , 30}; thus, arms 2 and 3 exhibit the smallest gaps. In addition, we set νi, the reward distribution of arm i, to be the uniform distribution supported on [0, 2µi] for all i [K].

We run experiments with several choices for the budget T and the privacy parameter ε, conducting 1000 independent trials for each pair of (T, ε) and reporting the fraction of trials in which the best arm is successfully identified.

The experimental results are shown in Figure 1 for varying T and ε values respectively. As the results demonstrate, the DP-BAI policy significantly outperforms BASELINE and DP-OD, demonstrating that the utility of the MAX-DET collection in exploiting the linear structure of the arm means. We also observe that as ε + (i.e., privacy requirement vanishes), the performances of DP-BAI and the non-private state-of-the-art OD-LINBAI algorithm are similar.

6 CONCLUSIONS AND FUTURE WORK

This work has taken a first step towards understanding the effect of imposing a differential privacy constraint on the task of fixed-budget BAI in bandits with linearly structured mean rewards. Our contributions include the development and comprehensive analysis of a policy, namely DP-BAI, which exhibits exponential decay in error probability with respect to the budget T, and demonstrates a dependency on the dimensionality of the arm vectors d and a composite hardness parameter, which encapsulates contributions from both the standard fixed-budget BAI task and the imposed differential privacy stipulation. A distinguishing aspect in the design of this policy is the critical utilization of the MAX-DET collection, instead of existing tools like the G-optimal designs (Yang and Tan, 2022) and XY-adaptive allocations (Soare et al., 2014). Notably, we establish a minimax lower bound that underlines the inevitability of certain terms in the exponent of the error probability of DP-BAI.

Some interesting directions for future research include extending our work to incorporate generalized linear bandits (Azizi et al., 2022) and neural contextual bandits (Zhou et al., 2020). Additionally, we aim to tackle the unresolved question brought forth post Theorem 4.7: does there exist an efficient fixed-budget BAI policy respecting the ε-DP requirement, whose error probability upper bound approximately matches a problem-dependent lower bound?

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

This research/project is supported by the National Research Foundation Singapore under the AI Singapore Programme (AISG Award No: AISG2-TC-2023-012-SGIL). This work is also supported by the Singapore Ministry of Education Academic Research Fund (Ac RF) Tier 2 under grant number A-8000423-00-00, and the Singapore Ministry of Education Ac RF Tier 1 under grant number A-8000189-01-00.

Audibert, J.-Y., Bubeck, S., and Munos, R. (2010). Best arm identification in multi-armed bandits. In COLT, pages 41 53.

Azize, A. and Basu, D. (2022). When privacy meets partial information: A refined analysis of differentially private bandits. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K., editors, Advances in Neural Information Processing Systems.

Azize, A. and Basu, D. (2023). Interactive and concentrated differential privacy for bandits. In Sixteenth European Workshop on Reinforcement Learning.

Azizi, M., Kveton, B., and Ghavamzadeh, M. (2022). Fixed-budget best-arm identification in structured bandits. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 2798 2804.

Basu, D., Dimitrakakis, C., and Tossou, A. (2019). Privacy in multi-armed bandits: Fundamental definitions and lower bounds. ar Xiv preprint ar Xiv:1905.12298.

Carpentier, A. and Locatelli, A. (2016). Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pages 590 604. PMLR.

Chan, T.-H. H., Shi, E., and Song, D. (2011). Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1 24.

Chowdhury, S. R. and Zhou, X. (2022). Shuffle private linear contextual bandits. In International Conference on Machine Learning, pages 3984 4009. PMLR.

Chowdhury, S. R. and Zhou, X. (2023). Distributed differential privacy in multi-armed bandits. In The Eleventh International Conference on Learning Representations.

Dwork, C. (2006). Differential privacy. In Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II 33, pages 1 12. Springer.

Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211 407.

Guha Thakurta, A. and Smith, A. (2013). (nearly) optimal algorithms for private online learning in full-information and bandit settings. Advances in Neural Information Processing Systems, 26:2733 2741.

Hanna, O. A., Girgis, A. M., Fragouli, C., and Diggavi, S. (2022). Differentially private stochastic linear bandits: (almost) for free. ar Xiv preprint ar Xiv:2207.03445.

Jain, P., Kothari, P., and Thakurta, A. (2012). Differentially private online learning. In Conference on Learning Theory, pages 24 1. JMLR Workshop and Conference Proceedings.

Karnin, Z., Koren, T., and Somekh, O. (2013). Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pages 1238 1246. PMLR.

Karwa, V. and Vadhan, S. (2018). Finite sample differentially private confidence intervals. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Schloss Dagstuhl-Leibniz Zentrum fuer Informatik.

Published as a conference paper at ICLR 2024

Kato, M., Imaizumi, M., Ishihara, T., and Kitagawa, T. (2023). Asymptotically minimax optimal fixed-budget best arm identification for expected simple regret minimization. ar Xiv preprint ar Xiv:2302.02988.

Kiefer, J. and Wolfowitz, J. (1960). The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363 366.

Komiyama, J., Tsuchiya, T., and Honda, J. (2022). Minimax optimal algorithms for fixed-budget best arm identification. Advances in Neural Information Processing Systems, 35:10393 10404.

Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.

Meyer, C. D. (2000). Matrix analysis and applied linear algebra, volume 71. SIAM.

Mishra, N. and Thakurta, A. (2015). (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 592 601.

Nikolakakis, K. E., Kalogerias, D. S., Sheffet, O., and Sarwate, A. D. (2021). Quantile multi-armed bandits: Optimal best-arm identification and a differentially private scheme. IEEE Journal on Selected Areas in Information Theory, 2(2):534 548.

Rio, A., Barlier, M., Colin, I., and Soare, M. (2023). Multi-agent best arm identification with private communications. International Conference on Machine Learning.

Sajed, T. and Sheffet, O. (2019). An optimal private stochastic-mab algorithm based on optimal private stopping rule. In International Conference on Machine Learning, pages 5579 5588. PMLR.

Shariff, R. and Sheffet, O. (2018). Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, volume 31, page 4301 4311.

Sheffet, O. (2015). Private approximations of the 2nd-moment matrix using existing techniques in linear regression. ar Xiv preprint ar Xiv:1507.00056.

Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27:828 836.

Solanki, S., Kanaparthy, S., Damle, S., and Gujar, S. (2023). Differentially private federated combinatorial bandits with constraints. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19 23, 2022, Proceedings, Part IV, pages 620 637. Springer.

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294.

Tossou, A. and Dimitrakakis, C. (2016). Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, page 2087 2093.

Weisberg, S. (2005). Applied linear regression, volume 528. John Wiley & Sons.

Yang, J. and Tan, V. (2022). Minimax optimal fixed-budget best arm identification in linear bandits. Advances in Neural Information Processing Systems, 35:12253 12266.

Zheng, K., Cai, T., Huang, W., Li, Z., and Wang, L. (2020). Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33:12300 12310.

Zhou, D., Li, L., and Gu, Q. (2020). Neural contextual bandits with UCB-based exploration. In Proceedings of the International Conference on Machine Learning, ICML-20, pages 11492 11502.

Zhou, X. and Chowdhury, S. R. (2023). On differentially private federated linear contextual bandits. ar Xiv preprint ar Xiv:2302.13945.

Published as a conference paper at ICLR 2024

Supplementary Material for Fixed-Budget Differentially Private Best Arm Identification

A USEFUL FACTS

In this section, we collate some useful facts that will be used in the subsequent proofs. Lemma A.1 (Hoeffding s inequality). Let X1, . . . , Xn be independent random variables such that ai Xi bi almost surely for all i [n], for some fixed constants ai bi, i [n]. Let

Sn = X1 + + Xn.

Then, for all ϵ > 0,

P (Sn E [Sn] ϵ) exp

2ϵ2 Pn i=1 (bi ai)2

P (|Sn E [Sn]| ϵ) 2 exp

2ϵ2 Pn i=1 (bi ai)2

Definition A.2 (Sub-exponential random variable). Let τ R and b > 0. A random variable X is said to be sub-exponential with parameters (τ 2, b) if

E [exp(λX)] exp λ2τ 2

λ : |λ| < 1

Lemma A.3 (Linear combination of sub-exponential random variables). Let X1, . . . , Xn be independent, zero-mean sub-exponential random variables, where Xi is τ 2 i , bi -sub-exponential. Then, for any a1, . . . , an R, the random variable Pn i=1 ai Xi is (τ 2, b)-sub-exponential, where τ 2 = Pn i=1 a2 i τ 2 i and b = maxi bi |ai|. Lemma A.4 (Tail bounds of sub-exponential random variables). Suppose that X is sub-exponential with parameters τ 2, b . Then,

P(X E[X] ϵ)

B AN EXAMPLE OF THE POLICY-SPECIFIC NOTATIONS IN SEC. 3.1

Consider d = 16, K = 10, 000. Then, we have g0 = 64, h0 = 9936, M1 = 4, M = 10 and λ 27.65. In the following table, we display the values of sp for p = 1, . . . , M1.

p sp (sp 1 g0)/(sp g0) 1 1000 N.A. 2 423 27.65 3 77 27.68 4 (= M1) 64 27.62

For p = M1 + 1, . . . , M, we simply have sp = 210 p, i.e., s5 = 32, s6 = 16, . . . , s10 = 1.

Notice that the values in the third column of the above table (apart from the first row which is not applicable) are nearly equal to the value of λ. Based on the above table, we may bifurcate the operations of our algorithm into two distinct stages. Roughly speaking, in the first stage (i.e., phases 1 to M1), the algorithm aims to reduce the number of arms to a predetermined quantity (i.e., g0) as quickly as possible. Following this, in the second stage (i.e., phase M1 + 1 to M), the algorithm iteratively halves the number of arms. This process is designed to gradually concentrate arm pulls on those arms with small sub-optimality gaps (including the best arm) as much as possible, thereby improving the accuracy of estimating the mean of the best arm and encouraging effective identification of the best arm.

Published as a conference paper at ICLR 2024

C EQUIVALENCE OF OUR NOTION OF DP WITH THE NOTION OF TABLE-DP OF AZIZE AND BASU (2023)

In this section, we discuss the notion of table-DP introduced in the recent work by Azize and Basu (2023, Section 2), and we will show that our definition of differential privacy in Definition 2.1 is equivalent to the notion of table-DP. Firstly, to maintain notational consistency with Azize and Basu (2023), we introduce a dual decision maker who obtains the reward XAt,t rather than the reward XAt,NAt,t upon pulling arm At at time step t; see Section 2 for other relevant notations. Then, for any policy π, we write D(π) to denote its dual policy obtained by substituting its decision maker with the dual decision maker. Clearly, in the non-private setting, any policy is equivalent to its dual policy. We demonstrate below that the same conclusion holds under DP considerations too. To begin, we formally define DP for the dual policy. Definition C.1 (Column-neighbouring). We say that x, x X are column-neighbouring if they differ in exactly one column, i.e., there exists t [T] such that x ,t = x ,t and x ,s = x ,s for all s = t, where x ,s := (xi,s)K i=1 and x ,s := (x i,s)K i=1 .

Definition C.2 (Table-DP). Given any ε > 0 and a randomised policy M : X [K]T +1, we say that the dual policy D(M) satisfies ε-table-DP if, for any pair of column-neighbouring x, x X,

PD(M) (D(M) (x) S) eε PD(M) (D(M)(x ) S) S [K]T +1.

For any z = (zt)T +1 t=1 [K]T +1, we denote [z]j i := (zt)j t=i [K]j i+1. Then, we have the following alternative characterization of table-DP. Corollary C.3. Given any ε > 0 and a randomised policy M : X [K]T +1, the dual policy D(M) satisfies ε-table-DP if and only if for any t [T] and any pair of column-neighbouring x, x X with x ,t = x ,t,

PD(M) [D(M) (x)]T +1 t+1 S eε PD(M) [D(M)(x )]T +1 t+1 S S [K]T t+1.

In the following, we demonstrate that the notions of DP in Definition C.2 and Definition 2.1 are equivalent after some slight modifications in notations. Proposition C.4. Fix any policy π. If π satisfies ε-DP, then D(π) satisfies ε-table-DP.

Proof. Fix any z = (zt)T +1 t=1 [K]T +1, and any pair of column-neighbouring x D, x D X. For i [K] and t [T], we let

t=1 1{zt=i}. (29)

Notice that ni,t is equal to Ni,t if Aτ = zτ for all τ t. Then, we construct x, x X by defining for all t [T] xzt,nzt,t := x D zt,t and x zt,nzt,t := x D zt,t, (30)

and for all (i, j) [K] [T] \ {(zt, nzt,t) t [T]},

xi,j := 0 and x i,j := 0.

Hence, the fact that x D, x D X are column-neighbouring, implies that either x and x are neighbouring or x = x . That is, by the assumption that π satisfies ε-DP, we have

Pπ π(x) = z eεPπ π(x ) = z . (31)

In addition, by (29) and (30) we obtain that

Pπ π(x) = z = PD(π) D(π)(x D) = z and

Pπ π(x ) = z = PD(π) D(π)(x D) = z . (32)

Finally, combining (31) and (32), we have

PD(π) D(π)(x D) = z eεPD(π) D(π)(x D) = z ,

which completes the desired proof.

Published as a conference paper at ICLR 2024

Proposition C.5. Fix any policy π. If D(π) satisfies ε-table-DP, then π satisfies ε-DP.

Proof. Fix any z = (zi)T +1 i=1 [K]T +1, and any pair of neighbouring x, x X. Again, for i [K] and t [T] we let

t=1 1{zt=i}. (33)

Then, we construct x D, x D X by defining for all t [T]

x D zt,t := xzt,nzt,t and x D zt,t := x zt,nzt,t, (34)

and for all (i, j) [K] [T] \ {(zt, t) t [T]},

x D i,j := 0 and x D i,j := 0.

Hence, the fact that x, x X are neighbouring, implies that either x D and x D are columnneighbouring or x D = x D. That is, by the assumption that D(π) satisfies ε-table-DP, we have

PD(π) D(π)(x D) = z eεPD(π) D(π)(x D) = z . (35)

In addition, by (33) and (34) we obtain

Pπ π(x) = z = PD(π) D(π)(x D) = z and

Pπ π(x ) = z = PD(π) D(π)(x D) = z . (36)

Finally, combining (35) and (36), we have

Pπ π(x) = z eεPπ π(x ) = z ,

which completes the proof.

Combining Propositions C.4 and C.5, we obtain the following corollary. Corollary C.6. A policy π satisfies ε-DP if and only if D(π) satisfies ε-table-DP.

This proves the equivalence betweeen our notion of DP in Definition 2.1 and the notion of table-DP appearing in the work of Azize and Basu (2023).

D EXTENSION OF OUR RESULTS TO (ε, δ)-DIFFERENTIAL PRIVACY

Below, we first define the (ε, δ)-differential privacy constraint formally. Definition D.1 ((ε, δ)-differential privacy). Given any ε > 0 and δ > 0, a randomised policy M : X [K]T +1 satisfies (ε, δ)-differential privacy if, for any pair of neighbouring x, x X,

PM(M(x) S) eε PM(M(x ) S) + δ S [K]T +1.

The above definition particularises to that of ε-differential privacy when δ = 0. Therefore, it is clear that any policy that satisfies the ε-DP constraint automatically satisfies (ε, δ)-DP constraint for all δ > 0. In particular, our proposed DP-BAI policy satisfies the (ε, δ)-DP constraint for all δ > 0.

However, DP-BAI has been specifically designed for (ε, 0)-DP, and does not adjust the agent s strategy for varying values of δ. Therefore, exclusively for the (ε, δ)-differential privacy constraint with δ > 0, we provide a variant of our proposed algorithm called DP-BAI-GAUSS that utilises the Gaussian mechanism (Dwork et al., 2014) and operates with additive Gaussian noise (in contrast to Laplacian noise under DP-BAI). The pseudo-code of DP-BAI-GAUSS is shown in Algorithm 2, where the differences from DP-BAI are highlighted in red.

Proposition D.2. The DP-BAI-GAUSS policy with privacy and budget parameters (ε, δ, T) satisfies the (ε, δ)-DP constraint, i.e., for any pair of neighbouring x, x X and S [K]T +1,

PΠDP-BAI-GAUSS(ΠDP-BAI-GAUSS(x) S) eε PΠDP-BAI-GAUSS(ΠDP-BAI-GAUSS(x ) S) + δ. (37)

Published as a conference paper at ICLR 2024

Algorithm 2 DP-BAI-GAUSS

ε, δ: differential privacy parameters; T: budget; {ai : i [K]}: d-dimensional feature vectors. Output: ˆIT : best arm.

1: Initialise T0 = 0, A1 = [K], a(0) i = ai for all i [K]. Set M and T as in (8). 2: for p {1, 2, . . . , M} do

3: Set dp = dim(span{a(p 1) i : i Ap}).

4: Obtain the new vector set {a(p) i : i Ap} from the set {a(p 1) i : i Ap} via (9). 5: Compute sp using (7). 6: if dp < sp then 7: Construct a MAX-DET collection Bp Ap.

8: Pull each arm in Bp for T

Mdp many times. Update Tp Tp 1 + dp T

9: Obtain the empirical means {ˆµi(p) : i Bp} via (10).

10: Generate eξ(p) i Gaussian 0, 2 log(1.25/δ)

(ε T Mdp )2

11: Set eµ(p) i ˆµ(p) i + eξ(p) i for all i Bp.

12: For arm i Ap \ Bp, compute eµ(p) i via (12). 13: else 14: Pull each arm in Ap for T

Msp many times. Update Tp Tp 1 + sp T

15: Obtain the empirical means {ˆµ(p) i : i Ap} via (10).

16: Generate eξ(p) i Gaussian 0, 2 log(1.25/δ)

(ε T Msp )2

17: Set eµ(p) i ˆµ(p) i + eξ(p) i for all i Ap. 18: end if 19: Compute sp+1 using (7).

20: Ap+1 the set of sp+1 arms with largest private empirical means among {eµ(p) i : i Ap}. 21: end for 22: ˆIT the only arm remaining in AM+1 23: return Best arm ˆIT .

The proof of Proposition D.2 is deferred until Section I. Below, we present an upper bound on the error probability of DP-BAI-GAUSS. Proposition D.3. For any problem instance v P,

PΠDP-BAI-GAUSS v (ˆIT = i (v)) exp

T HBAI log d

δ ) Hpri log d

where HBAI and Hpri are as defined in (15).

The proof of Proposition D.3 follows along the lines of the proof of Theorem 4.2, by using sub Gaussian concentration bounds in place of sub-exponential concentration bounds. The proof is deferred until Section J. Remark 4. It is pertinent to highlight that the contribution of our algorithms do not lie in the introduction of a novel privacy mechanism such as Laplace and Gaussian mechanisms. Instead, we propose a new sampling strategy based on the MAX-DET principle for fixed budget BAI. This strategy can be seamlessly integrated into any differential privacy mechanism, including the Laplace and the Gaussian mechanisms.

Table 1 shows a comparison of the upper bounds for DP-BAI-GAUSS and DP-BAI (from Theorem 4.2). Because DP-BAI is (ε, δ)-differentially private for all δ > 0 as alluded to earlier, the preceding comparison is valid.

We observe that DP-BAI-GAUSS outperforms DP-BAI for large values of δ and T. In the small δ regime, however, DP-BAI performs better than DP-BAI-GAUSS. In particular, when δ = 0, the

Published as a conference paper at ICLR 2024

Condition on T and δ DP-BAI-GAUSS DP-BAI T

4δ )Hpri log d

2 T HBAI log d exp Ω T HBAI log d exp Ω T H log d

T H log d T

4δ )Hpri log d

2 < T HBAI log d exp

4δ )Hpri log d

exp Ω T H log d

4δ )Hpri log d

2 < T H log d exp

4δ )Hpri log d

exp Ω T H log d

δ = 0, T > 0 exp( Ω(1)) (vacuous) exp Ω T H log d

Table 1: Comparison between the upper bounds of DP-BAI and DP-BAI-GAUSS. The ticks indicate the tighter of the two bounds under the given condition.

additive noise in the Gaussian mechanism is infinite and thus vacuous. That is, DP-BAI-GAUSS is not applicable when δ = 0. The above trends are not surprising, given that DP-BAI is designed for δ = 0 and is hence expected to work well for small values of δ. Finally, in order to keep the analysis simple and bring out the main ideas clearly, we hence only discuss ε-DP criterion in our main text, and the discussion regarding of (ε, δ)-DP is presented in this section of the appendix.

E A DIFFERENTIALLY PRIVATE VERSION OF OD-LINBAI (YANG AND TAN, 2022) AND ITS ANALYSIS

E.1 DETAILS OF IMPLEMENTING DP-OD

To achieve ε-DP for DP-OD, a variant of OD-LINBAI (Yang and Tan, 2022), one of the natural solutions is to use the idea of Shariff and Sheffet (2018), which is to add random noise in the ordinary least square (OLS) estimator; we note that OD-LINBAI uses the OLS estimator as a subroutine, similar to the algorithm of Shariff and Sheffet (2018). Specifically, given a dataset {(xi, yi)}n i=1, one of the steps in the computation of the standard OLS estimator (Weisberg, 2005, Chapter 2) is to evaluate the coefficient vector ˆβ via Gˆβ = U,

where G is the Gram matrix in the OLS estimator and U = Pn i=1 xiyi is the moment matrix. While Shariff and Sheffet (2018) add random noise to both G and U, we do not add noise to G as the feature vectors are determined and known to the decision maker in our setting. Instead, we add independent Laplacian noise with parameter L

ε to each element of U of the OLS estimator, with L = maxi [K] ai ; we use Laplacian noise instead of Gaussian or Wishart noise, noting from Shariff and Sheffet (2018, Section 4.2) and Sheffet (2015, Theorem 4.1) that the latter versions of noise can potentially be infinite and hence not particularly suitable for satisfying the ε-DP constraint. Using Shariff and Sheffet (2018, Claim 7), we may then prove that the above method satisfies the ε-DP constraint.

E.2 A BRIEF ANALYSIS OF DP-OD

The error probability of identifying the best arm under DP-OD depends not only on the suboptimality gaps of the arms, but also inherently a function of the arm vectors, as demonstrated below.

Proposition E.1. Let K = 2, a1 = [x, 0] , a2 = [0, y] with x, y > 0, and θ = [(0.5 + )/x, 0.5/y] for some fixed > 0. The upper bound on the error probability of DP-OD for the above bandit instance is

Proof. Note that there is only one round in DP-OD for the two-armed bandit instance. Let T1 and T2 be the number of arm pulls for arm 1 and arm 2, respectively. By the sampling strategy of DP-OD, we have T1 = Θ(T) and T2 = Θ(T) (as T ). From the description in Section E.1, we note that

Published as a conference paper at ICLR 2024

the moment matrix is given by U = PT1 i=1 a1X1,i + PT2 i=1 a2X2,i. We denote U to be the moment matrix with Laplacian noise, i.e., e U = U + [eξ1, eξ2] , (38)

where eξ1 and eξ2 are independent Laplacian noises with parameters L

ε and L = x y, respectively. The estimates of the expected reward of arm 1 and arm 2 are

eµ1 = G 1 e U, a1 (39)

and eµ2 = G 1 e U, a2 , (40) where the Gram matrix G = T1a1a 1 + T2a2a 2 . Then, the event {eµ1 > eµ2} implies that the decision maker successfully identifies the best arm. In other words,

P(ˆIT = 1) P(eµ1 eµ2). (41)

By (39) and (40), we have

eµ1 eµ2 = 1

i=1 X2,i + eξ1

x T1 eξ2 y T2

P(eµ1 eµ2 < 0) P T1 X

By Lemma A.1, we have

For sufficiently large T1 and T2, by Lemma A.4, we have

exp T1 ε 8(x y)/x

exp T2 ε 8(x y)/y

Recall that T1 = Θ(T) and T2 = Θ(T) (as T ), and by combining (42), (43), (44), (45) and (46), we have

P(eµ1 < eµ2) exp

Noting that x y

x y can be arbitrarily large, the above bound is inferior to the upper bound of DPBAI (equal to exp( Ω( T 2+(ϵ ) 1 )) for the above bandit instance). Figure 2 shows the empirical performances of DP-BAI and DP-OD by fixing x and varying y in this problem instance, where = 0.05, x = 1 is fixed, and y > 1 is allowed to vary, and the reward distribution is uniform distribution in [0, 2µi] for arm i = 1, 2. The success rates are obtained by averaging across 1000 independent trials. We observe that when the ratio x y

x y increases, the performance of DP-OD becomes worse. However, the performance of DP-BAI is essentially independent of this ratio as the theory suggests. The numerical results presented in Section 5 of our paper as well as this additional numerical study, showing the empirical performances of DP-BAI and DP-OD, reaffirm the inferiority of DP-OD.

Published as a conference paper at ICLR 2024

1 4 7 10 13 16 19 22 25 28 x y x y

Success Rate

T = 500, ε = 0.2

DP-BAI DP-OD

1 4 7 10 13 16 19 22 25 28 x y x y

Success Rate

T = 500, ε = 0.4

DP-BAI DP-OD

1 4 7 10 13 16 19 22 25 28 x y x y

Success Rate

T = 250, ε = 0.2

DP-BAI DP-OD

1 4 7 10 13 16 19 22 25 28 x y x y

Success Rate

T = 250, ε = 0.4

DP-BAI DP-OD

Figure 2: Comparison of DP-BAI to DP-OD for different values of x y

x y in the two-armed bandit instance introduced in Appendix E.2.

F PROOF OF PROPOSITION 4.1

Let N (p) i := Ni,Tp. Under the (random) process of DP-BAI, we introduce an auxiliary random mechanism π whose input is x X, and whose output is (N (P ) i , eµ(p) i )i [K],p [M], where we define

eµ(p) i = 0 if i / Ap. By Dwork et al. (2014, Proposition 2.1), to prove Proposition 4.1, it suffices to show that π is ε-differentially private, i.e., for all neighbouring x, x X,

P π( π(x) S) eε P π( π(x ) S) S Ω, (47)

where Ωis the set of all possible outputs of π. In the following, we write P π x to denote the probability measure induced under π with input x X.

Fix neighbouring z, z X arbitrarily. There exists a unique pair (i, t) with zi,n = z i,n. In addition,

fix any n(p) i N and Borel set χ(p) i [0, 1] for i [K] and p [M] such that

0 n(1) i n(2) i . . . n(M) i T i [K].

( M + 1, if n(M) i < n min{p [M] : n(p) i n}, otherwise. (48)

For any j [M], we define the event

Dj = { (i, p) [K] [j] : N (p) i = n(p) i , eµ(p) i χ(p) i }

and D j = Dj 1 {N (p) j = n(p) j }.

Published as a conference paper at ICLR 2024

For j {j + 1, . . . , M}

Dj,j = { (i, p) : [K] {j + 1, . . . , j } : N (p) i = n(p) i , eµ(p) i χ(p) i }

In particular, D0 and DM,M are events with probability 1, i.e.,

P π x (D0 DM,M) = 1 x X.

Then, if n(M) i < n, by the definition of π we can have

P π z (DM) = P π z (DM) . (49)

In the following, we assume n(M) i n and we will show

P π z (DM) exp(ε)P π z (DM) (50)

which then implies that π is ε-differentially private.

For x {z, z }, we denote

n (p) i n (p 1) i

j=n (p 1) i +1

That is, µx is the value of ˆµ (p) i in the condition of D p under probability measure P π x.

Then, by the chain rule we have for both x {z, z }

P π x (DM) = P π x D p

P π x eµ (p) i χ (p) i D p

P π x Dp,M Dp

= P π x D p

x χ (p) i f Lap(x µx)P π x Dp,M D p {eµ (p) i = x} dx (51)

where f Lap( ) is the probability density function of Laplace distribution with parameter 1 ε(n (p) i n (p 1) i ).

Note that by definition it holds

= P π z D p

P π z Dp,M D p {eµ (p) i = x} = P π z Dp,M D p {eµ (p) i = x} x χ (p) i

f Lap(x µz) exp(ε)f Lap(x µz ) x χ (p) i

Hence, combining (51) and (52), we have

P π z (DM) exp(ε)P π z (DM)

which implies π is ε-differentially private.

G PROOF OF THEOREM 4.2

Lemma G.1. Fix d N and a finite set A Rd such that span(A) = Rd . Let B = {b1, . . . , bd } be a MAX-DET collection of A. Fix an arbitrary z A, and let β1, . . . , βd R be the unique set of coefficients such that

Then, |βi| 1 for all i [d ].

Published as a conference paper at ICLR 2024

Proof. Fix z A.Let B = [b1 b2 . . . bd ] be the d d matrix consisting of vectors in B. Let β = [β1 β2 . . . βd ] to be a vector consisting of the members β1, . . . , βd .

Then, by the definition of βi for i [d ], we have

In addition, let B(i) be the matrix that is identical to B except the i-th column is replaced by z, i.e., B(i) = [b1 . . . bi 1 z bi+1 . . . bd ], for i [d ]. Note that by the definition of MAX-DET collection, we have |det(B(i))| |det(B)|. (53)

By Cramer s Rule (Meyer, 2000, Chapter 6), we have

βi = det(B(i))

det(B) . (54)

Finally, combining (53) and (54), we have for all i [d ]

Lemma G.2. Fix an instance v P and p [M]. Let i (v) denote the unique best arm under v. For any set Q [K] with |Q| = sp and i (v) Q, let d Q = dim(span{ai : i Q}). We have

PΠDP-BAI v eµ(p) j eµ(p) i (v) Ap = Q

for all T sufficiently large and j Q \ {i (v)}.

Proof. Fix j Q \ {i (v)}. Notice that the sampling strategy of ΠDP-BAI depends on the relation between |Q| and d Q. We consider two cases.

Case 1: d Q > p

In this case, note that each arm in Q is pulled l T M|Q| m many times. Let

E1 := {ˆµ(p) j ˆµ(p) i (v) + j j/2}. (56)

Let Xi,s = Xi,s µi for all i [K] and s [T]. Notice that Xi,s [ 1, 1] for all (i, s). Then,

PΠDP-BAI v (E1 Ap = Q)

Nj,Tp+ l T M|Q| m X

Ni (v),Tp+ l T M|Q| m X

s=Ni (v),Tp

2 1 2 l T M|Q| m j 2

2 l T M|Q| m

where (a) follows Lemma A.1. Furthermore, let

E2 := {ξ(p) j ξ(p) i (v) j/2}. (58)

Published as a conference paper at ICLR 2024

We note that both ξ(p) j and ξ(p) i (v) are (τ 2, b) sub-exponential, where τ = b = 2 T M|Q| ε. Lemma A.3

then yields that ξ(p) j ξ(p) i (v) is sub-exponential with parameters

2 , 2 T M|Q| ε

quently, Lemma A.4 yields that for all sufficiently large T ,

PΠDP-BAI v E2 Ap = Q exp

l T M|Q| m ε j

We therefore have

PΠDP-BAI v eµ(p) j eµ(p) i (v) Ap = Q

PΠDP-BAI v E1 E2 Ap = Q

where (a) follows from the union bound.

Case 2: d Q p

In this case, each arm in Bp Q is pulled for l T Md Q

m times. Recall that we have a(p) j = P

i Bp αj,i a(p) i . Using (9), this implies aj = P

i Bp αj,i ai, which in turn implies that aj, θ = P

i Bp αj,i ai, θ . Using (12), we then have

EΠDP-BAI v eµ(p) j Ap = Q = µj. (61)

If j / Bp, we denote ˆµ(p) j = X

ι Bp αj,ιˆµ(p) ι (62)

and eξ(p) j = X

ι Bp αj,ιeξ(p) ι . (63)

Note that (62) and (63) still hold in the case of j Bp. We define the event

G1 = ˆµ(p) j µj j

Then, we have

PΠDP-BAI v G1 Ap = Q

= PΠDP-BAI v

(ˆµ(p) j µj) T

= PΠDP-BAI v

Nι,Tp 1+ l T Md Q

s=Nι,Tp 1+1 αj,ι(Xι,s µι) T

2( l T Md Q

d Q l T Md Q

Published as a conference paper at ICLR 2024

M(d2 Q sp) 1

where (a) follows Lemma G.1 and Lemma A.1. In addition, we define the event

G2 = eξ(p) j j

By Lemma A.3 and Lemma G.1, we have eξ(p) j = P

ι Bp αj,ιeξ(p) ι is a sub-exponential random variable with parameters (P

ι Bp α2 j,ι( 2 l T Md Q

m ε)2, maxι Bp|αj,ι| 2 l T Md Q

m ε). Then, by Lemma A.4, it

holds that for T sufficiently large

PΠDP-BAI v G2 Ap = Q exp

4 maxι Bp|αj,ι| l T Md Q

M(d2 Q sp) m ε j

Define the events

G3 = ˆµ(p) i (v) µi (v) j

and G4 = eξ(p) i (v) j

Similar to (64) and (65), we have

PΠDP-BAI v G3 Ap = Q exp

M(d2 Q sp) 1

and for sufficiently large T ,

PΠDP-BAI v G4 Ap = Q exp

M(d2 Q sp) m ε j

Hence, in Case 2, for sufficiently large T we have

PΠDP-BAI v eµ(p) j eµ(p) i (v) Ap = Q

PΠDP-BAI v G1 G2 G3 G4 Ap = Q

M(d2 Q sp) 1

Finally, combining both Cases 1 and 2, for sufficiently large T , we have

PΠDP-BAI v eµ(p) j eµ(p) i (v) Ap = Q

M(d2 Q sp) 1

Published as a conference paper at ICLR 2024

Lemma G.3. Fix instance v P and p [M1]. Recall the definitions of λ in (4) and g0 in (8). For any set Q [K] with |Q| = sp and i (v) Q, it holds when T is sufficiently large

PΠDP-BAI v i (v) / Ap+1 Ap = Q

2 (g0) + exp 1

Proof. Fix v P, p [M1] and Q [K] with |Q| = sp and i (v) Q. Let Qsub be the set of sp g0 + 1 arms in Q with largest suboptimal gap. In addition, let

j Qsub 1{eµ(p) j eµ(p) i (v)}

be the number of arms in Qsub with private empirical mean larger than the best arm. Then, we have

EΠDP-BAI v N sub Ap = Q

j Qsub EΠDP-BAI v

1{eµ(p) j eµ(p) i (v)} Ap = Q

j Qsub PΠDP-BAI v eµ(p) j eµ(p) i (v) Ap = Q

(b) 2|Qsub|

= 2(sp g0 + 1)

where (a) follows Lemma G.2, and (b) follows from the fact that minj Qsub j (g0).

PΠDP-BAI v i (v) / Ap+1 Ap = Q

(a) PΠDP-BAI v N sub sp+1 g0 + 1 Ap = Q

(b) EΠDP-BAI v N sub Ap = Q

sp+1 g0 + 1

(c) 2(sp g0 + 1)

sp+1 g0 + 1

(d) 2(hp + 1)

where (a) follows from the fact that N sub sp+1 g0 + 1 is a necessary condition for i (v) / Ap+1 when Ap = Q, (b) follows Markov s inequality, (c) is obtained from (70). In addition, (d) is obtained from the definition in (7), and (e) is obtained from the definition in (6).

Published as a conference paper at ICLR 2024

Lemma G.4. Fix instance v P and p {M1 + 1, . . . , M}. For any set Q [K] with |Q| = sp and i (v) Q, it holds that when T is sufficiently large

PΠDP-BAI v i (v) / Ap+1 Ap = Q

where we define s M+2 = 1.

Proof. Fix v P, p [M1] and Q [K] with |Q| = sp and i (v) Q. Similarly, let Qsub be the set of sp sp+2 arms in Q with the largest suboptimality gaps. Again, let

j Qsub 1{eµ(p) j eµ(p) i (v)}.

Then, we have

EΠDP-BAI v N sub Ap = Q

j Qsub EΠDP-BAI v

1{eµ(p) j eµ(p) i (v)} Ap = Q

j Qsub PΠDP-BAI v eµ(p) j eµ(p) i (v) Ap = Q

(b) 2|Qsub|

= 2(sp sp+2)

where (a) follows from Lemma G.2, and (b) follows from the fact that minj Qsub j sp+2+1.

Similarly, we can have

PΠDP-BAI v i (v) / Ap+1 Ap = Q

(a) PΠDP-BAI v N sub sp+1 sp+2 + 1 Ap = Q

(b) EΠDP-BAI v N sub Ap = Q

sp+1 sp+2 + 1

(c) 2(sp sp+2) sp+1 sp+2 + 1

where (a) follows from the fact that N sub sp+1 sp+2 + 1 is the necessary condition of i (v) / Ap+1 when Ap = Q, (b) follows from Markov s inequality, (c) follows (72), and (d) is obtained from the definition in (7).

Published as a conference paper at ICLR 2024

With the above ingredients in place, we are now ready to prove Theorem 4.2.

Proof of Theorem 4.2. Fix instance v P. Recall that in DP-BAI, the decision maker eliminates arms in successive phases, and the decision maker can successfully identify the best arm if and only if it is not eliminated in any of the phases. That is,

PΠDP-BAI v ˆIT = i (v)

p=1 PΠDP-BAI v i (v) / Ap+1 i (v) Ap . (73)

Then, we divide the rightmost sum of the probabilities into two parts. Let

p=1 PΠDP-BAI v i (v) / Ap+1 i (v) Ap

p=M1+1 PΠDP-BAI v i (v) / Ap+1 i (v) Ap .

In the case of K d2/4 , by definition we have M1 = 0, which implies that P1 = 0. In the case of K > d2/4 by Lemma G.3, we obtain

P1 2M1λ exp 1

In addition, by Lemma G.4

p=M1+1 6 exp 1

p=M1+1 6 exp T

16Msp 2 (sp+2+1)

16Msp ε (sp+2+1)

p=M1+1 6 exp T

16M(4sp+2 + 4) 2 (sp+2+1)

16M(4sp+2 + 4)ε (sp+2+1)

i {sp+2+1:p {M1+1,...,M}} 6 exp T

6(M M1) exp T

12(M M1) exp T

Finally, combining (74) and (75), for sufficiently large T we have

PΠDP-BAI v ˆIT = i (v) P1 + P2

(12M + 4M1λ) exp T

Published as a conference paper at ICLR 2024

H PROOF OF THEOREM 4.7

Let P be the set of problem instances that meet the following properties: (a) the best arm is unique; (b) the feature vectors {ai}K i=1 Rd are mutually orthogonal; (c) K = d > 3 and (d) ε < 1. Hence, we have P P, and it suffices to consider the instances in P to prove Theorem 4.7.

Note that log(d) > 1, HBAI > 1 and Hpri > 1 when v P . That is, for any v P , T > 0, c > 0, and βi, βi [0, 1] with βi βi for i {1, 2, 3}, we have

c(log d)β1 (HBAI)β2 + (Hpri)β3 exp T

c(log d)β1 (HBAI)β2 + (Hpri)β3 .

Hence, besides we consider the instances in P , it still suffices to show that (23) holds in the cases of (i) β1 [0, 1), β2 = β3 = 1, (ii) β2 [0, 1), β1 = β3 = 1, and (iii) β3 [0, 1), β1 = β2 = 1. In the following, we will show that cases (i) and (ii) can be satisfied by following the argument of Carpentier and Locatelli (2016), while case (iii) is satisfied by the formula of change-of-measure introduced in Subsection 4.3.

First, we present the proof of Lemma 4.3 in the following.

Proof of Lemma 4.3. Fix a problem instance v P, policy π, n N, and ι [K], and event E = {ˆIT A} {Nι,T < n} for arbitrary A [K].

By the definition of early stopping policy, we have

Pπ v({Nι,T < n}) = PES(π,n,ι) v ({Nι,T < n}) (77)

and Pπ v({ˆIT A} {Nι,T < n}) = PES(π,n,ι) v ({ˆIT A} {Nι,T < n}). (78) By the chain rule, we have

Pπ v(E) = Pπ v({Nι,T < n})Pπ v({ˆIT A} {Nι,T < n}) (79)

and PES(π,n,ι) v (E) = PES(π,n,ι) v ({Nι,T < n})PES(π,n,ι) v ({ˆIT A} {Nι,T < n}). (80) That is, Pπ v(E) = PES(π,n,ι) v (E). (81)

Lemma H.1 (Adapted from Carpentier and Locatelli (2016)). Fix K > 0, H > 0, and a non-private consistent policy π. For all sufficiently large T, there exists an instance v P with K K, HBAI > H such that,

Pπ v ˆIT = i (v) > exp 401T log(K)HBAI(v)

Note that (82) still holds even with DP constraint, and the privacy parameter ε will not affect HBAI. Hence, we can have the following corollary.

Corollary H.2. Fix K > 0, H > 0, ε > 0 and a consistent policy π. For all sufficiently large T, there exists an instance v P with K K, HBAI > H and ε = ε such that,

Pπ v ˆIT = i (v) > exp 401T log(K)HBAI(v)

With the above Corollary, we are now ready to discuss Case (i) in Lemma H.3. Lemma H.3. Fix any β [0, 1), a consistent policy π, and a constant c > 0. For all sufficiently large T, there exists an instance v P such that,

Pπ v ˆIT = i (v) > exp T c(log K)β(HBAI(v) + Hpri(v))

Published as a conference paper at ICLR 2024

Proof. Fix c > 0, β [0, 1). Let v( K, H, ε, T) to be the instance in Corollary H.1 that meets (83).

Then, when K > exp 401 1 1 β (2c)

β 1 β , we have

401 > 2c(log( K))β. (85)

In addition, by the definition of Hpri we have

lim ε + Hpri( v( K, H, ε, T)) = 0, (86)

which implies that there exists ε ( K, H, T) such that when ε > ε ( K, H, T),

Hpri( v( K, H, ε, T)) < HBAI( v( K, H, ε, T)), (87)

since HBAI would not be affected by the privacy parameter.

Finally, combining (83), (85) and (87), we have for all sufficiently large T

Pπ v ˆIT = i (v) > exp T c(log K)β(HBAI(v) + Hpri(v))

where v = v( K, H, ε, T), ε > ε ( K, H, T), and K > exp 401 1 1 β (2c)

Similarly, we discuss Case (ii) in Lemma H.4.

Lemma H.4. Fix any β [0, 1), a consistent policy π, and a constant c > 0. For all sufficiently large T, there exists an instance v P such that,

Pπ v ˆIT = i (v) > exp T c(log K)(HBAI(v)β + Hpri(v))

Proof. Again, fix c > 0, β [0, 1). Recall that v( K, H, ε, T) is the instance in Corollary H.1 that meets (83). Similarly, when H > (802c) 1 1 β , we have

H 401 > 2c( H)β. (90)

In addition, we have lim ε + Hpri( v( K, H, ε, T)) = 0, (91)

which implies that there exists ε ( K, H, T, β) such that when ε > ε ( K, H, T, β),

Hpri( v( K, H, ε, T)) < HBAI( v( K, H, ε, T))β. (92)

Finally, combining 83, (90) and (92), we have for all sufficiently large T

Pπ v ˆIT = i (v) > exp T c(log K)(HBAI(v)β + Hpri(v))

where v = v( K, H, ε, T), ε > ε ( K, H, T, β), and H > (802c) 1 1 β .

Furthermore, before we discuss (iii), we introduce the following lemma.

Lemma H.5. Fix H > 0, β [0, 1), a consistent policy π, and K > 3. For all sufficiently large T, there exists an instance v P with K = K, Hpri H and Hpri (HBAI) 1 β such that,

Pπ v ˆIT = i (v) > exp 97T log(K)Hpri(v)

Published as a conference paper at ICLR 2024

Proof. Fix H > 0, a consistent policy π, and K > 3. Fix γ1 (0, ( 1

K ) K+2), and let γi = γi 1 K

for i = 2, . . . , K. We construct K + 1 problem instances v(0), v(1), . . . , v( K) P with K = K arms as follows. For instance j = 0, . . . , K, the reward distribution of arm i [ K] is

ν(j) i = Ber( 1

2 γi) if i = j Ber( 1

2 + γi) if i = j , (95)

where Ber is denoted to be Bernoulli distribution. In addition, the privacy parameters of these K + 1 instances are set to be the same, and it is denoted by ε. By the fact that Hpri(v(j)) + as ε 0 for all j = 0, . . . , K. Then, we choose ε to be any value such that for all j = 0, . . . , K

Hpri(v(j)) H and Hpri(v(j)) (HBAI(v(j))) 1 β . (96)

By the fact that π is a consistent policy, when T is sufficiently large, we have

Pπ v(0)(ˆIT = 1) > 1

Furthermore, we denote ni = 4Eπ v(0)(Ni,T ) to be the expected number of pulls for arm i [ K] in instance 0 with time budget T. Then, we define event Ei = {Ni,T < ni} {ˆIT = 1}. It then holds that Pπ v(0)(Ei) = 1 Pπ v(0)( Ei)

(a) 1 Pπ v(0)(Ni,T 4ni) Pπ v(0)(ˆIT = 1)

(b) 1 Pπ v(0)(Ni,T 4ni) 1

where (a) follows from the union bound, (b) follows (97), and (c) is obtained from Markov s inequality. Then, by Lemma 4.3, we have

PES(π,ni,i) v(0) (Ei) = Pπ v(0)(Ei) 1

In addition, by Lemma 4.5, we have for i = 1, . . . , K

PES(π,ni,i) v(i) (Ei) exp 6εni TV(ν(i) i , ν(0) i ) PES(π,ni,i) v(0) (Ei)

4 exp 6εni TV(ν(i) i , ν(0) i )

4 exp( 12εniγi), (100)

where (a) follows (99), and (b) is obtained by the definition in (95).

In other words, we have Pπ v(i)(ˆIT = i (v(i))) Pπ v(i)(Ei)

(a) = PES(π,ni,i) v(i) (Ei)

4 exp( 12εniγi), (101)

where (a) follows Lemma 4.3, and (b) follows (100).

Observe that for any {xi} K 1 i=1 R and {yi} K 1 i=1 R+, it holds that mini [ K 1] xi yi

i [ K 1] xi P

i [ K 1] yi .

Then, by letting xi = ni+1 and yi = 1 εγi+1Hpri(v(i+1)), we conclude that there exists i {2, . . . , K} such that

ni εγi Hpri(v(i )) K X

i=2 ni / K X

1 εγi Hpri(v(i))

Published as a conference paper at ICLR 2024

(a) 4T/ K X

1 εγi Hpri(v(i))

(b) 4T/ K X

(c) 4T log( K) log(2)

(d) 8T log( K) (102)

where (a) follows from the fact that P K i=2 Ev(0)(Ni,T ) T, and (b) is obtained from Hpri(v(i))

i εγi , (c) follows from the fact that log( K) log(2) = R K 2 1 x dx < P K i=2 1

i , and (d) is obtained by

the assumption that K > 3.

Combining (101) and (102), we have

Pπ v(i )(ˆIT = i (v(i ))) 1

4 exp( 12εni γi )

4 exp 12 Hpri(v(i ))εni γi Hpri(v(i ))

96T log( K)Hpri(v(i ))

That is, when T is sufficiently large, we have

Pπ v(i )(ˆIT = i (v(i ))) exp

97T log( K)Hpri(v(i ))

With Lemma H.5, we are now ready to show that Theorem 4.7 holds in Case (iii) in the following lemma.

Lemma H.6. Fix any β [0, 1), a consistent policy π, and a constant c > 0. For all sufficiently large T, there exists an instance v P such that,

Pπ v ˆIT = i (v) > exp T c(log K)(HBAI(v) + Hpri(v)β)

Proof. Fix c > 0, β [0, 1). Similarly, let v( K, H, β, T) be the instance specified in Lemma H.5 that meets (94). Then, when H > (194c) 1 1 β , we have

H 97 > 2c( H)β. (105)

Finally, combining (94), (96) and (105), we have for all sufficiently large T

Pπ v ˆIT = i (v) > exp T c(log K)(HBAI(v) + Hpri(v)β)

where v = v( K, H, β, T), and H > (194c) 1 1 β .

Proof of Theorem 4.7. Finally, by combining Lemmas H.3, H.4 and H.6, we see that Theorem 4.7 holds.

Published as a conference paper at ICLR 2024

I PROOF OF PROPOSITION D.2

The proof of Proposition D.2 shares some similarities as the proof of Proposition 4.1, and the difference is using the property of Gaussian mechanism instead of Laplacian mechanism.

Let N (p) i := Ni,Tp. Under the (random) process of DP-BAI-GAUSS, we introduce an auxiliary random mechanism π whose input is x X, and whose output is (N (P ) i , eµ(p) i )i [K],p [M], where we define eµ(p) i = 0 if i / Ap. By Dwork et al. (2014, Proposition 2.1), to prove Proposition D.2, it suffices to show that π is (ε, δ)-differentially private, i.e., for all neighbouring x, x X,

P π( π(x) S) eε P π( π(x ) S) + δ S Ω, (107)

where Ωis the set of all possible outputs of π. In the following, we write P π x to denote the probability measure induced under π with input x X.

Fix neighbouring z, z X arbitrarily. There exists a unique pair (i, n) with zi,n = z i,n. In addition,

fix any n(p) i N and Borel set χ(p) i [0, 1] for i [K] and p [M] such that

0 n(1) i n(2) i . . . n(M) i T i [K].

( M + 1, if n(M) i < n min{p [M] : n(p) i n}, otherwise. (108)

For any j [M], we define the event

Dj = {(i, p) [K] [j] : N (p) i = n(p) i , eµ(p) i χ(p) i }

and D j = Dj 1 {N (j) i = n(j) i } {i [K] \ {i} : N (j) i = n(j) i , eµ(j) i χ(j) i }.

For j {j + 1, . . . , M}

Dj,j = {(i, p) [K] {j + 1, . . . , j } : N (p) i = n(p) i , eµ(p) i χ(p) i }

In particular, D0 and DM,M are events with probability 1, i.e.,

P π x (D0 DM,M) = 1 x X.

Then, if n(M) i < n, by the definition of π, we have

P π z (DM) = P π z (DM) . (109)

In the following, we assume n(M) i n and we will show

P π z (DM) exp(ε)P π z (DM) (110)

which then implies that π is ε-differentially private.

For x {z, z }, we denote

n (p) i n (p 1) i

j=n (p 1) i +1

That is, µx is the value of ˆµ (p) i in the condition of D p under probability measure P π x.

Then, by the chain rule we have for both x {z, z }

P π x (DM) = P π x D p

P π x eµ (p) i χ (p) i D p

P π x Dp,M Dp

= P π x D p

f Gauss(x µx)P π x Dp,M D p {eµ (p) i = x} dx (111)

Published as a conference paper at ICLR 2024

where f Gauss( ) is the probability density function of Gaussian distribution with zero mean and variance 2 log(1.25/δ)

(ε(n (p) i n (p 1) i ))2 . Note that by definition it holds

= P π z D p

P π z Dp,M D p {eµ (p) i = x} = P π z Dp,M D p {eµ (p) i = x} 1 x χ (p) i (112)

By a property of Gaussian mechanism (Dwork et al., 2014, Appendix A), there exists sets γ+, γ χ (p) i with γ+ γ = and γ+ γ = χ (p) i , such that (R

x γ f Gauss(x µx) dx δ

2 x {z, z } R

x γ+ f Gauss(x µz) dx exp(ε) R

x γ+ f Gauss(x µz ) dx (113)

Hence, combining (111), (112) and (113), we have

P π z (DM) exp(ε)P π z (DM) + δ

which implies that π is (ε, δ)-differentially private.

J PROOF OF PROPOSITION D.3

The proof of Proposition D.3 shares some similarities as the proof of Theorem 4.2, and the difference is using the concentration bound of Gaussian random variables instead of Sub-Exponential random variables. Lemma J.1. Fix an instance v P and p [M]. For any set Q [K] with |Q| = sp and i (v) Q, let d Q = dim(span{ai : i Q}). We have

PΠDP-BAI-GAUSS v eµ(p) j eµ(p) i (v) Ap = Q

ε2 2 j 32 log(1.25/δ)

for all j Q \ {i (v)}.

Proof. Fix j Q \ {i (v)}. Notice that the sampling strategy of ΠDP-BAI-GAUSS depends on the relation between |Q| and d Q. We consider two cases.

Case 1: d Q > p

In this case, note that each arm in Q is pulled l T M|Q| m times. Let

E1 := {ˆµ(p) j ˆµ(p) i (v) + j j/2}. (115)

Let Xi,s = Xi,s µi for all i [K] and s [T]. Notice that Xi,s [ 1, 1] for all (i, s). Then,

PΠDP-BAI-GAUSS v (E1 Ap = Q)

PΠDP-BAI-GAUSS v

Nj,Tp+ l T M|Q| m X

Ni (v),Tp+ l T M|Q| m X

s=Ni (v),Tp

2 1 2 l T M|Q| m j 2

2 l T M|Q| m

Published as a conference paper at ICLR 2024

where (a) follows from Lemma A.1. Furthermore, let

E2 := {ξ(p) j ξ(p) i (v) j/2}. (117)

We note that both ξ(p) j and ξ(p) i (v) under DP-BAI-GAUSS are Gaussian random variables with zero

mean and variance 2 log(1.25/δ)

(ε T M|Q| )2 . Hence, ξ(p) j ξ(p) i (v) is a Gaussian random variable with zero mean

and variance 4 log(1.25/δ)

(ε T M|Q| )2

Subsequently,

PΠDP-BAI-GAUSS v E2 Ap = Q exp

( jε T M|Q| )2

32 log(1.25/δ)

We therefore have

PΠDP-BAI-GAUSS v eµ(p) j eµ(p) i (v) Ap = Q

PΠDP-BAI-GAUSS v E1 E2 Ap = Q

( jε T M|Q| )2

32 log(1.25/δ)

where (a) follows from the union bound.

Case 2: d Q p

Similarly, in this case, each arm in Bp Q is pulled for l T Md Q

m times. Recall that we have

i Bp αj,i a(p) i . Using (9), this implies aj = P

i Bp αj,i ai, which in turn implies that aj, θ = P

i Bp αj,i ai, θ . Using (12), we then have

EΠDP-BAI-GAUSS v eµ(p) j Ap = Q = µj. (119)

If j / Bp, we denote ˆµ(p) j = X

ι Bp αj,ιˆµ(p) ι (120)

and eξ(p) j = X

ι Bp αj,ιeξ(p) ι . (121)

Note that (120) and (121) still hold in the case of j Bp. We define the event

G1 = ˆµ(p) j µj j

Then, we have

PΠDP-BAI-GAUSS v G1 Ap = Q

= PΠDP-BAI-GAUSS v

(ˆµ(p) j µj) T

= PΠDP-BAI-GAUSS v

Nι,Tp 1+ l T Md Q

s=Nι,Tp 1+1 αj,ι(Xι,s µι) T

2( l T Md Q

d Q l T Md Q

Published as a conference paper at ICLR 2024

M(d2 Q sp) 1

where (a) follows Lemma G.1 and Lemma A.1. In addition, we define the event

G2 = eξ(p) j j

By the fact that eξ(p) ι is a Gaussian random variable with zero mean and variance 2 log(1.25/δ)

(ε T Md Q )2 ,

we have eξ(p) j = P

ι Bp αj,ιeξ(p) ι is a Gaussian random variable with zero mean and variance P

ι Bp α2 j,ι 2 log(1.25/δ)

(ε T Md Q )2 ). Note that P

ι Bp α2 j,ι d Q, then we can have

PΠDP-BAI-GAUSS v G2 Ap = Q exp

( jε T Md Q )2

32d Q log(1.25/δ)

32 log(1.25/δ)

Define the events

G3 = ˆµ(p) i (v) µi (v) j

and G4 = eξ(p) i (v) j

Similar to (122) and (123), we have

PΠDP-BAI-GAUSS v G3 Ap = Q exp

M(d2 Q sp) 1

PΠDP-BAI-GAUSS v G4 Ap = Q exp

32 log(1.25/δ)

Hence, in Case 2,we have

PΠDP-BAI-GAUSS v eµ(p) j eµ(p) i (v) Ap = Q

PΠDP-BAI-GAUSS v G1 G2 G3 G4 Ap = Q

M(d2 Q sp) 1

32 log(1.25/δ)

Finally, combining both Cases 1 and 2, for sufficiently large T , we have

PΠDP-BAI-GAUSS v eµ(p) j eµ(p) i (v) Ap = Q

M(d2 Q sp) 1

M(d2 Q sp) )2

32 log(1.25/δ)

M(d2 Q sp) )2

32 log(1.25/δ)

Published as a conference paper at ICLR 2024

Lemma J.2. Fix instance v P and p [M1]. Recall the definitions of λ in (4) and g0 in (8). For any set Q [K] with |Q| = sp and i (v) Q, it holds

PΠDP-BAI-GAUSS v i (v) / Ap+1 Ap = Q

M(d2 Q sp) m (g0)ε 2

32 log(1.25/δ)

Proof. Fix v P, p [M1] and Q [K] with |Q| = sp and i (v) Q. Let Qsub be the set of sp g0 + 1 arms in Q with largest suboptimal gap. In addition, let

j Qsub 1{eµ(p) j eµ(p) i (v)}

be the number of arms in Qsub with private empirical mean larger than the best arm. Then, we have

EΠDP-BAI-GAUSS v N sub Ap = Q

j Qsub EΠDP-BAI-GAUSS v

1{eµ(p) j eµ(p) i (v)} Ap = Q

j Qsub PΠDP-BAI-GAUSS v eµ(p) j eµ(p) i (v) Ap = Q

M(d2 Q sp) )2

32 log(1.25/δ)

(b) 2|Qsub|

M(d2 Q sp) )2

32 log(1.25/δ)

= 2(sp g0 + 1)

M(d2 Q sp) )2

32 log(1.25/δ)

where (a) follows Lemma J.1, and (b) follows from the fact that minj Qsub j (g0).

PΠDP-BAI-GAUSS v i (v) / Ap+1 Ap = Q

(a) PΠDP-BAI-GAUSS v N sub sp+1 g0 + 1 Ap = Q

(b) EΠDP-BAI-GAUSS v N sub Ap = Q

sp+1 g0 + 1

(c) 2(sp g0 + 1)

sp+1 g0 + 1

M(d2 Q sp) )2

32 log(1.25/δ)

(d) 2(hp + 1)

M(d2 Q sp) )2

32 log(1.25/δ)

M(d2 Q sp) )2

32 log(1.25/δ)

Published as a conference paper at ICLR 2024

where (a) follows from the fact that N sub sp+1 g0 +1 is a necessary condition for i (v) / Ap+1 when Ap = Q, (b) follows from Markov s inequality, and (c) follows from (124). In addition, (d) is obtained from the definition in (7), and (e) is obtained from the definition in (6).

Lemma J.3. Fix instance v P and p {M1 + 1, . . . , M}. For any set Q [K] with |Q| = sp and i (v) Q, it holds

PΠDP-BAI-GAUSS v i (v) / Ap+1 Ap = Q

( (sp+2+1)ε T

M(d2 Q sp) )2

32 log(1.25/δ)

where we define s M+2 = 1.

Proof. Fix v P, p [M1] and Q [K] with |Q| = sp and i (v) Q. Similarly, let Qsub be the set of sp sp+2 arms in Q with the largest suboptimality gaps. Again, let

j Qsub 1{eµ(p) j eµ(p) i (v)}.

Then, we have

EΠDP-BAI-GAUSS v N sub Ap = Q

j Qsub EΠDP-BAI-GAUSS v

1{eµ(p) j eµ(p) i (v)} Ap = Q

j Qsub PΠDP-BAI-GAUSS v eµ(p) j eµ(p) i (v) Ap = Q

M(d2 Q sp) )2

32 log(1.25/δ)

(b) 2|Qsub|

( (sp+2+1)ε T

M(d2 Q sp) )2

32 log(1.25/δ)

= 2(sp sp+2)

( (sp+2+1)ε T

M(d2 Q sp) )2

32 log(1.25/δ)

where (a) follows from Lemma J.1, and (b) follows from the fact that minj Qsub j sp+2+1.

By a similar calculation,

PΠDP-BAI-GAUSS v i (v) / Ap+1 Ap = Q

(a) PΠDP-BAI-GAUSS v N sub sp+1 sp+2 + 1 Ap = Q

(b) EΠDP-BAI-GAUSS v N sub Ap = Q

sp+1 sp+2 + 1

(c) 2(sp sp+2) sp+1 sp+2 + 1

+ exp ( (sp+2+1)ε T

M(d2 Q sp) )2

32 log(1.25/δ)

Published as a conference paper at ICLR 2024

( (sp+2+1)ε T

M(d2 Q sp) )2

32 log(1.25/δ)

where (a) follows from the fact that N sub sp+1 sp+2 + 1 is the necessary condition of i (v) / Ap+1 when Ap = Q, (b) follows from Markov s inequality, (c) follows from (126), and (d) is obtained from the definition in (7).

With the above ingredients in place, we are now ready to prove Proposition D.3.

Proof of Proposition D.3. Fix instance v P. Recall that in DP-BAI-GAUSS, the decision maker eliminates arms in successive phases, and the decision maker can successfully identify the best arm if and only if it is not eliminated in any of the phases. That is,

PΠDP-BAI-GAUSS v ˆIT = i (v)

p=1 PΠDP-BAI-GAUSS v i (v) / Ap+1 i (v) Ap .

Then, we split the rightmost sum of the probabilities into two parts. Let

p=1 PΠDP-BAI-GAUSS v i (v) / Ap+1 i (v) Ap

p=M1+1 PΠDP-BAI-GAUSS v i (v) / Ap+1 i (v) Ap .

When K d2/4 , by definition, we have M1 = 0, which implies that P1 = 0. When K > d2/4

by Lemma J.2, we obtain

( ( d2/4 )ε T

32 log(1.25/δ)

log(1.25/δ)

In addition, by Lemma J.3

( (sp+2+1)ε T

32 log(1.25/δ)

16M(4sp+2 + 4) 2 (sp+2+1)

( (sp+2+1)ε T M(4sp+2+4)))2

32 log(1.25/δ)

i {sp+2+1:p {M1+1,...,M}} 6

T ε (i) 64Mi p

log(1.25/δ)

log(1.25/δ)

Finally, combining (127) and (128), for sufficiently large T we have

PΠDP-BAI-GAUSS v ˆIT = i (v)

T HBAI log d

δ ) Hpri log d

This completes the proof.