# replicable_bandits__a2a4abb1.pdf

Published as a conference paper at ICLR 2023

REPLICABLE BANDITS

Hossein Esfandiari Google Research esfandiari@google.com

Alkis Kalavasis National Technical University of Athens kalavasisalkis@mail.ntua.gr

Amin Karbasi Yale University, Google Research amin.karbasi@yale.edu

Andreas Krause ETH Zurich krausea@ethz.ch

Vahab Mirrokni Google Research mirrokni@google.com

Grigoris Velegkas Yale University grigoris.velegkas@yale.edu

In this paper, we introduce the notion of replicable policies in the context of stochastic bandits, one of the canonical problems in interactive learning. A policy in the bandit environment is called replicable if it pulls, with high probability, the exact same sequence of arms in two different and independent executions (i.e., under independent reward realizations). We show that not only do replicable policies exist, but also they achieve almost the same optimal (non-replicable) regret bounds in terms of the time horizon. More specifically, in the stochastic multi-armed bandits setting, we develop a policy with an optimal problem-dependent regret bound whose dependence on the replicability parameter is also optimal. Similarly, for stochastic linear bandits (with finitely and infinitely many arms) we develop replicable policies that achieve the best-known problem-independent regret bounds with an optimal dependency on the replicability parameter. Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.

1 INTRODUCTION

In order for scientific findings to be valid and reliable, the experimental process must be repeatable, and must provide coherent results and conclusions across these repetitions. In fact, lack of reproducibility has been a major issue in many scientific areas; a 2016 survey that appeared in Nature (Baker, 2016a) revealed that more than 70% of researchers failed in their attempt to reproduce another researcher s experiments. What is even more concerning is that over 50% of them failed to reproduce their own findings. Similar concerns have been raised by the machine learning community, e.g., the ICLR 2019 Reproducibility Challenge (Pineau et al., 2019) and Neur IPS 2019 Reproducibility Program (Pineau et al., 2021), due to the to the exponential increase in the number of publications and the reliability of the findings.

The aforementioned empirical evidence has recently led to theoretical studies and rigorous definitions of replicability. In particular, the works of Impagliazzo et al. (2022) and Ahn et al. (2022) considered replicability as an algorithmic property through the lens of (offline) learning and convex optimization, respectively. In a similar vein, in the current work, we introduce the notion of replicability in the context of interactive learning and decision making. In particular, we study replicable policy design for the fundamental setting of stochastic bandits.

A multi-armed bandit (MAB) is a one-player game that is played over T rounds where there is a set of different arms/actions A of size |A| = K (in the more general case of linear bandits, we can consider even an infinite number of arms). In each round t = 1, 2, . . . , T, the player pulls an arm at A and receives a corresponding reward rt. In the stochastic setting, the rewards of each

Published as a conference paper at ICLR 2023

arm are sampled in each round independently, from some fixed but unknown, distribution supported on [0, 1]. Crucially, each arm has a potentially different reward distribution, but the distribution of each arm is fixed over time. A bandit algorithm A at every round t takes as input the sequence of arm-reward pairs that it has seen so far, i.e., (a1, r1), . . . , (at 1, rt 1), then uses (potentially) some internal randomness ξ to pull an arm at A and, finally, observes the associated reward rt Dat.

We propose the following natural notion of a replicable bandit algorithm, which is inspired by the definition of Impagliazzo et al. (2022). Intuitively, a bandit algorithm is replicable if two distinct executions of the algorithm, with internal randomness fixed between both runs, but with independent reward realizations, give the exact same sequence of played arms, with high probability. More formally, we have the following definition. Definition 1 (Replicable Bandit Algorithm). Let ρ [0, 1]. We call a bandit algorithm A ρreplicable in the stochastic setting if for any distribution Daj over [0, 1] of the rewards of the j-th arm aj A, and for any two executions of A, where the internal randomness ξ is shared across the executions, it holds that

Pr ξ,r(1),r(2)

h a(1) 1 , . . . , a(1) T = a(2) 1 , . . . , a(2) T i 1 ρ .

Here, a(i) t = A(a(i) 1 , r(i) 1 , ..., a(i) t 1, r(i) t 1; ξ) is the t-th action taken by the algorithm A in execution i {1, 2}.

The reason why we allow for some fixed internal randomness is that the algorithm designer has control over it, e.g., they can use the same seed for their (pseudo)random generator between two executions. Clearly, naively designing a replicable bandit algorithm is not quite challenging. For instance, an algorithm that always pulls the same arm or an algorithm that plays the arms in a particular random sequence determined by the shared random seed ξ are both replicable. The caveat is that the performance of these algorithms in terms of expected regret will be quite poor. In this work, we aim to design bandit algorithms which are replicable and enjoy small expected regret. In the stochastic setting, the (expected) regret after T rounds is defined as

E[RT ] = T max a A µa E

where µa = Er Da[r] is the mean reward for arm a A. In a similar manner, we can define the regret in the more general setting of linear bandits (see, Section 5) Hence, the overarching question in this work is the following:

Is it possible to design replicable bandit algorithms with small expected regret?

At a first glance, one might think that this is not possible, since it looks like replicability contradicts the exploratory behavior that a bandit algorithm should possess. However, our main results answer this question in the affirmative and can be summarized in Table 1.

Summary of Results Setting Algorithm Regret Theorem

Stochastic MAB Algorithm 1 e O K2 log3(T )H

Stochastic MAB Algorithm 2 e O K2 log(T )H

Stochastic Linear Bandits Algorithm 3 e O K2

Stochastic Linear Bandits Infinite Action Space

Algorithm 4 e O poly(d)

Table 1: Our results for replicable stochastic general multi-armed and linear bandits. In the expected regret column, e O( ) subsumes logarithmic factors. H is equal to P

j: j>0 1/ j, j is the difference between the mean of action j and the optimal action, K is the number of arms, d is the ambient dimension in the linear bandit setting.

Published as a conference paper at ICLR 2023

1.1 RELATED WORK

Reproducibility/Replicability. In this work, we introduce the notion of replicability in the context of interactive learning and, in particular, in the fundamental setting of stochastic bandits. Close to our work, the notion of a replicable algorithm in the context of learning was proposed by Impagliazzo et al. (2022), where it is shown how any statistical query algorithm can be made replicable with a moderate increase in its sample complexity. Using this result, they provide replicable algorithms for finding approximate heavy-hitters, medians, and the learning of half-spaces. Reproducibility has been also considered in the context of optimization by Ahn et al. (2022). We mention that in Ahn et al. (2022) the notion of a replicable algorithm is different from our work and that of Impagliazzo et al. (2022), in the sense that the outputs of two different executions of the algorithm do not need to be exactly the same. From a more application-oriented perspective, Shamir & Lin (2022) study irreproducibility in recommendation systems and propose the use of smooth activations (instead of Re LUs) to improve recommendation reproducibility. In general, the reproducibility crisis is reported in various scientific disciplines Ioannidis (2005); Mc Nutt (2014); Baker (2016b); Goodman et al. (2016); Lucic et al. (2018); Henderson et al. (2018). For more details we refer to the report of the Neur IPS 2019 Reproducibility Program Pineau et al. (2021) and the ICLR 2019 Reproducibility Challenge Pineau et al. (2019).

Bandit Algorithms. Stochastic multi-armed bandits for the general setting without structure have been studied extensively Slivkins (2019); Lattimore & Szepesv ari (2020); Bubeck et al. (2012b); Auer et al. (2002); Cesa-Bianchi & Fischer (1998); Kaufmann et al. (2012a); Audibert et al. (2010); Agrawal & Goyal (2012); Kaufmann et al. (2012b). In this setting, the optimum regret achievable is O log(T) P

i: i>0 1 ; this is achieved, e.g., by the upper confidence bound (UCB) algorithm of Auer et al. (2002). The setting of d-dimensional linear stochastic bandits is also well-explored Dani et al. (2008); Abbasi-Yadkori et al. (2011) under the well-specified linear reward model, achieving (near) optimal problem-independent regret of O(d p

T log(T)) Lattimore & Szepesv ari (2020). Note that the best-known lower bound is Ω(d

T) Dani et al. (2008) and that the number of arms can, in principle, be unbounded. For a finite number of arms K, the best known upper bound is O( p

d T log(K)) Bubeck et al. (2012a). Our work focuses on the design of replicable bandit algorithms and we hence consider only stochastic environments. In general, there is also extensive work in adversarial bandits and we refer the interested reader to Lattimore & Szepesv ari (2020).

Batched Bandits. While sequential bandit problems have been studied for almost a century, there is much interest in the batched setting too. In many settings, like medical trials, one has to take a lot of actions in parallel and observe their rewards later. The works of Auer & Ortner (2010) and Cesa Bianchi et al. (2013) provided sequential bandit algorithms which can easily work in the batched setting. The works of Gao et al. (2019) and Esfandiari et al. (2021) are focusing exclusively on the batched setting. Our work on replicable bandits builds upon some of the techniques from these two lines of work.

2 STOCHASTIC BANDITS AND REPLICABILITY

In this section, we first highlight the main challenges in order to guarantee replicability and then discuss how the results of Impagliazzo et al. (2022) can be applied in our setting.

2.1 WARM-UP I: NAIVE REPLICABILITY AND CHALLENGES

Let us consider the stochastic two-arm setting (K = 2) and a bandit algorithm A with two independent executions, A1 and A2. The algorithm Ai plays the sequence 1, 2, 1, 2, . . . until some, potentially random, round Ti N after which one of the two arms is eliminated and, from that point, the algorithm picks the winning arm ji {1, 2}. The algorithm A is ρ-replicable if and only if T1 = T2 and j1 = j2 with probability 1 ρ.

Assume that |µ1 µ2| = where µi is the mean of the distribution of the i-th arm. If we assume that is known, then we can run the algorithm for T1 = T2 = C 2 log(1/ρ) for some universal constant C > 0 and obtain that, with probability 1 ρ, it will hold that bµ(j) 1 µ1 and bµ(j) 2 µ2

Published as a conference paper at ICLR 2023

for j {1, 2}, where bµ(j) i is the estimation of arm s i mean during execution j. Hence, knowing implies that the stopping criterion of the algorithm A is deterministic and that, with high probability, the winning arm will be detected at time T1 = T2. This will make the algorithm ρ-replicable.

Observe that when K = 2, the only obstacle to replicability is that the algorithm should decide at the same time to select the winning arm and the selection must be the same in the two execution threads. In the presence of multiple arms, there exists the additional constraint that the above conditions must be satisfied during, potentially, multiple arm eliminations. Hence, the two questions arising from the above discussion are (i) how to modify the above approach when is unknown and (ii) how to deal with K > 2 arms.

A potential solution to the second question (on handling K > 2 arms) is the Execute-Then-Commit (ETC) strategy. Consider the stochastic K-arm bandit setting. For any ρ (0, 1), the ETC algorithm with known = mini i and horizon T that uses m = 4 2 log(1/ρ) deterministic exploration phases before commitment is ρ-replicable. The intuition is exactly the same as in the K = 2 case. The caveats of this approach are that it assumes that is known and that the obtained regret is quite unsatisfying. In particular, it achieves regret bounded by m P

i [K] i + ρ (T m K) P

Next, we discuss how to improve the regret bound without knowing the gaps i. Before designing new algorithms, we will inspect the guarantees that can be obtained by combining ideas from previous results in the bandits literature and the recent work in replicable learning of Impagliazzo et al. (2022).

2.2 WARM-UP II: BANDIT ALGORITHMS AND REPLICABLE MEAN ESTIMATION

First, we remark that we work in the stochastic setting and the distributions of the rewards of the two arms are subgaussian. Thus, the problem of estimating their mean is an instance of a statistical query for which we can use the algorithm of Impagliazzo et al. (2022) to get a replicable mean estimator for the distributions of the rewards of the arms. Proposition 2 (Replicable Mean Estimation (Impagliazzo et al., 2022)). Let τ, δ, ρ [0, 1]. There

exists a ρ-replicable algorithm Repr Mean Estimation that draws Ω log(1/δ) τ 2(ρ δ)2 samples from a

distribution with mean µ and computes an estimate bµ that satisfies |bµ µ| τ with probability at least 1 δ.

Notice that we are working in the regime where δ ρ, so the sample complexity is Ω log(1/δ)

τ 2ρ2 . The straightforward approach is to try to use an optimal multi-armed algorithm for the stochastic setting, such as UCB or arm-elimination (Even-Dar et al., 2006), combined with the replicable mean estimator. However, it is not hard to see that this approach does not give meaningful results: if we want to achieve replicability ρ we need to call the replicable mean estimator routine with parameter ρ/(KT), due to the union bound that we need to take. This means that we need to pull every arm at least K2T 2 times, so the regret guarantee becomes vacuous. This gives us the first key insight to tackle the problem: we need to reduce the number of calls to the mean estimator. Hence, we will draw inspiration from the line of work in stochastic batched bandits (Gao et al., 2019; Esfandiari et al., 2021) to derive replicable bandit algorithms.

3 REPLICABLE MEAN ESTIMATION FOR BATCHED BANDITS

As a first step, we would like to show how one could combine the existing replicable algorithms of Impagliazzo et al. (2022) with the batched bandits approach of Esfandiari et al. (2021) to get some preliminary non-trivial results. We build an algorithm for the K-arm setting, where the gaps j are unknown to the learner. Let δ be the confidence parameter of the arm elimination algorithm and ρ be the replicability guarantee we want to achieve. Our approach is the following: let us, deterministically, split the time interval into sub-intervals of increasing length. We treat each subinterval as a batch of samples where we pull each active arm the same number of times and use the replicable mean estimation algorithm to, empirically, compute the true mean. At the end of each batch, we decide to eliminate some arm j using the standard UCB estimate. Crucially, if we condition on the event that all the calls to the replicable mean estimator return the same number, then the algorithm we propose is replicable.

Published as a conference paper at ICLR 2023

Algorithm 1 Mean-Estimation Based Replicable Algorithm for Stochastic MAB (Theorem 3)

1: Input: time horizon T, number of arms K, replicability ρ 2: Initialization: B log(T), q T 1/B, c0 0, A [K], r T, bµa 0, a A 3: for i = 1 to B 1 do 4: if qi |A| > r then 5: break 6: ci = ci 1 + qi 7: Pull every arm a A for qi times 8: for a A do 9: bµa Repr Mean Estimation(δ = 1/(2KTB), τ = 1, p

log(2KTB)/ci, ρ = ρ/(KB)) Proposition 2

10: r r |A| qi 11: for a A do 12: if bµa < maxa A bµa 2τ then 13: Remove a from A 14: In the last batch play the arm from A with the smallest index

Theorem 3. Let T N, ρ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 1) for the stochastic bandit problem with K arms and gaps ( j)j [K] whose expected regret is

E[RT ] C K2 log2(T)

j + log(KT log(T))

where C > 0 is an absolute numerical constant, and its running time is polynomial in K, T and 1/ρ.

The above result, whose proof can be found in Appendix A, states that, by combining the tools from Impagliazzo et al. (2022) and Esfandiari et al. (2021), we can design a replicable bandit algorithm with (instance-dependent) expected regret O(K2 log3(T)/ρ2). Notice that the regret guarantee has an extra K2 log2(T)/ρ2 factor compared to its non-replicable counterpart in Esfandiari et al. (2021) (Theorem 5.1). This is because, due to a union bound over the rounds and the arms, we need to call the replicable mean estimator with parameter ρ/(K log(T)). In the next section, we show how to get rid of the log2(T) by designing a new algorithm.

4 IMPROVED ALGORITHMS FOR REPLICABLE STOCHASTIC BANDITS

While the previous result provides a non-trivial regret bound, it is not optimal with respect to the time horizon T. In this section, we show how to improve it by designing a new algorithm, presented in Algorithm 2, which satisfies the guarantees of Theorem 4 and, essentially, decreases the dependence on the time horizon T from log3(T) to log(T). Our main result for replicable stochastic multi-armed bandits with K arms follows. Theorem 4. Let T N, ρ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 2) for the stochastic bandit problem with K arms and gaps ( j)j [K] whose expected regret is

E[RT ] C K2

j + log(KT log(T))

where C > 0 is an absolute numerical constant, and its running time is polynomial in K, T and 1/ρ.

Note that, compared to the non-replicable setting, we incur an extra factor of K2/ρ2 in the regret. The proof can be found in Appendix B. Let us now describe how Algorithm 2 works. We decompose the time horizon into B = log(T) batches. Without the replicability constraint, one could draw qi samples in batch i from each arm and estimate the mean reward. With the replicability constraint, we have to boost this: in each batch i, we pull each active arm O(βqi) times, for some q to be determined, where β = O(K2/ρ2) is the replicability blow-up. Using these samples, we compute

Published as a conference paper at ICLR 2023

Algorithm 2 Replicable Algorithm for Stochastic Multi-Armed Bandits (Theorem 4)

1: Input: time horizon T, number of arms K, replicability ρ 2: Initialization: B log(T), q T 1/B, c0 0, A0 [K], r T, bµa 0, a A0 3: β max{K2/ρ2, 2304} 4: for i = 1 to B 1 do 5: if β qi |Ai| > r then 6: break 7: Ai Ai 1 8: for a Ai do 9: Pull arm a for β qi times

10: Compute the empirical mean bµ(i) α 11: ci ci 1 + qi 12: eci βci 13: e Ui p

2 ln(2KTB)/eci 14: Ui p

2 ln(2KTB)/ci 15: U i Uni[Ui/2, Ui] 16: r r β |Ai| qi 17: for a Ai do 18: if bµ(i) a + e Ui < maxa Ai bµ(i) a U i then 19: Remove a from Ai 20: In the last batch play the arm from AB 1 with the smallest index

the empirical mean bµ(i) α for any active arm α. Note that e Ui in Algorithm 2 corresponds to the size of the actual confidence interval of the estimation and Ui corresponds to the confidence interval of an algorithm that does not use the β-blow-up in the number of samples. The novelty of our approach comes from the choice of the interval around the mean of the maximum arm: we pick a threshold uniformly at random from an interval of size Ui/2 around the maximum mean. Then, the algorithm checks whether bµ(i) a + e Ui < max bµ(i) a U i, where max runs over the active arms a in batch i, and eliminates arms accordingly. To prove the result we show that there are three regions that some arm j can be in relative to the confidence interval of the best arm in batch i (cf. Appendix B). If it lies in two of these regions, then the decision of whether to keep it or discard it is the same in both executions of the algorithm. However, if it is in the third region, the decision could be different between parallel executions, and since it relies on some external and unknown randomness, it is not clear how to reason about it. To overcome this issue, we use the random threshold to argue about the probability that the decision between two executions differs. The crucial observation that allows us to get rid of the extra log2(T) factor is that there are correlations between consecutive batches: we prove that if some arm j lies in this bad region in some batch i, then it will be outside this region after a constant number of batches.

5 REPLICABLE STOCHASTIC LINEAR BANDITS

We now investigate replicability in the more general setting of stochastic linear bandits. In this setting, each arm is a vector a Rd belonging to some action set A Rd, and there is a parameter θ Rd unknown to the player. In round t, the player chooses some action at A and receives a reward rt = θ , at + ηt, where ηt is a zero-mean 1-subgaussian random variable independent of any other source of randomness. This means that E[ηt] = 0 and satisfies E[exp(ληt)] exp(λ2/2) for any λ R. For normalization purposes, it is standard to assume that θ 2 1 and supa A a 2 1. In the linear setting, the expected regret after T pulls a1, . . . , a T can be written as

E[RT ] = T sup a A θ , a E

Published as a conference paper at ICLR 2023

In Section 5.1 we provide results for the finite action space case, i.e., when |A| = K. Next, in Section 5.2, we study replicable linear bandit algorithms when dealing with infinite action spaces. In the following, we work in the regime where T d. We underline that our approach leverages connections of stochastic linear bandits with G-optimal experiment design, core sets constructions, and least-squares estimators. Roughly speaking, the goal of G-optimal design is to find a (small) subset of arms A , which is called the core set, and define a distribution π over them with the following property: for any ε > 0, δ > 0 pulling only these arms for an appropriate number of times and computing the least-squares estimate bθ guarantees that supa A a, θ bθ ε, with probability 1 δ. For an extensive discussion, we refer to Chapters 21 and 22 of Lattimore & Szepesv ari (2020).

5.1 FINITE ACTION SET

We first introduce a lemma that allows us to reduce the size of the action set that our algorithm has to search over.

Lemma 5 (See Chapters 21 and 22 in Lattimore & Szepesv ari (2020)). For any finite action set A that spans Rd and any δ, ε > 0, there exists an algorithm that, in time polynomial in d, computes a multi-set of Θ(d log(1/δ)/ε2 +d log log d) actions (possibly with repetitions) such that (i) they span Rd and (ii) if we perform these actions in a batched stochastic d-dimensional linear bandits setting with true parameter θ Rd and let bθ be the least-squares estimate for θ , then, for any a A, with probability at least 1 δ, we have D a, θ bθ E ε.

Essentially, the multi-set in Lemma 5 is obtained using an approximate G-optimal design algorithm. Thus, it is crucial to check whether this can be done in a replicable manner. Recall that the above set of distinct actions is called the core set and is the solution of an (approximate) Goptimal design problem. To be more specific, consider a distribution π : A [0, 1] and define V (π) = P

a A π(a)aa Rd d and g(π) = supa A a 2 V (π) 1. The distribution π is called a design and the goal of G-optimal design is to find a design that minimizes g. Since the number of actions is finite, this problem reduces to an optimization problem which can be solved efficiently using standard optimization methods (e.g., the Frank-Wolfe method). Since the initialization is the same, the algorithm that finds the optimal (or an approximately optimal) design is replicable under the assumption that the gradients and the projections do not have numerical errors. This perspective is orthogonal to the work of Ahn et al. (2022), that defines reproducibility from a different viewpoint.

Algorithm 3 Replicable Algorithm for Stochastic Linear Bandits (Theorem 6)

1: Input: number of arms K, time horizon T, replicability ρ 2: Initialization: B log(T), q (T/c)1/B, A [K], r T 3: β max{K2/ρ2, 2304} 4: for i = 1 to B 1 do 5: eεi = p

d log(KT 2)/(βqi) 6: εi = p

d log(KT 2)/qi

7: ni = 10d log(KT 2)/ε2 i 8: a1, . . . , ani multi-set given by Lemma 5 with parameters δ = 1/(KT 2) and ε = eεi 9: if ni > r then 10: break 11: Pull every arm a1, . . . , ani and receive rewards r1, . . . , rni

12: Compute the LSE bθi Pni j=1 aja T j 1 Pni j=1 ajrj

13: εi Uni[εi/2, εi] 14: r r ni 15: for a A do 16: if a, bθi + eεi < maxa A a, bθi εi then 17: Remove a from A 18: In the last batch play arg maxa A a, bθB 1

In our batched bandit algorithm (Algorithm 3), the multi-set of arms a1, . . . , ani computed in each batch is obtained via a deterministic algorithm with runtime poly(K, d), where |A| = K. Hence, the

Published as a conference paper at ICLR 2023

multi-set will be the same in two different executions of the algorithm. On the other hand, the LSE will not be since it depends on the stochastic rewards. We apply the techniques that we developed in the replicable stochastic MAB setting in order to design our algorithm. Our main result for replicable d-dimensional stochastic linear bandits with K arms follows. For the proof, we refer to Appendix C. Theorem 6. Let T N, ρ (0, 1]. There exists a ρ-replicable algorithm for the stochastic ddimensional linear bandit problem with K arms whose expected regret is

E[RT ] C K2

d T log(KT) ,

where C > 0 is an absolute numerical constant, and its running time is polynomial in d, K, T and 1/ρ.

Note that the best known non-replicable algorithm achieves an upper bound of e O( p

d T log(K)) and, hence, our algorithm incurs a replicability overhead of order K2/ρ2. The intuition behind the proof is similar to the multi-armed bandit setting in Section 4.

5.2 INFINITE ACTION SET

Let us proceed to the setting where the action set A is unbounded. Unfortunately, even when d = 1, we cannot directly get an algorithm that has satisfactory regret guarantees by discretizing the space and using Algorithm 3. The approach of Esfandiari et al. (2021) is to discretize the action space and use an 1/T-net to cover it, i.e. a set A A such that for all a A there exists some a A with ||a a ||2 1/T. It is known that there exists such a net of size at most (3T)d (Vershynin, 2018, Corollary 4.2.13). Then, they apply the algorithm for the finite arms setting, increasing their regret guarantee by a factor of

d. However, our replicable algorithm for this setting contains an additional factor of K2 in the regret bound. Thus, even when d = 1, our regret guarantee is greater than T, so the bound is vacuous. One way to fix this issue and get a sublinear regret guarantee is to use a smaller net. We use a 1/T 1/(4d+2) net that has size at most (3T) d 4d+2 and this yields an expected regret of order O(T 4d+1/(4d+2)p

d log(T)/ρ2). For further details, we refer to Appendix D.

Even though the regret guarantee we managed to get using the smaller net of Appendix D is sublinear in T, it is not a satisfactory bound. The next step is to provide an algorithm for the infinite action setting using a replicable LSE subroutine combined with the batching approach of Esfandiari et al. (2021). We will make use of the next lemma. Lemma 7 (Section 21.2 Note 3 of Lattimore & Szepesv ari (2020)). There exists a deterministic algorithm that, given an action space A Rd, computes a 2-approximate G-optimal design π with a core set of size O(d log log(d)).

We additionally prove the next useful lemma, which, essentially, states that we can assume without loss of generality that every arm in the support of π has mass at least Ω(1/(d log(d))). We refer to Appendix F.1 for the proof. Lemma 8 (Effective Support). Let π be the distribution that corresponds to the 2-approximate optimal G-design of Lemma 7 with input A. Assume that π(a) c/(d log(d)), where c > 0 is some absolute numerical constant, for some arm a in the core set. Then, we can construct a distribution bπ such that, for any arm a in the core set, bπ(a) C/(d log(d)), where C > 0 is an absolute constant, so that it holds sup a A a 2 V (bπ) 1 4d .

The upcoming lemma is a replicable algorithm for the least-squares estimator and, essentially, builds upon Lemma 7 and Lemma 8. Its proof can be found at Appendix F.2. Lemma 9 (Replicable LSE). Let ρ, ε (0, 1] and 0 < δ min{ρ, 1/d}1. Consider an environment of d-dimensional stochastic linear bandits with infinite action space A. Assume that π is a 4approximate optimal design with associated core set C as computed by Lemma 7 with input A. There exists a ρ-replicable algorithm that pulls each arm a C a total of

Ω d4 log(d/δ) log2 log(d) log log log(d)

1We can handle the case of 0 < δ d by paying an extra log d factor in the sample complexity.

Published as a conference paper at ICLR 2023

times and outputs θSQ that satisfies supa A | a, θSQ θ | ε , with probability at least 1 δ.

Algorithm 4 Replicable LSE Algorithm for Stochastic Infinite Action Set (Theorem 10)

1: Input: time horizon T, action set A Rd, replicability ρ 2: A 1/T-net of A 3: Initialization: r T, B log(T), q (T/c)1/B

4: for i = 1 to B 1 do 5: qi denotes the number of pulls of all arms before the replicability blow-up 6: εi = c d p

7: The blow-up is Mi = qi d3 log(d) log2 log(d) log log log(d) log2(T)/ρ2

8: a1, . . . , a|Ci| core set Ci of the design given by Lemma 7 with parameter A

9: if Mi > r then 10: break 11: Pull every arm aj for Ni = Mi /|Ci| rounds and receive rewards r(j) 1 , ..., r(j) Ni for j [|Ci|]

12: Si = {(aj, r(j) t ) : t [Ni], j [|Ci|]}

13: bθi Replicable LSE(Si, ρ = ρ/(d B), δ = 1/(2|A |T 2), τ = min{εi, 1}) 14: r r Mi 15: for a A do 16: if a, bθi < maxa A a, bθi 2εi then 17: Remove a from A

18: In the last batch play arg maxa A a, bθB 1 19: 20: Replicable LSE(S, ρ, δ, τ) 21: for a C do 22: v(a) Replicable SQ(ϕ : x R 7 x R, S, ρ, δ, τ) Impagliazzo et al. (2022) 23: return (P

j |S| aja j ) 1 (P

a C a na v(a))

The main result for the infinite actions case, obtained by Algorithm 4, follows. Its proof can be found at Appendix E. Theorem 10. Let T N, ρ (0, 1]. There exists a ρ-replicable algorithm (Algorithm 4) for the stochastic d-dimensional linear bandit problem with infinite action set whose expected regret is

E[RT ] C d4 log(d) log2 log(d) log log log(d)

T log3/2(T) ,

where C > 0 is an absolute numerical constant, and its running time is polynomial in T d and 1/ρ.

Our algorithm for the infinite arm linear bandit case enjoys an expected regret of order e O(poly(d)

T). We underline that the dependence of the regret on the time horizon is (almost) optimal, and we incur an extra d3 factor in the regret guarantee compared to the non-replicable algorithm of Esfandiari et al. (2021). We now comment on the time complexity of our algorithm. Remark 11. The current implementation of our algorithm requires time exponential in d. However, for a general convex set A, given access to a separation oracle for it and an oracle that computes an (approximate) G-optimal design, we can execute it in polynomial time and with polynomially many calls to the oracle. Notably, when A is a polytope such oracles exist. We underline that computational complexity issues also arise in the traditional setting of linear bandits with an infinite number of arms and the computational overhead that the replicability requirement adds is minimal. For further details, we refer to Appendix G.

6 CONCLUSION AND FUTURE DIRECTIONS

In this paper, we have provided a formal notion of reproducibility/replicability for stochastic bandits and we have developed algorithms for the multi-armed bandit and the linear bandit settings that satisfy this notion and enjoy a small regret decay compared to their non-replicable counterparts. We hope and believe that our paper will inspire future works in replicable algorithms for more complicated interactive learning settings such as reinforcement learning. We also provide experimental evaluation in Appendix H.

Published as a conference paper at ICLR 2023

7 ACKNOWLEDGEMENTS

Alkis Kalavasis was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the First Call for H.F.R.I. Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment grant , project BALSAM, HFRIFM17-1424. Amin Karbasi acknowledges funding in direct support of this work from NSF (IIS-1845032), ONR (N0001419-1-2406), and the AI Institute for Learning-Enabled Optimization at Scale (TILOS). Andreas Krause was supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program grant agreement no. 815943 and the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545. Grigoris Velegkas was supported by NSF (IIS-1845032), an Onassis Foundation Ph D Fellowship and a Bodossaki Foundation Ph D Fellowship.

Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. 3

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39 1. JMLR Workshop and Conference Proceedings, 2012. 3

Kwangjun Ahn, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, and Gil I Shamir. Reproducibility in optimization: Theoretical framework and limits. ar Xiv preprint ar Xiv:2202.04598, 2022. 1, 3, 7

Jean-Yves Audibert, S ebastien Bubeck, and R emi Munos. Best arm identification in multi-armed bandits. In COLT, pp. 41 53. Citeseer, 2010. 3

Peter Auer and Ronald Ortner. Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55 65, 2010. 3

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. 3

Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 2016a. 1

Monya Baker. Reproducibility crisis. Nature, 533(26):353 66, 2016b. 3

Mihir Bellare and Phillip Rogaway. The complexity of approximating a nonlinear program. In Complexity in numerical optimization, pp. 16 32. World Scientific, 1993. 22

S ebastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, pp. 41 1. JMLR Workshop and Conference Proceedings, 2012a. 3

S ebastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends in Machine Learning, 5(1):1 122, 2012b. 3

Nicolo Cesa-Bianchi and Paul Fischer. Finite-time regret bounds for the multiarmed bandit problem. In ICML, volume 98, pp. 100 108. Citeseer, 1998. 3

Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26, 2013. 3

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008. 3

Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Regret bounds for batched bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7340 7348, 2021. 3, 4, 5, 8, 9

Published as a conference paper at ICLR 2023

Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006. 4

Valerii Vadimovich Fedorov. Theory of optimal experiments. Elsevier, 2013. 22

Robert M Freund and James B Orlin. On the complexity of four polyhedral set containment problems. Mathematical programming, 33(2):139 145, 1985. 22

Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019. 3, 4

Steven N Goodman, Daniele Fanelli, and John PA Ioannidis. What does research reproducibility mean? Science translational medicine, 8(341):341ps12 341ps12, 2016. 3

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 3

Russell Impagliazzo, Rex Lei, Toniann Pitassi, and Jessica Sorrell. Reproducibility in learning. ar Xiv preprint ar Xiv:2201.08430, 2022. 1, 2, 3, 4, 5, 9, 20

John PA Ioannidis. Why most published research findings are false. PLo S medicine, 2(8):e124, 2005. 3

Emilie Kaufmann, Olivier Capp e, and Aur elien Garivier. On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics, pp. 592 600. PMLR, 2012a. 3

Emilie Kaufmann, Nathaniel Korda, and R emi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pp. 199 213. Springer, 2012b. 3

Piyush Kumar and E Alper Yildirim. Minimum-volume enclosing ellipsoids and core sets. Journal of Optimization Theory and applications, 126(1):1 21, 2005. 22

Tor Lattimore and Csaba Szepesv ari. Bandit algorithms. Cambridge University Press, 2020. 3, 7, 8, 22

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in rl with a generative model. In International Conference on Machine Learning, pp. 5662 5670. PMLR, 2020. 22

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are gans created equal? a large-scale study. Advances in neural information processing systems, 31, 2018. 3

Olvi L Mangasarian and T-H Shiau. A variable-complexity norm maximization problem. SIAM Journal on Algebraic Discrete Methods, 7(3):455 461, 1986. 22

Marcia Mc Nutt. Reproducibility, 2014. 3

Joelle Pineau, Koustuv Sinha, Genevieve Fried, Rosemary Nan Ke, and Hugo Larochelle. Iclr reproducibility challenge 2019. Re Science C, 5(2), May 2019. doi: 10.5281/zenodo.3158244. URL https://zenodo.org/record/3158244/files/article.pdf. 1, 3

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivi ere, Alina Beygelzimer, Florence d Alch e Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research: a report from the neurips 2019 reproducibility program. Journal of Machine Learning Research, 22, 2021. 1, 3

Gil I Shamir and Dong Lin. Real world large scale recommendation systems reproducibility and smooth activations. ar Xiv preprint ar Xiv:2202.06499, 2022. 3

Aleksandrs Slivkins. Introduction to multi-armed bandits. Foundations and Trends in Machine Learning, 12(1-2):1 286, 2019. 3

Published as a conference paper at ICLR 2023

Michael J Todd. Minimum-volume ellipsoids: Theory and algorithms. SIAM, 2016. 22

Stephen A Vavasis. Polynomial time weak approximation algorithms for quadratic programming. In Complexity in numerical optimization, pp. 490 500. World Scientific, 1993. 22

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. 8

Yinyu Ye. On affine scaling algorithms for nonconvex quadratic programming. Mathematical Programming, 56(1):285 300, 1992. 22

Published as a conference paper at ICLR 2023

A THE PROOF OF THEOREM 3

Theorem. Let T N, ρ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 1) for the stochastic bandit problem with K arms and gaps ( j)j [K] whose expected regret is

E[RT ] C K2 log2(T)

j + log(2KT log(T))

where C > 0 is an absolute numerical constant, and its running time is polynomial in K, T and 1/ρ.

Proof. First, we claim that the algorithm is ρ-replicable: since the elimination decisions are taken in the same iterates and are based solely on the mean estimations, the replicability of the algorithm of Proposition 2 implies the replicability of the whole algorithm. In particular,

Pr[(a1, ..., a T ) = (a 1, ..., a T )] = Pr[ i [B], j [K] : bµ(i) j was not replicable] ρ .

During each batch i, we draw for any active arm qi fresh samples for a total of ci samples and use the replicable mean estimation algorithm to estimate its mean. For an active arm, at the end of some batch i [B], we say that its estimation is correct if the estimation of its mean is within p

log(2KTB)/ci from the true mean. Using Proposition 2, the estimation of any active arm at the end of any batch (except possibly the last batch) is correct with probability at least 1 1/(2KTB) and so, by the union bound, the probability that the estimation is incorrect for some arm at the end of some batch is bounded by 1/T. We remark that when δ < ρ, the sample complexity of Proposition 2 reduces to O(log(1/δ)/(τ 2ρ2)). Let E denote the event that our estimates are correct. The total expected regret can be bounded as

E[RT ] T 1/T + E[RT |E] .

It suffices to bound the second term of the RHS and hence we can assume that each gap is correctly estimated within an additive factor of p

log(2KTB)/ci after batch i. First, due to the elimination condition, we get that the best arm is never eliminated. Next, we have that

E[RT |E] = X

j: j>0 j E[Tj|E] ,

where Tj is the total number of pulls of arm j. Fix a sub-optimal arm j and assume that i + 1 was the last batch it was active. Since this arm is not eliminated at the end of batch i, and the estimations are correct, we have that j p

log(2KTB)/ci ,

and so ci log(2KTB)/ 2 j. Hence, the number of pulls to get the desired bound due to Proposition 2 is (since we need to pull an arm ci/ρ2 1 times in order to get an estimate at distance p

log(1/δ)/c2 i with probability 1 δ in a ρ1-replicable manner when δ < ρ1)

Tj ci+1/ρ2 1 = q/ρ2 1(1 + ci) q/ρ2 1 (1 + log(2KTB)/ 2 j) .

This implies that the total regret is bounded by

E[RT ] 1 + q/ρ2 1 X

j + log(2KTB)

We finally set q = T 1/B and B = log(T). Moreover, we have that ρ1 = ρ/(KB). These yield

E[RT ] K2 log2(T)

j + log(2KT log(T))

This completes the proof.

Published as a conference paper at ICLR 2023

B THE PROOF OF THEOREM 4

Theorem. Let T N, ρ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 2) for the stochastic bandit problem with K arms and gaps ( j)j [K] whose expected regret is

E[RT ] C K2

j: j>0 ( j + log(KT log(T))/ j) ,

for some absolute numerical constant C > 0, and its running time is polynomial in K, T and 1/ρ.

To give some intuition, we begin with a non tight analysis which, however, provides the main ideas behind the actual proof.

Non Tight Analysis Assume that the environment has K arms with unknown means µi and let T be the number of rounds. Consider B to the total number of batches and β > 1. We set q = T 1/B. In each batch i [B], we pull each arm β qi times. Hence, after the i-th batch, we will have drawn eci = P

1 j i β qj independent and identically distributed samples from each arm. Let us also set ci = P

Let us fix i [B]. Using Hoeffding s bound for subgaussian concentration, the length of the confidence bound for arm j [K] that guarantees 1 δ probability of success (in the sense that the empirical estimate bµj will be close to the true µj) is equal to e Ui = p

2 log(1/δ)/eci , when the estimator uses eci samples. Also, let

2 log(1/δ)/ci . Assume that the active arms at the batch iteration i lie in the set Ai. Consider the estimates {bµ(i) j }i [B],j Ai, where bµ(i) j is the empirical mean of arm j using eci samples. We will eliminate an arm j at the end of the batch iteration i if

bµ(i) j + e Ui max t Ai bµ(i) t U i ,

where U i Uni[Ui/2, Ui]. For the remaining of the proof, we condition on the event E that for every arm j [K] and every batch i [B] the true mean is within e Ui from the empirical one.

We first argue about the replicability of our algorithm. Consider a fixed round i (end of i-th batch) and a fixed arm j. Let i be the optimal empirical arm after the i-th batch.

i the empirical estimates of arms j, i after the i-th batch, under some other execution

of the algorithm. We condition on the event E for the other execution as well. Notice that |bµ(i)

bµ(i) j | 2e Ui, |bµ(i)

i bµ(i) i | 2e Ui. Notice that, since the randomness of U i is shared, if bµ(i) j + e Ui

bµ(i) i U i + 4e Ui, then the arm j will not be eliminated after the i-th batch in some other execution of the algorithm as well. Similarly, if bµ(i) j + e Ui < bµ(i) i U i 4e Ui the the arm j will get eliminated after the i-th batch in some other execution of the algorithm as well. In particular, this means that if bµ(i) j 2e Ui > bµ(i) i + e Ui Ui/2 then the arm j will not get eliminated in some other execution of

the algorithm and if bµ(i) j + 5e Ui < bµ(i) i Ui then the arm j will also get eliminated in some other execution of the algorithm with probability 1 under the event E E . We call the above two cases good since they preserve replicability. Thus, it suffices to bound the probability that the decision about arm j will be different between the two executions when we are in neither of these cases. Then, the worst case bound due to the mass of the uniform probability measure is

2 log(1/δ)/eci p

2 log(1/δ)/ci .

This implies that the probability mass of the bad event is at most 16 p

ci/eci = 16 p

1/β. A union bound over all arms and batches yields that the probability that two distinct executions differ in at least one pull is Pr[(a1, . . . , a T ) = (a 1, . . . , a T )] 16KB p

Published as a conference paper at ICLR 2023

and since δ ρ it suffices to pick β = 768K2B2/ρ2.

We now focus on the regret of our algorithm. Let us set δ = 1/(KTB). Fix a sub-optimal arm j and assume that batch i + 1 was the last batch that is was active. We obtain that the total number of pulls of this arm is Tj eci+1 βq(1 + ci) βq(1 + 8 log(1/δ)/ 2 j)

From the replicability analysis, it suffices to take β of order K2 log2(T)/ρ2 and so

E[RT ] T 1/T+E[RT |E] = 1+ X

j: j>0 j E[Tj|E] C K2 log2(T)

j + log(KT log(T))

for some absolute constant C > 0.

Notice that the above analysis, which uses a naive union bound, does not yield the desired regret bound. We next provide a more tight analysis of the same algorithm that achieves the regret bound of Theorem 4.

Improved Analysis (The Proof of Theorem 4) In the previous analysis, we used a union bound over all arms and all batches in order to control the probability of the bad event. However, we can obtain an improved regret bound as follows. Fix a sub-optimal arm i [K] and let t be the first round that it appears in the bad event. We claim that after a constant number of rounds, this arm will be eliminated. This will shave the O(log2(T)) factor from the regret bound. Essentially, as indicated in the previous proof, the bad event corresponds to the case where the randomness of the cut-off threshold U can influence the decision of whether the algorithm eliminates an arm or not. The intuition is that during the rounds t and t + 1, given that the two intervals intersected at round t, we know that the probability that they intersect again is quite small since the interval of the optimal mean is moving upwards, the interval of the sub-optimal mean is concentrating around the guess and the two estimations have been moved by at most a constant times the interval s length.

Since the bad event occurs at round t, we know that

bµ(t) j h bµ(t) t Ut 5e Ut, bµ(t) t Ut/2 + 3e Ut i .

In the above bµt t is the estimate of the optimal mean at round t whose index is denoted by t . Now assume that the bad event for arm j also occurs at round t + k. Then, we have that

bµ(t+k) j h bµ(t+k) (t+k) Ut+k 5e Ut+k, bµ(t+k) (t+k) Ut+k/2 + 3e Ut+k i .

First, notice that since the concentration inequality under event E holds for rounds t, t + k we have that bµ(t+k) j bµ(t) j + e Ut + e Ut+k. Thus, combining it with the above inequalities gives us

bµ(t+k) (t+k) Ut+k 5e Ut+k bµ(t+k) j bµ(t) j + e Ut + e Ut+k bµ(t) t Ut/2 + 4e Ut + e Ut+k.

We now compare bµ(t) t , bµ(t+k) (t+k) . Let o denote the optimal arm. We have that

bµ(t+k) (t+k) bµ(t+k) o µo e Ut+k µt e Ut+k bµ(t) t e Ut e Ut+k.

This gives us that

bµ(t) t Ut+k 6e Ut+k e Ut bµ(t+k) (t+k) Ut+k 5e Ut+k.

Thus, we have established that

bµ(t) t Ut+k 6e Ut+k e Ut bµ(t) t Ut/2 + 4e Ut + e Ut+k =

Ut+k Ut/2 7e Ut+k 5e Ut Ut/2 12e Ut.

Since β 2304, we get that 12e Ut Ut/4. Thus, we get that

Published as a conference paper at ICLR 2023

Notice that Ut+k

thus it immediately follows that

16 = qt+1 1 qt+k+1 1 1

16 = 16 1 1 qt+1

qk 1 qt+1 =

qk 16 + 1 qt+1 17 = k log q log 17 = k 5,

when we pick B = log(T) batches. Thus, for every arm the bad event can happen at most 6 times, by taking a union bound over the K arms we see that the probability that our algorithm is not replicable is at most O(K p

1/β), so picking β = Θ(K2/ρ2) suffices to get the result.

C THE PROOF OF THEOREM 6

Theorem. Let T N, ρ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 3) for the stochastic d-dimensional linear bandit problem with K arms whose expected regret is

E[RT ] C K2

d T log(KT) ,

for some absolute numerical constant C > 0, and its running time is polynomial in d, K, T and 1/ρ.

Proof. Let c, C be the numerical constants hidden in Lemma 5, i.e., the size of the multi-set is in the interval [cd log(1/δ)/ε2, Cd log(1/δ)/ε2]. We know that the size of each batch ni [cqi, Cqi] (see Lemma 5), so by the end of the B 1 batch we will have less than n B pulls left. Hence, the number of batches is at most B.

We first define the event E that the estimates of all arms after the end of each batch are accurate, i.e., for every active arm a at the beginning of the i-th batch, at the end of the batch we have that D a, bθi θ E eεi. Since δ = 1/(KT 2) and there are at most T batches and K active arms in each

batch, a simple union bound shows that E happens with probability at least 1 1/T. We condition on the event E throughout the rest of the proof.

We now argue about the regret bound of our algorithm. We first show that any optimal arm a will not get eliminated. Indeed, consider any sub-optimal arm a [K] and any batch i [B]. Under the event E we have that

a, bθi a , bθi ( a, θ + eεi) ( a , θ eεi) < 2eεi < εi + εi.

Next, we need to bound the number of times we pull some fixed suboptimal arm a [K]. We let = a a, θ denote the gap and we let i be the smallest integer such that εi < /4. We claim that this arm will get eliminated by the end of batch i. Indeed,

a , bθi a, bθi ( a , bθi eεi) ( a, bθi + eεi) = 2eεi > 4εi 2eεi > eεi + εi.

This shows that during any batch i, all the active arms have gap at most 4εi 1. Thus, the regret of the algorithm conditioned on the event E is at most

i=1 4niεi 1 4βC

d log(KT 2)/qi 1 6βCq p

O βq B/2+1p

d log(KT) = O K2

ρ2 q B/2+1p

d log(KT) = O K2

d T log(KT) .

Thus, the overall regret is bounded by δ T + (1 δ) O K2

d T log(KT) =

d T log(KT) .

Published as a conference paper at ICLR 2023

We now argue about the replicability of our algorithm. The analysis follows in a similar fashion as in Theorem 4. Let bθi, bθ i be the LSE after the i-th batch, under two different executions of the algorithm and assume that the set of active arms. We condition on the event E for the other execution as well. Assume that the set of active arms is the same under both executions at the beginning of batch i. Notice that since the set that is guaranteed by Lemma 5 is computed by a deterministic algorithm, both executions will pull the same arms in batch i. Consider a suboptimal arm a and let ai = arg maxa A bθi, a , a i = arg maxa A bθ i, a . Under the event E E we have that | a, bθi bθ i | 2eεi, | ai , bθi bθ i | 2eεi, and | a i , bθ i ai , bθi | 2eεi. Notice that, since the randomness of εi is shared, if a, bθi + eεi ai , bθi εi + 4eεi, then the arm a will not be eliminated after the i-th batch in some other execution of the algorithm as well. Similarly, if a, bθi + eεi < ai , bθi εi 4eεi the the arm a will get eliminated after the i-th batch in some other execution of the algorithm as well. In particular, this means that if a, bθi 2eεi > ai , bθi +eεi εi/2 then the arm a will not get eliminated in some other execution of the algorithm and if a, bθi +5eεi < ai , bθi εi then the arm j will also get eliminated in some other execution of the algorithm with probability 1 under the event E E . Thus, it suffices to bound the probability that the decision about arm j will be different between the two executions when we are in neither of these cases. Then, the worst case bound due to the mass of the uniform probability measure is

d log(1/δ)/eci p

d log(1/δ)/ci .

This implies that the probability mass of the bad event is at most 16 p

ci/eci = 16 p

1/β. A naive union bound would require us to pick β = Θ(K2 log2 T/ρ2). We next show to avoid the log2 T factor. Fix a sub-optimal arm a [K] and let t be the first round that it appears in the bad event.

Since the bad event occurs at round t, we know that

a, bθt h at , bθt εt 5eεt, at , bθt εt/2 + 3eεt i .

In the above, at is the optimal arm at round t w.r.t. the LSE. Now assume that the bad event for arm a also occurs at round t + k. Then, we have that

a, bθt+k h a(t+k) , bθt+k εt+k 5eεt+k, a(t+k) , bθt+k εt/2 + 3eεt+k i .

First, notice that since the concentration inequality under event E holds for rounds t, t + k we have that a, bθt+k a, bθt + eεt + eεt+k. Thus, combining it with the above inequalities gives us

a(t+k) , bθt+k εt+k 5eεt+k a, bθt+k a, bθt + eεt + eεt+k at , bθt εt/2 + 4eεt + eεt+k.

We now compare at , bθt , a(t+k) , bθt+k . Let a denote the optimal arm. We have that

a(t+k) , bθt+k a , bθt+k a , θ eεt+k at , θ eεt+k at , bθt eεt+k eεt. This gives us that

at , bθt εt+k 6eεt+k eεt a(t+k) , bθt+k εt+k 5eεt+k. Thus, we have established that at , bθt εt+k 6eεt+k eεt at , bθt εt/2 + 4eεt + eεt+k = εt+k εt/2 7eεt+k 5eεt εt/2 12eεt. Since β 2304, we get that 12eεt εt/4. Thus, we get that εt+k εt/4. Notice that εt+k

thus it immediately follows that qt

16 = qk 16 = k log q log 16 = k 4,

when we pick B = log(T) batches. Thus, for every arm the bad event can happen at most 5 times, by taking a union bound over the K arms we see that the probability that our algorithm is not replicable is at most O(K p

1/β), so picking β = Θ(K2/ρ2) suffices to get the result.

Published as a conference paper at ICLR 2023

D NAIVE APPLICATION OF ALGORITHM 3 WITH INFINITE ACTION SPACE

We use a 1/T 1/(4d+2) net that has size at most (3T) d 4d+2 . Let A be the new set of arms. We then run Algorithm 3 using A . This gives us the following result, that is proved right after. Corollary 12. Let T N, ρ (0, 1]. There is a ρ-replicable algorithm for the stochastic ddimensional linear bandit problem with infinite arms whose expected regret is at most

E[RT ] C T 4d+1 4d+2

where C > 0 is an absolute numerical constant.

Proof. Since K (3T) d 4d+2 , we have that

T sup a A a, θ E

(3T) 2d 4d+2

d T log T(3T) d 4d+2 !

T 4d+1 4d+2

Comparing to the best arm in A, we have that:

T sup a A a, θ E

= T sup a A a, θ T sup a A a, θ +

T sup a A a, θ E

Our choice of the 1/T 1/(4d+2)-net implies that for every a A there exists some a A such that ||a a ||2 1/T 1/(4d+2). Thus, supa A a, θ supa A a , θ ||a a ||2||θ ||2 1/T 1/(4d+2). Thus, the total regret is at most

T 1/T 1/(4d+2) + O

T 4d+1 4d+2

T 4d+1 4d+2

E THE PROOF OF THEOREM 10

Theorem. Let T N, ρ (0, 1]. There exists a ρ-replicable algorithm (presented in Algorithm 4) for the stochastic d-dimensional linear bandit problem with infinite action set whose expected regret is

E[RT ] C d4 log(d) log2 log(d) log log log(d)

T log3/2(T) ,

for some absolute numerical constant C > 0, and its running time is polynomial in T d and 1/ρ.

Proof. First, the algorithm is ρ-replicable since in each batch we use a replicable LSE sub-routine with parameter ρ = ρ/B. This implies that

Pr[(a1, ..., a T ) = (a 1, ..., a T )] = Pr[ i [B] : bθi was not replicable] ρ .

Let us fix a batch iteration i [B 1]. Set Ci be the core set computed by Lemma 7. The algorithm first pulls ni = Cd4 log(d/δ) log2 log(d) log log log(d)

ε2 i ρ 2 times each one of the arms of the i-th core set Ci, as

indicated by Lemma 9 and computes the LSE bθi in a replicable way using the algorithm of Lemma 9. Let E be the event that over all batches the estimations are correct. We pick δ = 1/(2|A |T 2) so that this good event does hold with probability at least 1 1/T. Our goal is to control the expected regret which can be written as

E[RT ] = T sup a A a, θ E

t=1 at, θ .

We have that T sup a A a, θ T sup a A a , θ 1 ,

Published as a conference paper at ICLR 2023

since A is a deterministic 1/T-net of A. Also, let us set the expected regret of the bounded action sub-problem as

E[R T ] = T sup a A a , θ E

t=1 at, θ .

We can now employ the analysis of the finite arm case. During batch i, any active arm has gap at most 4εi 1, so the instantaneous regret in any round is not more than 4εi 1. The expected regret conditional on the good event E is upper bounded by

i=1 4Miεi 1 ,

where Mi is the total number of pulls in batch i (using the replicability blow-up) and εi 1 is the error one would achieve by drawing qi samples (ignoring the blow-up). Then, for some absolute constant C > 0, we have that

i=1 4 qi d3 log(d) log2 log(d) log log log(d) log2 T

d2 log(T)/qi 1 ,

which yields that

E[R T |E] C d4 log(d) log2 log(d) log log log(d) log(T) p

log(T) ρ2 S ,

where we set

q(i 1)/2 = q1/2 B X

i=1 qi/2 = q(1+B)/2 .

We pick B = log(T) and get that, if q = T 1/B then S = Θ(

T). We remark that this choice of q is valid since

i=1 qi = q B+1 q

q 1 = Θ(q B) 1 Tρ2

d3 log(d) log2 log(d) log log log(d) .

Hence, we have that

E[R T |E] O d4 log(d) log2 log(d) log log log(d)

T log3/2(T) .

Note that when E does not hold, we can bound the expected regret by 1/T T = 1. This implies that the overall regret E[RT ] 2 + E[R T |E] and so it satisfies the desired bound and the proof is complete.

F DEFERRED LEMMATA

F.1 THE PROOF OF LEMMA 8

Proof. Consider the distribution π that is a 2-approximation to the optimal G-design and has support |C| = O(d log log d). Let C be the set of arms in the support such that π(a) c/d log d. We consider eπ = (1 x)π + xa, where a C and x will be specified later. Consider now the matrix V (eπ). Using the Sherman-Morrison formula, we have that

V (eπ) 1 = 1 1 x V (π) 1 x V (π) 1aa V (π) 1

(1 x)2 1 + 1 1 x||a||2 V (π) 1 = 1 1 x

V (π) 1 x V (π) 1aa V (π) 1

1 x + ||a||2 V (π) 1

Consider any arm a . Then,

||a ||2 V (eπ) 1 = 1 1 x||a||2 V (π) 1 x 1 x (a V (π) 1a )2

1 x + ||a||2 V (π) 1 1 1 x||a||2 V (π) 1.

Published as a conference paper at ICLR 2023

Note that we apply this transformation at most O(d log log d) times. Let bπ be the distribution we end up with. We see that

||a ||2 V (bπ) 1 1 1 x

cd log log d ||a||2 V (π) 1 2 1 1 x

cd log log d d.

Notice that there is a constant c such that when x = c /d log d we have that 1 1 x cd log log d 2.

Moreover, notice that the mass of every arm is at least x(1 x)|C| x |C|x2 = c /(d log(d)) c d log log d/(d2 log2(d)) c/(d log(d)), for some absolute numerical constant c > 0. This concludes the claim.

F.2 THE PROOF OF LEMMA 9

Proof. The proof works when we can treat Ω( d log(1/δ)π(a)/ε2 ) as Ω(d log(1/δ)π(a)/ε2), i.e., as long as π(a) = Ω(ε2/d log(1/δ)). In the regime we are in, this point is handled thanks to Lemma 8. Combining the following proof with Lemma 8, we can obtain the desired result.

We underline that we work in the fixed design setting: the arms ai are deterministically chosen independently of the rewards ri. Assume that the core set of Lemma 7 is the set C. Fix the multi-set S = {(ai, ri) : i [M]}, where each arm a lies in the core set and is pulled na = Θ(π(a)d log(d) log(|C|/δ)/ε2) times2. Hence, we have that

a C na = Θ d log(d) log(|C|/δ)/ε2 .

Let also V = P

i [M] aia i . The least-squares estimator can be written as

θ(ε) LSE = V 1 X

i [M] airi = V 1 X

i [na] ri(a) ,

where each a lies in the core set (deterministically) and ri(a) is the i-th reward generated independently by the linear regression process θ , a +ξ, where ξ is a fresh zero mean sub-gaussian random variable. Our goal is to reproducibly estimate the value P

i [na] ri(a) for any a. This is sufficient since two independent executions of the algorithm share the set C and na for any a. Note that the above sum is a random variable. In the following, we condition on the high-probability event that the average reward of the arm a is ε-close to the expected one, i.e., the value θ , a . This happens with probability at least 1 δ/(2|C|), given Ω(π(a)d log(d) log(|C|/δ)/ε2) samples from arm a C. In order to guarantee replicability, we will apply a result from Impagliazzo et al. (2022). Since we will union bound over all arms in the core set and |C| = O(d log log(d)) (via Lemma 7), we will make use of a (ρ/|C|)-replicable algorithm that gives an estimate v(a) R such that

| θ , a v(a)| τ ,

with probability at least 1 δ/(2|C|). For δ < ρ, the algorithm uses

Sa = Ω d2 log(d/δ) log2 log(d) log log log(d)/(ρ2τ 2)

many samples from the linear regression with fixed arm a C. Since we have conditioned on the randomness of ri(a) for any i, we get

i [na] ri(a) v(a)

i [na] ri(a) θ , a

+ | θ , a v(a)| ε + τ ,

with probability at least 1 δ/(2|C|). Hence, by repeating this approach for all arms in the core set, we set θSQ = V 1 P

a C a na v(a). Let us condition on the randomness of the estimate θ(ε) LSE. We have that

sup a A | a , θSQ θ | sup a A | a , θSQ θ(ε) LSE | + sup a A | a , θ(ε) LSE θ | .

2Recall that π(a) c/(d log(d)), for some constant c > 0, so the previous expression is Ω(log(δ/|C|)/ε2).

Published as a conference paper at ICLR 2023

Note that the second term is ε with probability at least 1 δ via Lemma 5. Our next goal is to tune the accuracy τ (0, 1) so that the first term yields another ε error. For the first term, we have that

sup a A | a , θSQ θ(ε) LSE | sup a A

a C a na (ε + τ)

Note that V = Cd log(d) log(|C|/δ)

a C π(a)aa and so V 1 = ε2 Cd log(d) log(|C|/δ)V (π) 1, for some absolute constant C > 0. This implies that

sup a A | a , θSQ θ(ε) LSE | (ε+τ) sup a A

Cd log(d) log(|C|/δ)V (π) 1 X

Cd log(d) log(|C|/δ)π(a)

Hence, we get that

sup a A | a , θSQ θ(ε) LSE | (ε + τ) sup a A

a , V (π) 1 X

Consider a fixed arm a A. Then,

a , V (π) 1 X

a C π(a) a , V (π) 1a

a C π(a) 1 + a , V (π) 1a 2

a C π(a) a , V (π) 1a 2

= 1 + ||a ||2 V (π) 1

where the last inequality follows from the fact that π is a 4-approximation of the G-optimal design. Hence, in total, by picking τ = ε, we get that

sup a A | a , θSQ θ | 11dε .

Thus, for any ε > 0, the total number of pulls of each arm is

Ω d4 log(d/δ) log2 log(d) log log log(d)/(ρ2ε2) ,

to get sup a A | a , θSQ θ | ε .

G COMPUTATIONAL PERFORMANCE OF ALGORITHM 4

In this appendix, we discuss the barriers towards computational efficiency regarding Algorithm 4. The reasons why Algorithm 4 is computationally inefficient are the following: (a) we have to compute the arm in the set of active arms that has maximum correlation with the estimate bθi, (b) we have to eliminate arms based on this value and (c) we have to run at each batch the Frank-Wolfe algorithm (or some other optimization method needed for Lemma 5) in order to obtain an approximate G-optimal design. As a minimal assumption in what follows, we focus on the case where the action set A is convex and we have access to a separation oracle for it.

Note that executing both (a) and (b) naively requires time exponential in d. However, on the one side arm elimination (issue (b)) reduces to finding the intersection of the current active set with a halfspace H whose normal vector is bθi and the threshold is, roughly speaking, the maximum correlation. This maximum correlation can also be computed efficiently. Finding an arm with (almost) maximum correlation relates to the problem of finding a point that maximizes a linear

Published as a conference paper at ICLR 2023

objective under the constraint that the point lies in the intersection of the active arm set with some linear constraints. Thus, we can use the ellipsoid algorithm to implement this step.

The above discussion deals with issues (a) and (b) and, essentially, states that even with infinitely many actions, one could implement these steps efficiently. We now focus on issue (c). The Frank Wolfe method first requires a proper initialization. As mentioned in Lattimore & Szepesv ari (2020), if the starting point is chosen to be the uniform distribution over A , then the number of iterations before getting a 2-approximate optimal design is roughly e O(d). The issue is that since A is exponential in d, it is not clear how to work with such an initialization efficiently. Notably there is a different initialization Fedorov (2013); Lattimore et al. (2020) with support O(d) for which the method runs in O(d log log(d)) rounds (see Note 3 at Section 21.2 of Lattimore & Szepesv ari (2020) and Lattimore et al. (2020)). There are two issues: first, one requires an oracle to provide this good initialization. Second, each iteration of the Frank-Wolfe method (with current design guess π) requires computing a point in the current active set with maximum V (π) 1-norm. As noted in Todd (2016), a good initialization for finding a G-optimal design, i.e., a minimum volume enclosing ellipsoid (MVEE) should be sufficiently sparse (compared to the number of active arms) and assign positive mass to arms that correspond to extreme points, i.e., points that are close to the border of MVEE. The work of Kumar & Yildirim (2005) provides an initial core set that depends only on d but not on the number of points. The algorithm works as follows: it runs for d iterations and, in each round, it adds 2 arms into the core set. Initially, we set the core set C0 = and let Ψ = {0}. In each iteration i [d], the algorithm draws a random direction vi in the orthogonal complement of Ψ (this step is replicable thanks to the shared randomness) and computes the vectors in the active arms set with the maximum and the minimum correlation with vi, say a+ i , a i . It then extends C0 C0 {a+ i , a i } and sets Ψ span(Ψ, {a+ i a i }). Hence, the runtime of this algorithm corresponds to the runtime of the tasks maxa A a, vi and mina A a, vi . One can efficiently approximate these values using the ellipsoid algorithm and hence efficiently initialize the Frank-Wolfe algorithm as in Todd (2016) (e.g., set the weights uniformly 1/(2d)).

Our second challenge deals with finding a point in the active arm set with maximum V (π) 1norm for some current guess π. Even if the current active set is a polytope, finding an exact norm maximizer is NP-hard Freund & Orlin (1985); Mangasarian & Shiau (1986)3. Hence, one should focus on efficient approximation algorithms. We note that even a poly(d)-approximate maximizer is sufficient to get e O(poly(d)

T) regret. Such an algorithm for polytopes, which gets an 1/d2approximation, is provided in Ye (1992); Vavasis (1993).

As a general note, if we assume that we have access to an oracle O that computes a 2-approximate G-optimal design in time TO, then our Algorithm 4 runs in time polynomial in TO.

H EXPERIMENTAL EVALUATION

In this section, we provide some experimental evaluation of our proposed algorithms in the setting of multi-armed stochastic bandits. In particular, we will compare the performance of our algorithm with the standard UCB approach. We first analyze the experimental setup.

Experimental Setup. We consider a multi-armed bandit setting with K = 6 arms that have Bernoulli rewards with bias (0.44, 0.47, 0.5, 0.53, 0.56, 0.59), we run the algorithms for T = 45000 iterations and we execute both algorithms 20 different times. For our algorithm, we set its replicability parameter ρ = 0.3. We are interested in comparing the replicability and the regret of the two algorithms. We underline that our theoretical bound in this setting shows that the regret of our algorithm could be, roughly, at most 360 times worse than the regret of UCB.

Observations. We first observe that our proposed bandit algorithm is indeed replicable, i.e., it satisfies our proposed reproducibility/replicability definition. In particular, for the 10 pairs of consecutive executions we consider, we observe that our algorithm pulls the exact same sequence of arms in all of them. On the other hand, the standard UCB algorithm is far from being replicable. To be more precise, between consecutive executions we see that the algorithm makes a different choice

3In fact, even finding a constant factor approximation, for some appropriate constant, is NP-hard Bellare & Rogaway (1993).

Published as a conference paper at ICLR 2023

Figure 1: Cumulative Regret UCB vs. Replicable Batched Algorithm

between 26208 to 28002 rounds out of the 45000 rounds, which is more that half of the rounds! However, the regret of our approach is quite competitive compared to the standard UCB algorithm. In particular, the regret of the standard UCB algorithm has the expected polylog(T) shape, while the regret of our replicable variant incurs a multiplicative overhead that is, roughly, 2.5 times worse than UCB. This is depicted in Figure 1, where we plot the average cumulative regret (across the 20 executions) of the two algorithms.