# multiplier_bootstrapbased_exploration__f532636a.pdf

Multiplier Bootstrap-based Exploration

Runzhe Wan * 1 Haoyu Wei * 2 3 Branislav Kveton 1 Rui Song 1

Despite the great interest in the bandit problem, designing efﬁcient algorithms for complex models remains challenging, as there is typically no analytical way to quantify uncertainty. We propose Multiplier Bootstrap-based Exploration (MBE), a novel exploration strategy that is applicable to any reward model amenable to weighted loss minimization. We prove both instance-dependent and instance-independent rate-optimal regret bounds for MBE in sub-Gaussian multi-armed bandits. With extensive simulation and real-data experiments, we show the generality and adaptivity of MBE.

1. Introduction

The bandit problem has found wide applications in various areas such as clinical trials (Durand et al., 2018), ﬁnance (Shen et al., 2015), recommendation systems (Zhou et al., 2017), among others. Accurate uncertainty quantiﬁcation is the key to address the exploration-exploitation trade-off. Most existing bandit algorithms critically rely on certain analytical property of the imposed model (e.g., linear bandits) to quantify the uncertainty and derive the exploration strategy. Thompson Sampling (TS, Thompson, 1933) and Upper Conﬁdence Bound (UCB, Auer et al., 2002) are two prominent examples, which are typically based on explicit-form posterior distributions or conﬁdence sets, respectively.

However, in many real problems, the reward model is fairly complex: e.g., a general graphical model (Chapelle & Zhang, 2009) or a pipeline with multiple prediction modules and manual rules. In these cases, it is typically impossible to quantify the uncertainty in an analytical way, and frameworks such as TS or UCB are either methodologically not applicable or computationally infeasible.

*Equal contribution 1Amazon 2Department of Economics, University of California San Diego 3Department of Statistics, North Carolina State University. Correspondence to: Runzhe Wan <runzhe.wan@gmail.com>, Rui Song <songray@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Motivated by the real needs, we are concerned with the following question:

Can we design a practical bandit algorithm framework that is general, adaptive, and computationally tractable, with certain theoretical guarantee?

A straightforward idea is to apply the bootstrap method (Efron, 1992), a widely applicable data-driven approach for measuring uncertainty. However, as discussed in Section 2, most existing bootstrap-based bandit algorithms are either heuristic without a theoretical guarantee, computationally intensive, or only applicable in limited scenarios.

Contribution. Our contributions are three-fold. First, to address the aforementioned limitations, we propose a general-purpose bandit algorithm framework, Multiplier Bootstrap-based Exploration (MBE). MBE is based on multiplier bootstrap (Van Der Vaart & Wellner, 1996), an easyto-adapt bootstrap framework that only requires randomly weighted data points. We further show that a naive application of multiplier bootstrap may result in linear regret, and we introduce a suitable way to add additional perturbations for sufﬁcient exploration. The main advantage of MBE is that it is general: it is applicable to any reward model amenable to weighted loss minimization, without need of analyticalform uncertainty quantiﬁcation or case-by-case algorithm design. As a data-driven exploration strategy, MBE is also adaptive to different environments.

Second, theoretically, we prove near-optimal regret bounds for MBE under sub-Gaussian multi-armed bandits (MAB), in both the instance-dependent and the instance-independent sense. Compared with all existing results for bootstrapbased bandit algorithms, our result is strictly more general (see Table 1), since existing results only apply to some special cases of sub-Gaussian distributions. To overcome the technical challenges, we proved a novel concentration inequality for some function of sub-exponential variables, and also developed the ﬁrst ﬁnite-sample concentration and anti-concentration analysis for multiplier bootstrap, to the best of our knowledge. Given the broad applications of multiplier bootstrap in statistics and machine learning, our theoretical analysis is of independent interest.

This work does not relate to the positions of Runzhe Wan, Branislav Kveton and Rui Song at Amazon.

Multiplier Bootstrap-Based Exploration

Table 1: Comparisons between several bootstrapand perturbation-based bandit algorithms. All papers derive near-optimal regret bounds in MAB, with different reward distribution requirements. To compare the computational cost, we focus on MAB to illustrate, and consider Algorithm 2 for MBE. See Section 2 for more details of discussions in this table.

Exploration Source Methodology Generality Theory Requirement Computation Cost

MBE (this paper) intrinsic & extrinsic general sub-Gaussian O(KT)

GIRO (Kveton et al., 2019b) intrinsic & extrinsic general Bernoulli O(T 2)

Re Boot (Wang et al., 2020; Wu et al., 2022) intrinsic & extrinsic ﬁxed & ﬁnite set of arms Gaussian O(KT)

PHE (Kveton et al., 2019a; 2020a;b) only extrinsic general bounded O(KT)

Third, with extensive simulation and real-data experiments, we demonstrate that MBE yields comparable performance with existing algorithms in different MAB settings and three real-world problems (online learning to rank, online combinatorial optimization, and dynamic slate optimization). This supports that MBE is easily generalizable, as it requires minimal modiﬁcations and derivations to match the performance of those near-optimal algorithms speciﬁcally designed for each problem. Moreover, we also show that MBE adapts to different environments and is relatively robust, due to its data-driven nature.

2. Related Work

The most popular bandit algorithms, arguably, include ϵgreedy (Watkins, 1989), TS, and UCB. ϵ-greedy is simple and thus widely used. However, its exploration strategy is not aware of the uncertainty in data and thus is known to be statistically sub-optimal. TS and UCB rely on posteriors and conﬁdence sets, respectively. Yet, their closed forms only exist in limited cases, such as MAB or linear bandits. For a few other models (such as generalized linear model or neural nets), we know how to construct the approximate posteriors or conﬁdence sets (Filippi et al., 2010; Li et al., 2017; Phan et al., 2019; Kveton et al., 2020a;b) with theoretical guarantees, though the corresponding algorithms are usually costly or conservative. In more general cases, it is often not clear how to adapt UCB and TS in a valid and efﬁcient way. Although approximate TS can be done via approximate posterior inference methods (e.g., particle ﬁltering, Markov chain Monte Carlo, or variational inference) (Gopalan et al., 2014; Kawale et al., 2015; Wan et al., 2021; Urteaga & Wiggins, 2018; Yu et al., 2020), they do not come with guarantees. Moreover, the dependency on the probabilistic model assumptions (e.g., the reward distribution family or the noise level) also pose challenges to being robust.

To enable wider applications of bandit algorithms, several bootstrap-based (and related perturbation-based) methods

have been proposed in the literature. Most algorithms are TS-type, by replacing the posterior with a bootstrap distribution. We next review the related papers, and summarize those with near-optimal asymptotic regret bounds in Table 1. We divide the sources of exploration into (i) leveraging the intrinsic randomness in the observed data (e.g., by randomizing the subset of history used for training) and (ii) manually adding extrinsic perturbations that are independent of the observed data (e.g., adding additive Gaussian noise to observed rewards).

Arguably, the non-parametric bootstrap is the most wellknown bootstrap method, which works by re-sampling data with replacement. Vaswani et al. (2018) propose a version of non-parametric bootstrap with forced exploration to achieve a O(T 2/3) regret bound in Bernoulli MAB. GIRO proposed in Kveton et al. (2019b) successfully achieves a rate-optimal regret bound in Bernoulli MAB, by adding Bernoulli perturbations to non-parametric bootstrap. However, due to the re-sampling nature of non-parametric bootstrap, it is challenging to implement it efﬁciently beyond Bernoulli MAB (see Section 4.3). Speciﬁcally, the computational cost of re-sampling scales quadratically in T. Riou & Honda (2020) apply Bayesian bootstrap (Rubin, 1981), which is a smooth version of non-parametric Bootstrap. An asymptotically optimal regret bound is proved for MAB with bounded rewards. However, only MAB is studied and similar computational challenge exists in general cases.

Another line of research is the residual bootstrap-based approach (Re Boot) (Hao et al., 2019; Wang et al., 2020; Tang et al., 2021; Wu et al., 2022). For each arm, Re Boot randomly perturbs the residuals of the corresponding observed rewards with respect to the estimated model to quantify the uncertainty for its mean reward. Although these methods also use random weights, they are applied to residuals, and thus are fundamentally different from our work. The limitation is that, by design, this approach is only applicable to problems with a ﬁxed and ﬁnite set of arms, since the residuals are attached closely to each arm (see Appendix

Multiplier Bootstrap-Based Exploration

A.4 for more details).

The perturbed history exploration (PHE) algorithm (Kveton et al., 2019a; 2020a;b) is also related. PHE works by adding additive noise to the observed rewards. Osband et al. (2019) apply similar ideas to reinforcement learning. However, PHE has two main limitations. First, for models where adding additive noise is not feasible (e.g., decision trees), PHE is not applicable. Second, as demonstrated in both Wang et al. (2020) and our experiments, the fact that PHE relies on only the extrinsically injected noise for exploration makes it less robust. For a complex structured problem, it may not be clear how to add the noise in a sound way (Wang et al., 2020). In contrast, it is typically more natural (and hence easier to be accepted) to leverage the intrinsic randomness in the observed data.

Finally, we note that multiplier bootstrap has been considered in the bandit literature, mostly as a computationally efﬁcient approximation to non-parametric bootstrap studied in those papers. Eckles & Kaptein (2014) study the direct adaption of multiplier bootstrap (see Section 4.1) in simulation, and its empirical performance in contextual bandits is studied later (Tang et al., 2015; Elmachtoub et al., 2017; Riquelme et al., 2018; Bietti et al., 2021). However, no theoretical guarantee is provided in these works. In fact, as demonstrated in Section 4.1, such a naive adaptation may have a linear regret. Osband & Van Roy (2015) show that, in Bernoulli MAB, a variant of multiplier bootstrap is mathematically equivalent to TS. No further theoretical or numerical results are provided except for this special case. Our work is the ﬁrst systematic study of multiplier bootstrap in bandits. Our unique contributions include: we identify the potential failure of naively applying multiplier bootstrap, highlight the importance of additional perturbations, design a general algorithm framework to make this heuristic idea concrete, provide the ﬁrst theoretical guarantee in general MAB settings, and conduct extensive numerical experiments to study its generality and adaptivity.

3. Preliminary

Setup. We consider a general stochastic bandit problem. For any positive integer M, let [M] = {1, . . . , M}. At each round t [T], the agent observes a context vector xt (it is empty in non-contextual problems) and an action set At, then chooses an action At At, and ﬁnally receives the corresponding reward Rt = f(xt, At) + ϵt, Here, f is an unknown function and ϵt is the noise term. Without loss of generality, we assume f(xt, At) [0, 1]. We note that the realized reward Rt does not need to be bounded. The goal is to minimize the cumulative regret

t=1 E max a At f(xt, a) f(xt, At) .

At the end of round t, with an existing dataset Dt = {(xl, Al, Rl)}l [t], to decide the action At+1, most algorithms typically ﬁrst estimate f in some function class F by solving a weighted loss minimization problem (also called weighted empirical risk minimization or cost-sensitive training)

bf = arg min f F

l=1 ωl L f(xl, Al), Rl + J(f). (1)

Here, L is a loss function (e.g., ℓ2 loss or negative loglikelihood), ωl is the weight of the lth data point, and J is an optional penalty function. We consider the weighted problem as it is general and related to our proposal below. One can just set ωl 1 to get the unweighted problem. As the simplest example, consider the K-armed bandit problem where xl is empty and Al = [K]. Let L be the ℓ2 loss, J 0, and f(xl, Al) r Al where rk is the mean reward of the k-th arm. Then, (1) reduces to arg min{r1,...,r K} Pt l=1 ωl(Rl r Al)2, which gives the estimator brk = (P l:Al=k ωl) 1 P l:Al=k ωl Rl, i.e., the armwise weighted average. Similarly, in linear bandits, (1) reduces to the weighted least-square problem (see Appendix A.2 for details).

Challenges. The estimation of f, together with the related uncertainty quantiﬁcation, forms the foundation of most bandit algorithms. In the literature, F is typically a class of models that permit closed-form uncertainty quantiﬁcation (e.g., linear models, Gaussian processes, etc.). However, in many real applications, the reward model can yield a fairly complicated structure, e.g., a hierarchical pipeline with both classiﬁcation and regression modules. Manually speciﬁed rules are also common part of the model. It is challenging to quantify the uncertainty of these complicated models in analytical forms. Even when feasible, the dependency on the probabilistic model assumptions also pose challenges to being robust.

Therefore, in this paper, we focus on the bootstrap-based approach due to its generality and data-driven nature. Bootstrapping, as a general approach to quantify the model uncertainty, has many variants. The most popular one, arguably, is non-parametric bootstrap (used in GIRO), which constructs bootstrap samples by re-sampling the dataset with replacement. However, due to the re-sampling nature, it is computationally intense (see Section 4.3 for more discussions). In contrast, multiplier bootstrap (Van Der Vaart & Wellner, 1996), as an efﬁcient and easy-to-implement alternative, is popular in statistics and machine learning.

Multiplier bootstrap. The main idea of multiplier bootstrap is to learn the model using randomly weighted data points. Speciﬁcally, given a multiplier weight distribution ρ(ω), for every bootstrap sample, we ﬁrst randomly sample {ωMB l }t l=1 ρ(ω) at round t, and then solve (1)

Multiplier Bootstrap-Based Exploration

with ωl = ωMB l to obtain bf MB. Repeat the procedure and the distribution of bf MB forms the bootstrap distribution that quantiﬁes our uncertainty over f. The popular choices of ρ(ω) include N(1, σ2 ω), Exp(1), Poisson(1), and the double-or-nothing distribution 2 Bernoulli(0.5).

4. Multiplier Bootstrap-based Exploration

4.1. Failure of the Naive Adaption of Multiplier Bootstrap

To design an exploration strategy based on multiplier bootstrap, a natural idea is to replace the posterior distribution in TS with the bootstrap distribution. Speciﬁcally, at every time point, we sample a function bf following the multiplier bootstrap procedure as described in Section 3, and then take the greedy action arg maxa At bf(xt, a). However, perhaps surprisingly, such an adaptation may not be valid. The main reason is that the intrinsic randomness in a ﬁnite dataset is, in some cases, not enough to guarantee sufﬁcient exploration. For example, the support of the bootstrap distribution cannot go outside the convex hull of the observed rewards. We illustrate this further with the following toy example. Example 1. Consider a two-armed Bernoulli bandit. Let the mean rewards of the two arms be p1 and p2, respectively. Without loss of generality, assume 1 > p1 > p2 > 0. Let P(ω = 0) = 0. Then, with non-zero probability, an agent following the naive adaption of multiplier bootstrap (breaking ties randomly and initializing in an optimistic way; see Algorithm 5 in Appendix A.3 for details) pulls arm 1 only once. Therefore, the agent suffers a linear regret.

Proof. We ﬁrst deﬁne two events

E1 = {At = 1, R1 = 0}, E2 = {A2 = 2, R2 = 1}.

By design, at time t = 1, the agent randomly choose an arm and hence will pull arm 1 with probability 0.5. Then the observed reward R1 is 0 with probability 1 p1. Therefore, P(E1) = 0.5(1 p1). Conditioned on E1, at t = 2, the agent will pull arm 2 (since multiplying R1 = 0 with any weight always gives 0), then it will observe reward R2 = 1 with probability p2. Conditioned on E1 E2, by induction, the agent will pull arm 2 for any t > 2. This is because the only reward record for arm 1 is R1 = 0 and hence its weighted average is always 0, which is smaller than the weighted average for arm 2, which is at least positive. In conclusion, with probability at least 0.5 (1 p1) p2 > 0, the algorithm takes the optimal arm 1 only once.

4.2. Main Algorithm

The failure of the naive application of multiplier bootstrap implies that some additional randomness is needed to ensure sufﬁcient exploration. In this paper, we consider achieving that by adding pseudo-rewards, an approach that proves its

effectiveness in a few other setups (Kveton et al., 2019b; Wang et al., 2020). The intuition is as follows. The underexploration issue happens when, by randomness, the observed rewards are in the low-value region (compared with the expected reward). Therefore, if we can blend in some data points with rewards that have a relatively wide coverage, then the agent would have a higher chance to explore.

These discussions motivate the design of our main algorithm, Multiplier Bootstrap-based Exploration (MBE), as in Algorithm 1. Speciﬁcally, at every round, in addition to the observed reward, we additionally add two pseudo-rewards with value 0 and 1. The pseudo-rewards are associated with the pulled arm and the context (if exists). Then, we solve a weighted loss minimization problem to update the model estimation (line 8). The weights are ﬁrst sampled from a multiplier distribution (line 7), and then those of pseudorewards are additionally multiplied by a tuning parameter λ. In MAB, the estimates are arm-wise weighted average of all (observed or pseudo-) rewards

ℓ:Aℓ=k(ωℓRk,ℓ+ λω ℓ 1 + λω ℓ 0) P

ℓ:Aℓ=k(ωℓ+ λω ℓ+ λω ℓ) (2)

for k [K], where multiplier weights {ωℓ, ω ℓ, ω ℓ}t l=1 ρ(ω). See Appendix A.1 for details.

We make three remarks on the algorithm design. First, we choose to add pseudo-rewards at the boundaries of the mean reward range (i.e., [0, 1]), since such a design naturally induces a high variance (and hence more exploration). Adding pseudo-rewards in other manners is also possible. Second, the tuning parameter λ controls the amount of extrinsic perturbation and determines the degree of exploration (together with the dispersion of ρ(ω)). In Section 5, we give a theoretically valid range for λ. Finally and critically, besides guaranteeing sufﬁcient exploration, we need to make sure the optimal arm can still be identiﬁed (asymptotically) after adding the pseudo-rewards. Intuitively, this is guaranteed, since we shift and scale the (asymptotic) mean reward from f(x, a) to f(x, a)+λ /(1+2λ) = f(x, a)/(1+2λ)+λ/(1+2λ), which preserves the order between arms. A detailed analysis for MAB can be found in Appendix A.1.

We conclude this section by re-visiting Example 1 to provide some insights into how the pseudo-rewards help.

Example 1 (Continued). Even under the event E1 E2, Algorithm 1 explores. To see this, consider an example with multiplier distribution is 2 Bernoulli(0.5). Then

P(A3 = 1) P(Y 1 > Y 0)

= P λω 1 ω1 + λω 1 + λω 1 > ω2 + λω 2 ω2 + λω 2 + λω 2

P(ω 1 = 2, ω1 = ω 1 = ω2 = ω 2 = ω 2 = 0)

Multiplier Bootstrap-Based Exploration

Therefore, the agent can still choose the optimal arm.

Algorithm 1: General Template for MBE

Data: Function class F, loss function L, (optional) penalty function J, multiplier weight distribution ρ(ω), tuning parameter λ

2 Initialize bf

3 for t = 1, . . . , T do

4 Observe context xt and action set At

5 Take action At = arg maxa At bf(xt, A) (break ties randomly)

6 Observe reward Rt

7 Sample the multiplier weights {ωl, ω l, ω l }t l=1 ρ(ω)

8 Solve the weighted loss minimization problem

bf = arg min f F

h ωl L f(xl, Al), Rl

+ λω l L f(xl, Al), 0

+ λω l L f(xl, Al), 1 i + J(f). 9

4.3. Computationally-Efﬁcient Implementation

Efﬁcient computation is critical for real applications of bandit algorithms. One potential limitation of Algorithm 1 is the computational burden: at every decision point, we need to re-sample the weights for all historical observations (line 8). This leads to a total computational cost of order O(T 2), similar to GIRO.

Fortunately, one prominent advantage of multiplier bootstrap over other bootstrap methods (such as non-parametric bootstrap or residual bootstrap) is that the (approximate) bootstrap distribution can be efﬁciently updated in an online manner, so that the per-round computation cost does not grow over time. Suppose we have a dataset Dt at time t, and denote B(Dt) as the corresponding bootstrap distribution for f. With multiplier bootstrap, it is feasible to update B(Dt+1) approximately based on B(Dt). We detail the procedure below and elaborate more in Algorithm 2.

Speciﬁcally, we maintain B different models { bfb,t}B b=1 and the corresponding observed history

Hb = {(xl, Al, Rl, ωl,b)}t l=1

and pseudo-history

H b = {(xl, Al, 0, ω l,b)}t l=1 {(xl, Al, 1, ω l,b)}t l=1

for every b [B]. { bfb,t}B b=1 can be regarded as sampled from B(Dt) and hence the empirical distribution over them

is an approximation to the bootstrap distribution. At every time point t, for each replicate b, we only need to sample one weight for the new data point and then update bfb,t as bfb,t+1. Then, { bfb,t+1}B b=1 are still B valid samples from B(Dt+1) and hence still a valid approximation. We note that, since we only have one new data point, the updating of f can typically be done efﬁciently (e.g., with closed-form updating or via online gradient descent). The per-round computational cost is hence independent of t.

Such an approximation is a common practice in the online bootstrap literature and can be regarded as an ensemble sampling-type algorithm (Lu & Van Roy, 2017; Qin et al., 2022). The hyper-parameter B is typically not treated as a tuning parameter but depends on the available computational resource (Hao et al., 2019). In our numerical experiments, this practical variant shows desired performance with B = 50. Moreover, the algorithm is embarrassingly parallel and also easy to implement: given an existing implementation for estimating f (i.e., solving (1)), the major requirement is to replicate it for B times and use random weights for each. This feature is attactive in real applications.

Algorithm 2: Practical Implementation of MBE Data: Number of bootstrap replicates B, function class F, loss function L, (optional) penalty function J, weight distribution ρ(ω), tuning parameter λ

2 Let Hb = {} be the history and H b = {} be the pseudo-history, for any b [B]

3 Initialize bfb,0 for any b [B]

4 for t = 1, . . . , T do

5 Observe context xt and action set At

6 Sample an index bt uniformly from {1, . . . , B}

7 Offer At = arg max A At bfbt,t 1(xt, A) (break ties randomly)

8 Observe reward Rt

9 for b = 1, ..., B do

10 Sample the weights ωt,b, ω t,b, ω t,b ρ(ω).

11 Update Hb = Hb (xt, At, Rt, ωt,b) and H b = H b (xt, At, 0, ω t,b), (xt, At, 1, ω t,b)

12 Solve the weighted loss minimization problem

bfb,t = arg min f F

h ωl,b L f(xl, Al), Rl

+ λω l,b L f(xl, Al), 0

+ λω l,b L f(xl, Al), 1 i

Multiplier Bootstrap-Based Exploration

5. Regret Analysis

In this section, we provide the regret bound for Algorithm 1 under MAB with sub-Gaussian rewards. We regard this as the ﬁrst step towards the theoretical understanding of MBE, and leave the analysis of more general settings for future work. We call a random variable X as σ-sub-Gaussian if E exp{t(X EX)} exp{t2σ2/2} for any t R. The instantiation of Algorithm 1 under MAB is presented as Algorithm 3 in the appendix.

Theorem 5.1. Consider a K-armed bandit, where the reward distribution of arm k is 1-sub-Gaussian with mean µk. Suppose arm 1 is the unique best arm that has the highest mean reward and k = µ1 µk. Take the multiplier weight distribution as N(1, σ2 ω). Let the tuning parameters satisfy

λ 1 + σ2 ω/4 + 4/σω + p

4 (1 + 4/σω) /σω.

Then, the problem-dependent regret is upper bounded by

n 7 k+55 C 1(λ, σω) + C 2(λ, σω)

k log T o ,

and the problem-independent regret is bounded by

Reg T 7Kµ1 + C 1(λ, σω)K log T

C 2(λ, σω)KT log T,

C 1(λ, σω) = 8

2C 3(λ, σω) + 38σ2 ω,

C 2(λ, σω) = 6λ2 + 45(3 + σ2 ω)λ4C 3(λ, σω) + 38σ2 ω ,

C 3(λ, σω) = log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2

3 log 2 + 1.

The two regret bounds are known as near-optimal (up to a logarithm term) in both the problem-dependent and problemindependent sense (Lattimore & Szepesv ari, 2020). Notably, recall that the Gaussian distribution and all bounded distributions belong to the sub-Gaussian class. Therefore, as reviewed in Table 1, our theory is strictly more general than all existing results for bootstrap-based MAB algorithms.

Technical challenges. It is particularly challenging to analyze MBE for two reasons. First, the probabilistic analysis of multiplier bootstrap itself is technically challenging, since the same random weights appear in both the denominator and the numerator (recall that MBE uses the weighted averages (2) to select actions in MAB). It is notoriously complicated to analyze the ratio of random variables, especially when they are correlated. Besides, existing bootstrapbased papers rely on the properties of speciﬁc parametric reward classes (e.g., Bernoulli in Kveton et al. (2019b) and Gaussian in Wang et al. (2020)), while we lose these nice structures when considering sub-Gaussian rewards.

To overcome these challenges, we denote the ﬁrst s rewards from pulling arm k as Hk,s with the i-th observation denoted as Rk,i, and start with carefully deﬁning two good events Gk,s and Ak,s. Here, Gk,s denotes the event that the weighted average Y k,s = Ps i=1 [ωi Rk,i + ω i(1 λ) + ω i (0 λ)] / Ps i=1(ωi + λω i + λω i ) is close to the unweighted average (with pseudo-rewards) R k,s = Ps i=1(Rk,i + 1 λ + 0 λ)/ Ps i=1(1 + λ + λ), and Ak,s represents the event that R k,s is close to its population mean (µk + λ)/(1 + 2λ). It is worthy to note that {ωi, ω i, ω i }s i=1 are resampled from ρ(ω) at each round. To bound the probability of and the regret conditioned on the bad event, we face two major technical challenges. First, when transforming the ratio into an analyzable form, a summation of correlated sub-Gaussian and sub-exponential variables appears and is hard to analyze. We carefully design and analyze a novel event to remove the correlation and the sub-Gaussian terms (see proof of Lemma D.3). Second, the proof needs a new concentration inequality for functions of sub-exponential variables that does not exist in the literature. We obtain such a new concentration inequality (Lemma E.4) via careful analysis of sub-exponential distributions.

To the best of our knowledge, our proof provides the ﬁrst ﬁnite-sample concentration and anti-concentration analysis for multiplier bootstrap, which has broad applications in statistics and machine learning.

Extension. Theorem 5.1 is proved with Gaussian weights to simplify analysis. Indeed, to analyze multiplier Bootstrap, we need to analyze P ℓ:Aℓ=k ωℓRk,ℓconditioned on the reward history, which is the sum of scaled i.i.d. variables, and Gaussian distribution has nice analytical properties for us to derive the bound. We hypothesize our result can be extended to other weight distributions that satisfy similar anti-concentration and concentration properties, such as C exp{ t2/(2σ2)} P(|ωi 1| > t) C exp{ t2/(2σ2)} with positive C, C, σ, σ. However, we expect some analytical challenges.

Tuning parameters. In Theorem 5.1, MBE has two tuning parameters, λ and σω. Intuitively, λ controls the amount of external perturbation and σω controls the magnitude of exploration from bootstrap. In general, higher values of these two parameters facilitate exploration but also lead to a slower convergence. The condition on λ in Theorem 5.1 requires that (i) λ is not too small and (ii) the joint effect of λ and σω is not too small. Both are intuitive. In practice, this could be loose: e.g., it requires λ 5.25 + 2

5 when σω = 1. As we observe in Section 6, MBE with a smaller λ (e.g., 0.5) still empirically performs well.

Multiplier Bootstrap-Based Exploration

6. Experiments

In this section, we empirically evaluate MBE with both simulation (Section 6.1) and real datasets (Section 6.2).

6.1. MAB Simulation

We ﬁrst experiment with simulated MAB instances. The goal is to (i) further validate our theoretical ﬁndings, (ii) check whether MBE can yield comparable performance with standard methods, and (iii) study the robustness and adaptivity of MBE. We also experimented with linear bandits and the main ﬁndings are similar. To save space, we defer these results to Appendix B.1.

We compare MBE with TS (Thompson, 1933), PHE (Kveton et al., 2019a), Re Boot (Wang et al., 2020), and GIRO (Kveton et al., 2019b). The last three algorithms are the existing bootstrapor perturbation-type algorithms reviewed in Section 2. Speciﬁcally, PHE explores by perturbing observed rewards with additive noise, without leveraging the intrinsic uncertainly in the data, Re Boot explores by perturbing the residuals of the rewards observed for each arm, and GIRO re-samples observed data points with replacement. In all experiments below, the weights of MBE are sampled from N(1, σ2 ω) 1. We ﬁx λ = 0.5 and run MBE with three different values of σ2 ω: 0.5, 1 and 1.5. We also compare with the naive adaption of multiplier bootstrap (i.e., no pseudorewards; denoted as Naive MB). We run Algorithm 2 with B = 50 replicates.

We ﬁrst study 10-armed bandits, where the mean reward of each arm is independently sampled from Beta(1, 8). We consider three reward distributions, including Bernoulli, Gaussian, and exponential. For Gaussian MAB, the reward noise is sampled from N(0, 1). The other two distributions are determined by their means. For TS, we always use the correct reward distribution class and its conjugate prior. The prior mean and variance are calibrated using the true model. Therefore, TS is a strong baseline. For GIRO and Re Boot, we use the default implementations as they work well. For PHE, the original paper adds Bernoulli perturbation since it only studies bounded reward distributions. We extend PHE by sampling additive noise from the same distribution family as the true rewards, as done in Wu et al. (2022). GIRO, Re Boot, and PHE all have one tuning parameter to control the degree of exploration. We tune it over {2k 4}6 k=0 and report the best performance for each method. Without tuning, these algorithms generally do not perform well as originally proposed, due to differences in the settings. We tuned Naive MB as well.

Results. Results over 100 runs are reported in Figure 1.

1We also experimented with other weight distributions with similar main conclusions. Using Gaussian weights allows us to study impact of different multiplier magnitudes more clearly.

Our ﬁndings can be summarized as follows. First, without knowledge of the problem settings (e.g., the reward distribution family and its parameters, and the prior distribution) and without heavy tuning, MBE performs favorably and close to TS. Second, pseudo-rewards are indeed important in exploration, otherwise the algorithm suffers a linear regret. Third, MBE has a stable performance with different σω while the other methods are tuned for their best performance. This is thanks to the data-driven nature of MBE. Finally, the other three general-purpose exploration strategies perform reasonably after tuning, as expected. However, GIRO is computationally intense. For example, in Gaussian bandits, the time cost of GIRO is 2 minutes while all the other algorithms can complete within 10 seconds. The computational burden is due to the limitation of non-parametric bootstrap (see Section 4.3). Re Boot also performs reasonably, yet by design it is not easy to extend to more complex problems (e.g., problems in Section 6.2).

Adaptivity. PHE relies on sampling additive noise from an appropriate distribution, and TS can be viewed similarly. In the results above, we provide auxiliary information about the environment to them and need to modify their implementation in different setups. In contrast, MBE automatically adapts to these problems. As argued in Section 2, one main advantage of MBE over them is its adaptiveness. To see this, we consider the following procedure: we run the Gaussian versions of TS and PHE in Bernoulli MAB, and run their Bernoulli versions in Gaussian MAB. We also run MBE with σ2 ω = 0.5. MBE does not require any modiﬁcations across the two problems. The results presented in Figure 2 clearly demonstrate that MBE adapts to reward distributions.

Similarly, in Figure 3, we also studied the adaptivity of these methods against the reward distribution scale (the standard deviation of the Gaussian noise, σ) and the task distribution (we sample the mean rewards from Beta(α, 8) and vary the parameter α). For all settings, we use the algorithms tuned for Figure 1. MBE shows impressive adaptivity, while PHE and TS may not perform well when the environment is not close to the one they are tuned for. Recall that, in real applications, heavy tuning is not possible without the ground truth. This demonstrates the adaptivity of MBE, as a data-driven exploration strategy.

Additional results. In Appendix B.2, we also try different values of λ and B for MBE. We also repeat the main experiment with K = 25. Our main observations still hold, and MBE is relatively robust to its tuning parameters.

6.2. Real-Data Applications

The main beneﬁt of MBE is that it easily generalizes to complex models. In this section, we use real datasets to demonstrate this property. Speciﬁcally, we test if MBE can achieve comparable performance with strong problem-speciﬁc base-

Multiplier Bootstrap-Based Exploration

0 5000 10000 15000 20000 T

0 5000 10000 15000 20000 T

0 5000 10000 15000 20000 T

300 Exponential

GIRO MBE ( 2=0.5) MBE ( 2=1) MBE ( 2=1.5) Naive MB PHE Re Boot TS

Figure 1: Simulation results under MAB. The error bars indicate the standard errors, which may not be visible when the width is small.

0 5000 10000 15000 20000

Bernoulli MAB

0 5000 10000 15000 20000

Gaussian MAB

Figure 2: Robustness results, to the reward distribution class.

1 0 1 2 3 4 5 log( )

(a) Trend with α

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 log( )

log(Regret)

(b) Trend with σ

Figure 3: Results with different reward variances and task distributions. For the x-axis in both ﬁgures and the y-axis in the second one, we plot at the logarithmic scale for better visualization.

lines proposed in the literature, without problem-speciﬁc algorithm design and heavy tuning.

Domain-speciﬁc models. We study the three problems considered in Wan et al. (2022), including cascading bandits for online learning to rank (Kveton et al., 2015), combinatorial semi-bandits for online combinatorial optimization (Chen et al., 2013), and multinomial logit (MNL) bandits for dynamic slate optimization (Agrawal et al., 2017; 2019). All these are practical and important problems in real life. Yet, these domain models all have unique structures and require a case-by-case algorithm design. For example, the rewards in MNL bandits follow multinomial distributions that have complex dependency with the pulled arms. To derive the posterior or conﬁdence bound, one has to use a delicately designed epoch-type procedure (Agrawal et al., 2019).

Datasets. We use the three datasets studied in Wan et al. (2022). Speciﬁcally, we use the Yelp rating dataset (Zong

et al., 2016) to recommend and rank K restaurants, use the Adult dataset (Dua & Graff, 2017) to send advertisements to K/2 men and K/2 women (a combinatorial semi-bandit problem with continuous rewards), and use the Movie Lens dataset (Harper & Konstan, 2015) to display K movies. In our experiments, we ﬁx K = 4 and randomly sample 30 items from the dataset to choose from. We provide a summary of these datasets and problems in Appendix B.3, and refer interested readers to Wan et al. (2022) and references therein for more details.

Baselines. We compared MBE with state-of-the-art baselines in the literature, including TS-Cascade (Zhong et al., 2021) and Cascade KL-UCB (Kveton et al., 2015) for cascading bandits, CUCB (Chen et al., 2016) and CTS (Wang & Chen, 2018) for semi-bandits, and MNL-TS (Agrawal et al., 2017) and MNL-UCB (Agrawal et al., 2019) for MNL bandits. To save space, we denote the TS-type algorithms by TS and the UCB-type ones by UCB. We also study PHE and ϵ-greedy (EG) as two other general-purpose exploration strategies.

Tuning. For the baseline methods, as in Section 6.1, we either use the default hyperparameters in Wan et al. (2022) or tune them extensively via grid search and present their best performance. For EG, we choose the exploration rate as ϵt = min(1, a/2

t) with tuning parameter a, following Kveton et al. (2020a). For MBE, with every bootstrap sample, we estimate the reward model via maximum weighted likelihood estimation, which yields closed-form solution that allows online updating in all three problems. The other implementation details are similar to Section 6.1.

Results. We present the results in Figure 4. The overall ﬁndings are consistent with Section 6.1. First, without any additional derivations or algorithm design, MBE matches the performance of problem-speciﬁc algorithms. Second, pseudo-rewards are important to guarantee sufﬁcient exploration, and naively applying multiplier bootstrap may fail. Third, MBE has relatively stable performance with varying σω, since its exploration is mostly data-driven. In contrast, the hyper-parameters of PHE and EG have to be carefully tuned, since they rely on externally added perturbation or

Multiplier Bootstrap-Based Exploration

0 10000 20000 30000 T

Dynamic Slate Optimization

0 10000 20000 30000 T

Online Learning to Rank

0 10000 20000 30000 T

Online Advertisement Optimization

EG MBE ( 2=0.5) MBE ( 2=1) MBE ( 2=1.5) Naive MB PHE TS UCB

Figure 4: Real-data results for three structured bandit problems that need domain-speciﬁc models.

forced exploration. For example, the best parameters for EG in the three problems are a = 5, 0.1 and 0.5. Finally, PHE does not perform well in MNL and cascading bandits, where the outcomes are binary. We investigated this trend and found that the response rates (i.e., the probabilities for the binary outcome to be 1) in the two datasets are low. In this case, PHE introduces too much noise to explore, which slows down the estimation convergence.

7. Conclusion

In this paper, we propose a new bandit exploration strategy, Multiplier Bootstrap-based Exploration (MBE). The main advantage of MBE is its generality: for any reward model that can be estimated via weighted loss minimization, the idea of MBE is applicable, and requires minimal efforts on derivation or implementation of the exploration mechanism. As a datadriven method, MBE also shows nice adaptivity. We prove near-optimal regret bounds for MBE in the sub-Gaussian MAB setup, which is more general than in other bootstrapbased bandit papers. Numerical experiments demonstrate that MBE is general, efﬁcient, and adaptive.

There are a few meaningful future extensions. First, the regret analysis for MBE (and more generally, other bootstrapbased bandit methods) in more complicated setups would be valuable. The main challenge comes from analyzing the ﬁnite-sample property of multiplier bootstrap.

Second, adding pseudo-rewards at every round is needed for the analysis. We hypothesize that there exists a more adaptive and efﬁcient way of introducing extrinsic perturbation, such that we have sufﬁcient exploration while avoiding over-exploration.

Third, the practical implementation of MBE relies on an ensemble of models to approximate the bootstrap distribution and the online regression oracle to update the model estimation. Both parts lead to approximation and also correlation over time. Our numerical experiments show that such an approach works well empirically, but it would be still mean-

ingful to have more theoretical understanding.

Lastly, in this paper, we present MBE assuming the knowledge of a generative model of the rewards (i.e., assuming the existence of a regression oracle). The idea can be naturally generalized to the policy-based setting, where we assume the existence of a classiﬁcation oracle that can compute the optimal policy within a pre-speciﬁed policy class (see, e.g., Agarwal et al. (2014)). We leave this to future study.

Multiplier Bootstrap-Based Exploration

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646. PMLR, 2014.

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Thompson sampling for the mnl-bandit. ar Xiv preprint ar Xiv:1706.00977, 2017.

Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Mnlbandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453 1485, 2019.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002.

Bietti, A., Agarwal, A., and Langford, J. A contextual bandit bake-off. J. Mach. Learn. Res., 22:133 1, 2021.

Chapelle, O. and Zhang, Y. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web, pp. 1 10, 2009.

Chen, W., Wang, Y., and Yuan, Y. Combinatorial multiarmed bandit: General framework and applications. In International Conference on Machine Learning, pp. 151 159. PMLR, 2013.

Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746 1778, 2016.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Durand, A., Achilleos, C., Iacovides, D., Strati, K., Mitsis, G. D., and Pineau, J. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference, pp. 67 82. PMLR, 2018.

Eckles, D. and Kaptein, M. Thompson sampling with the online bootstrap. ar Xiv preprint ar Xiv:1410.4009, 2014.

Efron, B. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pp. 569 593. Springer, 1992.

Elmachtoub, A. N., Mc Nellis, R., Oh, S., and Petrik, M. A practical method for solving contextual bandit problems using decision trees. ar Xiv preprint ar Xiv:1706.04687, 2017.

Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23, 2010.

Gopalan, A., Mannor, S., and Mansour, Y. Thompson sampling for complex online problems. In International conference on machine learning, pp. 100 108. PMLR, 2014.

Hao, B., Abbasi Yadkori, Y., Wen, Z., and Cheng, G. Bootstrapping upper conﬁdence bound. Advances in Neural Information Processing Systems, 32, 2019.

Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015.

Kawale, J., Bui, H. H., Kveton, B., Tran-Thanh, L., and Chawla, S. Efﬁcient thompson sampling for online matrix-factorization recommendation. Advances in neural information processing systems, 28, 2015.

Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767 776. PMLR, 2015.

Kveton, B., Szepesv ari, C., Ghavamzadeh, M., and Boutilier, C. Perturbed-history exploration in stochastic multiarmed bandits. In Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, pp. 2786 2793, 2019a.

Kveton, B., Szepesvari, C., Vaswani, S., Wen, Z., Lattimore, T., and Ghavamzadeh, M. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 3601 3610. PMLR, 2019b.

Kveton, B., Szepesv ari, C., Ghavamzadeh, M., and Boutilier, C. Perturbed-history exploration in stochastic linear bandits. In Uncertainty in Artiﬁcial Intelligence, pp. 530 540. PMLR, 2020a.

Kveton, B., Zaheer, M., Szepesvari, C., Li, L., Ghavamzadeh, M., and Boutilier, C. Randomized exploration in generalized linear bandits. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2066 2076. PMLR, 2020b.

Lai, C. D. and Balakrishnan, N. Continuous bivariate distributions. Springer, 2009.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071 2080. PMLR, 2017.

Multiplier Bootstrap-Based Exploration

Lu, X. and Van Roy, B. Ensemble sampling. Advances in neural information processing systems, 30, 2017.

Osband, I. and Van Roy, B. Bootstrapped thompson sampling and deep exploration. ar Xiv preprint ar Xiv:1507.00300, 2015.

Osband, I., Van Roy, B., Russo, D. J., Wen, Z., et al. Deep exploration via randomized value functions. J. Mach. Learn. Res., 20(124):1 62, 2019.

Phan, M., Abbasi Yadkori, Y., and Domke, J. Thompson sampling and approximate inference. Advances in Neural Information Processing Systems, 32, 2019.

Qin, C., Wen, Z., Lu, X., and Roy, B. V. An analysis of ensemble sampling. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview. net/forum?id=c6ibx0yl-a G.

Riou, C. and Honda, J. Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory, pp. 777 826. PMLR, 2020.

Riquelme, C., Tucker, G., and Snoek, J. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. ar Xiv preprint ar Xiv:1802.09127, 2018.

Rubin, D. B. The bayesian bootstrap. The annals of statistics, pp. 130 134, 1981.

Shen, W., Wang, J., Jiang, Y.-G., and Zha, H. Portfolio choices with orthogonal bandit learning. In Twenty-fourth international joint conference on artiﬁcial intelligence, 2015.

Tang, L., Jiang, Y., Li, L., Zeng, C., and Li, T. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 323 332, 2015.

Tang, Q., Xie, H., Xia, Y., Lee, J., and Zhu, Q. Robust contextual bandits via bootstrapping. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pp. 12182 12189, 2021.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Urteaga, I. and Wiggins, C. Variational inference for the multi-armed contextual bandit. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 698 706. PMLR, 2018.

Van Der Vaart, A. W. and Wellner, J. A. Weak convergence. In Weak convergence and empirical processes, pp. 16 28. Springer, 1996.

Vaswani, S., Kveton, B., Wen, Z., Rao, A., Schmidt, M., and Abbasi-Yadkori, Y. New insights into bootstrapping for bandits. ar Xiv preprint ar Xiv:1805.09793, 2018.

Wan, R., Ge, L., and Song, R. Metadata-based multi-task bandits with bayesian hierarchical models. Advances in Neural Information Processing Systems, 34:29655 29668, 2021.

Wan, R., Ge, L., and Song, R. Towards scalable and robust structured bandits: A meta-learning framework. ar Xiv preprint ar Xiv:2202.13227, 2022.

Wang, C.-H., Yu, Y., Hao, B., and Cheng, G. Residual bootstrap exploration for bandit algorithms. ar Xiv preprint ar Xiv:2002.08436, 2020.

Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5114 5122. PMLR, 2018.

Watkins, C. J. C. H. Learning from delayed rewards. 1989.

Wen, Z., Kveton, B., and Ashkan, A. Efﬁcient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, pp. 1113 1122. PMLR, 2015.

Wu, S., Wang, C.-H., Li, Y., and Cheng, G. Residual bootstrap exploration for stochastic linear bandit. ar Xiv preprint ar Xiv:2202.11474, 2022.

Yu, T., Kveton, B., Wen, Z., Zhang, R., and Mengshoel, O. J. Graphical models meet bandits: A variational thompson sampling approach. In International Conference on Machine Learning, pp. 10902 10912. PMLR, 2020.

Zhang, H. and Chen, S. Concentration inequalities for statistical inference. Communications in Mathematical Research, 37(1):1 85, 2021.

Zhang, H. and Wei, H. Sharper sub-weibull concentrations. Mathematics, 10(13):2252, 2022.

Zhong, Z., Chueng, W. C., and Tan, V. Y. Thompson sampling algorithms for cascading bandits. Journal of Machine Learning Research, 22(218):1 66, 2021.

Zhou, Q., Zhang, X., Xu, J., and Liang, B. Large-scale bandit approaches for recommender systems. In International Conference on Neural Information Processing, pp. 811 821. Springer, 2017.

Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. ar Xiv preprint ar Xiv:1603.05359, 2016.

Multiplier Bootstrap-Based Exploration

A. Additional Method Details

A.1. MBE for MAB

In this section, we present the concrete form of MBE when being applied to MAB. Recall that xt is null, At [K], and rk is the mean reward of the k-th arm. We deﬁne f(xt, At; r) = r At, where the parameter vector r = (r1, . . . , r K) . We deﬁne the loss function as

t=1 ωt(r At Rt)2.

The solution is then (br1, . . . , br K) with brk = (P

t:At=k ωt) 1 P

t:At=k ωt Rt, i.e., the arm-wise weighted average. After adding the pseudo rewards, we can give algorithm for MAB in Algorithm 3.

Next, we provide intuitive explanation on why Algorithm 3 works. Indeed, denote s := |Hk,T |, where Hk,T is the set of observed rewards for the k-th arm up to round T. Let Rk,l be the l-th element in Hk,T . Then

Y k,s = Ps i=1 ωi Rk,i + λ Ps i=1 ω i Ps i=1 ωi + λ Ps i=1 ω i + λ Ps i=1 ω i

= s 1 Ps i=1 ωi(Rk,i µk) + s 1 Ps i=1(ωi 1) + λs 1 Ps i=1(ω i 1) + µk + λ s 1 Ps i=1(ωi 1) + λs 1 Ps i=1(ω i 1) + λs 1 Ps i=1(ω i 1) + 1 + 2λ

by using the law of large numbers. Then, by Slutsky s theorem,

Y k,s µk + λ

i=1 ωi(Rk,i µk) + 1 s

i=1 (ωi 1) + λ s

i=1 (ω i 1)

will weakly converge to a mean-zero Gaussian distribution N 0, σ2 k+2 (1+2λ)2 σ2 ω . Therefore, our algorithm preserves the order of the arms for any λ > 0.

Algorithm 3: MBE for MAB with sub-Gaussian rewards with mean bounded in [0, 1]

Data: number of arms K, multiplier weight distribution ρ(ω), tuning parameter λ

2 Set Hk = {} be the history of the arm k and Y k = + , k [K]

3 for t = 1, . . . , T do

4 Pull At = arg maxk [K] Y k (break tie randomly),

5 Observe reward Rt

6 Set Hk = Hk {Rt}

7 for k = 1, . . . , K do

8 if |Hk| > 0 then

9 Sample the multiplier weights {ωl, ω l, ω l }|Hk| l=1 ρ(ω).

10 Update the mean reward

ℓ=1 (ωℓ Rk,ℓ+ ω ℓ 1 λ + ω ℓ 0 λ)

ℓ=1 (ωℓ+ λω ℓ+ λω ℓ)

where Rk,l is the l-th element in Hk.

A.2. MBE for stochastic linear bandits

In this section, we derive the form of MBE when applied to stochastic linear bandits. We focus on the setup where xt is empty and At Rp is a linear feature vector, and other setups of linear bandits can be formulated similarly. In this case,

Multiplier Bootstrap-Based Exploration

Algorithm 4: MBE for linear bandits.

Data: number of arms K, multiplier weight distribution ρ(ω), tuning parameter λ

2 Set Hk = {} be the history of the arm k, set A0 = 0, bθ0 = 0 with b0 = 0, and V0 = (1 + ξ)Ip.

3 if t = 1, . . . , p then

4 Offer At = t.

6 for t = p + 1, . . . , T do

7 Offer At = arg maxa At a θt (break tie randomly)

8 Observe reward Rt

9 Set Hk = Hk {Rt}

10 for k = 1, . . . , K do

11 if |Hk| > 0 then

12 Sample the multiplier weights {ωl, ω l, ω l }|Hk| l=1 ρ(ω).

13 Update the following quantities:

Vt+1 = Vt + ωt At A t + λω t0Id + λω t Id;

bt+1 = bt + At ωt Rt + λω t0 + λω t 1 ;

Refresh the parameter as bθt+1 = V 1 t+1bt+1.

f(xt, At; θ) = A t θ where the parameter vector is θ Rp. Then, the weighted loss function is

t=1 ωt A t θ Rt 2 + ξ

where ξ 0 is a penalty tuning parameter. The solution is the standard weighted ridge regression estimator and can be updated in the following way:

0. Initialization: A0 = 0, bθ0 = 0 with b0 = 0, and V0 = (ξ + 1)Idim(At).

1. bθt = V 1 t At;

2. Vt+1 = Vt + ωt At A t , bt+1 = bt + ωt Rt At, and hence update

bθt+1 = V 1 t+1bt+1 = Vt + ωt At A t 1 bt + ωt Rt At

= V 1 t V 1 t At(ω 1 t + A t V 1 t At) 1A t V 1 t

3. Take the action At+1 = arg maxa At a θt+1.

The MBE algorithm for linear bandits is presented in Algorithm 4.

A.3. Naive Adaptation of the Multiplier Bootstrap

We present the naive multiplier bootstrap-based exploration algorithm in Algorithm 5. Speciﬁcally, there is no pseudorewards added.

A.4. Re Boot

For completeness, we introduce the details of Re Boot in this section and discuss its generalizability. More details can be found in the original papers (Wang et al., 2020; Wu et al., 2022).

Multiplier Bootstrap-Based Exploration

Algorithm 5: A Naive Design of MBE

Data: Function class F, loss function L, (optional) penalty function J, multiplier weight distribution ρ(ω), tuning parameter λ

2 Set H = {} be the history be the pseudo-history

3 Initialize bf in an optimistic way

4 for t = 1, . . . , T do

5 Observe context xt and action set At

6 Offer At = arg maxa At bf(xt, a) (break tie randomly)

7 Observe reward Rt

8 Update H = H {(xt, At, Rt)}

9 Sample the multiplier weights {ωl}t l=1 ρ(ω)

10 Solve the weighted loss minimization problem to update bf as

bf = arg min f F

l=1 ωl L f(xl, Al), Rl + J(f).

0 10000 20000 T

0 10000 20000 T

0 10000 20000 T

GIRO MBE ( =0.25) MBE ( =0.5) MBE ( =0.75) Naive MB PHE Reboot TS

0 10000 20000 T

0 10000 20000 T

0 10000 20000 T

GIRO MBE ( 2=0.5) MBE ( 2=1.0) MBE ( 2=1.5) Naive MB PHE Reboot TS

Figure 5: Performance of MBE in three linear bandit problems.

Consider a stochastic bandit problem with a ﬁxed and ﬁnite set of arm A. Every arm a A may have a ﬁxed feature vector (which with slight overload of notation, we also denote as a). The mean reward of arm a is f(a). At each round t , Re Boot ﬁrst ﬁt the model f as bf using all data. Then for each arm a, Re Boot ﬁrst computes the corresponding residuals using rewards related to that arm as {ϵt = Rt bf(a)}t:At=a,t t , then perturbs these residuals with random weights as {ωtϵt = Rt bf(a)}t:At=a,t t (Re Boot also adds pseudo-residuals, which we omit for ease of notations), and ﬁnally use bf(a) + |{t : At = a, t t }| 1 P

t:At=a,t t ωtϵt as the perturbed estimation of the mean reward of arm a. By design, it can be seen that Re Boot critically relies on the reward history of each ﬁxed arm. Therefore, to the best of our understanding, it is not easy to extend Re Boot to problems with either changing (e.g., contextual problems) or inﬁnite arms.

Multiplier Bootstrap-Based Exploration

0 10000 20000 T

0 10000 20000 T

0 10000 20000 T

300 Exponential

GIRO MBE ( =0.25) MBE ( =0.5) MBE ( =0.75) Naive MB PHE Re Boot TS

Figure 6: Performance of MBE with different values of λ in MAB.

B. More Experiment Results and Details

B.1. Results for linear bandits

We also consider the linear bandit problem. The linear bandit version of MBE is presented in Appendix A.2. We experiment with several dimensions p = 10, 15, 20. The number of arms is K = 100. The feature vector xk Rp of arm k is generated as follows. For the last 10 arms, the features are drawn uniformly at random from (0, 1). For the ﬁrst 90 arms, we consider a practical setup where they are low-rank: we ﬁrst generate a loading metric A = (aij) Rp 5 from Uniform(0, 1), then sample b R5 from Uniform(0, 1), and ﬁnally constructs xk = Ab. The parameter vector θ Rp is uniformly sampled from [0, 1]p. We normalize the feature vectors such that the mean reward µk = x k θ falls within the interval [0, 1]. The rewards of arm k are drawn i.i.d. from Bernoulli(x k θ).

We still compare MBE with the method for linear bandit version of GIRO, PHE, and Re Boot with tuning to their best performance over the hyper-parameter set {2k 4}6 k=0 and report the best performance of each method. For TS, we use Gaussian for both its reward and prior distribution, and calibrate their parameters using the true model. The total rounds are T = 20000 and our results are averaged over 50 randomly chosen problems. Most other details are similar to our MAB experiments.

We present the results in Figure 5, where we vary either σω or λ in the two subplots. We can see that naive MB leads to a linear regret. Hence, the pseudo-reward also matters in this problem. MBE achieves comparable performance with strong baselines such as TS. Another ﬁnding is that MBE is robust to its tuning parameters. Finally, Reboot needs to pull K times to initialize (the linear regret part in the ﬁrst K rounds) due to the nature of its design. In contrast, most other linear bandit algorithms typically only need p rounds of forced exploration. This shows the limitation of the Re Boot framework (See Appendix A.4).

B.2. Additional results

In this section, we study the performance of MBE with respect to a few other hyper-parameters.

We ﬁrst study the robustness to another tuning parameter λ. Recall that λ controls the amount of external perturbation. Speciﬁcally, we repeat the experiment in 6.1 with σ2 ω ﬁxed as 0.5 and with different values of λ (0.25, 0.5, 0.75). From Figure 6, it can be seen that a small amount of pseudo-rewards (λ = 0.25) seems sufﬁcient in these settings, and the results are fairly stable. We believe this is because the exploration of MBE is main driven by the internal randomness in the data.

In Figure 7, we repeat our main experiments with K changed to 25. We can see that our main conclusions still hold.

Finally, in Figure 8, we implement MBE with different number of replicates B. As expected, more replicates does help exploration due to a better approximation to the whole bootstrapping distribution. Yet, we ﬁnd that B = 50 sufﬁces to generate comparable performance with TS and the performance of MBE becomes relatively stable for larger values of B.

B.3. Details of the real-data experiments

Our real-data experiments closely follow Wan et al. (2022). For completeness, we provide information of the three problems here, and refer interested readers to Wan et al. (2022) and references therein.

In an online learning to rank problem, we aim to select and rank K items from a pool of L ones. We iteratively interacts with users to learn about their preferences. The cascading model is popular in learning to rank (Kveton et al., 2015), which

Multiplier Bootstrap-Based Exploration

0 10000 20000 T

0 10000 20000 T

0 10000 20000 T

500 Exponential

GIRO MBE ( 2=0.5) MBE ( 2=1) MBE ( 2=1.5) Naive MB PHE Re Boot TS

Figure 7: Performance of MBE with K = 25.

0 5000 10000 15000 20000 T

Trend with B

MBE (B=100) MBE (B=20) MBE (B=200) MBE (B=50) TS Figure 8: Performance of MBE with different number of replicates B.

models the user behaviour as glancing from top to bottom (like a cascade) and choose to click an item following a Bernoulli distribution when she looks at that item. Therefore, we will binary outcomes for all items that the user has examined, and there are complex dependency between them.

In a slate optimization (or called assortment optimization) problem, we aim to offer K items from a pool of L ones, especially when there exist substitution effects. The Multinomial Logit (MNL) model characterizes the choice behaviour as a multinomial distribution based on the attractiveness of each item. Since the offer subset changes over rounds, the joint likelihood is actually complex. To get the posterior or conﬁdence bounds, one has to resort to an epoch-type offering schedule (Agrawal et al., 2017).

Online combinatorial optimization also has numerous applications (Wen et al., 2015), including maximum weighted matching, ads allocation, webpage optimization, etc. It is common that every chosen item will generate a separate observation, known as the semi-bandit problem. We consider a special problem in our experiments, where we need to choose K persons from a pool under constraints.

The three datasets we used (and related problem setups) are studied in corresponding TS papers in the literature. To general random rewards, we need to either generate from a real-data-calibrated model or by directly sampling from the dataset. We follow Wan et al. (2022) and references therein. For cascading or MNL bandits, we split the dataset into a training and a testing set, use the training to estimate the reward model, and compare on the testing set. For semi-bandits, we sample rewards from the dataset.

C. Main Proof

This section gives the proof of the main regret bound (Theorem 5.1). Section D gives the major lemmas required to bound the regret components used in this section. Section E lists all supporting technical lemmas, including the lower bound of the Gaussian tail and some novel results on the concentration property of sub-Gaussian and sub-exponential distributions.

Before beginning our proof, we ﬁrst provide the deﬁnition of sub-Gaussian and sub-exponential variables: A mean-zero random variable X is called sub-Gaussian with variance proxy σ2 if E exp{t X} exp{t2σ2/2} for any t R, and we

Multiplier Bootstrap-Based Exploration

denote it as X sub G(σ2). A mean-zero variable X is called sub-exponential with parameters λ and α if

E exp{t X} exp t2λ2

We denote as X sub E(λ, α) if sub-exponential X has parameters λ and α. For simplicity, we denote sub E(λ) := sub E(λ, λ). Sub-Gaussian and sub-exponential variables play important roles in bandit problems and exhibit various concentration properties. For more details, please refer to Zhang & Chen (2021) and Zhang & Wei (2022).

Notations: Let Pξ(A) = R

A d Fξ(x) denote the probability of event A, where Fξ(x) is the distribution function of the random variable ξ. Similarly, let Eξf(ξ) = R f(x) d Fξ(x) represent the expectation. We write two functions a(s, T) b(s, T) if a(s, T) c b(s, T) for some constant c independent of s and T. We write a(s, T) b(s, T) if both a(s, T) b(s, T) and a(s, T) b(s, T). Furthermore, we deﬁne a b = max{a, b} and a b = min{a, b} for any real numbers a and b. Similarly, we deﬁne a b c = max{a, b, c} and a b c = min{a, b, c} for any a, b, c R.

We will present a comprehensive version of the MBE theory under MAB, as stated in Theorem C.1, along with its proof. In this version, we allow for arbitrary variance proxies σ2 k instead of constraining them to be equal to one in Theorem 5.1.

Theorem C.1. Consider a K-armed bandit, where the reward distribution of arm k is sub G(σ2 k) with mean µk. Suppose µ1 = maxk [K] µk and k = µ1 µk. Take the multiplier weight distribution as N(1, σ2 ω) in Algorithm 3. Let the tuning

parameters satisfy λ 1 + σ2 ω 4 + 4σ1

σω + 1 , Then the problem-dependent regret is upper bounded by

7 + C1(σ1, σk, λ, σω) + C2(σ1, σk, λ, σω)

C1(σ1, σk, λ, σω) = 55 8

2 max k [K] σ2 k

log D2(σ1, σk, λ, σω)

3 log 2 + 1 + 38σ2 ω

C2(σ1, σk, λ, σω) = 310λ2σ2 k + 55 D1(σ1, σk, λ, σω) log D2(σ1, σk, λ, σω)

3 log 2 + 1 + 38σ2 ω

D1(σ1, σk, λ, σω) = 1 + 8

2 max k [K] σ2 k 16 + σ2 ω σ2 1

+ 16σ4 1 + 3σ2 ω σ2 1 + 3σ2 ω + 1 σ2 ωλ4,

D2(σ1, σk, λ, σω) = 3 1 + 3 πσ2 ω 2σ1

σω + 3λ + 16 π max k [K] σ2 k

σ2 ω 16σ4 1 + 1

+ λ2σ2 ω 4σ4 1

Furthermore, the problem-independent regret is upper bounded by

Reg T 7Kµ1 + max k [K]\{1} C1(σ1, σk, λ)K log T + 2 r

max k [K]\{1} C2(σ1, σk, λ)KT log T.

Proof. We denote the ﬁrst s rewards from pulling arm k as Hk,s, with the i-th observations denoted as Rk,i. Let Qk,s(τ) = P

Y k,s > τ | Hk,s be the tail probability that Y k,s, conditioned on history Hk,s is at least τ. Further, let Nk,s(τ) = 1/Qk,s(τ) 1 represent the expected number of rounds in which the arm k is underestimated, given s sample rewards. Here

Y k,s = Ps i=1 [ωi Rk,i + ω i (1 λ) + ω i (0 λ)] Ps i=1(ωi + λω i + λω i )

is the objective function deﬁned in Algorithm 3.

Step 0: Decomposition of the regret bound. Our proof relies on the following decomposition of the cumulative regret.

Multiplier Bootstrap-Based Exploration

Lemma C.2 (Theorem 1, Kveton et al. (2019b)). Suppose in MAB we select arms according to the rule At = arg maxk [K] Y k,t with Y k,t deﬁned in Algorithm 3. Then for any {τk}K k=2 R, the expected T-round regret can be bounded above as

k=2 k ak + bk ,

where ak = PT 1 s=0 ak,s and bk = PT 1 s=0 bk,s + 1, and ak,s = E N1,s(τk) T and bk,s = P Qk,s(τk) > T 1 .

Recall the summation index s is the number of times we pull the k-th arm. In the proof, we will ﬁx τk µk+λ

1+2λ , µ1+λ

1+2λ . The deﬁnitions of ak and bk have important meanings: ak represents the expected number of rounds that optimal arm 1 has been being underestimated, whereas bk is the probability that the suboptimal arm k is being overestimated. Here we only need to consider the lower bound of the tail of the distribution of the rewards from the optimal arm. The intuition behind this is twofold: (i) we only need the rewards from the optimal arm taking a relatively large value with a probability that is not too small; (ii) we do not care about the negligible probability of receiving a large reward from suboptimal arms.

Therefore, our target is then to bound ak and bk for any k 2. These are completed in Step 1 and Step 2 below, respectively.

Step 1: Bounding ak.

We ﬁrst provide a roadmap for proving ak is bounded by a term of O(log T) order: For a given constant level τk, the probability of the optimal arm 1 being underestimated given s rewards is 1 Q1,s (τk). If we pick the level to satisfy τk < µ1+λ

1+2λ , the theory of large deviation gives

lim s Q1,s(τk) = 1.

Hence the expected number of rounds to observe a not-underestimated case N1,s(τk) = 1 Q1,s(τk) 1 has the property lims Ns(τk) = 0 as the number of pulls s grows to inﬁnity. Thus, given the round T, there exists a constant sa,k(T) such that Ns (τk) T 1 for all s over sa,k(T). Consequently, the quantity ak in regret bound will be bounded by

s=0 E [N1,s(τk) T] + 1.

The fact that the constant sa,k(T) is at the order of log T will be shown in Lemma D.2, Lemma D.3, and Lemma D.4. For small number of pulls s sa,k(T), we show in Lemma D.1 that E [N1,s(τk) T] 4 + 16e9/8 for any s N. Thus, it is enough to conclude that ak can be bounded by a term of O(log T) order.

To formally bound ak in the non-asymptotic sense following the intuition above, we need to decompose ak. For the decomposition, a common approach is to use indicators on good events. Denote the shifted (sample-) mean reward as

R k,s = Ps i=1 Rk,i + λs

s(1 + 2λ) = λ 1 + 2λ + 1 1 + 2λRk,s.

where Rk,s = 1

s Ps i=1 Rk,i is the mean reward of the k-th arm. Then we can deﬁne the following good events for l-th arm as

Al,s = C1 k < R l,s µl + λ

1 + 2λ < C1 k

and Gl,s = n C2 k Y l,s R l,s C2 k o .

The deﬁnitions of Al,s and Gl,s are intuitive: Al,s represents the sample mean does not deviate excessively from the true mean, and events on Gl,s means Y l,s is not too far away from the scaled-shifted sample mean R l,s. Here C1, C2 are constants belonging to the interval (0, 1).

Therefore, by using these good events, we decompose ak,s into the following three parts:

ak,s,1 = E h N1,s(τk) T I(Ac 1,s) i , (3)

Multiplier Bootstrap-Based Exploration

ak,s,2 = E h N1,s(τk) T I(A1,s)I(Gc 1,s) i , (4)

and ak,s,3 = E h N1,s(τk) T I(A1,s)I(G1,s) i . (5)

Let C1 = 1 6λ and C2 = 1 12λ with ﬁxed λ > 1, then C1, C2 (0, 1). Consider the case when T 2. Deﬁne

sa,k,j(T) := max{s : ak,s,j T 1}, k = 2, . . . , K, j = 1, 2, 3.

Lemma D.2 in Appendix D demonstrates that by taking

s sa,k,1(T) := 144λ2σ2 1 (1 + 2λ)2 2 k log T,

we will have ak,s,1 T 1. Lemma D.3 and Lemma D.4 in Appendix D say that: if we choose

s sa,k,2(T) :=

(" Ω1 + a1 + b1 + Ω1a1 1 3 log 1 2 log

2b2 1 + 2a1

18σ2 ω(2µ1 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

s sa,k,3(T) :=

(" Ω1 + a2 + b2 + Ω1a2 1 3 log 1 2 log

2b2 2 + 2a2

18σ2 ω(2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

then we have ak,s,2 T 1 and ak,s,3 T 1, respectively, where

a1 = 192σ2 ωλ4

3(1 + 2λ)2 2 k , b1 = 2σ2 ω 96λ4σ2 1 + (1 + 2λ)2C2 2 2 k

3(1 + 2λ)2 2 k ,

a2 = 36σ2 ωλ4

3(λ 1)2 2 k , b2 = σ2 ω 72λ4σ2 1 + 25(1 + 2λ)2 2 k

6(λ 1)2 2 k , and Ωk = 8

2σ2 k, k [K].

Let Ωmax = maxk [K] Ωk. Then, for any

s sa,k(T) = 144λ2σ2 1 (1 + 2λ)2 2 k log T +

(" Ωmax + (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2)

1 3 log 1 2 log

18σ2 ω(2µ1 1)2 (2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

it holds that s maxj=1,2,3 sa,k,j(T) because

sa,k(T) = sa,k,1(T) + max j=2,3 sa,k,j(T) max j=1,2,3 sa,k,j(T).

Hence, for any s sa,k(T), we have

ak,s = ak,s,1 + ak,s,2 + ak,s,3 3T 1.

Multiplier Bootstrap-Based Exploration

Finally, Lemma D.1 in Appendix D guarantees that if we choose λ 1 + σ2 ω 4 + 4σ1

σω + 1 , the component

ak,s 4 + 16e9/8 for any s 0. Thus, by setting λ 1 + σ2 ω 4 + 4σ1

σω + 1 , we have

s<sa,k(T ) max s {0,1,...,T 1} ak,s + X

sa,k(T ) s<T 3T 1

= max s {0,1,...,T 1} ak,s sa,k(T) + 3T 1 T sa,k(T)

4(1 + 4e9/8)

( 144λ2σ2 1 (1 + 2λ)2 2 k log T +

" Ωmax + (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2)

log n 3 h 1 + π

2 b1 + b2 +

2 a1 b2 1 + a2

+ 2 a1 b1 + a2

3 log 2 + 1

18σ2 ω(2µ1 1)2 (2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

for any k {2, . . . , K} and T 2.

Step 2: Bounding bk.

Again, we will provide a roadmap for proving that bk is bounded by a term of order O(log T). Similar to Step 1, we set a ﬁxed level τk such that τk > µk+λ

1+2λ . Then, according to the theory of large deviations, we have

lim s Qk,s(τk) = 0.

Thus, given the time horizon T, there exists a constant sb,k(T) such that Qk,s(τk) T 1 for all s beyond sb,k(T). As a result, the event Qk,s(τk) > T 1 is empty if the number of pulls s exceeds sb,k(T). Consequently,

s=0 P Qk,s (τk) > T 1 .

We will demonstrate that the constant sb,k(T) is of order O(log T) in Lemma D.5, Lemma D.6, and Lemma D.7. For a small number of pulls, s sb,k(T), we apply a trivial bound P Qk,s (τk) > T 1 1 that holds for any s. Therefore, it is sufﬁcient to conclude that bk can be bounded by a term of order O(log T).

It should be noted that bk,s is naturally bounded by the constant 1. Similar to Step 1, we decompose bk,s = P Qk,s(τk) > T 1 into bk = bk,s,1 + bk,s,2 + bk,s,3, with

bk,s,1 = E I(Qk,s(τk) > T 1)I(Ac k,s) , (6)

bk,s,2 = E I(Qk,s(τk) > T 1)I(Ak,s)I(Gc k,s) , (7)

and bk,s,3 = E I(Qk,s(τk) > T 1)I(Ak,s)I(Gk,s) . (8)

Again, we deﬁne sb,k,j := max{s : bk,s,j T 1} for j = 1, 2, 3. Lemma D.5 in Appendix D guarantees that when s sb,k,1(T), with

sb,k,1(T) = 72λ2σ2 k (1 + 2λ)2 2 k log T,

Multiplier Bootstrap-Based Exploration

we have bk,s,1 T 1. Considering T 2 and letting C1 = 1 6λ and C2 = 1 12λ with ﬁxed λ > 1 as in Step 1, Lemma D.6 and Lemma D.7 proves that: if we take

s sb,k,2(T) =

(" Ωk + a1 + b1 + Ωka1 1 3 log 1 2 log

2b2 1 + 2a1

18σ2 ω(2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

s sb,k,3(T) =

(" Ωk + a2 + b2 + Ωka2 1 3 log 1 2 log

2b2 2 + 2a2

18σ2 ω(2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

then we have bk,s,2 T 1 and bk,s,3 T 1, respectively, where a1, a2, b1, b2, and Ωk are already deﬁned in Step 1. Let

sb,k(T) := 72λ2σ2 k (1 + 2λ)2 2 k log T +

(" Ωmax + (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2)

1 3 log 1 2 log

18σ2 ω(2µ1 1)2 (2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

So for any s sb,k(T), we have s maxj=1,2,3 sb,k,j(T) since sb,k(T) = sb,k,1(T) + maxj=2,3 sb,k,j(T). Therefore, bk,s,1 + bk,s,2 + bk,s,3 3T 1 for any s sb,k(T). Note that bk,s = P Qk,s(τk) > T 1 1 for any s 0, then

1 + 1 sb,k(T) + 3T 1 T sb,k(T)

( 72λ2σ2 k (1 + 2λ)2 2 k +

" Ωmax + (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2)

1 3 log 1 2 log

18σ2 ω(2µ1 1)2 (2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

for any k {2, . . . , K} and T 2.

Step 3: Aggregating results.

Let us deﬁne d1 = 2 k (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2)

Multiplier Bootstrap-Based Exploration

so the bounds in Step 1 and Step 2 yield

ak 4(1 + 4e9/8)

( 144λσ2 1 (1 + 2λ)2 2 k

+ (Ωmax + d1) 1

log 2 + 1 18σ2 ω(2µ1 1)2 (2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

( 72λ2σ2 k (1 + 2λ)2 2 k

+ (Ωmax + d1) 1

log 2 + 1 18σ2 ω(2µ1 1)2 (2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

for any k {2, . . . , K} and T 2. Therefore, by applying Lemma C.2, we obtain the following inequality

k=2 k ak + bk

7 + c1(µ1, σ1, µk, σk, λ, σω) + c2(µ1, σ1, µk, σk, λ, σω) 2 k log T (9)

for any T 2, where

c1(µ1, σ1, µk, σk, λ)

= (5 + 16e9/8) Ωmax

3 log 2 + 1 18σ2 ω{(2µ1 1)2 (2µk 1)2}

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

and c2(µ1, σ1, µk, σk, λ)

= 72λ2σ2 k (1 + 2λ)2 (9 + 32e9/8)

+ (5 + 16e9/8) d1

3 log 2 + 1 18σ2 ω{(2µ1 1)2 (2µk 1)2}

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

for k = 2, . . . , K. When the total round T = 1, the bound (9) still holds because Reg T maxk=2,...K k 7 PK k=2 k. Finally, by utilizing the bound in (9) and the bounds for c1(µ1, σ1, µk, σk, λ, σω) and c2(µ1, σ1, µk, σk, λ, σω) in Lemma D.8, the following inequality holds for the problem-dependent case:

7 + c1(µ1, σ1, µk, σk, λ, σω) + c2(µ1, σ1, µk, σk, λ, σω) 2 k log T

7 + C1(σ1, σk, λ, σω) + C2(σ1, σk, λ, σω) 2 k log T .

For proving the problem-independent case in Theorem C.1, let us denote

t=1 I(It = k) =

k=2 (ak + bk),

Multiplier Bootstrap-Based Exploration

and then we have

k: k< k EJk,T + X

k: k k EJk,T

by the bounds of ak,bk T + X

7 + C1(σ1, σk, λ, σω) + C2(σ1, σk, λ, σω) 2 k log T

k: k C1(σ1, σk, λ, σω) k log T + X

C2(σ1, σk, λ, σω)

by k µ1 T + 7Kµ1 + max k [K]\{1} C1(σ1, σk, λ, σω)K log T + max k [K]\{1} C2(σ1, σk, λ, σω)K log T

for any previously speciﬁed (0, 1). Taking = p

maxk [K]\{1} C2(σ1, σk, λ, σω)K log T/T, we obtain

Reg T 7Kµ1 + max k [K]\{1} C1(σ1, σk, λ, σω)K log T + 2 r

max k [K]\{1} C2(σ1, σk, λ, σω)KT log T.

Thus, we complete the proof of Theorem C.1.

Lastly, to prove Theorem 5.1, we set σk 1 for k [K] and utilize the fact

C1(1, 1, λ, σω) + C2(1, 1, λ, σω) 2 k C1(1, 1, λ, σω) + C2(1, 1, λ, σω)

Then applying the bounds for C1(1, 1, λ, σω) and C2(1, 1, λ, σω) from Lemma D.9, we obtain the ﬁrst result in Theorem 5.1. By employing the same technique as in (12), we obtain the problem-independent regret we sought in Theorem 5.1.

D. Lemmas on Bounding Regret Components

D.1. Lemmas on bounding ak.

Lemma D.1 (Bounding ak,s for any s > 0). Set

λ 1 + σ2 ω 4 + 4σ1

then ak,s 4 + 16e9/8

for any k {2, . . . , K} and s 0.

Proof. Note that if we take

1 + 2λ, (13)

then we can ensure that the bound

ak,s = E 1 Q1,s(τk) 1 T

E 1 Q1,s(τk)

by Q1,s( ) is decreasing EQ 1 1,s

holds. Hence, we need to ﬁnd a lower bound of the tail probability Q1,s µ1+λ 1+2λ = P

Y 1,s > µ1+λ

1+2λ | H1,s . Throughout the proof, we will use the choice of (13) for τk, and we can take advantage of the bound (14).

Multiplier Bootstrap-Based Exploration

For further analyze (14), we will express it as the probability with respect to the weighted random summation of the Gaussian random variables {ωi, ω i, ω i }. Let xi := (2λ + 1)R1,i (µ1 + λ), yi := λ(µ1 1 λ), zi := λ(λ + µ1) for i [s], then

i=1 xi = s (2λ + 1)R1,s (µ1 + λ) , Sy =

i=1 yi = sλ(µ1 1 λ), Sz =

i=1 zi = sλ(λ + µ1),

with Sx Sy Sz = s(2λ + 1)(R1,s µ1) and

(2λ + 1)R1,i (µ1 + λ) 2, Ty =

i=1 y2 i = sλ2(µ1 1 λ)2, Tz =

i=1 z2 i = sλ2(λ + µ1)2,

with Tx + Ty + Tz = Ps i=1 (2λ + 1)R1,s (µ1 + λ) 2 + s λ2(µ1 1 λ)2 + λ2(λ + µ1)2 . Denote

Z1 := Ps i=1 xiωi Ps i=1 yiω i Ps i=1 ziω i (Sx Sy Sz) σω p

Tx + Ty + Tz

Z2 := Ps i=1 ωi + λ Ps i=1 ω i + λ Ps i=1 ω i (1 + 2λ)s

(1 + 2λ2)s .

Y 1,s > µ1 + λ

1 + 2λ | H1,s

i=1 ziω i > 0

i=1 ω i + λ

i=1 ω i > 0

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 < 0 .

Note that (Z1, Z2) R2 is the mean-zero bivariate Gaussian distribution with covariance 1 ρ ρ 1

ρ = Eω,ω ,ω [Z1Z2] 0

= Eω,ω ,ω n [Ps i=1 xiωi Ps i=1 yiω i Ps i=1 ziω i (Sx Sy Sz)] [Ps i=1 ωi+λ Ps i=1 ω i+λ Ps i=1 ω i (1 + 2λ)s] o

s(1 + 2λ2)(Tx+Ty+Tz)

= σ2 ω(Sx λSy λSz) (Sx Sy Sz)(1 + 2λ)s

s(1 + 2λ2)(Tx + Ty + Tz) .

The sign of ρ depends on the sign of ˇρ, where ˇρ is deﬁned as

s σ2 ω(Sx λSy λSz) (Sx Sy Sz)(1 + 2λ)s

= σ2 ω h (2λ + 1)R1,s µ1(2λ2 + 1) + λ2 λ i 2(1 + 2λ)2s(R1,s µ1)

= (1 + 2λ) (R1,s µ1) 2(1 + 2λ)s σ2 ω + σ2 ωλ(λ 1)(1 2µ1).

Multiplier Bootstrap-Based Exploration

Thus, for the event {ρ < 0}, if we take λ > σ2 ω 4 1

2, we will have

{ρ < 0} = {ˇρ < 0}

(R1,s µ1) < σ2 ωλ(λ 1)(2µ1 1) (1 + 2λ) 2(1 + 2λ)s σ2ω : s 1

R1,s µ1 > σ2 ωλ(λ 1)(2µ1 1) (1 + 2λ) 2(1 + 2λ)s σ2ω : s 1

This furthermore implies

{ρ < 0} {R1,s µ1 < 0}

(R1,s µ1) > σ2 ωλ(λ 1)(2µ1 1) (1 + 2λ) 2(1 + 2λ)s σ2ω : s 1

{R1,s µ1 < 0}

, if 2µ1 1 0 σ2 ωλ(λ 1)(2µ1 1)

(1+2λ) 2(1+2λ)s σ2ω < R1,s µ1 < 0 : s 1 , if 2µ1 1 > 0.

( , if 2µ1 1 0

As =: n σ2 ωλ(λ 1)(2µ1 1)

2(1+2λ)2s < R1,s µ1 < 0 : s 1 o , if 2µ1 1 > 0.

Now, we can decompose the second probability in (15) as

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 < 0

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 < 0 I (ρ 0)

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 < 0 I (ρ < 0)

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 < 0 I (ρ 0)

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 < (1 + 2λ) s

R1,s µ1 < 0 I ( ρ > 0)

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 < 0 I (ρ 0)

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z3 < (1 + 2λ) s

R1,s µ1 < 0 I (ϱ > 0) ,

where Z3 = Z2 and ϱ = ρ is the correlation coefﬁcient between Z1 and Z3. We will utilize the lower bound of the tail probability for bivariate Gaussian distribution with a positive correlation coefﬁcient.

Multiplier Bootstrap-Based Exploration

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 0 I (ρ 0)

( (Sx Sy Sz) σω p

Tx + Ty + Tz , (1 + 2λ) s

( (Sx Sy Sz) σω p

Tx + Ty + Tz , (1 + 2λ) s

R1,s µ1 0 I (ρ 0)

Z1 > (Sx Sy Sz)

Tx + Ty + Tz , Z2 > (Sx Sy Sz)

Tx + Ty + Tz

R1,s µ1 0 I (ρ 0)

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

1 + ρ (Sx Sy Sz)

Tx + Ty + Tz

R1,s µ1 0 I (ρ 0)

Z > (Sx Sy Sz)

Tx + Ty + Tz

Z > (Sx Sy Sz)

Tx + Ty + Tz

R1,s µ1 0 #2

where Z is the standard Gaussian distribution. Throughout the appendix, we will always use Z to represent the standard Gaussian distribution. The ﬁrst equality is due to (Sx Sy Sz) = s(1+2λ) (R1,s µ1) 0 (1+2λ) s when R1,s µ1 0, and the second inequality is by (11.44) of (Lai & Balakrishnan, 2009) for bivariate Gaussian distributions. By applying the ﬁrst inequality in Lemma E.1 and using the fact that Tx + Ty + Tz Tz sλ4, we can get that

Z > (Sx Sy Sz)

Tx + Ty + Tz

R1,s µ1 0 1

4 exp s(2λ + 1)2(R1,s µ1)2

R1,s µ1 0 ,

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 0 I (ρ 0)

16 exp 2s(2λ + 1)2(R1,s µ1)2

R1,s µ1 0 I (ρ 0) .

Second, we also have

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z3 < (1 + 2λ) s

R1,s µ1 < 0 I (ϱ > 0)

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z3 < (1 + 2λ) s

R1,s µ1 < 0 I (ϱ > 0)

=PZ1,Z3 Z1 > h(s), Z3 < k(s) I

R1,s µ1 < 0 I (ϱ > 0)

=PZ1,Z3 Z1 > h(s), Z3 < k(s) I

R1,s µ1 < 0 I (ϱ > 0) I(2µ1 1 > 0)

h(s) := (Sx Sy Sz)

Tx + Ty + Tz , k(s) := (1 + 2λ) s

are positive when R1,s µ1 < 0. Note that

PZ1,Z3 Z1 > h(s), Z3 < k(s) = PZ1 Z1 > h(s) PZ1,Z3 Z1 > h(s), Z3 > k(s)

Multiplier Bootstrap-Based Exploration

and PZ1,Z3 Z1 > h(s), Z3 > k(s) I ϱ > 0

ϱk(s) h(s) p

+ ϱ exp k2(s) h2(s)

ϱh(s) k(s) p

by (11.42) in Lai & Balakrishnan (2009). Next, by combining the above two inequalities, we have

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z3 < (1 + 2λ) s

R1,s µ1 < 0 I (ϱ > 0)

(h Φ h(s) Φ k(s) i

ϱk(s) h(s) p

+ ϱ exp k2(s) h2(s)

ϱh(s) k(s) p

R1,s µ1 < 0 I ϱ > 0

(h Φ h(s) Φ k(s) i

ϱk(s) h(s) p

+ ϱ exp k2(s) h2(s)

ϱh(s) k(s) p

R1,s µ1 < 0 {ϱ > 0} .

R1,s µ1 < 0 {ϱ > 0} As always hold, we have

0 < s(1 + 2λ)(R1,s µ1) < σ2 ωλ(λ 1)(2µ1 1)

2(1 + 2λ) . (17)

Then using the inequality Tx + Ty + Tz sλ4, we have

h(s) = (Sx Sy Sz)

Tx + Ty + Tz

by (17) < σωλ(λ 1)(2µ1 1) sλ2(1 + 2λ)

by µ1 1 σω(λ 1) λ(1 + 2λ)

by (19) (1 + 2λ)

by λ > 1 (1 + 2λ) σω

1 + 2λ2 (1 + 2λ) s

1 + 2λ2 = k(s),

where fourth inequality follows from the fact that

2σ2 ωλ < (1 + 2λ)2 (19)

provided that

( R if σω < 2,

σ2 ω 4 4 , if σω 2 .

Note that when σω > 2, the expression 2 σω+σω

σ2ω 4 4 is negative. Thus, the inequality (18) will hold for any λ > 1. This implies that h Φ h(s) Φ k(s) i I(As) > 0.

Multiplier Bootstrap-Based Exploration

whenever λ > σ2 ω 4 1

2. On the other hand, note that ˇρ σ2 ωλ(λ 1)(2µ1 1), then on the event As, we know that

ϱ = sˇρ σ2ω p

s(1 + 2λ2)(Tx + Ty + Tz) sσ2 ωλ(λ 1)(2µ1 1) σ2ω p

s(1 + 2λ2)(Tx + Ty + Tz)

and thus ϱk(s)

h(s) = (1 + 2λ) s

1 + 2λ2 ϱ h(s)

by the bound of ϱ (1 + 2λ) s σω

1 + 2λ2 sσ2 ωλ(λ 1)(2µ1 1) s(2λ + 1) (R1,s µ1)

by the bound in (17) (1 + 2λ) s σω

1 + 2λ2 sσ2 ωλ(λ 1)(2µ1 1) 2(1 + 2λ) σωλ(λ 1)(2µ1 1)

= 2(1 + 2λ)2

1 + 2λ2 ss > 1,

i.e., ϱk(s) > h(s). Combining the above analysis, we have

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z3 < (1 + 2λ) s

R1,s µ1 < 0 I (ϱ > 0)

h Φ h(s) Φ k(s) i I

R1,s µ1 < 0 I (ϱ > 0)

ϱk(s) h(s) p

R1,s µ1 < 0 I (ϱ > 0)

R1,s µ1 < 0 I (ϱ > 0) + Φ(0)I

R1,s µ1 < 0 I (ϱ > 0)

R1,s µ1 < 0 I (ϱ > 0) .

Besides, note that when R1,s µ1 > 0, we have (Sx Sy Sz) < 0, and then

Z1 > (Sx Sy Sz)

Tx + Ty + Tz

Z2 > (1 + 2λ) s

R1,s µ1 > 0

R1,s µ1 > 0 .

Plugging (16), (20), and (21) into (15), we obtain that

16 exp 2s(2λ + 1)2(R1,s µ1)2

R1,s µ1 0 I(ρ 0)

R1,s µ1 0 I(ρ < 0) + 1

R1,s µ1 > 0 .

As a result, with the fact (14), the upper bound for ak,s now is

ak,s = EQ 1 1,s

16E exp 2s(2λ + 1)2(R1,s µ1)2

R1,s µ1 0 I(ρ 0)

R1,s µ1 0 I(ρ < 0) + 2EI

R1,s µ1 > 0

16E exp 2s(2λ + 1)2(R1,s µ1)2

Multiplier Bootstrap-Based Exploration

where the ﬁrst inequality is due to that the several indicators are based on mutually exclusive events.

From the above expression, to prove that ak,s is bounded by a constant independent of s, it remains to show that the expectation

E exp 2s(2λ + 1)2(R1,s µ1)2

can be bounded below some constants free of s.

Applying Lemma E.2, we know that if we take

2s(2λ + 1)2

σ2ωλ4 s 8σ2 1 ,

then (22) can be upper bounded by e9/8. A sufﬁcient condition for the above inequality is

σ2ωλ4 1 16σ2 1 , i.e. λ2 8σ1

in other words, this can be expressed as:

σω + 1 . (23)

Therefore, if we take the tuning parameters λ and σω as speciﬁed in (23), then ak,s will be bounded by a constant such that

ak,s 4 + 16e9/8. (24)

It is important to note that this lemma establishes the tuning conditions for the tuning parameters in Theorem C.1.

Lemma D.2 (Bounding ak,s,1 at (3)). Take

sa,k,1(T) := 4σ2 1 C2 1(1 + 2λ)2 2 k log T.

Then for any k {2, . . . , K} and s sa,k,1(T), ak,s,1 T 1,

Proof. We will bound ak,s,1 by bounding the probability of the event Ac k,s. Write

ak,s,1 = E h N1,s(τk) T I(Ac 1,s) i E h TI(Ac 1,s) i = TP(Ac 1,s). (25)

Since the summation of independent sub-Gaussian variables is still sub-Gaussian, (40) in Lemma E.2 gives R1,s µ1 sub G(σ2 1/s). By applying the concentration inequality of the sub-Gaussian variable, we have

P(Ac 1,s) = P R1,s µ1 (1 + 2λ)C1 k

2 exp (1 + 2λ)2C2 1 2 ks 2σ2 1

take s sa,1(T ) 2σ2 1 C2 1(1+2λ)2 2 k log(

2T) 1 T 2 .

Hence, we obtain that ak,s,1 TP(Ac 1,s) T 1, for any s sa,1(T).

Multiplier Bootstrap-Based Exploration

Lemma D.3 (Bounding ak,s,2 at (4)). Take

sa,k,2(T) :=

(" Ω1 + a1 + b1 + Ω1a1 1 3 log 1 2 log

2πΩ1σ2 1a1 2b2 1 + 2a1

18σ2 ω(2µ1 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

a1 = 4σ2 ωλ2

3C2 2(1 + 2λ)2 2 k , b1 = 2σ2 ω 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k

3C2 2(1 + 2λ)2 2 k , and Ω1 = 8

Then for any s sa,k,2(T),

ak,s,2 T 1,

for any k {2, . . . , K} and T 2.

Proof. We will bound the probability of Gc 1,s conditioning on H1,s to prove this lemma. Note that

(1 + 2λ)R1,i (R1,s + λ) ωi +

i=1 λ(1 + λ R1,s)ω i

i=1 λ(R1,s + λ)ω i

=s (1 + 2λ)R1,s (R1,s + λ) + λ(1 + λ R1,s) λ(R1,s + λ) = 0,

we can bound the tail probability P(Gc 1,s | H1,s) by

P(Gc 1,s | H1,s)

=P |Y 1,s R 1,s| > C2 k | H1,s

Ps i=1 (1 + 2λ)R1,i (R1,s + λ) ωi + Ps i=1 λ(1 + λ R1,s)ω i Ps i=1 λ(R1,s + λ)ω i (1 + 2λ)(Ps i=1 ωi + λ Ps i=1 ω i + λ Ps i=1 ω i )

by Lemma E.3 2Pω,ω ,ω

(1 + 2λ)R1,i (R1,s + λ) ωi +

i=1 λ(1 + λ R1,s)ω i

i=1 λ(R1,s + λ)ω i C2(1 + 2λ) k

i=1 ω i + λ

i=1 ω i + λ

i=1 ω i < 0

I = Pω,ω ,ω

(1 + 2λ)R1,i (R1,s + λ) ωi +

i=1 λ(1 + λ R1,s)ω i

i=1 λ(R1,s + λ)ω i C2(1 + 2λ) k

i=1 ω i + λ

II = Pω,ω ,ω

i=1 ω i + λ

i=1 ω i < 0

Multiplier Bootstrap-Based Exploration

To bound P(Gc 1,s | H1,s), it is sufﬁcient to bound I and II separately. We ﬁrst examine II. Write

II = Pω,ω ,ω

i=1 ω i + λ

i=1 ω i < 0

Ps i=1(ωi 1) + λ Ps i=1(ω i 1) + λ Ps i=1(ω i 1) p

sσ2ω(1 + 2λ2) < (1 + 2λ)s p

sσ2ω(1 + 2λ2)

= Pω,ω ,ω Z < (1 + 2λ) σω

by Lemma E.1 1 2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

For studying I, we deﬁne the following functions of {R1,i}s i=1:

f {R1,i}s i=1 = f1 {R1,i}s i=1 + f2 {R1,i}s i=1 f3 {R1,i}s i=1 ,

f1 {R1,i}s i=1 =

(R1,i R1,s) + λ(2R1,i 1) (1 + 2λ)C2 k ωi,

f2 {R1,i}s i=1 =

λ(λ + 1 R1,s) (1 + 2λ)C2 k ω i,

f3 {R1,i}s i=1 =

λ(R1,s + λ) + (1 + 2λ)C2 k ω i .

Then we can write I = Pω,ω ,ω (f1 + f2 f3 > 0). Given that f1, f2, and f3 are mutually independent conditioning on H1,s, the expectation is E f1 + f2 f3 | H1,s = 3C2(1 + 2λ) ks,

and the variance is

var f1 + f2 f3 | H1,s

(R1,i R1,s) + λ(2R1,i 1) (1 + 2λ)C2 k 2

λ(λ + 1 R1,s) (1 + 2λ)C2 k 2 +

λ(R1,s + λ) + (1 + 2λ)C2 k 2 #

=σ2 ω h V1 + V2 + V3 i ,

n (R1,i R1,s) + λ(2R1,s 1) 2 + λ2(λ + 1 R1,s)2 + λ2(R1,s + λ)2o ,

n 2(1+2λ)C2 k (R1,i R1,s)+λ(2R1,s 1) 2(1+2λ)C2 kλ(λ+1 R1,s)+2(1+2λ)C2 k(R1,s+λ) o ,

(1 + 2λ)2C2 2 2 k + (1 + 2λ)C2 2 2 k + (1 + 2λ)C2 2 2 k .

Multiplier Bootstrap-Based Exploration

For bounding the conditional variance above, we will calculate its components as follows.

(R1,i R1,s) + λ(2R1,i 1) 2 + sλ2(λ + 1 R1,s)2 + sλ2(R1,s + λ)2

i=1 (R1,i R1,s)2 + 2sλ2 3R 2 1,s 3R1,s + λ2 + λ + 1

i=1 (R1,i µ1)2 + (6λ2 1)s(R1,s µ1)2 + 6sλ2(2µ1 3)(R1,s µ1) + 2sλ2 λ2 + λ + 1 + 3µ1(µ1 1)

Cauchy inequality 6λ2 s X

i=1 (R1,i µ1)2 + 6sλ2(2µ1 1)(R1,s µ1) + 2sλ2 λ2 + λ + 1 µ1(1 µ1)

i=1 (R1,i µ1)2 + 6sλ2(2µ1 1)(R1,s µ1) + 2sλ2 λ2 + λ + 1 µ1(1 µ1) ,

V2 = 2(1 + 2λ)C2 k

(R1,i R1,s) + λ(2R1,i 1) + sλ(λ + 1 R1,s) sλ(R1,s + λ)

= 2s(1 + 2λ)C2 k h

R1,s R1,s) + λ(2R1,i 1) + λ(λ + 1 R1,s) λ(R1,s + λ) i = 0,

and V3 = 3s(1 + 2λ)2C2 2 2 k.

Therefore, the conditional variance in (27) is bounded by

σ 2 ω var f1 + f2 f3 | H1,s = V1 + V2 + V3

i=1 (R1,i µ1)2 + 6sλ2(2µ1 1)

| {z } the random part

+ 2sλ2 λ2 + λ + 1 3µ1(1 µ1) + 3s(1 + 2λ)2C2 2 2 k | {z } the determined part = R1 + R2 + D,

R1 = 6λ2 s X

i=1 (R1,i µ1)2, R2 = 6sλ2(2µ1 1)

R1,s µ1 + 2sλ2 λ2 + λ + 1 3µ1(1 µ1) ,

and D = 3s(1 + 2λ)2C2 2 2 k.

It is clear that both R1 and R2 are strictly positive. Additionally, it can be shown with high probability that R2 is non-positive. Indeed, note that 0 3µ1(1 µ1) 3

P(R2 > 0) = P 3(2µ1 1)

R1,s µ1 > λ2 + λ + 1 3µ1(1 µ1)

R1,s µ1 > λ2 + λ + 1 3µ1(1 µ1) I µ1 = 1

+ P( ) I µ1 = 1

0 3µ1(1 µ1) 3

4 P R1,s µ1 > λ2 + λ + 1/4

+ 0 I µ1 = 1

by sub-Gaussian inequality exp (λ2 + λ + 1/4)2s

18σ2ω(2µ1 1)2

+ 0 I µ1 = 1

Multiplier Bootstrap-Based Exploration

Then by applying Lemma E.1 again,

P(Gc 1,s | H1,s) = 2I + II

f1 + f2 f3 E f1 + f2 f3 | H1,s

var f1 + f2 f3 | H1,s > E f1 + f2 f3 | H1,s

var f1 + f2 f3 | H1,s

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

E2 f1 + f2 f3 | H1,s

2 var f1 + f2 f3 | H1,s

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

Therefore, combining with (29), ak,s,2 will be bounded by

ak,s,2 E h TI(Gc 1,s) i = TE h P(Gc 1,s | H1,s) i

by (26) T 2I + II

E2 f1 + f2 f3 | H1,s

2 var f1 + f2 f3 | H1,s

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

E2 f1 + f2 f3 | H1,s

2 var f1 + f2 f3 | H1,s

) I(R2 0) + I(R2 > 0) + T

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

by the decomposition (29) TE exp

E2 f1 + f2 f3 | H1,s

2σ2ω R1 + R2 + D

I(R2 0) + TP(R2 > 0) + T

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

E2 f1 + f2 f3 | H1,s

2σ2ω R1 + D

+ TP(R2 > 0) + T

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

Recall R1,i µ1 sub G(σ2 1), and then ξi := (R1,i µ1)2 var(R1,i) sub E(8

2σ2 1). Therefore, for furthermore bounding R1 + D, we have

R1 + D = 6λ2 " s X

i=1 (R1,i µ1)2 s var(R1,i)

+ s var(R1,i) + 3s(1 + 2λ)2C2 2 2 k

i=1 ξi + s 6λ2 var(R1,i) + 3(1 + 2λ)2C2 2 2 k

by var(R1,i) σ2 1 6λ2 s X

i=1 ξi + 3s 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k

= 3s 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k ,

Multiplier Bootstrap-Based Exploration

where ξ = 1

s Ps i=1 ξi. Then

ak,s,2 TE exp

E2 f1 + f2 f3 | H1,s

2σ2ω R1 + D

+ TP(R2 > 0) + T

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

by (31) and (30) TE exp

9C2 2(1 + 2λ)2 2 ks2

6σ2ωs 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k

+ T exp (λ2 + λ + 1/4)2s

18σ2ω(2µ1 1)2

+ 0 I µ1 = 1

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

3C2 2(1 + 2λ)2 2 ks 2σ2ω 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k

+ T exp (λ2 + λ + 1/4)2s

18σ2ω(2µ1 1)2

+ 0 I µ1 = 1

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

The next step is to apply Lemma E.4. Let

a1 = 4σ2 ωλ2

3C2 2(1 + 2λ)2 2 k , b1 = 2σ2 ω 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k

3C2 2(1 + 2λ)2 2 k , λi 8

2σ2 1, i [s],

and Ω1 = (s 1 Ps i=1 λ2 i )1/2 = 8

2σ2 1, then Lemma E.4 gives

3C2 2(1 + 2λ)2 2 ks 2σ2ω 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k

3C2 2(1+2λ)2 2 k ξ + 2σ2 ω[2λ2σ2 1+(1+2λ)2C2 2 2 k] 3C2 2(1+2λ)2 2 k

= E exp s a1ξ + b1

2b2 1 + 2a1

exp s 2Ω1a1 (b1 + Ω1a1)

by 2Ω1a1 (b1+Ω1a1) b1+Ω1a1+Ω1+a1

2b2 1 + 2a1

s exp s Ω1 + a1 + b1 + Ω1a1

Hence, by taking

s Ω1 + a1 + b1 + Ω1a1

log 1 2 log

2b2 1 + 2a1

Ω1 + a1 + b1 + Ω1a1 log

2b2 1 + 2a1

3C2 2(1 + 2λ)2 2 ks 2σ2ω 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k

s 3T 3 1 3T 2 .

Similarly, the inequality

T exp (λ2 + λ + 1/4)2s

18σ2ω(2µ1 1)2

+ 0 I µ1 = 1

Multiplier Bootstrap-Based Exploration

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

3T have the solution

s 18σ2 ω(2µ1 1)2

(λ2 + λ + 1/4)2 [log 3 + 2 log T] .

s 2σ2 ω(1 + 2λ2) (1 + 2λ)2 [log(3/2) + 2 log T] ,

respectively. Therefore, we have ak,s,2 T 1 for any s sa,k,2(T) by taking

sa,k,2(T) =

" Ω1 + a1 + b1 + Ω1a1 1 3 log 1 2 log

2b2 1 + 2a1

18σ2 ω(2µ1 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

for any T 2.

Lemma D.4 (Bounding ak,s,3 at (5)). Take

sa,k,3(T) =

(" Ω1 + a2 + b2 + Ω1a2 1 3 log 1 2 log

2b2 2 + 2a2

18σ2 ω(2µ1 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

a2 = σ2 ωλ2

3(λ 1)2C2 1 2 k , b2 = σ2 ω 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k

6(λ 1)2C2 1 2 k , and Ω1 = 8

Then for any s sa,k,3(T) and λ > 1, ak,s,3 T 1.

for any k {2, . . . , K} and T 2.

Proof. Unlike the proofs for bounding ak,s,1 and ak,s,2, which involve controlling the probability of bad events, the deﬁnition of ak,s,3 is based on good events instead. Therefore, we require a different technique to handle ak,s,3. Observe that {N1,s(τk) < T 1} {N1,s(τk) < (T 2 1) 1}

by (T 2 1) 1 T 1 = Q1,s (τk) 1 < 1 + T 2 1 1

= Q1,s(τk) 1 < (1 T 2) 1 = [1 Q1,s(τk)] < T 2 ,

and ak,s,3 E N1,s(τk)I(A1,s G1,s) . Thus, as suggested by Wang et al. (2020), we can bound ak,s,3 T 1 for s sa,k,3 by ﬁnding s sa,k,3 such that [1 Q1,s(τk)] < T 2 holds on the event {A1,s G1,s}.

We can express the term 1 Q1,s(τk) I(A1,s)I(G1,s) = P

Y 1,s R 1,s Γ1 | H1,s I(A1,s G1,s),

where we deﬁne Γ1 := Γ1(τ1) = τ1 R 1,s as the difference between Y 1,s and R 1,s. On the event {A1,s G1,s}, Γ1 can be bounded within an interval. To see this,

R 1,s µ1 + λ

1 + 2λ > C1 k

R k,s µk + λ

1 + 2λ < C1 k

= Γ1 < C1 k µ1 + λ

1 + 2λ + τ1

Γ1 > C2 k µ1 + λ

1 + 2λ + τ1

= τ1 C1 k µ1 + λ

1 + 2λ < Γ1 < τ1 + C1 k µ1 + λ

Multiplier Bootstrap-Based Exploration

and G1,s = C2 k Y 1,s R 1,s C2 k

= τ1 C2 k Y 1,s + Γ1 τ1 + C2 k

= τ1 C2 k Y 1,s Γ1 τ1 + C2 k Y 1,s

by C2 k+R 1,s Y 1,s C2 k+R 1,s τ1 2C2 k R 1,s Γ1 τ1 + 2C2 k R 1,s .

Let us combine the previous two results, given by C1 = 2C2, (35)

which yields

τ1 C1 k R 1,s µ1 + λ

1 + 2λ Γ1 τ1 + C1 k R 1,s µ1 + λ

1+2λ C1 k<R 1,s< µ1+λ

1+2λ +C1 k τ1 2C1 k µ1 + λ

1 + 2λ Γ1 τ1 + 2C1 k µ1 + λ

τ1 := µ1 + λ

1 + 2λ 6λC1 k

This choice of τ1 satisﬁes (13) as τ1 µ1+λ

1+2λ . We can furthermore reduce on A1,s G1,s that

Γ1 τ1 + 2C1 k µ1 + λ

= 2C1 k 6λC1 k

1 + 2λ = 2C1 k λ 1 2λ + 1

by λ 1 2C1(λ 1) k

2λ + 1 < 0,

0 > Γ1 τ1 2C1 k µ1 + λ

= 2C1 k 6λC1 k

1 + 2λ = 2C1 k

1 + 3λ 1 + 2λ

by 3λ 1+2λ 3

Thus, we obtain Γ1 2C1(λ 1) k

2λ+1 , 5C1 k (0, + ). Return to the quantity we are interested in, 1

Q1,s(τk) I(A1,s)I(G1,s). We can express it as follows:

1 Q1,s(τk) I(A1,s)I(G1,s)

Y 1,s R 1,s Γ1 | H1,s I(A1,s G1,s)

i=1 (R1,i R1,s)ωi + λ

i=1 (2R1,iωi R1,sω i R1,sω i )

i=1 (ω i ωi) + λ2 s X

i=1 (ω i ω i )

Γ1(1 + 2λ) s X

i=1 ω i + λ

I(A1,s G1,s)

Multiplier Bootstrap-Based Exploration

Similar to the technique we used to bound the conditional probability P(Gc 1,s | H1,s), we just need to replace C2 k by Γ1 in (32) and obtain that 1 Q1,s(τk) I(A1,s)I(G1,s)

apply steps in (26) TE

9( Γ1)2(1 + 2λ)2s2

6σ2ωs 2λ2ξ + 2λ2σ2 1 + ( Γ1)2(1 + 2λ)2

I(A1,s G1,s)

+ T exp (λ2 + λ + 1/4)2s

18σ2ω(2µ1 1)2

+ 0 I µ1 = 1

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

by the bound of Γ1 TE exp

36(λ 1)2C2 1 2 ks2

6σ2ωs 2λ2ξ + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k

+ T exp (λ2 + λ + 1/4)2s

18σ2ω(2µ1 1)2

+ 0 I µ1 = 1

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

Similarly, we deﬁne the following expressions

a2 = σ2 ωλ2

3(λ 1)2C2 1 2 k , and b2 = σ2 ω 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k

6(λ 1)2C2 1 2 k .

Consider the value of s such that

s Ω1 + a2 + b2 + Ω1a2

log 1 2 log

2b2 2 + 2a2

By applying Lemma E.4 and following the steps in (33) again, we obtain

36(λ 1)2C2 1 2 ks2

6σ2ωs 2λ2ξ + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k

6(λ 1)2C2 1 2 ks σ2ω 2λ2ξ + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k

3(λ 1)2C2 1 2 k ξ + σ2 ω[2λ2σ2 1+25(1+2λ)2C2 1 2 k] 6(λ 1)2C2 1 2 k

= TE exp s a2ξ + b2

by applying steps in (33) 1 3T .

Therefore, to determine the value of s such that [1 Q1,s(τk)] < T 2 on {A1,s G1,s}, we can deﬁne

s sa,k,3(T) =

" Ω1 + a2 + b2 + Ω1a2 1 3 log 1 2 log

2b2 2 + 2a2

18σ2 ω(2µ1 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

By choosing this value for sa,k,3 then we get that as,k,3 T 1 whenever s sa,k,3.

Multiplier Bootstrap-Based Exploration

D.2. Lemmas on bounding bk.

Lemma D.5 (Bounding bk,s,1 at (6)). Consider

sb,k,1(T) = 2σ2 k C2 1(1 + 2λ)2 2 k log T.

For any s sb,k,1(T), we have bk,s,1 T 1, k {2, . . . , K}

provided that T 2.

Proof. Similar to Lemma D.5 and noting

sb,k,1(T) σ2 k C2 1(1 + 2λ)2 2 k log(2T),

we apply the Hoeffding inequality, which gives

bk,s,1 = E I(Qk,s(τk) > T 1)I(Ac k,s)

= P |Rk,s µk| (1 + 2λ)C1 k

by Hoeffding inequality T 1.

Lemma D.6 (Bounding bk,s,2 at (7)). Consider

sb,k,2(T) =

(" Ωk + a1 + b1 + Ωka1 1 3 log 1 2 log

2b2 1 + 2a1

18σ2 ω(2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

a1 = 4σ2 ωλ2

3C2 2(1 + 2λ)2 2 k , b1 = 2σ2 ω 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k

3C2 2(1 + 2λ)2 2 k , and Ωk = 8

For any s sb,k,2(T), we have bk,s,2 T 1, k {2, . . . , K}

provided that T 2.

Proof. Similar to bounding ak,s,2 in Lemma D.3, we can show that if we take

sb,k,2(T) =

(" Ωk + a1 + b1 + Ωka1 1 3 log 1 2 log

2b2 1 + 2a1

18σ2 ω(2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

with Ωk = 8

2σ2 k, we have

bk,s,2 E I(Gc k,s)

3C2 2(1 + 2λ)2 2 ks

2σ2ω h 2λ2ζ(k) + 2λ2σ2 k + (1 + 2λ)2C2 2 2 k i

+ exp (λ2 + λ + 1/4)2s

18σ2ω(2µk 1)2

+ 0 I µk = 1

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

Multiplier Bootstrap-Based Exploration

where ζ(k) = 1 s Ps i=1 ζ(k) i , with {ζ(k) i }s i=1 being i.i.d independent sub-Exponential variables such that ζ(k) i sub E(8

2σ2 k). Then bk,s,2 T 2 T 1 for any s sb,k,2 and k {2, . . . , K}.

Lemma D.7 (Bounding bk,s,3 at (8)). Consider

sb,k,3(T) =

(" Ωk + a2 + b2 + Ωka2 1 3 log 1 2 log

2b2 2 + 2a2

18σ2 ω(2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

a2 = σ2 ωλ2

3(λ 1)2C2 1 2 k , b2 = σ2 ω 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k

6(λ 1)2C2 1 2 k , and Ωk = 8

For any s sb,k,3(T) and λ > 1,

bk,s,3 T 1, k {2, . . . , K}

provided that T 2.

Proof. The basic idea is the same as bounding ak,s,3. The only difference is that we replace 1 Qk,s(τk) with Qk,s(τk). Again, we will ﬁrst bound Γk = Y k,s R k,s on the event Ak,s Gk,s. Exactly as before, we let

τk = µk + λ

1 + 2λ + 6λC1 k

which satisﬁes τk µk+λ

1+2λ . To ensure τk µ1+λ

1+2λ , one just needs to take C1 = 1 6λ, and then C2 = 1

2C1 = 1 12λ by (35). Next, we will obtain the range for Γk on the event Ak,s Gk,s as

Γk τk 2C1 k µk + λ

= 2C1 k λ 1 2λ + 1

by λ>1 > 0,

Γk τk + 2C1 k µk + λ

1 + 3λ 1 + 2λ

by 3λ 1+2λ 3

Multiplier Bootstrap-Based Exploration

i.e., Γk 2C1(λ 1) k

2λ+1 , 5C1 k (0, ). Therefore,

Qk,s(τk)I(Ak,s)I(Gk,s)

Y k,s R k,s Γk | Hk,s I(Ak,s Gk,s)

apply steps in (26) TE

9Γ2 k(1 + 2λ)2s2

6σ2ωs h 2λ2ζ(k) + 2λ2σ2 1 + Γ2 k(1 + 2λ)2 i

I(Ak,s Gk,s)

+ T exp (λ2 + λ + 1/4)2s

18σ2ω(2µk 1)2

+ 0 I µk = 1

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

by the bound of Γk TE exp

36(λ 1)2C2 1 2 ks2

6σ2ωs h 2λ2ζ(k) + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k i

+ T exp (λ2 + λ + 1/4)2s

18σ2ω(2µk 1)2

+ 0 I µk = 1

2 exp s(1 + 2λ)2

2σ2ω(1 + 2λ2)

where ζ(k) = 1

s Ps i=1 ζ(k) i with {ζ(k) i }s i=1 are i.i.d. independent sub-Exponential variables such that ζ(k) i sub E(8

2σ2 k). Let

s Ωk + a2 + b2 + Ωka2 1 3 log 1 2 log

2b2 2 + 2a2

By applying Lemma E.4 again, we have

36(λ 1)2C2 1 2 ks2

6σ2ωs h 2λ2ζ(k) + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k i

a2ζ(k) + b2

Therefore, we have bk,s,3 T 1 for any k {2, . . . , K} if we choose

sb,k,3(T) =

(" Ωk + a2 + b2 + Ωka2 1 3 log 1 2 log

2b2 2 + 2a2

18σ2 ω(2µk 1)2

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

D.3. Lemmas on simplifying bounds.

Lemma D.8. Assuming the conditions are identical to those in Theorem C.1, the constants c1(µ1, σ1, µk, σk, λ) and c2(µ1, σ1, µk, σk, λ), as deﬁned in (10) and (11), can be upper-bounded by C1(σ1, σk, λ, σω) and C2(σ1, σk, λ, σω) speciﬁed in Theorem C.1, respectively.

Proof. First, we establish bounds for the components d1 and d2 in c1(µ1, σ1, µk, σk, λ) and c2(µ1, σ1, µk, σk, λ). Observe

Multiplier Bootstrap-Based Exploration

that d1 = 2 k (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2)

by k 1 1 + Ωmax

(1 + 2λ)2 + 36λ4

" 2 96λ4σ2 1 + (1 + 2λ)2/(36λ2)

3(1 + 2λ)2 + 72λ4σ2 1 + 25(1 + 2λ)2

Given λ 1 + σ2 ω 4 + 4σ1

σω + 1 > 4σ1

σω + 1, we can infer that (λ 1)2 16σ2 1/σ2 ω, and consequently,

(1 + 2λ)2 + 36λ4

(λ 1)2 192λ4

16σ2 1/σ2ω = 48λ2 + 3σ2 ω σ2 1 λ4.

Similarly, 2 96λ4σ2 1 + (1 + 2λ)2/(36λ2)

3(1 + 2λ)2 + 72λ4σ2 1 + 25(1 + 2λ)2

2 96λ4σ2 1 3 4λ2 + 2 3 36λ2 + 72λ4σ2 1 + 25(1 + 2λ)2

6 16σ2 1/σ2ω

16λ2σ2 1 + 1

48 + λ4σ2 ω + 25 9λ2

6 16σ2 1/σ2ω 16λ2σ2 1 + 1

48 + 3λ4σ2 ω + 3λ2σ2 ω/σ2 1.

Thus, d1 = 2 k (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2)

(1 + Ωmax)σ2 ω 3

48λ2 + 3σ2 ω σ2 1 λ4 + σ2 ω

16λ2σ2 1 + 1

48 + 3λ4σ2 ω + 3σ2 ω σ2 1 λ2

16(1 + Ωmax) + 16σ4 1 + 3σ2 ω σ2 1

λ2 + (1 + Ωmax)σ2 ω σ2 1 + 3σ2 ω

16(1 + Ωmax) + 16σ4 1 + 3σ2 ω σ2 1 + (1 + Ωmax)σ2 ω σ2 1 + 3σ2 ω + 1 λ4

(1 + Ωmax) 16 + σ2 ω σ2 1

+ 16σ4 1 + 3σ2 ω σ2 1 + 3σ2 ω + 1 σ2 ωλ4.

D1(σ1, σk, λ, σω) := (1 + Ωmax) 16 + σ2 ω σ2 1

+ 16σ4 1 + 3σ2 ω σ2 1 + 3σ2 ω + 1 σ2 ωλ4,

then d1 D1(σ1, σk, λ, σω). On the other hand, regarding d2, we have the following:

2σ2ω(1 + 2λ)2 2 k/(144λ2) 3(1 + 2λ)2 2 k +

25σ2ω(1 + 2λ)2 2 k 6(λ 1)2 2 k

64λ4σ4 1 + (λ 1)2

12λ4σ2 1σ2ω

96λ4σ2 1 + (1 + 2λ)2/(36λ2) + 72λ4

72λ4σ2 1 + 25(1 + 2λ)2

To simply the bound above, we can use the fact λ > 1 to derive the following inequalities: s

2σ2ω(1 + 2λ)2 2 k/(144λ2) 3(1 + 2λ)2 2 k +

25σ2ω(1 + 2λ)2 2 k 6(λ 1)2 2 k σω

12λ + 5σω(1 + 2λ)

σω + 3σ2 ω σ1 (1 + 2λ) 3σ2 ω σ1

Multiplier Bootstrap-Based Exploration

64λ4σ4 1 + (λ 1)2

12λ4σ2 1σ2ω 9λ2

64λ4σ4 1 + 16σ2 1/σ2 ω 12λ4σ2 1σ2ω

64λ4σ4 1 + 16σ2 1/σ2 ω 12λ2σ2 1σ2ω

2 1 σ2 1λ2 + 1 σ2ωλ2

2 σ2 ω 16σ4 1 + 1

96λ4σ2 1 + (1 + 2λ)2/(36λ2) + 72λ4

72λ4σ2 1 + 25(1 + 2λ)2

96λ4σ2 1 + 72λ4

16σ4 1/σ2ω = λ2σ2 ω 8σ4 1 .

Thus, d2 is upper-bound by

3 1 + 3 πσ2 ω 2σ1

σ2 ω 16σ4 1 + 1

+ λ2σ2 ω 4σ4 1

Once again, we deﬁne

D2(σ1, σk, λ, σω) := 3 1 + 3 πσ2 ω 2σ1

σ2 ω 16σ4 1 + 1

+ λ2σ2 ω 4σ4 1

then d2 D2(σ1, σk, λ, σω).

Next, we can provide simple bounds for c1(µ1, σ1, µk, σk, λ, σω) and c2(µ1, σ1, µk, σk, λ, σω). Using the fact that a b a + b and a b c = [(a b) c], we obtain that Ωmax

3 log 2 + 1 18σ2 ω{(2µ1 1)2 (2µk 1)2}

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

3 log 2 + 1 72σ2 ω{(2µ1 1)2 + (2µk 1)2}

(1 + 2λ)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

3 log 2 + 1 72σ2 ω 2 + 2σ2 ω(1 + 2λ2) (1 + 2λ)2

3 log 2 + 1 144σ2 ω + 2σ2 ω + 4λ2σ2 ω 4λ2

3 log 2 + 1 + 38σ2 ω Ωmax

log D2(σ1, σk, λ, σω)

3 log 2 + 1 + 38σ2 ω.

3 log 2 + 1 18σ2 ω{(2µ1 1)2 (2µk 1)2}

(λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2

D1(σ1, σk, λ, σω) log D2(σ1, σk, λ, σω)

3 log 2 + 1 + 38σ2 ω,

where the last steps in two inequalities above are by (38) and (39). Finally, note that

5 + 16e9/8 55, 72λ2σ2 k (1 + 2λ)2 (9 + 32e9/8) 72λ2σ2 k 25 (9 + 32e9/8) 310λ2σ2 k,

Multiplier Bootstrap-Based Exploration

which implies c1(µ1, σ1, µk, σk, λ, σω) C1(σ1, σk, λ, σω) and c2(µ1, σ1, µk, σk, λ, σω) C2(σ1, σk, λ, σω).

Lemma D.9. Assuming the conditions the same as in Lemma D.8, we have

C1(1, 1, λ, σω) 55 8

log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2

3 log 2 + 1

C2(1, 1, λ, σω) 330λ2 + 55

45(3 + σ2 ω)

log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2

3 log 2 + 1

λ4 + 38σ2 ω

Proof. We have

D1(1, 1, λ, σω) = 1 + 8

2 16 + σ2 ω + 16 + 3σ2 ω + 3σ2 ω + 1 σ2 ωλ4

13 16 + 3σ2 ω + 6σ2 ω + 17 σ2 ωλ4 = 45(3 + σ2 ω)λ4

D2(1, 1, λ, σω) = 3 1 + 3 πσ2 ω 2

σω + 3λ + 8 π σ2 ω 16 + 1

3 1 + 3 πσ2 ω 2

σω + 3 + 8 π σ2 ω 16 + 1

λ2 1 + 15σ 2 ω + 3σω + 10σ2 ω λ2.

Therefore, we have

C1(1, 1, λ, σω) = 55 8

2 log D2(1, 1, λ, σω)

3 log 2 + 1 + 55σ2 ω

log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2

3 log 2 + 1

C2(1, 1, λ, σω) = 310λ2 + 55 D1(1, 1, λ, σω) log D2(1, , 1, λ, σω)

3 log 2 + 1 + 38σ2 ω

< 330λ2 + 55

45(3 + σ2 ω)

log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2

3 log 2 + 1

λ4 + 38σ2 ω

E. Technical Lemmas

Lemma E.1. Let Z be a standard Gaussian variable, then the tail probability P(Z > x) satisﬁes

1 4 exp( x2) < P(Z > x) 1

2 exp( x2/2)

for any x 0.

Proof. Let Q(x) := P(Z > x) for x > 0 represent the tail probability for a standard Gaussian variable Z. Let Z1 and Z2 are two independent standard Gaussian random variables. Then

P(Z1 x, Z2 x) = Z x

1 2π exp z2 1 z2 2 /2 dz1dz2

1 2π exp r2/2 r dθ dr

= 1 exp( x2),

Multiplier Bootstrap-Based Exploration

which implies that [1 2Q(x)]2 1 exp( x2) for x 0, or, equivalently, exp( x2) 4Q(x) 4Q2(x). However, since 4Q2(x) > 0 for all x, we obtain that

which provides the lower bound in the lemma. For the upper bound, we observe that

exp x2/2 Q(x) = exp x2/2 Z

x (2π) 1/2 exp t2/2 dt

x (2π) 1/2 exp t2 x2 /2 dt

x (2π) 1/2 exp (t x)2/2 dt = 1

which establishes the upper bound.

Lemma E.2. Let {Xi}n i=1 be a sequence of i.i.d. mean-zero sub-Gaussian random variables with variance proxy σ2, and let X = 1

n Pn i=1 Xi. Then

E exp λX 2 e9/8

holds for any |λ| n 8σ2 .

Proof. Note that if X µ sub G(σ2), then

X µ sub G(σ2/n) (40)

follows from the fact that

E exp s(X µ) =

i=1 exp n s

(by Xi µ is sub-Gaussian)

= exp s2(σ2/n)

Then, by Proposition 4.3 in Zhang & Chen (2021), (X µ)2 n 1 var(X) sub E(8

2σ2/n, 8σ2/n). Therefore,

E exp λ(X µ)2 = E exp λ (X µ)2 n 1 var(X) E exp{λn 1 var(X)}

= exp 64λ2σ4

for any λ n 8σ2 .

Lemma E.3. Suppose X and Y are Gaussian variables with EX = 0 and EY > 0. Then

|Y | > c 2P(X > c Y ) + P (Y < 0)

for any c > 0.

Proof. We have

|Y | > c = P (|X| > c|Y |)

= P ({X > c Y } {X > 0} {Y > 0}) + P ({ X > c Y } {X 0} {Y > 0})

+ P ({X > c Y } {X > 0} {Y 0}) + P ({ X > c Y } {X 0} {Y 0})

P ({X > c Y } {X > 0} {Y > 0}) + P ({ X > c Y } { X 0} {Y > 0}) + P(Y 0)

by using X is symmetric about 0. 2P(X > c Y ) + P(Y 0)

Multiplier Bootstrap-Based Exploration

for any c > 0.

Lemma E.4 (sub-Exponential concentration). Consider mean-zero independent random variables Xi sub E(λi), and positive constants a and b such that a X + b > 0, where the sample mean X := 1

s Ps i=1 Xi. Then,

E exp s a X + b

2πλa 2b2 + 2a

2λa (b + λa)

for s N such that 2λas/b 1, where λ := 1

s Ps i=1 λ2 i 1/2. Speciﬁcally, we have log E exp n s a X+b

Proof. Denote Y := a X + b as a strictly positive random variable. For any non-negative strictly increasing function f( ), we have

0 P (f(Y ) > r) dr = Z

0 P Y > f 1(r) dr.

Then for any ﬁxed s N,

E exp s a X + b

= E exp n s

0 P exp n s

0 P Y > s log r 1

s log r 1 b dr

s log r 1 b

s log r 1 b

s log r 1 b

by letting u = s log r 1 = exp n s

a | {z } >0

Recall that X is the average of independent mean-zero sub-exponential variables, and λ = 1

s Ps i=1 λ2 i 1/2. Then, we have

X > t exp 1

for any t > 0 by Corollary 4.2.(c) in Zhang & Chen (2021). Therefore, we can further bound the expectation as

E exp s a X + b

by sub-Exponential inequality exp n s

s u2 exp n s

λ2a2 s(u b)

λ2a2 s(u b)

# s u2 exp s

λ2a2 s(u b)

o + I + II .

Multiplier Bootstrap-Based Exploration

The next step is bounding both I and II. For I, we have

u + s(u b)2

s b2 exp s b + λa + s(u b)2

b2 exp s b + λa

b exp s(u b)2

b2 exp s b + λa

b exp (u b)2

2πλa 2b2 exp s b + λa

For II, we decompose it as

# s u2 exp s

du + exp n s

s(u b)(u 2λas

=: II1 + II2.

For II1, we have

II1 = Z 2λas

s u2 exp s 1

2λa is increasing on u

2λa (b + λa) + (

2λa (b + λa) b)

b b + λa exp s

2λa (b + λa)

For II2, if 2λas

b 1, i.e. λs b 2a, we obtain that

II2 = exp n s

s(u b)(u 2λas

b 1 s exp n s

b 1 s exp n s

= s exp n s

1 e4s/b 2 π

s/b e t2 dt

by Lemma E.5 s exp n s

Multiplier Bootstrap-Based Exploration

By Combining the results obtained for I, II1, and II2, we can derive the following bound:

E exp s a X + b

o + I + II1 + II2

2b2 exp s b + λa

+ s (b + λa)

b b + λa exp s

2λa (b + λa)

2b2 + 2λas2

2λa (b + λa)

2πλa 2b2 + 2a

2λa (b + λa)

2πλa 2b2 + 2a

2λa (b + λa)

which gives the result we need.

Lemma E.5. For any x 0, 2ex2

x e t2 dt 1.

x e t2 dt = 2 π

x e (t2 x2) dt

x e (t x)2 dt