# multiplier_bootstrapbased_exploration__f532636a.pdf Multiplier Bootstrap-based Exploration Runzhe Wan * 1 Haoyu Wei * 2 3 Branislav Kveton 1 Rui Song 1 Despite the great interest in the bandit problem, designing efficient algorithms for complex models remains challenging, as there is typically no analytical way to quantify uncertainty. We propose Multiplier Bootstrap-based Exploration (MBE), a novel exploration strategy that is applicable to any reward model amenable to weighted loss minimization. We prove both instance-dependent and instance-independent rate-optimal regret bounds for MBE in sub-Gaussian multi-armed bandits. With extensive simulation and real-data experiments, we show the generality and adaptivity of MBE. 1. Introduction The bandit problem has found wide applications in various areas such as clinical trials (Durand et al., 2018), finance (Shen et al., 2015), recommendation systems (Zhou et al., 2017), among others. Accurate uncertainty quantification is the key to address the exploration-exploitation trade-off. Most existing bandit algorithms critically rely on certain analytical property of the imposed model (e.g., linear bandits) to quantify the uncertainty and derive the exploration strategy. Thompson Sampling (TS, Thompson, 1933) and Upper Confidence Bound (UCB, Auer et al., 2002) are two prominent examples, which are typically based on explicit-form posterior distributions or confidence sets, respectively. However, in many real problems, the reward model is fairly complex: e.g., a general graphical model (Chapelle & Zhang, 2009) or a pipeline with multiple prediction modules and manual rules. In these cases, it is typically impossible to quantify the uncertainty in an analytical way, and frameworks such as TS or UCB are either methodologically not applicable or computationally infeasible. *Equal contribution 1Amazon 2Department of Economics, University of California San Diego 3Department of Statistics, North Carolina State University. Correspondence to: Runzhe Wan , Rui Song . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). Motivated by the real needs, we are concerned with the following question: Can we design a practical bandit algorithm framework that is general, adaptive, and computationally tractable, with certain theoretical guarantee? A straightforward idea is to apply the bootstrap method (Efron, 1992), a widely applicable data-driven approach for measuring uncertainty. However, as discussed in Section 2, most existing bootstrap-based bandit algorithms are either heuristic without a theoretical guarantee, computationally intensive, or only applicable in limited scenarios. Contribution. Our contributions are three-fold. First, to address the aforementioned limitations, we propose a general-purpose bandit algorithm framework, Multiplier Bootstrap-based Exploration (MBE). MBE is based on multiplier bootstrap (Van Der Vaart & Wellner, 1996), an easyto-adapt bootstrap framework that only requires randomly weighted data points. We further show that a naive application of multiplier bootstrap may result in linear regret, and we introduce a suitable way to add additional perturbations for sufficient exploration. The main advantage of MBE is that it is general: it is applicable to any reward model amenable to weighted loss minimization, without need of analyticalform uncertainty quantification or case-by-case algorithm design. As a data-driven exploration strategy, MBE is also adaptive to different environments. Second, theoretically, we prove near-optimal regret bounds for MBE under sub-Gaussian multi-armed bandits (MAB), in both the instance-dependent and the instance-independent sense. Compared with all existing results for bootstrapbased bandit algorithms, our result is strictly more general (see Table 1), since existing results only apply to some special cases of sub-Gaussian distributions. To overcome the technical challenges, we proved a novel concentration inequality for some function of sub-exponential variables, and also developed the first finite-sample concentration and anti-concentration analysis for multiplier bootstrap, to the best of our knowledge. Given the broad applications of multiplier bootstrap in statistics and machine learning, our theoretical analysis is of independent interest. This work does not relate to the positions of Runzhe Wan, Branislav Kveton and Rui Song at Amazon. Multiplier Bootstrap-Based Exploration Table 1: Comparisons between several bootstrapand perturbation-based bandit algorithms. All papers derive near-optimal regret bounds in MAB, with different reward distribution requirements. To compare the computational cost, we focus on MAB to illustrate, and consider Algorithm 2 for MBE. See Section 2 for more details of discussions in this table. Exploration Source Methodology Generality Theory Requirement Computation Cost MBE (this paper) intrinsic & extrinsic general sub-Gaussian O(KT) GIRO (Kveton et al., 2019b) intrinsic & extrinsic general Bernoulli O(T 2) Re Boot (Wang et al., 2020; Wu et al., 2022) intrinsic & extrinsic fixed & finite set of arms Gaussian O(KT) PHE (Kveton et al., 2019a; 2020a;b) only extrinsic general bounded O(KT) Third, with extensive simulation and real-data experiments, we demonstrate that MBE yields comparable performance with existing algorithms in different MAB settings and three real-world problems (online learning to rank, online combinatorial optimization, and dynamic slate optimization). This supports that MBE is easily generalizable, as it requires minimal modifications and derivations to match the performance of those near-optimal algorithms specifically designed for each problem. Moreover, we also show that MBE adapts to different environments and is relatively robust, due to its data-driven nature. 2. Related Work The most popular bandit algorithms, arguably, include ϵgreedy (Watkins, 1989), TS, and UCB. ϵ-greedy is simple and thus widely used. However, its exploration strategy is not aware of the uncertainty in data and thus is known to be statistically sub-optimal. TS and UCB rely on posteriors and confidence sets, respectively. Yet, their closed forms only exist in limited cases, such as MAB or linear bandits. For a few other models (such as generalized linear model or neural nets), we know how to construct the approximate posteriors or confidence sets (Filippi et al., 2010; Li et al., 2017; Phan et al., 2019; Kveton et al., 2020a;b) with theoretical guarantees, though the corresponding algorithms are usually costly or conservative. In more general cases, it is often not clear how to adapt UCB and TS in a valid and efficient way. Although approximate TS can be done via approximate posterior inference methods (e.g., particle filtering, Markov chain Monte Carlo, or variational inference) (Gopalan et al., 2014; Kawale et al., 2015; Wan et al., 2021; Urteaga & Wiggins, 2018; Yu et al., 2020), they do not come with guarantees. Moreover, the dependency on the probabilistic model assumptions (e.g., the reward distribution family or the noise level) also pose challenges to being robust. To enable wider applications of bandit algorithms, several bootstrap-based (and related perturbation-based) methods have been proposed in the literature. Most algorithms are TS-type, by replacing the posterior with a bootstrap distribution. We next review the related papers, and summarize those with near-optimal asymptotic regret bounds in Table 1. We divide the sources of exploration into (i) leveraging the intrinsic randomness in the observed data (e.g., by randomizing the subset of history used for training) and (ii) manually adding extrinsic perturbations that are independent of the observed data (e.g., adding additive Gaussian noise to observed rewards). Arguably, the non-parametric bootstrap is the most wellknown bootstrap method, which works by re-sampling data with replacement. Vaswani et al. (2018) propose a version of non-parametric bootstrap with forced exploration to achieve a O(T 2/3) regret bound in Bernoulli MAB. GIRO proposed in Kveton et al. (2019b) successfully achieves a rate-optimal regret bound in Bernoulli MAB, by adding Bernoulli perturbations to non-parametric bootstrap. However, due to the re-sampling nature of non-parametric bootstrap, it is challenging to implement it efficiently beyond Bernoulli MAB (see Section 4.3). Specifically, the computational cost of re-sampling scales quadratically in T. Riou & Honda (2020) apply Bayesian bootstrap (Rubin, 1981), which is a smooth version of non-parametric Bootstrap. An asymptotically optimal regret bound is proved for MAB with bounded rewards. However, only MAB is studied and similar computational challenge exists in general cases. Another line of research is the residual bootstrap-based approach (Re Boot) (Hao et al., 2019; Wang et al., 2020; Tang et al., 2021; Wu et al., 2022). For each arm, Re Boot randomly perturbs the residuals of the corresponding observed rewards with respect to the estimated model to quantify the uncertainty for its mean reward. Although these methods also use random weights, they are applied to residuals, and thus are fundamentally different from our work. The limitation is that, by design, this approach is only applicable to problems with a fixed and finite set of arms, since the residuals are attached closely to each arm (see Appendix Multiplier Bootstrap-Based Exploration A.4 for more details). The perturbed history exploration (PHE) algorithm (Kveton et al., 2019a; 2020a;b) is also related. PHE works by adding additive noise to the observed rewards. Osband et al. (2019) apply similar ideas to reinforcement learning. However, PHE has two main limitations. First, for models where adding additive noise is not feasible (e.g., decision trees), PHE is not applicable. Second, as demonstrated in both Wang et al. (2020) and our experiments, the fact that PHE relies on only the extrinsically injected noise for exploration makes it less robust. For a complex structured problem, it may not be clear how to add the noise in a sound way (Wang et al., 2020). In contrast, it is typically more natural (and hence easier to be accepted) to leverage the intrinsic randomness in the observed data. Finally, we note that multiplier bootstrap has been considered in the bandit literature, mostly as a computationally efficient approximation to non-parametric bootstrap studied in those papers. Eckles & Kaptein (2014) study the direct adaption of multiplier bootstrap (see Section 4.1) in simulation, and its empirical performance in contextual bandits is studied later (Tang et al., 2015; Elmachtoub et al., 2017; Riquelme et al., 2018; Bietti et al., 2021). However, no theoretical guarantee is provided in these works. In fact, as demonstrated in Section 4.1, such a naive adaptation may have a linear regret. Osband & Van Roy (2015) show that, in Bernoulli MAB, a variant of multiplier bootstrap is mathematically equivalent to TS. No further theoretical or numerical results are provided except for this special case. Our work is the first systematic study of multiplier bootstrap in bandits. Our unique contributions include: we identify the potential failure of naively applying multiplier bootstrap, highlight the importance of additional perturbations, design a general algorithm framework to make this heuristic idea concrete, provide the first theoretical guarantee in general MAB settings, and conduct extensive numerical experiments to study its generality and adaptivity. 3. Preliminary Setup. We consider a general stochastic bandit problem. For any positive integer M, let [M] = {1, . . . , M}. At each round t [T], the agent observes a context vector xt (it is empty in non-contextual problems) and an action set At, then chooses an action At At, and finally receives the corresponding reward Rt = f(xt, At) + ϵt, Here, f is an unknown function and ϵt is the noise term. Without loss of generality, we assume f(xt, At) [0, 1]. We note that the realized reward Rt does not need to be bounded. The goal is to minimize the cumulative regret t=1 E max a At f(xt, a) f(xt, At) . At the end of round t, with an existing dataset Dt = {(xl, Al, Rl)}l [t], to decide the action At+1, most algorithms typically first estimate f in some function class F by solving a weighted loss minimization problem (also called weighted empirical risk minimization or cost-sensitive training) bf = arg min f F l=1 ωl L f(xl, Al), Rl + J(f). (1) Here, L is a loss function (e.g., ℓ2 loss or negative loglikelihood), ωl is the weight of the lth data point, and J is an optional penalty function. We consider the weighted problem as it is general and related to our proposal below. One can just set ωl 1 to get the unweighted problem. As the simplest example, consider the K-armed bandit problem where xl is empty and Al = [K]. Let L be the ℓ2 loss, J 0, and f(xl, Al) r Al where rk is the mean reward of the k-th arm. Then, (1) reduces to arg min{r1,...,r K} Pt l=1 ωl(Rl r Al)2, which gives the estimator brk = (P l:Al=k ωl) 1 P l:Al=k ωl Rl, i.e., the armwise weighted average. Similarly, in linear bandits, (1) reduces to the weighted least-square problem (see Appendix A.2 for details). Challenges. The estimation of f, together with the related uncertainty quantification, forms the foundation of most bandit algorithms. In the literature, F is typically a class of models that permit closed-form uncertainty quantification (e.g., linear models, Gaussian processes, etc.). However, in many real applications, the reward model can yield a fairly complicated structure, e.g., a hierarchical pipeline with both classification and regression modules. Manually specified rules are also common part of the model. It is challenging to quantify the uncertainty of these complicated models in analytical forms. Even when feasible, the dependency on the probabilistic model assumptions also pose challenges to being robust. Therefore, in this paper, we focus on the bootstrap-based approach due to its generality and data-driven nature. Bootstrapping, as a general approach to quantify the model uncertainty, has many variants. The most popular one, arguably, is non-parametric bootstrap (used in GIRO), which constructs bootstrap samples by re-sampling the dataset with replacement. However, due to the re-sampling nature, it is computationally intense (see Section 4.3 for more discussions). In contrast, multiplier bootstrap (Van Der Vaart & Wellner, 1996), as an efficient and easy-to-implement alternative, is popular in statistics and machine learning. Multiplier bootstrap. The main idea of multiplier bootstrap is to learn the model using randomly weighted data points. Specifically, given a multiplier weight distribution ρ(ω), for every bootstrap sample, we first randomly sample {ωMB l }t l=1 ρ(ω) at round t, and then solve (1) Multiplier Bootstrap-Based Exploration with ωl = ωMB l to obtain bf MB. Repeat the procedure and the distribution of bf MB forms the bootstrap distribution that quantifies our uncertainty over f. The popular choices of ρ(ω) include N(1, σ2 ω), Exp(1), Poisson(1), and the double-or-nothing distribution 2 Bernoulli(0.5). 4. Multiplier Bootstrap-based Exploration 4.1. Failure of the Naive Adaption of Multiplier Bootstrap To design an exploration strategy based on multiplier bootstrap, a natural idea is to replace the posterior distribution in TS with the bootstrap distribution. Specifically, at every time point, we sample a function bf following the multiplier bootstrap procedure as described in Section 3, and then take the greedy action arg maxa At bf(xt, a). However, perhaps surprisingly, such an adaptation may not be valid. The main reason is that the intrinsic randomness in a finite dataset is, in some cases, not enough to guarantee sufficient exploration. For example, the support of the bootstrap distribution cannot go outside the convex hull of the observed rewards. We illustrate this further with the following toy example. Example 1. Consider a two-armed Bernoulli bandit. Let the mean rewards of the two arms be p1 and p2, respectively. Without loss of generality, assume 1 > p1 > p2 > 0. Let P(ω = 0) = 0. Then, with non-zero probability, an agent following the naive adaption of multiplier bootstrap (breaking ties randomly and initializing in an optimistic way; see Algorithm 5 in Appendix A.3 for details) pulls arm 1 only once. Therefore, the agent suffers a linear regret. Proof. We first define two events E1 = {At = 1, R1 = 0}, E2 = {A2 = 2, R2 = 1}. By design, at time t = 1, the agent randomly choose an arm and hence will pull arm 1 with probability 0.5. Then the observed reward R1 is 0 with probability 1 p1. Therefore, P(E1) = 0.5(1 p1). Conditioned on E1, at t = 2, the agent will pull arm 2 (since multiplying R1 = 0 with any weight always gives 0), then it will observe reward R2 = 1 with probability p2. Conditioned on E1 E2, by induction, the agent will pull arm 2 for any t > 2. This is because the only reward record for arm 1 is R1 = 0 and hence its weighted average is always 0, which is smaller than the weighted average for arm 2, which is at least positive. In conclusion, with probability at least 0.5 (1 p1) p2 > 0, the algorithm takes the optimal arm 1 only once. 4.2. Main Algorithm The failure of the naive application of multiplier bootstrap implies that some additional randomness is needed to ensure sufficient exploration. In this paper, we consider achieving that by adding pseudo-rewards, an approach that proves its effectiveness in a few other setups (Kveton et al., 2019b; Wang et al., 2020). The intuition is as follows. The underexploration issue happens when, by randomness, the observed rewards are in the low-value region (compared with the expected reward). Therefore, if we can blend in some data points with rewards that have a relatively wide coverage, then the agent would have a higher chance to explore. These discussions motivate the design of our main algorithm, Multiplier Bootstrap-based Exploration (MBE), as in Algorithm 1. Specifically, at every round, in addition to the observed reward, we additionally add two pseudo-rewards with value 0 and 1. The pseudo-rewards are associated with the pulled arm and the context (if exists). Then, we solve a weighted loss minimization problem to update the model estimation (line 8). The weights are first sampled from a multiplier distribution (line 7), and then those of pseudorewards are additionally multiplied by a tuning parameter λ. In MAB, the estimates are arm-wise weighted average of all (observed or pseudo-) rewards ℓ:Aℓ=k(ωℓRk,ℓ+ λω ℓ 1 + λω ℓ 0) P ℓ:Aℓ=k(ωℓ+ λω ℓ+ λω ℓ) (2) for k [K], where multiplier weights {ωℓ, ω ℓ, ω ℓ}t l=1 ρ(ω). See Appendix A.1 for details. We make three remarks on the algorithm design. First, we choose to add pseudo-rewards at the boundaries of the mean reward range (i.e., [0, 1]), since such a design naturally induces a high variance (and hence more exploration). Adding pseudo-rewards in other manners is also possible. Second, the tuning parameter λ controls the amount of extrinsic perturbation and determines the degree of exploration (together with the dispersion of ρ(ω)). In Section 5, we give a theoretically valid range for λ. Finally and critically, besides guaranteeing sufficient exploration, we need to make sure the optimal arm can still be identified (asymptotically) after adding the pseudo-rewards. Intuitively, this is guaranteed, since we shift and scale the (asymptotic) mean reward from f(x, a) to f(x, a)+λ /(1+2λ) = f(x, a)/(1+2λ)+λ/(1+2λ), which preserves the order between arms. A detailed analysis for MAB can be found in Appendix A.1. We conclude this section by re-visiting Example 1 to provide some insights into how the pseudo-rewards help. Example 1 (Continued). Even under the event E1 E2, Algorithm 1 explores. To see this, consider an example with multiplier distribution is 2 Bernoulli(0.5). Then P(A3 = 1) P(Y 1 > Y 0) = P λω 1 ω1 + λω 1 + λω 1 > ω2 + λω 2 ω2 + λω 2 + λω 2 P(ω 1 = 2, ω1 = ω 1 = ω2 = ω 2 = ω 2 = 0) Multiplier Bootstrap-Based Exploration Therefore, the agent can still choose the optimal arm. Algorithm 1: General Template for MBE Data: Function class F, loss function L, (optional) penalty function J, multiplier weight distribution ρ(ω), tuning parameter λ 2 Initialize bf 3 for t = 1, . . . , T do 4 Observe context xt and action set At 5 Take action At = arg maxa At bf(xt, A) (break ties randomly) 6 Observe reward Rt 7 Sample the multiplier weights {ωl, ω l, ω l }t l=1 ρ(ω) 8 Solve the weighted loss minimization problem bf = arg min f F h ωl L f(xl, Al), Rl + λω l L f(xl, Al), 0 + λω l L f(xl, Al), 1 i + J(f). 9 4.3. Computationally-Efficient Implementation Efficient computation is critical for real applications of bandit algorithms. One potential limitation of Algorithm 1 is the computational burden: at every decision point, we need to re-sample the weights for all historical observations (line 8). This leads to a total computational cost of order O(T 2), similar to GIRO. Fortunately, one prominent advantage of multiplier bootstrap over other bootstrap methods (such as non-parametric bootstrap or residual bootstrap) is that the (approximate) bootstrap distribution can be efficiently updated in an online manner, so that the per-round computation cost does not grow over time. Suppose we have a dataset Dt at time t, and denote B(Dt) as the corresponding bootstrap distribution for f. With multiplier bootstrap, it is feasible to update B(Dt+1) approximately based on B(Dt). We detail the procedure below and elaborate more in Algorithm 2. Specifically, we maintain B different models { bfb,t}B b=1 and the corresponding observed history Hb = {(xl, Al, Rl, ωl,b)}t l=1 and pseudo-history H b = {(xl, Al, 0, ω l,b)}t l=1 {(xl, Al, 1, ω l,b)}t l=1 for every b [B]. { bfb,t}B b=1 can be regarded as sampled from B(Dt) and hence the empirical distribution over them is an approximation to the bootstrap distribution. At every time point t, for each replicate b, we only need to sample one weight for the new data point and then update bfb,t as bfb,t+1. Then, { bfb,t+1}B b=1 are still B valid samples from B(Dt+1) and hence still a valid approximation. We note that, since we only have one new data point, the updating of f can typically be done efficiently (e.g., with closed-form updating or via online gradient descent). The per-round computational cost is hence independent of t. Such an approximation is a common practice in the online bootstrap literature and can be regarded as an ensemble sampling-type algorithm (Lu & Van Roy, 2017; Qin et al., 2022). The hyper-parameter B is typically not treated as a tuning parameter but depends on the available computational resource (Hao et al., 2019). In our numerical experiments, this practical variant shows desired performance with B = 50. Moreover, the algorithm is embarrassingly parallel and also easy to implement: given an existing implementation for estimating f (i.e., solving (1)), the major requirement is to replicate it for B times and use random weights for each. This feature is attactive in real applications. Algorithm 2: Practical Implementation of MBE Data: Number of bootstrap replicates B, function class F, loss function L, (optional) penalty function J, weight distribution ρ(ω), tuning parameter λ 2 Let Hb = {} be the history and H b = {} be the pseudo-history, for any b [B] 3 Initialize bfb,0 for any b [B] 4 for t = 1, . . . , T do 5 Observe context xt and action set At 6 Sample an index bt uniformly from {1, . . . , B} 7 Offer At = arg max A At bfbt,t 1(xt, A) (break ties randomly) 8 Observe reward Rt 9 for b = 1, ..., B do 10 Sample the weights ωt,b, ω t,b, ω t,b ρ(ω). 11 Update Hb = Hb (xt, At, Rt, ωt,b) and H b = H b (xt, At, 0, ω t,b), (xt, At, 1, ω t,b) 12 Solve the weighted loss minimization problem bfb,t = arg min f F h ωl,b L f(xl, Al), Rl + λω l,b L f(xl, Al), 0 + λω l,b L f(xl, Al), 1 i Multiplier Bootstrap-Based Exploration 5. Regret Analysis In this section, we provide the regret bound for Algorithm 1 under MAB with sub-Gaussian rewards. We regard this as the first step towards the theoretical understanding of MBE, and leave the analysis of more general settings for future work. We call a random variable X as σ-sub-Gaussian if E exp{t(X EX)} exp{t2σ2/2} for any t R. The instantiation of Algorithm 1 under MAB is presented as Algorithm 3 in the appendix. Theorem 5.1. Consider a K-armed bandit, where the reward distribution of arm k is 1-sub-Gaussian with mean µk. Suppose arm 1 is the unique best arm that has the highest mean reward and k = µ1 µk. Take the multiplier weight distribution as N(1, σ2 ω). Let the tuning parameters satisfy λ 1 + σ2 ω/4 + 4/σω + p 4 (1 + 4/σω) /σω. Then, the problem-dependent regret is upper bounded by n 7 k+55 C 1(λ, σω) + C 2(λ, σω) k log T o , and the problem-independent regret is bounded by Reg T 7Kµ1 + C 1(λ, σω)K log T C 2(λ, σω)KT log T, C 1(λ, σω) = 8 2C 3(λ, σω) + 38σ2 ω, C 2(λ, σω) = 6λ2 + 45(3 + σ2 ω)λ4C 3(λ, σω) + 38σ2 ω , C 3(λ, σω) = log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2 3 log 2 + 1. The two regret bounds are known as near-optimal (up to a logarithm term) in both the problem-dependent and problemindependent sense (Lattimore & Szepesv ari, 2020). Notably, recall that the Gaussian distribution and all bounded distributions belong to the sub-Gaussian class. Therefore, as reviewed in Table 1, our theory is strictly more general than all existing results for bootstrap-based MAB algorithms. Technical challenges. It is particularly challenging to analyze MBE for two reasons. First, the probabilistic analysis of multiplier bootstrap itself is technically challenging, since the same random weights appear in both the denominator and the numerator (recall that MBE uses the weighted averages (2) to select actions in MAB). It is notoriously complicated to analyze the ratio of random variables, especially when they are correlated. Besides, existing bootstrapbased papers rely on the properties of specific parametric reward classes (e.g., Bernoulli in Kveton et al. (2019b) and Gaussian in Wang et al. (2020)), while we lose these nice structures when considering sub-Gaussian rewards. To overcome these challenges, we denote the first s rewards from pulling arm k as Hk,s with the i-th observation denoted as Rk,i, and start with carefully defining two good events Gk,s and Ak,s. Here, Gk,s denotes the event that the weighted average Y k,s = Ps i=1 [ωi Rk,i + ω i(1 λ) + ω i (0 λ)] / Ps i=1(ωi + λω i + λω i ) is close to the unweighted average (with pseudo-rewards) R k,s = Ps i=1(Rk,i + 1 λ + 0 λ)/ Ps i=1(1 + λ + λ), and Ak,s represents the event that R k,s is close to its population mean (µk + λ)/(1 + 2λ). It is worthy to note that {ωi, ω i, ω i }s i=1 are resampled from ρ(ω) at each round. To bound the probability of and the regret conditioned on the bad event, we face two major technical challenges. First, when transforming the ratio into an analyzable form, a summation of correlated sub-Gaussian and sub-exponential variables appears and is hard to analyze. We carefully design and analyze a novel event to remove the correlation and the sub-Gaussian terms (see proof of Lemma D.3). Second, the proof needs a new concentration inequality for functions of sub-exponential variables that does not exist in the literature. We obtain such a new concentration inequality (Lemma E.4) via careful analysis of sub-exponential distributions. To the best of our knowledge, our proof provides the first finite-sample concentration and anti-concentration analysis for multiplier bootstrap, which has broad applications in statistics and machine learning. Extension. Theorem 5.1 is proved with Gaussian weights to simplify analysis. Indeed, to analyze multiplier Bootstrap, we need to analyze P ℓ:Aℓ=k ωℓRk,ℓconditioned on the reward history, which is the sum of scaled i.i.d. variables, and Gaussian distribution has nice analytical properties for us to derive the bound. We hypothesize our result can be extended to other weight distributions that satisfy similar anti-concentration and concentration properties, such as C exp{ t2/(2σ2)} P(|ωi 1| > t) C exp{ t2/(2σ2)} with positive C, C, σ, σ. However, we expect some analytical challenges. Tuning parameters. In Theorem 5.1, MBE has two tuning parameters, λ and σω. Intuitively, λ controls the amount of external perturbation and σω controls the magnitude of exploration from bootstrap. In general, higher values of these two parameters facilitate exploration but also lead to a slower convergence. The condition on λ in Theorem 5.1 requires that (i) λ is not too small and (ii) the joint effect of λ and σω is not too small. Both are intuitive. In practice, this could be loose: e.g., it requires λ 5.25 + 2 5 when σω = 1. As we observe in Section 6, MBE with a smaller λ (e.g., 0.5) still empirically performs well. Multiplier Bootstrap-Based Exploration 6. Experiments In this section, we empirically evaluate MBE with both simulation (Section 6.1) and real datasets (Section 6.2). 6.1. MAB Simulation We first experiment with simulated MAB instances. The goal is to (i) further validate our theoretical findings, (ii) check whether MBE can yield comparable performance with standard methods, and (iii) study the robustness and adaptivity of MBE. We also experimented with linear bandits and the main findings are similar. To save space, we defer these results to Appendix B.1. We compare MBE with TS (Thompson, 1933), PHE (Kveton et al., 2019a), Re Boot (Wang et al., 2020), and GIRO (Kveton et al., 2019b). The last three algorithms are the existing bootstrapor perturbation-type algorithms reviewed in Section 2. Specifically, PHE explores by perturbing observed rewards with additive noise, without leveraging the intrinsic uncertainly in the data, Re Boot explores by perturbing the residuals of the rewards observed for each arm, and GIRO re-samples observed data points with replacement. In all experiments below, the weights of MBE are sampled from N(1, σ2 ω) 1. We fix λ = 0.5 and run MBE with three different values of σ2 ω: 0.5, 1 and 1.5. We also compare with the naive adaption of multiplier bootstrap (i.e., no pseudorewards; denoted as Naive MB). We run Algorithm 2 with B = 50 replicates. We first study 10-armed bandits, where the mean reward of each arm is independently sampled from Beta(1, 8). We consider three reward distributions, including Bernoulli, Gaussian, and exponential. For Gaussian MAB, the reward noise is sampled from N(0, 1). The other two distributions are determined by their means. For TS, we always use the correct reward distribution class and its conjugate prior. The prior mean and variance are calibrated using the true model. Therefore, TS is a strong baseline. For GIRO and Re Boot, we use the default implementations as they work well. For PHE, the original paper adds Bernoulli perturbation since it only studies bounded reward distributions. We extend PHE by sampling additive noise from the same distribution family as the true rewards, as done in Wu et al. (2022). GIRO, Re Boot, and PHE all have one tuning parameter to control the degree of exploration. We tune it over {2k 4}6 k=0 and report the best performance for each method. Without tuning, these algorithms generally do not perform well as originally proposed, due to differences in the settings. We tuned Naive MB as well. Results. Results over 100 runs are reported in Figure 1. 1We also experimented with other weight distributions with similar main conclusions. Using Gaussian weights allows us to study impact of different multiplier magnitudes more clearly. Our findings can be summarized as follows. First, without knowledge of the problem settings (e.g., the reward distribution family and its parameters, and the prior distribution) and without heavy tuning, MBE performs favorably and close to TS. Second, pseudo-rewards are indeed important in exploration, otherwise the algorithm suffers a linear regret. Third, MBE has a stable performance with different σω while the other methods are tuned for their best performance. This is thanks to the data-driven nature of MBE. Finally, the other three general-purpose exploration strategies perform reasonably after tuning, as expected. However, GIRO is computationally intense. For example, in Gaussian bandits, the time cost of GIRO is 2 minutes while all the other algorithms can complete within 10 seconds. The computational burden is due to the limitation of non-parametric bootstrap (see Section 4.3). Re Boot also performs reasonably, yet by design it is not easy to extend to more complex problems (e.g., problems in Section 6.2). Adaptivity. PHE relies on sampling additive noise from an appropriate distribution, and TS can be viewed similarly. In the results above, we provide auxiliary information about the environment to them and need to modify their implementation in different setups. In contrast, MBE automatically adapts to these problems. As argued in Section 2, one main advantage of MBE over them is its adaptiveness. To see this, we consider the following procedure: we run the Gaussian versions of TS and PHE in Bernoulli MAB, and run their Bernoulli versions in Gaussian MAB. We also run MBE with σ2 ω = 0.5. MBE does not require any modifications across the two problems. The results presented in Figure 2 clearly demonstrate that MBE adapts to reward distributions. Similarly, in Figure 3, we also studied the adaptivity of these methods against the reward distribution scale (the standard deviation of the Gaussian noise, σ) and the task distribution (we sample the mean rewards from Beta(α, 8) and vary the parameter α). For all settings, we use the algorithms tuned for Figure 1. MBE shows impressive adaptivity, while PHE and TS may not perform well when the environment is not close to the one they are tuned for. Recall that, in real applications, heavy tuning is not possible without the ground truth. This demonstrates the adaptivity of MBE, as a data-driven exploration strategy. Additional results. In Appendix B.2, we also try different values of λ and B for MBE. We also repeat the main experiment with K = 25. Our main observations still hold, and MBE is relatively robust to its tuning parameters. 6.2. Real-Data Applications The main benefit of MBE is that it easily generalizes to complex models. In this section, we use real datasets to demonstrate this property. Specifically, we test if MBE can achieve comparable performance with strong problem-specific base- Multiplier Bootstrap-Based Exploration 0 5000 10000 15000 20000 T 0 5000 10000 15000 20000 T 0 5000 10000 15000 20000 T 300 Exponential GIRO MBE ( 2=0.5) MBE ( 2=1) MBE ( 2=1.5) Naive MB PHE Re Boot TS Figure 1: Simulation results under MAB. The error bars indicate the standard errors, which may not be visible when the width is small. 0 5000 10000 15000 20000 Bernoulli MAB 0 5000 10000 15000 20000 Gaussian MAB Figure 2: Robustness results, to the reward distribution class. 1 0 1 2 3 4 5 log( ) (a) Trend with α 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 log( ) log(Regret) (b) Trend with σ Figure 3: Results with different reward variances and task distributions. For the x-axis in both figures and the y-axis in the second one, we plot at the logarithmic scale for better visualization. lines proposed in the literature, without problem-specific algorithm design and heavy tuning. Domain-specific models. We study the three problems considered in Wan et al. (2022), including cascading bandits for online learning to rank (Kveton et al., 2015), combinatorial semi-bandits for online combinatorial optimization (Chen et al., 2013), and multinomial logit (MNL) bandits for dynamic slate optimization (Agrawal et al., 2017; 2019). All these are practical and important problems in real life. Yet, these domain models all have unique structures and require a case-by-case algorithm design. For example, the rewards in MNL bandits follow multinomial distributions that have complex dependency with the pulled arms. To derive the posterior or confidence bound, one has to use a delicately designed epoch-type procedure (Agrawal et al., 2019). Datasets. We use the three datasets studied in Wan et al. (2022). Specifically, we use the Yelp rating dataset (Zong et al., 2016) to recommend and rank K restaurants, use the Adult dataset (Dua & Graff, 2017) to send advertisements to K/2 men and K/2 women (a combinatorial semi-bandit problem with continuous rewards), and use the Movie Lens dataset (Harper & Konstan, 2015) to display K movies. In our experiments, we fix K = 4 and randomly sample 30 items from the dataset to choose from. We provide a summary of these datasets and problems in Appendix B.3, and refer interested readers to Wan et al. (2022) and references therein for more details. Baselines. We compared MBE with state-of-the-art baselines in the literature, including TS-Cascade (Zhong et al., 2021) and Cascade KL-UCB (Kveton et al., 2015) for cascading bandits, CUCB (Chen et al., 2016) and CTS (Wang & Chen, 2018) for semi-bandits, and MNL-TS (Agrawal et al., 2017) and MNL-UCB (Agrawal et al., 2019) for MNL bandits. To save space, we denote the TS-type algorithms by TS and the UCB-type ones by UCB. We also study PHE and ϵ-greedy (EG) as two other general-purpose exploration strategies. Tuning. For the baseline methods, as in Section 6.1, we either use the default hyperparameters in Wan et al. (2022) or tune them extensively via grid search and present their best performance. For EG, we choose the exploration rate as ϵt = min(1, a/2 t) with tuning parameter a, following Kveton et al. (2020a). For MBE, with every bootstrap sample, we estimate the reward model via maximum weighted likelihood estimation, which yields closed-form solution that allows online updating in all three problems. The other implementation details are similar to Section 6.1. Results. We present the results in Figure 4. The overall findings are consistent with Section 6.1. First, without any additional derivations or algorithm design, MBE matches the performance of problem-specific algorithms. Second, pseudo-rewards are important to guarantee sufficient exploration, and naively applying multiplier bootstrap may fail. Third, MBE has relatively stable performance with varying σω, since its exploration is mostly data-driven. In contrast, the hyper-parameters of PHE and EG have to be carefully tuned, since they rely on externally added perturbation or Multiplier Bootstrap-Based Exploration 0 10000 20000 30000 T Dynamic Slate Optimization 0 10000 20000 30000 T Online Learning to Rank 0 10000 20000 30000 T Online Advertisement Optimization EG MBE ( 2=0.5) MBE ( 2=1) MBE ( 2=1.5) Naive MB PHE TS UCB Figure 4: Real-data results for three structured bandit problems that need domain-specific models. forced exploration. For example, the best parameters for EG in the three problems are a = 5, 0.1 and 0.5. Finally, PHE does not perform well in MNL and cascading bandits, where the outcomes are binary. We investigated this trend and found that the response rates (i.e., the probabilities for the binary outcome to be 1) in the two datasets are low. In this case, PHE introduces too much noise to explore, which slows down the estimation convergence. 7. Conclusion In this paper, we propose a new bandit exploration strategy, Multiplier Bootstrap-based Exploration (MBE). The main advantage of MBE is its generality: for any reward model that can be estimated via weighted loss minimization, the idea of MBE is applicable, and requires minimal efforts on derivation or implementation of the exploration mechanism. As a datadriven method, MBE also shows nice adaptivity. We prove near-optimal regret bounds for MBE in the sub-Gaussian MAB setup, which is more general than in other bootstrapbased bandit papers. Numerical experiments demonstrate that MBE is general, efficient, and adaptive. There are a few meaningful future extensions. First, the regret analysis for MBE (and more generally, other bootstrapbased bandit methods) in more complicated setups would be valuable. The main challenge comes from analyzing the finite-sample property of multiplier bootstrap. Second, adding pseudo-rewards at every round is needed for the analysis. We hypothesize that there exists a more adaptive and efficient way of introducing extrinsic perturbation, such that we have sufficient exploration while avoiding over-exploration. Third, the practical implementation of MBE relies on an ensemble of models to approximate the bootstrap distribution and the online regression oracle to update the model estimation. Both parts lead to approximation and also correlation over time. Our numerical experiments show that such an approach works well empirically, but it would be still mean- ingful to have more theoretical understanding. Lastly, in this paper, we present MBE assuming the knowledge of a generative model of the rewards (i.e., assuming the existence of a regression oracle). The idea can be naturally generalized to the policy-based setting, where we assume the existence of a classification oracle that can compute the optimal policy within a pre-specified policy class (see, e.g., Agarwal et al. (2014)). We leave this to future study. Multiplier Bootstrap-Based Exploration Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and Schapire, R. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638 1646. PMLR, 2014. Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Thompson sampling for the mnl-bandit. ar Xiv preprint ar Xiv:1706.00977, 2017. Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. Mnlbandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453 1485, 2019. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. Bietti, A., Agarwal, A., and Langford, J. A contextual bandit bake-off. J. Mach. Learn. Res., 22:133 1, 2021. Chapelle, O. and Zhang, Y. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th international conference on World wide web, pp. 1 10, 2009. Chen, W., Wang, Y., and Yuan, Y. Combinatorial multiarmed bandit: General framework and applications. In International Conference on Machine Learning, pp. 151 159. PMLR, 2013. Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. The Journal of Machine Learning Research, 17(1):1746 1778, 2016. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Durand, A., Achilleos, C., Iacovides, D., Strati, K., Mitsis, G. D., and Pineau, J. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference, pp. 67 82. PMLR, 2018. Eckles, D. and Kaptein, M. Thompson sampling with the online bootstrap. ar Xiv preprint ar Xiv:1410.4009, 2014. Efron, B. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics, pp. 569 593. Springer, 1992. Elmachtoub, A. N., Mc Nellis, R., Oh, S., and Petrik, M. A practical method for solving contextual bandit problems using decision trees. ar Xiv preprint ar Xiv:1706.04687, 2017. Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems, 23, 2010. Gopalan, A., Mannor, S., and Mansour, Y. Thompson sampling for complex online problems. In International conference on machine learning, pp. 100 108. PMLR, 2014. Hao, B., Abbasi Yadkori, Y., Wen, Z., and Cheng, G. Bootstrapping upper confidence bound. Advances in Neural Information Processing Systems, 32, 2019. Harper, F. M. and Konstan, J. A. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015. Kawale, J., Bui, H. H., Kveton, B., Tran-Thanh, L., and Chawla, S. Efficient thompson sampling for online matrix-factorization recommendation. Advances in neural information processing systems, 28, 2015. Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pp. 767 776. PMLR, 2015. Kveton, B., Szepesv ari, C., Ghavamzadeh, M., and Boutilier, C. Perturbed-history exploration in stochastic multiarmed bandits. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 2786 2793, 2019a. Kveton, B., Szepesvari, C., Vaswani, S., Wen, Z., Lattimore, T., and Ghavamzadeh, M. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 3601 3610. PMLR, 2019b. Kveton, B., Szepesv ari, C., Ghavamzadeh, M., and Boutilier, C. Perturbed-history exploration in stochastic linear bandits. In Uncertainty in Artificial Intelligence, pp. 530 540. PMLR, 2020a. Kveton, B., Zaheer, M., Szepesvari, C., Li, L., Ghavamzadeh, M., and Boutilier, C. Randomized exploration in generalized linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 2066 2076. PMLR, 2020b. Lai, C. D. and Balakrishnan, N. Continuous bivariate distributions. Springer, 2009. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071 2080. PMLR, 2017. Multiplier Bootstrap-Based Exploration Lu, X. and Van Roy, B. Ensemble sampling. Advances in neural information processing systems, 30, 2017. Osband, I. and Van Roy, B. Bootstrapped thompson sampling and deep exploration. ar Xiv preprint ar Xiv:1507.00300, 2015. Osband, I., Van Roy, B., Russo, D. J., Wen, Z., et al. Deep exploration via randomized value functions. J. Mach. Learn. Res., 20(124):1 62, 2019. Phan, M., Abbasi Yadkori, Y., and Domke, J. Thompson sampling and approximate inference. Advances in Neural Information Processing Systems, 32, 2019. Qin, C., Wen, Z., Lu, X., and Roy, B. V. An analysis of ensemble sampling. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview. net/forum?id=c6ibx0yl-a G. Riou, C. and Honda, J. Bandit algorithms based on thompson sampling for bounded reward distributions. In Algorithmic Learning Theory, pp. 777 826. PMLR, 2020. Riquelme, C., Tucker, G., and Snoek, J. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. ar Xiv preprint ar Xiv:1802.09127, 2018. Rubin, D. B. The bayesian bootstrap. The annals of statistics, pp. 130 134, 1981. Shen, W., Wang, J., Jiang, Y.-G., and Zha, H. Portfolio choices with orthogonal bandit learning. In Twenty-fourth international joint conference on artificial intelligence, 2015. Tang, L., Jiang, Y., Li, L., Zeng, C., and Li, T. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 323 332, 2015. Tang, Q., Xie, H., Xia, Y., Lee, J., and Zhu, Q. Robust contextual bandits via bootstrapping. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 12182 12189, 2021. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933. Urteaga, I. and Wiggins, C. Variational inference for the multi-armed contextual bandit. In International Conference on Artificial Intelligence and Statistics, pp. 698 706. PMLR, 2018. Van Der Vaart, A. W. and Wellner, J. A. Weak convergence. In Weak convergence and empirical processes, pp. 16 28. Springer, 1996. Vaswani, S., Kveton, B., Wen, Z., Rao, A., Schmidt, M., and Abbasi-Yadkori, Y. New insights into bootstrapping for bandits. ar Xiv preprint ar Xiv:1805.09793, 2018. Wan, R., Ge, L., and Song, R. Metadata-based multi-task bandits with bayesian hierarchical models. Advances in Neural Information Processing Systems, 34:29655 29668, 2021. Wan, R., Ge, L., and Song, R. Towards scalable and robust structured bandits: A meta-learning framework. ar Xiv preprint ar Xiv:2202.13227, 2022. Wang, C.-H., Yu, Y., Hao, B., and Cheng, G. Residual bootstrap exploration for bandit algorithms. ar Xiv preprint ar Xiv:2002.08436, 2020. Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5114 5122. PMLR, 2018. Watkins, C. J. C. H. Learning from delayed rewards. 1989. Wen, Z., Kveton, B., and Ashkan, A. Efficient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, pp. 1113 1122. PMLR, 2015. Wu, S., Wang, C.-H., Li, Y., and Cheng, G. Residual bootstrap exploration for stochastic linear bandit. ar Xiv preprint ar Xiv:2202.11474, 2022. Yu, T., Kveton, B., Wen, Z., Zhang, R., and Mengshoel, O. J. Graphical models meet bandits: A variational thompson sampling approach. In International Conference on Machine Learning, pp. 10902 10912. PMLR, 2020. Zhang, H. and Chen, S. Concentration inequalities for statistical inference. Communications in Mathematical Research, 37(1):1 85, 2021. Zhang, H. and Wei, H. Sharper sub-weibull concentrations. Mathematics, 10(13):2252, 2022. Zhong, Z., Chueng, W. C., and Tan, V. Y. Thompson sampling algorithms for cascading bandits. Journal of Machine Learning Research, 22(218):1 66, 2021. Zhou, Q., Zhang, X., Xu, J., and Liang, B. Large-scale bandit approaches for recommender systems. In International Conference on Neural Information Processing, pp. 811 821. Springer, 2017. Zong, S., Ni, H., Sung, K., Ke, N. R., Wen, Z., and Kveton, B. Cascading bandits for large-scale recommendation problems. ar Xiv preprint ar Xiv:1603.05359, 2016. Multiplier Bootstrap-Based Exploration A. Additional Method Details A.1. MBE for MAB In this section, we present the concrete form of MBE when being applied to MAB. Recall that xt is null, At [K], and rk is the mean reward of the k-th arm. We define f(xt, At; r) = r At, where the parameter vector r = (r1, . . . , r K) . We define the loss function as t=1 ωt(r At Rt)2. The solution is then (br1, . . . , br K) with brk = (P t:At=k ωt) 1 P t:At=k ωt Rt, i.e., the arm-wise weighted average. After adding the pseudo rewards, we can give algorithm for MAB in Algorithm 3. Next, we provide intuitive explanation on why Algorithm 3 works. Indeed, denote s := |Hk,T |, where Hk,T is the set of observed rewards for the k-th arm up to round T. Let Rk,l be the l-th element in Hk,T . Then Y k,s = Ps i=1 ωi Rk,i + λ Ps i=1 ω i Ps i=1 ωi + λ Ps i=1 ω i + λ Ps i=1 ω i = s 1 Ps i=1 ωi(Rk,i µk) + s 1 Ps i=1(ωi 1) + λs 1 Ps i=1(ω i 1) + µk + λ s 1 Ps i=1(ωi 1) + λs 1 Ps i=1(ω i 1) + λs 1 Ps i=1(ω i 1) + 1 + 2λ by using the law of large numbers. Then, by Slutsky s theorem, Y k,s µk + λ i=1 ωi(Rk,i µk) + 1 s i=1 (ωi 1) + λ s i=1 (ω i 1) will weakly converge to a mean-zero Gaussian distribution N 0, σ2 k+2 (1+2λ)2 σ2 ω . Therefore, our algorithm preserves the order of the arms for any λ > 0. Algorithm 3: MBE for MAB with sub-Gaussian rewards with mean bounded in [0, 1] Data: number of arms K, multiplier weight distribution ρ(ω), tuning parameter λ 2 Set Hk = {} be the history of the arm k and Y k = + , k [K] 3 for t = 1, . . . , T do 4 Pull At = arg maxk [K] Y k (break tie randomly), 5 Observe reward Rt 6 Set Hk = Hk {Rt} 7 for k = 1, . . . , K do 8 if |Hk| > 0 then 9 Sample the multiplier weights {ωl, ω l, ω l }|Hk| l=1 ρ(ω). 10 Update the mean reward ℓ=1 (ωℓ Rk,ℓ+ ω ℓ 1 λ + ω ℓ 0 λ) ℓ=1 (ωℓ+ λω ℓ+ λω ℓ) where Rk,l is the l-th element in Hk. A.2. MBE for stochastic linear bandits In this section, we derive the form of MBE when applied to stochastic linear bandits. We focus on the setup where xt is empty and At Rp is a linear feature vector, and other setups of linear bandits can be formulated similarly. In this case, Multiplier Bootstrap-Based Exploration Algorithm 4: MBE for linear bandits. Data: number of arms K, multiplier weight distribution ρ(ω), tuning parameter λ 2 Set Hk = {} be the history of the arm k, set A0 = 0, bθ0 = 0 with b0 = 0, and V0 = (1 + ξ)Ip. 3 if t = 1, . . . , p then 4 Offer At = t. 6 for t = p + 1, . . . , T do 7 Offer At = arg maxa At a θt (break tie randomly) 8 Observe reward Rt 9 Set Hk = Hk {Rt} 10 for k = 1, . . . , K do 11 if |Hk| > 0 then 12 Sample the multiplier weights {ωl, ω l, ω l }|Hk| l=1 ρ(ω). 13 Update the following quantities: Vt+1 = Vt + ωt At A t + λω t0Id + λω t Id; bt+1 = bt + At ωt Rt + λω t0 + λω t 1 ; Refresh the parameter as bθt+1 = V 1 t+1bt+1. f(xt, At; θ) = A t θ where the parameter vector is θ Rp. Then, the weighted loss function is t=1 ωt A t θ Rt 2 + ξ where ξ 0 is a penalty tuning parameter. The solution is the standard weighted ridge regression estimator and can be updated in the following way: 0. Initialization: A0 = 0, bθ0 = 0 with b0 = 0, and V0 = (ξ + 1)Idim(At). 1. bθt = V 1 t At; 2. Vt+1 = Vt + ωt At A t , bt+1 = bt + ωt Rt At, and hence update bθt+1 = V 1 t+1bt+1 = Vt + ωt At A t 1 bt + ωt Rt At = V 1 t V 1 t At(ω 1 t + A t V 1 t At) 1A t V 1 t 3. Take the action At+1 = arg maxa At a θt+1. The MBE algorithm for linear bandits is presented in Algorithm 4. A.3. Naive Adaptation of the Multiplier Bootstrap We present the naive multiplier bootstrap-based exploration algorithm in Algorithm 5. Specifically, there is no pseudorewards added. A.4. Re Boot For completeness, we introduce the details of Re Boot in this section and discuss its generalizability. More details can be found in the original papers (Wang et al., 2020; Wu et al., 2022). Multiplier Bootstrap-Based Exploration Algorithm 5: A Naive Design of MBE Data: Function class F, loss function L, (optional) penalty function J, multiplier weight distribution ρ(ω), tuning parameter λ 2 Set H = {} be the history be the pseudo-history 3 Initialize bf in an optimistic way 4 for t = 1, . . . , T do 5 Observe context xt and action set At 6 Offer At = arg maxa At bf(xt, a) (break tie randomly) 7 Observe reward Rt 8 Update H = H {(xt, At, Rt)} 9 Sample the multiplier weights {ωl}t l=1 ρ(ω) 10 Solve the weighted loss minimization problem to update bf as bf = arg min f F l=1 ωl L f(xl, Al), Rl + J(f). 0 10000 20000 T 0 10000 20000 T 0 10000 20000 T GIRO MBE ( =0.25) MBE ( =0.5) MBE ( =0.75) Naive MB PHE Reboot TS 0 10000 20000 T 0 10000 20000 T 0 10000 20000 T GIRO MBE ( 2=0.5) MBE ( 2=1.0) MBE ( 2=1.5) Naive MB PHE Reboot TS Figure 5: Performance of MBE in three linear bandit problems. Consider a stochastic bandit problem with a fixed and finite set of arm A. Every arm a A may have a fixed feature vector (which with slight overload of notation, we also denote as a). The mean reward of arm a is f(a). At each round t , Re Boot first fit the model f as bf using all data. Then for each arm a, Re Boot first computes the corresponding residuals using rewards related to that arm as {ϵt = Rt bf(a)}t:At=a,t t , then perturbs these residuals with random weights as {ωtϵt = Rt bf(a)}t:At=a,t t (Re Boot also adds pseudo-residuals, which we omit for ease of notations), and finally use bf(a) + |{t : At = a, t t }| 1 P t:At=a,t t ωtϵt as the perturbed estimation of the mean reward of arm a. By design, it can be seen that Re Boot critically relies on the reward history of each fixed arm. Therefore, to the best of our understanding, it is not easy to extend Re Boot to problems with either changing (e.g., contextual problems) or infinite arms. Multiplier Bootstrap-Based Exploration 0 10000 20000 T 0 10000 20000 T 0 10000 20000 T 300 Exponential GIRO MBE ( =0.25) MBE ( =0.5) MBE ( =0.75) Naive MB PHE Re Boot TS Figure 6: Performance of MBE with different values of λ in MAB. B. More Experiment Results and Details B.1. Results for linear bandits We also consider the linear bandit problem. The linear bandit version of MBE is presented in Appendix A.2. We experiment with several dimensions p = 10, 15, 20. The number of arms is K = 100. The feature vector xk Rp of arm k is generated as follows. For the last 10 arms, the features are drawn uniformly at random from (0, 1). For the first 90 arms, we consider a practical setup where they are low-rank: we first generate a loading metric A = (aij) Rp 5 from Uniform(0, 1), then sample b R5 from Uniform(0, 1), and finally constructs xk = Ab. The parameter vector θ Rp is uniformly sampled from [0, 1]p. We normalize the feature vectors such that the mean reward µk = x k θ falls within the interval [0, 1]. The rewards of arm k are drawn i.i.d. from Bernoulli(x k θ). We still compare MBE with the method for linear bandit version of GIRO, PHE, and Re Boot with tuning to their best performance over the hyper-parameter set {2k 4}6 k=0 and report the best performance of each method. For TS, we use Gaussian for both its reward and prior distribution, and calibrate their parameters using the true model. The total rounds are T = 20000 and our results are averaged over 50 randomly chosen problems. Most other details are similar to our MAB experiments. We present the results in Figure 5, where we vary either σω or λ in the two subplots. We can see that naive MB leads to a linear regret. Hence, the pseudo-reward also matters in this problem. MBE achieves comparable performance with strong baselines such as TS. Another finding is that MBE is robust to its tuning parameters. Finally, Reboot needs to pull K times to initialize (the linear regret part in the first K rounds) due to the nature of its design. In contrast, most other linear bandit algorithms typically only need p rounds of forced exploration. This shows the limitation of the Re Boot framework (See Appendix A.4). B.2. Additional results In this section, we study the performance of MBE with respect to a few other hyper-parameters. We first study the robustness to another tuning parameter λ. Recall that λ controls the amount of external perturbation. Specifically, we repeat the experiment in 6.1 with σ2 ω fixed as 0.5 and with different values of λ (0.25, 0.5, 0.75). From Figure 6, it can be seen that a small amount of pseudo-rewards (λ = 0.25) seems sufficient in these settings, and the results are fairly stable. We believe this is because the exploration of MBE is main driven by the internal randomness in the data. In Figure 7, we repeat our main experiments with K changed to 25. We can see that our main conclusions still hold. Finally, in Figure 8, we implement MBE with different number of replicates B. As expected, more replicates does help exploration due to a better approximation to the whole bootstrapping distribution. Yet, we find that B = 50 suffices to generate comparable performance with TS and the performance of MBE becomes relatively stable for larger values of B. B.3. Details of the real-data experiments Our real-data experiments closely follow Wan et al. (2022). For completeness, we provide information of the three problems here, and refer interested readers to Wan et al. (2022) and references therein. In an online learning to rank problem, we aim to select and rank K items from a pool of L ones. We iteratively interacts with users to learn about their preferences. The cascading model is popular in learning to rank (Kveton et al., 2015), which Multiplier Bootstrap-Based Exploration 0 10000 20000 T 0 10000 20000 T 0 10000 20000 T 500 Exponential GIRO MBE ( 2=0.5) MBE ( 2=1) MBE ( 2=1.5) Naive MB PHE Re Boot TS Figure 7: Performance of MBE with K = 25. 0 5000 10000 15000 20000 T Trend with B MBE (B=100) MBE (B=20) MBE (B=200) MBE (B=50) TS Figure 8: Performance of MBE with different number of replicates B. models the user behaviour as glancing from top to bottom (like a cascade) and choose to click an item following a Bernoulli distribution when she looks at that item. Therefore, we will binary outcomes for all items that the user has examined, and there are complex dependency between them. In a slate optimization (or called assortment optimization) problem, we aim to offer K items from a pool of L ones, especially when there exist substitution effects. The Multinomial Logit (MNL) model characterizes the choice behaviour as a multinomial distribution based on the attractiveness of each item. Since the offer subset changes over rounds, the joint likelihood is actually complex. To get the posterior or confidence bounds, one has to resort to an epoch-type offering schedule (Agrawal et al., 2017). Online combinatorial optimization also has numerous applications (Wen et al., 2015), including maximum weighted matching, ads allocation, webpage optimization, etc. It is common that every chosen item will generate a separate observation, known as the semi-bandit problem. We consider a special problem in our experiments, where we need to choose K persons from a pool under constraints. The three datasets we used (and related problem setups) are studied in corresponding TS papers in the literature. To general random rewards, we need to either generate from a real-data-calibrated model or by directly sampling from the dataset. We follow Wan et al. (2022) and references therein. For cascading or MNL bandits, we split the dataset into a training and a testing set, use the training to estimate the reward model, and compare on the testing set. For semi-bandits, we sample rewards from the dataset. C. Main Proof This section gives the proof of the main regret bound (Theorem 5.1). Section D gives the major lemmas required to bound the regret components used in this section. Section E lists all supporting technical lemmas, including the lower bound of the Gaussian tail and some novel results on the concentration property of sub-Gaussian and sub-exponential distributions. Before beginning our proof, we first provide the definition of sub-Gaussian and sub-exponential variables: A mean-zero random variable X is called sub-Gaussian with variance proxy σ2 if E exp{t X} exp{t2σ2/2} for any t R, and we Multiplier Bootstrap-Based Exploration denote it as X sub G(σ2). A mean-zero variable X is called sub-exponential with parameters λ and α if E exp{t X} exp t2λ2 We denote as X sub E(λ, α) if sub-exponential X has parameters λ and α. For simplicity, we denote sub E(λ) := sub E(λ, λ). Sub-Gaussian and sub-exponential variables play important roles in bandit problems and exhibit various concentration properties. For more details, please refer to Zhang & Chen (2021) and Zhang & Wei (2022). Notations: Let Pξ(A) = R A d Fξ(x) denote the probability of event A, where Fξ(x) is the distribution function of the random variable ξ. Similarly, let Eξf(ξ) = R f(x) d Fξ(x) represent the expectation. We write two functions a(s, T) b(s, T) if a(s, T) c b(s, T) for some constant c independent of s and T. We write a(s, T) b(s, T) if both a(s, T) b(s, T) and a(s, T) b(s, T). Furthermore, we define a b = max{a, b} and a b = min{a, b} for any real numbers a and b. Similarly, we define a b c = max{a, b, c} and a b c = min{a, b, c} for any a, b, c R. We will present a comprehensive version of the MBE theory under MAB, as stated in Theorem C.1, along with its proof. In this version, we allow for arbitrary variance proxies σ2 k instead of constraining them to be equal to one in Theorem 5.1. Theorem C.1. Consider a K-armed bandit, where the reward distribution of arm k is sub G(σ2 k) with mean µk. Suppose µ1 = maxk [K] µk and k = µ1 µk. Take the multiplier weight distribution as N(1, σ2 ω) in Algorithm 3. Let the tuning parameters satisfy λ 1 + σ2 ω 4 + 4σ1 σω + 1 , Then the problem-dependent regret is upper bounded by 7 + C1(σ1, σk, λ, σω) + C2(σ1, σk, λ, σω) C1(σ1, σk, λ, σω) = 55 8 2 max k [K] σ2 k log D2(σ1, σk, λ, σω) 3 log 2 + 1 + 38σ2 ω C2(σ1, σk, λ, σω) = 310λ2σ2 k + 55 D1(σ1, σk, λ, σω) log D2(σ1, σk, λ, σω) 3 log 2 + 1 + 38σ2 ω D1(σ1, σk, λ, σω) = 1 + 8 2 max k [K] σ2 k 16 + σ2 ω σ2 1 + 16σ4 1 + 3σ2 ω σ2 1 + 3σ2 ω + 1 σ2 ωλ4, D2(σ1, σk, λ, σω) = 3 1 + 3 πσ2 ω 2σ1 σω + 3λ + 16 π max k [K] σ2 k σ2 ω 16σ4 1 + 1 + λ2σ2 ω 4σ4 1 Furthermore, the problem-independent regret is upper bounded by Reg T 7Kµ1 + max k [K]\{1} C1(σ1, σk, λ)K log T + 2 r max k [K]\{1} C2(σ1, σk, λ)KT log T. Proof. We denote the first s rewards from pulling arm k as Hk,s, with the i-th observations denoted as Rk,i. Let Qk,s(τ) = P Y k,s > τ | Hk,s be the tail probability that Y k,s, conditioned on history Hk,s is at least τ. Further, let Nk,s(τ) = 1/Qk,s(τ) 1 represent the expected number of rounds in which the arm k is underestimated, given s sample rewards. Here Y k,s = Ps i=1 [ωi Rk,i + ω i (1 λ) + ω i (0 λ)] Ps i=1(ωi + λω i + λω i ) is the objective function defined in Algorithm 3. Step 0: Decomposition of the regret bound. Our proof relies on the following decomposition of the cumulative regret. Multiplier Bootstrap-Based Exploration Lemma C.2 (Theorem 1, Kveton et al. (2019b)). Suppose in MAB we select arms according to the rule At = arg maxk [K] Y k,t with Y k,t defined in Algorithm 3. Then for any {τk}K k=2 R, the expected T-round regret can be bounded above as k=2 k ak + bk , where ak = PT 1 s=0 ak,s and bk = PT 1 s=0 bk,s + 1, and ak,s = E N1,s(τk) T and bk,s = P Qk,s(τk) > T 1 . Recall the summation index s is the number of times we pull the k-th arm. In the proof, we will fix τk µk+λ 1+2λ , µ1+λ 1+2λ . The definitions of ak and bk have important meanings: ak represents the expected number of rounds that optimal arm 1 has been being underestimated, whereas bk is the probability that the suboptimal arm k is being overestimated. Here we only need to consider the lower bound of the tail of the distribution of the rewards from the optimal arm. The intuition behind this is twofold: (i) we only need the rewards from the optimal arm taking a relatively large value with a probability that is not too small; (ii) we do not care about the negligible probability of receiving a large reward from suboptimal arms. Therefore, our target is then to bound ak and bk for any k 2. These are completed in Step 1 and Step 2 below, respectively. Step 1: Bounding ak. We first provide a roadmap for proving ak is bounded by a term of O(log T) order: For a given constant level τk, the probability of the optimal arm 1 being underestimated given s rewards is 1 Q1,s (τk). If we pick the level to satisfy τk < µ1+λ 1+2λ , the theory of large deviation gives lim s Q1,s(τk) = 1. Hence the expected number of rounds to observe a not-underestimated case N1,s(τk) = 1 Q1,s(τk) 1 has the property lims Ns(τk) = 0 as the number of pulls s grows to infinity. Thus, given the round T, there exists a constant sa,k(T) such that Ns (τk) T 1 for all s over sa,k(T). Consequently, the quantity ak in regret bound will be bounded by s=0 E [N1,s(τk) T] + 1. The fact that the constant sa,k(T) is at the order of log T will be shown in Lemma D.2, Lemma D.3, and Lemma D.4. For small number of pulls s sa,k(T), we show in Lemma D.1 that E [N1,s(τk) T] 4 + 16e9/8 for any s N. Thus, it is enough to conclude that ak can be bounded by a term of O(log T) order. To formally bound ak in the non-asymptotic sense following the intuition above, we need to decompose ak. For the decomposition, a common approach is to use indicators on good events. Denote the shifted (sample-) mean reward as R k,s = Ps i=1 Rk,i + λs s(1 + 2λ) = λ 1 + 2λ + 1 1 + 2λRk,s. where Rk,s = 1 s Ps i=1 Rk,i is the mean reward of the k-th arm. Then we can define the following good events for l-th arm as Al,s = C1 k < R l,s µl + λ 1 + 2λ < C1 k and Gl,s = n C2 k Y l,s R l,s C2 k o . The definitions of Al,s and Gl,s are intuitive: Al,s represents the sample mean does not deviate excessively from the true mean, and events on Gl,s means Y l,s is not too far away from the scaled-shifted sample mean R l,s. Here C1, C2 are constants belonging to the interval (0, 1). Therefore, by using these good events, we decompose ak,s into the following three parts: ak,s,1 = E h N1,s(τk) T I(Ac 1,s) i , (3) Multiplier Bootstrap-Based Exploration ak,s,2 = E h N1,s(τk) T I(A1,s)I(Gc 1,s) i , (4) and ak,s,3 = E h N1,s(τk) T I(A1,s)I(G1,s) i . (5) Let C1 = 1 6λ and C2 = 1 12λ with fixed λ > 1, then C1, C2 (0, 1). Consider the case when T 2. Define sa,k,j(T) := max{s : ak,s,j T 1}, k = 2, . . . , K, j = 1, 2, 3. Lemma D.2 in Appendix D demonstrates that by taking s sa,k,1(T) := 144λ2σ2 1 (1 + 2λ)2 2 k log T, we will have ak,s,1 T 1. Lemma D.3 and Lemma D.4 in Appendix D say that: if we choose s sa,k,2(T) := (" Ω1 + a1 + b1 + Ω1a1 1 3 log 1 2 log 2b2 1 + 2a1 18σ2 ω(2µ1 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 s sa,k,3(T) := (" Ω1 + a2 + b2 + Ω1a2 1 3 log 1 2 log 2b2 2 + 2a2 18σ2 ω(2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 then we have ak,s,2 T 1 and ak,s,3 T 1, respectively, where a1 = 192σ2 ωλ4 3(1 + 2λ)2 2 k , b1 = 2σ2 ω 96λ4σ2 1 + (1 + 2λ)2C2 2 2 k 3(1 + 2λ)2 2 k , a2 = 36σ2 ωλ4 3(λ 1)2 2 k , b2 = σ2 ω 72λ4σ2 1 + 25(1 + 2λ)2 2 k 6(λ 1)2 2 k , and Ωk = 8 2σ2 k, k [K]. Let Ωmax = maxk [K] Ωk. Then, for any s sa,k(T) = 144λ2σ2 1 (1 + 2λ)2 2 k log T + (" Ωmax + (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2) 1 3 log 1 2 log 18σ2 ω(2µ1 1)2 (2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 it holds that s maxj=1,2,3 sa,k,j(T) because sa,k(T) = sa,k,1(T) + max j=2,3 sa,k,j(T) max j=1,2,3 sa,k,j(T). Hence, for any s sa,k(T), we have ak,s = ak,s,1 + ak,s,2 + ak,s,3 3T 1. Multiplier Bootstrap-Based Exploration Finally, Lemma D.1 in Appendix D guarantees that if we choose λ 1 + σ2 ω 4 + 4σ1 σω + 1 , the component ak,s 4 + 16e9/8 for any s 0. Thus, by setting λ 1 + σ2 ω 4 + 4σ1 σω + 1 , we have s µk+λ 1+2λ . Then, according to the theory of large deviations, we have lim s Qk,s(τk) = 0. Thus, given the time horizon T, there exists a constant sb,k(T) such that Qk,s(τk) T 1 for all s beyond sb,k(T). As a result, the event Qk,s(τk) > T 1 is empty if the number of pulls s exceeds sb,k(T). Consequently, s=0 P Qk,s (τk) > T 1 . We will demonstrate that the constant sb,k(T) is of order O(log T) in Lemma D.5, Lemma D.6, and Lemma D.7. For a small number of pulls, s sb,k(T), we apply a trivial bound P Qk,s (τk) > T 1 1 that holds for any s. Therefore, it is sufficient to conclude that bk can be bounded by a term of order O(log T). It should be noted that bk,s is naturally bounded by the constant 1. Similar to Step 1, we decompose bk,s = P Qk,s(τk) > T 1 into bk = bk,s,1 + bk,s,2 + bk,s,3, with bk,s,1 = E I(Qk,s(τk) > T 1)I(Ac k,s) , (6) bk,s,2 = E I(Qk,s(τk) > T 1)I(Ak,s)I(Gc k,s) , (7) and bk,s,3 = E I(Qk,s(τk) > T 1)I(Ak,s)I(Gk,s) . (8) Again, we define sb,k,j := max{s : bk,s,j T 1} for j = 1, 2, 3. Lemma D.5 in Appendix D guarantees that when s sb,k,1(T), with sb,k,1(T) = 72λ2σ2 k (1 + 2λ)2 2 k log T, Multiplier Bootstrap-Based Exploration we have bk,s,1 T 1. Considering T 2 and letting C1 = 1 6λ and C2 = 1 12λ with fixed λ > 1 as in Step 1, Lemma D.6 and Lemma D.7 proves that: if we take s sb,k,2(T) = (" Ωk + a1 + b1 + Ωka1 1 3 log 1 2 log 2b2 1 + 2a1 18σ2 ω(2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 s sb,k,3(T) = (" Ωk + a2 + b2 + Ωka2 1 3 log 1 2 log 2b2 2 + 2a2 18σ2 ω(2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 then we have bk,s,2 T 1 and bk,s,3 T 1, respectively, where a1, a2, b1, b2, and Ωk are already defined in Step 1. Let sb,k(T) := 72λ2σ2 k (1 + 2λ)2 2 k log T + (" Ωmax + (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2) 1 3 log 1 2 log 18σ2 ω(2µ1 1)2 (2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 So for any s sb,k(T), we have s maxj=1,2,3 sb,k,j(T) since sb,k(T) = sb,k,1(T) + maxj=2,3 sb,k,j(T). Therefore, bk,s,1 + bk,s,2 + bk,s,3 3T 1 for any s sb,k(T). Note that bk,s = P Qk,s(τk) > T 1 1 for any s 0, then 1 + 1 sb,k(T) + 3T 1 T sb,k(T) ( 72λ2σ2 k (1 + 2λ)2 2 k + " Ωmax + (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2) 1 3 log 1 2 log 18σ2 ω(2µ1 1)2 (2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 for any k {2, . . . , K} and T 2. Step 3: Aggregating results. Let us define d1 = 2 k (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2) Multiplier Bootstrap-Based Exploration so the bounds in Step 1 and Step 2 yield ak 4(1 + 4e9/8) ( 144λσ2 1 (1 + 2λ)2 2 k + (Ωmax + d1) 1 log 2 + 1 18σ2 ω(2µ1 1)2 (2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 ( 72λ2σ2 k (1 + 2λ)2 2 k + (Ωmax + d1) 1 log 2 + 1 18σ2 ω(2µ1 1)2 (2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 for any k {2, . . . , K} and T 2. Therefore, by applying Lemma C.2, we obtain the following inequality k=2 k ak + bk 7 + c1(µ1, σ1, µk, σk, λ, σω) + c2(µ1, σ1, µk, σk, λ, σω) 2 k log T (9) for any T 2, where c1(µ1, σ1, µk, σk, λ) = (5 + 16e9/8) Ωmax 3 log 2 + 1 18σ2 ω{(2µ1 1)2 (2µk 1)2} (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 and c2(µ1, σ1, µk, σk, λ) = 72λ2σ2 k (1 + 2λ)2 (9 + 32e9/8) + (5 + 16e9/8) d1 3 log 2 + 1 18σ2 ω{(2µ1 1)2 (2µk 1)2} (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 for k = 2, . . . , K. When the total round T = 1, the bound (9) still holds because Reg T maxk=2,...K k 7 PK k=2 k. Finally, by utilizing the bound in (9) and the bounds for c1(µ1, σ1, µk, σk, λ, σω) and c2(µ1, σ1, µk, σk, λ, σω) in Lemma D.8, the following inequality holds for the problem-dependent case: 7 + c1(µ1, σ1, µk, σk, λ, σω) + c2(µ1, σ1, µk, σk, λ, σω) 2 k log T 7 + C1(σ1, σk, λ, σω) + C2(σ1, σk, λ, σω) 2 k log T . For proving the problem-independent case in Theorem C.1, let us denote t=1 I(It = k) = k=2 (ak + bk), Multiplier Bootstrap-Based Exploration and then we have k: k< k EJk,T + X k: k k EJk,T by the bounds of ak,bk T + X 7 + C1(σ1, σk, λ, σω) + C2(σ1, σk, λ, σω) 2 k log T k: k C1(σ1, σk, λ, σω) k log T + X C2(σ1, σk, λ, σω) by k µ1 T + 7Kµ1 + max k [K]\{1} C1(σ1, σk, λ, σω)K log T + max k [K]\{1} C2(σ1, σk, λ, σω)K log T for any previously specified (0, 1). Taking = p maxk [K]\{1} C2(σ1, σk, λ, σω)K log T/T, we obtain Reg T 7Kµ1 + max k [K]\{1} C1(σ1, σk, λ, σω)K log T + 2 r max k [K]\{1} C2(σ1, σk, λ, σω)KT log T. Thus, we complete the proof of Theorem C.1. Lastly, to prove Theorem 5.1, we set σk 1 for k [K] and utilize the fact C1(1, 1, λ, σω) + C2(1, 1, λ, σω) 2 k C1(1, 1, λ, σω) + C2(1, 1, λ, σω) Then applying the bounds for C1(1, 1, λ, σω) and C2(1, 1, λ, σω) from Lemma D.9, we obtain the first result in Theorem 5.1. By employing the same technique as in (12), we obtain the problem-independent regret we sought in Theorem 5.1. D. Lemmas on Bounding Regret Components D.1. Lemmas on bounding ak. Lemma D.1 (Bounding ak,s for any s > 0). Set λ 1 + σ2 ω 4 + 4σ1 then ak,s 4 + 16e9/8 for any k {2, . . . , K} and s 0. Proof. Note that if we take 1 + 2λ, (13) then we can ensure that the bound ak,s = E 1 Q1,s(τk) 1 T E 1 Q1,s(τk) by Q1,s( ) is decreasing EQ 1 1,s holds. Hence, we need to find a lower bound of the tail probability Q1,s µ1+λ 1+2λ = P Y 1,s > µ1+λ 1+2λ | H1,s . Throughout the proof, we will use the choice of (13) for τk, and we can take advantage of the bound (14). Multiplier Bootstrap-Based Exploration For further analyze (14), we will express it as the probability with respect to the weighted random summation of the Gaussian random variables {ωi, ω i, ω i }. Let xi := (2λ + 1)R1,i (µ1 + λ), yi := λ(µ1 1 λ), zi := λ(λ + µ1) for i [s], then i=1 xi = s (2λ + 1)R1,s (µ1 + λ) , Sy = i=1 yi = sλ(µ1 1 λ), Sz = i=1 zi = sλ(λ + µ1), with Sx Sy Sz = s(2λ + 1)(R1,s µ1) and (2λ + 1)R1,i (µ1 + λ) 2, Ty = i=1 y2 i = sλ2(µ1 1 λ)2, Tz = i=1 z2 i = sλ2(λ + µ1)2, with Tx + Ty + Tz = Ps i=1 (2λ + 1)R1,s (µ1 + λ) 2 + s λ2(µ1 1 λ)2 + λ2(λ + µ1)2 . Denote Z1 := Ps i=1 xiωi Ps i=1 yiω i Ps i=1 ziω i (Sx Sy Sz) σω p Tx + Ty + Tz Z2 := Ps i=1 ωi + λ Ps i=1 ω i + λ Ps i=1 ω i (1 + 2λ)s (1 + 2λ2)s . Y 1,s > µ1 + λ 1 + 2λ | H1,s i=1 ziω i > 0 i=1 ω i + λ i=1 ω i > 0 Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 < 0 . Note that (Z1, Z2) R2 is the mean-zero bivariate Gaussian distribution with covariance 1 ρ ρ 1 ρ = Eω,ω ,ω [Z1Z2] 0 = Eω,ω ,ω n [Ps i=1 xiωi Ps i=1 yiω i Ps i=1 ziω i (Sx Sy Sz)] [Ps i=1 ωi+λ Ps i=1 ω i+λ Ps i=1 ω i (1 + 2λ)s] o s(1 + 2λ2)(Tx+Ty+Tz) = σ2 ω(Sx λSy λSz) (Sx Sy Sz)(1 + 2λ)s s(1 + 2λ2)(Tx + Ty + Tz) . The sign of ρ depends on the sign of ˇρ, where ˇρ is defined as s σ2 ω(Sx λSy λSz) (Sx Sy Sz)(1 + 2λ)s = σ2 ω h (2λ + 1)R1,s µ1(2λ2 + 1) + λ2 λ i 2(1 + 2λ)2s(R1,s µ1) = (1 + 2λ) (R1,s µ1) 2(1 + 2λ)s σ2 ω + σ2 ωλ(λ 1)(1 2µ1). Multiplier Bootstrap-Based Exploration Thus, for the event {ρ < 0}, if we take λ > σ2 ω 4 1 2, we will have {ρ < 0} = {ˇρ < 0} (R1,s µ1) < σ2 ωλ(λ 1)(2µ1 1) (1 + 2λ) 2(1 + 2λ)s σ2ω : s 1 R1,s µ1 > σ2 ωλ(λ 1)(2µ1 1) (1 + 2λ) 2(1 + 2λ)s σ2ω : s 1 This furthermore implies {ρ < 0} {R1,s µ1 < 0} (R1,s µ1) > σ2 ωλ(λ 1)(2µ1 1) (1 + 2λ) 2(1 + 2λ)s σ2ω : s 1 {R1,s µ1 < 0} , if 2µ1 1 0 σ2 ωλ(λ 1)(2µ1 1) (1+2λ) 2(1+2λ)s σ2ω < R1,s µ1 < 0 : s 1 , if 2µ1 1 > 0. ( , if 2µ1 1 0 As =: n σ2 ωλ(λ 1)(2µ1 1) 2(1+2λ)2s < R1,s µ1 < 0 : s 1 o , if 2µ1 1 > 0. Now, we can decompose the second probability in (15) as Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 < 0 Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 < 0 I (ρ 0) Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 < 0 I (ρ < 0) Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 < 0 I (ρ 0) Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 < (1 + 2λ) s R1,s µ1 < 0 I ( ρ > 0) Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 < 0 I (ρ 0) Z1 > (Sx Sy Sz) Tx + Ty + Tz Z3 < (1 + 2λ) s R1,s µ1 < 0 I (ϱ > 0) , where Z3 = Z2 and ϱ = ρ is the correlation coefficient between Z1 and Z3. We will utilize the lower bound of the tail probability for bivariate Gaussian distribution with a positive correlation coefficient. Multiplier Bootstrap-Based Exploration Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 0 I (ρ 0) ( (Sx Sy Sz) σω p Tx + Ty + Tz , (1 + 2λ) s ( (Sx Sy Sz) σω p Tx + Ty + Tz , (1 + 2λ) s R1,s µ1 0 I (ρ 0) Z1 > (Sx Sy Sz) Tx + Ty + Tz , Z2 > (Sx Sy Sz) Tx + Ty + Tz R1,s µ1 0 I (ρ 0) Z1 > (Sx Sy Sz) Tx + Ty + Tz 1 + ρ (Sx Sy Sz) Tx + Ty + Tz R1,s µ1 0 I (ρ 0) Z > (Sx Sy Sz) Tx + Ty + Tz Z > (Sx Sy Sz) Tx + Ty + Tz R1,s µ1 0 #2 where Z is the standard Gaussian distribution. Throughout the appendix, we will always use Z to represent the standard Gaussian distribution. The first equality is due to (Sx Sy Sz) = s(1+2λ) (R1,s µ1) 0 (1+2λ) s when R1,s µ1 0, and the second inequality is by (11.44) of (Lai & Balakrishnan, 2009) for bivariate Gaussian distributions. By applying the first inequality in Lemma E.1 and using the fact that Tx + Ty + Tz Tz sλ4, we can get that Z > (Sx Sy Sz) Tx + Ty + Tz R1,s µ1 0 1 4 exp s(2λ + 1)2(R1,s µ1)2 R1,s µ1 0 , Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 0 I (ρ 0) 16 exp 2s(2λ + 1)2(R1,s µ1)2 R1,s µ1 0 I (ρ 0) . Second, we also have Z1 > (Sx Sy Sz) Tx + Ty + Tz Z3 < (1 + 2λ) s R1,s µ1 < 0 I (ϱ > 0) Z1 > (Sx Sy Sz) Tx + Ty + Tz Z3 < (1 + 2λ) s R1,s µ1 < 0 I (ϱ > 0) =PZ1,Z3 Z1 > h(s), Z3 < k(s) I R1,s µ1 < 0 I (ϱ > 0) =PZ1,Z3 Z1 > h(s), Z3 < k(s) I R1,s µ1 < 0 I (ϱ > 0) I(2µ1 1 > 0) h(s) := (Sx Sy Sz) Tx + Ty + Tz , k(s) := (1 + 2λ) s are positive when R1,s µ1 < 0. Note that PZ1,Z3 Z1 > h(s), Z3 < k(s) = PZ1 Z1 > h(s) PZ1,Z3 Z1 > h(s), Z3 > k(s) Multiplier Bootstrap-Based Exploration and PZ1,Z3 Z1 > h(s), Z3 > k(s) I ϱ > 0 ϱk(s) h(s) p + ϱ exp k2(s) h2(s) ϱh(s) k(s) p by (11.42) in Lai & Balakrishnan (2009). Next, by combining the above two inequalities, we have Z1 > (Sx Sy Sz) Tx + Ty + Tz Z3 < (1 + 2λ) s R1,s µ1 < 0 I (ϱ > 0) (h Φ h(s) Φ k(s) i ϱk(s) h(s) p + ϱ exp k2(s) h2(s) ϱh(s) k(s) p R1,s µ1 < 0 I ϱ > 0 (h Φ h(s) Φ k(s) i ϱk(s) h(s) p + ϱ exp k2(s) h2(s) ϱh(s) k(s) p R1,s µ1 < 0 {ϱ > 0} . R1,s µ1 < 0 {ϱ > 0} As always hold, we have 0 < s(1 + 2λ)(R1,s µ1) < σ2 ωλ(λ 1)(2µ1 1) 2(1 + 2λ) . (17) Then using the inequality Tx + Ty + Tz sλ4, we have h(s) = (Sx Sy Sz) Tx + Ty + Tz by (17) < σωλ(λ 1)(2µ1 1) sλ2(1 + 2λ) by µ1 1 σω(λ 1) λ(1 + 2λ) by (19) (1 + 2λ) by λ > 1 (1 + 2λ) σω 1 + 2λ2 (1 + 2λ) s 1 + 2λ2 = k(s), where fourth inequality follows from the fact that 2σ2 ωλ < (1 + 2λ)2 (19) provided that ( R if σω < 2, σ2 ω 4 4 , if σω 2 . Note that when σω > 2, the expression 2 σω+σω σ2ω 4 4 is negative. Thus, the inequality (18) will hold for any λ > 1. This implies that h Φ h(s) Φ k(s) i I(As) > 0. Multiplier Bootstrap-Based Exploration whenever λ > σ2 ω 4 1 2. On the other hand, note that ˇρ σ2 ωλ(λ 1)(2µ1 1), then on the event As, we know that ϱ = sˇρ σ2ω p s(1 + 2λ2)(Tx + Ty + Tz) sσ2 ωλ(λ 1)(2µ1 1) σ2ω p s(1 + 2λ2)(Tx + Ty + Tz) and thus ϱk(s) h(s) = (1 + 2λ) s 1 + 2λ2 ϱ h(s) by the bound of ϱ (1 + 2λ) s σω 1 + 2λ2 sσ2 ωλ(λ 1)(2µ1 1) s(2λ + 1) (R1,s µ1) by the bound in (17) (1 + 2λ) s σω 1 + 2λ2 sσ2 ωλ(λ 1)(2µ1 1) 2(1 + 2λ) σωλ(λ 1)(2µ1 1) = 2(1 + 2λ)2 1 + 2λ2 ss > 1, i.e., ϱk(s) > h(s). Combining the above analysis, we have Z1 > (Sx Sy Sz) Tx + Ty + Tz Z3 < (1 + 2λ) s R1,s µ1 < 0 I (ϱ > 0) h Φ h(s) Φ k(s) i I R1,s µ1 < 0 I (ϱ > 0) ϱk(s) h(s) p R1,s µ1 < 0 I (ϱ > 0) R1,s µ1 < 0 I (ϱ > 0) + Φ(0)I R1,s µ1 < 0 I (ϱ > 0) R1,s µ1 < 0 I (ϱ > 0) . Besides, note that when R1,s µ1 > 0, we have (Sx Sy Sz) < 0, and then Z1 > (Sx Sy Sz) Tx + Ty + Tz Z2 > (1 + 2λ) s R1,s µ1 > 0 R1,s µ1 > 0 . Plugging (16), (20), and (21) into (15), we obtain that 16 exp 2s(2λ + 1)2(R1,s µ1)2 R1,s µ1 0 I(ρ 0) R1,s µ1 0 I(ρ < 0) + 1 R1,s µ1 > 0 . As a result, with the fact (14), the upper bound for ak,s now is ak,s = EQ 1 1,s 16E exp 2s(2λ + 1)2(R1,s µ1)2 R1,s µ1 0 I(ρ 0) R1,s µ1 0 I(ρ < 0) + 2EI R1,s µ1 > 0 16E exp 2s(2λ + 1)2(R1,s µ1)2 Multiplier Bootstrap-Based Exploration where the first inequality is due to that the several indicators are based on mutually exclusive events. From the above expression, to prove that ak,s is bounded by a constant independent of s, it remains to show that the expectation E exp 2s(2λ + 1)2(R1,s µ1)2 can be bounded below some constants free of s. Applying Lemma E.2, we know that if we take 2s(2λ + 1)2 σ2ωλ4 s 8σ2 1 , then (22) can be upper bounded by e9/8. A sufficient condition for the above inequality is σ2ωλ4 1 16σ2 1 , i.e. λ2 8σ1 in other words, this can be expressed as: σω + 1 . (23) Therefore, if we take the tuning parameters λ and σω as specified in (23), then ak,s will be bounded by a constant such that ak,s 4 + 16e9/8. (24) It is important to note that this lemma establishes the tuning conditions for the tuning parameters in Theorem C.1. Lemma D.2 (Bounding ak,s,1 at (3)). Take sa,k,1(T) := 4σ2 1 C2 1(1 + 2λ)2 2 k log T. Then for any k {2, . . . , K} and s sa,k,1(T), ak,s,1 T 1, Proof. We will bound ak,s,1 by bounding the probability of the event Ac k,s. Write ak,s,1 = E h N1,s(τk) T I(Ac 1,s) i E h TI(Ac 1,s) i = TP(Ac 1,s). (25) Since the summation of independent sub-Gaussian variables is still sub-Gaussian, (40) in Lemma E.2 gives R1,s µ1 sub G(σ2 1/s). By applying the concentration inequality of the sub-Gaussian variable, we have P(Ac 1,s) = P R1,s µ1 (1 + 2λ)C1 k 2 exp (1 + 2λ)2C2 1 2 ks 2σ2 1 take s sa,1(T ) 2σ2 1 C2 1(1+2λ)2 2 k log( 2T) 1 T 2 . Hence, we obtain that ak,s,1 TP(Ac 1,s) T 1, for any s sa,1(T). Multiplier Bootstrap-Based Exploration Lemma D.3 (Bounding ak,s,2 at (4)). Take sa,k,2(T) := (" Ω1 + a1 + b1 + Ω1a1 1 3 log 1 2 log 2πΩ1σ2 1a1 2b2 1 + 2a1 18σ2 ω(2µ1 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 a1 = 4σ2 ωλ2 3C2 2(1 + 2λ)2 2 k , b1 = 2σ2 ω 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k 3C2 2(1 + 2λ)2 2 k , and Ω1 = 8 Then for any s sa,k,2(T), ak,s,2 T 1, for any k {2, . . . , K} and T 2. Proof. We will bound the probability of Gc 1,s conditioning on H1,s to prove this lemma. Note that (1 + 2λ)R1,i (R1,s + λ) ωi + i=1 λ(1 + λ R1,s)ω i i=1 λ(R1,s + λ)ω i =s (1 + 2λ)R1,s (R1,s + λ) + λ(1 + λ R1,s) λ(R1,s + λ) = 0, we can bound the tail probability P(Gc 1,s | H1,s) by P(Gc 1,s | H1,s) =P |Y 1,s R 1,s| > C2 k | H1,s Ps i=1 (1 + 2λ)R1,i (R1,s + λ) ωi + Ps i=1 λ(1 + λ R1,s)ω i Ps i=1 λ(R1,s + λ)ω i (1 + 2λ)(Ps i=1 ωi + λ Ps i=1 ω i + λ Ps i=1 ω i ) by Lemma E.3 2Pω,ω ,ω (1 + 2λ)R1,i (R1,s + λ) ωi + i=1 λ(1 + λ R1,s)ω i i=1 λ(R1,s + λ)ω i C2(1 + 2λ) k i=1 ω i + λ i=1 ω i + λ i=1 ω i < 0 I = Pω,ω ,ω (1 + 2λ)R1,i (R1,s + λ) ωi + i=1 λ(1 + λ R1,s)ω i i=1 λ(R1,s + λ)ω i C2(1 + 2λ) k i=1 ω i + λ II = Pω,ω ,ω i=1 ω i + λ i=1 ω i < 0 Multiplier Bootstrap-Based Exploration To bound P(Gc 1,s | H1,s), it is sufficient to bound I and II separately. We first examine II. Write II = Pω,ω ,ω i=1 ω i + λ i=1 ω i < 0 Ps i=1(ωi 1) + λ Ps i=1(ω i 1) + λ Ps i=1(ω i 1) p sσ2ω(1 + 2λ2) < (1 + 2λ)s p sσ2ω(1 + 2λ2) = Pω,ω ,ω Z < (1 + 2λ) σω by Lemma E.1 1 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) For studying I, we define the following functions of {R1,i}s i=1: f {R1,i}s i=1 = f1 {R1,i}s i=1 + f2 {R1,i}s i=1 f3 {R1,i}s i=1 , f1 {R1,i}s i=1 = (R1,i R1,s) + λ(2R1,i 1) (1 + 2λ)C2 k ωi, f2 {R1,i}s i=1 = λ(λ + 1 R1,s) (1 + 2λ)C2 k ω i, f3 {R1,i}s i=1 = λ(R1,s + λ) + (1 + 2λ)C2 k ω i . Then we can write I = Pω,ω ,ω (f1 + f2 f3 > 0). Given that f1, f2, and f3 are mutually independent conditioning on H1,s, the expectation is E f1 + f2 f3 | H1,s = 3C2(1 + 2λ) ks, and the variance is var f1 + f2 f3 | H1,s (R1,i R1,s) + λ(2R1,i 1) (1 + 2λ)C2 k 2 λ(λ + 1 R1,s) (1 + 2λ)C2 k 2 + λ(R1,s + λ) + (1 + 2λ)C2 k 2 # =σ2 ω h V1 + V2 + V3 i , n (R1,i R1,s) + λ(2R1,s 1) 2 + λ2(λ + 1 R1,s)2 + λ2(R1,s + λ)2o , n 2(1+2λ)C2 k (R1,i R1,s)+λ(2R1,s 1) 2(1+2λ)C2 kλ(λ+1 R1,s)+2(1+2λ)C2 k(R1,s+λ) o , (1 + 2λ)2C2 2 2 k + (1 + 2λ)C2 2 2 k + (1 + 2λ)C2 2 2 k . Multiplier Bootstrap-Based Exploration For bounding the conditional variance above, we will calculate its components as follows. (R1,i R1,s) + λ(2R1,i 1) 2 + sλ2(λ + 1 R1,s)2 + sλ2(R1,s + λ)2 i=1 (R1,i R1,s)2 + 2sλ2 3R 2 1,s 3R1,s + λ2 + λ + 1 i=1 (R1,i µ1)2 + (6λ2 1)s(R1,s µ1)2 + 6sλ2(2µ1 3)(R1,s µ1) + 2sλ2 λ2 + λ + 1 + 3µ1(µ1 1) Cauchy inequality 6λ2 s X i=1 (R1,i µ1)2 + 6sλ2(2µ1 1)(R1,s µ1) + 2sλ2 λ2 + λ + 1 µ1(1 µ1) i=1 (R1,i µ1)2 + 6sλ2(2µ1 1)(R1,s µ1) + 2sλ2 λ2 + λ + 1 µ1(1 µ1) , V2 = 2(1 + 2λ)C2 k (R1,i R1,s) + λ(2R1,i 1) + sλ(λ + 1 R1,s) sλ(R1,s + λ) = 2s(1 + 2λ)C2 k h R1,s R1,s) + λ(2R1,i 1) + λ(λ + 1 R1,s) λ(R1,s + λ) i = 0, and V3 = 3s(1 + 2λ)2C2 2 2 k. Therefore, the conditional variance in (27) is bounded by σ 2 ω var f1 + f2 f3 | H1,s = V1 + V2 + V3 i=1 (R1,i µ1)2 + 6sλ2(2µ1 1) | {z } the random part + 2sλ2 λ2 + λ + 1 3µ1(1 µ1) + 3s(1 + 2λ)2C2 2 2 k | {z } the determined part = R1 + R2 + D, R1 = 6λ2 s X i=1 (R1,i µ1)2, R2 = 6sλ2(2µ1 1) R1,s µ1 + 2sλ2 λ2 + λ + 1 3µ1(1 µ1) , and D = 3s(1 + 2λ)2C2 2 2 k. It is clear that both R1 and R2 are strictly positive. Additionally, it can be shown with high probability that R2 is non-positive. Indeed, note that 0 3µ1(1 µ1) 3 P(R2 > 0) = P 3(2µ1 1) R1,s µ1 > λ2 + λ + 1 3µ1(1 µ1) R1,s µ1 > λ2 + λ + 1 3µ1(1 µ1) I µ1 = 1 + P( ) I µ1 = 1 0 3µ1(1 µ1) 3 4 P R1,s µ1 > λ2 + λ + 1/4 + 0 I µ1 = 1 by sub-Gaussian inequality exp (λ2 + λ + 1/4)2s 18σ2ω(2µ1 1)2 + 0 I µ1 = 1 Multiplier Bootstrap-Based Exploration Then by applying Lemma E.1 again, P(Gc 1,s | H1,s) = 2I + II f1 + f2 f3 E f1 + f2 f3 | H1,s var f1 + f2 f3 | H1,s > E f1 + f2 f3 | H1,s var f1 + f2 f3 | H1,s 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) E2 f1 + f2 f3 | H1,s 2 var f1 + f2 f3 | H1,s 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) Therefore, combining with (29), ak,s,2 will be bounded by ak,s,2 E h TI(Gc 1,s) i = TE h P(Gc 1,s | H1,s) i by (26) T 2I + II E2 f1 + f2 f3 | H1,s 2 var f1 + f2 f3 | H1,s 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) E2 f1 + f2 f3 | H1,s 2 var f1 + f2 f3 | H1,s ) I(R2 0) + I(R2 > 0) + T 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) by the decomposition (29) TE exp E2 f1 + f2 f3 | H1,s 2σ2ω R1 + R2 + D I(R2 0) + TP(R2 > 0) + T 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) E2 f1 + f2 f3 | H1,s 2σ2ω R1 + D + TP(R2 > 0) + T 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) Recall R1,i µ1 sub G(σ2 1), and then ξi := (R1,i µ1)2 var(R1,i) sub E(8 2σ2 1). Therefore, for furthermore bounding R1 + D, we have R1 + D = 6λ2 " s X i=1 (R1,i µ1)2 s var(R1,i) + s var(R1,i) + 3s(1 + 2λ)2C2 2 2 k i=1 ξi + s 6λ2 var(R1,i) + 3(1 + 2λ)2C2 2 2 k by var(R1,i) σ2 1 6λ2 s X i=1 ξi + 3s 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k = 3s 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k , Multiplier Bootstrap-Based Exploration where ξ = 1 s Ps i=1 ξi. Then ak,s,2 TE exp E2 f1 + f2 f3 | H1,s 2σ2ω R1 + D + TP(R2 > 0) + T 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) by (31) and (30) TE exp 9C2 2(1 + 2λ)2 2 ks2 6σ2ωs 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k + T exp (λ2 + λ + 1/4)2s 18σ2ω(2µ1 1)2 + 0 I µ1 = 1 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) 3C2 2(1 + 2λ)2 2 ks 2σ2ω 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k + T exp (λ2 + λ + 1/4)2s 18σ2ω(2µ1 1)2 + 0 I µ1 = 1 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) The next step is to apply Lemma E.4. Let a1 = 4σ2 ωλ2 3C2 2(1 + 2λ)2 2 k , b1 = 2σ2 ω 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k 3C2 2(1 + 2λ)2 2 k , λi 8 2σ2 1, i [s], and Ω1 = (s 1 Ps i=1 λ2 i )1/2 = 8 2σ2 1, then Lemma E.4 gives 3C2 2(1 + 2λ)2 2 ks 2σ2ω 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k 3C2 2(1+2λ)2 2 k ξ + 2σ2 ω[2λ2σ2 1+(1+2λ)2C2 2 2 k] 3C2 2(1+2λ)2 2 k = E exp s a1ξ + b1 2b2 1 + 2a1 exp s 2Ω1a1 (b1 + Ω1a1) by 2Ω1a1 (b1+Ω1a1) b1+Ω1a1+Ω1+a1 2b2 1 + 2a1 s exp s Ω1 + a1 + b1 + Ω1a1 Hence, by taking s Ω1 + a1 + b1 + Ω1a1 log 1 2 log 2b2 1 + 2a1 Ω1 + a1 + b1 + Ω1a1 log 2b2 1 + 2a1 3C2 2(1 + 2λ)2 2 ks 2σ2ω 2λ2ξ + 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k s 3T 3 1 3T 2 . Similarly, the inequality T exp (λ2 + λ + 1/4)2s 18σ2ω(2µ1 1)2 + 0 I µ1 = 1 Multiplier Bootstrap-Based Exploration 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) 3T have the solution s 18σ2 ω(2µ1 1)2 (λ2 + λ + 1/4)2 [log 3 + 2 log T] . s 2σ2 ω(1 + 2λ2) (1 + 2λ)2 [log(3/2) + 2 log T] , respectively. Therefore, we have ak,s,2 T 1 for any s sa,k,2(T) by taking sa,k,2(T) = " Ω1 + a1 + b1 + Ω1a1 1 3 log 1 2 log 2b2 1 + 2a1 18σ2 ω(2µ1 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 for any T 2. Lemma D.4 (Bounding ak,s,3 at (5)). Take sa,k,3(T) = (" Ω1 + a2 + b2 + Ω1a2 1 3 log 1 2 log 2b2 2 + 2a2 18σ2 ω(2µ1 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 a2 = σ2 ωλ2 3(λ 1)2C2 1 2 k , b2 = σ2 ω 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k 6(λ 1)2C2 1 2 k , and Ω1 = 8 Then for any s sa,k,3(T) and λ > 1, ak,s,3 T 1. for any k {2, . . . , K} and T 2. Proof. Unlike the proofs for bounding ak,s,1 and ak,s,2, which involve controlling the probability of bad events, the definition of ak,s,3 is based on good events instead. Therefore, we require a different technique to handle ak,s,3. Observe that {N1,s(τk) < T 1} {N1,s(τk) < (T 2 1) 1} by (T 2 1) 1 T 1 = Q1,s (τk) 1 < 1 + T 2 1 1 = Q1,s(τk) 1 < (1 T 2) 1 = [1 Q1,s(τk)] < T 2 , and ak,s,3 E N1,s(τk)I(A1,s G1,s) . Thus, as suggested by Wang et al. (2020), we can bound ak,s,3 T 1 for s sa,k,3 by finding s sa,k,3 such that [1 Q1,s(τk)] < T 2 holds on the event {A1,s G1,s}. We can express the term 1 Q1,s(τk) I(A1,s)I(G1,s) = P Y 1,s R 1,s Γ1 | H1,s I(A1,s G1,s), where we define Γ1 := Γ1(τ1) = τ1 R 1,s as the difference between Y 1,s and R 1,s. On the event {A1,s G1,s}, Γ1 can be bounded within an interval. To see this, R 1,s µ1 + λ 1 + 2λ > C1 k R k,s µk + λ 1 + 2λ < C1 k = Γ1 < C1 k µ1 + λ 1 + 2λ + τ1 Γ1 > C2 k µ1 + λ 1 + 2λ + τ1 = τ1 C1 k µ1 + λ 1 + 2λ < Γ1 < τ1 + C1 k µ1 + λ Multiplier Bootstrap-Based Exploration and G1,s = C2 k Y 1,s R 1,s C2 k = τ1 C2 k Y 1,s + Γ1 τ1 + C2 k = τ1 C2 k Y 1,s Γ1 τ1 + C2 k Y 1,s by C2 k+R 1,s Y 1,s C2 k+R 1,s τ1 2C2 k R 1,s Γ1 τ1 + 2C2 k R 1,s . Let us combine the previous two results, given by C1 = 2C2, (35) which yields τ1 C1 k R 1,s µ1 + λ 1 + 2λ Γ1 τ1 + C1 k R 1,s µ1 + λ 1+2λ C1 k Γ1 τ1 2C1 k µ1 + λ = 2C1 k 6λC1 k 1 + 2λ = 2C1 k 1 + 3λ 1 + 2λ by 3λ 1+2λ 3 Thus, we obtain Γ1 2C1(λ 1) k 2λ+1 , 5C1 k (0, + ). Return to the quantity we are interested in, 1 Q1,s(τk) I(A1,s)I(G1,s). We can express it as follows: 1 Q1,s(τk) I(A1,s)I(G1,s) Y 1,s R 1,s Γ1 | H1,s I(A1,s G1,s) i=1 (R1,i R1,s)ωi + λ i=1 (2R1,iωi R1,sω i R1,sω i ) i=1 (ω i ωi) + λ2 s X i=1 (ω i ω i ) Γ1(1 + 2λ) s X i=1 ω i + λ I(A1,s G1,s) Multiplier Bootstrap-Based Exploration Similar to the technique we used to bound the conditional probability P(Gc 1,s | H1,s), we just need to replace C2 k by Γ1 in (32) and obtain that 1 Q1,s(τk) I(A1,s)I(G1,s) apply steps in (26) TE 9( Γ1)2(1 + 2λ)2s2 6σ2ωs 2λ2ξ + 2λ2σ2 1 + ( Γ1)2(1 + 2λ)2 I(A1,s G1,s) + T exp (λ2 + λ + 1/4)2s 18σ2ω(2µ1 1)2 + 0 I µ1 = 1 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) by the bound of Γ1 TE exp 36(λ 1)2C2 1 2 ks2 6σ2ωs 2λ2ξ + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k + T exp (λ2 + λ + 1/4)2s 18σ2ω(2µ1 1)2 + 0 I µ1 = 1 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) Similarly, we define the following expressions a2 = σ2 ωλ2 3(λ 1)2C2 1 2 k , and b2 = σ2 ω 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k 6(λ 1)2C2 1 2 k . Consider the value of s such that s Ω1 + a2 + b2 + Ω1a2 log 1 2 log 2b2 2 + 2a2 By applying Lemma E.4 and following the steps in (33) again, we obtain 36(λ 1)2C2 1 2 ks2 6σ2ωs 2λ2ξ + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k 6(λ 1)2C2 1 2 ks σ2ω 2λ2ξ + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k 3(λ 1)2C2 1 2 k ξ + σ2 ω[2λ2σ2 1+25(1+2λ)2C2 1 2 k] 6(λ 1)2C2 1 2 k = TE exp s a2ξ + b2 by applying steps in (33) 1 3T . Therefore, to determine the value of s such that [1 Q1,s(τk)] < T 2 on {A1,s G1,s}, we can define s sa,k,3(T) = " Ω1 + a2 + b2 + Ω1a2 1 3 log 1 2 log 2b2 2 + 2a2 18σ2 ω(2µ1 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 By choosing this value for sa,k,3 then we get that as,k,3 T 1 whenever s sa,k,3. Multiplier Bootstrap-Based Exploration D.2. Lemmas on bounding bk. Lemma D.5 (Bounding bk,s,1 at (6)). Consider sb,k,1(T) = 2σ2 k C2 1(1 + 2λ)2 2 k log T. For any s sb,k,1(T), we have bk,s,1 T 1, k {2, . . . , K} provided that T 2. Proof. Similar to Lemma D.5 and noting sb,k,1(T) σ2 k C2 1(1 + 2λ)2 2 k log(2T), we apply the Hoeffding inequality, which gives bk,s,1 = E I(Qk,s(τk) > T 1)I(Ac k,s) = P |Rk,s µk| (1 + 2λ)C1 k by Hoeffding inequality T 1. Lemma D.6 (Bounding bk,s,2 at (7)). Consider sb,k,2(T) = (" Ωk + a1 + b1 + Ωka1 1 3 log 1 2 log 2b2 1 + 2a1 18σ2 ω(2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 a1 = 4σ2 ωλ2 3C2 2(1 + 2λ)2 2 k , b1 = 2σ2 ω 2λ2σ2 1 + (1 + 2λ)2C2 2 2 k 3C2 2(1 + 2λ)2 2 k , and Ωk = 8 For any s sb,k,2(T), we have bk,s,2 T 1, k {2, . . . , K} provided that T 2. Proof. Similar to bounding ak,s,2 in Lemma D.3, we can show that if we take sb,k,2(T) = (" Ωk + a1 + b1 + Ωka1 1 3 log 1 2 log 2b2 1 + 2a1 18σ2 ω(2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 with Ωk = 8 2σ2 k, we have bk,s,2 E I(Gc k,s) 3C2 2(1 + 2λ)2 2 ks 2σ2ω h 2λ2ζ(k) + 2λ2σ2 k + (1 + 2λ)2C2 2 2 k i + exp (λ2 + λ + 1/4)2s 18σ2ω(2µk 1)2 + 0 I µk = 1 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) Multiplier Bootstrap-Based Exploration where ζ(k) = 1 s Ps i=1 ζ(k) i , with {ζ(k) i }s i=1 being i.i.d independent sub-Exponential variables such that ζ(k) i sub E(8 2σ2 k). Then bk,s,2 T 2 T 1 for any s sb,k,2 and k {2, . . . , K}. Lemma D.7 (Bounding bk,s,3 at (8)). Consider sb,k,3(T) = (" Ωk + a2 + b2 + Ωka2 1 3 log 1 2 log 2b2 2 + 2a2 18σ2 ω(2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 a2 = σ2 ωλ2 3(λ 1)2C2 1 2 k , b2 = σ2 ω 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k 6(λ 1)2C2 1 2 k , and Ωk = 8 For any s sb,k,3(T) and λ > 1, bk,s,3 T 1, k {2, . . . , K} provided that T 2. Proof. The basic idea is the same as bounding ak,s,3. The only difference is that we replace 1 Qk,s(τk) with Qk,s(τk). Again, we will first bound Γk = Y k,s R k,s on the event Ak,s Gk,s. Exactly as before, we let τk = µk + λ 1 + 2λ + 6λC1 k which satisfies τk µk+λ 1+2λ . To ensure τk µ1+λ 1+2λ , one just needs to take C1 = 1 6λ, and then C2 = 1 2C1 = 1 12λ by (35). Next, we will obtain the range for Γk on the event Ak,s Gk,s as Γk τk 2C1 k µk + λ = 2C1 k λ 1 2λ + 1 by λ>1 > 0, Γk τk + 2C1 k µk + λ 1 + 3λ 1 + 2λ by 3λ 1+2λ 3 Multiplier Bootstrap-Based Exploration i.e., Γk 2C1(λ 1) k 2λ+1 , 5C1 k (0, ). Therefore, Qk,s(τk)I(Ak,s)I(Gk,s) Y k,s R k,s Γk | Hk,s I(Ak,s Gk,s) apply steps in (26) TE 9Γ2 k(1 + 2λ)2s2 6σ2ωs h 2λ2ζ(k) + 2λ2σ2 1 + Γ2 k(1 + 2λ)2 i I(Ak,s Gk,s) + T exp (λ2 + λ + 1/4)2s 18σ2ω(2µk 1)2 + 0 I µk = 1 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) by the bound of Γk TE exp 36(λ 1)2C2 1 2 ks2 6σ2ωs h 2λ2ζ(k) + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k i + T exp (λ2 + λ + 1/4)2s 18σ2ω(2µk 1)2 + 0 I µk = 1 2 exp s(1 + 2λ)2 2σ2ω(1 + 2λ2) where ζ(k) = 1 s Ps i=1 ζ(k) i with {ζ(k) i }s i=1 are i.i.d. independent sub-Exponential variables such that ζ(k) i sub E(8 2σ2 k). Let s Ωk + a2 + b2 + Ωka2 1 3 log 1 2 log 2b2 2 + 2a2 By applying Lemma E.4 again, we have 36(λ 1)2C2 1 2 ks2 6σ2ωs h 2λ2ζ(k) + 2λ2σ2 1 + 25(1 + 2λ)2C2 1 2 k i a2ζ(k) + b2 Therefore, we have bk,s,3 T 1 for any k {2, . . . , K} if we choose sb,k,3(T) = (" Ωk + a2 + b2 + Ωka2 1 3 log 1 2 log 2b2 2 + 2a2 18σ2 ω(2µk 1)2 (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 D.3. Lemmas on simplifying bounds. Lemma D.8. Assuming the conditions are identical to those in Theorem C.1, the constants c1(µ1, σ1, µk, σk, λ) and c2(µ1, σ1, µk, σk, λ), as defined in (10) and (11), can be upper-bounded by C1(σ1, σk, λ, σω) and C2(σ1, σk, λ, σω) specified in Theorem C.1, respectively. Proof. First, we establish bounds for the components d1 and d2 in c1(µ1, σ1, µk, σk, λ) and c2(µ1, σ1, µk, σk, λ). Observe Multiplier Bootstrap-Based Exploration that d1 = 2 k (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2) by k 1 1 + Ωmax (1 + 2λ)2 + 36λ4 " 2 96λ4σ2 1 + (1 + 2λ)2/(36λ2) 3(1 + 2λ)2 + 72λ4σ2 1 + 25(1 + 2λ)2 Given λ 1 + σ2 ω 4 + 4σ1 σω + 1 > 4σ1 σω + 1, we can infer that (λ 1)2 16σ2 1/σ2 ω, and consequently, (1 + 2λ)2 + 36λ4 (λ 1)2 192λ4 16σ2 1/σ2ω = 48λ2 + 3σ2 ω σ2 1 λ4. Similarly, 2 96λ4σ2 1 + (1 + 2λ)2/(36λ2) 3(1 + 2λ)2 + 72λ4σ2 1 + 25(1 + 2λ)2 2 96λ4σ2 1 3 4λ2 + 2 3 36λ2 + 72λ4σ2 1 + 25(1 + 2λ)2 6 16σ2 1/σ2ω 16λ2σ2 1 + 1 48 + λ4σ2 ω + 25 9λ2 6 16σ2 1/σ2ω 16λ2σ2 1 + 1 48 + 3λ4σ2 ω + 3λ2σ2 ω/σ2 1. Thus, d1 = 2 k (a1 + a2) + (b1 + b2) + Ωmax(a1 + a2) (1 + Ωmax)σ2 ω 3 48λ2 + 3σ2 ω σ2 1 λ4 + σ2 ω 16λ2σ2 1 + 1 48 + 3λ4σ2 ω + 3σ2 ω σ2 1 λ2 16(1 + Ωmax) + 16σ4 1 + 3σ2 ω σ2 1 λ2 + (1 + Ωmax)σ2 ω σ2 1 + 3σ2 ω 16(1 + Ωmax) + 16σ4 1 + 3σ2 ω σ2 1 + (1 + Ωmax)σ2 ω σ2 1 + 3σ2 ω + 1 λ4 (1 + Ωmax) 16 + σ2 ω σ2 1 + 16σ4 1 + 3σ2 ω σ2 1 + 3σ2 ω + 1 σ2 ωλ4. D1(σ1, σk, λ, σω) := (1 + Ωmax) 16 + σ2 ω σ2 1 + 16σ4 1 + 3σ2 ω σ2 1 + 3σ2 ω + 1 σ2 ωλ4, then d1 D1(σ1, σk, λ, σω). On the other hand, regarding d2, we have the following: 2σ2ω(1 + 2λ)2 2 k/(144λ2) 3(1 + 2λ)2 2 k + 25σ2ω(1 + 2λ)2 2 k 6(λ 1)2 2 k 64λ4σ4 1 + (λ 1)2 12λ4σ2 1σ2ω 96λ4σ2 1 + (1 + 2λ)2/(36λ2) + 72λ4 72λ4σ2 1 + 25(1 + 2λ)2 To simply the bound above, we can use the fact λ > 1 to derive the following inequalities: s 2σ2ω(1 + 2λ)2 2 k/(144λ2) 3(1 + 2λ)2 2 k + 25σ2ω(1 + 2λ)2 2 k 6(λ 1)2 2 k σω 12λ + 5σω(1 + 2λ) σω + 3σ2 ω σ1 (1 + 2λ) 3σ2 ω σ1 Multiplier Bootstrap-Based Exploration 64λ4σ4 1 + (λ 1)2 12λ4σ2 1σ2ω 9λ2 64λ4σ4 1 + 16σ2 1/σ2 ω 12λ4σ2 1σ2ω 64λ4σ4 1 + 16σ2 1/σ2 ω 12λ2σ2 1σ2ω 2 1 σ2 1λ2 + 1 σ2ωλ2 2 σ2 ω 16σ4 1 + 1 96λ4σ2 1 + (1 + 2λ)2/(36λ2) + 72λ4 72λ4σ2 1 + 25(1 + 2λ)2 96λ4σ2 1 + 72λ4 16σ4 1/σ2ω = λ2σ2 ω 8σ4 1 . Thus, d2 is upper-bound by 3 1 + 3 πσ2 ω 2σ1 σ2 ω 16σ4 1 + 1 + λ2σ2 ω 4σ4 1 Once again, we define D2(σ1, σk, λ, σω) := 3 1 + 3 πσ2 ω 2σ1 σ2 ω 16σ4 1 + 1 + λ2σ2 ω 4σ4 1 then d2 D2(σ1, σk, λ, σω). Next, we can provide simple bounds for c1(µ1, σ1, µk, σk, λ, σω) and c2(µ1, σ1, µk, σk, λ, σω). Using the fact that a b a + b and a b c = [(a b) c], we obtain that Ωmax 3 log 2 + 1 18σ2 ω{(2µ1 1)2 (2µk 1)2} (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 3 log 2 + 1 72σ2 ω{(2µ1 1)2 + (2µk 1)2} (1 + 2λ)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 3 log 2 + 1 72σ2 ω 2 + 2σ2 ω(1 + 2λ2) (1 + 2λ)2 3 log 2 + 1 144σ2 ω + 2σ2 ω + 4λ2σ2 ω 4λ2 3 log 2 + 1 + 38σ2 ω Ωmax log D2(σ1, σk, λ, σω) 3 log 2 + 1 + 38σ2 ω. 3 log 2 + 1 18σ2 ω{(2µ1 1)2 (2µk 1)2} (λ2 + λ + 1/4)2 2σ2 ω(1 + 2λ2) (1 + 2λ)2 D1(σ1, σk, λ, σω) log D2(σ1, σk, λ, σω) 3 log 2 + 1 + 38σ2 ω, where the last steps in two inequalities above are by (38) and (39). Finally, note that 5 + 16e9/8 55, 72λ2σ2 k (1 + 2λ)2 (9 + 32e9/8) 72λ2σ2 k 25 (9 + 32e9/8) 310λ2σ2 k, Multiplier Bootstrap-Based Exploration which implies c1(µ1, σ1, µk, σk, λ, σω) C1(σ1, σk, λ, σω) and c2(µ1, σ1, µk, σk, λ, σω) C2(σ1, σk, λ, σω). Lemma D.9. Assuming the conditions the same as in Lemma D.8, we have C1(1, 1, λ, σω) 55 8 log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2 3 log 2 + 1 C2(1, 1, λ, σω) 330λ2 + 55 45(3 + σ2 ω) log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2 3 log 2 + 1 λ4 + 38σ2 ω Proof. We have D1(1, 1, λ, σω) = 1 + 8 2 16 + σ2 ω + 16 + 3σ2 ω + 3σ2 ω + 1 σ2 ωλ4 13 16 + 3σ2 ω + 6σ2 ω + 17 σ2 ωλ4 = 45(3 + σ2 ω)λ4 D2(1, 1, λ, σω) = 3 1 + 3 πσ2 ω 2 σω + 3λ + 8 π σ2 ω 16 + 1 3 1 + 3 πσ2 ω 2 σω + 3 + 8 π σ2 ω 16 + 1 λ2 1 + 15σ 2 ω + 3σω + 10σ2 ω λ2. Therefore, we have C1(1, 1, λ, σω) = 55 8 2 log D2(1, 1, λ, σω) 3 log 2 + 1 + 55σ2 ω log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2 3 log 2 + 1 C2(1, 1, λ, σω) = 310λ2 + 55 D1(1, 1, λ, σω) log D2(1, , 1, λ, σω) 3 log 2 + 1 + 38σ2 ω < 330λ2 + 55 45(3 + σ2 ω) log (1 + 15σ 2 ω + 3σω + 10σ2 ω)λ2 3 log 2 + 1 λ4 + 38σ2 ω E. Technical Lemmas Lemma E.1. Let Z be a standard Gaussian variable, then the tail probability P(Z > x) satisfies 1 4 exp( x2) < P(Z > x) 1 2 exp( x2/2) for any x 0. Proof. Let Q(x) := P(Z > x) for x > 0 represent the tail probability for a standard Gaussian variable Z. Let Z1 and Z2 are two independent standard Gaussian random variables. Then P(Z1 x, Z2 x) = Z x 1 2π exp z2 1 z2 2 /2 dz1dz2 1 2π exp r2/2 r dθ dr = 1 exp( x2), Multiplier Bootstrap-Based Exploration which implies that [1 2Q(x)]2 1 exp( x2) for x 0, or, equivalently, exp( x2) 4Q(x) 4Q2(x). However, since 4Q2(x) > 0 for all x, we obtain that which provides the lower bound in the lemma. For the upper bound, we observe that exp x2/2 Q(x) = exp x2/2 Z x (2π) 1/2 exp t2/2 dt x (2π) 1/2 exp t2 x2 /2 dt x (2π) 1/2 exp (t x)2/2 dt = 1 which establishes the upper bound. Lemma E.2. Let {Xi}n i=1 be a sequence of i.i.d. mean-zero sub-Gaussian random variables with variance proxy σ2, and let X = 1 n Pn i=1 Xi. Then E exp λX 2 e9/8 holds for any |λ| n 8σ2 . Proof. Note that if X µ sub G(σ2), then X µ sub G(σ2/n) (40) follows from the fact that E exp s(X µ) = i=1 exp n s (by Xi µ is sub-Gaussian) = exp s2(σ2/n) Then, by Proposition 4.3 in Zhang & Chen (2021), (X µ)2 n 1 var(X) sub E(8 2σ2/n, 8σ2/n). Therefore, E exp λ(X µ)2 = E exp λ (X µ)2 n 1 var(X) E exp{λn 1 var(X)} = exp 64λ2σ4 for any λ n 8σ2 . Lemma E.3. Suppose X and Y are Gaussian variables with EX = 0 and EY > 0. Then |Y | > c 2P(X > c Y ) + P (Y < 0) for any c > 0. Proof. We have |Y | > c = P (|X| > c|Y |) = P ({X > c Y } {X > 0} {Y > 0}) + P ({ X > c Y } {X 0} {Y > 0}) + P ({X > c Y } {X > 0} {Y 0}) + P ({ X > c Y } {X 0} {Y 0}) P ({X > c Y } {X > 0} {Y > 0}) + P ({ X > c Y } { X 0} {Y > 0}) + P(Y 0) by using X is symmetric about 0. 2P(X > c Y ) + P(Y 0) Multiplier Bootstrap-Based Exploration for any c > 0. Lemma E.4 (sub-Exponential concentration). Consider mean-zero independent random variables Xi sub E(λi), and positive constants a and b such that a X + b > 0, where the sample mean X := 1 s Ps i=1 Xi. Then, E exp s a X + b 2πλa 2b2 + 2a 2λa (b + λa) for s N such that 2λas/b 1, where λ := 1 s Ps i=1 λ2 i 1/2. Specifically, we have log E exp n s a X+b Proof. Denote Y := a X + b as a strictly positive random variable. For any non-negative strictly increasing function f( ), we have 0 P (f(Y ) > r) dr = Z 0 P Y > f 1(r) dr. Then for any fixed s N, E exp s a X + b = E exp n s 0 P exp n s 0 P Y > s log r 1 s log r 1 b dr s log r 1 b s log r 1 b s log r 1 b by letting u = s log r 1 = exp n s a | {z } >0 Recall that X is the average of independent mean-zero sub-exponential variables, and λ = 1 s Ps i=1 λ2 i 1/2. Then, we have X > t exp 1 for any t > 0 by Corollary 4.2.(c) in Zhang & Chen (2021). Therefore, we can further bound the expectation as E exp s a X + b by sub-Exponential inequality exp n s s u2 exp n s λ2a2 s(u b) λ2a2 s(u b) # s u2 exp s λ2a2 s(u b) o + I + II . Multiplier Bootstrap-Based Exploration The next step is bounding both I and II. For I, we have u + s(u b)2 s b2 exp s b + λa + s(u b)2 b2 exp s b + λa b exp s(u b)2 b2 exp s b + λa b exp (u b)2 2πλa 2b2 exp s b + λa For II, we decompose it as # s u2 exp s du + exp n s s(u b)(u 2λas =: II1 + II2. For II1, we have II1 = Z 2λas s u2 exp s 1 2λa is increasing on u 2λa (b + λa) + ( 2λa (b + λa) b) b b + λa exp s 2λa (b + λa) For II2, if 2λas b 1, i.e. λs b 2a, we obtain that II2 = exp n s s(u b)(u 2λas b 1 s exp n s b 1 s exp n s = s exp n s 1 e4s/b 2 π s/b e t2 dt by Lemma E.5 s exp n s Multiplier Bootstrap-Based Exploration By Combining the results obtained for I, II1, and II2, we can derive the following bound: E exp s a X + b o + I + II1 + II2 2b2 exp s b + λa + s (b + λa) b b + λa exp s 2λa (b + λa) 2b2 + 2λas2 2λa (b + λa) 2πλa 2b2 + 2a 2λa (b + λa) 2πλa 2b2 + 2a 2λa (b + λa) which gives the result we need. Lemma E.5. For any x 0, 2ex2 x e t2 dt 1. x e t2 dt = 2 π x e (t2 x2) dt x e (t x)2 dt