# zeroinflated_bandits__1213f206.pdf

Zero-Inflated Bandits

Haoyu Wei * 1 Runzhe Wan * 2 Lei Shi 2 Rui Song 2

Many real-world bandit applications are characterized by sparse rewards, which can significantly hinder learning efficiency. Leveraging problemspecific structures for careful distribution modeling is recognized as essential for improving estimation efficiency in statistics. However, this approach remains under-explored in the context of bandits. To address this gap, we initiate the study of zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution. We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure. The superior empirical performance of these methods is demonstrated through extensive numerical studies.

1. Introduction

Bandit algorithms have been widely applied in areas such as clinical trials (Durand et al., 2018), finance (Shen et al., 2015), recommendation systems (Zhou et al., 2017), among others. Accurate uncertainty quantification is key to addressing the exploration-exploitation trade-off and typically requires on certain reward distribution assumptions. Existing assumptions can be roughly classified into two groups:

Parametric: the reward distribution is assumed to belong to a parameterized family, such as Gaussian or Bernoulli distributions (Audibert et al., 2010; Krause & Ong, 2011; Agrawal & Goyal, 2012). The strong assumption typically ensures clean algorithmic design and nice theoretical results. However, when misspecified, it may lead to overor under-exploration.

*Equal contribution 1Department of Economics, University of California San Diego, La Jolla, USA 2Amazon, Seattle, USA. This work does not relate to the positions at Amazon. Correspondence to: Haoyu Wei <h8wei@ucsd.edu>, Rui Song <songray@gmail.com>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

0 10 20 30 Reward Value

25 50 75 100 Round

Upper Confidence Bound

Monte Carlo quantile non zero part based UCB variance based UCB estimated proxy based UCB product based UCB (ours) true proxy based UCB

Figure 1. Results from a real personalized pricing dataset detailed in Section 5. (a) Histogram of rewards, with zero represented in orange. (b) Comparison of 1 δ upper confidence bounds across different methods. We use Monte Carlo to approximate the true quantile (the tightest valid upper confidence bound). All methods are validated as their curves lie above the Monte Carlo baseline. Our proposed method (green) achieves the tightest bound quickly, demonstrating effective utilization of the zero-inflated structure. The other bounds correspond to UCB baselines detailed in Appendix D.1. Notably, applying existing concentration inequalities directly on the reward (yellow), even when knowing the true size parameter but without utilizing the ZI structure, results in significantly looser bounds.

Non-parametric: the reward distributions need to satisfy certain characteristics, such as having sub Gaussian tails (Chowdhury & Gopalan, 2017; Jin et al., 2021; Zhu & Tan, 2020) or being bounded (Kveton et al., 2019; 2020; Kalvit & Zeevi, 2021). In some cases, such an approach can achieve regret rates comparable to those of parametric methods (Urteaga & Wiggins, 2018; Kalvit & Zeevi, 2021; Jin et al., 2021). However, in general, these weaker assumptions sacrifice statistical efficiency by ignoring structural information. Even when the rates are the same, there may still be a significant empirical performance gap between the two approaches when the parametric distribution is correctly specified (see Chapter 9 in Lattimore & Szepesv ari, 2020).

As in many statistics and machine learning areas, using problem-specific structures to focus on a more detailed distribution family, if correctly specified, can improve the efficiency of estimation or uncertainty quantification. This leads to lower regrets in bandits. However, as discussed above, compared to the vast statistical literature on univari-

Zero-Inflated Bandits

Table 1. Summary of the theoretical guarantees for our algorithms.

BANDIT PROBLEM ALGORITHM REWARD DISTRIBUTION MINIMAX RATIO

sub-Gaussian (Algorithm 1) log T

sub-Weibull (Algorithm 1) log T

heavy-tailed (Algorithm B.1) p

K/T + (K/T)

ϵ 1 2(1+ϵ) log ϵ 1+ϵ K

TS sub-Gaussian (Algorithm C.1) 1

GLM UCB sub-Gaussian (Algorithm C.2) log(d q) log(T/(d q))

TS sub-Gaussian (Algorithm C.3) (d q) log(d q) log(T/(d q))

Our MAB algorithms are designed for a fixed horizon T, but the doubling trick can adapt them into anytime algorithms with comparable guarantees (Section 6.2 in Lattimore & Szepesv ari, 2020), preserving all the problem-dependent regret bounds (Theorems 7 and 9 in Besson & Kaufmann, 2018); *, ** Our problem-dependent and problem-independent regrets achieve state-of-the-art performance for both sub-Weibull (Hao et al., 2019) and heavy-tailed (Bubeck et al., 2013) rewards; *** Our problem-independent regret matches the minimax lower bound up to a logarithmic factor (Theorem 3 in Dani et al.,

2008) and aligns with the state-of-the-art rate (Li et al., 2017).

ate distribution, this direction is underexplored in bandits. This paper initiates the study of this direction by focusing on the sparse reward problem. Specifically, this work is motivated by the observation that rewards tend to be sparse in many real-world applications, meaning they are zero (or a constant) in most instances, called zero-inflated (ZI). For instance, in online advertising, most customers will not click the advertisement and hence the reward is zero with high probability; while for those clicked, the reward will then follow a certain distribution. Similar structures exist in broad applications, including mobile health (Ling, 2019) and freemium games (Yu et al., 2021). While some standard bandit algorithms can still be applied, they fail to utilize the distribution property and hence can be less efficient. See Figure 1a for a real example.

Contributions. Our contributions are threefold. First, we propose a general bandit algorithm framework for zeroinflated bandit (ZIB). Both Upper Confidence Bound (UCB) and Thompson Sampling (TS)-type algorithms are proposed. Using the problem-specific structure, our algorithm is more efficient than the existing ones via more accurate uncertainty quantification. We illustrate this with Figure 1b, which shows that our method leads to tighter concentration bounds, and this will translate into lower regrets when used with UCB and TS. Our algorithms are designed for a wide range of reward distributions, including the sub-Weibull distribution and even more heavy-tailed distributions (with moments exceeding one). Second, we theoretically derive the regret bounds for our UCB and TS algorithms in multi-armed bandits (MAB) under weak reward distribution assumptions, as well as for contextual linear bandits. In many cases, our algorithms achieve regret rates that are either minimax optimal or state-of-the-art. A detailed summary is provided in Table 1. To our knowledge, this is the

first finite-sample concentration analysis of the general ZI models in the literature, even outside of bandits. Lastly, we show the value of our approach through both simulated and real experiments.

Related work. Besides the bandit literature with parametric or nonparametric reward distributions discussed above, our work connects to several related research areas. First, there is research on semiparametric bandits (Krishnamurthy et al., 2018; Kim & Paik, 2019; Ou et al., 2019; Peng et al., 2019; Choi et al., 2023). However, these works focus on addressing the misspecification of the regression function within the context of contextual bandits. This focus is orthogonal to our objectives. Second, the zero-inflated distribution can also be regarded as a special case of hierarchical distributions. In recent years, there is growing interest in leveraging hierarchical models in bandits (Hong et al., 2022; Wan et al., 2021; 2023a). However, all of them study the hierarchical structure among bandit instances (with metalearning) instead of in the reward distribution as in our setup. Third, to be agnostic to the distribution assumption, besides relying on nonparametric distribution families, one may also consider bootstrap-based methods (Wan et al., 2023b; Kveton et al., 2019). Nonetheless, on one hand, these methods still rely on certain restrictive distribution assumptions to ensure a regret guarantee; on the other hand, they fail to fully utilize the problem-specific structure, which may lead to compromised efficiency. Fourth, our work is related to research on sparse rewards and heavytailed/asymmetric bandits. While traditional sparse bandits (Kwon et al., 2017; Perrault et al., 2020) focus on arms with zero mean rewards, our zero-inflated framework addresses structural sparsity where actions frequently yield zero rewards. This connects to heavy-tailed bandits (Bubeck et al., 2013; Zhang & Cutkosky, 2022) and asymmetric ban-

Zero-Inflated Bandits

dits (Zhang & Ong, 2021; Li et al., 2023), as zero-inflated distributions can exhibit heavy-tailed or asymmetric properties. However, our approach is specifically tailored to the zero-inflated structure, allowing the non-zero component to be either heavy-tailed or asymmetric while maintaining computational efficiency.

Finally, while zero-inflated structures have been studied in offline settings (supervised/unsupervised learning, see Lambert, 1992; Hall, 2000; Cheung, 2002) and there is some literature focusing on model-specific zero-inflated bandits (Liu et al., 2023, which only considers two discrete count rewards: Poisson and negative binomial), to our knowledge, this work is the first formal study on this topic in bandits generally, which poses unique challenges such as finitesample concentration bounds and regret rate analysis for a class of distributions.

2. Zero-Inflated Multi-Armed Bandits

In this section, we use the MAB setting to outline our motivation and strategy. We will extend to contextual bandits in Section 3.

Setup. For any positive integer M, we denote the set {1, . . . , M} by [M]. We start our discussion with the MAB problem: on each round t [T], the agent can choose an action At A with the action space A = [K], and then receive a random reward Rt = r At + εt, where rk = E[Rt | At = k] is the mean reward of the kth arm and εt is the random error. The performance of a bandit algorithm is measured by the cumulative regret R(T) = PT t=1 E maxa A ra r At . We focus on applications where the reward is zero for a significant proportion of time, and propose to characterize the reward distribution by the following Zero-Inflated (ZI) model:

Xt = µAt + εt,

Yt Bernoulli(p At),

and Rt = 0 (1 Yt) + Xt Yt.

Here, for each arm k, we introduce two unknown parameters, the non-zero probability pk [0, 1] and the mean of the non-zero part µk such that rk = µk pk. Here εt is a mean-zero random error term. For any arm k, we assume P(Xt = 0) = 0. We note this assumption can always be satisfied: given a reward variable Rt, one can always define Yt = 1(Rt = 0) and Xt = 1(Rt = 0) Rt. Moreover, as such, it is natural to regard Yt as observable as well. In contrast, the value of Xt is only observable when Yt = 0 (equivalently, Rt = 0), and in this case it is equal to Rt. For simplicity of notations and without loss of generality, we assume Xt Yt: even if Xt Yt, we can re-define ˇXt | Yt = y to have the same distribution as Xt | Yt = 1 for y = 0, 1. In this case, ˇXt Yt, and the observable

reward Rt = Xt Yt = ˇXt Yt; thus, we can simply replace Xt with ˇXt. Finally, the conditional distribution of Rt is a mixture of two distributions, one of which is a delta distribution on zero and the other is only required to satisfy minimal assumptions; while the assignment Yt is Bernoulli. For simplicity, we will occasionally omit the subscript t when there is no ambiguity.

To simplify the exposition, we first focus on pulling a single arm and may omit the subscript k. We begin with considering scenarios where εt exhibits relatively light tails. Specifically, we consider the sub-Weibull tail property, i.e., there exists θ > 0 for which the moment generating function (MGF) of |εt|θ is defined within an interval around zero. This sub-Weibull distribution family is very general (Zhang & Chen, 2021; Zhang & Wei, 2022): for example, when θ = 1 and θ = 2, it reduces to the subexponential or sub-Gaussian family, respectively. Mathematically, we denote εt sub W(θ; C) if the noise εt satisfies E exp(|εt|θ/Cθ) 2, with θ > 0 and C > 0 representing the tail and size parameter (Rinne, 2008; Vladimirova et al., 2020). In the design and analysis of bandit algorithms, it is commonly assumed that the parameters θ and C are known (Wu et al., 2016; Lattimore & Szepesv ari, 2020; Wu et al., 2022; Zhou et al., 2025).

We first note that the ZI structure retains the sub-Weibull tail behavior of the non-zero component Xt µ, as established in Lemma 2.1. Lemma 2.1. Assuming independent Yt Bernoulli(p) and Xt µ sub W(θ; C), let Rt = Xt Yt. Then, there exists a constant CR > 0 such that Rt µp sub W(θ; CR).

Naive approaches and their limitations. With Lemma 2.1, once the size parameter for Rt is known (or estimated), we can construct an upper confidence bound for r = µp using existing concentration inequalities for sub-Weibull variables {Rt} (Zhang & Chen, 2021; Zhang & Wei, 2022). The corresponding UCB algorithms then follows, which we refer to as naive approaches. While such approaches can theoretically attain minimax regret rates under appropriate parameter specifications, the zero-inflated structure introduces unique challenges that lead to two clear limitations.

First, even when the true size parameter is known, without leveraging the zero-inflated structure, such an approach leads to a loose concentration bound and hence under-exploration. This can be seen in Figure 1b, our numerical study in terms of regret, and also our regret bound (e.g. Lemma 6.1). We can appreciate the intuition from the following fact: var(Rt) = EY (var(Rt|Yt))+ var Y (E(Rt|Yt)) = p var(Xt) + µ2p(1 p). Therefore, if we directly apply a concentration bound with Rt, the width of the bound will roughly increase linearly with µ. However, the noise level and the difficulty of estimating either p or µ (and hence r) should not change only by shifting the

Zero-Inflated Bandits

non-zero distribution.

Second, in practice, estimating a valid size parameter CR for r (hence having a valid upper confidence bound) is challenging, as the size parameter has complex dependency on µ, p, θ and C. In real applications, there are a few common methods to choose CR: (1) Use the size parameter of Xt µ; (2) use the variance of Rt (estimated on the fly); (3) Use the definition to calculate CR, with µ and p estimated on the fly. The first two approaches are not valid, while the third one is also not reliable due to the sensitiveness induced by the ZI structure. We illustrate the issues with these methods in Lemma E.1 (in Appendix E) using sub Gaussian distributions, which we summarize here: 1) CR can significantly exceed C, hence approach 1 is not valid; 2) CR can significantly exceed var(Rt), hence approach 2 is not valid; 3) CR is very sensitive to (p, µ, σ2), and hence prone to be heavily influenced by their estimation errors - specifically, the partial derivatives of CR with respect to these parameters can be arbitrary large within some regions.

2.1. Proposed product method and upper confidence bound approach

To address the shortcomings of the naive approaches, we introduce the product method, which leverages the product structure of the true reward Rt = Xt Yt. Utilizing the corresponding concentration inequalities, we establish valid upper confidence bounds for µ using {Xt}n t=1 and for p using {Yt}n t=1, formulated as P(µ > X + UX) α/2 and P(p > Y + UY ) α/2. Here UX and UY are known functions, and X and Y are the sample averages for {Xt}n t=1 and {Yt}n t=1, respectively. Consequently,

P µp > (X + UX)(Y + UY )

P(µ X > UX) + P(p Y > UY ) = α, (1)

which suggests (X + UX)(Y + UY ) is a valid upper confidence bound for r = µ p.

Now, the key of establishing our method lies in determining sharp concentration bounds for both p and µ. For p, a standard concentration bound for Bernoulli variables can be applied, such as: UY = p

1/(2n) log (2/α). However, establishing sharp bounds for µ presents significant challenges: (i) X is not directly observable since we only observe Xt when Yt = 1, and (ii) the number of non-zero observations is random, complicating standard concentration analysis. We therefore define the average value of the observed Xt as: X := 1 #{t:Yt=1} P

t:Yt=1 Xt. Let B = Pn t=1 Yt represent the count of Xt observations. While B is a random variable, given any fixed B, the sub-Weibull concentration inequality still holds. For example, if Xt is sub-Gaussian (i.e., θ = 2)

with a variance proxy σ2,

2σ2/B log (2/α)

=E P µ X > p

(2σ2/B) log (2/α) | B α/2.

With UX = p

(2σ2/B) log (2/α), based on (1), we can derive a valid upper confidence bound for rk , and develop the corresponding UCB algorithm. The algorithm details are outlined in Algorithm 1.

Algorithm 1 UCB for ZI MAB with light tails Data: Horizon T, tail parameter θ, and size parameter C.

1 Set U µ k = 1 and U p k = 1, k [K].

2 Set the counters ck = 0, and set bµk = 0 and bpk = 0.

3 for t = 1, . . . , K do

4 Take action At = t;

5 Observe Rt, set Xt = 1(Rt = 0) Rt and Yt = 1(Rt = 0).

7 for t = K + 1, . . . , T do

8 Take action At = arg maxk [K] U µ k U p k (break tie randomly);

9 Observe Rt, set Yt = 1(Rt = 0);

10 Update c At = c At + 1, bp At = bp At + Yt bp At

U p At = bp At + q

11 if Rt = 0 then

12 Set Xt = Rt;

13 Update bµAt = 1 #{l t:Al=At and RAl =0} P

l t:Al=At RAl

and U µ At = bµAt + 2e D(θ)C q

E(θ) log(1/θ) 1(4/δ)

nbp At/2 , where D(θ) and E(θ) are defined in Lemma 2.2. 14 end

While the above construction provides a valid algorithm, the theoretical analysis presents additional complexities. The concentration bound p

(2σ2/B) log (2/α) is random and depends on {Yt}n t=1, which significantly complicates regret analysis. Fortunately, Lemma 2.2 demonstrates that we can have a similar concentration bound with rate npk, which is much easier to analyze. It also verifies that the observed average of i.i.d. sub-Weibull variables behave with a combination of a Gaussian and a Weibull tail.

Lemma 2.2. Suppose Xt µ i.i.d. sub W(θ; C) and Yt i.i.d. Bernoulli(p). Let X = 1 #{t [n]:Rt =0} Pn t=1 Rt be the observed sample mean for X and

UX = 2e D(θ)C

pn + 2E(θ) log(1/θ) 1(4/δ)

then P µ X UX δ for any δ > 0 and n

Zero-Inflated Bandits

4 log(2/δ)/p2. The constants D(θ) and E(θ) are defined in Lemma F.2.

More importantly, UX is not only independent of µ but also enables a tighter product method-based concentration for r = µ p compared to standard sub-Weibull concentrations for Rt = Xt Yt. Our numerical results in Figure 1b demonstrate this advantage.

In addition, some applications exhibit zero-inflated rewards where the non-zero part is heavy-tailed, possessing only finite moments of order 1 + ϵ for some ϵ (0, 1]. To handle such cases, we construct an upper bound for the trimmed mean (Bickel, 1965; Bubeck et al., 2013), enabling the corresponding UCB algorithm. Further details are provided in Appendix B.

2.2. Thompson sampling approach

We also extend our discussion to another widely adopted approach, Thompson Sampling (TS, Thompson, 1933). Similarly to our approach with the UCB algorithms above, we consider the non-zero part Xt and the binary variable Yt separately. For ease of exposition, we consider the sub Gaussian case for Xt. This can be easily extended to sub Weibull cases by introducing an extra sampling step, known as Chambers-Mallows-Stuck (CMS) Generation, to rescale Xt to a sub-Gaussian tail (Weron, 1996; Dubey & Pentland, 2019; Shi et al., 2023).

Following the standard TS framework, we sample µk and pk from their respective posteriors and multiply them. We prove this is equal to posterior sampling in Appendix C.1. To achieve a minimax optimal TS algorithm for general sub Gaussian distributions, we follow Jin et al. (2021); Karbasi et al. (2021) to use a clipped Gaussian distribution, denoted as cl N(µ, σ2; ϑ) := max{N(µ, σ2), ϑ}, which curtails overestimation of suboptimal arms and ensures optimality. We adapt this idea to use the clipped Gaussian distribution for sampling the non-zero part Xt and a clipped Beta distribution for the binary part Yt. Our decision to use a clipped Beta distribution over a standard Beta distribution stems from our analysis, which shows it is critical for managing the risk of overestimating suboptimal arms in Rt = Xt Yt. Further details are outlined in Appendix C.1.

3. Zero-Inflated Contextual Bandits

In this section, we extend our discussion to the Contextual Bandits problem. For concreteness, we consider the following setup of contextual bandits, although other setups can be similarly formulated and addressed: on each round t, the agent observes a context vector xt and a set of feasible actions At, choose an action At At, and receive a random reward Rt = r(xt, At) + εt, where r is the mean-reward function and εt is the random er-

ror. The cumulative regret in this setup is defined as R(T) = PT t=1 E maxa At r(xt, a) r(xt, At) .

To utilize the ZI structure, we propose to consider the following model: Rt = 0 (1 Yt) + Xt Yt with

Xt = g xt, At; β + εt,

Yt Bernoulli h(xt, At; θ) ,

where εt is a mean-zero error term, h is a function with codomain [0, 1] and parameterized by θ Rq, and g is a function parameterized by β Rd. We remark the relationship that r(x, a) = h(x, a; θ) g(x, a; β).

Here, we present a general UCB template for our method in Algorithm 2, which extends the MAB version in Section 2.2. Specifically, based on the functional forms of h( , ; θ) and g( , ; β), we construct confidence bounds for exploration, denoted as Uall,t(x, a) for h and Uall,t(x, a) for g, respectively. At each step, we estimate θ and β, then structure the UCB algorithm by maximizing the UCB score for each action. Similarly, the TS algorithm follows by designing appropriate sampling rules with suitable priors for θ and β. As a concrete example, Appendix C.2 provides detailed update formulas for both the UCB and TS algorithms when h and g are modeled as generalized linear functions.

Algorithm 2 General template of UCB for ZI contextual bandits Data: Confidence bound exploration terms Uall,t( , ) and Unon-zero,t( , ), random selection period τ, and other algorithm-specific parameters.

15 Set Hall = {} and Hnon-zero = {}.

16 Randomly choose action at At for t [τ];

17 for t = τ + 1, . . . , T do

18 Estimate bθt from the binary outcomes, using Hall;

19 Estimate bβt from the non-zero outcomes, using Hnon-zero;

20 Take action At = arg maxa At h(xt, a; bθt) + Uall,t(xt, a) g(xt, a; bβt) + Unon-zero,t(xt, a) .

21 Observe reward Rt, set Xt = 1(Rt = 0) Rt and Yt = 1(Rt = 0).

22 Update the dataset as Hall Hall {(xt, At, Yt)}.

23 if Rt = 0 then

24 Update the dataset as Hnon-zero Hnon-zero {(xt, At, Rt)};

4.1. Regret bounds for ZI MAB

Although concentration bounds for both components (e.g. van de Geer & Lederer, 2013; Kuchibhotla & Chakrabortty,

Zero-Inflated Bandits

2022; Zhang & Wei, 2022, for sub-Weibull) and Y (e.g. Bentkus, 2004; Mattner & Roos, 2007; Zhang & Chen, 2021, for Bernoulli) have been extensively studied, there present nontrivial challenges to analyze our algorithm with the product of them. These challenges, as discussed above, arise because the reward, and consequently the action selection, is jointly determined by both X and Y . Unlike standard bandits where the observability of X depends solely on arm selection, here it also depends on Y s value. Consequently, the number of times X is observed becomes a random variable. The lack of a precise distribution for sub-Weibull X also complicates analytical analysis. Fortunately, our Lemmas 2.2 and B.1 help address these challenges to support the regret analysis of our algorithms.

Without loss of generality, we assume rk (0, 1), and r1 = maxk [K] rk, i.e., the first arm is the optimal arm. To demonstrate the prior properties of considering the ZI structure from a theoretical perspective, we present the regret bound for our light-tailed ZI UCB algorithm for MAB. Theorem 4.1. Consider a K-armed zero-inflated bandit with sub-Weibull noise sub W(θ; C). We have an upper bound for the problem-dependent regret of Algorithm 1 with δ = 4/T 2 as R(T)

k=2 p 2 k log T/ k +

k=2 p 1 k log(1/θ) 1 T + p 2 1 log T

where k = r1 rk for k = 2, . . . , K.

In Theorem 4.1, the notation indicates inequality up to constants independent of bandit-specific parameters. Since pk is bounded in (0, 1] and k 1, the problemindependent regret simplifies to R(T) KT log T + K. This matches the minimax lower bound up to a factor of O( log T) as given in Theorem 15.2 of Lattimore & Szepesv ari (2020).

Similarly, we establish both problem-dependent and problem-independent regret bounds for our heavy-tailed UCB algorithm (Algorithm B.1 in Appendix C.1), which matches the sharpness upper bounds rates in the current literature (Bubeck et al., 2013; Dubey et al., 2020; Chatterjee & Sen, 2021) and hence achieves state-of-the-art performance. The detailed regret analysis is provided in Appendix B.

We also provide the regret analysis for our TS algorithm, Algorithm C.1 in Appendix C.1, when the non-zero part is sub-Gaussian. In contrast to the proofs of UCB algorithms, we require the anti-concentration properties of the distributions to control the probability of underestimating the optimal arm (Agrawal & Goyal, 2013; Jin et al., 2021; 2022). Fortunately, we prove that the clipped Gaussian and clipped Beta distributions, as well as their products, exhibit anti-concentration with optimal decay rates (Lemma F.3 and Lemma F.4). This finding allows us to establish the problem-dependent regret bound for our TS algorithm.

Theorem 4.2. Consider a K-armed ZIB with sub-Gaussian rewards. Let the tuning parameters satisfy γ 4, ρ (1/2, 1), and αk, βk, vk [0, 1]. Then Algorithm C.1 has

p 1 k k + p 1 1 p

Given that pk are bounded within (0, 1], the problemindependent regret is R(T)

KT + K. Compared to UCB (Theorem 4.1), Algorithm C.1 improves by a factor of log T. The regret is both minimax optimal (Auer et al., 2002) for sub-Gaussian MAB. One can extend Theorem 4.2 for sub-Weibull rewards with θ < 2 with the CMS generation (Dubey & Pentland, 2019). This introduces an additional regret term scaling as (KT)1/θ (Dubey & Pentland, 2019; Shi et al., 2023), which remains state-of-the-art.

4.2. Regret bounds for ZI contextual bandits

Analyzing ZI contextual bandit algorithms requires constructing confidence regions for both β and θ. Unlike in standard contextual linear bandit literature, the ZI structure means that updates to β only occur when non-zero outcomes are observed. Fortunately, the concentration analysis we established for MAB can be similarly applied here. To illustrate the theoretical advantage of our ZI contextual bandit algorithms, we analyze an instantiation where both the non-zero part Xt and the binary part Yt follow generalized linear models (GLMs):

Xt = g β ψX(xt, At) + εt

and Yt Bernoulli h θ ψY (xt, At) }.

Here, εt is sub-Gaussian, g( ) and h( ) are strictly increasing link functions, ψX( ) and ψY are known transformations. The unknown parameter vectors β Rd and θ Rq govern the models, and the action space A = [K] N can be large or even infinite. This GLM setting is widely studied in bandit literature (Filippi et al., 2010; Li et al., 2017; Wu et al., 2022; Li et al., 2010; 2012; Lattimore & Szepesv ari, 2020).

Denote z 2 A = z Az for any z Rd and positive definite matrix A Rd d. For the GLM setting, we design explicit forms for the confidence bound exploration terms Uall,t and Unon-zero,t, along with the estimation steps in Algorithm 2. Specifically, at round t, the estimates bβt and bθt are obtained via ridge regression

bβt := arg min β Γ

Rs g ψX(xs, As) β V 1 t

and bθt := arg min θ Θ

Ys h ψY (xs, As) θ U 1 t .

Zero-Inflated Bandits

The corresponding sample covariance matrices are Vt = λV Id + P s [t]:Ys=1 ψX(xs, As)ψX(xs, As) and Ut = λUIq + P

s [t] ψY (xs, As)ψY (xs, As) , where λV , λU are regularization parameters, and Γ, Θ are compact parameter spaces. To balance exploitation and exploration, we select actions that maximize the sum of the estimated mean and variance, which can be interpreted as an upper confidence bound. The confidence bound exploration terms are designed as Unon-zero,t(xt, a) = ρX,t ψX(xt, a) V 1 t and Uall,t(xt, a) = ρY,t ψY (xt, a) U 1 t , where {ρX,t, ρY,t}t 0 are ellipsoidal scaling factor sequences controlling exploration. This leads to our UCB algorithm for ZI GLM (Algorithm C.2 in Appendix C).

To derive the regret bounds for our UCB algorithm, we impose regularity conditions on the link functions, parameter spaces, and underlying distributions. Informally, we assume that the link functions have bounded derivatives, the parameter spaces are compact, the variance matrices for both the non-zero and binary components are positive definite with bounded minimal eigenvalues, and the noise is sub-Gaussian. These assumptions are standard and mild in the literature on generalized linear contextual bandits (e.g., Li et al., 2010; Deshpande et al., 2018; Wu et al., 2022). The formal assumptions are detailed in Assumptions C.1 and C.2 in Appendix C.2. Utilizing results on self-normalized martingales (Abbasi-Yadkori et al., 2011), we derive the following regret bounds for Algorithm C.2. Theorem 4.3. Fix any δ > 0. Suppose Algorithm C.2 is run with a suitable random selection period τ and tuning parameters {ρX,t, ρY,t}t 0, as specified in Assumption C.3 in Appendix C.2. Under the regularity conditions in Assumptions C.1 and C.2, the regret is bounded with probability at least 1 5δ as

d T log(1 + d 1T) log(1/δ) + d log(1 + d 1T)

q T log(1 + q 1T) log(1/δ) + q log(1 + q 1T)

for any λU, λV > 0.

A notable observation is that the regularity conditions, random selection period, tuning parameter choices, and regret bound in Theorem 4.3 are independent of the number of arms K. This regret rate e O((d q)

T) matches the minimax lower bound up to a logarithmic factor for contextual bandit problems with both finite and countably infinite action spaces (Theorem 3 in Dani et al., 2008).

While Theorem 4.3 focuses on the UCB approach, the TS variant can achieve the same regret rate. Specifically, the TS algorithm samples parameters as

eβt N( bβt, ϱ2 X,t V 1 t ) and eθt N(bθt, ϱ2 Y,t U 1 t ),

using the same estimators bβt, bθt, Vt, Ut as in the UCB algorithm. With appropriately chosen confidence radii

{ϱX,t, ϱY,t}t 0, the TS algorithm selects actions as ATS t = arg maxa [K][ψX(xt, a) eβt] [ψY (xt, a) eθt]. The corresponding regret bounds are provided in the following corollary.

Corollary 4.4. Suppose the conditions in Theorem 4.3 hold. If Algorithm C.3 is run with the same random selection period τ as Algorithm C.2 and the tuning parameters ϱX,t, ϱY,t specified in (C.3), then the regret is bounded with probability at least 1 5δ by e O((d q)2

5. Experiment

Simulation with MAB. For MAB problems, we evaluate our algorithms across three reward distributions: Gaussian, Gaussian mixture, and exponential. We set up several UCB baselines using the sub-Weibull concentration (Bogucki, 2015; Hao et al., 2019), with difference in the size parameter specification: size parameter of the non-zero component, estimated variances, on-the-fly estimated size parameters, and the true size parameters (strong baseline). Additionally, we include the MOTS algorithm from Jin et al. (2021) as a TS baseline. The details of these baselines and our setting are presented in Appendix D.1.

For the reward distributions, we consider: (i) Gaussian rewards with unit standard deviation and means µk; (ii) Mixed Gaussian rewards drawn from (1 pk) N µk 2(1 pk), σ2 +

pk N µk 2pk , σ2 ensuring overall mean µk; and (iii) Exponential rewards with mean µk. All mean parameters µk are independently drawn from U(0, 100). Throughout our experimental evaluation, all plotted lines represent mean cumulative regret over multiple independent replications. Error bars around the curves indicate 1/10 standard deviation, chosen to ensure visual clarity while reflecting variability. In some cases, error bars may appear negligible or invisible due to very low variability across replications.

We run simulations with different distributions of p = maxk [K] pk with δ = 4/T 2. Here, we only present the result with p U[0.30, 0.35] in Figure 2. Besides, the 1 δ upper confidence bounds for different UCB algorithms are shown in Figure 1 (b). Findings from other settings (including high and very low values of p, and bounded rewards) are similar and detailed in Appendix D.1. Our algorithms demonstrated sub-linear regrets across various distributions. In contrast, except for the strong baseline, all other methods exhibited either linear or significantly higher regrets in some cases, due to the challenge to correctly quantifying uncertainty in ZIB. Consistent with our observations in Figure 1b, our UCB algorithm even exhibits a much lower regret than the strong baseline in both Gaussian and Gaussianmixture cases. This proves our motivation that ignoring the ZI structure leads to looser concentration bounds and hence

Zero-Inflated Bandits

0 20000 40000 60000

culumative regret

0 20000 40000 60000

culumative regret

Mixed Gaussian

0 20000 40000 60000

culumative regret

Exponential Method

product UCB (ours)

product TS (ours)

non zero part based UCB variance based UCB

estimated proxy based UCB

true proxy based UCB

Figure 2. Zero-inflated MAB with K = 10 and T = 75000 with N = 50 replications for p U[0.30, 0.35].

higher regret. Similarly, unlike the baseline TS algorithm (Jin et al., 2021), our TS algorithm outperforms other baseline algorithms. We present results for TS algorithms only in Gaussian and Gaussian-mixture cases. For exponential distributions, the required CMS transformation to rescale the distribution to sub-Gaussian tails discussed in Section 2.2 complicates fair comparisons.

Simulation with contextual bandits. We adapt the settings in Bastani et al. (2021); Wu et al. (2022) to ZIB. We design K = 100 arms and generate a vector νk Rd with d = 10 for each arm k [K]. At each round t the context for arm k is sampled as xk,t Nd νk, 1 2K Id . The non-zero part of the reward, Xt, is generated using a linear model: β x At,t + εt, with εt representing white noise. For the binary part, we model the probability using a generalized linear function with the link function h(θ sin(x At,t)). Both β and θ are d-dimensional, with θ having s non-zero elements, where s [d]. The sparsity level s signifies that the non-zero elements in θ influence the occurrence of non-zero rewards. Unlike the MAB problem, the contextual bandits problem introduces an additional layer of complexity for fair comparison due to model specification. We evaluate Algorithm C.2 against two baseline UCB methods: The first one is a naive method, called misspecified UCB, that overlooks the ZI structure, treating it as a linear bandit and thus misspecifying the model. The second, termed integrated UCB, aligns with the UCB baselines in MAB: it correctly specifies the model, but directly quantifies the uncertainty of (β , θ ) from the model β xth(θ sin(xt)) + εt, by utilizing generalized contextual linear bandit algorithms (Li et al., 2017). Similarly, we assess our TS algorithm, Algorithm C.3, against the misspecified TS, which disregards the ZI structure, and integrated TS, which mirrors the integrated UCB. Various sparsity levels and link functions h( ) are considered. Results for s = 7 using the Probit function are shown in Figure 3, with additional simulation details provided in Appendix D.2. Again, TS algorithms are excluded for exponential noise to ensure fair comparison. As shown in Figure 3, our UCB and TS algorithms consistently achieve lower sub-linear regrets across all tested distributions. In contrast, other methods, except for integrated TS, exhibit linear regrets. While integrated TS also achieve sub-linear regret, our product TS demonstrates superior performance.

Real Data. We apply our ZI contextual bandit algorithms to a real dataset of loan records from a U.S. online auto loan company, a widely studied public dataset (Columbia Business School, Columbia University, 2024; Phillips et al., 2015; Ban & Keskin, 2021; Bastani et al., 2022). We mainly follow the setup in Chen et al. (2023). At each round t, an applicant arrives with certain covariates (the context) xt. The company then proposes a loan interest rate, which can be transformed into the nominal profit At for the company (the raw action). This raw action, along with xt, determines the actual profit Xt and affects the applicant s decision to accept Yt {0, 1}. The resulting reward for the company is hence Rt = Xt Yt. We categorize the raw action At R into discrete levels: e At Adiscrete = { low , medium , high , very high , luxury }, each representing a distinct price level. To simulate counterfactual outcomes, we begin by fitting the entire dataset using

Xt = (At, 1, b(xt) ) β + εt

and Yt Bernoulli h (At, 1, b (xt)) θ ,

where h( ) is the logistic function and the transformation function b( ) applied to the context is detailed in Appendix C.2.

After deriving initial estimators bβ and bθ, we then compare our method with two baseline approaches discussed in simulations for contextual bandits at each round t, an action e At is selected from Adiscrete using our algorithm. The profit At is then sampled from its truncated distribution based on e At. Subsequently, the real profit Xt and the binary decision Yt based on the offered At, and the corresponding reward Rt = Xt Yt are calculated. With the estimated parameters, we determine the optimal actions for each round to compute the regret. Detailed procedures and tuning parameters are in Appendix C.2. The average cumulative regrets over 5000 rounds are shown in Figure 4. Our UCB and TS algorithms demonstrate significantly lower regret compared to the integrated UCB and TS algorithms, highlighting the importance of leveraging the ZI structure in real-world contextual bandit problems. Furthermore, it is worth noting that while the misspecified UCB and TS algorithms may appear to perform well on this dataset, we should approach their performance with caution: As shown in Figure 3, these algorithms sometimes result in linear regrets. In contrast,

Zero-Inflated Bandits

0 5000 10000 15000 20000

culumative regret

0 5000 10000 15000 20000 Round

culumative regret

Mixed Gaussian

0 5000 10000 15000 20000

culumative regret

Exponential

product UCB (ours) product TS (ours) misspecified UCB integrated UCB misspecified TS integrated TS

Figure 3. Zero-inflated contextual bandits with T = 20000 and s = 7 under N = 25 replications.

0 1000 2000 3000 4000 5000 Round

culumative regret

product UCB (ours) product TS (ours) misspeciﬁed UCB misspeciﬁed TS integrated UCB integrated TS

Figure 4. Results with the real dataset.

our UCB and TS algorithms consistently outperform the competition, achieving the lowest overall regret.

6. Discussion

In this paper, we introduce a new bandit model and the corresponding algorithmic framework tailored to the common ZI distribution structure. The primary advantage lies in lower regrets from more accurate uncertainty quantification by utilizing the ZI structure. We establish theoretical regret bounds for UCB and TS algorithms in MABs under various reward distributions, as well as for contextual linear bandits.

Time and Space Complexity. Computational efficiency is crucial for practical applications. An important advantage of our approach is that our ZI-based methods retain the same big-O time and space complexity as standard baselines for both MAB and GLM settings. The only difference is a small constant overhead from maintaining two estimators: one for the zero indicator Yt and one for the nonzero magnitude Xt, instead of a single reward estimator. Thus, our methods preserve the same computational order as existing approaches without adding extra computational cost.

Asymptotic Order-Optimality. A natural theoretical question is whether our algorithms achieve asymptotic orderoptimality. The following lemma establishes a problemdependent lower bound for ZIB with a Gaussian non-zero component.

Lemma 6.1. Consider a K-armed zero-inflated bandit with the non-zero components belong to Gaussian distributions with variance σ2. For any consistent algorithm (see Definition 16.1 in Lattimore & Szepesv ari, 2020), the following

lower bound holds:

lim inf T + R(T)

" [0 (µk p 1 k r1)]2

pk log pk pk µ 1 k r1

+ (1 pk) log 1 pk 1 pk µ 1 k r1

Specifically, a necessary condition for an algorithm to achieve asymptotic order-optimality for sub-Gaussian ZI MABs is that its regret satisfies lim inf T + R(T)/log T PK k=2 p 1 k 1 k .

We call an algorithm as asymptotic order-optimal, also known as finite-time instance near-optimality (Lai & Robbins, 1985; Lattimore, 2015; M enard & Garivier, 2017; Lattimore & Szepesv ari, 2020), if its regret satisfies lim inf T + R(T)/log T and achieves the lower bound in Lemma 6.1, up to universal constants independent of both the horizon T and problem-specific parameters rk (and thus k, pk). While we do not explicitly prove that our UCB and TS algorithms for MABs achieve asymptotic order-optimality, they satisfy the necessary condition by accounting for the ZI structure parameter pk. However, designing an algorithm that attains this optimality may require additional refinements. For example, in (1), we allocate equal probability α/2 for the two concentration bounds, P(µ > X + UX) α/2 and P(p > Y + UY ) α/2. To attain asymptotic order-optimality, these probability allocations should potentially be adapted based on the zeroinflation parameters {pk}K k=1. We leave this as an open problem for future research.

Broader Implications. The ZI structure studied in this paper can be viewed as a special case of a broader class of problems where the reward distribution follows a hierarchical structure. This framework is particularly useful when the reward distribution is multimodal or when the reward generation mechanism follows a hierarchical process. For instance, in product recommendations, a customer s initial reaction (e.g., highly interested , somewhat interested , not interested ) could determine the subsequent reward distribution. Exploring these broader applications remains an interesting direction for future research.

Zero-Inflated Bandits

Impact Statement

This paper presents an improved algorithm framework for multi-armed bandits and contextual bandits, both are mature area with many applications. We are not aware of any particular negative social impacts of this work.

Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.

Abramowitz, M., Stegun, I. A., and Romer, R. H. Handbook of mathematical functions with formulas, graphs, and mathematical tables, 1988.

Adamczak, R., Litvak, A. E., Pajor, A., and Tomczak Jaegermann, N. Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling. Constructive Approximation, 34:61 88, 2011.

Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39 1. JMLR Workshop and Conference Proceedings, 2012.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127 135. PMLR, 2013.

Ahle, T. D. Asymptotic tail bound and applications. 2017.

Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT, pp. 41 53, 2010.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48 77, 2002.

Ban, G.-Y. and Keskin, N. B. Personalized dynamic pricing with machine learning: High-dimensional features and heterogeneous elasticity. Management Science, 67(9): 5549 5568, 2021.

Bastani, H., Bayati, M., and Khosravi, K. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329 1349, 2021.

Bastani, H., Simchi-Levi, D., and Zhu, R. Meta dynamic pricing: Transfer learning across experiments. Management Science, 68(3):1865 1881, 2022.

Bentkus, V. On hoeffding s inequalities. Annals of probability: An official journal of the Institute of Mathematical Statistics, 32(2):1650 1673, 2004.

Besson, L. and Kaufmann, E. What doubling tricks can and can t do for multi-armed bandits. ar Xiv preprint ar Xiv:1803.06971, 2018.

Bickel, P. J. On some robust estimates of location. The Annals of Mathematical Statistics, pp. 847 858, 1965.

Bogucki, R. Suprema of canonical weibull processes. Statistics & Probability Letters, 107:253 263, 2015.

Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. univ. press, 2013.

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. Bandits with heavy tail. IEEE Transactions on Information Theory, 59 (11):7711 7717, 2013.

Chatterjee, S. and Sen, S. Regret minimization in isotonic, heavy-tailed contextual bandits via adaptive confidence bands. ar Xiv preprint ar Xiv:2110.10245, 2021.

Chen, X., Wang, Y., and Zhou, Y. Dynamic assortment optimization with changing contextual information. The Journal of Machine Learning Research, 21(1):8918 8961, 2020.

Chen, X., Qi, Z., and Wan, R. Steel: Singularity-aware reinforcement learning. ar Xiv e-prints, pp. ar Xiv 2301, 2023.

Cheung, Y. B. Zero-inflated models for regression analysis of count data: a study of growth and development. Statistics in medicine, 21(10):1461 1469, 2002.

Choi, Y.-G., Kim, G.-S., Paik, S., and Paik, M. C. Semiparametric contextual bandits with graph-laplacian regularization. Information Sciences, 645:119367, 2023.

Chowdhury, S. R. and Gopalan, A. On kernelized multiarmed bandits. In International Conference on Machine Learning, pp. 844 853. PMLR, 2017.

Columbia Business School, Columbia University. Center for pricing and revenue management, 2024. available at https://business.columbia.edu/cprm upon request.

Dai, C., Lin, B., Xing, X., and Liu, J. S. False discovery rate control via data splitting. Journal of the American Statistical Association, 118(544):2503 2520, 2023.

Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear optimization under bandit feedback. In COLT, volume 2, pp. 3, 2008.

Deshpande, Y., Mackey, L., Syrgkanis, V., and Taddy, M. Accurate inference for adaptive linear models. In International Conference on Machine Learning, pp. 1194 1203. PMLR, 2018.

Zero-Inflated Bandits

Dubey, A. and Pentland, A. S. Thompson sampling on symmetric alpha-stable bandits. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 5715 5721. International Joint Conferences on Artificial Intelligence Organization, 7 2019.

Dubey, A. et al. Cooperative multi-agent bandits with heavy tails. In International conference on machine learning, pp. 2730 2739. PMLR, 2020.

Durand, A., Achilleos, C., Iacovides, D., Strati, K., Mitsis, G. D., and Pineau, J. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference, pp. 67 82. PMLR, 2018.

Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. Advances in neural information processing systems, 23, 2010.

Hall, D. B. Zero-inflated poisson and binomial regression with random effects: a case study. Biometrics, 56(4): 1030 1039, 2000.

Hao, B., Abbasi Yadkori, Y., Wen, Z., and Cheng, G. Bootstrapping upper confidence bound. Advances in neural information processing systems, 32, 2019.

Henzi, A. and D umbgen, L. Some new inequalities for beta distributions. Statistics & Probability Letters, 195: 109783, 2023.

Hong, J., Kveton, B., Zaheer, M., and Ghavamzadeh, M. Hierarchical bayesian bandits. In International Conference on Artificial Intelligence and Statistics, pp. 7724 7741. PMLR, 2022.

Jin, T., Xu, P., Shi, J., Xiao, X., and Gu, Q. Mots: Minimax optimal thompson sampling. In International Conference on Machine Learning, pp. 5074 5083. PMLR, 2021.

Jin, T., Xu, P., Xiao, X., and Anandkumar, A. Finite-time regret of thompson sampling algorithms for exponential family multi-armed bandits. Advances in Neural Information Processing Systems, 35:38475 38487, 2022.

Kalvit, A. and Zeevi, A. A closer look at the worst-case behavior of multi-armed bandit algorithms. Advances in Neural Information Processing Systems, 34:8807 8819, 2021.

Kang, M. and Kim, G.-S. Heavy-tailed linear bandit with huber regression. In Uncertainty in Artificial Intelligence, pp. 1027 1036. PMLR, 2023.

Karbasi, A., Mirrokni, V., and Shadravan, M. Parallelizing thompson sampling. Advances in Neural Information Processing Systems, 34:10535 10548, 2021.

Kim, G.-S. and Paik, M. C. Contextual multi-armed bandit algorithm for semiparametric reward model. In International Conference on Machine Learning, pp. 3389 3397. PMLR, 2019.

Krause, A. and Ong, C. Contextual gaussian process bandit optimization. Advances in neural information processing systems, 24, 2011.

Krishnamurthy, A., Wu, Z. S., and Syrgkanis, V. Semiparametric contextual bandits. In International Conference on Machine Learning, pp. 2776 2785. PMLR, 2018.

Kuchibhotla, A. K. and Chakrabortty, A. Moving beyond sub-gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11(4): 1389 1456, 2022.

Kveton, B., Szepesvari, C., Vaswani, S., Wen, Z., Lattimore, T., and Ghavamzadeh, M. Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 3601 3610. PMLR, 2019.

Kveton, B., Szepesv ari, C., Ghavamzadeh, M., and Boutilier, C. Perturbed-history exploration in stochastic linear bandits. In Uncertainty in Artificial Intelligence, pp. 530 540. PMLR, 2020.

Kwon, J., Perchet, V., and Vernade, C. Sparse stochastic bandits. In Conference on Learning Theory, pp. 1269 1270. PMLR, 2017.

Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22, 1985.

Lambert, D. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34(1): 1 14, 1992.

Lattimore, T. Optimally confident ucb: Improved regret for finite-armed bandits. ar Xiv preprint ar Xiv:1507.07880, 2015.

Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020.

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010.

Li, L., Chu, W., Langford, J., Moon, T., and Wang, X. An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. In Proceedings of the Workshop on On-line Trading of Exploration and Exploitation 2, pp. 19 36. JMLR Workshop and Conference Proceedings, 2012.

Zero-Inflated Bandits

Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071 2080. PMLR, 2017.

Li, S., Zhang, L., Yu, Y., and Li, X. Optimal arms identification with knapsacks. In International Conference on Machine Learning, pp. 20529 20555. PMLR, 2023.

Ling, W. Quantile regression for zero-inflated outcomes. Ph D thesis, Columbia University, 2019.

Liu, X., Deliu, N., Chakraborty, T., Bell, L., and Chakraborty, B. Thompson sampling for zero-inflated count outcomes with an application to the drink less mobile health study. ar Xiv preprint ar Xiv:2311.14359, 2023.

Mattner, L. and Roos, B. A shorter proof of kanter s bessel function concentration bound. Probability Theory and Related Fields, 139:191 205, 2007.

Medina, A. M. and Yang, S. No-regret algorithms for heavytailed linear bandits. In International Conference on Machine Learning, pp. 1642 1650. PMLR, 2016.

M enard, P. and Garivier, A. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pp. 223 237. PMLR, 2017.

Ou, M., Li, N., Yang, C., Zhu, S., and Jin, R. Semiparametric sampling for stochastic bandits with many arms. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 7933 7940, 2019.

Peng, Y., Xie, M., Liu, J., Meng, X., Li, N., Yang, C., Yao, T., and Jin, R. A practical semi-parametric contextual bandit. In IJCAI, pp. 3246 3252, 2019.

Perrault, P., Valko, M., and Perchet, V. Covariance-adapting algorithm for semi-bandits with application to sparse outcomes. In Conference on Learning Theory, pp. 3152 3184. PMLR, 2020.

Phillips, R., S ims ek, A. S., and Van Ryzin, G. The effectiveness of field price discretion: Empirical evidence from auto lending. Management Science, 61(8):1741 1759, 2015.

Ren, Z. and Barber, R. F. Derandomised knockoffs: leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1):122 154, 2024.

Rinne, H. The Weibull distribution: a handbook. CRC press, 2008.

Shao, H., Yu, X., King, I., and Lyu, M. R. Almost optimal algorithms for linear stochastic bandits with heavy-tailed payoffs. Advances in Neural Information Processing Systems, 31, 2018.

Shen, W., Wang, J., Jiang, Y.-G., and Zha, H. Portfolio choices with orthogonal bandit learning. In Twenty-fourth international joint conference on artificial intelligence, 2015.

Shi, Z., Kuruoglu, E. E., and Wei, X. Thompson sampling on asymmetric a-stable bandits. In International Conference on Agents and Artificial Intelligence, 2023.

Skorski, M. Bernstein-type bounds for beta distribution. Modern Stochastics: Theory and Applications, 10(2):211 228, 2023.

Sun, Q., Zhou, W.-X., and Fan, J. Adaptive huber regression. Journal of the American Statistical Association, 115(529): 254 265, 2020.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.

Urteaga, I. and Wiggins, C. H. Nonparametric gaussian mixture models for the multi-armed contextual bandit. stat, 1050:8, 2018.

Vaart, A. v. d. and Wellner, J. A. Empirical processes. In Weak Convergence and Empirical Processes: With Applications to Statistics, pp. 127 384. Springer, 2023.

van de Geer, S. and Lederer, J. The bernstein orlicz norm and deviation inequalities. Probability theory and related fields, 157(1-2):225 250, 2013.

Vladimirova, M., Girard, S., Nguyen, H., and Arbel, J. Sub-weibull distributions: Generalizing sub-gaussian and sub-exponential properties to heavier tailed distributions. Stat, 9(1):e318, 2020.

Wan, R., Ge, L., and Song, R. Metadata-based multi-task bandits with bayesian hierarchical models. Advances in Neural Information Processing Systems, 34:29655 29668, 2021.

Wan, R., Ge, L., and Song, R. Towards scalable and robust structured bandits: A meta-learning framework. In International Conference on Artificial Intelligence and Statistics, pp. 1144 1173. PMLR, 2023a.

Wan, R., Wei, H., Kveton, B., and Song, R. Multiplier bootstrap-based exploration. In International Conference on Machine Learning, pp. 35444 35490. PMLR, 2023b.

Zero-Inflated Bandits

Wei, H., Lei, X., Han, Y., and Zhang, H. High-dimensional inference and fdr control for simulated markov random fields. ar Xiv preprint ar Xiv:2202.05612, 2024.

Weron, R. On the chambers-mallows-stuck method for simulating skewed stable random variables. Statistics & probability letters, 28(2):165 171, 1996.

Wu, S., Wang, C.-H., Li, Y., and Cheng, G. Residual bootstrap exploration for stochastic linear bandit. In Uncertainty in Artificial Intelligence, pp. 2117 2127. PMLR, 2022.

Wu, Y., Shariff, R., Lattimore, T., and Szepesv ari, C. Conservative bandits. In International Conference on Machine Learning, pp. 1254 1262. PMLR, 2016.

Yu, M. et al. Online testing and semiparametric estimation of complex treatment effects. 2021.

Zhang, H. and Chen, S. X. Concentration inequalities for statistical inference. Communications in Mathematical Research, 37(1):1 85, 2021.

Zhang, H. and Wei, H. Sharper sub-weibull concentrations. Mathematics, 10(13):2252, 2022.

Zhang, H., Wei, H., and Cheng, G. Tight non-asymptotic inference via sub-gaussian intrinsic moment norm. ar Xiv preprint ar Xiv:2303.07287, 2023.

Zhang, J. and Cutkosky, A. Parameter-free regret in high probability with heavy tails. Advances in Neural Information Processing Systems, 35:8000 8012, 2022.

Zhang, M. and Ong, C. S. Quantile bandits for best arms identification. In International conference on machine learning, pp. 12513 12523. PMLR, 2021.

Zhang, T. Mathematical analysis of machine learning algorithms. Cambridge University Press, 2023.

Zhao, P., Zhang, L., Jiang, Y., and Zhou, Z.-H. A simple approach for non-stationary linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 746 755. PMLR, 2020.

Zhong, Z., Chueng, W. C., and Tan, V. Y. Thompson sampling algorithms for cascading bandits. The Journal of Machine Learning Research, 22(1):9915 9980, 2021.

Zhou, P., Wei, H., and Zhang, H. Selective reviews of bandit problems in ai via a statistical view. Mathematics, 13(4), 2025.

Zhou, Q., Zhang, X., Xu, J., and Liang, B. Large-scale bandit approaches for recommender systems. In International Conference on Neural Information Processing, pp. 811 821. Springer, 2017.

Zhu, Q. and Tan, V. Thompson sampling algorithms for mean-variance bandits. In International Conference on Machine Learning, pp. 11599 11608. PMLR, 2020.

Zero-Inflated Bandits

A. Appendix Overview and Roadmap

We provide a roadmap to help readers navigate the supplementary material.

SECTION SUBSECTION CONTENT DESCRIPTION

Appendix B Heavy-tailed ZI MAB setup, concentration inequalities, UCB algorithm, and regret analysis

C.1 Thompson Sampling algorithm for ZI MAB with sub-Gaussian rewards

C.2.1 GLM UCB algorithm setup, tuning parameters, and regularity conditions

C.2.2 GLM Thompson Sampling algorithm and tuning parameters

C.2.3 Extension to GLM ZI bandits with heavy-tailed noise

MAB simulation details: baselines, parameters, computational resources, and extended results (small/large p, bounded rewards). Includes confidence bounds of different concentration inequalities for Figure 1b

D.2 GLM bandit simulation details: algorithms, parameters, computational setup, and extended results (different sparsity levels, other baselines)

D.3 US loan dataset setup and modeling as ZI GLM contextual bandits

Appendix E Supporting lemmas for naive approaches discussion (Section 2)

Appendix F Proofs of all lemmas

Appendix G Regret bound proofs for UCB MAB algorithms

Appendix H Regret bound proofs for Thompson Sampling MAB algorithms

Appendix I Regret bound proofs for ZI GLM contextual bandit algorithms

Table 2. Appendix overview. Click on section names to navigate directly to the content.

B. Heavy-tailed MAB

In many applications with zero-inflated rewards, the non-zero part can be heavy-tailed (with only finite moments of order 1 + ϵ for some ϵ (0, 1]). To accommodate these scenarios, we adopt the trimmed mean (Bubeck et al., 2013) as a solution. The truncation follows a Hoeffding-type upper bound under the condition |Xt|1+ϵ M. Specifically, for fully observable data {Xt}n t=1, we can construct a 1 δ upper confidence bound for µ with the trimmed mean

X trimmed := 1

t=1 Xt1 |Xt| log 1(δ 1)Mt 1/(1+ϵ) .

The resulting upper confidence bound (Bickel, 1965; Bubeck et al., 2013; Dubey et al., 2020) is given by

X trimmed + 4M 1 1+ϵ n 1log δ 1 1/(1+ϵ) .

In ZIB, we similarly define

X := 1 #{t [n] : Yt = 1}

t:Yt=1 Xt1 |Xt| log 1(2/δ)j(t)M 1/(1+ϵ) with j(t) = X

ℓ t 1(Yℓ= 1).

Then we can construct a valid upper bound for µ as UX = X + M 1/(1+ϵ) 32 log S(n)/n ϵ/(1+ϵ), where S(n) is the round index when the arm that we focus on has been pulled for n times. The corresponding UCB algorithm is then constructed using UX UY , with UY as previously defined. This approach is further detailed in Algorithm B.1.

Similar to Lemma 2.2, Lemma B.1 below confirms that this upper bound maintains a deterministic rate, simplifying the analysis.

Zero-Inflated Bandits

Algorithm B.1 UCB for zero-inflated bandits with heavy tails Data: Horizon T, parameters ϵ and M.

26 Set U µ k = 1 and U p k = 1, k [K].

27 Set the counters ck = 0, set the mean estimator bpk = 0 and bµk = 0, k [K].

28 for t = 1, . . . , K do

29 Take action At = t, set Xt = 1(Rt = 0) Rt and Yt = 1(Rt = 0).

31 for t = K + 1, . . . , T do

32 Take action At = arg maxk [K] U µ k U p k (break tie randomly);

33 Observe Rt, and set Yt = 1(Rt = 0).

34 Update c At = c At + 1, bp At = bp At + Yt bp At

c At , and U p k = bp At + q

35 if Rt = 0 then

36 Set Xt = Rt;

bµAt = 1 #{l t : Al = At and RAl = 0}

l t:Al=At RAl1

|RAl| g(bp Al, ϵ)M 1 1+ϵ log l2

where the function g( , ϵ) is defined in Lemma B.1, and U µ k = bµAt + M 1/(1+ϵ) 32 log t/ck(t) ϵ/(1+ϵ);

Lemma B.1. Suppose Yt i.i.d. Bernoulli(p) and Xt satisfy E|Xt µ|1+ϵ M for some ϵ (0, 1] with positive M > 0. Then P µ X g(p, ϵ)M 1 1+ϵ n 1log(2/δ) ϵ 1+ϵ δ

for any δ > 0 and n 4 log(2/δ)/p2, where g(p, ϵ) := p ϵ 1+ϵ (1 + ϵ)2 ϵ 1+ϵ + p 14/3 + 2/ p.

Similarly, using Lemma B.1, we can establish the problem-dependent regret bound for our heavy-tailed UCB algorithm in Algorithm B.1. Theorem B.2. For a K-armed ZIB with noise satisfying maxk [K] E|εk|1+ϵ < for some ϵ (0, 1], Algorithm B.1 has a problem-dependent regret bound as

3 k + 4p 2 1 k + 9p 2 k 1 k +

2pkg(pk, ϵ) 1+ϵ

ϵ M 1/ϵ 1/ϵ k log T.

When treating the number of arms as finite and assuming that pk are fixed, comparing our results with those in the heavytailed bandit literature (Bubeck et al., 2013; Dubey et al., 2020; Chatterjee & Sen, 2021) and we know that our algorithm achieves state-of-the-art regret rate. Moreover, from Theorem B.2, the problem-independent regret for Algorithm B.1 is given by

R(T) p 2 1 K +

T + (MT) 1 1+ϵ (K log T)

T + T 1 1+ϵ (K log T)

where the last is due to p1, pk (0, 1]. This finalizes the minimax ratio in Table 1 for MAB.

C. Additional Algorithms Details

C.1. TS Algorithms for MAB

As outlined in Section 2, we have developed a light-tailed TS algorithm for MAB. The light-tailed TS-type algorithm achieves the minimax optimal rate when applied under the clipped distributions. The comprehensive procedure of the is presented in Algorithm C.1.

Zero-Inflated Bandits

Algorithm C.1 TS for zero-inflated MAB with light tails Data: Prior parameters {αk, βk, vk}K k=1 and ρ, γ.

39 Set the counter ck = 0, k [K];

40 for t = 1, . . . , K do

41 Take action At = t, set Xt = 1(Rt = 0) Rt and Yt = 1(Rt = 0).

43 for t = K + 1, . . . , T do

44 Sample epk cl Beta(αk, βk; ϑk(p)) and eµk cl N vk, 2σ2 ρckbpk ; ϑk(µ)

45 Take action At = arg maxk A epk eµk;

46 Observe reward Rt, and set Yt = 1(Rt = 0).

47 Update αAt = αAt + Yt and βAt = βAt + 1 Yt;

48 Update bp At = bp At + Yt bp At

c At and c At = c At + 1;

ϑAt(p) = bp At +

γ 4c At log+ T 4c At K

50 if Rt = 0 then

51 Calculate:

bµAt = 1 #{l t : Al = At and RAl = 0}

l t:Al=At RAt;

52 Update v At = bµAt

ϑAt(µ) = bµAt +

4γ 1 + log 1(1 + 1/ p

v u u tlog+ 4 1 + log 1(1 + 1/ p

c At T) σ2T bp2 Atc At K

The equivalence of directly sampling rk and the procedure of sampling µk and pk from their respective posteriors and then multiplying them in Algorithm C.1, which can be mathematically expressed as follows:

P(rk|{Rt}t:At=k)

P(rk) P({Rt}t:At=k | rk)

µkpk=rk P(µkpk) P({Rt}t:At=k,Rt=0|uk, pk)P({Rt}t:At=k,Rt =0|uk, pk)dµkdpk

h P(pk)P({Rt}t:At=k,Rt=0|pk) ih P(µk)P({Rt}t:At=k,Rt =0|uk) i dµkdpk,

where the last equality results from two assumptions: 1) independent priors for µk and pk, and 2) non-zero rewards depend solely on µk, while zero rewards depend only on pk.

An important aspect of the TS-algorithm is our use of clipped distributions for both the sub-Gaussian non-zero part X and the precisely Bernoulli distributed part Y . As explained in Section 2.2, the reason for using a clipped Gaussian distribution for X is due to its deviation from an exact Gaussian distribution. The rationale behind employing a clipped Beta distribution for the exactly Bernoulli distributed Y is as follows: the objective is to establish concentration for the product random variable R = X Y . The product of the original Beta distribution and the clipped Gaussian does not adequately control the overestimation probability of sub-optimal Rk in our proofs. This is primarily due to the proof techniques employed. For more details, refer to the proofs of Theorem 4.2 in Appendix H.

Zero-Inflated Bandits

C.2. Algorithms for Generalized Linear Contextual Bandits

C.2.1. UCB ALGORITHMS AND REGULARITY CONDITIONS

As a concrete example, we consider the widely-used generalized linear contextual bandits with finite action set A = [K], where both functions h and g are structured as generalized linear functions here.

This setup is characterized by the known transformation functions ψX( ), ψY ( ), and the known link functions g( ) and h( ), such that g(xt, At; β) = g ψX(xt, At) β and h(xt, At; θ) = h ψY (xt, At) θ . When the εt is sub-Gaussian, confidence radii for the ridge estimations bθt (or bβt) in a compact parameter space can be fully determined by ψY (xt, At) (or ψX(xt, At)) and the sub-Gaussian variance proxy of εt. Specifically, ρY,t is chosen such that (Lemma 17.8 in Zhang, 2023)

0 t T : ρY,t p

λU + sup a [K],x1:t X

s=1 εsψY (xs, a)

Thus, the ellipsoidal ratio ρY,t in Algorithm C.2 is constructed accordingly. A similar approach applies to ρX,t for the non-zero component, though additional randomness from Xs = g ψX(xs, As) β must be carefully handled, as discussed in Section 2.1. The proof of Theorem 4.3 provides further details, which we omit here for brevity.

Algorithm C.2 General template of UCB for zero-inflated generalized linear bandits Data: Link functions ψX( ), ψY ( ), g( ), and h( ). Ellipsoidal ratio sequences {ρX,t, ρY,t}. Rridge parameters λU and λV .

54 Set Hall = {} and Hnon-zero = {}. Set Ut = λUIq and Vt = λV Id.

55 Randomly choose action at [K] for t [τ];

56 for t = τ + 1, . . . , T do

57 for a [K] do

UCBt(a) = h ψX(xt, a) bβt + ρX,t ψX(xt, a) V 1 t

ih ψY (xt, a) bθt + ρY,t ψY (xt, a) U 1 t

60 Take action At = arg maxa [K] UCBt(a).

61 Observe reward Rt, and set Yt = 1(Rt = 0).

62 Update the dataset as Hall Hall {(xt, At, Yt)} and Ut = Ut + ψY (xt, At)ψY (xt, At) .

63 Solve the equation

h Ys h ψY (xs, As) θ i ψY (xs, As) = 0

64 if Rt = 0 then

65 Update the dataset as Hnon-zero Hnon-zero {(xt, At, Rt)};

66 Update Vt = Vt + ψX(xt, At)ψX(xt, At) ;

bβt β Γ : X

h Rs g ψX(xs, As) β i ψX(xs, As) = 0 ;

Next, we present the regularity conditions and selections of tuning parameters for Theorem 4.3. Let the true parameters be β and θ , and Zt,a = ψX(xt, a) Rd and Wt,a = ψY (xt, a) Rq for a [K]. These assumptions are quite standard in generalized linear contextual bandits (see, for example, Li et al., 2010; Deshpande et al., 2018; Wu et al., 2022).

Assumption C.1 (Link Functions and Parameters). (i) The parameter space Γ and Θ for β and θ satisfies supβ Γ β 2 1 and supβ Θ θ 2 1;

(ii) The link function g( ) and h( ) is twice differentiable. Its first and second order derivatives are upper-bounded by Lg and Mg, and Lh and Mh, respectively;

Zero-Inflated Bandits

(iii) κg := inf z 2 1, β β 2 1 g(z β) > 0 and κh := inf w 2 1, θ θ 2 1 h(w θ) > 0;

(iv) p := inf w 2 1,θ Θ h(w θ) > 0.

Assumption C.2 (Distributions). (i) Zt,a 2 1 and Wt,a 2 1 for all a [K];

(ii) The minimal eigenvalues for E[Zt,a Z t,a] and E[Wt,a W t,a] satisfy λmin E[Zt,a Z t,a] σ2 z and λmin E[Wt,a W t,a] σ2 w for all a [K] with some positive σz and σw.

(iii) The noise εt is sub-Gaussian distributed, satisfying E[εt | Ft 1] = 0 and E[esεt | Ft 1] es2σ2/2 for any s R, where Ft := σ {(x1, R1), . . . , (xt, Rt)} is an increasing sequence of sigma field.

Assumption C.3 (Tuning Parameters). For any δ (0, 4/T),

(i) the random selection period τ is chosen as

log(1/δ)/p σ2z

2 + 2 p σ2z

, 4 log(1/δ)

C3 q + C4 p

log(1/δ) σ2w

(ii) the ellipsoidal ratio sequences are chosen as

ρX,t = κ 1 g σ q

4 log(1/δ) + d log(1 + λ 1 V t/d) and ρY,t = κ 1 h

4 log(1/δ) + q log(1 + λ 1 U t/q),

where Cℓ, ℓ= 1, 2, 3, 4 are some universal positive constants detailed in Appendix I.

A quick remark here is that all regularity conditions in Assumption C.1, C.2, and tuning parameter selections in Assumptionn C.3 are independent of the number of arms K.

C.2.2. THOMPSON SAMPLING FOR GLM

Just like in the standard TS algorithm for linear contextual bandits (Agrawal & Goyal, 2013), we employ Gaussian priors for the sub-Gaussian noise. At each round t [T], we sample parameters { eβt, eθt} from Gaussian posterior distributions centered at the estimates { bβt, bθt} with covariance matrices ϱ2 X,t V 1 t , ϱ2 Y,t U 1 t , where these estimates and covariance matrices are obtained from the UCB algorithm. The arm At is then selected by maximizing the Thompson Sampling objective function ψX(xt, a) eβt ψY (xt, a) eθt leveraging the fact that both link functions are strictly increasing.

Let A t denote the optimal action at round t, and define W t := ψX(Wt, A t ). To ensure an appropriate choice of confidence radius sequences {ϱY,t}t 0 (or {ϱX,t}t 0, both of which can be variance-dependent), it suffices to find ϱY,t > 0 such that (Lemma 2 in Agrawal & Goyal, 2013)

P W t eθt > W t θ + q

ϱ2 Y,t log3 T Wt U 1 t | Ft 1 Gt

T ϵ/2 (C.2)

for some ϵ (0, 1), where Gt is a high-probability good event defined as

Gt := n a [K] : W t,a bθt θ ϱY,t Wt,a U 1 t

From the proof of Theorem 4.3, we can set ϱY,t ρY,t to ensure that Gt holds with probability at least 1 δ for t > τ. To validate the anti-concentration property in (C.2), we use the inequality

1 2 πx exp x2/2 P(|N(µ, σ2) µ| > xσ) 1 πx exp x2/2 .

Thus, choosing ϱY,t = p

ϵ 1q log(1/δ) with ϵ log T ensures that

P W t eθt > W t θ + q

ϱ2 Y,t log3 T Wt U 1 t | Ft 1 Gt

Zero-Inflated Bandits

Algorithm C.3 General template of TS for ZI GLM bandits Data: Link functions ψX( ), ψY ( ), g( ), and h( ). Confidence radio sequences {ϱX,t, ϱY,t}. Rridge parameters λU and λV .

69 Set Hall = {} and Hnon-zero = {}. Set Ut = λUIq and Vt = λV Id.

70 Randomly choose action at [K] for t [τ];

71 for t = τ + 1, . . . , T do

72 for a [K] do

TSt(a) = h ψX(xt, a) eβt ih ψY (xt, a) eθt i .

75 Take action At = arg maxa [K] TSt(a).

76 Observe reward Rt, and set Yt = 1(Rt = 0).

77 Update the dataset as Hall Hall {(xt, At, Yt)} and Ut = Ut + ψY (xt, At)ψY (xt, At) .

78 Solve the equation

h Ys h ψY (xs, As) θ i ψY (xs, As) = 0 .

79 Sample eθt from distribution N(bθt, ϱ2 Y,t U 1 t ).

80 if Rt = 0 then

81 Update the dataset as Hnon-zero Hnon-zero {(xt, At, Rt)};

82 Update Vt = Vt + ψX(xt, At)ψX(xt, At) ;

bβt β Γ : X

h Rs g ψX(xs, As) β i ψX(xs, As) = 0 ;

84 Sample eβt from distribution N( bβt, ϱ2 X,t V 1 t ).

Moreover, for any t [T],

q log(1/δ) log(T) p

log(1/δ) + q log(1 + t/q) ρY,t.

Similarly, we select ϱX,t correspondingly, leading to the following choices:

d log(1/δ) log(T) and ϱY,t p

q log(1/δ) log(T). (C.3)

C.2.3. EXTENSION FOR HEAVY-TAILED NOISE

Similarly, one can also devise the UCB-type algorithm for sub-Gaussian εt, as detailed in Algorithm C.2 in Appendix C. In cases where εt follows a general sub-Weibull distribution or only has finite moments of order 1 + ϵ with ϵ (0, 1], specific adjustments are necessary, such as applying the median of means (Mo M) estimator for linear bandits (Medina & Yang, 2016; Shao et al., 2018) or Huber regression (Sun et al., 2020; Kang & Kim, 2023).

D. Supplement to Simulation

D.1. Detailed Simulation Setting for Multi-Armed Bandits

Our simulations were performed on a Mac mini (2020) with an M1 chip and 8GB of RAM. We first detail the UCB baselines and TS baselines as follows:

UCB baselines:

We consider following UCB-type algorithms for comparison. At round t, the agent takes action At = maxk [K] Uk(t) with

Zero-Inflated Bandits

the k-th arm s upper bounds Uk(t) = Rk(t) + q

2τ 2 k log(2/δ)

ck(t) for sub-Gaussian rewards Uk(t) = Rk(t) + α2 k q

αk log(2/δ)

ck(t) for sub-Exponential rewards. Here, the size parameters τ 2 and α for the true rewards are determined using the following methods:

Using the original size parameters for the non-zero part Xk, assuming they are known, as the size parameter for constructing Uk(t);

Using the estimated variance of the rewards from k-th arm as the size parameter τ 2 k and αk for constructing Uk(t);

Using the estimated size parameter for k-th arm as follows:

For sub-Gaussian Xk with sub-Gaussian variance proxy σ2 k, the sub-Gaussian variance proxy for Rk is solved by

τ 2 k = max s R 2 s2

h sbµkbpk + log(1 bpk + bpkesbµk+s2σ2 k/2) i .

where bpk is taken as the average of observations Yk; For sub-Exponential Xk with the single sub-Exponential parameter λk, the sub-Exponential parameter αk for the rewards from the k-th arm is solved by

α2 k = λ2 k max s R 2 s2 h sbµkbpk + log(1 bpk + pkesbµk+s2λ2 k/2) i .

Here bµk and bpk are taken as the averages of observations Xk and Yk, respectively.

(Strong baseline) Using the true size parameter for the reward of each arm {Rk}K k=1.

TS baselines:

Here we exclusively consider the TS-type algorithm suitable for general sub-Gaussian distributions, namely the MOTS algorithm (Jin et al., 2021). For Gaussian and mixed-Gaussian rewards, we can directly apply both Algorithm C.1 and the MOTS algorithm. But, when applying with Exponential rewards, we adopt Algorithm 1 from Shi et al. (2023). In doing so, we integrate their step 5 with our algorithm and the MOTS algorithm. This ensures that both our method and the one proposed in Jin et al. (2021) are correctly adapted for use with sub-Gaussian distributions after the GMS generation.

Simulation settings and extended results:

For UCB-type algorithms, we set the confidence level δ = 4/T 2 consistently across all experiments. Prior parameters and tuning parameters for TS-type algorithms follow the recommendations in (Jin et al., 2021; Shi et al., 2023) for the MOTS algorithm and GMS generation. Beyond Figure 2, we provide additional simulations across different parameter ranges: Figure 5 shows results for alternative p settings, while Figures 6 and 7 examine extremely small and relatively large zero-inflation probabilities, respectively. We also extend our analysis to bounded rewards in Figure 8.

Our methods consistently outperform existing approaches across most settings. However, occasional underperformance against certain proxy-based UCB methods can occur, particularly with exponential rewards under specific parameter ranges (e.g., pk U[0.1, 0.3]). As demonstrated in Lemma E.1, such proxy methods can become unreliable under high variance or strong zero-inflation, whereas our approach demonstrates greater robustness by directly modeling the zero-inflated structure.

For bounded rewards, our UCB method shows clear advantages in most cases. The only exception occurs when the zero-inflation level is low (i.e., pk is large), in which case the true Hoeffding-based UCB (with access to the exact bound) performs slightly better. Nonetheless, this case represents a rare (as the exact bound baseline is not available in practice). In all other cases, especially when skewness or sparsity increases, our method shows clear advantages.

All simulations for each parameter setting and distribution were completed within 24 hours.

D.2. Detailed Simulation Setting for Contextual Bandits

In our contextual bandits scenario, we employ two UCB baseline methods for comparison of our method:

Zero-Inflated Bandits

0 20000 40000 60000

culumative regret

0 20000 40000 60000

culumative regret

Mixed Gaussian

0 20000 40000 60000

culumative regret

Exponential

method product UCB (ours) product TS (ours)

non zero part based UCB variance based UCB

estimated proxy based UCB true proxy based UCB

0 20000 40000 60000

culumative regret

0 20000 40000 60000

culumative regret

Mixed Gaussian

0 20000 40000 60000

culumative regret

Exponential

method product UCB (ours) product TS (ours)

non zero part based UCB variance based UCB

estimated proxy based UCB true proxy based UCB

0 20000 40000 60000

culumative regret

0 20000 40000 60000

culumative regret

Mixed Gaussian

0 20000 40000 60000

culumative regret

Exponential

method product UCB (ours) product TS (ours)

non zero part based UCB variance based UCB

estimated proxy based UCB true proxy based UCB

0 20000 40000 60000

culumative regret

0 20000 40000 60000

culumative regret

Mixed Gaussian

0 20000 40000 60000

culumative regret

Exponential

method product UCB (ours) product TS (ours)

non zero part based UCB variance based UCB

estimated proxy based UCB true proxy based UCB

Figure 5. Simulation for zero-inflated MAB with K = 10 and T = 75000 with N = 50 replications. The four rows represent p U[0.10, 0.15], p U[0.15, 0.20], p U[0.20, 0.25], and p U[0.25, 0.30], respectively.

(Naive Method) We ignore the zero-inflated structure. The upper bound for any action a At is defined as:

UCBt(a) := ψX(xt, a) bβnaive + ρX,t ψX(xt, a) Unaive

with Unaive and bβnaive are computed by Unaive = λUId + Pt s=1 ψX(xs, As)ψX(xs, As) and bβnaive = U 1 naive Pt s=1 RsψX(xs, As).

(Integrated Component Method) We correctly account for the zero-inflated structure. The upper bound for any action

Zero-Inflated Bandits

0 250005000075000100000

culumative regret

0 250005000075000100000

culumative regret

Mixed Gaussian

0 25000500007500010000

culumative regret

Exponential

method product UCB (ours) product TS (ours)

non zero part based UCB variance based UCB

estimated proxy based UCB true proxy based UCB

Figure 6. Simulation for zero-inflated MAB under p U[0.0005, 0.01] with K = 10 and T = 10000 with N = 50 replications.

0 5000 10000 15000 20000 25000 round

culumative regret

0 5000 10000 15000 20000 25000 round

culumative regret

Mixed Gaussian

0 5000 10000 15000 20000 25000 round

culumative regret

Exponential

product UCB (ours)

product TS (ours)

non-zero part based UCB

variance based UCB

estimated proxy based UCB

true proxy based UCB

Figure 7. Simulation for zero-inflated MAB under p U[0.75, 0.95] with K = 10 and T = 25000 with N = 50 replications.

0 5000 10000 15000 20000 25000

culumative regret

Beta: p ~ U[0.15, 0.2]

0 5000 10000 15000 20000 25000

culumative regret

Beta: p ~ U[0.35, 0.4]

0 5000 10000 15000 20000 25000

culumative regret

Beta: p ~ U[0.55, 0.6]

0 5000 10000 15000 20000 25000

culumative regret

Beta: p ~ U[0.75, 0.8]

product UCB (ours)

product TS (ours)

non-zero part based UCB

variance based UCB

estimated proxy based UCB

true Hoeffding based UCB

Figure 8. Simulation for zero-inflated MAB under the scaled Beta distribution of the form Xk 2µk Beta(pk, pk) such that E[Xk] = µk with K = 10 and T = 10000 with N = 50 replications.

UCBt(a) := ψX(xt, a) bβintegrated h ψY (xt, a) bθintegrated + pρX,t ρY,t

ψX(xt, a) ψY (xt, a)

Wintegrated

with " bβintegrated bθintegrated

= arg min θ,β Θ B

h Rs ψX(xs, As) β h ψY (xs, As) θ i

Zero-Inflated Bandits

and Wintegrated = λV I2d + Pt s=1

ψX(xs, As) ψY (xs, As)

ψX(xs, As) ψY (xs, As) .

The first baseline Naive Method results in model misspecification. We design this baseline because this issue will arise when researchers fail to correctly specify the zero-inflation structure. In contrast, the second baseline Integrated Component Method, appropriately specifies the model. The second method estimates two unknown vector β and θ as a single vector (β , θ ) from the model ψX(xt, a) β h ψY (xt, a) θ + εt, which originates from a range of generalized linear contextual bandit literature (Li et al., 2017; Chen et al., 2020; Zhao et al., 2020). Similarly, for the naive TS and integrated TS algorithm, we sample eβnaive from N( bβnative, ϕ2 βU 1 naive) and ( eβ integrated, eθ integrated) from

N ( bβ integrated, bθ integrated) , (ϕ2 β ϕ2 θ) W 1 naive respectively.

Our experimental setup aligns with that described in Section D.2 of (Wu et al., 2022): the dimension d is set to 10, and there are K = 100 arms in this context. The true parameter for the non-zero part, β, has a norm of 1 and is uniformly distributed, with each entry βi U[0, 1] for i [d]. The zero-inflation structure θ exhibits sparsity s, meaning only s elements in θ are derived from a uniform distribution, while the rest are zeros. Similarly, for each arm k [K], we generate νk exactly the same as θ. At each round t, the context of each arm k is sampled as xk,t Nd νk, 1 2K Id . Thus, in this setting, we have ψX(xt, At) = N νAt, 1 2K Id and ψY (xt, At) = φY N νAt, 1 2K Id with element-wised mapping φY : x 7 sin(x). Lastly, the noise term ϵt follows the distribution F(0, σ2) with a mean-zero distribution F and a standard deviation of σ = 1.

In detailing the tuning parameters for UCB algorithms, we initially adopt the configurations in Appendix C.2. For TS algorithms, we follow the original setting in Agrawal & Goyal (2013) while adding hyper-tuning parameters ϵX and ϵY such that ϵX ϵY log 1 T suggested by Zhong et al. (2021) as ϕ2 β = 24σ2d

K2ϵX log(1/δ) and ϕ2 θ = 24σ2d

K2ϵY log(1/δ). For simplicity, in our implementation, we fix ϵX and ϵY to log 1 T. Additionally, for both UCB and TS algorithms, we set the parameter δ to 1/T.

Further details are provided in Section 5 of Kang & Kim (2023). In addition to Figure 3, we present simulation results with different sparsity levels s in Figure 9 and comparisons with another semiparametric contextual bandit algorithm in Figure 10. We choose SPUCB from Peng et al. (2019) as the representative baseline, as other semiparametric algorithms exhibit similar performance. The results for various link functions h( ) are quite similar, and therefore, have been omitted for brevity. Like in MAB, simulations were conducted on a Mac mini (2020) with an M1 chip and 8GB of RAM, with all runs completed within 48 hours.

D.3. Detailed Simulation Setting for Real Data Application

The loan records data, collected by Columbia Business School, encompasses all auto loan records (totaling 208,085) from a major online auto loan lending company, spanning from July 2002 to November 2004. This dataset can be accessed upon request at https://business.columbia.edu/cprm . In our analysis, we treat the monthly payment and loan term as the raw action At R+. We then compute the price as:

price = Monthly payment

t=1 (1 + Prime rate ) t amount of loan requested,

which accounts for the net present value of future payments, following the approach used in (Phillips et al., 2015; Bastani et al., 2022; Chen et al., 2023). Similarly, we exclude outlier records, as done in (Chen et al., 2023). In this dataset, the binary decision of the applicant is denoted as Yt {0, 1}. Following the approach in (Chen et al., 2023), we define the synthetic observed reward as Rt = At Yt. The context vector xt R5 represents a set of features identified as significant in (Ban & Keskin, 2021; Chen et al., 2023). This includes the FICO credit score, the approved loan amount, the prime rate, the competitor s rate, and the loan term.

The range of the raw action At spans from 0.0004 to 28.5725. To simplify the online decision process, we introduce a discretization of At into categorized actions, denoted as e At. This discretization follows a specific transformation, ℓ(At),

Zero-Inflated Bandits

0 5000 10000 15000 20000 round

culumative regret

0 5000 10000 15000 20000 round

culumative regret

Mixed Gaussian

0 5000 10000 15000 20000 round

culumative regret

Exponential

product UCB (ours)

product TS (ours)

misspecified UCB

integrated UCB

misspecified TS

integrated TS

0 5000 10000 15000 20000 round

culumative regret

0 5000 10000 15000 20000 round

culumative regret

Mixed Gaussian

0 5000 10000 15000 20000 round

culumative regret

Exponential

product UCB (ours)

product TS (ours)

misspecified UCB

integrated UCB

misspecified TS

integrated TS

0 5000 10000 15000 20000 round

culumative regret

0 5000 10000 15000 20000 round

culumative regret

Mixed Gaussian

0 5000 10000 15000 20000 round

culumative regret

Exponential

product UCB (ours)

product TS (ours)

misspecified UCB

integrated UCB

misspecified TS

integrated TS

Figure 9. Simulation for zero-inflated contextual bandits and T = 20000 with N = 25 replications with the different sparsity levels. The three rows represent s = 1, 3, 5, respectively.

0 2500 5000 7500 10000 round

culumative regret

Gaussian with s = 3

0 2500 5000 7500 10000 round

culumative regret

Gaussian with s = 5

0 2500 5000 7500 10000 round

culumative regret

Gaussian with s = 7

product UCB (ours)

product TS (ours)

misspecified UCB

integrated UCB

misspecified TS

integrated TS

Figure 10. Comparison with SPUCB algorithm for zero-inflated contextual bandits with T = 10, 000 rounds over N = 25 independent replications across different sparsity levels.

defined as follows:

e At = ℓ(At) :=

low , if At 1.5400, medium , if 1.5400 < At 3.6390, high , if 3.6390 < At 5.4164, very high , if 5.4164 < At 8.5466, luxury , if At > 8.5466

Zero-Inflated Bandits

0 10 20 offered price (raw action)

Figure 11. The density of the raw actions and the corresponding discretization.

This discretization is derived from specific quantiles of the offered price: the 25% and 75% quantiles, the 66% quantile of the remaining offer beyond the 75% quantile, and the 95% quantile of the remaining offer. We have accordingly plotted the density f(a) of the raw action set {At}T t=1, as shown in Figure 11.

Again, the decision of applicants mainly depends on the nominal profit At offered, but it will also depend on the environment. Thus, we first model the binary selection Yt {0, 1}, whether accept the offer, of the applicant as a binary regression as

P(Yt = 1) = h (At, 1, b (xt))θ = h Atθ1 + θ2 + b (xt)θ 1, 2

with some different functions h( ) and suitable context transformation b( ). Intuitively, the raw action At, the nominal profit that company offer, is the most relevant factor. The offline estimation and the performance of θ via estimated FDR (Dai et al., 2023) using the whole dataset is shown in Table 3.

Table 3. Estimated FDR under data splitting. The function b( ) is chosen as the best degree in the bracket.

b( ) h( ) logit probit cauchit log cloglog

linear 0.1831 0.1912 0.1699 0.0983 0.1835

polynomial 0.1682 (3) 0.1705 (3) 0.1610 (3) 0.3483 (1) 0.1707 (3)

spline 0.1743 (5) 0.1755 (5) 0.1692 (3) 0.3483 (1) 0.1790 (5)

kernel (Epanechnikov) 0.2052

kernel (Triangale) 0.2050

kernel (Gaussian) 0.2049

As shown in Table 3, when we choose b( ) : x 7 x , the estimated FDR for the whole dataset is no more than 0.1, a quite small value, which means this model characterize the dataset quite good (Dai et al., 2023; Ren & Barber, 2024; Wei et al., 2024).

Next, for to simulate the counterfactual outcomes in the online setting, recognizing that rewards are deterministic when based on non-zero rewards, we introduce noise to better simulate real-world scenarios, such as deviations in payment plans or additional service purchases. Two important points here are: (1) Conditional on the raw action the noise is mean-zero; (2) that the noise should be related to the context. This indicates the true reward in the online procedure at each round Xt = g(At, xt; β) + εt the company will obtain satisfies

E[Xt | At] = E[g(At, xt; β) + εt | At] = At.

Zero-Inflated Bandits

0 1 2 3 4 5 λ

β1 ||β 1, 2||1+1

Estimation of β under different λ

Figure 12. The estimation of β under different λ, the red line is the estimated coefficient for the raw action At and the blue line is the ℓ1-norm plus one for the estimated coefficients of the environment b(xt).

For simplifying the procedure, here we just let

g(At, xt; β) = At, 1, b (xt) β = β1At + β2 + β 1, 2b(xt)

with the same context transformation function b( ). Note that we do not have Xt for the real data (the observed reward is Rt = At Yt instead of Rt = Xt Yt), so we manually create Xt = At + Exp(λ) λ, which satisfy E[Xt | At] = At and use the whole dataset to estimate bβ. The result of offline estimation bβ with different λ is shown in Figure 12. From the Figure 12, we know that any λ (0, 2] is reasonable as we have bβ1 1 and bβ 1, 2 1 0, which means the nominal profit the company proposed, At, by the company plays a major role, but the context xt does have some influence.

Then we implemented our method and the different baselines described in Section 5 in an online setting to compare their performance. Specially, at each round t, the company selects an discrete action e At from the discrete set Adiscrete = { low , medium , high , very high , luxury }. Then a company to have a potential nominal profit At with randomly sampled from the truncated original density for the profit

f(a)1{a ℓ 1( e At)},

Next, the potential real profit the company would receive with the current context xt is e Xt = At, 1, b (xt) bβ. Besides, the applicant accordingly makes a binary decision e Yt {0, 1} with probability P(e Yt = 1) = h (At, 1, b (xt))bθ . Next, the company receives the real reward calculated as e Rt = e Xt e Yt.

The regret for each round is calculated as E[e Y t e X t ] E[e Yt e Xt], where e Y t and e X t represent the decision of the company and the random true profit according to the optimal action e A t defined as

e A t arg max ea Adiscrete

Z (a, 1, b (xt)) bβh (a, 1, b (xt))bθ f(a)1{ℓ 1(ea)} da.

These online process mirrors that in Appendix D.2, with the same tuning parameters. We treat this dataset as a heavy-tailed contextual bandit, and thus use the Huber regression technique as described in (Kang & Kim, 2023). Once again, the simulations were conducted on a Mac mini (2020) with an M1 chip and 8GB of RAM, with results obtained within 6 hours.

E. Supporting Lemma and Figures for Motivations

Lemma E.1. Assume Xt µ sub G(σ2) and Yt Bernoulli(p), independent of Yt. Let Rt = Xt Yt. Then Rt µp is sub-Gaussian, with its sub-Gaussian variance proxy denoted by τ 2, satisfying:

Zero-Inflated Bandits

(i) Given any p (0, 1), σ2 > 0, and arbitrarily large M > 0, there exists µ > 0 such that τ 2 > Mσ2 for any µ > µ ;

(ii) For any arbitrarily large M > 0, there exist σ2 , p , µ such that τ 2 > M var(Rt) for any µ > µ , σ2 < σ2 , and p [p , 1);

(iii) (a) For any σ2 > 0 and arbitrarily large M > 0, there exists µ > 0 and p (0, 1) such that

µ=µ ,p=p > M

for any µ (0, µ ] and p [p , 1);

(b) For any µ > 0 and arbitrarily large M > 0, there exists σ2 > 0 and p (0, 1) such that

σ2=σ 2,p=p > M

for any σ 2 (0, σ2 ] and p [p , 1);

(c) For any µ R and arbitrarily large M > 0, there exists σ2 > 0 and p (0, 1) such that

σ2=σ 2,p=p > M

for any σ 2 (0, σ2 ] and p [p , 1).

Analogous results apply for Xt µ sub E(λ), with σ2 replaced by λ2 and τ 2 by α2.

We provide visual clarification for Lemma E.1 in Figure 13, which illustrates the lemma in the context where Xt N(µ, σ2). Figure 13 (a) and (b) show the values of τ 2 under different µ and p values when σ2 = 1, and the ratio of τ 2/ var(Rt) under varying p values when µ = 100, respectively. Figure 13 (c) presents the numerical derivative τ 2

p under different µ and p

values when σ2 = 1. Figures 13(d) and (e) display the numerical derivatives τ 2

µ across a range of µ and σ2 values when µ = 100 and µ = 0.5, respectively.

Zero-Inflated Bandits

(a) Illustration for Lemma E.1 (i)

(b) Illustration for Lemma E.1 (ii)

(c) Illustration for Lemma E.1 (iii, a)

(d) Illustration for Lemma E.1 (iii, b)

(e) Illustration for Lemma E.1 (iii, c)

Figure 13. The illustration Lemma E.1 when Xt is exactly Gaussian distributed.

F. Proof of Lemmas

In this section, we will provide the proofs for the concentration results presented in Section 2. We will also include relevant theoretical background and discussions.

Notations: Let Pξ(A) = R

A d Fξ(x) denote the probability of the event A, where Fξ(x) is the distribution function of the random variable ξ. Similarly, let Eξf(ξ) = R f(x)d Fξ(x) represent the expectation. We define a b = max{a, b} and a b = min{a, b} for any real numbers a and b.

We first define the Revised-Generalized Bernstein-Orlicz (RGBO) transformation function Ψθ,L( ) based on the inverse function Ψ 1 θ,L(s) := p

log(1 + s) + L(log(1 + s))(1/θ) 1

for any t 0. It is worthy to note that we replace (log(1 + s))1/θ with (log(1 + s))(1/θ) 1 in the Generalized Bernstein Orlicz function defined in van de Geer & Lederer (2013); Kuchibhotla & Chakrabortty (2022). It is easy to verify that is monotone increasing and Ψθ,L(0) = 0, and we can define the RGBO norm of a variable random X such that X Ψθ,L = inf{η > 0 : EΨθ,L(|X|/η) 1}. In contrast to the existent literature only care about heavy tail case θ < 1, Lemma F.2 provides the uniform optimal concentration in sense of rate for any θ > 0 with explicit constants. Before stating this lemma, we first give the equivalence of RGBO norm and concentration inequality stated as follows.

Lemma F.1. For any zero-mean variable X and L > 0, we have

X Ψθ,31/θ 1/2L

3τ P n |X| > τ s + Ls(1/θ) 1 o 2e s, for any s 0.

Proof. The proof is exactly the same as the proof of Lemma 1 and Lemma 2 in (van de Geer & Lederer, 2013). It is worthy

Zero-Inflated Bandits

to note that the probability here can be rewritten as

P n |X| > X Ψθ,31/θ 1/2LΨ 1 θ,L(es 1)/

for any s 0 and hence

P {|X| ℓ X θ,K} 2 1 + Ψθ,31/2 1/θK(

for any ℓ 0.

Under this lemma, another way to state Lemma 2.2 is using Bernstein-Orlicz norm (van de Geer & Lederer, 2013). As delineated in Lemma F.1, Lemma 2.2 can be reformulated as:

µ X Ψ θ, p E(θ)

2 n 4e D(θ)C

with probability 1 δ/2 for any n 4 log(2/δ)/p2.

Lemma F.2 (Sharper Sub-Weibull Concentrations). Suppose Xi µ i.i.d. sub W(θ; C), then for any s 0, µ X Ψθ,n 1/2E(θ) 2n 1/2e 1D(θ)C,

P µ X > 2e D(θ)C r s

n + E(θ)s(1/θ) 1

where D(θ) and E(θ) are defined as

8e3(2π)1/4e1/24 e2/e/θ 1/θ , if θ < 1, p

3/(2e2) C 1 Cθ 1 , if 1 θ < 2, p

17/(6e2) C 1 Cθ/2 1 , if θ 2,

22/θ 1/2, if θ < 1, 1/

6, if 1 θ < 2, 0, if θ 2.

Proof. We consider the case θ < 1, 1 θ < 2, and θ 2 separately.

If θ 2. From E exp |X µ|θ/Cθ 2, we know that

|X µ|θ/Cθ j

j!Cjθ 1. (F.2)

This implies for any k N, by θ 2,

E|X µ|2k 1 + E|X µ|kθ

by (F.2) 1 + k!Ckθ

= 1 + k!(1 Cθ/2)2k

by kk k! (2k)! 1 + (1 Cθ/2)2k (2k 1)!!(2k)!!

by Bohr Mollerup theorem 1 + (1 Cθ/2)2k (2k 1)!! 2kk!

1 + (2k 1)!!

(2k 1)!! 2 2Cθ/2 2k.

Zero-Inflated Bandits

Therefore, the sub-Gaussian intrinsic moment norm for X µ satisfies X µ G 2 2Cθ/2, and thus

µ X > 2 2Cθ/2 r

µ X > X µ G

for any s 0 by Theorem 2(b) in Zhang et al. (2023).

If 1 θ < 2, (F.2) still holds for θ [1, 2). We claim there exist positive ν and κ such that

2ν2κk 2k!, k = 2, 3, . . . . (F.3)

Indeed, (F.2) together with θ 1 implies

E|X µ|k 1 + E|X µ|kθ 1 + k!Ckθ

Hence, a sufficient condition for (F.3) is 1 + k!Ckθ 1

2ν2κk 2k! for any k = 2, 3, . . .. We rewrite this as

k! + 2Ckθ, k = 2, 3, . . . .

Therefore, we can take κ = 1 Cθ and ν =

3κ, and then Xi µ i.i.d. subΓ(ν, κ) by Lemma 2.2.11 in Vaart & Wellner (2023). We can then apply concentration for sub-Gamma distributions in Corollary 5.2 of Zhang & Chen (2021); Boucheron et al. (2013) and obtain that

If θ < 1. Denote β is the conjugate of θ, i.e., β = . Then from Theorem 1 in Zhang & Wei (2022), we know that

i=1 ai(µ Xi) 2e D(θ) b 2 s + 2e L n(θ)s1/θ b β

where b = n 1C1n Rn with b 2 = Cn 1/2, b β = Cn 1, and D(θ) = (

8e3(2π)1/4e1/24 e2/e/θ 1/θ,

and L n(θ) = Ln(θ)D(θ) b 2/ b β with Ln(θ) := 41/θ b β

2 b 2 . Then we have

2e D(θ) b 2 s + 2e L n(θ)s1/θ b β = 2e D(θ)C r s

n + E(θ)s1/θ

where E(θ) = 41/θ

2θ . Combining these results, we obtain the concentration inequality in the lemma. For the result of the RGBO norm, we just use the lemma F.1.

Lemma F.2 establishes a uniform result for the sample mean of i.i.d. sub-Weibull random variables. It is important to note that our result here is sharper than those in Kuchibhotla & Chakrabortty (2022) and Zhang & Wei (2022), especially for the case when θ 1, as they focus on general weighted summations. Another notable difference is in comparison to the sub-Weibull concentration results for sample means in Adamczak et al. (2011); Bogucki (2015); Hao et al. (2019), which require symmetry, while our approach does not. Consequently, we present a novel concentration result for the sample mean of i.i.d. sub-Weibull random variables.

Next, we will show some anti-concentrations for the posterior distributions in Algorithm C.1, which are essential for the proof of TS-type algorithms.

Lemma F.3. For any 0 x β α+β , we have

P Beta(α, β) > α α + β + x Γ(β + α)

β α + β x β α α + β + x α β + 2

β + 1 α + β

Zero-Inflated Bandits

Proof. First, we note that Beta(α, β) = 1 Beta(β, α). Then we can rewrite the probability as

P Beta(α, β) > α α + β + x

=P 1 Beta(β, α) > α α + β + x

=P Beta(β, α) 1 α α + β x

by Theorem 1 in (Henzi & D umbgen, 2023) Γ(β + α) βΓ(β)Γ(α)

1 α α + β x β α α + β + x α 1 + β + α

1 α α + β x

β α + β x β α α + β + x α β + 2

β + 1 α + β

Lemma F.4. Suppose ξ N(µ, σ2) and ζ Beta(α, β), then

P ξ ζ αµ α + β + x

1 2c(α, β), if x µβ 2(α+β),

2π 2(α+β)x µβ [2(α+β)x µβ]2+(2α+β)2σ2 exp 1

2 2(α+β)x µβ

(2α+β)σ 2 , if x > µβ 2(α+β),

c(α, β) = Γ(β + α)

Proof. First, we note that

P ξ ζ αµ α + β + x = P ξ ζ (µ + y) α α + β + z

P (ξ µ + y) P ζ α α + β + z

with y, z defined as

y = 2(α + β)x µβ

2α + β , z = β 2(α + β).

From Lemma F.3, we obtain

P ζ α α + β + z Γ(β + α)

β α + β z β α α + β + z α β + 2

β + 1 α + β

with c(α, β) is a constant only depending on α and β. If y 0, i.e., x µβ 2(α+β), then P (ξ µ + y) 1

2, and thus

P ξ ζ αµ α + β + x c(α, β)

Zero-Inflated Bandits

If y > 0, i.e., x > µβ 2(α+β), then by Abramowitz et al. (1988),

P (ξ µ + y) = P ξ µ + σ 2(α + β)x µβ

(2α+β)σ 2(α+β)x µβ

(2α+β)σ 2 + 1 exp

2(α + β)x µβ

= (2α + β)σ

2π 2(α + β)x µβ

[2(α + β)x µβ]2 + (2α + β)2σ2 exp

2(α + β)x µβ

which leads to the final result.

The remaining part of this section will consist of the proofs of the lemmas presented in the main content.

Proof of Lemma 2.1:

Proof. From the definition of sub-Weibull distribution, we know that E exp{|X µ|θ/Cθ X} 2. Note that for any a, b 0: if 0 θ 1, (a + b)θ aθ + bθ; if θ > 1, (a + b)θ 2θ 1(aθ + bθ). Hence,

(a + b)θ 2θ 1 1 (aθ + bθ).

Thus, for any C > 0, we have

E exp{|R µp|θ/Cθ}

= E exp (X µ)Y + µ(Y p) θ/Cθ

E exp |X µ|Y + µ|Y p| θ/Cθ

E exp 2θ 1 1 |X µ|θY θ + µθ|Y p|θ /Cθ

E exp 2θ 1 1 |X µ|θ + µθ(pθ + (1 p)θ) /Cθ

= exp 2θ 1 1 µθ(pθ + (1 p)θ) /Cθ E exp 2θ 1 1 |X µ|θ/Cθ .

Since E exp{|X µ|θ/Cθ X} 2, we have

0 lim C + E exp{|R µp|θ/Cθ}

lim C + exp 2θ 1 1 µθ(pθ + (1 p)θ) /Cθ lim CX C + E exp 2θ 1 1 |X µ|θ/Cθ

This implies there exists CR CX such that E exp{|R µp|θ/Cθ R} 2.

Proof of Lemma E.1:

Proof. The result that R µp is sub-Gaussian or sub-Exponential in the lemma directly comes from Lemma 2.1 by setting θ = 2 and θ = 1. For the second result, we first prove the results for sub-Gaussian case. Denote that τ 2 as the minimal value which satisfies E exp{s(R µp)} exp{s2τ 2/2}

for any s R. By the definition, we have

E exp{s(R µp)} = E exp{s(XY µp)}

= E exp{s(0 µp)}P(Y = 0) + E exp{s(X µp)}P(Y = 1)

= e sµp(1 p + p Ees X).

Zero-Inflated Bandits

Since σ2 is the minimal value such that E exp{s(X µ)} exp{s2σ2/2} for all s R, it also is the minimal value such that Ees X esµ+s2σ2/2. This indicates τ 2 satisfies

e sµp(1 p + pesµ+s2σ2/2) exp{s2τ 2/2},

τ 2 = max s R 2 s2

h sµp + log(1 p + pesµ+s2σ2/2) i .

f(s, µ, p, σ2) := 2

h sµp + log(1 p + pesµ+s2σ2/2) i

with τ 2 = maxs R f(s, µ, p, σ2) = f(s , µ, p, σ2). We will first show that

µ > 0, lim p 0 s = + and lim |µ| σ2 0 lim p 1 s = 0 + .

Indeed, s satisfies

sf(s, µ, p, σ2) = 0, i.e.,

s3 log(1 p + pes µ+s2 σ2/2) + 2

p(µ + σ2s ) p + (1 p)e s µ s2 σ2/2 = 0,

or say, 2 p log(1 p + pes µ+s2 σ2/2) = s

(µ + σ2s ) p + (1 p)e s µ s2 σ2/2 + µ . (F.4)

As we can see whenever s is finite, we have limp 0 2

p log(1 p + pes µ+s2 σ2/2) = + , then there must be limp 0 s = + since µ > 0. On the other hand, by letting p = 1, the above equation becomes

log(1 + es µ+s2 σ2/2) = s µ + s2 σ2/2.

Since limx log(1 + ex) x = + , limx 0 log(1 + ex) x = log 2, we must have lim|µ| σ2 0 limp 1 s = 0+. Now, consider τ 2

σ2 = 2 s2 σ2

h s µp + log(1 p + pes µ+s2 σ2/2) i

= p p + (1 p)e s µ s2 σ2/2 + µ(1 p)p

sσ2 es µ+s2 σ2/2 + 1 pes µ+s2 σ2/2 + 1 p

Then consider µ > 0, by taking a fixed p (0, 1), and denote s = s (p = p, µ, σ2) we get that

p=p = p p + (1 p)e sµ s2σ2/2 + µ(1 p)p

sσ2 esµ+s2σ2/2 + 1 pesµ+s2σ2/2 + 1 p

sσ2 esµ+s2σ2/2 + 1 pesµ+s2σ2/2 + 1 p

Since for any x 0 and p (0, 1)

x = log ex log(1 p + pex) = log(1 p) + log 1 + p 1 pex log(1 p) + log p 1 pex = x log p,

and note that s satisfies (F.4), we have

s (µ + σ2s) p + (1 p)e sµ s2σ2/2 + µ = 2

p log(1 p + pesµ+s2σ2/2)

by the inequality above 2 p

sµ + s2σ2/2 + log p, sµ + s2σ2/2 i .

Zero-Inflated Bandits

Note that the left hand above is less than

s (µ + σ2s) p + (1 p)e sµ s2σ2/2 + µ s (µ + σ2s)

p + µ = (1 + p 1)sµ + p 1s2σ2,

while the right hand above is larger than

2 p log(1 p + pesµ+s2σ2/2) 2(sµ + s2σ2/2 + log p)

p = 2p 1sµ + 2p 1 log p + p 1s2σ2.

Since for any fixed p (0, 1), we have p 1 > 1. Thus, we must have limµ + s = 0, or

( 2 p log(1 p + pesµ+s2σ2/2) s (µ + σ2s) p + (1 p)e sµ s2σ2/2 + µ )

sµ + 2p 1 log p > 0,

which leads to a contradiction on the equation (F.4). Therefore, we must have

lim µ + τ 2

p=p lim µ + µ(1 p)p

sσ2 esµ+s2σ2/2 + 1 pesµ+s2σ2/2 + 1 p = lim µ + 2µ(1 p)p

which concludes the results in (i). Then we will prove the results in (iii). By envelope theorem,

pf(s, µ, p, σ2) s=s = 2

s µ + es µ+s2 σ2/2 1 1 + p es µ+s2 σ2/2 1

The fact that ex 1+ex is bounded on x [0, ) and the fact lim|µ| σ2 0 limp 1 s = 0+ ensure that

lim |µ| σ2 0 lim p 1 τ 2

p = lim |µ| σ2 0 lim p 1 2 s2 [ s µ + 1] = + .

By the continuity of f(s , , , ) on µ, p, and σ2, we get the first two results in (iii). Similarly, we can show that

p + pes µ+s2 σ2/2

1 + p es µ+s2 σ2/2 1

which ensures limσ2 0 limp 1 τ 2

µ = + the last result in (iii). Finally, for the result in (ii), we note that

var(R) = E (XY µp)2 = p var(X) + p(1 p)µ2.

Then by σ2 var(X), we have

var(R) = τ 2

p var(X) + p(1 p)µ2 τ 2

pσ2 + p(1 p)µ2 .

Take µ = cµ with arbitrary c > 1, by letting σ2 0 and p 1, we have

pσ2 + p(1 p)µ2

τ 2|µ=µ >0 minµ [µ ,cµ ] τ 2

µ µ=µ (cµ µ )

pσ2 + p(1 p)c2µ2 by the result in (iii) (c) (c 1)Mµ pσ2 + p(1 p)c2µ2 +

Zero-Inflated Bandits

which gives the result in (ii). For the case that X µ sub E(λ), one only need to note that the sub-Exponential parameter α for R µp satisfies

e sµp(1 p + pesµ+s2λ2/2) exp{s2α2/2}

for any s 1

λ, which implies

α2 = λ2 max s R 2 s2

h sµp + log(1 p + pesµ+s2λ2/2) i .

Since λ2 g(λ, µ, p) with differential g( , , ) is also differential on its domain except the points that λ2 = g(λ, µ, p), the above results regarding large values will still hold. Thus, we finish the proof.

Proof of Lemma 2.2:

Proof. Denote B binomial(n; p) independent with {Xt}n t=1, we consider the positive p/2 > 0, by concentration for Bernoulli, we have

P(B pn/2) = 1 P(p B p/2)

1 exp np2/4 .

Given any δ > 0, the above inequality ensures

with probability at least 1 δ/2 for any n 4 p2 log(2/δ). Now, denote Xk as the k-th observed Xt. Consider s > 0 which will be determined later,

P µ X > 2e D(θ)CX

r s np/2 + E(θ)s(1/θ) 1

> 2e D(θ)CX

r s np/2 + E(θ)s(1/θ) 1

> 2e D(θ)CX

r s np/2 + E(θ)s(1/θ) 1

> 2e D(θ)CX

B + E(θ)s(1/θ) 1

where the last step is by Lemma F.2. Finally, by letting 2e s = δ/2, i.e., s = log(4/δ), we conclude the inequality in the lemma.

Proof of Lemma B.1:

Proof. Denote

with z will be determined later. The proof idea comes from Lemma 1 in Bubeck et al. (2013). Denote Xk as the k-th

Zero-Inflated Bandits

observed Xt. Consider some positive s,

k=1 Xk1(|Xk| Mk) > t

k=1 EX1(|X| > Mk) + 1

h EX1(|X| Mk) Xk1(|Xk| Mk) i > s #

M M ϵ k + 1

h EX1(|X| Mk) Xk1(|Xk| Mk) i > s #

(1 + ϵ)M 1 1+ϵ log ϵ 1+ϵ z B ϵ 1+ϵ + 1

h EX1(|X| Mk) Xk1(|Xk| Mk) i > s #

where B = Pn i=1 Yi binomial(n; p) is independent with {Xi}n i=1. The last inequality is using

M M ϵ k = M 1 1+ϵ log ϵ 1+ϵ z 1

k=1 s ϵ 1+ϵ

M 1 1+ϵ log ϵ 1+ϵ z 1

0 s ϵ 1+ϵ ds

= (1 + ϵ)M 1 1+ϵ log ϵ 1+ϵ z B ϵ 1+ϵ .

Next, we consider the positive p/2 > 0, by concentration for Bernoulli, we have

P(B pn/2) = 1 P(p B p/2)

1 exp np2/4 .

Given any δ, the above inequality ensures B pn/2

with probability at least 1 δ/2 for any n 4 p2 log(2/δ). Then by Bernstein s inequality

h EX1(|X| Mk) Xk1(|Xk| Mk) i > s (1 + ϵ)M 1 1+ϵ log ϵ 1+ϵ z B ϵ 1+ϵ

h EX1(|X| Mk) Xk1(|Xk| Mk) i > s (1 + ϵ)M 1 1+ϵ log ϵ 1+ϵ z B ϵ 1+ϵ , B pn/2 #

h EX1(|X| Mk) Xk1(|Xk| Mk) i > s (1 + ϵ)M 1 1+ϵ log ϵ 1+ϵ z (pn/2) ϵ 1+ϵ , B pn/2 #

B s (1+ϵ)M 1 1+ϵ log ϵ 1+ϵ z

(pn/2) ϵ 1+ϵ

MM 1 ϵ k + Mk

s (1+ϵ)M 1 1+ϵ log ϵ 1+ϵ z

(pn/2) ϵ 1+ϵ

pn s (1+ϵ)M 1 1+ϵ log ϵ 1+ϵ z

(pn/2) ϵ 1+ϵ

MM 1 ϵ n + Mn

s (1+ϵ)M 1 1+ϵ log ϵ 1+ϵ z

(pn/2) ϵ 1+ϵ

Zero-Inflated Bandits

s (1 + ϵ)M 1 1+ϵ log ϵ 1+ϵ z (pn/2) ϵ 1+ϵ = 4 log(2/δ)

4MM 1 ϵ n log(2/δ) pn ,

by s2/2 σ2+Ms/3 = A

n . Let z = log(2/δ), we have

s = (1 + ϵ)M 1 1+ϵ log ϵ 1+ϵ (2/δ) (pn/2) ϵ 1+ϵ + 4 log(2/δ)

4MM 1 ϵ n log(2/δ) pn

= (1 + ϵ)2 ϵ 1+ϵ M 1 1+ϵ

ϵ 1+ϵ + 4M 1 1+ϵ

ϵ 1+ϵ + 2M 1 1+ϵ p

" (1 + ϵ)2 ϵ 1+ϵ

p ϵ 1+ϵ + 4

M 1 1+ϵ log(2/δ)

which leads to the result.

Proof of Lemma 6.1:

Proof. From Theorem 16.2 in Lattimore & Szepesv ari (2020), an algorithm is deemed asymptotically optimal for problemdependent regret if it satisfies:

lim inf T + R(T)

k dinf(Pk, r1, Mk).

where dinf(P, r, M) = inf P M {KL (P, P ) : ER P R > r}. Here the model class Mk = Xk Yk is ZI structure such that

Xk = {X µk sub G(σ2) : P(X = 0) = 0} and Yk = {Y Bernoulli(pk) : pk (0, 1)}, where X Y.

To prove the first part of the theorem, let us choose a subclass X k Xk such that X k = {X N(µk, σ2) : µk R}. Denote Pk := N(µk, σ2) Ber(pk) M k := X k Yk. The remaining task is to calculate

dinf(Pk, r1, M k) = inf µk,pk:µkpk>r1 KL(P1, Pk).

Here we first note that the independence between X k and Yk implies

KL(P1, Pk) = KL N(µ1, σ2), N(µk, σ2) + KL Ber(p1), Ber(pk)

2σ2 + pk log(pk/p1) + (1 pk) log 1 pk

Then consider the restriction µ1p1 > µkpk, there are two cases:

If µ1 r1/pk. Then we can let pk = p1 and then µk < r1/p1 = r1/pk suffices to satisfy the constraint. In this case, KL Ber(p1), Ber(pk) = 0, and thus

inf µk,pk:µkpk>r1 KL(P1, Pk) = inf µk<r1/pk=µ1 (µ1 µk)2

2σ2 = (µk r1/pk)2

If µ1 > r1/pk, we let µk = µ1 and then pk < r1/µ1 = r1/µk satisfies the constraint. Similarly, in this case

inf µk,pk:µkpk>r1 KL(P1, Pk) = inf pk<r1/µ1=r1/µk=p1

pk log(pk/p1) + (1 pk) log 1 pk

= pk log pk r1/µk

+ (1 pk) log 1 pk 1 r1/µk

Zero-Inflated Bandits

By combining the two cases, we complete the proof for the first part in Lemma 6.1.

For proving the second argument in the lemma, we note that Pk := N(µk, σ2) Ber(pk) Mk. which implies

dinf(Pk, r1, Mk) inf µk,pk:µkpk>r1 KL(P1, Pk).

Thus, it suffices to bound the infimum derived in the first part. For the Gaussian part, it is straightforward to see that

inf µk<r1/pk (µk r1/pk)2

2σ2 = (µk r1/pk)2

4σ2 = 2 k 4p2 kσ2 .

For the Bernoulli part, applying the Pinsker s inequality, we get

inf pk<r1/µk:µk=µ1 KL Ber(p1), Ber(pk) = inf pk<r1/µk:µk=µ1 pk log pk r1/µk

+ (1 pk) log 1 pk 1 r1/µk

inf pk<r1/µk:µk=µ1 (pk r1/µk)2

(r1/µk) (1 r1/µk)

inf pk<r1/µk:µk=µ1 (rk r1)2

µ2 k(r1/µk) + inf pk<r1/µk:µk=µ1 (rk r1)2

µ2 k(1 r1/µk).

Next, we bound the above two terms by

inf pk<r1/µk:µk=µ1 (rk r1)2

µ2 k(r1/µk) inf pk<r1/µk:µk=µ1 2 k µ2 1(r1/µ1)

= inf pk<r1/µk:µk=µ1 2 k r1µk

inf pk<r1/µk:µk=µ1 2 kpk r1µk(r1/µk) = 2 kpk r2 1

inf pk<r1/µk:µk=µ1 (rk r1)2

µ2 k(1 r1/µk) inf pk<r1/µk:µk=µ1 2 k µk(µk r1)

= inf pk<r1/µk:µk=µ1 2 k µk(µ1 r1)

inf pk<r1/µk:µk=µ1 2 kp2 k µk(µ1 r1)(r1/µk)(r1/µ1)

= 2 kp2 k r2 1(1 p1).

Finally, combining these bounds for both the Gaussian and Bernoulli components, we observe that the dominant terms depend on pk, k (0, 1). The asymptotic regret bound is confirmed as

k dinf(Pk, r1, Mk)

2 kpk 2r2 1

+ k 2 kp2 k 2r2 1(1 p1)

p2 k k + 1 pk k + 1 p2 k k

p2 k k + 1 pk k

which concludes the proof for the second part of the lemma.

Zero-Inflated Bandits

G. Proof of the regrets for UCB-type algorithms

The proofs for our UCB-type algorithms also follow the standard approach used in UCB algorithms, which involves controlling two probabilities. The first probability relates to the underestimation of the optimal arm, characterized by P(U µ 1 (t) U p 1 (t) < r1), and this can be easily managed using the concentration results presented in Section 2. The second probability concerns the overestimation of suboptimal arms, characterized by P(U µ k (t) U p k(t) > r1) = P(U µ k (t) U p k(t) > rk + k). Since k > 0, the sharp properties of our concentration results in Section 2 also controls this probability, ensuring an exponential decay rate over rounds.

G.1. Proof of Theorem 4.1

Proof. For any δ > 0, denote the upper confidence bound for pk until round t as U p k(t, δ) := bpk(t) + q

2ck(t) , with bpk(t) be the point estimate at round t. Based on the estimated bpk(t) , define the upper confidence bound for µk as

U µ k (t, δ) := bµk(t) + 2e D(θ)C

log(4/δ) ck(t)bpk(t)/2 + E(θ)log(1/θ) 1(4/δ)

ck(t)bpk(t)/2

with bµk(t) be the point estimate again. For simplicity, we also denote bpk(t) as bpk,m when ck(t) = m, and similarly define bµk,m. Similarly, we denote U p k(t, δ) as U p k,m(δ) when ck(t) = m, and similarly define U µ k,m(δ).

Now we can define good events as follows

G0 := {r1 < min t {m1,m1+1...,T } U µ 1,m1(δ) U p 1,m1(δ)}

and Gr k = {U µ k,mk(δ) U p k,mk(δ) < r1}.

Furthermore, define Gp k = {bpk,mk > pk ϵk} for k [K] and Gk = G0 Gr k Gp 1 Gp k for k = 1, where ϵk will be determined later.

Step 1: For bounding P(Gc k), we use the inequality that

P(A B) = P(A) + P(B) P(A B) = P(A Bc) + P(B).

Then we can decompose

P(Gc k) P(G0c Gpc 1 ) + P(Grc k Gpc k )

= P (G0)c Gp 1 + P (Gp 1)c + P Gr k)c Gp k + P (Gp k)c .

with P (Gp 1)c and P (Gp k)c bounding easily. Indeed, denote d(p1, p2) as the KL-divergence between two Bernoulli distributions of probability p1 and p2, then

P (Gp 1)c = P(bp1,m1 p1 ϵ1) exp ( m1d(p1, p1 ϵ1))

and similarly P (Gp k)c exp ( mkd(pk, pk ϵk)). For another two terms, we first consider to decompose the sample space as Ω= {bp1,m1 p1 + ϵ 1} {bp1,m1 < p1 + ϵ 1}

with ϵ 1 > 0 will be determined later. Then

P (G0)c Gp 1

P Ω (G0)c Gp 1

P(bp1,m1 p1 + ϵ 1) + P bp1,m1 p1 + ϵ 1, r1 U µ 1,m1(δ) U p 1,m1(δ), bp1,m1 > p1 ϵ1

exp( m1d(p1 + ϵ 1, p1)) + P r1 U µ 1,m1(δ) U p 1,m1(δ), p1 ϵ1 < bp1,m1 p1 + ϵ 1 .

Note that for any real numbers a, b and any random variable X, Y with b, Y 0,

P(ab XY ) P {a X} or {b Y } P(a X) + P(b Y ).

Zero-Inflated Bandits

By the above inequality, we can next bound the second term in the above bound,

P r1 U µ 1,m1(δ) U p 1,m1(δ), p1 ϵ1 < bp1,m1 p1 + ϵ 1

bµ1,m1 + 2e D(θ)C

log(4/δ) m1bp1,m1/2 + E(θ)log(1/θ) 1(4/δ)

, p1 ϵ1 < bp1,m1 p1 + ϵ 1

bµ1,m1 + 2e D(θ)C

log(4/δ) m1(p1 + ϵ 1)/2 + E(θ)log(1/θ) 1(4/δ)

m1(p1 + ϵ 1)/2

, p1 ϵ1 < bp1,m1 p1 + ϵ 1

bµ1,m1 + 2e D(θ)C

log(4/δ) m1(p1 + ϵ 1)/2 + E(θ)log(1/θ) 1(4/δ)

m1(p1 + ϵ 1)/2

µ1 bµ1,m1 + 2e D(θ)C

log(4/δ) m1(p1 + ϵ 1)/2 + E(θ)log(1/θ) 1(4/δ)

m1(p1 + ϵ 1)/2

p1 bp1,m1 +

Since we have

p1 bp1,m1 +

µ1 bµ1,m1 + 2e D(θ)C

log(4/δ) m1(p1 + ϵ 1)/2 + E(θ)log(1/θ) 1(4/δ)

m1(p1 + ϵ 1)/2

µ1 bµ1,m1 + 2e D(θ)C

v u u u u t

m1p1/2 + E(θ)

by p1 p1+ϵ 1 p1 p1+ϵ 1

µ1 bµ1,m1 + 2e D(θ)C

v u u u u u t

m1p1/2 + E(θ)

by Lemma 2.2

Zero-Inflated Bandits

we can obtain

P r1 U µ 1,m1(δ) U p 1,m1(δ), p1 ϵ1 < bp1,m1 p1 + ϵ 1 4(δ/4)

which concludes that

P (G0)c Gp 1 exp m1d(p1 + ϵ 1, p1) + 4(δ/4)

It remains to bound P Gr k)c Gp k . We first decompose Ω= {bpk,mk pk + ϵ k} {bpk,mk < pk + ϵ k} and similarly obtain that

P Gr k)c Gp k

P(bpk,mk pk + ϵ k) + P bpk,mk pk + ϵ k, r1 U µ k,mk(δ) U p k,mk(δ), bpk,mk > pk ϵk

exp mkd(pk + ϵ k, pk) + P r1 U µ k,mk(δ) U p k,mk(δ), pk ϵk < bpk,mk pk + ϵ k .

The second term in above can be furthermore bounded by

P r1 U µ k,mk(δ) U p k,mk(δ), pk ϵk < bpk,mk pk + ϵ k

bµk,mk + 2e D(θ)C

log(4/δ) mkbpk,mk/2 + E(θ)log(1/θ) 1(4/δ)

, pk ϵk < bpk,mk pk + ϵ k

bµk,mk + 2e D(θ)C

log(4/δ) mkbpk,mk/2 + E(θ)log(1/θ) 1(4/δ)

k, pk < bpk,mk pk + ϵ k

bµk,mk + 2e D(θ)C

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

, pk ϵk < bpk,mk pk + ϵ k

bµk,mk + 2e D(θ)C

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

pk + pk k 2rk + k

bµk,mk + 2e D(θ)C

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

Pk,µ + Pk,p,

Zero-Inflated Bandits

where the last step is by the fact that for any real numbers a, b and any random variable X, Y with b, Y 0,

P(ab XY ) P {a X} or {b Y } P(a X) + P(b Y ).

Now, the two parts in the upper bound of P r1 U µ k,mk(δ) U p k,mk(δ), pk ϵk < bpk,mk pk + ϵ k can be furthermore bounded. Indeed, Pk,p can be bounded as

pk + pk k 2rk + k bpk,mk +

pk k 2rk + k

2mk bpk,mk pk

pk k 2rk + k

pk k 2rk + k

mk 2(2rk + k)2

p2 k 2 k log 2

2pk bµk,mk + 2e D(θ)C

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

( k 2pk 2e D(θ)C

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

by Lemma 2.2

( k 2pk 2e D(θ)C

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

µk bµk,mk Ψ θ, pk E(θ)

2 mk 4e D(θ)C

by (F.1) 2 1 + Ψ θ, 31/2 1/θpk E(θ)

s = µk bµk,mk 1 Ψ θ, pk E(θ)

" k 2pk 2e D(θ)C

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

Zero-Inflated Bandits

By the increasing property of Ψθ,L( ), we obtain that

Ψ θ, 31/2 1/θpk E(θ)

Ψ θ, 31/2 1/θpk E(θ)

pk 3mk 4e D(θ)C

" k 2pk 2e D(θ)C

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

by (G.3) Ψ θ, 31/2 1/θpk E(θ)

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

=Ψ θ, 31/2 1/θpk E(θ)

3pk log(4/δ)

2(pk ϵk) + 3pk E(θ) log(1/θ) 1(4/δ) mk(pk ϵk)

=Ψ θ, 31/2 1/θpk E(θ)

v u u tlog 4

3pk 2(pk ϵk) + 31/2 1/θpk E(θ)

2 mk log(1/θ) 1 4

3(1/θ) 1p(θ 1)/2 k (pk ϵk)θ 1

by 3pk 2(pk ϵk) 2θ 1

3(1/θ) 1p(θ 1)/2 k (pk ϵk)θ 1 Ψ θ, 31/2 1/θpk E(θ)

v u u tlog 4

3pk 2(pk ϵk) + 31/2 1/θpk E(θ)

2 mk log(1/θ) 1 4

3pk 2(pk ϵk)

3pk 2(pk ϵk) 1

log(4/δ) mk(pk ϵk)/2 + E(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2

2pk , (G.3)

and a sufficient condition for this is

log(4/δ) mk(pk ϵk)/2 k

e D(θ)CE(θ)log(1/θ) 1(4/δ)

mk(pk ϵk)/2 k

Thus, we can take

mk 128e2D2(θ)C2pk(1 + pk)2

(pk ϵk) 2 k log 4

+ 16e D(θ)CE(θ) pk(1 + pk)

(pk ϵk) k log(1/θ) 1 4

Therefore, we can furthermore upper-bound Pk,µ as

Pk,µ 2 1 + Ψ θ, 31/2 1/θpk E(θ)

δ 3pk 2(pk ϵk) 1 + δ/2

3pk 2(pk ϵk) + δ

Thus, we obtain that

P Gr k)c Gp k exp mkd(pk + ϵ k, pk) + 2 δ

3pk 2(pk ϵk) + δ.

Zero-Inflated Bandits

under condition (G.2) and (G.4). To summarize these results, we have

P(Gc k) P (G0)c Gp 1 + P (Gp 1)c + P Gr k)c Gp k + P (Gp k)c

exp m1d(p1 + ϵ 1, p1) + 4(δ/4)

+ δ/2 + exp m1d(p1, p1 ϵ1)

+ exp mkd(pk + ϵ k, pk) + 2 (δ/4)

3pk 2(pk ϵk) + δ + exp mkd(pk, pk ϵk) ,

whenever m1 satisfies (G.1) and mk satisfies (G.2) and (G.4).

Step 2: Now, we deal with E[ck(T) 1(Gk)]. If ck(T) > max{m1, mk}, then arm k was pulled more than mk times over the first T rounds, and so there must exist a round t [m1, . . . , T] such that At = k. However, on the good event Gk, we have U µ k (t, δ)U p k(t, δ) = U µ k,mk(δ)U p k,mk(δ)

on the event Gr k < r1

on the event G0 < min t [m1,...,T ] U µ 1 (t, δ)U p 1 (t, δ) U µ 1 (t, δ)U p 1 (t, δ).

This means the agent will choose arm 1 instead of arm k at time point t, which leads to a contradiction. Thus, we must have

ck(T) max{m1, mk}.

Step 3: Combining the inequality in Step 1 and Step 2, we obtain

Eck(T) E[ck(T) 1(Ek)] + E[ck(T) 1{(Ek)c}]

E[ck(T) 1(Ek)] + TP(Ec k)

max{m1, mk}

+ T h exp m1d(p1 + ϵ 1, p1) + exp m1d(p1, p1 ϵ1) + 4(δ/4)

3pk 2(pk ϵk)

+ 3δ/2 + exp mkd(pk, pk ϵk) + exp mkd(pk + ϵ k, pk) i

max{m1, mk} + T h exp{ 2m1ϵ 2 1 } + exp{ 2m1ϵ2 1}

+ exp{ 2mkϵ 2 k } + exp{ 2mkϵ2 k} + 4(δ/4)

3pk 2(pk ϵk) + 2δ i

whenever m1 satisfies (G.1) and mk satisfies (G.2) and (G.4). Now, taking

= 2(2rk + k)2

p2 k + 128e2D2(θ)C2pk(1 + pk)2

+ 16e D(θ)CE(θ) pk(1 + pk)

(pk ϵk) log(1/θ) 1(4/δ)

2(2rk + k)2

p2 k 2 k log 2

+ 128e2D2(θ)C2pk(1 + pk)2

(pk ϵk) 2 k log 4

+ 16e D(θ)CE(θ) pk(1 + pk)

(pk ϵk) k log(1/θ) 1 4

Zero-Inflated Bandits

with ϵk = ϵ k = pk/2 for k [K] satisfies (G.1), (G.2), and (G.4) by rk (0, 1]. Under these choice, we obtain

exp{ 2m1ϵ 2 1 } = exp{ 2m1ϵ2 1}

(1/2)θ 1# 2

= 4 δ2/16 (1/2)θ 1 ,

exp{ 2mkϵ 2 k } = exp{ 2mkϵ2 k}

2(2rk + k)2

p2 k + 128e2D2(θ)C2pk(1 + pk)2

+ 16e D(θ)CE(θ) pk(1 + pk)

pk/2 log(1/θ) 1(4/δ)

h (2rk + k)2 + 128p2 ke2D2(θ)C2(1 + pk) ilog(4/δ)

16e D(θ)CE(θ)p3/2 k (1 + pk)log(1/θ) 1(4/δ)

exp (2rk + k)2 log(4/δ)

exp 2 k log(4/δ)

Aggregating these results, Eck(T) can be furthermore upper bounded by

Eck(T) max{m1, mk} + T 8 δ2/16 (1/2)θ 1 + 2δ/4 + 4(δ/4)(1/2)θ 1 + 2(δ/4)3 + 2δ

m1 + mk + T h 8 δ2/16 1/2 + δ/2 + 4(δ/4)1/2 + 2(δ/4)3 + 2δ i

= m1 + mk + T 2

δ + 3δ + δ3/32

p2 1 log (4/δ)(1/2)θ 1 /2 + 2(2rk + k)2

p2 k + 128e2D2(θ)C2pk(1 + pk)2

+ 16e D(θ)CE(θ) pk(1 + pk)

pk/2 log(1/θ) 1(4/δ)

δ + 4δ + δ3/32

θ 1 log (4/δ) + h (2rk + k)2 + 128e D2(θ)C2p2 k(1 + pk)2i2 log(4/δ)

+ 32e D(θ)CE(θ) pk(1 + pk)log(1/θ) 1(4/δ)

δ + 4δ + δ3/32

by rk 1 and pk 1 4 p2 1 log (4/δ) + 2 9 + 512e D2(θ)C2 log(4/δ)

+ 64e D(θ)CE(θ)log(1/θ) 1(4/δ)

δ + 4δ + δ3/32 .

Zero-Inflated Bandits

Now, choose δ = 4/T 2,

Eck(T) 8 log T

p2 1 + 4 9 + 512e D2(θ)C2 log T

+ 26+(1/θ) 1e D(θ)CE(θ)log(1/θ) 1 T

Finally, we obtain the cumulative regret is bounded by

k=2 k Eck(T)

k=2 k + 4 9 + 512e D2(θ)C2 K X

log T p2 k k

+ 26+(1/θ) 1e D(θ)CE(θ)

log(1/θ) 1 T

pk + 6 + 16

which gives the regret in the theorem.

G.2. Proof of Theorem B.2

Before proving the heavy tailed bandit results. We first state some basic properties of g(p, ϵ) in the concentration of Lemma B.1. Define h(p, ϵ) := pg(p, ϵ) = (1 + ϵ)2 ϵ 1+ϵ p 1 1+ϵ + 2 p + 4

3, then g is monotonically decreasing and h is monotonically increasing with respect to p. Specially, they satisfy

h(p, ϵ) (1 + ϵ) 2 1 + 2 1 + 4

for any p (0, 1), and

g(p1, ϵ) = (1 + ϵ)2 ϵ 1+ϵ

ϵ 1+ϵ 1 + 4 3p1 + 2 p1 (1 + ϵ)2 ϵ 1+ϵ p

1 1+ϵ 2 + 4

3 + 2 p2 = h(p2, ϵ)

for any p1, p2 (0, 1).

Proof. The proof is similar to the proof of Theorem 4.1. The essential change is we use the concentration of trimming observable sample mean in Lemma B.1 instead of sub-Weibull concentrations. We will borrow some techniques in (Bubeck et al., 2013). For fixed ϵ, we will write g(p) = g(p, ϵ) and h(p) = h(p, ϵ).

Similar as the technique in Bubeck et al. (2013), suppose At = k, we define the following bad events

B0(t) := {r1 U µ 1,c1(t),t U p 1,c1(t)}, Bµ k(t) :=

bµk,ck(t),t µk + g(bpk,t)M 1 1+ϵ log t2

Bcount k (t) :=

ck(t) 2(pk + εk)g(pk + εk) 1+ϵ

( k εkµk) 1+ϵ

ck(t) 2h(pk + εk) 1+ϵ

( k εkµk) 1+ϵ

with εk (0, k/µk) determined later. Similarly, define

Bp k(t) = {bpk,t > pk + εk}

Zero-Inflated Bandits

On the event B0c Bµc k Bcount, c k Bpc k , we have

U µ 1,c1(t),t U p 1,c1(t),t > r1 = rk + k = µkpk + k

µkpk + 2h(pk + εk) log t2

ϵ/(1+ϵ) + µkεk

h is increasing µkpk + 2h(bpk,t)M 1/(1+ϵ) log t2

ϵ/(1+ϵ) + µkεk

µk(bpk,t εk) + 2h(bpk,t)M 1/(1+ϵ) log t2

ϵ/(1+ϵ) + µkεk

= µkbpk,t + 2h(bpk,t)M 1/(1+ϵ) log t2

bµk,ck(t),t g(bpk,t)M 1 1+ϵ log t2

ϵ 1+ϵ bpk,t + 2h(bpk,t)M 1/(1+ϵ) log t2

bµk,ck(t),t + bpk,t M 1/(1+ϵ) log t2

= U µ k,ck(t),t U p k,ck(t),t.

This implies At = 1, which leads to a contradiction.

Consider the probability:

P(B0 Bµ k Bp k) = P((B0 Bµ k) (Bp k)c) + P(Bp k)

P(B0 (Bp k)c) + P(Bµ k (Bp k)c) + P(Bp k)

with P(Bp k) and P(Bµ k (Bp k)c) can be bounded easily. Indeed, if we consider them, then

P Bp k(t) = P bpk,t > pk + εk exp tε2 k/2 .

P Bµ k(t) (Bp k(t))c

bµk,ck(t),t µk + g(bpk,t)M 1 1+ϵ log t2

bpk,t pk + εk !

P pk εk > bpk,t

bµk,ck(t),t µk + g(bpk,t)M 1 1+ϵ log t2

pk εk bpk,t pk + εk !

by g is decreasing P pk εk > bpk,t + P

bµk,ck(t),t µk + g(pk + εk)M 1 1+ϵ log t2

P pk εk > bpk,t exp tε2 k/2 .

Zero-Inflated Bandits

bµk,ck(t),t µk + g(pk + εk)M 1 1+ϵ log t2

n g(pk + εk)M 1 1+ϵ log t2

ck(t) ϵ 1+ϵ 1+ϵ

M 1 ϵ g 1+ϵ

ϵ (pk + εk)

we need pk+εk 1 2 exp

Proof in Proposition 1 of (Bubeck et al., 2013) 2

Thus, we obtain

P Bµ k(t) (Bp k(t))c exp tε2 k/2 + 2/t3.

Similarly, we can show that

P(B0 (Bp k)c) = P{r1 U µ 1,c1(t 1),t U p 1,c1(t 1), bpk,t < pk + εk}

exp tε2 1/2 + 2/t3.

for any ε1 (0, p1).

Step 3: Denote

Vk := 2pkg(pk) 1+ϵ

Take ε1 = p1/2 and εk = k

2µk satisfy ε1 (0, p1) and εk (0, k/µk) for k = 2, . . . , K. From Step 2, we know that

P(B0(t) Bµ k(t) Bp k)

exp tε2 k/2 + exp tε2 k/2 + 2/t2 + exp tε2 1/2 + 2/t2

=2 exp t 2 k 8µ2 k

+ exp tp2 1 8

Zero-Inflated Bandits

t=1 1(At = k)

t=1 E 1{At = k} Bcount k (t) +

t=1 E 1{At = k} Bcount, c k (t)

t=1 E 1{At = k} Bcount, c k (t)

t>Vk E 1{At = k} Bcount, c k (t)

t>Vk P Bcount, c k (t)

t>Vk P(B0(t) Bµ k(t) Bp k)

2 exp t 2 k 8µ2 k

+ exp tp2 1 8

0 exp t 2 k 8µ2 k

0 exp tp2 1 8

Vk + 2 8µ2 k p2 k 2 k + 8

p2 1 + 4 1.5 Vk + 16 p2 k 2 k + 8

Finally, by plugging the above into the decomposition R(T) = PK k=2 k Eck(T), we obtain the result in the theorem.

Proof of problem-independent regret in heavy-tailed UCB:

Proof. Still plug the bound for Eck(T),

k=2 k Eck(T)

p2 k 2 k + 8

For the second part, still apply Cauchy s inequality, we obtain

p2 k 2 k + 8

p2 k 2 k + 8

p2 1 + 6 k + 1

by Cauchy s inequality with = r

T PK k=2 1 p2 k 2K 4

p2 1 + 3 + 8

k=2 p 2 k .

Zero-Inflated Bandits

For the first part, we use H older s inequality instead as follows:

2pkg(pk, ϵ) 1+ϵ

by H older s inequality K ϵ 1+ϵ

! 1 1+ϵ max k [K] 2pkg(pk, ϵ)M 1 1+ϵ log ϵ 1+ϵ T

2 (1 + ϵ)2 ϵ 1+ϵ + 10/3 (MT) 1 1+ϵ (K log T)

Thus, we complete the proof of the problem-independent bound.

H. Proof of the regrets for TS-type algorithms

H.1. Proof of Theorem 4.2

The primary proof idea behind our TS-type algorithm can be divided into two parts: first, controlling the overestimation of suboptimal arms is achieved through truncation in the clipped distribution, a technique we have already used in the proof of UCB-type algorithms with some modifications. Second, restricting the underestimation of the optimal arm can be accomplished through the anti-concentrations for the posterior distributions detailed in Appendix F.

Proof. Denote eµk(t) and epk(t) is the posterior sample for the non-zero part in k-th arm at round t, and others use the same notations in the proof of Theorem 4.1. Define

Ξµ = µ1 min s [T ]

4γ h 1 + log 1(1 + 1/

bp2 1,ss log+

4 h 1 + log 1(1 + 1/

Ξp = p1 min s [T ]

γ 4s log+ T

Let Ξ = Ξµ Ξp, then we can decompose the regret as

k=2 k Eck(T)

k: k 2Ξ kck(T)

E[2TΞ] + 2e

k: k (2Ξ) (2e

2pk K/T ) kck(T)

For any x 0, we have P (Ξ x) = P (Ξµ Ξp x)

P (Ξµ x/2) + P (Ξp x/2) .

By Lemma 2.2, we know that the Orlicz norm for bµ1,s µ1 satisfies

bµ1,s µ1 ψ2 2σ p1 s (H.2)

Zero-Inflated Bandits

with probability 1 δ/2 whenever s 4p 2 1 log(2/δ). Thus, by the same decomposition technique in proof of Theorem 4.1, and let δx be the Dirac delta function, we have

4γ h 1 + log 1(1 + 1/

bp2 1,ss log+

4 h 1 + log 1(1 + 1/

s [T] : µ1 min 1 s T

p2 1s log+ 4σ2T

+ P s [T] : bp1,s p1 p1 log(1 + 1/

By Lemma 9.3 in (Lattimore & Szepesv ari, 2020), we have

s [T] : µ1 min 1 s T

p2 1s log+ 4σ2T

15K T(x/2)2 = 60K

and by Hoeffding inequality, we have

P s [T] : bp1,s p1 p1 log(1 + 1/

s=1 exp 2sp2 1 log2(1 + 1/

s=1 exp 2sp2 1 s 1T 1

0 exp 2u2 du = 1

Similarly, one can obtain

P (Ξp x/2) = P

s [T] : p1 min s [T ]

γ 4s log+ T

by bp1,s is sub-Gaussian with variance proxy at most 1

4s. Thus, for the first term in (H.1), we have

E[2TΞ] 2T Z +

0 P (Ξ x) dx

x2T dx + 2T

Now, consider the collection of the good sets defined as

K := n k [K] : k (2Ξ) (2e p

2pk K/T) o ,

then by Theorem 36.2 in Lattimore & Szepesv ari (2020) with ϵ = k/2 in K, we have

k Eck(T) k + k E

t=K+1 1 At = k, Ec k(t) #

1 G1,s( k/2) 1 #

Zero-Inflated Bandits

Ec k(t) = eµk(t) epk(t) > r1 k

and Gk,s(ϵ) = 1 Fk,s(µ1 ϵ) with Fk,s is the non-clipped full posterior of the k-th arm1. Note that eµk(t) comes from

cl N bµk, 2 ρkckbpk ; τk(t)

Ec k(t) τk(t) ζk(t) > r1 k

= τk(t) ζk(t) > rk + k

s=1 1 τk(s) > µk + k

4γ h 1 + log 1(1 + 1/

bp2 k,ss log+

4 h 1 + log 1(1 + 1/

s=1 1 {Gk,s(µ)} ,

s=1 1 ζk(s) > pk + pk k 4rk + k

γ 4s log+ T

> pk + pk k 4rk + k

s=1 1 {Gk,s(p)} ,

t=K+1 1 At = k, Ec k(t) #

t=K+1 1 At = k 1 τk(t) ζk(t) > rk + k

s=1 1 τk(s) ζk(s) > rk + k

s=1 1 n Gk,s(µ) Gk,s(p) o#

s=1 P Gk,s(µ) +

s=1 P Gk,s(p) #

1There is a trick for converting the clipped distribution to the non-truncated full posterior distribution. We omit here, and the details can be seen in Jin et al. (2022).

Zero-Inflated Bandits

Similar as we are dealing with the first part in (H.1), we have

s=1 P Gk,s(µ)

p2 ks log+ 4σ2T

s=1 P bpk,s pk pk log(1 + 1/

p2 ks log+ 4σ2T

by Lemma 2 in Jin et al. (2021) k 2pk + 24pk

log+ T 2 k 8p2 k K

2γπ log+ T 2 k 8p2 k K

s=1 P Gk,s(p)

γ 4s log+ T

> pk + pk k 4rk + k

γ 4s log+ T

by Lemma 2 in Jin et al. (2021) pk + 12

log+ Tp2 k 4K

2γπ log+ Tp2 k 4K

Note that x 1 log+(ax2) is decreasing for x e/ a, we have

1 k log+ T 2 k 8p2 k K

for any k 2e p

2pk K/T and 1 pk log+ Tp2 k 4K

pk log+ e = 1

for any T 4e K. Thus, we have the bounds on k K such that

s=1 P Gk,s(µ) k

log+ T 2 k 8p2 k K

2γπ log+ T 2 k 8p2 k K

2pk + 24pk 2e p

2pk K/T + 8γpk

v u u t 2γπ

s=1 P Gk,s(p) pk + 12

log+ Tp2 k 4K

2γπ log+ Tp2 k 4K

Zero-Inflated Bandits

Therefore, we have

t=K+1 1 At = k, Ec k(t) #

h P Gk,s(µ) + P Gk,s(p) i

It remains to deal with the last term in (H.4). For doing this, we will first prove the following result: for any ϵ > 0, there exists a universal constant c > 0 such that

1 G1,s(ϵ) 1 #

c p2 1ϵ2 . (H.5)

Indeed, let Zk,s be the random variable denoting the number of consecutive independent trails until a sample of the

distribution Pk,s := N bµk,s, 2σ2

Beta(αk,s, βk,s) becomes greater than r1 ϵ, then E h 1 G1,s(ϵ) 1 i = EΥ1,s.

Consider an integer q 1 and z ρ with some ρ (ρ, 1) determined later. Let Mk,q be the maximum of q independent samples from Pk,s and Fk,s be the filtration consisting the history of plays of Algorithm C.1 up to the s-th pull of arm 1. Then P(Υ1,s q)

P (M1,q > r1 ϵ)

E M1,q > α1,sbµ1,s α1,s + β1,s + z bp1,s ρs, α1,sbµ1,s α1,s + β1,s + z bp1,s ρs r1 ϵ | F1,s

1 α1,sbµ1,s

α1,s + β1,s + z bp1,s ρs r1 ϵ P M1,q > α1,sbµ1,s α1,s + β1,s + z bp1,s ρs | F1,s

Then by Lemma F.4,

P M1,q > α1,sbµ1,s α1,s + β1,s + z p

ρsbp1,s | F1,s

= P M1,q > α1 + B1,s α1 + β1 + s bµ1,s + z bp1,s ρs | F1,s

by the choice of z 1

1 c(α1 + B1,s, β1 + s B1,s) q ρ

8ρ log q 8ρ log q + 1+e 2s

by fact (H.8) 1

p2 1 1 p1 23+2α1+2β1πe e s(2 p1) log 2 q ρ

8ρ log q 8ρ log q + 1

by fact (H.10) 1

1 1 p1 24+2α1+2β1πe e s(2 p1) log 2 q ρ

8ρ log q 8ρ log q + 1

by (1 x)q e qx 1 exp

1 p1 24+2α1+2β1πe e s(2 p1) log 2 q1 ρ 128π log q

q1 ρ 128π log q

for some q e2 and ρ > 1/2 when we take

z = 2σ(2α1,s + β1,s) 2ρ log q

2(α1,s + β1,s)

by fact (H.9) σ(2α1,s + β1,s) 2ρ log q + bµ1,sβ1,s

2(α1,s + β1,s) .

Zero-Inflated Bandits

In the above, we use the the following three facts are

c(α1 + B1,s, β1 + s B1,s) = B 1(α1 + B1,s, β1 + s B1,s)

(β1 + s B1,s)

β1 + s B1,s

2(α1 + β1 + s)

2α1 + β1 + s + B1,s

2(α1 + β1 + s)

α1+B1,s β1 + s B1,s + 4

2(β1 + s B1,s + 1)

by Stirling s formula 1 2πe(β1 + (1 p1)s) (α1 + β1 + s)α1+β1+s 1/2

(α1 + p1s)α1+p1s 1/2(β1 + (1 p1)s)β1+(1 p1)s 1/2

β1 + (1 p1)s

2(α1 + β1 + s)

β1+(1 p1)s 2α1 + β1 + (1 + p1)s

2(α1 + β1 + s)

41+α1+β1eπ(β1 + (1 p1)s) (α1 + β1 + s) 1/2

(β1 + (1 p1)s) 1/2 2α1 + β1 + (1 + p1)s

1 p1 23+2α1+2β1eπ 4 s 2α1+p1s = 1 p1 23+2α1+2β1eπ e s(2 p1) log 2.

for any s 1,

σ(2α1,s + β1,s) p

2ρ log q bµ1,sβ1,s

2ρ log q bµ1,s(β1 + s B1,s) σ(2α1 + β1 + s + B1,s)

2ρ log q µ1 + 2σµ1

σ with probability at least 1 e (2σµ1)2s

= q exp (2σ + 1)2r2 1 2ρ p2 1σ2

with probability at least 1 e 2r2 1s

= q exp (2σ + 1)2

with probability at least 1 e 2s

p2 1 1 p1 24+2α1+2β1eπ e s(2 p1) log 2 q1 ρ 128π log q

(2 p1) log 2 2

p2 1 = 1 1 p1 24+2α1+2β1eπ q1 ρ 128π log q

let dα:=maxx 1 x α log x = 1 1 p1 24+2α1+2β1π q1 ρ p

128πd1 ρ q1 ρ

" 27+2α1+2β1+1/2π3/2e p

with dα < for any α > 0. Now, take

q e2 exp (2σ + 1)2

" 27+2α1+2β1+1/2π3/2e p

# 2 1 ρ exp 160 (1 ρ )2

in (H.7), we obtain that

P M1,q > α1,sbµ1,s α1,s + β1,s + z bp1,s ρs | F1,s

Zero-Inflated Bandits

On the other hand,

P α1,sbµ1,s

α1,s + β1,s + z bp1,s ρs r1 ϵ

P α1,sbµ1,s

α1,s + β1,s + 2σ(2α1,s + β1,s) 2ρ log q

2(α1,s + β1,s)bp1,s ρs µ1p1

bµ1,s µ1 + σ(2α1,s + β1,s)

α1,s + β1,s

α1,s p1 1 !

bµ1,s µ1 + sσ [2α1 + β1 + B1,s]

(α1 + B1,s)B1,s

ρs µ1 (1 p1)α1 + p1β1 (B1,s p1s)

α1 + B1,s | {z } event U1,s

=P (U1,s | B1,s p1s) P(B1,s p1s)

v=1 P U1,s | s

v < B1,s p1s < s v + 1

v < B1,s p1s < s v + 1

bµ1,s µ1 + σ s 1B1,s

ρs 0 | B1,s p1s

P(B1,s p1s)

bµ1,s µ1,s + A1(s, v)

ρs µ1A2(s, v) | s

v < B1,s p1s < s v + 1

v < B1,s p1s < s v + 1

by (H.12) and (H.13) 1 q ρ /ρ P(B1,s p1s) +

1 e2p 4 1 q 2p 1 1 ρ /ρ P s

v < B1,s p1s < s v + 1

1 q ρ /ρ e2p 4 1 q 2p 1 1 ρ /ρ,

where we define

A1(s, v) := σ[s 1(2α1 + β1) + s 1B1,s]

s 1B1,s[s 1α1 + s 1B1,s] , A2(s, v) := s 1[ (1 p1)α1 + p1β1] (s 1B1,s p1)

s 1α1 + s 1B1,s .

We also use the facts that

bµ1,s µ1 + σ s 1B1,s

ρs 0 | B1,s p1s

µ1 bµ1,s σ s 1B1,s

ρs | B1,s p1s

by the fact that B1,s(bµ1,s µ1) sub G(B1,sσ2) for any B1,s 1 1 exp ρ log q

by s 1B1,s 1 1 q ρ /ρ,

Zero-Inflated Bandits

and similarly

bµ1,s µ1 + A1(s, v)

ρs µ1A2(s, v) | s

v < B1,s p1s < s v + 1

by (1 p1)α1 p1β1 P

bµ1,s µ1 + A1(s, v)

ρs µ1(p1 s 1B1,s)

s 1α1 + s 1B1,s | s

v < B1,s p1s < s v + 1

by similar trick in (H.12) 1 sup s 1B1,s [p1 v 1,p1 (v+1) 1] exp

" µ1(p1 s 1B1,s)

s 1α1 + s 1B1,s A1(s, v)

by the fact (H.14) 1 sup s 1B1,s [p1 v 1,p1 (v+1) 1] exp

" v 2µ2 1 [p1 + (v + 1) 1]2 B1,s

2σ2 A2 1(s, v)2ρ log q

by the fact (H.15) 1 exp

" v 2µ2 1 [p1 + (v + 1) 1]2 1 p1 1 v+1

by the fact (H.16) 1 e2p 4 1 q 2p 1 1 ρ /ρ

(H.13) where the fact we used is

µ1(p1 s 1B1,s)

s 1α1 + s 1B1,s

2 2A1(s, v)µ1(p1 s 1B1,s)

s 1α1 + s 1B1,s

= µ1(p1 s 1B1,s)

[s 1α1 + s 1B1,s]2

µ1(p1 s 1B1,s) 2σ[s 1(2α1 + β1) + s 1B1,s]

v h p1 + 1 v+1 i2 µ1

v = v 2µ2 1 [p1 + (v + 1) 1]2 ,

2σ2 A2 1(s, v)2ρ log q

ρs = [s 1(2α1 + β1) + s 1B1,s]2

s 1B1,s [s 1α1 + s 1B1,s]2 ρ log q

by s 1B1,s p1 1 v+1 1 p1 1 v+1

v 2µ2 1 [p1 + (v + 1) 1]2 1 p1 1 v+1

ρs 2µ2 1 p2 1 2

uniformly on v, s 1 and s 1B1,s p1 v 1, p1 (v + 1) 1 . Therefore, by plugging these inequalities into (H.6), we conclude that

P(Υ1,s < q) 1 q 2 q ρ /ρ e2p 4 1 q 2p 1 1 ρ /ρ

Zero-Inflated Bandits

for any q satisfied (H.11), and then

q=0 P(Υ1,s q)

e2 + exp (2σ + 1)2

" 27+2α1+2β1+1/2π3/2e p

# 2 1 ρ + exp 160 (1 ρ )2

h q 2 + q ρ /ρ + e2p 4 1 q 2p 1 1 ρ /ρi

e2 + exp (2σ + 1)2

" 27+2α1+2β1+1/2π3/2e p

# 2 1 ρ + exp 160 (1 ρ )2

+ 1 + 1 1 ρ /ρ + e2p 4 1

1 2p 1 1 ρ /ρ.

for any s N. Let 2p 1 1 ρ /ρ = 1/2, we immediately conclude that

E 1 G1,s(ϵ) 1 = EΥ1,s c (H.17)

with c = c(p1, ρ, σ2) is fully determined by p1, ρ, σ2 and free of s. Now, let E1,s = {bµ1,s bp1,s > r1 ϵ/2}, then

P Υ1,s > r1 ϵ | E1,s

P Υ1,s > bµ1,s bp1,s ϵ/2 | E1,s

=1 P bµ1,s bp1,s > N bµ1,s, 2σ2/(ρksbp2 1,s) Beta(α1,s, β1,s) + ϵ/2 | E1,s

1 P bµ1,s > N bµ1,s, 2σ2/(ρsbp2 1,s) + ϵ/(2p1) | E1,s

+ P bp1,s > Beta(α1,s, β1,s) + p1ϵ/(4r1 + ϵ) .

Zero-Inflated Bandits

Note that we have

P bµ1,s > N bµ1,s, 2σ2/(ρsbp2 1,s) + ϵ/(2p1) | E1,s

= P B1,sbµ1,s N B1,sbµ1,s, 2sσ2

by the inequality does not rely anything on E1,s 1 2E exp

ϵ2B2 1,s 4p2 1 ρ 4sσ2

16p1σ2 s 1B1,s 2 | s 1B1,s > p1/2 P s 1B1,s > p1/2

+ E exp sρϵ2

16p1σ2 s 1B1,s 2 | s 1B1,s p1/2 P s 1B1,s p1/2 #

+ P 0 < s 1B1,s p1/2 #

by Theorem 2 in (Ahle, 2017) 1 2

+ exp sd(p1/2, p1) #

2 p1 + log 2 p1

by x 1+x log x 1 2

+ exp sp1(1 p1/2) #

P bp1,s > Beta(α1,s, β1,s) + p1ϵ/(4r1 + ϵ)

= P Beta(α1,s, β1,s) E Beta(α1,s, β1,s) < (α1 + β1)B1,s α1s

(α1 + β1 + s)s p1ϵ 4r1 + ϵ

when s 2(4+ϵ)

p1ϵ P Beta(α1,s, β1,s) E Beta(α1,s, β1,s) < p1ϵ 2(4 + ϵ)

2 exp p2 1ϵ2

18(4 + ϵ) [(4 + ϵ) + 4p1ϵ]s ,

where we use the result in Theorem 1 of Skorski (2023): if α1,s β1,s, we have

P Beta(α1,s, β1,s) E Beta(α1,s, β1,s) < p1ϵ 2(4 + ϵ)

p1ϵ 2(4 + ϵ)

2 (α1,s + β1,s)2(α1,s + β1,s + 1)

p1ϵ 2(4 + ϵ)

2 (α1 + β1 + s)2(1 + α1 + β1 + s)

2(α1 + B1,s)(β1 + s B1,s)

p1ϵ 2(4 + ϵ)

2(α1 + s/2)(β1 + s/2)

by α1,β1 1 exp p2 1ϵ2

18(4 + ϵ)2 s ;

Zero-Inflated Bandits

and if α1,s < β1,s, we have

P Beta(α1,s, β1,s) E Beta(α1,s, β1,s) < p1ϵ 2(4 + ϵ)

α1,sβ1,s (α1,s + β1,s)2(α1,s + β1,s + 1) + 2(β1,s α1,s) 3(α1,s + β1,s)(α1,s + β1,s + 2) p1ϵ 2(4 + ϵ)

(α1 + B1,s)(β1 + s B1,s) (α1 + β1 + s)2(1 + α1 + β1 + s) + 2(β1 α1 + s B1,s) 3(α1 + β1 + s)(2 + α1 + β1 + s) p1ϵ 2(4 + ϵ)

(α1 + s/2)(β1 + s/2)

s3 + p1ϵ(β1 α1)

4s3 + p1ϵ 3(4 + ϵ)s2

18(4 + ϵ) [(4 + ϵ) + 4p1ϵ]s .

By plugging these inequalities into (H.18), we obtain

P Υ1,s > r1 ϵ | E1,s

+ exp sp1(1 p1/2) #

2 exp p2 1ϵ2

18(4 + ϵ) [(4 + ϵ) + 4p1ϵ]s .

On the other hand, the probability of E1,s can be directly bounded through the concentrations for Gaussian and Binomial distribution as P(E1,s) = P bµ1,s bp1,s > r1 ϵ/2

1 P µ1 > bµ1,s + ϵ/(2p1) + P p1 > bp1,s + p1ϵ/(4r1 + ϵ)

exp 2sp2 1ϵ2

by the facts that

P µ1 > bµ1,s + ϵ/(2p1) = P µ1 bµ1,s > ϵ/(2p1) exp sϵ2

P p1 > bp1,s + p1ϵ/(4r1 + ϵ) = P p1 bp1,s > p1ϵ/(4r1 + ϵ) exp 2sp2 1ϵ2

Note that ρ (1/2, 1), we have exp sρϵ2

64p1σ2 , exp ( sp1(1 p1/2)), exp sp2 1ϵ2

18(4+ϵ)[(4+ϵ)+4p1ϵ] , exp sϵ2

8p2 1σ2 , and

exp 2sp2 1ϵ2

(4r1+ϵ)2 are all less than exp sp2 1ϵ2

(4+ϵ)2(1 σ2) . Then, by combining the inequalities for P Υ1,s > r1 ϵ | E1,s

and P(E1,s), we have

P Υ1,s > r1 ϵ | E1,s P(E1,s)

1 exp sp2 1ϵ2

(4 + ϵ)2(1 σ2)

2 exp sp2 1ϵ2

(4 + ϵ)2(1 σ2)

1 2 exp sp2 1ϵ2

(4 + ϵ)2(1 σ2)

1 3 exp sp2 1ϵ2

(4 + ϵ)2(1 σ2)

Zero-Inflated Bandits

Thus, if we take L (4+ϵ)2(1 σ2)

p2 1ϵ2 log(3(1 1/

2) 1), it will yield

1 G1,s(ϵ) 1 E T X

1 P Υ1,s > r1 ϵ | E1,s P(E1,s) 1

" 1 3 exp sp2 1ϵ2

(4 + ϵ)2(1 σ2)

by (1 3x) 2 1 12x for any x<1/3 1/(3

2) and the condition for L 12

s=L exp sp2 1ϵ2

(4 + ϵ)2(1 σ2)

L exp sp2 1ϵ2

(4 + ϵ)2(1 σ2)

ds + 12 exp Lp2 1ϵ2

(4 + ϵ)2(1 σ2)

= 12 exp Lp2 1ϵ2

(4 + ϵ)2(1 σ2)

1 + (4 + ϵ)2(1 σ2)

by the condition for L 4 1 1/

2 1 + (4 + ϵ)2(1 σ2)

The above inequality together inequality (H.17) implies (H.5) immediately. Therefore, we obtain the last term in (H.4) is bounded by

1 G1,s( k/2) 1 #

k 4 p2 1 2 k

by k (2Ξ) (2e

2pk K/T ) 1 p1 pk

and therefore,

k Eck(T) k + k E

t=K+1 1 At = k, Ec k(t) #

1 G1,s( k/2) 1 #

pk + 1 p1 pk

pk + 1 p1 pk

By substituting the above inequality and (H.3) back into (H.1), we obtain the result stated in the theorem.

I. Proof of Regret Bounds for Contextual Bandit Algorithms

Let Vt := Pt ℓ=1:Yℓ=1 ZℓZ ℓand Ut := Pt ℓ=1 WℓW ℓbe the sample variance for the non-zero part and binary part at round t, respectively. The proof for contextual bandits is divided into three main steps after decomposing the regret into two periods.

The first step is to find the minimal round τ such that the minimal eigenvalue of Vt and Ut satisfies min{λmin(Vt), λmin(Ut)} 1 for any t τ with a high probability, which bounds the regret in the first τ periods. The second step uses the self-normalized martingale result to bound the regret from τ to T rounds. The final step combines the regret over both periods.

Proof. Step 0: We start with decomposing the regret. Assume A t is the optimal action at round t, and let Z t := ψX(xt, A t )

Zero-Inflated Bandits

and p At := h ψY (xt, At) θ , then

h p A t g(β Z t ) p Atg(β Zt) i

h p A t g(β Zt) p Atg(β Z t ) i +

h p A t g(β Z t ) p Atg( bβ t Zt) i

2Lg + g(0) τ +

h p A t g(β Z t ) bp Atg( bβ t Zt) i

where the last inequality is by

h p A t g(β Z t ) p Atg(β Zt) i

h p A t g(β Z t ) p A t g(β Zt) i + h p A t g(β Zt) p Atg(β Zt) i

by Assumption C.1 (ii)

n Lg β Z t β Zt + h Lg β Zt 0 + g(0) i (p A t p At) o

β 2 2 Z t Zt 2 2 + h Lg q

β 2 2 Zt 2 2 + g(0) i (p A t bp A t ) o

by Assumption C.1 (i) 2Lg + g(0) τ

and τ N will be determined later.

For each arm k [K], let Vk(m) := Pm t=1:Yt,k=1 Zt,k Z t,k is the sample variance for the non-zero part and Bk(m) := {t [m] : Yt,k = 1} of the arm k with pulling it m times.

Step 1: Let Σk := E[Zt,k Z t,k] and Ωk := E[Wt,k W t,k] be the variance matrices for the i.i.d sample ψX(xt, Ak) and ψY (xt, Ak), respectively. Note Vk(m) has the same distribution with

i=1 Zi,k Z i,k,

due to Xt is independent with Yt for any fixed t [m]. Then by applying Proposition 1 in Li et al. (2017), for any δ (0, 1) there exist positive, universal constants C1 and C2 such that

λmin Vk(m) 1

with probability at least 1 δ, as long as

log(1/δ) λmin(Σk)

+ 2 λmin(Σk).

Note that Bk(m) is the summation of independent Bernoulli(pk,t) variables with pk,t p we have Bk(m) mp /2 with probability at least 1 δ whenever m 4p 2 by Hoeffding s inequality.

Similarly, we also apply Proposition 1 in Li et al. (2017) for Uk(m) = Pm t=1 Wt,k W t,k, and know that there exist some universal C3, C4 constants such that λmin Uk(m) 1 whenever

C3 q + C4 p

log(1/δ) λmin(Ωk)

+ 2 λmin(Ωk).

Zero-Inflated Bandits

Now, by noticing Assumption C.2 (ii), we let

log(1/δ) λmin(Σk)

2 + 2 λmin(Σk)

, 4 log(1/δ)

C3 q + C4 p

log(1/δ) λmin(Ωk)

+ 2 λmin(Ωk)

then we have min λmin(Vk(τk)), λmin(Uk(τk)) 1

with probability at least 1 3δ.

Step 2.1 Define Vn = X

k [K] Vk(mk) + λV Id

with n is the round such that for arm k pulled mk times. Then from Step 1 and the fact that τ = maxk [K] τk by Assumption C.2 (ii), we know that for any n τ + 1, the minimal eigenvalue of Vn and Un satisfies

λmin(Vn) 1 + λV , and λmin(Un) 1 + λU

with probability at least 1 3δ. Next, since Vt+1 = Vt + Yt Zt Z t ,

det Vt+1 = det Vt det Id + Yt V 1/2 t Zt Z t V 1/2 t

which furthermore implies

log det VT = log det Vτ

t=τ+1:Yt=1 (1 + Zt 2 V 1 t )

= log det Vτ +

t=τ+1:Yt=1 log 1 + Zt 2 V 1 t

log det Vτ + 1

1 Zt 2 V 1 t

= log det Vτ + 1

t=τ+1:Yt=1 Zt 2 V 1 t ,

where the last step is due to Zt 2 1 and λmin(Vt) > 1 for t τ + 1. Therefore, we conclude that

t=τ+1:Yt=1 Zt V 1 t

t=τ+1:Yt=1 Zt 2 V 1 t

2T log det VT

det Vτ . (I.2)

Note that by the inequality of arithmetic and geometric means,

det VT tr(VT )

d tr(Vτ) + T τ

and the fact that tr(Vτ) λV + τ and det Vτ (1 + λV )d.

Our inequality (I.2) can be furthermore bounded by

t=τ+1:Yt=1 Zt V 1 t

2T log det VT

2Td log λV + T

Zero-Inflated Bandits

Similarly, we have T X

t=τ+1 Wt U 1 t

2Tq log λU + T

as Yt would always be observed, which is a specific case above.

Step 2.2 Let ΨX,t(β) := Pt ℓ=1:Yℓ=1 g(Z ℓβ) g(Z ℓβ ) Zℓ. By Assumption C.1 (iii), we have

ΨX,t(β) 2 V 1 t κ2 g β β 2 2 (I.5)

for any β Γ {β : β β 2 1}. Next, note that bβt directly comes solving Pt ℓ=1:Yℓ=1 Xℓ g(Z ℓβ) Zℓ= 0, then

ΨX,t( bβt) = X

g(Z ℓbβt) g(Z ℓβ ) Zℓ

g(Z ℓbβt) g(Z ℓβ ) Zℓ+ X

Xℓ g(Z ℓbβt) Zℓ

Xℓ g(Z ℓβ ) Zℓ= X

ℓ [t]:Yℓ=1 εℓZℓ.

Note that εℓ| Fℓ 1 sub G(σ2) directly implies εℓ| σ Ys = 1 : s [ℓ] Fℓ 1 sub G(σ2). Let

1 2σ2 ΨX,t( bβt) 2 V 1 t = max v Rd

ℓ [t]:Yℓ=1 σ 1εℓZℓ 1

:= max v Rd MX,t(v),

then M X,t(v) := R MX,t(v) dµ(v) is also an F-adapted non-negative super-martingale with initial value = 1 for any probability measure µ(v) on Rd. Thus, by Theorem 3.9 in Lattimore & Szepesv ari (2020), we have

M X,t(v) 1/δ EM X,0(v)

for the probability measure µ = N(0, λ 1 U Id). Plugging in the explicit formula for M X,t(v), we obtain that

P t N : 1 2σ2 ΨX,t( bβt) 2 V 1 t 2 log(1/δ) + log det Vt

for any λU > 0 and δ (0, 1). Combining inequality (I.5) and (I.6), we conclude that

bβt β 2 κ 1 g

2 log(1/δ) + log det Vt

4 log(1/δ) + d log(1 + λ 1 V t/d) = ρX,t

holds with probability at least 1 δ for any t τ + 1, where the last inequality is due to

d 1 + {ℓ [t] : Yℓ= 1}

d 1 + t λV d

By using the similar argument, we can also show that

bθt θ 2 κ 1 h

4 log(1/δ) + q log(1 + λ 1 U t/q) = ρY,t

holds with probability at least 1 δ for any t τ + 1.

Step 2.3 The design of Algorithm C.2 for choosing At [K] ensures

bβ t Z t + ρX,t Z t V 1 t bβ t Zt + ρX,t Zt V 1 t ,

Zero-Inflated Bandits

i.e., bβ t (Z t Zt) ρX,t Zt V 1 t Z t V 1 t

and bθ t W t + ρY,t W t U 1 t bθ t Wt + ρY,t Wt U 1 t ,

i.e., bθ t (W t Wt) ρY,t Wt U 1 t W t U 1 t .

Apply the last two inequalities in Step 2.2, we obtain

(Z t Zt) β = (Z t Zt) bβt (Z t Zt) ( bβt β )

ρX,t Zt V 1 t Z t V 1 t + Z t Zt V 1 t bβt β 2

by the choose of ρX,t and t > τ ρX,t Zt V 1 t Z t V 1 t + Z t Zt V 1 t

2ρX,t Zt V 1 t .

with probability at least 1 4δ for any t τ + 1. Similarly, by applying the same technique for (W t Wt) θ , we can obtain that (Z t Zt) β 2ρX,t Zt V 1 t and (W t Wt) θ 2ρY,t Wt U 1 t , (I.7)

both hold with probability at least 1 5δ for any t τ + 1.

Step 3: Now, with the inequality (I.7), we could deal with the remaining part in (I.1) in Step 0. By following the same technique for proving Pτ t=1[p A t g(β Zt) bp Atg( bβ t Z t )] [2Lg + g(0)]τ, we have

h p A t g(β Z t ) p Atg(β Zt) i

by Assumption C.1 (ii)

n Lg β Z t β Zt + h Lg β Zt 0 + g(0) i Lh θ W t θ Wt o

by (I.7) 2Lg

t=τ+1 ρX,t Zt V 1 t + 2(Lg + g(0))Lh

t=τ+1 ρY,t Wt U 1 t

t=τ+1 Zt V 1 t + 2(Lg + g(0))LhρY,T

t=τ+1 Wt U 1 t

by (I.3) and (I.4) 2κ 1 g σLg q

4 log(1/δ) + d log(1 + λ 1 V T/d)

2Td log λV + T

+ 2κ 1 h (Lg + g(0))Lh q

4 log(1/δ) + q log(1 + λ 1 U t/q)

2Tq log λU + T

which completes the regret bound for our UCB algorithm. The regret bound for the TS algorithm follows similarly, leveraging the anti-concentration result in (C.2) along with the proof techniques in Theorem 1 of Agrawal & Goyal (2013). Specifically, it incorporates anti-concentration results for ( eβt bβt)(Z t Zt) and (eθt bθt)(W t Wt).