# shortlived_highvolume_bandits__1b9d664c.pdf

Short-lived High-volume Bandits

Su Jia * 1 Nishant Oli * 2 Ian Anderson 2 Paul Duff 2 Andrew A. Li 3 R. Ravi 3

We study how to efﬁciently perform A/B/n testing for a high-volume of short-lived treatments. We formulate the problem as a multiple-play bandits model. In each round a set of k actions arrive. Each action is available for w rounds and has an unknown reward rate. In each round, the learner selects a multiset of n actions and immediately observes the realized rewards. We aim to minimize the average loss under a random input model where the instance is randomly drawn from a known prior distribution D. We show that if k = O(nρ) for some ρ > 0, our policy achieves O(n min{ρ, 1

w ) 1}) average loss on a sufﬁciently large class of prior distributions. We also complement this result by showing that every policy suffers Ω(n min{ρ, 1

2 }) average loss on the same class of distributions. We further validate the effectiveness of our policy through a large-scale ﬁeld experiment on Glance, a content card-serving platform.

1. Introduction

Modern platforms leverage randomized experiments to make informed decisions from a given set of alternatives. As a particularly challenging scenario, these alternatives can potentially have at the same time (i) high volume, with thousands of new items being released each hour, and (ii) short lifetime, due to the transient nature of the contents. This challenge arises from, for example, recommending short-lived content on a video-sharing platform.

Orthogonal to lifetime, the problem is similarly well understood when there is a low volume of contents relative to the number of users - dedicated exploration methods,

*First and second authors. All other authors are alphabetically ordered. 1Center of Data Science for Enterprise and Society (CDSES), Cornell University, Ithaca, USA 2Glance, Bangalore, India 3Tepper School of Business, Carnegie Mellon University, Pittsburgh, USA. Correspondence to: Su Jia <sj693@cornell.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

such as standard A/B/n testing, are sufﬁcient for ﬁnding the most appealing content for the users; see, e.g., (Kohavi & Longbotham, 2017).

Naturally, then, the most challenging settings are where the contents to be selected are short-lived and have high volume, which occur in a variety of applications. (A) Recommender Systems. From online advertising to social networks, platforms are sometimes faced with a massive amount of short-lived contents to be recommended to users. For example, more than 210 million Snaps were created on Snapchat each day in 2020, most of which expired within just 24 hours (Vuleta, 2021). The brevity of the lifetime can either be caused by the features of the content (e.g. breaking news) itself or by the transient nature of user attentions. The platform needs to decide which contents to appear at the top of the user news feeds to maximize user engagement. (B) Website Optimization. In internet marketing, online platforms perform multi-variate A/B/n testing (e.g., (Mc Farland, 2012; Yang et al., 2017)) to test different designs of their user interface. For example, Linked In runs over 400 concurrent experiments per day to compare different website designs, with the goal of encouraging users to establish or reﬁne their personal proﬁles, or increasing the subscriptions to Linked In Premium (Xu et al., 2015). The combinatorial nature of the decision space results in a high volume of items to test. Speciﬁcally, the number of possible designs can be exponential in the number of features (e.g., logos, font, background color, etc.).

However, these items can have a short life. For example, considering the non-stationarity of the underlying environment (say, due to seasonality or societal events), any estimation is only reliable for a short amount of time. A simple approach is to partition the time horizon into short segments, in which the conversion rates of the designs are approximately constant, and then view the same design in different segments as separate copies.

Adding to this challenge is personalization. A common and naive approach is to cluster the users and solve the problem for each cluster separately. However, when restricting to a cluster, the number of users becomes much smaller while the number of actions remains the same. In other words, personalization would substantially limit the resources available for experimentation, rendering the problem harder.

Short-lived High-volume Bandits

Thus motivated, we study how to efﬁciently identify the effective items from a high volume of short-lived candidates. To encapsulate the key features of the problem, we propose the Short-lived High-volume Bandits (SLHVB) problem. We employ a multiple-play bandit framework with the following key features: 1) Multiple-play. In each round, we can choose each action (or arm) multiple times as long as the total number of plays is n, which corresponds to the number of user interactions in this time period;1

2) High-Volume of Arrivals. In each round, a set of k = nρ

actions arrive where ρ > 0; 3) Short Lifetime. Each arms is available for w rounds, which is a known, small constant; 4) Random Input Model (RIM). The reward rates are drawn from a ﬁxed known distribution. Our analysis focuses on the average performance over all instances, rather than against the worst-case instance.

We present a policy that recursively reﬁnes the exploration for up to w times and compares it against what is best achievable for different regimes of ρ > 0.

1.1. Our Contributions

This work contributes to the literature of multi-armed bandits and online controlled experiments in the following ways. 1) A Novel and Practical Formulation. Our ﬁrst contribution is formulating an online learning model that faithfully models a ubiquitous problem faced by online platforms. Our formulation considers a practical metric - the average loss - which better reﬂects the quality of a policy than worst-case metrics, which are more common in previous literature. 2) Average Regret for Batched Bandits. As a subroutine for our showing main result, we show that the Batched Successive Elimination algorithm (Gao et al., 2019) for the Batched Bandits (BB) problem achieves O((k/n) ℓ ℓ+2 ) average regret using ℓrounds of adaptivity whenever ρ ℓ 1 2ℓ+1.2

In particular, for ℓ 2, this bound is asymptotically better than the optimal O((k/n)1/2 n2 ℓ) worst-case regret. This contrast is strongest when ρ = 1/2 - in this case our policy has average regret n 1/2+O(1/ℓ) whereas the optimal worst-case regret is O(n 1/4+2 ℓ). 3) Policy for SLHVB. We show that any policy for BB with R(n, k) average regret can be converted into a policy for SLHVB with O (n ρ + R(n, k)) average loss. Our average regret bound for BB then implies an O(n min{ρ, w 2(w+1) }) average loss bound for SLHVB, by choosing a suitable ℓ depending on ρ and w. 4) Nearly Matching Lower Bound. We show that any policy

1Do not confuse this n with the n in the term A/B/n testing - the latte is actually our k, i.e., the number of treatments. 2All big-O s are with respect to n , with ρ ﬁxed.

for SLHVB suffers an Ω(n min{ρ, 1

2 }) average loss. Further, we juxtapose this result with a lower bound on the worst-case loss which is higher, and hence highlights the value of our RIM. 5) Large-Scale Field Experiment. Finally and most importantly, we validated the effectiveness of our policy in a ﬁeld experiment via collaboration with Glance, a leading lockscreen content platform in India. This ﬁrm faces exactly the aforementioned challenge: they generate hundreds of content cards on an hourly basis, which are available for at most 48 hours. Their current recommender is based on a state-of-the-art Deep Neural Network (DNN), which is time-consuming to re-train and hence unable to utilize user feedback in a timely manner. In a ﬁeld experiment, we observed that our 1-Layered Sieve policy outperforms their DNN-based recommender by 4 7% in user engagement.

1.2. Related Work

Our problem is a variant of the Multi-Armed Bandits (MAB) problem (Lai et al., 1985). Three lines of work are most related to ours: multiple-play bandits, mortal bandits and high-volume Bandits.

Multiple-play Bandits. In this variant, several arms are selected in each round. Many results in single-play bandits can be generalized to the multi-play variant, for example, (Komiyama et al., 2015) showed that the instance dependent regret bound for Thompson sampling can be generalized to the multi-play setting. One motivation of the multi-play variant is online ranking, see e.g., (Radlinski et al., 2008; Lagr ee et al., 2016; Gauthier et al., 2022), where the learner presents an ordered list of items to each user, viewed sequentially under certain click model. Unlike in our formulation, the arms have inﬁnite lifetime. Further, there is no arrival of new arms, so the learner does not need to take into consideration the ages of the arms.

Mortality of Arms. A quintessential motivation for the mortality of arms is online advertising. In the classical pay-by-click model, the ad broker matches each ad from a large corpus to contents, and is paid by the advertiser (i.e. who created the ad) only when an ad is clicked. As the key feature, an ad becomes unavailable when its advertiser s budget is run out. Thus motivated, (Chakrabarti et al., 2008) introduced the mortal bandits problem and considered two death models. In the deterministic model, each arm dies after being selected for a certain number of times, which corresponds to the advertisers budget in the advertising example. In the stochastic lifetime model, an arm dies with a ﬁxed probability every time it is selected. Relatedly, in rotting bandits (Levine et al., 2017), each arm s reward rate decays in the number of times it has been selected. In particular, if the reward function is an indicator function, then effectively each arm has a ﬁnite lifetime.

Short-lived High-volume Bandits

Motivated by demand learning in assortment planning, (Farias & Madan, 2011) considered the irrevocable bandits problem that bears both the multi-play and mortality features: arms are selected in batches and discarded immediately once selected. Unlike in this work, however, none of these models considered arrivals, and hence the learner does not need to take into consideration the age of the actions.

High Volume of Arms. Most existing work concerning large volume of arms considered the worst-case regret of a policy, i.e., the regret on the worst input in a given family (e.g. (Berry et al., 1997; Zhang & Frazier, 2021)). As a distinctive feature, in this work we consider a random input model where the reward rates follow a known distribution, and perform an average case analysis. As we will soon see, our formulation leads to theoretical results that would be otherwise impossible.

(Wang et al., 2008) also assumed that the reward rates of the arms are independently drawn from a common distribution such that the probability of being ε-optimal is O(εβ) where β [0, 1] is a known constant. However, unlike in our problem, there are no arrivals and hence the policy does not need to balance the exploration for items with different ages.

Finally, we are aware of another variant of MAB closely related to the multi-play bandits (and hence to this work).

Low-Adaptive Bandits Algorithms. In the batched bandits problem (Perchet et al., 2016; Agarwal et al., 2017) we aim to achieve low regret using low adaptivity. Unlike in the multi-play setting, here the learner can partition the time horizon [T] into w batches where w is a given constant, and choose a batch (i.e., a multiset) of arms based on the realizations in the previous batches. Alternatively, w can be interpreted as a constraint on the adaptivity of the policy. In the classical setting, the learner has unlimited adaptivity, i.e., w = T. (Gao et al., 2019) showed that for any k arms, we can achieve O(

k T) regret whenever w = Ω(log log T), which is optimal among all policies with arbitrary adaptivity.

2. Formulation

Suppose at the start of each round t = 1, 2, , a set At of k actions (or arms) arrives, available in rounds t, . . . , t + w. We call w the lifetime and viewed it as a small constant. In each round, the learner selects a multiset of n available arms - each arm can be chosen multiple times, as long as the total number of plays is n. If an arm a is selected m times, the learner immediately observes i.i.d. rewards X1, . . . , Xm with mean µa.3 For simplicity, we consider only Bernoulli random variables, although the analysis can be generalized

3 i.i.d. stands for identically independently distributed

to subgaussian reward distributions in a straightforward manner.

For concreteness, let us relate the above formulation to the recommendation problem. Most online platforms retrieve user interaction data and update the predictions periodically. A round corresponds to such a period. Further, n corresponds to the number of user impressions in a round, and is assumed to be independent of the quality of the recommendation. Selecting a multiset of arms corresponds to recommending exactly one item to each impression.

A Random Input Model. Unlike most work in MAB, here we consider a Random Input Model (RIM) where µa s are assumed be drawn i.i.d. from a known distribution D. This is realistic in practice as D can be approximated using past data. In contrast to the minimax framework, this formulation better reﬂects the reality and more importantly, enables us to obtain theoretical guarantees that would be otherwise impossible; see Section 3.

As a standard assumption in statistical learning (Ghosal, 2001; Petrone & Wasserman, 2002; Audibert & Tsybakov, 2007), we assume the density of D is bounded from above and below away from 0, although our analysis extends to more general distributions (with possibly weaker guarantees).

Assumption 2.1 (Bounded Density Assumption). The distribution D admits a density function f with a compact support C, and there exist constants C1, C2 > 0 such that C1 f(x) C2 for all x C. W.l.o.g.4 we assume C = [0, 1].

The Average Loss. The procedure of selecting arms can be encoded by a policy π = (πt) satisfying P

a At t w πt(a) =

n for each t, where At t := St

s=t As for 0 < t < t . The expected reward in round t is then P

a πt(a) µ(a). The problem is easy if µa s were known. In fact, let a t = arg max{µa : a At t w}, then the optimal policy is given by π t (a) = n 1[a = a t ].

When reward rates are unknown, the performance of a policy is measured by the following notion of average loss.5 We ﬁrst deﬁne the ﬁnite-time loss. For any integer T w and reward rates µ = {µa}, consider

Lossn(π; T, µ) := 1 n T E

a At t w πt(a) (µ t µa)

where the expectation is only over the reward realizations but not µ. To characterize the long-run performance of the system, let T and deﬁne the average loss (or simply,

4 w.l.o.g. stands for without loss of generality 5To avoid confusions, we use the term loss for SLHVB and regret for Batched Bandits in Section 4 to prevent confusions.

Short-lived High-volume Bandits

Lossn(π) := lim T Eµ D [Lossn(π; T, µ)] .

We are interested in how rapidly Lossn(π) vanishes given a ﬁxed ρ > 0.

3. Lower Bounds

We ﬁrst show that every policy suffers Ω(n min{ρ, 1

2}) average loss. We defer the details to Appendix A and only explain the high level ideas here. Consider the average loss Lt given by

a At t w πt(a) (µ t µa)

where µ t = µmax(At t w). By linearity of expectation, it sufﬁces to lower bound Lt. We will show that the regret is large in each of the following two cases. First consider the case where πt(At) n 2 , i.e., the policy sufﬁciently explores the arms arriving at t. Due to the RIM and by Assumption 2.1, the reward rates µa s are approximately evenly spaced. Since there are wk arms available at any time, the gap between the best and second best arms a, a is approximately 1 wk, formally, µa µa 1 wk. Consequently, if the arriving batch At does not contain the best available arm, i.e., a t / At, then µ(a t ) µmax(At) 1 wk. Moreover, by symmetry a t appears in each of the w available batches equally likely, so the above event occurs with probability 1 1

2, we have Lt 1

Now consider the other case, πt(At) < n 2 , i.e., At is under-explored in round t. Consider the event a t At. Then, by an argument similar to the ﬁrst case, we have µ(a t ) µmax At w t 1 1 wk. Further, this event occurs with probability 1

w, so in this case Lt 1

w 1 wk = 1 w2 n ρ. Thus in both cases, we have Lt 1 w2 n ρ, as formally stated below. We defer the proof to Appendix A.1.

Proposition 3.1 (Lower Bound). For any policy π and ρ 0, we have Lossn(π) 1 12w2 n ρ.

However, this bound becomes very weak when ρ is large. We next present a lower bound speciﬁc to ρ 1

2. We argue that to avoid an Ω(n 1/2) regret, a policy has to identify an n 1/2-optimal arm a, i.e., µa µ t O(n 1/2). To this goal, by Assumption 2.1, the learner needs to explore n1/2 distinct arms, leading to an Ω(n1/2) regret. We formally state the second lower bound below. We defer the proof to Appendix A.2.

Proposition 3.2 (Lower Bound, ρ > 1

2, then for any policy π, we have Lossn(π) 1 96w2 n 1/2.

Note that when ρ = 1/2, the above two lower bounds have

the same asymptotic order. By combining the above two lower bounds, we immediately obtain the following. Theorem 3.3 (Lower Bound). For any ρ > 0 and policy π, it holds that Lossn(π) 1 96w2 n min{ρ,1/2}.

Finally, we present a lower bound on the worst-case regret that highlights the advantage of the RIM. We show that no policy achieves o(1) worst-case regret, deﬁned as

Loss n(π) := max µ lim T Loss(π; T, µ),

where max is over all instances 0 µ 1. Proposition 3.4 (Worst-case Regret). For any policy π and lifetime w > 0, we have Loss n(π) 1 2w2.

This bound can be proved by constructing an instance with binary rewards, such that in each round there is exactly one arm a with µa = 1.

4. Upper Bounds

In this section, we establish the connections to the Batched Bandits (BB) problem. We ﬁrst explain how a semi-adaptive algorithm for BB can be converted into a policy for SLHVB, and show how the guarantees translate from one problem to another. Then, we consider a variant of the Batched Successive Elimination (BSE) algorithm (Gao et al., 2019) and analyze its regret. Finally, using this result, we obtain a policy for the SLHVB problem with O(n min{ρ, 1

w ) 1}) average loss. We defer the details to Appendix C.

4.1. Batched Bandits

In the BB problem (Perchet et al., 2016), the learner is given k arms, n slots, and an adaptivity level ℓ 1. In each phase i = 0, 1, . . . , ℓ, the learner selects a multiset6 Mi of arms such that Pℓ i=0 |Mi| = n. Each time an arm a is selected, a reward is randomly drawn from a Bernoulli distribution with unknown reward rate µa, and is immediately observed. If an arm is selected multiple times in one phase, the realized rewards may be different. The goal is to maximize the expected total reward.

Given an instance (µa)a [k], the regret7 of an algorithm A for BB is deﬁned as

Regn(A; µ) := 1

a [k] (µ µa) Na

where µ = arg maxa [k]{µa} and Na is the number of times an arm a is selected. Most existing work for MAB

6an easy way to remember the indexing rule: Mi is the batch of arms selected when using the i-th chance of being adaptive. 7To avoid confusions, we use synonymous terms algorithm for BB and policy for SLHVB; for the objective we use regret for BB and loss for SLHVB.

Short-lived High-volume Bandits

considered the worst-case regret

Regwc n (A) := max µ [0,1]k Regn(A; µ).

Analogous to the notion of average loss for the SLHVB problem, for BB we deﬁne the average regret over all instances drawn from a distribution D as

Regavg n (A) = Eµ D [Regn(A; µ)] .

An appealing class of algorithms is the semi-adaptive algorithms, where the cardinality of each batch Mi of arms selected are decided in advance non-adaptively and the selection of arms in each batch depends on the reward realizations adaptively.

Deﬁnition 4.1 (Semi-adaptive Algorithm). Given an adaptivity level ℓ> 0, a semi-adaptive algorithm is speciﬁed by (i) grid sizes ε0, . . . , εℓ 1 (0, 1) with Pℓ 1 i=0 εi < 1, and (ii) a family of decision rules

Aj : ([k] R)nj 1 [k]nj nj 1

for j = 0, . . . , ℓwhere8

Pj i=0 εin, if j = 0, . . . , ℓ 1, n, if j = ℓ, 0, if j = 1.

(Gao et al., 2019) proposed the following Batched Successive Elimination (BSE) algorithm. In each phase the algorithm keeps track of a subset Si [k] of surviving arms (or simply survivors) that serve as the candidates for the optimal arm, computed in the following manner. Initially S0 = [k]. In each phase i = 0, . . . , ℓ 1, the algorithm explores arms in Si uniformly, i.e., selects each arm εin/|Si| times. Finally in the last phase, i.e., phase ℓ, we arbitrarily choose an arm from Sℓ. For completeness, we formally state the algorithm in Algorithm 1.

(Gao et al., 2019) analyzed the performance of the BSE algorithm under two types of grids. The ﬁrst is called the minimax grid: let c = n

1 1 2 ℓ= O( n n2 ℓ), we recursive deﬁne εi+1 = c εi where ε0 = c. The second is called geometric grid, as the batch sizes form a geometric progression. Speciﬁcally, let c = n1/ℓ, recursively we deﬁne εi+1 = c εi for ℓ= 1, . . . , ℓ 1.

As the name suggests, the BSE algorithm under the minimax grid achieves the minimax regret. The geometric grid, on the other hand, is shown to achieve nearly-optimal instancedependent regret. We rephrase (Gao et al., 2019) s main result as follows.

8To avoid integrality issues, we assume εin/|Si| is an integer. This does not affect the asymptotic order of our bounds.

Theorem 4.2 (Theorem 1 of (Gao et al., 2019)). Given any adaptivity level ℓ, denote by BSEminimax ℓ and BSEgeo ℓ the BSE algorithms under the minimax and geometric grids. Then, for any k-armed instance, it holds that Regwc n (BSEminimax ℓ ) = O( p

k/n n2 ℓ). Further, if the highest and second highest reward rates differ by , then Regwc n (BSEgeo ℓ ) = O(n 1 ℓ 1 k/ ).

Algorithm 1 Batched Successive Elimination Policy BSE(ε0, . . . , εℓ 1; k ) for Batched Bandits.

1: Input: ℓ: adaptivity level ε0, . . . , εℓ 1 (0, 1): exploration intensities n: number of arms to be selected in total A: a set of k arms 2: Let A be a uniformly random subset of A of size k

3: for i = 0, 1, ..., ℓ 1 do 4: if |Si| = 1 then 5: Set Sℓ= Si; Break 6: end if 7: if |Si| 2 then

8: ni j εin |Si| k

9: for a Si do 10: Select arm a for ni times and observe rewards Xa,1, ..., Xa,ni 11: Xa 1 ni Pni j=1 Xa,j 12: end for 13: Xmax max

14: Si+1 = {a Si : |Xa Xmax| 3n 1/2 i log1/2 n} 15: end if 16: end for 17: Select any arm in Sℓfor n Pℓ 1 i=0 εin times

Apparently, the worst-case regret bound trivially holds for average regret. On the other hand, the worst-case regret of BSEgeo ℓ depends on the parameter , which is random under the RIM. By Assumption 2.1, we have Eµ[ 1] = Θ(k), immediately implying an O(k2n 1 ℓ 1) bound on the average regret, as summarized below. Corollary 4.3 (Average Regret of BSE). For any adaptivity level ℓ 1, we have Regavg n (BSEminimax ℓ ) = O( p

k/n n2 ℓ) and Regavg n (BSEgeo ℓ ) = O(k2n (1 1

In particular, when ℓ= O(log log n), we have n2 ℓ= O(1) and hence the bound in Theorem 4.2 becomes O( p

k/n). Surprisingly, this matches the Ω( p

k/n) lower bound for unlimited adaptivity level; see, e.g., (Auer et al., 2002).

4.2. Breaking the p

Using a different geometric grid, we obtain a stronger bound on the average regret for the BSE algorithm, compared to

Short-lived High-volume Bandits

k/n) bound in Corollary 4.7. More precisely, the common ratio in this geometric progression is k/n, whereas in (Gao et al., 2019) the dependence only depends on n. With some foresight, let s consider the following choice of exploration intensities.

Deﬁnition 4.4 (Revised Geometric Grid). For any adaptivity level ℓand i ℓ, we deﬁne ε i,ℓ= (k/n)(ℓ i)/(ℓ+2) for integers i = 0, . . . , ℓ 1. For each ﬁxed ℓ, denote by BSE ℓ the algorithm BSEℓwith grid size (ε 0,ℓ, . . . , ε ℓ 1,ℓ).

Our bound for a ﬁxed ℓrequires ρ to be larger than the following threshold, otherwise there is no need for having ℓ phases.

Deﬁnition 4.5 (Threshold Exponent). For each integer ℓ 1, we deﬁne the threshold exponent as θℓ:= ℓ 1 2ℓ+1.

To see why we need ρ to be large, observe that by the RIM, µa s are n ρ distance apart. On the other hand, if ρ is small, the conﬁdence intervals are expected to be narrower than n ρ with strictly less than ℓphases, and hence there is no need for extra layers.

More concretely, suppose ℓ= 2 and consider ρ = 0.04 < θ2 = 1/5. To see why the second phase is redundant, note that by Deﬁnition 4.4, ε 0 = O((k/n)1/2), so each arm is selected ε 0n/k = (n/k)1/2 = n0.48 times upon arrival. Thus, the conﬁdence interval of each arm after selecting the ﬁrst batch of arms has width (n0.48) 1/2 = n 0.24. On the other hand, the reward rates are spaced at n ρ = n 0.04

distance apart on average. Therefore, the optimal arm in At is likely to be identiﬁed after just one phase.

Theorem 4.6 (Average Regret of BSE ℓ). For any adaptivity level ℓwith ρ θℓ, the average regret of the BSE algorithm under the revised geometric grids can be bounded as Regavg n (BSE ℓ) = O((k/n)ℓ/(ℓ+2)).

The proof of this result crucially relies on the RIM. We argue that due to the RIM, the number of surviving arms is bounded by a function in n, ρ, j w.h.p.9 after a number j of phases. Consequently, each surviving arm then is guaranteed at least a certain amount of slots in phase j + 1.

As a caveat, the above argument breaks when there remains only one surviving arm after a number of phases. This event, however, occurs with low probability, since ρ is greater than the threshold exponent.

Note that average regret becomes lower as ℓincreases. But ℓcan not be arbitrarily high since we required θℓ< ρ. To ﬁnd the maximal feasible ℓ, consider the following metaalgorithm: If ρ < 1/2, choose the maximum ℓwith ρ θℓ. Equivalently, choose ℓ= ℓ (ρ) := 1+ρ

1 2ρ . If ρ 1

2, then the condition ρ θℓholds for any ℓ, and so we can choose

9We say an event occurs w.h.p. (which stands for with high probability ) if it has probability 1 nΩ(1).

arbitrarily large ℓ. We formally state this result as follows.

Corollary 4.7 (Explicit Form of the Average Regret). The average regret of the BSE algorithm under the revised geometric grids satisﬁes the following: (i) If 0 < ρ < 1

2, then for adaptivity level ℓ (ρ) = j 1+ρ 1 2ρ k

the average regret can be bounded as

Regavg n BSE ℓ (ρ) = O

ℓ (ρ) ℓ (ρ)+2

(ii) If ρ 1

2, then for any adaptivity level ℓ 1, the average regret can be bounded as

Regavg n (BSE ℓ) = O

In particular, Regavg n (BSE ℓ) = (k/n)1 O(1/ℓ) as ℓ .

Note that n2 ℓ> 1 for any ℓ, so the O( p

k/n n2 ℓ) bound in Theorem 4.2 is no better than O((k/n)1/2). In contrast, note that whenever ρ 1/5, we have ℓ (ρ) 2, so the bound in Corollary 4.7 is stronger.

This contrast becomes sharper as ρ increases. For example, with ρ = 1/3, we have ℓ (ρ) = 4, and hence Reg ℓ (ρ) = O((k/n)2/3), which is better than O((k/n)1/2). In the extreme case, for ε > 0 and ρ = 1

2 ε we have ℓ (ρ) = Ω(1/ε) and hence (k/n)Ω(1/ε), which is much stronger a guarantee than O((k/n)1/2).

4.3. From Regret to Loss

A semi-adaptive algorithm for BB induces the following policy for SLHVB. Let εi be the grid sizes, which is usually chosen such that Pℓ 1 j=0 εi = o(n), e.g., as in Deﬁnition 4.4. At each time t, the induced policy uses εin slots to perform the i-th phase of the BB algorithm on At i, i.e., the arms arriving at time t i. Meanwhile, we use the remaining (1 Pℓ 1 i=0 εi)n slots for exploitation ; see Algorithm 2 for a formal statement.

As a nice property, the induced policy of any semi-adaptive algorithm selects exactly n arms in each round and is hence valid for SLHVB. In fact, suppose the semi-adaptive policy for BB has grid sizes (εi). Using the notations in Algorithm 2, for each round t and each phase j = 0, . . . , ℓ, an arm a At j is selected N t j,a times. By deﬁnition of εj, we have P

a At j N t j,a = εjn. Summing over all j s, the total number of arms selected is

a At j N t j,a =

Short-lived High-volume Bandits

Algorithm 2 Induced Policy π[A; k ] for SLHV Bandits

1: Input: A: a semi-adaptive algorithm for BB k : cardinality of the random subset of arms 2: for t = 1, 2, . . . do 3: Sample a random subset At of At of size k

4: At At 5: for j = 0, . . . , ℓdo 6: for a At j do 7: Query A for the number N t j,a of times to select arm a in the j-th batch 8: Select a for N t j,a times, and return the rewards to A 9: end for 10: end for 11: end for

Note that in Algorithm 2, we only use age-ℓarms for exploitation . This is not practical an obviously better policy would exploit the empirically best arm in At ℓ t w = Sw j=ℓAt j rather than just in At ℓ. We choose to state the policy in this way for simplicity of analysis. By doing this, the loss only increases by O(n ρ), which is on the same order as the Ω(n ρ) lower bound in Theorem 3.1, and is hence not essential.

We can convert the regret of any semi-adaptive BB algorithm to the loss of the induced SLHVB policy as follows.

Proposition 4.8 (Regret-to-Loss Conversion). Suppose A is a semi-adaptive algorithm with average regret R(n, k) on any BB instance with k arms and n slots. Then for any k , the loss of the induced policy π[A, k ] for SLHVB satisﬁes Lossn(π[A, k ]) = O(1/k + ρ 1n ρ + R(n, k )).

We explain the high level idea and defer the details to Appendix B. The term 1/k captures the gap between the optimal reward rate in the resampled subset and the original set of arms. The second term, n ρ = 1/k, captures the loss due to the reward uncertainty of new arrivals we have to decide how many times to choose them without any knowledge of their qualities apart from the RIM assumption. The ﬁnal term, R(n, k ), bounds the average regret on the resampled subset, which contains k arms.

4.4. Loss of the Induced Policy

The average regret bound in Proposition 4.6 and the regretto-less conversion formula in Proposition 4.8 immediately lead to a loss bound for the induced SLHVB policy. To formalize this, we ﬁrst recapitulate some notations. Given a semi-adaptive algorithm A for BB, we deﬁned π[A, k ] as the induced policy for SLHVB with resampling size k . Let BSE (ℓ, k ) := π[BSE ℓ, k ] be the SLHVB policy induced by the BSE algorithm with the revised geometric grid (ε j;ℓ)

speciﬁed in Deﬁnition 4.4.

Proposition 4.9 (Loss of the Induced Policy). If ℓ w and ρ θℓ, then

Lossn (BSE (ℓ, k)) = O n ρ + n(ρ 1) ℓ ℓ+2

( O (n ρ) , if ρ < ℓ 2(ℓ+1), O n(ρ 1) ℓ ℓ+2 , otherwise. (1)

Different from the previous subsections, we now have an extra lifetime constraint that ℓ w. Our meta-policy, called the Hybrid policy, chooses suitable ℓfor any given ρ; see Algorithm 3).

Algorithm 3 Hybrid Policy H(ρ; w)

1: If ρ < 1

5 then set ℓ= 1 and k = k. 2: If 1

5 ρ < w 2w+2, then set k = k and choose any ℓ w such that θℓ= ℓ 1 2ℓ+1 ρ < ℓ 2ℓ+2.

3: If ρ w 2w+2, then set ℓ= w and k = n w 2w+2 . 4: Invoke BSE (ℓ, k ).

We prove the following in Appendix D.

Theorem 4.10 (Loss of the Hybrid Policy). For any SLHVB instance with n slots per round, volume exponent ρ > 0 and lifetime w 1, the average loss of the Hybrid policy satisﬁes Lossn(H(ρ; w)) = O(ρ 1 n min{ρ, 1

To better understand the impact of the lifetime w, observe that for w = 1, 2, the regret bounds are asymptotically r1(n) := n min{ρ,1/4} and r2(n) := n min{ρ,1/3}. In particular, when ρ > 1/4, we have r2 = o(r1) as n .

5. Field Experiment

We further validated the effectiveness of our policy in a ﬁeld experiment, via collaboration with Glance, a leading lock-screen content platform that faces the aforementioned challenge. Speciﬁcally, their marketing team curates around 200 content cards (or simply, cards) per hour, and around 70% of them expire in 48 hours. Each card consists of a link to some external content (e.g., video, news or article), along with a short text description; see Figure 1. The ﬁrm needs to recommend cards to users with the goal of maximizing the total user engagement, measured by the total duration of the interactions and the number of click-throughs.

This problem can be cast as an SLHVB problem, assuming that the total impressions is independent of the recommendations. Two key quantities are of particular interest for each card: (i) the click-through rate (CTR) and (ii) the average duration per impression. Both metrics are unknown when a card is released. We mix the above two metrics by considering the conversions. A conversion occurs if either a

Short-lived High-volume Bandits

Figure 1. Glance s content cards. The image and text summarize the content, and by clicking on the link the user is shown the details.

click-through occurs or the duration reaches a threshold of θ = 5 seconds. For simplicity, we assume that the rewards across different impressions are i.i.d. random variables.

The platform sends cards to the users on an hourly basis. For each user, the platform replaces the cards that the user has viewed in the past with new content cards, provided the device is connected to the internet. The users can swipe through the cards stored in the platform s app. When a user is interested in the content, they can click on the provided link and be redirected to an external source for further engagement, and then be redirected back to the app when ﬁnished. To decide which cards to send to each user, the ﬁrm deployed a recommender system based on a state-of-the-art Deep Neural Network (DNN). This DNN predicts, for each pair of user and card, the expected conversion probability, using (i) the user interaction history and (ii) the card s text, image features.

Although the current recommender system works reasonably well, there is a considerable potential for improvement. Most notably, the current system only uses the user feedback to update the users behavioral signature for future prediction, and does not directly leverage the feedback in making recommendations for similar users. In particular, they do not use the feedback to adjust the predictions on the conversion rates directly. This may have caused substantial loss in user engagement.

It is thus vital for the platform to ﬁnd a recommendation policy that (i) can learn the true conversion rates of new cards quickly using user interaction data and (ii) is computationally simple to deploy. Our policy is well-suited for this task. We defer the implementation details to Appendix E.

We perform a detailed analysis on the ﬁeld experiment result and show that the simplest version of our policy outperforms the DNN-based recommender by about 4% in total duration (see Figure 2) and nearly 7% in the total number of click-throughs per-user-per-day; see Figure 3. We defer the detailed analysis in Appendix F.

Figure 2. Duration per user-day pair.

Figure 3. Click-through per impression.

Short-lived High-volume Bandits

Agarwal, A., Agarwal, S., Assadi, S., and Khanna, S. Learning with limited rounds of adaptivity: Coin tossing, multiarmed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pp. 39 75. PMLR, 2017.

Audibert, J.-Y. and Tsybakov, A. B. Fast learning rates for plug-in classiﬁers. The Annals of statistics, 35(2): 608 633, 2007.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002.

Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and Shepp, L. A. Bandit problems with inﬁnitely many arms. The Annals of Statistics, 25(5):2103 2116, 1997.

Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. Mortal multi-armed bandits. Advances in neural information processing systems, 21:273 280, 2008.

Farias, V. F. and Madan, R. The irrevocable multiarmed bandit problem. Operations Research, 59(2):383 399, 2011.

Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multiarmed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.

Gauthier, C.-S., Gaudel, R., and Fromont, E. Unirank: Unimodal bandit algorithms for online ranking. In International Conference on Machine Learning, pp. 7279 7309. PMLR, 2022.

Ghosal, S. Convergence rates for density estimation with bernstein polynomials. The Annals of Statistics, 29(5): 1264 1280, 2001.

Kohavi, R. and Longbotham, R. Online controlled experiments and a/b testing. Encyclopedia of machine learning and data mining, 7(8):922 929, 2017.

Komiyama, J., Honda, J., and Nakagawa, H. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. In International Conference on Machine Learning, pp. 1152 1161. PMLR, 2015.

Lagr ee, P., Vernade, C., and Cappe, O. Multiple-play bandits in the position-based model. Advances in Neural Information Processing Systems, 29, 2016.

Lai, T. L., Robbins, H., et al. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6 (1):4 22, 1985.

Levine, N., Crammer, K., and Mannor, S. Rotting bandits. Advances in neural information processing systems, 30, 2017.

Mc Farland, C. Experiment!: Website conversion rate optimization with A/B and multivariate testing. New Riders, 2012.

Perchet, V., Rigollet, P., Chassang, S., Snowberg, E., et al. Batched bandit problems. Annals of Statistics, 44(2): 660 681, 2016.

Petrone, S. and Wasserman, L. Consistency of bernstein polynomial posteriors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(1):79 100, 2002.

Radlinski, F., Kleinberg, R., and Joachims, T. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning, pp. 784 791, 2008.

Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Vuleta, B. How much data is created every day? 2021. URL https://seedscientific.com/ how-much-data-is-created-every-day/.

Wang, Y., Audibert, J.-Y., and Munos, R. Algorithms for inﬁnitely many-armed bandits. Advances in Neural Information Processing Systems, 21, 2008.

Wasserman, L. All of nonparametric statistics. Springer Science & Business Media, 2006.

Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. From infrastructure to culture: A/b testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2227 2236, 2015.

Yang, F., Ramdas, A., Jamieson, K. G., and Wainwright, M. J. A framework for multi-a (rmed)/b (andit) testing with online fdr control. Advances in Neural Information Processing Systems, 30, 2017.

Zhang, X. and Frazier, P. I. Restless bandits with many arms: Beating the central limit theorem. ar Xiv preprint ar Xiv:2107.11911, 2021.

Short-lived High-volume Bandits

A. Details of the Lower Bounds

A.1. Proof of Proposition 3.1

Consider the event

Wt 1 = 1 2 (w 1)k µmax(At 1 t w) 1 1 (w 1)k

We ﬁrst show that this event occurs with large probability.

Lemma A.1. If K 10, then P(Wt 1) 1

Proof. Denote µmax = µmax(At 1 t w). Let H = n µmax 1 1 (w 1)k o and H = n µmax < 1 2 (w 1)k o . Then for any k 10, we have

H = P µmax < 1 1 (w 1)k

= 1 1 (w 1)k

and thus P[H] 1 3

4e. Moreover, since P[H ] = 1 2 (w 1)k (w 1)k

2 2 e 2, we conclude that

P[Wt 1] 1 P[H ] P[H] 1 1 3

Thus, by giving up a constant factor in the loss, we may restrict our attention on the event Wt 1. For each time period t, deﬁne probability measure Pt( ) = P[ |Wt 1] and Et( ) = E[ |Wt 1]. Consider the event that at time t, the policy selects arms from At at least n/2 times, i.e., Et = P

a At πt(a) n

2 . Further, deﬁne

B t = µmax(At) µmax(At 1 t w) 1

and B+ t = µmax(At) µmax(At 1 t w) + 1

Observe that

Et(Lt) = Et(Lt|Et B t ) Pt(Et B t ) + Et(Lt|Et B+ t ) Pt(Et B+ t )

+ Et(Lt|Et B t ) Pt(Et B t ) + Et(Lt|Et B+ t ) Pt(Et B+ t )

Et[Lt|Et B t ] Pt(Et B t ) + Et(Lt|Et B+ t ) Pt(Et B+ t ). (2)

We next lower bound the above two terms. Note that the event in the ﬁrst term, i.e., Et B t , says that the policy selects arms from At for at least n/2 times, but the best arm in At is suboptimal (w.r.t. µmax(At t w)) by 1/k. Intuitively, when this event occurs, the policy suffers at least n

k = n 2k loss in this round. On the other hand, the event in the other term, i.e., Et B+ t , says that the policy selects At for less than n/2 times, but the best arm in µ(At 1 t w) is suboptimal by n/(wk). By a similar argument, we can show that the loss in this round is at least 1 2wk on this event. At a high level, both these two lower bounds are caused by not knowing quality of the new batch of arms. We formalize the above ideas in the following lemma.

Lemma A.2 (Loss for Uncertainty in the New Batch). The loss in any round t satisﬁes Et[Lt|Et B t ] n 2k and Et[Lt|Et B+ t ] n 2wk.

Proof. Write µ t = µmax(At). Recall from the deﬁnition that when B t occurs, we have µ t µmax(At) 1/k. Thus, if π

Short-lived High-volume Bandits

selects arms from At for Ω(n) times, an Ω(n/k) loss is incurred. Formally,

Et Lt| Et B t = Et

a At t w πt(a) (µ t µa) Et B t

a At πt(a) (µ t µa) Et B t

a At πt(a) (µ t µmax(At)) Et B t

Note that conditional on Et, B t and Wt 1 (recall that Et[ ] = E[ |Wt 1]), we have P a At πt(a) n

2 and µ t µmax(At) 1/k. Thus,

Now consider the second term in (2). When B+ t occurs, we have µmax(At) µmax(At 1 t w) 1 wk, so if the π selects arms in At 1 t w for Ω(n) times, an Ω(n/k) loss is incurred. Formally,

Et Lt| Et B+ t = Et

a At t w πt(a) (µ t µa) Et B+ t

πt(a) µmax (At) µmax(At 1 t w) Et B+ t

Note that conditional on Et, B t and Wt 1, we have P

a At 1 t w πt(a) n

2 and At µmax(At 1 t w) 1 wk. Therefore,

wk = n 2wk .

A.2. Proof of Proposition 3.2

Consider the event Gt = µmax(At t w) 1 n 1/2 where t w.

Lemma A.3 (µmax is Close to 1). For k > n, we have P[Gt] 1

Proof. Since |At t w| = wk and the reward rate of each arm is drawn i.i.d. uniformly, we have P[Gt] = (1 δ)kw = (1 δ) 1 δ kwδ e kwδ. Since k > n, we have kδ > n 1 2 n 1

2 1, so P[Gt] e kwδ e w 1

2, i.e. P[Gt 1

An arm is said to be unexplored at time t if it has never been selected by the policy before. Our proof considers the number of unexplored arms selected in each round, as formalized below.

Deﬁnition A.4 (Unexplored Arms). We say an arm a is unexplored at round t if it has never been selected by (the start of) round t; formally, this means Pt 1 τ=1 πτ(a) = 0. For each round t, deﬁne Mt as the subset of unexplored arms selected in this round; formally, Mt = {a At t w : πt(a) > 0 and Pt 1 s=1 πs(a) = 0}. Moreover, as for any t, t , we write

From a myopic perspective, selecting too many unexplored arms in a round leads to high loss since, by Assumption 2.1, each unexplored arm is suboptimal by Ω(1) on average. We formalize this in the following lemma.

Lemma A.5 (Cost of Selecting Many unexplored Arms). For any t 1, consider the event Vt = n |Mt| n

expected loss in round t conditional on Vt satisﬁes E [Lt|Vt] n 24w.

Short-lived High-volume Bandits

Proof. Write µ t = µmax(At t w) and Et[ ] = E[ |Gt] for any t. We start with lower bounding the conditional loss. Observe that

Et [Lt|Vt] = Et

a A πt(a) (µ t µa) Vt

a A Et h πt(a) 1(a Mt) (µ t µa) Vt i

a A Et[1(a Mt) 1 n 1/2 µa | Vt], (5)

where the last inequality follows by deﬁnition of Et and that a Mt implies πt(a) 1. To further simplify, note that the reward rate of any unexplored arm a is an independent draw and is hence independent with the set Mt of unexplored arms selected in this round, so

a A Et [1(a Mt)| Vt] Et h 1 n 1/2 µa| Vt i . (6)

Note that Et[µa|Vt] = E[µa] = 1

2, we have Et 1 n 1/2 µa| Vt = 1 n 1/2 1

3 for any n 1. Therefore,

a A 1(a Mt) Vτ

3 = Et h |Mt| Vτ i 1

By Lemma A.5, we have P[Gt] 1

2, and thus

E[Lt|Vt] Et[Lt|Vt] P[Gt] 1

2 n 12w = n 24w.

The above says that selecting too many unexplored arms leads to high loss. On the ﬂip side, selecting too few unexplored arms also leads to high loss. To show this, consider the cold-start event where no δ-good arms that π ever selected is still available at the start of round t.

Deﬁnition A.6 (Cold-start Event). For each t 1, the cold-start event is deﬁned as Bt = µmax(St t w) 1 n 1/2 .

As the name suggests, when this event occurs, the policy has little information about the available arms, and hence it behaves as if the time horizon has restarted. This leads to high loss. In fact, a policy has to identify a δ-good arm to have low loss from time t w to t + w. This forces the policy to explore Ω(1/δ) = Ω(n1/2) unexplored arms, which leads to an Ω(n1/2) loss. This result is implied by the lower bound for inﬁnite armed bandits (Wang et al., 2008).

Lemma A.7 (Cold-start Incurs High Loss). For any t, we have E h Pt+w τ=t Lτ Bt i n

To apply the above, we next characterize when Bt would occur. Consider the number Nt of new arms selected in [t w, t]. If ENt 1/δ, then on average the policy suffers Ω(1/δ) = Ω( n) loss. If ENt < 1/δ, then with Ω(1) probability, none of those Nt arms are δ-good, which leads to the cold-start event Bt. We formalize this idea below.

Lemma A.8 (Under-Exploration Leads to Cold-start Event). Suppose E h Pt s=t w Ls i n 96w2 , then P[Bt] 1

Proof. Consider the event Es = n µmax(Ss) 1 1 n o that the for any s = 1, 2 . . . , then our goal reduces to showing

that P[Es] 1 2w for every s = t w, . . . , t. In fact, if none of the events Et w, . . . , Et occurs, i.e. no unexplored arm selected in the past w rounds is n 1/2-good, then Bt occurs. It then follows from the union bound that

Short-lived High-volume Bandits

and the proof will be complete.

To show P[Es] 1 2w for s {t w, ..., t}, observe that for any ﬁxed s,

P[Es] = P[Es|Vs] P[Vs] + P[Es|Vs] P[Vs] P[Vs] + P[Es|Vs], (7)

where we recall that Vt = n |Mt| n

4w o . we will bound each of the two terms in (7) by 1 4w separately. To see P[Vs] 1 4w, note that n 96w2

t =t w E[Lt ] E[Ls] E[Ls|Vs] P[Vs].

Note that by Lemma A.5, we have E[Ls|Vs] n 24w, so P[Vs] 1 4w.

To bound the second term in (7), note that by deﬁnition of Es, for any C, δ > 0 we have

Es Ss C i (1 δ)C (1 δ) 1 δ δC e δC 1 δC,

where the last inequality follows since e x 1 x for any x R. In particular, for C = n

4w and δ = 1 n, we have

Es Vs i = P

P h Es Vs i = 1 P h

We now combine the above to establish the n lower bound.

Proof of Proposition 3.2. Fix any round t. Decompose Lt+w t w into the loss before and after round t as

τ=t w ELτ =

τ=t w ELτ +

If the ﬁrst term, i.e., the loss before time t, is greater than n 96w2 , then the claim holds trivially. Otherwise, by Lemma A.8, the cold-start event Bt occurs w.p. P[Bt] 1

2. Thus, the loss after time t can be bounded using Lemma A.7 as

2 = n 12w > n 96w2 .

A.3. Proof of Theorem 3.1

Now we are ready to show the Ω(1/k) lower bound. By (2) and Lemma A.2, we have

Et(Lt) Et[Lt|Et B t ] Pt(Et B t ) + Et(Lt|Et B+ t ) Pt(Et B+ t )

2k Pt(Et B t ) + n 2wk Pt(Et B+ t ). (8)

Note that the events Et and B+ t (Et and B t resp.) are independent conditional on Gt 1, so

(8) n 2wk 1

2 Pt(Et) + 1

w Pt(Et) n 2w2k .

Short-lived High-volume Bandits

E[Lt] E[Lt 1(Gt 1)] = E[Lt|Gt 1] P[Gt 1] n 2w2k 1

8 = n 16w2k .

Summing over t and taking the limit, we conclude that

Lossn(π) = lim T 1 n T

t=1 E[Lt] 1 16w2k .

B. Proof of Proposition 4.8: Regret-to-Loss Conversion

As we recall, given a policy π = (πt) for SLHV bandits, the loss in round t is Lt = 1

a At t w πt(a) (µ t µa) i . We ﬁrst decompose Lt into the following internal and external loss.

Deﬁnition B.1 (External and Internal Loss). Let A be a semi-adaptive algorithm for the BB problem with adaptivity level ℓand π = (πt) be the induced policy for the SLHVB problem. For any round t, integer j w, let t,j = µmax(At t w) µmax(At). Deﬁne the external and internal loss as

Lext t = E max 1 j w t,j

and Lint t = E

n (µmax(At j) µ(a))

Here, the term t,j is external in the sense that it does not depend on the policy, but only on the randomness in the RIM. On the other hand, the term Lint t is internal it measures the reward gap between the selected arms and the best arms among the respective age groups. It is straightforward to show the following.

Lemma B.2 (Loss Decomposition). In any round t, the loss satisﬁes Lt Lint t + Lext t .

Proof. By the deﬁnition of internal and external loss, we have

a At t w πt(a) (µ t µa)

n µmax(At t w) µa

(since πt(a) = 0 if a At t w\At t ℓ)

n ( t,j + (µmax(At j) µa))

(by deﬁnition of i,j)

E max 1 j w t,j

n (µmax(At j) µa)

We next bound the external and internal losses separately. We start by showing the external regret is O(1/k). Recall from the deﬁnition of the RIM that µa D and from Assumption 2.1 that the density of D is bounded by C2, C1 > 0 from above and below.

Proposition B.3 (External Loss). For any round t, the external loss can be bounded as Lext t 3ρ log n

C1k for any sufﬁciently large n.10

10We say that a property Pn (e.g., an inequality) holds for any sufﬁciently large n if there exists a constant n0 > 0 such that Pn holds whenever n n0.

Short-lived High-volume Bandits

Proof. Consider any i [w] and Yi := µmax(At i). For any ε < 1 and arm a A, by the RIM and Assumption 2.1, we have P[µa > 1 ε] C1ε, or equivalently, P[µa 1 ε] 1 C1ε. In particular, for ε = 2ρ log n

C1k , we have

P Yi < 1 2ρ log n

1 C1 2ρ log n

k = 1 2ρ log n

k 2ρ log n 2ρ log n e 2ρ log n = n 2ρ.

Thus, the event B := S

i [w] n Yi < 1 2ρ log n

C1k o has probability P [B] P

i [w] n 2ρ wn 2ρ. It follows that

E min i [w] Yi

= E min i [w] Yi B P B 1 2ρ log n

1 wn 2ρ 1 3ρ log n

where the last inequality follows since wn 2ρ 2ρ log n

C1k for any sufﬁciently large n. Therefore, E maxi [w] t,i 1 E mini [w] Yi 3ρ log n

Now we are ready to prove the regret-to-loss conversion formula.

Proof of Proposition 4.8. Recall that k is the size of the resampled subset of each set At of arriving arms. For any ﬁxed T w, by re-arranging the terms in (9), we obtain

t=1 Lint t =

n (µmax (At j) µa)

n µmax (At j) µmax A t j + µmax A t j µa

µmax A t j µmax (At j) + 1

a At j πt(a) µmax A t j µa

where the inequality follows since for any t, the total number of arms the policy selects satisﬁes Pℓ j=0 P

a At j πt(a) = n. Note that E max0 j ℓ µmax A t j µmax (At j) 1

j=0 πt+j(a) (µmax(At) µa)

k + (T 2ℓ) R(n, k ) + 2ℓ,

where the 2ℓterm in the ﬁrst inequality is because we are summing from ℓto T ℓ. It follows that

Lossn(π) lim T 1 T

Lint t + Lext t

k + 3 log k

C1ρk + lim T (T 2ℓ) R(n, k ) + 2ℓ

k + 3 log k

C1ρk + R(n, k ).

where the last identity follows since R(n, k ) does not depend on T, and ℓ= O(1) as T .

C. Proof of Theorem 4.6: Average Regret of BSE

In this section we prove Theorem 4.10, which says that the average regret is O((k/n)ℓ/(ℓ+2)) for the BSE algorithm with the revised geometric grid, speciﬁed in Deﬁnition 4.4. We illustrate the key ideas by considering adaptivity levels ℓ= 1 and 2 as warm-up ﬁrst in Section C.2 and C.2 respectively, and then in Section C.4 we present the proof for general ℓ 1.

Short-lived High-volume Bandits

C.1. Preliminaries

We introduce some tools for our analysis. For any arm a, consider i.i.d. Bernoulli rewards (Za i,j)i [w],j [n] with mean µa. We ﬁrst state a standard concentration bound for independent random variables, see e.g., (Vershynin, 2018).

Lemma C.1 (Concentration Bounds). Let Z1, ..., Zm be independent random variables supported on [0, 1], and Z = 1 m Pm i=1 Zi, then for any δ > 0, it holds that

P(| Z E( Z)| > δ) exp δ2m

(Hoeffding s inequality),

P | Z E( Z)| > δ E( Z) exp

(Chernoff s inequality).

Consider the event that the empirical mean rewards of all available arms available at time t satisfy Hoeffding s inequality.

Deﬁnition C.2 (Clean Event). For any integers i, m and arm a, deﬁne

j=1 Za i,j mµa p

2m (ρ + 3) log n

We deﬁne the clean event as Ct = T Ci,m a where the intersection is over m [n], i [w] and a At t w.

We use Lemma C.1 to show that Ct occurs with high probability in each round t.

Lemma C.3 (Clean Event is Likely). For any time t, the clean event satisﬁes P

Proof. Fix an arbitrary i [w] and arm a At t w. Then for any integer m, by Hoeffding s inequality,

Ci,m a i = P

j=1 Za i,j mµa > p

2m(ρ + 3) log n

2m 2m(ρ + 3) log n = n 3.

By the union bound over all a [k], i [w] and m [n], we have

m [n],i [w], a At t w

m [n],i [w], a At t w

m [n],i [w], a At t w

j=1 Za i,j mµa > p

2m(ρ + 3) log n

nwk n ρ 3 = w n 2 n 1.

We will also repeatedly apply the following simple fact.

Lemma C.4 (3 -Inequality). Let > 0 and X = {xj}j [k], X = {x j}j [k] be two sets of real numbers such that |xj x j| for all j [k]. Suppose for some i we have x i max X . Then, xi max X 3 .

Proof. Since |xj x j| for all j [k], we have | max X max X | . Hence, x i max X max X 2 . Therefore, xi x i max X 2 = max X 3 .

Short-lived High-volume Bandits

C.2. Warm-up: ℓ= 1

In this subsection we show that the average regret is O (k/n)1/3 for the BSE algorithm with grid ε 0,1. We ﬁrst prove that w.h.p. all surviving arms are close to optimal. Recall from Algorithm 1 that S1 denotes the surviving arms at the end of the ﬁrst phase.

Lemma C.5 (Surviving Arms Have High Rewards). If the clean event C occurs, then µmax(A) µmin(S1) δ1 where δ1 = 3 ε0n

k 1/2 log1/2 n.

Proof. By Hoeffding s inequality (see Lemma C.1), the deviation of the empirical mean reward of each arm satisﬁes |ˆµa µa| 1

3δ1 for all arms a A. Consider ˆa = arg maxa ˆµa and a = arg maxa µa. By deﬁnition of S1 (see Algorithm 1), an arm a survives only if |ˆµa ˆµˆa| 1

3δ1. Thus, by the 3 -inequality (Lemma C.4) we have µa µmax(A) δ1.

Combining the above with the regret decomposition (Lemma B.2), we obtain the following.

Proposition C.6 (Average Regret of BSE 1). Suppose ρ 1. Then the average regret of BSE 1, the BSE algorithm with exploration intensities ε 0,1, satisﬁes

Regavg n (BSE 1) 5 k

1/3 log1/3 n.

Proof. To suppress notations, write ε0 := ε 0;1. By Lemma C.5, we have µmax(A) µmin(S1) δ1, and thus

Regavg n (BSE 1) ε0 + (1 ε0) E [µmax(A) µmin(S1)]

ε0 + δ1 P [C] + P

ε0 + δ1 + n 1,

where the last inequality follows from Lemma C.3. Expanding δ1 using Lemma C.5 and recalling that ε 0,1 = k

n 1/3 log1/3 n, we have

Regavg n (BSE 1) 4 k

3 log 1 3 n + n 1 5 k

3 log 1 3 n,

where the last inequality follows since 1

C.3. Bounding the Number of Survivors: Analysis for ℓ= 2

Recall that for adaptivity level ℓ= 2, we ﬁrst select a batch of arms and compute a subset S1 of surviving arms whose conﬁdence intervals are not dominated by any other arm. Then we select another batch of arms, and compute a further subset S2 S1 in a similar manner. Finally, in the exploitation phase, we choose an arbitrary arm from S2.

The key step is bounding the number of survivors after the ﬁrst phase - if we can upper-bound S1 s cardinality (w.h.p.), then we can lower-bound the number of times each arm in S1 is selected in phase 2, leading to a guarantee on the width of the conﬁdence interval. To this goal, consider the following uniform event that µa s are distributed approximately uniformly.

Deﬁnition C.7 (Uniform Event). Consider any constant δ (0, 1), and the number Nδ of arms in A that lie in a δneighborhood of the optimal arm, formally, Nδ = {a A : µa µmax(A) δ} . We deﬁne the uniform event as

We show that the uniform event is likely to occur when k is large. This is a direct consequence of the RIM and Assumption 2.1, and its proof is similar to that of Proposition B.3.

Lemma C.8 (Uniform Event is Likely). Suppose δk 8 C1 log n. Then for any δ (0, 1) and round t, the uniform event satisﬁes P

Short-lived High-volume Bandits

Proof. Index the arms from 1 to k, and denote by f the density of D, the distribution from which the reward rates µi s are drawn. We ﬁrst upper bound P[ G | µmax = 1 γ] for any ﬁxed γ (0, 1).

We subsequently ﬁx an arbitrary i [k]. Denote by Zi = 1[µi µmax δ] and consider the largest reward rate µmax i in [k]\{i}, i.e., µmax i = max {µ1, . . . , µi 1, µi+1, . . . , µk} .

Observe that

P [ Zi = 1 | µmax = 1 γ ] = P µi [1 γ δ, 1 γ] | µmax i δ = Z 1 γ

1 γ δ f(z) dz,

where the last identity follows since µi s are independent, in particular, µmax i and µi are independent. Recall by Assumption 2.1 that the density satisﬁes C1 f(z) C2, so

C1δ P [ Zi = 1 | µmax = 1 γ ] C2δ.

Note that conditional on the event {µmax = 1 γ}, the random variables Zi are still i.i.d., so by Chernoff s inequality (see Lemma C.1) we obtain

P h G µmax = 1 γ i P

i [k] Zi > 3

2 C2δk µmax = 1 γ

i [k] Zi < 1

2 C1δk µmax = 1 γ

8 8 C1ρ = 2n 1.

P[ G] = Eγ P[ G | µmax = 1 γ] 2n 1.

Assuming that Uδ1 and C both occur, we have |S1| δ1k. Thus, in the second phase, each surviving arm is selected ε1n

times, and hence the conﬁdence intervals have widths ε1n δ1k 1/2 . We next make these ideas precise.

Deﬁnition C.9 (Widths of Conﬁdence Intervals). For any ε0, ε1 (0, 1), deﬁne

δ1 = δ1(ε0) = 8

2 log n and δ2 = δ2(ε0, ε1) = 6 3 2

4 log 5 4 n.

For any adaptivity level ℓ, denote by δ j,ℓ= δj,ℓ(ε 0, ε 1) for j = 1, . . . , ℓ.

Under the above notations, we can bound the suboptimality of S2, the surviving arms after the second phase as follows.

Lemma C.10 (Suboptimality of S2). Consider the algorithm BSE2(ε0, ε1) such that δ1 = δ1(ε0, ε1) satisﬁes δ1k > 8 C1 log n. If the clean event C and the uniform event Uδ1 both occur, then

µmax(A) µmin(S2) δ2.

Proof. We ﬁrst upper bound the cardinality of S1. By Lemma C.5, since C occurs, we have µmax (A) µmin (S1) δ1. In other words, for an arm a A to survive the ﬁrst phase, its mean reward needs to be δ1-close to µmax(A). Since the uniform event Uδ1 occurs, by Lemma C.8 the number of such arms can be bounded as

2C2δ1k. (11)

Short-lived High-volume Bandits

Note that in phase 2, each arm in S1 is selected m1 := ε1n

|S1| times. By deﬁnition of the clean event C, the empirical mean

reward of every arm deviates from the mean by at most

2(ρ + 3) m 1/2 1 log1/2 n. Thus by the 3 -inequality (Lemma C.4), we can bound the suboptimality of arms in S2 as

µmax (A) µmin (S2) 3 p

2(ρ + 3) m 1/2 1 log1/2 n.

To further bound the above, we use (11) to lower bound m1 = ε1n

|S1|. Speciﬁcally,

µmax (A) µmin (S2)

2(ρ + 3) ε1n 3 2C2δ1k

(ρ + 3)C2 ε 1

2 log 1 4 k log 1 2 n. (12)

Recall from Deﬁnition C.9 that

2 log n and δ2 = δ2(ε0, ε1) = 6 3 2

4 log 5 4 n,

we can simplify the above inequality as

4 log 1 4 k log n = δ2.

We can now bound the average regret of the BSE algorithm for any grid as follows. Proposition C.11 (Average Regret of BSE with Arbitrary Grid, ℓ= 2). Suppose 0 < ε0 < ε1 < 1 and δ1k > 8 log n

C1 . Then, the average regret of the BSE algorithm with grid size ε0, ε1 satisﬁes

Regavg n (BSEℓ=2 (ε0, ε1)) ε0 + ε1δ1 + δ2 + O n 1

Proof. By deﬁnition of regret, we have

Regavg n (BSEℓ(ε0, ε1)) = ε0 + ε1 E [µmax(A) µmin(S1)] + (1 ε0 ε1) E [µmax (A) µmin (S2)] .

We bound each term separatly. By Lemma C.3 and Lemma C.5, the second term can be bounded as

E [µmax(A) µmin(S1)]

= E [µmax(A) µmin(S1) | C] P[C] + E µmax(A) µmin(S1) | C P[C]

δ1 1 + 2n 1.

By Lemma C.10, if the events Uδ1 and C both occur, then µmax(A) µmin(S2) δ2. Further, by Lemma C.8, we have P [Uδ1] 2n 1 whenever δ1k 8 log n

C1 . Combining the above facts, we obtain

Regavg n (BSEℓ(ε0, ε1)) ε0 + ε1 δ1 + 2n 1 + δ2 + n 1,

and the proof is complete.

The revised geometric grids in Deﬁnition 4.4 are motivated by minimizing the above bound. Since we are focusing on ℓ= 2, we will suppress ℓand write ε i = ε i,ℓfor i ℓ= 2. Choose ε0 so that

ε0 = ε1 δ1(ε0) = δ2(ε0, ε1),

then we have

as speciﬁed in Deﬁnition 4.4. We obtain the following guarantee for the revised geometric grid by choosing εi = ε i in Proposition C.11. The proof is straightforward we only need to verify that δ 1k 8 log n

Short-lived High-volume Bandits

Corollary C.12 (Average Regret of BSE 2). If ρ θ2, then Regn (BSE ℓ=2) O k

Proof. Recall from Deﬁnition 4.4 that ε 0 = k

2 , and from Deﬁnition C.9 that for any ε0 we deﬁned

δ1 = δ1(ε0) = 8

Expanding the expressions for ε 0 and δ 1 = δ1(ε 0), we have

2 log n k = 8

C1 k 5 4 n 1

where the last inequality follows since ρ θ2 = 1

5. By Proposition C.11, we conclude that

Regavg n (BSE ℓ=2) ε 0 + ε 1δ 1 + δ 2 + O n 1

2 log 5 4 n

C.4. Proof of Theorem 4.6

We now extend the analysis to the general ℓcase. For each available arm, an ℓ-Layer Sieve policy will recursively explore its reward rate and derive a series of conﬁdence intervals of the following widths.

Deﬁnition C.13 (Width of Conﬁdence Interval). Given ε0, . . . , εℓ 1, for each i = 1, 2, . . . , ℓ, we deﬁne

δi = δi(ε0, . . . , εi 1) = 3 15i 1 C

2 2 ε 2 i 0 ε 2 (i 1) 1 ... ε 1

It is straightforward to verify that the above δi s satisfy the following recursion. As in the analysis for ℓ= 2 in Section C.2, we will show that the loss on the j-th layer is bounded as follows.

Lemma C.14 (Regret on the j-th Layer). Fix integers t, j and suppose the events Gδ1 t 1, . . . , Gδj t j and Ct occur. Then,

µmax(At j) µmin(Sj t j) δj and |Sj t j| 6C2δjk log1/2 k.

Proof. Proof. We show this inductively on j. To show the base case j = 1, note that by Lemma C.5, we have µmax(At 1) µmin(S1 t 1) δ1. Moreover, we showed in the ﬁrst paragraph in the proof of Lemma ?? that |S1 t 1| 6C2δ1k log1/2 k, and thus the claim holds for j = 1.

Now consider j 2. As the induction hypothesis, assume the claim holds for 1, . . . , j 1. Then, |Sj 1 t (j 1)|

6C2δj 1k log1/2 k. Consequently, each arm in Sj 1 t (j 1) is selected

|Sj 1 t (j 1)| εjn

6C2δj 1k log1/2 k (13)

times. Since the clean event Ct occurs, the empirical mean of each arm from Sj 1 t (j 1) deviates from the mean by

3m1/2 j 1 log1/2 n. Thus, if an arm a At j survives, we have

µmax At (j 1) µmin Sj 1 t (j 1) 9m 1

2 j 1 log 1 2 mj 1. (14)

Since the good event Gδj t j occurs, we have

|Sj t j| 6C2 3m 1

2 j 1 log 1 2 mj 1 k log1/2 k.

Short-lived High-volume Bandits

To simplify, note that by (13),

2 j 1 log 1 2 mj 1 3

6C2δj 1k log1/2 k

6C2δj 1k log1/2 k

and hence |Sj t j| 6C2δjk log1/2 k.

To simplify (14), we expand mj 1 again and obtain

µmax At (j 1) µmin Sj 1 t (j 1)

2 j 1 log 1 2 mj 1

6C2δj 1k log1/2 k

2 log 1 2 n

2 log 1 4 k log 1 2 n, (15)

where the last inequality relies on δj 1k > 1. By Deﬁnition C.13, we have

δj 1 = 3 5j 2 ε 2 (j 1) 0 ε 2 (j 2) 1 ε 2 1 j 1 k

Substituting into (15), we conclude that

µmax At (j 1) µmin Sj 1 t (j 1) 3 5j 1C

2 2 ε 2 j 0 ε 2 (j 1) 1 ε 2 1 j

We now have the following regret bound for BSE algorithm with adaptivity ℓ.

Proposition C.15 (Average Regret of BSE, Arbitrary Grid). Consider arbitrary adaptive level ℓand grid sizes 0 < ε0 < < εℓ 1 < 1 and δik > 1 for any i ℓ 1. Then for any round t,

Regavg n (BSEℓ(ε0, , εℓ 1)) = O ε0 + ε1δ1 + ε2δ2 + + εℓ 1δℓ 1 + δℓ+ n ρ .

Proof. Proof. By deﬁnition of internal regret,

Regavg n (BSEℓ(ε0, , εℓ 1))

j=0 εj E h µmax(At j) µmin(Sj t j) i + E max j [w]

µmax (At j) µmin S2 t j .

Note that if the events Ct and Gδj t j occur, j = 0, . . . , ℓ 1, then by Lemma C.14, we have

µmax(At j) µmin(Sj t j) δj for j = 0, . . . , ℓ 1,

max ℓ j w µmax(At j) µmin(S2 t j) δ2.

Further, by Lemma C.8 that for each j we have P h Gδj t j i k 2 provided δik 1 for all j = 0, . . . , ℓ 1, so

E h µmax(At j) µmin(Sj t j) i

ℓ j w Gδj t j

Gδj t j i + P

δj 1 + n 2+ρ w.

Short-lived High-volume Bandits

E max ℓ j w µmax(At j) µmin(S2 t j)

ℓ j w G3δ1 t j

Gδ1 t j i + P

δℓ+ k 2w + n 2+ρw.

When ρ 1, the above is bounded by δ2 + n 2ρw + o 1

n . Combining, we conclude that

Regavg n (BSEℓ(ε0, , εℓ 1)) ε0 +

j=1 εj δj + n 3 + n 2ρw + o 1

j=1 εjδj + n ρ + o 1

D. Proof of Proposition 4.9

To derive the upper bound for SLHVB, we need to compare the asymptotic orders of the two terms in (1). Consider the following three cases for ρ.

Case 1. Suppose ρ 1

5. Consider ℓ= 1. By Proposition 4.9, we have

Lossn (BSE 1, k) = O n ρ + n(ρ 1) 1

5, it holds that n ρ n(ρ 1) 1

3 , so the above becomes O (n ρ) .

Case 2. Suppose 1

5 < ρ w 2w+2. The claimed ℓin Step 2 exists11 due to the following elementary fact: for any integer w 2, 1

5, w 2w + 2

2ℓ+ 1, ℓ 2ℓ+ 2

It then follows by Proposition 4.9 that

Lossn (BSE (ℓ, k)) = O n ρ + n(ρ 1) ℓ ℓ+2 .

Further, note that when ρ < ℓ 2ℓ+2, we have n ρ > n(ρ 1) ℓ ℓ+2 , so the above becomes O (n ρ).

Case 3. Suppose ρ w 2w+2 = θw. Note that the threshold exponent θℓincreases in ℓ, in particular, for any ℓ w, we have θℓ θw ρ. It then follows from Proposition 4.9 that

Lossn (BSE (w, k)) = O n ρ + n(ρ 1) w w+2 . (18)

Although this is already sublinear in n, we observe that the two terms in (18) are, in general, not equal, which suggests a potential improvement. Consider the resampling size k = nρ that renders those two terms equal, that is, ρ = w 2w+2 ρ. Then, due to Proposition 4.8 we obtain

Lossn (BSE (w, k )) = O n ρ + ρ 1n ρ + n(ρ 1) w w+2

O n w 2w+2 + ρ 1 n w 2w+2 + n w 2w+2

= O ρ 1 n w 2w+2 .

where the inequality follows since ρ ρ. We summarize the above discussion in the following theorem.

11Alternatively, one can show this constructively by showing that ℓ = ℓ (ρ) = j ρ+1 1 2ρ k satisﬁes θℓ ρ ℓ

2ℓ +2. However, the proof - which is essentially arithmetic manipulation - is slightly tedious, so we choose not to specify this explicit form.

Short-lived High-volume Bandits

E. Field Experiment: Implementation Details

In this section we provide more details about the implementation of the ﬁeld experiment.

E.1. Randomized BSE: A Thompson Sampling Variant

We implement a variant of semi-adaptive policy for SLHVB induced by the BSE policy (Algorithm 1) with ℓ= 1. We modify the policy due to the following practical concerns. The ﬁrst issue is the lack of knowledge of n. In our implementation, each round is set to be an hour, so n corresponds to the number of impressions per hour. But in practice, n is unknown. This can be easily ﬁxed via randomization: for each card slot, we assign a newly-arriving card (for exploration) with probability ε0 and an old card (for exploitation) otherwise.

Algorithm 4 Set Prior(g).

1: Input: a card g and m users 2: Output: ˆα, ˆβ 3: Randomly sample m users u1, . . . , um 4: for i = 1, . . . , m do 5: µ(ui, g) DNN-predicted reward on (ui, g) 6: end for 7: Compute the sample mean µ and variance v:

i=1 µ(ui, g), v 1 m 1

i=1 (µ(ui, g) µ)2

ˆα µ µ(1 µ)

v 1 and ˆβ = 1 µ

The second issue is the prior information for the rewards rates of the newly arriving cards. Although the DNN predictions are sometimes inaccurate, they do provide useful information. To utilize such information, we ﬁt a Beta prior distribution for each card s reward rate using the method of moments (see e.g., (Wasserman, 2006)) based on the DNN predictions (see Algorithm 4). Speciﬁcally, recall that the DNN returns a predicted conversion rate for each pair of user and card. For a ﬁxed card, denote by µ, v the mean and variance of the predicted conversion rates on m = 500 randomly sampled users. The ﬁtted Beta prior B(ˆα, ˆβ) is then given by

ˆα = µ µ(1 µ)

v 1 and ˆβ = 1 µ

The ﬁnal issue is the recommended contents diversity. Recall that the basic version of the BSE policy selects the empirically best arm to exploit . However, this is not reasonable in a practical setting where such an extreme promotion of a single card is undesirable. We thus consider the Thompson Sampling version of the Sieve policy under a Beta-Bernoulli reward model; see Algorithm 5. Speciﬁcally, for each slot we draw a score for each card from its posterior. Then we assign to this slot the card with the highest score. Note that the posterior can be efﬁciently updated using the Bayesian rule, since the Beta distribution is a conjugate prior for the Bernoulli distribution.

E.2. Implementation Details

The ﬁrm has maintained a partition of the users into several hundreds of buckets for various online experiments. This partition is randomly re-generated every six months. We chose three buckets as the treatment group, involving over 600, 000 users and accounting for around 1% of the total trafﬁc. We implemented the randomized BSE policy with ℓ= 1 on their real system in the ﬁrst 14 days of July 2021. For comparison, we also analyzed the interaction data from the ﬁrst 14 days of May in the same year.

Using an ofﬂine semi-synthetic simulation, we determined the empirically optimal parameter to be around ε0 = 0.2, which we used in the ﬁeld experiment. This choice is also consistent with our theoretical analysis. In fact, as we recall from

Short-lived High-volume Bandits

Algorithm 5 Randomized One-Layer Sieve Policy

1: Input: ε [0, 1], θ 0 2: for each hour t = 1, 2, ... do 3: Receive a set Anew of new cards 4: Update the set A of available cards 5: for each card g A do 6: if g Anew then 7: (αg, βg) Set Prior(g). 8: else 9: ng number of interactions of g in the last hour 10: hg number of conversions of g in the last hour 11: αg αg + hg, βg βg + ng hg 12: end if 13: end for 14: Awell {g : αg + βg > θ} 15: for each user u do 16: Receive the number ru of cards requested by u 17: Au cards that have never been assigned to u 18: for each card g Au do 19: Draw Xu,g Beta(αg, βg) 20: end for 21: Sort Au T Awell by Xu,g in non-increasing order as g1, g2, ... 22: Sort Au\Awell by Xu,g in non-increasing order as g 1, g 2, ... 23: i, i 1 24: for j = 1, ...ru do 25: Zj Ber(ε) 26: if Zj = 1 then 27: Sj gi 28: i i + 1 29: else 30: Sj g i 31: i i + 1 32: end if 33: Send to u the cards {Sj : j = 1, ..., ru} 34: end for 35: end for 36: end for

Proposition C.6, the optimal parameter is ε 0;1 (k/n)1/3. In our scenario, we observed from past data that there were on average around 11 million impressions per 14 days, so the number n of impressions per hour is 3.2 104. Moreover, there are on average k = 150 cards released per hour, and thus ε 0,1 0.14.

F. Field Experiment: Analysis

We now present a detailed statistical analysis of the ﬁeld experiment, including a bootstrapping hypothesis testing and a Difference-in-Differences (DID) analysis. Our analysis shows that our policy outperforms the DNN-based recommender by about 4% in total duration and nearly 7% in the total number of click-throughs per-user-per-day.

We ﬁrst explain how to eliminate outliers. An outlier is typically introduced in two ways. In practice, users may accidentally swipe through two cards in a row, without even looking at the ﬁrst one. We thus remove any impression with duration less than 0.2 seconds. On the other hand, users may sometimes leave their devices unattended for minutes, generating an abnormally high duration. Since most cards content can be fully consumed within 300 seconds, we remove any impression with duration over u = 300 seconds.

Short-lived High-volume Bandits

Table 1. Types of Data In the Analysis CT Duration Per User-Day integral numeric Per Impression binary numeric

Table 2. Overall Statistics May July NN MAB NN MAB

Per-User-Per-Day Duration Mean 175.910 175.548 137.059 142.618 SE Mean 0.699 0.659 0.6081 0.597 Median 44.250 44.279 32.973 34.430

#CT Mean 1.275 1.273 0.941 1.010 SE Mean 9.251e-03 8.814e-03 7.276e-03 7.549e-03

Per Impression Duration Mean 3.9697 4.0195 4.1183 4.2391 SE Mean 4.529e-03 4.402e-03 5.738e-03 5.599e-03 Median 0.693 0.697 0.702 0.703

CTR Mean 2.887e-02 2.915e-02 2.827e-02 3.001e-02 SE Mean 4.698e-05 4.568e-05 5.804e-05 5.671e-05

F.1. Analysis of 22 = 4 Metrics

The ﬁrm is interested in the analysis at the per-user-per-day and per-impression levels, and two measures for user engagement: duration and click-throughs. So altogether we have four metrics in total, as shown in Table 1. In the per-impression analysis, we treat each impression as an independent sample. For example, the row for per-impression duration in Table 2 is the ratio between the total duration and total number of impressions.

However, the ﬁrm s objective is the total user engagement rather than the per-impression engagement, which motivated our analysis on the per-user-per-day level. At ﬁrst sight, it seems reasonable to consider the total engagement of a ﬁxed user over all 14 days during the experiment. However, this metric is ﬂawed since the frequency at which the users enter the app is affected by many external factors, such as holidays and weekends, which introduces extra noise. We will thus only consider the days when a user entered the app (i.e., has at least one impression). Formally, for each user u and day d where this user has at least one impression, we deﬁne a tuple (u, d, Dud) where Dud is the total duration. Thus, the number of tuples associated with each user is between 1 and 14.

Under this deﬁnition of total engagement, we summarize12 the experimental results in Table 2 and visualize the per-userper-day user engagement in Figure 4 and Figure 5. We observe that in May the user engagement of the two groups are approximately identical, but in July the MAB group has a signiﬁcantly higher mean user engagement. Moreover, such improvement also appeared in median duration, indicating that this improvement is unlikely to be caused by a heavier tail in the distribution.

It is also worth noting that the user engagement per-user-per-day decreased from May to July. This is because May 2021 was when the Covid-19 pandemic reached its peak in the country where most users were located. During the lockdown, the users may have had more time to spend on the app, resulting in a higher total engagement. In Section F.2, we perform a DID analysis which incorporates the underlying change of environment across time.

Finally we emphasize that in our implementation, our policy is not personalized. Despite this disadvantage, our Sieve policy still outperforms their personalized DNN recommender in all metrics for user-engagement, both at the per-impression and per-user levels; see Table 2.

F.2. Signiﬁcance Tests

From Figure 4 and Figure 5 we observe that our policy outperforms the control policy. We now test whether this improvement is statistically signiﬁcant.

12The unit of duration in all tables is second

Short-lived High-volume Bandits

Figure 4. Number of click-throughs per-user-per-day

Figure 5. Duration per-user-per-day

Table 3. Signiﬁcance Testing Basic Bootstrap Z-score p-value Z-score p-value

Per-User-Per-Day Duration 4.610 2.018e-06 4.6197 1.921e-06 CT 4.259 1.027e-05 4.2556 1.042e-05

Per Impression Duration 6.963 1.665e-12 6.972 1.556e-12 CT 12.999 6.127e-39 12.933 1.469e-38

For each month m {May, July}, denote by Xm, Y m the user engagement (either duration or number of click-throughs) in the NN and MAB group respectively. Similarly, denote by X m, Y m be the sample means of Xm, Y m in month m. We are interested in the difference-in-differences in user engagement before and after our policy was deployed, i.e.,

= (Y July XJuly) (Y May XMay).

Consider the hypotheses H0 : E[ ] 0 vs. H1 : E[ ] > 0.

First consider the basic Z-score given by

Y July X July

Y May X May

Y July X July

Y May X May =

Y July X July

Y May X May

is the estimated standard deviation. For Z {XMay, XJuly, Y May, Y July} we denote by NZ the number of i.i.d. samples of Z, and let S2 Z be the sample variance. Assuming the samples are independent, we may approximate the above as

ˆS r 1 NXMay S2 XMay + 1 NXJuly S2 XJuly + 1 NY May S2 Y May + 1 NY July S2 Y July.

As shown in the Basic column of Table 3, the p-values for each of the four metrics are very low. We therefore reject the null hypothesis H0 and conclude that the treatment effect is statistically signiﬁcant.

However, in reality the samples are not independent, since (1) each user may appear in both months, (2) each user has multiple data points in a month, and (3) the same set of cards are shown to both the treatment and control group. We remove the dependence by bootstrapping as follows. From each of these four pools of data points, we randomly draw 106

samples with replacement and redeﬁne each Z as the bootstrap sample mean where Z = XMay, XJuly, Y May, Y July; see the Bootstrap column in Table 2. We still observe very low p-values, which further validates our conclusion.

Short-lived High-volume Bandits

Figure 6. Illustration of DID regression for duration per impression

F.3. DID Regression

Now we consider a DID regression analysis. We ﬁrst illustrate the idea of DID regression in Figure 6 using per-impression duration as an example. We ﬁrst vectorize each tuple (u, d, Yud) into a four-dimensional vector (tud, iud, tud iud, Yud) where tud = 1[day d is in July] and iud = 1[user u is in MAB group]

are the time and intervention indicators, and Yud {Cud, Dud} is the metric under consideration (i.e., click-throughs or duration of user u on day d). The basic assumption in a DID analysis is that the outcome follows the linear model

Yud = β0 + β1tud + β2iud + β3tud iud + εud (20)

where εud N(0, σ2) with unknown variance σ2. Intuitively, β1 measures the effect of being assigned to the treatment group and β2 captures the overall trend over time. Thus, if there is no treatment effect, the differences between the two groups should remain unchanged across May and July, and therefore the means of the samples from the four pools (shown as the red dots in Figure 6) will form a perfect parallelogram.

Now suppose there is indeed a positive treatment effect. Then, the top-right corner of this quadrilateral will be raised; see the green dot in Figure 6. This lift is measured by the variable β3. To see this, by setting tud = iud = 0, we observe that β0 is the mean engagement of control group users in May. Further, note that the top-right corner of the parallelogram is β1 + β2 + β0. In contrast, for day d in July and user u in MAB group, if iud = tud = 1, then the mean outcome satisﬁes E[Yud] = β0 + β1 + β2 + β3, which is higher than the top-right red dot by β3.

Assuming the Gaussian noise, we are able to compute conﬁdence intervals and p-values for the coefﬁcients βi s; see Tables 4. For both duration and CT, the coefﬁcients β3 are positive and have very low p-values. Therefore, it is is indeed signiﬁcant whether a user is assigned to the MAB group. Moreover, note that β2 has high p-values, so we conclude that the partition of users is sufﬁciently random, at least on the per-user-per-day level.

As shown in the second half of Table 4, we also consider per-impression user engagement. Similar to the above analysis, we convert each impression j into a three-dimensional binary vector (tj, ij, Yj) where Yj is either the duration or click-through indicator for impression j. Note that the duration per impression is numerical so we can analyze it using linear regression

Short-lived High-volume Bandits

Table 4. Difference-In-Differences Regression Coef. Std. Dev. t p-value 0.025Q 0.975Q

Per User-Day

β0 175.9103 0.640 274.941 0.000 174.656 177.164 β1 -38.8514 0.942 -41.263 0.000 -40.697 -37.006 β2 -0.3622 0.887 -0.409 0.683 -2.100 1.375 β3 5.9208 1.303 4.544 2.759e-06 3.367 8.475

β0 1.2750 0.008 153.851 0.000 1.259 1.291 β1 -0.3341 0.012 -27.394 1.616e-165 -0.358 -0.310 β2 -0.0016 0.011 -0.141 0.888 -0.024 0.021 β3 0.0704 0.017 4.171 1.516e-05 0.037 0.103

Per Impression

β0 3.9697 0.005 863.796 0.000 3.961 3.979 β1 0.1486 0.007 20.234 2.753e-89 0.134 0.163 β2 0.0497 0.006 7.781 3.597e-15 0.037 0.062 β3 0.0711 0.010 6.998 1.298e-12 0.051 0.091

β0 -3.5198 0.002 -2092.794 0.000 -3.523 -3.517 β1 -0.0161 0.003 -5.947 1.365e-09 -0.021 -0.011 β2 0.0133 0.002 5.712 5.582e-09 0.009 0.018 β3 0.0474 0.004 12.819 6.417e-38 0.040 0.055

Note: All regression are linear regression except for per impression CT, where we applied logistic regression due to binary labels.

(20). In contrast, the per impression click-throughs (Yj) are binary, so we instead apply logistic regression: we assume

Yj Ber ez 1+ez where z = β0 + β1tj + β2ij + β3tjij.

As opposed to the per-user-per-day regression, in this case all coefﬁcients have tiny p-values for both CT and duration. In particular, the coefﬁcient β2 for intervention has low p-value, indicating that the initial user partition may not be truly random in terms of per impression engagement. Nonetheless, this difference is interpretable: our experiment was performed on random user-groups that Glance has been using for months prior to our ﬁeld test, on which some previous experiments have been performed, causing this discrepancy in user behavior.

We next quantify the improvement. For the per-user-day conversion, we observe that the total duration and the number of click-throughs improved by β3/(β0 + β1) 4.319% and 7.482% respectively. For the per impression conversion, the duration improved by 1.726%. Finally, note that the β s for CTR are based on logistic regression. The improvement in the odds ratio is eβ3 1 4.854%.