# shortlived_highvolume_bandits__1b9d664c.pdf Short-lived High-volume Bandits Su Jia * 1 Nishant Oli * 2 Ian Anderson 2 Paul Duff 2 Andrew A. Li 3 R. Ravi 3 We study how to efficiently perform A/B/n testing for a high-volume of short-lived treatments. We formulate the problem as a multiple-play bandits model. In each round a set of k actions arrive. Each action is available for w rounds and has an unknown reward rate. In each round, the learner selects a multiset of n actions and immediately observes the realized rewards. We aim to minimize the average loss under a random input model where the instance is randomly drawn from a known prior distribution D. We show that if k = O(nρ) for some ρ > 0, our policy achieves O(n min{ρ, 1 w ) 1}) average loss on a sufficiently large class of prior distributions. We also complement this result by showing that every policy suffers Ω(n min{ρ, 1 2 }) average loss on the same class of distributions. We further validate the effectiveness of our policy through a large-scale field experiment on Glance, a content card-serving platform. 1. Introduction Modern platforms leverage randomized experiments to make informed decisions from a given set of alternatives. As a particularly challenging scenario, these alternatives can potentially have at the same time (i) high volume, with thousands of new items being released each hour, and (ii) short lifetime, due to the transient nature of the contents. This challenge arises from, for example, recommending short-lived content on a video-sharing platform. Orthogonal to lifetime, the problem is similarly well understood when there is a low volume of contents relative to the number of users - dedicated exploration methods, *First and second authors. All other authors are alphabetically ordered. 1Center of Data Science for Enterprise and Society (CDSES), Cornell University, Ithaca, USA 2Glance, Bangalore, India 3Tepper School of Business, Carnegie Mellon University, Pittsburgh, USA. Correspondence to: Su Jia . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). such as standard A/B/n testing, are sufficient for finding the most appealing content for the users; see, e.g., (Kohavi & Longbotham, 2017). Naturally, then, the most challenging settings are where the contents to be selected are short-lived and have high volume, which occur in a variety of applications. (A) Recommender Systems. From online advertising to social networks, platforms are sometimes faced with a massive amount of short-lived contents to be recommended to users. For example, more than 210 million Snaps were created on Snapchat each day in 2020, most of which expired within just 24 hours (Vuleta, 2021). The brevity of the lifetime can either be caused by the features of the content (e.g. breaking news) itself or by the transient nature of user attentions. The platform needs to decide which contents to appear at the top of the user news feeds to maximize user engagement. (B) Website Optimization. In internet marketing, online platforms perform multi-variate A/B/n testing (e.g., (Mc Farland, 2012; Yang et al., 2017)) to test different designs of their user interface. For example, Linked In runs over 400 concurrent experiments per day to compare different website designs, with the goal of encouraging users to establish or refine their personal profiles, or increasing the subscriptions to Linked In Premium (Xu et al., 2015). The combinatorial nature of the decision space results in a high volume of items to test. Specifically, the number of possible designs can be exponential in the number of features (e.g., logos, font, background color, etc.). However, these items can have a short life. For example, considering the non-stationarity of the underlying environment (say, due to seasonality or societal events), any estimation is only reliable for a short amount of time. A simple approach is to partition the time horizon into short segments, in which the conversion rates of the designs are approximately constant, and then view the same design in different segments as separate copies. Adding to this challenge is personalization. A common and naive approach is to cluster the users and solve the problem for each cluster separately. However, when restricting to a cluster, the number of users becomes much smaller while the number of actions remains the same. In other words, personalization would substantially limit the resources available for experimentation, rendering the problem harder. Short-lived High-volume Bandits Thus motivated, we study how to efficiently identify the effective items from a high volume of short-lived candidates. To encapsulate the key features of the problem, we propose the Short-lived High-volume Bandits (SLHVB) problem. We employ a multiple-play bandit framework with the following key features: 1) Multiple-play. In each round, we can choose each action (or arm) multiple times as long as the total number of plays is n, which corresponds to the number of user interactions in this time period;1 2) High-Volume of Arrivals. In each round, a set of k = nρ actions arrive where ρ > 0; 3) Short Lifetime. Each arms is available for w rounds, which is a known, small constant; 4) Random Input Model (RIM). The reward rates are drawn from a fixed known distribution. Our analysis focuses on the average performance over all instances, rather than against the worst-case instance. We present a policy that recursively refines the exploration for up to w times and compares it against what is best achievable for different regimes of ρ > 0. 1.1. Our Contributions This work contributes to the literature of multi-armed bandits and online controlled experiments in the following ways. 1) A Novel and Practical Formulation. Our first contribution is formulating an online learning model that faithfully models a ubiquitous problem faced by online platforms. Our formulation considers a practical metric - the average loss - which better reflects the quality of a policy than worst-case metrics, which are more common in previous literature. 2) Average Regret for Batched Bandits. As a subroutine for our showing main result, we show that the Batched Successive Elimination algorithm (Gao et al., 2019) for the Batched Bandits (BB) problem achieves O((k/n) ℓ ℓ+2 ) average regret using ℓrounds of adaptivity whenever ρ ℓ 1 2ℓ+1.2 In particular, for ℓ 2, this bound is asymptotically better than the optimal O((k/n)1/2 n2 ℓ) worst-case regret. This contrast is strongest when ρ = 1/2 - in this case our policy has average regret n 1/2+O(1/ℓ) whereas the optimal worst-case regret is O(n 1/4+2 ℓ). 3) Policy for SLHVB. We show that any policy for BB with R(n, k) average regret can be converted into a policy for SLHVB with O (n ρ + R(n, k)) average loss. Our average regret bound for BB then implies an O(n min{ρ, w 2(w+1) }) average loss bound for SLHVB, by choosing a suitable ℓ depending on ρ and w. 4) Nearly Matching Lower Bound. We show that any policy 1Do not confuse this n with the n in the term A/B/n testing - the latte is actually our k, i.e., the number of treatments. 2All big-O s are with respect to n , with ρ fixed. for SLHVB suffers an Ω(n min{ρ, 1 2 }) average loss. Further, we juxtapose this result with a lower bound on the worst-case loss which is higher, and hence highlights the value of our RIM. 5) Large-Scale Field Experiment. Finally and most importantly, we validated the effectiveness of our policy in a field experiment via collaboration with Glance, a leading lockscreen content platform in India. This firm faces exactly the aforementioned challenge: they generate hundreds of content cards on an hourly basis, which are available for at most 48 hours. Their current recommender is based on a state-of-the-art Deep Neural Network (DNN), which is time-consuming to re-train and hence unable to utilize user feedback in a timely manner. In a field experiment, we observed that our 1-Layered Sieve policy outperforms their DNN-based recommender by 4 7% in user engagement. 1.2. Related Work Our problem is a variant of the Multi-Armed Bandits (MAB) problem (Lai et al., 1985). Three lines of work are most related to ours: multiple-play bandits, mortal bandits and high-volume Bandits. Multiple-play Bandits. In this variant, several arms are selected in each round. Many results in single-play bandits can be generalized to the multi-play variant, for example, (Komiyama et al., 2015) showed that the instance dependent regret bound for Thompson sampling can be generalized to the multi-play setting. One motivation of the multi-play variant is online ranking, see e.g., (Radlinski et al., 2008; Lagr ee et al., 2016; Gauthier et al., 2022), where the learner presents an ordered list of items to each user, viewed sequentially under certain click model. Unlike in our formulation, the arms have infinite lifetime. Further, there is no arrival of new arms, so the learner does not need to take into consideration the ages of the arms. Mortality of Arms. A quintessential motivation for the mortality of arms is online advertising. In the classical pay-by-click model, the ad broker matches each ad from a large corpus to contents, and is paid by the advertiser (i.e. who created the ad) only when an ad is clicked. As the key feature, an ad becomes unavailable when its advertiser s budget is run out. Thus motivated, (Chakrabarti et al., 2008) introduced the mortal bandits problem and considered two death models. In the deterministic model, each arm dies after being selected for a certain number of times, which corresponds to the advertisers budget in the advertising example. In the stochastic lifetime model, an arm dies with a fixed probability every time it is selected. Relatedly, in rotting bandits (Levine et al., 2017), each arm s reward rate decays in the number of times it has been selected. In particular, if the reward function is an indicator function, then effectively each arm has a finite lifetime. Short-lived High-volume Bandits Motivated by demand learning in assortment planning, (Farias & Madan, 2011) considered the irrevocable bandits problem that bears both the multi-play and mortality features: arms are selected in batches and discarded immediately once selected. Unlike in this work, however, none of these models considered arrivals, and hence the learner does not need to take into consideration the age of the actions. High Volume of Arms. Most existing work concerning large volume of arms considered the worst-case regret of a policy, i.e., the regret on the worst input in a given family (e.g. (Berry et al., 1997; Zhang & Frazier, 2021)). As a distinctive feature, in this work we consider a random input model where the reward rates follow a known distribution, and perform an average case analysis. As we will soon see, our formulation leads to theoretical results that would be otherwise impossible. (Wang et al., 2008) also assumed that the reward rates of the arms are independently drawn from a common distribution such that the probability of being ε-optimal is O(εβ) where β [0, 1] is a known constant. However, unlike in our problem, there are no arrivals and hence the policy does not need to balance the exploration for items with different ages. Finally, we are aware of another variant of MAB closely related to the multi-play bandits (and hence to this work). Low-Adaptive Bandits Algorithms. In the batched bandits problem (Perchet et al., 2016; Agarwal et al., 2017) we aim to achieve low regret using low adaptivity. Unlike in the multi-play setting, here the learner can partition the time horizon [T] into w batches where w is a given constant, and choose a batch (i.e., a multiset) of arms based on the realizations in the previous batches. Alternatively, w can be interpreted as a constraint on the adaptivity of the policy. In the classical setting, the learner has unlimited adaptivity, i.e., w = T. (Gao et al., 2019) showed that for any k arms, we can achieve O( k T) regret whenever w = Ω(log log T), which is optimal among all policies with arbitrary adaptivity. 2. Formulation Suppose at the start of each round t = 1, 2, , a set At of k actions (or arms) arrives, available in rounds t, . . . , t + w. We call w the lifetime and viewed it as a small constant. In each round, the learner selects a multiset of n available arms - each arm can be chosen multiple times, as long as the total number of plays is n. If an arm a is selected m times, the learner immediately observes i.i.d. rewards X1, . . . , Xm with mean µa.3 For simplicity, we consider only Bernoulli random variables, although the analysis can be generalized 3 i.i.d. stands for identically independently distributed to subgaussian reward distributions in a straightforward manner. For concreteness, let us relate the above formulation to the recommendation problem. Most online platforms retrieve user interaction data and update the predictions periodically. A round corresponds to such a period. Further, n corresponds to the number of user impressions in a round, and is assumed to be independent of the quality of the recommendation. Selecting a multiset of arms corresponds to recommending exactly one item to each impression. A Random Input Model. Unlike most work in MAB, here we consider a Random Input Model (RIM) where µa s are assumed be drawn i.i.d. from a known distribution D. This is realistic in practice as D can be approximated using past data. In contrast to the minimax framework, this formulation better reflects the reality and more importantly, enables us to obtain theoretical guarantees that would be otherwise impossible; see Section 3. As a standard assumption in statistical learning (Ghosal, 2001; Petrone & Wasserman, 2002; Audibert & Tsybakov, 2007), we assume the density of D is bounded from above and below away from 0, although our analysis extends to more general distributions (with possibly weaker guarantees). Assumption 2.1 (Bounded Density Assumption). The distribution D admits a density function f with a compact support C, and there exist constants C1, C2 > 0 such that C1 f(x) C2 for all x C. W.l.o.g.4 we assume C = [0, 1]. The Average Loss. The procedure of selecting arms can be encoded by a policy π = (πt) satisfying P a At t w πt(a) = n for each t, where At t := St s=t As for 0 < t < t . The expected reward in round t is then P a πt(a) µ(a). The problem is easy if µa s were known. In fact, let a t = arg max{µa : a At t w}, then the optimal policy is given by π t (a) = n 1[a = a t ]. When reward rates are unknown, the performance of a policy is measured by the following notion of average loss.5 We first define the finite-time loss. For any integer T w and reward rates µ = {µa}, consider Lossn(π; T, µ) := 1 n T E a At t w πt(a) (µ t µa) where the expectation is only over the reward realizations but not µ. To characterize the long-run performance of the system, let T and define the average loss (or simply, 4 w.l.o.g. stands for without loss of generality 5To avoid confusions, we use the term loss for SLHVB and regret for Batched Bandits in Section 4 to prevent confusions. Short-lived High-volume Bandits Lossn(π) := lim T Eµ D [Lossn(π; T, µ)] . We are interested in how rapidly Lossn(π) vanishes given a fixed ρ > 0. 3. Lower Bounds We first show that every policy suffers Ω(n min{ρ, 1 2}) average loss. We defer the details to Appendix A and only explain the high level ideas here. Consider the average loss Lt given by a At t w πt(a) (µ t µa) where µ t = µmax(At t w). By linearity of expectation, it suffices to lower bound Lt. We will show that the regret is large in each of the following two cases. First consider the case where πt(At) n 2 , i.e., the policy sufficiently explores the arms arriving at t. Due to the RIM and by Assumption 2.1, the reward rates µa s are approximately evenly spaced. Since there are wk arms available at any time, the gap between the best and second best arms a, a is approximately 1 wk, formally, µa µa 1 wk. Consequently, if the arriving batch At does not contain the best available arm, i.e., a t / At, then µ(a t ) µmax(At) 1 wk. Moreover, by symmetry a t appears in each of the w available batches equally likely, so the above event occurs with probability 1 1 2, we have Lt 1 Now consider the other case, πt(At) < n 2 , i.e., At is under-explored in round t. Consider the event a t At. Then, by an argument similar to the first case, we have µ(a t ) µmax At w t 1 1 wk. Further, this event occurs with probability 1 w, so in this case Lt 1 w 1 wk = 1 w2 n ρ. Thus in both cases, we have Lt 1 w2 n ρ, as formally stated below. We defer the proof to Appendix A.1. Proposition 3.1 (Lower Bound). For any policy π and ρ 0, we have Lossn(π) 1 12w2 n ρ. However, this bound becomes very weak when ρ is large. We next present a lower bound specific to ρ 1 2. We argue that to avoid an Ω(n 1/2) regret, a policy has to identify an n 1/2-optimal arm a, i.e., µa µ t O(n 1/2). To this goal, by Assumption 2.1, the learner needs to explore n1/2 distinct arms, leading to an Ω(n1/2) regret. We formally state the second lower bound below. We defer the proof to Appendix A.2. Proposition 3.2 (Lower Bound, ρ > 1 2, then for any policy π, we have Lossn(π) 1 96w2 n 1/2. Note that when ρ = 1/2, the above two lower bounds have the same asymptotic order. By combining the above two lower bounds, we immediately obtain the following. Theorem 3.3 (Lower Bound). For any ρ > 0 and policy π, it holds that Lossn(π) 1 96w2 n min{ρ,1/2}. Finally, we present a lower bound on the worst-case regret that highlights the advantage of the RIM. We show that no policy achieves o(1) worst-case regret, defined as Loss n(π) := max µ lim T Loss(π; T, µ), where max is over all instances 0 µ 1. Proposition 3.4 (Worst-case Regret). For any policy π and lifetime w > 0, we have Loss n(π) 1 2w2. This bound can be proved by constructing an instance with binary rewards, such that in each round there is exactly one arm a with µa = 1. 4. Upper Bounds In this section, we establish the connections to the Batched Bandits (BB) problem. We first explain how a semi-adaptive algorithm for BB can be converted into a policy for SLHVB, and show how the guarantees translate from one problem to another. Then, we consider a variant of the Batched Successive Elimination (BSE) algorithm (Gao et al., 2019) and analyze its regret. Finally, using this result, we obtain a policy for the SLHVB problem with O(n min{ρ, 1 w ) 1}) average loss. We defer the details to Appendix C. 4.1. Batched Bandits In the BB problem (Perchet et al., 2016), the learner is given k arms, n slots, and an adaptivity level ℓ 1. In each phase i = 0, 1, . . . , ℓ, the learner selects a multiset6 Mi of arms such that Pℓ i=0 |Mi| = n. Each time an arm a is selected, a reward is randomly drawn from a Bernoulli distribution with unknown reward rate µa, and is immediately observed. If an arm is selected multiple times in one phase, the realized rewards may be different. The goal is to maximize the expected total reward. Given an instance (µa)a [k], the regret7 of an algorithm A for BB is defined as Regn(A; µ) := 1 a [k] (µ µa) Na where µ = arg maxa [k]{µa} and Na is the number of times an arm a is selected. Most existing work for MAB 6an easy way to remember the indexing rule: Mi is the batch of arms selected when using the i-th chance of being adaptive. 7To avoid confusions, we use synonymous terms algorithm for BB and policy for SLHVB; for the objective we use regret for BB and loss for SLHVB. Short-lived High-volume Bandits considered the worst-case regret Regwc n (A) := max µ [0,1]k Regn(A; µ). Analogous to the notion of average loss for the SLHVB problem, for BB we define the average regret over all instances drawn from a distribution D as Regavg n (A) = Eµ D [Regn(A; µ)] . An appealing class of algorithms is the semi-adaptive algorithms, where the cardinality of each batch Mi of arms selected are decided in advance non-adaptively and the selection of arms in each batch depends on the reward realizations adaptively. Definition 4.1 (Semi-adaptive Algorithm). Given an adaptivity level ℓ> 0, a semi-adaptive algorithm is specified by (i) grid sizes ε0, . . . , εℓ 1 (0, 1) with Pℓ 1 i=0 εi < 1, and (ii) a family of decision rules Aj : ([k] R)nj 1 [k]nj nj 1 for j = 0, . . . , ℓwhere8 Pj i=0 εin, if j = 0, . . . , ℓ 1, n, if j = ℓ, 0, if j = 1. (Gao et al., 2019) proposed the following Batched Successive Elimination (BSE) algorithm. In each phase the algorithm keeps track of a subset Si [k] of surviving arms (or simply survivors) that serve as the candidates for the optimal arm, computed in the following manner. Initially S0 = [k]. In each phase i = 0, . . . , ℓ 1, the algorithm explores arms in Si uniformly, i.e., selects each arm εin/|Si| times. Finally in the last phase, i.e., phase ℓ, we arbitrarily choose an arm from Sℓ. For completeness, we formally state the algorithm in Algorithm 1. (Gao et al., 2019) analyzed the performance of the BSE algorithm under two types of grids. The first is called the minimax grid: let c = n 1 1 2 ℓ= O( n n2 ℓ), we recursive define εi+1 = c εi where ε0 = c. The second is called geometric grid, as the batch sizes form a geometric progression. Specifically, let c = n1/ℓ, recursively we define εi+1 = c εi for ℓ= 1, . . . , ℓ 1. As the name suggests, the BSE algorithm under the minimax grid achieves the minimax regret. The geometric grid, on the other hand, is shown to achieve nearly-optimal instancedependent regret. We rephrase (Gao et al., 2019) s main result as follows. 8To avoid integrality issues, we assume εin/|Si| is an integer. This does not affect the asymptotic order of our bounds. Theorem 4.2 (Theorem 1 of (Gao et al., 2019)). Given any adaptivity level ℓ, denote by BSEminimax ℓ and BSEgeo ℓ the BSE algorithms under the minimax and geometric grids. Then, for any k-armed instance, it holds that Regwc n (BSEminimax ℓ ) = O( p k/n n2 ℓ). Further, if the highest and second highest reward rates differ by , then Regwc n (BSEgeo ℓ ) = O(n 1 ℓ 1 k/ ). Algorithm 1 Batched Successive Elimination Policy BSE(ε0, . . . , εℓ 1; k ) for Batched Bandits. 1: Input: ℓ: adaptivity level ε0, . . . , εℓ 1 (0, 1): exploration intensities n: number of arms to be selected in total A: a set of k arms 2: Let A be a uniformly random subset of A of size k 3: for i = 0, 1, ..., ℓ 1 do 4: if |Si| = 1 then 5: Set Sℓ= Si; Break 6: end if 7: if |Si| 2 then 8: ni j εin |Si| k 9: for a Si do 10: Select arm a for ni times and observe rewards Xa,1, ..., Xa,ni 11: Xa 1 ni Pni j=1 Xa,j 12: end for 13: Xmax max 14: Si+1 = {a Si : |Xa Xmax| 3n 1/2 i log1/2 n} 15: end if 16: end for 17: Select any arm in Sℓfor n Pℓ 1 i=0 εin times Apparently, the worst-case regret bound trivially holds for average regret. On the other hand, the worst-case regret of BSEgeo ℓ depends on the parameter , which is random under the RIM. By Assumption 2.1, we have Eµ[ 1] = Θ(k), immediately implying an O(k2n 1 ℓ 1) bound on the average regret, as summarized below. Corollary 4.3 (Average Regret of BSE). For any adaptivity level ℓ 1, we have Regavg n (BSEminimax ℓ ) = O( p k/n n2 ℓ) and Regavg n (BSEgeo ℓ ) = O(k2n (1 1 In particular, when ℓ= O(log log n), we have n2 ℓ= O(1) and hence the bound in Theorem 4.2 becomes O( p k/n). Surprisingly, this matches the Ω( p k/n) lower bound for unlimited adaptivity level; see, e.g., (Auer et al., 2002). 4.2. Breaking the p Using a different geometric grid, we obtain a stronger bound on the average regret for the BSE algorithm, compared to Short-lived High-volume Bandits k/n) bound in Corollary 4.7. More precisely, the common ratio in this geometric progression is k/n, whereas in (Gao et al., 2019) the dependence only depends on n. With some foresight, let s consider the following choice of exploration intensities. Definition 4.4 (Revised Geometric Grid). For any adaptivity level ℓand i ℓ, we define ε i,ℓ= (k/n)(ℓ i)/(ℓ+2) for integers i = 0, . . . , ℓ 1. For each fixed ℓ, denote by BSE ℓ the algorithm BSEℓwith grid size (ε 0,ℓ, . . . , ε ℓ 1,ℓ). Our bound for a fixed ℓrequires ρ to be larger than the following threshold, otherwise there is no need for having ℓ phases. Definition 4.5 (Threshold Exponent). For each integer ℓ 1, we define the threshold exponent as θℓ:= ℓ 1 2ℓ+1. To see why we need ρ to be large, observe that by the RIM, µa s are n ρ distance apart. On the other hand, if ρ is small, the confidence intervals are expected to be narrower than n ρ with strictly less than ℓphases, and hence there is no need for extra layers. More concretely, suppose ℓ= 2 and consider ρ = 0.04 < θ2 = 1/5. To see why the second phase is redundant, note that by Definition 4.4, ε 0 = O((k/n)1/2), so each arm is selected ε 0n/k = (n/k)1/2 = n0.48 times upon arrival. Thus, the confidence interval of each arm after selecting the first batch of arms has width (n0.48) 1/2 = n 0.24. On the other hand, the reward rates are spaced at n ρ = n 0.04 distance apart on average. Therefore, the optimal arm in At is likely to be identified after just one phase. Theorem 4.6 (Average Regret of BSE ℓ). For any adaptivity level ℓwith ρ θℓ, the average regret of the BSE algorithm under the revised geometric grids can be bounded as Regavg n (BSE ℓ) = O((k/n)ℓ/(ℓ+2)). The proof of this result crucially relies on the RIM. We argue that due to the RIM, the number of surviving arms is bounded by a function in n, ρ, j w.h.p.9 after a number j of phases. Consequently, each surviving arm then is guaranteed at least a certain amount of slots in phase j + 1. As a caveat, the above argument breaks when there remains only one surviving arm after a number of phases. This event, however, occurs with low probability, since ρ is greater than the threshold exponent. Note that average regret becomes lower as ℓincreases. But ℓcan not be arbitrarily high since we required θℓ< ρ. To find the maximal feasible ℓ, consider the following metaalgorithm: If ρ < 1/2, choose the maximum ℓwith ρ θℓ. Equivalently, choose ℓ= ℓ (ρ) := 1+ρ 1 2ρ . If ρ 1 2, then the condition ρ θℓholds for any ℓ, and so we can choose 9We say an event occurs w.h.p. (which stands for with high probability ) if it has probability 1 nΩ(1). arbitrarily large ℓ. We formally state this result as follows. Corollary 4.7 (Explicit Form of the Average Regret). The average regret of the BSE algorithm under the revised geometric grids satisfies the following: (i) If 0 < ρ < 1 2, then for adaptivity level ℓ (ρ) = j 1+ρ 1 2ρ k the average regret can be bounded as Regavg n BSE ℓ (ρ) = O ℓ (ρ) ℓ (ρ)+2 (ii) If ρ 1 2, then for any adaptivity level ℓ 1, the average regret can be bounded as Regavg n (BSE ℓ) = O In particular, Regavg n (BSE ℓ) = (k/n)1 O(1/ℓ) as ℓ . Note that n2 ℓ> 1 for any ℓ, so the O( p k/n n2 ℓ) bound in Theorem 4.2 is no better than O((k/n)1/2). In contrast, note that whenever ρ 1/5, we have ℓ (ρ) 2, so the bound in Corollary 4.7 is stronger. This contrast becomes sharper as ρ increases. For example, with ρ = 1/3, we have ℓ (ρ) = 4, and hence Reg ℓ (ρ) = O((k/n)2/3), which is better than O((k/n)1/2). In the extreme case, for ε > 0 and ρ = 1 2 ε we have ℓ (ρ) = Ω(1/ε) and hence (k/n)Ω(1/ε), which is much stronger a guarantee than O((k/n)1/2). 4.3. From Regret to Loss A semi-adaptive algorithm for BB induces the following policy for SLHVB. Let εi be the grid sizes, which is usually chosen such that Pℓ 1 j=0 εi = o(n), e.g., as in Definition 4.4. At each time t, the induced policy uses εin slots to perform the i-th phase of the BB algorithm on At i, i.e., the arms arriving at time t i. Meanwhile, we use the remaining (1 Pℓ 1 i=0 εi)n slots for exploitation ; see Algorithm 2 for a formal statement. As a nice property, the induced policy of any semi-adaptive algorithm selects exactly n arms in each round and is hence valid for SLHVB. In fact, suppose the semi-adaptive policy for BB has grid sizes (εi). Using the notations in Algorithm 2, for each round t and each phase j = 0, . . . , ℓ, an arm a At j is selected N t j,a times. By definition of εj, we have P a At j N t j,a = εjn. Summing over all j s, the total number of arms selected is a At j N t j,a = Short-lived High-volume Bandits Algorithm 2 Induced Policy π[A; k ] for SLHV Bandits 1: Input: A: a semi-adaptive algorithm for BB k : cardinality of the random subset of arms 2: for t = 1, 2, . . . do 3: Sample a random subset At of At of size k 4: At At 5: for j = 0, . . . , ℓdo 6: for a At j do 7: Query A for the number N t j,a of times to select arm a in the j-th batch 8: Select a for N t j,a times, and return the rewards to A 9: end for 10: end for 11: end for Note that in Algorithm 2, we only use age-ℓarms for exploitation . This is not practical an obviously better policy would exploit the empirically best arm in At ℓ t w = Sw j=ℓAt j rather than just in At ℓ. We choose to state the policy in this way for simplicity of analysis. By doing this, the loss only increases by O(n ρ), which is on the same order as the Ω(n ρ) lower bound in Theorem 3.1, and is hence not essential. We can convert the regret of any semi-adaptive BB algorithm to the loss of the induced SLHVB policy as follows. Proposition 4.8 (Regret-to-Loss Conversion). Suppose A is a semi-adaptive algorithm with average regret R(n, k) on any BB instance with k arms and n slots. Then for any k , the loss of the induced policy π[A, k ] for SLHVB satisfies Lossn(π[A, k ]) = O(1/k + ρ 1n ρ + R(n, k )). We explain the high level idea and defer the details to Appendix B. The term 1/k captures the gap between the optimal reward rate in the resampled subset and the original set of arms. The second term, n ρ = 1/k, captures the loss due to the reward uncertainty of new arrivals we have to decide how many times to choose them without any knowledge of their qualities apart from the RIM assumption. The final term, R(n, k ), bounds the average regret on the resampled subset, which contains k arms. 4.4. Loss of the Induced Policy The average regret bound in Proposition 4.6 and the regretto-less conversion formula in Proposition 4.8 immediately lead to a loss bound for the induced SLHVB policy. To formalize this, we first recapitulate some notations. Given a semi-adaptive algorithm A for BB, we defined π[A, k ] as the induced policy for SLHVB with resampling size k . Let BSE (ℓ, k ) := π[BSE ℓ, k ] be the SLHVB policy induced by the BSE algorithm with the revised geometric grid (ε j;ℓ) specified in Definition 4.4. Proposition 4.9 (Loss of the Induced Policy). If ℓ w and ρ θℓ, then Lossn (BSE (ℓ, k)) = O n ρ + n(ρ 1) ℓ ℓ+2 ( O (n ρ) , if ρ < ℓ 2(ℓ+1), O n(ρ 1) ℓ ℓ+2 , otherwise. (1) Different from the previous subsections, we now have an extra lifetime constraint that ℓ w. Our meta-policy, called the Hybrid policy, chooses suitable ℓfor any given ρ; see Algorithm 3). Algorithm 3 Hybrid Policy H(ρ; w) 1: If ρ < 1 5 then set ℓ= 1 and k = k. 2: If 1 5 ρ < w 2w+2, then set k = k and choose any ℓ w such that θℓ= ℓ 1 2ℓ+1 ρ < ℓ 2ℓ+2. 3: If ρ w 2w+2, then set ℓ= w and k = n w 2w+2 . 4: Invoke BSE (ℓ, k ). We prove the following in Appendix D. Theorem 4.10 (Loss of the Hybrid Policy). For any SLHVB instance with n slots per round, volume exponent ρ > 0 and lifetime w 1, the average loss of the Hybrid policy satisfies Lossn(H(ρ; w)) = O(ρ 1 n min{ρ, 1 To better understand the impact of the lifetime w, observe that for w = 1, 2, the regret bounds are asymptotically r1(n) := n min{ρ,1/4} and r2(n) := n min{ρ,1/3}. In particular, when ρ > 1/4, we have r2 = o(r1) as n . 5. Field Experiment We further validated the effectiveness of our policy in a field experiment, via collaboration with Glance, a leading lock-screen content platform that faces the aforementioned challenge. Specifically, their marketing team curates around 200 content cards (or simply, cards) per hour, and around 70% of them expire in 48 hours. Each card consists of a link to some external content (e.g., video, news or article), along with a short text description; see Figure 1. The firm needs to recommend cards to users with the goal of maximizing the total user engagement, measured by the total duration of the interactions and the number of click-throughs. This problem can be cast as an SLHVB problem, assuming that the total impressions is independent of the recommendations. Two key quantities are of particular interest for each card: (i) the click-through rate (CTR) and (ii) the average duration per impression. Both metrics are unknown when a card is released. We mix the above two metrics by considering the conversions. A conversion occurs if either a Short-lived High-volume Bandits Figure 1. Glance s content cards. The image and text summarize the content, and by clicking on the link the user is shown the details. click-through occurs or the duration reaches a threshold of θ = 5 seconds. For simplicity, we assume that the rewards across different impressions are i.i.d. random variables. The platform sends cards to the users on an hourly basis. For each user, the platform replaces the cards that the user has viewed in the past with new content cards, provided the device is connected to the internet. The users can swipe through the cards stored in the platform s app. When a user is interested in the content, they can click on the provided link and be redirected to an external source for further engagement, and then be redirected back to the app when finished. To decide which cards to send to each user, the firm deployed a recommender system based on a state-of-the-art Deep Neural Network (DNN). This DNN predicts, for each pair of user and card, the expected conversion probability, using (i) the user interaction history and (ii) the card s text, image features. Although the current recommender system works reasonably well, there is a considerable potential for improvement. Most notably, the current system only uses the user feedback to update the users behavioral signature for future prediction, and does not directly leverage the feedback in making recommendations for similar users. In particular, they do not use the feedback to adjust the predictions on the conversion rates directly. This may have caused substantial loss in user engagement. It is thus vital for the platform to find a recommendation policy that (i) can learn the true conversion rates of new cards quickly using user interaction data and (ii) is computationally simple to deploy. Our policy is well-suited for this task. We defer the implementation details to Appendix E. We perform a detailed analysis on the field experiment result and show that the simplest version of our policy outperforms the DNN-based recommender by about 4% in total duration (see Figure 2) and nearly 7% in the total number of click-throughs per-user-per-day; see Figure 3. We defer the detailed analysis in Appendix F. Figure 2. Duration per user-day pair. Figure 3. Click-through per impression. Short-lived High-volume Bandits Agarwal, A., Agarwal, S., Assadi, S., and Khanna, S. Learning with limited rounds of adaptivity: Coin tossing, multiarmed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pp. 39 75. PMLR, 2017. Audibert, J.-Y. and Tsybakov, A. B. Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2): 608 633, 2007. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002. Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and Shepp, L. A. Bandit problems with infinitely many arms. The Annals of Statistics, 25(5):2103 2116, 1997. Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. Mortal multi-armed bandits. Advances in neural information processing systems, 21:273 280, 2008. Farias, V. F. and Madan, R. The irrevocable multiarmed bandit problem. Operations Research, 59(2):383 399, 2011. Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multiarmed bandits problem. Advances in Neural Information Processing Systems, 32, 2019. Gauthier, C.-S., Gaudel, R., and Fromont, E. Unirank: Unimodal bandit algorithms for online ranking. In International Conference on Machine Learning, pp. 7279 7309. PMLR, 2022. Ghosal, S. Convergence rates for density estimation with bernstein polynomials. The Annals of Statistics, 29(5): 1264 1280, 2001. Kohavi, R. and Longbotham, R. Online controlled experiments and a/b testing. Encyclopedia of machine learning and data mining, 7(8):922 929, 2017. Komiyama, J., Honda, J., and Nakagawa, H. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. In International Conference on Machine Learning, pp. 1152 1161. PMLR, 2015. Lagr ee, P., Vernade, C., and Cappe, O. Multiple-play bandits in the position-based model. Advances in Neural Information Processing Systems, 29, 2016. Lai, T. L., Robbins, H., et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6 (1):4 22, 1985. Levine, N., Crammer, K., and Mannor, S. Rotting bandits. Advances in neural information processing systems, 30, 2017. Mc Farland, C. Experiment!: Website conversion rate optimization with A/B and multivariate testing. New Riders, 2012. Perchet, V., Rigollet, P., Chassang, S., Snowberg, E., et al. Batched bandit problems. Annals of Statistics, 44(2): 660 681, 2016. Petrone, S. and Wasserman, L. Consistency of bernstein polynomial posteriors. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(1):79 100, 2002. Radlinski, F., Kleinberg, R., and Joachims, T. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning, pp. 784 791, 2008. Vershynin, R. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. Vuleta, B. How much data is created every day? 2021. URL https://seedscientific.com/ how-much-data-is-created-every-day/. Wang, Y., Audibert, J.-Y., and Munos, R. Algorithms for infinitely many-armed bandits. Advances in Neural Information Processing Systems, 21, 2008. Wasserman, L. All of nonparametric statistics. Springer Science & Business Media, 2006. Xu, Y., Chen, N., Fernandez, A., Sinno, O., and Bhasin, A. From infrastructure to culture: A/b testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2227 2236, 2015. Yang, F., Ramdas, A., Jamieson, K. G., and Wainwright, M. J. A framework for multi-a (rmed)/b (andit) testing with online fdr control. Advances in Neural Information Processing Systems, 30, 2017. Zhang, X. and Frazier, P. I. Restless bandits with many arms: Beating the central limit theorem. ar Xiv preprint ar Xiv:2107.11911, 2021. Short-lived High-volume Bandits A. Details of the Lower Bounds A.1. Proof of Proposition 3.1 Consider the event Wt 1 = 1 2 (w 1)k µmax(At 1 t w) 1 1 (w 1)k We first show that this event occurs with large probability. Lemma A.1. If K 10, then P(Wt 1) 1 Proof. Denote µmax = µmax(At 1 t w). Let H = n µmax 1 1 (w 1)k o and H = n µmax < 1 2 (w 1)k o . Then for any k 10, we have H = P µmax < 1 1 (w 1)k = 1 1 (w 1)k and thus P[H] 1 3 4e. Moreover, since P[H ] = 1 2 (w 1)k (w 1)k 2 2 e 2, we conclude that P[Wt 1] 1 P[H ] P[H] 1 1 3 Thus, by giving up a constant factor in the loss, we may restrict our attention on the event Wt 1. For each time period t, define probability measure Pt( ) = P[ |Wt 1] and Et( ) = E[ |Wt 1]. Consider the event that at time t, the policy selects arms from At at least n/2 times, i.e., Et = P a At πt(a) n 2 . Further, define B t = µmax(At) µmax(At 1 t w) 1 and B+ t = µmax(At) µmax(At 1 t w) + 1 Observe that Et(Lt) = Et(Lt|Et B t ) Pt(Et B t ) + Et(Lt|Et B+ t ) Pt(Et B+ t ) + Et(Lt|Et B t ) Pt(Et B t ) + Et(Lt|Et B+ t ) Pt(Et B+ t ) Et[Lt|Et B t ] Pt(Et B t ) + Et(Lt|Et B+ t ) Pt(Et B+ t ). (2) We next lower bound the above two terms. Note that the event in the first term, i.e., Et B t , says that the policy selects arms from At for at least n/2 times, but the best arm in At is suboptimal (w.r.t. µmax(At t w)) by 1/k. Intuitively, when this event occurs, the policy suffers at least n k = n 2k loss in this round. On the other hand, the event in the other term, i.e., Et B+ t , says that the policy selects At for less than n/2 times, but the best arm in µ(At 1 t w) is suboptimal by n/(wk). By a similar argument, we can show that the loss in this round is at least 1 2wk on this event. At a high level, both these two lower bounds are caused by not knowing quality of the new batch of arms. We formalize the above ideas in the following lemma. Lemma A.2 (Loss for Uncertainty in the New Batch). The loss in any round t satisfies Et[Lt|Et B t ] n 2k and Et[Lt|Et B+ t ] n 2wk. Proof. Write µ t = µmax(At). Recall from the definition that when B t occurs, we have µ t µmax(At) 1/k. Thus, if π Short-lived High-volume Bandits selects arms from At for Ω(n) times, an Ω(n/k) loss is incurred. Formally, Et Lt| Et B t = Et a At t w πt(a) (µ t µa) Et B t a At πt(a) (µ t µa) Et B t a At πt(a) (µ t µmax(At)) Et B t Note that conditional on Et, B t and Wt 1 (recall that Et[ ] = E[ |Wt 1]), we have P a At πt(a) n 2 and µ t µmax(At) 1/k. Thus, Now consider the second term in (2). When B+ t occurs, we have µmax(At) µmax(At 1 t w) 1 wk, so if the π selects arms in At 1 t w for Ω(n) times, an Ω(n/k) loss is incurred. Formally, Et Lt| Et B+ t = Et a At t w πt(a) (µ t µa) Et B+ t πt(a) µmax (At) µmax(At 1 t w) Et B+ t Note that conditional on Et, B t and Wt 1, we have P a At 1 t w πt(a) n 2 and At µmax(At 1 t w) 1 wk. Therefore, wk = n 2wk . A.2. Proof of Proposition 3.2 Consider the event Gt = µmax(At t w) 1 n 1/2 where t w. Lemma A.3 (µmax is Close to 1). For k > n, we have P[Gt] 1 Proof. Since |At t w| = wk and the reward rate of each arm is drawn i.i.d. uniformly, we have P[Gt] = (1 δ)kw = (1 δ) 1 δ kwδ e kwδ. Since k > n, we have kδ > n 1 2 n 1 2 1, so P[Gt] e kwδ e w 1 2, i.e. P[Gt 1 An arm is said to be unexplored at time t if it has never been selected by the policy before. Our proof considers the number of unexplored arms selected in each round, as formalized below. Definition A.4 (Unexplored Arms). We say an arm a is unexplored at round t if it has never been selected by (the start of) round t; formally, this means Pt 1 τ=1 πτ(a) = 0. For each round t, define Mt as the subset of unexplored arms selected in this round; formally, Mt = {a At t w : πt(a) > 0 and Pt 1 s=1 πs(a) = 0}. Moreover, as for any t, t , we write From a myopic perspective, selecting too many unexplored arms in a round leads to high loss since, by Assumption 2.1, each unexplored arm is suboptimal by Ω(1) on average. We formalize this in the following lemma. Lemma A.5 (Cost of Selecting Many unexplored Arms). For any t 1, consider the event Vt = n |Mt| n expected loss in round t conditional on Vt satisfies E [Lt|Vt] n 24w. Short-lived High-volume Bandits Proof. Write µ t = µmax(At t w) and Et[ ] = E[ |Gt] for any t. We start with lower bounding the conditional loss. Observe that Et [Lt|Vt] = Et a A πt(a) (µ t µa) Vt a A Et h πt(a) 1(a Mt) (µ t µa) Vt i a A Et[1(a Mt) 1 n 1/2 µa | Vt], (5) where the last inequality follows by definition of Et and that a Mt implies πt(a) 1. To further simplify, note that the reward rate of any unexplored arm a is an independent draw and is hence independent with the set Mt of unexplored arms selected in this round, so a A Et [1(a Mt)| Vt] Et h 1 n 1/2 µa| Vt i . (6) Note that Et[µa|Vt] = E[µa] = 1 2, we have Et 1 n 1/2 µa| Vt = 1 n 1/2 1 3 for any n 1. Therefore, a A 1(a Mt) Vτ 3 = Et h |Mt| Vτ i 1 By Lemma A.5, we have P[Gt] 1 2, and thus E[Lt|Vt] Et[Lt|Vt] P[Gt] 1 2 n 12w = n 24w. The above says that selecting too many unexplored arms leads to high loss. On the flip side, selecting too few unexplored arms also leads to high loss. To show this, consider the cold-start event where no δ-good arms that π ever selected is still available at the start of round t. Definition A.6 (Cold-start Event). For each t 1, the cold-start event is defined as Bt = µmax(St t w) 1 n 1/2 . As the name suggests, when this event occurs, the policy has little information about the available arms, and hence it behaves as if the time horizon has restarted. This leads to high loss. In fact, a policy has to identify a δ-good arm to have low loss from time t w to t + w. This forces the policy to explore Ω(1/δ) = Ω(n1/2) unexplored arms, which leads to an Ω(n1/2) loss. This result is implied by the lower bound for infinite armed bandits (Wang et al., 2008). Lemma A.7 (Cold-start Incurs High Loss). For any t, we have E h Pt+w τ=t Lτ Bt i n To apply the above, we next characterize when Bt would occur. Consider the number Nt of new arms selected in [t w, t]. If ENt 1/δ, then on average the policy suffers Ω(1/δ) = Ω( n) loss. If ENt < 1/δ, then with Ω(1) probability, none of those Nt arms are δ-good, which leads to the cold-start event Bt. We formalize this idea below. Lemma A.8 (Under-Exploration Leads to Cold-start Event). Suppose E h Pt s=t w Ls i n 96w2 , then P[Bt] 1 Proof. Consider the event Es = n µmax(Ss) 1 1 n o that the for any s = 1, 2 . . . , then our goal reduces to showing that P[Es] 1 2w for every s = t w, . . . , t. In fact, if none of the events Et w, . . . , Et occurs, i.e. no unexplored arm selected in the past w rounds is n 1/2-good, then Bt occurs. It then follows from the union bound that Short-lived High-volume Bandits and the proof will be complete. To show P[Es] 1 2w for s {t w, ..., t}, observe that for any fixed s, P[Es] = P[Es|Vs] P[Vs] + P[Es|Vs] P[Vs] P[Vs] + P[Es|Vs], (7) where we recall that Vt = n |Mt| n 4w o . we will bound each of the two terms in (7) by 1 4w separately. To see P[Vs] 1 4w, note that n 96w2 t =t w E[Lt ] E[Ls] E[Ls|Vs] P[Vs]. Note that by Lemma A.5, we have E[Ls|Vs] n 24w, so P[Vs] 1 4w. To bound the second term in (7), note that by definition of Es, for any C, δ > 0 we have Es Ss C i (1 δ)C (1 δ) 1 δ δC e δC 1 δC, where the last inequality follows since e x 1 x for any x R. In particular, for C = n 4w and δ = 1 n, we have Es Vs i = P P h Es Vs i = 1 P h We now combine the above to establish the n lower bound. Proof of Proposition 3.2. Fix any round t. Decompose Lt+w t w into the loss before and after round t as τ=t w ELτ = τ=t w ELτ + If the first term, i.e., the loss before time t, is greater than n 96w2 , then the claim holds trivially. Otherwise, by Lemma A.8, the cold-start event Bt occurs w.p. P[Bt] 1 2. Thus, the loss after time t can be bounded using Lemma A.7 as 2 = n 12w > n 96w2 . A.3. Proof of Theorem 3.1 Now we are ready to show the Ω(1/k) lower bound. By (2) and Lemma A.2, we have Et(Lt) Et[Lt|Et B t ] Pt(Et B t ) + Et(Lt|Et B+ t ) Pt(Et B+ t ) 2k Pt(Et B t ) + n 2wk Pt(Et B+ t ). (8) Note that the events Et and B+ t (Et and B t resp.) are independent conditional on Gt 1, so (8) n 2wk 1 2 Pt(Et) + 1 w Pt(Et) n 2w2k . Short-lived High-volume Bandits E[Lt] E[Lt 1(Gt 1)] = E[Lt|Gt 1] P[Gt 1] n 2w2k 1 8 = n 16w2k . Summing over t and taking the limit, we conclude that Lossn(π) = lim T 1 n T t=1 E[Lt] 1 16w2k . B. Proof of Proposition 4.8: Regret-to-Loss Conversion As we recall, given a policy π = (πt) for SLHV bandits, the loss in round t is Lt = 1 a At t w πt(a) (µ t µa) i . We first decompose Lt into the following internal and external loss. Definition B.1 (External and Internal Loss). Let A be a semi-adaptive algorithm for the BB problem with adaptivity level ℓand π = (πt) be the induced policy for the SLHVB problem. For any round t, integer j w, let t,j = µmax(At t w) µmax(At). Define the external and internal loss as Lext t = E max 1 j w t,j and Lint t = E n (µmax(At j) µ(a)) Here, the term t,j is external in the sense that it does not depend on the policy, but only on the randomness in the RIM. On the other hand, the term Lint t is internal it measures the reward gap between the selected arms and the best arms among the respective age groups. It is straightforward to show the following. Lemma B.2 (Loss Decomposition). In any round t, the loss satisfies Lt Lint t + Lext t . Proof. By the definition of internal and external loss, we have a At t w πt(a) (µ t µa) n µmax(At t w) µa (since πt(a) = 0 if a At t w\At t ℓ) n ( t,j + (µmax(At j) µa)) (by definition of i,j) E max 1 j w t,j n (µmax(At j) µa) We next bound the external and internal losses separately. We start by showing the external regret is O(1/k). Recall from the definition of the RIM that µa D and from Assumption 2.1 that the density of D is bounded by C2, C1 > 0 from above and below. Proposition B.3 (External Loss). For any round t, the external loss can be bounded as Lext t 3ρ log n C1k for any sufficiently large n.10 10We say that a property Pn (e.g., an inequality) holds for any sufficiently large n if there exists a constant n0 > 0 such that Pn holds whenever n n0. Short-lived High-volume Bandits Proof. Consider any i [w] and Yi := µmax(At i). For any ε < 1 and arm a A, by the RIM and Assumption 2.1, we have P[µa > 1 ε] C1ε, or equivalently, P[µa 1 ε] 1 C1ε. In particular, for ε = 2ρ log n C1k , we have P Yi < 1 2ρ log n 1 C1 2ρ log n k = 1 2ρ log n k 2ρ log n 2ρ log n e 2ρ log n = n 2ρ. Thus, the event B := S i [w] n Yi < 1 2ρ log n C1k o has probability P [B] P i [w] n 2ρ wn 2ρ. It follows that E min i [w] Yi = E min i [w] Yi B P B 1 2ρ log n 1 wn 2ρ 1 3ρ log n where the last inequality follows since wn 2ρ 2ρ log n C1k for any sufficiently large n. Therefore, E maxi [w] t,i 1 E mini [w] Yi 3ρ log n Now we are ready to prove the regret-to-loss conversion formula. Proof of Proposition 4.8. Recall that k is the size of the resampled subset of each set At of arriving arms. For any fixed T w, by re-arranging the terms in (9), we obtain t=1 Lint t = n (µmax (At j) µa) n µmax (At j) µmax A t j + µmax A t j µa µmax A t j µmax (At j) + 1 a At j πt(a) µmax A t j µa where the inequality follows since for any t, the total number of arms the policy selects satisfies Pℓ j=0 P a At j πt(a) = n. Note that E max0 j ℓ µmax A t j µmax (At j) 1 j=0 πt+j(a) (µmax(At) µa) k + (T 2ℓ) R(n, k ) + 2ℓ, where the 2ℓterm in the first inequality is because we are summing from ℓto T ℓ. It follows that Lossn(π) lim T 1 T Lint t + Lext t k + 3 log k C1ρk + lim T (T 2ℓ) R(n, k ) + 2ℓ k + 3 log k C1ρk + R(n, k ). where the last identity follows since R(n, k ) does not depend on T, and ℓ= O(1) as T . C. Proof of Theorem 4.6: Average Regret of BSE In this section we prove Theorem 4.10, which says that the average regret is O((k/n)ℓ/(ℓ+2)) for the BSE algorithm with the revised geometric grid, specified in Definition 4.4. We illustrate the key ideas by considering adaptivity levels ℓ= 1 and 2 as warm-up first in Section C.2 and C.2 respectively, and then in Section C.4 we present the proof for general ℓ 1. Short-lived High-volume Bandits C.1. Preliminaries We introduce some tools for our analysis. For any arm a, consider i.i.d. Bernoulli rewards (Za i,j)i [w],j [n] with mean µa. We first state a standard concentration bound for independent random variables, see e.g., (Vershynin, 2018). Lemma C.1 (Concentration Bounds). Let Z1, ..., Zm be independent random variables supported on [0, 1], and Z = 1 m Pm i=1 Zi, then for any δ > 0, it holds that P(| Z E( Z)| > δ) exp δ2m (Hoeffding s inequality), P | Z E( Z)| > δ E( Z) exp (Chernoff s inequality). Consider the event that the empirical mean rewards of all available arms available at time t satisfy Hoeffding s inequality. Definition C.2 (Clean Event). For any integers i, m and arm a, define j=1 Za i,j mµa p 2m (ρ + 3) log n We define the clean event as Ct = T Ci,m a where the intersection is over m [n], i [w] and a At t w. We use Lemma C.1 to show that Ct occurs with high probability in each round t. Lemma C.3 (Clean Event is Likely). For any time t, the clean event satisfies P Proof. Fix an arbitrary i [w] and arm a At t w. Then for any integer m, by Hoeffding s inequality, Ci,m a i = P j=1 Za i,j mµa > p 2m(ρ + 3) log n 2m 2m(ρ + 3) log n = n 3. By the union bound over all a [k], i [w] and m [n], we have m [n],i [w], a At t w m [n],i [w], a At t w m [n],i [w], a At t w j=1 Za i,j mµa > p 2m(ρ + 3) log n nwk n ρ 3 = w n 2 n 1. We will also repeatedly apply the following simple fact. Lemma C.4 (3 -Inequality). Let > 0 and X = {xj}j [k], X = {x j}j [k] be two sets of real numbers such that |xj x j| for all j [k]. Suppose for some i we have x i max X . Then, xi max X 3 . Proof. Since |xj x j| for all j [k], we have | max X max X | . Hence, x i max X max X 2 . Therefore, xi x i max X 2 = max X 3 . Short-lived High-volume Bandits C.2. Warm-up: ℓ= 1 In this subsection we show that the average regret is O (k/n)1/3 for the BSE algorithm with grid ε 0,1. We first prove that w.h.p. all surviving arms are close to optimal. Recall from Algorithm 1 that S1 denotes the surviving arms at the end of the first phase. Lemma C.5 (Surviving Arms Have High Rewards). If the clean event C occurs, then µmax(A) µmin(S1) δ1 where δ1 = 3 ε0n k 1/2 log1/2 n. Proof. By Hoeffding s inequality (see Lemma C.1), the deviation of the empirical mean reward of each arm satisfies |ˆµa µa| 1 3δ1 for all arms a A. Consider ˆa = arg maxa ˆµa and a = arg maxa µa. By definition of S1 (see Algorithm 1), an arm a survives only if |ˆµa ˆµˆa| 1 3δ1. Thus, by the 3 -inequality (Lemma C.4) we have µa µmax(A) δ1. Combining the above with the regret decomposition (Lemma B.2), we obtain the following. Proposition C.6 (Average Regret of BSE 1). Suppose ρ 1. Then the average regret of BSE 1, the BSE algorithm with exploration intensities ε 0,1, satisfies Regavg n (BSE 1) 5 k 1/3 log1/3 n. Proof. To suppress notations, write ε0 := ε 0;1. By Lemma C.5, we have µmax(A) µmin(S1) δ1, and thus Regavg n (BSE 1) ε0 + (1 ε0) E [µmax(A) µmin(S1)] ε0 + δ1 P [C] + P ε0 + δ1 + n 1, where the last inequality follows from Lemma C.3. Expanding δ1 using Lemma C.5 and recalling that ε 0,1 = k n 1/3 log1/3 n, we have Regavg n (BSE 1) 4 k 3 log 1 3 n + n 1 5 k 3 log 1 3 n, where the last inequality follows since 1 C.3. Bounding the Number of Survivors: Analysis for ℓ= 2 Recall that for adaptivity level ℓ= 2, we first select a batch of arms and compute a subset S1 of surviving arms whose confidence intervals are not dominated by any other arm. Then we select another batch of arms, and compute a further subset S2 S1 in a similar manner. Finally, in the exploitation phase, we choose an arbitrary arm from S2. The key step is bounding the number of survivors after the first phase - if we can upper-bound S1 s cardinality (w.h.p.), then we can lower-bound the number of times each arm in S1 is selected in phase 2, leading to a guarantee on the width of the confidence interval. To this goal, consider the following uniform event that µa s are distributed approximately uniformly. Definition C.7 (Uniform Event). Consider any constant δ (0, 1), and the number Nδ of arms in A that lie in a δneighborhood of the optimal arm, formally, Nδ = {a A : µa µmax(A) δ} . We define the uniform event as We show that the uniform event is likely to occur when k is large. This is a direct consequence of the RIM and Assumption 2.1, and its proof is similar to that of Proposition B.3. Lemma C.8 (Uniform Event is Likely). Suppose δk 8 C1 log n. Then for any δ (0, 1) and round t, the uniform event satisfies P Short-lived High-volume Bandits Proof. Index the arms from 1 to k, and denote by f the density of D, the distribution from which the reward rates µi s are drawn. We first upper bound P[ G | µmax = 1 γ] for any fixed γ (0, 1). We subsequently fix an arbitrary i [k]. Denote by Zi = 1[µi µmax δ] and consider the largest reward rate µmax i in [k]\{i}, i.e., µmax i = max {µ1, . . . , µi 1, µi+1, . . . , µk} . Observe that P [ Zi = 1 | µmax = 1 γ ] = P µi [1 γ δ, 1 γ] | µmax i δ = Z 1 γ 1 γ δ f(z) dz, where the last identity follows since µi s are independent, in particular, µmax i and µi are independent. Recall by Assumption 2.1 that the density satisfies C1 f(z) C2, so C1δ P [ Zi = 1 | µmax = 1 γ ] C2δ. Note that conditional on the event {µmax = 1 γ}, the random variables Zi are still i.i.d., so by Chernoff s inequality (see Lemma C.1) we obtain P h G µmax = 1 γ i P i [k] Zi > 3 2 C2δk µmax = 1 γ i [k] Zi < 1 2 C1δk µmax = 1 γ 8 8 C1ρ = 2n 1. P[ G] = Eγ P[ G | µmax = 1 γ] 2n 1. Assuming that Uδ1 and C both occur, we have |S1| δ1k. Thus, in the second phase, each surviving arm is selected ε1n times, and hence the confidence intervals have widths ε1n δ1k 1/2 . We next make these ideas precise. Definition C.9 (Widths of Confidence Intervals). For any ε0, ε1 (0, 1), define δ1 = δ1(ε0) = 8 2 log n and δ2 = δ2(ε0, ε1) = 6 3 2 4 log 5 4 n. For any adaptivity level ℓ, denote by δ j,ℓ= δj,ℓ(ε 0, ε 1) for j = 1, . . . , ℓ. Under the above notations, we can bound the suboptimality of S2, the surviving arms after the second phase as follows. Lemma C.10 (Suboptimality of S2). Consider the algorithm BSE2(ε0, ε1) such that δ1 = δ1(ε0, ε1) satisfies δ1k > 8 C1 log n. If the clean event C and the uniform event Uδ1 both occur, then µmax(A) µmin(S2) δ2. Proof. We first upper bound the cardinality of S1. By Lemma C.5, since C occurs, we have µmax (A) µmin (S1) δ1. In other words, for an arm a A to survive the first phase, its mean reward needs to be δ1-close to µmax(A). Since the uniform event Uδ1 occurs, by Lemma C.8 the number of such arms can be bounded as 2C2δ1k. (11) Short-lived High-volume Bandits Note that in phase 2, each arm in S1 is selected m1 := ε1n |S1| times. By definition of the clean event C, the empirical mean reward of every arm deviates from the mean by at most 2(ρ + 3) m 1/2 1 log1/2 n. Thus by the 3 -inequality (Lemma C.4), we can bound the suboptimality of arms in S2 as µmax (A) µmin (S2) 3 p 2(ρ + 3) m 1/2 1 log1/2 n. To further bound the above, we use (11) to lower bound m1 = ε1n |S1|. Specifically, µmax (A) µmin (S2) 2(ρ + 3) ε1n 3 2C2δ1k (ρ + 3)C2 ε 1 2 log 1 4 k log 1 2 n. (12) Recall from Definition C.9 that 2 log n and δ2 = δ2(ε0, ε1) = 6 3 2 4 log 5 4 n, we can simplify the above inequality as 4 log 1 4 k log n = δ2. We can now bound the average regret of the BSE algorithm for any grid as follows. Proposition C.11 (Average Regret of BSE with Arbitrary Grid, ℓ= 2). Suppose 0 < ε0 < ε1 < 1 and δ1k > 8 log n C1 . Then, the average regret of the BSE algorithm with grid size ε0, ε1 satisfies Regavg n (BSEℓ=2 (ε0, ε1)) ε0 + ε1δ1 + δ2 + O n 1 Proof. By definition of regret, we have Regavg n (BSEℓ(ε0, ε1)) = ε0 + ε1 E [µmax(A) µmin(S1)] + (1 ε0 ε1) E [µmax (A) µmin (S2)] . We bound each term separatly. By Lemma C.3 and Lemma C.5, the second term can be bounded as E [µmax(A) µmin(S1)] = E [µmax(A) µmin(S1) | C] P[C] + E µmax(A) µmin(S1) | C P[C] δ1 1 + 2n 1. By Lemma C.10, if the events Uδ1 and C both occur, then µmax(A) µmin(S2) δ2. Further, by Lemma C.8, we have P [Uδ1] 2n 1 whenever δ1k 8 log n C1 . Combining the above facts, we obtain Regavg n (BSEℓ(ε0, ε1)) ε0 + ε1 δ1 + 2n 1 + δ2 + n 1, and the proof is complete. The revised geometric grids in Definition 4.4 are motivated by minimizing the above bound. Since we are focusing on ℓ= 2, we will suppress ℓand write ε i = ε i,ℓfor i ℓ= 2. Choose ε0 so that ε0 = ε1 δ1(ε0) = δ2(ε0, ε1), then we have as specified in Definition 4.4. We obtain the following guarantee for the revised geometric grid by choosing εi = ε i in Proposition C.11. The proof is straightforward we only need to verify that δ 1k 8 log n Short-lived High-volume Bandits Corollary C.12 (Average Regret of BSE 2). If ρ θ2, then Regn (BSE ℓ=2) O k Proof. Recall from Definition 4.4 that ε 0 = k 2 , and from Definition C.9 that for any ε0 we defined δ1 = δ1(ε0) = 8 Expanding the expressions for ε 0 and δ 1 = δ1(ε 0), we have 2 log n k = 8 C1 k 5 4 n 1 where the last inequality follows since ρ θ2 = 1 5. By Proposition C.11, we conclude that Regavg n (BSE ℓ=2) ε 0 + ε 1δ 1 + δ 2 + O n 1 2 log 5 4 n C.4. Proof of Theorem 4.6 We now extend the analysis to the general ℓcase. For each available arm, an ℓ-Layer Sieve policy will recursively explore its reward rate and derive a series of confidence intervals of the following widths. Definition C.13 (Width of Confidence Interval). Given ε0, . . . , εℓ 1, for each i = 1, 2, . . . , ℓ, we define δi = δi(ε0, . . . , εi 1) = 3 15i 1 C 2 2 ε 2 i 0 ε 2 (i 1) 1 ... ε 1 It is straightforward to verify that the above δi s satisfy the following recursion. As in the analysis for ℓ= 2 in Section C.2, we will show that the loss on the j-th layer is bounded as follows. Lemma C.14 (Regret on the j-th Layer). Fix integers t, j and suppose the events Gδ1 t 1, . . . , Gδj t j and Ct occur. Then, µmax(At j) µmin(Sj t j) δj and |Sj t j| 6C2δjk log1/2 k. Proof. Proof. We show this inductively on j. To show the base case j = 1, note that by Lemma C.5, we have µmax(At 1) µmin(S1 t 1) δ1. Moreover, we showed in the first paragraph in the proof of Lemma ?? that |S1 t 1| 6C2δ1k log1/2 k, and thus the claim holds for j = 1. Now consider j 2. As the induction hypothesis, assume the claim holds for 1, . . . , j 1. Then, |Sj 1 t (j 1)| 6C2δj 1k log1/2 k. Consequently, each arm in Sj 1 t (j 1) is selected |Sj 1 t (j 1)| εjn 6C2δj 1k log1/2 k (13) times. Since the clean event Ct occurs, the empirical mean of each arm from Sj 1 t (j 1) deviates from the mean by 3m1/2 j 1 log1/2 n. Thus, if an arm a At j survives, we have µmax At (j 1) µmin Sj 1 t (j 1) 9m 1 2 j 1 log 1 2 mj 1. (14) Since the good event Gδj t j occurs, we have |Sj t j| 6C2 3m 1 2 j 1 log 1 2 mj 1 k log1/2 k. Short-lived High-volume Bandits To simplify, note that by (13), 2 j 1 log 1 2 mj 1 3 6C2δj 1k log1/2 k 6C2δj 1k log1/2 k and hence |Sj t j| 6C2δjk log1/2 k. To simplify (14), we expand mj 1 again and obtain µmax At (j 1) µmin Sj 1 t (j 1) 2 j 1 log 1 2 mj 1 6C2δj 1k log1/2 k 2 log 1 2 n 2 log 1 4 k log 1 2 n, (15) where the last inequality relies on δj 1k > 1. By Definition C.13, we have δj 1 = 3 5j 2 ε 2 (j 1) 0 ε 2 (j 2) 1 ε 2 1 j 1 k Substituting into (15), we conclude that µmax At (j 1) µmin Sj 1 t (j 1) 3 5j 1C 2 2 ε 2 j 0 ε 2 (j 1) 1 ε 2 1 j We now have the following regret bound for BSE algorithm with adaptivity ℓ. Proposition C.15 (Average Regret of BSE, Arbitrary Grid). Consider arbitrary adaptive level ℓand grid sizes 0 < ε0 < < εℓ 1 < 1 and δik > 1 for any i ℓ 1. Then for any round t, Regavg n (BSEℓ(ε0, , εℓ 1)) = O ε0 + ε1δ1 + ε2δ2 + + εℓ 1δℓ 1 + δℓ+ n ρ . Proof. Proof. By definition of internal regret, Regavg n (BSEℓ(ε0, , εℓ 1)) j=0 εj E h µmax(At j) µmin(Sj t j) i + E max j [w] µmax (At j) µmin S2 t j . Note that if the events Ct and Gδj t j occur, j = 0, . . . , ℓ 1, then by Lemma C.14, we have µmax(At j) µmin(Sj t j) δj for j = 0, . . . , ℓ 1, max ℓ j w µmax(At j) µmin(S2 t j) δ2. Further, by Lemma C.8 that for each j we have P h Gδj t j i k 2 provided δik 1 for all j = 0, . . . , ℓ 1, so E h µmax(At j) µmin(Sj t j) i ℓ j w Gδj t j Gδj t j i + P δj 1 + n 2+ρ w. Short-lived High-volume Bandits E max ℓ j w µmax(At j) µmin(S2 t j) ℓ j w G3δ1 t j Gδ1 t j i + P δℓ+ k 2w + n 2+ρw. When ρ 1, the above is bounded by δ2 + n 2ρw + o 1 n . Combining, we conclude that Regavg n (BSEℓ(ε0, , εℓ 1)) ε0 + j=1 εj δj + n 3 + n 2ρw + o 1 j=1 εjδj + n ρ + o 1 D. Proof of Proposition 4.9 To derive the upper bound for SLHVB, we need to compare the asymptotic orders of the two terms in (1). Consider the following three cases for ρ. Case 1. Suppose ρ 1 5. Consider ℓ= 1. By Proposition 4.9, we have Lossn (BSE 1, k) = O n ρ + n(ρ 1) 1 5, it holds that n ρ n(ρ 1) 1 3 , so the above becomes O (n ρ) . Case 2. Suppose 1 5 < ρ w 2w+2. The claimed ℓin Step 2 exists11 due to the following elementary fact: for any integer w 2, 1 5, w 2w + 2 2ℓ+ 1, ℓ 2ℓ+ 2 It then follows by Proposition 4.9 that Lossn (BSE (ℓ, k)) = O n ρ + n(ρ 1) ℓ ℓ+2 . Further, note that when ρ < ℓ 2ℓ+2, we have n ρ > n(ρ 1) ℓ ℓ+2 , so the above becomes O (n ρ). Case 3. Suppose ρ w 2w+2 = θw. Note that the threshold exponent θℓincreases in ℓ, in particular, for any ℓ w, we have θℓ θw ρ. It then follows from Proposition 4.9 that Lossn (BSE (w, k)) = O n ρ + n(ρ 1) w w+2 . (18) Although this is already sublinear in n, we observe that the two terms in (18) are, in general, not equal, which suggests a potential improvement. Consider the resampling size k = nρ that renders those two terms equal, that is, ρ = w 2w+2 ρ. Then, due to Proposition 4.8 we obtain Lossn (BSE (w, k )) = O n ρ + ρ 1n ρ + n(ρ 1) w w+2 O n w 2w+2 + ρ 1 n w 2w+2 + n w 2w+2 = O ρ 1 n w 2w+2 . where the inequality follows since ρ ρ. We summarize the above discussion in the following theorem. 11Alternatively, one can show this constructively by showing that ℓ = ℓ (ρ) = j ρ+1 1 2ρ k satisfies θℓ ρ ℓ 2ℓ +2. However, the proof - which is essentially arithmetic manipulation - is slightly tedious, so we choose not to specify this explicit form. Short-lived High-volume Bandits E. Field Experiment: Implementation Details In this section we provide more details about the implementation of the field experiment. E.1. Randomized BSE: A Thompson Sampling Variant We implement a variant of semi-adaptive policy for SLHVB induced by the BSE policy (Algorithm 1) with ℓ= 1. We modify the policy due to the following practical concerns. The first issue is the lack of knowledge of n. In our implementation, each round is set to be an hour, so n corresponds to the number of impressions per hour. But in practice, n is unknown. This can be easily fixed via randomization: for each card slot, we assign a newly-arriving card (for exploration) with probability ε0 and an old card (for exploitation) otherwise. Algorithm 4 Set Prior(g). 1: Input: a card g and m users 2: Output: ˆα, ˆβ 3: Randomly sample m users u1, . . . , um 4: for i = 1, . . . , m do 5: µ(ui, g) DNN-predicted reward on (ui, g) 6: end for 7: Compute the sample mean µ and variance v: i=1 µ(ui, g), v 1 m 1 i=1 (µ(ui, g) µ)2 ˆα µ µ(1 µ) v 1 and ˆβ = 1 µ The second issue is the prior information for the rewards rates of the newly arriving cards. Although the DNN predictions are sometimes inaccurate, they do provide useful information. To utilize such information, we fit a Beta prior distribution for each card s reward rate using the method of moments (see e.g., (Wasserman, 2006)) based on the DNN predictions (see Algorithm 4). Specifically, recall that the DNN returns a predicted conversion rate for each pair of user and card. For a fixed card, denote by µ, v the mean and variance of the predicted conversion rates on m = 500 randomly sampled users. The fitted Beta prior B(ˆα, ˆβ) is then given by ˆα = µ µ(1 µ) v 1 and ˆβ = 1 µ The final issue is the recommended contents diversity. Recall that the basic version of the BSE policy selects the empirically best arm to exploit . However, this is not reasonable in a practical setting where such an extreme promotion of a single card is undesirable. We thus consider the Thompson Sampling version of the Sieve policy under a Beta-Bernoulli reward model; see Algorithm 5. Specifically, for each slot we draw a score for each card from its posterior. Then we assign to this slot the card with the highest score. Note that the posterior can be efficiently updated using the Bayesian rule, since the Beta distribution is a conjugate prior for the Bernoulli distribution. E.2. Implementation Details The firm has maintained a partition of the users into several hundreds of buckets for various online experiments. This partition is randomly re-generated every six months. We chose three buckets as the treatment group, involving over 600, 000 users and accounting for around 1% of the total traffic. We implemented the randomized BSE policy with ℓ= 1 on their real system in the first 14 days of July 2021. For comparison, we also analyzed the interaction data from the first 14 days of May in the same year. Using an offline semi-synthetic simulation, we determined the empirically optimal parameter to be around ε0 = 0.2, which we used in the field experiment. This choice is also consistent with our theoretical analysis. In fact, as we recall from Short-lived High-volume Bandits Algorithm 5 Randomized One-Layer Sieve Policy 1: Input: ε [0, 1], θ 0 2: for each hour t = 1, 2, ... do 3: Receive a set Anew of new cards 4: Update the set A of available cards 5: for each card g A do 6: if g Anew then 7: (αg, βg) Set Prior(g). 8: else 9: ng number of interactions of g in the last hour 10: hg number of conversions of g in the last hour 11: αg αg + hg, βg βg + ng hg 12: end if 13: end for 14: Awell {g : αg + βg > θ} 15: for each user u do 16: Receive the number ru of cards requested by u 17: Au cards that have never been assigned to u 18: for each card g Au do 19: Draw Xu,g Beta(αg, βg) 20: end for 21: Sort Au T Awell by Xu,g in non-increasing order as g1, g2, ... 22: Sort Au\Awell by Xu,g in non-increasing order as g 1, g 2, ... 23: i, i 1 24: for j = 1, ...ru do 25: Zj Ber(ε) 26: if Zj = 1 then 27: Sj gi 28: i i + 1 29: else 30: Sj g i 31: i i + 1 32: end if 33: Send to u the cards {Sj : j = 1, ..., ru} 34: end for 35: end for 36: end for Proposition C.6, the optimal parameter is ε 0;1 (k/n)1/3. In our scenario, we observed from past data that there were on average around 11 million impressions per 14 days, so the number n of impressions per hour is 3.2 104. Moreover, there are on average k = 150 cards released per hour, and thus ε 0,1 0.14. F. Field Experiment: Analysis We now present a detailed statistical analysis of the field experiment, including a bootstrapping hypothesis testing and a Difference-in-Differences (DID) analysis. Our analysis shows that our policy outperforms the DNN-based recommender by about 4% in total duration and nearly 7% in the total number of click-throughs per-user-per-day. We first explain how to eliminate outliers. An outlier is typically introduced in two ways. In practice, users may accidentally swipe through two cards in a row, without even looking at the first one. We thus remove any impression with duration less than 0.2 seconds. On the other hand, users may sometimes leave their devices unattended for minutes, generating an abnormally high duration. Since most cards content can be fully consumed within 300 seconds, we remove any impression with duration over u = 300 seconds. Short-lived High-volume Bandits Table 1. Types of Data In the Analysis CT Duration Per User-Day integral numeric Per Impression binary numeric Table 2. Overall Statistics May July NN MAB NN MAB Per-User-Per-Day Duration Mean 175.910 175.548 137.059 142.618 SE Mean 0.699 0.659 0.6081 0.597 Median 44.250 44.279 32.973 34.430 #CT Mean 1.275 1.273 0.941 1.010 SE Mean 9.251e-03 8.814e-03 7.276e-03 7.549e-03 Per Impression Duration Mean 3.9697 4.0195 4.1183 4.2391 SE Mean 4.529e-03 4.402e-03 5.738e-03 5.599e-03 Median 0.693 0.697 0.702 0.703 CTR Mean 2.887e-02 2.915e-02 2.827e-02 3.001e-02 SE Mean 4.698e-05 4.568e-05 5.804e-05 5.671e-05 F.1. Analysis of 22 = 4 Metrics The firm is interested in the analysis at the per-user-per-day and per-impression levels, and two measures for user engagement: duration and click-throughs. So altogether we have four metrics in total, as shown in Table 1. In the per-impression analysis, we treat each impression as an independent sample. For example, the row for per-impression duration in Table 2 is the ratio between the total duration and total number of impressions. However, the firm s objective is the total user engagement rather than the per-impression engagement, which motivated our analysis on the per-user-per-day level. At first sight, it seems reasonable to consider the total engagement of a fixed user over all 14 days during the experiment. However, this metric is flawed since the frequency at which the users enter the app is affected by many external factors, such as holidays and weekends, which introduces extra noise. We will thus only consider the days when a user entered the app (i.e., has at least one impression). Formally, for each user u and day d where this user has at least one impression, we define a tuple (u, d, Dud) where Dud is the total duration. Thus, the number of tuples associated with each user is between 1 and 14. Under this definition of total engagement, we summarize12 the experimental results in Table 2 and visualize the per-userper-day user engagement in Figure 4 and Figure 5. We observe that in May the user engagement of the two groups are approximately identical, but in July the MAB group has a significantly higher mean user engagement. Moreover, such improvement also appeared in median duration, indicating that this improvement is unlikely to be caused by a heavier tail in the distribution. It is also worth noting that the user engagement per-user-per-day decreased from May to July. This is because May 2021 was when the Covid-19 pandemic reached its peak in the country where most users were located. During the lockdown, the users may have had more time to spend on the app, resulting in a higher total engagement. In Section F.2, we perform a DID analysis which incorporates the underlying change of environment across time. Finally we emphasize that in our implementation, our policy is not personalized. Despite this disadvantage, our Sieve policy still outperforms their personalized DNN recommender in all metrics for user-engagement, both at the per-impression and per-user levels; see Table 2. F.2. Significance Tests From Figure 4 and Figure 5 we observe that our policy outperforms the control policy. We now test whether this improvement is statistically significant. 12The unit of duration in all tables is second Short-lived High-volume Bandits Figure 4. Number of click-throughs per-user-per-day Figure 5. Duration per-user-per-day Table 3. Significance Testing Basic Bootstrap Z-score p-value Z-score p-value Per-User-Per-Day Duration 4.610 2.018e-06 4.6197 1.921e-06 CT 4.259 1.027e-05 4.2556 1.042e-05 Per Impression Duration 6.963 1.665e-12 6.972 1.556e-12 CT 12.999 6.127e-39 12.933 1.469e-38 For each month m {May, July}, denote by Xm, Y m the user engagement (either duration or number of click-throughs) in the NN and MAB group respectively. Similarly, denote by X m, Y m be the sample means of Xm, Y m in month m. We are interested in the difference-in-differences in user engagement before and after our policy was deployed, i.e., = (Y July XJuly) (Y May XMay). Consider the hypotheses H0 : E[ ] 0 vs. H1 : E[ ] > 0. First consider the basic Z-score given by Y July X July Y May X May Y July X July Y May X May = Y July X July Y May X May is the estimated standard deviation. For Z {XMay, XJuly, Y May, Y July} we denote by NZ the number of i.i.d. samples of Z, and let S2 Z be the sample variance. Assuming the samples are independent, we may approximate the above as ˆS r 1 NXMay S2 XMay + 1 NXJuly S2 XJuly + 1 NY May S2 Y May + 1 NY July S2 Y July. As shown in the Basic column of Table 3, the p-values for each of the four metrics are very low. We therefore reject the null hypothesis H0 and conclude that the treatment effect is statistically significant. However, in reality the samples are not independent, since (1) each user may appear in both months, (2) each user has multiple data points in a month, and (3) the same set of cards are shown to both the treatment and control group. We remove the dependence by bootstrapping as follows. From each of these four pools of data points, we randomly draw 106 samples with replacement and redefine each Z as the bootstrap sample mean where Z = XMay, XJuly, Y May, Y July; see the Bootstrap column in Table 2. We still observe very low p-values, which further validates our conclusion. Short-lived High-volume Bandits Figure 6. Illustration of DID regression for duration per impression F.3. DID Regression Now we consider a DID regression analysis. We first illustrate the idea of DID regression in Figure 6 using per-impression duration as an example. We first vectorize each tuple (u, d, Yud) into a four-dimensional vector (tud, iud, tud iud, Yud) where tud = 1[day d is in July] and iud = 1[user u is in MAB group] are the time and intervention indicators, and Yud {Cud, Dud} is the metric under consideration (i.e., click-throughs or duration of user u on day d). The basic assumption in a DID analysis is that the outcome follows the linear model Yud = β0 + β1tud + β2iud + β3tud iud + εud (20) where εud N(0, σ2) with unknown variance σ2. Intuitively, β1 measures the effect of being assigned to the treatment group and β2 captures the overall trend over time. Thus, if there is no treatment effect, the differences between the two groups should remain unchanged across May and July, and therefore the means of the samples from the four pools (shown as the red dots in Figure 6) will form a perfect parallelogram. Now suppose there is indeed a positive treatment effect. Then, the top-right corner of this quadrilateral will be raised; see the green dot in Figure 6. This lift is measured by the variable β3. To see this, by setting tud = iud = 0, we observe that β0 is the mean engagement of control group users in May. Further, note that the top-right corner of the parallelogram is β1 + β2 + β0. In contrast, for day d in July and user u in MAB group, if iud = tud = 1, then the mean outcome satisfies E[Yud] = β0 + β1 + β2 + β3, which is higher than the top-right red dot by β3. Assuming the Gaussian noise, we are able to compute confidence intervals and p-values for the coefficients βi s; see Tables 4. For both duration and CT, the coefficients β3 are positive and have very low p-values. Therefore, it is is indeed significant whether a user is assigned to the MAB group. Moreover, note that β2 has high p-values, so we conclude that the partition of users is sufficiently random, at least on the per-user-per-day level. As shown in the second half of Table 4, we also consider per-impression user engagement. Similar to the above analysis, we convert each impression j into a three-dimensional binary vector (tj, ij, Yj) where Yj is either the duration or click-through indicator for impression j. Note that the duration per impression is numerical so we can analyze it using linear regression Short-lived High-volume Bandits Table 4. Difference-In-Differences Regression Coef. Std. Dev. t p-value 0.025Q 0.975Q Per User-Day β0 175.9103 0.640 274.941 0.000 174.656 177.164 β1 -38.8514 0.942 -41.263 0.000 -40.697 -37.006 β2 -0.3622 0.887 -0.409 0.683 -2.100 1.375 β3 5.9208 1.303 4.544 2.759e-06 3.367 8.475 β0 1.2750 0.008 153.851 0.000 1.259 1.291 β1 -0.3341 0.012 -27.394 1.616e-165 -0.358 -0.310 β2 -0.0016 0.011 -0.141 0.888 -0.024 0.021 β3 0.0704 0.017 4.171 1.516e-05 0.037 0.103 Per Impression β0 3.9697 0.005 863.796 0.000 3.961 3.979 β1 0.1486 0.007 20.234 2.753e-89 0.134 0.163 β2 0.0497 0.006 7.781 3.597e-15 0.037 0.062 β3 0.0711 0.010 6.998 1.298e-12 0.051 0.091 β0 -3.5198 0.002 -2092.794 0.000 -3.523 -3.517 β1 -0.0161 0.003 -5.947 1.365e-09 -0.021 -0.011 β2 0.0133 0.002 5.712 5.582e-09 0.009 0.018 β3 0.0474 0.004 12.819 6.417e-38 0.040 0.055 Note: All regression are linear regression except for per impression CT, where we applied logistic regression due to binary labels. (20). In contrast, the per impression click-throughs (Yj) are binary, so we instead apply logistic regression: we assume Yj Ber ez 1+ez where z = β0 + β1tj + β2ij + β3tjij. As opposed to the per-user-per-day regression, in this case all coefficients have tiny p-values for both CT and duration. In particular, the coefficient β2 for intervention has low p-value, indicating that the initial user partition may not be truly random in terms of per impression engagement. Nonetheless, this difference is interpretable: our experiment was performed on random user-groups that Glance has been using for months prior to our field test, on which some previous experiments have been performed, causing this discrepancy in user behavior. We next quantify the improvement. For the per-user-day conversion, we observe that the total duration and the number of click-throughs improved by β3/(β0 + β1) 4.319% and 7.482% respectively. For the per impression conversion, the duration improved by 1.726%. Finally, note that the β s for CTR are based on logistic regression. The improvement in the odds ratio is eβ3 1 4.854%.