# minimax_regret_for_cascading_bandits__c60ba6d7.pdf Minimax Regret for Cascading Bandits Daniel Vial UT Austin & UIUC dvial@utexas.edu Sujay Sanghavi UT Austin & Amazon sanghavi@mail.utexas.edu Sanjay Shakkottai UT Austin sanjay.shakkottai@utexas.edu UIUC rsrikant@illinois.edu Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art. 1 Introduction The cascading click model describes users interacting with ranked lists, such as search results or online advertisements [Craswell et al., 2008]. In this model, there is a set of items [L] = {1, . . . , L}. The user is given a list of K items A = (a1, . . . , a K), sequentially examines the list, and clicks on the first attractive item (if any). If a click occurs, the user leaves without examining the subsequent items. It is assumed that the e-th item has an attraction probability w(e) (which we also call the mean reward), and that the random (Bernoulli) clicks are conditionally independent given A. Cascading bandits, introduced concurrently by Kveton et al. [2015a] and Combes et al. [2015], are a sequential learning version of the model where the mean rewards { w(e)}L e=1 are initially unknown. At each round t 2 [n], the learner chooses an action, which is a list of items At = (at 1, . . . , at K). As in the basic click model, the user scans the list and clicks on the first attractive item (if any). Thus, if the Ct-th item is clicked, the learner knows that the user was not attracted to {at k=1 but was attracted to at Ct. However, the learner receives no feedback on the items {at k=Ct+1 that the user did not examine before leaving. The objective for the learner is to choose the sequence {At}n t=1 to maximize the expected number of clicks, or equivalently, minimize the regret defined in (1). This work provides problem-independent (i.e., gap-free) regret bounds for cascading bandits that strictly improve the state-of-the-art. In the case of unstructured rewards, our results provide the first minimax-optimal regret bounds (up to log terms). Our key insight is that, compared to the standard bandit problem, the reward variance plays an outsized role in the gap-free analysis. In particular, we show that for cascading bandits, the worst-case problem instances are those with low mean rewards 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Table 1: Problem-independent upper bounds and minimax lower bounds for cascading bandits with L total items, K recommended items, horizon n, and feature dimension d. The tabular columns assume unstructured mean reward. The linear column assumes reward is linear in unit-norm features and does not suppress any L dependencies in the O( ) notation. See Appendix A for other related work. Tabular case Linear case Paper Upper bound Lower bound Upper bound Zong et al. [2016] none none O( nd2K2) Wang and Chen [2017] O( n LK) none none Lattimore et al. [2018] O( n LK3) none none Li and Zhang [2018] none none O( Zhong et al. [2021] O( nd2K3 min{d, log L}) Kveton et al. [2022] O( n LK) none none Ours O( n L) (Thm 2) ( n L) (Thm 1) O( nd(d + K)) (Thm 4) These papers contain minimax lower bounds but for click models that are distinct from cascading bandits. These bounds assume L is large compared to K and n is large compared to L and K. This bound holds for a Bayesian notion of regret and includes a dependence on the prior not shown above. These bounds assume n is large compared to d and K to suppress some additive o(pn) terms. namely, w(e) 1/K, where the variance w(e)(1 w(e)) is also small. We emphasize that these worst-case instances are not pathological; rather, they model the low click-through rates that prevail in practice [Richardson et al., 2007]. Further, we argue that adapting to low variance is crucial to cope with the worst-case instances. Put differently, algorithms should be variance-aware, i.e., more exploitative when the variance is small. We provide the intuition behind this key insight in Section 1.2 and show how to formalize it with a proof sketch in Section 5. More specifically, our first goal is to establish upper bounds on the problem-independent regret (i.e., the maximum over w) for cascading bandit algorithms, as well as minimax lower bounds (i.e., the infimum over algorithms of their problem-independent regret). As shown in Table 1, the tightest existing such bounds are O( n LK) and ( n L/K), respectively. What is surprising is that, not only is there a gap between the bounds, but they increase and decrease in K, respectively. In other words, the following fundamental question is unresolved: as K grows, does the problem become harder (as suggested by the upper bound) or easier (as suggested by the lower bound)? As discussed by Zhong et al. [2021], this question lacks an obvious answer. On the one hand, larger K means that the learner needs to identify more good items, which hints at a harder problem. On the other hand, the learner receives more feedback as K grows, which intuitively makes the problem easier. As we show later, variance-aware algorithms are the key to resolving this tradeoff. In addition to this basic version of the model hereafter, the tabular case, where no structure is assumed for the mean rewards we are also interested in the linear case, where w(e) = hφ(e), i for some known feature map φ : [L] ! Rd and unknown parameter vector 2 Rd. A second goal of this work is to apply our insights from the tabular case to the linear one, in hopes of improving the best known upper bound O( nd2K) (see Table 1). 1.1 Main contributions Tabular case. First, we show no algorithm can achieve o( n L) regret uniformly across w, assuming L is large compared to K and n is large compared to L (see Theorem 1). Next, we consider three algorithms: Cascade KL-UCB, Cascade UCB-V, and Cascade UCB1 (the first and third are due to Kveton et al. [2015a]; Cascade UCB-V is new). All three rank the L items using upper confidence bounds (UCBs) and choose At as the K highest ranked items. They differ in the choice of UCB. As the names suggest, Cascade KL-UCB and Cascade UCB-V use KL-UCB [Garivier and Cappé, 2011, Maillard et al., 2011, Cappé et al., 2013] and UCB-V [Audibert et al., 2009], respectively. Both are variance-aware, in the sense that their respective UCBs are derived from the Chernoff and Bernstein inequalities. We show that both algorithms have near-optimal regret O( n L) for any w (see Theorem 2). In contrast, Cascade UCB1 relies on the Hoeffing-style UCB1 [Auer et al., 2002], and we show that it suffers suboptimal regret ( n LK) on some w (see Theorem 3). In summary, we prove (i) the minimax regret is ( n L) for cascading bandits, (ii) the variance-aware algorithms Cascade KL-UCB and Cascade UCB-V are minimax-optimal up to log terms, and (iii) the variance-unaware algorithm Cascade UCB1 is decidely suboptimal. Moreover, note from Table 1 that we strictly improve both the upper and lower bounds for this problem. Discussion. There are two surprising aspects to these results. First, the minimax bound ( n L) shows that (in a worst-case sense) the number of recommended items K plays no role.1 In other words, the aforementioned tradeoff (identifying more good items but receiving more feedback) is perfectly balanced when the correct algorithm (e.g., Cascade KL-UCB) is employed. The second (and arguably more surprising) aspect is that Cacade KL-UCB and Cascade UCB-V are optimal but Cascade UCB1 is not. This stands in contrast to the standard L-armed bandit problem, where the analogous algorithms KL-UCB, UCB-V, and UCB1 all achieve the minimax regret ( n L), and the main advantage of the former two is only to improve the constants in the gap-dependent bounds. We discuss the intuition behind this contrast in Section 1.2. Linear case. Motivated by these findings, we also propose a variance-aware algorithm for the linear case called Cascade WOFUL.2 Cascade WOFUL proceeds in two steps. First, we use the Hoeffdingstyle UCBs of Abbasi-Yadkori et al. [2011] to upper bound the mean rewards w(e), and thus the Bernoulli variances w(e)(1 w(e)), with high probability. Second, we use these Hoeffding UCBs as proxies for the true variances in the WOFUL algorithm of Zhou et al. [2021], which computes a variance-weighted estimate of that enjoys Bernstein-style concentration. In Theorem 4, we show the regret of Cascade WOFUL is O( nd(d + K)) for large n, which improves existing bounds by a factor of at least min{d, K} (see Table 1). 1.2 Why do variance-aware algorithms succeed but variance-unaware algorithms fail? To answer this question, we focus on the tabular case and contrast the standard L-armed bandit setting with the cascading one. We first recall how the O( n L) bound for L-armed bandits is derived. For simplicity, we restrict to instances with w(1) = p and w(2) = = w(L) = p for some p 2 (0, 1) and 2 (0, p). In this case, UCB1 plays each of the L 1 suboptimal items 1/ 2 times, up to constants and log factors [Auer et al., 2002]. Each such play costs regret , for a total regret that scales as L/ . Alternatively, regret can simply be bounded by n, since the n plays incur at most regret each. Combining the bounds gives min{L/ , n}. The worst case occurs when the two bounds are equal, i.e., when = L/n, which implies n L regret. In contrast, UCB-V plays each suboptimal item σ2/ 2 times (in an order sense), where σ2 p is the variance of the Bernoulli(p ) reward [Audibert et al., 2009]. Therefore, the argument of the previous paragraph shows that regret grows as min{Lp/ , n}. Here the worst case occurs when p is non-vanishing and = L/n, which gives the same regret scaling as UCB1. A similar argument holds for KL-UCB, because the number of plays grows as 1/d(p , p) [Cappé et al., 2013] and one can show d(p , p) = ( 2/p) (see Claim 1 in Appendix G). The analysis is more complicated for cascading bandits, because regret is nonlinear in the mean rewards and the amount of feedback is random. To oversimplify things, we draw an analogy with the above and assume w(1) = = w(K) = p and w(K + 1) = = w(L) = p . In this case, Cascade UCB1 similarly plays the L K suboptimal items 1/ 2 times each, which costs L/ regret when L K. However, we can no longer bound regret by n, because the total number of plays depends on the random number of items the user examines at each round, which is roughly min{1/p, K} (the mean of Geometric(p) random variable, truncated to the maximum K). Thus, the n bound inflates to n min{1/p, K}, which gives min{L/ , n min{1/p, K}} regret. For p 1/K and = L/(n K), this yields the best known bound n LK. We emphasize that, unlike the previous paragraph, the worst case here occurs when p (i.e., the click-through rate) is small. On the other hand, the analogous bound for Cascade UCB-V and Cascade KL-UCB scales as min{Lp/ , n min{1/p, K}}. Crucially, the factor of p in the first term which arises due to the variance-aware nature of the algorithms offsets the factor 1/p in the second term. Thus, in the hard 1The O( ) notation hides log K terms, but they can be bounded by log L while retaining O( n L) regret. 2WOFUL stands for weighted optimism in the face of uncertainty for linear bandits. case p 1/K, the bound becomes min{L/( K), n K}. Here the worst case is = L/(n K2), which yields the minimax regret n L that we establish in Theorems 1 and 2. 2 Preliminaries In this section, we precisely formulate our problem. A cascading bandit instance is defined by the triple (L, K, w), where L 2 N is the total number of items, K 2 [L] = {1, . . . , L} is the number of items the learner displays to the user at each round, and w 2 [0, 1]L is the vector of attraction probabilities. We define a sequential game as follows. At each round t 2 [n], the learner chooses an action At = (at 1, . . . , at K), where at k 2 [L] and at k0 when k 6= k0 (i.e., At is an ordered list of K distinct items). The user sequentially examines this list, clicks on the first item that attracts them, and stops examining items after clicking. Mathematically, we denote the first attractive item by Ct = inf{k 2 [K] : wt(at k) = 1}, where wt(e) Bernoulli( w(e)) for each e 2 [L]. If no item is clicked, i.e., if wt(at k) = 0 for all k, we set Ct = 1. Note the learner only observes the realizations corresponding to the items that the user examined, i.e., {wt(at k) : k 2 [min{Ct, K}]}. We denote by Ht = [t 1 s=1{As} [ {ws(as k) : k 2 [min{Cs, K}]} the history of actions and observations before time t.3 We let Pt( ) = P( |Ht [ {At}) and Et[ ] = E[ |Ht [ {At}] denote conditional probability and expectation given the history and current action. In the cascading bandit model, it is assumed that the feedback {wt(at k=1 (i.e., the presence or absence of clicks) is conditionally independent given Ht and At. Therefore, the conditional click probability is Pt(Ct < 1) = 1 Pt(Ct = 1) = 1 Pt(\K k) = 0}) = 1 Mappings from Ht to At are called policies. Let be the set of all policies. Given an instance w 2 [0, 1]L, a policy 2 , and a horizon n 2 N, the expected number of clicks is is the indicator function. Let A = (a 1, . . . , a K) be any action A = (a1, . . . , a K) that maximizes the click probability 1 QK k=1(1 w(ak)). Our goal is to minimize regret, which is the expected difference in the number of clicks between and the policy that always plays A , i.e., R , w(n) = E , w 3 Results for the tabular case We can now state our tabular results (the proofs are discussed in Section 5). First, we have a minimax lower bound showing no algorithm can achieve o( n L) uniformly over the mean rewards w. Theorem 1. Suppose N , L/K 2 {4, 5, . . .} and n L. Then for any policy 2 , there exists a mean reward vector w 2 [0, 1]L such that R , w(n) = ( Remark 1. The proof of Theorem 1 is essentially a reduction to Lattimore et al. [2018] s lower bound for the so-called document-based click model. Their proof and ours both use the assumption L/K 2 N to simplify the analysis, which involves partitioning the L items into K subsets of size L/K each. When L/K /2 N, one of the subsets will have fewer items, which makes the analysis more cumbersome; however, this does not fundamentally alter either result. Remark 2. The theorem also requires L 4K, which is not very restrictive since L K in typical applications. However, this assumption does eliminate an interesting analytical regime, namely, when K ! L. We conjecture the minimax lower bound is ( n(L K)) in this case, since any algorithm obtains zero regret when K = L and there are no suboptimal items. 3Our notation mostly follows Kveton et al. [2015a], but we clarify that their definition of Ht includes At. We next consider the algorithms Cascade KL-UCB and Cascade UCB1 from Kveton et al. [2015a], along with a new one called Cascade UCB-V. All follow a similar template, which is given in Algorithm 1. This is a natural generalization of upper confidence bound (UCB) algorithms from the standard L-armed bandit setting. At each round t 2 [n], it computes UCBs Ut(e) in a manner to be specified shortly, then chooses At as the K items with the highest UCBs (in order of UCB). After observing the click feedback Ct, the algorithm increments the number of observations Tt(e) and updates the empirical mean ˆw Tt(e)(e) for each item e that the user examined. Algorithm 1: General UCB algorithm for tabular cascading bandits Initialize number of observations Tt(e) = 0 and empirical mean ˆw0(e) = 0 for each e 2 [L] for t = 1, . . . , n do Compute Ut(e) for each e 2 [L] (by (2), (3), and (4) for Cascade KL-UCB, Cascade UCB-V, and Cascade UCB1, respectively) for k = 1, . . . , K do k = arg maxe2[L]\{at i=1 Ut(e) be the item with the k-th highest UCB Play At = (at 1, . . . , at K) and observe Ct = inf{k 2 [K] : wt(at k) = 1} (where inf ; = 1) Let Tt(e) = Tt 1(e) for each e 2 [L] for k = 1, . . . , min{Ct, K} do k, Tt(e) Tt(e) + 1, ˆw Tt(e)(e) (Tt 1(e) ˆw Tt 1(e)(e) + (Ct = k))/Tt(e) For Cascade KL-UCB, the UCBs Ut(e) are computed as follows: Ut(e) = max{u 2 [0, 1] : d( ˆw Tt 1(e)(e), u) log(f(t))/Tt 1(e)}, (2) where f(t) = t(log t)3 and d(p, q) = p log(p/q)+(1 p) log((1 p)/(1 q)) is the relative entropy between Bernoullis with means p, q 2 [0, 1]. The set in (2) is a confidence interval for w(e), which is derived from the Chernoff bound. For Cascade UCB-V, the UCBs are instead given by Ut(e) = ˆw Tt 1(e)(e) + 4ˆv Tt 1(e)(e) log(t)/Tt 1(e) + 6 log(t)/Tt 1(e), (3) where ˆvs(e) = ˆws(e)(1 ˆws(e)) is the empirical variance from s observations of item e. This UCB is derived from the coarser but crucially, still variance-aware Bernstein inequality. Finally, for Cascade UCB1, the UCBs are derived from the Hoeffing bound and computed as follows: Ut(e) = ˆw Tt 1(e)(e) + ct,Tt 1(e), where ct,s = 1.5 log(t)/s. (4) We can now show the variance-aware UCBs are nearly optimal, while Cascade UCB1 is suboptimal. Theorem 2. Suppose is Cascade KL-UCB or Cascade UCB-V, i.e., the policy from Algorithm 1 with the UCBs given by (2) or (3). Then R , w(n) = O( n L) for any w 2 [0, 1]L. Remark 3. The reader may wonder why we proposed Cascade UCB-V, since the Cascade KL-UCB bound is enough to establish the minimax regret. The main reason is to demonstrate that varianceawareness alone (no additional information encoded by KL-UCB) is enough to achieve the optimal regret, which helps motivate our linear algorithm. Furthermore, we show empirically in Section 6 that, while Cascade UCB-V is inferior to Cascade KL-UCB in terms of regret, its closed form nature leads to quicker computation, while still improving the regret of Cascade UCB1. Theorem 3. Suppose n max{LK, 49K4}, L 800K, and is Cascade UCB1, i.e., the policy from Algorithm 1 with the UCBs given by (4). Then R , w(n) = ( n LK) for some w 2 [0, 1]L. Remark 4. The assumed bounds on n and L simplify the calculations and can be improved (see Remark 9 in Appendix J). The main point of Theorem 3 is to show that, unlike the variance-aware algorithms, Cascade UCB1 cannot satisfy the conclusion of Theorem 2 for all choices of n and L. 4 Results for the linear case In light of the previous section, we seek a variance-aware algorithm for the linear case. Our method is based on the WOFUL algorithm of Zhou et al. [2021], which was designed for the standard linear bandit setting (the case K = 1). In this section, we review WOFUL, discuss how to overcome its limitations in the cascading setting, explain our Cascade WOFUL algorithm, and bound its regret. Existing algorithm. In Section 4.3 of their work, Zhou et al. [2021] consider the following problem. As above, there is a set of items [L], a known feature map φ : [L] ! Rd, and an unknown parameter vector 2 Rd. Successive plays of e 2 [L] give i.i.d. rewards with mean w(e) = hφ(e), i and variance upper bounded by σ2 e. Thus, at each round t 2 [n], the learner chooses at 2 [L] and receives a random reward rt = hφ(at), i + t, where E[ t|at] = 0 and E[ 2 at. For this setting, the authors proposed the WOFUL algorithm, which is based on the (unweighted) OFUL algorithm [Abbasi-Yadkori et al., 2011]. At each round t 2 [n], WOFUL chooses the item at = arg max hφ(e), ˆ ti + kφ(e)k 1 where > 0 is an exploration parameter, kxk B = x TBx is the norm induced by a positive definite matrix B, and ˆ t is the regularized and variance-weighted least-squares estimate given by as, where t = I + φ(as)φ(as)T/σ2 To gain some intuition, we assume momentarily that d = L and φ(e) is the e-th standard basis vector, i.e., the vector with 1 in the e-th coordinate and 0 elsewhere. In this case, one can easily calculate hφ(e), ˆ ti = s2[t 1]:as=e rs/σ2 e 1 + Tt 1(e)/σ2e 1 1 + Tt 1(e)/σ2e Therefore, for large Tt 1(e) (large enough that 1 + Tt 1(e)/σ2 e Tt 1(e)/σ2 e), we have hφ(e), ˆ ti + kφ(e)k 1 s2[t 1]:as=e rs σ2e Tt 1(e). Similar to UCB-V (3), the right side of this equation is the empirical mean plus an exploration bonus that grows with the variance upper bound σ2 e and decays in Tt 1(e). Hence, the term inside the arg max in (5) can be interpreted as a Bernstein-style UCB. The analysis of this UCB relies on a novel concentration inequality for vector-valued martingales [Zhou et al., 2021, Theorem 2], which is a Bernstein analogue of the Hoeffding-style bound due to Abbasi-Yadkori et al. [2011]. Limitations. The fact that WOFUL (roughly) generalizes UCB-V is promising. However, computing (6) requires knowledge of the variance upper bounds σ2 e, and nontrivial bounds on the variance are rarely available in practice. We sidestep this issue with three simple observations: (i) cascading bandits only involve Bernoulli rewards, (ii) for Bernoulli rewards, variances are upper bounded by means, and (iii) these means can be learned efficiently since they are linearly-parameterized. This suggests the following algorithm: first, compute Hoeffding-style UCBs; second, treat these UCBs as upper bounds for the true means, and thus the true variances, in WOFUL. Proposed algorithm. Algorithm 2 formalizes this approach. It contains three steps. First, step 1 defines Hoeffding-style UCBs Ut,H as in Abbasi-Yadkori et al. [2011]. Next, step 2 uses Ut,H as an upper bound for the variance and computes the Bernstein-style UCBs Ut,B as in WOFUL. Finally, step 3 chooses At as the K items with the highest Ut,B, analogous to Algorithm 1. Two technical clarifications are in order. First, observe that in step 1, we clip the variance bound Ut,H below by 1/K. We do so to ensure that t,B (which inverts Ut,H) remains bounded. Additionally, we note the choice 1/K is precisely motivated by Section 1.2, which shows this is a critical threshold for the small click-through rate. Second, t,B uses the regularizer KI. This is to ensure that the regularizer is large enough compared to the summands φ(as k)T/Us,H(as k), which scale as K in the worst case where the variance upper bound Us,H(as k) is clipped to 1/K. Remark 5. Zong et al. [2016] proposed a WOFUL-style cascading algorithm called Cascade Lin UCB (see Appendix C), but they set the variance upper bound to fixed σ2 > 0. In fact, they remark ideally, σ2 should be the variance of the observation noises, which Cascade WOFUL essentially learns. Remark 6. Appendix B contains an improved version of Cascade WOFUL. It is more efficient (for example, the inverses are iteratively updated via Sherman-Morrison), satisfies the same theoretical guarantee as Algorithm 2, and includes some tweaks that improve performance in practice. The tradeoff is that Algorithm 2 is simpler to explain, which is why we prefer it for the main text. Algorithm 2: Cascade WOFUL for linear cascading bandits Input: exploration parameters { t,H, t,B}n t=1 for t = 1, . . . , n do Step 1: Define the (clipped) Hoeffding-style UCBs Ut,H( ) = max hφ( ), ˆ t,Hi + t,Hkφ( )k 1 min{Cs,K} X k), t,H = I + min{Cs,K} X Step 2: Define the Bernstein-style UCBs Ut,B( ) = hφ( ), ˆ t,Bi + t,Bkφ( )k 1 min{Cs,K} X k) , t,B = KI + min{Cs,K} X Step 3: Let at k = arg maxe2[L]\{at i=1 Ut,B(e) be the item with the k-th highest UCB, play At = (at 1, . . . , at K), observe Ct = inf{k 2 [K] : wt(at k) = 1} (as in Algorithm 1) Regret bound. We can now state our main result for the linear case. Here Bd(M) = {x 2 Rd : kxk2 M} denotes the Euclidean ball of radius M > 0 in Rd.4 Theorem 4. Suppose w 2 [0, 1]L, φ : [L] ! Bd(1), and 2 Bd(1) satisfy w(e) = hφ(e), i for all e 2 [L]. Let be the policy of Algorithm 2 with inputs t,H = d log(1 + t K/d) + 2 log(n) + 1 and t,B = 8 d log(1 + t K/d) log(n3K) + 4 K log(n3K) + R , w(n) = O nd(d + K) + n 7 6 (d + K) 1 3 , pn max{d, K} min{d, K}1/3o Note this bound is completely independent of the number of items L. For large n (typically the case of interest), it becomes O( nd(d + K)), which improves the best known bound of O( nd2K) [Li and Zhang, 2018]. The theorem also establishes O(pn max{d, K} min{d, K}1/3) regret uniformly in n (i.e., without additive o(pn) terms which may dominate for small n), which improves the best known uniform-n bound of O(pnd K) [Zong et al., 2016]. See Table 1 for more details. 5 Overview of the analysis We next discuss the key ideas behind our proofs. The details are deferred to Appendices E-J. Theorem 2 proof. For simplicity, we assume the arms are ordered by their means, i.e., w(1) w(L). Under this assumption, the optimal action is [K]. We call those items optimal and [L] \ [K] suboptimal. We let Et be the bad event that the empirical and true means differ substantially at time t, Et its complement, Ge,e ,t the event that suboptimal e > K was chosen in favor of optimal e K and subsequently examined by the user, and e,e = w(e ) w(e) the reward gap. Then as in Appendix A.1 of Kveton et al. [2015a], we linearize regret as follows: ( Et, Ge,e ,t) The second term is small due to concentration. For the first term, define = w(K) L/n and assume n L (so w(K)).5 For each e > K, let K (e) = {e K : e,e } be the 4Our results hold with minor modification if φ : [L] ! Bd(M1) and 2 Bd(M2) for general M1, M2 > 0. However, as in prior work, the modified algorithm needs to know an upper bound on M2. 5The full proof addresses the cases n 6 L and w(K) = 0 (here we implicitly assume the latter to invert ). optimal items with small gap relative to e and K (e) = [K] \ K (e) the other optimal items. Then ( Et, Ge,e ,t) (Ge,e ,t) + ( Et, Ge,e ,t). Therefore, we can upper bound the first term in (7) by ( Et, Ge,e ,t) For the first term in (8), the inner double summation is the number of e > K that were chosen in favor of e 2 K (e), and subsequently examined by the user, at round t. Denote this number by Ot. Recall w(e ) w(K) (by the assumed ordering) and w(K), so for e 2 K (e), we have w(e) = w(e ) e,e w(e ) w(K) ' w(K). Hence, Ot is bounded by the number of items e with w(e) w(K) that the user examined. Since the user stops examining at the first attractive item, this number is dominated by a Geometric( w(K)) random variable. Therefore, the first term in (8) is O( n/ w(K)) = O( The second term in (8) accounts for choosing e instead of e when e,e and the empirical means are concentrated. Building upon the intuition of Section 1.2, we exploit the variance-awareness of KL-UCB and UCB-V (and use w(e) w(K) by the assumed ordering) to bound this term by O(L w(e)(1 w(e))/ ) O(L w(K)/ ). Thus, by choice of , this term is O( n L) as well. Theorem 4 proof. We decompose regret similar to (7) and (8), though with different choices of Et and .6 To bound P(Et), we use the aforementioned result of Abbasi-Yadkori et al. [2011] to show that the Hoeffding UCBs Ut,H upper bound the variances with high probability, then prove a guarantee for the least-squares estimate ˆ t,B using the Bernstein-style bound from Zhou et al. [2021]. We bound the first term in (8) using the exact same logic as the tabular case. The second term has a more complicated analysis, but (as in the tabular case) it amounts to bounding the number of times that items -far from optimal are chosen when ˆ t,B is well-concentrated. For this, we adapt techniques from the standard linear bandit setting (the case K = 1) to general K 2 [L]. Theorem 1 proof. We let H0 s=1{As, ws(as 1), . . . , ws(as K)} be the entire history before time t, which includes unobserved rewards ws(as k), k > Cs. We also let 0 be the policies that map H0 t to At. Note Ht H0 t, so 0. For any 2 0 and w 2 [0, 1]L, we define , w(n) = E , w k = arg maxe2[L]\{a i=1 w(e). Then a lower bound linearization analogous to (7) shows that for any p 2 [0, 1] and w 2 [0, p]L, R , w(n) (1 p)K 1R0 , w(n), which implies inf 2 sup w2[0,1]L R , w(n) inf 2 0 sup w2[0,p]L R , w(n) (1 p)K 1 inf 2 0 sup w2[0,p]L R0 , w(n). (10) If we choose p = 1, the inf sup at right is the minimax regret for the document-based model that was analyzed by Lattimore et al. [2018] (see Remark 1), but this makes (10) vacuous. On the other hand, by choosing p = O(1/K) (again, the small click-through rate of Section 1.2), so that the term (1 p)K 1 in (10) is (1), we can modify their analysis to prove Theorem 1. Theorem 3 proof. We linearize the regret similar to the proof of Theorem 1, then define a problem instance reminiscent of Section 1.2 and (roughly) follow the intuition for UCB1 therein. 6 Experiments Before closing, we conduct experiments on both synthetic and real data. Some details regarding experimental setup are deferred to Appendix C. Code is available in the supplementary material. 6In fact, the proofs of Theorems 2 and 4 both rely on a more general gap-free regret decomposition for cascading bandits (Lemma 1 in Appendix F), which to our knowledge is novel and may be of independent interest. Figure 1: Results for synthetic data (tabular on top, linear on bottom, L = 100 in both) Figure 2: Results for Movie Lens data [Harper and Konstan, 2015] Synthetic data. We let L = 100 and K 2 {2i}8 i=2. For each K, we sample w(e) uniformly in [ 2 K ] for e K and in [0, 1 3K ] for e > K. Note this yields a positive gap with the small click-through rate of Section 1.2. The top left plot in Figure 1 shows the regret at n = 105 for the tabular albums of Section 3 (the shaded regions are the standard deviations across five trials). As predicted by Theorems 2 and 3, the Cascade UCB1 curve grows with K, while the variance-aware curves do not. The top middle plot shows the time spent computing UCBs for the same experiment, which confirms the behavior mentioned in Remark 3. For the linear case, we vary d 2 {2i}8 i=2, generate the same w, then compute unit-norm vectors and φ(e) satisfying w(e) = hφ(e), i (see Appendix C). We compare Cascade WOFUL to Cascade Lin UCB [Zong et al., 2016], which Li and Zhang [2018] showed has the best existing regret guarantee. As suggested by Theorem 4, the left and middle plots on the bottom of Figure 1 show that our algorithm s regret has superior dependence on K and d. The rightmost plots show that regret is sublinear in n for the median values K = d = 10. Real data. We replicate the first experiment from Zong et al. [2016] on the Movie Lens-1M dataset (grouplens.org/datasets/movielens/1m/), which contains user ratings for L 4000 movies. In brief, the setup is as follows. First, we use their default choices d = 20 and K = 4. Next, we divide the ratings into train and test sets based on the user who provided the rating. From the training data and a rank-d SVD approximation, we learn a feature mapping φ from movies to the probability that a uniformly random training user rated the movie more than three stars. Finally, we run the algorithms as above, except at round t 2 [n], we sample a uniformly random user Jt from the test set and define wt(at k) = W(Jt, at k), where W(j, a) = (user j rated movie a more than 3 stars). In other words, instead of the independent Bernoulli clicks of Section 2, we observe the actual feedback of user Jt. We point the reader to Section 4 of Zong et al. [2016] and Appendix C for further details. The left plot of Figure 2 shows that Cascade WOFUL outperforms Cascade Lin UCB across n, eventually incurring less than 66% of the regret. In addition to this setup from Zong et al. [2016], we reran the experiment while restricting the set of items to movies of a particular genre, for each of 18 genres in the dataset. This is intended to model platforms like Netflix that recommend movies in various categories. The right plot shows that Cascade WOFUL is superior for all genres; for some genres (e.g., fantasy) its regret is about half of Cascade Lin UCB s. Moreover, our experiments indicate that Cascade WOFUL improves Cascade Lin UCB more dramatically for genres with smaller click-through rates (see Figure 3 and surrounding discussion in Appendix C), which reinforces a key message of this paper. 7 Conclusion In this work, we proved matching upper and lower bounds for the problem-independent regret of tabular cascading bandits and an upper bound for the linear case, all of which improve the best known. Our results suggest some interesting future directions, such as proving minimax lower bounds for the linear case and revisiting Thompson sampling for cascading bandits [Zhong et al., 2021] in light of our variance-aware insight; see Appendix D for details. Finally, we note the paper is theoretical and has no immediate societal impact. Nevertheless, we urge caution for the negative impacts that could arise in practice. For example, our Movie Lens experiments involved training on a subset of users, which could cause poor recommendations for demographics underrepresented in the training set. Acknowledgments and Disclosure of Funding This work was partially supported by ONR Grant N00014-19-1-2566, NSF TRIPODS Grant 1934932, NSF Grants CCF 22-07547, CCF 19-34986, CNS 21-06801, 2019844, 2112471, 2107037, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program. We thank Advait Parulekar for helpful discussions. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. Machine Learning, 5(1):1 122, 2012. Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Kullback-leibler upper confidence bounds for optimal sequential allocation. The Annals of Statistics, pages 1516 1541, 2013. Wang Chi Cheung, Vincent Tan, and Zixin Zhong. A thompson sampling algorithm for cascading bandits. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 438 447. PMLR, 2019. Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click Models for Web Search. Morgan & Claypool Publishers, 2015. Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. Learning to rank: Regret lower bounds and efficient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 231 244, 2015. Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining, pages 87 94, 2008. Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, pages 2137 2143, 2008. David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118, Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pages 359 376. JMLR Workshop and Conference Proceedings, 2011. F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015. Gaurush Hiranandani, Harvineet Singh, Prakhar Gupta, Iftikhar Ahamath Burhanuddin, Zheng Wen, and Branislav Kveton. Cascading linear submodular bandits: Accounting for position bias and diversity in online learning to rank. In Uncertainty in Artificial Intelligence, pages 722 732. PMLR, 2020. Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, pages 767 776. PMLR, 2015a. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascading bandits. Advances in Neural Information Processing Systems, 28, 2015b. Branislav Kveton, Ofer Meshi, Zhen Qin, and Masrour Zoghi. On the value of prior in online learning to rank. In The 25th International Conference on Artificial Intelligence and Statistics, 2022. Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. Toprank: A practical algorithm for online stochastic ranking. Advances in Neural Information Processing Systems, 31, 2018. Chang Li and Maarten De Rijke. Cascading non-stationary bandits: online learning to rank in the non-stationary cascade model. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 2859 2865, 2019. Shuai Li and Shengyu Zhang. Online clustering of contextual cascading bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. In International conference on machine learning, pages 1245 1253. PMLR, 2016. Shuai Li, Tor Lattimore, and Csaba Szepesvári. Online learning to rank with features. In International Conference on Machine Learning, pages 3856 3865. PMLR, 2019. Xutong Liu, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John Lui, and Wei Chen. Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms. ar Xiv preprint ar Xiv:2208.14837, 2022. Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. A finite-time analysis of multi-armed bandits problems with kullback-leibler divergences. In Proceedings of the 24th annual Conference On Learning Theory, pages 497 514. JMLR Workshop and Conference Proceedings, 2011. Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the click- through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521 530, 2007. Lingda Wang, Huozhi Zhou, Bingcong Li, Lav R Varshney, and Zhizhen Zhao. Near-optimal algorithms for piecewise-stationary cascading bandits. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3365 3369. IEEE, 2021. Qinshi Wang and Wei Chen. Improving regret bounds for combinatorial semi-bandits with probabilis- tically triggered arms and its applications. Advances in Neural Information Processing Systems, 30, 2017. Zixin Zhong, Wang Chi Chueng, and Vincent YF Tan. Thompson sampling algorithms for cascading bandits. Journal of Machine Learning Research, 22:1 66, 2021. Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532 4576. PMLR, 2021. Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and Zheng Wen. Online learning to rank in stochastic click models. In International Conference on Machine Learning, pages 4199 4208. PMLR, 2017. Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pages 835 844, 2016. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] The claims are accurate. Furthermore, we provide references to the formal results alongside each claim stated in the introduction. (b) Did you describe the limitations of your work? [Yes] The main limitation is that our lower bounds (Theorems 1 and 3) require some additional assumptions. These are stated clearly in the theorem statements and discussed in Remarks 1 and 4. (c) Did you discuss any potential negative societal impacts of your work? [Yes] Section 7 addresses societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] As mentioned above, the assumptions are clearly stated in the theorem statements. (b) Did you include complete proofs of all theoretical results? [Yes] Section 1.2 discusses the high-level intuition behind our proofs, Section 5 contains a more detailed proof sketch, and the appendices include complete proofs. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] Complete code to recreate all plots is available in the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] The experimental setup is discussed in detail in Appendix C, and other training details can be found in the code. (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] All plots include error bars. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] The experiments were run over several hours on a laptop, so no significant resources were used. Nevertheless, Appendix C mentions the approximate runtime needed to recreate the figures. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] We only used the Movie Lens dataset, which we cited and also provided a link to. (b) Did you mention the license of the assets? [Yes] The link to the Movie Lens dataset contains a README with license information. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]