# minimax_regret_for_cascading_bandits__c60ba6d7.pdf

Minimax Regret for Cascading Bandits

Daniel Vial UT Austin & UIUC dvial@utexas.edu

Sujay Sanghavi UT Austin & Amazon sanghavi@mail.utexas.edu

Sanjay Shakkottai

UT Austin sanjay.shakkottai@utexas.edu

UIUC rsrikant@illinois.edu

Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware conﬁdence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.

1 Introduction

The cascading click model describes users interacting with ranked lists, such as search results or online advertisements [Craswell et al., 2008]. In this model, there is a set of items [L] = {1, . . . , L}. The user is given a list of K items A = (a1, . . . , a K), sequentially examines the list, and clicks on the ﬁrst attractive item (if any). If a click occurs, the user leaves without examining the subsequent items. It is assumed that the e-th item has an attraction probability w(e) (which we also call the mean reward), and that the random (Bernoulli) clicks are conditionally independent given A.

Cascading bandits, introduced concurrently by Kveton et al. [2015a] and Combes et al. [2015], are a sequential learning version of the model where the mean rewards { w(e)}L

e=1 are initially unknown. At each round t 2 [n], the learner chooses an action, which is a list of items At = (at

1, . . . , at

K). As in the basic click model, the user scans the list and clicks on the ﬁrst attractive item (if any). Thus, if the Ct-th item is clicked, the learner knows that the user was not attracted to {at

k=1 but was attracted to at

Ct. However, the learner receives no feedback on the items {at

k=Ct+1 that the user did not examine before leaving. The objective for the learner is to choose the sequence {At}n

t=1 to maximize the expected number of clicks, or equivalently, minimize the regret deﬁned in (1).

This work provides problem-independent (i.e., gap-free) regret bounds for cascading bandits that strictly improve the state-of-the-art. In the case of unstructured rewards, our results provide the ﬁrst minimax-optimal regret bounds (up to log terms). Our key insight is that, compared to the standard bandit problem, the reward variance plays an outsized role in the gap-free analysis. In particular, we show that for cascading bandits, the worst-case problem instances are those with low mean rewards

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Table 1: Problem-independent upper bounds and minimax lower bounds for cascading bandits with L total items, K recommended items, horizon n, and feature dimension d. The tabular columns assume unstructured mean reward. The linear column assumes reward is linear in unit-norm features and does not suppress any L dependencies in the O( ) notation. See Appendix A for other related work.

Tabular case Linear case Paper Upper bound Lower bound Upper bound Zong et al. [2016] none none O(

nd2K2) Wang and Chen [2017] O(

n LK) none none Lattimore et al. [2018] O(

n LK3) none none Li and Zhang [2018] none none O(

Zhong et al. [2021] O(

nd2K3 min{d, log L}) Kveton et al. [2022] O(

n LK) none none Ours O(

n L) (Thm 2) (

n L) (Thm 1) O(

nd(d + K)) (Thm 4)

These papers contain minimax lower bounds but for click models that are distinct from cascading bandits. These bounds assume L is large compared to K and n is large compared to L and K. This bound holds for a Bayesian notion of regret and includes a dependence on the prior not shown above. These bounds assume n is large compared to d and K to suppress some additive o(pn) terms.

namely, w(e) 1/K, where the variance w(e)(1 w(e)) is also small. We emphasize that these worst-case instances are not pathological; rather, they model the low click-through rates that prevail in practice [Richardson et al., 2007]. Further, we argue that adapting to low variance is crucial to cope with the worst-case instances. Put differently, algorithms should be variance-aware, i.e., more exploitative when the variance is small. We provide the intuition behind this key insight in Section 1.2 and show how to formalize it with a proof sketch in Section 5.

More speciﬁcally, our ﬁrst goal is to establish upper bounds on the problem-independent regret (i.e., the maximum over w) for cascading bandit algorithms, as well as minimax lower bounds (i.e., the inﬁmum over algorithms of their problem-independent regret). As shown in Table 1, the tightest existing such bounds are O(

n LK) and (

n L/K), respectively. What is surprising is that, not only is there a gap between the bounds, but they increase and decrease in K, respectively. In other words, the following fundamental question is unresolved: as K grows, does the problem become harder (as suggested by the upper bound) or easier (as suggested by the lower bound)?

As discussed by Zhong et al. [2021], this question lacks an obvious answer. On the one hand, larger K means that the learner needs to identify more good items, which hints at a harder problem. On the other hand, the learner receives more feedback as K grows, which intuitively makes the problem easier. As we show later, variance-aware algorithms are the key to resolving this tradeoff.

In addition to this basic version of the model hereafter, the tabular case, where no structure is assumed for the mean rewards we are also interested in the linear case, where w(e) = hφ(e), i for some known feature map φ : [L] ! Rd and unknown parameter vector 2 Rd. A second goal of this work is to apply our insights from the tabular case to the linear one, in hopes of improving the best known upper bound O(

nd2K) (see Table 1).

1.1 Main contributions

Tabular case. First, we show no algorithm can achieve o(

n L) regret uniformly across w, assuming L is large compared to K and n is large compared to L (see Theorem 1). Next, we consider three algorithms: Cascade KL-UCB, Cascade UCB-V, and Cascade UCB1 (the ﬁrst and third are due to Kveton et al. [2015a]; Cascade UCB-V is new). All three rank the L items using upper conﬁdence bounds (UCBs) and choose At as the K highest ranked items. They differ in the choice of UCB. As the names suggest, Cascade KL-UCB and Cascade UCB-V use KL-UCB [Garivier and Cappé, 2011, Maillard et al., 2011, Cappé et al., 2013] and UCB-V [Audibert et al., 2009], respectively. Both are variance-aware, in the sense that their respective UCBs are derived from the Chernoff and Bernstein inequalities. We show that both algorithms have near-optimal regret O(

n L) for any w (see Theorem 2). In contrast, Cascade UCB1 relies on the Hoefﬁng-style UCB1 [Auer et al., 2002], and we show that it suffers suboptimal regret (

n LK) on some w (see Theorem 3).

In summary, we prove (i) the minimax regret is (

n L) for cascading bandits, (ii) the variance-aware algorithms Cascade KL-UCB and Cascade UCB-V are minimax-optimal up to log terms, and (iii) the variance-unaware algorithm Cascade UCB1 is decidely suboptimal. Moreover, note from Table 1 that we strictly improve both the upper and lower bounds for this problem.

Discussion. There are two surprising aspects to these results. First, the minimax bound (

n L) shows that (in a worst-case sense) the number of recommended items K plays no role.1 In other words, the aforementioned tradeoff (identifying more good items but receiving more feedback) is perfectly balanced when the correct algorithm (e.g., Cascade KL-UCB) is employed.

The second (and arguably more surprising) aspect is that Cacade KL-UCB and Cascade UCB-V are optimal but Cascade UCB1 is not. This stands in contrast to the standard L-armed bandit problem, where the analogous algorithms KL-UCB, UCB-V, and UCB1 all achieve the minimax regret (

n L), and the main advantage of the former two is only to improve the constants in the gap-dependent bounds. We discuss the intuition behind this contrast in Section 1.2.

Linear case. Motivated by these ﬁndings, we also propose a variance-aware algorithm for the linear case called Cascade WOFUL.2 Cascade WOFUL proceeds in two steps. First, we use the Hoeffdingstyle UCBs of Abbasi-Yadkori et al. [2011] to upper bound the mean rewards w(e), and thus the Bernoulli variances w(e)(1 w(e)), with high probability. Second, we use these Hoeffding UCBs as proxies for the true variances in the WOFUL algorithm of Zhou et al. [2021], which computes a variance-weighted estimate of that enjoys Bernstein-style concentration. In Theorem 4, we show the regret of Cascade WOFUL is O(

nd(d + K)) for large n, which improves existing bounds by a factor of at least

min{d, K} (see Table 1).

1.2 Why do variance-aware algorithms succeed but variance-unaware algorithms fail?

To answer this question, we focus on the tabular case and contrast the standard L-armed bandit setting with the cascading one. We ﬁrst recall how the O(

n L) bound for L-armed bandits is derived. For simplicity, we restrict to instances with w(1) = p and w(2) = = w(L) = p for some p 2 (0, 1) and 2 (0, p). In this case, UCB1 plays each of the L 1 suboptimal items 1/ 2 times, up to constants and log factors [Auer et al., 2002]. Each such play costs regret , for a total regret that scales as L/ . Alternatively, regret can simply be bounded by n, since the n plays incur at most regret each. Combining the bounds gives min{L/ , n}. The worst case occurs when the two bounds are equal, i.e., when =

L/n, which implies

n L regret.

In contrast, UCB-V plays each suboptimal item σ2/ 2 times (in an order sense), where σ2 p is the variance of the Bernoulli(p ) reward [Audibert et al., 2009]. Therefore, the argument of the previous paragraph shows that regret grows as min{Lp/ , n}. Here the worst case occurs when p is non-vanishing and =

L/n, which gives the same regret scaling as UCB1. A similar argument holds for KL-UCB, because the number of plays grows as 1/d(p , p) [Cappé et al., 2013] and one can show d(p , p) = ( 2/p) (see Claim 1 in Appendix G).

The analysis is more complicated for cascading bandits, because regret is nonlinear in the mean rewards and the amount of feedback is random. To oversimplify things, we draw an analogy with the above and assume w(1) = = w(K) = p and w(K + 1) = = w(L) = p . In this case, Cascade UCB1 similarly plays the L K suboptimal items 1/ 2 times each, which costs L/ regret when L K. However, we can no longer bound regret by n, because the total number of plays depends on the random number of items the user examines at each round, which is roughly min{1/p, K} (the mean of Geometric(p) random variable, truncated to the maximum K). Thus, the n bound inﬂates to n min{1/p, K}, which gives min{L/ , n min{1/p, K}} regret. For p 1/K and =

L/(n K), this yields the best known bound

n LK. We emphasize that, unlike the previous paragraph, the worst case here occurs when p (i.e., the click-through rate) is small.

On the other hand, the analogous bound for Cascade UCB-V and Cascade KL-UCB scales as min{Lp/ , n min{1/p, K}}. Crucially, the factor of p in the ﬁrst term which arises due to the variance-aware nature of the algorithms offsets the factor 1/p in the second term. Thus, in the hard

1The O( ) notation hides log K terms, but they can be bounded by log L while retaining O(

n L) regret. 2WOFUL stands for weighted optimism in the face of uncertainty for linear bandits.

case p 1/K, the bound becomes min{L/( K), n K}. Here the worst case is =

L/(n K2), which yields the minimax regret

n L that we establish in Theorems 1 and 2.

2 Preliminaries

In this section, we precisely formulate our problem. A cascading bandit instance is deﬁned by the triple (L, K, w), where L 2 N is the total number of items, K 2 [L] = {1, . . . , L} is the number of items the learner displays to the user at each round, and w 2 [0, 1]L is the vector of attraction probabilities. We deﬁne a sequential game as follows. At each round t 2 [n], the learner chooses an action At = (at

1, . . . , at

K), where at

k 2 [L] and at

k0 when k 6= k0 (i.e., At is an ordered list of K distinct items). The user sequentially examines this list, clicks on the ﬁrst item that attracts them, and stops examining items after clicking. Mathematically, we denote the ﬁrst attractive item by Ct = inf{k 2 [K] : wt(at

k) = 1}, where wt(e) Bernoulli( w(e)) for each e 2 [L]. If no item is clicked, i.e., if wt(at

k) = 0 for all k, we set Ct = 1. Note the learner only observes the realizations corresponding to the items that the user examined, i.e., {wt(at

k) : k 2 [min{Ct, K}]}.

We denote by Ht = [t 1

s=1{As} [ {ws(as

k) : k 2 [min{Cs, K}]} the history of actions and observations before time t.3 We let Pt( ) = P( |Ht [ {At}) and Et[ ] = E[ |Ht [ {At}] denote conditional probability and expectation given the history and current action. In the cascading bandit model, it is assumed that the feedback {wt(at

k=1 (i.e., the presence or absence of clicks) is conditionally independent given Ht and At. Therefore, the conditional click probability is

Pt(Ct < 1) = 1 Pt(Ct = 1) = 1 Pt(\K

k) = 0}) = 1

Mappings from Ht to At are called policies. Let be the set of all policies. Given an instance

w 2 [0, 1]L, a policy 2 , and a horizon n 2 N, the expected number of clicks is

is the indicator function. Let A = (a

1, . . . , a

K) be any action A = (a1, . . . , a K) that maximizes the click probability 1 QK

k=1(1 w(ak)). Our goal is to minimize regret, which is the expected difference in the number of clicks between and the policy that always plays A , i.e.,

R , w(n) = E , w

3 Results for the tabular case

We can now state our tabular results (the proofs are discussed in Section 5). First, we have a minimax lower bound showing no algorithm can achieve o(

n L) uniformly over the mean rewards w.

Theorem 1. Suppose N , L/K 2 {4, 5, . . .} and n L. Then for any policy 2 , there exists a mean reward vector w 2 [0, 1]L such that R , w(n) = (

Remark 1. The proof of Theorem 1 is essentially a reduction to Lattimore et al. [2018] s lower bound for the so-called document-based click model. Their proof and ours both use the assumption L/K 2 N to simplify the analysis, which involves partitioning the L items into K subsets of size L/K each. When L/K /2 N, one of the subsets will have fewer items, which makes the analysis more cumbersome; however, this does not fundamentally alter either result.

Remark 2. The theorem also requires L 4K, which is not very restrictive since L K in typical applications. However, this assumption does eliminate an interesting analytical regime, namely, when K ! L. We conjecture the minimax lower bound is (

n(L K)) in this case, since any algorithm obtains zero regret when K = L and there are no suboptimal items.

3Our notation mostly follows Kveton et al. [2015a], but we clarify that their deﬁnition of Ht includes At.

We next consider the algorithms Cascade KL-UCB and Cascade UCB1 from Kveton et al. [2015a], along with a new one called Cascade UCB-V. All follow a similar template, which is given in Algorithm 1. This is a natural generalization of upper conﬁdence bound (UCB) algorithms from the standard L-armed bandit setting. At each round t 2 [n], it computes UCBs Ut(e) in a manner to be speciﬁed shortly, then chooses At as the K items with the highest UCBs (in order of UCB). After observing the click feedback Ct, the algorithm increments the number of observations Tt(e) and updates the empirical mean ˆw Tt(e)(e) for each item e that the user examined.

Algorithm 1: General UCB algorithm for tabular cascading bandits Initialize number of observations Tt(e) = 0 and empirical mean ˆw0(e) = 0 for each e 2 [L] for t = 1, . . . , n do

Compute Ut(e) for each e 2 [L] (by (2), (3), and (4) for Cascade KL-UCB, Cascade UCB-V,

and Cascade UCB1, respectively) for k = 1, . . . , K do

k = arg maxe2[L]\{at

i=1 Ut(e) be the item with the k-th highest UCB Play At = (at

1, . . . , at

K) and observe Ct = inf{k 2 [K] : wt(at

k) = 1} (where inf ; = 1) Let Tt(e) = Tt 1(e) for each e 2 [L] for k = 1, . . . , min{Ct, K} do

k, Tt(e) Tt(e) + 1, ˆw Tt(e)(e) (Tt 1(e) ˆw Tt 1(e)(e) +

(Ct = k))/Tt(e)

For Cascade KL-UCB, the UCBs Ut(e) are computed as follows:

Ut(e) = max{u 2 [0, 1] : d( ˆw Tt 1(e)(e), u) log(f(t))/Tt 1(e)}, (2)

where f(t) = t(log t)3 and d(p, q) = p log(p/q)+(1 p) log((1 p)/(1 q)) is the relative entropy between Bernoullis with means p, q 2 [0, 1]. The set in (2) is a conﬁdence interval for w(e), which is derived from the Chernoff bound. For Cascade UCB-V, the UCBs are instead given by

Ut(e) = ˆw Tt 1(e)(e) +

4ˆv Tt 1(e)(e) log(t)/Tt 1(e) + 6 log(t)/Tt 1(e), (3)

where ˆvs(e) = ˆws(e)(1 ˆws(e)) is the empirical variance from s observations of item e. This UCB is derived from the coarser but crucially, still variance-aware Bernstein inequality. Finally, for Cascade UCB1, the UCBs are derived from the Hoefﬁng bound and computed as follows:

Ut(e) = ˆw Tt 1(e)(e) + ct,Tt 1(e), where ct,s =

1.5 log(t)/s. (4)

We can now show the variance-aware UCBs are nearly optimal, while Cascade UCB1 is suboptimal.

Theorem 2. Suppose is Cascade KL-UCB or Cascade UCB-V, i.e., the policy from Algorithm 1 with the UCBs given by (2) or (3). Then R , w(n) = O(

n L) for any w 2 [0, 1]L.

Remark 3. The reader may wonder why we proposed Cascade UCB-V, since the Cascade KL-UCB bound is enough to establish the minimax regret. The main reason is to demonstrate that varianceawareness alone (no additional information encoded by KL-UCB) is enough to achieve the optimal regret, which helps motivate our linear algorithm. Furthermore, we show empirically in Section 6 that, while Cascade UCB-V is inferior to Cascade KL-UCB in terms of regret, its closed form nature leads to quicker computation, while still improving the regret of Cascade UCB1.

Theorem 3. Suppose n max{LK, 49K4}, L 800K, and is Cascade UCB1, i.e., the policy from Algorithm 1 with the UCBs given by (4). Then R , w(n) = (

n LK) for some w 2 [0, 1]L.

Remark 4. The assumed bounds on n and L simplify the calculations and can be improved (see Remark 9 in Appendix J). The main point of Theorem 3 is to show that, unlike the variance-aware algorithms, Cascade UCB1 cannot satisfy the conclusion of Theorem 2 for all choices of n and L.

4 Results for the linear case

In light of the previous section, we seek a variance-aware algorithm for the linear case. Our method is based on the WOFUL algorithm of Zhou et al. [2021], which was designed for the standard linear

bandit setting (the case K = 1). In this section, we review WOFUL, discuss how to overcome its limitations in the cascading setting, explain our Cascade WOFUL algorithm, and bound its regret.

Existing algorithm. In Section 4.3 of their work, Zhou et al. [2021] consider the following problem. As above, there is a set of items [L], a known feature map φ : [L] ! Rd, and an unknown parameter vector 2 Rd. Successive plays of e 2 [L] give i.i.d. rewards with mean w(e) = hφ(e), i and variance upper bounded by σ2

e. Thus, at each round t 2 [n], the learner chooses at 2 [L] and receives a random reward rt = hφ(at), i + t, where E[ t|at] = 0 and E[ 2

at. For this setting, the authors proposed the WOFUL algorithm, which is based on the (unweighted) OFUL algorithm [Abbasi-Yadkori et al., 2011]. At each round t 2 [n], WOFUL chooses the item

at = arg max

hφ(e), ˆ ti + kφ(e)k 1

where > 0 is an exploration parameter, kxk B =

x TBx is the norm induced by a positive deﬁnite matrix B, and ˆ t is the regularized and variance-weighted least-squares estimate given by

as, where t = I +

φ(as)φ(as)T/σ2

To gain some intuition, we assume momentarily that d = L and φ(e) is the e-th standard basis vector, i.e., the vector with 1 in the e-th coordinate and 0 elsewhere. In this case, one can easily calculate

hφ(e), ˆ ti =

s2[t 1]:as=e rs/σ2

e 1 + Tt 1(e)/σ2e

1 1 + Tt 1(e)/σ2e

Therefore, for large Tt 1(e) (large enough that 1 + Tt 1(e)/σ2

e Tt 1(e)/σ2

e), we have

hφ(e), ˆ ti + kφ(e)k 1

s2[t 1]:as=e rs

σ2e Tt 1(e).

Similar to UCB-V (3), the right side of this equation is the empirical mean plus an exploration bonus that grows with the variance upper bound σ2

e and decays in Tt 1(e). Hence, the term inside the arg max in (5) can be interpreted as a Bernstein-style UCB. The analysis of this UCB relies on a novel concentration inequality for vector-valued martingales [Zhou et al., 2021, Theorem 2], which is a Bernstein analogue of the Hoeffding-style bound due to Abbasi-Yadkori et al. [2011].

Limitations. The fact that WOFUL (roughly) generalizes UCB-V is promising. However, computing (6) requires knowledge of the variance upper bounds σ2

e, and nontrivial bounds on the variance are rarely available in practice. We sidestep this issue with three simple observations: (i) cascading bandits only involve Bernoulli rewards, (ii) for Bernoulli rewards, variances are upper bounded by means, and (iii) these means can be learned efﬁciently since they are linearly-parameterized. This suggests the following algorithm: ﬁrst, compute Hoeffding-style UCBs; second, treat these UCBs as upper bounds for the true means, and thus the true variances, in WOFUL.

Proposed algorithm. Algorithm 2 formalizes this approach. It contains three steps. First, step 1 deﬁnes Hoeffding-style UCBs Ut,H as in Abbasi-Yadkori et al. [2011]. Next, step 2 uses Ut,H as an upper bound for the variance and computes the Bernstein-style UCBs Ut,B as in WOFUL. Finally, step 3 chooses At as the K items with the highest Ut,B, analogous to Algorithm 1.

Two technical clariﬁcations are in order. First, observe that in step 1, we clip the variance bound Ut,H below by 1/K. We do so to ensure that t,B (which inverts Ut,H) remains bounded. Additionally, we note the choice 1/K is precisely motivated by Section 1.2, which shows this is a critical threshold for the small click-through rate. Second, t,B uses the regularizer KI. This is to ensure that the regularizer is large enough compared to the summands φ(as

k)T/Us,H(as

k), which scale as K in the worst case where the variance upper bound Us,H(as

k) is clipped to 1/K. Remark 5. Zong et al. [2016] proposed a WOFUL-style cascading algorithm called Cascade Lin UCB (see Appendix C), but they set the variance upper bound to ﬁxed σ2 > 0. In fact, they remark ideally,

σ2 should be the variance of the observation noises, which Cascade WOFUL essentially learns. Remark 6. Appendix B contains an improved version of Cascade WOFUL. It is more efﬁcient (for example, the inverses are iteratively updated via Sherman-Morrison), satisﬁes the same theoretical guarantee as Algorithm 2, and includes some tweaks that improve performance in practice. The tradeoff is that Algorithm 2 is simpler to explain, which is why we prefer it for the main text.

Algorithm 2: Cascade WOFUL for linear cascading bandits Input: exploration parameters { t,H, t,B}n

t=1 for t = 1, . . . , n do

Step 1: Deﬁne the (clipped) Hoeffding-style UCBs

Ut,H( ) = max

hφ( ), ˆ t,Hi + t,Hkφ( )k 1

min{Cs,K} X

k), t,H = I +

min{Cs,K} X

Step 2: Deﬁne the Bernstein-style UCBs

Ut,B( ) = hφ( ), ˆ t,Bi + t,Bkφ( )k 1

min{Cs,K} X

k) , t,B = KI +

min{Cs,K} X

Step 3: Let at

k = arg maxe2[L]\{at

i=1 Ut,B(e) be the item with the k-th highest UCB, play At = (at

1, . . . , at

K), observe Ct = inf{k 2 [K] : wt(at

k) = 1} (as in Algorithm 1)

Regret bound. We can now state our main result for the linear case. Here Bd(M) = {x 2 Rd : kxk2 M} denotes the Euclidean ball of radius M > 0 in Rd.4

Theorem 4. Suppose w 2 [0, 1]L, φ : [L] ! Bd(1), and 2 Bd(1) satisfy w(e) = hφ(e), i for all e 2 [L]. Let be the policy of Algorithm 2 with inputs t,H =

d log(1 + t K/d) + 2 log(n) + 1 and t,B = 8

d log(1 + t K/d) log(n3K) + 4

K log(n3K) +

R , w(n) = O

nd(d + K) + n

7 6 (d + K)

1 3 , pn max{d, K} min{d, K}1/3o

Note this bound is completely independent of the number of items L. For large n (typically the case of interest), it becomes O(

nd(d + K)), which improves the best known bound of O(

nd2K) [Li and Zhang, 2018]. The theorem also establishes O(pn max{d, K} min{d, K}1/3) regret uniformly in n (i.e., without additive o(pn) terms which may dominate for small n), which improves the best known uniform-n bound of O(pnd K) [Zong et al., 2016]. See Table 1 for more details.

5 Overview of the analysis

We next discuss the key ideas behind our proofs. The details are deferred to Appendices E-J.

Theorem 2 proof. For simplicity, we assume the arms are ordered by their means, i.e., w(1)

w(L). Under this assumption, the optimal action is [K]. We call those items optimal and [L] \ [K] suboptimal. We let Et be the bad event that the empirical and true means differ substantially at time t, Et its complement, Ge,e ,t the event that suboptimal e > K was chosen in favor of optimal e K and subsequently examined by the user, and e,e = w(e ) w(e) the reward gap. Then as in Appendix A.1 of Kveton et al. [2015a], we linearize regret as follows:

( Et, Ge,e ,t)

The second term is small due to concentration. For the ﬁrst term, deﬁne = w(K)

L/n and assume n L (so w(K)).5 For each e > K, let K (e) = {e K : e,e } be the

4Our results hold with minor modiﬁcation if φ : [L] ! Bd(M1) and 2 Bd(M2) for general M1, M2 > 0.

However, as in prior work, the modiﬁed algorithm needs to know an upper bound on M2. 5The full proof addresses the cases n 6 L and w(K) = 0 (here we implicitly assume the latter to invert ).

optimal items with small gap relative to e and K (e) = [K] \ K (e) the other optimal items. Then

( Et, Ge,e ,t)

(Ge,e ,t) +

( Et, Ge,e ,t).

Therefore, we can upper bound the ﬁrst term in (7) by

( Et, Ge,e ,t)

For the ﬁrst term in (8), the inner double summation is the number of e > K that were chosen in favor of e 2 K (e), and subsequently examined by the user, at round t. Denote this number by Ot. Recall w(e ) w(K) (by the assumed ordering) and w(K), so for e 2 K (e), we have

w(e) = w(e ) e,e w(e ) w(K) ' w(K).

Hence, Ot is bounded by the number of items e with w(e) w(K) that the user examined. Since the user stops examining at the ﬁrst attractive item, this number is dominated by a Geometric( w(K)) random variable. Therefore, the ﬁrst term in (8) is O( n/ w(K)) = O(

The second term in (8) accounts for choosing e instead of e when e,e and the empirical means are concentrated. Building upon the intuition of Section 1.2, we exploit the variance-awareness of KL-UCB and UCB-V (and use w(e) w(K) by the assumed ordering) to bound this term by

O(L w(e)(1 w(e))/ ) O(L w(K)/ ). Thus, by choice of , this term is O(

n L) as well.

Theorem 4 proof. We decompose regret similar to (7) and (8), though with different choices of Et and .6 To bound P(Et), we use the aforementioned result of Abbasi-Yadkori et al. [2011] to show that the Hoeffding UCBs Ut,H upper bound the variances with high probability, then prove a guarantee for the least-squares estimate ˆ t,B using the Bernstein-style bound from Zhou et al. [2021]. We bound the ﬁrst term in (8) using the exact same logic as the tabular case. The second term has a more complicated analysis, but (as in the tabular case) it amounts to bounding the number of times that items -far from optimal are chosen when ˆ t,B is well-concentrated. For this, we adapt techniques from the standard linear bandit setting (the case K = 1) to general K 2 [L].

Theorem 1 proof. We let H0

s=1{As, ws(as

1), . . . , ws(as

K)} be the entire history before time t, which includes unobserved rewards ws(as

k), k > Cs. We also let 0 be the policies that map H0

t to At. Note Ht H0

t, so 0. For any 2 0 and w 2 [0, 1]L, we deﬁne

, w(n) = E , w

k = arg maxe2[L]\{a

i=1 w(e). Then a lower bound linearization analogous to (7) shows that for any p 2 [0, 1] and w 2 [0, p]L, R , w(n) (1 p)K 1R0

, w(n), which implies

inf 2 sup w2[0,1]L R , w(n) inf

2 0 sup w2[0,p]L R , w(n) (1 p)K 1 inf

2 0 sup w2[0,p]L R0

, w(n). (10)

If we choose p = 1, the inf sup at right is the minimax regret for the document-based model that was analyzed by Lattimore et al. [2018] (see Remark 1), but this makes (10) vacuous. On the other hand, by choosing p = O(1/K) (again, the small click-through rate of Section 1.2), so that the term (1 p)K 1 in (10) is (1), we can modify their analysis to prove Theorem 1.

Theorem 3 proof. We linearize the regret similar to the proof of Theorem 1, then deﬁne a problem instance reminiscent of Section 1.2 and (roughly) follow the intuition for UCB1 therein.

6 Experiments

Before closing, we conduct experiments on both synthetic and real data. Some details regarding experimental setup are deferred to Appendix C. Code is available in the supplementary material.

6In fact, the proofs of Theorems 2 and 4 both rely on a more general gap-free regret decomposition for cascading

bandits (Lemma 1 in Appendix F), which to our knowledge is novel and may be of independent interest.

Figure 1: Results for synthetic data (tabular on top, linear on bottom, L = 100 in both)

Figure 2: Results for Movie Lens data [Harper and Konstan, 2015]

Synthetic data. We let L = 100 and K 2 {2i}8

i=2. For each K, we sample w(e) uniformly in [ 2

K ] for e K and in [0, 1 3K ] for e > K. Note this yields a positive gap with the small click-through rate of Section 1.2. The top left plot in Figure 1 shows the regret at n = 105 for the tabular albums of Section 3 (the shaded regions are the standard deviations across ﬁve trials). As predicted by Theorems 2 and 3, the Cascade UCB1 curve grows with K, while the variance-aware curves do not. The top middle plot shows the time spent computing UCBs for the same experiment, which conﬁrms the behavior mentioned in Remark 3. For the linear case, we vary d 2 {2i}8

i=2, generate the same w, then compute unit-norm vectors and φ(e) satisfying w(e) = hφ(e), i (see Appendix C). We compare Cascade WOFUL to Cascade Lin UCB [Zong et al., 2016], which Li and Zhang [2018] showed has the best existing regret guarantee. As suggested by Theorem 4, the left and middle plots on the bottom of Figure 1 show that our algorithm s regret has superior dependence on K and d. The rightmost plots show that regret is sublinear in n for the median values K = d = 10.

Real data. We replicate the ﬁrst experiment from Zong et al. [2016] on the Movie Lens-1M dataset (grouplens.org/datasets/movielens/1m/), which contains user ratings for L 4000 movies. In brief, the setup is as follows. First, we use their default choices d = 20 and K = 4. Next, we divide the ratings into train and test sets based on the user who provided the rating. From the training data and a rank-d SVD approximation, we learn a feature mapping φ from movies to the probability that a uniformly random training user rated the movie more than three stars. Finally, we run the algorithms as above, except at round t 2 [n], we sample a uniformly random user Jt from the test set and deﬁne wt(at

k) = W(Jt, at

k), where W(j, a) =

(user j rated movie a more than 3 stars). In other words, instead of the independent Bernoulli clicks of Section 2, we observe the actual feedback of user Jt. We point the reader to Section 4 of Zong et al. [2016] and Appendix C for further details. The left plot of Figure 2 shows that Cascade WOFUL outperforms Cascade Lin UCB across n, eventually incurring less than 66% of the regret. In addition to this setup from Zong et al. [2016], we reran the experiment while restricting the set of items to movies of a particular genre, for each of 18 genres in the dataset. This is intended to model platforms like Netﬂix that recommend movies in various categories. The right plot shows that Cascade WOFUL is superior for all genres; for some genres (e.g., fantasy) its regret is about half of Cascade Lin UCB s. Moreover, our experiments indicate that Cascade WOFUL

improves Cascade Lin UCB more dramatically for genres with smaller click-through rates (see Figure 3 and surrounding discussion in Appendix C), which reinforces a key message of this paper.

7 Conclusion

In this work, we proved matching upper and lower bounds for the problem-independent regret of tabular cascading bandits and an upper bound for the linear case, all of which improve the best known. Our results suggest some interesting future directions, such as proving minimax lower bounds for the linear case and revisiting Thompson sampling for cascading bandits [Zhong et al., 2021] in light of our variance-aware insight; see Appendix D for details. Finally, we note the paper is theoretical and has no immediate societal impact. Nevertheless, we urge caution for the negative impacts that could arise in practice. For example, our Movie Lens experiments involved training on a subset of users, which could cause poor recommendations for demographics underrepresented in the training set.

Acknowledgments and Disclosure of Funding

This work was partially supported by ONR Grant N00014-19-1-2566, NSF TRIPODS Grant 1934932, NSF Grants CCF 22-07547, CCF 19-34986, CNS 21-06801, 2019844, 2112471, 2107037, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Afﬁliates Program. We thank Advait Parulekar for helpful discussions.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic

bandits. Advances in neural information processing systems, 24, 2011.

Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration exploitation tradeoff using

variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19):1876 1902, 2009.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit

problem. Machine learning, 47(2):235 256, 2002.

Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-

armed bandit problems. Machine Learning, 5(1):1 122, 2012.

Olivier Cappé, Aurélien Garivier, Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz.

Kullback-leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics, pages 1516 1541, 2013.

Wang Chi Cheung, Vincent Tan, and Zixin Zhong. A thompson sampling algorithm for cascading

bandits. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 438 447. PMLR, 2019.

Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. Click Models for Web Search. Morgan &

Claypool Publishers, 2015.

Richard Combes, Stefan Magureanu, Alexandre Proutiere, and Cyrille Laroche. Learning to rank:

Regret lower bounds and efﬁcient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 231 244, 2015.

Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of click

position-bias models. In Proceedings of the 2008 international conference on web search and data mining, pages 87 94, 2008.

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit

feedback. In Conference on Learning Theory, pages 2137 2143, 2008.

David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100 118,

Aurélien Garivier and Olivier Cappé. The kl-ucb algorithm for bounded stochastic bandits and

beyond. In Proceedings of the 24th annual conference on learning theory, pages 359 376. JMLR Workshop and Conference Proceedings, 2011.

F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm

transactions on interactive intelligent systems (tiis), 5(4):1 19, 2015.

Gaurush Hiranandani, Harvineet Singh, Prakhar Gupta, Iftikhar Ahamath Burhanuddin, Zheng Wen,

and Branislav Kveton. Cascading linear submodular bandits: Accounting for position bias and diversity in online learning to rank. In Uncertainty in Artiﬁcial Intelligence, pages 722 732. PMLR, 2020.

Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to

rank in the cascade model. In International Conference on Machine Learning, pages 767 776. PMLR, 2015a.

Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascading bandits.

Advances in Neural Information Processing Systems, 28, 2015b.

Branislav Kveton, Ofer Meshi, Zhen Qin, and Masrour Zoghi. On the value of prior in online learning

to rank. In The 25th International Conference on Artiﬁcial Intelligence and Statistics, 2022.

Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.

Tor Lattimore, Branislav Kveton, Shuai Li, and Csaba Szepesvari. Toprank: A practical algorithm for

online stochastic ranking. Advances in Neural Information Processing Systems, 31, 2018.

Chang Li and Maarten De Rijke. Cascading non-stationary bandits: online learning to rank in the

non-stationary cascade model. In Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, pages 2859 2865, 2019.

Shuai Li and Shengyu Zhang. Online clustering of contextual cascading bandits. In Proceedings of

the AAAI Conference on Artiﬁcial Intelligence, 2018.

Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading

bandits. In International conference on machine learning, pages 1245 1253. PMLR, 2016.

Shuai Li, Tor Lattimore, and Csaba Szepesvári. Online learning to rank with features. In International

Conference on Machine Learning, pages 3856 3865. PMLR, 2019.

Xutong Liu, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John Lui, and Wei Chen. Batch-size

independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms. ar Xiv preprint ar Xiv:2208.14837, 2022.

Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. A ﬁnite-time analysis of multi-armed

bandits problems with kullback-leibler divergences. In Proceedings of the 24th annual Conference On Learning Theory, pages 497 514. JMLR Workshop and Conference Proceedings, 2011.

Matthew Richardson, Ewa Dominowska, and Robert Ragno. Predicting clicks: estimating the click-

through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pages 521 530, 2007.

Lingda Wang, Huozhi Zhou, Bingcong Li, Lav R Varshney, and Zhizhen Zhao. Near-optimal

algorithms for piecewise-stationary cascading bandits. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3365 3369. IEEE, 2021.

Qinshi Wang and Wei Chen. Improving regret bounds for combinatorial semi-bandits with probabilis-

tically triggered arms and its applications. Advances in Neural Information Processing Systems, 30, 2017.

Zixin Zhong, Wang Chi Chueng, and Vincent YF Tan. Thompson sampling algorithms for cascading

bandits. Journal of Machine Learning Research, 22:1 66, 2021.

Dongruo Zhou, Quanquan Gu, and Csaba Szepesvari. Nearly minimax optimal reinforcement

learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532 4576. PMLR, 2021.

Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, and

Zheng Wen. Online learning to rank in stochastic click models. In International Conference on Machine Learning, pages 4199 4208. PMLR, 2017.

Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading

bandits for large-scale recommendation problems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial Intelligence, pages 835 844, 2016.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] The claims are accurate. Furthermore, we provide references to the formal results alongside each claim stated in the introduction. (b) Did you describe the limitations of your work? [Yes] The main limitation is that our

lower bounds (Theorems 1 and 3) require some additional assumptions. These are stated clearly in the theorem statements and discussed in Remarks 1 and 4. (c) Did you discuss any potential negative societal impacts of your work? [Yes] Section 7

addresses societal impact. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] As mentioned

above, the assumptions are clearly stated in the theorem statements. (b) Did you include complete proofs of all theoretical results? [Yes] Section 1.2 discusses

the high-level intuition behind our proofs, Section 5 contains a more detailed proof sketch, and the appendices include complete proofs. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experi-

mental results (either in the supplemental material or as a URL)? [Yes] Complete code to recreate all plots is available in the supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] The experimental setup is discussed in detail in Appendix C, and other training details can be found in the code. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [Yes] All plots include error bars. (d) Did you include the total amount of compute and the type of resources used (e.g.,

type of GPUs, internal cluster, or cloud provider)? [Yes] The experiments were run over several hours on a laptop, so no signiﬁcant resources were used. Nevertheless, Appendix C mentions the approximate runtime needed to recreate the ﬁgures. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We only used the

Movie Lens dataset, which we cited and also provided a link to. (b) Did you mention the license of the assets? [Yes] The link to the Movie Lens dataset

contains a README with license information. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]