# cascading_contextual_assortment_bandits__1340b221.pdf Cascading Contextual Assortment Bandits Hyun-jun Choi Seoul National University Seoul, South Korea nschj1@snu.ac.kr Rajan Udwani UC Berkeley Berkeley, CA, USA rudwani@berkeley.edu Min-hwan Oh Seoul National University Seoul, South Korea minoh@snu.ac.kr We present a new combinatorial bandit model, the cascading contextual assortment bandit. This model serves as a generalization of both existing cascading bandits and assortment bandits, broadening their applicability in practice. For this model, we propose our first UCB bandit algorithm, UCB-CCA. We prove that this algorithm achieves a T-step regret upper-bound of O( 1 T), sharper than existing bounds for cascading contextual bandits by eliminating dependence on cascade length K. To improve the dependence on problem-dependent constant , we introduce our second algorithm, UCB-CCA+, which leverages a new Bernstein-type concentration result. This algorithm achieves O(d T) without dependence on in the leading term. We substantiate our theoretical claims with numerical experiments, demonstrating the practical efficacy of our proposed methods. 1 Introduction Sequential interactions between users and a recommender agent are often modeled as the multi-armed bandit problem or one of its variants [15]. In practice, a user typically encounters multiple items per round of interaction rather than a solitary item. Two popular models that capture this aspect are the cascading bandit [13; 14; 18] and the assortment bandit [4; 5; 7; 22], also often known as multinomial logistic bandits. In the cascading bandit problem [13; 14; 18; 29; 25; 28], the agent selects a cascade of K items from a total of N items each round. These selected items are sequentially presented one at a time to a user. If the user clicks on a presented item, the cascading round ends. If not, the agent proceeds to reveal the next item from the cascade. This process continues until either the user clicks on an item or all K items in the cascade have been presented without a click. Once a round ends, a next round commences with a newly selected list of K items. In the assortment bandit problem [4; 5; 7; 8; 9; 7; 22; 23], the agent presents an assortment of M items all at once, then receives user choice feedback on the assortment. A user may opt for one of the M items presented or choose none at all, concluding the round in either case. Both cascading and assortment bandit problems are significant combinatorial variations of the multi-armed bandit problem and have been extensively examined both theoretically and in practice. However, a more commonly encountered scheme in real-world applications is a generalization of these two settings, where a cascade of assortments is sequentially revealed in each round. This approach is evident in video streaming services, where assortments of recommended contents are revealed as users scroll through webpages or mobile applications. Similar experiences can also be found in various online retail services and search engines. To address this, we propose a new interactive model, which we term cascading assortment bandits. In the cascading assortment bandits, the agent chooses a cascade of assortments that consists of K assortments with each assortment containing M items. The agent reveals one assortment at a time in the cascade. The cascade concludes if the user clicks on one of the items contained in a given assortment. If not, the agent proceeds 37th Conference on Neural Information Processing Systems (Neur IPS 2023). Scanned & not clicked Not scanned Current scan Single item Single assortment Scanned & not clicked Not scanned Current scan Multi-Armed Bandit Cascading Bandit Assortment Bandit Cascading Assortment Bandit texit>K = 1, M = 1 K = 5, M = 1 K = 1, M = 4 x5W0I3u KXl0nzv Oxdl Ks Pl WKtl MWRg2M4g TPw4BJqc At1a ABCc/w Cm9O4rw4787Hf HTFy Xa O4A+czx9EA48tK = 5, M = 4 Figure 1: Comparisons between the cascading assortment bandit and the other combinatorial bandits. The cascading assortment bandit subsumes the multi-armed bandit (K = 1, M = 1), the cascading bandit (K > 1, M = 1), and the assortment bandit (K = 1, M > 1). to unveil the subsequent assortment in the cascade. This cycle continues until either an assortment receives a click from the user or the agent depletes the pre-selected assortments. The cascading assortment bandit problem is the strict generalization of both the cascading bandits model and the assortment bandits model. It is also a generalization of the simple multi-armed bandit problem. That is, if K = 1 and M = 1, the problem is the simple multi-armed bandit. If K > 1 and M = 1, the problem corresponds to the cascading bandit. If K = 1 and M > 1, we recover the assortment bandit problem. The illustrations on comparisons between the cascading assortment bandit and the other combinatorial bandits are presented in Figure 1. In order to accommodate the generalization of the interactive model across items and assortments, we also incorporate the feature information of items and parametrization of a click model in the cascading assortment bandit model. Hence, we name the model as cascading contextual assortment bandit (see Section 2.2 for the formal definition of the problem setting).1 Under this newly proposed combinatorial bandit model, we posit the following question: Can we design a provably efficient algorithm for cascading contextual assortment bandits? To address the question at hand, we first have to overcome the technical challenges inherent in each special case of our problem setting: the cascading contextual bandit and the contextual assortment bandit. Firstly, in the cascading contextual bandit [18; 25], a longstanding issue has been the suboptimal dependence on the cascade length, K. Intuitively, one would expect that as K increases in the cascading model, the regret should either diminish or at least remain constant; performance deterioration should not occur. However, all existing regret bounds for cascading contextual bandits scale proportionally to K [18; 29; 17; 28; 25]. This finding is not just counter-intuitive, but also suboptimal (for further discussions, refer to Section2.4). As a result, (i) eradicating the suboptimal dependence on cascade length K has been recognized as an open problem, even within the cascading contextual bandit setting.2 Further, in the context of assortment bandits, there is a widely recognized suboptimal dependence on the problem-specific constant , as demonstrated in the existing assortment bandit literature [9; 22; 23]. This problem-specific constant (in Assumption 4.2) represents the curvature of the multinomial logit (MNL) function. Recent studies [24; 3] have demonstrated an improved dependence on , albeit only multiplied by logarithmic factors. However, this improvement comes at the expense of an increased dependence on the assortment size M, a conclusion that is both counter-intuitive and suboptimal. Thus, (ii) decreasing the dependence without escalating the dependence on M still poses an unresolved issue. While addressing either of the two challenges (i) and (ii) can be daunting individually, tackling both issues simultaneously poses an even greater challenge in both our algorithm design and regret analysis. 1Note that a non-contextual version of the cascading assortment bandit is a special case of the cascading contextual assortment bandit with a one-hot encoded feature vector for each item. Hence, when we aim to provide efficient algorithms for cascading contextual assortment bandit, we also address the non-contextual cascading assortment bandit which has not been studied previously. 2A concurrent work [20] addresses this suboptimal dependence on K under the linear model assumption. Our work tackles this challenge under the MNL model in a more general setting. Table 1: Comparisons of algorithms for contextual cascade and assortment bandits as well as for cascading contextual assortment bandits. N is the number of ground items, K is a length of cascade, d is a dimension of feature vectors and T is total rounds. is a problem-dependent parameter for the MNL model. See Appendix A for more discussions. Algorithm Context Cascade Assortment Click Model Regret Bound Comb Cascade [14] O( KNT) C3-UCB [18] Linear O(d KT) EE-MNL [5] MNL O( NT) TS-MNL [22] MNL O( 1 T) UCB-MNL [23] MNL O( 1 T) Lin TS-Cascade [28] Linear O(d3/2K T) Cascade WOFUL [25] Linear O( d2T + d TK) VAC2-UCB [20] Linear O(d UCB-CCA (Algorithm 1) MNL O( 1 T) UCB-CCA+ (Algorithm 2) MNL O(d To this end, we design novel upper confidence bound (UCB) algorithms for contextual cascading assortment bandits, tackling both technical challenges. We show that our proposed algorithms achieve provable guarantees on regret performances overcoming the longstanding technical challenges. Our regret bounds show sharper results than those of the existing contextual cascading bandits or assortment bandits. We corroborate our theoretical claims through numerical experiments, thus ensuring that both our newly proposed bandit framework and the proposed algorithms establish provable efficiency and practical applicability. Our main contributions are summarized as follows. We formulate a general combinatorial bandit model, named cascading contextual assortment bandit that encompasses the existing cascading bandits and assortment bandits. This novel problem setting is observed in many practical applications. We first propose a UCB bandit algorithm UCB-CCA for the cascading contextual assortment bandit and establish the T-step regret upper-bound of O( 1 T) (in Theorem 4.3). This regret bound is tighter than the existing bounds for cascading contextual bandits, where we not only remove the longstanding, unnecessary dependence on K but also establish the result without dependence on M. While UCB-CCA is an efficient algorithm achieving both the statistical efficiency and practical performances (shown in Section 7), its regret bound includes dependence on the inverse of a problem-dependent constant , which can be potentially large in the worst case. To improve the dependence on , we propose our second algorithm UCB-CCA+, which exploits a new Bernstein-type concentration result, taking into account the effects of the local curvature of the MNL model. We prove that UCB-CCA+ achieves O(d T) without the dependence on in the leading term (only scaling with logarithmic factors), hence significantly improving the regret of UCB-CCA without increasing the other dependencies. Hence, we successfully solve the two technical challenges (i) and (ii) mentioned above. As an independent contribution, we prove that a greedy algorithm for the cascading assort- ment optimization problem gives a 0.5 approximation of the optimal solution (discussed in Section 6). To our best knowledge, this is the first rigorous result showing the approximation guarantee even for the contextual cascading bandit problem, instead of simply assuming access to an approximation optimization oracle. We evaluate our proposed methods in numerical experiments and show that the practical performances support our theoretical claims. Hence, our proposed algorithms along with our newly proposed bandit model establish provable efficiency and practical applicability. 2 Preliminaries 2.1 Notation Define [n] as a set of positive integers from 1 to n. Let | | be the length of a sequence or the cardinality of a set. For a vector x 2 Rd, we denote the 2-norm of x as ||x||2 and the V -weighted norm of x for a positive-definite matrix V as ||x||V = x>V x. The determinant and trace of a matrix V are det(V ) and trace(V ), respectively. λmin(V ) denotes the minimum eigenvalue of a matrix V . 2.2 Cascading Contextual Assortment Bandit Problem Consider [N], a set of N items. Let A be a set of candidate assortments of items with size M, i.e., A := {A [N] : |A| = M}. A cascade S is an ordered sequence of K assortments chosen from A where all the items in these K assortments are distinct. Then, the set of all feasible cascades S can be defined as follows. A1, ..., AK) | Ak 2 A for all k 2 [K], \K At round t, feature vectors {xti 2 Rd, i 2 [N]} for every item are revealed to the decision-making agent. Each feature vector xti may contain the contextual information of the user at round t and the item i. After observing this contextual information, at round t, the agent recommends a cascade St = (Atk)k2[K] to the user, where Atk 2 A represents the k-th assortment of the cascade at round t. The user scans the assortments in St one by one. If the items in Atk do not attract the user, the user moves on to the next assortment At,k+1. The user stops at the Ot-th assortment when the user is attracted by an item in the Ot-th assortment and clicks on the item. After the user clicks on the item, the agent observes a sequence of user choices yt = (ytk)k2[Ot] where a binary vector ytk = (ytk0, ytk1, ..., ytk M) represents user choices on assortment Atk. Let ytkm = 1 if the m-th item im in Atk is clicked by the user, and ytkm = 0 for items that are not clicked on. For each assortment, there is an outside option. That is, there is a probability that the user may not click any of the items in Atk. If the user does not choose any items, ytk0 = 1 and ytkm = 0 for all m 2 [M]. The user choice for each assortment is given by the multinomial logit (MNL) choice model [21]. For this MNL model, there is an unknown time-invariant parameter 2 Rd. We define the true weight of item i in round t as w ti . Also, we let the vector representation of the weights be defined as w ti)i2[N] for convenience. Under this model, the user s click probability of the m-th item in Atk and the probability of the outside option in Atk is given respectively by pt(im|Atk, w t ) = exp(w j2Atk exp(w tj) and pt(i0|Atk, w t ) = 1 1 + P j2Atk exp(w where item i0 represents the outside option. The user choice ytk is sampled from the multinomial distribution, ytk MNL pt(im|Atk, w , where the argument 1 indicates that ytk is a single-trial sample. Hence, PM m=1 ytkm is always 1. Also, we denote measurement noise as tkm := ytkm pt(im|Atk, w t ). Since tkm is bounded in [0, 1], tkm is σ2-sub-Gaussian with σ2 = 1/4. It is important to note that tkm across items in the same assortment is not independent due to the substitution effect in the MNL model. The expected reward function of a combinatorial action St based on w t is given by pt(i0|At k, w pt(i|Atk, w pt(i0|Atk, w The formulation above is also known as the cascade model with disjunctive objective, where the user stops at the first attractive item [13; 14; 18]. 2.3 -Approximation Oracle and -Regret The exact combinatorial optimization to compute an optimal cascade of assortments can be computationally expensive. Therefore, we allow for approximate optimization. We assume that the agent has access to an -optimization oracle to compute a -approximation solution of the cascade optimization problem with 1. For approximate optimization, we prove that a greedy selection for the cascading assortment optimization problem gives a 0.5 approximation of the optimal solution, which may be of independent interest (see Section 6). Formally, for a given an -optimization oracle and a weight parameter w, the oracle outputs an approximately optimal cascade ˆS = O (w) 2 S satisfying f( ˆS , w) f(S , w) where S 2 arg max S2S f(S, w) is an optimal assortment without approximation. The instantaneous -regret of cascade St in round t is defined as R (t, St) := E[ f(S t ) f(St, w t )] where S t 2 arg max S2S f(S, w t ) is an true optimal assortment. Then, the goal of the agent is to minimize the cumulative -regret defined as R (t, St) = t ) f(St, w 2.4 Suboptimal Dependence on Problem Dependent Parameters In this subsection, we discuss the main technical challenges faced in the regret analysis of our problem setting. In particular, the suboptimal dependence on cascade length K has been a long-standing open problem even in contextual cascading bandits. 2.4.1 Dependence on Length of Cascade K The previous literature on the contextual cascading bandits [18; 17; 26; 25] bounds the instantaneous regret in each round, utilizing the monotonicity and Lipschitz continuity of the expected reward function f. A simple adaptation of the previously known techniques to our problem would result in the following upper bound for the instantaneous regret R (t, St) for St = (At1, At2, ..., At K). βt||xti||V 1 where βt is a suitable confidence radius chosen by an algorithm, and Vt is a positive definite gram matrix. Then, the dependence on the length of cascade K and the assortment size M would appear in the regret bound after summing the right-hand side of Eq.(1) over the time horizon and applying the Cauchy-Schwarz inequality. Because of this reason, even when M = 1, there still exists dependence on K which appears in the regret bounds of all previous contextual cascading bandits (see Table 1). To overcome this challenge, we present a new Lipschitz continuity of the expected reward function to derive the regret bound independent of M and K by replacing the summation with the maximum over assortments and a cascade (see Section 4.3 for more details). 2.4.2 Dependence on Worst-Case Scanning Probability Analogous to the existing algorithms for the contextual cascading bandits [18; 26], the gram matrix Vt contains the rank-1 matrices of observed items accumulated up to round t, i.e., Vt = Pt i2A k x ix> i + λI. However, there exists an out-of-control issue, that is, the summation of the rank-1 matrices over Ot + 1 to |St| in Eq.(1) is not included in the gram matrix Vt. Note that this issue also arises in cascading contextual assortment bandits. Adapting a technique used in the existing literature [18] to mitigates this issue, let pt,St be the probability of examining all assortments in St and p = mint2[T ] min S2S pt,St be the worst-case probability of examining a cascade over all rounds and all feasible cascades. A simple adaptation of the existing methods would result in the following bound for the expected instantaneous regret. E [R (t, St)] = E {Ot = |St|} | St {Ot = |St|} This concedes the dependence on p , which can be exponentially small in the worst case. We overcome this challenge by designing an algorithm that offers the assortment containing the most uncertain item as the first assortment in a cascade. We discuss this salient feature of the proposed algorithm in more detail in Section 3.2. Algorithm 1 UCB-CCA Input: confidence radius βt and ridge penalty parameter λ 1 1: for t = 1, . . . , T do 2: Observe xti for all i 2 [N] 3: Compute uti = x> ti ˆ t 1 + βt 1||xti||V 1 t 1 for all i 2 [N] 4: Compute a candidate cascade S0 tk)k2[K] = O (ut) 5: Find optimistic exposure assortment index k in (k , i ) = argmax 6: Optimistic exposure swap St (Atk)k2[K] where Atk := tk if k = 1 A0 t1 if k = k tk otherwise 7: Offer St, and observe user feedback Ot and yt = (ytk)k2[Ot] 8: Update Vt Vt 1 + POt i2Atk xtix> ti 9: Compute the regularized MLE ˆ t by solving r = 0 10: end for 3 Algorithm: UCB-CCA 3.1 Upper Confidence Bounds and Confidence Set UCB-CCA utilizes the upper confidence bounds (UCB) technique [6; 1; 16] to compute an optimistic action based on optimistic estimates of each item s weight, uti = x> ti ˆ t 1 + βt 1(δ)||xti||V 1 t 1 for all i 2 [N]. The confidence radius βt(δ) is specified to maintain a high-probability confidence set Ct(δ) for the unknown parameter , although the algorithm does not explicitly compute Ct(δ). 2 Rd : ||ˆ t ||Vt βt(δ) Setting a proper confidence radius βt(δ) can guarantee that lies within the confidence set with probability 1 δ. On the event that 2 Ct(δ), the UCB weight uti serves as an upper bound of a true weight w ti for every item i 2 [N]. We denote the UCB weight vector as ut = (uti)i2[N] for convenience. 3.2 Optimistic Exposure Swapping A distinctive element of UCB-CCA is what we call optimistic exposure swapping, a procedure crucial for eliminating dependence on the worst-case scanning probability, as elaborated in Section 2.4.2. This technique strategically positions the assortment containing the item with the highest uncertainty among the top MK items in the first slot of the cascade of assortments. In each round t, the -approximate oracle O (ut) outputs a candidate cascade S0 t, determined by the UCB weights ut. It is important to note that S0 t is not immediately presented to the user. Instead, after S0 t is derived using the optimization oracle O (ut), the algorithm identifies the index k of an assortment that includes the item with the largest ||xti||V 1 Subsequently, the algorithm swaps the positions: the assortment A0 tk is moved to the top of St, becoming At1, and the initially top assortment A0 t is relocated to the k -th position of St, now Atk . The positions of the other assortments remain the same, that is, Atk = A0 tk for all k 2 [K] \ {1, k }.3 This procedure is viable as the expected reward is unaffected by the display order of assortments in the cascade, as shown in Lemma 4.5. 3.3 Regularized Maximum Likelihood Estimation UCB-CCA computes a regularized maximum likelihood estimate of the unknown parameter . The negative log-likelihood is given by t( ) = Pt 1 m=0 y km log p (im|A k, w ), where 3While the swapping occurs between the first and the k -th positions for specificity, it is sufficient to place A0 tk at the top position of St. The sequence of the remaining assortments is not critical. wt = (wti)i2[N] is a weight vector, and its element is wti = x> ti . For penalty parameter λ 1, the 2-regularized MLE is given by ˆ t = argmin y km log p (im|A k, w ) + λ 4 Regret Analysis of UCB-CCA 4.1 Regularity Condition Assumption 4.1. ||x||ti 1 for all round t and items i 2 [N], and also || || 1. Assumption 4.2. There exists > 0 such that for all t 2 [T], any assortment A 2 A, and any item i 2 A, inf 2Rd pt(i|A, w)pt(i0|A, w) , where w = (wi)i2[N] and wi = x> Discussion of Assumptions. Assumption 4.1 makes the regret bound independent on the scale of the feature vector and parameter. This is the standard assumption used in the contextual bandit literature [1; 18; 22]. Assumption 4.2 is the standard regularity assumption in the contextual assortment bandit literature [8; 27; 22; 7; 23], adapted from the standard assumption for the link function in the generalized linear contextual bandit literature [16] to ensure that the Fisher information matrix is non-singular. 4.2 Regret Bound of UCB-CCA Theorem 4.3 ( -regret upper bound of UCB-CCA). Suppose Assumptions 4.1 and 4.2 hold, and we run UCB-CCA for total T rounds with βt = 1 2 + 4 log t + λ and with λ 1, Then, the -regret of UCB-CCA is upper-bounded by + 4 log T + Discussion of Theorem 4.3. Theorem 4.3 establishes that UCB-CCA achieves a regret bound of . Notably, this regret bound removes dependence on p completely and removes polynomial dependence on K, achieving the best-known bound in contextual cascading bandits [18; 25; 20], a special case of our problem setting. Apart from the generalization we consider in this work, a key distinction between our work and the previous contextual cascading bandit models lies in our adoption of the MNL model, as opposed to the linear model assumed by the existing literature. This model choice introduces a dependence on the parameter within the regret bound of UCB-CCA, which is an aspect we address and refine in subsequent sections. It is essential to highlight that our work tackles a more general problem yet achieves improved bounds concerning the key problem parameters previously considered suboptimal. The factor comprised of the length of the cascade, (K/(K +1))K+1, in Theorem 4.3 is also notable, which is bounded above by 1 regardless of the value of K. Consequently, the regret bound does not increase polynomially with K, ensuring scalability. 4.3 Proof Outline In this subsection, we present the proof sketch of Theorem 4.3. One of the key components of the regret analysis that enables carving off the dependence on K is the following lemma. Lemma 4.4 (Maximal Lipschitz continuity). Suppose uti w ti for all i 2 [N]. Then f(St, ut) f(St, w max Atk2St max i2Atk (uti w Lemma 4.4 demonstrates that the difference between the reward functions under the UCB parameter and the true parameter is upper bounded by the maximal difference between the UCB and true parameters. This implies that the regret bound remains unaffected by increases in the cascade length K or the assortment size M. Eliminating the dependence on p is another key element of our analysis. To this end, we first show that the order of assortments in the cascade model with the disjunctive objective does not affect the expected reward. We formalize this property in the following lemma. Lemma 4.5. Let pk be the probability that the user clicks on any item in Ak. Given a collection of assortments {A1, , AK} with probabilities {p1, , p K}, their order of display does not matter. Further, for every permutation : [K] ! [K], we have (1 p k) = 1 Consolidating these key results, we proceed to bound the cumulative regret. We begin by leveraging the monotonicity of the expected reward function and the definition of the -approximate optimization oracle to bound the cumulative regret. f(St, ut) f(St, w max Atk2St max i2Atk (uti w max Atk2St max i2Atk ||xti||V 1 max k2[Ot] max i2Atk ||xti||V 1 The second inequality is from Lemma 4.4, letting CK := (K/(K + 1))K+1. The third inequality is given by the concentration of the UCB weights (see Lemma B.5). Note that the assortment including the item with the largest value of ||xti||V 1 t is always examined by the user since it is included in the first assortment of St by the optimistic exposure swapping technique as described in Section 3.2. Note that a change in the order of assortments incurred by the optimistic exposure swapping does not affect the expected reward which is shown in Lemma 4.5. Hence, for every round t 2 [T], we obtain max Atk2St max i2Atk ||xti||V 1 i2Atk ||xti||V 1 Therefore, the last equality in Eq.(4) is given by Eq.(5). Then, we can apply the maximal version of elliptical potential lemma (see Lemma B.8) to bound the cumulative regret. 5 Improved Dependence on While UCB-CCA achieves the regret bound of O improving dependence on K, the bound includes the problem-dependent constant . This implies a potential risk of the regret bound becoming large when becomes very small. In order to circumvent this challenge, we propose a new optimism in the face of uncertainty (OFU) algorithm, UCB-CCA+, which exhibits a regret bound that is independent of in the leading term. The pseudocode of UCB-CCA+ is detailed in Algorithm 2. 5.1 Algorithm: UCB-CCA+ 5.1.1 Confidence Set UCB-CCA+ computes a regularized MLE ˆ t, following the same procedure described in Section 3.3. Then, the algorithm constructs a new confidence set centered around ˆ t utilizing a Bernstein-type tail inequality for self-normalized martingales [11; 3]. Nevertheless, a simple adaptation of the previous approaches may incur increased dependence on M. Hence, a more intricate analysis and refined algorithmic strategy are imperative to effectively address this challenge. First, we define gt( ) := Pt i2A k p (i|A k, w )x i + λt where wt = (wti)i2[N] and wti = x> ti . We also denote the partial derivative of pt(i|Atk, wt) with respect to wti as pt(i|A k, w ) := pt(i|Atk, wt)pt(i0|Atk, wt). Additionally, we define the new design matrix containing local information, denoted as Ht( ) := Pt i2A k p (i|A k, w )x ix> i+λt Id, and, for convenience, Ht := Pt i2A k p (i|A k, w i + λt Id. Then, the algorithm constructs a confidence set as below: 2 Rd : ||gt(ˆ t) gt( )||H 1 t ( ) γt(δ) Algorithm 2 UCB-CCA+ Input: confidence radius γt and ridge penalty parameter λ 1 1: for t = 1, . . . , T do 2: Observe xti for all i 2 [N] 3: Construct a confience set Bt 1(δ) as defined in Eq.(6) 4: Compute a candidate cascade (S0 tk)k2[K], t) = arg max S2S, 2Bt(δ) f(S, wt) 5: Find optimistic exposure assortments Lt,H and Lt,V (and their positions h and v) in S0 6: St (Atk)k2[K] where Atk = 8 > > > > > < > > > > > : Lt,H if k = 1 Lt,V if k = 2 A0 t1 if k = h A0 t2 if k = v A0 tk otherwise 7: Offer St and observe Ot, yt = (ytk)k2[Ot] 8: Update Ht Ht 1 + POt i2Atk pt(i|Atk, w ti 9: Update Vt Vt 1 + POt i2Atk xtix> ti 10: Compute the regularized MLE ˆ t by solving r = 0 11: end for where the confidence radius γt(δ) is suitably specified to ensure that the true parameter lies in the confidence set Bt(δ) with high probability. On the event of 2 Bt(δ), The following lemma bounds the weighted 2-norm of the difference between and . Lemma 5.1. Suppose 2 Bt(δ). Then, for any 2 Bt(δ), we have || ||Ht 6γt(δ). 5.1.2 Doubly Optimistic Exposure Swapping UCB-CCA+ still faces the challenge outlined in Section 2.4.2. The complication is exacerbated for UCB-CCA+ as the algorithm concurrently updates two gram matrices, Ht and Vt, which only contain the information of the observed items. Building upon the technique of optimistic exposure swapping detailed in Section 3.2, in each round t, UCB-CCA+ assigns the assortment Lt,H containing the item with the largest uncertainty with respect to Ht 1 among the top KM items to the first slot of cascade St. Similarly, the algorithm places Lt,V with the item that has the largest uncertainty with respect to Vt 1 in the second slot of St. 5.2 Regret Analysis of UCB-CCA+ Theorem 5.2 (Regret upper bound of UCB-CCA+). Suppose Assumptions 4.1 and 4.2 hold, and we run UCB-CCA+ for total T rounds with λ 1 and γt(δ) := 3pλt 2 + 2 pλt log (λt+KMt/d)d/2λ d/2 2d pλt log 2 with δ = 1 t2 . Then, the regret of UCB-CCA+ is upper-bounded by R (T) C1γT (δ) γT (δ)2d log where C1 = 36 and C2 = 216(1 + Me). Discussion of Theorem 5.2. Theorem 5.2 establishes the regret bound of O . The leading term of the regret bound is independent of . Although the second term exhibits dependence on , the term only scales logarithmically in T, whose comparative effect diminishes as t increases compared to the leading term. Hence, the worst-case regret guarantee of UCB-CCA+ improves from that of UCB-CCA. The comprehensive proof of Theorem 5.2 is provided in the appendix. 6 Approximation Algorithm for Optimal Combinatorial Action In this section, we show that computing the optimal cascade is, in general, a weakly NP-hard problem and give a (fast) polynomial time algorithm that the agent can use to compute a 0.5-approximation to the optimal cascade for any weight w. We assume that the MNL weights are given and consider the problem of finding the cascade that maximizes the expected reward. This is an optimization problem on selecting a sequence of K assortments with size M each from a ground set of N items. The number of feasible cascades is O . In fact, we establish the following hardness result. Lemma 6.1. For general M, the optimization problem is weakly NP-hard even for K = 2. The hardness of our problem follows from the hardness result of [19] for a related setting of unconstrained cascade optimization where the size of each assortment can be arbitrary. In light of this hardness, we turn our attention to finding fast approximation algorithms for the problem. While there has been a lot of recent work on the unconstrained cascade optimization problem (see [19; 12; 10] and the references therein), none of the previous algorithms apply to our constrained setting (where the size of each assortment is M). We consider the following greedy approach for the problem. Consider arbitrary weights at round t, i.e. wt = (wti)i2[N], is given. We order the items in [N] in decreasing order of given weights and consider the following cascade of assortments. D1 = {1, 2, ..., M}, D2 = {M + 1, M + 2, ..., 2M}, , DK = {(K 1)M + 1, , KM} We call these assortments decreasing order assortments . Let OPT denote the value of the optimal solution to the problem. We establish that showing these assortments in any possible sequences gives a good approximation to the problem. Lemma 6.2. Let OPT denote the overall click probability in the optimal solution. The decreasing order assortments D1, ..., DK, shown in any order, have overall click probability at least 0.5 OPT. We present the proof in Appendix E. We also show that our algorithm is optimal for M = 1, which captures the classic cascade optimization problem. 7 Numerical Experiments Figure 2: N =10, K =2, M =2, and d=5. Figure 3: N =15, K =2, M =2, and d=10 . In this section, we evaluate the performances of our proposed algorithms UCB-CCA and UCB-CCA+ in numerical experiments and compare their performances with the existing combinatorial bandit algorithms Comb Cascade and C3-UCB. For simulations, we generate a random sample of the unknown time-invariant parameter from N(0, 1) at the beginning of the simulation. We sample N feature vectors from N(0, 1) in each round t. At each round t, the oracle computes a sequence of assortments in decreasing order, forming a cascade St based on given wt. We assess the cumulative regret of UCB-CCA, UCB-CCA+, Comb Cascade, and C3-UCB. Note that a user s choice for UCB-CCA and UCB-CCA+ is determined by the MNL logit choice model, whereas C3-UCB utilizes a linear model and Comb Cascade is a non-contextual model. Figure 2 and Figure 3 indicate that both UCB-CCA and UCB-CCA+ significantly outperform C3-UCB and Comb Cascade. We also observe that UCB-CCA+ shows a slight performance advantage over UCB-CCA, although the difference is not statistically significant. While UCB-CCA+ has a sharper worst-case regret guarantee, UCB-CCA can provide favorable practical performances, with simpler implementation and computational efficiency. Acknowledgments and Disclosure of Funding This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (RS-2023-00222663, No. 2022R1C1C1006859, No. 2022R1A4A103057912, No. 2021M3E5D2A01024795) and by Creative-Pioneering Researchers Program through Seoul National University and by Naver. [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011. [2] Marc Abeille, Louis Faury, and Clément Calauzènes. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics, pages 3691 3699. PMLR, 2021. [3] Priyank Agrawal, Theja Tulabandhula, and Vashist Avadhanula. A tractable online learning algorithm for the multinomial logit contextual bandit. European Journal of Operational Research, 2023. [4] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Thompson sampling for the mnl- bandit. In Conference on Learning Theory, pages 76 78. PMLR, 2017. [5] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453 1485, 2019. [6] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. [7] Xi Chen, Yining Wang, and Yuan Zhou. Dynamic assortment optimization with changing contextual information. Journal of machine learning research, 2020. [8] Wang Chi Cheung and David Simchi-Levi. Assortment optimization under unknown multinomial logit choice models. ar Xiv preprint ar Xiv:1704.00108, 2017. [9] Wang Chi Cheung and David Simchi-Levi. Thompson sampling for online personalized assortment optimization problems with multinomial logit choice models. Available at SSRN 3075658, 2017. [10] Elaheh Fata, Will Ma, and David Simchi-Levi. Multi-stage and multi-customer assortment optimization with inventory constraints. ar Xiv preprint ar Xiv:1908.09808, 2019. [11] Louis Faury, Marc Abeille, Clément Calauzènes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning, pages 3052 3060. PMLR, 2020. [12] Pin Gao, Yuhang Ma, Ningyuan Chen, Guillermo Gallego, Anran Li, Paat Rusmevichientong, and Huseyin Topaloglu. Assortment optimization and pricing under the multinomial logit model with impatient customers: Sequential recommendation and selection. Operations research, 69(5):1509 1532, 2021. [13] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In International conference on machine learning, pages 767 776. PMLR, 2015. [14] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascading bandits. Advances in Neural Information Processing Systems, 28, 2015. [15] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [16] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pages 2071 2080. PMLR, 2017. [17] Shuai Li and Shengyu Zhang. Online clustering of contextual cascading bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. [18] Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. In International conference on machine learning, pages 1245 1253. PMLR, 2016. [19] Nan Liu, Yuhang Ma, and Huseyin Topaloglu. Assortment optimization under the multinomial logit model with sequential offerings. INFORMS Journal on Computing, 32(3):835 853, 2020. [20] Xutong Liu, Jinhang Zuo, Siwei Wang, John CS Lui, Mohammad Hajiesmaili, Adam Wierman, and Wei Chen. Contextual combinatorial bandits with probabilistically triggered arms. In International Conference on Machine Learning, pages 22559 22593. PMLR, 2023. [21] Daniel Mc Fadden. Modeling the choice of residential location. Transportation Research Record, (673), [22] Min-hwan Oh and Garud Iyengar. Thompson sampling for multinomial logit contextual bandits. Advances in Neural Information Processing Systems, 32, 2019. [23] Min-hwan Oh and Garud Iyengar. Multinomial logit contextual bandits: Provable optimality and practicality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9205 9213, 2021. [24] Noemie Perivier and Vineet Goyal. Dynamic pricing and assortment under a contextual mnl demand. Advances in Neural Information Processing Systems, 35:3461 3474, 2022. [25] Daniel Vial, Sujay Sanghavi, Sanjay Shakkottai, and R Srikant. Minimax regret for cascading bandits. ar Xiv preprint ar Xiv:2203.12577, 2022. [26] Kun Wang. Conservative contextual combinatorial cascading bandit. IEEE Access, 9:151434 151443, [27] Xue Wang, Mike Mingcheng Wei, and Tao Yao. Online assortment optimization with high-dimensional data. Available at SSRN 3521843, 2019. [28] Zixin Zhong, Wang Chi Chueng, and Vincent YF Tan. Thompson sampling algorithms for cascading bandits. The Journal of Machine Learning Research, 22(1):9915 9980, 2021. [29] Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 2016.