# online_clustering_of_contextual_cascading_bandits__cc3ac68a.pdf

Online Clustering of Contextual Cascading Bandits

Shuai Li, Shengyu Zhang The Chinese University of Hong Kong, Sha Tin, Hong Kong {shuaili, syzhang}@cse.cuhk.edu.hk

We consider a new setting of online clustering of contextual cascading bandits, an online learning problem where the underlying cluster structure over users is unknown and needs to be learned from a random preﬁx feedback. More precisely, a learning agent recommends an ordered list of items to a user, who checks the list and stops at the ﬁrst satisfactory item, if any. We propose an algorithm of CLUB-cascade for this setting and prove an n-step regret bound of order O( n). Previous work (Li et al. 2016) corresponds to the degenerate case of only one cluster, and our general regret bound in this special case also signiﬁcantly improves theirs. We conduct experiments on both synthetic and real data, and demonstrate the effectiveness of our algorithm and the advantage of incorporating online clustering method.

Introduction Most recommendation systems nowadays display items in an ordered list. Examples include typical hotels/restaurants/goods recommendation, search engines, etc. This is especially the case for apps or games recommendations on mobile devices due to the limited size of screen. Click behaviors and feedback in such ordered lists have their distinctive features, and a cascade model was recently developed for studying feedback of user click behaviors (Craswell et al. 2008). In the model, after a user receives a list of items, she checks the items in the given order and clicks the ﬁrst satisfactory one. After the click, she stops checking the rest items in the list. The learning agent receives the feedback of the click and knows that the items before the clicked one have been checked and are unsatisfactory, but whether the user likes any items after the clicked one is unknown. The cascade model is straightforward but effective in characterizing user behaviors (Chuklin, Markov, and Rijke 2015). In this paper, we consider an online learning variant of cascading bandits (Kveton et al. 2015a; 2015b). In our model, the learning agent uses exploration-exploitation techniques to learn the preferences of users over items by interactions with users. At each time step, the learning agent recommends a list of items to the current user, observes the click feedback, and receives a reward of 1 if the user clicks

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

on an item (and receives reward 0 otherwise). The learning agent aims to maximize its cumulative rewards after n rounds. Previous work (Li et al. 2016; Zong et al. 2016) considered a setting of linear cascading bandits to deal with ever-changing set of items. Roughly speaking, the learning agent adaptively learns a linear mapping between expected rewards and features of items and users. One important limit of the linear cascading bandit algorithms is that they mainly work in a content-dependent regime, discarding the often useful method of collaborative ﬁltering. One way to utilize the collaborative effect of users is to consider their clustering structure. In this paper, we formulate the problem of online clustering of contextual cascading bandits, and design an algorithm to learn the clustering information and extract user feature vectors adaptively with low cumulative regret. Following the approach in (Gentile, Li, and Zappella 2014), we use a dynamic graph on all users to represent clustering structure, where an edge indicates the similarity between the two users. Edges between different clusters are gradually removed as the algorithms learns from the feedback that the pairs of users are not similar. We prove an upper bound of O(d

mn K ln(n)) for the cumulative regret, where m is the number of clusters, d is the dimension of feature space, n is the number of rounds, and K is the number of recommended items. This extends and improves the existing results in the degenerate setting of only one cluster (m = 1). Finally we experiment on both synthetic and real datasets to demonstrate the advantage of the model and algorithm. The organization of this paper is as follows. We ﬁrst introduce previous work related to our setting, then formulates the setting of Online Clustering of Contextual Cascading Bandits with some appropriate assumptions. Next we give our UCB-like algorithm CLUB-cascade and the cumulative regret bound, which is better than the existing results in the degenerate case. Then we report experimental results on both synthetic data and real data to demonstrate the advantage of incorporating online clustering. Last is the conclusion of the paper.

Related Work (Kveton et al. 2015a; Katariya et al. 2016) introduced the click model with cascading feedback and DCM feedback to the MAB framework, which describes the random feedback

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

dependent on the display order of items. In the cascading feedback, a user clicks the ﬁrst satisfying items and stops checking further, while in the DCM feedback, after a user clicks an item, there is a chance that she is not satisﬁed and continues checking. (Kveton et al. 2015b) considered the problem where the random feedback stops at the ﬁrst default position (reward 0), in comparison with the ﬁrst success position (reward 1) in the cascade setting. Even though the settings are similar, the techniques are totally different because of the asymmetry of the binary OR function and binary AND function. (Zoghi et al. 2017) brought up an online elimination algorithm to deal with different click models. All the above works focused on the setting of ﬁxed item set. (Li et al. 2016) generalized both the cascade setting and combinatorial cascade setting with contextual information, position discounts and more general reward functions. For the binary OR case, they provided a regret bound for n rounds with order O( d

n K ln(n)) where p is probability to check all recommended items and could be small. At the same time, (Zong et al. 2016) also generalized the cascade setting with linear payoff and brought up a UCB-like algorithm, Cascade Lin UCB, as well as a Thompson sampling (TS) algorithm without a proof. They proved a regret bound of n rounds for the Cascade Lin UCB algorithm of order O(d K n ln(n)). In this paper, we consider the basic cascade setting, where the random feedback stops at the ﬁrst click position, together with the online clustering to explore user structure. We provide a regret bound of order O(d

mn K ln(n)). Cast in this framework, the existing results studied the degenerate case of m = 1. The work (Gentile, Li, and Zappella 2014) ﬁrst considered online clustering of linear bandits, and maintained a graph among users and used connected components to denote user clusters. A follow-up (Li, Karatzoglou, and Gentile 2016) explored item structures to help cluster users and to improve recommendation performance. (Gentile et al. 2017) considered a variant where the clusters over users are dependent on the current context. In this paper, we employ some idea of the ﬁrst paper, and manage to make it to work with random feedback in click models. (Combes et al. 2015) considered a similar setting of clustered users with cascade feedback, but in their paper, the clusters are ﬁxed and known to the learning agent. In our paper, the cluster structure is unknown and has to be learned by the learning agent.

Problem Setup

In this section, we formulate the problem of Online Clustering of Contextual Cascading Bandits . In this problem, there are u users, denoted by set [u] = {1, . . . , u}. At each time step t, a user it comes to be served with contents and the learning agent receives the user index with a ﬁnite feasible content set Dt Rd 1, where x 2 1 for all x Dt. Then the learning agent recommends a ranked list of distinct K items Xt = (x1, ..., x K) ΠK(Dt) to the user. The user checks the items in the order from the ﬁrst one to the last one, clicks the ﬁrst attractive item, and stops checking after the click. We use the Bernoulli random variable yt,k to indicate whether the item xt,k has been clicked or not.

The learning agent receives the feedback of the index of the clicked item, that is

Ct = inf{k : yt,k = 1}. (1)

Note that inf( ) = and Ct = represents that the user does not click on any given item. Let Kt = min{Ct, K}. The user checks the ﬁrst Kt items and the learning agent receives the feedback {yt,k, k = 1, . . . , Kt}. Let Ht be the entire history information until the end of round t. Then the action Xt is Ht 1-adaptive. We will write Et[ ] for E[ |Ht 1] for convenience of notation, use the boldface symbols to denote random variables, and denote [m] = {1, . . . , m}. We assume the probability of clicking on an item to be a linear function of item feature vector. Speciﬁcally there exists a vector θit Rd 1, θit 2 1 for user it, such that the expectation of the binary click feedback y on the checking item x is given by the inner product of x with θit, i.e., Et[y|x] = θ itx, (2) independently of any other given item. We assume that there are m clusters among the users, where m u, and the partition of the clusters is ﬁxed but unknown. Speciﬁcally we use I1, . . . , Im to denote the true clusters and a mapping function j : [u] [m] to map user i to its true cluster index j(i) (user i belongs to cluster Ij(i)). We assume the order of user appearance and the set of feasible items are not under the control of the learning agent. In addition, we assume clusters, users, and items satisfy the following assumptions.

Cluster regularity All users in the same cluster Ij share the same θ, denoted as θj. Users in different clusters have a gap between their θ s, that is

θj θj γ > 0

for any j = j .

User uniformness At each time step t, the user is drawn uniformly from the set of all users [u], independently over past.

Item regularity At each time step t, given the size of Dt, the items in Dt are drawn independently from a ﬁxed distribution x with x 2 1, and E[xx ] is full rank with minimal eigenvalue λx > 0. Also at all time t, for any ﬁxed unit vector θ Rd, given the size of Dt, (θ X)2 has sub-Gaussian tail with variance parameter σ2 λ2 x/(8 log(4|Dt|)). Note that the above assumptions follow the settings from previous work (Gentile, Li, and Zappella 2014). We will have more discussions on these in later section. At each time step t, the reward of X = (x1, . . . , x K) ΠK(Dt) under the random click result yt = (yt(x))x D (if known) is

f(X, yt) = 1

k=1 (1 yt(xk)).

By independence assumption, it is easily veriﬁed that the expectation of f(X, yt) is

f(X, yt) = 1

k=1 (1 yt(xk)),

where yt(x) = θ j(it)x and yt = (yt(x))x Dt. Let

X t = argmax X ΠK(Dt)ft(X, yt)

be the optimal action in round t. Then the regret in time step t is Rt(X, yt) = f(X t , yt) f(X, yt). The goal for the learning agent is to minimize the expected cumulative regret

t=1 Rt(Xt, yt) . (3)

Algorithm and Results

Notations Our main algorithm is given in Algorithm 1. Before diving into details, let us deﬁne some useful notations used in later analysis. For any time step t and user i, deﬁne

k=1 xs,kx s,k, bi,t =

k=1 ys,kxs,k,

s t, is=i Ks

to be the Gramian matrix, the moment matrix of regressand by regressors, and the number of effective feedbacks for user i up to time t, respectively. Let = I [u] be any nonempty user index subset and

M I,t = λId +

i I Si,t, b I,t =

ˆθI,t = M 1 I,tb I,t, TI,t =

i I Ti,t (5)

be the regularized Gramian matrix, the moment matrix of regressand, the estimate by ridge regressors, and the frequency associated with user set I and regularization parameter λ > 0 up to time t, respectively.

Algorithm The algorithm maintains an undirected graph structure on all users Gt = ([u], Et), where an edge exists between a pair of users if they are similar. The collection of the connected components represents a partition of the users. The learning agent starts with a complete graph over all users and initializes Gramian matrix and the moment matrix of regressand for each user i (Line 2). At each time step t, the learning agent receives a user index it and a feasible ﬁnite content set Dt (Line 4), where x 2 1 for all x Dt. From the current graph structure on users

Algorithm 1 CLUB-cascade

1: Input: λ, α, β > 0 2: Initialize: G0 = ([u], E0) is a complete graph over all users, Si,0 = 0d d, bi,0 = 0d 1, Ti,0 = 0 for all i [u]. 3: for all t = 1, 2, . . . , n do 4: Receive user index it, and the feasible context set Dt Rd 1; 5: Find the connected component Vt for user it in the current graph Gt 1 = ([u], Et 1), and compute

M Vt,t 1 = λI +

i Vt Si,t 1,

i Vt bi,t 1,

ˆθVt,t 1 = M 1 t bt;

6: For all x Dt, compute

U t(x) = min{ˆθ Vt,t 1x + β

x M 1 Vt,t 1x, 1}; (4) 7: Recommend a list of K items Xt = (xt,1, . . . , xt,K) with largest U t( ) values and receive feedback Ct {1, . . . , K, }; 8: Update statistics

Sit,t = Sit,t 1 +

k=1 xt,kx t,k,

bit,t = bit,t 1 +

k=1 xt,k1{Ct = k},

Tit,t = Tit,t 1 + Kt,

where Kt = min{Ct, K} and update

ˆθit,t = (λI + Sit,t) 1bit,t;

9: Update Sℓ,t = Sℓ,t 1, bℓ,t = bℓ,t 1, Tℓ,t = Tℓ,t 1, ˆθℓ,t = ˆθℓ,t 1 for all ℓ = it; 10: Delete the edge (it, ℓ) Et 1, if

ˆθit,t ˆθℓ,t 2 α

1 + ln(1 + Tit,t)

1 + ln(1 + Tℓ,t)

and obtain a new graph Gt = ([u], Et); 11: end for t

Gt 1 = ([u], Et 1), the agent ﬁnds the connected component Vt containing user it and computes the Gramian matrix, the moment matrix of regressand, and the estimates ˆθVt,t 1 by ridge regressor associated with set Vt up to time t 1 (Line 5). Then it uses this ˆθVt,t 1 as the estimate for the true weight vector θj(it) to compute the upper conﬁdence bound of the expected reward θ j(it)x for each item x Dt (Line 6). This step relies on the following lemma, which gives the theoretical guarantee of the ridge regression estimate for the true weight vector.

Lemma 1 Suppose (x1, y1), . . . , (xt, yt), . . . are generated sequentially from a linear model such that xt 1 for all t, E[yt|xt] = θ xt for ﬁxed but unknown θ with norm at most 1, and {yt θ xt}t=1,2,... have R-sub-Gaussian tails. Let Mt = λI + t s=1 xsx s , bt = t s=1 xsys, and δ > 0. If ˆθt = M 1 t bt is the ridge regression estimator of θ , then with probability at least 1 δ, for all t 0,

=: β(t, δ).

This Lemma is by Theorem 2 of (Abbasi-Yadkori, P al, and Szepesv ari 2011). When the current cluster is correct (which is guaranteed after O(ln(n)) rounds and to be proved later), i.e. Vt = Ij(it), θj(it) ˆθVt,t 1 M Vt,t 1 β(TVt,t 1, δ) β(n, δ).

(Here for a positive-deﬁnite matrix M, deﬁne the norm x M =

x T Mx. It is not hard to verify that if M 0, the dual norm x M is x M 1.) Then by Cauchy-Schwarz inequality, we have ˆθ Vt,t 1x θ j(it)x ˆθVt,t 1 θj(it) MVt,t 1 x M 1 Vt,t 1

β(n, δ) x M 1 Vt,t 1 ,

which results in a conﬁdence interval for the expected reward θ j(it)x on each item x Dt. Next the learning agent recommends a list of K items Xt = (x1, . . . , x K) which have the largest upper conﬁdence bounds. The user it checks the recommended items from the ﬁrst one, clicks on the ﬁrst satisfactory item, and stops checking anymore. Then the learning agent receives feedback Ct {1, . . . , K, } (Line 7). Ct {1, . . . , K} means that the user clicks Ct-th item and the ﬁrst Ct 1 items are not satisfactory, while the items after Ct-th position are not checked by the user. Ct = means that the user has checked all recommended items but none of them is satisfactory. Based on the feedbacks, the learning agent updates its statistics on user it (Line 8) but not on other users (Line 9). Based on the updates, the weight vector estimate for the user it might change and the similarity with other users might be veriﬁed false. The learning agent checks the edge

of (it, ℓ) Et 1 for any user ℓthat is linked to user it and deletes it if the distance between the two estimated weight vectors is large enough (Line 10).

Analysis The following theorem gives a bound on the cumulative regret achieved by our algorithm CLUB-cascade.

Theorem 2 Suppose the cluster structure on the users, user appearance, and items satisfy the assumptions stated in the section of Problem Setup with gap parameter γ > 0 and item regularity parameter 0 < λx 1. Let λ, K be the regularization constant and the number of recommended items in each round. Let λ K, β =

λd + 2 ln(4mn) +

λ and α = 4

d/λx, where d, m, u denotes the feature dimension, the number of clusters and the number of users, respectively. Then the cumulative regret of CLUB-cascade algorithm for n rounds satisﬁes

2dmn K ln 1 + n K

λ2x ) ln(n) +

mn K ln(n) .

For the degenerate case when m = 1, our result improves the existing regret bounds.

Corollary 3 When the number of clusters m = 1, that is all users are treated as one, let λ = K and β =

λd + 2 ln(4n) +

λ. Then the cumulative regret of CLUB-cascade after n rounds satisﬁes

dn ln(1 + n

λd) + 2 ln(4n) +

2d K ln 1 + n K

n K ln(K)). (7)

Note this result improves the existing results(Li et al. 2016; Zong et al. 2016). Discussions about the results, problem assumptions and implementations are given later. Next we give a proof sketch for the Theorem 2. Proof. [Sketch for Theorem 2] The proof for the main theorem is mainly based on two parts. The ﬁrst part proves the exploration rounds needed to guarantee the clusters partitioned correctly. And the second part is to estimate regret bounds for linear cascading bandits after the clusters are partitioned correctly. Under the assumption of item regularity, we prove when Ti,t O 1 γλx

d ln(n) , the 2 conﬁdence radius for weight vector associated with user i will be smaller than γ/2, where the γ is the gap constant raised in the assumption of cluster regularity. Suppose user i and user ℓbelong

to different clusters and the effective number of feedbacks associated to both user i and ℓmeet the requirements. Then the condition in the Algorithm 1 of deleting an edge (i, ℓ) (Line 10) will be satisﬁed, thus the edge between user i and ℓwill be deleted under our algorithm with high probability. On the other hand, if the condition of deleting an edge (i, ℓ) is satisﬁed, then the 2 difference between the weight vectors is greater than 0, thus the two users belong to different clusters, by the assumption of cluster regularity. By the assumption of item regularity and Bernstein s inequality, after t O (u + 1 λ2 x ) ln(n) +

rounds, we could gather enough information for every user, thus resulting a correct clustering with high probability. After the clusters are correctly partitioned, the recommendation is based on the estimates of cluster weight vector with the cascade feedback collected so far. After decomposing, the instaneous regret can be bounded by the individual difference between expected rewards of best items and checked items, which can be bounded with 2β x M, by the deﬁnition of U t(x). Then it remains to bound the sum of self-normalized sequence n t=1 xt M 1 t 1, where

Mt = Mt 1 + xx .

Extensions to Generalized Linear Rewards In this section, we consider a general case that the expected reward of recommending item x to user it at round t is

Et[y|x] = μ(θ itx),

where μ is a strictly increasing link function, continuously differentiable, and Lipschitz with constant κμ. This deﬁnition arises from exponential family distributions (Filippi et al. 2010) and incorporates a large class of problems, like Poisson or logistic regression. Let cμ = infa [ 2,2] μ (a) and assume cμ > 0. In this setting, let the estimator ˆθI,t 1 for the set of users I be maximum likelihood estimator, or equivalently (Filippi et al. 2010; Li, Lu, and Zhou 2017) the unique solution of

s=1 1{is I}

ys,k μ(θ xs,k) xs,k = 0, (8)

which can be found efﬁciently using Newton s algorithm. Note that the original samples (xs,k, ys,k) are stored instead of only aggregation S, b in the linear case. With a slightly modiﬁed version of Algorithm 1, a result of the cumulative regret bound is obtained and provided in the following theorem.

Theorem 4 Under the same assumptions and notations in linear setting, let β = 1 cμ

8 λx + d ln(n/d) + 2 ln(4mn)

d/(λxcμ), where d, m, u denotes the feature dimension, the number of clusters, and the number of users, respectively. Then the cumulative regret of CLUB-cascade with generalized linear rewards, after n rounds, satisﬁes

2dmn K ln 1 + n K

λ2x ) ln(n) +

mn K ln(n) .

Discussions The degenerate case where the number of clusters m = 1, or equivalently all users are treated as the same type, has the same setting with Section 4.2.2 of (Li et al. 2016) and the setting in (Zong et al. 2016). The regret proved in the ﬁrst paper is O( d

n K ln(n)), which has an additional term 1/p

compared to ours. The parameter p denotes the minimal probability that a user has checked all items, which could be quite small. The reason that we can get rid of such a 1/p term is because we have a better regret decomposition formula than theirs. The regret presented in the second paper has the bound of O(d K n ln(n)), which has an additional term

K than ours. This reason is that we have a tighter bound for the sum of self-normalized sequence. For the assumption on the true cluster structure over users, we assume there is a gap γ > 0 between the weight vectors associated with different clusters. The parameter γ is a trade-off between personalization and collaborative ﬁltering, where γ = 2 corresponds to the case of only one cluster containing all users and γ taking the value of minimal distance between different user weight vectors corresponds to the case that each user is one cluster. Also, the assumption of γ can be further relaxed by modifying γ along running the algorithm. Our algorithm explores clustering structures adaptively: It starts with one cluster, then ﬁnds ﬁner and ﬁner clustering until the cluster distance reaches γ. As more data ﬂows in, the parameter γ can be changed smaller and our algorithm can continue working without the need to restart. By a similar analysis, we could derive an asymptotic regret bound (with parameters in the algorithm changed accordingly). We omit this part and simply assume a γ > 0 gap exists. For the users, we assume the learning agent has no control over user appearances and at each time step, a user is drawn uniformly from all users, and independently from the past. If the learning agent has the access to sample users, the setting becomes active learning in online clustering and should have a better regret bound because the learning agent does not need to wait for collecting enough information. The uniform user appearance assumption means that the users we take care of are on the same activity level, which is easier for us to deal with. If there is some activity structure over users, we might need further assumptions and corresponding strategies on the activity structure. For example, if there are a large amount of new users, or users who only come a few time, additional assumptions like that those users share the same prediction vectors θ might be brought up. We leave the relaxations of the two assumptions as future work.

Experiments In this section, we compare our algorithm with C3-UCB (Li et al. 2016) and Cascade Lin UCB (Zong et al. 2016),

0K 20K 40K 60K 80K 100K Time t

Cum. Regret

CLUB-cascade C3-UCB/Cascade Lin UCB

(a) θ orthogonal, m = 2

0K 20K 40K 60K 80K 100K Time t

Cum. Regret

CLUB-cascade C3-UCB/Cascade Lin UCB

(b) θ orthogonal, m = 5

0M 1M 2M 3M 4M 5M Time t

Cum. Regret

CLUB-cascade C3-UCB/Cascade Lin UCB

(c) γ = 0.2, m = 2

0M 1M 2M 3M 4M 5M Time t

Cum. Regret

CLUB-cascade C3-UCB/Cascade Lin UCB

(d) γ = 0.2, m = 5

Figure 1: Synthetic Data Set, u = 40, L = 200, K = 4, d = 20

which are the most related works. In both synthetic and real datasets, the results demonstrate the advantage of incorporating online clustering in the setting of online recommendations with cascade model. We focus on linear rewards for all experiments. To accelerate our algorithm, we use a sparse initialization instead of the complete graph initialization, similar in (Gentile, Li, and Zappella 2014).

Synthetic Data

In this section, we compare our algorithm, CLUB-cascade, with C3-UCB/Cascade Lin UCB on the synthetic data. The results are shown in Figure 1. In all the four settings, we randomly choose a content set with L = 200 items, each of which has a feature vector x Rd with x 2 1 and d = 20. We use u = 40 users and assign them randomly to m = 2, 5 clusters. For each cluster j [m], we ﬁx a weight vector θj with norm 1 and use it to generate Bernoulli random variable, whose mean is the inner product of θj with the corresponding item vector. In each round, a random user comes and the algorithm recommends K = 4 items to the user. According to the Bernoulli random variables, the algorithm receives the cascading feedback and updates its statistics accordingly. In the synthetic setting, since we know the true weight vector θj, the best action can be computed and thus the cumulative regret for algorithms. The vertical axis denotes the cumulative regret and the horizontal axis denotes time step t. In the four subﬁgures, we explore the distance gap γ between different θ s and the number of clusters m. When the gap γ between weight vectors θ is ﬁxed, our algorithm has a better advantage over theirs when the number of clusters m is bigger. The θ s in subﬁgures (a)(c) are orthogonal, that is, the difference gap between them is γ =

2. The difference gap γ in (b)(d) is set to be 0.2, thus the cosine similarity between the different θ s is 0.98, which is quite high. Thus un-

0K 20K 40K 60K 80K 100K Time t

Cum. Reward

CLUB-cascade C3-UCB/Cascade Lin UCB

(a) u = 40, L = 200

0K 20K 40K 60K 80K 100K Time t

Cum. Reward

CLUB-cascade C3-UCB/Cascade Lin UCB

(b) u = 40, L = 1k

0K 20K 40K 60K 80K 100K Time t

Cum. Reward

CLUB-cascade C3-UCB/Cascade Lin UCB

(c) u = 200, L = 200

0K 20K 40K 60K 80K 100K Time t

Cum. Reward

CLUB-cascade C3-UCB/Cascade Lin UCB

(d) u = 200, L = 1k

Figure 2: Cumulative clicks on Yelp dataset, d = 20, K = 4

der the same number of clusters, our algorithm needs more time to learn well in the setting of a smaller γ. Because γ = 0.2 means near-1 cosine similarity, to regard all users as a whole might have advantages in early rounds. However, after our algorithm learns out the true cluster structure, their advantage depreciates very fast. Although our algorithm needs more steps to achieve an obvious advantage with a smaller gap γ, typically it is not required to differentiate θ s with near-1 cosine similarity. The purpose we use this setting is to demonstrate the extreme case. In real applications, the estimated weight vectors will not be too similar so that our algorithm can easily outperform theirs, which we will see in the next experiments. Note that our algorithm is not as good as theirs in the beginning. The reason is that we use a random sparse graph initialization and this initialization might result in inaccurate clustering for early rounds. However, after collecting enough information, our algorithm can still learn out the correct clustering (which might be a little ﬁner for true clustering).

Yelp Dataset In this section, we compare our algorithm, CLUB-cascade, with C3-UCB/Cascade Lin UCB on restaurant recommendations with Yelp dataset1. The dataset contains user ratings for several businesses. For restaurants, it contains 1579523 ratings of 26629 restaurants from 478841 users. We extract 1k restaurants with most reviews and 1k users who review most for experiments. Before we start, we randomly choose 100 users and formulate a binary matrix H R100 1k (stands for history ) where H(i, k) = 1 denotes the user i has rated restaurant k and H(i, k) = 0 denotes otherwise. We want to construct feature vectors for 1k restaurants from the records of 100 users and then use them to conduct experiments on the

1http : //www.yelp.com/dataset challenge

0K 20K 40K 60K 80K 100K Time t

Cum. Reward

CLUB-cascade C3-UCB/Cascade Lin UCB

(a) u = 40, L = 200

0K 20K 40K 60K 80K 100K Time t

Cum. Reward

CLUB-cascade C3-UCB/Cascade Lin UCB

(b) u = 40, L = 1k

0K 200K 400K 600K 800K 1000K Time t

Cum. Reward

CLUB-cascade C3-UCB/Cascade Lin UCB

(c) u = 200, L = 200

0K 200K 400K 600K 800K 1000K Time t

Cum. Reward

CLUB-cascade C3-UCB/Cascade Lin UCB

(d) u = 200, L = 1k

Figure 3: Cumulative clicks on Movie Lens, d = 20, K = 4

records of the remaining 900 users. Then, we perform SVD on H to get a d = 20 feature vectors for each of the chosen restaurants. The remaining ratings form another binary matrix F R900 1k (stands for future ), which is used for online experiments. For each of the following settings, we randomly choose L = 200 (or 1k) restaurants and u = 40 (or 200) users. At each time step t, a user is selected uniformly and the learning agent recommends K = 4 restaurants to the user. By referring the binary matrix F, the learning agent receives a feedback Ct {1, ..., K, } and updates its statistics. The objective is to maximize the cumulative clicks of the learning agent2. The results are shown in Figure 2, where the vertical axis denotes the cumulative rewards and the horizontal axis is the time step t. From the results, the performance of our algorithm has a clear advantage over theirs.

Movie Lens Dataset

In this experiment, we compare our algorithm, CLUBcascade, with C3-UCB/Cascade Lin UCB on the real dataset Movie Lens (Harper and Konstan 2016). We use the processed 20m dataset3, in which there are 20 million ratings for 27k movies by 138k users. Since the Movie Lens dataset has been processed and all users and movies have records with similar density, we randomly draw 1k movies and 1k users for experiments. After that, we randomly draw 100 users from the 1k users and formulate a binary matrix H R100 1k, where H(i, j) = 1 denotes the user i has rated movie j and H(i, j) = 0 denotes otherwise. Then we perform SVD on H to get a d = 20 feature vectors for all the 1k chosen movies. The

2Since there is no universal truth about correct clustering and choosing the gap parameter γ is quite subjective, to avoid disputes and be consistent with previous works (Li et al. 2016), we adopt the measure of cumulative rewards here. 3https : //grouplens.org/datasets/movielens/20m/

0K 20K 40K 60K 80K 100K Time t

Cum. Reward / Time t

CLUB-cascade C3-UCB/Cascade Lin UCB

0K 200K 400K 600K 800K 1000K Time t

Cum. Reward / Time t

CLUB-cascade C3-UCB/Cascade Lin UCB

(b) Movie Lens

Figure 4: Comparisons of CTR on Yelp and Movie Lens, d = 20, K = 4, u = 200, L = 1000

records for the remaining 900 users form another binary matrix F R900 1k, which is used for online evaluations. For each of the four settings, we randomly choose L = 200 (or 1k) movies and u = 40 (or 200) users. At each time step t, a user is selected uniformly and the learning agent recommends K = 4 movies to the user. By referring to the binary matrix F, the learning agent receives a feedback Ct {1, ..., K, } and updates its statistics. The objective is to maximize the cumulative clicks of the learning agent. The results are shown in Figure 3, where the vertical axis denotes the cumulative rewards and the horizontal axis is the time step t. From the results, the performance of our algorithm has a clear advantage over theirs. Comparing the performances on two datasets, our algorithm seems to need more steps to obtain an obvious advantage in the Movie Lens dataset. This is because the Movie Lens dataset has been processed and the user-movie matrix we are dealing with is quite dense, thus users are more similar. To see this phenomenon more clearly, we draw the results on the average rewards, the cumulative rewards up to time t divided by t, for both datasets in Figure 4. In earlier rounds, their algorithm taking all users as one will have a temporary advantage in Movie Lens dataset since the users are similar. At the same time, our algorithm pays the cost of exploring clusters and starts with low average rewards. However, as the explored cluster structure becomes more and more accurate, our algorithm beneﬁts from it and keeps a high increasing rate. As time goes by, the cost for regarding users as one is not negligible and our algorithm outperforms theirs. In most real applications, the user-item matrix would be very sparse and the users are tending to be dissimilar, resulting in a more advantaged environment for our algorithm.

Conclusions

In this paper, we bring up a new problem of online clustering of contextual cascading bandits, where the algorithm has to explore the unknown cluster structure on users under a preﬁx feedback of the recommended item list. We propose a CLUB-cascade algorithm based on the principle of optimism in face of uncertainty and prove a cumulative regret bound, whose degenerate case improves the existing results. The experiments conducted on both synthetic and real dataset demonstrate the advantage of incorporating online clustering.

Acknowledgments This research was sponsored by Research Grants Council of the Hong Kong S.A.R. (Project no. CUHK14239416), as well as Huawei Innovation Research Program.

References Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312 2320. Chuklin, A.; Markov, I.; and Rijke, M. d. 2015. Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services 7(3):1 115. Combes, R.; Magureanu, S.; Proutiere, A.; and Laroche, C. 2015. Learning to rank: Regret lower bounds and efﬁcient algorithms. ACM SIGMETRICS Performance Evaluation Review 43(1):231 244. Craswell, N.; Zoeter, O.; Taylor, M.; and Ramsey, B. 2008. An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, 87 94. ACM. Filippi, S.; Cappe, O.; Garivier, A.; and Szepesv ari, C. 2010. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, 586 594. Gentile, C.; Li, S.; Kar, P.; Karatzoglou, A.; Zappella, G.; and Etrue, E. 2017. On context-dependent clustering of bandits. In International Conference on Machine Learning, 1253 1262. Gentile, C.; Li, S.; and Zappella, G. 2014. Online clustering of bandits. In Proceedings of the 31st International Conference on Machine Learning, 757 765. Harper, F. M., and Konstan, J. A. 2016. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (Tii S) 5(4):19. Katariya, S.; Kveton, B.; Szepesvari, C.; and Wen, Z. 2016. Dcm bandits: Learning to rank with multiple clicks. In Proceedings of The 33rd International Conference on Machine Learning, 1215 1224. Kveton, B.; Szepesv ari, C.; Wen, Z.; and Ashkan, A. 2015a. Cascading bandits: learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning. Kveton, B.; Wen, Z.; Ashkan, A.; and Szepesvari, C. 2015b. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems, 1450 1458. Li, S.; Wang, B.; Zhang, S.; and Chen, W. 2016. Contextual combinatorial cascading bandits. In Proceedings of The 33rd International Conference on Machine Learning, 1245 1253. Li, S.; Karatzoglou, A.; and Gentile, C. 2016. Collaborative ﬁltering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 539 548. ACM. Li, L.; Lu, Y.; and Zhou, D. 2017. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, 2071 2080.

Zoghi, M.; Tunys, T.; Ghavamzadeh, M.; Kveton, B.; Szepesvari, C.; and Wen, Z. 2017. Online learning to rank in stochastic click models. In International Conference on Machine Learning, 4199 4208. Zong, S.; Ni, H.; Sung, K.; Ke, N. R.; Wen, Z.; and Kveton, B. 2016. Cascading bandits for large-scale recommendation problems. In the 32nd Conference on Uncertainty in Artiﬁcial Intelligence.