# on_contextdependent_clustering_of_bandits__12d5df2a.pdf On Context-Dependent Clustering of Bandits Claudio Gentile 1 Shuai Li 2 Purushottam Kar 3 Alexandros Karatzoglou 4 Giovanni Zappella 5 Evans Etrue 1 We investigate a novel cluster-of-bandit algorithm CAB for collaborative recommendation tasks that implements the underlying feedback sharing mechanism by estimating user neighborhoods in a context-dependent manner. CAB makes sharp departures from the state of the art by incorporating collaborative effects into inference, as well as learning processes in a manner that seamlessly interleaves explore-exploit tradeoffs and collaborative steps. We prove regret bounds for CAB under various data-dependent assumptions which exhibit a crisp dependence on the expected number of clusters over the users, a natural measure of the statistical difficulty of the learning task. Experiments on production and real-world datasets show that CAB offers significantly increased prediction performance against a representative pool of state-of-the-art methods. 1. Introduction In many prominent applications of bandit algorithms, such as computational advertising, web-page content optimization and recommendation systems, one of the main sources of information is embedded in the preference relationships between users and the items served. Preference patterns, emerging from clicks, views or purchase of items, are typically exploited through collaborative filtering techniques. In fact, it is common knowledge in recommendation systems practice, that collaborative effects carry more information about user preferences than say, demographic metadata (Pilaszy & Tikk, 2009). Yet, as content recommendation functionalities are incorporated in diverse online services, the requirements differ vastly too. For instance, in a 1Di STA, University of Insubria, Italy 2University of Cambridge, United Kingdom 3IIT Kanpur, India 4Telefonica Research, Spain 5Amazon Dev Center, Germany (work done while at the University of Milan, Italy). Correspondence to: Claudio Gentile . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). movie recommendation system, where the catalog is relatively static and ratings for items accumulate, one can easily deploy collaborative filtering methods such as matrix factorization or restricted Boltzmann machines. However, the same methods become practically impossible to use in more dynamic environments such as in news or You Tube video recommendation, where one has to deal with a nearcontinuous stream of new items to be recommended, along with new users to be served. These dynamic environments pose a dual challenge to recommendation methods: 1) How to present the new items to the users (or, vice versa, which items to present to new users), in order to optimally gather preference information on the new content (exploration), and 2) How to use all the available user-item preference information gathered so far (exploitation). Ideally, one would like to exploit both the content information but also, and more importantly, the collaborative effects that can be observed across users and items. When the users to serve are many and the content universe (or content popularity) changes rapidly over time, recommendation services have to show both strong adaptation in matching user preferences and high degree of algorithmic scalability/responsiveness so as to allow effective online deployment. In typical scenarios like social networks, where users are engaged in technology-mediated interactions influencing each other s behavior, it is often possible to single out a few groups or communities made up of users sharing similar interests and/or behavior. Such communities are not static over time and, more often than not, are clustered around specific content types, so that a given set of users can in fact host a multiplex of interdependent communities depending on specific content items, which can change dramatically on the fly. We call this multiplex of interdependent clusterings over users induced by the content universe, a context-dependent clustering. In addition to the above, the set of users itself can change over time, for new users get targeted by the service, others may sign out or unregister. Thus, a recommendation method has to readily adapt to a changing set of both users and items. In this paper, we introduce and analyze the CAB (Context Aware clustering of Bandits) algorithm, a simple and flexible algorithm rooted in the linear contextual bandit framework that does the above by incorporating collaborative effects which traditional approaches to contextual bandits ig- On Context-Dependent Clustering of Bandits nore (e.g., (Auer, 2002; Li et al., 2010; Chu et al., 2011; Abbasi-Yadkori et al., 2011)). CAB adapts to match user preferences in the face of a constantly evolving content universe and set of targeted users and implements the contextdependent clustering intuition by computing clusterings of bandits which allows each content item to cluster users into groups (which are few relative to the total number of users), where within each group, users tend to react similarly when that item gets recommended. CAB distinguishes itself in allowing distinct items to induce distinct clusterings which is frequently observed in practice (Sutskever et al., 2009). These clusterings are in turn suggestive of a natural context-dependent feedback sharing mechanism across users. CAB is thus able to exploit collaborative effects in contextual bandit settings in a manner similar to neighborhood techniques in batch collaborative filtering. We analyze CAB from both, theoretical and experimental standpoints. We show that CAB enjoys a regret bound wherein the number of users engaged essentially enters in the regret bound only through the expected number of context-dependent clusters over the users, a natural measure of the predictive hardness of learning these users. We extend this result to provide a sharper bound under sparsity assumptions on the user model vectors. Experimentally, we present comparative evidence on production and real-world datasets that CAB significantly outperforms, in terms of predictive performance, state-of-the-art contextual bandit algorithms that either do not leverage any clustering at all or do so in a context-independent fashion. 1.1. Related Work The literature on contextual bandit algorithms is too vast to be surveyed here. In the sequel, we point out works that most closely relate to ours. The technique of sequentially clustering users in the bandit setting was introduced by Gentile et al. (2014) and Maillard & Mannor (2014), but has also been inspired by earlier works such as (Azar et al., 2013) on transfer learning for stochastic bandits, and (Djolonga et al., 2013) on low-rank (Gaussian Process) bandits. This led to further developments such as (Nguyen & Lauw, 2014) which relies on k-means clustering, (Korda et al., 2016) which proposes distributed clustering of confidence ball algorithms for solving linear bandit problems in peer to peer networks, and (Zhou & Brunskill, 2016) that learns the latent classes of users (the clusters) so as to better serve new users. Related papers that implement feedback sharing mechanisms by leveraging (additional) social information among users include (Cesa-Bianchi et al., 2013; Wu et al., 2016). In all these cases, the way users are grouped is not context-dependent. Even more related to our work is the recent work of Li et al. (2016) which proposes to simultaneously cluster users as well as items, with item clusters dictating user clusters. However, a significant limitation of this approach is that the content universe has to be finite and known in advance, and in addition to the resulting algorithm being somewhat involved. It is worth stressing that having context-dependent clusters not only changes the model considerably, as compared to previous works, but the algorithms as well. All previous works have a clustering process which is context-oblivious, or else assume a static item universe. A specific drawback of works such as (Gentile et al., 2014; Li et al., 2016) is that the clustering is unidirectional, in that users/items once (erroneously) separated into different clusters cannot be joined again, even if future evidence suggests so. Compared to previous works, our approach distinguishes itself for being simple and flexible (e.g., we can seamlessly accommodate the inclusion/exclusion of items and users), as well as for performing feedback propagation among users in a context-dependent manner. As will be demonstrated in Section 5, this offers significant performance boosts in real-world recommendation settings. 2. Notation and Preliminaries We will consider the bandit clustering model standard in the literature, but with the crucial difference that we will allow user behavior similarity to be represented by a family of clusterings that depend on the specific item context under consideration. In particular, we let U = {1, . . . , n} represent the set of n users. An item, represented by its feature vector x Rd will be seen as inducing a (potentially unique) partition of the user set U into a small number m(x) of clusters {U1(x), U2(x), . . . , Um(x)(x)}, where m(x) n. Users belonging to the same cluster Uj(x) share similar behavior w.r.t. x (i.e. they both like or both dislike the item represented by x), while users lying in different clusters have significantly different behavior. This is a very flexible model that allows users to agree on their opinion of certain items and disagree on others, something that often holds in practice. It is important to note that the mapping x {U1(x), U2(x), . . . , Um(x)(x)} specifying the partitioning of U into the clusters determined by x (including the number of clusters m(x)), and the common user behavior within each cluster are unknown to the learner, and have to be inferred based on user feedback. For the sake of simplicity, we assume that the contextdependent clustering is determined by linear functions x u i x, each one parameterized by an unknown vector ui Rd hosted at user i U, with ui = 1 for all i, in such a way that if users i, i U are in the same cluster w.r.t. x then u i x = u i x, and if i, i U are in different clusters w.r.t. x then |u i x u i x| γ, for some gap parameter γ > 0.1 We will henceforth call this assumption the 1 As usual, this hypothesis may be relaxed by assuming the On Context-Dependent Clustering of Bandits γ-gap assumption. For user vectors u1, . . . , un Rd corresponding to the n users (note that these vectors are unknown to the algorithm), context x Rd, and user index i U, we denote by Ni(x) the true neighborhood of i w.r.t. x, i.e., Ni(x) = {j U : u j x = u i x}. Hence, Ni(x) is simply the cluster (over U) that i belongs to w.r.t. x. Notice that i Ni(x) for any i and any x. We will henceforth assume that all items vectors x satisfy x 1. As is standard in linear bandit settings (Auer, 2002; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Krause & Ong, 2011; Crammer & Gentile, 2011; Yue et al., 2012; Djolonga et al., 2013; Cesa-Bianchi et al., 2013; Agrawal & Goyal, 2013; Gentile et al., 2014; Li et al., 2016; Korda et al., 2016), the (unknown) user vector ui determines the average behavior of user i. More precisely, upon encountering an item context vector x, user i reacts by delivering a payoff value yi(x) = u i x + ϵi(x) , where ϵi(x) is a conditionally zero-mean sub-Gaussian error variable with (conditional) variance parameter σ2(x) σ2 for all x.2 Hence, conditioned on the past, the quantity u i x is indeed the expected payoff observed at user i for context vector x. For the sake of concreteness, we will assume that for all i U and x Rd we have yi(x) [ 1, +1] a.s.. Learning takes place over a discrete sequence of time steps (or rounds). At each time t = 1, 2, . . . , the learner receives a user index it U, representing the user to serve content. Notice that the user to serve may change from round to round, and the same user may recur several times. Together with it, the learner receives a set of context vectors Ct = {xt,1, xt,2, . . . , xt,ct} Rd, such that xt,k 1 for all t and k = 1, . . . , ct, encoding the content which is currently available for recommendation to user it. The learner is compelled to pick some xt = xt,kt Ct to recommend to it, and then observes it s feedback in the form of a payoff yt [ 1, +1] whose (conditional) expectation is u it xt. The sequence of pairings {it, Ct}T t=1 = {(i1, C1), (i2, C2), . . . , (i T , CT )} are generated by an exogenous process and, in a sense, represents the data at hand . As we shall see in Section 4, the performance of our algorithm will depend on the properties of these data. The practical goal of the learner is to maximize its total payoff PT t=1 yt over T time steps. From a theoretical standpoint, we are instead interested in bounding the cumulative regret incurred by our algorithm. More precisely, let the regret rt of the learner at time t be the extent to which the average payoff of the best choice in hindsight at user it exceeds the average payoff of the algorithm s choice, i.e., existence of two thresholds, one for the within-cluster distance of u i x and u i x, the other for the between-cluster distance. 2 Recall that a zero-mean random variable X is sub-Gaussian with variance parameter σ2 if E[exp(s X)] exp(s2 σ2/2) for all s R. Any variable X with E[X] = 0 and |X| b is sub-Gaussian with variance parameter upper bounded by b2. rt = max x Ct u itx u it xt . We are aimed at bounding with high probability (over the noise variables ϵit( xt), and any other possible source of randomness) the cumulative regret PT t=1 rt . As a special case of the above model, when the set of items do not possess informative features, we can always resort to the noncontextual bandit setting (e.g., (Auer et al., 2002; Audibert et al., 2009)). To implement this approach, we simply take the set of all items (which must be finite for this technique to work), and apply a one-hot encoding by assigning to the i-th item, the i-th canonical basis vector ei, with one at the i-th position and zero everywhere else as the context vector. It is easy to see that the expected payoff given by user i on item j will simply be the j-th component of vector ui. Our aim would be to obtain a regret bound that gracefully improves as the context-dependent clustering structure over the users becomes stronger. More specifically, values taken by the number of clusters m(x) would be of particular interest since we expect to reap the strongest collaborative effects when m(x) is small whereas not much can be done by way of collaborative analysis if m(x) n. Consequently, a desirable regret bound would be one that scales with m(x). Yet, recall that m(x) is a function of the context vector x, which means that we expect our regret bound to also depend on the properties of the actual data {it, Ct}T t=1. We will see in Section 4 that, under suitable stochastic assumptions on the way {it, Ct}T t=1 is generated, our regret analysis essentially replaces the dependence on the total number of users n by the (possibly much) smaller quantity E[m(x)], the expected number of clusters over users, the expectation being over the draw of context vectors x. 3. The Context-Aware Bandit Algorithm We present Context-Aware (clustering of) Bandits (dubbed as CAB, see Algorithm 1), an upper-confidence boundbased algorithm for performing recommendations in the context-sensitive bandit clustering model. Similar to previous works (Cesa-Bianchi et al., 2013; Gentile et al., 2014; Nguyen & Lauw, 2014; Li et al., 2016; Wu et al., 2016), CAB maintains a vector estimate wi,t to serve as a proxy to the unknown user vector ui at time t. CAB also maintains standard correlation matrices Mi,t. The standard confidence bound function for user i for item x at time t is derived as CBi,t(x) = α(t) q x M 1 i,t x, for a suitable func- tion α(t) = O( d log t). However, CAB makes sharp departures from previous works both in the way items are recommended, as well as in they way the proxy estimates wi,t are updated. Item Recommendation: At time t, we are required to serve user it U by presenting an item out of a set of items On Context-Dependent Clustering of Bandits Algorithm 1 Context-Aware clustering of Bandits (CAB). 1: Input: Separation parameter γ, exploration parameter α(t). 2: Init: bi,0 = 0 Rd and Mi,0 = I Rd d, i U. 3: for t = 1, 2, . . . , T do 4: Set wi,t 1 = M 1 i,t 1bi,t 1, for all i U; 5: Use CBi,t(x) = α(t) q x M 1 i,t 1x, for all x, i U; 6: Receive user it U to be served, and context vectors Ct = {xt,1, . . . , xt,ct} for items to be recommended from; 7: for k = 1, . . . , ct do 8: Compute neighborhood b Nk := b Nit,t(xt,k) for this item b Nk = n j U : |w it,t 1xt,k w j,t 1xt,k| CBit,t 1(xt,k) + CBj,t 1(xt,k) o . 9: Set w b Nk,t 1 = 1 | b Nk| P j b Nk wj,t 1; 10: Set CB b Nk,t 1(xt,k) = 1 | b Nk| P j b Nk CBj,t 1(xt,k); 11: end for 12: Recommend item xt = xt,kt Ct such that kt = argmax k=1,...,ct w b Nk,t 1 xt,k + CB b Nk,t 1(xt,k) ; 13: Observe payoff yt [ 1, 1]; 14: if CBit,t 1( xt) γ/4 then 15: Set Mit,t = Mit,t 1 + xt x t , 16: Set bit,t = bit,t 1 + yt xt, 17: Set Mj,t = Mj,t 1, bj,t = bj,t 1 for all j = it; 18: else 19: for all j b Nkt such that CBj,t 1( xt) < γ/4 do 20: Mj,t = Mj,t 1 + xt x t , 21: bj,t = bj,t 1 + yt xt; 22: end for 23: Set Mj,t = Mj,t 1, bj,t = bj,t 1 for all j / b Nkt and for j b Nkt such that CBj,t 1( xt) γ/4 . 24: end if 25: end for Ct = {xt,1, . . . , xt,ct} available at time t. To do so, CAB first computes for each item xt,k in Ct, the set of users that are likely to give the item a similar payoff as it. This set b Nit,t(xt,k) is the estimated neighborhood of user it with respect to item xt,k. A user j is included in b Nit,t(xt,k) if the estimated payoff it gives to the item xt,k is sufficiently close to that given to the item by user it (see step 8). CAB incorporates collaborative effects by lifting the notions of the user proxy and confidence bounds to a set of users N U. CAB uses an inexpensive, flat averaging lift: CBN,t(x) = 1 |N| P j N CBj,t(x) and w N,t = j N wj,t. Next, CAB uses (see step 12) aggregated confidence bounds CB b Nit,t(xt,k)(xt,k) and aggregated proxy vectors w b Nit,t(xt,k),t 1 to select an item xt = xt,kt Ct based on an upper confidence estimation step. Proxy Updates: Classical approaches update the user proxies wi,t by solving a regularized least squares problem involving (feature representations of) items served previously to user i and payoffs received. However, CAB re- mains fully committed to the collaborative approach (see steps 14-24) by allowing a user i to inherit updates due to an item x served to another user j if the two users are indeed deemed to agree on their opinion on item x with a sufficiently high degree of confidence. The proxies wj,t are updated after receiving the feedback yt from user it. If CAB is not too confident regarding the opinion it has along the direction xt, formally CBit,t 1( xt) γ/4, then only the proxy at user it is updated (see step 15-17). However, if CAB is confident, i.e., if CBit,t 1( xt) < γ/4 then the proxy updates are performed (see steps 19-23) for all users j in it s estimated neighborhood with respect to xt about whose opinions CAB is confident too. Notice that all such users j undergo the same update, which is motivated by the algorithm s belief that b Nit,t( xt) = Nit( xt), i.e., that the conditional expectation u it xt of yt given xt is actually also equal to u j xt for all users j b Nit,t( xt) such that CBj,t 1( xt) < γ/4. It is worth noting that CAB is extremely flexible in handling a fluid set of users U. Due to its context-sensitive user aggregation step, which is repeated at every round, CAB allows users to be added or dropped on the fly, in a seamless manner. This is in strike contrast to past approaches to bandit aggregation, such as Gob Lin (Cesa-Bianchi et al., 2013), CLUB (Gentile et al., 2014), and COFIBA (Li et al., 2016), where more involved feedback sharing mechanisms across the users are implemented which are based either on static network Laplacians or on time-evolving connected components of graphs over a given set of users. 4. Regret Analysis Our regret analysis depends on a specific measure of hardness of the data at hand: for an observed sequence of users {it}T t=1 = {i1, . . . , i T } and corresponding sequence of item sets {Ct}T t=1 = {C1, . . . , CT }, where Ct = {xt,1, . . . , xt,ct}, the hardness HD({it, Ct}T t=1, η) of the pairing {it, Ct}T t=1 at level η > 0 is defined as HD({it, Ct}T t=1, η) = max n t = 1, . . . , T : j U, k1, k2, . . . , kt, : s t : is=j xs,ksx s,ks has smallest eigenvalue η o . Given a data sequence {it, Ct}, HD({it, Ct}T t=1, η) measures the number of rounds we need to wait until correlation matrices Mj,t, corresponding to all users j U have eigenvalues lower bounded by η. For sake of convenience the above quantity is calculated using the worst possible way of updating the matrices Mj,t through rank-one adjustments based on the data. Based on the above hardness definition, the following result summarizes our main efforts in this section. On Context-Dependent Clustering of Bandits Theorem 1 Suppose CAB is executed on {it, Ct}T t=1, such that ct c for all t, and the condition |u j x w j,tx| CBj,t(x) holding for all j U, x Rd, along with the γ-gap assumption. Then the cumulative regret PT t=1 rt of CAB can be deterministically upper bounded as follows: t=1 rt 9α(T) c HD {it, Ct}T t=1, 16 α2(T) v u u td log T n |Nit( xt)| where we set α(T) = O( log T). Some comments are in order. Theorem 1 delivers a deterministic regret bound on the cumulative regret, and is composed of two terms. The first term is a measure of hardness of the data sequence {it, Ct}T t=1 at hand whereas the second term is the usual T-style term in linear bandit regret analyses (Auer, 2002; Chu et al., 2011; Abbasi-Yadkori et al., 2011). However, note that the dependence of the second term on the total number n of users to be served gets replaced by a much smaller quantity n |Nit( xt)| that depends on the actual size of context-dependent clusters of the served users. The statement of the theorem requires |u j x w j,tx| CBj,t(x) , which (see Lemma 3 below) holds from standard results. We will shortly see that if the pairings {it, Ct}T t=1 are generated in a favorable manner, such as sampling the item vectors xt,k i.i.d. from a (possibly unknown) distribution over the instance space (see Lemma 1 below), the hardness measure can be upper bounded with high probability by a term of the form log T γ2 . Similarly, for the second term, in the simple case when Nit( xt) = B for all t, the second term has the form p n B T, up to log factors. Notice that T is roughly the regret effort for learning a single bandit, and p n B T is the effort for learning n B -many (unrelated) clusters of bandits when the clustering is known. Thus, in this example, it is the ratio n B that quantifies the hardness of the problem, insofar clustering is concerned. Again, under favorable circumstances (see Lemma 2 below), we can relate the quantity PT t=1 n |Nit( xt)| to the expected number of context-dependent clusters of users, the expectation being w.r.t. the random draw of item context vectors. On the other hand, making no assumptions whatsoever on the way {it, Ct}T t=1 is generated makes it hard to exploit the cluster structure. For instance, if {it, Ct}T t=1 is generated by an adaptive adversary, this might cause HD {it, Ct}T t=1, η to become linear in T for any constant η > 1, thereby making the bound in Theorem 1 vacuous. However, a naive algorithm that disregards the cluster structure, making no attempts to incorporate collaborative effects, and running n-many independent LINUCB-like algorithms (Auer, 2002; Abbasi-Yadkori et al., 2011; Chu et al., 2011), will still easily yield a n T regret bound3. A sufficient condition for controlling the hardness term in Theorem 1 is provided by the following lemma. Lemma 1 For each round t, let the context vectors Ct = {xt,1, . . . , xt,ct} be generated i.i.d. (conditioned on it, ct, past data {is, Cs}t 1 s=1 and rewards y1, . . . , yt 1) from a sub-Gaussian random vector X Rd with (conditional) variance parameter ν2, such that X 1 a.s., and E[XX ] is full rank with smallest eigenvalue λ > 0. Let ct c for all t, and ν2 λ2 8 ln(4c). Finally, let the sequence {it}T t=1 be generated uniformly at random, 4 independent of all other variables. Then with probability at least 1 δ, HD {it, Ct}T t=1, η = O n η The following lemma handles the second term in the bound of Theorem 1. Lemma 2 For each round t, let the context vectors Ct = {xt,1, . . . , xt,ct} be generated i.i.d. (conditioned on it, ct, past data {is, Cs}t 1 s=1 and rewards y1, . . . , yt 1) from a random vector X Rd with X 1. Let also ct c for all t. Then, with probability at least 1 δ, 1 |Nit( xt)| 2Tc E[m(X)] n + 12 log log T Remark 1 The linear dependence on c can be turned to logarithmic, e.g., at the cost of an extra sub-Gaussian assumption on variables 1 |Ni(x)|, i U. Finally, we recall the following upper confidence bound, from (Abbasi-Yadkori et al., 2011). Lemma 3 Let CBj,t(x) = α(t) q x M 1 j,t x, with α(t) = δ .5 Then, under the payoff noise model de- fined in Section 2, |u j x w j,tx| CBj,t(x) holds uniformly for all j U, x Rd, and t = 1, 2, . . .. A straightforward combination of Theorem 1 with Lemmata 1, 2, and 3 yields the following result. Corollary 1 Let CBj,t(x) be defined with α(t) as in Lemma 3, and let the γ-gap assumption hold. Assume context vectors are generated as in Lemma 1 in such a way that 3 To see this, simply observe that each of the n LINUCB-like algorithms has a regret bound of the form Ti, where Ti is the number of rounds where it = i. Then PT t=1 rt Pn i=1 Ti n T, with equality if Ti = T/n for all i. 4 Any (non-uniform) process {it}T t=1 that nevertheless ensures that each user i gets served a number of times that is likely to grow unbounded with T would suffice here. 5 The big-oh notation here hides the dependence on the variance σ2 of the payoff values. On Context-Dependent Clustering of Bandits the sub-Gaussian assumption therein holds with ct c. Finally, let the sequence {it}T t=1 be generated as described in Lemma 1. Then, with probability at least 1 δ, the regret of CAB (Algorithm 1) satisfies t=1 rt = R + e O d p Tc (E[m(X)]) , where the e O-notation hides logarithmic factors in T Nd δ , and R is of the form6 d λ2 γ2 log2.5 Tnd Sparse user models. We conclude with a pointer to an extension of the CAB framework for sparse linear models that extends past analyses on sparse linear bandits for a single user (Abbasi-Yadkori et al., 2012; Carpentier & Munos, 2012; Carpentier, 2015), to include collaborative effects. If all user vectors are sparse, i.e. for all i U, ui 0 s, for s d, then by replacing the least-squares solution in Step 4 of Algorithm 1 with the solution computed by a fully corrective sparse recovery method (Dai & Milenkovic, 2009; Jain et al., 2014; Needell & Tropp, 2008), we can obtain an improved regret bound. Specifically, we can replace the factor d d in the term R above by s2 s, and the factor d multiplying the T-term by a term of the form 5. Experiments We tested CAB on production and real-world datasets, and compared them to standard baselines as well as to stateof-the-art bandit and clustering of bandit algorithms. For datasets with no item features, a one-hot encoding was used. We attempted to follow, as closely as possible, previous experimental settings, such as those as described in (Cesa-Bianchi et al., 2013; Gentile et al., 2014; Korda et al., 2016; Li et al., 2016). 5.1. Dataset Description Tuenti. Tuenti (owned by Telefonica) is a Spanish social network website that serves ads on its site, the data contains ad impressions viewed by users along with a variable that registers a click on an ad. The dataset contains d = 105 ads, n = 14, 612 users, and 15M records/timesteps. We adopted a one hot encoding scheme for the items, hence items are described by the unit-norm vectors e1, . . . , ed. Since the available payoffs are those associated with the items served by the system, we performed offline policy evaluation through a standard importance sampling technique: we discarded on the fly all records where the system s recommendation (the logged policy) did not coincide 6 We note that no special efforts have been devoted here to obtain sharper upper bounds on R. with the algorithms recommendations. The resulting number of retained records was around T = 1M, loosely depending on the different algorithms and runs. Yet, because this technique delivers reliable estimates when the logged policy makes random choices (e.g., (Li et al., 2010)), we actually simulated a random logged policy as follows. At each round t, we retained the ad served to the current user it with payoff value at (1 = clicked , 0 = not clicked ), but also included 14 extra items (hence ct = 15 for all t) drawn uniformly at random in such a way that, for any item ej, if ej occurs in some set Ct, this item will be the one served by system only 1/15 of the times. Notice that this random selection was independent of the available payoff at. KDD Cup. This dataset was released for the KDD Cup 2012 Online Advertising Competition7 where the instances were derived from the session logs of the search engine soso.com. A search session included user, query and ad information, and was divided into multiple instances, each being described using the ad impressed at that time at a certain depth and position. Instances were aggregated with the same user ID, ad ID, and query. We took the chronological order among all the instances, and seeded the algorithm with the first ct = 20 instances (the length of recommendation lists). Payoffs at are again binary. The resulting dataset had n = 10, 333 distinct users, and d = 6, 780 distinct ads. Similar to the Tuenti dataset, we generated random recommendation lists, and a random logged policy. We employed one-hot encoding as well in this dataset. The number of retained records was around T = 0.1M. Avazu. This dataset was released for the Avazu Click Through Rate Prediction Challenge on Kaggle8. Here click-through data were ordered chronologically, and nonclicks and clicks were subsampled according to different strategies. As before, we simulated a random logged policy over recommendation lists of size ct = 20 t. Payoffs are once again binary. The final dataset had n = 48, 723 users, ct = 20 for all t, d = 5, 099 items, while the number of retained records was around T = 1.1M. Again, we took the one-hot encoding for the items. Last FM and Delicious. These two datasets9 are extracted from the music streaming service Last.fm and the social bookmarking web service Delicious. The Last FM dataset includes n = 1,892 users, and 17,632 items (the artists). Delicious refers to n = 1,861 users, and 69,226 items (URLs). Preprocessing of data followed previous experimental settings where these datasets have been used, e.g., (Cesa Bianchi et al., 2013; Gentile et al., 2014). Specifically, using a tf-idf representation of the available items, the context vectors xt,i were generated by retaining only the first 7http://www.kddcup2012.org/c/kddcup2012-track2 8https://www.kaggle.com/c/avazu-ctr-prediction 9www.grouplens.org/node/462 On Context-Dependent Clustering of Bandits Table 1. Dataset statistics. Here, n is the number of users, d is the dimension of the item vectors (which corresponds to the number of items for Tuenti, KDD Cup and Avazu), ct is the size of the recommendation lists, and T is the number of records (or just retained records, in the case of Tuenti, KDD Cup and Avazu). DATASET n d ct T TUENTI 14,612 105 15 1,000,000 KDD CUP 10,333 6,780 20 100,000 AVAZU 48,723 5,099 20 1,100,000 LASTFM 1,892 25 25 50,000 DELICIOUS 1,861 25 25 50,000 d = 25 principal components. Binary payoffs were created as follows. Last FM: If a user listened to an artist at least once the payoff is 1, otherwise it is 0. Delicious: the payoff is 1 if the user bookmarked the URL, and 0 otherwise. We processed the datasets to make them suitable for use with multi-armed bandit algorithms. Recommendation lists Ct of size ct = 25 t were generated at random by first selecting index it at random over the n users, and then padding with 24 vectors chosen at random from the available items up to that time step, in such a way that at least one of these 25 items had payoff 1 for the current user it. This was repeated for T = 50, 000 times for the two datasets. 5.2. Algorithms We used the first 20% of each dataset to tune the algorithms parameters through a grid search, and report results on the remaining 80%. All results are averaged over 5 runs. We compared to a number of state-of-the art bandit and clustering-of-bandit methods: CLUB (Gentile et al., 2014): sequentially refines user clusters based on their confidence ellipsoid balls; We seeded the graph over users by an initial random Erdos-Renyi graphs with sparsity parameter p = (3 log n)/n. Since this is a randomized algorithm, each run was repeated five times, and the results averaged (the observed variance turned out to be small). Dyn UCB (Nguyen & Lauw, 2014): uses a traditional k-Means algorithm to cluster bandits. Lin UCB-SINGLE: uses a single instance of Lin UCB (Chu et al., 2011) to serve all users, i.e., all users belong to the same cluster, independent of the items. Lin UCB-MULTIPLE: uses an independent instance of Lin UCB per user with no user interactions. Each user forms his own cluster, independent of the items. The following variant of CAB (see Algorithm 1): each user j is considered for addition to the estimated neighborhoods b Nk of the currently served user it only if wj,t 1 has been updated at least once in the past. RAN: recommends a uniformly random item from Ct. 1 2 3 4 5 6 7 8 CAB CLUB Dyn UCB Lin UCB SINGLE Lin UCB MULTIPLE 1 2 3 4 5 6 7 8 CAB CLUB Dyn UCB Lin UCB SINGLE Lin UCB MULTIPLE 1 2 3 4 5 6 7 8 9 CAB CLUB Dyn UCB Lin UCB SINGLE Lin UCB MULTIPLE Figure 1. Clickthrough Rate vs. retained records ( time ) on three online advertising datasets. The higher the curves the better. All tested algorithms (except RAN) are based on upper-confidence bounds of the form CBi,t(x) = α p x Ni,tx log(1 + t). We tuned α for all algorithms across the grid {0, 0.01, 0.02, . . . , 0.2}. The α2 parameter in CLUB was tuned within {0.1, 0, 2, . . . , 0.5}. The number of clusters in Dyn UCB was increased in an exponential progression, starting from 1, and ending to n. Finally, the γ parameter in CAB was simply set to 0.2. In fact, the value of γ did not happen to have a significant influence on the performance of the version of CAB we tested. 5.3. Results Figures 1, 2, 3 summarize our experimental results. For the online advertising datasets Tuenti, KDD Cup, and Avazu (Figure 1), we report performance using the Click-Through Rate (CTR), hence the higher the curves the better. For the Last FM and Delicious datasets (Figure 2), we report the ratio of the cumulative regret of the tested algorithm to the cumulative regret of RAN, hence the lower the better. Our experimental setting, as well some results reproduced here, are in line with past work in the area (Li et al., 2010; Cesa-Bianchi et al., 2013; Gentile et al., 2014; Li et al., 2016). Given the way data have been prepared, our findings give reliable estimates of the actual CTR (Figure 1) or regret (Figure 2) performance of the tested algorithms. On Context-Dependent Clustering of Bandits 0.5 1 1.5 2 2.5 3 3.5 4 Cum. Regr. of Alg. / Cum. Regr. of RAN CAB CLUB Dyn UCB Lin UCB SINGLE Lin UCB MULTIPLE 0.5 1 1.5 2 2.5 3 3.5 4 Cum. Regr. of Alg. / Cum. Regr. of RAN CAB CLUB Dyn UCB Lin UCB SINGLE Lin UCB MULTIPLE Figure 2. Ratio of the cumulative regret of the algorithms to the cumulative regret of RAN against time on the two datasets Last FM and Delicious. The lower the curves the better. 0 0.5 1 1.5 2 2.5 3 3.5 4 Average Euclidean distance Last FM: Average Euclidean distance of served user to remaining users 0 0.5 1 1.5 2 2.5 3 3.5 4 Average Euclidean distance Delicious: Average Euclidean distance of served user to remaining users Figure 3. Average (estimated) Euclidean distance between the served user it and all other users, as a function of t for the two datasets Last FM (left) and Delicious (right). The distance is computed by associating with each user a model vector obtained through a regularized least-squares solution based on all available data for that user (instance vectors and corresponding payoffs). In four out of five datasets, CAB was found to offer performance superior to all other baselines. CAB performed particularly well on the Tuenti dataset where it delivered almost double the CTR compared to some of the baselines. CAB s performance advantage was more moderate on the KDD Cup and Avazu datasets. This is expected since exploiting collaborative effects is more important on a dataset like Tuenti, where users are exposed to a few ( 100) ads, as compared to the KDD Cup dataset (where ads are modulated by a user query) and the Avazu dataset, both of which have a much broader ad base ( 7000). This provides a strong indication that CAB effectively exploits collaborative effects. In general, on Tuenti, KDD Cup, and Avazu (Figure 1), CAB was found to offer benefits in the coldstart region (i.e., the initial relatively small fraction of time horizon), but also continued to maintain a lead throughout. On the Last FM and Delicious datasets (Figure 2), the results we report are consistent with (Gentile et al., 2014). On Last FM all methods are again outperformed by CAB. The overall performance of all bandit methods seems how- ever, to be relatively poor; this can be attributed to the way the Last FM dataset was generated. Here users typically have little interaction with the music serving system and a lot of the songs played were generated by a recommender. Hence while there are collaborative effects, they are relatively weak compared to datasets such as Tuenti. On the other hand, on the Delicious dataset the best performing strategy seems to be Lin UCB-MULTIPLE, which deliberately avoids any feedback sharing mechanism among the users. This dataset reflects user web-browsing patterns, as evinced by their bookmarks. As noted previously (Gentile et al., 2014), this dataset does not seem to contain any collaborative information, hence we can hardly expect to take advantage of clustering efforts. To shed further light, in Figure 3 we plot the average distance between a linear model for user it and the corresponding linear models for all other users j = it, as a function of t. For each user of these two datasets, these linear models were computed by taking the whole test set and treating each pairing (xt,k, yt,k) with it = i, and yt,k = 1 as a training sample for a (regularized) least-squares estimator for user i. The conclusion we can draw after visually comparing the left and the right plots in Figure 3 is that on Delicious these estimated user models tend to be significantly more separated than on Last FM, which readily explains the effectiveness of Lin UCB-MULTIPLE. Moreover, on Delicious, studies have shown that tags which are used as item features are generally chosen by users to reflect their interests and for personal use, hence we can expect these features to diverge even for similar websites. On the other hand, Last FM tags are typically indicative of the genre of the song. 6. Conclusions and Ongoing Research In this paper we proposed CAB, a novel contextual bandit algorithm for personalized recommendation systems. CAB effectively incorporates collaborative effects by implementing a simple context-dependent feedback sharing mechanism which greatly relaxes restrictions imposed by earlier works. We established crisp regret bounds for CAB which scale gracefully with the expected number of clusters over the users. We further found CAB to outperform existing approaches in practice, using extensive experimentation on a number of production and real-world datasets. We have begun preliminary experiments with Thompson sampling versions of CAB and its competitors but have thus far not observed statistically significant deviations from results in Section 5. From a theoretical standpoint, it would be nice to complement our upper bound in Corollary 1 with a lower bound helping characterize the regret complexity of our problem. We are also planning to experimentally benchmark with performance of the sparse bandit version of CAB that infers sparse user models. On Context-Dependent Clustering of Bandits Acknowledgments The authors thank the ICML reviewers for useful comments. CG acknowledges partial support from Criteo through a Faculty Research Award. PK is supported by the Deep Singh and Daljeet Kaur Faculty Fellowship and the Research-I foundation at IIT Kanpur, and thanks Microsoft Research India for a research grant. GZ would like to thank N. Cesa-Bianchi for inspiring discussions regarding bandit algorithms and related topics. Abbasi-Yadkori, Yasin, Pal, David, and Szepesvari, Csaba. Improved Algorithms for Linear Stochastic Bandits. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS), 2011. Abbasi-Yadkori, Yasin, P al, D avid, and Szepesv ari, Csaba. Online-to-Confidence-Set Conversions and Application to Sparse Stochastic Bandits. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012. Agrawal, Shipra and Goyal, Navin. Thompson Sampling for Contextual Bandits with Linear Payoffs. In Proceedings of the 30th International Conference on Machine Learning (ICML), 2013. Audibert, Jean Yves, Munos, Remi, and Szepesvari, Csaba. Exploration-exploitation Tradeoff using Variance Estimates in Multi-armed Bandits. Theoretical Computer Science, 410(19):1876 1902, 2009. Auer, Peter. Using Confidence Bounds for Exploration Exploitation Trade-Offs. Journal of Machine Learning Research, 3:397 422, 2002. Auer, Peter, Cesa-Bianchi, Nicolo, and Fischer, Paul. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47:235 256, 2002. Azar, Mohammad Gheshlaghi, Lazaric, Alessandro, and Brunskill, Emma. Sequential Transfer in Multi-armed Bandit with Finite Set of Models. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS), 2013. Carpentier, Alexandra. Implementable Confidence Sets in High Dimensional Regression. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. Carpentier, Alexandra and Munos, Remi. Bandit Theory meets Compressed Sensing for High-dimensional Stochastic Linear Bandit. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012. Cesa-Bianchi, Nicolo, Gentile, Claudio, and Zappella, Giovanni. A Gang of Bandits. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS), 2013. Chu, Wei, Li, Lihong, Reyzin, Lev, and Schapire, Robert. Contextual Bandits with Linear Payoff Functions. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011. Crammer, Koby and Gentile, Claudio. Multiclass Classification with Bandit Feedback using Adaptive Regularization. In Proceedings of the 28th International Conference on Machine Learning (ICML), 2011. Dai, Wei and Milenkovic, Olgica. Subspace Pursuit for Compressive Sensing Signal Reconstruction. IEEE Transactions on Information Theory, 55(5):2230 2249, 2009. Dekel, Ofer, Gentile, Claudio, and Sridharan, Karthik. Selective Sampling and Active Learning from Single and Multiple Teachers. Journal of Machine Learning Research, 13:2655 2697, 2012. Djolonga, Josip, Krause, Andreas, and Cevher, Volkan. High-Dimensional Gaussian Process Bandits. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS), 2013. Gentile, Claudio, Li, Shuai, and Zappella, Giovanni. Online Clustering of Bandits. In Proceedings of the 31st International Conference on Machine Learning (ICML), 2014. Jain, Prateek, Tewari, Ambuj, and Kar, Purushottam. On Iterative Hard Thresholding Methods for Highdimensional M-Estimation. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014. Kakade, S. and Tewari, A. On the Generalization Ability of Online Strongly Convex Programming Algorithms. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS), 2008. Korda, Nathan, Szorenyi, Balazs, and Li, Shuai. Distributed Clustering of Linear Bandits in Peer to Peer Networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016. Krause, Andreas and Ong, Cheng Soon. Contextual Gaussian Process Bandit Optimization. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS), 2011. Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert. A Contextual-Bandit Approach to Personalized On Context-Dependent Clustering of Bandits News Article Recommendation. In Proceedings of the 19th International World Wide Web Conference (WWW), 2010. Li, Shuai, Karatzoglou, Alexandros, and Gentile, Claudio. Collaborative Filtering Bandits. In 39th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2016. Maillard, Odalric-Ambrym and Mannor, Shie. Latent Bandits. In Proceedings of the 31st International Conference on Machine Learning (ICML), 2014. Needell, Deanna and Tropp, Joel A. Co Sa MP: Iterative Signal Recovery from Incomplete and Inaccurate Samples. Applied Computational Harmonic Analysis, 26: 301 321, 2008. Nguyen, Trong and Lauw, Hady. Dynamic Clustering of Contextual Multi-Armed Bandits. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM), 2014. Pilaszy, Istvan and Tikk, Domonkos. Recommending New Movies: Even a Few Ratings Are More Valuable Than Metadata. In Proceedings of the 3rd ACM Conference on Recommender Systems (Rec Sys), 2009. Sutskever, Ilya, Salakhutdinov, Ruslan, and Tenenbaum, Joshua. Modelling Relational Data using Bayesian Clustered Tensor Factorization. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009. Tropp, Joel A. Freedman s Inequality for Matrix Martingales. Electronic Communications in Probability, 16: 262 270, 2011. Wu, Qingyun, Wang, Huazheng, Gu, Quanquan, and Wang, Hongning. Contextual Bandits in A Collaborative Environment. In 39th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2016. Yue, Yisong, Hong, Sue Ann, and Guestrin, Carlos. Hierarchical Exploration for Accelerating Contextual Bandits. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012. Zhou, Li and Brunskill, Emma. Latent Contextual Bandits and their Application to Personalized Recommendations for New Users. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), 2016.