# spectral_bandits_for_smooth_graph_functions__2a3a10b7.pdf Spectral Bandits for Smooth Graph Functions Michal Valko MICHAL.VALKO@INRIA.FR INRIA Lille - Nord Europe, Seque L team, 40 avenue Halley 59650, Villeneuve d Ascq, France R emi Munos REMI.MUNOS@INRIA.FR INRIA Lille - Nord Europe, Seque L team, France; Microsoft Research New England, Cambridge, MA, USA Branislav Kveton BRANISLAV.KVETON@TECHNICOLOR.COM Technicolor Research Center, 735 Emerson St, Palo Alto, CA 94301, USA Tom aˇs Koc ak TOMAS.KOCAK@INRIA.FR INRIA Lille - Nord Europe, Seque L team, 40 avenue Halley 59650, Villeneuve d Ascq, France Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as contentbased recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations. 1. Introduction A smooth graph function is a function on a graph that returns similar values on neighboring nodes. This concept arises frequently in manifold and semi-supervised learning (Zhu, 2008), and reflects the fact that the outcomes on the neighboring nodes tend to be similar. It is well-known (Belkin et al., 2006; 2004) that a smooth graph function can be expressed as a linear combination of the eigenvectors of Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the authors. Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 Eigenvector 6 Figure 1. Eigenvectors from the Flixster data corresponding to the smallest few eigenvalues of the graph Laplacian projected onto the first principal component of data. Colors indicate the values. the graph Laplacian with smallest eigenvalues. Therefore, the problem of learning such function can be cast as a regression problem on these eigenvectors. This is the first work that brings this concept to bandits. In particular, we study a bandit problem where the arms are the nodes of a graph and the expected payoff of pulling an arm is a smooth function on this graph. We are motivated by a range of practical problems that involve graphs. One application is targeted advertisement in social networks. Here, the graph is a social network and our goal is to discover a part of the network that is interested in a given product. Interests of people in a social network tend to change smoothly (Mc Pherson et al., 2001), because friends tend to have similar preferences. Therefore, we take advantage of this structure and formulate this problem as learning a smooth preference function on a graph. Another application of our work are recommender systems (Jannach et al., 2010). In content-based recommendation (Chau et al., 2011), the user is recommended items that are similar to the items that the user rated highly in the past. The assumption is that users prefer similar items similarly. The similarity of the items can be measured for instance by a nearest neighbor graph (Billsus et al., 2000), where each Spectral Bandits item is a node and its neighbors are the most similar items. In this paper, we consider the following learning setting. The graph is known in advance and its edges represent the similarity of the nodes. At time t, we choose a node and then observe its payoff. In targeted advertisement, this may correspond to showing an ad and then observing whether the person clicked on the ad. In content-based recommendation, this may correspond to recommending an item and then observing the assigned rating. Based on the payoff, we update our model of the world and then the game proceeds into time t + 1. Since the number of nodes N can be huge, we are interested in the regime when t < N. If the smooth graph function can be expressed as a linear combination of k eigenvectors of the graph Laplacian, and k is small and known, our learning problem can be solved using ordinary linear bandits (Auer, 2002; Li et al., 2010). In practice, k is problem specific and unknown. Moreover, the number of features k may approach the number of nodes N. Therefore, proper regularization is necessary, so that the regret of the learning algorithm does not scale with N. We are interested in the setting where the regret is independent of N and therefore this problem is non-trivial. Let G be the given graph with the set of nodes V and denote |V| = N the number of nodes. Let W be the N N matrix of similarities wij (edge weights) and D is the N N diagonal matrix with the entries dii = P j wij. The graph Laplacian of G is defined as L = D W. Let {λL k , qk}N k=1 be the eigenvalues and eigenvectors of L ordered such that 0 = λL 1 λL 2 λL N. Equivalently, let L = QΛLQT be the eigendecomposition of L, where Q is an N N orthogonal matrix with eigenvectors in columns. In our setting we assume that the reward function is a linear combination of the eigenvectors. For any set of weights α let fα : V R be the function defined on nodes, linear in the basis of the eigenvectors of L: k=1 αk(qk)v = xv, α , where xv is the v-th row of Q, i.e., (xv)i = (qi)v. If the weight coefficients of the true α are such that the large coefficients correspond to the eigenvectors with the small eigenvalues and vice versa, then fα would be a smooth function on G (Belkin et al., 2006). Figure 1 displays first few eigenvectors of the Laplacian constructed from data we use in our experiments. In the extreme case, the true α may be of the form [α 1, α 2, . . . , α k, 0, 0, 0]T N for some k N. Had we known k in such case, the known linear bandits algorithm would work with the performance scaling with k instead of D = N. Unfortunately, first, we do not know k and second, we do not want to assume such an extreme case (i.e., α i = 0 for i > k). Therefore, we opt for the more plausible assumption that the coefficients with the high indexes are small. Consequently, we deliver algorithms with the performance that scale with the smoothness with respect to the graph. The learning setting is the following. In each time step t T, the recommender π chooses a node π(t) and obtains a noisy reward such that: rt = xπ(t), α + εt, where the noise εt is assumed to be R-sub-Gaussian for any t. In our setting, we have xv RD and xv 2 1 for all xv. The goal of the recommender is to minimize the cumulative regret with respect to the strategy that always picks the best node w.r.t. α . Let π(t) be the node picked (referred to as pulling an arm) by an algorithm π at time t. The cumulative (pseudo) regret of π is defined as: RT = T max v fα (v) t=1 fα (π(t)) We call this bandit setting spectral since it is built on the spectral properties of a graph. Compared to the linear and multi-arm bandits, the number of arms K is equal to the number of nodes N and to the dimension of the basis D (eigenvectors are of dimension N). However, a regret that scales with N or D that can be obtained using those settings is not acceptable because the number of nodes can be large. While we are mostly interested in the setting with K = N, our algorithms and analyses can be applied for any finite K. 3. Related Work Most related setting to our work is that of the linear and contextual linear bandits. Auer (2002) proposed a Sup Lin Rel algorithm and showed that it obtains DT regret that matches the lower bound by Dani et al. (2008). First practical and empirically successful algorithm was Lin UCB (Li et al., 2010). Later, Chu et al. (2011) analyzed Sup Lin UCB, which is a Lin UCB equivalent of Sup Lin Rel, to show that it also obtains DT regret. Abbasi-Yadkori et al. (2011) proposed the OFUL algorithm for linear bandits which obtains D T regret. Using their analysis, it is possible to show that Lin UCB obtains D T regret as well (Remark 2). Whether Lin UCB matches the DT the lower bound for this setting is still an open problem. Abernethy et al. (2008) and Bubeck et al. (2012) studied a more difficult adversarial setting of linear bandits where the reward function is time-dependent. It is an open problem if this approaches would work in our setting and have an upper bound on the regret that scales better than with D. Kleinberg et al. (2008); Slivkins (2009), and Bubeck et al. (2011) use similarity information between the context of Spectral Bandits arms, assuming a Lipschitz or more general properties. While such settings are indeed more general, the regret bounds scale worse with the relevant dimensions. Srinivas et al. (2010) and Valko et al. (2013) also perform maximization over the smooth functions that are either sampled from a Gaussian process prior or have a small RKHS norm. Their setting is also more general than ours since it already generalizes linear bandits. However their regret bound in the linear case also scales with D. Moreover, the regret of these algorithms also depends on a quantity for which dataindependent bounds exists only for some kernels, while our effective dimension is always computable given the graph. Another bandit graph setting called the gang of bandits was studied by Cesa-Bianchi et al. (2013) where each node is a linear bandit with its own weight vector which are assumed to be smooth on the graph. Next, Caron et al. (2012) assume that they obtain the reward not only from the selected node but also from all its neighbors. Yet another kind of the partial observability model for bandits on graphs in the adversarial setting is considered by Alon et al. (2013). 4. Spectral Bandits In our setting, the arms are orthogonal to each other. Thinking that the reward observed for an arm does not provide any information for other arms would not be correct because of the assumption that under another basis, the unknown parameter has a low norm. This provides additional information across the arms through the estimation of the parameter. We could think of our setting as an N armed bandit problem where N is larger than the time horizon T and the mean reward µk for each arm k satisfies the property that under a change of coordinates, the vector has small weights, i.e., there exists a known orthogonal matrix U such that α = Uµ has a low norm. As a consequence, we can estimate α using penalization and then recover µ. Given a vector of weights α, we define its Λ norm as: k=1 λkα2 k = We make use of this norm later in our algorithms. Intuitively, we would like to penalize the coefficients α that correspond to the eigenvectors with the large eigenvalues, in other words, to the less smooth basis functions on G. 4.1. Effective dimension In order to present our algorithms and analyses, we introduce a notion of effective dimension d. We keep using capital D to denote the ambient dimension, which is equal to N in the spectral setting. Intuitively, the effective dimension is a proxy for the number of relevant dimensions. We first provide a formal definition and then discuss its properties. In general, we assume there exists a diagonal matrix Λ with the entries 0 < λ = λ1 λ2 λN and a set of K vectors x1, . . . , x K RN such that xi 2 1 for all i. For the spectral bandits, we have K = N. Moreover, since Q is an orthonormal matrix, xi 2 = 1. Finally, since the first eigenvalue of a graph Laplacian is always zero, λL 1 = 0, we use Λ = ΛL + λI, in order to have λ1 = λ. Definition 1. Let the effective dimension d be the largest d such that: (d 1)λd T log(1 + T/λ) The effective dimension d is small when the coefficients λi grow rapidly above T. This is the case when the dimension of the space D (and K) is much larger than T, such as in graphs from social networks with very large number of nodes N. In contrast, when the coefficients are all small then d may be of the order of T, which would make the regret bounds useless. Figure 2 shows how d behaves compared to D on the generated and the real Flixster network graphs1 that we use for the experiments in Section 6. 0 50 100 150 200 250 300 350 400 450 500 1 effective dimenstion Barabasi Albert graph N=500 500 1000 1500 2000 2500 3000 3500 4000 4500 1 effective dimenstion Flixster graph: N=4546 Figure 2. Effective dimension d in the regime T < N. The effective dimension for this data is much smaller than the ambient dimension D = N, which is 500 and 4546 respectively. The actual form of Definition 1 comes from Lemma 6 and will become apparent in Section 5. The dependence of the effective dimension on T comes from the fact, that d is related to the number of non-negligible dimensions characterizing the space where the solution to the penalized least-squares may lie, since this solution is basically constrained to an ellipsoid defined by the inverse of the eigenvalues multiplied by T. Consequently, d is related to the metric dimension of this ellipsoid. Therefore, when T goes to infinity, the constraint is relaxed and all directions matter, thus the solution can be anywhere in a (bounded) space of dimension N. On the contrary, for a smaller T, the ellipsoid possesses a smaller number of non-negligible dimensions. Notice that it is natural that this effective dimension depends on T as we consider the setting T < N. If we wanted to avoid T in the definition of d, we could define it as well in terms of N by replacing T by N in Definition 1, but this would only loosen its value. 1We set Λ to ΛL + λI with λ = 0.01, where ΛL is the graph Laplacian of the respective graph. Spectral Bandits Algorithm 1 SPECTRALUCB N : the number of nodes, T : the number of pulls {ΛL, Q} spectral basis of L λ, δ : regularization and confidence parameters R, C : upper bounds on the noise and α Λ Run: Λ ΛL + λI d max{d : (d 1)λd T/ log(1 + T/λ)} for t = 1 to T do Update the basis coefficients ˆα: Xt [x1, . . . , xt 1]T r [r1, . . . , rt 1]T Vt Xt XT t + Λ ˆαt V 1 t XT tr ct 2R p d log(1 + t/λ) + 2 log(1/δ) + C Choose the node vt (xvt-th row of Q): vt arg maxv fˆα(v) + ct xv V 1 t Observe the reward rt end for 4.2. SPECTRALUCB The first algorithm we present is SPECTRALUCB (Algorithm 1) which is based on Lin UCB and uses the spectral penalty (1). For clarity, we set xt = xvt = xπ(t). Here we consider regularized least-squares estimate ˆαt of the form: ˆαt = arg min α v=1 [ xv, α rv]2 + α Λ A key part of the algorithm is to define the ct x V 1 t confidence widths for the prediction of the rewards. We take advantage of our analysis (Section 5.2) to define ct based on the effective dimension d which is specifically tailored to our setting. By doing this we also avoid the computation of the determinant (see Section 5). The following theorem characterizes the performance of SPECTRALUCB and bounds the regret as a function of effective dimension d. Theorem 1. Let d be the effective dimension and λ be the minimum eigenvalue of Λ. If α Λ C and for all xv, xv, α [ 1, 1], then the cumulative regret of SPECTRALUCB is with probability at least 1 δ bounded as: d log(1 + T/λ) + 2 log(1/δ) + 4C + 4 i d T log(1 + T/λ) Remark 1. The constant C needs to be such that α Λ C. If we set C too small, the true α will lie outside of the region and far from ˆαt, causing the algorithm to underperform. Alternatively, C can be time dependent, e.g., Ct = log T. In such case, we do not need to know an upper bound on α Λ in advance, but our regret bound would only hold after some t, when Ct α Λ. Algorithm 2 SPECTRALELIMINATOR N : the number of nodes, T : the number of pulls {ΛL, Q} spectral basis of L λ : regularization parameter β, {tj}J j parameters of the elimination and phases A1 {x1, . . . , x K}. for j = 1 to J do Vtj γΛL + λI for t = tj to min(tj+1 1, T) do Play xt Aj with the largest width to observe rt: xt arg maxx Aj x V 1 t Vt+1 Vt + xtx T t end for Eliminate the arms that are not promising: ˆαt V 1 t [xtj, . . . , xt][rtj, . . . , rt]T p maxx Aj h ˆαt, x x V 1 t β i Aj+1 n x Aj, ˆαt, x + x V 1 t β p o We provide the proof of Theorem 1 in Section 5 and examine the performance of SPECTRALUCB experimentally in Section 6. The d T result of Theorem 1 is to be compared with the classical linear bandits, where Lin UCB is the algorithm often used in practice (Li et al., 2010) achieving D T cumulative regret. As mentioned above and demonstrated in Figure 2, in the T < N regime we can expect d D = N and obtain an improved performance. 4.3. SPECTRALELIMINATOR It is known that the available upper bound for Lin UCB or OFUL is not optimal for the linear bandit setting with finite number of arms in terms of dimension D. On the other hand, the algorithms Sup Lin Rel or Sup Lin UCB achieve the optimal DT regret. In the following, we likewise provide an algorithm that also scales better with d and achieves d T regret. The algorithm is called SPECTRALELIMINATOR (Algorithm 2) and works in phases, eliminating the arms that are not promising. The phases are defined by the time indexes t1 = 1 t2 . . . and depend on some parameter β. The algorithm is in a spirit similar to the Improved UCB by Auer & Ortner (2010). The main idea of SPECTRALELIMINATOR is to divide the time steps in to sets in order to introduce independence and allow the Azuma-Hoeffding inequality (Azuma, 1967) to be applied. In the following theorem we characterize the performance of SPECTRALELIMINATOR and show that the upper bound on regret has d improvement over SPECTRALUCB. Theorem 2. Choose the phases starts as tj = 2j 1. Assume all rewards are in [0, 1] and α Λ C. For any δ > 0, with probability at least 1 δ, the cumulative regret Spectral Bandits of SPECTRALELIMINATOR algorithm run with parameter β = 2R p 14 log(2Klog2 T/δ) + C is bounded as: 14 log2K log2 T d Tlog 1+ T 4.4. Scalability and computational complexity There are three main computational issues to address in order to make SPECTRALUCB scalable: the computation of N UCBs, matrix inversion, and obtaining the eigenbasis which serves as an input to the algorithm. First, to speed up the computation of N UCBs in each time step, we use lazy updates technique (Desautels et al., 2012) which maintains a sorted queue of UCBs and in practice leads to substantial speed gains. Second, to speed up matrix inversion we do iterative matrix inversion (Zhang, 2005). Finally, while the eigendecomposition of a general matrix is computationally difficult, Laplacians are symmetric diagonally dominant (SDD). This enables us to use fast SDD solvers as CMG by Koutis et al. (2011). Furthermore, using CMG we can find good approximations to the first J eigenvectors in O(Jm log m) time, where m is the number of edges in the graph (e.g. m = 10N in the Flixter experiment). CMG can easily work with N in millions. In general, we have J = N but from our experience, a smooth reward function can be often approximated by dozens of eigenvectors. In fact, J can be considered as an upper bound on the number of eigenvectors we actually need. Furthermore, by choosing small J we not only reduce the complexity of eigendecomposition but also the complexity of the least-square problem being solved in each iteration. Choosing a small J can significantly reduce the computation but it is important to choose J large enough so that still less than J eigenvectors are enough. This way, the problem that we solve is still relevant and our analysis applies. In short, the problem cannot be solved trivially by choosing first k relevant eigenvectors because k is unknown. Therefore, in practice we choose the largest J such that our method is able to run. In Section 6.4, we demonstrate that we can obtain good results with relatively small J. 5. Analysis The analysis of SPECTRALUCB (Section 5.3) has two main ingredients. The first one is the derivation of the confidence ellipsoid for ˆα, which is a straightforward update of the analysis of OFUL (Abbasi-Yadkori et al., 2011) using self-normalized martingale inequality (Section 5.1). The second part is crucial to prove that the final regret bound scales only with the effective dimension d and not with the ambient dimension D. We achieve this by considering the geometrical properties of the determinant which hold in our setting (Section 5.2). We also used this result to upperbound the regret of SPECTRALELIMINATOR (Section 5.4). The proofs of the lemmas are in the appendix. 5.1. Confidence ellipsoid The first two lemmas are by Abbasi-Yadkori et al. (2011) and we restate them for the convenience. Lemma 1. Let Vt = Xt 1XT t 1 + Λ and define ξt = Pt s=1 εsxs. With probability at least 1 δ, t 1: ξt 2 V 1 t 2R2 log |Vt|1/2 Lemma 2. We have: s=1 min 1, xs 2 V 1 s 1 The next lemma is a generalization of Theorem 2 by Abbasi-Yadkori et al. (2011) to the regularization with Λ. The result of this lemma is also used for the confidence coefficient ct in Algorithm 1, which we upperbound in Section 5.2 to avoid the computation of determinants. Lemma 3. Let Vt = Xt 1XT t 1 + Λ and α Λ C. For any x and t 1, with probability at least 1 δ: |x T(ˆαt α )| x V 1 t 2 log |Vt|1/2 5.2. Effective dimension In Section 5.1 we show that several quantities scale with log(|VT |/|Λ|), which can be of the order of D. Therefore, in this part we present the key ingredient of our analysis based on the geometrical properties of determinants (Lemmas 4 and 5) to upperbound log(|VT |/|Λ|) by a term that scales with d (Lemma 6). Not only this will allow us to show that the regret scales with d, but it also helps us to avoid the computation of the determinants in Algorithm 1. Lemma 4. Let Λ = diag(λ1, . . . , λN) be any diagonal matrix with strictly positive entries and for any vectors (xt)1 t T such that xt 2 1 for all 1 t T, we have that the determinant |VT | of VT = Λ + PT t=1 xtx T t is maximized when all xt are aligned with the axes. Lemma 5. For any T, let VT = PT t=1 xtx T t + Λ. Then: i=1 log 1 + ti where the maximum is taken over all possible positive real numbers {t1, . . . , t N}, such that PN i=1 ti = T. Lemma 6. Let d be the effective dimension. Then: |Λ| 2d log 1 + T Spectral Bandits 5.3. Cumulative regret of SPECTRALUCB Proof of Theorem 1. Let x = arg maxxv x T vα and let rt denote the instantaneous regret at time t. With probability at least 1 δ, for all t: rt = x T α x T tα x T t ˆαt + ct xt V 1 t x T tα (2) x T t ˆαt + ct xt V 1 t x T t ˆαt + ct xt V 1 t (3) = 2ct xt V 1 t . The inequality (2) is by the algorithm design and reflects the optimistic principle of SPECTRALUCB. Specifically, x T α + ct x V 1 t x T t ˆαt + ct xt V 1 t , from which: x T α x T t ˆαt + ct xt V 1 t ct x V 1 t x T t ˆαt + ct xt V 1 t In (3) we applied Lemma 3: x T t ˆαt x T tα + ct xt V 1 t . Finally, by Lemmas 2 and 6 and the assumption rt 2: 2 max(1, c T ) t=1 min 1, xt 2 V 1 t 2(c T + 1) p 2T log(|VT |/|Λ|) 4(c T + 1) p d T log(1 + T/λ) Above, we used that ct c T because ct is nondecreasing. By plugging c T , we get that with probability at least 1 δ: 2 log(|VT |/|Λ|) + 2 log(1/δ) + C + 1 i d T log(1 + T/λ) d log(1 + T/λ) + 2 log(1/δ) + 4C + 4 i d T log(1 + T/λ) Remark 2. Notice that if we set Λ = I in Algorithm 1, we recover Lin UCB. Since log(|VT |/|Λ|) can be upperbounded by D log T (Abbasi-Yadkori et al., 2011), we obtain O(D T) upper bound of regret of Lin UCB as a corollary of Theorem 1. 5.4. Cumulative regret of SPECTRALELIMINATOR The probability space induced by the rewards r1, r2, . . . can be decomposed as a product of independent probability spaces induces by rewards in each phase [tj, tj+1 1]. Denote by Fj the σ-algebra generated by the rewards r1, . . . , rtj+1 1, i.e., received before and during the phase j. We have the following three lemmas for any phase j. Let us write Vj for Vtj and ˆαj for ˆαtj. Lemma 7. For any fixed x RN and any δ > 0, we have that if β(α , δ) = 2R p 14 log(2/δ)+ α Λ, then at time tj (beginning of phase j): P |x T(ˆαj α )| x V 1 j β(α , δ) 1 δ Lemma 8. For all x Aj, we have: x 2 V 1 j 1 tj tj 1 s=tj 1+1 xs 2 V 1 s 1 Lemma 9. For each phase j, we have: s=tj 1+1 min 1, xs 2 V 1 s 1 Combining all the lemmas above together we obtain: Lemma 10. For each j, x, and δ, with probability at least 1 δ: min 1,|x T(ˆαj α )| 2d log 1 + tj tj 1 Now we are ready to upperbound the cumulative regret of SPECTRALELIMINATOR. Proof of Theorem 2. Let J = log2 T . We have: t=1 x xt, α = t=tj x xt, α = j=0 (tj+1 tj) x xt, ˆαj + ( x V 1 j + xt V 1 j )β , in an event Ωof probability 1 KJδ, where we used Lemma 7 in the last inequality. By definition of the action subset Aj at phase j, under Ω, we have: x xt, ˆαj ( x V 1 j + xt V 1 j )β, since x Aj for all j J. By previous lemma: j=0 (tj+1 tj)β(α , δ) 2d log 1 + tj tj 1 d T log 1 + T Spectral Bandits 0 50 100 150 200 250 0 cumulative regret Barabasi Albert N=500, param=5, d=1 Spectral UCB Spectral Eliminator Lin UCB 0 50 100 150 200 250 0 cumulative regret Erdos Renyi N=500, param=5, d=1 Spectral UCB Spectral Eliminator Lin UCB 0 50 100 150 200 250 0 cumulative regret Lattice N=625, param=5, d=1 Spectral UCB Spectral Eliminator Lin UCB Figure 3. Cumulative regret for random graphs models Remark 3. If we set Λ = I in Algorithm 2 as in Remark 2, we get a new algorithm, LINEARELIMINATOR, which is a competitor to Sup Lin Rel (Auer, 2002) and as a corollary to Theorem 2 also enjoys O( DT) upper bound on the cumulative regret. On the other hand, compared to Sup Lin Rel, LINEARELIMINATOR and its analysis are significantly much simpler and elegant. 6. Experiments We evaluated our algorithms and compared them to Lin UCB. In all experiments we set δ to 0.001 and R to 0.01. For SPECTRALUCB and SPECTRALELIMINATOR we set Λ to ΛL + λI with λ = 0.01. For Lin UCB we regularized with λI with λ = 1. Our results are robust to small perturbations of all learning parameters. We also performed experiments with Sup Lin Rel, Sup Lin UCB, Sup Spectral UCB2, but due to the known reasons (Chu et al., 2011) these algorithms are not efficient3 and they were always outperformed by SPECTRALUCB and Lin UCB. 6.1. Random graph models To simulate realistic graph structures, we generated graphs of N nodes using three models that are commonly used in the social networks modeling. First, we considered the widely known Erd os-R enyi (ER) model. We sampled the edges in the ER model independently with probability 3%. Second, we considered the Barab asi-Albert (BA) model (1999), with the degree parameter 3. BA models are commonly used for modeling real networks due to their preferential attachment property. Finally, we considered graphs where the edge structure forms a regular lattice. For all the graph models we assigned uniformly random weights to their edges. Then, we randomly generated ksparse vector α of N weights, k N and defined the true graph function as f = Qα , where Q is the matrix of eigenvectors from the eigendecomposition of the graph Laplacian. We ran the algorithms in the desired T < N regime, with N = 500 (N = 54 for the lattice), T = 250, 2an equivalent of Sup Lin UCB with spectral regularization 3at least for the sizes of N and T that we deal with and k = 5. Figure 3 shows that the regret of Lin UCB for all three models has within first T steps still a linear trend unlike SPECTRALUCB that performs much better. Unfortunately, even though the regret bound Spectral Eliminator is asymptotically better, it was outperformed by SPECTRALUCB. This is similar to the linear case when Lin UCB outperforms4 Sup Lin UCB in practice (Chu et al., 2011) while it is open problem whether Lin UCB can be shown to have a better regret. 6.2. Movie Lens experiments In this experiment we took user preferences and the similarity graph over movies from the Movie Lens dataset (Lam & Herlocker, 2012), a dataset of 6k users who rated one million movies. We divide the dataset into two equally-sized parts. The first dataset is used to build our model of users, the rating that user i assigns to movie j. We factor the user-item matrix using low-rank matrix factorization (Keshavan et al., 2009) as M UV , a standard approach to collaborative filtering. The rating that the user i assigns to movie j is estimated as ˆri,j = ui, vj , where ui is the i-th row of U and vj is the j-th row of V. The rating ˆri,j is the payoff of pulling arm j when recommending to user i. The second dataset is used to build our similarity graph over movies. We factor the dataset in the same way as the first dataset. The graph contains an edge between movies i and i if the movie i is among 10 nearest neighbors of the movie i in the latent space of items V. The weight on all edges is one. Notice that if two items are close in the item space, then their expected rating is expected to be similar. However, the opposite is not true. If two items have a similar expected rating, they do not have to be close in the item space. In other words, we take advantage of ratings but do not hardwire the two similarly rated items to be similar. In Figure 4, we sampled 50 users and evaluated the regret of both algorithms for T = 100. Here SPECTRALUCB suffers only about one fourth of regret over Lin UCB, specifically 43.4611 vs. 133.0996 on average. 4For example, these algorithms do not to use all the rewards obtained for the estimation of ˆα. Spectral Bandits 250 Movielens: Cumulative regret for randomly sampled users. T = 100 cumulative regret 2434 1689 2335 970 659 920 1722 2445 856 1186 2539 1811 2920 600 2437 1991 734 1402 1176 1766 2358 310 2420 2477 1045 1267 1686 2064 2187 2232 1146 1265 2817 1688 2503 814 1833 1733 2838 253 1474 1537 266 2665 2605 1293 2303 438 1826 768 Spectral UCB Lin UCB 250 Flixster: Cumulative regret for randomly sampled users. T = 100 cumulative regret 1360 1370 2362 162 569 2391 52 2458 74 1595 969 2533 1960 2589 2460 1 814 2249 878 967 994 2029 1843 1620 168 2488 645 7 216 1357 588 2112 1487 38 2532 1833 2126 711 1863 1159 1353 192 989 322 132 1617 156 1654 337 1420 Spectral UCB Lin UCB Figure 4. Movie Lens and Flixter: Cumulative regret for 50 randomly chosen users. Horizontal axis shows the user number. 0 10 20 30 40 50 60 70 80 90 100 0 cumulative regret J = 20 J = 200 J = 2000 J = 20 J=200 J=2000 10 0 computational time in seconds Figure 5. Regret and computational time with reduced basis 6.3. Flixter Experiments We also performed experiments on users preferences from the movie recommendation website Flixter. The social network of the users was crawled by Jamali & Ester (2010) and then clustered by Graclus (2013) to obtain a strongly connected subgraph. We extracted a subset of users and movies, where each movie has at least 30 ratings and each user rated at least 30 movies. This resulted in a dataset of 4546 movies and 5202 users. As with Movie Lens dataset we completed the missing ratings by a low-rank matrix factorization and used it construct a 10-NN similarity graph. Again in Figure 4, we sampled 50 users and evaluated the regret of both algorithms for T = 100. On average, SPECTRALUCB suffers only about one third of regret over Lin UCB, specifically 37.6499 vs. 99.8730 on average. 6.4. Reduced basis As discussed in Section 4.4, one can decrease the computational complexity and thus increase the scalability by only extracting first J N eigenvectors of the graph Laplacian. First, the computational complexity of such operation is O(Jm log m), where m is the number of edges. Second, the least-squares problem that we have to do in each time step of the algorithm is only J dimensional. In Figure 5 we plot the cumulative regret and the total computational time in seconds (log scale) for a single user from the Movie Lens dataset. We varied J as 20, 200, and 2000 which corresponds to about 1%, 10% and 100% of basis functions (N = 2019). The total computational time also includes the computational savings from lazy updates and iterative matrix inversion. We see that with 10% of the eigenvectors we can achieve similar performance as for the full set for the fraction of the computational time. 7. Conclusion We presented spectral bandit setting inspired mostly by the applications in the recommender systems and targeted advertisement in social networks. In this setting, we are asked to repeatedly maximize an unknown graph function, assumed to be smooth on a given similarity graph. Traditional linear bandit algorithm can be applied but their regret scales with the ambient dimension D, either linearly or as a square root, which can be very large. Therefore, we introduced two algorithms, SPECTRALUCB and SPECTRALELIMINATOR, for which the regret only scales with effective dimension d which is typically much smaller than D for real-world graphs. We demonstrated that SPECTRALUCB delivers desired benefit for the graphs generated by Barab asi Albert, Erd os-R enyi, and regular lattice models; and for the movie rating data from the Movie Lens and Flixster social networks. In future, we plan to extend this work to a sparse setting when the smooth function is assumed to be a linear combination of only finite number of eigenvectors. 8. Acknowledgements We would like to thank Yiannis Koutis for his great help with the efficient computation of eigenvectors. We thank Andreas Krause for suggesting the lazy updates of UCBs. We would also like to thank Giovanni Zappella, Claudio Gentile, and especially Alessandro Lazaric for helpful discussions. The research presented in this paper was supported by French Ministry of Higher Education and Research, by European Community s Seventh Framework Programme (FP7/2007-2013) under grant agreement no270327 (project Comp LACS), and by Intel Corporation. Spectral Bandits Abbasi-Yadkori, Y, P al, D, and Szepesv ari, C. Improved Algorithms for Linear Stochastic Bandits. In Neural Information Processing Systems. 2011. Abernethy, J. D, Hazan, E, and Rakhlin, A. Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization. In Conference on Learning Theory, 2008. Alon, N, Cesa-Bianchi, N, Gentile, C, and Mansour, Y. From Bandits to Experts: A Tale of Domination and Independence. In NIPS, 2013. Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397 422, March 2002. ISSN 1532-4435. Auer, P and Ortner, R. UCB Revisited: Improved Regret Bounds for the Stochastic Multi-Armed Bandit Problem. Periodica Mathematica Hungarica, 2010. Azuma, K. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3):357 367, 1967. Barab asi, A.-L and Albert, R. Emergence of scaling in random networks. Science, 286:11, 1999. Belkin, M, Matveeva, I, and Niyogi, P. Regularization and Semi-Supervised Learning on Large Graphs. In Conference on Learning Theory, 2004. Belkin, M, Niyogi, P, and Sindhwani, V. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. Journal of Machine Learning Research, 7:2399 2434, 2006. Billsus, D, Pazzani, M. J, and Chen, J. A learning agent for wireless news access. In IUI, pp. 33 36, 2000. Bubeck, S, Munos, R, Stoltz, G, and Szepesvari, C. Xarmed bandits. Journal of Machine Learning Research, 12:1587 1627, 2011. Bubeck, S, Cesa-Bianchi, N, and Kakade, S. Towards minimax policies for online linear optimization with bandit feedback. In COLT, 2012. Caron, S, Kveton, B, Lelarge, M, and Bhagat, S. Leveraging Side Observations in Stochastic Bandits. In Uncertainty in Artificial Intelligence, pp. 142 151, 2012. Cesa-Bianchi, N, Gentile, C, and Zappella, G. A Gang of Bandits. In NIPS, 2013. Chau, D. H, Kittur, A, Hong, J. I, and Faloutsos, C. Apolo: making sense of large network data by combining rich user interaction and machine learning. In CHI, 2011. Chu, L, Li, L, Reyzin, L, and Schapire, R. Contextual Bandits with Linear Payoff Functions. In AISTATS, 2011. Dani, V, Hayes, T. P, and Kakade, S. M. Stochastic Linear Optimization under Bandit Feedback. In The 21st Annual Conference on Learning Theory, 2008. Desautels, T, Krause, A, and Burdick, J. Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization. In ICML, 2012. Graclus. Graclus, 2013. URL www.cs.utexas.edu/users/ dml/Software/graclus.html. Jamali, M and Ester, M. A matrix factorization technique with trust propagation for recommendation in social networks. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 2010. Jannach, D, Zanker, M, Felfernig, A, and Friedrich, G. Recommender Systems: An Introduction. Cambridge University Press, 2010. Keshavan, R, Oh, S, and Montanari, A. Matrix Completion from a Few Entries. In IEEE International Symposium on Information Theory, pp. 324 328, 2009. Kleinberg, R, Slivkins, A, and Upfal, E. Multi-armed bandit problems in metric spaces. In 40th ACM symposium on Theory Of Computing, 2008. Koutis, I, Miller, G. L, and Tolliver, D. Combinatorial preconditioners and multilevel solvers for problems in computer vision and image processing. Computer Vision and Image Understanding, 115:1638 1646, 2011. Lam, S and Herlocker, J. Movie Lens 1M Dataset. http://www.grouplens.org/node/12, 2012. Li, L, Chu, W, Langford, J, and Schapire, R. E. A Contextual-Bandit Approach to Personalized News Article Recommendation. WWW 10, 2010. Mc Pherson, M, Smith-Lovin, L, and Cook, J. Birds of a Feather: Homophily in Social Networks. Annual Review of Sociology, 27:415 444, 2001. Slivkins, A. Contextual Bandits with Similarity Information. Proceedings of the 24th annual Conference On Learning Theory, pp. 1 27, 2009. Srinivas, N, Krause, A, Kakade, S, and Seeger, M. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of International Conference on Machine Learning, 2010. Valko, M, Korda, N, Munos, R, Flaounas, I, and Cristianini, N. Finite-Time Analysis of Kernelised Contextual Bandits. In Uncertainty in Artificial Intelligence, 2013. Zhang, F. The Schur complement and its applications, volume 4. Springer, 2005. Zhu, X. Semi-Supervised Learning Literature Survey. Technical Report 1530, U. of Wisconsin-Madison, 2008.