# spectral_thompson_sampling__1d3b5cb4.pdf

Spectral Thompson Sampling

Tom aˇs Koc ak Seque L team INRIA Lille - Nord Europe France

Michal Valko Seque L team INRIA Lille - Nord Europe France

R emi Munos Seque L team INRIA Lille, France Microsoft Research NE, USA

Shipra Agrawal ML and Optimization Group Microsoft Research Bangalore, India

Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze Spectral TS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis showing that the regret of Spectral TS scales as d

T ln N with high probability, where T is the time horizon and N is the number of choices. Since a d

T ln N regret is comparable to the known results, Spectral TS offers a computationally more efﬁcient alternative. We also show that our algorithm is competitive on both synthetic and real-world data.

1 Introduction Thompson Sampling (Thompson 1933) is one of the oldest heuristics for sequential problems with limited feedback, also known as bandit problems. It solves the explorationexploitation dilemma by a simple and intuitive rule: when choosing the next action to play, choose it according to probability that it is the best one; that is the one that maximizes the expected payoff. Using this heuristic, it is straightforward to design many bandit algorithms, such as the Spectral TS algorithm presented in this paper. What is challenging though, is to provide the analysis and prove performance guarantees for TS algorithms. This may be the reason, why TS has not been in the center of interest of sequential machine learning, where mostly optimistic algorithms were studied (Auer, Cesa-Bianchi, and Fischer 2002; Auer 2002). Nevertheless, the past few years witnessed the rise of interest in TS due to its empirical performance, in particular in the computational advertising (Chapelle and Li

Copyright c 2014, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

2011), a major source of income for Internet companies. This motivated the researchers to explain the success of TS. A major breakthrough in this aspect was the work of Agrawal and Goyal (2012a), who provided the ﬁrst ﬁnitetime analysis of TS. It was shortly after followed by a reﬁned analysis (Kaufmann, Korda, and Munos 2012) showing the optimal performance of TS for Bernoulli distributions. Some of the asymptotic results for TS were proved by May et al. (2012). Agrawal and Goyal (2013a) then provided distribution-independent analysis of TS for the multiarm bandit. The most relevant results for our work are by Agrawal and Goyal (2013b), who bring a new martingale technique, enabling us to analyze cases where the payoffs of the actions are linear in some basis. Later, Korda, Kaufmann, and Munos (2013) extended the known optimal TS analysis to 1-dimensional exponential family distributions. Finally, in the Bayesian setting, Russo and Van Roy; Russo and Van Roy (2013; 2014) analyzed TS with respect to the Bayesian risk. In our prior work (Valko et al. 2014), we introduced a spectral bandit setting, relevant for content-based recommender systems (Pazzani and Billsus 2007), where the payoff function is expressed as a linear combination of a smooth basis. In such systems, we aim to recommend a content to a user, based on her personal preferences. Recommender systems take advantage of the content similarity in order to offer relevant suggestions. In other words, we assume that the user preferences are smooth on the similarity graph of content items. The results from manifold learning (Belkin, Niyogi, and Sindhwani 2006) show that the eigenvectors related to the smallest eigenvalues of the similarity graph Laplacian offer a useful smooth basis. Another example of leveraging useful structural property present in the real-world data is to take advantage of the hierarchy of the content features (Yue, Hong, and Guestrin 2012). Although Lin UCB (Li et al. 2010), GP-UCB (Srinivas et al. 2010), and Linear TS (Agrawal and Goyal 2013b) could be used for the spectral bandit setting, they would not scale well with the number of possible items to recommend. This is why we deﬁned (Valko et al. 2014) the effective dimension d, likely to be small in real-world graphs, and provided an al-

Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence

Linear Spectral

Optimistic Approach D2N per step update

T ln T Spectral UCB d

Thompson Sampling D2 + DN per step update

Linear TS D

T ln N Spectral TS d

Table 1: Linear vs. Spectral Bandits

gorithm based on the optimistic principle: Spectral UCB. In this paper, we focus on the TS alternative: Spectral Thompson Sampling (Table 1). The algorithm is easy to obtain, since there is no need to derive the upper conﬁdence bounds. Furthermore, one of the main beneﬁts of Spectral TS is its computational efﬁciency. The main contribution of this paper is the ﬁnite-time analysis of Spectral TS. We prove that the regret of Spectral TS scales as d

T ln N, which is comparable to the known results. Although the regret is

ln N away from the one of Spectral UCB, this factor is negligible for the relevant applications, e.g., movie recommendation. Interestingly, even for the linear case, there is no polynomial time algorithm for linear contextual bandits with better than D

T ln N regret (Agrawal and Goyal 2012b), where D is the dimension of the context vector. Optimistic approach (UCB) for linear contextual bandits is not polynomially implementable, where the numbers of choices are at least exponential in D (e.g., when set of arms is all the vectors in a polytope) and the approximation given by (Dani, Hayes, and Kakade 2008) achieves only D3/2

T regret. Similarly for the spectral bandit case, Spectral TS offers a computationally attractive alternative to Spectral UCB (Table 1, left). Instead of computing the upper conﬁdence bounds for all the arms in each step, we only need to sample from our current belief of the true model and perform the maximization given this belief. We support this claim with an empirical evaluation.

In this section, we formally deﬁne the spectral bandit setting. The most important notation is summarized in Table 2. Let G be the given graph with the set V of N nodes. Let W be the N N matrix of edge weights and D is a N N diagonal matrix with the entries dii = P j wij. The graph Laplacian of G is a N N matrix deﬁned as L = D W. Let L = QΛLQT be the eigendecomposition of L, where Q is N N orthogonal matrix with eigenvectors of L in columns. Let {bi}1 i N be the N rows of Q. This way, each bi corresponds to the features of action i (commonly referred to as the arm i) in the spectral basis. The reason for spectral basis comes from manifold learning (Belkin, Niyogi, and Sindhwani 2006). In our setting, we assume that the neighboring content has similar payoffs, which means that the payoff function is smooth on G. Belkin, Niyogi, and Sindhwani (2006) showed that that smooth functions can be represented as a linear combination of eigenvectors with small eigenvalues. This explains the choice of regularizer in Section 3. We now describe the learning setting. Each time t, the

N number of arms bi feature vector of arm i d effective dimension a(t) arm played at time t a optimal arm λ regularization parameter C upperbound on µ Λ µ true (unknown) vector of weights v = R p

6d ln((λ + T)/(δλ) + C p = 1/(4e π) l = R p

2d ln((λ + T)T 2/(δλ)) + C

4 ln TN + l i = b T a µ b T iµ

Table 2: Overview of the notation.

recommender chooses an action a(t) and receives a payoff that is in expectation linear in the associated features ba(t),

r(t) = b T a(t)µ + εt,

where µ encodes the (unknown) vector of user preferences and εt is R-sub-Gaussian noise, i.e.,

ξ R, E[eξεt | {bi}N i=1, Ht 1] exp ξ2R2

The history Ht 1 is deﬁned as:

Ht 1 = {a(τ), r(τ), τ = 1, . . . , t 1}.

We now deﬁne the performance metric of any algorithm for the setting above. The instantaneous (pseudo)-regret in time t is deﬁned as the difference between the mean payoff (reward) of the optimal arm a and the arm a(t) chosen by the recommender,

regret(t) = a(t) = b T a µ b T a(t)µ.

The performance of any algorithm is measured in terms of cumulative regret, which is the sum of regrets over time,

t=1 regret(t).

3 Algorithm In this paper, we use TS to decide which arm to play. Specifically, we represent our current knowledge about µ as the normal distribution N(ˆµ(t), v2B 1 t ), where ˆµ(t) is our actual approximation of the unknown parameter µ and v2B 1 t reﬂects our uncertainty about it. As mentioned, we assume that the reward function is a linear combination of eigenvectors of L with large coefﬁcients corresponding to the eigenvectors with small eigenvalues. We encode this assumption into our initial conﬁdence ellipsoid by setting B1 = Λ = ΛL + λIN, where λ is a regularization parameter. After that, every time step t we generate a sample µ(t) from the distribution N(ˆµ(t), v2B 1 t ) and chose an arm a(t) that maximizes b T i µ(t). After receiving a reward, we

update our estimate of µ and the conﬁdence of it, i.e., we compute ˆµ(t + 1) and B(t + 1),

Bt+1 = Bt + ba(t)b T a(t)

ˆµ(t + 1) = B 1 t+1

i=1 ba(i)r(i)

Remark 1. Since TS is a Bayesian approach, it requires a prior to run and we choose it here to be a Gaussian. However, this does not pose any assumption whatsoever about the actual data both for the algorithm and the analysis. The only assumptions we make about the data are: (a) that the mean payoff is linear in the features, (b) that the noise is sub Gaussian, and (c) that we know a bound on the Laplacian norm of the mean reward function. We provide a frequentist bound on the regret (and not an average over the prior) which is a much stronger worst case result.

The computational advantage of Spectral TS in Algorithm 1, compared to Spectral UCB, is that we do not need to compute the conﬁdence bound for each arm. Indeed, in Spectral TS we need to sample µ which can be done in N 2 time (note that Bt is only changing by a rank one update) and a maximum of b T i µ which can be also done in N 2 time. On the other hand, in Spectral UCB, we need to compute a B 1 t norm for each of N feature vectors which amounts to a ND2 time. Table 1 (left) summarizes the computational complexity of the two approaches. Notice that in our setting D = N, which comes to a N 2 vs. N 3 time per step. We support this argument in Section 5. Finally note, that the eigendecomposition needs to be done only once in the beginning and since L is diagonally dominant, this can be done for N in millions (Koutis, Miller, and Peng 2010).

Algorithm 1 Spectral Thompson Sampling

Input: N: number of arms, T: number of pulls {ΛL, Q}: spectral basis of graph Laplacian L λ, δ: regularization and conﬁdence parameters R, C: upper bounds on noise and µ Λ Initialization: v = R p

6d ln((λ + T)/δλ) + C ˆµ = 0N, f = 0N, B = ΛL + λIN Run: for t = 1 to T do Sample µ N(ˆµ, v2B 1) a(t) arg maxa b T a µ Observe a noisy reward r(t) = b T a(t)µ + εt f f + ba(t)r(t) Update B B + ba(t)b T a(t) Update ˆµ B 1f end for

In order to state our result, we ﬁrst deﬁne the effective dimension, a quantity shown to be small in the real-world graphs (Valko et al. 2014).

Deﬁnition 1. Let the effective dimension d be the largest d

such that (d 1)λd T ln(1 + T/λ).

We would like to stress that we consider the regime when T < N, because we aim for applications with a large set of arms and we are interested in a satisfactory performance after just a few iterations. For instance, when we aim to recommend N movies, we would like to have useful recommendations in the time T < N, i.e., before the user saw all of them. In the typical T > N setting, d can be of the order of N and our approach does not bring an improvement over linear bandit algorithms. The following theorem upperbounds the cumulative regret of Spectral TS in terms of d. Theorem 1. Let d be the effective dimension and λ be the minimum eigenvalue of Λ. If µ Λ C and for all bi, |b T iµ| 1, then the cumulative regret of Spectral Thompson Sampling is with probability at least 1 δ bounded as

λ d T ln λ + T

where p = 1/(4e π) and

6d ln λ + T

2d ln (λ + T)T 2

Remark 2. Substituting g and p we see that regret bound scales as d

T ln N. Note that N = D could be exponential in d and we need to consider factor

ln N in our bound. On the other hand, if N is indeed exponential in d, then our algorithm scales with ln D

T ln D = ln(D)3/2

T which is even better.

4 Analysis Preliminaries In the ﬁrst ﬁve lemmas we state the known results on which we build in our analysis. Lemma 1. For a Gaussian distributed random variable Z with mean m and variance σ2, for any z 1, 1 2 πz e z2/2 Pr(|Z m| > σz) 1 πz e z2/2.

Multiple use of Sylvester s determinant theorem gives:

Lemma 2. Let Bt = Λ + Pt 1 τ=1 bτb T τ, then

τ=1 ln(1 + bτ B 1 τ )

Lemma 3. (Abbasi-Yadkori, P al, and Szepesv ari 2011). Let Bt = Λ + Pt 1 τ=1 bτb T τ and deﬁne ξt = Pt τ=1 ετbτ. With probability at least 1 δ, t 1 :

ξt 2 B 1 t 2R2 ln |Bt|1/2

The next lemma is a generalization of Theorem 2 in (Abbasi Yadkori, P al, and Szepesv ari 2011) for any Λ. Lemma 4. (Lemma 3 by Valko et al. (2014)). Let µ Λ C and Bt is as above. Then for any x with probability at least 1 δ, t 1 :

|x T(ˆµ(t) µ)| x B 1 t

2 ln |Bt|1/2

Lemma 5. (Lemma 7 by Valko et al. (2014)). Let d be the effective dimension. Then:

|Λ| 2d ln 1 + T

Cumulative Regret analysis Our analysis is based on the proof technique of Agrawal and Goyal (2013b). The summary of the technique follows. Each time an arm is played, our algorithm improves the conﬁdence about our actual estimate of µ via update of Bt and thus the update of conﬁdence ellipsoid. However, when we play a suboptimal arm, the regret we obtain can be much higher than the improvement of our knowledge. To overcome this difﬁculty, the arms are divided into two groups of saturated and unsaturated arms, based on whether the standard deviation for an arm is smaller than the standard deviation of the optimal arm (Definition 3) or not. Consequently, the optimal arm is in group of unsaturated arms. The idea is to bound the regret of playing an unsaturated arm in terms of standard deviation and to show that the probability that the saturated arm is played is small enough. This way we overcome the difﬁculty of high regret and small knowledge obtained by playing an arm. In the following we use the notation from Table 2. Deﬁnition 2. We deﬁne E ˆµ(t) as the event that for all i, |b T i ˆµ(t) b T iµ| l bi B 1 t and E µ(t) as the event that for all i,

|b T i µ(t) b T i ˆµ(t)| v bi B 1 t p

Deﬁnition 3. We say that an arm i is saturated at time t if i > g bi B 1 t , and unsaturated otherwise. Let C(t) denote the set of saturated arms at time t (this includes a ). Deﬁnition 4. We deﬁne ﬁltration Ft 1 as the union of the history until time t 1 and features, i.e., Ft 1 = {Ht 1} {bi, i = 1, . . . , N} By deﬁnition, F1 F2 FT 1. Lemma 6. For all t, 0 < δ < 1, Pr(E ˆµ(t)) 1 δ/T 2 and for all possible ﬁltrations Ft 1, Pr(E µ(t) | Ft 1) 1 1/T 2. Proof. Bounding the probability of event E ˆµ(t): Using Lemma 4, where C is such that µ Λ C, for all i with probability at least 1 δ we have

|b T i(ˆµ(t) µ)| bi B 1 t

2 ln |Bt|1/2

|Λ| + 2 ln 1

Therefore, using Lemma 5 and substituting δ = δ/T 2, we get that with probability at least 1 δ/T 2, for all i,

|b T i(ˆµ(t) µ)|

2d ln λ + T

λ + 2d ln T 2

2d ln (λ + T)T 2

= l bi B 1 t .

Bounding the probability of event E µ(t): The probability of each individual term |b T i( µ(t) ˆµ(t))| < p

4 ln(TN) can be bounded using Lemma 1 to get

Pr |b T i( µ(t) ˆµ(t))| v bi B 1 t p

e 2 ln T N p

π4 ln(TN) 1 T 2N .

We complete the proof by taking a union bound over all N vectors bi. Notice that we took a different approach than (Agrawal and Goyal 2013b) to avoid the dependence on the ambient dimension D.

Lemma 7. For any ﬁltration Ft 1 such that E ˆµ(t) is true,

Pr (b T a µ(t) > b T a µ | Ft 1) 1 4e π .

Proof. Since b T a µ(t) is a Gaussian random variable with the mean b T a ˆµ(t) and the standard deviation v ba B 1 t , we can use the anti-concentration inequality in Lemma 1,

Pr (b T a µ(t) b T a µ | Ft 1)

b T a µ(t) b T a ˆµ(t) v ba B 1 t b T a µ b T a ˆµ(t) v ba B 1 t | Ft 1

1 4 πZt e Z2 t ,

where |Zt| =

b T a µ b T a ˆµ(t) v ba B 1 t

Since we consider a ﬁltration Ft 1 such that E ˆµ(t) is true, we can upperbound the numerator to get

|Zt| l ba B 1 t v ba B 1 t = l

Finally, Pr (b T a µ(t) > b T a µ | Ft 1) 1 4e π .

Lemma 8. For any ﬁltration Ft 1 such that E ˆµ(t) is true,

Pr(a(t) C(t) | Ft 1) 1 4e π 1

Proof. The algorithm chooses the arm with the highest value of b T i µ(t) to be played at time t. Therefore if b T a µ(t) is greater than b T j µ(t) for all saturated arms, i.e., b T a µ(t) > b T j µ(t), j C(t), then one of the unsaturated arms (which include the optimal arm and other suboptimal unsaturated arms) must be played. Therefore,

Pr(a(t) C(t) | Ft 1) Pr(b T a µ(t) > b T j µ(t), j C(t) | Ft 1).

By deﬁnition, for all saturated arms, i.e., for all j C(t), j > g bj B 1 t . Now if both of the events E ˆµ(t) and E µ(t)

are true then, by deﬁnition of these events, for all j C(t), b T j µ(t) b T jµ(t) + g bj B 1 t . Therefore, given the ﬁltration Ft 1, such that E ˆµ(t) is true, either E µ(t) is false, or else for all j C(t),

b T j µ(t) b T jµ + g bj B 1 t b T a µ.

Hence, for any Ft 1 such that E ˆµ(t) is true,

Pr(ba T µ(t) > b T j µ(t), j C(t) | Ft 1)

Pr(b T a µ(t) > b T a µ | Ft 1) Pr

E ˆµ(t) | Ft 1

In the last inequality we used Lemma 6 and Lemma 7.

Lemma 9. For any ﬁltration Ft 1 such that E ˆµ(t) is true,

E[ a(t) | Ft 1] 11g

p E[ ba(t) B 1 t | Ft 1] + 1

Proof. Let a(t) denote the unsaturated arm with the smallest norm bi B 1 t , i.e.,

a(t) = arg min i C(t) bi B 1 t .

Notice that since C(t) and bi B 1 t for all i, are ﬁxed on ﬁxing Ft 1, so is a(t). Now, using Lemma 8, for any Ft 1 such that E ˆµ(t) is true,

E[ ba(t) B 1 t | Ft 1]

E[ ba(t) B 1 t | Ft 1, a(t) C(t)]

Pr(a(t) C(t) | Ft 1)

ba(t) B 1 t

Now, if the events E ˆµ(t) and E µ(t) are true, then for all i, by deﬁnition, b T i µ(t) b T iµ+g bi B 1 t . Using this observation along with b T a(t) µ(t) b T i µ(t) for all i,

a(t) = a(t) + (b T a(t)µ b T a(t)µ)

a(t) + (b T a(t) µ(t) b T a(t) µ(t))

+ g ba(t) B 1 t + g ba(t) B 1 t a(t) + g ba(t) B 1 t + g ba(t) B 1 t g ba(t) B 1 t + g ba(t) B 1 t + g ba(t) B 1 t .

Therefore, for any Ft 1 such that E ˆµ(t) is true, either a(t) 2g ba(t) B 1 t + g ba(t) B 1 t , or E µ(t) is false. We can deduce that

E[ a(t) | Ft 1]

E h 2g ba(t) B 1 t + g ba(t) B 1 t | Ft 1 i

2g p 1 T 2 E h ba(t) B 1 t | Ft 1 i

+ g E h ba(t) B 1 t | Ft 1 i + 1

p E[ ba(t) B 1 t | Ft 1] + 1

In the last inequality we used that 1/(p 1/T 2) 5/p, which holds trivially for T 4. For T 5, we get that T 2 5e π, which holds for T 5.

Deﬁnition 5. We deﬁne regret (t) = regret(t) I(E ˆµ(t)). Deﬁnition 6. A sequence of random variables (Yt; t 0) is called a super-martingale corresponding to a ﬁltration Ft, if for all t, Yt is Ft-measurable, and for t 1,

E[Yt Yt 1 | Ft 1] 0.

Next, following Agrawal and Goyal (2013b), we establish a super-martingale process that will form the basis of our proof of the high-probability regret bound. Deﬁnition 7. Let

Xt = regret (t) 11g

p ba(t) B 1 t 1

Lemma 10. (Yt; t = 0, . . . , T) is a super-martingale process with respect to ﬁltration Ft.

Proof. We need to prove that for all t {1, . . . , T}, and any possible Ft 1, E[Yt Yt 1 | Ft 1] 0, i.e.

E[regret (t) | Ft 1] 11g

p ba(t) B 1 t + 1

Note that whether E ˆµ(t) is true or not, is completely determined by Ft 1. If Ft 1 is such that E ˆµ(t) is not true, then regret (t) = regret(t) I(E ˆµ(t)) = 0, and the above inequality holds trivially. Moreover, for Ft 1 such that E ˆµ(t) holds, the inequality follows from Lemma 9.

Unlike (Agrawal and Goyal 2013b; Abbasi-Yadkori, P al, and Szepesv ari 2011), we do not want to require λ 1. Therefore, we provide the following lemma that shows the dependence of ba(t) 2 B 1 t on λ.

Lemma 11. For all t,

ba(t) 2 B 1 t 2 + 2

ln 1 + ba(t) 2 B 1 t

Proof. Note, that ba(t) B 1 t (1/

λ) ba(t) (1/

λ) and for all 0 x 1 we have

x 2 ln(1 + x). (1)

Now we consider two cases depending on λ. If λ 1, we know that 0 ba(t) B 1 t 1 and therefore by (1),

ba(t) 2 B 1 t 2 ln 1 + ba(t) 2 B 1 t

Similarly, if λ < 1, then 0 λ ba(t) B 1 t 1 and we get

ba(t) 2 B 1 t 2

λ ln 1 + λ ba(t) 2 B 1 t

λ ln 1 + ba(t) 2 B 1 t

Combining the two, we get that for all λ 0,

ba(t) 2 B 1 t max 2, 2

ln 1 + ba(t) 2 B 1 t

ln 1 + ba(t) 2 B 1 t

Proof of Theorem 1. First, notice that Xt is bounded as |Xt| 1+11g/(p

λ)+1/T 2 (11/

λ+2)g/p. Thus, we can apply Azuma-Hoeffding inequality to obtain that with probability at least 1 δ/2,

t=1 regret (t)

p ba(t) B 1 t +

Since p and g are constants, then with probability 1 δ/2,

t=1 regret (t) 11g

t=1 ba(t) B 1 t + 1

The last step is to upperbound PT t=1 ba(t) B 1 t . For this purpose, Agrawal and Goyal (2013b) rely on the analysis of Auer (2002) and the assumption that λ 1. We provide an alternative approach using Cauchy-Schwartz inequality, Lemma 2, and Lemma 11 to get

t=1 ba(t) B 1 t

t=1 ba(t) 2 B 1 t

λ d T ln λ + T

Finally, we know that E ˆµ(t) holds for all t with probability at least 1 δ

2 and regret (t) = regret(t) for all t with

probability at least 1 δ

2. Hence, with probability 1 δ,

λ d T ln λ + T

5 Experiments The aim of this section is to give empirical evidence that Spectral TS a faster and simpler algorithm than Spectral UCB also delivers comparable or better empirical performance. For the synthetic experiment, we considered a Barab asi-Albert (BA) model (1999), known for its preferential attachment property, common in real-world graphs. We generated a random graph using BA model with N = 250 nodes and the degree parameter 3. For each run, we generated the weights of the edges uniformly at random. Then we generated µ, a random vector of weights (unknown to the algorithms) in order to compute the payoffs and evaluated the cumulative regret. The µ in each simulation was a random linear combination of the ﬁrst 3 smoothest eigenvectors of the graph Laplacian. In all experiments, we had δ = 0.001, λ = 1, and R = 0.01. We evaluated the algorithms in T < N regime, where the linear bandit algorithms are not expected to perform well. Figure 1 shows the results averaged over 10 simulations. Notice that while the result of Spectral TS are comparable to Spectral UCB, its computational time is much faster due the reasons discussed in Section 3. Recall that while both algorithms compute the leastsquare problem of the same size, Spectral UCB has then to compute the conﬁdence interval for each arm.

0 20 40 60 80 100 120 140 160 180 200 0

cumulative regret

Barabasi Albert N=250, basis size=3, effective d=1

Spectral TS Linear TS Spectral UCB Lin UCB

Spectral TS Linear TS Spectral UCB Lin UCB 0

computational time in seconds

Figure 1: Barab asi-Albert random graph results

Furthermore, we performed the comparison of the algorithms on the Movie Lens dataset (Lam and Herlocker 2012) of the movie ratings. The graph in this dataset is the graph of 2019 movies with edges corresponding to the movie similarities. For each user we have a graph function, unknown to the algorithms, that assigns to each node, the rating of the particular user. A detailed description on the preprocessing is deferred to (Valko et al. 2014). Our goal is then to recommend the most highly rated content. Figure 2 shows the Movie Lens data results averaged over 10 randomly sampled users. Notice that the results follow the same trends as for the synthetic data. Our results show that the empirical performance of the computationally more efﬁcient Spectral TS is similar or slightly better than the one of Spectral UCB. The main contribution of this paper is the analysis that backs up this evidence with a theoretical justiﬁcation.

0 20 40 60 80 100 120 140 160 180 200 0

cumulative regret

Movielens data N=2019, average of 10 users, T=200, d = 5

Spectral UCB

Spectral TS

Spectral TS Linear TS Spectral UCB Lin UCB 0

computational time in seconds

Figure 2: Movie Lens data results

6 Conclusion

We considered the spectral bandit setting with a reward function assumed to be smooth on a given graph and proposed Spectral Thompson Sampling (TS) for it. Our main contribution is stated in Theorem 1 where we prove that the regret bound scales with effective dimension d, typically much smaller than the ambient dimension D, which is the case of linear bandit algorithms. In our experiments, we showed that Spectral TS outperforms Linear UCB and Linear TS, and provides similar results to Spectral UCB in the regime when T < N. Moreover, we showed that Spectral TS is computationally less expensive than Spectral UCB.

7 Acknowledgements

The research presented in this paper was supported by French Ministry of Higher Education and Research, by European Community s Seventh Framework Programme (FP7/2007-2013) under grant agreement no270327 (project Comp LACS), and by Intel Corporation.

Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved Algorithms for Linear Stochastic Bandits. In Neural Information Processing Systems. Agrawal, S., and Goyal, N. 2012a. Analysis of Thompson Sampling for the multi-armed bandit problem. Proceedings of the 25th Annual Conference on Learning Theory (COLT). Agrawal, S., and Goyal, N. 2012b. Thompson Sampling for Contextual Bandits with Linear Payoffs. Co RR, abs/1209.3352, http://arxiv.org/abs/1209.3352. Agrawal, S., and Goyal, N. 2013a. Further Optimal Regret Bounds for Thompson Sampling. In Proceedings of the 16th International Conference on Articial Intelligence and Statistics (AISTATS), volume 31, 99 107. Agrawal, S., and Goyal, N. 2013b. Thompson Sampling for Contextual Bandits with Linear Payoffs. In International Conference on Machine Learning. Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finitetime Analysis of the Multiarmed Bandit Problem. Machine Learning 47(2-3):235 256. Auer, P. 2002. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research 3:397 422. Barab asi, A.-L., and Albert, R. 1999. Emergence of scaling in random networks. Science 286:11.

Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. Journal of Machine Learning Research 7:2399 2434. Chapelle, O., and Li, L. 2011. An Empirical Evaluation of Thompson Sampling. In Neural Information Processing Systems. Dani, V.; Hayes, T. P.; and Kakade, S. M. 2008. The Price of Bandit Information for Online Optimization. In Neural Information Processing Systems. MIT Press. Kaufmann, E.; Korda, N.; and Munos, R. 2012. Thompson Sampling: An Asymptotically Optimal Finite Time Analysis. Algorithmic Learning Theory. Korda, N.; Kaufmann, E.; and Munos, R. 2013. Thompson Sampling for 1-Dimensional Exponential Family Bandits. In Neural Information Processing Systems. Koutis, I.; Miller, G. L.; and Peng, R. 2010. Approaching Optimality for Solving SDD Linear Systems. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, 235 244. IEEE. Lam, S., and Herlocker, J. 2012. Movie Lens 1M Dataset. http://www.grouplens.org/node/12. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A Contextual-Bandit Approach to Personalized News Article Recommendation. WWW 10. May, B. C.; Korda, N.; Lee, A.; and Leslie, D. S. 2012. Optimistic Bayesian sampling in contextual-bandit problems. The Journal of Machine Learning Research 13(1):2069 2069 2106 2106. Pazzani, M. J., and Billsus, D. 2007. Content-Based Recommendation Systems. The adaptive web. Russo, D., and Van Roy, B. 2013. Eluder Dimension and the Sample Complexity of Optimistic Exploration. In Neural Information Processing Systems. Russo, D., and Van Roy, B. 2014. Learning to Optimize Via Posterior Sampling. Mathematics of Operations Research. Srinivas, N.; Krause, A.; Kakade, S.; and Seeger, M. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of International Conference on Machine Learning 1015 1022. Thompson, W. R. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25:285 294. Valko, M.; Munos, R.; Kveton, B.; and Koc ak, T. 2014. Spectral Bandits for Smooth Graph Functions. In 31th International Conference on Machine Learning. Yue, Y.; Hong, S. A.; and Guestrin, C. 2012. Hierarchical Exploration for Accelerating Contextual Bandits. In Langford, J., and Pineau, J., eds., Proceedings of the 29th International Conference on Machine Learning (ICML-12), 1895 1902. New York, NY, USA: ACM.