# efficient_learning_in_largescale_combinatorial_semibandits__58c0aac8.pdf

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

Zheng Wen ZHENGWEN@YAHOO-INC.COM Yahoo Labs, Sunnyvale, CA

Branislav Kveton KVETON@ADOBE.COM Adobe Research, San Jose, CA

Azin Ashkan AZIN.ASHKAN@TECHNICOLOR.COM Technicolor Research, Los Altos, CA

A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to combinatorial constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we consider efﬁcient learning in large-scale combinatorial semi-bandits with linear generalization, and as a solution, propose two learning algorithms called Combinatorial Linear Thompson Sampling (Comb Lin TS) and Combinatorial Linear UCB (Comb Lin UCB). Both algorithms are computationally efﬁcient as long as the ofﬂine version of the combinatorial problem can be solved efﬁciently. We establish that Comb Lin TS and Comb Lin UCB are also provably statistically efﬁcient under reasonable assumptions, by developing regret bounds that are independent of the problem scale (number of items) and sublinear in time. We also evaluate Comb Lin TS on a variety of problems with thousands of items. Our experiment results demonstrate that Comb Lin TS is scalable, robust to the choice of algorithm parameters, and signiﬁcantly outperforms the best of our baselines.

1. Introduction

Combinatorial optimization is a mature ﬁeld (Papadimitriou & Steiglitz, 1998), which has countless practical applications. One of the most studied problems in combinatorial optimization is maximization of a modular function subject to combinatorial constraints. Many important problems, such as minimum spanning tree (MST), shortest path,

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

and maximum-weight bipartite matching, can be viewed as instances of this problem.

In practice, the optimized modular function is often unknown and needs to be learned while repeatedly solving the problem. This class of learning problems was recently formulated as a combinatorial bandit/semi-bandit, depending on the feedback model (Audibert et al., 2014). Since then, many combinatorial bandit/semi-bandit algorithms have been proposed: for the stochastic setting (Gai et al., 2012; Chen et al., 2013; Russo & Van Roy, 2014; Kveton et al., 2015b); for the adversarial setting (Cesa-Bianchi & Lugosi, 2012; Audibert et al., 2014; Neu & Bart ok, 2013); and for subclasses of combinatorial problems, matroid and polymatroid bandits (Kveton et al., 2014a;b;c), submodular maximization (Wen et al., 2013; Gabillon et al., 2013), and cascading bandits (Kveton et al., 2015a). Many regret bounds have been established for the combinatorial semibandit algorithms. To achieve an O( n) dependence on time n, all of the regret bounds are Ω(

L), where L is the number of items. The dependence on L is intrinsic because the algorithms estimate the weight of each item separately, and matching lower bounds have been established (Section 3.2).

However, in many real-world problems, the number of items L is intractably large. For instance, online advertising in a mainstream commercial website can be viewed as a bipartite matching problem with millions of users and products; routing in the Internet can be formulated as a shortest path problem with billions of edges. Thus, learning algorithms with Ω(

L) regret are impractical in such problems. On the other hand, in many problems, items have features and their weights are similar when the features are similar. In movie recommendation, for instance, the expected ratings of movies that are close in the latent space are also similar. In this work, we show how to leverage this structure to learn to make good decisions more efﬁciently. More speciﬁcally, we assume a linear generalization across the items: conditioned on the features of an item, the expected

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

weight of that item can be estimated using a linear model. Our goal is to develop more efﬁcient learning algorithms for combinatorial semi-bandits with linear generalization.

It is relatively easy to extend many linear bandit algorithms, such as Thompson sampling (Thompson, 1933; Agrawal & Goyal, 2012; Russo & Van Roy, 2013) and Linear UCB (Lin UCB, see Auer (2002); Dani et al. (2008); Abbasi Yadkori et al. (2011)) , to combinatorial semi-bandits with linear generalization. In this paper, we propose two learning algorithms, Combinatorial Linear Thompson Sampling (Comb Lin TS) and Combinatorial Linear UCB (Comb Lin UCB), based on Thompson sampling and Lin UCB. Both Comb Lin TS and Comb Lin UCB are computationally efﬁcient, as long as the ofﬂine version of the combinatorial problem can be solved efﬁciently. The ﬁrst major contribution of the paper is that we establish a Bayes regret bound on Comb Lin TS and a regret bound on Comb Lin UCB, under reasonable assumptions. Both bounds are L-independent, and sublinear in time. The second major contribution of the paper is that we evaluate Comb Lin TS on a variety of problems with thousands of items, and two of these problems are based on real-world datasets. We only evaluate Comb Lin TS since recent literature (Chapelle & Li, 2011) suggests that Thompson sampling algorithms usually outperform UCB-like algorithms in practice. Our experimental results demonstrate that Comb Lin TS is scalable, robust to the choice of algorithm parameters, and signiﬁcantly outperforms the best of our baselines. It is worth mentioning that our derived L-independent regret bounds also hold in cases with L = . Moreover, as we will discuss in Section 7, our proposed algorithms and their analyses can be easily extended to the contextual combinatorial semibandits.

Finally, we brieﬂy review some relevant papers. Gabillon et al. (2014) and Yue & Guestrin (2011) focus on submodular maximization with linear generalization. Our paper differs from these two papers in the following two aspects: (1) our paper allows general combinatorial constraints while they do not; (2) our paper focuses on maximization of modular functions while they focus on submodular maximization.

2. Combinatorial Optimization

We focus on a class of combinatorial optimization problems that aim to ﬁnd a maximum-weight set from a given family of sets. Speciﬁcally, one such combinatorial optimization problem can be represented as a triple (E, A, w), where (1) E = {1, . . . , L} is a set of L items, called the ground set, (2) A {A E : |A| K} is a family of subsets of E with up to K items, where K L, and (3) w : E R is a weight function that assigns each item e in the ground set E a real number. The total weight of all

items in a set A E is deﬁned as:

f(A, w) = P

e A w(e), (1)

which is a linear functional of w and a modular function in A. A set Aopt is a maximum-weight set in A if:

Aopt arg max A A f(A, w) = arg max A A

e A w(e). (2)

Many classical combinatorial optimization problems, such as ﬁnding an MST, bipartite matching, the shortest path problem and the traveling salesman problem (TSP), have form (2). Though some of these problems can be solved efﬁciently (e.g. bipartite matching), others (e.g. TSP) are known to be NP-hard. However, for many such NPhard problems, there exist computationally efﬁcient approximation algorithms and/or randomized algorithms that achieve near-optimal solutions with high probability. Similarly to Chen et al. (2013), in this paper, we allow the agent to use any approximation / randomized algorithm ORACLE to solve (2), and denote its solution as A = ORACLE(E, A, w). To distinguish from a learning algorithm, we refer to a combinatorial optimization algorithm as an oracle in this paper.

3. Combinatorial Semi-Bandits with Linear Generalization

Many real-world problems are combinatorial in nature. In recommender systems, for instance, the user is typically recommended K items out of L. The value of an item, such as the expected rating of a movie, is never known perfectly and has to be reﬁned while repeatedly recommending to the pool of the users. Recommender problems are known to be highly structured. In particular, it is well known that the user-item matrix is typically low-rank (Koren et al., 2009) and that the value of an item can be written as a linear combination of its position in the latent space. In this work, we propose a learning algorithm for combinatorial optimization that leverages this structure. In particular, we assume that the weight of each item is a linear function of its features and then we learn the parameters of this model, jointly for all items.

3.1. Combinatorial Semi-Bandits

We formalize our learning problem as a combinatorial semi-bandit. A combinatorial semi-bandit is a triple (E, A, P), where E and A are deﬁned in Section 2 and P is a probability distribution over the weights w RL of the items in the ground set E. We assume that the weights w are drawn i.i.d. from P. The mean weight is denoted by w = E[w]. Each item e is associated with an arm and we assume that multiple arms can be pulled. A subset of arms A E can be pulled if and only if A A. The return of pulling arms A is f(A, w) (Equation (1)), the sum of the

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

weights of all items in A. After the arms A are pulled, we observe the individual return of each arm, {w(e) : e A}. This feedback model is known as semi-bandit (Audibert et al., 2014).

We assume that the combinatorial structure (E, A) is known and the distribution P is unknown. We would like to stress that we do not make any structural assumptions on P. The optimal solution to our problem is a maximumweight set in expectation:

Aopt arg max A A Ew[f(A, w)] = arg max A A

e A w(e). (3)

This objective is equivalent to the one in Equation (2).

Our learning problem is episodic. In each episode t, the learning agent adaptively chooses At A based on its observations of the weights up to episode t, gains f(At, wt), and observes the weights of all chosen items in episode t, {(e, wt(e)) : e At}. The learning agent interacts with the combinatorial semi-bandit for n times and its goal is to maximize the expected cumulative return in n-episodes E[Pn t=1 f(At, wt)], where the expectation is over (1) the random weights wt s, (2) possible randomization in the learning algorithm, and (3) w if it is randomly generated. Notice that the choice of At impacts both the return and observations in episode t. So we need to trade off exploration and exploitation, similarly to other bandit problems.

3.2. Linear Generalization

As we have discussed in Section 1, many provably efﬁcient algorithms have been developed for various combinatorial semi-bandits of form (3) (Chen et al., 2013; Gai et al., 2012; Russo & Van Roy, 2014; Kveton et al., 2014b; 2015b). However, since there are L parameters to learn and these algorithms do not consider generalization across items, the derived upper bounds on the expected cumulative regret and/or the Bayes cumulative regret of these algorithms are at least O(

L). Furthermore, Audibert et al. (2014) has derived an Ω(

LKn) lower bound on adversarial combinatorial semi-bandits, while Kveton et al. (2014b; 2015b) have derived asymptotic Ω(L log(n)/ ) gap-dependent lower bounds on stochastic combinatorial semi-bandits, where is an appropriate gap .

However, in many modern combinatorial semi-bandit problems, L tends to be enormous. Thus, an O(

L) regret is unacceptably large in these problems. On the other hand, in many practical problems, there exists a generalization model based on which the weight of one item can be (approximately) inferred based on the weights of other items. By exploiting such generalization models, an o(

L) or even an L-independent cumulative regret might be achieved.

In this paper, we assume that there is a (possibly imperfect) linear generalization model across the items. Speciﬁcally,

we assume that the agent knows a generalization matrix Φ RL d s.t. w either lies in or is close to the subspace span [Φ]. We use φe to denote the transpose of the e-th row of Φ, and refer to it as the feature vector of item e. Without loss of generality, we assume that rank [Φ] = d.

Similar to some existing literature (Wen & Van Roy, 2013; Van Roy & Wen, 2014), we distinguish between the coherent learning cases, in which w span [Φ], and the agnostic learning cases, in which w / span [Φ]. Like existing literature on linear bandits (Dani et al., 2008; Abbasi Yadkori et al., 2011), the analysis in this paper focuses on coherent learning cases. However, we would like to emphasize that both of our proposed algorithms, Comb Lin TS and Comb Lin UCB, are also applicable to the agnostic learning cases. As is demonstrated in Section 6, Comb Lin TS performs well in the agnostic learning cases.

Finally, we deﬁne θ = arg min θ w Φθ 2. Since rank [Φ] = d, θ is uniquely deﬁned. Moreover, in coherent learning cases, we have w = Φθ .

3.3. Performance Metrics

Let A = ORACLE(E, A, w). In this paper, we measure the performance loss of a learning algorithm with respect to A . Recall that the learning algorithm chooses At in episode t, we deﬁne Rt = f(A , wt) f(At, wt) as the realized regret in episode t. If the expected weight w is ﬁxed but unknown, we deﬁne the expected cumulative regret of the learning algorithm in n episodes as

R(n) = Pn t=1 E [Rt| w] , (4)

where the expectation is over random weights and possible randomization in the learning algorithm. If necessary, we denote R(n) as R(n; w) to emphasize the dependence on w. On the other hand, if w is randomly generated or the agent has a prior belief in w, then from Russo & Van Roy (2013), the Bayes cumulative regret of the learning algorithm in n episodes is deﬁned as

RBayes(n) = E w[R(n; w)] = Pn t=1 E [Rt] , (5)

where the expectation is also over w. That is, RBayes(n) is a weighted average of R(n; w) under the prior on w.

4. Learning Algorithms

In this section, we propose two learning algorithms for combinatorial semi-bandits: Combinatorial Linear Thompson Sampling (Comb Lin TS) and Combinatorial Linear UCB (Comb Lin UCB), which are respectively motivated by Thompson sampling and Lin UCB. Both algorithms maintain a mean vector θt and a covariance matrix Σt, and use Kalman ﬁltering to update θt and Σt. They differ in how to choose At (i.e. how to explore) in each episode t: Comb Lin TS chooses At based on a randomly sampled co-

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

efﬁcient vector θt, while Comb Lin UCB chooses At based on the optimism in the face of uncertainty (OFU) principle.

4.1. Combinatorial Linear Thompson Sampling

The psuedocode of Comb Lin TS is given in Algorithm 2, where (E, A) is the combinatorial structure, Φ is the generalization matrix, ORACLE is a combinatorial optimization algorithm, and λ and σ are two algorithm parameters controlling the learning rate. Speciﬁcally, λ is an inverseregularization parameter and smaller λ makes the covariance matrix Σt closer to 0. Thus, a too small λ will lead to insufﬁcient exploration and signiﬁcantly reduce the performance of Comb Lin TS. On the other hand, σ controls the decrease rate of the covariance matrix Σt. In particular, a large σ will lead to slow learning, while a too small σ will make the algorithm quickly converge to some sub-optimal coefﬁcient vector.

In each episode t, Algorithm 2 consists of three steps. First, it randomly samples a coefﬁcient vector θt from a Gaussian distribution. Second, it computes At based on θt and the pre-speciﬁed oracle. Finally, it updates the mean vector θt+1 and the covariance matrix Σt+1 based on Kalman ﬁltering (Algorithm 1).

It is worth pointing our that if (1) w = Φθ , (2) the prior on θ is N(0, λ2I), and (3) (t, e), the noise ηt(e) = wt(e) w(e) is independently sampled from N(0, σ2), then in each episode t, the Comb Lin TS algorithm samples θt from the posterior distribution of θ . We henceforth refer to a case satisfying condition (1)-(3) as a coherent Gaussian case. Obviously, the Comb Lin TS algorithm can be applied to more general cases, even to cases with no prior and/or agnostic learning cases.

4.2. Combinatorial Linear UCB

The pseudocode of Comb Lin UCB is given in Algorithm 3, where E, A, Φ and ORACLE are deﬁned the same as in Algorithm 2, and λ, σ, and c are three algorithm parameters. Similarly, λ is an inverse-regularization parameter, σ controls the decrease rate of the covariance matrix, and c controls the degree of optimism (exploration). Speciﬁcally, if c is too small, the algorithm might converge to some suboptimal coefﬁcient vector due to insufﬁcient exploration; on the other hand, too large c will lead to excessive exploration and slow learning.

In each episode t, Algorithm 3 also consists of three steps. First, for each e E, it computes an upper conﬁdence bound (UCB) ˆwt(e). Second, it computes At based on ˆwt and the pre-speciﬁed oracle. Finally, it updates θt+1 and Σt+1 based on Kalman ﬁltering (Algorithm 1).

5. Regret Bounds

In this section, we present a Bayes regret bound on Comb Lin TS, and a regret bound on Comb Lin UCB. We will also brieﬂy discuss how these bounds are derived, as well as their tightness. The detailed proofs are left to the appendices. Without loss of generality, throughout this section, we assume that φe 2 1, e E.

5.1. Bayes Regret Bound on Comb Lin TS

We have the following upper bound on RBayes(n) when Comb Lin TS is applied to a coherent Gaussian case with the right parameter. Theorem 1. If (1) w = Φθ , (2) the prior on θ is N(0, λ2I), (3) the noises are i.i.d. sampled from N(0, σ2), and (4) λ σ, then under Comb Lin TS algorithm with parameter (Φ, λ, σ), we have

RBayes(n) O Kλ p

dn min {ln(L), d} . (6)

Notice that condition (1)-(3) ensure it is a coherent Gaussian case, and condition (4) almost always holds1. The O notation hides the logarithm factors. We also note that Equation (6) is a minimum of two bounds. The ﬁrst bound is L-dependent, but it is only O( p

ln(L)); on the other hand, the second bound is L-independent, but is O(d) instead of O(

d). We would like to emphasize that Theorem 1 holds even if ORACLE is an approximation/randomized algorithm.

We now outline the proof of Theorem 1, which is motivated by Russo & Van Roy (2013) and Dani et al. (2008). Let Ht denote the history (i.e. all the available information) by the start of episode t. Note that from the Bayesian perspective, conditioning on Ht, θ and θt are i.i.d. drawn from N( θt, Σt) (Russo & Van Roy, 2013). This is because that conditioning on Ht, the posterior belief in θ is N( θt, Σt) and based on Algorithm 2, θt is independently sampled from N( θt, Σt). Since ORACLE is a ﬁxed combinatorial optimization algorithm (even though it can be independently randomized), and E, A, Φ are all ﬁxed, then conditioning on Ht, A and At are also i.i.d., furthermore, A is conditionally independent of θt, and At is conditionally independent of θ .

To simplify the exposition, θ Rd and A E, we deﬁne g(A, θ) = P

e A φe, θ ,

where , is an alternative notation for inner product. Thus we have E[Rt|Ht] = E[g(A , θ ) g(At, θ )|Ht]. We also deﬁne a UCB function Ut : 2E R as

e A h φe, θt + c p

φTe Σtφe i ,

1Condition (4) is not essential, please refer to Theorem 3 in Appendix A for a Bayes regret bound without condition (4).

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

Algorithm 1 Compute θt+1 and Σt+1 based on Kalman Filtering

Input: θt, Σt, σ, and feature-observation pairs {(φe, wt(e)) : e At}

Initialize θt+1 θt and Σt+1 Σt for k = 1, . . . , |At| do

Update θt+1 and Σt+1 as follows, where at k is the kth element in At

I Σt+1φat kφT at k φT at kΣt+1φat k + σ2

" Σt+1φat k φT at kΣt+1φat k + σ2

wt at k and Σt+1 Σt+1 Σt+1φat kφT at kΣt+1

φT at kΣt+1φat k + σ2 ,

Output: θt+1 and Σt+1

Algorithm 2 Combinatorial Linear Thompson Sampling

Input: Combinatorial structure (E, A), generalization matrix Φ RL d, algorithm parameters λ, σ > 0, oracle ORACLE

Initialize Σ1 λ2I Rd d and θ1 = 0 Rd

for all t = 1, 2, . . . , n do

Sample θt N θt, Σt

Compute At ORACLE(E, A, Φθt) Choose set At, and observe wt(e), e At

Compute θt+1 and Σt+1 based on Algorithm 1 end for

where c > 0 is a constant to be speciﬁed. Notice that conditioning on Ht, Ut is a deterministic function and A , At

are i.i.d., then E[Ut(At) Ut(A )|Ht] = 0 and

E[Rt|Ht] =E[g(A , θ ) Ut(A )|Ht]

+E Ut(At) g(At, θ ) Ht . (7)

Theorem 1 follows by respectively bounding the two terms on the righthand side of Equation (7). Two key observations are (1) if c = O p

min {ln(L), d} , then

E[g(A , θ ) Ut(A )|Ht] = O(1),

E Ut(At) g(At, θ ) Ht = c E h P

φTe Σtφe Ht i ,

and we have a worst-case bound (see Lemma 4 in Appendix A) on Pn t=1 P e At p

φTe Σtφe. Please refer to Appendix A for the detailed proof for Theorem 1.

Finally, we brieﬂy discuss the tightness of our bound. Without loss of generality, we assume that λ = 1. For the special case when Φ = I (i.e. no generalization), Russo & Van Roy (2014) provides an O( p

LK log(L/K)n) upper bound on RBayes(n) when Thompson sampling is applied, and Audibert et al. (2014) provides an Ω(

Algorithm 3 Combinatorial Linear UCB

Input: Combinatorial structure (E, A), generalization matrix Φ RL d, algorithm parameters λ, σ, c > 0, oracle ORACLE

Initialize Σ1 λ2I Rd d and θ1 = 0 Rd

for all t = 1, 2, . . . , n do

Deﬁne the UCB weight vector ˆwt as

ˆwt(e) = φe, θt + c q

φTe Σtφe e E

Compute At ORACLE(E, A, ˆwt) Choose set At, and observe wt(e), e At

Compute θt+1 and Σt+1 based on Algorithm 1 end for

lower bound2. Since L = d when Φ = I, the above results indicate that for general Φ, the best upper bound one can hope is O(

Kdn). Hence, our bound is at most O( p

K min{ln(L), d}) larger. It is well-known that the O(

d) factor is due to linear generalization (Dani et al., 2008; Abbasi-Yadkori et al., 2011), and as is discussed in the appendix (see Remark 1), the extra O(

K) factor is also due to linear generalization. They might be intrinsic, but we leave the ﬁnal word and tightness analysis to future work.

5.2. Regret Bound on Comb Lin UCB

Under the assumptions that (1) the support of P is a subset of [0, 1]L, (2) the stochastic item weights {w(e)}e E are statistically independent under P, and (3) the oracle ORACLE exactly solves the ofﬂine optimization problem3, we have the following upper bound on R(n) when Comb Lin UCB is applied to coherent learning cases:

2Audibert et al. (2014) focuses on the adversarial setting but the lower bound is stochastic. So it is a reasonable lower bound to compare with. 3If ORACLE is an approximation algorithm, then a variant of Theorem 2 can be proved (see Appendix D).

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

Theorem 2. For any λ, σ > 0, any δ (0, 1), and any c satisfying

d ln 1 + n Kλ2

if w = Φθ and the above three assumptions hold, then under Comb Lin UCB algorithm with parameter (Φ, λ, σ, c), we have

dn ln 1 + n Kλ2

Generally speaking, the proof for Theorem 2 proceeds as follows. We ﬁrst construct a conﬁdence set G of θ based on the self normalized bound developed in Abbasi-Yadkori et al. (2011). Then we decompose the regret over the high-probability good event G and the low-probability bad event G, where G is the complement of G. Finally, we bound the term associated with the event G based on the same worst-case bound on Pn t=1 P

φTe Σtφe used in the analysis for Comb Lin TS (see Lemma 4 in Appendix A), and bound the term associated with the event G based on a naive bound. Please refer to Appendix B for the detailed proof of Theorem 2.

Notice that if we choose λ = σ = 1, δ = 1/(n K), and c as the lower bound speciﬁed in Inequality (8), then the regret bound derived in Theorem 2 is also O(Kd n). Compared with the lower bound derived in Audibert et al. (2014), this bound is at most O(

Kd) larger. Similarly, the extra O(

d) factors are also due to linear generalization.

Finally, we would like to clarify that the assumption that the support of P is bounded is not essential. By slightly modifying the analysis, we can achieve a similar highprobability bound on the realized cumulative regret as long as P is sub-Gaussian. We also want to point out that the Lindependent bounds derived in both Theorem 1 and 2 will still hold even if L = .

6. Experiments

In this section, we evaluate Comb Lin TS on three problems. The ﬁrst problem is synthetic, but the last two problems are constructed based on real-world datasets. As we have discussed in Section 1, we only evaluate Comb Lin TS since in practice Thompson sampling algorithms usually outperform the UCB-like algorithms. Our experiment results in the synthetic problem demonstrate that Comb Lin TS is both scalable and robust to the choice of algorithm parameters. They also suggest the Bayes regret bound derived in Theorem 1 is likely to be tight. On the other hand, our experiment results in the last two problems show the value of

linear generalization in real-world settings: with domainspeciﬁc but imperfect linear generalization (i.e. agnostic learning), Comb Lin TS can signiﬁcantly outperform stateof-the-art learning algorithms that do not exploit linear generalization, which serve as baselines in these two problems.

In all three problems, the oracle ORACLE exactly solves the ofﬂine combinatorial optimization problem. Moreover, in the two real-world problems, we demonstrate the experiment results using a new performance metric, the expected per-step return in n episodes, which is deﬁned as

1 n Ew1,...,wn[Pn t=1 f(At, wt)| w] . (9)

Obviously, it is the expected cumulative return in n episodes divided by n. We demonstrate experiment results using expected cumulative return rather than R(n) since it is more illustrative.

6.1. Longest Path

We ﬁrst evaluate Comb Lin TS on a synthetic problem. Speciﬁcally, we experiment with a stochastic longest path problem on an (m+1) (m+1) square grid4. The items in the ground set E are the edges in the grid, L = 2m(m + 1) in total. The feasible set A are all paths in the grid from the upper left corner to the bottom right corner that follow the directions of the edges. The length of these paths is K = 2m. In this problem, we focus on coherent Gaussian cases and randomly sample the linear generalization matrix Φ RL d to weaken the dependence on a particular choice of Φ.

Our experiments are parameterized by a sextuple (m, d, λtrue, σtrue, λ, σ), where m, d, λ, and σ are deﬁned before and λtrue and σtrue are respectively the true standard deviations of θ and the observation noises. In each round of simulation, we ﬁrst construct a problem instance as follows: (1) generate Φ by sampling each component of Φ i.i.d. from N(0, 1); (2) sample θ independently from N(0, λ2 true I) and set w = Φθ ; and (3) (t, e), the observation noise ηt(e) = wt(e) w(e) is i.i.d. sampled from N(0, σ2 true). Then we apply Comb Lin TS with parameter (λ, σ) to the constructed instance for n episodes. Notice that in general (λ, σ) = (λtrue, σtrue). We average the experiment results over 200 simulations to estimate the Bayes cumulative regret RBayes(n).

We start with a default case with m = 30, d = 200, λtrue = λ = 10 and σtrue = σ = 1. Notice in this case L = 1860 and |A| 1.18 1017. We choose n = 150 since in the default case, the Bayes per-episode regret of Comb Lin TS vanishes far before period 150. In the default case RBayes(150) 1.56 104. In the experiments, we

4That is, each side has m edges and m + 1 nodes. Notice that the longest path problem and the shortest path problem are mathematically equivalent.

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

0 25 50 75 100

RBayes(150)

(a) RBayes vs. m

0 100 200 300 400 500

RBayes(150)

(b) RBayes vs. d

10 7 10 5 10 3 10 1 10 1 10 3 0

RBayes(150)

(c) RBayes vs. σ

10 2 10 0 10 2 10 4 0

RBayes(150)

(d) RBayes vs. λ

Figure 1. Experiment results in the longest path problem

vary one and only one parameter while keeping all the other parameters ﬁxed to their default values speciﬁed above to demonstrate the scalability and robustness of Comb Lin TS.

First, we study how the Bayes cumulative regret of Comb Lin TS scales with the size of the problem by varying m = 10, 20, . . . , 100, and show the result in Figure 1(a). The experiment results show that RBayes(150) roughly increases linearly with m, which indicates that Comb Lin TS is scalable with respect to the problem size m. We also experiment with m = 250, in this case we have L 125k, |A| 1.17 10149, and RBayes(150) 6.56 104, which is only 4.2 times of RBayes(150) in the default case. It is worth mentioning that this result also suggests that the Bayes regret bound derived in Theorem 1 is (almost) tight in this problem5. To see it, notice that K = 2m and L = O(m2), and hence the Bayes regret bound derived in Theorem 1 is O(m).

Second, we study how the Bayes cumulative regret of Comb Lin TS scales with d, the dimension of the feature vectors, by varying d = 10, 60, 110, . . . , 510, and demonstrate the result in Figure 1(b). The experiment results indicate that RBayes(150) also roughly increases linearly with d, and hence Comb Lin TS is also scalable with the feature dimension d. This result also suggests that the O(

d) bound in Theorem 1 is (almost) tight5.

5Recall that Theorem 1 requires maxe E φe 2 1. It can be easily extended to cases with maxe E φe 2 M by scaling the Bayes regret bound by M. However, in this problem φe is not bounded since it is sampled from a Gaussian distribution. We believe that Theorem 1 can be extended to this case by exploiting the properties of Gaussian distribution. Roughly speaking, in this problem, with high probability, φe 2 = O(

Finally, we study the robustness of Comb Lin TS with respect to the algorithm parameters σ and λ. In Figure 1(c), we vary σ = 10 7, 10 6, . . . , 104 and in Figure 1(d), we vary λ = 10 2, 10 1, . . . , 105. We would like to emphasize again that we only vary the algorithm parameters and ﬁx σtrue = 1 and λtrue = 10. The experiment results show that Comb Lin TS is robust to the choice of algorithm parameters and performs well for a wide range of σ and λ. However, too small or too large σ, or too small λ, can signiﬁcantly reduce the performance of Comb Lin TS, as we have discussed in Section 4.1.

6.2. Online Advertising

In the second experiment, we evaluate Comb Lin TS on an advertising problem. Our objective is to identify 100 people that are most likely to accept an advertisement offer, subject to the targeting constraint that exactly half of them are females. Speciﬁcally, the ground set E includes 33k representative people from Adult dataset (Asuncion & Newman, 2007), which was collected in the 1994 US census. A feasible solution A is any subset of E with |A| = 100 and satisfying the targeting constraint mentioned above. We assume that person e accepts an advertisement offer with probability

( 0.15 income is at least 50k 0.05 otherwise,

and people accept offers independently of each other. The features in the generalization matrix Φ are the age, which is binned into 7 groups; gender; whether the person works more than 40 hours per week; and the length of education in years. All these features can be constructed based on the Adult dataset.

Comb Lin TS is compared to three baselines. The ﬁrst baseline is the optimal solution Aopt. The second baseline is Comb UCB1 (Kveton et al., 2015b). This algorithm estimates the probability that person e accepts the offer w(e) independently of the other probabilities. The third baseline is Comb Lin TS without linear generalization, which we simply refer to as Comb TS. As in Comb UCB1, this algorithm estimates the probability that person e accepts the offer w(e) independently of the other probabilities. The posterior of w(e) is modeled as a beta distribution.

Our experiment results are reported in Figure 2. We observe two major trends. First, Comb Lin TS learns extremely quickly. In particular, its per-step return at episode 100 is 70% of the optimum, and its per-step return at episode 1k is 80% of the optimum. These results are remarkable since the linear generalization is imperfect in this problem. Second, both Comb UCB1 and Comb TS perform poorly due to insufﬁcient observations with respect to the model complexity. Speciﬁcally, in 1k episodes, the people in E are observed 100k times, which implies that each person is ob-

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

Episode n 200 400 600 800 1000

Expected per-step return

Optimal solution Comb UCB1 Comb TS Comb Lin TS

Figure 2. Experiment results in the online advertising problem

Episode n 200 400 600 800 1000

Expected per-step return

Optimal solution Comb UCB1 Comb TS Comb Lin TS

Figure 3. Experiment results in the artist recommendation problem

served only 3 times on average. This is not enough to discriminate the people who are likely to accept the advertisement offer from those that are not.

6.3. Artist Recommendation

In the last experiment, we evaluate Comb Lin TS on a problem of recommending K = 10 music artists that are most likely to be chosen by an average user of a music recommendation website. Speciﬁcally, the ground set E include artists from the Last.fm music recommendation dataset (Cantador et al., 2011). The dataset contains tagging and music artist listening information from a set of 2k users from Last.fm online music system6. The tagging part includes the tag assignments of all artists provided by the users. For each user, the artists to whom she listened and the number of listening events are also available in the dataset.

We choose E as the set of artists that were listened by at least two users and had at least one tag assignment among the top 20 most popular tags, and |E| 6k. For each artist e, we construct its feature vector φe [0, 1]20 by setting its

6http://www.lastfm.com

jth component as the fraction of users who assigned tag j to this artist. We assume that each artist e is chosen by an average user with probability w(e) = 1 |Ue| P u Ue wu(e), where Ue is the set of users that listened to artist e, and wu(e) is the probability that user u likes artist e. We estimate wu(e) based on a Na ıve Bayes classiﬁer with respect to the number of person/artist listening events.

Like Section 6.2, we also compare Comb Lin TS to three baselines: the optimal solution Aopt, the Comb UCB1 algorithm and the Comb TS algorithm. Our experiment results are reported in Figure 3. Similarly as Figure 2, the expected per-step return of Comb Lin TS approaches that of Aopt much faster than Comb UCB1 and Comb TS. Moreover, both Comb UCB1 and Comb TS perform poorly due to the insufﬁcient observations with respect to the model complexity: In 1k episodes, each artist is observed less than 2 times on average, which is not enough to discriminate most popular artists from less popular artists.

7. Conclusion

We have proposed two learning algorithms, Comb Lin TS and Comb Lin UCB, for stochastic combinatorial semibandits with linear generalization. The main contribution of this work is two-fold: First, we have established Lindependent regret bounds for these two algorithms under reasonable assumptions, where L is the number of items. Second, we have also evaluated Comb Lin TS on a variety of problems. The experiment results in the ﬁrst problem show that Comb Lin TS is scalable and robust, and the experiment results in the other two problems demonstrate the value of exploiting linear generalization in real-world settings.

It is worth mentioning that our results can be easily extended to the contextual combinatorial semi-bandits with linear generalization. In a contextual combinatorial semibandit, the probability distribution P (and hence the expected weight w) also depends on a context x, which either follows an exogenous stochastic process or is adaptively chosen by an adversary. Assume that each state-item pair (x, e) is associated with a feature vector φx,e, then similar to Agrawal & Goyal (2013), both Comb Lin TS and Comb Lin UCB, as well as their analyses, can be generalized to the contextual combinatorial semi-bandits.

We leave open several questions of interest. One interesting open question is how to derive regret bounds for Comb Lin TS and Comb Lin UCB in the agnostic learning cases. Another interesting open question is how to extend the results to combinatorial semi-bandits with nonlinear generalization. We believe that our results can be extended to combinatorial semi-bandits with generalized linear generalization7, but leave it to future work.

7That is, w(e) = h(φT e θ ), where h : R R is a strictly monotone function.

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

Abbasi-Yadkori, Yasin, P al, D avid, and Szepesv ari, Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems 24, pp. 2312 2320, 2011.

Agrawal, Shipra and Goyal, Navin. Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pp. 39.1 39.26, 2012.

Agrawal, Shipra and Goyal, Navin. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pp. 127 135, 2013.

Asuncion, A. and Newman, D.J. UCI machine learning repository, 2007. URL http://www.ics.uci. edu/$\sim$mlearn/{MLR}epository.html.

Audibert, Jean-Yves, Bubeck, Sebastien, and Lugosi, Gabor. Regret in online combinatorial optimization. Mathematics of Operations Research, 39(1):31 45, 2014.

Auer, Peter. Using conﬁdence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2002.

Cantador, Iv an, Brusilovsky, Peter, and Kuﬂik, Tsvi. Second workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In Proceedings of the 5th ACM conference on Recommender systems, Rec Sys 2011. ACM, 2011.

Cesa-Bianchi, Nicol o and Lugosi, G abor. Combinatorial bandits. Journal of Computer and System Sciences, 78 (5):1404 1422, 2012.

Chapelle, Olivier and Li, Lihong. An empirical evaluation of Thompson sampling. In Neural Information Processing Systems, pp. 2249 2257, 2011.

Chen, Wei, Wang, Yajun, and Yuan, Yang. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, pp. 151 159, 2013.

Dani, Varsha, Hayes, Thomas, and Kakade, Sham. Stochastic linear optimization under bandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, pp. 355 366, 2008.

Gabillon, Victor, Kveton, Branislav, Wen, Zheng, Eriksson, Brian, and Muthukrishnan, S. Adaptive submodular maximization in bandit setting. In Advances in Neural Information Processing Systems 26, pp. 2697 2705, 2013.

Gabillon, Victor, Kveton, Branislav, Wen, Zheng, Eriksson, Brian, and Muthukrishnan, S. Large-scale optimistic adaptive submodularity. In Proceedings of the 28th AAAI Conference on Artiﬁcial Intelligence, 2014.

Gai, Yi, Krishnamachari, Bhaskar, and Jain, Rahul. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5):1466 1478, 2012.

Koren, Yehuda, Bell, Robert, and Volinsky, Chris. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30 37, 2009.

Kveton, Branislav, Wen, Zheng, Ashkan, Azin, and Eydgahi, Hoda. Matroid bandits: Practical large-scale combinatorial bandits. In Workshops at the Twenty-Eighth AAAI Conference on Artiﬁcial Intelligence, 2014a.

Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Eydgahi, Hoda, and Eriksson, Brian. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the 30th Conference on Uncertainty in Artiﬁcial Intelligence, pp. 420 429, 2014b.

Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Eydgahi, Hoda, and Valko, Michal. Learning to act greedily: Polymatroid semi-bandits. Co RR, abs/1405.7752, 2014c.

Kveton, Branislav, Szepesvari, Csaba, Wen, Zheng, and Ashkan, Azin. Cascading bandits: Learning to rank in the cascade model. In Proceedings of the 32nd International Conference on Machine Learning, 2015a.

Kveton, Branislav, Wen, Zheng, Ashkan, Azin, and Szepesvari, Csaba. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, 2015b.

Neu, Gergely and Bart ok, G abor. An efﬁcient algorithm for learning with semi-bandit feedback. In Jain, Sanjay, Munos, R emi, Stephan, Frank, and Zeugmann, Thomas (eds.), Algorithmic Learning Theory, volume 8139 of Lecture Notes in Computer Science, pp. 234 248. Springer Berlin Heidelberg, 2013. ISBN 978-3642-40934-9.

Papadimitriou, Christos and Steiglitz, Kenneth. Combinatorial Optimization. Dover Publications, Mineola, NY, 1998.

Russo, Daniel and Van Roy, Benjamin. Learning to optimize via posterior sampling. Co RR, abs/1301.2609, 2013.

Efﬁcient Learning in Large-Scale Combinatorial Semi-Bandits

Russo, Daniel and Van Roy, Benjamin. An informationtheoretic analysis of thompson sampling. Co RR, abs/1403.5341, 2014.

Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Van Roy, Benjamin and Wen, Zheng. Generalization and exploration via randomized value functions. ar Xiv preprint ar Xiv:1402.0635, 2014.

Wen, Zheng and Van Roy, Benjamin. Efﬁcient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems 26, pp. 3021 3029, 2013.

Wen, Zheng, Kveton, Branislav, Eriksson, Brian, and Bhamidipati, Sandilya. Sequential Bayesian search. In Proceedings of the 30th International Conference on Machine Learning, pp. 977 983, 2013.

Yue, Yisong and Guestrin, Carlos. Linear submodular bandits and their application to diversiﬁed retrieval. In Advances in Neural Information Processing Systems 24, pp. 2483 2491, 2011.