# thompson_sampling_for_combinatorial_semibandits__e85a0f11.pdf

Thompson Sampling for Combinatorial Semi-Bandits

Siwei Wang 1 Wei Chen 2

We study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the ﬁrst distributiondependent regret bound of O(m log T/ min) for TS under general CMAB, where m is the number of arms, T is the time horizon, and min is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one cannot use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB and for which we could remove the independence assumption across arms and achieve a better regret bound. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.

1. Introduction

Multi-armed bandit (MAB) (Berry & Fristedt, 1985; Sutton & Barto, 1998) is a classical online learning model typically described as a game between a learning agent (player) and the environment with m arms. In each step, the environment generates an outcome, and the player uses a policy (or an algorithm), which takes the feedback from the previous steps as input, to select an arm to pull. After pulling an arm, the player receives a reward based on the pulled arm and the environment outcome. In this paper, we consider stochastic MAB problem, which means the environment outcome is drawn from an unknown distribution (Lai & Robbins, 1985), not generated by an adversary (Auer et al., 2002b). The goal of the player is to cumulate as much reward as possible over a total of T steps (T may be unknown). The performance metric is the (expected) regret, which is the cumulative

1Tsinghua University, Beijing, China 2Microsoft Research, Beijing, China. Correspondence to: Siwei Wang <wangsw15@mails.tsinghua.edu.cn>, Wei Chen <weic@microsoft.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

difference over T steps between always playing the arm with the optimal expected reward and playing the arms according to the policy.

MAB models the key tradeoff between exploration continuing exploring new arms not observed often, and exploitation sticking to the best performed arm based on the observation so far. A famous MAB algorithm is the upper conﬁdence bound (UCB) policy (Gittins, 1989; Auer et al., 2002a), which achieves O(m log T/ ) distributiondependent regret, where is the minimum gap in the expected reward between an optimal arm and any non-optimal arm, and it matches the lower bound (Lai & Robbins, 1985).

Combinatorial multi-armed bandit (CMAB) problem has recently become an active research area (Gai et al., 2012; Chen et al., 2016b; Gopalan et al., 2014a; Kveton et al., 2014; 2015a;b; Wen et al., 2015; Combes et al., 2015; Chen et al., 2016a; Wang & Chen, 2017). In CMAB, the environment contains m base arms, but the player needs to pull a set of base arms S in each time slot, where S is called a super arm (or an action). The kind of reward and feedback varies in different settings. In this paper, we consider the semi-bandit setting, where the feedback includes the outcomes of all base arms in the played super arm, and the reward is a function of S and the observed outcomes of arms in S. CMAB has found applications in many areas such as wireless networking, social networks, online advertising, etc. Thus it is important to investigate different approaches to solve CMAB problems.

An alternative approach different from UCB is the Thompson sampling (TS) approach, which is introduced much earlier by Thompson (1933), but the theoretical analysis of the TS policy comes much later Kaufmann et al. (2012) and Agrawal & Goyal (2012) give the ﬁrst regret bound for the TS policy, which essentially matches the UCB policy theoretically. Moreover, TS policy often performs better than UCB in empirical simulation results, making TS an attractive policy for further studies.

TS policy follows the Bayesian inference framework to solve the MAB problems. The unknown distribution of environment outcomes is parameterized, with an assumed prior distribution. TS updates the prior distribution in each step with two phases: ﬁrst it uses the prior distribution to sample a parameter, which is used to determine the action to

Thompson Sampling for Combinatorial Semi-Bandits

play in the current step; second it uses the feedback obtained in the current step to update the prior distribution to posterior distribution according to the Bayes rule. To avoid confusion on these two kinds of random variables, in the rest of this paper, we use the word sample to denote the variable in the ﬁrst phase, i.e. the random variable coming from the prior distribution. The word observation represents the feedback random variable, which follows the unknown environment distribution.

In this paper, we study the application of the Thompson sampling approach to CMAB. The reason that we are interested in this approach is that it has good performance in experiments. We found out that TS-based policy performs better than many kinds of UCB-based policy in experiments. We also adjust the parameters of UCB-based policy to make it behave better, but those parameters do not have theoretical guarantees. Thompson sampling policy behaves almost the same as UCB-based policies with no-guarantee parameters, and much better than those with parameters that have theoretical regret bounds. We can see those results in our experiments from Section 5. Another interesting thing is that TS-based policy only require the reward function to be continuous, while UCB-based policy need it to be monotone as well. These make TS-based policy more competitive in real applications.

We consider a general CMAB case similar with (Chen et al., 2016b), i.e. we assume that (a) the problem instance satisﬁes a Lipschitz continuity assumption to handle non-linear reward functions, and (b) the player has access to an exact oracle for the ofﬂine optimization problem. We use the standard TS policy together with the ofﬂine oracle, and refer it as the combinatorial Thompson sampling (CTS) algorithm.

CTS policy would ﬁrst derive a set of parameters θ = (θ1, , θm) as sample set for the base arms, and then select the optimal super arm under θ. The original analysis for TS on MAB model then faces a challenge in addressing the dependency issue: it essentially requires that different super arms be related with independent samples so that when comparing them and selecting the optimal super arm, the actual optimal one is selected with high probability. But when super arms are based on the same sample set θ, dependency and correlation among super arms may likely fail the above high probability analysis.

One way to get around this is to independently derive a sample set θ(S) for every super arm S and compute its expected reward under θ(S), and then select the optimal super arm. Obviously this solution incurs exponential sampling cost and is what we want to avoid when solving CMAB.

To address the dependency challenge, we adapt an analysis of (Komiyama et al., 2015) for selecting the top-k arms to the general CMAB setting. The adaptation is nontrivial

since we are dealing with arbitrary combinatorial constraints while they only deal with super arms containing k arms.

We show that CTS achieves O(P

i [m] log T/ i,min) + O((2/ε)2k ) distribution-dependent regret bound for some small ε, where i,min is the minimum gap between the optimal expected reward and any non-optimal expected reward containing arm i, and k is the size of the optimal solution. This is the ﬁrst distribution-dependent regret bound for general CMAB using TS-based policy, and the result matches the theoretical performance of the UCB-based solution CUCB in (Chen et al., 2016b). When considering CMAB with linear reward functions, the other complexity factors in the leading log T term also matches the regret lower bound for linear CMAB in (Kveton et al., 2015a). For the exponential constant term, we show an example that it is unavoidable for Thompson sampling.

Comparing to the UCB-based solution in (Chen et al., 2016b), the advantages of CTS is that: a) we do not need to assume that the expected reward is monotone to the mean outcomes of the base arms; b) it has better behaviour in experiments. CTS also suffers from some disadvantages. For example, CTS policy can not adapt an approximation oracle as in (Chen et al., 2016b) (the regret becomes to approximate regret as well). However, we claim that it is because of the difference between TS-based algorithm and UCB-based algorithm. To show this, we provide a counter example for origin MAB problem, which cause an approximate regret of Θ(T) when using TS policy.

Another disadvantage is that we need to assume that all the outcomes of all base arms are mutually independent. This is because TS policy maintains a prior distribution for every base arm s mean value µi. Only when the distributions are independent, we can use a simple method to update those prior distributions; otherwise the update method will be much more complicated for both the implmentation and the analysis. This assumption is still reasonable, since many real applications satisfy this assumption.

However, when applying on some further combinatorial structures, we do not need such an assumption, such as in matroid bandit. Matroid bandit is a special class of CMAB (Kveton et al., 2014), in which the base arms are the elements in the ground set and the super arms are the independent sets of a matroid. The reward function is the sum of all outcomes of the base arms in the super arm. We show that the regret of CTS is upper bounded by O(P

i S log T/ i) + O(m/ε4) for some small ε, where S is an optimal solution and i is the minimum positive gap between the mean outcome of any arms in S and the mean outcome of arm i. This result does not need to assume that all arm distributions are independent, and do not have a constant term exponential with k . It matches both the theoretical performance of the UCB-based algorithm

Thompson Sampling for Combinatorial Semi-Bandits

and the lower bound given in Kveton et al. (2014), and the constant term is similar with results in Agrawal & Goyal (2012), which appears in almost every TS analysis paper.

We further conduct empirical simulations, and show that CTS performs much better than the CUCB algorithm of (Chen et al., 2016b) and C-KL-UCB algorithm based on KL-UCB of (Garivier & Cappé, 2011) on both matroid and non-matroid CMAB problem instances.

In summary, our contributions include that: (a) we provide a novel analysis for the general CMAB problem, and provides the ﬁrst distribution-dependent regret bound for general CMAB problems and matroid bandit problems based on Thompson sampling; (b) we show that approximation oracle can not be used in TS-based algorithms; and (c) we show that the exponential constant in our regret bound for general CMAB problems is unavoidable.

Due to space constraint, complete proofs are moved to the supplementary material.

1.1. Related Work

A number of related works on the general context of multiarmed bandit and Thompson sampling have been given, and we focus here on the most relevant studies related to CMAB.

Our study follows the general CMAB framework of (Chen et al., 2016b), which provides a UCB-style algorithm CUCB and show a O(P

i [m] log T/ i,min) regret bound. Comparing with our CTS, both use an ofﬂine oracle, assume a Lipschitz continuity condition, and the bound essentially match asymptotically on time horizon T. Their differences include: (a) CTS does not need the monotonicity property but CUCB requires that to use the ofﬂine oracle; (b) CUCB allows an approximation oracle (and uses approximate regret), but CTS requires an exact oracle; (c) the regret bound of CTS has some additional terms not related to T, which is common in TS-based regret bounds (See Section 3 for more details). Combes et al. (2015) propose ESCB algorithm to solve CMAB with linear reward functions and independent arms, and their regret is a factor O( m) better than our corresponding regret bound for CTS. However, their ESCB algorithm requires an exponential-time for computation, which is what we want to avoid when designing CMAB algorithms.

Matroid bandit is deﬁned and studied by Kveton et al. (2014), who provide a UCB-based algorithm with regret bound almost exactly matches CTS algorithm. They also prove a matching lower bound using a partition matroid bandit.

Thompson sampling has also been applied to settings with combinatorial actions. Gopalan et al. (2014b) study a general action space with a general feedback model, and provide

analytical regret bounds for the exact TS policy. However, their general model cannot be applied to our case. In particular, they assume that the arm outcome distribution is from a known parametric family and the prior distribution on the parameter space is ﬁnitely supported. We instead work on arbitrary nonparametric and unknown distributions with bounded support, and even if we work on a parametric family, we allow the support of prior distributions to be inﬁnite or continuous. The reason is that in our CMAB setting, we only need to learn the means of base arms (same as in (Chen et al., 2016b)). Moreover, their regret bounds are high probability bounds, not expected regret bounds, and their bounds contain a potentially very large constant, which will turn into a non-constant term when we convert them to expected regret bounds.

In (Komiyama et al., 2015), the authors consider the TSbased policy for the top-k CMAB problem, a special case of matroid bandits where the super arms are subsets of size at most k. Thus we generalize top-k bandits to matroid bandits, and our regret bound for matroid bandit still matches the one in (Komiyama et al., 2015).

Wen et al. (2015) analyze the regret of using TS policy for contextual CMAB problems. The key difference between their work and ours is that they use the Bayesian regret metric. Bayesian regret takes another expectation on the prior distribution of parameters, while our regret bound works for any given parameter. This leads to very different analytical method, and they cannot provide distributiondependent regret bounds. Russo & Van Roy (2016) also use Bayesian regret to analyze the regret bounds of TS policy for any kind of MAB problems. Again, due to the use of Bayesian regret, their analytical method is very different and cannot be used for our purpose.

2. Model and Deﬁnitions

2.1. CMAB Problem Formulation

A CMAB problem instance is modeled as a tuple ([m], I, D, R, Q). [m] = {1, 2, , m} is the set of base arms; I 2[m] is the set of super arms; D is a probability distribution in [0, 1]m, and is unknown to the player, R and Q are reward and feedback functions to be speciﬁed shortly. Let µ = (µ1, , µm), where µi = EX D[Xi]. At discrete time slot t 1, the player pulls a super arm S(t) I, and the environment draws a random outcome vector X(t) = {X1(t), , Xm(t)} [0, 1]m from D, independent of any other random variables. Then the player receives an unknown reward R(t) = R(S(t), X(t)), and observes the feedback Q(t) = Q(S(t), X(t)). As in (Chen et al., 2016b) and other papers studying CMAB, we consider semi-bandit feedback, that is, Q(t) = {(i, Xi(t)) | i S(t)}. At time t, the previous step information is

Thompson Sampling for Combinatorial Semi-Bandits

Ft 1 = {(S(τ), Q(τ)) : 1 τ t 1}, which is the input to the learning algorithm to select the action S(t). Similar to (Chen et al., 2016b), we make the following two assumptions. For a parameter vector µ, we use µS to denote the projection of µ on S, where S is a subset of all the base arms.

Assumption 1. The expected reward of a super arm S I only depends on the mean outcomes of base arms in S. That is, there exists a function r such that E[R(t)] = EX(t) D[R(S(t), X(t))] = r(S(t), µS(t)).

The second assumption is a Lipschitz-continuity assumption of function r to deal with non-linear reward functions (it is based on one-norm).

Assumption 2. There exists a constant B, such that for every super arm S and every pair of mean vectors µ and µ , |r(S, µ) r(S, µ )| B||µS µ S||1.

The goal of the player is to minimize the total (expected) regret under time horizon T, as deﬁned below:

t=1 (r(S , µ) r(S(t), µ))

where S argmax S I r(S, µ) is a best super arm.

2.2. Matroid Bandit

In matroid bandit settings, ([m], I) is a matroid, which means that I has two properties:

If A I, then A A, A I;

If A1, A2 I, |A1| > |A2|, then there exists i A1 \ A2 such that A2 {i} I.

The reward function is R(S, x) = P

i S xi, and thus the expected reward function is r(S, µ) = P

3. Combinatorial Thompson Sampling

We ﬁrst consider the general CMAB setting. For this setting, we assume that the player has an exact oracle Oracle(θ) that takes a vector of parameters θ = (θ1, . . . , θm) as input, and output a super arm S = argmax S I r(S, θ).

The combinatorial Thompson sampling (CTS) algorithm is described in Algorithm 1. Initially we set the prior distribution of the means of all base arms as the Beta distribution β(1, 1), which is the uniform distribution on [0, 1]. After we get observation Q(t), we update the prior of all base arms in S(t) using procedure Update (Algorithm 2): for each observation Xi(t), we generate a Bernoulli random variable Yi(t) (the value of Yi at time t) independently with mean Xi(t), and then we update the prior Beta distribution

Algorithm 1 CTS Algorithm for CMAB

1: For each arm i, let ai = bi = 1 2: for t = 1, 2, do 3: For all arm i, draw a sample θi(t) from Beta distribution β(ai, bi); let θ(t) = (θ1(t), . . . , θm(t)) 4: Play action S(t) = Oracle(θ(t)), get the observation Q(t) = {(i, Xi(t)) : i S(t)} 5: Update({(ai, bi) | i S(t)}, Q(t)) 6: end for

Algorithm 2 Procedure Update

1: Input: {(ai, bi) | i S}, Q = {(i, Xi) | i S} 2: Output: updated {(ai, bi) | i S} 3: for all (i, Xi) Q do 4: Yi 1 with probability Xi, 0 with probability 1 Xi 5: ai ai + Yi; bi bi + 1 Yi 6: end for

of base arm i using Yi(t) as the new observation. It is easy to see that {Yi(t)}t s are i.i.d. with the same mean µi as the samples {Xi(t)}t s. Let ai(t) and bi(t) denote the values of ai and bi at the beginning of time step t. Then, following the Bayes rule, the posterior distribution of parameter µi after observation Q(t) is β(ai(t)+Yi(t), bi(t)+1 Yi(t)), which is what the Update procedure does for updating ai and bi. When choosing a super arm, we simply draw independent samples from all base arms prior distributions, i.e. θi(t) β(ai(t), bi(t)), and then send the sample vector θ(t) = (θ1(t), . . . , θm(t)) to the oracle. We use the output from the oracle S(t) as the super arm to play.

We also need a further assumption to tackle the problem:

Assumption 3. D = D1 D2 Dm, i.e., the outcomes of all base arms are mutually independent.

This assumption is not necessary in CUCB algorithms. However, when using TS method, this assumption is needed. This is because that we are using the Bayes Rule, thus we need the exact likelihood function (as we can see in (Gopalan et al., 2014b)). Only when the distributions for all the base arms are independent, we can use the Update procedure (Algorithm 2) to update their mean vector s prior distribution. When the distributions are correlated, the update procedure will also be much more complicated.

3.1. Regret Upper Bound

Let OPT = argmax S I r(S, µ) be the set of optimal super arms. Let S argmin S OPT |S| is one of the optimal super arm with minimum size k . Then we can deﬁne S = r(S , µ) r(S, µ), and max = max S I S. Kmax is the maximum size of super arms, i.e. Kmax = max S I |S|.

Theorem 1. Under Assumptions 1, 2, and 3, for all D,

Thompson Sampling for Combinatorial Semi-Bandits

Algorithm 1 has regret upper bound

i=1 max S:i S 8B2|S| log T S 2B(k 2 + 2)ε + (m K2 max ε2 + 3m) max

ε2 + 1)k log k

for any ε such that S, S > 2B(k 2 + 2)ε, where B is the Lipschitz constant in Assumption 2, and α1 is a constant not dependent on the problem instance.

When ε is sufﬁciently small, the leading log T term in the regret bound is comparable with the regret bound for CUCB in (Chen et al., 2016b). The term related to ε is to handle continuous Beta prior since we will never be able to sample a θ(k) i (t) to be exactly the true value µi, we need to consider the ε neighborhood of µi. This ε term is common in most Thompson sampling analysis.

The constant term has an exponential dependency on k . This is because we need all the k base arms in the best super arm to have samples close to their means to make sure that it is the best super arm in sampling. In contrast, for the top-k MAB of (Komiyama et al., 2015), there is no such exponential dependency, because they only compare one base arm at a time (this can also be seen in our matroid Bandit analysis). When dealing with the general actions, the regret result of (Gopalan et al., 2014b) also contains an exponentially large constant term without a close form, which is likely to be much larger than ours. In Section 3.3, we show that this exponential constant is unavoidable for the general CTS.

We now provide the proof outline for Theorem 1. First we deﬁne the following four events:

A(t) = {S(t) / OPT}

B(t) = { i S(t), |ˆµi(t) µi| > ε |S(t)|}

C(t) = {||θS(t)(t) µS(t)||1 > S(t)

B (k 2 + 1)ε}

D(t) = { i S(t), |θi(t) ˆµi(t)| > q

Then the total regret can be written as:

t=1 E I[A(t)] S(t)

t=1 E I[B(t) A(t)] S(t)

t=1 E I[ B(t) C(t) D(t) A(t)] S(t)

t=1 E I[ B(t) C(t) D(t) A(t)] S(t)

t=1 E I[ C(t) A(t)] S(t)

The ﬁrst term can be bounded by Chernoff Bound, which is m K2 max ε2 + m max. The second term can be bounded by some basic results of Beta distribution, the upper bound is 2m max.

The third term is a little bit tricky. Notice that under C(t), we can use B||θS(t)(t) µS(t)||1 as an approximation of S(t). However, it is hard to bound P

i S(t) |θi(t) µi|. To deal with this, we say one base arm i S(t) is sufﬁciently learned if Ni(t) > Li(S(t)) = 2 log T/( S(t) 2B|S(t)|

|S(t)| ε)2. When computing P i S(t) |θi(t) µi|, we do not count all the sufﬁciently learned arms in. To compensate their contributions, we double all the insufﬁciently learned arms contributions from q

Ni(t) to 2 q

Ni(t) . One can check that the sum of contributions from insufﬁciently learned arms is an upper bound for the regret S(t) under B(t) C(t) D(t). Thus, we can upper bound the total contribution of base arm i as: P

2 log TLmax i where Lmax i = max S:i S Li(S).

The difﬁculty comes mainly from the last term. Although the θi(t) s of all base arms are mutually independent, when it comes to super arms, the value r(S, θ(t)S) s for different super arms S are not mutually independent, because super arms may overlap one another. For example, Lemma 1 in (Agrawal & Goyal, 2013) is not true for considering super arms because of the lack of independence. This means that we cannot simply use the technique of (Agrawal & Goyal, 2013). Dealing with the dependency issue for this case is the main novelty in our analysis, as we now explain.

Let θ = (θ1, , θm) be a vector of parameters, Z [m] and Z = be some base arm set and Zc be the complement of Z. Recall that θZ is the sub-vector of θ projected onto Z, and we use notation (θ Z, θZc) to denote replacing θi s with θ i s for i Z and keeping the values θi for i Zc

Given a subset Z S , we consider the following property for θZc. For any ||θ Z µZ|| ε, let θ = (θ Z, θZc):

Z Oracle(θ );

Either Oracle(θ ) OPT or ||θ Oracle(θ )

µOracle(θ )||1 > Oracle(θ )

B (k 2 + 1)ε.

The ﬁrst one is to make sure that if we have normal samples in Z at time t, then arms in Z will be played and observed.

Thompson Sampling for Combinatorial Semi-Bandits

These observations would update the Beta distributions of these arms to be more accurate, such that it is easier next time that the samples from these arms are also within ε of their true value. This fact would be used later in the quantitative regret analysis. The second one says that if the samples in Z are normal, then C(t) A(t) can not happen (similar to the analysis in (Agrawal & Goyal, 2013) and (Komiyama et al., 2015)). As time going on, the probability that C(t) A(t) happens will become smaller and smaller, thus the expectation on its sum has an constant upper bound.

We use EZ,1(θ) to denote the event that the vector θZc has such a property, and emphasize that this event only depends on the values in vector θZc. What we want to do is to ﬁnd some exact Z such that EZ,1(θ) happens when C(t) A(t) happens. The following lemma shows that such Z must exist, it is the key lemma in this section.

Lemma 1. Suppose that C(t) A(t) happens, then there exists Z S and Z = such that EZ,1(θ(t)) holds.

By Lemma 1, for some nonempty Z, EZ,1(θ(t)) occurs when C(t) A(t) happens. Another fact is that ||θZ(t) µZ|| > ε. The reason is that if ||θZ(t) µZ|| ε, by deﬁnition of the property, either S(t) OPT or ||θS(t)(t)

µS(t)||1 > S(t)

B (k 2 + 1)ε, which means C(t) A(t) can not happen. Let EZ,2(θ) be the event {||θZ µZ|| > ε}. Then { C(t) A(t)} Z S ,Z = (EZ,1(θ(t)) EZ,2(θ(t))).

Using similar techniques in (Komiyama et al., 2015) we can get the upper bound O 8 ε2 ( 4

ε2 )|Z| log |Z|

ε2 for PT t=1 E [I{EZ,1(θ(t)), EZ,2(θ(t))}].

3.2. Approximation Oracle

We consider using an approximation oracle in our CTS algorithm as well, like what the author did in (Chen et al., 2016b) or (Wen et al., 2015). However, we found out that Thompson sampling does not work with an approximation oracle even in the original MAB model, as shown in Theorem 2. Notice that here we do not consider the Bayesian regret, so it does not contradict with the results in (Wen et al., 2015).

To make it clear, we need to show the deﬁnitions of approximation oracle and approximation regret here.

Deﬁnition 1. An approximation oracle with rate λ for MAB problem is a function Oracle : [0, 1]m {1, , m} such that µOracle(µ) λ maxi µi.

Deﬁnition 2. The approximation regret with rate λ of MAB problem on mean vector µ is deﬁned as:

t=1 (λ max i µi µi(t)),

where i(t) is the arm pulled on time step t.

The TS algorithm using approximation oracle Oralce works just as Algorithm 1.

Theorem 2. There exists an MAB instance with an approximation oracle such that when using Algorithm 1, the regret is Ω(T).

Proof Sketch. Consider the following MAB instance:

Problem Instance 1. m = 3, µ = [0.9, 0.82, 0.7], approximate rate λ = 0.8. The Oracle works as following: if µ3 λ maxi µi, Oracle(µ) = 3; else if µ2 λ maxi µi, Oracle(µ) = 2; otherwise Oracle(µ) = 1.

The key idea is that we may never play the best arm (arm 1 above) when using the approximation oracle. When the sample from the prior distribution of the best arm is good, we choose an approximate arm (arm 2 above) but not the best arm; otherwise we choose an bad arm (arm 3 above) with positive approximation regret. Thus the expected regret of each time slot depends on whether the prior distribution of the best arm at the beginning is good or not. Since the best arm is never observed, we never update its prior distribution. Thus the expected regret in each time slot can remain a positive constant forever.

3.3. The Exponential Constant Term

Since every arm s sample θi(t) is chosen independently, the worst case is that we need all the samples for base arms in the best super arm to be close to their true means to choose that super arm. Under this case, the probability that we have no regret in each time slot is exponentially with k , thus we will have such a constant term.

Theorem 3. There exists a CMAB instance such that the regret of Algorithm 1 on this instance is at least Ω(2k ).

Proof Sketch. Consider the following CMAB instance:

Problem Instance 2. m = k +1, there are only two super arms in I, where S1 = {1, 2, , k } and S2 = {k + 1}. The mean vector µS1 = [1, , 1]. The reward function R follows R(S1, X) = Q

i S1 Xi, and R(S2, X) = 1 , while = 0.5. The distributions Di are all independent Bernoulli distributions with mean µi (since µi = 1, the observations are always 1).

One can show that the expected time until Algorithm 1 plays the optimal super arm S1 for the ﬁrst time is Ω(2k ).

The exponential term comes from the bad prior distribution at the beginning of the algorithm. In fact, from the proof of Theorem 1 we know that if we can pull each base arm for O( 1

ε2 ) times at the beginning and then use the CTS policy

Thompson Sampling for Combinatorial Semi-Bandits

whose prior distribution at the beginning is the posterior distribution after those observations, then we can reduce the exponential constant term to O( m

ϵ4 ). However, since ε depends on min, which is unknown to the player, we can not simply run each base arm for a few time steps to avoid the exponential constant regret term. Perhaps an adaptive choice can be used here, and this is a further research item.

4. Matroid Bandit Case

In matroid bandit, we suppose the oracle we use is the greedy one since the greedy algorithm gives back the exact best super arm.

4.1. Regret Upper Bound

Let S argmax S I r(S, µ) be one of the optimal super arm. Deﬁne i = minj|j S ,µj>µi µj µi. If i / S

but {j | j S , µj > µi} = , we deﬁne i = , so that 1 i = 0. Let K = max S I |S| = |S |. We have the following theorem for CTS algorithm under the matroid bandit case:

Theorem 4. Under Assumptions 1 and 2, the regret upper bound of Algorithm 1 for a matroid bandit is:

log T i 2ε i ε i 2ε + α2 m

for any ε > 0 such that i / S , i 2ε > 0, where α2 is a constant not dependent on the problem instance.

Notice that we do not need the distributions of all the base arms to be independent due to the special structure of matroid. When ε is small, the leading log T term of the above regret bound matches the regret lower bound P

i/ S 1 i log T given in (Kveton et al., 2014). For the constant term, we have an O( 1

ε4 ) factor while Agrawal & Goyal (2013) have an O( 1

ε2 ) factor in their theorem. However, even following their analysis, we can only obtain O( 1

ε4 ) and cannot recover the O( 1

ε2 ) in their analysis.

We now provide a proof outline. The difference from Theorem 1 is that we can use the special combinatorial structure of matroid to improve the analysis.

Firstly, we introduce a fact from (Kveton et al., 2014).

Fact 1. (Lemma 1 in (Kveton et al., 2014)) For each S(t) = {i(1)(t), , i(K)(t)} chosen by Algorithm 1 (the superscript is the order when they are chosen), we could ﬁnd a bijection Lt from {1, 2, . . . , K} to S such that:

1) If i(k)(t) S , then Lt(k) = i(k)(t);

2) 1 k K, {i(1)(t), , i(k 1)(t), Lt(k)} I.

With a bijection Lt, we could decouple the regret of playing one action S(t) to each pair of mapped arms between S(t)

and S . For example, the regret of time t is PK k=1 µLt(k) µi(k)(t).

We use Ni,j(t) to denote the number of rounds that i(k)(t) = i and Lt(k) = j for i / S , j S within in time slots 1, 2, , T, then

j:j S ,µj>µi E[Ni,j(t)](µj µi).

We can see that if {j : j S , µj > µi} = , then P

j:j S ,µj>µi E[Ni,j(t)](µj µi) = 0, thus we do not need to consider the regret from base arm i, so we set i = to make 1 i = 0.

Now we just need to bound the value Ni,j(t), similarly, we can deﬁned the following three events:

Ai,j(t) = { k, i(k)(t) = i Lt(k) = j}

Bi(t) = {ˆµi(t) > µi + ε}

Ci,j(t) = {θi(t) > µj ε}

E[Ni,j(t)] = E

t=1 I[Ai,j(t)]

t=1 I[Ai,j(t) Bi(t)]

t=1 I[Ai,j(t) Bi(t) Ci,j(t)]

t=1 I[Ai,j(t) Ci,j(t)]

We can use Chernoff Bound to get an upper bound of the ﬁrst term as (m K)(1 + 1 ε2 ). As for the second term, basic properties of Beta distribution give an upper bound: (m K)K + P

The largest difference appears in the third term, instead of Lemma 1, here we can have some further steps in matroid bandit. Lemma 2. Suppose the vector θ(t) satisfy that Ai,j(t) Ci,j(t) happens. Then if we change θj(t) to θ j(t) > µj ε and set other values in θ(t) unchanged to get θ (t), then arm j must be chosen in θ (t).

We use the notation θ i to be the vector θ without θi.

For any j S , let Wj be the set of all possible values of θ satisﬁes that Ai,j(t) Ci,j(t) happens for some i, and W j = {θ j : θ Wj}.

Thompson Sampling for Combinatorial Semi-Bandits

From Lemma 2, we know θ(t) Wj only if θ j(t) W j and θj(t) µj ε. Then similar with the analysis of Theorem 1, we can bound the value PT t=1 E[θ(t) Wj] O( 1

5. Experiments

We conduct some preliminary experiments to empirically evaluate the performance of CTS versus CUCB and C-KLUCB. The reason that we choose C-KL-UCB is that: a) in classical MAB model, KL-UCB behaves better than UCB; b) similar with TS, it is also a policy based on Bayesian Rule. We also make simulations on CUCB and C-KL-UCB with chosen parameters, represented by CUCB-m and CKL-UCB-m. In CUCB, we choose the conﬁdence radius to be radi(t) = q

3 log t 2Ni(t), while in CUCB-m, it is q

log t 2Ni(t). In C-KL-UCB we choose f(t) = log t + 2 log log t, while in C-KL-UCB-m it is log t. Those chosen parameters in CUCB-m and C-KL-UCB-m make them behave better, but lack performance analysis.

5.1. Matroid Bandit

It is well known that spanning trees form a matroid. Thus, we test the maximum spanning tree problem as an example of matroid bandits, where edges are arms, and super arms are forests.

We ﬁrst generate a random graph with M nodes, and each pair of nodes has an edge with probability p. If the resulting graph has no spanning tree, we regenerate the graph again. The mean of the distribution is randomly and uniformly chosen from [0, 1]. The expected reward for any spanning tree is the sum of the means of all edges in it. It is easy to see that this setting is an instance of the matroid bandit.

The results are shown in Figure 1 with the probability p = 0.6 and M = 30. In Figure 1(a), we set all the arms to have independent distributions. In Figure 1(b), each time slot we generate a global random variable rand uniformly in [0, 1], all edges with mean larger than rand will have outcome 1, while others have outcome 0. In other words, the distributions of base arms are correlated. We can see that CTS has smaller regret than CUCB, CUCB-m and CKL-UCB in both two experiments. As for C-KL-UCB-m algorithm, it behaves better with small T, but loses when T is very large. We emphasize that C-KL-UCB-m policy uses parameters without theoretical guarantee, thus CTS algorithm is a better choice.

5.2. General CMAB

In the general CMAB case, we consider the shortest path problem. We build two graphs for this experiment, the results of them are shown in Figure 2(a) and Figure 2(b).

Figure 1. Experiments on matroid bandit: Maximum Spanning Tree

Figure 2. Experiments on general CMAB: Shortest Path

The cost of a path is the sum of all edges mean in that path, while the outcome of each edge e follows a independent Bernoulli distribution with mean µe. The objective is to ﬁnd the path with minimum cost. To make the problem more challenging, in both graphs we construct a lot of paths from the source node s to the sink node t that only have a little larger cost than the optimal one, and some of them are totally disjoint with the optimal path.

Similar to the case of matroid bandit, the regret of CTS is also much smaller than that of CUCB, CUCB-m and C-KLUCB, especially when T is large. As for the C-KL-UCB-m algorithm, although it behaves best in the four UCB-based policies, it still has a large difference between CTS.

6. Future Work

In this paper, we apply combinatorial Thompson sampling to combinatorial multi-armed bandit and matroid bandit problems, and obtain theoretical regret upper bounds for those two settings.

There are still a number of interesting questions that may worth further investigation. For example, pulling each base arm for a number of time slots at the beginning of the game can decease the constant term to non-exponential, but the point is that the player does not know how many time slots are enough. Thus how can we use an adaptive policy or some further assumptions to do so is a good question. In this paper, we suppose that all the distributions for base arms are independent. Another question is how to ﬁnd analysis for using CTS under correlated arm distributions.

Thompson Sampling for Combinatorial Semi-Bandits

Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In COLT, pp. 39 1, 2012.

Agrawal, S. and Goyal, N. Further optimal regret bounds for thompson sampling. In AISTATS, pp. 99 107, 2013.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235 256, 2002a.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The non-stochastic multi-armed bandit problem. Siam Journal on Computing, 32(1):48 77, 2002b.

Berry, D. A. and Fristedt, B. Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability). Springer, 1985.

Chen, W., Hu, W., Li, F., Li, J., Liu, Y., and Lu, P. Combinatorial multi-armed bandit with general reward functions. In NIPS, 2016a.

Chen, W., Wang, Y., Yuan, Y., and Wang, Q. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. Journal of Machine Learning Research, 17(50):1 33, 2016b. A preliminary version appeared as Chen, Wang, and Yuan, combinatorial multi-armed bandit: General framework, results and applications , ICML 2013.

Combes, R., Talebi, M. S., Proutiere, A., and Lelarge, M. Combinatorial bandits revisited. In NIPS, 2015.

Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20, 2012.

Garivier, A. and Cappé, O. The kl-ucb algorithm for bounded stochastic bandits and beyond. Mathematics, 2011.

Gittins, J. Multi-armed bandit allocation indices. wileyinterscience series in systems and optimization. 1989.

Gopalan, A., Mannor, S., and Mansour, Y. Thompson sampling for complex online problems. In Proceedings of the 31st International Conference on Machine Learning (ICML), 2014a.

Gopalan, A., Mannor, S., and Mansour, Y. Thompson sampling for complex online problems. In ICML, volume 14, pp. 100 108, 2014b.

Kaufmann, E., Korda, N., and Munos, R. Thompson sampling: an asymptotically optimal ﬁnite-time analysis. In Proceedings of the 23rd international conference on Algorithmic Learning Theory, pp. 199 213, 2012.

Komiyama, J., Honda, J., and Nakagawa, H. Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays. In Proceedings of The 32nd International Conference on Machine Learning, pp. 1152 1161, 2015.

Kveton, B., Wen, Z., Ashkan, A., Eydgahi, H., and Eriksson, B. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the 30th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2014.

Kveton, B., Wen, Z., Ashkan, A., and Szepesvári, C. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics, 2015a.

Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems, pp. 1450 1458, 2015b.

Lai, T. L. and Robbins, H. Asymptotically efﬁcient adaptive allocation rules. Advances in applied mathematics, 6(1): 4 22, 1985.

Russo, D. and Van Roy, B. An information-theoretic analysis of thompson sampling. Journal of Machine Learning Research, 17(68):1 30, 2016.

Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.

Wang, Q. and Chen, W. Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications. In Advances in Neural Information Processing Systems, 2017.

Wen, Z., Kveton, B., and Ashkan, A. Efﬁcient learning in large-scale combinatorial semi-bandits. In ICML, pp. 1113 1122, 2015.