# contextual_combinatorial_cascading_bandits__f0de599e.pdf

Contextual Combinatorial Cascading Bandits

Shuai Li SHUAILI@CSE.CUHK.EDU.HK The Chinese University of Hong Kong, Hong Kong

Baoxiang Wang BXWANG@CSE.CUHK.EDU.HK The Chinese University of Hong Kong, Hong Kong

Shengyu Zhang SYZHANG@CSE.CUHK.EDU.HK The Chinese University of Hong Kong, Hong Kong

Wei Chen WEIC@MICROSOFT.COM Microsoft Research, Beijing, China

We propose the contextual combinatorial cascading bandits, a combinatorial online learning game, where at each time step a learning agent is given a set of contextual information, then selects a list of items, and observes stochastic outcomes of a preﬁx in the selected items by some stopping criterion. In online recommendation, the stopping criterion might be the ﬁrst item a user selects; in network routing, the stopping criterion might be the ﬁrst edge blocked in a path. We consider position discounts in the list order, so that the agent s reward is discounted depending on the position where the stopping criterion is met. We design a UCB-type algorithm, C3-UCB, for this problem, prove an n-step regret bound O( n) in the general setting, and give ﬁner analysis for two special cases. Our work generalizes existing studies in several directions, including contextual information, position discounts, and a more general cascading bandit model. Experiments on synthetic and real datasets demonstrate the advantage of involving contextual information and position discounts.

1. Introduction

Multi-armed bandit (MAB) has been extensively studied in statistics and machine learning. The problem is usually formulated as a system of K base arms whose rewards

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

are random samples from unknown distributions with unknown means. The learning agent pulls one arm every time and tries to minimize the cumulative regret, which is the difference in cumulative rewards between always pulling the best arm in expectation and playing according to the agent s strategy. The problem of MAB has to deal with the trade-off between exploitation (pulling the best empirical arm) and exploration (trying other arms which are not sufﬁciently pulled).

Recently, stochastic combinatorial bandit started to draw much attention (Gai et al., 2012; Chen et al., 2013; Lin et al., 2014; Gopalan et al., 2014; Chen et al., 2014; Kveton et al., 2014; 2015b;a;c; Lin et al., 2015; Combes et al., 2015b). At every time step, a learning agent chooses a subset of ground items (super arm) under certain combinatorial constraints. There are several different kinds of feedback: (1) bandit feedback, where the learning agent can only obtain the reward of the chosen super arm; (2) semibandit feedback, where the learning agent can also obtain the stochastic outcomes of the all base arms constituting the chosen super arm; (3) cascading feedback, where the learning agent can obtain the reward of the chosen super arm and the weights of some base arms in the chosen super arm, according to some problem-speciﬁc stopping criterion for observation.

Cascading feedback model ﬁts into many real application scenarios. For example, in online or mobile recommendation, it is a typical practice that an ordered list (instead of a single item) is recommended to a user, who usually goes over the list based on the recommended order and selects one of her interest to click through. Thus it is reasonable to assume that items before the clicked item is of no interest to the user while user s interest to items after the clicked item is unclear. Another example is in networking routing,

Contextual Combinatorial Cascading Bandits

where the agent needs to select a routing path that is least likely of being blocked. To determine whether a path is blocked, the agent checks from the source node until meeting a blocked edge. The edges before the blocked one are unblocked and the edges after the blocked one are unobserved. In both of the above applications, we observe the feedbacks of a preﬁx of items in the chosen ordered list. The bandit problem with such cascading feedback has been studied in recent papers (Kveton et al., 2015a;c). In this paper, we generalize their work in several directions to cover more realistic scenarios.

First, we incorporate contextual information into cascading bandits. In online recommendation, contextual information includes various user and item information, and user behaviors in different contexts are different. Therefore utilizing contextual information is crucial for personalized recommendation. In network routing, temporal and spatial contextual information may also be useful to determine better routing paths. Therefore, we incorporate contextual information into the cascading bandit formulation. Second, cascading bandits studied in (Kveton et al., 2015a;c) treats all positions in the cascading list equally, but in applications different positions may bring different rewards. For example, in online recommendation, we usually prefer users to ﬁnd their interested item in the list as early as possible to increase user satisfaction. In network routing, we may prefer to hit a blocked edge as late as possible. To model these preferences, we introduce position discounts to the cascading bandit, and we show through experiments that incorporating position discounts may signiﬁcantly improve the learning result. Finally, we generalize the reward functions of (Kveton et al., 2015a;c), which are based on disjunctive and conjunctive binary random variables, to more general non-linear reward functions satisfying monotonicity and Lipschitz continuity conditions. This generalization allows us to cover new realistic scenarios not covered in (Kveton et al., 2015a;c), as exhibited in Section 4.3.

The organization of this paper is as follows: Section 2 compares our work to previous work; Section 3 presents the formulation of contextual combinatorial cascading bandits with general reward functions and position discounts; Section 4 gives our algorithm and main results both on general reward functions and two special reward functions; Section 5 demonstrates the experimental results; and Section 6 gives a ﬁnal conclusion.

2. Related Work

We ﬁrst discuss several studies most relevant to our work. Table 1 summarizes the different settings of these work, which we will explain in details next. A comparison of regret bounds by different methods is deferred to Section 4.3, after we provide full results of our regret bounds.

context cascading position general discount reward CUCB no yes no yes C2UCB yes no no yes Comb Cascade no yes no no C3-UCB

(ours) yes yes yes yes

Table 1. Comparisons of our setting with previous ones

Kveton et al. (2015a) introduced the cascading model to the multi-armed bandit framework with disjunctive objective, where the set of feasible actions form a uniform matroid, and the reward of an action is 1 if there is at least one good item in the list. The combinatorial cascading bandits of (Kveton et al., 2015c) (referred to as Comb Cascade in Table 1) generalized the framework, allowing each feasible action to be a subset of ground items under combinatorial constraints. It studied a problem of combinatorial bandits with conjunctive objective, where each feasible action is a chain of items, and reward is 1 if all items are good . Our work generalizes these models and involves both contextual information and position discounts.

The experiments of (Kveton et al., 2015a) found that recommending items in increasing order of their UCBs (meaning the items with lower preference come early) has a better performance than in decreasing order of their UCBs. This unnatural result is perhaps because the order of items does not affect the reward and thus putting low preference items ﬁrst and high preference items in the back may result in more feedbacks in the cascade. The position discounts introduced in this paper makes the recommended set in the decreasing order of UCBs, which is more realistic. Combes et al. (2015a) considered a similar cascading model to that of (Kveton et al., 2015a) with a particular case of contextual information, where a user is recommended with K items selected from a user-related group. Under this particular setting, they considered position discounts. In our setting, users and items are represented by general feature vectors.

Our work considers contextual combinatorial bandits with cascading feedback and nonlinear reward. The paper (Li et al., 2010) studied multi-armed bandit with contextual information, which recommends one item a time. A recent work (Qin et al., 2014) (referred to as C2UCB in Table 1) introduced contextual combinatorial bandit with semibandit feedback and nonlinear reward. Their feedback is more informative than ours. We relax the feedback from semi-bandit to cascading, yet our algorithm achieves the same order of regret bound.

There have been several pieces of work on partial monitor-

Contextual Combinatorial Cascading Bandits

ing MAB, but the results are not applicable to our setting. The papers of (Agrawal et al., 1989; Bart ok et al., 2012) studied partial monitoring problems but their algorithms are inefﬁcient in our setting. Lin et al. (2014) studied combinatorial partial monitoring problem and their feedback is a ﬁxed (with respect to the chosen action) combination of weights. In our setting, if the chosen action is ﬁxed, the number of observed items varies with the user s behavior, thus the combination constituting feedback is not ﬁxed.

Chen et al. (2016) considered general reward functions and probabilistic triggering of base arms in the stochastic combinatorial semi-bandit model (referred to as CUCB in Table 1). Cascading bandits can be viewed as one type of probabilistic triggering, where the head of the list is deterministically triggered and all the remaining items are probabilistically triggered. However, their work does not consider contextual information, and it is also difﬁcult to incorporate position discounts in a general triggering model.

3. Problem Formulation

We formulate the problem of contextual combinatorial cascading bandits as follows. Suppose we have a ﬁnite set E = {1, ..., L} of L ground items, also referred to as base arms. Let Πk = {(a1, ..., ak) : a1, ..., ak E, ai = aj for any i = j} be the set of all k-tuples of distinct items from E; we call each of such tuples an action of length k. We will use |A| to denote the length of an action A. Let Π K = K k=1Πk denote the set of all actions with length at most K, and let S Π K be the set of feasible actions with length at most K. As a convention, we always use boldface symbols to represent random variables, and denote [m] = {1, . . . , m}.

At time t, feature vectors xt,a Rd 1 with xt,a 2 1 for every base arm a E are revealed to the learning agent. Each feature vector combines the information of the user and the corresponding base arm. For example, suppose the user at time t is characterized by a feature vector ut and the base arm a has a feature vector va, then we can use xt,a = utv a , the outer-product of ut and va, as the combined feature vector of base arm a at time t. (See Section 5.2 for an application.) Denote xt = (xt,a)a E as all contextual information at time t. Then the learning agent recommends a feasible action At = (at 1, ..., at |At|) S to the user. In the cascading feedback models, the user checks from the ﬁrst item of recommended action and stops at the Ot-th item under some stopping criterion. For example, in recommender systems, the stopping criterion might be that the user ﬁnds the ﬁrst attractive item. Then the learning agent observes the weights of ﬁrst Ot base arms in At, denoted by wt(at k), k Ot.

Let Ht denote the history before the learning agent chooses

action at time t. Thus Ht contains feedback information at all time s < t, as well as contextual information at time t. Each arm a has a weight wt(a), representing the quality to the user at time t. Given history Ht, we assume that wt(a) s are mutually independent random variables with expectation E[wt(a)|Ht] = θ xt,a, (1)

where θ is an unknown d-dimensional vector with the assumption that θ 2 1 and 0 θ xt,a 1 for all t, a. Denote by wt,a = θ xt,a

the expected weight of base arm a at time t. We assume that each wt(a) is a random variable with R-sub-Gaussian tail, which means that for all b R,

E[exp(b(wt(a) θ xt,a))|Ht] exp(b2R2/2).

Recall that the random variable Ot is the number of observed base arms in At and at time t, the agent observes the ﬁrst Ot items of At. We say that item a is observed if a = at k for some k Ot. Thus, Ht consists of {xs, As = (as 1, ..., as |As|), Os, ws(as k)}k Os,s<t and {xt}.

We introduce position discounts γk [0, 1], k K. For example in website recommendations by a search engine, if the user selects the ﬁrst website, the learning agent will receive reward 1; and if the user selects the k-th website, the learning agent will receive a discounted reward γk.

The reward function on round t is an application dependent function. We assume the expected reward of action A is a function f(A, w) of expected weights w = (w(a))a E [0, 1]E (at time t, w = (wt,a)a E = (θ xt,a)a E) and satisﬁes the following two assumptions.

Monotonicity The expected reward function f(A, w) is a non-decreasing with respect to w: for any w, w [0, 1]E, if w(a) w (a), we have f(A, w) f(A, w ).

Lipschitz continuity The expected reward function f(A, w) is B-Lipschitz continuous with respect to w together with position discount parameters γk, k K. More speciﬁcally, for any w, w [0, 1]E, we have

|f(A, w) f(A, w )| B

k=1 γk|w(ak) w (ak)|,

where A = (a1, . . . , a|A|).

The assumption of Lipschitz continuity gives an estimate of changes in the reward function. To obtain a good estimation of the reward function, the position k with large γk needs a good estimation of w(ak). This means the positions with large γk are more important to the reward function.

Contextual Combinatorial Cascading Bandits

For example in website recommendations, the learning agent will receive the highest reward if the user selects the ﬁrst item, so the ﬁrst position is the most important.

Notice that we do not require the knowledge how f(A, w) is deﬁned. We only assume that the agent has access to an oracle OS(w) that outputs recommended action A. More precisely, an oracle OS(w) is called an α-approximation oracle for some α 1, if on given input w, the oracle returns an action A = OS(w) S satisfying f(A, w) αf(A , w) where A = argmax A Sf(A, w). In applications, the oracle OS(w) is often what the best ofﬂine algorithm can achieve. For this reason, in the regret deﬁned next, we compare our online algorithm with what the best ofﬂine algorithm can produce, i.e. αf t .

The α-regret of action A on time t is

Rα(t, A) = αf t f(A, wt),

where f t = f(A t , wt), A t = argmax A Sf(A, wt) and wt = (θ xt,a)a E. Our goal is to minimize the α-regret

t=1 Rα(t, At)].

4. Algorithms and Results

4.1. Algorithm

Before presenting the algorithm, we explain the main ideas in the design. By Eq.(1), at time s, we have

E[(γkws,as k)|Hs] = θ (γkxs,as k).

To estimate the expected reward, we ﬁrst estimate θ , which can be viewed as a ridge regression problem on samples x and labels w. In details, using the ridge regression on data {(γkxs,as k, γkws,as k)}k [Os],s [t],

we get an l2-regularized least-squares estimate of θ with regularization parameter λ > 0:

ˆθt = (X t Xt + λI) 1X t Y t, (2)

where Xt R(Pt s=1 Os) d is the matrix whose rows are γkx s,as k and Y t is a column vector whose elements are γkws(as k), k [Os], s [t]. Let

V t = X t Xt + λI = λI +

k=1 γ2 kxs,as kx s,as k.

Then V t Rd d is a symmetric positive deﬁnite matrix. For any symmetric positive deﬁnite matrix V Rd d and any vector x Rd 1, we deﬁne the 2-norm of x based on V to be x V = (x V x) 1 2 . Next we obtain a good estimate of the difference between ˆθt and θ by Theorem 2 in (Abbasi-Yadkori et al., 2011), restated as follows.

Lemma 4.1 (Abbasi-Yadkori et al., 2011) Let

ln det(V t)

Then for any δ > 0, with probability at least 1 δ, for all t > 0, we have

ˆθt θ V t βt(δ). (4)

Thus with high probability, the estimate ˆθ lies in the ellipsoid centered at θ with conﬁdence radius βt(δ) under V t norm. Building on this, we can deﬁne an upper conﬁdence bound of the expected weight for each base arm a by

U t(a) = min n ˆθ t 1xt,a + βt 1(δ) xt,a V 1 t 1, 1 o . (5)

The fact that U t(a) is an upper conﬁdence bound of expected weight wt,a = θ xt,a is proved in the following lemma.

Lemma 4.2 When Ineq.(4) holds for time t 1, we have

0 U t(a) wt,a 2βt 1(δ) xt,a V 1 t 1.

Proof. Note wt,a = θ xt,a. By H older s inequality, ˆθ t 1xt,a θ xt,a = [V 1/2 t 1(ˆθt 1 θ )] (V 1/2 t 1 xt,a)

V 1/2 t 1(ˆθt 1 θ ) 2 V 1/2 t 1 xt,a 2

= ˆθt 1 θ V t 1 xt,a V 1 t 1 βt 1(δ) xt,a V 1 t 1.

Because 1 θ xt,a 0 and

0 (ˆθ t 1xt,a + βt 1(δ) xt,a V 1 t 1) θ xt,a

2βt 1(δ) xt,a V 1 t 1,

the claimed result is obtained.

Based on the above analysis, we design our algorithm as follows. First, the learning agent computes the upper conﬁdence bounds (UCBs) U t [0, 1]E for the expected weights of all base arms in E. Second, the agent uses the computed UCBs, U t, to select an action At = (at 1, ..., at |At|). Third, assume that the user checks from the ﬁrst base arm in At and stops after checking the Ot-th base arm and the agents observes wt(at k), k Ot. Then, the learning agent updates V t, Xt, Y t in order to get a newer estimate ˆθt of θ and new conﬁdence radius βt(δ). Our proposed algorithm, C3-UCB, is described in Algorithm 1. There we use the notation [A; B] to denote the matrix obtained by stacking A and B vertically

Contextual Combinatorial Cascading Bandits

Algorithm 1 C3-UCB

1: Parameters: 2: {γk [0, 1]}k K; δ = 1 n; λ Cγ = PK k=1 γ2 k 3: Initialization: 4: ˆθ0 = 0, β0(δ) = 1, V 0 = λI, X0 = , Y 0 = 5: for all t = 1, 2, . . . , n do 6: Obtain context xt,a for all a E 7: a E, compute 8: U t(a) = min{ˆθ t 1xt,a + βt 1(δ) xt,a V 1 t 1, 1}

9: //Choose action At using UCBs U t 10: At = (at 1, ..., at |At|) OS(U t)

11: Play At and observe Ot, wt(at k), k [Ot] 12: //Update statistics 13: V t V t 1 + POt k=1 γ2 kxt,at kx t,at k 14: Xt [Xt 1; γ1x t,at 1; . . . ; γOtx t,at Ot]

15: Y t [Y t 1; γ1wt(at 1); . . . ; γOtwt(at Ot)]

16: ˆθt (X t Xt + λI) 1X t Y t 17: βt(δ) R p

ln(det(V t)/(λdδ2)) +

λ 18: end for t

4.2. Results

4.2.1. GENERAL RESULTS

To state our main theorem, we need some deﬁnitions. Let pt,A be the probability of full observation of A, that is, the probability of observing all base arms of A = (a1, . . . , a|A|) at time t. Let p = min1 t T min A S pt,A be the minimal probability that an action could have all base arms observed over all time. The following is the main theorem on the regret achieved by our C3-UCB algorithm.

Theorem 4.3 Suppose the expected reward function f(A, w) is a function of expected weights and satisﬁes the requirements of monotonicity and B-Lipschitz continuity. Then the α-regret of our algorithm, C3-UCB, satisﬁes

n Kd ln(1 + Cγn/(λd))

ln [(1 + Cγn/(λd))dn] +

n K ln(Cγn) , (6)

where R is the sub-Gaussian constant and Cγ = PK k=1 γ2 k K.

The regret bound deteriorates as p gets smaller. But unfortunately, this reciprocal dependence on p is not known to be improvable, even in the special case of disjunctive objective. Indeed, in that case, the bound in Section 2.3 of (Kveton et al., 2015c) also has a reciprocal dependence on

p . However, in the special case of conjunctive objective, we will give an improved bound.

Proof. (sketch) Assume At = (at 1, . . . , at |At|). We ﬁrst employ the monotonicity and B-Lipschitz continuity of f to upper bound regret as follows.

Rα(t, At) 2B

k=1 βt 1(δ)γk xt,at k V 1 t 1. (7)

As V t contains only information of observed base arms, our UCB-type algorithm guarantees the sum of the ﬁrst Ot items in the right of Ineq. (7) to be small, with the leftover sum (from Ot + 1 to |At|) out of control. We cope with this by a reduction to the case that Ot = |At|.

E[Rα(t, At)|Ht]

=E Rα(t, At)E 1 pt,At 1{Ot = |At|} At

p E [Rα(t, At)1{Ot = |At|}| Ht] . (8)

Combining (7) and (8) gives an upper bound of Rα(n) in terms of Pn t=1 P|Ot| k=1 γk xt,at k V 1 t 1. The next lemma upper bounds this sum of norms, squared.

Lemma 4.4 If λ Cγ, then for any time t,

k=1 γkxs,as k 2 V 1 s 1 2d ln (1 + Cγt/(λd)) .

The claimed result follows from Cauchy-Schwartz inequality. See Appendix for a complete proof.

4.2.2. DISJUNCTIVE OBJECTIVE

In the problem of cascading recommendation, when recommended with an ordered list of items At = (at 1, ..., at |At|), the user checks the list in that order. The checking process stops if the user selects one item or has checked all items without selecting anyone. The weight of each base arm a at time t, wt(a), is a {0, 1} value indicating whether the user has selected item a or not. Then the random variable Ot satisﬁes

( k, if wt(at j) = 0, j < k and wt(at k) = 1, |At| , if wt(at j) = 0, j |At| .

If the user selects the k-th item, then the learning agent receives a reward γk [0, 1]. Usually {γk} are decreasing with k: 1 = γ1 γ2 γK 0.

Contextual Combinatorial Cascading Bandits

If the user does not select any item, the agent gets no reward. At the end of time step t, the learning agent observes Ot, wt(at k), k Ot and receives a reward rt = wt(at Ot)γOt. It is not hard to see that rt = W|At| k=1 γkwt(at k) is the disjunctive of discounted weights, where we use the notation that Wn k=1 ak = max1 k n ak. Notice that the order of At affects both the feedback and the reward.

Now let us deﬁne a function f : S [0, 1]E [0, 1] on A = (a1, ..., a|A|) S, w = (w(1), ..., w(L)) by

i=1 (1 w(ai))w(ak). (9)

It is easily veriﬁed that rt = f(At, wt) and that E[rt] = f(At, θ xt) = f(At, wt). So f is the expected reward function and also a function of expected weights. It can be proved that f satisﬁes the properties of monotonicity and 1-Lipschitz continuity. (All veriﬁcations are left in Appendix.) We thus have the following corollary.

Corollary 4.5 In the problem of cascading recommendation, the expected reward function is deﬁned in Eq. (9), where 1 = γ1 γ2 γK 0. Then the α-regret of our algorithm, C3-UCB, satisﬁes

n Kd ln(1 + Cγn/(λd))

d ln[(1 + Cγn/(λd))dn] +

n K ln(Cγn) . (10)

4.2.3. CONJUNCTIVE OBJECTIVE

The above formulation is on the disjunctive objective, for which the user stops once she ﬁnds a good item. Similarly, we could also consider the case of conjunctive objective, for which the user stops once she ﬁnds a bad item. In this case, we derive a better result.

The Bernoulli random variable wt(a) {0, 1} indicates the weight of item a at time t satisfying Eq. (1). The learning agent observes the ﬁrst position k in the given action At = (at 1, . . . , at |At|) with wt(at k) = 0. We also consider partial rewards: if the k-th item is the ﬁrst item with weight 0, then the learning agent receives reward 1 γk; if all items have weight 1, the agent receives reward 1.

The more items reveals, the more reward the agent should receive. This can be used to model scenarios such as online surveys, where questions are adaptively given, and the more questions are observed, the more helpful it is to the survey conductor. Based on this, we assume that

1 = γ1 γ2 γK 0.

At the end of time step t, the learning agent observes Ot, wt(at k), k Ot and receives a reward rt:

( 1 γk if wt(at i) = 1, i < k, and wt(at k) = 0, 1 if wt(at i) = 1, i |At| .

If we deﬁne a function f : S [0, 1]E [0, 1] on A = (a1, . . . , a|A|) S and w = (w(1), . . . , w(L)) by

i=1 w(ai)(1 w(ak))+

(11) then it is not hard to verify that rt = f(At, wt) and E[rt] = f(At, wt) = f(At, θ xt). Next is our result for the conjunctive objective, where the bound in Theorem 4.3 is improved with p replaced by αf , where f = mint f t with f t = max A S f(A, wt). In network routing problem, the p is related with the bad paths, which can make the probability of observing the whole path very small, and f

is related with good paths, which makes f usually large.

Theorem 4.6 Suppose 1 = γ1 γ2 γK 1 α

4 f . Then the α-regret of our algorithm, C3-UCB, for the conjunctive objective satisﬁes

n Kd ln(1 + Cγn/(λd))

d ln[(1 + Cγn/(λd)dn] +

n K ln(Cγn) . (12)

Proof. (Sketch) First, we prove the expected reward function satisﬁes the properties of monotonicity and Lipschitz continuity. Then we use the following lemma to replace At with a preﬁx of At.

Lemma 4.7 Suppose 1 = γ1 γ2 γK 0. Let A = (a1, ..., a|A|). For the time t and the conjunctive objective, there exists a preﬁx B of A such that

2 f t 1 + γ|B|, Rα(t, B) 1

Therefore we can get

E[Rα(t, At)|Ht]

k=1 γkxt,at k V 1 t 1

Then similar to the proof of Theorem 4.3 and by the Lemma 4.4, we can prove this theorem.

Contextual Combinatorial Cascading Bandits

4.3. Discussions

In the above, we formulate the general setting of contextual combinatorial cascading bandit with position discounts. This setting generalizes several existing results.

The combinatorial cascading bandit (Kveton et al., 2015c) has a setting similar to ours in Section 4.2.3, but theirs has no contextual information or position discounts (i.e. all γk = 1). So our setting in Section 4.2.3 with regret bound O d f

n K ln(n) , which has an additional term q

f compared to their result. The loss is because we use the technique of linear bandits, (Abbasi-Yadkori et al., 2011), to handle the contextual information for a conﬁdence ellipsoid of θ . The regret bound of linear bandits has an additional term of O( p

d ln(n)) than the standard stochastic MAB bound O( p

nd ln(n)). A similar comparison can be made for the disjunctive objective in our Section 4.2.2 and in Section 2.3 of (Kveton et al., 2015c).

We can also compare to the work (Qin et al., 2014), where there are no triggering and no position discounts, and all base arms in action At can be observed. So in their setting, our random variable Ot becomes a deterministic value |At| and the probability, pt,At, is 1. Thus p = 1. Also all position discounts γk = 1. Then our regret bound of Theorem 4.3 is of the same order with theirs. Notice that the Lipschitz bound C in their setting is

Our algorithm also applies to the setting of (Chen et al., 2013) by allowing contextual information. Note the probability p in our results should be the probability that all base arms in S (in their notation) will be triggered.

Our setting generalizes several existing works on cascading feedback, and actually covers more cases. Consider in network routing problem, edges have latency following exponential distribution. The observed latency follows cutoff exponential distribution(we will not wait edges to react for arbitrary long time) with mean θ x. Suppose we treat an edge as blocked if its latency is larger than some tolerance τ. The edge has reward 1 if it is unblocked; otherwise, the reward is 0. Then the expected reward of an edge is a function of θ x, instead of θ x in the conjunctive case. A path has reward 1 only when all its edges have rewards 1. Then it can be proved that the resulting expected reward function, which cannot be represented in conjunctive/disjunctive objectives as in (Kveton et al., 2015a;c), is monotone and satisﬁes Lipschitz continuity. Details can be found in Appendix.

For the p in the result of Theorem 4.3, in our understanding, it is always tied to the reward function. In general, if the probability of observing a base arm i is very small causing 1/p large while observing i is not tied to the reward function, we can ignore arm i so that it does not af-

0 500 1000 1500 2000 2500 3000 Time t

Synthetic Data

C3-UCB Comb Cascade

(a) Disjunctive, γ = 1

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Time t

Synthetic Data

C3-UCB Comb Cascade

(b) Disjunctive, γ = 0.9

0 500 1000 1500 2000 2500 3000 Time t

Synthetic Data

C3-UCB Comb Cascade

(c) Conjunctive, γ = 1

0 500 1000 1500 2000 2500 3000 3500 4000 Time t

Synthetic Data

C3-UCB Comb Cascade

(d) Conjunctive, γ = 0.9

Figure 1. Synthetic Data Set, L = 200, K = 4, d = 20

fect p . In the other extreme, when observing arm i indeed could make a difference in the optimal action selection, one has to observe arm i, no matter how small its observation probability, which means in this case a regret with 1/p is reasonable. In other cases when p is tied with the reward function but not the optimal reward, one may add some condition similar to the one in Lemma 4.7 to detach p from the regret.

In addition, the assumption of monotonicity (together with the ofﬂine oracle assumption) can be removed if the problem can easily ﬁnd At = argmax A,w{f(A, w)|A, w in conﬁdence ellipsoid}.

5. Experiments

We evaluate our algorithm, C3-UCB, in a synthetic setting and two real applications. The results are compared with Comb Cascade, the study most related to ours, and demonstrate the advantage to involve contextual information and position discounts. In experiments, we set the position discounts γk to be γk 1 for some γ.

5.1. Synthetic Data

In the ﬁrst experiment, we compare C3-UCB to Comb Cascade on synthetic problems. The problem is a contextual cascading bandit with L = 200 items and K = 4, where at each time t the agent recommends K items to the user. At ﬁrst, we randomly choose a θ Rd 1 with θ 2 = 1 and let θ = ( θ

2). Then at each time t, we randomly assign x t,a Rd 1 with x t,a 2 = 1 to arm a and use xt,a = (x t,a, 1) to be the contextual information for

Contextual Combinatorial Cascading Bandits

0 500 1000 1500 2000 2500 3000 3500 4000 Time t

C3-UCB Comb Cascade

0 500 1000 1500 2000 2500 3000 3500 4000 Time t

C3-UCB Comb Cascade

(b) γ = 0.9

Figure 2. Rewards on Movie Lens, L = 400, K = 4, d = 400

arm a. This processing will guarantee the inner product θ xt,a = 1

2(θ x t,a + 1) [0, 1]. Next we generate the weight for arm a at time t by a random sample from the Bernoulli distribution with mean θ xt,a.

We conduct experiments in four settings. Under each setting, the learning agent chooses a set of K items out of L ground items. The ﬁrst two are of disjunctive objective where the learning agent observes a preﬁx of the chosen K items until the ﬁrst one with weight 1; the last two are of conjunctive objective where the learning agent observes from the ﬁrst item until the ﬁrst one with weight 0. Notice that with γ 1, the oracle selects a set of K items with highest UCBs in their decreasing order. The regrets are shown in Fig. 1 and our algorithm outperforms Comb Cascade algorithm because they do not make use of the contextual information.

5.2. Movie Recommendation

In this experiment, we evaluate C3-UCB on movie recommendation with data set Movie Lens (Lam & Herlocker, 2015) of 2015.

The learning problem is formulated as follows. There is a big sparse matrix A {0, 1}N1 N2 where A(i, j) = 1 denotes user i has watched movie j. Next, we split A as H + F by putting 1-entry in A to H or F according to a Bernoulli distribution Ber(p) for some ﬁxed p. We regard H as known information about history what users have watched and regard F as future criterion. We use H to derive feature vectors of both users and movies by SVD decomposition, H = USM where U = (u1; ...; u N1) and M = (m1; ...; m N2). At every time t, we randomly choose a user It [N1]. Then in the same spirit of (Li et al., 2010), we use xt,j = u Itm j , the outer product of u It and mj, as the contextual information for each movie j. The real weight of movie j at time t, wt(j), is F(It, mj).

For this experiment, we randomly choose L = 400 movies and recommend K = 4 movies at each time. We experiment with both γ = 1 (no position discount) and γ = 0.9, and compare our algorithm with Comb Cascade. The re-

0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time t

Network 1221

C3-UCB Comb Cascade

(a) Network 1221, γ = 1

0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time t

Network 1755

C3-UCB Comb Cascade

(b) Network 1755, γ = 0.9

Figure 3. Network, d = 5

sults are shown in Fig.2. The rewards of our algorithms are 3.52 and 3.736 times of those of Comb Cascade (for γ = 1 and 0.9, respectively), which demonstrate the advantage to involve contextual information in real applications.

5.3. Network Routing Problem

In this experiment, we evaluate C3-UCB on network routing problem with Rocket Fuel dataset (Spring et al., 2004).

The ground set E is the set of links in the network. Before learning, the environment randomly chooses a ddimensional vector θ [0, 1]d. At each time t, a pair of source and destination nodes are randomly chosen and the feasible action set St at time t contains all simple paths, paths without cycles, between the source and destination. Any edge a in the set St is assigned with a random d-dimensional contextual information vector xt,a. Notice here both θ and x have been processed like in Section 5.1 such that θ x [0, 1]. The weight for each edge a is a sample from Bernoulli distribution with mean θ xt,a. Then the learning agent recommends a feasible path A = (a1, ..., a|A|) between source and destination nodes to maximize the expected reward in the conjunctive objective. We experiment on different position discounts. The regrets are shown in Fig. 3 (a), (b).

6. Conclusions

In this paper, we propose contextual combinatorial cascading bandits with position discounts, where each action is an ordered list and only a preﬁx of the action is observed each time. We propose a C3-UCB algorithm and prove a cumulative regret bound for general reward functions and two special reward functions. The experiments conducted demonstrate the advantage to involve contextual information and position discounts. In future, we would like to investigate on lower bounds of the regret and cascading on general graphs.

Contextual Combinatorial Cascading Bandits

Acknowledgment

The work was supported in part by Research Grants Council of the Hong Kong S.A.R. (Project No. CUHK419413), Microsoft Research Asia Fund (Project No. FY15-RESTHEME-025), and the National Natural Science Foundation of China (Grant No. 61433014).

Abbasi-Yadkori, Yasin, P al, D avid, and Szepesv ari, Csaba. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Agrawal, Rajeev, Teneketzis, Demosthenis, and Anantharam, Venkatachalam. Asymptotically efﬁcient adaptive allocation schemes for controlled markov chains: Finite parameter space. Automatic Control, IEEE Transactions on, 34(12):1249 1259, 1989.

Bart ok, G abor, Zolghadr, Navid, and Szepesv ari, Csaba. An adaptive algorithm for ﬁnite stochastic partial monitoring. In Proceedings of the 29th International Conference on Machine Learning, 2012.

Chen, Shouyuan, Lin, Tian, King, Irwin, Lyu, Michael R, and Chen, Wei. Combinatorial pure exploration of multiarmed bandits. In Advances in Neural Information Processing Systems, pp. 379 387, 2014.

Chen, Wei, Wang, Yajun, and Yuan, Yang. Combinatorial multi-armed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, pp. 151 159, 2013.

Chen, Wei, Wang, Yajun, Yuan, Yang, and Wang, Qinshi. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. Journal of Machine Learning Research, 17(50):1 33, 2016.

Combes, Richard, Magureanu, Stefan, Proutiere, Alexandre, and Laroche, Cyrille. Learning to rank: Regret lower bounds and efﬁcient algorithms. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 231 244. ACM, 2015a.

Combes, Richard, Shahi, Mohammad Sadegh Talebi Mazraeh, Proutiere, Alexandre, et al. Combinatorial bandits revisited. In Advances in Neural Information Processing Systems, pp. 2107 2115, 2015b.

Gai, Yi, Krishnamachari, Bhaskar, and Jain, Rahul. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking (TON), 20(5):1466 1478, 2012.

Gopalan, Aditya, Mannor, Shie, and Mansour, Yishay. Thompson sampling for complex online problems. In Proceedings of The 31st International Conference on Machine Learning, pp. 100 108, 2014.

Kveton, Branislav, Wen, Zheng, Ashkan, Azin, Eydgahi, Hoda, and Eriksson, Brian. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the 30th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2014.

Kveton, Branislav, Szepesv ari, Csaba, Wen, Zheng, and Ashkan, Azin. Cascading bandits: learning to rank in the cascade model. In Proceedings of the 32th International Conference on Machine Learning, 2015a.

Kveton, Branislav, Wen, Zheng, Ashkan, Azin, and Szepesv ari, Csaba. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics(AISTATS), 2015b.

Kveton, Branislav, Wen, Zheng, Ashkan, Azin, and Szepesvari, Csaba. Combinatorial cascading bandits. Advances in Neural Information Processing Systems, 2015c.

Lam, Shyong and Herlocker, Jon. Movielens 20m dataset. 2015. URL http://grouplens.org/ datasets/movielens/20m/.

Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670. ACM, 2010.

Lin, Tian, Abrahao, Bruno, Kleinberg, Robert, Lui, John, and Chen, Wei. Combinatorial partial monitoring game with linear feedback and its applications. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 901 909, 2014.

Lin, Tian, Li, Jian, and Chen, Wei. Stochastic online greedy learning with semi-bandit feedbacks. In Advances in Neural Information Processing Systems, pp. 352 360, 2015.

Qin, Lijing, Chen, Shouyuan, and Zhu, Xiaoyan. Contextual combinatorial bandit and its application on diversiﬁed online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining (SDM), 2014.

Spring, Neil, Mahajan, Ratul, Wetherall, David, and Anderson, Thomas. Measuring isp topologies with rocketfuel. Networking, IEEE/ACM Transactions on, 12(1): 2 16, 2004.