# fatigueaware_bandits_for_dependent_click_models__860d00ac.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Fatigue-Aware Bandits for Dependent Click Models

Junyu Cao,1 Wei Sun,2 Zuo-Jun (Max) Shen,1 Markus Ettl2

1University of California, Berkeley, California 94720 2IBM Research, Yorktown Height, New York 10591

As recommender systems send a massive amount of content to keep users engaged, users may experience fatigue which is contributed by 1) an overexposure to irrelevant content, 2) boredom from seeing too many similar recommendations. To address this problem, we consider an online learning setting where a platform learns a policy to recommend content that takes user fatigue into account. We propose an extension of the Dependent Click Model (DCM) to describe users behavior. We stipulate that for each piece of content, its attractiveness to a user depends on its intrinsic relevance and a discount factor which measures how many similar contents have been shown. Users view the recommended content sequentially and click on the ones that they ﬁnd attractive. Users may leave the platform at any time, and the probability of exiting is higher when they do not like the content. Based on user s feedback, the platform learns the relevance of the underlying content as well as the discounting effect due to content fatigue. We refer to this learning task as fatigue-aware DCM Bandit problem. We consider two learning scenarios depending on whether the discounting effect is known. For each scenario, we propose a learning algorithm which simultaneously explores and exploits, and characterize its regret bound.

1 Introduction Recommender systems increasingly inﬂuence how users discover content. Some well-known examples include Facebook s News Feed, movie recommendations on Netﬂix, Spotify s Discover Weekly playlists, etc. To compete for users attention and time, platforms push out a vast amount of content when users browse their websites or mobile apps. While a user sifts through the proliferation of content, her experience could be adversely inﬂuenced by: 1) marketing fatigue which occurs due to an overexposure to irrelevant content, and 2) content fatigue which refers to the boredom from seeing too many similar recommendations. Motivated by this phenomenon, we consider an online learning setting with a fatigue-aware platform, i.e., it learns a policy to select a sequence of recommendations for

Correspondence to: Junyu Cao (jycao@berkeley.edu) Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

a user, while being conscious that both the choices of content and their placement could inﬂuence the experience. We propose a variant of the Dependent Click Model (DCM), where a user can click on multiple items1 that she ﬁnds attractive. For each piece of content, its attractiveness depends on its intrinsic relevance and a discount factor that measures how many similar recommendations have already been shown to this user. The user may leave the platform at any time, and the probability of exiting is higher when she does not like the content, reﬂecting the presence of marketing fatigue. Meanwhile, showing too much relevant but similar content could also reduce its attractiveness due to content fatigue and drive bored users to abandon the platform. The platform whose objective is to maximize the total number of clicks needs to learn users latent preferences in order to determine the optimal sequence of recommendations, based on users feedback. We refer to this online learning task which the platform faces as fatigue-aware DCM bandits. Our contribution of our work is fourfold. Firstly, we propose a novel model which captures the effects of marketing fatigue and content fatigue on users behavior. In the original DCM model (Guo, Liu, and Wang 2009; Katariya et al. 2016b), users can only exit the platform upon clicking on a recommendation. In reality, users may leave at any time, and they are more likely to exit when they are not engaged with the recommendations. We extend the DCM model to capture marketing fatigue by incorporating exiting behavior after both non-clicks and clicks. To reﬂect content fatigue, we incorporate a discount factor into DCM such that it penalizes repetitive content of similar types and promotes diversity in its recommendations. Secondly, even in the ofﬂine setting where all the information is known, the optimization problem to select a sequence of recommendation which is central to the learning task is combinatorial in nature without a straightforward efﬁcient algorithm. We propose a polynomial-time algorithm for this problem. Thirdly, for the online setting, we ﬁrst consider a scenario where the discount factor is known (e.g., when it can be estimated from historical data). We propose a learning algorithm and quantify the regret bound. Lastly, we consider a more general scenario where both the discount factor and intrinsic relevance

1We use content and items interchangeably in this work.

of items need to be learned. This is a signiﬁcantly more challenging setting to analyze because 1) we only observe partial feedback on item s attractiveness which depends on both the discount factor and the relevance; 2) the discount factor is dependent on the content that we recommend. For this scenario, we also establish a regret bound for the algorithm that we propose. Our paper is organized as follows. In Section 2, we review related work. In Section 3, we introduce the fatigue-aware DCM model and provide an ofﬂine algorithm to ﬁnd the optimal sequence when all parameters are known. In Section 4 and 5, we introduce our learning problems for two scenarios depending on whether the discount factor is known and analyze the regret associated with the proposed algorithms. In Section 6, we perform several numerical experiments to evaluate the performance of our algorithms.

2 Related Literature

Cascade Model (Chuklin, Markov, and Rijke 2015; Craswell et al. 2008) is a popular model that describes user s singleclick behavior. Several variants of this model have been considered in the bandit literature (Combes et al. 2015; Kveton et al. 2015a; 2015b). One of its limitations is that the positions of the items do not inﬂuence the reward since a user is expected to browse the list of recommendations until she clicks on the content that she likes. DCM (Guo, Liu, and Wang 2009) generalizes Cascade Model to allow multiple clicks by incorporating a parameter to indicate the probability that a user resumes browsing after clicking on a recommendation. On the other hand, if a user does not click, she views the next item with certainty unless the sequence runs out. Katariya et al. 2016b analyzed one type of DCM bandit settings. However, the reward in Katariya et al. 2016b does not exactly correspond to the number of clicks. In our DCM bandit setting, the rewards exactly correspond to the number of users clicks, which is an ubiquitous measure on recommenders effectiveness. In addition, as our model allows users to exit at any time, the position of the items matters as users may exit the platform early if they do not like what has been shown initially. Alternative multi-click models include Position-Based Model (PBM) (Richardson, Dominowska, and Ragno 2007). In PBM, the click probability of an item is a product of a static position-based examination probability and the item s intrinsic relevance score. The following papers have investigated PBM bandits. More speciﬁcally, in Lagr ee, Vernade, and Capp e 2016, the position-based examination probability is assumed known, whereas this quantity has to be learned together with item attractiveness in Komiyama, Honda, and Takeda 2017. In particular, Komiyama, Honda, and Takeda 2017 derive a parameter-dependent regret bound, which assumes that the gap between parameters is strictly positive. While our model also includes a position-dependent parameter which discounts the attractiveness of an item, this quantity does not merely depend on the position. The discount factor also depends on the order and the content types of the recommendations. For our setting, we derive a parameterindependent regret bound.

Our work also bears some resemblance to the stochastic rank-1 bandits (Katariya et al. 2016a; 2017), where a row arm and a column arm are pulled simultaneously in each around, and the reward corresponds to the product of the two values. In our setting, item attractiveness is a product of the discount factor and item s intrinsic relevance. However, in rank-1 bandits, one is only interested in determining the particular pair that yields the highest reward, whereas we need to learn and rank all items, which is signiﬁcantly harder. An important topic of recommender systems is fatigue control (Kapoor et al. 2015; Ma, Liu, and Shen 2016). One way to manage content fatigue is to introduce diversity in recommendations (Radlinski, Kleinberg, and Joachims 2008; Ziegler et al. 2005). The linear submodular function has been used to tackle the diversiﬁcation problem in bandit setting (Yu, Fang, and Tao 2016; Yue and Guestrin 2011). Warlop, Lazaric, and Mary 2018 propose a reinforcement learning framework that uses a linear reward structure which captures the effect of the recent recommendations on user s preferences. However, instead of recommending speciﬁc items as in our work, Warlop, Lazaric, and Mary 2018 restrict to recommending genres or content types to a user. While Cao and Sun 2019; Wang and Tulabandhula 2019 have also studied marketing fatigue in an online setting, their setting only captures a single click, whereas our model allows multiple clicks and incorporates the effect of content fatigue in addition to marketing fatigue.

3 Problem Formulation

In this section, we formally introduce the fatigue-aware DCM model and present an algorithm to determine the optimal sequence of recommendations in an ofﬂine setting where all underlying parameters are known.

3.1 Setting

Suppose there are N available items (e.g., songs, videos, articles) for the platform to choose from, denoted as [N]. Each item belongs to one of the K types (e.g., genres, categories, topics). Let S = (S1, , Sm) denote a sequence of recommended items. For each item i, ui denotes its intrinsic relevance to a user, and C(i) = j if item i belongs to type j. We deﬁne hi(S) as the number of items with type C(i) shown before item i. Similar to Warlop, Lazaric, and Mary 2018, we model the impact of showing too many similar contents as a discount factor on items attractiveness. More precisely, deﬁne the attractiveness of item i as zi(S) := f(hi(S))ui, where f( ) represents the discount factor on users preferences due to content fatigue, which depends on how many items of the same type have been shown. f( ) is assumed to be a decreasing function. Without loss of generality, we assume f(0) = 1. Given a list of recommendations S = (S1, , Sm), a user examines it sequentially, starting from S1. The user clicks on item i when she ﬁnds it attractive. The platform receives a reward for every click. DCM models the multi-click behavior by incorporating a parameter g which is the probability that the user will see more recommendations after a click. In case of no click or skip, unlike in the original DCM

(Guo, Liu, and Wang 2009; Katariya et al. 2016b) where a user examines the next item with probability 1, we use q to denote the probability that the user will resume browsing. As users are more likely to stay on the platform and continue browsing after consuming a piece of good content2, we let q g . The interaction between a user and recommended content is illustrated in Fig 1.

Figure 1: User behavior in a fatigue-aware DCM.

The probability of clicking on item i from a sequence S, denoted as Pi(S), depends on its position in the list, its own content, as well as the content of other items shown previously. Formally,

ui, if i S1

gz I(k)(S) + q(1 z I(k)(S)) zi(S) if i Sl

where I( ) denotes the index function, i.e., I(k) = i if and only if Sk = {i}. When i is the ﬁrst item of the sequence, the probability of clicking is simply ui, which is its intrinsic relevance. For the remainder of the sequence, the probability of clicking on i as the lth item is the joint probability that 1) the user ﬁnds item i attractive after taking content fatigue into account, zi(S); 2) she remains on the platform after examining the ﬁrst l 1 items, l 1 k=1 gz I(k)(S) + (1 z I(k)(S))q , which accounts for both clicking and skipping behavior.

3.2 Platform s optimization problem

The platform s objective is to maximize the total number of clicks by optimizing the sequence of recommended items. We use R(S) to denote the platform s expected reward, i.e., E[R(S)] =

i [N] Pi(S). Thus, the platform s optimization problem can be written as

max S E[R(S)] (1)

s.t. Si Sj = , i = j.

2We provide some evidence of this phenomenon based on the estimates from a real dataset, with more details in Section 6.

The constraint requires no duplicated items in the recommendations. We deﬁne S as the optimal sequence of content, and R as the corresponding maximum reward. While this problem is combinatorial in nature, Theorem 3.1 shows that this problem is polynomial-time solvable and the optimal sequence S can be identiﬁed by Algorithm 1.

Algorithm 1: Determine the optimal sequence S to the ofﬂine combinatorial problem

1 for i = 1 : K do

2 Sort uj for C(j) = i in an decreasing order o( ) where o(j) = r if item j is at the (r + 1)th position;

3 Set λj = ujf(o(j));

5 Sort λj for all j = 1 : N in an decreasing order o ( ) where o (r) = j if item j is at the rth position;

6 Set S = (o (1), o (2), , o (N)).

Theorem 3.1 Algorithm 1 ﬁnds the optimal sequence S .

Due to the space limit, we present the proof outline in the main paper and the detailed proofs are included in the Supplementary Material. Proof outline: We prove the result by contradiction. If the optimal sequence S is not ordered by Algorithm 1, then there exists a neighboring pair I(i) and I(i + 1) such that either 1) I(i) and I(i + 1) belong to the same type and u I(i) < u I(i+1) or 2) I(i) and I(i + 1) belong to different types and f(h I(i)(S ))u I(i) < f(h I(i+1)(S ))u I(i+1). We then show that swapping items I(i) and I(i + 1) increases the expected reward, which is a contradiction to S being optimal. For each category, Algorithm 1 ﬁrst ranks items based on their intrinsic relevance. If an item has the (r + 1)th largest relevance score within a category, we will then multiply the score with f(r) to compute its attractiveness. Next, the algorithm ranks items across all categories in their decreasing attractiveness. The complexity of Algorithm 1 is O(N log N). We also observe that the optimal sequence is completely determined by items intrinsic relevance u and the discount factor f, and is independent of the resuming probabilities g and q. In other words, the platform only needs to learn u and f for the online learning task. Nevertheless, the resuming probabilities inﬂuence the learning speed as we will investigate in the following sections.

4 Learning with Known Discount Factor f In the previous section, we assume all the parameters are known to the platform. It is natural to ask what the platform should do in the absence of such knowledge. Beginning from this section, we will present learning algorithms to the fatigue-aware DCM bandit problem and characterize the corresponding regret bound. We ﬁrst investigate a scenario where the discounting effect caused by content fatigue is known (e.g., it could be estimated from historical data), and the platform needs to learn the item s intrinsic relevance

u to determine the optimal sequence. We want to emphasize the differences between this setting and other learning-torank problems (Combes et al. 2015; Kveton et al. 2015a; 2015b). Firstly, in our setting, only partial feedback to a list is observed as the user might leave before viewing all the recommendations. Secondly, as the order of the content inﬂuences their attractiveness, the click rate of an item is non-stationary. Thirdly, in many previous learning-to-rank related problems, the position-biased examine probability is usually assumed to be ﬁxed for each position. However, in our setting, this examine probability depends on the content of previously shown recommendations. Assume users arrive at time t = 1, , T. For each user t, the platform determines a sequence of items St. We evaluate a learning algorithm by its expected cumulative rewards under policy π, or equivalently, its expected cumulative regret which is deﬁned as Regretπ(T; u) =

Eπ T t=1 R(S , u) R( St, u) .

4.1 Algorithm FA-DCM-P

Our strategy, which we call Fatigue-Aware DCM with Partial information (FA-DCM-P), is a UCB-based policy. We ﬁrst deﬁne an unbiased estimator for u and its upper conﬁdence bound. We deﬁne the pair (i, j) as an event that item j is the (i + 1)th recommendation from the category C(j). That is, i recommendations from the same category C(j) have been shown before item j. We deﬁne Tij as the epoch for the event (i, j), i.e., t Tij if and only if user t observes the event (i, j). Let Tij(t) denote the number of times that the event (i, j) occurs by time t, i.e., Tij(t) = |Tij(t)|. Deﬁne Tj(t) = i Tij(t). Let zt ij be the binary feedback for user t observing the event (i, j), where zt ij = 1 indicates that the user clicks, and zt ij = 0 means no click or skip. Deﬁne ˆuj,t as the estimator for uj, and u UCB j,t as its upper conﬁdence bound, i.e.,

ˆuj,t = 1 Tj(t)

zt ij fi , u UCB j,t = ˆuj,t +

(2) In Algorithm FA-DCM-P, for a user arriving at time t, we use u UCB t 1 to calculate the current optimal sequence of recommendations. Deﬁne wt as the last item examined by user t, which occurs when one of the following feedback is observed: 1) the user exits after clicking on wt; 2) the user views wt without clicking and exits; 3) the sequence runs out. When the last examined item wt is shown, we update Tj(t) and the upper conﬁdence bound to u UCB t based on the feedback.

4.2 Regret analysis for Algorithm FA-DCM-P

In this section, we quantify the regret bound for our proposed algorithm. We ﬁrst derive the following lemmas which will be used in the regret analysis.

Algorithm 2: [FA-DCM-P] An algorithm for fatigueaware DCM bandit when f is known

1 Initialization: Set u UCB i,0 = 1 and Ti(0) = 0 for all i [N]; t = 1;

2 while t < T do

3 Compute St = arg max S E[R(S, u UCB t 1 )] according to Theorem 3.1;

4 Offer sequence St, observe the user s feedback;

5 update Ti(t); update u UCB t according to Equation (2); t = t + 1;

Lemma 4.1 For any t and j [N] we have

Tj(t) < uj < u UCB j,t

Lemma 4.2 Assume S is the optimal sequence of messages. Under the condition that 0 u u , we have

E[R(S , u )] E[R(S , u)].

Proof of Lemma 4.2. We prove this theorem by induction. For the optimal sequence S = (S1, S2, , SN), since u I(N) u I(N), we have E[R(SN, u )] E[R(SN, u)]. Assume the inequality

E[R((Sk+1, , SN), u )] E[R((Sk+1, , SN), u)]

holds, and now we prove that E[R((Sk, , SN), u )] E[R((Sk, , SN), u)]. Since

E[R((Sk, , SN), u )]

= u I(k) + (gu I(k) + q(1 u I(k)))E[R((Sk, , SN), u )]

= u I(k) + ((g q)u I(k) + q)E[R((Sk, , SN), u )]

u I(k) + ((g q)u I(k) + q)E[R((Sk, , SN), u)]

= E[R((Sk, , SN), u)].

Therefore, we reach the desired result.

Theorem 4.3 The regret of Algorithm FA-DCM-P during time T is bounded by

Regretπ(T) C 1 g N

for some constant C.

Proof outline: Deﬁne event Ai,t = {u UCB i,t

u < u UCB i,t } and Et = N i=1 Ai,t. On the large probability event Et, with Lemma 4.2, we obtain E[R( St, u)] E[R(S , u)] E[R(S , u UCB)] E[R( St, u UCB)]. It implies that E[R(S , u)] E[R( St, u)] E[R( St, u UCB)] E[R( St, u)]. Let It( ) denote the index function for user t. We then show that the cumulative

difference on event Et can be bounded from above by

k=1 ((g q)f(h It(k)( St))u UCB It(k),t + q)

f(h It(i)( St))u UCB It(i),t1(Et) Eπ

f(h It(k)( St))u It(k) + q)f(h It(i)( St))u It(i)1(Et)

j=1 κj,t(u UCB j,t uj,t)1(Et)

where κj,t = lt(j) 1 k=1 ((g q)f(h It(k)( St))u It(k) + q). It denotes the probability that user t observes item j and lt(i) speciﬁes the position of item i in St. We then show that the regret term can be further bounded by 1 g N

1 g 2 log T

Eπ[Ti(T)]. Since i [N] Eπ[Ti(T)] 1 g N

1 g T, we can bound it by

1 g )3/2 NT log T. To complete the proof, we show that on the small probability event Ec t , the regret can also be bounded. The following results can be shown under two special cases of the DCM setting.

Corollary 4.4 When g = q = 1, the regret can be bounded by Regretπ(T) CN 2

Corollary 4.5 When at most L items can be recommended to a single user, the regret can be bounded by

Regretπ(T) CL3/2

Corollary 4.4 characterizes the regret bound under a special setting where users never exit the platform and browse all the recommendations. In theorem 4.3, we do not limit the length of the sequence as online recommenders continuously send content to users to keep them on the platforms. If at most L items will be recommended, the correponding regret is shown in Corollary 4.5. The detailed proofs can be found in the Supplementary Material.

5 Learning with Unknown Discount Factor f In this section, we consider a more general scenario where the discount factor f is also unknown and needs to be learned, in conjunction with learning items intrinsic relevance scores u. Before delving into our approach, we ﬁrst discuss a few alternative approaches and their trade-offs. A straightforward approach towards this problem is to treat each combination of f(i) and uj as an arm whose expected reward is zij := f(i)uj. However, it is not clear how to solve the ofﬂine combinatorial problem. An alternative approach is to view the problem as a generalized linear bandit which can be solved by GLM-UCB (Li, Lu, and Zhou 2017). Taking logarithm, we have log(zij) = log(f(i)) + log(uj),

which is a linear model. However, this approach is problematic for the following reasons: a) When uj and f(i) go to 0, log(uj) and log(f(i)) can go to negative inﬁnity, which implies that the parameter space is unbounded; b) In each step, GLM-UCB needs to compute the maximum likelihood estimators, which is computationally costly. We now present our approach, a UCB-based algorithm for this learning task which circumvents the limitations associated with the alternatived approaches discussed earlier.

5.1 Algorithm FA-DCM

Throughout this section, we deﬁne M as a threshold such that after showing M items from the same category, the discounting effect on item attractiveness due to content fatigue stays the same. That is, f(r) = f(r + 1) for r M. Note that the maximum possible value for M could be N. Following the notations introduced in Section 4.1, we deﬁne the unbiased estimator for fi and uj as

ˆf t i = 1 ˆTi(t)

zr ij ˆμj , and ˆut j =

r T0j(t) zr 0j T0j(t) ,

(3) where ˆTi(t) =

j Tij(t), the subscript in T0j and zr 0j represents the event (0, j), that is, item j is the ﬁrst message from category C(j) to be shown. The corresponding upper conﬁdence bounds at time t are deﬁned as

f UCB t (i) = ˆf t i + Δt i, and u UCB j,t = ˆuj,t +

zr ij ˆTi(t)

log t T0j(t)

log t T0j(t)

log t T0j(t) +

Δt i consists of two parts: the ﬁrst part refers to the exploration term related to μj for all j [N], and the second part is the exploration term with respect to f(i). We propose a learning algorithm, which we refer to as FA-DCM for the scenario when both the function f and u need to be learned. Deﬁne θUCB = (f UCB, u UCB) and θ = (f, u). At time t, we recommend a sequence St based on θUCB t 1 . Suppose there exists an item i which we have not collected sufﬁcient feedback such that Ti(t) < αT 2/3 where α is a tuning parameter. This item will then be inserted to the beginning of the sequence St to guarantee that it will be examined. This procedure ensures that we obtain a reliable estimate for ui, since the discount factor f(0) = 1 for the ﬁrst item. Once the feedback of the user at time t is observed, we then update f UCB t and u UCB t . Finally, it may be possible that some estimated values of f UCB t (i) violate the decreasing property of f, i.e., f UCB t (i) > f UCB t (i 1), then we need to correct them to enforce the property.

Algorithm 3: [FA-DCM] An algorithm for fatigueaware DCM when f is unknown

1 Initialization: Set u UCB j,0 = 1 and f UCB(i) = 1 for all j [N] and i [M] ; t = 1;

2 while t < T do

3 Compute St = argmax S E[R(S, θUCB t 1 )] according to Theorem 3.1;

4 if there exists i [N] such that Ti(t) < αT 2/3 then

5 Insert item i to the beginning of of St as the ﬁrst item;

7 Offer sequence St, observe the user s feedback;

8 Update u UCB j,t for all j and f UCB t (i) for all i according to Equation (4);

9 for i = 1 : M do

10 if f UCB t (i) > f UCB t (i 1) then

11 f UCB t (i) = f UCB t (i 1);

14 t = t + 1;

5.2 Regret analysis for Algorithm FA-DCM

Deﬁne V t i = 1 ˆTi j t Tij(t) zt ij μj . We ﬁrst have the following lemma to bound the difference between V t i and fi.

Lemma 5.1 For any t, P(|V t i f(i)|

log t ˆTi(t)) 2

Lemma 5.2 For any 1 i M, we have T t=1 P(| ˆft(i) f(i)| Δt i) CN, for some constant C, where ˆft(i) is deﬁned in Equation (3).

Proof outline: First we note that

P | ˆft(i) f(i)| Δt i

r Tij zr ij

zr ij ˆTi(t)

log t T0j(t)

log t T0j(t)

log t T0j(t)

zr ij μj f(i)

log t Ti(t)

For Equation (5), we can bound it above by

2 t2 + exp 1

2T0j(t)μ2 j

log t T0j(t) < μj

For Equation (6), using Lemma 5.1, we have

zr ij μj f(i)

It implies that

t=1 P(| ˆft(i) f(i)| Δt i) CN

for some constant C. Lemma 5.3 Assume S is the optimal sequence of messages. Under the condition that 0 u u and 0 f f , we have E[U(S , f , u )] E[U(S , f, u)]. Proof: Similar to the proof of Lemma 4.2, we can prove it by induction. Theorem 5.4 The regret of Algorithm FA-DCM during time T is bounded by

Regretπ(T) C 1 g N

NT 4/3 log T.

Proof outline: Deﬁne the following events, Aj,t =

Tj(t) < uj < u UCB j,t }, Bi,t =

{f UCB t (i) 2Δt i < f(i) < f UCB t (i)}, and Et = N j=1(Aj,t) M i=1(Bi,t). Let f UCB t (i) = minj i f UCB t (j), Bi,t = ! f UCB t (i) 2Δt i < f(i) < f UCB t (i) " and Et = N j=1(Aj,t) M i=1( Bi,t). Note that if Et holds, Et also holds. We consider time t is in the exploration phase, denoted as t E, if there exists j [N] such that T0j(t) < αt2/3. Similar to the proof for Theorem 4.3, the regret quantity E[

t/ E R(S , u) R( St, u)] on event Et for t / E can be ﬁrst bounded from above by

j=1 κj,t(u UCB j,t uj)1( Et)

i=1 κi,t( f UCB t (i) f(i))1( Et)

where κi,t = lt(i) lt(i) 1 k=1 ((g q)f(h It(k)( St))u It(k) + q) denotes the probability of observing an item as the ith recommendation within its own category. The regret term

can be further bounded by C 1 g N

NT 4/3 log T. Finally using Lemma 5.2, we can bound the small probability event Ec t and the cumulative regret on t E.

6 Numerical Experiments In this section, we perform four sets of numerical experiments to evaluate the performance of our online learning algorithms. In the ﬁrst two experiments, we investigate the robustness of Algorithm FA-DCM-P and FA-DCM respectively. In the last two experiments, we compare our algorithm with a benchmark using simulated and real datasets respectively.

6.1 Robustness study

Experiment I: With a known discount factor In this experiment, we investigate the robustness of Algorithm FADCM-P, when the discount factor f is known to the platform. We consider a setting with three categories, and each contains 10 items. Items intrinsic relevance u is uniformly generated from [0, 0.5]. The known discount factor is set to f(r) = exp( 0.1 (r 1)). We ﬁx the resuming probability after non-clicks to q = 0.7, and compare three cases by varying the resuming probabilities after clicks, i.e, Case 1: g = 0.95; Case 2: g = 0.85; Case 3: g = 0.75. For each case, we run 20 independent simulations. Experiment II: With an unknown discount factor We assume both f and u are unknown in this experiment, and study the robustness of Algorithm FA-DCM. There are three categories which contain 10 items each. We consider three cases, where we use q = 0.7 as in Experiment I and vary p and the discount factor f. More precisely, Case 4: g = 0.85, f(r) = exp( 0.1(r 1)); Case 5: g = 0.85, f(r) = exp( 0.15(r 1)); Case 6: g = 0.75, f(r) = exp( 0.1(r 1)). Result: The left plot in Fig 2 shows the average regret for Algorithm FA-DCM-P as the solid line and its 95% sampled conﬁdence interval as the shaded area, generated from 20 independent simulations for each case. Average values for the regret at T = 10, 000 are 307.52, 277.77 and 265.73 for Case 1, 2 and 3 respectively. The result shows that the regret decreases as g decreases, which is consistent with the regret bound shown in Theorem 4.3. The right plot in Fig 2 shows the results for Algorithm FADCM, where the average regret at T = 10, 000 is 1237.82, 1178.24, and 991.97 for Case 4, 5 and 6 respectively. Comparing Case 4 and 5, we ﬁnd that when the discount factor increases, the regret decreases. Meanwhile, comparing Case 4 and 6, it shows that the regret increases with g, which is consistent with the conclusion in Theorem 5.4. Comparing Case 2 and 4, as well as Case 3 and 6, the simulation results show that the regret is much larger when f also needs to be learned.

0 2500 5000 7500 10000 T

Case1 Case2 Case3

Experiment I

0 2500 5000 7500 10000 T

Case4 Case5 Case6

Experiment II

Figure 2: Simulation results for Experiments I and II

6.2 Comparison with a benchmark algorithm

Since there is no other algorithm addressing our speciﬁc setting, we choose the explore-then-exploit algorithm as a benchmark for Algorithm FA-DCM. During the exploration

time steps, items are randomly selected. For the ﬁrst t time steps, we ensure that there are β log t periods for exploration where β is a tuning parameter. That is, if the number of exploration periods before t is less than β log t, then we choose t as the exploration period where items are randomly selected. During the exploitation period, we use the empirical mean of u and f to ﬁnd the optimal sequence. In both Experiment III and IV, we run 20 independent simulations to compare the performance of our algorithm against the benchmark. Experiment III: With simulated data We set parameters g = 0.75, q = 0.7 and f(r) = exp( 0.1(r 1)). Items intrinsic relevance u is uniformly generated from [0, 0.5]. Experiment IV: With real data The data is from Taobao3, which contains 26 million ad display and click logs from 1,140,000 randomly sampled users from the website of Taobao for 8 days (5/6/2017-5/13/2017). To ﬁt the data into our framework, we use long period of inactivity from the last interaction with the website as a proxy for exiting the platform. Based on the data, we estimate that the average probability to resume browsing after non-clicks is 0.823, i.e., q = 0.823. Meanwhile, the probability to resume browsing after clicking is higher, i.e., g = 0.843. For the discount factor, we use f(r) = exp( 0.1(r 1)). In this experiment, we consider ﬁve categories of items, and select the top 20 selling products from each category. We estimate the click probabilities of these 100 products, and use them as the groundtruth values for ui. Result: Fig 3 compares the performance of our algorithm and the benchmark based on 20 independent simulations. Our algorithm outperforms the benchmark in both experiments, highlighting the beneﬁts of simultaneous exploration and exploitation. In particular, in Experiment III which is shown as the left plot, the average regret for the benchmark is 1279.88 at T = 10,000, which is 29.02% higher than that of Algorithm FA-DCM. Meanwhile, in Experiment IV as shown as the right ﬁgure, the average regret for Algorithm FA-DCM is 910.87, while the regret for the benchmark is 1519.20 at T = 100, 000.

0 2500 5000 7500 10000 T

Benchmark FA DCM

Experiment III

0 25000 50000 75000 100000 T

Benchmark FA DCM

Experiment IV

Figure 3: Simulation results for Experiments III and IV

3https://tianchi.aliyun.com/datalab/data Set.html?spm=5176. 100073.0.0.14d53ea7Rleuc9\&data Id=56

7 Conclusion In this work, we investigated a fatigue-aware DCM bandit problem for an online recommender. The setting takes into account of user s fatigue which is triggered by an overexposure to irrelevant content, as well as the boredom one might experience from seeing too much similar content. Depending on whether the discount factor f is known to the platform, we proposed two online learning algorithms and quantiﬁed its regert bound for the DCM bandit setting. We further showed the regret bounds for two learning algorithms and used numerical experiments to illustrate the performance. There are several interesting future directions. One natural extension is to consider the contextual version of this model by incorporating item features and user attributes, such that the system is capable of producing personalized recommendations. Another direction for future work is to approach the fatigue-aware DCM bandits via Thompson Sampling, although the regret analysis remains a challenging problem. Last but not least, as users preferences often change with time, it would be interesting to incorporate non-stationary item intrinsic relevance (ui) into the model.

References Cao, J., and Sun, W. 2019. Dynamic learning of sequential choice bandit problem under marketing fatigue. In The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19). Chuklin, A.; Markov, I.; and Rijke, M. d. 2015. Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services 7(3):1 115. Combes, R.; Magureanu, S.; Proutiere, A.; and Laroche, C. 2015. Learning to rank: Regret lower bounds and efﬁcient algorithms. ACM SIGMETRICS Performance Evaluation Review 43(1):231 244. Craswell, N.; Zoeter, O.; Taylor, M.; and Ramsey, B. 2008. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining, 87 94. ACM. Guo, F.; Liu, C.; and Wang, Y. M. 2009. Efﬁcient multipleclick models in web search. In Proceedings of the second acm international conference on web search and data mining, 124 131. ACM. Kapoor, K.; Subbian, K.; Srivastava, J.; and Schrater, P. 2015. Just in time recommendations: Modeling the dynamics of boredom in activity streams. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, 233 242. ACM. Katariya, S.; Kveton, B.; Szepesvari, C.; Vernade, C.; and Wen, Z. 2016a. Stochastic rank-1 bandits. ar Xiv preprint ar Xiv:1608.03023. Katariya, S.; Kveton, B.; Szepesvari, C.; and Wen, Z. 2016b. Dcm bandits: Learning to rank with multiple clicks. In International Conference on Machine Learning, 1215 1224. Katariya, S.; Kveton, B.; Szepesv ari, C.; Vernade, C.; and Wen, Z. 2017. Bernoulli rank-1 bandits for click feedback. In Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, 2001 2007. AAAI Press.

Komiyama, J.; Honda, J.; and Takeda, A. 2017. Positionbased multiple-play bandit problem with unknown position bias. In Advances in Neural Information Processing Systems, 4998 5008. Kveton, B.; Szepesvari, C.; Wen, Z.; and Ashkan, A. 2015a. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, 767 776. Kveton, B.; Wen, Z.; Ashkan, A.; and Szepesvari, C. 2015b. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems, 1450 1458. Lagr ee, P.; Vernade, C.; and Capp e, O. 2016. Multiple-play bandits in the position-based model. In Advances in Neural Information Processing Systems, 1597 1605. Li, L.; Lu, Y.; and Zhou, D. 2017. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2071 2080. JMLR. org. Ma, H.; Liu, X.; and Shen, Z. 2016. User fatigue in online news recommendation. In Proceedings of the 25th International Conference on World Wide Web, 1363 1372. International World Wide Web Conferences Steering Committee. Radlinski, F.; Kleinberg, R.; and Joachims, T. 2008. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th international conference on Machine learning, 784 791. ACM. Richardson, M.; Dominowska, E.; and Ragno, R. 2007. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, 521 530. ACM. Wang, Y., and Tulabandhula, T. 2019. Thompson sampling for a fatigue-aware online recommendation system. ar Xiv preprint ar Xiv:1901.07734. Warlop, R.; Lazaric, A.; and Mary, J. 2018. Fighting boredom in recommender systems with linear reinforcement learning. In Advances in Neural Information Processing Systems, 1757 1768. Yu, B.; Fang, M.; and Tao, D. 2016. Linear submodular bandits with a knapsack constraint. In Thirtieth AAAI Conference on Artiﬁcial Intelligence. Yue, Y., and Guestrin, C. 2011. Linear submodular bandits and their application to diversiﬁed retrieval. In Advances in Neural Information Processing Systems, 2483 2491. Ziegler, C.-N.; Mc Nee, S. M.; Konstan, J. A.; and Lausen, G. 2005. Improving recommendation lists through topic diversiﬁcation. In Proceedings of the 14th international conference on World Wide Web, 22 32. ACM.