# diversified_interactive_recommendation_with_implicit_feedback__c3d6e4ae.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Diversiﬁed Interactive Recommendation with Implicit Feedback

Yong Liu,1,2 Yingtai Xiao,2,4 Qiong Wu,2 Chunyan Miao,3 Juyong Zhang,4 Binqiang Zhao,5 Haihong Tang5

1Alibaba-NTU Singapore Joint Research Institute 2Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY) 3School of Computer Science and Engineering, Nanyang Technological University 4University of Science and Technology of China, 5Alibaba Group {stephenliu, wu.qiong, ascymiao}@ntu.edu.sg, gzxyt@mail.ustc.edu.cn, juyong@ustc.edu.cn binqiang.zhao@alibaba-inc.com, piaoxue@taobao.com

Interactive recommender systems that enable the interactions between users and the recommender system have attracted increasing research attention. Previous methods mainly focus on optimizing recommendation accuracy. However, they usually ignore the diversity of the recommendation results, thus usually results in unsatisfying user experiences. In this paper, we propose a novel diversiﬁed recommendation model, named Diversiﬁed Contextual Combinatorial Bandit (DC2B), for interactive recommendation with users implicit feedback. Speciﬁcally, DC2B employs determinantal point process in the recommendation procedure to promote diversity of the recommendation results. To learn the model parameters, a Thompson sampling-type algorithm based on variational Bayesian inference is proposed. In addition, theoretical regret analysis is also provided to guarantee the performance of DC2B. Extensive experiments on real datasets are performed to demonstrate the effectiveness of the proposed method in balancing the recommendation accuracy and diversity.

Introduction Conventional recommender systems are usually developed in non-interactive manner and learn the user preferences from logged user behavior data (Liu et al. 2017; Yang et al. 2018; Liu et al. 2018; Wang et al. 2018). One main drawback of these systems is that they cannot capture the changes of users preferences in time. This requires the development of interactive recommender system that enables interactions (Steck, van Zwol, and Johnson 2015). In the literature, contextual bandit learning has been demonstrated to be a promising solution to interactive recommendation problems (Li et al. 2010; Zhao, Zhang, and Wang 2013; Tang et al. 2015; Wang, Wu, and Wang 2017; Qi et al. 2018). In these methods, the recommender system sequentially recommends a set of items to a user and adopts the user s immediate feedback to improve its recommendation policy. In practice, users implicit feedback (e.g., clicking history) are usually utilized to build recommender systems, be-

This work was performed while this author was a research assistant at Nanyang Technological University, Singapore. Corresponding author. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

cause implicit feedback is user centric, and can be easily collected (Shi, Larson, and Hanjalic 2014; Wang et al. 2019). However, the implicit feedback usually brings bias signals which make the recommendation problems much more challenging. This bias comes from the fact that the implicit feedback can only capture the positive user preferences (i.e., observed user-item interactions), and all negative user preferences are missing. Although the non-interaction between the user and an item is usually treated as negative user preference in previous research work (Shi, Larson, and Hanjalic 2014), it does not explicitly indicate that the user dislikes the item, as non-interaction may also be caused by that the item has not been exposed to the user (Liang et al. 2016).

Moreover, previous interactive recommendation methods mainly focus on optimizing recommendation accuracy. They usually ignore other important properties of the recommendation results, for example the diversity of the recommended item set (Kunaver and Poˇzrl 2017; Wu et al. 2019b). Therefore, the items in the recommendation lists generated by these approaches may usually be very similar with each other, and the recommendation results may only cover a small fraction of items. This usually leads to inferior user experiences, and thus reduces the commercial values of recommender systems. Intuitively, it is very challenging to achieve both high accuracy and diversity. The methods focusing too strongly on diversity usually put accuracy at risk. Because there is a lack of data for less popular items, considering such items for recommendation may lead to a decrease in recommendation accuracy (Adomavicius and Kwon 2011). Therefore, the main objective of diversiﬁed recommendation methods is to optimize the trade-off between accuracy and diversity, which is usually referred as the accuracy diversity dilemma (Zhou et al. 2010).

In this paper, we propose a novel bandit learning framework for interactive recommender systems based on users implicit feedback, which strives to achieve a good balance between accuracy and diversity in the recommendation results. To solve the bias problems caused by implicit feedback, we model the interactions between users and the recommender system from two perspectives: i) Diversiﬁed Item Exposure: the recommender system selects a set of relevant yet diverse items to expose to the user; ii) User Engage-

ments: the user eventually engages with some of the exposed items (e.g., clicks on the items). Speciﬁcally, the determinantal point process (DPP) (Kulesza, Taskar, and others 2012) is employed to select a set of diverse items to expose to users, considering both the qualities of items and the diversity of the selected item set. The advantage of DPP is that it explicitly models the probability that an item set would be selected to be shown to the user, thus can help solve the bias problem caused by implicit feedback (Liang et al. 2016). In addition, the contextual features of items are also utilized to model the observed user engagements on the recommended items. To summarize, the major contributions made in this paper are as follows: (1) we propose a novel bandit learning method, i.e., Diversiﬁed Contextual Combinatorial Bandit (DC2B), to improve the recommendation diversity for interactive recommender systems; (2) we propose a variational Bayesian inference algorithm under the Thompson sampling framework to learn the model parameters; (3) we also provide theoretical regret analysis for the proposed DC2B method; (4) we perform extensive experiments on real datasets to demonstrate the effectiveness of DC2B in balancing the recommendation accuracy and diversity.

Related Work Diversiﬁed Recommendation One major group of diversiﬁed recommendation methods are based on greedy heuristics. The pioneering work is maximal marginal relevance (MMR) (Carbonell and Goldstein 1998), which deﬁnes a marginal relevance to combine the relevance and diversity metrics, and creates a diversiﬁed ranking of items by choosing an item in each interaction such that it maximizes the marginal relevance. Other greedy heuristics methods vary in the deﬁnition of the marginal relevance, often in the form of a sub-modular objective function (Qin and Zhu 2013; Sha, Wu, and Niu 2016), which can be solved greedily with an approximation to the optimal solution. Another group of methods are based on reﬁnement heuristics, which usually re-rank a pre-ranked item list through post-processing actions (Zhang and Hurley 2008; Antikacioglu and Ravi 2017). From another perspective, (Cheng et al. 2017) formulates the diversiﬁed recommendation problem as a supervised learning task, and proposes a diversiﬁed collaborative ﬁltering model to solve the optimization problems. Recently, DPP has been demonstrated to be effective in modeling diversity in various machine learning problems (Kulesza, Taskar, and others 2012), and some recent work (Chen, Zhang, and Zhou 2018; Wilhelm et al. 2018; Wu et al. 2019a) employs DPP to improve recommendation diversity.

Interactive Recommendation Contextual bandit has been often used for building interactive recommender systems. These methods mainly focus on optimizing the recommendation accuracy. For instance, (Li et al. 2010) proposes a contextual bandit algorithm, named Lin UCB, which sequentially recommended articles to users based on the contextual information of users and

articles. (Zhao, Zhang, and Wang 2013) combines probabilistic matrix factorization with Thompson sampling and upper conﬁdence bound based bandit algorithms to interactively select items. (Tang et al. 2015) proposes a parameterfree bandit approach that uses online bootstrap to learn the online recommendation model. Recently, (Wang et al. 2017) extends the Lin UCB to incorporate users social relationships into interactive recommender system. (Wang, Wu, and Wang 2017) proposes a factorization-based bandit approach to solve the online interactive recommendation problem. Moreover, in (Qi et al. 2018), the Thompson sampling framework is employed to solve the bandit problems with implicit feedback, where the implicit feedback is modeled as a composition of user result examination and relevance judgement. There also exist some interactive recommender systems focusing on promoting the recommendation diversity. For example, (Qin, Chen, and Zhu 2014) proposes a contextual combinatorial bandit framework, incorporating the entropy regularizer (Qin and Zhu 2013) to diversify the recommendation results. Differing from (Qin, Chen, and Zhu 2014), DC2B is a full Bayesian framework which is more effective in balancing the recommendation accuracy and diversity.

Problem Formulation We employ contextual bandit to build the diversiﬁed interactive recommender system. The recommender system is treated as an agent, and each item is treated as an arm. Let A = {ai}N i=1 denote the set of N arms (i.e., items). We assume each arm ai has a contextual feature vector xi R1 d summarizing its side information, and denote the features of all arms by X RN d. At each trial, the recommender agent would ﬁrstly choose a subset of arms S from A, considering the qualities of the arms and the diversity of selected arms. S is usually called as a super arm. Here, we empirically deﬁne the quality of an arm ai as follows:

ri = exp(θx i ), (1)

where θ is the bandit parameter that describes the user preferences. The diversity of the selected super arm S can be measured by the intra-list distance metric (Zhang and Hurley 2008). Once a diversiﬁed super arm S has been selected according to a policy π and displayed to the user, the user s engagements on displayed items (e.g., clicks on items) are used as the rewards for recommender agent to optimize its recommendation policy. Through interactions with the user, the recommender agent aims to adjust its super arm selection strategy to maximize its cumulative reward over time.

Diversiﬁed Item Exposure The DPP is an elegant probabilistic model with the ability to model diversity in various machine learning problems (Kulesza, Taskar, and others 2012). In this work, we utilize DPP to model the selection probability of a relevant yet diverse super arm S. Formally, a DPP P on the set of candidate arms A is a probability measure on 2A, describing the probability for the set of all subsets of A. If P assigns nonzero probability on the empty set , there exists a real,

positive semi-deﬁnite kernel matrix L RN N, such that the probability of the super arm S can be deﬁned as follows:

p(S) = det(L[S])

det(L + I), (2)

where I is the identity matrix, L[S] [Lij]ai,aj S is the sub-matrix of L. As revealed in (Kulesza, Taskar, and others 2012), L can be written as a Gram matrix, L = V V , where the rows of V are vectors representing the arms. Following (Chen, Zhang, and Zhou 2018; Wilhelm et al. 2018), we empirically set V i = (ri)αxi, where α > 0 is a parameter controlling the impacts of item qualities. Then, the elements of L are deﬁned as Lij = rirj αxix j . If xi is normalized, i.e., xi 2 = 1, the Cosine similarity between ai and aj can be calculated as Cij = xix j . We can re-write L as follows:

L = Diag{exp(α r)} C Diag{exp(α r)}, (3)

where Diag{ r} is a diagonal matrix with the ith diagonal element being ri = θx i , and C is the similarity matrix. Then, the log-probability of the super arm S is:

log p(S) 2α

ai S ri + log det C[S] , (4)

where the last term is maximized when the features of arms in S are orthogonal, thus it helps promote recommendation diversity (Chen, Zhang, and Zhou 2018). In addition, Eq. (4) also indicates the parameter α can help balance the relevance and diversity of items for recommendation.

User Engagements The user s engagements on displayed items are expressed by her implicit feedback (e.g., clicks on the items), which is usually described by a set of binary variables. If the user engages in the arm ai, we set yi to 1; otherwise, we set yi to 0. Once an arm ai S has been displayed to the user, we assume the user s engagements on ai is only determined by its quality. Thus, the probability of the observed user engagement on ai, i.e., yi = 1, can be deﬁned as follows:

pi ρ(θx i ) = exp(θx i ) 1 + exp(θx i ) = ri 1 + ri . (5)

This can be explained as that when an arm ai is offered to the user, the user engages in this arm or a virtual arm a0 with a relevance score 1. Based on these assumptions, we can deﬁne the joint probability of observed user engagements Y = {yi|ai S} as follows:

p(Y, S, θ) = p(θ)p(S|θ)p(Y|S, θ)

= p(θ) det(L[S])

ai S pyi i (1 pi)1 yi, (6)

where p(θ) is the prior assigned to bandit parameters. In addition, we assume p(θ) follows a Gaussian distribution N(m, Σ), and m, Σ are bounded. This assumption is typically used in practice.

Algorithm 1 Thompson Sampling for DC2B

Initialize m = 0, Σ = λI, and R = . for t = 0 to T do At A \ R, Xt = {xi|ai At} Randomly sample θ N(m, Σ) S O( θ, Xt) Play super arm S and observe the reward Y Update Σ and m according to Eq. (9), (10), and (11). R R S end for

Parameter Inference Once a newly observation (S, Y) is available, we employ variational Bayesian inference (Blei, Kucukelbir, and Mc Auliffe 2017) to develop a closed form approximation to the posterior of θ. According to (Blei, Kucukelbir, and Mc Auliffe 2017), the approximated posterior q(θ) of θ can be expressed as log q (θ) = Eparam =θ[log p(Y, S, θ)] + const. Moreover, based on the knowledge in Linear Algebra, we have det(L[S]) = ai S r2α i det(X[S]X [S]) and det(L + I) = exp(tr(log(L + I)). Then, we can have the following log-likelihood function:

log p(Y, S|θ)

ai S (ϕi + 2α log ri) + log det(X[S]X [S])

j=1 log(1 + r2α j xjx j ), (7)

where ϕi = yi log pi + (1 yi) log(1 pi). In Eq. (7), the likelihood function is a logistic function, which is not conjugate with the Gaussian priors on θ. To address this issue, the following Gaussian lower bound on the logistic function is employed to approximate the likelihood (Jaakkola and Jordan 1997), ρ(x) ρ(ξ)e x ξ

2 λ(ξ)(x2 ξ2), where λ(ξ) = 1 2ξ(ρ(ξ) 1

2), and ξ is an auxiliary variable needs to be adjusted to make the bound tight at x = ξ. Moreover, by assuming ||θ||2 A and ||xj||2 B, we have log 1 + exp(2αθx j )xjx j exp(2αθx j )xjx j exp(2αAB)B2. As we assume m and Σ are bounded, it is reasonable to infer that θ is bounded. By normalizing xj, we can make xj bounded. Then, we have the following lower bound of the log-likelihood function in Eq. (7):

log p(Y, S|θ) const.+

ai S [(2yi 1 + 4α)θx i 2 λ(ξi)(θ(x i xi)θ ) + φ(ξi)]

where φ(ξi) = log ρ(ξi) ξi

2 + λ(ξi)ξ2 i . The optimal variational distribution of θ is as follows: log q (θ) E log h(θ, ξ) +E log p(θ) +const. Due to model conjugacy, we can know that q(θ) shall follow a Gaussian distri-

Algorithm 2 DPP Greedy Search S O( θ, Xt)

Startup: Construct L, p according to θ, Xt Initialize ci = [] , d2 i = Lii , j = arg maxi Z log d2 i + log(pi) , S = {j}. for k = 0 to K do for i Z\S do ei = (Lji cj, ci ) /dj ci = ci ei], d2 i = d2 i e2 i end for j = arg maxi Z\Yg log d2 i + log(pi), S = S {j} end for Return S

bution N(mpost, Σpost), where the mean and variance are as follows: Σ 1 post = Σ 1 + 2

ai S λ(ξi)x i xi, (9)

mpost = Σpost Σ 1m +

ai S (yi + 2α 1

2)xi . (10)

Since no prior has been assigned to ξi, the optimal value of ξi can be derived by maximizing the expected log-likelihood function:ℓ(ξi) = E[log p(Y, S|θ, ξi)]. Taking the derivative of ℓ(ξi) with respect to ξi and setting it to zero, the optimal value of ξi can be obtained as follows:

xi(Σpost + m postmpost)x i . (11)

We employ Thompson sampling (TS) to update the model parameters by balancing exploration and exploitation. The details of the TS algorithm are summarized in Algorithm 1. In standard TS method, it is required to sample from the true posterior of model parameter θ. As the logistic likelihood function is not conjugate with the Gaussian prior, we propose to sample from the approximated poster distribution q(θ). Once completing the sampling of θ, the DPP kernel matrix L is ﬁxed, and we can select the optimal super arm S by maximizing fθ(S) = ai S pi det(L[S]). We employ the fast gready MAP inference algorithm (Chen, Zhang, and Zhou 2018) to obtain the optimal super arm. The details of the greedy algorithm are summarized in Algorithm 2.

Regret Analysis We consider a model involving a set of actions S and a set of functions F = {fθ : S R|θ Θ} indexed by a random variable θ which belongs to an index set Θ. At each time t, a random subset St S is presented and an action St St is selected after which the reward Rt is gained. We deﬁne the reward function as: E[Rt] fθ(St) = ai St pi det(L[St]) = ai St pir2α i det(X[St]XT [St]), and deﬁne the reward at trial t as: Rt = fθ(St) + ϵt. Therefore, we have E[ϵt] = 0. In addition we assume fθ F, St S, fθ(St) [0, C]. For a recommendation policy π, we can deﬁne the Bayesian risk bound as follows:

Regret(T, π) =

t=1 E max s St fθ(s) fθ (St) . (12)

To perform the regret analysis, we ﬁrst introduce the following two Lemmas, which can be proofed following the Proposition 9 and 10 in (Russo and Van Roy 2014). The difference is that the variable in our approach is a set of arms St instead of a single arm a, the proofs are similar so we omit them due to space limitation. We set σ = 1 according to Lemma 7 in (Qi et al. 2018), which also shows that fθ satisﬁes Assumption 2 in (Russo and Van Roy 2014). Lemma 1. For all T N, α0 > 0 and δ 1/2T,

Regret T, πTS 4

dim M (F, T 1) β T (F, α0, δ)T + 1

+ dim M F, T 1 + 1 C,

where dim M F, T 1 is the ϵ-dimension, β T (F, α0, δ) := 8 ln (N (F, α0, ) /δ) + 2αt 15

2 C + ln(2t2/δ) , and N (F, α0, ) denotes the α0-covering number of F. Lemma 2. Suppose Θ Rd, and |fθ(St) fθ (St)| h(θ θ ) φ(St) , where φ(St) =

i St xi and h is a constant. Assume there exist constants γ, S0 such that St S and θ Θ, θ 2 S0, and φ(St) 2 γ. Then we have

dim M(F, ϵ) 3d e e 1 ln

According to Lemma 1, in our problem, C = 1, and we can choose α0 = 1/T 2, δ = 1/T 2. Then, the Bayesian risk bound of DC2B is given by the following Theorem. Theorem 1. When T is sufﬁcient large, the Bayesian risk bound of DC2B is

Regret T, πT S = O d ln αse2αs T

where s is the number of items in a selected set, d is the dimension of θ.

Proof. We ﬁrst assume θ 2 1, x 2 1, and introduce the following inequalities: (1) Mean Value Theorem: we have |θx| θ 2 x 2 1, then |p p | = |ρ(θ x) ρ(θ x)| = |ρ (ξ)(θ θ ) x| 1 4 θ θ 2 x 2, and |r2 r 2| = | exp(2αθ x) exp(2αθ x)| = | exp(2αζ)2(θ θ ) x| 2αe2α θ θ 2 x 2, where ρ (x) = ρ(x)(1 ρ(x)) 1

4, 0 ξ 1, 0 ζ 1; (2) Gram Inequality: | det(X [S]X[S])| = |det (G (x1, , xs))| x1 2 2 xs 2 2 1, where [G (x1, , xn)]i,j = x i xj deﬁnes a gram matrix; (3) Triangle inequality: |x1x2 y1y2| = |x1x2 y1x2 + y1x2 y1y2| |x1 y1||x2| + |y1||x2 y2|. Based on these inequalities, we have,

|fθ(St) fθ (St)|

= | det(X [St]X[St])||

i=1 pir2α i

i=1 p i r 2α i |

4 )e2αs|(θ θ ) s

d θ θ , (15)

where we use inequality θ 2

d θ . According to Eq. (15), an α0-covering of F can therefore be attained through an (α0/γ)-covering of Θ Rd, where γ = 8α+1

d. Evenly divide Rd in each dimension, we can obtain N Rd, α0, = (1/α0)d. Then, we have

N (F, α0, ) = (γ/α0)d =

(8α + 1)se2αs

In our problem, S0 = 1, h = 8α+1

4 e2αs and γ = s. According to Lemma 2 and Eq. (15), we have the following bound on dim M F, T 1 :

dim M(F, T 1)

3d e e 1 ln

3 (8α + 1)e2αs Ts

Let α0 = 1/T 2, δ = 1/T 2, C = 1. When T is sufﬁcient large, the second part of β T (F, α0, δ) will decrease to zero. After some calculation together with above two bounds, we can ﬁnish the proof.

The upper bound in Theorem 1 mainly depends on the dimensionality of model parameter d, the size of recommended item set s, and the quality controlling parameter α. Here, d describes the model complexity. As d increases, θ is able to model more complex scenarios. However, a sophisticated model would cause over-ﬁtting, resulting in poor performances. Therefore, the regret bound would be high, when d is large. The Proposition 9 in (Russo and Van Roy 2014) gives the Bayesian risk bound for non-combinatorial bandit methods as O(rd

T log(r T)), where r is a parameter determined by the reward function. By simply repeating the recommendation s times to get a set of items, the bound would be O(srd

T log(r T)). In DC2B, if we set α = 1, the Bayesian regret bound would be O(d

d T)), which is slightly different from multiplying s to the bound of non-combinatorial methods. This is because our reward function also takes the recommendation diversity into account. As α controls the impacts of item qualities, the increase of α would increase the risks caused by the estimation of item qualities. Thus, the regret will grow as α increases.

Experiments Experimental Settings Datasets The experiments are performed on the following datasets: Movielens-100K, Movielens-1M1, and Anime2. Movilens-100K contains 100,000 ratings given by 943 users to 1,682 movies, and Movielens-1M contains 1,000,209 ratings given by 6,040 users to 3,706 movies. There are 18 movie categories in both Movielens datasets. We denote these two datasets by ML-100K and ML-1M, respectively.

1https://grouplens.org/datasets/movielens/ 2https://www.kaggle.com/Cooper Union/animerecommendations-database

Table 1: The statistics of experimental datasets. Datasets # Users # Items # Inter. # Cate. ML-100K 942 1,447 55,375 18 ML-1M 6,038 3,533 575,281 18 Anime 69,400 8,825 5,231,117 44

For Anime dataset, there are 7,813,737 ratings given by 73,515 users to 11,200 animes, and there are 44 anime categories. Following (Liu et al. 2018), we keep the ratings larger than 3 as positive feedback on ML-100K and ML-1M datasets, and keep the ratings larger than 6 as positive feedback on the Anime dataset. Table 1 summarizes the statistics of the experimental datasets, where movies and animes are items . In these datasets, each item may belong to multiple categories. The density of ML-100K, ML-1M, and Anime datasets are 4.06%, 2.70%, and 0.85%, respectively.

Setup and Metrics For interactive recommendation methods, it is most appropriate to use an online experimental setting with real time user interactions for evaluation. However, it is typically impossible to have such an environment in academic research. Hence, following (Zhao, Zhang, and Wang 2013; Qin, Chen, and Zhu 2014; Wang et al. 2017), we assume that users ratings on items recorded in our experimental datasets are not biased by the recommender system, and these records can be regarded as unbiased user feedback in our experimental settings. The unbiased ofﬂine evaluation strategy (Li et al. 2011) is used to evaluate the recommendation methods. In the experiments, we randomly partition each dataset into two non-overlapping sets, by randomly sampling 80% of the users for training and using the remaining 20% users for testing. Moreover, we employ BPRMF (Rendle et al. 2009) to learn the embeddings of items based on training data, which are used as the contextual features of arms. Empirically, we set the dimensionality of the item embeddings to 10. As users are usually interested in a few top-ranked recommendation items, we adopt Precision@N to evaluate the recommendation accuracy (Shi, Larson, and Hanjalic 2014), by aggregating the recommended items in N/|S| trials and computing the precision. Speciﬁcally, N is set to 10, 30, and 50. We also evaluate the average recommendation diversity of each method over all recommendation trials, by the intra-list distance (ILD) (Zhang and Hurley 2008) metric as follows: 1 T T t=1 2 |St|(|St| 1)

aj St,i =j(1 simij) , where St is recommended item set at trial t, |St| denotes the size of St, T is the total number of recommendation trials, simij denotes the similarity between ai and aj. As an item may belong to multiple item categories, we deﬁne the item similarity simij by using the Jaccard similarity of the categories of two items. For these accuracy and diversity metrics, we ﬁrst compute the value for each user, and then report the averaged value over all users. Following (Cheng et al. 2017), we also employ Fmeasure to evaluate the performances of different methods on trading-off between accuracy and diversity, where Fmeasure=2*accuracy*diversity / (accuracy+diversity).

Table 2: Recommendation performances of different algorithms. The best results are in bold faces, and the second best results are underlined.

Datasets Metrics Log Rank MMR ϵ-Greedy DPPmap C2UCB EC-Bandit DC2B

Precision@10 0.3548 0.3665 0.3421 0.3665 0.3633 0.2128 0.3649 Precision@30 0.2872 0.2872 0.2792 0.2846 0.3415 0.1633 0.3211 Precision@50 0.2507 0.2499 0.2433 0.2554 0.3146 0.1453 0.2882 Diversity 0.8024 0.8151 0.8145 0.7985 0.7827 0.8356 0.8118 F-measure 0.3820 0.3825 0.3747 0.3870 0.4488 0.2476 0.4254

Precision@10 0.3785 0.3754 0.3631 0.3764 0.3418 0.2160 0.3785 Precision@30 0.3204 0.3173 0.3084 0.3173 0.3192 0.1750 0.3401 Precision@50 0.2841 0.2824 0.2745 0.2807 0.2998 0.1611 0.3117 Diversity 0.8516 0.8531 0.8462 0.8174 0.8319 0.8326 0.8367 F-measure 0.4261 0.4221 0.4145 0.4179 0.4408 0.2700 0.4542

Precision@10 0.3141 0.3157 0.2867 0.3157 0.0095 0.1733 0.3003 Precision@30 0.2527 0.2534 0.2366 0.2541 0.1116 0.1326 0.2666 Precision@50 0.2165 0.2178 0.2025 0.2164 0.1518 0.1168 0.2419 Diversity 0.8323 0.8495 0.8521 0.8414 0.5031 0.8460 0.8355 F-measure 0.3436 0.3467 0.3272 0.3443 0.2332 0.2053 0.3752

Evaluated Recommendation Methods

As the training users are non-overlapping with the testing users, the recommendation algorithms (Shi, Larson, and Hanjalic 2014) designed for warm-start settings are not suitable as baselines. In this paper, we compare DC2B with the following recommendation methods: (1) Log Rank: In this method, we deﬁne the quality score of each arm ai as ri = 1/(1 + exp( ux i )), where u is the mean of the user embeddings learnt from the training data. Then, the |St| available arms with the highest quality scores are selected as a super arm St for recommendation at trial t; (2) MMR: This method employs MMR strategy (Carbonell and Goldstein 1998) to promote the recommendation diversity. At trial t, this method sequentially selects an available arm with the largest maximal marginal relevance score into St. The maximal marginal relevance score is deﬁned as ri = αri (1 α)

j St sim(xi, xj), where ri is the arm quality deﬁned in the Log Rank method, and sim(xi, xj) is the Cosine similarity between xi and xj; (3) ϵ-Greedy: This method randomly adds an available arm into St with probability ϵ, and adds the arm with highest quality into St with probability 1 ϵ. The item quality is deﬁned the same as in Log Rank method; (4) DPPmap (Chen, Zhang, and Zhou 2018): This non-interactive method uses determinantal point process to promote recommendation diversity. The item quality is deﬁned the same as in Log Rank; (5) C2UCB (Qin, Chen, and Zhu 2014): This methods integrates the Lin UCB framework with an entropy regularizer to promote diversity for interactive recommendation. (6) ECBandit (Qi et al. 2018): This bandit method is based on Thompson sampling framework and developed for interactive recommendation with users implicit feedback. In this method, the user needs to interact with the recommender |St| times to generate the recommended item set at trial t. For all methods, we empirically set the size of St to 10 in each trial. A validation set is sampled from training data to choose hyper-parameters. The best parameter settings for

0.01 0.1 1 3 5 10 100

Precision@50

Precision Diversity

Figure 1: Performance trend of DC2B with respect to different settings of α on ML-100K dataset.

each method are as follows. α is set to 0.9 for MMR. ϵ is set to 0.1 for ϵ-Greedy, and θ is set to 0.6 for DPPmap. In C2UCB, we set λ0 = 100, λ = 0.1, and σ = 1. In ECBandit, we set the parameter λ = 1. For DC2B, we empirically set α = 3, and λ = 1, on all datasets.

Performance Comparison The recommendation accuracies and diversity of different algorithms are summarized in Table 2. As shown in Table 2, the proposed DC2B method usually achieves the best recommendation accuracy (i.e., Precision@N) on ML1M and Anime datsets, and achieves the second best accuracy on ML-100K dataset. For example, on Anime dataset, DC2B signiﬁcantly outperforms C2UCB and EC-Bandit by 59.35% and 107.11%, and achieves 11.73%, 11.07%, 19.46%, and 11.78% better performances than Log Rank, MMR, ϵ-Greedy, and DPPmap, in terms of Precision@50. These results indicate that DC2B is more effective than baseline methods on large and sparse dataset. Moreover, we also note the combinatorial bandit methods C2UCB and DC2B signiﬁcantly outperform EC-Bandit. One potential reason is

Table 3: Relative improvements of DC2B over baselines. The positive improvements are highlighted in bold.

Methods ML-100K ML-1M Anime Prec.@50 Div. F-m. Prec.@50 Div. F-m. Prec.@50 Div. F-m. Log Rank +14.96% +1.17% +11.36% +9.71% -1.75% +6.59% +11.73% +0.38% +9.20% MMR +15.33% -0.40% +11.22% +10.38% -1.92% +7.60% +11.07% -1.65% + 8.22% ϵ-Greedy +18.45% -0.33% +13.53% +13.55% -1.12% +9.58% +19.46% -1.95% +14.67% DPPmap +12.84% +1.67% +9.92% +11.04% +2.36% +8.69% +11.78% -0.70% +8.97% C2UCB -8.39% +3.72% -5.21% +3.97% +0.58% +3.04% +59.35% +66.07% +60.89% EC-Bandit +98.35% -2.85% +71.81% +93.48% +0.49% +68.22% +107.11% -1.21% +82.76%

Prec@10 Prec@20 Prec@30 Prec@40 Prec@50

|St|=5 |St|=10

Figure 2: Performance trend of DC2B with respect to different settings of |St| on ML-100K dataset.

that the combinatorial methods employ the user s feedback on a set of items to update model parameters. However, ECBandit uses the user s feedback on a single item to update model parameters. The parameter learning of C2UCB and DC2B is more stable than that of EC-Bandit, thus C2UCB and DC2B can achieve better recommendation accuracy. In addition, we can note that the non-interactive methods MMR, and ϵ-Greedy usually achieves slightly higher recommendation diversity than DC2B, and DC2B attains better recommendation diversity than DPPmap on ML-100K and ML-1M datasets. Comparing with interactive methods, Table 2 indicates that the recommendation diversity of DC2B is higher than that of C2UCB on all datasets, and EC-Bandit achieves higher recommendation diversity than DC2B on ML-100K and Anime datasets. For better understanding the results, F-measure is used to study the effectiveness of each recommendation algorithm in balancing the recommendation accuracy and diversity. Here, we use Precision@50 and Diversity to compute the F-measure. As shown in Table 2, we can note that DC2B achieves the best F-measure value on ML-1M and Anime datasets, and the second best F-measure value on ML-100K dataset. In addition, we summarize the relative improvements of DC2B over baseline methods on Precision@50, Diversity, and F-measure in Table 3. These results demonstrate that the proposed DC2B method is more effective in balancing the recommendation accuracy and diversity than the baseline methods, especially on larger and sparser datasets.

Parameter Sensitivity Analysis

Moreover, we also evaluate the impacts of α and the size of super arm |St| on the performances of DC2B, on ML-100K dataset. The parameter α is varied in {0.01, 0.1, 1, 3, 5, 10, 100}. Figure 1 shows the performances of DC2B with respect to different settings of α. As shown in Figure 1, the recommendation accuracy in terms of Precision@50 ﬁrstly increases with the increase of α. When α is larger than 5, the recommendation accuracy of DC2B drops drastically by changing α to 10 and 100. We can also note that the diversity value decreases with the increase of α, because larger α makes DC2B focus more on the item qualities in generating recommendations. Overall, the results in Figure 1 indicate that α can effectively control the item qualities and item diversities when generating recommendations. Additionally, we vary the size of recommendation list |St| at each trial in {5, 10}. Figure 2 summarizes the accuracy of DC2B with respect to different sizes of super arm. As shown in Figure 2, larger super arm size tends to results better recommendation accuracy, when enough number of interactions (e.g., more than 3 interactions) between the user and the recommender system have been performed. This is because the model updating based on the user s feedback on a larger set of items is expected to be more stable and accurate. Moreover, the average recommendation diversity with respect to |St| = 5 and |St| = 10 are 0.8110 and 0.8118, respectively. This indicates that |St| does not have signiﬁcant impacts on diversity.

Conclusion and Future Work

This work proposes a novel bandit learning method, namely Diversiﬁed Contextual Combinatorial Bandit (DC2B), for interactive recommendation based on users implicit feedback. Speciﬁcally, DC2B is a full Bayesian recommendation framework, which enables the interactions between recommender system and the user, and employs determinantal point process (DPP) to promote the recommendation diversity. We develop a Thompson sampling-type optimization algorithm to iteratively learn the model parameters, and conduct regret analysis to provide theoretical guarantee for DC2B. Moreover, empirical experiments on real datasets also demonstrate the effectiveness of the proposed DC2B in balancing the recommendation accuracy and diversity. As for the future work, we intend to develop more complex DPP kernels and efﬁcient DPP inference algorithms for interactive recommender systems. In addition, we are also inter-

ested in developing more sophisticated models to describe the user engagements on the recommendation results.

Acknowledgments This research is supported, in part, by the National Research Foundation, Prime Minister s Ofﬁce, Singapore under its AI Singapore Programme (Award Number: AISGGC-2019-003) and under its NRF Investigatorship Programme (NRFI Award No. NRF-NRFI05-2019-0002). This research is also supported, in part, by the Alibaba-NTU Singapore Joint Research Institute (Award Number: Alibaba NTU-AIR2019B1), Nanyang Technological University, Singapore, and Alibaba Group.

References Adomavicius, G., and Kwon, Y. 2011. Improving aggregate recommendation diversity using ranking-based techniques. IEEE Transactions on Knowledge and Data Engineering 24(5):896 911. Antikacioglu, A., and Ravi, R. 2017. Post processing recommender systems for diversity. In KDD, 707 716. ACM. Blei, D. M.; Kucukelbir, A.; and Mc Auliffe, J. D. 2017. Variational inference: A review for statisticians. Journal of the American Statistical Association 112(518). Carbonell, J., and Goldstein, J. 1998. The use of mmr, diversitybased reranking for reordering documents and producing summaries. In SIGIR, 335 336. ACM. Chen, L.; Zhang, G.; and Zhou, E. 2018. Fast greedy map inference for determinantal point process to improve recommendation diversity. In NIPS, 5622 5633. Cheng, P.; Wang, S.; Ma, J.; Sun, J.; and Xiong, H. 2017. Learning to recommend accurate and diverse items. In WWW, 183 192. International World Wide Web Conferences Steering Committee. Jaakkola, T., and Jordan, M. 1997. A variational approach to bayesian logistic regression models and their extensions. In Sixth International Workshop on Artiﬁcial Intelligence and Statistics, volume 82, 4. Kulesza, A.; Taskar, B.; et al. 2012. Determinantal point processes for machine learning. Foundations and Trends R in Machine Learning 5(2 3). Kunaver, M., and Poˇzrl, T. 2017. Diversity in recommender systems a survey. Knowledge-Based Systems 123:154 162. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW, 661 670. ACM. Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased ofﬂine evaluation of contextual-bandit-based news article recommendation algorithms. In WSDM, 297 306. ACM. Liang, D.; Charlin, L.; Mc Inerney, J.; and Blei, D. M. 2016. Modeling user exposure in recommendation. In WWW, 951 961. International World Wide Web Conferences Steering Committee. Liu, Y.; Zhao, P.; Liu, X.; Wu, M.; Duan, L.; and Li, X.-L. 2017. Learning user dependencies for recommendation. In IJCAI, 2379 2385. AAAI Press. Liu, Y.; Zhao, L.; Liu, G.; Lu, X.; Gao, P.; Li, X.-L.; and Jin, Z. 2018. Dynamic bayesian logistic matrix factorization for recommendation with implicit feedback. In IJCAI, 3463 3469. AAAI Press. Qi, Y.; Wu, Q.; Wang, H.; Tang, J.; and Sun, M. 2018. Bandit learning with implicit feedback. In NIPS, 7276 7286.

Qin, L., and Zhu, X. 2013. Promoting diversity in recommendation by entropy regularizer. In IJCAI, 2698 2704. AAAI Press. Qin, L.; Chen, S.; and Zhu, X. 2014. Contextual combinatorial bandit and its application on diversiﬁed online recommendation. In SDM, 461 469. SIAM. Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt-Thieme, L. 2009. Bpr: Bayesian personalized ranking from implicit feedback. In UAI, 452 461. AUAI Press. Russo, D., and Van Roy, B. 2014. Learning to optimize via posterior sampling. Mathematics of Operations Research 39(4). Sha, C.; Wu, X.; and Niu, J. 2016. A framework for recommending relevant and diverse items. In IJCAI, 3868 3874. AAAI Press. Shi, Y.; Larson, M.; and Hanjalic, A. 2014. Collaborative ﬁltering beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys 47(1):3. Steck, H.; van Zwol, R.; and Johnson, C. 2015. Interactive recommender systems: Tutorial. In Rec Sys, 359 360. ACM. Tang, L.; Jiang, Y.; Li, L.; Zeng, C.; and Li, T. 2015. Personalized recommendation via parameter-free contextual bandits. In SIGIR, 323 332. ACM. Wang, X.; Hoi, S. C.; Liu, C.; and Ester, M. 2017. Interactive social recommendation. In CIKM, 357 366. ACM. Wang, S.; Hu, L.; Cao, L.; Huang, X.; Lian, D.; and Liu, W. 2018. Attention-based transactional context embedding for next-item recommendation. In AAAI, 2532 2539. AAAI Press. Wang, S.; Hu, L.; Wang, Y.; Cao, Longbingand Sheng, Q. Z.; and Orgun, M. 2019. Sequential recommender systems: Challenges, progress and prospects. In IJCAI, 6332 6338. AAAI Press. Wang, H.; Wu, Q.; and Wang, H. 2017. Factorization bandits for interactive recommendation. In AAAI, 2695 2702. AAAI Press. Wilhelm, M.; Ramanathan, A.; Bonomo, A.; Jain, S.; Chi, E. H.; and Gillenwater, J. 2018. Practical diversiﬁed recommendations on youtube with determinantal point processes. In CIKM, 2165 2173. ACM. Wu, Q.; Liu, Y.; Miao, C.; Zhao, B.; Zhao, Y.; and Guan, L. 2019a. Pd-gan: adversarial learning for personalized diversity-promoting recommendation. In IJCAI, 3870 3876. AAAI Press. Wu, Q.; Liu, Y.; Miao, C.; Zhao, Y.; Guan, L.; and Tang, H. 2019b. Recent advances in diversiﬁed recommendation. ar Xiv preprint ar Xiv:1905.06589. Yang, P.; Zhao, P.; Liu, Y.; and Gao, X. 2018. Robust cost-sensitive learning for recommendation with implicit feedback. In SDM, 621 629. SIAM. Zhang, M., and Hurley, N. 2008. Avoiding monotony: improving the diversity of recommendation lists. In Rec Sys, 123 130. ACM. Zhao, X.; Zhang, W.; and Wang, J. 2013. Interactive collaborative ﬁltering. In CIKM, 1411 1420. ACM. Zhou, T.; Kuscsik, Z.; Liu, J.-G.; Medo, M.; Wakeling, J. R.; and Zhang, Y.-C. 2010. Solving the apparent diversity-accuracy dilemma of recommender systems. Proceedings of the National Academy of Sciences 107(10):4511 4515.