# improving_implicit_recommender_systems_with_view_data__fe8c4601.pdf Improving Implicit Recommender Systems with View Data Jingtao Ding1, Guanghui Yu1, Xiangnan He2, Yuhan Quan1, Yong Li1, Tat-Seng Chua2, Depeng Jin1 and Jiajie Yu3 1 Beijing National Research Center for Information Science and Technology (BNRist), Department of Electronic Engineering, Tsinghua University 2 School of Computing, National University of Singapore 3 Beibei Inc liyong07@tsinghua.edu.cn Most existing recommender systems leverage the primary feedback data only, such as the purchase records in E-commerce. In this work, we additionally integrate view data into implicit feedback based recommender systems (dubbed as Implicit Recommender Systems). We propose to model the pairwise ranking relations among purchased, viewed, and non-viewed interactions, being more effective and flexible than typical pointwise matrix factorization (MF) methods. However, such a pairwise formulation poses efficiency challenges in learning the model. To address this problem, we design a new learning algorithm based on the elementwise Alternating Least Squares (e ALS) learner. Notably, our algorithm can efficiently learn model parameters from the whole user-item matrix (including all missing data), with a rather low time complexity that is dependent on the observed data only. Extensive experiments on two real-world datasets demonstrate that our method outperforms several state-of-the-art MF methods by 10% 28.4%. Our implementation is available at: https: //github.com/dingjingtao/View enhanced ALS. 1 Introduction Recent research on recommendation has shifted from explicit ratings [Koren, 2010] to implicit feedback, such as purchases, clicks, and watches [Bayer et al., 2017]. Distinct from explicit feedback like ratings, in implicit feedback data, the negative signal about user preference over items is naturally scarce, resulting in a one-class learning problem [Pan et al., 2008; Zhao et al., 2015a]. Therefore, to learn from implicit feedback, it is crucial to account for both observed and missing data. A state-of-the-art MF method for implicit feedback is the e ALS [He et al., 2016], which treats all missing data as the negative feedback but with a lower weight. This wholedata based formulation has been shown to be superior to the prevalent sampling-based method [Rendle et al., 2009] that models partial missing data only. In an online information system, in addition to the primary feedback data that is directly related with the business KPI, there are also other kinds of user feedback data available [Pan et al., 2015; Tang et al., 2016; Ding et al., 2018]. For example, in E-commerce sites, a user must view a product (i.e., click the product page) before purchasing it. This kind of view data provides valuable signal on user preference, which can complement the purchase data in two folds. First, if a user views an item, regardless of whether purchasing or not, it at least reflects that the user is interested in the item (i.e., positive signal, compared to the non-viewed items). Second, if a user views a product but does not purchase it afterward, it means that the item is of less interest to the user (i.e., negative signal, compared to the purchased items). As such, view data can be seen as an intermediate feedback between purchase and missing data, which enriches the two-level implicit feedback with multiple levels that better distinguish user preference. In this work, we aim to integrate the valuable view data into the state-of-the-art e ALS method, so to enhance the performance of implicit recommender systems. Nevertheless, it is non-trivial to integrate such intermediate feedback into e ALS, which is designed for learning from binary 0/1 data only. Specifically, it performs regression by treating purchased interactions as having a label of 1, and other missing interactions as having a label of 0. While an intuitive solution to assign the view interactions with an intermediate label , i.e., a value between 0 and 1, it is difficult to set a proper value for each interaction. Setting a uniform value for all intermediate interactions oversimplifies the problem, which is sub-optimal or may even adversely degrade the performance if the value is set improperly. In this work, we make a novel technical contribution in integrating intermediate feedback into e ALS. Instead of assigning a specific label value to do point-wise learning, we propose a new solution that models the relative preference orders among different interactions. Specifically, to encode the twofold semantics of view data, we consider the pairwise ranking relations between 1) purchased and viewed interactions, and 2) viewed and non-viewed interactions. The idea is to regularize e ALS by enforcing that the predictions of a user over purchased items should be larger than Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) that of viewed items; and the same regularization applies to viewed and non-viewed items. Despite soundness, this solution poses strong challenges to the learning efficiency. In particular, the large number of non-viewed interactions (i.e., missing entries) makes even point-wise regression over them become unaffordable [He et al., 2016], not to mention the pairwise comparisons between viewed and non-viewed interactions. To make the learning tractable, we develop a fast algorithm that leverages the bilinear structure of MF to achieve speedups. Through rigorous mathematical analysis, we identify computational bottlenecks in optimization, and resolve the bottlenecks via clever designs of memoization strategies. We summarize the contributions of the paper as follows. 1. We improve implicit recommender systems by incorporating view data, proposing a View-enhanced e ALS (VALS) method that models the pairwise relations among purchased, viewed, and non-viewed interactions. 2. We propose a fast algorithm that solves the challenging VALS problem with a controllable time complexity that is determined by the number of observed interactions only. 3. We conduct extensive experiments on two real datasets, demonstrating that our method outperforms state-of-the-art view-aware recommender systems by a large margin. 2 Related Work To improve implicit recommender systems with multiple feedback, two types of methods have been proposed. Model-based. A typical method of this type is collective matrix factorization (CMF), which performs multiple relational learning by sharing information between models of different feedback [Singh and Gordon, 2008; Cheng et al., 2014; Yuan et al., 2014]. While CMF is originated for explicit rating prediction, it has been extended for implicit recommender systems as well [Krohn-Grimberghe et al., 2012; Zhao et al., 2015b; Cao et al., 2017]. However, as CMFbased model generates different user-item relations, i.e., latent factors, for each type of feedback, it is hard to differentiate their preference levels. In contrast, our VALS method learns the same user-item relation to indicate relative preference order among purchase and view data, which is more effective. Learning-based. Several other methods integrate multiple types of feedback in the negative sampler of BPR [Lerche and Jannach, 2014]. Specifically, Multi-channel BPR (MCBPR) assigns different preference levels to different types of user feedback. Then, in the training process, it samples the item pairs based on these preference orders [Loni et al., 2016]. A recent proposal also performs biased sampling from viewed but non-purchased items and observes significant performance improvements [Ding et al., 2018]. However, these BPR-based solutions still suffer from the shortcomings of sampling-based methods, i.e., degradation on both performance and fidelity, as well as an expensive tuning on learning rate. Our VALS method differs from them by integrating different preference levels based on whole-data based learning strategy. To our knowledge, VALS is the first attempt to exploit different preference levels of implicit feedback in whole-data based MF methods. 3 Preliminaries We start by introducing some basic notations. For a user-item interaction matrix R RM N, M and N denote the number of users and items, respectively, R denotes the set of useritem pairs that have interactions. For a specific user u, vector pu RK denotes the K-dimensional latent feature vector, and set Ru denotes the set of items that are interacted by u. Similarly, for an item i, notations qi and Ri are used. Matrices P RM K and Q RN K denote the latent factor matrix for users and items. The standard MF is used as the predictive model. Mathematically, each entry rui of R is estimated as ˆrui =< pu, qi >= p T u qi. To learn user/item latent factors, [He et al., 2016] developed an e ALS method that introduces a weighted regression function, which assigns a zero rui value to missing entries with a confidence variable: (u,i) R ωui(ˆrui rui)2+ i/ Ru siˆr2 ui+λ ||P||2 F +||Q||2 F , (1) where ωui denotes the weight of entries (u, i) R. Determined by an item popularity-aware weighting strategy, si denotes the confidence that item i missed by users is a true negative assessment. The above problems can be solved using ALS-based technique by iteratively optimizing each coordinate of the latent vector, while leaving others fixed [Liu et al., 2016]. The time complexity is O((M + N)K2 + |R|K) for one iteration, which approaches SGD method O(|R|K) when (M + N)K and |R| are close. Due to space limit, we leave out the update rule for user/item factors [He et al., 2016]. Since e ALS learns latent factors from the whole missing data, it not only can retain model s fidelity but also achieves higher accuracy than the sampling-based methods that sample negative instances from missing data. Considering these advantages, we choose to develop a view-enhanced whole-data based method based on this framework (details in Section 4). 4 Our View-enhanced e ALS Method Based on e ALS, we consider how to effectively learn user preference from both purchase and view data. First of all, in order to differentiate their preference levels using the same user-item relation, we consider the pairwise ranking order between a viewed item and purchased (or non-viewed) item in the objective function. More specifically, we use two margins to describe the intermediate preference level of user s viewed interactions, which is lower than purchased ones but higher than non-viewed ones. On the other hand, introducing the above pairwise relationship in view-enhaced objective function makes the original e ALS learning elgorithm become N times slower, which is unsuitable for large-scale data. Therefore, we further develop a fast VALS learning algorithm to efficiently optimize the view-enhanced objective function. For readability, we summarize the major notations throughout the paper in Table 1. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Notation Description M, N, K The numbers of users, items, and factors. P, {pu} The latent factor matrix and vector for users. Q, {qi} The latent factor matrix and vector for items. R, Ru, Ri The sets of all purchased (u, i) pairs, items purchased by u, users that have purchased i. V, Vu, Vi Similar notations for viewed interactions. RV, RVu, RVi Similar notations for the union of R and V, i.e., R V. ˆrui, ˆruv, ˆruj Predictions of user u over purchased items i, viewed items v and non-viewed items j. ωui Weight of the purchased interaction (u, i). sj Item-oriented weight of item j in missing data. cv Item-oriented weight of item v in view data. γ1, γ2 Margin value between ˆrui and ˆruv, ˆruv and ˆruj. λ Regularization parameter. Table 1: List of commonly used notations. 4.1 View-enhanced Objective Function In E-commerce recommender systems, besides the purchases as the primary feedback that is directly related to optimizing the conversion rate, the view logs of users can be intuitively treated as the intermediate feedback between the purchased and missing data. Therefore, for user u s viewed item v, it should have an intermediate value of prediction ˆruv between those of non-viewed item j (i.e., missing entry) and purchased item i, i.e., ˆruj and ˆrui. However, it is difficult to choose an appropriate ˆruv for different users. To solve this problem, instead of assigning a uniform value ruv for ˆruv to optimize, like the normally used squared loss (ruv ˆruv)2, we consider the pairwise ranking between the ˆruv and the predictions over other items, including ˆruj and ˆrui. By this means, we are able to differentiate the preference levels between view feedback and others, in a more accurate and flexible way. Our view-enhanced objective function is then designed as: L = Le ALS + LReg + Lview = (u,i) R ωui(ˆrui rui)2+ j / RVu sjˆr2 uj +λ ||P||2 F +||Q||2 F + (u,v) V cv h X γ1 (ˆrui ˆruv) 2+ X γ2 (ˆruv ˆruj) 2i . Note that this objective function can be divided into three terms, where the first two represent the prediction error and regularizer in the former e ALS solution, while the last Lview term describes the intermediate preference level of the view signal. For u s viewed item v, the pairwise ranking between v and another purchased item i or non-viewed item j is achieved through a margin-based loss. More specifically, prediction ˆruv is optimized to be lower than ˆrui with a margin namely γ1, as indicated by (γ1 (ˆrui ˆruv))2 term. Similarly, ˆruv is optimized to be higher than ˆruj with another margin namely γ2. For each (u, v), N |Vu| terms are considered in pairwise ranking loss, which has a total time cost of O(|V|(N |Vu|)). The sum of each (u, v) s loss is aggregated with an v-dependent weight namely cv, which indicates the weight of view data in L. By varying margin values (γ1,γ2), we are able to control possible scoring range of predictions over viewed items, and thus search the best preference level for view signal. Without loss of generality, we only consider those viewed but not purchased items, indicating R V = φ. 4.2 Fast Learning Algorithm As we mentioned above, the pairwise ranking loss between a viewed item and another one introduces nearly |V| N terms in Lview of Eq. (2), making the time cost become N times more if we directly use the original e ALS method [He et al., 2016]. To overcome this efficiency challenge, we develop a fast VALS learning algorithm that can speed-up this learning process by avoiding massive repeated computations in Lview. We detail this for user latent factors, while the counterpart for item factors can be achieved likewise and thus is omitted due to space limitation. First, following the idea of ALS technique, the user u s f th latent factor is updated by setting L/ puf to 0, while the others are fixed. Since L = Le ALS + LReg + Lview and the speed-up strategies of Le ALS and LReg have been discussed in [He et al., 2016], we only present the speed-up of Lview term. According to Eq. (2), we obtain the derivative of Lview w.r.t. puf: Lview i Ru cv(qif qvf)2 puf+ (3) j / RVu cv(qjf qvf)2 puf+ (4) i Ru cv(ˆrf ui ˆrf uv)(qif qvf)+ (5) j / RVu cv(ˆrf uj ˆrf uv)(qjf qvf) (6) i Ru γ1cv(qif qvf) X j / RVu γ2cv(qjf qvf) , (7-8) where ˆrf u = ˆru pufq f, i.e., the prediction without the component of latent factor f. Clearly, the bottleneck lies in the (4), (6) and (8) terms that contain summations over item pairs (v, j), introduced by the pairwise ranking between viewed items and non-viewed items. It takes O(|Vu|(N |RVu|)) time for a raw implementation. To solve this inefficiency issue, we first break down the summations over item pairs into two independent summations over one item index only, which reduces the time complexity into O(N |RVu|). Then, we further apply memoization strategy to avoid the massive repeated computations on non-viewed item j, achieving an efficient learning in O(|RVu|+K) time. We detail above process by focusing on the (6) term in Lview/ puf, as the rest can be done likewise. More specifically, we can obtain X j / RVu cv(ˆrf uj ˆrf uv)(qjf qvf) = X j / RVu ˆrf ujqjf v Vu cvqvf X j / RVu ˆrf uj X v Vu cvˆrf uv X j / RVu qjf + X v Vu cvˆrf uvqvf X j / RVu 1. (9) By this reformulation, we observe that each term above can be factorized as two parts that are only dependent on one item Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) index, i.e., viewed item v or non-viewed item j. Then, the original summation over item pairs (v, j) can be broken down into two independent summations. As the summation over v takes O(|Vu|) time, the current bottleneck falls onto those jdependent summations, which require a traversal of the whole negative space, taking near O(N) time. Therefore, we move forward to speed up the calculation of these terms, j / RVu ˆrf uj = j=1 ˆrf uj X j RVu ˆrf uj = k =f pukqjk X j RVu ˆrf uj j RVu ˆrf uj; j / RVu ˆrf ujqjf = X j=1 qjkqjf X j RVu ˆrf ujqjf. As shown above, the major computation lies in PN j=1 qjk and PN j=1 qjkqjf terms, which are independent of u. Therefore, when updating the latent factors for different users, it is unnecessary to repeatedly compute these terms. We define Dq cache as Dq = PN i=1 qi, and Eq cache as Eq = PN i=1 qiq T i , which can be pre-computed and used in updating the latent factors for all users. Based on these two caches, Eq. (10) can be further evaluated as: X k =f pukdq k X j RVu ˆrf uj and X k =f pukeq kf X j RVu ˆrf ujqjf, (11) which can be done in O(K + |RVu|) time. Overall, Eq. (9) can be calculated efficiently with the help of Dq and Eq caches. As for the rest terms of Lview/ puf, we can apply the same strategy of breaking down and memoizing summations to calculate them in O(K + |RVu|) time. Combining with previous solution of (Le ALS+LReg)/ puf, the time complexity for updating puf is still O(K + |RVu|). Similarly, for updating item latent factors, we can also achieve the time complexity of O(K + |RVi|) by applying the above speed-up strategy. Algorithm 1 summarizes the accelerated algorithm for our VALS learner. Note that solving (Le ALS +LReg)/ puf and (Le ALS +LReg)/ qif uses the original e ALS method. Overall, one VALS iteration takes O((M + N)K2 + (|R| + |V|)K) time, which only depends on the number of observed interactions. Note that some previous works [Tak acs and Tikk, 2012; Zhang et al., 2017] have also used the squared form of loss function and learning method based on ALS or coordinate descent technique. However, compared to VALS, they only solve a sub-problem with only one feedback type and no tunable margin to control pairwise ranking relations. Also, the update of user vectors is embarrassingly parallel, while that of item vectors can influence with each other and thus introduce approximate loss in parallel. Thus we plan to solve this issue in future work. 5 Experiments 5.1 Experimental Settings Datasets and Preprocessing. We perform experiments on two real-world datasets of Beibei and Tmall: Algorithm 1: Fast VALS Learning algorithm. Input : R, V, K, λ, W, item confidence vector {s, c} and margin values {γ1, γ2}; Output: Latent feature matrix P and Q; 1 Randomly initialize P and Q; 2 while Stopping criteria is not met do 3 // Update user factors 4 for u 1 to M do 5 for f 1 to K do 6 Calculate each term in (Le ALS + LReg)/ puf; 7 Calculate each term in Lview/ puf; 8 Update puf by setting (Le ALS+LReg+Lview) 11 // Update item factors 12 for i 1 to N do 13 for f 1 to K do 14 Calculate each term in (Le ALS + LReg)/ qif; 15 Calculate each term in Lview/ qif; 16 Update qif by setting (Le ALS+LReg+Lview) Beibei1: Beibei is the largest E-commerce platform for maternal and infant products in China. We sample a subset of user interactions that contain views and purchases within the time period from 2017/05/25 to 2017/06/28. Tmall2: Tmall is the largest E-commerce platform in China. To make our results reproducible, we use a public benchmark released in IJCAI-2015 challenge3. We take three steps for data preprocessing. First, we merge the repetitive purchases into one purchase with the earliest timestamp, as we aim to recommend novel items for a user to purchase. Second, we filter out users views on those purchased items to avoid information leaking. Third, we filter out users and items with less than 12 and 16 purchase interactions, respectively, to alleviate the data sparsity issue. Table 2 summarizes the data statistics. Dataset Purchase# View# User# Item# Sparsity Beibei 2,654,467 23,668,454 158,907 119,012 99.99%/99.87% Tmall 352,768 1,585,225 28,059 32,339 99.96%/99.83% Table 2: Statistics of the evaluation datasets. Evaluation Methodology. In the evaluation, we adopt the leave-one-out protocol [Rendle et al., 2009; He et al., 2016], where the latest purchase interaction of each user is held out for testing and the models are trained on the remaining data. For the metrics, we employ Hit Ratio (HR) and Normalized 1http://www.beibei.com/ 2https://www.tmall.com/ 3The dataset is downloaded from https://tianchi.aliyun.com/ datalab/data Set.htm?id=5 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Method Tuning Range Beibei Tmall s0 e ALS [1, 2, 4, 8, 16, 32, 64] 102 1600 800 εBPR (MR/MC-)BPR [0.05, 0.1, 0.5, 1, 5] 10 2 0.001 0.01 α MR-BPR [0.125, 0.25, 0.5, 1, 2, 4] 1 2 β1 β2 MC-BPR [0.01, 0.05, 0.1, 0.5, 1, 5, 10] [0.01, 0.05, 0.1, 0.5, 1] 5 0.05 εMFPR MFPR [0.05, 0.1, 0.5, 1, 5] 10 2 0.001 0.001 Table 3: Parameter exploration for baselines and optimal settings. Discounted Cumulative Gain (NDCG) on the ranking of all non-purchased items for a user. We truncate the ranked list at the position of 100 and report the average score of all users. Baselines. We compare with two types of methods. For methods that only use primary purchase feedback, we choose: - e ALS [He et al., 2016]. This is a state-of-the-art MF method for implicit recommender systems. We tuned the weight of missing data si. - BPR [Rendle et al., 2009]. BPR optimizes the MF model with a pairwise ranking loss and learns model parameters with Stochastic Gradient Descent (SGD) method. We tuned the learning rate εBPR. For the second type of methods that integrate both purchase and view feedback, we choose the following methods, - MR-BPR [Krohn-Grimberghe et al., 2012]. Applying CMF method to BPR, this method exerts the impact of viewing behavior on predicting purchases with a weight α. - MC-BPR [Loni et al., 2016]. This method samples both positive and negative instances from viewed items. Two parameters control this process: 1) β1 denotes relative weight of viewed items when sampling positive instances; 2) β2 denotes possibility of sampling a viewed item as a negative instance. - MFPR [Liu et al., 2017]. This is a most recent method that integrates multiple types of implicit feedback. We adapt it to our case by generating training item pairs in the same way as BPR, and use a fixed learning rate εMFPR in SGD. Parameter settings. For the above baselines, we have carefully explored the corresponding parameters and listed the result in Table 3. Note that we uniformly set the weight of missing data as si = s0/N, as the effectiveness of popularitybiased weighting strategy is beyond the scope of this paper. The score of purchased interactions rui and its weight ωui are both uniformly set to 1, which are the suggested values in the e ALS implementation. For regularization, we set λ as 0.001 for all methods for a fair comparison. Since the findings are consistent across the number of latent factors K, we report the results of K = 32 only. Efficiency. Our experiment on the same machine (Intel Xeon 2.10 GHz CPU) shows that the actual training time per iteration for VALS on Beibei dataset is about 75s. When training set reduces to (1/4, 1/2, 3/4) of the original size, the time is (15s, 35s, 55s), corresponding to our theoretical analysis of linearity with the size of observed data. 5.2 Hyper-parameter Investigation Our VALS has two parameters: {γ1, γ2}, which are the margin values to control pairwise ranking between viewed items and other items, and ci that determines the weight of γ 0.15 0.25 0.35 0.45 (a) γ vs. HR (different c0) γ 0.15 0.25 0.35 0.45 (b) γ vs. NDCG (different c0) γ 0.0 1.0 2.0 3.0 4.0 5.0 (c) γ vs. HR (different c0) γ 0.0 1.0 2.0 3.0 4.0 5.0 (d) γ vs. NDCG (different c0) Figure 1: Impact of weighting parameters c0 and margin values γ on VALS s performance. view data. Without loss of generality, we set margin values {γ1, γ2} uniformly as γ1 = γ2 = γ. Similar to weight of missing data si, we set a uniform weight distribution (i.e., ci = c0/N) and leave the item-dependent weighting strategy to the future work. To find the best setting for (γ, c0), we conduct a grid search over these two parameters and report the mean values of both HR and NDCG in last 10 iterations. Figure 1 plots the prediction accuracy of VALS with different γ and c0. First, we study the impact of margin value γ. For Beibei (Figure 1a and b), we observe that the best γ values are between 0.25 and 0.35 under different values of weight c0. Since the prediction is optimized to indicate the preference level, this highlights the necessity of using an appropriate margin γ so as to control the preference level of viewed interactions, which is lower than purchased ones but higher than non-viewed ones. Surprisingly, for Tmall (Figure 1c and d), we observe that a relative large margin value (λ = 3 5) achieves better performance. According to our definition in Eq. (2), purchased items are optimized to be predicted as 1 while optimized predictions for non-viewed items are 0. When the margin γ between a viewed item and a purchased/non-viewed item is large, it is more likely for a viewed item to be predicted outside the (0, 1), while the optimized prediction should be inside. Therefore, a large γ observed on Tmall dataset indicates higher possibility that the preference level of viewed interactions is close to purchased ones or non-viewed ones. Then, we investigate the best setting of weight c0. For Beibei (Figure 1a and b), the peak performance is achieved when c0 is 1.6; similarly for Tmall (Figure 1c and d), the optimal c0 is 0.5. When c0 becomes smaller or too large, the performance decreases in both cases, indicating the necessity to account for viewed interactions carefully. According to above parameter exploration, we fix γ and c0 according to the best performance evaluated by HR, i.e., γ =0.3, c0 =1.6 for Beibei and γ =3.5, c0 =0.5 for Tmall. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Methods e ALS VALS Improvement Datasets HR NDCG HR NDCG HR NDCG Beibei 0.1388 0.0345 0.1443 0.0363 +3.96% +5.22% Tmall 0.0289 0.0079 0.0675 0.0171 +133.56% +116.46% Table 4: Performance improvement after integrating view data. (a) γ = 0.3, c0 = 1.6 (b) γ = 3.5, c0 = 0.5 Figure 2: e ALS versus VALS, in terms of quantiles (5%, 25%, 50%, 75%, 95%) and means of the predictions over viewed items. 5.3 Performance Gain of View Data Table 4 displays the performance of our VALS method compared with e ALS, w.r.t. HR@100 and NDCG@100. We report their mean values in last 10 iterations. Clearly, after integrating view data as the intermediate feedback, our method significantly outperforms the original e ALS that only leverages purchase data. For Beibei, the relative improvement in terms of HR and NDCG are 3.96% and 5.22%, respectively. Moreover, for Tmall, we observe an improvement of over 100% on both two metrics. This indicates that users viewing behaviors in Tmall are much more valuable for learning a more accurate preference order among different items. To further clarify the distinct performance gain on two datasets, we compare the predictions over viewed items between two datasets, using both e ALS and VALS method to train the model. Figure 2 plots the distribution quantiles (5%, 25%, 50%, 75%, 95%) and means of viewed item scores, where the results of two methods are presented together. Clearly, the viewed items are both predicted to have a near 0 score on two datasets when trained with e ALS, as they are considered as negative instances. However, compared to that on Beibei (Figure 2a), VALS generates much higher scores for viewed items on Tmall (Figure 2b), i.e., 0.72 v.s. 0.15 and 0.91 v.s. 0.16 in terms of median and mean value, respectively. These higher scores of viewed items correspond to the fact that viewing behavior is more related to purchasing behavior in Tmall, which is the main reason behind the over 100% improvement of VALS method. Moreover, this also verifies our previous discussion of γ that the VALS model with a large γ is more likely to generate large predictions outside (0, 1), as γ is set as 3.5 for Tmall. 5.4 Performance Comparison Figure 3 shows the prediction accuracy of each method in each training iteration. Note that we only run VALS for 200 iterations on two datasets and run other methods except MC-BPR for 1500 iterations on Beibei dataset, which are enough for them to converge. We have the following three Number of Iterations 0 500 1000 1500 2000 2500 BPR MC-BPR MR-BPR MFPR VALS Number of Iterations 0 500 1000 1500 2000 2500 BPR MC-BPR MR-BPR MFPR VALS Number of Iterations 0 500 1000 1500 BPR MC-BPR MR-BPR MFPR VALS Number of Iterations 0 500 1000 1500 BPR MC-BPR MR-BPR MFPR VALS Figure 3: Prediction accuracy of VALS compared with other baseline methods in each iteration. key observations. First, we see that VALS achieves the best performance after convergence. All improvements are statistically significant evidenced by the one-sample paired t-test (p < 0.01). The best baseline is MC-BPR. Except for a close HR metric on Beibei, VALS outperforms it by a large margin (on average, the relative improvement for Beibei and Tmall is 10.0% and 28.4%). Compared with MC-BPR, VALS mainly benefits from 1) margin-based design that controls the pairwise ranking between predictions of different feedback and 2) the whole-data based strategy of handling missing data. Second, MR-BPR and MFPR achieve higher performance over the vanilla BPR that only leverages purchase data, while the relative improvement is quite insignificant when compared with MC-BPR and VALS. This highlights the necessity of exploiting different preference levels between purchase and view data, which is lacked in MR-BPR and MFPR. Finally, VALS maintains both accuracy and fidelity, which is an inherent advantage of using whole-data based learning strategy. Comparatively, although MC-BPR outperforms the vanilla BPR on the Beibei dataset evaluated by both metrics, we find it obtains a higher HR but a lower NDCG score when compared to e ALS that does not integrate view data (i.e., 0.0330 v.s. 0.0345). It means that whole-data based strategy is a better candidate compared to samplingbased one when improving implicit recommender systems. We notice that BPR-based methods show unusual spike in early iterations and performance degradation with more iterations on Beibei dataset, which might be caused by some regularities in the data. For example, Beibei dataset is highly popularity-skewed the top-1% items contributed almost 50% of purchases. This may cause unstable performance because these popular items are ranked high in early iterations. 6 Conclusion We study the problem of improving implicit recommender systems by integrating both purchase and view data. Based on the state-of-the-art e ALS method we model user s viewed Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) interactions as an intermediate feedback between purchased and non-viewed interactions. To address the key efficiency challenge in optimization, we further develop a fast learning algorithm which efficiently learns parameters from the whole data instead of sampling negative instances. With these designs, our VALS method not only achieves higher accuracy, but also becomes practical for large-scale data. This work has focused on collaborative filtering setting, which only leverages the feedback data and is mostly used in the candidate selection stage of industrial recommender systems. In future, we will focus more on the ranking stage, integrating view data into generic feature-based models, such as the expressive neural factorization machines [He and Chua, 2017]. In addition, our VALS method can also be applied to other domains of implicit recommender systems where various additional user feedback can be leveraged, like news [Agarwal et al., 2012], online videos [Covington et al., 2016] and social networks [Hong et al., 2013]. Acknowledgments This work was supported in part by the National Nature Science Foundation of China under 6171101425, 61621091 and 61673237, and research fund of Tsinghua University - Tencent Joint Laboratory for Internet Innovation Technology. This project is also part of NEx T, supported by the National Research Foundation, Prime Minister s Office, Singapore under its IRC@SG Funding Initiative. The corresponding author is Xiangnan He. References [Agarwal et al., 2012] Deepak Agarwal, Bee-Chung Chen, and Xuanhui Wang. Multi-faceted ranking of news articles using postread actions. In CIKM, pages 694 703, 2012. [Bayer et al., 2017] Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. A generic coordinate descent framework for learning from implicit feedback. In WWW, pages 1341 1350, 2017. [Cao et al., 2017] Cheng Cao, Hancheng Ge, Haokai Lu, Xia Hu, and James Caverlee. What are you known for?: Learning user topical profiles with implicit and explicit footprints. In SIGIR, pages 743 752. ACM, 2017. [Cheng et al., 2014] Jian Cheng, Ting Yuan, Jinqiao Wang, and Hanqing Lu. Group latent factor model for recommendation with multiple user behaviors. In SIGIR, pages 995 998, 2014. [Covington et al., 2016] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Rec Sys, pages 191 198, 2016. [Ding et al., 2018] Jingtao Ding, Fuli Feng, Xiangnan He, Guanghui Yu, Yong Li, and Depeng Jin. An improved sampler for bayesian personalized ranking by leveraging view data. In WWW, pages 13 14, 2018. [He and Chua, 2017] Xiangnan He and Tat-Seng Chua. Neural factorization machines for sparse predictive analytics. In SIGIR, pages 355 364, 2017. [He et al., 2016] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, pages 549 558, 2016. [Hong et al., 2013] Liangjie Hong, Aziz S Doumith, and Brian D Davison. Co-factorization machines: modeling user interests and predicting individual decisions in twitter. In WSDM, pages 557 566, 2013. [Koren, 2010] Yehuda Koren. Collaborative filtering with temporal dynamics. Communications of the ACM, 53(4):89 97, 2010. [Krohn-Grimberghe et al., 2012] Artus Krohn-Grimberghe, Lucas Drumond, Christoph Freudenthaler, and Lars Schmidt-Thieme. Multi-relational matrix factorization using bayesian personalized ranking for social network data. In WSDM, pages 173 182, 2012. [Lerche and Jannach, 2014] Lukas Lerche and Dietmar Jannach. Using graded implicit feedback for bayesian personalized ranking. In Rec Sys, pages 353 356, 2014. [Liu et al., 2016] An-An Liu, Wei-Zhi Nie, Yue Gao, and Yu-Ting Su. Multi-modal clique-graph matching for view-based 3d model retrieval. IEEE Transactions on Image Processing, 25(5):2103 2116, 2016. [Liu et al., 2017] Jian Liu, Chuan Shi, Binbin Hu, Shenghua Liu, and S Yu Philip. Personalized ranking recommendation via integrating multiple feedbacks. In PAKDD, pages 131 143, 2017. [Loni et al., 2016] Babak Loni, Roberto Pagano, Martha Larson, and Alan Hanjalic. Bayesian personalized ranking with multichannel user feedback. In Rec Sys, pages 361 364, 2016. [Pan et al., 2008] Rong Pan, Yunhong Zhou, Bin Cao, Nathan N Liu, Rajan Lukose, Martin Scholz, and Qiang Yang. One-class collaborative filtering. In ICDM, pages 502 511, 2008. [Pan et al., 2015] Weike Pan, Hao Zhong, Congfu Xu, and Zhong Ming. Adaptive bayesian personalized ranking for heterogeneous implicit feedbacks. Knowledge-Based Systems, 73:173 180, 2015. [Rendle et al., 2009] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In UAI, pages 452 461, 2009. [Singh and Gordon, 2008] Ajit P Singh and Geoffrey J Gordon. Relational learning via collective matrix factorization. In KDD, pages 650 658, 2008. [Tak acs and Tikk, 2012] G abor Tak acs and Domonkos Tikk. Alternating least squares for personalized ranking. In Rec Sys, pages 83 90, 2012. [Tang et al., 2016] Liang Tang, Bo Long, Bee-Chung Chen, and Deepak Agarwal. An empirical study on recommendation with multiple types of feedback. In KDD, pages 283 292, 2016. [Yuan et al., 2014] Ting Yuan, Jian Cheng, Xi Zhang, Shuang Qiu, Hanqing Lu, et al. Recommendation by mining multiple user behaviors with group sparsity. In AAAI, pages 222 228, 2014. [Zhang et al., 2017] Yan Zhang, Defu Lian, and Guowu Yang. Discrete personalized ranking for fast collaborative filtering from implicit feedback. In AAAI, pages 1669 1675, 2017. [Zhao et al., 2015a] Tong Zhao, Julian Mc Auley, and Irwin King. Improving latent factor models via personalized feature projection for one class recommendation. In CIKM, pages 821 830, 2015. [Zhao et al., 2015b] Zhe Zhao, Zhiyuan Cheng, Lichan Hong, and Ed H Chi. Improving user topic interest profiles by behavior factorization. In WWW, pages 1406 1416, 2015. Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)