# curriculum_disentangled_recommendation_with_noisy_multifeedback__3008104a.pdf

Curriculum Disentangled Recommendation with Noisy Multi-feedback

Hong Chen1 , Yudong Chen1,2 , Xin Wang1 , Ruobing Xie2, Rui Wang2, Feng Xia2, Wenwu Zhu1

1Tsinghua University, 2We Chat Search Application Department, Tencent. {h-chen20,cyd18}@mails.tsinghua.edu.cn, {xin_wang,wwzhu}@tsinghua.edu.cn, {ruobingxie,rysanwang,xiafengxia}@tencent.com

Learning disentangled representations for user intentions from multi-feedback (i.e., positive and negative feedback) can enhance the accuracy and explainability of recommendation algorithms. However, learning such disentangled representations from multi-feedback data is challenging because i) multi-feedback is complex: there exist complex relations among different types of feedback (e.g., click, unclick, and dislike, etc) as well as various user intentions, and ii) multi-feedback is noisy: there exists noisy (useless) information both in features and labels, which may deteriorate the recommendation performance. Existing disentangled recommendation works only focus on positive feedback, failing to handle the complex relations and noise hidden in multi-feedback data. To solve this problem, in this work we propose a Curriculum Disentangled Recommendation (CDR) model that is capable of efﬁciently learning disentangled representations from complex and noisy multi-feedback for better recommendation. Concretely, we design a co-ﬁltering dynamic routing mechanism which simultaneously captures the complex relations among different behavioral feedback and user intentions as well as denoise the representations in the feature level. We then present an adjustable self-evaluating curriculum that is able to evaluate sample difﬁculties for better model training and conduct denoising in the label level via disregarding useless information. Our extensive experiments on several real-world datasets demonstrate that the proposed CDR model can signiﬁcantly outperform several state-of-the-art methods in terms of recommendation accuracy3.

1 Introduction

Recommenders aim to capture the user s preferences from different aspects of information for more accurate prediction[1 4] . Learning disentangled representations that can uncover and disentangle the latent explanatory factors hidden in user behavioral data has recently been shown as an effective way to discover users intentions, improving both recommendation accuracy and explainability [5 9]. Multi-feedback, normally including the positive feedback (e.g., click) and negative feedback (unclick and dislike, etc.), is of great signiﬁcance to depict the user s unbiased and various intentions [10]. Learning disentangled representation from multi-feedback is able to capture various user intentions more accurately, leading to improvement of accuracy and explainability in recommendation.

However, learning such disentangled representation from multi-feedback that can best serve recommendation is quite challenging due to two reasons. i) Multi-feedback is complex: different types of

Equal contributions Corresponding Authors. 3Our code will be released at https://github.com/forchchch/CDR

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

feedback in multi-feedback data have complex relations with each other, and the relations between multi-feedback and various user intentions are also complicated. For instance, a user may unclick an item because of disinclination, getting tired of seeing the same type of item previously clicked too many times, as well as lacking enough time to view the item in detail. ii) Multi-feedback is noisy: the large amount of feedback such as unclick brings a lot of noisy information (useless for recommendation) that may severely deteriorate the recommendation accuracy. For example, a user may unclick an item because he truly dislikes this item or he has interests in the item but is interrupted by others before making the click decision. As such, completely incorporating the unclick feedback as historical features to extract user intention representations may run the risk of bringing feature-level noise, while directly regarding the unclick behavior as negative samples to train the model will probably introduce label-level noise. Existing works on disentangled representation learning merely rely on the positive user feedback to extract the user intentions, failing to handle the complex relations and noisy (useless) information hidden in multi-feedback data.

To tackle these challenges, we propose a Curriculum Disentangled Recommendation (CDR) model that is able to accurately discover various user intentions from different kinds of user feedback. The CDR model consists of two core components, i.e., the co-ﬁltering dynamic routing mechanism and the adjustable self-evaluating curriculum strategy, which together learn disentangled representations as well as captures the complex relational dependencies and ﬁlters out noise from multi-feedback to achieve more accurate recommendation. More concretely, the proposed routing mechanism utilizes the dependencies among different feedback to accurately discover the user preferences, followed by the intention aggregation which further helps to learn disentangled representations for users. Our adjustable self-evaluating curriculum then guides the model towards better optima by reweighing samples according to the self-evaluated difﬁculty. The proposed curriculum can further enhance the model performance by making the model learn from multi-feedback data of different difﬁculties at different learning paces, thus preventing the model from overﬁtting the data with improper labels. We conduct extensive experiments on several real-world datasets to demonstrate that our proposed CDR model can signiﬁcantly beat baseline methods in terms of recommendation accuracy.

To summarize, our contributions are listed as follows. (1) We propose a Curriculum Disentangled Recommendation (CDR) model to learn disentangled representations for recommendation from multi-feedback with complex relations and noisy information. (2) We propose a co-ﬁltering dynamic routing mechanism to simultaneously capture complex relations among different sorts of feedback and various user intentions as well as ﬁltering out feature-level noise in multi-feedback. (3) We propose an adjustable self-evaluating curriculum to guide the model towards better optima by alleviating the impact of label-level noise in a more controllable and convenient way.

2 Related Work

Recommendation Based on Multi-Feedback Apart from positive feedback, negative feedback data are also essential for capturing user preference [11]. Early efforts [12 14] regard all unclick data equally as negative feedback and decrease the conﬁdence compared to click data. Later works utilize additional information [15 18] or reinforcement learning [19 21] to distinguish real negative signals from all the unclick data. A recent method [22] also recognizes negative feedback from the clicked news by reading dwell time. However, all these methods merely select better negative training samples and ignore multi-feedback as features to learn user interests. DFN [10] simultaneously incorporate click, unclick, and dislike feedback simply by rough attention mechanism and concatenation, which is insufﬁcient to capture the user s comprehensive and accurate interests.

Disentangled Representation Learning Disentangled representation learning aims to learn various hidden explanatory factors behind observable data in different parts of the learned vector presentation [23]. Many variants of variational auto-encoder (VAE) [24] have been studied to improve the disentanglement of the learned representation [25 27] by adding regularization terms to decrease the mutual information between different parts of the learned vector. Disentangled representation learning has also found its application in recommendation [5 8, 28, 29] by learning disentangled user preference from user positive feedback to improve both the performance and interpretability. Different from these works, our work focuses on learning more comprehensive disentangled interests with multi-feedback.

Curriculum Learning Curriculum learning [30, 31] aims to design a dynamic sample reweighting strategy throughout the training process to improve performance and training efﬁciency. Traditional methods mostly follow an easy-to-hard paradigm, i.e., assigning higher weights to easier samples in earlier training, where the easiness measurement can be both domain-knowledge-based [32 34] and loss-based [35, 36]. These methods can only be effective for speciﬁc tasks, and easyto-hard is sometimes sub-optimal compared with harder ﬁrst [37]. Recent works propose to automatically learn a curriculum by reinforcement learning [38 40], meta-learning [41], etc. However, the optimization process of these curricula is timeor resource-consuming, which is too costly for abundant recommendation data. Our method provides an efﬁcient and also ﬂexible curriculum that is beyond the easy-to-hard limit and more adjustable and interpretable than automatic methods.

In this section, we introduce the proposed Curriculum Disentangled Recommendation (CDR) model (Figure 1) to mine comprehensive and accurate user intentions from users noisy multi-feedback.

3.1 Notations and Problem Formulation

Notations We denote φ(a, b) as the inner product of two vectors Layer Norm(a) and Layer Norm(b), where Layer Norm refers to Layer Normalization operation and φ(a, b) measures the similarity between vector a and b. sim(keyi, query) is the normalized similarity between query and keyi on set Q that is composed of all the keys: sim(keyi, query) = exp(φ(keyi,query)) P

i Q exp(φ(keyi ,query)).

Multi-feedback based prediction The vth user s historical behavior contains his or her clicked item sequence c(v) = [c(v) 1 , c(v) 2 , , c(v) m ], unclicked (i.e., presented but not clicked) item sequence u(v) = [u(v) 1 , u(v) 2 , , u(v) n ], and disliked (i.e., press to the dislike button or low rated) item sequence d(v) = [d(v) 1 , d(v) 2 , , d(v) l ], where each term in these sequences represents an item that the v(th) user interacted with. Our goal is to learn users disentangled intentions and then accurately predict svt, i.e., the vth user s preference towards the candidate item t(v). Besides, we utilize the proﬁle (e.g., age and gender) of the vth user to aid the learning and prediction process.

3.2 Co-ﬁltering Dynamic Routing

The routing mechanism takes as input user proﬁle, candidate item feature, and user multi-feedback history, and makes the ﬁnal prediction by the following three steps. It ﬁrst utilizes the relations behind different kinds of feedback to discover where the user s true interests locate. Then, the model aggregates the user s disentangled intentions from the useful behavior. Finally, it predicts the user s preference towards the candidate item based on the learned intentions.

3.2.1 Interests Mining

The user s interests could be reﬂected by his or her various kinds of feedback. Ignoring any kind of feedback may lead to incomplete or inaccurate preference modeling. However, the noise hidden behind these data makes it infeasible to directly use multi-feedback to learn the user s interests. Thus, how to make use of the relations of different kinds of feedback to discover where the user s true interests locate is the key in this step.

First, we project each item in c(v), d(v), and u(v) into the embedding space by concatenating its ID and category embedding. Thus, we obtain the clicked embedding sequence h(v) c = [h(v) c1 , h(v) c2 , , h(v) cm], unclicked embedding sequence h(v) u = [h(v) u1 , h(v) u2 , , h(v) un], and disliked embedding sequence h(v) d = [h(v) d1 , h(v) d2 , , h(v) dl ]. We also project the information of the user proﬁle into the embedding space and obtain the user proﬁle feature F(v) = {F(v) 1 , F(v) 2 , , F(v) g }, where g is the number of user proﬁles. We then utilize the power of transformer encoder [42] to obtain better representation for each item with fully interactions: [z(v) c1 , z(v) c2 , , z(v) cm] = C-Encoder(h(v) c ), [z(v) u1 , z(v) u2 , , z(v) un] = U-Encoder(h(v) u ), [z(v) d1 , z(v) d2 , , z(v) dl ] = D-Encoder(h(v) d ), where the z(v) ci , z(v) ui , z(v) di refer to ddimensional feature vectors containing the vth user s interests.

negative intention disliked

d"# d"$ d"% d"&

user profile

d-# d-$ d-% d-&

Time& Candidate Awareness

Candidate Awareness

t disentangled

Feature Interaction

t candidate

Interests Mining Intention Aggregation Prediction

prototype sum

sim similarity

mask a with b a b

Co-filtering Dynamic Routing

predicted labels

Adjustable Self-evaluating

Curriculum weighted losses

number of training batches

sample weight

self-evaluated difficulty

Difficulty: .ℎ01

learning speed: 2

Smoothness: 3

Adjustable hyperparameters of curriculum 4: {.ℎ01, 3, 2}

Figure 1: The framework of the proposed Curriculum Disentangled Recommendation (CDR) model

We then need deeper thinking towards the relations among multiple feedback to discover user s true interests. The ﬁrst thing to consider is where the user s interests will locate. We note that not only the click behavior contains user interests, but the unclick sequence generated by the recommender algorithms could also contain rich information about user interests. Therefore, utilizing both kinds of feedback could help to learn comprehensive user interests. The second noteworthy thing is how to discover the user s true interests from the noisy data. To ﬁlter out the noise information in these historical sequences, we need the disliked item sequences that could reﬂect the user s strong negative intentions and have high conﬁdence. Thus, we average these strong negative features and regard it as the negative intention of the vth user: n(v) = 1

l Pl i=1 z(v) di . This highly conﬁdent negative intention could be used to ﬁlter the noise in both clicked and unclicked history items. We adopt the similarity between each item and the negative intention to judge whether the item should contribute to the user intentions. Based on the intuition that higher similarity to the negative intention should contribute less to the user preference, the contribution of each item to the user intentions can be formulated as follows: n (v) c = n(v) + bc1, z (v) ci = z(v) ci + σ1(Wc1z(v) ci + bc2),

d(v) ci = exp( φ(z (v) ci , n (v) c )) Pm j=1 exp( φ(z (v) cj , n (v) c )) ,

n (v) u = n(v) + bu1, z (v) uj = z(v) uj + σ1(Wu1z(v) uj + bu2),

d(v) uj = σ( φ(z (v) uj , n (v) u ) + MLP([F(v); z(v) uj ])),

where σ1 is the Re LU activation function and σ is the Sigmoid activation function. d(v) ci and d(v) uj represent the contribution of each clicked item and unclicked item to the user intention, respectively. We obtain d(v) ci by calculating the similarity between each clicked item and the negative intention, and we also introduce the parameters Wc1, bc1, bc2 to avoid some outliers that disobey our intuition. As for the unclicked item sequence, there exists more noise. Thus, we utilize not only the information of the user negative intention but also the user proﬁle feature F(v) to measure the importance of each

unclicked item. We ﬁrst concatenate user proﬁle features and the unclicked item feature together, and then use a two-layer MLP to judge whether the unclicked item is important.

Besides the noise effect, we also take the time and candidate item factors into consideration. Intuitively, more recently clicked items and items having a higher similarity to the candidate item will contribute more to the user preference towards the candidate item. The formulation is as follows:

time(v) ci = z(v) ci + σ1(Wc2[z(v) ci ; pi] + bc3), time (v) cm = z(v) cm + σ1(Wc3[z(v) cm; pm] + bc4),

cand(v) ci = z(v) ci + σ1(Wc4z(v) ci + bc5), z (v) ct = h(v) t + bc6,

f (v) ci = sim(time(v) ci , time (v) cm ) + sim(cand(v) ci , z (v) ct ),

cand(v) uj = z(v) uj + σ1(Wu2z(v) uj + bu3), z (v) ut = h(v) t + bu4, f (v) uj = sim(cand(v) uj , z (v) ut ).

We consider the impact of both time and candidate factors for the clicked items. The candidate item feature h(v) t Rd and the most recently clicked item feature z(v) cm measures the importance of each clicked item. pi is the position embedding of each item and encoded for more accurate time factor consideration. However, for the unclick sequence, we only consider the candidate factor, because the time of unclicking one item has low relations with the user s current interests. f (v) ci and f (v) uj are the importance of each clicked item and unclicked item, respectively.

3.2.2 Intention Aggregation

After considering how each kind of feedback will contribute to the user s intentions, we then conduct intention aggregation to obtain the vth user intentions under various latent categories. Assuming that the intentions of all users can be decomposed to K latent categories and each latent category has its prototype mk Rd, k = 1, 2, , K, we can predict the probability of each item belonging to the kth latent category by their similarities:

c(v) ik = sim(mk, z(v) ci ), u(v) jk = sim(mk, z(v) uj ). (3)

We then use the highly conﬁdent negative intention to ﬁlter the click and unclick historical item features by a residual structure. The kth intention of the vth user could be aggregated from all the ﬁltered clicked and unclicked item features, z1(v) ci and z1(v) uj as Eq. (4). βk is the bias for the kth

latent category. λ < 1 is the prior conﬁdence of the unclicked item. inten(v) k is the vth user s interest under the kth latent category.

z1(v) ci = z(v) ci + MLP([z(v) ci ; n (v) c ; z(v) ci n (v) c ; z(v) ci n (v) c ]),

z1(v) uj = z(v) uj + MLP([z(v) uj ; n (v) u ; z(v) uj n (v) u ; z(v) uj n (v) u ]),

inten(v) k =

i=1 d(v) ci f (v) ci c(v) ik (z1(v) ci + βk) + λ

j=1 d(v) uj f (v) uj u(v) jk (z1(v) uj + βk).

3.2.3 Prediction

Based on the learned intentions, we can predict the user s preference towards the candidate item by calculating the inner product between the candidate item and user disentangled intentions. However, there also exist some users who might have little historical behaviors. So we also manage to discover the interests of the vth user from his or her proﬁle feature and the candidate item feature, aiming to capture some common patterns, e.g., females may show high interests in fashion articles. We collect these information in Fall = [F(v) 1 , , F(v) g , h(v) t1 , , h(v) tr ], where h(v) tj R d r is one ﬁeld of the

candidate item embedding h(v) t . Then the multi-head attention [42] is used to capture the interactions of different ﬁelds of features: [f1, f2, , fg+r] = Multihead Attention(Fall). (5) By simultaneously considering the user s interests learned from the historical multi-feedback and feature interaction, we can predict the user preference towards the candidate item svt as Eq. (6), where the ﬁrst term is the preference inferred from user feedback (<> is the inner product) and the second term is from the feature interaction between the user proﬁle and item features.

k=1 < inten(v) k , h(v) t > +MLP([f1; f2; ; fg+r]). (6)

Our loss function is designed as follows, composing of the widely adopted binary cross-entropy loss and the regularization term for disentanglement aiming to minimize the similarities among different intentions. yvt is the ground truth label, and DT is the training set.

L = 1 |DT |

(v,t) DT yvt log σ(svt) + (1 yvt) log(1 σ(svt)) +

j =i φ(mi, mj) (7)

3.3 Adjustable Self-evaluating Curriculum towards A Better Self

All the aforementioned methods could help to capture the user s comprehensive true interests from the input features. However, the noise existing in the training labels will also mislead our model to suboptimal parameters. To tackle the problem, we leverage the idea of curriculum learning to denoise [31]. However, it is proved in [36] that different curriculum strategies (e.g., easy to hard, hard example mining, etc.) can be effective for different dataset settings, and thus we need a ﬂexible curriculum design to adapt to complex recommendation scenarios.

To this end, we propose a novel adjustable self-evaluating curriculum for recommendation, which algorithm is shown in Algorithm 1. Our goal is to obtain the optimal parameters θ for our recommender F. When training on the batch B, we ﬁrst calculate the prediction result σ(sj) for each sample and the difference between the ground truth label yj and our prediction. Then, we obtain the importance of each sample through a Gaussian distribution, where the sample with a closer difference to a preset value thre will get higher importance. Here, the hyperparameter thre (0, 1) reﬂects the sample difﬁculty we hope the model to focus on. Concretely, if thre approaches 1, the harder samples will get higher weights. While if thre approaches 0, the model will lay more emphasis on the easier samples. In our experiment, we can adjust the values for thre to set the curriculum of different difﬁculty levels for our model to improve itself. Finally, we optimize the parameters by the reweighted loss with an existing optimization method π. In the algorithm, µ controls the degree of concentration of the model and τ is a time-weight-decay factor. As the training goes on, µ is becoming smaller and smaller by timing τ, which makes the Gaussian distribution smoother and smoother so that all the samples almost get equal weights for training in the end. This time-varying process conforms to the human learning process: after we gradually improve ourselves by a scheduled curriculum, we should review all the samples to further consolidate the knowledge in our mind. Since in the early stage the model has already learned enough knowledge, it could be more robust and less likely to be inﬂuenced by the noisy samples. By adjusting the values of τ, we can control the curriculum learning speed, where a larger τ means a slower learning process.

Algorithm 1 Adjustable Self-evaluating Curriculum towards A Better Self

1: input: {(xvt, yvt)}(v,t) DT , π( ), ℓ( , ), τ < 1, counter, interval, thre, F 2: initialize: θ, µ 3: for B DT do 4: for (xj, yj) B do 5: sj = F(xj), dj = abs(yj σ(sj)); 6: wj = exp( µ (dj thre)2);

7: wj wj/ P|B| j=1(wj); 8: end for 9: θ θ π( θ P|B| j=1 wjℓ(sj, yj) + PK i=1 P

j =i φ(mi, mj)); 10: counter counter + 1; 11: if counter%interval == 0 then 12: µ µ τ; 13: end if 14: end for

Discussion. Our proposed curriculum elegantly converts the designing process of a curriculum training strategy to a hyper-parameter search process on thre, µ and τ, which improves the ﬂexibility, controllability, and explainability of curriculum design, while goes beyond the conventional assumption of easy-to-hard [30, 35, 32, 33]. Meanwhile, compared to automatic curricula [38, 40, 43, 39], our method requires almost no extra time and memory overhead for learning and applying the curriculum, which is efﬁcient and conforms to the demand of recommendation scenarios.

4 Experiments

4.1 Experimental Setup

Table 1: Dataset statistics

Amazon-Beauty Amazon-Sports Movie Lens-1M We Chat5D # of users 22,342 35,590 6,039 13,340 # of items 12,099 18,356 3,628 112,859 # of click 176,520 277,088 836,478 749,138 # of unclick 788,008 1,179,266 2,138,040 7,766,013 # of dislike 21,847 19,203 163,515 295,504

Datasets We conduct our experiments on four real-world datasets: We Chat5D, Movie Lens-1M[44], Amazon Sports[45] and Amazon Beauty[45]. The datasets statistics are shown in Table 1. We Chat5D is a mobile article recommendation dataset of We Chat Top Stories and itself has different kinds of user feedback. All data are preprocessed via data masking to protect user privacy. For the Amazon and Movie Lens dataset, we regard the ratings that are larger than 2 points as click feedback, while the rest as the dislike feedback. Since these three datasets have no information about the items that are recommended to the users but are unclicked, we simulate a simple recommendation environment. For each piece of user like or dislike interaction, we generate four pieces of unclick interactions for this user. They include three items that are sampled from the top popular items at that time and one item randomly sampled from all the items. This simulates the simplest recommendation rule, i.e., recommendation based on item popularity. The randomly sampled item simulates the scenario that the recommenders would always recommend something to explore the customer s potential interests [46]. Note that all the user s unclick interactions are not in the user clicked or disliked item sets. More speciﬁcally, based on the two Amazon datasets and the Movie Lens-1M dataset, for each piece of user like or dislike interaction, we generate four pieces of unclick interactions for this user. They include three items that are sampled from the top popular (top 3000 for Amazon datasets and top 1/2 for Movie Lens-1M) items at that time and one item randomly sampled from all the items. The whole dataset is chronologically divided to the train, valid, and test dataset by the ratio of 8:1:1. Note our training and testing phase follow the sequential recommendation setting. For example, if one user s historical behavior is a sequence {1, 2, 3, , 18, 19, 20}. Then, we will generate the validation and test samples as follows: two validation samples {[1, 2, 3, 4, , 16], [17]}, {[1, 2, 3, 4, , 16, 17], [18]} and two test samples {[1, 2, 3, 4, , 16, 17, 18], [19]}, {[1, 2, 3, 4, , 16, 17, 18, 19], [20]}, where the ﬁrst term in [ ] represents the historical information we use for prediction and the second term is the next item for prediction. We always use all the user s real historical behaviors as the sequential input to the models to predict the next item the user will click.

Baselines We compare our approach with the state-of-the-art (SOTA) methods. Deep FM [47] and Auto Int [48] are recommenders based on feature interaction. DIN [49], SASRec [50], DFN [10], SDR [9] are methods based on the user s sequential historical behavior. Speciﬁcally, DFN considers different kinds of user feedback and concatenates these features together. SDR only utilizes positive click feedback to capture the user s disentangled interests. For fair comparison, we add the feature interaction module to all the baselines if the model originally does not utilize the user proﬁle.

Implementation and hyper-parameters We implement our method in Tensorﬂow and use the Adagrad [51] optimizer for mini-batch gradient descent that is suitable for sparse data, while the size of each mini-batch is 256. All the mentioned transformer encoders are four-head and one-layer. We cap the maximum sequential historical behavior length to 30 for all datasets. We ﬁx µ in the curriculum to 10 and the other hyper-parameters are then tuned using random search. The search space is listed as follows. More detailed experimental settings can be found in our appendix.

The number of latent intentions K {1, 2, , 8}.

The prior conﬁdence for the unclicked data λ {0.1, 0.2, , 1.0}.

The learning rate {0.0001, 0.001, 0.01, 0.1, 1.0}.

The hidden size of each ﬁeld of feature {32, 64, 128, 256}

Table 2: Model performance

Dataset Model Amazon-Beauty Amazon-Sports Movie Lens-1M We Chat5D AUC Rela Impr AUC Rela Impr AUC Rela Impr AUC Rela Impr Deep FM 0.6975 0.00% 0.7608 0.00% 0.8098 0.00% 0.7219 0.00% Auto Int 0.6826 -7.57% 0.7575 -1.27% 0.7940 -5.08% 0.7263 1.96% SASRec 0.7415 22.28% 0.7830 8.52% 0.8826 19.32% 0.7126 -4.18% DIN 0.7633 33.32% 0.7968 13.80% 0.8760 21.39% 0.7282 2.83% DFN 0.7670 35.21% 0.7763 5.93% 0.8293 6.31% 0.7329 4.96% SDR 0.7238 13.30% 0.7972 13.95% 0.8865 24.78% 0.7296 3.45% CDR (Ours) 0.7991 51.43% 0.8152 20.85% 0.9065 31.24% 0.7622 18.14%

4.2 Recommendation Performance

We evaluate the performance of our proposed method on the classical click-through-rate (CTR) prediction task and utilize a widely-used metric Area Under Curve (AUC) for evaluation. We also follow [52] to use the Rela Impr to measure the relative improvements over the base model (i.e.,Deep FM in our experiment). The results are shown in Table 2.

We observe that our approach outperforms the baselines signiﬁcantly, both on the dense Movie Lens dataset and the sparse Amazon and We Chat5D dataset. We can see that on the Movie Lens dataset where each user has rich historical behaviors, the models that utilize the sequential historical behavior obtain much better performance. However, DFN fails to accurately capture the user s interests compared to other sequential models in Movie Lens-1M. This is likely because DFN uses three kinds of feedback and in this dataset, the user has rich feedback that contains more noise (introduced by more unclick behavior), while DFN fails to ﬁlter the noise. Our method performs best on datasets that have different sparsity, mainly beneﬁting from both our model design and the curriculum training strategy, which is discussed in Section 4.3 and Section 4.4, respectively.

4.3 Multi-feedback

We validate the effectiveness of our method on capturing the user s preference from different kinds of feedback. We compare the following four situations, our complete method (complete), our model without curriculum (w/o CL), our model that only utilizes the clicked historical feedback without curriculum (click), and our model that directly utilizes the clicked and unclicked historical feedback(i.e, the output of the C-Encoder and U-Encoder) to aggregate the user s intentions without curriculum (c&un). The result is shown in Table 3.

Table 3: Effectiveness of our model components

Dataset Model Amazon-Beauty Amazon-Sports Movie Lens-1M We Chat5D AUC Rela Impr AUC Rela Impr AUC Rela Impr AUC Rela Impr w/o CL 0.7914 47.56% 0.8083 18.22% 0.8972 28.22% 0.7548 14.82% click 0.7477 25.43% 0.8006 15.26% 0.8879 25.22% 0.7347 5.76% c&un 0.7477 25.42% 0.7969 13.86% 0.8854 24.40% 0.7340 5.42% complete 0.7991 51.43% 0.8152 20.85% 0.9065 31.24% 0.7622 18.14%

By comparing the click and c&un results, we conclude that directly utilizing the unclick historical behavior will not improve the model performance. A possible reason is that the noise in unclick feedback will prevent the model from capturing the user s true interests. However, with our proposed co-ﬁltering mechanism to ﬁlter the noise and locate the user s true preference, the model could more precisely capture the user s intentions and bring higher performance (comparing the results of w/o CL with that of c&un). Furthermore, the results of complete and w/o CL indicate that the curriculum learning strategy takes a further step to help the model to learn better parameters.

4.4 Curriculum Exploration

In this part, we explore what kind of curriculum strategy would beneﬁt our model. We respectively explore the impact of the curriculum difﬁculty thre and the curriculum learning speed τ.

Impact of Curriculum Difﬁculty As the difﬁculty of our curriculum is controlled by the parameter thre (larger thre means curriculum of higher difﬁculty), we set different values for thre to train our model, while ﬁxing the other hyper-parameters the same as the random grid search result. The performance of the model trained with different values of thre is shown in Fig 2(a). The results on Amazon Sports dataset show the typical easy-to-hard curriculum pattern [31]. Learning from the easier samples ﬁrst will make the objective smoother, thus more easily reaching the global optimal. In contrast, the results on the Amazon Beauty dataset show a different curriculum scenario: although learning easy samples ﬁrst could achieve comparably good performance, learning from the samples that have difﬁculty of about 0.8 could lead to better performance. It matches the spirits of hard example mining [37], focusing more on the harder samples that are more informative for the model helps the model to discriminate the samples better. Although the results on the two datasets show that they have different suitable curriculum, there is one phenomenon in common, curriculum concentrating on the too hard samples (thre = 1.0) results in bad performance. This is quite reasonable for our scenario, because there exists label-level noise. While the noisy data always cause higher prediction error than the clean data [53], focusing on the noisy data will absolutely cause the learned parameters sub-optimal.

Impact of Curriculum Speed Fixing the most suitable curriculum difﬁculty thre, we change the hyper-parameter τ to see at what pace the model should learn. Smaller τ means faster adapting to all the samples considering that smaller τ will make the Gaussian function smooth faster. From the results in Fig 2(b), we can see that too fast or too slow curriculum is not good enough, and the best paces for different datasets are different. Moreover, we observe that if τ equals 1, the performance on both datasets will drop4. This phenomenon is easy to understand from human learning. If we always concentrate on the problems of some particular level of difﬁculty, we cannot perform well in exams when the problems of different levels of difﬁculty are present to us. For the Amazon Sports dataset, since we ﬁx thre = 0.0, it always concentrates on the easiest samples and hardly learns the more difﬁcult samples, and thus suffer dramatically performance drop when test. While for the Amazon Beauty dataset, since thre = 0.8, it more focuses on its errors during all the training process, and thus not suffers as much performance drop as the Amazon Sports dataset.

4.5 Other Studies on Explainability

Disentangled Intentions We validate the disentanglement of our learned different intentions. We calculate the similarity between each item and each of K prototype intention, and assign each item to the intention that has the highest similarity with it. Then under each intention, we calculate the number of items belonging to each item category and plot the results in Figure 4. We can see that the learned intentions have disentangled meanings. For example, in Figure 4(d), intention 0 means interests in Sports , while intention 5 mainly means preference towards Entertainment . However, the disentanglement on the Movien Lens-1M dataset is not so promising, almost all intentions are highly related to Comedy , Action , and drama . This phenomenon is probably because the long historical behavior in Movie Lens-1M brings great challenges for disentanglement as mentioned in [9].

Mining Interests from Unclick Sequence We also conduct an ablation study to validate our claim that there also exist user s interests in the unclick sequence and our model can locate these interests from these noisy data. One example is shown in Figure 3 and we can see that this user clicked topics about Social livelihood , Technology and Human history . Our interests mining algorithm discovers the user s interests in International News and Education from the unclick feedback. When the candidate item about International News comes, our model can make the right prediction based on these located interests, showing that our model could mine the user s true interests that do not hide in the clicked history from the user s unclick history, thus capturing users comprehensive and accurate intentions.

4we don t plot the performance on Amazon Sports dataset because its performance drops to 0.5645 when τ equals 1.0.

0.0 0.2 0.4 0.6 0.8 1.0

Amazon Sports

Amazon Sports w/o CL

Amazon Beauty

Amazon Beauty w/o CL

(a) curriculum difﬁculty

0.0 0.2 0.4 0.6 0.8 1.0

Amazon Sports

Amazon Sports w/o CL

Amazon Beauty

Amazon Beauty w/o CL

(b) curriculum speed

Figure 2: Ablation on Curriculum

Edu Soc live. Pol Affairs O International

Soc live. Comedy Pol Affairs O Spts Education O

Human history Technology Social livelihood Click:

Interests Mining

(0.053) (0.012) (0.018) (0.31) (0.72) (0.01) (0.01) (0.003)(0.003)(0.02) (0.86) (0.003)

Intention Aggregation

Candidate: Prediction 0.78 Ground Truth: 1 Right Prediction International

Edu Soc live. Pol Affairs O International

Soc live. Comedy Pol Affairs O Spts Education O

Figure 3: Interests mining from unclick feedback

Tools &Accessories

(a) Amazon Beauty

Hunting & Fishing

Exercise &Fitness

Outdoor Gear

(b) Amazon Sports

(c) Movie Lens-1M

Social livelihood

Entertainment

(d) We Chat5D

Figure 4: Disentangled Intentions

5 Conclusion

In this paper, we propose to learn disentangled representations for user s intentions from multifeedback. The proposed routing mechanism models the complex relations among multi-feedback and various user intentions, and tackles the impact of noise brought by the multi-feedback. Experimental results show that the learned disentangled representations from multi-feedback could capture comprehensive user intentions and consequently improve the recommendation performance. Moreover, the proposed curriculum further alleviates the impact of data with improper labels in different datasets by adjustable hyperparameters, which can serve as an efﬁcient plug-in for recommendation models. Future work may include explicitly locating the noisy sample labels in multi-feedback data.

Acknowledgement

This work is supported by the National Key Research and Development Program of China No. 2020AAA0106300 and National Natural Science Foundation of China (No. 62050110, No. 62102222).

[1] Xin Wang, Wenwu Zhu, and Chenghao Liu. Social recommendation with optimal limited attention. In Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis, editors, Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, pages 1518 1527. ACM, 2019.

[2] Shengze Yu, Xin Wang, Wenwu Zhu, Peng Cui, and Jingdong Wang. Disparity-preserved deep crossplatform association for cross-platform video recommendation. In Sarit Kraus, editor, Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 4635 4641. ijcai.org, 2019.

[3] Qianqian Xu, Jiechao Xiong, Xiaochun Cao, Qingming Huang, and Yuan Yao. From social to individuals: A parsimonious path of multi-level models for crowdsourced preference aggregation. IEEE Trans. Pattern Anal. Mach. Intell., 41(4):844 856, 2019.

[4] Xin Wang, Wei Lu, Martin Ester, Can Wang, and Chun Chen. Social recommendation with strong and weak ties. In Snehasis Mukhopadhyay, Cheng Xiang Zhai, Elisa Bertino, Fabio Crestani, Javed Mostafa, Jie Tang, Luo Si, Xiaofang Zhou, Yi Chang, Yunyao Li, and Parikshit Sondhi, editors, Proceedings of the 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, IN, USA, October 24-28, 2016, pages 5 14. ACM, 2016.

[5] Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, and Wenwu Zhu. Learning disentangled representations for recommendation. In Advances in neural information processing systems, pages 5711 5722, 2019.

[6] Yin Zhang, Ziwei Zhu, Yun He, and James Caverlee. Content-collaborative disentanglement representation learning for enhanced recommendation. In Fourteenth ACM Conference on Recommender Systems, pages 43 52, 2020.

[7] Linmei Hu, Siyong Xu, Chen Li, Cheng Yang, Chuan Shi, Nan Duan, Xing Xie, and Ming Zhou. Graph neural news recommendation with unsupervised preference disentanglement. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4255 4264, 2020.

[8] Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. Disentangled graph collaborative ﬁltering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1001 1010, 2020.

[9] Jianxin Ma, Chang Zhou, Hongxia Yang, Peng Cui, Xin Wang, and Wenwu Zhu. Disentangled selfsupervision in sequential recommenders. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, editors, KDD 20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 483 491. ACM, 2020.

[10] Ruobing Xie, Cheng Ling, Yalong Wang, Rui Wang, Feng Xia, and Leyu Lin. Deep feedback network for recommendation. Proceedings of IJCAI-PRICAI, 2020.

[11] Qian Zhao, F Maxwell Harper, Gediminas Adomavicius, and Joseph A Konstan. Explicit or implicit feedback? engagement or satisfaction? a ﬁeld experiment on machine-learning-based recommender systems. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pages 1331 1340, 2018.

[12] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative ﬁltering for implicit feedback datasets. In 2008 Eighth IEEE International Conference on Data Mining, pages 263 272. Ieee, 2008.

[13] Yao Wu, Christopher Du Bois, Alice X Zheng, and Martin Ester. Collaborative denoising auto-encoders for top-n recommender systems. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 153 162, 2016.

[14] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative ﬁltering. In Proceedings of the 26th international conference on world wide web, pages 173 182, 2017.

[15] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 549 558, 2016.

[16] Jiawei Chen, Can Wang, Sheng Zhou, Qihao Shi, Yan Feng, and Chun Chen. Samwalker: Social recommendation with informative sampling strategy. In The World Wide Web Conference, pages 228 239, 2019.

[17] Dawen Liang, Laurent Charlin, James Mc Inerney, and David M Blei. Modeling user exposure in recommendation. In Proceedings of the 25th international conference on World Wide Web, pages 951 961, 2016.

[18] Erheng Zhong, Nathan Liu, Yue Shi, and Suju Rajan. Building discriminative user proﬁles for largescale content recommendation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2277 2286, 2015.

[19] Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, and Tat-Seng Chua. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of The Web Conference 2020, pages 99 109, 2020.

[20] Jingtao Ding, Yuhan Quan, Xiangnan He, Yong Li, and Depeng Jin. Reinforced negative sampling for recommendation with exposure data. In IJCAI, pages 2230 2236, 2019.

[21] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1040 1048, 2018.

[22] Chuhan Wu, Fangzhao Wu, Yongfeng Huang, and Xing Xie. Neural news recommendation with negative feedback. CCF Transactions on Pervasive Computing and Interaction, 2(3):178 188, 2020.

[23] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013.

[24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

[25] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.

[26] Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in neural information processing systems, pages 2610 2620, 2018.

[27] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. ar Xiv preprint ar Xiv:1802.05983, 2018.

[28] Haoyang Li, Xin Wang, Ziwei Zhang, Jianxin Ma, Peng Cui, and Wenwu Zhu. Intention-aware sequential recommendation with structured intent transition. IEEE Transactions on Knowledge and Data Engineering, 2021.

[29] Xin Wang, Hong Chen, and Wenwu Zhu. Multimodal disentangled representation for recommendation. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1 6, 2021.

[30] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41 48, 2009.

[31] Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

[32] Radu Tudor Ionescu, Bogdan Alexe, Marius Leordeanu, Marius Popescu, Dim P Papadopoulos, and Vittorio Ferrari. How hard can it be? estimating the difﬁculty of visual search in an image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2157 2166, 2016.

[33] Emmanouil Antonios Platanios, Otilia Stretcu, Graham Neubig, Barnabas Poczos, and Tom M Mitchell. Competence-based curriculum learning for neural machine translation. ar Xiv preprint ar Xiv:1903.09848, 2019.

[34] Yudong Chen, Xin Wang, Miao Fan, Jizhou Huang, Shengwen Yang, and Wenwu Zhu. Curriculum meta-learning for next POI recommendation. In Feida Zhu, Beng Chin Ooi, and Chunyan Miao, editors, KDD 21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 2692 2702. ACM, 2021.

[35] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in neural information processing systems, pages 1189 1197, 2010.

[36] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. ar Xiv preprint ar Xiv:1904.03626, 2019.

[37] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761 769, 2016.

[38] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. ar Xiv preprint ar Xiv:1805.03643, 2018.

[39] Gaurav Kumar, George Foster, Colin Cherry, and Maxim Krikun. Reinforcement learning based curriculum optimization for neural machine translation. ar Xiv preprint ar Xiv:1903.00041, 2019.

[40] Xinyi Wang, Hieu Pham, Paul Michel, Antonios Anastasopoulos, Jaime Carbonell, and Graham Neubig. Optimizing data usage via differentiable rewards. In International Conference on Machine Learning, pages 9983 9995. PMLR, 2020.

[41] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. ar Xiv preprint ar Xiv:1803.09050, 2018.

[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998 6008, 2017.

[43] Tianyi Zhou, Shengjie Wang, and Jeff A Bilmes. Curriculum learning by dynamic instance hardness. Advances in Neural Information Processing Systems, 33, 2020.

[44] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):19:1 19:19, 2016.

[45] Ruining He and Julian J. Mc Auley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative ﬁltering. In Jacqueline Bourdeau, Jim Hendler, Roger Nkambou, Ian Horrocks, and Ben Y. Zhao, editors, Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, pages 507 517. ACM, 2016.

[46] Ruobing Xie, Shaoliang Zhang, Rui Wang, Feng Xia, and Leyu Lin. Hierarchical reinforcement learning for integrated recommendation. 2021.

[47] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction. ar Xiv preprint ar Xiv:1703.04247, 2017.

[48] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1161 1170, 2019.

[49] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1059 1068, 2018.

[50] Wang-Cheng Kang and Julian J. Mc Auley. Self-attentive sequential recommendation. In IEEE International Conference on Data Mining, ICDM 2018, Singapore, November 17-20, 2018, pages 197 206. IEEE Computer Society, 2018.

[51] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121 2159, 2011.

[52] Ling Yan, Wu-Jun Li, Gui-Rong Xue, and Dingyi Han. Coupled group lasso for web-scale CTR prediction in display advertising. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 802 810. JMLR.org, 2014.

[53] Eric Arazo, Diego Ortego, Paul Albert, Noel E. O Connor, and Kevin Mc Guinness. Unsupervised label noise modeling and loss correction. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 312 321. PMLR, 2019.