# sequential_recommendation_with_relationaware_kernelized_selfattention__1744732a.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Sequential Recommendation with Relation-Aware Kernelized Self-Attention

Mingi Ji, Weonyoung Joo, Kyungwoo Song, Yoon-Yeong Kim, Il-Chul Moon Korea Advenced Institute of Science and Technology (KAIST), Korea {qwertgfdcvb, es345, gtshs2, yoonyeong.kim, icmoon}@kaist.ac.kr

Recent studies identiﬁed that sequential Recommendation is improved by the attention mechanism. By following this development, we propose Relation-Aware Kernelized Self Attention (RKSA) adopting a self-attention mechanism of the Transformer with augmentation of a probabilistic model. The original self-attention of Transformer is a deterministic measure without relation-awareness. Therefore, we introduce a latent space to the self-attention, and the latent space models the recommendation context from relation as a multivariate skew-normal distribution with a kernelized covariance matrix from co-occurrences, item characteristics, and user information. This work merges the self-attention of the Transformer and the sequential recommendation by adding a probabilistic model of the recommendation task speciﬁcs. We experimented RKSA over the benchmark datasets, and RKSA shows signiﬁcant improvements compared to the recent baseline models. Also, RKSA were able to produce a latent space model that answers the reasons for recommendation.

Introduction

Recommendation is one of the key application areas of artiﬁcial intelligence in the big data era. The recommendation tasks are supported by large scale data, and users need to select a speciﬁc item from many alternative items. This selection requirement motivates the utilization of attention mechanism in the recommendation task. The attention is applied to the item selection, and the sequential recommendation particularly selects the past item choice records to consider the recommendation at the current timestep with the attention mechanism (Wang et al. 2018; Liu et al. 2018; Li et al. 2017; Ying et al. 2018; Yu et al. 2019; Huang et al. 2018). Given the relationship between the attention and the recommendation, adopting a new attention mechanism to the recommendation has been a research trend. For instance, Self-Attentive Sequential Recommendation (SASRec) (Kang and Mc Auley 2018) adopted the self-attention mechanism of the Transformer (Vaswani et al. 2017) to the recommendation task. This adaptation is interesting, but it

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Each entry of the co-occurrence matrix means the number of users that each movie pair appeared together in an user sequence in the Movie Lens dataset. We can see that there are many users who watched Star Wars movies together. This allows modifying the attention weight from blue to red using co-occurrence information, when Star Wars 6 is a query.

was limitedly customized to meet the task speciﬁcs. Recommendation often requires understanding items, users, browsing sequences, etc, and the recommendation models need to consider such contexts which SASRec does not provide. Following SASRec, there have been developments in using the self-attention mechanism of the Transformer to model a task speciﬁc feature of sequential recommendation. For example, ATRank (Zhou et al. 2018) utilized the self-attention mechanism for considering the inﬂuences from heterogenous behavior representations. To model the user s shortterm intent, Att Rec (Zhang et al. 2019) adopted the selfattention mechanism on the user interaction history. Similar to ATRank and Att Rec, BST (Chen et al. 2019) used the self-attention mechanism for aggregate of the auxiliary user and item features. Given the success of the self-attention (Tan et al. 2018; Devlin et al. 2018; Zhang et al. 2018), the recommendation task can be improved from the sequential information, which was limitedly used in the previous works. Moreover, such utilization on the sequential information provides a new approach to customize the self-attention structure to the

recommendation task. Figure 1 is the example that the cooccurrence information may inﬂuence the attention weight. It is feasible to see a movie pair that has a higher cooccurrence than others, and this movie pair should inform the attention mechanism to increase the weight. We renovate and customize the self-attention of Transformer with a latent space model. Speciﬁcally, we add a latent space to the self-attention value of the Transformer, and we use the latent space to model the context from relations of the recommendation task. The latent space is modeled as a multivariate skew-normal (MSN) distribution (Azzalini and Valle 1996) with the dimension of the number of unique items in the sequence. The covariance matrix of the MSN distribution is the variable that we model the relations of a sequence, items, and a user by a kernel function that provides the ﬂexibility of the recommendation task adaptation. After the kernel modeling, we provide the reparametrization of the MSN distribution to enable the amortized inference on the introduced latent space. Since the relation modeling is done with kernelization, we call this model as relation-aware kernelized self-attention (RKSA). We designed RKSA with three innovations. First, the deterministic Transformer may not work well in the generalized task of recommendation because of sparsity, so we added a latent dimension and its corresponding reparameterization. Second, the covariance modeling with the relation-aware kernel enables the more fundamental adaptation of the selfattention to the recommendation. Third, the kernelized latent space of the self-attention provides the reasoning on the recommendation result. RKSA is evaluated against eight baseline models including SASRec, HCRNN, NARM, etc; as well as, ﬁve benchmark datasets with Amazon review, Movie Lens, Steam, etc. Our experiments showed that RKSA signiﬁcantly improves the performance over the baselines on the benchmarks, consistently.

Preliminary Multi-Head Attention We start the preliminary by reviewing the self-attention structure that is the backbone of RKSA. Recently, (Vaswani et al. 2017) proposed the scaleddot product attention, which is deﬁned by Equation 1 where Q Rn dk, K Rm dk, and V Rm dv are the queries, the keys, and the value matrix, respectively. The scaled-dot product attention calculates importance weights from the dot-product of query i with key j with a scaling of dk. This importance is boundarized by the softmax, and the boundarized importance is again multiplied by the value v to form the scaled-dot product attention.

Attention(Q, K, V ) = softmax(QKT

When the query, the key, and the value take the same X Rn d as an input matrix in Equation 2, the scaleddot product attention is called as the self-attention. A selfattention with an additional predeﬁned or learnable positional embedding (Vaswani et al. 2017; Kang and Mc Auley 2018) is able to capture the latent information of the position like previous recurrent networks. SA(X) = Attention(XW Q, XW K, XW V ) (2)

Multi-head attention uses H scaled-dot product attentions with 1/H times smaller dimension on attention weight parameters. (Vaswani et al. 2017) found that the multi-head attention is useful even though it uses the similar number of parameters compared to the single-head attention.

MHA(Q, K, V ) = [Head1, ..., Head H]W O

where Headi = Attention(XW Q i , XW K i , XW V i ) (3)

(Yang et al. 2019) considered the dependencies, i.e. item co-occurrence, between the temporal state representations over a single sequence with the scaled-dot product attention. Their model is introducing a context vector C to be linearly combined with Q and K in the self-attention. We expand this context modeling with stochasticity and kernel method to add the ﬂexibility of the self-attention.

Multivariate Skew-Normal Distribution As we mentioned the latent space model of RKSA, we introduce an explicit probability density model to the self-attention structure. Here, we choose the multivariate skew-normal (MSN) distribution to be the explicit density because we intend to model 1) the covariance structure between items; and 2) the skewness of the attention value. It would be natural to consider the multivariate normal distribution to enable the covariance model, but the normal distribution is unable to model the skewness because it enforces the symmetric shape of the density curve. As the name suggests, the MSN distribution reﬂects the skewness as the shape parameter α (Azzalini and Valle 1996). The MSN distribution needs four parameters: location ξ, scale ω, correlation ψ, and shape α. Following (Azzalini and Capitanio 1999), a k-dimensional random variable x Rk follows the MSN disitribution with the location parameter ξ Rk; the correlation matrix ψ Rk k; the scale parameter ω = diag(ω1, ..., ωk) Rk k; and the shape parameter α Rk, as Equation 4.

f(x) = 2φk(x; ξ, Σ)Φ(αT ω 1(x ξ)) (4)

Here, Σ = ωψω is the covariance matrix; φk is the kdimensional multivariate normal density with the mean ξ and the covariance Σ; and Φ is the cumulative distribution function of N(0, 1). If α is a zero vector, the distribution reduces down to the multivariate normal distribution with the mean ξ and the covariance Σ.

Kernel Function Given that we intend to model the covariance of the MSN, we introduce how we provide the ﬂexible covariance structure through kernels. Kernel function, k(x, x ) = φ(x) φ(x ), evaluates a pair of observations in the observation space X with a real value. In the machine learning ﬁeld, the kernel functions are widely used to compute the similarity between two data points as a covariance matrix. Given observations X = {xi}n i=1, a function k : X X Rk is a valid kernel if and only if it is (1) symmetric: k(x, x ) = k(x , x) for all x, x X; and (2) positive semi-deﬁnite:

i,j aiajk(xi, xj) 0 for all ai, aj R (Rasmussen 2003). We apply a customized kernel function to model the relational covariance parameter of the MSN in RKSA, and we provide proofs on the validity of our customized kernels.

Figure 2: (a) Graphical notation of RKSA. φ is the parameter of MSN distribution, and dashed line denotes sampling procedure. (b) The overall structure of RKSA with MSN parameters. The scaled-dot product denotes the matrix multiplication between query and key matrix in scaled-dot product attention.

Methodology This section explains the sequential recommendation task, the overall structure of Relation-Aware Kernelized Self Attention (RKSA), and its detailed parameter modeling.

Problem Statement A sequential recommendation uses datasets built upon a past action sequence of a user. Let U = {u1, u2, ..., u|U|} be a set of users; let I = {i1, i2, ..., i|I|} be a set of items; and let

Su = {i(u) 1 , i(u) 2 , ..., i(u) nu } be a user u s action sequence. The task of sequential recommendation is predicting the next item to interact by the user, as P(i(u) nu+1 = i| u U Su).

Self-Attention Block We propose elation-aware Kernelized Self-Attention (RKSA), which is a modiﬁcation of the self-attention structure embedded in Transformer (Vaswani et al. 2017). Figure 2 illustrates that RKSA is a customized self-attention based on relations, such as the item, the user, and the global co-occurrence information. The detailed procedure is explained in the below.

Embedding Layer Since the raw data of items and interactions follow sparse one-hot encoding, we need to embed the information of items and positions of interactions. To create such embeddings, we use the latest n actions from the user sequence of Su. Speciﬁcally, the item embedding matrix is deﬁned as E R|I| d, where d denotes the dimensionality of the embedding. E is estimated by a hidden layer as a part of the modeled neural network, and the raw input to the hidden layer is the one-hot encoding of an interacted item at time t. Similarly, we set a user embedding to be U R|U| d to make a distinction between users. Also, we deﬁne a positional embedding matrix as P Rn d, to introduce the sequential ordering information of the interactions, which we follow the ideas from (Kang and Mc Auley 2018). P and U are also estimated by a hidden layer that matches the dimensionality of E for the further construction of xt. Afterward, we estimate the inputs to RKSA, and the input should convey the representation of items and positions

in the sequences. Here, we assume that the item at time t, which is it, is represented as xt as a timestep of the sequence, and we denoted the representation as xt because it is the input to RKSA. xt is estimated through the summation of the item embedding eit E and the positional embedding pt P, as xt = eit + pt. Finally, the input item sequence is expressed as X Rn d by combining the item embedding E, and the positional embedding P.

Relation-Aware Kernelized Self-Attention The core component of RKSA is the multi-head attention structure that includes a latent variable of z. Given that Equation 1 is deterministic, we intend to turn QKT

dk into a single latent variable z. The changed part is originally the alignment score of the attention mechanism, so its range becomes R. Additionally, we assume that there is a skewed shape in the alignment score distribution, so we designed z to follow the multivariate skew-normal distribution (MSN), as Equation 5. In other words, we sample the logit of the softmax function from the MSN distribution.

H = RKSA(X, C) = softmax(Z)V where z MSN(Z|ξ, Σ, α) (5)

In the above, the parameters of the MSN distribution include the location ξ, the covariance Σ, and the shape parameter α. The details of the parameters are explained in Section Parameter Modeling. Additionally, in Equation 5, X denotes the items in the sequence Su, and C is the co-occurrence matrix of Su from our kernel model, which is explained in Section Covariance. The co-occurrence matrix C is constructed by counting the co-occurrence number between item pairs in the whole dataset. We follow the amortized inference with a reparametrization on the MSN; and X and C are used as inputs to the inference. Lastly, the output of RKSA is the hidden dimension deﬁned as H= {h1, h2, ..., hn}, hi Rd in Equation 5. V is the value vector estimated from the input item sequence representation X. Since we modify the scaled-dot product attention, RKSA is easily expanded to be a variant of multihead attention by following the same procedure of Equation 3.

Point-Wise Feed-Forward Network We apply the Point Wise Feed-Forward Network in Transformer to the output of RKSA by each position. The point-wise feed-forward network consists of two linear transformations with a Re LU nonlinear activation function between the linear transformations. The ﬁnal output of the point-wise feed-forward network, F is {FFN(h1), ..., FFN(hd)}. Besides of the above modeling structure, we stacked multiple self-attention blocks to learn complex transition patterns, and we added residual connections (He et al. 2016) to train a deeper network structure. We also applied the layer normalization (Ba, Kiros, and Hinton 2016) and the dropout (Srivastava et al. 2014) to the output of each layer by following (Vaswani et al. 2017).

Output Layer Let B be the number of self-attention blocks. The task requires predicting the (n + 1)-th item with the n-th output of the B-th self-attention block. We use the same weights of the item embedding layer to rank the item prediction. The relevance score of the item in is deﬁned as ri,n:

ri,n = F (B) n Ei (6)

F (B) n denotes the n-th output of the last self-attention block, and Ei is the embedding of item i. The prediction ranking of the (n + 1)-th item is deﬁned by the ranking of the items relevance scores.

Parameter Modeling

This section enumerates the detailed modeling of the MSN parameters, which is used for the latent variable z in RKSA.

Location The location ξ has the same role of the mean of multivariate normal distribution. Given that we use the MSN to sample the alignment score, we still need to provide the deterministic alignment score with the most likelihood. Therefore, we allow the alignment score to be the location parameter as:

ξ = f((XW Q l )(XW K l )

Also, we can use activation function f and scaling to ξ with

Covariance The covariance Σ represents the relation between items. While Σ is a square matrix of parameters, Σ has a limited size because we only use the latest n items; and because there are not many unique items in those latest interactions. The relation can be measured by various methods, ranging from a simple co-occurrence counting to a nonlinear kernel function. This paper design a kernel function to measure the relation between a pair of items because the kernel function is known to be the efﬁcient and nonlinear highdimensional distance metric that can also be learned through optimizing the kernel hyperparameters. We compose a kernel function by considering the relations of the co-occurrence, the item and the user. For a given sequence, for timesteps i and j, we utilize the normalized

representations of ˆxi and ˆxj. Additionally, we infer the variance ω2 i , ω2 j R+ of z at timestep i and j, by an amortized inference as Equation 8.

ωi = softplus((xn W Q ω )(xi W K ω )

In the above, we set the activation function of standard deviation as softplus to make the value of standard deviation positive. The following deﬁnes three different kernel functions, and we denote ˆxi as xi for simplicity.

Counting kernel is deﬁned by the co-occurrence number of each item pair. The counting kernel is kc(xi, xj) =

ωiωj P 2 ij Pi Pj where Pi, Pj are the number of occurrence of item i and j, respectively, and Pij is the number of cooccurrence of item i and j.

Item kernel utilizes the representation of each item. There are two alternative kernels. The linear item kernel is ki(xi, xj) = ωiωj(xi xj) where denotes dot product; and the Radial Basis Function (RBF) kernel is ki(xi, xj) = ωiωj exp( ||xi xj||2).

User kernel utilizes the representation of each items and users. The user kernel is ku(xi, xj) = ωiωj[(Wsus xi) (Wsus xj)] for user embedding us Rd and weight matrix Ws Rd d where denotes Hadamard product.

Unlike the item and the user kernel, the validity of the counting kernel should be checked because it is not a wellknown format as the linear or the RBF kernels. The counting kernel is always symmetric and positive semi-deﬁnite. Therefore, the counting kernel is a valid kernel function. From the property of kernel functions, we combine kernel functions by their summation to make the ﬁnal kernel function ﬂexible. The ﬁnal kernel function is deﬁned as:

k(xi, xj) = r1kc(xi, xj) + r2ki(xi, xj) + r3ku(xi, xj) where r = softmax(u Wu + bu) (9)

With the above kernel function, our modeling on the correlation matrix is ψi,j = k(xi,xj)

ωiωj , similar to the deﬁnition of Σ of Equation 4. This section describes the covariance modeling with the ﬁnal kernel, so the kernel hyperparameter, such as ω, Ws and r, needs to be inferred. While they need to be supervised to learn the kernel hyperparameters, the loss of the recommendation task needs to be augmented with an additional loss to guide the kernel hyperparameter. Therefore, we modeled a loss that regularizes the covariance to be the item co-occurrence. Since we have other loss terms, i.e. the recommendation loss, the learned correlation does not become same to the item co-occurrence, but the co-occurrence loss can be prior knowledge. Particularly, we measure the co-occurrence loss Lrank with the listwise ranking loss to match the alignment of the correlation and the ranking of the item co-occurrences. The co-occurrence loss is deﬁned as maximizing te listwise ranking loss (Cao et al. 2007).

Shape The shape parameter α reﬂects the relation between a ﬁnal item and an item in a user sequence. We designate α = {α1, ..., αn} to items {i1, ..., in}. We deﬁne α by introducing a ratio parameter ˆα with the co-occurrence matrix C; and a learnable scaling parameter s. Speciﬁcally, we assume αj = sj ˆαj max(ˆα), which is a scaled correlation between the ﬁnal item in and the item ij. First, we calculate the ratio parameter ˆαj [0, 1] with the co-occurrence matrix C, by the summation of the linear alignments between the last time in, and the aligned item ij. Here, let ci,j be the value of i-th row and j-th column of co-occurrence matrix, C. For simplicity, we denote cij,ik as cj,k. The following is the detailed formula of ˆαj.

k=1 cj,kck,n where ck,k

l {1, ,n}\k ck,l

k={1, ,n}\ {j,n} cj,kck,n

l {1, ,n}\j cj,l

cj,n + cj,n n 1 k=1 ck,n n 1

Equation 10 computes ˆαj by the dot-product between j-th row and n-th column of the co-occurrence matrix C, which means that we calculate the correlation between the cooccurrence of ij and in. Having said that, the co-occurrence of the same item is semantically meaningless in C, so such cases used the average of the remaining elements in each row in the dot-product process. ˆαj enables modeling the two-hop dependency between ij and in through ik. Second, Equation 11 deﬁnes the scaling parameter, sj:

sj = f((xn W Q s )(xj W K s )

We can apply the softplus activation to f, so the shape parameter becomes positive.

Model Inference Loss Function Given the above model structure, this subsection introduces the inference on the latent variable z following the MSN distribution. It is well-known that the latent variable can be inferred by optimizing the evidence lower bound from the Jensen s inequality, so we optimize the evidence lower bound on the marginal log-likelihood, p(yn), when predicting the (n + 1)-th item in. Equation 12 describes the loss function of this prediction task.

Lz = Ez[log p(yn|z)] = log p(yn|z)p(z)dz (12)

log p(yn|z)p(z)dz = log p(yn)

Lz utilizes the binary cross-entropy loss with the negative sampling as conducted in (Kang and Mc Auley 2018) to calculate p(yn|z). It should be noted that the actual loss function is a combination of the prediction loss and the cooccurrence loss, which is L = Lz + λr Lrank. λr is the regularization weight hyperparameter of the co-occurrence loss.

Table 1: Statistics of evaluation datasets.

Dataset #users #items #actions avg. avg. actions actions /user /item

Beauty 52,024 57,289 0.4m 7.6 6.9 Games 31,013 23,715 0.3m 9.3 12.1 Cite ULike 1,798 2,000 0.05m 30.6 27.5 Steam 334,730 13,047 3.7m 11.0 282.5 Movie Lens 4,639 930 0.2m 40.9 204.0

Reparametrization of Z We sample the values of z from the MSN(Z|ξ, ω, ψ, α) distribution using the reparameterization trick. Equation 13 shows the reparametrization of the MSN distribution with the sample from the two Normal distributions.

y0 N(Y0|0, 1), y N(Y |0, ψ), δj = αj

1 + α2 j (13)

ˆzj = δj|y0| + (1 δ2 j ) 1 2 yj, zj = ξj + ωj ˆzj

This reparametrization is utilized because z needs to be instantiated for the forward path. Equation 13 shows how to sample z given the amortized inference parameters of α, ξ, ω, and ψ. Once the forward path is enabled, the neural network can be trained via the back-propagation method.

Experiment Result Datasets We evaluate our model on ﬁve real world datasets: Amazon (Beauty, Games) (He and Mc Auley 2016; Mc Auley et al. 2015), Cite ULike, Steam, and Movie Lens. We follow the same preprocessing procedure on Beauty, Games, and Steam from (Kang and Mc Auley 2018). For preprocessing Cite ULike and Movie Lens, we follow the preprocessing procedure from (Song et al. 2019). We split all datasets for training, validation, and testing following the procedure of (Kang and Mc Auley 2018). Table 1 summarizes the statistics of the preprocessed datasets.

Baselines We compared RKSA with seven baselines.

Pop always recommends the most popular items.

Item-KNN (Linden, Smith, and York 2003) recommends an item based on the measured similarity of the last item.

BPR-MF (Rendle et al. 2009) recommends an item by the user and the item latent vectors with the matrix factorization.

GRU4REC (Hidasi et al. 2015) models the sequential user history with GRU and the specialized recommendation loss function such as Top1 and BPR loss.

NARM (Li et al. 2017) focuses on both short and longterm dependency of a sequence with an attention and a modiﬁed bi-linear embedding function.

HCRNN (Song et al. 2019) considers the user s sequential interest change with the global, the local, and the temporary context modeling. It modiﬁes the GRU cell structure to incorporate the various context modeling.

Table 2: Performance comparison (higher is better). The best performing model is indicated as boldface. The second-best model is indicated as underline. indicates that the result has p-value less than 0.05 against the second-best result based on t-test.

Dataset Metric Pop Item-KNN BPR-MF GRU4REC NARM HCRNN Att Rec SASRec RKSA

Hit@5 0.2972 0.0885 0.0735 0.3097 0.3663 0.3643 0.3341 0.3735 0.3999* NDCG@5 0.1478 0.0872 0.0486 0.2257 0.2785 0.2764 0.2535 0.2846 0.2998* Hit@10 0.4289 0.0885 0.1285 0.4174 0.4674 0.4653 0.4222 0.4720 0.5015* NDCG@10 0.1882 0.0872 0.0662 0.2604 0.3111 0.3091 0.2819 0.3164 0.3326*

Hit@5 0.3416 0.1969 0.1291 0.5749 0.6224 0.6229 0.5673 0.6395 0.6544* NDCG@5 0.1730 0.1892 0.0920 0.4570 0.4927 0.4955 0.4358 0.5068 0.5168* Hit@10 0.4846 0.1969 0.1919 0.6733 0.7244 0.7233 0.6812 0.7373 0.7551* NDCG@10 0.2168 0.1892 0.1121 0.4889 0.5257 0.5281 0.4727 0.5385 0.5495*

Hit@5 0.1318 0.3563 0.1624 0.4310 0.4457 0.4442 0.4275 0.5044 0.5308 NDCG@5 0.0650 0.2666 0.1107 0.2982 0.3016 0.3053 0.2891 0.3447 0.3687* Hit@10 0.2144 0.3815 0.2472 0.5879 0.6150 0.6077 0.5808 0.6757 0.6893* NDCG@10 0.0902 0.2751 0.1378 0.3488 0.3565 0.3583 0.3388 0.4001 0.4202*

Hit@5 0.5545 0.2964 0.5724 0.7065 0.7095 0.7136 0.5936 0.7477 0.7514 NDCG@5 0.2873 0.2724 0.4144 0.5444 0.5476 0.5516 0.4182 0.5828 0.5841 Hit@10 0.7162 0.2965 0.7083 0.8293 0.8314 0.8344 0.7491 0.8610 0.8668* NDCG@10 0.3370 0.2724 0.4587 0.5844 0.5873 0.5909 0.4687 0.6196 0.6217

Hit@5 0.1521 0.2950 0.1241 0.3883 0.4057 0.4039 0.3493 0.4260 0.4361* NDCG@5 0.0733 0.2019 0.0767 0.2650 0.2775 0.2770 0.2217 0.2965 0.3023* Hit@10 0.2547 0.4051 0.2088 0.5487 0.5617 0.5606 0.5094 0.5873 0.5997* NDCG@10 0.1044 0.2376 0.1039 0.3167 0.3278 0.3275 0.2734 0.3485 0.3552*

Table 3: Ablation study on the Beauty and Movie Lens datasets. The measure is Hit@10 and C, I and U denote counting, item and user kernel function respectively. B is the Beauty dataset; and M is the Movie Lens dataset.

C I U C+I C+U I+U C+I+U

B 0.5015 0.4982 0.4958 0.5012 0.4955 0.4951 0.5011 M 0.5911 0.5966 0.5977 0.5960 0.5962 0.5997 0.5973

Att Rec (Zhang et al. 2019) models the short-term intent using self-attention and the long-term preference with metric learning.

SASRec (Kang and Mc Auley 2018) is a Transformer model which combines the strength of Markov chains and RNN. SASRec focuses on ﬁnding the relevant items adaptively with self-attention mechanisms.

Experiment Settings For GRU4REC, NARM, HCRNN, and SASRec, we use the ofﬁcial codes written by the corresponding authors. For GRU4REC, NARM and HCRNN, we apply the data augmentation method proposed by NARM (Li et al. 2017). We use two self-attention blocks and one head for SASRec and RKSA following the default setting of (Kang and Mc Auley 2018). For fair comparisons, we apply the same setting of the batch size (128), the item embedding (64), the dropout rate (0.5), the learning rate (0.001), and the optimizer (Adam). We use the same setting of authors for other hyperparameters. For RKSA, we set the cooccurrence loss weight λr as 0.001. Furthermore, we use the learning rate decay and the early stopping based on the validation accuracy for all methods. We use the latest 50 actions of sequence for all datasets.

Quantitative Analysis Table 2 presents the recommendation performance of the experimented models. We adopt two widely used measurements: Hit Rate@K and NDCG@K (He et al. 2017). Considering that all user-item pairs require heavy computation, we use 100 negative samples for the evaluation following (Kang and Mc Auley 2018; He et al. 2017). We repeat each experiment for ﬁve times, and the results are the average of each method. The performance of RKSA comes from the best kernel variant of RKSA, and RKSA outperforms all baseline models on all datasets and metrics. Especially, Beauty shows the biggest improvement. Beauty is the most sparse dataset, so there are many items infrequently occurred. This result suggests that using the relational information can be helpful for predicting such infrequent items.

Ablation Study We compared the kernel function combinations on Beauty and Movie Lens datasets. We consider Beauty as a representative sparse dataset, and Movie Lens as a representative dense dataset. Table 3 shows the performance of each kernel functions. We assume that using the sparse and short dataset is hard to learn the representation of the item and user. Therefore, RKSA with the counting kernel function shows the best performance on the sparse dataset. On the contrary, it is relatively easy to learn the representation of item and user by the dense dataset, and Table 3 shows the kernel combination of the item and the user is best.

Qualitative Analysis Item Embedding and Correlation Matrix The item kernel utilizes the dependency between the items in each time step. When learning the co-occurrence loss, the kernel hyperparameter and the item embedding captures the relational

Figure 3: (a) Item embedding visualization with t SNE (Van der Maaten and Hinton 2008) of Movie Lens dataset. (b) Correlation between movies by the counting and item kernel combination.

Figure 4: (a) The weights of the counting, the item, and the user kernels for the ﬁnal kernel calculation (b) Average predicted ranking of the SASRec and RKSA by item occurrence in Beauty dataset. Value of the x-axis grows, it indicates the frequently occurred group. RKSA predict the higher ranking for infrequent items.

information of the co-occurrence. Figure 3a illustrates the item embedding of movies. The item embedding with the same genres are distributed closely together.

We generate the synthetic sequence to analyze the correlation from the trained kernel function. We use the counting and the item kernel combination without the user kernel because the sequence was synthetic. The synthetic sequence includes four different movie series and an animation movie. Figure 3b shows that the movies belong to the same series have high correlations. On the contrary, the correlations between the animation genre and the other genres were low.

Finally, we observed the weights of the counting, the item, and the user kernels, see Figure 4a, because the kernel weights also contribute to the construction of the correlation matrix. Since each dataset has different characteristics, a dataset emphasizes the counting, the item, and the user relations, differently. Interestingly, the counting kernel was not the most dominant kernel in Movie Lens, but the user kernel was dominant. Movie Lens is relatively dense dataset with respect to the number of average action per user, as shown in Table 1. Our proposed model, RKSA, adapts to the property of dataset well, and focus on the user kernel instead of other kernels on Movie Lens dataset.

Figure 5: Attention heatmap for a user sequence of Movie Lens. The ﬁrst row indicates the co-occurrence, and the last item does not have co-occurrence information. If the cooccurrence between last item (query) and each item is bigger than average co-occurrence of sequence, we ﬁll each timestep as black and the rest white. The second row is an attention weight in SASRec and the below is an attention weight in RKSA.

Predicted ranking of infrequent items A sparse dataset, like Beauty, has many infrequent items, which are difﬁcult to predict because of its information sparsity. To overcome this problem, RKSA utilizes the relational information of the whole dataset, instead of a single sequence in the prediction. Figure 4b shows that the target item is highly ranked by RKSA as the information sparsity worsens, compare to the predicted ranking of SASRec.

Attention Weight Case Study Figure 5 shows the attention weight of SASRec and RKSA with the co-occurrence information between the last item and each item of sequence. The sequence instance in Figure 5 has a high cooccurrence value at timestep 0, 1, 2, and 5; and Figure 5 conﬁrms that RKSA places higher attention values than SASRec. In the opposite case, the attention weight of RKSA is lower than the attention weight of SASRec.

We present relation-aware kernelized self-attention (RKSA) for a sequential recommendation task. RKSA introduces a new self-attention mechanism which is stochastic as well as kernelized by the relational information. While the past attention mechanisms are deterministic, we introduce a latent variable in the attention. Moreover, the latent variable utilizes the kernelized correlation matrix, so the kernel can be expanded to include relational information and modeling. From these innovations, we were able to see the best performance in all experimental settings. We expect that the further development on the stochastic attention of the Transformer will come in the near future.

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF2018R1C1B6008652).

References Azzalini, A., and Capitanio, A. 1999. Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(3):579 602. Azzalini, A., and Valle, A. D. 1996. The multivariate skewnormal distribution. Biometrika 83(4):715 726. Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. ar Xiv preprint ar Xiv:1607.06450. Cao, Z.; Qin, T.; Liu, T.-Y.; Tsai, M.-F.; and Li, H. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, 129 136. ACM. Chen, Q.; Zhao, H.; Li, W.; Huang, P.; and Ou, W. 2019. Behavior sequence transformer for e-commerce recommendation in alibaba. ar Xiv preprint ar Xiv:1905.06874. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805. He, R., and Mc Auley, J. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative ﬁltering. In proceedings of the 25th international conference on world wide web, 507 517. International World Wide Web Conferences Steering Committee. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.- S. 2017. Neural collaborative ﬁltering. In Proceedings of the 26th international conference on world wide web, 173 182. International World Wide Web Conferences Steering Committee. Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2015. Session-based recommendations with recurrent neural networks. ar Xiv preprint ar Xiv:1511.06939. Huang, X.; Qian, S.; Fang, Q.; Sang, J.; and Xu, C. 2018. Csan: Contextual self-attention network for user sequential recommendation. In 2018 ACM Multimedia Conference on Multimedia Conference, 447 455. ACM. Kang, W.-C., and Mc Auley, J. 2018. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), 197 206. IEEE. Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; and Ma, J. 2017. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1419 1428. ACM. Linden, G.; Smith, B.; and York, J. 2003. Amazon. com recommendations: Item-to-item collaborative ﬁltering. IEEE Internet computing (1):76 80. Liu, Q.; Zeng, Y.; Mokhosi, R.; and Zhang, H. 2018. Stamp: short-term attention/memory priority model for sessionbased recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1831 1839. ACM.

Mc Auley, J.; Targett, C.; Shi, Q.; and Van Den Hengel, A. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 43 52. ACM. Rasmussen, C. E. 2003. Gaussian processes in machine learning. In Summer School on Machine Learning, 63 71. Springer. Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt Thieme, L. 2009. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-ﬁfth conference on uncertainty in artiﬁcial intelligence, 452 461. AUAI Press. Song, K.; Ji, M.; Park, S.; and Moon, I.-C. 2019. Hierarchical context enabled recurrent neural network for recommendation. In Proceedings of the AAAI. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research 15(1):1929 1958. Tan, Z.; Wang, M.; Xie, J.; Chen, Y.; and Shi, X. 2018. Deep semantic role labeling with self-attention. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence. Van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov). Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998 6008. Wang, S.; Hu, L.; Cao, L.; Huang, X.; Lian, D.; and Liu, W. 2018. Attention-based transactional context embedding for next-item recommendation. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence. Yang, B.; Li, J.; Wong, D. F.; Chao, L. S.; Wang, X.; and Tu, Z. 2019. Context-aware self-attention networks. ar Xiv preprint ar Xiv:1902.05766. Ying, H.; Zhuang, F.; Zhang, F.; Liu, Y.; Xu, G.; Xie, X.; Xiong, H.; and Wu, J. 2018. Sequential recommender system based on hierarchical attention networks. In the 27th International Joint Conference on Artiﬁcial Intelligence. Yu, L.; Zhang, C.; Liang, S.; and Zhang, X. 2019. Multiorder attentive ranking model for sequential recommendation. In Proceedings of the AAAI. Zhang, H.; Goodfellow, I.; Metaxas, D.; and Odena, A. 2018. Self-attention generative adversarial networks. ar Xiv preprint ar Xiv:1805.08318. Zhang, S.; Tay, Y.; Yao, L.; Sun, A.; and An, J. 2019. Next item recommendation with self-attentive metric learning. In Thirty-Third AAAI Conference on Artiﬁcial Intelligence, volume 9. Zhou, C.; Bai, J.; Song, J.; Liu, X.; Zhao, Z.; Chen, X.; and Gao, J. 2018. Atrank: An attention-based user behavior modeling framework for recommendation. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence.