# sequential_recommendation_with_relationaware_kernelized_selfattention__1744732a.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Sequential Recommendation with Relation-Aware Kernelized Self-Attention Mingi Ji, Weonyoung Joo, Kyungwoo Song, Yoon-Yeong Kim, Il-Chul Moon Korea Advenced Institute of Science and Technology (KAIST), Korea {qwertgfdcvb, es345, gtshs2, yoonyeong.kim, icmoon}@kaist.ac.kr Recent studies identified that sequential Recommendation is improved by the attention mechanism. By following this development, we propose Relation-Aware Kernelized Self Attention (RKSA) adopting a self-attention mechanism of the Transformer with augmentation of a probabilistic model. The original self-attention of Transformer is a deterministic measure without relation-awareness. Therefore, we introduce a latent space to the self-attention, and the latent space models the recommendation context from relation as a multivariate skew-normal distribution with a kernelized covariance matrix from co-occurrences, item characteristics, and user information. This work merges the self-attention of the Transformer and the sequential recommendation by adding a probabilistic model of the recommendation task specifics. We experimented RKSA over the benchmark datasets, and RKSA shows significant improvements compared to the recent baseline models. Also, RKSA were able to produce a latent space model that answers the reasons for recommendation. Introduction Recommendation is one of the key application areas of artificial intelligence in the big data era. The recommendation tasks are supported by large scale data, and users need to select a specific item from many alternative items. This selection requirement motivates the utilization of attention mechanism in the recommendation task. The attention is applied to the item selection, and the sequential recommendation particularly selects the past item choice records to consider the recommendation at the current timestep with the attention mechanism (Wang et al. 2018; Liu et al. 2018; Li et al. 2017; Ying et al. 2018; Yu et al. 2019; Huang et al. 2018). Given the relationship between the attention and the recommendation, adopting a new attention mechanism to the recommendation has been a research trend. For instance, Self-Attentive Sequential Recommendation (SASRec) (Kang and Mc Auley 2018) adopted the self-attention mechanism of the Transformer (Vaswani et al. 2017) to the recommendation task. This adaptation is interesting, but it Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Figure 1: Each entry of the co-occurrence matrix means the number of users that each movie pair appeared together in an user sequence in the Movie Lens dataset. We can see that there are many users who watched Star Wars movies together. This allows modifying the attention weight from blue to red using co-occurrence information, when Star Wars 6 is a query. was limitedly customized to meet the task specifics. Recommendation often requires understanding items, users, browsing sequences, etc, and the recommendation models need to consider such contexts which SASRec does not provide. Following SASRec, there have been developments in using the self-attention mechanism of the Transformer to model a task specific feature of sequential recommendation. For example, ATRank (Zhou et al. 2018) utilized the self-attention mechanism for considering the influences from heterogenous behavior representations. To model the user s shortterm intent, Att Rec (Zhang et al. 2019) adopted the selfattention mechanism on the user interaction history. Similar to ATRank and Att Rec, BST (Chen et al. 2019) used the self-attention mechanism for aggregate of the auxiliary user and item features. Given the success of the self-attention (Tan et al. 2018; Devlin et al. 2018; Zhang et al. 2018), the recommendation task can be improved from the sequential information, which was limitedly used in the previous works. Moreover, such utilization on the sequential information provides a new approach to customize the self-attention structure to the recommendation task. Figure 1 is the example that the cooccurrence information may influence the attention weight. It is feasible to see a movie pair that has a higher cooccurrence than others, and this movie pair should inform the attention mechanism to increase the weight. We renovate and customize the self-attention of Transformer with a latent space model. Specifically, we add a latent space to the self-attention value of the Transformer, and we use the latent space to model the context from relations of the recommendation task. The latent space is modeled as a multivariate skew-normal (MSN) distribution (Azzalini and Valle 1996) with the dimension of the number of unique items in the sequence. The covariance matrix of the MSN distribution is the variable that we model the relations of a sequence, items, and a user by a kernel function that provides the flexibility of the recommendation task adaptation. After the kernel modeling, we provide the reparametrization of the MSN distribution to enable the amortized inference on the introduced latent space. Since the relation modeling is done with kernelization, we call this model as relation-aware kernelized self-attention (RKSA). We designed RKSA with three innovations. First, the deterministic Transformer may not work well in the generalized task of recommendation because of sparsity, so we added a latent dimension and its corresponding reparameterization. Second, the covariance modeling with the relation-aware kernel enables the more fundamental adaptation of the selfattention to the recommendation. Third, the kernelized latent space of the self-attention provides the reasoning on the recommendation result. RKSA is evaluated against eight baseline models including SASRec, HCRNN, NARM, etc; as well as, five benchmark datasets with Amazon review, Movie Lens, Steam, etc. Our experiments showed that RKSA significantly improves the performance over the baselines on the benchmarks, consistently. Preliminary Multi-Head Attention We start the preliminary by reviewing the self-attention structure that is the backbone of RKSA. Recently, (Vaswani et al. 2017) proposed the scaleddot product attention, which is defined by Equation 1 where Q Rn dk, K Rm dk, and V Rm dv are the queries, the keys, and the value matrix, respectively. The scaled-dot product attention calculates importance weights from the dot-product of query i with key j with a scaling of dk. This importance is boundarized by the softmax, and the boundarized importance is again multiplied by the value v to form the scaled-dot product attention. Attention(Q, K, V ) = softmax(QKT When the query, the key, and the value take the same X Rn d as an input matrix in Equation 2, the scaleddot product attention is called as the self-attention. A selfattention with an additional predefined or learnable positional embedding (Vaswani et al. 2017; Kang and Mc Auley 2018) is able to capture the latent information of the position like previous recurrent networks. SA(X) = Attention(XW Q, XW K, XW V ) (2) Multi-head attention uses H scaled-dot product attentions with 1/H times smaller dimension on attention weight parameters. (Vaswani et al. 2017) found that the multi-head attention is useful even though it uses the similar number of parameters compared to the single-head attention. MHA(Q, K, V ) = [Head1, ..., Head H]W O where Headi = Attention(XW Q i , XW K i , XW V i ) (3) (Yang et al. 2019) considered the dependencies, i.e. item co-occurrence, between the temporal state representations over a single sequence with the scaled-dot product attention. Their model is introducing a context vector C to be linearly combined with Q and K in the self-attention. We expand this context modeling with stochasticity and kernel method to add the flexibility of the self-attention. Multivariate Skew-Normal Distribution As we mentioned the latent space model of RKSA, we introduce an explicit probability density model to the self-attention structure. Here, we choose the multivariate skew-normal (MSN) distribution to be the explicit density because we intend to model 1) the covariance structure between items; and 2) the skewness of the attention value. It would be natural to consider the multivariate normal distribution to enable the covariance model, but the normal distribution is unable to model the skewness because it enforces the symmetric shape of the density curve. As the name suggests, the MSN distribution reflects the skewness as the shape parameter α (Azzalini and Valle 1996). The MSN distribution needs four parameters: location ξ, scale ω, correlation ψ, and shape α. Following (Azzalini and Capitanio 1999), a k-dimensional random variable x Rk follows the MSN disitribution with the location parameter ξ Rk; the correlation matrix ψ Rk k; the scale parameter ω = diag(ω1, ..., ωk) Rk k; and the shape parameter α Rk, as Equation 4. f(x) = 2φk(x; ξ, Σ)Φ(αT ω 1(x ξ)) (4) Here, Σ = ωψω is the covariance matrix; φk is the kdimensional multivariate normal density with the mean ξ and the covariance Σ; and Φ is the cumulative distribution function of N(0, 1). If α is a zero vector, the distribution reduces down to the multivariate normal distribution with the mean ξ and the covariance Σ. Kernel Function Given that we intend to model the covariance of the MSN, we introduce how we provide the flexible covariance structure through kernels. Kernel function, k(x, x ) = φ(x) φ(x ), evaluates a pair of observations in the observation space X with a real value. In the machine learning field, the kernel functions are widely used to compute the similarity between two data points as a covariance matrix. Given observations X = {xi}n i=1, a function k : X X Rk is a valid kernel if and only if it is (1) symmetric: k(x, x ) = k(x , x) for all x, x X; and (2) positive semi-definite: i,j aiajk(xi, xj) 0 for all ai, aj R (Rasmussen 2003). We apply a customized kernel function to model the relational covariance parameter of the MSN in RKSA, and we provide proofs on the validity of our customized kernels. Figure 2: (a) Graphical notation of RKSA. φ is the parameter of MSN distribution, and dashed line denotes sampling procedure. (b) The overall structure of RKSA with MSN parameters. The scaled-dot product denotes the matrix multiplication between query and key matrix in scaled-dot product attention. Methodology This section explains the sequential recommendation task, the overall structure of Relation-Aware Kernelized Self Attention (RKSA), and its detailed parameter modeling. Problem Statement A sequential recommendation uses datasets built upon a past action sequence of a user. Let U = {u1, u2, ..., u|U|} be a set of users; let I = {i1, i2, ..., i|I|} be a set of items; and let Su = {i(u) 1 , i(u) 2 , ..., i(u) nu } be a user u s action sequence. The task of sequential recommendation is predicting the next item to interact by the user, as P(i(u) nu+1 = i| u U Su). Self-Attention Block We propose elation-aware Kernelized Self-Attention (RKSA), which is a modification of the self-attention structure embedded in Transformer (Vaswani et al. 2017). Figure 2 illustrates that RKSA is a customized self-attention based on relations, such as the item, the user, and the global co-occurrence information. The detailed procedure is explained in the below. Embedding Layer Since the raw data of items and interactions follow sparse one-hot encoding, we need to embed the information of items and positions of interactions. To create such embeddings, we use the latest n actions from the user sequence of Su. Specifically, the item embedding matrix is defined as E R|I| d, where d denotes the dimensionality of the embedding. E is estimated by a hidden layer as a part of the modeled neural network, and the raw input to the hidden layer is the one-hot encoding of an interacted item at time t. Similarly, we set a user embedding to be U R|U| d to make a distinction between users. Also, we define a positional embedding matrix as P Rn d, to introduce the sequential ordering information of the interactions, which we follow the ideas from (Kang and Mc Auley 2018). P and U are also estimated by a hidden layer that matches the dimensionality of E for the further construction of xt. Afterward, we estimate the inputs to RKSA, and the input should convey the representation of items and positions in the sequences. Here, we assume that the item at time t, which is it, is represented as xt as a timestep of the sequence, and we denoted the representation as xt because it is the input to RKSA. xt is estimated through the summation of the item embedding eit E and the positional embedding pt P, as xt = eit + pt. Finally, the input item sequence is expressed as X Rn d by combining the item embedding E, and the positional embedding P. Relation-Aware Kernelized Self-Attention The core component of RKSA is the multi-head attention structure that includes a latent variable of z. Given that Equation 1 is deterministic, we intend to turn QKT dk into a single latent variable z. The changed part is originally the alignment score of the attention mechanism, so its range becomes R. Additionally, we assume that there is a skewed shape in the alignment score distribution, so we designed z to follow the multivariate skew-normal distribution (MSN), as Equation 5. In other words, we sample the logit of the softmax function from the MSN distribution. H = RKSA(X, C) = softmax(Z)V where z MSN(Z|ξ, Σ, α) (5) In the above, the parameters of the MSN distribution include the location ξ, the covariance Σ, and the shape parameter α. The details of the parameters are explained in Section Parameter Modeling. Additionally, in Equation 5, X denotes the items in the sequence Su, and C is the co-occurrence matrix of Su from our kernel model, which is explained in Section Covariance. The co-occurrence matrix C is constructed by counting the co-occurrence number between item pairs in the whole dataset. We follow the amortized inference with a reparametrization on the MSN; and X and C are used as inputs to the inference. Lastly, the output of RKSA is the hidden dimension defined as H= {h1, h2, ..., hn}, hi Rd in Equation 5. V is the value vector estimated from the input item sequence representation X. Since we modify the scaled-dot product attention, RKSA is easily expanded to be a variant of multihead attention by following the same procedure of Equation 3. Point-Wise Feed-Forward Network We apply the Point Wise Feed-Forward Network in Transformer to the output of RKSA by each position. The point-wise feed-forward network consists of two linear transformations with a Re LU nonlinear activation function between the linear transformations. The final output of the point-wise feed-forward network, F is {FFN(h1), ..., FFN(hd)}. Besides of the above modeling structure, we stacked multiple self-attention blocks to learn complex transition patterns, and we added residual connections (He et al. 2016) to train a deeper network structure. We also applied the layer normalization (Ba, Kiros, and Hinton 2016) and the dropout (Srivastava et al. 2014) to the output of each layer by following (Vaswani et al. 2017). Output Layer Let B be the number of self-attention blocks. The task requires predicting the (n + 1)-th item with the n-th output of the B-th self-attention block. We use the same weights of the item embedding layer to rank the item prediction. The relevance score of the item in is defined as ri,n: ri,n = F (B) n Ei (6) F (B) n denotes the n-th output of the last self-attention block, and Ei is the embedding of item i. The prediction ranking of the (n + 1)-th item is defined by the ranking of the items relevance scores. Parameter Modeling This section enumerates the detailed modeling of the MSN parameters, which is used for the latent variable z in RKSA. Location The location ξ has the same role of the mean of multivariate normal distribution. Given that we use the MSN to sample the alignment score, we still need to provide the deterministic alignment score with the most likelihood. Therefore, we allow the alignment score to be the location parameter as: ξ = f((XW Q l )(XW K l ) Also, we can use activation function f and scaling to ξ with Covariance The covariance Σ represents the relation between items. While Σ is a square matrix of parameters, Σ has a limited size because we only use the latest n items; and because there are not many unique items in those latest interactions. The relation can be measured by various methods, ranging from a simple co-occurrence counting to a nonlinear kernel function. This paper design a kernel function to measure the relation between a pair of items because the kernel function is known to be the efficient and nonlinear highdimensional distance metric that can also be learned through optimizing the kernel hyperparameters. We compose a kernel function by considering the relations of the co-occurrence, the item and the user. For a given sequence, for timesteps i and j, we utilize the normalized representations of ˆxi and ˆxj. Additionally, we infer the variance ω2 i , ω2 j R+ of z at timestep i and j, by an amortized inference as Equation 8. ωi = softplus((xn W Q ω )(xi W K ω ) In the above, we set the activation function of standard deviation as softplus to make the value of standard deviation positive. The following defines three different kernel functions, and we denote ˆxi as xi for simplicity. Counting kernel is defined by the co-occurrence number of each item pair. The counting kernel is kc(xi, xj) = ωiωj P 2 ij Pi Pj where Pi, Pj are the number of occurrence of item i and j, respectively, and Pij is the number of cooccurrence of item i and j. Item kernel utilizes the representation of each item. There are two alternative kernels. The linear item kernel is ki(xi, xj) = ωiωj(xi xj) where denotes dot product; and the Radial Basis Function (RBF) kernel is ki(xi, xj) = ωiωj exp( ||xi xj||2). User kernel utilizes the representation of each items and users. The user kernel is ku(xi, xj) = ωiωj[(Wsus xi) (Wsus xj)] for user embedding us Rd and weight matrix Ws Rd d where denotes Hadamard product. Unlike the item and the user kernel, the validity of the counting kernel should be checked because it is not a wellknown format as the linear or the RBF kernels. The counting kernel is always symmetric and positive semi-definite. Therefore, the counting kernel is a valid kernel function. From the property of kernel functions, we combine kernel functions by their summation to make the final kernel function flexible. The final kernel function is defined as: k(xi, xj) = r1kc(xi, xj) + r2ki(xi, xj) + r3ku(xi, xj) where r = softmax(u Wu + bu) (9) With the above kernel function, our modeling on the correlation matrix is ψi,j = k(xi,xj) ωiωj , similar to the definition of Σ of Equation 4. This section describes the covariance modeling with the final kernel, so the kernel hyperparameter, such as ω, Ws and r, needs to be inferred. While they need to be supervised to learn the kernel hyperparameters, the loss of the recommendation task needs to be augmented with an additional loss to guide the kernel hyperparameter. Therefore, we modeled a loss that regularizes the covariance to be the item co-occurrence. Since we have other loss terms, i.e. the recommendation loss, the learned correlation does not become same to the item co-occurrence, but the co-occurrence loss can be prior knowledge. Particularly, we measure the co-occurrence loss Lrank with the listwise ranking loss to match the alignment of the correlation and the ranking of the item co-occurrences. The co-occurrence loss is defined as maximizing te listwise ranking loss (Cao et al. 2007). Shape The shape parameter α reflects the relation between a final item and an item in a user sequence. We designate α = {α1, ..., αn} to items {i1, ..., in}. We define α by introducing a ratio parameter ˆα with the co-occurrence matrix C; and a learnable scaling parameter s. Specifically, we assume αj = sj ˆαj max(ˆα), which is a scaled correlation between the final item in and the item ij. First, we calculate the ratio parameter ˆαj [0, 1] with the co-occurrence matrix C, by the summation of the linear alignments between the last time in, and the aligned item ij. Here, let ci,j be the value of i-th row and j-th column of co-occurrence matrix, C. For simplicity, we denote cij,ik as cj,k. The following is the detailed formula of ˆαj. k=1 cj,kck,n where ck,k l {1, ,n}\k ck,l k={1, ,n}\ {j,n} cj,kck,n l {1, ,n}\j cj,l cj,n + cj,n n 1 k=1 ck,n n 1 Equation 10 computes ˆαj by the dot-product between j-th row and n-th column of the co-occurrence matrix C, which means that we calculate the correlation between the cooccurrence of ij and in. Having said that, the co-occurrence of the same item is semantically meaningless in C, so such cases used the average of the remaining elements in each row in the dot-product process. ˆαj enables modeling the two-hop dependency between ij and in through ik. Second, Equation 11 defines the scaling parameter, sj: sj = f((xn W Q s )(xj W K s ) We can apply the softplus activation to f, so the shape parameter becomes positive. Model Inference Loss Function Given the above model structure, this subsection introduces the inference on the latent variable z following the MSN distribution. It is well-known that the latent variable can be inferred by optimizing the evidence lower bound from the Jensen s inequality, so we optimize the evidence lower bound on the marginal log-likelihood, p(yn), when predicting the (n + 1)-th item in. Equation 12 describes the loss function of this prediction task. Lz = Ez[log p(yn|z)] = log p(yn|z)p(z)dz (12) log p(yn|z)p(z)dz = log p(yn) Lz utilizes the binary cross-entropy loss with the negative sampling as conducted in (Kang and Mc Auley 2018) to calculate p(yn|z). It should be noted that the actual loss function is a combination of the prediction loss and the cooccurrence loss, which is L = Lz + λr Lrank. λr is the regularization weight hyperparameter of the co-occurrence loss. Table 1: Statistics of evaluation datasets. Dataset #users #items #actions avg. avg. actions actions /user /item Beauty 52,024 57,289 0.4m 7.6 6.9 Games 31,013 23,715 0.3m 9.3 12.1 Cite ULike 1,798 2,000 0.05m 30.6 27.5 Steam 334,730 13,047 3.7m 11.0 282.5 Movie Lens 4,639 930 0.2m 40.9 204.0 Reparametrization of Z We sample the values of z from the MSN(Z|ξ, ω, ψ, α) distribution using the reparameterization trick. Equation 13 shows the reparametrization of the MSN distribution with the sample from the two Normal distributions. y0 N(Y0|0, 1), y N(Y |0, ψ), δj = αj 1 + α2 j (13) ˆzj = δj|y0| + (1 δ2 j ) 1 2 yj, zj = ξj + ωj ˆzj This reparametrization is utilized because z needs to be instantiated for the forward path. Equation 13 shows how to sample z given the amortized inference parameters of α, ξ, ω, and ψ. Once the forward path is enabled, the neural network can be trained via the back-propagation method. Experiment Result Datasets We evaluate our model on five real world datasets: Amazon (Beauty, Games) (He and Mc Auley 2016; Mc Auley et al. 2015), Cite ULike, Steam, and Movie Lens. We follow the same preprocessing procedure on Beauty, Games, and Steam from (Kang and Mc Auley 2018). For preprocessing Cite ULike and Movie Lens, we follow the preprocessing procedure from (Song et al. 2019). We split all datasets for training, validation, and testing following the procedure of (Kang and Mc Auley 2018). Table 1 summarizes the statistics of the preprocessed datasets. Baselines We compared RKSA with seven baselines. Pop always recommends the most popular items. Item-KNN (Linden, Smith, and York 2003) recommends an item based on the measured similarity of the last item. BPR-MF (Rendle et al. 2009) recommends an item by the user and the item latent vectors with the matrix factorization. GRU4REC (Hidasi et al. 2015) models the sequential user history with GRU and the specialized recommendation loss function such as Top1 and BPR loss. NARM (Li et al. 2017) focuses on both short and longterm dependency of a sequence with an attention and a modified bi-linear embedding function. HCRNN (Song et al. 2019) considers the user s sequential interest change with the global, the local, and the temporary context modeling. It modifies the GRU cell structure to incorporate the various context modeling. Table 2: Performance comparison (higher is better). The best performing model is indicated as boldface. The second-best model is indicated as underline. indicates that the result has p-value less than 0.05 against the second-best result based on t-test. Dataset Metric Pop Item-KNN BPR-MF GRU4REC NARM HCRNN Att Rec SASRec RKSA Hit@5 0.2972 0.0885 0.0735 0.3097 0.3663 0.3643 0.3341 0.3735 0.3999* NDCG@5 0.1478 0.0872 0.0486 0.2257 0.2785 0.2764 0.2535 0.2846 0.2998* Hit@10 0.4289 0.0885 0.1285 0.4174 0.4674 0.4653 0.4222 0.4720 0.5015* NDCG@10 0.1882 0.0872 0.0662 0.2604 0.3111 0.3091 0.2819 0.3164 0.3326* Hit@5 0.3416 0.1969 0.1291 0.5749 0.6224 0.6229 0.5673 0.6395 0.6544* NDCG@5 0.1730 0.1892 0.0920 0.4570 0.4927 0.4955 0.4358 0.5068 0.5168* Hit@10 0.4846 0.1969 0.1919 0.6733 0.7244 0.7233 0.6812 0.7373 0.7551* NDCG@10 0.2168 0.1892 0.1121 0.4889 0.5257 0.5281 0.4727 0.5385 0.5495* Hit@5 0.1318 0.3563 0.1624 0.4310 0.4457 0.4442 0.4275 0.5044 0.5308 NDCG@5 0.0650 0.2666 0.1107 0.2982 0.3016 0.3053 0.2891 0.3447 0.3687* Hit@10 0.2144 0.3815 0.2472 0.5879 0.6150 0.6077 0.5808 0.6757 0.6893* NDCG@10 0.0902 0.2751 0.1378 0.3488 0.3565 0.3583 0.3388 0.4001 0.4202* Hit@5 0.5545 0.2964 0.5724 0.7065 0.7095 0.7136 0.5936 0.7477 0.7514 NDCG@5 0.2873 0.2724 0.4144 0.5444 0.5476 0.5516 0.4182 0.5828 0.5841 Hit@10 0.7162 0.2965 0.7083 0.8293 0.8314 0.8344 0.7491 0.8610 0.8668* NDCG@10 0.3370 0.2724 0.4587 0.5844 0.5873 0.5909 0.4687 0.6196 0.6217 Hit@5 0.1521 0.2950 0.1241 0.3883 0.4057 0.4039 0.3493 0.4260 0.4361* NDCG@5 0.0733 0.2019 0.0767 0.2650 0.2775 0.2770 0.2217 0.2965 0.3023* Hit@10 0.2547 0.4051 0.2088 0.5487 0.5617 0.5606 0.5094 0.5873 0.5997* NDCG@10 0.1044 0.2376 0.1039 0.3167 0.3278 0.3275 0.2734 0.3485 0.3552* Table 3: Ablation study on the Beauty and Movie Lens datasets. The measure is Hit@10 and C, I and U denote counting, item and user kernel function respectively. B is the Beauty dataset; and M is the Movie Lens dataset. C I U C+I C+U I+U C+I+U B 0.5015 0.4982 0.4958 0.5012 0.4955 0.4951 0.5011 M 0.5911 0.5966 0.5977 0.5960 0.5962 0.5997 0.5973 Att Rec (Zhang et al. 2019) models the short-term intent using self-attention and the long-term preference with metric learning. SASRec (Kang and Mc Auley 2018) is a Transformer model which combines the strength of Markov chains and RNN. SASRec focuses on finding the relevant items adaptively with self-attention mechanisms. Experiment Settings For GRU4REC, NARM, HCRNN, and SASRec, we use the official codes written by the corresponding authors. For GRU4REC, NARM and HCRNN, we apply the data augmentation method proposed by NARM (Li et al. 2017). We use two self-attention blocks and one head for SASRec and RKSA following the default setting of (Kang and Mc Auley 2018). For fair comparisons, we apply the same setting of the batch size (128), the item embedding (64), the dropout rate (0.5), the learning rate (0.001), and the optimizer (Adam). We use the same setting of authors for other hyperparameters. For RKSA, we set the cooccurrence loss weight λr as 0.001. Furthermore, we use the learning rate decay and the early stopping based on the validation accuracy for all methods. We use the latest 50 actions of sequence for all datasets. Quantitative Analysis Table 2 presents the recommendation performance of the experimented models. We adopt two widely used measurements: Hit Rate@K and NDCG@K (He et al. 2017). Considering that all user-item pairs require heavy computation, we use 100 negative samples for the evaluation following (Kang and Mc Auley 2018; He et al. 2017). We repeat each experiment for five times, and the results are the average of each method. The performance of RKSA comes from the best kernel variant of RKSA, and RKSA outperforms all baseline models on all datasets and metrics. Especially, Beauty shows the biggest improvement. Beauty is the most sparse dataset, so there are many items infrequently occurred. This result suggests that using the relational information can be helpful for predicting such infrequent items. Ablation Study We compared the kernel function combinations on Beauty and Movie Lens datasets. We consider Beauty as a representative sparse dataset, and Movie Lens as a representative dense dataset. Table 3 shows the performance of each kernel functions. We assume that using the sparse and short dataset is hard to learn the representation of the item and user. Therefore, RKSA with the counting kernel function shows the best performance on the sparse dataset. On the contrary, it is relatively easy to learn the representation of item and user by the dense dataset, and Table 3 shows the kernel combination of the item and the user is best. Qualitative Analysis Item Embedding and Correlation Matrix The item kernel utilizes the dependency between the items in each time step. When learning the co-occurrence loss, the kernel hyperparameter and the item embedding captures the relational Figure 3: (a) Item embedding visualization with t SNE (Van der Maaten and Hinton 2008) of Movie Lens dataset. (b) Correlation between movies by the counting and item kernel combination. Figure 4: (a) The weights of the counting, the item, and the user kernels for the final kernel calculation (b) Average predicted ranking of the SASRec and RKSA by item occurrence in Beauty dataset. Value of the x-axis grows, it indicates the frequently occurred group. RKSA predict the higher ranking for infrequent items. information of the co-occurrence. Figure 3a illustrates the item embedding of movies. The item embedding with the same genres are distributed closely together. We generate the synthetic sequence to analyze the correlation from the trained kernel function. We use the counting and the item kernel combination without the user kernel because the sequence was synthetic. The synthetic sequence includes four different movie series and an animation movie. Figure 3b shows that the movies belong to the same series have high correlations. On the contrary, the correlations between the animation genre and the other genres were low. Finally, we observed the weights of the counting, the item, and the user kernels, see Figure 4a, because the kernel weights also contribute to the construction of the correlation matrix. Since each dataset has different characteristics, a dataset emphasizes the counting, the item, and the user relations, differently. Interestingly, the counting kernel was not the most dominant kernel in Movie Lens, but the user kernel was dominant. Movie Lens is relatively dense dataset with respect to the number of average action per user, as shown in Table 1. Our proposed model, RKSA, adapts to the property of dataset well, and focus on the user kernel instead of other kernels on Movie Lens dataset. Figure 5: Attention heatmap for a user sequence of Movie Lens. The first row indicates the co-occurrence, and the last item does not have co-occurrence information. If the cooccurrence between last item (query) and each item is bigger than average co-occurrence of sequence, we fill each timestep as black and the rest white. The second row is an attention weight in SASRec and the below is an attention weight in RKSA. Predicted ranking of infrequent items A sparse dataset, like Beauty, has many infrequent items, which are difficult to predict because of its information sparsity. To overcome this problem, RKSA utilizes the relational information of the whole dataset, instead of a single sequence in the prediction. Figure 4b shows that the target item is highly ranked by RKSA as the information sparsity worsens, compare to the predicted ranking of SASRec. Attention Weight Case Study Figure 5 shows the attention weight of SASRec and RKSA with the co-occurrence information between the last item and each item of sequence. The sequence instance in Figure 5 has a high cooccurrence value at timestep 0, 1, 2, and 5; and Figure 5 confirms that RKSA places higher attention values than SASRec. In the opposite case, the attention weight of RKSA is lower than the attention weight of SASRec. We present relation-aware kernelized self-attention (RKSA) for a sequential recommendation task. RKSA introduces a new self-attention mechanism which is stochastic as well as kernelized by the relational information. While the past attention mechanisms are deterministic, we introduce a latent variable in the attention. Moreover, the latent variable utilizes the kernelized correlation matrix, so the kernel can be expanded to include relational information and modeling. From these innovations, we were able to see the best performance in all experimental settings. We expect that the further development on the stochastic attention of the Transformer will come in the near future. Acknowledgments This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF2018R1C1B6008652). References Azzalini, A., and Capitanio, A. 1999. Statistical applications of the multivariate skew normal distribution. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(3):579 602. Azzalini, A., and Valle, A. D. 1996. The multivariate skewnormal distribution. Biometrika 83(4):715 726. Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. ar Xiv preprint ar Xiv:1607.06450. Cao, Z.; Qin, T.; Liu, T.-Y.; Tsai, M.-F.; and Li, H. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, 129 136. ACM. Chen, Q.; Zhao, H.; Li, W.; Huang, P.; and Ou, W. 2019. Behavior sequence transformer for e-commerce recommendation in alibaba. ar Xiv preprint ar Xiv:1905.06874. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805. He, R., and Mc Auley, J. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, 507 517. International World Wide Web Conferences Steering Committee. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778. He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.- S. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, 173 182. International World Wide Web Conferences Steering Committee. Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2015. Session-based recommendations with recurrent neural networks. ar Xiv preprint ar Xiv:1511.06939. Huang, X.; Qian, S.; Fang, Q.; Sang, J.; and Xu, C. 2018. Csan: Contextual self-attention network for user sequential recommendation. In 2018 ACM Multimedia Conference on Multimedia Conference, 447 455. ACM. Kang, W.-C., and Mc Auley, J. 2018. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), 197 206. IEEE. Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; and Ma, J. 2017. Neural attentive session-based recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 1419 1428. ACM. Linden, G.; Smith, B.; and York, J. 2003. Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing (1):76 80. Liu, Q.; Zeng, Y.; Mokhosi, R.; and Zhang, H. 2018. Stamp: short-term attention/memory priority model for sessionbased recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1831 1839. ACM. Mc Auley, J.; Targett, C.; Shi, Q.; and Van Den Hengel, A. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 43 52. ACM. Rasmussen, C. E. 2003. Gaussian processes in machine learning. In Summer School on Machine Learning, 63 71. Springer. Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt Thieme, L. 2009. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, 452 461. AUAI Press. Song, K.; Ji, M.; Park, S.; and Moon, I.-C. 2019. Hierarchical context enabled recurrent neural network for recommendation. In Proceedings of the AAAI. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15(1):1929 1958. Tan, Z.; Wang, M.; Xie, J.; Chen, Y.; and Shi, X. 2018. Deep semantic role labeling with self-attention. In Thirty-Second AAAI Conference on Artificial Intelligence. Van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-sne. Journal of machine learning research 9(Nov). Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998 6008. Wang, S.; Hu, L.; Cao, L.; Huang, X.; Lian, D.; and Liu, W. 2018. Attention-based transactional context embedding for next-item recommendation. In Thirty-Second AAAI Conference on Artificial Intelligence. Yang, B.; Li, J.; Wong, D. F.; Chao, L. S.; Wang, X.; and Tu, Z. 2019. Context-aware self-attention networks. ar Xiv preprint ar Xiv:1902.05766. Ying, H.; Zhuang, F.; Zhang, F.; Liu, Y.; Xu, G.; Xie, X.; Xiong, H.; and Wu, J. 2018. Sequential recommender system based on hierarchical attention networks. In the 27th International Joint Conference on Artificial Intelligence. Yu, L.; Zhang, C.; Liang, S.; and Zhang, X. 2019. Multiorder attentive ranking model for sequential recommendation. In Proceedings of the AAAI. Zhang, H.; Goodfellow, I.; Metaxas, D.; and Odena, A. 2018. Self-attention generative adversarial networks. ar Xiv preprint ar Xiv:1805.08318. Zhang, S.; Tay, Y.; Yao, L.; Sun, A.; and An, J. 2019. Next item recommendation with self-attentive metric learning. In Thirty-Third AAAI Conference on Artificial Intelligence, volume 9. Zhou, C.; Bai, J.; Song, J.; Liu, X.; Zhao, Z.; Chen, X.; and Gao, J. 2018. Atrank: An attention-based user behavior modeling framework for recommendation. In Thirty-Second AAAI Conference on Artificial Intelligence.