# interpretable_deep_generative_recommendation_models__fffc5e12.pdf Journal of Machine Learning Research 22 (2021) 1-54 Submitted 10/20; Revised 8/21; Published 8/21 Interpretable Deep Generative Recommendation Models Huafeng Liu huafeng@bjtu.edu.cn School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Department of Mathematics, The University of Hong Kong, Hong Kong SAR, China Liping Jing lpjing@bjtu.edu.cn Jingxuan Wen jingxuan@bjtu.edu.cn Pengyu Xu pengyuxu@bjtu.edu.cn Jiaqi Wang jiaqi.wang@bjtu.edu.cn Jian Yu jianyu@bjtu.edu.cn School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China Michael K. Ng mng@maths.hku.hk Department of Mathematics, The University of Hong Kong, Hong Kong SAR, China Editor: Samy Bengio User preference modeling in recommendation system aims to improve customer experience through discovering users intrinsic preference based on prior user behavior data. This is a challenging issue because user preferences usually have complicated structure, such as inter-user preference similarity and intra-user preference diversity. Among them, inter-user similarity indicates different users may share similar preference, while intra-user diversity indicates one user may have several preferences. In literatures, deep generative models have been successfully applied in recommendation systems due to its flexibility on statistical distributions and strong ability for non-linear representation learning. However, they suffer from the simple generative process when handling complex user preferences. Meanwhile, the latent representations learned by deep generative models are usually entangled, and may range from observed-level ones that dominate the complex correlations between users, to latent-level ones that characterize a user s preference, which makes the deep model hard to explain and unfriendly for recommendation. Thus, in this paper, we propose an Interpretable Deep Generative Recommendation Model (In DGRM) to characterize inter-user preference similarity and intra-user preference diversity, which will simultaneously disentangle the learned representation from observed-level and latent-level. In In DGRM, the observed-level disentanglement on users is achieved by modeling the user-cluster structure (i.e., inter-user preference similarity) in a rich multimodal space, so that users with similar preferences are assigned into the same cluster. The observed-level disentanglement on items is achieved by modeling the intra-user preference diversity in a prototype learning strategy, where different user intentions are captured by item groups (one group refers to one intention). To promote disentangled latent representations, In DGRM adopts structure and sparsity-inducing penalty and integrates them into the generative procedure, which has ability to enforce each latent factor focus on a limited subset of items (e.g., one item group) and benefit latent-level disentanglement. Meanwhile, it can be efficiently inferred by minimizing its penalized upper bound with the aid of local variational optimization technique. Theoretically, we analyze the generalization error bound of In DGRM to . Corresponding author. c 2021 Huafeng Liu, Liping Jing, Jingxuan Wen, Pengyu Xu, Jiaqi Wang, Jian Yu and Michael K. Ng. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-1098.html. Liu, Jing, Wen, Xu, Wang, Yu and Ng guarantee its performance. A series of experimental results on four widely-used benchmark datasets demonstrates the superiority of In DGRM on recommendation performance and interpretability.1 Keywords: Recommendation System, Collaborative Filtering, Deep Generative Model, Interpretable Machine Learning, Latent Factor Model 1. Introduction Interpretability can be defined as the degree to which a human can understand the cause of a decision . Making machine learning models explainable helps understand why the models succeed or fail, and could give us better intuition about the problem and higher trust in the solution. More fundamental need for the explainability stems from an incompleteness in the problem formalization. Explainable recommendation refers to personalized recommendation algorithms that address the problem of why - they not only provide users or system designers with recommendation results, but also explanations to clarify why such items are recommended (Zhang and Chen, 2020). In this way, it helps to improve the transparency, persuasiveness, effectiveness, trustworthiness, and user satisfaction of recommendation systems. It also facilitates system designers to diagnose, debug, and refine the recommendation algorithm. Even though significant progress has been made for recommendation, they hardly realize the underlying reasons behind the users decision making processes since user preferences are highly similar and diverse, and may range from inter-user preference similarity that governs preference relationship among users to intra-user preference diversity that characterizes a user s diverse preference when executing an intention. Learning representations that reflect users preference, based chiefly on user behavior, has been a central theme of research on recommendation systems (Liang et al., 2018; Liu et al., 2019b; Ma et al., 2019). However, complex user preference is hardly to model and the learned representations are highly entangled and may range from observed-level ones that dominate the complex correlations between users and items and latent-level ones that characterize a user s preference. On the one hand, the latent representations of different users are assumed independent to each other when building the model. Actually, in real applications, different users may interact with each other due to inter-user preference similarity on behaviors. By using Movielens 20M 2 dataset as an example, we investigate the genres that each user likes3. Figure 1(a) demonstrates the genres distribution related to 61 users for each genre4. We can see that although different users have different genre distributions, users show a strong correlation on genres. Furthermore, for each user, we sort the genres in descending order according to the number of genres interactions, and take the top 20% genres as the main preference of users. Figure 1(b) demonstrates the number of users who like each genre together in terms of genres. We can see that there are a large number of users with the same 1. A preliminary version of this work was presented in Proceedings of the Web Conference (WWW), 2020 (Liu et al., 2020). 2. https://grouplens.org/datasets/movielens/ 3. In Movielens 20M dataset, each user interacted with several items and each item is labeled by one or more genre labels. 4. We count the interaction proportion of each user on different genres, and get the distribution of genres. For each type of genres, we select the top 10 users to display. Considering that some users have more interactions on multiple genres, a total of 61 users are statistically displayed. Interpretable Deep Generative Recommendation Models 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061 Action Adventure Animation Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western (a) Genres distribution related to top 10 users for each genre 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Documentary (b) The number of users who like each genre together in terms of genres 0 1 2 3 4 5 log(#Users) log(#Items) (c) The number of interacted items along all users 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 (d) The size of genres corresponding to items which are interacted by each user Figure 1: Some data distributions in Movielens 20M dataset. preferences. Although different users may have different preference, users with the same preference genre show a strong correlation. Thus, ignoring correlation between users may destroy the data distribution estimation. Meanwhile, such correlations between users are indeed helpful to explain the recommendation result, which has been proven by structural user models (Balog et al., 2019). On the other hand, the user representations can be highly entangled in the latent-level, which preserves the confounding of the factors and is prone to mistake the relationship between the latent factors and the observed user behavior, and further leads to non-robust recommendation and low interpretability. In this case, the system designers can not inter- Liu, Jing, Wen, Xu, Wang, Yu and Ng pret the learned latent factors, and the end-user can not understand the final recommendation results. It is interesting to note that the intra-user preference diversity where the preferences of a user are always diverse but sparsely distributed with respect to the whole item space (Zhou et al., 2018; Liu et al., 2019a, 2020). Figure 1(c) shows the distribution about the number of interacted items along all users. Many users are only interested with a few items, which leads to a power-law distribution. This statistical observation confirms that the interested item data set is sparsely distributed. Figure 1(d) demonstrates the size of genres corresponding to items which are interacted by each user. It can be found that most of the users are interested in more than ten genres, which indicates that users have diverse interests. Thus, it is necessary to disentangle the latent representations so that different factors can capture different preferences for each user. Such intra-user preference diversity modeling is critical to improve the recommendation performance (Ma et al., 2019). In this paper, based on our prior user preference modeling work (Liu et al., 2020), we propose a new Interpretable Deep Generative Recommendation Model (In DGRM) to characterize user behavior from both inter-user preference similarity and intra-user preference diversity modeling, and achieve latent-level and observed-level disentanglements for interpretable recommendation. In In DGRM, inter-user preference similarity indicates strong correlations among a subset of users who have similar preference, and the latent user representation is modeled by a mixture prior in a rich multimodal space, and further achieve observed-level disentanglement on users by separating users into different preference groups. Intra-user preference diversity represents user s intrinsic diverse preference since one user might be interested in different kinds items, which is model by identifying the item groups via learning a set of prototypes to achieve observed-level disentanglement on items, based on which the user intention related with each item is inferred, and then capturing the preference of a user about the different intentions separately. We promote disentangled latent representations by introducing structure and sparsity-inducing penalty into a generative procedure, which enforces each latent factor to influence a limited subset of items (i.e., item groups) and achieve latent-level disentanglement. To effectively handle the optimizing process, we adopt Wasserstein auto-encoder (WAE) framework (Tolstikhin et al., 2018) to measure the true preference data distribution and the generated data distribution. In DGRM can be efficiently inferred via the local variational optimization technique. Theoretically, we provide its generalization error bound to guarantee its performance. A series of experiments are conducted on four real-world datasets, and the results have demonstrated that In DGRM outperforms the state-of-the-art baselines in terms of several popular evaluation metrics. And the learned disentanglement on latent representation and observed behavior is demonstrated to be interpretable. 2. Related Work In this section, we briefly review two different areas which are highly relevant to the proposed method, interpretable recommendation and user preference modeling. 2.1 Interpretable Recommendation Interpretable recommendation can not only provide interpretation to a user why the items are recommended, but also facilitate system designers to diagnose, debug, and refine the Interpretable Deep Generative Recommendation Models recommendation models. Early interpretable approaches to personalized recommender systems mostly focused on content-based or collaborative filtering (CF) based recommendation (Adomavicius and Tuzhilin, 2005; Pazzani and Billsus, 2007; Sarwar et al., 2001; Konstan et al., 1997). Content-based recommendation attempts to model user and/or item profiles with various available content information, such as the price, color, brand of the goods in e-commerce, or the genre, director, duration of movies in review systems. Since the item contents are much understandable to the users, they are always used to explain why an item is recommended out of other candidates in content-based recommendation. However, collecting content information in different application scenarios is expensive and time-consuming. The most popular CF-based recommendation includes user-based CF and item-based CF (Konstan et al., 1997). Those methods can provide intuitive interpretations, such as customers who bought this item also bought... in user-based CF and the item is similar to your previously loved items in item-based CF. Leveraging the neighborhood style explanation mechanism, Abdollahi and Nasraoui (2016) presented an explainable matrix factorization (EMF) by extending matrix factorization-based CF model via an explainability regularizer. Even though EMF provides explanation, it prefers to the popular items recommendation. Inspired by the successes of attention-based deep learning methods, researchers presented neural attentive interpretable recommendation model (NAIRS) (Yu et al., 2019) to extend the traditional item similarity model (Kabbur et al., 2013). In our recent work (Liu et al., 2019c), an influence-based interpretable recommendation was proposed to modeling the influence of historical interactions. Recently, Ma et al. (2019) focused on disentangled representation for interpretable recommendation. Although the above methods are able to provide interpretable results, the learned deep latent features are too entangled in preference space to investigate internal recommendation mechanism. Another stream of work seeks for recommendation interpretations from auxiliary information. For example, textual reviews and tag information are studied in additional to the basic recommender model. To make use of textual reviews, topic model is integrated with matrix factorization to determine the explicit review-aware item features which are aligned to the latent factor for explanations generation (Mcauley and Leskovec, 2013; Zhang et al., 2014). Recently, due to the powerful ability in representation learning, deep learning is adopted in recommendation to model textual content and generate the explanations (Seo et al., 2017; Chen et al., 2018b; Donkers et al., 2017; Chen et al., 2018a). For instance, Seo et al. (2017) introduced an interpretable convolutional neural network to learn the item feature from users review information. Donkers et al. (2017) combined the user-item interaction and review information in a unified LSTM framework. Moreover, social trust information has been proved an alternative view of user preference to improve trustworthiness and transparency for recommendation (Park et al., 2018), and the utilization of tags for explainable recommendation has been particularly well studied (Balog et al., 2019). We in this work target on making interpretable recommendation from implicit feedback data only. The key is to infer the interpretable disentangled factor from observed feedback data, while related work discussed here aims at linking the auxiliary information with the recommendation decisions. Our study thus has a different problem setting and is generally applicable to systems where the auxiliary information is unavailable. Liu, Jing, Wen, Xu, Wang, Yu and Ng 2.2 User Preference Modeling User preference modeling in recommendation systems aims to explore users intrinsic preferences to improve recommendation performance. Most existing recommender systems represent a user s preference with a feature vector, which is assumed to be fixed or followed in the same distribution when predicting this user s preferences for different items, e.g., latent factor models (Mnih and Salakhutdinov, 2008; Salakhutdinov et al., 2007; Xue et al., 2017). However, the same vector or distribution cannot accurately capture a user s varying preferences on all items, especially when considering the diverse characteristics of various items. In order to capture the correlations among users and diverse preferences, recently, researchers proposed several localized latent factor models (Lee et al., 2016; Wu et al., 2016; Zhang et al., 2017; Liu et al., 2019a, 2020, 2021a,b) by exploiting local structure of the large-scale preference matrix. The main idea focuses on dividing the whole preference matrix into several submatrices so that each submatrix contains a set of like-minded users and the items that they are interested in. To sufficiently reduce the approximation error, the original preference matrix is partitioned several times to get a set of approximated preference matrices, and reconstructed with them and the corresponding weights in an ensemble manner. In literature, a simple way to capture local structure is randomly selecting users/items to form submatrices with similar interests (Mackey et al., 2011), but it can not guarantee that the users in the same submatrix share the common interests and the items have the similar categories. To address this problem, several works (Lee et al., 2016; Chen et al., 2015; Beutel et al., 2015; Wang et al., 2016; Zhang et al., 2017) are proposed to effectively partition preference matrix. Lee et al. (2016) proposed a local low-rank matrix approximation method using a kernel smoothing nearest neighbors method to acquire local structure and represent rating matrix as a weighted sum of several local low-rank matrices. In Beutel et al. (2015), Bayesian co-clustering was proposed to determine the local structure and a concise model was designed for matrix approximation in an additive strategy. Zhang et al. (2017) proposed a heuristic anchor-point selecting method to enhance local low-rank matrix approximation. However, these methods assign each user/item to only one single cluster, which make them can not handle the users with multiple interests well. Thus, researchers introduced an affiliation score to characterize the strength between user/item and the corresponding submatrices (Zhang et al., 2013; Wu et al., 2016). Our latest work DGLGM aims to learn global and local user representation in a deep generative model and achieve promising recommendation performance (Liu et al., 2020). However, above methods modeling user preference solely focus on partitioning user well and neglect interpretability of recommendation results. Thus, in this paper, we focus on modeling user preference on both inter-user preference similarity and intra-user preference diversity and achieving interpretable recommendation by investigating observed-level and latent-level disentanglement based on our previous work (Liu et al., 2020). 3. Notations and Problem Formulation Let calligraphic letter (e.g., A) indicate set, lower or upper case regular letter (e.g., a or A) for scalar, lower-case bold letter (e.g., a) for vector, and upper-case bold letter (e.g., Interpretable Deep Generative Recommendation Models A) for matrix. Suppose there are N users and M items, X = {(i, j, r), i DI, j DJ, r {0, 1, , R} denotes user-item preference set where (i, j, r) indicates the i-th user gives preference r to the j-th item (here preference value r is defined by non-negative integer value.), DI and DJ are user set and item set respectively. In most scenarios, preference value r is a binary value (i.e., 0 or 1) since it is collected implicitly, while, in few areas, the explicit preference information (e.g., ratings) can be collected. Thus, in this paper, we focus on modeling user s implicit feedback to achieve interpretable recommendation. Let X NN M denote the user-item interaction matrix among N users and M items. Its element xi,j {0, 1} indicates whether the j-th item is interacted by the i-th user. xi = [xi,1, , xi,M] NM is a binary vector demonstrating the interaction history of user i on all items. The general goal of deep generative recommendation method is to determine the latent user representation by defining proper prior and generative process to user-item interactions. In this work, we focus on learning latent user representations {zi}N i=1 (zi Rdi, di is the size of the optimal latent space for current user) to model inter-user preference similarity and intra-user preference diversity, and achieve the latent-level and observed-level disentanglement of the representations for interpretable recommendation. Inter-user preference similarity: In recommendation platform, users with similar preference may affect each other. In this case, it is intuitive to assume that users with similar preference follow the same distribution, and their latent representations can be generated in same way. Specifically, the users can be separated into several preference groups, where users in the same group share the same interests (preference). This grouping operation is helpful to achieve observed-level disentanglement on users. Given one user s interaction behavior xi, the corresponding latent user representation zi Rdi can be modeled in a rich multimodal space. In this space, the whole preference space is partitioned into K disjoint clusters, such that users in the same group are close to each other, while users in different groups are far away from each other in terms of user s preference. Meanwhile, different user groups can be spanned by their own optimal latent space. Here the users are grouped into K disjoint clusters to capture the inter-user preference similarity. Intra-user preference diversity: One user might be interested in different kinds of items, especially when the system contains a large set of items with various types. In literatures, there is another line of work to handle preference diversity by introducing overlapping clusters (Lee et al., 2016). In this work, we investigate the intra-user preference diversity by introducing item groups. Specifically, a set of multi-hot vectors C = {cj}M j=1 are used to indicate the item-category membership. For the j-th item, cj = [cj,1, cj,2, , cj,G] NG, and cj,g = 1 if item j belongs to category g, otherwise cj,g = 0. C is helpful to achieve observed-level disentanglement on items. To capture the essential user preference in each item category (group), the interaction behavior data of user i is split into G groups, denoted as xi = {x(1) i , x(2) i , , x(G) i }, where x(g) i contains the interaction data corresponding to the items in group g. Meanwhile, in the latent space for user representation, each latent factor can be connected to the explicit item group for latent-level disentanglement. Here, a structure and sparsity mapping matrix Oi = [o(1) i , o(2) i , , o(G) i ] Rdi G between factors and item groups is designed for each user. Modeling inter-user preference similarity and intra-user preference diversity is expected to properly capture the user preference and output more accurate recommendation results, Liu, Jing, Wen, Xu, Wang, Yu and Ng so as to achieve observed-level disentanglement on users and items, as well as latent-level disentanglement on representations for interpretable recommendation. In the next section, we will describe the model in detail. 𝒑𝟏 𝒑𝟐 𝒑𝑮 𝒕𝟏 𝒕𝟐 𝒕𝑮 Intra-user preference diversity modeling Item prototype learning 𝝅𝒊𝟏 𝝅𝒊𝒌 𝝅𝒊𝑲 Item category membership Inter-user preference similarity modeling Observed-level disentanglement Item groups User clusters Latent-level disentanglement 𝒑𝒈 Privileged information Item representation Figure 2: Architecture of the proposed In DGRM by modeling both inter-user preference similarity and intra-user preference diversity to achieve observed-level disentanglement and latent-level disentanglement. Inter-user preference similarity indicates strong correlation among a subset of users who have similar preference, which is achieved by learning a non-parametric mixture prior with several components to capture the correlation among users (user clusters). Intra-user preference diversity indicates the inherent diversity preferences of a single user, which is achieved by learning a set of item group prototypes, based on which the user intention related with each item is inferred, and then capturing the preference of a user about the different intentions separately. Both inter-user preference similarity and intra-user preference diversity modeling strategies focus on achieve disentanglement in observed-level and latent-level, i.e., users clusters, item groups, and disentangled latent representations. 4. The Proposed Method In this section, an Interpretable Deep Generative Recommendation Model (In DGRM) will be presented by investigating user behaviors from the views of inter-user preference similarity and intra-user preference diversity. Our goal is to achieve interpretable mechanism for deep learning architecture from two perspectives, observed-level disentanglement and latent-level disentanglement. The architecture of In DGRM is given in Figure 2. In DGRM consists of three modules. One module is to capture the inter-user preference similarity and achieve observed-level disentanglement on users via user grouping technique with multimodal prior on latent user representations. The second module is to characterize the intra-user preference diversity and achieve observed-level disentanglement on items via Interpretable Deep Generative Recommendation Models item prototype learning technique. The third module aims to disentangle the latent factor with the aid of item groups and achieve latent-level disentanglement for interpretable recommendation. In In DGRM, the behavior data of each user can be generated via a hierarchical generative model with two layers of latent variables as follows: k=1 πi,k Epθ(C) g=1 pθ x(g) i |zi, o(g) i , C pθ(zi|k)pθ(k|πi)pθ(o(g) i |γ)p(γ)dγdzi (1) Here pθ(C) indicates the item-group correlation distribution which aims to assign items into different groups. πi RK is the prior cluster probability for the i-th user and PK k=1 πi,k = 1 (K is the number of clusters). pθ(k|πi) is the categorical distribution parameterized by πi, and θ is the learnable parameters. The discrete latent variable k indicates the corresponding cluster that the i-th user belongs to. According to the assigned user cluster k, the latent user representation zi Rdi can be obtained by pθ(zi|k). The sparsity mapping {o(g) i }G g=1 (o(g) i Rdi) from latent factors to item groups is modeled by a sparsity-inducing penalty pθ(o(g) i |γ)p(γ) (γ is a variable to control the sparsity, which can be sampled from a Gamma distribution), which will lead to a disentangled latent representation. Finally, the useritem interaction behavior data xi can be generated through the pre-defined distribution parameterized with latent representation zi, item group assignment C and sparsity mapping {o(g) i }G g=1. Next, we will describe the detailed implementation of above generative procedure. 4.1 Inter-user Preference Similarity Modeling The general goal of deep generative recommendation method is to determine the latent user representation zi by defining proper prior and generative process from latent representation to observed user-item interaction data. Among it, the prior is usually crucial for recommendation data generation. In literatures, most researches restrict user latent representation in a univariate Gaussian space, which means that all users come from a single preference pattern. As aforementioned, different users may have different habits, while some of them may be effected to each other due to their similar preference. In this case, a simple prior, e.g., Gaussian distribution, becomes unreasonable to model such complicated user structure. In order to properly capture the inter-user preference similarity and achieve observedlevel disentanglement on users, the mixture model is adopted as the prior distribution of latent representation, because it has been proved was a universal approximator for any continuous density function and able to characterize the recommendation data well (Ma et al., 2019; Liu et al., 2020). Nevertheless, the existing methods have to predefine the number of mixture components K and share the same latent space size in all components, which often leads to a priori inaccuracy. Such hyperparameter setting definitely affects the final recommendation performance. For instance, if K is too small, the mixture model may not be able to well capture the complicated local structure from the wide-ranging sources. On the other hand, if K is too large, it will be time-consuming to learn all components even though most of them may make small contribution. Moreover, without a proper prior assumption for the mixing coefficient, the mixture model may be unstable and result in overfitting (Chen et al., 2016). Different components have their own structure or own space, therefore, it is necessary to determine the optimal feature subspace for each component. Liu, Jing, Wen, Xu, Wang, Yu and Ng 4.1.1 User group structure identification To capture the group structure among users, a deep mixture generative recommendation model is designed, with which the latent user representation can be generated via k=1 πi,kpθ(zi|k)pθ(k|πi). (2) Here K is the number of components in mixture model, which is usually taken as a predefined parameter. However, it is hard to previously set proper value for K. To automatically determine the number of components, we exploit non-parametric Bayesian technique which provides an elegant solution on automatic adaptation of model capacity. Such adaptivity can be obtained by defining stochastic processes on rich measure spaces such as Dirichlet Process (Neal, 2000), Indian Buffet Process (Ghahramani and Griffiths, 2006) and etc. In this work, the Dirichilet Process is adopted. More specifically, the latent variable zi is generated via a hierarchical structure with two layers according to pθ(zi|k)pθ(k|πi) and user-component membership πi. Among it, πi is sampled from Griffiths-Engen-Mc Closkey (GEM ) distribution (Pitman, 2002) via πi GEM(α) (with P k=1 πi,k = 1), which is a special case of Dirichlet Process (DP). GEM ( ) distribution is able to construct an infinite data partition and can be efficiently constructed via a stick-breaking process. Considering a stick with unit length, it can be broken into an infinite number of segments {πi,k} k=1 by the following process with parameter vk Beta(1, α): πi,k = υ1 if k = 1, υk Qk 1 l=1 (1 υl) for k > 1. (3) The precision parameter α controls the number of significant sticks that have appreciable weights πi,k. This process provides insights for developing variational approximate inference algorithms (Blei and Jordan, 2006). In each component, the data is assumed following a Gaussian distribution. Then, the prior of user representation zi along all components can be formulated as: k=1 πi,k N µ(k) i , σ(k) i I πi GEM(α) with k=1 πi,k = 1 here µ(k) i and σ(k) i I are mean and variance of Gaussian distribution related to the k-th component. For the users with similar preference, they are expected to have similar usercomponent distribution πi. In this case, their latent representation zi will approach to each other. 4.1.2 Optimal subspace determination For each component corresponding to one preference cluster, to capture its own structure, its latent feature space including space size is automatically determined. For convenient Interpretable Deep Generative Recommendation Models computing and disentangled representation, the latent features in zi are assumed independent to each other and each feature has zero mean. Then, the Gaussian distribution N(µ(k) i , σ(k) i I) related to the k-th component can be modeled as: N µ(k) i , σ(k) i I = Yd(k) l=1 N 0, λ(k) l = N 0, Λ(k) λ(k) l IG(ηa, ηb) (5) where µ(k) i Rd(k) and Λ(k) = diag(λ(k)) Rd(k) d(k) are the mean vector and covariance matrix respectively. In the covariance matrix, its l-th diagonal element (λ(k) l ) is sampled from Inverse Gamma distribution with parameters ηa and ηb. To determine the optimal dimensionality d(k), we take advantage of the automatic relevance determination (ARD) technique (Neal, 1996; Wipf and Rao, 2007). Since each latent feature is assumed having zero mean, the feature with small variance {λ(k) l } will shrink to zero. In this case, the latent features with small variance can not contribute to characterize users, thus, we can remove them. In other words, only the latent features with larger variance (empirically greater than 0.001) are useful to form the latent space. A good by-product is that each component can contain its own latent space size (d(k)). Note that the generative process is built on the truncated stick-breaking process at step K, which has been proved to closely approximate a true Dirichlet Process as long as K is chosen to be large enough (Ishwaran and James, 2001). Empirically K may be initialized to some value from tens to hundreds based on the model complexity. The useless dimensions will gradually be pruned automatically. For our case, small πik indicates that it is very unlike for some entries belonging to the corresponding cluster, thus, we can prune such clusters because they make few contribution for generation process. As a consequence, the users can be clustered into K groups without a complicated model selection procedure. Meanwhile, user clustering is able to achieve observed-level disentanglement on users. 4.2 Intra-user Preference Diversity Modeling To capture the preference diversity for each user, all interacted items are automatically grouped via prototype learning technique. 4.2.1 Prototype-based item group assignment To sufficiently exploit group structure among items, G category prototypes {pg}G g=1 are introduced. Meanwhile, a Multinomial distribution pθ(cj) over G groups is used to model the j-th item along all groups, where cj = [cj,1, cj,2, , cj,G] is a multi-hot vector drawn from cj Mult(Nj, sj) sj,g exp cos(hj, pg) where {pg}G g=1 indicates the G category prototypes and {hj}M j=1 indicate the item latent representations. The correlation between each pair of item and prototype is measured via Liu, Jing, Wen, Xu, Wang, Yu and Ng cosine similarity5 cos(hj, pg) = h j pg ||hj||||pg|| and normalized via a softmax function to produce a probability vector sj = [sj,1, sj,2, , sj,G] SG 1 in a (G 1)-simplex. Among it, sj,g quantifies the correlation between the j-th item and the g-th group. The larger the correlation value is, the higher probability that the j-th item belongs to the g-th group. The hyperparameter τ aims to scales the similarity from [ 1, 1] to [ 1 τ ]. In experiments, τ is set to be 0.1 for more skewed distribution. Nj is the number of groups to which the j-th item belongs. With the aid of the above variables, the multi-hot group membership vector cj can be sampled from a Multinomial distribution with probability sj. According to the distribution of cj, i.e., pθ(cj), we can get the item-group distribution pθ(C) = QM j=1 pθ(cj) under the assumption that all items are drawn from independent and identically distributed (i.i.d.). The item j will be assigned to the group g if g = arg maxg{cj,g|G g=1}. 4.2.2 Prototype enhancing with privileged information According to the cognitive science(Hampton, 1993), the prototypes are expected to demonstrate some concepts or categories, which are usually average or best exemplars. Prototypes provide a concise representation for an entire group (category) of entities, providing means to anticipate hidden properties and interact with novel stimuli based on their similarity to prototypical members of their group. A prototype learning system is a system of cognitive processes together with their underlying neural structures that enables one to learn a category prototype from a set of data points(Zeithamova, 2012). The prototype learning on pure interacted data is hard to achieve this goal. Fortunately, in the real-world applications, item is usually labeled with special characteristics, such as genre information of movies(Harper and Konstan, 2015), category information of goods(Ma et al., 2019) and etc. Such kind of privilege information has been proved to be effective for prototype learning (Vapnik and Izmailov, 2015). Thus, in this work, they are taken as prior knowledge and incorporated in the prototype learning process. Our main idea is to construct the triple set from item category or genre information, and model the alignment between prototype relations and the given category relations. Specifically, the category description can be collected and preprocessed as semantical representations {tg}G g=1 via word embedding technique. The triplet of categories can be constructed, where three categories in one set have the following indicator sj,k,g = 1(cos(tg, tj) cos(tg, tk)) to demonstrate the ranking order of prototype j and k when given prototype g. With the help of sj,k,g, the prototypes triplet (pg, pj, pk) can be 5. Here cosine similarity is adopted instead of the inner product similarity which is used by most existing deep learning methods (He et al., 2017), because it is crucial for preventing mode collapse. Empirically, with inner product, the majority of the items are highly likely to be assigned into a single category pg with an extremely large norm. In contrast, cosine similarity avoids this degenerate case due to the normalization. Moreover, cosine similarity is related with the Euclidean distance on the unit hypersphere, and the Euclidean distance is a proper metric that is more suitable for inferring the cluster structure, compared to inner product. Interpretable Deep Generative Recommendation Models constrained by the following Bernoulli distribution: sj,k,g exp cos(hj, pk) τ cos(hj, pg) sj,k,g B(sj,k,g) (7) where the ranking order likelihood sj,k,g is modeled via a logistic function which maps the similarity difference to probability. Incorporating similarity order, instead of direct similarity, will enforce the hyperspherical prototypes have the same similarity ranking order as the priors {tg}G g=1, which is much more reasonable in real applications (Mettes et al., 2019). A good by-product of item grouping is to achieve observed-level disentanglement on items. Specially, the observed interactions behavior data of the i-th user (xi) can be separated into G groups, xi = [x(1) i , x(2) i , x(G) i ]. x(g) i contains the interaction data related to the items which belong to the g-th group. 4.3 Factor-to-group Sparse Mapping As we known, the latent representations learned by deep generative models are usually entangled, which makes the deep model hard to explain and non-transparent for recommendation. A straight but efficient way to handle this issue is to disentangle the latent factors. In our work, the item groups with category information provide high-level and interpretable concepts. Thus, we can connect the latent features of zi with the item groups for interpretable recommendation model construction. Inspired by the sparse factor analysis (Tibshirani, 1996; Tank et al., 2018), the latent features can be well explained by encouraging sparse mappings from latent features to item groups. Specifically, a groupspecific factor mapping vector o(g) i = [o(g) i,1 , o(g) i,2 , , o(g) i,di] Rdi is introduced between the latent representation and item group for the current user i, where di is the optimal latent space size (as shown in the right part of Figure 2). Note that the l-th element of the mapping vector, i.e., o(g) i,l , indicates the influence of the items in group g on the l-th latent factor for user i. To make the influence more explainable, o(g) i,l is set to be 0 if the g-th item group have no influence on the l-th latent factor. Meanwhile, each latent factor should only focus on as few items as possible, i.e., the mapping vector should be sparse (Anindya et al., 2015). To implement this, a hierarchical distribution with Bayesian prior p(γ) is introduced to model o(g) i,l as follows: o(g) i,l N(0, γ2) (8) where sparsity variance γ2 is sampled from Gamma distribution parametrized by shape parameter (a+1) 2 and rate parameter b2 2 . Among it, the sparsity can be controlled by rate parameter b2 2 , larger b implying more sparse in o(g) i . Note that deep structure allows rescaling of the parameters across layer boundaries without affecting the end behavior of the networks. Liu, Jing, Wen, Xu, Wang, Yu and Ng With the aid of mapping vectors {o(g) i }G g=1, the generative process of In DGRM has the ability to capture the group-specific latent feature. Obviously, such latent feature learning is helpful to achieve disentangled latent representation. In addition, since each item group is related to one type of user preferences, such group-specific latent feature is beneficial to model intra-user preference diversity. 4.4 Interaction Data Generation Once having the latent representation zi for user i, sparse mapping vectors {o(g) i }G g=1 and item representations {hj}M j=1, the interaction behavior data xi can be generated. To sufficiently exploit the preference diversity, xi is split into G groups (as shown in Subsection 4.2). For each user, all interaction behaviors [x(g) i,j ]1 g G,1 j M share the same user latent repre- sentation zi. In this case, the user-item interaction information xi = {x(1) i , x(2) i , , x(G) i } with group structure can be generated by: PG g=1 cj,gfθ(g) cos(hj, zi)/τ + o(g) i z i PM j=1 PG g=1 cj,gfθ(g) cos(hj, zi)/τ + o(g) i z i Among them, the user-item interaction behavior x(g) i,j can be decoded with an individual Bernoulli distribution. Here function fθ(g)( ) indicates neural network parameterized by θ(g) that estimates how much the user with a given preference is interested in item group g. cos(hj, zi)/τ implies that item representation hj will be disentangled if zi is disentangled, as the two s dimensions are aligned. o(g) i z i connects the latent user representation and item groups to achieve latent-level disentanglement. Consider the computational complexity, here we use sampled softmax (Jean et al., 2015) to estimate PM j=1 PG g=1 cj,gfθ(g) cos(hj, zi)/τ + o(g) i z i based on a few sampled items when M is vary large. The whole generative process of In DGRM is summarized in Algorithm 1. 4.5 The Interpretability of In DGRM The proposed In DGRM model aims to capture the complicated user preference pattern and simultaneously provide interpretable recommendation. The interpretability of In DGRM can be guaranteed from two aspects: 1. Observed-level disentanglement on users and items: By introducing mixture prior to model inter-user preference similarity on latent user representation, In DGRM allows dynamic allocation of statistical capacity among components, where users will be assigned into different components (clusters) so that users with similar preference belong to one component. In this case, users in the same component can be taken as neighborhood to each other. Based to this operation, In DGRM is able to capture user correlations structure and achieve observed-level disentanglement on users. Furthermore, this kind of disentanglement alleviates data sparsity by allowing a near cold-start user to borrow information from other users from the same cluster. By introducing prototype-based item grouping to model intra-user preference diversity, Interpretable Deep Generative Recommendation Models Algorithm 1 In DGRM Generative Process 1. Randomly initialize hyperparameters α, ηa, ηb, a, b, item representations {hj}N j=1, group representations {pg}G g=1; 2. For each item j [1, M]: 1) For each item group g [1, G]: a) Draw sj,g exp cos(hj,pg) b) if privilege information i) Draw sj,k,g exp n cos(hj,pk) τ cos(hj,pg) ii) Draw sj,k,g B(sj,k,g) 2) Draw cj Mult(Nj, sj) 3. For each user i [1, N]: 1) Draw πi GEM(α) 2) For each latent dimension l [1, d(k)] a) Draw λ(k) l IG(ηa, ηb) 3) Draw use representation zi P k=1 πi,k N µ(k) i , σ(k) i I = P k=1 πi,k Qd(k) l=1 N 0, λ(k) l 4) For each interaction related to user i: a) Draw γ2 Γ a+1 b) For each item group g [1, G] and each latent dimension l [1, d(k)]: i) Draw o(g) i,l N(0, γ2) c) Draw the rating xi QM j=1 QG g=1 B x(g) i,j ; PG g=1 cj,gfθ(g) cos(hj,zi)/τ+o(g) i z i PM j=1 PG g=1 cj,gfθ(g) cos(hj,zi)/τ+o(g) i z i Liu, Jing, Wen, Xu, Wang, Yu and Ng In DGRM captures item groups to reflect the diverse interests of single user so as to achieve observed-level disentanglement on items. 2. Latent-level disentanglement on representations: By introducing the column-wise latent-to-group sparse mapping matrix from latent user representation to item groups, each latent factor will be automatically associated with item group. The mapping matrix can be taken as a bipartite graph on latent feature set and item group set, from which we can quickly identify the relationships among these two sets. When we have category prior for each group, the latent feature can be directly explained according to the category information of the most related groups. A good by-product of factor-to-group mapping is that the latent feature can be easily disentangled in the latent space. Specially, by penalizing mapping between the latent features and item groups, for each user, we can force the model to learn a latent representation where the correlation among latent features is as small as possible. Furthermore, the cosine similarity measurement between user representation and item representation (in Eq. (9)) also encourages latent-level disentanglement on item representations when their dimensions are aligned well. In a word, in In DGRM, the observed-level and latent-level disentanglement schemes are properly designed and seamlessly integrated into the recommendation model via a unified generative process, which will enforce each other to improve recommendation performance and achieve model interpretability. 4.6 Inference Learning for In DGRM Deep generative models target to minimizing certain discrepancy measures between the true data distribution and the generative distribution (Kingma and Welling, 2013; Goodfellow et al., 2014). Kullback-Leibler (KL) divergence (equivalently maximizing the marginal data log-likelihood) or Jensen-Shannon (JS) divergence are commonly used to estimate the model parameters. However, user behavior data is characterized by few frequently occurring items and a large amount of tail items, where data can be actually characterized by a low dimensional manifold. In this case, thus, we adopt Wasserstein distance (Villani, 2008) to measure the difference between two distributions rather than KL-divergence or JS-divergence, because it has the ability to preserve the transitivity in latent space due to the much weaker topology (Tolstikhin et al., 2018; Liu et al., 2019b). Meanwhile, the generative process can be implemented under the auto-encoder framework from the optimal transport point of view (Tolstikhin et al., 2018; Arjovsky et al., 2017). Motivated by above techniques, the parameters in In DGRM can be determined by minimizing Wasserstein distance between the ground truth distribution and the model generated distribution on user-item interaction data. It can be approximated by optimizing the following bound: L(xi; θ, φ, O) = inf qφ(zi,k|xi,C) QZ K Ep(xi)Eqφ(zi,k|xi,C)[c(xi, ˆxi)] + λr Ep(xi)[D (qφ(zi, k|xi, C)||pθ(zi|k)pθ(k|πi))] λo X g ||o(g) i ||2. (10) Interpretable Deep Generative Recommendation Models where xi and ˆxi are the original and generated feedback data respectively. c(x, y) is the cost function to measure the distance between x and y, here the squared cost function c(x, y) = ||x y||2 2 is used. infqφ(zi,k|xi,C) QZ K Ep(xi)Eqφ(zi,k|xi,C)[c(xi, ˆxi)] is the tractable upper bound of Wasserstein distance W(p(xi), p G(xi)), and QZ K is the set of all conditioned distributions over zi. Variational distribution qφ(zi, k|xi, C) in the inference model can be factorized as qφ(zi|k, xi, C) and qφ(k|xi, C). Since any parametrization of qφ(zi, k|xi, C) reduces the search space of the infimum, the above objective function can be taken as an upper bound of the true Wasserstein distance. D( ) is an arbitrary divergence between prior and posterior, two mixture of Gaussian distributions, which can be efficiently and effectively approximated (Hershey and Olsen, 2007). The last penalty term on Oi aims to constrain its values in he collapsed bound and encourage it sparse. λr > 0 and λo > 0 are two hyperparameters. As we mentioned before, prototype learning can be promoted via introducing privilege information (Eq. (7)). Thus, the final objective can be formulated as follow: L(xi; θ, φ, O) = inf qφ(zi,k|xi,C) QZ K Ep(xi)Eqφ(zi,k|xi,C)[c(xi, ˆxi)] + λr Ep(xi)[D (qφ(zi, k|xi, C)||pθ(zi|k)pθ(k|πi))] λo X g ||o(g) i ||2 (j,k,g) P sj,k,g log sj,k,g (1 sj,k,g) log(1 sj,k,g). where the last term λp 1 |P| P (j,k,g) P sj,k,g log sj,k,g (1 sj,k,g) log(1 sj,k,g) is derived from Eq. (7) in Section 4.2.2. P denotes the set of all ranking order triplets for prototype and λp is a hyperparameter controlling the effect of prototype enhancing. Intuitively, this term optimizes for hyperspherical prototypes to have the same ranking order as the semantic priors. The second term with KL divergence in Eq. (10) and Eq. (11) can be rewritten as Ep(xi) [D(qφ(zi, k|xi, C)||pθ(zi|k)pθ(k|πi))] Eqφ(zi,k|xi,C) ln qφ(zi, k|xi, C) qφ(zi, k|C) + ln qφ(zi, k|C) pθ(zi|k)pθ(k|πi) = Ep(xi)[D(qφ(zi, k|xi, C)||qφ(zi, k|C))] + Eqφ(zi|xi,C) ln qφ(zi, k|C) pθ(zi|k)pθ(k|πi) = I(xi, zi) + D(qφ(zi, k|C)||pθ(zi|k)pθ(k|πi)). I(xi, zi) is the mutual information between xi and zi. Minimizing it is equivalent to applying the information bottleneck principle on the learning process, which encourages zi to ignore as much noise in the input as it can, so that the latent representation mainly focus on the essential information. The second term can encourage independence among latent features with the aid of proper prior. For instance, a prior with independent features, pθ(zi|k) = Qd l=1 pθ(zi,l|k) (as shown in Eq.(5)) is adopted here. Penalizing this KL term will make the posterior approach to the prior and preserve such independent properties. Meanwhile, this term seamlessly integrate preference diversity modeling and user correlation modeling. Liu, Jing, Wen, Xu, Wang, Yu and Ng As mentioned before, to capture the inter-user preference similarity, each user latent representation is assumed following a Mixture of Gaussian distribution. By sufficiently exploiting the relationship between user clusters and item groups, the variational posterior qφ(zi|k, xi, C) can be approximated by qφ(zi|k, xi, C) = k=1 πi,k N µ(φ,k) i , σ(φ,k) i I . (13) In mixture model, each component is a Gaussian distribution with the following mean µ(φ,k) i and standard deviation σ(φ,k) i : µ(φ,k) i = W(k) µ PM j=1 PG g=1 cj,ghj q PM j=1 PG g=1 cj,g , fφ {x(g) i }G g=1 σ(φ,k) i = softplus PM j=1 PG g=1 cj,ghj q PM j=1 PG g=1 cj,g , fφ {x(g) i }G g=1 where W(k) µ and W(k) σ are learnable weight matrices related to the k-th component. fφ( ) is the neural network to extract the high-level features from user behavior information with group structure. { πi,k}K k=1 are the group-specific attention weights, which can be computed via: πi,k = softmax W2tanh W1 xi, {xi }i Uk, {hj}xij=1 . (15) Here Uk indicates users set related to the k-th user cluster and hj is representation for the j-th item. W1 and W2 are learnable weight metrices. To mirror the prior, we parameterize variational distribution qφ(k|xi, C) via mixture prior pθ(zi|k) pθ(k|πi). Specifically, qφ(k|xi, C) = πi,k N µ(k) i , σ(k) i I PK k =1 πi,k N µ(k ) i , σ(k ) i I (16) is a categorical distribution over K user clusters. The information loss induced by the factorized approximation can be mitigated by forcing its dependency on the posterior pθ(k|zi) and non-formative prior pθ(k|πi). For the group lasso penalty on the latent-to group matrix Oi in Eq.(10) or Eq.(11), we update it via proximal gradient descent method, which is efficient for the separable objectives with both differentiable and potentially non-differentiable components. Specifically, the proximal operator on vector λo P g ||o(g) i ||2 can be formulated as: o(g) i = o(g) i ||o(g) i ||2 max 0, ||o(g) i ||2 ηoλo . Geometrically, this operator reduces the norm of o(g) i by ηoλo, and shrinks o(g) i with ||o(g) i ||2 ηoλo to zero. Obviously, this updating is cheap for computing and leads to machine-precision zeros. Interpretable Deep Generative Recommendation Models 4.7 Local Variational Optimization for In DGRM In the objective function, Eq.(10) or Eq.(11), {θ, φ, O} are the parameters to be optimized. Among it, the latent representation zi can be obtained via reparametrization trick, zi = µ(φ,g) i + ϵn σ(φ,g) i , where ϵn N(0, I). Thus, the posterior is be controlled by following variational parameters: ψ(xi) = n {µ(φ,g) i }G g=1, {σ(φ,g) i }G g=1 o . In this case, given xi, our main goal is to find the optimal variational parameters ψ(xi), which can be implemented by maximizing L(xi; θ, φ, O, ψ(xi)) on the neural network with parameters φ. A simple and popular way is random sampling ψ(xi). However, it has been proven that such strategy can not guarantee optimal variational parameters and may yield a much looser upper bound for the original generative model designed on Wasserstein distance (between prior and posterior distributions) (Krishnan et al., 2017). Therefore, in this work, the optimal variational parameters ψ are automatically determined, which has ability to obtain a tighter upper bound L (xi; θ, φ, O, ψ (xi)) as follows. L(xi; θ, φ, O) L (xi; θ, φ, O, ψ (xi)) W(p(xi), p G(xi)) (17) Algorithm 2 Learning with local variational optimization for In DGRM Input: Implicit user-item interaction matrix X with N users and M items. Randomly initialized parameters {θ, φ, O}. 1: for t = 1, 2, , T 2: Sample a user xi and set ψ0 = ψ(xi). 3: Learning item group assignment C and prototypes via Eq.(6) and Eq.(7). 4: Approximate ψL ψ (xi) = minψ LG(xi; θt, φt, Ot, ψ(xi)). 5: for τ = 0, , L 1 6: ψτ+1 = ψτ + ηψ ψτ LG(xi; θt, φt, Ot, ψτ). 7: Update θ: θt+1 θt + ηθ θt L(xi; θt, φt, Ot, ψL). 8: Update φ: φt+1 φt + ηφ φt L(xi; θt, φt, Ot, ψL). 9: Update O: Ot+1 Ot + ηo ot L(xi; θt, φt, Ot, ψL). 10: for g = 1, 2, , G 11: Update o(g) i with o(g) i = o(g) i ||o(g) i ||2 max 0, ||o(g) i ||2 ηoρ . 12: Reduce certain dimension of zi if corresponding ARD variance is less than ϵλ. Output: The learned optimal parameters {θ, φ, O}. To obtain the optimal ψ for further updating model parameters {θ, φ, O}, we adopt a local variational optimization (LVO) method to estimate parameters ψ = ψ(xi) with the aid of the inference network. Its main idea is to optimize ψ via gradient descent, where one initializes with ψ0 and takes successive steps of ψτ+1 = ψτ + ηψ ψτ L(xi; θ, φ, O, ψτ), where the gradient with respect to ψ can be approximated via Monte Carlo sampling. As shown in line 5-7 of Algorithm 2, the resulting ψL can approach the optimal variational Liu, Jing, Wen, Xu, Wang, Yu and Ng parameters. Note that L in inner optimization is a predefined value, usually few steps are enough to find the optimal ψ. With ψL, we can update the parameters θ, φ and O via stochastic back-propogation, as shown in line 8, 9 and 10, where the learning rates ηψ, ηθ, ηφ and ηo obtained via ADAM (Kingma and Ba, 2014). The factor-to-group mapping vector o(g) i is updated by the proximal stochastic gradient descent method, as shown in line 11-13. The whole training procedure for In DGRM with local variational optimization strategy is summarized in Algorithm 3. 5. Theoretical Analysis of In DGRM In this section, we focus on theoretically analyzing the proposed In DGRM model including its generalization error bound and relationship with existing methods. In the first we analyze the generalization error bound of the proposed In DGRM to underscore the characteristics of estimation error in terms of model parameters. Then we prove that In DGRM is superior to the existing recommendation methods. 5.1 Generalization Error Bound Motivated by Arora et al. (2017) and Lee et al. (2016), the generalization of interaction data generation process can be defined by measuring the difference between generative data distribution ˆ Xp and real data distribution Xp. Our main idea is check whether the population distance between Xp and ˆ Xp is close to the empirical distance between the empirical distributions (Xe and ˆ Xe). Definition 1 For the empirical version of the true distribution (Xe) with N training examples, a generative distribution ( ˆ Xe) generalizes under the distance d( , ) between distributions with generalization error δ1 > 0 if the following holds with high probability, |Ep( ˆX) Ee(ˆX)| δ1 (18) Ep( ˆX) = Ex(p) Xp,x(p) ˆ Xpd x(p), ˆx(p) , Ee( ˆX) = 1 NM j=1 d x(e) i,j , ˆx(e) i,j | x(e) i,j ˆ Xe, ˆx(e) i,j ˆ Xe (19) Ep( ˆX) indicates the population distance between the true and generative distributions (Xp and ˆ Xp). Ee(ˆX) is the empirical distance between the true and generative distributions (Xe and ˆ Xe). Since the proposed In DGRM with single user cluster and item group can be seem as a generalized matrix completion model, the Frobenius norm between the ground truth data (X(e)) and the generative data ( ˆX(e)) with N users and M items, is used as the metric to establish the error bound of the proposed method, i.e. d(x(e) ij , ˆx(e) i,j ) = ||x(e) i,j ˆx(e) i,j ||2 2. In the Interpretable Deep Generative Recommendation Models Algorithm 3 The training procedure for In DGRM with local variational optimization strategy. 1: input: user-item interaction matrix X RM N. 2: parameters: The parameters {θ, φ, O} include: user representations {zi}N i=1 RN di, item representations {hj}M j=1 RM di, group prototypes {pg}G k=g RK di, sparse mapping {Oi}N i=1 RN di G, and the parameters of neural networks. 3: function Prototype Group Assignment() 4: if privilege information 5: sj,k,g = 1(cos(tg, tj) cos(tg, tk)), (i, j, k) P 6: sj,k,g exp ncos(hj,pk) τ cos(hj,pg) 7: PI = 1 |P| P (j,k,g) P sj,k,g log sj,k,g (1 sj,k,g) log(1 sj,k,g) 8: sj,g exp cos(hj,pg) τ , g = 1, 2, , G 9: cj Gumble-Softmax([sj,1, , sj,G]) 10: return {cj}M j=1, PI 11: function Encoder(xi, {cj}M j=1) 12: µ(φ,k) i = W(k) µ " PM j=1 PG g=1 cj,ghj q PM j=1 PG g=1 cj,g , fφ {x(g) i }G g=1 # 13: σ(φ,k) i = softplus " PM j=1 PG g=1 cj,ghj q PM j=1 PG g=1 cj,g , fφ {x(g) i }G g=1 #! 14: πi,k = softmax W2tanh W1 xi, {xi }i Uk, {hj}xij=1 15: zi = PK k=1 πi,k N µ(φ,k) i , σ(φ,k) i I 16: return zi, KL = PK k=1 KL N(µ(φ,k) i , σ(φ,k) i I)||N(0, Λ(k) 17: function Decoder(zi, {cj}M j=1) 18: SP = P g ||o(g) i ||2 19: for g = 1, 2, , G // For simplicity, we omite sup-script (g) and denote p(g) i,j as pi,j. PG g=1 cj,gfθ(g) cos(hj,zi)/τ+o(g) i z i PM j=1 PG g=1 cj,gfθ(g) cos(hj,zi)/τ+o(g) i z i , j = 1, 2, , M 21: [ˆxi,1, ˆxi,2, , ˆxi,M] Softmax([pi,1, pi,2, , pi,M]) 22: RC = ||xi ˆxi||2 F 23: return RC + λo SP 24: {cj}M j=1, PI Prototype Group Assignment(). 25: zi, KL Encoder(xi, {cj}M j=1) 26: RC + λo SP Decoder(zi, {cj}M j=1) 27: if privilege information 28: L (xi; θ, φ, O, ψ (xi)) = RC + λr KL + λo SP + λp PI 30: L (xi; θ, φ, O, ψ (xi)) = RC + λr KL + λo SP 31: θ, φ, O Update θ, φ and O to minimize L (xi; θ, φ, O, ψ (xi)), using the proposed local variational optimization method. Liu, Jing, Wen, Xu, Wang, Yu and Ng following content, we abbreviate X(e) as X and ˆX(e) as ˆX, and d (xi,j, ˆxi,j) = ||xi,j ˆxi,j||2 2 is used as shorthand for d(x(e) ij , ˆx(e) i,j ) = ||x(e) i,j ˆx(e) i,j ||2 2. According to the generative process of In DGRM, it can be seen that the whole interaction matrix is generated by K G components with different properties. Here we define X(k,g) as the (k, g)-th submatrix, N(k,g) is the number of users in submatrix X(k,g) and M(k,g) is the number of items. Firstly, we establish the error bound of the proposed In DGRM, i.e., the generative error of the proposed deep generative recommendation model is bounded, so that we can still find optimal generative model for recommendation by optimizing the proposed optimization problem. Theorem 2 For any X NN M (N, M > 2), Ep( ˆX) Ee(ˆX) + max (i,j) di,j holds with probability at least 1 δ1 over uniformly choosing an empirical version of X. Here, di,j = d (xi,j, ˆxi,j). Proof Since the entries in X are chosen independently and uniformly, it is reasonable to assume each di,j = d(xi,j, ˆxi,j) is a random variable and satisfies p(ζ di,j 0) = 1 where ζ = max(i,j) di,j. Hence, based on the Hoeffding Inequality, we have p(|Ep( ˆX) Ee(ˆX)| ϵ) exp 2NMϵ2 ζ2 . By setting ϵ = q log δ1 2NM ζ2, we have |Ep( ˆX) Ee(ˆX)| log δ1 2NM ζ2 ! Ep( ˆX) Ee(ˆX) + max (i,j) di,j Therefore, the errors of interaction data generative model are bounded. Then, we theoretically prove the generalization error bound of In DGRM related to a single user cluster and a single item group, X(k,g). To make the theorem more convincing, we make several standard assumptions: (1) each submatrix X(k,g) related to the (k)-th user cluster and the (g)-th item group is incoherent (Cand es and Recht, 2009; Sun and Luo, 2016); (2) X(k,g) is well-conditioned (Cand es and Recht, 2009); (3) the number of users is larger than items in each submatrix (N(k,g) M(k,g)). As shown in Theorem 7 in Candes and Plan (2011), a theoretical bound to generalized matrix reconstruction model is existing as follows. Interpretable Deep Generative Recommendation Models Theorem 3 If X(k,g) is well-conditioned and incoherent such that |D(k,g)| Cµ2M(k,g)d(k,g) log6 M(k,g), then with high probability 1 (M(k,g)) 3, X(k,g) satisfies ||X(k,g) ˆX(k,g)||F 4ε (2 + ρ(k,g))M(k,g) ρ(k,g) + 2ε (2 + ρ(k,g))M(k,g) where ρ(k,g) = |X(k,g)| N(k,g)M(k,g) indicates the density of the observed entries ρ in every submatrix is consistent with each other. ε = max(ˆX) min(ˆX) = 1 and d(k,g) is the optimal dimensionality of latent representation for the (k, g)-th submatrix, D(k,g) D is the set of observed feedback in submatrix X(k,g). This theorem indicates that the generative error of each submatrix ˆX(k,g) is bounded. Then, based on Theorem 3, we can analyze the generalization error bound of the proposed In DGRM method on whole interaction matrix via the following theorem. Theorem 4 If matrix related to K user clusters and G item groups satisfied Theorem 3, then with high probability 1 δ2, ˆX is divided into K G submatrices satisfies ||X ˆX||F ω where δ2 = (2KGM) 3 and ω indicates the average weight, ρ = |X| NM is the density of the observed entries in X. Proof For each user-item pair (i, j), an implicit feedback xi,j is equal to ˆxi,j + z where z is a random variable whose absolute error is bounded by ||W (X ˆX)|| ω(max (X) min (X)) = ωε = ω (25) where W indicates the mixture weights. ε = max (X) min (X) = 1 since we focus on implicit feedback data. By applying Theorem 3 to implicit preference reconstruction problem with bounded noise, we get with probability greater than 1 v 3 that every mixture model to approximate X will satisfy ||W (X ˆX)||F ω where v = max(N, M), γ = min(N, M). For a submatrix, there are K G different components X(k,g) and obviously P k P g γ(k,g) N. Using Cauchy-Schwarz inequality, we get X (k,g) γ(k,g) Liu, Jing, Wen, Xu, Wang, Yu and Ng Therefore, we can bound the reconstruction error as follow: ||W (X ˆX)||F (a) X (k,g) ||W (k,g) (X(k,g) ˆX(k,g))||F γ(k,g)(2 + ρ) 2KGN(2 + ρ) in which (a) holds due to the triangle inequality of Frobenius norm; and (b) holds due to (26); and (c) holds due to (27). Since for all user-item pairs, we have Ee( ˆX) = ||X ˆX||F NM ||W (X ˆX)||F Combining (28) and (29), we established the error bound of ˆX as stated above. In order to adjust the confidence level, we take a union bound of events ||W (X ˆX)||F ρ + 2 for each mixture component, then we have (k,g) v(k,g) (2KGM) 3 (30) thus, the error bound holds with probability at least 1 (2KGM) 3. Remark 5 According to above theorems, we can underscore the characteristics of estimation error in terms of parameters such as the training set size of each submatrix |D(k,g)|, optimal dimensions of each user representation di, and the number of submatrix K G. Considering the basic assumption of each submatrix is followed by Candes and Plan (2011), our model with K user clusters and G item groups requires a lower training set size, i.e., P (k,g) |D(k,g)| |D|. This property also guarantees that our model is able to achieve competitive performance in more sparse scenario. And we observe empirically that this phenomenon does occur with cold-start scenario (see Table 3). Furthermore, our method is able to model user representations in each submatrix with the optimal subspace dimensions since the introduction of ARD technique, which further relaxes the requirement of the size of training set. 5.2 Relations to Existing Methods Generative model aims to reconstruct the input interaction data including the observed data and the unobserved data. In this scenario, the proposed method is closely related to global-based methods (e.g., PMF (Mnih and Salakhutdinov, 2008)) and local-based methods Interpretable Deep Generative Recommendation Models (e.g., LLORMA (Lee et al., 2016), DGLGM (Liu et al., 2020)). In this section, we will theoretically compare them from the perspective of generalization bound. The objective function of the traditional matrix reconstruction methods (Mnih and Salakhutdinov, 2008) can be formulated as follow: (i,j) D ||xi,j ˆxi,j||2 2. (31) Here we ignore the regularization term. For local-based matrix reconstruction method with S local structures (Lee et al., 2016), its objective function can be written (i,j) D(s) ||x(s) i,j ˆx(s) i,j ||2 2, (32) where D(s) is the observed entries set related to the s-th local structure and ws indicates the importance of the corresponding local structure with PS s=1 ws = 1 and the interactions in the same local structure have the same weight. Next, let s prove that the local-based method (Eq. (32)) can achieve lower generalization error bound than global-based method (Eq. (31)), i.e., the later has ability to obtain potentially better generalization performance than the former. Theorem 6 Let E[Lg] = E[Ll] = µ. For any ϵ > 0, if p(|Ll µ| ϵ) 1 δl and p(|Lg µ| ϵ) 1 δg, then δl δg. Proof Based on Markov s inequality6 and Hoeffding s Lemma7, for any ϵ, t > 0, we have p(Lg µ ϵ) = p et(Lg µ) etϵ (a) E[et(Lg µ)] 1 8t2(a b)b where µ is the expectation of Lg, a = sup{Lg µ} and b = inf{Lg µ}. (a) holds due to Markov s inequality, and (b) holds due to Hoeffding s Lemma. Similarly, we have p(µ Lg ϵ) e 1 8 t2(a b)2 Combing the above two inequalities, we have p(|µ Lg| ϵ) 1 2e 1 8 t2(a b)2 6. (Markov s Inequality) Let x be a real-valued non-negative random variable. Then, for any ϵ > 0, p(x ϵ) E[x] ϵ . 7. (Hoeffding s Lemma) Let x be a real-valued random variable with zero mean and p(x [a, b]) = 1. Then, for any z R, E[ezx] exp( 1 8z2(b a)2). Liu, Jing, Wen, Xu, Wang, Yu and Ng δg = 2e 1 8 t2(a b)2 Let L(s) g = 1 D(s) P (i,j) D(s) ||xi,j ˆxi,j||2 2 (s {1, 2, , S}). Then, we know that Ll and Lg have the same expectation µ, Therefore, we can derive δl for Ll as follows: s=1 ws L(s) l µ ϵ ! (a) E[et(P s ws L(s) l µ)] etϵ ws E[et(L(s) l µ)] etϵ wse 1 8 t2(as bs)2 where (a) holds due to E[et(P s ws L(s) l µ)] etϵ = E[et(P s ws(L(s) l µ))] etϵ and Markov s inequality, (b) holds due to the convexity of exponential function, and (c) holds due to Hoeffding s Lemma. Let as = sup{L(s) l µ} and bs = inf{L(s) l µ}, we have wse 1 8 t2(as bs)2 Considering D(s) D, we know sup{L(s) l } sup{Lg} and inf{L(s) l } inf{Lg}. Therefore, sup{L(s) l µ} sup{Lg µ} and inf{L(s) l µ} inf{Lg µ}, i.e., as a and bs b. Then, we have (a b)2 (as bs)2, i.e., e 1 8 t2(as bs)2 etϵ e 1 8 t2(a b)2 etϵ for s {1, , S}. (39) Then, under the constraint P s ws = 1, we can conclude that δl δg. Remark 7 The above theorem demonstrates that Ll will be close to its expectation µ with higher probability than Lg. Note that Theorem 6 can be applied to general local-based matrix reconstruction methods. However, different local structure determination strategies can derive different δl, i.e., a well-designed local structure determination strategy can achieve better generalization performance than random splitting because a well-designed local structure determination strategy can achieve more accurate submatrix partition. Similar to the local-based method, we can rephrase the objective function of the proposed In DGRM method by (i,j) D(s) w(s) i,j ||x(s) i,j ˆx(s) i,j ||2 2 (40) Interpretable Deep Generative Recommendation Models where ws is the local-specific weight. Similarly, all weights are constrained by PS s=1 ws = 1. w(s) i,j indicates the entry-specific weight in the s-local structure. We can prove that L can achieve lower generalization error bound in the following Theorem. Theorem 8 Let E[L] = E[Ll] = µ. For any ϵ > 0, if p(|Ll µ| ϵ) 1 δl and p(|L µ| ϵ) 1 δ, then δ δl. Proof According to the definition of L and L(s), we can get E[L] = E[Ll] = µ. Let L(s) = P (i,j) D(s) w(s) i,j ||x(s) i,j ˆx(s) i,j ||2 2 (s {1, 2, , S}). We can derive δ for L as follows: s {1,2, ,S} ws L(s) µ ϵ (a) E[et(P s ws L(s) µ)] ws E[et(L(s) µ)] wse 1 8 t2(cs ds)2 where inequalities (a), (b) and (c) hold due to the Markov s inequality, the convexity of exponential function, and Hoeffding s Lemma, respectively. Let cs = sup{L(s) µ} and ds = inf{L(s) µ}, we have wse 1 8 t2(cs ds)2 According to Theorem 6, we know sup{L(s)} sup{L(s) l } and inf{L(s)} inf{L(s) l }. Therefore, sup{L(s) µ} sup{L(s) l µ} and inf{L(s) µ} inf{L(s) l µ}, i.e., cs as and ds bs. Then, we have (as bs)2 (cs ds)2, i.e., e 1 8 t2(cs ds)2 etϵ e 1 8 t2(as bs)2 etϵ for s {1, , S}. (43) Then, under the constraint P s ws = 1, we can conclude that δ < δl. Remark 9 Theorem 8 indicates that our In DGRM have a lower generalization bound than the existing local-based matrix reconstruction methods, because L will be close to its expectation µ with higher probability than Ll. Therefore, minimizing L can achieve small generalization error with higher probability than minimizing Ll. 6. Experimental Results In this section, we evaluate the proposed deep generative model on four datasets by comparing with the state-of-the-art recommendation methods. Liu, Jing, Wen, Xu, Wang, Yu and Ng ML 20M Netflix Ali Shop-7C Yelp users (N) 138,493 480,189 10,668 1,182,626 items (M) 26,744 17,770 20,591 156,638 Interactions 20,000,263 100,000,000 767,493 4,731,265 Density 0.54% 1.17% 0.35% 0.0026% M i 144 208 47 4 N j 748 5,627 231 30 M i: the average number of items rated by each user N j: the average number of users interested in each item Table 1: Summary of experimental datasets 6.1 Experimental Setting Datasets: In experiments, four widely-used recommendation datasets, Movie Lens 20M (ML 20M) 8, Netflix 9, Ali Shop-7C 10 and Yelp 11, are used to validate the recommendation performance. Among them, ML 20M (Harper and Konstan, 2015) and Netflix come from movie domain, Alishop-7C (Ma et al., 2019) belongs to online product domain, and Yelp is related to local business domain. The preference scores in ML 20M are 10 discrete numerical values in the range of [0.5,5] with step 0.5, while the ratings in other datasets are ordinal values on the scale 1 to 5. Following Liang et al. (2018), we binarize explicit data by keeping ratings of four or higher. More detailed information is summarized in Table 1. Metrics for ranking estimation: Recall Recall@n(i) = |Re(i) T (i) is used as ranking estimation, where Re(i) denotes the set of recommended items to user i and T (i) denotes the set of favorite items of user i. Meanwhile, the normalized discounted cumulative gain (NDCG) is adopted to measure the item ranking accuracy, which can be computed by: NDCG@n(i) = DCG@n(i) IDCG@n(i) DCG@n(i) = 2relr 1 log2(r + 1) where IDCG is the DCG value with perfect ranking. relr is the graded relevance of the result at position r. Metrics for interpretability: Two metrics, Explainability Precision (EP) and Explainability Recall (ER), are adopted to evaluate model interpretability (Abdollahi and Nasraoui, 2017). EP is defined as the proportion of explainable items in the top-n recommendation list relative to the number of recommended (top-n) items for each user. Similar to the recall metric, ER is the proportion of explainable items in the top-n recommendation list relative to the number of all explainable items for a given user. Note that an item j is explainable 8. https://grouplens.org/datasets/movielens/ 9. https://www.netflixprize.com 10. https://jianxinma.github.io/disentangle-recsys.html 11. https://www.yelp.com/dataset/challenge Interpretable Deep Generative Recommendation Models for user i if E(xij|j Np) = xij Np Ii where Np is the set of neighbors of item p. For each item, cosine similarity based on interaction behaviors is used to measure the relationship between items, top-50 items are selected as neighbors. Ii is the set of items that user i interacted with. Following (Abdollahi and Nasraoui, 2016), we set explainable threshold ξ = 0.01. Mean EP (MEP) is the average value of explainability precision over all users and mean ER (MER) is the mean explainability recall calculated over all users. Baselines: Two kinds of recommendation methods are adopted as baselines: Traditional recommendation method (EMF (Abdollahi and Nasraoui, 2016)) and Deep generative recommendation models (Mult-VAE (Liang et al., 2018), Macrid VAE (Ma et al., 2019) and DGLGM (Liu et al., 2020)). Parameter setting: For fair comparison, the number of learnable parameters is set around 2Mdmax for each method, which is equivalent to using dmax-dimensional representations for the M items. The initial dimensionality dmax is set as 150. The dropout technique (Srivastava et al., 2014) is adopted at the input layer with probability 0.5. The model is trained using Adam (Kingma and Ba, 2014) with batch size of 128 users for 200 epoch on all datasets. The regularization coefficient λ is set to 1.2 for ML 20M and Netflix, 1.5 for Ali Shop-7C and Yelp. λo is set to 1 for better disentanglement. For auto-encoder based deep methods (Mult-VAE, Macri VAE, DGLGM and our method), the hyperparameters are automatically tuned via TPE (Bergstra et al., 2011), which searches the optimal hyperparameter configuration with 200 trials on the validation set. The held-out users strategy and five-fold cross validation are used to evaluate the recommendation performance. The 20% users are taken as held-out users and evenly separated for validation and test respectively. For each held-out user, his/her feedback data is randomly split into five equal sized subsets. Among them, four subsets are used to obtain the latent representation, and the rest subset is for evaluation in each round. Finally, the averaged results on five rounds are reported. 6.2 Results and Discussion In this section, the proposed In DGRM is sufficiently investigated from seven views. Firstly, In DGRM is compared with baselines from All Users view and Near-cold-start Users view in terms of recommendation performance.Secondly, the interpretability of In DGRM is evaluated in terms of two metrics. Thirdly, a series of ablation experiments is conducted to illustrate the performance of the proposed model. Fourthly, we demonstrate the observed-level disentanglement on users and items. Fifthly, we analyze the factor-to-group interactions to show the disentanglement of user representations and the interpretability of In DGRM. Sixthly, the disentanglement ability on item representations are investigated to further confirm the interpretability of In DGRM. f Finally, the generalization ability and convergence of the proposed model is empirically analyzed. Liu, Jing, Wen, Xu, Wang, Yu and Ng Datasets Methods Recall NDCG @1 @5 @10 @50 @100 @1 @5 @10 @50 @100 EMF 0.1011 0.1985 0.3168 0.4662 0.5437 0.1026 0.2128 0.3235 0.3541 0.3764 Mult-VAE 0.1231 0.2231 0.3320 0.5184 0.5761 0.1211 0.2453 0.3185 0.3833 0.4122 Macrid VAE 0.1322 0.2433 0.3411 0.5233 0.6032 0.1322 0.2633 0.3376 0.3790 0.4122 DGLGM 0.1442 0.2542 0.3428 0.5205 0.6231 0.1527 0.2876 0.3391 0.3931 0.4198 In DGRM 0.1631 0.2651 0.3458 0.5264 0.6389 0.1723 0.3041 0.3426 0.4152 0.4276 EMF 0.1422 0.2411 0.3123 0.4143 0.5772 0.0944 0.1511 0.2234 0.2753 0.3422 Mult-VAE 0.1675 0.2651 0.3208 0.4232 0.5996 0.1238 0.1782 0.2449 0.2985 0.3689 Macrid VAE 0.1842 0.2782 0.3521 0.4348 0.6181 0.1437 0.1916 0.2584 0.3156 0.3798 DGLGM 0.1942 0.2953 0.3513 0.4401 0.6331 0.1653 0.2142 0.2669 0.3289 0.3763 In DGRM 0.2131 0.3153 0.3585 0.4453 0.6542 0.1841 0.2316 0.2695 0.3451 0.3831 Ali Shop-7C EMF 0.0421 0.0852 0.1152 0.2367 0.3651 0.0782 0.1123 0.1522 0.1842 0.2141 Mult-VAE 0.0589 0.1032 0.1343 0.2459 0.3856 0.0943 0.1322 0.1632 0.2032 0.2387 Macrid VAE 0.0642 0.1152 0.1544 0.3025 0.3984 0.1085 0.1527 0.1722 0.2231 0.2925 DGLGM 0.0743 0.1165 0.1585 0.3036 0.4132 0.1231 0.1675 0.1792 0.2431 0.2994 In DGRM 0.0895 0.1287 0.1622 0.3067 0.4275 0.1406 0.1873 0.1853 0.2633 0.3021 EMF 0.2322 0.3536 0.4873 0.5589 0.7431 0.4354 0.5642 0.7285 0.7378 0.7531 Mult-VAE 0.2548 0.3753 0.5322 0.5824 0.7643 0.4558 0.5885 0.7328 0.7551 0.7689 Macrid VAE 0.2773 0.3871 0.5435 0.5903 0.7834 0.4669 0.5973 0.7386 0.7663 0.7746 DGLGM 0.2697 0.3795 0.5458 0.5893 0.7896 0.4772 0.6085 0.7404 0.7595 0.7734 In DGRM 0.2846 0.3996 0.5496 0.5932 0.8132 0.4875 0.6173 0.7542 0.7762 0.7785 Table 2: Comparisons of different methods in terms of ranking estimation (Recall and NDCG) from All Users view. 6.2.1 Recommendation Performance The first experiment is conducted to compare the proposed In DGRM with the baselines from two views (All Users and Near-cold-start Users) on four datasets in terms of Recall and NDCG. All Users indicates that all users are used as the testing set. Near-cold-start Users view means that the users with less than 5 interacted items are involved in the testing set12. Table 2 shows the recommendation performance (Recall and NDCG) for All Users on four datasets. The best and second results are marked in bold and underline. We can see that deep methods significantly outperform the traditional recommendation approach (EMF) on All Users, which indicates that non-linear deep features are beneficial for improving recommendation quality. As expected, In DGRM performs better than all deep generative baselines, which confirms that modeling inter-user preference similarity and intra-user preference diversity in a proper way is beneficial to capture user s intrinsic preference and further improve the prediction accuracy. The more sparser feedback data is, the more challenging the personalized recommendation task is. In these four datasets, the average number of Near-cold-start Users are 1543, 2201, 5310 and 4431526 in ML 20M, Netflix, Ali Shop-7C and Yelp respectively, and they are about 1.85%, 0.46%, 57.4%, 46.7% of all users in the corresponding datasets. It can be seen that Ali Shop-7C and Yelp are more sparse, thus we further evaluate the recommendation performance on Near-cold-start Users. The results obtained by the proposed model and all baselines are listed in the bottom part of Table 3. It is exciting to see that In DGRM is 12. It is not suitable to set the cold start user with less than 5 interacted items on yelp dataset since it is extremely sparse. We define Near-cold-start Users on Yelp as users with less than 2 interacted items. Interpretable Deep Generative Recommendation Models Datasets Methods Recall NDCG @1 @5 @10 @50 @100 @1 @5 @10 @50 @100 EMF 0.0843 0.1433 0.2122 0.3431 0.4955 0.0564 0.1003 0.1933 0.2345 0.3154 Mult-VAE 0.1054 0.1534 0.2390 0.3763 0.5432 0.0675 0.1226 0.1976 0.2557 0.3256 Macrid VAE 0.1251 0.1833 0.2339 0.3843 0.5642 0.0762 0.1323 0.1974 0.2733 0.3412 DGLGM 0.1346 0.1963 0.2431 0.3816 0.5598 0.0721 0.1296 0.1998 0.2872 0.3314 In DGRM 0.1421 0.1985 0.2498 0.3928 0.5795 0.0898 0.1473 0.2033 0.2965 0.3433 EMF 0.0784 0.1342 0.1752 0.2546 0.3421 0.1021 0.1342 0.1852 0.2123 0.2655 Mult-VAE 0.0931 0.1489 0.1945 0.2794 0.3575 0.1321 0.1489 0.1938 0.2322 0.2921 Macrid VAE 0.1034 0.1543 0.2042 0.2852 0.3678 0.1465 0.1589 0.2052 0.2456 0.3015 DGLGM 0.1142 0.1698 0.2076 0.2893 0.3801 0.1634 0.1731 0.2136 0.2542 0.3078 In DGRM 0.1298 0.1846 0.2136 0.2954 0.3988 0.1814 0.1896 0.2144 0.2679 0.3124 Ali Shop-7C EMF 0.0132 0.0436 0.0731 0.1923 0.2653 0.0236 0.0563 0.0852 0.1432 0.1842 Mult-VAE 0.0245 0.0576 0.0952 0.2124 0.2781 0.0353 0.0655 0.1023 0.1623 0.2042 Macrid VAE 0.0276 0.0631 0.1052 0.2241 0.2833 0.0403 0.0732 0.1103 0.1673 0.2134 DGLGM 0.0315 0.0615 0.1084 0.2278 0.2913 0.0514 0.0833 0.1156 0.1765 0.2312 In DGRM 0.0397 0.0783 0.1126 0.2337 0.3042 0.0703 0.0984 0.1214 0.1896 0.2363 EMF 0.0232 0.0511 0.0788 0.1231 0.3244 0.0421 0.0873 0.1122 0.1766 0.2456 Mult-VAE 0.0341 0.0631 0.0981 0.1328 0.3433 0.0542 0.0967 0.1352 0.1879 0.2642 Macrid VAE 0.0421 0.0745 0.1031 0.1398 0.3542 0.0672 0.1042 0.1428 0.1982 0.2762 DGLGM 0.0542 0.0872 0.1121 0.1472 0.3648 0.0732 0.1101 0.1531 0.2098 0.2892 In DGRM 0.0673 0.0993 0.1289 0.1628 0.3761 0.0894 0.1315 0.1672 0.2214 0.3015 Table 3: Comparisons of different methods in terms of ranking estimation (Recall and NDCG) from Near-cold-start Users view. superior to all baselines. The main reason, we believe, is that In DGRM sufficiently exploits the user correlations which is helpful to characterize the Near-cold-start Users. Inter-user preference similarity and intra-user preference diversity not only allow us to accurately represent the diverse interests of a user with the aid of group structure among users and items, but also alleviates data sparsity by allowing a rarely visited item to propagate information from other items of the same item group. For demonstrating the efficiency of the proposed In DGRM method intuitively, we calculate improvements between In DGRM and four baselines, as shown in Figure 3. Although the percentage of relative improvements are small, small improvements can lead to significant differences of recommendations in practice (Koren et al., 2010). Meanwhile, we conduct paired t-test (confidence 0.95) between In DGRM and each baseline with five-fold cross-validation results. As shown in Table 4, we list the p-value obtained by In DGRM vs. four baselines on All Users view. The p-values in all cases are less than 10 5, which indicates that our improvements are statistically significant at the 5% level. Similar statistical test results can be found in Near-cold-start Users view, we omit it here. Therefore, based on these observations, we can say In DGRM consistently outperforms the state-of-the-art recommendation methods and significantly improves the recommendation performance. 6.2.2 Interpretability Performance Following Abdollahi and Nasraoui (2016) and Abdollahi and Nasraoui (2017), two metrics, Mean Explainable Precision (MEP) and Mean Explainable Recall (MER), are adopted to evaluate the performance on explainability. Larger MEP and MER indicate better interpretability performance. As shown in Table 5 and 6, In DGRM consistently performs better Liu, Jing, Wen, Xu, Wang, Yu and Ng Recall@1 Recall@5 Recall@10 Recall@50 Recall@100 NDCG@1 NDCG@5 NDCG@10 NDCG@50 NDCG@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (a) ML 20M / All Users Recall@1 Recall@5 Recall@10 Recall@50 Recall@100 NDCG@1 NDCG@5 NDCG@10 NDCG@50 NDCG@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (b) Netflix / All Users Recall@1 Recall@5 Recall@10 Recall@50 Recall@100 NDCG@1 NDCG@5 NDCG@10 NDCG@50 NDCG@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (c) Ali Shop-7C / All Users Recall@1 Recall@5 Recall@10 Recall@50 Recall@100 NDCG@1 NDCG@5 NDCG@10 NDCG@50 NDCG@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (d) Yelp / All Users 0.00% 10.00% 30.00% 40.00% 50.00% 70.00% 80.00% Recall@1 Recall@5 Recall@10 Recall@50 Recall@100 NDCG@1 NDCG@5 NDCG@10 NDCG@50 NDCG@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (e) ML 20M / Near-cold-start Users 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% Recall@1 Recall@5 Recall@10 Recall@50 Recall@100 NDCG@1 NDCG@5 NDCG@10 NDCG@50 NDCG@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (f) Netflix / Near-cold-start Users Recall@1 Recall@5 Recall@10 Recall@50 Recall@100 NDCG@1 NDCG@5 NDCG@10 NDCG@50 NDCG@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (g) Ali Shop-7C / Near-cold-start Users Recall@1 Recall@5 Recall@10 Recall@50 Recall@100 NDCG@1 NDCG@5 NDCG@10 NDCG@50 NDCG@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (h) Yelp / Near-cold-start Users Figure 3: The relative improvements of In DGRM vs. four baselines in terms of recommendation performance (Recall and NDCG) on All Users and Near-cold-start Users views. Interpretable Deep Generative Recommendation Models Datasets Methods Recall NDCG @1 @5 @10 @50 @100 @1 @5 @10 @50 @100 vs.EMF 2.52E-08 7.12E-08 4.93E-06 7.66E-07 9.09E-06 6.34E-07 7.85E-05 1.41E-07 8.65E-06 5.23E-05 vs.Mult-VAE 4.88E-07 3.78E-07 8.24E-09 4.79E-07 5.29E-05 5.52E-06 6.19E-05 3.95E-05 3.83E-08 4.42E-08 vs.Macrid VAE 1.38E-08 2.28E-07 3.57E-07 9.85E-07 4.99E-08 5.52E-06 6.19E-05 6.23E-06 3.53E-06 2.53E-08 vs.DGLGM 6.07E-08 5.18E-06 4.67E-07 6.44E-06 1.69E-06 3.52E-05 5.89E-07 9.85E-07 4.99E-05 5.52E-06 vs.EMF 5.32E-06 5.75E-07 7.34E-09 4.79E-07 6.21E-07 3.58E-06 5.34E-06 1.55E-06 2.42E-08 3.52E-07 vs.Mult-VAE 8.07E-06 7.43E-08 3.66E-08 9.85E-07 4.99E-05 5.52E-06 6.19E-05 4.33E-08 2.22E-06 3.09E-08 vs.Macrid VAE 8.08E-08 3.66E-08 5.23E-09 5.31E-05 3.91E-05 2.53E-07 9.12E-06 1.03E-08 1.39E-06 5.78E-07 vs.DGLGM 3.55E-07 3.53E-08 5.23E-07 3.33E-06 3.72E-06 3.31E-07 2.69E-05 2.55E-05 8.97E-06 6.45E-07 Ali Shop-7C vs.EMF 6.95E-05 5.56E-07 3.22E-06 6.54E-06 3.33E-06 4.12E-07 2.88E-05 9.07E-06 8.51E-06 3.53E-06 vs.Mult-VAE 8.06E-05 5.24E-05 3.52E-06 2.72E-07 3.36E-06 3.53E-07 2.88E-09 2.79E-07 5.33E-05 2.78E-08 vs.Macrid VAE 8.32E-06 2.90E-05 4.98E-05 5.33E-09 3.23E-07 3.11E-06 2.93E-05 6.63E-06 1.20E-06 4.66E-08 vs.DGLGM 4.19E-07 3.36E-06 5.81E-05 3.82E-06 2.94E-06 2.26E-06 6.11E-08 3.81E-07 1.92E-06 3.44E-07 vs.EMF 9.54E-07 5.93E-05 4.42E-06 2.73E-06 9.49E-06 8.58E-08 3.14E-05 4.96E-07 6.12E-06 8.60E-06 vs.Mult-VAE 9.65E-06 7.56E-07 4.86E-07 2.83E-06 2.44E-05 2.55E-07 3.33E-08 3.73E-06 3.95E-06 8.68E-08 vs.Macrid VAE 5.23E-06 4.22E-07 4.62E-08 4.71E-06 2.79E-06 3.58E-07 3.38E-06 3.34E-07 8.74E-06 3.53E-08 vs.DGLGM 3.65E-08 2.99E-06 3.52E-07 6.26E-08 2.27E-06 3.33E-06 6.36E-07 6.35E-07 3.47E-08 3.27E-05 Table 4: Statistical significance (p-value) obtained by In DGRM vs. four baselines on All Users view. Datasets Methods MEP MER @1 @5 @10 @50 @100 @1 @5 @10 @50 @100 EMF 0.1233 0.2524 0.5121 0.6422 0.7832 0.0023 0.0046 0.0066 0.0098 0.1135 Mult-VAE 0.1354 0.2653 0.5001 0.6555 0.7943 0.0037 0.0049 0.0062 0.0142 0.1256 Macrid VAE 0.1398 0.2734 0.5242 0.6653 0.8033 0.0035 0.0057 0.0068 0.0189 0.1354 DGLGM 0.1453 0.2844 0.5232 0.6873 0.8142 0.0043 0.0068 0.0069 0.0197 0.1398 In DGRM 0.1512 0.2952 0.5234 0.6793 0.8233 0.0056 0.0075 0.0073 0.0202 0.1435 EMF 0.2313 0.3766 0.6657 0.7322 0.7932 0.0023 0.0056 0.0093 0.0145 0.0342 Mult-VAE 0.2521 0.3872 0.6325 0.7593 0.8132 0.0028 0.0067 0.0089 0.0242 0.0433 Macrid VAE 0.2633 0.3933 0.6732 0.7633 0.8234 0.0034 0.0075 0.0094 0.0342 0.0489 DGLGM 0.2692 0.3842 0.6643 0.7768 0.8293 0.0038 0.0079 0.0095 0.0387 0.0532 In DGRM 0.2753 0.3992 0.6801 0.7832 0.8432 0.0046 0.0073 0.0096 0.0424 0.0593 Ali Shop-7C EMF 0.1452 0.3522 0.5433 0.6892 0.7832 0.0242 0.0456 0.0742 0.0953 0.1231 Mult-VAE 0.1569 0.3688 0.5401 0.6945 0.7985 0.0356 0.0543 0.0753 0.1032 0.1389 Macrid VAE 0.1678 0.3764 0.5552 0.7065 0.8032 0.0384 0.0598 0.0843 0.1135 0.1456 DGLGM 0.1732 0.3734 0.5563 0.7121 0.8142 0.0396 0.0613 0.0855 0.1189 0.1498 In DGRM 0.1787 0.3875 0.5642 0.7232 0.8214 0.0406 0.0645 0.0913 0.1232 0.1534 EMF 0.1358 0.2533 0.3923 0.4983 0.6732 0.0142 0.0235 0.0357 0.0642 0.0983 Mult-VAE 0.1542 0.2661 0.3778 0.5122 0.6893 0.0246 0.0342 0.0353 0.0714 0.1042 Macrid VAE 0.1688 0.2785 0.3943 0.5235 0.6832 0.0346 0.0398 0.0358 0.0798 0.1098 DGLGM 0.1752 0.2549 0.3913 0.5212 0.6901 0.0389 0.0412 0.0358 0.0833 0.1123 In DGRM 0.1872 0.2911 0.4078 0.5378 0.6993 0.0477 0.0479 0.0361 0.0819 0.1242 Table 5: Comparing different methods in terms of explainability metrics (MEP and MER) on All Users view. Liu, Jing, Wen, Xu, Wang, Yu and Ng Datasets Methods MEP MER @1 @5 @10 @50 @100 @1 @5 @10 @50 @100 EMF 0.0544 0.1823 0.2152 0.4222 0.5433 0.0012 0.0024 0.0037 0.0065 0.0842 Mult-VAE 0.0636 1922 0.2251 0.4367 0.5546 0.0014 0.0027 0.0042 0.0072 0.0887 Macrid VAE 0.0698 0.1976 0.2336 0.4403 0.5598 0.0018 0.0032 0.0046 0.0078 0.0938 DGLGM 0.0721 0.1989 0.2386 0.4451 0.5648 0.0021 0.0037 0.0051 0.0094 0.0984 In DGRM 0.0789 0.2032 0.2451 0.4526 0.5687 0.0026 0.0043 0.0054 0.0098 0.1032 EMF 0.1124 0.1831 0.3326 0.4257 0.5452 0.0014 0.0024 0.0045 0.0112 0.0215 Mult-VAE 0.1255 0.1925 0.3453 0.4333 0.5544 0.0016 0.0027 0.0047 0.0124 0.0227 Macrid VAE 0.1358 0.1993 0.3533 0.4428 0.5646 0.0019 0.0033 0.0049 0.0135 0.0245 DGLGM 0.1527 0.2043 0.3596 0.4489 0.5766 0.0023 0.0036 0.0051 0.0153 0.0263 In DGRM 0.1738 0.2176 0.3645 0.4573 0.5834 0.0026 0.0042 0.0056 0.0178 0.0276 Ali Shop-7C EMF 0.0633 0.1443 0.2353 0.3411 0.5123 0.0133 0.0225 0.0412 0.0591 0.0716 Mult-VAE 0.0783 0.1325 0.2443 0.3537 0.5236 0.0147 0.0261 0.0451 0.0633 0.0821 Macrid VAE 0.0856 0.1489 0.2495 0.3646 0.5312 0.0166 0.0276 0.0487 0.0684 0.0895 DGLGM 0.0945 0.1437 0.2575 0.3735 0.5463 0.0178 0.0296 0.0516 0.0742 0.0963 In DGRM 0.0987 0.1514 0.2651 0.3829 0.5535 0.0189 0.0314 0.0578 0.0793 0.1035 EMF 0.0256 0.1336 0.1945 0.3144 0.4877 0.0064 0.0121 0.0215 0.0401 0.0634 Mult-VAE 0.0358 0.1453 0.2042 0.3228 0.4964 0.0121 0.0154 0.0226 0.0458 0.0712 Macrid VAE 0.0411 0.1478 0.2148 0.3328 0.5024 0.0153 0.0168 0.0231 0.0549 0.0789 DGLGM 0.0436 0.1562 0.2235 0.3446 0.5120 0.0162 0.0185 0.0248 0.0637 0.0846 In DGRM 0.0458 0.1635 0.2358 0.3562 0.5258 0.0178 0.0197 0.0262 0.0691 0.0915 Table 6: Comparing different methods in terms of explainability metrics (MEP and MER) on Near-cold-start Users view. Dataset Methods 0-5 6-20 21-50 51-100 > 100 MEP MER MEP MER MEP MER MEP MER MEP MER EMF 0.4578 0.0065 0.4656 0.0065 0.4756 0.0068 0.4941 0.0069 0.5153 0.0070 Macrid VAE 0.4615 0.0066 0.4734 0.0068 0.4861 0.0069 0.4904 0.0069 0.5211 0.0070 In DGRM 0.4745 0.0068 0.4989 0.0069 0.4868 0.0071 0.4922 0.0072 0.5222 0.0071 EMF 0.5812 0.0088 0.6479 0.0090 0.6621 0.0093 0.6656 0.0095 0.6653 0.0096 Macrid VAE 0.6033 0.0084 0.6472 0.0086 0.6615 0.0089 0.6714 0.0090 0.6731 0.0097 In DGRM 0.6232 0.0089 0.6574 0.0092 0.6734 0.0095 0.6802 0.0096 0.6812 0.0098 Ali Shop-7C EMF 0.5053 0.0698 0.5315 0.0713 0.5401 0.0739 0.5423 0.0742 0.5432 0.0752 Macrid VAE 0.5231 0.0783 0.5351 0.0803 0.5415 0.0822 0.5514 0.0832 0.5531 0.0847 In DGRM 0.5389 0.0834 0.5571 0.0856 0.5634 0.0884 0.5643 0.0898 0.5652 0.0911 EMF 0.3521 0.0357 0.3751 0.0363 0.3901 0.0369 0.3925 0.0371 0.3911 0.0374 Macrid VAE 0.3656 0.0353 0.3765 0.0359 0.3901 0.0365 0.3914 0.0368 0.3931 0.0373 In DGRM 0.3733 0.0362 0.3856 0.0371 0.3953 0.0376 0.3998 0.0379 0.3998 0.0384 Table 7: Comparisons of explainable recommendation methods (EMF, Macri VAE and In DGRM) on all items with different interaction degrees in terms of MEP@10 and MER@10. Interpretable Deep Generative Recommendation Models 40.00% 60.00% MEP@1 MEP@5 MEP@10 MEP@50 MEP@100 MER@1 MER@5 MER@10 MER@50 MER@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (a) ML 20M / All Users MEP@1 MEP@5 MEP@10 MEP@50 MEP@100 MER@1 MER@5 MER@10 MER@50 MER@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (b) Netflix / All Users 10.00% 20.00% 30.00% 50.00% 60.00% 70.00% 80.00% MEP@1 MEP@5 MEP@10 MEP@50 MEP@100 MER@1 MER@5 MER@10 MER@50 MER@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (c) Ali Shop-7C / All Users MEP@1 MEP@5 MEP@10 MEP@50 MEP@100 MER@1 MER@5 MER@10 MER@50 MER@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (d) Yelp / All Users MEP@1 MEP@5 MEP@10 MEP@50 MEP@100 MER@1 MER@5 MER@10 MER@50 MER@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (e) ML 20M / Near-cold-start Users 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% MEP@1 MEP@5 MEP@10 MEP@50 MEP@100 MER@1 MER@5 MER@10 MER@50 MER@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (f) Netflix / Near-cold-start Users MEP@1 MEP@5 MEP@10 MEP@50 MEP@100 MER@1 MER@5 MER@10 MER@50 MER@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (g) Ali Shop-7C / Near-cold-start Users MEP@1 MEP@5 MEP@10 MEP@50 MEP@100 MER@1 MER@5 MER@10 MER@50 MER@100 Improvement(%) vs. EMF vs. Mult-VAE vs. Macrid VAE vs. DGLGM (h) Yelp / Near-cold-start Users Figure 4: The relative improvements of In DGRM vs. four baselines in terms of interpretability (MEP and MER) on All Users and Near-cold-start Users view. Liu, Jing, Wen, Xu, Wang, Yu and Ng than baselines in terms of both All Users view and Near-cold-start Users view. These results demonstrate that the proposed model has ability to improve the interpretability of recommendation system. At least, the recommended list includes much more explainable recommended items for each user. Note that whether a recommended item is explainable or not depends on whether it is also interacted by similar users. The good performance benefits from two points. Firstly, In DGRM sufficiently determines the user-cluster structure, which is helpful to determine similar users according to their similar preference. Secondly, In DGRM bridges the semantic gap between the latent factors and user-item interaction behavior groups, which is able to disentangle the user latent representation and further improve the model s interpretability. We also report the improvements between In DGRM between four baselines in terms of interpretability metrics, as shown in Figure 4. We can see that the improvements on MEP are more obvious, that is, In DGRM can recall more interpretable items than the baselines. The possible reason is that prototype learning in In DGRM can effectively mine the relationship between items and achieve better preference modeling with the aid of sparse mapping mechanism. To further demonstrate the interpretability, we split items into five groups (0-5, 6-20, 21-50, 51-100, >100) according to their number of interactions from all users in training data. Three interpretable methods (EMF, Macrid VAE and In DGRM) are used for comparison. The interpretability performance on each group is shown in Table 7 (in terms of MEP and MER). The proposed In DGRM and baselines have the similar trends with respect to different groups, i.e., MEP and MER becomes better and better with the increasing of interactions, which indicates that item interaction frequency plays an important role in interpretability performance. To be exciting, the proposed In DGRM significantly outperforms the baselines for all groups. This result further demonstrates that In DGRM has the ability to effectively handle data with various items, especially for Near-cold-start Items. 6.2.3 Ablation Test The proposed In DGRM can be regraded as an improved deep generative model equipped with three modules, mixture of Gaussian prior for latent user representation, item group mining and factor-to-group sparse mapping. In this ablation experiment, we try to demonstrate the effect of each module. Two simplified versions of In DGRM (denoted as In DGRMa and In DGRM-b) are mainly constructed. In DGRM-a considers item group mining and factor-to-group sparse mapping, and removes the mixture of Gaussian prior. In DGRM-b only considers mixture of Gaussian prior. Table 8 lists the results (including recommendation performance and interpretability) on Mult-VAE, In DGRM-a, In DGRM-b and In DGRM in terms of All Users view and Nearcold-start Users view. Here Mult-VAE is used as a basic baseline, which is deep generative model but does not consider three modules (Kingma and Welling, 2013). In DGRM-a and In DGRM-b are superior to Mult-VAE, which show the effectiveness of the corresponding proposed modules. In DGRM-b is superior to In DGRM-a in terms of interpretability evaluation metrics (MEP and MER). The possible reason is that mixture of Gaussian prior is able to grouping users with similar preference. Like the neighborhood-based interpretation, this scheme is helpful to provide clues for interpretation. As expected, In DGRM outperforms Mult-VAE, In DGRM-a and In DGRM-b. This confirms that modeling both inter-user Interpretable Deep Generative Recommendation Models Datasets Methods All Users Near-cold-start Users Recall@10 NDCG@10 MEP@10 MER@10 Recall@10 NDCG@10 MEP@10 MER@10 Mult-VAE 0.3320 0.3185 0.5001 0.0062 0.2390 0.1976 0.2251 0.0042 In DGRM-a 0.3368 0.3365 0.5194 0.0070 0.2415 0.1996 0.2402 0.0048 In DGRM-b 0.3397 0.3396 0.5203 0.0071 0.2433 0.2004 0.2413 0.0050 In DGRM 0.3458 0.3426 0.5234 0.0073 0.2498 0.2033 0.2451 0.0054 Mult-VAE 0.3208 0.2449 0.6325 0.0089 0.1945 0.1938 0.3453 0.0047 In DGRM-a 0.3453 0.2623 0.6725 0.0092 0.2078 0.2103 0.3597 0.0052 In DGRM-b 0.3496 0.2649 0.6743 0.0093 0.2093 0.2112 0.3605 0.0053 In DGRM 0.3585 0.2695 0.6801 0.0096 0.2136 0.2144 0.3645 0.0056 Ali Shop-7C Mult-VAE 0.1343 0.1632 0.5401 0.0753 0.0952 0.1023 0.2443 0.0451 In DGRM-a 0.1573 0.1804 0.5591 0.8985 0.1095 0.1183 0.1603 0.0521 In DGRM-b 0.1589 0.1821 0.5604 0.0900 0.1101 0.1191 0.2617 0.0543 In DGRM 0.1622 0.1853 0.5642 0.0913 0.1126 0.1214 0.2651 0.0578 Mult-VAE 0.5322 0.7328 0.3778 0.0353 0.0981 0.1352 0.2042 0.0226 In DGRM-a 0.5421 0.7356 0.3984 0.0359 0.1589 0.1600 0.2287 0.0245 In DGRM-b 0.5443 0.7390 0.4025 0.0360 0.1611 0.1621 0.2307 0.0251 In DGRM 0.5496 0.7542 0.4078 0.0361 0.1289 0.1672 0.2358 0.0262 Table 8: Ablation analysis of In DGRM in terms of different metrics. Datasets Methods All Users Near-cold-start Users Recall@10 NDCG@10 MEP@10 MER@10 Recall@10 NDCG@10 MEP@10 MER@10 ML 20M In DGRM-p 0.3498 0.3545 0.5361 0.0077 0.2590 0.2136 0.2563 0.0057 In DGRM 0.3458 0.3426 0.5234 0.0073 0.2498 0.2033 0.2451 0.0054 Netflix In DGRM-p 0.3658 0.2789 0.6942 0.0098 0.2251 0.2261 0.3763 0.0059 In DGRM 0.3585 0.2695 0.6801 0.0096 0.2136 0.2144 0.3645 0.0056 Ali Shop-7C In DGRM-p 0.1733 0.1894 0.5731 0.0963 0.1262 0.1331 0.2721 0.0615 In DGRM 0.1622 0.1853 0.5642 0.0913 0.1126 0.1214 0.2651 0.0578 Yelp In DGRM-p 0.5562 0.5631 0.4151 0.0363 0.1341 0.1721 0.2425 0.0273 In DGRM 0.5496 0.7542 0.4078 0.0361 0.1289 0.1672 0.2358 0.0262 Table 9: Comparison between In DGRM and In DGRM-p in terms of different metrics. preference similarity and intra-user preference diversity has ability to significantly improve model performance. Furthermore, In DGRM with prototype enhancing by considering privilege information (In DGRM-p) is also investigated. Table 9 lists the comparison results between In DGRM-p and In DGRM. We can see that prototype enhancing with privilege information is helpful for better preference modeling and further improving performance in terms of both recommendation accuracy and interpretability. Additionally, the number of mixture components K is a key parameter in the proposed In DGRM, which affects the ability of capturing the inter-user preference similarity. Thus, we conduct experiment to evaluate the effect of K on the recommendation performance of In DGRM. Among it, K is selected from {1, 2, 4, 6, 8, 10, 15, 20, 25, 30}. Figure 5 shows the recommendation performance (in terms of Recall@10 and NDCG@10) by varying K. It can be seen that the performance is improved with the increasing of K. This is due to the fact that different mixture components may capture different user preferences and further improve recommendation performance. After reaching the optimal value K (The optimal values are 6, 8, 4 and 15 on ML 20M, Netflix, Ali Shop-7C and Yelp, respectively), the performance decreases when K further increases. This phenomenon implies that large K may overfit the training data. Meanwhile, an interesting observation is that the optimal K is larger on the large dataset than that on the small dataset. This is reasonable because Liu, Jing, Wen, Xu, Wang, Yu and Ng 1 2 4 6 8 10 15 20 25 30 K 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 0.342 0.305 0.31 0.315 0.32 0.325 0.33 0.335 0.34 0.345 Recall@10 NDCG@10 1 2 4 6 8 10 15 20 25 30 K Recall@10 NDCG@10 1 2 4 6 8 10 15 20 25 30 K Recall@10 NDCG@10 (c) Ali Shop-7C 1 2 4 6 8 10 15 20 25 30 K Recall@10 NDCG@10 Figure 5: Effect of K (the number of mixture components (i.e., user clusters)) on In DGRM. large dataset contains the larger number of users and items, which is much more complicated and requires more components to capture their characteristics. In Section 4.2, the cosine similarity, instead of inner product similarity, is used to calculate the correlations between each pair of item and propertype. In this experiment, a new model with inner product similarity is trained to compare that with cosine similarity. Figure 6 shows the learned item groups with two models on Ml 20M and Ali Shop. It can be seen that, inner product promotes most items belonging to one prototype (see Figure 6 (a) and (c)). As expected, each prototype output by cosine-based model can contain a certain number of items (see Figure 6 (b) and (d)). This finding confirms our claim that selecting a proper metric, such as cosine similarity, is important for model training. 6.2.4 Observed-level Disentanglement Analysis To capture the user-cluster structure reflecting inter-user preference similarity, the mixture prior is introduced to model user s preference so that users with similar tastes can be assigned to the same group. To confirm this, taking ML 20M dataset as an example, we try to investigate the corresponding semantical information of the mixture components with the aid of side information. In ML 20M, each movie is marked by one or more genres, and all movies belong to 18 genres, including Action , Adventure , Animation , Children , Comedy , Crime , Documentary , Drama , Fantasy , Film-Noir , Horror , Musical , Mystery , Romance , Sci-Fi , Thriller , War and Western . For the k-th user cluster obtained by In DGRM, let M (k) be the number of items related by the users belonging to this cluster, where the number of items belonging to the q-th genre is denoted as M(k) q and PNl q=1 M(k) q = M(k) (Nl = 18 (the number of genres) in ML 20M ). {M(k) q /M (k)}Nl q=1 indicates the item proportion along different genres for k-th user Interpretable Deep Generative Recommendation Models 1 2 3 4 5 6 7 8 9 101112131415161718 Number of Items Item Group ID (a) Item groups with inner product similarity on ML 20M 1 2 3 4 5 6 7 8 9 101112131415161718 Number of Items Item Group ID (b) Item groups with cosine similarity on ML 20M 0 2000 4000 6000 8000 10000 12000 14000 16000 1 2 3 4 5 6 7 Number of Items Item Group ID (c) Item groups with inner product similarity on Ali Shop-7C 1 2 3 4 5 6 7 Number of Items Item Group ID (d) Item groups with cosine similarity on Ali Shop-7C Figure 6: The number of item in each item group obtained via In DGRM with cosine similarity and inner product similarity. Liu, Jing, Wen, Xu, Wang, Yu and Ng Documentary Item propotion User Cluster 1 User Cluster 2 User Cluster 3 User Cluster 4 User Cluster 5 User Cluster 6 Figure 7: The item proportion with genre in each user clusters obtained by In DGRM on ML 20M. Dataset K d(k) in each component ML 20M 6 {159, 126, 147, 152, 144, 156} Netflix 8 {144, 78, 134, 153, 126, 159, 145, 114} Ali Shop-7C 4 {136, 184, 135, 90} Yelp 15 {137, 126, 154, 142, 137, 165, 121, 95, 78, 115, 125, 147, 89, 120, 156} Table 10: The optimal latent dimensionality dk in each component determined by In DGRM on four datasets. K is the optimal number of mixture components. dk is the optimal dimensionality in the k-th component. cluster. The item proportion with specific genre is shown for six clusters in Figure 7. It can be seen that most items (> 70%) in one component are always from the same genre, such as {Drama}, {Comedy}, {Action}, {Romance}, {Thriller}, {Animation, Children s} for user cluster 1 to 6. This result indicates each user cluster definitely captures one main user preference. Thus, we can say In DGRM has the ability to determine the user-cluster structure and achieve observed-level disentanglement on users. Thanks for ARD technique with thresh, the proposed In DGRM model can adaptively determine the optimal latent dimensionality for each mixture component. Table 10 demonstrates the optimal latent space size for zi in each user cluster on different datasets. It can be seen that the latent dimensionalities have different variances from different datasets. For instance, the minimal and maximal sizes are {126, 159}, {78, 159}, {90, 184} and {78, 165} on ML 20M, Netflix, Ali Shop-7C and Yelp, respectively. An interesting thing is that the optimal size is almost proportional to the group density. This result confirms that user clusters with few interactions should be represented in a lower dimensional space, other be of higher dimensional space, which is consistent with the conclusion given in Li et al. (2017) and Liu et al. (2019a). Interpretable Deep Generative Recommendation Models Datasets ML 20M Netflix Ali Shop-7C Yelp Metrics Recall@10 NDCG@10 Recall@10 NDCG@10 Recall@10 NDCG@10 Recall@10 NDCG@10 In DGRM(d=10) 0.3334 0.3323 0.3401 0.2489 0.1537 0.1822 0.5403 0.7343 In DGRM(d=20) 0.3338 0.3357 0.3419 0.2504 0.1558 0.1824 0.5405 0.7369 In DGRM(d=50) 0.3314 0.3283 0.3405 0.2545 0.1543 0.1784 0.5393 0.7354 In DGRM(d=80) 0.3321 0.3314 0.3414 0.2506 0.1545 0.1805 0.5398 0.7365 In DGRM(d=90) 0.3345 0.3355 0.3422 0.2555 0.1546 0.1815 0.5421 0.7388 In DGRM(d=100) 0.3326 0.3311 0.3431 0.2553 0.1589 0.1835 0.5431 0.7391 In DGRM(d=110) 0.3354 0.3326 0.3457 0.2589 0.1598 0.1846 0.5415 0.7387 In DGRM(d=120) 0.3389 0.3331 0.3478 0.2656 0.1577 0.1836 0.5396 0.7385 In DGRM(d=130) 0.3413 0.3358 0.3532 0.2675 0.1568 0.1827 0.5382 0.7378 In DGRM(d=140) 0.3424 0.3379 0.3492 0.2631 0.1555 0.1815 0.5390 0.7381 In DGRM(d=150) 0.3373 0.3369 0.3455 0.2575 0.1532 0.1789 0.5395 0.7383 In DGRM(d=160) 0.3367 0.3358 0.3432 0.2557 0.1576 0.1825 0.5374 0.7376 In DGRM(d=170) 0.3354 0.3328 0.3446 0.2565 0.1554 0.1813 0.5353 0.7368 In DGRM(d=180) 0.3321 0.3310 0.3421 0.2545 0.1533 0.1804 0.5365 0.7372 In DGRM(d=190) 0.3348 0.3356 0.3468 0.2595 0.1565 0.1819 0.5345 0.7369 In DGRM(d=200) 0.3346 0.3357 0.3446 0.2564 0.1542 0.1816 0.5327 0.7361 DGLGM 0.3428 0.3391 0.3513 0.2669 0.1585 0.1792 0.5458 0.7404 In DGRM 0.3458 0.3426 0.3585 0.2695 0.1622 0.1853 0.5496 0.7542 Table 11: Recommendation performance of In DGRM against In DGRM with fixed latent dimensionality for zi in terms of Recall@10 and NDCG@10. Additionally, we compare the recommendation performance of In DGRM against In DGRM with fixed latent dimensionalities in all components {10, 20, 50, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200}. The results are list in Table 11. As we can see, the performance of In DGRM with fixed latent dimensionality is unstable when the dimensionality increases from 10 to 200. The competitive performance can be obtained by In DGRM with fixed dimensions 140, 130, 110 and 100 on ML 20M, Netflix, Ali Shop-7C and Yelp, respectively. In the case of fixed dimensions, the optimal value is better than DGLGM, but still worse than In DGRM. The reason is that mapping users with preference diversity into a fixed latent space size cannot model all users well, so that some users are either under-fitted or over-fitted. Furthermore, In DGRM is superior to DGLGM which adopts ARD technique as well, the main reason is that item group structure can model user preference in a more fine-grained manner. 6.2.5 Disentangled User Representations Analysis a). Factor-to-group Mapping By telling us the item groups responsible for a given prediction, factor-to-group mapping matrices reveal insights about how models rely on and extrapolate from the user behavior. In this subsection, we focus on show how In DGRM can understand model behavior with the aid of factor-to-group mapping mechanism. Group structure among items can effectively reflect the intra-user preference diversity. Firstly, we investigate the properties of the learned factor-to-group mapping matrix by splitting its values (o(g) i,l ) into ten groups at interval of 0.1 according. Figure 8 shows the proportion of mapping value in difference intervals. Obviously, they approximately follow Laplace-like distribution. In other words, most values belong to first three ranges, which Liu, Jing, Wen, Xu, Wang, Yu and Ng [0, 0.1] [0.1, 0.2] [0.2, 0.3] [0.3, 0.4] [0.4, 0.5] [0.5, 0.6] [0.6, 0.7] [0.7, 0.8] [0.8, 0.9] [0.9, 1] [0, 0.1] [0.1, 0.2] [0.2, 0.3] [0.3, 0.4] [0.4, 0.5] [0.5, 0.6] [0.6, 0.7] [0.7, 0.8] [0.8, 0.9] [0.9, 1] [0, 0.1] [0.1, 0.2] [0.2, 0.3] [0.3, 0.4] [0.4, 0.5] [0.5, 0.6] [0.6, 0.7] [0.7, 0.8] [0.8, 0.9] [0.9, 1] (c) Ali Shop-7C [0, 0.1] [0.1, 0.2] [0.2, 0.3] [0.3, 0.4] [0.4, 0.5] [0.5, 0.6] [0.6, 0.7] [0.7, 0.8] [0.8, 0.9] [0.9, 1] Figure 8: Factor-to-group mapping value learned by In DGRM. (a) Sampled user 1 on ML 20M (b) Sampled user 2 on ML 20M 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Computer accessories & hardware Office supplies Food Household goods Mobile phone accessories Electrical appliances 0.00 0.15 0.30 0.45 0.60 0.75 (c) Sampled user 1 on Ali Shop-7C 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Computer accessories & hardware Office supplies Food Household goods Mobile phone accessories Electrical appliances (d) Sampled user 2 on Ali Shop-7C Figure 9: Demonstrating the learned factor-to-group mapping matrices O i related to 2 sampled users on ML 20M and Ali Shop-7C. Interpretable Deep Generative Recommendation Models indicates that there is few relations between a large number of dimensions and item groups. This result is obtained by penalizing the interactions via group lasso, which has ability to encourage each factor to capture the distinct modes of variation, so that the correlations among factors of zi can be minimized as much as possible. Therefore, In DGRM can achieve the disentanglement of latent item representations. Furthermore, we plot the learned factor-to-group matrix Oi for two sampled users in ML 20M (Figure 9(a) - (b)) and Ali Shop-7C (Figure 9(c) - (d)). In DGRM successfully disentangles the dimensions of zi because each dimension only corresponds to one group with higher probability. For instance, from Figure 9(a), we know that sampled user 1 likes movies with genre Action , Adventure , Comedy , Fantasy , Musical , and Romance . Sampled user 2 in Figure 9(b) tends to watch movies about Action , Adventure , Crime , Horror genres, which indicates this user s taste is crime and violence. Category Action is liked by most of users. The reason is that lots of movies contain genre Action . Furthermore, we can see some dimensions are associated with multiple genres since some genres are usually co-existed, e.g., an Action movie is usually an Adventure movie. Ali Shop-7C contains goods with 7 categories, including Computer accessories & hardware , Office supplies , Food , Household goods , Mobile phone accessories , Electrical appliances , and Bag . Figure 9(c) -(d) show two sample users with diverse interests on Ali Shop-7C. For example, sample user 1 tends to buy Computer accessories & hardware , Office supplies and Household goods , while sample user 2 like to buy Food , Household goods and Bag . Like ML 20M, each dimensions is related to its own group. However, unlike in ML 20M (where each dimension is related to one group), in Ali Shop-7C each group is usually associated with more than one dimensions. The main reason is that category information in Ali Shop-7C is high-level primary category and contains more categories that can be subdivided. For example, Food includes sea food, fruit, meat and etc. It can be seen that the factor-to-group mapping matrix is able to demonstrate user s preference explicitly. This interpretable result benefits from the factor-to-group sparse mapping so that each latent factor is associated with distinct mode. b). Disentangled Representation Inspired by Kim and Mnih (2018), the disentanglement ability of the latent representations can be quantitatively assessed. Specifically, by fixing a particular latent feature u (1 u d), we can generate the user-item interaction information (xi) with randomly varying other features. Then the empirical variance of each feature is computed according to the normalized generated data. The feature with lowest variance and target feature u will be taken as two samples and fed into a majority-vote classifier to check the disentanglement scores of u. The evaluation metric Independence (Ma et al., 2019) is adopted to investigate the relationship (independence) between any pair of latent features, which is defined as Independence = 1 2 d(d 1) 1 i