# learning_disentangled_representations_for_recommendation__b5f09b89.pdf Learning Disentangled Representations for Recommendation Jianxin Ma1,2 , Chang Zhou1 , Peng Cui2, Hongxia Yang1, Wenwu Zhu2 1Alibaba Group, 2Tsinghua University majx13fromthu@gmail.com, ericzhou.zc@alibaba-inc.com, cuip@tsinghua.edu.cn, yang.yhx@alibaba-inc.com, wwzhu@tsinghua.edu.cn User behavior data in recommender systems are driven by the complex interactions of many latent factors behind the users decision making processes. The factors are highly entangled, and may range from high-level ones that govern user intentions, to low-level ones that characterize a user s preference when executing an intention. Learning representations that uncover and disentangle these latent factors can bring enhanced robustness, interpretability, and controllability. However, learning such disentangled representations from user behavior is challenging, and remains largely neglected by the existing literature. In this paper, we present the MACRo-m Icro Disentangled Variational Auto-Encoder (Macrid VAE) for learning disentangled representations from user behavior. Our approach achieves macro disentanglement by inferring the high-level concepts associated with user intentions (e.g., to buy a shirt or a cellphone), while capturing the preference of a user regarding the different concepts separately. A micro-disentanglement regularizer, stemming from an information-theoretic interpretation of VAEs, then forces each dimension of the representations to independently reflect an isolated low-level factor (e.g., the size or the color of a shirt). Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines. We further demonstrate that the learned representations are interpretable and controllable, which can potentially lead to a new paradigm for recommendation where users are given fine-grained control over targeted aspects of the recommendation lists. 1 Introduction Learning representations that reflect users preference, based chiefly on user behavior, has been a central theme of research on recommender systems. Despite their notable success, the existing user behavior-based representation learning methods, such as the recent deep approaches [49, 32, 31, 52, 11, 18], generally neglect the complex interaction among the latent factors behind the users decision making processes. In particular, the latent factors can be highly entangled, and range from macro ones that govern the intention of a user during a session, to micro ones that describe at a granular level a user s preference when implementing a specific intention. The existing methods fail to disentangle the latent factors, and the learned representations are consequently prone to mistakenly preserve the confounding of the factors, leading to non-robustness and low interpretability. Disentangled representation learning, which aims to learn factorized representations that uncover and disentangle the latent explanatory factors hidden in the observed data [3], has recently gained much attention. Not only can disentangled representations be more robust, i.e., less sensitive to the misleading correlations presented in the limited training data, the enhanced interpretability also finds Equal contribution. Work done at Alibaba. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. N (µ(1), σ(1)) N(µ(3), σ(3)) N(µ(2), σ(2)) Reconstruction Loss Encoder Not Encoded Figure 1: Our framework. Macro disentanglement is achieved by learning a set of prototypes, based on which the user intention related with each item is inferred, and then capturing the preference of a user about the different intentions separately. Micro disentanglement is achieved by magnifying the KL divergence, from which a term that penalizes total correlation can be separated, with a factor of β. direct application in recommendation-related tasks, such as transparent advertising [33], customerrelationship management, and explainable recommendation [51, 17]. Moreover, the controllability exhibited by many disentangled representations [19, 14, 10, 8, 9, 25] can potentially bring a new paradigm for recommendation, by giving users explicit control over the recommendation results and providing a more interactive experience. However, the existing efforts on disentangled representation learning are mainly from the field of computer vision [28, 15, 20, 30, 53, 14, 10, 39, 19]. Learning disentangled representations based on user behavior data, a kind of discrete relational data that is fundamentally different from the well-researched image data, is challenging and largely unexplored. Specifically, it poses two challenges. First, the co-existence of macro and micro factors requires us to to separate the two levels when performing disentanglement, in a way that preserves the hierarchical relationships between an intention and the preference about the intention. Second, the observed user behavior data, e.g., user-item interactions, are discrete and sparse in nature, while the learned representations are continuous. This implies that the majority of the points in the highdimensional representation space will not be associated with any behavior, which is especially problematic when one attempts to investigate the interpretability of an isolated dimension by varying the value of the dimension while keeping the other dimensions fixed. In this paper, we propose the MACRo-m Icro Disentangled Variational Auto-Encoder (Macrid VAE) for learning disentangled representations based on user behavior. Our approach explicitly models the separation of macro and micro factors, and performs disentanglement at each level. Macro disentanglement is achieved by identifying the high-level concepts associated with user intentions, and separately learning the preference of a user regarding the different concepts. A regularizer for micro disentanglement, derived by interpreting VAEs [27, 44] from an information-theoretic perspective, is then strengthened so as to force each individual dimension to reflect an independent micro factor. A beam-search strategy, which handles the conflict between sparse discrete observations and dense continuous representations by finding a smooth trajectory, is then proposed for investigating the interpretability of each isolated dimension. Empirical results show that our approach can achieve substantial improvement over the state-of-the-art baselines. And the learned disentangled representations are demonstrated to be interpretable and controllable. In this section, we present our approach for learning disentangled representations from user behaivor. 2.1 Notations and Problem Formulation A user behavior dataset D consists of the interactions between N users and M items. The interaction between the uth user and the ith item is denoted by xu,i {0, 1}, where xu,i = 1 indicates that user u explicitly adopts item i, whereas xu,i = 0 means there is no recorded interaction between the two. For convenience, we use xu = {xu,i : xu,i = 1} to represent the items adopted by user u. The goal is to learn user representations {zu}N u=1 that achieves both macro and micro disentanglement. We use θ to denote the set that contains all the trainable parameters of our model. Macro disentanglement Users may have very diverse interests, and interact with items that belong to many high-level concepts, e.g., product categories. We aim to achieve macro disentanglement, by learning a factorized representation of user u, namely zu = [z(1) u ; z(2) u ; . . . ; z(K) u ] Rd , where d = Kd, assuming that there are K high-level concepts. The kth component z(k) u Rd is for capturing the user s preference regarding the kth concept. Additionally, we infer a set of one-hot vectors C = {ci}M i=1 for the items, where ci = [ci,1; ci,2; . . . ; ci,K]. If item i belongs to concept k, then ci,k = 1 and ci,k = 0 for any k = k. We jointly infer {zu}N u=1 and C unsupervisedly. Micro disentanglement High-level concepts correspond to the intentions of a user, e.g., to buy clothes or a cellphone. We are also interested in disentangling a user s preference at a more granular level regarding the various aspects of an item. For example, we would like the different dimensions of z(k) u to individually capture the user s preferred sizes, colors, etc., if concept k is clothing. We start by proposing a generative model that encourages macro disentanglement. For a user u, our generative model assumes that the observed data are generated from the following distribution: pθ(xu) = Epθ(C) Z pθ (xu | zu, C) pθ(zu) dzu pθ (xu | zu, C) = Y xu,i xu pθ(xu,i | zu, C). (2) The meanings of xu, zu, C are described in the previous subsection. We have assumed pθ(zu) = pθ(zu | C) in the first equation, i.e., zu and C are generated by two independent sources. Note that ci = [ci,1; ci,2; . . . ; ci,K] is one-hot, since we assume that item i belongs to exactly one concept. And pθ(xu,i | zu, C) = Z 1 u PK k=1 ci,k g(i) θ (z(k) u ) is a categorical distribution over the M items, where Zu = PM i=1 PK k=1 ci,k g(i) θ (z(k) u ) and g(i) θ : Rd R+ is a shallow neural network that estimates how much a user with a given preference is interested in item i. We use sampeld softmax [23] to estimate Zu based on a few sampled items when M is very large. Macro disentanglement We assume above that the user representation zu is sufficient for predicting how the user will interact with the items. And we further assume that using the kth component z(k) u alone is already sufficient if the prediction is about an item from concept k. This design explicitly encourages z(k) u to capture preference regarding only the kth concept, as long as the inferred concept assignment matrix C is meaningful. We will describe later the implementation details of pθ(C), pθ(zu) and g(i) θ (z(k) u ). Nevertheless, we note that pθ(C) requires careful design to prevent mode collapse, i.e., the degenerate case where almost all items are assigned to a single concept. Variational inference We follow the variational auto-encoder (VAE) paradigm [27, 44], and optimize θ by maximizing a lower bound of P u ln pθ(xu), where ln pθ(xu) is bounded as follows: ln pθ(xu) Epθ(C) Eqθ(zu|xu,C)[ln pθ(xu | zu, C)] DKL(qθ(zu | xu, C) pθ(zu)) . (3) See the supplementary material for the derivation of the lower bound. Here we have introduced a variational distribution qθ(zu | xu, C), whose implementation also encourages macro disentanglement and will be presented later. The two expectations, i.e., Epθ(C)[ ] and Eqθ(zu|xu,C)[ ], are intractable, and are therefore estimated using the Gumbel-Softmax trick [22, 41] and the Gaussian re-parameterization trick [27], respectively. Once the training procedure is finished, we use the mode of pθ(C) as C, and the mode of qθ(zu | xu, C) as the representation of user u. Micro disentanglement A natural strategy to encourage micro disentanglement is to force statistical independence between the dimensions, i.e., to force qθ(z(k) u | C) Qd j=1 qθ(z(k) u,j | C), so that each dimension describes an isolated factor. Here qθ(zu | C) = R qθ(zu | xu, C)pdata(xu) dxu. Fortunately, the Kullback Leibler (KL) divergence term in the lower bound above does provide a way to encourage independence. Specifically, the KL term of our model can be rewritten as: Epdata(xu) [DKL(qθ(zu | xu, C) pθ(zu))] = Iq(xu; zu) + DKL(qθ(zu | C) pθ(zu)). (4) See the supplementary material for the proof. Similar decomposition of the KL term has been noted for the original VAEs previously [1, 25, 9]. Penalizing the latter KL term would encourage independence between the dimensions, if we choose a prior that satisfies pθ(zu) = Qd j=1 pθ(zu,j). On the other hand, the former term Iq(xu; zu) is the mutual information between xu and zu under qθ(zu | xu, C) pdata(xu). Penalizing Iq(xu; zu) is equivalent to applying the information bottleneck principle [47, 2], which encourages zu to ignore as much noise in the input as it can and to focus on merely the essential information. We therefore follow β-VAE [19], and strengthen these two regularization terms by a factor of β 1, which brings us to the following training objective: Epθ(C) Eqθ(zu|xu,C)[ln pθ(xu | zu, C)] β DKL(qθ(zu | xu, C) pθ(zu)) . (5) 2.3 Implementation In this section, we describe the implementation of pθ(C), pθ(xu,i | zu, C) (the decoder), pθ(zu) (the prior), qθ(zu | xu, C) (the encoder), and propose an efficient strategy to combat mode collapse. The parameters θ of our implementation include: K concept prototypes {mk}K k=1 RK d, M item representations {hi}M i=1 RM d used by the decoder, M context representations {ti}M i=1 RM d used by the encoder, and the parameters of a neural network fnn : Rd R2d. We optimize θ to maximize the training objective (see Equation 5) using Adam [26]. Prototype-based concept assignment A straightforward approach would be to assume pθ(C) = QM i=1 p(ci) and parameterize each categorical distribution p(ci) with its own set of K 1 parameters. This approach, however, would result in over-parameterization and low sample efficiency. We instead propose a prototype-based implementation. To be specific, we introduce K concept prototypes {mk}K k=1 and reuse the item representations {hi}M i=1 from the decoder. We then assume ci is a one-hot vector drawn from the following categorical distribution pθ(ci): ci CATEGORICAL (SOFTMAX([si,1; si,2; . . . ; si,K])) , si,k = COSINE(hi, mk)/τ, (6) where COSINE(a, b) = a b/( a 2 b 2) is the cosine similarity, and τ is a hyper-parameter that scales the similarity from [ 1, 1] to [ 1 τ ]. We set τ = 0.1 to obtain a more skewed distribution. Preventing mode collapse We use cosine similarity, instead of the inner product similarity adopted by most existing deep learning methods [32, 31, 18]. This choice is crucial for preventing mode collapse. In fact, with inner product, the majority of the items are highly likely to be assigned to a single concept mk that has an extremely large norm, i.e., mk 2 , even when the items {hi}M i=1 correctly form K clusters in the high-dimensional Euclidean space. And we observe empirically that this phenomenon does occur frequently with inner product (see Figure 2e). In contrast, cosine similarity avoids this degenerate case due to the normalization. Moreover, cosine similarity is related with the Euclidean distance on the unit hypersphere, and the Euclidean distance is a proper metric that is more suitable for inferring the cluster structure, compared to inner product. Decoder The decoder predicts which item out of the M ones is mostly likely to be clicked by a user, when given the user s representation zu = [z(1) u ; z(2) u ; . . . ; z(K) u ] and the one-hot concept assignments {ci}M i=1. We assume that pθ(xu,i | zu, C) PK k=1 ci,k g(i) θ (z(k) u ) is a categorical distribution over the M items, and define g(i) θ (z(k) u ) = exp(COSINE(z(k) u , hi)/τ). This design implies that {hi}M i=1 will be micro-disentangled if {z(k) u }N u=1 is micro-disentangled, as the two s dimensions are aligned. Prior & Encoder The prior pθ(zu) needs to be factorized in order to achieve micro disentanglement. We therefore set pθ(zu) to N(0, σ2 0I). The encoder qθ(zu | xu, C) is for computing the representation of a user when given the user s behavior data xu. The encoder maintains an additional set of context representations {ti}M i=1, rather than reusing the item representations {hi}M i=1 from the decoder, which is a common practice in the literature [32]. We assume qθ(zu | xu, C) = QK k=1 qθ(z(k) u | xu, C), and represent each qθ(z(k) u | xu, C) as a multivariate normal distribution with a diagonal covariance matrix N(µ(k) u , [diag(σ(k) u )]2), where the mean and the standard deviation are parameterized by a neural network fnn : Rd R2d: (a(k) u , b(k) u ) = fnn i:xu,i=+1 ci,k ti q P i:xu,i=+1 c2 i,k , µ(k) u = a(k) u a(k) u 2 , σ(k) u σ0 exp 1 The neural network fnn( ) captures nonlinearity, and is shared across the K components. We normalize the mean, so as to be consistent with the use of cosine similarity which projects the representations onto a unit hypersphere. Note that σ0 should be set to a small value, e.g., around 0.1, since the learned representations are now normalized. 2.4 User-Controllable Recommendation The controllability enabled by the disentangled representations can bring a new paradigm for recommendation. It allows a user to interactively search for items that are similar to an initial item except for some controlled aspects, or to explicitly adjust the disentangeld representation of his/her preference, learned by the system from his/her past behaviors, to actually match the current preference. Here we formalize the task of user-controllable recommendation, and illustrate a possible solution. Task definition Let h Rd be the representation to be altered, which can be initialized as either an item representation or a component of a user representation. The task is to gradually alter its jth dimension h ,j, while retrieving items whose representations are similar to the altered representation. This task is nontrivial, since usually no item will have exactly the same representation as the altered one, especially when we want the transition to be smooth, monotonic, and thus human-understandable. Solution Here we illustrate our approach to this task. We first probe the suitable range (a, b) for h ,j. Let us assume that prototype k is the prototype closest to h . The range (a, b) is decided such that: prototype k remains the prototype closest to h if and only if h ,j (a, b). We can decide each endpoint of the range using binary search. We then divide the range (a, b) into B subranges, a = a0 < a1 < a2 . . . < a B = b. We ensure that the subranges contain roughly the same number of items from concept k when dividing (a, b) . Finally, we aim to retrieve B items {it}B t=1 {1, 2, . . . , M}B that belong to concept k , each from one of the B subranges, i.e., hit,j (at 1, at]. We thus decide the B items by maximizing P COSINE(hit, j ,h , j ) COSINE(hit, j ,hit , j ) where hi, j = [hi,1; hi,2; . . . ; hi,j 1; hi,j+1; . . . ; hi,d] Rd 1 and γ is a hyper-parameter. We approximately solve this maximization problem sequentially using beam search [36]. Intuitively, selecting items from the B subranges ensures that the items change monotonously in terms of the jth dimension. On the other hand, the first term in the maximization problem forces the retrieved items to be similar with the initial item in terms of the dimensions other than j, while the second term encourages any two retrieved items to be similar in terms of the dimensions other than j. 3 Empirical Results 3.1 Experimental Setup Datasets We conduct our experiments on five real-world datasets. Specifically, we use the largescale Netflix Prize dataset [4], and three Movie Lens datasets of different scales (i.e., ML-100k, ML-1M, and ML-20M) [16]. We follow Mult VAE [32], and binarize these four datasets by keeping ratings of four or higher while only keeping users who have watched at least five movies. We additionally collect a dataset, named Ali Shop-7C 2, from Alibaba s e-commerce platform Taobao. Ali Shop-7C contains user-item interactions associated with items from seven categories, as well as item attributes such as titles and images. Every user in this dataset clicks items from at least two categories. The category labels are used for evaluation only, and not for training. Baselines We compare our approach with Mult DAE [32] and β-Mult VAE [32], the two state-ofthe-art methods for collaborative filtering. In particular, β-Mult VAE is similar to β-VAE [19], and has a hyper-parameter β that controls the strength of disentanglement. However, β-Mult VAE does not learn disentangled representations, because it requires β 1 to perform well. 2The dataset and our code are at https://jianxinma.github.io/disentangle-recsys.html. Table 1: Collaborative filtering. All methods are constrained to have around 2Md parameters, where M is the number of items and d is the dimension of each item representation. We set d = 100. Dataset Method NDCG@100 Recall@20 Recall@50 Ali Shop-7C Mult DAE 0.23923 ( 0.00380) 0.15242 ( 0.00305) 0.24892 ( 0.00391) β-Mult VAE 0.23875 ( 0.00379) 0.15040 ( 0.00302) 0.24589 ( 0.00387) Ours 0.29148 ( 0.00380) 0.18616 ( 0.00317) 0.30256 ( 0.00397) ML-100k Mult DAE 0.24487 ( 0.02738) 0.23794 ( 0.03605) 0.32279 ( 0.04070) β-Mult VAE 0.27484 ( 0.02883) 0.24838 ( 0.03294) 0.35270 ( 0.03927) Ours 0.28895 ( 0.02739) 0.30951 ( 0.03808) 0.41309 ( 0.04503) ML-1M Mult DAE 0.40453 ( 0.00799) 0.34382 ( 0.00961) 0.46781 ( 0.01032) β-Mult VAE 0.40555 ( 0.00809) 0.33960 ( 0.00919) 0.45825 ( 0.01039) Ours 0.42740 ( 0.00789) 0.36046 ( 0.00947) 0.49039 ( 0.01029) ML-20M Mult DAE 0.41900 ( 0.00209) 0.39169 ( 0.00271) 0.53054 ( 0.00285) β-Mult VAE 0.41113 ( 0.00212) 0.38263 ( 0.00273) 0.51975 ( 0.00289) Ours 0.42496 ( 0.00212) 0.39649 ( 0.00271) 0.52901 ( 0.00284) Netflix Mult DAE 0.37450 ( 0.00095) 0.33982 ( 0.00123) 0.43247 ( 0.00126) β-Mult VAE 0.36291 ( 0.00094) 0.32792 ( 0.00122) 0.41960 ( 0.00125) Ours 0.37987 ( 0.00096) 0.34587 ( 0.00124) 0.43478 ( 0.00125) Hyper-parameters We constrain the number of learnable parameters to be around 2Md for each method so as to ensure fair comparison, which is equivalent to using d-dimensional representations for the M items. Note that all the methods under investigation use two sets of item representations, and we do not constrain the dimension of user representations since they are not parameters. We set d = 100 unless otherwise specified. We fix τ to 0.1. We tune the other hyper-parameters of both our approach s and our baselines automatically using the TPE method [6] implemented by Hyepropt [5]. 3.2 Recommendation Performance We evaluate the performance of our approach on the task of collaborative filtering for implicit feedback datasets [21], one of the most common settings for recommendation. We follow the experiment protocol established by the previous work [32] strictly, and use the same preprocessing procedure as well as evaluation metrics. The results on the five datasets are listed in Table 1. We observe that our approach outperforms the baselines significantly, especially on small, sparse datasets. The improvement is likely due to two desirable properties of our approach. Firstly, macro disentanglement not only allows us to accurately represent the diverse interests of a user using the different components, but also alleviates data sparsity by allowing a rarely visited item to borrow information from other items of the same category, which is the motivation behind many hierarchical methods [50, 38]. Secondly, as we will show in Section 3.4, the dimensions of the representations learned by our approach are highly disentangled, i.e., independent, thanks to the micro disentanglement regularizer, which leads to more robust performance. 3.3 Macro Disentanglement We visualize the high-dimensional representations learned by our approach on Ali Shop-7C in order to qualitatively examine to which degree our approach can achieve macro disentanglement. Specifically, we set K to seven, i.e., the number of ground-truth categories, when training our model. We visualize the item representations and the user representations together using t-SNE [40], where we treat the K components of a user as K individual points and keep only the two components that have the highest confidence levels. The confidence of component k is defined as P i:xu,i>0 ci,k, where ci,k is the value inferred by our model, rather than the ground-truth. The results are shown in Figure 2. Interpretability Figure 2c, which shows the clusters inferred based on the prototypes, is rather similar to Figure 2d that shows the ground-truth categories, despite the fact that our model is trained (a) Items and users. Item i is colored according to arg maxk ci,k, i.e., the inferred category. Each component of a user is treated as an individual point, and the kth component is colored according to k. (b) Users only, colored in the same way as Figure 2a. (c) Items only, colored in the same way as Figure 2a. (d) Items only, colored according to their ground-truth categories. (e) Items, obtained by training a new model that uses inner product instead of cosine, colored according to the value of arg maxk ci,k. Figure 2: The discovered clusters of items (see Figure 2c), learned unsupervisedly, align well with the ground-truth categories (see Figure 2d, where the color order is chosen such that the connections between the ground-truth categories and the learned clusters are easy to verify). Figure 2e highlights the importance of using cosine similarity, rather than inner product, to combat mode collapse. (a) Bag size. (b) Bag color. (c) Styles of phone cases. (d) Bag size. The same dimension as Figure 3a. (e) Bag color. The same dimension as Figure 3b. (f) Chicken beef mutton seafood. Figure 3: Starting from an item representation, we gradually alter the value of a target dimension, and list the items that have representations similar to the altered representations (see Subsection 2.4). without the ground-truth category labels. This demonstrates that our approach is able to discover and disentangle the macro structures underlying the user behavior data in an interpretable way. Moreover, the components of the user representations are near the correct cluster centers (see Figure 2a and Figure 2b), and are hence likely capturing the users separate preferences for different categories. Cosine vs. inner product To highlight the necessity of using cosine similarity instead of the more commonly used inner product similarity, we additionally train a new model that uses inner product in place of cosine, and visualize the learned item representations in Figure 2e. With inner product, the majority of the items are assigned to the same prototype (see Figure 2e). In comparison, all seven prototypes learned by the cosine-based model are assigned a significant number of items (see Figure 2c). This finding supports our claim that a proper metric space, such as the one implied by the cosine similarity, is important for preventing mode collapse. 3.4 Micro Disentanglement Independence We vary the hyper-parameters related with micro disentanglement (β and σ0 for our approach, β for β-Mult VAE), and plot in Figure 4 the relationship between the level of independence achieved and the recommendation performance. Each method is evaluated with 2,000 randomly 0.14 0.16 0.18 0.20 0.22 0.24 0.26 Performance (Recall@20) Disentanglement (Uncorrelatedness) Representations = Items' 0.14 0.16 0.18 0.20 0.22 0.24 0.26 Performance (Recall@20) Representations = Users' Method Ours(100,700) Ours(100,100) beta-Mult VAE(100,700) beta-Mult VAE(100,100) Figure 4: Micro disentanglement vs. recommendation performance. (d, d ) indicates d-dimensional item representations and d -dimensional user representations. Note that d = Kd. We observe that (1) our approach outperforms the baselines in terms of both performance and micro disentanglement, and (2) macro disentanglement benefits micro disentanglement, as K = 7 is better than K = 1. sampled configurations on ML-100k. We quantify the level of independence achieved by a set of d-dimensional representations using 1 2 d(d 1) P 1 i