# bayesian_deep_collaborative_matrix_factorization__4e7354ae.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Bayesian Deep Collaborative Matrix Factorization

Teng Xiao,1,2 Shangsong Liang,1,2, Weizhou Shen,1,2 Zaiqiao Meng1,2

1School of Data and Computer Science, Sun Yat-sen University, China 2Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China {cstengxiao, liangshangsong}@gmail.com, shenwzh3@mail2.sysu.edu.cn, zqmeng@aliyun.com

In this paper, we propose a Bayesian Deep Collaborative Matrix Factorization (BDCMF) algorithm for collaborative ﬁltering (CF). BDCMF is a novel Bayesian deep generative model that learns user and item latent vectors from users social interactions, contents of items as the auxiliary information and user-item rating (feedback) matrix. It alleviates the problem of matrix sparsity by incorporating items auxiliary and users social information into the model. It can learn more robust and dense latent representations by integrating deep learning into Bayesian probabilistic framework. As being one of deep generative models, it has both non-linearity and Bayesian nature. Additionally, in BDCMF, we derive an efﬁcient EM-style point estimation algorithm for parameter learning. To further improve recommendation performance, we also derive a full Bayesian posterior estimation algorithm for inference. Experiments conducted on two sparse datasets show that BDCMF can signiﬁcantly outperform the state-ofthe-art CF methods.

Introduction Social media, e.g., Facebook and Twitter, has greatly inﬂuenced people s daily lives, and thus building an effective Recommender System (RS) for them has attracted many interests in recent years. The most used technique in RS is collaborative ﬁltering (CF), which makes use of users ratings over items. Matrix factorization (MF) (Mnih and Salakhutdinov 2008) is one of the most commonly used CF methods due to its effectiveness and scalability. The goal of MF is to learn latent factors of both users and items from user-item feedback matrix. However, traditional MF methods suffer from the matrix sparsity problem recommendation performance drops dramatically when the user-item feedback matrix is very sparse. In addition, traditional MF methods assume users are independent and identically distributed, and ignore the social connections among users. To alleviate the matrix sparsity problem, many social MF methods such as those in (Park, Kim, and Choi 2013; Adams, Dahl, and Murray 2014) incorporate auxiliary information, e.g., content information of items, into MF. However, these methods incorporate auxiliary information as a

Shangsong Liang is the corresponding author. Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

linear regularization term, resulting in the fact that learning latent representations of users and items is not effective when the auxiliary information is also sparse. Aiming at learning more effective representations, many CF-based models using different types of auxiliary information, such as latent Dirichlet allocation (LDA) (Wang and Blei 2011), stacked denoising autoencoder (SDAE) (Wang, Wang, and Yeung 2015; Zhang et al. 2016), variational autoencoder(Li and She 2017) and marginalized denoising autoencoder (Li, Kawale, and Fu 2015), have been proposed to obtain items latent content representations. Nevertheless, all these models ignore social connections among users. In order to jointly utilize item contents and social network information, some hybrid MF methods have been proposed, which can be roughly classiﬁed into two categories: i.e., Collaborative Topic Regression (CTR)-based methods (Wang and Blei 2011; Chen et al. 2014; Purushotham, Liu, and Kuo 2012; Wang, Chen, and Li 2013; Xu et al. 2017) and Collaborative Deep Learning (CDL)-based methods (Wang, Wang, and Yeung 2015; Zhang et al. 2016; Nguyen and Lauw 2017). Although CTR-based and CDLbased methods have achieved great performance, the full Bayesian posterior estimation is intractable in these models. All these models need to maximize a point estimation for parameter learning. However, in this paper, we point out that maximizing such a point estimation actually assumes that the variation distributions are discrete from variational Bayesian perspective (will be discussed later), which leads to a higher variance in prediction and ignores uncertainty in the model parameters (Lim and Teh 2007). This Bayesian estimation also leads to overly optimistic estimates of test-set log-likelihood (Welling, Chemudugunta, and Sutter 2008) and easily overﬁt the observed data, as shown in the following experiments. These drawbacks limit CTRbased and CDL-based methods to learn better latent representations from sparse datasets. To tackle the above problems, in this paper, we aim to build a full Bayesian deep framework to jointly learn latent factors of users and items from three source data, i.e., item content, user-item feedback and user social matrices. We propose a Bayesian Deep Collaborative Matrix Factorization model, abbreviated as BDCMF, which integrates item contents and user social information into probabilistic matrix factorization (PMF). Different from CTR-

based and CDL-based methods which utilize LDA and SDAE to capture items latent content vectors, our BDCMF is a full deep generative model and considers the various item content information to be generated by items latent content factors through a generative network. In addition, with both item content and social information, BDCMF can effectively tackle the matrix sparsity problem. In our BDCMF, to infer latent factors of users and items, we ﬁrst propose a novel variational EM algorithm from a Bayesian point estimation perspective. Due to the Bayesian nature of BDCMF and drawbacks of Bayesian point estimation, we also propose Bayesian posterior estimation to infer the posterior distributions of users and items latent factors. To sum up, our main contributions are:

We propose a novel deep generative model called BDCMF, which jointly learns latent factors of users and items from content, feedback and social matrices.

We derive an efﬁcient parallel variational EM-style algorithm from Bayesian point estimation perspective to infer latent factors of users and items.

To obtain high-quality latent factors, we derive a full Bayesian variational posterior estimation algorithm that can infer posterior distribution of latent factors.

Comprehensive experiments conducted on two large realworld datasets show BDCMF can signiﬁcantly outperform state-of-the-art hybrid MF methods for CF.

Related Work Matrix factorization (MF) is one of the most used approaches in collaborative ﬁltering (CF) and other applications such as rank aggregation (Liang et al. 2014; 2018a). Since the traditional MF suffers from matrix sparsity problem, some previous works explore items content information to improve the performance. Ma et al. (2008) proposed Soc Rec, which incorporates social information into probabilistic MF. Wang and Blei (2011) proposed CTR, which explores LDA to capture text topics, and integrates it into probabilistic MF. Wang, Wang, and Yeung (2015) proposed CDL which incorporates SDAE (Stack Denoising Auto Encoder) into MF framework. To consider both user social information and item content information, many CTR-based and CDL-based methods have been proposed. Purushotham, Liu, and Kuo (2012) proposed CTR-SMF, which incorporates CTR and social matrix factorization to improve performance. In C-CTR-SMF2 (Chen et al. 2014), social context information was integrated into CTR. de Souza da Silva, Langseth, and Ramampiaro (2017) proposed Poisson MFCS, which utilizes Poisson matrix factorization to model users preferences and items contents. Ren et al. proposed s CVR (Ren et al. 2017) to predict item ratings based on user opinions and social relations. Recently, some previous works utilize neural networks to model latent factors of users and items, due to their non-linear modelling abilities. These include Neu MF (Neural Matrix factorization) (He et al. 2017) and CDAE (Collaborative Denoising Auto Encoder) (Wu et al. 2016). However, the stacked neural network structures of them make them difﬁcult to train and in-

cur high computational cost. In contract, our BDCMF incorporates neural network into Bayesian generative framework, which makes it have non-linear modelling abilities and to be trained effectively.

Problem Deﬁnition and Notations Let R {0, 1}N M be a user-item matrix, where N and M are the number of users and items, respectively. Rij = 1 denotes the implicit feedback from user i over item j is observed and Rij = 0 otherwise. We use R j to represent the j-th column of R. Let S = {Sik}N N denote the social matrix of a social network graph G, where Sik = 1 if user i is associated with user k and Sik = 0 otherwise. Let X = [x1, x2, . . . , x M] RL M represent item content matrix, where L denotes the dimension of content vector xj, and xj be the content information of item j. For example, if item j is a product or a music, the content xj can be bag-of-word representation of its tags. We use U = [u1, u2, . . . , u N] RD N and V = [v1, v2, . . . , v M] RD M to denote user and item latent matrices, respectively, where D denotes the latent dimension. G = [g1, g2, . . . , g N] RD N denotes the social factor latent matrix. ID represents identity matrix with dimension D. The problem we aim to address is: given a user-item matrix R, item content matrix X and social matrix S, infer user latent matrix U, social factor latent matrix G and item latent matrix V such that missing value Rij in R can be effectively predicted.

Bayesian Deep Collaborative Matrix Factorization In this section, we propose a Bayesian Deep Collaborative Matrix Factorization (BDCMF) for recommendation, the goal of which is to infer user latent matrix U, social factor latent matrix G, and item latent matrix V given item content matrix X, user social matrix S, and feedback matrix R.

The Proposed Model Similar to (Ma et al. 2008), we consider user-item and social matrices, R and S, sharing the same user latent matrix U, such that R is factorized by user and item latent matrices, U and V, via probabilistic matrix factorization (PMF) (Mnih and Salakhutdinov 2008), and S is factorized by user latent and social factor latent matrices, U and G. For items content information, since it can be very complex and various, we don t know its real distribution. However, we know any distribution can be generated by mapping simple Gaussian through a sufﬁciently complicated function (Doersch 2016). In our proposed model, we consider item contents to be generated by their latent content vector through a generative network. The generative process of BDCMF is: 1. For each user i, draw user latent vector ui N(0, λu ID). 2. For each social factor k, draw its latent vector, i.e., the social latent vector gk N(0, λg ID). 3. For each item j: (a) Draw item content latent vector zj N(0, ID). (b) Draw item content vector pθ(xj|zj). (c) Draw item latent offset kj N(0, λv ID) and set the item latent vector as vj = zj + kj.

4. For each user-item pair (i, j) in R, draw Rij:

Rij N(u i vj, c 1 ij ). (1)

5. For each user-social factor pair (i, k) in S, draw Sik:

Sik N(u i gk, λ 1 q e 1 ik ). (2)

In the process, λv, λu, λg, and λq are the free parameters, respectively. Speciﬁcally, λq is a trade-off between the social matrix and user-item matrix for user latent factor (will be discussed later). Similar to (Wang and Blei 2011; Wang, Wang, and Yeung 2015), cij in Eq.1 and eik in Eq.2 serves as conﬁdent parameters for Rij and Sik, respectively:

cij = ϕ1 if Rij = 1, ϕ2 if Rij = 0, (3)

eik = ϕ3 if Sik = 1, ϕ4 if Sik = 0, (4)

where ϕ1 > ϕ2 > 0 and ϕ3 > ϕ4 > 0 are free parameters. In our model, we follow (Wang, Wang, and Yeung 2015; Purushotham, Liu, and Kuo 2012) to set ϕ1 = ϕ3 = 1 and ϕ2 = ϕ4 = 0.1. pθ(xj|zj) represents item content information and xj is generated from latent content vector zi through a generative neural network parameterized by θ. It should be noted that the speciﬁc form of the probability pθ(xj|zj) depends on the type of the item content vector. For instance, if xj is binary vector, pθ(xj|zj) can be a multivariate Bernoulli distribution Ber(Fθ(zj)) with Fθ(zj) being the highly no-linear function parameterized by θ. According to the graphic model in Fig.1, the joint probability of R, S, X, U, V, G, Z can be represented as:

p(O,Z) = YN

k=1 p(Oijk, Zijk) = YN

k=1 p(zj)p(gk)p(ui)pθ(xj|zj)p(vj|zj)

p(Rij|ui, vj)p(Sik|ui, gk), (5)

where O = {R, S, X} is the set of all observed variables, Z = {U, V, G, Z} is the set of all latent variables needed to be inferred, and Oijk = {Rij, Sik, xj} and Zijk = {ui, vj, gk, zj} for short.

Bayesian Point Estimation Previous works (Wang, Wang, and Yeung 2015; Wang and Blei 2011) have shown that using an EM-style algorithm enables recommendation methods that integrate them to obtain high-quality latent vectors (in our case, U and V). Inspired by these work, in this section, we ﬁrst derive an EM-style algorithm called BDCMF-1 from the view of Bayesian point estimation. The marginal log likelihood can be given by:

log p(O) = log Z p(O, Z)d Z Z q(Z) log p(O, Z)

= Z q(Z) log p(O, Z) Z q(Z) log q(Z) L(q), (6)

where we apply Jensen s inequality, and q(Z) and F(q) are variational distribution and the evidence lower bound

Figure 1: Graphical model of our BDCMF. Solid and dashed lines represent generative and inference process, respectively. Shaded nodes represent observed variables.

(ELBO), respectively. For variational distribution q(Z), we consider variational distributions in it to be matrix-wise independent:

q(Z) = q(U)q(V)q(G)q(Z) (7)

i=1 q(ui) YM

j=1 q(vj) YN

k=1 q(gk) YM

For Bayesian point estimation, we assume the variational distribution of ui is:

d=1 δ(Uid ˆUid), (8)

where { ˆUid}D d=1 are variational parameters and δ is a Dirac delta function. Variational distributions of vj and gk are deﬁned similarly. When Uid is discrete (The continuous situation will be discussed in next section), the entropy of ui is:

H(ui) = Z q(ui) log q(ui) (9)

Uid δ(Uid ˆUid) log δ(Uid ˆUid) = 0.

Similarly, H(vj) and H(gk) are 0 when the elements are discrete. Then the evidence lower bound L(q) (Eq. 6) can be written as:

Lpoint( ˆU, ˆV, ˆG, θ, φ) = log p(U)p(G)p(V|Z) (10) p(X|Z)p(R|U, V)p(S|U, G) q KL(qφ(Z|X)||p(Z)),

where is the statistical expectation with respect to the

corresponding variational distribution. ˆU = n ˆUid o , ˆV = n ˆVjd o and ˆG = n ˆGkd o are variational parameters cor-

responding to the variational distribution q(U), q(V) and q(G), respectively. For latent variables Z, we use a generative network to represent the distribution pθ(X|Z). As discussed in (Kingma

and Welling 2014), traditional variational-EM algorithm (Neal and Hinton 1998) is intractable to infer the posterior of Z. Consequently, similar to VAE (Kingma and Welling 2014), we also introduce a variational distribution qφ(Z|X, R) to approximate the true posterior distribution p(Z|O). qφ(Z|X, R) is implemented by a inference neural network parameterized by φ (see Fig.1). Speciﬁcally, for zj we have:

q(zj) = qφ(zj|xj, R j) = N(µj, diag(δ2 j )), (11)

where the mean µj and variance δj are the outputs of the inference neural network. Directly maximizing the ELBO (Eq. 10) involves solving parameters ˆU, ˆV ˆG, θ and φ, which is intractable. Thus, we derive an iterative variational-EM (VEM) algorithm to maximize Lpoint( ˆU, ˆV, ˆG, θ, φ) , abbreviated Lpoint. Variational E-step. We ﬁrst keep θ and φ ﬁxed, then optimize evidence lower bound Lpoint with respect to ˆU, ˆV and ˆG. The updating rules of ˆui, ˆvj and ˆgk are (we omit the derivation due to space limitations):

ˆui ( ˆVCi ˆV + λq ˆGEi ˆG + λu ID) 1

( ˆVCi Ri + λq ˆGEi Si), (12)

ˆvj ( ˆUCj ˆU + λv ID) 1( ˆUCj Rj + λv zj ), (13)

ˆgk (λq ˆUEk ˆU + λg ID) 1(λq ˆUEk Sk), (14)

where Ci = diag(ci1, ...ci M), Ei = diag(ei1, ...ei N), Ri = [Ri1, ...Ri M] and Si = [Si1, ...Si N]. For item latent vector vj and social latent vector gk, Cj, Rj, Ek and Sk are deﬁned similarly. ˆui = [ ˆUi1, ... ˆUi D], ˆvj = [ ˆVj1, ... ˆVj D] and ˆgk = [ ˆGk1, ... ˆGk D]. For zj, its expectation is zj = µj, which is the output of the inference network. It can be observed that λv governs how much the latent item vector zj affects item latent vector vj. For example, if λv = , it indicates we direct use latent item vector to represent item latent vector vj; if λv = 0, it means we do not embed any item content information into item latent vector. λq serves as a balance parameter between social network matrix and user-item matrix on user latent vector ui. For example, if λq = , it means we only use the social network information to model user s preference; if λq = 0, we only use user-item matrix and item content information for prediction. Thus, λv and λq are regarded as collaborative parameters for item content, feedback and social matrices. Variational M-step. Keep ˆU, ˆV and ˆG ﬁxed, we optimize the following Lpoint w.r.t. φ and θ (we only focus on terms containing φ and θ):

Lpoint = constant + XM

j=1 L(θ, φ; xj, vj, R j) = constant

2 (vj zj) (vj zj) q(Z)+ (15)

log pθ(xj|zj) qφ(zj|xj,R j) KL(qφ(zj|xj, R j)||p(zj)),

where M is the number of items and the constant term represents terms which do not contain θ and φ. For the expectation term pθ(xj|zj) qφ(zj|xj,R j), we can not solve it analytically. To handle this problem, we approximate it by the

Monte Carlo sampling as follows:

log pθ(xj|zj) qφ(zj|xj,R j) = 1

l=1 pθ(xj|zl j), (16)

where L is the size of samplings, and zl j denotes the l-th sample, which is reparameterized to zl j = ϵl j diag(δ2 j )+µj. Here ϵl j is drawn from N(0, ID) and is an element-wise multiplication. By using this reparameterization trick and Eq 11, L(θ, φ; xj, vj, R j) in Eq 15 can be estimated by:

L(θ, φ; xj, vj, R j) Lj(θ, φ) = λv

2 ( 2µ j ˆvj+

µ j µj + tr(diag(δ2 j ))) + 1

l=1 pθ(xj|zl j)

KL(qφ(zj|xj, R j)||p(zj)) + constant. (17)

We can construct an estimator of Lpoint(φ, θ; X, V, R), based on minibatches:

Lpoint(θ, φ) LP (θ, φ) = M

j=1 Lj(θ, φ). (18)

As discussed in (Kingma and Welling 2014), the number of samplings L per item j can be set to 1 as long as the minibatch size P is large enough, e.g., P = 100. We can update θ and φ by using the gradient θ,φ LP (θ, φ). We iteratively update U, V, G, θ, and φ until it converges. Overview of our Bayesian point estimation is summarized in algorithm 1.

Bayesian Posterior Estimation

We assume ˆU, ˆV and ˆG are discrete variables. When these variables are continuous, as discussed in pervious work (Welling, Chemudugunta, and Sutter 2008), the Bayesian point estimation will lead overly to optimistic estimates of test-set log-likelihood. This problem is exactly that pervious well-known recommendation models such CTRbased and CDL-based models (Wang, Wang, and Yeung 2015; Wang and Blei 2011; Purushotham, Liu, and Kuo 2012; Chen et al. 2014; Zhang et al. 2016) which use point estimation algorithm suffer from. In addition, point estimators leads to higher variance in prediction and ignore uncertainty in the model parameters. Unlike CTR-based and CDL-based recommendation methods, BDCMF can be easily estimated by a full Bayesian posterior estimator. Here, we derived Bayesian posterior estimation learning algorithm called BDCMF-2. For Bayesian posterior estimation, we also assume variational distributions are matrixwise independent:

q(Z) = q(U)q(V)q(G)q(Z) (19)

i=1 q(ui) YM

j=1 q(vj) YN

k=1 q(gk) YM

For variational distributions p(zj), we also introduce a variational distribution qφ(Z|X) to approximate the true posterior distribution p(Z|O) like Bayesian point estimation. qφ(Z|X) is parameterized by φ, an inference neural network (see Fig.1). For zj:

q(zj) = qφ(zj|xj, R j) = N(µj, diag(δ2 j )). (20)

It should be noted we do not assume the speciﬁc forms of variational distributions for q(ui), q(vj) and q(gk) in Bayesian posterior estimation. Our goal is also to maximize the evidence lower bound L(q) (Eq. 6). The algorithm also contains variational E-step and M-step. Variational E-step. we ﬁx θ and φ, then optimize evidence lower bound (6) w.r.t ui, vj and gk. Taking the derivative of lower bound (Eq. 6) w.r.t. q(ui) and setting it to zero, then we can get:

q(ui) exp log p(O, Z) q(Z\ui) = N(ui, Λu i ) , (21)

where the expectation is taken with respect to the variational distributions over all latent variables excluding ui. Thus, the updating rules of ui are:

i=1 N(ui, Λu i ), where (22)

Λu i ( V Ci V + XM

j=1 cijΛv j + λq( G Ei G

k=1 eikΛg k) + λu ID) 1, (23)

ui Λu i ( V Ci Ri + λq G Ei Si). (24)

Similarly, the updating rules of V and G are:

j=1 N(vj, Λv j), where (25)

Λv j ( U Cj U + XN

i=1 cijΛu i + λv ID) 1, (26)

vj Λv j( U Cj Rj + λv zj ). (27)

k=1 N(gk, Λg k), where (28)

Λg k (λq( U Ek U + XN

i=1 eikΛu i )

+ λg ID) 1, (29)

gk Λg k(λq U Ek Sk). (30)

Variational M-step. we only focus parameters θ and φ. The M-step is as same as Bayesian point estimation. By using the reparameterization trick as we did in Bayesian point estimation, the objection function w.r.t θ and φ is:

L(q) = constant + XM

j=1 L(θ, φ; xj, vj, R j) = constant

2 (vj zj) (vj zj) q(Z)+ (31)

log pθ(xj|zj) qφ(zj|xj,R j) KL(qφ(zj|xj, R j)||p(zj)) .

We can optimize it as the same as we did in Bayesian point estimation.

Discussion According to Algorithm 1, the time complexity of updating gk is O(ND2 + D3), where N is the number of users and D is the dimension of latent space. The time complexity of updating vj is O(ND2 + D3 + LP 2), where L is the number of layers in neural network and P is the average dimension of these layers. The time complexity of updating ui is O(MD2 + ND2 + D3), where M is the number

Algorithm 1 BDCMF-1 inference algorithm.

Require: user-item matrix R, item content matrix X, parameters λv, λu, λg, and λq, and P = 100 and L = 1. 1: Randomly initialize variational parameters ˆui, ˆvj, ˆgk and network parameters θ and φ. 2: while not converged do 3: for i = 1 to N do 4: update ˆui using Eq.12. 5: end for 6: for j = 1 to M do 7: update ˆvj using Eq.13. 8: end for 9: for k = 1 to N do 10: update ˆgk using Eq.14. 11: end for 12: randomly draw P triples, i.e., {(xp, vp, R p)}P p=1 from the dataset. 13: ϵ draw P samples from N(0, ID), with each for a triple (xp, vp, R p). 14: θ, φ update θ, φ using the gradient θ,φ LP (θ, φ). 15: end while

of items. Because ˆVCi ˆV in Eq. 12 can be rewritten as ˆV(Ci ϕ2I) ˆV +ϕ2 ˆV ˆV , the complexity of updating gk decreases from O(ND2 + D3) to O(N0D2 + D3) , where N0 is observed rating numbers with respect to item j which is relatively small in sparse matrix. ˆUCj ˆU and ˆUEk ˆU can be rewritten, similarly. In fact, the latent dimensions D is also relatively small (<200). Therefore, our BDCMF can be very efﬁcient for large datasets. Additionally, our model is ﬂexible to handle different types of contents in different scenarios by replacing the architecture of neural networks for inferring items latent content representations to the other neural networks such as recurrent neural network.

Prediction After Algorithm 1 is converged, we predict the missing value Rij in R by using the learned latent features ui and vj:

R ij = Rij = ( zj + kj ) ui = vj ui . (32)

For a new item that is not rated by any other users, the offset ϵj is zero, and we can predict Rij by:

R ij = Rij = zj ui . (33)

Experiments Experimental Setup Research Questions. The research questions guiding the remainder of the paper are: (RQ1) How does our proposed BDCMF-1 compare to state-of-the-art MF methods for CF on sparsity matrix? (RQ2) How do different parameter settings (e.g., the social parameter λq and content parameter λv) affect BDCMF-1? (RQ3) Does the Bayesian posterior estimation outperform Bayesian point estimation? Datasets. In order to answer our research questions, we conduct experiments on two real-world datasets from

Table 1: Statistics of the datasets.

Datasets lastfm-2k delicious-2k users 01,892 001,876 items 17,632 069,226 tags 11,946 053,388 user-items relations 92,834 104,799 user-user relations 25,434 015,328 user-tags-items 186,479 437,593

Lastfm 1 (lastfm-2k) and Delicious2 (delicious-2k) collected by Brusilovsky et al. (2010). Both of the datasets contain user-item, user-user, and user-tag-item relations. We ﬁrst transform datasets as implicit feedback. For Lastfm dataset, we consider the user-item feedback is 1 if the user has listened to the artist (item); otherwise, it is 0. Similarly, for the Delicious dataset, if the user has bookmarked a URL (item), the user-item feedback is 1; otherwise, the feedback is 0. After doing so, Lastfm and Delicious datasets only contain 0.27% and 0.08% observed feedbacks, respectively, and are extremely sparse. We use items bag-of-word tag representations as their items content information. Baselines. For fair comparisons, like that in our BDCMF-1 and BDCMF-2 (in what follows, we use BDCMF to refer to these two models), most baselines we used also incorporate user social information or item content information into matrix factorization. MF-based methods: (1) PMF. This model (Mnih and Salakhutdinov 2008) is a famous MF method, and only uses user-item feedback matrix. (2) So Rec. This model (Ma et al. 2008) jointly decomposes user-user social matrix and user-item feedback matrix to learn user and item latent representations. Content-based methods: (3) Collaborative topic regression (CTR). This model (Wang and Blei 2011) utilizes topic model and matrix factorization to learn latent representations of users and items. (4) Collaborative deep learning (CDL). This model (Wang, Wang, and Yeung 2015) utilizes stack denoising autoencoder to learn latent items content representations, and incorporates them into probabilistic matrix factorization. Hybrid methods: (5) CTR-SMF. This model (Purushotham, Liu, and Kuo 2012) incorporates topic modeling and probabilistic MF of social networks. (6) Poisson MF-CS This model(de Souza da Silva, Langseth, and Ramampiaro 2017), jointly models use social trust, item content and user s preference using Poisson matrix factorization framework. It is a state-of-the-art MF method for Top-N recommendation on the Lastfm dataset. Neural network-based method: (7)Neural Matrix Factorization (Neu MF). This model (He et al. 2017) is a state-of-the-art CF method, which

1http://www.lastfm.com. 2http://www.delicious.com.

utilizes neural network to model the interaction between user and item features. Settings. For fair comparisons, We ﬁrst set the parameters for PMF, So Rec, CTR, CTR-SMF, CDL, Neu MF via ﬁve-fold cross validation. For PMF, we set D, λu and λv as 50, 0.01 and 0.001, respectively. The parameter settings for So Rec are λc = 10, λu = λv = λz = 0.001. For CTR, we ﬁnd it achieve best performance when D=50, λu=0.1, λv=100, a=1 and b=0.01. CTR-SMF yields the best performance when D=75, λu=0.1, λv=100, λq=100, a=1 and b=0.01. For CDL, it achieves the best performance when we set a = 1, b = 0.01, D = 50, λu=1, λv=10, λn = 1000, and λw=0.0001. For Neu MF, we set D=50, and the last hidden layer is 16. For Poisson MF-CS, we set λc=0.1 and λs=0.1. For our model BDCMF, we set D=50. We set λv=1, λq=10 for dataset Lastfm, and set λv=0.1, λq=10 for dataset Delicious. To evaluate our model performance on extreme sparse matrix, we use 90% of dataset to train our model and the remainder for testing. The dimensions of the layers of the network for BDCMF are (L+N)-200-100(zj)-100-100-(L+N), where L and N are the dimension of xj and R j. The - denotes the next layer in neural networks. Evaluation Metrics. The metrics we used are Recall@K, NDCG@K and MAP@K which are common metrics for recommendation and their deﬁnitions can be found at (Croft, Metzler, and Strohman 2015; Liang and de Rijke 2016). We report the average performance over all users on the metrics.

Experimental Results and Discussions Overall Performance (RQ1). Table 2 shows the performance of our BDCMF-1 and the baselines using the two datasets. According to Table 2, we have following ﬁndings: a) BDCMF-1 outperforms the baselines in terms of all matrices on Lastfm, which demonstrates the effectiveness of our method of inferring the latent factors of users and items. b) For more sparse dataset, Delicious, BDCMF1 also achieves the best performance, which demonstrates our model can effectively handle matrix sparsity problem. c) We can see both methods that utilize content and social information (BDCMF-1 and Poisson MF-CS) outperform the others (CDL, CTR, So Rec and PMF), which demonstrates incorporating content and social information can effectively alleviate matrix sparse problem. d) Our BDCMF-1 outperforms state-of-the-art Poisson MF-CS, though they are both Bayesian generative model. The reason is that our BDCMF-1 incorporates neural network into Bayesian generative model, which makes it have powerful non-linearity to model item content s latent representation. Impact of Parameters (RQ2). Fig. 2(a) and 2(b) show the contour of Recall@50 by varying content parameters λv and social parameters λq on Lastfm and Delicious datasets. As we can see, BDCMF-1 achieves the best recommendation performance when λv=0.1 and λq=10 in Lastfm. We also can see that the performance is worse when λv >10, as λv can control how much item content information is incorporated into item latent vector, and Lastfm is a social media dataset where people have similar preferences with

Table 2: Recommendation performance of BDCMF-1 and the baselines. The best and the second best performance scores per metric per dataset are marked with boldfaces and underlined, respectively.

Lastfm Dataset Delicious Dataset Recall@20 Recall@50 NDCG@20 MAP@20 Recall@20 Recall@50 NDCG@20 MAP@20 PMF 0.0923 0.1328 0.0703 0.1083 0.0272 0.0561 0.0286 0.0312 So Rec 0.1088 0.1524 0.0721 0.1128 0.0384 0.0892 0.0323 0.0389 CTR 0.1192 0.1624 0.0799 0.1334 0.0563 0.1342 0.0487 0.0581 CTR-SMF 0.1232 0.1832 0.0823 0.1386 0.0781 0.1564 0.0534 0.0588 CDL 0.1346 0.2287 0.0928 0.1553 0.0861 0.1897 0.0726 0.0792 Neu MF 0.1517 0.2584 0.1036 0.1678 0.1078 0.2267 0.0769 0.0928 Poisson MF-CS 0.1482 0.2730 0.1089 0.1621 0.1012 0.2123 0.0742 0.0863 BDCMF-1(ours) 0.1625 0.3026 0.1174 0.1701 0.1121 0.2341 0.0826 0.1021

(a) Lastfm Recall@50 (b) Delicious Recall@50

Figure 2: The contour of Recall@50 by varying λv and λq on datasets Lastfm and Delicious.

their friends. λq can control how much social information is incorporated into user latent factor. Thus, our BDCMF1 is very sensitive to λv and λq. For Delicious dataset, Fig. 2(b) shows BDCMF-1 achieves the best performance on Recall@50 when λv =1 and λq =10. Fig. 2(a) and 2(b) show that we can balance the content information and social information by varying λv and λq, leading to better recommendation performance. To ﬁgure out the impact of the latent dimension D, we vary D and see the performance. As shown in Fig. 3, BDCMF-1 outperforms all the baselines with all dimensions, which demonstrates BDCMF-1 can learn better latent representations of users and items than the baselines regardless of their dimensional size. The Effectiveness of Bayesian posterior estimation (RQ3). To ﬁgure out the effectiveness of Bayesian posterior estimation, we make comparison between BDCMF-1 and BDCMF-2 on Lastfm dataset. Results on Delicious show the same trend and thus we omit it here. Performance of BDCMF-1 and BDCMF-2 over iterations on Lastfm is observed as: a) With more iterations, the ELBO of BDCMF1 and BDCMF-2 gradually increases. BDCMF-2 achieves bigger ELBO than BDCMF-1, which illustrates BDCMF2 can ﬁt data better (the margin likelihood of data is more higher). b) For recommendation performance, BDCMF-2 outperforms BDCMF-1 in terms of all metrics. Speciﬁcally, BDCMF-2 betters with a 10.4% relative improvement on Recall@20. c) When the number of iterations > 30, BDCMF-1 suffers severe overﬁtting problem (Recall@20 scores begin to decrease, although the EBLO keeps increasing). In contrast, BDCMF-2 does not suffer this overﬁtting

Figure 3: Recommendation performance on Lastfm and Delicious with various values of dimension D.

Figure 4: ELBO and recommendation performance of BDCMF methods w.r.t. the number of iterations on Lastfm (D=50, λv = 0.1 and λq = 10).

problem. These observations demonstrate BDCMF-2 can learn better latent representations than BDCMF-1.

Conclusions In this paper we studied the problem of inferring latent factors of users and items in the social recommender system. We have proposed a novel Bayesian deep collaborative matrix factorization model, BDCMF, which incorporates various item content information and user social relationship into a full Bayesian framework. To model item s latent content vector, we use a generative network to model the generative process of item content. To effectively infer latent factors of users and items, we ﬁrst derived a EM algorithm from a Bayesian point estimation perspective. Due to the full Bayesian nature of our proposed model and the drawbacks of Bayesian point estimation, we have inferred the full posterior distribution of the latent factors of users and items. We have conducted several experiments on two public datasets. Experimental results show that our BDCMF can effective infer latent factors of users and items, and the Bayesian posterior estimation is more robust than point estimation. In the future, we will utilize the proposed model to deal with other information retrieval task such as user proﬁling (Liang 2018; Liang et al. 2018b; Liang, Yilmaz, and Kanoulas 2018) and social network analysis.

References Adams, R. P.; Dahl, G. E.; and Murray, I. 2014. Incorporating side information in probabilistic matrix factorization with gaussian processes. Papeles De Poblacion 3(14):33. Brusilovsky, P.; Koren, Y.; Kuﬂik, T.; and Weimer, M. 2010. Workshop on information heterogeneity and fusion in recommender systems. In Rec Sys, 375 376. Chen, C.; Zheng, X.; Wang, Y.; Hong, F.; and Lin, Z. 2014. Context-aware collaborative topic regression with social matrix factorization for recommender systems. In AAAI, 9 15. Croft, W. B.; Metzler, D.; and Strohman, T. 2015. Search engines: Information retrieval in practice. Addison-Wesley Reading. de Souza da Silva, E.; Langseth, H.; and Ramampiaro, H. 2017. Content-based social recommendation with poisson matrix factorization. In ECML-PKDD, 530 546. Doersch, C. 2016. Tutorial on variational autoencoders. Co RR abs/1606.05908. He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X.; and Chua, T.-S. 2017. Neural collaborative ﬁltering. In WWW, 173 182. Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. In ICLR. Li, X., and She, J. 2017. Collaborative variational autoencoder for recommender systems. In KDD, 305 314. Li, S.; Kawale, J.; and Fu, Y. 2015. Deep collaborative ﬁltering via marginalized denoising auto-encoder. In CIKM, 811 820. Liang, S., and de Rijke, M. 2016. Formal language models for ﬁnding groups of experts. Information Processing & Management 52(4):529 549. Liang, S.; Ren, Z.; Weerkamp, W.; Meij, E.; and De Rijke, M. 2014. Time-aware rank aggregation for microblog search. In CIKM, 989 998.

Liang, S.; Markov, I.; Ren, Z.; and de Rijke, M. 2018a. Manifold learning for rank aggregation. In WWW, 1735 1744. Liang, S.; Zhang, X.; Ren, Z.; and Kanoulas, E. 2018b. Dynamic embeddings for user proﬁling in twitter. In KDD, 1764 1773. Liang, S.; Yilmaz, E.; and Kanoulas, E. 2018. Collaboratively tracking interests for user clustering in streams of short texts. IEEE Transactions on Knowledge and Data Engineering. Liang, S. 2018. Dynamic user proﬁling for streams of short texts. In AAAI, 5860 5867. Lim, Y. J., and Teh, Y. W. 2007. Variational bayesian approach to movie rating prediction. In Proceedings of KDD cup and workshop, volume 7, 15 21. Ma, H.; Yang, H.; Lyu, M. R.; and King, I. 2008. Sorec:social recommendation using probabilistic matrix factorization. In CIKM, 931 940. Mnih, A., and Salakhutdinov, R. R. 2008. Probabilistic matrix factorization. In NIPS, 1257 1264. Neal, R. M., and Hinton, G. E. 1998. A view of the em algorithm that justiﬁes incremental, sparse, and other variants. In Nato Advanced Study Institute on Learning in Graphical MODELS, 355 368. Nguyen, T. T., and Lauw, H. W. 2017. Collaborative topic regression with denoising autoencoder for content and community co-representation. In CIKM, 2231 2234. Park, S.; Kim, Y. D.; and Choi, S. 2013. Hierarchical bayesian matrix factorization with side information. In IJCAI, 1593 1599. Purushotham, S.; Liu, Y.; and Kuo, C. C. J. 2012. Collaborative topic regression with social matrix factorization for recommendation systems. In ICML, 691 698. Ren, Z.; Liang, S.; Li, P.; Wang, S.; and de Rijke, M. 2017. Social collaborative viewpoint regression with explainable recommendations. In WSDM, 485 494. ACM. Wang, C., and Blei, D. M. 2011. Collaborative topic modeling for recommending scientiﬁc articles. In KDD, 448 456. Wang, H.; Chen, B.; and Li, W. J. 2013. Collaborative topic regression with social regularization for tag recommendation. In IJCAI, 2719 2725. Wang, H.; Wang, N.; and Yeung, D. Y. 2015. Collaborative deep learning for recommender systems. In KDD, 1235 1244. Welling, M.; Chemudugunta, C.; and Sutter, N. 2008. Deterministic latent variable models and their pitfalls. In SDM, 196 207. Wu, Y.; Du Bois, C.; Zheng, A. X.; and Ester, M. 2016. Collaborative denoising auto-encoders for top-n recommender systems. In WSDM, 153 162. ACM. Xu, J.; Yao, Y.; Tong, H.; Tao, X.; and Lu, J. 2017. Hoorays: High-order optimization of rating distance for recommender systems. In KDD, 525 534. Zhang, F.; Yuan, N. J.; Lian, D.; Xie, X.; and Ma, W. Y. 2016. Collaborative knowledge base embedding for recommender systems. In KDD, 353 362.