# streaming_bayesian_deep_tensor_factorization__f7bc60c3.pdf

Streaming Bayesian Deep Tensor Factorization

Shikai Fang 1 Zheng Wang 1 Zhimeng Pan 1 Ji Liu 2 3 Shandian Zhe 1

Despite the success of existing tensor factorization methods, most of them conduct a multilinear decomposition, and rarely exploit powerful modeling frameworks, like deep neural networks, to capture a variety of complicated interactions in data. More important, for highly expressive, deep factorization, we lack an effective approach to handle streaming data, which are ubiquitous in real-world applications. To address these issues, we propose SBDT, a Streaming Bayesian Deep Tensor factorization method. We ﬁrst use Bayesian neural networks (NNs) to build a deep tensor factorization model. We assign a spikeand-slab prior over each NN weight to encourage sparsity and to prevent overﬁtting. We then use the multivariate delta method and moment matching to approximate the posterior of the NN output and calculate the running model evidence, based on which we develop an efﬁcient streaming posterior inference algorithm in the assumed-densityﬁltering and expectation propagation framework. Our algorithm provides responsive incremental updates for the posterior of the latent factors and NN weights upon receiving newly observed tensor entries, and meanwhile identify and inhibit redundant/useless weights. We show the advantages of our approach in four real-world applications.

1. Introduction

Tensor factorization is a fundamental tool for multiway data analysis. While many tensor factorization methods have been developed (Tucker, 1966; Harshman, 1970; Chu & Ghahramani, 2009; Kang et al., 2012; Choi & Vishwanathan, 2014), most of them conduct a mutilinear decomposition and are incapable of capturing complex, nonlinear relationships in data. Deep neural networks (NNs) are a class of very ﬂexible and powerful modeling framework, known to

1University of Uath 2Kwai Inc 3University of Rochester. Correspondence to: Shandian Zhe < zhe@cs.utah.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

be able to estimate all kinds of complicated (e.g., highly nonlinear) mappings. The most recent work (Liu et al., 2018; 2019) have attempted to incorporate NNs into tensor factorization and shown a promotion of the performance, in spite of the risk of overﬁtting the tensor data that are typically sparse.

Nonetheless, one critical bottleneck for NN based factorization is the lack of effective approaches for streaming data. In practice, many applications produce huge volumes of data at a fast pace (Du et al., 2018). It is extremely costly to run the factorization from scratch every time when we receive a new set of entries. Some privacy-demanding applications (e.g., Snap Chat) even forbid us from revisiting the previously seen data. Hence, given new data, we need an effective and efﬁcient way to incrementally update the model.

A general and popular approach is streaming variational Bayes (SVB) (Broderick et al., 2013), which integrates the current posterior with the new data, and then estimates a variational approximation as the updated posterior. Although SVB has been successfully used to develop the state-ofthe-art multilinear streaming tensor factorization (Du et al., 2018), it does not perform well for deep NN based factorization. Due to the nested nonlinear coupling of the latent factors and NN weights, the variational model evidence lower bound (ELBO) that SVB maximizes is analytically intractable and we have to seek for stochastic optimization, which is unstable and hard to diagnose the convergence. We cannot use common tricks, e.g., cross-validation and early stopping, to alleviate the issue, because we cannot store or revisit the data in the streaming scenario. Consequently, the posterior updates are often unreliable and inferior, which in turn hurt the subsequent updates and tend to result in a poor ﬁnal model estimation.

To address these issues, we propose SBDT, a streaming Bayesian deep tensor factorization method that not only exploits NNs expressive power to capture intricate relationships, but also provides efﬁcient, high-quality posterior updates for streaming data. Speciﬁcally, we ﬁrst use Bayesian neural networks to build a deep tensor factorization model, where the input is the concatenation of the factors associated with each tensor entry and the NN output predicts the entry value. To reduce the risk of overﬁtting, we place

Streaming Bayesian Deep Tensor Factorization

a spike-and-slab prior over each NN weight to encourage sparsity. For streaming inference, we use the multivariate delta method (Bickel & Doksum, 2015) that employs a Taylor expansion of the NN output to analytically compute its moments, and match the moments to obtain the its current posterior and the running model evidence. We then use back-propagation to calculate the gradient of the log evidence, with which we match the moments and update the posterior of the latent factors and NN weights in the assumed-density-ﬁltering (Boyen & Koller, 2013) framework. After processing all the newly received entries, we update the spike-and-slab prior approximation with expectation propagation (Minka, 2001a) to identify and inhibit redundant/useless weights. In this way, the incremental posterior updates are deterministic, reliable and efﬁcient.

For evaluation, we examined SBDT in four real-world largescale applications, including both binary and continuous tensors. We compared with the state-of-the-art streaming tensor factorization algorithm (Du et al., 2018) based on a multilinear form, and streaming nonlinear factorization methods implemented with SVB. In both running and ﬁnal predictive performance, our method consistently outperforms the competing approaches, mostly by a large margin. The running accuracy of SBDT is also much more stable and smooth than the SVB based methods.

2. Background

Tensor Factorization. We denote a K-mode tensor by Y Rd1 ... d K, where mode k includes dk nodes. We index each entry by a tuple i = (i1, . . . , i K), which stands for the interaction of the corresponding K nodes. The value of entry i is denoted by yi. To factorize the tensor, we represent all the nodes by K latent factor matrices U = {U1, . . . , UK}, where each Uk = [uk 1, . . . , uk dk]

is of size dk rk, and each uk j are the factors of node j in mode k. The goal is to use U to recover the observed entries in Y. To this end, the classical Tucker factorization (Tucker, 1966) assumes Y = W 1 U1 2 . . . K UK, where W Rr1 ... r K is a parametric tensor and k the mode-k tensor matrix multiplication (Kolda, 2006), which resembles the matrix-matrix multiplication. If we set all rk = r and W to be diagonal, Tucker factorization becomes CANDECOMP/PARAFAC (CP) factorization (Harshman, 1970). The element-wise form is yi = Pr j=1 QK k=1 uk ik,j = (u1 i1 . . . u K i K) 1, where is the Hadamard (element-wise) product and 1 the vector ﬁlled with ones. We can estimate the factors U by minimizing a loss function, e.g., the mean squared error in recovering the observed elements in Y.

Streaming Model Estimation. A general and popular framework for incremental model estimation is streaming variational Bayes(SVB) (Broderick et al., 2013), which is

grounded on the incremental version of Bayes rule,

p(θ|Dold Dnew) p(θ|Dold)p(Dnew|θ) (1)

where θ are the latent random variables in the probabilistic model we are interested in, Dold all the data that have been seen so far, and Dnew the incoming data. SVB approximates the current posterior p(θ|Dold) with a variational posterior qcur(θ). When the new data arrives, SVB integrates qcur(θ) with the likelihood of the new data to obtain an unnormalized, blending distribution,

p(θ) = qcur(θ)p(Dnew|θ) (2)

which can be viewed as approximately proportional to the joint distribution p(θ, Dold Dnew). To conduct the incremental update, SVB uses p(θ) to construct a variational ELBO (Wainwright et al., 2008), L(q(θ)) = Eq[log p(θ)/q(θ) ], and maximizes the ELBO to obtain the updated posterior, q = argmaxq L(q). This is equivalent to minimizing the Kullback-Leibler (KL) divergence between q and the normalized p(θ). We then set qcur = q

and prepare the update for the next batch of new data. At the beginning (when we do not receive any data), we set qcur = p(θ), the original prior in the model. For efﬁciency and convenience, a factorized variational posterior q(θ) = Q

j q(θj) is usually adopted to fulﬁll cyclic, closedform updates. For example, the state-of-the-art streaming tensor factorization, POST (Du et al., 2018), uses the CP form to build a Bayesian model, and applies SVB to update the posterior of the factors incrementally when receiving new tensor entries.

3. Bayesian Deep Tensor Factorization

Despite the elegance and convenience of the popular Tucker and CP factorization, their multilinear form can severely limit the capability of estimating complicated, highly nonlinear/nonstationary relationships hidden in data. While numerous other methods have also been proposed, e.g., (Chu & Ghahramani, 2009; Kang et al., 2012; Choi & Vishwanathan, 2014), most are still inherently based on the CP or Tucker form. Enlightened by the expressive power of (deep) neural networks (Goodfellow et al., 2016), we propose a Bayesian deep tensor factorization model to overcome the limitation of traditional methods and ﬂexibly estimate all kinds of complex relationships.

Speciﬁcally, for each tensor entry i, we construct an input xi by concatenating all the latent factors associated with i, namely, xi = [ u1 i1 , . . . , u K i K ] . We assume that there is an unknown mapping between the input factors xi and the value of entry i, f : R PK k=1 rk R, which reﬂects the complex interactions/relationships between the tensor nodes in entry i. Note that CP factorization uses a

Streaming Bayesian Deep Tensor Factorization

multilinear mapping. We use an M-layer neural network (NN) to model the mapping f, which are parameterized by M weight matrices W = {W1, . . . , WM}. Each Wm is Vm (Vm 1 + 1) where Vm and Vm 1 are the widths of layer m and m 1, respectively. V0 = PK k=1 rk is the input dimension and VM = 1. We denote the output in each hidden layer m by hm (1 m M 1) and deﬁne h0 = xi. We compute each hm = σ(Wm[hm 1; 1]/ p

Vm 1 + 1) where σ( ) is a nonlinear activation function, e.g., Re LU and tanh. Note that we append a constant feature 1 to introduce the bias terms in the linear transformation, namely the last column in each Wm. For the last layer, we compute the output by f W(xi) = WM[h M 1; 1]/ p

VM 1 + 1. Given the output, we sample the observed entry value yi via a noisy model. For continuous data, we use a Gaussian noise model, p(yi|U) = N yi|f W(xi), τ 1 where τ is the inverse noise variance. We further assign τ a Gamma prior, p(τ) = Gamma(τ|a0, b0). For binary data, we use the Probit model, p(yi|U) = Φ (2yi 1)f W(xi) where Φ( ) is the cumulative density function (CDF) of the standard normal distribution.

Despite their great ﬂexibility, NNs take the risk of overﬁtting. The larger a network, i.e., with more weight parameters, the easier the network overﬁts the data. In order to prevent overﬁtting, we assign a spike-and-slab prior (Ishwaran et al., 2005; Titsias & Lázaro-Gredilla, 2011) (that is ideal due to the selective shrinkage effect) over each NN weight to sparsify and condense the network. Speciﬁcally, for each weight wmjt = [Wm]jt, we ﬁrst sample a binary selection indicator smjt from p(smij|ρ0) = Bern(smjt|ρ0) = ρsmjt 0 (1 ρ0)1 smjt. The weight is then sampled from

p(wmjt|smjt)

= smjt N(wmjt|0, σ2 0) + (1 smjt)δ(wmjt), (3)

where δ( ) is the Dirac-delta function. Hence, the selection indicator smjt determines the type of prior over wmjt: if smjt is 1, meaning the weight is useful and active, we assign a ﬂat Gaussian prior with variance σ2 0 (slab component); if otherwise smjt is 0, namely the weight is useless and should be deactivated, we assign a spike prior concentrating on 0 (spike component).

Finally, we place a standard normal prior over the factors U. Given the set of observed tensor entries D = {yi1, . . . , yi N }, the joint probability of our model for continuous data is

p(U, W, S, τ) = YM

t=1 Bern(smjt|ρ0)

smjt N(wmjt|0, σ2 0) + (1 smjt)δ(wmjt)

j=1 N(uk j |0, I)Gamma(τ|a0, b0)

n=1 N(yin|f W(xin), τ 1) (4)

where S = {smjt}, and for binary data is

p(U, W, S) = YM

t=1 Bern(smjt|ρ0)

smjt N(wmjt|0, σ2 0) + (1 smjt)δ(wmjt) (5)

j=1 N(uk j |0, I) YN

n=1 Φ (2yin 1)f W(xin) .

4. Streaming Posterior Inference

We now present our streaming model estimation algorithm. In general, the observed tensor entries are assumed to be streamed in a sequence of small batches, {B1, B2, . . .}. Different batches do not have to include the same number of entries. Upon receiving each batch Bt, we aim to update the posterior distribution of the factors U, the inverse noise variance τ (for continuous data), the selection indicators S and the neural network weights W, without re-accessing the previous batches {Bj}j<t. While we can apply SVB, the variational ELBO that integrates the current posterior and the new entry batch will be analytically intractable. Take binary tensors as an example. Given a new entry batch Bt, the EBLO constructed based on the blending distribution (see (2)) is L = KL q(U, S, W) qcur(U, S, W) + P

n Bt Eq log Φ (2yin 1)f W(xin) . Due to the nested, nonlinear coupling of the factors (in each xin) and NN weights W in calculating f W(xin), the expectation terms in L are intractable, without any closed form. Obviously, the same conclusion applies to the continuous data. Therefore, to maximize L so as to obtain the updated posterior, we have to use stochastic gradient descent (SGD), typically with the re-parameterization trick (Kingma & Welling, 2013). However, without the explicit form of L, it is hard to diagnose the convergence of SGD it may stop at a place far from the (local) optimums. Note that we cannot use hold-out datasets or cross-validation to monitor/control the training because we cannot store or revisit the data. The inferior posterior estimation in one batch can in turn inﬂuence the posterior updates in the subsequent batches, and ﬁnally result in a very poor model estimation.

4.1. Online Moment Matching for Posterior Update

To address these problems, we exploit the assumed-densityﬁltering (ADF) framework (Boyen & Koller, 1998), which can be viewed as an online version of expectation propagation (EP) (Minka, 2001a), a general approximate Bayesian inference algorithm. ADF is also based on the incremental version of Bayes rule (see (1)). It uses a distribution in the exponential family (Wainwright et al., 2008) to approximate the current posterior. When the new data arrive, instead of maximizing a variational ELBO, ADF projects the (unnormalized) blending distribution (2) to the exponential family to obtain the updated posterior. The projection is done by moment matching, which essentially is to

Streaming Bayesian Deep Tensor Factorization

minimize KL( p(θ)/Z q(θ)) where Z is the normalization constant. For illustration, suppose we choose q(θ) to be a fully factorized Gaussian distribution, q(θ) = Q

j q(θj) = Q

j N(θj|µj, vj). To update each q(θj), we compute the ﬁrst and second moments of θj w.r.t p(θ), and match a Gaussian distribution with the same moments, namely, µj = E p(θj) and vj = Var p(θj) = E p(θ2 j) E p(θj)2.

For our model, we use a fully factorized distribution in the exponential family to approximate the current posterior. When a new batch of data Bt are received, we sequentially process each observed entry, and perform moment matching to update the posterior of the NN weights W and associated latent factors. Speciﬁcally, let us start with the binary data. We approximate the posterior with

qcur(W, U, S)

t=1 Bern(smjt|ρmjt)N(wmjt|µmjt, vmjt)

t=1 N(uk jt|ψkjt, νkjt).

Given each entry in in the new batch, we construct the blending distribution, p(W, U, S) qcur(W, U, S)Φ (2yin 1)f W(xin) . To obtain its moments, we consider the normalizer, i.e., the model evidence under the blending distribution,

Zn = Z qcur(W, U, S)Φ (2yin 1)f W(xin) d Wd Ud S.

Under the Gaussian form, according to (Minka, 2001b), we can compute the moments and update the posterior of each NN weight wmjt and each factor associated with in {uk ink}k by

µ = µ + v log Zn

v = v v2 log Zn

µ 2 2 log Zn

where µ and v are the current posterior mean and variance of the corresponding weight or factor. Note that since the likelihood does not include the binary selection indicators S, their moments are the same as those under qcur and we do not need to update their posterior.

However, a critical issue is that due to the nonlinear coupling of the U and W in computing the NN output f W(xin), the exact normalizer is analytically intractable. To overcome this issue, we consider approximating the current posterior of f W(xin) ﬁrst. We use the multivariate delta method (Oehlert, 1992; Bickel & Doksum, 2015) that expands the NN output at the mean of W and U with a Taylor

approximation,

f W(xin) f E[W](E[xin]) + g n (ηn E[ηn]) (8)

where the expectation is under qcur( ), ηn = vec(W xin), gn = f W(xin)|ηn=E[ηn]. Note that xin is the concatenation of the latent factors associated with in. The rationale of the approximation (8) is that the NN output is highly nonlinear and nonconvex in U and W. Hence, the scale of the output change rate (i.e., gradient) can be much larger than the scale of the posterior variances of W and U, which are (much) smaller than prior variance 1 (see (4) and (5)). Therefore, we can ignore the second-order term that involves the posterior variances. Note that despite the seemingly simpler structure, our approximation in (8) is still very complex both the NN output and the Jacobian are highly nonlinear to W, and hence the model expressiveness is not changed. We have also tried the second-order expansion, which, however, is unstable and does not improve the performance.

Based on (8), we can calculate the ﬁrst and second moments of f W(xin),

αn = Eqcur[f W(xin)] f E[W](E[xin]),

βn = Varqcur(f W(xin)) g n diag(γn)gn, (9)

where each [γn]j = Varqcur([ηn]j). Due to the fully factorized posterior form, we have cov(ηn) = diag(ηn). Note that all the information in the Gaussian posterior (i.e., mean and variance) of W and U have been integrated to approximate the moments of the NN output. Now we use moment matching to approximate the current (marginal) posterior of the NN output by qcur(f W(xin)) = N(f W(xin)|αn, βn). Then we compute the running model evidence (6) by

Zn = Eqcur(W,U,S)[Φ (2yin 1)f W(xin) ]

= Eqcur(f W(xin))[Φ (2yin 1)f W(xin) ]

Z N(fo|αn, βn)Φ (2yin 1)fo dfo

= Φ (2yin 1)αn 1 + βn

where we redeﬁne fo = f W(xin) for simplicity. With the nice analytical form, we can immediately apply (7) to update the posterior for W and the associated factors in in. In light of the NN structure, the gradient can be efﬁciently calculated via back-propagation, which can be automatically done by many deep learning libraries.

For continuous data, we introduce a Gamma posterior for the inverse noise variance, qcur(τ) = Gamma(τ|a, b), in addition to the fully factorized posterior for W, U and S as in the binary case. After we use (8) and (9) to obtain the posterior of the NN output, we derive the running model

Streaming Bayesian Deep Tensor Factorization

evidence by

Zn = Eqcur(W,U,S,τ)[N yin|f W(xin), τ 1 ]

= Eqcur(fo)qcur(τ)[N(yin|fo, τ 1)]

Eqcur(τ) Z N(fo|αn, βn)N(yin|fo, τ 1)dfo

= Eqcur(τ)[N(yin|αn, βn + τ 1)]. (11)

Next, we use the Taylor expansion again, at the mean of τ, to approximate the Gaussian term inside the expectation,

N(yin|αn, βn + τ 1) N(yin|αn, βn + Eqcur[τ] 1)

+ (τ Eqcur[τ]) N(yin|αn, βn + τ 1)/ τ|τ=Eqcur[τ].

Taking expectation over the Taylor expansion gives

Zn N(yin|αn, βn + Eqcur(τ) 1)

= N(yin|αn, βn + b/a). (12)

We now can use (7) to update the posterior of W and the factors associated with the entry. While we can also use more accurate approximations, e.g., the second-order Taylor expansion or quadrature, we found empirically our method achieves almost the same performance.

To update qcur(τ), we consider the blending distribution only in terms of the NN output fo and τ so we have p(fo, τ) qcur(fo)qcur(τ)N(yin|fo, τ 1) = N(fo|αn, βn)Gamma(τ|a, b)N(yin|fo, τ 1). Then we follow (Wang & Zhe, 2019) to ﬁrst derive the conditional moments and then approximate the expectation of the conditional moments to obtain the moments. The details are given in the supplementary material. The updated posterior is given by q (τ) = Gamma(τ|a , b ), where a = a + 1

2 and b = b + 1

2((yin αn)2 + βn).

4.2. Prior Approximation Reﬁnement

Ideally, at the beginning (i.e., when we have not received any data), we should set qcur to the prior of the model (see (4) and (5)). This is feasible for the factors U, selection indicators S and the inverse noise variance τ (for continuous data only), because their Gaussian, Bernoulli and Gamma priors are all members of the exponential family. However, the spike-and-slab prior for each NN weight wmjt (see (3)) is a mixture prior and does not belong to the exponential family. Hence, we introduce an approximation term,

p(wmjt|smjt) A(wmjt, smjt)

= Bern smjt|c(ρmjt) N(wmjt|µ0 mjt, v0 mjt), (13)

where means approximately proportional to and c(x) = 1/(1+exp( x)). At the beginning, we initialize v0 mjt = σ2 0 and µ0 mjt to be a random number generated from a standard Gaussian distribution truncated in [ σ0, σ0]; we initialize

Algorithm 1 Streaming Bayesian Deep Tensor Factorization (SBDT)

1: Initialize the spike-and-slab prior approximation and multiply it with all the other priors to initialize qcur( ). 2: while a new batch of observed tensor entries Bt arrives do 3: for each entry in in Bt do 4: Approximate the running model evidence with (10) (binary data) or (12) (continuous data). 5: Update the posterior for W and the associated factors {uk inj}k,j with (7). 6: if continuous data then 7: Update the posterior for the inverse noise variance τ via conditional moment matching. 8: end if 9: Update the spike-and-slab prior approximation with standard EP. 10: end for 11: end while 12: return the current posterior qcur( ).

ρmjt = 0. Obviously, this is a very rough approximation. If we only execute the standard ADF to continuously integrate new entries to update the posterior (see (7)), the prior approximation term will remain the same and never be changed. However, the spike-and-slab prior is critical to sparsify and condense the network, and an inferior approximation will make it noneffective at all. To address this issue, after we process all the entries in the incoming batch, we use EP to update/improve the prior approximation term (13), with which to further update the posterior of the NN weights. Hence, as we continuously process streaming batches of the observed tensor entries, the prior approximation becomes more and more accurate, and thereby can effectively inhibit/deactivate the redundant or useless weights on the ﬂy. The details of the updates are provided in the supplementary material, where we also summarize our streaming inference in Algorithm 1.

4.3. Algorithm Complexity

The time complexity of our streaming inference is O(NV + PK k=1 dkrk) where V is the total number of weights in the NN. Therefore, the computational cost is proportional to N, the size of the streaming batch. The space complexity is O(V + PK k=1 dkrk), including the storage of the posterior for the factors U, NN weights W and selection indicators S, and the approximation term for the spike-and-slab prior.

5. Related Work

Classical CP (Harshman, 1970) and Tucker (Tucker, 1966) factorization are multilinear and therefore are incapable of

Streaming Bayesian Deep Tensor Factorization

estimating complex, nonlinear relationships in data. While numerous other approaches have been proposed (Shashua & Hazan, 2005; Chu & Ghahramani, 2009; Sutskever et al., 2009; Hoff, 2011; Kang et al., 2012; Yang & Dunson, 2013; Choi & Vishwanathan, 2014; Hu et al., 2015; Rai et al., 2015), most of them are still based on the CP or Tucker form. To overcome these issues, recently, several Bayesian nonparametric tensor factorization models (Xu et al., 2012; Zhe et al., 2015; 2016a;b) were proposed to estimate the nonlinear relationships with Gaussian processes (Rasmussen & Williams, 2006). The most recent work, Neural CP (Liu et al., 2018) and Co STCo (Liu et al., 2019) have shown the advantage of NNs in tensor factorization. Co STCo also concatenates the factors associated with each entry to construct the input and use the NN output to predict the entry value; but to alleviate overﬁtting, Co STCo introduces two convolutional layers to extract local features and then feed them into dense layers. By contrast, with the spike-and-slab prior and Bayesian inference, we found that our model can also effectively prevent overﬁtting, without the need for extra convolutional layers. Neural CP uses two NNs, one to predict the entry value, the other the (log) noise variance. Hence, Neural CP only applies to continuous data. Our model can be used for both continuous and binary data. Finally, both Co STCo and Nerual CP are trained with stochastic optimization, need to pass the data many times (epochs), and hence cannot handle streaming data. Tillinghast et al. (2020) inserted NNs into kernels to conduct Gaussian process based factorization. Apart from this, there are interesting works using tensor factorization to compress NNs (Ye et al., 2018; Bacciu & Mandic, 2020), which, however, are off our topic.

Expectation propagation (Minka, 2001a) is an approximate Bayesian inference algorithm that generalizes assumeddensity-ﬁltering (ADF) (Boyen & Koller, 1998) and (loopy) belief propagation (Murphy et al., 1999). EP employs an exponential-family term to approximate the prior and likelihood of each data point, and cyclically updates each approximation term via moment matching. ADF can be considered as applying EP in a model including only one data point. Because ADF only maintains the holistic posterior, without the need for keeping individual approximation terms, it is very appropriate for streaming learning. EP can meet a practical barrier when the moment matching is intractable. To address this problem, Wang & Zhe (2019) proposed conditional EP that uses conditional moment matching, quadrature and Taylor approximations to provide a high-quality, analytical solution. Based on EP and ADF, Hernández-Lobato & Adams (2015) proposed probabilistic back-propagation, a batch inference algorithm for Bayesian neural networks. PBP conducts ADF to pass the dataset many times and re-update the prior approximation after each pass. A key difference from our work is that PBP conducts a forward, layer by layer moment matching, to approximate the poste-

rior of each hidden neuron in the network, until it reaches the output. The computation is limited to fully connected, feed-forward networks and Re LU activation function. By contrast, our method computes the moments of the NN output via the delta method (i.e., Taylor expansions) and does not need to approximate the posterior of the hidden neurons. Therefore, our method is free to use any NN architecture and activation function. Note that the multi-variate delta method was also used in Laplace s approximation (Mac Kay, 1992) for NNs and non-conjugate variational inference (Wang & Blei, 2013). Furthermore, we employ spike-and-slab priors over the NN weights to control the complexity of the model and to prevent overﬁtting in the streaming inference.

6. Experiment

6.1. Predictive Performance

Datasets. We examined SBDT on four real-world, large-scale datasets. (1) DBLP (Du et al., 2018), a binary tensor about bibliography relationships (author conference, keyword), of size 10, 000 200 10, 000, including 0.001% nonzero entries. (2) Anime(https: //www.kaggle.com/Cooper Union/ anime-recommendations-database), a twomode tensor depicting binary (user, anime) preferences. The tensor contains 1, 300, 160 observed entries, of size 25, 838 4, 066. (3) ACC (Du et al., 2018), a continuous tensor representing the three-way interactions (user, action, ﬁle), of size 3, 000 150 30, 000, including 0.9% nonzero entries. (4) Movie Len1M (https: //grouplens.org/datasets/movielens/), a two-mode continuous tensor of size 6, 040 3, 706, consisting of (user, movie) ratings. We have 1, 000, 209 observed entries.

Competing methods. We compared with the following baselines. (1) POST (Du et al., 2018), the state-of-theart probabilistic streaming tensor decomposition algorithm based on the CP model. It uses streaming variational Bayes (SVB) (Broderick et al., 2013) to perform mean-ﬁeld posterior updates upon receiving new entries. (2) SVB-DTF, streaming deep tensor factorization implemented with SVB. (3) SVB-GPTF, the streaming version of the Gaussian process(GP) nonlinear tensor factorization (Zhe et al., 2016b), implemented with SVB. Note that similar to NNs, the ELBO in SVB for GP factorization is intractable and we used stochastic optimization. (4) SS-GPTF, the streaming GP factorization implemented with the recent streaming sparse GP approximations (Bui et al., 2017). It uses SGD to optimize another intractable ELBO. (5) CP-WOPT (Acar et al., 2011), a scalable static CP factorization algorithm implemented with gradient-based optimization.

Parameter Settings. We implemented our method, SBDT

Streaming Bayesian Deep Tensor Factorization

SBDT-Re LU SBDT-tanh

SVB-DTF-Re LU

SVB-DTF-tanh

POST SS-GPTF

SVB-GPTF CP-WOPT

3 5 8 10 Number of Factors

0.75 0.8 0.85 0.9

3 5 8 10 Number of Factors

3 5 8 10 Number of Factors

3 5 8 10 Number of Factors

(d) Movie Len1M

0.75 0.8 0.85 0.9

(h) Movie Len1M

Figure 1. Predictive performance with different numbers of factors (top row) and streaming batch sizes (bottom row). In the top row, the streaming bath size is ﬁxed to 256; in the bottom row, the factor number is ﬁxed to 8. The results are averaged over 5 runs.

0 500 1000 Number of Batches

(a) DBLP (r = 3)

0 1000 2000 Number of Batches

(b) Anime (r = 3)

0 2000 4000 Number of Batches

(c) ACC (r = 3)

1500 3000 Number of Batches

(d) Movie Len1M (r = 3)

0 500 1000 Number of Batches

(e) DBLP (r = 8)

0 1000 2000 Number of Batches

(f) Anime (r = 8)

0 2000 4000 Number of Batches

(g) ACC (r = 8)

1500 3000 Number of Batches

(h) Movie Len1M (r = 8)

Figure 2. Running prediction accuracy along with the number of processed streaming batches. The batch size was ﬁxed to 256.

with Theano, and SVB-DTF, SVB/SS-GPTF with Tensor Flow. For POST, we used the original MATLAB implementation (https://github.com/yishuaidu/ POST). For SVB/SS-GPTF, we set the number of pseudo inputs to 128 in their sparse approximations. We used Adam (Kingma & Ba, 2014) for the stochastic optimization in SVB-DTF and SVB/SS-GPTF, where we set the number of epochs to 100 in processing each streaming batch and tuned the learning rate from {10 5, 5 10 5, 10 4, 3

10 4, 5 10 4, 10 3, 3 10 3, 5 10 3, 10 2}. For SBDT and SVB-DTF, We used a 3-layer NN, with 50 nodes in each hidden layer. We tested Re LU and tanh activations.

Results. We ﬁrst evaluated the prediction accuracy after all the (accessible) entries are processed. To do so, we sequentially fed the training entries into every method, each time a small batch. We then evaluated the predictive performance on the test entries. We examined the root-mean-squared-

Streaming Bayesian Deep Tensor Factorization

error (RMSE) and area under ROC curves (AUC) for continuous and binary data, respectively. We ran the static factorization algorithm CP-WOPT on the entire training set. On DBLP and ACC, we used the same split of the training and test entries as in (Du et al., 2018), including 320K and 1M training entries for DBLP and ACC respectively, and 100K test entries for both. On Anime and Movie Len1M, we randomly split the observed entries into 90% for training and 10% for test. For both datasets, the number of training entries is around one million. For each streaming approach, we randomly shufﬂed the training entries and then partitioned them into a stream of batches. On each dataset, we repeated the test for 5 times and calculated the average of RMSEs/AUCs and their standard deviations. For CP-WOPT, we used a different random initialization for each test.

We conducted two groups of evaluations. In the ﬁrst group, we ﬁxed the batch size to 256 and examined the predictive performance with different numbers of the factors, {3, 5, 8 10}. In the second group, we ﬁxed the factor number to 8, and examined how the prediction accuracy varies with the size of the streaming batches, {26, 27, 28, 29}. Obviously, the more the factors/larger the batch size, the more expensive for SBDT to factorize in each streaming batch. Hence both settings examined the trade-off between the accuracy and computational complexity. The results are reported in Fig. 1. As we can see, SBDT (with both tanh and Re LU) consistently outperforms all the competing approaches in all the cases and mostly by a large margin. First, SBDT signiﬁcantly improves upon POST and CP-WOPT the streaming and static multilinear factorization, conﬁrming the advantages of the deep tensor factorization. It worth noting that CP-WOPT performed much worse than POST on ACC and Movie Len1M, which might be due to the poor local optimums CP-WOPT converged to. Second, SVB/SSGPTF are generally far worse than our method, and in many cases even inferior to POST (see Fig. 1b, d, f and h). SVBDTF is even worse. Only on Fig. 1c, the performance of SVB-DTF is comparable to or better than CP-WOPT, and in all the other cases, SVB-DTF is much worse than all the other methods and we did not show its curve in the ﬁgure (similar for CP-WOPT in Fig. 1g). Those results might be due to the inferior/unreliable stochastic posterior updates. Lastly, although both SBDT-Re LU and SBDT-tanh outperform all the baselines, SBDT-Re LU is overall better than SBDT-tanh (especially Fig. 1g).

6.2. Prediction On the Fly

Next, we evaluated the dynamic performance. We randomly generated a stream of training batches from each dataset, upon which we ran each algorithm and examined the prediction accuracy after processing each batch. We set the batch size to 256 and tested with the number of latent factors r = 3 and r = 8. The running RMSE/AUC of each method

0.3 0.5 1 posterior selection

posterior mean

0.3 0.5 1 posterior selection

posterior variance

Figure 3. Posterior selection probability c(ρ mjt) vs. the posterior mean and variance of each NN weight wmjt

are reported in Fig. 2. Note that some curves are missing or partly missing because the performance of the corresponding methods are much worse than all the other ones. In general, nearly all the methods improved the prediction accuracy with more and more batches, showing increasingly better factor estimations. However, SBDT always obtained the best AUC/RMSE on the ﬂy, except at the beginning stage on Anime and Movie Len1M (r = 8). The trends of SBDT and POST are much smoother than that of SVB-DTF and SVB/SS-GPTF, which might again because the stochastic updates are unstable and unreliable. Note that in Fig. 2b, SVB-DTF has running AUC steadily around 0.5, implying that SVB actually failed to effectively update the posterior.

Efﬁciency. We implemented SBDT by Theano and SVBDTF, SVB-GPTF, SS-GPTF by Tensor Flow. POST was implemented with Matlab. We ran all the methods on a desktop machine with Intel i9-9900K CPU and 32GB memory. We did not use GPU acceleration to run SBDT for a fair comparison. SBDT is faster than POST on Movie Len1M but slower on the other datasets. For example, for r = 8 and batchsize 128, the running time (in seconds) are {SBDT-Re LU: 6,928, SBDT-tanh: 8,566, POST: 40,551} on Movie Len1M, {SBDT-Re LU: 13,742, SBDT-tanh: 15,641, POST: 1,442} on ACC, {SBDT-Re LU: 5,255, SBDT-tanh: 5,356, POST: 2,459} on Anime and {SBDT-Re LU: 1,349, SBDT-tanh: 1,481, POST: 371} on DBLP. Note that on different datasets, POST may need a different number of iterations to converge in the mean-ﬁeld variational updates. The tolerance level was set to 10 5 and the maximum number of iterations 500. The results are reasonable, because SBDT is based on neural networks, including much more parameters than CP factorization, and the computation is much more complex. The other methods are in general faster than SBDT. This might be partly due to the difference in efﬁciency between Theano and Tensor Flow libraries. Nonetheless, the predictive performance of those methods are much worse than SBDT and even often worse than POST.

Network Sparsity. Finally, we looked into the estimated posterior distribution of the NN weights. We set the number of latent factors to 8 and streaming batch-size to 256, and

Streaming Bayesian Deep Tensor Factorization

ran SBDT on DBLP dataset with Re LU. In Fig. 3, we show the posterior selection probability c(ρ mjt) vs. the posterior mean µ mjt for each weight wmjt, and c(ρ mjt) vs. the posterior variance v mjt for each weight. As we can see, when the posterior selection probability is less than 0.5, i.e., the weight wmjt is likely to be useless/redundant, both its posterior mean and variance are small and close to 0. The more the posterior selection probability approaches 0, the closer µ mjt and v mjt to 0, exhibiting a shrinkage effect. Thereby the corresponding weight wmjt is inhibited or deactivated. By contrast, when the posterior selection probability is bigger than 0.5, the posterior mean and variance have much larger scales and ranges, implying that the corresponding weight is active and estimated from data freely. Therefore, SBDT can effectively inhibit redundant/useless NN weights to prevent overﬁtting during the factorization.

7. Conclusion

We have presented SBDT, a streaming probabilistic deep tensor factorization approach, which can effectively leverage neural networks to capture complicated relationships for streaming factorization. Experiments on real-world applications have demonstrated the advantage of SBDT.

Acknowledgments

This work has been supported by NSF IIS-1910983.

Acar, E., Dunlavy, D. M., Kolda, T. G., and Morup, M. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106 (1):41 56, 2011.

Bacciu, D. and Mandic, D. P. Tensor decompositions in deep learning. ar Xiv preprint ar Xiv:2002.11835, 2020.

Bickel, P. J. and Doksum, K. A. Mathematical statistics:

basic ideas and selected topics, volume I, volume 117. CRC Press, 2015.

Boyen, X. and Koller, D. Tractable inference for complex stochastic processes. In Proceedings of the Fourteenth conference on Uncertainty in artiﬁcial intelligence, pp. 33 42, 1998.

Boyen, X. and Koller, D. Tractable inference for complex stochastic processes. ar Xiv preprint ar Xiv:1301.7362, 2013.

Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. Streaming variational bayes. In Advances in neural information processing systems, pp. 1727 1735, 2013.

Bui, T. D., Nguyen, C., and Turner, R. E. Streaming sparse gaussian process approximations. In Advances in Neural Information Processing Systems, pp. 3299 3307, 2017.

Choi, J. H. and Vishwanathan, S. Dfacto: Distributed factorization of tensors. In Advances in Neural Information Processing Systems, pp. 1296 1304, 2014.

Chu, W. and Ghahramani, Z. Probabilistic models for incomplete multi-dimensional arrays. AISTATS, 2009.

Du, Y., Zheng, Y., Lee, K.-c., and Zhe, S. Probabilistic streaming tensor decomposition. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 99 108. IEEE, 2018.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org.

Harshman, R. A. Foundations of the PARAFAC procedure: Model and conditions for an explanatory multi-mode factor analysis. UCLA Working Papers in Phonetics, 16: 1 84, 1970.

Hernández-Lobato, J. M. and Adams, R. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International Conference on Machine Learning, pp. 1861 1869, 2015.

Hoff, P. Hierarchical multilinear models for multiway data. Computational Statistics & Data Analysis, 55:530 543, 2011. ISSN 0167-9473. URL http://www.stat.washington.edu/ hoff/Code/hoff_2011_csda.

Hu, C., Rai, P., and Carin, L. Zero-truncated poisson tensor factorization for massive binary tensors. In UAI, 2015.

Ishwaran, H., Rao, J. S., et al. Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics, 33(2):730 773, 2005.

Kang, U., Papalexakis, E., Harpale, A., and Faloutsos, C. Gigatensor: scaling tensor analysis up by 100 timesalgorithms and discoveries. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 316 324. ACM, 2012.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kolda, T. G. Multilinear operators for higher-order decompositions, volume 2. United States. Department of Energy, 2006.

Streaming Bayesian Deep Tensor Factorization

Liu, B., He, L., Li, Y., Zhe, S., and Xu, Z. Neuralcp: Bayesian multiway data analysis with neural tensor decomposition. Cognitive Computation, 10(6):1051 1061, 2018.

Liu, H., Li, Y., Tsang, M., and Liu, Y. Costco: A neural tensor completion model for sparse tensors. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 324 334, 2019.

Mac Kay, D. J. The evidence framework applied to classiﬁcation networks. Neural computation, 4(5):720 736, 1992.

Minka, T. P. Expectation propagation for approximate bayesian inference. In Proceedings of the Seventeenth conference on Uncertainty in artiﬁcial intelligence, pp. 362 369, 2001a.

Minka, T. P. A family of algorithms for approximate Bayesian inference. Ph D thesis, Massachusetts Institute of Technology, 2001b.

Murphy, K. P., Weiss, Y., and Jordan, M. I. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth conference on Uncertainty in artiﬁcial intelligence, pp. 467 475. Morgan Kaufmann Publishers Inc., 1999.

Oehlert, G. W. A note on the delta method. The American

Statistician, 46(1):27 29, 1992.

Rai, P., Hu, C., Harding, M., and Carin, L. Scalable probabilistic tensor factorization for binary and count data. In IJCAI, 2015.

Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes

for Machine Learning. MIT Press, 2006.

Shashua, A. and Hazan, T. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22th International Conference on Machine Learning (ICML), pp. 792 799, 2005.

Sutskever, I., Tenenbaum, J. B., and Salakhutdinov, R. R. Modelling relational data using bayesian clustered tensor factorization. In Advances in neural information processing systems, pp. 1821 1828, 2009.

Tillinghast, C., Fang, S., Zhang, K., and Zhe, S. Probabilistic neural-kernel tensor decomposition. In 2020 IEEE International Conference on Data Mining (ICDM), pp. 531 540. IEEE, 2020.

Titsias, M. K. and Lázaro-Gredilla, M. Spike and slab variational inference for multi-task and multiple kernel learning. In Advances in neural information processing systems, pp. 2339 2347, 2011.

Tucker, L. Some mathematical notes on three-mode factor analysis. Psychometrika, 31:279 311, 1966.

Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008.

Wang, C. and Blei, D. M. Variational inference in nonconjugate models. Journal of Machine Learning Research, 14 (Apr):1005 1031, 2013.

Wang, Z. and Zhe, S. Conditional expectation propagation. In UAI, pp. 6, 2019.

Xu, Z., Yan, F., and Qi, Y. Inﬁnite Tucker decomposition: Nonparametric Bayesian models for multiway data analysis. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.

Yang, Y. and Dunson, D. Bayesian conditional tensor factorizations for high-dimensional classiﬁcation. Journal of the Royal Statistical Society B, revision submitted, 2013.

Ye, J., Wang, L., Li, G., Chen, D., Zhe, S., Chu, X., and Xu, Z. Learning compact recurrent neural networks with block-term tensor decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9378 9387, 2018.

Zhe, S., Xu, Z., Chu, X., Qi, Y., and Park, Y. Scalable nonparametric multiway data analysis. In Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 1125 1134, 2015.

Zhe, S., Qi, Y., Park, Y., Xu, Z., Molloy, I., and Chari, S. Dintucker: Scaling up Gaussian process models on large multidimensional arrays. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 30, 2016a.

Zhe, S., Zhang, K., Wang, P., Lee, K.-c., Xu, Z., Qi, Y., and Ghahramani, Z. Distributed ﬂexible nonlinear tensor factorization. In Advances in Neural Information Processing Systems, pp. 928 936, 2016b.