# streaming_bayesian_deep_tensor_factorization__f7bc60c3.pdf Streaming Bayesian Deep Tensor Factorization Shikai Fang 1 Zheng Wang 1 Zhimeng Pan 1 Ji Liu 2 3 Shandian Zhe 1 Despite the success of existing tensor factorization methods, most of them conduct a multilinear decomposition, and rarely exploit powerful modeling frameworks, like deep neural networks, to capture a variety of complicated interactions in data. More important, for highly expressive, deep factorization, we lack an effective approach to handle streaming data, which are ubiquitous in real-world applications. To address these issues, we propose SBDT, a Streaming Bayesian Deep Tensor factorization method. We first use Bayesian neural networks (NNs) to build a deep tensor factorization model. We assign a spikeand-slab prior over each NN weight to encourage sparsity and to prevent overfitting. We then use the multivariate delta method and moment matching to approximate the posterior of the NN output and calculate the running model evidence, based on which we develop an efficient streaming posterior inference algorithm in the assumed-densityfiltering and expectation propagation framework. Our algorithm provides responsive incremental updates for the posterior of the latent factors and NN weights upon receiving newly observed tensor entries, and meanwhile identify and inhibit redundant/useless weights. We show the advantages of our approach in four real-world applications. 1. Introduction Tensor factorization is a fundamental tool for multiway data analysis. While many tensor factorization methods have been developed (Tucker, 1966; Harshman, 1970; Chu & Ghahramani, 2009; Kang et al., 2012; Choi & Vishwanathan, 2014), most of them conduct a mutilinear decomposition and are incapable of capturing complex, nonlinear relationships in data. Deep neural networks (NNs) are a class of very flexible and powerful modeling framework, known to 1University of Uath 2Kwai Inc 3University of Rochester. Correspondence to: Shandian Zhe < zhe@cs.utah.edu>. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). be able to estimate all kinds of complicated (e.g., highly nonlinear) mappings. The most recent work (Liu et al., 2018; 2019) have attempted to incorporate NNs into tensor factorization and shown a promotion of the performance, in spite of the risk of overfitting the tensor data that are typically sparse. Nonetheless, one critical bottleneck for NN based factorization is the lack of effective approaches for streaming data. In practice, many applications produce huge volumes of data at a fast pace (Du et al., 2018). It is extremely costly to run the factorization from scratch every time when we receive a new set of entries. Some privacy-demanding applications (e.g., Snap Chat) even forbid us from revisiting the previously seen data. Hence, given new data, we need an effective and efficient way to incrementally update the model. A general and popular approach is streaming variational Bayes (SVB) (Broderick et al., 2013), which integrates the current posterior with the new data, and then estimates a variational approximation as the updated posterior. Although SVB has been successfully used to develop the state-ofthe-art multilinear streaming tensor factorization (Du et al., 2018), it does not perform well for deep NN based factorization. Due to the nested nonlinear coupling of the latent factors and NN weights, the variational model evidence lower bound (ELBO) that SVB maximizes is analytically intractable and we have to seek for stochastic optimization, which is unstable and hard to diagnose the convergence. We cannot use common tricks, e.g., cross-validation and early stopping, to alleviate the issue, because we cannot store or revisit the data in the streaming scenario. Consequently, the posterior updates are often unreliable and inferior, which in turn hurt the subsequent updates and tend to result in a poor final model estimation. To address these issues, we propose SBDT, a streaming Bayesian deep tensor factorization method that not only exploits NNs expressive power to capture intricate relationships, but also provides efficient, high-quality posterior updates for streaming data. Specifically, we first use Bayesian neural networks to build a deep tensor factorization model, where the input is the concatenation of the factors associated with each tensor entry and the NN output predicts the entry value. To reduce the risk of overfitting, we place Streaming Bayesian Deep Tensor Factorization a spike-and-slab prior over each NN weight to encourage sparsity. For streaming inference, we use the multivariate delta method (Bickel & Doksum, 2015) that employs a Taylor expansion of the NN output to analytically compute its moments, and match the moments to obtain the its current posterior and the running model evidence. We then use back-propagation to calculate the gradient of the log evidence, with which we match the moments and update the posterior of the latent factors and NN weights in the assumed-density-filtering (Boyen & Koller, 2013) framework. After processing all the newly received entries, we update the spike-and-slab prior approximation with expectation propagation (Minka, 2001a) to identify and inhibit redundant/useless weights. In this way, the incremental posterior updates are deterministic, reliable and efficient. For evaluation, we examined SBDT in four real-world largescale applications, including both binary and continuous tensors. We compared with the state-of-the-art streaming tensor factorization algorithm (Du et al., 2018) based on a multilinear form, and streaming nonlinear factorization methods implemented with SVB. In both running and final predictive performance, our method consistently outperforms the competing approaches, mostly by a large margin. The running accuracy of SBDT is also much more stable and smooth than the SVB based methods. 2. Background Tensor Factorization. We denote a K-mode tensor by Y Rd1 ... d K, where mode k includes dk nodes. We index each entry by a tuple i = (i1, . . . , i K), which stands for the interaction of the corresponding K nodes. The value of entry i is denoted by yi. To factorize the tensor, we represent all the nodes by K latent factor matrices U = {U1, . . . , UK}, where each Uk = [uk 1, . . . , uk dk] is of size dk rk, and each uk j are the factors of node j in mode k. The goal is to use U to recover the observed entries in Y. To this end, the classical Tucker factorization (Tucker, 1966) assumes Y = W 1 U1 2 . . . K UK, where W Rr1 ... r K is a parametric tensor and k the mode-k tensor matrix multiplication (Kolda, 2006), which resembles the matrix-matrix multiplication. If we set all rk = r and W to be diagonal, Tucker factorization becomes CANDECOMP/PARAFAC (CP) factorization (Harshman, 1970). The element-wise form is yi = Pr j=1 QK k=1 uk ik,j = (u1 i1 . . . u K i K) 1, where is the Hadamard (element-wise) product and 1 the vector filled with ones. We can estimate the factors U by minimizing a loss function, e.g., the mean squared error in recovering the observed elements in Y. Streaming Model Estimation. A general and popular framework for incremental model estimation is streaming variational Bayes(SVB) (Broderick et al., 2013), which is grounded on the incremental version of Bayes rule, p(θ|Dold Dnew) p(θ|Dold)p(Dnew|θ) (1) where θ are the latent random variables in the probabilistic model we are interested in, Dold all the data that have been seen so far, and Dnew the incoming data. SVB approximates the current posterior p(θ|Dold) with a variational posterior qcur(θ). When the new data arrives, SVB integrates qcur(θ) with the likelihood of the new data to obtain an unnormalized, blending distribution, p(θ) = qcur(θ)p(Dnew|θ) (2) which can be viewed as approximately proportional to the joint distribution p(θ, Dold Dnew). To conduct the incremental update, SVB uses p(θ) to construct a variational ELBO (Wainwright et al., 2008), L(q(θ)) = Eq[log p(θ)/q(θ) ], and maximizes the ELBO to obtain the updated posterior, q = argmaxq L(q). This is equivalent to minimizing the Kullback-Leibler (KL) divergence between q and the normalized p(θ). We then set qcur = q and prepare the update for the next batch of new data. At the beginning (when we do not receive any data), we set qcur = p(θ), the original prior in the model. For efficiency and convenience, a factorized variational posterior q(θ) = Q j q(θj) is usually adopted to fulfill cyclic, closedform updates. For example, the state-of-the-art streaming tensor factorization, POST (Du et al., 2018), uses the CP form to build a Bayesian model, and applies SVB to update the posterior of the factors incrementally when receiving new tensor entries. 3. Bayesian Deep Tensor Factorization Despite the elegance and convenience of the popular Tucker and CP factorization, their multilinear form can severely limit the capability of estimating complicated, highly nonlinear/nonstationary relationships hidden in data. While numerous other methods have also been proposed, e.g., (Chu & Ghahramani, 2009; Kang et al., 2012; Choi & Vishwanathan, 2014), most are still inherently based on the CP or Tucker form. Enlightened by the expressive power of (deep) neural networks (Goodfellow et al., 2016), we propose a Bayesian deep tensor factorization model to overcome the limitation of traditional methods and flexibly estimate all kinds of complex relationships. Specifically, for each tensor entry i, we construct an input xi by concatenating all the latent factors associated with i, namely, xi = [ u1 i1 , . . . , u K i K ] . We assume that there is an unknown mapping between the input factors xi and the value of entry i, f : R PK k=1 rk R, which reflects the complex interactions/relationships between the tensor nodes in entry i. Note that CP factorization uses a Streaming Bayesian Deep Tensor Factorization multilinear mapping. We use an M-layer neural network (NN) to model the mapping f, which are parameterized by M weight matrices W = {W1, . . . , WM}. Each Wm is Vm (Vm 1 + 1) where Vm and Vm 1 are the widths of layer m and m 1, respectively. V0 = PK k=1 rk is the input dimension and VM = 1. We denote the output in each hidden layer m by hm (1 m M 1) and define h0 = xi. We compute each hm = σ(Wm[hm 1; 1]/ p Vm 1 + 1) where σ( ) is a nonlinear activation function, e.g., Re LU and tanh. Note that we append a constant feature 1 to introduce the bias terms in the linear transformation, namely the last column in each Wm. For the last layer, we compute the output by f W(xi) = WM[h M 1; 1]/ p VM 1 + 1. Given the output, we sample the observed entry value yi via a noisy model. For continuous data, we use a Gaussian noise model, p(yi|U) = N yi|f W(xi), τ 1 where τ is the inverse noise variance. We further assign τ a Gamma prior, p(τ) = Gamma(τ|a0, b0). For binary data, we use the Probit model, p(yi|U) = Φ (2yi 1)f W(xi) where Φ( ) is the cumulative density function (CDF) of the standard normal distribution. Despite their great flexibility, NNs take the risk of overfitting. The larger a network, i.e., with more weight parameters, the easier the network overfits the data. In order to prevent overfitting, we assign a spike-and-slab prior (Ishwaran et al., 2005; Titsias & Lázaro-Gredilla, 2011) (that is ideal due to the selective shrinkage effect) over each NN weight to sparsify and condense the network. Specifically, for each weight wmjt = [Wm]jt, we first sample a binary selection indicator smjt from p(smij|ρ0) = Bern(smjt|ρ0) = ρsmjt 0 (1 ρ0)1 smjt. The weight is then sampled from p(wmjt|smjt) = smjt N(wmjt|0, σ2 0) + (1 smjt)δ(wmjt), (3) where δ( ) is the Dirac-delta function. Hence, the selection indicator smjt determines the type of prior over wmjt: if smjt is 1, meaning the weight is useful and active, we assign a flat Gaussian prior with variance σ2 0 (slab component); if otherwise smjt is 0, namely the weight is useless and should be deactivated, we assign a spike prior concentrating on 0 (spike component). Finally, we place a standard normal prior over the factors U. Given the set of observed tensor entries D = {yi1, . . . , yi N }, the joint probability of our model for continuous data is p(U, W, S, τ) = YM t=1 Bern(smjt|ρ0) smjt N(wmjt|0, σ2 0) + (1 smjt)δ(wmjt) j=1 N(uk j |0, I)Gamma(τ|a0, b0) n=1 N(yin|f W(xin), τ 1) (4) where S = {smjt}, and for binary data is p(U, W, S) = YM t=1 Bern(smjt|ρ0) smjt N(wmjt|0, σ2 0) + (1 smjt)δ(wmjt) (5) j=1 N(uk j |0, I) YN n=1 Φ (2yin 1)f W(xin) . 4. Streaming Posterior Inference We now present our streaming model estimation algorithm. In general, the observed tensor entries are assumed to be streamed in a sequence of small batches, {B1, B2, . . .}. Different batches do not have to include the same number of entries. Upon receiving each batch Bt, we aim to update the posterior distribution of the factors U, the inverse noise variance τ (for continuous data), the selection indicators S and the neural network weights W, without re-accessing the previous batches {Bj}j