# spectral_methods_for_supervised_topic_models__8e4e8ca0.pdf Spectral Methods for Supervised Topic Models Yining Wang Jun Zhu Machine Learning Department, Carnegie Mellon University, yiningwa@cs.cmu.edu Dept. of Comp. Sci. & Tech.; Tsinghua National TNList Lab; State Key Lab of Intell. Tech. & Sys., Tsinghua University, dcszj@mail.tsinghua.edu.cn Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on either variational approximation or Monte Carlo sampling. This paper presents a novel spectral decomposition algorithm to recover the parameters of supervised latent Dirichlet allocation (s LDA) models. The Spectral-s LDA algorithm is provably correct and computationally efficient. We prove a sample complexity bound and subsequently derive a sufficient condition for the identifiability of s LDA. Thorough experiments on a diverse range of synthetic and real-world datasets verify the theory and demonstrate the practical effectiveness of the algorithm. 1 Introduction Topic modeling offers a suite of useful tools that automatically learn the latent semantic structure of a large collection of documents. Latent Dirichlet allocation (LDA) [9] represents one of the most popular topic models. The vanilla LDA is an unsupervised model built on input contents of documents. In many applications side information is available apart from raw contents, e.g., user-provided rating scores of an online review text. Such side signal usually provides additional information to reveal the underlying structures of the documents in study. There have been extensive studies on developing topic models that incorporate various side information, e.g., by treating it as supervision. Some representative models are supervised LDA (s LDA) [8] that captures a real-valued regression response for each document, multiclass s LDA [21] that learns with discrete classification responses, discriminative LDA (Disc LDA) [14] that incorporates classification response via discriminative linear transformations on topic mixing vectors, and Med LDA [22, 23] that employs a max-margin criterion to learn discriminative latent topic representations. Topic models are typically learned by finding maximum likelihood estimates (MLE) through local search or sampling methods [12, 18, 19], which may suffer from local optima. Much recent progress has been made on developing spectral decomposition [1, 2, 3] and nonnegative matrix factorization (NMF) [4, 5, 6, 7] methods to infer latent topic-word distributions. Instead of finding MLE estimates, which is a known NP-hard problem [6], these methods assume that the documents are i.i.d. sampled from a topic model, and attempt to recover the underlying model parameters. Compared to local search and sampling algorithms, these methods enjoy the advantage of being provably effective. In fact, sample complexity bounds have been proved to show that given a sufficiently large collection of documents, these algorithms can recover the model parameters accurately with a high probability. Although spectral decomposition (as well as NMF) methods have achieved increasing success in recovering latent variable models, their applicability is quite limited. For example, previous work has mainly focused on unsupervised latent variable models, leaving the broad family of supervised models (e.g., s LDA) largely unexplored. The only exception is [10] which presents a spectral method for mixtures of regression models, quite different from s LDA. Such ignorance is not a coincidence as supervised models impose new technical challenges. For instance, a direct application of previous techniques [1, 2] on s LDA cannot handle regression models with duplicate entries. In addition, the sample complexity bound gets much worse if we try to match entries in regression models with their corresponding topic vectors. On the practical side, few quantitative experimental results (if any at all) are available for spectral decomposition based methods on LDA models. In this paper, we extend the applicability of spectral learning methods by presenting a novel spectral decomposition algorithm to recover the parameters of s LDA models from empirical low-order moments estimated from the data. We provide a sample complexity bound and analyze the identifiability conditions. A key step in our algorithm is a power update step that recovers the regression model in s LDA. The method uses a newly designed empirical moment to recover regression model entries directly from the data and reconstructed topic distributions. It is free from making any constraints on the underlying regression model, and does not increase the sample complexity much. We also provide thorough experiments on both synthetic and real-world datasets to demonstrate the practical effectiveness of our proposed algorithm. By combining our spectral recovery algorithm with a Gibbs sampling procedure, we showed superior performance in terms of language modeling, prediction accuracy and running time compared to traditional inference algorithms. 2 Preliminaries We first overview the basics of s LDA, orthogonal tensor decomposition and the notations to be used. 2.1 Supervised LDA Latent Dirichlet allocation (LDA) [9] is a generative model for topic modeling of text documents. It assumes k different topics with topic-word distributions µ1, , µk V 1, where V is the vocabulary size and V 1 denotes the probability simplex of a V -dimensional random vector. For a document, LDA models a topic mixing vector h k 1 as a probability distribution over the k topics. A conjugate Dirichlet prior with parameter α is imposed on the topic mixing vectors. A bag-of-word model is then adopted, which generates each word in the document based on h and the topic-word vectors µ. Supervised latent Dirichlet allocation (s LDA) [8] incorporates an extra response variable y R for each document. The response variable is modeled by a linear regression model η Rk on either the topic mixing vector h or the averaging topic assignment vector z, where zi = 1 j 1[zj=i] with m the number of words in a document. The noise is assumed to be Gaussian with zero mean and σ2 variance. Fig. 1 shows the graph structure of two s LDA variants mentioned above. Although previous work has mainly focused on model (b) which is convenient for Gibbs sampling and variational inference, we consider model (a) because it will considerably simplify our spectral algorithm and analysis. One may assume that whenever a document is not too short, the empirical distribution of its word topic assignments should be close to the document s topic mixing vector. Such a scheme was adopted to learn sparse topic coding models [24], and has demonstrated promising results in practice. 2.2 High-order tensor product and orthogonal tensor decomposition A real p-th order tensor A Np i=1 Rni belongs to the tensor product of Euclidean spaces Rni. Generally we assume n1 = n2 = = np = n, and we can identify each coordinate of A by a p-tuple (i1, , ip), where i1, , ip [n]. For instance, a p-th order tensor is a vector when p = 1 and a matrix when p = 2. We can also consider a p-th order tensor A as a multilinear mapping. For A Np Rn and matrices X1, , Xp Rn m, the mapping A(X1, , Xp) is a p-th order tensor in Np Rm, with [A(X1, , Xp)]i1, ,ip P j1, ,jp [n] Aj1, ,jp[X1]j1,i1[X2]j2,i2 [Xp]jp,ip. Consider some concrete examples of such a multilinear mapping. When A, X1, X2 are matrices, we have A(X1, X2) = X 1 AX2. Similarly, when A is a matrix and x is a vector, A(I, x) = Ax. An orthogonal tensor decomposition of a tensor A Np Rn is a collection of orthonormal vectors {vi}k i=1 and scalars {λi}k i=1 such that A = Pk i=1 λiv p i . Without loss of generality, we assume λi are nonnegative when p is odd. Although orthogonal tensor decomposition in the matrix case can be done efficiently by singular value decomposition (SVD), it has several delicate issues in higher order tensor spaces [2]. For instance, tensors may not have unique decompositions, and an orthogonal decomposition may not exist for every symmetric tensor [2]. Such issues are further complicated when only noisy estimates of the desired tensors are available. For these reasons, we need more advanced techniques to handle high-order tensors. In this paper, we will apply the robust (a) yd = η d hd + εd (b) yd = η d zd + εd Figure 1: Plate notations for two variants of s LDA tensor power method [2] to recover robust eigenvalues and eigenvectors of an (estimated) third-order tensor. The algorithm recovers eigenvalues and eigenvectors up to an absolute error ε, while running in polynomial time with respect to the tensor dimension and log(1/ε). Further details and analysis of the robust tensor power method are presented in Appendix A.2 and [2]. 2.3 Notations Throughout, we use v p v v v to denote the p-th order tensor generated by a vector v. We use v = p P i v2 i to denote the Euclidean norm of a vector v, M to denote the spectral norm of a matrix M and T to denote the operator norm of a high-order tensor. M F = q P i,j M 2 ij denotes the Frobenious norm of a matrix. We use an indicator vector x RV to represent a word in a document, e.g., for the i-th word in the vocabulary, xi = 1 and xj = 0 for all j = i. We also use O (µ1, µ2, , µk) RV k to denote the topic distribution matrix, and e O (eµ1, eµ2, , eµK) to denote the canonical version of O, where eµi = q αi α0(α0+1)µ with α0 = Pk i=1 αi. 3 Spectral Parameter Recovery We now present a novel spectral parameter recovery algorithm for s LDA. The algorithm consists of two key components the orthogonal tensor decomposition of observable moments to recover the topic distribution matrix O and a power update method to recover the linear regression model η. We elaborate on these techniques and a rigorous theoretical analysis in the following sections. 3.1 Moments of observable variables Our spectral decomposition methods recover the topic distribution matrix O and the linear regression model η by manipulating moments of observable variables. In Definition 1, we define a list of moments on random variables from the underlying s LDA model. Definition 1. We define the following moments of observable variables: M1 = E[x1], M2 = E[x1 x2] α0 α0 + 1M1 M1, (1) M3 = E[x1 x2 x3] α0 α0 + 2 (E[x1 x2 M1] + E[x1 M1 x2] + E[M1 x1 x2]) + 2α2 0 (α0 + 1)(α0 + 2)M1 M1 M1, (2) My = E[yx1 x2] α0 α0 + 2 (E[y]E[x1 x2] + E[x1] E[yx2] + E[yx1] E[x2]) + 2α2 0 (α0 + 1)(α0 + 2)E[y]M1 M1. (3) Note that the moments M1, M2 and M3 were also defined and used in previous work [1, 2] for the parameter recovery for LDA models. For the s LDA model, we need to define a new moment My in order to recover the linear regression model η. The moments are based on observable variables in the sense that they can be estimated from i.i.d. sampled documents. For instance, M1 can be estimated by computing the empirical distribution of all words, and M2 can be estimated using M1 and word co-occurrence frequencies. Though the moments in the above forms look complicated, we can apply elementary calculations based on the conditional independence structure of s LDA to significantly simplify them and more importantly to get them connected with the model parameters to be recovered, as summarized in Proposition 1. The proof is deferred to Appendix B. Proposition 1. The moments can be expressed using the model parameters as: M2 = 1 α0(α0 + 1) i=1 αiµi µi, M3 = 2 α0(α0 + 1)(α0 + 2) i=1 αiµi µi µi, (4) My = 2 α0(α0 + 1)(α0 + 2) i=1 αiηiµi µi. (5) 3.2 Simultaneous diagonalization Proposition 1 shows that the moments in Definition 1 are all the weighted sums of tensor products of {µi}k i=1 from the underlying s LDA model. One idea to reconstruct {µi}k i=1 is to perform simultaneous diagonalization on tensors of different orders. The idea has been used in a number of recent developments of spectral methods for latent variable models [1, 2, 10]. Specifically, we first whiten the second-order tensor M2 by finding a matrix W RV k such that W M2W = Ik. This whitening procedure is possible whenever the topic distribuction vectors {µi}k i=1 are linearly independent (and hence M2 has rank k). The whitening procedure and the linear independence assumption also imply that {Wµi}k i=1 are orthogonal vectors (see Appendix A.2 for details), and can be subsequently recovered by performing an orthogonal tensor decomposition on the simultaneously whitened third-order tensor M3(W, W, W). Finally, by multiplying the pseudo-inverse of the whitening matrix W + we obtain the topic distribution vectors {µi}k i=1. It should be noted that Jennrich s algorithm [13, 15, 17] could recover {µi}k i=1 directly from the 3rd order tensor M3 alone when {µi}k i=1 is linearly independent. However, we still adopt the above simultaneous diagonalization framework because the intermediate vectors {Wµi}k i=1 play a vital role in the recovery procedure of the linear regression model η. 3.3 The power update method Although the linear regression model η can be recovered in a similar manner by performing simultaneous diagonalization on M2 and My, such a method has several disadvantages, thereby calling for novel solutions. First, after obtaining entry values {ηi}k i=1 we need to match them to the topic distributions {µi}k i=1 previously recovered. This can be easily done when we have access to the true moments, but becomes difficult when only estimates of observable tensors are available because the estimated moments may not share the same singular vectors due to sampling noise. A more serious problem is that when η has duplicate entries the orthogonal decomposition of My is no longer unique. Though a randomized strategy similar to the one used in [1] might solve the problem, it could substantially increase the sample complexity [2] and render the algorithm impractical. We develop a power update method to resolve the above difficulties. Specifically, after obtaining the whitened (orthonormal) vectors {vi} ci Wµi 1 we recover the entry ηi of the linear regression model directly by computing a power update v i My(W, W)vi. In this way, the matching problem is automatically solved because we know what topic distribution vector µi is used when recovering ηi. Furthermore, the singular values (corresponding to the entries of η) do not need to be distinct because we are not using any unique SVD properties of My(W, W). As a result, our proposed algorithm works for any linear model η. 3.4 Parameter recovery algorithm An outline of our parameter recovery algorithm for s LDA (Spectral-s LDA) is given in Alg. 1. First, empirical estimates of the observable moments in Definition 1 are computed from the given documents. The simultaneous diagonalization method is then used to reconstruct the topic distribution matrix O and its prior parameter α. After obtaining O = (µ1, , µk), we use the power update method introduced in the previous section to recover the linear regression model η. Alg. 1 admits three hyper-parameters α0, L and T. α0 is defined as the sum of all entries in the prior parameter α. Following the conventions in [1, 2], we assume that α0 is known a priori and use this value to perform parameter estimation. It should be noted that this is a mild assumption, as in practice usually a homogeneous vector α is assumed and the entire vector is known [20]. The L and T parameters are used to control the number of iterations in the robust tensor power method. In general, the robust tensor power method runs in O(k3LT) time. To ensure sufficient recovery accuracy, 1ci is a scalar coefficient that depends on α0 and αi. See Appendix A.2 for details. Algorithm 1 spectral parameter recovery algorithm for s LDA. Input parameters: α0, L, T. 1: Compute empirical moments and obtain c M2, c M3 and c My. 2: Find c W Rn k such that c M2(c W, c W) = Ik. 3: Find robust eigenvalues and eigenvectors (bλi, bvi) of c M3(c W, c W, c W) using the robust tensor power method [2] with parameters L and T. 4: Recover prior parameters: bαi 4α0(α0+1) (α0+2)2bλ2 i . 5: Recover topic distributions: bµi α0+2 2 bλi(c W +) bvi. 6: Recover the linear regression model: bηi α0+2 2 bv i c My(c W, c W)bvi. 7: Output: bη, bα and {bµi}k i=1. L should be at least a linear function of k and T should be set as T = Ω(log(k)+log log(λmax/ε)), where λmax = 2 α0+2 αmin and ε is an error tolerance parameter. Appendix A.2 and [2] provide a deeper analysis into the choice of L and T parameters. 3.5 Speeding up moment computation In Alg. 1, a straightforward computation of the third-order tensor c M3 requires O(NM 3) time and O(V 3) storage, where N is corpus size and M is the number of words per document. Such time and space complexities are clearly prohibitive for real applications, where the vocabulary usually contains tens of thousands of terms. However, we can employ a trick similar as in [11] to speed up the moment computation. We first note that only the whitened tensor c M3(c W, c W, c W) is needed in our algorithm, which only takes O(k3) storage. Another observation is that the most difficult term in c M3 can be written as Pr i=1 ciui,1 ui,2 ui,3, where r is proportional to N and ui, contains at most M non-zero entries. This allows us to compute c M3(c W, c W, c W) in O(NMk) time by computing Pr i=1 ci(W ui,1) (W ui,2) (W ui,3). Appendix B.2 provides more details about this speed-up trick. The overall time complexity is O(NM(M + k2) + V 2 + k3LT) and the space complexity is O(V 2 + k3). 4 Sample Complexity Analysis We now analyze the sample complexity of Alg. 1 in order to achieve ε-error with a high probability. For clarity, we focus on presenting the main results, while deferring the proof details to Appendix A, including the proofs of important lemmas that are needed for the main theorem. Theorem 1. Let σ1( e O) and σk( e O) be the largest and the smallest singular values of the canonical topic distribution matrix e O. Define λmin 2 α0+2 αmax and λmax 2 α0+2 αmin with αmax and αmin the largest and the smallest entries of α. Suppose bµ, bα and bη are the outputs of Algorithm 1, and L is at least a linear function of k. Fix δ (0, 1). For any small error-tolerance parameter ε > 0, if Algorithm 1 is run with parameter T = Ω(log(k) + log log(λmax/ε)) on N i.i.d. sampled documents (each containing at least 3 words) with N max(n1, n2, n3), where n1 = C1 1 + p log(6/δ) 2 α2 0(α0 + 1)2 αmin , n3 = C3 (1 + p σk( e O)10 max 1 n2 = C2 (1 + p log(15/δ))2 ε2σk( e O)4 max ( η + Φ 1(δ/60σ))2, α2 maxσ1( e O)2 , and C1, C2 and C3 are universal constants, then with probability at least 1 δ, there exists a permutation π : [k] [k] such that for every topic i, the following holds: 1. |αi bαπ(i)| 4α0(α0+1)(λmax+5ε) (α0+2)2λ2 min(λmin 5ε)2 5ε, if λmin > 5ε; 2. µi bµπ(i) 3σ1( e O) 8αmax λmin + 5(α0+2) 3. |ηi bηπ(i)| η λmin + (α0 + 2) ε. 300 600 1000 3000 6000 10000 0 α error (1 norm) M=250 M=500 300 600 1000 3000 6000 10000 0 η error (1 norm) M=250 M=500 300 600 1000 3000 6000 10000 0 µ error (1 norm) M=250 M=500 Figure 2: Reconstruction errors of Alg. 1. X axis denotes the training size. Error bars denote the standard deviations measured on 3 independent trials under each setting. In brevity, the proof is based on matrix perturbation lemmas (see Appendix A.1) and analysis to the orthogonal tensor decomposition methods (including SVD and robust tensor power method) performed on inaccurate tensor estimations (see Appendix A.2). The sample complexity lower bound consists of three terms, from n1 to n3. The n3 term comes from the sample complexity bound for the robust tensor power method [2]; the ( η + Φ 1(δ/60σ))2 term in n2 characterizes the recovery accuracy for the linear regression model η, and the α2 maxσ1( e O)2 term arises when we try to recover the topic distribution vectors µ; finally, the term n1 is required so that some technical conditions are met. The n1 term does not depend on either k or σk( e O), and could be largely neglected in practice. An important implication of Theorem 1 is that it provides a sufficient condition for a supervised LDA model to be identifiable, as shown in Remark 1. To some extent, Remark 1 is the best identifiability result possible under our inference framework, because it makes no restriction on the linear regression model η, and the linear independence assumption is unavoidable without making further assumptions on the topic distribution matrix O. Remark 1. Given a sufficiently large number of i.i.d. sampled documents with at least 3 words per document, a supervised LDA model M = (α, µ, η) is identifiable if α0 = Pk i=1 αi is known and {µi}k i=1 are linearly independent. We also make remarks on indirected quantities appeared in Theorem 1 (e.g., σk( e O)) and a simplified sample complexity bound for some specical cases. They can be found in Appendix A.4. 5 Experiments 5.1 Datasets description and Algorithm implementation details We perform experiments on both synthetic and real-world datasets. The synthetic data are generated in a similar manner as in [22], with a fixed vocabulary of size V = 500. We generate the topic distribution matrix O by first sampling each entry from a uniform distribution and then normalize every column of O. The linear regression model η is sampled from a standard Gaussian distribution. The prior parameter α is assumed to be homogeneous, i.e., α = (1/k, , 1/k). Documents and response variables are then generated from the s LDA model specified in Sec. 2.1. For real-world data, we use the large-scale dataset built on Amazon movie reviews [16] to demonstrate the practical effectiveness of our algorithm. The dataset contains 7,911,684 movie reviews written by 889,176 users from Aug 1997 to Oct 2012. Each movie review is accompanied with a score from 1 to 5 indicating how the user likes a particular movie. The median number of words per review is 101. A vocabulary with V = 5, 000 terms is built by selecting high frequency words. We also pre-process the dataset by shifting the review scores so that they have zero mean. Both Gibbs sampling for the s LDA model in Fig. 1 (b) and the proposed spectral recovery algorithm are implemented in C++. For our spectral algorithm, the hyperparameters L and T are set to 100, which is sufficiently large for all settings in our experiments. Since Alg. 1 can only recover the topic model itself, we use Gibbs sampling to iteratively sample topic mixing vectors h and topic assignments for each word z in order to perform prediction on a held-out dataset. 5.2 Convergence of reconstructed model parameters We demonstrate how the s LDA model reconstructed by Alg. 1 converges to the underlying true model when more observations are available. Fig. 2 presents the 1-norm reconstruction errors of α, η and µ. The number of topics k is set to 20 and the number of words per document (i.e., M) is set 0.20.40.60.8 1 2 4 6 8 10 0 ref. model Spec s LDA Gibbs s LDA 0.20.40.60.8 1 2 4 6 8 10 8.8 9 Neg. Log likeli. (k=20) 0.20.40.60.8 1 2 4 6 8 10 0 0.20.40.60.8 1 2 4 6 8 10 8.93 Neg. Log likeli. (k=50) Figure 3: Mean square errors and negative per-word log-likelihood of Alg. 1 and Gibbs s LDA. Each document contains M = 500 words. The X axis denotes the training size ( 103). 0 2 4 6 8 10 PR2 (α=0.01) Gibbs s LDA Spec s LDA Hybrid 0 2 4 6 8 10 0.05 PR2 (α=0.1) Gibbs s LDA Spec s LDA Hybrid 0 2 4 6 8 10 0.2 PR2 (α=1.0) Gibbs s LDA Spec s LDA Hybrid 0 2 4 6 8 10 Neg. Log likeli. (α=0.01) Gibbs s LDA Spec s LDA Hybrid 0 2 4 6 8 10 Neg. Log likeli. (α=0.1) Gibbs s LDA Spec s LDA Hybrid 0 2 4 6 8 10 7.4 Neg. Log likeli. (α=1.0) Gibbs s LDA Spec s LDA Hybrid Figure 4: p R2 scores and negative per-word log-likelihood. The X axis indicates the number of topics. Error bars indicate the standard deviation of 5-fold cross-validation. to 250 and 500. Since Spectral-s LDA can only recover topic distributions up to a permutation over [k], a minimum weighted graph match was computed on O and b O to find an optimal permutation. Fig. 2 shows that the reconstruction errors for all the parameters go down rapidly as we obtain more documents. Furthermore, though Theorem 1 does not involve the number of words per document, the simulation results demonstrate a significant improvement when more words are observed in each document, which is a nice complement for the theoretical analysis. 5.3 Prediction accuracy and per-word likelihood We compare the prediction accuracy and per-word likelihood of Spectral-s LDA and Gibbs-s LDA on both synthetic and real-world datasets. On the synthetic dataset, the regression error is measured by the mean square error (MSE), and the per-word log-likelihood is defined as log2 p(w|h, O) = log2 PK k=1 p(w|z = k, O)p(z = k|h). The hyper-parameters used in our Gibbs sampling implementation are the same with the ones used to generate the datasets. Fig. 3 shows that Spectral-s LDA consistently outperforms Gibbs-s LDA. Our algorithm also enjoys the advantage of being less variable, as indicated by the curve and error bars. Moreover, when the number of training documents is sufficiently large, the performance of the reconstructed model is very close to the underlying true model2, which implies that Alg. 1 can correctly identify an s LDA model from its observations, therefore supporting our theory. We also test both algorithms on the large-scale Amazon movie review dataset. The quality of the prediction is assessed with predictive R2 (p R2) [8], a normalized version of MSE, which is defined as p R2 1 (P i (yi byi)2)/(P i (yi y)2), where byi is the estimate, yi is the truth, and y is the average true value. We report the results under various settings of α and k in Fig. 4, with the σ hyper-parameter of Gibbs-s LDA selected via cross-validation on a smaller subset of documents. Apart from Gibbs-s LDA and Spectral-s LDA, we also test the performance of a hybrid algorithm which performs Gibbs sampling using models reconstructed by Spectral-s LDA as initializations. Fig. 4 shows that in general Spectral-s LDA does not perform as well as Gibbs sampling. One possible reason is that real-world datasets are not exact i.i.d. samples from an underlying s LDA model. However, a significant improvement can be observed when the Gibbs sampler is initialized with models reconstructed by Spectral-s LDA instead of random initializations. This is because Spectral-s LDA help avoid the local optimum problem of local search methods like Gibbs sampling. Similar improvements for spectral methods were also observed in previous papers [10]. 2Due to the randomness in the data generating process, the true model has a non-zero prediction error. Table 1: Training time of Gibbs-s LDA and Spectral-s LDA, measured in minutes. k is the number of topics and n is the number of documents used in training. k = 10 k = 50 n( 104) 1 5 10 50 100 1 5 10 50 100 Gibbs-s LDA 0.6 3.0 6.0 30.5 61.1 2.9 14.3 28.2 145.4 281.8 Spec-s LDA 1.5 1.6 1.7 2.9 4.3 3.1 3.6 4.3 9.5 16.2 Table 2: Prediction accuracy and per-word log-likelihood of Gibbs-s LDA and the hybrid algorithm. The initialization solution is obtained by running Alg. 1 on a collection of 1 million documents, while n is the number of documents used in Gibbs sampling. k = 8 topics are used. predictive R2 Negative per-word log-likelihood log10 n 3 4 5 6 3 4 5 6 Gibbs-s LDA 0.00 0.04 0.11 0.14 7.72 7.55 7.45 7.42 (0.01) (0.02) (0.02) (0.01) (0.01) (0.01) (0.01) (0.01) Hybrid 0.02 0.17 0.18 0.18 7.70 7.49 7.40 7.36 (0.01) (0.03) (0.03) (0.03) (0.01) (0.02) (0.01) (0.01) Note that for k > 8 the performance of Spectral-s LDA significantly deteriorates. This phenomenon can be explained by the nature of Spectral-s LDA itself: one crucial step in Alg. 1 is to whiten the empirical moment c M2, which is only possible when the underlying topic matrix O has full rank. For the Amazon movie review dataset, it is impossible to whiten c M2 when the underlying model contains more than 8 topics. This interesting observation shows that the Spectral-s LDA algorithm can be used for model selection to avoid overfitting by using too many topics. 5.4 Time efficiency The proposed spectral recovery algorithm is very time efficient because it avoids time-consuming iterative steps in traditional inference and sampling methods. Furthermore, empirical moment computation, the most time-consuming part in Alg. 1, consists of only elementary operations and could be easily optimized. Table 1 compares the training time of Gibbs-s LDA and Spectral-s LDA and shows that our proposed algorithm is over 15 times faster than Gibbs sampling, especially for large document collections. Although both algorithms are implemented in a single-threading manner, Spectral-s LDA is very easy to parallelize because unlike iterative local search methods, the moment computation step in Alg. 1 does not require much communication or synchronization. There might be concerns about the claimed time efficiency, however, because significant performance improvements could only be observed when Spectral-s LDA is used together with Gibbss LDA, and the Gibbs sampling step might slow down the entire procedure. To see why this is not the case, we show in Table 2 that in order to obtain high-quality models and predictions, only a very small collection of documents are needed after model reconstruction of Alg. 1. In contrast, Gibbs-s LDA with random initialization requires more data to get reasonable performances. To get a more intuitive idea of how fast our proposed method is, we combine Tables 1 and 2 to see that by doing Spectral-s LDA on 106 documents and then post-processing the reconstructed models using Gibbs sampling on only 104 documents, we obtain a p R2 score of 0.17 in 5.8 minutes, while Gibbs-s LDA takes over an hour to process a million documents with a p R2 score of only 0.14. Similarly, the hybrid method takes only 10 minutes to get a per-word likelihood comparable to the Gibbs sampling algorithm that requires more than an hour running time. 6 Conclusion We propose a novel spectral decomposition based method to reconstruct supervised LDA models from labeled documents. Although our work has mainly focused on tensor decomposition based algorithms, it is an interesting problem whether NMF based methods could also be applied to obtain better sample complexity bound and superior performance in practice for supervised topic models. Acknowledgement The work was done when Y.W. was at Tsinghua. The work is supported by the National Basic Research Program of China (No. 2013CB329403), National NSF of China (Nos. 61322308, 61332007), and Tsinghua University Initiative Scientific Research Program (No. 20121088071). [1] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, and Y.-K. Liu. Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocatoin. ar Xiv:1204.6703, 2012. [2] A. Anandkumar, R. Ge, D. Hsu, S. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. ar Xiv:1210:7559, 2012. [3] A. Anandkumar, D. Hsu, and S. Kakade. A method of moments for mixture models and hidden Markov models. ar Xiv:1203.0683, 2012. [4] S. Arora, R. Ge, Y. Halpern, D. Mimno, and A. Moitra. A practical algorithm for topic modeling with provable guarantees. In ICML, 2013. [5] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization - provably. In STOC, 2012. [6] S. Arora, R. Ge, and A. Moitra. Learning topic models-going beyond SVD. In FOCS, 2012. [7] V. Bittorf, B. Recht, C. Re, and J. Tropp. Factoring nonnegative matrices with linear programs. In NIPS, 2012. [8] D. Blei and J. Mc Auliffe. Supervised topic models. In NIPS, 2007. [9] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, (3):993 1022, 2003. [10] A. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. In ICML, 2013. [11] S. Cohen and M. Collins. Tensor decomposition for fast parsing with latent-variable PCFGs. In NIPS, 2012. [12] M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent Dirichlet allocation. In NIPS, 2010. [13] J. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decompositions, with applications to arithmetic complexity and statistics. Linear Algebra and its Applications, 18(2):95 138, 1977. [14] S. Lacoste-Julien, F. Sha, and M. Jordan. Disc LDA: Discriminative learning for dimensionality reduction and classification. In NIPS, 2008. [15] S. Leurgans, R. Ross, and R. Abel. A decomposition for three-way arrays. SIAM Journal on Matrix Analysis and Applications, 14(4):1064 1083, 1993. [16] J. Mc Auley and J. Leskovec. From amateurs to connoisseus: Modeling the evolution of user expertise through online reviews. In WWW, 2013. [17] A. Moitra. Algorithmic aspects of machine learning. 2014. [18] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In SIGKDD, 2008. [19] R. Redner and H. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26(2):195 239, 1984. [20] M. Steyvers and T. Griffiths. Latent semantic analysis: a road to meaning, chapter Probabilistic topic models. Laurence Erlbaum, 2007. [21] C. Wang, D. Blei, and F.-F. Li. Simultaneous image classification and annotation. In CVPR, 2009. [22] J. Zhu, A. Ahmed, and E. Xing. Med LDA: Maximum margin supervised topic models. Journal of Machine Learning Research, (13):2237 2278, 2012. [23] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with data augmentation. Journal of Machine Learning Research, (15):1073 1110, 2014. [24] J. Zhu and E. Xing. Sparse topic coding. In UAI, 2011.