# topic_modeling_via_full_dependence_mixtures__bba8aaf7.pdf Topic Modeling via Full Dependence Mixtures Dan Fisher 1 Mark Kozdoba 1 Shie Mannor 1 2 In this paper we introduce a new approach to topic modelling that scales to large datasets by using a compact representation of the data and by leveraging the GPU architecture. In this approach, topics are learned directly from the co-occurrence data of the corpus. In particular, we introduce a novel mixture model which we term the Full Dependence Mixture (FDM) model. FDMs model second moment under general generative assumptions on the data. While there is previous work on topic modeling using second moments, we develop a direct stochastic optimization procedure for fitting an FDM with a single Kullback Leibler objective. Moment methods in general have the benefit that an iteration no longer needs to scale with the size of the corpus. Our approach allows us to leverage standard optimizers and GPUs for the problem of topic modeling. In particular, we evaluate the approach on three large datasets, Neur IPS papers, a Twitter corpus, and full English Wikipedia, with a large number of topics, and show that the approach performs comparably or better than the the standard benchmarks. 1. Introduction A topic model is a probabilistic model of joint distribution in the data, that is typically used as a dimensionality reduction technique in a variety of applications, such as for instance text mining, information retrieval and recommender systems. In this paper we concentrate on topic models in text data. Perhaps the most widely used topic model for text is the Latent Dirichlet Allocation (LDA) model, (Blei et al., 2002). LDA is a fully generative probabilistic model, and is typically learned through a Bayesian approach by sampling the parameter distribution given the data, via Collapsed Gibbs samplers (Griffiths & Steyvers, 2004), (Steyvers & 1Technion, Israel Institute of Technology 2NVIDIA Research. Correspondence to: , , . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). Griffiths, 2007), variational methods, (Blei et al., 2002), (Foulds et al., 2013), (Hoffman et al., 2013), or other related approaches. A typical learning procedure for Bayesian methods involves an iteration over the entire corpus, where a topic assignment is sampled per each token or document in the corpus. In order to apply these methods to large corpora, a variety of optimized procedures were developed, where speed improvement is achieved either via parallelization or via more economic sampling scheme, per token. An additional level of complexity is added by the fact that LDA has two hyperparameters, the Dirichlet priors of the token distribution in topics, and topic distribution in documents. This may complicate the application of the methods since the choice of parameters may influence the results even for relatively large data sizes. A detailed discussion of LDA optimization is given in Section 2. In this paper we propose an alternative approach to topic modeling, based on principles that are principally different from the standard LDA optimization techniques. We show that using this approach it is possible to analyze large datasets on a single workstation with a GPU, and to obtain results that are comparable or better than the standard benchmarks. A reference implementation of the algorithm is available at https://github.com/fisherd3/fdm. In the rest of this section we introduce some necessary notation, describe our model and the related loss function, discuss the optimization procedure for this loss, and overview the experimental results of the paper. 1.1. Method Description We assume that the text data was generated from a p LSA (Probabilistic Latent Semantic Allocation) probability model , (Hofmann, 1999), as follows: Denote by X the set of distinct tokens in the corpus, and suppose that we are given T topics, µt, t T, where each topic is a probability distribution on the set of tokens X. Then the p LSA assumption is that each document d is generated by independently sampling tokens from a mixture of topics, denoted νd: t θd(t)µt, (1) where θd(t) 0 and P t θd(t) = 1 for every document d. Note that we do not specify the generative model for θd. In Topic Modeling via Full Dependence Mixtures this sense, p LSA is a semi-generative model, and is more general than LDA. Next, for every document in the corpus we construct its token co-occurrence probability matrix, and we take the co-occurrence probability matrix of the corpus to be the average of the document matrices. Let N = |X| be the dictionary size - the number of distinct tokens in the corpus. Then the co-occurrence matrix c M of the corpus is an N N matrix, with non-negative entries that sum to 1. Suppose that one performs the following experiment: Sample a document from a corpus at random, and then sample two tokens independently from the document. Then c Mu,v is the probability to observe the pair u, v in this experiment (up to a small modification, full details are given in Section 3). Now, if one assumes the p LSA model of the corpus, then it can be shown that the expectation of c M should be of the form Mu,v(µ, α) = i,j=1 αi,jµi(u)µj(v), (2) where µi are the topics and αi,j 0, αi,j = αj,i, and P i,j αi,j = 1 represent the corpus level topic-topic correlations. We refer to the matrices of the form (2) as Full Dependence Mixture (FDM) matrices. This is due to the analogy with standard multinomial mixture models (also known as categorical mixture models), which can be represented in the form (2) but with α restricted to be a diagonal matrix. Multinomial mixture models correspond to the special case where each document contains samples from only one topic, where the topic may be different for different documents. Equivalently, in multinomial mixtures, each θd is 0 on all but one coordinates. In this paper, we consider a set of topics µt to be a good fit for the data if there are some correlation coefficients α such that M(µ, α) is close to c M, the FDM generated from the data. Specifically, we define the loss by L(µ, α) = X u,v X c Mu,v log M(µ, α)u,v, (3) and we are interested in minimizing L over all µ, α. Clearly, minimizing L is equivalent to minimizing the Kullback Leibler divergence between M(µ, α) and M, viewed as probability distributions over X X. To gain basic intuition as to why M(µ, α) that approximates well the empirical matrix c M should yield good topics, it is useful to consider again the simple case of the multinomial mixture, and moreover, the specific case where the topics are disjoint. In this case, the matrix c M will be block diagonal (up to a reordering of the dictionary), with disjoint blocks that correspond to the topics. Thus, provided enough samples d, the topics can be easily read off from the matrix. 0 20 40 60 80 Figure 1. An c M matrix example. An example of a more complicated matrix c M is shown in Figure 1. Here the ground set is X = {1, . . . , 100}, and the topics are uniform on intervals [1, 40],[30, 70], and [60, 100] respectively. The documents were generated from the mixture (1), with θd sampled from a non-uniform Dirichlet, θd Dir((2, 1, 1.5)), which make the topics appear with different frequencies in the data. Although here the relation between the topics and the matrix is more involved, one can still see that the topics could be traceable from the matrix. In Section 4 we show the asymptotic consistency of the loss (3) for topics which satisfy the classical anchor words ((Donoho & Stodden, 2003), (Arora et al., 2013)) assumption. That is, if the topics satisfy the anchor words assumption, then given enough samples the topics can be uniquely reconstructed from the matrix c M by minimizing the loss (3). The anchor words assumption roughly states that each topic has a word that is unique to that particular topic. Note that this word does not have to appear in each document containing the topic, and may in fact have a relatively low probability inside the topic itself. It is known, (Arora et al., 2013), and easy to verify, that natural topics, such as topics produced by learning LDA, do satisfy the anchor words assumption. We refer to Section 4 for further details. The advantage of using the cost L(µ, α) is that it depends on the corpus only through the matrix c M. Therefore, the size of the corpus does not enter the optimization problem directly, and we are dealing with a fixed size, N N problem. This is a general feature of reconstruction through moments approaches, such as for instance (Arora et al., 2013), (Anandkumar et al., 2012). In particular, the number of variables for the optimization is TN + T 2, in contrast to variational or Gibbs sampler based methods, which either have T variables for every document, or have a single variable for every token in the corpus, respectively. 1.2. Optimization of the Objective L(µ, α). For smaller problems, one may directly optimize the objective (3) using gradient descent methods. However, note that Topic Modeling via Full Dependence Mixtures if one computes L(µ, α) directly, then one has to compute M(µ, α), which is a sum of T 2 matrices of size N 2. Indeed, denote by M i,j the N N matrix such that its u, v entry is M i,j u,v = µi(u) µj(v). Then M(µ, α) = P i,j T αi,j M i,j. On standard GPU computing architectures, all T 2 of the matrices will have to be in memory simultaneously, which is prohibitive even for moderate values of N, T. To resolve this issue, we reformulate the optimization of L as a stochastic optimization problem in u, v. To this end, note that L is an expectation of the term log M(µ, α)u,v over pairs of tokens (u, v), sampled from c M, where c M is viewed as a probability distribution over X X. Formally, L(µ, α) = E(u,v) c M log M(µ, α)u,v. (4) Therefore, given (u, v), one only has to compute the gradient of M(µ, α)u,v, rather than full M(µ, α) at µ, α which is a much smaller optimization problem, of size O(T 2), and this can be done for moderate (u, v) batch sizes. This approach makes the optimization of L(µ, α) practically feasible. The full algorithm, given as Algorithm 2 below, is discussed in detail in Section 3. Note that this approach differs from the standard stochastic gradient descent flow on the GPU, where the batches consist of data samples (documents in the case of text data). Instead, here the data is already summarized as c M, and the batches consist of pairs of tokens (u, v) that we sample actively from c M. 1.3. Experimental Results We evaluate the FDM algorithm on a semi-synthetic dataset where the ground truth topics are known and taken to be realistic (topics learned by LDA on Neur IPS data) and on three real world datasets: the Neur IPS full papers corpus, a very large (20 million tweets) Twitter dataset that was collected via the Twitter API and the full English Wikipedia. For the semi-synthetic dataset the topic quality was measured by comparison to the ground truth topics, while for the real datasets coherence and log-likelihhod on a hold-out set was measured. We compare FDM to a state of the art LDA collapsed Gibbs sampler (termed Sparse LDA), and to the topic learning algorithm introduced in (Arora et al., 2013) (termed EP). Additional details on these benchmarks are given in Section 2. For the semi-synthetic models, we find that while, as expected, Sparse LDA with true hyperparameters performs somewhat better, FDM performs similarly to a Sparse LDA with close but different hyperparameters. All algorithms find a reasonable approximation of the topics, althpough EP performance is notably weaker. On Neur IPS FDM performs similarly to Sparse LDA, while on Twitter and Wikipedia data FDM performance is somewhat better. Both algorithms outperform EP. To summarize, the contributions of this paper are as follows: We introduce a new approach to topic modeling, via the fitting of the empirical FDM c M to the topic FDM M(µ, α) by likelihood minimization, and prove the consistency of the associated loss. We introduce a practical optimisation procedure for this method, and experimentally show that it produces topics that are comparable or better than the state of the art approaches, while using principles that significantly differ from the existing methods. 2. Literature The subject of optimization in topic models has received significant attention in the literature. We first describe the general directions of this research. Variational methods for the LDA objective were developed in the paper that introduced the LDA model, (Blei et al., 2002). See also (Foulds et al., 2013), (Hoffman et al., 2013). The collapsed Gibbs sampler for LDA was introduced in (Griffiths & Steyvers, 2004), (Steyvers & Griffiths, 2007). Optimizations of the collapsed sampler that exploit the sparsity of the topics in a document were developed in (Yao et al., 2009), (Xiao & Stibor, 2010), (Li et al., 2014) and yielded further performance improvements. Streaming, or online methods for the LDA objective were proposed in (Newman et al., 2009), (Hoffman et al., 2010). We also note that the topic modelling problem, and in particular the p LSA model, is closely related to the Non-Negative Matrix Factorization (NMF) problem, (Huang et al., 2012). In this context, EM type algorithms for topic models were developed in the paper that introduced p LSA, (Hofmann, 1999). Streaming algorithms for NMF in general are also an active field, see for instance (Zhao & Tan, 2016), (Guan et al., 2012), which involve Euclidean costs, but could possibly be adapted to the topic modelling setting. Finally, distributed methods for LDA were introduced in (Newman et al., 2009), (Smola & Narayanamurthy, 2010), (Liu et al., 2011), (Zhai et al., 2012), (Ahmed et al., 2012). Topic reconstruction from corpus statistics such as the matrix c M were previously considered in the theoretical study of topic models. Topic reconstruction from the third moments of the data was proposed in (Anandkumar et al., 2012). While highly theoretically significant, these algorithms require construction an analysis of an N N N matrix, and thus can not be applied with dictionaries of size of several thousands tokens. An algorithm that is based on the matrix c M, as in our approach, was given in (Arora et al., 2012) and improved in (Arora et al., 2013). However, despite the fact that both (Arora et al., 2013) and our approach use c M, the underling principles behind the two algorithms are completely different. The approach of (Arora et al., 2013) is based on the notion of anchor words (see Section 4), and consists of two steps: First, the algorithm attempts to explicitly identify the anchor words from the matrix c M. Then, the topics are reconstructed using the obtained anchor words. Topic Modeling via Full Dependence Mixtures Due to the structure of the p LSA model, it can be shown that any row of the matrix c M can be approximately represented as a convex combination of the rows of c M that correspond to the anchor words. Thus, the anchor word identification in (Arora et al., 2013) is done by identifying the approximate extreme points of a set, where the set is the set of rows of c M. In contrast, our approach does not attempt to find or use anchor words explicitly and conceptually is a much simpler gradient descent algorithm. We optimize in the space of topics directly, by approximating c M by an FDM M(µ, α). It is also worth mentioning that in the consistency proof of Algorithm 2, Theorem 4.2 (although not in the algorithm itself), we use the characterization of the topics as the extreme points of a certain polytope. However, these are not the same extreme points as in (Arora et al., 2013). The extreme points we use for the proof correspond to topics, while the extreme points in (Arora et al., 2013) correspond to conditional probabilities of tokens given anchor words, and are generally very different from the topics themselves. Finally, as mentioned earlier, we use the algorithm form (Arora et al., 2013) as an additional benchmark. While the algorithm of (Arora et al., 2013) is extremely fast, and can be very precise under certain conditions, the quality of the topics found by that algorithm is significantly lower compared to the topics produced by the standard optimized LDA Gibbs sampler. A variant of the Gibbs sampler for LDA that is adapted to the computation on GPU was recently proposed in (Tristan et al., 2015). We note that similarly to the Gibbs samplers or variational methods, this approach maintains a form of a topic assignment for each document. Therefore, the number of variables that need to be stored grows with the number of documents and topics and is Ω(D T), where D is the number of documents and T number of topics. This is true despite the remarkable memory optimizations described in (Tristan et al., 2015), which address other matrices used by that algorithm. Observe that for GPUs, this problem is quite severe, since the amount of memory typically available on a GPU is much smaller than the CPU memory. This is in contrast to our approach, where the GPU memory requirement is independent of the number of documents or tokens. To put this in context, the datasets we analyze here, Neur IPS and Twitter (both with T = 500) can not be analyzed via the approach of (Tristan et al., 2015) on a high end desktop GPU (10GB memory). The MALLET code, (Mc Callum, 2002), is widely used as the standard benchmark. This code is based on a collapsed Gibbs sampler for LDA, and implements a variety of optimizations discussed earlier. In particular it exploits sparsity, based on (Yao et al., 2009), parallelization (within a single workstation) based on (Newman et al., 2009), and has an efficient and publicly available implementation. Finally, we note that, while outside of the scope of this paper, the methods presented here could easily be adapted to distributed, multi-GPU settings, using standard SGD parallelization techniques. This may be achieved either by elementary means, by increasing the batch size and processing it on multiple GPUs in parallel, or via more involved, lock-free methods, such as (Recht et al., 2011). 3. Formal Algorithm Specification Recall that |X| = N is the size of the dictionary. For a document d given as a sequence of tokens d = {x1, . . . , xld}, where ld is the total number of tokens in d, denote by cd RN the count vector of d, cd(u) = # {xi | xi = u} for u X. (5) Thus, cd is the bag of words representation of d, and cd(u) is the number of times u appears in d. With this notation, the construction of the matrix c M from the data is described in Algorithm 1. To motivate this construction, assuming the p LSA model, suppose a document d is sampled from a mixture of topics t T θ(t)µt. (6) The co-occurrence matrix of the mixture ν is (Mν)u,v = ν(u) ν(v) = X i,j T θ(i)θ(j)µi(u)µj(v). (7) Thus, (Mν)u,v is the probability of obtaining the pair (u, v) when one samples two token independently from ν. The co-occurrence matrix of the corpus is the average of the corresponding Mν over all documents d. Observe form (7) that Mν that has the form of an FDM, (2), and hence the co-occurrence matrix of the full corpus also has this form. Next, note that we do not have access to the matrices Mν themselves, only to the documents d, which are samples from ν. We therefore estimate Mν using the tokens of the document. Specifically, the matrix c Md constructed in (9) in Algorithm 1 is an unbiased estimate of Mν from the tokens in d: We have Ed c Md |θ = Mν, (8) where the expectation is over the documents sampled from ν. We provide the proof of this statement in the supplementary material. Therefore, to obtain an approximation to the cooccurrence matrix of the model, in Algorithm 1 we first compute the estimates c Md for each document, and then average over the corpus. Once the matrix c M is constructed, our goal is to reconstruct the topics µ and the corpus level coefficients α from c M. Topic Modeling via Full Dependence Mixtures Algorithm 1 Computation of c M Input: Corpus: C = {d1, ..., d D}, a corpus with D documents. 1: For every d C construct c Md such that the entry u, v is: ( ld ld 1 cd(u) ld if u = v ld ld 1 cd(u) ld cd(u) ld(ld 1) if u = v .(9) 2: Set c M = 1 d C c Md. (10) Algorithm 2 FDM Optimization Input: c M: The empirical FDM Input: B: Batch size Input: T: Number of topics 1: Initialize free variables α , µ t at random. 2: Set α = softmax(α ), µt = softmax(µ t). 3: while LB not converged do 4: Sample B pairs of tokens, Batch = {(u1, v1), ..., (u B, v B)}. Each pair is sampled from c M, (ui, vi) c M. 5: Set LB(µ, α) = PB k=1 log PT i,j αi,jµi(uk)µj(vk) 6: (µ , α ) (µ , α ) + γ µ ,α LB 7: end while As discussed in the introduction, the FDM optimization algorithm is a stochastic gradient ascent on the pairs of tokens (u, v) sampled from c M, given as Algorithm 2. Note that since the parameters α, µ in which we are interested are constrained to be probability distributions, we parametrize them with free variables α , µ t via softmax: αi,j = eα ij P i ,j T eα i j and µt(l) = eµ t(l) P l N eµ t(l ) , (11) where i, j T, t T, and l N. Thus, in Algorithm 2, α, µ are functions of α , µ and the gradient ascent is over α , µ . Note that any SGD optimizer, including adaptive rate optimizers, may be used in Step 6 of Algorithm 2. We use Adam, (Kingma & Ba, 2015), in the experiments. 4. Consistency In this section we discuss the consistency of the loss function (3) under the anchor words assumption. Specifically, we show that if the the corpus is generated by a p LSA model with a set of topics {µt}T 1 that satisfy the anchor words assumption, then among topics satisfying this assumption, the loss (3) can only be asymptotically minimized by an FDM M(µ, α) with the true topics µ. It follows that if one minimizes the loss (3) and the resulting topics satisfy the anchor words assumptions (this holds empirically, see below), then in the limit the topics must be the true topics. The anchor words assumption was introduced in (Donoho & Stodden, 2003) as a part of a set of sufficient conditions for identifiability in NMF. A set {µt}T 1 of topics is said to have an anchor words property, denoted (AW), if for every t T, there is a point ut N such that µt(ut) > 0 and µt (ut) = 0 for all t = t. The point ut is called the anchor word of the topic µt. As mentioned earlier, natural topics tends to have the anchor word property. For instance, topics found by various LDA based methods have anchor words, (Arora et al., 2013). We note that topics found by the FDM optimization algorithm, Algorithm 2, also have anchor words. We first develop a few equivalent geometric characterizations of anchor words. While some of the arguments used in the proofs are similar in spirit to those in (Donoho & Stodden, 2003),(Arora et al., 2012), the particular notions we introduce have not appeared in the literature previously, and significantly clarify the geometric nature of the anchor word assumption. We thus provide these results here for completeness. Some of the equivalences below play an important role in the proof of Theorem 4.2. Denote by N 1 the probability simplex in RN. A set of probability measures {µt}T 1 N 1 is called spanmaximal (SM) if N 1 span {µt}T 1 = conv {µt}T 1 . Here span and conv denote the span and the convex hull of the set. A set {µt}T 1 N 1 is positive (P) if the following holds: For every linear combination P t atµt N 1 then at 0 for all t T. In other words, a set is positive if only its linear combinations with non-negative coefficients belong to the simplex. Finally, a set is maximal (M) if the following holds: For any set {νt}T 1 N 1, if conv {µt}T 1 conv {νt}T 1 , then we must have µt = νπ(t) for some permutation π on {1, . . . , T}. Equivalently, a set is maximal if it can not be properly contained in a convex hull of another set of topics of size T. Proposition 4.1. For a set of linearly independent topics {µt}T 1 N 1 , the properties (SM), (P), (AW), and (M) are equivalent. The proof is given in supplementary material Section C. Note in addition that (AW) implies that {µt}T 1 N 1 are linearly independent. We now discuss the relation between between {µt}T 1 and an FDM matrix Mu,v = PT i,j=1 αijµi(u)µj(v). In particular, we describe how {µt}T 1 can be recovered from the image of M (as an operator, M : RN RN) under (AW). Note first that the image of M satisfies Im(M) span {µt}T 1 . Indeed, Topic Modeling via Full Dependence Mixtures for a fixed v, a column of M can be written as a linear combination, M ,v = P j αi,jµj(v) µi( ). Moreover, if the matrix α = αij has full rank, rank(α) = T, and if {µt}T 1 are linearly independent, then Im(M) = span {µt}T 1 . Therefore, when rank(α) = T, given M we can recover span {µt}T 1 as Im(M). Now, if {µt}T 1 satisfies (AW), then by Proposition 4.1 it also satisfies (SM). It then follows that conv {µt}T 1 can be recovered as conv {µt}T 1 = Im(M) N 1. Finally, if we know conv {µt}T 1 , we can recover {µt}T 1 , since these are simply the extreme points of that polytope, and every polytope is uniquely characterized by its extreme points. This relation between {µt}T 1 and M is at the basis of the consistency result below. We state the consistency for the probabilistic setting: We complement the p LSA model to be a full generative model by assuming that topic distribution θd is sampled independently for each document d from some probability distribution T on T 1. If T is a Dirichlet distribution, symmetric or asymmetric, this corresponds to an LDA model. Other examples include models with correlated topics, (Blei & Lafferty, 2007), or hierarchical topics, (Li & Mc Callum, 2006), among many others. The only requirement on the topic distribution is the following: Let Θi,j = Eθd(i)θd(j) be the expected topic-topic co-occurrence matrix corresponding to the sampling scheme. We require that Θ is full rank. This assumption holds in all the examples above. Theorem 4.2 (Consistency). Consider a generative p LSA model (1), over topics {µi}T 1 which satisfy (AW), and where θd are sampled independently from a fixed distribution on T 1, with topic-topic expected co-ocurrence matrix Θ. Set Mu,v := M(µ, Θ) = PT i,j=1 Θi,jµi(u)µj(v). Then with probability 1, lim D c Mu,v log Mu,v = X u,v Mu,v log Mu,v. (12) Conversely, let {µ t}T t=1 = {µt}T t=1 be a different set of topics satisfying (AW). There is a γ = γ {µi}T 1 , {µ i}T 1 , κ(Θ) > 0, such that for any FDM M over {µ i}T 1 , with probability 1, lim D c Mu,v log M u,v X u,v Mu,v log Mu,v γ. (13) This result is proved in supplementary material Section D. The probability in Theorem 4.2 is over the samples from the generative model, through the random variables c M (note that c M, defined (10), depends on D). The theorem states that the loss (3) is minimized at the true parameters {µi}T 1 and Θ. Indeed, denote L0 = P u,v Mu,v log Mu,v. Note that the cost L(µ, Θ) is a random variable, as it depends on c M. Then Eq. (12) states that L(µ, Θ) converges to L0, while Eq. (13) is equivalent to stating that for any other set of topics, {µ i}T 1 , we have L(µ , α) > L0 + γ for any α. The gap γ will depend on how well {µ i}T 1 approximates the true topics {µi}T 1 . 5. Experiments In the following sections we discuss experiments on semisynthetic data, on the Neur IPS papers corpus, and on Twitter and Wikipedia datasets; see Section 1.3 for an overview. For non synthetic data the performance is evaluated by loglikelihood on test set, and by the coherence measure of the topics. The coherence results are discussed in Section 5.5. 5.1. Synthetic Data To approximate real data in a synthetic setting, we used T = 500 topics learned by Sparse LDA optimization on the Neur IPS papers corpus (Section 5.2) as the ground truth topics. The dictionary size in this case is N = 10500 tokens. The synthetic documents were generated using the LDA model: For each document, a topic distribution θd was sampled per document from a symmetric Dirichlet with the standard concentration parameter α = 1/T, and 30 tokens were sampled from the θd mixture of the ground truth topics. The corpus size was D = 100000 documents. Note that for a dictionary of size N = 10500 and non-uniform topics, this is not a very large corpus. To reconstruct the topics, we compared three algorithms: (i) Sparse LDA a sparsity optimized parallel collapsed Gibbs sampler for LDA, implemented in the MALLET framework (see Section 2 for details), which was run with 4 parallel threads. (ii) FDM, Algorithm 2. (iii) The topic learning algorithm from (Arora et al., 2013), to which we refer as EP (Extreme Points) in what follows. Sparse LDA was run with 4 threads. All algorithms were run 5 times, until convergence, on the fixed dataset. Note that the EP algorithm does not have a random initialization, but uses a random projection as an intermediate dimensionality reduction step. Therefore different runs might be affected by restarts (although very mildly, in practice). We used M = 1000 as the dimension of the random projection a value that was specified in (Arora et al., 2013) as the practical value for the dictionary sizes of the order we use here. Hardware specifications are given in the supplementary material. All the algorithms were run with the true number of topics T as a parameter. The Sparse LDA algorithm was run in two modes: With the true hyperparameters, α = 1/T, corresponding to the true α of the corpus, and with topic sparsity parameter β = 1/N, a standard setting which was also used to learn the ground Topic Modeling via Full Dependence Mixtures 8.0 7.5 7.0 6.5 6.0 5.5 5.0 4.5 4.0 loglikelihood 1.0 FDM mallet EP (a) Distribution of log-likelihoods on test set. Larger (closer to zero) values are better. FDM (blue), Sparse LDA (orange), EP (green). 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0.0 4.0 FDM-Mallet FDM-Empirical Mallet-Empirical (b) Distribution of distances in an optimal Matching. FDM to Sparse LDA (blue), FDM to empirical (orange), Sparse LDA to empirical(green). Figure 2. Neur IPS Papers Experiment 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0 140 FDM mallet_true_alpha EP mallet_modified_alpha Figure 3. Distribution of the distances to ground truths, in an optimal matching, for a typical instance of each method. truth topics. To model a situation where the true hyperparameters are unknown, we also evaluated Sparse LDA with a modified hyperparameter α = 10/T, and same β. Note that this is a relatively mild change of the hyperparameter. The quality of the learned topics was measured by calculating the optimal matching ℓ1 1 distances to the ground truth topics. That is, given the topics returned by the model, νt, 1 t T and the ground truth topics µt, we compute µt ντ(t) 1 , (14) where τ is the matching a permutation of the set {1, . . . , T}. The optimal matching τ was computed using the Hungarian algorithm. The results are given in Table 1, which shows for each algorithm the average error and the standard deviation over the different runs. To put the numbers in perspective, the typical ℓ1 distance between two ground truth topics is around 1.75. Thus all algorithms learned at least some approximation of the ground truth. By visual inspection, a topic at distance 0.6 from a given ground truth topic tends to capture the 1|µ ν|1 = P u X |µ(u) ν(u)| mass at the correct tokens, but the amount of mass deviates somewhat from the correct one. We observe that Sparse LDA with the ground truth α attained best performance. This is not surprising, since the algorithm is based on the generative model that is the true generative model of the corpus, and was provided with the true hyperparameters. Both of these constitute a considerable prior knowledge. The performance of the EP algorithm was relatively low, which is likely due to the fact that the corpus size was not sufficiently large for that algorithm, with this set of topics. Finally, FDM and Sparse LDA with the modified hyperparameter attained similar performance. Interestingly, Sparse LDA and FDM tend to err slightly differently. In Figure 3 for a set of topics found by each algorithm we show the histogram of the quantities µt ντ(t) 1, i.e. the distribution of the distances within the matching. Sparse LDA tends to completely miss about 50 to 80 out of 500 topics, while FDM is slightly less precise on the topics that it does approximate well. This figure is remarkably consistent across the different runs of the algorithms. Table 1. Synthetic Corpus Matching Distances Model Average ℓ1 FDM 0.66 0.005 Sparse LDA, α = α 0.41 0.0004 EP 1.05 0.005 Sparse LDA, α = 10α 0.66 0.01 5.2. Neur IPS Dataset The Neur IPS dataset, (Neur IPSPapers Corpus, 2016), consists of all Neur IPS papers from 1987 to 2016. Each document was taken to be a single paragraph. Stop words, numbers and tokens appearing less than 50 times were removed. All tokens were stemmed and documents with less than 20 tokens are removed. The preprocessed dataset con- Topic Modeling via Full Dependence Mixtures 7 6 5 4 3 loglikelihood FDM mallet EP (a) Distribution of log-likelihoods on test set. Larger (closer to zero) values are better. FDM (blue), Sparse LDA (orange), EP (green). 0.25 0.50 0.75 1.00 1.25 1.50 1.75 FDM-Mallet FDM-Empirical Mallet-Empirical (b) Distribution of distances in an optimal Matching. FDM to Sparse LDA (blue), FDM to empirical (orange), Sparse LDA to empirical(green). Figure 4. Twitter Experiment tained roughly D = 251000 documents over the dictionary of around N = 10500 unique tokens. 20% of the documents were taken at random as a hold-out (test) set. The log-likelihood of the documents in the hold-out set was used as the performance measure. The computation of the (log) likelihood on the hold out set given topics is standard, with full details provided in supplementary material Section E. The following models were trained: FDM, Sparse LDA (α = 1/T, β = 1/N) and EP ((Arora et al., 2013) ) with T = 500 topics. All models were run until convergence. The mean hold-out log-likelihoods for each method are shown in Table 2, and a histogram of the distribution of the holdout log-likelihoods for a single run of each algorithm is shown in Figure 2a. We observe that the performance of Sparse LDA and FDM are practically identical, and both perform better than EP. To obtain some insight into the relation between the models, Figure 2b shows the histogram of the optimal matching distances between the topics learned by a fixed run of Sprase LDA and a fixed run of FDM (blue). For scale, the distances of Sparse LDA and FDM to a fixed topic, the empirical distribution of the corpus, are also shown. It appears that both models find somewhat similar topics. Table 2. Test set average Log-likelihoods Model Neur IPS Twitter Wiki FDM 5.62 5.26 6.28 Sparse LDA 5.57 5.70 6.68( 6.45) EP 6.31 8.5 6.84 5.3. Twitter Data We first describe the collection and processing of the Twitter corpus. The tweets were collected via the Tweeter API, (Twitter API). The data contains about 16M (million) (after pre-processing, see below) publicly available tweets posted by 60K users during roughly the period of 1/2014 to 1/2017. The tweets were preprocessed to remove numbers, none ASCII characters, mentions (Twitter usernames preceded by an @ character) and URLs. All tokens were stemmed, and stop words and tokens with less than 3 characters were removed. The most common 200 tokens and rare tokens (less than 1000 appearances in the corpus) were removed. Following this, tweets shorter than 4 tokens were also removed. This resulted in a corpus of about D = 16M tweets over a dictionary of slightly more than N = 15000 unique tokens. Each tweet was considered as a separate document, and typical tweets have 4 to 8 tokens. The experiment setting was similar to the Neur IPS papers corpus. 20% of the documents were taken as a hold-out set, and the holdout loglikelihood of Sparse LDA, FDM and EP topics was evaluted. All algorithms were run to convergence, for about 12 hours for each run. The resulting hold-out log-likelihoods are given in Table 2, and the distribution of the log-likelihoods in shown in Figure 4a. We observe that in this case the performance of FDM is better than that of Sparse LDA, while EP does not produce a good approximation of the dataset. 5.4. Wikipedia We use the full English Wikipedia corpus, as archived on 04/2020. In Wikipedia the articles are naturally divided into sections, and each section was taken as a single document. 20% of articles (that is, all sections from these articles) were taken as test set. Similarly to the Twitter data, the text was stemmed and stop words were removed. Then the dictionary was restricted to N = 50000 most common tokens. After this preprocessing the total number of tokens in the train set was 1.3 billion, across approximately 11 million, D = 11 106, documents. In all models we have used T = 1000 topics, and similarly to the other datasets the LDA models were trained with the standard hyperparameters α = 1/T, β = 1/N. Topic Modeling via Full Dependence Mixtures 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 Train Time (hours) Holdout log likelihood Wikipedia Performance FDM Mallet EP Figure 5. Wikipedia Test Likelihoods as Function of Train Time The test set log-likelihoods are given in Table 2, and shown as a function of train time in Figure 5. While absolute training times depend strongly on hardware, in this case both FDM and Mallet were run on the same system, using CPUs and GPU of the same generation. See Section F for full hardware details. As shown in Figure 5, EP provides non-trivial results relatively quickly 2. However, these results are significantly weaker than what FDM and Mallet provide. FDM converges to its best log-likelihood, -6.28, after about 10 hours, while MALLET, after 24 hours achieves -6.68, and -6.45 after 80 hours, which still does not match the FDM performance. 5.5. Coherence In addition to the holdout log-likelihood, another topic quality measure that is often used in the literature is topic coherence, (Mimno et al., 2011). The coherence of a topic µ with respect to the corpus is defined as follows: Let C(v) denote the number of documents containing at least one instance of the token v, and let C(v, w) denote the number of documents containing at least one instance of both v and w. For a topic µ, denote by u1, . . . , u M the M tokens with the highest weights in the topic, ordered by weight. Thus for instance u1 is the token with the highest weight µ(u1). The (normalized) coherence of µ is then given by Coh(µ; M) = 1 M 2 l=1 log C(um, ul) + 1 Coherence values are normally negative, and typically values closer to 0 are considered to be better. All coherences below are computed at the standard value of M = 10 top words. For a set of topics produced by any given model, we compute the average coherence of the topics. The results for all models and datasets we consider are given in Table 3. In particular FDM and Sparse LDA achieve similar coherence 2Note that EP is not an iterative method. Its results can not be further improved by using more time. 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0 Wikipedia Coherence Distribution FDM Mallet EP Mean Empirical Figure 6. Wikipedia Distribution of Topic Coherence score, with Sparse LDA s slightly higher. When discussing coherence, it is crucial to note that coherence typically decreases with training. Moreover, for a given corpus, define the empirical mean topic µ as µ(v) = C(v)/ P u C(u). That is, this is the topic in which each token has the same frequency as in the whole corpus. While this is in many senses a trivial topic, it usually has high coherence compared to other, more specific and useful topics. The coherence of this topic for each corpus is shown in the last row of Table 3. Since most initialization algorithms initialize topics such that they are close to µ, this explains the decrease of coherence during training. Due to this reason, while higher values of coherence might be desirable, such values are only meaningful when log-likelihood values are also high. The distribution of coherences for the Wikipedia topics for FDM, Mallet and EP are shown in Figure 6. Table 3. Average Topic Coherence (at 10 topwords) Model Neur IPS Twitter Wiki FDM 2.62 4.51 2.15 Sparse LDA 2.44 4.36 2.07 EP 2.19 7.35 2.72 Mean Empirical 1.34 4.02 0.98 6. Conclusions In this paper a introduced a new topic modeling approach, FDM topic modeling, which is based on matching, via KL divergence, of the token co-occurrence distribution induced by the topics to the co-occurrence distribution of the corpus. We have shown the asymptotic consistency of the approach under the anchor words assumption and presented an efficient stochastic optimization procedure for this problem. This algorithm enables the approach to leverage GPU computation and efficient SGD optimizers. Our empirical evaluation shows that FDM produces topics of good quality and improves over the performance of Sparse LDA. Topic Modeling via Full Dependence Mixtures Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., and Smola, A. J. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pp. 123 132. ACM, 2012. Anandkumar, A., Foster, D. P., Hsu, D. J., Kakade, S. M., and kai Liu, Y. A spectral algorithm for latent dirichlet allocation. In NIPS. 2012. Arora, S., Ge, R., and Moitra, A. Learning topic models going beyond svd. FOCS 12, 2012. Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M. A practical algorithm for topic modeling with provable guarantees. In ICML, 2013. Bhatia, R. Matrix Analysis. Graduate Texts in Mathematics. Springer New York, 1997. Blei, D. M. and Lafferty, J. D. A correlated topic model of science. The Annals of Applied Statistics, 2007. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of Machine Learning Research, 3, 2002. Cover, T. M. and Thomas, J. A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006. Donoho, D. and Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS 03, 2003. Foulds, J., Boyles, L., Du Bois, C., Smyth, P., and Welling, M. Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In KDD, 2013. Griffiths, T. L. and Steyvers, M. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101, 2004. Guan, N., Tao, D., Luo, Z., and Yuan, B. Online nonnegative matrix factorization with robust stochastic approximation. IEEE Transactions on Neural Networks and Learning Systems, 2012. Hoffman, M., Bach, F. R., and Blei, D. M. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems 23. 2010. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. J. Mach. Learn. Res., 14 (1), 2013. Hofmann, T. Probabilistic latent semantic indexing. SIGIR 99, 1999. Huang, Z., Zhou, A., and Zhang, G. Non-negative matrix factorization: A short survey on methods and applications. In Computational Intelligence and Intelligent Systems. Springer Berlin Heidelberg, 2012. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Li, A. Q., Ahmed, A., Ravi, S., and Smola, A. J. Reducing the sampling complexity of topic models. In KDD, 2014. Li, W. and Mc Callum, A. Pachinko allocation: Dagstructured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning, ICML 06. Association for Computing Machinery, 2006. Liu, Z., Zhang, Y., Chang, E. Y., and Sun, M. Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol., 2011. Mc Callum, A. K. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., and Mc Callum, A. Optimizing semantic coherence in topic models. In EMNLP, 2011. Neur IPSPapers Corpus. Neuripspaperscorpus. https:// www.kaggle.com/benhamner/nips-papers, 2016. Accessed: 1/8/2019. Newman, D., Asuncion, A., Smyth, P., and Welling, M. Distributed algorithms for topic models. J. Mach. Learn. Res., 2009. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825 2830, 2011. Recht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24. 2011. Smola, A. and Narayanamurthy, S. An architecture for parallel topic models. Proc. VLDB Endow., 2010. Steyvers, M. and Griffiths, T. Probabilistic Topic Models. 2007. Topic Modeling via Full Dependence Mixtures Tristan, J.-B., Tassarotti, J., and Steele, G. Efficient training of lda on a gpu by mean-for-mode estimation. In ICML, 2015. Twitter API. Twitter api. https://developer. twitter.com/en/docs/tweets/timelines/ api-reference/get-statuses-user_ timeline.html. Accessed: 1/8/2019. Xiao, H. and Stibor, T. Efficient collapsed gibbs sampling for latent dirichlet allocation. In ACML, 2010. Yao, L., Mimno, D., and Mc Callum, A. Efficient methods for topic model inference on streaming document collections. In KDD, 2009. Zhai, K., Boyd-Graber, J., Asadi, N., and Khouja, J. Mr. lda: a flexible large scale topic modeling package using variational inference in mapreduce. In WWW, 2012. Zhao, R. and Tan, V. Y. F. Online nonnegative matrix factorization with outliers. In ICASSP, 2016.