# neural_topic_model_via_optimal_transport__d554e584.pdf Published as a conference paper at ICLR 2021 NEURAL TOPIC MODEL VIA OPTIMAL TRANSPORT He Zhao, Dinh Phung, Viet Huynh, Trung Le, Wray Buntine Department of Data Science and Artificial Intelligence, Faculty of Information Technology Monash University, Australia {ethan.zhao,dinh.phung,viet.huynh,trunglm,wray.buntine}@monash.edu Recently, Neural Topic Models (NTMs) inspired by variational autoencoders have obtained increasingly research interest due to their promising results on text analysis. However, it is usually hard for existing NTMs to achieve good document representation and coherent/diverse topics at the same time. Moreover, they often degrade their performance severely on short documents. The requirement of reparameterisation could also comprise their training quality and model flexibility. To address these shortcomings, we present a new neural topic model via the theory of optimal transport (OT). Specifically, we propose to learn the topic distribution of a document by directly minimising its OT distance to the document s word distributions. Importantly, the cost matrix of the OT distance models the weights between topics and words, which is constructed by the distances between topics and words in an embedding space. Our proposed model can be trained efficiently with a differentiable loss. Extensive experiments show that our framework significantly outperforms the state-of-the-art NTMs on discovering more coherent and diverse topics and deriving better document representations for both regular and short texts. 1 INTRODUCTION As an unsupervised approach, topic modelling has enjoyed great success in automatic text analysis. In general, a topic model aims to discover a set of latent topics from a collection of documents, each of which describes an interpretable semantic concept. Topic models like Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and its hierarchical/Bayesian extensions, e.g., in Blei et al. (2010); Paisley et al. (2015); Gan et al. (2015); Zhou et al. (2016) have achieved impressive performance for document analysis. Recently, the developments of Variational Auto Encoders (VAEs) and Autoencoding Variational Inference (AVI) (Kingma & Welling, 2013; Rezende et al., 2014) have facilitated the proposal of Neural Topic Models (NTMs) such as in Miao et al. (2016); Srivastava & Sutton (2017); Krishnan et al. (2018); Burkhardt & Kramer (2019). Inspired by VAE, many NTMs use an encoder that takes the Bag-of-Words (Bo W) representation of a document as input and approximates the posterior distribution of the latent topics. The posterior samples are further input into a decoder to reconstruct the Bo W representation. Compared with conventional topic models, NTMs usually enjoy better flexibility and scalability, which are important for the applications on large-scale data. Despite the promising performance and recent popularity, there are several shortcomings for existing NTMs, which could hinder their usefulness and further extensions. i) The training and inference processes of NTMs are typically complex due to the prior and posterior constructions of latent topics. To encourage topic sparsity and smoothness, Dirichlet (Burkhardt & Kramer, 2019) or gamma (Zhang et al., 2018) distributions are usually used as the prior and posterior of topics, but reparameterisation is inapplicable to them, thus, complex sampling schemes or approximations have to be used, which could limit the model flexibility. ii) A desideratum of a topic model is to generate better topical representations of documents with more coherent and diverse topics; but for many existing NTMs, it is hard to achieve good document representation and coherent/diverse topics at the same time. This is because the objective of NTMs is to achieve lower reconstruction error, which usually means topics are less coherent and diverse, as observed and analysed in Srivastava & Sutton (2017); Burkhardt & Kramer (2019). iii) It is well-known that topic models degrade their performance severely on short documents such as tweets, news headlines and product reviews, as each individual document contains insufficient word co-occurrence information. This issue can be exacerbated for NTMs because of the use of the encoder and decoder networks, which are usually more vulnerable to data sparsity. Published as a conference paper at ICLR 2021 To address the above shortcomings for NTMs, we in this paper propose a neural topic model, which is built upon a novel Optimal Transport (OT) framework derived from a new view of topic modelling. For a document, we consider its content to be encoded by two representations: the observed representation, x, a distribution over all the words in the vocabulary and the latent representation, z, a distribution over all the topics. x can be obtained by normalising a document s word count vector while z needs to be learned by a model. For a document collection, the vocabulary size (i.e., the number of unique words) can be very large but one individual document usually consists of a tiny subset of the words. Therefore, x is a sparse and low-level representation of the semantic information of a document. As the number of topics is much smaller than the vocabulary size, z is the relatively dense and high-level representation of the same content. Therefore, the learning of a topic model can be viewed as the process of learning the distribution z to be as close to the distribution x as possible. Accordingly, it is crucial to investigate how to measure the distance between two distributions with different supports (i.e., words to x and topics to z). As optimal transport is a powerful tool for measuring the distance travelled in transporting the mass in one distribution to match another given a specific cost function, and recent development on computational OT (e.g., in Cuturi (2013); Frogner et al. (2015); Seguy et al. (2018); Peyré et al. (2019)) has shown the promising feasibility to efficiently compute OT for large-scale problems, it is natural for us to develop a new NTM based on the minimisation of OT. Specifically, our model leverages an encoder that outputs topic distribution z of a document by taking its word count vector as input like standard NTMs, but we minimise the OT distance between x and z, which are two discrete distributions on the support of words and topics, respectively. Notably, the cost function of the OT distance specifies the weights between topics and words, which we define as the distance in an embedding space. To represent their semantics, all the topics and words are embedded in this space. By leveraging the pretrained word embeddings, the cost function is then a function of topic embeddings, which will be learned jointly with the encoder. With the advanced properties of OT on modelling geometric structures on spaces of probability distributions, our model is able to achieve a better balance between obtaining good document representation and generating coherent/diverse topics. In addition, our model eases the burden of designing complex sampling schemes for the posterior of NTMs. More interestingly, our model is a natural way of incorporating pretrained word embeddings, which have been demonstrated to alleviate the issue of insufficient word co-occurrence information in short texts (Zhao et al., 2017; Dieng et al., 2020). With extensive experiments, our model can be shown to enjoy the state-of-the-art performance in terms of both topic quality and document representations for both regular and short texts. 2 BACKGROUND In this section, we recap the essential background of neural topic models and optimal transport. 2.1 NEURAL TOPIC MODELS Most of existing NTMs can be viewed as the extensions of the framework of VAEs where the latent variables can be interpreted as topics. Suppose the document collection to be analysed has V unique words (i.e., vocabulary size). Each document consists of a word count vector denoted as x NV and a latent distribution over K topics: z RK. An NTM assumes that z for a document is generated from a prior distribution p(z) and x is generated by the conditional distribution pφ(x|z) that is modelled by a decoder φ. The model s goal is to infer the topic distribution given the word counts, i.e., to calculate the posterior p(z|x), which is approximated by the variational distribution qθ(z|x) modelled by an encoder θ. Similar to VAEs, the training objective of NTMs is the maximisation of the Evidence Lower BOund (ELBO): max θ,φ Eqθ(z|x) [log pφ(x|z)] KL [qθ(z|x) p(z)] . (1) The first term above is the expected log-likelihood or reconstruction error. As x is a count-valued vector, it is usually assumed to be generated from the multinomial distribution: pφ(x|z) := Multi(φ(z)), where φ(z) is a probability vector output from the decoder. Therefore, the expected log-likelihood is proportional to x T log φ(z). The second term is the Kullback Leibler (KL) divergence that regularises qθ(z|x) to be close to its prior p(z). To interpret topics with words, φ(z) is usually constructed by a single-layer network (Srivastava & Sutton, 2017): φ(z) := softmax(Wz), where W RV K Published as a conference paper at ICLR 2021 indicates the weights between topics and words. Different NTMs may vary in the prior and the posterior of z, for example, the model in Miao et al. (2017) applies Gaussian distributions for them and Srivastava & Sutton (2017); Burkhardt & Kramer (2019) show that Dirichlet is a better choice. However, reparameterisation cannot be directly applied to a Dirichlet, so various approximations and sampling schemes have been proposed. 2.2 OPTIMAL TRANSPORT OT distances have been widely used for the comparison of probabilities. Here we limit our discussion to OT for discrete distributions, although it applies for continuous distributions as well. Specifically, let us consider two probability vectors r Dr and c Dc, where D denotes a D 1 simplex. The OT distance1 between the two probability vectors can be defined as: d M(r, c) := min P U(r,c) P, M , (2) where , denotes the Frobenius dot-product; M RDr Dc 0 is the cost matrix/function of the transport; P RDr Dc >0 is the transport matrix/plan; U(r, c) denotes the transport polytope of r and c, which is the polyhedral set of Dr Dc matrices: U(r, c) := {P RDr Dc >0 |P1Dc = r, P T 1Dr = c}; and 1D is the D dimensional vector of ones. Intuitively, if we consider two discrete random variables X Categorical(r) and Y Categorical(c), the transport matrix P is a joint probability of (X, Y ), i.e., p(X = i, Y = j) = pij and U(r, c) is the set of all the joint probabilities. The above optimal transport distance can be computed by finding the optimal transport matrix P . It is also noteworthy that the Wasserstein distance can be viewed as a specific case of the OT distances. As directly optimising Eq. (2) can be time-consuming for large-scale problems, a regularised optimal transport distance with an entropic constraint is introduced in Cuturi (2013), named the Sinkhorn distance: d M,α(r, c) := min P Uα(r,c) P, M , (3) where Uα(r, c) := {P U(r, c)|h(P) h(r) + h(c) α}, h( ) is the entropy function, and α [0, ). To compute the Sinkhorn distance, a Lagrange multiplier is introduced for the entropy constraint to minimise Eq. (3), resulting in the Sinkhorn algorithm, widely-used for discrete OT problems. 3 PROPOSED MODEL Now we introduce the details of our proposed model. Specifically, we present each document as a distribution over V words, x V obtained by normalising x: x := x/S where S := PV v=1 x is the length of a document. Also, each document is associated with a distribution over K topics: z K, each entry of which indicates the proportion of one topic in this document. Like other NTMs, we leverage an encoder to generate z from x: z = softmax(θ( x)). Notably, θ is implemented with a neural network with dropout layers for adding randomness. As x and z are two distributions with different supports for the same document, to learn the encoder, we propose to minimise the following OT distance to push z towards x: min θ d M( x, z) . (4) Here M RV K >0 is the cost matrix, where mvk indicates the semantic distance between topic k and word v. Therefore, each column of M captures the importance of the words in the corresponding topic. In addition to the encoder, M is a variable that needs to be learned in our model. However, learning the cost function is reported to be a non-trivial task (Cuturi & Avis, 2014; Sun et al., 2020). To address this problem, we specify the following construction of M: mvk = 1 cos(ev, gk) , (5) where cos( , ) [ 1, 1] is the cosine similarity; gk RL and ev RL are the embeddings of topic k and word v, respectively. 1To be precise, an OT distance becomes a distance metric in mathematics only if the cost function M is induced from a distance metric. We call it OT distance to assist the readability of our paper. Published as a conference paper at ICLR 2021 The embeddings are expected to capture the semantic information of the topics and words. Instead of learning the word embeddings, we propose to feed them with pretrained word embeddings such as word2vec (Mikolov et al., 2013) and Glo Ve (Pennington et al., 2014). This not only reduces the parameter space to make the learning of M more stable but also enables us to leverage the rich semantic information in pretrained word embeddings, which is beneficial for short documents. Here the cosine distance instead of others is used for two reasons: it is the most commonly-used distance metric for word embeddings and the cost matrix M is positive thus the similarity metric requires to be upper-bounded. As cosine similarity falls in the range of [ 1, 1], we have M [0, 2]V K. For easy presentation, we denote G RL K and E RL V as the collection of the embeddings of all topics and words, respectively. Now we can rewrite Eq. (4) as: min θ,G d M( x, z) . (6) Although the mechanisms are totally different, both M of our model and W in NTMs (See Section 2.1) capture the relations between topics and words (M is distance while W is similarity). Here M is the cost function of our OT loss while W is the weights in the decoder of NTMs. Different from other NTMs based on VAEs, our model does not explicitly has a decoder to project z back to the word space to reconstruct x, as the OT distance facilitates us to compute the distance between z and x directly. To further understand our model, we can actually project z to the space of x by virtually defining a decoder: φ(z) := softmax((2 M)z). With the notation of φ(z), we show the following theorem to reveal the relationships between other NTMs and ours, whose proof is shown in Section A of the appendix. Theorem 1. When V 8 and M [0, 2]V K, we have: d M( x, z) x T log φ(z). (7) With Theorem 1, we have: Lemma 1. Maximising the expected multinomial log-likelihood of NTMs is equivalent to minimising the upper bound of the OT distance in our model. Frogner et al. (2015) propose to minimise the OT distance between the predicted and true label distributions for classification tasks. It is reported in the paper that combining the OT loss with the conventional cross-entropy loss gives better performance on using either of them. As the expected multinomial log-likelihood is easier to learn and can be helpful to guide the optimisation of the OT distance, empirically inspired by Frogner et al. (2015) and theoretically motivated by Theorem 1, we propose the following joint loss for our model that combines the OT distance with the expected log-likelihood: max θ,G x T log φ(z) d M( x, z) . (8) If we compare the above loss with the ELBO of Eq. (1), it can be observed that similar to the KL divergence of NTMs, our OT distance can be viewed as a regularisation term to the expected loglikelihood ( x T log φ(z) := 1 S x T log φ(z)). Compared with other NTMs, our model eases the burden of developing the prior/posterior distributions and the associated sampling schemes. Moreover, with OT s ability to better modelling geometric structures, our model is able to achieve better performance in terms of both document representation and topic quality. In addition, the cost function of the OT distance provides a natural way of incorporating pretrained word embeddings, which boosts our model s performance on short documents. Finally, we replace the OT distance with the Sinkhorn distance (Cuturi, 2013), which leads to the final loss function: max θ,G ϵ x T log φ(z) d M,α( x, z) . (9) where z = softmax(θ( x)); M is parameterised by G; φ(z) := softmax((2 M)z); x and x are the word count vector and its normalisation, respectively; ϵ is the hyperparameter that controls the weight of the expected likelihood; α is the hyperparameter for the Sinkhorn distance. To compute the Sinkhorn distance, we leverage the Sinkhorn algorithm (Cuturi, 2013). Accordingly, we name our model Neural Sinkhorn Topic Model (NSTM), whose training algorithm is shown in Published as a conference paper at ICLR 2021 input :Input documents, Pretrained word embeddings E, Topic number K, ϵ, α output :θ, G Randomly initialise θ and G; while Not converged do Sample a batch of B input documents X; Column-wisely normalise X to get X Compute M with G and E by Eq. (5); Compute Z = softmax(θ( X)); Compute the first term of Eq. (9); # Sinkhorn iterations # Ψ1 = ones(K, B)/K, Ψ2 = ones(V, B)/V ; H = e M/α; while Ψ1 changes or any other relevant stopping criterion do Ψ2 = X 1/(HΨ1); Ψ1 = Z 1/(HT Ψ2); end Compute the second term of Eq. (9): d M,α = sum(Ψ2 T (H M)Ψ1); Compute the gradients of Eq. (9) in terms of θ, G; Update θ, G with the gradients; end Algorithm 1: Training algorithm for NSTM. X NV B and Z RK B >0 consists of the word count vectors and topic distributions for all the documents, respectively; is the element-wise multiplication. Algorithm 1. It is noteworthy that the Sinkhorn iterations can be implemented with the tensors of Tensor Flow/Py Torch (Patrini et al., 2020). Therefore, the loss of Eq. (9) is differentiable in terms of θ and G, which can be optimised jointly in one training iteration. After training the model, we can infer z by conducting a forward-pass of the encoder θ with the input x. In practice, x can be normalised by other methods e.g., softmax or one can use TF-IDF as the input data of the encoder. 4 RELATED WORKS We first consider NTMs (e.g. in Miao et al. (2016); Srivastava & Sutton (2017); Krishnan et al. (2018); Card et al. (2018); Burkhardt & Kramer (2019); Dieng et al. (2020) reviewed in Section 2.1 as the closest line of related works to ours. For a detailed survey of NTMs, we refer to Zhao et al. (2021). Connections and comparisons between our model and NTMs have been discussed in Section 3. In addition, word embeddings have been recently widely-used as complementary metadata for topic models, especially for modelling short texts. For Bayesian probabilistic topic models, word embeddings are usually incorporated into the generative process of word counts, such as in Petterson et al. (2010); Nguyen et al. (2015); Li et al. (2016); Zhao et al. (2017). Due to the flexibility of NTMs, word embeddings can be incorporated as part of the encoder input, such as in Card et al. (2018) or they can be used in the generative process of words such as in Dieng et al. (2020). Our novelty with NSTM is that word embeddings are naturally incorporated in the cost function of the OT distance. To our knowledge, the works that connect topic modelling with OT are still very limited. In Yurochkin et al. (2019) authors proposed to compare two documents similarity with the OT distance between their topic distributions extracted from a pretrained LDA, but the aim is not to learn a topic model. Another recent work related to ours is Wasserstein LDA (WLDA) (Nan et al., 2019), which adapts the framework of Wasserstein Auto Encoders (WAEs) (Tolstikhin et al., 2018). The key difference from ours is that WLDA minimises the Wasserstein distance between the fake data generated with topics and real data, which can be viewed as an OT variant to VAE-NTMs. However, our NSTM directly minimises the OT distance between z and x, where there are no explicit generative processes from topics to data. Other two related works are Distilled Wasserstein Learning (DWL) (Xu et al., 2018) and Optimal Transport LDA (OTLDA) (Huynh et al., 2020), which adapt the idea of Wasserstein barycentres and Wasserstein Dictionary Learning (Rolet et al., 2016; Schmitz et al., 2018). There are fundamental differences of ours from DWL and OTLDA in terms of the relations between Published as a conference paper at ICLR 2021 Table 1: Statistics of the datasets Number of docs Vocabulary size (V) Total number of words Number of labels 20NG 18,846 22,636 2,037,671 20 WS 12,337 10,052 192,483 8 TMN 32,597 13,368 592,973 7 Reuters 11,367 8,817 836,397 N/A RCV2 804,414 7,282 60,209,009 N/A documents, topics, and words. Specifically, in DWL and OTLDA, documents and topics locate in one space of words (i.e., both are distributions over words) and x can be approximated with the weighted Wasserstein barycentres of all the topic-word distributions, where the weights can be interpreted as the topic proportions of the document, i.e., z. However, in NSTM, a document locates in both the topic space and the word space and topics and words are embedded in the embedding space. These differences lead to different views of topic modelling and different frameworks as well. Moreover, DWL mainly focuses on learning word embeddings and representations for International Classification of Diseases (ICD) codes, while NSTM aims to be a general method of topic modelling. Finally, DWL and OTLDA are not neural network models while ours is. 5 EXPERIMENTS We conduct extensive experiments on several benchmark text datasets to evaluate the performance of NSTM against the state-of-the-art neural topic models. 5.1 EXPERIMENTAL SETTINGS Datasets: Our experiments are conducted on five widely-used benchmark text datasets, varying in different sizes, including 20 News Groups (20NG)2, Web Snippets (WS) (Phan et al., 2008), Tag My News (TMN) (Vitale et al., 2012)3, Reuters extracted from the Reuters-21578 dataset4, Reuters Corpus Volume 2 (RCV2) (Lewis et al., 2004)5. The statistics of the datasets in the experiments are shown in Table 1. In particular, WS and TMN are short documents; 20NG, WS, and TMN are associated with document labels6. Evaluation metrics: We report Topic Coherence (TC) and Topic Diversity (TD) as performance metrics for topic quality. TC measures the semantic coherence in the most significant words (top words) of a topic, given a reference corpus. We apply the widely-used Normalized Pointwise Mutual Information (NPMI) (Aletras & Stevenson, 2013; Lau et al., 2014) computed over the top 10 words of each topic, by the Palmetto package (Röder et al., 2015)7. As not all the discovered topics are interpretable (Yang et al., 2015; Zhao et al., 2018), to comprehensively evaluate the topic quality, we choose the topics with the highest NPMI and report the average score over those selected topics. We vary the proportion of the selected topics from 10% to 100%, where 10% indicates the top 10% topics with the highest NPMI are selected and 100% means all the topics are used. TD, as its name implies, measures how diverse the discovered topics are. We define topic diversity to be the percentage of unique words in the top 25 words (Dieng et al., 2020) of the selected topics, similar in TC. TD close to 0 indicates redundant topics; TD close to 1 indicates more varied topics. As doc-topic distributions can be viewed as unsupervised document representations, to evaluate the quality of such representations, we perform document clustering tasks and report the purity and Normalized Mutual Information (NMI) (Manning et al., 2008) on 20NG, WS, and TMN, where the document labels are considered. With the default training/testing splits of the datasets, we train a model on the training documents and infer the topic distributions z on the testing documents. Given z, we 2http://qwone.com/~jason/20Newsgroups/ 3http://acube.di.unipi.it/tmn-dataset/ 4https://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html 5https://trec.nist.gov/data/reuters/reuters.html 6We do not consider the labels of Reuters and RCV2 as there are multiple labels for one document. 7http://palmetto.aksw.org Published as a conference paper at ICLR 2021 20% 40% 60% 80% 100% -0.15 NSTM Prod LDA DVAE ETM WLDA LDA 20% 40% 60% 80% 100% -0.1 20% 40% 60% 80% 100% -0.2 (d) Reuters 20% 40% 60% 80% 100% -0.15 20% 40% 60% 80% 100% -0.3 NSTM Prod LDA DVAE ETM WLDA 20% 40% 60% 80% 100% 0.2 20% 40% 60% 80% 100% 0 20% 40% 60% 80% 100% 0 20% 40% 60% 80% 100% 0 20% 40% 60% 80% 100% 0.3 Figure 1: The first row shows the TC scores for all the datasets and the second row shows the corresponding TD scores. In each subfigure, the horizontal axis indicates the proportion of selected topics according to their NPMIs. 20 40 60 80 100 0 NSTM Prod LDA DVAE ETM WLDA LDA 20 40 60 80 100 0.2 20 40 60 80 100 0.2 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 Figure 2: The first row shows the km-Purity scores and the second row shows the corresponding km-NMI scores. In each subfigure, the horizontal axis indicates the number of KMeans clusters. (a) Over batches 0 2000 4000 6000 8000 10000 46 0 2000 4000 6000 8000 10000 2150 0 2000 4000 6000 8000 10000 500 1000 Prod LDA (b) Over seconds 0 100 200 300 400 46 0 100 200 300 400 0 100 200 300 400 500 1000 Prod LDA Figure 3: Training loss. adopt two strategies to perform the document clustering task: i) Following Nguyen et al. (2015), we use the most significant topic of a testing document as its clustering assignment to compute purity and NMI (denoted by top-Purity and top-NMI); ii) We apply the KMeans algorithm on z (over all the topics) of the testing documents and report the purity and NMI of the KMeans clusters (denoted by km-Purity and km-NMI). For the first strategy, the number of clusters equals to the number of topics while for the second one, we vary the number of clusters of KMeans in the range of {20, 40, 60, 80, 100}. Note that our goal is not to achieve the state-of-the-art document clustering results but compare document representations of topic models. For all the metrics, higher values indicate better performance. Baseline methods and their settings: We compare with the state-of-the-art NTMs, including: LDA with Products of Experts (Prod LDA) (Srivastava & Sutton, 2017), which replaces the mixture model in LDA with a product of experts and uses the AVI for training; Dirichlet VAE (DVAE) (Burkhardt & Kramer, 2019), which is a neural topic model imposing the Dirichlet prior/posterior on z. We use the variant of DVAE with rejection sampling VI, which is reported to perform the best; Embedding Topic Model (ETM) (Dieng et al., 2020), which is a topic model that incorporates word embeddings and is learned by AVI; Wasserstein LDA (WLDA) (Nan et al., 2019), which is a WAE-based topic model. For all the above baselines, we use their official code with the best reported settings. Settings for NSTM: NSTM is implemented on Tensor Flow. For the encoder θ, to keep simplicity, we use a fully-connected neural network with one hidden layer of 200 units and Re LU as the activation function, followed by a dropout layer (rate=0.75) and a batch norm layer, same to the Published as a conference paper at ICLR 2021 Table 2: top-Purity and top-NMI for document clustering. The best and second scores of each dataset are highlighted in boldface and with an underline, respectively. top-Purity top-NMI 20NG WS TMN 20NG WS TMN LDA 0.398 0.013 0.446 0.022 0.470 0.008 0.320 0.010 0.185 0.013 0.125 0.006 Prod LDA 0.417 0.004 0.293 0.023 0.405 0.157 0.321 0.004 0.066 0.016 0.091 0.101 DVAE 0.281 0.006 0.284 0.005 0.477 0.012 0.187 0.005 0.059 0.001 0.113 0.004 ETM 0.063 0.003 0.215 0.001 0.556 0.022 0.005 0.005 0.003 0.003 0.328 0.010 WLDA 0.117 0.001 0.239 0.003 0.260 0.002 0.060 0.001 0.026 0.001 0.009 0.001 NSTM 0.477 0.011 0.451 0.009 0.637 0.010 0.415 0.012 0.201 0.004 0.334 0.004 settings of Burkhardt & Kramer (2019). For the Sinkhorn algorithm, following Cuturi (2013), the maximum number of iterations is 1,000 and the stop tolerance is 0.0058. In all the experiments, we fix α = 20 and ϵ = 0.07. We further vary the two hyperparameters to study our model s sensitivity to them in Figure B.1 of the appendix. Finetuning the parameters specifically to a dataset may give better results. The optimisation of NSTM is done by Adam (Kingma & Ba, 2015) with learning rate 0.001 and batch size 200 for maximally 50 iterations. For NSTM and ETM, the 50-dimensional (i.e., L = 50, see Eq. (5)) Glo Ve word embeddings (Pennington et al., 2014) pre-trained on Wikipedia9 are used. We use the number of topics K = 100 in most cases and set K = 500 on RCV2 to test our model s scalability. 5.2 RESULTS Quantitative results: We run all the models in comparison five times with different random seeds and report the mean and standard deviation (as error bars). We show the results of TC and TD in Figure 1 and top-Purity/NMI in Table 2, and km-Purity/NMI in Figure 2, respectively. We have the following remarks about the results: i) Our proposed NSTM outperforms the others significantly in terms of topic coherence while obtaining high topic diversity on all the datasets. Although others may have higher TD than ours in one dataset or two, they usually cannot achieve a high TC at the same time. ii) In terms of document clustering, our model performs the best in general with a significant gap over other NTMs, except the case where ours is the second for the KMeans clustering on 20NG. This demonstrates that NSTM is not only able to discover interpretable topics with better quality but also learn good document representations for clustering. It also shows that with the OT distance, our model can achieve a better balance among the comprehensive metrics of topic modelling. iii) For all the evaluation metrics, our model is consistently the best on the short documents including WS and TMN. This demonstrates the effectiveness of our way of incorporating pretrained word embeddings, which shows our model s potential on short text topic modelling. Although ETM also uses pretrained word embeddings, its performance is incomparable to ours. Scalability: NSTM has comparable scalability with other NTMs and is able to scale on large datasets with a large number of topics. To demonstrate the scalability, we run NSTM, DVAE, Prod LDA (as these three are implemented in Tensor Flow, while ETM is in Py Torch, and WLDA is in MXNet) on RCV2 with K = 500. The three models run on a Titan RTX GPU with batch size 1,000. Figure 3 shows the training losses, which demonstrate that NSTM has similar learning speed to Prod LDA, better than DVAE. The TC and TD scores of this experiment are shown in Section C of the appendix, where it can be observed that with 500 topics, our model shows similar performance advantage over others. Qualitative analysis: As topics in our model are embedded in the same space as pretrained word embeddings, they share similar geometric properties. Figure 4 shows a qualitative analysis. For the t-SNE (Maaten & Hinton, 2008) visualisation, we select the top 50 topics with the highest NPMI learned by a run of NSTM on RCV2 with K = 100 and feed their (50 dimensional) embeddings into the t-SNE method. We also show the top five words and the topic number (1 to 50) of each topic. We 8The Sinkhorn algorithm usually reaches the stop tolerance in less than 50 iterations in NSTM 9https://nlp.stanford.edu/projects/glove/ Published as a conference paper at ICLR 2021 Top 10 related words of apple : 'apple' 'blackberry' 'chips' 'iphone' 'microsoft' 'pc' 'ipod' 'intel' Top 10 related words of apple - topic 46: 'bean' 'mango' 'plum' 'peach' 'blackberry' 'almond' 'cobbler' 'pineapple' Top 10 related words of apple + topic 1: 'corn' 'product' 'juice' 'wheat' 'coca' 'soybeans' 'corn' 'grain' 'soybean' 'crop' 'bushel' 'export' 'sugar' 'cotton' + = 'ibm 'intel 'compaq 'stock 'profit 'microsoft 'sony 'packard 'pcs 'dell' Figure 4: Left: t-SNE visualisation of topic embeddings on RCV2. One red dot represents a topic. The top 5 words and the topic number (1 to 50) of each topic are also shown. Right: interactions between word and topic embeddings. can observe that although the words of the topics are different, the semantic similarity between the topics captured by the embeddings is highly interpretable. In addition, we take the Glo Ve embeddings of the polysemantic word apple and find the closest 10 related words among the 0.4 million words of the Glo Ve vocabulary according to their cosine similarity. It can be seen that by default apple refers to the Apple company more in Glo Ve. Either adding the embeddings of topic 1 that describes the concept of food or subtracting the embeddings of topic 46 that describes the concept of tech companies reveals the fruit semantic for the word apple . More qualitative analysis on topics are provided in Section E of the appendix. 6 CONCLUSION In this paper, we presented a novel neural topic model based on optimal transport, where a document is endowed with two representations: the word distribution, x, and the topic distribution, z. An OT distance is leveraged to compare the semantic distance between the two distributions, whose cost function is defined according to the cosine similarities between topics and words in the embedding space. z is obtained from an encoder that takes x as input and is trained by minimising the OT distance between z and x. With pretrained word embeddings, topic embeddings are learned by the same minimisation of the OT distance in terms of the cost function. Our model has shown appealing properties that are able to overcome several shortcomings of existing neural topic models. extensive experiments have been conducted, showing that our model achieves state-of-the-art performance on both discovering quality topics and deriving useful document representations for both regular and short texts. ACKNOWLEDGMENTS Trung Le was supported by AOARD grant FA2386-19-1-4040. Wray Buntine was supported by the Australian Research Council under award DP190100017. Nikolaos Aletras and Mark Stevenson. Evaluating topic coherence using distributional semantics. In International Conference on Computational Semantics, pp. 13 22, 2013. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. JMLR, 3:993 1022, 2003. David M Blei, Thomas L Griffiths, and Michael I Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):7, 2010. Sophie Burkhardt and Stefan Kramer. Decoupling sparsity and smoothness in the Dirichlet variational autoencoder topic model. JMLR, 20(131):1 27, 2019. Published as a conference paper at ICLR 2021 Dallas Card, Chenhao Tan, and Noah A Smith. Neural models for documents with metadata. In ACL, pp. 2031 2040, 2018. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pp. 2292 2300, 2013. Marco Cuturi and David Avis. Ground metric learning. JMLR, 15(1):533 564, 2014. Adji B Dieng, Francisco JR Ruiz, and David M Blei. Topic modeling in embedding spaces. TACL, 8: 439 453, 2020. Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a Wasserstein loss. In NIPS, pp. 2053 2061, 2015. Zhe Gan, R. Henao, D. Carlson, and Lawrence Carin. Learning deep sigmoid belief networks with data augmentation. In AISTATS, pp. 268 276, 2015. Viet Huynh, He Zhao, and Dinh Phung. OTLDA: A geometry-aware optimal transport approach for topic modeling. Neur IPS, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR. 2015. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ICLR, 2013. Rahul Krishnan, Dawen Liang, and Matthew Hoffman. On the challenges of learning with inference networks on sparse, high-dimensional data. In AISTATS, pp. 143 151, 2018. John D Lafferty and David M Blei. Correlated topic models. In NIPS, pp. 147 154, 2006. Jey Han Lau, David Newman, and Timothy Baldwin. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality. In EACL, pp. 530 539, 2014. David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. JMLR, 5(Apr):361 397, 2004. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. Topic modeling for short texts with auxiliary word embeddings. In SIGIR, pp. 165 174, 2016. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. JMLR, 9(Nov): 2579 2605, 2008. Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introduction to Information Retrieval. Cambridge University Press, Cambridge, 2008. Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational inference for text processing. In ICML, pp. 1727 1736, 2016. Yishu Miao, Edward Grefenstette, and Phil Blunsom. Discovering discrete latent topics with neural variational inference. In ICML, pp. 2410 2419, 2017. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. ICLR, 2013. Feng Nan, Ran Ding, Ramesh Nallapati, and Bing Xiang. Topic modeling with Wasserstein autoencoders. In ACL, pp. 6345 6381, 2019. Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. Improving topic models with latent feature word representations. TACL, 3:299 313, 2015. John Paisley, Chong Wang, David M Blei, and Michael I Jordan. Nested hierarchical Dirichlet processes. TPAMI, 37(2):256 270, 2015. Giorgio Patrini, Rianne van den Berg, Patrick Forre, Marcello Carioni, Samarth Bhargav, Max Welling, Tim Genewein, and Frank Nielsen. Sinkhorn autoencoders. In UAI, pp. 733 743, 2020. Published as a conference paper at ICLR 2021 Jeffrey Pennington, Richard Socher, and Christopher Manning. Glo Ve: Global vectors for word representation. In EMNLP, pp. 1532 1543, 2014. James Petterson, Wray Buntine, Shravan M Narayanamurthy, Tibério S Caetano, and Alex J Smola. Word features for latent Dirichlet allocation. In NIPS, pp. 1921 1929, 2010. Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Foundations and Trends R in Machine Learning, 11(5-6):355 607, 2019. Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In WWW, pp. 91 100, 2008. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, pp. 1278 1286, 2014. Michael Röder, Andreas Both, and Alexander Hinneburg. Exploring the space of topic coherence measures. In WSDM, pp. 399 408, 2015. Antoine Rolet, Marco Cuturi, and Gabriel Peyré. Fast dictionary learning with a smoothed Wasserstein loss. In AISTATS, pp. 630 638, 2016. Morgan A Schmitz, Matthieu Heitz, Nicolas Bonneel, Fred Ngole, David Coeurjolly, Marco Cuturi, Gabriel Peyré, and Jean-Luc Starck. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear dictionary learning. SIAM Journal on Imaging Sciences, 11(1):643 678, 2018. Vivien Seguy, Bharath Bhushan Damodaran, Remi Flamary, Nicolas Courty, Antoine Rolet, and Mathieu Blondel. Large scale optimal transport and mapping estimation. ICLR, 2018. Akash Srivastava and Charles Sutton. Autoencoding variational inference for topic models. ICLR, 2017. Haodong Sun, Haomin Zhou, Hongyuan Zha, and Xiaojing Ye. Learning cost functions for optimal transport. ar Xiv preprint ar Xiv:2002.09650, 2020. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. ICLR, 2018. Daniele Vitale, Paolo Ferragina, and Ugo Scaiella. Classification of short texts by deploying topical annotations. In ECIR, pp. 376 387, 2012. Hongteng Xu, Wenlin Wang, Wei Liu, and Lawrence Carin. Distilled Wasserstein learning for word embedding and topic modeling. In Neur IPS, pp. 1716 1725, 2018. Yi Yang, Doug Downey, and Jordan Boyd-Graber. Efficient methods for incorporating knowledge into topic models. In EMNLP, pp. 308 317, 2015. Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin M Solomon. Hierarchical optimal transport for document representation. In Neur IPS, pp. 1599 1609, 2019. Hao Zhang, Bo Chen, Dandan Guo, and Mingyuan Zhou. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. 2018. He Zhao, Lan Du, and Wray Buntine. A word embeddings informed focused topic model. In ACML, pp. 423 438, 2017. He Zhao, Lan Du, Wray Buntine, and Mingyuan Zhou. Dirichlet belief networks for topic structure learning. In Neur IPS, pp. 7966 7977, 2018. He Zhao, Dinh Phung, Viet Huynh, Yuan Jin, Lan Du, and Wray Buntine. Topic modelling meets deep neural networks: A survey. ar Xiv preprint ar Xiv:2103.00498, 2021. Mingyuan Zhou, Yulai Cong, and Bo Chen. Augmentable gamma belief networks. JMLR, 17(163): 1 44, 2016. Published as a conference paper at ICLR 2021 A PROOF OF THEOREM 1 Proof. Before showing the proof, we introduce the following notations: We denote k {1, , K} and v {1, , V } as the indexes; The sth (s {1, , S}) token of the document picks a word in the vocabulary, denoted by ws {1, , V }; the normaliser in the softmax function of φ(z) is denoted as ˆφ so: v=1 e PK k=1 zk(2 mvk) = e2 V X v=1 e PK k=1 zkmvk . With these notations, we first have the following equation for the multinomial log-likelihood: x T log φ(z) = 1 s=1 log φ(z)ws k=1 zk(2 mwsk) log ˆφ = 2 log ˆφ 1 k=1 zkmwsk . (A.1) Recall that in Eq. (1) of the main paper, the transport matrix P is one of the joint distributions of x and z. We introduce the conditional distribution of z given x as Q, where q(v, k) indicates the probability of assigning a token of word v to topic k. Given that P satisfies P U( x, z) and pvk = xvq(v, k), Q must satisfy U ( x, z) := {Q RV K >0 | PV v=1 xvq(v, k) = zk}. With Q, we can rewrite the OT distance as: d M( x, z) = min Q U ( x,z) v=1,k=1 xvq(v, k)mvk S min Q U ( x,z) s=1 q(ws, k)mwsk. If we let q(v, k) = zk, meaning that all the tokens of a document to the topics according to the document s doc-topic distribution, then Q satisfies U ( x, z), which leads to: d M( z, x) 1 s=1 zkmwsk . (A.2) Together with Eq. (A.1), the definition of ˆφ, and the fact that mvk 2, we have: x T log φ(z) = 2 log ˆφ 1 v=1 e PK k=1 zkmvk ! (log V 2) d M( x, z) d M( x, z) , (A.3) where the last equation holds if log V > 2, i.e., V 8. Published as a conference paper at ICLR 2021 0.001 0.01 0.03 0.05 0.07 0.1 0.2 1.0 -0.03 0.001 0.01 0.03 0.05 0.07 0.1 0.2 1.0 0.1 (c) top-Purity 0.001 0.01 0.03 0.05 0.07 0.1 0.2 1.0 0.15 (d) top-NMI 0.001 0.01 0.03 0.05 0.07 0.1 0.2 1.0 0.15 (e) km-Purity 0.001 0.01 0.03 0.05 0.07 0.1 0.2 1.0 0.2 0.001 0.01 0.03 0.05 0.07 0.1 0.2 1.0 0.15 200 140 100 33 20 14 10 5 1 -0.03 200 140 100 33 20 14 10 5 1 0.15 200 140 100 33 20 14 10 5 1 0.41 200 140 100 33 20 14 10 5 1 0.375 200 140 100 33 20 14 10 5 1 0.42 200 140 100 33 20 14 10 5 1 0.37 Figure B.1: Parameter sensitivity of NSTM on 20News. The first and second show the performance with different values of ϵ and α, respectively. In the first row, we fix α = 20 and vary ϵ while in the second row, we fix ϵ = 0.07 and vary α. 20% 40% 60% 80% 100% -0.2 NSTM Prod LDA DVAE 20% 40% 60% 80% 100% 0.2 Figure C.1: TC and TD on RCV2 with 500 topics. B PARAMETER SENSITIVITY In the previous experiments, we fix the values of ϵ and α, which control the weight of the multinomial likelihood in Eq (9) and the weight of the entropic regularisation in the Sinkhorn distance, respectively. Here we report the performance of NSTM on 20NG (blue lines) under different settings of the two hyperparameters in Figure B.1. Moreover, we propose two variants of NSTM. The first one removes the Sinkhorn distance in the training loss of Eq. (9) (i.e., only the expected log-likelihood term left) and its performance is shown as the red lines. The second variant removes the the expected log-likelihood term in the training loss of Eq. (9) (i.e., only Sinkhorn distance left) and its performance is shown as yellow lines. C TC AND TD ON RCV2 WITH 500 TOPICS The results are shown in Figure C.1. D AVERAGE SINKHORN DISTANCE WITH VARIED NUMBER OF TOPICS In Figure D.1, we show the average Sinkhorn distance with varied number of topics on 20NG, WS, TMN, and Reuters. It can be observed that when K increases, there is a clear trend that d M(z, x) decreases. Published as a conference paper at ICLR 2021 5 25 50 75 100 125 150 175 200 0.66 5 25 50 75 100 125 150 175 200 0.35 5 25 50 75 100 125 150 175 200 0.35 (d) Reuters 5 25 50 75 100 125 150 175 200 0.5 Figure D.1: Sinkhorn distance with varied K. Vertical axis: the average Sinkhorn distance over all the training documents, i.e., mean d M(z, x). Horizontal axis: the number of topics, i.e., K 3: user hardware computer mobile software 6: computer hardware software standard digital 18: company software ibm wireless computer 24: miller davis walker mike johnson 25: team game second third play 14: league toronto chicago phoenix dallas 31: university senior attended national associate 34: science study university institute harvard 48: home moved opened york day 16: ohio virginia texas carolina michigan 21: richmond melbourne nottingham birmingham dublin 8: cologne hamburg berlin stockholm munich 35: law legal federal issue review 42: provided instance information provide example 41: higher rise market drop increase 17: ali mohammed abdul ahmad ahmed 2: chang lin lee chung chan 44: gupta ravi mac krishna singh 9: gas fuel water produce energy 47: space earth time example well 10: cat dog wolf wild bird 26: featured series story star ghost 7: lisa ann laura barbara jane 36: collins smith thompson rogers freeman 18: company software ibm wireless computer 1: antonio carlos san juan francisco 27: peter schneider martin weber carl 28: smith johnson miller walker moore 38: smith cox johnson austin turner 49: green cover box red blue 37: screen feature digital instance allows 12: file copy user web software 33: application allows provided example addition 19: news radio weekly editorial broadcast 5: web online information software computer Figure E.1: t-SNE visualisation of topic embeddings on 20NG. E MORE TOPIC EMBEDDING VISUALISATIONS In Figure E.1, E.2, E.3, and E.4, we show the visualisations of 20NG, WS, TMN, and Reuters, respectively. We note that the topic embeddings in general present much better clustering structures of topics in the semantic space. Such topic correlations can only be detected by specialised topic models (e.g.,in Lafferty & Blei (2006); Blei et al. (2010); Zhou et al. (2016)). Instead, the correlations of topics in our model are implicitly captured by the semantic embeddings. Published as a conference paper at ICLR 2021 7: abc television broadcast pbs nbc 22: editorial magazine newspapers news editor 44: football soccer team league basketball 45: tournament finals champions match round 40: nba game nfl games giants 33: web online internet users user 12: northern southern region western eastern 6: germany czech sweden switzerland prague 31: medical nursing medicine psychiatric clinic 46: disorders diabetes disease brain cancer 5: disease influenza diseases virus flu 50: bird animals animal birds ca 19: finite linear discrete parameters vector 9: mathematical computation computational methodology empirical 32: terminology pronunciation abbreviation language spelling 38: collins smith allen moore walker 39: werner karl fischer hans berger 26: smith james moore robert clark 15: laura jane julie ann sarah 13: car cars truck vehicles vehicle 27: aircraft navy air force fleet 11: mars spacecraft nasa discovery planet 16: graduate undergraduate teaching faculty academic 29: biology biomedical science sciences neuroscience 2: yang liu tao ping wang 3: election democratic candidates democrats vote 28: deputy secretary chief general senior 18: interface open-source functionality interfaces web-based 36: microprocessors microprocessor macintosh pcs processors 14: kansas carolina indiana texas ohio 35: wales england melbourne brisbane cardiff 43: payments costs pay tax income 47: market prices expectations price rate Figure E.2: t-SNE visualisation of topic embeddings on WS. 25: bus car train passenger vehicle 26: aboard ship flight plane landing 7: ford toyota nissan honda motors 46: pay paying paid payments cash 43: company companies business firm firms 16: sachs merrill citigroup securities ubs 33: sudan algeria congo sudanese khartoum 3: thailand malaysia indonesia bangladesh india 11: chan yang wen peng liu 31: graduate school college teaching student 22: missouri virginia carolina ohio maryland 19: phoenix dallas seattle philadelphia denver 12: knee shoulder elbow ankle thigh 28: cancer patients diabetes disease complications 47: bird wild birds animals elephant 17: disease flu virus diseases epidemic 8: catholic holy church religious orthodox 44: ceremony celebration celebrations celebrating celebrate 42: sense brings tremendous passion experience 48: fact question reason reasons explain 45: caused severe risk impact problems 10: germany czech austrian austria german 1: mexico peru costa chile colombia 6: carlos jose luis gonzalez rodriguez 23: chelsea striker liverpool defender midfielder 5: mariners sox mets yankees dodgers 18: michelle laura lisa ann jennifer 37: smith moore clark campbell allen 29: broncos cowboys usc redskins quarterback 5: mariners sox mets yankees dodgers 14: mother daughter father wife son Figure E.3: t-SNE visualisation of topic embeddings on TMN. 1: carlos jose luis antonio juan 3: sao paulo janeiro rio aires 13: wellington brisbane adelaide melbourne perth 8: phoenix dallas philadelphia cincinnati denver 10: carolina texas missouri mississippi ohio 6: southwest northwest northeast north near 14: ship cargo ships boat vessel 4: election party democratic candidate vote 26: meeting meanwhile met tuesday wednesday 41: meanwhile trade china thursday tuesday 5: germany zurich stockholm frankfurt munich 7: fritz werner otto schmidt heinz 33: loss quarter pretax losses profit 48: pretax profit net quarter earnings 19: profit revenue ratio value cumulative 11: mobile micro wireless digital software 42: government-owned subsidiary york-based chicago-based toronto-based 35: shares investment holdings company firm 12: company subsidiary firm maker corporation 18: subsidiary petroleum company venture corporation 40: stores sales sells retail retailer 9: corn sugar beans milk juice 16: wheat crop corn harvest crops 43: exports oil prices year demand 39: prices price rose dropped fell 50: sales profit quarter year rose 38: billion dollars dlrs million totaled 20: december march september january july 25: company expects market share shares 47: institute research sciences university science 30: example instance similar addition particular 34: plans based company business agreed 45: substantially weaker undervalued marginally stronger 31: aims improve strengthen boost enhance 23: share shares expects sale pay 46: charges alleged fraud charged charge 36: provisions requirement regulations impose provision 37: revenue payments dividend payment savings Figure E.4: t-SNE visualisation of topic embeddings on Reuters.