# revisiting_topicguided_language_models__e047991d.pdf Published in Transactions on Machine Learning Research (11/2023) Revisiting Topic-Guided Language Models Carolina Zheng carozheng@cs.columbia.edu Department of Computer Science Columbia University Keyon Vafa kv2294@columbia.edu Department of Computer Science Columbia University David M. Blei david.blei@columbia.edu Department of Statistics Department of Computer Science Columbia University Reviewed on Open Review: https: // openreview. net/ forum? id= l XBEw Ffxp A A recent line of work in natural language processing has aimed to combine language models and topic models. These topic-guided language models augment neural language models with topic models, unsupervised learning methods that can discover document-level patterns of word use. This paper compares the effectiveness of these methods in a standardized setting. We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora. Surprisingly, we find that none of these methods outperform a standard LSTM language model baseline, and most fail to learn good topics. Further, we train a probe of the neural language model that shows that the baseline s hidden states already encode topic information. We make public all code used for this study. 1 Introduction Recurrent neural networks (RNNs) and LSTMs have been an important class of models in the development of methods for many tasks in natural language processing, including machine translation, summarization, and speech recognition. One of the most successful applications of these models is in language modeling, where they are effective at modeling small text corpora. Even with the advent of transformer-based language models, RNNs and LSTMs can outperform non-pretrained transformers on various small datasets (Melis et al., 2020). While powerful, RNNand LSTM-based models struggle to capture long-range dependencies in their context history (Bai et al., 2018; Sankar et al., 2019). Additionally, they are not designed to learn interpretable structure in a corpus of documents. To this end, multiple researchers have proposed adapting these models by incorporating topic models (Dieng et al., 2017; Lau et al., 2017; Rezaee & Ferraro, 2020; Guo et al., 2020). The motivation for combining language models and topic models is to decouple local syntactic structure, which can be modeled by a language model, from document-level semantic concepts, which can be captured by a topic model (Khandelwal et al., 2018; O Connor & Andreas, 2021). The topic model component is also designed to uncover latent structure in documents. We refer to these models as topic-guided language models. Broadly, this body of research has reported good results: topic-guided language models improve next-word predictive performance and learn interpretable topics. In this work, we re-investigate this class of models by evaluating four representative topic-guided language model (TGLM) papers in a unified setting. We train the models from Dieng et al. (2017); Lau et al. (2017); Published in Transactions on Machine Learning Research (11/2023) Rezaee & Ferraro (2020); Guo et al. (2020) on three document-level corpora and evaluate their held-out perplexity. Unlike some prior work, during next-word prediction, we take care to condition the topic model component on only previous words, rather than the entire document. Moreover, we use a baseline language model that is conditioned on all previously seen document words, rather than being restricted to the current sentence (Lau et al., 2017; Rezaee & Ferraro, 2020; Guo et al., 2020). Additionally, we choose baseline language models with comparable model sizes to ensure valid comparisons. Our finding: no predictive improvement of TGLMs over a standard LSTM-LM baseline (Zaremba et al., 2014). In order to understand why topic-guided language models offer no predictive improvement, we probe the LSTM-LM s hidden representations. A probe is a trained predictor used to measure the extent to which fitted black-box models, such as neural models, have learned specific linguistic features of the input (Hewitt & Liang, 2019). The probe reveals that the LSTM-LM already encodes topic information, rendering a formal topic model component redundant. Additionally, topic-guided language models were developed to provide insight into text corpora by uncovering latent topics. This method of exploratory text analysis is commonly used in the social sciences and digital humanities (Griffiths & Steyvers, 2004; Blei & Lafferty, 2007; Grimmer & Stewart, 2013; Mohr & Bogdanov, 2013). Here, we show that the topics learned by topic-guided language models are not better than a standard topic model and, for some of the models, qualitatively poor. This paper contributes to a line of reproducibility studies in machine learning that aim to evaluate competing methods in a consistent and equitable manner. These studies have uncovered instances where results are not directly comparable, as reported numbers are borrowed from prior works that used different experimental settings (Marie et al., 2021; Hoyle et al., 2021). Furthermore, they identify cases where baselines are either too weak or improperly tuned (Dacrema et al., 2019; Nityasya et al., 2023). We observe analogous issues within the topic-guided language modeling literature. To support transparency and reproducibility, we make public all code used in this study.1 Finally, we consider how these insights apply to other models. While prior work has incorporated topic model components into RNNs and LSTMs, the topic-guided language model framework is agnostic to the class of neural language model used. We conclude by discussing how the results in this paper are relevant to researchers considering incorporating topic models into more powerful neural language models, such as transformers. 2 Study Design Let x1:T = {x1, . . . , x T } be a sequence of tokens collectively known as a document, where each xt indexes one of V words in a vocabulary (words outside the vocabulary are mapped to a special out-of-vocabulary token). Given a corpus of documents, the goal of language modeling is to learn a model p(x1:T ) that approximates the probability of observing a document. A document can be modeled autoregressively using the chain rule of probability, p(x1:T ) = QT t=1 p(xt | x