# compressed_nonparametric_language_modelling__c2eb61a5.pdf Compressed Nonparametric Language Modelling Ehsan Shareghi, Gholamreza Haffari, Trevor Cohn Faculty of Information Technology, Monash University Computing and Information Systems, The University of Melbourne first.last@{monash.edu, unimelb.edu.au} Hierarchical Pitman-Yor Process priors are compelling for learning language models, outperforming point-estimate based methods. However, these models remain unpopular due to computational and statistical inference issues, such as memory and time usage, as well as poor mixing of sampler. In this work we propose a novel framework which represents the HPYP model compactly using compressed suffix trees. Then, we develop an efficient approximate inference scheme in this framework that has a much lower memory footprint compared to full HPYP and is fast in the inference time. The experimental results illustrate that our model can be built on significantly larger datasets compared to previous HPYP models, while being several orders of magnitudes smaller, fast for training and inference, and outperforming the perplexity of the state-of-the-art Modified Kneser-Ney count-based LM smoothing by up to 15%. 1 Introduction Statistical Language Models (LM) are the key components of many tasks in Natural Language Processing, such as Statistical Machine Translation, and Speech Recognition. Conventional LMs are n-gram models which apply the Markov assumption to approximate the probability of a sequence w N 1 as, P(w N 1 ) = i=1 P(wi|wi 1 1 ) i=1 P(wi|wi 1 i n+1). (1) Several smoothing techniques have been proposed to address the statistical sparsity issue in computation of each conditional probability term. The widely used smoothing techniques for LM are Kneser-Ney (KN) [Kneser and Ney, 1995], and its extension Modified Kneser-Ney (MKN) [Chen and Goodman, 1999]. The intuition behind KN, MKN and their extensions [Shareghi et al., 2016a] is to adjust the original distribution to assign non-zero probability to unseen or rare events. This is achieved by re-allocating the probability mass in an interpolative procedure via absolute discounting. It turns out that the Bayesian generalisation of KN family of smoothing is the Hierarchical Pitman-Yor Process (HPYP) LM [Teh, 2006a], which was originally developed for finiteorder LM [Teh, 2006b], and was extended as the Sequence Memoizer (SM) [Wood et al., 2011] to model infinite-order LMs. While capturing the long range dependency via HPYP improves the estimation of conditional probabilities, these types of models remain impractical due to several computational and learning challenges, namely large model size (data structure representing the model, and the number of parameters), long training and test time, and poor sampler mixing. In this paper we address aforementioned issues; inspired by the recent advances in using compressed data structures in LM [Shareghi et al., 2015; 2016b] our model is built on top of a compressed suffix tree (CST) [Ohlebusch et al., 2010]. In the training step, only the CST representation of text is constructed, allowing for a very fast training, while proposing an efficient approximate inference algorithm for the test time. Mixing issue is avoided via careful sampler initialisation and design. The empirical results show that our proposed approximation of HPYP is richer than KN and MKN, and is much more efficient in learning and inference phase compared to full HPYP. Compared with 10-gram KN and MKN models, our -gram model consistently improves the perplexity by up to 15%. Our compressed framework allows us to train on large collection of text, i.e. 100 larger than the largest dataset used in HPYP LMs [Wood et al., 2011] while having several orders of magnitudes smaller memory footprint and supporting fast and efficient inference. 2 Interpolative Language Models Conventional interpolative smoothing techniques in LM follow a general form, P(wi|u) = c(uwi) d c(u) + γ(u, d) c(u) P(wi|π(u)), where, u = wi 1 i n+1 is called the context, π(u) is u with its least recent symbol dropped, and c and d are the count and absolute discount, while γ is the mass allocated to the lower level of the interpolation. Interpolative smoothing assumes that P(w|u), the conditional distribution of a word w in the context u, is similar to and hence smoothed by that of suffix Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) k u P k KN(wi|u, dk) P k HPYP(wi|u, ηu) n u1 c(uwi) dk c(u) + dk N1+(u ) c(u) P k 1 KN (wi|π(u), dk 1) nu wi dutu wi nu. + θu + θu+ dutu . nu. +θu P k 1 HPYP(wi|π(u), ηπ(u)) n-1 π(u1) N1+( uwi) dk N1+( u ) + dk N1+(u ) N1+( u ) P k 1 KN (wi|π(u), dk 1) nu wi dutu wi nu. +θu + θu+dutu . nu. +θu P k 1 HPYP(wi|π(u), ηπ(u)) ... ... ... ... 1 ε N1+( uwi) dk N1+( u ) + dk N1+(u ) N1+( u ) 1 |σ| nu wi dutu wi nu. +θu + θu+dutu . nu. +θu 1 |σ| Table 1: One-to-One mapping of interpolative smoothing under KN and HPYP. In P k KN column : N1+(u ) = |{w : c(uw) > 0}|, and similarly N1+( u) and N1+( u ) are defined. In P k HPYP column : nu wi = (uwi) and nu . = c(u) when k = n. Also nu . = P w σu nu w, and tu . is defined similarly, and ηu = {du, θu, {nu w, tu w}w σu}. k = 1, u = ε k = 2, u = π(π(u1)) k = 3, u = π(u1) k = 4, u = u1 u2 u3 Figure 1: An example of a interpolative LM of depth 3. Moving from leaf node u1 towards the root, corresponds to moving from the top row towards the bottom row in Table 1. Nodes sharing a parent correspond to identical sequences except for their least recent symbol, for example a partial sequence assignment to the nodes is π(π(u1)) = c, π1(u1) = bc, u1 = abc, u2 = bbc, u3 = dbc. of the context P(w|π(u)). This recursive smoothing stops at the unigram level where the conditioning context is empty, u = ε (see the left panel of Table 1). In what follows, we provide a brief overview of hierarchical Pitmon-Yor Process LM and its relationship with KN. 2.1 Hierarchical Pitman-Yor Process (HPYP) LM We start by describing the Pitman-Yor process (PYP; [Pitman and Yor, 1997]), used as a prior over LM parameters. PYP(d, θ, H) is a distribution over the space of probability distributions with three parameters: a base distribution H which is the expected value of a draw from a PYP, the concentration parameter d < θ which controls the variation of draws from PYP around H, and the discount parameter 0 d < 1 which controls the heavy-tailness of the sampled distributions. PYP is an appropriate prior for LMs to capture power-law behaviour prevalent in the natural language [Goldwater et al., 2011]. To illustate the use of the PYP prior, we conside as the likelihood a simple unigram LM G, from which the words of a text are generated. The Chinese restaurant process (CRP) is a metaphor that allows generating words from a PYP without directly dealing with the LM G itself by integrating it out. Consider a restaurant where customers are seated on different tables, and each table is served one dish. To make the analogy, the restaurant corresponds to the LM, the customers are the text token needed to be generated, and the dishes are the words. Let tw denote the number of tables serving the same dish w in the restaurant, nw denote the total number of customers seated on these tables, and t. = P w tw and n. = P w nw to be the total number of tables and customers, respectively. Generating the next word from the LM is done by sending a customer to the restaurant which either (i) sits on an existing table serving the dish w with probability proportional to nw dtw, or (2) sits on a new table with probability proportional to θ + dt. and orders a dish from the base distribution H. The probability of the next word w is thus P(w|η) = nw dtw n. + θ + θ + dt. n. + θ P(w|H). where η = {d, θ, {nw, tw}w σ}, and σ is the vocabulary. Note that {nw, tw}w σ form the sufficient statistics for generating the next word. In a HPYP LM, the distribution Gu of words following a context u has a PYP(du, θu, Gπ(u)) prior, Gu | du, θu, Gπ(u) PYP(du, θu, Gπ(u)). where the base distribution Gπ(u) itself has a PYP prior. This induces a hierarchy among these context-conditioned distributions (see Figure 1) tying the base distribution of each node of the hierarchy to the distribution of its parent. Note that u refers to both a context (a sequence of words) and its corresponding node in the HPYP tree. The distribution at the root of the hierarchy Gε corresponds to the empty context ε: Gε | dε, θε, U PYP(dε, θε, U) where the base distribution U is the uniform distribution over the vocabulary σ. Having a HPYP LM, the next word is generated by integrating out all of the distributions corresponding to the nodes of the tree. This is achieved by hierarchical Chinese restaurant process (HCRP), whereby a customer is sent to a restaurant of a context from which a word needs to be generated. In case the customer is seated on a new table, a new customer is sent to the parent restaurant for ordering the dish. The number of customers sent to the parent is orchestrated via the concentration and discount parameters. The following constraints hold for {nu w, tu w}w σu across the tree nodes: w σu : 0