# chunkdistilled_language_modeling__8681cf3a.pdf Published as a conference paper at ICLR 2025 CHUNK-DISTILLED LANGUAGE MODELING Yanhong Li University of Chicago & TTI-Chicago yanhongli@uchicago.edu Karen Livescu TTI-Chicago klivescu@ttic.edu Jiawei Zhou TTI-Chicago & Stony Brook University jiawei.zhou.1@stonybrook.edu We introduce Chunk-Distilled Language Modeling (CD-LM), an approach to text generation that addresses two challenges in current large language models (LLMs): the inefficiency of token-level generation, and the difficulty of adapting to new data and knowledge. Our method combines deep network-based LLMs with a straightforward retrieval module, which allows the generation of multitoken text chunks at a single decoding step. Our retrieval framework enables flexible construction of modelor domain-specific datastores, either leveraging the internal knowledge of existing models, or incorporating expert insights from human-annotated corpora. This adaptability allows for enhanced control over the language model s distribution without necessitating additional training. We present the CD-LM formulation along with performance metrics demonstrating its ability to improve language model performance and efficiency across a diverse set of downstream applications.1 1 INTRODUCTION Large language models (LLMs) have become a crucial component of intelligent systems, but still suffer from fundamental challenges to their efficiency and performance. LLMs are most commonly based on autoregressive Transformers (Vaswani et al., 2017) and typically generate text sequences one token at a time in a serial fashion, which limits their efficiency. Moreover, once pre-trained, updating the model parameters requires expensive data and computational resources, making it difficult to incorporate dynamic knowledge into the model. Several techniques have been proposed to improve the efficiency and performance of LLMs, such as speculative decoding (Leviathan et al., 2023; Chen et al., 2023; Miao et al., 2024; Spector & Re, 2023) and retrieval-augmented generation (RAG) (Lewis et al., 2020; Guu et al., 2020; Borgeaud et al., 2022). The former relies on a smaller model to speculate several tokens at a time to reduce inference runtime while retaining the same model distribution, while the latter combines parametric language models with non-parametric memory to improve adaptability to dynamic knowledge but often without efficiency gains. This work aims to alleviate both challenges via a fine-grained retrieval-augmented language modeling approach that focuses on text chunks, or contiguous spans of tokens that often appear together. The intuition for this approach is that a substantial amount of linguistic or factual knowledge can be expressed in text chunks spanning multiple contiguous tokens, such as named entities, multiword expressions, and other common phrases. These sub-sentence structures tend to exhibit lower variability compared to larger text units such as sentences, and are often memorized precisely by well-trained LLMs. Figures 1 and 2 demonstrate this effect: Chunks conveying key content are often repeated verbatim across multiple decoding runs with similar contexts, and the LLM probabilities over token sequences show recurring plateaus of high probability within such multi-token chunks. By injecting memorized or novel chunks into the generation process, we may be able to 1Code and data are available at https://github.com/yanhong-lbh/cd-lm. Published as a conference paper at ICLR 2025 The answer to life, the universe, and everything is? Tell me the meaning of life, the universe and everything. What is the answer to life the universe and everything? USER USER USER LLM LLM The answer to everything is 42, reflecting Douglas Adams' humorous take on the quest for universal truths in "The Hitchhiker's Guide to the Galaxy." The answer to life, the universe, and everything is 42, according to Douglas Adams' "The Hitchhiker's Guide to the Galaxy." The meaning of life, the universe, and everything is 42, referencing Douglas Adams' iconic work in "The Hitchhiker's Guide to the Galaxy." Figure 1: LLMs may generate sequences with repeated chunks spanning contiguous tokens conveying key information in similar contexts. Examples are generated from Llama-2-7b-chat. improve the models ability to adapt to new domains or knowledge. In addition, if entire chunks can be cached and retrieved during inference, we should also be able to speed up text generation. everything is according to Douglas Adams' Token Probability Token Probability for Llama-2-7b-chat and Llama-2-70b-chat Llama-2-7b-chat Llama-2-70b-chat Figure 2: LLM token probabilities for the sentence: The answer to life, the universe, and everything is 42, according to Douglas Adams The Hitchhiker s Guide to the Galaxy. These models bind token sequences such as Douglas Adams and The Hitchhiker s Guide to the Galaxy into chunks with plateaus of high probability. Inspired by these observations, we present Chunk-Distilled Language Modeling (CD-LM), a new training-free generation approach that mixes LM token generation with chunk retrieval. To facilitate efficient search, we store text chunks of variable sizes, along with their preceding contexts, in a trie-structured datastore, and retrieve the most likely chunks as possible text continuations given the current generation. The context matching is done in the vector representation space induced by the LM itself without the additional overhead of specialized embedding modules, commonly used in RAG (Lan et al., 2023; Ram et al., 2023; Borgeaud et al., 2022). Well-matched chunk continuations are accepted, skipping multiple token decoding steps. Using the same generation approach, CD-LM allows language models (LMs) to work with chunks mined in different ways to achieve various goals in applications. As suggested by Figure 2, chunks can be naturally derived from any parametric pre-trained LM as memorized high-probability sequences. When the chunks are extracted from the distribution of a more powerful or specialized LM, CD-LM implements a form of knolwedge distillation, adapting the base model s distribution (without any additional training) by injecting chunks during inference. In this setting CD-LM can either improve smaller models with knowledge drawn from larger models or perform training-free domain adaptation. On the other hand, when the chunks are extracted from the same LM used for generation, they form a self-memory datastore that can be used to improve inference efficiency while maintaining the same model distribution, as in speculative decoding. Finally, the chunks can be not only extracted from a parametric model but even directly curated from human experts. Such external knowledge can be factual information or private data that the LM may not have direct access to. CD-LM requires no training and can work with any off-the-shelf language model in both chunk discovery and sequence generation. We conduct a diverse set of empirical studies, including language modeling perplexity, text generation, and domain adaptation, showing the ability of CD-LM to improve inference efficiency and modeling performance. 2 BACKGROUND While many attempts have been made to improve language modeling and generation efficiency, it remains a significant challenge to address both simultaneously. For example, non-parametric approaches like k NN-LM (Khandelwal et al., 2020) reduce LM perplexity in certain domains, but tend to require a sizable database for retrieval and adds latency during generation; specialized inference algorithms like speculative decoding (Spector & Re, 2023) speed up generation but keep the LM s distribution fixed. Unlike prior work, CD-LM can both speed up generation and adapt the LM s distribution. We include a more comprehensive overview of related work in Appendix C. Published as a conference paper at ICLR 2025 What is the answer to life the universe and everything? <\s> The answer to everything is 42 , reflecting Douglas Adams' ...the humorous tone of his writing is evident, reflecting Douglas Adams' unique wit and creativity ...the dry humor are reminiscent of The Hitchhiker's Guide to the Galaxy, reflecting Douglas Adams. Adams Adams Each Trie has root node as an entry token: everything Each node represents a text chunk for retrieval: is 42 Each node stores chunk s preceding contexts: The answer to everything is 42 , reflecting Douglas Adams' humor itself a better human experts text corpus 1 . Extract the chunks using an LM s token probabilities, or using existing human knowledge SCD-LM KCD-LM ...the answer to the ultimate question of life, the universe, and everything is 42 ... 2 . Build Trie Datastore 3 . Inference Search trie -> Match contexts -> Accept or reject chunk -> Generate chunk directly if accepted Figure 3: Overview of CD-LM. Colored text spans are generated together by chunk retrieval, interleaved with token-by-token generation by the LM. Note that same chunk can appear in multiple contexts, so each node in the trie datastore contains multiple context vectors in practice. Non-Parametric Language Modeling k NN-LM (Khandelwal et al., 2020) extends a pre-trained LM by linearly interpolating its distribution with a non-parametric k-nearest neighbors model based on token retrieval, thereby often improving language modeling performance. However, it is typically very inefficient as it performs retrieval at each token, and it affects the immediate next token distribution via soft mixing. There is a series of proposed methods that improve the efficiency of k NN-LM (He et al., 2021; Alon et al., 2022); however, they are still slower than the pre-trained LM. Unlike k NN-LM, CD-LM does not involve retrieval at each token and makes a hard decision about multiple tokens in a chunk rather than mixing token distributions, enabling it to enjoy the benefits of dynamic retrieval but also save on k NN searches. Speculative Decoding Speculative decoding (Leviathan et al., 2023; Chen et al., 2023; Miao et al., 2024; Spector & Re, 2023; He et al., 2024) is an inference acceleration technique. Given a particular target LLM, a smaller LM is used to quickly generate multiple draft tokens, which are then considered together by the target LLM. The work most closely related to ours is REST (He et al., 2024), which retrieves draft token sequences from an external datastore. While CD-LM also retrieves chunks from a datastore, it is fundamentally different from speculative decoding. Speculative decoding methods use the target LLM for draft token verification, so the language model s distribution and therefore downstream performance cannot be further improved and no new knowledge can be injected. In contrast, CD-LM is designed to inject chunk-level knowledge into generation so the model distribution can be adapted. 3 LANGUAGE MODELING WITH CHUNK GENERATION In this section, we introduce a general framework of language modeling that interleaves chunk generations with tokens from a standard autoregressive LM. We then describe the operational details of the chunk generation process with retrieval from a structured database in Section 4. Together, these two sections build the core ideas of CD-LM. Finally, we derive a tractable algorithm for computing sequence probabilities under CD-LM in Section 5. 3.1 PRELIMINARIES An autoregressive language model assigns a probability to any given sequence of tokens (x1, x2, . . . , x N) as follows: pθ(x1, x2, . . . , x N) = QN n=1 pθ(xn|x