# cllms_consistency_large_language_models__5bbf355a.pdf CLLMs: Consistency Large Language Models Siqi Kou * 1 Lanxiang Hu * 2 Zhezhi He 3 Zhijie Deng 1 Hao Zhang 2 Parallel decoding methods such as Jacobi decoding show promise for more efficient LLM inference as it breaks the sequential nature of the LLM decoding process and transforms it into parallelizable computation. However, in practice, it achieves little speedup compared to traditional autoregressive (AR) decoding, primarily because Jacobi decoding seldom accurately predicts more than one token in a single fixed-point iteration step. To address this, we develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory. This is accomplished by refining the target LLM to consistently predict the fixed point given any state as input. Extensive experiments demonstrate the effectiveness of our method, showing 2.4 to 3.4 improvements in generation speed while preserving generation quality across both domain-specific and open-domain benchmarks. Our code is available at https://github.com/hao-ailab/Consistency LLM. 1. Introduction Large language models (LLMs), including GPT-4 (Achiam et al., 2023), LLa MA (Touvron et al., 2023a;b), Pa LM (Anil et al., 2023), are pushing the limit of artificial intelligence. As LLMs are integrated into more applications (Zheng et al., 2023; Wu et al., 2023), the inference latency of LLMs plays a crucial role in ensuring a positive user experience and high service quality. However, LLM serving operates in an AR paradigm, generating one token at a time due to the attention mechanism s need for previous token states to generate the next one. To produce a lengthy response, one must execute forward passes through the LLMs as many times as the *Equal contribution 1Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University 2University of California, San Diego 3School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University. Correspondence to: Zhijie Deng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). number of tokens generated, resulting in high latency. Existing methods address this issue from various perspectives. For example, speculative decoding (Leviathan et al., 2023; Chen et al., 2023) introduces a small draft LLM to guess tokens and let the target LLM verify them in parallel. Although they can opportunistically generate multiple tokens in a single evaluation of the target LLM, obtaining a small yet effective draft model is non-trivial; managing multiple models within a single system remains a challenging engineering task. Medusa (Cai et al., 2024) alternatively augments the target LLM with extra guess heads to enable self-speculation with as much as 3 speedup on various tasks. Yet, the number of added parameters can be significant (e.g., Medusa2 with 5 extra heads adds 1.6B parameters for a 6.7B target LLM). Increased memory consumption could limit generation length and negatively affect inference latency due to the reduction in available memory for key-value (KV) cache (Pope et al., 2023). On the other hand, originating from the Jacobi and Gauss Seidel fixed-point iteration for solving nonlinear equations (Ortega & Rheinboldt, 2000; Song et al., 2021a), the Jacobi decoding method (Santilli et al., 2023) first randomly guesses the next n tokens in a sequence (referred to as n-token sequence hereinafter) from an input prompt. The n-token sequence, along with the prompt, is then fed to the LLM to iteratively update itself. Eventually, the n-token sequence converges to the same output generated by AR decoding under a greedy strategy (see Figure 1). The evolution of the n-token sequence forms a Jacobi trajectory between a randomly initialized sequence to the n-token sequence generated by AR decoding (i.e., the fixed point). However, vanilla Jacobi decoding for LLMs shows only marginal speedup over AR decoding in practice, e.g., an average of 1.05 speedup in Santilli et al. (2023). This is because a LLM can rarely yield a correct token when there are incorrection1 in its preceding tokens due to the attention mechanism, resulting in a long trajectory as illustrated on the left side of Figure 2. Lookahead decoding (Fu et al., 2024) improves the efficiency by leveraging n-grams generated from previous Jacobi iterations and verify them in parallel during the decoding process. However, both work are unable 1By correctness, we mean alignment with the AR decoding result under a greedy sampling strategy. CLLMs: Consistency Large Language Models Autoregressive LM prefix random n-token seq The prompt The converged n-token seq k iterations Answer: This is one try randomly initialized point fixed point Jacobi trajectory n-token seq n-token seq The prompt The prompt Answer: This is Figure 1. An instance of Jacobi trajectory. n-token seq refers to the n-token sequence that is iteratively updated in Jacobi iterations. to achieve the same level of speedup as Meudsa. This work aims to achieve all three goals by refining the target LLM. Specifically, we propose to fine-tune the LLM so that it can yield multiple, instead of one, subsequent tokens of a prefix at once. In the ideal case, with the prompt and a randomly initialized n-token sequence as input, our goal is to train a LLM that can generate the same n-token sequence as AR decoding (the fixed point) using only one step. Our preliminary experiments show the single-step learning task is difficult when n is large, and leads to slow model convergence. We therefore ease the learning process by also taking intermediate points on the Jacobi trajectory with more correct tokens into account. In particular, for the second to last point on the trajectory, the learning is identical to AR modeling, at which the target LLM without adaptation has already excelled. We argue such a learning strategy that a single model is tuned to solve a series of learning problems of mapping any arbitrary point on the trajectory to the fixed-point is beneficial to model convergence (see Figure 4 and Figure 5). Imagining the evolution of the n-token sequence as the denoising process of a natural image (Ho et al., 2020; Song et al., 2021b), we surprisingly find that the above learning procedure draws a sharp analogy to the acceleration technique for diffusion models named consistency models (CMs) (Song et al., 2023; Song & Dhariwal, 2023). CMs aim to achieve single-step image generation using the denoising objective by minimizing distances between consecutive denoising steps along the probability flow ordinary differential equation (ODE) trajectory during training. Our method and CMs share the notion of directly mapping intermediate states of a solving process (of non-linear systems or ODEs) to its final solution for inference acceleration. Based on these, we refer to our trained models as Consistency Large Language Models (CLLMs). In comparison with previous methods like speculative decoding and Medusa, CLLM doesn t introduce extra memory cost to accommodate auxiliary model components while delivering significant speedup with minimal performance degradation. To implement this learning strategy, it only requires model training with two loss terms. Following CMs, we can convert the aforementioned learning objective into a consistency loss where the model is demended to map arbitrary point on the Jacobi trajectory to the fixed point. CLLMs also include an AR loss to avoid deviating from the distribution of the target LLM and hence ensure the generation quality. The fine-tuning cost of CLLMs is moderate, e.g., training on only 1M tokens for LLa MA-7B to achieve a 3.4 speedup on the Spider dataset. We further empirically identify that such acceleration is likely to stem from the existence of 1) fast forwarding, where multiple consecutive tokens are correctly predicted in a single forward pass, and 2) stationary tokens, which are correctly predicted and remain unaltered through subsequent iterations, despite being preceded by inaccurate tokens. An illustration of the examples is shown in Figure 2. To summarize, our key contributions are as follows: We propose Consistency Large Language Models (CLLMs), a new family of LLMs specialized for the Jacobi decoding method for latency reduction. We empirically observe the existence of fast forwarding and stationary tokens phenomena in Jacobi decoding of CLLMs. Empirically, CLLMs can lead to a 2.0 to 6.8 improvement in the count of fast-forwarded tokens and stationary tokens compared to the original LLM. We demonstrate the efficacy of CLLMs on a variety of benchmarks. On domain-specific benchmarks including GSM8K, Code Search Net Python, and Spider, CLLMs can achieve 2.4 to 3.4 speedup using Jacobi decoding with nearly no loss in accuracy. On open-domain benchmark MT-bench, CLLMs can achieve 2.4 speedup on Share GPT with state-of-the-art performance, scoring 6.4. 2. Related Work Efficient LLM Inference. This body of work can be broadly categorized into two streams: methods that necessitate additional training and those that do not. The high AR inference cost in LLMs has sparked a surge in research aimed at efficient LLM inference, primarily focused on accelerating the AR decoding process. CLLMs: Consistency Large Language Models The methods that do not require additional training include speculative decoding, as introduced in studies by Leviathan et al. (2023) and Chen et al. (2023). These techniques enhance LLM decoding speed by leveraging a smaller draft model to predict the outputs of a larger target model which subsequently verifies these predictions. Another category of training-free approaches involves systemor hardwareoriented optimizations. Notable examples include Paged Attention (Kwon et al., 2023), which optimizes KV cache management for throughput using memory paging, and Flash Attention (Dao et al., 2022; Dao, 2023), which accelerates attention module computations by reducing HBM access via softmax tiling. Other strategies enhance LLM inference speed by optimizing model designs, reducing weight/activation precision, and utilizing sparsity, including multi-query and grouped-query attention mechanisms with fused heads (Shazeer, 2019; Ainslie et al., 2023), posttraining quantization (Dettmers et al., 2022; Xiao et al., 2023; Frantar et al., 2022; Lin et al., 2023), and various pruning techniques (Sun et al., 2023; Frantar & Alistarh, 2023; Ashkboos et al., 2024). For methods that necessitate training, they often require integration of auxiliary components, such as additional LM or AR heads, to facilitate faster AR generation (Cai et al., 2024; Li et al., 2024). It may also involve significant modifications to the model weights or architecture, as seen in various pruning approaches (Ma et al., 2023; Xia et al., 2022; 2023). Moreover, training can enhance certain training-free techniques, like speculative decoding, by capturing the behavior of the original, larger model in a smaller student model through distillation, thereby retaining performance with reduced size (Zhou et al., 2023b; Liu et al., 2023). An detailed analysis that compare CLLMs with different SOTA baseline methods are further discussed and compared in Section B and Table 7. It s worthy noticing that CLLMs requires neither modification to pre-trained models nor any auxiliary components. This brings higher memory efficiency and adaptability to users at inference time. LLM Distillation. Knowledge distillation (KD) serves as a technique for creating smaller models that replicate the functionality of larger ones. While traditional KD approaches often fall short for LLMs, (Gu et al., 2023) has adapted KD for autoregressive LLMs, focusing on minimizing the reverse KL divergence between student and teacher models through student-driven decoding. In another advancement, Agarwal et al. (2023) introduces generalized knowledge distillation (GKD), which balances forward and reverse KL divergences by employing a mix of data sampled from both teacher and student models. CLLMs are distinct from these works as our proposed method can be regarded as a self-distillation approach with a Jacobi trajectory training dataset that matches the target LLM s output distribution. Consistency Models. Diffusion models (Ho et al., 2020; Song et al., 2021b) suffer from slow iterative sampling process. Consistency models overcome this limitation by mapping any point along the probability flow ODE of the diffusion process back to the original point, corresponding to the initial image, in a single step (Song et al., 2023). In this work, we highlight that a parallelism can be drawn between the few-step generation capability of CLLMs and that of the consistency models. 3. Methodology This section begins with a review of the Jacobi decoding method (Santilli et al., 2023) for accelerating LLM inference, then elaborates on CLLMs, a refinement of pre-trained LLMs to enjoy higher speedup from Jacobi decoding. In this paper, we only consider greedy sampling and leave other sampling strategies to future work. We also empirically identify the fast-forwarding phenomenon and the emeragence of stationary tokens from CLLMs, which serve as the source of such acceleration. 3.1. Preliminary: Jacobi Decoding Given a prompt x and a pre-trained LLM p( |x), we obtain the model response typically with the standard AR decoding method under the greedy strategy, i.e., yi = arg max y p(y|y is not generated and length generated < N do J = {y(0), . . . , y } Jacobi Decoding(p, x) x cat(x, y ) if use data augmentation then for all y J do Augment y with false tokens corrected randomly end for end if Append x and J to Training Dataset D end while until all prompts in origin dataset O are used token sequences usually exhibit a correct, correct, wrong, wrong, wrong pattern. In comparison, patterns like correct, correct, wrong, correct, wrong can be rare. To enhance the learning and generalization capabilities of CLLMs, we augment the dataset D by randomly correcting erroneously predicted tokens within the samples. Data post-processing. Since the target LLM itself can make errors for some prompts, it often leads to low-quality generations in the Jacobi trajectories. We find training a CLLM with n-token sequences with token-level (Holtzman et al., 2019) or sentence-level repetitions (Poliˇsensk a et al., 2015) often results in to repetitive content generation and noticeably degrades performance. Recognizing the significance of high-quality datasets for training LLMs (Zhou et al., 2023a), we perform post-processing to eliminate the low-quality samples from our training dataset D based on a rule-based detector. 3.2.2. TRAINING We jointly optimize two losses for tuning CLLMs, one guaranteeing the prediction of multiple tokens at once and the other avoiding the CLLM from deviating from the target LLM so as to maintain generation quality. Consistency Loss. For a prompt x with the Jacobi trajectory J , let y and y denote a random state on the trajectory and the fixed point respectively. We can directly push CLLM to output y with y as the input by minimizing the following loss: LGC = E(x,J ) D,y J h i=1 D (qθ ( |y Are you a LLM ? k-th n-token sequence Yes a LLM ! one fun meet Yes I am one one smart gadget k=3 (3rd iteration) Yes I am . Nice to chat Yes I am . Nice to meet k=0 (initialization) You ? Are LLM ? a Jacobi trajectory random initialization fixed point Figure 4. The image illustrates global consistency loss where we aim to directly learn a model qθ that maps arbitrary n-token sequence y(0), y(1), etc.) to the fixed point y . Jacobi trajectory k=4 (4th iteration): Converged, same result as greedy AR decoding. k=1 (1-st iteration) k=2 (2nd iteration) Yes a LLM ! one fun meet Yes I am one one smart gadget k=3 (3rd iteration) Yes I am . Nice to chat Yes I am . Nice to meet k=0 (initialization) You ? Are LLM ? a Jacobi trajectory random initialization fixed point Autoregressive LM prefix (k-1)-th n-token sequence Are you a LLM ? k-th n-token sequence prefix Figure 5. The image illustrates local consistency loss where we aim to learn a model qθ that maps an arbitrary n-token sequence y(j) to its next adjacent state, and implicitly mapping the point to the fixed point y . CLLMs: Consistency Large Language Models B. Comparison with Baseline Algorithms In this section, we present a comparative analysis of baseline algorithms for efficient LLM inference. Key features considered are listed below. Table 7 underlines that CLLMs, our proposed method, stands out for its memory efficiency and adaptability, requiring no modifications to the existing model architecture while achieving up to 3.4 inference speedup. Lossless: whether the method generates exactly the same output distribution as AR decoding does in the backbone model. Training-free: whether the method requires training. Architecture-design-free: whether the method requires modifications or adding auxiliary components to pre-trained LLMs (like extra MLP layers, LM heads (Cai et al., 2024), autoregressive heads (Li et al., 2024), etc.). Attention-modification-free: whether the methods require modifications to exisiting attention mechanism in transformers. For example, this includes tree token verification as appears in Cai et al. (2024). Extra-memory-free: whether the method requires extra memory conmsumption in the system to accommodate speculative model or extra parameters. Speedup: Whether the method can effectively deliver inference speedup in practical use cases. Table 7. All speedups are relative to the vanilla AR. CLLMs has the best memory efficiency and adaptability as it requires no modifications to the model. yes refers to capability of achieving more than 3 speedup on at least one of our benchmarks. Jacobi decoding doesn t always lead to a speedup as discussed in Section 3.1, so we denote it with yes. Methods Lossless Training-free Arch-design-free Attention-mod-free Extra-memory-free Speedup Vanilla AR yes yes yes yes yes no Jacobi Decoding yes yes yes yes yes yes Speculative Decoding yes yes yes yes no yes Lookahead Decoding yes yes yes yes no yes SD with Distilled Student yes no yes yes no yes Eagle yes no no no no yes Medusa no no no no no yes CLLMs ( Ours ) no no yes yes yes yes C. Pesudo Code for Jacobi Decoding with KV Cache Algorithm 3 Jacobi Decoding with KV Cache 1: Input: prompt x, n-gram size n, past KV cache K, LLM, Jacobi trajectory J 2: y random tokens from x 3: nt 0 {Initialization of accurate length} 4: y0, K LLM(x) {Prefill phase: generate the first token} 5: znext cat(y0, y 1) 6: repeat 7: zcurrent znext 8: znext, K LLM(zcurrent, K) 9: i max{i | zcurrent