# online_speculative_decoding__48a3f87d.pdf Online Speculative Decoding Xiaoxuan Liu 1 Lanxiang Hu 2 Peter Bailis 3 Alvin Cheung 1 Zhijie Deng 4 Ion Stoica 1 Hao Zhang 2 Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model s outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding to address this challenge. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. Adapting to query distribution mitigates the shifts between the training distribution of the draft model and the query distribution, enabling the draft model to more accurately predict the target model s outputs. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42 to 2.17 latency reduction. Our code is available at https: //github.com/Liu Xiaoxuan PKU/OSD. 1. Introduction Large language models (LLMs) such as GPT-4 (Open AI, 2023) and LLa MA (Touvron et al., 2023a;b) are rapidly reinventing today s applications. Many companies are racing to deploy LLMs in online services, such as search, chatbots, and virtual assistants. Since most of these services demand low latency, optimizing LLM serving latency directly translates into better quality of service and cost reduction. The latency of today s LLM service is unfortunately very high, primarily because serving a user query requires multiple serial evaluations of the LLM, each generating only one 1UC Berkeley 2UCSD 3Google Inc. 4SJTU. Correspondence to: Hao, Zhang , Zhijie, Deng . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Draft Model Target Model (teacher) Proposal token probability distribution Verification t1 t2 t3 t1 t2 t3_correct Update Model when size(Buffer) > Threshold_size or elapsed time > Threshold_time Distance metric: KL/Reverse KL/ JSD (Draft, target) probability distribution Resume from t3_correct to generate subsequent tokens t3 t3_correct Figure 1. Overview of online speculative decoding (OSD) framework: For each prompt, the draft model suggests multiple tokens and the target model performs the verification. If the student proposes incorrect tokens, both the draft and target distributions are stored in a buffer. Once the buffer exceeds a size limit or is too old, the draft model is updated by calculating the loss between the draft and target distributions using various distance metrics. token of the response. An emerging solution to reduce the latency is speculative decoding (Leviathan et al., 2023) it employs a small draft model to speculate multiple output tokens of the target (large) model, and then lets the target LLM verify these speculations in parallel. If the verification of a token fails, the large model must recompute from that point. Therefore, the performance of speculative decoding largely depends on the speculation accuracy of the draft model (also known as the token acceptance rate). In the presence of diverse text inputs, the accuracy of the speculations is often not very high, due to the capability gap between the draft and target models. Employing a larger, more accurate draft model, however, defeats the purpose of speculative decoding as it will increase latency. To address this challenge, we introduce a novel method, online speculative decoding (OSD), to periodically finetune the draft model based on the corrections of the target model. OSD aims to reduce query latency while preserving the compact size of the draft model. First, OSD employs knowledge distillation within speculative decoding to enhance the alignment between the draft and target models. Speculative decoding involves the draft model proposing potential tokens with their respective probability distributions. The target model then assesses these suggestions, correcting discrepancies to ensure that the outputs remain consistent with those produced without the draft model. This correction mechanism serves as an effective way for the draft model to assimilate and learn from this enriched information. Compared to conventional label finetuning, knowledge distillation offers a significant advantage Online Speculative Decoding by providing a probability distribution for each token. By leveraging the insights from the teacher model (Gu et al., 2023), this method effectively aligns the draft and the target models. Furthermore, instead of relying on a static draft model, we periodically update the draft model. This is because user queries to a specific LLM service often exhibit domainspecific distributions (Zheng et al., 2023a), reflecting shared usage patterns. While accurately speculating the larger model s outputs on any diverse input is challenging, it is feasible to enhance the draft model s prediction accuracy, only for similar inputs posted to the service, characterized by the query distribution. Updates can be implemented through several methods. One approach is to fine-tune the draft model in the background and then apply these updates in real-time after a predetermined period. Alternatively, one can leverage the excess computational capacity of the serving system while it is running, as detailed in 4.2.2. Importantly, the real-time tuning of the draft models enables them to continuously adapt based on incoming query data. This dynamic approach is essential for preserving a high token acceptance rate, ensuring the model remains efficient and current with evolving data and trends. Lastly, to further improve the token acceptance rate, OSD not only narrows the query distribution but also routes each query to the draft model best suited for that specific distribution. This is accomplished by developing draft models that are finely tuned to cater to distinct domains. Concentrating on a narrower query distribution has proven more effective for learning. Consequently, OSD efficiently directs queries to the corresponding draft model that specializes in their domain. As evidenced in 5.2 of our evaluation, we have adeptly trained multiple draft models, each uniquely tailored to distinct languages or topics. This method highlights the significant potential for improved efficiency and accuracy when dealing with a diverse range of queries. In summary, this paper makes the following contributions: We explore various generalized knowledge distillation (GKD) methods for constructing draft models and identify the most effective variants (Section 4.1). We introduce online speculative decoding to reduce LLM serving latency by adapting draft models on the fly ( 4.2). We investigate draft model customization for speculative decoding, wherein each query is directed to the draft model that corresponds with the query s domain ( 4.3). OSD demonstrates a significant improvement in token acceptance rate, by up to 10-65% on diverse datasets, which translates into a 1.4-2.1 reduction in latency. OSD can be combined with existing methods that construct static draft models and match the accuracy achieved if all query data were available beforehand ( 5). 2. Related Work Speculative decoding. Speculative decoding (Leviathan et al., 2023; Chen et al., 2023a) accelerates LLM decoding by employing a (small) draft model to predict the outputs of the larger target model, which the target model then verifies. Suppose the draft model can correctly predict more than one token per verification step, the memory I/O for accessing the weights and KV cache of the (large) target model at inference is amortized across multiple output tokens, thereby reducing latency, especially since LLM inference is often constrained by GPU HBM bandwidth. The efficacy of speculative decoding hinges on the draft model s ability to accurately predict the target model s outputs. Existing work improves the speculation accuracy by using staged (Spector & Re, 2023), RAG-style (He et al., 2023), multiple (Miao et al., 2023; Chen et al., 2023b) draft models and sampling multiple candidates from the draft model (Yang et al., 2024; Cai et al., 2024). Additionally, there exists a line of research that eliminates the need for a separate draft model by leveraging auxiliary modules within the target model itself (Cai et al., 2023; Stern et al., 2018; Cai et al., 2024; Lin et al., 2024; Zhang et al., 2023). These methods predominantly assume a static draft model post-deployment. In contrast, our work introduces a framework that actively adapts the draft model to the evolving user query distribution on the fly, irrespective of the draft model s construction. OSD is orthogonal to the aforementioned methods, enabling its integration with them to improve overall efficacy in online deployment scenarios. Distillation for auto-regressive models. Knowledge distillation (KD) is a framework to generate smaller models that emulate the performance of larger models. However, KD in its conventional form has been observed to be less effective for LLMs. Gu et al. (2023) extends KD to autoregressive LLMs by decoding from the student model and optimizing the reserve KL divergence between students and teachers. Agarwal et al. (2023) introduce generalized knowledge distillation (GKD) to optimize a linear combination of the forward KL and reverse KL between teacher and student, using a blend of teacherand student-sampled data. Drawing inspiration from both works, OSD applies KD to speculative decoding for LLMs and extends it to dynamically adjust draft models (Section 4.1). We acknowledge the simultaneous emergence of a related work, Distill Spec (Zhou et al., 2023), which also employs KD for speculative decoding. However, our work and Distill Spec were developed concurrently. Moreover, Distill Spec represents a specific aspect of our broader framework OSD. OSD not only explores KD for speculative decoding but also addresses challenges in the online setting and routes queries across various distributions. Online Speculative Decoding 3. Background We first briefly review speculative decoding (Leviathan et al., 2023), a critical technique that accelerates inference of a large target LLM p( |x) with token proposals from a small draft model qθ( |x). x denotes the concatenation of the input prompts and already generated tokens. The two distributions are both learned in an auto-regressive way. We emphasize the parameters θ of the draft model because we usually need to tailor them according to the target LLM for more substantial acceleration. Speculative decoding uses a (small) draft model to propose k tokens y {yi}k i=1 qθ( |x), and lets the target LLM estimate the k + 1 token probabilities {p(y|x, y