# rapid_longcontext_inference_with_retrievalaugmented_speculative_decoding__f004d3e3.pdf RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding Guanzheng Chen * 1 2 3 Qilong Feng * 1 Jinjie Ni 1 Xin Li 2 3 Michael Qizhe Shieh 1 Code: https://github.com/NUS-TRAIL/RAPID The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented SPeculat Ive Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter a draft LLM operating on shortened retrieval contexts to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLa MA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and longcontext LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on Infinite Bench for LLa MA-3.1-8B) with more than 2 speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality. *Equal contribution 1National University of Singapore 2DAMO Academy, Alibaba Group 3Hupan Lab, 310023, Hangzhou, China. Correspondence to: Guanzheng Chen , Michael Qizhe Shieh . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). 1. Introduction Large language models (LLMs) have traditionally relied on retrieval-augmented generation (RAG) to process extensive documents by selectively retrieving relevant text segments. While effective, the performance of RAG is inherently bounded by the capability of the retriever to extract pertinent information across diverse queries (Gao et al., 2023). The recent emergence of long-context LLMs, capable of directly processing million-word documents (Team et al., 2024), suggests a promising alternative to complex RAG pipelines. However, this breakthrough is bottlenecked by the computational efficiency of long-context inference, where processing extensive key-value (KV) caches becomes memory-bound and introduces substantial latency (Pope et al., 2022). Speculative Decoding (SD) (Chen et al., 2023; Leviathan et al., 2023) is a prevalent approach to accelerate LLM inference without compromising generation quality. By leveraging a smaller draft model to propose multiple candidates for single-pass validation by the target model, SD achieves significant speedup when candidates are accepted. The benefits of SD hinge on two critical factors: the computational efficiency of the draft model in generating candidates, as well as its capability to produce high-quality and acceptable candidates. However, SD will become less effective in longcontext scenarios, as memory-bound KV cache operations prevent smaller LLMs from maintaining significant speed benefits over larger models (Pope et al., 2022; Ainslie et al., 2023b). As depicted in Figure 1, the throughput gains of LLa MA-3.1-8B over LLa MA-3.1-70B diminish drastically (23.6 9.4) with increasing context lengths from 1K to 128K tokens. In this work, we introduce Retrieval-Augmented SPeculat Ive Decoding (RAPID), to bridge the gap of SD for accelerating long-context inference while enhancing generation quality. RAPID employs a RAG drafter the draft LLM operating on shortened context from RAG to speculate the generation of long-context LLM following the SD process. We propose that RAG drafter can serve as ideal draft model for long-context target LLM, as it Title Suppressed Due to Excessive Size 1K 2K 4K 8K 16K 32K 64K 128K Retrieval Context Length (Tokens) Performance (acc.) Performance (8B) Throughput (8B) Performance (70B) Throughput (70B) Throughput (tokens/sec) Figure 1. Performance (accuracy, left axis) and throughput (tokens/sec, right axis) of LLa MA-3.1-8B (serving on 1 A800) and LLa MA-3.1-70B (serving on 8 A800) on Long Bench v2 (Long) across different retrieval context lengths. demonstrates the potential to approach the capabilities of long-context LLM (Li et al., 2024b) while offering superior computational efficiency. As illustrated in Figure 1, LLa MA-3.1-8B with RAG on 4K 16K tokens can recover most performance achieved with full 128K tokens. This indicates that the RAG drafter is capable of producing high-quality candidates for long-context target LLM with high acceptance rate, while eliminating the memory-bound KV cache operations over long-context to accelerate the inference process. In addition, our RAPID opens a new paradigm for SD that leveraging the same-scale or even larger LLMs as the RAG drafters to accelerate smaller target LLMs. This paradigm shift is possible since RAG drafters, operating on shortened contexts (e.g., 4K), potentially maintain higher efficiency than target LLMs of the same or even larger scale on long contexts (e.g., 128K) as evidenced in Figure 1. Therefore, our RAPID operates on two settings: (1) self-speculation, where long-context target LLM and RAG drafter are of the same scale; and (2) upward-speculation, where RAG drafter involves larger parameter scale than target LLM. Moreover, in the both settings, the generation quality of RAG drafter may surpass that of long-context target models in some scenarios (Li et al., 2024a). However, the native SD, utilizing target LLM prediction as ground-truth distribution to perform rejection sampling, may neglect the candidates of high quality from the stronger RAG drafter. This would result in unnecessary rejection of valid candidates, thereby impeding both efficiency and performance gains. To address this limitation, RAPID implements a retrievalaugmented target distribution, which incorporates the native long-context target distribution in SD with an inference-time knowledge transfer. Specifically, we reversely position the RAG drafter as teacher and long-context target LLM as the student, to derive a distilled logits shift towards the RAG drafter during inference. By incorporating the shift into the prediction logits of target LLM, we obtain an enriched target distribution that is more receptive to high-quality speculative candidates. Our RAPID can serve as a drop-in decoding method during long-context inference. We conduct experiments on LLa MA-3.1 (8B, 70B) (Dubey et al., 2024) and Qwen2.5 (7B, 72B) (Yang et al., 2024) series on Bench (Zhang et al.) and Long Bench v2 (Bai et al., 2024b). The experimental results demonstrate that RAPID successfully integrates the complementary strengths of long-context LLMs and RAG while maintaining significant inference speedups. In self-speculation settings, RAPID achieves consistent performance improvements (e.g., 42.83 vs 39.33 on Infinite Bench for LLa MA-3.1-8B) with significant speedup (up to 2.69 ) over the long-context target LLMs. The upward-speculation setting further boosts performance through effective knowledge transfer from larger RAG drafters (e.g., improving LLa MA-3.1-8B from 42.83 to 49.98 on Infinite Bench), with comparable efficiency with the smaller long-context target LLMs. With moderate retrieval length ( 16K) for RAG drafter, we found RAPID consistently achieves speedup when target long-context length beyond 32K. Our analyses also indicate that RAPID demonstrates robustness to retrieval quality and potentially superior generation quality in real-world multi-turn dialogue tasks. These results validate RAPID as an effective decoding method for accelerating long-context inference and, at the same time, enhancing generation quality through retrieval-augmented speculation. 2. RAPID: Retrieval-Augmented Speculative Decoding 2.1. Background: Speculative Decoding Autoregressive generation with a LLM pϕ traditionally requires sequential forward passes, where each token xi is sampled from the distribution pϕ(xi|x 2 speedup for long-context inference. In self-speculation settings, RAPID achieves significant speedup over LC baseline (2.10 for LLa MA3.1-8B, 2.69 for LLa MA-3.1-70B), and significantly surpasses naive SD and Magic Dec. When employing upwardspeculation with larger drafters, RAPID still maintains comparable throughput 1 (1.14 for LLa MA-3.1-8B with 70B drafter, 0.93 for Qwen2.5-7B with 72B drafter) while substantially improving generation quality. While pure RAG shows highest throughput (e.g., 3.35 speedup for LLa MA3.1-8B), its performance can be significantly compromised in certain scenarios (e.g., En.QA accuracy drops from 39.21 to 30.72 for Qwen2.5-72B). In contrast, RAPID effectively maintains competitive throughput while consistently achieving superior generation quality across different settings. 4.2. Benefits Integration Analysis RAPID incorporates benefits from RAG drafter while maintaining target model capabilities. To analyze how RAPID integrates the strengths of both RAG drafter and target LLM, we examine the relative success and failure of RAG drafter, SD, and RAPID on Long Bench v2. As shown in Figure 2, RAPID successfully handles additional cases where the target LLM fails by incorporating beneficial 1Note that upward-speculation requires extra GPUs to serve the RAG drafter like regular SD. Title Suppressed Due to Excessive Size 4K 8K 16K 32K 64K 128K Accuracy (%) 4K 8K 16K 32K 64K 128K Acceptance Rate (%) 4K 8K 16K 32K 64K 128K Target Context Length (Tokens) Speedup Ratio RAPID (Retrieval 4K) RAPID (Retrieval 8K) RAPID (Retrieval 16K) RAPID (Retrieval 32K) SD (Retrieval 4K) SD (Retrieval 8K) SD (Retrieval 16K) SD (Retrieval 32K) Figure 3. Impact of context and retrieval lengths on RAPID (selfsepculation) performance and efficiency based on LLa MA-3.18B. RAPID consistently outperforms naive SD and achieves speedup beyond 32K context length with moderate retrieval lengths ( 16K).Top: Accuracy indicates the accuracy margins on Long Bench v2 (Long, Co T) subset over target LLM. Middle: Acceptance rate indicating the proportion of accepted draft tokens. Bottom: Speedup ratio compared to target LLM inference (> 1 indicates acceleration). knowledge from the RAG drafter. Meanwhile, RAPID maintains the capabilities of target LLM, exhibiting significantly lower failure rates compared to using RAG drafter alone. This combination of gains from RAG drafter with minimal degradation of target LLM capabilities enables RAPID to outperform both target and draft models. Furthermore, the gains from RAG drafter in RAPID substantially exceed those in naive SD, demonstrating the effectiveness of our retrieval-augmented target distribution in Eq. (6). RAPID exhibits capabilities beyond individual target/draft LLMs. Most notably, we observe an emergent phenomenon where RAPID successfully handles cases that both the target LLM and RAG drafter fail individually (shown as RAPID Only in Figure 2). Specifically, this emergent accuracy mass grows more pronounced as RAG drafters become stronger, from LLa MA-3.1-8B to LLa MA3.1-70B. This suggests that RAPID not only combines the strengths of both models but also enables new capabilities through their synergistic interaction. The phenomenon becomes particularly evident in the upward-speculation setting, Table 2. Evaluation on multi-turn dialogue generation with extended chat history for LLa MA-3.1-8B as both target and draft LLM. Quality scores (1-10) are rated by GPT-4-Turbo-1106 using LLM-as-a-Judge protocol. Quality Acceptance Rate (%) Throughput Target LLM 2.82 - 10.64 0.98 RAG Drafter 3.95 - 40.49 0.47 SD 2.94 56.34 0.13 14.07 3.08 RAPID 4.21 76.94 0.13 18.18 3.23 Table 3. Robustness study of RAPID with different draft influence parameter η. Results show performance gains ( Accuracy) and speedup ratios on Long Bench v2 (Long, Co T) subset using LLa MA-3.1-8B as target LLM, with LLa MA-3.1-8B and LLa MA3.1-70B as RAG drafters under unrelated retrieval context. η LLa MA-3.1-8B (Draft) LLa MA-3.1-70B (Draft) Accuracy Speedup Accuracy Speedup 0 1.20 1.62 -1.30 0.67 5 2.80 1.75 0.40 0.69 10 1.60 1.77 1.20 0.72 20 1.20 1.78 4.40 0.75 30 -2.40 2.07 6.60 0.80 40 -2.60 2.08 6.60 0.84 50 -6.30 2.10 6.00 0.87 where the stronger RAG drafter facilitates more sophisticated knowledge transfer during inference. 4.3. Impact of Context and Retrieval Length. RAPID demonstrates effectiveness across various context configurations. We analyze how RAPID performs under varying target context lengths and RAG drafter retrieval lengths, as shown in Figure 3. The results demonstrate consistent advantages of RAPID over naive SD across all configurations. First, RAPID achieves significantly better performance gains (2-8% Accuracy) over the longcontext baseline compared to the marginal or negative gains (-5-2%) of naive SD. This superior performance is accompanied by consistently higher acceptance rates (75-85% versus 60-70%) and better speedup ratios across all context and retrieval lengths configurations. RAPID achieves speedup for long-context inference beyond 32K. The impact of retrieval length reveals an interesting efficiency-effectiveness trade-off. In terms of computational efficiency, RAPID achieves acceleration (speedup > 1.0 ) when the target context length exceeds 32K, while SD requires contexts beyond 64K to demonstrate speedup. For retrieval length, while longer retrieval contexts generally lead to higher acceptance rates (up to 85%), the speedup ratio is not necessarily increasing. Specifically, retrieval lengths of 4K and 8K achieve nearly identical speedup ratios, indicating minimal overhead in this scope. However, Title Suppressed Due to Excessive Size when retrieval length exceeds 16K, the increased computational overhead from longer draft contexts becomes apparent and impacts the overall speedup. These findings suggest that RAPID achieves remarkable efficiency when accelerating long-context inference beyond 32K tokens upon moderate retrieval length within 16K. 4.4. Generation Quality Analysis RAPID achieves superior generation quality and throughput in real-world application. To evaluate the effectiveness of RAPID in practical long-context applications, we assess its performance on multi-turn dialogue generation. We construct a challenging evaluation dataset by adapting MT-Bench-101 (Bai et al., 2024a): for each of the first 100 samples, we preserve their last-turn queries while distributing their previous conversation context within a longer chat history comprising additional dialogue turns from another 500 samples in MT-Bench-101. The resulting chat history is of around 122K tokens length. This setup tests the ability of models to maintain coherence and relevance while processing extensive dialogue history. As shown in Table 2, RAPID demonstrates substantial improvements across all metrics. Using GPT-4-Turbo-1106 as evaluator following LLM-as-a-Judge (Zheng et al., 2023), RAPID achieves a generation quality score of 4.21, significantly outperforming the target LLM (2.82), RAG drafter (3.95) and naive SD (2.94). This quality improvement comes with a robust acceptance rate of 76.94% (vs. 56.34% for SD) and enhanced throughput of 18.18 tokens/second (1.7 speedup over target LLM), demonstrating practical advantages of RAPID in real-world long-context applications. 4.5. Robustness to Retrieval Quality RAPID shows robustness to retrieval quality, which is further enhanced by stronger drafter. To assess the robustness of RAPID regarding retrieval quality, we conduct stress tests by deliberately using unrelated retrieval context (using the context of first sample from Long Bench v2 for all samples) while varying the knowledge transfer parameter η in Eq. (6). As shown in Table 3, with selfspeculation (LLa MA-3.1-8B drafter), RAPID maintains performance gains ( Accuracy > 0) and improved efficiency (speedup 1.62 -1.78 ) when η 20, even with irrelevant retrieval context. However, when η > 20, the RAG drafter may overly impact the target distribution, leading to performance degradation. Moreover, upward-speculation with LLa MA-3.1-70B as drafter demonstrates even better robustness, maintaining positive performance gains (up to 6.60%) across all η values despite totally unrelated retrieval context. This increased resilience suggests that RAPID effectively leverages the inherent capabilities of stronger RAG drafters, maintaining reliable performance even under suboptimal retrieval quality. 5. Related Work Speculative Decoding Speculative Decoding (Chen et al., 2023; Leviathan et al., 2023) accelerates LLM inference by leveraging smaller draft models to propose multiple tokens for single-pass validation. REST (He et al., 2024b) extends the drafting mechanism by retrieving possible continuation from a built corpus rather than generating with a draft LLM. Ouroboros (Zhao et al., 2024) proposes producing longer and more acceptable candidates from draft LLM per step based on draft phrases. Inspired by the speculation mechanism, Speculative RAG (Wang et al., 2024) proposes a parallel draft-then-verify mechanism to improve RAG quality. Recent works like Tri Force (Sun et al., 2024) and Magic Dec (Chen et al., 2024a) attempt to extend SD to long-context scenarios through KV cache compression techniques (Xiao et al., 2023). However, such compression approaches often result in weakened draft models with limited speedup in complex applications. In contrast, RAPID adopts RAG drafters that maintain both high-quality speculation and substantial speedup in various applications. Long-Context Inference Speedup Research on accelerating long-context inference has primarily focused on two directions: optimizing KV cache operations through selective retention (Xiao et al., 2023; Kang et al., 2024; Zhang et al., 2023) or quantization (Sheng et al., 2023; Liu et al., 2024b; He et al., 2024a), and exploring prompt compression methods (Chevalier et al., 2023; Jiang et al., 2023; Pan et al., 2024). While these approaches improve efficiency, they often compromise contextual information without quality guarantees (Zhang et al., 2024). RAPID addresses this limitation by leveraging SD to maintain generation quality through explicit verification from long-context LLMs, providing a more reliable balance between efficiency and performance. RAG and Long-Context LLMs Recent studies have revealed complementary strengths between RAG and longcontext LLMs, with substantial prediction overlap despite different performance characteristics (Li et al., 2024b;a). While long-context LLMs excel in document-based tasks, RAG shows advantages in scenarios like dialogue-based question-answering. Previous attempts to combine these approaches, such as self-reflection routing (Li et al., 2024b) and step-by-step RAG enhancement (Yue et al., 2024), rely heavily on task-specific prompt engineering. RAPID provides a more principled solution by directly integrating RAG benefits into the decoding process, enabling dynamic adaptation while preserving advantages of both paradigms. Title Suppressed Due to Excessive Size 6. Conclusion In this work, we introduce RAPID, a novel decoding method that bridges the efficiency gap of speculative decoding (SD) in long-context inference while enhancing generation quality through retrieval-augmented speculation. The key of RAPID lies in leveraging RAG drafters to enable efficient speculation for long-context target LLMs, along with a retrieval-augmented target distribution that effectively integrates knowledge from potentially stronger drafters. Through extensive experiments, we demonstrate that RAPID successfully achieves both computational efficiency and improved generation quality across different model scales and tasks. Specifically, RAPID enables more than 2 speedup while maintaining performance advantages in self-speculation settings, and achieves substantial quality improvements through upward-speculation with stronger RAG drafters. These results establish RAPID as a practical solution for accelerating long-context inference with improved generation quality. Acknowledgments This project was partially supported by the Singapore Ministry of Education Academic Research Fund Tier 1 (Award Number: T1 251RES2514) and DAMO Academy Research Intern Program. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. GQA: Training generalized multi-query transformer models from multihead checkpoints. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 4895 4901, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023. emnlp-main.298. URL https://aclanthology. org/2023.emnlp-main.298/. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023b. URL https: //openreview.net/forum?id=hm Ow OZWz YE. Bai, G., Liu, J., Bu, X., He, Y., Liu, J., Zhou, Z., Lin, Z., Su, W., Ge, T., Zheng, B., and Ouyang, W. MT-bench101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7421 7454, Bangkok, Thailand, August 2024a. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long. 401. URL https://aclanthology.org/2024. acl-long.401/. Bai, Y., Tu, S., Zhang, J., Peng, H., Wang, X., Lv, X., Cao, S., Xu, J., Hou, L., Dong, Y., Tang, J., and Li, J. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. Ar Xiv, abs/2412.15204, 2024b. URL https://api.semanticscholar. org/Corpus ID:274859535. Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling, 2023. URL https: //arxiv.org/abs/2302.01318. Chen, J., Tiwari, V., Sadhukhan, R., Chen, Z., Shi, J., Yen, I. E.-H., and Chen, B. Magicdec: Breaking the latencythroughput tradeoff for long context generation with speculative decoding. ar Xiv preprint ar Xiv:2408.11049, 2024a. Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. M3-embedding: Multi-linguality, multifunctionality, multi-granularity text embeddings through self-knowledge distillation. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp. 2318 2335, Bangkok, Thailand, August 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024. findings-acl.137. URL https://aclanthology. org/2024.findings-acl.137/. Chevalier, A., Wettig, A., Ajith, A., and Chen, D. Adapting language models to compress contexts. Ar Xiv, abs/2305.14788, 2023. URL https: //api.semanticscholar.org/Corpus ID: 258865249. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783, 2024. Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Guo, Q., Wang, M., and Wang, H. Retrieval-augmented generation for large language models: A survey. Ar Xiv, abs/2312.10997, Title Suppressed Due to Excessive Size 2023. URL https://api.semanticscholar. org/Corpus ID:266359151. He, Y., Zhang, L., Wu, W., Liu, J., Zhou, H., and Zhuang, B. Zipcache: Accurate and efficient KV cache quantization with salient token identification. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024a. URL https://openreview.net/ forum?id=5t4ZAk Pi Js. He, Z., Zhong, Z., Cai, T., Lee, J., and He, D. REST: Retrieval-based speculative decoding. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 1582 1595, Mexico City, Mexico, June 2024b. Association for Computational Linguistics. doi: 10.18653/v1/ 2024.naacl-long.88. URL https://aclanthology. org/2024.naacl-long.88/. Hinton, G. E., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. Ar Xiv, abs/1503.02531, 2015. URL https://api.semanticscholar. org/Corpus ID:7200347. Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., and Qiu, L. Llmlingua: Compressing prompts for accelerated inference of large language models. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar. org/Corpus ID:263830701. Jiang, H., Li, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., Han, Z., Abdi, A. H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=f PBACAbq SN. Kang, H., Zhang, Q., Kundu, S., Jeong, G., Liu, Z., Krishna, T., and Zhao, T. Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm. Ar Xiv, abs/2403.05527, 2024. URL https://api.semanticscholar. org/Corpus ID:268297231. Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding, 2023. URL https://arxiv.org/abs/2211.17192. Li, X., Cao, Y., Ma, Y., and Sun, A. Long context vs. rag for llms: An evaluation and revisits. 2024a. URL https://api.semanticscholar. org/Corpus ID:275323896. Li, X. L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., and Lewis, M. Contrastive decoding: Open-ended text generation as optimization. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12286 12312, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URL https: //aclanthology.org/2023.acl-long.687/. Li, Z., Li, C., Zhang, M., Mei, Q., and Bendersky, M. Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach. Ar Xiv, abs/2407.16833, 2024b. URL https: //api.semanticscholar.org/Corpus ID: 271404721. Liu, H., Yan, W., Zaharia, M., and Abbeel, P. World model on million-length video and language with blockwise ringattention. Ar Xiv, abs/2402.08268, 2024a. URL https://api.semanticscholar. org/Corpus ID:267637090. Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. Ar Xiv, abs/2402.02750, 2024b. URL https://api.semanticscholar. org/Corpus ID:267413049. Pan, Z., Wu, Q., Jiang, H., Xia, M., Luo, X., Zhang, J., Lin, Q., R uhle, V., Yang, Y., Lin, C.-Y., Zhao, H. V., Qiu, L., Zhang, D., Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., and Nakano, R. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Annual Meeting of the Association for Computational Linguistics, 2024. URL https://api.semanticscholar. org/Corpus ID:268531237. Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference. Ar Xiv, abs/2211.05102, 2022. URL https://api.semanticscholar. org/Corpus ID:253420623. Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y., Xie, Z., Chen, B., Barrett, C. W., Gonzalez, J., Liang, P., R e, C., Stoica, I., and Zhang, C. High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar. org/Corpus ID:257495837. Title Suppressed Due to Excessive Size Sun, H., Chen, Z., Yang, X., Tian, Y., and Chen, B. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. ar Xiv preprint ar Xiv:2404.11912, 2024. Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ar Xiv preprint ar Xiv:2403.05530, 2024. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi ere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models. 2023. URL https://arxiv.org/abs/ 2302.13971. Wang, Z., Wang, Z., Le, L. T., Zheng, H. S., Mishra, S., Perot, V., Zhang, Y., Mattapalli, A., Taly, A., Shang, J., Lee, C.-Y., and Pfister, T. Speculative rag: Enhancing retrieval augmented generation through drafting. Ar Xiv, abs/2407.08223, 2024. URL https://api.semanticscholar. org/Corpus ID:271097348. Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. ar Xiv, 2023. Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report. ar Xiv preprint ar Xiv:2412.15115, 2024. Yue, Z., Zhuang, H., Bai, A., Hui, K., Jagerman, R., Zeng, H., Qin, Z., Wang, D., Wang, X., and Bendersky, M. Inference scaling for long-context retrieval augmented generation. Ar Xiv, abs/2410.04343, 2024. URL https://api.semanticscholar. org/Corpus ID:273185794. Zhang, J., Zhu, D., Song, Y., Wu, W., Kuang, C., Li, X., Shang, L., Liu, Q., and Li, S. More tokens, lower precision: Towards the optimal token-precision tradeoff in kv cache compression. Ar Xiv, abs/2412.12706, 2024. URL https://api.semanticscholar. org/Corpus ID:274789429. Zhang, X., Chen, Y., Hu, S., Xu, Z., Chen, J., Hao, M., Han, X., Thai, Z., Wang, S., Liu, Z., and Sun, M. Bench: Extending long context evaluation beyond 100K tokens. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). URL https://aclanthology. org/2024.acl-long.814. Zhang, Z. A., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., R e, C., Barrett, C. W., Wang, Z., and Chen, B. H2o: Heavyhitter oracle for efficient generative inference of large language models. Ar Xiv, abs/2306.14048, 2023. URL https://api.semanticscholar. org/Corpus ID:259263947. Zhao, W., Huang, Y., Han, X., Xu, W., Xiao, C., Zhang, X., Fang, Y., Zhang, K., Liu, Z., and Sun, M. Ouroboros: Generating longer drafts phrase by phrase for faster speculative decoding. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N. (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 13378 13393, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-main. 742. URL https://aclanthology.org/2024. emnlp-main.742/. Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena. Ar Xiv, abs/2306.05685, 2023. URL https://api.semanticscholar. org/Corpus ID:259129398. Title Suppressed Due to Excessive Size A. Proof of Theorem 1 We analyze the gradient of the knowledge distillation loss with respect to the target model s logits. The distillation loss with temperature T is defined as: L = T 2 KL(q(x)||p(x)) j q(xj) log q(xj) where the target distribution p(x) is parameterized by logits z through softmax: p(xj) = exp(zj/T ) P k exp(zk/T ) (12) Theorem: The gradient of the distillation loss with respect to logit zi is: L zi = T [q(xi) p(xi)] (13) Proof: We derive this gradient through the following steps: 1) First, expand the derivative using the chain rule: L zi = T 2 X zi [log q(xj) log p(xj)] (14) 2) Note that q(xj) is independent of zi: zi log p(xj) (15) 3) Expand the log probability: " zj T log X k exp(zk/T ) 4) Apply the derivative using the Kronecker delta δij: j q(xj) δij T exp(zi/T ) P k exp(zk/T ) 5) Simplify using the definition of p(xi): = T X j q(xj)[δij p(xi)] (18) 6) The sum over j with δij selects only q(xi): = T [q(xi) X j q(xj)p(xi)] (19) j q(xj) = 1, we obtain our final result: = T [q(xi) p(xi)] (20) This gradient shows that the distillation loss pushes the target distribution p(x) towards the draft distribution q(x) with strength proportional to the temperature T . Title Suppressed Due to Excessive Size B. Correctness of RAPID s Residual Distribution We prove that for RAPID s retrieval-augmented speculative decoding, when rejection occurs, sampling from the distribution xi norm(max(p(xi) ˆp(xi), p(xi) q(xi))) (21) maintains the target distribution p(xi), where: p(xi) = pϕ(xi|[C; x βSD), which directly reduces the amortized FLOPs per generated token. Table 4. FLOPs comparison for different methods per step. Method FLOPs Long Context 2γTL + γ2T RAG Drafter 2γDLR + γ2D SD 2γDLR+γ2D+2T (L+γ) βSD RAPID 2γDLR+γ2D+2T (L+γ) D.2. Overhead of RAG Unlike regular RAG pipeline, which builds indexes for a large external corpus (hundreds of millions of documents), we only index/retrieve the chunks for the input long context (<128K) on-the-fly during inference. Therefore, the RAG component latency in our method will become marginal compared to the inference latency over long context. Table 5 presents the average latency (in seconds) for each component of RAPID on Long Bench v2 (Long, Co T) using LLa MA-3.1-8B and LLa MA-3.1-70B in self-speculative mode. Table 5. Latency of RAPID Components on Long Bench v2 (Long, Co T) Model RAG Pipeline (s) Prefill (s) Generation (s) LLa MA-3.1-8B-RAPID 1.43 26.37 32.25 LLa MA-3.1-70B-RAPID 1.43 163.43 121.76 E. More Results E.1. Comparison with Tri Force Tri Force was not included in Table 1 since it is not directly compatible with modern LLMs using Grouped Query Attention (GQA) (Ainslie et al., 2023a). We conducted comparisons on LWM-Text-Chat-128K (Liu et al., 2024a) (based on LLa MA27B (Touvron et al., 2023)), with a retrieval budget of 4096 tokens, a chunk size of 8, and a draft cache budget of 256 for Tri Force. Table 6 shows the performance and speedup of the decoding in Long Bench v2 (Long, Co T). Title Suppressed Due to Excessive Size Table 6. Comparison of RAPID and Tri Force on LWM-Text-Chat-128K in Long Bench v2 (Long, Co T) task. Model Accuracy Speedup LWM-Text-Chat-128K 18.4 1.00 Tri Force 18.0 1.27 RAPID 21.6 2.56 While Tri Force achieves modest efficiency gains, RAPID delivers superior speedup and performance. Tri Force relies on chunk-wise attention scores for information recall, but high attention scores do not always correlate with semantic relevance, e.g., initial tokens may act as attention sinks despite lacking meaningful content (Xiao et al., 2023). In contrast, our RAPID drafter prioritizes semantically relevant information, resulting in a higher acceptance rate and greater speedup for complex tasks. E.2. Comparison with MInference We evaluated MInference (Jiang et al., 2024) against our RAPID using LLa MA-3.1-8B on the Long Bench v2 (Long, Co T) task. Table 7 reports the performance, prefill time (in seconds), and decoding speedup relative to the LLa MA-3.1-8B. Table 7. Comparison of RAPID and MInference on LLa MA-3.1-8B in Long Bench v2 (Long, Co T) task. Model Accuracy Prefill Time (s) Speedup LLa MA-3.1-8B (Baseline) 30.4 25.89 1.00 MInference 30.9 9.10 0.62 RAPID 34.2 26.37 2.10 MInference significantly reduces prefill time, showcasing its efficiency in the initial processing phase. However, RAPID outperforms MInference in overall performance and decoding throughput, achieving a higher speedup. We note that sparse attention, as utilized by MInference, is orthogonal to our approach, suggesting that integrating sparse attention with RAPID could further enhance efficiency.