# long_context_compression_with_activation_beacon__10ac87be.pdf Published as a conference paper at ICLR 2025 LONG CONTEXT COMPRESSION WITH ACTIVATION BEACON Peitian Zhang1,2 Zheng Liu1 Shitao Xiao1 Ninglu Shao1,2 Qiwei Ye1 Zhicheng Dou2 1: Beijing Academy of Artificial Intelligence, 2: Gaoling School of Artificial Intelligence, Renmin University of China {namespace.pt, zhengliu1026}@gmail.com Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts. To achieve this, our method introduces the following technical designs. 1) We directly compress the activations (i.e. keys and values at every layer), rather than leveraging soft prompts to relay information (which constitute a major bottleneck to encapsulate the complex information within long contexts). 2) We tailor the compression workflow, where each fine-grained input unit is progressively compressed, enabling high-quality compression and efficient computation during both training and inference. 3) We train the model through compressionbased auto-regression, making full use of plain texts and instructional data to optimize the model s compression performance. 4) During training, we randomly sample a compression ratio at each step, teaching the model to support a wide range of compression configurations. Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache. 1 INTRODUCTION Large language models (LLMs) need to process long contexts to accomplish many important tasks, such as long-document understanding (Jiang et al., 2024b), long-content creation (Bai et al., 2024), and long-term memorization/reasoning (Zhang et al., 2024). To address these needs, modern LLMs are built with extended context windows (e.g., 128K) that enable remarkable long-context processing capabilities (Open AI, 2024; Yang et al., 2024; et al., 2024). Despite their effectiveness, LLMs encounter efficiency challenges in processing long contexts. On one hand, transformer-based LLMs incur substantial computational costs due to the quadratic complexity of self attention. On the other hand, they require tremendous GPU memory to hold the KV cache of the entire sequence for faster decoding. Both computation and memory costs increase as the context length grows. A wide array of studies are dedicated to alleviating efficiency issues, among which context compression is a promising direction (Mu et al., 2023; Chevalier et al., 2023; Ge et al., 2024; Jiang et al., 2023a;b). This approach aims to compress raw input into more concise representations, allowing the generation process to be conditioned on a shorter context. Therefore, it helps to reduce both computation cost of inference and memory cost from KV cache, while also enabling the processing of longer inputs than the LLM s built-in context window. Despite the current progresses, it it remains a tough challenge to compress long contexts. Specifically, existing methods usually summarize the context into a few soft tokens (Chevalier et al., 2023; Ge et al., 2024), which constitute the major bottleneck to summarize the complex information within Peitian Zhang and Zheng Liu are the co-first authors Zheng Liu is the corresponding author Published as a conference paper at ICLR 2025 Let there b 1 1 be light b 2 1 and there was b 1 there be light and there was light Figure 1: Overview of Activation Beacon. The context is partitioned into chunks. Each chunk is further split into fine-grained units and interleaved with beacon tokens according to a compression ratio (2 in the figure). The LLM encodes one chunk at a time, compressing the context into beacon tokens activations, which are accumulated and reused for encoding following chunks. long contexts. Besides, they try to compress the context all-at-once , lacking a fine-grained handling of the detailed information. Moreover, these soft tokens must be re-encoded before generation, resulting in inferior efficiency in both training and inference. Lastly, these methods are learned to compress with a fixed number of soft tokens, thus, it s hard to customize the compression ratio for downstream tasks. While some alternamtive methods focus on deleting unimportant tokens (Jiang et al., 2023b; Li et al., 2024b), they depend on the input question to estimate the token importance, limiting their efficiency in real-world multi-turn scenarios. To address the above challenges, we present Activation Beacon (Figure 1), a plug-in module to transformer-based LLMs that enables effective, efficient, and flexible compression of long contexts. Activation Beacon is featured with the following technical designs. First of all, we introduce a new special token, called the beacon token b . The context is distilled into beacon tokens activations (i.e. keys and values at every layer), whose capacity are large enough to encapsulate the complex information within long contexts. Next, we tailor the compression workflow, where each fine-grained context unit is progressively compressed. Specifically, the long context is partitioned into equal-size chunks. Each chunk is further split into fine-grained units of size α where α is the desired compression ratio. A group of beacon tokens are interleaved with these units (one beacon token is dispatched to the end of every unit). The LLM encodes one chunk at a time, distilling the chunk s information into beacon tokens activations during self attention. After encoding, the raw tokens activations are discarded; while the beacon tokens activations are accumulated and reused for encoding following chunks. This progressive workflow brings forth several advantages: 1) It can handle inputs longer than the backbone LLM s context window as the chunk size is small. 2) It achieves fine-grained compression since the attention scope of each beacon token is differentiated. 3) By caching and reusing activations, it facilitates contiguous gradient propagation in training, avoids re-encoding overhead in inference, and allows for incrementally updating the compression results in multi-turn scenarios. Finally, Activation Beacon is learned with compression-based auto-regression to optimize the generation quality conditioned on the compressed context. Thanks to high sample efficiency, the model can be effectively trained with 1B plain corpus and 30K fine-tuning samples (maximum context length is 20K), which can be quickly accomplished. During training, we randomly sample the compression ratio for each chunk, enhancing the model s flexibility to tackle different compression ratios in downstream tasks. Note that all beacon tokens share the same token embedding, one can use arbitrary number of beacon tokens to achieve the desired compression ratio by repeating. In our experiments, Activation Beacon is applied to Llama-2 (Touvron et al., 2023) and Qwen2 (Yang et al., 2024). We evaluate the resulted models on a variety of long-context tasks (whose lengths may be much longer than the training length, e.g., 128K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various compression configurations, meanwhile achieving 2x acceleration and 8x KV cache reduction. Moreover, the LLM s original capabilities on short context is well preserved. Published as a conference paper at ICLR 2025 2 RELATED WORKS Recently, processing long context has become a fundamental capability of modern LLMs (Open AI, 2024; et al., 2024; Yang et al., 2024; Deep Seek-AI, 2024). The recipe of context window extension is roughly the same: modifying the rotary position embedding (Su et al., 2021) by extrapolation and interpolation (Chen et al., 2023a; ntk, 2023; Peng et al., 2023; Ding et al., 2024), and leveraging long-dependency data in both the pre-training and post-training stage. Despite the impressive progress in effectiveness, LLMs face significant challenges in efficiency. There is significant computational cost due to the quadratic complexity of transformer, and huge memory cost because LLMs need to hold the KV activations of the entire sequence on GPU for faster decoding. Multiple threads of research endeavour to reduce these costs, which are discussed as follows. Sparse Attention. Conventional sparse attention methods require re-training a model from scratch using the designated sparse patterns (Zaheer et al., 2020; Beltagy et al., 2020). However, extensive recent studies have identified that the attention pattern of LLMs are naturally sparse despite they are densely trained (Jiang et al., 2024a; Xiao et al., 2023; Han et al., 2023; Zhu et al., 2024). They also propose to dynamically set appropriate sparse patterns for each head so that the attention mass can be largely preserved, leading to competitive performance against the full-attention method with reduced computation. However, these methods require holding all KV activations on chip to dynamically determine the optimal sparse patterns, making them unsuitable for KV cache reduction. There are some sparse attention methods that directly evict the middle tokens (Han et al., 2023; Xiao et al., 2023). Despite their high efficiency and ability to generate endless fluent texts, these methods cannot memorize information in the middle contexts, leading to inferior performances on longcontext tasks Xiao et al. (2024). KV Compression. This line of research focuses on compressing the KV activations to reduce the attention computation as well as the cache size. Since the KV activations are per-layer, perhead, per-token, and per-channel float numbers, they can be reduced from all the five dimensions (including the numerical dimension). For example, CLA (Brandon et al., 2024) shares the KV cache across multiple layers; GQA (Ainslie et al., 2023) compresses multiple key/value heads into a single one; MLA (Deep Seek-AI, 2024) compresses the channels into fewer and more compact ones; and KIVI (Zirui Liu et al., 2023) quantizes the numerical value in the activations. The sequence-wise compression (also known as context compression), where Activation Beacon falls, is introduced in the following paragraph. It is orthogonal to the compression along other dimensions, and the complementary effect of the compression along different dimension could be left for future work. Besides, some recent studies design efficient strategies for offloading and transferring KV cache (Liu et al., 2023; Xiao et al., 2024). They can also be jointly used with KV compression techniques to achieve more efficient long-context generation. Context Compression. This type of methods aim to compress the raw context into shorter yet more compact representations. Existing studies are usually tailored for compressing short context (less than 1K), which tend to be sub-optimal for long-context compression. Specifically, Gisting (Mu et al., 2023) compresses the user instruction into gist activations all at once. As a result, it cannot process context longer than the backbone LLM s window. CCM (Kim et al., 2024) extends Gisting to compress conversations in online chatting, yet it cannot be used in general long context tasks such as long document understanding. ICAE (Ge et al., 2024) and Auto Compressor (Chevalier et al., 2023) alleviate this problem by segmenting the long context into chunks and compressing each chunk, in order to compress contexts longer than the backbone LLM s window. CEPE (Yen et al., 2024) shares a similar workflow while introducing a standalone encoder to compress the context and utilizing the compression results through a cross-attention module. However, these methods compress the context into soft tokens, which are the major bottleneck to encapsulate the complex information in long contexts. Their compression workflow also lacks fine-grained handling of the chunked inputs, resulting in inferior compression quality. Moreover, these methods must perform re-encoding or employ additional cross-attention mechanism to utilize the compressed soft tokens, which introduces extra overhead. Lastly, since the number of soft tokens are pre-defined, it is hard to flexibly assign the compression ratio for downstream tasks. Another branch of methods (Jiang et al., 2023b; Li et al., 2024b) propose to delete unimportant tokens to realize compression. However, they depend on the input question to accurately estimate the token importance, leading to low efficiency in real-world multi-turn scenarios. Compared with existing approaches, Activation Beacon is able to achieve more effective, efficient, and flexible compression. Based on context compression techniques, there are Published as a conference paper at ICLR 2025 some innovated frameworks like LLo CO (Tan et al., 2024). It is built upon a compressor and a decoder, where the context is compressed offline and offloaded into a retrieval system. The decoder then efficiently responds to the user inputs based on retrieved compression results. Both modules are learned with in-domain fine-tuning. Our work aims at improving the compressor itself, and hence is orthogonal to these frameworking research. 3 METHODOLOGY LLMs accomplish arbitrary tasks in the form of next-token prediction. Formally, given the context X = [x1, . . . , xn], the LLM generates the next token based on all preceding tokens and its welltrained parameters: Pr(xn+1 | x1, . . . , xn; Θ). Transformer-based LLMs incur heavy computation cost due to the quadratic complexity of self attention; besides, they require tremendous GPU memory to store the KV cache of x n+1 for faster decoding (Zhang et al., 2023). Both the costs in computation and memory significantly expand when the context length increases. Activation Beacon employs a new special token, namely beacon token b , and condenses the raw context X into beacon tokens activations Ψ (i.e. their keys and values at every layer). The nexttoken prediction is converted to condition on the compressed context instead of the plain one. Given |Ψ| < |X|, both the computation cost and the KV cache size are reduced. Additionally, the LLM is enabled to handle context longer than its window size based on the compressed representations. We tailor the compression mechanism and the learning method of Activation Beacon towards achieving effective, efficient, and flexible compression, which will be elaborated in the following. 3.1 COMPRESSION MECHANISM Overview. We propose to progressively compress each fine-grained units of long contexs. Specifically, given the input context X whose length may exceed the LLM s context window N, it is first partitioned into chunks of the same size w (e.g., 1024): [x1, . . . , xn] Partition [X1, . . . X n/w ], Xi = [x(i 1)w+1, . . . , xiw]1 = [xi 1, . . . , xi w]. (1) Next, for each chunk Xi, we determine a compression ratio αi (w is evenly divisible by αi). The chunk is further split into fine-grained units of size α. Then a group of ki = w/αi beacon tokens, Bi = [ b i 1, . . . , b i ki], are interleaved with these units. In other words, one beacon token is dispatched to the end of every unit: Xi Interleave Bi X i = [xi 1, . . . , xi αi, b i 1, . . . , xi w αi+1, . . . , xi w, b i ki]. (2) The LLM encodes these chunks one by one, compressing the contextual information of each chunk into the corresponding beacon tokens activations during self attention. After encoding X i, we discard activations of all the raw tokens Xi, while we accumulate the activations of the beacon tokens Bi. When encoding the next chunk X i+1, the LLM directly conditions on the accumulated beacon activations as a proxy to the raw context X i. This progressive workflow benefits both compression quality and running efficiency. On one hand, it enables thorough distillation of complex information within long contexts and allows for the compression of inputs that exceed the LLM s context window. On the other hand, by caching and reusing beacon tokens activations, it avoids redudant computation and allows for incrementally update of the compression results in multi-turn interactions. Encoding and Compression. As shown in Figure 2, Activation Beacon reuses all modules of the LLM except imposing a slight modification on self attention. Without loss of generality, for the i-th chunk X i, the encoding process can be written as: LLM( b i 1, . . . , b i 1 ki 1, | {z } beacon activations accumulated from X