# clover_crosslayer_orthogonal_vectors_pruning__27445312.pdf CLOVER: Cross-Layer Orthogonal Vectors Pruning Fanxu Meng 1 2 Pingzhi Tang 1 Fan Jiang 1 Muhan Zhang 1 2 Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memorybounded. To address this challenge, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel approach that treats pairs of components of the attention mechanism as low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K and V-O pairs within each attention head. The resulting singular values, in turn, guide pruning and further serve as trainable parameters for efficient fine-tuning, ultimately enabling the model to recover its performance to the level before pruning.After pruning and fine-tuning, these values are reintegrated into the model without increasing its parameter count. Visualizations across various models show that CLOVER effectively removes linear redundancies within attention heads, greatly improving pruning efficiency. For example, pruning 70% of the Q-K head dimension in GPT-2 XL results in a perplexity comparable to that of pruning just 8% using vanilla pruning. The combination of CLOVER and Trans MLA achieves a speedup of up to 11.1 over LLa MA-2-7B. Our code is available at: https://github.com/Graph PKU/CLOVER 1. Introduction In recent years, Large Language Models (LLMs) have rapidly evolved into essential tools for productivity (Open AI, 2024; Anthropic, 2024; Team et al., 2024a). Open-source models (AI@Meta, 2024; Mistral, 2024; Qwen, 2024; Liu et al., 2024b; Team et al., 2024b; Abdin et al., 2024) have also narrowed the performance gap with closed-source models. The success of LLMs is largely attributed to Next Token 1Institute for Artificial Intelligence, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI. Correspondence to: Muhan Zhang . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). (a) One Attention Head 0.08 0.7 1 Pruning Ratio Log of Perplexity Vanilla CLOVER (c) Pruning without Training 0.125 0.25 0.375 0.5 0.625 0.75 Pruning Ratio Vanilla CLOVER (d) Fine-Tuning Pruned Model Figure 1. (a) We treat the Query-Key and Value-Output layers within a single attention head as a unified structure. (b) Apply SVD to obtain two sets of singular vectors for initializing the Q-K and V-O layers, along with singular values that guide pruning or enable efficient full-rank fine-tuning. (c) This cross-layer orthogonalization strategy allows for higher pruning rates. (d) The pruned model maintains strong performance after fine-tuning. Prediction (Radford et al., 2018; Brown et al., 2020), where tokens are predicted sequentially, with attention computed between each token and all preceding ones. To avoid redundant computations, key-value features are cached. However, as model size grows, the overhead of caching becomes substantial, leading to memory and communication bottlenecks. For instance, in the case of a 65B parameter model (Touvron et al., 2023) with 8-bit key-value quantization, storing 512K tokens requires over 86GB of GPU memory, which surpasses the capacity of a single 80GB GPU (Sun et al., 2024). CLOVER: Cross-Layer Orthogonal Vectors Pruning To enable efficient training and inference, we introduce CLOVER (Cross-Layer Orthogonal Vectors) pruning, a novel method that orthogonalizes the Query, Key, Value, and Output vectors without generating additional transformation matrices. As shown in Figure 1a, we treat the Q-K and V -O pairs in each attention head as a low-rank decomposition of WQK and WV O. By crossing these layers and performing SVD on WQK and WV O, the Query, Key, Value, and Output vectors become orthogonal within each attention head. Figure 1b illustrates how the resulting singular values can guide pruning or serve as trainable parameters for efficient fine-tuning. After pruning or fine-tuning, these values can be reintegrated into the model without increasing its parameter count. Notably, previous methods, such as SVFT (Lingam et al., 2024), obtain orthogonal vectors by directly performing orthogonal decomposition on each projection matrix, which can lead to the introduction of a large number of additional parameters during fine-tuning. In contrast, CLOVER jointly decomposes Q-K and V -O pairsas transformation matrices for each other. CLOVER only generates a small set of singular values to guide pruning and fine-tuning, which can be merged back into the model without increasing inference costs. By orthogonalizing the vectors, we eliminate linear redundancy. Attention heads contain numerous non-zero norm vectors. Directly pruning these vectors would degrade performance, but orthogonalizing them allows us to represent the entire attention head s space using a small set of orthonormal basis. The remaining vectors are nearly zero, making them safe to prune. As shown in Figure 1c, pruning 70% of the total budget in the query-key pair using CLOVER where the pruning ratio can vary across different attention layers yields a perplexity comparable to that of vanilla pruning, which removes only 8% of the vectors. We summarize the contribution of our paper as follows: We treat the Q-K and V-O pairs in each attention head as low-rank approximations of WQK and WV O. By performing SVD, we orthogonalize the attention head without adding extra transformation matrices. This orthogonalization not only reduces linear redundancy, but also is compatible with any other pruning method, thereby allowing for higher pruning ratios. Pruning 46.42% of the vectors in Whisper s attention head Open AI s speech-to-text model(Radford et al., 2023) preserves performance, without the need for additional training. CLOVER enables parameter-efficient fine-tuning, surpassing SOTA methods such as Lo RA, Do RA, Hi RA, and Pi SSA on eight commonsense reasoning tasks across LLa MA-2-7B, and LLa MA-3-8B. Additional analyses further highlight its advantages. 2. Related Work LLM Compression To alleviate the memory pressure of KV caches in long-context models, researchers have explored several complementary directions: dynamic token pruning keeps only the past tokens whose attention scores materially influence future predictions, discarding the rest at inference time (Fu et al., 2024; Jo & Shin, 2024; Li et al., 2024b); rank compression of keys and values factorizes or groups the K/V tensors so that a lower-rank set of basis vectors plus coefficients can replicate the original attention, cutting the cache size almost linearly with the reduced rank (Shazeer, 2019; Ainslie et al., 2023; Liu et al., 2024a; Yu et al., 2024); heador dimension-level pruning statistically identifies and removes low-impact attention heads or subdimensions, slimming each token s stored representation (Ashkboos et al., 2024; Xia et al., 2023; Sun et al., 2023); cross-layer KV sharing reuses one KV table across multiple transformer layers, turning layer-wise memory growth into a constant factor (Sun et al., 2024; Brandon et al., 2024; Liu et al., 2024c; Zuhri et al., 2024); and quantization encodes KV weights and activations in INT4 INT8 (or fewer bits), shrinking the cache size without altering the computational graph (Frantar et al., 2022; Dettmers et al., 2022; Xiao et al., 2023; Liu et al., 2024e; Hooper et al., 2024). Although these compression methods can reduce the model size and may ultimately achieve inference speedup, dropping active dimensions inevitably hurts accuracy (Ma et al., 2023); fortunately, parameter-efficient fine-tuning (PEFT) methods e.g., Lo RA or adapters inserted after pruning restore most of the lost quality with just 0.1 1 % extra trainable parameters (Guo et al., 2023). Parameter Efficient Fine-Tuning. Low Rank Adaption (Lo RA) is widely used due to its simplicity and effectiveness, with recent works enhancing it further (Zhang et al., 2023; Zi et al., 2023; Liu et al., 2024d; Zhao et al., 2024; Jiang et al., 2024). Pi SSA (Meng et al., 2024) improves convergence speed by initializing adapters with principal singular values and vectors, also reducing quantization error (Wang et al., 2024a;b; Li et al., 2024a). However, Pi SSA is limited by its use of a fixed set of orthogonal bases. SVFT (Lingam et al., 2024) directly applies Singular Value Decomposition (SVD) to the original matrix, but this increases the number of parameters, raising computational overhead and reducing efficiency. The CLOVER method addresses these issues by treating the Query-Key pairs in each attention head as low-rank matrices. Using orthogonal decomposition, CLOVER eliminates the need for additional transformation matrices. Instead, it leverages a small set of singular values to linearly combine orthogonal vectors, making the approach more parameter-efficient. After fine-tuning, the adapter can be smoothly reintegrated into the original matrix structure. CLOVER: Cross-Layer Orthogonal Vectors Pruning 3. CLOVER: Cross-Layer Orthogonal Vectors Below is a step-by-step breakdown of the CLOVER method, illustrating how it performs orthogonalization of the Query, Key, Value, and Output layers in Multi-Head Attention, how orthogonal initialization helps improve pruning rates, and how the singular value matrices obtained from orthogonal decomposition can be used for efficient parameter finetuning. We begin by using the computation of the Q-K pair as a representative example, which is then generalized to the V -O pair. Multi-Head Self-Attention In a multi-head self-attention mechanism with H heads, each head h {1, . . . , H} computes an attention score as: attn(Qh, Kh) = softmax Qh K h where d is the dimension of each head, Qh, Kh Rn d are the query and key representations for head h. Specifically, the queries and keys for head h are obtained by multiplying the input matrix X Rn D(n is the sequence length, D is the hidden dimension) with the corresponding slice of projection matrices WQ, WK RD H d, respectively: Qh = X W [:,h,:] Q , Kh = X W [:,h,:] K . Cross-Layer Merging Substituting the expressions for Qh and Kh into the product Qh K h , we have: Qh K h = X W [:,h,:] Q W [:,h,:] K X . Notice that the original weights W [:,h,:] Q and W [:,h,:] K are each in RD d. When multiplied together, the resulting ma- trix W h QK = W [:,h,:] Q W [:,h,:] K has dimensions D D. Since d D, directly using W h QK in computations or storing it as trainable parameters would be highly inefficient, limiting the applicability of such parameter merging. Cross-Layer Orthogonal Decomposition To mitigate the large size of W h QK, we factorize it via SVD: W h QK = U h QK Sh QK V h QK, where U h QK and V h QK are D D orthogonal matrices, Sh QK is a D D diagonal matrix of singular values. Since W [:,h,:] Q and W [:,h,:] K each have dimensions RD d, the rank of W h QK is at most d. Thus, the number of nonzero singular values in Sh QK are at most d. We can truncate the SVD to retain only the top-r singular values without any loss of information: W h QK = U h QK[:, : r] Sh QK[: r, : r] V h QK[: r, :], The process can be easily applied to WV and WO, as detailed in Appendix D.4. CLOVER for Pruning After performing SVD, we can rewrite the weight matrix W h QK as follows: W h QK = U h QK[:, : r] Sh QK[: r, : r] | {z } f W h Q V h QK[:, : r] | {z } f W h K Instead of storing the full matrices W h Q and W h K RD d, we store the smaller factors f W h Q and f W h K RD r, which are significantly smaller than the original matrix since r d D. This leads to a reduction in both memory usage and computational cost. Additionally, we can further prune small nonzero singular values (and their corresponding singular vectors) that fall below a chosen threshold, further reducing the parameter count and computational overhead. CLOVER for Fine-Tuning CLOVER can be used not only for pruning, but also for parameter-efficient fine-tuning. We freeze the matrices U h QK[:, : r] and V h QK[:, : r], and only fine-tune the singular values Sh QK[: r, : r]. In contrast to SVFT, which factorizes the original weight matrices WQ, WK, WV , WO RD H d individually, CLOVER factorizes the merged weights W h QK and W h OV within each attention head. As a result, the tunable matrix SQK has a size bounded by RH d d (considering all heads). In comparison, SVFT requires factorizing large matrices each into three components (U, S, V RD D), leading to a significant increase in parameter count and computational overhead, even with sparse updates for the singular values S. For example, consider the LLa MA 2-7B model with H = 32 attention heads and a head dimension of d = 128. By factorizing each head separately, the largest size for SQK is O(32 128 128), which is significantly smaller than factorizing a R4096 4096 matrix. This makes CLOVER s parameter efficiency comparable to that of a Lo RA configuration with rank 32, as shown in Appendix B, but with additional potential for pruning. CLOVER: Cross-Layer Orthogonal Vectors Pruning Table 1. Comparison of CLOVER with Slice GPT and Trans MLA on pruning Deep Seek-V2-Lite and LLa MA-2-7B separately, and evaluation of their fine-tuned performance across six benchmarks. Model Hidden Size Head Dim Avg. MMLU ARC PIQA HS OBQA WG Deep Seek V2 Lite 61.54 43.29 60.39 79.92 74.51 45.40 65.75 - Slice GPT -6.25% 57.30 38.40 55.95 77.20 68.67 41.20 62.35 -12.50% 53.51 35.24 51.97 74.27 62.08 37.80 59.67 - CLOVER -25% 59.84 41.16 57.56 79.27 72.61 44.60 63.85 -50% 57.25 38.96 55.27 78.02 69.63 41.40 60.22 Model KV Cache Head Dim Avg. MMLU ARC PIQA HS OBQA WG LLa MA-2-7B 59.85 41.43 59.24 78.40 73.29 41.80 64.96 - Trans MLA -68.75% 59.82 40.87 59.18 77.91 71.82 45.20 63.93 -87.50% 59.36 40.77 58.84 78.18 71.28 43.60 63.46 -92.97% 58.68 40.82 59.72 76.55 69.97 43.60 61.40 - CLOVER -68.75% -50% 59.40 40.91 58.97 78.35 71.32 43.40 63.46 -87.50% -50% 59.28 40.46 59.12 77.48 70.62 44.60 63.38 -92.97% -50% 59.13 40.69 60.03 77.09 69.65 45.20 62.12 Table 2. Comparison of latency metrics for different method. Model Prefilling (ms) Generation (ms/token) Deep Seek 195.12 40.11 Slice GPT 191.91 40.32 CLOVER 177.02 31.00 4. Experiments In Section 4.1, we compare CLOVER with Slice GPT (Ashkboos et al., 2024) and Trans MLA (Meng et al., 2025), which respectively prune Deep Seek-v2-Lite (Deep Seek-AI, 2024) and LLa MA-2-7B (AI@Meta, 2023). In Section 4.2, we visualize how CLOVER removes linear redundancy between vectors, facilitating more efficient pruning. In Section 4.3, we evaluate the acceleration performance of CLOVER. In Section 4.4, we demonstrate CLOVER s ability to perform significant pruning In Section 4.5, we apply CLOVER to orthogonalize the attention heads of the GPT-2-XL model (Radford et al., 2019), to explore the role of CLOVER in both pruning and fine-tuning. In Section 4.6, we conduct fine-tuning experiments on eight commonsense tasks, comparing CLOVER with SOTA PEFT methods. 4.1. Comparing CLOVER with Other Methods Currently, pruning efforts for Deep Seek models are limited. The few existing approaches mainly focus on reducing the number of experts in the Mo E module (Gu et al., 2025). However, by orthogonalizing the attention heads in Deep Seek, we observe that significant redundancy also exists within MLA (Figure 2a). Removing this redundancy can substantially reduce the computational overhead during training, pre-filling, and the computation of query represen- tations in the absorb phase. To compare the effectiveness of CLOVER with other pruning methods, we adapted the Slice GPT (Ashkboos et al., 2024) codebase to support the Deep Seek model architecture. And we applied CLOVER to orthogonally initialize the attention heads and pruned the attention head dimensions based on the magnitude of singular values. Additionally, we compared pruning for LLa MA-2-7B with Trans MLA (Meng et al., 2025), which converts models using Multi-Head Attention (MHA) or Grouped Query Attention (GQA) into MLA-based models, effectively compressing the KV cache. Trans MLA can be further combined with CLOVER to prune the dimensionality of attention heads more efficiently. We pruned the K No PE and V head dimensions in the LLa MA-2-7B model released in their paper, to evaluate CLOVER s effectiveness. For fine-tuning, we followed the Trans MLA procedure on a mixed pretraining dataset, as shown in Table 6. All models were evaluated on six benchmarks: MMLU (Hendrycks et al., 2021), ARC (easy and challenge) (Clark et al., 2018a), PIQA (Bisk et al., 2020a), Hella Swag (HS) (Zellers et al., 2019a), and Winogrande (WG) (Sakaguchi et al., 2021a). These evaluations serve to validate the effectiveness of different pruning strategies. As shown in Table 1, CLOVER achieves performance comparable to Slice GPT while pruning 50% of the head dimension, compared to Slice GPT s 6.25% pruning. However, as demonstrated in Table 2, CLOVER delivers a 1.25 speedup, whereas Slice GPT provides no acceleration. Furthermore, building on Trans MLA, an additional 50% pruning of the head dimension still allows the model to recover its performance with only a small amount of retraining. CLOVER: Cross-Layer Orthogonal Vectors Pruning 0 50 100 0.0 Vanilla CLOVER 0 50 100 Sorted Dimensions Vanilla CLOVER (a) Deep Seek-V2-Lite 0 25 50 75 0 Vanilla CLOVER 0 25 50 75 Sorted Dimensions Vanilla CLOVER (b) Llama-3.2-Vision 0 20 40 60 0.00 Vanilla CLOVER 0 20 40 60 Sorted Dimensions Vanilla CLOVER (c) Whisper-Large-v3 Vanilla CLOVER 0 20 40 60 Sorted Dimensions Vanilla CLOVER 0 50 100 0.00 Vanilla CLOVER 0 50 100 Sorted Dimensions Vanilla CLOVER (e) CLIP-Vi T-Big G Figure 2. CLOVER (orange) uses fewer orthogonal basis vectors than Vanilla Pruning (blue) to span the attention head space. The first row shows the importance of Q-K dimensions, and the second row shows V-O dimensions. After the red dot, CLOVER s importance is lower, and pruning these vectors results in less performance loss. 4.2. CLOVER Removal Redundant Vectors CLOVER achieves a higher pruning ratio due to the significant linear redundancy present in the model. By representing the entire attention head with only a small number of orthogonal vectors, CLOVER effectively removes this redundancy. To illustrate the advantages of CLOVER in eliminating linear redundancy, we apply it to a variety range of models, including the large language model Deep Seek V2-Lite (Deep Seek-AI, 2024), the multimodal automatic speech recognition and speech translation model Whisper Large-v3 (Radford et al., 2023), the multimodal instructiontuned image reasoning generative models LLa MA-3.2-11BVision (AI@Meta, 2024), the image encoder CLIP-Vi Tbig G (Cherti et al., 2022), and the image generation model Stable Diffusion XL (Podell et al., 2023). We compute the L2 norm for each dimension (equal to singular values) in both the Q-K pair and the V-O pair, sorting the values in descending order within each attention head for better visualization. For comparison, we also perform Vanilla Pruning, which does not utilize CLOVER initialization but instead sorts directly based on the L2 norm. Figure 2 showcases the first attention head from the first layer of each model. In the first column of the figure, depicting the Q-K norm, we observe that in the original model, the importance of each dimension is relatively balanced (e.g. Figure 2c). This balanced distribution is a result of the linear redundancy, where different directions are intertwined, making it challenging to prune individual directions without negatively affecting the model s performance. However, after applying CLOVER s orthogonal decomposition, only a small number of orthogonal bases on the left side exhibit significantly large norms. These vectors span almost the entire attention head s space, and the remaining vectors have norms that approach zero, indicating that they are already represented by the dominant singular vectors and can be pruned without loss of performance. Beyond the red intersection point, CLOVER s remaining vectors exhibit consistently lower importance than those in Vanilla Pruning, meaning pruning these vectors results in less performance degradation. This demonstrates why CLOVER enables a higher pruning ratio. A similar trend is observed for the V-O pair, although the model s inherent sparsity is less pronounced than in the Q-K pair, making the effect less noticeable. Still, in most models, pruning half of the vectors has a smaller impact on performance compared to Vanilla Pruning. Notably, in CLIP-Vi T-big G (Figure 2e), a proportion of the vectors already have a norm of zero, allowing for safe pruning. Beyond the red intersection point, CLOVER s remaining vectors exhibit consistently lower importance than those in Vanilla Pruning, meaning pruning these vectors results in less performance degradation. This demonstrates why CLOVER enables a higher pruning ratio. A similar trend is observed for the V-O pair, although the model s inherent sparsity is less pronounced than in the Q-K pair, making the effect less noticeable. Still, in most models, pruning half of the vectors has a smaller impact on performance compared to Vanilla Pruning. Notably, in CLIP-Vi T-big G (Figure 2e), a proportion of the vectors already have a norm of zero, allowing for safe pruning. Beyond the red intersection point, CLOVER s remaining vectors exhibit consistently lower importance than those in Vanilla Pruning, meaning pruning these vectors results in less performance degradation. This demonstrates why CLOVER enables a higher pruning ratio. A similar trend is observed for the V-O pair CLOVER: Cross-Layer Orthogonal Vectors Pruning 1k 2k 4k 8k Context Length 11.1x Low Rank Q = 512 Full Rank Q (a) 165.2 TFLOPS 24GB 1k 2k 4k 8k 16k 32k Context Length 6.7x Low Rank Q = 512 Full Rank Q (b) 312 TFLOPS 40GB 1k 2k 4k 8k 16k 32k Context Length 4.8x Low Rank Q = 512 Full Rank Q (c) 320 TFLOPS 64GB Figure 3. Inference speedups with CLOVER comparing to the original LLa MA2 7B model on three platform. Low-rank Q and Full-rank Q indicate whether the query projections were also compressed. Context length represents the total sequence length. 4.3. Inference Speedup with CLOVER In Figure 3, we benchmark the inference performance of CLOVER featuring a 92.97% reduction in the KV cache and a 50% reduction in the Q nope, K nope, and V head dimensions using the v LLM framework across three GPUs with varying compute capabilities and memory sizes: 165.2 TFLOPS with 24GB memory, 312 TFLOPS with 40GB memory, and 320 TFLOPS with 64GB memory. The figure illustrates the inference speedup of the pruned model relative to the original LLa MA-2-7B. Low-rank Q and Full-rank Q indicate whether the query projections were also compressed. The context length refers to the total sequence length, which includes both the prompt and generated tokens (with equal lengths for each). Our experiments demonstrate that CLOVER s inference speedup increases with longer context lengths. As long sequences typically lead to both compute and memory bottlenecks, compressing the KV cache and attention head dimensions helps alleviate these issues, thereby enabling higher speedups. Notably, for an 8K context window on the first hardware platform, the CLOVER-pruned model achieves an impressive 11.1 inference acceleration. 4.4. CLOVER for Training-Free Pruning As demonstrated by the prominent low-rank properties in Figure 2c, we applied pruning to the Whisper-largev3 model (Radford et al., 2023). We use the official Whisper-large-v3 example (Libri Speech Long dataset (Gandhi et al., 2023)1) to intuitively highlight the effectiveness of CLOVER pruning. For reference, the waveform of this input is shown in Figure 4, and the corresponding target translation script is provided in Appendix C. After applying CLOVER to orthogonalize the vectors, we pruned vectors with magnitudes close to zero ( WQ WK 5 10 3 and WV W O 6 10 3). This pruning achieved ratios of 56.01% and 36.82% for the parameters in Q-K Pair and V -O Pair, respectively. Re- 1https://huggingface.co/openai/whisper-large-v3 markably, the model s output remains nearly unchanged, with only one error, which has been highlighted in the text using strikethrough and red for clarity: 0 2e5 4e5 6e5 8e5 10e5 Samples Figure 4. An audio waveform from the librispeech dataset. Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter s manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton s work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell s pictures are a sort of Up Guards and Adam paintings, and Mason s exquisite idles are as national as a jingo poem. Mr. Birkett Foster s landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And, and Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, next man. In contrast, vanilla pruning which forgoes orthogonal initialization and prunes head vectors solely based on their norm results in the model completely failing to generate valid outputs at the same pruning ratio. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... This example validates our earlier claim that straightforward pruning of non-zero dimensions can lead to accumulated loss. In contrast, CLOVER effectively eliminates linear redundancy, enabling a significantly higher pruning ratio. When the linear redundancy is sufficiently pronounced, CLOVER can even achieve a high pruning ratio without the need for fine-tuning to recover performance. CLOVER: Cross-Layer Orthogonal Vectors Pruning Table 3. Pruning the attention layers of GPT-2-XL using CLOVER and vanilla pruning at various sparsity levels. We report perplexity on Wiki Text-2 (lower is better) and evaluate fine-tuning performance on Open Web Text under different token budgets. The base model s perplexity is 14.78. CLOVERFT and Vanilla fine-tune the pruned attention layers, while CLOVERPEFT fine-tunes the singular value matrices obtained from the decomposition of the QK and VO projections. Pruning Ratio w/o Training Perplexity( ) 66M Tokens Perplexity ( ) 131M Tokens Perplexity ( ) Vanilla CLOVER Vanilla CLOVERFT CLOVERPEFT Vanilla CLOVERFT CLOVERPEFT 12.5% 33.76 15.89 16.04 15.45 15.67 16.38 15.77 15.42 25.0% 78.36 17.45 16.93 15.70 15.89 17.07 16.05 15.75 37.5% 159.4 20.95 18.17 16.17 16.60 18.14 16.48 16.41 50.0% 338.9 35.12 20.45 17.22 17.63 19.02 17.13 17.71 62.5% 538.5 85.25 24.65 19.32 20.64 21.44 18.40 20.39 75.0% 708.8 187.4 36.04 24.65 29.28 27.22 20.99 28.44 4.5. Pruning and Fine-Tuning with CLOVER Model pruning often necessitates fine-tuning to recover from performance degradation. CLOVER supports both pruning and fine-tuning within a unified framework. In this section, we evaluate CLOVER s effectiveness in both aspects. We initialize GPT-2-XL with CLOVER and prune the model by removing vectors corresponding to the singular values with the smallest magnitudes. To achieve better performance, Figure 1c allows each layer to adopt a different pruning rate. In this case, a fixed proportion of parameters is pruned across the entire model based on the global ranking of singular values (for CLOVER) or L2norms (for Vanilla pruning). Remarkably, pruning 70% of the parameters using CLOVER yields performance comparable to pruning only 8% with Vanilla pruning, highlighting the effectiveness of CLOVER s orthogonal initialization in facilitating structured pruning. However, using a uniform pruning rate across all layers where the same percentage of the smallest singular vectors is pruned per layer is beneficial for consistent training and inference speed. Therefore, unless otherwise noted, we apply uniform pruning across layers. After fine-tuning, the singular values S are merged into their corresponding U and V matrices. For comparison, we also apply Vanilla pruning, which lacks CLOVER s orthogonalization and instead uses an L2-norm-based criterion. Following pruning, we evaluate model perplexity on the Wiki Text-2 dataset (Merity et al., 2016). We then fine-tune the pruned models on the Open Web Text dataset (Gokaslan & Cohen, 2019), using the nano GPT framework2. To minimize disruption to the pretrained model, only the pruned attention layers are fine-tuned, while the MLP, embedding layers, and LM head remain fixed. This setup is referred to as CLOVERFT and Vanilla, respectively. In the CLOVERPEFT configuration, the singular values S are not immediately 2https://github.com/karpathy/nano GPT merged into the U and V matrices. Instead, they are retained for parameter-efficient fine-tuning, where only these singular values are updated, and merging is deferred until post-training. PEFT typically converges more slowly than full-parameter finetuning. To accelerate convergence, we increase the learning rate from 6 10 4 to 6 10 3 and remove weight decay, while keeping all other hyperparameters consistent with those used in Vanilla and CLOVERFT. As shown in Table 3, CLOVER induces significantly less performance degradation than Vanilla pruning by concentrating functionality into fewer orthogonal bases. For instance, pruning 50% of the parameters without fine-tuning increases CLOVERFT s perplexity by only 1.38 , compared to 21.9 for Vanilla. After fine-tuning, CLOVERFT substantially outperforms Vanilla; for example, CLOVER with a 75% pruning rate achieves comparable performance to Vanilla pruning at only 62.5%. Owing to its reduced model disruption, CLOVERFT also requires fewer training tokens to restore performance (e.g., perplexity with 66M tokens closely matches that with 131M), whereas Vanilla pruning demands more data, increasing both computational cost and the risk of degradation on out-of-domain tasks. Moreover, CLOVERPEFT, which fine-tunes only the singular values from the SVD decomposition and the attention layer biases, enables performance recovery with minimal resource consumption and parameter updates. At lower pruning rates, CLOVERPEFT even surpasses full attentionlayer training (CLOVERFT). However, at higher pruning rates, performance declines significantly due to the limited number of remaining tunable parameters (e.g., only 0.15% of the original attention-layer parameters are updated). These results empirically validate the benefits discussed earlier: CLOVER s orthogonal initialization of attention heads enables the representation of the entire attention space using a compact set of orthogonal bases, which is highly advantageous for pruning. Furthermore, the singular value matrix can be seamlessly merged back into the attention head. CLOVER: Cross-Layer Orthogonal Vectors Pruning Table 4. Accuracy comparison of LLa MA2-7B, and LLa MA3-8B with various PEFT methods on eight commonsense reasoning datasets. Results of Lo RA and Do RA are taken from (Liu et al., 2024d). Results of Hi RA are taken from (Huang et al., 2025). Model Method Params Bool Q PIQA SIQA Hella Swag Wino Grande ARC-e ARC-c OBQA Avg. Lo RA 0.83% 69.8 79.9 79.5 83.6 82.6 79.8 64.7 81.0 77.6 Do RA 0.84% 71.8 83.7 76.0 89.1 82.6 83.7 68.2 82.4 79.7 Hi RA 0.83% 71.2 83.4 79.5 88.1 84.0 86.7 73.8 84.6 81.4 Pi SSA 0.83% 75.0 87.0 81.6 95.0 86.5 88.5 75.9 86.4 84.5 CLOVER 0.83% 75.0 86.4 82.0 95.1 87.5 89.6 76.6 89.4 85.2 Lo RA 0.70% 70.8 85.2 79.9 91.7 84.3 84.2 71.2 79.0 80.8 Do RA 0.71% 74.6 89.3 79.9 95.5 85.6 90.5 80.4 85.8 85.2 Hi RA 0.70% 75.4 89.7 81.2 95.4 87.7 93.3 82.9 88.3 86.7 Pi SSA 0.70% 77.2 90.0 82.9 96.6 88.4 93.6 82.4 87.4 87.3 CLOVER 0.47% 76.4 89.3 82.1 96.9 89.9 93.6 84.5 90.6 87.9 Table 5. Comparison of training costs between Lo RA and CLOVER on LLa MA-2-7B. We trained on a commonsense dataset for 3 epochs with model max length = 1024, per device train batch size = 2, gradient accumulation steps = 2, num gpus = 4, executed on 4 312 TFLOPS 80G GPUs. Method Params Max Memory Runtime Lo RA 0.83% 110.84 GB 2:42:37 CLOVER 0.83% 104.75 GB 2:22:47 4.6. Comparison with PEFT Methods In this section, we conduct an ablation study to compare the fine-tuning capability of CLOVER against several parameter-efficient fine-tuning (PEFT) methods, including Lo RA (Hu et al., 2021), Do RA (Liu et al., 2024d), Hi RA (Huang et al., 2025), and Pi SSA (Meng et al., 2024). We exclude SVFT (Lingam et al., 2024) from this comparison due to its significant computational overhead. The evaluation spans eight sub-tasks, as detailed in Table 7. All models are fine-tuned on the Commonsense-148k dataset and evaluated on the respective test sets of each sub-task. For CLOVER, we apply orthogonal decomposition to the Value-Output projection and fine-tune the resulting singular value matrix. Due to the non-linear Ro PE (Su et al., 2024) operation between the query and key, we instead decompose the Key layer and fine-tune its transition matrix. Likewise, in the mlp.up proj layer, we treat every 64 consecutive dimensions as a head, apply orthogonal decomposition, and update the corresponding transition matrix. The number of trainable parameters in LLa MA-2-7B matches those used in Lo RA, Do RA, Hi RA, and Pi SSA, all employing rank-32 updates. For LLa MA-3-8B, we reduce the number of trainable parameters to two-thirds of the amount used in the other models. The comparison of memory consumption and runtime in Table 5 demonstrate that CLOVER consumes less GPU memory and exhibits shorter training runtime compared to Lo RA. We attribute this to CLOVER being applied between two layers, whereas Lo RA operates in parallel with the main branch. This enables sequential computation, eliminating the need to retain the input features of the main branch. Lo RA and Do RA results are taken from the Do RA paper, while Hi RA results are sourced directly from its original publication. Since Pi SSA has not conducted experiments on commonsense reasoning datasets, we include its performance by reproducing the experiments ourselves. For a fair comparison, we adopt the hyperparameters from Do RA and adjust the learning rates accordingly. As shown in Table 8, CLOVER achieves the best performance with a learning rate of 1e 4, which we apply consistently across both LLa MA-2-7B and LLa MA-3-8B. Pi SSA performs best with a learning rate of 2e 5, as reported in its original paper; all other hyperparameters remain unchanged. Due to the stable training behavior observed in both Pi SSA and CLOVER, we omit the validation procedure used in Do RA where the best-performing model is selected every 80 iterations based on the validation set. Instead, we train for the full 3 epochs and use the final model checkpoint for testing. Table 4 demonstrates that CLOVER consistently outperforms all other methods across all models and tasks. Specifically, on LLa MA-2-7B, CLOVER surpasses Lo RA, Do RA, Hi RA, and Pi SSA by 7.6%, 5.5%, 3.8%, and 0.7%. Even on LLa MA-3-8B, with fewer trainable parameters, CLOVER outperforms by 7.1%, 2.7%, 1.2%, and 0.6%. CLOVER leads in most sub-tasks and ranks second in a few. These experiments demonstrate that CLOVER possesses strong fine-tuning capabilities, making it effective for recovering performance degradation caused by pruning. Additional analysis is provided in Appendix D. CLOVER: Cross-Layer Orthogonal Vectors Pruning 5. Conclusion and Limitations In this paper, we introduce Cross-Layer Orthogonal Vectors (CLOVER), a method that orthogonalizes vectors within attention heads without requiring additional transformation matrices. This orthogonalization process condenses effective parameters into fewer vectors, improving the pruning ratio. By fine-tuning the singular values obtained through orthogonalization, CLOVER learns linear combinations of orthogonal bases, enabling full-rank updates. When applied to prune 50% of the attention head parameters in GPT-2XL, CLOVER results in a perplexity that is just one-tenth of that achieved by standard pruning methods. For Whisper Large-v3, CLOVER removes 46.42% of the parameters without fine-tuning, while preserving model performance. Furthermore, when used for fine-tuning, CLOVER outperforms state-of-the-art methods such as Lo RA, Do RA, Hi RA, and Pi SSA, achieving superior results with equal or fewer trainable parameters. We also demonstrate how CLOVER removes linear redundancy to facilitate pruning and discuss the necessity of fine-tuning across all orthogonal bases. Visual comparisons of models fine-tuned with different methods further illustrate its effectiveness. Despite its advantages, CLOVER has some limitations. When nonlinear operations are present between Q-K or V-O pairs (such as with the widely-used Ro PE (Su et al., 2024)), cross-layer orthogonalization is not feasible. In these cases, we instead perform head-wise orthogonalization within the Key layer during fine-tuning. Fortunately, CLOVER Fine-Tuning can apply intra-layer attention head orthogonalization, while CLOVER Pruning remains applicable to many popular models, including Deep Seek (Deep Seek-AI, 2024; Liu et al., 2024b)(which uses Decoupled Ro PE), Vi T and SDXL (which use absolute positional encoding), and BLOOM (Workshop et al., 2022) (which employs Alibi relative positional encoding (Press et al., 2021)). Additionally, as a newly proposed method, our current evaluation focuses primarily on basic pruning tasks and does not include comparisons with other state-of-the-art pruning techniques. However, because CLOVER does not alter the model structure and only updates the initialization method, it can be combined with existing pruning methods to further enhance their effectiveness. As a novel technique, CLOVER holds considerable promise for future applications. For instance, it could be combined with quantization methods to eliminate outliers, guide pruning and fine-tuning based on data feature directions, or even inspire new model architectures. Acknowledgement This work is supported by the National Key R&D Program of China (2022ZD0160300), National Natural Science Foundation of China (62276003), and Kunpeng&Ascend Center of Excellence, Peking University. Impact Statement This paper proposes a cross-layer orthogonal initialization method to guide model pruning and efficient fine-tuning, offering valuable insights for the application and development of large models. Both application directions aim to reduce training and inference costs, lower computational overhead, decrease power consumption, and minimize carbon emissions. Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. ar Xiv preprint ar Xiv:2412.08905, 2024. AI@Meta. Llama 2: Open foundation and fine-tuned chat models. Co RR, abs/2307.09288, 2023. doi: 10. 48550/ar Xiv.2307.09288. URL https://doi.org/ 10.48550/ar Xiv.2307.09288. AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebr on, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. ar Xiv preprint ar Xiv:2305.13245, 2023. Anthropic. Claude 3.5 sonnet, 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet. Ashkboos, S., Croci, M. L., Nascimento, M. G. d., Hoefler, T., and Hensman, J. Slicegpt: Compress large language models by deleting rows and columns. ar Xiv preprint ar Xiv:2401.15024, 2024. Ben Allal, L., Lozhkov, A., Penedo, G., Wolf, T., and von Werra, L. Smollm-corpus, 2024. URL https://huggingface.co/datasets/ Hugging Face TB/smollm-corpus. Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The CLOVER: Cross-Layer Orthogonal Vectors Pruning Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 7432 7439. AAAI Press, 2020a. doi: 10.1609/AAAI.V34I05.6239. URL https:// doi.org/10.1609/aaai.v34i05.6239. Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432 7439, 2020b. Brandon, W., Mishra, M., Nrusimha, A., Panda, R., and Kelly, J. R. Reducing transformer key-value cache size with cross-layer attention. ar Xiv preprint ar Xiv:2405.12981, 2024. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877 1901, 2020. Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., and Jitsev, J. Reproducible scaling laws for contrastive language-image learning. ar Xiv preprint ar Xiv:2212.07143, 2022. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surprising difficulty of natural yes/no questions. ar Xiv preprint ar Xiv:1905.10044, 2019. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the AI2 reasoning challenge. Co RR, abs/1803.05457, 2018a. URL http://arxiv. org/abs/1803.05457. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. ar Xiv preprint ar Xiv:1803.05457, 2018b. Deep Seek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. Co RR, abs/2405.04434, 2024. URL https://doi.org/10. 48550/ar Xiv.2405.04434. Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318 30332, 2022. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pretrained transformers. ar Xiv preprint ar Xiv:2210.17323, 2022. Fu, Q., Cho, M., Merth, T., Mehta, S., Rastegari, M., and Najibi, M. Lazyllm: Dynamic token pruning for efficient long context llm inference. ar Xiv preprint ar Xiv:2407.14057, 2024. Gandhi, S., von Platen, P., and Rush, A. M. Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling. ar Xiv preprint ar Xiv:2311.00430, 2023. Gokaslan, A. and Cohen, V. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019. Gu, H., Li, W., Li, L., Zhu, Q., Lee, M., Sun, S., Xue, W., and Guo, Y. Delta decompression for moe-based llms compression. ar Xiv preprint ar Xiv:2502.17298, 2025. Guo, S., Xu, J., Zhang, L. L., and Yang, M. Compresso: Structured pruning with collaborative prompting learns compact large language models. ar Xiv preprint ar Xiv:2310.05015, 2023. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum? id=d7KBjm I3Gm Q. Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm inference with kv cache quantization. ar Xiv preprint ar Xiv:2401.18079, 2024. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. ar Xiv preprint ar Xiv:2106.09685, 2021. Huang, Q., Ko, T., Zhuang, Z., Tang, L., and Zhang, Y. Hi RA: Parameter-efficient hadamard high-rank adaptation for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=Tw Jr Tz9c RS. Jiang, T., Huang, S., Luo, S., Zhang, Z., Huang, H., Wei, F., Deng, W., Sun, F., Zhang, Q., Wang, D., et al. Mora: High-rank updating for parameter-efficient fine-tuning. ar Xiv preprint ar Xiv:2405.12130, 2024. Jo, H.-r. and Shin, D. A2sf: Accumulative attention scoring with forgetting factor for token pruning in transformer decoder. ar Xiv preprint ar Xiv:2407.20485, 2024. CLOVER: Cross-Layer Orthogonal Vectors Pruning Li, M., Lin, Y., Zhang, Z., Cai, T., Li, X., Guo, J., Xie, E., Meng, C., Zhu, J.-Y., and Han, S. Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models. ar Xiv preprint ar Xiv:2411.05007, 2024a. Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. ar Xiv preprint ar Xiv:2404.14469, 2024b. Lingam, V., Tejaswi, A., Vavre, A., Shetty, A., Gudur, G. K., Ghosh, J., Dimakis, A., Choi, E., Bojchevski, A., and Sanghavi, S. Svft: Parameter-efficient fine-tuning with singular vectors. ar Xiv preprint ar Xiv:2405.19597, 2024. Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. ar Xiv preprint ar Xiv:2405.04434, 2024a. Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseekv3 technical report. ar Xiv preprint ar Xiv:2412.19437, 2024b. Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B. Minicache: Kv cache compression in depth dimension for large language models. ar Xiv preprint ar Xiv:2405.14366, 2024c. Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., and Chen, M.-H. Dora: Weight-decomposed low-rank adaptation. ar Xiv preprint ar Xiv:2402.09353, 2024d. Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. ar Xiv preprint ar Xiv:2402.02750, 2024e. Lozhkov, A., Ben Allal, L., von Werra, L., and Wolf, T. Fineweb-edu: the finest collection of educational content, 2024a. URL https://huggingface.co/ datasets/Hugging Face FW/fineweb-edu. Lozhkov, A., Li, R., Allal, L. B., Cassano, F., Lamy-Poirier, J., Tazi, N., Tang, A., Pykhtar, D., Liu, J., Wei, Y., Liu, T., Tian, M., Kocetkov, D., Zucker, A., Belkada, Y., Wang, Z., Liu, Q., Abulkhanov, D., Paul, I., Li, Z., Li, W.-D., Risdal, M., Li, J., Zhu, J., Zhuo, T. Y., Zheltonozhskii, E., Dade, N. O. O., Yu, W., Krauß, L., Jain, N., Su, Y., He, X., Dey, M., Abati, E., Chai, Y., Muennighoff, N., Tang, X., Oblokulov, M., Akiki, C., Marone, M., Mou, C., Mishra, M., Gu, A., Hui, B., Dao, T., Zebaze, A., Dehaene, O., Patry, N., Xu, C., Mc Auley, J., Hu, H., Scholak, T., Paquet, S., Robinson, J., Anderson, C. J., Chapados, N., Patwary, M., Tajbakhsh, N., Jernite, Y., Ferrandis, C. M., Zhang, L., Hughes, S., Wolf, T., Guha, A., von Werra, L., and de Vries, H. Starcoder 2 and the stack v2: The next generation, 2024b. Ma, X., Fang, G., and Wang, X. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702 21720, 2023. Meng, F., Wang, Z., and Zhang, M. Pissa: Principal singular values and singular vectors adaptation of large language models. ar Xiv preprint ar Xiv:2404.02948, 2024. Meng, F., Yao, Z., and Zhang, M. Transmla: Multihead latent attention is all you need. ar Xiv preprint ar Xiv:2502.07864, 2025. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016. Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. ar Xiv preprint ar Xiv:1809.02789, 2018. Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URL https://mistral.ai/news/ mixtral-8x22b. Open AI. Hello GPT-4o, 2024. URL https://openai. com/index/hello-gpt-4o/. Paster, K., Santos, M. D., Azerbayev, Z., and Ba, J. Openwebmath: An open dataset of high-quality mathematical web text, 2023. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M uller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. ar Xiv preprint ar Xiv:2307.01952, 2023. Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. ar Xiv preprint ar Xiv:2108.12409, 2021. Qwen. Qwen2.5: A party of foundation models, 2024. URL https://qwenlm.github.io/blog/qwen2.5. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. San Francisco, CA, USA, 2018. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019. CLOVER: Cross-Layer Orthogonal Vectors Pruning Radford, A., Kim, J. W., Xu, T., Brockman, G., Mc Leavey, C., and Sutskever, I. Robust speech recognition via largescale weak supervision. In International conference on machine learning, pp. 28492 28518. PMLR, 2023. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99 106, 2021a. doi: 10.1145/3474381. URL https://doi.org/10. 1145/3474381. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99 106, 2021b. Sap, M., Rashkin, H., Chen, D., Le Bras, R., and Choi, Y. Socialiqa: Commonsense reasoning about social interactions. ar Xiv preprint ar Xiv:1904.09728, 2019. Shazeer, N. Fast transformer decoding: One write-head is all you need. ar Xiv preprint ar Xiv:1911.02150, 2019. Shuttleworth, R., Andreas, J., Torralba, A., and Sharma, P. Lora vs full fine-tuning: An illusion of equivalence. ar Xiv preprint ar Xiv:2410.21228, 2024. Stack Overflow. Stack overflow, 2025. URL https:// stackoverflow.com. Accessed: 2025-05-21. Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. ar Xiv preprint ar Xiv:2306.11695, 2023. Sun, Y., Dong, L., Zhu, Y., Huang, S., Wang, W., Ma, S., Zhang, Q., Wang, J., and Wei, F. You only cache once: Decoder-decoder architectures for language models. ar Xiv preprint ar Xiv:2405.05254, 2024. Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ar Xiv preprint ar Xiv:2403.05530, 2024a. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ram e, A., et al. Gemma 2: Improving open language models at a practical size. ar Xiv preprint ar Xiv:2408.00118, 2024b. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi ere, B., Goyal, N., Hambro, E., Azhar, F., et al. LLa MA: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023. Wang, S., Yu, L., and Li, J. Lora-ga: Low-rank adaptation with gradient approximation. ar Xiv preprint ar Xiv:2407.05000, 2024a. Wang, Z., Liang, J., He, R., Wang, Z., and Tan, T. Lorapro: Are low-rank adapters properly optimized? ar Xiv preprint ar Xiv:2407.18242, 2024b. Workshop, B., Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili c, S., Hesslow, D., Castagn e, R., Luccioni, A. S., Yvon, F., et al. Bloom: A 176b-parameter open-access multilingual language model. ar Xiv preprint ar Xiv:2211.05100, 2022. Wu, X., Huang, S., and Wei, F. Mixture of lora experts. ar Xiv preprint ar Xiv:2404.13628, 2024. Xia, M., Gao, T., Zeng, Z., and Chen, D. Sheared llama: Accelerating language model pre-training via structured pruning. ar Xiv preprint ar Xiv:2310.06694, 2023. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087 38099. PMLR, 2023. Yu, H., Yang, Z., Li, S., Li, Y., and Wu, J. Effectively compress kv heads for llm. ar Xiv preprint ar Xiv:2406.07056, 2024. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and M arquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28August 2, 2019, Volume 1: Long Papers, pp. 4791 4800. Association for Computational Linguistics, 2019a. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? ar Xiv preprint ar Xiv:1905.07830, 2019b. Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., and Zhao, T. Adalora: Adaptive budget allocation for parameter-efficient finetuning. ar Xiv preprint ar Xiv:2303.10512, 2023. Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y. Galore: Memory-efficient llm training by gradient low-rank projection. ar Xiv preprint ar Xiv:2403.03507, 2024. Zi, B., Qi, X., Wang, L., Wang, J., Wong, K.-F., and Zhang, L. Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. ar Xiv preprint ar Xiv:2309.02411, 2023. CLOVER: Cross-Layer Orthogonal Vectors Pruning Zuhri, Z. M. K., Adilazuarda, M. F., Purwarianti, A., and Aji, A. F. Mlkv: Multi-layer key-value heads for memory efficient transformer decoding. ar Xiv preprint ar Xiv:2406.09297, 2024. CLOVER: Cross-Layer Orthogonal Vectors Pruning A. Dataset and Hyper-Parameters for Table 1 Following the experimental setups of Trans MLA (Meng et al., 2025), we fine-tune our models using the prtraining corpus from Smol LM (Ben Allal et al., 2024). The dataset comprises Fine Web-Edu-Dedup (Lozhkov et al., 2024a), Cosmopediav2 a synthetic dataset generated by Mixtral (Wu et al., 2024), Python-Edu from Star Coder (Lozhkov et al., 2024b), Open-Web-Math (Paster et al., 2023), and data from Stack Overflow (Stack Overflow, 2025). Table 6. Composition of the training dataset. Dataset Sampling Weight fineweb-edu-dedup 0.70 cosmopedia-v2 0.15 python-edu 0.06 open-web-math 0.08 stackoverflow 0.01 To ensure a fair comparison, we replicate the dataset mixing ratios used in the Trans MLA setup, as shown in Table 6, to maintain experimental consistency. Deep Seek-V2-Lite is trained on 1B tokens for both Slice GPT and CLOVER using Slice GPT s hyperparameters. For LLa MA-2-7B, we apply CLOVER on top of the checkpoint released by Trans MLA, which pruned 93% of the KV cache and was trained on 6B tokens. Additionally, we further train the model with 1.5B tokens for various pruning ratios using Trans MLA s hyperparameters. B. Dataset and Hyper-Parameters for Table 4 The commonsense reasoning tasks consist of 8 subtasks, each with predefined training and testing sets, as described by LLM-Adapters (Hu et al., 2023). The following table lists the details of each sub-dataset. Table 7. Details of datasets for commonsense reasoning tasks. Dataset Train Test About Bool Q (Clark et al., 2019) 9,427 3,270 Naturally occurring yes/no questions from unconstrained settings. PIQA (Bisk et al., 2020b) 16,113 1,838 Questions with two solutions requiring physical commonsense. SIQA (Sap et al., 2019) 33,410 1,954 Reasoning about actions and social implications. Hella Swag (Zellers et al., 2019b) 39,905 10,042 Commonsense NLI questions with context and endings. Wino Grande (Sakaguchi et al., 2021b) 40,398 1,267 Fill-in-the-blank task with binary options. ARC-e (Clark et al., 2018b) 2,251 2,376 Grade-school multiple-choice science questions in Easy sets. ARC-c (Clark et al., 2018b) 1,119 1,172 Grade-school multiple-choice science questions in Challenge sets. OBQA (Mihaylov et al., 2018) 4,957 500 Questions requiring multi-step reasoning and commonsense knowledge. For Wino Grande, the original dataset includes multiple partitions: [xs, s, m, l, xl, debiased]. While LLM-Adapters simply concatenated all these partitions, note that the xl partition actually includes all others, leading to extensive data duplication. After removing duplicates, the training data is reduced from 63.2K to 40.4K instances. Additionally, in the LLM-Adapters paper, the training set sizes of ARC Challenge and ARC Easy were reversed by mistake; here, we correct that error. The results for Lo RA and Do RA presented in Table 4 are directly taken from the original Do RA paper, where the hyperparameters are carefully tuned. Similarly, the results for Hi RA are cited from its original publication. In contrast, we introduce new experimental results for Pi SSA and CLOVER, both of which are optimized with the best learning rates (Table 8) and aligned hyperparameters. Specifically, Pi SSA achieves optimal performance at a learning rate of 2e-5, while CLOVER performs best at 1e-4, as shown in the table below: Table 9 presents a comparison of hyperparameters for different fine-tuning methods on commonsense tasks. The target model remains the same for Lo RA, Do RA, Hi RA, and Pi SSA. However, Do RA introduces an additional magnitude module, leading to a slightly higher parameter count. In a single layer of Lo RA, the trainable parameters are as follows: CLOVER: Cross-Layer Orthogonal Vectors Pruning Method Learning Rate Acc 1e-4 80.1 5e-5 82.9 Pi SSA 3e-5 84.1 2e-5 84.5 1e-5 83.6 5e-4 79.0 2e-4 83.9 CLOVER 1e-4 85.2 5e-5 84.3 2e-5 82.8 Table 8. Learning rate searching. In Lo RA, the trainable parameters are: Q = 4096 32 + 4096 32 K = 4096 32 + 4096 32 V = 4096 32 + 4096 32 Up = 4096 32 + 11008 32 Down = 4096 32 + 11008 32 The total sum is 1,753,088. In CLOVER, the trainable parameters are: QK = 32 128 128 V O = 32 128 128 UD = 172 64 64 The total sum is also 1,753,088. Since CLOVER inserts trainable parameters across layers, we use the Q-K pair notation to represent its target model. When CLOVER updates parameters within an attention head, the number of trainable parameters matches exactly that of Lo RA at rank 32. To adjust the number of learnable parameters, CLOVER can either span multiple heads or split a single head into multiple blocks. Both Pi SSA and CLOVER exhibit stable training performance. Therefore, instead of validating every 80 steps, we omit frequent validation, improving training efficiency. Table 9. Detailed Training Hyperparameters. Q-K,V-O, U-D means CLOVER update pair of orthogonal vectors. Method Target Evaluation steps LR Scheduler Batch size Warmup Steps Epochs Lo RA Q,K,V,U,D 80 3e-4 Linear 16 100 3 Do RA Q,K,V,U,D 80 2e-4 Linear 16 100 3 Hi RA Q,K,V,U,D 80 1e-4/2e-4 Linear 32 100 3 Pi SSA Q,K,V,U,D 2e-5 Linear 16 100 3 CLOVER Q-K,V-O, U-D 1e-4 Linear 16 100 3 C. Libri Speech Long dataset target transcript Below is the reference text of the Libri Speech Long dataset for comparison. CLOVER: Cross-Layer Orthogonal Vectors Pruning Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter s manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton s work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell s pictures are a sort of Up Guards and Adam paintings, and Mason s exquisite idles are as national as a jingo poem. Mr. Birkett Foster s landscapes smile at one much in the same way that Mr. Carker used to flash his teeth, and Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, next man. In fact, with Vanilla Pruning ratios of just 22.31% and 6.69% for WQ-WK and WV -WO, respectively, the model s output is already significantly degraded. Mr. Colter is the personal of the classes, and we are glad to welcome his gospel. Nor is Mr. Colter s manner less interesting than his manner. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly he is drawn from eating and its results occur most readily to the mind. He is very dull, so very frequently, and is very Greek after all, and can discover in it but little of Rocky Ithaca. The Nell s pictures are sort of up-guard to Adam s paintings, and Mason s exquisite idylls are as national as a jingle poem. Mr. Burke and Foster s landscapes smile at one much in the same way as Mr. Parker, Mr. Flash is tits. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo and a Turkish bath, Next man. D. Further Analysis of CLOVER s Fine-Tuning Capability D.1. Necessity of Full-Direction Fine-Tuning Besides pruning with a large ratio, CLOVER is capable of learning linear combinations of all orthogonal vectors within each attention head. This capability allows CLOVER to resemble full-parameter fine-tuning more closely. To highlight the advantages of updating all orthogonal bases, we randomly sampled 16 instances from the Commonsense dataset, fed them into the model, and performed SVD to the model. We then recorded the projection magnitudes of input features along all orthogonal directions. Figure 5 visualizes the results for the middle layer, revealing the following insights: 10% Next 256 (a) Pi SSA w/o Singular Value Next 256 10% Figure 5. Proportion of data projections across different components in random directions (Lo RA) versus orthogonal directions (Pi SSA), as well as all orthogonal directions (CLOVER). 1) Without accounting for the scaling effect of singular values, the projection magnitude along the principal singular vector consistently exceeds that in other directions. This observation supports Pi SSA s approach, which updates based on the principal singular values and vectors, leading to improved training performance. In contrast, Lo RA projects in random directions, resulting in uniform projection magnitudes across all directions. 2) The singular values in the original model reflect the importance of each direction in the pretraining task. The model amplifies the components along directions with larger singular values and suppresses those along smaller singular values. Therefore, it is crucial to consider the scaling effect of singular values. As shown in Figure 5c, the projection magnitude along the principal singular vector direction increases to 18%. 3) While more data projections align with the principal singular vector at higher ranks, 82% of the feature components are still projected onto other directions. In extreme cases, if a task is entirely orthogonal to the vectors used by Pi SSA, training CLOVER: Cross-Layer Orthogonal Vectors Pruning on such a task may result in zero gradients, thereby limiting its learning capacity. Under the same rank constraint, 94% of the feature components in Lo RA are projected outside the Lo RA adapter, making it more susceptible to the zero-gradient problem. Since CLOVER updates across all orthogonal directions, as shown in Figure 5d it effectively mitigates this issue. Consequently, CLOVER outperforms both Lo RA and Pi SSA in multi-task learning, even when using the same or fewer learnable parameters (Section 4.6). D.2. Visualizing Rank Updates To demonstrate CLOVER achieves full-rank updates, we multiply the updated singular values with their corresponding singular vectors and perform SVD on the base model (SQK applied to the Key layer, SV O to the Value layer, and SUD to the Up layer). We take Lo RA, and Full Fine-tuning for comparing. Figure 6 shows the singular value of the middle layer in LLa MA-2-7B, revealing that CLOVER and Full Fine-tuning achieve full-rank updates, while Lo RA is constrained by its low-rank design. 0 1000 2000 3000 4000 0.0 (a) Full Fine-Tuning 0 1000 2000 3000 4000 0 0 1000 2000 3000 4000 0.0 Figure 6. W is low rank in Lo RA, while full rank for Full-Fine-Tuning and CLOVER. D.3. CLOVER Avoids Intrusive Dimensions Recent research (Shuttleworth et al., 2024) has highlighted an issue with Lo RA, referred to as the intrusive dimensions phenomenon. As illustrated in Figure 7b, Lo RA introduces new random directions into the model, which possess large magnitudes and thus precede all the original singular vectors. The study suggests that these intrusive dimensions can degrade the model s performance, exacerbating catastrophic forgetting during continual learning with Lo RA. In contrast, CLOVER addresses this issue by fixing all orthogonal bases and updating only the vector combinations. As a result, the changes introduced by CLOVER fine-tuning closely resemble those generated by full parameter fine-tuning, as shown in Figure 7a and Figure 7c. D.4. Cross Layer Orthogonal Vectors in Value and Output layers In the main text, we only presented the orthogonalization process for the Q-K pair. Here, we provide the method for orthogonalizing the V-O pair. Additionally, for up-down layers, the output dimension of the Up layer can be reshaped into block number block size, followed by performing orthogonal decomposition within each block. Y = attn(Qh, Kh)V WO, V = XWV Rb h n d (1) = attn(Qh, Kh)XWV WO, WV WO = WV O = USV Rh D D (2) = attn(Qh, Kh)XUSV, S[:,rvo:,rvo:] = SV O Rh rvo rvo = 0, rvo d. (3) = attn(Qh, Kh)XUV OSV OVV O, UV O RD h rvo, VV O Rh rvo D. (4) CLOVER: Cross-Layer Orthogonal Vectors Pruning 0 100 200 300 400 (a) Full Fine-Tuning 0 100 200 300 400 0 100 200 300 400 Figure 7. Intruder dimensions phenomenal in Lo RA, which does not exist in Full Fine-Tuning and CLOVER. Through this series of transformations, WV and WO can be equivalently replaced by orthogonal vectors UV O and VV O, along with the diagonal matrix SV O. Since rvo d, the singular zero values and their corresponding singular vectors can be safely pruned. After guided pruning, SV O can be merged into UV O and VV O, resulting in no additional computational overhead. E. Visualizing more attention heads In Section 4.2, we only presented the first attention head in the first layer. Here, we provide a broader view by showcasing more attention heads. Figure 8 illustrates the L2 norm of all Q-K heads in the first, middle, and last layers of Whisper-Largev3. Figure 9 shows the L2 norm of all Q-K heads in the first, middle, and last layers of Vi T-big G. From these figures, we can observe that CLOVER consistently represents the entire attention head with fewer orthogonal bases across all layers and all attention heads. This property forms the foundation of CLOVER s effectiveness in enhancing pruning. CLOVER: Cross-Layer Orthogonal Vectors Pruning 0 200 400 600 800 1000 1200 0 Absorb and Decompose Vanilla 0 200 400 600 800 1000 1200 0 Layer.15.qk 0 200 400 600 800 1000 1200 0.0 Layer.31.qk Figure 8. The L2-norm for the 0-th, 15-th, and 31-st attention layers in the Whisper-large-v3 encoder. The blue line represents the results after redundancy removal using the CLOVER method, while the orange line depicts the L2-norm directly computed for each dimension. 0 250 500 750 1000 1250 1500 0.0 Absorb and Decompose Vanilla 0 250 500 750 1000 1250 1500 0.0 Layer.23.qk 0 250 500 750 1000 1250 1500 0.0 Layer.47.qk Figure 9. The L2-norm for the 0-th, 15-th, and 31-st attention layers in the Vi T-big G. The blue line represents the results after redundancy removal using the CLOVER method, while the orange line depicts the L2-norm directly computed for each dimension.