# lasp_linear_attention_sequence_parallelism__4c69ef9a.pdf Published in Transactions on Machine Learning Research (05/2025) LASP: Linear Attention Sequence Parallelism Weigao Sun sunweigao@outlook.com Shanghai AI Laboratory Zhen Qin zhenqin950102@gmail.com Tap Tap Dong Li liddalidd@gmail.com Shanghai AI Laboratory Xuyang Shen xuyangshen1122@gmail.com Shanghai AI Laboratory Yu Qiao qiaoyu@pjlab.org.cn Shanghai AI Laboratory Yiran Zhong zhongyiran@gmail.com Shanghai AI Laboratory Reviewed on Open Review: https: // openreview. net/ forum? id= g G8s QUUt N7 Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-topoint ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8 longer than existing SP methods. Code is available at: https://github.com/Open NLPLab/LASP. 1 Introduction Linear sequence modeling methods (Katharopoulos et al., 2020; Choromanski et al., 2022; Sun et al., 2025a) including linear attention (Qin et al., 2024d), state space models (Dao & Gu, 2024) and linear RNN (Qin et al., 2024e), are becoming increasingly popular due to their faster training and inference speed and comparable modeling performance to vanilla Softmax attention-based transformer models (Vaswani et al., 2017; Zeng et al., 2022; Touvron et al., 2023a;b; Team, 2023). The hybrid architecture, which interleaves Softmax attention and Equal Contribution Corresponding Author Published in Transactions on Machine Learning Research (05/2025) linear attention Transformer layers, has proven to be an effective balance between their respective strengths. This approach has been successfully implemented in large-scale commercial models such as Minimax-01 (Li et al., 2025) and Tencent Hunyuan Turbo-S (Tencent, 2025), as well as in smaller-scale hybrid models like Samba (Ren et al., 2024), Jamba (Lieber et al., 2024). As the size of large language models (LLMs) increases and sequence lengths extend, the capacity limitations of single GPU s memory become a significant challenge, constraining the maximum sequence length manageable by a large model. To address this, Sequence Parallelism (SP) techniques (Li et al., 2022; Korthikanti et al., 2022) are employed, which partition a long sequence into multiple sub-sequences to be processed on separate devices. However, current implementations of SP methods do not fully exploit the right-product advantages of linear-complexity attention mechanisms Qin et al. (2024b). This results in less than optimal parallelism efficiency and reduced usability on linear sequence modeling methods. In this paper, we present Linear Attention Sequence Parallelism (LASP) approach for efficient SP on models with linear sequence modeling. Our approach takes linear attention (Katharopoulos et al., 2020) as an instance to design a sophisticated point-to-point (P2P) ring-style communication mechanism during both forward and backward among devices within a node or across multiple nodes. This design maximizes the utilization of right-product kernel tricks in linear attention, by only exchanging one single intermediate state instead of both of key and value states in other counterparts. Notably, our approach is independent of attention heads partitioning, allowing it to be applied to models with varying numbers or styles of attention heads, such as multi-head, multi-query, and grouped-query attentions. This flexibility exceeds the capabilities of existing SP methods in Megatron-LM (Shoeybi et al., 2019; Korthikanti et al., 2022) or Deep Speed (Jacobs et al., 2023). Our implementation of LASP incorporates system engineering optimizations such as kernel fusion and KV State caching, resulting in significantly enhanced execution efficiency. Furthermore, we have taken great care in ensuring compatibility of LASP with various (sharded) distributed data-parallel (DDP) (Li et al., 2020) training methods during the implementation, which we refer to as the data-sequence hybrid parallelism. Through extensive experiments with linear transformer models of different parameter numbers, cluster sizes, and sequence lengths, we demonstrate the performance and efficiency of LASP when used with different DDP instances. Specifically, LASP can extend sequence length up to 4096K on 128 GPUs, which is 8 longer than existing SP methods. Our primary contributions can be summarized as follows: A new SP approach called LASP that is designed for linear sequence modeling methods. LASP is able to perform sequence-level distributed training on 8 longer sequences than existing SP methods while being significantly faster. Sequence length-independent communication overhead. Our proposed P2P ring-style communication strategy leverages right-product kernel trick of linear attention to ensure that the exchanging of linear attention intermediate states is sequence length-independent. GPU friendly implementation. We optimize the execution of LASP on GPU hardware through meticulous system engineering, including kernel fusion and KV State caching. Data-parallel compatibility. LASP is compatible with all batch-level DDP methods, including Py Torch/Legacy DDP, FSDP, and Ze RO-series optimizers. 2 Related Work Linear Attention. Linear Transformer models bypass the use of Softmax attention by adopting various approximation methods (Katharopoulos et al., 2020; Peng et al., 2021; Qin et al., 2022a; Shen et al., 2024) instead. The central concept involves using the "kernel trick" to speed up the calculation of the attention matrix, specifically by multiplying keys and values before tackling the computationally intensive n n matrix multiplication (Sun et al., 2025b). For instance, Katharopoulos et al. (2020) use 1 + elu activation function, Qin et al. (2022b) utilizes the cosine function to imitate Softmax characteristics. Trans Normer LLM (Qin et al., 2024a) proposes Lightning Attention to accelerate linear attention via optimized IO operations, while Published in Transactions on Machine Learning Research (05/2025) Lightning Attention-2 (Qin et al., 2024c) enhances efficiency by separately processing interand intrablock computations. Ret Net (Sun et al., 2023) integrates retention with attention for parallel training and linear-time inference. GLA (Yang et al., 2023) introduces data-independent gating and a hardware-efficient training algorithm. Delta Net (Schlag et al., 2021) and its parallelized version (Yang et al., 2024b) improve long-context performance using a delta-rule update. GSA (Zhang et al., 2024), inspired by GLA, incorporates bounded-memory slot control to enhance recall-heavy tasks. Despite significant research progress, the adoption of linear attention and similar linear sequence modeling techniques in commercial large-scale models remains limited. However, some companies have started exploring their use. For example, Minimax-01 incorporates Lightning Attention (Li et al., 2025), a variant of linear attention; Tencent s Hunyuan Turbo-S employs Mamba2 (Tencent, 2025), a variant of state-space model ; and Together.AI integrates Striped Hyena (Poli et al., 2023), a long convolution model. Memory-Efficient Attention. Rabe & Staats (2021) first employs the online Softmax technique to efficiently compute numerically stable attention scores sequentially, resulting in a linear memory for attention, yet still needs quadratic time complexity. While Flash Attention (Dao et al., 2022; Dao, 2023) employs tiling to minimize the number of memory reads/writes between GPU s high bandwidth memory (HBM) and on-chip SRAM to reduce time and memory in the training process, Paged Attention (Kwon et al., 2023) optimizes the utilization of the KV cache memory by reducing waste and enabling adaptable sharing among batched requests during inference. Ring Attention (Liu et al., 2023) reduces memory requirements for Transformer models when handling long sequences by distributing sequences across multiple devices and overlapping the communication of key-value blocks with blockwise attention computation. Sequence Parallelism. SP as a widely used method to train long sequences has been integrated into many large model training frameworks, including Megatron-LM, Deep Speed, and Colossal-AI. Megatron LM (Shoeybi et al., 2019) implements SP along with model (tensor) parallelism (MP) to perform large matrix multiplications on GPUs. However, MP partitions the attention heads, which limits the maximum parallelism degree to be less than the number of attention heads. Deep Speed-Ulysses (Jacobs et al., 2023) uses an all-to-all communication primitive to reduce communication volume, but also partitions attention heads and faces similar issues as Megatron-LM. 3.1 Preliminary Softmax Attention. Consider the standard attention (Vaswani et al., 2017) computation with causal masking in the transformer architecture, formulated as: O = Softmax(QK / where d denotes the hidden dimension. The matrices Q, K, V RN d represent query, key, and value matrices, respectively. These matrices are linear projections of the input X RN d, i.e., Q = XWQ, K = XWK, V = XWV. The output matrix is denoted as O RN d, and M RN N represents the causal mask matrix. The Softmax( ) operation introduces quadratic time complexity relative to the input sequence length N, limiting the scalability of vanilla transformers to extended input sequences. Linear Attention. Linear attention is originally proposed in (Katharopoulos et al., 2020), with the elimination of Softmax operation (Vaswani et al., 2017). Qin et al. (2022a; 2024a) propose to replace the Softmax operation with a normalization operation Norm( ), which turns to O = Norm((QK M)V). (2) When considering bidirectional tasks, the above formulation can be simplified as O = Norm((QK )V). Then by performing the associativity property of matrix products, it can be mathematically equivalently transformed into a right-product version: O = Norm(Q(K V)). (3) Published in Transactions on Machine Learning Research (05/2025) Linear Attention Linear Attention Linear Attention Linear Attention Device 饾憱 1 Device 饾憱+ 1 Device 饾憱 Device 饾憱+ 2 Send & Recv Send & Recv Send & Recv Send & Recv Figure 1: Visualization of LASP. Left: The chunk-level linear attention computation with a causal mask can be segmented into two categories: intra-chunk and inter-chunk computations. Intra-chunk computations, corresponding to the diagonal elements (in diagonal orange boxes) of the mask matrix, utilize traditional left-product methods. While inter-chunk computations, corresponding to the lower triangular boxes, employ efficient right-product methods for computation. Right: This panel illustrates the P2P communication mechanism employed by LASP. The input sequence X is divided into multiple sub-sequence chunks { , Xi, Xi+1, }, each processed by different model instances across distinct devices. For each device i, Qi, Ki, and Vi are computed from its respective input chunk Xi. Notably, the communication operations between devices are designed to be complementary in the forward and backward passes. Specifically, in the forward pass, KV matrices are sent from device i to device (i + 1), and in the backward pass, d KV matrices are sent back from device (i + 1) to device i. This linear attention formulation facilitates recurrent prediction with a computational complexity of O(Nd2). And the recurrent update of K V without needing to compute the entire attention matrix makes its inference efficient. While linear complexity offers significant advantages in terms of computational efficiency and memory optimization for linear attention, it still incurs a proportional increase in computation and memory utilization on a single GPU as the sequence length N grows. This can lead to memory constraints on a single GPU, such as the 80GB limit in NVIDIA A100, for exceptionally long sequences. The challenge of achieving zero-redundancy (on sequence level) training for such long sequences using linear attention-based LLMs across GPU clusters remains an open problem. Furthermore, the complexity of addressing this issue in a casual setting further intensifies the challenge. To address this, we propose LASP as a solution for parallelizing linear attention training at the sequence level, even in a casual setting. LASP tiles sequences over the cluster. Follow the thought-of-tiling, LASP partitions the input sequences into multiple sub-sequence chunks, distributing these chunks individually across different GPUs. For linear attention in a casual setting, in order to fully exploit the advantage of right-product in linear attention, we categorize the attention computation for chunks into two distinct types: intra-chunks and inter-chunks. Intrachunks involve conventional attention computation, while inter-chunks leverage the kernel tricks associated with linear attention s right-product. Further details regarding the intricate mechanisms of LASP in data distribution, forward pass, and backward pass are expounded upon below. A visualization of LASP is presented in Fig. 1. Data Distribution. LASP is designed for training long sequences on linear transformers in a distributed environment, achieved by partitioning the input data along its sequence dimension. In this situation, each GPU within the distributed environment undertakes the training of a subset of sub-sequences, which serves to diminish the large memory footprint associated with activation during the training of long sequences. Communication operations are introduced between GPUs to transmit intermediate states. The final trained model assimilates the knowledge derived from the entirety of the long sequences. Published in Transactions on Machine Learning Research (05/2025) Sub-seq0 Sub-seq1 Sub-seq2 Sub-seq3 0 1 2 3 4 5 6 7 SP-Group0 SP-Group1 Sub-seq0 Sub-seq1 Sub-seq2 Sub-seq3 Seq1 Algorithm 1 LASP Data Distribution 1: Input: An input sequence in embedding space X RN d with sequence length N and hidden dimension d, distributed world size W and sequence parallel size T. 2: Obtain number of sequence parallel groups G = W/T. 3: Obtain sub-sequence length (or chunk size) C = N/T. 4: Get global rank list R = get_global_rank(). 5: Obtain sequence parallel source rank list Rsrc = R/T T. 6: Along sequence dimension, split X into T chunks {X1, X2, ...XT }, of size C d for each. 7: Transfer copies of data chunks {X1, X2, , XT } to GPUs with rank indices in Rsrc. 8: Scatter {X1, X2, , XT } from Rsrc to all ranks in respective sequence parallel groups. Figure 2: LASP Data Distribution. Left: An example of data distribution with two input sequences and eight GPUs. Right: Complete data distribution algorithm. For an input sequence of length N, we establish its embedding space representation denoted as X RN d with a feature dimension of d. In the LASP framework, X is evenly partitioned into T chunks, where T is called the sequence parallel size, which must be divisible by the distributed world size W. These segmented data chunks are subsequently assigned to the respective GPUs. It is essential to note that different sequence parallel groups receive dissimilar data batches. However, within the same group, all data chunks originate from an identical batch of data. A comprehensive depiction of the data distribution process in LASP is provided in Algorithm 1. Additionally, an illustrative example of data distribution in LASP is presented in Fig. 2, where the distributed world size is characterized by W = 8, the sequence parallel size by T = 4, the number of sequence parallel groups by G = 2, and the sequence parallel source rank list by Rsrc = [0, 4]. For the first batch Seq0, the input sequence X undergoes partitioning into T chunks {X1, X2, ..., XT } along the sequence dimension, subsequently transmitted to the first rank in SP-Group0, which corresponds to global rank 0. The data chunks on global rank 0 are then scattered to global ranks {0, 1, 2, 3} within SP-Group0, where each rank only retains a single chunk. The subsequent batch Seq1 follows a similar manner, being assigned to global ranks {4, 5, 6, 7} within SP-Group1. Forward Pass. To streamline derivations, the Norm( ) operator in Eq. (2) is temporarily omitted. Additionally, we consider a normal case where W = T, indicating G = W/T = 1. In this scenario, GPU with rank 0 consolidates all split sub-sequences in a batch, subsequently distributing them to all GPUs across the entire distributed world. It is noteworthy that the scenario where the sequence parallel size is not equal to world size is discussed in Sec.3.5. We first define kv and KV as the intermediate memory state vector and matrix, respectively. Without loss of generality, we add 位 as the decay rate in linear attention with casual masking, choosing 位 = 1 yields the ordinary linear attention (Qin et al., 2024a; Sun et al., 2023). In the forward pass of linear attention computation with casual masking, the s-th output can be calculated as o s = q s X i s 位s ikiv i . (4) Rewrite in a recurrence form, we have kv0 =0 Rd d, kvs = 位kvs 1 + ksv s , o s = q s (kvs), (5) where kvs = P i s 位s ikiv i (6) is the activation memory state in the forward pass with s-th input. Published in Transactions on Machine Learning Research (05/2025) Algorithm 2 LASP Forward Pass 1: Input: input sequence in embedding space X RN d with sequence length N and hidden dimension d, distributed world size W, sequence parallel size T = W, decay rate 位 R+. 2: Distribute input sequence X according to Algorithm 1. 3: Obtain sub-sequence length (or chunk size) C = N/T. 4: Initialize mask M RC C, where Mij = 位i j, if i j, else Mij = 0. 5: Initialize 螞 = diag{位, 位2, , 位C} RC C. 6: Initialize activation state KV = 0 Rd d. 7: for chunk t {1, , T} at rank i {1, , W} in parallel do 8: Calculate Qt = Xt WQ, Kt = Xt WK, Vt = Xt WV according to its own data chunk, of size C d for each. 9: Compute Ot,intra = [(Qt K t ) M]Vt. 10: end for 11: for chunk t {1, , T} at rank i {1, , W} do 12: Recv activation KVt 1 from rank (i 1). 13: Save KVt 1 as KVi for backward computation. 14: Compute Ot,inter = 螞Qt KVt 1. 15: Compute Ot = Ot,intra + Ot,inter. 16: Update KVt = 位CKVt 1 + (位C螞 1Kt) Vt. 17: Send activation KVt to rank (i + 1). 18: end for 19: return O = [Ot], with t {1, , T}. In SP, given data chunk Xt on rank i, the query, key and value corresponding to Xt is Qt = Xt WQ, Kt = Xt WK, Vt = Xt WV . Note that we assume T = W here, their indices are thus equivalent, i.e., t = i. The output within the t-th chunk can be calculated as Ot,intra = [(Qt K t ) M]Vt. (7) The intra-chunk computation has no dependencies with other chunks on other GPUs, so it can be calculated parallelized on all ranks in the distributed world. However, this result does not consider the impact of the previous 1 (t 1) chunks on the t-th chunk, which is called an interchunk. To calculate inter-chunk, let us rearrange Eq. (4) as o s+C =q s+C X i s+C 位s+C ikiv i i=C+1 位s+C ikiv i + 位sq s+C X i C 位C ikiv i . The resulted first part in Eq. (8) corresponds to the computation on previous chunks, and the second part corresponds to the computation on the current chunk. In SP, Eq. (8) can be rewritten in the chunk form as Ot,inter = 螞Qt KVt 1, (9) where KVt = kvt C. Note that the calculation of the inter-chunk for the t-th chunk depends on the activation state of previous (t 1) chunk, i.e., KVt 1, which is calculated on rank (i 1). Thus a P2P communication operation Recv should be performed to pull KVt 1 from rank (i 1) to rank i. Then the activation state KVt should be updated for subsequent inter-chunk attention computation at (t + 1)-th chunk. The update rule of KVt at t-th chunk is s t C 位t C sksv s = 位C X s (t 1)C 位(t 1)C sksv s + s=(t 1)C+1 位t C sksv s = 位CKVt 1 + diag{位C 1, . . . , 1}Kt Vt = 位CKVt 1 + 位C螞 1Kt Vt. In correspondence to the preceding Recv operation, another P2P communication operation Send is executed to transmit the acquired KVt in Eq. (10) to the subsequent rank (i + 1) for its inter-chunk computation. It is noteworthy that in the backward pass, the t-th chunk necessitates KVt 1 as activation to calculate gradients. To minimize communication operations, we cache KVt 1 on High-Bandwidth Memory (HBM) to accelerate computation. Integrating both the intra and inter parts, the final forward output is as follows: Ot = Ot,intra + Ot,inter (11) We present the complete forward pass of LASP with W = T in Algorithm 2. Published in Transactions on Machine Learning Research (05/2025) Backward Pass. For the backward pass, given dos, we have (Katharopoulos et al., 2020) dq s = do s kv s R1 d, dk s = v s dkv s R1 d, dv s = k s dkvs R1 d, dkvs = X i s 位i sqido i Rd d. (12) By writing dkvs in a recursive form, we have dkvn+1 = 0 Rd d, dkvs 1 = 位dkvs + qs 1do s 1. (13) In SP, we have {Qt, Kt, Vt, Ot, d Ot} which corresponds to the t-th sub-sequence chunk on rank i, where t {1, , T} and i {1, , W}. Same with the forward pass, the following derivations assume t = i, T = W. We first calculate d Q with respective to the t-th data chunk, which yields: d Qt,intra = [(d Ot V t ) M]Kt. (14) Since the computation of d Qt,intra is independent, its calculation can be parallelized on all GPUs. While the calculation of d Qt,inter reflects the inter-dependence of chunks 1 to t 1 on chunk t. In order to compute the inter part, we transform Eq. (12) as dq s+C = do s+C X i s+C 位s+C ivik i = do s+C i=C+1 位s+C ivik i + 位sdo s+C X i C 位C ivik i . (15) The first part in Eq. (15) corresponds to the intra-chunk, while the second part corresponds to the inter-chunk. In SP, we can calculate d Qt,inter as d Qt,inter = 螞d Ot KV t 1. (16) Note that KVt has already been computed and cached during the forward pass, so no communication is required here to obtain KVt. Benefit from the KV state caching, the calculation of d Qt,inter can also be executed in parallel. Next, d K within the t-th chunk can be calculated in parallel as d Kt,intra = [(d Ot V t ) M] Qt. (17) Then we transform Eq. (12) as dk s = v s X i s 位i sdoiq i = v s i=s 位i sdoiq i + 位C sv s X i C+1 位i Cdoiq i , (18) where the term before plus sign corresponds to the intra-chunk, and the term after plus sign corresponds to the inter-chunk. The above equation can be rewritten in terms of chunks as follow: d Kt,inter = 位C螞 1Vtd KV t+1. (19) Here a Recv operation is required here to pull d KVt+1 from the (t + 1)-th chunk. Then in order to compute d KV for the (t 1)-th chunk, d KV should be updated as: s>t C 位s t Cqsdo s = 位C X s>(t+1)C 位s (t+1)Cq s dos + s=t C+1 位s t Cqsdo s = 位Cd KVt+1 + (螞Qt) d Ot. Published in Transactions on Machine Learning Research (05/2025) Then a Send operation is performed to push d KVt to rank (i 1). Finally, for d V, its intra part can be calculated as d Vt,intra = [(Qt K t ) M] d Ot. Again we transform Eq. (12) as: dv s = k s X i s 位i sqido i = k s i=s 位i sqido i + 位C sk s X i C+1 位i Cqido i . (21) The first and second terms corresponds to the computation of the intraand inter-chunks, respectively. In SP, d Vt,inter can be calculated as: d Vt,inter = 位C螞 1Ktd KVt+1. (22) Combine the intra and inter part, we obtain the final results of d Qt, d Kt and d Vt: d Qt = d Qt,intra + d Qt,inter, d Kt = d Kt,intra + d Kt,inter, d Vt = d Vt,intra + d Vt,inter. (23) We provide the complete backward pass of LASP in Algorithm 3 in Appendix A.1. 3.3 Comparison Table 1: Communication Volume Comparison. Simplified Formulation: we eliminate the common factors Bd for ease of comparison. Method Full Formulation Simplified Formulation LASP Bd2/h d/h Ring Attention 2BNd/h 2N/h Deep Speed-Ulysses 4BNd/T 4N/T Megatron-SP 2BNd + 4BNd/T 2N + 4N/T In LASP, it is important to note that the forward pass requires communication for the KV Rd d state in each linear attention module layer. The communication volume is determined by Bd2/h, where B is the batch size and h is the number of heads. In comparison, Ring Attention also adopts P2P ring-style communication on states K, V RV d, which results a communication volume of BNd/h. SP in Megatron-LM utilizes all-gather operations twice after two layer normalization layers within each transformer layer, and a reducescatter operation after the attention and Feedforward Neural Network (FFN) layers. This results in a communication volume of 2BNd + 4BNd/T. Deep Speed uses all-to-all collective communication (Thakur et al., 2005) for input Q, K, V, and output O of each attention module layer, resulting in a communication volume of 4BNd/T. Table 1 displays a comparison of communication volumes across three frameworks. d/h is the head dimension which is set at 128 as usual (Lan et al., 2020). In practical applications where N/T 32, LASP is able to achieve the lowest theoretical communication volume. Furthermore, the communication volume of LASP is not impacted by changes in sequence length N or sub-sequence length C, which is a huge advantage for SP with very-long sequences across large clusters. It is worth to note that, although Ring Attention and LASP both use P2P ring-style communication, they have differences lie in both communication and computation sides. Communication: In both forward and backward, Ring Attention involves communicating two states K, V RV d. In contrast, LASP only communicates one single state KV Rd d, which does not depend on the sequence length. LASP has a lower theoretical communication complexity. This makes LASP more efficient, especially in environments with slower interconnects where the communication-computation overlap may not be optimal. Computation: Ring Attention is specifically designed for standard attention, utilizing a left-product manner, i.e., ((QK )V). On the other hand, LASP is specifically tailored for linear attention-like sequence modeling methods, which leverages the right-product kernel trick (Q(K V)) to achieve linear-time complexity. 3.4 System Engineering Optimization Kernel Fusion. To improve the efficiency of LASP on GPU, we perform kernel fusion in both the intrachunk and inter-chunk computations, and also fused the updates of KV and d KV into the intra-chunk and inter-chunk computations. Published in Transactions on Machine Learning Research (05/2025) KV State Caching. To avoid recomputing activation KV during the backward pass, we choose to store it in the HBM of the GPU right after computing it in the forward pass. During the subsequent backward pass, LASP directly accesses KV for use. It is important to note that the size of the KV activation cached in HBM is d d, which is not affected by the sequence length N. When the input sequence length N is exceptionally large, the memory usage of KV becomes negligible. 3.5 Hybrid Parallelism Data-Sequence Hybrid Parallelism. As illustrated in Fig. 2, LASP allows for the specification of a smaller sequence parallel size that is divisible by the distributed world size. This configuration results in the input data being split along both the batch and sequence dimensions, which is a type of hybrid parallelism called data-sequence hybrid parallelism. The Ze RO-series optimizers (Rajbhandari et al., 2020) in Deep Speed and FSDP (Zhao et al., 2023) in Py Torch propose to distribute model states, which include optimizer states, gradients, and model parameters, across all GPUs within the distributed environment. As variants of data parallelism, these techniques seamlessly align with LASP. Furthermore, their focus on minimizing the memory of model states complements LASP s objective of reducing activation memory on each GPU. By integrating these techniques, the training of large models handling long sequence lengths is rendered more practical. Compatibility with Tensor Parallelism and Pipeline Parallelism. LASP supports both tensor parallelism (TP) and pipeline parallelism (PP). In PP, as exemplified by the GPipe (Kim et al., 2020) scheduling method, the model is initially partitioned across multiple devices, with each device holding a segment of the model. Data within a mini-batch is then divided into micro-batches, which are sequentially fed into the device containing the first segment. Each device processes its micro-batch and forwards the output to the next device in the sequence, simultaneously preparing to receive and process the subsequent micro-batch from the preceding device. This method of pipelining inputs effectively minimizes device idle times. When LASP is integrated with PP, micro-batches are substituted with sub-sequences derived from a mini-batch. Unlike standard PP, each device retains the intermediate states (KV in the forward and d KV in the backward) locally, rather than transmitting them to the next device as typically done in LASP alone. For TP, the integration with LASP is fluid. Linear attention layers utilize TP to segment matrix operations across both intra-chunk and inter-chunk computations. Hybrid SP on Inter-layer Hybrid Models. The hybrid SP approach for inter-layer hybrid models applies the established Ring Attention SP to softmax attention Transformer layers while using LASP for linear attention Transformer layers simultaneously. Since these two strategies operate independently within their respective layer types, they do not interfere with each other. This straightforward approach primarily serves as a practical application of LASP to hybrid models. We conduct experiments to demonstrate the feasibility of Hybrid SP in Appendix A.5.1. 4 Experiments We evaluate LASP on two representative linear attention-based models: Trans Normer LLM (TNL) (Qin et al., 2023b; 2024a) and Linear Transformer (Katharopoulos et al., 2020). TNL is the latest large language model purely built upon linear attention, while Linear Transformer is a classical linear transformer model recognized in the community. Our assessment focuses on three key areas: 1) the ability of LASP to scale up sequence length on scaling-out GPUs, 2) the convergence when using LASP, and 3) speed evaluation when using LASP and its comparison with other SP methods. No activation checkpointing (AC) (Korthikanti et al., 2022) techniques are used in following experiments to reduce activation memory, except experiments in Section 4.3.2. This is because although the adoption of AC will further enables longer sequence lengths, it will cover up the ability of our sequence parallel method LASP. All experiments are conducted on a GPU cluster equipped with 128x A100 80G GPUs. Our implementation is built on Metaseq (Zhang et al., 2022), a Py Torch-based sequence modeling framework with Fair Scale (Fair Scale authors, 2021) integrated. For more details of hardware and software, and experimental setup, see Appendix A.2 & A.3. Note that when implement other SP methods (e.g., Ring Attentoin, Deep Speed-ulysses and Megatron-SP) on linear attention instances for the purpose of comparison, we do not use the right-product kernel trick. We maintain the use of each method s original communication primitives and computational manners as they originally proposed for softmax attention. Published in Transactions on Machine Learning Research (05/2025) 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Sequence Length 1e5 TNL-1B, FSDP Backend 16 GPUs 32 GPUs 64 GPUs 128 GPUs 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K Sequence Length 1e5 TNL-1B, DDP Backend 16 GPUs 32 GPUs 64 GPUs 128 GPUs 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K 8192K Sequence Length Memory Per GPU (GB) TNL-1B, FSDP Backend 8 GPUs 16 GPUs 32 GPUs 64 GPUs 128 GPUs 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Sequence Length Memory Per GPU (GB) TNL-1B, DDP Backend 8 GPUs 16 GPUs 32 GPUs 64 GPUs 128 GPUs Figure 3: Scalability Evaluation of LASP on Throughput (tokens/sec) and Memory Usage. Left: Integration of LASP with FSDP backend; Right: Integration of LASP with DDP backend. The TNL-1B model is used, with a batch size of 1 across up to 128x A100 80GB GPUs. The sign " " with a dotted line represents occurring an Out of Memory (OOM). 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K Sequence Length LASP Ring Attention Deep Speed-Ulysses Megatron-SP 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K Sequence Length LASP Ring Attention Deep Speed-Ulysses Megatron-SP Figure 4: Speed Comparison (tokens/sec) of LASP Against Ring Attention, Deep Speed-Ulysses and Megatron-SP. The sign " " with a dotted line represents occurring an Out of Memory (OOM). The evaluation utilizes the TNL-1B and 7B models with a batch size of 1 on 64x A100 80GB GPUs. The parallelism size for these three methods is configured to 64. 4.1 Scalability and Speed Comparison The scalability results regarding throughput and memory usage with varying sequence lengths and number of GPUs are illustrated in Fig. 3. By using LASP, we successfully scale the sequence length up to 4096K using the FSDP backend and 2048K with the DDP backend on a TNL model with 1B parameters, on 128 GPUs. We keep using a fixed batch size of 1 to thoroughly assess the performance of LASP across a range of sequence lengths, from a typical 2K to an exceptionally long 4096K. By keeping the batch size constant at 1, we ensure that the experiment results are directly comparable, with only the sequence length varying. Importantly, the implementation of LASP allows for a linear increase in the maximum sequence length capacity, directly proportional (linear) to the number of GPUs used. For instance, a sequence length of 512K can be trained using 16 GPUs, while 64 GPUs (4 ) has is able to train 2048K (4 ) sequence length. Enabling LASP maintains a high throughput level even with more GPUs used. Furthermore, LASP demonstrates consistent scalability performance under both the FSDP and DDP backends. For more quantitative scalability results of LASP, see Table 10 in Appendix A.5. We furthermore conducted comparisons on TNL 1B and 7B models against existing SP methods: Ring Attention (Liu et al., 2023), Deep Speed-Ulysses (Jacobs et al., 2023) and Megatron-SP (Korthikanti et al., 2022). All results presented in Fig. 4 are obtained on 64 GPUs. LASP demonstrates a notable enhancement in throughput, primarily due to its efficient communication design that facilitates the exchange of linear attention intermediate states. 4.2 Convergence Table 2 presents the convergence results of two linear attention based models: TNL (Qin et al., 2024a) and Linear Transformer (Katharopoulos et al., 2020), and one transformer model (LLa MA (Touvron et al., 2023a;b)) with Softmax attention, evaluated on an epoch-by-epoch basis. The experiments were conducted using the same training corpus: the Pile (Gao et al., 2020). Both linear models has 0.4B parameters, Published in Transactions on Machine Learning Research (05/2025) Table 2: Convergence Performance of LASP. All experiments use 8x A100 80G GPUs, sequence length of 16K, and batch size of 1. The results cover various DDP backends in conjunction with LASP. We explore the performance of two linear attention models: Trans Normer LLM (TNL) and Linear Transformer, and one transformer model (LLa MA) with Softmax attention, all with 0.4B parameters, across 50K updates. Model Parameters Method Loss Method Loss Transformer 0.4B DDP 3.727 \ \ TNL (Qin et al., 2024a) 0.4B DDP 3.719 LASP + DDP 3.715 Legacy DDP 3.709 LASP + Legacy DDP 3.705 FSDP 3.717 LASP + FSDP 3.714 Ze RO-1 3.653 LASP + Ze RO-1 3.653 Ze RO-2 3.655 LASP + Ze RO-2 3.649 Ze RO-3 3.656 LASP + Ze RO-3 3.649 Linear Transformer (Katharopoulos et al., 2020) 0.4B DDP 5.419 LASP + DDP 5.408 Legacy DDP 5.425 LASP + Legacy DDP 5.413 FSDP 5.428 LASP + FSDP 5.441 Ze RO-1 5.114 LASP + Ze RO-1 5.118 Ze RO-2 5.105 LASP + Ze RO-2 5.120 Ze RO-3 5.110 LASP + Ze RO-3 5.123 demonstrated consistent loss values when training with and without LASP. All experiments undergoes 50K steps. The uniform loss convergence across various DDP backends demonstrates that LASP does not negatively affect model convergence. 4.3 Ablation Study 4.3.1 Ablation on System Engineering Optimization The system engineering optimization techniques, Kernel Fusion and KV State Caching, are aimed at improving the practical execution efficiency of LASP. To better understand their effects, we perform ablation studies, and the results are presented in Table 3. We assess the training throughput and memory usage of a 1B TNL model with a batch size of 2 and a sequence length of 8K, using 2x A100 GPUs. The findings show that both Kernel Fusion and KV State Caching significantly enhance training throughput, with only a minimal effect on memory usage. Table 3: Ablation on System Engineering Optimizations Techniques Kernel Fusion and KV State Caching. Experiments are conducted on TNL-1B model with a batch size of 2 and a sequence length of 8K, utilizing 2x A100 GPUs. Kernel Fusion KV State Cache Throughput (tokens/s) Memory Usage Per GPU (GB) No No 37684.4 49.5 Yes No 44691.0 49.5 No Yes 41179.6 49.7 Yes Yes 45915.2 49.6 4.3.2 Ablation on Activation Reducing Methods LASP effectively reduces activation memory consumption during training on a per-GPU basis, and this advantage becomes even more pronounced in larger clusters due to the distributed partitioning along the sequence dimension. Another widely used technique, activation checkpointing (AC), follows a fundamentally different strategy but also contributes significantly to activation memory reduction. To further analyze their impact, we conduct ablation experiments to evaluate AC, LASP, and their combined effect. The results are summarized in Table 4. The experimental results indicate that when using only DDP and FSDP, the maximum sequence lengths that can be trained on a single node with 8 GPUs are 12K and 16K, respectively. Both activation checkpointing Published in Transactions on Machine Learning Research (05/2025) (AC) and LASP substantially extend the maximum sequence length by significantly reducing activation memory consumption per GPU, although with a minor decrease in throughput. A key distinction between the two is that LASP exhibits scalability directly proportional to the number of GPUs, whereas AC does not. By combining LASP with AC, we achieve remarkable maximum sequence lengths of 496K and 768K on a single node using DDP and FSDP backends, respectively. This is made possible by the complementary benefits of three techniques: linear attention, AC, and LASP, all of which contribute to efficient training with extremely long input sequences. Table 4: Ablation on Activation Reducing Methods. Both DDP and FSDP backends are tested. A single node equipped with 8x A100 80G GPUs is used to train a TNL-1B model. Method Maximum Sequence Length Throughput (tokens/sec) Method Maximum Sequence Length Throughput (tokens/sec) DDP 12K 131286.0 FSDP 16K 145303.6 DDP+AC 64K 117429.5 FSDP+AC 96K 114464.0 DDP+LASP 96K 126829.4 FSDP+LASP 120K 138598.8 DDP+AC+LASP 496K 100837.8 FSDP+AC+LASP 768K 106578.3 5 Discussion Linear-complexity sequence modeling methods are emerging as important alternatives to traditional transformers (using Softmax attention) for next-generation foundational models due to their significantly faster training and inference times, coupled with performance that rivals conventional approaches. Recently, the AI community has seen a rapid development of novel linear-complexity models, gaining considerable interest. Examples include linear attention models such as Trans Normer LLM, state space models (SSM) like Mamba and Jamba, and linear RNN models including RWKV, HGRN, and Griffin. We contend that the LASP design can be seamlessly integrated into most linear-complexity models. To underscore LASP s generalization, we use a generalized form of linear attention in Appendix A.4 (Qin et al., 2024b), demonstrating that other linear-complexity models can also be accommodated within this the LASP framework. Moreover, it is important to explore the compatibility of linear attention and LASP with the widely adopted softmax attention. Each of these mechanisms has distinct advantages in different scenarios. Softmax attention is highly effective for modeling short sequences with strong performance but suffers from quadratic complexity concerning sequence length, which limits its scalability to long-context tasks. On the other hand, linear attention and similar linear sequence modeling methods provide significantly better efficiency for long sequences but tend to be less effective in capturing complex dependencies. A practical solution to harness the benefits of both is a hybrid architecture that alternates softmax attention and linear attention layers within Transformer models. This strategy has already been implemented in large-scale commercial models such as Minimax-01 (Li et al., 2025) and Tencent Hunyuan Turbo-S (Tencent, 2025), as well as in smaller hybrid models like Samba (Ren et al., 2024), Jamba (Lieber et al., 2024), and Gated Delta Net (Yang et al., 2024a). 6 Conclusion We presented LASP to effectively address the limitations of existing SP methods on linear-complexity sequence modeling methods by leveraging their right-product features, which significantly enhanced communication and parallelism efficiency. Through the design of an efficient P2P ring-style communication mechanism and elaborated engineering optimizations including kernel fusion and KV state caching, LASP achieved a notable reduction in communication traffic and improved hardware utilization on GPU clusters. Compatibility with all types of batch-level DDP methods ensured the practicability of LASP for large-scale distributed training with very-long sequences. Our experiments highlighted the advantages of LASP on scalability, speed, memory usage and convergence performance. In specific experimental setup, LASP achieves significant faster sequence-level distributed training speed on a maximum 8 longer sequence length than the out-of-the-box SP methods. Published in Transactions on Machine Learning Research (05/2025) Broader Impact Statement This work represents a notable advancement in artificial intelligence and machine learning, particularly in improving the efficiency and scalability of linear attention-based models. LASP enables the processing of much longer sequences compared to existing methods while significantly accelerating computation, making it highly beneficial for tasks like natural language understanding, genomic sequence analysis, and time-series forecasting. However, the enhanced capabilities and efficiency introduced by LASP also raise ethical and societal considerations, such as the potential for misuse in generating persuasive but misleading content or in surveillance applications. Nevertheless, the contributions of LASP to reducing computational overhead and energy consumption in training large models may also bring positive environmental impacts. Acknowledgments This work is supported by the Shanghai AI Laboratory. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers, 2022. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. ar Xiv preprint ar Xiv:2307.08691, 2023. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. ar Xiv preprint ar Xiv:2405.21060, 2024. Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R茅. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344 16359, 2022. Fair Scale authors. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. ar Xiv preprint ar Xiv:2312.00752, 2023. Albert Gu, Karan Goel, and Christopher R茅. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), 2022. Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces, 2022. Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran莽ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156 5165. PMLR, 2020. Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. ar Xiv preprint ar Xiv:2311.01927, 2023. Published in Transactions on Machine Learning Research (05/2025) Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, and Sungwoong Kim. torchgpipe: On-the-fly pipeline parallelism for training giant models. 2020. Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence Mc Afee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models, 2022. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2020. Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention. ar Xiv preprint ar Xiv:2501.08313, 2025. Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. Pytorch distributed: Experiences on accelerating data parallel training, 2020. Shenggui Li, Fuzhao Xue, Chaitanya Baranwal, Yongbin Li, and Yang You. Sequence parallelism: Long sequence training from system perspective, 2022. Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. ar Xiv preprint ar Xiv:2403.19887, 2024. Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023. Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10236 10242, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https: //aclanthology.org/2022.emnlp-main.697. Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hy UNwul C-. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bart艂omiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Johan Wind, Stanis艂aw Wo藕niak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu. RWKV: Reinventing RNNs for the transformer era. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14048 14077, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.936. URL https://aclanthology.org/2023.findings-emnlp.936. Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021. URL https://openreview.net/forum?id=Qt TKTd Vr FBB. Michael Poli, Jue Wang, Stefano Massaroli, Jeffrey Quesnelle, Ryan Carlow, Eric Nguyen, and Armin Thomas. Striped Hyena: Moving Beyond Transformers with Hybrid Signal Processing Models, 12 2023. URL https://github.com/togethercomputer/stripedhyena. Zhen Qin and Yiran Zhong. Accelerating toeplitz neural network with constant-time inference complexity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, December 2023. Published in Transactions on Machine Learning Research (05/2025) Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 7025 7041, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.473. Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=Bl8CQrx2Up4. Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=Ixm Wsm4xrua. Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Fei Yuan, Xiao Luo, et al. Scaling transnormer to 175 billion parameters. ar Xiv preprint ar Xiv:2307.14995, 2023b. Zhen Qin, Weixuan Sun, Kaiyue Lu, Hui Deng, Dongxu Li, Xiaodong Han, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Linearized relative positional encoding. Transactions on Machine Learning Research, 2023c. Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, and Yiran Zhong. Transnormerllm: A faster and better large language model with improved transnormer, 2024a. Zhen Qin, Xuyang Shen, Weigao Sun, Dong Li, Stan Birchfield, Richard Hartley, and Yiran Zhong. Unlocking the secrets of linear complexity sequence model from a unified perspective. ar Xiv preprint ar Xiv:2405.17383, 2024b. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. ar Xiv preprint ar Xiv:2401.04658, 2024c. Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Various lengths, constant speed: Efficient language modeling with lightning attention. ar Xiv preprint ar Xiv:2405.17381, 2024d. Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion. ar Xiv preprint ar Xiv:2404.07904, 2024e. Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. Advances in Neural Information Processing Systems, 36, 2024f. Markus N. Rabe and Charles Staats. Self-attention does not need o(n2) memory. Co RR, abs/2112.05682, 2021. URL https://arxiv.org/abs/2112.05682. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models, 2020. Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling. ar Xiv preprint ar Xiv:2406.07522, 2024. Imanol Schlag and J眉rgen Schmidhuber. Gated fast weights for associative retrieval. 2018. Imanol Schlag, Kazuki Irie, and J眉rgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, 2021. Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, and Yiran Zhong. Scaling laws for linear complexity language models. ar Xiv preprint ar Xiv:2406.16690, 2024. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Le Gresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. ar Xiv preprint ar Xiv:1909.08053, 2019. Published in Transactions on Machine Learning Research (05/2025) Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space layers for sequence modeling. Co RR, abs/2208.04933, 2022. doi: 10.48550/ar Xiv.2208.04933. URL https://doi.org/10. 48550/ar Xiv.2208.04933. Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, and Yiran Zhong. Co2: Efficient distributed training with full communication-computation overlap. ar Xiv preprint ar Xiv:2401.16265, 2024. Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, and Yu Cheng. Linear-moe: Linear sequence modeling meets mixture-of-experts. ar Xiv preprint ar Xiv:2503.05447, 2025a. Weigao Sun, Yongtuo Liu, Xiaqiang Tang, and Xiaoyu Mo. Sequence accumulation and beyond: Infinite context length on single gpu and large clusters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 20725 20733, 2025b. Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models. ar Xiv preprint ar Xiv:2307.08621, 2023. Intern LM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023. Tencent. Tencent hunyuan turbo-s. https://github.com/Tencent/llm.hunyuan.turbo-s, 2025. Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of collective communication operations in mpich. The International Journal of High Performance Computing Applications, 19(1):49 66, 2005. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth茅e Lacroix, Baptiste Rozi猫re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, 艁ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. ar Xiv preprint ar Xiv:2312.06635, 2023. Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. ar Xiv preprint ar Xiv:2412.06464, 2024a. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. ar Xiv preprint ar Xiv:2406.06484, 2024b. Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. ar Xiv preprint ar Xiv:2210.02414, 2022. Published in Transactions on Machine Learning Research (05/2025) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022. Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Wei Bi Freda Shi, Bailin Wang, Peng Zhou, and Guohong Fu. Gated slot attention for efficient linear-time sequence modeling. ar Xiv preprint ar Xiv:2409.07146, 2024. Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. ar Xiv preprint ar Xiv:2304.11277, 2023. Beitong Zhou, Jun Liu, Weigao Sun, Ruijuan Chen, Claire J Tomlin, and Ye Yuan. pbsgd: Powered stochastic gradient descent methods for accelerated non-convex optimization. In IJCAI, pp. 3258 3266, 2020. Published in Transactions on Machine Learning Research (05/2025) A.1 Backward Pass Algorithm Algorithm 3 LASP Backward Pass 1: Input: Sequence Length N, Distributed world size W, sequence parallel size T, decay rate 位 R+, Qt, Kt, Vt, Ot, d Ot RC d for t {1, 2, , T}. 2: Obtain sub-sequence length (or chunk size) C = N/T. 3: Initialize mask M RC C, where Mij = 位i j, if i j, else Mij = 0. 4: Initialize 螞 = diag{位, 位2, , 位C} RC C . 5: Initialize d KV = 0 Rd d. 6: for t {1, 2, , T} at rank i {1, 2, , W} in parallel do 7: Compute d Qt,intra = [(d Ot V t ) M]Kt. 8: Compute d Qt,inter = 螞d Ot KV t 1. 9: Compute d Kt,intra = [(d Ot V t ) M] Qt. 10: Compute d Vt,intra = [(Qt K t ) M] d Ot. 11: end for 12: for t {T, , 2, 1} at rank i {W, , 2, 1} do 13: Recv activation d KVt+1 from rank (i + 1). 14: Compute d Kt,inter = (位C螞 1Vt)d KV t+1. 15: Compute d Vt,inter = (位C螞 1Kt)d KVt+1. 16: Load KVi as KVt on rank i. 17: Combine intraand inter-chunks of d Qt, d Kt, d Vt: d Qt = d Qt,intra + d Qt,inter, d Kt = d Kt,intra + d Kt,inter, d Vt = d Vt,intra + d Vt,inter. 18: Compute d KVt = 位Cd KVt+1 + (螞Qt) d Ot. 19: Send activation d KVt to rank i. 20: end for 21: return d Q = [d Qt], d K = [d Kt], d V = [d Vt], with t {1, 2, , T}. A.2 Hardware and Software Hardware. Our experimental configuration involves a maximum of 16x DGX-A100 servers, each equipped with 8x A100 GPUs, these GPUs are interconnected through NVSwitch, ensuring an inter-GPU bandwidth of 600GBps. For inter-node communication, we employ Ro CE (RDMA over Converged Ethernet) technology, utilizing 8 Ro CE RDMA adapters in each server. This setup facilitates efficient inter-server communication with a bandwidth capacity of 800Gbps. Software. Experiments are implemented in Py Torch 2.1.1 and Triton 2.0.0 with CUDA 11.7, cu DNN 8.0, and NCCL 2.14.3. Our algorithm is developed upon Metaseq and Deep Speed. A.3 Experimental Setup The training configuration is set with specific hyperparameters: a learning rate of 0.0005 to control the optimization step size, a cap of 50,000 updates to define the training duration, and a 2,000-update warmup period to stabilize early training by gradually adjusting the learning rate. Additionally, a weight decay rate of 0.01 is used for regularization to avoid over-fitting (Sun et al., 2024). The Adam optimizer, with beta values of 0.9 and 0.999, is chosen for managing the momentum and scaling of gradients, aiding in effective and stable training convergence (Zhou et al., 2020). Different DDP backends, including Py Torch DDP (abbr. Published in Transactions on Machine Learning Research (05/2025) DDP), Legacy DDP, FSDP, Ze RO-series, are selected in experiments for cross-validation of compatibility with LASP. A.4 Generalization of LASP While LASP is initially inspired by linear attention mechanisms, we aim to show its broader applicability to various linear sequence modeling approaches. This section investigates the generalization of LASP through both theoretical analysis and empirical validation. In the theoretical aspect, we first define the following terms: Memory State mt Rk d, Input State it Rd, Expand State et Rk, Oscillation State ot Rk m, Shrink State st Rk and write a general form of recurrent memory as (Qin et al., 2024b) mt = otmt 1 + eti t . (24) which is general form of the recurrence form of Linear Attention in Eq. (5) with specified ot and et: kvt = 位kvt 1 + ktv t . (25) The design of LASP can be seamlessly applied to models which is able to be generally expressed by Eq. (24). These models include: S4 (Gu et al., 2022), S5 (Smith et al., 2022), DSS (Gupta et al., 2022), TNN (Qin et al., 2023a), Linear Attention (Katharopoulos et al., 2020), TNL (Qin et al., 2024a), Ret Net (Sun et al., 2023), Mamba (Gu & Dao, 2023), RWKV-4 (Peng et al., 2023), Cosformer (Qin et al., 2022b), Lrpe (Qin et al., 2023c), GLA (Yang et al., 2023), Gate Loop (Katsch, 2023), DUR (Mao, 2022), GFW (Schlag & Schmidhuber, 2018), HGRN (Qin et al., 2024f;e), and LRN (Martin & Cundy, 2018). We list all these models and their corresponding elements in Table 5. Table 5: Checklist for Typical Linear-Complexity Sequence Modeling Methods within the Defined General Form. For each method, the following states are outlined: Input State, Expand State, Oscillation State, Shrink State, and Memory State. If the state is directly linked to the input sequence, the subscript i is emphasized. Note that we use 1(k) Rk, where 1(k) j = 1 for j = 1, . . . , k, and J(kd) = 1(k)1(d) Rk d. Method Input it Expand et Oscillation ot Shrink st Memory k d S4 xt B A C k 1 S5 xt B A C k d DSS xt B a1 k C k d TNN xt B A C k d Linear Attention xt Bt J(kd) Ct k d TNL/Ret Net xt Bt 位J(k) Ct k d Mamba xt Bt At Ct k d RWKV4 xt exp(kt) exp( w) Ct 1 1 Cosformer xt Bt exp(i胃)J(kd) Ct k d LRPE xt Bt exp(i螛)1(d) Ct k d GLA/Gate Loop xt Bt gt1 d Ct k d DUR/GFW xt Bt gt g t Ct k d HGRN/LRN xt 1 At At Ct 1 1 We also give the complete explanation for each modeling method as below. S4. In S4, we obtain ut Rd through linear projection from input xt and A Rk k, B, C Rk 1 through SSM parameterization. The calculation is as follows: mt = Amt 1 + Bu t , yt = m t C. Note that the original definition of S4 is defined as a channel-wise mappings fi, i = 1, . . . , d of Rn 1 Rn 1. Published in Transactions on Machine Learning Research (05/2025) S5. The recurrence equation of S5 is the same as S4, with the only difference being the direct definition of the mapping Rn d Rn d and B, C Rk d. DSS. The recurrence equation of DSS is same as S4/S5, with the only difference being the direct definition of the mapping Rn d Rn d and B, C Rk d, A = Diaga Rk k. TNN. According to (Qin & Zhong, 2023), TNN can be losslessly converted to SSM, where C = J(kd) Rk d, B Rk d, A = Diag位1, . . . , 位k Rk k, get ut from xt through linear projection, and it can be expressed as a recursive formula: mt = Amt 1 + Bu t , yt = m t C. Linear Attention. In Linear Attention, we obtain query qt Rk, key kt Rk, value vt Rd from the input xt Rd through linear projection, and recursively calculation is as follows: kvt = kvt 1 + ktv t , yt = kv t qt. TNL/Ret Net. TNL/Ret Net is a form of Linear Attention with exponential decay and the method for getting qt, kt, vt are the same as those in Linear Attention, and lambda is a predefined parameter that cannot be learned. Its recursive calculation is: kvt = 位kvt 1 + ktv t , yt = kv t qt. Mamba. Mamba can be seen as a data-dependent S4. It uses the similar method to get ut, A, B, C, the At, Bt, Ct are computed throuth xt and A, B, C. Its recurrence equation is defined as: mt = At mt 1 + Btu t , yt = m t Ct. RWKV-4. In RWKV-4, we get rt, kt, vt through linear projection from input xt and w as a learnable weight. Ignoring the denominator of RWKV-4, the recurrence equation can be simplified as: mt = exp( w)mt 1 + exp(kt)v t , yt = m t rt. Similar to S4, RWKV4 uses channel-wise mapping fi, i = 1, . . . , d of Rn 1 Rn 1. Cosformer. In Cosformer, we obtain query qt Rk, key kt Rk, value vt Rd from the input xt Rd and a prefined 胃(not learnable). Then recursively calculate as follows: kvt = exp(i胃)kvt 1 + ktv t , yt = Rel[kv t ]qt. Lrpe. In Lrpe, we obtain query qt Rk, key kt Rk, value vt Rd from the input xt Rd, 胃 as a learnable weight and recursively calculate as follows: kvt = 螞kvt 1 + ktv t , 螞 = diag(exp(i胃1), . . . , exp(i胃k)), yt = Rel[kv]t qt. GLA/Gate Loop. In GLA/Gate Loop, we obtain query qt Rk, key kt Rk, value vt Rd, decay gt Rk from the input xt Rd and recursively calculate as follows: kvt = Diag(gt)kvt 1 + ktv t , yt = kv t qt. DUR/GFW In DUR/GFW, we obtain query qt Rk, key kt Rk, value vt Rd, decay gt Rk, gt Rd from the input xt Rd, and recursively calculate as follows: kvt = (gt gt ) kvt 1 + ktv t , yt = [kv] t qt. Published in Transactions on Machine Learning Research (05/2025) HGRN/LRN In HGRN/LRN, we obtain output gate ot R1, forget gate ft R1, input state it R1 from the input xt R1, and recursively calculate as follows: ht = ft ht 1 + (1 ft)i t , yt = h t ot. Similar to S4, HGRN/LRN use channel-wise mapping fi, i = 1, . . . , d of Rn 1 Rn 1. To empirically validate the generalization of LASP, we adopt the experimental setup from Table 2 and apply LASP to three additional linear sequence modeling methods listed in Table 5, namely Cosformer, Ret Net, and Mamba. The convergence results, presented in Table 6, demonstrate that LASP does not negatively affect convergence and achieves performance on par with the baselines. Table 6: Convergence Results of LASP on Cosformer, Ret Net and Mamba. TNL with 0.4B parameters are tested with a batch size of 2 and sequence length of 16K. Model Parameters Method Loss Method Loss Cosformer 0.4B DDP 4.001 LASP+DDP 4.005 Cosformer 0.4B Ze RO-1 4.013 LASP+Ze RO-1 3.969 Ret Net 0.4B DDP 4.312 LASP+DDP 4.306 Ret Net 0.4B Ze RO-1 4.312 LASP+Ze RO-1 4.309 Mamba 0.4B DDP 4.116 LASP+DDP 4.122 Mamba 0.4B Ze RO-1 4.108 LASP+Ze RO-1 4.110 A.5 Additional Experiment Results A.5.1 Hybrid SP Results on Hybrid Models We perform a small-scale experiment (8 A100, 1B parameters, DDP backend) to evaluate the feasibility of the hybrid SP approach for hybrid models. In this setup, a 1/4 hybrid model denotes a configuration where one out of every four layers is a softmax attention Transformer layer. Using S to represent softmax attention and L for linear attention, a 16-layer 1/4 hybrid model follows the pattern LLLSLLLSLLLSLLLS. The results in Table 7 demonstrate that hybrid SP effectively extends the maximum trainable sequence length for both TNL and Linear Transformer, while incurring only a slight reduction in training speed. Table 7: Hybrid SP Results on Inter-layer Hybrid Models. 1/4 hybrid refers to a model where one out of every four layers is a softmax attention Transformer layer. Maximum Sequence Length and Throughput (tokens/sec) are reported. Model Method Maximum Sequence Length Throughput 1/4 Hybrid TNL No SP 12K 128684 1/4 Hybrid TNL Hybrid SP Solution 90K 125397 1/4 Hybrid Linear Transformer No SP 12K 129253 1/4 Hybrid Linear Transformer Hybrid SP Solution 90K 125883 A.5.2 Evaluation Results on Downstream Tasks We conduct an experiment with extended training duration of 300K steps (which consumes 40B tokens) to assess the performance of LASP, and its evaluation results on downstream tasks. Both TNL and Linear Transformer with 0.4B parameters are investigated. We evaluate the performance of the trained models on multiple downstream benchmarks, including PIQA, Hella Swag (HS), Wino Grande (WG), ARC-E, ARC-C, OBQA, and CSR-AVG. The results are presented in the Tables 8 and 9. LASP does not negatively affect downstream task performance. A.5.3 Quantitative Scalability Results See Table 10 in next page. Published in Transactions on Machine Learning Research (05/2025) Table 8: Convergence Results of LASP with Extended 300K Steps. Both TNL and Linear Transformer with 0.4B parameters are tested with a batch size of 2 and sequence length of 16K. Model Parameters Steps Method Loss PPL Method Loss PPL TNL 0.4B 300K DDP 3.218 9.318 LASP+DDP 3.218 9.321 Linear Transformer 0.4B 300K DDP 4.164 17.972 LASP+DDP 4.145 17.730 Table 9: Evaluation Results on Downstream Tasks. HS: Hella Swag, WG: Wino Grande. A higher score indicates better performance. Model Method Tokens PIQA HS WG ARC-E ARC-C OBQA CSR-AVG TNL DDP 40B 55.71 28.21 51.30 28.87 23.72 26.00 35.64 TNL LASP+DDP 40B 54.30 28.17 51.54 31.27 24.06 29.60 36.49 Linear Transformer DDP 40B 52.18 25.68 49.80 26.81 25.60 26.40 34.93 Linear Transformer LASP+DDP 40B 52.18 26.07 49.25 26.22 26.71 27.00 35.44 Published in Transactions on Machine Learning Research (05/2025) Table 10: Quantitative Scalability Results of LASP on Throughput (tokens/sec) and Memory Usage Per GPU (GB). Experiments are performed on TNL-1B, scaling sequence length from 2K to 4096K with a batch size of 1. Both DDP and FSDP backends are tested. Sequence Length GPUs LASP + DDP LASP + FSDP Throughput Memory Throughput Memory 16 1893.3 22.5 1780.5 6.9 32 1645.4 22.5 1671.2 6.6 64 1639.7 22.5 1589.8 6.4 128 1610.9 22.5 1566.2 6.2 16 3686.9 22.5 3519.9 6.9 32 3458.4 22.5 3304.4 6.6 64 3245.3 22.5 3152.2 6.4 128 3211.5 22.5 3075.7 6.2 16 7076.9 22.5 6924.8 6.9 32 7319.3 22.5 6472.9 6.6 64 6869.1 22.5 6459.4 6.4 128 6793.6 22.5 6398.4 6.2 16 14036.8 22.5 13513.7 6.9 32 14671.7 22.5 12978.9 6.6 64 13828.6 22.5 12569.4 6.4 128 13484.5 22.5 12184.5 6.2 16 28354.6 24.4 25727.2 6.9 32 27863.6 22.5 26646.4 6.6 64 25275.9 22.5 25201.4 6.4 128 24523.8 22.5 25638.9 6.2 16 52993.1 28.3 48542.8 11 32 53393.2 24.4 49648.6 6.6 64 52024.2 22.5 49780.5 6.4 128 51983.3 22.5 49833.3 6.2 16 107682 36.1 84901.9 19 32 93371.5 28.3 92718.8 10.6 64 100046 24.4 96771.6 6.4 128 95828.5 22.5 98975.9 6.2 16 202057 51.7 136765 35.2 32 190675 36.1 159326 18.7 64 193341 28.3 170996 10.4 128 187347.7 24.4 178628.4 6.3 16 OOM OOM 201791 67.5 32 323596 51.7 250663 34.8 64 304366 36.1 284803 18.5 128 295128.5 28.3 298755 10.1 16 OOM OOM OOM OOM 32 OOM OOM 358478 67.1 64 523119 51.7 437728 34.6 128 508383 36.1 459794 18.2 16 OOM OOM OOM OOM 32 OOM OOM OOM OOM 64 OOM OOM 585326 66.9 128 658432 51.7 597953 33.8 16 OOM OOM OOM OOM 32 OOM OOM OOM OOM 64 OOM OOM OOM OOM 128 OOM OOM 792705 66.2