# exploring_transformer_extrapolation__bf5a060a.pdf Exploring Transformer Extrapolation Zhen Qin1,2*, Yiran Zhong1* , Hui Deng3 1Open NLPLab, Shanghai AI Lab, Shanghai, China 2Tap Tap, Shanghai, China 3Northwestern Polytechnical University, Shaanxi, China {zhenqin950102, zhongyiran}@gmail.com, denghui986@foxmail.com Length extrapolation has attracted considerable attention recently since it allows transformers to be tested on longer sequences than those used in training. Previous research has shown that this property can be attained by using carefully designed Relative Positional Encodings (RPEs). While these methods perform well on a variety of corpora, the conditions for length extrapolation have yet to be investigated. This paper attempts to determine what types of RPEs allow for length extrapolation through a thorough mathematical and empirical analysis. We discover that a transformer is certain to possess this property as long as the series that corresponds to the RPE s exponential converges. Two practices are derived from the conditions and examined in language modeling tasks on a variety of corpora. As a bonus from the conditions, we derive a new Theoretical Receptive Field (TRF) to measure the receptive field of RPEs without taking any training steps. Extensive experiments are conducted on the Wikitext-103, Books, Github, and Wiki Book datasets to demonstrate the viability of our discovered conditions. We also compare TRF to Empirical Receptive Field (ERF) across different models, showing consistently matched trends on these datasets. Code is released at: https://github.com/Open NLPLab/Rpe. Introduction Transformer (Vaswani et al. 2017) is advancing steadily in the areas of natural language processing (Qin et al. 2023b; Devlin et al. 2019; Liu et al. 2019; Qin et al. 2022b,a; Liu et al. 2022; Qin and Zhong 2023), computer vision (Dosovitskiy et al. 2020; Sun et al. 2022b; Lu et al. 2022; Hao et al. 2024), and audio processing (Gong, Chung, and Glass 2021; Akbari et al. 2021; Gulati et al. 2020; Sun et al. 2022a). Although it outperforms other architectures such as RNNs (Cho et al. 2014; Qin, Yang, and Zhong 2023) and CNNs (Kim 2014; Hershey et al. 2016; Gehring et al. 2017) in many sequence modeling tasks, its lack of length extrapolation capability limits its ability to handle a wide range of sequence lengths, i.e., inference sequences need to be equal to or shorter than training sequences. Increasing the training sequence length is only a temporary solution because the space-time complexity grows quadratically with *These authors contributed equally. Corresponding author. Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the sequence length. Another option is to extend the inference sequence length by converting the trained full attention blocks to sliding window attention blocks (Beltagy, Peters, and Cohan 2020), but this will result in significantly worse efficiency than the full attention speed (Press, Smith, and Lewis 2022). How to permanently resolve this issue without incurring additional costs has emerged as a new topic. A mainstream solution for length extrapolation is to design a Relative Positional Encoding (RPE) (Qin et al. 2023c) that concentrates attention on neighboring tokens. For example, ALi Bi (Press, Smith, and Lewis 2022) applies linear decay biases to the attention to reduce the contribution from distant tokens. Kerple (Chi et al. 2022) investigates shiftinvariant conditionally positive definite kernels in RPEs and proposes a collection of kernels that promote the length extrapolation property. It also shows that ALi Bi is one of its instances. Sandwich (Chi, Fan, and Rudnicky 2022) proposes a hypothesis to explain the secret behind ALi Bi and empirically proves it by integrating the hypothesis into sinusoidal positional embeddings. In order to investigate transformer extrapolation, we first establish a hypothesis regarding why existing RPE-based length extrapolation methods (Qin et al. 2023a) have this capacity to extrapolate sequences in inference based on empirical analysis. Then we identify the conditions of RPEs that satisfy the hypothesis through mathematical analysis. Finally, the discovered conditions are empirically validated on a variety of corpora. Specifically, we assume that due to decay biases, existing RPE-based length extrapolation methods behave similarly to sliding window attention, i.e., only tokens within a certain range can influence the attention scores. A transformer can extrapolate for certain in this scenario since the out-of-range tokens have no effect on the attention outcomes. We derive that a transformer is guaranteed to satisfy this hypothesis if the series corresponding to the exponential of its RPE converges. Based on the observation, we show that previous RPE-based methods (Press, Smith, and Lewis 2022; Chi et al. 2022) can be seen as particular instances under the conditions. Two new practices from the conditions are derived and evaluated in language modeling. The observed conditions not only shed light on the secret of length extrapolation but also offer a new perspective on computing the Theoretical Receptive Fields (TRF) of RPEs. In contrast to prior approaches that require training gradients The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) to compute TRF, we propose a new way to calculate TRF that is solely based on the formulation of RPEs. Extensive experiments on various datasets validate the conditions. TRF calculated by our method substantially matches the trend of the Empirical Receptive Field (ERF) in real-world scenarios. Preliminary Before embarking on the journey of exploring, we introduce several preliminary concepts that will be used throughout the paper, such as softmax attention, relative positional encoding, length extrapolation, and sliding window attention. We also provide the necessary notations for the subsequent analysis, i.e., we use M to denote a matrix and m i to represent the ith row of M. The complete math notations can be found in Appendix. Following previous work (Press, Smith, and Lewis 2022), we restrict our analysis to causal language models and assume that the max sequence length during training is m. Softmax Attention Softmax attention is a key component of transformers which operates on query Q, key K and value V matrices. Each matrix is a linear map that takes X Rn d as input: Q = XWQ, K = XWK, V = XWV Rn d, (1) where n is the sequence length and d is the dimension of the hidden feature. The output attention matrix O Rn d can be formulated as: O = Softmax(QKT/ d)V. (2) To prevent information leakage in causal language modeling, a mask matrix M Rn n is used to ensure that current tokens can only see previous tokens and themselves. The lower triangular elements of M are 0, and the upper triangular ones, except for the diagonal, are . Then the output attention matrix O for causal language models will be: O = Softmax(QK / d + M)V. (3) Note that Eq. 3 can be seen as a general form of attention, i.e., when the elements of M are all 0, Eq. 3 is degenerated to Eq. 2. For ease of discussion, we use Eq. 3 to represent attention computation. Relative Positional Encoding Positional encoding is designed to inject positional bias into transformers. Absolute Positional Encoding (APE) (Vaswani et al. 2017; Gehring et al. 2017) and Relative Positional Encoding (RPE) (Su et al. 2021; Liutkus et al. 2021; Press, Smith, and Lewis 2022; Chi et al. 2022) are the two most common types of positional encoding. In this paper, we focus on RPE because it is the key for length extrapolation, as shown in (Press, Smith, and Lewis 2022). An attention with RPE can be written as: O = Softmax(QK / d + M + P)V, (4) where P Rn n is a Toeplitz matrix that encodes relative positional information, i.e., pij = pi j. It is worth noting that M and P can be merged, and the merged matrix is still a Toeplitz matrix. We use R to represent the merged matrix and rewrite Eq. 4 as: O = Softmax(QK / d + R)V. (5) Definition Of Length Extrapolation The property of length extrapolation allows a model to be tested on longer sequences than those used in training. Previous sequence modeling structures such as RNNs (Hochreiter and Schmidhuber 1997) and CNNs (Gehring et al. 2017) often naturally possess this property, but it is a difficult task for transformers. This property is only present in sliding window transformers and a few transformer variants with specifically designed RPEs (Chi et al. 2022; Press, Smith, and Lewis 2022; Chi, Fan, and Rudnicky 2022). In language modeling, one token can only see itself and previous tokens. Therefore, regardless the sequence length, the performance should be stable for the neighboring tokens that are within the training sequence length (Beltagy, Peters, and Cohan 2020). For the tokens that are out of range, the performance will degrade if the model does not support length extrapolation (Press, Smith, and Lewis 2022). Based on the observation above, we give a definition of length extrapolation: Definition 0.1. For a language model F, given dataset X, if for any n, there is, |ppln(X, F) pplm(X, F)|/pplm(X, F) < δ, (6) then F is considered to have the extrapolation property. Here δ > 0 is a small constant, ppln(X, F) means that F calculates perplexity with a max sequence length of n on the data set X. Empirically, if |ppln(X, F) pplm(X, F)|/pplm(X, F) becomes very large( 1) as n increases, we consider that F does not have extrapolation property. Sliding Window Attention For the convenience of subsequent discussions, we define a window attention at position i and window size j as follows: i j+1 s i exp(q i ks/ d) exp(ris)vs P i j+1 t i exp(q i kt/ d) exp(rit) i j+1 s i cisvs where Cij = P i j+1 t i cit, cij = aijbij, aij = exp(q i kj/ d), bij = exp(rij), j i. We further assume xi l, x {q, k, v}, where l > 0 is a constant. The oj i represents the attention output of the i-th token, which interacts with the j tokens preceding it. Note that window attention naturally possesses the length extrapolation ability. There are two ways to infer window attention: nonoverlapping inference and sliding window inference as shown on the right of Figure 1. In sliding window inference, the tokens within each sliding window must be re-encoded multiple times, making it substantially slower than the nonoverlapping one. In Table 1 we compare the average inference time over a group of window sizes between the sliding window inference and nonoverlapping window inference. The sliding window one is more than 44 times slower than the nonoverlapping one. However, as shown on the left of Figure 1, the sliding window inference has much lower ppl than the nonoverlapping one. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Figure 1: Sliding window inference vs Nonoverlapping inference. We illustrate the difference between sliding window inference and nonoverlapping inference in the right figure. The left figure shows the curves of Sliding Window and Nonoverlapping Window corresponding to the ppls calculated by a language model at different inference window sizes. Method Rel Avg infer time Sliding Window 44.35 Nonoverlapping Window 1.00 Alibi 1.00 Table 1: Relative average inference time. We compute the relative average inference time of sliding window inference and nonoverlapping inference over a set of window sizes {16,32,64,128,258,512}. We also include the Alibi inference time as a reference. (a) Sliding window (b)Alibi linear decay Figure 2: Visualization of attention reweighting. We plot the reweighting schema of sliding window attention and Alibi linear decay bias. They share a similar behavior in that only neighboring tokens can influence the attention results. Transformer Extrapolation Exploration In this section, we first describe the hypothesis about why existing RPE-based length extrapolation methods can extrapolate sequences in inference and provide empirical evidence for it. Then we derive the conditions for length extrapolation in detail and demonstrate that recent RPE-based length extrapolation methods (Chi et al. 2022; Press, Smith, and Lewis 2022) satisfy the conditions. The Hypothesis A sliding window attention with window size w is equivalent to the following RPE on full attention: mij = 0, i j w. , others. (8) By comparing Eq. 8 and the corresponding RPE of Alibi (Press, Smith, and Lewis 2022) in Figure 2, we can see that they both have the same behavior in that they both concentrate tokens inside a specified range. Also, in Figure 1, we show that the performance of Alibi is similar to the sliding window attention when the window size is sufficiently large. Based on these two observations, we make the following hypothesis: Hypothesis 0.1. A RPE that makes a transformer extrapolatable needs to have similar behavior to sliding window attention, i.e., δ(i, j) should satisfy: ϵ > 0, j0, s.t, j > j0, δ(i, j) < ϵ, (9) where δ(i, j) oi i oj i , and the window length j needs to be sufficiently large. In the following sections, we will derive the conditions for RPEs that satisfy Eq. 9. The Conditions Let us introduce the first lemma: Lemma 0.2. When the following condition is satisfied, Eq. 9 holds. limi Cii C < . (10) Proof. When i m, the test sequence length is smaller than the max sequence length m during training, take j = i, we get oi i oj i = oi i oi i = 0. When i > m, we can reformulate Eq. 7 as: i j+1 s i cisvs + P 1 s i j cisvs Cii i j+1 s i cisvs Cij Cij Cii + 1 s i j cisvs Cii Cij Cii Cij i j+1 s i cisvs Cij Cij Cii + 1 s i j cisvs Therefore we have oi i oj i =: i j+1 s i cisvs 1 s i j cisvs The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) For the second part: P i j+1 s i cisvs P 1 s i j cisvs P i j+1 s i cis vs P 1 s i j cis vs P i j+1 s i cisl P 1 s i j cisl Cii Cij = 2l δ(i, j) 2 1 Cij According to Eq 10 and the tail of convergence series is arbitrarily small. C/2 > ϵ > 0, we can find a j0, s.t. if i j > j0, Cii Cij < ϵ. We can also find a j1, s.t. if i j > j1, C ϵ < Cii < C + ϵ. If we take j2 = max(j0, j1), so if i j j2, we have: Cii Cij < ϵ, C ϵ < Cii < C + ϵ (14) So when i j j2, we have δ(i, j) 2 1 Cij l = 2Cii Cij Cii l 2 ϵ C ϵl 2lϵ C C/2 = 4lϵ According to the definition of limitation, Eq. 10 holds. This lemma implies that for any token if the attention of the model focuses on its neighboring j(j j2) tokens, the model has length extrapolation property. The lemma accompanies our intuitions. Does it mean that as long as a RPE follows the same principle, i.e., places more weights on neighboring j tokens, the model is guaranteed to have the length extrapolation property? In the following sections, we will demonstrate that concentrating more weights on neighboring tokens does not guarantee the transformer has the length extrapolation property. Specifically, we will provide a mathematical proof of the sufficient conditions for RPE to have the length extrapolation property. Theorem 0.3. When the following condition is satisfied, Eq. 9 holds. limi Bii < , Bii = X 1 t i bit < . (16) Proof. Since we assume qi l, ki l, then: aij = exp(q i kj) exp(l2), (17) cij = aijbij exp(l2)bij, Cii exp(l2)Bii. (18) Therefore, Eq. 10 can be derived from Eq. 16. Combine with Lemma 0.2, the proof is concluded. By leveraging the property of RPE, Theorem 0.3 can be further simplified as: Theorem 0.4. When the following condition is satisfied, Eq. 9 holds. t=1 bi t = lim i t=0 bt < . (19) Proof. According to the definition of RPE: 1 t i bit = t=0 bt. (20) This means that Eq. 16 is equivalent to: lim i Bii = lim i t=0 bt < . (21) Theorem 0.4 indicates that as long as the series of exp(RPE) converges, the model is guaranteed to have length extrapolation property. Based on this principle, we can mathematically determine whether an RPE allows for length extrapolation before conducting experiments or designing a variety of RPEs that can do length extrapolation. In Appendix, we show that previous methods such as Alibi (Press, Smith, and Lewis 2022), Kerple (Chi et al. 2022), and Sandwich (Chi, Fan, and Rudnicky 2022) satisfy our derived conditions for length extrapolation. Theoretical Receptive Field In the previous section, we established the conditions for length extrapolation. As an extra bonus, we can derive Theoretical Receptive Fields (TRF) for any RPE-based length extrapolation method. Let us start with the definition of Empirical Receptive Field (ERF). ERF can be viewed as a window containing the vast majority of the information contained within the attention. Recall Eq. 13, by setting 1 Cij Cii = ϵ, we can define: Cij = Cii(1 ϵ), nemp(ϵ) = inf j (Cij > Cii(1 ϵ)), nemp(ϵ) is the ERF that represents the minimal sequence length required to maintain the performance within a gap of ϵ. Intuitively, ERF can be viewed as the smallest window that contains the majority of the information within an attention. Since it is related to both aij and bij, it can only be calculated after training. Now we define TRF, which allows us to estimate the receptive field without training. To accomplish this, we consider the upper bound of Cij. From the definition of Cij and Eq. 17, Cij is upper bounded by Bij. Therefore, we can define the TRF nb the(ϵ) respect to series bt as: nthe(ϵ) = inf j (Bij > B(1 ϵ)) t=0 bt > B(1 ϵ) t j bt < Bϵ where B = limj Pj 1 t=0 bt. We may find it difficult to give the analytical form of the partial sum of the series at times, but we can still compute the TRF numerically or compare the TRFs of different RPEs using the theorem below: The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Theorem 0.5. If the following conditions hold: β , t , α lim j t=0 αt, β lim j t=0 βt. (23) Then: nα the(ϵ) nβ the(ϵ), ϵ 0. (24) Proof. According to Eq.23, there exists t0 > 0, such that, when t > t0, we have: Let ϵ < ϵ0, where nβ the(ϵ0) = t0, (26) then we get: X t nβ the(ϵ) βt βϵ, nβ the(ϵ) > t0. (27) t nβ the(ϵ) αt X t nβ the(ϵ) According to Eq. 22, we have: na the(ϵ) nb the(ϵ). (28) The exp(RPE) series follows the same trend as TRF, the smaller the series, the smaller the TRF. We provide several examples of how to compute TRF in the Appendix. Two New RPEs Based on the proven conditions of length extrapolation, we can design infinite kinds of RPEs with the length extrapolation property. Here, we propose two new RPEs to empirically prove the conditions and hypothesis, namely: Type1 : bn = 1 n2 = exp( 2 ln n), Type2 : bn = exp( ln2 n); The corresponding TRF of Type 1 is: 1 (i + 1)2 Z j 1 x2 dx = 1 1 nthe(ϵ) = inf j (Bij > B(1 ϵ)) j > 1 ϵ = Θ 1 For Type 2, it is difficult to provide the analytical form of its TRF. However, we can prove that the TRF of Type 2 is smaller than the TRF of Type 1 using Theorem 0.5 and the inequality below: c1, c2 > 0, exp( ln2 n) Empirical Validation Setting All models are implemented in Fairseq (Ott et al. 2019) and trained on 8 V100 GPUs. We use the same model architecture and training configuration for all RPE variants to ensure fairness. For Wikitext-103 (Merity et al. 2016), since it is a relatively small dataset, we use a 6-layer transformer decoder structure with an embedding size of 512. For other datasets, in particular, we used a 12-layer transformer decoder structure with an embedding size of 768. The evaluation metric is perplexity (PPL) and the max training length during training is 512. The detailed hyper-parameter settings are listed in Appendix. Dataset We conduct experiments on Wikitext-103 (Merity et al. 2016), Books (Zhu et al. 2015), Github (Gao et al. 2020) and Wiki Book (Wettig et al. 2022). Wikitext-103 is a small dataset containing a preprocessed version of the Wikipedia dataset. It is widely used in many NLP papers. Books has a large number of novels, making it a good corpus for long sequence processing. Github consists of a sizable amount of open-source repositories, the majority of which are written in coding languages. Wiki Book is a 22-gigabyte corpus of Wikipedia articles and books curated by (Wettig et al. 2022). This corpus is used to validate the performance of various models on large datasets. Validating The Sufficiency. To empirically validate the sufficiency of our discovered conditions, we integrate the two RPEs that were proposed in the previous section into transformers and test their length extrapolation capability on Wikitext-103, Books, Github, and Wiki Book datasets. We increase the length of the inference sequence from 512 to 9216 tokens and plot the testing PPLs of our proposed RPEs as well as those of existing methods such as Alibi, Kerple, and Sandwich in Figure 3. All these methods demonstrate good length extrapolation capability. However, the stabilized PPL may vary due to the effectiveness of different positional encoding strategies, which are not considered in this paper. We include the Sinusoidal (Vaswani et al. 2017) positional encoding as a reference method that cannot extrapolate, which grows rapidly as the inference sequence length increases. Validating The Necessity. Although we only provide mathematical proof for the sufficiency of our discovered conditions, we also attempt to verify their necessity empirically in this section. Specifically, we pick two RPEs that are very close to satisfying Theorem 0.4 as follows. Note that both of them concentrate their weight on neighboring tokens. Example1 : bn = 1 n, Example2 : bn = 1 n ln n Below is a brief mathematical proof that the above RPEs do not satisfy Theorem 0.4. 1 n > Z k+1 1 xdx = ln(k + 1), 1 n ln n > Z k+1 1 x ln xdx = ln ln(k + 1) ln ln 3. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 2000 4000 6000 8000 Inference length Sinusoidal Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 2000 4000 6000 8000 Inference length Sinusoidal Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 2000 4000 6000 8000 Inference length Sinusoidal Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 2000 4000 6000 8000 Inference length Sinusoidal Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 Figure 3: Sufficiency validation on Wikitext-103, Books, Github, Wiki Book datasets (in top to down order). To test length extrapolation capability, we lengthen inference sequences from 512 to 9216 tokens and plot the testing PPLs of our proposed Type 1 and Type 2 RPEs, as well as Alibi, Kerple, and Sandwich. We then empirically test their length extrapolation capability on Wikitext-103, Books, Github, and Wiki Book datasets by scaling the inference sequence length from 512 to 9216 tokens. As shown in Figure 4, the PPLs of both RPEs grow rapidly as the length of the testing sequence increases. It demonstrates that both of them cannot extrapolate. We also include Type 1 RPE in Figure 4 as a reference. 2000 4000 6000 8000 Inference length 1 nln n Type1 2000 4000 6000 8000 Inference length 1 nln n Type1 2000 4000 6000 8000 Inference length 1 nln n Type1 2000 4000 6000 8000 Inference length 1 nln n Type1 Figure 4: Necessity validation on Wikitext-103, Books, Github, Wiki Book datasets (in top to down order). We select two RPEs that do not satisfying Theorem 0.4, e.g., bn = 1 n and bn = 1 n ln n. Validating TRF We validate our proposed TRF by comparing the trend between the TRF and ERF. We plot the TRFs and ERFs of the Alibi, Kerple, Sandwich, and our proposed RPEs on the aforementioned datasets. As observed in Figure 6 and Figure 5, while the curves vary across datasets, TRF estimates a similar overall trend of ERFs. Visualizing RPE We visualize the weighting schemes of Type 1 and 2 in Figure 7, i.e., the heatmap of exp(RPE). The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) 0.00 0.02 0.04 0.06 0.08 0.10 0.2 Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 0.00 0.02 0.04 0.06 0.08 0.10 Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 0.00 0.02 0.04 0.06 0.08 0.10 Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 0.00 0.02 0.04 0.06 0.08 0.10 Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 Figure 5: We plot the ERF for Alibi, Kerple, Sandwich and our proposed Type 1 and Type 2 methods on Wikitext-103, Books, Github, and Wiki Book datasets using trained models. ERF is normalized for better visualization. Type 2 concentrates weights on closer neighboring tokens than Type 1, indicating a smaller TRF and ERF as shown in Figure 6 and Figure 5. We also visualize other methods in Appendix. In this paper, we explore the secrets of transformer length extrapolation in language modeling. We first make a hypoth- 0.00 0.02 0.04 0.06 0.08 0.10 Alibi Kerple-Log Kerple-Power Sandwich Type1 Type2 Figure 6: We numerically plot TRFs for existing methods and our proposed method. TRF is normalized for visualization. The TRF of Type 1 is larger than Type 2, which matches the Theorem 0.5 and our analysis. (a) Type1 (b) Type2 Figure 7: We plot the heatmap of exp(RPE) for Type 1 and Type 2. Type 2 concentrates weights on closer neighboring tokens than Type 1, indicating a smaller TRF. esis about extrapolation and then derived the sufficient conditions for RPE to have the length extrapolation property. A thorough mathematical analysis reveals that a transformer model is certain to be capable of length extrapolation if the series that corresponds to the exponential of its RPE converges. This observation brings an extra bonus: we can estimate TRFs of RPEs solely based on their formulations. We chose two new RPEs that satisfy the conditions and two that do not to empirically prove the sufficiency of the conditions on four widely used datasets. We also validated our TRFs by comparing them with ERFs on these datasets as well. The results show that our TRFs can accurately reflect the actual receptive fields of RPEs before training. Acknowledgements This work is partially supported by the National Key R&D Program of China (NO.2022ZD0160100). References Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.-H.; Chang, S.-F.; Cui, Y.; and Gong, B. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. In ar Xiv preprint ar Xiv:2104.11178. Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Longformer: The Long-Document Transformer. In ar Xiv:2004.05150. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Chi, T.-C.; Fan, T.-H.; Ramadge, P. J.; and Rudnicky, A. I. 2022. KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation. Ar Xiv, abs/2205.09921. Chi, T.-C.; Fan, T.-H.; and Rudnicky, A. I. 2022. Receptive Field Alignment Enables Transformer Length Extrapolation. Ar Xiv, abs/2212.10356. Cho, K.; van Merri enboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724 1734. Doha, Qatar: Association for Computational Linguistics. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171 4186. Minneapolis, Minnesota: Association for Computational Linguistics. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929. Gao, L.; Biderman, S.; Black, S.; Golding, L.; Hoppe, T.; Foster, C.; Phang, J.; He, H.; Thite, A.; Nabeshima, N.; Presser, S.; and Leahy, C. 2020. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. In ar Xiv preprint ar Xiv:2101.00027. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 2017. Convolutional sequence to sequence learning. In International Conference on Machine Learning, 1243 1252. PMLR. Gong, Y.; Chung, Y.-A.; and Glass, J. 2021. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021, 571 575. Gulati, A.; Chiu, C.-C.; Qin, J.; Yu, J.; Parmar, N.; Pang, R.; Wang, S.; Han, W.; Wu, Y.; Zhang, Y.; and Zhang, Z., eds. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. Hao, D.; Mao, Y.; He, B.; Han, X.; Dai, Y.; and Zhong, Y. 2024. Improving Audio-Visual Segmentation with Bidirectional Generation. In Proceedings of the AAAI Conference on Artificial Intelligence. Hershey, S.; Chaudhuri, S.; Ellis, D. P. W.; Gemmeke, J. F.; Jansen, A.; Moore, R. C.; Plakal, M.; Platt, D.; Saurous, R. A.; Seybold, B.; Slaney, M.; Weiss, R. J.; and Wilson, K. W. 2016. CNN architectures for large-scale audio classification. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 131 135. Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8): 1735 1780. Kim, Y. 2014. Convolutional Neural Networks for Sentence Classification. In Conference on Empirical Methods in Natural Language Processing. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692. Liu, Z.; Li, D.; Lu, K.; Qin, Z.; Sun, W.; Xu, J.; and Zhong, Y. 2022. Neural architecture search on efficient transformers and beyond. ar Xiv preprint ar Xiv:2207.13955. Liutkus, A.; C ıfka, O.; Wu, S.-L.; Simsekli, U.; Yang, Y.- H.; and Richard, G. 2021. Relative positional encoding for transformers with linear complexity. In International Conference on Machine Learning, 7067 7079. PMLR. Lu, K.; Liu, Z.; Wang, J.; Sun, W.; Qin, Z.; Li, D.; Shen, X.; Deng, H.; Han, X.; Dai, Y.; and Zhong, Y. 2022. Linear video transformer with feature fixation. ar Xiv preprint ar Xiv:2210.08164. Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2016. Pointer Sentinel Mixture Models. In ar Xiv:1609.07843. Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; and Auli, M. 2019. fairseq: A fast, extensible toolkit for sequence modeling. ar Xiv preprint ar Xiv:1904.01038. Press, O.; Smith, N.; and Lewis, M. 2022. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In International Conference on Learning Representations. Qin, Z.; Han, X.; Sun, W.; He, B.; Li, D.; Li, D.; Dai, Y.; Kong, L.; and Zhong, Y. 2023a. Toeplitz Neural Network for Sequence Modeling. In The Eleventh International Conference on Learning Representations. Qin, Z.; Han, X.; Sun, W.; Li, D.; Kong, L.; Barnes, N.; and Zhong, Y. 2022a. The Devil in Linear Transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 7025 7041. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. Qin, Z.; Li, D.; Sun, W.; Sun, W.; Shen, X.; Han, X.; Wei, Y.; Lv, B.; Yuan, F.; Luo, X.; Qiao, Y.; and Zhong, Y. 2023b. Scaling Trans Normer to 175 Billion Parameters. In ar Xiv preprint 2307.14995. Qin, Z.; Sun, W.; Deng, H.; Li, D.; Wei, Y.; Lv, B.; Yan, J.; Kong, L.; and Zhong, Y. 2022b. cos Former: Rethinking Softmax In Attention. In International Conference on Learning Representations. Qin, Z.; Sun, W.; Lu, K.; Deng, H.; Li, D.; Han, X.; Dai, Y.; Kong, L.; and Zhong, Y. 2023c. Linearized Relative Positional Encoding. ar Xiv preprint ar Xiv:2307.09270. Qin, Z.; Yang, S.; and Zhong, Y. 2023. Hierarchically gated recurrent neural network for sequence modeling. Neur IPS. Qin, Z.; and Zhong, Y. 2023. Accelerating Toeplitz Neural Network with Constant-time Inference Complexity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Su, J.; Lu, Y.; Pan, S.; Wen, B.; and Liu, Y. 2021. Roformer: Enhanced transformer with rotary position embedding. ar Xiv preprint ar Xiv:2104.09864. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Sun, J.; Zhong, G.; Zhou, D.; Li, B.; and Zhong, Y. 2022a. Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition. ar Xiv preprint ar Xiv:2203.15609. Sun, W.; Qin, Z.; Deng, H.; Wang, J.; Zhang, Y.; Zhang, K.; Barnes, N.; Birchfield, S.; Kong, L.; and Zhong, Y. 2022b. Vicinity Vision Transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, (01): 1 14. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30. Wettig, A.; Gao, T.; Zhong, Z.; and Chen, D. 2022. Should You Mask 15% in Masked Language Modeling? In ar Xiv:2202.08005. Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; and Fidler, S. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In The IEEE International Conference on Computer Vision (ICCV). The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)