# waveletbased_positional_representation_for_long_context__55df9037.pdf

Published as a conference paper at ICLR 2025

WAVELET-BASED POSITIONAL REPRESENTATION FOR LONG CONTEXT

Yui Oka, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito NTT Human Informatics Laboratories, NTT Corporation yui.oka@ntt.com

In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model s position embedding mechanisms are limited to positions encountered during training, thus preventing effective representation of positions in longer sequences. We analyzed conventional position encoding methods for long contexts and found the following characteristics. (1) When the representation dimension is regarded as the time axis, Rotary Position Embedding (Ro PE) can be interpreted as a restricted wavelet transform using Haar-like wavelets. However, because it uses only a fixed scale parameter, it does not fully exploit the advantages of wavelet transforms, which capture the fine movements of non-stationary signals using multiple scales (window sizes). This limitation could explain why Ro PE performs poorly in extrapolation. (2) Previous research as well as our own analysis indicates that Attention with Linear Biases (ALi Bi) functions similarly to windowed attention, using windows of varying sizes. However, it has limitations in capturing deep dependencies because it restricts the receptive field of the model. From these insights, we propose a new position representation method that captures multiple scales (i.e., window sizes) by leveraging wavelet transforms without limiting the model s attention field. Experimental results show that this new method improves the performance of the model in both short and long contexts. In particular, our method allows extrapolation of position information without limiting the model s attention field.

1 INTRODUCTION

Several pre-trained large language models based on Transformer architecture (Vaswani et al., 2017) have demonstrated robust capabilities in various generative tasks (Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020; Touvron et al., 2023a; Jiang et al., 2023). However, limitations on the input sequence length arise due to the computational resource constraints encountered during the pre-training phase. Such constraints necessitate a determination of the maximum allowable length of sequences, hereinafter Ltrain, prior to the pre-training process, thus hindering the model s performance in processing sequences longer than those encountered during training. This weakness is primarily attributed to the positional encoding s ineffectiveness in handling sequences that exceed the length of those encountered during the model s training phase (Devlin et al., 2019; Press et al., 2022).

Rotary Position Embedding (Ro PE) (Su et al., 2021) has become a common approach in many language models that handle long contexts, and it employs a rotation matrix to encode positional information and facilitate the processing of long sequences. To manage sequences longer than those encountered during training, various scaling strategies (Chen et al., 2023; bloc97, 2023; Peng et al., 2024; Liu et al., 2024) have been applied to Ro PE, although these often require additional finetuning and incur further learning costs in addition to those of pre-training. In contrast, Attention with Linear Biases (ALi Bi) (Press et al., 2022) is able to sequence length estimation beyond the limits of pre-training without requiring additional fine-tuning. However, ALi Bi limits the attention s receptive field (Chi et al., 2023) in the manner of windowed attention (Beltagy et al., 2020). For this reason, a model using ALi Bi may not be able to obtain information that is in a distant dependency relationship. In this paper, we analyze conventional positional encoding methods for long contexts,

Published as a conference paper at ICLR 2025

Figure 1: Overview of Wavelet-based Relative Positional Representation As in RPE (Shaw et al., 2018), our method computes a relative positional representation (pm,n)T to the query qm and the key kn. Instead of learnable embedding in RPE, the position is computed based on the wavelet function. Different wavelet functions ψa,b are used for each dimension of the head d. Furthermore, the scale parameter a and the shift parameter b change depending on the dimension of the head d.

and we propose a novel positional representation that permits extrapolation without constraining the attention mechanism s receptive field. First, we mathematically show that Ro PE performs a process similar to a wavelet transformation considered the gold standard of time-frequency analysis methodology. We interpreted the position of each token in the sequence as a time point in timefrequency analysis. However, Ro PE does not perform a transformation in accordance with the order of positions but rather in accordance with the number of dimensions, and it does not capture the dynamic change in a signal over time. Furthermore, the values corresponding to the wavelet scale (i.e., window size) are constant, so Ro PE does not make good use of the key characteristic of wavelet transforms, which is the ability to analyze signals on multiple scales. In other words, Ro PE may fail to capture the dynamic change in a signal over time, such as what occurs in natural language. In this study, we also show that ALi Bi provides different window sizes for each head.

Based on these insights, we propose a wavelet transform-based method, using multiple window sizes, to offer a robust and flexible approach to positional encoding. By performing a wavelet transform along the order of positions and introducing various scale parameters, our method can capture the dynamic changes in a sequence over positions in the manner of the original feature of wavelet transformation, i.e., time-frequency analysis. Following the methodology of Relative Position Representation (RPE) (Shaw et al., 2018), we implement our method with relative ease.

From our experiments on extrapolation capabilities using the wikitext-103 dataset (Merity et al., 2017), the results demonstrate that our method surpasses traditional positional encoding methods in perplexity. We also report that our method has lower perplexity than Ro PE in experiments with long contexts using the Llama-2 model (Touvron et al., 2023b) and the Code Parrot dataset.

2 BACKGROUND

2.1 POSITIONAL REPRESENTATION

Within the Transformer architecture, positional encoding is employed to accurately represent the sequential position of each token. Positional encoding can be divided into two main types: absolute position, which expresses the position of a token from the static beginning of the sequence, and relative position, which expresses the position of each token in relation to the other tokens within the sequence. Ro PE (Su et al., 2021), which adopts a type of absolute position, uses a rotation matrix to compute the position and then multiplies it by the query and key to represent the position. RPE (Shaw et al., 2018), based on a type of relative position, uses a learnable embedding that represents the position of distances of up to 16 or 32 tokens by clipping. Two other variations include T5 Bias (Raffel et al., 2020), which has an enlarged RPE window size, and Transformer-XL (Dai et al., 2019), which uses a sine wave for position representation instead of learnable embedding.

Position encoding plays a critical role in enabling models to effectively handle long context sequences, and it allows for extrapolation. Relative position is not a position expression that depends on the length of the sequence, so it is effective in extrapolation. ALi Bi (Press et al., 2022) is an

Published as a conference paper at ICLR 2025

effective position representation method for extrapolation: It uses the relative position bias of all tokens by adding a linear bias to each head s attention score, rather than using position embedding. However, ALi Bi is unable to obtain information in a distant dependency relationship due to its constraints on the self-attention mechanism s receptive field(Chi et al., 2023). On the other hand, absolute position is unsuitable for extrapolation because it expresses the position of all words in the sequence. For this reason, many methods have been proposed for fine-tuning Ro PE by interpolating positions in using absolute position (Chen et al., 2023; bloc97, 2023; Peng et al., 2024).

2.2 FREQUENCY ANALYSIS AND TIME-FREQUENCY ANALYSIS

Frequency analysis in signal processing involves analyzing the frequency components of a signal to understand its behavior. The Fourier transform (FT) (Bracewell & Bracewell, 1986) is a key method for frequency analysis, converting a signal from the time domain to the frequency domain, thus providing a global view of its frequency content. However, the FT does not provide any information about when specific frequencies occur. To address this limitation, time-frequency analysis techniques have been applied. The wavelet transform (WT) (Grossmann & Morlet, 1984; Mallat, 1989) offers a more flexible approach by analyzing the signal at multiple scales or resolutions. The WT adaptively provides high time resolution for high-frequency components and high frequency resolution for low-frequency components, making it well-suited for analyzing signals with nonstationary or transient features. This adaptability allows the wavelet transform to capture both time and frequency information with varying degrees of precision.

3 ROPE AND WAVELET TRANSFORM

3.1 PRELIMINARY

Wavelet Transform A wavelet is a wave that decays quickly and locally as it approaches zero. A function ψ defined on a real R is called a wavelet function if it belongs to the space L2(R) of square integrable functions and satisfies the following conditions: Z

| ψ(x) |2 dx < . (1)

The wavelet function is defined as follows.

ψa,b(t) = 1 aψ t b

In this case, b is the shift and a > 0 is the scale parameter. The scale parameter a simultaneously changes the range over which the wavelet is localized as well as the wavelet s amplitude. Typical wavelets include the Haar wavelet (Haar, 1910), Ricker wavelet (Ricker, 1944), and Morlet wavelet (Bernardino & Santos-Victor, 2005). Suppose that we sample T values at regular intervals from a continuous signal. Wavelet transform (WT) (Grossmann & Morlet, 1984) is the process of transforming a signal x(t) into the frequency domain and time domain by computing the inner product of the wavelet function ψa,b(t) and signal x(t).

t=0 ψa,b(t)x(t). (3)

In some cases, the term Discrete Wavelet Transform or Wavelet Transform is used to refer to multi-resolution analysis (Mallat, 1989), but in this paper we follow the original definition. We can see that the FT only converts to the frequency domain, whereas the WT converts to two domains: scale a and shift b. For example, consider the case of a conversion to two scales and four shifts. When a [2, 4] and b [0, 1, 2, 3], the wavelet transform can be expressed in terms of determinants as follows:

W(2, 0) W(4, 0) W(2, 1) W(4, 1) ... W(4, 3)

ψ2,0(0) ψ2,0(1) ψ2,0(2) ... ψ2,0(T 1) ψ4,0(0) ψ4,0(1) ψ4,0(2) ... ψ4,0(T 1) ψ2,1(0) ψ2,1(1) ψ2,1(2) ... ψ2,1(T 1) ψ4,1(0) ψ4,1(1) ψ4,1(2) ... ψ4,1(T 1) ... ... ... ... ... ψ4,3(0) ψ4,3(1) ψ4,3(2) ... ψ4,3(T 1)

x(0) x(1) x(2) ... x(T 1)

Published as a conference paper at ICLR 2025

Furthermore, since ψa,b(t) = ψa,0(t b) from Eq.2, ψ of the wavelet transform in Eq 4 is expressed as follows.

W(2, 0) W(4, 0) W(2, 1) W(4, 1) ... W(4, 3)

ψ2,0(0) ψ2,0(1) ψ2,0(2) ... ψ2,0(T 1) ψ4,0(0) ψ4,0(1) ψ4,0(2) ... ψ4,0(T 1) ψ2,0( 1) ψ2,0(0) ψ2,0(1) ... ψ2,0(T 2) ψ4,0( 1) ψ4,0(0) ψ4,0(1) ... ψ4,0(T 2) ... ... ... ... ... ψ4,0( 3) ψ4,0( 2) ψ4,0( 1) ... ψ4,0(T 3)

x(0) x(1) x(2) ... x(T 1)

Due to the characteristics of the scale parameter a, the values of the wavelet matrix become 0 or approach 0 outside a certain range that depends on the specific wavelet function.

Ro PE Ro PE incorporates positional information directly into the self-attention mechanism by rotating the query and key vectors in complex space. When divided into even and odd dimensions, the following calculations are performed for the m-th query in each sequence. In even dimensions, Ro PE is expressed as follows.

qm 0 qm 2 ... qm d 2

cos mθ1 sin mθ1 0 0 ... 0 0 0 0 cos mθ2 sin mθ2 ... 0 0 ... ... ... ... ... ... ... 0 0 0 0 ... cos mθd/2 sin mθd/2

qm 0 qm 1 ... qm d 2 qm d 1

where qm R1 d is the m-th query when the number of dimensions is d and θi = 10000 2(i 1)/d, i [1, 2, ..., d/2]. For Ro PE in odd dimensions, see Appendix A.1. The same process is also performed for the n-th key kn R1 d.

3.2 THEORETICAL ANALYSIS

First, we show the wavelet transform using the following two Haar-like wavelets (Haar, 1910).

cos f(t) 0 t<1, sin f(t) 1 t<2, 0 otherwise. ψ (t) =

sin f(t) 0 t<1, cos f(t) 1 t<2, 0 otherwise. (7)

Here, f : R R is a function that satisfies R ψ(t) dt = 0 and Eq.(1). Assuming that when x(t)(0 t d 1) is a signal with d elements, the wavelet ψ is used and wavelet transform is performed at each scale a = 1. We define the shift parameter as bj = j δ(j)(j = 0, 2, .., d 2). Here, δ(t) is a function such that 0 t d 1 and 0 δ(t) < 1. When the wavelet function is Haar-like wavelet ψ(t) in Eq.(7) and a = 1 and b [b0, b2, .., dd 2], the wavelet matrix ψ in the wavelet transform w = ψx can be expressed in terms of determinants as follows.

W(1, b0) W(1, b2) ... W(1, bd 2)

cos ϕ0 sin ϕ1 0 0 ... 0 0 0 0 cos ϕ2 sin ϕ3 ... 0 0 ... ... ... ... ... ... ... 0 0 0 0 ... cos ϕd 2 sin ϕd 1

x(0) x(1) ... x(d 2) x(d 1)

To simplify the notation in the matrix representation above, we write ϕj for j = 0, 1, . . . , d 1, where ϕj = f(1 + δ(j)) if j is odd, and ϕj = f(δ(j)) otherwise. Let x be the query qm, and define f such that ϕj = ϕj+1 = mθ j+1

2 for j = 0, 2, 4, . . . , d 2, where θi = 10000 2(i 1)/d and

i [1, 2, ..., d/2]. Under this definition, the transformation matrix of Eq. (8) becomes identical to that of Eq. (6) in Ro PE. 1 In other words, Ro PE can be viewed as a wavelet transform using Haarlike wavelets that change amplitude on a fixed scale. Furthermore, the same result as Ro PE in odd

1The proof of the existence of f(t) that satisfies this condition is provided in Appendix A.2.

Published as a conference paper at ICLR 2025

Figure 2: Heatmap of scaled attention scores via softmax normalization in ALi Bi without nonoverlapping inference. The vertical axis represents the query, while the horizontal axis corresponds to the key in the attention map. For clarity, values of 0.001 or more are mapped to black, while values below that are mapped to yellow. The maximum allowable length of sequences is Ltrain = 512, and the inference length is 1012.

dimensions can be obtained when using ψ for wavelet transformation. 2 This wavelet transform in Ro PE is performed across the number of query head dimensions d. Therefore, Ro PE can be considered a wavelet transformation along the head dimension using a wavelet with a fixed scale of 2.3

4 WINDOW SIZE VARIABILITY IN ALIBI

ALi Bi has a restricted receptive field and behaves in the manner of windowed attention (Chi et al., 2023; Beltagy et al., 2020). A receptive field refers to the specific region of the input space that significantly influences the model s output, typically representing the area where the most relevant features are captured. ALi Bi is expressed as

softmax(qm KT + slope [ (m 1), . . . , 2, 1, 0]), (9)

where the slope is a head-specific slope fixed before training and KT Rm d is the first m keys. In this section, we analyzed the window size in ALi Bi using the attention map.

4.1 INSIGHTS FROM ATTENTION MAP ANALYSIS

A heatmap of scaled attention scores obtained through softmax normalization is shown in Figure 2. The number of heads N is 8, and the slope of ALi Bi is [ 1

64, 1 128, 1 256]. In extrapolation, sequences are often divided, but in this section the sequences are not divided. The experimental setting was set to the same as that in Section 6.1. The perplexity results are shown in Table 1.

The attention map shows that ALi Bi uses multiple window sizes corresponding to relative positions and that the window size increases as the slope decreases. Moreover, previous research (Chi et al., 2023) shows that constraining the window size (slope) to a single value leads to increased perplexity. Consequently, one of the reasons ALi Bi is effective, compared to a previous relative position using fixed window sizes in T5 Bias (Raffel et al., 2020), is its ability to accommodate multiple window sizes. ALi Bi does not perform calculations like those in Eq. (3), so it does not exactly match the wavelet transform. However, having windows of various sizes is similar to the role of the scale parameter used in wavelet transforms.

5 WAVELET-BASED POSITIONAL REPRESENTATION

Wavelet transform (WT) is a method of analyzing signals using variable-scale wavelets, and it is possible to adjust the scale of the window. This scalability allows both broad and fine signal features to be efficiently extracted by shifting the wavelet while changing the window size. In particular, this is suitable for investigating non-stationary signals. For this reason, we believe that the wavelet

2Additionally, when sin mθi = cos mθi, the Haar wavelet matrix and Ro PE are the same when the scale is 2, and the shift is [2, 4, . . . , d/2]. Refer to Appendix A.3 for the detailed proof. 3From previous research(Tancik et al., 2020), we also hypothesized that this could be equivalent to a Fourier transform. However, this hypothesis does not hold (refer to Appendix A.4 for details).

Published as a conference paper at ICLR 2025

transform approach is effective for capturing the dynamic fluctuations of signals that change over time, and it is also effective for the fluid nature of natural language, which is not constrained by periodicity. Furthermore, when extrapolating, it is important to be able to respond flexibly to changes in context and information. For this reason, we believe that the wavelet transform is also an effective method for extrapolation.

When applying wavelet transforms to positional encoding, a key question arises: Which features should be leveraged for handling long-context dependencies? Notably, Ro PE shares conceptual similarities with the wavelet transform (Section 3); however, Ro PE depends on absolute positional information, which limits its effective context window to the training length (Ltrain) and restricts its extrapolation capabilities. In contrast, ALi Bi offers extrapolation capabilities by using relative position, and it supports varying window sizes (Section 4). However, ALi Bi s linear bias constrains its receptive field, making it insufficient for capturing long-range dependencies. According to Press et al. (2022), conventional relative positional encoding (RPE) methods (Shaw et al., 2018; Raffel et al., 2020), which rely on a fixed window size, are similarly ineffective for extrapolation. In conclusion, we adopt relative position with flexible window sizes to handle long-context and extrapolation.

Accordingly, we propose positional representation based on wavelet transform with the following characteristics:

1. Position-based Transformation: Ro PE predominantly relies on independent transformation based on the head dimensions. ALi Bi employs multiple windows based on the relative position of the sentence, rather than the dimension of the head, which may contribute to its performance. Therefore, we apply a wavelet transform based on the relative position of the sentence.

2. Type of Wavelet: Ro PE can be thought of as a wavelet transform using the Haar wavelet, which is the simplest wavelet. However, Haar wavelets might fall short in capturing the intricacies of natural languages. Transitioning toward the use of more sophisticated wavelet functions could enhance our approach to distilling and representing a broader spectrum of features inherent in natural languages.

3. Diversification of Window Sizes (Scale Parameters): From our analysis of ALi Bi, we found that having multiple windows is effective for long contexts. The original version of Ro PE works with a single fixed scale. To address this limitation, we introduce a variety of scale and shift parameters.

5.1 METHODOLOGY

Incorporating Wavelet Transform into PE Due to the wavelet shift feature, we adopt relative position representation using ALi Bi because it is more suitable than absolute position representation. 4 In a transformer model (Vaswani et al., 2017), the self-attention mechanism operates by projecting the input sequence into three distinct representations queries (Q), keys (K), and values (V ) using learnable weight matrices. Self-attention sublayers employ N attention heads. In self-attention sublayers, em,n is the attention score for each query, and then the key is calculated. RPE(Shaw et al., 2018) expresses position by calculating the inner product of the query and the relative position embedding. We incorporate the wavelet function into RPE as follows.

em,n = qmk T n + qm(pm,n)T

where qm is the mth query (qm R1 d, 1 m L) of a sentence of length L, kn is the nth key (kn R1 d, 1 n L) for qm, and d is the number of dimensions of each head. Here, pm,n is the relative position from the m-th query to the n-th key. RPE (Shaw et al., 2018) uses learnable embedding for pm,n Rd and a fixed scale by clipping. However, instead of using learnable embeddings to represent pm,n, we use d-pattern wavelet functions with multiple scales to calculate the position. In our method, there is no clipping, and the distance of the position expression is fixed regardless of the length of the sentence.

4We also considered incorporating wavelet transforms into Ro PE, but decided not to do this because it would make the computational cost even higher. A discussion on this is included in Appendix A.5.

Published as a conference paper at ICLR 2025

Wavelet Function In conventional wavelets, such as in Eq. (2), the amplitude also varies depending on the scale parameter a. In the proposed method, all amplitudes are the same.

ψa,b(t) = ψ t b

The variable t is assigned the relative position, which is t = m n. We used the Ricker wavelet (Ricker, 1944) as a base wavelet, which is formulated as follows.

ψ(t) = (1 t2) exp t2

Shift and scale parameters We use s distinct patterns for the scale parameter a and d

s patterns for the shift parameter b.

(a, b) {20, 21, 22, ...2s 1} {0, 1, 2, 3, ..., d

The scale parameter is a power of 2 derived from the principles of the discrete wavelet transform. By combining the d

s-pattern shift parameters b with the s-pattern scale parameters a, we generate d distinct wavelets. In this way, our method can set the s-pattern context window size using the scale parameter a and the d-pattern context window using both the scale parameter a and the shift parameter b. For instance, with a head dimension of d = 128, we use s = 8 scale variants (a {20, 21, ..., 27}) and 16 shift variants (b {0, 1, 2, ..., 15}), resulting in 8 16 = 128 unique wavelets. Finally, pm,n is computed as follows.5

pm,n = 1 m n b

6 SHORT-CONTEXT EXPERIMENT

6.1 EXPERIMENTAL SETTINGS

First, we conducted a small-scale experiment to compare our approach with various position encodings. We used the Wiki Text-103 dataset (Merity et al., 2017), which consists of over 103 million tokens of English Wikipedia articles. We performed a comparative evaluation using a Transformerbased language model (Baevski & Auli, 2019). The dimensionality of the word embedding dmodel is 1024, the number of heads N is 8, the dimensionality of the heads d is 128, and the number of layers is 16. The implementation was based on the fairseq (Ott et al., 2019)-based code6 provided in a previous work(Press et al., 2022), and all hyperparameters were set to the same values as those in the literature(Press et al., 2022).7 The maximum allowable lengths of sequences were set to Ltrain = 512 and Ltrain = 1024.

Compared Methods Although θ = 10, 000 is usually used for Ro PE, it has been found that extending θ to 500,000 is effective for long contexts (Xiong et al., 2024). Therefore, we compared θ = 10, 000 with θ = 500, 000. In addition to ALi Bi and Ro PE, the following position representations were also compared: No PE (Kazemnejad et al., 2023), in which position information is given, and Trans XL (Dai et al., 2019), which is a relative positional representation that uses sine waves.

Evaluation Metric We use perplexity as our evaluation metric. Following previous research (Press et al., 2022), we evaluated the validation set. To evaluate sequences longer than Ltrain tokens, it is common to divide the sequence into Ltrain-length sub-sequences, evaluate each independently, and report the average score. However, methods that use relative positions to express a wide range, such as ALi Bi, Trans-Xl, and the proposed method, are able to consider a wider range of contexts than Ltrain. For this reason, in this paper, we report not only the perplexity of non-overlapping inference but also the normal perplexity when the sequence is not divided into partial sequences. Note

5Implementation tips for reducing the memory and computational efficiency of the proposed method are included in Appendix A.6. 6https://github.com/ofirpress/attention_with_linear_biases 7See Appendix A.7 for more details of hyperparameters.

Published as a conference paper at ICLR 2025

Table 1: Perplexity of validation set in extrapolation experiments using Wikitext-103. Maximum allowable lengths of sequences in pre-training are Ltrain = 512 and Ltrain = 1024.

Sequence Length

Ltrain = 512 Ltrain = 1024

pos 128 256 512 1012 1512 2512 1024 1524 3024 5024

Perplexity in Non-overlapping Inference with Ltrain

No PE(Kazemnejad et al., 2023) - 26.38 23.23 21.53 21.52 21.53 21.53 20.81 21.52 21.49 21.45 Ro PE (Su et al., 2021) abs 23.82 20.98 19.39 19.35 19.39 19.38 18.42 19.51 19.52 19.48 Ro PE (Xiong et al., 2024) abs 23.81 20.95 19.35 19.32 19.35 19.33 18.50 19.53 19.54 19.50 Trans-XL (Dai et al., 2019) rel 24.16 21.53 19.96 19.92 19.93 19.96 18.67 19.75 19.74 19.70 ALi Bi(Press et al., 2022) rel 24.18 21.32 19.69 19.64 19.69 19.64 18.66 19.64 19.65 19.62 Wavelet(Ricker) rel 23.64 20.82 19.19 19.15 19.17 19.20 18.26 19.30 19.34 19.26

Perplexity without Non-overlapping Inference

No PE(Kazemnejad et al., 2023) - 26.38 23.23 21.53 21.03 21.58 48.48 20.81 20.45 22.11 59.37 Ro PE (Su et al., 2021) abs 23.82 20.98 19.39 23.25 44.38 93.94 18.42 18.29 33.20 122.52 Ro PE (Xiong et al., 2024) abs 23.81 20.95 19.35 23.70 40.39 77.90 18.50 18.30 29.25 83.43 Trans-XL(Dai et al., 2019) rel 24.16 21.53 19.96 19.09 18.92 19.05 18.67 18.25 18.17 18.76 ALi Bi(Press et al., 2022) rel 24.18 21.32 19.69 18.71 18.42 18.41 18.66 18.14 17.86 17.88 Wavelet(Ricker) rel 23.64 20.82 19.19 18.23 18.00 17.99 18.26 17.13 17.14 17.44 Haar (Fixed scale) rel 24.98 22.07 20.49 51.61 116.87 299.26 - - - - Haar rel 23.73 20.89 19.27 18.34 18.11 18.17 - - - - Morlet rel 24.15 21.28 19.65 19.02 20.46 26.56 - - - - Gaussian rel 23.77 20.90 19.30 18.31 18.02 17.88 - - - -

that when the sequence length is less than Ltrain, the scores for the perplexity of non-overlapping inference and the normal perplexity without division into partial sequences are the same. Of course, when perplexity is considered without division into partial sequences, the performance of Ro PE is expected to decrease greatly because unknown values are used for Ro PE when processing a sequence longer than the length encountered during training.

6.2 MAIN RESULTS

The experimental results are shown in Table 1. The results of perplexity in inference without overlap show that the proposed method using wavelets achieved the lowest perplexity and was also effective for extrapolation. In Ro PE, the values used during training are also used in inference without overlap, so the perplexity remains low even when the sequence length exceeds Ltrain. At the same time, however, perplexity is higher for ALi Bi and Trans-XL than for Ro PE, which is attributed to the limited context range of the position representation s applicability due to the division of the sequence into sub-sequences. In contrast, the proposed method maintains low perplexity even in the case of division into sub-sequences, suggesting that the wavelet position representation is highly effective.

On the other hand, perplexity without non-overlapping inference showed the opposite results. First, since Ro PE uses absolute positions, it is necessary to use new values for unknown positions, and thus perplexity increased significantly. However, in the case of θ = 500, 000, the increase in perplexity was relatively small. On the contrary, Trans-XL and ALi Bi, which use relative positions, were able to handle longer contexts, and perplexity decreased as the range of position representations expanded. In the proposed method, perplexity also decreased and the best score was achieved. Trans-XL uses a position representation based on a periodic sine wave function, but the proposed method, which uses wavelets, could further decrease perplexity. This result supports our claim (section 5) that an approach like wavelet transformation is more effective than periodic functions in capturing the fluid nature of natural language, which is not constrained by periodicity.

6.3 ANALYSIS

6.3.1 HOW EFFECTIVE ARE THE OTHER WAVELET TYPES?

We also conducted experiments to see whether the same effect could be obtained with other wavelets. The wavelets tested were the Gaussian-based wavelet ψ(t) = exp( t2), the Morletbased waveletψ(t) = exp( t2)cos(at), and the Haar-based wavelet. Note that when ψ(t/a) exists in our Morlet wavelet, the frequency of this cosine wavelet is not affected by the scale parameter a.

Published as a conference paper at ICLR 2025

Figure 3: Heatmap of scaled attention scores via softmax normalization in 4th head after softmax operation without non-overlapping inference. The vertical axis represents the query, while the horizontal axis corresponds to the key. For clarity, values of 0.001 or more are mapped to black, while values below that are mapped to yellow. The maximum allowable length of sequences in pre-training is Ltrain = 512 and the inference length is 1012. See Appendix A.12 for other heads.

We used the following formula for the Haar wavelet.

1 0.5 t<0, 1 1 t< 0.5, 0 otherwise. (15)

We kept the shift and scale parameters constant, only changing the wavelet function. We also tested the Haar wavelet when set to a {20, 20, 20, ...20}. Consequently, this restricted Haar wavelet had the same scale parameter setting as the Ro PE demonstrated in Section 3.2. 8 The graphs of these wavelet functions are shown in Appendix A.10 (Fig. 6). Extrapolation experiments were conducted under the same conditions as the experimental setup in Section 6, with Ltrain = 512 during training.

As shown in Table 1, the Ricker-, Haarand Gaussian-based wavelets had lower perplexity than the Morlet wavelet. One possibility is that complex wavelets with multiplied cosine waves, such as Morlet wavelets, are not suitable for relative positional representation. On the other hand, wavelets with all positive values, such as Gaussian-based wavelets, are expected to represent positions within a narrower distance than the window specified by the scale parameter due to softmax normalization. This suggests that wavelets with a specific range of negative values are suitable, like a Ricker wavelet, for positional representation. Although the Haar wavelet is simple, it is such a wavelet with negative values within a specific range. Therefore, it is considered effective, although not as much as a Ricker wavelet. However, when the scale parameter is restricted ( a {20, ..., 20}), as in Ro PE, the perplexity increases. This demonstrates the importance of having multiple scales, or in this case, window sizes. We also performed ablation studies for each shift and scale parameter (Appendix A.13) and for discrete wavelets as well as continuous wavelets (Appendix A.14).

6.3.2 CAN IT HANDLE TOKENS WITH LONG-RANGE DEPENDENCIES?

Figure 3 shows the attention map of scaled attention scores obtained through softmax normalization for the proposed method. The inference length is L = 1012 without non-overlapping inference. The most notable feature of the proposed method is that it is always able to attend to specific tokens. The words that always receive attention are those that are important in the sentence, such as the special token, the first token, and the subject of the sequence. On the other hand, ALi Bi has a restricted receptive field for attention, making it unable to capture long-distance dependencies. Similar to the proposed method, Ro PE emphasizes important and special words but struggles to capture those that are farther apart. Moreover, as the sentence lengthens, it loses the ability to attend to the initial word. This tendency was also seen in sentences shorter than Ltrain. Accordingly, the proposed method has demonstrated its superiority at capturing long dependencies without restricting the receptive field of attention.

8Normally, the wave is localized when t > 0 in the Haar wavelet, but in the decoder model, only the range t < 0 is used. Therefore, we transformed the Haar wavelet into a form that reflects the original function f(x) across the y-axis.

Published as a conference paper at ICLR 2025

Table 2: Perplexity in Non-overlapping Inference with Ltrain = 4096.

Sequence Length 4 k 8 k 16 k 32 k

Ro PE (Xiong et al., 2024) 9.45 9.33 9.12 8.90 Wavelet 9.00 9.01 8.83 8.60

7 LONG CONTEXT

7.1 EXPERIMENTAL SETTINGS

Next, we conducted a large-scale experiment using a Llama-based model (Touvron et al., 2023b). We pre-trained the Llama-2-7B9 model from scratch. For pre-training, we used the Red Pajama dataset (Computer, 2023), which selects a 1B-token sample of all samples. The maximum allowable length of sequences in pre-training was set to Ltrain = 4096. For the same reason as given in Section 6.1, we set θ = 500, 000 for Ro PE. Furthermore, when the scale parameter is a {20, 21, ..., 27}, the range within which the wavelet is localized becomes narrow. Therefore, in our method, we changed the scale parameter to a {22, 23, ..., 29}. The other parameters are the same as those used for the Llama-2-7B model(Touvron et al., 2023b). We used Code Parrot 10 for evaluation, which is good for long-distance testing because it requires an understanding of patterns and contextualization of information over long distances. 11

7.2 MAIN RESULTS

The experimental results are shown in Table 2. Regardless of whether interpolation or extrapolation was applied, the perplexity of our method was lower than Ro PE. Therefore, even with large-scale models and long contexts, our method was found to be effective. Moreover, the results in Section 6.2 show that not dividing the sequence further reduces perplexity. Therefore, our method might also be able to further reduce perplexity. We investigated the use of Long Bench(Bai et al., 2024), with the results given in Appendix A.15.

In addition, position interpolation methods (Chen et al., 2023; bloc97, 2023; Peng et al., 2024; Ding et al., 2024) have been proposed to adapt Ro PE for longer contexts. We believe these methods can be integrated into our approach for the following reasons. First, the parameter θ in Ro PE corresponds to the scale parameter a in our method, implying compatibility between the two frameworks. Both θ and a refer to the upper limit of the number of positions to be expressed. Second, the Long Ro PE paper (Ding et al., 2024) reveals that performance improves when extrapolation is avoided for the initial positions, which likely aligns with the shift parameter b in our method. Thus, it is highly likely that existing position interpolation methods will integrate seamlessly with our approach.

8 CONCLUSION

In this paper, we demonstrated that Ro PE can be interpreted as a wavelet transform, and we introduced a novel positional representation method that leverages the wavelet transform s advantages, effectively capturing positional information across various window sizes. Our experimental results demonstrate the proposed method s superior performance in extrapolation tasks when compared to traditional positional representation techniques. Importantly, our approach offers the advantage of not constraining the receptive field, which allows more flexible and comprehensive analysis of positions. Calculating relative positions is known to require more resources than calculating absolute positions, so we show methods for reducing memory consumption in Appendix A.6. However, the computational overhead of calculating relative positions may still impose a bottleneck, and thus reducing it is an important direction for future work.

9https://huggingface.co/meta-llama/Llama-2-7b 10https://huggingface.co/datasets/codeparrot/codeparrot-clean 11See Appendix A.8 for more details of the hyperparameters.

Published as a conference paper at ICLR 2025

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Byx ZX20q FQ.

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Long Bench: A bilingual, multitask benchmark for long context understanding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3119 3137, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.172. URL https://aclanthology.org/2024.acl-long.172.

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. ar Xiv:2004.05150, 2020.

Alexandre Bernardino and Jos e Santos-Victor. A real-time gabor primal sketch for visual attention. In Jorge S. Marques, Nicol as P erez de la Blanca, and Pedro Pina (eds.), Pattern Recognition and Image Analysis, pp. 335 342, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-32237-5.

bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023. URL https://www.reddit.com/r/Local LLa MA/comments/14lz7j5/ntkaware_ scaled_rope_allows_llama_models_to_have/.

Ronald Newbold Bracewell and Ronald N Bracewell. The Fourier transform and its applications, volume 31999. Mc Graw-Hill New York, 1986.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877 1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023. URL https://arxiv.org/ abs/2306.15595.

Ta-Chung Chi, Ting-Han Fan, Alexander Rudnicky, and Peter Ramadge. Dissecting transformer length extrapolation via the lens of receptive field analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13522 13537, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/ v1/2023.acl-long.756. URL https://aclanthology.org/2023.acl-long.756.

Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/Red Pajama-Data.

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Anna Korhonen, David Traum, and Llu ıs M arquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978 2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https://aclanthology. org/P19-1285.

Ingrid Daubechies. Ten lectures on wavelets. Society for industrial and applied mathematics, 1992.

Published as a conference paper at ICLR 2025

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/ N19-1423.

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens, 2024. URL https://arxiv.org/abs/2402.13753.

W. M. Gentleman and G. Sande. Fast fourier transforms: for fun and profit. In Proceedings of the November 7-10, 1966, Fall Joint Computer Conference, AFIPS 66 (Fall), pp. 563 578, New York, NY, USA, 1966. Association for Computing Machinery. ISBN 9781450378932. doi: 10. 1145/1464291.1464352. URL https://doi.org/10.1145/1464291.1464352.

A. Grossmann and J. Morlet. Decomposition of hardy functions into square integrable wavelets of constant shape. SIAM Journal on Mathematical Analysis, 15(4):723 736, 1984. doi: 10.1137/ 0515056. URL https://doi.org/10.1137/0515056.

A. Haar. Zur theorie der orthogonalen funktionensysteme. (erste mitteilung). Mathematische Annalen, 69:331 371, 1910. URL http://eudml.org/doc/158469.

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ee Lacroix, and William El Sayed. Mistral 7b, 2023. URL https: //arxiv.org/abs/2310.06825.

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 24892 24928. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/4e85362c02172c0c6567ce593122d31c-Paper-Conference.pdf.

Gregory R. Lee, Ralf Gommers, Filip Waselewski, Kai Wohlfahrt, and Aaron O8217;Leary. Pywavelets: A python package for wavelet analysis. Journal of Open Source Software, 4(36):1237, 2019. doi: 10.21105/joss.01237. URL https://doi.org/10.21105/joss.01237.

Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. Learnable fourier features for multidimensional spatial positional encoding. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https:// openreview.net/forum?id=R0h3NUMao_U.

Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of ro PE-based extrapolation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=JO7k0SJ5V6.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id= Bkg6Ri Cq Y7.

S.G. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):674 693, 1989. doi: 10.1109/ 34.192463.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. URL https: //openreview.net/forum?id=Byj72udxe.

Published as a conference paper at ICLR 2025

Nhat Khang Ngo, Truong Son Hy, and Risi Kondor. Multiresolution graph transformers and wavelet positional encoding for learning long-range and hierarchical structures. The Journal of Chemical Physics, 159(3):034109, 07 2023a. ISSN 0021-9606. doi: 10.1063/5.0152833. URL https: //doi.org/10.1063/5.0152833.

Nhat Khang Ngo, Truong Son Hy, and Risi Kondor. Multiresolution graph transformers and wavelet positional encoding for learning hierarchical structures. ar Xiv preprint ar Xiv:2302.08647, 2023b.

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Ya RN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=w HBfxh Zu1u.

Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8s QPp GCv0.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-totext transformer. Journal of Machine Learning Research, 21(140):1 67, 2020. URL http: //jmlr.org/papers/v21/20-074.html.

Norman Ricker. Wavelet functions and their polynomials. Geophysics, 9(3):314 323, 07 1944. ISSN 0016-8033. doi: 10.1190/1.1445082. URL https://doi.org/10.1190/1.1445082.

Ohad Rubin and Jonathan Berant. Retrieval-pretrained transformer: Long-range language modeling with self-retrieval, 2024. URL https://arxiv.org/abs/2306.13421.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464 468, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2074. URL https://aclanthology.org/N18-2074.

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021.

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14590 14604, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.816. URL https://aclanthology.org/2023.acl-long.816.

Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=q Vye W-gr C2k.

R. Tian, Z. Wu, Q. Dai, H. Hu, Y. Qiao, and Y. Jiang. Resformer: Scaling vits with multiresolution training. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22721 22731, Los Alamitos, CA, USA, jun 2023. IEEE Computer Society. doi: 10.1109/CVPR52729.2023.02176. URL https://doi.ieeecomputersociety.org/ 10.1109/CVPR52729.2023.02176.

Published as a conference paper at ICLR 2025

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. Ar Xiv, abs/2302.13971, 2023a. URL https://api.semanticscholar. org/Corpus ID:257219404.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b. URL https://arxiv.org/abs/2307.09288.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Benyou Wang, Donghao Zhao, Christina Lioma, Qiuchi Li, Peng Zhang, and Jakob Grue Simonsen. Encoding word order in complex embeddings. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hke-WTVtwr.

Yuhuai Wu, Markus Norman Rabe, De Lesley Hutchins, and Christian Szegedy. Memorizing transformers. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=Trjbxz Rcnf-.

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective longcontext scaling of foundation model. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4643 4663, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.260. URL https://aclanthology.org/2024.naacl-long.260.

Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, and Zhicheng Dou. Soaring from 4k to 400k: Extending llm s context with activation beacon, 2024. URL https://arxiv. org/abs/2401.03462.

Published as a conference paper at ICLR 2025

A.1 ROTARY POSITION EMBEDDING

Ro PE incorporates positional information directly into the self-attention mechanism by rotating the query and key vectors in the complex space. When divided into even and odd dimensions, the following calculations are performed for the m-th query in each sequence. In even dimensions, Ro PE is expressed as follows.

qm 0 qm 2 ... qm d 2

cos mθ1 sin mθ1 0 0 ... 0 0 0 0 cos mθ2 sin mθ2 ... 0 0 ... ... ... ... ... ... ... 0 0 0 0 ... cos mθd/2 sin mθd/2

qm 0 qm 1 ... qm d 2 qm d 1

In odds dimensions, Ro PE is expressed as follows.

qm 1 qm 3 ... qm d 1

sinθ1 cosθ1 0 0 ... 0 0 0 0 sinmθ2 cosmθ2 ... 0 0 ... ... ... ... ... ... ... 0 0 0 0 ... sinmθd/2 cosmθd/2

qm 0 qm 1 ... qm d 2 qm d 1

where qm R1 d is the m-th query when the number of dimensions is d and θi = 10000 2(i 1)/d, i [1, 2, ..., d/2]. The same process is also performed for the n-th key kn R1 d.

A.2 PROOF OF THE EXISTENCE OF f(t)

We prove the existence of f(t) as described in 3.2 such that ϕj = ϕj+1 = mθ j+1

2 , where θi =

10000 2(i 1)/d and i [1, 2, ..., d/2]. Here, we restrict our proof to ψ(t) in Eq.(7), but a similar argument can be applied to ψ (t), following analogous steps to establish its validity.

First, we revisit the definition of ψ(t):

cos f(t) 0 t < 1, sin f(t) 1 t < 2, 0 otherwise. (18)

Here, f : R R is a monotonous function that satisfies R ψ(t) dt = 0 and Eq.(1). Assuming that when x(t)(0 t d 1) is a signal with d elements, the wavelet ψ is used and wavelet transform is performed at each scale a = 1. We define the shift parameter as bj = j δ(j)(j = 0, 2, .., d 2). Here, δ(t) is a monotonous function such that 0 t d 1 and 0 δ(t) < 1.

W (1, b0) W (1, b2) ... W (1, bj) ... W (1, bd 2)

cos ϕ0 sin ϕ1 0 0 ... ... 0 0 0 0 cos ϕ2 sin ϕ3 ... ... 0 0 ... ... ... ... ... ... ... ... 0 0 0 ... cos ϕj sin ϕj+1 ... 0 ... ... ... ... ... ... ... ... 0 0 0 0 ... ... cos ϕd 2 sin ϕd 1

x(0) x(1) ... x(j) x(j + 1) ... x(d 2) x(d 1)

To simplify the notation in the matrix representation above, we write ϕj for j = 0, 1, . . . , d 1, where ϕj = f(1 + δ(j)) if j is odd, and ϕj = f(δ(j)) otherwise. We let x be the query qm. The function f(t) is defined such that 0 < f(t) 2kπ for 0 t < 1 and 0 < f(t) 2kπ for 1 t < 2, where k is the smallest natural number satisfying m < 2kπ.

Published as a conference paper at ICLR 2025

Do Haar-like wavelets satisfy the necessary conditions of a wavelet? Here, f(t) must be a function such that ψ(t) satisfies the conditions of a wavelet. For Eq. 1, it is evident that it holds for any f satisfying 0 < f(t) 2kπ. Next, we consider the zero-mean property. As an example, consider f(t) defined as f(t) = 2kπt for 0 t < 1 and f(t) = 2kπ(t 1) for 1 t < 2. If we set θ = f(t), we have: Z

ψ(t)dt = Z 2kπ

0 cos θdθ + Z 2kπ

0 sin θdθ = 0. (20)

Since this satisfies the zero-mean property, we conclude that there exists an f(t) such that ψ(t) is a wavelet.

Furthermore, we observe that there exists a δ(t) satisfying ϕj(= f(δ(j))) = ϕj+1(= f(1+δ(j))) = 2kπδ(t) = mθ j+1

2 for j = 0, 2, . . . , d 2. In other words, we can simply choose a function δ(j)

that satisfies δ(j) =

2 2kπ for j = 0, 2, . . . , d 2.

A.3 HAAR WAVELET

Here, we explain wavelet transform using the Haar wavelet, which is the simplest wavelet. The definition of the Haar wavelet is as follows.

1 0 t<1/2, 1 1/2 t<1, 0 otherwise. ϕ(t) = 1 0 t<1, 0 otherwise. (21)

Haar wavelets are defined not only by a wavelet function ψ but also by a scaling function ϕ.

The method of analyzing signals by performing a discrete wavelet transform using these two functions is called multi-resolution analysis. When the scale is fixed at 2 and the shift b [0, 2, ..., d/2], the wavelet transform using the wavelet function and scaling function is expressed as follows.

ψ2,0(0) ψ2,0(1) ψ2,0(2) ψ2,0(3) ... ψ2,0(T 2) ψ2,0(T 1) ϕ2,0(0) ϕ2,0(1) ϕ2,0(2) ϕ2,0(3) ... ϕ2,0(T 2) ϕ2,0(T 1) ψ2,0( 2) ψ2,0( 1) ψ2,0(0) ψ2,0(1) ... ψ2,0(T 4) ψ2,0(T 3) ϕ2,0( 2) ϕ2,0( 1) ϕ2,0(0) ϕ2,0(1) ... ϕ2,0(T 4) ϕ2,0(T 3) ... ... ... ... ... ... ... ψ2,0( d

2 ) ψ2,0( d

2 + 1) ψ2,0( d

2 + 2) ψ2,0( d

2 + 3) ... ψ2,0(0) ψ2,0(1) ϕ2,0( d

2 ) ϕ2,0( d

2 + 1) ϕ2,0( d

2 + 2) ϕ2,0( d

2 + 3) ... ϕ2,0(0) ϕ2,0(1)

x(0) x(1) x(2) ... x(T 2) x(T 1)

From Eq.(21), ψ2,0 and ϕ2,0 are as follows.

2 1 t<2, 0 otherwise.

2 0 t<2, 0 otherwise. (23)

Therefore, the Haar wavelet transform is a 2 2 block matrix.

ψ(2, 0) ϕ(2, 0) ψ(2, 2) ϕ(2, 2) ... ψ(2, T 2) ϕ(2, T 2)

2 0 0 ... 0 0 1/

2 0 0 ... 0 0 0 0 1/

2 ... 0 0 0 0 1/

2 ... 0 0 ... ... ... ... ... ... ... 0 0 0 0 ... 1/

2 0 0 0 0 ... 1/

x(0) x(1) x(2) x(3) ... x(T 2) x(T 1)

This matrix is the Haar forward transform using matrix multiplication for a T element signal. This matches the Ro PE matrix with mθ = π/4.

Published as a conference paper at ICLR 2025

A.4 ISN T ROPE A FOURIER TRANSFORM?

We also hypothesized that this could be equivalent to a Fourier transform. However, this hypothesis does not hold. When a signal x(t) that changes over time is Fourier transformed, its spectrum F(k) is obtained. The process of converting an actual discrete signal x(t) into a spectrum F(k) is as follows.

t=0 x(t)wf t. (25)

The Fourier transform can be expressed as a matrix formula as follows.

F (0) F (1) F (2) ... F (f)

w0 0 w0 1 w0 2 ... w0 (T 1)

w1 0 w1 1 w1 2 ... w1 (T 1)

w2 0 w2 1 w2 2 ... w2 (T 1)

... ... ... ... ... wf 0 wf 1 wf 2 ... wf (T 1)

x(0) x(1) x(2) ... x(T 1)

Here, f R is the wave number, T R is the number of samples, and i is the imaginary unit. w = exp( 2πi

T ) is called the Twiddle Factor (Gentleman & Sande, 1966), which is a complex number expressed in polar form using Euler s formula e iθ = cosθ isinθ. In the complex plane, wf t represents a point on the unit circle with an argument of the complex number ft2π

T . From this formula, we can see that the Fourier transform calculates the inner product of all signals and sine waves. However, in Ro PE, the inner product with sine waves is calculated only within each block.

Next, when calculating the attention score with Ro PE, does the Fourier transform hold? Attention scores of the m-th query qm and the n-th key kn with Ro PE are calculated as follows.

h R1 m(Q1 m)T , ..., Rd/2 m (Qd/2 m )T i

R1 n K1 n ... Rd/2 n Kd/2 n

i=1 (Qi m)T Ri n m Ki n, (27)

where Qd/2 m is the query divided into every two dimensions, and Rd/2 m is the rotation matrix.

Qd/2 m = qd 1 m qd m

, Kd/2 n = kd 1 n kd n

, Rd/2 m = cosmθd/2 sinmθd/2 sinmθd/2 cosmθd/2

Aligning with the Fourier transform, as illustrated in Equation 26, requires a process involving the inner product between a frequency tensor of dimensions f T and a signal tensor of dimensions T 1 (such as the query vector). However, Ro PE operates on independent 2 2 blocks, where each block is processed separately. Consequently, Ro PE s block-wise operations do not conform to the structure required by the Fourier transform. Moreover, if we focus solely on the Ro PE and key operations in Equation 27, they may appear to align with the structure of a Fourier transform. However, since the final step involves taking the inner product with the query, the overall operation deviates from the path of becoming a perfect match with the Fourier transform. Furthermore, the rotation factor represents a rotation in the complex plane, and even if it is expressed as in Eq.(26) using a rotation matrix, it does not completely match a rotation matrix that represents a rotation in the Euclidean plane.

Therefore, Ro PE cannot be equated with the Fourier transform. Furthermore, even if it were the same as the Fourier transform, it would be unsuitable for processing non-stationary signals and thus unsuitable for processing natural language, which is a non-stationary flow.

Published as a conference paper at ICLR 2025

A.5 CONSIDERATION OF WAVELET TRANSFORMATION BASED ON ROPE

In this paper, we explore the incorporation of wavelet transforms into Ro PE (Relative Positional Encoding) following our previous discussion on RPE (Relative Position Encoding). In this regard, integrating wavelet transforms into Ro PE presents challenges for controlling computational and memory costs. In Sections 3 and 4, we highlighted the potential effectiveness of employing multiple scales for extrapolation. With this in mind, we present a simplified formula for applying various scales and wavelet transforms to Ro PE, which we refer to here as a Ro PE-based Wavelet.

qm 0 qm 1 qm 2 qm 3 ... qm d 1 qm d 2

cos mθ1 sin mθ1 0 0 ... 0 0 sin mθ1 cos mθ1 0 0 ... 0 0 cos mθ2 sin mθ2 cos mθ2 sin mθ2 ... 0 0 sin mθ2 cos mθ2 sin mθ2 sin mθ2 ... 0 0 ... ... ... ... ... ... ... cos mθd/2 sin mθd/2 cos mθd/2 sin mθd/2 ... cos mθd/2 sin mθd/2 sin mθd/2 cos mθd/2 sin mθd/2 sin mθd/2 ... sin mθd/2 cos mθd/2

qm 0 qm 1 qm 2 qm 3 ... qm d 2 qm d 1

where qm R1 d is the m-th query when the number of dimensions is d and θi = 10000 2(i 1)/d, i [1, 2, ..., d/2]. The same process is also performed for the n-th key kn R1 d.

Conversely, the method introduced in Section 5 is here called RPE-based Wavelet. The key differences between Ro PE-based Wavelet and RPE-based Wavelet are as follows:

Number of Scale Parameters: In RPE-based Wavelet, the scale parameters can be selected up to the maximum sequence length. However, in Ro PE-based Wavelet, the selection is limited to a maximum of d.

Memory Usage: Ro PE-based Wavelet requires a wavelet matrix that corresponds to the number of absolute positions m. Consequently, the memory usage is significantly higher. Unlike Ro PE-based Wavelet, RPE-based Wavelet does not necessitate a wavelet matrix that matches m values, allowing the use of Tip 2 from Appendix A.6, which improves memory efficiency.

Absolute and Relative Positions: When applying wavelet transforms using Ro PE-based Wavelet, it is necessary to use absolute positions. In contrast, RPE-based Wavelet can use relative positions, which enhances extrapolation.

Computational Cost: Implementing wavelet transforms via Ro PE-based Wavelet requires processing both the query and the key, necessitating two calculations. RPE-based Wavelet, as discussed in Section 5, only requires one computation, since it processes only the query.

Additionally, we conducted an experiment with Ro PE-based Wavelet. Unfortunately, we had to halt the learning process because it took over five times longer than anticipated. Considering the learning costs associated with large-scale language models in recent years, we believe the Ro PEbased Wavelet approach is not feasible.

Published as a conference paper at ICLR 2025

A.6 IMPLEMENTATION TIPS FOR WAVELET POSITION REPRESENTATION

Tip 1 Similar to RPE(Shaw et al., 2018), we used Eq. (10) as

αij = softmax qi KT + qi(pij)T

By transforming it in this way, it is possible to reduce the computational complexity to O(batch n length2 d + length2 d), where batch is the batch size, n is the number of heads, length is the number of tokens, and d is the number of dimensions of each head. The experiments in Section 6 are implemented based on the methodology introduced in this section.

Tip 2 When dealing with long contexts of over 4 k with a large model, the memory efficiency of (d, length, length) of the wavelet position becomes a bottleneck. Therefore, we further reduce the memory usage to (d, length) by using torch.scatter to scatter the wavelet position representation to the attention mask. In the relative position representation in the decoder, only the position information of the token before the current token is required, for example, 0, 1, 2, etc. Therefore, we pre-compute the information up to 0, 1, 2, ...length and reduce the memory usage by using torch.scatter to distribute it. Specifically, we prepare a (d, length) wavelet tensor and calculate the 2D inner product with the query, which has been transposed to (length batch, d). The tensor after the calculation becomes (length batch, length), which is then scattered using torch.scatter so that it becomes a relative position in the attention mask. This reduces the amount of memory used from (d, length, length) to (d, length), and the calculation can be performed using calculations between 2D tensors. The experiments in Section 7 are implemented based on the methodology introduced in this section.

A.7 EXPERIMENTAL SETTINGS IN SHORT-CONTEXT EXPERIMENT

The parameter settings used in the extrapolation experiments were the same as those in the original ALi Bi paper. The dimensionality of the word embedding dmodel is 1024, the number of heads N is 8, the dimensionality of the heads d is 128, and the number of layers is 16. The implementation was based on the fairseq (Ott et al., 2019)-based code12 provided in a previous work(Press et al., 2022), and all hyperparameters were set to the same values as those in the literature(Press et al., 2022). The number of training epochs is 205, and the batch size is 9216. The learning rate was set to 1.0, and the learning process was updated by 1e-7 every 16,000 steps.

A.8 EXPERIMENTAL SETTINGS IN LONG-CONTEXT EXPERIMENT

The dimensionality of the word embedding dmodel is 4096, the number of heads N is 32, the dimensionality of the heads d is 128, and the number of layers is 32. The number of training steps is 30,000, and the batch size is 1. The learning rate was set to 0.0003. We used Adam W(Loshchilov & Hutter, 2019) as the optimizer, with (β1, β2) = (0.9, 0.95). In accordance with previous research (Rubin & Berant, 2024; Wu et al., 2022; Zhang et al., 2024), we then used 100 sampled sequences in the training set for evaluation. In this experiment, due to the large model size and long sequence length, we report perplexity only for non-overlapping inference using Ltrain, since the memory capacity is exceeded.

12https://github.com/ofirpress/attention_with_linear_biases

Published as a conference paper at ICLR 2025

A.9 RICKER WAVELET

Figures 4 and 5 show the Ricker wavelets with multiple scale a.

Figure 4: Graph of compared Ricker wavelet functions with a = [20, 21, 22, 23, 24]

Figure 5: Graph of compared Ricker wavelet functions with a = [25, 26, 27, 28, 29]

A.10 WAVELET TYPE

Figure 6 shows graphs of the wavelets compared in Section 6.3.1. It can be seen that the simplest is the Haar wavelet, while the most complex is the Morlet wavelet.

Figure 6: Graph of compared wavelet functions. The case with scale parameter a = 24 and shift parameter b = 0 is shown.

Published as a conference paper at ICLR 2025

A.11 EXAMPLE OF HEAT MAP AND TEXT CORRESPONDENCE

Figure 7 shows the attention map after softmax operation for the proposed method. First, the notable feature of the proposed method is that it is always able to pay attention to specific tokens. The words that always receive attention are those that are important in the sentence, such as the </s> token, the first token, and words that are the subject of the sequence, such as he. Moreover, as with ALi Bi, the proposed method has a different scope of attention for each head.

Figure 7: Heatmap of attention score eij after softmax operation for the proposed method. The maximum sequence length is Lmax = 512, and the sequence length at inference is L = 1012. From left to right, n = 1, 2, 4th heads are shown. Scores above 0.01 are mapped in black and the rest in yellow. Words that were always given attention in all heads are shown in red, and words that were frequently given attention only in the n = 2nd head are shown in blue. Sentences are omitted in the middle because they are long with 1012 tokens.

Published as a conference paper at ICLR 2025

A.12 CAN IT HANDLE TOKENS WITH LONG-RANGE DEPENDENCIES?

Figure 8: Heatmap of scaled attention scores via softmax normalization in 1-3rd and 5-8th head after softmax operation for ALi Bi, Ro PE, and our method. For clarity, values of 0.001 or more are mapped to black, while values below that are mapped to yellow.

Published as a conference paper at ICLR 2025

Table 3: Perplexity of validation set in extrapolation experiments using Wikitext-103. Maximum allowable length of sequences in pre-training is Ltrain = 512.

Sequence Length

scale a shift b 128 256 512 1012 1512 2512

Perplexity without Non-overlapping Inference

Ricker {20, 21, ..., 27} {0, 1, 2, ..., 15} 23.64 20.82 19.19 18.23 18.00 17.99 Ricker {21, 22..., 28} {0, 1, 2, ..., 15} 23.77 20.89 19.25 18.23 17.97 18.02 Ricker {22, 23..., 29} {0, 1, 2, ..., 15} 23.92 21.03 19.40 18.41 18.14 18.07 Ricker {20, 21, 22, 23} {0, 1, 2, ..., 31} 23.96 21.13 19.55 18.87 19.40 21.73 Ricker {20, 21} {0, 1, 2, ..., 63} 24.49 21.60 19.95 20.90 32.01 70.80 Ricker {20, 21..., 215} {0, 1, 2, ..., 7} 23.74 20.88 19.24 18.22 17.96 17.84 Ricker {20, 21..., 231} {0, 1, 2, 3} 23.75 20.86 19.26 18.24 17.96 17.84 Ricker {20, 21..., 263} {0, 1} 23.75 20.88 19.30 18.31 18.04 18.02 Ricker {20, 21..., 2127} {0} 23.97 21.10 19.46 18.50 18.27 18.29 Ricker {27} {0, 1, 2, ..., 127} 24.35 21.45 19.80 20.68 20.87 21.31 Gaussian {20, 21, ..., 27} {0, 1, 2, ..., 15} 23.77 20.90 19.30 18.31 18.02 17.88 Gaussian {21, 22..., 28} {0, 1, 2, ..., 15} 23.92 21.02 19.41 18.41 18.15 18.01 Gaussian {22, 23..., 29} {0, 1, 2, ..., 15} 23.98 21.09 19.46 18.43 18.13 17.93 Gaussian {20, 21, 22, 23} {0, 1, 2, ..., 31} 23.83 29.96 19.33 18.43 18.40 18.94 Gaussian {20, 21} {0, 1, 2, ..., 63} 24.28 21.35 19.70 18.96 19.63 23.14 Gaussian {20, 21..., 215} {0, 1, 2, ..., 7} 23.72 20.86 19.24 18.24 17.95 17.77 Gaussian {20, 21..., 231} {0, 1, 2, 3} 23.78 20.92 19.29 18.30 18.01 17.85 Gaussian {20, 21..., 263} {0, 1} 23.86 20.98 19.37 18.46 18.20 18.10 Gaussian {20, 21..., 2127} {0} 24.21 21.31 19.68 18.71 18.45 18.45 Gaussian {27} {0, 1, 2, ..., 127} 24.48 21.62 20.05 19.53 22.63 35.23 Haar - - 24.98 22.07 20.49 51.61 116.87 299.26 Haar {20, 21, ..., 27} {0, 1, 2, ..., 15} 23.73 20.89 19.27 18.34 18.11 18.17 Morlet {20, 21, ..., 27} {0, 1, 2, ..., 15} 24.15 21.28 19.65 19.02 20.46 26.56

A.13 ABLATION STUDY OF SCALE AND SHIFT PARAMETERS

In this section, we present the findings from our ablation study focusing on the shift and scale parameters of the Ricker and Gaussian wavelets. As indicated in Table 1, both wavelet types demonstrate substantial effectiveness in our method. To further evaluate their performance, we explored the contributions of two parameters, i.e., the scale parameter a and the shift parameter b, while keeping all other settings consistent with those outlined in Section 6.

Results The results of our experiments are summarized in Table 3. Both the Ricker and Gaussian wavelets exhibit similar trends regarding the influence of the scale and shift parameters on extrapolation performance. Initially, we observed that increasing the scale parameter value a while holding the shift parameter b ({20, 21, ..., 27} {0, 1, 2, ..., 15}, {21, 22, ..., 28} {0, 1, 2, ..., 15} and {22, 23, ..., 29} {0, 1, 2, ..., 15}) constant maintained the performance of extrapolation, albeit with some fluctuations. Conversely, when we increased the number of shift parameters while decreasing the number of scale parameters ({20, 21, 22, 24} {0, 1, 2, ..., 31} and {20, 21} {0, 1, 2, ..., 63}), there was a noticeable decline in performance. These findings underscore the significance of the scale parameters in extrapolation. Moreover, we found that increasing the number of scale parameters while decreasing the number of shift parameters led to performance improvements in some instances ({20, 21, ..., 215} {0, 1, 2, ..., 7} and {20, 21, ..., 231} {0, 1, 2, 3}). However, when the shift parameters were reduced to two or entirely eliminated ({20, 21, ..., 263} {0, 1} and {20, 21, ..., 2127} ), relying solely on the scale parameters resulted in a deterioration of extrapolation performance. Moreover, even when the scale parameter was fixed and only the shift parameter was used ({27} {0, 1, 2, .., 127}), the extrapolation performance decreased. This suggests the potential importance of shift parameters as well.

In conclusion, our analysis highlights the critical roles of both shift and scale parameters in the effectiveness of our wavelet-based method.

Published as a conference paper at ICLR 2025

A.14 ABLATION STUDY OF WAVELET TYPES

In this section, we also explored a variety of wavelet types beyond those previously discussed. In Section 6.3.1, our focus was primarily on wavelets that could be computed directly from mathematical formulas. However, in this section, we expand our inquiry to include wavelets with varying numbers of vanishing moments as well as discrete wavelet transformations. Additionally, drawing from previous research (Wang et al., 2020), we considered the necessity for a distinct approach when incorporating complex numbers into positional encoding. Consequently, our study did not encompass wavelets that incorporate complex numbers.

Wavelet types The specific wavelets under consideration in our investigation are outlined as follows:

Daubechies (db) (Daubechies, 1992) - Compactly supported orthonormal wavelets

Symlets (sym) - Wavelets with minimum asymmetry

Coiflets (coif) - The scaling and wavelet functions have the same number of vanishing moments

Meyer (dmey) - Wavelets defined in the frequency domain

Biorthogonal Spline (bior) - Two wavelets are used: one for decomposition, and the other for reconstruction

Reverse biorthogonal Spline (rbio)

In addition, the graphs of these wavelets are shown in Figures 9 and 10. As the number of vanishing moments increases, the wave oscillation becomes larger. Therefore, we also conducted a survey by vanishing point moment. The name of a wavelet is derived from the number of vanishing moments. For example, db6 is a Daubechies wavelet with 6 vanishing moments, and sym3 is a Symlet wavelet with 3 vanishing moments. In the case of Coiflet wavelets, coif3 is a Coiflet wavelet with 6 vanishing moments. The names of bior and rbio wavelets are derived from the number of vanishing moments possessed by the decomposition and reconstruction wavelets, respectively. For example, bior3.5 is Biorthogonal wavelet that has 3 vanishing moments for the decomposition wavelet and 5 vanishing moments for the reconstruction wavelet. Biorthogonal wavelets and Reverse-Biorthogonal wavelets can calculate the approximate values of decomposition wavelets and reconstruction wavelets, but in this case, we only used the values of decomposition wavelets.

Experimental Settings We used Pywavelet (Lee et al., 2019) 13 to calculate the approximate values of these wavelets. In addition, in this experiment, we calculated the approximate values by specifying 8 levels of {1, 2, ..., 8} instead of the 8-pattern scale parameters {20, 21, ..., 27}. We used the shift parameter {0, 1, 2, ..., 15}. The other experimental settings are the same as those in Section 6.

Results The experimental results are summarized in Table 4. Overall, the performance observed was suboptimal for extrapolation. However, it is important to note that since the parameters were fixed at levels {1, 2, ..., 8}, we believe that performance may be enhanced with adjustments to these levels. Notably, the rbio1.1 wavelet demonstrated promising extrapolation capabilities, suggesting significant potential for future improvements. In contrast, the coif and dmey wavelets exhibited limited performance, even with shorter sequences, indicating their potential unsuitability for position encoding tasks. Conversely, while the extrapolation performance (> 512) of other wavelets was generally low, their interpolation performance ( 512) remained consistently stable, highlighting another avenue for enhancement. Furthermore, the performance of the db, bior, and rbio wavelets showed a positive correlation with an increasing number of vanishing points. This finding underscores the importance of vanishing points as a critical factor influencing performance. In conclusion, our analysis indicates that both the shape of the wavelet and the number of vanishing points play significant roles in determining extrapolation performance. Future work should explore these relationships further to identify optimal configurations for improved performance outcomes.

13https://pywavelets.readthedocs.io/en/latest/index.html

Published as a conference paper at ICLR 2025

Table 4: Perplexity without Non-overlapping Inference. We evaluated the validation set in extrapolation experiments using Wikitext-103. The maximum allowable length of sequences in pre-training is Ltrain = 512.

Sequence Length

Wavelet type 128 256 512 1012 1512 2512

Continuous Wavelet Families

Ricker 23.64 20.82 19.19 18.23 18.00 17.99 Gaussian 23.77 20.90 19.30 18.31 18.02 17.88 Morlet 24.15 21.28 19.65 19.02 20.46 26.56

Discrete Wavelet Families

Haar 23.73 20.89 19.27 18.34 18.11 18.17 db2 25.22 22.26 20.64 30.30 60.27 130.93 db4 25.22 22.47 21.37 41.78 51.75 56.18 db8 25.19 22.48 21.58 26.90 31.55 39.75 db16 25.23 22.43 21.24 21.15 22.16 46.65 db32 25.12 22.35 21.14 21.20 22.40 38.00 sym2 25.11 22.21 20.68 31.25 61.00 126.32 sym4 25.27 22.56 21.98 24.70 26.81 42.81 sym8 29.27 26.13 24.63 23.97 31.47 92.36 coif1 31.24 28.00 26.24 64.62 71.06 97.60 coif2 25.24 22.47 21.39 27.74 27.39 44.26 coif4 49.91 45.15 42.42 41.07 56.08 110.27 coif8 25.15 22.39 21.26 21.31 22.26 35.73 coif16 126.38 117.88 113.42 132.14 166.77 230.95 dmey 30.38 27.12 25.45 25.88 46.35 131.48 bior1.3 26.27 23.36 23.69 23.38 30.71 88.66 bior2.2 25.28 22.51 21.59 29.71 29.43 50.25 bior2.6 25.29 22.70 21.60 22.15 22.71 40.61 bior3.1 26.92 24.02 22.38 59.30 113.81 205.54 bior3.5 25.17 22.49 21.65 27.41 27.19 53.99 bior3.9 25.24 22.48 21.51 21.89 23.86 50.14 bior4.4 25.52 22.72 21.64 21.67 24.42 51.46 bior5.5 25.21 22.55 21.72 23.43 24.68 36.30 bior6.8 25.14 22.39 21.21 21.10 22.31 46.97 rbio1.1 24.26 21.34 19.69 18.79 18.63 18.98 rbio1.3 25.28 22.50 21.39 52.06 47.78 59.94 rbio2.2 25.92 23.08 21.98 68.57 86.12 93.90 rbio2.6 25.29 22.68 21.60 24.54 24.47 44.57

Published as a conference paper at ICLR 2025

Figure 9: Graph of compared wavelets with level=10. Pywavelet (Lee et al., 2019) was used to calculate wavelets.

Published as a conference paper at ICLR 2025

Figure 10: Graph of compared wavelets with level=10. Pywavelet (Lee et al., 2019) was used to calculate wavelets.

Published as a conference paper at ICLR 2025

A.15 EVALUATION ON LONGBENCH

The models pre-trained in Section 7 were evaluated on Long Bench (Bai et al., 2024). This evaluation was conducted using a dataset that contained relatively long sentences. Furthermore, the multidocument QA task and single-document QA task were evaluated on all datasets. Since pre-training was conducted using an English dataset, evaluation was conducted using only the English dataset.

The results are shown in Figure 11. In some datasets, the performance of the model that adopted Ro PE was good (Qasper, Mu Si Que, and QMSum). In Narrative QA, the two models attained almost the same score. However, in the remaining tasks, the proposed method was more effective. Note that this is an evaluation of a model that was pre-trained on a small dataset (redpajama-1B). As future work, it will be necessary to pre-train the model with a larger dataset and conduct evaluations with other models that are effective for long sentences, such as Long Range Arena (Tay et al., 2021).

Figure 11: Evaluation results using Long Bench(Bai et al., 2024). We evaluated the model pretrained in Section 7. The scores of the difference between the model using the proposed method, which uses wavelet-based position representation, and the model using Ro PE are shown. The tasks were evaluated using the dataset, which contains relatively long sentences.

Table 5: Overview of the dataset statistics in Long Bench (Bai et al., 2024). Avg len (average length) is computed using the number of words in the English.

Dataset Avg len Metric Samples

Narrative QA 18,409 F1 200 Qasper 3,619 F1 200 Multi Field QA-en 4,559 F1 150 Hotpot QA 9,151 F1 200 2Wiki MQA 4,887 F1 200 Mu Si Que 11,214 F1 200 Trivia QA 8,209 F1 200 SAMSum 6258 Rouge-L 200 QMSum 10614 Rouge-L 200