# cosformer_rethinking_softmax_in_attention__ab556f12.pdf

Published as a conference paper at ICLR 2022

COSFORMER : RETHINKING SOFTMAX IN ATTENTION

1Zhen Qin 1,3Weixuan Sun 1,4Hui Deng 3Dongxu Li 1Yunshen Wei 1Baohong Lv 1Junjie Yan 2,5Lingpeng Kong 1,2Yiran Zhong

1Sense Time Research 2Shanghai AI Laboratory 3Australian National University 4Northwestern Polytechnical University 5The University of Hong Kong {lastnamefirstname}@sensetime.com,lpk@cs.hku.hk

Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length. Kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Nevertheless, due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops when compared with the vanilla softmax attention. In this paper, we propose a linear transformer called COSFORMER that can achieve comparable or better accuracy to the vanilla transformer in both casual and cross attentions. COSFORMER is based on two key properties of softmax attention: i). non-negativeness of the attention matrix; ii). a non-linear re-weighting scheme that can concentrate the distribution of the attention matrix. As its linear substitute, COSFORMER fulﬁlls these properties with a linear operator and a cosine-based distance re-weighting mechanism. Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark. The source code is available at COSFORMER .

1 INTRODUCTION

20 30 40 50 60 70 Speed (examples per sec)

Long-Range Arena Score

Linear Transformer

Local attention

Performer Reformer

Sinkhorn Transformer

Sparse Transformer

Synthesizer

Transformer

Figure 1: Performance (y axis), speed (x axis), and memory footprint (circle sizes) of efﬁcient transformers on the Long-Range Arena benchmark. The proposed COSFORMER achieves an all-around supremacy over competing methods in the top left quadrant.

With years of development, the transformer model (Vaswani et al., 2017) and its variants (Zaheer et al., 2020; Wang et al., 2020; Tay et al., 2020a) have been successfully adapted to three most popular artiﬁcial intelligence (AI) ﬁelds: i.e., natural language processing (Devlin et al., 2019; Liu et al., 2019), computer vision (Dosovitskiy et al., 2020; Carion et al., 2020; Liu et al., 2021) and audio processing (Schneider et al., 2019; Baevski et al., 2020). Compared with conventional

Indicates the corresponding author. Indicates equal contribution.

Published as a conference paper at ICLR 2022

recurrent (Hochreiter & Schmidhuber, 1997) and convolutional architectures (He et al., 2016), transformer-based architectures are generally more scalable to data volumes (Brown et al., 2020) and stronger in capturing global information with less inductive bias, thus excelling on many tasks.

Dot-product attention with softmax normalization is the cornerstone of the transformer to capture long-range dependencies. However, its quadratic space and time complexity with regard to the length of the sequence make its computational overhead prohibitive, especially for long inputs. To address this issue, numerous methods are proposed recently, such as the sparse attention matrix (Zaheer et al., 2020; Beltagy et al., 2020; Tay et al., 2020a; Kitaev et al., 2019; Child et al., 2019),lowrank representations (Wang et al., 2020) or kernel-based methods (Peng et al., 2020; Choromanski et al., 2020; Katharopoulos et al., 2020), among many others. These methods achieve reduced computational complexity with comparable performances when compared with the vanilla attention architecture on several selected tasks or corpus.

However, the improved efﬁciency is usually achieved via introducing additional yet often impractical assumptions on the attention matrix (Wang et al., 2020) or with valid approximation of softmax operation only within constrained theoretical bounds (Choromanski et al., 2020; Peng et al., 2020) Therefore, when their assumptions are unsatisﬁed or when approximation errors get accumulated, these methods may not always be advantageous over the vanilla architecture (Narang et al., 2021). Consequently, performance deﬁciencies in a broad application spectrum are often observed in these transformer variants, especially those with linear complexity. For example, the Performer (Choromanski et al., 2020), RFA (Peng et al., 2020) and Reformer (Kitaev et al., 2019) show less satisfactory performance on the GLUE benchmark (Wang et al., 2018) when compared with the vanilla architecture as suggested in our preliminary experiments (Tab. 2). Furthermore, many of these aforementioned methods are not applicable to casual attentions, which are critical for auto-regressive training. For example, techniques proposed in Linformer (Wang et al., 2020) and Big Bird (Zaheer et al., 2020) are speciﬁc to cross attentions.

Since the softmax operator appears to be the main hurdle while efﬁcient yet accurate approximation to softmax is difﬁcult to achieve, one question naturally arises: Can we replace the softmax operator with a linear function instead, while maintaining its key properties? . By digging into the softmax attention, we ﬁnd two key properties that affect its empirical performance: (i) elements in the attention matrix are non-negative (Tsai et al., 2019; Katharopoulos et al., 2020); (ii) the non-linear re-weighting scheme acts as a stabilizer for the attention weights (Titsias, 2016; Gao & Pavel, 2017; Jang et al., 2016). These ﬁndings reveal some new insights of the current approaches. For example, the linear transformer (Katharopoulos et al., 2020) achieves property (i) using an exponential linear unit (Clevert et al., 2016) activation function. However, due to lack of the re-weighting scheme, it underperforms other efﬁcient transformer variants on the Long-Range Arena benchmark as shown in Figure 1 as well as the language modeling task (Table 2) based on our controlled experiments.

In this paper, we propose a new variant of linear transformer called COSFORMER that satisﬁes both of the above properties. Speciﬁcally, we enforce the non-negative property by passing the features to a Re LU (Agarap, 2018) activation function before computing the similarity scores. In this way, we encourage the model to avoid aggregating negatively-correlated contextual information. Further, we adopt a cos re-weighting scheme to stabilize the attention weights. This helps the model to amplify local correlations, which usually contain more relevant information for natural language tasks. Thanks to the Ptolemy s theorem, our attention can be exactly decomposed into a linear form. We perform extensive experiments on both autoregressive language models and bidirectional models on ﬁve public benchmarks, including Wiki Text-103 (Merity et al., 2017), GLUE (Wang et al., 2018), IMDB (Maas et al., 2011), AMAZON (Ni et al., 2019) and Long-Range Arena benchmark (Tay et al., 2020b). Our model shows much better inference speed and smaller memory footprint, while achieving on par performance with the vanilla transformer. It is noteworthy that our method ranks 1st on the Long-Range Arena benchmark, showing favorable performance than other competitors, which well demonstrates its strong capacity in modeling long sequence inputs.

2 OUR METHOD

In this section, we provide technique details of our linear transformer called COSFORMER . The key insight of the COSFORMER is to replace the non-decomposable non-linear softmax operation by a linear operation with decomposable non-linear re-weighting mechanism. Our model is applicable

Published as a conference paper at ICLR 2022

to both casual and cross attentions with a linear time and space complexity with regard to the input sequence length, thus exhibiting strong capacity in modeling long-range dependencies.

2.1 THE GENERAL FORM OF TRANSFORMER Given an input sequence x with length of N, we ﬁrst represent it in the embedding space x RN d with feature dimension of d. A transformer block T : RN d RN d with input x is deﬁned as:

T (x) = F(A(x) + x), (1)

where F is a feedforward network that contains a residual connection; A is the self-attention function that computes the attention matrix A, which has quadratic space and time complexity with respect to N, thus becoming the computation bottleneck of T on long inputs.

There are three key components in A, namely, query (Q), key (K), value (V ) computed through three learnable linear matrices WQ, WK, WV : Q = x WQ, K = x WK, V = x WV . We use Mi to represent the i-th row of a matrix M, then the output O RN d of A(x) can be computed as:

O = A(x) = [O1, . . . , ON]T , Oi = X

S(Qi, Kj) P

j S(Qi, Kj)Vj, (2)

where S( ) measures the similarity between queries. If S(Q, K) = exp(QKT ), the Eq. 2 becomes the dot-product attention with softmax normalization. In this case, the space and time complexity to compute one row of the output Oi is O(N). Therefore, the total space and time complexity for computing O grows quadratically with respect to the input length.

2.2 LINEARIZATION OF SELF-ATTENTION According to Eq. 2, we can select any similarity functions to compute the attention matrix. In order to maintain a linear computation budget, one solution is to adopt a decomposable similarity function such that: S(Qi, Kj) = φ(Qi)φ(Kj)T , (3) where φ is a kernel function that maps the queries and keys to their hidden representations. Then one can rewrite Eq. 2 in the form of kernel functions as:

PN j=1(φ(Qi)φ(Kj)T )Vj PN j=1(φ(Qi)φ(Kj)T ) ., (4)

After that, attention operation in linear complexity is achieved via the matrix product property:

(φ(Q)φ(K)T )V = φ(Q)(φ(K)T V ). (5)

In this form (Eq. 5), instead of explicitly computing the attention matrix A = QKT RN N, we calculate the φ(K)T V Rd d ﬁrst, and then multiplying φ(Q) RN d. By using this trick, we only incurs a computation complexity of O(Nd2). Note that in typical natural language tasks, the feature dimension of one head d is always much smaller than the input sequence length N (d N), so we can safely omit d and achieve computation complexity of O(N), as illustrated in Figure 2.

Previous Solutions As aforementioned, the key to the linear attentions is to ﬁnd a decomposable similarity function S( ) that generalizes well to different tasks. Most existing linear transformers are trying to ﬁnd an unbiased estimation of the softmax attention. For example, RFA (Peng et al., 2020) approximates the softmax operation with random feature maps using theorem of random fourier features (Rahimi & Recht, 2008) and the Performer (Choromanski et al., 2020) utilizes positive random features to approximate it. However, we empirically ﬁnd that these methods are sensitive to the selection of sampling rate and becomes unstable if the sampling rate gets too high. Also, to accommodate recency bias, gating mechanisms are employed to better exploit more recent context.

Another group of works attempt to directly replace the softmax with a linear operation. For example, the linear transformer (Katharopoulos et al., 2020) model replaces the softmax similarity function with a pure dot product S = QKT , and use a non-linear activation function φ( ) = elu( ) + 1 to model the pairwise relation between features. However, our controlled experiments show that their solution does not necessarily generalize well on many downstream tasks (Tab. 2) or the Long-Range Arena benchmark (Tab. 4). In this paper, we propose a new replacement of softmax that not only achieves comparable or better performance than the softmax attention in a wide range of tasks, but also enjoys linear space and time complexity.

Published as a conference paper at ICLR 2022

O(Nd2 + Nd2) O(N)

d x N N x d

O(N2d + N2d) O(N2)

Vanilla self attention Linearized self attention

N x d N x d N x d d x N N x d N x d

Figure 2: Illustration of the computations for vanilla self attention (left) and linearized attention (right). The input length is N and feature dimension is d, with d N. Tensors in the same box are associated for computation. The linearized formulation allows O(N) time and space complexity.

2.3 ANALYSIS OF SOFTMAX ATTENTION In the vanilla transformer architecture, when S(Q, K) = exp(QKT ), the softmax operation is applied to obtain row-wise normalization on the attention matrix A RN N as shown in the Eq. 2. In other words, we normalize the relations of each element in the input sequence to all other elements in order to obtain a weighted aggregation of contextual information. However, apart from the good empirical performance of softmax attention, what are the crucial and necessary characteristics of it remain only loosely determined in the original transformer paper and follow-up works.

Table 1: Analysis of the softmax properties. All attention variants are implemented in the Ro BERTa (Liu et al., 2019) architecture and are pre-trained on the Wiki Text-103 (Merity et al., 2017) dataset. The Loss represents the validation loss. We then ﬁne-tune these variants on each downstream datasets and show the accuracy (the higher the better).

Loss QQP SST-2 MNLI φI 2.343 84.23 76.26 58.27 φLeaky Re LU 2.246 84.46 78.21 74.26

φRe LU 1.993 88.86 89.90 77.86 softmax 1.915 88.41 92.31 79.15

In this work, we empirically identify two key properties of the softmax operation that may play important roles for its performance: 1) it ensures all values in the attention matrix A to be non-negative; 2) it provides a non-linear reweighting mechanism to concentrates the distribution of attention connections and stabilizes the training(Titsias, 2016; Gao & Pavel, 2017; Jang et al., 2016).

To validate these assumptions, we design the following preliminary studies as shown in Table 1. First, to validate the importance of nonnegativity, we compare three instantiations of the function φ in equation 3: an identify mapping φI = I that does not preserve the non-negativity, and the other variant φRe LU( ) = Re LU( ) that retains only positive input values while replacing negative values to zeros. We also add the φLeaky Re LU( ) = Leaky Re LU( ) variant as it does not have the non-negativity as well but have the same non-linearly as the Re LU one. Second, to demonstrate the effect of non-linear re-weighting, we compare the models using only φRe LU( ) without any re-weighting and those with softmax operations. From Table 1, the superior results of φRe LU over φI and φLeaky Re LU demonstrate the beneﬁt of retaining non-negative values. Our conjecture is that by retaining only positive values in the similarity matrices, the model ignores features with negative correlations, thus effectively avoiding aggregating irrelevant contextual information. By comparing the results of φRe LU with the softmax, we observe that models with softmax re-weighting converge faster and generalize better to downstream tasks. This might be explained as softmax normalization ampliﬁes the correlated pairs, which might be useful to identify useful patterns.

2.4 COSFORMER Based on the observations above, we propose our model COSFORMER , which discards entirely the softmax normalization while still features the non-negativity and re-weighting mechanism. Our COSFORMER consists two main components: a linear projection kernel φlinear and a cos-Based Reweighting mechanism. Below we describe details of each components:

Linear projection kernel φlinear Recall the general form of the attention in Eq. 2, let us deﬁne a linear similarity as: S(Q, K) = s(φlinear(Q), φlinear(K)) = s(Q , K ) (6) where φlinear is the transformation function that map queries Q and keys K to our desired representations Q and K , and s is a function that can be linearly decomposed to measure the similarity between Q and K . Speciﬁcally, in order to ensure a full positive attention matrix A and avoid

Published as a conference paper at ICLR 2022

(3)cos Former (w/o re-weighting) (2)cos Former (1)Vanilla Transformer

(4)cos re-weighting matrix

Figure 3: (1): Attention matrix of vanilla transformer.(2):Attention matrix of COSFORMER .(3): Attention matrix of COSFORMER without re-weighting. (4): Visualization of the cos-based distance matrix. After reweighting, we can see a smoother attention distribution along the diagonal region of attention matrix, exhibiting a similar pattern to the vanilla transformer, which assists to stabilize the training.

aggregating negatively-correlated information, we adopt Re LU( ) as the transformation functions and therefore effectively eliminate negative values:

φlinear(x) = Re LU(x) (7)

As Q and K contain only non-negative values, we directly take their dot-product s(x, y) = xy T , x, y R1 d followed by a row-wise normalization to compute attention matrices:

PN j=1 f(φlinear(Qi),φlinear(Kj))Vj PN j=1 f(φlinear(Qi),φlinear(Kj)) =

PN j=1(Re LU(Qi)Re LU(Kj)T )Vj PN j=1(Re LU(Qi)Re LU(Kj)T ) (8)

Based on Eq. 4, we rearrange the order of dot-product and obtain the formulation of the proposed attention in linear complexity as:

Oi = Re LU(Qi) PN j=1 Re LU(Kj)T Vj Re LU(Qi) PN j=1 Re LU(Kj)T (9)

cos-Based Re-weighting Mechanism The non-linear re-weighting mechanism introduced by the softmax attention can concentrate the distribution of the attention weights and therefore stabilize the training process (Titsias, 2016; Gao & Pavel, 2017; Jang et al., 2016). We also empirically ﬁnd that it can punish far-away connections and enforce locality in some cases. In fact, such locality bias, i.e., a large portion of contextual dependencies are from neighboring tokens, is commonly observed on downstream NLP tasks (Clark et al., 2019; Kovaleva et al., 2019), as shown in Figure 3 (1).

Based on the assumption above, what we need to fulﬁll the second property of softmax may be a decomposable re-weighting mechanism that can introduce recency bias to the attention matrix. Here, we propose a cos-based re-weighting mechanism as it perfectly ﬁt our purpose: 1). the Ptolemy s theorem ensures the cos weights can be decomposed into two summations; 2). as shown in Figure 3 (4), the cos will put more weights on the neighbouring tokens and therefore enforces locality. Also, by comparing the attention matrices in Figure 3 (2) and (3), the COSFORMER enforces more locality than the one without the re-weighting mechanism.

Speciﬁcally, by combining with Eq 6, the model with cosine re-weighting is deﬁned as:

s(Q i, K j) = Q i K T j cos π

By leveraging the Ptolemy s theorem, we decompose this formulation as:

i K T j cos π

where i, j = 1, ..., N, M N, and Q = Re LU(Q), K = Re LU(K). Let Qcos i = Q i cos πi

2M , Qsin i = Q i sin πi

2M , Kcos j = K j cos πj

2M , Ksin j = K j sin πj

2M , the output of the proposed attention module can be expressed as:

PN j=1 f(Q i,K j)Vj PN j=1 f(Q i,K j) =

PN j=1 Qcos i (Kcos j ) T Vj +PN j=1 Qsin i (Ksin j ) T Vj

PN j=1 Qcos i (Kcos j ) T +PN j=1 Qsin i (Ksin j ) T , (11)

Published as a conference paper at ICLR 2022

10000 20000 30000 40000 50000 Train Step

Vanilla Transformer cos Former cos Former w/o reweight

10000 20000 30000 40000 50000 Train Step

Vanilla Transformer cos Former cos Former w/o reweight

Figure 4: Training loss (left) and validation loss (right) of the bidirectional language modeling pre-train. In both training and validation, the proposed COSFORMER has a faster converge speed than vanilla transformer.

where Oi is the output at the ith position of the sequence from the attention module. Detailed derivation are included in the Appendix. Without losing the generality, our method achieves a linear complexity as:

O = S(Q, K)V = (Qcos Kcos + Qsin Ksin)V = Qcos(Kcos V ) + Qsin(Ksin V ) (12)

Relation to positional encoding. COSFORMER can be seen as a new way of introducing the relative positional bias to the efﬁcient transformer. Compared with the Rotary Position Embedding (Su et al., 2021), they use a more complex position embedding strategy and did not enforce the nonnegativity to the similarity scores as ours. Also, since they only change the position embedding on the numerator while keeping the denominator unchanged, the summation of their attention scores is not equal to 1. For Stochastic Positional Encoding (Liutkus et al., 2021), they use a sampling strategy to approximate the softmax, and introduce relative positional encoding to linear transformers.

3 EXPERIMENTS

In this section, we experimentally validate the effectiveness of the proposed method in multiple settings. The purposes of the experiments are three-fold. First, we validate the capacity of COSFORMER in language modeling through autoregressive (Sec. 3.1) and bidirectional (Sec. 3.2) setups using Wiki Text-103 (Merity et al., 2017). In this way, we validate the effectiveness of the proposed linear attention module in both causal and non-causal cases. Second, we investigate the generalization ability of COSFORMER on downstream tasks by comparisons with other existing transformer variants. This is achieved by performing comparative ﬁnetuning experiments on ﬁve datasets, including GLUE (QQP, SST-2, MNLI) (Wang et al., 2018), IMDB (Maas et al., 2011) and AMAZON (Ni et al., 2019) (Sec. 3.3). We further compare COSFORMER with other transformer variants on the long-range-arena benchmark (Tay et al., 2020b) to understand its ability in modeling long-range dependencies (Sec. 3.4) and show comparative analysis into model efﬁciency (Sec. 3.5). Third, we conduct ablation studies to understand each component in COSFORMER (Sec. 3.6).

3.1 AUTOREGRESSIVE LANGUAGE MODELING

In autoregressive or left-to-right language modeling, we estimate the probability distribution of a token given its previous tokens. We use (Baevski & Auli, 2018) as our baseline model. Speciﬁcally, we adopt their large model which has 16 cascaded layers with a projected dimensions of 1024, and replace the self-attention module with our proposed linear attention module. We train our model on 8 Nvidia Tesla A100 GPUs with a sequence length of 512 for 150K updates on the Wiki Text-103 (Merity et al., 2017) and report perplexity on the validation and test splits in Table 2.

Table 2: Perplexity (lower is better) results of language modeling pre-training task on validation set and test set of the Wiki Text-103 (Merity et al., 2017) dataset.

ppl(val) ppl(test) Vanilla Transformer 24.5 26.2 Linear Transformer 28.7 30.2 RFA-Gaussian 25.8 27.5 RFA-across 26.4 28.1 RFA-Gate-across 24.8 26.3 RFA-Gate-Gaussian 23.2 25.0 COSFORMER 23.5 23.1

We observe that although the baseline model is a powerful standard transformer which requires quadratic computation complexity, COSFORMER outperforms it with a clear margin in linear computation complexity. Besides, we achieve comparable perplexity to other methods on the validation set, and signiﬁcantly outperform all competing methods on the test set by a clear gap, which further demonstrates the effectiveness of COSFORMER .

Published as a conference paper at ICLR 2022

Table 3: Results on ﬁne-tuned downstream tasks based on pre-trained bidirectional model. Best result is in boldface and second best is underlined. The proposed COSFORMER achieves superb performances over competing efﬁcient transformers and is approaching vanilla transformer.

QQP SST-2 MNLI IMDB AMAZON Avg Vanilla Transformer (Liu et al., 2019) 88.41 92.31 79.15 92.86 75.79 85.70 Performer (Choromanski et al., 2020) 69.92 50.91 35.37 60.36 64.84 56.28 Reformer (Kitaev et al., 2019) 63.18 50.92 35.47 50.01 64.28 52.77 Linear Trans. (Katharopoulos et al., 2020) 74.85 84.63 66.56 91.48 72.50 78.00 Longformer (Beltagy et al., 2020) 85.51 88.65 77.22 91.14 73.34 83.17 RFA (Peng et al., 2020) 75.28 76.49 57.6 78.98 68.15 71.30 COSFORMER 89.26 91.05 76.70 92.95 76.30 85.25

3.2 BIDIRECTIONAL LANGUAGE MODEL

For bidirectional language modeling, we adopt Ro BERTa (Liu et al., 2019) as the baseline model. Similarly, we replace the self-attention module in the Ro BERTa by the proposed linear attention module, and keep other structures unchanged. We train this bidirectional task on 2 Nvidia Tesla A100 GPUs for 50K iterations with a input sequence length 512. As shown in Figure 4, COSFORMER converges faster than vanilla transformer on both training and validation sets with a comparable or smaller loss values, despite it only consumes linear space and time computation complexity. In addition, the COSFORMER variant with re-weighting mechanism has both notably better converge speed and ﬁnal results over the counterpart without re-weighting, which further validates the effectiveness of our cos-based distance matrix and also demonstrates the effectiveness of recency bias on natural language data.

3.3 DOWNSTREAM FINE-TUNING TASKS

In this section, we ﬁne-tune the pre-trained model on downstream tasks to demonstrate the generalization ability of COSFORMER on downstream tasks. We use the pre-trained bidirectional model and ﬁne-tune it on three downstream text classiﬁcation tasks: GLUE (QQP, SST-2, MNLi) (Wang et al., 2018), IMDB (Maas et al., 2011) and AMAZON (Ni et al., 2019). For fair comparison, we ﬁrst pre-train all the competing efﬁcient transformer variants for the same 50K iterations on Wiki Text103 (Merity et al., 2017) under the same setting, then we follow the same ﬁne-tuning protocol as Ro BERTa (Liu et al., 2019) to ﬁne-tune these methods on the downstream tasks. From Table 3, we can see that COSFORMER outperforms baseline (Liu et al., 2019) on three out of ﬁve datasets, and achieves either best or secondary place on all ﬁve downstream datasets compared to competing efﬁcient transformers. It is worth noting that despite Longformer (Beltagy et al., 2020) achieves better results on MNLI than COSFORMER , it requires a computation complexity of O(Nw), where w is window size. As shown in Figure 1, Longformer is slower and requires more memory overhead than COSFORMER . Other competing methods(Peng et al., 2020; Choromanski et al., 2020; Kitaev et al., 2019) are all based on kernel functions and have substantial performance gaps compared with our model. This validates the effectiveness of the proposed COSFORMER model compared with other efﬁcient transformer variants.

3.4 RESULTS ON LONG-RANGE-ARENA BENCHMARK

To further evaluate the generalization ability of the proposed method, we train our model from scratch on Long-range-arena benchmark 2020b. Long-range-arena (Tay et al., 2020b) is a benchmark speciﬁcally designed for efﬁcient transformers with long input sequences, thus serving as a suitable testbed to assess the quality of efﬁcient transformer variants comparatively. To ensure fair comparison, we ﬁrst implement our method on Jax (Bradbury et al., 2018), then carefully follow their preprocessing, data split, model structure and training protocol. We evaluate our method on a variety of tasks including Long sequence List Ops (Nangia & Bowman, 2018), Byte-level text classiﬁcation (Maas et al., 2011), document retrieval using the ACL Anthology Network (Radev et al., 2013), image classiﬁcation on sequence of pixels on CIFAR-10 (Krizhevsky & Hinton, 2009), and Pathﬁnder (Linsley et al., 2018). As shown in Table 4, COSFORMER overall achieves competitive results across all the tasks while achieving best performance on List Ops and Document Retrieval. For the Pathﬁnder task, since the distance between the two points can be very far from each other, our introduced locality bias would have negative impact to this task and show a bit lags to other SOTA methods, despite that the performance gap between our method and the vanilla transformer is small It is worth mentioning that COSFORMER achieves the best overall scores on Long-range-arena benchmark, being one of the only two models that surpass vanilla transformer architecture.

Published as a conference paper at ICLR 2022

Table 4: Results on Long-range-arena benchmark. Best result is in boldface and second best is underlined. COSFORMER achieves the best average score across 5 different tasks.

Model List Ops Text Retrieval Image Pathﬁnder Avg Local Attention (Tay et al., 2020b) 15.82 52.98 53.39 41.46 66.63 46.06 Linear Trans. (Katharopoulos et al., 2020) 16.13 65.9 53.09 42.34 75.3 50.55 Reformer (Kitaev et al., 2019) 37.27 56.1 53.4 38.07 68.5 50.67 Sparse Trans.(Child et al., 2019) 17.07 63.58 59.59 44.24 71.71 51.24 Sinkhorn Trans.(Tay et al., 2020a) 33.67 61.2 53.83 41.23 67.45 51.29 Linformer(Wang et al., 2020) 35.7 53.94 52.27 38.56 76.34 51.36 Performer(Choromanski et al., 2020) 18.01 65.4 53.82 42.77 77.05 51.41 Synthesizer (Tay et al., 2021) 36.99 61.68 54.67 41.61 69.45 52.88 Longformer(Beltagy et al., 2020) 35.63 62.85 56.89 42.22 69.71 53.46 Transformer (Vaswani et al., 2017) 36.37 64.27 57.46 42.44 71.4 54.39 Big Bird (Zaheer et al., 2020) 36.05 64.02 59.29 40.83 74.87 55.01 COSFORMER 37.9 63.41 61.36 43.17 70.33 55.23

Table 5: Speed comparison on the long-range-arena benchmark in both training and inference varying sequence lengths (1-4k). We mark it with a cross if a method runs out of memory. The higher, the better.

Inference Speed(steps per second) Train Speed(steps per second) model 1K 2K 3K 4k 1K 2K 3K 4K Transformer(Vaswani et al., 2017) 25.37 7.83 6.95 2.23 Local Attention(Tay et al., 2020b) 57.73 33.19 23.36 17.79 13.45 6.71 4.32 3.09 Linformer(Wang et al., 2020) 70.09 39.1 27.05 20.62 14.75 7.09 4.52 3.21 Reformer(Kitaev et al., 2019) 44.21 21.58 12.74 8.37 11.58 4.98 2.94 1.95 Sinkhorn Trans. (Tay et al., 2020a) 43.29 23.58 16.53 12.7 11.09 5.57 3.68 2.68 Synthesizer (Tay et al., 2021) 20.89 6.24 6.36 2.01 Bir Bird (Zaheer et al., 2020) 20.96 11.5 8.12 6.15 6.46 3.2 2.13 1.53 Linear Trans. (Katharopoulos et al., 2020) 67.85 38.24 26.28 19.98 11.86 5.54 3.53 2.56 Performer (Choromanski et al., 2020) 74.15 42.31 29.5 22.44 14 6.49 4.1 2.94 Longformer (Beltagy et al., 2020) 22.99 6.72 4.4 1.3 Sparse Trans. Child et al. (2019) 24.87 7.5 6.77 2.2 COSFORMER 58.82 33.45 22.77 17.42 12.27 5.72 3.62 2.64

3.5 EFFICIENCY COMPARISON

In this section, we compare the efﬁciency of COSFORMER with other models, with a focus on long sequences as inputs. With the proposed linear attention module, we expect that COSFORMER scales comparably with other linear variants while signiﬁcantly surpassing the vanilla transformer architecture. For a fair and comprehensive comparison, we implement our method and competing methods on Jax (Bradbury et al., 2018). We use the byte-level text classiﬁcation benchmark and report runtime speed during both training and inference under different sequence lengths (1k-4k). We conduct experiments on one Nvidia A6000 GPU and also report the corresponding inferencetime memory foot prints as shown in Figure 1. As shown in Table 5 and Figure 1, most pattern based methods (Beltagy et al., 2020; Zaheer et al., 2020; Tay et al., 2020a; 2021) and vanilla transformer (Vaswani et al., 2017) are much slower and require greater memory than COSFORMER prevents them from extending to longer sequence. Further, the kernel based methods like (Narang et al., 2021; Choromanski et al., 2020; Tay et al., 2020a) have comparable speed and memory overheads, but their performances are less satisfactory compared to COSFORMER across above metrics. In summary, our model COSFORMER achieves overall better efﬁciency than other linear variants while maintain superior modeling and generalization ability.

3.6 ABLATION: cos-BASED RE-WEIGHTING

Table 6: Performance comparison of COSFORMER with and without cos-based re-weighting (φRe LU). We evaluate on two compositive metrics. Bidirectional ﬁnetuneavg: average score across 5 datasets reported in Table 3. LRAavg: average score across 5 tasks reported in Table 4.

Model Bidirectional ﬁnetuneavg LRAavg φRe LU 85.12 54.20 COSFORMER 85.25 55.23

By introducing cos-based re-weighting, we provide a non-linear mechanism to concentrate the distribution of attention connections and stabilizes the training. In this way, we encourage the model to better take into account the locality inductive biases commonly observed on many natural language tasks. In particular, we investigate the effect of the cos-based re-weighting in two aspects. First, as shown in Figure 4, by adding

Published as a conference paper at ICLR 2022

cos-based re-weighting, we obtain both notably better converge speed and ﬁnal results in autoregressive language modeling. Further, in Table 6, we present a comparison between COSFORMER models with and without re-weighting mechanism. We use two composite metrics which comprehensively include 10 different datasets from bidirectional downstream ﬁne-tuning tasks and long-range-arena (Tay et al., 2020b). COSFORMER achieves overall better results over the counterpart without reweighting, improving the average scores on bidirectional ﬁnetuning and long-range-arena by a clear margin. This veriﬁes that the proposed re-weighting effectively incorporates the locality inductive biases for natural language tasks.

4 RELATED WORK

This section will introduce the existing works on improving the efﬁciency of Transformers, they can be broadly divided into two categories, Pattern based methods and Kernel based methods.

Pattern based method Pattern based methods sparsify the attention matrix with handcrafted or learnable patterns. As an early approach, Lee et al. (2019) leverages the inducing points from the sparse Gaussian process to reduce the quadratic complexities of a transformer. Child et al. (2019) reduces the complexity by applying combination of strided pattern and local pattern to the vanilla attention matrix. Longformer (Beltagy et al., 2020) designs ﬁxed diagonal sliding windows combined with global window, and the sliding window pattern can also be extended with dilation to enlarge the receptive ﬁeld. Zaheer et al. (2020) presents a more powerful and expressive sparse attention mechanism, which combines multiple types of attention patterns and gives a thorough study of sparse attention mechanism. Instead of ﬁxed patterns, Kitaev et al. (2019) and Daras et al. (2020) group the attention computation process into buckets by local sensitive hashing, while Roy et al. (2020) uses mini-batch spherical k-means. Nevertheless, Pattern based methods can only cope with sequences up to a certain length, and the computational complexity still grows rapidly when the input sequence becomes longer.

Kernel based method When faced with longer input sequences, it is more efﬁcient to directly reduce the complexity of the theoretical calculation method. Kernel based methods speed up selfattention by reducing the computation complexity of self-attention from quadratic to linear. Vyas et al. (2020) approximate the full attention with a ﬁxed number of cluster attention groups by assuming neighbouring queries in Euclidean space should have similar attention distributions. Peng et al. (2020) chooses to use the production of Gaussian kernel functions to approximate Softmax, changing the order of scale dot product calculation, thus reducing the theoretical time to linear complexity and Choromanski et al. (2020) uses Haar measurement based kernel instead. Wang et al. (2020) imports the low-rank prior for attention matrix and approximate softmax with SVD decomposition manner. Xiong et al. (2021) utilizes the Nystr om method with segment-means to generate a lowrank approximation of the Softmax matrix. Katharopoulos et al. (2020) formalizes the transformer layer as a recurrent neural network. In this paper, we demonstrate that the approximation to Softmax is unneccessary for Linearization of self-attention module. We instead propose a new method to replace Softmax with a linear operation with a re-weighting mechanism, which reduces both time complexity and space complexity to O(N) while maintaining the accuracy.

5 CONCLUSION

We presented COSFORMER , a new efﬁcient transformer that has linear time and space complexity. Our COSFORMER is based on two key properties of the original softmax attention: (i) every element in the attention matrix are non-negative, such that negatively-correlated information are not included for contextual information aggregation; (ii) the non-linear re-weighting scheme concentrates the distribution of the attention matrix, in order to better exploit the locality inductive biases on sequence modeling. To fulﬁll these properties in our COSFORMER , we utilized the Ru LU function as our linear operation to ensure the non-negative property; a new cos-based re-weighting mechanism was proposed to enforce the locality bias in the original softmax attention. Since our COSFORMER is naturally decomposable, it does not suffer the accumulated approximation error that usually happens in previous linear transformers. On causal pre-training, bidirectional pre-training, and multiple downstream text understanding tasks, COSFORMER achieves comparable or even better performances than the vanilla transformer. On long sequence benchmark, COSFORMER achieved state-of-the-art performance over ﬁve different tasks. Further, COSFORMER obtains a signiﬁcant overall advantage in terms of time and memory efﬁciency over all existing efﬁcient transformers, facilitating the transformers to easily scale to longer input sequence.

Published as a conference paper at ICLR 2022

Abien Fred Agarap. Deep learning using rectiﬁed linear units (relu). ar Xiv preprint ar Xiv:1803.08375, 2018.

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2018.

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 2020.

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. ar Xiv preprint ar Xiv:2004.05150, 2020.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. Jax: composable transformations of python+ numpy programs. Version 0.1, 55, 2018.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. ar Xiv, 2020.

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213 229. Springer, 2020.

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. ar Xiv preprint ar Xiv:1904.10509, 2019.

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. ar Xiv preprint ar Xiv:2009.14794, 2020.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert s attention. In Proceedings of the 2019 ACL Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pp. 276 286, 2019.

Djork-Arn e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann Le Cun (eds.), 4th International Conference on Learning Representations, ICLR, San Juan, Puerto Rico, 2016.

Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf - efﬁcient attention using asymmetric clustering. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Neur IPS, volume 33, pp. 6476 6489. Curran Associates, Inc., 2020.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171 4186, 2019.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.

Bolin Gao and Lacra Pavel. On the properties of the softmax function with application in game theory and reinforcement learning. ar Xiv preprint ar Xiv:1704.00805, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Published as a conference paper at ICLR 2022

Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735 1780, 1997.

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156 5165. PMLR, 2020.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efﬁcient transformer. In International Conference on Learning Representations, 2019.

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of bert. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 4365 4374, 2019.

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master s thesis, Department of Computer Science, University of Toronto, 2009.

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In ICML, pp. 3744 3753, 2019.

Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charlie Windolf, and Thomas Serre. Learning long-range spatial dependencies with horizontal gated recurrent units. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 152 164, 2018.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. ar Xiv preprint ar Xiv:2103.14030, 2021.

Antoine Liutkus, Ondˇrej C ıfka, Shih-Lun Wu, Umut S ims ekli, Yi-Hsuan Yang, and Ga el Richard. Relative positional encoding for Transformers with linear complexity. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7067 7079. PMLR, 18 24 Jul 2021.

Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142 150, 2011.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. 5th International Conference on Learning Representations, ICLR, Toulon, France, 2017.

Nikita Nangia and Samuel Bowman. Listops: A diagnostic dataset for latent tree learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 92 99, 2018.

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, et al. Do transformer modiﬁcations transfer across implementations and applications? ar Xiv preprint ar Xiv:2102.11972, 2021.

Jianmo Ni, Jiacheng Li, and Julian Mc Auley. Justifying recommendations using distantly-labeled reviews and ﬁne-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 188 197, 2019.

Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah Smith, and Lingpeng Kong. Random feature attention. In International Conference on Learning Representations, 2020.

Published as a conference paper at ICLR 2022

Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. The acl anthology network corpus. Language Resources and Evaluation, 47(4):919 944, 2013.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2008.

Aurko Roy, Mohammad Taghi Saffar, David Grangier, and Ashish Vaswani. Efﬁcient content-based sparse attention with routing transformers. In TACL, 2020.

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. In INTERSPEECH, 2019.

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. In ar Xiv, 2021.

Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. Sparse sinkhorn attention. In International Conference on Machine Learning, pp. 9438 9447. PMLR, 2020a.

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efﬁcient transformers. In International Conference on Learning Representations, 2020b.

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention for transformer models. In International Conference on Machine Learning, pp. 10183 10192. PMLR, 2021.

Michalis K Titsias. One-vs-each approximation to softmax for scalable estimation of probabilities. ar Xiv preprint ar Xiv:1609.07410, 2016.

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: An uniﬁed understanding for transformer s attention via the lens of kernel. In EMNLP, 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pp. 5998 6008, 2017.

A. Vyas, A. Katharopoulos, and F. Fleuret. Fast transformers with clustered attention. In Neur IPS, 2020.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353 355, 2018.

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. ar Xiv preprint ar Xiv:2006.04768, 2020.

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr omformer: A nystr om-based algorithm for approximating self-attention. In AAAI, 2021.

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. In Neur IPS, 2020.

Published as a conference paper at ICLR 2022

A.1 MATHEMATICAL DERIVATION OF cos-BASED RE-WEIGHTING

Following Equation 11, we give a detailed deviation of how to obtain output at position ith position:

PN j=1 f(Q i, K j)Vj PN j=1 f(Q i, K j)

Qcos i Kcos j T + Qsin i Ksin j T Vj

Qcos i Kcos j T + Qsin i Ksin j T

PN j=1 Qcos i Kcos j T Vj + PN j=1 Qsin i Ksin j T Vj PN j=1 Qcos i Kcos j T + PN j=1 Qsin i Ksin j T

PN j=1 Qcos i Kcos j T Vj + PN j=1 Qsin i

Ksin j T Vj

PN j=1 Qcos i Kcos j T + PN j=1 Qsin i Ksin j T

where i, j = 1, ..., N, M N, and Q = Re LU(Q), K = Re LU(K). Let Qcos i = Q i cos πi

2M , Qcos i = Q i cos πi

2M , Kcos j = K j cos πj

2M , Ksin j = K j sin πj

2M . It presents that the output of the proposed COSFORMER attention can be obtained in a linear manner.

A.2 PSEUDO CODE OF COSFORMER

Algorithm 1 describe the way to compute COSFORMER attention

Algorithm 1 COSFORMER attention

Input: Q RN d1, K RM d1, V RM d2; Output: O RN d2; Use Mi to represent the i-th row of matrix M; Initialize A[i] = πi 2N , O[i][j] = 0, i = 1, . . . , N, j = 1, . . . , d2; Initialize Scos[i][j] = 0, Ssin[i][j] = 0, T cos[i] = 0, T sin[i] = 0, i = 1, . . . , d1, j = 1, . . . , d2; for i in 1, . . . , M do:

Kcos i = Ki cos πi

2M , Ksin i = Ki sin πi

2M ; Scos += (Kcos i )T Vi; Ssin += Ksin i T Vi; T cos += Kcos i ; T sin += Ksin i ; end for for i in 1, . . . , N do:

Qcos i = Qi cos πi

2M , Qsin i = Qi sin πi

Oi = Qcos i Scos+Qsin i Ssin

Qcos i T cos+Qsin i T sin ; end for

A.3 ALGORITHM TO VISUALIZE ATTENTION MATRIX

Algorithm 2 describe the way to visualize attention matrix as Figure 3

Published as a conference paper at ICLR 2022

Algorithm 2 Algorithm to visualize attention matrix

Input: Mk Rd d, k = 1, . . . , n; threshold [0, 1]; Output: M Rd d; Initialize M[i][j] = 0, i 1, . . . , d, j 1, . . . , d; for k in 1, . . . , n do:

for i in 1, . . . , d do:

index = argsort(Mk[i]) (in descending order) p = 0 for j in 1, . . . , d do:

l = index[j] p += Mk[i][l] M[i][l] += 1 if p > threshold then: break end if end for end for end for M /= n; Use heatmap to visualize M;

A.4 INTRODUCTION OF DATASET

We train both models on autoregressive language modeling and bidirectional modeling by Wikitext103 dataset, it is split by tokens and its statistics as Table 7.Then we ﬁne-tune the pre-trained bidirectional modeling on several text classiﬁcation tasks.

QQP dataset contain thousands of sentence pair from community question-answering website Quora.Network need to determine pairs of question are semantically equivalent. SST-2 and IMDB are collections of movie reviews. The task is to determine whether a review is positive or not. AMAZON dataset contains millions of product reviews from Amazon.The requirement of this task is to infer the scoring of the product from the review text.MNLI is a crow-source collections of sentence pairs. The network must distinguish which of the three categories entailment, contradiction and neutral the given sentences belong to.

The long-range-aren benchmark contains 5 different datasets.List Ops contains some designed clever mathematical problem to clarify the parsing ability of neural models. IMDB is also used in this benchmark to examine the text classiﬁcation ability of neural models. CIFAR-10 is a image collection of various of object, this task require models capture 2D spatial relations between ﬂatten pixels.In pathﬁnder task, models need to determine the connection of two points in the picture, so as to examine the model s ability to acquire 2D spatial relationships.AAN dataset is used to evaluate the ability for models to encode and store compressed representations for retrieving.

Data Train Valid Test Wiki Text-103 103M 218K 246K QQP 364K - 391K SST-2 67K - 1.8K MNLI 393K - 20K IMDB 25K - 25K AMAZON 3M 168K 168K List Ops 90K - 10K AAN 147K 18K 17K CIFAR-10 50K - 10K Pathﬁnder 160K - 20K

Table 7: Statistics for the datasets.A subset of Small amazon subset on electronics category is used for experiment

Published as a conference paper at ICLR 2022

A.5 QUALITATIVE RESULTS OF LRA

We provide our qualitative results of the List Ops and Document Retrieval tasks on Long-Range Arena benchmark (Tay et al., 2020b) with a comparison to the vanilla transformer.

List Ops is a ten-way classiﬁcation task which aims to prediction the results of a sequence with a hierarchical structure and operators MAX, MEAN, MEDIAN and SUM MOD that are enclosed by delimiters (brackets). The network needs to access all tokens and model the logical structure of the inputs in order to make a prediction.

Document Retrieval task is to decide whether the two input long documents are similar or not with a binary label. This task evaluates a model s ability to encode and store compressed representations that are useful for matching and retrieval. Since the samples in LRA are too long, We substantially shorten some selected samples and display them as below:

1 Input: ( ( ( ( ( ( ( [MED 7 ) 9 ) 3 ) 1 ...... 5 ) 6 ) 8 ) ] ) ) 2 ) 8 ) 9 ) 5 ) 0 ) ] ) ) 8 ) 5 ) 1 ) 2 ) ] ) Our Output: 0, Transformer output: 9, Ground-truth: 0 2 3 Input: ( ( ( ( ( ( ( ( ( [SM 5 ) 6 ) 0 ) 7 ) 1 ) ( ( ( ( ( (...... ( (

( Input: ( ( [MIN 5 ) 8 ) 1 ) 0 ) (( [MED ( ( ( 8 ) 7 ) 2 ) 8 ) 1 ) 8 ) ] ) ) 7 ) ] )] ) Our output: 9, Transformer output: 3, Ground-truth: 9 4 5 Input: ( ( ( ( ( ( ( ( ( [MAX 7 ) 4 ) 8 ) ( ( ( ( ( ( ( ( ( ( ( [MAX 5

) 2 ) ( ( ( ( ( ( [SM 3 ) 6 ) 9 ) ( ( ( ...... ) ) 1 ) 6 ) 4 ) 2 ) ] ) ) ] ) Our output: 9, Transformer output: 5, Ground-truth: 9

Listing 1: Examples of Lis Ops

Byte-level document retrieval:

1 Text1: b 1 Introduction Recent advances in Statistical Machine

Translation (SMT) are widely centred around two concepts: (a) hierarchical translation processes, frequently employing Synchronous Context Free Grammars (SCFGs) and (b) transduction or synchronous rewrite processes over a linguistic ...... 2 3 Text2: b 1 Introduction Automatic Grammatical Error Correction (GEC) for non-native English language learners has attracted more and more attention with the development of natural language processing , machine learning and big-data techniques. ?The Co NLL2013 shared task focuses on the problem of GEC in five different error types including determiner, preposition, noun number...... 4 5 Our output: False, Transformer output: True, Ground-truth: False

Listing 2: Examples of Document Retrieval