# elliptical_attention__6944b9b6.pdf

Elliptical Attention

Stefan K. Nielsen FPT Software AI Center Ha Noi, Vietnam stefannvkp@fpt.com

Laziz U. Abdullaev Department of Mathematics National University of Singapore Singapore 119077, Singapore laziz.abdullaev@u.nus.edu

Rachel S.Y. Teo Department of Mathematics National University of Singapore Singapore 119077, Singapore rachel.teo@u.nus.edu

Tan M. Nguyen Department of Mathematics National University of Singapore Singapore 119077, Singapore tanmn@nus.edu.sg

Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. This dot-product self-attention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance. In particular, we define a hyper-ellipsoidal neighborhood around each query to increase the attention weights of the tokens lying in the contextually important directions. We term this novel class of attention Elliptical Attention. Our Elliptical Attention provides two benefits: 1) reducing representation collapse, and 2) enhancing the model s robustness as Elliptical Attention pays more attention to contextually relevant information, rather than focusing on some small subset of informative features. We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities. The code is publicly available at https://github.com/stefvk/Elliptical-Attention.

1 Introduction

Attention mechanisms and transformers [82] have achieved state of the art performance across a wide variety of tasks in machine learning [27, 35, 75] and, in particular, within natural language processing [1, 2, 13, 65, 12], computer vision [16, 39, 78, 66, 62], and reinforcement learning [25, 5]. They have also demonstrated strong performance in knowledge transfer from pretraining tasks to various downstream tasks with weak or no supervision [63, 64, 15]. At the core of these models is the dotproduct self-attention mechanism, which learns self-alignment between tokens in an input sequence by estimating the relative importance of each token with respect to all others. The mechanism then transforms each token into a weighted average of the feature representations of the other tokens with weights proportional to the learned importance scores. The relative importance scores capture contextual information among tokens and are key to the success of the transformer architecture [83, 76, 8, 59, 36, 55].

Equal contribution. Correspondence to stefannvkp@fpt.com

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Figure 1: Comparison of Attention Heatmaps. Elliptical pays attention to more relevant information. Dei T focuses on just a subset of informative features while Elliptical considers a wider set of contextually relevant information, helping to produce more accurate and robust predictions. Attention scores are min-max scaled for visualization purposes.

Recent work has begun exploring the connections between self-attention and non-parametric kernel regression [54, 23]. Under this interpretation, there is an unknown, underlying function f mapping the tokens in the input sequence to the output sequence. The self-attention mechanism estimates f by performing Nadaraya-Watson (NW) regression with isotropic Gaussian kernels. Our work leverages this perspective on self-attention, where we notice that Gaussian isotropic kernels are spherically invariant. This has the drawback of assuming all dimensions of the feature space are equal in terms of importance, meaning nearby tokens are assigned contextual relevance weights dependant only on their Euclidean distance from a query, regardless of direction. From the non-parametric regression perspective, we show that spherical invariance in the kernel causes the estimator to suffer provably higher variance. This causes two connected disadvantages in the self-attention setting. First, high variance in the estimator impairs robustness as small contaminations in the input cause large, erroneous changes in the self-attention output. Second, the high variance of the estimator reduces the capacity of the self-attention mechanism as hidden representations passing through the model are increasingly composed of uninformative noise.

Contribution. In this work, we propose Elliptical Attention, a new class of self-attention that constructs hyper-ellipsoidal, rather than hyper-spherical, neighborhoods around the attention queries. The key idea is to stretch the neighborhoods around the queries to upweight keys in directions of high importance. We achieve this by computing a Mahalanobis transformation that stretches the axes of the underlying feature space according to a learned measure of coordinate-wise relevance. Constructing hyper-ellipsoidal neighborhoods following this scheme allows the self-attention mechanism to learn higher-quality contextual representations that prevent representation collapse while simultaneously exhibiting stronger robustness. We additionally propose an estimator of coordinate-wise relevance in the self-attention mechanism that can be computed highly efficiently and with no learnable parameters. We theoretically prove that our estimator accurately estimates the relative coordinate-wise relevance in the feature space. Finally, our approach of constructing hyper-ellipsoidal neighborhoods is linked to theoretical improvements in the mean squared error (MSE) of non-parametric estimators by reducing variance without introducing bias. We demonstrate that this provable reduction in variance is related to both representation collapse and robustness, proposing a unifying framework for both phenomena. This framework is based on the geometry of the predictive neighborhood around queries in the attention mechanism. In summary, our contributions are three-fold:

1. We develop the novel Elliptical Attention, which learns better contextual representations by constructing hyper-ellipsoidal neighborhoods around queries. 2. We propose an efficient estimator of the coordinate-wise relevance in the self-attention mechanism, which requires no learnable parameters, and provide theoretical guarantees for this estimator. 3. We derive a theoretical framework unifying representation collapse and robustness in transformers based only on the implicit geometry of the attention mechanism.

We empirically demonstrate that 1) Elliptical Attention outperforms baseline self-attention models in terms of accuracy and robustness on a variety of practical benchmarks, including Wiki Text-103 language modelling, Image Net-1K object classification, LRA long sequence modeling, and ADE20K image segmentation, 2) Elliptical Attention attains robust improvements with lower memory requirements and faster computational speed than baseline robust transformers, and 3) Elliptical Attention can be combined with state-of-the-art robust transformers to further boost robust performance in Image Net-1K under adversarial attack.

Figure 2: Left: The function does not vary in the x2 axis so we stretch the neighborhood in that direction. Right: The stretched ellipsoidal neighborhood includes 4 more keys.

Organization. We structure this paper as follows: In Section 2, we present preliminaries on selfattention and non-parametric kernel regression. In Section 3, we illustrate the theoretical benefits of hyper-ellipsoidal neighborhoods, demonstrate how we build the required transformation, and provide the full technical formulation of Elliptical Attention. We empirically validate the advantages of the Elliptical Attention in Section 4. Related work is discussed in Section 5 before presenting concluding remarks in Section 6. Proofs, technical details, and further experiments are provided in the Appendix.

2 Background: Self-Attention and Non-Parametric Regression

We first provide preliminaries on the self-attention mechanism followed by background on its connection to the Nadaraya-Watson (NW) estimator in non-parametric regression [48].

2.1 Self-Attention Mechanism

Given an input sequence X = [x1, . . . , x N] RN Dx of N feature vectors, the self-attention mechanism transforms the input to H := [h1, . . . , h N] RN Dx as follows:

j [N] softmax q i kj

vj, for i = 1, . . . , N. (1)

The vectors qi, kj, and vj are the query, key, and value vectors, respectively. They are computed as [q1, . . . , q N] := Q = XW Q RN D, [k1, . . . , k N] := K = XW K RN D, and [v1, . . . , v N] := V = XW V RN Dv where W Q, W K RD Dx, W V RDv Dx are the weight matrices. Eqn. 1 can be expressed in matrix form as:

H = softmax

where the softmax function is applied row-wise to the matrix QK /

D. We refer to transformers built with Eqn. 2 as standard transformers or just transformers.

2.2 A Non-Parametric Regression Perspective of Self-Attention

We now present the connection between self-attention as described in Eqn. 1 and non-parametric regression. We first assume key and value vectors {kj, vj}j [N] are obtained from the following data generating process: v = f(k) + ϵ, (3) where ϵ is random zero-mean noise E[ϵ] = 0, and f is the unknown function to be estimated. We consider the random design setting where the keys {kj}j [N] are i.i.d samples drawn from the marginal distribution p(k). We use p(v, k) to denote the joint distribution of pairs (v, k) as obtained according to Eqn. 3. At any given new query q, we aim to estimate the unknown function f(q).

The NW estimator is a non-parametric estimator of the unknown f described by

f(k) = E[v|k] = Z

RD v p(v|k)dv = Z

RD v p(v, k)

p(k) dv, (4)

where we apply zero-mean noise for the first equality and the definitions of conditional expectation and density for the second and final. Then, it can be shown that by estimating the joint density p(v, k) and marginal density p(k) using isotropic Gaussian kernels with bandwidth σ and evaluating the NW estimator at a new query qi, we obtain

j [N] vj exp qi kj 2/2σ2

j [N] exp ( qi kj 2/2σ2) (5)

P j [N] vj exp(q i kj/σ2) P

j [N] exp(q i kj/σ2) = X

j [N] softmax(q i kj/σ2)vj, (6)

where choosing σ2 =

D as the isotropic variance recovers the full attention mechanism. We present the full derivation in Appendix A.

Limitation of self-attention. We see in Eqn. 5 that standard self-attention computes the relative importance scores between queries and keys via Euclidean distance. Euclidean distances are spherically invariant and therefore fail to consider coordinate-wise significance in the feature space, meaning the proximity of kj from qi influences its contextual relevance equally regardless of direction.

3 Elliptical Attention: Leveraging Hyper-Ellipsoids to Pay More Attention Without Losing Focus

In this section, we first present how NW regression obtains a lower MSE by taking hyper-ellipsoidal neighborhoods around queries. We then construct the required hyper-ellipsoidal transformation via a Mahalanobis metric. We present the framework relating robustness and representation collapse to the geometry of the query neighborhoods and show how our proposed scheme offers improvements in both areas. We then provide an efficient estimator of the coordinate-wise relevance before finally giving the full technical formulation of Elliptical Attention. Technical details on the implementation procedure are in Appendix E.

3.1 Improving NW Regression with Hyper-Ellipsoids

Distance-based estimators, such as the NW estimator, can obtain a lower MSE by taking hyperellipsoidal neighborhoods around queries [29, 30]. The key idea is that we wish to stretch the axes of the underlying space in directions for which the true f in Eqn. 3 varies least.

Figure 3: Representation Collapse on Wiki Text-103. Elliptical Attention learns more diverse representations.

Figure 2 shows a situation in which f does not vary equally in all directions. This is actually a limiting case in which the function is sparse in the x2 direction. In the left sub-figure, we show the result of stretching the Euclidean circular neighborhoods around each query in the x2 direction for which the function does not vary. The right sub-figure then shows how the resulting ellipse in the x2 direction can include additional data points without adding additional bias into the model. It is a well-established result that the variance of non-parametric estimates at a point is inversely proportional to the number of samples in that point s neighborhood, as the additional samples smooth out the effect of noise. As a result, stretching the neighborhood, as shown in the right sub-figure, decreases the variance. Crucially, including these additional samples does not cause the estimate to miss the true variation in the function, as there is no variation in the x2 direction. By including points in this direction, we do not introduce bias into the estimate. Hence, we lower variance without the introduction of bias, obtaining a lower MSE estimator. This intuition is formalized in Theorem 1 in Appendix C, which shows that the best achievable rate of convergence for estimators of non-sparse Lipschitz functions is of the order O(n 2/(2+d)) for a d dimensional feature space. However, when the function only depends on R [d] coordinates, the rate improves to O(n 2/(2+|R|)). In the case of approximate sparsity, when coordinate directions exhibit differing variability, the same intuition carries over as shown by the improvement in convergence rates in Theorem 2 in Appendix C.

We leverage this analysis from non-parametric regression to motivate our Elliptical Attention. From the regression perspective, the self-attention mechanism, which performs NW regression, is able

to learn a lower MSE estimator of the true underlying f by reducing the variance of the estimator without (or with minimal) introduction of bias. From the attention perspective, this means queries pay higher attention to more relevant keys, producing more contextually meaningful attention scores and better, more robust learned representations.

3.2 Capturing Coordinate-wise Variability and Building the Mahalanobis Transformation

We measure the variation in f in the ith coordinate direction by the expectation of the L1 norm of the ith directional derivative taken over all k Xk, where Xk RD denotes the feature space. Roughly speaking, this quantity corresponds to the average absolute gradient of f in the ith direction throughout the space. Formally, this quantity is defined as

Definition 1 (Coordinate-wise Variability of f : RD RDv) The coordinate-wise variability of f : RD RDv with Jacobian matrix Jf RDv D in the ith direction is given by the quantity f i 1,µ := Ek µ Jf(k)ei 1, i [D], where ei is an all-zero vector with a single 1 in the ith coordinate and µ is the marginal distribution of k over support Xk.

Remark 1 This definition is one of many possible. One could also take the supremum rather than the expectation or consider second derivatives. We select this definition as averages over first derivatives are more easily estimated and the definition still captures the intuitive properties of variability.

Denoting estimates of the coordinate-wise variability f i 1,µ by mi, we can then incorporate these quantities into a distance function of the form

d(q, k) := q

(q k) M(q k), (7)

where M = diag(m1, m2, . . . , m D) is a diagonal matrix whose diagonal elements are the estimates of f i 1,µ for i [D].

Remark 2 The metric described in Eqn. 7 is a form of Mahalanobis distance metric, which can be interpreted as first applying a transformation to the underlying space in which we stretch the coordinate axes by the diagonal elements of M. Therefore using this metric within the self-attention computation produces the desired hyper-ellipsoidal neighborhoods around queries.

Remark 3 In practice, we maxscale the estimates to obtain mi mi/mmax where mmax mi for all i [D]. This is because we care about the relative magnitudes of the direction-wise variability as opposed to the absolute magnitudes. Under this interpretation, we identify the most variable dimension and stretch all others relative to this direction.

3.3 Promoting Robustness and Avoiding Representation Collapse

Before providing the technical procedure for estimating M and the full technical formulation of Elliptical Attention in Section 3.5, we first theoretically analyze in Propositions 1 and 2 how the hyper-ellipsoidal transformation in Eqn.7 improves robustness and alleviates representation collapse.

Dimension-wise input sensitivity of Elliptical Attention and robustness. In Lemma 1, we show that when each input component is weighted according to the Mahalanobis transformation in Eqn. 7, the impact of perturbing the ith input coordinate on any coordinate of the output is proportional to the corresponding weighting parameter with proportionality coefficient depending on the indices i and j.

Lemma 1 Let M : RD RN denote the transformed Elliptical softmax operator for a given set of keys as M(x) := 1 P

j [N] exp(x Mkj) exp(x Mk1), exp(x Mk2), . . . , exp(x Mk N) for

weight matrix M as in Eqn. 7. Then, the achievable rate of change of M(x) in ith input dimension is proportional to mi, that is, supx X |JM(x)ji| mi, for all i [D] and j [N] where JM is the Jacobian matrix of M.

By virtue of Lemma 1, which is proven in Appendix B.1, we show in Proposition 1 that choosing the weights as properly scaled estimates of the underlying function variability, as in Elliptical Attention, the output vectors become less prone to large errors caused by noisy input while simultaneously respecting the dimension-wise variability pattern of the true self-attention function.

Proposition 1 (Robustness of Elliptical Attention) Let f : RD RDv be the true self-attention function, ˆfd be the Elliptical Attention estimator with metric d as described in Eqn. 7. Then for any index i [N] and noise ϵ RD, the following error bound holds

ˆfd(qi) ˆfd(qi + ϵ)

tr(K2 j M 2) vj

where {Kj}j [N] are constant diagonal matrices that depend only on the key vectors.

Note that when the estimates are maxscaled so that mi 1, the achievable output error of Elliptical Attention is lower than that of standard self-attention where mi = 1 for all i [D]. Besides, when the true function exhibits approximate sparisity in some number of dimensions (i.e. mi 0+ for majority of indices), the error bound in Eqn. 8 becomes significantly tighter for Elliptical Attention. The proof of Proposition 1 is provided in Appendix B.2.

Input smoothing and representation collapse. In each layer, the standard self-attention mechanism fits a noisy estimate of the true function f, which is then fed into subsequent layers and iteratively refit. The input to each attention layer is then partially composed of noise, which is equivalently the common regularization method of random input smoothing. We show that by reducing the noise component in each layer, Elliptical Attention maintains expressive power and resists representation collapse. This is formalized in the following proposition:

Proposition 2 (Elliptical Attention maintains expressive power by reducing noise) Let hℓ d denote the output of a transformer using Elliptical Attention with metric d as described in Eqn. 7 and hℓdenote the output of a transformer using standard self-attention at layer ℓ. Let D be the sampling distribution of the data and let c RD. Then, for any h, hd and layer ℓ, in expectation a standard self-attention transformer attenuates towards c faster than Elliptical Attention. Formally, we have: ED hℓ d c ED hℓ c . (9)

Proof is provided in Appendix B.3. Proposition 2 shows Elliptical Attention maintains better expressive power than standard self-attention. We find this empirically supported as shown in Fig 3. 3.4 An Efficient Estimator of the Coordinate-wise Variability We propose a simple difference-based estimator that effectively captures the coordinate-wise variability of the underlying function. Our estimator is easily and efficiently computed. It requires no additional learnable parameters and demands negligible additional memory. Let En denote empirical mean over n samples, vℓ(i) denote the ith component of the vector v at the ℓth layer, and X ℓ,ℓ+1 v = {(vℓ+1, vℓ) : vℓ= f(kℓ) + ϵ} be the value feature space at neighboring layers ℓand ℓ+ 1 where values are generated according to the process described in Eqn. 3. Then, our approach to estimating the ith coordinate-wise variability is described in the following proposition.

Proposition 3 (Coordinate-wise Variability Estimator) Given a function f : RD RDv with ith directional variation f i 1,µ, i [D] and some δ > 0, the directional variation can be estimated by the quantity

mi := En (vℓ,vℓ+1) X ℓ,ℓ+1 v

|vℓ+1(i) vℓ(i)|

Remark 4 For the purposes of improving the performance of transformers by stretching the feature space according to the direction-wise variability of f, we note that consistent estimators of f i 1,µ for all i [D] are sufficient but not necessary. Instead, we require only the weaker objective of accurately estimating the relative magnitudes of the direction-wise variability. That is, if f i 1,µ f j 1,µ, we need only that mi mj. This is because the theory requires us only to identify coordinate directions of more or less variability and shrink or stretch the space accordingly.

The intuition behind our estimator in Eqn. 10 lies in prior lines of research studying transformers as an Euler discretization of a continuous-time dynamic, usually as a system of first-order ordinary differential equations (ODEs) [40, 21, 53]. In fact, our estimator resembles the absolute value of a forward Euler discretization of the variability of the ith component of a value vector over time v(i, t)/ t, where the layers ℓand ℓ+1 represent consecutive time points in an interval partition with the step size δ. We prove that our estimator in Eqn. 10 effectively estimates the relative magnitudes of the coordinate-wise variability of f in Appendix B.5.

3.5 Full Technical Formulation of Elliptical Attention

We now present the full formulation of Elliptical Attention. Given the distance function d( , ) as in Eqn. 7, where M = diag(m1, . . . , m D) is a diagonal matrix with elements mi as in Prop. 3, the M-norm can be defined as x M :=

x T Mx, which produces hyper-ellipsoidal stretching in the feature space. Then, Elliptical Attention is defined as follows.

Definition 2 (Elliptical Attention Computation) Let φd,σ : RD R denote the Gaussian density kernel with variance σ2I equipped with the M-norm as defined above. Then the corresponding NW estimator at qi becomes

ˆfd,D(qi) :=

j [N] vj exp qi kj 2 M/2σ2

j [N] exp ( qi kj 2 M/2σ2) . (11)

Then, the Elliptical Attention output for the ith query qi given keys {ki}N i=1 and values {vi}N i=1 corresponding to the NW estimator (11) with σ2 =

D is given by

exp q i Mkj/

j [N] exp q i Mkj/

j [N] softmax(q i Mkj/

where M = diag(m1, . . . , m D) with mi defined as Eqn. 10 for all i [D].

Eqn. 12 is equivalently expressed in matrix form as

H = softmax

Remark 5 We see from the form of Eqns. 12, 13 that standard self-attention is recoverable by setting M = ID. Under our framework, this implies that standard self-attention assumes the underlying regression function to have exactly equal variability in all coordinate directions.

Pseudocode for the Elliptical Attention computation is provided in Appendix F.12.

4 Experimental Results

In this section, we empirically justify the advantage of Elliptical Attention over baseline transformers that take hyper-spheres around queries. We evaluate our method on robust Wikitext-103 modeling under Word Swap contamination [45], Image Net classification under a wide range of attacks [14, 67], the LRA benchmark [74], and ADE20K image segmentation [87]. We compare Elliptical Attention with state-of-the-art (SOTA) clean and robust models, including Performer [9], Fourier Former [54], Robust Vision Transformer [44], Fully Attentional Network (FAN) [89], Mixture of Gaussian Keys (MGK) [52], Mixture-of-Expert (Mo E) based transformers, such as Switch transformer [18] and Generalist Langauge Model (GLa M) [17], and robust kernel density estimation (KDE) based transformers, such as Median of Means (Mo M) and Scaled Projected KDE (SPKDE) [23]. We aim to show that i) Elliptical Attention offers substantive improvements over baseline models across tasks on both clean and contaminated data; ii) Elliptical Attention attains these improvements on contaminated data while reducing memory requirements and increasing computational speed compared to comparative robust models; iii) Elliptical Attention can be combined with SOTA robust transformers to further improve robustness with negligible increase in computational overhead. We compare Elliptical Attention with baselines of the same configuration. Results are averaged over 5 runs with different seeds. Additional results and full details on experimental setup are in Appendix F.

4.1 Robust Language Modelling

Experimental setup. We adopt the experimental setup in [54, 23]. We pretrain and evaluate our models on the Wiki Text-103 benchmark in comparison with the standard baseline Transformer [82], Performer [9], Transformer-MGK [52], Fourier Former [54], and the robust kernel density estimationbased Transformers including Transformer-SPKDE and Transformer-Mo M [23]. All models use the 44M-parameter Transformer backbone. We pretrain all models on clean data for 125 epochs before attacking only the test set using a Word Swap Attack, which substitues random words with a generic AAA token at a 2.5% swap rate. We report test perplexity (PPL) as the performance metric.

Table 1: Perplexity (PPL) on Wiki Text-103 under Word Swap contamination. Elliptical achieves top PPL in clean data and second best in contaminated. Best result in bold and second best underlined.

Model Clean Test PPL ( ) Contaminated Test PPL ( )

Transformer [82] 34.29 74.56 Performer [9] 33.49 73.48 Transformer-MGK [52] 33.21 71.03 Fourier Former [54] 32.85 68.33 Transformer-SPKDE [23] 32.18 54.97 Transformer-Mo M [23] 34.68 52.14

Transformer-Elliptical 32.00 52.59

Table 2: Top-1 and Top-5 Test accuracy on Image Net under adversarial attacks PGD, FGSM, and SPSA with perturbation budget 1/255. Best result shown in bold and second best shown underlined.

Method Clean Data FGSM PGD SPSA Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

Dei T [78] 72.23 91.13 52.61 82.26 41.84 76.49 48.34 79.36 Distill [78] 74.32 93.72 53.24 84.07 41.72 76.43 49.56 80.14 Fourier Former [54] 73.25 91.66 53.08 83.95 41.34 76.19 48.79 79.57 RVT [44] 74.37 93.89 53.67 84.11 43.39 77.26 51.43 80.98 Dei T-KDE [23] 72.58 91.34 52.25 81.52 41.38 76.41 48.61 79.68 Dei T-Mo M [23] 71.94 91.08 55.76 85.23 43.78 78.85 49.38 80.02

Dei T-Elliptical 72.36 91.33 54.64 85.18 44.96 79.35 56.55 87.26

Results. Table 1 shows our Elliptical Transformer (Elliptical) achieves top test perplexity in clean data while also achieving second top test perplexity under data contamination by Word Swap [47], illustrating that the Elliptical Attention is highly robust and offers substantial advantages on clean data as well.

4.2 Image Classification under Adversarial Attack

Experimental setup. We adopt the experimental setup in [23]. We train and evaluate Elliptical on Image Net-1K against standard vision transformers, including Dei T [78] and Distill [78], as well as the Fourier Former [54]. We also compare Elliptical with robust vision transformers, including Dei T-KDE [23], Dei T-Mo M [23], RVT [44], and FAN [89]. The Dei T backbone is the tiny configuration of 5.7M parameters. We train all models on clean Image Net-1K for 300 epochs before evaluating their top-1 and top-5 accuracy on the test dataset under fast gradient sign method (FGSM) [22], projected gradient descent (PGD) [42], and simultaneous perturbation stochastic approximation (SPSA) [81]. We also present results for performance against Auto Attack [11], which is an ensemble of auto PGD-Cross Entropy (APGD-CE), auto PGD-targeted (APGD-T), fast adaptive boundary-targeted (FAB-T), and Square. We display results for attacks individually and in default sequential mode.

Results. Table 2 shows Elliptical attains top robustness in PGD and SPSA and second top in FGSM while achieving highly competitive clean accuracy. Dei T-Elliptical is particularly impressive under black box attack SPSA, improving over the next best model, RVT, [44], by 10%. Table 4 shows results on Auto Attack [11], where we see Dei T-Elliptical substantially outperforms standard Dei T in each attack individually and sequentially. We again see strong performance against black box attack Square with an 8.5% improvement. When combining with SOTA robust transformer, FAN [89], Elliptical Attention improves robustness to sequential Auto Attack and all individual attacks except FAB-T, for which it still remains highly competitive. This shows Elliptical Attention can further boost robustness when combined with SOTA robust models.

4.3 Long Sequence Modelling on the LRA Benchmark

Experimental setup. We adopt the setup in [7]. For each of the 5 tasks, equation calculation (List Ops) [50], review classification (Text) [41], document retrieval (Retrieval) [61], image classification (Image) [32], and image spatial dependencies (Pathfinder) [37], we compare Elliptical with standard Transformer [82], Linformer [26], Reformer [28], Performer [9], and Longformer [3].

Results. Elliptical Attention achieves top or second top test accuracy in every task and top overall performance. This shows Elliptical Attention learns superior representations across a wide range of modalities in long-range contexts.

Table 3: Test accuracy on long range tasks: List Ops, Text, Retrieval, Image, and Pathfinder. Best result in bold and second best underlined.

Dataset (seq. length) Trans. [82] Lin. [26] Re. [28] Per. [9] Long. [3] Elliptical

List Ops (2K) 37.1 37.3 19.1 18.8 37.2 37.8 Text (4K) 65.0 55.9 64.9 63.8 64.6 65.6 Retrieval (4K) 79.4 79.4 78.6 78.6 81.0 80.3 Image (1K) 38.2 37.8 43.3 37.1 39.1 40.2 Pathfinder (1K) 74.2 67.6 69.4 69.9 73.0 73.2

Average Accuracy 58.5 55.6 55.1 53.6 59.0 59.4

Table 4: Top-1 and Top-5 Test accuracy on Image Net under Auto Attack applied both individually and sequentially with perturbation budget 1/255. Best result is shown in bold.

Method Dei T [78] Dei T-Elliptical FAN [89] FAN-Elliptical Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

Clean Data 72.23 91.13 72.36 91.33 76.31 93.42 76.38 93.53

APGD-CE 27.75 66.48 31.27 68.28 35.05 74.56 36.13 75.69 APGD-T 27.74 73.37 29.69 74.39 35.02 80.46 36.25 81.30 FAB-T 71.61 90.54 71.74 90.81 76.35 93.65 76.16 93.45 Square 43.55 80.96 47.25 81.65 56.75 88.05 58.38 88.20

Average 42.66 77.84 45.00 78.78 50.79 84.18 51.73 84.66 Sequential Attack 26.08 64.18 27.45 67.77 33.29 74.52 34.54 75.67

Table 5: Switch Transformer Language Modeling

Model Test PPL ( )

Switch Transformer-medium [18] 35.33 Switch Elliptical-medium 34.67

Switch Transformer-large [18] 31.18 Switch Elliptical-large 30.56

Table 6: GLa M Language Modeling

Model Test PPL ( )

GLAM-small [17] 58.27 GLa M-Elliptical-small 56.69

GLa M-medium [17] 38.27 GLa M-Elliptical-medium 36.34

4.4 Image Segmentation on ADE20K

Experimental setup. We adopt the setup in [71]. The encoder is pretrained on Image Net-1K following the same specification described in 4.2. In particular, the encoder is a Dei T-tiny backbone of 5.7M parameters pretrained for 300 epochs. After pretraining, we then attach a decoder that contains 2-layer masked transformer and finetune the full encoder-decoder model for 64 epochs on the ADE20K [88] image segmentation dataset.

Results. Table 7 reports pixel accuracy, mean accuracy, and mean intersection over union (IOU). Elliptical Attention boosts performance across all metrics, with intersection over union, in particular, improving by a substantive 4.7%.

4.5 Further Clean Data Language Modelling

Experimental setup. For experiments using Switch Transformer [18] and GLa M [17] backbones, we adopt the setup in [60]. In particular, we integrate Elliptical into small (70M parameters) and medium (220M parameters) Gla M backbones and train the models on Wiki Text-103 for 80 and 120 epochs, respectively. We consider Switch backbones at medium (220M parameters) and large (388M parameters) configurations, both trained for 80 epochs. All models use top-2 expert routing. For the standard transformer experiments, we continue with the setup of [54] and additionally present results for Elliptical in a medium configuration with 90M parameters trained for 100 epochs.

Results. We present in Tables 5 and 6 the performance of Elliptical in Mo E backbones. We see moving from smaller to larger configurations, Elliptical maintains strong, consistent improvements in test PPL. We note particularly substantive improvements with scale in the GLa M backbone, where at the small configuration Elliptical attains a 2.7% improvement, but at the medium configuration this performance improvement almost doubles to 5.0%. Table 8 further shows that in the standard transformer backbone, Elliptical maintains its substantive 6.8% improvement when scaling up to a larger configuration. These results show that Elliptical Attention scales well with model size.

Table 7: Image Segmentation Results

Model Pixel Acc. Avg Acc. Avg IOU

Dei T [78] 77.93 46.30 35.44 Elliptical 78.46 48.04 37.09

Table 8: Wikitext-103 Results

Model Test PPL ( )

Transformer-small [82] 34.29 Elliptical-small 32.00

Transformer-medium [82] 29.60 Elliptical-medium 27.60

5 Related Work

Theoretical Frameworks for Attention. Attention mechanisms have been studied from a range of perspectives. [80] shows that self-attention can be derived from kernel similarity functions, and [77] points out that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. [56] formulates self-attention as the support vector expansion derived from a support vector regression problem, while [73] explains attention through nonlinear singular value decomposition of asymmetric kernels. Attention has also been explained through ordinary/partial differential equations, Gaussian mixture models, and graph-structured learning [40, 68, 51, 72, 20, 52, 31, 86]. [54, 23] show that self-attention performs Nadaraya-Watson regression with Gaussian isotropic kernels. This paper leverages this viewpoint and proposes modifying the Gaussian isotropic kernels to include a Mahalanobis metric which can be interpreted as stretching the hyper-spherical neighborhoods of the kernel to hyper-ellipsoids.

Robust Transformers. In vision, [43] proposes an ensemble defense strategy to white-box

Figure 4: Image Net Efficiency: Comparison of throughput and max memory allocated for Dei T, Elliptical, RVT, RKDE, Mo M on Tiny, Small, and Base sizes. Elliptical is the most efficient robust model. Numerical analysis in Table 12 of Appendix F.

attacks while [44] proposes position-aware attention scaling and patch-wise augmentation. Recently, [89] proposes a fully-attentional network to attain state-of-the-art accuracy on corrupted image data. In language, [85] proposes structurally aware table-text encoding, [38] proposes a robust end-to-end transformer for crisis detection, and [33] proposes duration-based hard attention. [6, 4] integrate a Gaussian process into attention for out-of-distribution detection, and [79] develops equivariant neural functional networks for transformers. These methodologies are motivated by their respective domain and tend to have limited generalizability to differing domains. Our approach, by contrast, proposes a general framework that makes no assumption on the downstream task and requires no additional parameters and negligible computational overhead.

Mahalanobis Metrics. Mahalanobis metrics have been used predominantly in classical machine learning algorithms. In nearest-neighbor (NN) classification and regression, [84, 49] learn the metric through backpropagation. In NN KL divergence estimation, [58] learns a Mahalanobis metric from density approximation. In kernel regression, [57] takes eigenvalues of the estimated Jacobian while [29, 30] estimate coordinate-wise variability of the true function. Our model similarly uses coordinate-wise variability of the unknown function to form the Mahalanobis transformation but instead uses a more efficient estimator that does not require materializing the prediction function and accommodates the self-attention setting. In general, our method is among the early work in incorporating Mahalanobis metrics into the self-attention mechanism.

6 Conclusion and Future Work

In this paper, we present Elliptical Attention, a novel variant of attention that computes a Mahalanobis transformation to stretch the underlying feature space in directions of high contextual relevance. This transformation can be interpreted as modifying the hyper-spherical neighborhoods around queries to hyper-ellipsoids which upweight the attention paid to keys lying in important directions, enabling the transformer to learn better and more robust representations. This approach makes no assumptions on the downstream task, requires no learnable parameters, and can be applied to any transformer to boost clean and robust performance. A limitation of our work is that we use the values over layers to estimate the average direction-wise gradient of the true self-attention function, which makes the estimate prone to noise. For ongoing work, we are exploring more precise estimation methods with provable convergence guarantees that do not compromise efficiency.

Acknowledgments and Disclosure of Funding

This research / project is supported by the National Research Foundation Singapore under the AI Singapore Programme (AISG Award No: AISG2-TC-2023-012-SGIL). This research / project is supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 1 (FY2023) (A-8002040-00-00, A-8002039-00-00). This research / project is also supported by the NUS Presidential Young Professorship Award (A-0009807-01-00).

Thanks to our anonymous reviewers, who provided valuable feedback which improved the paper substantially. Thanks also to Thai Ha for the many illuminating conversations.

[1] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3159 3166, 2019.

[2] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019.

[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. ar Xiv preprint ar Xiv:2004.05150, 2020.

[4] Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, and Trong Nghia Hoang. Revisiting kernel attention with correlated gaussian process representation. In The 40th Conference on Uncertainty in Artificial Intelligence, 2024.

[5] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084 15097, 2021.

[6] Wenlong Chen and Yingzhen Li. Calibrating transformers via sparse gaussian processes. In The Eleventh International Conference on Learning Representations, 2023.

[7] Yingyi Chen, Qinghua Tao, Francesco Tonin, and Johan Suykens. Primal-attention: Selfattention through asymmetric kernel svd in primal representation. Advances in Neural Information Processing Systems, 36, 2024.

[8] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724 1734, Doha, Qatar, October 2014. Association for Computational Linguistics.

[9] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021.

[10] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1310 1320. PMLR, 09 15 Jun 2019.

[11] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pages 2206 2216. PMLR, 2020.

[12] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978 2988, Florence, Italy, July 2019. Association for Computational Linguistics.

[13] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019.

[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009.

[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171 4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

[16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

[17] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547 5569. PMLR, 2022.

[18] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1 39, 2022.

[19] Chris D Frost and Simon G Thompson. Correcting for regression dilution bias: comparison of methods for a single predictor variable. Journal of the Royal Statistical Society: Series A (Statistics in Society), 163, 2000.

[20] Prasad Gabbur, Manjot Bilkhu, and Javier Movellan. Probabilistic attention for interactive segmentation. Advances in Neural Information Processing Systems, 34:4448 4460, 2021.

[21] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

[22] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

[23] Xing Han, Tongzheng Ren, Tan Nguyen, Khai Nguyen, Joydeep Ghosh, and Nhat Ho. Designing robust transformers using robust kernel density estimation. Advances in Neural Information Processing Systems, 36, 2024.

[24] Hang Xu Han Shi, JIAHUI GAO, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, and James Kwok. Revisiting over-smoothing in BERT from the perspective of graph. In International Conference on Learning Representations, 2022.

[25] Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273 1286, 2021.

[26] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156 5165. PMLR, 2020.

[27] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1 41, 2022.

[28] Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020.

[29] Samory Kpotufe and Abdeslam Boularias. Gradient weights help nonparametric regressors. In Advances in Neural Information Processing Systems, volume 25, 2012.

[30] Samory Kpotufe, Abdeslam Boularias, Thomas Schultz, and Kyoungok Kim. Gradients weights improve regression and classification. Journal of Machine Learning Research, 17(22):1 34, 2016.

[31] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618 21629, 2021.

[32] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[33] Naihan Li, Yanqing Liu, Yu Wu, Shujie Liu, Sheng Zhao, and Ming Liu. Robutrans: A robust transformer-based text-to-speech model. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8228 8235, 2020.

[34] Hua Liang, Wolfgang Härdle, and Raymond J. Carroll. Estimation in a semiparametric partially linear errors-in-variables model. Annals of Statistics, 27(5):1519 1535, Oct 1999.

[35] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI open, 3:111 132, 2022.

[36] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. In International Conference on Learning Representations, 2017.

[37] Drew Linsley, Junkyung Kim, Vijay Veerabadran, Charles Windolf, and Thomas Serre. Learning long-range spatial dependencies with horizontal gated recurrent units. Advances in neural information processing systems, 31, 2018.

[38] Junhua Liu, Trisha Singhal, Lucienne TM Blessing, Kristin L Wood, and Kwan Hui Lim. Crisisbert: a robust transformer for crisis classification and contextual crisis embedding. In Proceedings of the 32nd ACM conference on hypertext and social media, pages 133 141, 2021.

[39] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012 10022, 2021.

[40] Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie yan Liu. Understanding and improving transformer from a multi-particle dynamic system point of view. In ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations, 2019.

[41] Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142 150, 2011.

[42] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.

[43] Kaleel Mahmood, Rigel Mahmood, and Marten Van Dijk. On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7838 7847, 2021.

[44] Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. Towards robust vision transformer. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 12042 12051, 2022.

[45] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017.

[46] Jeet Mohapatra, Ching-Yun Ko, Tsui-Wei Weng, Pin-Yu Chen, Sijia Liu, and Luca Daniel. Higher-order certification for randomized smoothing. In Advances in Neural Information Processing Systems, 2020.

[47] John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Text Attack: A framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119 126, Online, October 2020. Association for Computational Linguistics.

[48] Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141 142, 1964.

[49] Youssef Nader, Leon Sixt, and Tim Landgraf. Dnnr: Differential nearest neighbors regression. In International Conference on Machine Learning, pages 16296 16317. PMLR, 2022.

[50] Nikita Nangia and Samuel Bowman. List Ops: A diagnostic dataset for latent tree learning. In Silvio Ricardo Cordeiro, Shereen Oraby, Umashanthi Pavalanathan, and Kyeongmin Rim, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 92 99, New Orleans, Louisiana, USA, June 2018. Association for Computational Linguistics.

[51] Tam Nguyen, Tan Nguyen, and Richard Baraniuk. Mitigating over-smoothing in transformers via regularized nonlocal functionals. Advances in Neural Information Processing Systems, 36:80233 80256, 2023.

[52] Tam Minh Nguyen, Tan Minh Nguyen, Dung DD Le, Duy Khuong Nguyen, Viet-Anh Tran, Richard Baraniuk, Nhat Ho, and Stanley Osher. Improving transformers with probabilistic attention keys. In International Conference on Machine Learning, pages 16595 16621. PMLR, 2022.

[53] Tam Minh Nguyen, Cesar A Uribe, Tan Minh Nguyen, and Richard Baraniuk. PIDformer: Transformer meets control theory. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 37776 37797. PMLR, 21 27 Jul 2024.

[54] Tan Nguyen, Minh Pham, Tam Nguyen, Khai Nguyen, Stanley Osher, and Nhat Ho. Fourierformer: Transformer meets generalized fourier integral theorem. Advances in Neural Information Processing Systems, 35:29319 29335, 2022.

[55] Tan Minh Nguyen, Tam Minh Nguyen, Hai Ngoc Do, Khai Nguyen, Vishwanath Saragadam, Minh Pham, Nguyen Duy Khuong, Nhat Ho, and Stanley Osher. Improving transformer with an admixture of attention heads. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[56] Tan Minh Nguyen, Tam Minh Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard Baraniuk, and Stanley Osher. A primal-dual framework for transformers and neural networks. In The Eleventh International Conference on Learning Representations, 2023.

[57] Yung-Kyun Noh, Masashi Sugiyama, Kee-Eung Kim, Frank Park, and Daniel D Lee. Generative local metric learning for kernel regression. Advances in neural information processing systems, 30, 2017.

[58] Yung-Kyun Noh, Masashi Sugiyama, Song Liu, Marthinus C Plessis, Frank Chongwoo Park, and Daniel D Lee. Bias reduction and metric learning for nearest-neighbor estimation of kullback-leibler divergence. In Artificial Intelligence and Statistics, pages 669 677. PMLR, 2014.

[59] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. In Jian Su, Kevin Duh, and Xavier Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2249 2255, Austin, Texas, November 2016. Association for Computational Linguistics.

[60] Quang Pham, Giang Do, Huy Nguyen, Trung Tin Nguyen, Chenghao Liu, Mina Sartipi, Binh T Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi, et al. Competesmoe effective training of sparse mixture of experts via competition. ar Xiv preprint ar Xiv:2402.02526, 2024.

[61] Dragomir R Radev, Pradeep Muthukrishnan, Vahed Qazvinian, and Amjad Abu-Jbara. The acl anthology network corpus. Language Resources and Evaluation, 47:919 944, 2013.

[62] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021.

[63] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.

[64] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

[65] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1 67, 2020.

[66] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821 8831. Pmlr, 2021.

[67] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211 252, 2015.

[68] Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pages 3515 3530. PMLR, 2022.

[69] J. H. Sepanski, R. Knickerbocker, and R. J. Carroll. A semiparametric correction for attenuation. Journal of the American Statistical Association, 89(428):1366 1373, 1994.

[70] Charles J Stone. Optimal global rates of convergence for nonparametric regression. The annals of statistics, pages 1040 1053, 1982.

[71] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262 7272, 2021.

[72] Binh Tang and David S Matteson. Probabilistic transformer for time series analysis. Advances in Neural Information Processing Systems, 34:23592 23608, 2021.

[73] Qinghua Tao, Francesco Tonin, Panagiotis Patrinos, and Johan A. K. Suykens. Nonlinear svd with asymmetric kernels: feature learning and asymmetric nyström method. Co RR, abs/2306.07040, 2023.

[74] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021.

[75] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1 28, 2022.

[76] Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593 4601, Florence, Italy, July 2019. Association for Computational Linguistics.

[77] Rachel Teo and Tan Nguyen. Unveiling the hidden structure of self-attention via kernel principal component analysis. Advances in Neural Information Processing Systems, 2024.

[78] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347 10357. PMLR, 2021.

[79] Viet-Hoang Tran, Thieu N Vo, An Nguyen The, Tho Tran Huu, Minh-Khoi Nguyen-Nhat, Thanh Tran, Duy-Tung Pham, and Tan Minh Nguyen. Equivariant neural functional networks for transformers. ar Xiv preprint ar Xiv:2410.04209, 2024.

[80] Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: An unified understanding for transformer s attention via the lens of kernel. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4344 4353, Hong Kong, China, November 2019. Association for Computational Linguistics.

[81] Jonathan Uesato, Brendan O donoghue, Pushmeet Kohli, and Aaron Oord. Adversarial risk and the dangers of evaluating against weak attacks. In International Conference on Machine Learning, pages 5025 5034. PMLR, 2018.

[82] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[83] Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer language model. In Tal Linzen, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes, editors, Proceedings of the 2019 ACL Workshop Blackbox NLP: Analyzing and Interpreting Neural Networks for NLP, pages 63 76, Florence, Italy, August 2019. Association for Computational Linguistics.

[84] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of machine learning research, 10(2), 2009.

[85] Jingfeng Yang, Aditya Gupta, Shyam Upadhyay, Luheng He, Rahul Goel, and Shachi Paul. Tableformer: Robust transformer modeling for table-text encoding. ar Xiv preprint ar Xiv:2203.00274, 2022.

[86] Shaolei Zhang and Yang Feng. Modeling concentrated cross-attention for neural machine translation with Gaussian mixture model. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1401 1411, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.

[87] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633 641, 2017.

[88] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302 321, 2019.

[89] Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. In International Conference on Machine Learning, pages 27378 27394. PMLR, 2022.

Supplement to Elliptical Attention

Table of Contents

A Full Derivation of Self-Attention as Non-Parametric Regression 17

B Technical Proofs 18

B.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

B.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

B.3 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B.4 Edge-preservation Perspective on Representation Collapse . . . . . . . . . . . . . 22

B.5 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

B.6 Lipschitz smoothness in (X, d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C Additional Theorems 26

D A Consistent Estimator 27

E Implementation Procedure and Computational Efficiency 28

F Experimental Details and Additional Experiments 28

F.1 Out-of-Distribution Robustness and Data Corruption on Image Net-A,R,C . . . . . 28

F.2 Representation Collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

F.3 Head Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.4 Efficiency Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

F.5 Elliptical Attention in Mixture of Expert Architectures . . . . . . . . . . . . . . . 29

F.6 Additional Adversarial Attack Results on Dei T-Small Configuration . . . . . . . . 30

F.7 Wikitext-103 Language Modelling and Word Swap Attack . . . . . . . . . . . . . 30

F.8 Image Net Image Classification and Adversarial Attack . . . . . . . . . . . . . . . 31

F.9 LRA Long Sequence Classification. . . . . . . . . . . . . . . . . . . . . . . . . . 31

F.10 ADE20K Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

F.11 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

F.12 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

G Broader Impacts 34

A Full Derivation of Self-Attention as Non-Parametric Regression

Recall NW estimator is a non-parametric estimator of the unknown f at any given query q described by

f(k) = E[v|k] = Z

RD v p(v|k)dv = Z

RD v p(v, k)

where the first equality comes from the noise being zero mean, the second equality comes from the definition of conditional expectation and the final equality comes from the definition of conditional density. Eqn. 3 implies that if we can just obtain good estimates of the joint density p(v, k) and marginal density p(k) then we can estimate the required f(q). The Gaussian isotropic kernels with

bandwidth σ are given by

ˆpσ(v, k) = 1

j [N] φσ(v vj)φσ(k kj), ˆpσ(k) = 1

j [N] φσ(k kj), (14)

where φσ is the multivariate Gaussian density function with diagonal covariance matrix σ2ID. Given the kernel density estimators in Eqn. 14, the unknown function can be estimated as

RD v ˆpσ(v, k)

ˆpσ(k) dv = Z

j [N] φσ(v vj)φσ(k kj)

j [N] φσ(k kj) dv

j [N] φσ(k kj) R v φσ(v vj)dv P

j [N] φσ(k kj) =

j [N] vjφσ(k kj) P

j [N] φσ(k kj) .

Then, using the definition of the Gaussian isotropic kernel and evaluating the estimated function at qi we have

PN j vj exp qi kj 2/2σ2

PN j exp ( qi kj 2/2σ2)

PN j vj exp ( qi 2 + kj 2)/2σ2 exp(q i kj/σ2) PN j exp [ ( qi 2 + kj 2)/2σ2] exp(q i kj/σ2)

PN j vj exp(q i kj/σ2) PN j exp(q i kj/σ2) = X

j [N] softmax(q i kj/σ2)vj.

Remark 6 Note that relaxing the assumption of normalized keys, the standard unnormalized selfattention score can be written as

exp(q i kj/σ2) = exp qi kj 2/2σ2 exp ( qi 2 + kj 2)/2σ2

exp qi kj 2/2σ2 ,

which shows that the dot-product self-attention scores are proportional to the NW kernel value with Euclidean distance. Hence the assumption of key normalization is sufficient to recover exactly the correspondence between self-attention and NW kernel regression, but not necessary. Analogously, the unnormalized Elliptical Attention score takes the following form:

exp(q i Mkj/σ2) = exp d(qi, kj)2/2σ2 exp ( qi 2 M + kj 2 M)/2σ2

exp d(qi, kj)2/2σ2 ,

where d( , ) is the Mahalanobis distance used in Eqn. 7 and M is the norm in the transformed space with metric d. This observation justifies the use of the transformed dot product instead of the full Mahalanobis distance metric in Eqn. 12 as it preserves the proportionality relationship between the attention computation and the corresponding nonparametric regression estimator with chosen distance metric.

B Technical Proofs

In this section, we present the omitted theorem statements and technical proofs in the main body of the paper.

B.1 Proof of Lemma 1

Let M : RD RN be the transformed softmax operator as defined in Lemma 1. We wish to find its Jacobian matrix given by

q2 . . . M1(q)

q2 . . . M2(q)

q D ... ... ... ... MN(q)

q2 . . . MN(q)

to measure the sensitivity of each output dimension to a change in each input dimension. Let Mj : RD R denote the jth component of the output vector for j [N], that is, for a vector q RD,

Mj(q) = exp(q Mkj) P

s [N] exp(q Mks). (16)

Let qi and ki j denote the ith coordinates of vectors q and kj, respectively. Then,

qi ln(Mj(q)) = qi

s [N] exp(q Mks)

s [N] qi exp(q Mks) P

s [N] exp(q Mks)

= miki j mi X

ki s exp(q Mks) P

s [N] exp(q Mks )

s [N] ki s Ms(q)

Since the output of Eqn. 16 consists of only positive components, we have qi Mj(q) = qi ln(Mj(q)) Mj(q)

s [N] ki s Ms(q)

Therefore, the triangle inequality gives qi Mj(q) =

s [N] ki s Ms(q)

|ki j(1 Mj(q))Mj(q)| + X

s [N]\{j} |ki s Ms(q)Mj(q)|

We now bound each term individually. Consider the terms j = s first. Since 0 Ms(q) 1, we can bound them as |ki s Ms(q)Mj(q)| |ki s|. (18) Now recall that the inequality ab (a + b)2/4 holds for any real numbers a and b with equality holding at a = b. Therefore, for the first term, we obtain

|ki j(1 Mj(q))Mj(q)| |ki j|(1 Mj(q) + Mj(q))2

4 = |ki j| 4 . (19)

Combining inequalities 17, 18 and 19, we finally arrive at

|JM(q)ji| = qi Mj(q) mi

|ki j| 4 + X

s [N]\{j} |ki s|

= κijmi (20)

for all i [D] and j [N], where κij 0 denotes the coefficient in the bracket.

B.2 Proof of Proposition 1

Let us estimate the distance between two output vectors of Elliptical attention mechanism corresponding to clean and contaminated query inputs, namely:

j [N] softmax(q Mkj/σ2)vj = X

j [N] Mj(q)vj

j [N] softmax((q + ϵ) Mkj/σ2)vj = X

j [N] Mj (q + ϵ) vj,

where M is defined as in Lemma 1. We omit the keys and scaling parameter for convenience since they do not affect the analysis. Then,

j [N] (Mj(q) Mj(q + ϵ)) vj

j [N] |Mj(q) Mj(q + ϵ)| vj

j [N] Mj(ˆq) ϵ vj (21)

i [D] (JM(ˆq)ji)2 vj ϵ

i [D] κ2 ijm2 i vj ϵ (22)

tr(K2 j M 2) vj ϵ ,

where Kj := diag(κ1j, κ2j, . . . , κDj) and κij is defined as in Eqn. 20. Note that 21 follows from mean value theorem for some β [0, 1] and ˆq := q + βϵ while 22 follows from Lemma 1.

It should be noted that Proposition 1 addresses the impact of noise exclusively on the query vectors. However, the resulting bound can be extended to account for noise in all tokens by employing the same technique utilized in the proof. For completeness, we also provide the extension. Let M : RD RD RD | {z } N RN be the Elliptical Softmax function defined as

M(q, k1, . . . , k N) = 1 P

j [N] exp(q Mkj)

exp(q Mk1) ... exp(q Mk N)

Again, take the difference between output vectors calculated from clean and noisy tokens as follows

j [N] Mj(q + ϵq, k1 + ϵk, . . . , k N + ϵk)(vj + ϵv), (24)

j [N] Mj(q, k1, . . . , k N)vj. (25)

Let ϵ := max{ ϵq , ϵk , ϵv } denote the noise with the largest norm among query, key and value noises. Then, hϵ h X

j [N] |Mj(q + ϵq, k1 + ϵk, . . . , k N + ϵk) Mj(q, k1, . . . , k N)| vj

j [N] Mj(q + ϵq, k1 + ϵk, . . . , k N + ϵk) ϵv (26)

q Mj( q) + X

ϵ vj + ϵ . (27)

Following the same steps as Lemma 1, one can derive the bound kis Mj

mi |qi| 1 3δsj

for s, j [N]. Therefore, we obtain

tr(C2 s,j M 2) vj

where Cs,j are the diagonal matrices whose elements are the proportionality coefficients in the derived upper bounds.

B.3 Proof of Proposition 2

There are two avenues through which to see resistance to representation collapse. In this section, we provide a proof based on noise propagation through layers, which decreases representation capacity as representations in deeper layers are increasingly composed of uninformative noise. We refer the reader to Appendix B.4 for an additional lens on representation collapse, where we show that Elliptical Attention is more sensitive to the variation and local features of the underlying function.

Let the output at layer ℓbe denoted as hℓ, the standard self-attention estimator and Elliptical estimator fitted at layer ℓbe denoted ˆf ℓand ˆf ℓ d respectively, where d is the Mahalanobis metric described in Eqn. 7, and f be the true underlying function described in Eqn. 3. By assumption, ˆf is a higher variance estimator than ˆfd for any layer. The output for either estimator at layer ℓcan be decomposed into ground truth and noise as follows:

hℓ= ˆf ℓ(qℓ) = f(qℓ) + ϵℓ (30)

hℓ d = ˆf ℓ d(qℓ) = f(qℓ) + ηℓ, (31)

where ηℓ γ(0, Vη), ϵℓ γ(0, Vϵ) are the noise components of the estimate at qℓand f(qℓ) is the ground truth. By assumption of ˆfd being lower variance, Vϵ Vη is a positive semi-definite matrix.

We first require the following Assumption 1, which is described as:

Assumption 1 (Random Input Noise Causes Estimator Attenuation) . Let ˆf be any estimator of true function f and let the input x µ drawn from marginal µ be randomly corrupted by random noise ϵ (0, V ) of some unknown distribution and covariance matrix V . Let c be some constant. Then, random input noise attenuates the estimator as follows:

Ex,ϵ ˆf(x + ϵ) c Ex ˆf(x) c (32)

Assumption 1 is a well-studied phenomenon in parametric regression, often referred to as attenuation bias [69], regression dilution [19], or errors-in-variables [34]. In parametric regression, it can be shown to have an exact form where the estimated gradients of the model are attenuated towards 0 proportional to the variance of the noise ϵ. In non-parametric regression, addition of input noise is often referred to as random smoothing or random input smoothing [46, 10], and is well known to be used as regularization technique to introduce bias into the model. In non-parametric models, no exact closed forms exist to express the attenuation bias, but for our purposes we only note the attenuation exists and provide a general form of it in Assumption 1.

The outputs of 30 and 31 then become the inputs to the following layer after being self-added, normalized, projected, and linearly transformed. For notational simplicity and because these operations do not change the analysis, we denote the input at the next layer as the previous layer output qℓ+1 = hℓ. We therefore have the following process:

hℓ+1 = ˆf ℓ+1(qℓ+1) = ˆf ℓ+1(hℓ) = ˆf ℓ+1(f(qℓ) + ϵℓ | {z } zℓ ), (33)

where we see the output hℓ+1 is obtained by fitting ˆf ℓto input zℓwhich is composed of ground truth f(qℓ) and noise ϵℓpassed through from the previous layer.

The result then follows directly from the fact that in any given layer, the standard self-attention estimator produces noisier estimates, where that noise is then passed into the subsequent layer as input noise. This is

E hℓ+1 c = E ˆf ℓ+1(qℓ+1) c = E ˆf ℓ+1(f(qℓ) + ϵℓ) c (34)

E ˆf ℓ+1(f(qℓ) + ηℓ) c (35)

E ˆf ℓ+1 d (f(qℓ) + ηℓ) c (36)

= E ˆf ℓ+1 d (f(qℓ+1) c = E hℓ+1 d c , (37)

where line 35 follows from combining the fact that ηℓis lower variance with Assumption 1 and line 36 follows from the fact that E X E Y when X, Y have the same mean and roughly similar distribution.

Therefore we obtain at any layer ℓthe following

E hℓ+1 c E hℓ+1 d c , (38)

as required.

B.4 Edge-preservation Perspective on Representation Collapse

To further substantiate our findings on the mitigation of representation collapse in transformers, we now present an additional proposition that examines this phenomenon from a different perspective. In Proposition 4, we show that Elliptical attention reduces representation collapse by retaining the important local features (bumps etc.) better than the standard self-attention in the case of sparse piece-wise constant functions.

Proposition 4 (Representation Collapse) Let f : A RD for A RD be a piece-wise constant function with f|Ai(q) = f i RD where A = S i I Ai for some (possibly infinite) index I. Let q1 and q2 be the queries lying in any of the adjacent domain pieces with distant function values. Then, the Elliptical estimates at these queries retain the distance better than the standard self-attention estimates which is formulated as

E ˆfd(q2) ˆfd(q1) E ˆf(q2) ˆf(q1) . (39)

Proof. Assume all output vectors are normalized. Then, the Euclidean distance between two vectors is determined by their dot product since

a b 2 = a 2 + b 2 2a b. (40)

Without loss of generality, we take A1 and A2 to be the two adjacent pieces so that f(qi) = f i for i = 1, 2. Denote ˆf(qi) = hi and ˆfd(qi) = hid. Then, Eqn. 39 is equivalent to proving

ED[h 1dh2d] ED[h 1 h2], (41)

where the expectation is taken over the whole sampling distribution D but the points q1 A1 and q2 A2 are fixed as described in the definition. We drop the subscript D as this will be the default distribution for computing expectation unless specified otherwise. Let r S = cossim(f 1, f 2) = f 1 f 2 be the cosine similarity of the two piece-wise values. By definition of q1 and q2 and since the estimates work by averaging the output vectors with a small amount of noise, we have r S min n E[h 1dh2d], E[h 1 h2] o . We now decompose h1d and h2d in terms of components along and orthogonal to f 1 and f 2, respectively:

h1d = (h 1df 1)f 1 + f 1 , h2d = (h 2df 2)f 2 + f 2 , (42)

where f i f i = 0. Then, for their dot product, we have

h 1dh2d = h (h 1df 1)f 1 + f 1 i h (h 2df 2)f 2 + f 2 i

= (h 1df 1)(h 2df 2)f 1 f 2 + (h 1df 1)f 1 f 2 + (h 2df 2)f 2 f 1 + (f 1 ) f 2 . (43)

The analogous decomposition of h 1 h2 can be obtained. By Theorem 2 we have that the Elliptical estimator is lower variance and so we have E f i hid 2 E f i hi 2. This has the following implications:

1. 1 E[h idf i] E[h i f i] i.e. the component of hid along f i is larger than that of hi, and hence h idf i is closer to 1.

2. Due to the first implication above, the orthogonal component f i becomes smaller in terms of magnitude so that f j f i and (f i ) f j are closer to 0 for Elliptical compared to the standard self-attention.

These two arguments, combined with Eqn. 43, imply that in expectation h 1dh2d is closer to 1 (f 1 f 2) + 0 = f 1 f 2 = r S which, by definition, is the smallest dot product over S, and hence, r S E[h 1dh2d] E[h 1 h2] as desired.

B.5 Proof of Proposition 3 The lemma below encapsulates the necessary calculations that will then be used in the following proofs.

Lemma 2 Given a normally distributed zero mean random variable ξ N(0, σ2), the expectation of a random variable obtained by its absolute value is E|ξ| = p

Proof. Since ξ N(0, σ2), by definition of expectation, we have

2πσ2 exp x2

2πσ2 exp x2

2πσ2 exp x2

where we used the variable change x ( x) in the first integral of 44 to obtain 45.

We derive the bounds for the impact of noise in 3, with respect to its variance, on our estimator 10 in Lemma 3. Henceforth, we omit the factor δ in Eqn. 10 since it does not affect the further analysis.

Lemma 3 Given that the noise term in 3 follows a normal distribution with zero mean and variance σ2, the following inequality mi E|fi(kℓ+1) fi(kℓ)| 2 π σ (46)

holds for all i [D], where fi denotes the ith component of f(kℓ) = (f1(k), f2(k), . . . , f D(k)) .

Proof. Since all value vectors are taken from the data generating process 3, we have mi = E (vℓ,vℓ+1) X ℓ,ℓ+1 v |vℓ+1 i vℓ i|

= E|fi(kℓ+1) fi(kℓ) + ϵℓ+1 i ϵℓ i|, (47) where ϵℓ i and ϵℓ+1 i denote the ith components of the noise terms ϵℓand ϵℓ+1, respectively. Note that for real numbers a and b, we have by triangle inequality that |a + b| |a| + |b| and |a + b| = |a ( b)| ||a| | b|| |a| |b|. Applying these and the linearity of expectation to 47, we obtain E|fi(kℓ+1) fi(kℓ)| E|ϵℓ+1 i ϵℓ i| mi E|fi(kℓ+1) fi(kℓ)| + E|ϵℓ+1 i ϵℓ i| (48)

Recall that ϵℓ i N(0, σ2) and independent. Now we have that ϵℓ+1 i ϵℓ i N(0, 2σ2) as the mean value does not change while variance accumulates when subtracting two zero-mean normal variables. Therefore, the Lemma 2 gives that

E|ϵℓ+1 i ϵℓ i| = 2 π σ.

Plugging this back into the inequalities 48, we get

E|fi(kℓ+1) fi(kℓ)| 2 π σ mi E|fi(kℓ+1) fi(kℓ)| + 2 π σ,

which is equivalent to 46 as desired.

Remark 7 Note that in Lemma 3, we may also take into account the possible noise in the value vectors. Let ϵℓ i,v N(0, σ2 v) be the noise in the values vectors as mϵ i = E|vℓ+1 i vℓ i + ϵℓ+1 i,v ϵℓ i,v|. Then, applying the triangle inequality, we obtain E|vℓ+1 vℓ| E|ϵℓ+1 i,v ϵℓ i,v| mϵ i E|vℓ+1 vℓ| + E|ϵℓ+1 i,v ϵℓ i,v|. Now applying Lemma 2 and Lemma 3, we arrive at mϵ i E|f(kℓ+1) f(kℓ)| 2 π (σ + σv).

Figure 5: Left: Evolution of mean values of key perturbations over successive layers. Right: Mean key perturbations at different layers after 300 epochs. The figures show that as the number of layers increases, mean key perturbations over layers stabilize around a constant value.

Proof of Proposition 3. We shall first make the following assumptions on the data generating process 3:

Assumption 2 The underlying coordinate system in the feature space Xk is independent, implying that the function f : RD RD in Eqn. 3 can be separated as f(k) = (f1(k1), . . . f D(k D))

Assumption 3 The noise term in Eqn. 3 has independent components with each component ϵℓ j following a normal distribution N(0, σ2) for small σ, for all j [D] and ℓ N

Assumption 4 The magnitude of each component of key perturbations across consecutive layers, defined as |kℓ+1 i kℓ i|, follows a distribution with small, layer-independent mean (δ) and variance (ν)

Remark 8 The assumption of layer-independence in Assumption 4, especially for deeper layers, is supported well empirically, as shown in Figure 5. Given the over-smoothing observed in transformers [24], where token representations stabilize after initial few layers, it is also practical to assume that key perturbations across layers have relatively small mean and variance when modelled as a random process.

Proof. Under the Assumptions 2, 3, 4, we show that f i 1,µ f j 1,µ implies mi mj with high probability where mi is defined as in (10).

Directly from the Lemma 3, we have mi E|fi(kℓ+1) fi(kℓ)| 2 π σ.

Letting σ 0 in this inequality, which is feasible under the Assumption 3, one can get with a small error that

mi E|fi(kℓ+1) fi(kℓ)|, (49)

which in turn implies that the impact of the noise in (10) is negligible and the error of ignoring them can be controlled by the bounds given by (46). Now according to the theorem statement,

f i 1,µ f j 1,µ E Jf(k)ei 1 E Jf(k)ej 1

E |f i(ki)| E f j(kj) (50)

where we used the separability of f as given in Assumption 2 which simplifies the Jacobian matrix as

k2 . . . f1(k)

k2 . . . f2(k)

k D ... ... ... ... f D(k)

k2 . . . f D(k)

k2 . . . f1(k1)

k2 . . . f2(k2)

k D ... ... ... ... f D(k D)

k1 f D(k D)

k2 . . . f D(k D)

f 1(k1) 0 . . . 0 0 f 2(k2) . . . 0 ... ... ... ... 0 0 . . . f D(k D)

so that [Jf(k)]ii = f i(ki). Using the definition of derivative, the inequality 50 is equivalent to

E lim τ 0 f i(kℓ i + τ) f i(kℓ i) τ

lim τ 0 f j(kℓ j + τ) f j(kℓ j) τ

Next, we note that for a small δ, the limits in (51) can be approximated with f s(kℓ s+δ) f s(kℓ s) δ for s {i, j}: E|f i(kℓ i + δ) f i(kℓ i)| δ E|f j(kℓ j + δ) f j(kℓ j)| δ . (52)

Let us choose δ = E|kℓ+1 i kℓ i|. Then, by Chebyshev s inequality, we have for any ε > 0 that

ε2 P |kℓ+1 i kℓ i| δ ε

= P δ ε |kℓ+1 i kℓ i| δ + ε . (53) Given that the variance ν is sufficiently small as in the Assumption 4, the inequality (53) implies that kℓ+1 i kℓ i δ with high probability. Therefore, it follows from (52) with high probability that E|f i(kℓ+1 i ) f i(kℓ i)| δ E|f j(kℓ+1 j ) f j(kℓ j)| δ , which, due to 49, is equivalent to mi mj as desired. B.6 Lipschitz smoothness in (X, d) Below we show how Lipschitz smoothness of f changes when moving from Euclidean to the Mahalanobis transformed space. We shall follow similar steps to [29] and [30] but for a more general class of functions.

Proposition 5 (Change in Lipschitz smoothness for f) Suppose there exists a positive constant Gi such that fi(k) Gi for any k Xk and mi > 0 for all i [D]. Then for any q, k Xk, the following inequality holds:

Proof. Let ω := q k q k denote the unit vector pointing from k to q. The fundamental theorem of calculus implies that

f(q) f(k) = Z q k

d dtf(k + tω) dt = Z q k

0 ω Jf(k + tω) dt,

where Jf is the Jacobian matrix of f as usual. Starting with the distance between outputs f(q) and f(k) we have

f(q) f(k) =

0 ω Jf(k + tω) dt

i [D] ωi fi(k + tω)

i [D] |ωi| fi(k + tω) dt X

i [D] Gi|ωi| Z q k

i [D] Gi|qi ki|, (54)

where, as for all other vectors, qi denotes the ith component of vector q. Now note that

(qi ki)2 + X

mj mi (qj kj)2 =

(q k) M(q k)

mi = d(q, k) mi . (55)

Combining 54 and 55, we finally attain

f(q) f(k) X

Gi mi d(q, k), (56)

which completes the proof.

C Additional Theorems

The following Theorem 1 is a classic result from [70]. We refer the reader to their work for details.

Theorem 1 (Minimax rate for functions of bounded variability [70]) Let Fλ denote the class of distributions PX,Y on X [0, 1] such that i [d], the directional derivates of f(x) := E[Y |X = x] satisfy |f i|sup := supq Xk fi(q) sup λ. Then for any f Fλ, estimator ˆf, sample size n N, there exists a c 1 independent of n satisfying

inf fn sup f Fλ E Xn,Y n ˆf f 2 2 c2/(2+d)(dλ)2d/(2+d)n 2/(2+d) (57)

Theorem 2 (Improvement in MSE for approximately sparse functions [30]) Let the norm of the largest gradient be λ := supi [D] fi(q) sup and ˆfd be an estimator in metric space (Xq, d) where d is defined as Eqn. 7. Then,

E ˆfd f 2 2 < inf f sup Fλ E f f 2 2. (58)

Proof. We provide an abridged proof for completeness. We refer the reader to [30] for the full details.

First, the full bound is described as follows:

E ˆfd f 2 2 2C2/2+r κR (CDλdd(X))2r/2+rn 2/2+r < inf f sup Fλ E f f 2 2, (59)

where d(X) is the d-diameter of X defined as supx,x X d(x, x ), R [D], 1 CκR C (4κR)|R|, C and C1 are universal constants and λd supi f i sup/ mi. Let

( |R| if ϵ ϵ R/d(X) D (D |R|) log(d(X)/ϵ R)

log(1/ϵ) if ϵ < ϵ R/d(X) .

For bandwidth ϵn, r = r(ϵn) and let |R| r D. Let ϵ > 0, c be defined as the same c in Theorem 1, and n N, define the function ψn,d = Cϵ r(ϵ)/n and ψn, d(ϵ) = C 1ϵ D/n where

C1 = c (λ/Cλdd(X))D. Also define ϕ(ϵ) = C2D2λ2 dd(X)2 ϵ2.

For any fixed n, let ϵn, d be a solution to ψn, d(ϵ) = ϕ(ϵ). Solving for ϵn, d obtains the following lower bound on the minmax rate of

2ϕ(ϵn, d) = 2 c2/(2+D)(Dλ)2d/(2+d)n 2/(2+d). (60)

For any n N there exists a solution ϵn,d to the equation ψn,d(ϵ) = ϕ(ϵ) since r(ϵ) is nondecreasing. Therefore it is possible to obtain the following:

EXn,Y n fn,ϵ,d f 2 2 2ϕ(ϵn,d). (61)

Since ϕ is independent of n, and both ψn,d and ψn, d are strictly decreasing functions of n, we have that ϵn,d and ϵn, d both tend to 0 as n . Therefore we can define n0 such that, for all n n0, both ϵn,d and ϵn, d are less than ϵ R/d(X).

Thus, n n0, we have ϵn,d < ϵn, d if, for all 0 < ϵ < ϵ R/d(X), ψn,d(ϵ) < ψn, d(ϵ), which completes the proof .

D A Consistent Estimator

In this section, we present a consistent centered difference-based quotient estimator of the coordinatewise variability obtained by perturbing the estimated function in the ith direction and measuring the L1 norm of the difference. Similarly, this estimator requires no learnable parameters or gradients. The estimator is described in the following proposition.

Proposition 6 (Consistent Estimator) Given a function f : RD RDv with ith directional variation f i 1,µ, i [D], the directional variation can be estimated by the quantity

f(k + tei) f(k tei) 1

where t is a hyperparameter controlling the degree of locality of the estimator and En denotes the empirical expectation for n samples.

Despite bmi in proposition 6 s simple formulation, it is nonetheless a consistent estimator of the coordinate-wise variation in the underlying function. We utilize a simplified version of a theorem from [30], adapted to suit our specific needs, as the original formulation is more detailed than necessary for our purposes.

Theorem 3 (Consistency of Centered Difference-based Estimator for Scalar Function [30]) Let φ : RD R be a smooth scalar function and φ i 1,µ := Ex µ|e i φ| be the coordinate-wise variability for that scalar function. Then, for any direction i and any 0 < δ < 1/2, the following bound holds with probability of at least 1 2δ: En | φ(x + tei) φ(x tei)|

O(n 1/2t 1 ln(2D/δ)1/2). (63)

Note that the Theorem 3 is different from our setting by studying a scalar function as opposed to a vector valued function. However, we show that the result can be generalized to the latter case in Corollary 1 below via the estimator 62.

Corollary 1 (Consistency of the Estimator (62) for Vector-valued Function) Let f : RD RDv be a vector valued function and f i 1,µ be defined as in Definition 1. Then, for any direction i and any 0 < δ < 1/2, the following bound holds with probability of at least 1 2δ: | bmi f i 1,µ| O(n 1/2t 1 ln(2D/δ)1/2). (64)

Proof. We first derive the relation between the left hand side of 64 and its coordinate-wise differences as follows:

| bmi f i 1,µ| = En

f(k + tei) f(k tei) 1

Ek µ [ Jf(k)ei 1]

| fj(k + tei) fj(k tei)|

j [D] |e i fj|

j [D] En | fj(k + tei) fj(k tei)|

j [D] Ek µ|e i fj|

m(j) i f i (j) 1,µ (definition of mi and f i 1,µ for components fj)

m(j) i f i (j) 1,µ (triangle inequality)

O(n 1/2t 1 ln(2D/δ)1/2), (Theorem 3) where line 65 follows from the definition of the ℓ1 norm, line 66 follows from the linearity of expectation, the superscript j indicates that the case is reduced to the scalar function case for each jth summand individually. Note that the probability of the last bound is at least (1 2δ/D)D since each component-wise bound holds with probability at least 1 2δ/D. However, since we can choose δ small enough such that 2δ < 1, by Bernoulli s inequality (1 2δ/D)D 1 2Dδ/D = 1 2δ.

Table 9: Evaluation of the performance of our model and Dei T across multiple robustness benchmarks, using appropriate evaluation metrics for each.

Dataset Image Net-R Image Net-A Image Net-C Image Net-C (Extra) Metric Top-1 Top-1 m CE ( ) m CE ( )

Dei T 32.22 7.33 72.21 63.68 Dei T-Elliptical 32.66 7.63 73.59 65.71

Remark 9 Despite the proven consistency of this estimator, we opt for the efficient estimator presented in our main body described in Eqn 10. This is because the consistent estimator requires materialising the prediction function that is, computing a forward pass of the self-attention mechanism twice per dimension. This makes the consistent estimator unusable in most problem settings. We present results for the consistent estimator in Appendix F.11.

E Implementation Procedure and Computational Efficiency

Training and Inference. Given Elliptical Attention requires keys and values from the previous layer in order to compute the required transformation, we can only implement Elliptical Attention from the second layer on. We incorporate our Elliptical Attention into both training and inference stages. This is because, firstly, Elliptical Attention is designed to offer improvements to both clean and contaminated data, and so even in the presence of completely clean train and test data, it is advantageous to incorporate Elliptical Attention into both stages. Secondly, it is commonplace to encounter data contamination in test data and indeed also highly possible to encounter it in train data as well. Therefore, in the interest of robustness as well, we also incorporate Elliptical Attention into both stages.

Computational Efficiency. Computing the required transformation requires no learnable parameters and is obtained simply by averaging absolute differences in values over layers. These operations are therefore just of the order O(bhn D) = O(n) for batch size b, head number h, key/value length n, and dimension D. Hence upper-bound time complexity of the overall Transformer is unaffected. We provide efficiency analysis in terms of computation speed and max GPU memory allocated (calculated by CUDA max_memory_allocated in Figure 4, which shows that compared with baseline robust models, Elliptical is the fastest and most memory efficient. Elliptical exhibits no perceptible slowdown versus Dei T of the same configuration and only a 0.99% increase in max memory allocated, which is why Elliptical and Dei T are shown as the same data point in the Figure 4.

F Experimental Details and Additional Experiments

F.1 Out-of-Distribution Robustness and Data Corruption on Image Net-A,R,C

Image Net-A,R,C are benchmarks capturing a range of out-of-distribution and corrupted samples. Image Net-A contains real world adversarially filtered images that fool current Image Net classifiers. Image Net-R contains various artistic renditions of object classes from the original Image Net. Image Net-C consists of 15 types of algorithmically generated corruptions with 5 levels of severity (e.g blurring, pixelation, speckle noise etc). Given that Elliptical Attention learns attention weights dependant on the transformation M, which is itself dependant on the train data distribution, our proposed model is not designed for situations in which the test distribution is substantially different from the train distribution. This then includes OOD robustness and robustness to heavy corruption to the point where the underlying data distribution is fundamentally different. We nonetheless evaluate Elliptical Attention on Image Net-A,R,C to assess these important forms of robustness as well. Table 9 shows that Elliptical Attention is still able to offer improvements over baseline Dei T in terms of OOD robustness, while maintaining approximately the same performance as the baseline for Image Net-C. Figure 7 shows for Fog and Pixelate corruptions how Elliptical compares with Dei T over the 5 severity levels, where we see that at low severity levels Elliptical improves over Dei T, however as the severity level gets too high Elliptical falls behind. This agrees with our expectation that as the severity level grows, the distribution is further shifted relative to the train distribution and so Elliptical Attention is unable to improve performance.

F.2 Representation Collapse

We provide in Figure 6 additional representation collapse results for Image Net and ADE20K, showing that across modalities Elliptical Attention resists representation collapse.

Figure 6: Additional Representation Collapse Results on ADE20K, Wiki Text-103 and Image Net. Elliptical reduces token similarity over layers across a range of modalities

Table 10: Additional Results on Imagenet Increasing Heads But Maintaining Overall Embedding Dimension

Model Num. Heads Head Dim. #Params. Top-1 Accuracy Top-5 Accuracy

Dei T 3 64 5M 72.23 91.13 Elliptical 3 64 5M 72.36 91.33

Dei T-6head 6 32 5M 72.34 91.22 Elliptical-6head 6 32 5M 73.00 91.77

F.3 Head Redundancy

We present in Table 18 head redundancy results on the two large-scale tasks, Wiki Text-103 language modelling and Image Net-1K object classification. Mean L2 distance between vectorized attention heads, with the mean taken over a batch of size 1000 and averaged layer-wise. We see that Elliptical improves head redundancy on Wiki Text-103 versus the baseline transformer while performing approxiamtely equally to the Dei T baseline on Image Net.

Figure 7: Comparison of Dei T versus Deit-Elliptical accuracies on two types of Image Net-C corruptions, namely, Fog (left) and Pixelate (right). The figures show two out of many cases where Dei T-Elliptical outperforms its counterpart while vanilla Dei T manages to exceed only at higher severity levels.

F.4 Efficiency Results

We present here the comparative efficiency results for Dei T and Dei T-Elliptical in a side-by-side comparison at tiny, small, and base sizes, along with Dei T-Elliptical compared with other robust baselines.

F.5 Elliptical Attention in Mixture of Expert Architectures

We additionally evaluate Elliptical Attention within Mixture of Expert architectures. Specifically, we show in Tables 13, 14, and 15 the performance of Elliptical Attention within the Switch Transformer [18] and Generalized Language Model (GLa M) backbones [17].

Table 11: Side-by-side Efficiency comparison of Dei T and Dei T-Elliptical

Model Compute Speed (it/s) Max Memory (K) FLOPs / sample #Params (M)

Dei T 0.347 2.08 1.77 5.7 Dei T-Elliptical 0.342 2.12 1.79 5.7

% Change -1.44% 1.92% 1.12% -

Dei T 0.297 4.89 6.91 22.1 Dei T-Elliptical 0.289 4.96 6.99 22.1

% Change -2.69% 1.43% 1.16% -

Dei T 0.132 10.27 26.37 86.6 Deit-Elliptical 0.130 10.54 26.63 86.6

% Change -1.52% 2.63% 0.98% -

Avg. % Change 1.88% 1.99% 1.09% -

Table 12: Efficiency Comparison between Elliptical and baseline robust models

Model Compute Speed (it/s) Max Memory (K) FLOPs / sample #Params (M)

Dei T-Mo M 0.331 2.24 1.74 5.7 Dei T-RKDE 0.271 3.12 1.77 5.7 Dei T-SPKDE 0.168 3.35 1.75 5.7 Dei T-RVT 0.292 3.91 1.89 7.1

Dei T-Elliptical 0.342 2.12 1.79 5.7

F.6 Additional Adversarial Attack Results on Dei T-Small Configuration

We present here additional results for Dei T and Dei T-Elliptical at the Small configuration [78] (22.1M parameters) under adversarial attack. Table 16 shows the result of Elliptical against PGD, FGSM, and SPSA. Table 17 shows the results of Elliptical against Auto Attack. Given the larger model size, we attack with a perturbation budget of 3/255.

F.7 Wikitext-103 Language Modelling and Word Swap Attack

Dataset. The Wiki Text-1032 dataset contains around 268K words and its training set consists of about 28K articles with 103M tokens. This corresponds to text blocks of about 3600 words. The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively.

Corruption. Word Swap Text Attack3 corrupts the data by substituting random words with a generic token AAA . We follow the setup of [23] and assess models by training them on clean data before attacking only the evaluation set using a substitution rate of 2.5%.

Model, Optimizer & Train Specification. We adopt the training regime of [54]. To this end, the small backbone uses 16 layers, 8 heads of dimension 16, a feedforward layer of size 2048 and an embedding dimension of 128. We use a dropout rate of 0.1. We trained with Adam using a starting learning rate of 0.00025 and cosine scheduling under default Py Torch settings. We used a batch size of 96 and trained for 120 epochs and 2000 warmup steps. The train and evaluation target lengths were set to 256.

The medium backbone uses 16 layers, 8 heads of dimension 32, a feedforward layer of size 2048 and embedding dimension of 256. We use a dropout rate of 0.1. We trained with Adam using a starting

2www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/ 3Implementation available at github.com/QData/Text Attack

Table 13: Elliptical Switch Transformers Pretrained on Wiki Text-103 and Finetuned on Stanford Sentiment Treebank 2 (SST-2)

Model Test PPL ( ) Finetune Test Acc. ( )

Switch Transformer-medium 35.33 76.27 Switch Elliptical-medium 34.67 77.32

Switch Transformer-large 31.18 76.79 Switch Elliptical-large 30.56 78.08

Table 14: Elliptical Switch Transformers Pretrained on En Wik8 and Finetuned on Stanford Sentiment Treebank 2 (SST-2)

Model Test BPC ( ) Finetune Test Acc. ( )

Switch Transformer 1.153 63.27 Switch Elliptical 1.142 67.75

learning rate 0.00025 and cosine scheduling under default Py Torch settings. We used a batch size of 56 and trained for 100 epochs and 2000 warmup steps. The train and evaluation target lengths were set to 384.

For Elliptical Attention, we use an Elliptical layer on all possible layers 2 through 16. We use a constant delta of 1.

Compute Resources. All models are trained and evaluated on two NVIDIA A100 SXM4 40GB GPUs.

F.8 Image Net Image Classification and Adversarial Attack

Dataset. We use the full Image Net dataset that contains 1.28M training images and 50K validation images. The model learns to predict the class of the input image among 1000 categories. We report the top-1 and top-5 accuracy on all experiments.

Corruption. We use attacks FGSM [22], PGD [42], and Auto Attack [11] with perturbation budget 1/255 while SPSA [81] uses a perturbation budget 0.1. All attacks perturb under l norm. PGD attack uses 20 steps with step size of 0.15.

Model, Optimizer & Train Specification. The configuration follows the default Dei T tiny configuration [78]. In particular, we follow the experimental setup of [23, 54]. To this end, the Dei T backbone uses 12 layers, 3 heads of dimension 64, patch size 16, feedforward layer of size 768 and embedding dimension of 192. We train using Adam with a starting learning rate of 0.0005 using cosine scheduling under default Py Torch settings, momentum of 0.9, batch size of 256, 5 warmup epochs starting from 0.000001 and 10 cooldown epochs, for an overall train run of 300 epochs. The input size is 224 and we follow the default Auto Augment policy and color jitter 0.4.

For Elliptical Attention, we use an Elliptical layer on all possible layers 2 through 12. We use a constant delta of 1.

Compute Resources. We train and evaluate all models on four NVIDIA A100 SXM4 40GB GPUs, with the exception of the robustness experiments on Image Net-C which are conducted using four NVIDIA Tesla V100 SXM2 32GB GPUs.

F.9 LRA Long Sequence Classification.

Dataset. The LRA benchmark consists 5 tasks involving long range contexts of up to 4000 in sequence length. These tasks consist of equation calculation (List Ops) [50], review classification (Text) [41], document retrieval (Retrieval) [61], image classification (Image) [32] and image spatial dependencies (Pathfinder) [37].

Model, Optimizer & Train Specification. We adopt the same experimental setup as [7]. To that end, the Transformer backbone is set with 2 layers, hidden dimension of 128, 2 attention heads

Table 15: Test Perplexity of Elliptical GLa M on Wiki Text-103 Modeling

Model Test PPL

GLAM-small 58.27 GLAM-Elliptical-small 56.69

GLAM-medium 38.27 GLAM-Elliptical-medium 36.34

Table 16: Dei T and Dei T-Elliptical Accuracy on Image Net Under Adversarial Attacks PGD, FGSM, and SPSA with Small Backbone Configuration

Method Clean Data PGD FGSM SPSA Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

Dei T-small 79.89 95.04 21.41 51.50 51.57 82.12 65.68 91.28 Elliptical-small 79.92 95.06 22.39 54.02 51.86 82.87 72.02 92.45

Table 17: Dei T and Dei T-Elliptical Accuracy on Image Net under Auto Attack with Small Backbone Configuration

Method Clean Data APGD-CE APGD-T FAB-T Square Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

Dei T-small 79.89 95.04 19.18 50.75 16.54 63.84 80.66 95.09 49.98 89.17 Elliptical-small 79.92 95.06 18.88 51.07 17.30 65.28 81.64 95.59 55.89 89.36

Table 18: Head Redundancy Results

Model Num. Heads Dim. Head L2 Distance

Wiki Text-103

Transformer-Small 8 16 5.40 2.21 Elliptical-Small 8 16 6.45 2.38

Dei T 3 64 5.11 1.67 Elliptical 3 64 4.98 1.54

of dimension 32, and embedding dimension of 64. We use a dropout rate of 0.1. Built on top of the standard transformer backbone, Reformer uses 2 hashes, Performer has 256 random feature dimensions and Linformer uses a projection dimension of 256. We train with Adam using a learning rate of 0.0001 with linear decay. We use a batch size of 32 for List Ops, Retrieval, and Text and 256 for Image and Pathfinder32. We use 1000, 175, 312, 800, and 1000 warmup steps for List Ops, Image, Pathfinder32, Retrieval, and Text respectively.

Elliptical places the Elliptical Attention layer on the final layer (as the only one possible) and uses delta equal to 1.

Compute Resources. All models are trained and evaluated on a single NVIDIA A100 SXM4 40GB GPU.

F.10 ADE20K Image Segmentation

Dataset. ADE20K [88] contains challenging scenes with fine-grained labels and is one of the most challenging semantic segmentation datasets. The training set contains 20,210 images with 150 semantic classes. The validation and test set contain 2,000 and 3,352 images respectively.

Model, Optimizer & Train Specification. We follow the experimental setup as in [71]. The encoder is pretrained on Image Net following the specification described in F.8 using the setup in [78, 54]. That is, the encoder is a Dei T backbone using 12 layers, 3 heads of dimension 64, patch

Table 19: Perplexity (PPL) of Elliptical and baselines on Wiki Text-103 under Word Swap data contamination. Best results are in bold. Our Elliptical method achieve substantially better robust PPL without compromising performance on clean data.

Model Clean Test PPL ( ) Contaminated Test PPL ( )

Transformer 34.29 74.56 Performer 33.49 73.48 Transformer-MGK 33.21 71.03 Fourier Former 32.85 68.33 Transformer-SPKDE 32.18 54.97 Transformer-Mo M 34.68 52.14 Elliptical 32.00 52.59

Random Ablation 37.84 46.82 Elliptical-Consistent 32.95 54.67 Elliptical-Meanscale 31.94 52.78

size 16, feedforward layer of size 768 and embedding dimension of 192. We train using Adam with a starting learning rate of 0.0005 using cosine scheduling under default Py Torch settings, momentum of 0.9, batch size of 256, 5 warmup epochs starting from 0.000001 and 10 cooldown epochs, for an overall train run of 300 epochs. The input size is 224 and we follow the default Auto Augment policy and color jitter 0.4. After pretraining the encoder, we then attach as decoder a masked transformer consisting of 2 layers. Each layer contains 3 heads of dimension 64, embedding dimension of 192 and feedforward dimension of 768. The decoder uses a dropout rate of 0.1. The full segmenter (encoder and decoder) is then finetuned using SGD with starting learning rate 0.001 and polynomial scheduling. The batch size is set to 8.

Compute Resources. All models are trained and evaluated on a single NVIDIA A100 SXM4 40GB GPU.

F.11 Ablation Studies

Ablation Models. We consider the following models in our ablation studies:

Random Ablation. To validate the efficacy of our proposed estimator given in Eqn. 10, we consider an alternate model in which M is populated by weights uniformly drawn from the [0, 1] interval followed by the same maxscaling as in Elliptical. Elliptical-Meanscale. We ablate the effect of maxscaling by considering meanscaling of the estimates mi. That is, each mi mi/ m is scaled by the mean variability estimate m = ED[mi]. Elliptical-Consistent. We consider also the performance of Elliptical when using the consistent estimator of f i 1,µ described by Equation 62.

Language Modelling. Results are shown in Table 19. Amazingly, the random ablation model performs extremely well on contaminated data. In general, this most likely suggests that training a model with randomness injected into the attention matrix can generate some robustness benefits, which is intuitive. It does, less surprisingly, come at the cost of clean data performance, where Random Ablation performs almost 10% worse than baseline transformer.

F.12 Pseudocode

Algorithm 1 presents a pseudocode for implementing Elliptical Attention as given by Eqn. 13 on top of conventional self-attention.

Image Net Classification and Attack. Table 21 shows the ablation model s performance on both clean Image Net and under Auto Attack. The ablation model shows a slight improvement over the Dei T baseline in Top 1 accuracy, however Top 5 accuracy is substantially lower. Reasonable performance again Auto Attack is overall unsurprising given that the random Random Ablation model is essentially employing random defence. Nonetheless, it still does not surpass the performance of Elliptical.

Algorithm 1 Computation of Elliptical Attention

1: Tensor Q RN D current layer queries 2: Tensor K RN D current layer keys 3: Tensor V RN D current layer values 4: Tensor V prev RN D previous layer values 5: float δ R+ step size 6: integer D N head dimension

7: function ELLIPTICAL_ATTENTION(Q, K, V , V prev, δ, D) 8: M ELLIPTICAL_WEIGHTS(V , V prev, δ) compute weight matrix M 9: logits Q M K 1

D modify the dot-product computation

10: attention SOFTMAX(logits) 11: output attention V 12: return output 13: end function

14: function ELLIPTICAL_WEIGHTS(V , V prev, δ) 15: with torch.no_grad() do 16: N V .size(0) sequence length 17: value_diff (V V prev)/δ 18: M 1

N NORM(value_diff, p = 1, dim = 0) column-wise average of L1 norms 19: M DIAG_EMBED(M) embed the vector into a diagonal matrix 20: return M 21: end function

Table 20: Evaluation of the performance of our model and Dei T across multiple robustness benchmarks, using appropriate evaluation metrics for each.

Dataset Image Net-R Image Net-A Image Net-C Image Net-C (Extra) Metric Top-1 Top-1 m CE ( ) m CE ( )

Dei T 25.38 3.65 72.21 63.68 Elliptical 31.37 6.76 73.59 65.71

Random Ablation 30.87 5.85 74.02 65.90 Elliptical-Consistent 31.46 6.71 82.92 71.74 Elliptical-Meanscale 32.66 7.63 72.28 63.79

Table 21: Auto Attack Ablation Study: Top 1 and Top 5 test accuracies on clean Image Net and under Auto Attack. The ablation model fails to fit the clean data well and is highly prone to adversarial attack.

Method Dei T [78] Dei T-Elliptical Random Ablation Top 1 Top 5 Top 1 Top 5 Top 1 Top 5

Clean Data 72.23 91.13 72.36 91.33 71.44 91.29

APGD-CE 27.75 66.48 31.27 68.28 27.85 61.74 APGD-T 27.74 73.37 29.69 74.39 28.60 68.72 FAB-T 71.61 90.54 71.74 90.81 68.54 89.43 Square 43.55 80.96 47.25 81.65 47.24 78.87

Average 42.66 77.84 45.00 78.78 43.06 74.69 Sequential Attack 26.08 64.18 27.45 67.77 26.33 60.85

G Broader Impacts

Our research offers benefits to both clean data and robust performance. We in particular show improved results in domains with wide social applicability. These include image segmentation, with benefits to self-driving cars, and language modeling, with benefits to AI chatbot assistants. We in particular show strong improvements against contamination by adversarial attack, which we hope can protect vital AI systems from malicious actors, and competitive performance in contaminated

language modeling, which we hope can improve language models evaluated on imperfect data as is often the case in the real world. There is always possibility of misuse of AI systems, however our research shows substantive improvements in fundamental architectures and theory which we hope can spur further socially beneficial outcomes.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Our paper claims that using a Mahalanobis metric inside of the self-attention mechanism to produce hyper-ellipsoidal neighborhoods around queries improves both robustness and representation collapse. We provide a theoretical framework for this with proofs and a large amount of empirical validation.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We mention in our future work (section 6) that our estimator is noisy. In Appendix D we provide and prove a consistent estimator but note it is too computationally expensive, hence we need to opt for our noisier but more efficient estimator. As a result we do not prove the consistency of our estimator, but just the weaker requirement which is that it accurately estimates the relative magnitudes of the direction-wise variability. For this we assume approximate separability of the true function.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best

judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes] Justification: All theoretical results are numbered and have a referenced section in the Appendix where all required lemmas and assumptions are stated and proven. All proofs make each step as clear as possible.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: We include in Appendix F all hyperparameters, model configuration, training and inference specifications, and dataset details. For corruption, we additionally include all perturbation budgets, steps, step sizes, norm specification, and additional details. We provide the exact equation for our coordinate-wise variability estimator.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

(c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We provide all code used, packaged into folders for each set of experiments (e.g Wikitext-103). Each folder then contains a bash to run the required result (e.g run_elliptical.sh or run_baseline.sh). Where possible, we also include bashes to download the data. Folders also contain readme files providing information on version and package dependencies.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: In Appendix F we provide all dataset details, train and test splits, compute resources, and other experimental configuration information. We also include citations to other authors who we compare with for which we adopt the same experimental setting.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: In our head redundancy experiments we report error bars, in particular showing the difference in performance between Dei T and Elliptical is not statistically significant. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide an efficiency analysis figure (Figure 4) containing computation time and memory resources. We also provide in Appenxix F the exact GPU resources used. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Our research involves no human subjects or privacy concerns. Our research also has no clear links to discriminatory, unsafe, or harmful outcomes. Rather, as it is in large part concerned with robustness, particularly to adversarial attack, we hope our research might be usable to defend vital AI systems from ill-intentioned actors. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [Yes]

Justification: We discuss broader impacts in Appendix G. We see our improved accuracy and robustness as offering societal benefits, for example with self-driving cars by our improved image segmentation results or with our adversarially robust vision models to defend against ill-intentioned actors.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper contains no such models with high risk of misuse such as pretrained language models, image generators, or scraped datasets. We use widely accepted, standard benchmark datasets and propose a fundamental and general improvement to a core architecture.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: In places where code has been borrowed from public repositories or baseline models have been implemented, we have duly credited.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: We include details about training and implementation as well as limitations for our novel class of Elliptical Attention mechanisms.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: We do not use any crowdsourcing nor research with human subjects.

Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: We do not use any crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.