# selfattention_networks_localize_when_qkeigenspectrum_concentrates__fc4a6380.pdf Self-attention Networks Localize When QK-eigenspectrum Concentrates Han Bao 1 Ryuichiro Hataya 2 Ryo Karakida 3 The self-attention mechanism prevails in modern machine learning. It has an interesting functionality of adaptively selecting tokens from an input sequence by modulating the degree of attention localization, which many researchers speculate is the basis of the powerful model performance but complicates the underlying mechanism of the learning dynamics. In recent years, mainly two arguments have connected attention localization to the model performances. One is the rank collapse, where the embedded tokens by a self-attention block become very similar across different tokens, leading to a less expressive network. The other is the entropy collapse, where the attention probability approaches non-uniform and entails low entropy, making the learning dynamics more likely to be trapped in plateaus. These two failure modes may apparently contradict each other because the rank and entropy collapses are relevant to uniform and non-uniform attention, respectively. To this end, we characterize the notion of attention localization by the eigenspectrum of query-key parameter matrices and reveal that a small eigenspectrum variance leads attention to be localized. Interestingly, the small eigenspectrum variance can prevent both rank and entropy collapses, leading to better model expressivity and trainability. 1. Introduction Transformers have been widely adopted in language modeling (Vaswani et al., 2017), vision tasks (Dosovitskiy et al., 2021; Touvron et al., 2021), and speech recognition (Likhomanenko et al., 2021). A crucial building block in transformers is the attention mechanism, dating back to Graves (2013), which was initially designed to capture long-range signals in sequential inputs by mixing individual tokens but has also been leveraged to capture general struc- *Equal contribution 1Kyoto University 2RIKEN AIP 3AIST. Correspondence to: Han Bao . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). tures of input data. After the fully-attention-based language model has appeared (Brown et al., 2020), the research community gets interested in the functionality and benefits of the attention. To mention a few, transformers implicitly prefer hierarchical interpretations of input sequences (Kharitonov & Chaabouni, 2021); store relational knowledge in MLP layers as an associative memory (Meng et al., 2022); its computational graphs tend to be tree-structured (Murty et al., 2023b); suddenly capture tree structures of inputs after long training epochs (Murty et al., 2023a). Theoretically, training dynamics analysis explains how to learn spatially correlated patches by vision transformers (Vi T) (Jelassi et al., 2022), select dominant tokens (Tian et al., 2023a), store information as an associative memory (Bietti et al., 2023), and select max-margin tokens (Tarzanagh et al., 2023b), whereas Xie et al. (2022) explains the in-context learning as a process of concept learning in Bayesian inference. Among many aspects of attention, we specifically focus on localization for a query token, self-attention can select a few relevant tokens only (which we call localized attention) or select many tokens uniformly. As attention can be regarded as a token mixer, it plays a pivotal role in studying how it selects tokens to reveal the characteristics of the token embeddings. To this end, we have the following research questions: (Q1) When is self-attention localized or uniform? (Q2) How does localization affect model performances? Along this line, previous studies mainly investigated from the model expressivity and training stability perspectives. On the one hand, Dong et al. (2021) and Noci et al. (2022) initiated the discussion of attention localization and theoretically showed that a network with self-attention layers without skip connections exponentially loses the rank of hidden layers; the fact indicates that the model expressivity shall be immediately lost with more self-attention layers being stacked. On the other hand, Zhai et al. (2023) empirically found that attention entropy averaged Shannon entropy of an attention probability matrix correlates with the training stability. Specifically, a training loss curve tends to fall into a plateau when the attention entropy is low. Since higher entropy indicates near-uniform attention weights, their finding apparently suggests that localized attention may lead the learning dynamics to a plateau. Up until now, these two failure modes have been discussed independently with slightly different notions of attention Self-attention Networks Localize When QK-eigenspectrum Concentrates 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Figure 1: Comparison of softmax S and the piecewise approximation e S for two-dimensional inputs. localization, and hence, our understanding of the blessing and curse of attention localization remains elusive. To better comprehend, we characterize self-attention patterns by attention parameter matrices to reconcile the two collapse modes. We formulate the concept of localization by signal propagation probability (Section 3), which describes how likely the signal of a specific input token propagates to the gradient of a training objective. If the signal propagation probability is high for a few numbers of tokens only, attention is regarded to be localized, and the learning dynamics is dominated by them. We show that the localization mode can be characterized by the eigenspectrum of attention weight matrices (Section 4). Specifically, attention is localized in the above sense when the eigenspectrum of the query-key parameter matrix has a non-zero mean and a small variance. Furthermore, the small eigenspectrum variance is relevant to both the rank collapse and entropy collapse (Section 5), and thus, we give a unified perspective of the two notions of attention collapse. For this reason, we argue that attention collapse and its performance can be viewed more transparently based on the eigenspectrum variance. Lastly, we indirectly observed the correlation of the eigenspectrum and the model performance in the experiments with the Wiki Text dataset (Merity et al., 2016) by introducing a regularization scheme called LOCATER. Notation. We write vectors with all zeros and ones by 0 and 1, respectively, whereas the i-th one-hot vector is written as ei. A vector is written in bold-face like a, and its i-th scalar element is written by non-bold ai. A matrix is written by capital bold-face like A, and Ai denotes its i-th column vector unless otherwise noted. The identity matrix is denoted by I. The Hadamard product of A and A is written by A 2 := A A. We write the error function as erf and use erf( z) = erf(z) without explicitly mentioning it. Infinitesimal asymptotic orders o( ) are with respect to the sequence length T unless otherwise noted. Transformer. Let X := [x1 x2 . . . x T ] Rd T be an input with T tokens, defined later shortly. We suppose that all input sequences have the same length T, and T is occasionally taken sufficiently large. The ℓ-th (single-head) self-attention layer is defined as Aℓ:= S (Xℓ 1) WQKXℓ 1 Uℓ:= WVXℓ 1Aℓ, (2) where WV Rd d is the value parameters, WQK(:= W QWK) Rd d is the query-key parameters (with joint parametrization), λ > 0 is temperature, commonly λ = d, and S : RT RT is the softmax applied for each row. In this way, each input token in Xℓ 1 (embedded by WV) is mixed by Aℓ. Then, the transformer block (without layer normalization (Ba et al., 2016)) is defined as Zℓ:= Uℓ+ Xℓ 1, Hℓ:= WF2σ(WF1Zℓ), Xℓ:= Hℓ+ Zℓ, where Hℓis a feed-forward network with parameters WF1, WF2 Rd d and an (element-wise) activation σ : R R. We omit the token embedding layer and set X0 := X. There are two common variants of layer normalization positions, Post-LN (Vaswani et al., 2017) and Pre-LN (Xiong et al., 2020), which are applied token-wise after the residual connections (Zℓand Xℓ+1) and before the inputs (Xℓand Zℓ), respectively. Then, the transformer block Xℓis stacked L times and F(X) := XL Rd T is the output. Learning task. We focus on causal language modeling, where a model predicts the next token given contextual tokens. Formally, given T contextual tokens X Rd T , the prediction target is the (T + 1)-th token y := x T +1 Rd. With the squared loss, the objective is written as follows: 2 E y F(X)T 2, where Θ := (WV, WQK, WF1, WF2) denotes the model parameter set, and the expectation is taken over input sequences (X, y). Here, our decoding procedure in consideration is to simply choose the embedded last token F(X)T RT . The parameters Θ are learned by minimizing J. Note that our analysis considers optimizing the query-key parameters jointly. Although such joint parameterization is less common in practice, it is convenient for the theoretical derivation of the gradients and has been used Self-attention Networks Localize When QK-eigenspectrum Concentrates in several previous studies (Jelassi et al., 2022; Tian et al., 2023a). Interested readers may refer to a recent work revealing that the joint and separate QK-parametrization lead to different implicit regularizations (Tarzanagh et al., 2023a). Picewise linear approximation of softmax. In this article, we choose to approximate the softmax function S by linearization. This is for the convenience of computing Gaussian moments while keeping the attention structure as close to the original softmax as possible in Section 4. For an input ω RT , the softmax function is defined as S(ω)i := exp(ωi) P j [T ] exp(ωj) for all i [T]. For linearization, the Taylor expansion of S(ω)i around the origin γi, ω + γi 0 is used, where γi := i S(0) = 1 T 2 1, γi 0 := S(0)i = 1 Then, we approximate S by the piecewise linear function such that S(ω)i max{0, min{1, γi, ω + γi 0}} = eγi, ω + eγi 0, where (eγi, eγi 0) = (0, 0) if γi, ω + γi 0 < 0, (γi, γi 0) if γi, ω + γi 0 [0, 1], (0, 1) if γi, ω + γi 0 > 1. In the vector form, the piecewise approximation S(ω) e S(ω) is given by e S(ω) = Γ ω + eγ0, where ( Γ := [eγ1 eγ2 . . . eγT ], eγ0 = [eγ1 0, eγ2 0, . . . , eγT 0 ] . For notational simplicity, the column vectors of Γ are exceptionally denoted by eγi with superscripts, for which the α-th element is written by eγi α. The difference between S and e S is illustrated in Fig. 1. Note that a popular alternative to softmax, sparsemax (Martins & Astudillo, 2016), is also a piecewise linear function, although the functional form is slightly different from e S. Remark 1. Each (eγi, eγi 0) depends on the softmax input ω, unlike the Taylor coefficient (γi, γi 0) being independent from ω. This point matters particularly when we take expectations of terms involving (eγi, eγi 0). Remark 2. When T is sufficiently large, the coefficient vector γi = T 1ei + o(1). In this regime, γi α = o(1) for any α [T] \ {i}, so γi behaves like a selector of the i-th input. Hence, γi i = γi 0 = T 1 + o(1). 3. Signal propagation probability We analyze how much each token contributes to the learning dynamics. To this end, we formalize how much the signal of a specific input token xi propagates to the gradient J. Remark that this notion is slightly different from the contribution of an input token xi to the model output F(X)j analyzed by Kobayashi et al. (2023) recently. Uniform vs. localized softmax. The piecewise linear approximation implies that the i-th input signal is propagated to the subsequent blocks when γi, ω + γi 0 [0, 1]; otherwise, e S(ω)i = eγi, ω + eγi 0 = eγi 0, which hinders the input token xi from contributing to the self-attention layer (2). Thus, we will focus on the following quantity. Definition 1 (Signal propagation probability). Suppose that WQK is independent from X. For i [T], the signal propagation probability of the i-th token is defined as follows: ρi := P γi, ω + γi 0 [0, 1] , where ω := X WQKx T /λ and the randomness originates solely from the input tokens X. When only a few ρi are significantly larger than zero, we can interpret it as localized softmax; in this case, the selfattention (2) is dominated by a small number of tokens. By contrast, most of the tokens contribute to self-attention almost equally if ρi takes a similar value across different i; this situation is interpreted as uniform softmax. Through the lens of gradient. The signal propagation probability naturally arises in the gradient as a quantity characterizing how each token contributes to the gradient of the loss function J. Since the learning dynamics of causal language modeling is governed by the gradient flow of J, we can benefit from deriving the gradient of J to see how attention affects the learning dynamics. This is an important step to analyze the learning dynamics and implicit bias of self-attention layers, and we choose to measure the localization and uniformity through the signal propagation probability instead of other measures like the entropy. To keep the derivation concise, we consider a 1-layer transformer (where we drop the superscripts ℓ) without layer normalization and simplify the feed-forward net H by supposing the identity activation. With the approximated softmax e S, the transformer can be written as follows: F(X)T = WF{WVXe S(ω) + x T } = WF{WVXΓ ω + WVXeγ0 + x T }, where WF := WF2WF1 + I and ω := X WQKx T /λ. For this architecture, the QK-gradient is computed: WQKJ = λ 2 E[XΓPΓ X WQKx T x T ] + λ 1 E[XΓPeγ0x T ] + λ 1 E[XΓqx T ], (3) Self-attention Networks Localize When QK-eigenspectrum Concentrates 0.0 0.2 0.4 0.6 0.8 1.0 Relative token position Signal propagation prob. ( ) 0.0 0.2 0.4 0.6 0.8 1.0 Relative token position Signal propagation prob. ( ) 0.0 0.2 0.4 0.6 0.8 1.0 Relative token position Signal propagation prob. ( ) 0.0 0.2 0.4 0.6 0.8 1.0 Relative token position Signal propagation prob. ( ) 0.0 0.2 0.4 0.6 0.8 1.0 Relative token position Signal propagation prob. ( ) 0.0 0.2 0.4 0.6 0.8 1.0 Relative token position Signal propagation prob. ( ) Figure 2: The theoretical plots of the signal propagation probability ρ(θ) with different ξ = tr(W)/ p tr(W2) and η = p tr(W2)/λ2. The vertical axes indicate relative token position θ = i/T (i: token index, T: number of tokens). Smaller θ close to zero and larger θ close to one correspond to early-site and late-site tokens, respectively. P := X W VW F WFWVX, q := X W VW F (WFx T y). When T is sufficiently large, we can drop asymptotically negligible terms with respect to T as detailed in Appendix B, and the QK-gradient (3) is simplified as follows: i,j,α,β [T ] E h eγi αeγj β(x i ˇPxj)(x β WQKx T )xαx T i , (4) where ˇP := W VW F WFWV. Now, in the gradient term (4), the summands with eγi i (introduced in Section 2) are asymptotically dominant over those with eγi α (with α = i) because eγi i = T 1 and eγi α = o(T 1). Additionally, the i-th signal propagates when eγi i > 0, which holds iff γi, ω + γi 0 [0, 1] by definition. Therefore, we are motivated to check the condition γi, ω + γi 0 [0, 1] to see whether xi contributes to the gradient (3). The signal propagation probability ρi characterizes its strength. Summary. In this section, we introduced the signal propagation probability ρi, which characterizes how likely a given token xi contributes to the learning dynamics. Specifically, eγi = 0 holds more likely with larger ρi, where xi contributes to the QK-gradient (3). Subsequently, we will analyze the quantity ρi to see the behavior of the probability vector ρ [0, 1]T . When does the mass of ρ concentrate to only a few tokens or scatter across most of the tokens? 4. When does attention localize? We derive the signal propagation probability ρi based on the following synthetic data model for the sake of clarity. Assumption 1 (Random walk). The tokens (xt)t 1 are generated by the following Gaussian random walk: x1 N(0, Σ), xt+1 N(xt, Σ). We discuss the validity of this assumption at the end of this section. To derive ρi, we resort to the Gaussian approximation of γi, ω + γi 0. Define µi := E[ γi, ω + γi 0] and vi := V[ γi, ω + γi 0]. We approximately suppose that γi, ω + γi 0 N(µi, vi). Then, the signal propagation probability is approximated: To leverage this formula, we derive µi and vi. Lemma 1. Suppose that WQK is symmetric and independent from X, and let W := WQKΣ. Under Assumption 1, for i [T], the mean µi and variance vi of γi, ω + γi 0 with the input ω := X WQKx T /λ are given as follows: Self-attention Networks Localize When QK-eigenspectrum Concentrates The proof is given in Appendix C. The symmetry of WQK is assumed for convenience. In the case of asymmetric WQK, we can redefine the signal propagation probability with the symmetrized matrix (WQK + W QK)/2. Recall that tr(W) = P i [d] wi if we write the eigenvalues of W by (w1, w2, . . . , wd). If W is real diagonalizable, tr(W2) = P i [d] w2 i holds, and tr(W)2 d tr(W2) follows due to Jensen s inequality. This implies that d tr(W2) tr(W) p d tr(W2). (5) Moreover, µi and vi are determined by the relative token location i/T. By continuously extending i/T to θ [0, 1], the signal propagation probability ρi can be extended to ρ : [0, 1] [0, 1], defined over relative token locations: ρ(θ) := Φ θ 1 η ; θ , (6) ξ := tr(W) p tr(W2) , η := Φ(z; θ) := 1 and the parameter ranges are ξ [ d] (due to the bound (5)) and η (0, ). Here, ξ and η can be regarded independent (when W is independent from X) because the eigenspectrum scale tr(W2) can be modulated within the bound (5) once the eigenspectrum of W is given. Figure 2 numerically illustrates ρ(θ) with different ξ and η. From these figures, we obtain a couple of observations. Localization. ρ(θ) concentrates on fewer tokens as η increases (see |ξ| 5). By contrast, ρ(θ) behaves relatively uniformly regardless of η for small |ξ| 1. Late-/middle-/early-site focus. Focus on small η such as η = 0.001. As ξ increases to a large positive, ρ(θ) puts positive weights for only late-site tokens, i.e., θ > 0.5. By contrast, as ξ decreases to a negative, ρ(θ) focuses on early-site tokens, i.e., θ < 0.5. When η increases (see η 0.5), ρ(θ) localizes around θ = 0.5 with sufficiently large ξ (say, |ξ| 5), which indicates middle-site focus. Vanishing signal. As η increases, ρ(θ) degenerates to zero for any θ [0, 1] regardless of ξ. How ρ behaves at the limit. Subsequently, we claim the above observations formally, which is proven in Appendix C. Lemma 2. ρ(θ) satisfies the following properties. 1. (Late-/middle-site) As (ξ, η) ( , 0) with ξη r, 2 } if 0 r 2 1{ 1 r } if r > 2 . 0.0 0.2 0.4 0.6 0.8 1.0 Relative token position Signal propagation prob. ( ) ( , ) = (128, 0.01) 0.0 0.2 0.4 0.6 0.8 1.0 Relative token position Signal propagation prob. ( ) ( , ) = (512, 0.01) Figure 3: The theoretical plots of ρ(θ). For each ξ = 128, 512, the product value ξη = 1.28, 5.12, respectively. The latter is sufficiently larger than the localization threshold r = 2 and localized. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 = XX 2 WQK 2 Attn. ent. lower bound T = 10 T = 20 T = 30 T = 40 Figure 4: Entropy lower bound (9) by Zhai et al. (2023). 2. (Early-/middle-site) As (ξ, η) ( , 0) with ξη r, 2 } if 2 r < 0 1{ 1 2 } if r < 2 . 3. (Uniformity) Fix η as a finite value. As |ξ| 0, |ρ (θ)| 0 for any θ [0, 1]. 4. (Vanishing signal) Fix ξ as a finite value. As η , ρ(θ) 0 for any θ [0, 1]. From late-/middle-/early-site focus in Lemma 2, we see interestingly that ρ(θ) localizes when ξη = tr(W)/λ asymptotically deviates from zero significantly so that |r| 2. At this limit, ρ(θ) concentrates on θ = 0.5, inducing the middle-site focus. Conversely, attention becomes relatively uniform when ξη = tr(W)/λ is kept close to zero. In Fig. 3, we numerically illustrate this regime: ρ localizes at the middle site when η = 0.02 (i.e., ξη = 5.12). Let us investigate the limiting condition of (ξ, η) for localization: When is (ξ, η) close to the limit ( , 0) while ξη r 2? Here, we focus on the eigenspectrum of W by regarding its eigenvalues (wi)i [d] as being sampled from a distribution with the mean tr(W) = P i [d] wi and scale tr(W2) = P i [d] w2 i (supposing that W is real diagonalizable). First, η 0 indicates that the scale tr(W2) should be close to zero. Next, ξη(= tr(W)/λ) r 2 means that tr(W) 2λ at the limit, i.e., tr(W) should be significantly away from zero. By combining them, we tell that ρ localizes when the eigenspectrum concentrates around a non-zero mean. This happens more likely when the embedding dimension d is excessively large to make Self-attention Networks Localize When QK-eigenspectrum Concentrates the eigenvalue sum tr(W) bounded away from zero while keeping the scale tr(W2) close to zero (i.e., keeping every eigenvalue close to zero). Thus, a larger embedding dimension d is beneficial to drive attention to localization. Next, from the claim of uniformity in Lemma 2, we tell that ρ fluctuates less and less with ξ closer to zero. Hence, ρ(θ) takes a similar value across different token positions θ in this limit. When tr(W) 0, ρ attains this limit. Summary. Wrapping up this section, we obtain answers to (Q1) posed in Section 1 under the random walk model. A1: When does attention localize? ρ localizes when tr(W2) is close to zero while | tr(W)| is significantly bounded away from zero, i.e., W-eigenspectrum concentrates to a non-zero mean. ρ is uniform when tr(W) is close to zero while tr(W2) remains finite, i.e., W-eigenspectrum has the zero mean and a finite variance. ρ degenerates to zero uniformly when tr(W2) is sufficiently large, i.e., W-eigenspectrum has an infinitely large variance. Remark 3 (Validity of random walk). The analysis results in this section largely owe to the random walk assumption in Assumption 1. This is reminiscent of the random walk model of word vectors proposed by Arora et al. (2016): they suppose that a discourse vector that governs the latent topic in a sentence evolves along a random walk, and observed word vectors are emitted depending on their affinity to the discourse vector. Under this model, they showed that the PMI (pointwise mutual information) between two words is close to their scaled inner product (Arora et al., 2016, Theorem 2.2). Hence, the random walk model may be semantically plausible in such a geometrical sense. Their proof essentially benefits from the random walk to claim that two discourse vectors are independent (Arora et al., 2016, Eq. (2.11)), for which a sufficient tiny drift order O(d 1/2) matters. That said, we suppose the Gaussianity in addition to their random walk. We exploit this to approximately get the closed form of ρ(θ) (6), which is a limitation of our analysis. 5. Are different collapse regimes reconcilable? We discuss the results of our analysis in Section 4 and the previous arguments related to attention uniformity. Connection to rank collapse. Dong et al. (2021) showed that self-attention blocks Uℓ(see Eq. (2)) converges to a rank-1 matrix z1 (for some z) with L without skip connections or feed-forward blocks, which is called rank collapse.1 They argued the importance of avoiding rank collapse for better expressivity because each token embedding in a rank-1 self-attention block degenerates to the same. Hence, the rank collapse is related to the failure mode attributed to the uniformity after mixing key tokens by attention, which is slightly different from what we are concerned about how each token contributes during mixing by attention (through the gradient, as discussed in Section 3). Nonetheless, the uniformity of Dong et al. (2021) can be connected to the perspective of the W-eigenspectrum. Dong et al. (2021, Theorem 2.2) proved that the convergence rate to a rank-1 matrix slows down when the matrix ℓ1-norm WQK 1, i.e., the ℓ1-norm of the vectorized matrix, is large. Because we can draw the following connection between WQK 1 and | tr(W)|: d Σ 2 WQK 2 WQK F WQK 1, (7) where the first inequality is due to the bound (5) and the Cauchy Schwarz inequality, it is sufficient (but not necessary!) to increase | tr(W)| under fixed tr(W2) to mitigate the rank collapse. This is equivalent to reducing the eigenspectrum variance d2V[wi] = d2(E[w2 i ] E[wi]2) = d tr(W2) | tr(W)|2. (8) Hence, minimizing the W-eigenspectrum variance leads to better expressivity. Connection to entropy collapse. Zhai et al. (2023) introduced a concept called entropy collapse, in which the average Shannon entropy of the columns of the attention matrix Aℓ(see Eq. (1)) shrinks. Intuitively speaking, low attention entropy induces localized attention. This notion of localization is akin to ours because the attention entropy measures the uniformity the attention is applied to input tokens during mixing. They empirically observed that the training loss falls into a plateau with low attention entropy, which causes training instability of transformers, and hence advocate for keeping attention less peaked during training. In Zhai et al. (2023, Theorem 3.1), the attention entropy is asymptotically lower-bounded for large T by ln(1 + T exp( ν)) + ν exp( ν/2) T 1 + exp( ν), (9) where ν := XX 2 WQK 2. This lower bound is unimodal in ν and vanishes at WQK 2 (see Fig. 4), so the attention entropy tends to be higher when WQK 2 is small. If | tr(W)| is not too small, WQK 2 is lowerbounded (see Eq. (7)) and the attention entropy may be kept 1Note that our matrix notation is different from the one used in Dong et al. (2021) so that we chose to let each column of X store a token, whereas they let each row of X store a token. Self-attention Networks Localize When QK-eigenspectrum Concentrates 10 20 30 40 Token index i Eigenvalue scale=0.05 Mean=0.04 Mean=0.1 Mean=0.25 10 20 30 40 Token index i Eigenvalue scale=0.45 Mean=0.04 Mean=0.1 Mean=0.25 10 20 30 40 Token index i Eigenvalue scale=4.0 Mean=0.04 Mean=0.1 Mean=0.25 0.04 0.065 0.1 0.16 0.25 Mean of eigvals Scale of eigvals 4.13 3.42 2.65 2.00 1.37 2.53 2.09 1.99 1.54 1.11 0.94 0.73 0.78 0.80 0.60 0.27 0.28 0.25 0.24 0.21 0.07 0.07 0.07 0.07 0.08 Attention entropy 10 20 30 40 Token index i Eigenvalue scale=0.05 Mean=0.04 Mean=0.1 Mean=0.25 10 20 30 40 Token index i Eigenvalue scale=0.45 Mean=0.04 Mean=0.1 Mean=0.25 10 20 30 40 Token index i Eigenvalue scale=4.0 Mean=0.04 Mean=0.1 Mean=0.25 0.04 0.065 0.1 0.16 0.25 Mean of eigvals Scale of eigvals 2.22 1.69 1.21 0.70 0.41 1.11 0.90 0.74 0.57 0.35 0.34 0.32 0.31 0.24 0.23 0.10 0.09 0.11 0.09 0.08 0.03 0.04 0.03 0.03 0.03 Attention entropy Figure 5: Simulated signal propagation probability. In the top and bottom rows, the results for the isotropic and anisotropic covariances (the details in the text) are shown, respectively. (Left) Signal propagation probability ρi computed over repeatedly sampled 300 random walks (Assumption 1) with 40 tokens. For each line, WQK (d = 128) is sampled 10 times with the corresponding mean and scale of the eigenvalue distribution, and the averaged ρi is denoted by the bold line. (Right) The attention entropy (Zhai et al., 2023) is computed for WQK with different eigenvalue mean-scale pairs. close to the peak of the lower bound (9). To mitigate the entropy collapse, it is sufficient (but not necessary!) to decrease tr(W2) under fixed tr(W) (which is equivalent to minimizing the eigenspectrum variance by Eq. (8)) because of the bound tr(W2) = Σ 1 F W F WQK F WQK 2, (10) where the first inequality is due to the Cauchy Schwarz inequality. Hence, minimizing the W-eigenspectrum variance helps the model to avoid the entropy collapse. Rank collapse vs. entropy collapse. At first sight, the two notions of collapse seem to contradict each other because avoiding the rank collapse leads to diverse token embeddings, whereas avoiding the entropy collapse leads to a uniform token mixer. Indeed, the matrix ℓ1-norm (that decreases under the rank collapse) and the spectral norm (that increases under the entropy collapse) are equivalent norms, and the two modes appear to be incompatible. However, as we discussed above, this trade-off is reconcilable from the viewpoint of the W-eigenspectrum. Setting the eigenspectrum mean to be bounded away from zero, we can avoid the rank collapse owing to the bound (7). Under a fixed eigenspectrum mean, minimizing the eigenspectrum scale (equivalently, minimizing its variance (8)) leads to high attention entropy due to the bound (10) and the unimodal shape of the entropy lower bound (9). This variance minimization is nothing else but the condition of attention localization. Eventually, ρ(θ) localizes and attends to spe- cific sites of tokens, as we showed in Lemma 2. Hence, the signal propagation probability offers us a better view of localization. Let us summarize our second take-home. A2: How does attention localization impact? Better expressivity: If | tr(W)| is maximized for a fixed tr(W2), the convergence to the rank collapse becomes slow. High attention entropy: If tr(W2) is minimized for a fixed | tr(W)| bounded away from zero, the attention entropy is increased. Both of the above are attributed to minimizing the W-eigenspectrum variance (8). Finally, let us reiterate that the minimization of the eigenspectrum variance is merely a sufficient but not necessary condition of the aforementioned better expressivity and high attention entropy. The apparent contradiction between the rank and entropy collapses could be reconciled by other mechanisms in practice. The aim of this work is to elucidate the connection between the collapse phenomena and the eigenspectrum. Numerical simulation. To see the relationship between the eigenspectrum, ρ, and attention entropy, we simulated the signal propagation probability ρi using synthesized random walks following Assumption 1 with the isotropic Σ = I and anisotropic Σ. To obtain an anisotropic Σ, we first sampled R Rd d from element-wise Unif( 2.5, 2.5) and computed Σ = R R/d. We sampled 300 sequences Self-attention Networks Localize When QK-eigenspectrum Concentrates 0 100 200 300 400 500 Iter [ 102] (A) Eig. spectr. mean tr(WQK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (B) Eig. spectr. scale tr(W2 QK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (C) Attn. entropy (val, 1 = 100) 100 200 300 400 500 Iter [ 102] (D) Perplexity (val, 1 = 100) 430 435 440 445 Figure 6: Experimental results of language modeling (Wiki Text-2) with d = 128 and 1-layer transformers, fixed κ1 = 100, and varying regularization intensity κ2. With stronger κ2, the eigenspectrum scale shrinks (B), the attention entropy increases (C), and the perplexity improves (D). The rightmost figure magnifies the x-range, where the perplexity attains the minimum. 0 5 10 15 20 25 30 35 Token index With Loc Ate R (κ1 = 100, κ2 = 1.0) 0 5 10 15 20 25 30 35 Iter [1.2 105] Token index No Loc Ate R Figure 7: The signal propagation probabilities are shown at each iteration over 50000 iterations. (Top) LOCATER with κ1 = 100 and κ2 = 1. A couple of light and dark horizontal stripes correspond to the attention localization. (Bottom) No LOCATER. Overall, the signal propagation probability is uniform at each time. with 40 tokens, and obtained WQK by generating 128 eigenvalues (wi)i [d] from N(mean, scale2) and composed with a sampled orthogonal basis matrix B, by the eigendecomposition formula WQK = B diag((wi)i)B . The signal propagation probability was averaged over 300 sequences. Figure 5 shows the averaged ρi with different mean-scale pairs and the corresponding attention entropy in the rightmost figure. As seen, ρi localizes with smaller scales and larger means, which is consistent with the conclusion in Section 4. This trend supports the validity of WQKeigenspectrum as a proxy to W-eigenspectrum. Moreover, we observe that WQK-eigenspectrum with a fixed mean and scale leads to higher attention entropy. 6. Intervening attention localization To empirically see the impact of localization on the model performance, we propose a method to control the degree of attention localization. We focus on the eigenspectrum of WQK instead of W = WQKΣ because Σ does not change during training, and the numerical simulation showed that WQK behaves as a reasonable proxy to W (see Section 5). We minimize the loss function J while minimizing the eigenspectrum scale and maintaining the mean to a fixed level: n J(Θ) + κ1 tr(W QKWQK) + κ2(tr(WQK) 1)2o , where κ1, κ2 > 0 are the regularization strengths. Here, we allow WQK to be asymmetric, and the eigenspectrum scale is represented by tr(W QKWQK). The regularization terms can be optimized fairly easily thanks to the following derivative formulae (for WQK = W QWK): WQ tr(W QKWQK) = 2WKW KWQ, WK tr(W QKWQK) = 2WQW QWK, WQ[(tr(WQK) 1)2]=[tr(WQK) 1] tr(WQK)WK, WK[(tr(WQK) 1)2]=[tr(WQK) 1] tr(WQK)WQ. Since this whole objective drives the eigenspectrum scale to a small value, the signal vanishing can be avoided automatically. We call this regularization scheme LOCATER (LOCalized ATt Ention Regularization). 7. Experiments We aim to observe the correlation between the eigenspectrum and localization. To this end, we train transformers with LOCATER and varying κ1, κ2, and see how the model performances and attention foci change over time. Setup. We used fairseq v0.12.2 (Ott et al., 2019), which is a toolkit oriented for sequence modeling, to implement and train transformers. The basic training scheme was inherited from fairseq-cli/train.py. The model is a 1-layer transformer with a single-head self-attention and Post-LN (default), and the input embedding dimension, attention embedding dimension, and feed-forward net embedding dimension are set to 128 altogether (namely, d = 128).2 Note that we use the standard softmax attention in the experiments, although our theoretical analysis was conducted with its piecewise approximation. Input data 2The experimental results with deeper transformers are shown in Appendix D. The overall trends remain alike. Self-attention Networks Localize When QK-eigenspectrum Concentrates were transformed into 64 tokens (namely, T = 64) with batch size 64. The optimizer is Adam (Kingma & Ba, 2015) with default parameters and no clip norm, and the weight decay with 0.01 is used. The learning rate is fixed to 2.5 10 5 without any scheduling. The FP16 quantizer was applied to reduce memory usage. All the other configs remain to be the same as the default in fairseq-cli/train.py. Under this config, we updated the model with 50000 iters. Language modeling. We conducted the language modeling task. The dataset we used is Wiki Text-2 (Merity et al., 2016), which is a collection of high-quality Wikipedia articles. We conduced the experiments with fixed scale regularization strength κ1 = 100 and varying mean regularization strength κ2 from 0, 10 3, 10 2, . . . , 100. The results are shown in Fig. 6, in which stronger regularizers tend to make the eigenspectrum scale smaller. This, in turn, maintains the attention entropy higher during the updates entirely, and eventually, the model achieves better perplexity. While the better model performance with higher attention entropy has already been observed by Zhai et al. (2023), we also showed that a smaller eigenspectrum scale contributes to higher attention entropy. This empirically corroborates that attention localization leads to better model performance, probably because the attention mechanism appropriately selects relevant tokens during training. Figure 7 shows the signal propagation probability at each training iteration. We compute the signal propagation probability of token i by counting the frequency of γi, ω +γi 0 [0, 1] in a given batch. LOCATER entails salient horizontal stripes, each corresponding to attended tokens. Yet, the stripes do not appear in bulk as we analyzed in Section 4 because our synthetic data model in Assumption 1 does not perfectly align with real datasets. Nevertheless, our experiments evidently contrast the localized and uniformed attention depending on the eigenspectrum scale because no salient stripes are observed without LOCATER. In Figs. 6 and 7, we observe different learning phases for the first 104 and the rest iters. Indeed, Tian et al. (2023b) observed similar phenomena and explained that it is due to the different convergence speeds between attention weights corresponding to informative and non-informative tokens. The relationship between the WQK-eigenspectrum and this dynamics is beyond our scope and left for future work. 8. Conclusion and limitation We revealed that attention localizes when the eigenspectrum of W concentrates to a non-zero mean, or equivalently, with larger eigenspectrum mean tr(W) and smaller scale tr(W2). Based on it, LOCATER was proposed to shrink the scale tr(W2 QK) while maintaining the mean tr(WQK). Interestingly, maximizing the scale is related to mitigating both rank collapse and entropy collapse, and hence, the two apparently contradictory failure modes can be reconciled. The experiments on a real-world dataset corroborate it, though the random walk model is not perfectly satisfied. We recognize three limitations of this work. First, we rely on the strong random walk model. Although the Gaussianity may be reasonable because of usual initialization schemes of transformer embedding layers, it is interesting to consider an alternative model to capture token correlations better. Second, the formal analysis is mainly restricted to 1-layer transformers. Recent studies often consider gradient explosion in the large-depth limit from the viewpoint of layer normalization (Xiong et al., 2020; Takase et al., 2023b) and initialization (Bachlechner et al., 2021; Takase et al., 2023a). It must be fruitful to integrate these perspectives to our gradient analysis through Eq. (3). Third, why attention localization leads to better model performance still remains elusive. Whereas localization is related to avoiding rank collapse (and hence higher model expressivity), we need additional effort to fully understand the mechanism. Last but not least, let us mention concurrent work on understanding the attention mechanism. Geshkovski et al. (2023) argued that a trained multi-layer self-attention network exhibits a layer-wise dynamics similar to the Kuramoto model and token embeddings converge to a few leader tokens depending on the structure determined by the self-attention parameter matrices. Li et al. (2024) proved that the learning dynamics of a 1-layer self-attention network yields a querykey parameter matrix capturing the token-pair frequencies. Together with our work, it has become an interesting direction to study the implicit bias of attention through parameter eigenspectra. Impact Statement Transformer architectures have prevailed in recent years, yet the internal functionality has not been transparently understood. Our work reconciles the apparently contradictory two collapse modes, rank collapse and entropy collapse, with a new perspective of the parameter eigenspectrum. We believe this result pushes forward the understanding of transformers. Acknowledgments A part of the experiments of this research was conducted using Wisteria/Aquarius in the Information Technology Center, the University of Tokyo. Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. A latent variable model approach to PMI-based word embeddings. Transactions of the Association for Computational Lin- Self-attention Networks Localize When QK-eigenspectrum Concentrates guistics, 4:385 399, 2016. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. Bachlechner, T., Majumder, B. P., Mao, H., Cottrell, G., and Mc Auley, J. Re Zero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp. 1352 1361. PMLR, 2021. Bietti, A., Cabannes, V., Bouchacourt, D., Jegou, H., and Bottou, L. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36, 2023. Brookes, M. The matrix reference manual. http://www.ee.ic.ac.uk/hp/staff/dmb/ matrix/intro.html, 1998. [Online; accessed 01-September-2023]. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33: 1877 1901, 2020. Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In Proceedings of the 38th International Conference on Machine Learning, pp. 2793 2803. PMLR, 2021. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021. Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. The emergence of clusters in self-attention dynamics. Advances in Neural Information Processing Systems, 36, 2023. Graves, A. Generating sequences with recurrent neural networks. ar Xiv preprint ar Xiv:1308.0850, 2013. Jelassi, S., Sander, M., and Li, Y. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822 37836, 2022. Kharitonov, E. and Chaabouni, R. What they do when in doubt: a study of inductive biases in seq2seq learners. In Proceedings of the 9th International Conference on Learning Representations, 2021. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015. Kobayashi, G., Kuribayashi, T., Yokoi, S., and Inui, K. Analyzing feed-forward blocks in transformers through the lens of attention map. ar Xiv preprint ar Xiv:2302.00456, 2023. Li, Y., Huang, Y., Ildiz, M. E., Rawat, A. S., and Oymak, S. Mechanics of next token prediction with self-attention. In Proceedings of the 27th International Conference on Artificial Intelligence and Statistics, pp. 685 693. PMLR, 2024. Likhomanenko, T., Xu, Q., Kahn, J., Synnaeve, G., and Collobert, R. slim IPL: Language-model-free iterative pseudo-labeling. In Proc. Interspeech 2021, pp. 741 745, 2021. Martins, A. and Astudillo, R. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the 34th International Conference on Machine Learning, pp. 1614 1623. PMLR, 2016. Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35:17359 17372, 2022. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843, 2016. Murty, S., Sharma, P., Andreas, J., and Manning, C. Grokking of hierarchical structure in vanilla transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 439 448. Association for Computational Linguistics, 2023a. Murty, S., Sharma, P., Andreas, J., and Manning, C. D. Characterizing intrinsic compositionality in transformers with tree projections. In Proceedings of the 11th International Conference on Learning Representations, 2023b. Noci, L., Anagnostidis, S., Biggio, L., Orvieto, A., Singh, S. P., and Lucchi, A. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35: 27198 27211, 2022. Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48 53, 2019. Takase, S., Kiyono, S., Kobayashi, S., and Suzuki, J. Spike no more: Stabilizing the pre-training of large language models. ar Xiv preprint ar Xiv:2312.16903, 2023a. Self-attention Networks Localize When QK-eigenspectrum Concentrates Takase, S., Kiyono, S., Kobayashi, S., and Suzuki, J. B2T connection: Serving stability and performance in deep transformers. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 3078 3095, 2023b. Tarzanagh, D. A., Li, Y., Thrampoulidis, C., and Oymak, S. Transformers as support vector machines. ar Xiv preprint ar Xiv:2308.16898, 2023a. Tarzanagh, D. A., Li, Y., Zhang, X., and Oymak, S. Maxmargin token selection in attention mechanism. Advances in Neural Information Processing Systems, 36, 2023b. Tian, Y., Wang, Y., Chen, B., and Du, S. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. Advances in Neural Information Processing Systems, 36, 2023a. Tian, Y., Wang, Y., Zhang, Z., Chen, B., and Du, S. Jo MA: Demystifying multilayer transformers via JOint Dynamics of MLP and Attention. ar Xiv preprint ar Xiv:2310.00535, 2023b. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pp. 10347 10357. PMLR, 2021. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 30:6000 6010, 2017. Xie, S. M., Raghunathan, A., Liang, P., and Ma, T. An explanation of in-context learning as implicit Bayesian inference. In Proceedings of the 10th International Conference on Learning Representations, 2022. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, pp. 10524 10533. PMLR, 2020. Zhai, S., Likhomanenko, T., Littwin, E., Busbridge, D., Ramapuram, J., Zhang, Y., Gu, J., and Susskind, J. M. Stabilizing transformer training by preventing attention entropy collapse. In Proceedings of the 40th International Conference on Machine Learning, pp. 40770 40803. PMLR, 2023. Self-attention Networks Localize When QK-eigenspectrum Concentrates A. Helper lemmas Lemma 3. Let W Rd d be a symmetric matrix. Fix a, µ Rd and Σ Rd d be a covariance matrix. For x N(m, Σ), the following moment formulae hold: E[x Wx] = tr(WΣ) + m Wm (11) E[xx ] = Σ + mm (12) E[a Wxx Wx] = 2a WΣWm + a Wm{tr(WΣ) + m Wm} (13) E[x Wxx Wx] = 2 tr(WΣWΣ) + tr(WΣ)2 + 4m WΣWm + 2 tr(WΣ)m Wm + m Wmm Wm (14) The formulae in Lemma 3 are standard and cropped from Brookes (1998). Lemma 4. Let W Rd d be a symmetric matrix. For i j, suppose that xi, xj follow Assumption 1. Then, the following formulae hold: E[x i Wxi] = (i 1) tr(WΣ) (15) E[x i Wxix i Wxi] = (i2 2i + 2){2 tr(WΣWΣ) + tr(WΣ)2} (16) E[x i Wxjx j Wxi] = (i2 + ij 3i j + 4) tr(WΣWΣ) + (i2 2i + 2) tr(WΣ)2 (17) E[x i Wxjx j Wxj] = (ij i j + 2){2 tr(WΣWΣ) + tr(WΣ)2} (18) The formulae in Lemma 4 can be shown by recursively applying Lemma 3. We omit the proofs since they are elementary. B. Omitted derivations B.1. QK-gradient Here, we complement the derivations of the QK-gradient terms shown in Section 3. To get Eq. (3), we compute WQKJ: 2 E[ WQK y F(X)T 2] 2 E[ WQK{F(X) T F(X)T 2y F(X)T }] 2 E[ω ΓX W VW F WFWVXΓ ω + 2(WVXeγ0 + x T ) W F WFWVXΓ ω 2y WFWVXΓ ω] = 1 2λ2 E[x T W QKXΓPΓ X WQKx T ] + 1 λ E[(eγ0) PΓ X WQKx T ] + 1 λ E[q Γ X WQKx T ] λ2 E[XΓPΓ X WQKx T x T ] + 1 λ E[XΓPeγ0x T ] + 1 λ E[XΓqx T ]. By expanding the first term of WQKJ, we get the following: i,j,α,β [T ] E h eγi αeγj β(x i ˇPxj)(x β WQKx T )xαx T i , (19) where ˇP := W VW F WFWV. Similarly, by expanding the second and third terms of WQKJ, we get the following terms, respectively: i,α,β [T ] E h eγi αeγβ 0 (x i ˇPxβ)xαx T i , (20) i,α [T ] E eγi α{x i W VW F (WFx T y)}xαx T . (21) Self-attention Networks Localize When QK-eigenspectrum Concentrates To get Eq. (19), we expand the first term of WQKJ: E[XΓPΓ X WQKx T x T ] = E[XΓX W VW F WFWVXΓ X WQKx T x T ] = E {(WFWVX)(XΓ) } {(WFWVX)(XΓ) }WQKx T x T i [T ] (Xeγi)(WFWVxi) X j [T ] (WFWVxj)(Xeγj) WQKx T x T i,j [T ] (Xeγi){(WFWVxi) (WFWVxj)}(Xeγj) WQKx T x T i,j [T ] x i ˇPxj eγi αeγj βxαx β WQKx T x T i,j,α,β [T ] E[eγi αeγj β(x i ˇPxj)(x β WQKx T )xαx T ]. To get Eq. (20), we expand the second term of WQKJ: E[XΓPeγ0x T ] = E[XΓX W VW F WFWVXeγ0x T ] = E[{(WFWVX)(XΓ) } {(WFWVX)(eγ0x T )}] i [T ] (Xeγi)(WFWVxi) X β [T ] (WFWVxβ)(eγβ 0 x T ) i,β [T ] (Xeγi){(WFWVxi) (WFWVxβ)}(eγβ 0 x T ) i,β [T ] x i ˇPxβ eγi αeγβ 0 xαx T i,α,β [T ] E[eγi αeγβ 0 (x i ˇPxβ)xαx T ]. To get Eq. (21), we expand the third term of WQKJ: E[XΓqx T ] = E[XΓX W VW F (WFx T y)x T ] = E[{(WFWVX)(XΓ) } (WFx T y)x T ] i [T ] (Xeγi)(WFWVxi) (WFx T y)x T i,α [T ] eγi αxαx i W VW F (WFx T y)x T i,α [T ] E[eγi α{x i W VW F (WFx T y)}xαx T ]. B.2. Order evaluation of QK-gradient The orders of the QK-gradient terms (19), (20), and (21) are evaluated. In this subsection, we assume that the covariance matrix in Assumption 1 is Σ = I for simplicity. The following evaluation still applies with minor modifications for a general Self-attention Networks Localize When QK-eigenspectrum Concentrates Σ. For Eq. (19), the Cauchy Schwarz inequality implies that Eq. (19) = X i,j,α,β E[eγi αeγj β(x i ˇPxj)(x β WQKx T )xαx T ] i,j,α,β E[(eγi αeγj β)2] i,j,α,β E[(x i ˇPxj)2] i,j,α,β E[(x β WQKx T )2] i,j,α,β E[(xαx T ) 2] For (A), we have eγi α, eγj β T 1 by definition, and hence (A) = O(1). For (B), by using Eq. (17), (B) = T 2 X i,j E[x i ˇPxjx j ˇPxi] = T 2 X i,j {(i2 + ij 3i j + 4) tr(ˇP2) + (i2 2i + 2) tr(ˇP)2} = O(T 6). For (C), by following the same computation as Eq. (22), (C) = T 3 X β E[x β WQKx T x T WQKxβ] β {(β2 + (T 3)β (T 4)) tr(W2 QK) + (β2 2β + 2) tr(WQK)2} = O(T 6). For (D), its (i, j)-element can be evaluated as follows (no matter whether i = j or not): (D)ij = T 3 X α E[x2 α,ix2 T,j] = T 3 X α E[x2 α,i{(T α) + x2 α,j}] = O(T 5) X α E[x2 α,i] + T 3 X α E[x2 α,ix2 α,j] = O(T 6). By plugging them back, we now confirmed that |Eq. (19)| = O(T 8). The orders of Eqs. (20) and (21) can be evaluated similarly and the detailed evaluations are omitted. |Eq. (20)|2 = X i,α,β E[eγi αeγβ 0 (x i ˇPxβ)xαx T ] i,α,β E[(eγi αeγβ 0 )2] X i,α,β E[(x i ˇPxβ)2] X i,α,β E[(xαx T ) 2] = O(T) O(T 4) O(T 5) = |Eq. (20)| = O(T 5). |Eq. (21)|2 = X i,α E[eγi α{x i W VW F (WFx T y)}xαx T ] i,α E[(eγi α)2] X i,α E[(x i W VW F (WFx T y))2] X i,α E[(xαx T ) 2] = O(1) O(T 4) O(T 4) = |Eq. (21)| = O(T 4). Hence, we have |Eq. (19)| = O(T 8), |Eq. (20)| = O(T 5), and |Eq. (21)| = O(T 4), which implies that the QK-gradient (3) is asymptotically dominated by Eq. (19). Self-attention Networks Localize When QK-eigenspectrum Concentrates Lemma 1. Suppose that WQK is symmetric and independent from X, and let W := WQKΣ. Under Assumption 1, for i [T], the mean µi and variance vi of γi, ω + γi 0 with the input ω := X WQKx T /λ are given as follows: Proof. To derive the mean, we use Eq. (15). µi = 1 λT E[x i WQKx T ] 1 λT 2 X j [T ] E[x j WQKx T ] + o(1) j [T ](j 1) λT 2 tr(W) + o(1) Note that γi 0 = o(1). To derive the variance, we first compute E[x i WQKx T x j WQKx T ] (for i j T). E[x i WQKx T x j WQKx T ] = E[x i WQK(x T x T )WQKxj] = E[x i WQK{(T j)Σ + xjx j }WQKxj] = (T j) E[x i WQKΣWQKxj] + E[x i WQKxjx j WQKxj] = (T j)(i 1) tr(W2) + (ij i j + 2){2 tr(W2) + tr(W)2} = (ij + (T 2)i j (T 4)) tr(W2) + (ij i j + 2) tr(W)2, where Eq. (11) is used recursively at the second identity and Eqs. (15) and (18) are used at the fourth identity. Then, the expectation of the squared term is expanded: E[ γi, X WQKx T 2] T x i WQKx T 1 j [T ] x j WQKx T T 2 x i WQKx T x i WQKx T 2 j [T ] x i WQKx T x j WQKx T + 1 j,j [T ] x j WQKx T x j WQKx T T 2 E[x i WQKx T x i WQKx T ] | {z } (A) T 3 2 E[x i WQKx T x i WQKx T ] | {z } (B1) j>i E[x i WQKx T x j WQKx T ] | {z } (B2) ji {(ij + (T 2)i j (T 4)) tr(W2) + (ij i j + 2) tr(W)2} = (T 2i 2Ti2 i3) tr(W2) + (T 2i i3) tr(W)2 + o(T 3), j 2 . 2. (Early-/middle-site) As (ξ, η) ( , 0) with ξη r, 2 } if 2 r < 0 1{ 1 2 } if r < 2 . Self-attention Networks Localize When QK-eigenspectrum Concentrates 3. (Uniformity) Fix η as a finite value. As |ξ| 0, |ρ (θ)| 0 for any θ [0, 1]. 4. (Vanishing signal) Fix ξ as a finite value. As η , ρ(θ) 0 for any θ [0, 1]. Proof. To see 1: We first see that as ξ , 1 2 if θ > 1 2 0 if θ = 1 In addition, as ξ and η 0 with ξη r [0, 2], By combining them, ρ(θ) 1{θ 1 2 } at the limit. If r > 2, r 0 if θ = 1 r 1 2 if θ > 1 and ρ(θ) 1{ 1 r } at the limit. We can see 2 in the same way as 1. To see 3: First, compute ρ (θ) by using d dzerf(z) = 2 π exp( z2): ρ (θ) = 1 π exp ((θ 1 " 1 π exp ((θ 1 3 (2(2θ2 + 7 12))3/2 1 π exp ! 4θ2 θ + 5 η (2(2θ2 + 7 By noting that 0 < exp( z2) 1, |ρ (θ)| |ξ| π 3 (2(2θ2 + 7 12))3/2 4θ2 θ + 5 η (2(2θ2 + 7 = |ξ| π 1 (2(2θ2 + 7 0 as |ξ| 0. To see 4: For finite ξ, lim η Φ θ 1 which indicates that ρ(θ) 0 at the limit η . D. Additional experiments Here, we show additional results of the language modeling task with 1-/3-/6-layer transformers with different embedding dimensions d = 32, 128. For d = 128, the configurations remain the same except for the number of decoder layers as in Section 7. For d = 32, we used the learning rate 0.0001 (instead of 0.000025 used for d = 128), and the other configurations remain the same. The results are shown in Fig. 8 (d = 32, 1-layers), Fig. 9 (d = 32, 3-layers), Fig. 10 (d = 32, 6-layers), Self-attention Networks Localize When QK-eigenspectrum Concentrates 0 100 200 300 400 500 Iter [ 102] 6 (A) Eig. spectr. mean tr(WQK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (B) Eig. spectr. scale tr(W2 QK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (C) Attn. entropy (val, 1 = 100) 100 200 300 400 500 Iter [ 102] (D) Perplexity (val, 1 = 100) 465 470 475 480 Figure 8: Experimental results of language modeling (Wiki Text-2) with d = 32 with 1-layers transformers, fixed κ1 = 100, and varying regularization intensity κ2. With stronger κ2, the eigenspectrum scale shrinks (B), the attention entropy increases (C), and the perplexity improves (D). 0 100 200 300 400 500 Iter [ 102] (A) Eig. spectr. mean tr(WQK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (B) Eig. spectr. scale tr(W2 QK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (C) Attn. entropy (val, 1 = 100) 100 200 300 400 500 Iter [ 102] (D) Perplexity (val, 1 = 100) 465 470 475 480 Figure 9: Experimental results of language modeling (Wiki Text-2) with d = 32 with 3-layers transformers, fixed κ1 = 100, and varying regularization intensity κ2. With stronger κ2, the eigenspectrum scale shrinks (B), the attention entropy increases (C), and the perplexity improves (D). Self-attention Networks Localize When QK-eigenspectrum Concentrates 0 100 200 300 400 500 Iter [ 102] (A) Eig. spectr. mean tr(WQK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (B) Eig. spectr. scale tr(W2 QK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (C) Attn. entropy (val, 1 = 100) 100 200 300 400 500 Iter [ 102] (D) Perplexity (val, 1 = 100) 470 475 480 485 Figure 10: Experimental results of language modeling (Wiki Text-2) with d = 32 with 6-layers transformers, fixed κ1 = 100, and varying regularization intensity κ2. With stronger κ2, the eigenspectrum scale shrinks (B), the attention entropy increases (C), and the perplexity improves (D). 0 100 200 300 400 500 Iter [ 102] (A) Eig. spectr. mean tr(WQK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (B) Eig. spectr. scale tr(W2 QK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (C) Attn. entropy (val, 1 = 100) 100 200 300 400 500 Iter [ 102] (D) Perplexity (val, 1 = 100) 455 460 465 470 Figure 11: Experimental results of language modeling (Wiki Text-2) with d = 128 with 3-layers transformers, fixed κ1 = 100, and varying regularization intensity κ2. With stronger κ2, the eigenspectrum scale shrinks (B), the attention entropy increases (C), and the perplexity improves (D). Self-attention Networks Localize When QK-eigenspectrum Concentrates 0 100 200 300 400 500 Iter [ 102] (A) Eig. spectr. mean tr(WQK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (B) Eig. spectr. scale tr(W2 QK) (train, 1 = 100) 0 100 200 300 400 500 Iter [ 102] (C) Attn. entropy (val, 1 = 100) 100 200 300 400 500 Iter [ 102] (D) Perplexity (val, 1 = 100) 460 465 470 475 Figure 12: Experimental results of language modeling (Wiki Text-2) with d = 128 with 6-layers transformers, fixed κ1 = 100, and varying regularization intensity κ2. With stronger κ2, the eigenspectrum scale shrinks (B), the attention entropy increases (C), and the perplexity improves (D). Fig. 11 (d = 128, 3-layers), and Fig. 12 (d = 128, 6-layers). Unlike the 1-layer case, we monitored the statistics of WQK in the first layer only. The overall trends are quite similar to the case of 1-layer transformers with d = 128 as seen in Fig. 6: As κ2 increases, the eigenspectrum scale decreases, the attention entropy increases, and eventually, the perplexity improves (namely, decreases).