# the_emergence_of_clusters_in_selfattention_dynamics__db71ef6b.pdf

The emergence of clusters in self-attention dynamics

Borjan Geshkovski

MIT borjan@mit.edu

Cyril Letrouit

MIT letrouit@mit.edu

Yury Polyanskiy

MIT yp@mit.edu

Philippe Rigollet

MIT rigollet@mit.edu

Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time-dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to inﬁnity. Cluster locations are determined by the initial tokens, conﬁrming context-awareness of representations learned by Transformers. Using techniques from dynamical systems and partial differential equations, we show that the type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the selfattention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically conﬁrms the empirical observation made by Vaswani et al. [VSP 17] that leaders appear in a sequence of tokens when processed by Transformers.

1 Introduction

The introduction of Transformers in 2017 [VSP 17] marked a turning point in the AI revolution, powering breakthroughs in natural language modeling and computer vision. With remarkable empirical success, Transformers enable large language models to compute very powerful representations using the self-attention mechanism. Yet, little is known about the geometric structure of these representations. As the size of these models grows at an astonishing rate, the need to understand their inner workings is becoming a pressing scientiﬁc challenge. In this work, we make a ﬁrst step in this direction by describing the geometry of learned representations.

To provide a transparent presentation of our ﬁndings, we take a leaf out of the literature on continuoustime dynamics such as neural ordinary differential equations (ODEs) [CRBD18, Wei17, HR17]. By viewing layers as a time variable, this formalism has emerged as a ﬂexible mathematical framework to implement and study Res Nets [HZRS16a] as particular discrete-time versions of a parametrized dynamics of the form

9xptq f pxptqq, t P r0, Ts. Here is the trained parameter of a neural network and f is characterized by the precise architecture of the Res Net1. In turn, an input (e.g., an image) xp0q P Rd is mapped to its representation xp Tq.

Unlike neural ODEs and Res Nets, the representation map of Transformers is not solely a function of an individual input xp0q P Rd but rather of a set/sequence px1p0q, . . . , xnp0qq of n 1 ddimensional tokens. These tokens then evolve in time by interacting with each other per the selfattention mechanism. Namely, following [SABP22], we view tokens as particles, and the transformer dynamics as an interacting particle system of the form

Pijptq V xjptq, t P r0, 8q, (1)

1A classical choice is p W, A, bq P Rdˆd ˆ Rdˆd ˆ Rd and f pxq Wσp Ax bq where σ is an elementwise nonlinearity such as the Re LU ([HZRS16b]).

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

for any i P rns, where Pijptq are the entries of a n ˆ n stochastic matrix Pptq, given by

Pijptq : ex Qxiptq,Kxjptqy n

1 ex Qxiptq,Kx ptqy , pi, jq P rns2. (2)

Here the matrices Q (Query), K (Key), and V (Value) are learned from data. Note that Q, K need not be square. The n ˆ n matrix Pptq is called self-attention matrix. The wording attention stems precisely from the fact that Pijptq captures the attention given by token i to token j relatively to all tokens P rns. The matrices Q and K in (2) warp the geometry of the input tokens, so that a trained attention matrix contains weights which indicate semantic relations between words. Such conclusions have been drawn in the context of language processing tasks in [VSP 17, Figures 3-5].

Our goal is to showcase the fact that self-attention, which itself is the core novelty of Transformers, entails a clustering effect. To that end, we focus on the pure self-attention dynamics described in (1). In particular, we do not model variations such as multiple heads, feed-forward layers, and layer normalization that are typically adjoined to self-attention dynamics of (1). However, on this last point, we note that our theoretical ﬁndings indicate that without any normalization, the dynamics (1) can diverge in some (or even all) directions over time. We leave these additional questions for future research; see Section 6.

1.1 Organization of the paper and summary of contributions

Figure 1: For V I3 tokens cluster toward the vertices of a convex polytope (Theorem 3.1).

The goal of this paper is to characterize clustered representations of a trained Transformer by studying the asymptotic behavior of a sequence of tokens px1ptq, . . . , xnptqq as they evolve through the layers of a transformer architecture using the dynamics (1). In this setup, a Transformer is completely described by the weight matrices p Q, K, V q obtained during training. Note that we assume that these three matrices are timeindependent. While this assumption is motivated by mathematical convenience, it is worth noting that such weight-sharing scenarios are in fact used in practice see, e.g., ALBERT [LCG 20] as they drastically reduce the number of parameters of a network.

With parameters p Q, K, V q ﬁxed, tokens are subject to collective dynamics that we call transformer dynamics. While these dynamics are reminiscent of existing models for opinion dynamics and ﬂocking, they present they own mathematical challenges requiring ad-hoc tools to study their asymptotic behavior.

The main conclusion of our analysis is that the set of tokens tx1ptq, . . . , xnptqu, appropriately rescaled, tends to a clustered conﬁguration as t Ñ 8. Our theoretical ﬁndings justify the empirical observation made in [VSP 17] that leaders appear in a sequence of tokens when processed by Transformers. We now list our main contributions.

(i) As a warm-up to the geometric characterization of the limits of sequences of tokens, we show in

Section 2 that when d 1 and V 0, the self-attention matrix Pptq converges to a low-rank matrix with entries 0 and 1 as t Ñ 8 thus revealing the emergence of a small number of leaders that drive the transformer dynamics. The restriction d 1 follows from technical considerations, and some pathological phenomena may occur in higher dimensions (see Remark 5). But numerical experiments (as well as past empirical work) indicate that the result may extend to higher dimensions for almost all initial sequences of tokens.

(ii) In Section 3 we ﬁrst focus on the case V Id as a natural canonical choice that enables us to

establish some of the main tools of the paper. We introduce a time re-scaling reminiscent of the layer normalization heuristics to alleviate the possible divergence of tokens. We show that along this scale the tokens converge to the boundary of a convex polytope. For almost all initial sequences they even converge to the vertices of the polytope, the number of which is signiﬁcantly smaller than n. This elucidates the clustering phenomenon. (See Fig. 1.) When V Id, all tokens following the dynamics (1) collapse to 0.

(iii) We build on these results and in Section 4 consider the case wherein V is only assumed to have

a simple and positive leading eigenvalue. This setting is much closer to reality and corresponds to

actual learned matrices V (see Figure 10). We show that along the particular timescale, tokens cluster toward one of at most three hyperplanes which are determined by the corresponding eigenvector.

(iv) In Section 5 we complete the results of Sections 3 and 4 by addressing the case where the leading

eigenvalue has multiplicity. This results in clustering toward the vertices of a convex polytope in some directions, and a linear subspace in the others.

(v) We also prove the global existence and uniqueness of solutions of all dynamics considered in this work (including the mean ﬁeld limit). We refer the reader to Appendix A for more details.

We also observed numerically that our conclusions extend to more compound architectures (see Conjecture 4.2, Section 6 and Appendix F).

Value Key and Query Limit geometry Reference

V Id QJK 0 vertices of convex polytope Theorem 3.1 λ1p V q 0, simple x Q'1, K'1y 0 union of 3 parallel hyperplanes Theorem 4.1 V paranormal QJK 0 polytope ˆ subspaces Theorem 5.1

V Id QJK Id single cluster at origin Theorem C.5

Table 1: Summary of the clustering results of this work. All results except for the case V Id hold for the time-scaled dynamics (4).

Remark 1 (Discrete time). While we focus on the idealized setting of self-attention dynamics in continuous-time, this is solely done for convenience and all of our methods are straightforwardly applicable to the discrete-time setting. (See also Remark 4.) The discrete-time analog of (1) with time-step t 0 (equal to 1 in practice) is simply the forward Euler iteration

xippk 1q tq xipk tq t

ˆ ex Qxipk tq,Kxjpk tqy n

1 ex Qxipk tq,Kx pk tqy

V xjpk tq, k P N. (3)

Notation. We denote by x , y and } } the Euclidean dot product and norm respectively, and we use the shorthand rns : t1, . . . , nu. For any matrix M P Rdˆd, we order its eigenvalues (repeated according to multiplicity) by decreasing order of modulus: |λ1p Mq| . . . |λdp Mq|. We denote by }M}op the 2 operator norm of the matrix M, equal to the largest singular value of M. Given a set S Ä Rd, we deﬁne the distance of a point x P Rd to S as distpx, Sq : infs PS }x s}, and by convp Sq the convex hull of S.

Related work

Our study and results build on several different lines of work, and we draw some parallels in what follows.

Analysis of attention-based models. Given the widespread use of Transformers in natural language processing, there has been a surge of interest in understanding the function and signiﬁcance of attention layers within these models. In [YBR 20], the authors show that when treated as discrete-time systems with additional dense layers and multiple heads appended to the core attention mechanism, Transformers exhibit the universal approximation property. In [LLH 20], the authors present, to the best of our knowledge, the ﬁrst interacting particle systems perspective on Transformers. They then leverage the similarities between Transformers (with an additional feed-forward layer compared to (1)) and convection-diffusion equations to slightly improve the performance of Transformers by employing a Strang-Marchuk splitting scheme for time discretization. In [SABP22], the authors interpret system (1) as the characteristics of a continuity equation. Drawing on the similarities between (1) and Sinkhorn iterations, they propose a novel architecture dubbed Sinkformer, which possesses the desirable property of being a Wasserstein gradient ﬂow.

Quadratic complexity of Transformers. The major computational challenge of Transformers is their high computational complexity, particularly when processing long sequences. Transformers require quadratic time and space complexity to process sequences, because each self-attention layer contains n2 products of the form x Qxi, Kxjy (for i, j P rns). The empirical observation that the selfattention matrix P is close to a low rank matrix see [LWLQ22, Section 4.4] for references is cited

as the inspiration behind Linformers [WLK 20] and the ﬁne-tuning algorithm Lo RA [Hys W 22]. For both approaches, the low-rank structure is imposed rather than extracted from P itself. Other methods called sparse attention and block attention have been proposed to reduce the quadratic complexity see [WLK 20, Section 2.2] for references. In the spirit of these works, a foreshadowing of the clustering mechanism was invoked in [VKF20], where queries are clustered into groups, again in view of reducing the quadratic complexity of self-attention. We point out that [DCL21] previously demonstrated that without skip connections, the dynamics trivializes and all tokens quickly lump together into a single tight cluster. Our work, in contrast, shows that in the presence of skip connections a rich cluster structure emerges.

Compared to the usual BERT, ALBERT [LCG 20] uses parameter-sharing across layers, meaning that the weight matrices Q, K, V in (1)-(2) do not depend on time, as in the present paper. This does not reduce the theoretical Opn2q complexity of the original Transformer, but, quoting [LCG 20], it "signiﬁcantly reduce[s] the number of parameters for BERT without seriously hurting performance, thus improving parameter-efﬁciency. An ALBERT conﬁguration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster. The parameter reduction techniques also act as a form of regularization that stabilizes the training and helps with generalization".

Neural collapse. Our results and conclusions bear a resemblance to some geometric aspects of neural collapse for classiﬁcation tasks [PHD20]. A key geometric aspect of neural collapse is the observation that, during the training of deep neural networks, the representation of different classes in the later layers of the network tends to form a tight cluster around the vertices of a simplex. The emergence of a simplex structure in the representation space provides insights into how the neural network organizes and separates the different classes.

Clustering in interacting particle systems. The transformer dynamics (1) have a strong connection to the vast literature on nonlinear systems arising in the modeling of opinion dynamics and ﬂocking phenomena. In addition to the classical Kuramoto model describing synchronization/clustering of oscillators [Kur75, ABV 05], the model which is most similar to (1) is the Krause model [Kra00]

aijpxjptq xiptqq, aij φp}xi xj}2q n

k 1 φp}xi xk}2q.

which is non-symmetric in general (aij aji), much like (1). When φ is compactly supported, it has been shown in [JM14] that the particles xiptq assemble in several clusters as t Ñ 8. Other models of opinion dynamics and ﬂocking have been proposed and studied, among which the Vicsek model [VCBJ 95], the Hegselmann-Krause model [HK02] and the Cucker-Smale model [CS07]. These models may also exhibit a clustering behavior under various assumptions (see [MT14, CHH 16, HKPZ19] and the references therein). The transformer dynamics are also closely related to the dynamics employed in mean-shift clustering [Che95], and this work indirectly sheds some light on its theoretical properties.

The analysis of transformer dynamics presents unique mathematical challenges that cannot be addressed using the tools developed for these more primitive models. In particular, our work demonstrates how different choices for the parameters lead to remarkably diverse clustering patterns. Much more remains to be discovered and this work is a ﬁrst attempt a rigorous mathematical analysis of these synthetic dynamics.

2 Asymptotic low-rankness of the self-attention matrix

As mentioned in Section 1.1, numerical experiments in [WLK 20] show that the self-attention matrix P, deﬁned in (2), has an almost low-rank structure. This observation has then been leveraged to reduce the quadratic complexity in the sequence length n which is inherent to Transformers, resulting in a non-negligible decrease in the cost of training.

As a warm-up to deriving complete geometric representations of the dynamics, our ﬁrst result shows, in the simple 1d case that Pptq indeed converges exponentially fast toward a matrix which is typically both Boolean and low-rank (see Fig. 3). Although there are clear obstructions to a rigorous extension of this result to higher dimensions (Remark 5), numerical experiments appear to show that this result holds in greater generality, for almost all initial sequences (Appendix F).

1 0 . . . 0 ... ... ... ... 1 0 . . . 0 . . . 0 . . . 0 1 ... ... ... ... 0 . . . 0 1

Figure 2: Elements in P, where Pσi P Rnˆn are some permutation matrices, and asterisks denote arbitrary non-negative reals which add up to 1.

To set this up, we introduce the set P of n ˆ n matrices having the form illustrated in Fig. 2, where the asterisks denote arbitrary non-negative real numbers which add up to 1. The row of asterisks may actually be any row between the ﬁrst and the last one.

Theorem 2.1 (Self-attention matrix converges to a low-rank Boolean matrix). Let d 1. Suppose that the scalars p Q, K, V q satisfy V 0 and QK 0. For any initial sequence of pairwise distinct tokens px1p0q, . . . , xnp0qq P Rn, there exists some P P P such that the self-attention matrix Pptq deﬁned in (2) converges to P as t Ñ 8.

The proof may be found in Appendix B. The rate of convergence toward P is in fact doubly exponential in t for coefﬁcients outside the row of asterisks in Fig. 2. The proof the theorem also reveals that for almost all initial sequences of pairwise distinct tokens, P is actually of rank 1 or 2, i.e., the row of asterisks is equal to either e1 p1, 0, . . . , 0q or en p0, . . . , 0, 1q.

The interpretation of Theorem 2.1 is that in the 1d case, at most three tokens capture the attention of all tokens except at most one. Typically, these leading tokens are those carrying the largest amount of information. This is also illustrated in Fig. 4. Since the tokens xi here evolve on R, the right-most and left-most ones (which typically tend toward 8) capture the attention of all the others.

0 5 10 15 20 25 30 35

t = 0.0, rank= 11

0 5 10 15 20 25 30 35

t = 3.0, rank= 23

0 5 10 15 20 25 30 35

t = 5.0, rank= 14

0 5 10 15 20 25 30 35

t = 10.0, rank= 2

Figure 3: An illustration of the asymptotics of Pptq entailed by Theorem 2.1 for n 40 tokens, with Q K 1 and V 1. (See Appendix F for details on computing.) Increasing n has no effect on this behavior of Pptq see Fig. 11.

t = 0.0 t = 2.0 t = 9.0

Figure 4: The clouds t Kxiptqui Pr20s (green) and t Qxjptquj Pr20s (purple) for d 2 where pairwise points of clouds are connected by a line of width equal to Pijptq. Here V 0 and Q 0 are random matrices and K I2. The creation of clusters is reﬂected by the rank 2 structure of the self-attention matrix Pptq. This interaction echoes ﬁndings illustrated in the original paper [VSP 17] for instance, Figures 3-5 therein.

3 Clustering toward vertices of convex polytopes

In the rest of the paper, we seek to taxonomize various clustering results for the solutions to (4) when t Ñ 8, depending the sign and the multiplicity of the eigenvalues of V . We begin by focusing

on what may appear to be the most natural2 case V Id, as is also done in [SABP22]. In fact, we demonstrate (theoretically and numerically) later on, clustering is a generic phenomenon which holds under much less restrictive assumptions.

The transformer dynamics considered in (1) does not contain a layer normalization mechanism typically encountered in practice [VSP 17]. In absence of such a device, tokens may diverge to inﬁnity as in Theorem 2.1. In fact, the norm of the tokens xiptq typically diverges exponentially toward 8 for any d: this is expected, by analogy with the non-trivial solutions to 9yptq yptq.

To remedy this situation, we take inspiration from the solution yptq et V yp0q to 9yptq V yptq. Namely, for any i P rns we consider the rescaled tokens ziptq : e t V xiptq, which solve

ex Qet V ziptq,Ket V zjptqy n

k 1 ex Qet V ziptq,Ket V zkptqy

V pzjptq ziptqq for t P r0, 8q. (4)

The initial condition remains the same: xip0q zip0q for any i P rns. More importantly, the coefﬁcients of the self-attention matrix for the rescaled tokens ziptq are the same as those for the original tokens xiptq. Whence, the conclusion of Theorem 2.1 also applies to the dynamics (4). We see this rescaling of tokens as a mathematically justiﬁed surrogate for the layer normalization.

The appearance of the exponential factor within the self-attention kernel facilitates the analysis of (4) compared to (1), and it is in fact instrumental in the proofs of all results that follow. Each result on the rescaled tokens ziptq then gives information on the dynamics of the original tokens xiptq by virtue of the relation xiptq et V ziptq.

We are now able to state the main result of this section on the case V Id. The following theorem shows that the tokens ziptq evolving per dynamics (4) converge to the boundary of a convex polytope as t Ñ 8. We present here a simpliﬁed but weaker version of our result for convenience, and refer the reader to Theorem C.1 in the appendix for a complete statement.

Theorem 3.1 (Convergence to points on the boundary of a convex polytope). Suppose V Id and QJK 0. Then, for any initial sequence of tokens tzip0qui Prns Ä Rd, there exists a convex polytope K Ä Rd such that for any i P rns, ziptq converges either to 0 or to some point on BK as t Ñ 8.

The convex polytope K is completely determined by the initial sequence of tokens, and QJK (refer to Claim 1). Numerical experiments (e.g. Fig. 5) also lead us to claim that for almost all initial sequences of tokens, one should expect convergence of ziptq (i P rns) toward some vertex of K. (Furthermore, the number of vertices of K is often found to be signiﬁcantly smaller than n.) It may however happen that for initial sequences taken in some null set (not seen when tokens are drawn at random) some tokens converge to other points of the boundary BK, namely in the interior of facets. On the other hand, for generic choices of initial sequences, we do not see a way to predict K explicitly besides running the full dynamics.

Figure 5: A toy example illustrating Theorem 3.1 with n 40 tokens in R3. Here Q K I3. The tokens converge to one of the vertices (leaders) of the limiting convex polytope.

Recall that the points xiptq etziptq when V Id follow the original dynamics (1). Akin to Theorem 2.1, this result also shows the emergence of a set of leaders (given by the vertices of K)

2Note that the case V Id may appear equally natural. For such a choice of V , we show in Appendix C.2 that the dynamics converge to a single cluster located at the origin. Multiplicative constants preserving the sign, i.e., V c Id, c 0 trivially yield the same conclusions.

attracting all tokens as t grows. It has been experimentally observed (ﬁrst in [VSP 17]) that in trained Transformers, tokens focus their attention on local leaders in a way that seems to reproduce the syntactic and semantic structure of sentences.

The proof of Theorem 3.1 is postponed to Appendix C, and amounts to a couple of effects entailed by the dynamics. First of all, the convex hull of the particles is shrinking over time (Proposition C.2). This is due to the fact that the distance of the particle nearest to any half-space (not containing the particles) increases with time. On the other hand, the convex hull ought not collapse since particles which have not concentrated near the boundary of the limiting polytope will continue to increase in magnitude until they themselves reach this boundary (Step 2 in the proof). This occurs due to the time-rescaling. Remark 2. Assuming QJK 0 does not seem to be essential for our conclusions; instead, it guides the direction of the proof. To emphasize the broader validity of our conclusion beyond this speciﬁc assumption, we conducted additional experiments (refer to Section F.6) which suggest that Theorem 3.1 (as well as Theorems 4.1 and 5.1 stated below) holds in more generality.

Remark 3 (Rate of convergence). Although Theorem 3.1 (as well as Theorems 4.1 and 5.1 stated below) does not specify a rate of convergence toward BK, we expect (and observe through numerics) that convergence happens very quickly after few layers, most tokens are already clustered. What "few layers" means here necessarily depends on the typical modulus of the initial tokens, since the dynamics (1) is not invariant under multiplication of all initial conditions by a ﬁxed real number. Remark 4 (Discrete time). As alluded to in Remark 1, all our results extend to the discrete-time Transformers (3). Indeed, just as in the continuous-time case, there is a natural rescaled dynamics,

which is the discrete analogue of (4): if we set R Id V t, and assume that R is invertible (which is the case for sufﬁciently small t), then zipk tq R kxipk tq : zrks

1 ex QRkzrks

The proofs of Theorems 2.1, C.5, 3.1, 4.1, and 5.1 carry through with straightforward modiﬁcations.

Let us provide some comments on the proof of Theorem 3.1 in the discrete-time setting, for the sake of completeness. First of all, Proposition C.2 holds intuitively because for all integers i P rns and k 1,

We then deﬁne the candidate set of limit points as in (37), and Claim 1 holds without any change in

the statement or in the proof. Then, just as in Steps 2 and 3 in the proof of C.1, we can ﬁrst show that if zrks

i is not already near some point in the candidate limit set, it will keep moving toward the boundary of the convex polytope. Finally, we can prove that tokens cannot circulate indeﬁnitely between different points on the boundary. The combination of these arguments would establish the convergence of each token toward some point in the set given by (37).

4 Clustering toward hyperplanes

While being a natural example to consider, value matrices found empirically are much more general than V Id, which we considered in the previous section. We now turn our attention to a signiﬁcantly more general setting of value matrices, which we formalize as follows. Deﬁnition 1. We call p Q, K, V q a good triple if the two following conditions are satisﬁed:

the eigenvalue of V with largest modulus is real, positive, and simple; namely,

λ1p V q |λ2p V q| . . . |λdp V q|.

x Q'1, K'1y 0 for any '1 P Rd lying on the line kerp V λ1p V q Idq.

The second condition simply states that the quadratic form x Q , K y is positive deﬁnite along the eigenspace associated to the leading eigenvalue of V . Note also that if all entries of V are positive,

the ﬁrst condition is automatically satisﬁed by virtue of the Perron-Frobenius theorem. In fact, this assumption is generic. On the one hand, it is satisﬁed by some pre-trained value matrices for ALBERT (Figure 10). On the other hand, numerical experiments indicate that a constant fraction (about 14%) of matrices from the real Ginibre ensemble in dimension d 128 this proportion is known to vanish as d Ñ 8, albeit very slowly [RS14].

Our clustering result in the setting of good triples can be summarized as follows: the coordinate xziptq, '1 }'1}y of any token ziptq along the eigenspace spanned by '1 converges, as t Ñ 8, toward one among possibly 3 real scalars. Consequently, all the tokens ziptq converge toward one among at most three parallel hyperplanes; see Fig. 6 for an illustration. Theorem 4.1 (Convergence toward 3 hyperplanes). Assume that p Q, K, V q is a good triple in the sense of Deﬁnition 1. Then, for any initial sequence of tokens tzip0qui Prns Ä Rd, there exist at most three parallel hyperplanes in Rd such that for any i P rns, the distance of the solution ziptq to (4) to one of these hyperplanes converges to 0 as t Ñ 8.

Figure 6: Illustrating Theorem 4.1 with n 40 tokens in R2. Here Q K I2, V is a random symmetric matrix with eigenvalues t1.35, 0.07u, and '1 p0.76, 0.65q. The components of the tokens in the direction of '1 (orange arrow) cluster over time. (See Figures 13 14 for examples in R3.) We also observe that tokens typically cluster toward only two hyperplanes a third one (passing through the origin) may appear for non-generic initial sequences. The hyperplanes are perpendicular to '1 since V is diagonalizable.

The proof may be found in Appendix D. The important role played by λ1p V q in the dynamics may be seen in (4): the component of ziptq along '1 determines the size of et V ziptq in the exponent appearing in (4). The tokens zjptq attracting other tokens ziptq are those for which this component along '1 is largest in modulus. This attraction process forms the clusters. These leaders, as in all our results, have been empirically observed to be the ones carrying the largest amount of information in the sentence (see Supplementary material in [VSP 17]).

Furthermore, Theorem 4.1 can also be interpreted in more classical machine learning terms. On the one hand, it can be seen as an instance of K-ﬂats clustering [BM00, Vid11] points in the input sequence are clustered, based on their intrinsic similarity, to at most 3 "ﬂats" of dimension d 1. On the other hand, it ensures that for a good triple p Q, K, V q, (4) generates a linearly separable representation of tokens.

Beyond a single direction?

Numerical experiments (e.g., Fig. 7) indicate that a similar phenomenon emerges for more complex V . We formulate following conjecture which is a natural generalization of Theorem 4.1. Conjecture 4.2 (Codimension conjecture). Let k 1 be the number of eigenvalues of V with positive real part. Then there exist at most three parallel Euclidean subspaces of Rd of codimension k such that for any i P rns, the distance of ziptq to one of these subspaces converges to 0 as t Ñ 8.

5 A mix of hyperplanes and polytopes

We now turn our attention to an even more general version of Theorem 4.1, which does not require the leading eigenvalue of V to be simple. The resulting theorem can be viewed as a combination of

5 0 5 5 0 5

10 0 10 10 0 10

40 20 0 20 25 0 25

100 50 0 50 100 0 100

(a) Conjecture 4.2: low-dimensional case.

0.0 2.5 5.0 7.5 10.0 12.5 15.0 t

Positive limits for clustered coordinates

0.0 2.5 5.0 7.5 10.0 12.5 15.0 t

Negative limits for clustered coordinates

(b) Conjecture 4.2: high-dimensional case.

Figure 7: (a) n 40, d 3 and Q K I3 with V a random matrix with eigenvalues t1.96, 0.22, 0.25u. The k 2 positive eigenvalues of V generate attraction between the tokens and even convergence in the corresponding eigenspaces this explains the codimension k statement. The negative eigenvalue generates a repulsive effect between the tokens, and we see a divergence along two lines (note the different scales between the four ﬁgures). (b) n 256, d 128, with p Q, K, V q ﬁxed random matrices and V symmetric. For each coordinate j corresponding to a positive eigenvalue, the variance of the set t'

j pziptqq: i P rnsu (shaded area) tends to 0 with t, while the mean (solid lines) converges to one among two real scalars: one positive (top ﬁgure), one negative (bottom) ﬁgure. Coordinates corresponding to negative eigenvalues diverge (Fig. 15).

Theorem 4.1 and Theorem 3.1. Speciﬁcally, we assume that V behaves as the identity when acting on the eigenspace of the leading eigenvalue. This property is automatically satisﬁed if V is normal so that its eigenvectors form an orthonormal basis so we call such a V paranormal. Deﬁnition 2. We call p Q, K, V q a good triple with multiplicity if the following conditions hold:

(i) QJK is positive deﬁnite: QJK 0;

(ii) V is paranormal: there exist two linear subspaces F, GÄ Rd which are invariant under V ,

and such that F G Rd, V|F λId for λ 0, and p V|Gq λ, where p q denotes the spectral radius (the maximal modulus of eigenvalues).

An example of such a V is used for Fig. 8. We may now state our main result in the setting of good triples with multiplicity. The proof may be found in Appendix E. Theorem 5.1 (Clustering for λ1 with multiplicity). Suppose that p Q, K, V q is a good triple with multiplicity in the sense of Deﬁnition 2. Then, for any initial sequence tzip0qui Prns Ä Rd, there exists a bounded convex polytope K Ä Fsuch that setting H : p BK Y t0uq ˆ G, for any i P rns, we have distpziptq, Hq Ñ 0 as t Ñ 8.

Several important directions regarding the mathematical theory of Transformers remain unexplored. An important extension of our work would amount to studying multi-headed Transformers borrowing the notation from Remark 4, they amount to:

1 ex Qhxrks

i ,Khx pkqy

Figure 8: Illustrating Theorem 5.1 with n 40 tokens in R3. As before, Q K Id, and we take V diagp1, 1, 1

2q. A convex polytope K emerges before time 5, toward which two coordinates of the tokens cluster, and persists throughout the evolution, while the tokens diverge along the coordinate corresponding to the eigenvalue 1

2 (note the different scales between the four ﬁgures).

For each h P r Hs (corresponding to a different head), the weight matrices Qh, Kh, Vh are constant. Proofs regarding clustering or convergence of the self-attention matrix for such dynamics is an open problem. Preliminary numerical investigations seem to indicate that interesting clustering phenomena also occur in this context. A characterization or properties of optimal weights by invoking the optimal control correspondence in the spirit of [Wei17] is also an interesting avenue for future research.

Acknowledgments and Disclosure of Funding

We thank Pierre Ablin, Léonard Boussioux, Enric Boix-Adsera, Gabriel Peyré, Yair Shenfeld and Emmanuel Trélat for helpful discussions. C.L. was supported by the Simons Foundation Grant 601948, DJ. P.R. is supported by NSF grants IIS-1838071, DMS-2022448, and CCF-2106377. Y.P. is supported in part by the MIT-IBM Watson AI Lab.

[ABV 05] Juan A Acebrón, Luis L Bonilla, Conrad J Pérez Vicente, Félix Ritort, and Renato

Spigler. The Kuramoto model: A simple paradigm for synchronization phenomena. Reviews of modern physics, 77(1):137, 2005.

[BM00] Paul S Bradley and Olvi L Mangasarian. K-plane clustering. Journal of Global

optimization, 16:23 32, 2000.

[Che95] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 17(8):790 799, 1995.

[CHH 16] Junghee Cho, Seung-Yeal Ha, Feimin Huang, Chunyin Jin, and Dongnam Ko. Emer-

gence of bi-cluster ﬂocking for the Cucker Smale model. Mathematical Models and Methods in Applied Sciences, 26(06):1191 1218, 2016.

[CRBD18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural

ordinary differential equations. Advances in Neural Information Processing Systems, 31, 2018.

[CS07] Felipe Cucker and Steve Smale. Emergent behavior in ﬂocks. IEEE Transactions on

Automatic Control, 52(5):852 862, 2007.

[DCL21] Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you

need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pages 2793 2803. PMLR, 2021.

[Dob79] Roland L vovich Dobrushin. Vlasov equations. Funktsional nyi Analiz i ego Prilozheniya, 13(2):48 58, 1979.

[Gol13] François Golse. Mean ﬁeld kinetic equations. Course of Polytechnique, 2013.

[HJ12] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge University Press,

[HK02] Rainer Hegselmann and Ulrich Krause. Opinion dynamics and bounded conﬁdence:

models, analysis and simulation. Journal of Artiﬁcal Societies and Social Simulation (JASSS), 5(3), 2002.

[HKPZ19] Seung-Yeal Ha, Jeongho Kim, Jinyeong Park, and Xiongtao Zhang. Complete cluster

predictability of the Cucker Smale ﬂocking model on the real line. Archive for Rational Mechanics and Analysis, 231:319 365, 2019.

[HR17] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse

problems, 34(1), 2017.

[Hys W 22] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang,

Lu Wang, and Weizhu Chen. Lo RA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.

[HZRS16a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning

for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

[HZRS16b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in

deep residual networks. In Computer Vision ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11 14, 2016, Proceedings, Part IV 14, pages

630 645. Springer, 2016.

[JM14] Pierre-Emmanuel Jabin and Sebastien Motsch. Clustering and asymptotic behavior in

opinion formation. Journal of Differential Equations, 257(11):4165 4187, 2014.

[Kra00] Ulrich Krause. A discrete nonlinear and non-autonomous model of consensus. In

Communications in Difference Equations: Proceedings of the Fourth International Conference on Difference Equations, page 227. CRC Press, 2000.

[Kur75] Yoshiki Kuramoto. Self-entrainment of a population of coupled non-linear oscillators.

In International Symposium on Mathematical Problems in Theoretical Physics: January 23 29, 1975, Kyoto University, Kyoto/Japan, pages 420 422. Springer, 1975.

[LCG 20] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,

and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, 2020.

[LLH 20] Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and

Tie-Yan Liu. Understanding and improving transformer from a multi-particle dynamic system point of view. In International Conference on Learning Representations, 2020.

[LWLQ22] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers.

AI Open, 3:111 132, 2022.

[MT14] Sebastien Motsch and Eitan Tadmor. Heterophilious dynamics enhances consensus.

SIAM Review, 56(4):577 621, 2014.

[PHD20] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during

the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652 24663, 2020.

[PR13] Benedetto Piccoli and Francesco Rossi. Transport equation with nonlocal velocity in

Wasserstein spaces: convergence of numerical schemes. Acta Applicandae Mathematicae, 124:73 105, 2013.

[PR16] Benedetto Piccoli and Francesco Rossi. On properties of the generalized Wasserstein

distance. Archive for Rational Mechanics and Analysis, 222:1339 1365, 2016.

[PRT15] Benedetto Piccoli, Francesco Rossi, and Emmanuel Trélat. Control to ﬂocking of the

kinetic Cucker Smale model. SIAM Journal on Mathematical Analysis, 47(6):4685 4719, 2015.

[RS14] Brian Rider and Christopher D. Sinclair. Extremal laws for the real Ginibre ensemble.

The Annals of Applied Probability, 24(4):1621 1651, 2014.

[SABP22] Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers:

Transformers with doubly stochastic attention. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3515 3530. PMLR, 2022.

[VCBJ 95] Tamás Vicsek, András Czirók, Eshel Ben-Jacob, Inon Cohen, and Ofer Shochet. Novel

type of phase transition in a system of self-driven particles. Physical Review Letters, 75(6):1226, 1995.

[Vid11] René Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52 68,

[VKF20] Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. Fast transformers with

clustered attention. Advances in Neural Information Processing Systems, 33:21665 21674, 2020.

[VSP 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.

[Wei17] E Weinan. A proposal on machine learning via dynamical systems. Communications in

Mathematics and Statistics, 1(5):1 11, 2017.

[WHL19] E Weinan, Jiequn Han, and Qianxiao Li. A mean-ﬁeld optimal control formulation of

deep learning. Research in Mathematical Sciences, 6(1):10, 2019.

[WLK 20] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer:

Self-attention with linear complexity. ar Xiv preprint ar Xiv:2006.04768, 2020.

[YBR 20] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv

Kumar. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020.