# clustering_in_causal_attention_masking__42c7e6b8.pdf Clustering in Causal Attention Masking Nikita Karagodin Yury Polyanskiy Philippe Rigollet This work presents a modification of the self-attention dynamics proposed by Geshkovski et al. (2023b) to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we significantly strengthen the results of Geshkovski et al. (2023b) in this context: While previous rigorous results focused on cases where all three matrices (Key, Query, and Value) were scaled identities, we prove asymptotic convergence to a single cluster for arbitrary key-query matrices and a value matrix equal to the identity. Additionally, we establish a connection to the classical Rényi parking problem from combinatorial geometry to make initial theoretical steps towards demonstrating the existence of meta-stable states. 1 Introduction The introduction of the Transformer architecture Vaswani et al. (2017) has markedly impacted the landscape of natural language processing (NLP), signaling the advent of large language models. Central to the Transformer architecture is the self-attention mechanism, a special kind of layer that distinguishes it from preceding models such as Res Nets. This innovation has yielded unprecedented performance not only in machine translation and text summarization but also in areas beyond NLP, including computer vision, speech recognition, and robotics. The flexibility and efficiency of Transformers underscore their integral role in the progression of artificial intelligence. Despite their widespread use, the theoretical foundations underlying their success remain underexplored. Following Sander et al. (2022), recent studies by Geshkovski et al. (2023a) and Geshkovski et al. (2023b) have proposed a mathematical framework to analyze Transformers as interacting particle systems, demonstrating that tokens, when modeled as particles, exhibit clustering under certain conditions on the Key, Query, and Value matrices. These works primarily focus on full (mean-field) attention mechanisms, where each token can interact with every other token. Building upon this foundation, our research extends the analysis to causal attention mechanisms, wherein each token is restricted to interact only with preceding tokens. This distinction is crucial, as causal attention is prevalent in Transformer models employed in generative AI and known as decoder architectures. Causal attention is crucial for sequence generation tasks, ensuring that each token only attends to previous tokens and not future ones, thereby preserving the correct temporal order. This mechanism, also known as autoregressive attention, masks future tokens during attention computation to prevent the model from accessing information it hasn t generated yet. At inference time, causal attention allows the model to generate text one token at a time, using previously generated tokens to inform the next, ensuring coherent and contextually accurate sequences. This step-by-step generation process is computationally efficient, as each token is produced in a forward pass without needing to revisit previous steps. In contrast to full attention, which considers all tokens simultaneously and is suitable Laboratory for Information and Decision Systems, MIT, Cambridge, MA, USA Laboratory for Information and Decision Systems, MIT, Cambridge, MA, USA Department of Mathematics, MIT, Cambridge, MA, USA 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Table 1: Possible Final Configurations of Particles Largest Eigenvalue Multiplicity Final Configuration Figure λmax > 0 d First particle x0 1a λmax > 0 2 One point in L 1b λmax > 0 1 Two points ξ and ξ 1c λmax < 0 2 Point cloud around L 1d λmax < 0 1 Two point clouds around ξ and ξ 1e for tasks like machine translation where the entire sequence is known, causal attention is essential for tasks requiring real-time, sequential output. This computational advantage explains the pervasiveness of causal attention not only in natural language processing but also in image generation with tools like DALL-E (Ramesh et al., 2021), VQGAN (Esser et al., 2021), or Parti (Yu et al., 2022) and multimodal foundation models, notably Chameleon (Team, 2024). More generally, the use of masked attention where tokens pay attention to a subset of other tokens has been driving recent scaling efforts and has led to state-of-the-art models such as MUSE (Chang et al., 2023) or Alphafold 3 (Abramson et al., 2024). Causal attention can also be recast as an interacting particle system but it requires different analytical techniques. This is the goal of our paper. Our contributions. Our main theoretical result establishes asymptotic clustering of tokens for causal self-attention transformer modeled as an interacting particles system on the sphere (Theorem 4.1). While mathematically accurate, this asymptotic collapse to a single cluster is seldom observed numerically. Instead, particles collapse to multiple clusters and stay in this configuration for a very long time (see Fig. 2 for a representative example) such meta-stable states were already alluded to in Geshkovski et al. (2023b) and their study was recently initiated in Koubbi et al. (2024). In Section 5 we describe such meta-stable states using analogy with the Rényi parking process (Lemma 2, Theorem 5.1). Additionally, Theorem 5.1 covers asymptotic clustering of tokens for causal self-attention with additional cross-attention component. Moreover, we predict that, akin to linear dynamical systems, the most important factors that qualitatively describe final particles configuration both in causal and full-attention cases are the eigenvalue of the Value matrix V with the largest real part λmax and its eigenspace L, while Query and Key matrices Q, K and temperature parameter β do not matter. Our conjectured atlas of possible meta-stable configurations is listed in Table 1. We prove the result stated as the first line of this table, namely that particles eventually collapse into a point when V = Id in Theorem 4.1. We remark that assumptions of Theorem 4.1 are much weaker than for the similar results in the full-attention case Geshkovski et al. (2023b), in particular we put no constraints on K, Q or β. This work is a combination of rigorous mathematical results and non-trivial predictions based on analytical insights and numerical simulations. We summarize all limitations in Section 6. Related work. Our work builds upon the framework of Geshkovski et al. (2023b,a) where clustering properties of transformers are analyzed as systems of particles interacting on the sphere. Specifically, Geshkovski et al. (2023b) proved that encoder-only (i.e. unmasked) self-attention with (post) Layer Norm leads to tokens clustering to a single point, in the limit of number of layers going to infinity. This phenomenon is also known as consensus in the related literature of multi-agent systems Markdahl et al. (2017); Criscitiello et al. (2024) and Kuramoto oscillators Strogatz (2000); Abdalla et al. (2022). Work Geshkovski et al. (2023b) in turn expands on the original perspective brought forward by Sander et al. (2022) that identify the self-attention layer as a measure-to-measure map, see also Vuckovic et al. (2021). More recently, Castin et al. (2024) studied the smoothness of this map in a framework that also covers causal attention. This work introduces a clever reparametrization that allows them to recast causal attention as mean-field dynamics, akin to their full attention counterpart. Using various approximations, Cowsik et al. (2024) were able to study a more realistic architecture that also includes MLP layers and produce accurate predictions for the final configuration of particles. This setup was further investigated by Agrachev and Letrouit (2024) from a geometric control perspective. We note also that clustering in the absence of a residual connection (replace xk(t) with xk(t + 1) in (SA)) was established in Wu et al. (2023). Additional effects of the residual connection are studied in Dong et al. (2021) and Zhang et al. (2024). (a) V = diag(1, 1, 1) (b) V = diag(1, 1, 0) (c) V = diag(1, 0, 0) (d) V = diag(-1, -1, -3) (e) V = diag(-1, -3, -3) Figure 1: Particle trajectories for different Value matrices. In all cases we take simple Query and Key matrices K = Q = Id, temperature β = 9 and final time T = 5000 for n = 32 particles initialized uniformly at random on the sphere. Positions of particles at time T are indicated by a red dot. 2 Causal attention Before describing our model of causal attention dynamics, we review the idea of Geshkovski et al. (2023b) for modeling the full attention dynamics. In that work, the evolution of representations of tokens through the layers is modeled as a system of n coupled Ordinary Differential Equations (ODEs) describing dynamics of a system of particles x1(t), . . . , xn(t). A brief part of their derivation of the dynamics from the transformers architecture is written in Section A.1. The particle position xk(t) corresponds to representation of the k-th token at layer t (where for convenience, t is allowed to take non-integer values) and due to RMSNorm the particles are forced to live on a unit sphere Sd 1. (RMSNorm layer usually also includes a multiplication by a trainable diagonal matrix D, but the effect of this step can be equivalently achieved by multiplying K, Q, V matrices by D.) These ODEs are parametrized by three matrices, known as the query Q, the key K and the value V , respectively, and that are assumed to be square d d matrices. More specifically, token k evolves according to xk(t) = Pxk(t) 1 Zk(t) j=1 eβ Qxk(t),Kxj(t) V xj(t) , (SA) where Pxy = y x,y x |x|2 is the projection onto the tangent space of Sd 1 at x, and j=1 eβ Qxk(t),Kxj(t) is a normalizing factor. Note that the dynamics of the k-th token depend on the positions of all tokens j [n], which is a landmark characteristic of full attention leading to the so-called mean-field dynamics studied in Geshkovski et al. (2023b); see also Geshkovski et al. (2023a); Castin et al. (2024); Paul and Trélat (2024). In this work we focus on causal attention, where the dynamics of token k depend only on the position of tokens j k. As described in the introduction, this modification is by now the dominant type of transformer architecture in generative AI. To reflect causal masking, we modify the ODE governing the dynamics of token k as follows: xk(t) = Pxk(t) 1 Zk(t) j=1 eβ Qxk(t),Kxj(t) V xj(t) , (CSA) where the normalizing factor Zk(t) is naturally updated to j=1 eβ Qxk(t),Kxj(t) . 3 Single token dynamics Note that in (CSA) dynamics, the first token is evolving fully autonomously without the influence of others. Thus, we start from the description of its evolution. It will also guide our understanding of the dynamics of subsequent tokens. The first token moves according to the equation x(t) = Px(t)(V x(t)). To state its behavior for any matrix V , we need a few definitions. Denote λmax as the largest real part of all the eigenvalues of V . Let L be the span of all generalized eigenvectors of V associated to eigenvalues(potentially different) with their real part equal to λmax. Let L L be the subspace generated by only the eigenvectors in L with the largest corresponding Jordan block (the vectors might correspond to different blocks and even to different eigenvalues). Lemma 1. Let x(t) be a solution of an ODE x(t) = Px(t)(V x(t)) defined on the unit sphere Sd 1. Then, for almost every initial value x(0) Sd 1, there exists C, c > 0 such that the following convergence rates for the geodesic distance dist hold: (i) Exponential convergence to L : dist(x(t), L Sd 1) Ce ct, and (ii) linear convergence to L: dist(x(t), L Sd 1) ct 1 This result can be derived from standard results on the theory of linear ODEs (proof in Section B.1). We note that this result is important for other tokens as well. Indeed, for every token xk, the contribution to xk in (CSA) from the term with j = k often has the biggest weight, an effect amplified by large β. In general, eigenvectors corresponding to a real eigenvalue λ = λmax create a fixed set in L Sd 1, while the complex eigenvalues with the largest real part produce a limit torus in L Sd 1. In what follows, we only consider the case where the eigenvalue with the largest real part is real itself and it only has Jordan blocks of size 1. Then, L = L and convergence to L is exponentially fast. Note also that when dim L = 1, we have L S1 = { ξ} for some unit vector ξ. In this case, x(t) ξ as t , again with exponential speed. These observations will be important for the next section, when we describe asymptotic configurations of tokens. 4 Final Configuration The system of n tokens that we are studying is far more complicated than for a single token. Even establishing convergence to some point as t is challenging. In Geshkovski et al. (2023b), similar models were analyzed analytically by noticing that the dynamical system has the structure of the gradient flow of some potential function: x(t) = H(x) . For such systems, groundbreaking results of Łojasiewicz (1962, 1965, 1984) (see Haraux (2012) for a self-contained overview) guarantee convergence to a critical point of H assuming it is real-analytic. However, our system (CSA) does not have a gradient-flow structure and thus techniques of Łojasiewicz are not applicable. On the other hand, we have a significant advantage in the hierarchical structure of our system, allowing us to study tokens sequentially. We have already understood the evolution of the first token. In this section, we do two things. First, we describe, based on our analytical and numerical insights, conjectures about the asymptotic configuration x(t) for t . The surprising result here is that only the spectral properties of V (and not K or Q) affect asymptotics. Second, we rigorously prove convergence to a single point for the special case of V = Id. We note that unlike the proof in Geshkovski et al. (2023b) (see also Markdahl et al. (2017); Criscitiello et al. (2024)), our result works for all K and Q matrices, while the proof in Geshkovski et al. (2023b) works only for Q K = V and Markdahl et al. (2017); Criscitiello et al. (2024) is restricted to Q K = Id. Our main insight is that there are two major forces that drive each token: its internal force which is described by Lemma 1, and the external force induced by all the particles preceding it, which is either attractive or repulsive depending on the sign of the top eigenvalue(s) of V . The balance between the two forces is defined via attention. To get a better grasp of how the external force works, we consider the case where the first (internal) force vanishes, that is, V = Id. In this case, the tokens collapse asymptotically to a single point. Theorem 4.1. Let V = Id and Q, K be arbitrary matrices. Then, for almost any starting point (x1(0), . . . , xn(0)) with respect to the volume measure on (Sd 1)n, the causal transformer dynamics (CSA) converge to a single cluster: k [n], lim t xk(t) = x1(0). We prove this result in Section B.2. In the proof, weight functions are only required to be positive and continuously differentiable (C1). This ambiguity suggests that incorporating time-dependence of Q and K might not alter the theorem s validity, but it significantly adds complexity to the proof in dealing with non-autonomous systems. Steps similar to our proof of Theorem 4.1 can be followed to study the more general case of the matrix V = Id. Unfortunately, one runs into multiple technical issues with application of the stable-manifold theorem from dynamical systems due to the emergence of critical manifolds (as opposed to critical points in the V = Id) case. Thus, we leave the general case at the status of conjectures, which we describe next. In what follows, we denote the eigenvalue of V with the largest real part as λmax and assume that it is real. If it is not, the limiting configuration is additionally rotating with a constant speed, which complicates the discussion and so is omitted. Let L denote the eigenspace of λmax. If λmax has multiplicity 1, then we denote the corresponding unit eigenvectors as ξ. For simplicity we assume that λmax has all of the corresponding Jordan blocks of size 1. First of all, if dim L = 1 then, according to Lemma 1, every token is driven towards ξ or ξ by their own force. Moreover, for λmax > 0 the force of other tokens is attractive, while for λmax < 0 it is repulsive. Thus, for λmax > 0 all the particles collapse into ξ and ξ, whereas for λmax < 0 the repulsion force prevents the particles from going all the way to ξ and instead the particles stabilize at two clouds around ξ and ξ. This behavior is captured in Figures 1c and 1e4. For the case λmax > 0 we formally express it as: Conjecture 1. Let Q, K be arbitrary matrices and V be diagonalizable with d different positive real eigenvalues. Denote the largest eigenvalue as λmax and unit vector ξ : V ξ = λmaxξ. Then, for almost any starting point (x1(0), . . . , xn(0)) with respect to the volume measure on (Sd 1)n, the causal transformer dynamics (CSA) converge to two clusters k [n], lim t xk(t) {ξ, ξ}. If λmax has multiplicity at least 2, then from Lemma 1 each token internally gets attracted by the eigenspace L. When tokens are close to L, the action of V becomes close to λmax Id, which for λmax > 0 according to Theorem 4.1 forces tokens to collapse to a singleton, while for λmax < 0 other tokens exude a repelling force, causing particles to spread out around L. This behavior is captured in Figures 1b and 1d. For the case λmax > 0 we formalize it as follows: Conjecture 2. Let Q, K be arbitrary matrices and V be any matrix such that its largest eigenvalue λmax > 0 is real and has an eigenspace L of dimension dim L 2, while for any z L one 4The spread of the clouds depends on the relative importance of each token s own attention, that differs with various K, Q. There are choices of K and Q that result in complex interactions without structure. For example, when Q K = [[0, 1, 0], [1, 0, 0], [0, 0, 1]], V = I3 (c) Time 10 (d) Time 75 (e) Time 150 (f) Time 500 Figure 2: Evolution of the system (CSA) with K = Q = V = I2 with n = 200, d = 2, β = 64, strong Rényi centers (red) and Rényi centers (black) with δ = 4β 1/2. Note that strong Rényi centers are visually stationary (as per Lemma 2) but do not explain all clusters. In turn, Rényi centers are moving and merging (one disappears between t = 75 and t = 150), but capture more meta-stable clusters. has V z L with V z, z < λmax|z|2. Then, for almost any initialization (x1(0), . . . , xn(0)), the causal attention dynamics (CSA) converge to one cluster. More specifically, if we define ξ as the normalized L-component of x1(0), i.e., for y1 := PL (x1(0)), ξ := y1/|y1|, then k [n], lim t xk(t) = ξ. (Note that ξ is undefined when x1(0) L, but this happens with probability zero.) An important practical observation is that these conjectures explain that V performs dimensionality reduction in the following way. Tokens converge to L Sd 1 and, in that space, they move as if acted upon by the λId matrix on a sphere Sdim L 1. For the pre-trained Lan et al. (2020) the spectra of value matrices is depicted in Figure 4. Interestingly, there are heads with negative λmax. Future work will be concerned with studying real-world matrices V and connecting their top eigenspaces to semantic meaning of layers and tokens. 5 Meta-stable clustering As we discussed earlier, perhaps the most fascinating discovery of Geshkovski et al. (2023b) is the existence of meta-stable clusters in the full-attention dynamics. It turns out that the same phenomenon persists in the causal-attention dynamics that we study here. The dynamical evolution of the system is illustrated in Fig. 2. At t = 150, the initially uniform distribution of 200 particles consolidates into seven distinct clusters. While Theorem 4.1 establishes the eventual collapse into a single cluster, these intermediate clusters exhibit remarkable metastability, persisting with negligible movement over extended time periods at least until t = 500 according to Fig. 2 before sequential merging occurs. We define these meta-stable configurations as meta-stable clusters, with three-dimensional analogues shown in Fig. 1a and 1b. Given that the time parameter in our dynamics corresponds to network depth in transformer architectures, the meta-stable configurations, rather than final states (achieved at t = exp(Ω( β))), hold greater practical significance. The emergence of meta-stable clustering and its associated dimensionality reduction may provide fundamental insights into transformers capacity for generating efficient context-dependent representations. From a theoretical perspective, understanding meta-stable clustering presents significant challenges, as traditional techniques for asymptotic analysis such as those used in our Theorem 4.1 prove insufficient. Recent work on full attention transformers has made partial progress in this direction. Koubbi et al. (2024) demonstrated that when self-attention dynamics approach a nearly clustered state, they will converge to a tightly clustered configuration and remain stable for an exponential time period. Complementing this, Bruno et al. (2024) proved that tokens initialized near a uniform distribution on the sphere will spontaneously organize into a loosely clustered state. However, the bounds on the clustering tightness in this second line of work are not sufficient to trigger the convergence conditions required by Koubbi et al. (2024) s theorem. This Section presents a fundamental discovery regarding the identification of cluster centers in causal-attention dynamics. We establish three key claims: First, we demonstrate that initialization irregularities generate distinctive tokens, termed Rényi parking centers, which evolve into meta-stable cluster nuclei. While this phenomenon is primarily supported by numerical evidence (Fig. 3), it provides crucial insight into the clustering mechanism. Second, we prove that a subset of these special tokens, called strong Rényi centers, maintains near-stationarity over extended time periods (Lemma 2). Both Rényi and strong Rényi centers occur with frequency Θ(β d 1 2 ), confirming the β scaling predicted for d = 2 by Geshkovski et al. (2023b); see also Koubbi et al. (2024); Bruno et al. (2024). Third, we establish in Theorem 5.1 that as t , all remaining tokens will converge to the vicinity of one of these stationary tokens, completing the meta-stable clustering process. This section restricts our analysis to the case where V = Id. For general matrices V , our empirical observations suggest that particles rapidly converge to a lower-dimensional subspace spanned by d1 d principal eigenvectors. Consequently, we conjecture that the number of meta-stable clusters should rather be β d1 1 2 , where the ambient dimension d is replaced by the effective dimension d1. While a rigorous proof of this dimension-reduction remains an open problem for future investigation, this phenomenon motivates our focus on low-dimensional cases (specifically d = 2) throughout this section. For convenience, we also fix Q = K = Id, though this condition could be easily relaxed (e.g. to Q K = K Q = V ). Under these assumptions, the system can be rewritten in polar coordinates xk = [cos(φk), sin(φk)] as j=1 eβ(cos(φk φj) 1) sin(φj φk) = 1 j=1 h(φj φk), (CSA-2d) with interaction potential given by h(x) := eβ(cos(x) 1) sin x, and Zk = j=1 eβ(cos(φk φj) 1). (Int Pot) 5.1 Rényi Parking The prediction of meta-stable clustering center locations exhibits a notable connection to the Rényi parking problem. Consider a sequence of tokens (xj)j 1 on the sphere Sd 1 equipped with geodesic distance dist. For a fixed separation parameter δ > 0, we define: Rényi centers as the subsequence (xsj)j 1, where (sj)j 1 is strictly increasing and satisfies: dist(xsj, xsi) > δ for all i < j Strong Rényi centers as the subsequence (xsj)j 1 satisfying: dist(xsj, xi) > δ for all i < sj By construction, the set of Strong Rényi centers forms a subset of Rényi centers. As demonstrated in Section A.2, particles in our system exert maximal attractive force at distances of order at most β 1/2, with rapid decay beyond this scale. For strong Rényi centers defined Figure 3: Total percentage of particles consumed by Rényi and strong Rényi centers over time. Here we have plotted average, 0.1 and 0.9 quantiles over 5000 experiments with n = 200, d = 2, β = 64, δ = 4β 1/2. with separation parameter δ = cβ 1/2 (where c is sufficiently large), this decay ensures negligible influence from preceding particles and thus remain stable for a long time a phenomenon formally established in Lemma 2. This metastability, coupled with rapid particle aggregation tokens, indicates that strong Rényi centers serve as primary attractors for subsequent tokens. Rényi centers are unaffected by previous particles but only by previous Rényi centers, thereby generating new clustering centers. For fixed δ, there are more Rényi centers than strong Rényi centers (see Section C.4 for exact cardinality analysis). While Rényi centers better capture the meta-stable clustering effect, as illustrated in Figures 3 and 2, they lack positional stability and may converge to other centers over time. Although Rényi centers rapidly aggregate a large fraction of particles, some of these particles continue to migrate and eventually converge to strong Rényi centers. The next result shows that strong Rényi centers remain nearly fixed for a long time. Lemma 2. Let d = 2 and Q = K = V = I2. Consider a subsequence of strong Rényi centers xs1, . . . , xsm satisfying the separation condition with constants ε, c > 0 min i c(1 + 2ε)β 1/2. Assume that c > β1/2 arccos(( 1 + p 4β2 + 1)/(2β)). (1) Then for any time Tj such that Tjsjh(cβ 1/2) < εcβ 1/2, (2) where the interaction potential h is defined in (Int Pot), the displacement of each center is bounded by max t [0,Tj] |xsj(t) xsj(0)| < εcβ 1/2. The key observation driving this result is that strong Rényi centers are weakly affected by all previous particles. However, though this is correct on short time scales, it should be checked for all times in [0, T]. A complete proof can be found in Section C.1. Remark 1. Using the properties of h derived in Section A.2 it can be shown that For β > 1 a sufficient condition for (1) to hold is simply c > 1, a sufficient condition for (2) to hold is Tjsj < ec2/2 c4/(24β)ε. Moreover, it is easy to prove that indexes sj are mostly small. Thus, we see that early strong Rényi centers are almost stationary for a time that is exponential with the square of separation magnitude. Rényi centers and strong Rényi centers play a fundamental role in meta-stable clustering, warranting analysis of their properties as extreme points in a sequence. While defined here using geodesic distance on a sphere, the definition extends naturally to distances induced by Qx, Ky under appropriate conditions. This generalization aligns with our observation that meta-stable clustering occurs in the subspace L where V sends tokens and acts as the identity on L. The distribution of these centers under various initialization schemes presents a key analytical challenge. Section C.4 addresses the uniform i.i.d. case, where Rényi s classical result characterizes the expected number of centers. Extensions to general distributions and Markov processes more relevant to language processing applications remain open for Rényi centers due to their structural complexity, particularly in higher dimensions (d > 2). In contrast, strong Rényi centers are much easier to handle: even our computation of the average number of centers in Section C.4 works for any distribution regardless of the dimension. 5.2 Fixed Meta-stable Clustering Centers Having established the existence of O( β) quasi-stationary tokens for d = 2 and n 1, we next examine their role as cluster centers. While Figures 3 and 2 provide substantial numerical evidence that these tokens attract and aggregate nearby particles, a rigorous proof remains elusive. We establish instead a weaker result: when quasi-stationary tokens are artificially frozen (analogous to cross-attention in encoder-decoder architectures), all other tokens converge to these frozen centers. This simplified model, while instructive, differs from true meta-stable clustering in important aspects detailed in Section 6. We only state our result for d = 2 and identity parameter matrices as in (CSA-2d). Theorem 5.1 (Clustering to frozen tokens for K = Q = V = I2). Let θ1, . . . , θm be fixed tokens that are well-separated, namely |θi θj| > cβ 1/2. Let µ0 be an absolutely continuous probability measure on (S1)n and let φ1(0), . . . , φn(0) µ0. Consider causal attention dynamics (CSA-2d), with additional influence from the fixed tokens θj, which enter evolution with additional weights aj 1. Specifically, we have j=1 eβ(cos(φk φj) 1) sin(φj φk) + j=1 ajeβ(cos(φk θj) 1) sin(θj φk) , j=1 eβ(cos(φk φj) 1) + j=1 ajeβ(cos(φk θj) 1). Define N = n + Pm j=1 aj and g = h where h is the interaction potential of (Int Pot). If N, β, ε > 0, and c > 2 + 2ε satisfy: Nh((c 1 2ε)β 1/2) < h(εβ 1/2) Ng((c 2ε)β 1/2) < g(εβ 1/2), then with probability one, φ(t) converges to an asymptotically stable critical point φ (S1)n satisfying: k [n], j [m] : |φ k θj| < εβ 1/2. Since our dynamical system is not a gradient flow, the classical Łojasiewicz convergence theorem does not apply. Instead, we establish convergence by observing that the causal dynamics (both with and without frozen tokens) is, in fact, a sequential gradient flow, where each particle minimizes a slightly different energy potential. For such systems on S1, we demonstrate convergence through an alternative approach that circumvents the Łojasiewicz framework. Lemma 3. Consider a system of n particles on S1 with angular coordinates φ1, . . . , φn 2πR/Z evolving according to φk = 1 Zk(φ1, . . . , φk) φk Ek(φ1, . . . , φk), where E1, . . . , En are C1 energy functions and Zk are C1 normalization factors bounded by 0 < c < Zk(φ) < C. Assume: 1. Each Ek has isolated critical points in φk for fixed φj, j = k (satisfied by analyticity), 2. For any k [n], critical points restricted to the first k particles are either strongly stable (the Jacobian has only eigenvalues with strictly negative real parts) or strongly unstable (there is an eigenvalue with a strictly positive real part). Then for almost every initial condition φ(0) with respect to Lebesgue measure, φ(t) converges to a strongly stable critical point φ . The proof is deferred to Section C.2. Remark 2. The conditions of Theorem 5.1 are satisfied under the following explicit bounds: ε < 0.1, c 5.5 + 2ε, β (c 1 2ε)2/2, N ε c 1 exp(3(c 1 2ε)2/8). Note that this result requires only β log N. For example, taking ε = 0.1 and c = 6.5 yields β 14 and N 700 is sufficient. See Lemma 5 for the proof. 6 Limitations Our analysis presents both theoretical and practical limitations. From a theoretical perspective, we establish two key results: (1) strong Rényi centers maintain quasi-stationarity for time scales of order exp(c2/2) per Lemma 2, and (2) exactly stationary centers attract all remaining particles (Theorem 5.1). However, this falls short of proving meta-stable clustering, as Theorem 5.1 provides no bounds on the convergence time. Consequently, we cannot guarantee that strong Rényi centers remain sufficiently stationary during particle aggregation. A complete meta-stability theory would require demonstrating that each Rényi center captures Ω(n) particles in O(1) time as n . Currently, even the weaker claim of capturing ω(1) particles remains unproven, presenting a crucial direction for future research. The practical limitations stem from two model simplifications: the use of tied weights across layers (though supported by successful implementations, see Lan et al. (2020)), and the omission of the MLP layer central to Transformer architectures. Incorporating the MLP dynamics into our theoretical framework remains a significant open challenge. Acknowledgments The work of NK and YP was partially supported by the MIT-IBM Watson AI Lab and the National Science Foundation under Grant No CCF-2131115. PR is supported by NSF grants DMS-2022448 and CCF-2106377, and a gift from Apple. Abdalla, P., Bandeira, A. S., Kassabov, M., Souza, V., Strogatz, S. H., and Townsend, A. (2022). Expander graphs are globally synchronising. ar Xiv preprint ar Xiv:2210.12788. Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., Bodenstein, S. W., Evans, D. A., Hung, C.-C., O Neill, M., Reiman, D., Tunyasuvunakool, K., Wu, Z., Žemgulyt e, A., Arvaniti, E., Beattie, C., Bertolli, O., Bridgland, A., Cherepanov, A., Congreve, M., Cowen-Rivers, A. I., Cowie, A., Figurnov, M., Fuchs, F. B., Gladman, H., Jain, R., Khan, Y. A., Low, C. M. R., Perlin, K., Potapenko, A., Savy, P., Singh, S., Stecula, A., Thillaisundaram, A., Tong, C., Yakneen, S., Zhong, E. D., Zielinski, M., Žídek, A., Bapst, V., Kohli, P., Jaderberg, M., Hassabis, D., and Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with alphafold 3. Nature. Agrachev, A. and Letrouit, C. (2024). Generic controllability of equivariant systems and applications to particle systems and neural networks. ar Xiv preprint ar Xiv:2404.08289. Bruno, G., Pasqualotto, F., and Agazzo, A. (2024). Emergence of meta-stable clustering in mean-field transformer models. Castin, V., Ablin, P., and Peyré, G. (2024). How smooth is attention? In International Conference on Machine Learning. Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M.-H., Murphy, K. P., Freeman, W. T., Rubinstein, M., Li, Y., and Krishnan, D. (2023). Muse: Text-to-image generation via masked generative transformers. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 4055 4075. PMLR. Cowsik, A., Nebabu, T., Qi, X.-L., and Ganguli, S. (2024). Geometric dynamics of signal propagation predict trainability of transformers. ar Xiv preprint ar Xiv:2403.02579. Criscitiello, C., Mc Rae, A., Rebjock, Q., and Boumal, N. (2024). Synchronization on circles and spheres with nonlinear interactions. ar Xiv preprint ar Xiv:2405.18273. Dong, Y., Cordonnier, J.-B., and Loukas, A. (2021). Attention is not all you need: pure attention loses rank doubly exponentially with depth. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2793 2803. PMLR. Dvoretzky, A. and Robbins, H. (1964). On the parking problem. Publ. Math. Inst. Hung. Acad. Sci., 9:209 224. Esser, P., Rombach, R., and Ommer, B. (2021). Taming transformers for high-resolution image synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12868 12878, Los Alamitos, CA, USA. IEEE Computer Society. Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. (2023a). The emergence of clusters in self-attention dynamics. Advances in Neural Information Processing Systems, 36:57026 57037. Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. (2023b). A mathematical perspective on transformers. ar Xiv preprint ar Xiv:2312.10794. Haraux, A. (2012). Some applications of the Łojasiewicz gradient inequality. Communications on Pure and Applied Analysis, 11(6):2417 2427. Koubbi, H., Geshkovski, B., Polyanskiy, Y., and Rigollet, P. (2024). Dynamic metastability in the self-attention model. ar Xiv preprint ar Xiv:2410.06833. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2020). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations. Łojasiewicz, S. (1962). Une propriété topologique des sous-ensembles analytiques réels. In Colloques internationaux du C.N.R.S.: Les équations aux dérivées partielles, pages 87 89, Paris. Editions du C.N.R.S. Łojasiewicz, S. (1965). Ensembles semi-analytiques. Preprint, I.H.E.S., Bures-sur-Yvette. Łojasiewicz, S. (1984). Sur les trajectoires du gradient d une fonction analytique. In Geometry seminars, 1982-1983 (Bologna, 1982-1983), pages 115 117, Univ. Stud. Bologna, Bologna. Markdahl, J., Thunberg, J., and Gonçalves, J. (2017). Almost global consensus on the n-sphere. IEEE Transactions on Automatic Control, 63(6):1664 1675. Paul, T. and Trélat, E. (2024). From microscopic to macroscopic scale dynamics: mean field, hydrodynamic and graph limits. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. (2021). Zero-shot text-to-image generation. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821 8831. PMLR. Renyi, A. (1958). On a one-dimensional problem concerning space-filling. Publ. Math. Inst. Hungar. Acad. Sci., 3:109 127. Sander, M. E., Ablin, P., Blondel, M., and Peyré, G. (2022). Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pages 3515 3530. PMLR. Shub, M. (2013). Global stability of dynamical systems. Springer Science Business Media. Strogatz, S. H. (2000). From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators. Physica D: Nonlinear Phenomena, 143(1-4):1 20. Team, C. (2024). Chameleon: Mixed-modal early-fusion foundation models. ar Xiv preprint ar Xiv:2405.09818. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Vuckovic, J., Baratin, A., and des Combes, R. T. (2021). On the regularity of attention. ar Xiv preprint ar Xiv:2102.05628. Wu, X., Ajorlou, A., Wu, Z., and Jadbabaie, A. (2023). Demystifying oversmoothing in attentionbased graph neural networks. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 35084 35106. Curran Associates, Inc. Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson, B., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J., and Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research. Featured Certification. Zhang, B. and Sennrich, R. (2019). Root mean square layer normalization. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Zhang, X., Jiang, R., Gao, W., Willett, R., and Maire, M. (2024). Residual connections harm self-supervised abstract feature learning. ar Xiv preprint ar Xiv:2404.10947. A Supplementary Material A.1 From Transformers to ODEs The derivation of the equation (SA) was thoroughly described in Geshkovski et al. (2023b), but for completeness, we briefly repeat it here to explain how the problem arises. In general, a typical Transformer architecture consists of repeated layers of multi-head attention, multi-layer perceptrons (MLP), normalization, and residual connections Vaswani et al. (2017). In this work, we simplify this setting by focusing only on the geometric behavior of a single-head attention layer with normalization and residual connections, omitting the MLP for brevity. One head of a standard attention layer is defined as follows. Given an input sequence represented by the token embeddings X Rn d, where n is the number of tokens and d is the dimension of the embedding space, and matrices WQ, WK, WV to compute queries, keys, and values, the attention mechanism computes a weighted sum of values based on their relevance to a query in the following form Attention(X) = softmax XWQW KX By adding an RMS normalization from Zhang and Sennrich (2019) and a residual connection, the transformation from layer t to layer t + 1 is given by: Xt+1 = Xt + RMSNorm(Attention(Xt)). (3) Here, different tokens are represented as rows of the matrix X for computational reasons. For consistency with the convention that vectors are represented as columns, we transpose everything and denote a sequence of tokens encoded as particles in the d-dimensional embedding space Rd as (x1, . . . , xn), corresponding to the columns of X . Additionally, to simplify the notation we denote V := W V , Q := W Q , and K := W K, and introduce an arbitrary temperature parameter β instead of the fixed scaling factor 1/ d. With these notational adjustments, one term of attention added to the k-th token can be written explicitly as: attn(x1, . . . , xn)k = 1 j=1 eβ Qxk,Kxj V xj, j=1 eβ Qxk,Kxj . The equation (3) can be interpreted as a discrete derivative, with Xt+1 Xt representing the difference between layers. Therefore, the trajectory Xt can be viewed as a discretization of a continuous flow. RMS normalization ensures that tokens remain on the scaled unit sphere, but from properly rescaling Q, K, V we can assume that they stay on the standard unit sphere Sd 1. Combining all these observations, the dynamics of token propagation through layers can be expressed as: xk(t) = 1 Zk(t)Pxk(t) j=1 eβ Q(t)xk(t),K(t)xj(t) V (t)xj(t) j=1 eβ Q(t)xk(t),K(t)xj(t) , and the projector Px(y) := y x, y x/|x|2 ensuring that xk remains on the sphere. This leads to the equation (SA), and applying a causal constraint, where each token attends only to the previous ones, transforms it into the causal attention equation (CSA) studied in this work. A.2 Interaction Potential For completeness, here we describe the key properties of the interaction potential h(x) = eβ(cos(x) 1) sin(x) from (Int Pot), which defines the interactions between particles, and its derivative g(x) = h (x). Lemma 4 (Properties of Interaction Functions). Let h(x) = eβ(cos x 1) sin x and g(x) = h (x). Then: 1. h(x) is odd and g(x) is even. 2. h(x) is increasing on [0, τ β] and decreasing on [τ β, ), where cos τ β = 1 + p 4β2 + 1 2β and for β 1, (β + 1/2) 1/2 < τ β < β 1/2 3. For x > 0, h(x) is bounded by e βx2/2(x x3/6) < h(x) < e βx2/2+βx4/24x 4. For g(x), the following bounds hold: g(x) > e βx2/2(1 x2/2 βx2) for 0 < x < (β + 1/2) 1/2 g(x) > e βx2/2+βx4/24βx2 for x > 0 Proof. 1. The oddness of h and evenness of g follow directly from their definitions. 2. Computing g(x) explicitly: g(x) = eβ(cos(x) 1)( β sin2(x) + cos(x)) = eβ(cos(x) 1)(β cos(x)2 + cos(x) β) The sign of g(x) changes at τ β, where cos τ β = 1 + p 4β2 + 1 2β establishing that h(x) increases on [0, τ β] and decreases on [τ β, ). For the the lower bound on τ β: For x < (β + 1/2) 1/2, we have cos x > 1 x2/2 > βx2 > β sin2 x implying g(x) > 0 in this region. For the upper bound on τ β, fix z = β 1/2 and observe that it suffices to show cos τβ = 1 + p 4β2 + 1 2β > cos β 1/2 = cos z Using cos z < 1 z2/2 + z4/24, this reduces to verifying 1 + z4/4 > 1 z2/2 + z4/24 which holds for z < 3.13, satisfied when β = z 2 1. 3. and 4. The bounds on h and g follow from the standard inequalities x x3/6 < sin x < x, and x2/2 < cos x < x2/2 + x4/24 combined with our characterization of g s sign via τ β. We now turn to the proof of Remark 2. Lemma 5. Let N, c, ε, and β satisfy: β (c 1 2ε)2/2 N < e3(c 1 2ε)2/8 ε c 1 Then: Nh((c 1 2ε)β 1/2) < h(εβ 1/2) and Ng((c 2ε)β 1/2) < g(εβ 1/2) Proof. Let r := c 1 2ε. From the assumptions, we have r 4.5, β r2/2 > 10, and ε < 0.1. We must verify: N < h(εβ 1/2) N < g(εβ 1/2) g((r + 1)β 1/2) Using the bounds from Lemma 4, these inequalities reduce to: N < exp( ε2/2)εβ 1/2(1 ε2/(6β)) exp( r2/2 + r4/(24β))rβ 1/2 N < exp( ε2/2)(1 ε2/(2β) ε2) exp( (r + 1)2/2 + (r + 1)4/(24β))(r + 1)2 Given N < exp(3r2/8)ε/r, it suffices to verify: 1. First inequality: Taking logarithms and using ε < 0.1, β > 10, we need: Since β r2/2, this follows from: 50 which holds for r 4.5. 2. Second inequality: After simplification using β r2/2, β 10, ε < 0.1, we need: 2 + (r + 1)4 12r2 + 1 200 + 2 ln(r + 1) ln(9.8r) < 0 The derivative (r + 1)3(r 1) + 2 r + 1 1 for r > 4.5, as r > 3 and 2r > r + 1. Therefore, it suffices to verify f(4.5) 4.14 < 0. B Final configuration B.1 Proof of Lemma 1 Let us show that trajectories x(t) of our system can be characterized as normalized solutions of a linear homogeneous ODE in Rd. Consider a solution y(t) of: y(t) = V y(t), y(0) = x(0) (4) For s(t) := y(t)/ y(t) , we derive: s(t) = y(t) y(t) y(t) y(t) 2 y(t) , y(t) = V s(t) s(t), V s(t) s(t) = Ps(t)(V s(t)) Thus x(t) s(t) = y(t)/ y(t) . The solution to (4) has the following explicit form. Let {Jk} denote the Jordan blocks of V with: sizes nk, eigenvalues λk, and generalized eigenvectors {ξk 1, . . . , ξk nk}. k eλkt nk X (j i)!ξk i (5) where the coefficients {ck j } satisfy: j=1 ck j ξk j = x(0) For almost all initial conditions x(0) with respect to the surface measure on the sphere, all coefficients ck j are non-zero. For complex eigenvalues λk, we combine conjugate terms to obtain real-valued solutions involving trigonometric functions. The asymptotic behavior follows from two observations: (i) Terms with largest ℜ(λk) dominate as t , corresponding to convergence to L Sd 1 at exponential rate, and (ii) Among these terms, those with highest power of t (i.e., tnk 1ξk 1 terms) determine the slower convergence to L Sd 1 B.2 Proof of Theorem 4.1 We begin with a simple geometric lemma. Lemma 6. Let x, y, z Rd with x = y = 1. If | y, z | x, z , then: x, z x, y y, z with equality if and only if either: (i) x, z = 0, or (ii) | y, z | = x, z and x, y = sign( y, z ). Proof. By the Cauchy-Schwarz inequality and the hypothesis: x, y y, z | x, y || y, z | | x, y | x, z x, z where the last inequality follows since | x, y | 1 for unit vectors. The equality conditions follow from examining when each inequality becomes equality in the chain above. We continue with the proof of Theorem 4.1. Proof. The system of particles is governed by equations j=1 eβ Qxk,Kxj (xj xj, xk xk), where we omit t from the notation for simplicity. This system is autonomous, so we first explore its critical points and their stability. For autonomous systems with established convergence, it is well-known that for any absolutely continuous initialization, the limiting point is strongly unstable with probability zero (see (Shub, 2013, Thm. III.7, Ex. III.3) and (Geshkovski et al., 2023b, Lemma B.1)). Note that the proof in Geshkovski et al. (2023b) is stated for gradient ascent dynamics but it readily extends to any smooth autonomous dynamics on a compact Riemannian manifold. fk(x) := 1 Pk j=1 eβ Qxk,Kxj j=1 eβ Qxk,Kxj (xj xj, xk xk) We aim to (i) find stationary points x where all fk(x) = 0 and (ii) analyze eigenvalues of the Jacobian ( fk xj ) at said stationary points. Any critical point must satisfy one of the following: x1 = . . . = xn = ξ for some ξ Sd 1 There exists s {2, . . . , n} such that x1 = . . . = xs 1 = ξ and xs = ξ Indeed, if the first condition fails, consider the first token xs where xs = x1. Then fs(x) = 0 implies xs xs, ξ ξ = 0, forcing xs = ξ so that xs = ξ since we required xs = x1. Our goal is to show that stationary points of the second kind are limiting points with probability zero with respect to the initialization distribution. Observe that since the system formed by the first s particles is independent of subsequent ones, it suffices to show: P((x1, . . . , xs 1, xs) (ξ, . . . , ξ, ξ)) = 0 Since x1(t) = x1(0), this reduces to: P(x1 = ξ, (x2, . . . , xs 1, xs) (ξ, . . . , ξ, ξ)) = 0 By the law of total probability, it suffices to show that for almost all ξ Sd 1: Px2,...,xs|x1=ξ((x2, . . . , xs 1, xs) (ξ, . . . , ξ, ξ)) = 0 We are left to the function fs around (x1, . . . , xs 1, xs) = (ξ, . . . , ξ, ξ). Observe that fs(ξ, . . . , ξ, xs) = w(ξ, xs)(ξ ξ, xs xs) w(ξ, xs) = (s 1)eβ Qxs,Kξ (s 1)eβ Qxs,Kξ + eβ Qxs,Kxs > 0. Observe that the Jacobian ( fk xj ) is block lower triangular, with blocks given by fk xk . We show below xs has an eigenvalue with positive real part, which is sufficient to establish strong instability. At x2 = . . . = xs 1 = ξ and xs = ξ: fs(ξ, . . . , ξ, xs) = w(ξ, xs)(ξ ξ, xs xs) w(ξ, xs) = (s 1)eβ Qxs,Kξ (s 1)eβ Qxs,Kξ + eβ Qxs,Kxs > 0. The classical Jacobian in Rd is: xs fs(ξ, . . . , ξ, xs) = xs w(ξ, xs) (ξ ξ, xs xs) + w(ξ, xs) xs (ξ ξ, xs xs) Hence xs fs(ξ, . . . , ξ, xs) xs= ξ = w(ξ, xs) xs (ξ ξ, xs xs) The spherical Jacobian is obtained by projecting onto ξ and given by xs fs(ξ, . . . , ξ, xs) xs= ξ = w(ξ, ξ) (I ξξ ) This linear operator acts on ξ and has eigenvalues w(ξ, ξ) with multiplicity d 1, which are all real positive, confirming strong instability. By the center-stable manifold theorem (Shub, 2013, Thm. III.7, Ex. III.3), if a point has at least one eigenvalue with positive real part, then: (i) The center-stable manifold W cs loc has positive co-dimension and (ii) Points converging to this equilibrium must enter W cs loc at some finite time and (iii) the set of such points has measure zero. More precisely, if a trajectory (x2, . . . , xs) converges to (ξ, . . . , ξ, ξ), then: m Z 0 : (x2(m), . . . , xs(m)) W cs loc. Since our flow is a diffeomorphism, the pre-image of W cs loc is also a manifold of positive codimension. Therefore, the set of initial conditions leading to convergence to (ξ, . . . , ξ, ξ) is contained in a countable union of measure-zero sets, making it a measure-zero set itself. We continue with an induction on the number of particles to show that with probability one, x2, . . . , xn all converge to ξ. For the base case k = 2, observe that x2 converges to ξ except when initialized unstable equilibrium x2(0) = ξ. Assume next that x2, . . . , xk 1 ξ with probability one so that for any ε > 0, there exists a time T0 after which min j 1 ε a.s. j=1 eβ Qxk(t),Kxj(t) xj(t) j=1 eβ Qxk(t),Kxj(t) ( xj(t), ξ xj(t), xk(t) xk(t), ξ ) From Lemma 6 we get that xj(t), ξ xj(t), xk(t) xk(t), ξ > 0 if xj, ξ > 0 and | xk, ξ | < xj, ξ . In particular, we get that the time derivative above is positive after time T0 whenever 1 + ε < xk, ξ < 1 ε since we are guaranteed that j < k, xj, ξ > 1 ε > 0 after that time. But from the center-stable theorem argument, we know that xk, ξ does not converge to 1 so there exists a time T1 > T0 at which xk, ξ > 1 + ε. After this time, either xk gets closer to ξ (positive derivative) until xk, ξ = 1 ε after which time, xk, ξ 1 ε forever. Since this argument is valid, for all ε > 0, we get that xk ξ. By induction, all points converge to ξ with probability one. B.3 Spectra of V Here we include a figure showing the spectra of value matrices of different heads of a pre-trained transformer. Notice that most of them have real eigenvalue corresponding to λmax, justifying our focus on the real case. Interestingly, there are heads with negative λmax, even though starting spectrum of a Gaussian matrix is far from the left half-plane. This could indicate that some properties of λmax < 0 are desirable in practice, suggesting to consider other ways to initialise of V for training. Figure 4: Eigenvalues of different heads of pre-trained albert-xlarge-v2 model in the complex plane. C Meta-stable clustering C.1 Proof of Lemma 2 We prove the lemma for d = 2. Proof. For d = 2, written in polar coordinates xk = exp(iφk), the system is j=1 eβ(cos(φk φj) 1) sin(φj φk), (6) j=1 eβ(cos(φk φj) 1). Because of the j = k term one has Zk 1, which implies j=1 eβ(cos(φsk φj) 1)| sin(φj φsk)| = j=1 h(|φj φsk|). Consider φr to be the closest particle to φsk among previous ones, here r depends on sk and t. Wlog, let us assume that φr [φsk, φsk + π] and denote = φr φsk. We are going to show that cannot decrease fast. At any time one has = φsk φr = 1 Zsk j=1 h(φj φsk) 1 Zr j=1 h(φj φr ). Let us bound both terms separately. Due to the definition of , and as long as > τ β (from Lemma 4) we can bound the first term | φsk| skh( ). (7) For the second one, we only leave the terms with φj [φr π, φr ], because the other ones are negative j=1 |h(φj φr )|1φj [φr π,φr ]. In other words, the only particles that drive φr towards φsk are the ones that are in the same direction. However, there are no particles closer to φsk than φr . Therefore, being in the same direction implies being at least 2 -far away from φr . If h(2 ) < 0, there are no particles φj [φr π, φr ] and the sum is zero, otherwise we use Lemma 4 to upper bound each term with h(2 ). Thus, φr sk max(h(2 ), 0) skh( ). The last inequality follows from the fact that h( ) > 0 and monotonicity of h when > τβ . Combining the estimates, we obtain that for > τ β one has To prove that for t [0, Tk] we have (t) > cβ 1/2, we just need to verify that cβ 1/2 > τβ and that initially (0) > cβ 1/2 + 2sk Tkh(cβ 1/2), which is true by definition of Rényi centers, and bounds on c (1) and Tk (2). Then, from (7) one has max t [0,Tk] |φsk(t) φsk(0)| < sk Tkh(cβ 1/2) < εcβ 1/2. C.2 Proof of Lemma 3 Proof. We prove that the particles converge exponentially fast to some strongly stable critical point φ via induction on their number. Induction base. In this proof the induction base for n = 1 follows from the induction step proof applied to the first particle φ1. Induction step. Consider φ1(t), . . . , φn 1(t). They do not depend on φn and from induction hypothesis converge exponentially fast to some asymptotically stable point φ 1, . . . , φ n 1. In particular, one has φk L1([0, )), k [n 1]. Consider full derivative φj φj(t) + En φj φj(t) Zn(t)| φn(t)|2. Since all partial derivatives En/ φj are bounded on the compact manifold and all derivatives φj, j < n are in L1([0, )), one has φj φj(t) L1([0, )). For the base case, i.e. n = 1, one simply has g(t) 0. Let us show that f(t) := d En(φ(t)) dt is also in L1([0, )). Notice how f(t) has a finite integral on any interval Z b a f(t)dt = En(φ(a)) En(φ(b)) < C < (8) and is upper bounded f(t) g(t) L1([0, )). Consider f+ = f(t)1f(t)>0, f = f+ f . Clearly f+ L1([0, )) as it is upper bounded by |g| L1([0, )). Moreover, the bound (8) shows that for all T > 0 one has Z T 0 f (s)ds Z T 0 f+(s)ds + C < f+ L1([0, )) + C. From this we get that f L1([0, )). Consequently, f L1([0, )). Since Zn(t)| φn|2 = g(t) f(t) L1([0, )), and Zn(t) has a uniform lower bound, we obtain φn L2([0, )). Let us show that φn is absolutely continuous. The trajectory of all particles φ(t) is absolutely continuous, because φ(t) is bounded as a continuous vector field on a compact manifold. Then, φn(t) is absolutely continuous, because it is a composition of absolutely continuous vector field and an absolute continuous trajectory φ(t). Because φ L2([0, )) and φn is absolutely continuous, it satisfies lim sup t | φn(t)| = 0. In other words, because of the upper bound Zn(t) < C, one has φn En(φ1(t), . . . , φn(t)) = 0. (9) Finally, we are going to prove that φn converges to some critical point φ n with φn En(φ 1, . . . , φ n) = 0. Consider the set E = {ψ : φn E(φ 1, . . . , φ n 1, ψ) = 0}. By assumption, the energy function En(φ 1, . . . , φ n 1, ψ) has isolated critical points w.r.t. ψ, thus the set of zeroes E is a finite collection of points E = {ψ1, . . . , ψm}. When φ1, . . . , φn 1 converge to φ 1, . . . , φ n 1 the set φn : φn E(φ1, . . . , φn) < ε is inside a collection of distinct intervals around ψ1, . . . , ψm for small enough ε. Therefore, due to (9), from some moment φn(t) stays only in one of those intervals. They collapse into points as we take ε 0 and t , proving that φn converges to some φ n E. It remains to prove that observed convergence is to a strongly stable point. We know that the limiting point is critical and for almost any initialisation it is not strongly unstable. This is a well-known fact that for autonomous systems convergence to a strongly unstable point happens with probability zero, that we used in Section B.2. One could find it in (Geshkovski et al., 2023b, Lemma B.1), based on the center manifold theorem (Shub, 2013, Thm. III.7, Ex. III.3). Therefore, from the assumption on critical points, the limiting point is strongly stable. Then, the convergence happens exponentially fast, because locally the whole neighbourhood of a strongly stable point is its stable manifold W s loc, see (Shub, 2013, Thm. III.7, Ex. III.3). C.3 Proof of Theorem 5.1 Proof. Our prove consists of two parts. In order to obtain convergence we are going to apply Lemma 3. The system has exactly gradient-like form for energy functionals Ek(φ1, . . . , φk) = j=1 eβ(cos(φk φj) 1) + j=1 ajeβ(cos(φk θj) 1) and bounded normalization factors Zk(φ1, . . . , φk) = j=1 eβ(cos(φk φj) 1) + j=1 ajeβ(cos(φk θj) 1). Therefore, if conditions of Lemma 3 are satisfied, we have convergence to a strongly stable critical point. In that case, it is sufficient to prove that for any strongly stable critical point φ , all particles φ k are εβ 1/2-close to one of the centers θj. We show this via induction on k, because new particles do not affect the movement of the previous ones. To justify our use of Lemma 3, we need to check that all critical points are either strongly stable or strongly unstable, i.e. that the Jacobian at any critical point with only non-positive eigenvalues actually has no zero eigenvalues. Moreover, for our conclusion, we need to check that if all the eigenvalues are negative, then the critical point is clustered around θ. Let us start with that. In what follows we repeatedly use properties of h and g from Lemma 4. The system is of the form j=1 h(φk φj) + j=1 ajh(φk θj) =: fk(φ, θ). Since fk does not depend on φk+1, . . . , φn, the Jacobian ( fk φj ) is lower-triangular. Therefore, its eigenvalues are fk φk . Let us assume that all of the eigenvalues are non-positive. Since at a critical point fk(φ , θ) itself is zero, one has fk φk (φ , θ) = 1 j=1 g(φ k φ j) + j=1 ajg(φ k θj) This implies that one of the terms g(φ k φ j) or g(φ k θj) is positive. From properties of g in Section A.2, we obtain that φ k is τ β-close to either one of the centers θj or one of the previous particles φ j. By induction hypothesis, all previous particles are εβ 1/2-close to the centers, i.e. φ k is in τ β + εβ 1/2-neighbourhood of some center, wlog θ1. Let us assume that φ k is εβ 1/2-far from θ1. From criticality of the point, it is true that j=1 h(φ k φ j) + j=1 ajh(φ k θj) = 0, By induction hypothesis, all previous particles are εβ 1/2-close to some center. Denote the set of particles that are close to θ1 as S. Then the criticality can be written as a1h(φ k θ1) + X j S h(φ k φ j) = X j / S h(φ k φ j) j=2 ajh(φ k θj). All of the terms on the left hand side have the same sign, because φ j, j S are εβ 1/2-close to θ1 and φ k is (τ β +εβ 1/2)-close to θ1. Any other particle/center is at least (c 2ε)β 1/2 τ β far from φ k, because centers are cβ 1/2-separated. The absolute value of the l.h.s. can be lower bounded |a1h(φ k) + X j S h(φ k φ j)| h(εβ 1/2). The absolute value of the r.h.s. can be upper bounded j / S h(φ k φ j) + j=2 ajh(φ k θj)| Nh((c 2ε)β 1/2 τ β) < Nh((c 1 2ε)β 1/2), where in the last part we used the fact from Section A.2 that τ β < β 1/2. Therefore, one has Nh((c 1 2ε)β 1/2) h(εβ 1/2), which contradicts our assumption on system parameters. It remains to check that when the eigenvalues of the Jacobian are not positive, they are actually strictly negative. We already know that in that case all the particles are clustered around centers θj. Let us assume that fk φk (φ , θ) = 0. We know that φ k is εβ 1/2-close to some center, wlog θ1. Therefore, g(φ k θ1) g(εβ 1/2). On the other hand, the only negative terms in the sum j=1 g(φ k φ j) + j=1 ajg(φ k θj) = 0 are from particles that are not in the neighbourhood of θ1, so they are at least (c 2ε)β 1/2-far from φk. Therefore, for the negative terms to balance the positive g(φ k θ1), it should be true that g(εβ 1/2) < Ng((c 2ε)β 1/2). This contradicts the parameters choice. C.4 Rényi centers vs strong Rényi centers In this section we estimate the number of clusters our approach predicts. As a reminder, for a sequence of particles Xi on a unit sphere, and geodesic distance dist, we consider two subsequences Rényi centers is Xsj, j 1 such that j mini δ, strong Rényi centers is Xsj, j 1 such that j mini δ. Estimating the number of elements in the first sub-sequence is well-known as famous Rényi parking problem Renyi (1958). In the case d = 2, the result from Dvoretzky and Robbins (1964) implies that as δ 0, the average number of elements in the sequence approaches c2π/δ superexponentially fast, where c 0.75 is the Rényi constant. However, this problem becomes significantly harder in higher dimensions. In contrast, it is easy to compute the average number of elements in the second sequence in any dimension and even for a wider class of distributions. Lemma 7. In an infinitely long sequence Xi the average number of variables Xsj chosen by strong Rényi parking is the inverse spherical cap surface area 1/σd 1(Bδ). In particular, it grows as 1/δd 1 with dimension and can be computed directly in lower dimensions 1/σd 1(Bδ) = πδ 1, d = 2 (3 sin2(δ/2)) 1, d = 3 Proof. Consider a sequence of i.i.d. points Xi being sampled on a d-dimensional sphere Sd 1 according to some distribution µ. Let us find the probability that k-th particle Xk is chosen by strong Rényi parking with distance δ. This event can be written as P[ k 1 j=1{dist(Xk, Xj) > δ}] = Z Sd 1 P[ k 1 j=1{dist(x, Xj) > δ}]dµ(x), where we used total probability formula. Then, since Xj are i.i.d., we can write it as Z Sd 1 P(dist(x, X1) > δ)k 1dµ(x) = Z Sd 1(1 µ(Bδ(x)))k 1dµ(x). From this we obtain that the average number of chosen points is equal to Z k=1 P(dist(x, X1) > δ)k 1dµ(x) = Z Sd 1 1 µ(Bδ(x))dµ(x) = 1 σd 1(Bδ), where the last equality is correct for any spherically harmonic distribution µ, in particular for a uniform measure. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The work is dedicated to causal self-attention from interacting particles point of view. Asymptotic convergence for arbitrary key-query matrices is proven in Theorem 4.1, final configuration for non-identity Value matrices is presented in Table 1 and Figure 1. Connection with Rényi parking is shown in Section 5 and properties of meta-stable clusters are proven in Lemma 2 and Theorem 5.1. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Significant limitations are discussed in Section 6. As mentioned in Section 4, provided approach to asymptotical convergence is generally close, but insufficient to prove all results listed in Table 1, that is why those results are only conjectured. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: The proofs of all claimed results can be found in appendix. When stated, every result refers to the section that is dedicated to its proof. All proofs are well-structured, rely on known results and have all the assumptions clearly stated. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: The only experiments are numerical models of considered dynamics presented in Figures 1, 3 and 2. Their main purpose is to support what is claimed visually. These calculations can be easily reproduced, as all the parameters used are listed and the system is not computationally complicated. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The only experiments are numerical modelings of the system, that were used to support conjectures in Section 4 and show performance of predicted meta-stable clustering centers in Figure 3. They are not claimed as main results of the paper and if needed the code for creating the pictures is easy to reproduce. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: The paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: The paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: The paper does not include experiments. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: Conducted research conforms with the Code of Ethics. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: The paper is focused on theoretical research and does not provide any direct improvement of existing models, thus it has no way of deployment that could somehow have harmful societal impact. In general, this work aims to improve our theoretical understanding of Transformers, that are prevalent in modern state-of-the-art AI models. In this regard, it could lead to improved long-term development of the field, that we see to be beneficial for the society. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper is theoretical and does not have any model to release. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [NA] Justification: The paper does not utilize any existing code. Every existing idea or intellectual result is properly referenced upon its introduction. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.