# geometry_of_lightning_selfattention_identifiability_and_dimension__b8d9fa2e.pdf Published as a conference paper at ICLR 2025 GEOMETRY OF LIGHTNING SELF-ATTENTION: IDENTIFIABILITY AND DIMENSION Nathan W. Henry * University of Toronto nathan.henry@mail.utoronto.ca Giovanni Luca Marchetti * Royal Institute of Technology (KTH) glma@kth.se Kathl en Kohn * Royal Institute of Technology (KTH) kathlen@kth.se We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case. Figure 1: A slice of the space of lightning self-attention mechanisms. 1 INTRODUCTION AND RELATED WORK The self-attention mechanism is the cornerstone of the Transformer a modern machine learning architecture that is nowadays popular in a vast variety of domains, ranging from natural language processing (Vaswani et al., 2017), to vision (Dosovitskiy et al., 2020), to sound (Huang et al., 2018). In all of these domains, self-attention mechanisms have showcased outstanding performance due to their ability to model long-range dependencies within data sequences. Lightning self-attention mechanisms (Schlag et al., 2021) are standard variants where, differently from the original proposal, the attention weights are left un-normalized. As a result, the computational complexity of a forward pass is linear with respect to the sequence length, substantially improving on the quadratic complexity of the original model. Despite their effectiveness, the theoretical understanding of self-attention mechanisms is superficial, and many aspects have yet to be clarified. In particular, understanding the geometry of function *Equal contribution. Published as a conference paper at ICLR 2025 spaces defined by neural networks typically referred to as neuromanifolds (Marchetti et al., 2025; Kohn, 2024; Calin, 2020) is a fundamental challenge due to its intimate connection to several machine learning aspects, such as sample complexity and expressivity. Moreover, since neural networks learn by following a gradient flow over the neuromanifold, the geometry of the latter controls several aspects of the training dynamics (Trager et al., 2019). While neuromanifolds are well-understood for several architectures, such as fully-connected (Kileel et al., 2019; Kubjas et al., 2024) and convolutional networks (Kohn et al., 2022; 2023; Shahverdi et al., 2024), they have not been considered for self-attention mechanisms. In this work, we study the geometry of neuromanifolds associated to lightning self-attention mechanisms. These models are of algebraic nature, since they are tri-linear in their weights and cubical in the input. This enables us to analyze neuromanifolds via ideas and tools from algebraic geometry a rich field concerned with spaces defined by polynomial equations. In particular, it is possible to compute geometric quantities such as the dimension of the neuromanifold. The latter is a measure of expressivity of the underlying model. More concretely, it is intimately linked with sample complexity. According to the Fundamental Theorem of Learning, the dimension controls, linearly, the sample complexity of learnability (Shalev-Shwartz & Ben-David, 2014). This theory is typically formulated for (binary) classifiers1, and the notion of dimension is a discrete one the Vapnik Chervonenkis (VC) dimension, specifically. In the continuous setting, the dimension of the neuromanifold is the natural analogue of the combinatorial VC dimension, and controls the sample complexity of learnability. An expression for sample complexity can be used both to select the appropriate model/architecture given an available dataset, and to collect appropriate amounts of data to train a given a model. This is especially important for attention-based models, that are nowadays popular in several domains, and are trained at extremely-large scales. The dimension of the neuromanifold is related to the dual question of identifiability (Grigsby et al., 2023; Bona-Pellissier et al., 2023; Fefferman et al., 1994) a problem concerned with characterizing the parameters corresponding to the same function. Geometrically, such parameters define fibers of the parametrization of the neuromanifold, and their (generic) dimension measures the difference between the dimension of the neuromanifold and the number of parameters. Therefore, characterizing fibers leads to an estimate of sample complexity which is more precise than the common practice of counting parameters. Moreover, understanding the fibers can be interesting beyond their relation to the dimension, since they control aspects of the training dynamics. Indeed, fibers induce invariances of the loss function which are data-independent, meaning that for any dataset, the loss will be constant for parameters within the same fiber. This gives rise to the phenomenon of flatness of the loss landscape (Zhao et al., 2022b), where minima are not isolated but instead belong to a continuous set. Even further, it is understood that the symmetries of the loss landscape control training dynamics as gradient directions must be orthogonal to fibers of the loss, which also induces a constraint on the Hessian (Kunin et al., 2020). This has recently been exploited to design optimizers that teleport along fibers (Zhao et al., 2022a), improving learning efficiency. 1.1 SUMMARY OF CONTRIBUTIONS Our core contribution is a description of the (generic) fibers of the parametrization of lightning self-attention networks. As a consequence, the expression for the dimension of the neuromanifold follows immediately. More specifically, our results are summarized as follows. For a single layer of lightning self-attention (Section 3.1), we describe all the fibers, and additionally study various aspects of the geometry of the neuromanifold. Specifically, we prove that it is Euclidean closed and compute its singular and boundary points. For a deep lightning self-attention network (Section 3.3), we compute the generic fibers. Our proof involves a reparametrization which can be interpreted as introducing virtual weights and a subtle induction argument based on a closed-form algebraic expression for the network w.r.t. the new parameters. Assuming the network has a bottleneck architecture, we derive a formula for the dimension of the neuromanifold. In particular, the dimension is strictly lower than the number of parameters, with redundancies arising from a scaling symmetry, from inter-layer symmetries, and from the rank constraint on the attention weights. 1In this context, neuromanifolds are referred to as hypothesis spaces , and are usually considered in a combinatorial version. Published as a conference paper at ICLR 2025 Lastly, we study traditional self-attention by re-introducing the softmax normalization (Section 3.4). We prove that the parametrization of a single layer is generically one-to-one, and state a conjecture verified via numerical experiments for the generic fibers of deep self-attention networks. 2 LIGHTNING SELF-ATTENTION Fix positive integers d, d , a, t N and matrices Q, K Ra d, V Rd d. The latter are deemed query weights, key weights, and value weights respectively2. A self-attention mechanism is a map parametrized by (Q, K, V ) sending sequences of length t of vectors in Rd to sequences of vectors in Rd . Here we consider the variant of self-attention mechanisms deemed lightning, which is computationally efficient and fully algebraic. Definition 1. The lightning self-attention mechanism associated to the weights W = (Q, K, V ) is the map: φW : Rd t Rd t 1 j t x j K Q xi V xj Intuitively, every component xi of the input corresponds to a token and attends bilinearly to every other xj, producing a scalar weight x j K Qxi R. These weights are used to aggregate the values V xj Rd , obtaining the corresponding component of the output. The map defined by Equation 1 is tri-linear in (Q, K, V ) and homogeneous cubical in x. It is often convenient to write Equation 1 in matrix form: if X = (xi)1 i t is interpreted as a d t matrix, then: φW (X) = V XX K QX. (2) Moreover, we will often simplify the parametrization by introducing the attention matrix: A = K Q. (3) The latter will always be interpreted as a bilinear form x Ay. Lightning attention mechanisms are variants of the traditional ones (Vaswani et al., 2017), where the attention weights x j Axi are normalized to a probability distribution across j see Section 3.4 for further details. The major practical advantage of the lightning variant is its computational efficiency with respect to the sequence length. Specifically, Equation 1 can be computed in O(t) time, while traditional self-attention mechanisms require O(t2) time due to normalization. The improvement in efficiency motivates the term lightning . Self-attention mechanisms can be stacked in order to obtain a deep network architecture. To this end, fix positive integers t, l, d = (d0, . . . , dl), a = (a1, . . . , al), and weights Q = (Q1, . . . , Ql), K = (K1, . . . , Kl), V = (V1, . . . , Vl), with Qi, Ki Rai di, Vi Rdi di 1. Definition 2. A deep self-attention network associated to the weights W = (Q, K, V) is the map: φW : Rd0 t Rdl t (4) given by the composition φW = φWl φW1. Again, a deep self-attention network is homogeneous of degree 3l in x. Based on this, we denote by Sym3l Rd0 t, Rdl t the vector space of homogeneous polynomial functions from Rd0 t to Rdl t of degree 3l in all the output co-ordinates. Definition 3. The neuromanifold of a deep self-attention network is the image of the parametrization map W 7 φW: Md,a = n φW | W R P i di(2ai+di 1)o Sym3l Rd0 t, Rdl t . (5) The neuromanifold is a semi-algebraic set by the Tarski-Seidenberg Theorem, meaning that it can be defined by a finite number of polynomial equalities and inequalities in Sym3l Rd0 t, Rdl t . 2An alternative standard notation for Q, K, V is WQ, WK, WV . Our choice is motivated by better readability. Published as a conference paper at ICLR 2025 In this section, we study the neuromanifold of lightning attention networks, focusing on its parametrization and its dimension. Our core focus will be the description of the fibers of the parametrization map W 7 φW, meaning that we will describe the sets of weights that define the same function. More precisely, the fiber of φW Md,a is the set {W | φW = φW}. (6) Once the fibers are understood, the dimension of the neuromanifold can be computed. To this end, it is actually sufficient to describe the generic fibers, i.e., the ones corresponding to almost all W or, more precisely, to W lying outside of the common zeros of a polynomial system. The co-dimension of such fibers is constant and coincides with the dimension of the neuromanifold. In order to study the parametrization map and its fibers, it is convenient to split the problem by considering self-attention mechanisms as parametrized via the attention matrix. More precisely, we will think of self-attention mechanisms as parametrized, by abuse of notation, via weights W = (A, V ), where A Rd d is an arbitrary matrix, and will study the matrix multiplication map (Q, K) 7 A = K Q independently. We begin by considering the latter. When a < d, the matrix multiplication map is not surjective, since A is constrained to have rank a. In other words, the image of this map is the determinantal variety defined as the set of matrices in Rd d of rank at most a. On the other hand, the fibers of the matrix multiplication map are subtle, since they are closely related to the problem of matrix factorization. Yet, it is still possible to describe the generic fibers. To this end, note that the map exhibits the following invariance: K Q = K Q , where K = CK and Q = C Q for an arbitrary invertible matrix C GLa(R). Conversely, the following elementary result shows that this is the only symmetry of a generic fiber. Lemma 3.1. Suppose that A = K Q = K Q = A and that rk(A) = rk(A ) = a d. Then there exists a unique invertible matrix C GLa(R) such that K = CK and Q = C Q. Proof. See Appendix A.1 If follows from the above result that, for a < d, the generic fibers of the matrix multiplication map are isomorphic to GLa(R), and therefore have dimension a2. This recovers the well-known formula for the dimension of the determinantal variety, which coincides with 2ad a2 = a(2d a). 3.1 SINGLE-LAYER IDENTIFIABILITY We now describe completely the fibers of the parametrization of a lightning self-attention mechanism in terms of the attention matrix. By abuse of notation, we will write φW for W = (A, V ). Firstly, note that it is always possible to rescale the weights without changing the function. That is, (A, V ) and λA, 1 λV belong to the same fiber for all λ R \ {0}. Therefore, we will focus on the fibers up to rescaling. Theorem 3.2. Suppose t 2. The fiber of φW Md,d ,a for a given W = (A, V ) is as follows: If rk(A) = rk(V ) = 1, given tensor decompositions A = k q and V = h v for some q, k, v Rd \ {0}, h Rd \ {0}, the fiber consists, up to rescaling, of W and W = (v q, h k). If φW = 0, the fiber consists of W = (A , V ) such that A = 0 or V = 0. Otherwise, the fiber consists only of rescalings of W. Proof. See Appendix A.2. Note that the second condition of Theorem 3.2 is negligible, i.e., it does not hold for almost all weights W, even under the constraint rk(A) a. Moreover, if d, d 2 or d, a 2, the first condition is negligible as well. Therefore, the generic fibers are one-dimensional. As a consequence, it is possible to compute the dimension (in the sense of algebraic geometry) of the neuromanifold, even when parametrized via queries and keys, or equivalently, when A is restricted to have rank a. Published as a conference paper at ICLR 2025 Corollary 3.3. Suppose that t 2 and that d, d 2 or d, a 2. The dimension of the neuromanifold is: dim (Md,d ,a) = 2ad + dd a2 1 if a d, d2 + dd 1 otherwise. (7) Proof. The formula follows from the fact that the generic fibers of the parametrization are onedimensional and that the dimension of the determinantal variety is 2αd + dd α2, where α = min{a, d} (see discussion after Lemma 3.1). 3.2 SINGLE-LAYER GEOMETRY We will now describe the geometry of the neuromanifold of a single layer in more detail. We will return to the question of identifiability for deep networks in Section 3.3. Throughout this section, we assume that t 2 and that either d, d 2 or d, a 2. Theorem 3.4. The neuromanifold Md,d ,a is closed in the Euclidean topology. Its (relative) boundary points are those φ(A,V ) of the form A = k q and V = h k for some q, k Rd, h Rd . Moreover, Md,d ,a is not a smooth manifold: its singular points are the φ(A,V ) satisfying rk(A)rk(V ) 1. Proof. This is an amalgamation of Corollaries A.1, A.4, and A.6 in Appendix A.3. Figure 1 provides a visualization of M2,1,2 for t = 2. The latter has dimension 5 and is embedded in the 40-dimensional space Sym3 R4, R2 . The illustration shows a rendering of the neuromanifold in a 3-dimensional affine slice of the ambient space. This slice cuts a 2-dimensional section of the neuromanifold. The yellow line denotes the singular locus, whose dotted segment emerges by taking the closure in the Zariski topology. The above result has several consequences, from a machine learning perspective and, in particular, in terms of learning dynamics. The fact that the neuromanifold is closed in the Euclidean topology implies that it contains its limit points. In particular, when the model is trained via a dynamical system which is the case for gradient descent any training trajectory that converges to an equilibrium in the ambient space will converge within the neuromanifold. Moreover, singularities of neuromanifolds are a central focus in Information Geometry, and specifically in Singular Learning Theory (Watanabe, 2009). According to the latter, singularities of neuromanifolds play a central role in deep learning, since the function learned by a neural network (via gradient descent) often corresponds to a singular point of the neuromanifolds [8]. In other words, singularities often attract the learning dynamics, resulting in a form of implicit bias associated to the neural architecture. According to Theorem 3.4, singularities of the neuromanifold arise exactly when both A and V have rank 1 (or vanish). Therefore, this result suggests an implicit bias in attention mechanism towards inferring (extremely) low-rank functions. Such bias has been empirically observed in a variety of neural architectures (Amari et al., 2006), and our result might suggest a mathematical explanation to this phenomenon, at least in the single-layer and lightning case. 3.3 DEEP NETWORKS In this section, we completely describe the symmetries in the parameters of a generic function arising from a deep network of lightning self-attention layers. We show that there are only three symmetries: 1) each layer can be scaled by a constant, 2) the keys and queries within each layer can be scaled by an invertible matrix as in Lemma 3.1, and 3) the output of one layer can be scaled by an invertible matrix if the next layer cancels out this scaling. We now describe the latter type of parameter symmetry, before formally stating our main result in Theorem 3.7. To this end, consider dimension vectors d, a and weights W = (A, V) of a network with l layers. For 1 i l, define: l i j0 that is injective and such that S(0) = 1. A typical choice is S(x) = ex/τ, where τ R>0 is the temperature hyperparameter. The traditional self-attention mechanism is then defined for W = (A, V ) and X Rd t as: 1 j t S x i Axj V xj 1 k t S x i Axk . (15) Note that, for simplicity, we adhere to the convention of parametrizing via the attention matrix. Intuitively, in Equation 15 the attention weights S x i Axj are normalized to sum to 1, forcing the model to distribute its attention across the input components. The following result describes the fibers of the parametrization, analogously to Theorem 3.2. Theorem 3.9. Suppose that t 2. If φW = 0, then V = 0. Otherwise, the fiber of φW is a singleton {W}. Proof. See Appendix A.8. Therefore, differently from lightning self-attention, in this case, the parametrization is generically one-to-one. We now consider the deep case. Note that even with normalization, deep attention networks can be reparametrized via M and L, as defined in Section 3.3. Therefore, the fibers of the parametrization will be unaffected by the symmetries inside the attention matrices from Lemma 3.1 and the transformations Ci from Equation 9. However, this time no rescaling is possible. Therefore, we conjecture that the parametrization via (M, L) is generically one-to-one; in other words, that normalization only breaks the layer-wise scaling symmetry of the parametrization. Conjecture 3.10. For normalized deep self-attention networks, the generic fibers of the parametrization via (M, L) are singletons. In particular, suppose that for some δ it holds that di = δ for all 0 < i < l and d0, dl δ. Similarly to Corollary 3.8, the above conjecture implies that the dimension of the neuromanifold equals to: 2α1d0 α2 1 + δ(dl + d0) δ2 + X 1