# geometry_of_lightning_selfattention_identifiability_and_dimension__b8d9fa2e.pdf

Published as a conference paper at ICLR 2025

GEOMETRY OF LIGHTNING SELF-ATTENTION: IDENTIFIABILITY AND DIMENSION

Nathan W. Henry * University of Toronto nathan.henry@mail.utoronto.ca

Giovanni Luca Marchetti * Royal Institute of Technology (KTH) glma@kth.se

Kathl en Kohn * Royal Institute of Technology (KTH) kathlen@kth.se

We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.

Figure 1: A slice of the space of lightning self-attention mechanisms.

1 INTRODUCTION AND RELATED WORK

The self-attention mechanism is the cornerstone of the Transformer a modern machine learning architecture that is nowadays popular in a vast variety of domains, ranging from natural language processing (Vaswani et al., 2017), to vision (Dosovitskiy et al., 2020), to sound (Huang et al., 2018). In all of these domains, self-attention mechanisms have showcased outstanding performance due to their ability to model long-range dependencies within data sequences. Lightning self-attention mechanisms (Schlag et al., 2021) are standard variants where, differently from the original proposal, the attention weights are left un-normalized. As a result, the computational complexity of a forward pass is linear with respect to the sequence length, substantially improving on the quadratic complexity of the original model.

Despite their effectiveness, the theoretical understanding of self-attention mechanisms is superficial, and many aspects have yet to be clarified. In particular, understanding the geometry of function

*Equal contribution.

Published as a conference paper at ICLR 2025

spaces defined by neural networks typically referred to as neuromanifolds (Marchetti et al., 2025; Kohn, 2024; Calin, 2020) is a fundamental challenge due to its intimate connection to several machine learning aspects, such as sample complexity and expressivity. Moreover, since neural networks learn by following a gradient flow over the neuromanifold, the geometry of the latter controls several aspects of the training dynamics (Trager et al., 2019). While neuromanifolds are well-understood for several architectures, such as fully-connected (Kileel et al., 2019; Kubjas et al., 2024) and convolutional networks (Kohn et al., 2022; 2023; Shahverdi et al., 2024), they have not been considered for self-attention mechanisms.

In this work, we study the geometry of neuromanifolds associated to lightning self-attention mechanisms. These models are of algebraic nature, since they are tri-linear in their weights and cubical in the input. This enables us to analyze neuromanifolds via ideas and tools from algebraic geometry a rich field concerned with spaces defined by polynomial equations. In particular, it is possible to compute geometric quantities such as the dimension of the neuromanifold. The latter is a measure of expressivity of the underlying model. More concretely, it is intimately linked with sample complexity. According to the Fundamental Theorem of Learning, the dimension controls, linearly, the sample complexity of learnability (Shalev-Shwartz & Ben-David, 2014). This theory is typically formulated for (binary) classifiers1, and the notion of dimension is a discrete one the Vapnik Chervonenkis (VC) dimension, specifically. In the continuous setting, the dimension of the neuromanifold is the natural analogue of the combinatorial VC dimension, and controls the sample complexity of learnability. An expression for sample complexity can be used both to select the appropriate model/architecture given an available dataset, and to collect appropriate amounts of data to train a given a model. This is especially important for attention-based models, that are nowadays popular in several domains, and are trained at extremely-large scales.

The dimension of the neuromanifold is related to the dual question of identifiability (Grigsby et al., 2023; Bona-Pellissier et al., 2023; Fefferman et al., 1994) a problem concerned with characterizing the parameters corresponding to the same function. Geometrically, such parameters define fibers of the parametrization of the neuromanifold, and their (generic) dimension measures the difference between the dimension of the neuromanifold and the number of parameters. Therefore, characterizing fibers leads to an estimate of sample complexity which is more precise than the common practice of counting parameters. Moreover, understanding the fibers can be interesting beyond their relation to the dimension, since they control aspects of the training dynamics. Indeed, fibers induce invariances of the loss function which are data-independent, meaning that for any dataset, the loss will be constant for parameters within the same fiber. This gives rise to the phenomenon of flatness of the loss landscape (Zhao et al., 2022b), where minima are not isolated but instead belong to a continuous set. Even further, it is understood that the symmetries of the loss landscape control training dynamics as gradient directions must be orthogonal to fibers of the loss, which also induces a constraint on the Hessian (Kunin et al., 2020). This has recently been exploited to design optimizers that teleport along fibers (Zhao et al., 2022a), improving learning efficiency.

1.1 SUMMARY OF CONTRIBUTIONS

Our core contribution is a description of the (generic) fibers of the parametrization of lightning self-attention networks. As a consequence, the expression for the dimension of the neuromanifold follows immediately. More specifically, our results are summarized as follows.

For a single layer of lightning self-attention (Section 3.1), we describe all the fibers, and additionally study various aspects of the geometry of the neuromanifold. Specifically, we prove that it is Euclidean closed and compute its singular and boundary points.

For a deep lightning self-attention network (Section 3.3), we compute the generic fibers. Our proof involves a reparametrization which can be interpreted as introducing virtual weights and a subtle induction argument based on a closed-form algebraic expression for the network w.r.t. the new parameters. Assuming the network has a bottleneck architecture, we derive a formula for the dimension of the neuromanifold. In particular, the dimension is strictly lower than the number of parameters, with redundancies arising from a scaling symmetry, from inter-layer symmetries, and from the rank constraint on the attention weights.

1In this context, neuromanifolds are referred to as hypothesis spaces , and are usually considered in a combinatorial version.

Published as a conference paper at ICLR 2025

Lastly, we study traditional self-attention by re-introducing the softmax normalization (Section 3.4). We prove that the parametrization of a single layer is generically one-to-one, and state a conjecture verified via numerical experiments for the generic fibers of deep self-attention networks.

2 LIGHTNING SELF-ATTENTION

Fix positive integers d, d , a, t N and matrices Q, K Ra d, V Rd d. The latter are deemed query weights, key weights, and value weights respectively2. A self-attention mechanism is a map parametrized by (Q, K, V ) sending sequences of length t of vectors in Rd to sequences of vectors in Rd . Here we consider the variant of self-attention mechanisms deemed lightning, which is computationally efficient and fully algebraic. Definition 1. The lightning self-attention mechanism associated to the weights W = (Q, K, V ) is the map: φW : Rd t Rd t

1 j t x j K Q xi V xj

Intuitively, every component xi of the input corresponds to a token and attends bilinearly to every other xj, producing a scalar weight x j K Qxi R. These weights are used to aggregate the values V xj Rd , obtaining the corresponding component of the output. The map defined by Equation 1 is tri-linear in (Q, K, V ) and homogeneous cubical in x. It is often convenient to write Equation 1 in matrix form: if X = (xi)1 i t is interpreted as a d t matrix, then:

φW (X) = V XX K QX. (2)

Moreover, we will often simplify the parametrization by introducing the attention matrix:

A = K Q. (3)

The latter will always be interpreted as a bilinear form x Ay.

Lightning attention mechanisms are variants of the traditional ones (Vaswani et al., 2017), where the attention weights x j Axi are normalized to a probability distribution across j see Section 3.4 for further details. The major practical advantage of the lightning variant is its computational efficiency with respect to the sequence length. Specifically, Equation 1 can be computed in O(t) time, while traditional self-attention mechanisms require O(t2) time due to normalization. The improvement in efficiency motivates the term lightning .

Self-attention mechanisms can be stacked in order to obtain a deep network architecture. To this end, fix positive integers t, l, d = (d0, . . . , dl), a = (a1, . . . , al), and weights Q = (Q1, . . . , Ql), K = (K1, . . . , Kl), V = (V1, . . . , Vl), with Qi, Ki Rai di, Vi Rdi di 1. Definition 2. A deep self-attention network associated to the weights W = (Q, K, V) is the map:

φW : Rd0 t Rdl t (4)

given by the composition φW = φWl φW1.

Again, a deep self-attention network is homogeneous of degree 3l in x. Based on this, we denote by Sym3l Rd0 t, Rdl t the vector space of homogeneous polynomial functions from Rd0 t to Rdl t

of degree 3l in all the output co-ordinates. Definition 3. The neuromanifold of a deep self-attention network is the image of the parametrization map W 7 φW:

Md,a = n φW | W R P

i di(2ai+di 1)o Sym3l Rd0 t, Rdl t . (5)

The neuromanifold is a semi-algebraic set by the Tarski-Seidenberg Theorem, meaning that it can be defined by a finite number of polynomial equalities and inequalities in Sym3l Rd0 t, Rdl t .

2An alternative standard notation for Q, K, V is WQ, WK, WV . Our choice is motivated by better readability.

Published as a conference paper at ICLR 2025

In this section, we study the neuromanifold of lightning attention networks, focusing on its parametrization and its dimension. Our core focus will be the description of the fibers of the parametrization map W 7 φW, meaning that we will describe the sets of weights that define the same function. More precisely, the fiber of φW Md,a is the set

{W | φW = φW}. (6)

Once the fibers are understood, the dimension of the neuromanifold can be computed. To this end, it is actually sufficient to describe the generic fibers, i.e., the ones corresponding to almost all W or, more precisely, to W lying outside of the common zeros of a polynomial system. The co-dimension of such fibers is constant and coincides with the dimension of the neuromanifold.

In order to study the parametrization map and its fibers, it is convenient to split the problem by considering self-attention mechanisms as parametrized via the attention matrix. More precisely, we will think of self-attention mechanisms as parametrized, by abuse of notation, via weights W = (A, V ), where A Rd d is an arbitrary matrix, and will study the matrix multiplication map (Q, K) 7 A = K Q independently. We begin by considering the latter. When a < d, the matrix multiplication map is not surjective, since A is constrained to have rank a. In other words, the image of this map is the determinantal variety defined as the set of matrices in Rd d of rank at most a. On the other hand, the fibers of the matrix multiplication map are subtle, since they are closely related to the problem of matrix factorization. Yet, it is still possible to describe the generic fibers. To this end, note that the map exhibits the following invariance: K Q = K Q , where K = CK and Q = C Q for an arbitrary invertible matrix C GLa(R). Conversely, the following elementary result shows that this is the only symmetry of a generic fiber. Lemma 3.1. Suppose that A = K Q = K Q = A and that rk(A) = rk(A ) = a d. Then there exists a unique invertible matrix C GLa(R) such that K = CK and Q = C Q.

Proof. See Appendix A.1

If follows from the above result that, for a < d, the generic fibers of the matrix multiplication map are isomorphic to GLa(R), and therefore have dimension a2. This recovers the well-known formula for the dimension of the determinantal variety, which coincides with 2ad a2 = a(2d a).

3.1 SINGLE-LAYER IDENTIFIABILITY

We now describe completely the fibers of the parametrization of a lightning self-attention mechanism in terms of the attention matrix. By abuse of notation, we will write φW for W = (A, V ). Firstly, note that it is always possible to rescale the weights without changing the function. That is, (A, V ) and λA, 1

λV belong to the same fiber for all λ R \ {0}. Therefore, we will focus on the fibers up to rescaling. Theorem 3.2. Suppose t 2. The fiber of φW Md,d ,a for a given W = (A, V ) is as follows:

If rk(A) = rk(V ) = 1, given tensor decompositions A = k q and V = h v for some q, k, v Rd \ {0}, h Rd \ {0}, the fiber consists, up to rescaling, of W and W = (v q, h k).

If φW = 0, the fiber consists of W = (A , V ) such that A = 0 or V = 0.

Otherwise, the fiber consists only of rescalings of W.

Proof. See Appendix A.2.

Note that the second condition of Theorem 3.2 is negligible, i.e., it does not hold for almost all weights W, even under the constraint rk(A) a. Moreover, if d, d 2 or d, a 2, the first condition is negligible as well. Therefore, the generic fibers are one-dimensional. As a consequence, it is possible to compute the dimension (in the sense of algebraic geometry) of the neuromanifold, even when parametrized via queries and keys, or equivalently, when A is restricted to have rank a.

Published as a conference paper at ICLR 2025

Corollary 3.3. Suppose that t 2 and that d, d 2 or d, a 2. The dimension of the neuromanifold is:

dim (Md,d ,a) = 2ad + dd a2 1 if a d, d2 + dd 1 otherwise. (7)

Proof. The formula follows from the fact that the generic fibers of the parametrization are onedimensional and that the dimension of the determinantal variety is 2αd + dd α2, where α = min{a, d} (see discussion after Lemma 3.1).

3.2 SINGLE-LAYER GEOMETRY

We will now describe the geometry of the neuromanifold of a single layer in more detail. We will return to the question of identifiability for deep networks in Section 3.3. Throughout this section, we assume that t 2 and that either d, d 2 or d, a 2. Theorem 3.4. The neuromanifold Md,d ,a is closed in the Euclidean topology. Its (relative) boundary points are those φ(A,V ) of the form A = k q and V = h k for some q, k Rd, h Rd . Moreover, Md,d ,a is not a smooth manifold: its singular points are the φ(A,V ) satisfying rk(A)rk(V ) 1.

Proof. This is an amalgamation of Corollaries A.1, A.4, and A.6 in Appendix A.3.

Figure 1 provides a visualization of M2,1,2 for t = 2. The latter has dimension 5 and is embedded in the 40-dimensional space Sym3 R4, R2 . The illustration shows a rendering of the neuromanifold in a 3-dimensional affine slice of the ambient space. This slice cuts a 2-dimensional section of the neuromanifold. The yellow line denotes the singular locus, whose dotted segment emerges by taking the closure in the Zariski topology.

The above result has several consequences, from a machine learning perspective and, in particular, in terms of learning dynamics. The fact that the neuromanifold is closed in the Euclidean topology implies that it contains its limit points. In particular, when the model is trained via a dynamical system which is the case for gradient descent any training trajectory that converges to an equilibrium in the ambient space will converge within the neuromanifold. Moreover, singularities of neuromanifolds are a central focus in Information Geometry, and specifically in Singular Learning Theory (Watanabe, 2009). According to the latter, singularities of neuromanifolds play a central role in deep learning, since the function learned by a neural network (via gradient descent) often corresponds to a singular point of the neuromanifolds [8]. In other words, singularities often attract the learning dynamics, resulting in a form of implicit bias associated to the neural architecture. According to Theorem 3.4, singularities of the neuromanifold arise exactly when both A and V have rank 1 (or vanish). Therefore, this result suggests an implicit bias in attention mechanism towards inferring (extremely) low-rank functions. Such bias has been empirically observed in a variety of neural architectures (Amari et al., 2006), and our result might suggest a mathematical explanation to this phenomenon, at least in the single-layer and lightning case.

3.3 DEEP NETWORKS

In this section, we completely describe the symmetries in the parameters of a generic function arising from a deep network of lightning self-attention layers. We show that there are only three symmetries: 1) each layer can be scaled by a constant, 2) the keys and queries within each layer can be scaled by an invertible matrix as in Lemma 3.1, and 3) the output of one layer can be scaled by an invertible matrix if the next layer cancels out this scaling. We now describe the latter type of parameter symmetry, before formally stating our main result in Theorem 3.7.

To this end, consider dimension vectors d, a and weights W = (A, V) of a network with l layers. For 1 i l, define:

l i j<l Vl j Mi = L i 1Ai Li 1. (8)

Moreover, set M1 = A1. It follows immediately from Definition 2 that a deep self-attention network can be written in terms of (M, L), where M = (M1, . . . , Ml) and L = Ll. Therefore, we introduce

Published as a conference paper at ICLR 2025

a further re-parametrization and write, with abuse of notation, φW for W = (M, L). We call the parameters in this parametrization virtual weights.

This parametrization has symmetries. Namely, invertible matrices Ci GLdi(R) for 1 i < l, consider: V i = Ci Vi C 1 i 1 A i+1 = C i Ai+1C 1 i . (9)

In the above, we set C0 and Cl to the identity. Replacing Vi, Vi+1, Ai+1 with V i , V i+1, A i+1 does not alter the virtual weights M and L. Intuitively, this operation consists of transforming the output of the i-th layer and cancelling back the transformation in the next layer. The following result states that, for generic V and with an assumption on the dimensions di, the above procedure completely characterizes the generic fibers of the map (A, V) 7 (M, L).

Lemma 3.5. Suppose that for some δ N it holds that di = δ for all 0 < i < l and d0, dl δ. Let (M, L) be virtual weights (Equation 8) obtained from both (A, V) and (A , V ). If rk(L) = δ then for every i there exists a unique Ci GLdi(R) such that V i and A i+1 are obtained from Vi and Ai+1 via Equation 9.

Proof. See Appendix A.4.

The assumption on the dimensions di in the above result can be practically interpreted as a bottleneck architecture. Specifically, this architecture is equivalent to including a low-dimensional embedding and a high-dimensional unembedding layer, which is common in the literature, especially in an interpretability context (Elhage et al., 2021).

Before computing the fibers of the re-parametrization, it is convenient to rephrase the latter. The following result provides a recursive expression of a deep attention network in terms of (M, L).

Lemma 3.6. For X Rd0 t, define inductively: D0 = I Di = Di 1X Mi XDi 1X M i XDi 1. (10)

φW(X) = Ll X

1 i l Dl i X Ml i+1X

Proof. See Appendix A.5.

The above result can be rephrased without recursion. Given X = (xi)1 i t, since XX = P

1 i t xix i , Equation 11 states that φW(X)k can be written for every k as:

k1,...,k l Llxk1

1 j l 1 x kj f Mjxkj+1

x l A1xk, (12)

where l = (3l 1)/2 and f Mj is defined as follows. Let αj be the index of the first non-zero digit from the right of j in base 3 (i.e., its 3-adic valuation plus one). Then f Mj equals to Mαj if this digit is 1, and to M αj if it is 2. We provide a diagram illustrating Equation 12 in Figure 2, where the matrices in the equation correspond, in order, to the regions between the lines.

We now discuss the fibers, as anticipated. Firstly, similarly to the one-layer case, W = (M, L) can be rescaled without altering the corresponding function. More precisely, if M i = λi Mi and L = ρL for some λi, ρ R \ {0}, then φW = φW if: Y

1 i l (λi)3l i = 1

Conversely, we now show that rescaling characterizes the generic fibers. The following is the main result of this work.

Published as a conference paper at ICLR 2025

M1 M 1 M 1 M1 M1 M1 M 1

Figure 2: Diagrammatic illustration of Equation 12.

Theorem 3.7. Let W = (M, L). Suppose that t 3 and:

For 1 i l, rk(Mi) 2,

For 1 < i l, Mi is not skew-symmetric3.

Then the fiber of φW consists of rescalings of W.

Proof. The proof is highly technical. Here, we provide a short summary; for the full proof, see Appendix A.6. Using unique factorization of polynomials, we first show that, up to rescalings, the fiber consists of W = (M , L), where M i = Mi +Σi and Σi is a skew-symmetric matrix. We then proceed to show that Σi = 0 via an induction argument over i. The argument involves analyzing specific monomials where Σi appears, and proving that most of them vanish due to the symmetries of the unrolled recursion tree in Figure 2.

The conditions of the above theorem are generic if di, ai 2 for all i. To summarize, the theorem states that a generic function in the neuromanifold has precisely three types of symmetries in its parameters: 1) scaling each layer by a constant as in Equation 13, 2) scaling the keys and queries of each layer by invertible matrices as in Lemma 3.1, and 3) scaling the output of one layer by an invertible matrix and reverting this scaling in the next layer as in Equation 9. Similarly to the one-layer case, this results leads to the computation of the dimension of the neuromanifold.

Corollary 3.8. Suppose that t 3, ai 2 for all i, and for some δ 2 it holds that di = δ for all 0 < i < l and d0, dl δ. The dimension of the neuromanifold is:

2α1d0 α2 1 + δ(d0 + dl) δ2 l + X

1<i l (2αiδ α2 i ), (14)

where αi = min{ai, di 1}.

Proof. See Appendix A.7.

As discussed in Section 1, an exact expression for the dimension of the neuromanifold enables one to estimate the sample complexity of the model.While the latter is commonly measured as the number of parameters, the dimension which constitutes the theoretically-correct estimate can, sometimes, significantly differ from it. To illustrate this, in some instances the dimension of queries/keys is set to be equal to the embedding dimension (Vaswani et al., 2017), which translates to setting αi = δ for all i. Assuming d0 = dl = δ, the number of parameters is 3

2d2l, while according to Equation 14 the dimension is d2(l + 1) l. Asymptotically, their ratio is 3

2, i.e., in this case the number of parameters is 50% larger than the dimension.

3A square matrix M is skew-symmetric if M = M.

Published as a conference paper at ICLR 2025

3.4 TRADITIONAL SELF-ATTENTION

In this section, we briefly consider the case of traditional self-attention mechanisms, where the attention weights are normalized. We compute the fibers in the single-layer case and state a conjecture around the deep case.

In order to introduce the normalization, consider a map S: R R>0 that is injective and such that S(0) = 1. A typical choice is S(x) = ex/τ, where τ R>0 is the temperature hyperparameter. The traditional self-attention mechanism is then defined for W = (A, V ) and X Rd t as:

1 j t S x i Axj V xj

1 k t S x i Axk . (15)

Note that, for simplicity, we adhere to the convention of parametrizing via the attention matrix. Intuitively, in Equation 15 the attention weights S x i Axj are normalized to sum to 1, forcing the model to distribute its attention across the input components. The following result describes the fibers of the parametrization, analogously to Theorem 3.2.

Theorem 3.9. Suppose that t 2. If φW = 0, then V = 0. Otherwise, the fiber of φW is a singleton {W}.

Proof. See Appendix A.8.

Therefore, differently from lightning self-attention, in this case, the parametrization is generically one-to-one.

We now consider the deep case. Note that even with normalization, deep attention networks can be reparametrized via M and L, as defined in Section 3.3. Therefore, the fibers of the parametrization will be unaffected by the symmetries inside the attention matrices from Lemma 3.1 and the transformations Ci from Equation 9. However, this time no rescaling is possible. Therefore, we conjecture that the parametrization via (M, L) is generically one-to-one; in other words, that normalization only breaks the layer-wise scaling symmetry of the parametrization.

Conjecture 3.10. For normalized deep self-attention networks, the generic fibers of the parametrization via (M, L) are singletons.

In particular, suppose that for some δ it holds that di = δ for all 0 < i < l and d0, dl δ. Similarly to Corollary 3.8, the above conjecture implies that the dimension of the neuromanifold equals to:

2α1d0 α2 1 + δ(dl + d0) δ2 + X

1<i l (2αiδ α2 i ), (16)

which coincides with the dimension in the lightning case (Equation 14), summed with the number of layers l due to the removal of scaling symmetries. This is an inconsequential difference for large models, where l is significantly smaller than the number of parameters per layer.

3 4 5 6 7 8 9 10 Embedding Dimension ( )

Neuromanifold Dimension

Expected Estimated

Figure 3: Plot of the estimated and expected dimensions of the neuromanifold as δ varies.

Published as a conference paper at ICLR 2025

We provide empirical evidence for Conjecture 3.10. To this end, we implement a deep attention network with softmax normalization (i.e., S(x) = ex), and estimate the dimension of its neuromanifold. The latter is a subtle problem since, differently from the lightning case, the neuromanifold is not a priori embedded in a finite-dimensional vector space. Therefore, we rely on a stochastic finite element approach by randomly sampling N = 250 input points in Rd0 t from a normal distribution and restricting φW to this finite space. As a result, the neuromanifold is embedded in a (N t dl)- dimensional vector space. Its dimension can then be computed as the rank of the Jacobian of the parametrization at a random parameter W. This provides the correct result with probability 1 w.r.t. to the sampling of W, with the only possible error coming from the discretization, which can be corrected for by increasing the number of input samples. Our Python code is available at a public repository4. The results are visualized in Figure 3 for a deep attention network with l = 2 layers, t = 3, ai = 2 for all i, and di = δ varying from 3 to 10. The plot shows both the dimension estimated via the numerical approach ( Estimated ) and the one computed via Equation 16 ( Expected ). The two values coincide for all δ, confirming Conjecture 3.10 empirically.

4 CONCLUSIONS AND FUTURE WORK

In this work, we have analyzed the geometry of neuromanifolds of lightning self-attention networks. In particular, we have provided a description of the fibers of the parametrization, and consequently computed the dimension of the neuromanifold for an arbitrary number of layers. Finally, we have formulated an analogous conjecture for traditional self-attention networks.

Our work represents a first step towards the mathematical understanding of neuromanifolds defined by attention networks. As such, it is subject to limitations and leaves several questions open. Specifically, the attention networks we consider are a simplified version of the ones deployed in practice, since we omit popular architectural variations (Brauwers & Frasincar, 2021). Two such variations, which are ubiquitous in contemporary Transformers, are skip connections and multiple heads. With both these additions, the lightning self-attention mechanism is still polynomial. Note that skip connections make the model non-homogeneous, which breaks the scaling symmetry in the parameterization. We believe that in this case the parameterization via (M, L) is generically one-to-one, similarly to the traditional case (Conjecture 3.10), which is also non-homogeneous. In contrast, including multiple attention heads introduces new symmetries due to permutation of heads, similarly to the permutation symmetries of traditional Multi-Layer Perceptrons (Kileel et al., 2019). Therefore, these two variations give rise to interesting phenomena in terms of symmetries of the parameterization, and define future directions that are worthy of exploration.

From a wider perspective, the research program of applying the tools offered by algebraic geometry to the field of deep learning remains open and worthy of exploration. Even further, going beyond the polynomial setting e.g., addressing problems such as Conjecture 3.10 is an even more general challenge that lies at the foundations of the theoretical understanding of deep learning.

ACKNOWLEDGEMENTS

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.

Shun-ichi Amari, Hyeyoung Park, and Tomoko Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural computation, 18(5):1007 1065, 2006.

Joachim Bona-Pellissier, Franc ois Bachoc, and Franc ois Malgouyres. Parameter identifiability of a deep feedforward relu neural network. Machine Learning, 112(11):4431 4493, 2023.

Gianni Brauwers and Flavius Frasincar. A general survey on attention mechanisms in deep learning. IEEE Transactions on Knowledge and Data Engineering, 35(4):3279 3298, 2021.

4https://github.com/giovanni-marchetti/Neuro Dim

Published as a conference paper at ICLR 2025

Ovidiu Calin. Neuromanifolds. Deep Learning Architectures: A Mathematical Approach, pp. 465 504, 2020.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam Mc Candlish, and Chris Olah. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. URL https://transformer-circuits.pub/2021/framework/index.html.

Charles Fefferman et al. Reconstructing a neural net from its output. Revista Matem atica Iberoamericana, 10(3):507 556, 1994.

Elisenda Grigsby, Kathryn Lindsey, and David Rolnick. Hidden symmetries of relu networks. In International Conference on Machine Learning, pp. 11734 11760. PMLR, 2023.

Robin Hartshorne. Algebraic geometry, volume 52. Springer Science & Business Media, 2013.

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam M. Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: Generating music with long-term structure. In International Conference on Learning Representations, 2018. URL https://api.semanticscholar.org/Corpus ID: 54477714.

Joe Kileel, Matthew Trager, and Joan Bruna. On the expressive power of deep polynomial neural networks. Advances in neural information processing systems, 32, 2019.

Kathl en Kohn. The geometry of the neuromanifold. Collections, 57(06), 2024.

Kathl en Kohn, Bernt Ivar Utstøl Nødland, and Paolo Tripoli. Secants, bitangents, and their congruences. Combinatorial Algebraic Geometry: Selected Papers From the 2016 Apprenticeship Program, pp. 87 112, 2017.

Kathl en Kohn, Thomas Merkh, Guido Mont ufar, and Matthew Trager. Geometry of linear convolutional networks. SIAM Journal on Applied Algebra and Geometry, 6(3):368 406, 2022.

Kathl en Kohn, Guido Mont ufar, Vahid Shahverdi, and Matthew Trager. Function space and critical points of linear convolutional networks. ar Xiv preprint ar Xiv:2304.05752, 2023.

Kaie Kubjas, Jiayi Li, and Maximilian Wiesmann. Geometry of polynomial neural networks. ar Xiv preprint ar Xiv:2402.00949, 2024.

Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. ar Xiv preprint ar Xiv:2012.04728, 2020.

Giovanni Luca Marchetti, Vahid Shahverdi, Stefano Mereta, Matthew Trager, and Kathl en Kohn. An invitation to neuroalgebraic geometry, 2025. URL https://arxiv.org/abs/2501. 18915.

Imanol Schlag, Kazuki Irie, and J urgen Schmidhuber. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp. 9355 9366. PMLR, 2021.

Vahid Shahverdi, Giovanni Luca Marchetti, and Kathl en Kohn. On the geometry and optimization of polynomial convolutional networks, 2024. URL https://arxiv.org/abs/2410.00722.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Published as a conference paper at ICLR 2025

Matthew Trager, Kathl en Kohn, and Joan Bruna. Pure and spurious critical points: a geometric study of linear networks. ar Xiv preprint ar Xiv:1910.01671, 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Sumio Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge university press, 2009.

Bo Zhao, Nima Dehmamy, Robin Walters, and Rose Yu. Symmetry teleportation for accelerated optimization. Advances in neural information processing systems, 35:16679 16690, 2022a.

Bo Zhao, Iordan Ganev, Robin Walters, Rose Yu, and Nima Dehmamy. Symmetries, flat minima, and the conserved quantities of gradient flow. ar Xiv preprint ar Xiv:2210.17216, 2022b.

Published as a conference paper at ICLR 2025

A PROOFS OF THEORETICAL RESULTS

In this appendix, we include the proofs of the theoretical results in the main body of the paper.

A.1 PROOF OF LEMMA 3.1

Proof. Since a rk(Q) rk(A) = a, we deduce rk(K) = a, and similarly for Q, K , Q . Therefore, without loss of generality, we can assume that K = K1 K2, where K1 GLa(R) and K2 Ra (d a), and similarly for K , Q, Q . If follows immediately that exists a unique C GLa(R) such that K 1 = CK1. Since by hypothesis K 1 Q1 = K 1 Q 1, we deduce Q1 = C Q 1. But then K 2 Q1 = K 2 Q 1 = K 2 C Q1, implying that K 2 = K 2 C and, similarly, Q2 = C Q 2, as desired.

A.2 PROOF OF THEOREM 3.2

Proof. Suppose that φW (X) = φW (X) for all X = (xi)1 i t Rd t, where W = (A , V ). By comparing the terms containing only x1 in the polynomial identity φW (X)1 = φW (X)1, we obtain: V x1 x 1 Ax1 = V x1 x 1 A x1. (17)

The above equation is a product of a quadratic form and a linear multivariate form. If rk(A) > 1, the quadratic form is irreducible, and from the unique factorization property of polynomials it follows that, up to multiplicative scalars, V coincides with V , and the quadratic form associated to A coincides with the one associated to A . The same holds if rk(V ) > 1, since the sides of the above equation represent cubical polynomials that share a quadratic factor, but whose remaining linear factors are independent. In order to prove that A and A coincide (up to scaling), consider the terms linear in x1 and quadratic in x2 in the identity φW (X)1 = φW (X)1. This leads to:

HH V x2 x 2 Ax1 = HHH V x2 x 2 A x1. (18)

It follows that the bilinear forms associated to A and A coincide (up to scaling), as desired for the third case of the claim.

If instead rk(A) = rk(V ) = 1, then it is possible to factor A = k q and V = h v for some vectors q, k, v Rd \ {0}, h Rd \ {0}. Exchanging the role of k and v does not alter Equation 1, which provides the first case of the claim. Finally, since the ring of polynomials is an integral domain, if φW (X) = 0 then from Equation 17 it follows that either A = 0 or V = 0, as desired for the second case of the claim.

A.3 PROOF OF THEOREM 3.4

To prove Theorem 3.4, we begin with the following key insight: a consequence of Theorem 3.2 is that it is possible to projectify the neuromanifold and its parametrization (with respect to the attention matrix). To this end, denote by PV = V/(R \ {0}) the projectification of a vector space V. The second condition of Theorem 3.2 implies that the parametrization descends to a well-defined morphism:

φ: PRd d PRd d PSym3 Rd t, Rd t . (19)

We denote by PMd,d ,a the image of the map φ referred to as projective neuromanifold. The fact that φ is well-defined implies the following topological result.

Corollary A.1. The neuromanifold Md,d ,a is closed in the Euclidean topology.

Proof. A continuous map between a compact and a Hausdorff space is closed, and in particular has a closed image. These properties are satisfied by φ, where both the domain and the codomain are equipped with the Euclidean topology over projective spaces. It follows that PMd,d ,a is Euclidean

closed in PSym3 Rd t, Rd t . Since the parametrization is homogeneous, the neuromanifold coincides with the affine cone of its projectification. Since affine cones of closed subspaces are closed, the claim follows.

Published as a conference paper at ICLR 2025

Next, we characterize the lightning self-attention mechanisms that are singular in the neuromanifold, i.e., whose tangent space has a dimension greater than Equation 7. For that, we combine Theorem 3.2 with a computation of the critical points of the network parametrization map. Firstly, we remark that the last point in Theorem 3.2 implies that φ is generically one-to-one onto its image PMd,d ,a, i.e., it is birational. Corollary A.2. The map φ is birational onto its image PMd,d ,a, with special fibers of cardinality 2.

Proof. This follows from the first and the last point of Theorem 3.2.

Next, we compute the critical points of the parametrization. To this end, recall that a point is critical for a differentiable map if the differential at that point does not have maximal rank. Lemma A.3. A point (A, V ) = W is critical for the parametrization map W 7 φW if and only if A = k q and V = h k for some q, k Rd, h Rd .

Proof. Given W = (A, V ), by computing the partial derivatives of φ , we see that the differential sends a tangent vector ( A, V ) to V XX AX + V XX AX. The latter is interpreted as an element

of the vector space Sym3 Rd t, Rd t , which is identified with the tangent spaces of all of its elements. Since the dimension of the neuromanifold is one less than the dimension of the parameter space, the point W is critical if and only if the kernel of the differential has dimension > 1. The kernel consists of those ( A, V ) that, for all X, satisfy

V XX AX = V XX AX. (20)

If V = 0, the differential vanishes whenever V = 0; and similarly for A. This provides the first case of the claim. If rk(A)rk(V ) 2, Theorem 3.2 implies that the kernel consists of ( V = λV, A = λA) for λ R, and W is therefore not critical. Finally, suppose that rk(A) = rk(V ) = 1, and write A = k q and V = h v for some q, k, v Rd, h Rd . In order to obtain a larger kernel, by Theorem 3.2, it is necessary that v = k (up to scaling). In that case, for every A of the form A = p q, we have that V = h p is a solution to Equation 20, implying that W = (A, V ) is critical.

By exploiting the fact that φ is birational with finite fibers, the above result leads us to the characterization of the singular points of Md,d ,a. Corollary A.4. φ(A,V ) is singular in the neuromanifold if and only if rk(A)rk(V ) 1.

Proof. The morphism φ is finite and birational by Corollary A.2. For such a map, a standard fact from algebraic geometry (Kohn et al., 2017, Lemma 3.2) says that a point φW is singular in the projective neuromanifold PMd,d ,a if and only if either its fiber under φ has cardinality 2 or W is critical for φ. By Lemma A.3 and Theorem 3.2, a point with rk(A)rk(V ) 1 is either critical or has a fiber of cardinality 2, and φ(A,V ) is therefore singular in PMd,d ,a. Since Md,d ,a is the affine cone of PMd,d ,a, a point is singular in the projective neuromanifold if and only if the corresponding line is singular in the neuromanifold, from which the claim follows.

Finally, we compute the boundary points of Md,d ,a. We will see that they are precisely the critical values of the parametrization map. For that, we interpret the neuromanifold as a Segre variety on linear forms, as follows. Given a vector space V, we consider its dual space V and the outer product of vectors of linear forms:

σ: (V )d (V )d Sym2(V)d d , (α, ν) 7 (αm νn)1 m d,1 n d .

We can projectify this map analogously to φ, obtaining birational morphism (if d > 1 or d > 1), but not an embedding. If a < d, we need to restrict the first factor of the map σ to tuples of linear forms that span a subspace of rank at most a. We denote this space by (V )d a.

Proposition A.5. The neuromanifold Md,d ,a is linearly isomorphic to the image of σ restricted to (V )d a (V )d , where V = Rd.

Published as a conference paper at ICLR 2025

Proof. We denote by αm the linear form that takes the inner product with the m-th column of A and by νn the inner product with the n-th row of V . We have seen in the proof of Theorem 3.2 that every φW Md,d ,a is uniquely determined by its term that is linear in x1 and quadratic in x2, which is

x 2 Ax1V x2 =

m=1 x1,m αm(x2) νn(x2)

Since the x1,m are formal variables, this expression is the collection of all products of linear forms αm νn for 1 m d and 1 n d , which shows the claim.

In what follows, relative boundary refers to the set of points (in the Euclidean topology) in Md,d ,a that are limit points of sequences in Md,d ,a \ Md,d ,a, where Md,d ,a denotes the closure of the

neuromanifold in the Zariski topology of its ambient space Sym3 Rd t, Rd t .

Corollary A.6. φW is on the relative boundary of the neuromanifold if and only if W it is critical for the parametrization map.

Proof. For simplicity, we write M for Md,d ,a, and start by computing its Zariski closure M. The Zariski closure of a subset S of RN consists precisely of the real points in the complex Zariski closure of S viewed as a subset of CN. We compute the complex Zariski closure of M by considering the parametrization map φ in (19) over C instead of over R. That map φC is also a well-defined morphism between projective spaces. The main theorem on projective varieties (Hartshorne, 2013, II, 4, Theorem 4.9) states that the image PMC of φC is Zariski closed. Hence, the real Zariski closure M consists precisely of the real points φ(A,V ) with A Cd d, V Cd d .

Next, we compute the points in the complement M \ M. Using the identification in Proposition A.5, points in that complement look like real matrices with entries αm νn, where αm, νn are complex linear forms, and no real parametrization exists. So one of the αm is non-real. Since each of the αm νn is real, every νn has to be the complex conjugate αm times a real scalar. Analogously, all αr have to be νn times a real scalar, i.e., they need to coincide with αm up to scaling. This shows that M\M consists of φ(A,V ) with A = k q and V = h k for some q Rd \{0}, h Rd \{0}, k Cd \ Rd. In the limit, sequences of W = (A, V ) in this form can become real if either one among A and V vanish, or the complex conjugated pair k, k becomes the same real vector k = k.

A.4 PROOF OF LEMMA 3.5

Proof. Note that Vi is invertible for every 1 < i < l by the rank hypothesis. Moreover, since L = Q

0 j<l Vl j = Q

0 j<l V l j, it is possible to apply Lemma 3.1 by splitting the latter factorization of L at the first and last layer. This way, we obtain matrices Ci GLδ(R) for 1 i l such that V i = Ci Vi C 1 i 1. But then we have:

L i 1A i L i 1 = L i 1C i 1A i Ci 1Li 1 = L i 1Ai Li 1. (21)

Since Li is surjective by the rank hypothesis, we conclude that C i 1A i Ci 1 = Ai.

A.5 PROOF OF LEMMA 3.6

Proof. For 0 i l, denote by Xi the output of the i-th layer of the network. In other words, X0 = X and Xi = φWi(Xi 1) = Vi Xi 1X i 1Ai Xi 1 for i > 0. We wish to prove that for all i:

Xi = Li X Y

Di j X Mi j+1X . (22)

Published as a conference paper at ICLR 2025

Denote by Yi the right-hand side of Equation 22. We have:

Yi = Li |{z} Vi Li 1

XDi 1X Mi |{z}

L i 1Ai Li 1

1 j<i Di 1 j X Mi j X (23)

=Vi Li 1XDi 1 (Li 1X) Ai Li 1X Y

1 j<i Di 1 j X Mi j X

| {z } Yi 1

Therefore, in order to show that Yi satisfies the same recurrence relation as Xi, we need to prove that Li XDi (Li X) = Yi Y i for all i, which in turn reduces to show that Di = Zi Z i , where:

Di j X Mi j+1X . (25)

To this end, since Zi = Di 1X Mi XZi 1, we have:

Zi Z i = Di 1X Mi XZi 1Z i 1X M i XD i 1. (26)

If we assume inductively that Zi 1Z i 1 = Di 1, and using the fact that Di is a symmetric matrix (i.e., D i 1 = Di 1), the above recurrence relation coincides with the one defining Di in Equation 11. This implies that Zi Z i = Di, as desired.

A.6 PROOF OF THEOREM 3.7

Proof. Pick weights W, W such that φW(X) = φW (X) for all X = (xi)1 i t Rd0 t. Our proof strategy will involve considering the polynomial identity

φW(X)1 = φW (X)1 (27)

and comparing monomial terms arising from Equation 12 with specific degrees in the xi s.

To this end, Equation 12 implies that all such terms contain x1. Even further, the unique term in φW(X)1 that is linear in x1 and of degree 3l 1 in x2 can be written as:

x 2 Mix2 3l i

x 2 A1x2 3l 1 1 x 2 A1x1. (28)

Put simply, the above expression is a product of a linear form in x2, several quadratic forms in x2, and a bilinear form in x1 and x2. Moreover, the quadratic forms are coprime by hypothesis since rk(Mi) > 1 for i > 1. By comparing this term with the corresponding term on the right-hand side of Equation 27, from the unique factorization property of polynomials, it follows that up to a multiplicative scalar, the following hold:

Ll coincides with L l,

A1 coincides with A 1,

The quadratic forms associated to Mi and M i coincide for all 2 i l.

The last condition above implies that for every 2 i l there exists a skew-symmetric matrix Σi such that M i coincides with Mi + Σi up to a multiplicative scalar. Since Equation 12 is multi-linear in each occurrence of Mj, by substituting M j = Mj + Σj in φW (X)1 for all j, we obtain a sum of expressions in the form of Equation 12, but where an arbitrary number of occurrences of Mj has been replaced by Σj. The only expression with no replacement coincides with φW(X)1. Therefore, Equation 27 reduces to a vanishing sum of copies of Equation 12, where at least one Mj has been replaced with Σj.

In order to conclude, we wish to show that Σi = 0 for i 2. We will proceed by induction on i. Specifically, given i and assuming that Σj = 0 for j < i, we will show that Σi = 0. To this end, consider the monomial terms in φW (X)1 of degree 1 in x1, of degree 3i 1 1 in x3, and

Published as a conference paper at ICLR 2025

of remaining degree in x2. This is possible since we assume t 3. From the discussion above, it follows that Equation 27 reduces to a vanishing sum of copies of Equation 12, where an arbitrary (non-zero) number of occurrences of f Mj = Mαj, with αj i, has been replaced with Σαj (and similarly with transposes), and where kj {2, 3} for all 1 j l = (3l 1)/2.

We will now argue that several terms cancel due to the symmetries of Equation 12, illustrated in figure 4. Namely, consider a monomial term with some multi-index k containing a factor of the form x kjΣαjxkj+1 for some j, with αj > i. Since x3 cannot appear more than 3i 1 1 times in the monomial, there exists a j (3i 1 1)/2 such that kj j = kj+j+1 = 2. Moreover, due to the

inductive hypothesis, f Mj coincides with Mαj or with its transpose for all j j j j + j since αj < i. Consider the multi-index k such that k j j = kj+j +1 for all j j j, and k j = kj for all the other j . Intuitively, k reflects k locally around j see Figure 4 for an illustration. By construction, the monomial corresponding to k is identical to the one corresponding to k , except for the term x k jΣαjxk j+1 = x kj+1Σαjxkj = x kjΣαjxkj+1. (29)

Therefore, these two monomials cancel each other out.

Figure 4: Diagrammatic illustration of the symmetry involved in the cancellation argument.

After the above cancellation argument, we are left with monomials containing occurrences of Σαj only for αj = i. In fact, the cancellation argument still applies verbatim to these cases, except for when j = (3l 3i 1)/2 and kj j = kj+j +1 for all j (3i 1 1)/2. In this case, due to the presence of x1, the monomials corresponding to k and k are not opposite, but contain different factors given by x 3 A1x2x 2 A1x1 and x 3 A1x1 x 2 A1x2, respectively. At the end of the day, Equation 27 reduces to an expression of the form:

x 3 Σix2 x 3 A1x2x 2 A1x1 x 3 A1x1 x 2 A1x2 X

h p((xhj)j) = 0, (30)

where h is an opportune multi-index with values in {2, 3}, and p is a polynomial consisting of products of bilinear/quadratic forms associated to the Mj s, and a multi-variate linear factor associated to Ll. The right-most factor in Equation 30 is not the zero polynomial (in the variables x2 and x3) since, by imposing the condition x2 = x3, it becomes (up to a multiplicative scalar) a product of quadratic forms associated to the Mj s. The latter are non-vanishing since Mj is not skew-symmetric for all j by hypothesis.

We wish to prove that the second factor in Equation 30 is non-vanishing as well. To this end, note that the coefficients of that factor, seen as a polynomial in x1, x2, x3, are of the form

(A1)α,β(A1)γ,δ + (A1)α,γ(A1)β,δ (A1)α,δ(A1)β,γ (A1)α,δ(A1)γ,β (31)

for 1 α, β, γ, δ d0. Suppose by contradiction that the above expression vanishes for all α, β, γ, δ. When β = γ, Equation 31 coincides with (twice) the determinant of an arbitrary 2 2

Published as a conference paper at ICLR 2025

minor of A1 intersecting the diagonal. If (A1)β,β = 0 for some β, then from the Kronecker s bordered matrix theorem it follows that A1 has rank 1, contradicting the hypothesis rk(A1) 2. We conclude that the diagonal entries of A1 are zero, and, when β = γ, Equation 31 reduces to (A1)α,β(A1)β,δ = 0. Therefore, if (A1)α,β = 0 for some α = β, then the β-th row of A1 must vanish. But then Equation 31 reduces (up to sign) to the determinant of an arbitrary 2 2 minor intersecting (α, β), and we again obtain a contradiction from Kronecker s bordered matrix theorem.

In conclusion, the left-most factor in Equation 30 must vanish, meaning that Σi = 0, as desired.

A.7 PROOF OF COROLLARY 3.8

Proof. Recall that the dimension of the determinantal variety of di 1 di 1 matrices of rank at most ai is 2αidi 1 α2 i , and therefore the space of parameters (A, V) has dimension

2α1d0 α2 1 + δ(d0 + dl) + (l 2)δ2 + X

1<i l (2αiδ α2 i ). (32)

Moreover, by Lemma 3.5 the generic fibers of the re-parametrization (A, V) 7 (M, L) have dimension (l 1)δ2, while by Theorem 3.7 the generic fibers with respect to (M, L) have dimension l (recall the constraint on rescaling given by Equation 13). Since the dimension of the image of a map coincides with the co-dimension of the generic fibers, the result follows.

A.8 PROOF OF THEOREM 3.9

Proof. Suppose that φW (X) = φW (X) for all X = (xi)1 i t Rd t, where W = (A , V ). We assume that V = 0, since the case φW = 0 follows immediately. Due to normalization, if xi = xj for all i, j, then φW (X)1 = V x1 = V x1 = φW (X)1, implying V = V . If instead xi = 0 for i > 1, since S(0) = 1, we obtain:

S x 1 Ax1 + t 1 HH V x1 = S x 1 A x1

S x 1 A x1 + t 1 HHH V x1 (33)

Since S is injective, we deduce that the quadratic form associated to A coincides with the one associated to A . In order to prove that A = A , suppose that xi = xj for i, j 2, obtaining:

S x 1 Ax1 V x1 + S x 1 Ax2 V x2 S x 1 Ax1 + (t 1)S x 1 Ax2 = S x 1 A x1 V x1 + S x 1 A x2 V x2 S x 1 A x1 + (t 1)S x 1 A x2 . (34)

After elementary algebraic manipulations using the fact that x 1 Ax1 = x 1 A x1 and V = V , the above equation reduces to:

(t 1) S x 1 A x2 S x 1 Ax2 V x1 = S x 1 A x2 S x 1 Ax2 V x2. (35)

Since V = 0, V x1 = V x2 for generic x1, x2. Therefore, S x 1 A x2 S x 1 Ax2 must vanish generically, implying A = A due to the injectivity of S, as desired.