# neural_tangent_kernel_collapse__814f0c2a.pdf

Neural (Tangent Kernel) Collapse

Mariia Seleznova1 Dana Weitzner2 Raja Giryes2 Gitta Kutyniok1 Hung-Hsu Chou1

1Ludwig-Maximilians-Universität München 2Tel Aviv University

This work bridges two important concepts: the Neural Tangent Kernel (NTK), which captures the evolution of deep neural networks (DNNs) during training, and the Neural Collapse (NC) phenomenon, which refers to the emergence of symmetry and structure in the last-layer features of well-trained classification DNNs. We adopt the natural assumption that the empirical NTK develops a block structure aligned with the class labels, i.e., samples within the same class have stronger correlations than samples from different classes. Under this assumption, we derive the dynamics of DNNs trained with mean squared (MSE) loss and break them into interpretable phases. Moreover, we identify an invariant that captures the essence of the dynamics, and use it to prove the emergence of NC in DNNs with block-structured NTK. We provide large-scale numerical experiments on three common DNN architectures and three benchmark datasets to support our theory.

1 Introduction

Deep Neural Networks (DNNs) are advancing the state of the art in many real-life applications, ranging from image classification to machine translation. Yet, there is no comprehensive theory that can explain a multitude of empirical phenomena observed in DNNs. In this work, we provide a theoretical connection between two such empirical phenomena, prominent in modern DNNs: Neural Collapse (NC) and Neural Tangent Kernel (NTK) alignment.

Neural Collapse. NC [39] emerges while training modern classification DNNs past zero error to further minimize the loss. During NC, the class means of the DNN s last-layer features form a symmetric structure with maximal separation angle, while the features of each individual sample collapse to their class means. This simple structure of the feature vectors appears favourable for generalization and robustness in the literature [12, 31, 40, 47]. Though NC is common in modern DNNs, explaining the mechanisms behind its emergence is challenging, since the complex non-linear training dynamics of DNNs evade analytical treatment.

Neural Tangent Kernel. The NTK [30] describes the gradient descent dynamics of DNNs in the function space, which provides a dual perspective to DNNs evolution in the parameters space. This perspective allows to study the dynamics of DNNs analytically in the infinite-width limit, where the NTK is constant during training [30]. Hence, theoretical works often rely on the infinite-width NTK to analyze generalization of DNNs [1, 20, 28, 49]. However, multiple authors have argued that the infinite-width limit does not fully reflect the behaviour of realistic DNNs [2, 10, 22, 27, 36, 43], since constant NTK implies that no feature learning occurs during DNNs training.

NTK Alignment. While the infinite-width NTK is label-agnostic and does not change during training, the empirical NTK rapidly aligns with the target function in the early stages of training [5, 7, 44, 45]. In the context of classification, this manifests itself as the emergence of a block structure

Correspondence to: Mariia Seleznova (selez@math.lmu.de).

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

in the kernel matrix, where the correlations between samples from the same class are stronger than between samples from different classes. The NTK alignment implies the so-called local elasticity of DNNs training dynamics, i.e., samples from one class have little impact on samples from other classes in Stochastic Gradient Descent (SGD) updates [23]. Several recent works have also linked the local elasticity of training dynamics to the emergence of NC [33, 53]. This brings us to the main question of this paper: Is there a connection between NTK alignment and neural collapse?

Contribution. In this work, we consider a model of NTK alignment, where the kernel has a block structure, i.e., it takes only three distinct values: an inter-class value, an intra-class value and a diagonal value. We describe this model in Section 3. Within the model, we establish the connection between NTK alignment and NC, and identify the conditions under which NC occurs. Our main contributions are as follows:

We derive and analyze the training dynamics of DNNs with MSE loss and block-structured NTK in Section 4. We identify three distinct convergence rates in the dynamics, which correspond to three components of the training error: error of the global mean, of the class means, and of each individual sample. These components play a key role in the dynamics. We show that NC emerges in DNNs with block-structured NTK under additional assumptions in Section 5.3. To the best of our knowledge, this is the first work to connect NTK alignment and NC. While previous contributions rely on the unconstrained features models [21, 38, 48] or other imitations of DNNs training dynamics [53] to derive NC (see Appendix A for a detailed discussion of related works), we consider standard gradient flow dynamics of DNNs simplified by our assumption on the NTK structure. We analyze when NC does or does not occur in DNNs with NTK alignment, both theoretically and empirically. In particular, we identify an invariant of the training dynamics that provides a necessary condition for the emergence of NC in Section 5.2. Since DNNs with block-structured NTK do not always converge to NC, we conclude that NTK alignment is a more widespread phenomenon than NC. We support our theory with large-scale numerical experiments in Section 6. Source code to reproduce the results is available in the project s Git Hub repository.

2 Preliminaries

We consider the classification problem with C N classes, where the goal is to build a classifier that returns a class label for any input x X. In this work, the classifier is a DNN trained on a dataset {(xi, yi)}N i=1, where xi X are the inputs and yi RC are the one-hot encodings of the class labels. We view the output function of the DNN f : X RC as a composition of parametrized last-layer features h : X Rn and a linear classification layer parametrized by weights W RC n and biases b RC. Then the logits of the training data X = {xi}N i=1 can be expressed as follows:

f(X) = WH + b1 N, (1) where H Rn N are the features of the entire dataset stacked as columns and 1N RN is a vector of ones. Though we omit the notion of the data dependence in the text to follow, i.e. we write H without the explicit dependence on X, we emphasize that the features H are a function of the data and the DNN s parameters, unlike in the previously studied unconstrained feature models [21, 38, 48].

We assume that the dataset is balanced, i.e. there are m := N/C training samples for each class. Without loss of generality, we further assume that the inputs are reordered so that x(c 1)m+1, . . . , xcm belong to class c for all c [C]. This will make the notation much easier later on. Since the dimension of features n is typically much larger than the number of classes, we also assume n > C in this work.

2.1 Neural Collapse

Neural Collapse (NC) is an empirical behaviour of classifier DNNs trained past zero error [39]. Let h := N 1 PN i=1 h(xi) denote the global features mean and h c := m 1 P

xi class c h(xi), c [C] be the class means. Furthermore, define the matrix of normalized centered class means as M := [ h 1/ h 1 2, . . . , h C/ h C 2] Rn C, where h c = h c h , c [C]. We say that a DNN exhibits NC if the following four behaviours emerge as the training time t increases:

(NC1) Variability collapse: for all samples xc i from class c [C], where i [m], the penultimate layer features converge to their class means, i.e. h(xc i) h c 2 0. (NC2) Convergence to Simplex Equiangular Tight Frame (ETF): for all c, c [C], the class means converge to the following configuration:

h c h 2 h c h 2 0, M M C C 1( IC 1

(NC3) Convergence to self-duality: the class means M and the final weights W converge to each other: M/ M F W / W F F 0.

(NC4) Simplification to Nearest Class Center (NCC): the classifier converges to the NCC decision rule behaviour:

argmax c (Wh(x) + b)c argmin c h(x) h c 2.

Though NC is observed in practice, there is currently no conclusive theory on the mechanisms of its emergence during DNN training. Most theoretical works on NC adopt the unconstrained features model, where features H are free variables that can be directly optimized [21, 38, 48]. Training dynamics of such models do not accurately reflect the dynamics of real DNNs, since they ignore the dependence of the features on the input data and the DNN s trainable parameters. In this work, we make a step towards realistic DNN dynamics by means of the Neural Tangent Kernel (NTK).

2.2 Neural Tangent Kernel

The NTK Θ of a DNN with the output function f : X RC and trainable parameters w RP (stretched into a single vector) is given by

Θk,s(xi, xj) := wfk(xi), wfs(xj) , xi, xj X, k, s [C]. (2)

We also define the last-layer features kernel Θh, which is a component of the NTK corresponding to the parameters up to the penultimate layer, as follows:

Θh k,s(xi, xj) := whk(xi), whs(xj) , xi, xj X, k, s [n]. (3)

Intuitively, the NTK captures the correlations between the training samples in the DNN dynamics. While most theoretical works consider the infinite-width limit of DNNs [30, 52], where the NTK can be computed theoretically, empirical studies have also extensively explored the NTK of finite-width networks [19, 36, 45, 49]. Unlike the label-agnostic infinite-width NTK, the empirical NTK aligns with the labels during training. We use this observation in our main assumption (Section 3).

2.3 Classification with MSE Loss

We study NC for DNNs with the mean squared error (MSE) loss given by

L(W, H, b) = 1

2 f(X) Y 2 F , (4)

where Y RC N is a matrix of stacked labels yi. While NC was originally introduced for the cross-entropy (CE) loss [39], which is more common in classification problems, the MSE loss is much easier to analyze theoretically. Moreover, empirical observations suggest that DNNs with MSE loss achieve comparable performance to using CE [14, 29, 41], which motivates the recent line of research on MSE-NC [21, 38, 48].

3 Block Structure of the NTK

Numerous empirical studies have demonstrated that the NTK becomes aligned with the labels Y Y during the training process [7, 32, 45]. This alignment constitutes feature learning and is associated with better performance of DNNs [9, 13]. For classification problems, this means that the empirical

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

h1 h11 h21 h31 h41 h51 h61

h1 h11 h21 h31 h41 h51 h61

d) Θh k,s F

0 100 200 300 400 epoch

e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 1: The NTK block structure of Res Net20 trained on MNIST. a) Traced kernel PC k=1 Θk,k(X) computed on a random data subset with 12 samples from each class. The samples are ordered as described in Section 2, so that the diagonal blocks correspond to pairs of inputs from the same class. b) Traced kernel Pn k=1 Θh k,k(X) computed on the same subset. c) Norms of the kernels Θk,s(X) for all k, s [C]. d) Norms of the kernels Θh k,s(X) for all k, s [n]. The color bars show the values in each heatmap as a fraction of the maximal value in the heatmap. e) The alignment of the traced kernels from panes a and b with the class labels.

NTK develops an approximate block structure with larger kernel values corresponding to pairs of samples (xc i, xc j) from the same class [44]. Figure 1 shows an example of such a structure emergent in the empirical NTK of Res Net20 trained on MNIST.2 Motivated by these observations, we assume that the NTK and the last-layer features kernel exhibit a block structure, defined as follows: Definition 3.1 (Block structure of a kernel). We say a kernel Θ : X X RK K has a block structure associated with (λ1, λ2, λ3), if λ1 > λ2 > λ3 0 and

Θ(x, x) = λ1IK, Θ(xc i, xc j) = λ2IK, Θ(xc i, xc j ) = λ3IK, (5)

where xc i and xc j are two distinct inputs from the same class, and xc j is an input from class c = c.

Assumption 3.2. The NTK Θ : X X RC C has a block structure associated with (γd, γc, γn), and the penultimate kernel Θh : X X Rn n has a block structure associated with (κd, κc, κn).

This assumption means that every kernel Θk,k(X) := [Θk,k(xi, xj)]i,j [N] corresponding to an output neuron fk, k [C] and every kernel Θh p,p(X) corresponding to a last-layer neuron hp, p [n] is aligned with Y Y (see Figure 1, panes a-b). Additionally, the "non-diagonal" kernels Θk,s(X) and Θh k,s(X), k = s are equal to zero (see Figure 1, panes c-d).3 Moreover, if γc γn and κc κn, Assumption 3.2 can be interpreted as local elasticity of DNNs, defined below. Definition 3.3 (Local elasticity [23]). A classifier is said to be locally elastic (LE) if its prediction or feature representation on point xc i from class c [C] is not significantly affected by performing SGD updates on data points from classes c = c.

To see the relation between Assumption 3.2 and this definition, consider a Gradient Descent (GD) step of the output neuron fk, k [C] with step size η performed on a single input xc j from class c = c. By the chain rule, block-structured Θ implies locally-elastic predictions since

f t+1(xc i) f t(xc i) = ηΘ(xc i, xc j ) L(xc j ) f(xc j ) + O(η2), (6)

i.e., the magnitude of the GD step of f(xc i) is determined by the value of Θ(xc i, xc j ). Similarly, block-structured kernel Θh implies locally-elastic penultimate layer features because

ht+1(xc i) ht(xc i) = ηΘh(xc i, xc j )W L(xc j ) f(xc j ) + O(η2). (7)

This observation provides a connection between our work and recent contributions suggesting a connection between NC and local elasticity [33, 53].

2We provide figures illustrating the NTK block structure on other architectures and datasets in Appendix C. 3We discuss possible relaxations to our main assumption, where the "non-diagonal" components of the last-layer kernel Θh k,s are allowed to be non-zero, in Appendix D.

Eigenvalue Eigenvector Multiplicity λsingle = γd γc vc i = 1 m 1 m 1, 1 m 1 | {z } index i>0, class c<0

, 0 N m | {z } others =0

λclass = λsingle + m(γc γn) vc = 1 C 1 (C 1)1 m | {z } class c>0

, 1 N m | {z } others <0

λglobal = λclass + Nγn v0 = 1N 1 Table 1: Eigendecomposition of the block-structured NTK.

4 Dynamics of DNNs with NTK Alignment

4.1 Convergence

As a warm up for our main results, we analyze the effects of the NTK block structure on the convergence of DNNs. Consider a GD update of an output neuron fk, k [C] with the step size η:

f t+1 k (X) = f t k(X) ηΘk,k(X)(f t k(X) Yk) + O(η2), k = 1, . . . , C. (8)

Note that we have taken into account that Θk,s is zero for k = s by our assumption. Denote the residuals corresponding to fk as r k := f k (X) Yk RN. Then we have the following dynamics for the residuals vector: rt+1 k = (1 ηΘk,k(X))rt k + O(η2). (9) The eigendecomposition of the block-structured kernel Θk,k(X) provides important insights into this dynamics and is summarized in Table 1. We notice that the NTK has three distinct eigenvalues λglobal λclass λsingle, which imply different convergence rates for certain components of the error. Moreover, the eigenvectors associated with each of these eigenvalues reveal the meaning of the error components corresponding to each convergence rate. Indeed, consider the projected dynamics with respect to eigenvector v0 and eigenvalue λglobal from Table 1:

rt+1 k , v0 = (1 ηλglobal) rt k, v0 , (10)

where we omitted O(η2) for clarity. Now notice that the projection of rt k onto the vector v0 is in fact proportional to the average residual over the training set:

rt k, v0 = rt k, 1N = N rt k (11)

where denotes the average over all the training samples xi X. By a similar calculation, for all c [C] and i [m] we get interpretations of the remaining projections of the residual:

rt k, vc = N C 1( rt k c rt k ), rt k, vc i = m m 1(rt k(xc i) rt k c), (12)

We where c denotes the average over samples xc i from class c, and r k (xc i) is the kth component of f (xc i) yc i . Combining (10), (11) and (12), we have the following convergence rates:

rt+1 k = (1 ηλglobal) rt k , (13)

rt+1 k c rt+1 k = (1 ηλclass)( rt k c rt k ), (14)

rt+1 k (xc i) rt+1 k c = (1 ηλsingle)(rt k(xc i) rt k c). (15)

Overall, this means that the global mean r of the residual converges first, then the class means, and finally the residual of each sample r(xc i). To simplify the notation, we define the following quantities:

R = f(X) Y = [r(x1), . . . , r(x N)], (16)

m RY Y = [ r 1, . . . , r C] | {z } :=R1

Rglobal = 1

N R1N1 N = r 1 N, (18)

where R RC N is the matrix of residuals, Rclass RC N are the residuals averaged over each class and stacked m times, and Rglobal RC N are the residuals averaged over the whole training set stacked N times. According to the previous discussion, Rglobal converges to zero at the fastest rate, while R converges at the slowest rate. The last phase, which we call the end of training, is when Rclass and Rglobal have nearly vanished and can be treated as zero for the remaining training time. We will use this notion in several remarks, as well as in the proof of Theorem 5.2.

4.2 Gradient Flow Dynamics with Block-Structured NTK

We derive the dynamics of H, W, b under Assumption 3.2 in Theorem 4.1. One can see that the block-structured kernel greatly simplifies the complicated dynamics of DNNs and highlights the role of each of the residual components identified in Section 4.1. We consider gradient flow, which is close to gradient descent for sufficiently small step size [16], to reduce the complications caused by higher order terms. The proof is given in Appendix B.1. Theorem 4.1. Suppose Assumption 3.2 holds. Then the gradient flow dynamics of a DNN can be written as

H = W [(κd κc)R + (κc κn)m Rclass + κn NRglobal] W = RH b = Rglobal1N. (19)

We note that at the end of training, where Rclass and Rglobal are zero, the system (19) reduces to

H = (κd κc) H L, W = W L, L(W, H) := 1

2 WH + b1 N Y 2 F , (20)

and b = 0. This system differs from the unconstrained features dynamics only by a factor of κd κc before H. Moreover, such a form of the loss function also appears in the literature of implicit regularization [4, 6, 11], where the authors show that WH converges to a low rank matrix.

5 NTK Alignment Drives Neural Collapse

The main goal of this work is to demonstrate how NC results from the NTK block structure. To this end, in Section 5.1 we further analyze the dynamics presented in Theorem 4.1, in Section 5.2 we derive the invariant of this training dynamics, and in Section 5.3 we finally derive NC.

5.1 Features Decomposition

We first decompose the features dynamics presented in Theorem 4.1 into two parts: H1, which lies in the subspace of the labels Y, and H2, which is orthogonal to the labels and eventually vanishes. To achieve this, note that the SVD of Y has the following form:

P YQ = m IC, O , (21)

where O RC (N C) is a matrix of zeros, and P RC C and Q RN N are orthogonal matrices. Moreover, we can choose P and Q such that P = IC and

Q = Q1, Q2 , Q1 = 1 m IC 1m RN C, Q2 = IC Q2 RN (N C), (22)

where is the Kronecker product. Note that by orthogonality, Q2 Rm (m 1) has full rank and 1 m Q2 = O. We can now decompose HQ into two components as follows:

HQ = m[H1, H2], H1 = 1 m HQ1, H2 = 1 m HQ2. (23)

The following equations reveal the meaning of these two components:

H1 = h 1, . . . , h C , H2 = 1 m

H(1) Q2, . . . , H(C) Q2 , (24)

where h c Rn is the mean of h over inputs xc i from class c [C], and H(c) Rn m is the submatrix of H corresponding to samples of class c, i.e., H = H(1), . . . , H(C) . We see that H1

is simply the matrix of the last-layer features class means, which is prominent in the NC literature. We also see that the columns of H(c) Q2 are m 1 different linear combinations of m vectors h(xc i), i [m]. Moreover, the coefficients of each of these linear combinations sum to zero by the choice of Q2. Therefore, H2 must reduce to zero in case of variability collapse (NC1), when all the feature vectors within the same class become equal. We prove that H2 indeed vanishes in DNNs with block-structured NTK as part of our main result (Theorem 5.2).

5.2 Invariant

We now use the former decomposition of the last-layer features to further simplify the dynamics and deduce a training invariant in Theorem 5.1. The proof is given in Appendix B.2.

Theorem 5.1. Suppose Assumption 3.2 holds. Define H1 and H2 as in (23). Then the class-means of the residuals (defined in (17)) are given by R1 = WH1 + b1 C IC, and the training dynamics of the DNN can be written as

H1 = W R1(µclass IC + κnm1C1 C) H2 = µsingle W WH2 W = m(R1H 1 + WH2H 2 ) b = m R11C,

where µsingle := κd κc and µclass := µsingle + m(κc κn) are the two smallest eigenvalues of the kernel Θh k,k(X) for any k [n]. Moreover, the quantity

m W W 1 µclass H1(IC α1C1 C)H 1 1 µsingle H2H 2 (26)

is invariant in time. Here α := κnm µclass+Cκnm.

We note that the invariant E derived here resembles the conservation laws of hyperbolic dynamics that take the form Ehyp := a2 b2 = const for time-dependent quantities a and b. Such dynamics arise when gradient flow is applied to a loss function of the form L(a, b) := (ab q)2 for some q. Since the solutions of such minimization problems, given by ab = q, exhibit symmetry under scaling a γa, b b/γ, the value of the invariant Ehyp uniquely specifies the hyperbola followed by the solution. In machine learning theory, hyperbolic dynamics arise as the gradient flow dynamics of linear DNNs [42], or in matrix factorization problems [3, 15]. Moreover, the end of training dynamics defined in (20) has a hyperbolic invariant given by

Eeot := W W 1 µsingle HH . (27)

Therefore, the final phase of training exhibits a typical behavior for the hyperbolic dynamics, which is also characteristic for the unconstrained features models [21, 38]. Namely, "scaling" W and H by an invertible matrix does not affect the loss value but changes the dynamic s invariant. On the other hand, minimizing the invariant Eeot has the same effect as joint regularization of W and H [48].

However, we also note that our invariant E provides a new, more comprehensive look at the DNNs dynamics. While unconstrained features models effectively make assumptions on the end-of-training invariant Eeot to derive NC [21, 38, 48], our dynamics control the value of Eeot through the more general invariant E. This way we connect the properties of end-of-training hyperbolic dynamics with the previous stages of training.

5.3 Neural Collapse

We are finally ready to state and prove our main result in Theorem 5.2 about the emergence of NC in DNNs with NTK alignment. We include the proof in Appendix B.3.

Theorem 5.2. Assume that the NTK has a block structure as defined in Assumption 3.2. Then the DNN s training dynamics are given by the system of equations in (25). Assume further that the last-layer features are centralized, i.e h = 0, and the dynamics invariant (26) is zero, i.e., E = O. Then the DNN s dynamics exhibit neural collapse as defined in (NC1)-(NC4).

Below we provide several important remarks and discuss the implications of this result:

(1) Zero invariant assumption: We assume that the invariant (26) is zero in Theorem 5.2 for simplicity and consistency with the literature. Indeed, similar assumptions arise in matrix decomposition papers, where zero invariant guarantees "balance" of the problem [3, 15]. However, our proofs in fact only require a weaker assumption that the invariant terms containing features H are aligned with the weights W W, i.e.

W W 1 µclass H1H 1 1 µsingle H2H 2 , (28)

where we have taken into account our assumption on the zero global mean h = 0.

(2) Necessity of the invariant assumption: The relaxed assumption on the invariant (28) is necessary for the emergence of NC in DNNs with block-structured NTK. Indeed, NC1 implies H2 = O, and NC3 implies H1H 1 W W. Therefore, DNNs that do not satisfy this assumption do not display NC. Our numerical experiments described in Section 6 strongly support this insight (see Figure 2, panes a-e). Thus, we believe that the invariant derived in this work characterizes the difference between models that do and do not exhibit NC.

(3) Zero global mean assumption: We note that the zero global mean assumption h = 0 in Theorem 5.2 ensures that the biases are equal to b = 1

C 1C at the end of training. This assumption is common in the NC literature [21, 38] and is well-supported by our numerical experiments (see figures in Appendix C, pane i). Indeed, modern DNNs typically include certain normalization (e.g. through batch normalization layers) to improve numerical stability, and closeness of the global mean to zero is a by-product of such normalization.

(4) General biases case: Discarding the zero global mean assumption allows the biases b to take an arbitrary form. In this general case, the following holds for the matrix of weights:

(WW )2 = m µclass

IC α1C1 C + (1 αC)(Cbb b1 C 1Cb ) . (29)

For optimal biases b = 1

C 1C, this reduces to the ETF structure that emerges in NC. Moreover, if biases are all equal, i.e. b = β1C for some β R, the centralized class means still form an ETF (i.e., NC2 holds), and the weights exhibit a certain symmetric structure given by

WW IC γ1C1 C , M M IC 1

C 1C1 C , (30)

where γ := 1

C (1 |1 βC|

C . The proof and a discussion of this result are given in Appendix B.4. In general, the angles of these two frames are different, and thus NC3 does not hold. This insight leads us to believe that normalization is an important factor in the emergence of NC.

(5) Partial NC: Our proofs and the discussion suggest that all the four phenomena that form NC do not have to always coincide. In particular, our proof of NC1 only requires the block-structured NTK and the invariant to be P.S.D, which is much weaker than the total set of assumptions in Theorem 5.2. Therefore, variability collapse can occur in models that do not exhibit the ETF structure of the class-means or the duality of the weights and the class means. Moreover, as shown above, NC2 can occur when NC3 does not, i.e., the ETF structure of the class means does not imply duality.

6 Experiments

We conducted large-scale numerical experiments to support our theory. While we only showcase our results on a single dataset-architecture pair in the main text (see Figure 2) and refer the rest to the appendix, the following discussion covers all our experiments.

Datasets and models. Following the seminal NC paper [39], we use three canonical DNN architectures: VGG [46], Res Net [24] and Dense Net [26]. Our datasets are MNIST [35], Fashion MNIST [51] and CIFAR10 [34]. We choose VGG11 for MNIST and Fashion MNIST, and VGG16 for CIFAR10. We add batch normalization after every layer in the VGG architecture, set dropout to zero and choose the dimensions of the two fully-connected layers on the top of the network as 512 and 256. We use Res Net20 architecture described in the original Res Net paper [24], and Dense Net40 with bottleneck layers, growth k = 12, and zero dropout for all the datasets.

a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.003 Acc.: 99.49%, h h, Y Y Ti: 0.45 Le Cun normal init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.93 Uniform init., LR=0.003 Acc.: 99.60%, h h, Y Y Ti: 0.51 Uniform init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.88 He normal init., LR=0.005 Acc.: 99.51%, h h, Y Y Ti: 0.38 He normal init., LR=0.049 Acc.: 99.64%, h h, Y Y Ti: 0.92

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

0 100 200 300 400 epoch

1.25 e) NC3

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 3 10 1 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.003 Acc.: 99.49%, h h, Y Y Ti: 0.45 Le Cun normal init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.93 Uniform init., LR=0.003 Acc.: 99.60%, h h, Y Y Ti: 0.51 Uniform init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.88 He normal init., LR=0.005 Acc.: 99.51%, h h, Y Y Ti: 0.38 He normal init., LR=0.049 Acc.: 99.64%, h h, Y Y Ti: 0.92

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Figure 2: Res Net20 trained on MNIST with three initialization settings and varying learning rates (see Section 6 for details). We chose a model that exhibits NC (red lines, filled markers) and a model that does not exhibit NC (blue lines, empty markers) for each initialization. The vertical lines indicate the epoch when the training accuracy reaches 99.9% (over the last 10 batches). a) Frobenious norm of the invariant E F . b) Alignment of the invariant terms as defined in (28). c) NC1: standard deviation of h(xc i) averaged over classes. d) NC2: M M/ M M F Φ F , where Φ is an ETF. e) NC3: W / W F M/ M F F . The legend displays the test accuracy achieved by each model and the last-layer features kernel alignment given by Θh/ Θh F , Y Y/ Y Y F F . The curves in panes a-e are smoothed by Savitzky Golay filter with polynomial degree 1 over window of size 10. Panes f, g and h show the NC metrics and the test accuracy as functions of the learning rate.

Optimization and initialization. We use SGD with Nesterov momentum 0.9 and weight decay 5 10 4. Every model is trained for 400 epochs with batches of size 120. To be consistent with the theory, we balance the batches exactly. We train every model with a set of initial learning rates spaced logarithmically in the range η [10 4, 100.25]. The learning rate is divided by 10 every 120 epochs. On top of the varying learning rates, we try three different initialization settings for every model: (a) Le Cun normal initialization (default in Flax), (b) uniform initialization on [

k], where k = 1/nℓ 1 for a linear layer, and k = 1/(Knℓ 1) for a convolutional layer, where K is the convolutional kernel size (default in Py Torch), (c) He normal initialization in fan_out mode.

Results. Our experiments confirm the validity of our assumptions and the emergence of NC as their result. Specifically, we make the following observations:

While most of the DNNs that achieve high test performance exhibit NC, we are able to identify DNNs with comparable performance that do not exhibit NC (see Figure 2, panes f-h). We note that such models still achieve near-zero error on the training set in our setup.

Comparing DNNs that do and do not exhibit NC, we find that our assumption on the invariant (see Theorem 5.2 and (28)) holds only for the models with NC (see Figure 2, panes a-e). This confirms our reasoning about the necessity of the invariant assumption for NC emergence.

The kernels Θ and Θh are strongly aligned with the labels Y Y in the models with the best performance, which is in agreement with the NTK alignment literature and justifies our assumption on the NTK block structure.

We include the full range of experiments along with the implementation details and the discussion of required computational resources in Appendix C. Specifically, we present a figure analogous to Figure 2 for every considered dataset-architecture pair. Additionally, we report the norms of matrices H1H 1 , H2H 2 , and h h , as well as the alignment of both the NTK Θ and the last-layer features kernel Θh in the end of training, to further justify our assumptions.

7 Conclusions and Broad Impact

This work establishes the connection between NTK alignment and NC, and thus provides a mechanistic explanation for the emergence of NC within realistic DNNs training dynamics. It also contributes to the underexplored line of research connecting NC and local elasticity of DNNs training dynamics.

The primary implication of this research is that it exposes the potential to study NC through the lens of NTK alignment. Indeed, previous works on NC focus on the top-down approach (layer-peeled models) [18, 21, 38, 48], and fundamentally cannot explain how NC develops through earlier layers of a DNN and what are the effects of depth. On the other hand, NTK alignment literature focuses on the alignment of individual layers [7], and recent theoretical results even quantify the role of each hidden layer in the final alignment [37]. Therefore, we believe that the connection between NTK alignment and NC established in this work provides a conceptually new method to study NC.

Moreover, this work introduces a novel approach to facilitate theoretical analysis of DNNs training dynamics. While most theoretical works consider the NTK in the infinite-width limit to simplify the dynamics [1, 20, 28, 49], our analysis shows that making reasonable assumptions on the empirical NTK can also lead to tractable dynamics equations and new theoretical results. Thus, we believe that the analysis of DNNs training dynamics based on the properties of the empirical NTK is a promising approach also beyond NC research.

8 Limitations and Future Work

The main limitation of this work is the simplifying Assumption 3.2 on the kernel structure. While the NTK of well-trained DNNs indeed has an approximate block structure (as we discuss in detail in Section 3), the NTK values also tend to display high variance in real DNNs [22, 44]. Thus, we believe that adding stochasticity to the dynamics considered in this paper is a promising direction for the future work. Moreover, the empirical NTK exhibits so-called specialization, i.e., the kernel matrix corresponding to a certain output neurons aligns more with the labels of the corresponding class [45]. In block-structured kernels, specialization implies different values in blocks corresponding to different classes. Thus, generalizing our theory to block-structured kernels with specialization is another promising short-term research goal. In addition, our theory relies on the assumption that the dataset (or the training batch) is balanced, i.e., all the classes have the same number of samples. Accounting for the effects of non-balanced datasets within the dynamics of DNNs with block-structured NTK is another possible future work direction.

More generally, we believe that empirical observations are essential to demistify the DNNs training dynamics, and there are still many unknown and interesting connections between seemingly unrelated empirical phenomena. Establishing new theoretical connections between such phenomena is an important objective, since it provides a more coherent picture of the deep learning theory as a whole.

Acknowledgments and Disclosure of Funding

R. Giryes and G. Kutyniok acknowledge support from the LMU-TAU - International Key Cooperation Tel Aviv University 2023. R. Giryes is also grateful for partial support by ERC-St G SPADE grant no. 757497. G. Kutyniok is grateful for partial support by the Konrad Zuse School of Excellence in Reliable AI (DAAD), the Munich Center for Machine Learning (BMBF) as well as the German Research Foundation under Grants DFG-SPP-2298, KU 1446/31-1 and KU 1446/32-1 and under Grant DFG-SFB/TR 109, Project C09 and the Federal Ministry of Education and Research under Grant Ma Gri Do.

[1] Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 74 84. PMLR, 2020.

[2] Laurence Aitchison. Why bigger is not always better: on finite and infinite neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 156 164. PMLR, 2020.

[3] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 244 253. PMLR, 2018.

[4] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 7411 7422, 2019.

[5] Alexander Atanasov, Blake Bordelon, and Cengiz Pehlevan. Neural networks as kernel learners: The silent alignment effect. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022.

[6] Bubacarr Bah, Holger Rauhut, Ulrich Terstiege, and Michael Westdickenberg. Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers. Information and Inference: A Journal of the IMA, 11(1):307 353, 02 2021.

[7] Aristide Baratin, Thomas George, César Laurent, R. Devon Hjelm, Guillaume Lajoie, Pascal Vincent, and Simon Lacoste-Julien. Implicit regularization via neural feature alignment. In The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, volume 130 of Proceedings of Machine Learning Research, pages 2269 2277. PMLR, 2021.

[8] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018.

[9] Shuxiao Chen, Hangfeng He, and Weijie J. Su. Label-aware neural tangent kernel: Toward better generalization and local elasticity. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[10] Lénaïc Chizat, Edouard Oyallon, and Francis R. Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 2933 2943, 2019.

[11] Hung-Hsu Chou, Carsten Gieshoff, Johannes Maly, and Holger Rauhut. Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank. ar Xiv preprint: 2011.13772, 2020.

[12] Moustapha Cissé, Piotr Bojanowski, Edouard Grave, Yann N. Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th

International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 854 863. PMLR, 2017.

[13] Nello Cristianini, John Shawe-Taylor, André Elisseeff, and Jaz S. Kandola. On kernel-target alignment. In Advances in Neural Information Processing Systems 14: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada, pages 367 373. MIT Press, 2001.

[14] Ahmet Demirkaya, Jiasi Chen, and Samet Oymak. Exploring the role of loss functions in multiclass classification. In 54th Annual Conference on Information Sciences and Systems, CISS 2020, Princeton, NJ, USA, March 18-20, 2020, pages 1 5. IEEE, 2020.

[15] Simon S. Du, Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 382 393, 2018.

[16] Omer Elkabetz and Nadav Cohen. Continuous vs. discrete optimization of deep neural networks. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 4947 4960, 2021.

[17] Tolga Ergen and Mert Pilanci. Revealing the structure of deep neural networks via convex duality. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 3004 3014. PMLR, 2021.

[18] C Fang, H He, Q Long, and WJ Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences of the United States of America, 118(43), 2021.

[19] Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[20] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020.

[21] X. Y. Han, Vardan Papyan, and David L. Donoho. Neural collapse under MSE loss: Proximity to and dynamics on the central path. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open Review.net, 2022.

[22] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

[23] Hangfeng He and Weijie J. Su. The local elasticity of neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 770 778. IEEE Computer Society, 2016.

[25] Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Zee. Flax: A neural network library and ecosystem for JAX, 2020.

[26] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2261 2269. IEEE Computer Society, 2017.

[27] Jiaoyang Huang and Horng-Tzer Yau. Dynamics of deep neural networks and neural tangent hierarchy. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 4542 4551. PMLR, 2020.

[28] Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao. Why do deep residual networks generalize better than deep feedforward networks? - A neural tangent kernel perspective. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

[29] Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs crossentropy in classification tasks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. Open Review.net, 2021.

[30] Arthur Jacot, Clément Hongler, and Franck Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pages 8580 8589, 2018.

[31] Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[32] Dmitry Kopitkov and Vadim Indelman. Neural spectrum alignment: Empirical study. In Artificial Neural Networks and Machine Learning - ICANN 2020 - 29th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 15-18, 2020, Proceedings, Part II, volume 12397 of Lecture Notes in Computer Science, pages 168 179. Springer, 2020.

[33] Vignesh Kothapalli, Ebrahim Rasromani, and Vasudev Awatramani. Neural collapse: A review on modelling principles and generalization. ar Xiv preprint ar Xiv:2206.04041, 2022.

[34] Alex Krizhevsky et al. Learning multiple layers of features from tiny images, 2009.

[35] Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.

[36] Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33:15156 15172, 2020.

[37] Yizhang Lou, Chris E Mingard, and Soufiane Hayou. Feature learning and signal propagation in deep neural networks. In International Conference on Machine Learning, pages 14248 14282. PMLR, 2022.

[38] Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. ar Xiv preprint ar Xiv:2011.11619, 2020.

[39] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652 24663, 2020.

[40] Federico Pernici, Matteo Bruni, Claudio Baecchi, and Alberto Del Bimbo. Fix your features: Stationary and maximally discriminative embeddings using regular polytope (fixed classifier) networks. ar Xiv preprint ar Xiv:1902.10441, 2019.

[41] Tomaso A. Poggio and Qianli Liao. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. ar Xiv preprint ar Xiv:2101.00072, 2021.

[42] Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013.

[43] Mariia Seleznova and Gitta Kutyniok. Analyzing finite neural networks: Can we trust neural tangent kernel theory? In Mathematical and Scientific Machine Learning, 16-19 August 2021, Virtual Conference / Lausanne, Switzerland, volume 145 of Proceedings of Machine Learning Research, pages 868 895. PMLR, 2021.

[44] Mariia Seleznova and Gitta Kutyniok. Neural tangent kernel beyond the infinite-width limit: Effects of depth and initialization. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 19522 19560. PMLR, 2022.

[45] Haozhe Shan and Blake Bordelon. A theory of neural tangent kernel alignment and its influence on training. ar Xiv preprint ar Xiv:2105.14301, 2021. [46] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [47] Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6398 6407, 2020. [48] Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural collapse. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 21478 21505, 2022. [49] Tom Tirer, Joan Bruna, and Raja Giryes. Kernel-based smoothness analysis of residual networks. In Mathematical and Scientific Machine Learning, volume 145, pages 921 954, 2021. [50] Tom Tirer, Haoxiang Huang, and Jonathan Niles-Weed. Perturbation analysis of neural collapse. ar Xiv preprint ar Xiv:2210.16658, 2022. [51] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [52] Greg Yang. Tensor programs II: neural tangent kernel for any architecture. ar Xiv preprint ar Xiv:2006.14548, 2020. [53] Jiayao Zhang, Hua Wang, and Weijie J. Su. Imitating deep learning dynamics via locally elastic stochastic differential equations. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neur IPS 2021, December 6-14, 2021, virtual, pages 6392 6403, 2021.

A Related works

NC with MSE loss. NC was first introduced for DNNs with cross-entropy (CE) loss, which is commonly used in classification problems [39]. Since then, numerous papers discussed NC with MSE loss, which provides more opportunities for theoretical analysis, especially after the MSE loss was shown to perform on par with CE loss for classification tasks [14, 29].

Most previous works on MSE-NC adopt the so-called unconstrained features model [21, 38, 48]. In this model, the last-layer features H are free variables that are directly optimized during training, i.e., the features do not depend on the input data or the DNN s trainable parameters. Fang et al. [18] also introduced a generalization of this approach called N-layer-peeled model, where features of the N-th-to-last layer are free variables, and studied the 1-layer-peeled model (equivalent to the unconstrained features model) with CE loss as a special case.

One line of research on MSE-NC in unconstrained/layer-peeled models aims to derive global minimizers of optimization problems associated with DNNs [17, 18, 48]. In particular, Tirer et al. [48] showed that global minimizers of the MSE loss with regularization of both H and W exhibit NC. Moreover, Ergen & Pilanci [17] showed that NC emerges in global minimizers of optimization problems with general convex loss in the context of the 2-layer-peeled model. In comparison to our work, these contributions do not consider the training dynamics of DNNs, i.e., they do not discuss whether and how the model converges to the optimal solution.

Another line of research on MSE-NC explicitly considers the dynamics of the unconstrained features models [21, 38]. In particular, Han et al. [21] considered the gradient flow of the unconstrained renormalized features along the "central path", where the classifier is assumed to take the form of the optimal least squares (OLS) solution for given features H. Under this assumption, they derive a closedform dynamics that implies NC. While they empirically show that DNNs are close to the central path in certain scenarios, they do not provide a theoretical justification for this assumption. The dynamics considered in their work is also distinct from the standard gradient flow dynamics of DNNs considered in our work. On the other hand, an earlier work by Mixon et al. [38] considered the gradient flow dynamics of the unconstrained features model, which is equivalent (up to rescaling) to the end-oftraining dynamics (20) that we discuss in Sections 4.2 and 5.2. Their work relies on the linearization of these dynamics to derive a certain subspace, which appears to be an invariant subspace of the non-linearized unconstrained features model dynamics. Then they show that minimizers of the loss from this subspace exhibit NC. We note that, in terms of our paper, assuming that the unconstrained features model dynamics follow a certain invariant subspace means making assumptions on the end-of-training invariant (27). In comparison to these works, we make a step towards realistic DNNs dynamics by considering the standard gradient flow of DNNs simplified by Assumption 3.2 on the NTK structure, which is supported by the extensive research on NTK alignment [7, 9, 44, 45]. In our setting, the NTK captures the dependence of the features on the training data, which is missing in the unconstrained features model. Moreover, while other works focus only on the dynamics that converge to NC, we show that DNNs with MSE loss may not exhibit NC in certain settings, and the invariant of the dynamics (26) characterizes the difference between models that do and do not converge to NC.

Notably, works by Poggio & Liao [41] adopt a model different from the unconstrained features model to analyze gradient flow of DNNs. They consider the dynamics of homogeneous DNNs, in particular Re LU networks without biases, with normalization of the weights matrices and weights regularization. The goal of weights normalization in their model is to imitate the effects of batch normalization in DNNs training. In this model, certain fixed points of the gradient flow exhibit NC. While the approach taken in their work captures the dependence of the features on the data and the DNN s parameters, it fundamentally relies on the homogeneity of the DNN s output function. However, most DNNs that exhibit NC in practice are not homogeneous due to biases and skip-connections.

NC and local elasticity. A recent extensive survey of NC literature [33] discussed local elasticity as a possible mechanism behind the emergence of NC, which has not been sufficiently explored up until now. One of the few works in this research direction is by Zhang et al. [53], who analyzed the so-called locally-elastic stochastic differential equations (SDEs) and showed the emergence of NC in their solutions. They model local elasticity of the dynamics through an effect matrix, which has only two distinct values: a larger intra-class value and a smaller inter-class value. These values characterize how much influence samples from one class have on samples from other classes in the SDEs. While the aim of their work is to imitate DNNs training dynamics through SDEs, the authors

do not provide any explicit connection between their dynamics and real gradient flow dynamics of DNNs. On the other hand, we derive our dynamics directly from the gradient flow equations and connect local elasticity to the NTK, which is a well-studied object in the deep learning theory.

Another work by Tirer et al. [50] provided a perturbation analysis of NC to study "inexact collapse". They considered a minimization problem with MSE loss, regularization of H and W, and additional regularization of the distance between H and a given matrix of initial features. In the "near-collapse" setting, i.e., when the initial features are already close to collapse, they showed that the optimal features can be obtained from the initial features by a certain linear transformation with a block structure, where the intra-class effects are stronger than the inter-class ones. While this transformation matrix resembles the block-structured effect matrices in locally-elastic training dynamics, it does not originate from the gradient flow dynamics of DNNs and is not related to the NTK.

B.1 Proof of Theorem 4.1

Proof of Theorem 4.1. We will first derive the dynamics of hs(xc i), which is the s-th component of the last-layer features vector on sample xc i X from class c [C]. Let w RP be the trainable parameters of the network stretched into a single vector. Then its gradient flow dynamics is given by

w = w L(f) =

i =1 (f(X)ki Yki ) wf(X)ki , (31)

where wf(X)ki RP is the component of the DNN s Jacobian corresponding to output neuron k and the input sample xc i . Since entries of f(X) can be written as

s =1 Wks Hs i + bk =

s =1 Wks hs (xc i ) + bk, (32)

s =1 (f(X)ki Yki ) w(Wks hs (xc i ) + bk). (33)

By chain rule, we have hs(xc i) = whs(xc i), w . Then, taking into account that

whs(xc i), w(Wks hs (xc i ) + bk) = Wks whs(xc i), whs (xc i ) , (34)

and that whs(xc i), whs (xc i ) = Θh s,s (xc i, xc i ) by definition of Θh, we have

s =1 (f(X)ki Yki )Wks Θh s,s (xc i, xc i ). (35)

Now by Assumption 3.2 we have Θh s,s = 0 if s = s . Therefore, the above expression simplifies to

i =1 Θh s,s(xc i, xc i )

k=1 (f(X)ki Yki )Wks

i =1 [W (WH + b1 N Y)]si Θh s,s(xc i, xc i ).

To express H = hs(xc i)

s,i Rn N in matrix form, it remains to express Θh s,s(xc i, xc i ) as the (i , i)-th entry of some matrix. We will separate the sum into three cases: 1) i = i , 2) i = i and c = c , and 3) c = c . According to Assumption 3.2, the first case corresponds to the multiple of identity κd IN. The second corresponds to the block matrix of size m with zeros on the diagonal,

which can be written as κc(Y Y IN). The third matrix equals to κn(1N1 N Y Y). Therefore we can express the dynamics of H as follows: H = [W (WH + b1 N Y)][κd I + κc(Y Y I) + κn(1N1 N Y Y)]

= (κd κc)W (WH + b1 N Y)

(κc κn)W (WHY Y + mb1 N m Y)

κn W (WH1N1 N + Nb1 N N

Now we notice that HY Y/m is the matrix of stacked class means repeated m times each and H1N1 N/N is a matrix of the global mean repeated N times. Therefore, we have

WHY Y + mb1 N m Y = m Rclass,

WH1N1 N + Nb1 N N

C 1C1 N = NRglobal

according to the definitions of global and class-mean residuals in (18) and (17).

The expressions for the gradient flow dynamics of W and b follow directly from the derivatives of f(X) w.r.t. W and b. This completes the proof.

B.2 Proof of Theorem 5.1

Proof of Theorem 5.1. Recall from (23) in Section 5.1 that we have the following decomposition

HQ = m[H1, H2], H1 = 1 m HQ1, H2 = 1 m HQ2

with orthogonal Q = [Q1, Q2] RN N. We now artificially add QQ (= IN) to the dynamics (19) in Theorem 4.1 and obtain

HQ = (κd κc)W (WHQ + b1 NQ Y Q) (κc κn)m W ( 1

m WHQQ Y YQ + b1 NQ Y Q) κn NW ( 1

N WHQQ 1N1 NQ + b1 NQ 1

C 1C1 NQ) W = (WHQ + b1 NQ Y Q)Q H b = (WHQ + b1 NQ Y Q)Q 1N.

Let us simplify the expression. Since Q1 = 1 m IC 1m and Q2 = IC Q2, we have

1 NQ = m[1 C, O], YQ = m[IC, O]. (37)

Plugging (37) into (36), we see the dynamics can be decomposed into

H1 = (κd κc)W (WH1 + b1 C IC) (κc κn)m W (WH1 + b1 C IC) κn NW ( 1

C WH11C1 C + b1 C 1

C 1C1 C) H2 = (κd κc)W WH2 W = m(WH1 + b1 C IC)H 1 m WH2H 2 b = m(WH1 + b1 C IC)1C.

To further simplify (38), we define the following quantities

µsingle := κd κc, µclass := µsingle + m(κc κn), R1 := WH1 + b1 C IC. (39)

Notice that µsingle and µclass are the two largest eigenvalues of the block-structured kernel Θh s,s(X) (see Table 1 for the eigndecomposition of a block-structured matrix), and R1 is a matrix of the stacked class-mean residuals, which is also defined in (17). The the dynamics (38) simplifies to

H1 = W (µclass R1 + κn N( 1

C WH11C1 C + b1 C 1

C 1C1 C)) H2 = µsingle W WH2 W = m(R1H 1 WH2H 2 ) b = m R11C.

It remains to simplify the expression for H1. By using the relation

1 C WH11C1 C + b1 C 1

C 1C1 C = 1

C R11C1 C, (41)

we can deduce that the dynamics for H1 in (40) can be expressed as (recalling that N = m C)

H1 = W R1(µclass I + κnm1C1 C). (42)

We notice that (IC + κnm

µclass 1C1 C) 1 = IC α1C1 C, where α := κnm µclass+Cκnm. Then we can derive

the invariant of the training dynamics by direct computation of the time-derivative E, where

m W W 1 µclass H1(IC α1C1 C)H 1 1 µsingle H2H 2 (43)

Since E = O, we get that the quantity E remains constant in time. This completes the proof.

B.3 Proof of Theorem 5.2

We divide the proof into two main parts: the first one shows the emergence of NC1, and the second one shows NC2-4.

(NC1). Following the analysis in Section 3, the dynamics eventually enters the end of training phase (see Section 4.1). Then the dynamics in Theorem 5.1 simplifies to the following form:

H1 = O H2 = µsingle W WH2 W = m WH2H 2 b = O

As we note in Section 4, this dynamics is similar to the gradient flow of the unconstrained features models and is an instance of the class of hyperbolic dynamics, which is discussed in Section 5.2. During this phase the quantity

E := µsingle W W m H2H 2 = mµsingle(E + 1 µclass H1(I α1C1 C)H 1 ) (45)

does not change in time. Hence we can decouple the dynamic using the invariant as follows: H2 = µsingle( E + m H2H 2 )H2 W = W(µsingle W W E) (46)

Since E is p.s.d (or zero, as a special case), E is p.s.d as well, and the eigendecomposition of the invariant is given by E = P k ckvkv k for some coefficients ck 0 and a set of orthonormal vectors vk Rn. Then we also have H2H 2 = P

k,l αklvkv l , where αkl are symmetric (i.e. αkl = αlk) and αkk 0 for all k = 1, . . . n (since H2H 2 is symmetric and p.s.d.). Note that coefficients ck here are constant while coefficients αkl are time-dependent. Let us then write the dynamics for αkl using the dynamics of H2H 2 :

(H2H 2 ) = EH2H 2 H2H 2 E 2(H2H 2 )2 (47)

Then for the elements of α we have:

αkl = αkl(ck + cl) 2 X

j αkjαjl (48)

For the diagonal elements αkk, this gives:

αkk = 2ckαkk 2 X

j α2 kj (49)

Since ck 0 , αkk 0 and α2 kj 0, we get that

αkk t 0 k (50)

And, therefore, all the non-diagonal elements also tend to zero. Thus, we get that

H2H 2 t O (51)

and thus H2 t O (52)

Now we notice that from the expression for H2 in (24) it follows that H2 = O implies variability collapse, since it means that all the feature vectors within the same class are equal. Indeed, H(c) Q2 = O Rn (m 1) means that there is a set of m 1 orthogonal vectors, which are all also orthogonal to [hi(xc 1), . . . , hi(xc m)] for any i = 1, . . . , n, where xc i are inputs from class c. However, there is only one vector (up to a constant) orthogonal to all the columns of Q2 in Rm and this vector is 1m. Therefore, [hi(xc 1), . . . hi(xc m)] = γ1m for some constant γ for any i = 1, . . . , n. Thus, we indeed have h(xc 1) = = h(xc m), which constitutes variability collapse within classes.

(NC2-4). Set β = 1

C . We first show that zero global feature mean implies b = β1C. At the end of training, since R1 = O, we have WH1 + b1 C = IC (53) On the other hand, zero global mean implies H11C = C h = O. Then multiplying (53) by 1C on the right, we get the desired expression for the biases. Given the zero global mean, we have 1 m W W 1 µclass H1H 1 1 µsingle H2H 2 = E αm C2

µclass h h = E (54)

By the proof of NC1, H2 O. Together with the assumption that E is proportional to the limit of W W (or zero, as a special case), we obtain

µclass W W m H1H 1 γW W (55)

for some γ 0. Note that since H1H 1 is p.s.d. this implies λc := µclass γ 0. By multiplying the left and right with appropriate factors, we have H 1 ( λc W W m H1H 1 )H1 O W( λc W W m H1H 1 )W O. (56)

Consequently (according to (53)) λc(IC β1C1 C)2 m(H 1 H1)2 O λc(WW )2 (IC β1C1 C)2 O (57)

Since both WW and H 1 H1 are p.s.d., we have

m (IC β1C1 C)

λc (IC β1C1 C). (58)

To establish NC2, recall that H1 = h 1, . . . , h C and that M, as a normalized version of H1, satisfies M M 1 1 β (IC β1C1 C) = C C 1(IC 1

To establish NC3, note that from (55) and (58) together, it follows that the limits of M and W only differ by a constant multiplier.

To establish NC4, note that using NC3 we can write argmax c (Wh(x) + b)c = argmax c (Wh(x))c (b = β1C)

argmax c (M h(x))c (NC3)

= argmin c h(x) h c 2.

This completes the proof.

B.4 General biases case

Proof. As in the proof of Theorem 5.2, at the end of training we have WH1 + b1 C = IC. Moreover, since E = O and H2 O, we have 1 m W W 1 µclass H1(I α1C1 C)H 1 O. (59)

Multyplying the above expression to the left by W and to the right by W , we obtain the general expression (29) for the matrix (WW )2 mentioned in the main text:

(WW )2 m µclass

IC α1C1 C + (1 αC)(Cbb b1 C 1Cb ) . (60)

This expression implies that the rows of the weights matrix may have varying separation angles in the general biases case, i.e., there is no symmetric structure is general. However, for constant biases b = β1C, the above expression simplifies to

(WW )2 m µclass

C 1 (1 αC)(1 βC)2 1C1 C . (61)

Since α < 1/C and (1 βC)2 0, we have that (1 (1 αC)(1 βC)2)/C 1/C. Therefore, the RHS of (61) is always p.s.d. and has a unique p.s.d square root proportional to IC γ1C1 C for some constant γ < 1/C. Denote ρ := (1 (1 αC)(1 βC)2)/C, then we have γ = (1 1 Cρ)/C. Note that ρ < 1/C ensures that γ is well defined. Then the configuration of the final weights is given by

IC γ1C1 C . (62)

This means that the norms of all the weights rows are still equal, as in NC2. However, since γ < 1/C if β = 1/C, the angle between these rows is smaller than in the ETF structure.

We can derive the configuration of the class means similarly by multyplying (59) to the left by H 1 and to the right by H1. In the general biases case, we get

H 1 H1(IC α1C1 C)H 1 H1 µclass

IC b1 C 1Cb + b 2 21C1 C . (63)

As with the weights, we see that this is not a symmetric structure in general. Thus, NC2 does not hold in the general biases case. However, for the constant biases b = β1C, the above expression simplifies to H 1 H1(IC α1C1 C)H 1 H1 µclass

m (IC β1C1 C)2. (64)

Analogously to the previous derivations, we get that the unique p.s.d. square root of the RHS is given by IC ρ1C1 C, where ρ := (1 |1 βC|)/C < 1/C for β = 1/C. On the other hand, the unique p.s.d root of I α1C1 C is given by IC ϕ1C1 C, where ϕ := (1

1 αC)/C. Thus, we have the following r m

µclass H 1 H1(IC ϕ1C1 C) IC ρ1C1 C. (65)

Therefore, the structure of the last-layer features class means is given by

H 1 H1 rµclass

IC ρ1C1 C IC ϕ 1 + ϕC 1C1 C = rµclass

IC θ1C1 C , (66)

where θ := ρ + ϕ/(1 + ϕC) C ρϕ/(1 + ϕC) < 1/C for β = 1/C. Thus, similarly to the classifier weights W, the last-layer features class means form a symmetric structure with equal lengths and a separation angle smaller than in the ETF. However, the centralized class means given by M = H1(IC 1C1T C/C) still form the ETF structure:

M M rµclass

C 1C1 C . (67)

This holds since the component proportional to 1C1 C on the RHS of equation (66) lies in the kernel of the ETF matrix (IC 1C1 C/C). Thus, we conclude that NC2 holds in case of equal biases, while NC3 does not.

Remark on α 0 case: Simplifying the expressions for constants γ and θ, which define the angles in the configurations of the weights and the class means above, we get the following:

1 αC , θ = 1

Analyzing these expressions, we find that they are equal only if 1 αC = 1, i.e. α = 0. However, this can only hold if κn = 0 by definition of α, i.e., when the kernel Θh is zero on pairs of samples from different classes. While α = 0 in general, there are certain settings where α approaches zero. Simplifying the expression for α, we can get the following

α = 1 κc κn (1 1

κn 1 m + (C 1). (69)

One can see that α 0 if C or when κc/κn . Since the kernel Θh is strongly aligned with the labels in our numerical experiments, the value of κc/κn is large in practice. Thus, α is not zero but indeed significantly smaller than 1/C. Thus, in our numerical experiments the angles θ and γ are close to each other. However, we note that the equality of these two angles does not imply NC3, since the value of θ characterizes the angles between the non-centralized class means.

Remark on α 1/C case: If α = 1/C, the equation (63) for the structure of the features class means with general (not equal) biases simplifies to

C 1C1 C , (70)

i.e., in this case the class means always exhibit the ETF structure, even without the assumption that all the biases are equal. Moreover, in this case γ = 1/C as well. Thus, both NC2 and NC3 hold. While by definition α < 1/C, we can analyze the cases when it approaches 1/C using the expression (69) again. One can see that when m and κc/κn 1, we have α 1/C. However, the requirement κc/κn 1 implies that the kernel Θh does not distinguish between pairs of samples from the same class and from different classes. Such a property of the kernel is associated with poor generalization performance and does not occur in our numerical experiments.

C Numerical experiments

Implementation details We use JAX [8] and Flax (neural network library for JAX) [25] to implement all the DNN architectures and the training routines. This choice of the software allows to compute the empirical NTK of any DNN architecture effortlestly and efficiently. We compute the values of kernels Θ and Θh on the whole training batch (m = 12 samples per class, 120 samples in total) in case of Res Net20 and Dense Net40 to approximate the values (γd, γc, γn) and (κd, κc, κn), as well as the NTK alignment metrics, and compute the invariant E using these values. Since VGG11 and VGG16 architectures are much larger (over 10 million parameters) and computing their Jacobians is very memory-intensive, we use m = 4 samples per class (i.e., 40 samples in total) to approximate the kernels of these models. We compute all the other training metrics displayed in panes a-e of Figures 3, 4, 5, 6, 7, 8, 9, 10, 11 on the whole last batch of every second training epoch for all the architectures. The test accuracy is computed on the whole test set. To produce panes f-h of the same figures, we only compute the NC metrics and the test accuracy one time after 400 epochs of training for every learning rate. We use 30 logarithmically spaced learning rates in the range η [10 4, 100.25] for Res Net20 trained on MNIST and VGG11 trained on MNIST. For all the other architecture-dataset pairs we only compute the last 20 of these learning rates to reduce the computational costs, since the smallest learning rates do not yield models with acceptable performance.

Compute We executed the numerical experiments mainly on NVIDIA Ge Force RTX 3090 Ti GPUs, each model was trained on a single GPU. In this setup, a single training run displayed in panes a-e of Figures 3, 4, 5, 6, 7, 8, 9, 10, 11 took approximately 3 hours for Res Net20, 6 hours for Dense Net40, 7 hours for VGG11, and 11 hours for VGG16. This adds up to a total of 312 hours to compute panes a-e of the figures. The computation time is mostly dedicated not to the training routine itself but to the large number of computationally-heavy metrics, which are computed every second epoch of a training run. Indeed, to approximate the values of Θ and Θh, one needs to compute C(C + 1) + n(n + 1) kernels on a sample of size m C from the dataset, and each of the kernels requires computing a

gradient with respect to numerous parameters of a DNN. Additionally, the graphs in panes f-h of the same figures take around 1.5 hours for each learning rate value for Res Net20, 3 hours for Dense Net40, and 4 hours for VGG11 and VGG16, which adds up to approximately 1350 computational hours.

Results We include experiments on the following architecture-dataset pairs:

Figure 3: VGG11 trained on MNIST Figure 4: VGG11 trained on Fashion MNIST Figure 5: VGG16 trained on CIFAR10 Figure 6: Res Net20 trained on MNIST Figure 7: Res Net20 trained on Fashion MNIST Figure 8: Res Net20 trained on CIFAR10 Figure 9: Dense Net40 trained on MNIST Figure 10: Dense Net40 trained on Fashion MNIST Figure 11: Dense Net40 trained on CIFAR10

The experiments setup is described in Section 6. Panes a-h of Figures 3, 4, 5, 6, 7, 8, 9, 10, 11 are analogous to the same panes of Figure 2. We include additional pane i here, which displays the norms of the invariant terms corresponding to the feature matrix components H1 and H2, and the global features mean h at the end of training. One can see that the global features mean is relatively small in comparison with the class-means in every setup, and the "variance" term H2 is small for models that exhibit NC. We also add pane j, which displays the alignment of kernels Θ and Θh for every model at the end of training. One can see that the kernel alignments is typically stronger in models that exhibit NC.

C.1 Additional examples of the NTK block structure

We include the following additional illustrative figures (analogous to Figure 1 in the main text) that show the NTK block structure in dataset-architecture pairs covered in our experiments:

Figure 13: VGG11 trained on MNIST Figure 14: VGG11 trained on Fashion MNIST Figure 15: VGG16 trained on CIFAR10 Figure 16: Res Net20 trained on Fashion MNIST Figure 17: Res Net20 trained on CIFAR10 Figure 18: Dense Net40 trained on MNIST Figure 19: Dense Net40 trained on Fashion MNIST Figure 11: Dense Net40 trained on CIFAR10

Overall, the block structure pattern is visible in the traced kernels in all the figures. As expected, the block structure is more pronounced in the kernels where the final alignment values are higher. While the norms of the "non-diagonal" components of the kernels are generally smaller than the "diagonal" components in panes c) and d), we notice that there is a large variability in the norms of the "diagonal" components in some settings. This means that different neurons of the penultimate layer and different classification heads may contribute to the kernel unequally in some settings. Moreover, certain "non-diagonal" components of the last-layer kernel may have non-negligible effect in some settings. We discuss how one could generalize our analysis to account for these properties of the NTK in Appendix D.

C.2 Preliminary experiments with CE loss

While CE loss is a common choice for training DNN classifiers, our theoretical analysis and the experimental results only cover DNNs trained with MSE loss. For completeness, we provide experimental results for Res Net20 trained on MNIST with CE loss in Figure 12. One can see that

smaller invariant norm and higher invariant alignment correlate with NC in the figure. However, DNNs trained with CE loss overall reach better NC metrics but have much larger norm of the invariant in comparison with DNNs trained with MSE loss.

D Relaxation of the NTK Block-Structure Assumption

In this section, we first derive the dynamics equations of DNNs with a general block structure assumption on the last-layer kernel Θh (analogous to the equations presented in Theorem 4.1 and Theorem 5.1). Then we discuss a possible relaxation of Assumption 3.2, under which our main result regarding NC in Theorem 5.2 still holds.

D.1 Dynamics under General Block Structure Assumption

We first formulate the most general form of the block structure assumption on Θh as follows: Assumption D.1. Assume that Θh : X X Rn n has the following block structure

Θh(x, x) = Ad + Ac + An, Θh(xc i, xc j) = Ac + An, Θh(xc i, xc j ) = An, (71)

where Ad,c,n Rn n are arbitrary p.s.d. matrices. Here xc i and xc j are two distinct inputs from the same class, and xc j is an input from class c = c.

This assumption means that every kernel matrix Θh k,s(X), k, s [1, n] still has at most three distinct values, corresponding to the inter-class, intra-class, and the diagonal values of the kernel. However, these values are arbitrary and may depend on the choice of k, s [1, n].

Under the general block structure assumption, the gradient flow dynamics of DNNs with MSE loss takes the following form:

H = Ad W R + m Ac W Rclass + NAn W Rglobal W = RH b = Rglobal1N. (72)

This is the generalized version of the dynamics presented in Theorem 4.1. Consequently, the decomposed dynamics presented in Theorem 5.1 takes the following form under the general block structure assumption:

H1 = (Ad + m Ac)W R1 m An W R11C1 C H2 = Ad W WH2 W = m(R1H 1 + WH2H 2 ) b = m R11C.

The derivation of the above dynamics equations are identical to the proofs of Theorem 4.1 and Theorem 5.1 presented in Appendix B.

Rotation invariance We notice that the dynamics of (W, H) in (72) has to be rotation invariant, i.e., the equations should not be affected by a change of variables W WQ, H Q H for any orthogonal matrix Q. This holds since the loss function only depends on the product WH, which does not change under rotation. This requirement puts conditions on the behavior of Ad,c,n under rotation. Indeed, assume that the rotation W WQ, H Q H for some Q corresponds to the following change of the kernel: Ad,c,n Ad,c,n(Q), (74) then the rotation invariance of the dynamics implies the following equality for any Q:

Q Ad(Q)Q W R + m Q Ac(Q)Q W Rclass + NQ An(Q)Q W Rglobal (75)

= Ad W R + m Ac W Rclass + NAn W Rglobal. (76)

These equations are satisfied trivially with our initial assumption, where Ad,c,n = Ad,c,n(Q) In. However, as we can see, any generalized assumption should specify the behavior of the kernel under rotation, and satisfy the above equation.

For general Ad,c,n, the following behavior under rotation trivially satisfies the above condition: Ad,c,n(Q) = Q Ad,c,n Q. This behaviour of the kernel under rotation is intuitive, since it implies that the gradients of the last-layer features h are rotated in the same way as the features. However, we note that gradients of parametrized functions do not in general behave this way, since the rotation of the function has to be realized by a certain change of parameters. Consider, for instance, a one-hidden-layer linear network with weights V in the first layer. Then we have H = VX, and a rotation H Q H corresponds to the change of parameters V Q V. In this case, the kernel does not change under rotation, i.e., Ad,c,n(Q) = Ad,c,n.

Dynamics invariant We note that the dynamics in 73 does not in general have an invariant analogous to the one we identified in Theorem 5.1. Indeed, if we define a quantity E := W W c1H1H 1 c2H2H 2 for some constants c1,2 R, and additionally assume centered global means H11C = 0, we get the following expression for the derivative of E:

E = c1(Ad + m Ac) m In W R1H 1 H1R 1 W c1(Ad + m Ac) m In (77)

+ c2Ad m In W WH2H 2 H2H 2 W W c2A d m In , (78)

which is not equal to zero with arbitrary matrices Ad,c.

D.2 Neural Collapse under Relaxed Block Structure Assumption

We now propose a relaxation of our main assumption, under which our main result regarding NC in Theorem 5.2 still holds. In terms of Assumption D.1 on the general block structure of Θh, our initial Assumption 3.2 in the main text is the special case with An = κn In, Ac = (κc κn)In, Ad = (κd κc)In. The relaxed assumption can be formulated as follows in terms of matrices Ad,c,n: Assumption D.2. Assume that An is an arbitrary p.s.d. matrix and (Ac, Ad) satisfy the following conditions: Ac = κc In + Nc, Ad = κd In + Nd, (79) where N c,d ker(R W), i.e., Nc,d W R = O. Further, assume that the kernel changes under rotation with an orthogonal matrix Q as follows:

Ad,c,n(Q) = Q Ad,c,n Q. (80)

Since An is arbitrary, this relaxation allows arbitrary non-zero values of non-diagonal kernels Θh k,s with k = s. The following observations justify the consistency of the above assumption:

Since W RC n, R RC N and N > n > C, R W has a non-empty kernel (possibly time-dependent). The dynamics is rotation invariant under the assumption, i.e., the equation (75) holds.

The expression of the assumption is rotation invariant, in a sense that Ad,c(Q) = κd,c In + Nd,c(Q), where N d,c ker(R WQ) for any orthogonal Q.

Under the above assumption, the derivative in 77 becomes zero, so the dynamics has an invariant of the form E := W W c1H1H 1 c2H2H 2 . Moreover, the statement and the proof of our main Theorem 5.2 remains unchanged. Thus, DNNs satisfying the conditions of Theorem 5.2 display NC under Assumption D.2.

D.3 Discussion

The analysis of the DNNs dynamics is simplified significantly by assuming that Θh has a block structure. However, formulating a reasonable and consistent assumption on the NTK and its components is non-trivial. The Assumption 3.2 that we used in the main text is justified by the empirical results but may not capture all the relevant properties of the NTK. We believe that studying DNNs dynamics under a more general or a more reasonable assumption on the NTK is a promising future work direction. The relaxed block structure assumption proposed in this section is the first step into this direction.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

600 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 99.36%, h h, Y Y Ti: 0.88 Le Cun normal init., LR=0.348 Acc.: 99.55%, h h, Y Y Ti: 0.88 Uniform init., LR=0.007 Acc.: 99.58%, h h, Y Y Ti: 0.76 Uniform init., LR=0.131 Acc.: 99.58%, h h, Y Y Ti: 0.98 He normal init., LR=0.013 Acc.: 99.55%, h h, Y Y Ti: 0.74 He normal init., LR=0.131 Acc.: 99.57%, h h, Y Y Ti: 0.95

0 100 200 300 400 epoch

0 100 200 300 400 epoch

0 100 200 300 400 epoch

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 3 10 1 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 99.36%, h h, Y Y Ti: 0.88 Le Cun normal init., LR=0.348 Acc.: 99.55%, h h, Y Y Ti: 0.88 Uniform init., LR=0.007 Acc.: 99.58%, h h, Y Y Ti: 0.76 Uniform init., LR=0.131 Acc.: 99.58%, h h, Y Y Ti: 0.98 He normal init., LR=0.013 Acc.: 99.55%, h h, Y Y Ti: 0.74 He normal init., LR=0.131 Acc.: 99.57%, h h, Y Y Ti: 0.95

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

600 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

103 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Figure 3: VGG11 trained on MNIST. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

103 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 93.12%, h h, Y Y Ti: 0.66 Le Cun normal init., LR=0.348 Acc.: 93.24%, h h, Y Y Ti: 0.87 Uniform init., LR=0.007 Acc.: 93.39%, h h, Y Y Ti: 0.54 Uniform init., LR=0.131 Acc.: 93.07%, h h, Y Y Ti: 0.91 He normal init., LR=0.013 Acc.: 92.66%, h h, Y Y Ti: 0.76 He normal init., LR=0.131 Acc.: 93.35%, h h, Y Y Ti: 0.81

0 100 200 300 400 epoch

0 100 200 300 400 epoch

0 100 200 300 400 epoch

1.50 e) NC3

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 2 10 1 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 93.12%, h h, Y Y Ti: 0.66 Le Cun normal init., LR=0.348 Acc.: 93.24%, h h, Y Y Ti: 0.87 Uniform init., LR=0.007 Acc.: 93.39%, h h, Y Y Ti: 0.54 Uniform init., LR=0.131 Acc.: 93.07%, h h, Y Y Ti: 0.91 He normal init., LR=0.013 Acc.: 92.66%, h h, Y Y Ti: 0.76 He normal init., LR=0.131 Acc.: 93.35%, h h, Y Y Ti: 0.81

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Figure 4: VGG11 trained on Fashion MNIST. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

3000 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 78.72%, h h, Y Y Ti: 0.50 Le Cun normal init., LR=0.131 Acc.: 87.25%, h h, Y Y Ti: 0.88 Uniform init., LR=0.007 Acc.: 82.91%, h h, Y Y Ti: 0.59 Uniform init., LR=0.181 Acc.: 89.28%, h h, Y Y Ti: 0.87 He normal init., LR=0.018 Acc.: 86.34%, h h, Y Y Ti: 0.86 He normal init., LR=0.068 Acc.: 88.78%, h h, Y Y Ti: 0.90

0 100 200 300 400 epoch

0 100 200 300 400 epoch

0 100 200 300 400 epoch

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 2 10 1 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 78.72%, h h, Y Y Ti: 0.50 Le Cun normal init., LR=0.131 Acc.: 87.25%, h h, Y Y Ti: 0.88 Uniform init., LR=0.007 Acc.: 82.91%, h h, Y Y Ti: 0.59 Uniform init., LR=0.181 Acc.: 89.28%, h h, Y Y Ti: 0.87 He normal init., LR=0.018 Acc.: 86.34%, h h, Y Y Ti: 0.86 He normal init., LR=0.068 Acc.: 88.78%, h h, Y Y Ti: 0.90

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

103 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Figure 5: VGG16 trained on CIFAR10. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

600 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.003 Acc.: 99.49%, h h, Y Y Ti: 0.45 Le Cun normal init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.93 Uniform init., LR=0.003 Acc.: 99.60%, h h, Y Y Ti: 0.51 Uniform init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.88 He normal init., LR=0.005 Acc.: 99.51%, h h, Y Y Ti: 0.38 He normal init., LR=0.049 Acc.: 99.64%, h h, Y Y Ti: 0.92

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

0 100 200 300 400 epoch

1.25 e) NC3

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 3 10 1 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.003 Acc.: 99.49%, h h, Y Y Ti: 0.45 Le Cun normal init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.93 Uniform init., LR=0.003 Acc.: 99.60%, h h, Y Y Ti: 0.51 Uniform init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.88 He normal init., LR=0.005 Acc.: 99.51%, h h, Y Y Ti: 0.38 He normal init., LR=0.049 Acc.: 99.64%, h h, Y Y Ti: 0.92

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

600 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Figure 6: Res Net20 trained on MNIST. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.005 Acc.: 91.82%, h h, Y Y Ti: 0.63 Le Cun normal init., LR=0.094 Acc.: 93.84%, h h, Y Y Ti: 0.74 Uniform init., LR=0.005 Acc.: 93.00%, h h, Y Y Ti: 0.56 Uniform init., LR=0.094 Acc.: 93.71%, h h, Y Y Ti: 0.70 He normal init., LR=0.007 Acc.: 92.53%, h h, Y Y Ti: 0.57 He normal init., LR=0.068 Acc.: 93.64%, h h, Y Y Ti: 0.88

0 100 200 300 400 epoch

0 100 200 300 400 epoch

0 100 200 300 400 epoch

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 2 10 1 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.005 Acc.: 91.82%, h h, Y Y Ti: 0.63 Le Cun normal init., LR=0.094 Acc.: 93.84%, h h, Y Y Ti: 0.74 Uniform init., LR=0.005 Acc.: 93.00%, h h, Y Y Ti: 0.56 Uniform init., LR=0.094 Acc.: 93.71%, h h, Y Y Ti: 0.70 He normal init., LR=0.007 Acc.: 92.53%, h h, Y Y Ti: 0.57 He normal init., LR=0.068 Acc.: 93.64%, h h, Y Y Ti: 0.88

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Figure 7: Res Net20 trained on Fashion MNIST. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

400 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 69.19%, h h, Y Y Ti: 0.66 Le Cun normal init., LR=0.068 Acc.: 87.64%, h h, Y Y Ti: 0.45 Uniform init., LR=0.004 Acc.: 69.40%, h h, Y Y Ti: 0.48 Uniform init., LR=0.068 Acc.: 87.03%, h h, Y Y Ti: 0.47 He normal init., LR=0.018 Acc.: 81.89%, h h, Y Y Ti: 0.40 He normal init., LR=0.068 Acc.: 86.53%, h h, Y Y Ti: 0.77

0 100 200 300 400 epoch

0 100 200 300 400 epoch

0 100 200 300 400 epoch

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 2 10 1 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 69.19%, h h, Y Y Ti: 0.66 Le Cun normal init., LR=0.068 Acc.: 87.64%, h h, Y Y Ti: 0.45 Uniform init., LR=0.004 Acc.: 69.40%, h h, Y Y Ti: 0.48 Uniform init., LR=0.068 Acc.: 87.03%, h h, Y Y Ti: 0.47 He normal init., LR=0.018 Acc.: 81.89%, h h, Y Y Ti: 0.40 He normal init., LR=0.068 Acc.: 86.53%, h h, Y Y Ti: 0.77

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Figure 8: Res Net20 trained on CIFAR10. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

102 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.004 Acc.: 99.21%, h h, Y Y Ti: 0.53 Le Cun normal init., LR=0.049 Acc.: 99.64%, h h, Y Y Ti: 0.75 Uniform init., LR=0.004 Acc.: 99.36%, h h, Y Y Ti: 0.61 Uniform init., LR=0.049 Acc.: 99.67%, h h, Y Y Ti: 0.79 He normal init., LR=0.007 Acc.: 99.48%, h h, Y Y Ti: 0.47 He normal init., LR=0.049 Acc.: 99.65%, h h, Y Y Ti: 0.72

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

0 100 200 300 400 epoch

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 2 10 1 100 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

102 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.004 Acc.: 99.21%, h h, Y Y Ti: 0.53 Le Cun normal init., LR=0.049 Acc.: 99.64%, h h, Y Y Ti: 0.75 Uniform init., LR=0.004 Acc.: 99.36%, h h, Y Y Ti: 0.61 Uniform init., LR=0.049 Acc.: 99.67%, h h, Y Y Ti: 0.79 He normal init., LR=0.007 Acc.: 99.48%, h h, Y Y Ti: 0.47 He normal init., LR=0.049 Acc.: 99.65%, h h, Y Y Ti: 0.72

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Figure 9: Dense Net40 trained on MNIST. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 93.00%, h h, Y Y Ti: 0.62 Le Cun normal init., LR=0.094 Acc.: 94.01%, h h, Y Y Ti: 0.75 Uniform init., LR=0.007 Acc.: 92.95%, h h, Y Y Ti: 0.69 Uniform init., LR=0.094 Acc.: 94.08%, h h, Y Y Ti: 0.75 He normal init., LR=0.007 Acc.: 91.71%, h h, Y Y Ti: 0.56 He normal init., LR=0.094 Acc.: 94.16%, h h, Y Y Ti: 0.77

0 100 200 300 400 epoch

0 100 200 300 400 epoch

0 100 200 300 400 epoch

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 2 10 1 100 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 93.00%, h h, Y Y Ti: 0.62 Le Cun normal init., LR=0.094 Acc.: 94.01%, h h, Y Y Ti: 0.75 Uniform init., LR=0.007 Acc.: 92.95%, h h, Y Y Ti: 0.69 Uniform init., LR=0.094 Acc.: 94.08%, h h, Y Y Ti: 0.75 He normal init., LR=0.007 Acc.: 91.71%, h h, Y Y Ti: 0.56 He normal init., LR=0.094 Acc.: 94.16%, h h, Y Y Ti: 0.77

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Figure 10: Dense Net40 trained on Fashion MNIST. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 76.99%, h h, Y Y Ti: 0.57 Le Cun normal init., LR=0.094 Acc.: 88.31%, h h, Y Y Ti: 0.70 Uniform init., LR=0.007 Acc.: 78.71%, h h, Y Y Ti: 0.64 Uniform init., LR=0.094 Acc.: 88.33%, h h, Y Y Ti: 0.68 He normal init., LR=0.007 Acc.: 70.19%, h h, Y Y Ti: 0.49 He normal init., LR=0.094 Acc.: 88.34%, h h, Y Y Ti: 0.72

0 100 200 300 400 epoch

0 100 200 300 400 epoch

0 100 200 300 400 epoch

Test accuracy

0.75 f) Le Cun normal init.

0.75 g) Uniform init.

0.75 h) He normal init.

10 2 10 1 100 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

101 a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.007 Acc.: 76.99%, h h, Y Y Ti: 0.57 Le Cun normal init., LR=0.094 Acc.: 88.31%, h h, Y Y Ti: 0.70 Uniform init., LR=0.007 Acc.: 78.71%, h h, Y Y Ti: 0.64 Uniform init., LR=0.094 Acc.: 88.33%, h h, Y Y Ti: 0.68 He normal init., LR=0.007 Acc.: 70.19%, h h, Y Y Ti: 0.49 He normal init., LR=0.094 Acc.: 88.34%, h h, Y Y Ti: 0.72

0 100 200 300 400 epoch

0.20 c) NC1

0 100 200 300 400 epoch

1.50 d) NC2

0 100 200 300 400 epoch

1.50 e) NC3

Figure 11: Dense Net40 trained on CIFAR10. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

10 3 10 1 Learning rate

1.5 f) Le Cun normal init.

1.5 g) Uniform init.

1.5 h) He normal init.

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

103 i) Features norms

Le Cun init.,

Le Cun init.,

Unif. init.,

Unif. init.,

1.0 j) Kernels alignment

1 k F k H2HT

2 k F khhihhi Tk F h /k k F, Y Y T/k Y Y Tk Fi

h h/k hk F, Y Y T/k Y Y Tk Fi

a) Invariant norm

1.0 b) Inv. alignment to W TW Le Cun normal init., LR=0.003 Acc.: 99.41%, h h, Y Y Ti: 0.56 Le Cun normal init., LR=0.049 Acc.: 99.69%, h h, Y Y Ti: 0.86 Uniform init., LR=0.003 Acc.: 99.56%, h h, Y Y Ti: 0.70 Uniform init., LR=0.049 Acc.: 99.61%, h h, Y Y Ti: 0.82 He normal init., LR=0.003 Acc.: 99.23%, h h, Y Y Ti: 0.48 He normal init., LR=0.049 Acc.: 99.61%, h h, Y Y Ti: 0.87

0 100 200 300 400 epoch

0.35 c) NC1

0 100 200 300 400 epoch

0 100 200 300 400 epoch

Figure 12: Res Net20 trained on MNIST with CE loss. See Figure 2 for the description of panes a-h. i) Norms of matrices H1H 1 , H2H 2 , and h h at the end of training. j) Alignment of kernels Θ and Θh at the end of training. The color in panes i-j is the color of the same model in panes a-e.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

d) Θh k,s F

0 100 200 300 400 epoch

e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 13: NTK block structure of VGG11 trained on MNIST. Le Cun normal initialization, initial learning rate 0.131. The kernel is computed on a random data subset with 4 samples from each class. See Figure 1 for the description of panes.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

d) Θh k,s F

0 100 200 300 400 epoch

e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 14: NTK block structure of VGG11 trained on Fashion MNIST. Le Cun normal initialization, initial learning rate 0.049. The kernel is computed on a random data subset with 4 samples from each class. See Figure 1 for the description of panes.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

d) Θh k,s F

0 100 200 300 400 epoch

e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 15: NTK block structure of VGG11 trained on CIFAR10. Le Cun normal initialization, initial learning rate 0.131. The kernel is computed on a random data subset with 4 samples from each class. See Figure 1 for the description of panes.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

h1 h11 h21 h31 h41 h51 h61

h1 h11 h21 h31 h41 h51 h61

d) Θh k,s F

0 100 200 300 400 epoch

e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 16: NTK block structure of Res Net20 trained on Fashion MNIST. Le Cun normal initialization, initial learning rate 0.094. The kernel is computed on a random data subset with 12 samples from each class. See Figure 1 for the description of panes.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

h1 h11 h21 h31 h41 h51 h61

h1 h11 h21 h31 h41 h51 h61

d) Θh k,s F

0 100 200 300 400 epoch

0.7 e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 17: NTK block structure of Res Net20 trained on CIFAR10. Le Cun normal initialization, initial learning rate 0.068. The kernel is computed on a random data subset with 12 samples from each class. See Figure 1 for the description of panes.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

d) Θh k,s F

0 100 200 300 400 epoch

1.0 e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 18: NTK block structure of Dense Net40 trained on MNIST. Le Cun normal initialization, initial learning rate 0.049. The kernel is computed on a random data subset with 12 samples from each class. See Figure 1 for the description of panes.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

d) Θh k,s F

0 100 200 300 400 epoch

1.0 e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 19: NTK block structure of Dense Net40 trained on Fashion MNIST. Le Cun normal initialization, initial learning rate 0.094. The kernel is computed on a random data subset with 12 samples from each class. See Figure 1 for the description of panes.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

a) P k Θk,k(X)

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10

b) P k Θh k,k(X)

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

d) Θh k,s F

0 100 200 300 400 epoch

e) Alignment

Θ,Y Y T Θ F Y Y T F

Θh,Y Y T Θh F Y Y T F

Figure 20: NTK block structure of Dense Net40 trained on CIFAR10. Le Cun normal initialization, initial learning rate 0.094. The kernel is computed on a random data subset with 12 samples from each class. See Figure 1 for the description of panes.