# perturbation_analysis_of_neural_collapse__675ba3f8.pdf

Perturbation Analysis of Neural Collapse

Tom Tirer * 1 Haoxiang Huang * 2 Jonathan Niles-Weed 2

Training deep neural networks for classification often includes minimizing the training loss beyond the zero training error point. In this phase of training, a neural collapse behavior has been observed: the variability of features (outputs of the penultimate layer) of within-class samples decreases and the mean features of different classes approach a certain tight frame structure. Recent works analyze this behavior via idealized unconstrained features models where all the minimizers exhibit exact collapse. However, with practical networks and datasets, the features typically do not reach exact collapse, e.g., because deep layers cannot arbitrarily modify intermediate features that are far from being collapsed. In this paper, we propose a richer model that can capture this phenomenon by forcing the features to stay in the vicinity of a predefined features matrix (e.g., intermediate features). We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied models. For example, we prove reduction in the within-class variability of the optimized features compared to the predefined input features (via analyzing gradient flow on the central-path with minimal assumptions), analyze the minimizers in the near-collapse regime, and provide insights on the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings.

1. Introduction

Modern classification systems are typically based on deep neural networks (DNNs), whose parameters are optimized

*Equal contribution 1Faculty of Engineering, Bar-Ilan University, Ramat Gan, Israel 2Courant Institute of Mathematical Sciences, New York University, NY, US. Correspondence to: Tom Tirer <tirer.tom@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

using a large amount of labeled training data. The training scheme of these networks often includes minimizing the training loss beyond the zero training error point (Hoffer et al., 2017; Ma et al., 2018; Belkin et al., 2019). In this terminal phase of training, a neural collapse (NC) behavior has been empirically observed when using either crossentropy (CE) loss (Papyan et al., 2020) or mean squared error (MSE) loss (Han et al., 2022).

The NC behavior includes several simultaneous phenomena that evolve as the number of epochs grows. The first phenomenon, dubbed NC1, is decrease in the variability of the features (outputs of the penultimate layer) of training samples from the same class. The second phenomenon, dubbed NC2, is increasing similarity of the structure of the inter-class features means (after subtracting the global mean) to a simplex equiangular tight frame (ETF). The third phenomenon, dubbed NC3, is alignment of the last layer s weights with the inter-class features means. A consequence of these phenomena is that the classifier s decision rule becomes similar to nearest class center in feature space.

Many recent works attempt to theoretically analyze the NC behavior (Mixon et al., 2020; Lu & Steinerberger, 2022; Wojtowytsch et al., 2021; Fang et al., 2021; Zhu et al., 2021; Graf et al., 2021; Ergen & Pilanci, 2021; Ji et al., 2021; Galanti et al., 2021; Tirer & Bruna, 2022; Zhou et al., 2022a; Thrampoulidis et al., 2022; Yang et al., 2022; Zhou et al., 2022b; Kothapalli, 2023). The mathematical frameworks are almost always based on variants of the unconstrained features model (UFM), proposed by (Mixon et al., 2020), which treats the (deepest) features of the training samples as free optimization variables (disconnected from data or intermediate/shallow features). Typically, in these idealized models all the minimizers exhibit exact collapse (i.e., their within-class variability is exactly 0 and an exact simplex ETF structure is demonstrated) provided that arbitrary (but nonzero) level of regularization is used.

However, the features of DNNs are not free optimization variables but outputs of predetermined architectures that get training samples as input and have parameters (shared by all the samples) that are hard to optimize. Thus, usually, the deepest features demonstrate reduced NC distance metrics (such as within-class variability) compared to features of intermediate layers but do not exhibit convergence to an

Perturbation Analysis of Neural Collapse

exact collapse. Indeed, as can be seen in any NC paper that presents empirical results, the decrease in the NC metrics is typically finite and stops above zero at some epoch. The margin depends on the dataset complexity, architecture, hyperparameter tuning, etc. Yet, due to their over-idealization , the previously studied theoretical models mask the effects of these factors on the closeness to collapse and cannot capture the depthwise progress of collapse.

In this paper, this issue is taken into account by studying a model that can force the features to stay in the vicinity of a predefined features matrix. By considering the predefined features as intermediate features of a DNN, the proposed model allows us to analyze how deep features progress from, or relate to, shallower features. We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied UFMs.

Our main contributions include:

We prove reduction in the within-class variability of the optimized features compared to the predefined input features. To the best of our knowledge, this is the first proof of depthwise reduction of an NC1 metric, as existing works have only demonstrated this behavior empirically.1

To obtain the aforementioned result (for arbitrary input features), we prove monotonic decrease of an NC1 metric along gradient flow on the central-path of an associated UFM with minimal assumptions (i.e., we drop all the assumptions and modifications of the flow that Han et al. (2022) did to facilitate their analysis). Moreover, we establish a separation between the behavior of the withinand between-class covariance of the features along the flow, and show that the rate of decrease of the NC1 metric is exponential in the presence of regularization.

We provide a closed-form approximation for the minimizer of the newly proposed model. Then, focusing on the case where the input features matrix is already near collapse (e.g., the penultimate features of a welltrained DNN), we present a fine-grained analysis of our closed-form approximation, which provides insights on the effect of regularization hyperparameters on the closeness to collapse, as well as reasoning why NC1 metrics commonly plateau at lower values than metrics of other NC components (e.g., NC2).

We support our theory with experiments in practical deep learning settings.

1Note that layer-extended UFMs are proven to exhibit exact zero within-class variability across all their layers rather than progressive depthwise reduction, which suggests that they are limited in modeling the depthwise behavior of practical DNNs (Tirer & Bruna, 2022; Dang et al., 2023).

2. Background

Consider a classification task with K classes and n training samples per class. Let us denote by yk RK the one-hot vector with 1 in its k-th entry and by xk,i Rp the i-th training sample of the k-th class. DNN-based classifiers can be typically expressed as

DNNΘ(x) = Whθ(x) + b,

where hθ( ) : Rp Rd (with d K) is the feature mapping that is composed of multiple layers (with learnable parameters θ), and W = [w1, . . . , w K] RK d (w k denotes the kth row of W) and b RK are the weights and bias of the last classification layer. The network s parameters Θ = {W, b, θ} are usually learned by empirical risk minimization

i=1 L (Whθ(xk,i) + b, yk) + R (Θ) ,

where L( , ) is a loss function (e.g., CE or MSE2) and R( ) is a regularization term.

Let us denote the feature vector of the i-th training sample of the k-th class by hk,i (i.e., hk,i = hθ(xk,i)), and let denote the Kronecker product. Papyan et al. (2020) empirically demonstrated that when DNNΘ is trained beyond the zero training error point, the (organized) features matrix H = [h1,1, . . . , h1,n, h2,1, . . . , h K,n] Rd Kn gets closer to an exact collapse structure , defined as follows.

Definition 2.1 ((Neural) Collapse structure). We say that the organized features matrix H Rd Kn is collapsed, or alternatively, has a (neural) collapse structure, if H = H 1 n for some H Rd K, and

H h G1 K = ρ IK 1

for some ρ > 0, where h G = 1

K H1K is the global mean.

In the above definition, H = H 1 n reflects zero withinclass variability ( exact NC1 ), and the property stated for H h G1 K (the mean-subtracted features) is being referred to as having a simplex ETF structure ( exact NC2 ).

Following the work of (Mixon et al., 2020), in order to mathematically show the emergence of minimizers with NC structure, most of the theoretical papers have followed the unconstrained features model (UFM) approach, where the features {hk,i = hθ(xk,i)} are treated as free optimization

2(Hui & Belkin, 2021) have shown that training DNN classifiers with MSE loss is a powerful strategy whose performance is similar to training with CE loss.

Perturbation Analysis of Neural Collapse

variables. Namely, they study problems of the form

min W,b,{hk,i} 1 Kn

i=1 L (Whk,i + b, yk)

+ R (W, b, {hk,i}) .

One such example is the work in (Tirer & Bruna, 2022), which considered a setting with regularized MSE loss:

min W,H 1 2Kn WH Y 2 F + λW

2K W 2 F + λH

2Kn H 2 F ,

where H = [h1,1, . . . , h1,n, h2,1, . . . , h K,n] Rd Kn is the (organized) unconstrained features matrix, Y = IK 1 n RK Kn is its associated one-hot vectors matrix, and λW and λH are positive regularization hyperparameters.

The model in (1) shares similarity with models in the matrix factorization literature (Koren et al., 2009; Chi et al., 2019), except the assumptions d K and the specific structure of Y. (In the model studied here, we further deviate from previous models). These crucial differences allowed the authors to show that all the (global) minimizers of this biasfree UFM exhibit an orthogonal collapse, as stated in the following theorem.3

Theorem 2.2 (Theorem 3.1 in (Tirer & Bruna, 2022)). Let d K and define c := λHλW . If c 1, then any global minimizer (W , H ) of (1) satisfies

h k,1 = . . . = h k,n =: h k, k [K],

h 1 2 2 = . . . = h K 2 2 =: ρ = (1 c)

h 1, . . . , h K i h

h 1, . . . , h K i = ρIK,

λH/λW h k, k [K].

If c > 1, then (1) is minimized by (W , H ) = (0, 0).

In short, the theorem states that any minimizer (W , H ) of (1) obeys that H = H 1 n for some H Rd K, and W H H H W W IK. It is not hard to show that H H = ρIK implies that H h G1 K = H 1

K H1K1 K is a simplex ETF (see Definition 2.1).

From the structure of the problem and the theorem, we see that there are infinitely many minimizers of (1). Indeed, as

3Note that the results in (Tirer & Bruna, 2022) are stated for λW λW

K and λH λH

Kn (i.e., their hyperparameters absorb the factors 1/K and 1/Kn that are used here). Scaling the terms in the objective according to the number of samples, as done in (1), agrees with what is done in practice (e.g., averaging the squared errors over the minibatch samples rather than summing them). Our scaling also highlights the independence of the minimizers properties on K and n.

can be deduced from the proof of Theorem 2.2 in (Tirer & Bruna, 2022): Taking any (partial) orthonormal matrix R Rd K (i.e., R R = IK), one can construct a minimizer for (1) simply by H = p

ρ(λW , λH)R 1 n and W = p

ρ(λW , λH)R .

The existing literature includes other different UFM settings where all the minimizers exhibit exact NC structures (e.g., see (Lu & Steinerberger, 2022; Wojtowytsch et al., 2021; Zhu et al., 2021; Fang et al., 2021; Tirer & Bruna, 2022; Thrampoulidis et al., 2022)). However, as discussed in Section 1, all the previously studied UFMs are idealized and their results deviate from the situation in practical DNN training, where: 1) the features do not exhibit exact collapse (e.g., since deep layers cannot arbitrarily modify intermediate features that are far from being collapsed); and 2) the setting of the hyperparameters affects the distance from exact collapse structure.

In what follows, we extend the model in (1) to overcome the two limitations mentioned above, as well as being able to explain depthwise behavior. As will be shown, the theoretical insights that we gain are empirically aligned also with DNNs trained with CE loss and bias (beyond the setting of (1)). Potentially, our novel perturbation analysis approach, which exploits knowledge on exactly collapsed minimizers of UFMs, can be also applied to models other than (1).

3. Problem Setup

To gain insights on the practical NC behavior, in this paper, we consider a model capable of analyzing the real-world situation where exact NC structure is not reached. Motivated by (1), we consider the following model

min W,H f(W, H; H0) = 1 2Kn WH Y 2 F (2)

2K W 2 F + λH

2Kn H 2 F + β 2Kn H H0 2 F ,

where H0 Rd Kn is an input features matrix, which is fixed, and β is a positive hyperparameter that controls the distance of H from H0.

Let us discuss the motivation for studying this model. As before, we interpret W and H as the final weights and deepest features of the DNN, respectively. Clearly, for H0 = 0 this model reduces to (1) (with H 2 F regularized by λH + β). Furthermore, when H0 is nonzero, but already a minimizer of (1) (and thus has an exact collapse structure), the following statement is straightforward. Corollary 3.1. Let d K, λHλW < 1, and let (W , H ) be a minimizer of (1). Then, the minimizer of f(W, H; H0 = H ) (in (2)) is unique4 and it is given by (W , H ).

4Note that in both (1) and (2) the minimizer w.r.t. W is a closed-

Perturbation Analysis of Neural Collapse

That is, (2) allows us to pick one of the minimizers of (1) by H0 and transfer its orthogonal collapse properties, which are stated in Theorem 2.2, to the minimizer of (2).

However, the usefulness of (2) comes from exploring cases with nonzero/non-collapsed H0. Indeed, while H can be interpreted as the deepest features of a DNN, here we interpret H0 as the features that are obtained in a shallower layer. In this case, 1/β can be understood as the complexity of the subnetwork from H0 to H. We are particularly interested in the the large β regime, β 1, where H0 expresses penultimate features (only one layer before H) that significantly constrain H. Focusing on the large β regime, in this paper we provide mathematical reasoning for the empirical NC behavior that are not captured by previously studied UFMs, such as proving that the optimized H has smaller withinclass variability than H0, and analyzing how perturbations from collapse of H0 can be mitigated by the minimizer of (2).

Lastly, note that even though the relation between H and H0 in (2) differs from their explicit relation in DNNs, there exist networks and settings that motivate the assumption that the deepest features and the penultimate features are close to each other. For example, consider the Res Net architecture from (He et al., 2016b) that explicitly includes identity mappings, where (under our interpretation of H and H0) the deepest features obey H = H0+r(H0), where r( ) denotes a residual block. The residual term can potentially be very small if H0 already separates the classes (e.g., it has a near NC structure). In fact, in the popular neural ODE framework (Chen et al., 2018), which is understood as the infinite depth limit of these Res Nets, we inherently have that H H0. Another example where the concept H H0 inherently holds is deep equilibrium models (DEQ) (Bai et al., 2019). These practical DNN frameworks provide the rationality for analyzing our model. Furthermore, our theoretical results are aligned also with the empirical behavior of DNN architectures beyond the aforementioned examples (e.g., the other version of Res Net (He et al., 2016a) and plain MLPs).

4. Decrease in Within-Class Variability

As discussed above, while the features matrix H represents the output of a DNN s penultimate layer, the input matrix H0 can be interpreted as the features of a preceding layer. Tirer & Bruna (2022) and follow-up works (Galanti et al., 2022; He & Su, 2022) have empirically presented settings where the within-class variability of the features, measured by some NC1 metric , decreases across depth. The goal of this section is to prove such a phenomenon for

form function of H: W (H) = YH (HH + nλW Id) 1. As such, a minimizer H of either objective uniquely implies the associated W = W (H ).

the model stated in (2). The theory that we provide shows also monotonic decrease of the within-class variability (till exact collapse) along gradient flow on the central-path of the UFM stated in (1).

Let us begin with several definitions that will be used in this section. For a given set of n features for each of K classes, {hk,i}, we define the per-class and global means as hk := 1 n Pn i=1 hk,i and h G := 1 Kn Pk k=1 Pn i=1 hk,i, respectively, as well as the mean features matrix H :=

h1, . . . , h K . Next, we define the within-class and between-class d d covariance matrices

ΣW (H) := 1 Kn

i=1 (hk,i hk)(hk,i hk) ,

k=1 (hk h G)(hk h G) .

The within-class variability collapse (NC1) can be expressed as ΣW (H) 0 while ΣB(H) 0, where the limit takes place with increasing the training epoch, and ΣB(H) > 0 filters degenerate cases such as H = 0. Several papers considered in their experiments the metric

1 K Tr ΣW (H)Σ B(H) , where Σ B denotes the pseudoinverse of ΣB (Papyan et al., 2020; Han et al., 2022; Zhu et al., 2021). Yet, we believe that considering the metric

g NC1(H) := Tr (ΣW (H)) /Tr (ΣB(H)) (3)

is more amenable for theoretical analysis while capturing the desired nondegenerate collapse behavior.5 Indeed, the trace of a covariance matrix equals zero if and only if the covariance matrix is a zero matrix (this follows from Cov2(X, Y ) Var(X)Var(Y )).

Recall that the minimizer w.r.t. W in (2) (and (1)) has a closed-form expression that is a function of H, which is given by W (H) = YH (HH + nλW Id) 1. Thus, the optimization in (2) is equivalent to

H1/β := argmin H L(H) + β 2Kn H H0 2 F

where L(H) := 1 2Kn W (H)H Y 2 F + λW

2K W (H) 2 F + λH 2Kn H 2 F .

5The metric 1 K Tr ΣW Σ B was considered in (Han et al., 2022). Yet, to state a result on this metric the authors claim (in the proof of Cor. 2) that a nonzero eigenvalue of Σ 1/2 W HH Σ 1/2 W equals the reciprocal of the associated nonzero eigenvalue of Σ1/2 W (HH ) Σ1/2 W . However, this is not correct in general (due to the inherent rank deficiency of HH ). For example, for Σ1/2 W = h 2 1 1 2 i and HH = h 1 0 0 0 i , we have that the single

nonzero eigenvalue of the former is 5/9 while the single nonzero eigenvalue of the latter is 5.

Perturbation Analysis of Neural Collapse

For large β, the minimizer H1/β can be viewed as a backward/implicit gradient descent update from H0 with respect to the loss L. This follows from rewriting the first order optimality condition as

1/β = Kn L(H1/β).

Observing that for β we have H1/β H0 (formally shown in Appendix B), the above equation can be written as d Ht

dt t=0 = Kn L(H0), where we think of t as β 1. This naturally gives rise to the gradient flow

dt = Kn L(Ht), (4)

associated with the UFM in (1). This means that results on this flow can be translated to results on the minimizer of (2) in the large β regime. Indeed, in Theorem 4.1 below, we show that g NC1(H) monotonically decreases along this flow, which implies that g NC1(H1/β) < g NC1(H0) for large enough β (see the statement in Corollary 4.2 below).

Note that a flow for an objective that is equivalent to L(H) with λW = 0 and λH = 0 has been studied in (Han et al., 2022), who called it the central path . The motivation for studying such an objective, where the optimization variable W is replaced by the optimal W (H), comes from the empirical observation in (Han et al., 2022) that the gap from the central path measured by WH Y 2 F W (H)H Y 2 F is rather small (compared to each term) during the optimization process of practical DNNs with MSE loss.

We now state our result for gradient flow on the central path (which is proved in Appendix A).

Theorem 4.1. Assume that λW > 0, λH 0, and that H0 has nonzero within-class variability (i.e., ΣW (H0) = 0). Then, along the gradient flow, which is stated in (4), we have that

g NC1(Ht) strictly decreases along the flow until it reaches zero.

t 7 e2λHt Tr(ΣW (Ht)) decreases along the flow. In particular, when λH > 0, Tr(ΣW (Ht)) decays exponentially.

t 7 e2λHt Tr(ΣB(Ht)) strictly increases along the flow.

Remark. Note that our gradient flow analysis has minimal assumptions. Unlike (Han et al., 2022), our flow does not assume zero global mean (h G = 0), λW = λH = 0 and invertibility of ΣW . And most importantly, it does not include any engineered renormalization and projection of the gradient, contrary to the previous work. Thus, it is more

similar to practical gradient descent optimization of DNNs. Our unmodified flow and minimal assumptions require a different, and more general, analysis with quite involved computations.6

Not only does Theorem 4.1 state a strict monotonic decrease toward 0 in the NC1 metric, it further implies exponential rate of convergence if λH > 0. Indeed, in this case Tr(ΣW ) decays exponentially with a rate faster than e 2λHt (as implied by the second bullet point), while Tr(ΣB) cannot decay exponentially with such a rate (as implied by the third bullet point). Therefore, the NC1 metric Tr(ΣW )/ Tr(ΣB), decays exponentially. Similarly, Theorem 4.1 also provides a separation between the behavior of Tr(ΣW ) and Tr(ΣB) along the flow. A strict separation is observed for λH = 0: Tr(ΣW ) decreases while Tr(ΣB) increases.

Oftentimes, gradient flow is used as a proxy for analyzing gradient descent with a small step-size (Elkabetz & Cohen, 2021). Therefore, if we overlook the difference between optimizing the UFM in (1) jointly w.r.t. W and H and restricting the optimization to the central path (W (H), H), then our theory also provides a mathematical reasoning for the experiments in (Tirer & Bruna, 2022) that show monotonic decrease in within-class variability during gradient descent iterations.

Finally, with our interpretation of t as β 1, the following Corollary is a direct consequence of Theorem 4.1 and the continuity of L(H) (see Appendix B for a formal proof). Corollary 4.2. Assume that H0 has nonzero within-class variability (i.e., ΣW (H0) = 0). Then, there exists some constant C = C(H0) > 0 such that for β > C we have that g NC1(H1/β) < g NC1(H0).

Recall that in the large β regime we can interpret H as features of DNN that are deeper than H0 but such that the architecture between H0 and H is extremely simple (e.g., they are features of adjacent layers) and thus the distance between them is constrained. Under this interpretation, Corollary 4.2 implies that layer-wise optimization of DNN where each time a new layer is added (so that the previous deepest features H1/β are considered as the new H0) will result in gradually depthwise decreasing NC1. An extension of the model in (2) that will include multiple levels of optimizable parameters may be able to provide similar reasoning to the gradual depthwise decrease in NC1 that is observed in practical DNN training, where all the layers are optimized simultaneously.

6In more detail, all the assumptions in (Han et al., 2022) (including continually renormalization the gradient) lead to the fact that only the singular values (and not the singular vectors) of an SNR matrix Σ 1/2 W (H)H vary along their flow. However, since we do not make their assumptions, we do not have such a matrix whose singular bases are fixed along the flow and we need to approach the problem in a more general way.

Perturbation Analysis of Neural Collapse

5. Analysis of the Near-Collapse Regime

In this section, we will explore the behavior of the minimizers of (2) in the near collapse regime. As stated in Corollary 3.1, if H0 = H is a minimizer of (1), and thus already exactly collapsed, then the minimizer of (2) is also collapsed. This is aligned with the rationale that if we have a DNN that already exhibits collapse at some intermediate layer, we would expect the subsequent layers to maintain this collapse.7 Essentially, we would like to analyze the minimizer of (2) for H0 that is not already collapsed. Unfortunately, for general non-collapsed H0 it is not likely that the minimizer is amenable for explicit analytical characterization. Yet, the fact that for orthogonally collapsed H0 = H we get a unique minimizer (W , H ) of (2), which is still characterized by Theorem 2.2, gives us a desirable setting for examining the minimizer of (2) obtained for H0 = H + δH0 (with sufficiently small δH0) by exploiting our knowledge on (W , H ; H0 = H ).

Analyzing the near-collapse setting will shed light on the way that the deviation from collapse in the input features is transferred to the optimized features, e.g., the amount of interaction within/between classes and the effects of hyperparameters. Such insights can be latter examined empirically beyond the near-collapse regime.

Let us denote by ( W , H ) the minimizer of f(W, H; H0). We are interested in studying the dependence of δW := W W and δH := H H on δH0 = H0 H

without the requirement of computing ( W , H ) (that lack analytical expressions). In particular, our focus is on the relation between the features δH and δH0 (rather than δW and δH0), both because a minimizer H uniquely implies the associated W , and because important aspects of NC, such as within-class variability decrease (NC1) and interclass feature structure (NC2), consider the feature mapping rather than the last layer weights.

We begin with establishing such a result in the following theorem (which is proved in Appendix C) for H0 that is not necessarily a collapsed features matrix. For the reader s convenience, we present here a simplified version of a more general theorem that is stated and proved in Appendix C. Specifically, the statement here includes only the effect of δH0 on δH and the large β regime. In what follows, we use vec( ) to denote the column-stack vectorization of a matrix.

Theorem 5.1. Let d K, and set some H0 and δH0. Let ( ˆ W , ˆH ) be the minimizer of f(W, H; H0) (with f stated in (2)). Let ( W , H ) be the minimizer of f(W, H; H0 = H0 + δH0). Define δH := H ˆH . Then, for β max{1, λH}, with approximation accuracy

7This is also aligned with empirical observations of gradual depthwise collapse in practical DNNs and with Corollary 4.2 at the limit where H0 is nearly collapsed.

of O(β 2, δH0 2), we have that

vec(δH) F vec(δH0),

F = Idn K λH

β In K ˆ W ˆ W + 1

Z := (E + ˆH ˆ W ) ( ˆH ˆH IK + nλW Id K) 1

(E + ˆH ˆ W ),

with E Rdn K Kd whose ((i 1)K + k)-th column (for i [d] and k [K]) is given by

E [:, (i 1)K + k] = vec(ed,ie K,k( ˆ W ˆH Y)),

and ed,i is the standard vector in Rd with 1 in its ith entry (similar definition stands for e K,k).

Observe that, assuming small approximation error, Theorem 5.1 states the linear operation that transforms δH0 to δH. Furthermore, due to the vectorization operation, observe that the linear expression vec(δH) Fvec(δH0) has the following block-based representation

vec(δH(1)) ... vec(δH(K))

F1,1 . . . F1,K ... FK,1 . . . FK,K

vec(δH(1) 0 ) ... vec(δH(K) 0 )

where δH(k) := δH[:, dn(k 1) + 1 : dn K] Rd n is the sub-matrix of δH that is composed of the columns associated with the kth class (and similarly for δH0). Namely, we have that F Rdn K dn K is composed of blocks of size dn dn. The diagonal blocks are the intra-class blocks . Each of them shows the effect of perturbation in a certain class in H0 on the features of the same class in H. The off-diagonal blocks are the inter-class blocks . Each of them shows the effect of perturbation in a certain class in H0 on the features of another class in H.

Recall that for H0 = H that is already exactly collapsed, the minimizer of f( ; H0) is also collapsed, so ˆH = H

in the above theorem. Importantly, in this case the matrix in (6) transforms deviation from exact collapse in the input features to deviation from exact collapse in the optimized features. Thus, we have that stronger attenuation behavior of the blocks of F (e.g., small singular values) implies that the minimizer H is closer to exact collapse. Based on specializing Theorem 5.1 to the near-collapse case, we present in the following theorem (which is proved in Appendix D) an exact analysis of singular values of the blocks of F. (The notations σmax( ) and σmin( ) stand for the largest and smallest singular values of a matrix, respectively).

Perturbation Analysis of Neural Collapse

Figure 1: The effect of λH on the spectrum of Fk,k.

Figure 2: Layer-wise training of MLP on CIFAR-10.

Theorem 5.2. Consider the setting of Theorem 5.1, λHλW < 1 (assumed in Theorem 2.2), d > K, β max{1, λH}, and the representation of (5) that is given in (6). Let H0 be a collapse features matrix (minimizer of (1) for the same λH, λW as in (2)). Then, for k, k [K] with k = k we have that Fk,k is full rank, Fk, k is rank-1, and

σmax(Fk,k) = 1,

σmin(Fk,k) = 1 β 1p

σmax(Fk, k) = 2β 1λH(1 p

Remark. In Appendix D we derive expressions for the complete singular value decomposition of Fk,k and Fk, k. Our expressions for the entire spectrum of Fk,k reveal its step-wise decreasing shape, as visualized in Figure 1 for β = 100, K = 4, d = 10, n = 10, λW =

2 and various values of λH. To keep the paper concise, we state in the above theorem only the results for the maximal and minimal singular values of Fk,k, but note that, similarly to σmin(Fk,k), almost all singular values decrease as λH increases. Even though a small portion ( 1 K/d

n ) of the singular values equal 1 (as shown in our analysis in Appendix D), we can still gain insights on the attenuation profile since generic perturbations are unlikely to concentrate in such an extremely low-dimensional subspace. In fact, our analysis shows that the singular vectors associated with this subspace do not affect the within-class variability, which further implies that metrics of the NC1 component are attenuated more than those of other components in the near collapse regime. Interestingly, this theoretical observation is aligned with an empirical observation that when exploring NC behavior of practical DNNs, NC1 metrics commonly reach lower values than metrics of other components (e.g., NC2).

From Theorem 5.2 we gain the following insights on the

minimizer of (2) in the near-collapse and large β regime. First, observe that not only do exactly collapsed minimizers have orthogonal features for different classes, but also in the near-collapse setting an intra-class block is much more dominant than each inter-class block, as follows from Fk, k being rank-1 and σmax(Fk, k) σmin(Fk,k). For generic perturbations that do not concentrate in specific lowdimensional subspaces this implies that also before/near pure collapse, we have that the deviation from collapse in the features of a certain class is mainly due to deviation from collapse of input (preceding) features of the same class and not those of the K 1 other classes. (See Appendix D.1 for more details, and note that this also implies preservation of per-class near-collapse). Second, we see that the feature mapping regularization plays the major role in approaching (near-)collapse behavior. Indeed, increasing λH decreases the spectral values of the (more dominant) intra-class blocks {Fk,k} (contrary to increasing λW ). Recall that reducing the singular values of the blocks of F implies reducing the distance of the minimizer H from exact collapse. Third, our result on the inter-class blocks {Fk, k =k} hints that the regularization of the last layer s weights (determined by λW > 0) may still have a supportive effect on reaching (near-)collapse behavior by reducing the component of the deviation from collapse that is due to crosstalk /interference of features of different classes (e.g., when some classes are harder to be classified then others). In the sequel, we show that the above observations correlate with the NC behavior in practical settings.

6. Experiments

In this section, we translate the insights that are obtained for the model in (2) to what is observed with practical DNNs and datasets. We evaluate the distance of DNN s features from exact NC using metrics that have been also used in previous works. Despite defining the metric g NC1 in (3), here we mainly measure within-class variability using NC1 := 1 K Tr ΣW Σ B , where we use the definitions of Section 4. (We use this metric due to its popularity even though it is less amenable for theoretical analysis). We measure the structure of the features using

where H := H h G1 K, and the simplex ETF is normalized to unit Frobenius norm.

The result of Section 4 provides reasoning to justify depthwise decrease in within-class variability, which has already been empirically demonstrated for end-to-end training in several papers (Tirer & Bruna, 2022; Galanti et al., 2022; He & Su, 2022) (we present such experiments in Appendix E.2). Here we show this behavior also for layer-wise training,

Perturbation Analysis of Neural Collapse

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

Figure 3: The effect of modifying the weight decay (WD) on NC metrics for Res Net18 trained on CIFAR-10. Top: MSE loss without bias; Bottom: CE loss with bias. Observe that modifying the WD in the feature mapping increases the deviation from the baseline more than modifying the WD of the last layer.

which is better represented by our model. We consider the CIFAR-10 dataset and train an MLP with 1 to 10 hidden layers and a final classification layer. Each time, we add and train a hidden layer on top of the previous hidden layers, which are maintained fixed. Then we compute the NC1 metrics for the deepest features (more experimental details appear in Appendix E.1). Figure 2 demonstrates decrease in both NC1 and g NC1 as we add more hidden layers on top the previous, which are maintained fixed. Note that our theory justifies such decrease for all the layers (the features are not required to be near collapse).

Next, we turn to demonstrate correlation of practical NC behavior with the insight gained in Section 5 that λH plays a bigger role than λW does in approaching NC. Based on the equivalence of L2-regularization with weight decay (WD) in gradient-based methods, we can make the analogy of regularizing H in (2) to WD of the weights of practical DNNs in the feature mapping layers (i.e., excluding the last layer s weights). Importantly, note that this analogy is empirically justified for plain UFMs in (Zhu et al., 2021). Under this analogy, our analysis suggests that, as long as entering the zero training error phase of training is maintained, increasing (resp. decreasing) the WD in the feature mapping layers should decrease (resp. increase) the distance from exact collapse more than increasing (resp. decreasing) the WD in the classification layer. Indeed, we empirically show this behavior below. (More experiments are presented in Appendix E.2). We note that there exists a work that empirically8 shows that WD facilitates collapse (Rangamani & Banburski-Fahey, 2022), however, they do not examine the WD in feature mapping and classification layers separately.

8Note that the claim in (Rangamani & Banburski-Fahey, 2022) that NC solution cannot minimize unregularized bias-free MSE loss comes from demanding that H without subtracting the global mean will be a simplex ETF rather than an orthogonal frame as shown in Theorem 2.2.

We consider the CIFAR-10 dataset and examine how modifying the regularization hyperparameters affects the NC behavior of the widely used Res Net18 (He et al., 2016a) compared to a baseline setting. Specifically, as a baseline hyperparameter setting, we consider one that is used in previous works (Papyan et al., 2020; Zhu et al., 2021): default Py Torch initialization of the weights, SGD optimizer with LR 0.05 that is divided by 10 every 40 epochs, momentum of 0.9, and WD of 5e-4 for all the network s parameters. The modifications include: 1) doubling the WD only for the last (FC) layer; 2) doubling the WD only for feature mapping (conv) layers; 3) zeroing the WD for the last layer; and 4) zeroing the WD for feature mapping layers.

Figure 3 presents the NC1 and NC2 metrics of the (deepest) features for: (Top) MSE loss with no bias in the FC layer (similar to the analyzed model); and (Bottom) CE loss with bias in the FC layer. In all the settings, we reach zero training error at the 40 epoch approximately. The empirical results show that modifying the WD in the feature mapping layers leads to curves with larger deviations from the baseline compared to modifying the last layer s WD, which is aligned with the theory established in Section 5 (i.e., the important role of λH in attenuating the dominant intra-class perturbations). Reducing (zeroing) the WD in the feature mapping increases the distance from exact NC (i.e., from 0 value of the metrics), while increasing the WD decreases the gap from exact NC, as the theory predicts. The fact that sometimes (e.g., with CE loss) increasing the WD of the last layer can also decrease the gap from collapse hints that mitigating inter-class interference/correlation of features in practical deep learning settings is more significant for reaching NC than in our analysis that considers a near-collapse regime.9 Yet, both the experiments and the

9In Appendix E.2, we demonstrate the role of λW in mitigating inter-class interference of features, which is identified by our analysis, also empirically with practical DNNs.

Perturbation Analysis of Neural Collapse

theoretical study show that the regularization of the feature mapping has larger significance in approaching NC.

7. Conclusion

The features that are learned by training practical networks on real world datasets typically do not reach exact NC. In this paper, we addressed this issue by studying a model that can force the features to stay in the vicinity of a predefined features matrix. We analyzed it for the small vicinity case and established results that cannot be obtained by the previously studied (idealized) UFMs. We proved reduction in within-class variability of the optimized features compared to the input features (via analyzing gradient flow along the central-path of a UFM with minimal assumptions, unlike

existing literature). We also presented an analysis of the model s minimizer in the near-collapse regime that provides insights on the effect of the regularization hyperparameters on the closeness to collapse, which correlate with the behavior in practical deep learning settings. Importantly, note that our perturbation analysis approach, which is based on exploiting our knowledge on exactly collapsed minimizers of UFMs for studying non-collapse cases, can also be applied to models other than the one considered in this paper, such as models with different loss functions and/or multiple levels of features and/or imbalanced data.

Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.

Belkin, M., Rakhlin, A., and Tsybakov, A. B. Does data interpolation contradict statistical optimality? In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1611 1619. PMLR, 2019.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Chi, Y., Lu, Y. M., and Chen, Y. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239 5269, 2019.

Dang, H., Nguyen, T., Tran, T., Tran, H., and Ho, N. Neural collapse in deep linear network: From balanced to imbalanced data. ar Xiv preprint ar Xiv:2301.00437, 2023.

Elkabetz, O. and Cohen, N. Continuous vs. discrete optimization of deep neural networks. Advances in Neural Information Processing Systems, 34:4947 4960, 2021.

Ergen, T. and Pilanci, M. Revealing the structure of deep neural networks via convex duality. In International Con-

ference on Machine Learning, pp. 3004 3014. PMLR, 2021.

Fang, C., He, H., Long, Q., and Su, W. J. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43), 2021.

Galanti, T., Gy orgy, A., and Hutter, M. On the role of neural collapse in transfer learning. ar Xiv preprint ar Xiv:2112.15121, 2021.

Galanti, T., Galanti, L., and Ben-Shaul, I. On the implicit bias towards minimal depth of deep neural networks. ar Xiv preprint ar Xiv:2202.09028, 2022.

Graf, F., Hofer, C., Niethammer, M., and Kwitt, R. Dissecting supervised constrastive learning. In International Conference on Machine Learning, pp. 3821 3830. PMLR, 2021.

Han, X., Papyan, V., and Donoho, D. L. Neural collapse under mse loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2022.

He, H. and Su, W. J. A law of data separation in deep learning. ar Xiv preprint ar Xiv:2210.17020, 2022.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016a.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630 645. Springer, 2016b.

Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.

Hui, L. and Belkin, M. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In The Ninth International Conference on Learning Representations (ICLR), 2021.

Ji, W., Lu, Y., Zhang, Y., Deng, Z., and Su, W. J. An unconstrained layer-peeled perspective on neural collapse. ar Xiv preprint ar Xiv:2110.02796, 2021.

Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42(8): 30 37, 2009.

Kothapalli, V. Neural collapse: A review on modelling principles and generalization. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.

Perturbation Analysis of Neural Collapse

Lu, J. and Steinerberger, S. Neural collapse under crossentropy loss. Applied and Computational Harmonic Analysis, 2022.

Ma, S., Bassily, R., and Belkin, M. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pp. 3325 3334. PMLR, 2018.

Mixon, D. G., Parshall, H., and Pi, J. Neural collapse with unconstrained features. ar Xiv preprint ar Xiv:2011.11619, 2020.

Papyan, V., Han, X., and Donoho, D. L. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652 24663, 2020.

Rangamani, A. and Banburski-Fahey, A. Neural collapse in deep homogeneous classifiers and the role of weight decay. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4243 4247. IEEE, 2022.

Thrampoulidis, C., Kini, G. R., Vakilian, V., and Behnia, T. Imbalance trouble: Revisiting neural-collapse geometry. ar Xiv preprint ar Xiv:2208.05512, 2022.

Tirer, T. and Bruna, J. Extended unconstrained features model for exploring deep neural collapse. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp. 21478 21505. PMLR, 2022.

Wojtowytsch, S. et al. On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers. Proceedings of Machine Learning Research, 145:1 21, 2021.

Yang, Y., Xie, L., Chen, S., Li, X., Lin, Z., and Tao, D. Do we really need a learnable classifier at the end of deep neural network? ar Xiv preprint ar Xiv:2203.09081, 2022.

Zhou, J., Li, X., Ding, T., You, C., Qu, Q., and Zhu, Z. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp. 27179 27202. PMLR, 2022a.

Zhou, J., You, C., Li, X., Liu, K., Liu, S., Qu, Q., and Zhu, Z. Are all losses created equal: A neural collapse perspective. ar Xiv preprint ar Xiv:2210.02192, 2022b.

Zhu, Z., Ding, T., Zhou, J., Li, X., You, C., Sulam, J., and Qu, Q. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820 29834, 2021.

Perturbation Analysis of Neural Collapse

A. Proof of Theorem 4.1

To prove Theorem 4.1, in addition to the within-class and between-class covariance matrices, let us define the total covariance matrix (across all classes) of the non-centered features

ΣT (H) := 1 Kn

i=1 hk,ih k,i.

For convenience we also define the non-centered between-class covariance matrix

We have the decomposition ΣT (H) = ΣW (H) + ΣB(H).

Using YH = (IK 1 n )H = n H and ΣT = 1 Kn HH , we have that for each features matrix H, the optimal weight matrix W (H) is given by

K H ( ΣT + λW

Next, let us simplify the terms with W (H) in L(H):

L(H) := 1 2Kn W (H)H Y 2 F + λW

2K W (H) 2 F + λH

2Kn H 2 F .

For the first term in L(H), observe that

1 2Kn W (H)H Y 2 F = 1 2Kn Tr W (H)HH W (H) 1 Kn Tr W (H)HY + 1

= 1 2K Tr ( ΣT + λW

K I) 1 ΣT ( ΣT + λW

K Tr ( ΣT + λW

2K2 Tr ( ΣT + λW

1 2K Tr ( ΣT + λW

where in the second equality we used ΣB = 1

K HH , and in the last equality we used ( ΣT + λW

K I) 1 ΣT = I λW

K ( ΣT + λW

For the second term in L(H), observe that

λW 2K W (H) 2 F = λW

2K Tr W (H)W (H)

2K2 Tr ( ΣT + λW

Adding the two terms together,

1 2Kn W (H)H Y 2 F + λW

2K W (H) 2 F = 1

2K Tr ( ΣT + λW

= 1 2K Tr ( ΣT + λW

K I) 1(ΣW + λW

where we used ( ΣT + λW

K I) 1 ΣB = I ( ΣT + λW

K I) 1(ΣW + λW

Finally, for the third term in L(H) we have

λH 2Kn H 2 F = λH

Perturbation Analysis of Neural Collapse

To conclude

L(H) = 1 2K Tr (ΣW + λW

K I)( ΣT + λW

K I) 1 + λH

Next, we are going to analyze the traces of dΣB

dt , and d ΣT

dt , along the flow that is stated in (4), which is repeated here for the convenience of the reader:

dt = Kn L(Ht).

In the following lemma, we state the required derivatives.

Lemma A.1. Denote CB := ΣB( ΣT + λW

K I) 1, CB := ΣB( ΣT + λW

K I) 1 and CW := ΣW ( ΣT + λW

K I) 1. Along the gradient flow we have

CB(I CB) + (I C B)C B

CW CB + C BC W

(I CB CW ) CB + C B(I C B C W )

Proof. We use the notation kjl to denote the derivative w.r.t. the lth entry of hk,j. Then

kjlΣB = 1 Kn(el(hk h G) + (hk h G)e l ),

kjlΣW = 1 Kn el(hk,j hk) + (hk,j hk)e l ,

kjl ΣT = 1 Kn(elh k,j + hk,je l ),

where el Rd is the one-hot vector whose lth entry is one (i.e., a standard basis vector). By the product rule,

kjl L(H) = 1 2K Tr ( kjlΣW )( ΣT + λW

K I) 1 + 1 2K Tr

= 1 2K Tr ( kjlΣW )( ΣT + λW

K I 1 kjl ΣT

K I) 1(hk,j hk)

K I 1 (ΣW + λW

K I 1 hk,j + λHKhk,j

Therefore, the gradient of L is given by

K I) 1(H H 1 n )

K I 1 ΣW + λW

K I 1 H + λHKH

Next, we compute how each covariance matrix updates along the flow. Let ΣB(a, b) = e a ΣBeb denote the a, b-th entry of ΣB. We further denote C := ( ΣT + λW

K I) 1, CB := ΣB( ΣT + λW

K I) 1, CB := ΣB( ΣT + λW

K I) 1 CW := ΣW ( ΣT + λW

K I) 1 and write kjl L(H) = Lkj, el , where

Lkj = 1 K2n

C(hk,j hk) (I C B)Chk,j + λHKhk,j

Perturbation Analysis of Neural Collapse

Using the chain rule, we have that

k,j,l kjlΣB(a, b)dhk,j[ℓ]

k,j,l kjlΣB(a, b)( Kn kjl L(H))

l ea, el eb, hk h G + ea, hk h G el, eb el, Lkjl

k,j ea, Lkj eb, hk h G + ea, hk h G Lkj, eb

k,j Lk,j(hk h G) (hk h G)L k,j

K e T a CB(I CB) + (I C B)C B eb 2λHe a ΣBeb

Similar computation yields

K e a CW CB + C BC W eb 2λHe a ΣW eb

d ΣT (a, b)

K e T a (I CB CW ) CB + C B(I C B C W ) eb 2λHe a ΣT eb

Let TB : t 7 e2λHt Tr(ΣB) and TW : t 7 e2λHt Tr(ΣW ). The above lemma suggests that TB strictly increases along the flow, while TW decreases. Indeed,

dt = e2λHt(d Tr(ΣW )

dt + 2λHTr(ΣW ))

K e2λHt Tr(CW CB)

K Tr(ΣW ( ΣT + λW

K I) 1 ΣB( ΣT + λW

The last inequality holds because the trace of the product of two positive semidefinite matrices is always non-negative (e.g. by Von-Neumann s trace inequality). Similarly

K e2λHt Tr(CB(I CB))

K e2λHt Tr(ΣB( ΣT + λW

K I) 1(I ΣB( ΣT + λW

K e2λHt Tr(ΣB( ΣT + λW

K I) 1(ΣW + λW

K I)( ΣT + λW

K e2λHt Tr(ΣB( ΣT + λW

K I) 1ΣW ( ΣT + λW

K I) 1) + λW

K Tr(ΣB( ΣT + λW

K2 e2λHt Tr(ΣB( ΣT + λW

K I) 2) > 0,

where the strict inequality again comes from Von-Neumann trace inequality, which ensures that the trace of product of a positive definite matrix and a non-zero positive semidefinite matrix is positive.

Since g NC1 = TW /TB, the above computation also shows that g NC1 has to strictly decrease along the flow.

Perturbation Analysis of Neural Collapse

B. Proof of Corollary 4.2

Recall that the minimizer H1/β satisfies the first order equation

H1/β H0 = Kn

β L(H1/β). (8)

We first show that H1/β H0 as β . The following lemma would be helpful.

Lemma B.1. There exists a constant M > 0 independent of H, such that

L(H) F M H F ,

for any H Rd Kn.

Proof. We bound each term in the expression of L equation (7) individually. For the first term we have

K I) 1(H H 1 n ) F ( ΣT + λW

K I) 1 op (H H 1 n ) F

λW (H H 1 n ) F 2K

where op denotes the operator norm and the second inequality is due to the fact that each eigenvalue of ( ΣT + λW

is no bigger than K λW . Similarly,

K I 1 ΣW + λW

where in the last inequality we used ( ΣT + λW

K I) 1/2 op p

K/λW since every eigenvalue of ( ΣT + λW

K I) 1/2 is

bounded by p

K/λW . Denote A = ΣW + λW

K I and use A + ΣB = B, we have

B 1/2AB 1/2 op = (B 1/2A1/2)(B 1/2A1/2) op

= (B 1/2A1/2) (B 1/2A1/2) op

= A1/2B 1A1/2 op

= (A 1/2(A + ΣB)A 1/2) 1 op

= (I + A 1/2 ΣBA 1/2) 1 op 1.

Combining the above bounds together, we have obtained for any H Rd Kn,

L(H) F 1 Kn

Next, we combine the lemma and the stationary equation (8) to get

H1/β H0 F n KM

β H1/β F n KM

β H1/β H0 F + n KM

Rearranging, we have the bound

H1/β H0 F β n KM 1 1 H0 F .

Perturbation Analysis of Neural Collapse

This implies that H1/β H0 as β . Combined with the continuity of L( ) and the first order equation (8), this further implies

lim β H1/β H0

1/β = Kn L(H0).

Now, by chain rule,

g NC1(H1/β) g NC1(H0) 1/β = H g NC1(H0), lim β H1/β H0

= H g NC1(H0), Kn L(H0)

t=0 g NC1(Ht).

In the last line, Ht denotes the gradient flow iterate defined in (4). By (the proof of) Theorem 4.1, when H0 is non-collapsed,

t=0 g NC1(Ht) < 0

must hold. This further implies that there exists some constant C = C(H0) > 0 such that for β > C we have that g NC1(H1/β) g NC1(H0) 1/β < 0.

Perturbation Analysis of Neural Collapse

C. Proof of Theorem 5.1

Theorem 5.1, which is stated in the main body of the paper, is a simplified version of Theorem C.1.

The notation in the theorem is as follows. We use vec( ) to denote the column-stack vectorization of a matrix. The derivatives are w.r.t. the vectorized matrices vec(H) and vec(W). For example, Hf Rdn K 1 stands for the derivative of f w.r.t. vec(H), and a second derivative w.r.t. vec(W) yields W Hf Rdn K Kd.

Theorem C.1. Let d K, and set some H0 and δH0. Let ( ˆ W , ˆH ) be the minimizer of f(W, H; H0) (with f stated in (2)). Let ( W , H ) be the minimizer of f(W, H; H0 = H0 + δH0). Define δW := W ˆ W and δH := H ˆH . Then, with approximation accuracy of O( δH 2, δW 2, δH0 2), we have that

vec(δH) F vec(δH0),

vec(δW) ( W W f) 1 H W f F vec(δH0),

where F = β Kn H Hf W Hf( W W f) 1 H W f 1

and all the derivatives10 of f are evaluated at the point ( ˆ W , ˆH ; H0).

In particular, for β max{1, λH}, with approximation accuracy of O(β 2, δH0 2), we have

vec(δH) Idn K λH

β In K ˆ W ˆ W + 1

β Z vec(δH0),

Z := (E + ˆH ˆ W ) ( ˆH ˆH IK + nλW Id K) 1(E + ˆH ˆ W ),

with E Rdn K Kd whose ((i 1)K + k)-th column (for i [d] and k [K]) is given by

E [:, (i 1)K + k] = vec(ed,ie K,k( ˆ W ˆH Y)),

and ed,i is the standard vector in Rd with 1 in its ith entry (similar definition stands for e K,k).

C.1. Proof of Theorem C.1

Our proof is essentially a perturbation analysis approach that exploits the fact that each of the minimizers is a stationary point of its associated objective function. Namely, the minimizer of the perturbed problem f(W, H; H0), i.e., ( W , H ), obeys

that f( W , H ; H0) = Hf( W , H ; H0) W f( W , H ; H0)

= 0, and the minimizer of the unperturbed problem, i.e., (W , H )

where for brevity we omit the ˆ symbol, obeys that f(W , H ; H0) = Hf(W , H ; H0) W f(W , H ; H0)

We use these properties in the following first order Taylor approximation of f( W , H ; H0) around (W , H ; H0) (with accuracy of O( δH 2, δW 2, δH0 2)) that is given by Hf( W , H ; H0) W f( W , H ; H0)

Hf(W , H ; H0) W f(W , H ; H0)

+ H Hf(W , H ; H0) W Hf(W , H ; H0) H W f(W , H ; H0) W W f(W , H ; H0)

vec(δH) vec(δW)

+ H0 Hf(W , H ; H0) H0 W f(W , H ; H0)

Recall that δH := H H , δW := W W , and δH0 = H0 H0. Since the two terms in the first line of (9) vanish, we get that vec(δH) vec(δW)

H Hf W Hf H W f W W f

1 H0 Hf H0 W f

vec(δH0), (10)

10The derivatives are stated in the proof.

Perturbation Analysis of Neural Collapse

where all the derivatives are evaluated at (W , H ; H0), which is omitted in order to simplify the presentation. As shown below, in our setting the matrix that is inverted is indeed nonsingular.

We turn now to compute the derivatives. Let us denote h := vec(H), w := vec(W), and y := vec(Y). Observe that from well known identities on the Kronecker product and the vectorization operation we have

1 2Kn WH Y 2 F = 1 2Kn (Ikn W)h y 2 2 = 1 2Kn (H IK)w y 2 2.

Therefore, the first order derivatives are given by

Hf(W, H; H0) = 1 Kn(Ikn W )((Ikn W)h y) + λH

Kn(h vec(H0)), (11)

W f(W, H; H0) = 1 Kn(H IK)((H IK)w y) + λW

H0 W f = 0Kd dn K.

Plugging these expressions in (10) and using blockwise matrix inversion gives

vec(δH) β Kn H Hf W Hf( W W f) 1 H W f 1 vec(δH0),

Kn( W W f) 1 H W f H Hf W Hf( W W f) 1 H W f 1 vec(δH0),

which are stated in the theorem, where all the derivatives are evaluated at the point (W , H ; H0).

Let us state the second order derivatives that appear above. First, one can observe that

H Hf(W , H ; H0) = 1 Kn In K W W + λH

Kn Idn K + β

W W f(W , H ; H0) = 1 Kn HH IK + λW

As for the mixed partial derivative, applying W on (11), we get

W Hf = w Hf = 1 Kn w (Ikn W )r + 1 Kn(Ikn W )

w((Ikn W)h y)

= 1 Kn w (Ikn W )r + 1 Kn(Ikn W )

w((H IK)w y)

= 1 Kn E(W, H) + 1 Kn(Ikn W )(H IK)

= 1 Kn E(W, H) + 1 Kn(H W )

where r := vec(WH Y) but treated as independent of w due to the product rule, and E(W, H) := w (Ikn W )r . Denoting wk,i := W[k, i], we have that

(Ikn W )r = (Ikn wk,i W )r = (Ikn ed,ie K,k)r

= vec(ed,ie K,k(WH Y)),

where ed,i is the standard vector in Rd with 1 in its ith entry (similar definition stands for e K,k).

Perturbation Analysis of Neural Collapse

H W f(W , H ; H0) = 1 Kn E + 1 Kn(H W ),

W Hf(W , H ; H0) = 1 Kn E + 1 Kn(H W ),

where E = E(W , H ) and E(W, H) Rdn K Kd is given by

E(W, H) := vec(ed,1e K,1(WH Y)), ..., vec(ed,1e K,K(WH Y)), vec(ed,2e K,1(WH Y)), ...

..., vec(ed,de K,K(WH Y)) .

We focus now on the effect the deviation δH0 = H0 H0 on the feature learning δH = H H . This requires inverting the dn K dn K matrix that links δH and δH0, which is quite challenging. Yet, from the derivatives that are stated above we observe the following

vec(δH) β Kn H Hf W Hf( W W f) 1 H W f 1 vec(δH0)

= Idn K + λH

β Idn K + 1

β In K W W Kn

β W Hf( W W f) 1 H W f 1 vec(δH0)

= Idn K + λH

β Idn K + 1

β In K W W 1

β Z 1 vec(δH0)

Z := (E + H W ) (H H IK + nλW Id K) 1(E + H W ).

Therefore, under the assumption of β max{1, λH}, which is associated with a restrictive link between H0 and H, we can use the first-order truncated Neumann series to approximate the matrix inversion (with accuracy of O(β 2)), and obtain the expression that us stated in (5):

vec(δH) Idn K λH

β In K W W + 1

β Z vec(δH0).

Lastly, as shown in Section C.2, in the large β regime we have that O( δH ) = O( δH0 ). Therefore, the above approximation accuracy is O(β 2, δH0 2) (namely, we can omit O( δH 2)).

Perturbation Analysis of Neural Collapse

C.2. More on the Map H0 7 H

In this section we show that the map H0 7 H (namely, the map from the input features matrix H0 to the minimizer H of problem (2)) is Lipschitz. In fact, our result shows that this map is nearly nonexpansive in the large β regime (note that since L(H) is not convex, known results of proximal mapping do not hold here).

Theorem C.2. Let H be the minimizer of problem (2) for a predefined H0. For β > 11λH and λW λH < 1, the map H0 7 H is (1 11λH

β ) 1-Lipschitz.

Recall the first order optimality condition

β L(H ), (12)

where L(H) := 1 2Kn W (H)H Y 2 F + λW

2K W (H) 2 F + λH 2Kn H 2 F . We first show that L is 11λH

Kn -Lipschitz. Theorem C.2 then follows immediately by triangular inequality.

The following linear algebra lemma will be useful.

Lemma C.3. Let A Rn n be a positive definite matrix and B Rn n be a positive semi-definite matrix with B A. ( Here B A means A B is a positive definite matrix.) Then A 1/2BA 1/2 op 1.

Proof. If B is positive definite, and thus invertible, we have

A 1/2BA 1/2 op = (A 1/2B1/2)(A 1/2B1/2) op

= (A 1/2B1/2) (A 1/2B1/2) op

= B1/2A 1B1/2 op

= (B 1/2(B + A B)B 1/2) 1 op

= (I + B 1/2(A B)B 1/2) 1 op 1

If B is positive semi-definite, since B A, for sufficiently small δ > 0, we have B + δI is positive definite, and it still holds that B + δI A. From the previous argument it holds that A 1/2(B + δI)A 1/2 op 1. The result follows by taking δ 0.

The following lemma is an immediate application of the previous lemma, and will be useful to bound the Lipschitz norm of L.

Lemma C.4. For any features matrix H, we have the following bound

K I) 1/2 ΣT ( ΣT + λW

K I) 1/2 op 1 (13)

K I) 1/2ΣW ( ΣT + λW

K I) 1/2 op 1 (14)

K I) 1/2H op

K I) 1/2(H H 1 n ) op

Proof. The first two inequalities are direct applications of the previous lemma since ΣW ΣT ( ΣT + λW

K I). For the latter two inequalities,

K I) 1/2H op = ( ΣT + λW

K I) 1/2HHT ( ΣT + λW

K I) 1/2 1/2 op

Kn ( ΣT + λW

K I) 1/2 ΣT ( ΣT + λW

K I) 1/2 1/2 op

Perturbation Analysis of Neural Collapse

K I) 1/2(H H 1 n ) op = ( ΣT + λW

K I) 1/2(H H 1 n )(H H 1 n ) ( ΣT + λW

K I) 1/2 1/2 op

Kn ( ΣT + λW

K I) 1/2ΣW ( ΣT + λW

K I) 1/2 1/2 op

Proof of Theorem C.2. First, we show that L is ( 11λH

Kn )-Lipschitz. For any increment H, let Ht = H + t H for 0 t 1. By fundamental theorem of calculus, it s enough to show that d

dt L(Ht) F 11λH

Kn H F for any 0 t 1.

In the following, for simplicity we write At = ΣT + λW

K I, Bt = ΣW + λW

K I. (Note that although we have omitted the indices, all the covariance matrices depend on t). We also denote P(H) = H H 1 n , where the operator P is indeed an orthogonal projection.

Taking derivative in (7), we get

d dt L(Ht) =

A 1 t P( H) 1 Kn A 1 t HH t A 1 t P(Ht) 1 Kn A 1 t Ht( H) A 1 t P(Ht) A 1 t Bt A 1 t H

+ 1 Kn A 1 t HH t A 1 t Bt A 1 t Ht + 1 Kn A 1 t Ht( H) A 1 t Bt A 1 t Ht 1 Kn A 1 t P( H)P(Ht) A 1 t Ht

1 Kn A 1 t P(Ht)P( H) A 1 t Ht + 1 Kn A 1 t Bt A 1 t HH t A 1 t Ht + 1 Kn A 1 t Bt A 1 t Ht( H) A 1 t Ht

where we used d

dt At = 1 Kn( HH t + Ht( H) ), d dt Bt = 1 Kn(P( H)P(Ht) + P(Ht)P( H) ), and d

dt A 1 t = A 1 t ( d

dt At)A 1 t . Next, we bound all the terms appeared above individually.

A 1 t P( H) F A 1 t op P( H) F K

Kn A 1 t HH t A 1 t P(Ht) F 1 Kn A 1 t op H F H t A 1/2 t op A 1/2 t P(Ht) op

1 Kn K λW H F

where we applied lemma C.4 to bound A 1/2 t P(Ht) op and H t A 1/2 t op.

Kn A 1 t Ht( H) A 1 t Bt A 1 t Ht F

1 Kn A 1/2 t op A 1/2 t Ht op H F A 1/2 t op A 1/2 t Bt A 1/2 t op A 1/2 t Ht op

where we applied lemma C.4 again to bound A 1/2 t Bt A 1/2 t op 1. All the other terms can be bounded in similar ways: by evoking submultiplicativity of matrix norms and decomposing each term into products that involve H F , A 1 t op, A 1/2 t op and those in lemma C.4. By combining all the bounds together we have

dt L(Ht) F 1 K2n(10K

λW + λHK) H F 11λH

Perturbation Analysis of Neural Collapse

where we used λHλW 1 in the last inequality. By fundamental theorem of calculus, we can conclude now that L is 11λH

Kn -Lipschitz. Finally, from the first order equation (12), we have

H0 = H + Kn

Consider another input features matrix H0 and denote by H the associated minimizer of (2). The first order optimality condition gives

H0 = H + Kn

By triangle inequality,

H0 H0 F H H F Kn

β L(H ) L( H ) F (1 11λH

β ) H H F .

It follows that the map H0 H is (1 11λH

β ) 1-Lipschitz.

Perturbation Analysis of Neural Collapse

D. Proof of Theorem 5.2

In this section we compute the entire spectrum (singular values) for the diagonal blocks ( intra-class blocks ) and the off-diagonal blocks ( inter-class blocks ) of the block matrix in (6). To keep the main body of the paper concise, we present in the statement of Theorem 5.2 only the results for σmax(Fk,k) and σmin(Fk,k) of the full rank matrix Fk,k, as well as σmax(Fk, k) of the rank-1 matrix Fk, k ( k = k).

Recall that we consider the (non-degenerate) setting c := λHλW < 1. Therefore, when H0 = H is a minimizer of (1) (associated with W ), from Corollary 3.1 we have that (W , H ) the minimizer of f(W, H; H0) is also orthogonally collapsed and characterized by Theorem 2.2 with λW and λH (independent of K, n, d). That is, H = H 1 n and W H H H W W IK. We also have the following results for the spectral norm of H and W , that we denote by σH and σW respectively:

σ2 W = (1 c)

Observe that these expressions do not depend on the number of samples K, n, d. Note also that σ2

Hσ2 W = (1 λHλW )2 = (1 c)2 < 1.

We remind the reader that vec(δH) Fvec(δH0), for

F = Idn K λH

β In K W W + 1

Z := (E + H W ) (HH IK + nλW Id K) 1(E + H W ),

and E = E(W , H ) and E(W, H) Rdn K Kd is defined as

E(W, H) := vec(ed,1e K,1(WH Y)), ..., vec(ed,1e K,K(WH Y)), vec(ed,2e K,1(WH Y)), ...

..., vec(ed,de K,K(WH Y)) ,

where ed,i is the standard vector in Rd with 1 in its ith entry (similar definition stands for e K,k).

For the collapsed minimizer (W , H ), we know that H = σHR 1 n and W = σW R for some (partial) orthonormal matrix R Rd K (i.e., R R = IK).

Therefore, we have that W H Y = c IK 1 n Id, and that H W = (1 c)R 1 n R . Observe that the alignment of the former expression with the latter (where the locations of the dimensions d and K are swapped) is done using the matrices {ed,ie K,k}. Indeed, we can write E = Kd,K( c IK 1 n Id), where Kd,K RKd d K is the permutation matrix that satisfies

K d,K(X1 X2)Kd,K = X2 X1

for any X1 Rd d and X2 RK K. Such a matrix Kd,k is also known as commutation matrix in the matrix theory literature. Another useful property of the commutation matrix that we will frequently use is that

Kd,K(x Y) = Y x (17)

for any x RK 1 and Y Rd m.

Let us extract the k, k-th block Z k, k Rdn dn of Z . First, observe that

Z = (E + H W ) (H H IK + nλW Id K) 1(E + H W )

n B (A IK)B,

Perturbation Analysis of Neural Collapse

HRR + λW Id) 1

B = c Kd,K(IK 1 n Id) + (1 c)(R 1 n R ).

Denote by {ek} the standard basis vectors in RK. To extract the k, k-th block of Z , we compute

Z k, k = (ek Idn) Z (e k Idn)

n (B(ek Idn)) (A IK) (B(ek Idn)) ,

B(ek Idn) = c Kd,K(ek 1 n Id) + (1 c)(rk 1 n R )

= c(1 n Id ek) + (1 c)(rk1 n R ),

where in the last line, we have used property (17) to swap the Kronecker product. Then,

n( c(1 n Id ek) + (1 c)(rk1 n R )) (A IK)( c(1 n Id e k) + (1 c)(r k1 n R ))

= c2(e k e k) 1

n(1n1 n ) A + (1 c)2(r k Ar k) 1

n(1n1 n ) RR

n(1n1 n ) (Ar kr k + r kr k A) .

Let us write R = [r1 r2 ...r K] Rd K and let r K+1, ..., rd be the orthonormal vectors such that {ri}d i=1 forms an orthonormal basis. We know that

HRR + λW Id) 1 =

H + λW rir i +

1 λW rjr j .

r k Ar k = δk k σ2

Ar kr k = r kr k A = 1 σ2

H + λW r kr k .

We can thus conclude that

δk kc2A + δk k(1 c)2

H + λW RR 2c(1 c)

H + λW r kr k

When k = k, the off-diagonal block of Z is given by

H + λW r kr k

which is a rank-1 matrix. Since other matrices in F do not contribute to the inter-class block, we know that Fk, k = 1

β Z k, k. It is well-known that the eigenvalues of Kronecker product of two matrices are given by the products of their eigenvalues. We know that 1

n(1n1 n ) has exactly one non-zero eigenvalue, which equals to 1. This implies that

σmax(Fk, k) = 2c(1 c) β(σ2

H + λW ) = 2λH(1 λHλW )

Perturbation Analysis of Neural Collapse

Next, let us compute the intra-class block. Setting k = k in equation (18), we get

i=1 µirir i

n1n1 n ) (rir i )

H + λW + (1 c)2

H + λW 2c(1 c)

H + λW = (2c 1)2 r

H + λW + (1 c)2

H + λW = (c2 + (1 c)2)

λH λW , for 1 i K and i = k,

λW = λH, for K < j d

The intra-class block is therefore given by

Fk,k = (1 λH

β )Ind σ2 W β In (RR ) + 1

n1n1 n ) (rir i )

i=1 In (rir i ) σ2 W β

i=1 In (rir i ) + 1

n1n1 n ) (rir i )

i=1 λi In (rir i ) + 1

n1n1 n ) (rir i ),

β σ2 W β = 1 1

λH λW , for 1 i K (19)

β , for K < i d (20)

Let s1 = 1 n1n and {si}n i=1 be a set of orthonormal basis of Rn. Then, we can further write

j=1 λisjs j ) (rir i ) + 1

i=1 µi(s1s 1 ) (rir i )

j=1 λi(sj ri)(sj ri) + 1

i=1 µi(s1 ri)(s1 ri) (21)

One can easily verify that {sj ri}1 j n,1 i d is an orthonormal basis of Rnd. So, (21) gives us the eigendecomposition of Fk,k. The spectral norm of Fk,k is therefore given by

σmax(Fk,k) = max 1 i d max{|λi|, |λi + 1

As we consider the large β regime, the expressions in both (19) and (20) are positive. Observe that for K < i d (associated with the over-parameterization of the model) we have that the eigenvalue associated with the eigenvector (s1 ri) is given by

β µi = 1 λH

Perturbation Analysis of Neural Collapse

Note, though, that due to the Kronecker product with s1 = 1 n1n, perturbation in the direction of this eigenvector does not affect the variability in the kth class at all. Furthermore, generic/practical perturbations are likely to correlate with, or have their power spectrum spread over, many components of the dn dimensional eigenbasis of Fk,k and not concentrate in an extremely low dimensional d K subspace (composed only of s1 ri with K < i < d). Thus, we expect these eigenvectors to have small correlation with generic perturbations.

Showing that σmax(Fk,k) = 1 reduces now to eliminating the option of eigenvalues larger than 1 for 1 i K. This is equivalent to having that

1 + (2c 1)2 < 0,

1 + (c2 + (1 c)2) < 0,

and both are ensured under our assumption c := λHλW < 1 (the non-degenerate case of the model).

Finally, observing that (19) is smaller than (20), and that the second term in (21) does not include eigenvectors (sj ri) for j > 1, we conclude that

σmin(Fk,k) = 1 1

D.1. Additional Discussion on the Results of the Theorem

Theorem 5.2 has no restricting assumptions on the number of classes K. The only assumption on K, which is common in theoretical NC papers and is also what is done in practice, is that d > K, i.e., that the dimension of the features is larger than the number of classes. This means that, regardless of the number of classes, the inter-class (off-diagonal) blocks have rank 1, while the intra-class (diagonal) blocks have full rank (recall that each block is of size dn dn).

Considering the conclusions from Theorem 5.2, which are stated in Section 5, if we sum up the maximal contribution of each of the K 1 inter-class blocks of a certain class, i.e., (K 1)σmax(Fk, k), then for guaranteeing that this sum is smaller than the minimal contribution of the intra-class block, i.e., σmin(Fk,k), we may need to assume that β K. Note that this is a reasonable assumption under our large β setting. Yet, we believe that the rank difference between the two types of blocks is a more important indicator for the dominance of the intra-class blocks, and this property is independent of the number of classes K. Specifically, since dn > K (all the more so, in practice we even have n K), then for generic perturbations (that uniformly span the entire dn K dimensional space) the rank-1 inter-class blocks nullify much of the perturbation contrary to the intra-class block (which has full rank). This strengthen our conclusion that the deviation from collapse of each class of the minimizer H is dominated by the deviation from collapse of the same class in H0 rather than by the deviations of other classes. One thing that should be reminded here is that we analyse the near-NC regime, so we assume that the system is already not far from exact NC. Reaching this point in general might become harder when the number of classes grows.

Another point that can be raised regarding the results of Theorem 5.2, is that we do not analyze the full matrix F but rather its blocks. In fact, we believe that our analysis, which includes the complete spectral analysis for each block (Fk,k, Fk, k) separately, is much more informative than any attempt to analyze the spectrum of the full matrix F, as it clearly distinguishes between properties of intraand inter-class blocks and provides insights on the roles of the regularization hyperparameters that are aligned with practical DNN training. In contrast, in the large β regime we have that F is full rank, which masks the rank-1 property of the inter-class (off-diagonal) blocks. Indeed, using numerical examinations of the configurations that have been used in Figure 1, we observed that the spectrum of the full matrix F resembles a stretched version of the spectrum of the diagonal blocks and completely masks the intriguing properties of the off-diagonal blocks.

Perturbation Analysis of Neural Collapse

E. Additional Experiments and Experimental Details

E.1. Experimental Details for The Layer-Wise Experiment

In this section, we provide the experimental details for the layer-wise training experiment that is presented in Figure 2 in the main body of the paper.

We train an MLP with 10 hidden layers on CIFAR-10 dataset, where each sample is flattened to a 3072x1 vector. Each hidden layer includes 3072 fully connected neurons with default Py Torch initialization of the weights, batchnorm, and Re LU nonlinearity. We start with one hidden layer and train the MLP with 3 epochs of Adam with mini-batch size of 256, learning rate of 1e-4, and CE loss. Then, we compute NC1 metrics for the deepest features. At this point, the first outer iteration of the procedure is finished. We fix the parameters in the existing hidden layers, insert a new hidden layer before the final classification layer, and repeat the procedure. Namely, at each outer iteration of the procedure we optimize only the deepest hidden layer, which has just been inserted with default Py Torch initialization of the weights, and the final classification layer, which is initialized with its weights from the previous outer iteration.

Let us provide more details that has led to the implementation decisions that are stated above. We have found that layer-wise training of DNNs (on a practical dataset, e.g., CIFAR-10 that we use here) is significantly harder than end-to-end training in terms of reaching a small training loss value. (Presumably, this is the reason that DNNs are typically trained in an end-to-end fashion). Careful configuration of the training procedure was required for reaching considerable low loss (though, still not zero training error) and low NC1 metrics as presented in Figure 2. From our efforts in layer-wise training the 10-layer MLP we observed the following: Adam optimizer worked better than SGD (which is harder to tune); Layer-wise minimization with CE loss (rather than MSE loss) has led to lower NC1 metrics; Using no more than 3 epochs per outer iteration allowed reaching lower values for the loss and the NC1 metrics at the deeper layers. Regarding the latter (i.e., more epochs per outer iteration lead to worse optimization results), when there are only one or two hidden layers then the decrease in the loss and the decrease in the NC1 metrics are larger when more epochs are being used. However, when we add in that case more hidden layers, the optimization appears to get stuck at some local minima with higher loss and NC1 metrics compared to what we get with only 3 epochs per outer iteration. As far as we understand, this behavior follows from the (extreme) nonconvexity of the problem. Importantly, note that even though we prove depthwise decrease in within-class variability for MSE loss (to allow rigorous mathematical analysis), it is beneficial to demonstrate alignment with the behavior for CE loss, as this means that we are not revealing peculiar features of MSE that do not appear in other settings.

E.2. More Experiments on the Effect of the Regularization Hyperparameters

In this section, we present more experiments that examine how modifying the regularization hyperparameters affects the NC behavior of a practical DNN Res Net18 (He et al., 2016a) compared to a baseline setting. Specifically, as a baseline hyperparameter setting, we consider one that is used in previous works (Papyan et al., 2020; Zhu et al., 2021): default Py Torch initialization of the weights, SGD optimizer with mini-batch size of 256, learning rate of 0.05 that is divided by 10 every 40 epochs, momentum of 0.9, and weight decay (L2 regularization) of 5e-4 for all the network s parameters.

The first set of experiments is similar to the experiments in Section 6. These experiments support the insight gained in Section 5 that λH (the regularization of the feature mapping) plays a bigger role than λW (the regularization of the classification layer) does in approaching NC. We compare the NC1 and NC2 metrics (defined in Section 6) of the baseline setting and the following modified settings: 1) doubling the weight decay only for the last (FC) layer; 2) doubling the weight decay only for feature mapping (conv) layers; 3) zeroing the weight decay for the last layer; and 4) zeroing the weight decay for feature mapping layers.

In Figure 4 we consider the MNIST dataset with 3K training samples per class. Figure 4a presents the NC1 and NC2 metrics of the deepest features for MSE loss and no bias in the FC layer. Figures 4b and 4c present the NC1 and NC2 metrics of the deepest and intermediate (output of 3 out of the 4 Res Block) features, respectively, when for CE loss with bias in the FC layer. In all the settings, we reach zero training error at the 40 epoch approximately. In Figure 5 we repeat the experiments with 5K training samples per class. Furthermore, repeating the experiments with 3 different random seeds for initializing the DNN s parameters yields similar curves that demonstrate the same trends. In Table 1 we report the mean and the standard deviation (SD) for the NC metrics computed for the deepest features at the 100 epoch (which is already after the NC metrics reach plateaus) for both the CIFAR-10 and the MNIST datasets.

Similar to previous works ((Tirer & Bruna, 2022) and follow-ups), from comparing Figures 4b and 4c (as well as Figures 5b and 5c) we see that the NC distance metrics are larger in the intermediate features, which correlates with the results for our

Perturbation Analysis of Neural Collapse

Table 1: The effect of modifying the weight decay (WD) on NC metrics for Res Net18 trained on CIFAR-10 and MNIST datasets mean and SD are computed for 3 random seeds. Observe that modifying the WD in the feature mapping increases the deviation from the baseline more than modifying the WD of the last layer.

CIFAR-10, MSE loss CIFAR-10, CE loss MNIST, MSE loss MNIST, CE loss NC1 NC2 NC1 NC2 NC1 NC2 NC1 NC2 Baseline 0.0061 4e-4 0.111 1e-2 0.062 5e-3 0.173 1e-2 8e-4 5e-5 0.072 1e-2 0.004 3e-4 0.115 5e-3 WDx2 for W 0.0055 3e-4 0.101 8e-3 0.040 2e-3 0.161 7e-3 5e-4 5e-5 0.055 1e-2 0.003 1e-4 0.102 3e-3 WDx2 for H 0.0022 8e-5 0.070 6e-3 0.024 4e-3 0.131 9e-3 4e-4 2e-5 0.048 5e-3 0.002 7e-5 0.101 3e-3 WD=0 for W 0.0048 2e-4 0.101 6e-3 0.104 8e-3 0.195 9e-3 1.7e-3 1e-4 0.108 2e-2 0.009 4e-4 0.147 5e-3 WD=0 for H 0.0280 3e-3 0.226 6e-3 0.174 7e-3 0.331 1e-2 41e-3 2e-3 0.303 2e-2 0.031 4e-4 0.198 8e-3

model in Section 4. Examining all the settings of Figures 4 and 5, as well as Table 1, the experiments show the important role of the regularization of the feature mapping layers in approaching NC. Namely, modifying the regularization of the feature mapping layers leads to curves with larger deviations from the baseline compared to modifying the last layer s regularization. This is aligned with the theory established in Section 5 that links increasing λH to reducing the dominant component of the distance from collapse of a class, which is the deviation from collapse of its own features in preceding layers.

The second set of experiments shows the role of λW in mitigating the interferences between the features of different classes (such interferences can hinder approaching NC). To visualize such behavior we use a per-class NC1 metric, defined as

NC(k) 1 := 1

i=1 (hk,i hk)(hk,i hk) Σ B

Note that the NC1 metric, which is defined in Section 6, can be written as

i=1 (hk,i hk)(hk,i hk) Σ B

k=1 NC(k) 1 .

We also use the following metric to measure the alignment of the mean features and the last layer s weights

NC3 := W(H h G1 K) W(H h G1 K) F 1

K 1K1 K) F ,

where the simplex ETF is normalized to unit Frobenius norm.

In Figure 6a we present the NC metrics of the deepest features of the baseline training scheme on the MNIST dataset with 3K samples per class. The other lines in Figure 6 show the NC metrics for a modified training set, where the samples of classes (digits) 4 and 9 are degraded by a uniform blur (blur kernel of size 9 9) that hardens the distinction between them. Each line corresponds to a different value of weight decay for the last layer s parameters. Yet, in all of the settings we reached zero training error at the 40 epoch approximately. The empirical results show that large λW facilitates reaching reduced NC metrics (closeness to NC structure) by reducing the effect ( interference ) of the features of the degraded samples on the features of the other classes. This is aligned with the theory that is established for our model in Section 5.

Perturbation Analysis of Neural Collapse

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

(a) MSE loss without bias. Deepest features.

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

(b) CE loss with bias. Deepest features.

0 25 50 75 100 Epoch

Baseline (inner features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (inner features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (inner features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (inner features) WD=0 for W WD=0 for H

(c) CE loss with bias. Intermediate features.

Figure 4: The effect of modifying the weight decay (WD) on NC metrics for Res Net18 trained on MNIST with 3K samples per class. Observe that modifying the WD in the feature mapping increases the deviation from the baseline more than modifying the WD of the last layer.

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

(a) MSE loss without bias. Deepest features.

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (last features) WD=0 for W WD=0 for H

(b) CE loss with bias. Deepest features.

0 25 50 75 100 Epoch

Baseline (inner features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (inner features) WDx2 for W WDx2 for H

0 25 50 75 100 Epoch

Baseline (inner features) WD=0 for W WD=0 for H

0 25 50 75 100 Epoch

Baseline (inner features) WD=0 for W WD=0 for H

(c) CE loss with bias. Intermediate features.

Figure 5: The effect of modifying the weight decay (WD) on NC metrics for Res Net18 trained on MNIST with 5K samples per class. Observe that modifying the WD in the feature mapping increases the deviation from the baseline more than modifying the WD of the last layer.

Perturbation Analysis of Neural Collapse

0 25 50 75 100 Epoch

Per-Class NC1

class 0 class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9

0 25 50 75 100 Epoch

0 25 50 75 100 Epoch

0 25 50 75 100 Epoch

(a) Original samples, WD 5e-4 across layers.

0 25 50 75 100 Epoch

Per-Class NC1

class 0 class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9

0 25 50 75 100 Epoch

0 25 50 75 100 Epoch

0 25 50 75 100 Epoch

(b) Samples of classes 4 and 9 are blurred, last layer s WD remains 5e-4. The effect of the blurred classes on the NC metrics (avg. and other classes) is minor.

0 25 50 75 100 Epoch

Per-Class NC1

class 0 class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9

0 25 50 75 100 Epoch

0 25 50 75 100 Epoch

0 25 50 75 100 Epoch

(c) Samples of classes 4 and 9 are blurred, last layer s WD reduced to 5e-5. The blurred classes affect the per-class NC1 of other classes and the NC metrics increase.

0 25 50 75 100 Epoch

Per-Class NC1

class 0 class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9

0 25 50 75 100 Epoch

0 25 50 75 100 Epoch

0 25 50 75 100 Epoch

(d) Samples of classes 4 and 9 are blurred, last layer has no WD. The blurred classes further interfere with other classes and the NC metrics further increase.

Figure 6: The effect of modifying the weight decay (WD) of the last layer s weights on NC metrics for Res Net18 trained on MNIST with 3K samples per class where samples from classes 4 and 9 are blurred. Observe that small WD in the last layer increases the effect of the pre-class NC1 curves of the blurred classes on the other classes, and increases also the other NC metrics.