# an_unconstrained_layerpeeled_perspective_on_neural_collapse__d3b1a493.pdf Published as a conference paper at ICLR 2022 AN UNCONSTRAINED LAYER-PEELED PERSPECTIVE ON NEURAL COLLAPSE Wenlong Ji Peking University Yiping Lu Stanford University Yiliang Zhang University of Pennsylvania Zhun Deng Harvard University Weijie J. Su University of Pennsylvania Neural collapse is a highly symmetric geometry of neural networks that emerges during the terminal phase of training, with profound implications on the generalization performance and robustness of the trained networks. To understand how the last-layer features and classifiers exhibit this recently discovered implicit bias, in this paper, we introduce a surrogate model called the unconstrained layer-peeled model (ULPM). We prove that gradient flow on this model converges to critical points of a minimum-norm separation problem exhibiting neural collapse in its global minimizer. Moreover, we show that the ULPM with the cross-entropy loss has a benign global landscape for its loss function, which allows us to prove that all the critical points are strict saddle points except the global minimizers that exhibit the neural collapse phenomenon. Empirically, we show that our results also hold during the training of neural networks in real-world tasks when explicit regularization or weight decay is not used. 1 INTRODUCTION Deep learning has achieved state-of-the-art performance in various applications (Le Cun et al., 2015), such as computer vision (Krizhevsky et al., 2012), natural language processing (Brown et al., 2020), and scientific discovery (Long et al., 2018; Zhang et al., 2018). Despite the empirical success of deep learning, how gradient descent or its variants lead deep neural networks to be biased towards solutions with good generalization performance on the test set is still a major open question. To develop a theoretical foundation for deep learning, many studies have investigated the implicit bias of gradient descent in different settings (Li et al., 2018; Amari et al., 2020; Vaswani et al., 2020; Soudry et al., 2018; Lyu & Li, 2019; Arora et al., 2019). It is well acknowledged that well-trained end-to-end deep architectures can effectively extract features relevant to a given label. Although theoretical analysis of deep learning has been successful in recent years (Arora et al.; Goldblum et al., 2019), most of the studies that aim to analyze the properties of the final output function fail to understand the features learned by neural networks. Recently, in Papyan et al. (2020), the authors observed that the features in the same class will collapse to their mean and the mean will converge to an equiangular tight frame (ETF) during the terminal phase of training, that is, the stage after achieving zero training error. This phenomenon, namely, neural collapse (Papyan et al., 2020), provides a clear view of how the last-layer features in the neural network evolve after interpolation and enables us to understand the benefit of training after achieving zero training error to achieve better performance in terms of generalization and robustness. To theoretically analyze the neural collapse phenomenon, Fang et al. (2021) proposed the layer-peeled model (LPM) as a simple surrogate for neural networks, where the last-layer features are modeled as free optimization variables. In particular, in a balanced K-class classification problem using a neural network with d neurons in the last hidden layer, the LPM takes the following form: min W ,H 1 n i=1 L (W hi, yi) , s.t. 1 2||W ||2 F C1, 1 2||H||2 F C2, (1.1) Published as a conference paper at ICLR 2022 where C1, C2 are positive constants. Here, W = [w1, w2, , w K] RK d is the weight of the final linear classifier, H = [h1, h2, , hn] Rd n is the feature of the last layer and yi is the corresponding label. The intuition behind the LPM is that modern deep networks are often highly over-parameterized, with the capacity to learn any representations of the input data. It has been shown that an equiangular tight frame (ETF), i.e., feature with neural collapse, is the only global optimum of the LPM objective (1.1) (Fang et al., 2021; Lu & Steinerberger, 2020; Wojtowytsch & E, 2020; Zhu et al., 2021). However, feature constraints in LPMs are not equivalent to weight decay used in practice. In this study, we directly deal with the unconstrained model and show that gradient flow can find those neural collapse solutions without the help of explicit constraints and regularization. To do this, we build a connection between the neural collapse and recent theories on max-margin implicit regularization (Lyu & Li, 2019; Wei et al., 2018), and use it to provide a convergence result to the first-order stationary point of a minimum-norm separation problem. Furthermore, we illustrate that the crossentropy loss enjoys a benign global landscape where all the critical points are strict saddles in the tangent space, except for the only global minimizers that exhibit the neural collapse phenomenon. Finally, we verify our insights via empirical experiments. In contrast to previous theoretical works on neural collapse, our analysis does not incorporate any explicit regularization or constraint on features. A comparison with other results can be found in Table 1 and we defer a detailed discussion to Section 5.2. The reasons we investigate the unregularized objective are summarized as follows: 1. Feature regularization or constrain is still not equivalent to weight decay used in practice. However, previous studies have justified that neural networks continue to perform well without any regularization or constraint (Zhang et al., 2021). Moreover, it is proved that SGD with exponential learning rate on unconstrained objective is equivalent to SGD with weight decay. (Li & Arora, 2019). 2. As shown in this study, neural collapse exists even under an unconstrained setting, which implies the emergence of neural collapse should be attributed to gradient descent and cross-entropy loss rather than explicit regularization. 3. Regularization or constraint feature constraint can be barriers for existing theories of neural networks (Jacot et al., 2018; Lyu & Li, 2019). By allowing features to be totally free, we hope our results can inspire further analysis to plug in a realistic neural network. Reference Contribution Feature Norm Constraint Feature Norm Regularization Loss Function (Papyan et al., 2020) Empirical Results % % CE Loss (Wojtowytsch & E, 2020) (Lu & Steinerberger, 2020) (Fang et al., 2021) Global Optimum " % CE Loss (Mixon et al., 2020) (Poggio & Liao, 2020) (Han et al., 2021) Modified Training Dynamics % % ℓ2 Loss (Zhu et al., 2021) concurrent Landscape Analysis % " CE Loss This paper Training Dynamics+ Landscape Analysis % % CE Loss Table 1: Comparison of recent analysis for neural collapse. We provide theoretical results with the minimum modification of the training objective function. Here we use CE loss to refer to the cross-entropy loss. 1.1 CONTRIBUTIONS The contributions of the present study can be summarized as follows. Published as a conference paper at ICLR 2022 We build a relationship between the max-margin analysis (Soudry et al., 2018; Nacson et al., 2019b; Lyu & Li, 2019) and the neural collapse and provide the implicit bias analysis to the feature rather than the output function. Although both parameters and features diverge to infinity, we prove that the convergent direction is along the direction of the minimum-norm separation problem. Previous works (Lyu & Li, 2019; Ji et al., 2020) only prove that gradient flow on homogeneous neural networks will converge to the KKT point of the corresponding minimum-norm separation problem. However, the minimum-norm separation problem remains a highly non-convex problem and a local KKT point may not be the neural collapse solution. In this study, we perform a more detailed characterization of the convergence direction via landscape analysis. Previous analysis about neural collapse relies on the explicit regularization or constraint. In this study, we show that the implicit regularization effect of gradient flow is sufficient to lead to a neural collapse solution. The emergence of neural collapse should be attributed to gradient descent and loss function, rather than explicit regularization or constraint. We put detailed discussion in Section 5.2. 1.2 RELATED WORKS Implicit Bias of Gradient Descent: To understand how gradient descent or its variants helps deep learning to find solutions with good generalization performance on the test set, a recent line of research have studied the implicit bias of gradient descent in different settings. For example, gradient descent is biased toward solutions with smaller weights under ℓ2 loss (Li et al., 2018; Amari et al., 2020; Vaswani et al., 2020) and will converge to large margin solution while using logistic loss (Soudry et al., 2018; Nacson et al., 2019b; Lyu & Li, 2019; Chizat & Bach, 2020; Ji et al., 2020). For linear networks, Arora et al. (2019); Razin & Cohen (2020); Gidel et al. (2019) have shown that gradient descent determines a low-rank approximation. Loss Landscape Analysis: Although the practical optimization problems encountered in machine learning are often nonconvex, recent works have shown the landscape can enjoy benign properties which allow further analysis. In particular, these landscapes do not exhibit spurious local minimizers or flat saddles and can be easily optimized via gradient-based methods (Ge et al., 2015). Examples include phase retrieval (Sun et al., 2018), low-rank matrix recovery (Ge et al., 2016; 2015), dictionary learning (Sun et al., 2016; Qu et al., 2019; Laurent & Brecht, 2018) and blind deconvolution (Lau et al., 2019). 2 PRELIMINARIES AND PROBLEM SETUP In this paper, || ||F denotes the Frobenius norm, 2 denotes the matrix spectral norm, denotes the nuclear norm, denotes the vector l2 norm and tr( ) is the trace of matrices. We use [K] := {1, 2, , K} to denote the set of indices up to K. 2.1 PRELIMINARIES We consider a balanced dataset with K classes SK k=1{xk,i}n i=1. A standard fully connected neural network can be represented as: f (x; Wfull) = b L + WLσ (b L 1 + WL 1σ ( σ (b1 + W1x) )) . (2.1) Here Wfull = (W1, W2, , WL) denote the weight matrices in each layer, (b1, b2, , b L) denote the bias terms, and σ( ) denotes the nonlinear activation function, for example, Re LU or sigmoid. Let hk,i = σ (b L 1 + WL 1σ ( σ (b1 + W1xk,i))) Rd denote the last layer feature for data xk,i and hk = 1 n Pn i=1 hk,i denotes the feature mean within the k-th class. To provide a formal definition of neural collapse, we first introduce the concept of a simplex equiangular tight frame (ETF): Definition 2.1 (Simplex ETF). A symmetric matrix M RK K is said to be a simplex equiangular tight frame (ETF) if K K 1Q(IK 1 K 1K1 K). (2.2) Published as a conference paper at ICLR 2022 Where Q Rd K is a matrix with orthogonal columns. Let W RK d = WL = [w1, w2, , w K] be the weight of the final layer classifier, the four criteria of neural collapse can be formulated precisely as: (NC1) Variability collapse: As training progresses, the within-class variation of the activation becomes negligible as these activation collapse to their class mean hk = 1 n Pn i=1 hk,i. ||hk,i hk|| = 0, 1 k K. (NC2) Convergence to Simplex ETF: The vectors of the class-means (after centering by their global-mean) converge to having equal length, forming equal-size angles between any given pair, and being the maximally pairwise-distanced configuration constrained to the previous two properties. cos( hk, hj) = 1 K 1, || hk|| = || hj||, k = j. (NC3) Convergence to self-duality: The linear classifiers and class-means will converge to align with each other, up to appropriate rescaling, that is, there exist a universal constant C > 0 such that wk = C hk, k [K]. (NC4) Simplification to Nearest Class-Center For a given deepnet activation: h = σ (b L 1 + WL 1σ ( σ (b1 + W1x) )) Rd, the network classifier converges to choose whichever class has the nearest train class-mean arg min k wk, h arg min k In this paper, we say that a point W RK d, H Rd n K satisfies the neural collapse conditions or is a neural collapse solution if the above four criteria are all satisfied for (W , H). 2.2 PROBLEM SETUP We mainly focus on the neural collapse phenomenon, which is only related to the classifiers and features in the last layer. Since the general analysis of the highly non-smooth and non-convex neural network is difficult, we peel down the last layer of the neural network and propose the following unconstrained layer-peeled model (ULPM) as a simplification to capture the main characters related to neural collapse during training dynamics. A similar simplification is commonly used in previous theoretical works (Lu & Steinerberger, 2020; Fang et al., 2021; Wojtowytsch & E, 2020; Zhu et al., 2021), but ours does not have any constraint or regularization on features. It should be mentioned that although Mixon et al. (2020); Han et al. (2021) also studied the unconstrained model, their analysis adopted an approximation to real dynamics and was highly dependent on the ℓ2 loss function, which is rarely used in classification tasks. Compared with their works, we directly deal with the real training dynamics and cover the most popular cross-entropy loss in classification tasks. Let W = [w1, w2, , w K] RK d and H = [h1,1, , h1,n, h2,1, , h K,n] Rd Kn be the matrices of classifiers and features in the last layer, where K is the number of classes and n is the number of data points in each class. The ULPM is defined as follows: min W ,H L(W , H) = min W ,H exp(w k hk,i) PK j=1 exp(w j hk,i) Here, we do not have any constraint or regularization on features, which corresponds to the absence of weight decay in deep learning training. The objective function (2.3) is generally non-convex on (W , H) and we aim to study the landscape of the objective function (2.3). Furthermore, we consider the gradient flow of the objective function: dt = L(W (t), H(t)) dt = L(W (t), H(t)) Published as a conference paper at ICLR 2022 3 MAIN RESULTS In this section, we present our main results regarding the training dynamics and landscape analysis of (2.3). This section is organized as follows: First, in Section 3.1, we show the relationship between the margin and neural collapse in our surrogate model. Inspired by this relationship, we propose a minimum-norm separation problem (3.1) and show the connection between the convergence direction of the gradient flow and the KKT point of (3.1). In addition, we explicitly solve the global optimum of (3.1) and show that it must satisfy the neural collapse conditions. However, owing to the nonconvexity, we find an Example 3.1 in Section 3.2, which shows that there exist some bad KKT points such that a simple gradient flow will get stuck in them and does not converge to the neural collapse solution, which is proved to be optimal in Theorem 3.3. Then, we present our second order analysis result in Theorem 3.4 to show that those bad points will exhibit decreasing directions in the tangent space thus gradient descent and its variants can avoid those bad directions and only converge to the neural collapse solutions (Lee et al., 2016; Ge et al., 2015). 3.1 CONVERGENCE TO THE FIRST ORDER STATIONARY POINT Heuristically speaking, the simplex ETF (Definition 2.1) gives a set of vectors with the maximum average angle between them. As a result, neural collapse implies that the neural networks tend to maximize the angles between each class and the corresponding classifiers. At a high level, such behavior is quite similar to margin maximization which is known to be an implicit regularization effect of gradient descent and has been extensively studied in Soudry et al. (2018); Nacson et al. (2019b); Lyu & Li (2019); Ji et al. (2020). First, we illustrate the connection between the margin and neural collapse. Recall that the margin of a single data point xk,i and the associated feature hk,i is qk,i(W , H) := w k hk,i maxj =k w j hk,i. Then, the margin of the entire dataset can be defined as qmin(W , H) := min k [K],i [n] qk,i(W , H). The following theorem demonstrates that neural collapse yields the maximum margin solution of our ULPM model: Theorem 3.1 (Neural collapse as max-margin solution). For the ULPM model (2.3), the margin of the entire dataset always satisfies qmin(W , H) W 2 F + H 2 F 2(K 1) n and the equality holds if and only if (W , H) satisfies the neural collapse conditions with W F = H F . Based on this finding, we present an analysis of the convergence of the gradient flow on the ULPM (2.3). Following Lyu & Li (2019), we link the gradient flow on cross-entropy loss to a minimum-norm separation problem. Theorem 3.2. For problem (2.3), let (W (t), H(t)) be the path of gradient flow at time t, if there exist a time t0 such that L(W (t0), H(t0)) < log 2, then any limit point of {( ˆ H(t), ˆ W (t)) := ( H(t) p W (t) 2 F + H(t) 2 F , W (t) p W (t) 2 F + H(t) 2 F )} is along the direction (i.e., a constant multiple) of a Karush-Kuhn-Tucker (KKT) point of the following minimum-norm separation problem: min W,H 1 2||W ||2 F + 1 s.t.w k hk,i w j hk,i 1, k = j [K], i [n]. (3.1) Remark 3.1. Here we write (3.1) as a constraint problem, but the constraint is introduced by the implicit regularization effect of gradient flow on our ULPM objective (2.3). Since the training dynamics will diverge to infinity, we hope to justify that the diverge direction is highly related to neural collapse and an appropriate normalization is needed, which is why it appears to be a Published as a conference paper at ICLR 2022 constraint optimization form. Our goal is to justify that the neural collapse phenomenon is caused by the properties of the loss function and training dynamics rather than an explicit regularization or constraint, which seems to be necessary for previous studies (Fang et al., 2021; Lu & Steinerberger, 2020; Wojtowytsch & E, 2020). Remark 3.2. In Theorem 3.2, we assume that there exists a time t0 such that L(W (t0), H(t0)) < log 2. Note that the loss function can be rewritten as: i=1 log(1 + X j =k exp(wjhk,i wkhk,i)) (3.2) the requirement L(W , H) < log 2 implies that wkhk,i wjhk,i 0 for any k, j [K], i [n], which is equivalent to qmin(W , H) > 0; that is, every feature is separated perfectly by the classifier. This assumption is common in the study of implicit bias in the nonlinear setting (Nacson et al., 2019a; Lyu & Li, 2019; Ji et al., 2020) and its validity can be justified by the fact that neural collapse is found only in the terminal phase of training in the deep neural network, where the training accuracy has achieved 100%. It is also an interesting direction to remove this assumption and study the early-stage dynamics of training, which is beyond the scope of this study and we leave it to future exploration. Theorem 3.2 indicates that the convergent direction of the gradient flow is restricted to the max-margin directions, which usually have good robustness and generalization performance. In general, the KKT conditions are not sufficient to obtain global optimality because the minimum-norm separation problem (3.1) is non-convex. On the other hand, we can precisely characterize its global optimum in the ULPM case based on Theorem 3.1: Corollary 3.1. Every global optimum of the minimum-norm separation problem (3.1) is also a KKT point that satisfies the neural collapse conditions. With Theorem 3.2 bridging dynamics with KKT points of (3.1) and Corollary 3.1 associating global optimum of (3.1) with neural collapse, the remaining work is to close the gap between the KKT point and the global optimum. 3.2 SECOND ORDER LANDSCAPE ANALYSIS In the convex optimization problem, the KKT conditions are usually equivalent to global optimality. Unfortunately, owing to the non-convex nature of the objective (2.3), the KKT points can also be saddle points or local optimum other than the global optimum. In this section, we aim to show that this non-convex optimization problem is actually not scary via landscape analysis. To be more specific, we prove that except for the global optimum given by neural collapse, all the other KKT points are actually saddle points that can be avoided by gradient flow. In contrast to previous landscape analysis of non-convex problems, where people aim to show that the objective has a negative directional curvature around any stationary point (Sun et al., 2015; Zhang et al., 2020), our analysis is slightly different. Note that since the model is unconstrained, once features can be perfectly separated, the ULPM objective (2.3) will always decrease along the direction of the current point and the optimum is attained only in infinity. Although growing along all of those perfectly separating directions can let the loss function decrease to 0, the speed of decrease is quite different and there exists an optimal direction with the fastest decreasing speed. As shown in Section 3.1, first-order analysis of training dynamics fails to distinguish such an optimal direction from KKT points, and we need second-order analysis to help us fully characterize the realistic training dynamics. First, we provide an example to illustrate the motivation and necessity of second-order landscape analysis. Example 3.1 (A Motivating Example). Consider the case where K = 4, n = 1, let (W , H) be the following point: 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 One can easily verify that (W , H) enables our model to classify all features perfectly. Furthermore, we can show that it is along the direction of a KKT point of the minimum-norm separation problem Published as a conference paper at ICLR 2022 (3.1) by constructing the Lagrangian multiplier Λ = (λij)K i,j=1 as follows: 0 0 1 2 1 2 0 0 1 2 1 2 1 2 1 2 0 0 1 2 1 2 0 0 To see this, simply write down the corresponding Lagrangian (note that we aim to justify (W , H) is along the direction of a KKT point of (3.1), to make it a true KKT point , one needs to multiple 1/ 2 on W , H): L(W , H, Λ) = 1 4 W 2 F + 1 j =i λi,j(1 Simply take derivatives for W , H and Λ we find that it satisfies the KKT conditions. However, the gradient of (W , H) is: W L(W , H) = HL(W , H) = 2 + 2e 2 2 + 2e 2 + 2e2 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 We can see that the directions of the gradient and the parameter align with each other (i.e., W is parallel to W L(W , H), and H is parallel to HL(W , H)), which implies that simple gradient descent may get stuck in this direction and only grow the parameter norm. However, if we construct: 1 + α 1 + α α α 1 + α 1 + α α α α α 1 α 1 α α α 1 α 1 α By simple calculation, we find that f(α) := L(W , H ) satisfies f (α) = 0, f (α) < 0. Since W F = W F , H F = H F and ||W W ||2 F + ||H H||2 F 0 as α 0, this result implies that for any ϵ > 0, we can choose appropriate α such that: ||W ||2 F = ||W ||2 F , ||H ||2 F = ||H||2 F , ||W W ||2 F + ||H H||2 F < ϵ, L(W , H ) < L(W , H). In Example 3.1, it is shown that there are some KKT points of the minimum-norm separation problem (3.1) that are not globally optimal, but there exist some better points close to it; thus, the gradientbased method can easily avoid them (see Lee et al. (2016); Ge et al. (2015) for a detailed discussion). In the following theorem, we will show that the best directions are neural collapse solutions in the sense that the loss function is the lowest among all the growing directions. Theorem 3.3. The optimal value of the loss function (2.3) on a sphere is obtained (i.e., L(W , H) L(W , H ) for any ||W ||2 F + ||H ||2 F = ||W ||2 F + ||H||2 F ) if only if (W , H) satisfies neural collapse conditions and ||W ||F = ||H||F . Remark 3.3. Note that the second condition is necessary because neural collapse conditions do not specify the norm ratio of W and H. That is, if (W , H) satisfies the neural collapse conditions, then for any α, β R, (αW , βH) will also satisfy them, but only some certain α, β are optimal. Now, we turn to those points that are not globally optimal. To formalize our discussion in the motivating example 3.1, we first introduce the tangent space: Definition 3.1 (tangent space). The tangent space of (W , H) is defined as a set of directions that are orthogonal to (W , H) : T (W , H) = { W RK d, H Rd n K) : tr(W W ) + tr(H H) = 0} Our next result justifies our observation in Example 3.1 that for those non-optimal points, there exists a direction in the tangent space such that moving along this direction will lead to a lower objective value. Published as a conference paper at ICLR 2022 Theorem 3.4. If (W , H) is not the optimal solution in Theorem 3.3 (i.e., (W , H) is not a neural collapse solution or it is a neural collapse solution but W F = H F ), then there exists a direction ( W , H) T (W , H) and constant M > 0 such that for any 0 < δ < M, L(W + δ W , H + δ H) < L(W , H). (3.3) Further more, it implies that for any ϵ > 0 there exists a point (W , H ) such that: ||W ||2 F + ||H ||2 F = ||W ||2 F + ||H||2 F , ||W W ||2 F + ||H H||2 F < ϵ, L(W , H ) < L(W , H). (3.4) Remark 3.4. The result in (3.3) gives us a decreasing direction orthogonal to the direction of (W , H). As shown in Example 3.1, the gradient on these non-optimal points might be parallel to (W , H); thus, the first-order analysis fails to explain the prevalence of neural collapse. Here the decreasing direction is obtained by analyzing the Riemannian Hessian matrix and finding its eigenvector corresponding to a negative eigenvalue, which further indicates that these points are first-order saddle points in the tangent space. That is why we name it second-order landscape analysis. A formal statement and proof are presented in the appendix C. Previous works have shown that for a large family of gradient-based methods, they can avoid saddle points and only converge to minimizers (Lee et al., 2016; Ge et al., 2015; Panageas et al., 2019), thus our landscape analysis indicates that the gradient flow dynamics only find neural collapse directions. 4 EMPIRICAL RESULTS 1 10 100 500 Epoch Classifier (CIFAR-10) Last layer feature (CIFAR-10) Classifier (MNIST) Last layer feature (MNIST) (a) Variation of the norms 1 10 100 500 Epoch Avg absolute difference Last layer feature (CIFAR-10) Last layer feature (MNIST) (b) Within-class variation 1 10 100 500 Epoch Avg shifted cosine Classifier (CIFAR-10) Last layer feature (CIFAR-10) Classifier (MNIST) Last layer feature (MNIST) (c) Cosines between pairs 100 101 102 Avg absolute difference Dual (CIFAR-10) Dual (MNIST) (d) Self-duality Figure 1: Experiments on real datasets without weight decay. We trained a Res Net18 on both MNIST and CIFAR10 datasets. The x-axis in the figures are set to have log(log(t)) scales and the y-axis in the figures are set to have log scales. To evaluate our theory, we trained the Res Net18 (He et al., 2016) on both MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky et al., 2009) datasets without weight decay, and tracked how the last layer features and classifiers converge to neural collapse solutions. The results are plotted in Figure 1. Here we reported the following metrics to measure the level of neural collapse: 1. The variation of both feature norm (i.e., Std( hk h )/Avg( hk h )) and classifier norm in the last layer (i.e., Std( wk )/Avg( wk )). 2. Within-class variation for last layer features (i.e., Avg( hk,i hk )/Avg( hk,i h )). 3. Average cosine between last layer features (i.e., Avg(| cos( hk, hk ) + 1/(K 1)|)) and that of last layer classifiers (i.e., Avg(| cos( wk, wk ) + 1/(K 1)|)). 4. Self-duality distance between features and classifiers corresponding to the same class in the last layer. (i.e., Avg(|( hk h)/ hk h wk/ wk |)) Here a smaller value of each metric indicates a closer state to a neural collapse solution. Specifically, the within-class variation measures (NC1) ,the variation of norms measures and cosine between pairs measure (NC2), the self-duality measures (NC3), and (NC4) has been shown to be a corollary of (NC1)-(NC3) (Papyan et al., 2020). More experiment results under various settings and training details can be found in Appendix D. As shown in the Figure 1, all of these metric decreases along the training epochs. These results reveal strong evidence for the emergence of neural collapse in the unconstrained setting and provide sound support for our theory. Published as a conference paper at ICLR 2022 5 CONCLUSION AND DISCUSSION 5.1 CONCLUSION To understand the implicit bias of neural features from gradient descent training, we built a connection between max-margin implicit bias and the neural collapse phenomenon and studied the ULPM in this study. We proved that the gradient flow of the ULPM converges to the KKT point of a minimum-norm separation problem, where the global optimum satisfies the neural collapse conditions. Although the ULPM is non-convex, we show that ULPM has a benign landscape where all the stationary points are strict saddle points in the tangent space, except for the global neural collapse solution. Our study helps to demystify the neural collapse phenomenon, which sheds light on the generalization and robustness during the terminal phase of training deep networks in classification problems. 5.2 RELATIONSHIP WITH OTHER RESULTS ON NEURAL COLLAPSE Theoretical analysis of neural collapse was first provided by Lu & Steinerberger (2020); Wojtowytsch & E (2020); Fang et al. (2021), who showed that the neural collapse solution is the only global minimum of the simplified non-convex objective function. In particular, Wojtowytsch & E (2020); Lu & Steinerberger (2020) studied a continuous integral form of the loss function and showed that the features learned should have a uniform distribution on the sphere. A more realistic discrete setting was studied in Fang et al. (2021), where the constraint is on the entire feature matrix rather than individual features. Our result utilizes the implicit bias of the cross-entropy loss function to remove the feature norm constraint, which is not practical in real applications. Although the global optimum can be fully characterized by neural collapse conditions, the ULPM objective is still highly non-convex. Regarding optimization, Mixon et al. (2020); Poggio & Liao (2020); Han et al. (2021) analyzed the unconstrained feature model with ℓ2 loss and established convergence results for collapsed features for gradient descent. However, they fail to generalize to the more practical cross-entropy loss functions used in classification tasks. The analysis relies highly on the ℓ2 loss to obtain a closed-form gradient flow and still requires some additional approximation to guarantee global convergence. The most relevant study is a concurrent work (Zhu et al., 2021), which provides a landscape analysis of the regularized unconstrained feature model. Zhu et al. (2021) turns the feature norm constraint in Fang et al. (2021) into feature norm regularization and still preserves the neural collapse global optimum. At the same time, it shows that the modified regularized objective shares a benign landscape, where all the critical points are strict saddles except for the global one. Although our study and Zhu et al. (2021) discover similar landscape results, we believe our characterization remains closer to the real algorithms since we do not introduce any constraints or regularization on the feature norm following the conventional setting in realist training. The regularization of features introduced in Zhu et al. (2021) is still different from weight decay regularization (Krogh & Hertz, 1992). However, weight decay on homogeneous neural networks is equivalent to gradient descent with scaling step size on an unregularized objective (Li & Arora, 2019; Zhang et al., 2018). Moreover, neural networks are found to perform well in the absence of weight decay, which highlights the importance of implicit regularization (Zhang et al., 2021). As a result, instead of explicit feature norm constraint/regularization, in this study, we consider implicit regularization brought by the gradient flow on cross-entropy. We show that implicit regularization is sufficient to lead the dynamics to converge to the neural collapse solution without the explicit regularization. ACKNOWLEDGMENTS We are grateful to Qing Qu and X.Y. Han for helpful discussions and feedback on an early version of the manuscript. Wenlong Ji is partially supported by the elite undergraduate training program of the School of Mathematical Sciences at Peking University. Yiping Lu is supported by the Stanford Interdisciplinary Graduate Fellowship (SIGF). Published as a conference paper at ICLR 2022 Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, and Ji Xu. When does preconditioning help or hurt generalization? ar Xiv preprint ar Xiv:2006.10732, 2020. Raman Arora, Sanjeev Arora, Joan Bruna, Nadav Cohen, Rong Ge, Suriya Gunasekar, Chi Jin, Jason Lee, Tengyu Ma, Behnam Neyshabua, and Zhao Song. Theory of deep learning. https://www.cs.princeton.edu/courses/archive/fall19/cos597B/ lecnotes/bookdraft.pdf/. Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. ar Xiv preprint ar Xiv:1905.13655, 2019. Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020. Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pp. 1305 1338. PMLR, 2020. Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature. ar Xiv preprint ar Xiv:1812.01718, 2018. J. Dutta, K. Deb, Rupesh Tulshyan, and Ramnik Arora. Approximate kkt points and a proximity measure for termination. Journal of Global Optimization, 56:1463 1499, 2013. C. Fang, H. He, Q. Long, and W. Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences (in press), 2021. Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In Conference on learning theory, pp. 797 842. PMLR, 2015. Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. ar Xiv preprint ar Xiv:1605.07272, 2016. Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. ar Xiv preprint ar Xiv:1904.13262, 2019. Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein. Truth or backpropaganda? an empirical investigation of deep learning theory. ar Xiv preprint ar Xiv:1910.00359, 2019. Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. ar Xiv preprint ar Xiv:1506.07540, 2015. X. Y. Han, Vardan Papyan, and David L. Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path, 2021. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8580 8589, 2018. Ziwei Ji, Miroslav Dudík, Robert E Schapire, and Matus Telgarsky. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, pp. 2109 2136. PMLR, 2020. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Published as a conference paper at ICLR 2022 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097 1105, 2012. Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950 957, 1992. Yenson Lau, Qing Qu, Han-Wen Kuo, Pengcheng Zhou, Yuqian Zhang, and John Wright. Short-andsparse deconvolution a geometric approach. ar Xiv preprint ar Xiv:1908.10959, 2019. Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima are global. In International conference on machine learning, pp. 2902 2907. PMLR, 2018. Yann Le Cun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436 444, 2015. Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pp. 1246 1257. PMLR, 2016. Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp. 2 47. PMLR, 2018. Zhiyuan Li and Sanjeev Arora. An exponential learning rate schedule for deep learning. ar Xiv preprint ar Xiv:1910.07454, 2019. Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. Pde-net: Learning pdes from data. In International Conference on Machine Learning, pp. 3208 3216. PMLR, 2018. Jianfeng Lu and Stefan Steinerberger. Neural collapse with cross-entropy loss. ar Xiv preprint ar Xiv:2012.08465, 2020. Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. ar Xiv preprint ar Xiv:1906.05890, 2019. Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. ar Xiv preprint ar Xiv:2011.11619, 2020. Mor Shpigel Nacson, Suriya Gunasekar, Jason Lee, Nathan Srebro, and Daniel Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pp. 4683 4692. PMLR, 2019a. Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3420 3428. PMLR, 2019b. Ioannis Panageas, Georgios Piliouras, and Xiao Wang. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. ar Xiv preprint ar Xiv:1906.07772, 2019. Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40): 24652 24663, 2020. Tomaso Poggio and Qianli Liao. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. ar Xiv preprint ar Xiv:2101.00072, 2020. Qing Qu, Yuexiang Zhai, Xiao Li, Yuqian Zhang, and Zhihui Zhu. Analysis of the optimization landscapes for overcomplete representation learning. ar Xiv preprint ar Xiv:1912.02427, 2019. Noam Razin and Nadav Cohen. Implicit regularization in deep learning may not be explainable by norms. ar Xiv preprint ar Xiv:2005.06398, 2020. Published as a conference paper at ICLR 2022 Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1): 2822 2878, 2018. Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? ar Xiv preprint ar Xiv:1510.06096, 2015. Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853 884, 2016. Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131 1198, 2018. Sharan Vaswani, Reza Babanezhad, Jose Gallego, Aaron Mishkin, Simon Lacoste-Julien, and Nicolas Le Roux. To each optimizer a norm, to each norm its generalization. ar Xiv preprint ar Xiv:2006.06821, 2020. G Alistair Watson. Characterization of the subdifferential of some matrix norms. Linear algebra and its applications, 170:33 45, 1992. Colin Wei, Jason Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. 2018. Stephan Wojtowytsch and Weinan E. On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers. ar Xiv preprint ar Xiv:2012.05420, 2020. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021. Linfeng Zhang, Jiequn Han, Han Wang, Roberto Car, and Weinan E. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Physical Review Letters, 120(14):143001, 2018. Yuqian Zhang, Qing Qu, and John Wright. From symmetry to geometry: Tractable nonconvex problems. ar Xiv preprint ar Xiv:2007.06753, 2020. Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. ar Xiv preprint ar Xiv:2105.02375, 2021. Published as a conference paper at ICLR 2022 A ELEMENTS OF OPTIMIZATION In this section, we introduce some basic definitions and theory about optimization. In the following discussion, we consider a standard form inequality constrained optimization problem: min x Rdf(x) s.t. gi(x) 0, i [n]. (A.1) In addition, we assume all of those functions f and gi are twice differentiable. A point x Rd is said to be feasible if and only if it satisfies all of the constraints in (A.1), i.e., gi(x) 0, i [n]. And the Lagrangian of problem (A.1) is defined as following: L(x, λ) = f(x) + i=1 λigi(x). A.1 KARUSH KUHN TUCKER CONDITIONS Now let s first introduce the definition of Karush Kuhn Tucker (KKT) point and approximate KKT point. Here we follows the definition of (ϵ, δ)-KKT point as in Lyu & Li (2019). Definition A.1 (Definition of KKT point). A feasible point is said to be KKT point of problem (A.1) if there exist λ1, λ2, , λn 0 such that the following Karush Kuhn Tucker (KKT) conditions hold: 1. f(x) + Pn i=1 λi gi(x) = 0 2. λigi(x) = 0, i [n] Definition A.2 (Definition of (ϵ, δ)-KKT point). For any ϵ, δ > 0, a feasible point of (A.1) is said to be (ϵ, δ)-KKT point of problem (A.1) if there exist λ1, λ2, , λn 0 such that: 1. || f(x) + Pn i=1 λi gi(x)|| ϵ 2. λigi(x) δ, i [n] Generally speaking, KKT conditions might not be necessary for global optimality. We need some additional regular conditions to make it necessary. For example, as shown in Dutta et al. (2013) we can require the problem to satisfy the following Mangasarian-Fromovitz constraint qualification (MFCQ): Definition A.3 (Mangasarian-Fromovitz constraint qualification (MFCQ) ). For a feasible point x of (A.1), problem (A.1) is said to satisfy (MFCQ) at x if there exist a vector v Rd such that: xgi(x), v > 0, i [n]. (A.2) Moreover, when MFCQ holds we can build a connection between approximate KKT point and KKT point, see detailed proof in Dutta et al. (2013): Theorem A.1 (Relationship between Approximate KKT point and KKT point). Let xk Rd : k N be a sequence of feasible points of (A.1) , {ϵk > 0 : k N} and {δk > 0 : k N} be two sequences of real numbers such that xk is an (ϵk, δk)-KKT point for every k, and ϵk 0, δk 0. If xk x as k + . If MFCQ (A.3) holds at x, then x is a KKT point of (P). B OMITTED PROOFS FROM SECTION 3.1 Recall our ULPM problem: min W ,H L(W , H) = exp(w k hk,i) PK j=1 exp(w j hk,i) Published as a conference paper at ICLR 2022 Let s review some basic notations defined in the main body. Let sk,i,j = w k hk,i w j hk,i, k [K], i [n], j [K], the margin of a single feature hk,i is defined to be qk,i(W , H) := minj =k sk,i,j = w k hk,i maxj =k w j hk,i. We define the margin of entire dataset as qmin = qmin(W , H) = mink [1,K],i [1,n] qk,i(W , H). We first prove Theorem 3.1 in the mainbody. Proof of Theorem 3.1. First we can find that the margin will not change if we minus a vector a for all wj, so if we denote the mean of classifier wi = wi 1 K PK i=1 wi and then we have w k hk,i maxj =k w j hk,i = w k hk,i maxj =k w j hk,i qmin(W , H), that is: w k hk,i w j hk,i qmin(W , H), j = k [K], i [n]. (B.1) Note that PK j=1 w j hk,i = 0 then sum this inequality over j we have: (K 1) w k hk,i X j =k w j hk,i = K w k hk,i (K 1)qmin(W , H), k [K], i [n]. By Cauchy inequality, we have: 1 2( 1 n|| wk||2 2 + n||hk,i||2 2) w k hk,i K 1 K qmin(W , H). (B.2) Sum (B.2) over k and i we have: 1 2 n(|| W ||2 F + ||H||2 F ) n(K 1)qmin(W , H). (B.3) On the other hand, we know that: || W ||2 F = i=1 wi||2 2 i=1 ||wi||2 2 = ||W ||2 F . Then we can conclude that: qmin(W , H) W 2 F + H 2 F 2(K 1) n as desired. When the equality holds, first we have || W ||2 F = ||W ||2 F which is equivalent to 1 K PK i=1 wi = 0, wi = wi. Take it back into (B.3), then we must have all of the equality holds in (B.2) and (B.1), which give us: wk = nhk,i, ||wk||2 2 = n||hk,i||2 2 = ||W ||2 F + ||H||2 F 2K . Take this into (B.1) we have: hk,i = hk,i , h k,ihj,i = w k wj = ||W ||2 F + ||H||2 F 2K(K 1) n , which implies neural collapse conditions. Now let s turn to the training dynamics and prove Theorem 3.2. Before starting the proof, we need to introduce some additional notations. Since the W and H are all optimization variables here, we denote θ = vec(W , H) as the whole parameter for simplicity and all of previous function can be defined on θ by matching the corresponding parameters. Denote ρ = θ as the norm of θ and γ = log(e L(θ) 1) ρ2 . Now we can state our first lemma to show how training dynamics of gradient flow on ULPM objective (B) is related to a KKT point of (3.1). Lemma B.1. If there exist a time t0 such that L(θ(t0)) < log 2, then for any t > t0 θ := θ/qmin(θ)1/2 is an (ϵ, δ) - approximate KKT point of the following minimum-norm separation problem. More precisely, we have γ(t) , δ = K2(K 1)n 2e γ(t)qmin(t), where: β = θ ||θ||2 , dθ is the angle between θ and its corresponding gradient. Published as a conference paper at ICLR 2022 Proof. The training dynamics is given by gradient flow: Then by the chain rule we have: It indicates that the loss function L is monotonically decreasing. If L(θ(t0)) < log 2, we have L(θ(t)) < log 2, t > t0. On the other hand, note that i=1 log(1 + X j =k e sk,i,j(t)) log(1 + exp( qmin(t))), (B.5) which gives us qmin(t) > 0, t > t0. dt , note that we can rewrite the ULPM objective function (B) as L(θ) = PK k=1 Pn i=1 log(1+ P j =k e sk,i,j). By the chain rule and the gradient flow equation we have 1 + P l =k e sk,i,l gk,i,j, where gk,i,j is the gradient of sk,i,j, i.e. gk,i,j = θsk,i,j(θ). Now let gk,i,j = gk,i,j/q1/2 min = θsk,i,j( θ) and construct λk,i,j = ρ ||g||2 e sk,i,j l =k e sk,i,l , we only need to show: j =k λk,i,j gk,i,j||2 2 1 β j =k λk,i,j(sk,i,j( θ) 1) K2(K 1)n 2eqmin γ . (B.7) To prove (B.6), we only need to compute (Recall that θ = θ/qmin(θ)1/2): j =yn λk,i,j gk,i,j||2 2 = ρ2 qmin || θ ||θ||2 g ||g||2 ||2 2 = ρ2 qmin (2 2β). γ = log(e L(θ) 1) ρ2 , L(θ) = n=1 log(1 + X j =yn e snj) log(1 + exp( qmin)). (B.8) Then we have the following inequality: Take this back into (B.8) we have (B.6) as desired. To prove (B.7), first by our construction: j =k λk,i,j(sk,i,j( θ) 1) = ρ qmin||g||2 1 + P l =k e sk,i,l (sk,i,j qmin). Published as a conference paper at ICLR 2022 Note that ||g||2 D g, θ ||θ||2 ρ g, θ and gk,i,j, θ = 2sk,i,j since sk,i,j = w k hk,i w j hk,i, we have: l =k e sk,i,l gk,i,j, θ l =k e sk,i,l sk,i,j j =k e sk,i,j (since sk,i,j qmin > 0 and e sk,i,l 1) (B.11) Take this inequality back into the (B.10) we have: j =k λk,i,j(sk,i,j( θ) 1) Kρ2 eqmin sk,i,j l =k e sk,i,l (sk,i,j qmin) j =k eqmin sk,i,j(sk,i,j qmin) j =k eqmin sk,i,j(sk,i,j qmin) Where the last inequality is obtained from the fact xe x 1 e, x > 0, which can be proved by some elementary calculus. Based on Lemma B.1, we have shown that the (W , H) will be a (ϵ, δ)-KKT point, if we can show (ϵ, δ) converge to zero, then by Theorem A.1 we know the limit point will be along the direction of a KKT point. Ignoring the constant term, we only have to show how γ(t), β(t) and qmin(t) evolve along time. Now we provide the following lemmas to illustrate their dynamics. The first lemma aims at proving that the norms of parameter ρ(t) and γ(t) are monotonically increasing. Lemma B.2. If there exist t0 such that L(θ(t0)) < log 2, then for any t > t0 we have: dt > 0, d γ dt 0. (B.12) Proof. We can disentangle the whole training dynamics into the following two parts: the radial part: v := ˆθˆθ dθ the tangent part: u = (I ˆθˆθ ) dθ First analyze the radial part, by the chain rule: ||v||2 = |ˆθ dθ dt |. For dρ2 dt , we have the following equation: l =k e sk,i,l sk,i,j, (B.13) Published as a conference paper at ICLR 2022 where the last equality holds by equation (B.11).Then when t > t0, we have shown that qmin(t) 0, combine this with the fact sk,i,j qmin we obtain the first inequality in (B.12) l =k e sk,i,l sk,i,j l =k e sk,i,l qmin 0. Next, we aim to prove the monotonicity of γ(t), compute the derivative of γ(t) we have: γ = log(e L(θ) 1) ρ2 , d dt log γ = d dt(log( log(e L(θ) 1)) 2 log ρ), (B.15) d dt log( log(e L(θ) 1)) = 1 log(e L(θ) 1) e L(θ) e L(θ) 1 d L(θ) e L(θ) 1. (B.16) Recall that we have: l =k e sk,i,l sk,i,j j =k e sk,i,j j =k e sk,i,j qmin i=1 log(1 + X j =k e sk,i,j) 1 log(1 + P j =k e sk,i,j) P j =k e sk,i,j j =k e sk,i,j qmin i=1 log(1 + X j =k e sk,i,j) e L(θ) 1 L(θ)e L(θ) qmin e L(θ) qmin. Then second last line is because the definition of the loss function L(θ) = PK k=1 Pn i=1 log(1 + P j =k e sk,i,j) log(1 + P j =k e sk,i,j) and the monotonicity of ex 1 xex (in fact, d dx ex 1 e x(x ex+1) x2 0, x > 0). As a result, we notice that 1 2 d dt log γ(t) 1 At the same time, we notice that v 2 = 1 ρ2 1 2 dρ2 dt log ρ on the one hand, and by the chain rule: d dt ˆθ = 1 Combine this with the radial term: 2 = v 2 + u 2 = 1 dt log ρ + ρ2 dˆθ dt on both sides, we have dt log ρ + d dt log ρ 1 dˆθ Published as a conference paper at ICLR 2022 d dt log ρ + d dt log ρ 1 dˆθ 2(e L(θ) 1). Now by equation (B.15) and (B.16) we obtain: 1 2 d dt log γ d L(θ) 2 e L(θ) 1 d dt log ρ 1 dˆθ By (B.14) we know the ρ is monotonically increasing, we have d dt log ρ > 0 and then we get the second inequality in (B.12). Lemma B.2 gives us the monotonicity of γ, note that since the loss function L(θ(t0)) < log 2, we have γ(t) γ(t0) > 0, then we can treat γ(t) in Lemma B.1 as a positive constant. The remaining work is to show qmin(t) grows to infinity and β(t) 1. To show qmin(t) , it s equivalent to show L(t) 0 and we have the following lemma: Lemma B.3. If there exist t0 such that L(θ(t0)) < log 2, then L(θ(t)) 0 and qmin(θ(t)) as t , moreover we have the convergence rate L(θ(t)) = O(1/t) Proof. By (B.4) and (B.13), the evolution of loss function L(θ) can be written as: dt , θ ||θ||2 l =k e sk,i,l sk,i,j)2. Combine it with (B.5),(B.14) and (B.17) we have: 1 + P l =k e sk,i,l sk,i,j 2e L(θ) 1 e L(θ) qmin 2e L(θ) 1 e L(θ) log(e L(θ) 1), which indicates: d L(θ(t)) dt 4(e L(θ(t)) 1 e L(θ(t)) log(e L(θ(t)) 1))2. (B.20) Since 0 < L(θ(t)) < L(θ(t0)) < log 2, note that: x ) > 0, lim x 0 ex 1 which implies: 1 < e L(θ(t)) 1 L(θ(t)) < 1 log(2), (B.21) On the other hand, we can find that: log(e L(θ(t)) 1) < log(e L(θ(t0)) 1) < 0, 1 < e L(θ(t)) < 2 (B.22) Combine equation (B.21) and (B.22) together, one can conclude that there exist a constant C > 0 such that for t > t0: d L(θ(t)) dt C(L(θ(t)))2 It further implies that: dt( 1 L(θ(t))), then integral on both side we have: C(t t0) < 1 L(θ(t)) 1 L(θ(t0)), and L(θ(t)) = O(1/t). Thus we must have L(θ(t)) 0 and combine this with qmin log(e L(θ(t)) 1) we know qmin(θ(t)) as desired. Published as a conference paper at ICLR 2022 To bound β(t), we first need a useful lemma to bound the changes of the direction of θ. Lemma B.4. If there exist t0 such that L(θ(t0)) < log 2, then for any t > t0 dˆθ 1 γ(t0) d dt log ρ. Proof. First we know that: dˆθ (I ˆθˆθ )dθ l =k e sk,i,l gk,i,j l =k e sk,i,l gk,i,j . (B.23) Recall our gk,i,j = sk,i,j θ and sk,i,j = w k hk,i w j hk,i, we can find that gk,i,j 2ρ. On the other hand, Combine it with (B.13) and (B.9) we have: dˆθ d dt log ρ 1 γ d dt log ρ 1 γ(t0) d dt log ρ as desired. Where the second inequality holds by multiple ρ qmin on the right hand of (B.23) and the formula of dρ2 dt in equation (B.13), and the last inequality holds since we have shown that γ(t) is monotonically increasing in (B.2) and γ(t0) > 0 since L(t0) < log 2. Let s turn back to β, though we can not show directly that it increase to one, we can find a sequence of time {tm} for each limit point such that β(tm) 1 Lemma B.5. If there exist t0 such that L(θ(t0)) < log 2, then for every limit point θ of {ˆθ(t) : t 0}, there exists a sequence of time {tm > 0 : m N} such that tm , ˆθ (tm) θ, and β (tm) 1. Proof. Recall that in equation (B.19) we have shown that: d dt log γ 2 d dt log ρ 1 dˆθ dt log ρ = 1 dt = 1 2ρ2 dρ2 dt = 1 ρ2 θ, dθ ρ(I ˆθˆθ )dθ Plug them into (B.24) we have: d dt log γ 2 dt 2 D ˆθ, dθ dt E2 d dt log ρ = 2(β 2 1) d For any t2 > t1 > t0, integrate both sides from time t2 to t1 we have: log γ(t2) log γ(t1) 2 R t2 t1 (β(t) 2 1) d dt log ρdt. By the continuity of β we know there exist a time t such that: log γ(t2) log γ(t1) 2 Z t2 t1 (β(t) 2 1) d = 2(β(t ) 2 1) Z t2 d dt log ρdt = 2(β(t ) 2 1)(log ρ(t2) log ρ(t1)). Published as a conference paper at ICLR 2022 By (B.9) we know that γ qmin ρ2 and the right hand is bounded, and the γ is monotonically increasing, then there exist γ such that γ(t) γ . Now we are ready to construct the sequence of tm, first take a sequence of {ϵm > 0, m N} such that ϵm 0. We construct tm by induction, suppose we have already find t1 < t2 < < tm 1 satisfy our requirement, since θ is a limit point of {ˆθ(t) : t > 0}, then we can find a time sm such that: ˆθ(sm) θ ϵm, log γ γ(sm) ϵ3 m. By the monotonicity and continuity of ρ we can find a time s m such that log ρ(s m) log ρ(sm) ϵm. Take t2 = s m, t1 = sm in (B.25), there exist a time tm such that: 2(β(tm) 2 1) log γ(t2) log γ(t1) log ρ(t2) log ρ(t1) ϵ2 m (B.26) on the other hand, by Lemma B.4 we have: ˆθ(tm) θ ˆθ(sm) θ + ˆθ(sm) ˆθ(tm) ϵm + 1 γ(t0)(log ρ(tm) log ρ(sm)) (1 + 1 γ(t0))ϵm. (B.27) Note that θ, dθ dt > 0, then by definition we know β > 0. Combine equations (B.26) and (B.27) we have β(tm) 1 and ˆθ(tm) θ as desired. Now we are ready to prove Theorem 3.2: Proof of Theorem 3.2. By Lemma B.1, we know that once t > t0, (W (t), H(t))/qmin(W (t), H(t)) is an ( q γ(t) , K2(K 1)n 2 γ(t)qmin(t)) -approximate KKT point. We have shown that γ(t) > γ(t0) > 0 in Lemma B.2, qmin in Lemma B.3 and from Lemma B.5 we know for any limit point ( W , H) of {( ˆ H(t), ˆ W (t)) := ( H(t) W (t) 2 F + H(t) 2 F , W (t) W (t) 2 F + H(t) 2 F )}, there exists a sequence of time {tm > 0 : m N} such that tm , β(tm) 1 and ( ˆ H(tm), ˆ W (tm)) ( W , H). Then ( W , H) is along the direction of a limit point of a sequence of (ϵ, δ)-approximate KKT point with ϵ, δ 0. On the other hand, we can verify that the problem (3.1) satisfies MCFQ (A.3) by simply setting v = θ, then: sk,i,j, θ = 2sk,i,j 0. Now by Theorem A.1 we know ( W , H) is along the direction of a KKT point of problem (3.1) Theorem 3.2 characterize the convergent behaviour of gradient flow, under separable conditions the limit point is along the direction of a KKT point of (3.1), next we show that the global minimum of (3.1) must satisfy neural collapse conditions by proving Corollary 3.1: Proof of Corollary 3.1. Since we have shown that the problem (3.1) satisfy MCFQ, then the KKT conditions are necessary for global optimality, we only need to show the global optimum satisfies neural collapse conditions. First the constraints in (3.1) can be transformed to be a single constraint by the definition of margin: k = j [K], i [n], w k hk,i w j hk,i 1. qmin(W , H) 1. (B.28) Note that the margin is homogeneous: qmin(αW , αH) = α2qmin(W , H), α R. Then for any point (W , H) satisfies qmin(W , H) > 0, after an appropriate scaling α, (αW , αH), α2 1/qmin(W , H) is feasible for (3.1). Take optimum among all scaling factor α we know the minimum norm is attained if and only if α2 = 1/qmin(W , H). And the optimum norm is: 1 2||αW ||2 F + 1 2||αH||2 F = 1 2qmin(W , H)(||W ||2 F + ||H||2 F ). Published as a conference paper at ICLR 2022 Then by Theorem 3.1 we have: 1 2qmin(W , H)(||W ||2 F + ||H||2 F ) 2(K 1) n. And the global optimum is attained only when (W , H) satisfies neural collapse conditions C OMITTED PROOFS FROM SECTION 3.2 To begin with, let s finish the computation in the motivating example (Example 3.1) Proof in Example 3.1. Consider the case where K = 4, n = 1, let (W , H) be the following point: 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 One can easily verify that this (W , H) enables our model to classify all of the features perfectly. Further more, we can show it is along the direction of a KKT point of the minimum-norm separation problem (3.1) by construct the Lagrangian multiplier Λ = (λij)K i,j=1 as following: 0 0 1 2 1 2 0 0 1 2 1 2 1 2 1 2 0 0 1 2 1 2 0 0 To see this, just write down the corresponding Lagrangian (note that to make it to be a true KKT point of (3.1), one needs to multiple 1/ 2 on W , H): L(W , H, Λ) = 1 4 W 2 F + 1 j =i λi,j(1 Simply take derivatives for W , H and Λ we can find it satisfies KKT conditions. On the other hand, the gradient of (W , H) is W L(W , H) = HL(W , H) = 2 + 2e 2 2 + 2e 2 + 2e2 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 We can find that the directions of gradient and the parameter align with each other (i.e., W is parallel to W L(W , H), and H is parallel to HL(W , H)), which implies that simple gradient descent may gets stuck in this direction and only grows the parameters norm. However, if we construct: 1 + α 1 + α α α 1 + α 1 + α α α α α 1 α 1 α α α 1 α 1 α 1 + α 1 + α α α 1 + α 1 + α α α α α 1 α 1 α α α 1 α 1 α Note that W F = W F , H F = H F and ||W W ||2 F + ||H H||2 F 0 as α 0. First we can compute: W H = 1 1 + 2α2 2 + 4α2 4α2 2 4α2 4α2 4α2 2 2 + 4α2 4α2 4α2 4α2 4α2 2 + 4α2 4α2 2 4α2 4α2 4α2 2 2 + 4α2 Published as a conference paper at ICLR 2022 L(W , H ) = 4 log e2 e2 + e2 2α2 1 1+2α2 + 2e 2 2α2 Our aim is to show that for any ϵ > 0, there exist α such that |α| < ϵ and L(W , H ) < L(W , H). By the formulation of L(W , H ), it s sufficient to show that f(α) e2 2α2 1 1+2α2 + 2e 2 2α2 1+2α2 < f(0). Then take the derivative of f(α) we have: 4α2 2 1+2α2 8α 1 + 2α2 8α 2α2 1 (1 + 2α2)2 8α 1 + 2α2 4α2 2 1+2α2 8α 1 + 2α2 8α 2α2 1 (1 + 2α2)2 8α 1 + 2α2 4α2 2 1+2α2 (1 + 2α2)2 + 64 2α2 1 α2 (1 + 2α2)3 + 8 1 + 2α2 8 2α2 1 (1 + 2α2)2 8 1 + 2α2 128α4 Now we can find that f (0) = 0 and f (0) = 16( 1 e2 1) < 0. Since the function f(α) is continuously twice differentiable, we can conclude that for any ϵ > 0, we can choose appropriate α such that: ||W ||2 F = ||W ||2 F , ||H ||2 F = ||H||2 F , ||W W ||2 F + ||H H||2 F < ϵ, L(W , H ) < L(W , H). Now we prove Theorem 3.3 by some similar strategies as used in proving Theorem 3.1. Proof of Theorem 3.3. Again we rewrite the ULPM objective by introducing sk,i,j = w k hk,i w j hk,i: i=1 log(1 + X j =k exp( sk,i,j)). In addition, we can find that centralizing wi does not change the value of sk,i,j. Let wi = wi 1 K PK k=1 wk, i [K], then sk,i,j = w k hk,i w j hk,i = w k hk,i w j hk,i and PK i=1 wi = 0. First by the strict convexity of ex and Jensen Inequality: i=1 log(1 + (K 1) exp( 1 K 1 j =k sk,i,j)) i=1 log(1 + (K 1) exp( K w k hk,i K 1 )). Where the last equality is obtained from: X j =k sk,i,j = X j =k w k hk,i w j hk,i = (K 1) w k hk,i X j =k w j hk,i = K w k hk,i. Published as a conference paper at ICLR 2022 Now again by the strict convexity of log(1 + (K 1) exp( x)) and Jensen inequality, we have: i=1 log(1 + (K 1) exp( K w k hk,i K 1 )) n K log(1 + (K 1) exp( 1 n(K 1) i=1 w k hk,i)) n K log(1 + (K 1) exp( 1 2n(K 1) 1 n wk 2 + n hk,i 2)) n K log(1 + (K 1) exp( 1 2 n(K 1)( W 2 F + H 2 F ))). Where the last inequality holds since W 2 F = PK k=1 wk 2 PK k=1 wk 2 1 K PK k=1 wk 2 = PK k=1 wk 2. When all of the above inequality reduce to equality, we must have: 1. PK i=1 wi = 0, wi = wi (the last inequality in (C.2)) 2. wk = nhk,i, i [n] (the third inequality in (C.2)) 3. wk = wk , hk,i = hk ,j , k, k [K], i, j [n] (the second inequality in (C.2)) 4. sk,i,j = w k hk,i w j hk,i = K K 1w k hk,i, k, j [K], i [n] (the first inequality in (C.1)) These four conditions are exactly equivalent to neural collapse conditions and W F = H F The global optimality is not enough to illustrate how does gradient flow converges to neural collapse since there may exist some bad local minimum. We will provide the following second-order analysis to eliminate spurious local minimum. First, define the cross-entropy loss on a matrix Z RK n K: i=1 log ezk,i,j PK l=1 ezk,i,l , where zk,i,j denote the j-th row and [(k 1)K + i]-th column elements of Z. Then we have L(W , H) = L(W H) . Now compute the gradient of L(Z) to each element: zk,i,k = 1 + zk,i,k PK l=1 ezk,i,l zk,i,j = zk,i,j PK l=1 ezk,i,l , k = j. If u RK satisfies u L(Z) = 0, denote up as the maximum element of u, then we have: 0 = up L(Z) q =p uq L(Z) zp,i,q =up( 1 + zp,i,p PK l=1 ezp,i,l ) + X q =p uq zp,i,q PK l=1 ezp,i,l q =p (up uq) zp,i,q PK l=1 ezp,i,l 0. (C.3) Where the last inequality holds if and only if uq = up, q [K]. Which indicates that the rank of L(Z) is K 1 and u L(Z) = 0 u = 1. Now we are ready to prove Theorem 3.4 in the main body. Proof of Theorem 3.4. First compute the gradient of ULPM objective (B), by the chain rule we have: W L(W , H) = L(W H)H , HL(W , H) = W L(W H). Published as a conference paper at ICLR 2022 If there exist a vector ( W , H) T (W , H) such that W L(W , H), W + HL(W , H), H = 0, moreover we can assume W L(W , H), W + HL(W , H), H < 0 since we can take the negative direction if the formula is greater than zero, then by Taylor expansion: L(W + δ W , H + δ H) = L(W , H) + δ W L(W , H), W + δ HL(W , H), H + O(δ2), we know that ( W , H) satisfies our requirement. Now let s discuss the case when: W L(W , H), W + HL(W , H), H = 0, ( W , H) T (W , H), by definition of T (W , H), it contains all vectors that are orthogonal to (W , H), so ( W L(W , H), HL(W , H)) is parallel to (W , H), that is, there exist λ such that: L(W H)H = λW , W L(W H) = λH. (C.4) If there does not exist ( W , H) satisfying the requirement, we know that for any feasible curve φ(t) = (W (t), H(t)) with φ(0) = (W , H) on the sphere S = {(W , H ) : W 2 F + H 2 F = W 2 F + H 2 F }, t = 0 admits the local minimum of L(φ(t)) and thus: dt2 L(φ(t))|t=0 = φ (0)T 2L(W , H)φ (0) + L(W , H)φ (0). (C.5) On the other hand, since the curve lies on the sphere S, denote h(W , H) = W 2 F + H 2 F , then h(φ(t)) must stay as a constant, take twice derivative we have: dt2 h(φ(t))|t=0 = φ (0)T 2h(W , H)φ (0) + h(W , H)φ (0). (C.6) Then sum these two conditions together, we have: dt2 (L(φ(t)) + |λ| 2 h(φ(t)))|t=0 = φ (0)T 2(L + |λ| 2 h)(W , H)φ (0) + (L + |λ| 2 h)(W , H)φ (0). (C.7) By equation (C.4) we know that (L + |λ| 2 h)(W , H) = 0 Note that φ (0) T (W , H) since the curve lies on S and for any ( W , H) T (W , H) we can construct a curve φ(t) such that φ (0) = ( W , H). Then (C.7) indicates that ( W , H) T (W , H) we have: 0 ( W , H) 2L(W , H)( W , H) + |λ| 2 ( W , H) 2h(W , H)( W , H) =( W , H) 2L(W , H)( W , H) + |λ|( W 2 F + H 2 F ). (C.8) When λ = 0, by equation (C.3) we know that L(WH) 2 > 0, which gives us |λ| L(W H) 2 When λ = 0, combine the two equations in (C.4) we know: λW W = W L(W H)H = λHH W W = HH , (C.9) which further implies: ||W ||F = ||H||F , ||W ||2 = ||H||2. Thus we also have (Note that when W = H = 0 we can take λ to be zero): L(W H)H = λW |λ|||W ||2 || L(W H)||2||H||2 |λ| || L(W H)||2. Now when |λ| < || L(W H)||2, we can show that it will contradict with (C.8): We have shown that the rank of L(Z) is K 1, so by (C.4) and (C.9) there exist a vector a such that W a = H a = 0, Published as a conference paper at ICLR 2022 let u and v are the left and right singular vectors corresponding to the largest singular value of L(W H), construct W = ua , H = av , then ( W , H) T (W , H) and: ( W , H) 2L(W , H)( W , H) λ( W 2 F + H 2 F ) =(W H + W H) 2L (W H) (W H + W H) + 2 L(W H), W H + |λ|( W 2 F + H 2 F ) 2||a||2 2(|λ| u L(W H)v) < 0. Then it only remains to analyze the |λ| = || L(W H)||2 cases, construct another convex optimization problem: min Z L(Z) + |λ|||Z|| , (C.10) suppose Z has SVD Z = UΣV , as we know that the subgradient of ||Z|| can be written as (see Watson (1992) for a proof): Z = UV + W , W RK n K | U W = 0, W V = 0, W 2 1 . (C.11) On the other hand, we know that: H HH H = H W W H = V Σ2V W W W W = W HH W = UΣ2U , which indicates that H H = V ΣV and W W = UΣU . Combine them with (C.4) we have: L(W H)H H = λW H L(W H)V ΣV = λUΣV L(W H)V = λU W W L(W H) = λW H UΣU L(W H) = λUΣV U L(W H) = λV . Note that |λ| = || L(W H)||2, then by (C.11) we know that L(W H) |λ| W H . Then by the strict convexity of (C.10) we know W H is the global minimum of it. In addition, we have W 2 F + H 2 F = 2trΣ2) = 2 W H . In addition, previous works (Haeffele & Vidal, 2015) have shown that: Z = min Z=W H 1 2 W 2 F + H 2 F , which is equivalent to: 2( W 2 F + H 2 F ). Now for any (W , H ), we know that: L(W , H) + |λ| 2 ( W 2 F + H 2 F ) = L(W H) + |λ| W H L(W H ) + |λ| W H L(W , H ) + |λ| 2 ( W 2 F + H 2 F ), which indicates (W , H) must attain global minimum of the following optimization problem: min W ,H L(W , H) + |λ| 2 ( W 2 F + H 2 F ). If (W , H) does not satisfy neural collapse conditions, by the optimality of neural collapse solution (Theorem 3.3) we know there exists another point (W , H ) such that L(W , H ) < L(W , H) and W 2 F + H 2 F = W 2 F + H 2 F thus: L(W , H) + |λ| 2 ( W 2 F + H 2 F ) > L(W , H ) + |λ| 2 ( W 2 F + H 2 F ), which contradicts with the global optimality of (W , H), thus (W , H) must satisfy all of the neural collapse conditions and we finish the proof. Published as a conference paper at ICLR 2022 D ADDITIONAL EMPIRICAL RESULTS Gradient Descent on the ULPM Objective. We conduct experiments on the ULPM objective (2.3) to support the results of convergence toward neural collapse in our theories. We set N = 10, K = 5, and d = 20, and used gradient descent with a learning rate of 5 to run 105 epochs. We characterize the dynamics of the training procedure in Figure 2 based on four aspects: (1) Relative variation of the centered class-mean feature norms (i.e., Std( hk h )/Avg( hk h )) and the variation of the classifier s norms (i.e., Std( wk )/Avg( wk )). (2) Within-class variation of the last layer features (i.e., Avg( hk,i hk )/Avg( hk,i h )). (3) The cosines between pairs of last layer features (i.e., Avg(| cos( hk, hk )+1/(K 1)|)) and that of the classifiers (i.e., Avg(| cos( wk, wk )+1/(K 1)|)). (4) The distance between the normalized centered classifier and the normalized last layer feature (i.e., Avg(|( hk h)/ hk h wk/ wk |)). Empirically, we observe that all four quantities decrease at approximately the rate O(1/(log(t))). 1 101 102 103104105 Classifier Last layer feature (a) Variation of the norms 1 101 102 103 104105 Avg absolute difference Last layer feature (b) Within-class variation 1 101 102 103104105 Avg shifted cosine Classifier Last layer feature (c) Cosines between pairs 1 101 102 103104105 Avg absolute difference (d) Self-duality Figure 2: Training dynamics in ULPM. The x-axis in the figures is set to have log(log(t)) scales, and the y-axis in the figures are set to have log scales. (a) The dynamics of the variation of the centered class-mean features norms (shown in blue) and the variation of the classifier s norms (shown in red). We observe that the logarithm of both terms decreases at a rate O(1/(log(t))). (b) Dynamics of within-class variation of the last layer features. The logarithm of the variation converges at approximately the rate O(1/ log(t))). (c) The dynamics of the cosines between pairs of last layer features (shown in blue) and those of the classifiers (shown in red). The logarithm of both terms converge approximately at rate O(1/ log(t))). (d) Dynamics of the distance between the normalized centered classifier and normalized last layer feature. The logarithm of the quantity converges at approximately the rate O(1/ log(t))) to the point of self-duality. Details of Realistic Training. In the real data experiments, we trained the VGG-13 (Simonyan & Zisserman, 2014) and Res Net18 (He et al., 2016) on MNIST (Le Cun et al., 1998), KMNIST (Clanuwat et al., 2018), Fashion MNIST (Xiao et al., 2017) and CIFAR-10 datasets (Krizhevsky et al., 2009) without weight decay, and with a learning rate of 0.01, momentum of 0.3, and batch size of 128. The metrics are defined similarly to the ULPM case and the experiment results are reported in Figure 1, 3, 4, 5, 6, 7, and 8. All experiments were run in Python (version 3.6.9) on Google Colab. It 1 10 100 500 Epoch Classifier (CIFAR-10) Last layer feature (CIFAR-10) (a) Variation of the norms 1 10 100 500 Epoch Avg absolute difference Last layer feature (CIFAR-10) (b) Within-class variation 1 10 100 500 Epoch Avg shifted cosine Classifier (CIFAR-10) Last layer feature (CIFAR-10) (c) Cosines between pairs 100 101 102 Avg absolute difference Dual (CIFAR-10) (d) Self-duality Figure 3: Experiments on real datasets without weight decay. We trained a VGG13 on CIFAR10 dataset. The x-axis in the figures are set to have log(log(t)) scales and the y-axis in the figures are set to have log scales. Published as a conference paper at ICLR 2022 1 10 100 500 Epoch Classifier (MNIST) Last layer feature (MNIST) (a) Variation of the norms 1 10 100 500 Epoch Avg absolute difference Last layer feature (MNIST) (b) Within-class variation 1 10 100 500 Epoch Avg shifted cosine Classifier (MNIST) Last layer feature (MNIST) (c) Cosines between pairs 100 101 102 Avg absolute difference Dual (MNIST) (d) Self-duality Figure 4: Experiments on real datasets without weight decay. We trained a VGG13 on the MNIST dataset. The x-axis in the figures are set to have log(log(t)) scales and the y-axis in the figures are set to have log scales. 1 10 100 500 Epoch Classifier (KMNIST) Last layer feature (KMNIST) (a) Variation of the norms 1 10 100 500 Epoch Avg absolute difference Last layer feature (KMNIST) (b) Within-class variation 1 10 100 500 Epoch Avg shifted cosine Classifier (KMNIST) Last layer feature (KMNIST) (c) Cosines between pairs 100 101 102 Avg absolute difference Dual (KMNIST) (d) Self-duality Figure 5: Experiments on real datasets without weight decay. We trained a VGG18 on the KMNIST dataset. The x-axis in the figures are set to have log(log(t)) scales and the y-axis in the figures are set to have log scales. 1 10 100 500 Epoch Classifier (KMNIST) Last layer feature (KMNIST) (a) Variation of the norms 1 10 100 500 Epoch Avg absolute difference Last layer feature (KMNIST) (b) Within-class variation 1 10 100 500 Epoch Avg shifted cosine Classifier (KMNIST) Last layer feature (KMNIST) (c) Cosines between pairs 100 101 102 Avg absolute difference Dual (KMNIST) (d) Self-duality Figure 6: Experiments on real datasets without weight decay. We trained a Res Net18 on the KMNIST dataset. The x-axis in the figures are set to have log(log(t)) scales and the y-axis in the figures are set to have log scales. 1 10 100 500 Epoch Classifier (FMNIST) Last layer feature (FMNIST) (a) Variation of the norms 1 10 100 500 Epoch Avg absolute difference Last layer feature (FMNIST) (b) Within-class variation 1 10 100 500 Epoch Avg shifted cosine Classifier (FMNIST) Last layer feature (FMNIST) (c) Cosines between pairs 100 101 102 Avg absolute difference Dual (FMNIST) (d) Self-duality Figure 7: Experiments on real datasets without weight decay. We trained a VGG13 on the Fashion-MNIST dataset. The x-axis in the figures are set to have log(log(t)) scales and the y-axis in the figures are set to have log scales. Published as a conference paper at ICLR 2022 1 10 100 500 Epoch Classifier (FMNIST) Last layer feature (FMNIST) (a) Variation of the norms 1 10 100 500 Epoch Avg absolute difference Last layer feature (FMNIST) (b) Within-class variation 1 10 100 500 Epoch Avg shifted cosine Classifier (FMNIST) Last layer feature (FMNIST) (c) Cosines between pairs 100 101 102 Avg absolute difference Dual (FMNIST) (d) Self-duality Figure 8: Experiments on real datasets without weight decay. We trained a Res Net18 on the Fashion-MNIST dataset. The x-axis in the figures are set to have log(log(t)) scales and the y-axis in the figures are set to have log scales. should be mentioned that we can observe that the variation of classifier norm stays at a low value and do not decrease in some settings (e.g., Figure 7a and Figure 8a), this phenomenon is also found in Figure 2 of (Papyan et al., 2020), which might attribute to the network architecture (e.g., batch normalization) and characteristics of real-world datasets. Overall, we can find that neural collapse occurs for various network architectures and datasets under an unconstrained setting, which provides sound support for our theory. E LIMITATION AND FUTURE DIRECTIONS Currently, our analysis is still limited in the terminal phase of training and only studies the training behavior after data is perfectly separated. Such assumption is necessary for all implicit regularization analysis (Lyu & Li, 2019; Ji et al., 2020; Nacson et al., 2019a) under a nonlinear setting, and how to fully characterize the early time training dynamics is still an open problem. Moreover, we have provided a loss landscape analysis in the paper showing that there is no spurious minimum in the tangent space. To guarantee the training dynamics do not get stuck in saddle points and converge to neural collapse solution, we need to introduce randomness in the training dynamics (e.g. use stochastic gradient descent (Ge et al., 2015)), otherwise simple gradient flow may get stuck in saddle points as we have shown in Example 3.1. How to characterize the training dynamics of stochastic gradient descent in the nonlinear setting is also an open problem. It is an interesting direction to further build a complete characterization of convergence to neural collapse solutions by addressing the early time training dynamics and studying the stochastic gradient descent.