# imbalance_trouble_revisiting_neuralcollapse_geometry__fb618bbb.pdf

Imbalance Trouble: Revisiting Neural-Collapse Geometry

Christos Thrampoulidis , Ganesh R. Kini , Vala Vakilian , Tina Behnia

University of British Columbia, University of California, Santa Barbara.

Neural Collapse refers to the remarkable structural properties characterizing the geometry of class embeddings and classiﬁer weights, found by deep nets when trained beyond zero training error. However, this characterization only holds for balanced data. Here we thus ask whether it can be made invariant to class imbalances. Towards this end, we adopt the unconstrained-features model (UFM), a recent theoretical model for studying neural collapse, and introduce Simplex-Encoded-Labels Interpolation (SELI) as an invariant characterization of the neural collapse phenomenon. We prove for the UFM with cross-entropy loss and vanishing regularization that, irrespective of class imbalances, the embeddings and classiﬁers always interpolate a simplexencoded label matrix and that their individual geometries are determined by the SVD factors of this same label matrix. We then present extensive experiments on synthetic and real datasets that conﬁrm convergence to the SELI geometry. However, we caution that convergence worsens with increasing imbalances. We theoretically support this ﬁnding by showing that unlike the balanced case, when minorities are present, ridge-regularization plays a critical role in tweaking the geometry. This deﬁnes new questions and motivates further investigations into the impact of class imbalances on the rates at which ﬁrst-order methods converge to their preferred solutions.

1 Introduction

What are the unique structural properties of models learned by training deep neural networks to zero training error? Is there an implicit bias towards solutions of certain geometry? How does this vary across training instances, architectures, and data? These questions are at the core of understanding the optimization landscape of deep-nets. Also, they are naturally informative about the role of models since diﬀerent parameterizations might aﬀect preferred geometries. Ultimately, such understanding makes progress towards explaining generalization of overparameterized models.

Recently, remarkable new progress in answering these questions has been made by Papyan et al. [26], who empirically discover and formalize the so-called Neural-collapse (NC) phenomenon. NC describes geometric properties of the learned embeddings (aka last-layer features) and of the classiﬁer weights of deep-nets, trained with cross-entropy (CE) loss and balanced data far into the zero training-error regime. The NC phenomenon produces a remarkably simple description of a particularly symmetric geometry: (i) The embeddings of each class collapse to their class mean (see (NC) property); and (ii) The class means align with the classiﬁer weights and they form a simplex equiangular tight frame (see (ETF) property). Importantly, as noted by Papyan et al. [26], this simple geometry appears to be cross-situational invariant across diﬀerent architectures and diﬀerent balanced datasets.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Supported by an NSERC Undergraduate Student Research Grant, an NSERC Discovery Grant, NSF CCF-2009030, a CRG8-KAUST award, and by UBC Advanced Research Computing services.

0.75 SEL matrix 𝒁"

(a) SELI geometry; (10, 1/2)-STEP imbalance

(b) ETF geometry; balanced Figure 1: Visualizing and contrasting the SELI and ETF geometries for k = 4 classes.

In this paper, we study Neural collapse with imbalanced classes: Is there a (ideally equally simple) description of the geometry that is invariant across class-imbalanced datasets?

Contributions. We propose a new description capturing the geometric structure of learnedembeddings and classiﬁer-weights on possibly class-imbalanced data, which we call the Simplex Encoded-Labels Interpolation (SELI) geometry. This new geometry is a generalization of the ETF geometry: It recovers the latter when data are balanced or when there are only two classes, and also, unlike ETF, it remains invariant across diﬀerent imbalance levels. Importantly, it too, has a simple description: The matrix of learned logits interpolates a simplex-encoded label (SEL) matrix ˆZ, and, the individual geometries of the embeddings and classiﬁers are determined by the SVD factors of this same SEL matrix. Because the particular arrangement of columns of the SEL matrix changes with the imbalance level, this also impacts the geometric arrangement of the embedding and classiﬁer vectors. Overall, the norms and angles of these vectors admit simple closed-form expressions in terms of the imbalance characteristics and the number of classes.

We use an example to illustrate this. Fig. 1a depicts the SEL matrix ˆZ R4 110 for a STEP-imbalanced k = 4-class dataset with two majority classes of 50 examples each, and, two minority classes of 5 examples each. Each column of ˆZ includes the k learned logits for each one of the 110 examples in the dataset. Each such column has exactly one entry equal to 1 1/k = 0.75 and three entries equal to 1/k = 0.25. The corresponding geometry of the embeddings and classiﬁers, shown in the 3D plot, is found by an SVD of ˆZ: the left eigenvectors determine the classiﬁers and the right ones the embeddings. Note that ˆZ is rank 3, hence the geometry is 3D. Since embeddings collapse to their class means (see (NC) property), we only show the four class-mean embeddings and the corresponding four

classiﬁers. Two of each correspond to majorities ( marker) and two to minorities ( + marker). The radii of the two concentric spheres are equal to the norms of the minority classiﬁers (red sphere) and of the minority embeddings (blue sphere), respectively. Note that the norms of minorities and majorities are diﬀerent, and so are the angles. Moreover, the classiﬁers are not aligned with the embeddings. Overall, the geometry is diﬀerent compared to the ETF geometry seen in the balanced case, which is shown in Fig. 1b. What remains invariant across class-imbalances is that the logits (i.e. inner products between classiﬁers and embeddings) only take values either 1 1/k or 1/k, so that the logit matrix is equal to the SEL matrix. Equivalently, the learned model interpolates the simplex-encoding of the labels.

Below we explain the conception of this geometry and our contributions in detail. The initial major challenge was: Assuming a class-imbalance-invariant geometry exists, how to ﬁnd it?

To answer this question, we adopted the Unconstrained Feature Model (UFM) previously introduced as a two-layer proxy model to theoretically justify neural collapse [24, 3, 41]. Motivated by deep-learning practice and by studies on implicit bias of gradient descent (GD) for unregularized CE minimization, we analyze the geometry of solutions to an unconstrainedfeatures Support Vector Machines (UF-SVM) problem. We prove, for STEP-imbalanced data, any solution of the UF-SVM follows the SELI geometry. Thus, the learned end-to-end model always interpolates a simplex labels encoding. We show that (ETF) Ô (SELI). However, (SELI) / Ô (ETF) unless classes are balanced or there is just two of them (k = 2).

Next, we analyze training of the UFM with ridge-regularized CE. Unlike previous studies, we ﬁnd in the presence of imbalances that regularization matters as it changes the geometry of solutions. In fact, we show that there is no ﬁnite regularization that leads to the SELI

0 50 100 150 200 250 300 350

0 Train Error

Distance to SELI/ETF

0 50 100 150 200 250 300 350

0 50 100 150 200 250 300 350

1.2 R=1 R=5 R=10 R=100

0 50 100 150 200 250 300 350 0.0

0 Train Error

Distance to SELI/ETF

0 50 100 150 200 250 300 350

0 50 100 150 200 250 300 350

R=1 R=5 R=10 R=100

0 50 100 150 200 250 300 350

0 Train Error

Distance to SELI/ETF

(a) Classiﬁers

0 50 100 150 200 250 300 350 0.2

1.4 Fashion-MNIST

Epochs (b) Embeddings

0 50 100 150 200 250 300 350

1.2 R=1 R=5 R=10 R=100

Epochs (c) Logits Figure 2: Convergence of learned classiﬁers, embeddings and corresponding logits to the SELI (solid lines) vs ETF (dashed lines) geometries, measured using a Res Net-18 model, trained far beyond zero training error on STEP-Imbalanced CIFAR10, MNIST and Fashion-MNIST, for diﬀerent imbalance ratios R; see Sec. 5 for metrics and discussion.

geometry. However, we also show that as regularization vanishes, the solutions do interpolate the SEL matrix (after appropriate normalization.) Finally, we show that the SELI geometry diﬀers from the minority-collapse phenomenon [3], since the latter does not correspond to solutions with zero training error. In fact, we show minority collapse: (i) does not occur for small ﬁnite regularization and ﬁnite imbalance ratio, and (ii) occurs asymptotically for vanishing regularization, but only asymptotically as the imbalance ratio grows.

We numerically test convergence to the SELI geometry in both synthetic and real classimbalanced datasets. For diﬀerent imbalance levels, the learned geometries approach the SELI geometry signiﬁcantly faster compared to the ETF geometry (Fig. 2). However, convergence worsens with increasing level of imbalance. A plausible theoretical justiﬁcation is that as we show regularization plays critical role under imbalances. We also consistently get better convergence rate for the classiﬁers. We believe our observations strongly motivate further investigations regarding potential frailties of asymptotic implicit bias characterizations and how these might vary in multiclass and possibly imbalanced settings.

1.1 Related works

The original contribution by Papyan et al. [26] has attracted lots of attention resulting in numerous followups within short time period, e.g., [41, 12, 3, 9, 21, 24, 5, 40, 32]. (See also [9, Sec. E] for a review of the recent literature.) Several works have proposed and/or used the UFM with CE training to analyze theoretical abstractions of NC [41, 12, 3, 5]. Other works analyze the UFM with square loss [24, 9, 40, 32] and recent extensions accounting for additional layers are studied in [32]. Here, we drove particular inspiration from Zhu et al. [41], who presented a particularly transparent and complete analysis of the optimization landscape of ridge-regularized CE minimization for the UFM under balanced data. In the same spirit, we also relied on the UFM. However, our work is, to the best of our knowledge, the ﬁrst

Geometry of global minimizers for UFM

UF-Ridge CE (Eqn. (1)) UF-SVM (Eqn. (2))

Balanced ETF [21, 5, 41, 3, 32]

ETF [12], [Cor. 1.3]

λ: NO SELI [Prop. 1]

λ 0: SELI [Prop. 2] λ < 1

2: NO minority collapse [Sec. H]

SELI [Thm. 1]

Table 1: Summary of contributions and comparison to most-closely related work.

explicit geometry analysis for class-imbalanced data. See also Table 1 for a comparison. The only previous work on neural collapse with imbalances is [3], which was the ﬁrst to note that collapse of the embeddings is preserved, but otherwise the geometry might skew away from ETF. Also, Fang et al. [3] ﬁrst proposed studying the new geometry using the UFM and appropriate convex relaxations. With this setup, they presented an intriguing ﬁnding, which they termed minority collapse: for asymptotically large imbalance levels, the minorities classiﬁers collapse to the same vector. Instead, we derive an explicit geometric characterization of both embeddings and classiﬁers for both majorities and minorities and for all imbalance levels. Specializing these ﬁndings to vanishing regularization and imbalance ratio growing to inﬁnity recovers and gives new insights to minority collapse. Our results also draw from and relate to the literatures on implicit bias, matrix factorization, and imbalanced deep-learning. We defer a detailed discussion on these to Sec. I of the SM. Notation. For matrix V Rm n, V[i,j] denotes its (i,j) entry, vj denotes the j-th column, VT its transpose and V its pseudoinverse. We denote V F , V 2, V the Frobenius, spectral, nuclear norms of V. tr(V) denotes the trace of V. and denote Hadammard and Kronecker products. V 0 denotes V is positive semideﬁnite and V 0 that V has nonnegative entries. VL Rm n is the gradient of a scalar function L with respect to V. 1m denotes an m-dimensional vector of all ones and Im the m-dimensional identity matrix. For vectors/matrices with all zero entries, we simply write 0, with dimensions understood from context. ej denotes a column with a single non-zero entry of 1 in the j-th entry.

2 Problem setup

We adopt the unconstrained feature model (UFM) [24, 3] in a k-class classiﬁcation setting. Let Wd k = [w1,w2, ,wk] be the matrix of classiﬁer weights corresponding to the k classes. Here, d is the feature dimension. We assume throughout that d k 1. Next, we let Hd n = [h1,h2, ,hn] denote a matrix of n feature embeddings, each corresponding to a diﬀerent example in the training set. We assume each class c [k] has nc 1 examples (thus, nc embeddings) so that c [k] nc = n. Without loss of generality, we assume examples are ordered. Formally, we assume that examples i = 1,...,n1 have labels yi = 1, examples i = n1 + 1,...,n1 + n2 have labels yi = 2, and so on. The UFM trains the features hi,i [n] (jointly with the weights wc,c [k]) without any further constraints, i.e., by minimizing the ridge-regularized cross-entropy (CE) loss as follows [41]:

( ˆ Wλ, ˆHλ) arg min W,H L(WT H) + λ W 2 F /2 + λ H 2 F /2, (1)

where L(WT H) = i [N] log(1 + c yi e (wyi wc)T hi) is the CE loss.

UFM as 2-layer linear net. The formulation above does not explicitly specify inputs. Alternatively, consider training a 2-layer linear net with hidden dimension d, ﬁrst / second layers H / W, over n examples with n-dim. inputs xi = ei Rn,i [n] and labels as above.

Unconstrained-features SVM (UF-SVM). Since neural-collapse is observed when training with small / vanishing regularization [26], it is reasonable to consider an unregularized version of (1). In this special case, gradient descent (with suﬃciently small step size) on (1) produces iterates that diverge in norm, but converge in direction [22, 12]. In fact, it has been recently shown that the GD solutions converge in direction to a KKT point of the following max-margin classiﬁer [22, 12]:

( ˆ W, ˆH) arg min W,H W 2 F /2 + H 2 F /2 sub. to (wyi wc)T hi 1, i [n],c yi. (2) For convenience, we refer to the optimization in (2) as unconstrained-features SVM (UFSVM). This minimization (unlike standard SVM) is non-convex. Hence, KKT points (thus, GD convergence directions) are not necessarily global minimizers; see discussion in Sec. 6.

Class-imbalance model. To streamline the presentation, we focus on a setting with STEP imbalances. This includes balanced data as special case by setting R = 1. Deﬁnition 1 ((R,ρ)-STEP imbalance). In a (R,ρ)-STEP imbalance setting with labelimbalance ratio R 1 and minority fraction ρ (0,1), the following hold. All minority (resp. majority) classes have the same sample size nmin (resp. Rnmin). There are (1 ρ)k majority and ρk minority classes. Without loss of generality, classes {1,...,(1 ρ)k} are majorities.

3 Global structure of the UF-SVM: SELI geometry

In this section, we show that the global minimizers of the non-convex program in (2) take a particularly simple form that is best described in terms of a simplex-encoding of the labels.

Deﬁnition 2 (SEL matrix). The simplex-encoding label (SEL) matrix ˆZk n is such that

c [k],i [n] ˆZ[c,i] = {1 1/k , c = yi 1/k , c yi . (3)

Onwards, let ˆZT = UΛVT be the compact SVD of ˆZT . Speciﬁcally, Λ is a positive (k 1)- dimensional diagonal matrix and Un (k 1), Vk (k 1) have orthonormal columns.

Each column ˆzi Rk of ˆZ represents a class-membership encoding of datapoint i [n]. This diﬀers from the vanilla one-hot encoding ˆyi = eyi in that ˆzi = ˆyi 1

k1k. Speciﬁcally, ˆZ has exactly k diﬀerent and aﬃnely independent columns, which together with the zero vector form a k-dimensional simplex, motivating the SEL name. Finally, note that ˆZT 1k = 0; thus, rank(ˆZ) = k 1. We gather useful properties about the eigenstructure of ˆZ in Sec. A. Theorem 1 (Structure of the UF-SVM minimizers). Suppose d k 1 and a (R,ρ)-STEP imbalance setting. Let ( ˆ W, ˆH) be any solution and p the optimal cost of the UF-SVM in (2). Then, p = ˆZ = ˆH 2 F = ˆ W 2 F . Moreover, the following statements characterize the geometry of global minimizers in terms of the the SEL matrix and its SVD.

(i) For the optimal logits it holds that ˆ WT ˆH = ˆZ. (ii) The Gram matrices satisfy ˆHT ˆH = UΛUT and ˆ WT ˆ W = VΛVT . (iii) For partial orthonormal matrix R R(k 1) d, ˆ W = RT Λ1/2VT and ˆH = RT Λ1/2UT .

We outline the theorem s proof in Sec. 3.3 and defer the details to Sec. C.1. The theorem provides an explicit characterization of the geometry of optimal embeddings and classiﬁers around the key ﬁnding that the optimal logit matrix is always equal to the SEL matrix.

Simplicity. The lack of symmetry in the imbalanced setting, makes it a priori unclear whether a simple geometry description is still possible, as in the balanced case. But, the theorem shows this to be the case. The key observation is that the optimal logit matrix ˆ WT ˆH equals ˆZ (cf. Statement (i)). Then, the Gram matrices of embeddings and classiﬁers are given simply in terms of the singular factors of the SEL matrix (cf. Statements (ii),(iii)).

Invariance to imbalances. Equality of the optimal logit matrix to the SEL matrix is the key invariant characterization across changing imbalances. This also implies that at optimality all margins are equal irrespective of the imbalance type. The description of Gram matrices in terms of the SVD of ˆZ is also invariant. Of course, the particular arrangement of columns of ˆZ itself depends on the values of (R,ρ). In turn, the singular factors determining the geometry of embeddings and classiﬁers depend implicitly on the same parameters. Thus, the geometry diﬀers for diﬀerent imbalance levels; see Fig. 1 for an example.

3.1 Invariant properties: NC and SELI

Here, we further discuss the geometry of embeddings and classiﬁers induced by the SVD of the SEL matrix. The ﬁrst realization is that the embeddings collapse under all settings. Corollary 1.1. The UF-SVM solutions satisfy the following irrespective of imbalance: (NC) The embeddings collapse to their class means ˆhi = ˆµc = 1 nc j yj=c ˆhj, c [k], i yi = c.

This can be inferred from Theorem 1 (speciﬁcally from Statement (iii) and that U has repeated columns.) A more straightforward argument is by directly inspecting (2) is as follows. For any ﬁxed (say optimal) ˆ W, the minimization over hi is: (i) separable and identical for all i yi = c in same class c, and (ii) strongly convex. Hence, for all i yi = c, there is unique minimizer corresponding to the ﬁxed ˆ W; this must be their class mean. Beyond (NC), Theorem 1 speciﬁes the exact geometry of solutions. Deﬁnition 3 (SELI geometry). The embedding and classiﬁer matrices Hd n and Wd k follow the simplex-encoded-labels interpolation geometry when for some scaling α > 0:

HT ][W H] = α [VΛVT ˆZ ˆZT UΛUT ], where ˆZ = VΛUT is the SEL matrix.

Corollary 1.2. The UF-SVM solutions follow the SELI geometry, irrespective of imbalance.

The (SELI) geometry speciﬁes (up to a global positive scaling) the Gram matrices GW = WT W and GH = HT H, and the logit matrix Z = WT H: The diagonals of the Gram matrices specify the norms 2, and together with their oﬀ-diagonal entries, they further specify the angles between classiﬁers and between embeddings. Because of (NC), the norms and angles of the embeddings are uniquely determined in terms of the norms and angles of the mean-embeddings; thus, for all i [n], GH[i,j] = GH[i,ℓ] if yj = yℓ. Finally, the norms together with the entries of the logit matrix determine the angles between the two sets of: (a) the k classiﬁers and (b) the k mean embeddings. Thus, they specify the degree of alignment between the two sets of vectors. In the next section, we show that it is in fact possible to obtain explicit closed-form formulas describing the norms, angles and alignment of classiﬁer and embedding vectors, in terms of the imbalance characteristics and the number of classes. Remark 3.1 (Why SELI ?). For the UFM, Z = WT H is the learned end-to-end model. According to its deﬁnition, the SELI geometry implies WT H = αˆZ. Thus, the learned model interpolates (a scaling of) the SEL matrix, motivates the naming in Deﬁnition 3.

3.1.1 Special case: Balanced or binary data

For the special cases of balanced or binary data, Theorem 1 recovers the ETF structure, i.e. (SELI) (ETF). Let ˆM = [ˆµ1,..., ˆµk] denote the matrix of mean embeddings.

Corollary 1.3 (R = 1 or k = 2). Assume balanced data (R = 1) or binary classiﬁcation (k = 2). Then, any UF-SVM solution ( ˆ W, ˆH) follows the ETF geometry as deﬁned in [26]:

(ETF) ˆ W = ˆM and ˆMT ˆM = ˆ WT ˆ W = Ik 1

k1k1T k . Thus, when data are balanced or binary: (i) the norms of the classiﬁers and of the embeddings are all equal; (ii) the angles between any two classiﬁers or any two embeddings are all equal to 1/k 1; and, (iii) the set of classiﬁers and the set of embeddings are aligned. See Fig. 1b.

3.2 How the SELI geometry changes with imbalances

For k > 3, R > 1, (SELI) / Ô (ETF) and geometry is determined in terms of the SVD factors of the SEL matrix. Sec. A in the SM explicitly characterizes these SVD factors and leads to explicit closed-form formulas for the norms, angles and alignment in terms of R,ρ and k,nmin in Sec. B. For example, the following lemma gives a formula for the ratio of majority and minority norms. For simplicity, assume equal majorities and minorities. Lemma 3.1 (Norm ratios). Assume (R, 1/2)-STEP imbalance. Suppose (W,H) satisﬁes the (SELI) property. Let wmaj,hmaj (resp. wminor,hminor) denote majority (resp. minority) classiﬁers and embeddings, respectively. Then,

wmaj 2 2 wminor 2 2 = (1 2/k)

and hmaj 2 2 hminor 2 2 =

R (1 2/k) + 1 k

(R+1)/2 (1 2/k) + 1 k

Thus, wmaj 2 wminor 2 and hmaj 2 hminor 2, with equalities iﬀR = 1 or k = 2.

2We assume throughout that the regularization strength is same for embeddings and classiﬁers. For completeness we treat the general case in Sec. C.3 in the SM, where it is shown that diﬀerent regularization values do not change the SELI geometry as per Deﬁnition 3 apart from introducing a (global) relative scaling factor between the norms of the embeddings and classiﬁers.

The fact that CE learns majority classiﬁers of larger norm has been empirically observed in the imbalanced deep-learning literature [17, 18]. Lemma 3.1 provides a theoretical justiﬁcation and precisely quantiﬁes the ratio, not only for classiﬁers, but also for the learned embeddings. As another example of closed-form formulas, the angles between any two majority (resp. minority) classiﬁers wmaj,w maj (resp. wminor,w minor) satisfy (see Sec. B.1.2):

Cos(wmaj, w maj) = 2

(R + 1)/2 and Cos(wminor, w minor) = 2 +

(R + 1)/2 . (4)

Both formulas evaluate to 1/(k 1) for R = 1. Also, the ﬁrst is strictly decreasing and the second strictly increasing in R, i.e. with larger R majority classiﬁers go further away from each other, while minority classiﬁers come closer; see Fig. 1 and Fig. 6 in the SM. Remark 3.2 (Asymptotics). While we focus on ﬁnite values of R, computing limits for our formulas gives asymptotic characterizations as R . As an example, it is easy to see from (4) that the angle between the minority classiﬁers collapses to zero in that limit. This phenomenon is called minority collapse by Fang et al. [3]. Here, we recover it as a special case of Theorem 1 and of SELI. Note also that the rate at which the minority angle collapses is rather slow (see also Fig. 6c in the SM). Additional details are included in Sec. B.

3.3 Proof sketch

We start from the following standard convex relaxation of the UF-SVM [31, 8, 41, 3]: min Z Rk n Z subj. to Z[yi,i] Z[c,i] 1, c yi,i [n]. (5)

The relaxation follows by setting Z = WT H, thus Z is the logit matrix (also, the end-toend model) of the non-convex UF-SVM. Our key technical innovation is proving that ˆZ is the unique minimizer of (5). There are three key ingredients in this. First, is a clever re-parameterization of the dual program to (5), introducing the SEL matrix ˆZ in the dual:

max B Rn k tr(ˆZB) sub. to B 2 1, B1k = 0, B ˆZT 0. (6)

Second, we prove that ˆB = UVT is the unique maximizer of the re-parameterized dual problem in (6). While it is not hard to check that ˆB optimizes a relaxation of (6), it is far from obvious that ˆB is unique, and, even more that it satisﬁes the third constraint. The key technical challenge here is that the third constraint acts entry-wise on B. In fact, to proceed with the proof we need that the constraint is not active, i.e. ˆB ˆZT > 0, or equivalently, that the sign pattern of the entries of ˆB agrees with the sign pattern of the transpose SEL matrix ˆZT . We prove this by an explicit construction of the singular factors U,V exploiting the structure of the SEL matrix. Once we have shown that ˆB is the unique maximizer and is strictly feasible, we use the KKT conditions to prove that ˆZ is the unique minimizer of the nuclear-norm relaxation in (5). To do this, we leverage that strict feasibility of ˆB implies by complementary slackness all constraints in the primal (5) must be active at the optimum. The proof of the theorem completes by arguing that the relaxation (6) is tight when d k 1 allowing us to connect the UF-SVM minimizers to the SEL matrix ˆZ. Sec. C.1 for details. Remark 3.3 (Comparison to literature). The common analysis strategy in all other related works is deriving tight bounds on the CE loss (or related quantities, such as the minimum margin), and, then identifying the structure in the parameters that achieves those bounds. For example, [41, 5, 3] lower bound the CE loss and [12] upper bounds the minimum margin, all using a similar elegant argument based on Cauchy-Schwartz and Jensen inequalities. The ETF geometry is then uncovered by recognizing that it uniquely achieves those bounds. It is not clear how to employ such exercises in the presence of imbalances, due to the absence of symmetry properties (e.g. alignment of classiﬁers with embeddings). Our proof of Theorem 1 is more direct and is in large enabled by identifying the key role played by the logit matrix.

4 The role of regularization

In this section, we study the geometry of solutions ( ˆ Wλ, ˆHλ) of ridge-regularized CE minimization in (1), as a function of both the imbalance and the regularization parameter λ.

4.1 Global minimizers as solutions to a convex relaxation

The regularized CE minimization in (1) is non-convex. Yet, its landscape is benign and the global solution can be described in terms of the solution to a convex relaxation program [41].

Theorem 2 (Reformulated from [41]). Let λ > 0, d > k 1 and (R,ρ)-STEP imbalance. Let ˆZλ Rk n be the unique minimizer of the convex nuclear-norm-regularized loss minimization,

ˆZλ = arg min Z L(Z) + λ Z , (7)

and, denote ˆZλ = VλΛλUT λ its SVD. Any stationary point of (1) satisﬁes exactly one of the following two. Either it is a strict saddle, or, it is a global minimizer ( ˆ Wλ, ˆHλ) and satisﬁes

[ ˆ WT λ ˆHT λ ][ ˆ Wλ ˆHλ] = [Vλ Uλ]Λλ [VT λ UT λ ]. (8)

This ensures that any ﬁrst-order method escaping strict saddles ﬁnds a stationary point that is a global minimizer [41]. Moreover, it describes the structure of the global minimizers of (1) in terms of ˆZλ, the solution to the convex minimization in (7). Structurally, the characterization in (8) resembles the characterization in Theorem 1 regarding the UF-SVM. However, Theorem 1 goes a step further and gives an explicit form for the logit matrix, namely the SEL matrix ˆZ. Instead, ˆZλ in Theorem 2 is given implicitly as the solution to a convex program. In the remaining of this section, we ask: how does ˆZλ compare to ˆZ for diﬀerent values of the regularizer? Also, how does the answer depend on the imbalance level?

Remark 4.1. Although not stated explicitly in this form, Theorem 2 is essentially retrieved from the proof of [41, Theorem 3.2] with two small adjustments. First, Zhu et al. [41] only considers balanced data. Here, we realize their proof actually carries over to the imbalanced setting. Second, we relax their assumption d > k to d > k 1 thanks to a simple observation: 1T k ZL(Z) = 0, hence the CE gradient drops rank (see Sec. E.1 for details).

4.2 Regularization matters

For balanced data, the minimizers ˆ Wλ, ˆHλ of (1) satisfy the (NC) and (ETF) properties (up to scaling by a constant) for every value of the regularization parameter λ > 0 [41] (see also [5, 3, 21].) In our language, for all λ > 0, there exists scalar αλ such that a scaling (αλ ˆ Wλ,αλ ˆHλ) of any global solution of the regularized CE minimization in (1) satisﬁes the ETF geometry. Thus, for balanced data, up to a global scaling, the geometry is: (i) insensitive to λ > 0 and (ii) the same as that of the UF-SVM minimizers.

Here, we show that the situation changes drastically with imbalances: the regularization now plays a critical role and the solution is never the same as that of UF-SVM for ﬁnite λ.

Proposition 1 (Imbalanced data: Regularization matters). Assume imbalanced data and k > 2. There does not exist ﬁnite λ > 0 and corresponding scaling αλ such that the scaled solution (αλ ˆ Wλ,αλ ˆHλ) of (1) follows the (SELI) geometry. Equivalently, there does not exist λ > 0 and αλ such that a scaling of the UF-SVM solution solves (1).

Proof. The proof relies on Theorem 1 as follows. For the sake of contradiction assume there exists λ > 0 and some αλ > 0 such that the scaled UF-SVM minimizer (αλ ˆ W,αλ ˆH) solves (1). Since then (αλ ˆ W,αλ ˆH) is a stationary point, it satisﬁes WL(α2 λ ˆ WT ˆH) + λαλ ˆ W = 0 Ô

αλ ˆH ( ZL(α2 λ ˆ WT ˆH)) T = λ ˆ W Ô αλ ˆ WT ˆH ( ZL(α2 λ ˆ WT ˆH)) T = λ ˆ WT ˆ W. But, by Theorem 1: ˆ WT ˆH = ˆZ and ˆ WT ˆ W = VΛVT . Moreover, thanks to the special structure of ˆZ we can check that ZL(α2 λ ˆZ) = α λ ˆZ for α λ = k/(exp(α2 λ) + k 1); see Lemma A.1(v). With these, and denoting α λ = αλα λ, we arrive at the following about the singular values of ˆZ: α λ ˆZˆZT = λ ˆ WT ˆ W Ô α λVΛ2VT = λVΛVT Ô α λΛ2 = λΛ Ô Λ = (λ/α λ)Ik 1. Thus, all singular values of ˆZ must be the same. However, we show in Lemma A.3 that this is not the case unless data are balanced or k = 2. Thus, we arrive at a contradiction.

Distance to SELI

(a) SELI Convergence.

SELI geometry

Minority collapse

Min. margin

(b) Minimum margin.

Distance to ETF

(c) ETF convergence. Figure 3: Numerical study of global solutions ( ˆ Wλ, ˆHλ) of (1) across the regularization path.

4.3 Vanishing regularization As λ vanishes, it is not hard to check that minimizers diverge in norm and the relevant question becomes: where do they converge in direction? The following answers this. Proposition 2 (Regularization path leads to UF-SVM). Suppose d > k 1 and (R,ρ)-STEP imbalance. It then holds that limλ 0 ˆ WT λ ˆHλ/( ˆ Wλ 2 F /2+ ˆHλ 2 F /2) = ˆZ/ ˆZ .

Put together with the content of the previous section: For balanced data, the solution is always the same up to global scaling. However, for imbalanced data, the solution changes with λ and only in the limit of λ 0 does it align with that of the UF-SVM. Regarding the proof of the proposition: thanks to Theorems 1 and 2, it suﬃces that the solution ˆZλ of (7) converges in direction to the SEL matrix ˆZ; see Proposition 3 in Sec. D. To show this, we critically use from Theorem 1 that ˆZ is unique minimizer of (5) (see Sec. E.3). Closely related results are [28, 16], who studied the regularization path of p-norm regularized CE.

4.4 Imbalance emphasizes the impact of non-convexity Recall interpreting the UFM as a two-layer linear net trained on the standard basis ei Rn. Suppose instead that we train a simple k-class linear classiﬁer Ξk n on the same data by minimizing ridge regularized CE: minΞ L(Ξ)+ λ

2 Ξ 2 F . It is easy to check that (after scaling) ˆZ satisﬁes ﬁrst-order optimality conditions. Thus, the optimal linear classiﬁer is such that for all λ > 0, there exists αλ such that ˆΞλ = αλ ˆZ. Contrasting this to Proposition 1, we ﬁnd that the end-to-end models minimizing ridge-regularized CE for a linear versus a two-layer linear network are the same (in direction) when data are balanced, but diﬀer under imbalances.

5 Experiments In the experiments, we choose (R, 1/2)-STEP imbalances with varying R and we measure convergence to either the (SELI) or the (ETF), in terms of the three metrics below corresponding to classiﬁers, embeddings, and logits, respectively. Denote A = A/ A F the Euclidean normalization and GA = AT A the Gram matrix of matrix A. Classiﬁers: We measure GW ˆGW F , where ˆGETF W = Ik 1

k1k1T k = G and ˆGSELI W = VΛVT (see Deﬁnition 3). Embeddings: Because of (NC), it suﬃces to work with the matrix M of mean embeddings (see Sec. 3.1.) Speciﬁcally, we measure GM ˆGM F , where ˆGETF M = G

and ˆGSELI M is computed from the n-dimensional ˆGSELI H = UΛUT by only keeping the k columns/rows corresponding to the ﬁrst example of each class. In the deep-net experiments, we employ an additional centering of the class means with their (balanced) global mean; see Sec. G. Logits: We measure WT M G F . Note that, when NC holds this metric is essentially analogous to measuring WT H ˆZ F .

UFM: Global minimizers. Fig. 3 investigates the global minimizers of regularized CE (1) for the UFM and k = 4 classes. Thanks to Theorem 2, we obtain such minimizers by solving (7) with CVX [6], and then, using (8) to infer the Gram matrices GW,GM and logits WT H. Fig. 3c shows that the distance to ETF is large and not approaching zero for any value of λ. On the other hand, Fig. 3a numerically validates Propositions 1 and 2: the distance to SELI for all three metrics is non-zero for any ﬁnite λ > 0, but converges to zero

101 102 103 104 105

Distance to SELI/ETF

(a) Classiﬁers

101 102 103 104 105

(b) Embeddings

101 102 103 104 105

100 R=1 R=2 R=10

(c) Logits Figure 4: Geometry of SGD solutions on minimizing CE for the UFM; SELI(Solid)/ETF(Dashed).

as λ 0. However, this convergence is slow and the rate becomes even worse as R increases. Finally, Fig. 3b depicts the minimum margins of solutions across λ. For all suﬃciently small λ, the minimum margin is strictly positive (see Lemma D.4 in Sec. D.2). As a byproduct, this shows that minority collapse [3] can only possibly occur for large λ; see also Sec. H.

UFM: SGD solutions. Fig. 4 investigates whether the solutions found by SGD are consistent with the prediction of Theorem 1 about global minimizers of the UF-SVM. We ﬁx k = 4 and, for each R, we select nmin so that n = ((R + 1)/2)knmin 400. The weights of the UFM are optimized using SGD with constant learning rate 0.4, batch size 4 and no weight decay. We train for 105 epochs, much beyond zero training error and plot the distance to SELI and ETF over time. We observe the following: (i) SGD iterates favor the SELI, instead of the ETF geometry. As a matter of fact, the distance to SELI is decreasing with epochs, suggesting an implicit bias of SGD towards global minimizers of the UF-SVM. (ii) However, convergence is rather slow and rates get worse with increasing imbalance. (iii) Also, the embeddings convergence is more elusive compared to that of the classiﬁers. Interestingly, the last two observations are reminiscent of the trends we observed in Fig. 3a, suggesting connections between regularization path and (S)GD iterates, worth investigating further. Refer to Sec. F for additional numerical results on the UFM.

Deep-learning experiments. We investigate convergence to SELI in deep-net training of (R,ρ = 1/2)-STEP imbalanced MNIST, Fashion-MNIST and CIFAR10 datasets with Res Net-18 [10] and VGG-13 [29]; see Sec. G.1.1 for implementation details. The convergence to SELI and ETF for the classiﬁers, (centered) mean-embeddings, and logits is illustrated in Fig. 2 for Res Net and Fig. 11 in the SM for VGG. The vertical dashed lines mark the zero-training-error epoch; see Sec. G.1.3. In all plots, the distance to SELI decreases as training evolves and convergence is consistently better compared to the ETF. However, convergence slows down for increasing imbalance (see R = 100). Also, convergence is worse for the embeddings compared to classiﬁers. See Sec. G in the SM for additional results.

6 Outlook: Imbalance troubles and opportunities

We propose (SELI) as the class-imbalance-invariant geometry of classiﬁers and embeddings learnt by overparameterized models when trained beyond zero training error. We arrive at it after showing that the UF-SVM global minimizers follow this geometry. Subsequently, we conjecture and show experiments supporting that: (C1) GD on the UFM leads to solutions approaching the SELI asymptotically in the number of epochs (Sec. I.2 for connection to implicit bias); (C2) training of deep-nets learns models that approach the SELI geometry (Sec. G for additional experiments). We hope our results motivate further theoretical and experimental investigations, especially since data imbalances appear frequently across applications. Beyond that, we believe that further similar studies on identifying geometric structures of learned embeddings and classiﬁers could oﬀer new perspectives on generalization. Our results could pave that way since they uncover diﬀerent geometries (aka SELI for diﬀerent R values), each leading to diﬀerent generalization (worse for increasing R [1]). Relatedly, we envision that further such studies lead to algorithmic contributions in imbalanced deeplearning as they can facilitate studying the implicit-bias eﬀect of CE adjustments and post-hoc techniques tailored to imbalanced data [2, 23, 39, 18, 17, 19, 20].

[1] J. Byrd and Z. Lipton. What is the eﬀect of importance weighting in deep learning? In International Conference on Machine Learning, pages 872 881. PMLR, 2019.

[2] K. Cao, C. Wei, A. Gaidon, N. Arechiga, and T. Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems, pages 1567 1578, 2019.

[3] C. Fang, H. He, Q. Long, and W. J. Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43), 2021.

[4] T. Galanti, A. György, and M. Hutter. On the role of neural collapse in transfer learning. ar Xiv preprint ar Xiv:2112.15121, 2021.

[5] F. Graf, C. Hofer, M. Niethammer, and R. Kwitt. Dissecting supervised constrastive learning. In International Conference on Machine Learning, pages 3821 3830. PMLR, 2021.

[6] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, Mar. 2014.

[7] S. Gunasekar, J. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832 1841. PMLR, 2018.

[8] B. D. Haeﬀele and R. Vidal. Structured low-rank matrix factorization: Global optimality, algorithms, and applications. IEEE transactions on pattern analysis and machine intelligence, 42(6):1468 1482, 2019.

[9] X. Han, V. Papyan, and D. L. Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path. ar Xiv preprint ar Xiv:2106.02073, 2021.

[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition." computer vision and pattern recognition (2015). Google Scholar There is no corresponding record for this reference, pages 770 778, 2015.

[11] L. Hui, M. Belkin, and P. Nakkiran. Limitations of neural collapse for understanding generalization in deep learning. ar Xiv preprint ar Xiv:2202.08384, 2022.

[12] W. Ji, Y. Lu, Y. Zhang, Z. Deng, and W. J. Su. An unconstrained layer-peeled perspective on neural collapse. ar Xiv preprint ar Xiv:2110.02796, 2021.

[13] Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300, 2018.

[14] Z. Ji and M. Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772 1798. PMLR, 2019.

[15] Z. Ji and M. Telgarsky. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33:17176 17186, 2020.

[16] Z. Ji, M. Dudík, R. E. Schapire, and M. Telgarsky. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, pages 2109 2136. PMLR, 2020.

[17] B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis. Decoupling representation and classiﬁer for long-tailed recognition, 2020.

[18] B. Kim and J. Kim. Adjusting decision boundary for class imbalanced learning. IEEE Access, 8:81674 81685, 2020. doi: 10.1109/ACCESS.2020.2991231.

[19] G. R. Kini, O. Paraskevas, S. Oymak, and C. Thrampoulidis. Label-imbalanced and group-sensitive classiﬁcation under overparameterization. Advances in Neural Information Processing Systems, 34:18970 18983, 2021.

[20] M. Li, X. Zhang, C. Thrampoulidis, J. Chen, and S. Oymak. Autobalance: Optimized loss functions for imbalanced data. Advances in Neural Information Processing Systems, 34:3163 3177, 2021.

[21] J. Lu and S. Steinerberger. Neural collapse with cross-entropy loss. ar Xiv preprint ar Xiv:2012.08465, 2020.

[22] K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. ar Xiv preprint ar Xiv:1906.05890, 2019.

[23] A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar. Long-tail learning via logit adjustment. ar Xiv preprint ar Xiv:2007.07314, 2020.

[24] D. G. Mixon, H. Parshall, and J. Pi. Neural collapse with unconstrained features. ar Xiv preprint ar Xiv:2011.11619, 2020.

[25] M. S. Nacson, J. Lee, S. Gunasekar, P. H. P. Savarese, N. Srebro, and D. Soudry. Convergence of gradient descent on separable data. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3420 3428. PMLR, 2019.

[26] V. Papyan, X. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117 (40):24652 24663, 2020.

[27] X. Peng, C. Lu, Z. Yi, and H. Tang. Connections between nuclear-norm and frobeniusnorm-based representations. IEEE transactions on neural networks and learning systems, 29(1):218 224, 2016.

[28] S. Rosset, J. Zhu, and T. Hastie. Margin maximizing loss functions. In NIPS, pages 1237 1244, 2003.

[29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

[30] D. Soudry, E. Hoﬀer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1): 2822 2878, 2018.

[31] N. Srebro, J. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. Advances in neural information processing systems, 17, 2004.

[32] T. Tirer and J. Bruna. Extended unconstrained features model for exploring deep neural collapse. ar Xiv preprint ar Xiv:2202.08087, 2022.

[33] G. Vardi, O. Shamir, and N. Srebro. On margin maximization in linear and relu networks. ar Xiv preprint ar Xiv:2110.02732, 2021.

[34] R. Vidal and P. Favaro. Low rank subspace clustering (lrsc). Pattern Recognition Letters, 43:47 61, 2014.

[35] K. Wang, V. Muthukumar, and C. Thrampoulidis. Benign overﬁtting in multiclass classiﬁcation: All roads lead to interpolation. Advances in Neural Information Processing Systems, 34, 2021.

[36] J. Wright and Y. Ma. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022.

[37] L. Xie, Y. Yang, D. Cai, D. Tao, and X. He. Neural collapse inspired attraction-repulsionbalanced loss for imbalanced learning. ar Xiv preprint ar Xiv:2204.08735, 2022.

[38] Y. Yang, L. Xie, S. Chen, X. Li, Z. Lin, and D. Tao. Do we really need a learnable classiﬁer at the end of deep neural network? ar Xiv preprint ar Xiv:2203.09081, 2022.

[39] H.-J. Ye, H.-Y. Chen, D.-C. Zhan, and W.-L. Chao. Identifying and compensating for feature deviation in imbalanced deep learning, 2020.

[40] J. Zhou, X. Li, T. Ding, C. You, Q. Qu, and Z. Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. ar Xiv preprint ar Xiv:2203.01238, 2022.

[41] Z. Zhu, T. Ding, J. Zhou, X. Li, C. You, J. Sulam, and Q. Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34, 2021.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] We include concrete theoretical and experimental results substantiating our claims (b) Did you describe the limitations of your work? [Yes] We discuss these throughout the paper. For example, we discuss that convergence to the proposed geometry is slower in certain settings. See also Section 6 (c) Did you discuss any potential negative societal impacts of your work? [Yes] We do not foresee any potential negative societal impact of this work (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] We have reviewed the guidelines and stand by the same 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] We state our assumptions clearly (b) Did you include complete proofs of all theoretical results? [Yes] While main body includes sketches of some of the proofs, more detailed proofs have been deferred to the supplementary material 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Our code scripts and instructions are included (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] We include these details with experimental results (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We show results over multiple runs to conﬁrm invariance of the learned geometries across training instances. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We report these in Sec. 5 and in the SM. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See Sec. 5 (b) Did you mention the license of the assets? [Yes] See Sec. 5

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or oﬀensive content? [N/A]

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]