# highdimensional_sgd_aligns_with_emerging_outlier_eigenspaces__9a7d6bd1.pdf

Published as a conference paper at ICLR 2024

HIGH-DIMENSIONAL SGD ALIGNS WITH EMERGING OUTLIER EIGENSPACES

G erard Ben Arous Courant Institute New York University New York, NY, USA benarous@cims.nyu.edu

Reza Gheissari Department of Mathematics Northwestern University Evanston, IL, USA gheissari@northwestern.edu

Jiaoyang Huang Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA, USA huangjy@wharton.upenn.edu

Aukosh Jagannath Department of Statistics and Actuarial Science Department of Applied Mathematics, and Cheriton School of Computer Science University of Waterloo Waterloo, ON, Canada a.jagannath@uwaterloo.ca

We rigorously study the joint evolution of training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, the SGD trajectory rapidly aligns with emerging low-rank outlier eigenspaces of the Hessian and gradient matrices. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer s outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.

1 INTRODUCTION

Stochastic gradient descent (SGD) and its many variants, are the backbone of modern machine learning algorithms (see e.g., Bottou (1999)). The training dynamics of neural networks, however, are still poorly understood in the non-convex and high-dimensional settings that are frequently encountered. A common explanation for the staggering success of neural networks, especially when overparameterized, is that the loss landscapes that occur in practice have many flat directions and a hidden low-dimensional structure within which the bulk of training occurs.

To understand this belief, much attention has been paid to the Hessian of the empirical risk (and related matrices formed via gradients) along training. This perspective on the training dynamics of neural networks was proposed in Le Cun et al. (2012), and numerically analyzed in depth in Sagun et al. (2017a;b). The upshot of these studies was a broad understanding of the spectrum of the Hessian of the empirical risk, that we summarize as follows:

1. It has a bulk that is dependent on the network architecture, and is concentrated around 0, becoming more-so as the model becomes more overparametrized;

2. It has (relatively) few outlier eigenvalues that are dependent on the data, and evolve nontrivially along training while remaining separated from the bulk;

Since those works, these properties of the Hessian and related spectra over the course of training have seen more refined and large-scale experimentation. Papyan (2019) found a hierarchical decomposition to the Hessian for deep networks, attributing the bulk, emergent outliers, and a minibulk to three different parts of the Hessian. Perhaps most relevant to this work, Gur-Ari et al. (2019)

Published as a conference paper at ICLR 2024

noticed that gradient descent tends to quickly align with a low-dimensional outlier subspace of the Hessian matrix, and stay in that subspace for long subsequent times. They postulated that this common low-dimensional structure to the SGD and Hessian matrix may be key to many classification tasks in machine learning. For a sampling of other empirical investigations of spectra of Hessians and information matrices along training, see e.g., Ghorbani et al. (2019); Papyan (2020); Li et al.; Martin & Mahoney (2019); Cohen et al. (2021); Xie et al. (2023).

From a theoretical perspective, much attention has been paid to the Hessians of deep networks using random matrix theory approaches. Most of this work has focused on the spectrum at a fixed point in parameter space, most commonly at initialization. Early works in the direction include Watanabe (2007); Dauphin et al. (2014). Choromanska et al. (2015) noted similarities of neural net Hessians to spin glass Hessians, whose complexity (exponential numbers of critical points) has been extensively studied see e.g., Auffinger & Ben Arous (2013); Auffinger et al. (2013). More recently, the expected complexity has been investigated in statistical tasks like tensor PCA and generalized linear estimation Ben Arous et al. (2019); Maillard et al. (2020). In Pennington & Worah (2018) and Pennington & Bahri (2017), the empirical spectral distribution of Hessians and information matrices in single layer neural networks were studied at initialization. Liao & Mahoney (2021) studied Hessians of some non-linear models, which they referred to as generalized GLMs, also at initialization.

Under the infinite-width neural tangent kernel limit of Jacot et al. (2018), Fan & Wang (2020) derived the empirical spectral distribution at initialization, and Jacot et al. (2020) studied its evolution over the course of training. In this limit the input dimension is kept fixed compared to the parameter dimension, while our interest in this paper is when the input dimension, parameter dimension, and number of samples all scale together.

An important step towards understanding the evolution of Hessians along training in the highdimensional setting, is of course understanding the training dynamics themselves. Since the classical work of Robbins & Monro (1951), there has been much activity studying limit theorems for stochastic gradient descent. In the high-dimensional setting, following Saad & Solla (1995a;b), investigations have focused on finding a finite number of functions (sometimes called observables or summary statistics ), whose dynamics under the SGD are asymptotically autonomous in the highdimensional limit. For a necessarily small sampling of this rich line of work, we refer to Goldt et al. (2019); Veiga et al. (2022); Paquette et al. (2021); Arnaboldi et al. (2023); Tan & Vershynin (2019); Ben Arous et al. (2022). Of particular relevance, it was shown by Damian et al. (2022); Mousavi Hosseini et al. (2023) that for multi-index models, SGD predominantly lives in the low-dimensional subspace spanned by the ground truth parameters.

A class of tasks whose SGD is amenable to this broad approach is classification of Gaussian mixture models (GMMs). With various losses and linearly separable class structures, the minimizer of the empirical risk landscape with single-layer networks was studied in Mignacco et al. (2020); Loureiro et al. (2021). A well-studied case of a Gaussian mixture model needing a two-layer network is under an XOR-type class structure; the training dynamics of SGD for this task were studied in Refinetti et al. (2021) and Ben Arous et al. (2022) and it was found to have a particularly rich structure with positive probability of convergence to bad classifiers among other degenerate phenomena.

Still, a simultaneous understanding of high-dimensional SGD and the Hessian and related matrices spectra along the training trajectory has remained largely open.

1.1 OUR CONTRIBUTIONS

In this paper, we study the interplay between the training dynamics (via SGD) and the spectral compositions of the empirical Hessian matrix and an empirical gradient second moment matrix, or simply G-matrix (2.2) (similar in spirit to an information matrix) over the course of training. We rigorously show the following phenomenology in two canonical high-dimensional classification tasks with k hidden classes:

1. Shortly into training, the empirical Hessian and empirical G-matrices have C(k) many outlier eigenvalues, and the SGD itself predominantly lives in their eigenspace. Here C(k) is explicit and depends on the performance of the classifier to which the SGD converges.

Published as a conference paper at ICLR 2024

2. In multi-layer settings, this alignment happens within each layer, i.e., the first layer parameters align with the outlier eigenspaces of the corresponding blocks of the empirical Hessian and G-matrices, and likewise for the second layer parameters. 3. This alignment is not predicated on success at the classification task: when the SGD converges to a sub-optimal classifier, the empirical Hessian and G matrices have lower rank outlier eigenspaces, and the SGD aligns with those rank deficient spaces.

The first model we consider is the basic example of supervised classification of general k-component Gaussian mixture models with k linearly independent classes by a single layer neural network. In Theorem 3.1, we establish alignment of the form of Item 1 above, between each of the k one-vs-all classifiers and their corresponding blocks in the empirical Hessian and G-matrices. See also the depictions in Figures 3.1 3.2. In order to show this, we show that the matrices have an outlierminibulk-bulk structure throughout the parameter space, and derive limiting dynamical equations for the trajectory of appropriate summary statistics of the SGD trajectory following the approach of Ben Arous et al. (2022). Importantly, the same low-dimensional subspace is at the heart of both the outlier eigenspaces and the summary statistics. At this level of generality, the SGD can behave very differently within the outlier eigenspace. As an example of the refined phenomenology that can arise, we further investigate the special case where the means are orthogonal; here the SGD aligns specifically with the single largest outlier eigenvalue, which itself has separated from the other k 1 outliers along training. This is proved in Theorem 3.3 and depicted in Figure 3.3. These results are presented in Section 3.

To demonstrate our results in more complex multi-layer settings, we consider supervised classification of a GMM version of the famous XOR problem of Minsky & Papert (1969). This is one of the simplest models that requires a two-layer neural network to solve. We use a two-layer architecture with a second layer of width K. As indicated by Item 2 above, in Theorems 4.1 4.2 the alignment of the SGD with the matrices outlier eigenspaces occurs within each layer, the first layer having an outlier space of rank two, and the second layer having an outlier space of rank 4 when the dynamics converges to an optimal classifier. This second layer s alignment is especially rich, as when the model is overparametrized (K large), its outlier space of rank 4 is not present at initialization, and only separates from its rank-K bulk over the course of training. This can be interpreted as a dynamical version of the well-known spectral phase transition in spiked covariance matrices of Baik et al. (2005): see Figure 4.2 for a visualization. Moreover, the SGD for this problem is known to converge to sub-optimal classifiers with probability bounded away from zero Ben Arous et al. (2022), and we find that in these situations, the alignment still occurs but the outlier eigenspaces into which the SGD moves are rank-deficient compared to the number of hidden classes, 4: see Figure 4.3. These results are presented in Section 4.

2 PRELIMINARIES

Before turning to our results, let us begin by introducing the following framework and notation that will be relevant throughout this work. We suppose that we are given data from a distribution PY over pairs Y = (y, Y ) where y Rk is a one-hot label vector that takes the value 1 on a class (sometimes identified with the element of [k] = {1, ..., k} on which it is 1), and Y Rd is a corresponding feature vector. In training we take as loss a function of the form L(x, Y) : Rp Rk+d R+ , where x Rp represents the network parameter. (As we are studying supervised classification, in both settings this loss will be the usual cross-entropy loss corresponding to the architecture used.)

We imagine we have two data sets, a training set (Yℓ)M ℓ=1 and a test set ( e Yℓ)f M ℓ=1, all drawn i.i.d. from PY . Let us first define the stochastic gradient descent trained using (Yℓ). In order to ensure the SGD doesn t go off to infinity we add an ℓ2 penalty term (as is common in practice) with Lagrange multiplier β. The (online) stochastic gradient descent with initialization x0 and learning rate, or step-size, δ, will be run using the training set (Yℓ)M ℓ=1 as follows:

xℓ= xℓ 1 δ L(xℓ 1, Yℓ) βxℓ 1 . (2.1)

Our aim is to understand the behavior of SGD with respect to principal subspaces, i.e., outlier eigenvectors, of the empirical Hessian matrix and empirical second moment matrix of the gradient.

Published as a conference paper at ICLR 2024

This latter matrix is exactly the information matrix when L is the log-likelihood; in our paper L is taken to be a cross-entropy loss, so we simply refer to this as the G-matrix henceforth. We primarily consider the empirical Hessian and empirical G-matrices generated out of the test data, namely:

2 b R(x) = 1

ℓ=1 2L(x, e Yℓ) , and b G(x) = 1

ℓ=1 L(x, e Yℓ) 2 . (2.2)

(Notice that in our online setting, it is as natural to generate these matrices with test data as with train data. See Remark 5.1 for how our results extend when training data is used.) When the parameter space naturally splits into subsets of its indices (e.g., the first-layer weights and the second-layer weights), for a subset I of the parameter coordinates, we use subscripts 2 I,I b R and b GI,I to denote the block corresponding to that subset.

To formalize the notion of alignment between the SGD and the principal directions of the Hessian and G-matrices, we introduce the following language. For a subspace B, we let PB denote the orthogonal projection onto B; for a vector v, we let v be its ℓ2 norm; and for a matrix A, let A = A op be its ℓ2 ℓ2 operator norm. Definition 2.1. The alignment of a vector v with a subspace B is the ratio ρ(v, B) = ||PBv||/||v||. We say a vector v lives in a subspace B up to error ε if ρ(v, B) 1 ε.

For a matrix A, we let Ek(A) denote the span of the top k eigenvectors of A, i.e., the span of the k eigenvectors of A with the largest absolute values. We also use the following. Definition 2.2. We say a matrix A lives in a subspace B up to error ε if there exists M such that Im(A M) B with M op ε A op, where A op denotes the ℓ2-to-ℓ2 operator norm.

3 CLASSIFYING LINEARLY SEPARABLE MIXTURE MODELS

We begin by illustrating our results on (arguably) the most basic problem of high-dimensional multiclass classification, namely supervised classification of a k component Gaussian mixture model with constant variance and linearly independent means using a single-layer network. (This is sometimes used as a toy model for the training dynamics of the last layer of a deep network via the common ansatz that the output of the second-to-last layer of a deep network behaves like a linearly separable mixture of Gaussians: see e.g., the neural collapse phenomenon posited by Papyan et al. (2020).)

3.1 DATA MODEL

Let C = [k] be the collection of classes, with corresponding distinct class means (µa)a [k] Rd, covariance matrices Id/λ, where λ > 0 can be viewed a signal-to-noise parameter, and corresponding probabilities 0 < (pa)a [k] < 1 such that P

a [k] pa = 1. The number of classes k = O(1) is fixed (here and throughout the paper o(1), O(1) and Ω(1) notations are with respect to the dimension parameter d, and may hide constants that are dimension independent such as k, β).

For the sake of simplicity we take the means to be unit norm. Further, in order for the task to indeed be solvable with the single-layer architecture, we assume that the means are linearly independent, say with a fixed (i.e., d-independent) matrix of inner products (mab)a,b = (µa µb)a,b. Our data distribution PY is a mixture of the form P

c pc N(µc, Id/λ), with an accompanying class label y Rk. Namely, if Zλ N(0, Id/λ), our data is given as Y = (y, Y ) where:

a [k] paδ1a , and Y X

a [k] yaµa + Zλ . (3.1)

We perform classification by training a single layer network formed by k all-vs-one classifiers using the cross entropy loss (equivalently, we are doing multi-class logistic regression):

L(x, Y) = X

c [k] ycxc Y + log X

c [k] exp(xc Y ) , (3.2)

where x = (xc)c C are the parameters, each of which is a vector in Rd, i.e., x Rdk. (Note that we can alternatively view x as a k d matrix.)

Published as a conference paper at ICLR 2024

Figure 3.1: The alignment of the SGD trajectory xc ℓwith Ek( 2 cc b R(xℓ)) (left) and Ek( b Gcc(xℓ)) (right) over time, for c = 1, ..., k (shown in different colors). The x-axis is rescaled time, ℓδ. The parameters here are k = 10 classes in dimension d = 1000 with λ = 10, β = 0.01, and δ = 1/d.

3.2 RESULTS AND DISCUSSION

Our first result is that after some linearly many steps, the SGD finds the subspace generated by the outlier eigenvalues of the Hessian and/or G-matrix of the test loss and lives there for future times. Theorem 3.1. Consider the mixture of k-Gaussians with loss function from (3.2), and SGD (2.1) with learning rate δ = O(1/d), regularizer β > 0, initialized from N(0, Id/d). There exists α0, λ0 such that if λ λ0, and f M α0d, the following hold. For every ε > 0, there exists T0(ε) such that for any time horizon T0 < Tf < M/d, with probability 1 od(1),

xc ℓlives in Ek( 2 cc b R(xℓ)) and in Ek( b Gcc(xℓ)) ,

for every c [k], up to O(ε + λ 1) error, for all ℓ [T0δ 1, Tfδ 1].

This result is demonstrated in Figure 3.1 which plots the alignment of the training dynamics xc ℓwith the principal eigenspaces of the Hessian and G for each c [k]. As we see the alignment increases to near 1 rapidly for all blocks in both matrices. This theorem, and all our future results, are stated using a random Gaussian initialization, scaled such that the norm of the parameters is O(1) in d. The fact that this is Gaussian is not relevant to the results, and similar results hold for other uninformative initializations with norm of O(1).

Theorem 3.1 follows from the following theorem that describes the SGD trajectory, its Hessian and G-matrix (and their top k eigenspaces), all live up to O(1/λ) error in Span(µ1, ..., µk). Theorem 3.2. In the setup of Theorem 3.1, the following live in Span(µ1, ..., µk) up to O(ε + λ 1) error with probability 1 od(1):

1. The state of the SGD along training, xc ℓfor every c;

2. The b, c blocks of the empirical test Hessian, 2 bc b R(xℓ) for all b, c [k];

3. The b, c blocks of the empirical test G-matrix b Gbc(xℓ) for all b, c [k].

We demonstrate this result in Figure 3.2.

Inside the low-rank space spanned by µ1, ..., µk, the training dynamics, Hessian and G-matrix spectra can display different phenomena depending on the relative locations of µ1, ..., µk and weights p1, ..., pk. To illustrate the more refined alignment phenomena, let us take as a concrete example pc = 1

k for all c, and µ1, ..., µk orthonormal.

With this concrete choice, we can analyze the limiting dynamical system of the SGD without much difficulty and its relevant dynamical observables have a single stable fixed point, to which the SGD converges in linearly many, i.e., O(δ 1), steps. This allows us to show more precise alignment that occurs within Span(µ1, ..., µk) over the course of training. Theorem 3.3. In the setting of Theorem 3.1, with the means (µ1, ..., µk) being orthonormal, the estimate xc ℓhas Ω(1), positive, inner product with the top eigenvector of both 2 cc b R(xℓ) and b Gcc(xℓ) (and negative, Ω(1) inner product with the k 1 next largest eigenvectors). Also, the top eigenvector of 2 cc b R(xℓ), as well as that of b Gcc(xℓ), live in Span(µc) up to O(ε + λ 1) error.

Published as a conference paper at ICLR 2024

Figure 3.2: From left to right: Plot of entries of x1 ℓand the k leading eigenvectors (in different colors) of 2 11 b R(xℓ) and b G11(xℓ) respectively at the end of training, namely ℓ= 50 d = 25, 000 steps. Here the x-axis represents the coordinate index. The parameters are the same as in Fig. 3.1 and the means are µi = ei 50.

Figure 3.3: Left: the eigenvalues (in different colors) of 2 b R11(xℓ) over the course of training. The leading k eigenvalues are separated from the bulk at all times, and the top eigenvalue, corresponding to µ1 separates from the remaining eigenvalues soon after initialization. Right: the inner product of x1 ℓwith the means µ1, ..., µk undergoes a similar separation over the course of training. Parameters are the same as in preceding figures.

Put together, the above three theorems describe the following rich scenario for classification of the k-GMM. At initialization, and throughout the parameter space in each class-block, the Hessian and G-matrices decompose into a rank-k outlier part spanned by µ1, ..., µk, and a correction term of size O(1/λ) in operator norm. Furthermore, when initialized randomly, the SGD is not aligned with the outlier eigenspaces, but does align with them in a short O(δ 1) number of steps. Moreover, when the means are orthogonal, each class block of the SGD xc ℓin fact correlates strongly with the specific mean for its class µc, and simultaneously, in the Hessian and G-matrices along training, the eigenvalue corresponding to µc becomes distinguished from the other k 1 outliers. We illustrate these last two points in Figure 3.3. We also refer the reader to Section F for further numerical demonstrations. Each of these phenomena appear in more general contexts, and in even richer manners, as we will see in the following section.

4 CLASSIFYING XOR-TYPE MIXTURE MODELS VIA TWO-LAYER NETWORKS

With the above discussion in mind, let us now turn to more complex classification tasks that are not linearly separable and require the corresponding network architecture to be multilayer.

4.1 DATA MODEL

For our multilayer results, we consider the problem of classifying a 4-component Gaussian mixture whose class labels are in a so-called XOR form. More precisely, consider a mixture of four Gaussians with means µ, µ, ν, ν where µ = ν = 1 and, say for simplicity, are orthogonal, and variances Id/λ. There are two classes, class label 1 for Gaussians with mean µ, and 0 for Gaussians with mean ν. To be more precise, our data distribution PY is

2δ1 and Y 1

2N(µ, Id/λ) + 1

2N( µ, Id/λ) y = 1 1 2N(ν, Id/λ) + 1

2N( ν, Id/λ) y = 0 . (4.1)

Published as a conference paper at ICLR 2024

This is a Gaussian version of the famous XOR problem of Minsky & Papert (1969). It is one of the simplest examples of a classification task requiring a multi-layer network to express a good classifier.

We therefore use a two-layer architecture with the intermediate layer having width K 4 (any less and a Bayes-optimal classifier would not be expressible), Re Lu activation function g(x) = x 0 and then sigmoid activation function σ(x) = 1 1+e x . The parameter space is then x = (W, v) RKd+K, where the first layer weights are denoted by W = W(x) RK d and the second layer weights are denoted by v = v(x) RK. We use the binary cross-entropy loss on this problem,

L(x; Y) = yv g(WY ) + log(1 + ev g(W Y )) , (4.2)

with g applied entrywise. The SGD for this classification task was studied in some detail in Refinetti et al. (2021) and Ben Arous et al. (2022), with critical step size δ = Θ(1/d).

4.2 RESULTS AND DISCUSSION

We begin our discussion with the analogue of Theorem 3.1 in this setting. As the next theorem demonstrates, in this more subtle problem, the SGD still finds and lives in the principal directions of the empirical Hessian and G-matrices.1 Here, the principal directions vary significantly across the parameter space, and the alignment phenomenon differs depending on which fixed point the SGD converges to. Moreover, the relation between the SGD and the principal directions of the Hessian and G-matrices can be seen per layer. Theorem 4.1. Consider the XOR GMM mixture with loss function (4.2) and the corresponding SGD (2.1) with β (0, 1/8), learning rate δ = O(1/d), initialized from N(0, Id/d). There exist α0, λ0 such that if λ λ0, and f M α0d, the following hold. For every ε > 0, there exists T0(ε) such that for any time horizon T0 < Tf < M/d, with probability 1 od(1), for all i {1, ..., K},

1. Wi(xℓ) lives in E2( 2 Wi Wi b R(xℓ)) and in E2( b GWi Wi(xℓ)), and

2. v(xℓ) lives in E4( 2 vv b R(xℓ)) and E4( b Gvv(xℓ)),

up to O(ε + λ 1/2)2 error, for all ℓ [T0δ 1, Tfδ 1]. Remark 1. The restriction to β < 1/8 in Theorems 4.1 4.2 is because when β > 1/8 the regularization is too strong for the SGD to be meaningful; in particular, the SGD converges ballistically to the origin in parameter space, with no discernible preference for the directions corresponding to µ, ν as the other directions. The above theorems are still valid there if the notion of error for living in a space from Definition 2.1 were additive instead of multiplicative (i.e., at most ϵ rather than ϵ v ).

This theorem is demonstrated in Figure 4.1. There we have plotted the alignment of the rows in the intermediate layer with the space spanned by the top two eigenvectors of the corresponding first-layer blocks of the Hessian and G-matrices, and similarly for the final layer.

As before, the above theorem follows from the following theorem that describes both the SGD trajectory, its Hessian, and its G-matrix, living up to O(ϵ + λ 1/2) error in their first-layer blocks in Span(µ, ν) and (applying g entrywise) in their second layer blocks in

Span(g(W(xℓ)µ), g( W(xℓ)µ), g(W(xℓ)ν), g( W(xℓ)ν)) .

Theorem 4.2. In the setting of Theorem 4.1, up to O(ε + λ 1/2) error with probability 1 od(1), the following live in Span(µ, ν),

The first layer weights, Wi(xℓ) for each i {1, ..., K},

The first-layer empirical test Hessian 2 Wi Wi b R(xℓ) for each i {1, ..., K},

The first-layer empirical test G-matrix b GWi Wi(xℓ) for each i {1, ..., K},

and the following live in Span(g(W(xℓ)µ), g( W(xℓ)µ), g(W(xℓ)ν), g( W(xℓ)ν))

1As the second derivative of Re LU is only defined in the sense of distributions, care must be taken when studying these objects. This singularity is not encountered by the SGD trajectory almost surely. 2In this theorem and in Theorem 4.2, the big-O notation also hides constant dependencies on the initial magnitudes of v(x0), and on Tf.

Published as a conference paper at ICLR 2024

Figure 4.1: (a) (b) depict the alignment of the first layer weights Wi(xℓ) for i = 1, ..., K (in different colors) with the principal subspaces of the corresponding blocks of the Hessian and G-matrices, i.e., with E2( 2 Wi Wi b R(xℓ)) and E2( b GWi Wi(xℓ)). (c) (d) plot the second-layer alignment, namely of v(xℓ) with E4( 2 vv b R(xℓ)) and E4( b Gvv(xℓ)). Parameters are d = 1000, λ = 10, and K = 20

Figure 4.2: The eigenvalues (in different colors) of the vv blocks of the Hessian and G-matrices over time from a random initialization. Initially, there is one outlier eigenvalue due to the positivity of the Re LU activation. Along training, four outlier eigenvalues separate from the bulk, corresponding to the four hidden classes in the XOR problem. Parameters are the same as in Figure 4.1.

The second layer weights v(xℓ),

The second-layer empirical test Hessian 2 vv b R(xℓ),

The second-layer empirical test G-matrix b Gvv(xℓ).

Let us discuss a bit more the phenomenology of the alignment in the second layer. First of all, we observe that the subspace in which the alignment occurs is a random depending on the initialization and trajectory (in particular, choice of fixed point the SGD converges to) 4-dimensional subspace of RK. Furthermore, if we imagine K to be much larger than 4 so that the model is overparametrized, at initialization, unlike the 1-layer case studied in Section 3, the Hessian and Gmatrices do not exhibit any alignment in their second layer blocks. In particular, the second layer blocks of the Hessian and G-matrices look like (non-spiked) Gaussian orthogonal ensemble and Wishart matrices3 in K dimensions at initialization, and it is only over the course of training that they develop 4 outlier eigenvalues as the first layer of the SGD begins to align with the mean vectors and the vectors (g(W(xℓ)ϑ))ϑ { µ, ν} in turn get large enough to generate outliers. This crystallization of the last layer around these vectors over the course of training is reminiscent of the neural collapse phenomenon outlined in Papyan et al. (2020) (see also Han et al. (2022); Zhu et al. (2021)). The simultaneous emergence, along training, of outliers in the Hessian and G-matrix spectra can be seen as a dynamical version of what is sometimes referred to as the BBP transition after Baik et al. (2005) (see also P ech e (2006)). This dynamical transition is demonstrated in Figure 4.2

Finally, we recall that Ben Arous et al. (2022) found a positive probability (uniformly in d, but shrinking as the architecture is overparametrized by letting K grow) that the SGD converges to suboptimal classifiers from a random initialization. When this happens, g(W(xℓ)ϑ) remains small for the hidden classes ϑ { µ, ν} that are not classifiable with SGD output. In those situations, Theorem 4.2 shows that the outlier subspace in the vv-blocks that emerges will have rank smaller than 4, whereas when an optimal classifier is found it will be of rank 4: see Figure 4.3. Knowing that

3These are two of the most classical random matrix models, see Anderson et al. (2010) for more.

Published as a conference paper at ICLR 2024

Figure 4.3: Evolution of eigenvalues in the v component of G over time in rank deficient cases. Here SGD is started from initializations that converge to suboptimal classifiers (this has uniformly positive, K-dependent, probability under a random initialization). From left to right, the SGD s classifier varies in the number of hidden classes it discerns, from 1 to 4. There is still a dynamical spectral transition, now with only a corresponding number of emerging outlier eigenvalues.

the classification task entails a mixture of 4 means, this may provide a method for devising a stopping rule for classification tasks of this form by examining the rank of the outlier eigenspaces of the last layer Hessian or G-matrix. While the probability of such sub-optimal classification is bounded away from zero, the probability exponentially goes to zero as the model is overparametrized via K , and that gives more chances to allow the second layer SGD, Hessian, and G-matrices to exhibit full 4-dimensional principal spaces. This serves as a concrete and provable manifestation of the lottery ticket hypothesis of Frankle & Carbin (2019).

5 OUTLINE AND IDEAS OF PROOF

The proofs of our main theorems break into three key steps.

1. In Sections A B, we show that the population Hessian and G-matrices, have bulks (and possibly minibulks) that are O(1/λ) in operator norm, and finite C(k)-rank parts. We can explicitly characterize the low-rank part s eigenvalues and eigenvectors up to O(1/λ) corrections, as functions of the parameter space, to see exactly where their emergence as outliers occurs depending on the model.

2. In Section C, we analyze the SGD trajectories for the k-GMM and XOR classification problems. We do this using the limiting effective dynamics theorem proven in Ben Arous et al. (2022) for finite families of summary statistics. We derive ODE limits for these summary statistics, notably for the general k-GMM which was not covered in that paper: see Theorem C.7. We then pull back the limiting dynamics to finite d and expand its λfinite trajectory about its λ = solution. These latter steps involve understanding some of the stability properties of the dynamical system limits of the SGD s summary statistics.

3. In Section D, we prove concentration for the empirical Hessian and G-matrices about their population versions, in operator norm, throughout the parameter space. In some related settings, including binary mixtures of Gaussians, Mei et al. (2018) established concentration of the empirical Hessian about the population Hessian uniformly in the parameter space assuming polylogarithmic sample complexity. Our proofs are based on ϵ-nets and concentration inequalities for uniformly sub-exponential random variables, together with some twists due, for instance, to non-differentiability of the Re LU function.

Remark 5.1. Our results are stated for the empirical Hessian and G-matrices generated using test data, along the SGD trajectory generated from training data. Since we are considering online SGD, the empirical Hessian and G-matrices with training data are no more relevant than those generated from test data, but the reader may still wonder whether the same behavior holds. A straightforward modification of our arguments in Section D extends our results for the k-GMM model to Hessian and G-matrices generated from train data, if we assume that M d log d rather than simply M d. See the full version for a sketch of this extension. It is an interesting mathematical question to drop this extra logarithmic factor. The extension in the XOR case is technically more involved due to the regularity of the Re LU function. See Section F for numerical demonstrations that the phenomena in the empirical matrices generated from train data are identical to those generated from test data.

Published as a conference paper at ICLR 2024

AUTHOR CONTRIBUTIONS

All authors contributed equally to this work.

ACKNOWLEDGMENTS

The authors thank anonymous referees for their helpful comments. G.B. acknowledges the support of NSF grant DMS-2134216. R.G. acknowledges the support of NSF grant DMS-2246780. The research of J.H. is supported by NSF grants DMS-2054835 and DMS-2331096. A.J. acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada Research Chairs programme. Cette recherche a et e enterprise grˆace, en partie, au soutien financier du Conseil de Recherches en Sciences Naturelles et en G enie du Canada (CRSNG), [RGPIN-2020-04597, DGECR-2020-00199], et du Programme des chaires de recherche du Canada.

Greg W Anderson, Alice Guionnet, and Ofer Zeitouni. An introduction to random matrices. Number 118. Cambridge university press, 2010.

Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, and Bruno Loureiro. From high-dimensional & mean-field dynamics to dimensionless odes: A unifying approach to sgd in two-layers networks. In Gergely Neu and Lorenzo Rosasco (eds.), Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pp. 1199 1227. PMLR, 12 15 Jul 2023. URL https://proceedings.mlr.press/v195/arnaboldi23a.html.

Antonio Auffinger and G erard Ben Arous. Complexity of random smooth functions on the highdimensional sphere. Ann. Probab., 41(6):4214 4247, 2013. ISSN 0091-1798. doi: 10.1214/ 13-AOP862.

Antonio Auffinger, G erard Ben Arous, and Jiˇr ı ˇCern y. Random matrices and complexity of spin glasses. Comm. Pure Appl. Math., 66(2):165 201, 2013. ISSN 0010-3640. doi: 10.1002/cpa. 21422.

Jinho Baik, G erard Ben Arous, and Sandrine P ech e. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability, 33(5):1643 1697, 2005.

G erard Ben Arous, Song Mei, Andrea Montanari, and Mihai Nica. The landscape of the spiked tensor model. Comm. Pure Appl. Math., 72(11):2282 2330, 2019. ISSN 0010-3640. doi: 10. 1002/cpa.21861.

Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Q38D6xxr KHe.

L eon Bottou. On-Line Learning and Stochastic Approximations. Cambridge University Press, USA, 1999. ISBN 0521652634.

Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann Le Cun. The Loss Surfaces of Multilayer Networks. In Guy Lebanon and S. V. N. Vishwanathan (eds.), Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pp. 192 204, San Diego, California, USA, 09 12 May 2015. PMLR. URL https://proceedings.mlr.press/v38/ choromanska15.html.

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jh-r Ttvk Ge M.

Alexandru Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Po-Ling Loh and Maxim Raginsky (eds.), Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pp.

Published as a conference paper at ICLR 2024

5413 5452. PMLR, 02 05 Jul 2022. URL https://proceedings.mlr.press/v178/ damian22a.html.

Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS 14, pp. 2933 2941, Cambridge, MA, USA, 2014. MIT Press.

Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linearwidth neural networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https:// openreview.net/forum?id=r Jl-b3Rc F7.

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2232 2241. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/v97/ghorbani19b.html.

Sebastian Goldt, Madhu Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborov a. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. Advances in neural information processing systems, 32, 2019.

Guy Gur-Ari, Daniel A. Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace, 2019. URL https://openreview.net/forum?id=Bye THs Aqt X.

X.Y. Han, Vardan Papyan, and David L. Donoho. Neural collapse under MSE loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=w1Ubdv WH_R3.

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_ files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.

Arthur Jacot, Franck Gabriel, and Clement Hongler. The asymptotic spectrum of the hessian of dnn throughout training. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Skgsca NYPS.

Yann A. Le Cun, L eon Bottou, Genevieve B. Orr, and Klaus-Robert M uller. Efficient Back Prop, pp. 9 48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3642-35289-8. doi: 10.1007/978-3-642-35289-8 3. URL https://doi.org/10.1007/ 978-3-642-35289-8_3.

Xinyan Li, Qilong Gu, Yingxue Zhou, Tiancong Chen, and Arindam Banerjee. Hessian based analysis of SGD for Deep Nets: Dynamics and Generalization, pp. 190 198. doi: 10. 1137/1.9781611976236.22. URL https://epubs.siam.org/doi/abs/10.1137/1. 9781611976236.22.

Zhenyu Liao and Michael W. Mahoney. Hessian eigenspectra of more realistic nonlinear models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id= o-RYNVOlx A8.

Bruno Loureiro, Gabriele Sicuro, Cedric Gerbelot, Alessandro Pacco, Florent Krzakala, and Lenka Zdeborova. Learning gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview. net/forum?id=j3e Gy NMPvh.

Published as a conference paper at ICLR 2024

Antoine Maillard, G erard Ben Arous, and Giulio Biroli. Landscape complexity for the empirical risk of generalized linear models. In Jianfeng Lu and Rachel Ward (eds.), Proceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107 of Proceedings of Machine Learning Research, pp. 287 327. PMLR, 20 24 Jul 2020. URL https: //proceedings.mlr.press/v107/maillard20a.html.

Charles H. Martin and Michael W. Mahoney. Traditional and heavy tailed self regularization in neural network models, 2019. URL https://openreview.net/forum?id=SJe FNo Rc FQ.

Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for nonconvex losses. Ann. Statist., 46(6A):2747 2774, 12 2018. doi: 10.1214/17-AOS1637.

Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborov a. Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 9540 9550. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/6c81c83c4bd0b58850495f603ab45a93-Paper.pdf.

Marvin Minsky and Seymour Papert. An introduction to computational geometry. Cambridge tiass., HIT, 479:480, 1969.

Alireza Mousavi-Hosseini, Sejun Park, Manuela Girotti, Ioannis Mitliagkas, and Murat A Erdogdu. Neural networks efficiently learn low-dimensional representations with SGD. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=6taykzqc PD.

Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5012 5021. PMLR, 09 15 Jun 2019. URL https://proceedings. mlr.press/v97/papyan19a.html.

Vardan Papyan. Traces of class/cross-class structure pervade deep learning spectra. Journal of Machine Learning Research, 21(252):1 64, 2020. URL http://jmlr.org/papers/v21/ 20-933.html.

Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40): 24652 24663, 2020. doi: 10.1073/pnas.2015509117. URL https://www.pnas.org/doi/ abs/10.1073/pnas.2015509117.

Courtney Paquette, Kiwon Lee, Fabian Pedregosa, and Elliot Paquette. SGD in the large: Averagecase analysis, asymptotics, and stepsize criticality. In Conference on Learning Theory, pp. 3548 3626. PMLR, 2021.

S. P ech e. The largest eigenvalue of small rank perturbations of hermitian random matrices. Probability Theory and Related Fields, 134(1):127 173, Jan 2006. ISSN 1432-2064. doi: 10.1007/s00440-005-0466-z.

Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2798 2806. PMLR, 06 11 Aug 2017. URL https://proceedings.mlr.press/v70/ pennington17a.html.

Jeffrey Pennington and Pratik Worah. The spectrum of the fisher information matrix of a singlehidden-layer neural network. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_ files/paper/2018/file/18bb68e2b38e4a8ce7cf4f6b2625768c-Paper.pdf.

Published as a conference paper at ICLR 2024

Maria Refinetti, Sebastian Goldt, Florent Krzakala, and Lenka Zdeborov a. Classifying highdimensional gaussian mixtures: Where kernel methods fail and neural networks succeed. In International Conference on Machine Learning, pp. 8936 8947. PMLR, 2021.

Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statistics, 22: 400 407, 1951. ISSN 0003-4851. doi: 10.1214/aoms/1177729586.

David Saad and Sara Solla. Dynamics of on-line gradient descent learning for multilayer neural networks. Advances in neural information processing systems, 8, 1995a.

David Saad and Sara A Solla. On-line learning in soft committee machines. Physical Review E, 52 (4):4225, 1995b.

Levent Sagun, Leon Bottou, and Yann Le Cun. Eigenvalues of the hessian in deep learning: Singularity and beyond, 2017a. URL https://openreview.net/forum?id=B186c P9gx.

Levent Sagun, Utku Evci, V. Ugur G uney, Yann N. Dauphin, and L eon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. Co RR, abs/1706.04454, 2017b. URL http: //arxiv.org/abs/1706.04454.

Yan Shuo Tan and Roman Vershynin. Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval. ar Xiv preprint ar Xiv:1910.12837, 2019.

Rodrigo Veiga, Ludovic Stephan, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborov a. Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks. ar Xiv preprint ar Xiv:2202.00293, 2022.

Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Sumio Watanabe. Almost all learning machines are singular. In 2007 IEEE Symposium on Foundations of Computational Intelligence, pp. 383 388, 2007. doi: 10.1109/FOCI.2007.371500.

Zeke Xie, Qian-Yuan Tang, Mingming Sun, and Ping Li. On the overlooked structure of stochastic gradients. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=H4Gsteo L0M.

Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=KRODJAa6pz E.

Published as a conference paper at ICLR 2024

A ANALYSIS OF THE POPULATION MATRICES: 1-LAYER NETWORKS

In this section, we study the Hessian and G-matrix of the population loss for the k-GMM problem whose data distribution and loss were given in (3.1) (3.2). Specifically, we give the Hessian and G-matrices λ-large expansion, and showing they have low rank structures with top eigenspaces generated by O(1/λ) perturbations of the mean vectors (µ1, ..., µk). In Section D, we will show that the empirical matrices are well concentrated about the population matrices.

A.1 PRELIMINARY CALCULATIONS AND NOTATION

It helps to first fix some preliminary notation, and give expressions for derivatives of the loss function of (3.2). Differentiating that, we get

xc L = ( yc + exp(xc Y ) P a exp(xa Y ))Y , (A.1)

the Hessian matrix

xb xc L = exp(xb Y ) P

a exp(xa Y )δbc exp(xb Y ) exp(xc Y )

a exp(xa Y ))2

Y Y , (A.2)

and the G-matrix

xb L xc L = (ycyb πY (b)yc ybπY (c) + πY (c)πY (b)) Y Y . (A.3)

Given the above, the following probability distribution over the [k] classes naturally arises, for which we reserve the notation πY ( ) M1([k]) (which is a probability measure on [k]):

πY (c) = πY (c; x) := exp(xc Y ) P a [k] exp(xa Y ). (A.4)

Note the dependency of πY on the point in parameter space x. Since in this section x can be viewed as fixed, we will suppress this dependence from the notation. In the rest of this section, we will denote Z N(0, Id/λ). We also denote

x B = x B πY = X

a xaπY (a) , x B; x B = x B; x B πY = X

a (xa xa)πY (a) x 2 πY ,

where B πY .

We consider the population Hessian 2Φ = 2E[L], block by block, using bcΦ to denote the d d block in 2 xΦ corresponding to xb xcΦ. Then, thanks to (A.2), the bc block of the population Hessian is of the form

bcΦ = E [(πY (b)δbc πY (b)πY (c))Y Y ] .

We also consider the population G-matrix Γ = E[ L 2], block by block, using Γbc to denote the d d block corresponding to E[ xb L xc L]. Then, thanks to (A.3), the bc block of the population Hessian is of the form

Γbc = E[(ycyb πY (b)yc ybπY (c) + πY (c)πY (b)) Y Y ] .

For both population Hessian matrix and G-matrix, We will study the off-diagonal blocks a = c and the diagonal ones a = c separately.

A.2 ANALYSIS OF THE POPULATION HESSIAN MATRIX

We now compute exact expressions for the blocks of the population Hessian as λ gets large: see Lemmas A.1 A.2.

We begin by studying the off-diagonal blocks: b = c, for which,

bcΦ = E (πY (b)δcb πY (b)πY (c))Y 2 = X

l pl E πYl(b)πYl(c)Y 2 l (A.6)

Published as a conference paper at ICLR 2024

where Yl = µl + Z, i.e., it is distributed like Y given class choice l. It helps here to recall the well-known Gaussian integration by parts formula: For f that is differentiable with derivative of at most exponential growth at infinity,

Ef(Zλ)Zλ = 1

Our goal is to show the following. Lemma A.1. The off-diagonal blocks of the population Hessian satisfy

bcΦ(x) = E n πY (c)πY (a) h µy + 1

λ[(xc + xa) 2 x B ] 2 + 1

λ2 x B; x B io

In particular they are shifts by the identity of a matrix of rank at most k2.

Proof. We decompose each term in (A.6) as three terms

E [πYl(b)πYl(c)] µ 2 l + 2 Sym(E [πYl(b)πYl(c)Z] µl) + E πYl(b)πYl(c)Z 2 =: (i) + (ii) + (iii)

where we ve used here multi-linearity of . Here Sym(c b) = (c b + b c)/2. Let s look at this term-by-term. We leave term (i) as is. For Term (ii), we notice by Gaussian integration by parts,

E [πYl(c)πYl(b)Z] µl = 1

λE [ Z(πYl(c)πYl(b))] µl

λE h ((xb + xc) 2 x B

πYl )πYl(b)πYl(c) i µl

λ(E[πYl(b)πYl(c)](xb + xc) µl 2Ey h x B

πYl πYl(b)πYl(c) i µl) .

where we recall (A.5) and used the derivative calculation

ZπYl(a) = Z exp[xa (µl + Z)] P

b exp[xb (µl + Z)] = (xa x B πYl )πYl(a) . (A.7)

This tells us that

λ Sym E h (xb + xc) 2 x B

πYl(b)πYl(c) i µl ,

notice that this is of rank at most 2 and of order O(1/λ).

Finally for term (iii), we integrate by parts again, to get, for every i, j,

E [πYl(b)πYl(c)Zi Zj] = 1

λE[ Zi(πYl(b)πYl(c))Zj] + 1

λE[πYl(b)πYl(c)δij]

λ2 E[ Zi Zj(πYl(b)πYl(c))] + 1

λE[πYl(b)πYl(c)δij]. (A.8)

so term (iii) is given by

1 λE[πYl(b)πYl(c)Id] + 1

λ2 E[ 2 Z (πYl(b)πYl(c))] .

Examining this second term,

E[ Zi Zj(πYl(b)πYl(c))] = E h Zj n xb i + xc i 2 x B i πYl πYl(b)πYl(c) oi

= E h xb i + xc i 2 x B i πYl (xb j + xc j) 2 x B j πYl πYl(b)πYl(c) i 2E x B i ; x B j πYl πYl(b)πYl(c)

where we first used (A.7), then used that

Z x B πYl = Z

b xb exp[xb (µl + Z)] P

b exp[xb (µl + Z)] = x B x B πYl x B 2 πYl = x B; x B πYl (A.9)

Published as a conference paper at ICLR 2024

1 λ2 E[ 2 Z(πYl(b)πYl(c))] = 1

n E h (xb + xc 2 x B πYl ) 2πYl(b)πYl(c) i

2E h x B; x B πYl πYl(b)πYl(c) io

This term can be seen to be of rank at most k2 (each x B is a weighted sum of (xa)a).

Combining all three terms above, we get that the off-diagonal block of the population Hessian matrix is given by the sum over y [k] of

E [πYl(b)πYl(c)] µ 2 l + 2

h Sym(E h (xb + xa) 2 x B πYl

ππYl (b)ππYl (c) i µl) i + 1

λE[ππYl (b)ππYl (c)]Id

E h ((xb + xa) 2 x B

πYl ) 2ππYl (b)ππYl (c) i 2E h x B; x B

πYl ππYl (b)ππYl (c) i .

Summing this in y we get that an off-diagonal block is of the form

λ2 C), (A.11)

l pl E h ππYl (c)ππYl (a) i µ 2 l ,

l py E [πYl(a)πYl(c)] Id + 2 X

l pl Sym El h xc + xa 2 x B

πYl(c)πYl(a) i µl ,

l pl E h (xc + xa 2 x B

πYl ) 2ππYl (c)ππYl (a) i 2E h x B; x B

πYl πYl(c)πYl(a) i .

In summary to leading order it is A which is rank k. To next order it is O(1/λ) and that term is a full rank (identity) plus a rank at most 2k term. To next order it is O(1/λ2) and this is a covariance-type quantity with respect to the Gibbs probability πYy, with rank at most k2.

We can group the expression (A.11) further as X

l pl E n πYl(c)πYl(a) h µl + 1

λ[xc + xa 2 x B

πYl ] 2 + 1

λ2 x B; x B

This is of the form rank k plus rank k2 shifted by the identity (adding one more eigenvalue). Incorporating the average over l into the expectation, this is exactly the claimed expression.

A.2.1 ON-DIAGONAL BLOCKS

In this section, we study the aa diagonal blocks

aaΦ = E (πY (a)(1 πY (a))Y 2 = X

l pl E (πYl(a)(1 πYl(a))Y 2 l . (A.12)

We prove the following large λ expansion. Lemma A.2. The diagonal aa-block of the population Hessian aaΦ equals

E h πY (a) µ + 1

λ(xa x B πY ) 2 1

λ2 x B; x B πY i

E h π(a)2 (µ + 2

λ(xa x B πY ) 2 2

λ2 x B; x B πY i + 1

λE πY (a)(1 πY (a)) Id .

In particular, it is a shift by the identity of a rank at-most k2 matrix.

Proof. By (A.12), the diagonal aa-block is given by an average over l [k] of

E[πYl(a)(1 πYl(a))Y 2] = E[πYl(a)(µl + Z) 2] E[πYl(a)2(µl + Z) 2] .

Published as a conference paper at ICLR 2024

Note that the second term was exactly what was computed Lemma A.1, setting b = c. It remains to compute the first. To this end, we proceed as before and for each fixed l, write the above as

E[πYl(a)]µ 2 l + 2 Sym(E[πYl(a)Z] µl) + E[πYl(a)Z 2] .

We will integrate the second and third terms by-parts. By the gradient calculations of (A.7),

E[πYl(a)Z] = 1

λE[ ZπYl(a)] = 1

λE[(xa x B πYl )πYl(a)] , (A.13)

and by the calculation of (A.9),

E[πYl(a)Z 2] = 1

λE[πYl(a)]Id + 1

λ2 E[ 2 ZπYl(a)]

λE[πYl(a)]Id + E[πYl(a)((xa x B πYl ) 2 x B; x B πYl )] . (A.14)

Combining the above expressions yields

E[πYl(a)Y 2 l ] = E[πYl(a)((µl + 1

λ(xa x B πYl ) 2 + 1

λ2 x B; x B πYl )].

On the otherhand, Lemma A.1, yields

E[πYl(a)2Y 2 l ] = E[πYl(a)2((µl + 2

λ(xa x B πYl ) 2 + 1

λ2 x B; x B πYl )] .

Combining these two and averaging over l yields the desired.

A.3 ANALYSIS OF THE POPULATION G-MATRIX

We now compute an exact expansion of the population G-matrix as λ gets large. Lemma A.3. For any b, c, the bc block of the population G-matrix can be written in the form:

E[ xb L xc L] = A + B

where A is in the span of (µc)c and B, C have operator norm bounded by 1. In particular,

A = δbcpbµ 2 b pc E{πYc(b)[µc + 1

λ(xb x B πYc )] 2} pb E{πYb(c)[µb + 1

λ(xc XB πYb )] 2}

l pl E{πYl(b)πYl(c)[µl 1

λ(xb + xc 2 x B πYl )] 2} ,

B = δbcpb (pb E[πYb(c)] + pc E[πYc(b)]) + X

l pl E[πYl(b)πYl(c)] Id ,

C = pc E[ x B; x B πYc πYc(b)] pb E[ x B; x B πYb πYb(c)] 2 X

l pl E[ x B; x B πYl πYl(b)πYl(c)]

Proof. We recall from (A.3), there are four terms in the bc block of the G-matrix:

E[ xb L xc L] =: (i) (ii) (iii) + (iv), (A.15)

(i) := E[ycyb Y Y ], (ii) := E[πY (b)yc Y Y ], (iii) := E[ybπY (c)Y Y ], (iv) := E[πY (c)πY (b)Y Y ]

The first term in (A.15) is easy to compute

(i) = E[ybyc Y Y ] = δbcpb(µb µb + Id/λ). (A.16)

The second and third terms are similar to each other, we will only compute the second term (ii),

(ii) = E[ycπY (b)Y Y ] = pc E πYc(b)(µc + Z) 2

= pc E[πYc(b)µ 2 c ] + 2pc Sym E[πYc(b)Z µc] + pc E[πYc(b)Z 2] (A.17)

Published as a conference paper at ICLR 2024

Using the calculations of (A.13) and (A.14), we conclude that

(ii) = pc E[πYc(b)(µ 2 c + Id/λ)] + 1

λ2 Sym(E[(xb x B πYc ])πYc(b)] µc)].

λ2 pc E[((xb x B πYc ) 2 x B; x B πYc )πYc(b)] (A.18)

For the last term (iv) in (A.15), it has been computed in (A.10) that (iv) is the expectation over l of

E [πYl(b)πYl(c)] µ 2 l

h 2 Sym(E h (xb + xc) 2 x B πYl

πYl(b)πYl(c) i µl) i + 1

λE[πYl(b)πYl(c)]Id

E h ((xb + xc) 2 x B πYl ) 2πYl(b)πYl(c) i 2E h x B; x B πYl πYl(b)πYl(c) i (A.19)

By plugging (A.16), (A.18), its analogue for (iii), and (A.19) into (A.15), we conclude that the block E[Gxbxc] of the population G-matrix is given by

E[Gxbxc] = A + B

A = δbcpbµ 2 b pc E[πYc(b)]µ 2 c pb E[πYb(c)]µ 2 b + X

l pl E[πYl(b)πYl(c)]µ 2 l ,

δbcpb (pb E[πYb(c)] + pc E[πYc(b)]) + X

l pl E[πYl(b)πYl(c)]

2pc Sym(E[(xb x B πYc )πYc(b)] µc)] 2pb Sym(E[(xc x B πYb )πYb(c)] µb)]

l 2pl Sym(E h (xb + xc) 2 x B πYl

πYl(b)πYl(c) i µl) ,

C = pc E[((xb x B πYc ) 2 x B; x B πYc )πYc(b)] pb E[((xc x B πYb ) 2 x B; x B πYb )πYb(c)]

l pl E h ((xb + xc) 2 x B πYl ) 2πYl(b)πYl(c) i 2E h x B; x B πYl πYl(b)πYl(c) i .

Grouping tensor-squares we obtain the desired decompostion in terms of A, B, and C.

B ANALYSIS OF POPULATION MATRICES: THE 2-LAYER CASE

In this section we analyze the population Hessian and G-matrices for the 2-layer XOR model, and especially its λ large behavior by viewing it as a perturbation of its λ = value. Specifically, we compute its large λ expansion, and uncover an underlying low-rank structure. We will show that the empirical Hessian and G-matrices concentrate about their population versions in Section D.

B.1 PRELIMINARY CALCULATIONS

Recall the data model for the 2-layer XOR GMM from (4.1) and the loss function (4.2), with σ being the sigmoid function and g being Re LU. The 2-layer architecture has intermediate layer width K, so that the first layer weights W form a K d matrix, and the second layer weights v form a K-vector. Observe that

v L = (y by)g(WY ) and Wi L = (y by)g (Wi Y )vi Y ,

by = ev g(W Y )

1 + ev g(W Y ) = σ(v g(WY )) . (B.1)

Published as a conference paper at ICLR 2024

The vv-block of the G-matrix is the K K matrix given by

v L 2 = (y by)2g(WY ) 2 . (B.2)

The Wi Wj block of the G-matrix corresponding to W is given by

Wi L Wj L = (y by)2vivjg (Wi Y )g (Wj Y )Y 2 . (B.3)

The per-layer Hessian for the loss on a given sample is then given by

2 vv L = g(WY ) 2by(1 by) ,

2 Wi Wj L = ( δijvig (Wi Y )(y by) + vivjg (Wi Y )g (Wj Y )by(1 by))Y 2 (B.4)

Note that we may also write the diagonal block in W of the form,

2 W L = (y by)diag(vig (Wi Y )) Y 2 + by(1 by)(v diag(g (WY ))) 2 Y 2

We assume that Wi 0 for all i, a set which we call Wc 0. Then recall that for Re Lu activation g, we have g = δ0 in the sense of distributions. Thus as long as W Wc 0, we can drop g in (B.4) for any finite λ < .

Our aim is to approximate the expected values of (B.4) by their λ = versions, and show that they are within O(λ 1/2) of one another. Let F(t) = P(G > t) where G N(0, 1) be the tail probability of a standard Gaussian. Let us also introduce the notations for ϑ { µ, ν},

mϑ i = Wi ϑ and Rii = Wi Wj for 1 i, j K .

These notations will reappear as summary statistics for the analysis of the SGD in the following sections. Lemma B.1. For X N(ϑ, Id/λ) and W a K d matrix, we have that

||E[g(WX) 2] g(Wϑ) 2||op K ψ2(ϑ, W, λ) (B.5)

ψ2(ϑ, W, λ) = max 1 i,j K(|mϑ i |3 + λ 3/2R3/2 ii )1/3 (|mϑ j |3 + λ 3/2R3/2 jj )1/3

λ Rii |mϑ i |

λ Rjj |mϑ j |

Proof. By the equivalence of norms in finite dimensional vector spaces, it suffices to control the norm entry-wise at the price of a constant that depends at most on K.

Case 1: Wi ϑ, Wj ϑ 0. In this case we have that

E[g(Wi X)g(Wj X)] g(Wi ϑ)g(Wj ϑ) = E[(Wi ϑ)(Wj ϑ)1Wi X<0 Wj X<0]+ Wi Wj

The absolute value of the first term is bounded, by Holder s inequality and a union bound, by

E[|Wi X|3]1/3E[|Wj X|3]1/3(P(Wi X < 0) + P(Wj X < 0))1/3 .

Case 2: Wi ϑ < 0 (the case Wj ϑ < 0 is symmetrical). Here we have

E[g(Wi X)g(Wj X)] = E[(Wi X)(Wj X)1Wi X,Wj X>0] ,

which, by Holder s inequality, is bounded by

E[|Wi X|3]1/3E[|Wj X|3]1/3P(Wi X > 0)1/3 .

combining the two cases yields the desired.

Lemma B.2. There exists a universal c > 0 such that for X N(ϑ, Id/λ) and W RK d,

E[||g(WX) 2||2 op]1/2 ||Wϑ||4 + 6||Wϑ||2

λ2 ||W||2 F , (B.7)

where F is the Frobenius norm.

Published as a conference paper at ICLR 2024

Proof. Let Zi = Wi Z. Then

E(Wi X)4 = E[(mϑ i + Zi)4] = (mϑ i )4 + 6(mϑ i )2

λ + c||Wi||

E[||g(WX) 2||2 op] = E[||g(WX)||4] ||Wϑ||4 4 + 6||Wϑ||2

λ2 ||W||2 F ,

with the bound the following from the fact that || ||p || ||q for p q.

Lemma B.3. For X N(ϑ, Id/λ) and v RK,W RK d we have

E[(σ(v g(WX))2 σ(v g(Wϑ))2)2] E[(σ (v g(WX)) σ (v g(Wϑ)))2] K ψ3( 1

λ||v||2||W||2 F ) ,

where ψ3(x) = x(1 + x).

Proof. Observe that

|σ(v g(WX))2 σ(v g(Wϑ))2|

σ (v g(Wϑ))|v (g(WX) g(Wϑ))| + O(|v g(WX) g(Wϑ)|2)

|v (g(WX) g(Wϑ))| + |v (g(WX) g(Wϑ))|2 ,

and similarly for |σ (v g(WX)) σ (v g(Wϑ))| where we used that σ, σ , σ are all uniformly bounded. By Jensen s inequality, it will suffice to bound the expectation of the quadratic terms. We begin by noting that ||g(WX) g(Wϑ)||2 X

1 i K (Wi Z)2

E[(v (g(WX) g(Wϑ)))4] ||v||4E[||WZ||4] ||v||4 1

λ2 ||W||4 F .

Consequently, taking the expectation of the right-hand side of (B.9) squared, gives

E[(σ(v g(WX))2 σ(v g(Wϑ))2)2] E[(v (g(WX) g(Wϑ))2] + E[(v (g(WX) g(Wϑ)))4]

||v||4 2||W||4 F + 1

λ2 ||v||4 2||W||4 F = ψ3( 1

λ||v||2 2||W||2 F ) ,

and the analogue of it for E[(σ (v g(WX)) σ (v g(Wϑ)))2] follows similarly.

B.2 POPULATION HESSIAN ESTIMATES

In this section, we study the population Hessian matrices. Our large λ approximations will be uniform over compact sets in parameter space. Thus, we will use B to denote a ball in parameter space, so that any constant dependencies on the choice of B are just dependencies on its radius. Lemma B.4. For (v, W) B the first layer block of the population Hessian matrix satisfy

||E[ 2 vv L] 1

ϑ { µ, ν} σ (v g(Wϑ))g(Wϑ) 2||op K,B ψ2(ϑ, W, λ) + 1

where ψ2 was defined in (B.6). For (v, W) B with W Wc 0, the diagonal second layer blocks of the population Hessian satisfy

||E[ 2 Wi Wi L] v2 i 4

ϑ { µ, ν} σ(vg(mϑ i ))(1 σ(vg(mϑ i )))F

max ϑ { µ, ν}

λ Rii |mϑ i |

where we remind the reader that mϑ i = Wi ϑ, Rii = Wi Wi, and F and F are the cdf and tail of a standard Gaussian respectively.

Published as a conference paper at ICLR 2024

Proof. We begin with the estimate on the vv block. Recall from (B.4) that

E[ 2 vv L] = E[σ (v g(WY ))g(WY ) 2]. (B.12)

By conditioning on the value of ϑ, writing Eϑ for the conditional expectation, it suffices to bound

||Eϑ[σ (v g(WY ))g(WY ) 2] σ (v g(Wϑ))g(Wϑ) 2||op ,

for each ϑ { µ, ν}. To this end, we write

Eϑ σ (v g(WY ))g(WY ) 2 σ (v g(Wϑ))g(Wϑ) 2

= Eϑ[σ (v g(Wϑ))) (g(WY ) 2 g(Wϑ) 2)] + Eϑ[(σ (v g(WY )) σ (v g(Wϑ)))g(WY ) 2] =: (i) + (ii)

By uniform boundedness of σ , we then have by (B.5) that ||(i)||op K ψ2(ϑ, W, λ). On the other hand, by (B.7) and (B.8),

||(ii)||op (E[(σ (v g(WY )) σ (v g(Wϑ)))2])1/2(E||g(WY )||2 op)1/2 B,K 1 λ .

We now turn to the Wi Wi block of the population Hessian. Recall from (B.4) and the fact that W Wc 0, that E[ 2 Wi Wi L] = E[by(1 by)v2 i g (Wi Y )Y 2] ,

(since (g (x))2 = g (x)). We now aim to show the following two bounds, the second of which will give the desired bound of (B.11): for each i, we have that

||E[v2 i g (Wi Y )Y 2] v2 i 4

ϑ { µ, ν} F

ϑ 2|| max ϑ { µ, ν}

λ Rii |mϑ i |

||E[by(1 by)v2 i g (Wi Y )Y 2] v2 i 4

ϑ { µ, ν} σ(v g(Wϑ))(1 σ(v g(Wϑ)))F

max ϑ { µ, ν}

λ Rii |mϑ i |

Conditioning on ϑ and pulling out v2 i in (B.13), it suffices to bound

Eϑ[g (Wi Y )Y 2] = Pϑ(Wi Y > 0)ϑ 2 + 2 Sym Eϑ[1Wi Y >0Z] ϑ + Eϑ[1Wi Y >0Z 2] =: (i) + (ii) + (iii) .

Term (i) is exactly the term to which we are comparing. In particular

Pϑ(Wi Y > 0) = P(Wi ϑ > Wi Z) = F

We thus wish to bound the operator norms of (ii) and (iii). For term (ii), observe that for u Sd 1, since Eϑ u, Z = 0, we have that

Eϑ[1Wi Y >0 u, Z ] = Eϑ[1Wi Y <0 u, Z ] .

Thus we may bound this, by Cauchy-Schwarz by

|Em[1Wi Y >0 u, Z ]| E[ u, Z 2]1/2Pϑ((Wi Y ) sgn(mϑ i ) < 0)1/2 = 1

λ Rii |mϑ i |

It follows, using that ϑ = 1, that

||(ii)||op = sup u | u, (ii)u | 2

λ Rii |mϑ i |

Published as a conference paper at ICLR 2024

Finally, for term (iii) we do a naive bound to get that for every u Sd 1,

| u, (iii)u | |Eϑ[1Wi Y >0 u, Z 2]| (E u, Z 4)1/2 1

In order to show (B.14), we can rewrite it as

||Eϑ[by(1 by)g (Wi Y )Y 2] σ(v g(Wϑ))(1 σ(v g(Wϑ)))F

= ||Eϑ[by(1 by)g (Wi Y )Y 2] σ(v g(Wϑ))(1 σ(v g(Wϑ)))Eϑ[g (Wi Y )Y 2]||

+ σ(vg(Wϑ))(1 σ(vg(Wϑ)))||Eϑ[g (Wi Y )Y 2] F

ϑ 2|| =: (i) + (ii) .

Since σ(vg(Wϑ))(1 σ(vg(Wϑ))) 1, (ii) on the right-hand side in (B.15) can be bounded as in (B.13) (without the vi terms). Term (i) on the right-hand side in (B.15), can be bounded as

(i) Eϑ[(σ(v g(Wi Y ))(1 σ(v g(Wi Y ))) σ(v g(Wi ϑ))(1 σ(v g(Wi Y ))))2]1/2 sup ||u||2 1 Eϑ[ u, Y 4]1/2

where we used that |g (Wi Y )| 1. Since Eϑ[ u, Y 4] 1 + 1

λ + 1 λ2 for ||u||2 1, we have by (B.8) that ||(i)|| B 1

λ, from which the desired follows.

B.3 BOUNDS ON POPULATION G-MATRIX

We now turn to obtaining analogous λ-large approximations to the blocks of the population Gmatrix. In what follows, let yϑ = 1 if ϑ { µ} and yϑ = 0 if ϑ { ν} be the label given that the mean is ϑ. Lemma B.5. For all (v, W) B, the first layer block of the population G-matrix satisfies

||E[ v L 2] 1

ϑ { µ, ν} (yϑ σ(v g(Wϑ)))2g(Wϑ) 2|| K,B ψ2(ϑ, W, λ) + 1

and the second layer blocks satisfies,

||E[ Wi L] 2 v2 i A|| 1

λ max ϑ { µ, ν} F

λ Rii |mϑ i |

ϑ { µ, ν} (yϑ σ(v g(Wϑ)))2F

ϑ 2 . (B.18)

Proof. We begin with the v-block. Recalling (B.2), and conditioning on the mean being ϑ, it suffices to show for ϑ { µ, ν},

||Eϑ[σ(v g(WY ))2g(WY ) 2] σ(v g(Wϑ))2g(Wϑ) 2|| K,B ψ2(ϑ, W, λ) + 1

and similarly with negative signs in the sigmoids, to account for the yϑ term when relevant. The proofs are similar, so we just do the first, with a fixed choice of ϑ. We begin by writing

Eϑ[σ(v g(WY ))2g(WY ) 2] σ(v g(Wϑ))2g(Wϑ) 2

= Eϑ{[σ(v g(WY ))2 σ(v g(Wϑ))2]g(WY ) 2} + Eϑ{σ(v g(Wϑ))[g(WY ) 2 g(Wϑ) 2]} =: (i) + (ii) .

We begin with bounding the operator norm of term (i). To this end, by Cauchy Schwarz, then (B.7) and (B.8),

||(i)||op Eϑ[||g(WY ) 2||2 op]1/2Eϑ[(σ(v g(WY )) σ(v g(Wϑ)))2]1/2 . B 1 λ

Published as a conference paper at ICLR 2024

For term (ii), since |σ| 1,

||(ii)||op ||Eϑ[g(WY ) 2] g(Wϑ) 2||op K ψ2(ϑ, W, λ) ,

by (B.5). Combining these two and averaging over ϑ yields (B.16).

Recalling (B.3), and conditioning on the mean being ϑ, it suffices to control the operator norm of

Eϑ[(by yϑ)2g (Wi Y )Y 2] Aϑ where Aϑ is the corresponding summand in (B.18). Fix a ϑ with yϑ = 0, the choices being analogous. In this case we wish to bound the difference,

Eϑ[σ(v g(WY ))2g (Wi Y )Y 2] σ(v g(Wϑ))2F(

λ Rii mϑ i )ϑ 2 = (i) + (ii) ,

(i) := Eϑ[(σ(v g(WY ))2 σ(v g(Wϑ))2)g (Wi Y )Y 2] ,

(ii) := σ(v g(Wϑ))2

Eϑ[g (Wi Y )Y 2] F

By (B.13) and the boundedness of σ,

||(ii)||op max ϑ { µ, ν}

λ Rii |mϑ i |

For (i), we bound its operator norm by Cauchy Schwarz, as

||(i)||op E[(σ(v g(Wi Y ))2 σ(v g(Wi ϑ))2)2]1/2 sup ||u||2 1 Eϑ[ u, Y 4] ,

where we used that |g (Wi Y )| 1. Since Eϑ[ u, Y 4] 1 + 1

λ + 1 λ2 for ||u||2 1, we have by (B.8) that ||(i)|| B 1

λ, from which the desired follows.

C ANALYSIS OF THE SGD TRAJECTORY

Our goal in this section is to show the following results for the SGD for our two classes of classification tasks. The first is our main result on stochastic gradient descent for the mixture of k-GMM s via a single-layer network. Proposition C.1. For every ϵ, β > 0, there exists T0 such that for all Tf > T0 and all λ large, the SGD for the k-GMM with step size δ = O(1/d), initialized from N(0, Id/d), does the following for all c [k] with probability 1 od(1):

1. There exists L(β) (independent of ϵ, λ) such that xc ℓ L for all ℓ [T0δ 1, Tfδ 1];

2. There exists η(β) > 0 (independent of ϵ, λ) such that xc ℓis within O(ϵ + λ 1) distance of a point in Span(µ1, ..., µk) having xc > η.

The following is our analogous main result for the XOR GMM with two layer networks. Proposition C.2. For every ϵ > 0, there exists T0 such that for all Tf > T0 and all λ large, the SGD for the XOR GMM with width K fixed, with step-size δ = O(1/d), initialized from N(0, Id/d) in its first layer weights, and N(0, 1) i.i.d. in its second layer weights, with probability 1 od(1), satisfies:

1. There is an L(K, β, v(x0)) (independent of ϵ, λ) such that Wi(xℓ) L for all i K, and v(xℓ) L for all ℓ [T0δ 1, Tfδ 1];

2. If β < 1/8, there exists η(K, β, v(x0)) > 0 (independent of ϵ, λ) such that xℓhas

|vi(xℓ)| > η and max ϑ { µ, ν} Wi(xℓ) ϑ > η for all ℓ [T0δ 1, Tfδ 1] and i K.

and furthermore, it has Wi(xℓ) lives in Span(µ, ν) and v(xℓ) lives in Span((g(W(x )ϑ))ϑ { µ, ν} up to error ϵ + λ 1/2.

Published as a conference paper at ICLR 2024

Note, the dependencies of L and η on the absolute initial second layer value |v(x0)| are continuous.

Our approach goes by first recalling the main result of Ben Arous et al. (2022) regarding a limit theorem as d for the trajectories of summary statistics of stochastic gradient descent. We then apply this result to the task of classifying k Gaussian mixtures, obtaining ballistic limits for the SGD in Theorem C.7 that may be of independent interest. We then analyze the ballistic limits to establish Proposition C.1. Then, we recall the ballistic limits for the XOR Gaussian mixture from Ben Arous et al. (2022) and similarly establish Proposition C.2. Since we are imagining the dimensions of the parameter space and dimension space, and the step size to all be scaling with relation to one another, we use a dummy index n, in this section to encode their mutual relationship, namely d = dn, p = pn and δ = δn and then n .

Also, to compress notation throughout this section, we will combine the loss function with the regularizer β

2 x 2 so that the stochastic gradient descent updates are indeed making gradient updates with respect to the new loss:

xℓ= xℓ 1 δ L(xℓ 1, Yℓ) where L(x) = L(x) + β

C.1 RECALLING THE EFFECTIVE DYNAMICS FOR SUMMARY STATISTICS

Suppose that we are given a sequence of functions un C1(Rpn; Rk) for some fixed k, where un(x) = (un 1(x), ..., un k(x)), and our goal is to understand the evolution of un(xℓ).

In what follows, let H(x, Y) = Ln(x, Y) Φ(x), where Φ(x) = E[ Ln(x, Y)]. (Note that since the regularizer term is non-random, H is the same with L in place of L.) Throughout the following, we suppress the dependence of H on Y and instead view H as a random function of x, denoted H(x). We let V (x) = EY [ H(x) H(x)] denote the covariance matrix for H at x.

Definition C.3. A triple (un, Ln, Pn) is δn-localizable with localizing sequence (EK)K if there is an exhaustion by compacts (EK)K of Rk, and constants CK (independent of n) such that

1. maxi supx u 1 n (EK)|| 2un i ||op CK δ 1/2 n , and maxi supx u 1 n (EK)|| 3un i ||op CK;

2. supx u 1 n (EK) Φ CK, and supx u 1 n (EK) E[ H 8] CKδ 4 n ;

3. maxi supx u 1 n (EK) E[ H, un i 4] CKδ 2 n , and

maxi supx u 1 n (EK) E[ 2un i , H H V 2] = o(δ 3 n ).

Define the following first and second-order differential operators,

i iΦ i , and Ln = 1

i,j Vij i j . (C.1)

Alternatively written, An = Φ, and Ln = 1

2 V, 2 . Let Jn denote the Jacobian matrix un.

Definition C.4. A family of summary statistics (un) are asymptotically closable for step-size δn if (un, Ln, Pn) are δn-localizable with localizing sequence (EK)K, and furthermore there exist locally Lipschitz functions h : Rk Rk and Σ : Rk Rk k, such that

sup x u 1 n (EK)

An + δn Ln un(x) h(un(x)) 0 , (C.2)

sup x u 1 n (EK) δn Jn V JT n Σ(un(x)) 0 . (C.3)

In this case we call h the effective drift, and Σ the effective volatility.

For a function f and measure µ we let f µ denote the push-forward of µ. The main result of Ben Arous et al. (2022) was the following limit theorem for SGD trajectories as n .

Published as a conference paper at ICLR 2024

Theorem C.5 ((Ben Arous et al., 2022, Theorem 2.2)). Let (xδn ℓ)ℓbe stochastic gradient descent initialized from x0 µn for µn M1(Rpn) with learning rate δn for the loss Ln( , ) and data distribution Pn. For a family of summary statistics un = (un i )k i=1, let (un(t))t be the linear interpolation of (un(xδn tδ 1 n ))t.

Suppose that un are asymptotically closable with learning rate δn, effective drift h, and effective volatility Σ, and that the pushforward of the initial data has (un) µn ν weakly for some ν M1(Rk). Then (un(t))t (ut)t weakly as n , where ut solves

dut = h(ut)dt + p

Σ(ut)d Bt . (C.4)

initialized from ν, where Bt is a standard Brownian motion in Rk.

As a result, we can read off in the ballistic regime the following finite-n approximation result. Lemma C.6. In the setting of Theorem C.5, if the limiting dynamics have Σ 0, then for every T, we have with probability 1 od(1),

u(x δ 1t ) ut C[0,T ] = od(1) .

C.2 1-LAYER NETWORKS: BALLISTIC LIMITS

Our first aim in this section is to show that the limit theorem of Theorem C.5 indeed applies to the mixture of k Gaussians.

Recall the definitions of the data distribution, and the cross-entropy loss for x Rkd (i.e., xa Rd for a [k]) from (3.2). Recall that mab = µa µb, and that in order for the problem to be linearly classifiable (and therefore solvable with this loss) no mean should be the weighted barycenter of the other means, i.e., paµa = 1

b pbµb. For this task, the input dimension d = n, the parameter dimension pn = kn and the step-size will be taken to be δ = O(1/n) = O(1/d), and we will tend to use d rather than the dummy index n. It will be helpful to write Ya = µa + Zλ, and recall for a, c [k], the Gibbs probability πYa(c) = πYa(c; x) as defined in (A.4).

C.2.1 THE SUMMARY STATISTICS

We will show that the following family of functions form a set of localizable summary statistics un(x) = ((mab(x))a,b, (R ab)ab).

m = (mab)a,b [k] where mab = xa µb

R = (R ab)a,b [k] where R ab = x a x b

where x a denotes the part of xa orthogonal to Span(µ1, ..., µk), i.e., x a = P xa, where we use P to be the projection operator into the orthogonal complement of Span(µ1, ..., µk).

It is not hard to see that the law of the full sequence (πYa(c))a,c (and therefore its moments etc...) only depend on x through m and R , and therefore they have no finite d dependence. The following describes the ODEs obtained by taking the simultaneous d and δ = O(1/d), limit of the SGD trajectory in its summary statistics. Theorem C.7. If the step sizes δ = cδ/d, then the summary statistics un = (m, R ) are δlocalizable, and satisfy the following ballistic limit as d :

mab(t) = pamab βmab X

mcb P c a + Qc,µb a , (C.5)

R ab(t) = βR ab + X

pa Qc,R bb a + pb Qc,R aa b ) X

λ E (πYc(a) 1a=c)(πYc(b) 1b=c) ,

where P c a, Qc,v a are the following Gaussian integrals:

P c a(m, R ) = E[πYc(a)] and Qc,v a (m, R ) = E[ Zλ, v πYc(a)] (C.7)

and where if v Span(µ1, ..., µk) we use the shorthand Qc,r a for r = v 2 since for such v, Qc,v a only depends on v through v 2. The case where δ = o(1/d) is read-off by formally setting cδ = 0.

Published as a conference paper at ICLR 2024

Remark 2. While a priori, it may appear that the quantity Qc,v a in (C.7) depends on the dimension d, if v is one of (µ1, ..., µk), or if it is simply orthogonal to Span(µ1, ..., µk), then Qc,v a does not depend on d. Such cases are the only ones appearing in (C.5) (C.6). The same can be said for the expectation in the last term of (C.7).

Proof. Theorem C.7 will follow from an application of Theorem C.5 to the summary statistics m, R for the k-GMM, so we start by verifying that the problem fits the assumptions of δlocalizability and asymptotic closability.

Verifying δ-localizability Our aim is to verify the conditions of δ localizability from Definition C.3. Let us begin with some calculations, recalling that the Jacobian matrix J = u. Let a denote the derivative in the Rd coordinates corresponding to xa. For a, b, c [k],

cmab = µb if c = a 0 else , and c R ab =

x a a = b = c x b b = a = c 2x a a = b = c 0 else

Continuing, all higher derivatives of mab are zero. The Hessian of R ab is given by

abmcd = 0 and ab R ab = P if a = b 2P if a = b . (C.9)

(and other blocks are zero), and higher derivatives of R ab are also zero.

Let us also express the loss function as equal in distribution to a random variable whose law depends only on m and R . For ease of notation, let

Rab = xa, xb and Ya = µa + Zλ .

Recalling (3.2) and adding in the regularizer, we can write for each fixed x,

L(x, Y) (d) = X

a [k] ya maa + Ga + X

b [k] yb log X

a [k] emab+Ga + β

a [k] Raa , (C.10)

where (Ga)a is a Gaussian vector with covariance matrix (Rab), and y is a uniformly drawn 1-hot vector in Rk. Observe that the law of L only depends on x through the values of the summary statistics.

Lemma C.8. Suppose δ = O(1/d), λ > 0 is fixed (or growing with d) and β > 0 is fixed. Then the family of summary statistics u = (m, R ) are δ-localizable with balls.

Proof. Item (1) of δ-localizability. The first part is easily seen to be satisfied by (C.9) since the Hessians of statistics in m are all zero, and the Hessian of R ab is bounded in operator norm by 2 + 2k using the triangle inequality and µb = 1.

Item (2) of δ-localizability. We begin with the bound on the population loss s norm. By taking expectation of (C.10)

a [k] pamaa + X

a [k] pa E h log X

b [k] e xb,Ya i + β

a [k] Raa . (C.11)

Taking the derivative of this, we get for each c [k],

cΦ(x) = pcµc + X

a [k] pa E[πYa(c)Ya] + βxc , (C.12)

where Ya = µa + Zλ and π is as in (A.4). Considering the norm of Φ , we have

Φ k max c cΦ k(1 + 1 + λ 1/2 + β max a Raa)

Published as a conference paper at ICLR 2024

which is bounded by a constant C(K) for m, R in a ball of radius K.

Moving on to the bound on E[ H 8], first observe using c H = c L cΦ, that

c H = (pcµc yc(µc + Zλ)) + πY (c)Y E[πY (c)Y ] (C.13)

Taking the norm and the 8 th moment, we use the fact that u + v 8 ( u 8 + v 8), and the fact that πYa( ) is a probability mass function and therefore at most 1, to upper bound

max c [k] E[ c H 8] sup a µa 8 + E Zλ 8 1 + O((d/λ)4) .

As long as δn = O(1/d), since λ is uniformly bounded away from zero, this will be bounded by a constant times δ 4 n as required.

Item (3) of δ-localizability. We next turn to bounding the fourth moments of the directional derivatives, starting with the statistics mab:

c H, amab = (pc yc)µc, µb yc Zλ, µb + pb E[πYb(c)] ybπYb(c)

pl E[πYl(c) µb, Zλ ] + ylπYl(c) µb, Zλ .

Taking the fourth moment, again bounding things by the fourth moments of the terms individually up to a universal constant, then taking an expected value, we see that the first term is bounded by 1, the second by the fourth moment of a Gaussian random variable with variance 1/λ, i.e., by C/λ2, the third and fourth by 1 since π is a probability distribution, and the summands individually have fourth moments bounded by C/λ2 for the same reason. Altogether, we get

E[ c H, amab 4] 1 + O(1/λ2) ,

satisfying the requisite bound with room to spare. Turning to the directional derivative in the direction of R ab, note that derivatives of R ab are orthogonal to (µ1, ..., µk) so for a = b,

c H, b R ab = c H, b R ba = E[πY (c) x a , Zλ ] πY (c) x a , Zλ yc x a , Zλ .

Considering the fourth moment of the above, using that π is bounded by 1, and x a , Zλ is distributed as a Gaussian random variable with variance R aa/λ, we obtain

E[ c H, b R ab 4] (R aa/λ)2 ,

which is bounded by C(K) while R aa is bounded by K. The diagonal a = b is the same up to a factor of 2. The last thing to check is the second part of item (3) in δn-localizability. For this purpose, recall that V (x) = E[ H 2], and notice from equation C.13 that for the k-GMM we have

c H d H = ((πY (c) yc)Y E[(πY (c) yc)Y ]) ((πY (d) yd)Y E[(πY (d) yd)Y ])

Taking expected values, we get that the cd-block of V is given by

Vcd = E[ c H d H] = Cov (πY (c) yc)Y, (πY (d) yd)Y , (C.14)

where Cov is the covariance matrices associated to the vectors inside it.

For the second part of item (3) in localizability, we only need to consider the statistics R ab since the second derivatives of mab are all zero. For the ballistic limit it is sufficient to use the bound

E[ 2R ab, c H d H Vcd 2] 2R ab 2 op E[ c H 4] .

The first term is bounded by 2 per (C.9). The second can be seen to be bounded via the 8th moment above by O((d/λ)2) which is o(δ 3) when δ = O(1/d).

Remark 3. The reader might notice that there is a lot of room in the bounds above as compared to the allowed thresholds from the δ-localizability conditions. The weaker conditions in localizability are to allow taking diffusive limit theorems about saddle points by rescaling the summary statistics by d-dependent factors, which can help with understand timescales to escape fixed point regions of the limits of Theorem C.7. This was explored in much detail for matrix and tensor PCA in Ben Arous et al. (2022).

Published as a conference paper at ICLR 2024

Calculating the drift and corrector Now that we have verified that the δ-localizability conditions apply to the summary statistics u = (m, R ), we compute the limiting ODE one gets in the d limit for these statistics. We will establish individual convergence of Au to some fu(u) and convergence of δLu to some gu(u) for each u u, whence h from Definition C.4 equals f + g.

Recall the differential operator A from (C.1), the expression for Φ from (C.12), and consider

c [k] cΦ, cmab = aΦ, µb = paµa, µb + E[πY (a) Y, µb ] + β xa, µb .

Recalling that mab = µa, µb , we get

Amab = pamab + βmab + X

mlb E[πYl(a)] + E[πYl(a) Zλ, µb ] .

Notice that the two expected values are Gaussian expectations that only depend on x through (mcl)c and (R lm)l,m. In particular, we can take the limit as d to get the limiting drift function

fmab(m, R ) = pamab + βmab + X

mlb P l a + Ql,µb a .

where P l a and Ql,µb a are defined per (C.7). The contribution coming from δLmab vanishes in the d limit since the second derivative of mab is zero, i.e., gmab = 0.

For the drift function for R ab, since x a , µb = 0 for all a, b, we get

AR ab = βR ab + X

l [k] (pa Ql,x b a + pb Ql,x a b ) .

In particular, we get

f R ab = βR ab + X

pa Ql,R bb a + pb Ql,R aa b ,

using Ql,R bb a = Ql,x b a since Ql,v a only depended on v through its norm when v Span(µ1, ..., µk)

There is also a contribution from δLR ab; recall that L = 1

2 V, 2 and notice that

2 Vab, P + Vba, P .

Plugging in for Vab from (C.14), and expanding this out, a calculation yields

c [k] pc E Zλ 2(πYc(a) 1a=c)(πYc(b) 1b=c)

E[Z(πYc(a) 1a=c), E[Z(πYc(b) 1b=c)] + O(1) ,

where to see that the extra terms are O(1), we notice that the inner products of the µ s with each other and µ s with Z s are all order 1. By Cauchy Schwarz, the inner product of the expectations is at most O( p

d/λ) and will vanish when multiplied by δ = O(1/d) and the d limit is taken; on the other hand, the second moment term is of order d/λ. In particular, if δ = cδ/d, we get

δLR ab = cδ X

c [k] pc E h d 1/2Zλ 2(πYc(a) 1a=c)(πYc(b) 1b=c) i + o(1) .

We claim that the d limit of this gives

g R ab = cδ

c [k] pc E h (πYc(a) 1a=c)(πYc(b) 1b=c) i (C.15)

Notice that the expectation is bounded by 1 in absolute value and thus the above goes to 0 as λ . In order to show (C.15), let us consider the term coming from πYc(a)πYc(b), the other terms being

Published as a conference paper at ICLR 2024

analogous or even easier since the indicator is deterministic. Using the fact that π are probabilities and therefore bounded by 1, note that

E[ d 1/2Z 2πYc(a)πYc(b)] 1

λE[πYc(a)πYc(b)]| E h d 1/2Z 2 1

which goes to 0 from the standard fact that for a standard Gaussian vector Gd in Rd, one has E Gd

d 2 1 2 = O(1/d).

The last thing to check is that in the ballistic regime where our summary statistics are not rescaled, the limiting dynamics are indeed an ODE, i.e., there is no limiting stochastic part. Towards that purpose, notice that in any ball of values for (m, R),

JV JT = O(1) ,

using an O(1) bound on the operator norm of V , and noticing that J 2 is bounded by a constant plus the sums of R ab. Therefore when multiplied by δ = o(1) this vanishes in the limit and therefore the diffusion matrix Σ is identically zero.

The following confines this λ-finite dynamical system to a compact set for all times.

Lemma C.9. For every β > 0, there exists L(β) such that for all λ large, the dynamics of Theorem C.7 stays inside the ℓ2-ball of radius L for all time.

Proof. Notice that a naive bound on Qd,v a from (C.7) is at most p

v /λ since πYc(a) is bounded, and P d a is always at most 1. Plugging these bounds into Theorem C.7, together with the definition of mab and the fact that µa = 1 for all a, yields the inequalities for all a, b,

| mab(t) + βmab| 1 + p

| R aa + βR aa| q

R aa/λ + λ 1 + λ 2 .

By these, for λ larger than a fixed constant, we have R aa 1 βR aa/2 which Gronwall s inequality ensures will be bounded by 1 + R aa(0) for all times t 0. Similarly, we get that for λ sufficiently large, |mab(t)| is bounded by 3 + mab(0).

C.2.2 THE ZERO-NOISE LIMIT

We now send λ , or simply Taylor expand in the large λ limit, to understand the behavior of the limiting ODE s we derived when λ is large. Let

πc(a) = πc(a; x) := emac P

b [k] embc . (C.16)

This is the λ = value of πYc(a). The aim of this subsection is to establish the following.

Proposition C.10. The λ limit of the ODE system from Theorem C.7 is the following dynamical system:

mab(t) = pamab βmab X

c [k] pcπc(a)mcb , (C.17)

R ab(t) = βR ab . (C.18)

Moreover, at large finite λ, the difference of the drifts in (C.5) (C.6) to the above drifts is O(λ 1).

The main thing to prove is the following behavior of integrals of πYc(a) as λ .

Lemma C.11. Recalling P c a and Qc,v a from (C.7), we have

P c a = πc(a) + O(1/λ) . Qc,v a = O( v /λ) .

Published as a conference paper at ICLR 2024

Proof. By Taylor expanding, we can write

πYc(a) = πc(a) + (xa Z)exa µc

P b [k] exb µc exa µc(P

b [k](xb Z)exb µc)

(P b [k] exb µc)2 + O((max b [k] xb Z)2) .

Taking an expectation of the right-hand side, noting that xb Z = N(0, Rbb

λ ), everything on the right-hand side after πd(a) is O(1/λ) for R bb that is O(1).

For Qc,v a , using the Gaussian integration by parts formula and (A.7)

E[(xa v)πYc(a)] E[ x B v πYc πYc(a)]

Since π 1 and xb v q

R bb v , this is easily seen to be O( v q

R bb/λ) as claimed.

Proof of Proposition C.10. For any K, uniformly over all m, R in a ball of radius K about the origin, we claim that the limit of the drifts for each of those variables converge to the claimed λ limiting drifts. This is obtained by applying the above lemma to the P and Q terms in the drifts in Theorem C.7, and finally the observation that Var(πYc(a)πYc(b)) 1 so that taking λ the last two terms in the drift for R ab in Theorem C.7 also vanish.

The following gives a quantitative approximation of the ODE by its λ = limit.

Corollary C.12. The trajectories of the ODEs of Theorem C.7 and Proposition C.10 are within distance O(t/λ) of one another.

Proof. Let u and eu be the two solutions to the λ finite and λ = ballistic dynamics respectively. Then, while u , eu L, we have by Proposition C.10 that

u eu L λ 1 .

Per Lemma C.9, both dynamics remain confined for a large enough L(β) for all times, and therefore integrating the above gives the claim.

C.3 LIVING IN SUBSPACE SPANNED BY THE MEANS

We now wish to show that the SGD trajectory lives in the span of the means. This can be done by showing that on the one hand, R will stay as small as we want after an O(1) burn-in time, and on the other hand, towards the error being multiplicative in Definition 2.1, for every a, xa ℓneeds to be non-negligible.

We first establish that this happens for the λ = dynamics, then pull it back to the dynamics at λ finite but large via Corollary C.12.

Lemma C.13. The solution to the dynamical system of Proposition C.10 is such that for all t T0(ϵ), it is within distance ϵ of a point having R aa = 0 and maxb |mab| > cβ > 0 for every a [k].

Proof. By the expression from Proposition C.10 for the drift of R aa for a [k], the dynamical system has R aa(t) = e βt R aa(0). In particular, for any ϵ > 0, the has R aa(t) < ϵ for all t T0(ϵ).

We need to show for every a, in the solution of the dynamical system, some (mab)b is bounded away from zero after some small time. Let M0 = S

c{mac = 0} be the set of (m) values we would like to ensure the dynamics stays away from. First observe that the λ = dynamical system of Proposition C.10 is a gradient system for the energy function

H(m, R ) = X

a [k] pamaa + X

a [k] pa log X

b [k] emab + β

2 ( m 2 + R 2) ,

Published as a conference paper at ICLR 2024

so it has no recurrent orbits. It thus suffices to show that for every point in M0, the quantity maxb |mab| has a drift strictly bounded away from zero. If we show that, then the dynamics is guaranteed to leave a cβ-neighborhood of the set M0 in a finite time (uniform by continuity considerations and we are already guaranteed that the dynamics stays in a compact set by Lemma C.9.

Consider a point such that mab = 0 for all b [k]. There, by Proposition C.10,

mab = paµa X

c [k] pcπc(a)µc, µb .

We need to show that the maximum over b, of the absolute values of these, is non-zero. Indeed, if it were 0 for all b [k], then paµa = P

c pcπc(a)µc because the difference of these vectors would be in Span(µ1, ..., µk) while being orthogonal to all of µ1, ..., µk. In turn, however, this is impossible by our assumption that µ1, ..., µk are linearly independent. Therefore, in the ball of radius L(β) about the origin, for every a, at least one of the drifts mab is bounded away from zero uniformly.

Proof of Proposition C.1. We have shown that the limit dynamics for the summary statistics of the SGD initialized from N(0, Id/d) is the dynamical system of Theorem C.7 initialized from the deterministic initialization mab δ0 and R ab δ0 if a = b and δ1 if a = b.

By Lemma C.9, there exists L(β) such that that dynamical system is confined to a ball of radius at most L for all time. Since xc 2 is encoded by a smooth function of the summary statistics R cc, (mcb)b appearing in the dynamical system of Theorem C.7, this is transferred to the SGD xℓ via Lemma C.6 for a different constant L(β) for d sufficiently large.

For the second part, by Lemma C.13, the solution to the dynamical system of Theorem thm:k GMM-ballistic-limit is at distance O(λ 1) from the solution of the dynamical system (with the same initialization) of Proposition C.10, for which Lemma C.13 applies. By Lemma C.6, these get pulled back to the summary statistics applied to the SGD itself (at d sufficiently large depending on ϵ, λ). We therefore deduce that for all ℓ T0(ϵ)δ 1 steps, the SGD has R cc(xℓ) O(ϵ + λ 1) and has maxb | xc ℓ, µb | > cβ > 0. These imply the claim using that xc maxb | xc, µb | and that R cc(xc) is by definition the projection of xc orthogonal to (µ1, ..., µk).

C.3.1 FIXED POINT ANALYSIS WITH ORTHOGONAL MEANS

The behavior of the ODE system of Proposition C.10 can be sensitive to the relative location of the means (µ1, ..., µk). In order to be able to make more precise statements about the alignment of the SGD with specific eigenvectors rather than just living in Span(µ1, ..., µk), we specialize to the case where the means form an orthonormal family of vectors. Here, we can explicitly characterize the fixed points of Proposition C.10.

Namely, in this subsection, assume that mab = 1a=b. Then, by (C.17) any fixed point must satisfy

βmab = pa1a=b pb emab P

c emcb . (C.19)

At a fixed point, the function P

c mcb thus must equal 0. Also, if a, c = b then

mab mcb = pb

β emab emcb

Since the function x+cex is strictly increasing, the only solutions to this are at mab = mcb (so long as pb > 0). Combining the above two observations, at a fixed point,

mab = 1 k 1mbb for all a = b .

Plugging this in to (C.19), we find that at a fixed point

(k 1)e 1 k 1 mbb + embb

This can easily be seen to have a unique solution, and that solution must have mbb (0, pb

Published as a conference paper at ICLR 2024

Therefore, as long as (pb)b [k] are all positive, the dynamical system of Proposition C.10 has a unique fixed point, call it u at (mab)a,b as above, and R ab = 0 for all a, b. As observed in the proof of Lemma C.13, the dynamical system never leaves a ball of some radius L(β) and is also a gradient system for an energy function H. Combining these facts with continuity of the drift functions, it means that for every ϵ > 0, there is a T0(ϵ) such that the solution to the SGD gets within distance T0 of that unique fixed point and stays there for all t T0.

Altogether, this leaves us with the following stronger form of Proposition C.1.

Proposition C.14. When the means (µ1, ..., µk) are orthonormal, beyond Proposition C.1, we further have that x is at distance O(ϵ+λ 1) of a point x such that for each c, xc has positive (bounded away from zero uniformly in ϵ, λ) inner product with µc and negative inner product with (µa)a =c.

If (pc)c [k] are assumed to all be equal, then furthermore at x ,

πb(a) = 1 k 1(1 πc(c)) for all a, b, c such that a = b .

C.4 2-LAYER NETWORKS

We now turn to the analysis of the SGD in the case of multilayer networks for the XOR GMM problem (4.1). In this problem, the input dimension is d = n, the parameter dimension is p = Kd + K, and the step-size is again taken to be δ = O(1/d). Consider the following family of 4K + K 2 summary statistics of x: for 1 i j K and ϑ {µ, ν},

vi , mϑ i = Wi ϑ , R ij = W i W j , (C.20)

where W i = Wi P

ϑ {µ,ν} mϑ i ϑ. Use u = (v, mµ, mν, R ) for these families.

In Ben Arous et al. (2022) it was shown that this family of summary statistics is δ-localizable and asymptotically closable with respect to the loss for the XOR GMM of (4.2), and the following convergence to ODE s was established. For a point x = (v, W) RK+Kd, define the quantity

Ai = E Y 1Wi Y 0 y + σ(v g(WY )) , (C.21)

(where we recall that g is the Re LU function and σ is the sigmoid function) and let

Aϑ i = ϑ Ai , A ij = W j Ai .

Furthermore, let

Bij = E 1Wi Y 01Wj Y 0 y + σ(v g(WY )) 2 . (C.22)

It can be observed that these functions are expressible as functions of u alone.

Proposition 5.1 of Ben Arous et al. (2022) proved the following effective ballistic dynamics.

Proposition C.15. Let un be as in (C.20) and fix any λ > 0 and δ = cδ/d. Then un(t) converges weakly to the solution of the ODE system ut = f(ut) + g(ut), initialized from limn(un) µn with

fvi = mµ i Aµ i (u) + mν i Aν i (u) + A ii(u) + βvi , fmµ i = vi Aµ i + βmµ i ,

f R ij = vi A ij(u) + vj A ji(u) + 2βR ij , fmν i = vi Aν i + βmν i .

and correctors gvi = gmµ i = gmν i = 0, and g R ij = cδ vivj

λ Bij for 1 i j K. The case where δ = o(1/d) is read-off by formally setting cδ = 0.

C.4.1 LARGE λ BEHAVIOR

We now wish to investigate the large λ behavior of the dynamical system in Proposition C.15. Our approach to doing this is to give a large λ approximation to the drifts in the above, and then use that to show that the trajectory is close to its λ = version.

The first thing we do is give large λ approximations to the quantities Ai and Bij. In what follows, we use F(x) to denote the cumulative distribution function of a standard Gaussian random variable.

Published as a conference paper at ICLR 2024

Lemma C.16. Suppose Y ϑ + Zλ for a fixed vector ϑ {µ, µ, ν, ν} and fix a unit vector b Rd and a vector v RK. Then E[(b Y )1Wi Y >0σ(v g(WY ))] (b ϑ)F mϑ i

max j (1 + vj)(mϑ i +

λ )e (mϑ j )2λ/(16Rjj) + O(λ 1) .

Proof. Let us start with a Taylor expansion of σ: For simplicity, let us use σY and σϑ to denote σ(v g(WY )) and σ(v g(Wϑ)) respectively and similarly for σ . We then have

σY σϑ = σ ϑ (v (g(Wϑ) g(WY ))) + σ (o)(v (g(Wϑ) g(WY )))2 , (C.23)

for some point o between v g(WY ) and v g(Wϑ). Therefore, the left-hand side of the lemma is

E h 1Wi Y >0(b ϑ)σϑ i + E h 1Wi Y >0(b Z)σϑ i + E h 1Wi Y >0(b Y )σ ϑ(v (g(Wϑ) g(WY ))) i

+ E h 1Wi Y >0(b Y )σ (o)(v (g(Wϑ) g(WY )))2 i . (C.24)

The first term of (C.24) is exactly the term we are comparing to in the left-hand side of the lemma statement. Since E[(b Z)] = 0, the absolute value of the second term of (C.24) is bounded via Cauchy Schwarz as follows:

|E h 1Wi Y >0(b Z)σϑ i | = |E h (1Wi Y >0 1mϑ i >0)(b Z)σϑ i | σϑ

λ F( |mϑ i | p

For the next two terms we can start by rewriting

g(Wj ϑ) g(Wj Y ) = (mϑ j +Wj Z)1Wj Y <0,mϑ j >0 (Wj Y )1Wj Y >0,mϑ j <0 (Wj Z)1mϑ j >0 . (C.25)

Using this, the third expectation in (C.24) is

σ ϑE h (b Y )1Wi Y >0 X

j vj (mϑ j +Wj Z)1Wj Y <0,mϑ j >0 (Wj Y )1Wj Y >0,mϑ j <0 (Wj Z)1mϑ j >0 i .

The first and second terms in the parentheticals are such that the Cauchy Schwarz inequality can be applied to bound their total contributions by

Kσ a v max j (mϑ j )2 + Rjj

λ )1/2F( |mϑ j | q

λ/Rjj)1/2 .

The last term will contribute (up to a net sign)

σ ϑvj E[(b ϑ)(Wj Z)(1Wi Y >0 1mϑ i >0)1mϑ j >0] + σ ϑvj E[(b Z)(Wj Z)1Wi Y >01mϑ j >0] .

The absolute value of the first term is bounded similarly to an earlier one by Cauchy Schwarz. The absolute value of the second term is bounded by dropping the indicator functions and applying Cauchy Schwarz to see that it is O(1/λ) so long as b , Wj = O(1).

Finally, for the fourth term of (C.24), the square allows us to put absolute values on every term, and immediately apply the Cauchy Schwarz inequality, to bound its absolute value by

σ v E[(b Y )2]1/2 max j E[(g(Wj ϑ) g(Wj Y ))4]1/2 .

The fourth moment in the above expression is bounded, up to constant, by the fourth moments of each of the individual terms in (C.25). The first two of those will be at most some constant times F( |ma j | p

λ/Rjj)1/4. For the last of them, we use E[(Wj Z)4]1/2 O(1/λ).

Combining all of the above bounds, and naively bounding the cdf of the Gaussian via

F( |mϑ j | q

λ/Rjj)1/4 e (mϑ j )2λ/(16Rjj)

we arrive at the claimed bound.

Published as a conference paper at ICLR 2024

As a consequence of Lemma C.16, we can deduce that the quantities appearing above satisfy the following large λ behavior.

Corollary C.17. For every x such that mµ i , mν i log λ

λ for all i,

mµ i Aµ i = 1

4g(mµ i )σ( v g(mµ)) 1

4g( mµ i )σ( v g( mµ)) + O(λ 1)

mν i Aν i = 1

4g(mν i )σ(v g(mν)) + 1

4g( mν i )σ(v g( mν)) + O(λ 1)

A ij = O(λ 1) .

Without the assumption on mµ i , mν i , the same holds with O(λ 1) replaced by O(λ 1/2), as long as the indicators from g(mϑ i ) = mϑ i 1mϑ i >0 are replaced by the soft indicators F(mϑ i p

Proof. We begin with the estimate on mµ i Aµ i . This quantity can be split into four terms correspond-

ing to whether the mean chosen for Y is µ, µ, ν, ν. Namely, if we let Ya d= a + Zλ,

E[(µ Yµ)1Wi Yµ 0σ( v g(WYµ)] + E[(µ Y µ)1Wi Y µ 0σ( v g(WY µ)]

+ E[(µ Yν)1Wi Yν 0σ(v g(WYν)] + E[((µ Y ν)1Wi Y ν 0σ(v g(WY ν)]

Now notice that each of the four quantities are of the form of Lemma C.16, with b = µ, and ϑ = µ, µ, ν, ν respectively (the change of the sigmoid possibly having a negative sign on its argument is realized by switching the sign of v since that is the only place it appears).

It is easily seen that so long as |mµ i | (log λ)/

λ/Rii) = 1{mµ i > 0} + O(1/λ) .

By a similar bound on the error term on the right of Lemma C.16,

E[(µ Yµ)1Wi Yµ 0σ( v g(WYµ)] = 1mµ i >0σ( v g(mµ)) + O(1/λ) .

and the cases where it is Yν contribute 0 + O(λ 1) since µ ν = 0. Together with the analogous bounds for mν i Aν i and A ij, this gives the first part of the corollary.

If we drop the assumption on mµ i , we find from the general inequality xe x2L C

uniform C, that the errors in the right-hand side of Lemma C.16 become at most O(1/

λ) as claimed. Leaving the soft indicator in place, this gives the second part of the corollary.

We deduce from the above approximation and a trivial bound of O(1/λ) on Bij, a λ = limiting dynamics. However, one has to be slightly careful about the λ limit near the hyperplanes where mµ i = 0 or mν i = 0; Therefore, in what follows we envision a different limit associated to each orthant of the parameter space (associated to the signs of mµ i , mν i ). This is reasonable because in each orthant, the λ = dynamical system initialized from that orthant stays in that orthant. The following proposition captures that limiting dynamics (all orthants being written into the below simultaneously, with ambiguities at mµ i = 0 being therefore omitted). Proposition C.18. In the λ limit, the ODE from Proposition C.15 converges to

g(mµ i )σ( v g(mµ)) + g( mµ i )σ( v g( mµ))

g(mν i )σ(v g(mν)) + g( mν i )σ(v g( mν)) βvi ,

1mµ i >0σ( v g(mµ)) 1mµ i <0σ( v g( mµ)) βmµ i ,

1mν i >0σ( v g(mν)) 1mν i <0σ( v g( mν)) βmν i ,

and R ij = 2βR ij for 1 i j K.

Moreover, for λ large, the difference in the drifts above and those of Proposition C.15 are of order O(1/λ) if (|mµ i |, |mν i |) are all at least p

(log λ)/λ, and O(1/

λ) everywhere.

Published as a conference paper at ICLR 2024

C.4.2 CONFINING THE SGD TO A COMPACT SET

Similar to the k-GMM case, our first aim is to confine the SGD to a compact set so long as β > 0. Lemma C.19. For every β > 0, there exists L(β, v(x0)) such that for all λ large, the dynamics of Proposition C.15 stays inside the ℓ2-ball of radius L for all time.

Proof. We start by considering the evolution of the ℓ2-norm of all the parameters, (v, mµ, mν). If we use the shorthand

gµ = g(mµ), g µ = g( mµ), σµ = σ( v gµ), σ µ = σ( v g µ)

and similar quantities with ν instead of µ, a short calculation shows that we get in the λ = dynamical system of Proposition C.18

2 v gµσµ + v g µσ µ 1

2 v gνσν + v g νσ ν 2β v 2 .

Notice that σµ goes to 0 as v gµ , and moreover, xσ( x) 0 as x . Therefore, there is a uniform bound of 1 (in fact, W(1/e) where W is the product log function) on v gµσµ. Similar uniform bounds apply to the other three terms, so that in total,

v 2 2 2β v 2 ,

which implies in particular that its drift is negative when v 2 is at least L(β). Similar arguments go through mutatis mutandis for (mµ)2 and (mν)2. These then imply that the drift when λ is finite are similarly negative when v 2 = L(β) as long as λ is sufficiently large per the approximation from Proposition C.18. Altogether, this implies that for the dynamical system of Proposition C.15, all of v , mµ and mν , stay bounded by some L(β) for all time.

Using that, if we consider the expression for R ii and use the boundedness of Ai, B, the drift in Proposition C.15, gives

R 2 β R 2 + 4L R ,

The right-hand side is at most CL 1

2β R 2 for some CL, whence by Gronwall, R 2 also stays bounded by some constant L (β) for all time under the dynamics of Proposition C.15. Altogether these prove the lemma.

Corollary C.20. The trajectories of the ODEs of Proposition C.15 and Proposition C.18 are within distance O(e Ct/

λ) of one another.

Proof. Let u and eu be the two solutions to the λ finite and λ = ballistic limits respectively. Then

u eu K (1 + L + 2β) u eu + O(λ 1/2) ,

where we used that L bounds the norm of u and eu for all times by Lemma C.19, and that the Lipschitz constant of the sigmoid is 1. By Gronwall s inequality, this implies that |u eu| O(e Ct/

λ) as claimed for some C depending only on β.

C.4.3 LIVING IN THE PRINCIPAL DIRECTIONS NEAR FIXED POINTS

The last thing to conclude is that the fixed points of the λ = dynamical system are indeed living in the principal directions as desired by Theorem 4.2. By the above arguments, for all t T0(ϵ) the dynamics is within distance ϵ of one of the fixed points of Proposition C.18. Let us recall the exact locations of those fixed points from Ben Arous et al. (2022).

If 0 < β < 1/8, then let (I0, I+ µ , I µ , I+ ν , I ν ) be any disjoint (possibly empty) subsets whose union is {1, ..., K}. Corresponding to that tuple (I0, I+ µ , I µ , I+ ν , I ν ), is a set of fixed points that have R ij = 0 for all i, j, and have

1. mµ i = mν i = vi = 0 for i I0, 2. mµ i = vi > 0 such that P

i I+ µ v2 i = logit( 4β) and mν i = 0 for all i I+ µ ,

Published as a conference paper at ICLR 2024

3. mµ i = vi > 0 such that P

i I µ v2 i = logit( 4β) and mν i = 0 for all i I µ ,

4. mν i = vi < 0 such that P

i I+ ν v2 i = logit( 4β) and mµ i = 0 for all i I+ ν ,

5. mν i = vi < 0 such that P

i I ν v2 i = logit( 4β) and mµ i = 0 for all i I ν .

The following observation is easy to see from the fixed point characterization described above. Observation C.21. Suppose that x is a fixed point amongst the above. Then

W(x ) Span(µ, ν) ,

v(x ) Span(gµ(x ), g µ(x ), gν(x ), g ν(x )) ,

(with no error).

Proof. The first claim that W(x ) Span(µ, ν) follows from the fact that R ii = 0 for all i. Furthermore, if I+ ν , I ν are empty, then it lives in Span(µ), and similarly if I+ µ , I µ are empty, then it lives in Span(ν).

For the next claim, observe that gµ(x ) is the vector that is mµ in coordinates belonging to I+ µ , and 0 in all other coordinates. g µ(x ) is the vector that is mµ in coordinates belonging to I+ µ, and zero in others. gν(x ) is mν i for coordinates in I ν , zero else, and g ν(x ) is mν i on I+ ν .

Since the I-sets are a partition of {1, ..., K} it is evident that we can express

v(x ) = gµ(x ) + g µ(x ) gν(x ) g ν(x ) ,

implying that indeed v(x ) lives in Span(gµ(x ), g µ(x ), gν(x ), g ν(x ))

It was further argued in (Ben Arous et al., 2022, Section 9.4) that there is a transition at β = 1/8 in the regularization, and as long as β < 1/8, with probability 1 with respect to the random initialization, the SGD converges to a fixed point having I0 = .

Lemma C.22. Suppose β < 1/8 and consider the initialization

vi(0) N(0, 1) , and mµ i (0), mν i (0), R ij(0) δ0 and R ii δ1 , (C.26)

where since the dynamical system of Proposition C.10 is defined per orthant, the δ0 s are understood as 1/2-1/2 mixtures of δ0+ and δ0 . With probability 1 over this initialization, the dynamical system of Proposition C.10 converges to a fixed point as characterized above, above having I0 = .

Putting the above together, we can conclude our proof of Proposition C.2.

Proof of Proposition C.2. Per Proposition C.15, the limit of the summary statistics of the SGD along training is the solution of that dynamical system initialized from mµ i , mν i , R ij δ0 for i = j (with equal probability of mµ i , mν i being δ0+ and δ0 if we need to distinguish which orthant it is initialized in), R ii δ1, and vi N(0, 1) i.i.d.

For the first part, notice that the norms Wi(x) 2 = R ii +(mµ i )2 +(mν i )2 and v(x) 2 = PK i=1 v2 i are expressible in terms of the summary statistics. Therefore, their boundedness follows from Lemma C.19, pulled back to the summary statistics of the SGD per Lemma C.6.

For the second item, by Corollary C.20 and Lemma C.22 together with Observation C.21, for every ϵ > 0, there is a T0 such that for every Tf, for all t [T0, Tf], the dynamical system of Proposition C.15 is within distance ϵ + λ 1/2 of a point u having |vi(t)| > η and max{|mµ i |, |mν i |} > η in and having that its first layer lives in Span(µ, ν) and its second layer lives in Span((gϑ(u ))ϑ { µ, ν}). This is pulled back to the summary statistics applied to the SGD trajectory per Lemma C.6 (with the observation that (gϑ(x))ϑ are functions of only the summary statistics).

Published as a conference paper at ICLR 2024

D CONCENTRATION OF HESSIAN AND G-MATRICES

We recall the general forms of the empirical test Hessian matrix 2 b R(x) and G-matrix b G(x) from (2.2) respectively. Our aim in this section is to establish concentration of those empirical matrices about their population versions throughout the parameter space.

D.1 HESSIAN AND G-MATRIX: 1-LAYER

We first prove the concentration of the empirical Hessian and G-matrix for the k-GMM problem about their population versions, which we analyzed in depth in Section A B. This concentration will be uniform over the entire parameter space. Namely, our aim in this section is to show the following.

Theorem D.1. Consider the k-GMM data model of (3.1) and sample complexity f M/d = α. There are constants c = c(k), C = C(k) (independent of λ) such that for all t > 0, the empirical Hessian matrix concentrates as

sup x Rkd P(|| 2( b R(x) E[ b R(x)])||op > t) exp [cα(t t2) C]d , (D.1)

and so does the empirical G-matrix

sup x Rkd P(|| b G(x) E[ b G(x)]||op > t) exp [cα(t t2) C]d . (D.2)

Proof. In the following we fix x Rkd and simply write b R(x), b G(x) as b R, b G. Let e A = (e Y 1 e Y f M) denote the test data matrix and let

DH bc = diag(πe Y ℓ(c)δbc πe Y ℓ(c)πe Y ℓ(b))1 ℓ f M , (D.3)

DG bc = diag((eyℓ ceyℓ b πe Y ℓ(b)eyℓ c eyℓ bπe Y ℓ(c) + πe Y ℓ(c)πe Y ℓ(b))1 ℓ f M , (D.4)

where πY (c) was defined in (A.4). Then DH bc, DG bc are f M f M diagonal matrices for each pair bc [k]2. We denote the (k f M) (k f M) matrices DH = (DH bc)bc and DG = (DG bc)bc, and the dk k f M matrix e A k = Ik e A . With these notations, per (2.2), we can rewrite the Hessian and G-matrices as

f M e A k DH( e A k)T , (D.5)

f M e A k DG( e A k)T . (D.6)

To prove that the operator norm of 2( b R E[ b R]) concentrates, we ll use a net argument over the unit ball in Rdk to show that the following concentrates

sup v (Rd)k,||v||=1

v, 2( b R E[ b R])v , (D.7)

where v = (vc)c [k] (Rd)k.

By plugging (D.5) into (D.7), we want a concentration estimate for F(v) = v, 2( b R E[ b R])v , which we can rewrite as follows.

v, e A k DH( e A k)T v v, E[ e A k DH( e A k)T ]v

va, e ADH ab e AT vb va, E[ e ADH ab e AT ]vb

ℓ dℓ(a, b) va, e Y ℓ vb, e Y ℓ E dℓ(a, b) va, e Y ℓ vb, e Y ℓ ,

Published as a conference paper at ICLR 2024

where dℓ(a, b) = [DH ab]ℓℓ. Recalling (Vershynin, 2018, Section 4.4.1) that for an ϵ-net Nϵ of {v : v = 1}, and any real symmetric matrix H the following holds 1 1 2ϵ sup v Nϵ | v, Hv | ||H||op = sup v 2=1 | v, Hv | sup v Nϵ | v, Hv | .

So by a union bound and the ϵ-covering number of {v Rkd : v = 1}, we have

P sup v 2=1 |F(v)| > t |Nϵ| sup v Nϵ P |F(v)| > t/2 (C/ϵ)kd P |F(v)| > t/2 , (D.9)

so long as ϵ < 1/4, say.

To control the last quantity P(|F(v)| > t/2), we notice that F(v) is a sum of O(1) many (in fact k2 many) terms (corresponding to each pair a, b) so it suffices by a union bound to control the concentration of each summand. That is, it suffices to understand the concentration of 1 f M

dℓ(a, b) va, e Y ℓ vb, e Y ℓ E dℓ(a, b) va, e Y ℓ vb, e Y ℓ . (D.10)

We recall that the test data e Y ℓ= X

a [k] eyℓ aµa + Zℓ λ := µℓ+ Zℓ λ , (D.11)

where Zℓ λ are i.i.d. N(0, Id/λ), and µℓ P

a [k] paδµa. In this case, note that dℓ(a, b) =

πe Y ℓ(b)δab πe Y ℓ(b)πe Y ℓ(a) are uniformly bounded by 2, and that for 1 ℓ f M

va, e Y ℓ d= va, µℓ + va, Zℓ λ , (D.12)

are i.i.d. sub-Gaussian (with norm O(1)) for fixed va, since va 2 1 and µc 2 = 1 for all c, and λ 1 say. Thus, for all a, b, the products (dℓ(a, b) va, e Y vb, e Y ℓ )ℓare i.i.d. uniformly subexponential random variables. As such, (D.10) is a sum of i.i.d. centered uniformly sub-exponential random variables, and Bernstein s inequality yields that there exists a small constant c = c(k)

P(|F(v)| > t/2) exp( cf M(t t2)) . (D.13)

It follows from plugging (D.13) into (D.9) and taking ϵ = 1/8, that

P sup v 2=1 |F(v)| > t/2 exp cf M(t t2) + Ckd ,

which yields the concentration bound (D.1) for empirical Hessian matrix, by noticing that f M/d = α.

For the proof of the concentration bound (D.2) for empirical G-matrix, thanks to (D.6), we can define F(v) (as in (D.8)) using DG instead of DH. Noticing that the entries DH ab are bounded by 4, the concentration of this new F(v) follows from Bernstein s inequality by the same argument. Then the concentration estimates of this new F(v) together with a epsilon-net argument gives (D.2).

D.2 HESSIAN AND G-MATRIX: 2-LAYER GMM MODEL

For the reader s convenience, we recall the empirical Hessian and G-matrix from the beginning of Section B. Our main aim in this subsection is to prove the following analogue of Theorem D.1 for the XOR GMM with a 2-layer network. There is a small problem with differentiating the Re LU activation twice exactly at zero, so let us define the set Wc 0 = T

Theorem D.2. We consider the 2-layer XOR GMM model from (4.1) (4.2) with sample complexity f M/d = α. For any L, there are constants c = c(K, L), C = C(K, L) such that for all t > 0, the empirical Hessian matrix concentrates as

sup (v,W ) L W Wc 0

P(|| 2( b R(v, W) E[ b R(v, W)])||op > t) exp{ [cα(t t2) C]d} , (D.14)

and the empirical G-matrix concentrates as

sup (v,W ) L P(|| b G(v, W) E[ b G(v, W)]||op > t) exp{ [cα(t t2) C]d} . (D.15)

Published as a conference paper at ICLR 2024

Proof. We begin by recalling some expressions. Letting byℓ= σ(v g(W e Y ℓ)) as in (B.1), by (B.4), we have

2 vv b R = 1

ℓ=1 byℓ(1 byℓ)g(W e Y ℓ) 2 ,

2 Wi Wj b R = 1

ℓ=1 (δijvig (Wi e Y ℓ)(eyℓ byℓ) + vivjbyℓ(1 byℓ)g (Wi e Y ℓ)g (Wj e Y ℓ)(e Y ℓ) 2 ,

2 v Wj b R = 1

ℓ=1 ((eyℓ byℓ)g (Wj e Y ℓ)ej + vjbyℓ(1 byℓ)g (Wj e Y ℓ)g(W e Y ℓ)) e Y ℓ,

and by (B.2) (B.3), we have

ℓ=1 (eyℓ byℓ)2g(W e Y ℓ) g(W e Y ℓ)

b GWi Wj = 1

ℓ=1 (eyℓ byℓ)2vivjg (Wi e Y ℓ)g (Wj e Y ℓ)e Y ℓ e Y ℓ,

b Gv Wj = 1

ℓ=1 (eyℓ byℓ)2vjg (Wj e Y ℓ)g(W e Y ℓ) e Y ℓ.

Recalling the data distribution from (4.1), so long as W Wc 0, almost surely, all the coordinates of WY are nonzero: as long as λ < ,

P( i, Wi e Y ℓ= 0) = 0 , for all W Wc 0 . (D.18)

For any W Wc 0, we thus can ignore the second derivative term g (Wi e Y ℓ) in 2 Wi Wj b R, so a.s.

2 Wi Wj b R = 1

ℓ=1 (vig (Wi e Y ℓ))(vjg (Wj e Y ℓ))byℓ(1 byℓ)(e Y ℓ) 2 . (D.19)

With these calculations in hand, the proof is similar to that of Theorem D.1, we will only emphasize the main differences. Let B be the ball of radius L in parameter space. Fix (v, W) B and simply write b R(v, W), b G(v, W) as b R, b G. Our goal is to prove concentration of the operator norm

sup (a,u) RK (Rd)K

(a, u), ( 2( b R E[ b R])(a, u) ,

where a = (a1, a2, , a K) and u = (u1, u2, , u K) (Rd)K. As in the proof of Theorem D.1, specifically (D.9), by an epsilon-net argument, we only need prove a concentration estimate for the following quantity

F(a, u) = (a, u), 2( b R E[ b R])(a, u) , (D.20)

individually per (a, u) : (a, u) = 1. The inner product on the right-hand side of (D.20) splits into (K + 1)2 terms corresponding the terms in (D.16) and (D.19):

a, 2 vv b Ra = 1

ℓ=1 byℓ(1 byℓ) a, g(W e Y ℓ) 2 ,

ui, 2 Wi Wj b Ruj = 1

ℓ=1 (vig (Wi e Y ℓ))(vjg (Wj e Y ℓ))byℓ(1 byℓ) ui, e Y ℓ uj, e Y ℓ , (D.21)

a, 2 v Wj b Ruj = 1

ℓ=1 (ajg (Wj e Y ℓ)(eyℓ byℓ) + (vjbyℓ(1 byℓ)g (Wj e Y ℓ) a, g(W e Y ℓ) ) e Y ℓ, uj .

Published as a conference paper at ICLR 2024

The concentration bound for F(a, u) follows from concentration bounds for each of the above terms about their respective expected values. Each such term minus its expected value will be a sum of f M i.i.d. centered random variables. It remains to show that the summands are uniformly (in λ and BL) sub-exponential. Then the concentration of F(a, u) follows from Bernstein s inequality as in (D.13).

To that end, we notice that (byℓ)(1 byℓ), eyℓare bounded by 1, and vi, vj and g are all bounded by a C(L). By the same argument as in the proof of Theorem D.1, we have that ui, e Y ℓ and uj, e Y ℓ are uniformly sub-Gaussian.

Next we show that g(W e Y ℓ), a is also sub-Gaussian. In law, conditionally on the mean choice

among µ, ν, we have W e Y ℓd= W( µ + Zλ) or W e Y ℓd= W( ν + Zλ). Since (a, u) B and µ = ν = 1, we have by Cauchy Schwarz that Wµ , Wν L. Then we can write

g(W e Y ℓ), a = g(WZλ), a + g(W e Y ℓ) g(WZλ), a .

By the uniform Lipschitz continuity of g, we have

| g(W e Y ℓ) g(WZλ), a | a ( Wν + Wµ ) 2 .

We also notice that if Z1 =

Z1 g(WZλ), a = 1

i=1 aig (Wi Zλ)Wi 1

λ g a 2 W 2 2 L

implying that it is a uniformly L-Lipschitz function of standard Gaussians, from which it follows that g(WZλ) is uniformly sub-Gaussian (see (Vershynin, 2018, Section 5.2.1)). For the expectation we have

λ), a ]| |E[ g(0), a ]| + |E g(WZ/

λ) g(0), a |

K a 2 + g E[ WZ/

Tr[WW ] C(K, L) .

We conclude that g(W e Y ℓ), a is sub-Gaussian with norm bounded by O(1) (depending only on K, L.) As a consequence, each summand in (D.21) is sub-exponential, with norm uniformly bounded by O(1) (depending only on K, L.) Bernstein s inequality gives that there exists a small constant c = c(K, L)

P(|F(a, u)| > t/2) exp( cf M(t t2)) . (D.22)

It follows from (D.22) and the epsilon-net argument over the unit ball of (a, u) RK RKd as in (D.9), that P sup (a,u) =1 |F(a, u)| > t exp (cf M(t t2) CK2d) ,

which yields the concentration bound (D.14) for empirical Hessian matrix, by noticing that f M/d = α.

For the proof of the concentration bound (D.15) for the empirical G-matrix, thanks to (D.17), we can define F(v) (as in (D.20)) using b G instead of 2 b R. Then we need concentration bounds for the following quantities

a, b Gvva = 1

ℓ=1 (eyℓ byℓ)2 a, g(W e Y ℓ) 2Y ℓ) ,

ui, b GWi Wjuj = 1

ℓ=1 (eyℓ byℓ)2vivjg (Wi e Y ℓ)g (Wj e Y ℓ) ui, e Y ℓ uj, e Y ℓ ,

a, b Gv Wjuj = 1

ℓ=1 (eyℓ byℓ)2vjg (Wj e Y ℓ) a, g(W e Y ℓ) uj, e Y ℓ .

Published as a conference paper at ICLR 2024

By the same argument as that after (D.21), each summand in (D.23) is uniformly (with constant only depending on K, L) sub-exponential. The concentration of this new F(a, u) follows from Bernstein s inequality by the same argument. Then the concentration estimates of this new F(a, u) together with a epsilon-net argument gives (D.15).

E PROOFS OF MAIN THEOREMS

In this section we put together the ingredients we have established in the preceding sections to deduce our main theorems, Theorems 3.1 4.2.

E.1 1-LAYER NETWORK FOR MIXTURE OF k GAUSSIANS

We begin with the theorems for classification of the k-GMM with a single layer network.

Proof of Theorem 3.2. We prove the alignment (up to multiplicative error O(ϵ+λ 1)) of the SGD trajectory, the Hessian, and the G-matrix, with Span(µ1, ..., µk) one at a time.

The results for the SGD were exactly the content of item (2) of Proposition C.1. Namely, that item tells us that the part of xc ℓorthogonal to Span(µ1, ..., µk) has norm at most O(ϵ + λ 1) while xc ℓ η O(ϵ + λ 1) η/2 for an η independent of ϵ, λ. Absorbing η 1 into the big-O, this is exactly the definition of living in a subspace per Definition 2.1.

We recall from Lemma A.1, for b = c, the bc-block of the population Hessian is given by

E[ 2 cb b R(x)] = X

l [k] plΠl cbµ 2 l + ER cb , (E.1)

where Πl cb = E[πYl(c)πYl(b)] for Yl = µl + Zλ, and where ER cb is a matrix with operator norm bounded by O(1/λ). By Lemma A.2, the aa diagonal block of the population Hessian is given by

E[ 2 aa b R(x)] = X

l [k] plΠl aaµ 2 l + ER aa , (E.2)

where Πl aa = E[πYl(a)(1 πYl(a))] and ER aa is a matrix with norm bounded by O(1/λ).

The two expressions (E.1) and (E.2), together with the concentration of the empirical Hessian matrix Theorem D.1 implies that blocks of the empirical Hessian concentrate around low rank matrices; i.e., for each x, we have

2 cb b R(x) = X

1 l k plΠl cbµ 2 l + ER cb + ( 2 cb b R(x) 2 cb E[ b R(x)])

1 l k plΠl cbµ 2 l + b ER cb . (E.3)

where for each fixed x, except with probability e (cαε2 C)d we have b ER ca ε + 1

λ. Since the test and training data are independent of one another, this also means for any realization of the SGD trajectory (xℓ)ℓ Tf δ 1, and so long as Tfδ 1 = eo(d), by a union bound we get

P max ℓ Tf δ 1 2 cb b R(xℓ) X

l [k] pyΠl cb(xℓ)µ 2 l C(ε + λ 1) = od(1) . (E.4)

where we ve put in the xℓto emphasize the dependence of Πl cb on the location in parameter space.

Moreover, we recall from item (1) of Proposition C.1, that there exists a constant L such that for all λ large, with probability 1 od(1), the SGD trajectory has xℓ L for all T0δ 1 ℓ Tfδ 1. It follows by definition of π that there exists a constant c = c(L) > 0, such that the coefficients in (E.1) and (E.2) are lower bounded: plΠl cb(xℓ), plΠl aa(xℓ) c for all T0δ 1 ℓ Tfδ 1. Thus the first sum P

1 l k plΠl cbµ 2 l on the righthand side of (E.3) is positive definite, and its norm is lower bounded by c (uniformly in ϵ, λ). Together with (E.4), we conclude that the b, c blocks of the test

Published as a conference paper at ICLR 2024

Hessian 2 bc b R(xℓ) live in Span(µ1, µ2, , µk) up to error O(ϵ + λ 1). Namely, this is because it satisfies Definition 2.2 with the choice of M = b ER cb, after absorbing c 1 into the big-O.

By the same argument, thanks to Lemma A.3, the bc block of the population G-matrix is given by

δbcpbµ 2 b pc E[πYc(b)]µ 2 c pb E[πYb(c)]µ 2 b + X

l pl E[πYl(b)πYl(c)]µ 2 l + EG bc , (E.5)

where ER bc is a matrix with norm bounded by O(1/λ). The expression (E.5), together with the concentration of empirical G-matrix Theorem D.1, and a union bound over ℓ Tfδ 1, implies that blocks of the empirical G-matrix concentrate around

2 bc b G(xℓ) = δbcpbµ 2 b pc E[πYc(b)]µ 2 c pb E[πYb(c)]µ 2 b + X

l pl E[πYl(b)πYl(c)]µ 2 l + b EG bc(xℓ) ,

where except with probability Tfδ 1e (cαε2 C)d we have b EG ca(xℓ) ε + 1

λ for all ℓ Tfδ 1. Namely, the analogue of (E.4) will apply to 2 bc b G about the low-rank part of the above. In order to deduce the claimed alignment of Theorem 3.2 it remains to show that each block of the matrix in (E.6) (minus b EG bc has operator norm bounded away from zero uniformly in ϵ, λ. Towards this, recall that we can work on the event that xℓ L. In the on-diagonal blocks, we can rewrite the part of (E.6) that is not b EG bb as

pb E[(1 πYb(b))2]µ 2 b + X

l =b pl E[πYl(b)2]µ 2 l . (E.7)

This matrix is positive definite and since xℓ L, there exists c(L) such that the coefficients of µc are all bounded away from zero by c, so that the norm of the above is bounded from below by zero for all (xℓ)T0δ 1 ℓ Tf δ 1.

Thus by C.1, the coefficients pb E[(1 πYb(b))2], pl E[πYl(b)2] in (E.7) are lower bounded. Thus (E.7) is positive definite, there exists a constant c > 0, such that its norm is lower bounded by c.

For b = c, we can lower bound the operator norm of the matrix

pc E[πYc(b)(1 πYc(c))]µ 2 c pb E[πYb(c)(1 πYb(b))]µ 2 b + X

l =b,c pl E[πYl(b)πYl(c)]µ 2 l ,

while xℓ L using that if k > 2 then the last sum contributes some positive portion outside of Span(µc, µb) which can be used to lower bound the operator norm by some c(L) and if k = 2, then the first two terms are the only two, are negative definite, and have coefficients similarly bounded away from zero by some c(L).

Proof of Theorem 3.1. The theorem follows from Theorem 3.2 together with the observation that the matrices in (E.1) and (E.5) that are not the error portion ER cb and EG bc respectively, are of rank k since they are sums of k rank-1 matrices and as explained in the above proof, each of their eigenvalues are bounded away from zero uniformly in ϵ, λ, (in a manner depending only on L).

Proof of Theorem 3.3. The claims regarding the SGD are proved in Proposition C.14.

It remains to prove alignment of the top eigenvectors of the cc-blocks of the Hessian and G-matrices with µc. Recall from (E.3), the decomposition of the empirical Hessian matrix

2 aa b R(x) =: X

1 l k plΠl aaµ 2 l + b ER aa , (E.9)

where Πl aa = E[πYl(a)(1 πYl(a))], and with probability e (cαε2 C)d we have b ER ca ε + 1

By our assumption that µ1, µ2, , µk are orthonormal, thus the first part in the decomposition (E.9) can be viewed as an orthogonal decomposition. 2 aa b R(x) is a perturbation of P

1 l k plΠl aaµ 2 l by b ER aa. In turn, by the same reasoning as in Lemma C.11,

Πl aa = Π l aa + O(λ 1) where Π l aa = E[ πl(a)(1 πl(a))] ,

Published as a conference paper at ICLR 2024

for πl(a) = emal/ P

b embl as in (C.16). By Lemma C.19, there exists c(β) > 0 such that for all (xℓ)T0δ 1 ℓ Tf δ 1, all the coefficients plΠl aa in (E.9) are lower bounded: plΠl aa c. Thus P

1 l k plΠl aaµ 2 l has k positive eigenvalues, {plΠl aa}1 l k, and each of them is lower bounded by c. The associated eigenvectors are given by {µl}1 l k. We furthermore claim that the one corresponding to µ 2 a is separated from the others uniformly in ϵ, λ. This follows from the fact that we derived in Proposition C.14 that xℓis within O(ϵ + λ 1) distance of a point x such that πb(a) = 1 k 1(1 πc(c)) as long as a = b. This implies that πl(a) for l = a is closer to 0 than πa(a) is close to 1, whence πa(a)(1 πa(a)) > πl(a)(1 πl(a)) by an amount that is uniform in ϵ, λ. This ensures that along (xℓ) the largest eigenvector in P

l Πl aaµ 2 a is the one with eigenvector µa and the next k 1 are those corresponding to (µl)l =a. Altogether, by eigenvector stability, the top eigenvector of 2 aa b R(xℓ) lives in Span(µa) up to error O(ϵ + λ 1) and the next k 1 all live in Span((µl)l =a) up to an error O(ε + 1/λ) for all ℓ [T0δ 1, Tfδ 1].

The statement for the empirical G-matrix is established similarly by using (E.6) as input. The part that is different from the above is to see that its top eigenvector in the aa-block is the one corresponding approximating by µa and its next k 1 are those approximating (µl)l =a. For that, recall the expression (E.7) and use that since (1 πa(a)) > πl(a), the coefficient of µ 2 a in the aa-block is strictly larger than the coefficients of µl for l = a.

E.2 2-LAYER NETWORK FOR XOR GAUSSIAN MIXTURE

We now turn to proving our main theorems for the XOR Gaussian mixture model with a 2-layer network of width K.

Proof of Theorem 4.2. We prove the alignment individually for each of the SGD, the Hessian matrix, and the G-matrix. For the SGD trajectory, the claimed alignment was exactly the content of item (2) in Proposition C.2.

Letting F denote the cdf of a standard Gaussian, thanks to Lemma B.4,

E[ 2 vv L] = 1

ϑ { µ, ν} σ (v g(Wϑ))g(Wϑ) 2 + ER vv , (E.10)

E[ 2 Wi Wi L] = v2 i 4

ϑ { µ, ν} F

ϑ 2 + ER Wi Wi , (E.11)

where the two error matrices satisfy ER vv , ER Wi Wi = O(1/

λ). The two expressions (E.10) and (E.11), together with the concentration of empirical Hessian matrix Theorem D.2 imply that for every L, every fixed x = (v, W) with W / Wc 0, the blocks of the empirical Hessian matrix satisfy

2 vv b R(v, W) = 1

ϑ { µ, ν} σ (v g(Wϑ))g(Wϑ) 2 + b ER vv , (E.12)

2 Wi Wi b R(v, W) = v2 i 4

ϑ { µ, ν} F

ϑ 2 + b ER Wi Wi , (E.13)

where except with probability e (cαε2 C)d, we have

b ER vv(v, W) op, b ER Wi Wi(v, W) op ε + 1

Using independence of the test and training data, recalling from item (1) of Proposition C.2 that the SGD stays confined to a ball of radius L in parameter space for all ℓ Tfδ 1, and from item (2) of Proposition C.2 that Wi > 0 for all i for all T0δ 1 ℓ Tfδ 1, taking a union bound over

Published as a conference paper at ICLR 2024

ℓ Tfδ 1, we get for α > α0,

P max T0δ 1 ℓ Tf δ 1 2 vv b R(xℓ) 1

ϑ { µ, ν} σ (v g(Wϑ))g(Wϑ) 2 op C(ε + λ 1/2) = o(1) .

P max T0δ 1 ℓ Tf δ 1 2 Wi Wi b R(xℓ) v2 i 4

ϑ { µ, ν} F

ϑ 2 op C(ε + λ 1/2) = o(1) .

where in the quantity that the blocks of the empirical Hessian are being compared to, v, W evaluated along the SGD trajectory xℓ.

It remains to show that the low-rank matrices in (E.15) (E.16) have some operator norm uniformly bounded away from zero to deduce the alignment of the empirical test Hessian with the claimed vectors. By item (2) of Proposition C.2, there exists some constant c > 0 uniform in λ, such that with probability 1 od(1), for all ℓ Tfδ 1 the SGD xℓis such that

max ϑ { µ, ν} σ (v g(Wϑ)) g(Wϑ) = max ϑ { µ, ν} σ X

i vig(mϑ i ) g(Wϑ) > c , (E.17)

Thus the deterministic matrix in (E.15) P

ϑ { µ, ν} σ (v g(Wϑ))g(Wϑ) 2 is positive definite and its norm is lower bounded away from 0 for all (xℓ)ℓ Tf δ 1. Together with (E.15), we conclude that the second-layer test Hessian 2 vv b R(xℓ) live in Span(g(W(xℓ) ϑ)ϑ { µ, ν}).

For (E.16), by item (2) of Proposition C.2, there exists c > 0 (independent of ϵ, λ) such that with probability 1 od(1), for every i, it must be the case that

max ϑ { µ, ν} v2 i 4 F mϑ i

> c for all (xℓ)ℓ Tf δ 1 . (E.18)

Then the deterministic matrix in (E.16) is positive definite and has norm uniformly lower bounded away from 0 for all xℓfor ℓ Tfδ 1. Together with (E.16), we conclude that the first-layer test Hessian 2 Wi Wi b R(xℓ) lives in Span(µ, ν).

By the same argument, thanks to Lemma B.5, together with the concentration of the empirical G-matrix from Theorem D.2, the blocks of the empirical G-matrix concentrate around low rank matrices

b Gvv(v, W) = 1

ϑ { µ, ν} (σ(v g(Wϑ)) yϑ)2g(Wϑ) 2 + b EG vv ,

b GWi Wi(v, W) = v2 i A + b EG Wi Wi ,

where recalling (B.18),

ϑ { µ, ν} (yϑ σ(v g(Wϑ)))2F(mϑ i

λ Rii )ϑ 2 , (E.20)

and for every x = (v, W) of norm at most L, except with probability e (cαε2 C)d we have b EG vv , b EG Wi Wi ε + 1

λ. Union bounding over the Tfδ 1 points along the trajectory of the SGD, as in the lead-up to (E.15) (E.16), we get

P max ℓ Tf δ 1 b Gvv(xℓ) 1

ϑ { µ, ν} (σ(v g(Wϑ)) yϑ)2g(Wϑ) 2 op C(ε + λ 1/2) = o(1) .

P max ℓ Tf δ 1 b GWi Wi(xℓ) v2 i A op C(ε + λ 1/2) = o(1) . (E.22)

where in the quantity that the blocks of the empirical G-matrix are being compared to, v, W are evaluated along the SGD trajectory xℓ.

Published as a conference paper at ICLR 2024

We recall from item (1) of Proposition C.2 the SGD stays inside the ℓ2-ball of radius L for all time. Thus for x = xℓ, the coefficients (σ(v g(Wϑ)) yϑ)2, σ(v g(Wϑ))2, (1 σ(v g(Wϑ)))2 in (E.19) and (E.20) are lower bounded away from 0. By the same argument as for the test Hessian matrix, we conclude that the second-layer test G-matrix b Gvv(xℓ) lives in Span(g(W(xℓ)ϑ)ϑ { µ, ν}) and its first layer lives in Span(µ, ν), up to error O(ϵ + λ 1/2).

Proof of Theorem 4.1. The first layer alignment follows from Theorem 4.2 together with the observation that the part of (E.13) besides b ER Wi Wi is a rank-2 matrix with eigenvectors µ, ν and having both eigenvectors bounded away from zero uniformly in (ϵ, λ); this latter fact comes from the observation that the sum of the two cdf s F(a)+F( a) = 1 always, and vi being bounded away from zero per part (2) of Proposition C.2. A similar behavior ensures the same for the G-matrix s first layer top two eigenvectors per (E.19) (E.20) and a uniform lower bound on the sigmoid function over all possible parameters in a ball of radius L, as guaranteed by part (1) of Proposition C.2.

For the alignment of the second layer, it follows from Theorem 4.2 together with the following. First observe that the part of (E.12) that is not b ER vv is a rank-4 matrix with eigenvectors (g(Wϑ))ϑ { µ, ν}. To reason that all of the corresponding eigenvectors are uniformly (in ϵ, λ) bounded away from zero, use the uniform lower bound on σ while the parameters are in a ball of radius L about the origin as promised by part (1) of Proposition C.2, together with the uniform lower bound, for each i, on one of (Wi ϑ)ϑ { µ, ν} which holds for the SGD after T0(ϵ) per the proof of part (2) of Proposition C.2.

F ADDITIONAL FIGURES

F.1 ADDITIONAL FIGURES FOR K-GMM

In this section, we include more figures depicting the spectral transitions for the k-GMM model.

Figure F.1: x1 and x2 at initial and finial trianing times. Initially they are random and then they correlate (and anti-corellate) as per Theorem 3.2. Parameters are the same as in Figure 3.1.

Published as a conference paper at ICLR 2024

Figure F.2: Further illustration of dynamical BBP from Fig. 3.3. Here we see that there is initially two components of the spectrum, then as time evolves, the top eigenvalue and the next 9 separate from each other.

Figure F.3: x1 and x2 at initial and finial training times. Initially they are random and then they correlate (and anti-correlatel) as per Theorem 3.2. Parameters are the same as in Figure 3.1.

F.2 TRAINING DATA

We include here a few plots of the preceding discussion run with training data, as opposed to test data. In the following, the plots are generated using the Hessian of the empirical risk computed over the entire training data set and similarly the G-matrix generated using the entire training data set, as opposed to the test data as in earlier plots.

F.2.1 K-COMPONENT MIXTURE MODEL

Figure F.4: Parameters are the same as in Fig. 3.1

Published as a conference paper at ICLR 2024

Figure F.5: Training data analogue of Fig. 3.3. the eigenvalues of 2 b R11(xℓ) over the course of training (but with training data. The leading k eigenvalues are separated from the bulk at all times, and the top eigenvalue, corresponding to µ1 separates from the remaining eigenvalues soon after initialization. Right: the inner product of x1 ℓwith the means µ1, ..., µk undergoes a similar separation over the course of training. Parameters are the same as in preceding figures.

F.2.2 XOR MIXTURE MODEL

Figure F.6: Training data analogue of Fig. 4.1 and (b) depict the alignment of the first layer weights Wi(xℓ) with the principal subspaces of the corresponding blocks of the Hessian and G-matrices fromed using training data, i.e., with E2( 2 Wi Wi b R(xℓ)) and E2( b GWi Wi(xℓ)). (c) and (d) plot the second-layer alignment, namely of v(xℓ) with E4( 2 vv b R(xℓ)) and E4( b Gvv(xℓ)). Parameters are d = 500, λ = 10, and K = 20

Figure F.7: Training data analogue of Fig. 4.2 The eigenvalues of the vv blocks of the Hessian and G-matrices over time from a random initialization. Along training, the top four eigenvalues separate from the bulk, and four outlier eigenvalues emerge, corresponding to the four hidden classes in the XOR problem. Parameters are the same as in Figure 4.1.