# directional_convergence_and_alignment_in_deep_learning__c1cff416.pdf

Directional convergence and alignment in deep learning

Ziwei Ji Matus Telgarsky {ziweiji2,mjt}@illinois.edu University of Illinois, Urbana-Champaign

In this paper, we show that although the minimizers of cross-entropy and related classiﬁcation losses are off at inﬁnity, network weights learned by gradient ﬂow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge. This proof holds for deep homogeneous networks a broad class of networks allowing for Re LU, max-pooling, linear, and convolutional layers and we additionally provide empirical support not just close to the theory (e.g., the Alex Net), but also on non-homogeneous networks (e.g., the Dense Net). If the network further has locally Lipschitz gradients, we show that these gradients also converge in direction, and asymptotically align with the gradient ﬂow path, with consequences on margin maximization, convergence of saliency maps, and a few other settings. Our analysis complements and is distinct from the well-known neural tangent and mean-ﬁeld theories, and in particular makes no requirements on network width and initialization, instead merely requiring perfect classiﬁcation accuracy. The proof proceeds by developing a theory of unbounded nonsmooth Kurdyka-Łojasiewicz inequalities for functions deﬁnable in an o-minimal structure, and is also applicable outside deep learning.

1 Introduction

Recent efforts to rigorously analyze the optimization of deep networks have yielded many exciting developments, for instance the neural tangent [Jacot et al., 2018, Du et al., 2018, Allen-Zhu et al., 2018, Zou et al., 2018] and mean-ﬁeld perspectives [Mei et al., 2019, Chizat and Bach, 2018]. In these works, it is shown that small training or even testing error are possible for wide networks.

The above theories, with ﬁnite width networks, usually require the weights to stay close to initialization in certain norms. By contrast, practitioners run their optimization methods as long as their computational budget allows [Shallue et al., 2018], and if the data can be perfectly classiﬁed, the parameters are guaranteed to diverge in norm to inﬁnity [Lyu and Li, 2019]. This raises a worry that the prediction surface can continually change during training; indeed, even on simple data, as in Figure 1, the prediction surface continues to change after perfect classiﬁcation is achieved, and even with large width is not close to the maximum margin predictor from the neural tangent regime. If the prediction surface never stops changing, then the generalization behavior, adversarial stability, and other crucial properties of the predictor could also be unstable.

In this paper, we resolve this worry by guaranteeing stable convergence behavior of deep networks as training proceeds, despite this growth of weight vectors to inﬁnity. Concretely:

1. Directional convergence: the parameters converge in direction, which sufﬁces to guarantee convergence of many other relevant quantities, such as the prediction margins. 2. Alignment: when gradients exist, they converge in direction to the parameters, which implies various margin maximization results and saliency map convergence, to name a few.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

(a) Shallow NTK max margin.

(b) Shallow net, early training.

(c) Shallow net, late training.

Figure 1: Prediction surface of a shallow network on simple synthetic data with blue negative examples ( ) and red positive examples ( + ), trained via gradient descent. Figure 1a shows the prediction surface reached by freezing activations, which is also the prediction surface of the corresponding Neural Tangent Kernel (NTK) maximum margin predictor [Soudry et al., 2017]. Figure 1b shows the same network, but now without frozen activations, at the ﬁrst moment with perfect classiﬁcation. Training this network much longer converges to Figure 1c.

1.1 First result: directional convergence

We show that the network parameters Wt converge in direction, meaning the normalized iterates Wt/ Wt converge. Details are deferred to Section 3, but here is a brief overview.

Our networks are L-positively homogeneous in the parameters, meaning scaling the parameters by c > 0 scales the predictions by c L, and deﬁnable in some o-minimal structure, a mild technical assumption which we will describe momentarily. Our networks can be arbitrarily deep with many common types of layers (e.g., linear, convolution, Re LU, and max-pooling layers), but homogeneity rules out some components such as skip connections and biases, which all satisfy deﬁnability.

We consider binary classiﬁcation with either the logistic loss ℓlog(z) := ln(1 + e z) (binary crossentropy) or the exponential loss ℓexp(z) := e z, and a standard gradient ﬂow (inﬁnitesimal gradient descent) for non-differentiable non-convex functions via the Clarke subdifferential. We start from an initial risk smaller than 1/n, where n denotes the number of data samples; in this way, our analysis handles the late phase of training, and can be applied after some other analysis guarantees risk 1/n.

Under these conditions, we prove the following result, without any other assumptions about the distribution of the parameters or the width of the network (cf. Theorem 3.1):

The curve swept by Wt/ Wt has ﬁnite length, and thus Wt/ Wt converges.

Our main corollary is that prediction margins converge (cf. Corollary 3.2), meaning convergence of the normalized per-example values yiΦ(xi;Wt)/ Wt L, where yi is the label and Φ(xi; Wt) is the prediction on example xi. These quantities are central in the study of generalization of deep networks, and their stability also implies stability of many other useful quantities [Bartlett et al., 2017, Jiang et al., 2019, 2020]. As an illustration of directional convergence and margin convergence, we plot the margin values for all examples in the standard cifar data against training iterations in Figure 2; these trajectories exhibit strong convergence behavior, both within our theory (a modiﬁed homogeneous Alex Net, as in Figure 2a), and outside of it (Dense Net, as in Figure 2b).

Directional convergence is often assumed throughout the literature [Gunasekar et al., 2018a, Chizat and Bach, 2020], but has only been established for linear predictors [Soudry et al., 2017]. It is tricky to prove because it may still be false for highly smooth functions: for instance, the homogeneous Mexican Hat function satisﬁes all our assumptions except deﬁnability, and can be adjusted to have arbitrary order of continuous derivatives, but its gradient ﬂow does not converge in direction, instead it spirals [Lyu and Li, 2019]. To deal with similar pathologies in many branches of mathematics, the notion of functions deﬁnable in some o-minimal structure was developed: these are rich classes of functions built up to limit oscillations and other bad behavior. Using techniques from this literature, we build general tools, in particular unbounded nonsmooth Kurdyka-Łojasiewicz inequalities, which allows us to prove directional convergence, and may also be useful outside deep learning. More discussion on the o-minimal literature is given in Section 1.3, technical preliminaries are introduced in Section 2, and a proof overview is given in Section 3, with full details in the appendices.

(a) Margins while training H-Alex Net.

(b) Margins while training Dense Net.

Figure 2: The margins of all examples in cifar, plotted against time, or rather optimization accuracy ln(n/L(Wt)) to remove the effect of step size and other implementation coincidences. Figure 2a shows H-Alex Net , a homogeneous version of Alex Net as described in the main text [Krizhevsky et al., 2012], which is handled by our theory. Figure 2b shows a standard Dense Net [Huang et al., 2017], which does not ﬁt the theory in this work due to skip connections and biases, but still exhibits convergence of margins, thus suggesting a tantalizing open problem.

1.2 Second result: gradient alignment

Our second contribution, in Section 4, is that if the network has locally Lipschitz gradients, then these gradients also converge, and are aligned to the gradient ﬂow path (cf. Theorem 4.1).

The gradient ﬂow path, and the gradient of the risk along the path, converge to the same direction.

As a practical consequence of this, recall the use of gradients within the interpretability literature, speciﬁcally in saliency maps [Adebayo et al., 2018]: if gradients do not converge in direction then saliency maps can change regardless of the number of iterations used to produce them. As a theoretical consequence, directional convergence and alignment imply margin maximization in a variety of situations: this holds in the deep linear case, strengthening prior work [Gunasekar et al., 2018b, Ji and Telgarsky, 2018a], and in the 2-homogeneous network case, with an assumption taken from the inﬁnite width setting [Chizat and Bach, 2020], but presented here with ﬁnite width.

1.3 Further related work

Our analysis is heavily inspired and inﬂuenced by the work of Lyu and Li [2019], who studied margin maximization of homogeneous networks, establishing monotonicity of a smoothed margin, a quantity we also use. However, they did not prove directional convergence but instead must use subsequences. Their work also left open alignment and global margin maximization.

Directional convergence. A standard approach to resolve directional convergence and similar questions is to establish that the objective function in question is deﬁnable in some o-minimal structure, which as mentioned before, limits oscillations and other complicated behavior. This literature cannot be directly applied to our setting, owing to a combination of nonsmooth layers like the Re LU and max-pooling, and the exponential function used in the cross entropy loss, and as a result, our proofs need to rebuild many o-minimal results from the ground up.

In more detail, an important problem in the o-minimal literature is the gradient conjecture of René Thom: it asks when the existence of limt Wt = z further implies limt (Wt z)/ Wt z exists, and was established in various deﬁnable scenarios by Kurdyka et al. [2000a, 2006] via related Kurdyka-Łojasiewicz inequalities [Kurdyka, 1998]. The underlying proof ideas can also be used to analyze limt Wt/ Wt when the weights go to inﬁnity [Grandjean, 2007]. However, the prior results require the objective function to be either real analytic, or deﬁnable in a polynomiallybounded o-minimal structure. The ﬁrst case causes the aforementioned nonsmoothness issue, and

excludes many common layers in deep learning such as the Re LU and max-pooling. The second case excludes the exponential function, and means the logistic and cross-entropy losses cannot be handled. To resolve these issues, we had to redo large portions of the o-minimality theory, such as the nonsmooth unbounded Kurdyka-Łojasiewicz inequalities that can handle the exponential/logistic loss, as presented in Section 3.

Alignment. As discussed in Section 4, alignment implies the gradient ﬂow reaches a stationary point of the limiting margin maximization objective, and therefore is related to various statements and results throughout the literature on implicit bias and margin maximization [Soudry et al., 2017, Ji and Telgarsky, 2018b]. This stationary point perspective also appears in some nonlinear works, for instance in the aforementioned work on margins by Lyu and Li [2019], which showed that subsequences of the gradient ﬂow converge to such stationary points; in addition to fully handling the gradient ﬂow, the present work also differs in that alignment is in general a stronger notion, in that it is unclear how to prove alignment as a consequence of convergence to KKT points. Additionally, alignment can still hold when the objective function is not deﬁnable and directional convergence is false, for example on the homogeneous Mexican hat function, which cannot be handled by the approach in [Lyu and Li, 2019, Appendix J]. As a ﬁnal pointer to the literature, many implicit bias works explicitly assume directional convergence and some version of alignment [Gunasekar et al., 2018b, Chizat and Bach, 2020], but neither do these works indicate a possible proof, nor do they provide conclusive evidence.

1.4 Experimental overview

The experiments in Figures 1 and 2 are performed in as standard a way as possible to highlight that directional convergence is a reliable property; full details are in Appendix A. Brieﬂy, Figure 1 uses synthetic data and vanilla gradient descent (no momentum, no weight decay, etc.) on a 10,000 node wide 2-layer squared Re LU network and its Neural Tangent Kernel classiﬁer; by using the squared Re LU, both our directional convergence and our alignment results apply. Figure 2 uses standard cifar ﬁrstly with a modiﬁed homogeneous Alex Net and secondly with an unmodiﬁed Dense Net, respectively inside and outside our assumptions. SGD was used on cifar due to training set size, and seeing how directional convergence still seems to occur, suggests another open problem.

2 Preliminaries and assumptions

In this section, we ﬁrst introduce the notions of Clarke subdifferentials and o-minimal structures, and then use these notions to describe the network model, gradient ﬂow, and Assumptions 2.1 and 2.2. Throughout this paper, denotes the ℓ2 (Frobenius) norm, and σ denotes the spectral norm.

Locally Lipschitz functions and Clarke subdifferentials. Consider a function f : D R with D open. We say that f is locally Lipschitz if for any x D, there exists a neighborhood U of x such that f|U is Lipschitz continuous. We say that f is C1 if f is continuously differentiable on D.

If f is locally Lipschitz, it holds that f is differentiable a.e. [Borwein and Lewis, 2000, Theorem 9.1.2]. The Clarke subdifferential of f at x D is deﬁned as

f(x) := conv lim i f(xi) xi D, f(xi) exists, lim i xi = x ,

which is nonempty convex compact [Clarke, 1975], and if f is continuously differentiable at x, then f(x) = { f(x)}. Vectors in f(x) are called subgradients, and we let f(x) denote the unique minimum-norm subgradient:

f(x) := arg min x f(x) x .

In the following analysis, we use f in many places that seem to call on f.

O-minimal structures and deﬁnable functions. Formally, an o-minimal structure is a collection S = {Sn} n=1, where Sn is a set of subsets of Rn which includes all algebraic sets and is closed under ﬁnite union/intersection and complement, Cartesian product, and projection, and S1 consists

of ﬁnite unions of open intervals and points. A set A Rn is deﬁnable if A Sn, and a function f : D Rm with D Rn is deﬁnable if its graph is in Sn+m. More details are given in Appendix B.

Many natural functions and operations are deﬁnable. First of all, deﬁnability of functions is stable under algebraic operations, composition, inverse, maximum and minimum, etc. Moreover, Wilkie [1996] proved that there exists an o-minimal structure where polynomials and the exponential function are deﬁnable. Consequently, deﬁnability allows many common layer types in deep learning, such as fully-connected/convolutional/Re LU/max-pooling layers, skip connections, the cross entropy loss, etc.; moreover, they can be composed arbitrarily As will be discussed later, what is still missing is the handling of the gradient ﬂow on such functions.

The network model. Consider a dataset {(xi, yi)}n i=1, where xi Rd are features and yi { 1, +1} are binary labels, and a predictor Φ( ; W) : Rd R with parameters W Rk. We make the following assumption on the predictor Φ. Assumption 2.1. For any ﬁxed x, the prediction W 7 Φ(x; W) as a function of W is locally Lipschitz, L-positively homogeneous for some L > 0, and deﬁnable in some o-minimal structure including the exponential function.

As mentioned before, homogeneity means that Φ(x; c W) = c LΦ(x; W) for any c 0. This means, for instance, that linear, convolutional, Re LU, and max-pooling layers are permitted, but not skip connections and biases. Homogeneity is used heavily throughout the theoretical study of deep networks [Lyu and Li, 2019].

Given a decreasing loss function ℓ, the total loss (or unnormalized empirical risk) is given by

i=1 ℓ yiΦ(xi; W) =

i=1 ℓ(pi(W)),

where pi(W) := yiΦ(xi; W) are also locally Lipschitz, L-positively homogeneous and deﬁnable under Assumption 2.1. We consider the exponential loss ℓexp(z) := e z and the logistic loss ℓlog(z) := ln(1 + e z), in which case L is also locally Lipschitz and deﬁnable.

Gradient ﬂow. As in [Davis et al., 2020, Lyu and Li, 2019], a curve z from an interval I to some real space Rm is called an arc if it is absolutely continuous on any compact subinterval of I. It holds that an arc is a.e. differentiable, and the composition of an arc and a locally Lipschitz function is still an arc. We consider a gradient ﬂow W : [0, ) Rk that is an arc and satisﬁes

dt L(Wt), for a.e. t 0. (1)

Our second assumption is on the initial risk, and appears in prior work [Lyu and Li, 2019]. Assumption 2.2. The initial iterate W0 satisﬁes L(W0) < ℓ(0).

As mentioned before, this assumption encapsulates our focus on the late training phase; some other analysis, for instance the neural tangent kernel, can be ﬁrst applied to ensure L(W0) < ℓ(0).

3 Directional convergence

We now turn to stating our main result on directional convergence and sketching its analysis. As Assumptions 2.1 and 2.2 imply Wt [Lyu and Li, 2019], we study the normalized ﬂow f Wt := Wt/ Wt , whose convergence is a formal way of studying the directional convergence of Wt. As mentioned before, directional convergence is false in general [Lyu and Li, 2019], but deﬁnability sufﬁces to ensure it. Throughout, for general nonzero W, we will use f W := W/ W .

Theorem 3.1. Under Assumptions 2.1 and 2.2, for ℓexp and ℓlog, the curve swept by f Wt has ﬁnite length, and thus f Wt converges.

A direct consequence of Theorem 3.1 is the convergence of the margin distribution (i.e., normalized outputs). Due to homogeneity, for any nonzero W, we have pi(W)/ W L = pi(f W), and thus the next result follows from Theorem 3.1.

Corollary 3.2. Under Assumptions 2.1 and 2.2, for ℓexp and ℓlog, it holds that pi(Wt)/ Wt L converges for all 1 i n.

Next we give a proof sketch of Theorem 3.1; the full proofs of the Kurdyka-Łojasiewicz inequalities (Lemmas 3.5 and 3.6) are given in Appendix B.3, while the other proofs are given in Appendix C.

3.1 A proof sketch of Theorem 3.1

The smoothed margin introduced in [Lyu and Li, 2019] is crucial in our analysis: given W = 0, let

α(W) := ℓ 1 L(W) , and α(W) := α(W)

For simplicity, let αt denote α(Wt), and ζt denote the length of the path swept by f Wt = Wt/ Wt from time 0 to t. Lyu and Li [2019] proved that αt is nondecreasing with some limit a (0, ), and Wt . We invoke a standard but sophisticated tool from the deﬁnability literature to aid in proving ζt is ﬁnite: formally, a function Ψ : [0, ν) R is called a desingularizing function when Ψ is continuous on [0, ν) with Ψ(0) = 0, and continuously differentiable on (0, ν) with Ψ > 0; in words, a desingularizing function is a witness to the fact that the ﬂow is asymptotically well-behaved. As we will sketch after stating the lemma, this immediately leads to a proof of Theorem 3.1. Lemma 3.3. There exist R > 0, ν > 0 and a deﬁnable desingularizing function Ψ on [0, ν), such that for a.e. large enough t with Wt > R and αt > a ν, it holds that

dt cdΨ (a αt)

dt for some constant c > 0.

To prove Theorem 3.1 from here, let t0 be large enough so that the conditions of Lemma 3.3 hold for all t t0: then we have limt ζt ζt0 + cΨ (a αt0) < , and thus the path length is ﬁnite.

Below we sketch the proof of Lemma 3.3, which is based on a careful comparison of d αt/ dt and dζt/ dt. The proof might be hard to parse due to the extensive use of , the minimum-norm Clarke subgradient; at ﬁrst reading, the condition of local Lipschitz continuity can just be replaced with continuous differentiability, in which case the Clarke subgradient is just the normal gradient.

Given any function f which is locally Lipschitz around a nonzero W, let

rf(W) := D f(W), f W E f W and f(W) := f(W) rf(W)

denote the radial and spherical parts of f(W) respectively. First note the following technical characterization of d αt/dt and dζt/dt using the radial and spherical components of relevant Clarke subgradients. Lemma 3.4. It holds for a.e. t 0 that

dt = r α(Wt) r L(Wt) + α(Wt) L(Wt) , and dζt

For simplicity, in the discussion here we consider the case that all subgradients in Lemma 3.4 are nonzero, with the general case handled in the full proofs in the appendices. Then Lemma 3.4 implies

dζt = d αt/ dt

dζt/ dt = Wt

r L(Wt) L(Wt) r α(Wt) + α(Wt) !

As in [Kurdyka et al., 2006, Grandjean, 2007], to bound eq. (2), we further consider two cases depending on the ratio α(Wt) / r α(Wt) .

If α(Wt) / r α(Wt) c1 Wt L/3 for some constant c1 > 0, then Lemma 3.3 follows from d αt/ dζt Wt α(Wt) as given by eq. (2), and the following Kurdyka-Łojasiewicz inequality. Its proof is based on the proof idea of [Kurdyka et al., 2006, Proposition 6.3], but further handles the unbounded and nonsmooth setting.

Lemma 3.5. Given a locally Lipschitz deﬁnable function f with an open domain D x x > 1 , for any c, η > 0, there exists ν > 0 and a deﬁnable desingularizing function Ψ on [0, ν) such that

Ψ f(x) x f(x) 1, if f(x) (0, ν) and f(x) c x η rf(x) .

On the other hand, if α(Wt) / r α(Wt) c1 Wt L/3, then a careful calculation (using Lemmas C.2 to C.4) can show that for some constants c2, c3 > 0, r L(Wt) L(Wt) c2 Wt 2L/3, and

r α(Wt) α(Wt) c3 Wt L/3.

It then follows from eq. (2) that d αt/ dζt c2c3 Wt 4L/3 α(Wt) . In this case we give the following Kurdyka-Łojasiewicz inequality, which implies Lemma 3.3.

Lemma 3.6. Given a locally Lipschitz deﬁnable function f with an open domain D x x > 1 , for any λ > 0, there exists ν > 0 and a deﬁnable desingularizing function Ψ on [0, ν) such that

Ψ f(x) x 1+λ f(x) 1, if f(x) (0, ν).

4 Alignment between the gradient ﬂow path and gradients

Theorem 3.1 gave our directional convergence result, namely that the normalized iterate Wt/ Wt converges to some direction. Next we show and discuss our alignment result, that if all pi have locally Lipschitz gradients, then along the gradient ﬂow path, L(Wt) converges to the same direction as Wt.

Theorem 4.1. Under Assumptions 2.1 and 2.2, if all pi further have locally Lipschitz gradients, then L(Wt) and Wt converge to the same direction, meaning the angle between Wt and L(Wt) converges to zero. If all pi are twice continuously differentiable, then the same result holds without the deﬁnability condition (cf. Assumption 2.1).

Below we ﬁrst sketch the proof of Theorem 4.1, with full details in Appendix D, and then in Section 4.2 present a few global margin maximization consequences, which are proved in Appendix E.

4.1 A proof sketch of Theorem 4.1

Recall that limt α(Wt)/ Wt L = a. The ﬁrst observation is that α(Wt), the smoothed margin function, asymptotes to the exact margin min1 i n pi(Wt) which is L-positively homogeneous. Therefore α is asymptotically L-positively homogeneous, and formally we can show

Wt L 1 , Wt

Wt L = a L, (3)

which can be viewed as an asymptotic version of Euler s homogeneous function theorem (cf. Lemma C.1). Consequently, the inner product between α(Wt)/ Wt L 1 and f Wt converges.

Let θt denote the angle between Wt and L(Wt), which is also the angle between Wt and α(Wt), since L(Wt) and α(Wt) point to opposite directions by the chain rule. By [Lyu and Li, 2019, Corollary C.10], given any ϵ > 0, there exists a time tϵ such that θtϵ < ϵ. The question is whether such a small angle can be maintained after tϵ. This is not obvious since, as mentioned above, the smoothed margin α(Wt) asymptotes to the exact margin min1 i n pi(Wt), which may be nondifferentiable even with smooth pi, due to nondifferentiability of the minimum. Consequently, the exact margin may have discontinuous Clarke subdifferentials, and since the smoothed margin asymptotes to it, it is unclear whether θt 0. (This point was foreshadowed earlier, where it was pointed out that alignment is not a clear consequence of convergence to stationary points of the margin maximization objective.)

To handle this, the key to our analysis is the potential function J (W) := α(Wt) 2 / Wt 2L 2.

Suppose at time t, it holds that D α(Wt)/ Wt L 1, f Wt E is close to a L, and θt is very small. If θt

becomes large again at some t > t, it must follows that J (Wt ) is much larger than J (Wt). We prove that this is impossible, by showing that

dτ dτ = 0, (4)

and thus Theorem 4.1 follows. The proof of eq. (4) is motivated by the dual convergence analysis in [Ji and Telgarsky, 2019], and also uses the positive homogeneity of pi and 2pi (which exist a.e.).

4.2 Main alignment consequence: margin maximization

A variety of (global) margin maximization results are immediate consequences of directional convergence and alignment. This subsection investigates two examples: deep linear networks, and shallow squared Re LU networks.

Deep linear networks predict with Φ(xi; W) = AL A1xi, where the parameters W = (AL, . . . , A1) are organized into L matrices. This setting has been considered in the literature, but the original work assumed directional convergence, alignment and a condition on the support vectors [Gunasekar et al., 2018b]; a follow-up dropped the directional convergence and alignment assumptions, but instead assumed the support vectors span the space Rd [Ji and Telgarsky, 2018a]. As follows, we not only drop the all aforementioned assumptions, but moreover include a proof rather than an assumption of directional convergence.

Proposition 4.2. Suppose Wt = (AL(t), . . . , A1(t)) and L(W0) < ℓ(0). Then a unique linear max margin predictor u := arg max u 1 mini yix T iu exists, and there exist unit vectors (v L, . . . , v1, v0) with v L = 1 and v0 = u such that

lim t Aj(t) Aj(t) = vjv T j 1 and lim t AL(t) A1(t) AL(t) A1(t) = u T.

Thanks to directional convergence and alignment (cf. Theorems 3.1 and 4.1), the proof boils down to writing down the gradient expression for each layer and doing some algebra.

A more interesting example is a certain 2-homogeneous case, which despite its simplicity is a universal approximator; this setting was studied by Chizat and Bach [2020], who considered the inﬁnite width case, and established margin maximization under assumptions of directional convergence and gradient convergence. Unfortunately, it is not clear if Theorems 3.1 and 4.1 can be applied to ﬁll these assumptions, since they do not handle inﬁnite width, and indeed it is not clear if inﬁnite width networks or close relatives are deﬁnable in an o-minimal structure. Instead, here we consider the ﬁnite width case, albeit with an additional assumption.

Following [Chizat and Bach, 2020, S-Re LU], organize Wt into m rows (wj(t))m j=1, with normalizations θj(t) := wj(t)/ wj(t) where θj(t) = 0 when wj(t) = 0, and consider

Φ(xi; W) := X

j ( 1)j max{0, w T jxi}2 and ϕij(w) := yi( 1)j max{0, w Txi}2, (5)

whereby pi(W) = P

j ϕij(wj), and Φ, pi, and ϕij are all 2-homogeneous and deﬁnable. (The ( 1)j may seem odd, but is an easy trick to get universal approximation without outer weights.)

Proposition 4.3. Consider the setting in eq. (5) along with L(W0) < ℓ(0) and xi 1.

1. (Local guarantee.) s Rm with sj(t) := wj(t) 2/ Wt 2 satisﬁes s p m (probability simplex on m vertices), and θj θj with θj = 0 if sj = 0, and

a = lim t min i pi(Wt)

Wt 2 = lim t min i

j sj(t)ϕij(θj(t)) = min i max s m

j sjϕij( θj).

2. (Global guarantee.) Suppose the covering condition: there exist t0 and ϵ > 0 with

max j θj(t0) θj 2 ϵ, and max θ Sd 1 max n min 2|j θj(t0) θ , min 2 j θj(t0) θ o ϵ,

where Sd 1 := {θ Rd : θ = 1}. Then margins are approximately (globally) maximized:

lim t min i pi(Wt)

Wt 2 max ν P(Sd 1) min i yi

Z max{0, x T iθ}2 dν(θ) 4ϵ,

where P(Sd 1) is the set of signed measures on Sd 1 with mass at most 1.

The ﬁrst part (the local guarantee ) characterizes the limiting margin as the maximum margin of a linear problem obtained by taking the limiting directions ( θj)m j=1 and treating the resulting ϕij( θj) as features. The quality of this margin is bad if the limiting directions are bad, and therefore we secondly (the global guarantee ) consider a case where our margin is nearly as good as the inﬁnite width global max margin value as deﬁned by [Chizat and Bach, 2020, eq. (5)]; see discussion therein for a justiﬁcation of this choice, and moreover calling it the globally maximal margin.

The covering condition deserves further discussion. In the inﬁnite width setting, it holds for all ϵ > 0 assuming directional convergence [Chizat and Bach, 2020, Proof of Theorem D.1], but cannot hold in such generality here as we are dealing with ﬁnite width. Similar properties have appeared throughout the literature: Wei et al. [2018, Section 3] explicitly re-initialized network nodes to guarantee a good covering, and more generally [Ge et al., 2015] added noise to escape saddle points in general optimization problems.

5 Concluding remarks and open problems

In this paper, we established that the normalized parameter vectors Wt/ Wt converge, and that under an additional assumption of locally Lipschitz gradients, the gradients also converge and align with the parameters.

There are many promising avenues for future work based on these results. One basic line is to weaken our assumptions: dropping homogeneity to allow for Dense Net and Res Net, and analyzing ﬁnite-time methods like (stochastic) gradient descent, and moreover their rates of convergence. We also handled only the binary classiﬁcation case, however our tools should directly allow for cross-entropy.

Another direction is into further global margin maximization results, beyond the simple networks in Section 4.2, and into related generalization consequences of directional convergence and alignment.

Broader impact

This paper constitutes theoretical work, with an aim of enhancing human understanding, and laying the groundwork for further theoretical and applied work. The authors hope that advancing the foundations of deep networks leads moreover to a better understanding of their failure modes, and manipulation thereof, and thus an increase in safety.

Acknowledgments and disclosure of funding

The authors thank Zhiyuan Li and Kaifeng Lyu for lively discussions during an early phase of the project. The authors are grateful for support from the NSF under grant IIS-1750051, and from NVIDIA under a GPU grant.

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In NIPS, 2018. ar Xiv:1810.03292 [cs.CV].

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. ar Xiv preprint ar Xiv:1811.03962, 2018.

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240 6249, 2017.

Jérôme Bolte, Aris Daniilidis, Adrian Lewis, and Masahiro Shiota. Clarke subgradients of stratiﬁable functions. SIAM Journal on Optimization, 18(2):556 572, 2007.

Jonathan Borwein and Adrian Lewis. Convex Analysis and Nonlinear Optimization. Springer Publishing Company, Incorporated, 2000.

Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In NIPS, 2018. ar Xiv:1805.09545 [math.OC].

Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. ar Xiv preprint ar Xiv:2002.04486, 2020.

Frank H Clarke. Generalized gradients and applications. Transactions of the American Mathematical Society, 205:247 262, 1975.

Frank H. Clarke. Optimization and Nonsmooth Analysis. Siam Classics in Applied Mathematics, 1983.

Michel Coste. An introduction to o-minimal geometry. Istituti editoriali e poligraﬁci internazionali Pisa, 2000.

Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D Lee. Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, 20(1):119 154, 2020.

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent ﬁnds global minima of deep neural networks. ar Xiv preprint ar Xiv:1811.03804, 2018.

Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119 139, 1997.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In COLT, 2015. ar Xiv:1503.02101 [cs.LG].

V Grandjean. On the limit set at inﬁnity of a gradient trajectory of a semialgebraic function. Journal of Differential Equations, 233(1):22 41, 2007.

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. ar Xiv preprint ar Xiv:1802.08246, 2018a.

Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461 9471, 2018b.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017. ar Xiv:1608.06993v5 [cs.CV].

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018.

Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. ar Xiv preprint ar Xiv:1810.02032, 2018a.

Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300v2, 2018b.

Ziwei Ji and Matus Telgarsky. A reﬁned primal-dual analysis of the implicit bias. ar Xiv preprint ar Xiv:1906.04540, 2019.

Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. In ICLR, 2019. ar Xiv:1810.00113 [stat.ML].

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to ﬁnd them. In ICLR, 2020. ar Xiv:1912.02178 [cs.LG].

Alex Krizhevsky. Learning multiple layers of features from tiny images. https://www.cs.toronto. edu/~kriz/learning-features-2009-TR.pdf, 2009.

Alex Krizhevsky, Ilya Sutskever, and Geoffery Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012.

Krzysztof Kurdyka. On gradients of functions deﬁnable in o-minimal structures. In Annales de l institut Fourier, volume 48, pages 769 783, 1998.

Krzysztof Kurdyka, Tadeusz Mostowski, and Adam Parusinski. Proof of the gradient conjecture of r. thom. Annals of Mathematics, 152(3):763 792, 2000a.

Krzysztof Kurdyka, Patrice Orro, and Stéphane Simon. Semialgebraic sard theorem for generalized critical values. Journal of differential geometry, 56(1):67 92, 2000b.

Krzysztof Kurdyka, Adam Parusi nski, et al. Quasi-convex decomposition in o-minimal structures. application to the gradient conjecture. In Singularity theory and its applications, pages 137 177. Mathematical Society of Japan, 2006.

Ta Lê Loi. Lecture 1: O-minimal structures. In The Japanese-Australian Workshop on Real and Complex Singularities: JARCS III, pages 19 30. Centre for Mathematics and its Applications, Mathematical Sciences Institute, The Australian National University, 2010.

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. ar Xiv preprint ar Xiv:1906.05890, 2019.

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-ﬁeld theory of two-layers neural networks: dimension-free bounds and kernel limit. 2019. ar Xiv:1902.06015 [stat.ML].

András Némethi and Alexandru Zaharia. Milnor ﬁbration at inﬁnity. Indagationes Mathematicae, 3 (3):323 335, 1992.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Neu RIPS. 2019.

Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. 2018. ar Xiv:1811.03600 [cs.LG].

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. ar Xiv preprint ar Xiv:1710.10345, 2017.

Lou Van den Dries and Chris Miller. Geometric categories and o-minimal structures. Duke Math. J, 84(2):497 540, 1996.

Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. ar Xiv preprint ar Xiv:1810.05369, 2018.

Alex J Wilkie. Model completeness results for expansions of the ordered ﬁeld of real numbers by restricted pfafﬁan functions and the exponential function. Journal of the American Mathematical Society, 9(4):1051 1094, 1996.

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. ar Xiv preprint ar Xiv:1811.08888, 2018.