# directional_convergence_and_alignment_in_deep_learning__c1cff416.pdf Directional convergence and alignment in deep learning Ziwei Ji Matus Telgarsky {ziweiji2,mjt}@illinois.edu University of Illinois, Urbana-Champaign In this paper, we show that although the minimizers of cross-entropy and related classification losses are off at infinity, network weights learned by gradient flow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge. This proof holds for deep homogeneous networks a broad class of networks allowing for Re LU, max-pooling, linear, and convolutional layers and we additionally provide empirical support not just close to the theory (e.g., the Alex Net), but also on non-homogeneous networks (e.g., the Dense Net). If the network further has locally Lipschitz gradients, we show that these gradients also converge in direction, and asymptotically align with the gradient flow path, with consequences on margin maximization, convergence of saliency maps, and a few other settings. Our analysis complements and is distinct from the well-known neural tangent and mean-field theories, and in particular makes no requirements on network width and initialization, instead merely requiring perfect classification accuracy. The proof proceeds by developing a theory of unbounded nonsmooth Kurdyka-Łojasiewicz inequalities for functions definable in an o-minimal structure, and is also applicable outside deep learning. 1 Introduction Recent efforts to rigorously analyze the optimization of deep networks have yielded many exciting developments, for instance the neural tangent [Jacot et al., 2018, Du et al., 2018, Allen-Zhu et al., 2018, Zou et al., 2018] and mean-field perspectives [Mei et al., 2019, Chizat and Bach, 2018]. In these works, it is shown that small training or even testing error are possible for wide networks. The above theories, with finite width networks, usually require the weights to stay close to initialization in certain norms. By contrast, practitioners run their optimization methods as long as their computational budget allows [Shallue et al., 2018], and if the data can be perfectly classified, the parameters are guaranteed to diverge in norm to infinity [Lyu and Li, 2019]. This raises a worry that the prediction surface can continually change during training; indeed, even on simple data, as in Figure 1, the prediction surface continues to change after perfect classification is achieved, and even with large width is not close to the maximum margin predictor from the neural tangent regime. If the prediction surface never stops changing, then the generalization behavior, adversarial stability, and other crucial properties of the predictor could also be unstable. In this paper, we resolve this worry by guaranteeing stable convergence behavior of deep networks as training proceeds, despite this growth of weight vectors to infinity. Concretely: 1. Directional convergence: the parameters converge in direction, which suffices to guarantee convergence of many other relevant quantities, such as the prediction margins. 2. Alignment: when gradients exist, they converge in direction to the parameters, which implies various margin maximization results and saliency map convergence, to name a few. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. (a) Shallow NTK max margin. (b) Shallow net, early training. (c) Shallow net, late training. Figure 1: Prediction surface of a shallow network on simple synthetic data with blue negative examples ( ) and red positive examples ( + ), trained via gradient descent. Figure 1a shows the prediction surface reached by freezing activations, which is also the prediction surface of the corresponding Neural Tangent Kernel (NTK) maximum margin predictor [Soudry et al., 2017]. Figure 1b shows the same network, but now without frozen activations, at the first moment with perfect classification. Training this network much longer converges to Figure 1c. 1.1 First result: directional convergence We show that the network parameters Wt converge in direction, meaning the normalized iterates Wt/ Wt converge. Details are deferred to Section 3, but here is a brief overview. Our networks are L-positively homogeneous in the parameters, meaning scaling the parameters by c > 0 scales the predictions by c L, and definable in some o-minimal structure, a mild technical assumption which we will describe momentarily. Our networks can be arbitrarily deep with many common types of layers (e.g., linear, convolution, Re LU, and max-pooling layers), but homogeneity rules out some components such as skip connections and biases, which all satisfy definability. We consider binary classification with either the logistic loss ℓlog(z) := ln(1 + e z) (binary crossentropy) or the exponential loss ℓexp(z) := e z, and a standard gradient flow (infinitesimal gradient descent) for non-differentiable non-convex functions via the Clarke subdifferential. We start from an initial risk smaller than 1/n, where n denotes the number of data samples; in this way, our analysis handles the late phase of training, and can be applied after some other analysis guarantees risk 1/n. Under these conditions, we prove the following result, without any other assumptions about the distribution of the parameters or the width of the network (cf. Theorem 3.1): The curve swept by Wt/ Wt has finite length, and thus Wt/ Wt converges. Our main corollary is that prediction margins converge (cf. Corollary 3.2), meaning convergence of the normalized per-example values yiΦ(xi;Wt)/ Wt L, where yi is the label and Φ(xi; Wt) is the prediction on example xi. These quantities are central in the study of generalization of deep networks, and their stability also implies stability of many other useful quantities [Bartlett et al., 2017, Jiang et al., 2019, 2020]. As an illustration of directional convergence and margin convergence, we plot the margin values for all examples in the standard cifar data against training iterations in Figure 2; these trajectories exhibit strong convergence behavior, both within our theory (a modified homogeneous Alex Net, as in Figure 2a), and outside of it (Dense Net, as in Figure 2b). Directional convergence is often assumed throughout the literature [Gunasekar et al., 2018a, Chizat and Bach, 2020], but has only been established for linear predictors [Soudry et al., 2017]. It is tricky to prove because it may still be false for highly smooth functions: for instance, the homogeneous Mexican Hat function satisfies all our assumptions except definability, and can be adjusted to have arbitrary order of continuous derivatives, but its gradient flow does not converge in direction, instead it spirals [Lyu and Li, 2019]. To deal with similar pathologies in many branches of mathematics, the notion of functions definable in some o-minimal structure was developed: these are rich classes of functions built up to limit oscillations and other bad behavior. Using techniques from this literature, we build general tools, in particular unbounded nonsmooth Kurdyka-Łojasiewicz inequalities, which allows us to prove directional convergence, and may also be useful outside deep learning. More discussion on the o-minimal literature is given in Section 1.3, technical preliminaries are introduced in Section 2, and a proof overview is given in Section 3, with full details in the appendices. (a) Margins while training H-Alex Net. (b) Margins while training Dense Net. Figure 2: The margins of all examples in cifar, plotted against time, or rather optimization accuracy ln(n/L(Wt)) to remove the effect of step size and other implementation coincidences. Figure 2a shows H-Alex Net , a homogeneous version of Alex Net as described in the main text [Krizhevsky et al., 2012], which is handled by our theory. Figure 2b shows a standard Dense Net [Huang et al., 2017], which does not fit the theory in this work due to skip connections and biases, but still exhibits convergence of margins, thus suggesting a tantalizing open problem. 1.2 Second result: gradient alignment Our second contribution, in Section 4, is that if the network has locally Lipschitz gradients, then these gradients also converge, and are aligned to the gradient flow path (cf. Theorem 4.1). The gradient flow path, and the gradient of the risk along the path, converge to the same direction. As a practical consequence of this, recall the use of gradients within the interpretability literature, specifically in saliency maps [Adebayo et al., 2018]: if gradients do not converge in direction then saliency maps can change regardless of the number of iterations used to produce them. As a theoretical consequence, directional convergence and alignment imply margin maximization in a variety of situations: this holds in the deep linear case, strengthening prior work [Gunasekar et al., 2018b, Ji and Telgarsky, 2018a], and in the 2-homogeneous network case, with an assumption taken from the infinite width setting [Chizat and Bach, 2020], but presented here with finite width. 1.3 Further related work Our analysis is heavily inspired and influenced by the work of Lyu and Li [2019], who studied margin maximization of homogeneous networks, establishing monotonicity of a smoothed margin, a quantity we also use. However, they did not prove directional convergence but instead must use subsequences. Their work also left open alignment and global margin maximization. Directional convergence. A standard approach to resolve directional convergence and similar questions is to establish that the objective function in question is definable in some o-minimal structure, which as mentioned before, limits oscillations and other complicated behavior. This literature cannot be directly applied to our setting, owing to a combination of nonsmooth layers like the Re LU and max-pooling, and the exponential function used in the cross entropy loss, and as a result, our proofs need to rebuild many o-minimal results from the ground up. In more detail, an important problem in the o-minimal literature is the gradient conjecture of René Thom: it asks when the existence of limt Wt = z further implies limt (Wt z)/ Wt z exists, and was established in various definable scenarios by Kurdyka et al. [2000a, 2006] via related Kurdyka-Łojasiewicz inequalities [Kurdyka, 1998]. The underlying proof ideas can also be used to analyze limt Wt/ Wt when the weights go to infinity [Grandjean, 2007]. However, the prior results require the objective function to be either real analytic, or definable in a polynomiallybounded o-minimal structure. The first case causes the aforementioned nonsmoothness issue, and excludes many common layers in deep learning such as the Re LU and max-pooling. The second case excludes the exponential function, and means the logistic and cross-entropy losses cannot be handled. To resolve these issues, we had to redo large portions of the o-minimality theory, such as the nonsmooth unbounded Kurdyka-Łojasiewicz inequalities that can handle the exponential/logistic loss, as presented in Section 3. Alignment. As discussed in Section 4, alignment implies the gradient flow reaches a stationary point of the limiting margin maximization objective, and therefore is related to various statements and results throughout the literature on implicit bias and margin maximization [Soudry et al., 2017, Ji and Telgarsky, 2018b]. This stationary point perspective also appears in some nonlinear works, for instance in the aforementioned work on margins by Lyu and Li [2019], which showed that subsequences of the gradient flow converge to such stationary points; in addition to fully handling the gradient flow, the present work also differs in that alignment is in general a stronger notion, in that it is unclear how to prove alignment as a consequence of convergence to KKT points. Additionally, alignment can still hold when the objective function is not definable and directional convergence is false, for example on the homogeneous Mexican hat function, which cannot be handled by the approach in [Lyu and Li, 2019, Appendix J]. As a final pointer to the literature, many implicit bias works explicitly assume directional convergence and some version of alignment [Gunasekar et al., 2018b, Chizat and Bach, 2020], but neither do these works indicate a possible proof, nor do they provide conclusive evidence. 1.4 Experimental overview The experiments in Figures 1 and 2 are performed in as standard a way as possible to highlight that directional convergence is a reliable property; full details are in Appendix A. Briefly, Figure 1 uses synthetic data and vanilla gradient descent (no momentum, no weight decay, etc.) on a 10,000 node wide 2-layer squared Re LU network and its Neural Tangent Kernel classifier; by using the squared Re LU, both our directional convergence and our alignment results apply. Figure 2 uses standard cifar firstly with a modified homogeneous Alex Net and secondly with an unmodified Dense Net, respectively inside and outside our assumptions. SGD was used on cifar due to training set size, and seeing how directional convergence still seems to occur, suggests another open problem. 2 Preliminaries and assumptions In this section, we first introduce the notions of Clarke subdifferentials and o-minimal structures, and then use these notions to describe the network model, gradient flow, and Assumptions 2.1 and 2.2. Throughout this paper, denotes the ℓ2 (Frobenius) norm, and σ denotes the spectral norm. Locally Lipschitz functions and Clarke subdifferentials. Consider a function f : D R with D open. We say that f is locally Lipschitz if for any x D, there exists a neighborhood U of x such that f|U is Lipschitz continuous. We say that f is C1 if f is continuously differentiable on D. If f is locally Lipschitz, it holds that f is differentiable a.e. [Borwein and Lewis, 2000, Theorem 9.1.2]. The Clarke subdifferential of f at x D is defined as f(x) := conv lim i f(xi) xi D, f(xi) exists, lim i xi = x , which is nonempty convex compact [Clarke, 1975], and if f is continuously differentiable at x, then f(x) = { f(x)}. Vectors in f(x) are called subgradients, and we let f(x) denote the unique minimum-norm subgradient: f(x) := arg min x f(x) x . In the following analysis, we use f in many places that seem to call on f. O-minimal structures and definable functions. Formally, an o-minimal structure is a collection S = {Sn} n=1, where Sn is a set of subsets of Rn which includes all algebraic sets and is closed under finite union/intersection and complement, Cartesian product, and projection, and S1 consists of finite unions of open intervals and points. A set A Rn is definable if A Sn, and a function f : D Rm with D Rn is definable if its graph is in Sn+m. More details are given in Appendix B. Many natural functions and operations are definable. First of all, definability of functions is stable under algebraic operations, composition, inverse, maximum and minimum, etc. Moreover, Wilkie [1996] proved that there exists an o-minimal structure where polynomials and the exponential function are definable. Consequently, definability allows many common layer types in deep learning, such as fully-connected/convolutional/Re LU/max-pooling layers, skip connections, the cross entropy loss, etc.; moreover, they can be composed arbitrarily As will be discussed later, what is still missing is the handling of the gradient flow on such functions. The network model. Consider a dataset {(xi, yi)}n i=1, where xi Rd are features and yi { 1, +1} are binary labels, and a predictor Φ( ; W) : Rd R with parameters W Rk. We make the following assumption on the predictor Φ. Assumption 2.1. For any fixed x, the prediction W 7 Φ(x; W) as a function of W is locally Lipschitz, L-positively homogeneous for some L > 0, and definable in some o-minimal structure including the exponential function. As mentioned before, homogeneity means that Φ(x; c W) = c LΦ(x; W) for any c 0. This means, for instance, that linear, convolutional, Re LU, and max-pooling layers are permitted, but not skip connections and biases. Homogeneity is used heavily throughout the theoretical study of deep networks [Lyu and Li, 2019]. Given a decreasing loss function ℓ, the total loss (or unnormalized empirical risk) is given by i=1 ℓ yiΦ(xi; W) = i=1 ℓ(pi(W)), where pi(W) := yiΦ(xi; W) are also locally Lipschitz, L-positively homogeneous and definable under Assumption 2.1. We consider the exponential loss ℓexp(z) := e z and the logistic loss ℓlog(z) := ln(1 + e z), in which case L is also locally Lipschitz and definable. Gradient flow. As in [Davis et al., 2020, Lyu and Li, 2019], a curve z from an interval I to some real space Rm is called an arc if it is absolutely continuous on any compact subinterval of I. It holds that an arc is a.e. differentiable, and the composition of an arc and a locally Lipschitz function is still an arc. We consider a gradient flow W : [0, ) Rk that is an arc and satisfies dt L(Wt), for a.e. t 0. (1) Our second assumption is on the initial risk, and appears in prior work [Lyu and Li, 2019]. Assumption 2.2. The initial iterate W0 satisfies L(W0) < ℓ(0). As mentioned before, this assumption encapsulates our focus on the late training phase; some other analysis, for instance the neural tangent kernel, can be first applied to ensure L(W0) < ℓ(0). 3 Directional convergence We now turn to stating our main result on directional convergence and sketching its analysis. As Assumptions 2.1 and 2.2 imply Wt [Lyu and Li, 2019], we study the normalized flow f Wt := Wt/ Wt , whose convergence is a formal way of studying the directional convergence of Wt. As mentioned before, directional convergence is false in general [Lyu and Li, 2019], but definability suffices to ensure it. Throughout, for general nonzero W, we will use f W := W/ W . Theorem 3.1. Under Assumptions 2.1 and 2.2, for ℓexp and ℓlog, the curve swept by f Wt has finite length, and thus f Wt converges. A direct consequence of Theorem 3.1 is the convergence of the margin distribution (i.e., normalized outputs). Due to homogeneity, for any nonzero W, we have pi(W)/ W L = pi(f W), and thus the next result follows from Theorem 3.1. Corollary 3.2. Under Assumptions 2.1 and 2.2, for ℓexp and ℓlog, it holds that pi(Wt)/ Wt L converges for all 1 i n. Next we give a proof sketch of Theorem 3.1; the full proofs of the Kurdyka-Łojasiewicz inequalities (Lemmas 3.5 and 3.6) are given in Appendix B.3, while the other proofs are given in Appendix C. 3.1 A proof sketch of Theorem 3.1 The smoothed margin introduced in [Lyu and Li, 2019] is crucial in our analysis: given W = 0, let α(W) := ℓ 1 L(W) , and α(W) := α(W) For simplicity, let αt denote α(Wt), and ζt denote the length of the path swept by f Wt = Wt/ Wt from time 0 to t. Lyu and Li [2019] proved that αt is nondecreasing with some limit a (0, ), and Wt . We invoke a standard but sophisticated tool from the definability literature to aid in proving ζt is finite: formally, a function Ψ : [0, ν) R is called a desingularizing function when Ψ is continuous on [0, ν) with Ψ(0) = 0, and continuously differentiable on (0, ν) with Ψ > 0; in words, a desingularizing function is a witness to the fact that the flow is asymptotically well-behaved. As we will sketch after stating the lemma, this immediately leads to a proof of Theorem 3.1. Lemma 3.3. There exist R > 0, ν > 0 and a definable desingularizing function Ψ on [0, ν), such that for a.e. large enough t with Wt > R and αt > a ν, it holds that dt cdΨ (a αt) dt for some constant c > 0. To prove Theorem 3.1 from here, let t0 be large enough so that the conditions of Lemma 3.3 hold for all t t0: then we have limt ζt ζt0 + cΨ (a αt0) < , and thus the path length is finite. Below we sketch the proof of Lemma 3.3, which is based on a careful comparison of d αt/ dt and dζt/ dt. The proof might be hard to parse due to the extensive use of , the minimum-norm Clarke subgradient; at first reading, the condition of local Lipschitz continuity can just be replaced with continuous differentiability, in which case the Clarke subgradient is just the normal gradient. Given any function f which is locally Lipschitz around a nonzero W, let rf(W) := D f(W), f W E f W and f(W) := f(W) rf(W) denote the radial and spherical parts of f(W) respectively. First note the following technical characterization of d αt/dt and dζt/dt using the radial and spherical components of relevant Clarke subgradients. Lemma 3.4. It holds for a.e. t 0 that dt = r α(Wt) r L(Wt) + α(Wt) L(Wt) , and dζt For simplicity, in the discussion here we consider the case that all subgradients in Lemma 3.4 are nonzero, with the general case handled in the full proofs in the appendices. Then Lemma 3.4 implies dζt = d αt/ dt dζt/ dt = Wt r L(Wt) L(Wt) r α(Wt) + α(Wt) ! As in [Kurdyka et al., 2006, Grandjean, 2007], to bound eq. (2), we further consider two cases depending on the ratio α(Wt) / r α(Wt) . If α(Wt) / r α(Wt) c1 Wt L/3 for some constant c1 > 0, then Lemma 3.3 follows from d αt/ dζt Wt α(Wt) as given by eq. (2), and the following Kurdyka-Łojasiewicz inequality. Its proof is based on the proof idea of [Kurdyka et al., 2006, Proposition 6.3], but further handles the unbounded and nonsmooth setting. Lemma 3.5. Given a locally Lipschitz definable function f with an open domain D x x > 1 , for any c, η > 0, there exists ν > 0 and a definable desingularizing function Ψ on [0, ν) such that Ψ f(x) x f(x) 1, if f(x) (0, ν) and f(x) c x η rf(x) . On the other hand, if α(Wt) / r α(Wt) c1 Wt L/3, then a careful calculation (using Lemmas C.2 to C.4) can show that for some constants c2, c3 > 0, r L(Wt) L(Wt) c2 Wt 2L/3, and r α(Wt) α(Wt) c3 Wt L/3. It then follows from eq. (2) that d αt/ dζt c2c3 Wt 4L/3 α(Wt) . In this case we give the following Kurdyka-Łojasiewicz inequality, which implies Lemma 3.3. Lemma 3.6. Given a locally Lipschitz definable function f with an open domain D x x > 1 , for any λ > 0, there exists ν > 0 and a definable desingularizing function Ψ on [0, ν) such that Ψ f(x) x 1+λ f(x) 1, if f(x) (0, ν). 4 Alignment between the gradient flow path and gradients Theorem 3.1 gave our directional convergence result, namely that the normalized iterate Wt/ Wt converges to some direction. Next we show and discuss our alignment result, that if all pi have locally Lipschitz gradients, then along the gradient flow path, L(Wt) converges to the same direction as Wt. Theorem 4.1. Under Assumptions 2.1 and 2.2, if all pi further have locally Lipschitz gradients, then L(Wt) and Wt converge to the same direction, meaning the angle between Wt and L(Wt) converges to zero. If all pi are twice continuously differentiable, then the same result holds without the definability condition (cf. Assumption 2.1). Below we first sketch the proof of Theorem 4.1, with full details in Appendix D, and then in Section 4.2 present a few global margin maximization consequences, which are proved in Appendix E. 4.1 A proof sketch of Theorem 4.1 Recall that limt α(Wt)/ Wt L = a. The first observation is that α(Wt), the smoothed margin function, asymptotes to the exact margin min1 i n pi(Wt) which is L-positively homogeneous. Therefore α is asymptotically L-positively homogeneous, and formally we can show Wt L 1 , Wt Wt L = a L, (3) which can be viewed as an asymptotic version of Euler s homogeneous function theorem (cf. Lemma C.1). Consequently, the inner product between α(Wt)/ Wt L 1 and f Wt converges. Let θt denote the angle between Wt and L(Wt), which is also the angle between Wt and α(Wt), since L(Wt) and α(Wt) point to opposite directions by the chain rule. By [Lyu and Li, 2019, Corollary C.10], given any ϵ > 0, there exists a time tϵ such that θtϵ < ϵ. The question is whether such a small angle can be maintained after tϵ. This is not obvious since, as mentioned above, the smoothed margin α(Wt) asymptotes to the exact margin min1 i n pi(Wt), which may be nondifferentiable even with smooth pi, due to nondifferentiability of the minimum. Consequently, the exact margin may have discontinuous Clarke subdifferentials, and since the smoothed margin asymptotes to it, it is unclear whether θt 0. (This point was foreshadowed earlier, where it was pointed out that alignment is not a clear consequence of convergence to stationary points of the margin maximization objective.) To handle this, the key to our analysis is the potential function J (W) := α(Wt) 2 / Wt 2L 2. Suppose at time t, it holds that D α(Wt)/ Wt L 1, f Wt E is close to a L, and θt is very small. If θt becomes large again at some t > t, it must follows that J (Wt ) is much larger than J (Wt). We prove that this is impossible, by showing that dτ dτ = 0, (4) and thus Theorem 4.1 follows. The proof of eq. (4) is motivated by the dual convergence analysis in [Ji and Telgarsky, 2019], and also uses the positive homogeneity of pi and 2pi (which exist a.e.). 4.2 Main alignment consequence: margin maximization A variety of (global) margin maximization results are immediate consequences of directional convergence and alignment. This subsection investigates two examples: deep linear networks, and shallow squared Re LU networks. Deep linear networks predict with Φ(xi; W) = AL A1xi, where the parameters W = (AL, . . . , A1) are organized into L matrices. This setting has been considered in the literature, but the original work assumed directional convergence, alignment and a condition on the support vectors [Gunasekar et al., 2018b]; a follow-up dropped the directional convergence and alignment assumptions, but instead assumed the support vectors span the space Rd [Ji and Telgarsky, 2018a]. As follows, we not only drop the all aforementioned assumptions, but moreover include a proof rather than an assumption of directional convergence. Proposition 4.2. Suppose Wt = (AL(t), . . . , A1(t)) and L(W0) < ℓ(0). Then a unique linear max margin predictor u := arg max u 1 mini yix T iu exists, and there exist unit vectors (v L, . . . , v1, v0) with v L = 1 and v0 = u such that lim t Aj(t) Aj(t) = vjv T j 1 and lim t AL(t) A1(t) AL(t) A1(t) = u T. Thanks to directional convergence and alignment (cf. Theorems 3.1 and 4.1), the proof boils down to writing down the gradient expression for each layer and doing some algebra. A more interesting example is a certain 2-homogeneous case, which despite its simplicity is a universal approximator; this setting was studied by Chizat and Bach [2020], who considered the infinite width case, and established margin maximization under assumptions of directional convergence and gradient convergence. Unfortunately, it is not clear if Theorems 3.1 and 4.1 can be applied to fill these assumptions, since they do not handle infinite width, and indeed it is not clear if infinite width networks or close relatives are definable in an o-minimal structure. Instead, here we consider the finite width case, albeit with an additional assumption. Following [Chizat and Bach, 2020, S-Re LU], organize Wt into m rows (wj(t))m j=1, with normalizations θj(t) := wj(t)/ wj(t) where θj(t) = 0 when wj(t) = 0, and consider Φ(xi; W) := X j ( 1)j max{0, w T jxi}2 and ϕij(w) := yi( 1)j max{0, w Txi}2, (5) whereby pi(W) = P j ϕij(wj), and Φ, pi, and ϕij are all 2-homogeneous and definable. (The ( 1)j may seem odd, but is an easy trick to get universal approximation without outer weights.) Proposition 4.3. Consider the setting in eq. (5) along with L(W0) < ℓ(0) and xi 1. 1. (Local guarantee.) s Rm with sj(t) := wj(t) 2/ Wt 2 satisfies s p m (probability simplex on m vertices), and θj θj with θj = 0 if sj = 0, and a = lim t min i pi(Wt) Wt 2 = lim t min i j sj(t)ϕij(θj(t)) = min i max s m j sjϕij( θj). 2. (Global guarantee.) Suppose the covering condition: there exist t0 and ϵ > 0 with max j θj(t0) θj 2 ϵ, and max θ Sd 1 max n min 2|j θj(t0) θ , min 2 j θj(t0) θ o ϵ, where Sd 1 := {θ Rd : θ = 1}. Then margins are approximately (globally) maximized: lim t min i pi(Wt) Wt 2 max ν P(Sd 1) min i yi Z max{0, x T iθ}2 dν(θ) 4ϵ, where P(Sd 1) is the set of signed measures on Sd 1 with mass at most 1. The first part (the local guarantee ) characterizes the limiting margin as the maximum margin of a linear problem obtained by taking the limiting directions ( θj)m j=1 and treating the resulting ϕij( θj) as features. The quality of this margin is bad if the limiting directions are bad, and therefore we secondly (the global guarantee ) consider a case where our margin is nearly as good as the infinite width global max margin value as defined by [Chizat and Bach, 2020, eq. (5)]; see discussion therein for a justification of this choice, and moreover calling it the globally maximal margin. The covering condition deserves further discussion. In the infinite width setting, it holds for all ϵ > 0 assuming directional convergence [Chizat and Bach, 2020, Proof of Theorem D.1], but cannot hold in such generality here as we are dealing with finite width. Similar properties have appeared throughout the literature: Wei et al. [2018, Section 3] explicitly re-initialized network nodes to guarantee a good covering, and more generally [Ge et al., 2015] added noise to escape saddle points in general optimization problems. 5 Concluding remarks and open problems In this paper, we established that the normalized parameter vectors Wt/ Wt converge, and that under an additional assumption of locally Lipschitz gradients, the gradients also converge and align with the parameters. There are many promising avenues for future work based on these results. One basic line is to weaken our assumptions: dropping homogeneity to allow for Dense Net and Res Net, and analyzing finite-time methods like (stochastic) gradient descent, and moreover their rates of convergence. We also handled only the binary classification case, however our tools should directly allow for cross-entropy. Another direction is into further global margin maximization results, beyond the simple networks in Section 4.2, and into related generalization consequences of directional convergence and alignment. Broader impact This paper constitutes theoretical work, with an aim of enhancing human understanding, and laying the groundwork for further theoretical and applied work. The authors hope that advancing the foundations of deep networks leads moreover to a better understanding of their failure modes, and manipulation thereof, and thus an increase in safety. Acknowledgments and disclosure of funding The authors thank Zhiyuan Li and Kaifeng Lyu for lively discussions during an early phase of the project. The authors are grateful for support from the NSF under grant IIS-1750051, and from NVIDIA under a GPU grant. Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In NIPS, 2018. ar Xiv:1810.03292 [cs.CV]. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. ar Xiv preprint ar Xiv:1811.03962, 2018. Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240 6249, 2017. Jérôme Bolte, Aris Daniilidis, Adrian Lewis, and Masahiro Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556 572, 2007. Jonathan Borwein and Adrian Lewis. Convex Analysis and Nonlinear Optimization. Springer Publishing Company, Incorporated, 2000. Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In NIPS, 2018. ar Xiv:1805.09545 [math.OC]. Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. ar Xiv preprint ar Xiv:2002.04486, 2020. Frank H Clarke. Generalized gradients and applications. Transactions of the American Mathematical Society, 205:247 262, 1975. Frank H. Clarke. Optimization and Nonsmooth Analysis. Siam Classics in Applied Mathematics, 1983. Michel Coste. An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali Pisa, 2000. Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, and Jason D Lee. Stochastic subgradient method converges on tame functions. Foundations of computational mathematics, 20(1):119 154, 2020. Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. ar Xiv preprint ar Xiv:1811.03804, 2018. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119 139, 1997. Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points online stochastic gradient for tensor decomposition. In COLT, 2015. ar Xiv:1503.02101 [cs.LG]. V Grandjean. On the limit set at infinity of a gradient trajectory of a semialgebraic function. Journal of Differential Equations, 233(1):22 41, 2007. Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. ar Xiv preprint ar Xiv:1802.08246, 2018a. Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461 9471, 2018b. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017. ar Xiv:1608.06993v5 [cs.CV]. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018. Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. ar Xiv preprint ar Xiv:1810.02032, 2018a. Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300v2, 2018b. Ziwei Ji and Matus Telgarsky. A refined primal-dual analysis of the implicit bias. ar Xiv preprint ar Xiv:1906.04540, 2019. Yiding Jiang, Dilip Krishnan, Hossein Mobahi, and Samy Bengio. Predicting the generalization gap in deep networks with margin distributions. In ICLR, 2019. ar Xiv:1810.00113 [stat.ML]. Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In ICLR, 2020. ar Xiv:1912.02178 [cs.LG]. Alex Krizhevsky. Learning multiple layers of features from tiny images. https://www.cs.toronto. edu/~kriz/learning-features-2009-TR.pdf, 2009. Alex Krizhevsky, Ilya Sutskever, and Geoffery Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. In Annales de l institut Fourier, volume 48, pages 769 783, 1998. Krzysztof Kurdyka, Tadeusz Mostowski, and Adam Parusinski. Proof of the gradient conjecture of r. thom. Annals of Mathematics, 152(3):763 792, 2000a. Krzysztof Kurdyka, Patrice Orro, and Stéphane Simon. Semialgebraic sard theorem for generalized critical values. Journal of differential geometry, 56(1):67 92, 2000b. Krzysztof Kurdyka, Adam Parusi nski, et al. Quasi-convex decomposition in o-minimal structures. application to the gradient conjecture. In Singularity theory and its applications, pages 137 177. Mathematical Society of Japan, 2006. Ta Lê Loi. Lecture 1: O-minimal structures. In The Japanese-Australian Workshop on Real and Complex Singularities: JARCS III, pages 19 30. Centre for Mathematics and its Applications, Mathematical Sciences Institute, The Australian National University, 2010. Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. ar Xiv preprint ar Xiv:1906.05890, 2019. Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. 2019. ar Xiv:1902.06015 [stat.ML]. András Némethi and Alexandru Zaharia. Milnor fibration at infinity. Indagationes Mathematicae, 3 (3):323 335, 1992. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Neu RIPS. 2019. Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. 2018. ar Xiv:1811.03600 [cs.LG]. Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. ar Xiv preprint ar Xiv:1710.10345, 2017. Lou Van den Dries and Chris Miller. Geometric categories and o-minimal structures. Duke Math. J, 84(2):497 540, 1996. Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. ar Xiv preprint ar Xiv:1810.05369, 2018. Alex J Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted pfaffian functions and the exponential function. Journal of the American Mathematical Society, 9(4):1051 1094, 1996. Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. ar Xiv preprint ar Xiv:1811.08888, 2018.