# piecewise_strong_convexity_of_neural_networks__52e12cdb.pdf

Piecewise Strong Convexity of Neural Networks

Tristan Milne Department of Mathematics University of Toronto Toronto, Ontario, Canada tmilne@math.toronto.edu

We study the loss surface of a feed-forward neural network with Re LU nonlinearities, regularized with weight decay. We show that the regularized loss function is piecewise strongly convex on an important open set which contains, under some conditions, all of its global minimizers. This is used to prove that local minima of the regularized loss function in this set are isolated, and that every differentiable critical point in this set is a local minimum, partially addressing an open problem given at the Conference on Learning Theory (COLT) 2015; our result is also applied to linear neural networks to show that with weight decay regularization, there are no non-zero critical points in a norm ball obtaining training error below a given threshold. We also include an experimental section where we validate our theoretical work and show that the regularized loss function is almost always piecewise strongly convex when restricted to stochastic gradient descent trajectories for three standard image classiﬁcation problems.

1 Introduction

Neural networks are an extremely popular tool with a variety of applications, from object ([SHK+14], [HZRS15]) and speech recognition [GMH13] to the automatic creation of realistic synthetic images [WLZ+17]. The optimization problem of ﬁnding the weights of a neural network such that the resulting function approximates a target function on training data is the focus of this paper.

Despite strong empirical success on this optimization problem, there is still much to know about the landscape of the loss function besides the fact that it is not globally convex. How many local minima does this function have? Are the local minima isolated from each other? What about the existence of local maxima or saddle points? We aim to answer several of these questions in this paper, at least for the loss function restricted to an important set of weights.

Important papers on the study of a neural network s loss function include [CHM+15] and [Kaw16]. In [CHM+15], the authors study the loss surface of a non-linear neural network. Under some assumptions, including independence of the network s input and the non-linearities, the loss function of the non-linear network can be reduced to a loss function for a linear network, which is then written as the Hamiltonian of a spin glass model. Recent ideas from spin glass theory are then used on the network s loss function, proving statements about the distribution of its critical values. This paper established an important ﬁrst step in the direction of understanding the loss surface of neural networks, but leaves a gap between theory and practise due to its analysis of the expected value of the non-linear network over the behaviour of the Re LU non-linearities, which reduces the non-linear network to a linear one. Moreover, the authors assume that the components of the input data vector are independently distributed, which need not be true in important applications such as object recognition, where the pixel values of the input image are strongly correlated.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

In [Kaw16], the results of [CHM+15] are improved and several strong assumptions in that paper are either weakened or eliminated. This paper studies the loss surface of a linear neural network with a quadratic loss function, and establishes many strong conclusions, including the facts that every local minimum for such a network is a global minimum, there are no local maxima, and, under some assumptions on the rank of the product of weight matrices, the Hessian of the loss function has at least one negative eigenvalue at a saddle point, which has implications for the convergence of gradient descent via [LSJR16]. All this is proven under some mild assumptions on the distribution of the labelled data. These results, although comprehensive for linear networks, only hold for a network with Re LU activation functions if one ﬁrst takes an expectation over the network s non-linearities as in [CHM+15], and thus have limited applicability to non-linear networks.

Our contribution is to the understanding of the loss function for a neural network with Re LU nonlinearities, quadratic error, and weight decay regularization. In contrast to [CHM+15] and [Kaw16], we do not take an expectation over the network s non-linearities, and instead study the network in its fully non-linear form. Without making any assumptions on the labelled data distribution, we prove that the loss function has no local maxima in any open neighbourhood where the loss function is inﬁnitely differentiable and non-constant. We also prove that there is a non-empty open set where

1. the regularized loss function is piecewise strongly convex, with domains of convexity determined by the Re LU non-linearity, 2. every differentiable critical point of the regularized loss function is a local minima, and, 3. local minima of the regularized loss function are isolated.

This open set contains the origin, all points in a norm ball where training error is below a given threshold, and under some conditions, all global minimizers of the regularized loss function. Our results also provide an explicit description of the set where piecewise strong convexity is guaranteed in terms of the size of the network s parameters and architecture, which allows for networks to be designed to maximize the size of this set. This set also implicitly depends on the training data. We emphasize the similarity of these results to those proved in [CHM+15], which include the fact that there is a value of the Hamiltonian (i.e. the loss function of the network), below which critical points have a high probability of being low index. Our results ﬁnd a threshold of the same loss function under which every differentiable critical point of the regularized function in a bounded set is a local minimum, and therefore has index 0. Since we make no assumptions on the distribution of training data, we therefore hope that our results provide at least a partial answer to the open problem given in [CLA15].

We also include an experimental section where we validate our theoretical results on a toy problem and show that restricted to stochastic gradient descent trajectories, the loss function for commonly used neural networks on image classiﬁcation problems is almost always piecewise strongly convex. This shows that even though the loss function may be non-convex outside the open set we have found, its gradient descent trajectories are similar to those of a strongly convex function, which hints at a general explanation for the success of ﬁrst order optimization methods applied to neural networks.

1.1 Related Work

A related paper is [HM16], which shows that for linear residual networks, the only critical points in a norm ball at the origin are global minima, provided the training data is generated by a linear map with Gaussian noise. In contrast, we show that for a non-linear network with any training data, there is an open set containing the origin where every differentiable critical point is a local minimum.

Our result on the existence of regions on which the loss function is piecewise well behaved is similar to the results in [SS16], which shows that for a two layer network with a scalar output, weight space can be divided into convex regions, and on each of these the loss function has a basin-like structure which resembles a convex function in that every local minimum is a global minimum. Compared to [SS16], our conclusions apply to a smaller set in weight space, but are compatible with more network architectures (e.g. networks with more than two layers).

Another relevant paper is [SC16], which proves that for neural networks with a single hidden layer and leaky Re LUs, every differentiable local minimum of the training error is a global minimum under mild over-parametrization; they also have an extension of this result to deeper networks under stronger conditions. With small modiﬁcations to our proofs, our results can be applied to networks

with leaky Re LU s, and thus our paper and [SC16] provide complementary viewpoints; if every differentiable local minimum is a global minimum, then differentiable local minima must satisfy our error threshold as long as the network has enough capacity to obtain zero training error, and thus piecewise strong convexity is guaranteed around these points. Our results in combination with [SC16] therefore describe, in some cases, the regularized loss surface around differentiable local minima of the training error.

Convergence of gradient methods for deep neural networks has recently been shown in [AZLS18] and [OS19] with some restricted architectures. Since our results on the structure of the loss function are limited to a subset of weight space, they say little about the convergence of gradient descent with arbitrary initialization. Our empirical results in Section 4, however, show that the loss function is well behaved outside this subset, and indicate methods for generating new convergence results.

2 Summary of Results

Here we give informal statements of our main theorems and outline the general argument. In Section 3.1 we start by showing that a network of Re LU s is inﬁnitely differentiable (i.e. smooth) almost everywhere in weight space. Using the properties of subharmonic functions we then prove that the corresponding loss function is either constant or not smooth in any open neighbourhood of a local maximum. In Section 3.2, we estimate the second derivatives of the loss function in terms of the loss function itself, giving the following theorem in Section 3.3. Theorem 1. Let ℓbe the loss function for a feed-forward neural network with Re LU non-linearities, a scalar output, a quadratic error function, and training data {ai, f(ai)}N i=1. Let

ℓλ(W) = ℓ(W) + λ

be the same loss function with weight decay regularization. Then there is an open set U, containing the origin, all points W in a norm ball where training error is below a certain threshold, and, in certain cases, all global minimizers of ℓλ, such that ℓλ is piecewise strongly convex on U.

See Theorem 3 for a more precise statement of this result. In Section 3.4, we use piecewise strong convexity to prove the following theorem. Theorem 2. For the same network as in Theorem 1, every differentiable critical point of ℓλ in U is a local minimum, and every local minimum of ℓλ in U is an isolated local minimum.

We conclude our theoretical section with an application of Theorem 2. We end with Section 4, where we empirically validate our theoretical results on a toy problem and demonstrate, for more complex problems, that the loss function is almost always piecewise strongly convex on stochastic gradient descent trajectories.

3 Theoretical Results

Consider a neural network with an n0 dimensional input a. The neural network will have H hidden layers, which means there are H + 2 layers in total, including the input layer (layer 0), and the output layer (layer H + 1). The width of the ith layer will be denoted by ni, and we will assume ni > 1 for all i = 1, . . . , H. We will consider scalar neural networks for simplicity (i.e. n H+1 = 1), though these results can be easily extended to non-scalar networks. The weights connecting the ith layer to the i + 1st layer will be given by matrices Wi Rni,ni+1, with wi j,k giving the weight of the connection between neuron j in layer i and neuron k in layer i + 1. The neural network y : Rn0 Rm R is given by the function

y(a, W) = σ(W T Hσ(W T H 1σ(. . . σ(W T 0 a) . . .))), (1)

where W = (W0, . . . , WH) is the collection of all the network weights, and σ(x1, . . . , xn) = (max(x1, 0), . . . , max(xn, 0)) is the Re LU non-linearity. Let f : Rn0 R be the target function, and let {ai, f(ai)}N i=1 be a set of labelled training data. The loss function is given by ℓ: Rm R,

ℓ(W) = 1 2N

i=1 (f(ai) y(ai, W))2 .

We will refer to ℓ(W) as the training error. The regularized loss function is given by ℓλ : Rm R,

ℓλ(W) = ℓ(W) + λ

where λ > 0, and ||W|| is the standard Euclidean norm of the weights. We will start by writing y(a, W) as a matrix product. Deﬁne Si : Rn0 Rm Rni,ni by

Si(a, W) = diag(hi(W T i 1σ(. . . σ(W T 0 a) . . .)), (2)

where hi : Rni Rni is given by

hi(x1, . . . , xni) = (1x1>0(x1), . . . , 1xni>0(xni)),

where 1xi>0 is the indicator function of the positive real numbers, equal to 1 if the argument is positive, and zero otherwise. We will call the Si matrices the switches of the network. It is clear that y(a, W) = SH+1(a, W)W T HSH(a, W)W T H 1SH 1(a, W) . . . S1(a, W)W T 0 a.

3.1 Differentiability of the neural network

Here we state that the network is differentiable, at least on the majority of points in weight space. Lemma 1. For any a Rn0, the map W 7 y(a, W) is smooth almost everywhere in Rm.

Although the proof of this lemma is deferred to the supplementary materials, the crux of the argument is simple; the network is piecewise analytic as a function of the weights, with domains of deﬁnition delineated by the zero sets of the inputs to the Re LU non-linearities. These inputs are themselves locally analytic functions, and the zero set of a non-zero real analytic function has Lebesgue measure zero; for a concise proof of this fact, see [Mit15].

Given that y(a, W) is differentiable for almost all W, we compute some derivatives in the coordinate directions of weight space, and use the result to prove a lemma about the existence of local maxima. Lemma 2. If W is a local maximum of ℓ, then ℓis not smooth on any open neighbourhood of W , unless ℓis constant on that neighbourhood.

This lemma is proved in the supplementary material, and involves showing that ℓis a subharmonic function on any neighbourhood where it is smooth, and then using the maximum principle [Mc O03]. Note that the same proof applies to deep linear networks. These networks are everywhere differentiable, so we conclude that the loss functions of linear networks have no local maxima unless they are constant. This yields a simpler proof of Lemma 2.3 (iii) from [Kaw16].

3.2 Estimating Second Derivatives

This section will be devoted to estimating the second derivatives of ℓ. This relates to the convexity of ℓλ, since d2

dt2 |t=0ℓλ(W + t X) = d2

dt2 |t=0ℓ(W + t X) + λ||X||2.

Hence, if there exists θ > 0 such that

dt2 |t=0ℓ(W + t X) θ||X||2,

then, provided λ > θ, the loss function ℓλ will be at least locally convex. To estimate the second derivative of the loss function in an arbitrary direction, we ﬁrst deﬁne the following norm, which measures the maximum operator norm of the weight matrices; it is the same norm used in [HM16]. Deﬁnition 1. For a parameter W Rm, deﬁne the norm

W = max 0 i H Wi 2 (3)

where 2 denotes the standard operator norm induced by the Euclidean norm.

The proof of Lemma 1 shows that ℓis everywhere equal to a loss function for a collection of linear neural networks which are obtained from y(ai, W) by holding the switches {Sj(ai, W)}i,j constant. The next lemma estimates the second derivative of such a function in an arbitrary direction.

Lemma 3. Suppose ai 2 r for all 1 i N. Fix W Rm, and set φ : Rm R as equal to ℓ, but with switches {Sj(ai, W)}i,j held constant as determined by W and the dataset {ai}N i=1. The second derivative of φ in direction X Rm satisﬁes

dt2 |t=0φ(W + t X)

2H(H + 1) W H 1 X 2rφ(W)1/2. (4)

The proof of Lemma 3 is in the supplementary material, and uses standard tools.

3.3 Piecewise Convexity

Lemma 3 shows that the second derivative of ℓin an arbitrary direction is bounded below by a term which depends on ℓ. Thus, as the loss function gets small, the second derivative of the regularized loss function will be overwhelmed by the positive part contributed by the weight decay term. This observation leads to the following deﬁnition and theorem.

Deﬁnition 2. For a ﬁxed neural network architecture with H hidden layers, and r > 0 satisfying

ai 2 r 1 i N (5)

deﬁne the open set U(λ, θ), where λ > θ > 0, by

U(λ, θ) = {W Rm | ℓ(W)1/2 W H 1 < λ θ

2H(H + 1)r}. (6)

If we restrict to the norm ball B(R) = {W Rm | W R}, we have

U(λ, θ) B(R) {W B(R) | ℓ(W)1/2 < λ θ

2H(H + 1)r RH 1 }. (7)

Thus, U(λ, θ) B(R) contains all points in B(R) obtaining training error below the threshold given in (7); this is the inclusion referred to in Theorem 1. The set U(λ, θ) is non-empty provided H > 1, as it contains an open neighbourhood of 0 Rm. Further, U(λ, θ) contains all points W obtaining zero training error, if these points exist; [ZBH+16] gives some sufﬁcient conditions on the network architecture for obtaining zero training error. Lemma 4 gives conditions under which U(λ, θ) contains all global minimizers of ℓλ.

Lemma 4. Let ϵ = inf W Rm ℓλ(W). If

2 (H(H + 1)r) 2/H , (8)

then there exists θ such that U(λ, θ) contains all global minimizers of ℓλ.

Lemma 4 is proved in the supplementary materials. It follows since the term ℓ(W)1/2 W H 1 is roughly bounded by ℓλ(W). Observe that by evaluating ℓλ(W) at W = 0, we get the inequality

ϵ ℓ(0) = 1 2N

i=1 f(ai)2. (9)

Thus, as long as λ is large enough to satisfy

i=1 f(ai)2 < λ1+1/H

2 (H(H + 1)r) 2/H , (10)

Lemma 4 shows that there exists θ such that U(λ, θ) contains all global minimizers of ℓλ. Thus, for any problem, we have found a range of λ values where our set U(λ, θ) is guaranteed to contain all global minimizers of the problem.

Theorem 3 shows that we can characterise ℓλ on U(λ, θ).

Theorem 3 (Piecewise Strong Convexity). With U(λ, θ) deﬁned as above, there exist closed sets Bi Rm for i = 1, . . . , L such that

i=1 Bi U(λ, θ), (11)

and smooth functions φi : Vi Rm R, Vi open, satisfying

H(φi)(W) θIm, W Vi (12)

where H(φi) is the Hessian matrix, Bi U(λ, θ) Vi for all i, and

ℓλ|Bi U(λ,θ) = φi|Bi U(λ,θ). (13)

The proof of Theorem 3 is in the supplementary materials. The sets Bi are obtained from enumerating all possible values of the switches {Sj(ak, W)}j,k as we vary W, and taking Bi as the topological closure of all weights W giving the ith value of the switches. The functions φi are obtained from ℓλ by ﬁxing the switches according to the deﬁnition of Bi.

Note that this theorem implies Theorem 1, with U given by U(λ, θ). This shows that when the training error is small enough in a bounded region, as prescribed by U(λ, θ) though (7), the function ℓλ is locally strongly convex. Note also that our estimates depend implicitly on the training data {ai, f(ai)}N i=1, and the widths of the network s layers, since these quantities affect the size of the set U(λ, θ).

3.4 Isolated Local Minima

The next two lemmas are an application of Theorem 3, and prove Theorem 2. Lemma 5. Every differentiable critical point of ℓλ in U(λ, θ) is an isolated local minimum. Lemma 6. Every local minimum of ℓλ in U(λ, θ) is an isolated local minimum.

The proofs of these lemmas are given in the supplementary materials. Note the subtle difference between the two; Lemma 5 only applies to critical points where ℓλ is differentiable, while Lemma 6 applies to non-differentiable local minima. Lemma 5 can be applied to prove that non-zero local minima obtaining training error below our threshold do not exist when the network is linear. Lemma 7. If ℓλ is the regularized loss function for a linear neural network with H 1 and ni > 1 for all i = 1, . . . , H, then ℓλ has no non-zero critical points on the set U(λ) given by

θ>0 U(λ, θ) = {W Rm | ℓ(W)1/2 W H 1 < λ

2H(H + 1)r}. (14)

This lemma is proved in the supplementary materials, and involves the use of rotation matrices to demonstrate that any non-zero local minimum in U(λ) cannot be an isolated local minimum.

4 Experiments

We begin our experimental analysis with a regression experiment to validate our theoretical results; namely that the region U(λ, θ) is accessed over the course of training. We used a target function f(x) = x/4, with 100 data points sampled uniformly in [ 1, 1], and a neural network with H = 1, n1 = 2, no biases, no Re LU on the output, weight decay parameter λ = 0.4, and learning rate η = 0.1. In this experiment, we measure the change in the regularized loss ℓλ from when gradient descent ﬁrst enters U(λ) to convergence, as a fraction of the total change in ℓλ from initialization to convergence; this quantity measures the portion of the gradient descent trajectory contained in U(λ). We ran 1000 independent trials in Py Torch, distinguished by their random initializations, and found that the bound given in (6) is satisﬁed for some θ for 87.2% 22.3% (mean standard deviation) of the loss change over training. A histogram of the loss change fractions over these trials is given in Figure 1 as well as a plot of the training error ℓfor a single initialization, selected to show the

0.0 0.2 0.4 0.6 0.8 1.0 Fraction of Loss Change

Histogram of Loss Change Fractions

0 20 40 60 80 100 Logging Period: 1 Step

Training Loss

Training Loss Vs. Logging Period

Figure 1: Left: Histograms of loss change fractions for f(x) = x/4 over 1000 trials. Right: A plot of the training loss ℓversus time for a single training run on the same toy example. When the loss is below the horizontal dashed line, the current parameters W are in U(λ)

trajectory entering U(λ) over the course of training. The histogram in 1 shows that, for this simple experiment, most gradient descent trajectories inhabit U(λ) for almost the entire training run.

For more complicated architectures, we found that the bound in (6) is difﬁcult to satisfy while still obtaining good test performance. The goal of the rest of this section is therefore to empirically determine when the loss function for standard neural network architectures is piecewise convex, which may occur before the bound in (6) is satisﬁed. Because the Hessian of the loss function ℓλ is difﬁcult to compute for neural networks with many parameters, we turn our focus to ℓλ restricted to gradient descent trajectories, which is similar to the analysis in [SMG13]. To that end, let W : R+ Rm be a solution to the ODE W(t) = ℓλ(W(t)), W(0) = W0, (15) and let γ(t) = ℓλ(W(t)) be the restriction of ℓλ to this curve. We have

γ(t) = 2 ℓλ(W(t))T H(ℓλ(W(t))) ℓλ(W(t)). (16)

If this is always positive, then ℓλ is piecewise convex on gradient descent trajectories. We are especially interested in the right hand side of (16) normalized by ℓλ 2, due to Lemma 8. Lemma 8. Suppose that γ(t) is C2 over [0, t ] R, and that

2 ℓλ(W(t))T H(ℓλ(W(t))) ℓλ(W(t))

ℓλ(W(t)) 2 C. (17)

Then we have the convergence estimate

ℓλ(W(t)) 2 ℓλ(W(0)) 2e Ct t [0, t ]. (18)

This lemma is proved in the supplementary material, and makes use of Grönwall s inequality. We may compute the second derivative of γ(t) while avoiding computing H(ℓλ) by observing that

2 ℓT λ H(ℓλ)(W) ℓλ = ℓT λ ℓλ 2. (19)

and therefore we may compute (16) by two backpropagations.

One notable difference between our theoretical work and the following experiments is that we focus on image classiﬁcation tasks below, and hence the loss function is given by a composition of a softmax and cross entropy, as opposed to a square Euclidean distance.

For the MNIST, CIFAR10, and CIFAR100 datasets, we produce two plots generated by training neural networks using stochastic gradient descent (SGD). The ﬁrst is of ℓλ and normalized second derivative, as given by the left hand side of (17), versus training time, represented using logging periods, calculated over a single SGD trajectory. The second is a histogram which runs the same test over several trials, distinguished by their random initializations, and records the percent change in γ(t) between when it ﬁrst became piecewise convex to convergence. In other words, let t0 be the ﬁrst

time after which γ(t) is always positive; in our experiments, such a t0 always exists. Let t1 and t = 0 be the terminal and initial time, respectively. To compute the histograms in the left column of Figure 2, we compute the loss change fraction

γ(t0) γ(t1)

γ(0) γ(t1) (20)

over n separately initialized training trials on a given dataset. We are interested in (20) as it measures how much of the SGD trajectory is spent in a piecewise convex regime.

Our experimental set-up for each data set is summarized in Table 1, and all experiments were implemented in Py Torch, with a mini-batch size of 128. Results are summarized in Table 2 in the form mean standard deviation, calculated across all trials for each data set. The column Norm. 2nd. Deriv. (%10) in Table 2 is the 10th percentile of the normalized second derivatives in the piecewise convex regime, as calculated within each trial; we present this instead of the minimum normalized second derivative as the minimum is quite noisy, and as such may not reﬂect the behaviour of the second derivative in bulk.

Table 1: Experimental Set-up

Dataset Model Epochs Batch Norm. λ Trials Learn. Rate MNIST Le Net-5 2 No 5 10 4 100 0.05 CIFAR10 Res Net22 65 Yes 1 10 6 20 0.1/0.02 CIFAR100 Res Net22 65 Yes 1 10 7 20 0.1/0.02

4.1 MNIST Experiments

In our ﬁrst experiment we used the Le Net-5 architecture [LBB+98] on MNIST. The histogram generated for MNIST in Figure 2 shows that the loss function is piecewise convex over most of the SGD trajectory, independent of initialization. The upper right plot of Figure 2 conﬁrms this; the normalized second derivative, in red, is negative for a very small amount of time, and then is consistently larger than 10, indicating that the loss function is piecewise strongly convex on most of this SGD trajectory. This is similar to our theoretical results as piecewise strong convexity only seems to occur after the loss is below a certain threshold.

4.2 CIFAR10 and CIFAR100 Experiments

This group of experiments deals with image classiﬁcation on the CIFAR10 and CIFAR100 datasets. We trained the Res Net22 architecture with Py Torch, and produced the same plots as with MNIST. This version of Res Net22 for CIFAR10/CIFAR100 was taken from the code accompanying [FOA18], and has a learning rate of 0.1, decreasing to 0.02 after 60 epochs.

In the middle left histogram in Figure 2, we see that over 20 trials on CIFAR10, every SGD trajectory was in the piecewise convex regime for its entire course. This is again reﬂected in the middle right plot of Figure 2, where we also see the normalized second derivative being consistently larger than 10. The test set accuracies are given in Table 2 for all three data sets, and are non-optimal; this is intentional, as we wanted to study the loss surface without optimization of hyperparameters. Note that Table 2 shows that large values of the normalized second derivative are consistently observed across all initializations, and for every dataset.

Table 2: Experimental Results

Dataset Test Acc. (%) Norm. 2nd. Deriv. (%10) Loss Frac. MNIST 97.32 1.00 21.19 2.12 0.80 0.09 CIFAR10 84.93 0.35 8.82 2.80 1.00 0.00 CIFAR100 54.01 0.53 11.80 0.27 1.00 0.02

To see if this behaviour is consistent with more challenging classiﬁcation tasks, we tested the same network on CIFAR100; the results are summarized in the bottom row of Figure 2, and are consistent with CIFAR10, with only 1 trajectory out of 20 failing to be entirely piecewise convex. We conjecture

that the extra convexity observed in CIFAR10/100 compared to MNIST is due to the presence of batch normalization, which has been shown to improve the optimization landscape [STIM18].

0.50 0.60 0.70 0.80 0.90 1.00 Fraction of Loss Change

Histogram of Loss Change Fractions

0 20 40 60 80 Logging Period: 10 Steps

Norm. 2nd Deriv. of Loss Fn.

Norm. 2nd Deriv. Vs. Logging Period

0.50 0.60 0.70 0.80 0.90 1.00 Fraction of Loss Change

Histogram of Loss Change Fractions

0 50 100 150 200 250 Logging Period: 100 Steps

Norm. 2nd Deriv. of Loss Fn.

Norm. 2nd Deriv. Vs. Logging Period

0.50 0.60 0.70 0.80 0.90 1.00 Fraction of Loss Change

Histogram of Loss Change Fractions

0 50 100 150 200 250 Logging Period: 100 Steps

Norm. 2nd Deriv. of Loss Fn.

Norm. 2nd Deriv. Vs. Logging Period

Figure 2: Left: Histograms of loss change fractions for three image classiﬁcation problems over several trials. Right: A plot of the loss function and normalized second derivative versus time for a single training run on the same datasets. The normalized second derivative is clipped at 10 for legibility. Top: MNIST, Middle: CIFAR10, Bottom: CIFAR100. Best viewed in color.

5 Conclusion

We have established a number of facts about the critical points of the regularized loss function for a neural network with Re LU activation functions. In particular, there are, in a certain sense, no differentiable local maxima, and, on an important set in weight space, the loss function is piecewise strongly convex, with every differentiable critical point an isolated local minimum. We then applied this result to prove that non-zero local minima of a regularized loss function for linear networks obtaining training error below a certain threshold do not exist. Finally, we established the relevance of our theory to a toy problem through experiments, and demonstrated empirically that the loss function for standard neural networks is piecewise strongly convex along most SGD trajectories. Future directions of research include investigating if the re-parametrization offered by Residual Networks [HZRS15] can improve our analysis, as was observed in [HM16], as well as determining where ℓλ satisﬁes (17), which would allow for stronger theorems on the convergence of SGD.

6 Acknowledgements

We would like to thank the anonymous referees for their thoughtful reviews. Thanks also to Professor Adam Stinchcombe for the use of his GPUs. This work was partially supported by an NSERC PGS - D and by the Mackenzie King Open Scholarship.

[AZLS18] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parametrization. ar Xiv preprint ar Xiv:1811.03962, 2018.

[CHM+15] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann Le Cun. The loss surfaces of multilayer networks. In Artiﬁcial Intelligence and Statistics, pages 192 204, 2015.

[CLA15] Anna Choromanska, Yann Le Cun, and Gérard Ben Arous. Open problem: The landscape of the loss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756 1760, 2015.

[FOA18] Chris Finlay, Adam Oberman, and Bilal Abbasi. Improved robustness to adversarial examples using Lipschitz regularization of the loss. ar Xiv preprint ar Xiv:1810.00953, 2018.

[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645 6649. IEEE, 2013.

[HM16] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. ar Xiv preprint ar Xiv:1611.04231, 2016.

[HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Co RR, abs/1512.03385, 2015.

[Kaw16] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586 594, 2016.

[LBB+98] Yann Le Cun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

[LSJR16] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1246 1257, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR.

[Mc O03] Robert C. Mc Owen. Partial Differential Equations: Methods and Applications. Prentice Hall, 2nd edition, 2003.

[Mit15] Boris Mityagin. The zero set of a real analytic function. ar Xiv preprint ar Xiv:1512.07276, 2015.

[OS19] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. ar Xiv preprint ar Xiv:1902.04674, 2019.

[SC16] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. ar Xiv preprint ar Xiv:1605.08361, 2016.

[SHK+14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15:1929 1958, 2014.

[SMG13] Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013.

[SS16] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspeciﬁed neural networks. In International Conference on Machine Learning, pages 774 782, 2016.

[STIM18] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In Advances in Neural Information Processing Systems, pages 2483 2493, 2018.

[WLZ+17] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. Co RR, abs/1711.11585, 2017.

[ZBH+16] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.