# on_connected_sublevel_sets_in_deep_learning__e50c6206.pdf

On Connected Sublevel Sets in Deep Learning

Quynh Nguyen 1

Abstract This paper shows that every sublevel set of the loss function of a class of deep overparameterized neural nets with piecewise linear activation functions is connected and unbounded. This implies that the loss has no bad local valleys and all of its global minima are connected within a unique and potentially very large global valley.

1. Introduction

It has been commonly observed in deep learning that overparameterization can be helpful for optimizing deep neural networks. In particular, several recent work (Allen-Zhu et al., 2018b; Du et al., 2018; Zou et al., 2018) have shown that if all the hidden layers of a deep network have polynomially large number of neurons compared to the number of training samples and the network depth, then (stochastic) gradient descent converges to a global minimum with zero training error. While these theoretical guarantees are interesting conceptually, it remains largely unclear why this kind of simple local search algorithms can succeed given the nonconvexity and NP-Hardness (worst-case) of the problem. We are instead interested in the following question:

Is there any underlying geometric structure of the loss function that can intuitively support for the success of (stochastic) gradient descent under excessive overparameterization regimes?

In this paper, we shed light on this question by showing that every sublevel set of the loss is connected if one of the hidden layers is wide enough. Our key idea is ﬁrst to prove that this property holds in general for linearly independent training data, and then extend such result to arbitrary data by using an additional condition on the wide layer. In particular, we ﬁrst show that if one of the hidden layers has more neurons than the number of training samples, then the loss function has no bad local valleys in the sense that there is a

1Department of Mathematics and Computer Science, Saarland University, Germany. Correspondence to: Quynh Nguyen <quynh@cs.uni-saarland.de>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

continuous path from any starting point in parameter space on which the loss is non-increasing and gets arbitrarily close to its inﬁmum (global minimum). Second, if the ﬁrst hidden layer of the network has twice more neurons than the number of training samples, then we show that every sublevel set of the loss becomes connected. This is a stronger guarantee than before as it not only implies that the loss function does not contain any bad local valleys where gradient descent might get stuck, but also that all ﬁnite global minima (whenever they exist) are connected within a unique global valley. Our results hold for standard deep fully connected networks with arbitrary convex losses and piecewise linear activation functions. All missing proofs are moved to the appendix.

2. Background

Let N be the number of training samples and X = [x1, . . . , x N]T RN d the training data with xi Rd. Let L be the number of layers of the network, nk the number of neurons at layer k, d the input dimension, m the output dimension, and Wk Rnk 1 nk and bk Rnk the weight matrix and biases respectively of layer k. By convention, we assume that n0 = d and n L = m. Let σ : R R be a continuous activation function speciﬁed later. The network output at layer k is the matrix Fk RN nk deﬁned as

X k = 0 σ Fk 1Wk + 1Nb T k 1 k L 1 FL 1WL + 1Nb T L k = L (1)

Let θ := (Wl, bl)L l=1 be the set of all parameters. Let Ωl be the space of (Wl, bl) for every layer l [1, L], and Ω = Ω1 . . . ΩL the whole parameter space. Let Ω l Ωl be the subset of parameters of layer l for which the corresponding weight matrix has full rank, that is Ω l = {(Wl, bl) | Wl has full rank} . In this paper, we often write Fk(θ) to denote the network output at layer k as a function θ, but sometimes we drop the argument θ and write just Fk if it is clear from the context. We also use the notations Fk (W1, b1), . . . , (WL, b L) , Fk (W1, b1), (Wl, bl)L l=2 . The training loss Φ : Ω R is deﬁned as

Φ(θ) = ϕ(FL(θ)) (2)

where ϕ : RN m R is assumed to be any convex loss. Typical examples include the standard cross-entropy loss

On Connected Sublevel Sets in Deep Learning

ϕ(FL) = 1 N N i=1 log e(FL)iyi m k=1 e(FL)ik where yi is the ground-truth class of xi, and the standard square loss for regression ϕ(FL) = 1

2 FL Y 2 F where Y RN m is a given training output.

In this paper, we denote p = inf G RN m ϕ(G) which serves as a lower bound on Φ. Note that p is fully determined by the choice of ϕ( ) and thus is independent of the training data. Please also note that we make no assumption about ﬁniteness of p in this paper although for most of practical loss functions as mentioned above one has p = 0. We list below several assumptions on the activation function and will refer to them accordingly in our different results.

Assumption 2.1 σ is strictly monotonic and σ(R) = R.

Note that Assumption 2.1 implies that σ has a continuous inverse σ 1 : R R, which is satisﬁed for Leaky-Re LU.

Assumption 2.2 There do not exist non-zero coefﬁcients (λi, ai)p i=1 with ai = aj i = j such that σ(x) = p i=1 λiσ(x ai) for every x R.

Assumption 2.2 is satisﬁed for every piecewise linear activation functions except the linear one as shown below.

Lemma 2.3 Assumption 2.2 is satisﬁed for any continuous piecewise linear activation function with at least two pieces such as Re LU, Leaky-Re LU, etc and for the exponential

linear unit σ(x) =

x x 0 α(ex 1) x < 0 where α > 0.

Through out the rest of this paper, we will make the following mild assumption on our training data.

Assumption 2.4 All the training samples are distinct.

A key concept of this paper is the sublevel set of a function.

Deﬁnition 2.5 For every α R, the α-level set of Φ : Ω R is the preimage Φ 1(α) = {θ Ω | Φ(θ) = α} , and the α-sublevel set of Φ is given as Lα = {θ Ω | Φ(θ) α} .

Below we recall the standard deﬁnition of connected sets and some basic properties which are used in this paper.

Deﬁnition 2.6 (Connected set) A subset S Rd is called connected if for every x, y S, there exists a continuous curve r : [0, 1] S such that r(0) = x and r(1) = y.

Proposition 2.7 Let f : U V be a continuous map. If A U is connected then f(A) V is also connected.

Proposition 2.8 The Minkowski sum of two connected subsets U, V Rn, deﬁned as U + V = {u + v | u U, v V }, is a connected set.

In this paper, A denotes the Moore-Penrose inverse of A. If A has full row rank then it has a right inverse A = AT (AAT ) 1 with AA = I, and if A has full column rank then it has a left inverse A = (AT A) 1AT with A A = I.

3. Key Result: Linearly Independent Data Leads to Connected Sublevel Sets

This section presents our key results for linearly independent data, which form the basis for our additional results in the next sections where we analyze deep over-parameterized networks with arbitrary data. Below we assume that the widths of the hidden layers are decreasing, i.e. n1 > . . . > n L. Note that it is still possible to have n1 d or n1 < d. The above condition is quite natural as in practice (e.g., see Table 1 in (Nguyen & Hein, 2018)) the ﬁrst hidden layer often has the most number of neurons, afterwards the number of neurons starts decreasing towards the output layer, which is helpful for the network to learn more compact representations at higher layers. We introduce the following property for a class of points θ = (Wl, bl)L l=1 in parameter space and refer to it later in our results and proofs.

Property 3.1 Wl has full rank for every l [2, L].

Our main result in this section is stated as follows.

Theorem 3.2 Let Assumption 2.1 hold, rank(X) = N and n1 > . . . > n L where L 2. Then the following hold:

1. Every sublevel set of Φ is connected. Moreover, Φ can attain any value arbitrarily close to p .

2. Every non-empty connected component of every level set of Φ is unbounded.

We have the following decomposition of sublevel set: Φ 1(( , α]) = Φ 1(α) Φ 1(( , α)). It follows that if Φ has unbounded level sets then its sublevel sets must also be unbounded. We note that the reverse is not true, e.g. the standard Gaussian distribution function has unbounded sublevel sets but its level sets are bounded. Given that, the two statements of Theorem 3.2 together imply that every sublevel set of the loss must be both connected and unbounded. While the connectedness property of sublevel sets implies that the loss function is rather well-behaved, the unboundedness property of level sets intuitively implies that there are no bounded valleys in the loss surface, regardless of whether these valleys contain a global minimum or not. Clearly this also indicates that Φ has no strict local minima/maxima. In the remainder of this section, we will present the proof of Theorem 3.2. The following lemmas will be helpful.

Lemma 3.3 Let the conditions of Theorem 3.2 hold. Given some k [2, L]. Then there is a continuous map h : Ω 2 . . . Ω k RN nk Ω1 which satisfy the following:

On Connected Sublevel Sets in Deep Learning

1. For every (W2, b2), . . . , (Wk, bk), A

Ω 2 . . . Ω k RN nk it holds that

Fk h (Wl, bl)k l=2, A , (Wl, bl)k l=2 = A.

2. For every θ = (W l , b l )L l=1 where all the matrices (W l )k l=2 have full rank, there is a continuous curve

from θ to h (W l , b l )k l=2, Fk(θ) , (W l , b l )L l=2 on which the loss Φ is constant.

Proof: For every (W2, b2), . . . , (Wk, bk), A Ω 2

. . . Ω k RN nk, let us deﬁne the value of the map h as

h (Wl, bl)k l=2, A = (W1, b1),

where (W1, b1) is given by the following recursive formula

= [X, 1N] σ 1(B1),

Bl = σ 1(Bl+1) 1Nb T l+1 W l+1, l [1, k 2],

(A 1Nb T L) W L k = L σ 1(A) 1Nb T k W k k [2, L 1]

By our assumption n1 > . . . > n L, it follows from the domain of h that all the matrices (Wl)k l=2 have full column rank, and so they have a left inverse. Similarly, [X, 1N] has full row rank due to our assumption that rank(X) = N, and so it has a right inverse. Moreover σ has a continuous inverse by Assumption 2.1. Thus h is a continuous map as it is a composition of continuous functions. In the following, we prove that h satisﬁes the two statements of the lemma.

1. Let (W2, b2), . . . , (Wk, bk), A Ω 2 . . . Ω k

RN nk. Since all the matrices (Wl)k l=2 have full column rank and [X, 1N] has full row rank, it holds that W l Wl = I and [X, 1N][X, 1N] = I and thus we easily obtain from the above deﬁnition of h that

B1 = σ [X, 1N]

Bl+1 = σ(Bl Wl+1 + 1Nb T l+1), l [1, k 2],

BL 1WL + 1Nb T L k = L, σ(Bk 1Wk + 1Nb T k ) k [2, L 1].

One can easily check that the above formula of A is exactly the deﬁnition of Fk from (1) and thus it holds Fk h (Wl, bl)k l=2, A , (Wl, bl)k l=2 = A for every (W2, b2), . . . , (Wk, bk), A Ω 2 . . . Ω k RN nk.

2. Let Gl : RN nl 1 RN nl be deﬁned as

ZW L + 1N(b L)T l = L

σ ZW l + 1N(b l )T l [2, L 1].

For convenience, let us group the parameters of the ﬁrst layer into a matrix, say U = [W T 1 , b1]T R(d+1) n1. Similarly, let U = [(W 1 )T , b 1]T R(d+1) n1. Let f : R(d+1) n1 RN nk be a function of (W1, b1) deﬁned as

f(U) = Gk Gk 1 . . . G2 G1(U), where

G1(U) = σ([X, 1N]U), U = [W T 1 , b1]T .

We note that this deﬁnition of f is exactly Fk from (1), but here we want to exploit the fact that f is a function of (W1, b1) as all other parameters are ﬁxed to the corresponding values of θ. Let A = Fk(θ). By deﬁnition we have f(U ) = A and thus U f 1(A). Let us denote

(W h 1 , bh 1) = h W l , b l )k l=2, A , U h = [(W h 1 )T , bh 1]T .

By applying the ﬁrst statement of the lemma to (W 2 , b 2), . . . , (W k , b k), A we have

A = Fk (W h 1 , bh 1), (W l , b l )k l=2 = f(U h)

which implies U h f 1(A). So far both U and U h belong to f 1(A). The idea now is that if one can show that f 1(A) is a connected set then there would exist a connected path between U and U h (and thus a path between (W 1 , b 1) and (W h 1 , bh 1)) on which the output at layer k is identical to A and hence the loss is invariant, which concludes the proof.

In the following, we show that f 1(A) is indeed connected. First, one observes that range(Gl) = RN nl for every l [2, k] since all the matrices (W l )k l=2 have full column rank and σ(R) = R due to Assumption 2.1. Similarly, it follows from our assumption rank(X) = N that range(G1) = RN n1. By standard rules of compositions, we have

f 1(A) = G 1 1 G 1 2 . . . G 1 k (A).

where all the inverse maps G 1 l have full domain. It holds

G 1 k (A) = (A 1Nb T L)(W L) + {B | BW L = 0} k = L σ 1(A) 1Nb k (W k ) + {B | BW k = 0} else

which is a connected set in each case because of the following reasons: 1) the kernel of any matrix is connected, 2) the Minkowski-sum of two connected sets is connected by Proposition 2.8, and 3) the image of a connected set under a continuous map is connected by Proposition 2.7. By repeating the similar argument for k 1, . . . , 2 we conclude that V := G 1 2 . . . G 1 k (A) is connected. Lastly, we have

G 1 1 (V ) = [X, 1N] σ 1(V ) + {B | [X, 1N]B = 0}

which is also connected by the same arguments above. Thus f 1(A) is a connected set.

On Connected Sublevel Sets in Deep Learning

Overall, we have shown in this proof that the set of (W1, b1) which realizes the same output at layer k (given the parameters of other layers in between are ﬁxed) is a connected set. Since both (W 1 , b 1) and h (W l , b l )k l=2, Fk(θ) belong to this solution set, there must exist a continuous path between them on which the loss Φ is constant.

The next lemma shows how to make the weight matrices full rank. Its proof can be found in the appendix.

Lemma 3.4 Let the conditions of Theorem 3.2 hold. Let θ = (Wl, bl)L l=1 be any point in parameter space. Then there is a continuous curve which starts from θ and ends at some θ = (W l , b l)L l=1 so that θ satisﬁes Property 3.1 and the loss Φ is constant on the curve.

Proposition 3.5 (Evard & Jafari, 1994) The set of full rank matrices A Rm n is connected for m = n.

3.1. Proof of Theorem 3.2

1. Let Lα be some sublevel set of Φ. Let θ = (Wl, bl)L l=1 and θ = (W l , b l)L l=1 be arbitrary points in Lα. Let FL = FL(θ) and F L = FL(θ ). These two quantities are computed in the beginning and will never change during this proof. But when we write FL(θ ) for some θ we mean the network output evaluated at θ . The main idea is to construct two different continuous paths which simultaneously start from θ and θ and are entirely contained in Lα (this is done by making the loss on each individual path non-increasing), and then show that they meet at a common point in Lα, which then implies that Lα is a connected set.

First of all, we can assume that both θ and θ satisfy Property 3.1, because otherwise by Lemma 3.4 one can follow a continuous path from each point to arrive at some other point where this property holds and the loss on each path is invariant, meaning that we still stay inside Lα. As θ and θ satisfy Property 3.1, all the weight matrices (Wl)L l=2 and (W l )L l=2 have full rank, and thus by applying the second statement of Lemma 3.3 with k = L and using the similar argument above, we can simultaneously drive θ and θ to the following points,

θ = h (Wl, bl)L l=2, FL , (W2, b2), . . . , (WL, b L) ,

θ = h (W l , b l)L l=2, F L , (W 2, b 2), . . . , (W L, b L) (3)

where h : Ω 2 . . . Ω L RN m Ω1 is the continuous map from Lemma 3.3 which satisﬁes

FL h ( ˆWl,ˆbl)L l=2, A , ( ˆWl,ˆbl)L l=2 = A, for every (4) ( ˆWl,ˆbl), . . . , ( ˆWL,ˆb L), A Ω 2 . . . Ω L RN nk.

Next, we construct a continuous path starting from θ on which the loss is constant and it holds at the end point of

the path that all parameters from layer 2 till layer L are equal to the corresponding parameters of θ . Indeed, by applying Proposition 3.5 to the pairs of full rank matrices (Wl, W l ) for every l [2, L], we obtain continuous curves W2(λ), . . . , WL(λ) so that Wl(0) = Wl, Wl(1) = W l and Wl(λ) has full rank for every λ [0, 1]. For every l [2, L], let cl : [0, 1] Ω l be the curve of layer l deﬁned as

cl λ) = Wl(λ), (1 λ)bl + λb l .

We consider the curve c : [0, 1] Ω given by

c(λ) = h (cl(λ))L l=2, FL , c2(λ), . . . , c L(λ) .

Then one can easily check that c(0) = θ and c is continuous as all the functions h, c2, . . . , cl are continuous. Moreover, we have c2(λ), . . . , c L(λ) Ω 2 . . . Ω L and thus it follows from (4) that FL(c(λ)) = FL for every λ [0, 1], which leaves the loss invariant on c.

Since the curve c above starts at θ and has constant loss, we can reset θ to the end point of this curve, by setting θ = c(1), while keeping θ from (3), which together give us

θ = h (W l , b l)L l=2, FL , (W 2, b 2), . . . , (W L, b L) ,

θ = h (W l , b l)L l=2, F L , (W 2, b 2), . . . , (W L, b L) .

Now we note that the parameters of θ and θ coincide at all layers except at the ﬁrst layer. We will construct two continuous paths inside Lα, say c1( ) and c2( ), which starts from θ and θ respectively , and show that they meet at a common point in Lα. Let ˆY RN m be any matrix so that

ϕ( ˆY ) min(Φ(θ), Φ(θ )). (5)

Consider the curve c1 : [0, 1] Ω deﬁned as

c1(λ)= h (W l , b l)L l=2, (1 λ)FL + λ ˆY , (W l , b l)L l=2 .

Note that c1 is continuous as h is continuous, and it holds:

c1(0) = θ, c1(1) = h (W l , b l)L l=2, ˆY , (W l , b l)L l=2 .

It follows from the deﬁnition of Φ, c1(λ) and (4) that

Φ(c1(λ)) = ϕ(FL(c1(λ))) = ϕ((1 λ)FL + λ ˆY )

and thus by convexity of ϕ,

Φ(c1(λ)) (1 λ)ϕ(FL) + λϕ( ˆY )

(1 λ)Φ(θ) + λΦ(θ) = Φ(θ),

which implies that c1[0, 1] is entirely contained in Lα. Similarly, we can also construct a curve c2( ) inside Lα which starts at θ and satisﬁes

c2(0) = θ , c2(1) = h (W l , b l)L l=2, ˆY , (W l , b l)L l=2 .

On Connected Sublevel Sets in Deep Learning

So far, the curves c1 and c2 start at θ and θ respectively and meet at the same point c1(1) = c2(1).

Overall, we have shown that starting from any two points in Lα we can ﬁnd two continuous curves so that the loss is non-increasing on each curve, and these curves meet at a common point in Lα, and so Lα has to be connected. Moreover, the point where they meet satisﬁes Φ(c1(1)) = ϕ( ˆY ). From (5), ϕ( ˆY ) can be chosen arbitrarily small, and thus Φ can attain any value arbitrarily close to p .

2. Let C be a non-empty connected component of some level set, i.e. C Φ 1(α) for some α R. Let θ = (Wl, bl)L l=1 C. Similar as above, we ﬁrst use Lemma 3.4 to ﬁnd a continuous path from θ to some other point where W2 attains full rank, and the loss is invariant on the path. From that point, we apply Lemma 3.3 with k = 2 to obtain another continuous path (with constant loss) which leads us to θ := h (W2, b2), F2(θ) , (W2, b2), . . . , (WL, b L)

where h : Ω 2 Ω1 is a continuous map satisfying that

F2 h ( ˆW2,ˆb2), A , ( ˆWl,ˆbl)L l=2 = A,

for every point ( ˆWl,ˆbl)L l=1 such that ˆW2 has full rank, and every A RN n2. Note that θ C as the loss is constant on the above paths. Consider the following continuous curve

c(λ) = h (λW2, b2), F2(θ) , (λW2, b2), . . . , (WL, b L)

for every λ 1. This curve starts at θ since c(1) = θ . Moreover F2(c(λ)) = F2(θ) for every λ 1 and thus the loss is constant on this curve, meaning that the entire curve belongs to C. Lastly, since W2 is full rank, the curve c is unbounded as λ goes to inﬁnity, and thus C is unbounded.

4. Large Width of One of Hidden Layers Leads to No Bad Local Valleys

In the previous section, we show that linearly independent training data essentially leads to connected sublevel sets. In this section, we show the ﬁrst application of this result in proving absence of bad local valleys on the loss landscape of deep and wide neural nets with arbitrary training data.

Deﬁnition 4.1 A local valley is a nonempty connected component of some strict sublevel set Ls α := {θ | Φ(θ) < α} . A bad local valley is a local valley on which the training loss Φ cannot be made arbitrarily close to p .

The main result of this section is stated as follows.

Theorem 4.2 Let Assumption 2.1 and Assumption 2.2 hold. Suppose that there exists a layer k [1, L 1] such that nk N and nk+1 > . . . > n L. Then the following hold:

Figure 1. Left: an example function with exponential tails where local minima do NOT exist but local/global valleys still exist. Right: a different function which satisﬁes every local minimum is a global minimum, but bad local valleys still exist at both inﬁnities (exponential tails) where local search algorithms easily get stuck.

1. The loss Φ has no bad local valleys.

2. If k L 2 then every local valley of Φ is unbounded.

The conditions of Theorem 4.2 are satisﬁed for any strictly monotonic and piecewise linear activation function such as Leaky-Re LU (see Lemma 2.3). We note that for Leaky Re LU and other similar homogeneous activation functions, the second statement of Theorem 4.2 is quite straightforward. Indeed, if one scales all parameters of one hidden layer by some arbitrarily large factor k > 0 and the weight matrix of the following layer by 1/k then the network output will be unchanged, and so every connected component of every level set (also sublevel set) must extend to inﬁnity and thus be unbounded. However, for general non-homogeneous activation functions, the second statement is non-trivial.

The ﬁrst statement of Theorem 4.2 implies that there is a continuous path from any point in parameter space on which the loss is non-increasing and gets arbitrarily close to p . At this point, one might wonder if a function satisfying every local minimum is a global minimum would automatically contain no bad local valleys. Unfortunately this is not true in general. Indeed, Figure 1 shows two counter-examples where a function does not have any bad local mimina, but bad local valleys still exist. The reason for this lies at the fact that bad local valleys generally need not contain any local minimum or even critical point though in theory they can have very large volume. Thus any pure results on global optimality of local minima with no further information on the loss would not be sufﬁcient to guarantee convergence of local search algorithms to a global minimum, especially if they are initialized in such regions. Similar phenomenon has been observed by (Sohl-Dickstein & Kawaguchi, 2019).

We note that while the statements of Theorem 4.2 imply the absence of strict local minima and bounded local valleys, they do not rule out the possibility of non-strict bad local minima. Overall, the statements of Theorem 4.2 together imply that every local valley is an unbounded global valley in which the loss can attain any value arbitrarily close to p .

The high level proof idea for Theorem 4.2 is that inside every local valley one can ﬁnd a point where the feature representations of all training samples are linearly indepen-

On Connected Sublevel Sets in Deep Learning

dent at the wide hidden layer, and thus an application of Theorem 3.2 to the subnetwork from this wide layer till the output layer yields the result. We list below several technical lemmas which are helpful to prove Theorem 4.2.

Lemma 4.3 Let (F, W, I) be such that F RN n, W Rn p, rank(F) < n and I {1, . . . , n} be a subset of columns of F so that rank(F(:, I)) = rank(F) and I the remaining columns. Then there exists a continuous curve c : [0, 1] Rn p which satisﬁes the following:

1. c(0) = W and Fc(λ) = FW, λ [0, 1].

2. The product Fc(1) is independent of F(:, I).

Lemma 4.4 Given v Rn with vi = vj i = j, and σ : R R satisﬁes Assumption 2.2. Let S Rn be deﬁned as S = {σ(v + b1n) | b R} . Then it holds Span(S) = Rn.

We recall the following standard result from topology (e.g., see Apostol (1974), Theorem 4.23, p. 82).

Proposition 4.5 Let f : Rm Rn be a continuous function. If U Rn is an open set then f 1(U) is also open.

4.1. Proof of Theorem 4.2

1. Let C be a connected component of some strict sublevel set Ls α = Φ 1(( , α)), for some α > p . By Proposition 4.5, Ls α is an open set and thus C must be open.

Step 1: Finding a point inside C where Fk has full rank. Let θ C be such that the pre-activation outputs at the ﬁrst hidden layer are distinct for all training samples. Note that such θ always exist since Assumption 2.4 implies that the set of W1 where this does not hold has Lebesgue measure zero, whereas C has positive measure. This combined with Assumption 2.1 implies that the (post-activation) outputs at the ﬁrst hidden layer are distinct for all training samples. Now one can view these outputs at the ﬁrst layer as inputs to the next layer and argue similarly. By repeating this argument and using the fact that C has positive measure, we conclude that there exists θ C such that the outputs at layer k 1 are distinct for all training samples, i.e. (Fk 1)i: = (Fk 1)j: for every i = j. Let V be the pre-activation output (without bias term) at layer k, in particular V = Fk 1Wk = [v1, . . . , vnk] RN nk. Since Fk 1 has distinct rows, one can easily perturb Wk so that every column of V has distinct entries. Note here that the set of Wk where this does not hold has measure zero whereas C has positive measure. Equivalently, C must contain a point where every vj has distinct entries. To simplify notation, let a = bk Rnk, then by deﬁnition,

Fk = [σ(v1 + 1Na1), . . . , σ(vnk + 1Nank)]. (6)

Suppose that Fk has low rank, otherwise we are done. Let r = rank(Fk) < N nk and I {1, . . . , nk} , |I| = r be the subset of columns of Fk so that rank(Fk(:, I)) = rank(Fk), and I the remaining columns. By applying Lemma 4.3 to (Fk, Wk+1, I), we can follow a continuous path with invariant loss (i.e. entirely contained inside C) to arrive at some point where Fk Wk+1 is independent of Fk(: . I). It remains to show how to change Fk(:, I) by modifying certain parameters so that Fk has full rank. Let p = | I| = nk r and I = {j1, . . . , jp} . From (6) we have

Fk(:, I) = [σ(vj1 + 1Naj1), . . . , σ(vjp + 1Najp)].

Let col( ) denotes the column space of a matrix. Then dim(col(Fk(:, I))) = r < N. Since vj1 has distinct entries, Lemma 4.4 implies that there must exist aj1 R so that σ(vj1 + 1Naj1) / col(Fk(:, I)), because otherwise Span {σ(vj1 + 1Naj1) | aj1 R} col(Fk(:, I)) whose dimension is strictly smaller than N and thus contradicts Lemma 4.4. So we pick one such value for aj1 and follow a direct line segment between its current value and the new value. Note that the loss is invariant on this segment since any changes on aj1 only affects Fk(:, I) which however has no inﬂuence on the loss by above construction. Moreover, it holds at the new value of aj1 that rank(Fk) increases by 1. Since nk N by our assumption, it follows that p N r and thus one can choose aj2, . . . , aj N r in a similar way and ﬁnally obtain rank(Fk) = N.

Step 2: Applying Theorem 3.2 to the subnetwork above k. Suppose that we have found from previous step a θ = ((W l , b l )L l=1) C so that Fk has full rank. Let g : Ωk+1 . . . ΩL be given as

g (Wl, bl)L l=k+1 =Φ (W l , b l )k l=1, (Wl, bl)L l=k+1 (7)

We recall that C is a connected component of Ls α. It holds g (W l , b l )L l=k+1 = Φ(θ) α. Now one can view g as the new loss for the subnetwork from layer k till layer L and Fk can be seen as the new training data. Since rank(Fk) = N and nk+1 > . . . > n L, Theorem 3.2 implies that g has connected sublevel sets and g can attain any value arbitrarily close to p . Let (p , α) and (W l , b l)L l=k+1 be any point such that g (W l , b l)L l=k+1 . Since

both (W l , b l )L l=k+1 and (W l , b l)L l=k+1 belongs to the αsublevel set of g, which is a connected set, there must exist a continuous path from (W l , b l )L l=k+1 to (W l , b l)L l=k+1 on which the value of g is not larger than α. This combined with (7) implies that there is also a continuous path from θ = (W l , b l )k l=1, (W l , b l )L l=k+1 to θ := (W l , b l )k l=1, (W l , b l)L l=k+1 on which the loss Φ is not larger than α. Since C is connected, it must hold θ C. Moreover, we have Φ(θ ) = g (W l , b l)L l=k+1 . Since

On Connected Sublevel Sets in Deep Learning

can be chosen arbitrarily small and close to p , we conclude that the loss Φ can be made arbitrarily small inside C, and thus Φ has no bad local valleys.

2. Let C be a local valley, which by Deﬁnition 4.1 is a connected component of some strict sublevel set Ls α = Φ 1(( , α)). According the the proof of the ﬁrst statement above, one can ﬁnd a θ = (W l , b l )L l=1 C so that Fk(θ) has full rank. Now one can view Fk(θ) as the training data for the subnetwork from layer k till layer L. The new loss is deﬁned for this subnetwork as

g (Wl, bl)L l=k+1 = Φ (W l , b l )k l=1, (Wl, bl)L l=k+1 .

By our assumptions, σ satisﬁes Assumption 2.1 and nk+1 > . . . > n L, thus the above subnetwork with the new loss g and training data Fk(θ) satisfy all the conditions of Theorem 3.2, and so it follows that g has unbounded level set components. Let β := g (W l , b l )L l=k+1 = Φ(θ) < α. Let E be a connected component of the level set g 1(β) which contains (W l , b l )L l=k+1. Let D = (W l , b l )k l=1, (Wl, bl)L l=k+1 (Wl, bl)L l=k+1 E . Then D is connected and unbounded since E is connected and unbounded. It holds for every θ D that Φ(θ ) = β, and thus D Φ 1(β) Ls α, where the last inclusion follows from β < α. Moreover, we have θ = (W l , b l )k l=1, (W l , b l )L l=k+1 D and also θ C, it follows that D C since C is already the maximal connected component of Ls α. Since D is unbounded, C must also be unbounded, which ﬁnishes the proof.

5. Large Width of First Hidden Layer Leads to Connected Sublevel Sets

In the previous section (Theorem 4.2), we show that if one of the hidden layers has more than N neurons then the loss function has no bad local valleys. In this section, we treat a special case where the ﬁrst hidden layer has at least 2N neurons. Under such setting, the next theorem shows in addition that every sublevel set must be also connected.

Theorem 5.1 Let Assumption 2.1 and Assumption 2.2 hold. Suppose that n1 2N and n2 > . . . > n L. Then every sublevel set of Φ is connected. Moreover, every connected component of every level set of Φ is unbounded.

Theorem 5.1 shows a stronger result than Theorem 4.2 as it not only implies that there are no bad local valleys but also there is a unique global valley. Equivalently, all ﬁnite global minima (if exist) must be connected. This can be seen as a generalization of previous result (Venturi et al., 2018) from one hidden layer networks and square loss to arbitrary deep networks and convex losses. Interestingly,

recent work (Draxler et al., 2018; Garipov et al., 2018) have shown that different global minima of several existing CNN architectures can be connected by a continuous path on which the loss has similar values. While our current results are not directly applicable to these models, we consider this as a stepping stone for such an extension in future work. Similar to previous results, the unboundedness of level sets as shown in the second statement of Theorem 5.1 implies that Φ has no bounded local valleys nor strict local extrema. The proof of Theorem 5.1 relies on the following lemmas.

Lemma 5.2 Let (X, W, b, V ) RN d Rd n Rn Rn p. Let σ : R R satisfy Assumption 2.2. Suppose that n N and X has distinct rows. Let Z = σ(XW + 1Nb T ) V. There is a continuous curve c : [0, 1] Rd n Rn Rn p with c(λ) = (W(λ), b(λ), V (λ)) satisfying:

1. c(0) = (W, b, V ).

2. σ XW(λ)) + 1Nb(λ)T V (λ) = Z, λ [0, 1].

3. rank σ XW(1) + 1Nb(1)T = N.

Lemma 5.3 Let (X, W, V, W ) RN d Rd n Rn p Rd n. Let σ : R R satisfy Assumption 2.2. Suppose that n 2N and rank(σ(XW)) = N, rank(σ(XW )) = N. Then there is a continuous curve c : [0, 1] Rd n Rn p

with c(λ) = (W(λ), V (λ)) which satisﬁes the following:

1. c(0) = (W, V ).

2. σ(XW(λ)) V (λ) = σ(XW) V, λ [0, 1].

3. W(1) = W .

5.1. Proof of Theorem 5.1

Let θ = (Wl, bl)L l=1, θ = (W l , b l)L l=1 be arbitrary points in some sublevel set Lα. It is sufﬁcient to show that there is a connected path between θ and θ on which the loss is not larger than α. The output at the ﬁrst layer is given by

F1(θ) = σ([X, 1N][W T 1 , b1]T ),

F1(θ ) = σ([X, 1N][W T 1 , b 1]T ).

First, by applying Lemma 5.2 to (X, W1, b1, W2), we can assume that F1(θ) has full rank, because otherwise there is a continuous path starting from θ to some other point where the rank condition is fulﬁlled and the loss is invariant on the path, and so we can reset θ to this new point. Similarly, we can assume that F1(θ ) has full rank.

Next, by applying Lemma 5.3 to the tuple [X, 1N], [W T 1 , b1]T , W2, [W T 1 , b 1]T , and using the similar argument as above, we can drive θ to some other

On Connected Sublevel Sets in Deep Learning

point where the parameters of the ﬁrst hidden layer agree with the corresponding values of θ . So we can assume w.l.o.g. that (W1, b1) = (W 1, b 1). Note that at this step we did not modify θ but θ and thus F1(θ ) still has full rank.

Once the ﬁrst hidden layer of θ and θ coincide, one can view the output of this layer, say F1 := F1(θ) = F1(θ ) with rank(F1) = N, as the new training data for the subnetwork from layer 1 till layer L (given that (W1, b1) is ﬁxed). This subnetwork and the new data F1 satisfy all the conditions of Theorem 3.2, and so it follows that the loss Φ restricted to this subnetwork has connected sublevel sets, which implies that there is a connected path between (Wl, bl)L l=2 and (W l , b l)L l=2 on which the loss is not larger than α. This indicates that there is also a connected path between θ and θ in Lα and so Lα must be connected.

To show that every level set component of Φ is unbounded, let θ Ω be an arbitrary point. Denote F1 = F1(θ) and let I {1, . . . , N} be such that rank(F1(:, I)) = rank(F1). Since rank(F1) min(N, n1) < n1, we can apply Lemma 4.3 to the tuple (F1, W2, I) to ﬁnd a continuous path W2(λ) which drives θ to some other point where the output at 2nd layer F1W2 is independent of F1(:, I). Note that the network output at 2nd layer is invariant on this path and hence the entire path belongs to the same level set component with θ. From that point, one can easily scale (W1(:, I), b1( I)) to arbitrarily large values without affecting the output. Since this path has constant loss and is unbounded, it follows that every level set component of Φ is unbounded.

6. Extension to Other Activation Functions

One way to extend our results from the previous sections to other activation functions such as Re LU and exponential linear unit is to remove Assumption 2.1 from the previous theorems, as shown next.

Theorem 6.1 Let K = min {n1, . . . , n L 1} . Then all the following hold under Assumption 2.2:

1. If K N then the loss Φ has no bad local valleys.

2. If K 2N then every sublevel set of Φ is connected.

It is clear that the conditions of Theorem 6.1 are now much stronger than that of our previous theorems as all the hidden layers need to be wide enough. Nevertheless, it is worth mentioning that this kind of technical conditions (i.e. all the hidden layers are sufﬁciently wide) have also been used in recent work (Allen-Zhu et al., 2018b; Du et al., 2018; Zou et al., 2018) to establish convergence of gradient descent methods to a global minimum. From a theoretical standpoint, these results seem to suggest that Leaky-Re LU might in general lead to a much easier loss surface than Re LU.

7. Related Work

Many interesting theoretical results have been developed on the loss surface of neural networks (Livni et al., 2014; Choromanska et al., 2015; Haeffele & Vidal, 2017; Hardt & Ma, 2017; Xie et al., 2017; Yun et al., 2017; Lu & Kawaguchi, 2017; Pennington & Bahri, 2017; Zhou & Liang, 2018; Liang et al., 2018b;a; Zhang et al., 2018; Nouiehed & Razaviyayn, 2018; Laurent & v. Brecht, 2018). There is also a whole line of researches studying convergence of learning algorithms in training neural networks (Andoni et al., 2014; Sedghi & Anandkumar, 2015; Janzamin et al., 2016; Gautier et al., 2016; Brutzkus & Globerson, 2017; Soltanolkotabi, 2017; Soudry & Hoffer, 2017; Tian, 2018; Wang et al., 2018; Ji & Telgarsky, 2019; Arora et al., 2019; Allen-Zhu et al., 2018a; Bartlett et al., 2018; Chizat & Bach, 2018) and others studying generalization properties, which is however beyond the scope of this paper.

The closest existing result is the work by (Venturi et al., 2018) who show that if the number of hidden neurons is greater than the intrinsic dimension of the network, deﬁned as the dimension of some function space, then the loss has no spurious valley, and furthermore, if the number of hidden neurons is greater than two times the intrinsic dimension then every sublevel set is connected. The results apply to one hidden layer networks with population risk and square loss. As admitted by the authors in the paper, an extension of such result, in particular the notion of intrinsic dimension, to multiple layer networks would require the number of neurons to grow exponentially with depth. Prior to this, (Safran & Shamir, 2016) showed that for a class of deep fully connected networks with common loss functions, there exists a continuous descent path between speciﬁc pairs of points in parameter space which satisfy certain conditions, and these conditions can be shown to hold with probability 1/2 as the width of the last hidden layer goes to inﬁnity.

Most closely related in terms of the setting are the work by (Nguyen & Hein, 2017; 2018) who analyze the optimization landscape of standard deep and wide (convolutional) neural networks for multiclass problem. They both assume that the network has a wide hidden layer k with nk N. This condition has been recently relaxed to n1 +. . .+n L 1 N by using ﬂexible skip-connections (Nguyen et al., 2019). All of these results so far require real analytic activation functions, and thus are not applicable to the class of piecewise linear activations analyzed in this paper. Moreover, while the previous work focus on global optimality of critical points, this paper characterizes sublevel sets of the loss function which gives us further insights and intuition on the underlying geometric structure of the optimization landscape.

Conclusion. We have shown that every sublevel set of the training loss function of a certain class of deep overparameterized neural networks is connected and unbounded.

On Connected Sublevel Sets in Deep Learning

Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers, 2018a. ar Xiv:1811.04918.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. ar Xiv:1811.03962, 2018b.

Andoni, A., Panigrahy, R., Valiant, G., and Zhang, L. Learning polynomials with neural networks. ICML, 2014.

Apostol, T. M. Mathematical analysis. Addison Wesley, 1974.

Arora, S., Cohen, N., Golowich, N., and Hu, W. A convergence analysis of gradient descent for deep linear neural networks. ICLR, 2019.

Bartlett, P., Helmbold, D., and Long, P. Gradient descent with identity initialization efﬁciently learns positive deﬁnite linear transformations by deep residual networks. In ICML, 2018.

Brutzkus, A. and Globerson, A. Globally optimal gradient descent for a convnet with gaussian inputs. ICML, 2017.

Chizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In NIPS, 2018.

Choromanska, A., Hena, M., Mathieu, M., Arous, G. B., and Le Cun, Y. The loss surfaces of multilayer networks. AISTATS, 2015.

Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016.

Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. Essentially no barriers in neural network energy landscape. In ICML, 2018.

Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient descent ﬁnds global minima of deep neural networks. ar Xiv:1811.03804, 2018.

Evard, J. C. and Jafari, F. The set of all mxn rectangular real matrices of rank-r is connected by analytic regular arcs. Proceedings of American Mathematical Society, 1994.

Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D., and Wilson, A. G. Loss surfaces, mode connectivity, and fast ensembling of dnns. In NIPS, 2018.

Gautier, A., Nguyen, Q., and Hein, M. Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods. NIPS, 2016.

Haeffele, B. D. and Vidal, R. Global optimality in neural network training. CVPR, 2017.

Hardt, M. and Ma, T. Identity matters in deep learning. ICLR, 2017.

Janzamin, M., Sedghi, H., and Anandkumar, A. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. ar Xiv:1506.08473, 2016.

Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. ICLR, 2019.

Laurent, T. and v. Brecht, J. Deep linear networks with arbitrary loss: All local minima are global. In ICML, 2018.

Liang, S., Sun, R., Lee, J. D., and Srikant, R. Adding one neuron can eliminate all bad local minima. ar Xiv:1805.08671, 2018a.

Liang, S., Sun, R., Li, Y., and Srikant, R. Understanding the loss surface of neural networks for binary classiﬁcation. In ICML, 2018b.

Livni, R., Shalev-Shwartz, S., and Shamir, O. On the computational efﬁciency of training neural networks. NIPS, 2014.

Lu, H. and Kawaguchi, K. Depth creates no bad local minima. ar Xiv:1702.08580, 2017.

Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. ICML, 2017.

Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep cnns. ICML, 2018.

Nguyen, Q., Mukkamala, M. C., and Hein, M. On the loss landscape of a class of deep neural networks with no bad local valleys. In ICLR, 2019.

Nouiehed, M. and Razaviyayn, M. Learning deep models: Critical points and local openness. ICLR Workshop, 2018.

Pennington, J. and Bahri, Y. Geometry of neural network loss surfaces via random matrix theory. ICML, 2017.

Safran, I. and Shamir, O. On the quality of the initial basin in overspeciﬁed networks. ICML, 2016.

Sedghi, H. and Anandkumar, A. Provable methods for training neural networks with sparse connectivity. ICLR Workshop, 2015.

Sohl-Dickstein, J. and Kawaguchi, K. Eliminating all bad local minima from loss landscapes without even adding an extra unit. ar Xiv:1901.03909, 2019.

On Connected Sublevel Sets in Deep Learning

Soltanolkotabi, M. Learning relus via gradient descent. NIPS, 2017.

Soudry, D. and Hoffer, E. Exponentially vanishing suboptimal local minima in multilayer neural networks. ICLR Workshop 2018, 2017.

Tian, Y. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In ICML, 2018.

Venturi, L., Bandeira, A. S., and Bruna, J. Spurious valleys in two-layer neural network optimization landscapes. ar Xiv:1802.06384v2, 2018.

Wang, G., Giannakis, G. B., and Chen, J. Learning relu networks on linearly separable data: Algorithm, optimality, and generalization. ar Xiv:1808.04685, 2018.

Xie, B., Liang, Y., and Song, L. Diverse neural network learns true target functions. AISTAT, 2017.

Yun, C., Sra, S., and Jadbabaie, A. Global optimality conditions for deep neural networks. ICLR, 2017.

Zhang, H., Shao, J., and Salakhutdinov, R. Deep neural networks with multi-branch architectures are less nonconvex. ar Xiv:1806.01845, 2018.

Zhou, Y. and Liang, Y. Critical points of neural networks: Analytical forms and landscape properties. ICLR, 2018.

Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient descent optimizes over-parameterized deep relu networks. ar Xiv:1811.08888, 2018.