# deep_learning_without_poor_local_minima__b354c6e5.pdf

Deep Learning without Poor Local Minima

Kenji Kawaguchi Massachusetts Institute of Technology kawaguch@mit.edu

In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist bad saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

1 Introduction

Deep learning has been a great practical success in many fields, including the fields of computer vision, machine learning, and artificial intelligence. In addition to its practical success, theoretical results have shown that deep learning is attractive in terms of its generalization properties (Livni et al., 2014; Mhaskar et al., 2016). That is, deep learning introduces good function classes that may have a low capacity in the VC sense while being able to represent target functions of interest well. However, deep learning requires us to deal with seemingly intractable optimization problems. Typically, training of a deep model is conducted via non-convex optimization. Because finding a global minimum of a general non-convex function is an NP-complete problem (Murty & Kabadi, 1987), a hope is that a function induced by a deep model has some structure that makes the nonconvex optimization tractable. Unfortunately, it was shown in 1992 that training a very simple neural network is indeed NP-hard (Blum & Rivest, 1992). In the past, such theoretical concerns in optimization played a major role in shrinking the field of deep learning. That is, many researchers instead favored classical machining learning models (with or without a kernel approach) that require only convex optimization. While the recent great practical successes have revived the field, we do not yet know what makes optimization in deep learning tractable in theory.

In this paper, as a step toward establishing the optimization theory for deep learning, we prove a conjecture noted in (Goodfellow et al., 2016) for deep linear networks, and also address an open problem announced in (Choromanska et al., 2015b) for deep nonlinear networks. Moreover, for

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

both the conjecture and the open problem, we prove more general and tighter statements than those previously given (in the ways explained in each section).

2 Deep linear neural networks

Given the absence of a theoretical understanding of deep nonlinear neural networks, Goodfellow et al. (2016) noted that it is beneficial to theoretically analyze the loss functions of simpler models, i.e., deep linear neural networks. The function class of a linear multilayer neural network only contains functions that are linear with respect to inputs. However, their loss functions are nonconvex in the weight parameters and thus nontrivial. Saxe et al. (2014) empirically showed that the optimization of deep linear models exhibits similar properties to those of the optimization of deep nonlinear models. Ultimately, for theoretical development, it is natural to start with linear models before working with nonlinear models (as noted in Baldi & Lu, 2012), and yet even for linear models, the understanding is scarce when the models become deep.

2.1 Model and notation

We begin by defining the notation. Let H be the number of hidden layers, and let (X, Y ) be the training data set, with Y Rdy m and X Rdx m, where m is the number of data points. Here, dy 1 and dx 1 are the number of components (or dimensions) of the outputs and inputs, respectively. Let Σ = Y XT (XXT ) 1XY T . We denote the model (weight) parameters by W, which consists of the entries of the parameter matrices corresponding to each layer: WH+1 Rdy d H, . . . , Wk Rdk dk 1, . . . , W1 Rd1 dx. Here, dk represents the width of the k-th layer, where the 0-th layer is the input layer and the (H + 1)-th layer is the output layer (i.e., d0 = dx and d H+1 = dy). Let Idk be the dk dk identity matrix. Let p = min(d H, . . . , d1) be the smallest width of a hidden layer. We denote the (j, i)-th entry of a matrix M by Mj,i. We also denote the j-th row vector of M by Mj, and the i-th column vector of M by M ,i.

We can then write the output of a feedforward deep linear model, Y (W, X) Rdy m, as

Y (W, X) = WH+1WHWH 1 W2W1X.

We consider one of the most widely used loss functions, squared error loss:

ˉL(W) = 1 2

Y (W, X) ,i Y ,i 2 2 = 1 2 Y (W, X) Y 2 F ,

where F is the Frobenius norm. Note that 2 m ˉL(W) is the usual mean squared error, for which all of our results hold as well, since multiplying ˉL(W) by a constant in W results in an equivalent optimization problem.

2.2 Background

Recently, Goodfellow et al. (2016) remarked that when Baldi & Hornik (1989) proved Proposition 2.1 for shallow linear networks, they stated Conjecture 2.2 without proof for deep linear networks.

Proposition 2.1 (Baldi & Hornik, 1989: shallow linear network) Assume that H = 1 (i.e., Y (W, X) = W2W1X), assume that XXT and XY T are invertible, assume that Σ has dy distinct eigenvalues, and assume that p < dx, p < dy and dy = dx (e.g., an autoencoder). Then, the loss function ˉL(W) has the following properties:

(i) It is convex in each matrix W1 (or W2) when the other W2 (or W1) is fixed.

(ii) Every local minimum is a global minimum.

Conjecture 2.2 (Baldi & Hornik, 1989: deep linear network) Assume the same set of conditions as in Proposition 2.1 except for H = 1. Then, the loss function ˉL(W) has the following properties:

(i) For any k {1, . . . , H + 1}, it is convex in each matrix Wk when for all k = k, Wk is fixed.

(ii) Every local minimum is a global minimum.

Baldi & Lu (2012) recently provided a proof for Conjecture 2.2 (i), leaving the proof of Conjecture 2.2 (ii) for future work. They also noted that the case of p dx = dx is of interest, but requires further analysis, even for a shallow network with H = 1. An informal discussion of Conjecture 2.2 can be found in (Baldi, 1989). In Appendix D, we provide a more detailed discussion of this subject.

2.3 Results

We now state our main theoretical results for deep linear networks, which imply Conjecture 2.2 (ii) as well as obtain further information regarding the critical points with more generality.

Theorem 2.3 (Loss surface of deep linear networks) Assume that XXT and XY T are of full rank with dy dx and Σ has dy distinct eigenvalues. Then, for any depth H 1 and for any layer widths and any input-output dimensions dy, d H, d H 1, . . . , d1, dx 1 (the widths can arbitrarily differ from each other and from dy and dx), the loss function ˉL(W) has the following properties:

(i) It is non-convex and non-concave.

(ii) Every local minimum is a global minimum.

(iii) Every critical point that is not a global minimum is a saddle point.

(iv) If rank(WH W2) = p, then the Hessian at any saddle point has at least one (strictly) negative eigenvalue.1

Corollary 2.4 (Effect of deepness on the loss surface) Assume the same set of conditions as in Theorem 2.3 and consider the loss function ˉL(W). For three-layer networks (i.e., H = 1), the Hessian at any saddle point has at least one (strictly) negative eigenvalue. In contrast, for networks deeper than three layers (i.e., H 2), there exist saddle points at which the Hessian does not have any negative eigenvalue.

The assumptions of having full rank and distinct eigenvalues in the training data matrices in Theorem 2.3 are realistic and practically easy to satisfy, as discussed in previous work (e.g., Baldi & Hornik, 1989). In contrast to related previous work (Baldi & Hornik, 1989; Baldi & Lu, 2012), we do not assume the invertibility of XY T , p < dx, p < dy nor dy = dx. In Theorem 2.3, p dx is allowed, as well as many other relationships among the widths of the layers. Therefore, we successfully proved Conjecture 2.2 (ii) and a more general statement. Moreover, Theorem 2.3 (iv) and Corollary 2.4 provide additional information regarding the important properties of saddle points.

Theorem 2.3 presents an instance of a deep model that would be tractable to train with direct greedy optimization, such as gradient-based methods. If there are poor local minima with large loss values everywhere, we would have to search the entire space,2 the volume of which increases exponentially with the number of variables. This is a major cause of NP-hardness for non-convex optimization. In contrast, if there are no poor local minima as Theorem 2.3 (ii) states, then saddle points are the main remaining concern in terms of tractability.3 Because the Hessian of ˉL(W) is Lipschitz continuous, if the Hessian at a saddle point has a negative eigenvalue, it starts appearing as we approach the saddle point. Thus, Theorem 2.3 and Corollary 2.4 suggest that for 1-hidden layer networks, training can be done in polynomial time with a second order method or even with a modified stochastic gradient decent method, as discussed in (Ge et al., 2015). For deeper networks, Corollary 2.4 states that there exist bad saddle points in the sense that the Hessian at the point has no negative eigenvalue. However, we know exactly when this can happen from Theorem 2.3 (iv) in our deep models. We leave the development of efficient methods to deal with such a bad saddle point in general deep models as an open problem.

3 Deep nonlinear neural networks

Now that we have obtained a comprehensive understanding of the loss surface of deep linear models, we discuss deep nonlinear models. For a practical deep nonlinear neural network, our theoretical results so far for the deep linear models can be interpreted as the following: depending on the

1If H = 1, to be succinct, we define WH W2 = W1 W2 Id1, with a slight abuse of notation. 2Typically, we do this by assuming smoothness in the values of the loss function. 3Other problems such as the ill-conditioning can make it difficult to obtain a fast convergence rate.

nonlinear activation mechanism and architecture, training would not be arbitrarily difficult. While theoretical formalization of this intuition is left to future work, we address a recently proposed open problem for deep nonlinear networks in the rest of this section.

We use the same notation as for the deep linear models, defined in the beginning of Section 2.1. The output of deep nonlinear neural network, ˆY (W, X) Rdy m, is defined as

ˆY(W, X) = qσH+1(WH+1σH(WHσH 1(WH 1 σ2(W2σ1(W1X)) ))),

where q R is simply a normalization factor, the value of which is specified later. Here, σk : Rdk m Rdk m is the element-wise rectified linear function:

b11 . . . b1m ... ... ... bdk1 bdkm

ˉσ(b11) . . . ˉσ(b1m) ... ... ... ˉσ(bdk1) ˉσ(bdkm)

where ˉσ(bij) = max(0, bij). In practice, we usually set σH+1 to be an identity map in the last layer, in which case all our theoretical results still hold true.

3.2 Background

Following the work by Dauphin et al. (2014), Choromanska et al. (2015a) investigated the connection between the loss functions of deep nonlinear networks and a function well-studied via random matrix theory (i.e., the Hamiltonian of the spherical spin-glass model). They explained that their theoretical results relied on several unrealistic assumptions. Later, Choromanska et al. (2015b) suggested at the Conference on Learning Theory (COLT) 2015 that discarding these assumptions is an important open problem. The assumptions were labeled A1p, A2p, A3p, A4p, A5u, A6u, and A7p.

In this paper, we successfully discard most of these assumptions. In particular, we only use a weaker version of assumptions A1p and A5u. We refer to the part of assumption A1p (resp. A5u) that corresponds only to the model assumption as A1p-m (resp. A5u-m). Note that assumptions A1p-m and A5u-m are explicitly used in the previous work (Choromanska et al., 2015a) and included in A1p and A5u (i.e., we are not making new assumptions here).

As the model ˆY (W, X) Rdy m represents a directed acyclic graph, we can express an output from one of the units in the output layer as

ˆY (W, X)j,i = q

p=1 [Xi](j,p)[Zi](j,p)

k=1 w(k) (j,p). (1)

Here, Ψ is the total number of paths from the inputs to each j-th output in the directed acyclic graph. In addition, [Xi](j,p) R represents the entry of the i-th sample input datum that is used in the

p-th path of the j-th output. For each layer k, w(k) (j,p) R is the entry of Wk that is used in the p-th path of the j-th output. Finally, [Zi](j,p) {0, 1} represents whether the p-th path of the j-th output is active ([Zi](j,p) = 1) or not ([Zi](j,p) = 0) for each sample i as a result of the rectified linear activation.

Assumption A1p-m assumes that the Z s are Bernoulli random variables with the same probability of success, Pr([Zi](j,p) = 1) = ρ for all i and (j, p). Assumption A5u-m assumes that the Z s are independent from the input X s and parameters w s. With assumptions A1p-m and A5u-m, we can write EZ[ ˆY (W, X)j,i] = q PΨ p=1[Xi](j,p)ρ QH+1 k=1 w(k) (j,p).

Choromanska et al. (2015b) noted that A6u is unrealistic because it implies that the inputs are not shared among the paths. In addition, Assumption A5u is unrealistic because it implies that the activation of any path is independent of the input data. To understand all of the seven assumptions (A1p, A2p, A3p, A4p, A5u, A6u, and A7p), we note that Choromanska et al. (2015b,a) used these seven assumptions to reduce their loss functions of nonlinear neural networks to:

Lprevious(W) = 1 λH/2

i1,i2,...,i H+1=1 Xi1,i2,...,i H+1

k=1 wik subject to 1 λ

i=1 w2 i = 1,

where λ R is a constant related to the size of the network. For our purpose, the detailed definitions of the symbols are not important (X and w are defined in the same way as in equation 1). Here, we point out that the target function Y has disappeared in the loss Lprevious(W) (i.e., the loss value does not depend on the target function). That is, whatever the data points of Y are, their loss values are the same. Moreover, the nonlinear activation function has disappeared in Lprevious(W) (and the nonlinearity is not taken into account in X or w). In the next section, by using only a strict subset of the set of these seven assumptions, we reduce our loss function to a more realistic loss function of an actual deep model.

Proposition 3.1 (High-level description of a main result in Choromanska et al., 2015a) Assume A1p (including A1p-m), A2p, A3p, A4p, A5u (including A5u-m), A6u, and A7p (Choromanska et al., 2015b). Furthermore, assume that dy = 1. Then, the expected loss of each sample datum, Lprevious(W), has the following property: above a certain loss value, the number of local minima diminishes exponentially as the loss value increases.

3.3 Results

We now state our theoretical result, which partially address the aforementioned open problem. We consider loss functions for all the data points and all possible output dimensionalities (i.e., vectoredvalued output). More concretely, we consider the squared error loss with expectation, L(W) =

1 2 EZ[ ˆY (W, X) Y ] 2 F .

Corollary 3.2 (Loss surface of deep nonlinear networks) Assume A1p-m and A5u-m. Let q = ρ 1. Then, we can reduce the loss function of the deep nonlinear model L(W) to that of the deep linear model ˉL(W). Therefore, with the same set of conditions as in Theorem 2.3, the loss function of the deep nonlinear model has the following properties:

(i) It is non-convex and non-concave. (ii) Every local minimum is a global minimum. (iii) Every critical point that is not a global minimum is a saddle point. (iv) The saddle points have the properties stated in Theorem 2.3 (iv) and Corollary 2.4.

Comparing Corollary 3.2 and Proposition 3.1, we can see that we successfully discarded assumptions A2p, A3p, A4p, A6u, and A7p while obtaining a tighter statement in the following sense: Corollary 3.2 states with fewer unrealistic assumptions that there is no poor local minimum, whereas Proposition 3.1 roughly asserts with more unrealistic assumptions that the number of poor local minimum may be not too large. Furthermore, our model ˆY is strictly more general than the model analyzed in (Choromanska et al., 2015a,b) (i.e., this paper s model class contains the previous work s model class but not vice versa).

4 Proof Idea and Important lemmas

In this section, we provide overviews of the proofs of the theoretical results. Our proof approach largely differs from those in previous work (Baldi & Hornik, 1989; Baldi & Lu, 2012; Choromanska et al., 2015a,b). In contrast to (Baldi & Hornik, 1989; Baldi & Lu, 2012), we need a different approach to deal with the bad saddle points that start appearing when the model becomes deeper (see Section 2.3), as well as to obtain more comprehensive properties of the critical points with more generality. While the previous proofs heavily rely on the first-order information, the main parts of our proofs take advantage of the second order information. In contrast, Choromanska et al. (2015a,b) used the seven assumptions to relate the loss functions of deep models to a function previously analyzed with a tool of random matrix theory. With no reshaping assumptions (A3p, A4p, and A6u), we cannot relate our loss function to such a function. Moreover, with no distributional assumptions (A2p and A6u) (except the activation), our Hessian is deterministic, and therefore, even random matrix theory itself is insufficient for our purpose. Furthermore, with no spherical constraint assumption (A7p), the number of local minima in our loss function can be uncountable.

One natural strategy to proceed toward Theorem 2.3 and Corollary 3.2 would be to use the first-order and second-order necessary conditions of local minima (e.g., the gradient is zero and the Hessian is

positive semidefinite).4 However, are the first-order and second-order conditions sufficient to prove Theorem 2.3 and Corollary 3.2? Corollaries 2.4 show that the answer is negative for deep models with H 2, while it is affirmative for shallow models with H = 1. Thus, for deep models, a simple use of the first-order and second-order information is insufficient to characterize the properties of each critical point. In addition to the complexity of the Hessian of the deep models, this suggests that we must strategically extract the second order information. Accordingly, in section 4.2, we obtain an organized representation of the Hessian in Lemma 4.3 and strategically extract the information in Lemmas 4.4 and 4.6. With the extracted information, we discuss the proofs of Theorem 2.3 and Corollary 3.2 in section 4.3.

4.1 Notations

Let M M be the Kronecker product of M and M . Let Dvec(W T k )f( ) = f( ) vec(W T k ) be the partial

derivative of f with respect to vec(W T k ) in the numerator layout. That is, if f : Rdin Rdout, we have Dvec(W T k )f( ) Rdout (dkdk 1). Let R(M) be the range (or the column space) of a matrix M. Let M be any generalized inverse of M. When we write a generalized inverse in a condition or statement, we mean it for any generalized inverse (i.e., we omit the universal quantifier over generalized inverses, as this is clear). Let r = (Y (W, X) Y )T Rm dy be an error matrix. Let C = WH+1 W2 Rdy d1. When we write Wk Wk , we generally intend that k > k and the expression denotes a product over Wj for integer k j k . For notational compactness, two additional cases can arise: when k = k , the expression denotes simply Wk, and when k < k , it denotes Idk. For example, in the statement of Lemma 4.1, if we set k := H + 1, we have that WH+1WH WH+2 Idy.

In Lemma 4.6 and the proofs of Theorems 2.3, we use the following additional notation. We denote an eigendecomposition of Σ as Σ = UΛU T , where the entries of the eigenvalues are ordered as Λ1,1 > > Λdy,dy with corresponding orthogonal eigenvector matrix U = [u1, . . . , udy]. For each k {1, . . . dy}, uk Rdy 1 is a column eigenvector. Let ˉp = rank(C) {1, . . . , min(dy, p)}. We define a matrix containing the subset of the ˉp largest eigenvectors as Uˉp = [u1, . . . , uˉp]. Given any ordered set Iˉp = {i1, . . . , iˉp | 1 i1 < < iˉp min(dy, p)}, we define a matrix containing the subset of the corresponding eigenvectors as UIˉ p = [ui1, . . . , uiˉ p]. Note the difference between Uˉp and UIˉ p.

As discussed above, we extracted the first-order and second-order conditions of local minima as the following lemmas. The lemmas provided here are also intended to be our additional theoretical results that may lead to further insights. The proofs of the lemmas are in the appendix.

Lemma 4.1 (Critical point necessary and sufficient condition) W is a critical point of ˉL(W) if and only if for all k {1, ..., H + 1}, Dvec(W T k ) ˉL(W) T = WH+1WH Wk+1 (Wk 1 W2W1X)T T vec(r) = 0.

Lemma 4.2 (Representation at critical point) If W is a critical point of ˉL(W), then

WH+1WH W2W1 = C(CT C) CT Y XT (XXT ) 1.

Lemma 4.3 (Block Hessian with Kronecker product) Write the entries of 2 ˉL(W) in a block form as

Dvec(W T H+1) Dvec(W T H+1) ˉL(W) T Dvec(W T 1 ) Dvec(W T H+1) ˉL(W) T

... ... ...

Dvec(W T H+1) Dvec(W T 1 ) ˉL(W) T Dvec(W T 1 ) Dvec(W T 1 ) ˉL(W) T

4For a non-convex and non-differentiable function, we can still have a first-order and second-order necessary condition (e.g., Rockafellar & Wets, 2009, theorem 13.24, p. 606).

Then, for any k {1, ..., H + 1},

Dvec(W T k ) Dvec(W T k ) ˉL(W) T

= (WH+1 Wk+1)T (WH+1 Wk+1) (Wk 1 W1X)(Wk 1 W1X)T ,

and, for any k {2, ..., H + 1},

Dvec(W T k ) Dvec(W T 1 ) ˉL(W) T

= CT (WH+1 Wk+1) X(Wk 1 W1X)T +

[(Wk 1 W2)T X] [Idk 1 (r WH+1 Wk+1) ,1 . . . Idk 1 (r WH+1 Wk+1) ,dk] .

Lemma 4.4 (Hessian semidefinite necessary condition) If 2 ˉL(W) is positive semidefinite or negative semidefinite at a critical point, then for any k {2, ..., H + 1},

R((Wk 1 W3W2)T ) R(CT C) or Xr WH+1WH Wk+1 = 0.

Corollary 4.5 If 2 ˉL(W) is positive semidefinite or negative semidefinite at a critical point, then for any k {2, ..., H + 1},

rank(WH+1WH Wk) rank(Wk 1 W3W2) or Xr WH+1WH Wk+1 = 0.

Lemma 4.6 (Hessian positive semidefinite necessary condition) If 2 ˉL(W) is positive semidefinite at a critical point, then C(CT C) CT = Uˉp U T ˉp or Xr = 0.

4.3 Proof sketches of theorems

We now provide the proof sketch of Theorem 2.3 and Corollary 3.2. We complete the proofs in the appendix.

4.3.1 Proof sketch of Theorem 2.3 (ii)

By case analysis, we show that any point that satisfies the necessary conditions and the definition of a local minimum is a global minimum.

Case I: rank(WH W2) = p and dy p: If dy < p, Corollary 4.5 with k = H + 1 implies the necessary condition of local minima that Xr = 0. If dy = p, Lemma 4.6 with k = H + 1 and k = 2, combined with the fact that R(C) R(Y XT ), implies the necessary condition that Xr = 0. Therefore, we have the necessary condition of local minima, Xr = 0 . Interpreting condition Xr = 0, we conclude that W achieving Xr = 0 is indeed a global minimum.

Case II: rank(WH W2) = p and dy > p: From Lemma 4.6, we have the necessary condition that C(CT C) CT = Uˉp U T ˉp or Xr = 0. If Xr = 0, using the exact same proof as in Case I, it is a global minimum. Suppose then that C(CT C) CT = Uˉp U T ˉp . From Lemma 4.4 with k = H + 1, we conclude that ˉp rank(C) = p. Then, from Lemma 4.2, we write WH+1 W1 = Up U T p Y XT (XXT ) 1, which is the orthogonal projection onto the subspace spanned by the p eigenvectors corresponding to the p largest eigenvalues following the ordinary least square regression matrix. This is indeed the expression of a global minimum.

Case III: rank(WH W2) < p: We first show that if rank(C) min(p, dy), every local minimum is a global minimum. Thus, we consider the case where rank(WH W2) < p and rank(C) < min(p, dy). In this case, by induction on k = {1, . . . , H+1}, we prove that we can have rank(Wk W1) min(p, dy) with arbitrarily small perturbation of each entry of Wk, . . . , W1 without changing the value of ˉL(W). Once this is proved, along with the results of Case I and Case II, we can immediately conclude that any point satisfying the definition of a local minimum is a global minimum.

We first prove the statement for the base case with k = 1 by using an expression of W1 that is obtained by a first-order necessary condition: for an arbitrary L1,

W1 = (CT C) CT Y XT (XXT ) 1 + (I (CT C) CT C)L1.

By using Lemma 4.6 to obtain an expression of C, we deduce that we can have rank(W1) min(p, dy) with arbitrarily small perturbation of each entry of W1 without changing the loss value.

For the inductive step with k {2, . . . , H + 1}, from Lemma 4.4, we use the following necessary condition for the Hessian to be (positive or negative) semidefinite at a critical point: for any k {2, . . . , H + 1},

R((Wk 1 W2)T ) R(CT C) or Xr WH+1 Wk+1 = 0. We use the inductive hypothesis to conclude that the first condition is false, and thus the second condition must be satisfied at a candidate point of a local minimum. From the latter condition, with extra steps, we can deduce that we can have rank(Wk Wk 1 W1) min(p, dx) with arbitrarily small perturbation of each entry of Wk while retaining the same loss value.

We conclude the induction, proving that we can have rank(C) rank(WH+1 W1) min(p, dx) with arbitrarily small perturbation of each parameter without changing the value of ˉL(W). Upon such a perturbation, we have the case where rank(C) min(p, dy), for which we have already proven that every local minimum is a global minimum. Summarizing the above, any point that satisfies the definition (and necessary conditions) of a local minimum is indeed a global minimum. Therefore, we conclude the proof sketch of Theorem 2.3 (ii).

4.3.2 Proof sketch of Theorem 2.3 (i), (iii) and (iv) We can prove the non-convexity and non-concavity of this function simply from its Hessian (Theorem 2.3 (i)). That is, we can show that in the domain of the function, there exist points at which the Hessian becomes indefinite. Indeed, the domain contains uncountably many points at which the Hessian is indefinite.

We now consider Theorem 2.3 (iii): every critical point that is not a global minimum is a saddle point. Combined with Theorem 2.3 (ii), which is proven independently, this is equivalent to the statement that there are no local maxima. We first show that if WH+1 W2 = 0, the loss function always has some strictly increasing direction with respect to W1, and hence there is no local maximum. If WH+1 W2 = 0, we show that at a critical point, if the Hessian is negative semidefinite (i.e., a necessary condition of local maxima), we can have WH+1 W2 = 0 with arbitrarily small perturbation without changing the loss value. We can prove this by induction on k = 2, . . . , H + 1, similar to the induction in the proof of Theorem 2.3 (ii). This means that there is no local maximum.

Theorem 2.3 (iv) follows Theorem 2.3 (ii)-(iii) and the analyses for Case I and Case II in the proof of Theorem 2.3 (ii); when rank(WH W2) = p, if 2 ˉL(W) 0 at a critical point, W is a global minimum.

4.3.3 Proof sketch of Corollary 3.2 Since the activations are assumed to be random and independent, the effect of nonlinear activations disappear by taking expectation. As a result, the loss function L(W) is reduced to ˉL(W).

5 Conclusion

In this paper, we addressed some open problems, pushing forward the theoretical foundations of deep learning and non-convex optimization. For deep linear neural networks, we proved the aforementioned conjecture and more detailed statements with more generality. For deep nonlinear neural networks, when compared with the previous work, we proved a tighter statement (in the way explained in section 3) with more generality (dy can vary) and with strictly weaker model assumptions (only two assumptions out of seven). However, our theory does not yet directly apply to the practical situation. To fill the gap between theory and practice, future work would further discard the remaining two out of the seven assumptions made in previous work. Our new understanding of the deep linear models at least provides the following theoretical fact: the bad local minima would arise in a deep nonlinear model but only as an effect of adding nonlinear activations to the corresponding deep linear model. Thus, depending on the nonlinear activation mechanism and architecture, we would be able to efficiently train deep models.

Acknowledgments The author would like to thank Prof. Leslie Kaelbling, Quynh Nguyen, Li Huan and Anirbit Mukherjee for their thoughtful comments on the paper. We gratefully acknowledge support from NSF grant 1420927, from ONR grant N00014-14-1-0486, and from ARO grant W911NF1410433.

Baldi, Pierre. 1989. Linear learning: Landscapes and algorithms. In Advances in neural information processing systems. pp. 65 72.

Baldi, Pierre, & Hornik, Kurt. 1989. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1), 53 58.

Baldi, Pierre, & Lu, Zhiqin. 2012. Complex-valued autoencoders. Neural Networks, 33, 136 147.

Blum, Avrim L, & Rivest, Ronald L. 1992. Training a 3-node neural network is NP-complete. Neural Networks, 5(1), 117 127.

Choromanska, Anna, Henaff, MIkael, Mathieu, Michael, Ben Arous, Gerard, & Le Cun, Yann. 2015a. The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics. pp. 192 204.

Choromanska, Anna, Le Cun, Yann, & Arous, Gérard Ben. 2015b. Open Problem: The landscape of the loss surfaces of multilayer networks. In Proceedings of The 28th Conference on Learning Theory. pp. 1756 1760.

Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, & Bengio, Yoshua. 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems. pp. 2933 2941.

Ge, Rong, Huang, Furong, Jin, Chi, & Yuan, Yang. 2015. Escaping From Saddle Points Online Stochastic Gradient for Tensor Decomposition. In Proceedings of The 28th Conference on Learning Theory. pp. 797 842.

Goodfellow, Ian, Bengio, Yoshua, & Courville, Aaron. 2016. Deep Learning. Book in preparation for MIT Press. http://www.deeplearningbook.org.

Livni, Roi, Shalev-Shwartz, Shai, & Shamir, Ohad. 2014. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems. pp. 855 863.

Mhaskar, Hrushikesh, Liao, Qianli, & Poggio, Tomaso. 2016. Learning Real and Boolean Functions: When Is Deep Better Than Shallow. Massachusetts Institute of Technology CBMM Memo No. 45.

Murty, Katta G, & Kabadi, Santosh N. 1987. Some NP-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2), 117 129.

Rockafellar, R Tyrrell, & Wets, Roger J-B. 2009. Variational analysis. Vol. 317. Springer Science & Business Media.

Saxe, Andrew M, Mc Clelland, James L, & Ganguli, Surya. 2014. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations.

Zhang, Fuzhen. 2006. The Schur complement and its applications. Vol. 4. Springer Science & Business Media.