# the_multilinear_structure_of_relu_networks__44f25d12.pdf

The Multilinear Structure of Re LU Networks

Thomas Laurent * 1 James H. von Brecht * 2

We study the loss surface of neural networks equipped with a hinge loss criterion and Re LU or leaky Re LU nonlinearities. Any such network deﬁnes a piecewise multilinear form in parameter space. By appealing to harmonic analysis we show that all local minima of such network are non-differentiable, except for those minima that occur in a region of parameter space where the loss surface is perfectly ﬂat. Non-differentiable minima are therefore not technicalities or pathologies; they are heart of the problem when investigating the loss of Re LU networks. As a consequence, we must employ techniques from nonsmooth analysis to study these loss surfaces. We show how to apply these techniques in some illustrative cases.

1. Introduction

Empirical practice tends to show that modern neural networks have relatively benign loss surfaces, in the sense that training a deep network proves less challenging than the nonconvex and non-smooth nature of the optimization would na ıvely suggest. Many theoretical efforts, especially in recent years, have attempted to explain this phenomenon and, more broadly, the successful optimization of deep networks in general (Gori & Tesi, 1992; Choromanska et al., 2015; Kawaguchi, 2016; Safran & Shamir, 2016; Mei et al., 2016; Soltanolkotabi, 2017; Soudry & Hoffer, 2017; Du et al., 2017; Zhong et al., 2017; Tian, 2017; Li & Yuan, 2017; Zhou & Feng, 2017; Brutzkus et al., 2017). The properties of the loss surface of neural networks remain poorly understood despite these many efforts. Developing of a coherent mathematical understanding of them is therefore one of the

*Equal contribution 1Department of Mathematics, Loyola Marymount University, Los Angeles, CA 90045, USA 2Department of Mathematics and Statistics, California State University, Long Beach, Long Beach, CA 90840, USA. Correspondence to: Thomas Laurent <tlaurent@lmu.edu>, James H. von Brecht <james.vonbrecht@csulb.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

(a) Parameter Space Ω

(b) Loss Surface L(ω)

Figure 1. (a): Parameter space Ω= R2 decomposes into a partition of four open cells Ωu and a closed boundary set N (solid red lines). (b): The loss surface is smooth inside each Ωu and nondifferentiable on N. It has two types of local minima, ﬂat minima (cell Ω1) and sharp minima (boundary between cells Ω3 and Ω4). The sharp minima must have non-zero loss.

major open problems in deep learning.

We focus on investigating the loss surfaces that arise from feed-forward neural networks where rectiﬁed linear units (Re LUs) σ(x) := max(x, 0) or leaky Re LUs σα(x) := α min(x, 0) + max(x, 0) account for all nonlinearities present in the network. We allow the transformations deﬁning the hidden-layers of the network to take the form of fully connected afﬁne transformations or convolutional transformations. By employing a Re LU-based criterion we then obtain a loss with a consistent, homogeneous structure for the nonlinearities in the network. We elect to use the binary hinge loss

ℓ(ˆy, y) := σ 1 yˆy (1)

for binary classiﬁcation, where ˆy denote the scalar output of the network and y { 1, 1} denotes the target. Similarly, for multiclass classiﬁcation we use the multiclass hinge loss,

ℓ(ˆy, r0) = P

r =r0 σ 1 + ˆyr ˆyr0 (2)

where ˆy = (ˆy1, . . . , ˆy R) RR denotes the vectorial output of the network and r0 {1, . . . , R} denotes the target class.

To see the type of structure that emerges in these networks, let Ωdenote the space of network parameters and let L(ω) denote the loss. Due to the choices (1,2) of network criteria, all nonlinearities involved in L(ω) are piecewise linear.

The Multilinear Structure of Re LU Networks

Figure 2. For a certain class of networks, ﬂat local minima are always optimal whereas sharp ones are always sub-optimal.

These nonlinearities encode a partition of parameter space Ω= Ω1 ΩM N into a ﬁnite number of open cells Ωu and a closed set N of cell boundaries (c.f. ﬁgure 1). A cell Ωu corresponds to a given activation pattern of the nonlinearities, and so L(ω) is smooth in the interior of cells and (potentially) non-differentiable on cell boundaries. This decomposition provides a description of the smooth (i.e. Ω\N) and non-smooth (i.e. N) parts of parameter space.

We begin by showing that the loss restricted to a cell Ωu is a multilinear form. As multilinear forms are harmonic functions, an appeal to the strong maximum principle shows that non-trivial optima of the loss must happen on cell boundaries (i.e. the non-differentiable region N of the parameter space). In other words, Re LU networks with hinge loss criteria do not have differentiable local minima, except for those trivial ones that occur in regions of parameter space where the loss surface is perfectly ﬂat. Figure 1b) provides a visual example of such a loss.

As a consequence the loss function has only two types of local minima. They are

Type I (Flat): Local minima that occur in a ﬂat (i.e. constant loss) cell or on the boundary of a ﬂat cell. Type II (Sharp): Local minima on N that are not on the boundary of any ﬂat cell.

We then investigate type I and type II local minima in more detail. The investigation reveals a clean dichotomy. First and foremost,

Main Result 1. L(ω) > 0 at any type II local minimum.

Importantly, if zero loss minimizers exist (which happens for most modern deep networks) then sharp local minima are always sub-optimal. This result applies to a quite general class of deep neural networks with fully connected or convolutional layers equipped with either Re LU or leaky Re LU nonlinearities. To obtain a converse we restrict our attention to fully connected networks with leaky Re LU nonlinearities. Under mild assumptions on the data we have

Main Result 2. L(ω) = 0 at any type I local minimum, while L(ω) > 0 at any type II local minimum.

Thus ﬂat local minima are always optimal whereas sharp minima are always sub-optimal in the case where zero loss minimizers exist. Conversely, if zero loss minimizers do not exist then all local minima are sharp. See ﬁgure 2 for an

illustration of such a loss surface.

All in all these results paint a striking picture. Networks with Re LU or leaky Re LU nonlinearities and hinge loss criteria have only two types of local minima. Sharp minima always have non-zero loss; they are undesirable. Conversely, ﬂat minima are always optimal for certain classes of networks. In this case the structure of the loss (ﬂat v.s. sharp) provides a perfect characterization of their quality (optimal v.s. suboptimal).

This analysis also shows that local minima generically occur in the non-smooth region of parameter space. Analyzing them requires an invocation of machinery from non-smooth, non-convex analysis. We show how to apply these techniques to study non-smooth networks in the context of binary classiﬁcation. We consider three speciﬁc scenarios to illustrate how nonlinearity and data complexity affect the loss surface of multilinear networks

Scenario 1: A deep linear network with arbitrary data.

Scenario 2: A network with one hidden layer, leaky Re LUs and linearly separable data.

Scenario 3: A network with one hidden layer, Re LUs and linearly separable data.

The nonlinearities σα(x) vary from the linear regime (α = 1) to the leaky regime (0 < α < 1) and ﬁnally to the Re LU regime (α = 0) as we pass from the ﬁrst to the third scenario. We show that no sub-optimal local minimizers exist in the ﬁrst two scenarios. When passing to the case of paramount interest, i.e. the third scenario, a bifurcation occurs. Degeneracy in the nonlinearities (i.e. α = 0) induces sub-optimal local minima in the loss surface. We also provide an explicit description of all such sub-optimal local optima. They correspond to the occurence of dead data points, i.e. when some data points do not activate any of the neurons of the hidden layer and are therefore ignored by the network. Our results for the second and third scenarios provide a mathematically precise formulation of a commonplace intuitive picture. A Re LU can completely turn off, and sub-optimal minima correspond precisely to situations in which a data point turns off all Re LUs in the hidden layer. As leaky Re LUs have no completely off state, such networks therefore have no sub-optimal minima.

Finally, in section 4 we conclude by investigating the extent to which these phenomena do, or do not, persist when passing to the multiclass context. The loss surface of a multilinear network with the multiclass hinge loss (2) is fundamentally different than that of a binary classiﬁcation problem. In particular, the picture that emerges from our two-class results does not extend to the multiclass hinge loss. Nevertheless, we show how to obtain a similar picture of critical points by modifying the training strategy applied to multiclass problems.

The Multilinear Structure of Re LU Networks

Many recent works theoretically investigate the loss surface of Re LU networks. The closest to ours is (Safran & Shamir, 2016), which uses Re LU nonlinearities to partition the parameter space into basins that, while similar in spirit, differ from our notion of cells. Works such as (Keskar et al., 2016; Chaudhari et al., 2017) have empirically investigated the notion of width of a local minimizer. Conjecturally, a wide local minimum should generalize better than a narrow one and might be more likely to attract the solution generated by a stochastic gradient descent algorithm. Our ﬂat and sharp local minima are reminiscent of these notions. Finally, some prior works have proved variants of our results in smooth situations. For instance, (Brutzkus et al., 2017) derives results about the smooth local minima occurring in scenarios 2 and 3, but they do not investigate non-differentiable local minima. Additionally, (Kawaguchi, 2016) considers our ﬁrst scenario with a mean squared error loss instead of the hinge loss, while (Frasconi et al., 1997) considers our second scenario with a smooth version of the hinge loss and with sigmoid nonlinearities. Our non-smooth analogues of these results require fundamentally different techniques. We prove all lemmas, theorems and corollaries in the appendix.

2. Global Structure of the Loss

We begin by describing the global structure of Re LU networks with hinge loss that arises due to their piecewise multilinear form. Let us start by rewriting (2) as

ℓ(ˆy, y) = 1 +

r=1 σ 1 + ˆyr y, ˆy

= 1 + D 1 , σ (Id 1 y)ˆy + 1 E (3)

where we now view the target y {0, 1}R as a one-hot vector that encodes for the desired class. The term 1 y denotes the outer product between the constant vector 1 = (1, . . . , 1)T and the target, while y, ˆy refers to the usual Euclidean inner product. We consider a collection (x(i), y(i)) of N labeled data points fed through a neural network with L hidden layers,

x(i,ℓ) = σα(W (ℓ)x(i,ℓ 1) + b(ℓ)) for ℓ [L]

ˆy(i) = V x(i,L) + c, (4)

so that for ℓ [L] := {1, . . . , L} each x(i,ℓ) refers to feature vector of the ith data point at the ℓth layer (with the convention that x(i,0) = x(i)) and ˆy(i) refers to the output of the network for the ith datum. By (3) we obtain

L(ω) = 1+ X

i µ(i) 1, σ (Id 1 y(i))ˆy(i)+1 (5)

for the loss L(ω). The positive weights µ(i) > 0 sum to one, say µ(i) = 1/N in the simplest case, but we allow for other

choices to handle those situations, such as an unbalanced training set, in which non-homogeneous weights could be beneﬁcial. The matrices W (ℓ) and vector b(ℓ) appearing in (4) deﬁne the afﬁne transformation at layer ℓof the network, and V and c in (4) denote the weights and bias of the output layer. We allow for fully-connected as well as structured models, such as convolutional networks, by imposing the assumption that each W (ℓ) is a matrix-valued function that depends linearly on some set of parameters ω(ℓ)

W (ℓ) cω(ℓ) + dˆω(ℓ) = c W (ℓ) ω(ℓ) + d W (ℓ) ˆω(ℓ) ;

thus the collection

ω = (ω(1), . . . , ω(L), V, b(1), . . . , b(L), c) Ω

represents the parameters of the network and Ωdenotes parameter space. As the slope α of the nonlinearity decreases from α = 1 to α = 0 the network transitions from a deep linear architecture to a standard Re LU network. Finally, we let dℓdenote the dimension of the features at layer ℓof the network, with the convention that d0 = d (dimension of the input data) and d L+1 = R (number of classes). We use D = d1 + . . . + d L+1 for the total number of neurons.

2.1. Partitioning Ωinto Cells

The nonlinearities σα(x) and σ(x) account for the only sources of nondifferentiabilty in the loss of a Re LU network. To track these potential sources of nondifferentiability, for a given a data point x(i) we deﬁne the functions

s(i,ℓ)(ω) := sign(W (ℓ)x(i,ℓ 1) + bℓ) for ℓ [L]

s(i,L+1)(ω) := sign (Id 1 y(i)) ˆy(i) + 1 , (6)

where sign(x) stands for the signum function that vanishes at zero. The function s(i,ℓ) describes how data point x(i)

activates the dℓneurons at the ℓth layer, while s(i,L+1)(ω) describes the corresponding activation of the loss. These activations take one of three possible states, the fully active state (encoded by a one), the fully inactive state (encoded by a minus one), or an in-between state (encoded by a zero). We then collect all of these functions into a single signature function

S(ω) = s(1,1)(ω), . . . , s(1,L+1)(ω); . . . . . . ;

s(N,1)(ω), . . . , s(N,L+1)(ω)

to obtain a function S : Ω7 { 1, 0, 1}ND since there are a total of D neurons and N data points. If S(ω) belongs to the subset { 1, 1}ND of { 1, 0, 1}ND then none of the ND entries of S(ω) vanish, and as a consequence, all of the nonlinearities are differentiable near ω; the loss L is smooth near such points. With this in mind, for a given

The Multilinear Structure of Re LU Networks

u { 1, 1}ND we deﬁne the cell Ωu as the (possibly empty) set

Ωu := S 1(u) := {ω Ω: S(ω) = u}

of parameter space. By choice L is smooth on each nonempty cell Ωu, and so the cells Ωu provide us with a partition of the parameter space

u { 1,1}ND Ωu

into smooth and potentially non-smooth regions. The set N contains those ω for which at least one of the ND entries of S(ω) takes the value 0, which implies that at least one of the nonlinearities is non-differentiable at such a point. Thus N consists of points at which the loss is potentially nondifferentiable. The following lemma collects the various properties of the cells Ωu and of N that we will need.

Lemma 1. For each u { 1, 1}ND the cell Ωu is an open set. If u = u then Ωu and Ωu are disjoint. The set N is closed and has Lebesgue measure 0.

2.2. Flat and Sharp Minima

Recall that a function φ : Rd1 . . . Rdn R is a multilinear form if it is linear with respect to each of its inputs when the other inputs are ﬁxed. That is,

φ(v1, . . . , cvk + dwk, . . . , vn) = cφ(v1, . . . , vk, . . . , vn)

+ dφ(v1, . . . , wk, . . . , vn).

Our ﬁrst theorem forms the basis for our analytical results. It states that, up to a constant, the loss restricted to a ﬁxed cell Ωu is a sum of multilinear forms.

Theorem 1 (Multilinear Structure of the Loss). For each cell Ωu there exist multilinear forms φu 0, . . . , φu L+1 and a constant φu L+2 such that

L|Ωu(ω(1), . . . , ω(L), V, b(1), . . . , b(L), c) =

φu 0(ω(1), ω(2), ω(3), ω(4) . . . , ω(L), V )

+φu 1(b(1), ω(2), ω(3), ω(4) . . . , ω(L), V )

+φu 2(b(2), ω(3), ω(4) . . . , ω(L), V ) ...

+φu L 1(b(L 1), ω(L), V )

+φu L(b(L), V )

The proof relies on the fact that the signature function S(ω) is constant inside a ﬁxed cell Ωu, and so the network reduces

to a succession of afﬁne transformations. These combine to produce a sum of multilinear forms. Appealing to properties of multilinear forms then gives two important corollaries. Multilinear forms are harmonic functions. Using the strong maximum principle for harmonic functions1 we show that L does not have differentiable optima, except for the trivial ﬂat ones.

Corollary 1 (No Differentiable Extrema). Local minima and maxima of the loss (5) occur only on the boundary set N or on those cells Ωu where the loss is constant. In the latter case, L|Ωu(ω) = φu L+2.

Our second corollary reveals the saddle-like structure of the loss.

Corollary 2 (Saddle-like Structure of the Loss). If ω Ω\ N and the Hessian matrix D2L(ω) does not vanish, then it must have at least one strictly positive and one strictly negative eigenvalue.

These corollaries have implications for various optimization algorithms. At a local minimum D2L either vanishes (ﬂat local minima) or does not exist (sharp local minima). Therefore local minima do not carry any second order information. Moreover, away from minima the Hessian is never positive deﬁnite and is typically indeﬁnite. Thus an optimization algorithm using second-order (i.e. Hessian) information must pay close attention to both the indeﬁnite and non-differentiable nature of the loss.

To investigate type I/II minima in greater depth we must there exploit the multilinear structure of L itself. Our ﬁrst result along these lines concerns type II local minima.

Theorem 2. If ω is a type II local minimum then L(ω) > 0.

Modern networks of the form (5) typically have zero loss global minimizers. For any such network type II (i.e. sharp) local minimizers are therefore always sub-optimal. A converse of theorem 2 holds for a restricted class of networks. That is, type I (i.e. ﬂat) local minimizers are always optimal. To make this precise we need a mild assumption on the data.

Deﬁnition 1. Fix α > 0 and a collection of weighted data points (µ(i), x(i), y(i)). The weighted data are rare if there exist N coeffecients λ(i) {1, α, . . . , αL} and a non-zero collection of NR scalars ε(i,r) {0, 1} so that the system

r:y(i) r =0 ε(i,r)

i:y(i) r =1 λ(i)µ(i)ε(i)x(i) = X

i:y(i) r =0 λ(i)µ(i)ε(i,r)x(i)

1The strong maximum principle states that a non-constant harmonic function cannot attain a local minimum or a local maximum at an interior point of an open, connected set.

The Multilinear Structure of Re LU Networks X

i:y(i) r =1 λ(i)µ(i)ε(i) = X

i:y(i) r =0 λ(i)µ(i)ε(i,r) (7)

holds r [R]. The data are generic if they are not rare.

As the possible choices of λ(i), ε(i,r) take on at most a ﬁnite set of values, rare data points (µ(i), x(i), y(i)) must satisfy one of a given ﬁnite set of linear combinations. Thus (7) represents the exceptional case, and most data are generic. For example, if the x(i) X(i) come from indepenendent samples of atomless random variables X(i) they are generic with probability one. Similarly, a small perturbation in the weights µ(i) will usually transform data from rare to generic.

Theorem 3. Consider the loss (5) for a fully connected network. Assume that α > 0 and that the data points (x(i), y(i)) are generic. Then L(ω) = 0 at any type I local minimum.

For most data we may pair this result with its counterpart for fully connected networks and obtain a clear picture. Desirable (zero loss) minima are always ﬂat, while undesireable (positive loss) minima are always sharp. Analyzing suboptimal minima therefore requires handling the non-smooth case, and we now turn to this task.

3. Critical Point Analysis

In this section we use machinery from non-smooth analysis (see chapter 6 of (Borwein & Lewis, 2010) for a good reference) to study critical points of the loss surface of such piecewise multilinear networks. We consider three scenarios by traveling from the deep linear case (α = 1) and passing through the leaky Re LU case (0 < α < 1) before arriving at the most common case (α = 0) of Re LU networks. We intend this journey to highlight how the loss surface changes as the level of nonlinearity increases. A deep linear network has a trivial loss surface, in that local and global minima coincide (see theorem 100 in the appendix for a precise statement and its proof). If we impose further assumptions, namely linearly separable data in a one-hidden layer network, this benign structure persists into the leaky Re LU regime. When we arrive at α = 0 a bifurcation occurs, and sub-optimal local minima suddenly appear in classical Re LU networks.

To begin, we recall that for a Lipschitz but non-differentiable function f(ω) the Clarke subdifferential 0f(ω) of f at a point ω Ωprovides a generalization of both the gradient f(ω) and the usual subdifferential f(ω) of a convex function. The Clarke subdifferential is deﬁned as follow (c.f. page 133 of (Borwein & Lewis, 2010)):

Deﬁnition 2 (Clarke Subdifferential and Critical Points). Assume that a function f : Ω7 R is locally Lipschitz around ω Ω, and differentiable on Ω\ M where M is a

Figure 3. Illustration of the Clarke Subdifferential

set of Lebesgue measure zero. Then the convex hull

0f(ω) := c.h. lim k f(ωk) : ωk ω, ωk / M

is the Clarke Subdifferential of f at ω. In addition, if

0 0f(ω), (8)

then ω is a critical point of f in the Clarke sense.

The deﬁnition of critical point is a consistent one, in that (8) must hold whenever ω is a local minimum (c.f. page 125 of (Borwein & Lewis, 2010)). Thus the set of all critical points contains the set of all local minima. Figure 3 provides an illustration of the Clarke Subdifferential. It depicts a function f : R2 7 R with global minimum at the origin, which therefore deﬁnes a critical point in the Clarke sense. While the gradient of f(x) itself does not exist at 0, its restrictions fk := f|Ωk to the four cells Ωk neighboring 0 have well-deﬁned gradients fk(0) (shown in red) at the critical point. By deﬁnition the Clarke subdifferential 0f(0) of f at 0 consists of all convex combinations

θ1 f1(0) + θ2 f2(0) + θ3 f3(0) + θ4 f4(0)

of these gradients; that some such combination vanishes (say, 1

2 f1(0) + 1

2 f3(0) = 0) means that 0 satisﬁes the deﬁnition of a critical point. Moreover, an element of the subdifferential 0f naturally arises from gradient descent. A gradient-based optimization path x(j+1) = x(j) dt(j) f(x(j)) (shown in blue) asymptotically builds, by successive accumulation at each step, a convex combination of the fk whose corresponding weights θk represent the fraction of time the optimization spends in each cell.

We may now show how to apply these tools in the study of Re LU Networks. We ﬁrst analyze the leaky regime (0 < α < 1) and then analyze the ordinary Re LU case (α = 0).

Leaky Networks (0 < α < 1): Take 0 < α < 1 and consider the corresponding loss L(W, v, b, c) = X µ(i) σ h 1 y(i) n v T σα(Wx(i) + b) + c oi (9)

associated to a fully connected network with one hidden layer. We shall also assume the data {x(i)} are linearly separable. In this setting we have

The Multilinear Structure of Re LU Networks

Theorem 4 (Leaky Re LU Networks). Consider the loss (9) with α > 0 and data x(i), i [N] that are linearly separable. Assume that ω = (W, v, b, c) is any critical point of the loss in the Clarke sense. Then either v = 0 or ω is a global minimum.

The loss in this scenario has two type of critical points. Critical points with v = 0 correspond to a trivial network in which all data points are mapped to a constant; all other critical points are global minima. If we further assume equally weighted classes X

i:y(i)=1 µ(i) = X

i:y(i)= 1 µ(i)

then all local minima are global minima

Theorem 5 (Leaky Re LU Networks with Equal Weight). Consider the loss (9) with α > 0 and data x(i), i [N] that are linearly separable. Assume that the µ(i) weight both classes equally. Then every local minimum of L(ω) is a global minimum.

In other words, the loss surface is trivial when 0 < α 1.

Re LU Networks (α = 0): This is the case of paramount interest. When passing from α > 0 to α = 0 a structural bifurcation occurs in the loss surface Re LU nonlinearities generate non-optimal local minima even in a one hidden layer network with separable data. Our analysis provides an explicit description of all the critical points of such loss surfaces, which allows us to precisely understand the way in which sub-optimality occurs.

In order to describe this structure let us brieﬂy assume that we have a simpliﬁed model with two hidden neurons, no output bias and uniform weights. If wk denotes the kth row of W then we have the loss

L(W, v, b) = 1

X σ(1 y(i)ˆy(i)), where

k=1 vkσ wk, x(i) + bk (10)

for such a network. Each hidden neuron has an associated hyperplane wk, + bk as well as a scalar weight vk used to form the output. Figure 4 shows three different local minima of such a network. The ﬁrst panel, ﬁgure 4(a), shows a global minimum where all the data points have zero loss. Figure 4(b) shows a sub-optimal local minimum. All unsolved data points, namely those that contribute a non-zero value to the loss, lie on the blind side of the two hyperplanes. For each of these data points the corresponding network output ˆy(i) vanishes and so the loss is σ( 1 y(i)ˆy(i)) = 1 for these unsolved points. Small perturbations of the hyperplanes or of the values of the vk do not change the fact that these data points lie on the blind side of the

(a) L = 0 (b) L = 3/10 (c) L = 10/10

Figure 4. Three different local minima of the loss L(ω) for a network with two hidden neurons and standard Re LU nonlinearities. Points belonging to class +1 (resp. -1) are denoted by circles (resp. triangles). Data points for which the loss is zero (solved points) are colored in green, while data points with non-zero loss (unsolved points) are in red. The unsolved data points always lie on the blind side of both hyperplanes.

two hyperplanes. Their loss will not decrease under small perturbations, and so the conﬁguration is, in fact, a local minimum. The same reasoning shows that the conﬁguration in ﬁgure 4(c), in which no data point is classiﬁed correctly, is also a local minimum.

Despite the presence of sub-optimal local minimizers, the local minima depicted in ﬁgure 4 are somehow trivial cases. They simply come from the fact that, due to inactive Re LUs, some data points are completely ignored by the network, and this fact cannot be changed by small perturbations. The next theorem essentially shows that these are the only possible sub-optimal local minima that occur. Moreover, the result holds for the case (9) of interest and not just the simpliﬁed model.

Theorem 6 (Re LU networks). Consider the loss (9) with α = 0 and data x(i), i [N] that are linearly separable. Assume that ω = (W, v, b, c) is a critical point in the Clarke sense, and that x(i) is any data point that contributes a nonzero value to the loss. Then for each hidden neuron k [K] either

(i) wk, x(i) + bk 0, or (ii) vk = 0.

If vk = 0 then the kth hidden neuron is unused when forming network predictions. In this case we may say the kth

hyperplane is inactive, while if vk = 0 the corresponding hyperplane is active. Theorem 6 therefore states that if a data point x(i) is unsolved it must lie on the blind side of every active hyperplane. So all critical points, including local minima, obey the property sketched in ﬁgure 4.

When taken together, theorems 5 and 6 provide rigorous mathematical ground for the common view that dead or inactive neurons can cause difﬁculties in optimizing neural networks, and that using leaky Re LU networks can overcome these difﬁculties. The former have sub-optimal local minimizers exactly when a data point does not activate any of the Re LUs in the hidden layer, but this situation never occurs with leaky Re LUs and so neither do sub-optima minima.

The Multilinear Structure of Re LU Networks

(a) L(ω) = 0

(b) L(ω) > 0

Figure 5. Four-way classiﬁcation with multiclass hinge loss (2). At left a global minimizer. At right a sub-optimal local minimizer where the analogue of theorem 6 fails.

4. Exact Penalties and Multi-Class Structure

These results give a clear illustration of how nonlinearity and data complexity combine to produce local minimizers in the loss surface for binary classiﬁcation tasks. While we might try to analyze multi-class tasks by following down the same path, such an effort would unfotunately bring us to a quite different destination. Speciﬁcally, the conclusion of theorem 6 fails for multi-class case; in the presence of three or more classes a critical point may exhibit active yet unsolved data points (c.f. ﬁgure 5). This phenomenon is inherent to multiclass tasks in a certain sense, for if we use the same features x(i,ℓ) (c.f. (4)) in a multi-layer Re LU network but apply a different network criterion ℓ(y, ˆy) then the phenomenon persists. For example, using the one-versus-all criterion

ℓ(ˆy, y) := P

r µ(i,r)σ 1 + ˆy(i) r ( 1)y(i) r , (11)

in place of the hinge loss (2) still gives rise to a network with non-trivial critical points (similar to ﬁgure 5) despite its more binary structure. In this way, the emergence of non-trivial critical points reﬂects the nature of multi-class tasks rather than some pathology of the hinge-loss network criterion itself.

To arrive at the same destination our analysis must therefore take a more circumlocuitous route. As these counterexamples suggest, if the loss L(ω) has non-trivial critical points then we must avoid non-trivial critical points by modiﬁng the training strategy instead. We shall employ the one-versus-all criterion (11) for this task, as this choice will allows us to directly leverage our binary analyses.

Let us begin this process by recalling that

x(i,L) ω(1), . . . , ω(L), b(1), . . . , b(L)

and ˆy(i) = V x(i,L) + c denote the features and predictions of the network with L hidden layers, respectively. The sub-collection of parameters

ω := ω(1), . . . , ω(L), b(1), . . . , b(L)

therefore determine a set of features x(i,L) for the network while the parameters V, c determine a set of one-versus-all

classiﬁers utilizing these features. We may write the loss for the rth class as

L(r)( ω, vr, cr) = X µ(i,r)σ 1 + ˆy(i) r ( 1)y(i) r (12)

and then form the sum over classes

L(ω) := (L(1) + + L(R))(ω)

to recover the total objective. We then seek to minimize L by applying a soft-penalty approach. We introduce the R replicates

ω(r) = ω(1,r), . . . , ω(L,r), b(1,r), . . . , b(L,r) r [R]

of the hidden-layer parameters ω and include a soft ℓ2penalty R ω(1), . . . , ω(R) :=

r=1 ω(ℓ,r) ω(ℓ) 2 + b(ℓ,r) b(ℓ) 2

to enforce that the replicated parameters ω(ℓ,r), b(ℓ,r) remain close to their corresponding means ( ω(ℓ), b(ℓ)) across classes. Our training strategy then proceeds to minimize the penalized loss Eγ ω(1), . . . , ω(R) :=

r L(r) ω(r) + γR ω(1), . . . , ω(R) (13)

for γ > 0 some parameter controlling the strength of the penalty. Remarkably, utilizing this strategy yields

Theorem 7 (Exact Penalty and Recovery of Two-Class Structure). If γ > 0 then the following hold for (13)

(i) The penalty is exact, that is, at any critical point ω(1), . . . , ω(R) of Eγ the equalities

ω(ℓ,1) = = ω(ℓ,R) = ω(ℓ) := 1

b(ℓ,1) = = b(ℓ,R) = b(ℓ) := 1

hold for all ℓ [L]. (ii) At any critical point of Eγ the two-class critical point relations 0 0L(r)( ω, vr, cr) hold for all r [R].

In other words, applying a soft-penalty approach to minimizing the original problem (12) actually yields an exact penalty method. By (i), at critical points we obtain a common set of features x(i,L) for each of the R binary classiﬁcation problems. Moreover, by (ii) these features simultaneously yield critical points

0 0L(r) ω, vr, cr (14)

for all of these binary classiﬁcation problems. The fact that (14) may fail for critical points of L is responsible

The Multilinear Structure of Re LU Networks

for the presence of non-trivial critical points in the context of a network with one hidden layer. We may therefore interpret (ii) as saying that a training strategy that uses the penalty approach will avoid pathological critical points where 0 0 L(ω) holds but (14) does not. In this way the penalty approach provides a path forward for studying multi-class problems. Regardless of the number L of hidden layers, it allows us to form an understanding of the family of critical points (14) by reducing to a study of critical points of binary classiﬁcation problems. This allows us to extend the analyses of the previous section to the multi-class context.

We may now pursue an analysis of multi-class problems by traveling along the same path that we followed for binary classiﬁcation. That is, a deep linear network (α = 1) once again has a trivial loss surface (see corollaries 100 and 101 in the appendix for precise statements and proofs). By imposing the same further assumptions, namely linearly separable data in a one-hidden layer network, we may extend this benign structure into the leaky Re LU regime. Finally, when α = 0 sub-optimal local minima appear; we may characterize them in a manner analogous to the binary case.

To be precise, recall the loss

r=1 L(r)(ω for (15)

L(r)(ω) := X µ(i,r)σ 1 y(i,r)( vr, x(i,1) + cr)

that results from the features x(i,1) = σα(Wx(i) + b) of a Re LU network with one hidden layer. If the positive weights µ(i,r) > 0 satisfy

y(i,r)=1 µ(i,r) = X

y(i,r)= 1 µ(i,r) = 1

then we say that the µ(i,r) give equal weight to all classes. Appealing to the critical point relations (14) yields the following corollary. It gives the precise structure that emerges from the leaky regime 0 < α < 1 with separable data

Corollary 3 (Multiclass with 0 < α < 1). Consider the loss (15) and its corresponding penalty (13) with γ > 0, 0 < α < 1 and data x(i), i [N] that are linearly separable.

(i) Assume that ω = (ω(1), . . . , ω(R)) is a critical point of Eγ in the Clarke sense. If v(r) = 0 for all r [R] then ω is a global minimum of L and of Eγ. (ii) Assume that the µ(i,r) give equal weight to all classes. If ω = (ω(1), . . . , ω(R)) is a local minimum of Eγ and vr = 0 for some r [R] then ω is a global minimum of L and of Eγ.

Finally, when arriving at the standard Re LU nonlinearity α = 0 a bifurcation occurs. Sub-optimal local minimizers of

Eγ can exist, but once again the manner in which these suboptimal solutions appear is easy to describe. We let ℓ(i,r)(ω) denote the contribution of the ith data point x(i) to the loss L(r) for the rth class, so that L(r)(ω) = P

i µ(i,r)ℓ(i,r)(ω) gives the total loss. Appealing directly to the family of critical point relations 0 0L(r) ω, vr, cr furnished by theorem 7 yields our ﬁnal corollary in the multiclass setting. Corollary 4 (Multiclass with α = 0). Consider the loss (15) and its corresponding penalty (13) with γ > 0, α = 0 and data x(i), i [N] that are linearly separable. Assume that ω = (ω(1), . . . , ω(R)) is any critical point of Eγ in the Clarke sense. Then ℓ(i,r) > 0

= (vr)k σ( wk, x(i) + bk) = 0 for all k [K].

5. Conclusion

We conclude by painting the overall picture that emerges from our analyses. The loss of a Re LU network is a multilinear form inside each cell. Multilinear forms are harmonic functions, and so maxima or minima simply cannot occur in the interior of a cell unless the loss is constant on the entire cell. This simple harmonic analysis reasoning leads to the following striking fact. Re LU networks do not have differentiable minima, except for trivial cases. This reasoning is valid for any convolutional or fully connected network, with plain or leaky Re LUs, and with binary or multiclass hinge loss. Dealing with non-differentiable minima is therefore not a technicality; it is the heart of the matter.

Given this dichotomy between trivial, differentiable minima on one hand and nontrivial, nondifferentiable minima on the other, it is natural to try and characterise these two classes of minima more precisely. We show that global minima with zero loss must be trivial, while minima with nonzero loss are necessarily nondifferentiable for many fully connected networks. In particular, if a network has no zero loss minimizers then all minima are nondifferentiable.

Finally, our analysis clearly shows that local minima of Re LU networks are generically nondifferentiable. They cannot be waved away as a technicality, so any study of the loss surface of such network must invoke nonsmooth analysis. We show how to properly use this machinery (e.g. Clark subdifferentials) to study Re LU networks. Our goal is twofold. First, we prove that a bifurcation occurs when passing from leaky Re LU to Re LU nonlinearities, as suboptimal minima suddenly appear in the latter case. Secondly, and perhaps more importantly, we show how to apply nonsmooth analysis in familiar settings so that future researchers can adapt and extend our techniques.

Borwein, J. and Lewis, A. S. Convex analysis and nonlinear optimization: theory and examples. Second Edition.

The Multilinear Structure of Re LU Networks

Springer Science & Business Media, 2010.

Brutzkus, A., Globerson, A., Malach, E., and Shalev Shwartz, S. Sgd learns over-parameterized networks that provably generalize on linearly separable data. ar Xiv preprint ar Xiv:1710.10174, 2017.

Chaudhari, P., Choromanska, A., Soatto, S., and Le Cun, Y. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations, 2017.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and Le Cun, Y. The loss surfaces of multilayer networks. In Artiﬁcial Intelligence and Statistics, pp. 192 204, 2015.

Du, S. S., Lee, J. D., and Tian, Y. When is a convolutional ﬁlter easy to learn? ar Xiv preprint ar Xiv:1709.06129, 2017.

Frasconi, P., Gori, M., and Tesi, A. Successes and failures of backpropagation: A theoretical. Progress in Neural Networks: Architecture, 5:205, 1997.

Gori, M. and Tesi, A. On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1):76 86, 1992.

Kawaguchi, K. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586 594, 2016.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.

Li, Y. and Yuan, Y. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pp. 597 607, 2017.

Mei, S., Bai, Y., and Montanari, A. The landscape of empirical risk for non-convex losses. ar Xiv preprint ar Xiv:1607.06534, 2016.

Safran, I. and Shamir, O. On the quality of the initial basin in overspeciﬁed neural networks. In International Conference on Machine Learning, pp. 774 782, 2016.

Soltanolkotabi, M. Learning relus via gradient descent. In Advances in Neural Information Processing Systems, pp. 2004 2014, 2017.

Soudry, D. and Hoffer, E. Exponentially vanishing suboptimal local minima in multilayer neural networks. ar Xiv preprint ar Xiv:1702.05777, 2017.

Tian, Y. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, 2017.

Zhong, K., Song, Z., Jain, P., Bartlett, P. L., and Dhillon, I. S. Recovery guarantees for one-hidden-layer neural networks. In International Conference on Machine Learning, 2017.

Zhou, P. and Feng, J. The landscape of deep learning algorithms. ar Xiv preprint ar Xiv:1705.07038, 2017.