# optimization_landscape_and_expressivity_of_deep_cnns__702078d7.pdf

Optimization Landscape and Expressivity of Deep CNNs

Quynh Nguyen 1 Matthias Hein 2

We analyze the loss landscape and expressiveness of practical deep convolutional neural networks (CNNs) with shared weights and max pooling layers. We show that such CNNs produce linearly independent features at a wide layer which has more neurons than the number of training samples. This condition holds e.g. for the VGG network. Furthermore, we provide for such wide CNNs necessary and sufﬁcient conditions for global minima with zero training error. For the case where the wide layer is followed by a fully connected layer we show that almost every critical point of the empirical loss is a global minimum with zero training error. Our analysis suggests that both depth and width are very important in deep learning. While depth brings more representational power and allows the network to learn high level features, width smoothes the optimization landscape of the loss function in the sense that a sufﬁciently wide network has a well-behaved loss surface with almost no bad local minima.

1. Introduction

It is well known that the optimization problem for training neural networks can have exponentially many local minima (Auer et al., 1996; Safran & Shamir, 2016) and NP-hardness has been shown in many cases (Blum & Rivest., 1989; Sima, 2002; Livni et al., 2014; Shamir, 2017; Shalev-Shwartz et al., 2017). However, it has been empirically observed (Dauphin et al., 2014; Goodfellow et al., 2015) that the training of state-of-the-art deep CNNs (Le Cun et al., 1990; Krizhevsky et al., 2012), which are often overparameterized, is not hampered by suboptimal local minima.

In order to explain the apparent gap between hardness results and practical performance, many interesting theoret-

1Department of Mathematics and Computer Science, Saarland University, Germany 2University of T ubingen, Germany. Correspondence to: Quynh Nguyen <quynh@cs.uni-saarland.de>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

Table 1. The maximum width of all layers in several state-of-theart CNN architectures compared with the size of Image Net dataset (N 1200K). All numbers are lower bounds on the true width.

CNN ARCHITECTURE M = maxk nk M > N VGG(A-E) (SIMONYAN & ZISSERMAN, 2015) 3000K(k = 1) YES INCEPTIONV3 (SZEGEDY ET AL., 2015B) 1300K(k = 3) YES INCEPTIONV4 (SZEGEDY ET AL., 2016) 1300K(k = 3) YES SQUEEZENET (IANDOLA ET AL., 2016) 1180K(k = 1) NO ENET (PASZKE ET AL., 2016) 1000K(k = 1) NO GOOGLENET (SZEGEDY ET AL., 2015A) 800K(k = 1) NO RESNET (HE ET AL., 2016) 800K(k = 1) NO XCEPTION (CHOLLET, 2016) 700K(k = 1) NO

ical results have been recently developed (Andoni et al., 2014; Sedghi & Anandkumar, 2015; Janzamin et al., 2016; Haeffele & Vidal, 2017; Gautier et al., 2016; Brutzkus & Globerson, 2017; Soltanolkotabi, 2017; Zhong et al., 2017; Tian, 2017; Du et al., 2018) in order to identify conditions under which one can guarantee that local search algorithms like gradient descent converge to the globally optimal solution. However, it turns out that these approaches are either not practical as they require e.g. knowledge about the data generating measure, or a modiﬁcation of network structure and objective, or they are for quite restricted network structures, mostly one hidden layer networks, and thus are not able to explain the success of deep networks in general. For deep linear networks one has achieved a quite complete picture of the loss surface as it has been shown that every local minimum is a global minimum (Baldi & Hornik, 1988; Kawaguchi, 2016; Freeman & Bruna, 2017; Hardt & Ma, 2017; Yun et al., 2018). By randomizing the nonlinear part of a feedforward network with Re LU activation function and making some additional simplifying assumptions, (Choromanska et al., 2015a) can relate the loss surface of neural networks to a certain spin glass model. In this model the objective of local minima is close to the global optimum and the number of bad local minima decreases quickly with the distance to the global optimum. This is a very interesting result but is based on a number of unrealistic assumptions (Choromanska et al., 2015b). More recently, (Nguyen & Hein, 2017) have analyzed deep fully connected networks with general activation functions and could show that almost every critical point is a global minimum if one layer has more neurons than the number of training points. While this result holds for networks in practice, it requires a quite extensively overparameterized network.

Optimization Landscape and Expressivity of Deep CNNs

In this paper we overcome the restriction of previous work in several ways. This paper is one of the ﬁrst ones, which analyzes the loss surface of deep CNNs. CNNs are of high practical interest as they learn very useful representations (Zeiler & Fergus, 2014; Mahendran & Vedaldi, 2015; Yosinski et al., 2015) with small number of parameters. We are only aware of (Cohen & Shashua, 2016) who study the expressiveness of CNNs with max-pooling layer and Re LU activation but with rather unrealistic ﬁlters (just 1 1) and no shared weights. In our setting we allow as well max pooling and general activation functions. Moreover, we can have an arbitrary number of ﬁlters and we study general convolutions as the ﬁlters need not be applied to regular structures like 3 3 but can be patch-based where the only condition is that all the patches have the size of the ﬁlter. Convolutional layers, fully connected layers and max-pooling layers can be combined in almost arbitrary order. We study in this paper the expressiveness and loss surface of a CNN where one layer is wide, in the sense that it has more neurons than the number of training points. While this assumption sounds at ﬁrst quite strong, we want to emphasize that the popular VGG (Simonyan & Zisserman, 2015) and Inception networks (Szegedy et al., 2015b; 2016), see Table 1, fulﬁll this condition. We show that wide CNNs produce linearly independent feature representations at the wide layer and thus are able to ﬁt the training data exactly (universal ﬁnite sample expressivity). This is even true with probability one when all the parameters up to the wide layer are chosen randomly1. We think that this explains partially the results of (Zhang et al., 2017) where they show experimentally for several CNNs that they are able to ﬁt random labels. Moreover, we provide necessary and sufﬁcient conditions for global minima with zero squared loss and show for a particular class of CNNs that almost all critical points are globally optimal, which to some extent explains why wide CNNs can be optimized so efﬁciently. All proofs are moved to the appendix due to limited space.

2. Deep Convolutional Neural Networks

We ﬁrst introduce our notation and deﬁnition of CNNs. Let N be the number of training samples and denote by X = [x1, . . . , x N]T RN d, Y = [y1, . . . , y N]T RN m the input resp. output matrix for the training data (xi, yi)N i=1, where d is the input dimension and m the number of classes.

Let L be the number of layers of the network, where each layer is either a convolutional, max-pooling or fully connected layer. The layers are indexed from k = 0, 1, . . . , L which corresponds to input layer, 1st hidden layer, . . ., and output layer. Let nk be the width of layer k and fk : Rd Rnk the function that computes for every input

1for any probability measure on the parameter space which has a density with respect to the Lebesgue measure

its feature vector at layer k.

The convolutional layer consists of a set of patches of equal size where every patch is a subset of neurons from the same layer. Throughout this paper, we assume that the patches of every layer cover the whole layer, i.e. every neuron belongs to at least one of the patches, and that there are no patches that contain exactly the same subset of neurons. This means that if one patch covers the whole layer then it must be the only patch of the layer. Let Pk and lk be the number of patches resp. the size of each patch at layer k for every 0 k < L. For every input x Rd, let x1, . . . , x P0 Rl0 denote the set of patches at the input

layer and n f 1 k(x), . . . , f Pk k (x) o Rlk the set of patches at layer k. Each ﬁlter of the layer consists of the same set of patches. We denote by Tk the number of convolutional ﬁlters and by Wk = [w1 k, . . . , w Tk k ] Rlk 1 Tk the corresponding parameter matrix of the convolutional layer k for every 1 k < L. Each column of Wk corresponds to one ﬁlter. Furthermore, bk Rnk denotes the bias vector and σk : R R the activation function for each layer. Note that one can use the same activation function for all layers but we use the general form to highlight the role of different layers. In this paper, all functions are applied componentwise, and we denote by [a] the set of integers {1, 2, . . . , a} and by [a, b] the set of integers from a to b.

Deﬁnition 2.1 (Convolutional layer) A layer k is called a convolutional layer if its output fk(x) Rnk is deﬁned for every x Rd as

fk(x)h = σk wt k, f p k 1(x) + (bk)h (1)

for every p [Pk 1], t [Tk], h := (p 1)Tk + t.

The value of each neuron at layer k is computed by ﬁrst taking the inner product between a ﬁlter of layer k and a patch at layer k 1, adding the bias and then applying the activation function. The number of neurons at layer k is thus nk = Tk Pk 1, which we denote as the width of layer k. Our deﬁnition of a convolutional layer is quite general as every patch can be an arbitrary subset of neurons of the same layer and thus covers most of existing variants in practice.

Deﬁnition 2.1 includes the fully connected layer as a special case by using Pk 1 = 1, lk 1 = nk 1, f 1 k 1(x) = fk 1(x) Rnk 1, Tk = nk, Wk Rnk 1 nk, bk Rnk. Thus we have only one patch which is the whole feature vector at this layer.

Deﬁnition 2.2 (Fully connected layer) A layer k is called a fully connected layer if its output fk(x) Rnk is deﬁned for every x Rd as

fk(x) = σk W T k fk 1(x) + bk . (2)

Optimization Landscape and Expressivity of Deep CNNs

For some results we also allow max-pooling layers.

Deﬁnition 2.3 (Max-pooling layer) A layer k is called a max-pooling layer if its output fk(x) Rnk is deﬁned for every x Rd and p [Pk 1] as

fk(x)p = max (f p k 1(x))1, . . . , (f p k 1(x))lk 1

A max-pooling layer just computes the maximum element of every patch from the previous layer. Since there are Pk 1 patches at layer k 1, the number of output neurons at layer k is nk = Pk 1.

Reformulation of Convolutional Layers: For each convolutional or fully connected layer, we denote by Mk : Rlk 1 Tk Rnk 1 nk the linear map that returns for every parameter matrix Wk Rlk 1 Tk the corresponding full weight matrix Uk = Mk(Wk) Rnk 1 nk. For convolutional layers, Uk can be seen as the counterpart of the weight matrix Wk in fully connected layers. We deﬁne Uk = Mk(Wk) = Wk if layer k is fully connected. Note that the mapping Mk depends on the patch structure of each convolutional layer k. For example, suppose that layer k has

two ﬁlters of length 3, that is, Wk = [w1 k, w2 k] =

a d b e c f

and nk 1 = 5 and patches given by a 1D-convolution with stride 1 and no padding then:

U T k = Mk(Wk)T =

a b c 0 0 d e f 0 0 0 a b c 0 0 d e f 0 0 0 a b c 0 0 d e f

The above ordering of the rows of U T k of a convolutional layer is determined by (1), in particular, the row index h of U T k is calculated as h = (p 1)Tk + t, which means for every given patch p one has to loop over all the ﬁlters t and compute the corresponding value of the output unit by taking the inner product of the h-th row of U T k with the whole feature vector of the previous layer. We assume throughout this paper that that there is no non-linearity at the output layer. By ignoring max-pooling layers for the moment, the feature maps fk : Rd Rnk can be written as

x k = 0 σk gk(x) 1 k L 1 g L(x) k = L

where gk : Rd Rnk is the pre-activation function:

gk(x) = U T k fk 1(x) + bk, 1 k L

By stacking the feature vectors of layer k of all training samples, before and after applying the activation function, into a matrix, we deﬁne:

Fk = [fk(x1), . . . , fk(x N)]T RN nk,

Gk = [gk(x1), . . . , gk(x N)]T RN nk.

In this paper, we refer to Fk as the output matrix at layer k. It follows from above that

X k = 0 σk(Gk) 1 k L 1 GL k = L (4)

where Gk = Fk 1Uk + 1Nb T k for every 1 k L.

In this paper, we assume the following general condition on the structure of convolutional layers.

Assumption 2.4 (Convolutional Structure) For every convolutional layer k, there exists at least one parameter matrix Wk Rlk 1 Tk for which the corresponding weight matrix Uk = Mk(Wk) Rnk 1 nk has full rank.

It is straightforward to see that Assumption 2.4 is satisﬁed if every neuron belongs to at least one patch and there are no identical patches. As the set of full rank matrices is a dense subset, the following result follows immediately.

Lemma 2.5 If Assumption 2.4 holds, then for every convolutional layer k, the set of Wk Rlk 1 Tk for which Uk = Mk(Wk) Rnk 1 nk does not have full rank has Lebesgue measure zero.

3. CNN Learn Linearly Independent Features

In this section, we show that a class of standard CNN architectures with convolutional layers, fully connected layers and max-pooling layers plus standard activation functions like Re LU, sigmoid, softplus, etc are able to learn linearly independent features at every wide hidden layer if it has more neurons than the number of training samples. Our assumption on training data is the following.

Assumption 3.1 (Training data) The patches of different training samples are non-identical, that is, xp i = xq j for every p, q [P0], i, j [N], i = j.

Assumption 3.1 is quite weak, especially if the size of the input patches is large. If the assumption does not hold, one can add a small perturbation to the training samples: {x1 + ϵ1, . . . , x N + ϵN} . The set of {ϵi}N i=1 where Assumption 3.1 is not fulﬁlled for the new dataset has measure zero. Moreover, {ϵi}N i=1 can be chosen arbitrarily small so that the inﬂuence of the perturbation is negligible. Our main assumptions on the activation function of the hidden layers are the following.

Optimization Landscape and Expressivity of Deep CNNs

Assumption 3.2 (Activation function) The activation function σ is continuous, non-constant, and satisﬁes one of the following conditions:

There exist µ+, µ R s.t. lim t σk(t) = µ and

lim t σk(t) = µ+ and µ+µ = 0

There exist ρ1, ρ2, ρ3, ρ4 R+ s.t. |σ(t)| ρ1eρ2t

for t < 0 and |σ(t)| ρ3t + ρ4 for t 0

Assumption 3.2 covers several standard activation functions.

Lemma 3.3 The following activation functions satisfy Assumption 3.2:

Re LU: σ(t) = max(0, t)

Sigmoid: σ(t) = 1 1+e t

Softplus: σα(t) = 1

α ln(1 + eαt) for some α > 0

It is known that the softplus function is a smooth approximation of Re LU. In particular, it holds that:

lim α σα(t) = lim α 1 α ln(1 + eαt) = max(0, t). (5)

The ﬁrst main result of this paper is the following.

Theorem 3.4 (Linearly Independent Features) Let Assumption 3.1 hold for the training sample. Consider a deep CNN architecture for which there exists some layer 1 k L 1 such that

1. Layer 1 and layer k are convolutional or fully connected while all the other layers can be convolutional, fully connected or max-pooling

2. The width of layer k is larger than the number of training samples, nk = Tk Pk 1 N

3. (σ1, . . . , σk) satisfy Assumption 3.2

Then there exists a set of parameters of the ﬁrst k layers (Wl, bl)k l=1 such that the set of feature vectors {fk(x1), . . . , fk(x N)} are linearly independent. Moreover, (Wl, bl)k l=1 can be chosen in such a way that all the weight matrices Ul = Ml(Wl) have full rank for every 1 l k.

Theorem 3.4 implies that a large class of CNNs employed in practice with standard activation functions like Re LU, sigmoid or softplus can produce linearly independent features at any hidden layer if its width is larger than the size of training set. Figure 1 shows an example of a CNN architecture that satisﬁes the conditions of Theorem 3.4 at

the ﬁrst convolutional layer. Note that if a set of vectors is linearly independent then they are also linearly separable. In this sense, Theorem 3.4 suggests that CNNs can produce linearly separable features at every wide hidden layer.

Linear separability in neural networks has been recently studied by (An et al., 2015), where the authors show that a two-hidden-layer fully connected network with Re LU activation function can transform any training set to be linearly separable while approximately preserving the distances of the training data at the output layer. Compared to (An et al., 2015) our Theorem 3.4 is derived for CNNs with a wider range of activation functions. Moreover, our result shows even linear independence of features which is stronger than linear separability. Recently, (Nguyen & Hein, 2017) have shown a similar result for fully connected networks and analytic activation functions.

We want to stress that, in contrast to fully connected networks, for CNNs the condition nk N of Theorem 3.4 does not imply that the network has a huge number of parameters as the layers k and k + 1 can be chosen to be convolutional. In particular, the condition nk = Tk Pk 1 N can be fulﬁlled by increasing the number of ﬁlters Tk or by using a large number of patches Pk 1 (however Pk 1 is upper bounded by nk), which is however only possible if lk 1 is small as otherwise our condition on the patches cannot be fulﬁlled. In total the CNN has only lk 1Tk + lk Tk+1 parameters versus nk(nk 1 +nk+1) for the fully connected network from and to layer k. Interestingly, the VGG-Net (Simonyan & Zisserman, 2015), where in the ﬁrst layer small 3 3 ﬁlters and stride 1 is used, fulﬁlls for Image Net the condition nk N for k = 1, as well as the Inception networks (Szegedy et al., 2015b; 2016), see Table 1.

One might ask now how difﬁcult it is to ﬁnd such parameters which generate linearly independent features at a hidden layer? Our next result shows that once analytic2 activation functions, e.g. sigmoid or softplus, are used at the ﬁrst k hidden layers of the network, the linear independence of features at layer k holds with probability 1 even if one draws the parameters of the ﬁrst k layers (Wl, bl)k randomly for any probability measure on the parameter space which has a density with respect to the Lebesgue measure.

Theorem 3.5 Let Assumption 3.1 hold for the training samples. Consider a deep CNN for which there exists some layer 1 k L 1 such that

1. Every layer from 1 to k is convolutional or fully connected

2. The width of layer k is larger than number of training

2A function σ : R R is real analytic if its Taylor series about x0 converges to σ(x0) on some neighborhood of x0 for every x0 R (Krantz & Parks, 2002).

Optimization Landscape and Expressivity of Deep CNNs

samples, that is, nk = Tk Pk 1 N

3. (σ1, . . . , σk) are real analytic functions and satisfy Assumption 3.2.

Then the set of parameters of the ﬁrst k layers (Wl, bl)k l=1 for which the set of feature vectors {fk(x1), . . . , fk(x N)} are not linearly independent has Lebesgue measure zero.

Theorem 3.5 is a much stronger statement than Theorem 3.4, as it shows that for almost all weight conﬁgurations one gets linearly independent features at the wide layer. While Theorem 3.5 does not hold for the Re LU activation function as it is not an analytic function, we note again that one can approximate the Re LU function arbitrarily well using the softplus function (see 5), which is analytic function for any α > 0 and thus Theorem 3.5 applies. It is an open question if the result holds also for the Re LU activation function itself. The condition nk N is not very restrictive as several state-of-the art CNNs , see Table 1, fulﬁll the condition. Furthermore, we would like to stress that Theorem 3.5 is not true for deep linear networks. The reason is simply that the rank of a product of matrices can at most be the minimal rank among all the matrices. The nonlinearity of the activation function is thus critical (note that the identity activation function, σ(x) = x, does not fulﬁll Assumption 3.2).

To illustrate Theorem 3.5 we plot the rank of the feature matrices of the network in Figure 1. We use the MNIST dataset with N = 60000 training and 10000 test samples. We add small Gaussian noise N(0, 10 5) to every training sample so that Assumption 3.1 is fulﬁlled. We then vary the number of convolutional ﬁlters T1 of the ﬁrst layer from 10 to 100 and train the corresponding network with squared loss and sigmoid activation function using Adam (Kingma & Ba, 2015) and decaying learning rate for 2000 epochs. In Table 2 we show the smallest singular value of the feature matrices together with the corresponding training loss, training and test error. If number of convolutional ﬁlters is large enough (i.e. T1 89), one has n1 = 26 26 T1 N = 60000, and the second condition of Theorem 3.5 is satisﬁed for k = 1. Table 2 shows that the feature matrices F1 have full rank in all cases (and F3 in almost all cases), in particular for T1 89 as shown in Theorem 3.5. As expected when the feature maps of the training samples are linearly independent after the ﬁrst layer (F1 has rank 60000 for T 89) the training error is zero and the training loss is close to zero (the GPU uses single precision). However, as linear independence is stronger than linear separability one can achieve already for T < 89 zero training error.

It is interesting to note that Theorem 3.5 explains previous empirical observations. In particular, (Czarnecki et al., 2017) have shown empirically that linear separability is often obtained already in the ﬁrst few hidden layers of the

trained networks. This is done by attaching a linear classiﬁer probe (Alain & Bengio, 2016) to every hidden layer in the network after training the whole network with backpropagation. The fact that Theorem 3.5 holds even if the parameters of the bottom layers up to the wide layer k are chosen randomly is also in line with recent empirical observations for CNN architectures that one has little loss in performance if the weights of the initial layers are chosen randomly without training (Jarrett et al., 2009; Saxe et al., 2011; Yosinski et al., 2014).

As a corollary of Theorem 3.4 we get the following universal ﬁnite sample expressivity for CNNs. In particular, a deep CNN with scalar output can perfectly ﬁt any scalar-valued function for a ﬁnite number of inputs if the width of the last hidden layer is larger than the number of training samples.

Corollary 3.6 (Universal Finite Sample Expressivity) Let Assumption 3.1 hold for the training samples. Consider a standard CNN with scalar output which satisﬁes the conditions of Theorem 3.4 at the last hidden layer k = L 1. Let f L : Rd R be the output of the network given as

j=1 λjf(L 1)j(x) x Rd

where λ Rn L 1 is the weight vector of the last layer. Then for every target y RN, there exists λ, (Wl, bl)L 1 l=1 so that it holds f L(xi) = yi for every i [N].

The expressivity of neural networks has been well-studied, in particular in the universal approximation theorems for one hidden layer networks (Cybenko, 1989; Hornik et al., 1989). Recently, many results have shown why deep networks are superior to shallow networks in terms of expressiveness (Delalleau & Bengio, 2011; Telgarsky, 2016; 2015; Eldan & Shamir, 2016; Safran & Shamir, 2017; Yarotsky, 2016; Poggio et al., 2016; Liang & Srikant, 2017; Mhaskar & Poggio, 2016; Montufar et al., 2014; Pascanu et al., 2014; Raghu et al., 2017) While most of these results are derived for fully connected networks, it seems that (Cohen & Shashua, 2016) are the ﬁrst ones who study expressivity of CNNs. In particular, they show that CNNs with max-pooling and Re LU units are universal in the sense that they can approximate any given function if the size of the networks is unlimited. However, the number of convolutional ﬁlters in this result has to grow exponentially with the number of patches and they do not allow shared weights in their result, which is a standard feature of CNNs. Corollary 3.6 shows universal ﬁnite sample expressivity, instead of universal function approximation, even for L = 2 and k = 1, that is a single convolutional layer network can perfectly ﬁt the training data as long as the number of hidden units is larger than the number of training samples.

For fully connected networks, universal ﬁnite sample ex-

Optimization Landscape and Expressivity of Deep CNNs

Table 2. The smallest singular value σmin(F1) of the feature matrix F1 of the ﬁrst convolutional layer (similar σmin(F3) for the feature matrix F3 of the second convolutional layer) of the trained network in Figure 1 are shown for varying number of convolutional ﬁlters T1. The rank of a matrix A Rm n is estimated (see Chapter 2.6.1 in (Press, 2007)) by computing the singular value decomposition of A and counting the singular values which exceed the threshold 1

2 m + n + 1 σmax(A)ϵ, where ϵ is machine precision. For all ﬁlter sizes the feature matrices F1 have full rank. Zero training error is attained for T1 30.

T1 SIZE(F1) rank(F1) σMIN(F1) SIZE(F3) rank(F3) σMIN(F3) LOSS( 10 5) TRAIN ERROR TEST ERROR 10 60000 6760 6760 3.7 10 6 60000 2880 2880 2.0 10 2 2.4 8 151 20 60000 13520 13520 2.2 10 6 60000 2880 2880 7.0 10 4 1.2 1 132 30 60000 20280 20280 1.5 10 6 60000 2880 2880 2.4 10 4 0.24 0 174 40 60000 27040 27040 2.0 10 6 60000 2880 2880 2.2 10 3 0.62 0 124 50 60000 33800 33800 1.3 10 6 60000 2880 2880 3.9 10 5 0.02 0 143 60 60000 40560 40560 1.1 10 6 60000 2880 2880 4.0 10 5 0.57 0 141 70 60000 47320 47320 7.5 10 7 60000 2880 2880 7.1 10 3 0.12 0 120 80 60000 54080 54080 5.4 10 7 60000 2880 2875 4.9 10 18 0.11 0 140 89 60000 60164 60000 2.0 10 8 60000 2880 2880 8.9 10 10 0.35 0 117 100 60000 67600 60000 1.1 10 6 60000 2880 2856 8.5 10 27 0.04 0 139

pressivity has been studied by (Zhang et al., 2017; Nguyen & Hein, 2017; Hardt & Ma, 2017). It is shown that a single hidden layer fully connected network with N hidden units can express any training set of size N. While the number of training parameters of a single hidden layer CNN with N hidden units and scalar output is just 2N + T1l0, where T1 is the number of convolutional ﬁlters and l0 is the size of each ﬁlter, it is Nd + 2N for fully connected networks. If we set the width of the hidden layer of the CNN as n1 = T1P0 = N in order to fulﬁll the condition of Corollary 3.6, then the number of training parameters of the CNN becomes 2N + Nl0/P0, which is less than 3N if l0 P0 compared to (d + 2)N for the fully connected case. In practice one almost always has l0 P0 as l0 is typically a small integer and P0 is on the order of the dimension of the input. Thus, the number of parameters to achieve universal ﬁnite sample expressivity is for CNNs signiﬁcantly smaller than for fully connected networks.

Obviously, in practice it is most important that the network generalizes rather than just ﬁtting the training data. By using shared weights and sparsity structure, CNNs seem to implicitly regularize the model to achieve good generalization performance. Thus even though they can ﬁt also random labels or noise (Zhang et al., 2017) due to the universal ﬁnite sample expressivity shown in Corollary 3.6, they seem still to be able to generalize well (Zhang et al., 2017).

4. Optimization Landscape of Deep CNNs

In this section, we restrict our analysis to the use of least squares loss. However, as we show later that the network can produce exactly the target output (i.e. FL = Y ) for some choice of parameters, all our results can also be extended to any other loss function where the global minimum is attained at FL = Y , for instance the squared Hinge-loss analyzed in (Nguyen & Hein, 2017). Let P denote the space

of all parameters of the network. The ﬁnal training objective Φ : P R is given as

Φ (Wl, bl)L l=1 = 1

2 FL Y 2 F (6)

where FL is deﬁned as in (4), which is also the same as

FL = σL 1(. . . σ1(XU1 + 1Nb T 1 ) . . .)UL + 1Nb T L,

where Ul = Ml(Wl) for every 1 l L. We require the following assumptions on the architecture of CNN.

Assumption 4.1 (CNN Architecture) Every layer in the network is a convolutional layer or fully connected layer and the output layer is fully connected. Moreover, there exists some hidden layer 1 k L 1 such that the following holds:

The width of layer k is larger than number of training samples, that is, nk = Tk Pk 1 N

All the activation functions of the hidden layers (σ1, . . . , σL 1) satisfy Assumption 3.2

(σk+1, . . . , σL 1) are strictly increasing or strictly decreasing, and differentiable

The network is pyramidal from layer k + 1 till the output layer, that is, nk+1 . . . n L

A typical example that satisﬁes Assumption 4.1 with k = 1 can be found in Figure 1 where one disregards max-pooling layers and uses e.g. sigmoid or softplus activation function.

In the following, let us deﬁne for every 1 k L 1 the subset Sk P of the parameter space such that

Sk := (Wl, bl)L l=1 Fk, Uk+2, . . . , ULhave full rank .

Optimization Landscape and Expressivity of Deep CNNs

Figure 1. An example of CNN for a given training set of size N 100 26 26 = 67600. The width of each layer is d = n0 = 784, n1 = 67600, n2 = 16900, n3 = 2880, n4 = 720, n5 = 100, n6 = m = 10. One can see that n1 N and the network has pyramidal structure from layer 2 till the output layer, that is, n2 . . . n6.

The set Sk is the set of parameters where the feature matrix at layer k and all the weight matrices from layer k+2 till the output layer have full rank. In the following, we examine conditions for global optimality in Sk. It is important to note that Sk covers almost the whole parameter space under an additional mild condition on the activation function.

Lemma 4.2 Let Assumption 3.1 hold for the training samples and a deep CNN satisfy Assumption 4.1 for some layer 1 k L 1. If the activation functions of the ﬁrst k layers (σ1, . . . , σk) are real analytic, then the complementary set P \ Sk has Lebesgue measure zero.

In the next key lemma, we bound the objective function in terms of its gradient magnitude w.r.t. the weight matrix of layer k for which nk N. For every matrix A Rm n, let σmin(A) and σmax(A) denote the smallest and largest singular value of A. Let A F = q P i,j A2 ij, A min :=

min i,j |Aij| and A max := max i,j |Aij|. From (4), and (6),

it follows that Φ can be seen as a function of (Ul, bl)L l=1, and thus we can use UkΦ. If layer k is fully connected then Uk = Mk(Wk) = Wk and thus UkΦ = WkΦ. Otherwise, if layer k is convolutional then we note that UkΦ is not the true gradient of the training objective because Uk is not the true optimization parameter but Wk. In this case, the true gradient of Φ w.r.t. to the true parameter matrix Wk which consists of convolutional ﬁlters can be computed via the chain rule as

Φ (Wk)rs = X

(Uk)ij (Wk)rs

Please note that even though we write the partial derivatives with respect to the matrix elements, WkΦ resp. UkΦ are the matrices of the same dimension as Wk resp. Uk in the following.

Lemma 4.3 Consider a deep CNN satisfying Assumption

4.1 for some hidden layer 1 k L 1. Then it holds Uk+1Φ F

σmin(Fk) L 1 Y

l=k+1 σmin(Ul+1) σ l(Gl) min FL Y F

and Uk+1Φ F

σmax(Fk) L 1 Y

l=k+1 σmax(Ul+1) σ l(Gl) max FL Y F .

Our next main result is motivated by the fact that empirically when training over-parameterized neural networks with shared weights and sparsity structure like CNNs, there seem to be no problems with sub-optimal local minima. In many cases, even when training labels are completely random, local search algorithms like stochastic gradient descent can converge to a solution with almost zero training error (Zhang et al., 2017). To understand better this phenomenon, we ﬁrst characterize in the following Theorem 4.4 the set of points in parameter space with zero loss, and then analyze in Theorem 4.5 the loss surface for a special case of the network. We emphasize that our results hold for standard deep CNNs with convolutional layers with shared weights and fully connected layers.

Theorem 4.4 (Conditions for Zero Training Error) Let Assumption 3.1 hold for the training sample and suppose that the CNN architecture satisﬁes Assumption 4.1 for some hidden layer 1 k L 1. Let Φ : P R be deﬁned as in (6). Given any point (Wl, bl)L l=1 Sk. Then it holds that

Φ (Wl, bl)L l=1 = 0 if and only if Uk+1Φ (Wl,bl)L l=1 = 0.

Proof: If Φ (Wl, bl)L l=1 = 0 then it follows from the upper bound of Lemma 4.3 that Uk+1Φ = 0. For reverse direction, one has (Wl, bl)L l=1 Sk and thus

Optimization Landscape and Expressivity of Deep CNNs

rank(Fk) = N and Ul has full rank for every l [k +2, L]. Thus it holds σmin(Fk) > 0 and σmin(Ul) > 0 for every l [k + 2, L]. Moreover, (σk+1, . . . , σL 1) have non-zero derivative by Assumption 4.1 and thus σ l(Gl) min > 0 for every l [k+1, L 1]. This combined with the lower bound

in Lemma 4.3 leads to Φ Wl, bl)L l=1 = FL Y F = 0.

Lemma 4.2 shows that the set of points which are not covered by Theorem 4.4 has measure zero if the ﬁrst k layers have analytic activation functions. The necessary and sufﬁcient condition of Theorem 4.4 is rather intuitive as it requires the gradient of the training objective to vanish w.r.t. the full weight matrix of layer k + 1 regardless of the architecture of this layer. It turns out that if layer k + 1 is fully connected, then this condition is always satisﬁed at a critical point, in which case we obtain that every critical point in Sk is a global minimum with exact zero training error. This is shown in the next Theorem 4.5, where we consider a classiﬁcation task with m classes, Z Rm m is the full rank class encoding matrix e.g. the identity matrix and (X, Y ) the training sample such that Yi: = Zj: whenever the training sample xi belongs to class j for every i [N], j [m].

Theorem 4.5 (Loss Surface of CNNs) Let (X, Y, Z) be a training set for which Assumption 3.1 holds, the CNN architecture satisﬁes Assumption 4.1 for some hidden layer 1 k L 1, and layer k + 1 is fully connected. Let Φ : P R be deﬁned as in (6). Then the following holds

Every critical point (Wl, bl)L l=1 Sk is a global mini-

mum with Φ (Wl, bl)L l=1 = 0

There exist inﬁnitely many global minima (Wl, bl)L l=1 Sk with Φ (Wl, bl)L l=1 = 0

Theorem 4.5 shows that the loss surface for this type of CNNs has a rather simple structure in the sense that every critical point in Sk must be a global minimum with zero training error. Note that if the activation functions up to layer k are analytic, the complement of Sk has measure zero (see Lemma 4.2). For those critical points lying outside Sk, it must hold that either one of the weight matrices {Uk+2, . . . , UL} has low rank or the set of feature vectors at layer k is not linearly independent (i.e. Fk has low rank). Obviously, some of these critical points can also be global minima, but we conjecture that they cannot be suboptimal local minima due to the following reasons. First, it seems unlikely that a critical point with a low rank weight matrix is a suboptimal local minimum as this would imply that all possible full rank perturbations of the current solution must have larger/equal objective value. However, there is no term

in the loss function which favors low rank solutions. Even for linear networks, it has been shown by (Baldi & Hornik, 1988) that all the critical points with low rank weight matrices have to be saddle points and thus cannot be suboptimal local minima. Second, a similar argument applies to the case where one has a critical point outside Sk such that the features are not linearly independent. In particular, any neighborhood of such a critical point contains points which have linearly independent features at layer k, from which it is easy to reach zero loss if one ﬁxes the parameters of the ﬁrst k layers and optimizes the loss w.r.t. the remaining ones. This implies that every small neighborhood of the critical point should contain points from which there exists a descent path that leads to a global minimum with zero loss, which contradicts the fact that the critical point is a suboptimal local minimum. In summary, if there are critical points lying outside the set Sk, then it is very unlikely that these are suboptimal local minima but rather also global minima, saddle points or local maxima.

It remains an interesting open problem if the result of Theorem 4.5 can be transferred to the case where layer k + 1 is also convolutional. In any case whether layer k + 1 is fully connected or not, one might assume that a solution with zero training error still exists as it is usually the case for practical over-parameterized networks. However, note that Theorem 4.4 shows that at those points where the loss is zero, the gradient of Φ w.r.t. Uk+1 must be zero as well.

An interesting special case of Theorem 4.5 is when the network is fully connected in which case all the results of Theorem 4.5 hold without any modiﬁcations. This can be seen as a formal proof for the implicit assumption used in the recent work (Nguyen & Hein, 2017) that there exists a global minimum with absolute zero training error for the class of fully connected, deep and wide networks.

5. Conclusion

We have analyzed the expressiveness and loss surface of CNNs in realistic and practically relevant settings. As stateof-the-art networks fulﬁll exactly or approximately the condition to have a sufﬁciently wide convolutional layer, we think that our results help to understand why current CNNs can be trained so effectively. It would be interesting to discuss the loss surface for cross-entropy loss, which currently does not ﬁt into our analysis as the global minimum does not exist when the data is linearly separable.

Acknowledgements

The authors would like to thank the reviewers for their helpful comments on the paper and Maksym Andriushchenko for helping us to set up the experiment on the rank of features.

Optimization Landscape and Expressivity of Deep CNNs

Alain, G. and Bengio, Y. Understanding intermediate layers using linear classiﬁer probes. In ICLR Workshop, 2016.

An, S., Boussaid, F., and Bennamoun, M. How can deep rectiﬁer networks achieve linear separability and preserve distances? In ICML, 2015.

Andoni, A., Panigrahy, R., Valiant, G., and Zhang, L. Learning polynomials with neural networks. In ICML, 2014.

Auer, P., Herbster, M., and Warmuth, M. K. Exponentially many local minima for single neurons. In NIPS, 1996.

Baldi, P. and Hornik, K. Neural networks and principle component analysis: Learning from examples without local minima. Neural Networks, 2:53 58, 1988.

Blum, A. and Rivest., R. L. Training a 3-node neural network is np-complete. In NIPS, 1989.

Brutzkus, A. and Globerson, A. Globally optimal gradient descent for a convnet with gaussian inputs. In ICML, 2017.

Chollet, F. Xception: Deep learning with depthwise separable convolutions, 2016. ar Xiv:1610.02357.

Choromanska, A., Hena, M., Mathieu, M., Arous, G. B., and Le Cun, Y. The loss surfaces of multilayer networks. In AISTATS, 2015a.

Choromanska, A., Le Cun, Y., and Arous, G. B. Open problem: The landscape of the loss surfaces of multilayer networks. COLT, 2015b.

Cohen, N. and Shashua, A. Convolutional rectiﬁer networks as generalized tensor decompositions. In ICML, 2016.

Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2:303 314, 1989.

Czarnecki, W. M., Swirszcz, G., Jaderberg, M., Osindero, S., Vinyals, O., and Kavukcuoglu, K. Understanding synthetic gradients and decoupled neural interfaces. In ICML, 2017.

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In NIPS, 2014.

Delalleau, O. and Bengio, Y. Shallow vs. deep sum-product networks. In NIPS, 2011.

Du, S. S., Lee, J. D., and Tian, Y. When is a convolutional ﬁlter easy to learn? In ICLR, 2018.

Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In COLT, 2016.

Freeman, C. D. and Bruna, J. Topology and geometry of half-rectiﬁed network optimization. In ICLR, 2017.

Gautier, A., Nguyen, Q., and Hein, M. Globally optimal training of generalized polynomial neural networks with nonlinear spectral methods. In NIPS, 2016.

Goodfellow, I. J., Vinyals, O., and Saxe, A. M. Qualitatively characterizing neural network optimization problems. In ICLR, 2015.

Haeffele, B. D. and Vidal, R. Global optimality in neural network training. In CVPR, 2017.

Hardt, M. and Ma, T. Identity matters in deep learning. In ICLR, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.

Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359 366, 1989.

Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5mb model size, 2016. ar Xiv:1602.07360.

Janzamin, M., Sedghi, H., and Anandkumar, A. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. ar Xiv:1506.08473, 2016.

Jarrett, K., Kavukcuoglu, K., and Le Cun, Y. What is the best multi-stage architecture for object recognition? In CVPR, 2009.

Kawaguchi, K. Deep learning without poor local minima. In NIPS, 2016.

Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In ICLR, 2015.

Krantz, S. G. and Parks, H. R. A Primer of Real Analytic Functions. Birkh auser, Boston, second edition, 2002.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012.

Le Cun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., and Jackel, L.D. Handwritten digit recognition with a back-propagation network. In NIPS, 1990.

Liang, S. and Srikant, R. Why deep neural networks for function approximation? In ICLR, 2017.

Optimization Landscape and Expressivity of Deep CNNs

Livni, R., Shalev-Shwartz, S., and Shamir, O. On the computational efﬁciency of training neural networks. In NIPS, 2014.

Mahendran, A. and Vedaldi, A. Understanding deep image representations by inverting them. In CVPR, 2015.

Mhaskar, H. and Poggio, T. Deep vs. shallow networks : An approximation theory perspective, 2016. ar Xiv:1608.03287.

Montufar, G., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In NIPS, 2014.

Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. In ICML, 2017.

Pascanu, R., Montufar, G., and Bengio, Y. On the number of response regions of deep feedforward networks with piecewise linear activations. In ICLR, 2014.

Paszke, A., Chaurasia, A., Kim, S., and Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation, 2016. ar Xiv:1606.02147.

Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q. Why and when can deep but not shallow networks avoid the curse of dimensionality: a review, 2016. ar Xiv:1611.00740.

Press, W. H. Numerical Recipes: The art of scientiﬁc computing. Cambridge University Press, 2007.

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl Dickstein, J. On the expressive power of deep neural networks. In ICML, 2017.

Safran, I. and Shamir, O. On the quality of the initial basin in overspeciﬁed networks. In ICML, 2016.

Safran, I. and Shamir, O. Depth-width tradeoffs in approximating natural functions with neural networks. In ICML, 2017.

Saxe, A., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. Y. On random weights and unsupervised feature learning. In ICML, 2011.

Sedghi, H. and Anandkumar, A. Provable methods for training neural networks with sparse connectivity. In ICLR Workshop, 2015.

Shalev-Shwartz, S., Shamir, O., and Shammah, S. Failures of gradient-based deep learning. In ICML, 2017.

Shamir, O. Distribution-speciﬁc hardness of learning neural networks, 2017. ar Xiv:1609.01037.

Sima, J. Training a single sigmoidal neuron is hard. Neural Computation, 14:2709 2728, 2002.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

Soltanolkotabi, M. Learning relus via gradient descent. In NIPS, 2017.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In CVPR, 2015a.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision, 2015b. ar Xiv:1512.00567.

Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning, 2016. ar Xiv:1602.07261.

Telgarsky, M. Representation beneﬁts of deep feedforward networks, 2015. ar Xiv:1509.08101v2.

Telgarsky, M. Beneﬁts of depth in neural networks. In COLT, 2016.

Tian, Y. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In ICML, 2017.

Yarotsky, D. Error bounds for approximations with deep relu networks, 2016. ar Xiv:1610.01145.

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In NIPS, 2014.

Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., and Lipson, H. Understanding neural networks through deep visualization. In ICML, 2015.

Yun, C., Sra, S., and Jadbabaie, A. Global optimality conditions for deep neural networks. In ICLR, 2018.

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV, 2014.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, Oriol. Understanding deep learning requires re-thinking generalization. In ICLR, 2017.

Zhong, K., Song, Z., Jain, P., Bartlett, P., and Dhillon, I. Recovery guarantees for one-hidden-layer neural networks. In ICML, 2017.