# maximumandconcatenation_networks__3dac22d6.pdf

Maximum-and-Concatenation Networks

Xingyu Xie 1 Hao Kong 1 Jianlong Wu 2 Wayne Zhang 3 Guangcan Liu 4 Zhouchen Lin 1

While successful in many ﬁelds, deep neural networks (DNNs) still suffer from some open problems such as bad local minima and unsatisfactory generalization performance. In this work, we propose a novel architecture called Maximum-and Concatenation Networks (MCN) to try eliminating bad local minima and improving generalization ability as well. Remarkably, we prove that MCN has a very nice property; that is, every local minimum of an (l + 1)-layer MCN can be better than, at least as good as, the global minima of the network consisting of its ﬁrst l layers. In other words, by increasing the network depth, MCN can autonomously improve its local minima s goodness, what is more, it is easy to plug MCN into an existing deep model to make it also have this property. Finally, under mild conditions, we show that MCN can approximate certain continuous functions arbitrarily well with high efﬁciency; that is, the covering number of MCN is much smaller than most existing DNNs such as deep Re LU. Based on this, we further provide a tight generalization bound to guarantee the inference ability of MCN when dealing with testing samples.

1. Introduction Deep neural networks (DNNs) have been showing superior performance in various ﬁelds such as computer vision, speech recognition, natural language processing, and so on. At the ﬁrst glance, DNN learning is not an enigmatic technique, as its basic idea is quite simple and mostly about learning a possibly over-parameterized DNN from a huge

1Key Lab. of Machine Perception (Mo E), School of EECS, Peking University 2School of Computer Science and Technology, Shandong University 3Sense Time Research 4B-DAT and CICAEET, School of Automation, Nanjing University of Information Science and Technology. Correspondence to: Guangcan Liu <gcliu@nuist.edu.cn>, Zhouchen Lin <zlin@pku.edu.cn>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

number of training samples; namely,

min θ L(θ) := 1

i=1 ℓ(fθ(xi), yi), (1)

where xi Rdx and yi Rdy denote an input and a target, respectively, fθ( ) standards for a DNN with parameters θ, and ℓ( ) is some loss function. Notice that, some kind of regularization schema has already been implanted into the network to constrain the parameter space, though there is no explicit regularizer imposed on θ (Arora et al., 2019a). Despite its ordinary appearance, DNN learning is meanwhile quite complicated in many ways, and the current DNNs still suffer from several weaknesses, e.g., the training procedure may easily get stuck in bad local minima (i.e., the local minima with large training error), the learnt model may be prone to over-ﬁt the training data (i.e., the testing error is large when small training error is obtained), etc. Overcoming these difﬁculties are crucial for DNNs to solve the realworld problems that are more challenging and signiﬁcant, but they are still open problems.

To address the issue of bad local minima, many heuristic techniques have been proposed, e.g., batch normalization (Ioffe & Szegedy, 2015), group normalization (Wu & He, 2018), dropout (Srivastava et al., 2014), etc. These techniques would be useful under certain context, but may not be generally helpful and, even worse, it is hard to know when and which method should be used. In fact, the elimination of bad local minima, i.e., having small empirical training error at all local minima, is really important for DNN learning. Some recent theories (Zhang et al., 2017; Wei & Ma, 2019; Cao & Gu, 2019; Li & Liang, 2018; Allen-Zhu et al., 2018; Arora et al., 2019c) have revealed that, whenever the local minima produces only small training error, DNNs have probably good generalization performance at these local minima. That is to say, in some cases, good local minima mean good predictors which are the ultimate goal of supervised learning. With the hope of pursuing the property of no bad local minima, some learning theories (Kawaguchi, 2016; Arora et al., 2018; Hardt & Ma, 2016; Liang et al., 2018a;b) have been established to prove that, under certain conditions, any local minima of a certain DNN are also global minima. While impressive, existing studies are still unsatisfactory in some aspects:

Most existing theories about all local minima are

Maximum-and-Concatenation Networks

global minima are built upon on some unrealistic network architectures, e.g., without activation function, which means that they cannot be applied to common deep learning tasks. The work (Kawaguchi & Kaelbling, 2019) considers general architectures, but requires additional regularizer and is limited to shallow case. In addition, strictly speaking, the conclusion of all local minima are global minima cannot really ensure that DNN has no bad local minima . This is because, whenever the adopted network itself is poorly designed, global minima can still lead to large training error. In one word, existing studies have not gained convenient schemes that can be easily used to reduce the training error of general DNNs.

Though small training error may bring good generalization for some specially designed DNNs (Zhang et al., 2017; Wei & Ma, 2019; Cao & Gu, 2019; Li & Liang, 2018; Allen-Zhu et al., 2018; Arora et al., 2019c), a rigorous generalization bound is still important for general DNNs to produce superior performance in practice. There is sparse research in the direction of generalization analysis, e.g., deep Re LU (Yarotsky, 2017). However, the covering number in deep Re LU is very large, which means that the approximation ability of the network is rather weak.

What is more, to our knowledge, there is no theoretical study that addresses the issues of local minima and generalization ability simultaneously. These two problems are closely related and should be investigated at the same time.

To relieve the issues highlighted above, we propose a novel multi-layer DNN termed Maximum-and-Concatenation Networks (MCN). In our MCN, one hidden layer is formed by concatenating together two parts, with one being a linear transformation of the output of the previous layer, and the other being a maximum of two piecewise smooth functions. The output of the ﬁnal layer is further transformed by some linear operators, so as to stay in step with the conﬁguration of the target output. In general, the concatenation operator is a good option during designing DNNs, and it is indeed a primary cause of the superiorities of MCN over existing architectures.

We prove that MCN naturally ensures the effectiveness of its learning process, i.e., the no bad local minima property. To be more precise, suppose that θ is a global minimum to (1) with fθ being an l-layer MCN (brieﬂy, we say that θ

is a global minimum of an l-layer MCN), and θ is a local minimum of the (l + 1)-layer MCN obtained by adding one layer to the former l-layer network. Then we have L(θ) L(θ ), which means that the global minima of an l-layer MCN may be outperformed, at least can be attained, by simply increasing the network depth. More importantly,

MCN can be easily appended to many existing network architectures, and we prove that, under mild conditions, the modiﬁed DNN will get the nice properties of MCN. This property is achieved mainly due to a skip connection with a proper activation function: With the help of this skip connection, the bad local minima are moved to inﬁnity, while the implicit regularizer carried by the network itself may encourage the optimization procedure to seek for the remaining good local minima.

Notice that, piecewise liner functions can approximate any Lipschitz continuous function up to arbitrarily small error, and the maximum operator can model the piecewise linear function efﬁciently (Telgarsky, 2016). Based on these facts, we show that MCN with sparse connection can approximate a wide range of continuous functions arbitrarily well. Our analysis framework is new and quite different from the previous studies (Lu et al., 2020; Yarotsky, 2018; 2017), which rely on Taylor expansion and requires a parameter complexity of O(N dx), where N 1 is a quantity that controls the approximation accuracy 1. By sharp contrast, we show that a complexity of only O(N(ln N)dx 2) is enough to approximate the target function.

Based on the approximation analysis, we further investigate the generalization ability of MCN to cope with testing samples, proving that MCN has much smaller covering number than deep Re LU. Interestingly, our results suggest that the width has less effects than the depth on the generalization bound. Our results also show that, whenever the training data are exactly ﬁtted, MCN achieves the statistically optimal rate in the minmiax sense; this conﬁrms the conjectures in (Wei & Ma, 2019; Arora et al., 2019c; Belkin et al., 2018b) that ultra-deep networks may generalize well on testing data 2. To summarize, the contributions of this paper mainly include:

We propose a novel architecture termed MCN and prove that MCN can help to overcome the issue of bad local minima. Namely, the global minima of an l-layer MCN can be always attained or even outperformed by simply increasing the network depth (Theorem 1). More importantly, we show that MCN is able to turn a possibly poorly-designed DNN into a good one, which also has the nice property of no bad local minima" under certain conditions (Corollary 4.2 and Theorem 5). These results would be more signiﬁcant than (Kawaguchi, 2016; Hardt & Ma, 2016), which only show that all local minima of a certain DNN with ﬁxed depth are global minima, but provide no practical guidance for the users to seek better solutions to their

1N β is the dominant term in approximation error, where β > 0 relates to the smoothness of the target function. 2Note here that we have no intention to suggest using inﬁnitely deep networks, as the computational cost is also a matter and the required data amount in the extreme case could be huge.

Maximum-and-Concatenation Networks

Figure 1. Left: Illustration of the motivations for inventing MCN, which is indeed a generalization of the piecewise smooth function. The composition of MCNs may increase the pieces exponentially. Right: One block of MCN, where each layer consists of four parts.

tasks just ﬁnding the globally optimal solutions to some over-simpliﬁed optimization problems is essentially not enough.

We devise a new framework to analyze the approximation ability of MCN, showing that MCN can approximate some classes of continuous functions arbitrarily well by only using a parameter complexity of O(N(ln N)dx 2) (Theorem 2). This is much lower than the O(N dx) complexity obtained by the previous studies (Lu et al., 2020; Yarotsky, 2018; 2017).

Unlike the previous analyses in (Liang et al., 2018a;b;

Kawaguchi & Kaelbling, 2019), which focus on the elimination of local minima but ignore the generalization performance, we provide rigorous analysis to guarantee the generalization ability of MCN under certain conditions (Theorem 3 and Corollary 3.1). In particular, our results show that MCN has a much smaller covering number than deep Re LU, revealing that the depth is more important than the width for generalization; this supports the mechanism of deep learning.

2. Model and Setting

This section introduces the technical details of MCN, as well as the setup for establishing theoretical analysis.

2.1. Maximum-and-Concatenation Networks

The design of our MCN a linearity and maximum concatenation network is inspired by the following observations. Consider the task of shattering some points that are not linearly separable, which is shown in Figure 1. Intuitively, the maximum of two hyperplanes may produce smaller classiﬁcation error than every single one of them. Therefore, we may reduce the classiﬁcation error by replacing parts of the current classiﬁer with some maximum-derived units. Such a replacement process can be repeated several

times, learning progressively a reﬁned classiﬁcation surface that will be piecewise smooth. Moreover, considering the regression problem, we have a classical claim from the Stone-Weierstrass approximation theorem.

Claim 1. Any Lipschitz continuous function can be approximated arbitrarily well by a piecewise linear function.

By composing a series of maximum operators, we can easily construct a piecewise smooth function. Consider approximating the quadratic function x x2. Deﬁne the operator T m(x) := max{ x/2, x/2 21 2m} and let gm(x) := T m T m 1 T 1(x). It is known that x + Pm i=1 gi(x) approximates x2 exponentially fast in m (Telgarsky, 2016). In contrast, to approximate a twice differentiable non-piecewise linear function f, it would be awkward to use some existing DNNs that need to rescale the second order differences: (f(t + 2δx) 2f(t + xδ) + f(xδ))/(δ2f (t)) x2 for δ 0 with f (t) = 0. Note that δ 0 will cause the scale of network parameters to be very large.

Beneath it all, the model of an l-layer MCN, which is indeed a mapping from input x to output y, is designed as follows, for k = 0, , l 1:

xk+1 = h Lk+1(xk); γ Ak+1(x0) + Mk+1(xk) i , (2) where

Mk+1(xk) = max Wk+1(xk), σk+1 Ak+1 xˆk ,

0 ˆk k (xˆk is the output of any intermediate layer between xk and x0), γ( ) and σk+1( ) are some element-wise activation functions, x0 = x Rdx is the input data vector, xk Rdk is the output of the k-th layer, [ ; ] is the operator that vertically concatenates two vectors into a single one, Lk+1 : Rdk Rd L is a learnable linear operator3,

3For convenience, we assume that the output of Lk+1 has a

Maximum-and-Concatenation Networks

and Ak+1( ), Ak+1( ) and Wk+1( ) are all learnable linear operators from Rdk to Rdk+1 d L.

In fact, as mentioned in Figure 1, MCN is a generalization of piecewise smooth functions, and it can contain many existing DNNs as special cases, e.g., Res Net, Maxout Network (Goodfellow et al., 2013) and Input Convex Neural Networks (ICNN) (Amos et al., 2017). In MCN, there are layers that directly connect the input x0 to the hidden units in deeper layers. Such connections are unnecessary for traditional networks, but very important for achieving the nice property of no bad local minimum which we will introduce later. The highway with the operator Lk connects the training loss with the geometric projection residual in the proper setting (Section B in supplementary material), which helps MCN perform well when it goes deeper and wider.

2.2. Setting

To analyze MCN theoretically, we consider a typical task of regression (or classiﬁcation). Denote by x Rdx and y Rdy an input vector and a target, respectively. Let {(xi, yi)}n i=1 be a training set consisting of n samples, with {xi}n i=1 being distinct points in Rdx. Denote by xk,i the output of the k-th layer on the i-th training sample xi. Notice that MCN is primarily designed to learn some extrinsic structures from the data x, and its outputs may be inconsistent with the target y, e.g., they might have different dimensions. Hence, an additional mapping Ψ : Rdl Rdy is used to further transform the network outputs, resulting in the following objective function for training an l-layer MCN:

i=1 ℓ(Ψ(xl,i), yi), (3)

where ℓ: Rdy Rdy R is an arbitrary lower-bounded loss function (without losing generality, we assume the lower bound is 0), and θl = {θ(Lk, Wk, Ak, Ak)}l k=1 is a collection of all learnable parameters with θ(Lk, Wk, Ak, Ak) being the parameters of the operators Lk, Wk, Ak and Ak deﬁned in (2). In our setup, the extra mapping Ψ( ) could be either learnt or ﬁxed 4, while the activation functions σk( ) and γ( ) are always ﬁxed.

To obtain rigorous conclusions, some technical conditions are required. But for the ease of presentation, we would like to present them along with the established theorems.

ﬁxed dimension d L, k = 0, , l 1. Actually, our methods and theories do not need this assumption. 4There is no much difference between these two variants, as ﬁxing the last layer of a DNN may cause very little inﬂuence (Hoffer et al., 2018).

3. Main Results

This section presents the main results of this paper, including a couple of theories regarding the optimality, ﬁtting ability and generalization performance. All the detailed proofs of these theorems are provided in the supplementary material.

3.1. Effects of Depth

First note that an (l + 1)-layer MCN is obtained by adding one layer into the network consisting of its ﬁrst l layers, i.e., θl+1 = {θl, θ(Ll+1, Wl+1, Al+1, Al+1)}. Under some mild technical conditions, we prove that the training objective (3) is non-increasing, or even monotonically decreasing, as the network goes deeper5.

Theorem 1 (Effects of Depth). Let the activation function γ( ) be the element-wise exp( ). Suppose that the loss function ℓ( ) in (3) is differentiable and convex. Denote by θl+1 any local minimum of an (l + 1)-layer MCN. If dl+1 = dl, then the following holds for any ﬁxed injection Ψ( ):

L(θl+1) min θ l L(θ l),

where θ l is a global minimum of the l-layer MCN. Moreover, if ℓ( ) is strongly convex and there exists i such that xl+1,i = x l,i, then the inequality is strict, namely L(θl+1) < minθ l L(θ l).

The setting of ﬁxing Ψ( ) is to ensure that an (l + 1)-layer MCN and its l-layer part are comparable. According to the above theorem, the global minima of an l-layer MCN can be attained, or even outperformed, by simply increasing the network depth by one. So, given the context of MCN, increasing network depth can not only eliminate local minima, but also help seek good solutions that possess smaller training error, providing a theoretically interpretation for a well-known empirical observation deeper networks usually lead to better training results.

Among the other things, provided that the loss function is differentiable and strongly convex, we can further prove that the training error is able to go to zero. But the proof needs a key theorem established in the next subsection.

Remark 1: One may worry that there exist decreasing paths to inﬁnity, and the weight may need to diverge to improve the performance of local minima (Sohl-Dickstein & Kawaguchi, 2019). The previous work (Kawaguchi, 2016; Liang et al., 2018a;b) may suffer from this problem, mainly due to their explicit regularization, whose coefﬁcient should decay to zero to ensure the consistency of optimization.

5This is not in conﬂict with the learning-based optimization theories (Xie et al., 2019; Liu et al., 2019), which show that their networks can converge fast and need only a smaller number of layers to solve optimization problems. In fact, empirically, MCN will converge when the network is deep enough.

Maximum-and-Concatenation Networks

Hence, it leads to the divergence of some parameters to ensure the scale of output. However, our results hold without requiring any parameter to approach zero or inﬁnity. Furthermore, for the classiﬁcation problem, this divergence problem can be solved by proper parameter regularization (Liang et al., 2019). But, for the general regression problem, regularization may not work. Fortunately, under the over-parameterized setting, algorithmic analysis (Allen Zhu et al., 2019; Du et al., 2019a) can entirely avoid the divergence risk. We leave the algorithmic analysis of MCN as our future work.

3.2. Approximation Ability

In general, it is unlikely that all mathematical functions can be approximated by DNNs. The following deﬁnes a class of functions which can be well approximately by MCN.

Condition 1. For β N, we deﬁne a modiﬁed β-th Sobolev space on the hypercube [ 1, 1]dx

Hβ := n f : Dαf L2 [ 1, 1]dx , α : |α| β o ,

where α = (α1, , αdx) Ndx is a multi-index, Dα corresponds to the weak derivatives operator α1 x1 . . . αd xdx of order |α| = α1 + +αdx and |α| = max{αi}. It is assumed that the function f H2β+2 obeys the homogeneous Neumann boundary conditions up to order β:

2r+1 xj f Ωj = 0, j = 1, . . . , dx, r = 0, . . . , β 1,

where Ωj = x [ 1, 1]d : xj = 1 is the boundary.

The above condition depicts a class of continuous functions f L2([ 1, 1]dx) such that f and its weak derivatives up to a certain order have ﬁnite L2 norm. Note that the Neumann boundary condition of [ 1, 1]dx is not harsh, and we can always extend the target function by ﬁrstly using the Sine or Cosine functions to introduce the homogeneous Neumann property and then scaling it to the interval [ 1, 1]dx.

As pointed out by (Barron, 1993), a standard fully connected neural network with enough, possibly inﬁnite, hidden units can approximate any continuous function in compact domain. For MCN, we have an explicit approximation bound to connect the width and depth in a ﬁnite fashion.

Theorem 2 (Approximation Ability). Let f be a vectorvalued function that obeys Condition 1, and let w 0, p 0, N 1 be given numbers. Deﬁne Nd = N (ln N)dx 2, and denote by fθ the output of an MCN. Suppose either fθ is of width O(Nddxwp ln p) and depth O(l ln p + N 2), or fθ has O(dxwp ln p) width and O(Ndl ln p + N 2Nd) depth. Then f can be approximated by MCN with proper parameters, in a sense that:

fθ(x) f(x) ϵ, x [ 1, 1]dx,

ϵ = O dx2dxp22 wl + N 2β 2 (ln N)dx 1 .

The number of non-zero parameters in θ is in the order of O Nd dxw2p ln p + N 2 .

Proof Sketch. We ﬁrst construct the shallow MCNs that approximate sin(nπx) and cos(nπx) for different n N exponentially fast. Then we can obtain a multivariate function φn := Qd1 i=1 sin(niπx) Qdx k=d1+1 cos(nkπx) by an MCN of O(ln dx) depth, where d1 dx and n Ndx. Since the set {φn, n Ndx} is a Fourier orthogonal basis for L2([ 1, 1]dx), we can prove that N(ln N)dx 2 sub MCNs sufﬁce to approximate the target function, where N = Q

i ni. More detailed proofs can be founded in the supplementary material.

Remarkably, Theorem 2 shows that MCN requires only a parameter complexity of O dx N(ln N)dx 2 to approximate the target function, which is dramatically lower than the O(ddx x N dx) required by deep Re LU (Yarotsky, 2017). This is mainly beneﬁted from our analysis techniques. Unlike the analyses in (Lu et al., 2020; Yarotsky, 2018), which split the input space into small hyper-cubes and use a local network to approximate the Taylor expansion on those hyper-cubes, our analysis is built upon high-dimensional Fourier expansions and can therefore obtain higher decay rate for the approximation residual. Besides, the special network architecture of MCN is another cause of the advantage of lower complexity. Namely, the maximum operator makes the power of the decay term for approximating underlying polynomial be in the order of width depth. By contrast, the decay power is just proportional to the depth in deep Re LU.

In summary, Theorem 2 illustrates that MCN with highly sparse connectivity between neurons can produce good approximation performance. This forms good basis for establishing tight generalization bound and eliminating bad local minima, as will be shown soon.

3.3. Generalization Bound

Theorem 2 ensures the existence of a good predictor when MCN goes deeper and wider. Now, one natural question is: does the generalization bound also shrink as the network becomes deeper? To analyze the generalization ability of DNNs or any other learning methods, it is indeed necessary to make some assumptions about the data. In this subsection, we set dy = 1 and assume that xi [0, 1]dx for i = 1, , n. We consider the nonparametric regression task, i.e., there exists a target oracle function f0 such that

yi = f0(xi) + εi, i = 1, , n, (4)

Maximum-and-Concatenation Networks

where the noise terms εi s are assumed to be i.i.d. Gassuian and independent of xi.

Denote the function class of our MCN as

F(θ, s) := fθ : Supp (θ) < s, θk 2 F < , k l ,

where θk F is the Frobenius norm of all the parameters at the k-th layer, and the operator Supp( ) denotes the support of a set, i.e., Supp (θ) is the number of nonzero parameters in MCN. The boundness assumption of Supp (θ) < s is made on the basis of Theorem 2, which shows that MCN with sparse connections can possess strong approximation ability. For convenience, we consider the case where the structure of F(θ, s) is deterministic, i.e., the input layer of Ak( ) is the same for all MCNs in F(θ, s). Denote by N (δ, F(θ, s), 1) the minimal number of ℓ1balls with radius δ that covers F(θ, s). The logarithm of N (δ, F(θ, s), 1) is also called the covering number for convenience. For an operator A, A 1 denotes its ℓ1 norm induced by the vector ℓ1 norm, namely A 1 = maxx =0 A(x) 1

x 1 . Then we have the following theorem to bound the covering number (i.e., ln N (δ, F(θ, s), 1)).

Theorem 3 (Covering Number of MCN). Assume that the activation function σk( ) is ρk-Lipschitz and ρk ρ for k = 1, , l. Then one block of MCN is κk-Lipschitz continuous w.r.t. the input layers and

κk = (1 + max{ρk, 2} θk 1) ,

θk 1 := max{ Ak 1, Ak 1, Wk + Lk 1}.

Moreover, we have

ln N (F(θ, s), δ, 1) O

ρw Ql k=1 κk δ

where w and l are the width and depth of MCN, respectively.

The above theorem shows that the covering number of MCN is O sl2 ln (w/δ) , where s = Θ dx N(ln N)dx 2 . By contrast, to achieve the same approximation accuracy, deep Re LU needs a covering number of O s l ln s w2l/δ , with s = Θ(ddx x N dx). In the situation of high-dimensional data, i.e., dx is large, it is clear that MCN has much smaller covering number than deep Re LU, which means that the model complexity of MCN is much lower. Due to this, MCN provably owns good generalization performance, as shown in the following.

Corollary 3.1 (Generalization Bound). Consider the regression problem in (4) and assume maxx [0,1]dx f0(x) < . Let f M be any MCN from F(θ, s), and deﬁne

i=1 (yi f(xi))2 ,

ℓn(f M) inf f F ℓn(f) ,

where Ef0 is the expectation taken with respect to the samples generated from the regression model (4). Deﬁne

dis(f M, f0) := Ef0 h (f M(x) f0(x))2i .

Then, we have

dis(f M, f0) O n + inf f F dis(f, f0) + sl2 ln (wn)

This corollary is indeed a direct application of the general statics generalization inequality in (Lu et al., 2020; Yarotsky, 2017). As we can see, the generalization bound depends on three parts, intuitively described as ε1 +ε2 +ε3, where ε1 is the gap from the obtained training loss to the global minimal one, ε2 is the approximation error, and ε3 is the covering number. Notably, Theorem 1 provides a way to reduce ε1, and Theorem 3 ensures that small ε2 unnecessarily results in large ε3.

For nonparametric regression with square loss, when the target function f0 is β-smooth, it is well-known that the statistically optimal estimation rate in terms of data size is n 2β 2β+dx (Giné & Nickl, 2016), also called as minimax estimation rate. Owning the minimax estimation rate means that the estimator performs the best in the worst case. Interestingly, when the training data is ﬁtted exactly, MCN also owns this property.

Theorem 4 (Minimax Estimation Rate). Suppose that the density p( ) over some compact set C satisﬁes

0 < pmin p(x) pmax, x C.

Assume that the target function f0 is β-smooth and let ℓ( ) in (3) be the square loss. Denote the ﬁnal output of our model as fθ(xi), where θ is the learnable parameters of MCN. If fθ(xi) = yi for i = 1, , n, then for any data sample x Rdx located in the support set of p, the output of MCN satisﬁes the following with high probability:

ESn Eε fθ (x) f0 (x) 2 | Sn Cn 2β 2β+dx ,

where Sn = {(xi, yi))}n i=1 and C > 0 is a number depends only on the numerical range of the outputs of MCN.

In general, the above theorem conﬁrms the phenomenon that over-parameterized DNNs may not necessarily cause over-ﬁtting (Belkin et al., 2019; 2018b). For Theorem 4 to hold, the training error has to be reduced to zero. This can actually be accomplished by using the techniques in (Gasca & Sauer, 2000) to link together Theorem 1 and Theorem 2, as will be shown in next subsection.

Maximum-and-Concatenation Networks

Remark 2: One may worry about the exact ﬁtting assumption may not be satisﬁed since the noise belongs to an unbounded distribution. The derivatives or weights of DNN may diverge to inﬁnity as n . However, this may not be a problem and exact ﬁtting can easily happen under mild condition. On the one hand, Gaussian distribution has an exponential decay tail. Thus, we can approximately treat it as bounded. On the other hand, some recent results (Arora et al., 2019c; Du et al., 2019a; Allen-Zhu et al., 2018; Liang et al., 2020; E et al., 2019) show that the DNNs, having universal approximation ability, can easily ﬁt the Gaussian noise without any weight diverging. Even more, exact ﬁtting can happen near the initial state of DNN as long as it is wide or deep enough; the depth or width is in the polynomial order of n. For MCN, we already prove its approximation ability in Theorem 2. Following the same road-map, we can conclude that its parameters do not diverge in the exact ﬁtting case. Remark 3: We remark that the estimator fθ does not belong to the β-smooth function class (its smoothness depends on the architecture and activation function). In conclusion, even though fθ is not β-smooth and ﬁts the data exactly, it attains optimal excess loss rates. We refer the readers to (Rakhlin et al., 2017) for further discussion of optimal rates in non-parametric estimation and statistical learning. Remark 4: For any xi Sn,

Eϵi fθ (xi) f0 (xi) 2 | Sn = 1.

ESn Eε fθ (xi) f0 (xi) 2 | Sn 0, as n ,

due to the measure of a speciﬁc point is 0.

3.4. No Bad Local Minima

As aforementioned, under mild technical conditions, the training error produced by MCN can be arbitrarily small when the network is deep enough. Corollary 4.1 (Optimal Training Error). Suppose that the loss function ℓ( ) is differentiable and strongly convex. Denote by θl any local minimum of an l-layer MCN. For any ϵ > 0, there exists a D N such that L(θl) ϵ holds for any l > D.

The no bad local minima property of MCN replies on its special network design, and is unnecessarily true for the other DNNs. In the following, we shall introduce two ways to reﬁne an existing DNN that is possibly poorly designed. The ﬁrst one is straightforward and simply to treat the output of an existing DNN as the input x0 to MCN, and the parameters of the existing network are not involved in re-training. In this case, it is easy to obtain the following result: Corollary 4.2 (Partial Training). For ﬁxed injection Ψ( ) and an existing l0-layer DNN with output h0, construct an llayer MCN with γ( ) being element-wisely exponential and

input x0 = h0. If h0 is an injective function w.r.t. the input x and the loss ℓ( ) is differentiable and strongly convex, then for any ϵ > 0, there exists a large enough D N, such that L(θl) ϵ holds for any local minimum θl with l D.

In above corollary, the existing DNN is assumed to be ﬁxed and MCN is simply applied to its output. Actually, it is also feasible to re-train all the parameters, including the parameters of both the existing network and the appended MCN blocks.

Theorem 5 (Full Training). For ﬁxed injection Ψ( ) and an existing l0-layer DNN with output h0, append an l-layer MCN at its end with γ( ) being element-wisely exponential, resulting in a new model hl. Suppose that the loss ℓ( ) is differentiable and strongly convex, and there exist parameters that make h0 be injective. Then, for any ϵ > 0, there exists a large enough D N such that

i=1 ℓ(Φ (hl(xi)) , yi) ϵ

holds at any local minimum hl with l D.

One may have noticed that monotonic decreasing property in Theorem 1 is not enough to guarantee global minimal training loss. In fact, as aforementioned, Theorem 2 also plays an important role in gaining the above results, and we need use the techniques in (Gasca & Sauer, 2000) to link Theorem 1 and Theorem 2 together.

Remarkably, the above results illustrate that MCN is not just an approach for seeking the global optimal solution to certain optimization problems, but instead a powerful tool for helping seek better solutions to the primary task behind the optimization problems.

3.5. Discussions

There is another interpretation for why MCN can eliminate bad local minimum. When adopting the square loss, we ﬁnd that the loss in (3) at the local minimum equals to a projection residual obtained by projecting the training data onto a subspace. The subspace is expanded by parameters in the concatenation linear part Lk( ) for k = 1, , l, which means that the subspace is larger when more independent parameters are contained in the linear branch Lk( ). On the other hand, large space often brings small projection residual. Please see Section B in the supplementary material for more details.

To summarize, this section establishes a collection of theorems to cope with the problems of bad local minima and generalization issue. More precisely, ﬁrst, Theorem 1 and Corollary 4.1 reveal the no bad local minima property of MCN, and Corollary 4.2 and Theorem 5 extend this property to the other DNNs. Second, Theorem 2 shows the

Maximum-and-Concatenation Networks

Figure 2. Left: Training loss of our MCN (red) and baseline network (green) with various number of layers. Right: Testing accuracy.

approximation ability of MCN, illustrating that MCN can obtain the same approximation error by using parameters much less than deep Re LU. The number of required parameters is far smaller than the network size, which implies that MCN allows to use some prevalent sparse patterns such as CNN structure and pruning tricks. The sparsity of network connections further leads to a small covering number for MCN in Theorem 3. Based on this, ﬁnally, we provide the generalization bound for MCN in Corollary 3.1.

4. Experiments

4.1. Theorems Veriﬁcation

We conduct experiments on the commonly used CIFAR10 dataset, with the purpose of validating our theorems as well as the effectiveness of MCN. We ﬁrst construct a baseline network with 6 weighted layers, including ﬁve convolutional layers and one fully-connected layer. Then we add convolutional layers to make the network deeper. It contains ﬁve max pooling in total. For our MCN, we replace the convolutional layers after the third max pooling layer with our MCN block. To make a fair comparison, both networks have the same number of layers and parameters, and so for the random seed and learning rate. Also, batch normalization and Re LU are adopted by both networks. For detailed experimental settings and model conﬁgurations, please refer to the supplementary material.

Figure 2 shows the training loss and testing accuracy with different number of layers. According to the red line in the left part of Figure 2, the training loss of our MCN monotonically decreases with the increase of depth. This is consistent with our Theorems 1 and 2. From the red line in the right part of Figure 2, we can see that deeper MCN can achieve better testing accuracy, which demonstrates the generalization performance of MCN and conﬁrms our Corollary 3.1 and Theorem 4. In addition, according to the green line

in the right part of Figure 2, the testing accuracy of the baseline network does not monotonically increase as the network goes deeper. Therefore, the no bad local minima property should be a primary cause of the nice performance of MCN. In summary, compared with the baseline network, our MCN has much lower training loss as well as higher testing accuracy, revealing the superiority of MCN.

4.2. Appending MCN

To validate the merits of Corollary 4.2 and Theorem 5, we add two MCN blocks to VGG19 (Simonyan & Zisserman, 2014) and Res Net18 (He et al., 2016) as the treatment group. The original two architectures, VGG19 and Res Net18, are regarded as the ﬁrst control group. To make a comparison, we also add two traditional convolutional layers to VGG19 and Res Net18, considered as the second control group. For the treatment group and the second control group, we consider two ways to train the appended VGG19 and Res Net18 (short as Res18). The ﬁrst one is partial training which treats VGG19 and Res18 as the feature extractors whose parameters are not involved during training. The second one is full training which considers the appended networks as new models and train them from scratch.

Table 1 shows the comparison results among all the three groups, in terms of both training loss and testing accuracy. As we can see, the plugging of traditional convolution layers can decrease the training loss, however, the appending of MCN has more amount of improvement, which, again, show the beneﬁts of the no bad local minima" property. Interestingly, full training and partial training share comparable performance when appending MCN but not for convolution layers. Hence, both Corollary 4.2 and Theorem 5 are practical theories. Moreover, our MCN outperforms distinctly all the competing methods; this, again, conﬁrms the superiority of our MCN architecture.

Maximum-and-Concatenation Networks

Table 1. The training error (Err.) and testing accuracy (Acc.) of different models on the CIFAR-10 dataset. We denote by C the added two convolutional layers and M the appended MCN blocks.

Models VGG19 VGG19+ VGG19+ VGG19+ VGG19+ Res18 Res18+ Res18+ Res18+ Res18+ C(full) C(part) M(full) M(part) C(full) C(part) M(full) M(part) Err. 0.0016 0.0013 0.0015 0.0010 0.0011 0.0013 0.0011 0.0012 0.0009 0.0009 Acc. 92.0% 92.4% 92.1% 92.8% 92.6% 92.7% 93.5% 93.1% 93.7% 93.8%

Table 2. The training error (Err.) and testing accuracy (Acc.) of different models on the CIFAR-100 dataset. We denote by C the added two convolutional layers and M the appended MCN blocks.

Models Res18 Res18+ Res18+ Res18+ Res18+ Res Ne Xt29 Res Ne Xt29+ Res Ne Xt29+ Res Ne Xt29+ Res Ne Xt29+

C(full) C(part) M(full) M(part) C(full) C(part) M(full) M(part) Err. 0.0020 0.0014 0.0015 0.0009 0.0009 0.0056 0.0051 0.0054 0.0008 0.0011 Acc. 76.15% 76.58% 76.48% 76.95% 76.87% 80.71% 80.78% 80.69% 82.31% 81.41%

4.3. Additional Experiments

To better demonstrate the representation ability of our MCN block, we further conduct some additional experiments on the more complex dataset CIFAR-100, and make comparisons with the SOTA of Res Ne Xt (Xie et al., 2017) (a more powerful network architecture).

Similar to the previous part, the original two architectures, Res18 and Res Ne Xt29, are regarded as the ﬁrst control group. As for the second control group, we still add two traditional convolutional layers to Res18 and Res Ne Xt29. Besides, we append two MCN blocks to the end of both Res18 and Res Ne Xt29 as the treatment group.

For the treatment group and the second control group, the two ways to train the appended Res18 and Res Ne Xt29 remain the same as previous experiment. One is partial training which treats Res18 and Res Ne Xt29 as the feature extractors, while the other is full training which considers the appended DNNs as new models and train them from scratch.

We present the results of the partial training (i.e., ﬁxing Res18 and Res Ne Xt29 when appending MCN blocks) in Table 2. It can be seen that, even in the case of handling complex data, our MCN achieves superior results. The treatment groups under two different training methods both outperform the control groups, which is consistent with Table 1. Moreover, by comparing Table 1 with Table 2, our MCN blocks have greatly improved the performance when handling more complex data. Please note that ordinarily appending CNNs cannot ensure the monotonicity of Err. and Acc. This phenomenon not only veriﬁes Corollary 4.2 and Theorem 5 again, but also shows that our MCN has a stronger representation ability than general linear structure, which corresponds to Theorem 2.

5. Conclusion

In this paper, we propose a novel multi-layer DNN structure termed MCN, which can approximate some class of continuous functions arbitrarily well even with highly sparse connection. We prove that the global minima of an l-layer MCN may be outperformed, at least can be attained, by simply increasing the network depth. More importantly, MCN could be easily appended to any of the many existing DNN and the augmented DNN will share the same property of MCN. Finally, we analyze the generalization ability of MCN and reveal that depth is more important than width for generalization; this supports the mechanism of deep learning. In summary, this study does take a step towards the ultimate goal of deep learning theory to understand why DNNs can work well in a wide variety of applications.

Acknowledgments

This work is supported in part by New Generation AI Major Project of Ministry of Science and Technology of China (grant no 2018AAA0102501), in part by NSF China (grant no.s 61625301 and 61731018), in part by Major Scientiﬁc Research Project of Zhejiang Lab (grant no.s 2019KB0AC01 and 2019KB0AB02), in part by Fundamental Research Funds of Shandong University, in part by Future Talents Research Funds of Shandong University, in part by Beijing Academy of Artiﬁcial Intelligence, in part by Qualcomm, and in part by Sense Time Research Fund.

Adcock, B. Multivariate modiﬁed fourier series and application to boundary value problems. Numerische Mathematik, 115(4):511 552, 2010.

Maximum-and-Concatenation Networks

Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. ar Xiv preprint ar Xiv:1811.04918, 2018.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242 252, 2019.

Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. In International Conference on Machine Learning, pp. 146 155, 2017.

Arora, S., Cohen, N., Golowich, N., and Hu, W. A convergence analysis of gradient descent for deep linear neural networks. ar Xiv preprint ar Xiv:1810.02281, 2018.

Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, pp. 7411 7422, 2019a.

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. On exact computation with an inﬁnitely wide neural net. ar Xiv preprint ar Xiv:1904.11955, 2019b.

Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Finegrained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584, 2019c.

Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930 945, 1993.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine learning and the bias-variance trade-off. ar Xiv preprint ar Xiv:1812.11118, 2018a.

Belkin, M., Hsu, D. J., and Mitra, P. Overﬁtting or perfect ﬁtting? risk bounds for classiﬁcation and regression rules that interpolate. In Advances in Neural Information Processing Systems, pp. 2300 2311, 2018b.

Belkin, M., Rakhlin, A., and Tsybakov, A. B. Does data interpolation contradict statistical optimality? In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1611 1619, 2019.

Cao, Y. and Gu, Q. A generalization theory of gradient descent for learning over-parameterized deep relu networks. ar Xiv preprint ar Xiv:1902.01384, 2019.

Dou, X. and Liang, T. Training neural networks as learning data-adaptive kernels: Provable representation and approximation beneﬁts. ar Xiv preprint ar Xiv:1901.07114, 2019.

Du, S. S., Wang, Y., Zhai, X., Balakrishnan, S., Salakhutdinov, R. R., and Singh, A. How many samples are needed to estimate a convolutional neural network? In Advances in Neural Information Processing Systems, pp. 373 383, 2018.

Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. Gradient descent ﬁnds global minima of deep neural networks. In International Conference on Machine Learning, 2019a.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b.

E, W., Ma, C., Wu, L., et al. On the generalization properties of minimum-norm solutions for over-parameterized neural network models. ar Xiv preprint ar Xiv:1912.06987, 2019.

Gasca, M. and Sauer, T. Polynomial interpolation in several variables. Advances in Computational Mathematics, 12 (4):377, 2000.

Giné, E. and Nickl, R. Mathematical foundations of inﬁnitedimensional statistical models, volume 40. Cambridge University Press, 2016.

Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. In International Conference on Machine Learning, pp. 1319 1327, 2013.

Györﬁ, L., Kohler, M., Krzyzak, A., and Walk, H. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.

Hardt, M. and Ma, T. Identity matters in deep learning. ar Xiv preprint ar Xiv:1611.04231, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

Hoffer, E., Hubara, I., and Soudry, D. Fix your classiﬁer: the marginal value of training the last weight layer. ar Xiv preprint ar Xiv:1801.04540, 2018.

Horn, R. A. and Johnson, C. R. Topics in matrix analysis, 1991. Cambridge University Presss, 37:39, 1991.

Huybrechs, D., Iserles, A., et al. From high oscillation to rapid approximation iv: Accelerating convergence. IMA Journal of Numerical Analysis, 31(2):442 468, 2011.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015.

Maximum-and-Concatenation Networks

Kawaguchi, K. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586 594, 2016.

Kawaguchi, K. and Kaelbling, L. P. Elimination of all bad local minima in deep learning. ar Xiv preprint ar Xiv:1901.00279, 2019.

Kawaguchi, K., Huang, J., and Kaelbling, L. P. Effect of depth and width on local minima in deep learning. Neural Computation, 2019.

Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157 8166, 2018.

Liang, S., Sun, R., Lee, J. D., and Srikant, R. Adding one neuron can eliminate all bad local minima. In Advances in Neural Information Processing Systems, pp. 4350 4360, 2018a.

Liang, S., Sun, R., Li, Y., and Srikant, R. Understanding the loss surface of neural networks for binary classiﬁcation. ar Xiv preprint ar Xiv:1803.00909, 2018b.

Liang, S., Sun, R., and Srikant, R. Revisiting landscape analysis in deep neural networks: Eliminating decreasing paths to inﬁnity. ar Xiv preprint ar Xiv:1912.13472, 2019.

Liang, T., Rakhlin, A., and Zhai, X. On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. ar Xiv preprint ar Xiv:1908.10292 [cs, math, stat], 2020.

Liu, J., Chen, X., Wang, Z., and Yin, W. ALISTA: Analytic weights are as good as learned weights in LISTA. In International Conference on Learning Representations, 2019.

Lu, J., Shen, Z., Yang, H., and Zhang, S. Deep network approximation for smooth functions. ar Xiv preprint ar Xiv:2001.03040, 2020.

Luxburg, U. v. and Bousquet, O. Distance-based classiﬁcation with lipschitz functions. Journal of Machine Learning Research, 5(Jun):669 695, 2004.

Ma, C., Wu, L., et al. A priori estimates of the generalization error for two-layer neural networks. ar Xiv preprint ar Xiv:1810.06397, 2018.

Maillard, O. and Munos, R. Compressed least-squares regression. In Advances in Neural Information Processing Systems, pp. 1213 1221, 2009.

Olver, S. On the convergence rate of a modiﬁed fourier series. Mathematics of Computation, 78(267):1629 1645, 2009.

Rakhlin, A., Sridharan, K., Tsybakov, A. B., et al. Empirical entropy, minimax regret and minimax risk. Bernoulli, 23 (2):789 824, 2017.

Rumerlhar, D. Learning representation by back-propagating errors. Nature, 323:533 536, 1986.

Schmidt-Hieber, J. Nonparametric regression using deep neural networks with relu activation function. Annals of Statistics, 2019.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. 2014.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Sohl-Dickstein, J. and Kawaguchi, K. Eliminating all bad local minima from loss landscapes without even adding an extra unit. ar Xiv preprint ar Xiv:1901.03909, 2019.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014.

Telgarsky, M. Beneﬁts of depth in neural networks. In Conference on Learning Theory, pp. 1517 1539, 2016.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Wei, C. and Ma, T. Data-dependent sample complexity of deep neural networks via lipschitz augmentation. ar Xiv preprint ar Xiv:1905.03684, 2019.

Wu, Y. and He, K. Group normalization. In European Conference on Computer Vision, pp. 3 19, 2018.

Xie, B., Liang, Y., and Song, L. Diverse neural network learns true target functions. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1216 1224, 2017.

Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987 5995, 2017.

Xie, X., Wu, J., Zhong, Z., Liu, G., and Lin, Z. Differentiable linearized ADMM. In International Conference on Machine Learning, 2019.

Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103 114, 2017.

Maximum-and-Concatenation Networks

Yarotsky, D. Optimal approximation of continuous functions by very deep relu networks. In Conference on Learning Theory, pp. 639 649, 2018.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Machine Learning, 2017.

Zhang, X., Ling, C., and Qi, L. The best rank-1 approximation of a symmetric tensor and related spherical optimization problems. SIAM Journal on Matrix Analysis and Applications, 33(3):806 821, 2012.