# understanding_geometry_of_encoderdecoder_cnns__c7552b7e.pdf

Understanding Geometry of Encoder-Decoder CNNs

Jong Chul Ye 1 2 Woon Kyoung Sung 2

Abstract Encoder-decoder networks using convolutional neural network (CNN) architecture have been extensively used in deep learning literatures thanks to its excellent performance for various inverse problems. However, it is still difﬁcult to obtain coherent geometric view why such an architecture gives the desired performance. Inspired by recent theoretical understanding on generalizability, expressivity and optimization landscape of neural networks, as well as the theory of deep convolutional framelets, here we provide a uniﬁed theoretical framework that leads to a better understanding of geometry of encoder-decoder CNNs. Our uniﬁed framework shows that encoder-decoder CNN architecture is closely related to nonlinear frame representation using combinatorial convolution frames, whose expressivity increases exponentially with the depth. We also demonstrate the importance of skipped connection in terms of expressivity, and optimization landscape.

1. Introduction

For the last decade, we have witnessed the unprecedented success of deep neural networks (DNN) in various applications in computer vision, classiﬁcation, medical imaging, etc. Aside from traditional applications such as classiﬁcation (Krizhevsky et al., 2012), segmentation (Ronneberger et al., 2015), image denoising (Zhang et al., 2017), superresolution (Kim et al., 2016), etc, deep learning approaches have already become the state-of-the-art technologies in various inverse problems in x-ray CT, MRI, etc (Kang et al., 2017; Jin et al., 2017; Hammernik et al., 2018)

However, the more we see the success of deep learning, the more mysterious the nature of deep neural networks becomes. In particular, the amazing aspects of expressive

1Dept. of Bio/Brain Engineering, KAIST Daejeon 34141, Republic of Korea. 2Dept. of Mathematical Sciences, KAIST, Daejeon 34141, Republic of Korea.. Correspondence to: Jong Chul Ye <jong.ye@kaist.ac.kr>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

power, generalization capability, and optimization landscape of DNNs have become an intellectual challenge for machine learning community, leading to many new theoretical results with varying capacities to facilitate the understanding of deep neural networks (Ge & Ma, 2017; Hanin & Sellke, 2017; Yarotsky, 2017; Nguyen & Hein, 2017; Arora et al., 2016; Du et al., 2018; Raghu et al., 2017; Bartlett et al., 2017; Neyshabur et al., 2018; Nguyen & Hein, 2018; Rolnick & Tegmark, 2017; Shen, 2018).

In inverse problems, one of the most widely employed network architectures is so-called encoder-decoder CNN architectures (Ronneberger et al., 2015). In contrast to the simpliﬁed form of the neural networks that are often used in theoretical analysis, these encoder-decoder CNNs usually have more complicated network architectures such as symmetric network conﬁguration, skipped connections, etc. Therefore, it is not clear how the aforementioned theory can be used to understand the geometry of encoder-decoder CNNs to examine the origin of their superior performance.

Recently, the authors in (Ye et al., 2018) proposed so-called deep convolutional framelets to explain the encoder-decoder CNN architecture from a signal processing perspective. The main idea is that a data-driven decomposition of Hankel matrix constructed from the input data provides encoderdecoder layers that have striking similarity to the encoderdecoder CNNs. However, one of the main weaknesses of the theory is that it is not clear where the exponential expressiveness comes from. Moreover, many theoretical issues of neural networks such as generalizability and the optimization landscape, which have been extensively studied in machine learning literature, have not been addressed.

Therefore, this work aims at ﬁlling the gap and ﬁnding the connections between machine learning and signal processing to provide a uniﬁed theoretical analysis that facilitates the geometric understanding of encoder-decoder CNNs. Accordingly, we have revealed the following geometric features of encoder-decoder CNNs:

An encoder-decoder CNN with an over-parameterized feature layer approximates a map between two smooth manifolds that is decomposed as a high-dimensional embedding followed by a quotient map.

Understanding Geometry of Encoder-Decoder CNNs

Figure 1. An architecture of κ-layer symmetric encoder-decoder CNN with skipped connections. Here, ql denotes the number of channels at the l-th layer, whereas ml refers to each channel dimension, and dl represents the total dimension of the feature at the l-th layer.

An encoder-decoder CNN with Re LU nonlinearity can be understood as deep convolutional framelets that use combinatorial frames of spatially varying convolutions. Accordingly, the number of linear representations increases exponentially with the network depth. This also suggests that the input space is divided into nonoverlapping areas where each area shares the common linear representation.

We derive an explicit form of the Lipschitz condition that determines the generalization capability of the encoder-decoder CNNs. The expression shows that the expressiveness of the network is not affected by the control of the Lipschitz constant.

We provide explicit conditions under which the optimization landscape for encoder-decoder CNNs is benign. Speciﬁcally, we show that the skipped connection play important roles in smoothing out the optimization landscape.

All the proof of the theorems and lemmas in this paper are included in the Supplementary Material.

2. Related Works

Choromanska et al (Choromanska et al., 2015) employed the spin glass model from statistical physics to analyze the representation power of deep neural networks. Telgarsky constructs interesting classes of functions that can be only computed efﬁciently by deep Re LU nets, but not by shallower networks with a similar number of parameters (Telgarsky, 2016). Arora et al (Arora et al., 2016) showed that for every natural number k there exists a Re LU network with k2 hidden layers and total size of k2, which can be represented by 1

2kk+1 1 neurons with at most k-hidden layers. All these results agree that the expressive power of deep neural networks increases exponentially with the network depth.

The generalization capability have been addressed in terms of various complexity measures such as Rademacher com-

plexity (Bartlett & Mendelson, 2002), VC bound (Anthony & Bartlett, 2009), Kolmorogov complexity (Schmidhuber, 1997), etc. However, a recent work (Zhang et al., 2016) showed intriguing results that these classical bounds are too pessimistic to explain the generalizability of deep neural networks. Moreover, it has been repeatedly shown that overparameterized deep neural networks, which are trained with fewer samples than the number of neurons, generalize well rather than overﬁtting (Cohen et al., 2018; Wei et al., 2018; Brutzkus et al., 2017; Du & Lee, 2018), which phenomenon cannot be explained by the classical complexity results.

The optimization landscape of neural networks have been another important theoretical issue in neural networks. Originally observed in linear deep neural networks (Kawaguchi, 2016), the benign optimization landscape has been consistently observed in various neural networks (Du et al., 2018; Nguyen & Hein, 2018; Du et al., 2017; Nguyen & Hein, 2017).

However, these theoretical works mainly focus on simpliﬁed network architectures, and we are not aware of analysis for encoder-decoder CNNs.

3. Encoder-Decoder CNNs

3.1. Deﬁnition

In this section, we provide a formal deﬁnition of encoderdecoder CNNs (E-D CNNs) to facilitate the theoretical analysis. Although our deﬁnition is for 1-dimensional signals, its extension to 2-D images is straightforward.

3.1.1. BASIC ARCHITECTURE

Consider encoder-decoder networks in Fig. 1. Speciﬁcally, the encoder network maps a given input signal x X Rd0 to a feature space z Z Rdκ, whereas the decoder takes this feature map as an input, process it and produce an output y Y Rd L. In this paper, symmetric conﬁguration is considered so that both encoder and decoder have the same number of layers, say κ; the input and output dimensions for the encoder layer El and the decoder layer

Understanding Geometry of Encoder-Decoder CNNs

Dl are symmetric:

El : Rdl 1 7 Rdl, Dl : Rdl 7 Rdl 1

where l [κ] with [n] denoting the set {1, , n}; and both input and output dimension is d0. More speciﬁcally, the l-th layer input signal for the encoder layer comes from ql 1 number of input channels:

ξl 1 = h ξl 1 1 ξl 1 ql 1 i Rdl 1,

where denotes the transpose, and ξl 1 j Rml 1 refers to the j-th channel input with the dimension ml 1. Therefore, the overall input dimension is given by dl 1 := ml 1ql 1. Then, the l-th layer encoder generates ql channel output using the convolution operation:

ξl 1 k ψ l j,k !

, j [ql] (1)

where ξl j Rml refers to the j-th channel output after the

convolutional ﬁltering with the r-tap ﬁlters ψ l j,k Rr and pooling operation Φl Rml ml 1, and σ( ) denotes the element wise rectiﬁed linear unit (Re LU). More speciﬁcally, ψ l j,k Rr denotes the r-tap convolutional kernel that is convolved with the k-th input to contribute to the output of the j-th channel, is the circular convolution via periodic boundary condition to avoid special treatment of the convolution at the boundary, and v refers to the ﬂipped version of the vector v. For the formal deﬁnition of the convolution operation used in this paper, see Appendix A in Supplementary Material.

Moreover, as shown in Appendix B in Supplementary Material, an equivalent matrix representation of the encoder layer is then given by

ξl := σ(El ξl 1) = ξl 1 ξl ql

where El Rdl 1 dl is computed by1

Φl ψl 1,1 Φl ψl ql,1 ... ... ... Φl ψl 1,ql 1 Φl ψl ql,ql 1

with Φl ψl i,j := φl 1 ψl i,j φl ml ψl i,j

On the other hand, the l-th layer input signal for the decoder layer comes from ql channel inputs, i.e. ξl =

1Here, without loss of generality, bias term is not explicitly shown, since it can be incorporated into the matrix El and Dl as an additional column.

h ξl 1 ξl ql i Rdl, and the decoder layer convolution is given by

Φl ξl k ψl j,k !

, j [ql 1] (4)

where the unpooling layer is denoted by Φl Rml 1 ml. Note that (1) and (4) differ in their order of the pooling or unpooling layers. Speciﬁcally, a pooling operation is applied after the convolution at the encoder layer, whereas, at the decoder, an unpooling operation is performed before the convolution to maintain the symmetry of the networks. In matrix form, a decoder layer is given by

ξl 1 := σ(Dl ξl) = h ξl 1 1 ξl 1 ql 1 i

where Dl Rdl dl 1 is computed by

Φl ψl 1,1 Φl ψl 1,ql ... ... ... Φl ψl ql 1,1 Φl ψl ql 1,ql

3.1.2. E-D CNN WITH SKIPPED CONNECTION

As shown in Fig. 1, a skipped connection is often used to bypass an encoder layer output to a decoder layer. The corresponding ﬁltering operation at the l-th layer encoder is described by

σ Φl Pql 1 k=1 ξl 1 k ψ l j,k

σ Pql 1 k=1 ξl 1 k ψ l j,k

where χl j and ξl j denote the skipped output, and the pooled output via Φl , respectively, after the ﬁltering with ψj,k. As shown in Fig. 1, the skipped branch is no more ﬁltered at the subsequent layer, but is merged at the symmetric decoder layer:

( Φl ξl k + χl k) ψl j,k !

In matrix form, the encoder layer with the skipped connection can be represented by

El : ξl 1 7 ξl χl

ξl := σ(El ξl 1) , χl := σ(Sl ξl 1) (7)

where El is given in (2) and the skipped branch ﬁlter matrix Sl is represented by

Iml 1 ψl 1,1 Iml 1 ψl ql,1 ... ... ... Iml 1 ψl 1,ql 1 Iml 1 ψl ql,ql 1

Understanding Geometry of Encoder-Decoder CNNs

where Iml 1 denotes the ml 1 ml 1 identity matrix. This implies that we can regard the skipped branch as the identity pooling Iml 1 applied to the ﬁltered signals. Here, we denote the output dimension of the skipped connection as

sl := ml 1ql .

Then, the skipped branch at the l-th encoder layer is merged at the l-th decoder layer, which is deﬁned as

Dl : ξl χl 7 ξl 1

ξl 1 := σ(Dl ξl + Slχl) (9)

and Dl is deﬁned in (5), and Sl is given by

Iml 1 ψl 1,1 Iml 1 ψl 1,ql ... ... ... Iml 1 ψl ql 1,1 Iml 1 ψl ql 1,ql

3.2. Parameterization of E-D CNNs At the l-th encoder (resp. decoder) layer, there are qlql 1 ﬁlter set that generates the ql (resp. ql 1) output channels from ql 1 (resp. ql) input channels. In many CNNs, the ﬁlter lengths are set to equal across the layer. In our case, we set this as r, so the number of ﬁlter coefﬁcients for the l-layer is nl := rqlql 1, l [κ]

These parameters should be estimated during the training phase. Speciﬁcally, by denoting the set of all parameter matrices W = WE WD where WE := Rnκ Rn1 and WD := Rn1 Rnκ, we compose all layer-wise maps to deﬁne an encoder-decoder CNN as

z = F(W, x). (11)

Regardless of the existence of skipped connections, note that the same number of unknown parameters is used because the skipped connection uses the same set of ﬁlters.

4. Theoretical Analysis of E-D CNNs

4.1. Differential Topology

First, we brieﬂy revisit the work by Shen (Shen, 2018), which gives an topological insight on the E-D CNNs. Proposition 1 (Extension of Theorem 3 in (Shen, 2018)). Let f : X 7 Y Rq be a continuous map of smooth manifolds such that f = g h, where g : Rp 7 Rq with p q is a Lipschitz continuous map. If p > 2 dim X, then there exists a smooth embedding h : X 7 Rp, so that the following inequality holds true for a chosen norm and all x X and ϵ > 0:

f(x) g h(x) ϵ

Here, p > 2 dim X comes from the weak Whitney embedding theorem (Whitney, 1936; Tu, 2011). Note that Theorem 1 informs that a neural network, designed as a continuous map of smooth manifolds, can be considered as an approximation of a task map that is composed of a smooth embedding followed by an additional map. In fact, this decomposition is quite general for a map between smooth manifolds as shown in the following proposition:

Proposition 2. (Shen, 2018) Let f : X 7 Y Rq be a map of smooth manifolds, then the task f admits a decomposition of f = g h, where h : X 7 Z Rp with p 2 dim X is a smooth embedding. Furthermore, the task map f is a quotient map, if and only if the map g is a quotient map.

To understand the meaning of the last sentence in Proposition 2, we brieﬂy review the concept of the quotient space and quotient map (Tu, 2011). Speciﬁcally, let be an equivalence relation on X. Then, the quotient space, Y = X/ is deﬁned to be the set of equivalence classes of elements of X. For example, we can declare images perturbed by noises as an equivalent class such that our quotient map is designed to map the noisy signals to its noiseless equivalent image.

It is remarkable that Proposition 1 and Proposition 2 give interpretable conditions for design parameters such as network width (i.e. no of channels), pooling layers, etc. For example, if there are no pooling layers, the dimensionality conditions in Proposition 1 and Proposition 2 can be easily met in practice by increasing the number of channels more than twice the input channels. With the pooling layers, one could calculate the number of channels in a similar way. In general, Proposition 1 and Proposition 2 strongly suggest an encoder-decoder architecture with the constraint d0 d1 dκ with dκ > 2d0, where an encoder maps an input signal to higher dimensional feature space whose dimension is at least twice bigger than the input space. Then, the decoder determines the nature of the overall neural network.

4.2. Links to the frame representation

One of the important contributions of recent theory of deep convolutional framelets (Ye et al., 2018) is that encoderdecoder CNNs have an interesting link to multi-scale convolution framelet expansion. To see this, we ﬁrst deﬁne ﬁlter matrices Ψl Rrql 1 ql and Ψl Rrql 1 ql for encoder and decoder:

ψl 1,1 ψl ql,1 ... ... ... ψl 1,ql 1 ψl ql,ql 1

Understanding Geometry of Encoder-Decoder CNNs

ψl 1,1 ψl 1,ql ... ... ... ψl ql 1,1 ψl ql 1,ql

Then, the following proposition, which is novel and signiﬁcantly extended from (Ye et al., 2018), states the importance of the frame conditions for the pooling layers and ﬁlters to obtain convolution framelet expansion (Yin et al., 2017).

Proposition 3. Consider an encoder-decoder CNN without Re LU nonlinearities. Let Φl and Φl denote the l-th encoder and decoder layer pooling layers, respectively, and Ψl and Ψl refer to the encoder and decoder ﬁlter matrices. Then, the following statements are true.

1) For the encoder-decoder CNN without skipped connection, if the following frame conditions are satisﬁed for all l [κ]

ΦlΦl = αIml 1, Ψl Ψl = 1

rαIrql 1 (12)

then we have

i bi, x bi (13)

where bi and bi denote the i-th column of the following frame basis and its dual:

B = E1E2 Eκ, (14) B = D1D2 Dκ (15)

2) For the encoder-decoder CNN with skipped connection, if the following frame conditions are satisﬁed for all l [κ]:

ΦlΦl = αIml 1, Ψl Ψl = 1 r(α + 1)Irql 1 (16)

then (13) holds, where bi and bi denote the i-th column of the following frame and its duals:

Bskp ( Rd0 (dκ+Pκ l=1 sl)) (17)

:= E1 Eκ E1 Eκ 1Sκ E1S2 S1

Bskp ( Rd0 (dκ+Pκ l=1 sl)) (18)

:= D1 Dκ D1 Dκ 1 Sκ D1 S2 S1

Furthermore, the following corollary shows that the total basis and its dual indeed come from multiple convolutional operations across layers:

Corollary 4. If there exist no pooling layers, then the t-th block of the frame basis matrix for t [ql] is given by E1 El

t = E1 El 1Sl

ql 1, ,q1 X

jl 1, ,j1=1 ψl j1,1 ψl t,jl 1

t = h D1 Dl 1 Sli

ql 1, ,q1 X

jl 1, ,j1=1 ψl j1,1 ψl t,jl 1

This suggests that the length of the convolutional ﬁlters increases with the depth by cascading multiple convolution operations across the layers. While Proposition 3 informs that the skipped connection increases the dimension of the feature space from dκ to dκ + Pκ l=1 sl, Corollary 4 suggest that the cascaded expression of the ﬁlters becomes more diverse for the case of encoder-decoder CNNs with skipped connection. Speciﬁcally, instead of convolving all κ layers of ﬁlters, the skipped connection allows the combination of subset of ﬁlters. All these make the frame representation from skipped connection more expressive.

4.3. Expressiveness

However, to satisfy the frame conditions (12) or (16), we need ql rql 1 so that the number of output ﬁlter channel ql should increase exponentially. While this condition can be relaxed when the underlying signal has low-rank Hankel matrix structure (Ye et al., 2018), the explicit use of the frame condition is still rarely observed. Moreover, in contrast to the classical wavelet analysis, the perfect reconstruction condition itself is not interesting in neural networks, since the output of the network should be different from the input due to the task dependent processing.

Here, we claim that one of the important roles of using Re LU is that it allows combinatorial basis selection such that exponentially large number of basis expansion is feasible once the network is trained. This is in contrast with the standard framelet basis estimation. For example, for a given target data Y = y(1) y(T ) and the input data X = x(1) x(T ) , the estimation problem of the frame basis and its dual in Proposition 3 is optimal for the given training data, but the network is not expressive and does not generalize well when the different type of input data is given. Thus, one of the important requirements is to allow large number of expressions that are adaptive to the different inputs.

Indeed, Re LU nonlinearity makes the network more expressive. For example, consider a trained two layer encoderdecoder CNN:

y = BΛ(x)B x (19)

Understanding Geometry of Encoder-Decoder CNNs

where B Rd0 d1 and B Rd0 d1 and Λ(x) is a diagonal matrix with 0, 1 elements that are determined by the Re LU output. Now, the matrix can be equivalently represented by

i=1 σi(x) bib i (20)

where σi(x) refers to the (i, i)-th diagonal element of Λ(x). Therefore, depending on the input data x Rd0, σi(x) is either 0 or 1 so that a maximum 2d1 distinct conﬁgurations of the matrix can be represented using (20), which is signiﬁcantly more expressive than using the single representation with the frame and its dual. This observation can be generalized as shown in Theorem 5.

Theorem 5 (Expressiveness of encoder-decoder networks). Let

Υl = Υl(x) := Υl 1 Λl(x)Dl, (21)

Υl = Υl(x) := Υl 1ElΛl(x), (22)

with Υ0(x) = Id0 and Υ0(x) = Id0, and

M l = M l(x) := SlΛl S(x) (23) M l = M l(x) := Λl(x) Sl (24)

where Λl(x) and Λl(x) refer to the diagonal matrices from Re LU at the l-th layer encoder and decoder, respectively, which have 1 or 0 values; Λl S(x) refers to a similarly deﬁned diagonal matrices from Re LU at the l-th skipped branch of encoder. Then, the following statements are true.

1) Under Re LUs, an encoder-decoder CNN without skipped connection can be represented by

y = B(x)B (x)x = X

i x, bi(x) bi(x) (25)

B(x) = Υκ(x) , B(x) = Υκ(x) (26)

Furthermore, the maximum number of available linear representation is given by

Nrep = 2 Pκ i=1 di dκ, (27)

2) An encoder-decoder CNN with skipped connection under Re LUs is given by

y = Bskp(x)Bskp (x)x = X

i x, bskp i (x) bskp i (x) (28)

where Bskp(x) :=

Υκ Υκ 1M κ Υκ 2M κ 1 M 1

Υκ Υκ 1 M κ Υκ 2 M κ 1 M 1

Furthermore, the maximum number of available linear representation is given by

Nrep = 2 Pκ i=1 di dκ 2 Pκ i=1 sk (31)

This implies that the number of representation increase exponentially with the network depth, which again conﬁrm the expressive power of the neural network. Moreover, the skipped connection also signiﬁcantly increases the expressive power of the encoder-decoder CNN. Another important consequence of Theorem 5 is that the input space X is partitioned into the maximum Nrep non-overlapping regions so that inputs for each region shares the same linear representation.

Due to the Re LU, one may wonder whether the cascaded convolutional interpretation of the frame basis in Corollary 4 still holds. A close look of the proof of Corollary 4 reveals that this is still the case. Under Re LUs, note that (Im ψl j,s)(Im ψl+1 t,j ) = Im (ψl j,s ψl+1 t,j ) in Lemma 11 in Supplementary Material should be replaced with (Im ψl j,s)Λl j(x)(Im ψl+1 t,j ) where Λl j(x) is a diagonal matrix with 0 and 1 values due to the Re LU. This means that the Λl j(x) provides spatially varying mask to the convolution ﬁlter ψl+1 t,j so that the net effect is a convolution with the the spatially varying ﬁlters originated from masked version of ψl+1 t,j . This results in a spatially variant cascaded convolution, and only change in the interpretation of Corollary 4 is that the basis and its dual are composed of spatial variant cascaded convolution ﬁlters. Furthermore, the Re LU works to diversify the convolution ﬁlters by masking out the various ﬁlter coefﬁcients. It is believed that this is another source of expressiveness from the same set of convolutional ﬁlters.

4.4. Generalizability

To understand the generalization capability of DNNs, recent research efforts have been focused on reducing the gap by suggesting different ways of measuring the network capacity (Bartlett et al., 2017; Neyshabur et al., 2018). These works consistently showed the importance of Lipschitz condition for the encoder and decoder parts of the networks.

More speciﬁcally, we have shown that the neural network representation varies in exponentially many different forms depending on inputs, so one may be concerned that the output might vary drastically with small perturbation of the inputs. However, Lipschitz continuity of the neural network prevents such drastic changes. Speciﬁcally, a neural network

Understanding Geometry of Encoder-Decoder CNNs

F(W, x) is Lipschitz continuous, if there exists a constant K > 0 such that

F(W, x(1)) F(W, x(2)) 2 K x(1) x(2) 2 .

where the Lipschitz constant K can be obtained by

K = sup x X D2F(W, x) 2 (32)

where D2F(W, x) is the Jacobian with respect to the second variable. The following proposition shows that the Lipschitz constant of encoder-decoder CNNs is closely related to the frame basis and its duals.

Proposition 6. The Lipschitz constant for encoder-decoder CNN without skipped connection is given by

K = sup x X B(x)B(x) 2 (33)

whereas Lipschitz constant for encoder-decoder CNN with skipped connection is given by

K = sup x X Bskp(x)Bskp (x) 2 (34)

where B(x), B(x), Bskp(x) and Bskp(x) are deﬁned in (26), (29) and (30).

Recall that the input space X is partitioned into regions that share the same linear representation. Therefore, the local Lipschitz constant within the p-th partition is given by

Kp = sup z X p B(z)B (z) 2

= B(zp)B (zp) 2, zp X p (35)

for the case of E-D CNN without skipped connections. Here, X p denotes the p-th input space partition, and the last equality in (35) comes from the fact that every point in X p shares the same linear representation. Thus, it is easy to see that the global Lipschitz constant can be given by

K = sup x X B(x)B(x) 2 = sup p Kp (36)

Furthermore, Theorem 5 informs that the number of partition is bonded by Nrep. Therefore, (36) suggests that by bounding the local Lipschitz constant within each linear region, one could control the global Lipschitz constant of the neural network. Similar observation holds for E-D CNNs with skipped connection.

One of the most important implications of (36) is that the expressiveness of the network is not affected by the control of the Lipschitz constant. This in turn is due to the combinatorial nature of the Re LU nonlinearities, which allows for an exponentially large number of linear representations.

4.5. Optimization landscape

For a given ground truth task map f : X 7 Y and given training data set {(x(i), y(i))}T i=1 such that y(i) = f (x(i)), an encoder-decoder CNN training problem can be formulated to ﬁnd a neural network parameter weight W by minimizing a speciﬁc loss function. Then, for the case of l2 loss:

i=1 F(W, x(i)) y(i) 2 , (37)

Nguyen et al (Nguyen & Hein, 2018) showed that overparameterized CNNs can produce zero training errors. Their results are based on the following key lemma.

Lemma 7. (Nguyen & Hein, 2018) Consider an encoderdecoder CNN without skipped connection. Then, the Jacobian of the cost function in (37) with respect to Eκ is bounded as EκC F

σmin(Ξκ) min i [T ] σmin

Λκ(x(i)) Υκ(x(i)) p

σmax(Ξκ) max i [T ] σmax

Λκ(x(i)) Υκ(x(i)) p

where σmin(A) and σmax(A) denote the minimum and maximum singular value for a matrix A Rn m with n m, respectively; Υκ is deﬁned in (21), and Ξκ denotes the feature matrix for the training data

Ξκ = ξκ(1) ξκ(T ) Rdκ T

and C(W) is the cost in (37).

The authors in (Nguyen & Hein, 2018) further showed that if every shifted r-segment of training samples is not identical to each other and dκ T, then Ξκ has full column rank. Additionally, if the nonlinearity at the decoder layer is analytic, then they showed that Υκ(x)Λκ(x) has almost always full row rank. This implies that both σmin(Ξκ) and σmin(Λκ( Υκ) ) are non-zero so that EκC|W = 0 if and only if y(i) = F(W, x(i)) for all i [T] (that is, the loss becomes zero, i.e. C(W) = 0).

Unfortunately, this almost always guarantee cannot be used for the Re LU nonlinearities at the decoder layers, since the Re LU nonlinearity is not analytic. In this paper, we extend the result of (Nguyen & Hein, 2018) for the encoder-decoder CNN with skipped connection when Re LU nonlinearities are used. In addition to Lemma 7, the following lemma, which is original, does hold for this case.

Understanding Geometry of Encoder-Decoder CNNs

Lemma 8. Consider an encoder-decoder CNN with skipped connection. Then, the Jacobian of the cost function in (37) with respect to Sl for l [κ] is bounded as

σmin(Γl) min i [T ] σmin

Λl(x(i)) Υl 1(x(i)) p

σmax(Γl) max i [T ] σmax

Λl(x(i)) Υl 1(x(i)) p

where Γl denotes the feature matrix from the skipped branch

Γl = χl(1) χl(T ) Rsl T

and C(W) is the cost in (37).

Lemma 8 leads to the following key results on the optimization landscape for the encoder-decoder network with skipped connections.

Theorem 9. Suppose that there exists a layer l [κ] such that

skipped features χl(1), , χl(T ) are linear independent.

Υl 1(x) Λl(x) has full row rank for all training data x [x(1), , x(T )].

Then, Sl C|W = 0 if and only if y(i) = F(W, x(i)) for all i [T] (that is, the loss becomes zero, i.e. C(W) = 0).

Proof. Under the assumptions, both σmin(Γl) and σmin( Λl( Υl 1) ) are non-zero. Therefore, Lemma 8 leads to the conclusion.

Note that the proof for the full column rank condition for Ξκ in (Nguyen & Hein, 2018) is based on the constructive proof using independency of intermediate features χl(1), , χl(T ) for all l [κ]. Furthermore, for the case of Re LU nonlinearities, even when Υκ(x)Λκ(x) does not have full row rank, there are chances that Υl 1(x) Λl(x) has full row rank at least one l [κ]. Therefore, our result has more relaxed assumptions than the optimization landscape results in (Nguyen & Hein, 2018) that relies on Lemma 7. This again conﬁrms the advantages of the skipped connection in encoder-decoder networks.

5. Discussion and Conclusion

In this paper, we investigate the geometry of encoderdecoder CNN from various theoretical aspects such as differential topological view, expressiveness, generalization capability and optimization landscape. The analysis was feasible thanks to the explicit construction of encoder-decoder CNNs using the deep convolutional framelet expansions. Our analysis showed that the advantages of the encoder-decoder CNNs comes from the expressiveness of the encoder and decoder layers, which are originated from the combinatorial nature of Re LU for decomposition and reconstruction frame basis selection. Moreover, the expressiveness of the network is not affected by controlling Lipschitz constant to improve the generalization capability of the network. In addition, we showed that the optimization landscape can be enhanced by the skipped connection.

This analysis coincides with our empirical veriﬁcation using deep neural networks for various inverse problems. For example, in a recent work of k-space deep learning (Han & Ye, 2018), we showed that a neural network for compressed sensing MRI can be more effectively designed in the k-space domain, since the frame representation is more concise in the Fourier domain. Similar observation was made in sub-sampled ultrasound (US) imaging (Yoon et al., 2018), where we show that the frame representation in raw data domain is more effective in US so that the deep network is designed in the raw-date domain rather than image domain. These empirical examples clearly showed that the uniﬁed view between signal processing and machine learning as suggested in this paper can help to improve design and understanding of deep models.

Acknowledgements

The authors thank to reviewers who gave useful comments. This work was supported by the National Research Foundation (NRF) of Korea grant NRF-2016R1A2B3008104 and KI Science Technology Leading Primary Research.

Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009.

Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectiﬁed linear units. ar Xiv preprint ar Xiv:1611.01491, 2016.

Bartlett, P. L. and Mendelson, S. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally-

Understanding Geometry of Encoder-Decoder CNNs

normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240 6249, 2017.

Brutzkus, A., Globerson, A., Malach, E., and Shalev Shwartz, S. SGD learns over-parameterized networks that provably generalize on linearly separable data. ar Xiv preprint ar Xiv:1710.10174, 2017.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and Le Cun, Y. The loss surfaces of multilayer networks. Journal of Machine Learning Research, 38:192 204, 2015.

Cohen, G., Giryes, R., and Sapiro, G. DNN or k-NN: That is the generalize vs. memorize question. ar Xiv preprint ar Xiv:1805.06822, 2018.

Du, S. S. and Lee, J. D. On the power of overparametrization in neural networks with quadratic activation. ar Xiv preprint ar Xiv:1803.01206, 2018.

Du, S. S., Lee, J. D., Tian, Y., P oczos, B., and Singh, A. Gradient descent learns one-hidden-layer CNN: don t be afraid of spurious local minima. ar Xiv preprint ar Xiv:1712.00779, 2017.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. ar Xiv preprint ar Xiv:1810.02054, 2018.

Ge, R. and Ma, T. On the optimization landscape of tensor decompositions. In Advances in Neural Information Processing Systems, pp. 3656 3666, 2017.

Hammernik, K., Klatzer, T., Kobler, E., Recht, M. P., Sodickson, D. K., Pock, T., and Knoll, F. Learning a variational network for reconstruction of accelerated MRI data. Magnetic resonance in medicine, 79(6):3055 3071, 2018.

Han, Y. and Ye, J. k-space deep learning for accelerated MRI. ar Xiv 2018(1805.03779), 2018.

Hanin, B. and Sellke, M. Approximating continuous functions by Re LU nets of minimal width. ar Xiv preprint ar Xiv:1710.11278, 2017.

Jin, K. H., Mc Cann, M. T., Froustey, E., and Unser, M. Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26 (9):4509 4522, 2017.

Kang, E., Min, J., and Ye, J. C. A deep convolutional neural network using directional wavelets for low-dose X-ray CT reconstruction. Medical Physics, 44(10):e360 e375, 2017.

Kawaguchi, K. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pp. 586 594, 2016.

Kim, J., Lee, J. K., and Lee, K. M. Accurate image superresolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646 1654, 2016.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012.

Neyshabur, B., Li, Z., Bhojanapalli, S., Le Cun, Y., and Srebro, N. Towards understanding the role of overparametrization in generalization of neural networks. ar Xiv preprint ar Xiv:1805.12076, 2018.

Nguyen, Q. and Hein, M. The loss surface of deep and wide neural networks. ar Xiv preprint ar Xiv:1704.08045, 2017.

Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep CNNs. In International Conference on Machine Learning, pp. 3727 3736, 2018.

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl Dickstein, J. On the expressive power of deep neural networks. In International Conference on Machine Learning, pp. 2847 2854, 2017.

Rolnick, D. and Tegmark, M. The power of deeper networks for expressing natural functions. ar Xiv preprint ar Xiv:1705.05502, 2017.

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234 241. Springer, 2015.

Schmidhuber, J. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857 873, 1997.

Shen, H. A differential topological view of challenges in learning with feedforward neural networks. ar Xiv preprint ar Xiv:1811.10304, 2018.

Telgarsky, M. Beneﬁts of depth in neural networks. In Conference on Learning Theory, pp. 1517 1539, June 2016.

Tu, L. W. An introduction to manifolds. Springer,, 2011.

Wei, C., Lee, J. D., Liu, Q., and Ma, T. On the margin theory of feedforward neural networks. ar Xiv preprint ar Xiv:1810.05369, 2018.

Understanding Geometry of Encoder-Decoder CNNs

Whitney, H. Differentiable manifolds. Annals of Mathematics, pp. 645 680, 1936.

Yarotsky, D. Error bounds for approximations with deep Re LU networks. Neural Networks, 94:103 114, 2017.

Ye, J. C., Han, Y., and Cha, E. Deep convolutional framelets: A general deep learning framework for inverse problems. SIAM Journal on Imaging Sciences, 11(2):991 1048, 2018.

Yin, R., Gao, T., Lu, Y. M., and Daubechies, I. A tale of two bases: Local-nonlocal regularization on image patches with convolution framelets. SIAM Journal on Imaging Sciences, 10(2):711 750, 2017.

Yoon, Y. H., Khan, S., Huh, J., Ye, J. C., et al. Efﬁcient bmode ultrasound image reconstruction from sub-sampled rf data using deep learning. IEEE transactions on medical imaging, 2018.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.

Zhang, K., Zuo, W., Chen, Y., Meng, D., and Zhang, L. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing, 2017.