# scaling_properties_of_deep_residual_networks__f6de3448.pdf

Scaling Properties of Deep Residual Networks

Alain Sam Cohen 1 Rama Cont 2 Alain Rossier 2 1 Renyuan Xu 2

Residual networks (Res Nets) have displayed impressive results in pattern recognition and, recently, have garnered considerable theoretical interest due to a perceived link with neural ordinary differential equations (neural ODEs). This link relies on the convergence of network weights to a smooth function as the number of layers increases. We investigate the properties of weights trained by stochastic gradient descent and their scaling with network depth through detailed numerical experiments. We observe the existence of scaling regimes markedly different from those assumed in neural ODE literature. Depending on certain features of the network architecture, such as the smoothness of the activation function, one may obtain an alternative ODE limit, a stochastic differential equation or neither of these. These ﬁndings cast doubts on the validity of the neural ODE model as an adequate asymptotic description of deep Res Nets and point to an alternative class of differential equations as a better description of the deep network limit.

1. Introduction

Residual networks, or Res Nets, are multilayer neural network architectures in which a skip connection is introduced at every layer (He et al., 2016). This allows deep networks to be trained by circumventing vanishing and exploding gradients (Bengio et al., 1994). The increased depth in Res Nets has lead to commensurate performance gains in applications ranging from speech recognition (Heymann et al., 2016; Zagoruyko & Komodakis, 2016) to computer vision (He et al., 2016; Huang et al., 2016).

A residual network with L layers may be represented as

h(L) k+1 = h(L) k + δ(L) k σd A(L) k h(L) k + b(L) k , (1)

1 Insta Deep 2Mathematical Institute, University of Oxford. Correspondence to: Alain Rossier <rossier@maths.ox.ac.uk>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

where h(L) k is the hidden state at layer k = 0, . . . , L, h(L) 0 = x Rd the input, h(L) L Rd the output, σ: R R is a nonlinear activation function, σd(x) = (σ(x1), . . . , σ(xd)) its component-wise extension to x Rd, and A(L) k , b(L) k , and δ(L) k are trainable network weights for k = 0, . . . , L 1.

1.1. Connection to previous work

Res Nets have been the focus of several theoretical studies due to a perceived link with a class of differential equations. The idea, put forth in (Haber & Ruthotto, 2018; Chen et al., 2018), is to view (1) as a discretization of a system of ordinary differential equations

At Ht + bt , (2)

where A : [0, 1] Rd d and b: [0, 1] Rd are appropriate smooth functions and H(0) = x. This may be justiﬁed (Thorpe & van Gennip, 2018) by assuming that

δ(L) 1/L, A(L) k At, b(L) k bt (3)

as L increases and k/L t. Such models, named neural ordinary differential equations or neural ODEs (Chen et al., 2018; Dupont et al., 2019), have motivated the use of optimal control methods to train Res Nets (E et al., 2019a).

However, the precise link between deep Res Nets and the neural ODE model (2) is unclear: in practice, the weights A(L) and b(L) result from training, and the validity of the scaling assumptions (3) for trained weights is far from obvious and has not been veriﬁed. As a matter of fact, there is empirical evidence showing that using a scaling factor δ(L) 1/L can deteriorate the network accuracy (Bachlechner et al., 2020). Also, there is no guarantee that weights obtained through training have a non-zero limit which depends smoothly on the layer, as (3) would require. In fact, we present numerical experiments which point to the contrary for many Res Net architectures used in practice. These observations motivate an in-depth examination of the actual scaling behavior of weights with network depth in Res Nets and of its impact on the asymptotic behavior of those networks.

Scaling properties of deep residual networks

Figure 1: Trained weights as a function of k for k = 0, . . . , L and L = 9100. Left: rescaled weights LβA(L) k,(0,0) for a tanh network

with β = 0.2. Center: weights A(L) k,(0,0) for a Re LU network. Right: cumulative sum Pk 1 j=0 A(L) j,(0,0)for a Re LU network.

1.2. Our contributions

We systematically investigate the scaling behavior of trained Res Net weights as the number of layers increases and examine the consequences of this behavior for the asymptotic properties of deep Res Nets. Our code is publicly available at https://github.com/instadeepai/ scaling-resnets.

Our main contributions are twofold. Using the methodology described in Section 2, we design detailed numerical experiments to study the scaling of trained network weights across a range of Res Net architectures and datasets, showing the existence of at least three different scaling regimes, none of which correspond to (3). In Section 4, we show that in two of these scaling regimes, the properties of deep Res Nets may be described in terms of a class of ordinary or stochastic differential equations, albeit different from the neural ODEs studied in (Chen et al., 2018; Haber & Ruthotto, 2018; Lu et al., 2020). Those novel ﬁndings on the relation between Res Nets and differential equations complement previous work (Thorpe & van Gennip, 2018; E et al., 2019b; Frei et al., 2019; Ott et al., 2021). In particular, our ﬁndings question the validity of the neural ODE (2) as a description of deep Res Nets with trained weights.

1.3. Notations

y denotes the Euclidean norm of a vector y. For a matrix x, x denotes its transpose, Diag(x) its diagonal vector, Tr(x) its trace and x F = p

Tr(x x) its Frobenius norm. u denotes the integer part of a positive number u. N(m, v) denotes the Gaussian distribution with mean m and (co)variance v, denotes the tensor product, and Rd, n = Rd Rd (n times). vec: Rd1 dn Rd1 dn denotes the vectorisation operator, and 1S the indicator function of a set S. C0 is the space of continuous functions, and for ν 0, Cν is the space of ν-H older continuous functions.

2. Methodology

We identify the possible scaling regimes for the network weights, introduce the quantities needed to characterize the deep network limit, and describe the step-by-step procedure we use to analyze our numerical experiments.

2.1. Scaling hypotheses

As described in Section 1, the neural ODE limit assumes

L and A(L) Lt L At, b(L) Lt L bt, (4)

for t [0, 1] where A : [0, 1] Rd d and b : [0, 1] Rd d are smooth functions (Thorpe & van Gennip, 2018). Our numerical experiments, detailed in Section 3, show that the weights generally shrink as L increases (see for example Figures 2 and 4), so one cannot expect the above assumption to hold, and weights need to be renormalized in order to converge to a non-zero limit. We consider here the following more general situation which includes (4) but allows for shrinking weights:

Hypothesis 1. There exist A C0 [0, 1], Rd d and β [0, 1] such that

s [0, 1], As = lim L Lβ A(L) Ls . (5)

Properly renormalized weights may indeed converge to a continuous function of the layer in some cases, as shown in Figure 1 (left) which displays an example of layer dependence of trained weights for a Res Net (1) with fully connected layers and tanh activation function, without explicit regularization (see Section 3.1).

However it is not always the case that network weights converge to a smooth function of the layer, even after rescaling. For example, network weights A(L) k are usually initialized to random, independent and identically distributed (i.i.d.) values, whose scaling limit would then correspond to a white noise, which cannot be represented as a function of the layer. Such scaling behaviour also occurs for trained

Scaling properties of deep residual networks

weights, as shown in Figure 1 (center). In this case, the cumulative sum Pk 1 j=0 A(L) j of the weights behaves like a random walk, which does have a well-deﬁned scaling limit W C0 [0, 1] , Rd d . Figure 1 (right) shows that, for a Re LU Res Net with fully-connected layers, this cumulative sum of trained weights converges to an irregular, that is, non-smooth function of the layer.

This observation motivates the consideration of an alternative hypothesis where the weights A(L) k are represented as the increments of a continuous function W A.

Combining such terms with the ones considered in Hypothesis 1, we consider the following, more general, setting:

Hypothesis 2. There exist β [0, 1), A C0 [0, 1], Rd d , and W A C0([0, 1], Rd d) non-zero such that W A 0 = 0 and

A(L) k = L βAk/L + W A (k+1)/L W A k/L. (6)

The above decomposition is unique. Indeed, for s [0, 1],

Lβ 1 Ls 1 X

k=0 A(L) k = L 1 Ls 1 X

Ak/L + Lβ 1W A Ls /L

Ardr, as L . (7)

The integral of A is thus uniquely determined by the weights A(L) k when L is large, so A can be obtained by discretization and W A by ﬁtting the residual error in (7). In addition, Hypotheses 1 and 2 are mutually exclusive since Hypothesis 2 requires W A to be non-zero.

Remark 2.1 (IID initialization of weights). In the special case of independent Gaussian weights

A(L) k,mn i.i.d N 0, L 1d 2 and b(L) k,n i.i.d N 0, L 1d 1

where A(L) k,mn is the (m, n)-th entry of A(L) k Rd d and

b(L) k,n is the n-th entry of b(L) k Rd, we can represent the weights {A(L), b(L)} as the increments of a matrix Brownian motion

A(L) k = d 1 W A (k+1)/L W A k/L ,

which is a special case of Hypothesis 2.

2.2. Smoothness of weights with respect to the layer

A question related to the existence of a scaling limit is the degree of smoothness of the limits A or W A, if they exist. To quantify the smoothness of the function mapping the layer number to the corresponding network weight, we deﬁne in Table 1 several quantities which may be viewed as

discrete versions of various (semi-)norms used to measure the smoothness of functions.

Table 1: Quantities associated to a tensor A(L) RL d d.

Quantity Deﬁnition

Maximum norm maxk A(L) k F

β-scaled norm of increments

Lβ maxk A(L) k+1 A(L) k F

Cumulative sum norm

Root sum of squares

The ﬁrst two norms relate to Hypothesis 1: if A(L) satisfy (5), then the maximum norm scales like L β and the β-scaled norm of increments scales like L ν if the limit function A is ν-H older-continuous. The last two norms relate to Hypothesis 2: if A(L) satisfy (6), then the cumulative sum norm scales like L β. Furthermore, the root sum of squares gives us the regularity of W A. Indeed, deﬁne the quadratic variation tensor of W A by

Then, using (6) and Cauchy-Schwarz, we estimate

F + L1 2β A 2 L2

(8) where ||| ||| is the Hilbert-Schmidt norm. As A is continuous on a compact domain, its L2 norm is ﬁnite. Hence, if β 1/2, the fact that the root sum of squares of A(L) is upper bounded as L implies that the quadratic variation of W A is ﬁnite.

2.3. Procedure for numerical experiments

Note that Hypotheses 1 and 2 are mutually exclusive since Hypothesis 2 requires W A to be non-zero. In order to examine whether one of these hypotheses, or neither, holds for the trained weights A(L) and b(L), we proceed as follows.

Step 1: We perform a logarithmic regression of the maximum norm of δ(L) with respect to L to deduce the scaling δ(L) L α.

Step 2: To obtain the exponent β [0, 1), we perform a logarithmic regression of the cumulative sum norm of A(L)

with respect to L. Indeed, (7) for s = 1 indicates that the cumulative sum norm explodes with a slope of 1 β.

Scaling properties of deep residual networks

Step 3: After identifying the correct exponent β for the weights, we compute the β-scaled norm of increments of A(L) to check Hypothesis 1 and measure the smoothness of the trained weights. On one hand, if the β-scaled norm of increments of A(L) does not vanish as L , it means that the rescaled weights cannot be represented as a continuous function of the layer, as in Hypothesis 1. On the other hand, if the β-scaled norm of increments of A(L) vanishes (say, as L ν) when L increases, it supports Hypothesis 1 with a H older-continuous limit function A Cν([0, 1], Rd d).

Step 4: To discriminate between Hypothesis 1 and Hypoth-

esis 2, we decompose the cumulative sum Pk 1 j=0 A(L) j of the trained weights into a trend component A and a noise component W A, as shown in (7). The presence of nonnegligible noise term W A favors Hypothesis 2.

Step 5: Finally, we estimate the regularity of the term W A

under Hypothesis 2. If β 1/2 and the root sum of squares of A(L) is ﬁnite, we deduce by (8) that W A is of ﬁnite quadratic variation. It happens for example when W A has a diffusive behavior, as in the example of i.i.d. random weights.

The same procedure is used for b(L). Note that the scaling exponent β may be different for A(L) and b(L).

Remark 2.2. Note that σ = Re LU is homogeneous of degree 1, so we can write

δ σd (Ah + b) = sign(δ) σd (|δ| Ah + |δ| b) .

Hence, when analyzing the scaling of trained weights in the case of a Re LU activation with fully-connected layers, we look at the quantities δ(L) A(L) and δ(L) b(L), as they represent the total scaling of the residual connection.

3. Numerical Experiments

We investigate the scaling properties and asymptotic behavior of trained weights for residual networks as the number of layers increases. We focus on two types of architectures: fully-connected and convolutional networks.

3.1. Fully-connected layers

Architecture. We consider a regression problem where the network layers are fully-connected. We consider the network architecture (1) for two different setups:

(i) σ = tanh, δ(L) k = δ(L) R+ trainable,

(ii) σ = Re LU, δ(L) k R trainable.

We choose to present these two cases for the following reasons. First, both tanh and Re LU are widely used in practice. Further, having δ(L) scalar makes the derivation

of the limiting behavior simpler. Also, since tanh is an odd function, the sign of δ(L) can be absorbed into the activation. Therefore, we can assume that δ(L) is non-negative for tanh. Regarding Re LU, having a shared δ(L) would hinder the expressiveness of the network. Indeed, if for instance δ(L) > 0, we would get h(L) k+1 h(L) k element-wise since

Re LU is non-negative. This would imply that h(L) L x, which is not desirable. The same argument applies to the case δ(L) < 0. Thus, we let δ(L) k R depend on the layer number for Re LU networks.

Datasets and training. We consider two datasets. The ﬁrst one is synthetic: ﬁx d = 10 and generate N i.i.d samples xi coming from the d dimensional uniform distribution in [ 1, 1]d. Let K = 100 and simulate the following dynamical system:

( zxi 0 = xi zxi k = zxi k 1 + K 1/2 tanhd gd zxi k 1, k, K ,

where gd(z, k, K) := sin(5kπ/K)z+cos(5kπ/K)1d. The targets yi are deﬁned as yi = zxi K / zxi K . The motivation behind this low-dimensional dataset is to be able to train very deep residual networks on a problem where the optimal input-output map lies inside the class of functions represented by (1).

The second dataset is a low-dimensional embedding of the MNIST handwritten digits dataset (Le Cun et al., 1998). Let (ex, c) R28 28 {0, . . . , 9} be an input image and its corresponding class. We transform ex into a lower dimensional embedding x Rd using an untrained convolutional projection, where d = 25. More precisely, we stack two convolutional layers initialized randomly, we apply them to the input and we ﬂatten the downsized image into a d dimensional vector. Doing so reduces the dimensionality of the problem while allowing very deep networks to reach at least 99% training accuracy. The target y Rd is the one-hot encoding of the corresponding class.

The weights are updated by stochastic gradient descent (SGD) on the unregularized mean-squared loss using batches of size B and a constant learning rate η. We perform SGD updates until the loss falls below ϵ, or when the maximum number of updates Tmax is reached. We repeat the experiments for several depths L varying from Lmin to Lmax. All the hyperparameters are given in Appendix A.

Results. For the case of a tanh activation (i), we observe in Figure 2 that for both datasets, δ(L) L 0.7 clearly decreases as L increases, and the cumulative sum norm of A(L) slightly increases when L increases. We deduce that β = 0.2 for the MNIST dataset and β = 0.3 for the synthetic dataset.

Scaling properties of deep residual networks

Figure 2: Scaling for tanh activation and δ(L) R. Left: Maximum norm of δ(L) with respect to L. Right: Cumulative sum norm of A(L) with respect to L. The dashed lines are for the synthetic data and the solid lines are for MNIST. The plots are in log-log scale.

Next, we verify which of Hypothesis 1 or Hypothesis 2 holds for A(L). We observe in Figure 3 (left) that the β-scaled norm of increments of A(L) decreases like L 1/2, suggesting that Hypothesis 1 holds, with A being 1/2 H older continuous. This is conﬁrmed in Figure 3 (right), as the trend part A is visibly continuous and even of class C1. The noise part W A is negligible. This observation is even more striking given that the weights are trained without explicit regularization.

Figure 3: Hypothesis veriﬁcation for tanh activation and δ(L) R. Left: in pink we plot in log-log scale the root sum of squares of A(L), and in orange the β-scaled norm of increments of A(L). The dashed lines are for the synthetic data and the solid lines are for MNIST. Right: Decomposition of the trained weights A(L) k,(9,7) with the trend part A and the noise part W A for L = 10321, as deﬁned in (6), for the synthetic dataset.

Regarding the case of a Re LU activation function (ii), we observe in Figure 4 (left) that the cumulative sum norm of the residual connection δ(L) A(L) scales like L0.2 for the synthetic dataset and like L0.1 for the MNIST dataset, so β = 0.8, resp. 0.9 in this case. We see in Figure 4 (right) that keeping the sign of δ(L) k is important, as the sign oscillates considerably throughout the network depth k = 0, . . . , L 1.

We verify now which of Hypothesis 1 or Hypothesis 2 holds for δ(L) A(L). Figure 5 (left) shows that the βscaled norm of increments scales like L0.2 and L0.4 as the depth increases. This suggests that there exists a noise part W A. Following (8), the fact that the root sum of squares of δ(L) A(L) is upper bounded as L implies that W A has ﬁnite quadratic variation. These claims are also

Figure 4: Scaling for Re LU activation and δ(L) k R. Left: Cumulative sum norm of |δ(L)|A(L) with respect to L, in log-log scale. Right: trained values of δ(L) k as a function of k, for L = 9100 and for the synthetic dataset.

supported by Figure 5 (right): there is a non-zero trend part A, and a non-negligible noise part W A.

Figure 5: Hypothesis veriﬁcation for Re LU activation and δ(L) k R. Left: in pink we plot in log-log scale the root sum of squares of |δ(L)|A(L), and in orange the β-scaled norm of increments of |δ(L)|A(L). The dashed lines are for the synthetic data and the solid lines for MNIST. Right: Decomposition of the trained weights |δ(L)| A(L) k,(7,7) with the trend part A and the noise part W A for L = 10321, as deﬁned in (6), for the synthetic dataset.

Given the scaling behavior of the trained weights, we conclude that Hypothesis 1 seems to be a plausible description for the tanh case (i), but Hypothesis 2 provides a better description for the Re LU case (ii).

The same conclusions hold for b(L) as well, see Appendix B.

Role of the noise term W A. A legitimate question to ask at this point is whether the noise part W A plays a signiﬁcant role in the accuracy of the network. To test this, we create a residual network with denoised weights e A(L) k := L βAk/L, compute its training error and we compare it to the original training error. We observe in Figure 6 (left) that for tanh, the noise part W A is negligible and does not inﬂuence the loss. However, for Re LU, the loss with denoised weights is one order of magnitude above the original training loss, meaning that the noise part W A plays a signiﬁcant role in the accuracy of the trained network.

3.2. Convolutional layers

We now consider the original Res Net with convolutional layers introduced in (He et al., 2016). This architecture is close to the state-of-the-art methods used for image recog-

Scaling properties of deep residual networks

Figure 6: Loss value, as a function of L, in black for the trained weights A(L) k and in green for the denoised weights e A(L) k = L βAk/L. Left: tanh activation and δ(L) R. Right: Re LU activation and δ(L) k R. Note that these curves are for the synthetic dataset and that we plot them in log-log scale. Also, we show in off-white the loss value range in which we consider our networks to have converged.

nition. In particular, we do not include batch normalization (Ioffe & Szegedy, 2015) since it only slightly improves the performance of the network while making the analysis signiﬁcantly more complicated.

Setup. The precise architecture is detailed in Appendix D. Most importantly, our network still possesses the key skip connections from (1). Simply, the update rule for the hidden state reads

hk+1 = σ (hk + k σ (Ak hk) + Fk hk) (9)

for k = 0, . . . , L 1, where σ = Re LU. Here, k, Ak, and Fk are kernels and denotes convolution. Note that k plays the same role as δ(L) k from (1). To lighten the notation, we omit the superscripts x (the input) and L (the number of layers).

We train our residual networks at depths ranging from Lmin = 8 to Lmax = 121 on the CIFAR-10 (Krizhevsky et al.) dataset with the unregularized cross-entropy loss. Here, the depth is the number of residual connections. We underline that a network with Lmax = 121 is already very deep. As a comparison, a standard Res Net-152 (He et al., 2016) has depth L = 50 in our framework.

Figure 7: Scaling of (L) (left) and A(L) (right) against the network depth L for convolutional architectures on CIFAR-10. In blue, we plot the spectral norm of the kernels (L) k , resp. A(L) k , for k = 0, . . . , L 1. The red line is the maximum of these values over k, namely the maximum norm, deﬁned in Table 1. The plots are in log-log scale.

Results. As in Section 3.1, we investigate how the weights scale with network depth and whether Hypothesis 1 or Hypothesis 2 holds for a convolutional networks. To that end, we compute the spectral norms, of the linear operators deﬁned by the convolutional kernels (L) k and A(L) k using the method described in (Sedghi et al., 2019). Figure 7 shows the maximum norm, and hence the scaling of (L) and A(L)

against the network depth L. We observe that (L) L α

and A(L) L β with α = 0.1 and β = 0.

Figure 8: Testing of Hypotheses 1 and 2 for (L) (left) and A(L)

(right). We plot in pink the root sum of squares and in orange the α-scaled norm of increments of (L) (left) and the β-scaled norm of increments of A(L) (right). Plots are in log-log scale. The root sum of squares and the scaled norm of increments are deﬁned in Table 1. We obtain α and β from Figure 7.

We then use the values obtained for α and β to verify our hypotheses. Figure 8 shows that both the α-scaled norm of increments of (L) and the β-scaled norm of increments of A(L) seem to have lower bounds as the depth grows. This suggests that Hypothesis 1 does not hold for convolutional layers.

We also observe that the root sum of squares stays in the same order as the depth increases. Coupled with the fact that the maximum norms of (L) and A(L) are close to constant order as the depth increases, this suggests that the scaling limit is sparse with a ﬁnite number of weights being of constant order in L.

3.3. Summary: three scaling regimes

Our experiments show different scaling behaviors depending on the network architecture, especially the smoothness of the activation function. In Section 3.1, for fully-connected layers with tanh activation and a common pre-factor δ(L)

across layers, we observe a behavior consistent with Hypothesis 1 for both the synthetic dataset and MNIST. In contrast, a Res Net with fully-connected layers with Re LU activation and δ(L) k R shows behavior compatible with Hypothesis 2 both for the synthetic dataset and MNIST.

In the case of convolutional architectures trained on CIFAR10 (Section 3.2) we observe that the maximum norm of the trained weights does not decrease with the network depth and the trained weights display a sparse structure, indicating a third scaling regime corresponding to a sparse structure for both (L) and A(L). These results are consistent with previous evidence on the existence of sparse CNN represen-

Scaling properties of deep residual networks

tations for image recognition (Mallat, 2016). We stress that the setup for our CIFAR-10 experiments has been chosen to approach state-of-the-art test performance with our generic architecture, as shown in Appendix C.

Note that the reason we consider networks with many layers (up to L = 10321) is to investigate the behavior of coefﬁcients as depth varies, not because of any claim that very deep networks are more robust or generalize better than shallower ones. In fact, most of the networks exhibit good accuracy for depths L 15.

4. Deep Network Limit

In this section, we analyze the scaling limit of the residual network (1) under Hypotheses 1 and 2.

4.1. Setup and assumptions

We consider δ(L) = L α for some α 0 and

h(L) 0 = x,

h(L) k+1 = h(L) k + L α σd A(L) k h(L) k + b(L) k , (10)

A(L) k = L βAk/L + W A (k+1)/L W A k/L,

b(L) k = L βbk/L + W b (k+1)/L W b k/L,

where (W A t )t [0,1] and (W b t )t [0,1] are Itˆo processes (Revuz & Yor, 2013) with regularity conditions speciﬁed in Appendix E.1.

Remark 4.1. Hypothesis 1 corresponds to the case W A 0 and W b 0. Hypothesis 2 corresponds to the case where W A and W b are non-zero.

We use the following notation for the quadratic variation of W A and W b:

0 ΣA u du, W b

0 Σb udu, (11)

where ΣA and Σb are bounded processes with values respectively in Rd, 4 and Rd d. Let Qi : [0, 1] Rd R be the (random) quadratic form deﬁned by

Qi(t, x) :=

j,k=1 xjxk ΣA t

ijik + Σb t,ii. (12)

and Q(t, x) = (Q1(t, x), . . . , Qd(t, x)). Our analysis focuses on smooth activation functions.

Assumption 4.2 (Activation function). The activation function σ is in C3(R, R) and satisﬁes σ(0) = 0, σ (0) = 1. Moreover, σ has a bounded third derivative σ .

Most smooth activation functions, including tanh, satisfy this condition. Also, the boundedness of the third derivative σ could be further relaxed to some exponential growth condition, see (Peluchetti & Favaro, 2020).

Finally, we assume that the hidden state dynamics (h(L) k , k = 1, . . . , L) given by (1) is uniformly integrable. (For a precise statement see Assumption E.3 in Appendix.) This is a reasonable assumption since both the inputs and the outputs of the network are uniformly bounded.

4.2. Informal derivation of the deep network limit

We ﬁrst provide an informal analysis on the derivation of the deep network limit. Denote tk = k/L and deﬁne for s [tk, tk+1]:

M (L),k s = W A s W A tk h(L) k + W b s W b tk + L1 βAtkh(L) k (s tk) + L1 βbtk(s tk).

When β = 0, we need α = 1 to obtain a non-trivial limit. In this case, the noise terms in M (L),k s are vanishing as L increases, and the limit is the Neural ODE. See Lemma 4.6 in (Thorpe & van Gennip, 2018).

When β > 0 we can apply the Itˆo formula (Itˆo, 1944) to σ M (L),k s for s [tk, tk+1) to obtain the approximation

h(L) k+1 h(L) k = δ(L)σ M (L),k tk+1 = D1+D2+D3+o (1/L) (13) where

D1 = L α W A tk+1 W A tk h(L) k + W b tk+1 W b tk

2L ασ (0) Q tk, h(L) k (tk+1 tk)

D3 = L1 β α

Atkh(L) k (tk+1 tk) + btk(tk+1 tk) .

When α = 0, we see from D1 that (13) admits a diffusive limit, that is with non-vanishing noise terms W A and W b. In this case, D2 stays bounded when L increases, and D3 does not explode only when β 1. The case α = 0, β 1 corresponds to a stochastic differential equation (SDE) limit.

When α > 0, D1 and D2 vanish in the limit L , and we need β = 1 α to obtain a non-trivial ODE limit.

We provide precise mathematical statements of these results in the next section.

4.3. Statement of the results

The following results describe the different scaling limits of the hidden state dynamics (h(L) k , k = 1, . . . , L) for various values of scaling exponents α and β.

First, we show that Hypothesis 1 with a smooth activation function leads to convergence in sup norm to an ODE limit

Scaling properties of deep residual networks

that is different from the neural ODE behavior described in (Chen et al., 2018; Thorpe & van Gennip, 2018; Haber & Ruthotto, 2018).

Theorem 4.3 (Asymptotic behavior under Hypothesis 1). If the activation function satisﬁes Assumption 4.2, 0 < α < 1 and α + β = 1, then the hidden state converges to the solution to the ODE

dt = At Ht + bt, H0 = x, (14)

in the sense that

lim L sup 0 t 1

Ht h(L) t L = 0.

In particular, this implies the convergence of the hidden state process for any typical initialization, i.e almost-surely with respect to the initialization.

Note that in Theorem 4.3, the limit (14) deﬁnes a linear input-output map behaving like a linear network (Arora et al., 2019). This is different from the neural ODE (2), where the activation function σ appears in the limit. Interestingly, the limit is a controlled ODE where the control parameters are linear in the derivative of the state. Their expressivity is discussed in (Cuchiero et al., 2020).

We show that under Hypothesis 2, we may obtain either an SDE or an ODE limit. In the latter case, the limiting ODE is found to be different from the neural ODE (2).

Theorem 4.4 (Asymptotic behavior under Hypothesis 2). Let σ be an activation function satisfying Assumption 4.2. α = 0 and β 1: diffusive limit. Let H be the solution to the SDE

d Ht = Ht d W A t + d W b t + 1

2σ (0)Q(t, Ht) dt

+ 1β=1(At Ht + bt) dt, (15)

with initial condition H0 = x. If there exists p2 > 2 such that E sup0 t 1 Ht p2 < , then the hidden state converges uniformly in L2 to the solution H of (15):

lim L E sup 0 t 1

h(L) t L Ht 2 = 0.

0 < α < 1, α + β = 1: ODE limit. The hidden state converges uniformly in L2 to the solution of the ODE

dt = At Ht + bt, (16)

with initial condition H0 = x:

lim L E sup 0 t 1

h(L) t L Ht 2 = 0.

Note that we prove uniform convergence in L2, also known as strong convergence.

The detailed assumptions and a sketch of the proof for Theorems 4.3 and 4.4 are given in Appendix E. Further details and some extensions may be found in the companion paper (Cohen et al., 2021).

The idea of the proof is to view the Res Net as a nonlinear Euler discretization of the limit equation, and then bound the difference between the hidden state and a classical Euler discretization. Then, using an extension of the techniques in (Higham et al., 2002) to the case of equations driven by Itˆo processes, we show strong convergence in the following way. We ﬁrst show that the drift term of (15) is locally Lipschitz (Appendix E.3). We then prove the strong convergence of the hidden state dynamics (10) by bounding the difference between the hidden state and an Euler scheme for the limiting equation. It is worth mentioning that the convergence results in (Higham et al., 2002) hold for a class of time-homogeneous diffusion processes whereas our result holds for general Itˆo processes. This distinction is important for training neural networks since the diffusion assumption involves the Markov property which cannot be expected to hold after training with backpropagation.

In addition, we also relax a technical condition from (Higham et al., 2002), which is difﬁcult to verify in practice. See Remark E.5.

4.4. Remarks on the results

Interestingly, when the activation function σ is smooth, all limits in both Theorems 4.3 and 4.4 depend on the activation only through σ (0) (assumed to be 1 for simplicity) and σ (0). Hypotheses 1 and 2 lead to the same ODE limit when 0 < α < 1 and α + β = 1. In contrast to the neural ODE (2), the characteristics of σ away from 0 are not relevant to this limit. In addition, our proof relies on the smoothness of σ at 0. If the activation function is not differentiable at 0, then a different limit should be expected.

The case A 0, b 0, α = 0, and β = 1 in Theorem 4.4 is considered in (Peluchetti & Favaro, 2020), who prove weak convergence under the additional assumption that W A and W b are Brownian motions with constant drift. In our setup, W A and W b are allowed to be arbitrary Itˆo processes, whose increments, i.e. the network weights, are not necessarily independent nor identically distributed. This allows for a general distribution and dependence structure of weights across layers.

The concept of weak convergence used in (Peluchetti & Favaro, 2020) corresponds to convergence of quantities averaged across many random IID weight initializations. In practice, as the training is done only once, the strong convergence, shown in Theorems 4.3 and 4.4 is a more relevant

Scaling properties of deep residual networks

notion for studying the asymptotic behavior of deep neural networks.

4.5. Link with numerical experiments

Let us now discuss how the analysis above sheds light on the numerical results in Section 3.1 and Section 3.2.

Figure 2 shows that β 0.3 and α 0.7 for the synthetic dataset with fully-connected layers and tanh activation function, and Figure 3 suggests that Hypothesis 1 is more likely to hold. This corresponds to the assumptions of Theorem 4.3 with the ODE limit (14). This is also consistent with the estimated decomposition in Figure 3 (right) where the noise part tends to be negligible.

In the case of Re LU activation with fully-connected layers, we observe that β + α 1 from Figure 4 (left). Since Re LU is homogeneous of degree 1 (see Remark 2.2), |δ(L)| can be moved inside the activation function, so without loss of generality we can assume α = 0 and β 1. If we replace the Re LU function by a smooth version σϵ, then the limit is described by the stochastic differential equation (15). The Re LU case would then correspond to a limit of this equation as ϵ 0. The existence of such a limit is, however, nontrivial.

From the experiments with convolutional architectures, we observe that the maximum norm (Figure 7), the scaled norm of the increments, and the root sum of squares (Figure 8) are upper bounded as the number of layers L increases. This indicates that the weights become sparse when L is large. In this case, there is no continuous ODE or SDE limit and Hypotheses 1 and 2 both fail. This emergence of sparse representations in convolutional networks is consistent with previous results on such networks (Mallat, 2016).

5. Conclusion

We study the scaling behavior of trained weights in deep residual networks. We provide evidence for the existence of at least three different scaling regimes that encompass differential equations and sparse scaling limits. We also theoretically characterize the ODE and SDE limits for the hidden state dynamics in deep fully-connected residual networks.

Our work contributes to a better understanding of the behavior of residual networks and the role of network depth. Our ﬁndings point to interesting questions regarding the asymptotic behavior of such networks in the case of non-smooth activation functions and more complex architectures.

Arora, S., Cohen, N., Golowich, N., and Hu, W. A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks. In 7th International Conference on Learning Representations (ICLR), 2019.

Bachlechner, T., Majumder, B. P., Mao, H. H., Cottrell, G. W., and Mc Auley, J. Re Zero is All You Need: Fast Convergence at Large Depth, 2020.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difﬁcult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994. doi: 10.1109/72.279181.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural Ordinary Differential Equations. In Advances in Neural Information Processing Systems (Neur IPS) 31, pp. 6571 6583, 2018.

Cohen, A.-S., Cont, R., Rossier, A., and Xu, R. Asymptotic analysis of deep residual networks. Working paper, 2021.

Cuchiero, C., Larsson, M., and Teichmann, J. Deep neural networks, generic universal interpolation and controlled ODEs. SIAM Journal on Mathematics of Data Science, 2 (3):901 919, 2020.

Dupont, E., Doucet, A., and Teh, Y. W. Augmented Neural ODEs. In Advances in Neural Information Processing Systems (Neur IPS) 32, 2019.

E, W., Han, J., and Li, Q. A mean-ﬁeld optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1):1 41, 2019a.

E, W., Ma, C., Wang, Q., and Wu, L. Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections. 2019b.

Frei, S., Cao, Y., and Gu, Q. Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks. In Advances in Neural Information Processing Systems (Neur IPS) 32, 2019.

Haber, E. and Ruthotto, L. Stable architectures for deep neural networks. Inverse Problems, 34(1), 2018.

He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770 778. IEEE Computer Society, 2016.

Heymann, J., Drude, L., and Haeb-Umbach, R. Wide Residual BLSTM Network with Discriminative Speaker Adaptation for Robust Speech Recognition. In Computer Speech and Language, 2016.

Scaling properties of deep residual networks

Higham, D. J., Mao, X., and Stuart, A. M. Strong convergence of Euler-type methods for nonlinear stochastic differential equations. SIAM Journal on Numerical Analysis, 40(3):1041 1063, 2002.

Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Deep Networks with Stochastic Depth. In European Conference on Computer Vision, 2016.

Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, pp. 448 456, 2015.

Itˆo, K. Stochastic integral. Proc. Imp. Acad., 20(8):519 524, 1944. doi: 10.3792/pia/1195572786.

Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization, 2014.

Krizhevsky, A., Nair, V., and Hinton, G. CIFAR-10.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, volume 86, pp. 2278 2324, 1998.

Loshchilov, I. and Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In 5th International Conference on Learning Representations (ICLR), 2017.

Lu, Y., Ma, C., Lu, Y., Lu, J., and Ying, L. A Mean Field Analysis Of Deep Res Net And Beyond: Towards Provably Optimization Via Overparameterization From Depth. In International Conference on Machine Learning, pp. 6426 6436. PMLR, 2020.

Mallat, S. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065): 20150203, 2016. doi: 10.1098/rsta.2015.0203.

Ott, K., Katiyar, P., Hennig, P., and Tiemann, M. Res Net After All: Neural ODEs and Their Numerical Solution. In Intl Conference on Learning Representations (ICLR), 2021.

Peluchetti, S. and Favaro, S. Inﬁnitely deep neural networks as diffusion processes. In Intl Conference on Artiﬁcial Intelligence and Statistics, pp. 1126 1136. PMLR, 2020.

Platen, E. and Bruti-Liberati, N. Numerical solution of stochastic differential equations with jumps in ﬁnance, volume 64. Springer Science & Business Media, 2010.

Revuz, D. and Yor, M. Continuous martingales and Brownian motion. Springer, 2013.

Sedghi, H., Gupta, V., and Long, P. M. The Singular Values of Convolutional Layers. In Intl Conference on Learning Representations (ICLR), 2019.

Thorpe, M. and van Gennip, Y. Deep Limits of Residual Neural Networks, 2018.

Zagoruyko, S. and Komodakis, N. Wide Residual Networks, 2016.