# scaling_resnets_in_the_largedepth_regime__2ce6f3b0.pdf

Journal of Machine Learning Research 26 (2025) 1-48 Submitted 6/22; Revised 2/25; Published 2/25

Scaling Res Nets in the Large-depth Regime

Pierre Marion pierre.marion@mines.org Sorbonne Universit e, CNRS, Laboratoire de Probabilit es, Statistique et Mod elisation F-75005 Paris, France

Adeline Fermanian adeline.fermanian@califrais.fr MINES Paris Tech, PSL Research University, CBIO, F-75006 Paris, France Institut Curie, PSL Research University, F-75005 Paris, France INSERM, U900, F-75005 Paris, France

G erard Biau gerard.biau@sorbonne-universite.fr Sorbonne Universit e, CNRS, Laboratoire de Probabilit es, Statistique et Mod elisation Institut universitaire de France F-75005 Paris, France

Jean-Philippe Vert jean-philippe.vert@mines.org Google Research, Brain team, Paris, France

Editor: Joan Bruna

Deep Res Nets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth L increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor αL. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics is for αL = 1/

L other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic diﬀerential equation, contrarily to a widespread interpretation that deep Res Nets are discretizations of neural ordinary diﬀerential equations. By contrast, in the latter regime, stability is obtained with speciﬁc correlated initializations and αL = 1/L. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training. Keywords: Res Nets, deep learning theory, neural ODE, neural network initialization, continuous-time models

1. Introduction

We begin by introducing the general context on deep residual networks, before stating our contributions and discussing related work.

. Now at EPFL . Now at Califrais . Now at Owkin

c 2025 Pierre Marion, Adeline Fermanian, G erard Biau, and Jean-Philippe Vert.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/22-0664.html.

Marion, Fermanian, Biau, and Vert

1.1 Deep Residual Neural Networks

Residual neural networks (Res Nets), introduced by He et al. (2016a) in the ﬁeld of computer vision, were the ﬁrst deep neural network models successfully trained with several thousand layers. Since then, extensive empirical evidence has shown that increasing the depth leads to signiﬁcant improvements in performance, while raising new challenges in terms of training (e.g., Wang et al., 2024). From a high-level perspective, the key feature of Res Nets is the presence of skip connections between successive layers. In mathematical terms, this means that the (k + 1)-th hidden state hk+1 Rd follows sequentially from the previous hidden state via the recurrence relation

hk+1 = hk + f(hk, θk+1), 0 k L 1, (1)

where f( , θk+1) : Rd Rd is the layer function parameterized by θk+1 Rp and L is the number of layers. The skip connection corresponds to the addition of hk on the right-hand side of (1), which is absent in classical feedforward networks. This reﬁnement prevents instability issues during training when L is large, provided training is performed in a careful way (He et al., 2015). The idea of adding skip connections has become common practice in the ﬁeld of deep learning, and is today incorporated in many other models such as Transformers in natural language processing (Vaswani et al., 2017). For simplicity, in the rest of the paper, we continue to use the terminology Res Nets to denote any architecture of the form (1), keeping in mind that this framework goes beyond the original model of He et al. (2016a). The most common architectures have 50-150 layers, but Res Nets can be trained with depths up to the order of thousand layers (He et al., 2016b). Yet, the training procedure needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth increases. As pointed out by, e.g., Shao et al. (2020), these instabilities are related to a shift in the magnitude of the variance of a signal as it passes through the network. In the original approach of He et al. (2016a), the issue was mitigated by adding a normalization step, called batch normalization (Ioﬀe and Szegedy, 2015), which rescales the output of each layer via centering and unit variance normalization. However, this normalization stage introduces practical and theoretical diﬃculties, among which computational overhead and strong dependence on the batch size (see Brock et al., 2021, and the references therein). A widespread alternative to stabilize training in deep models, explored for example by Yang and Schoenholz (2017), Arpit et al. (2019), Zhang et al. (2019b), and De and Smith (2020), is to incorporate a scaling factor αL in front of the residual term in (1), yielding the model

hk+1 = hk + αLf(hk, θk+1), 0 k L 1. (2)

There is strong evidence that this scaling factor αL should depend on L, without however any consensus to date on the exact form of this dependence, nor on the mathematical grounding of the approach. Thus, despite progresses on the empirical side, the mathematical forces in action behind the stability of deep Res Nets are still poorly understood, although they are key to unlock training at arbitrary depth. Our goal in the present paper is to take a step forward towards a better theoretical understanding of deep Res Nets by providing a thorough probabilistic analysis of the sequence (hk)0 k L at initialization when L is large, and by leveraging a continuous-time interpretation

Scaling Res Nets in the Large-depth Regime

of model (2) via the so-called neural ordinary diﬀerential equation (neural ODE, Chen et al., 2018) paradigm. In a nutshell, our results highlight the intimate connection that exists at initialization between stability of the learning process, the regularity of the weights, and the scaling factor αL. We oﬀer in particular a proper mathematical grounding on why and how to choose the parameter αL as a function of the depth L and the distribution of the weights.

1.2 Our Contributions

Scaling at initialization. The optimal parameters of Res Nets are learned by minimizing some empirical risk function via a gradient descent algorithm. As highlighted for example by Yang and Schoenholz (2017), Hanin and Rolnick (2018), and Arpit et al. (2019), a good parameter initialization of this learning phase plays a major role in the quality of the learned model, in particular to avoid vanishing gradients and deadlock at initialization, or exploding gradients and quick divergence of the model parameters at the beginning of training. Moreover, a good initialization allows the use of larger learning rates, which have been shown to correlate with better generalization (Jastrzkebski et al., 2017). It is thus of great interest to study and understand the role played by scaling of deep Res Nets at initialization. This is the context in which we place ourselves in the sequel. At initialization stage, the weights (θk)1 k L are usually chosen as (realizations of) independent and identically distributed (i.i.d.) random variables, which typically follow a uniform or Gaussian distribution on Rp. Accordingly, the sequence (hk)0 k L that results from the recursion (2) for a given input to the network takes the form of a sequence of random variables that are not i.i.d. but are actually a martingale. Thus, denoting informally by L the diﬀerentiable loss associated with the learning task (classiﬁcation or regression), the distributions of (hk)0 k L and ( L

hk )0 k L as L becomes large carry useful information on the stability of training. For instance, exploding gradients in the backpropagation phase of learning correspond to the fact that, with high probability, L

h L , where denotes the Euclidean norm. Our ﬁrst contribution, in Section 2, is to provide thorough mathematical statements on the behavior of these distributions (both for ﬁnite and inﬁnite L), depending on the value of αL. Among other results, we show that only the choice αL 1/

L yields a non-trivial behavior at initialization, thereby conﬁrming empirical ﬁndings in the literature (Arpit et al., 2019; De and Smith, 2020). For αL 1/

L, the norms explode exponentially fast with L, which is inappropriate for training. For αL 1/

L, the network is almost equivalent to identity, that is, h L h0. The analysis of the diﬀerent cases as a function of αL is mathematically involved and makes extensive use of concentration tools from random matrix theory. The continuous approach. As noticed by several authors (Chen et al., 2018; E et al., 2019; Thorpe and van Gennip, 2023), model (2) with a scaling factor αL = 1/L (and not 1/

L) is formally similar to the discretization of a diﬀerential equation. Thus, when L tends to inﬁnity, the weights and hidden states change continuously with the layer according to the equation d Ht

dt = f(Ht, Θt), t [0, 1]. (3)

Here, time t is the continuous analogue of the layer index k, H : [0, 1] Rd is a continuoustime hidden state, and Θ : [0, 1] Rp a continuous-time parameter. This important

Marion, Fermanian, Biau, and Vert

connection between Res Nets and diﬀerential equations has been identiﬁed in the past years under the umbrella name of neural ODE. Since the original article of Chen et al. (2018), this point of view has led to the development of a variety of new continuous-time models, together with innovative architectures and eﬃcient training algorithms (Chang et al., 2019; Grathwohl et al., 2019; Kidger et al., 2021). The neural ODE paradigm also enabled to leverage the rich theory of diﬀerential equations to better understand the mechanisms at work behind deep Res Nets (E et al., 2019; Fermanian et al., 2021). However, there is a debated question in the neural ODE community about the choice αL = 1/L, which guarantees convergence of the discrete model (2) to its continuous-time counterpart (3). As a matter of fact, it seems that this choice is guided by more mathematical than practical considerations, and several authors have suggested that it is inconsistent with what is done in practice (Cohen et al., 2021; Bayer et al., 2023). Moreover, letting αL = 1/L is somewhat contradictory with the results discussed above, which highlighted that the only non-trivial limit at initialization is αL = 1/

L. Thus, as a second contribution, we clarify the problem in Section 3 by leveraging our previous results on stability. We show that the value αL = 1/

L corresponds in the continuous world to a neural stochastic diﬀerential equation (SDE) of the form (3), where now Θ : [0, 1] Rp takes the form of a continuous-time stochastic process, typically a Brownian motion. By contrast, we also prove that the neural ODE regime with αL = 1/L corresponds to the limit of a Res Net, not with i.i.d. weights as considered before, but with more complex and correlated weight distributions. For these weight distributions, the scaling αL = 1/L is also a critical value between explosion and identity.

Going further, our third contribution is to exhibit in Section 4 a continuous range of regimes that are controlled by the choice of αL (beyond the cases 1/

L and 1/L) and the distribution of (θk)1 k L at initialization, derived from a continuous-time process Θ with a regularity diﬀerent from a Brownian motion. More precisely, we show experimentally that there is a strong interplay (with the same three cases explosion, identity mapping, non-trivial behavior) between the choice of αL and the regularity of (θk)1 k L as a function of the layer index k. In addition, empirical evidence suggests that this interplay impacts both the behavior and performance of the networks during training, beyond initialization.

1.3 Related Work

The choice of scaling for Res Nets has been discussed in many papers, without however reaching a clear consensus on the form this scaling factor should take. For instance, Hanin and Rolnick (2018) state that stability requires αL 1/L, while Zhang et al. (2019b) show that αL 1/

L is enough to ensure stability. On the other hand, Cohen et al. (2021) claim that the scaling factor observed in practice in trained Res Nets is of the form 1/Lβ with β 0.7. Other authors have proposed more complex choices for αL (e.g., Zhang et al., 2019a; Shao et al., 2020). Taking another point of view, De and Smith (2020) observe that batch normalization is empirically equivalent to taking a 1/

L normalization factor. Bachlechner et al. (2021) suggest learning a scaling parameter αk that is allowed to vary from one layer to another, whereas, in (4), αL is kept constant across layers. These authors observe a great acceleration for training compared to traditional Res Nets with no scaling. They also suggest a similar architecture for Transformers and then notice that αk 1/L at the end of training.

Scaling Res Nets in the Large-depth Regime

Closest to our analysis at initialization are the papers of Arpit et al. (2019) and Zhang et al. (2019b). Arpit et al. (2019) develop a theoretical analysis based on mean ﬁeld approximation that suggests that a scaling factor αL = 1/

L prevents vanishing/exploding gradients at initialization, and provide experimental evidence that this approach is competitive with batch normalization. However, the authors do not provide rigorous mathematical statements for the three diﬀerent cases αL 1/

L, and αL 1/

L, nor do they highlight the connection with the continuous-time interpretation. Interestingly, the idea of exploiting the martingale structure to analyze the magnitude of the hidden states is present in Zhang et al. (2019b), who study the convergence of gradient descent for over-parameterized Res Nets with diﬀerent values of αL. Nevertheless, they consider a speciﬁc model with Gaussian weights, and only provide asymptotic results when both width and depth tend to inﬁnity. The asymptotic limit in this particular regime has been studied by, e.g., Allen-Zhu et al. (2019) and Hayou et al. (2021). We depart from this point of view by considering a ﬁnite-width setting. The connection between the choice of scaling and the continuous-time point of view has previously been noticed by Zhang et al. (2019c), then studied in detail by Cohen et al. (2021) and Cont et al. (2022). These authors show that, under assumptions on the form of the weights, it is possible to derive limiting (stochastic or ordinary) diﬀerential equations for the hidden states. However, they do not discuss the transition between these two regimes, nor do they link diﬀerential equations regimes with the stability of the network.

2. Scaling at Initialization

Our goal in this section is to study the eﬀect of the scaling factor αL on the stability of Res Nets at initialization, assuming that the weights are i.i.d. random variables. We start by making more precise the model and the learning problem introduced in (1).

2.1 Model and Assumptions

Model. The data is a sample of n pairs (xi, yi)1 i n, where xi is the input vector in Rnin and yi Rnout is the output vector to be predicted. This setting includes regression and classiﬁcation (after one-hot encoding of the labels). Specifying the informal recurrence (1), for any input x Rnin, we consider the output Fπ(x) Rnout of the Res Net model deﬁned by

hk+1 = hk + αLVk+1g(hk, wk+1), 0 k L 1,

Fπ(x) = Bh L,

where αL > 0 is the scaling factor of the Res Net and π = (A, B, (wk)1 k L, (Vk)1 k L) are its parameters, with A Rd nin, B Rnout d, wk Rp and Vk Rd d for k = 1, . . . , L. The almost-everywhere diﬀerentiable function g : Rd Rp Rd encodes the choice of architecture. We note that the model includes initial and ﬁnal linear layers in order to map the input space Rnin into the space of hidden states Rd, and symmetrically to map the last hidden state h L into the output space Rnout. These two transformations are of little interest to us, since we mostly focus on the behavior of the sequence of hidden states (hk)0 k L. Let

Marion, Fermanian, Biau, and Vert

us ﬁnally notice that the results of this section can be adapted to hidden layers that do not have the same width, at the cost of increased technicality. An important feature of model (4) is that the layer function takes the form of a matrixvector multiplication, which will prove crucial to make use of concentration results on random matrices. We stress that this setting is standard in practice and that it encompasses many types of Res Nets. It includes for example simple Res Nets where g(h, w) = σ(h) with σ the activation function, and the original Res Nets from He et al. (2016a), which have

g(h, w) = Re LU(Wh + b),

where the parameter is a pair w = (W, b) with W Rd d a weight matrix and b Rd a bias, and Re LU: x 7 max(x, 0) is applied element-wise. This setting also includes attention layers, where g corresponds to the scaled dot-product between keys and queries, as well as convolutional layers. The assumptions made below do not cover these latter cases, making it an interesting avenue for future research to adapt our approach to these cases. Throughout the article, we let ℓ: Rnout Rnout R+ be a loss function, diﬀerentiable w.r.t. its ﬁrst parameter, for example the squared loss or the cross-entropy loss. The objective of learning is to ﬁnd the optimal parameter π that minimizes the empirical risk L (π) = Pn i=1 ℓ(Fπ(xi), yi).

Probabilistic setting at initialization. The minimization of the empirical risk is usually performed by stochastic gradient descent or one of its variants (Goodfellow et al., 2016, Chapter 8). The gradient descent is initialized by choosing the weights as (realizations of) i.i.d. random variables. The parameters w1, V1, . . . , w L, VL in model (4) are therefore assumed to be an i.i.d. collection of random variables, where we recall that wk Rp and Vk Rd d parameterize the k-th layer of the network. In this stochastic context, the successive hidden states h0, . . . , h L given a ﬁxed input x are also random variables, but their distribution is not i.i.d. in fact, under our assumptions, this sequence is a martingale. To avoid unnecessary technicalities, we assume that the sequence (hk)0 k L is non-atomic. This is for example the case if the distribution of the parameters is absolutely continuous w.r.t. the Lebesgue measure. In particular, this ensures that the sequence (hk)0 k L almost surely does not hit the non-diﬀerentiability points of g. It is stressed that the distribution of the parameters are assumed to be independent of the depth, so that all the dependence on L is captured in the scaling factor αL. This model enables us to consider multiple architectures at once, via the function g. By contrast, some authors formulate the problem of scaling as a choice of the variance at initialization (e.g., Yang and Schoenholz, 2017; Wang et al., 2024), which makes the analysis architecturedependent. However, for a given architecture, these two approaches are essentially equivalent since Var(αLVk) = α2 L Var(Vk). The quantity h L h0 / h0 carries key information on the behavior of the network at initialization. On the one hand, if h L h0 h0 , the network is essentially equal to the identity function. On the other hand, if h L h0 h0 , the output of the network explodes. An intermediate situation is when h L h0 h0 . In addition, another source of information is provided by the gradients of the hidden states with respect to the empirical risk L . If L

h L , the gradients do not change as they ﬂow through the network, which means that the exact same information is backpropagated

Scaling Res Nets in the Large-depth Regime

throughout the network. Conversely, if L

h L , the gradients explode during backpropagation. By exploiting the martingale structure of ( hk )0 k L, as well as state-ofthe-art concentration inequalities for random matrices with sub-Gaussian entries, we provide in this section probabilistic bounds on the magnitude of these various quantities.

Assumptions. Some assumptions are needed on the choices of architecture and initialization. Recall that a centered real-valued random variable X is said to be s2 sub-Gaussian (van Handel, 2016, Chapter 3) if for all λ R, E(exp(λX)) exp(λ2s2/2). The sub-Gaussian property is a constraint on the tail of the probability distribution. As an example, Gaussian random variables on the real line are sub-Gaussian and so are bounded random variables. The following assumptions will be needed throughout the section: for any 1 k L,

(A1) For some s 1, the entries of

d Vk are centered i.i.d. s2 sub-Gaussian random variables, independent of d and L, with unit variance.

(A2) For some C > 0, independent of d and L, and for any h Rd,

2 E g(h, wk) 2 h 2 and E g(h, wk) 8 C h 8.

Assumption (A1) is mild and satisﬁed by all initializations used in practice. For example, the classical Glorot initialization (Glorot and Bengio, 2010) which is the default implementation in the Keras package (Chollet et al., 2015) takes the entries of Vk as uniform U( p

3/d) variables. This means that

d Vk is initialized with U(

3) random variables, which satisfy (A1). Other examples include the Gaussian N(0, 1/d) initialization of He et al. (2015) and, for example, initialization with Rademacher variables. The ﬁrst part of Assumption (A2) ensures that g( , wk) is not too far away from being an isometry in expectation. The second part is more technical and, roughly, allows to upper bound the deviations of the norm of g(hk 1, wk). Our next Proposition 1 shows that most classical Res Net architectures verify Assumption (A2). For the sake of readability, these models, together with their parameters, are summarized in Table 1 below.

Name Recurrence relation Parameters of g

res-1 Simple Res Net hk+1 = hk + αLVk+1σ(hk) wk+1 = res-2 Parametric Res Net hk+1 = hk + αLVk+1σ(Wk+1hk) wk+1 = Wk+1 res-3 Original Res Net hk+1 = hk + αLVk+1 Re LU(Wk+1hk) wk+1 = Wk+1

Table 1: Examples of Res Net architectures considered in the paper. In the ﬁrst two cases, the activation function σ is such that, for all x R, a|x| |σ(x)| b|x|, 1/

2 a < b 1. In the last two cases, Wk+1 Rd d.

Proposition 1 Let res-1, res-2, and res-3 be the models deﬁned in Table 1. Then

(i) Assumption (A2) is satisﬁed for res-1.

Marion, Fermanian, Biau, and Vert

(ii) Assumption (A2) is satisﬁed for res-2 and res-3, as soon as the entries of

d Wk+1, 0 k L 1, are centered i.i.d. sub-Gaussian random variables, independent of d and L, with unit variance.

In the models res-1 and res-2, σ can be, for instance, taken as the parametric Re LU function, i.e., σ(x) = x+ + sx , where x+ (resp. x ) denotes the positive (resp. negative) part and the slope s [1/

2, 1] is a parameter of the model. Parametric Re LU, also known as leaky Re LU (Maas et al., 2013), has been shown to outperform standard Re LU for image datasets (Xu et al., 2015). Observe also that res-2 diﬀers from res-3 since the classical Re LU function is deﬁned by Re LU(x) = x+ and thus does not satisfy the condition |σ(x)| a|x|. Note that there is no bias term in these three models, as this term is commonly initialized to zero, and we are interested in the behavior at initialization.

Remark 2 An important architecture that is not covered by our setting is hk+1 = hk + αLσ(Wk+1hk), where compared to res-1 the weight matrix is located inside the activation function σ. This is due to the fact that the residual branch writes as a matrix-vector product in our model (4), which enables us to leverage matrix concentration inequalities (see, e.g., Lemmas 21 and 22). However, we check numerically that qualitatively similar results are obtained in this case (see Appendix D). This leads us to believe that similar concentration results should hold true when the matrix-vector product is followed by an element-wise activation function. We leave this mathematical study for future work.

2.2 Probabilistic Bounds on the Norm of the Hidden States

The next two propositions describe how the quantity h L h0 / h0 changes as a function of Lα2 L. Proposition 3 provides a high-probability bound of interest when Lα2 L 1. In this case, we see that, with high probability, the network acts as the identity function, directly mapping h0 to h L. On the other hand, Proposition 4 provides information in the two cases Lα2 L 1 and Lα2 L 1. When Lα2 L 1, the lower bound (i) indicates an explosion with high probability of the norm of the last hidden state. On the other hand, when Lα2 L 1, the bounds (i) and (ii) show that h L randomly varies around h0 with ﬂuctuation sizes bounded from below and above.

Proposition 3 Consider a Res Net (4) such that Assumptions (A1) and (A2) are satisﬁed. If Lα2 L 1, then, for any δ (0, 1), with probability at least 1 δ,

h0 2 2Lα2 L δ .

Proposition 4 Consider a Res Net (4) such that Assumptions (A1) and (A2) are satisﬁed.

(i) Assume that d 64 and α2 L 2 (

C+16s4)d. Then, for any δ (0, 1), with probability at least 1 δ,

Scaling Res Nets in the Large-depth Regime

provided that

2L exp d 64α2 Ls2

(ii) Assume that α2 L 1

C(d+128s4). Then, for any δ (0, 1), with probability at least 1 δ,

We recall that the constants s and C appearing in Proposition 4 are deﬁned respectively by Assumptions (A1) and (A2). Moreover, note that the assumptions of Proposition 4 on d and αL are mild, since in the learning tasks where deep Res Nets are involved, one typically has αL = 1/Lβ with β > 0, d 102 and L 102. Furthermore, the assumption on αL is a technical assumption used to simplify the ﬁnal expression. It is also possible to derive a proof without this assumption, at the cost of a more intricate result. Note also that condition (5) is not severe since, when d and L are large, it encompasses all reasonable values of δ. Propositions 3 and 4 are interesting in the sense that they provide ﬁnite-depth high-probability bounds on the behavior of the hidden states, depending on the magnitude of Lα2 L. The results become clearer by letting αL = 1/Lβ, with β > 0, as shown in the following corollary.

Corollary 5 Consider a Res Net (4) such that Assumptions (A1) and (A2) are satisﬁed, and let αL = 1/Lβ, with β > 0.

(i) If β > 1/2, then h L h0

(ii) If β < 1/2 and d 9, then h L h0

(iii) If β = 1/2, d 64, L (1

C + 8s4)d + 96

Cs4, then, for any δ (0, 1), with probability at least 1 δ,

1 < h L h0 2

provided that

Corollary 5 highlights three diﬀerent asymptotic behaviors for h L , depending on the values of β. For β > 1/2, statement (i) tells that h L converges towards h0 in probability, as L tends to inﬁnity, which means that the neural network is essentially equivalent to an identity mapping. On the other hand, for β < 1/2, the norm of h L explodes with high probability. Finally, for the critical value β = 1/2, we see that h L ﬂuctuates around h0, with

Marion, Fermanian, Biau, and Vert

0 250 500 750 1000 L

0 250 500 750 1000 L

0 250 500 750 1000 L

Figure 1: Evolution of h L h0 / h0 as a function of L for diﬀerent values of β and an i.i.d. U( p

3/d) initialization of model res-3, with d = 40. The input is a random Gaussian observation x in dimension nin = 64. The experiment is repeated with 50 independent randomizations.

1.0 1.2 1.4 1.6 0

(a) Distribution of h L / h0

1.0 1.2 1.4 1.6 0

(b) Distribution of L

Figure 2: Empirical distributions of the norms for β = 1/2, L = 103, d = 100. The experiment is repeated with 104 independent randomizations.

Scaling Res Nets in the Large-depth Regime

a ﬂuctuation size independent of L. Observe that the lower bound in (iii) is not trivial as soon as exp(3/8 p

11/dδ) > 1, i.e., d > 99/64δ. The message of Corollary 5 is that the only scaling leading to a non-degenerate distribution at initialization is for β = 1/2. The three statements of Corollary 5 are illustrated in Figure 1. In this experiment, we consider model res-3, a random Gaussian observation x in dimension nin = 64, and parameters initialized with a uniform distribution U( p

3/d). We refer to Appendix D for a detailed setup of all the experiments of the paper. Figure 2a shows the empirical distribution of h L / h0 when β = 1/2 for a large number of realizations. This ﬁgure illustrates in particular that our bounds are reasonably sharp, since the bounds indicate that the ﬁrst quartile of the distribution is larger than 0.87 (whereas the ﬁrst quartile of the empirical histogram is equal to 1.21) and the third quartile is less than 2.06 (whereas the third quartile of the empirical histogram is equal to 1.34). Determining the exact distribution of h L / h0 is an interesting avenue for future research that is beyond the scope of the present article. There is however a strong indication that the ratio follows a log-normal distribution, as conﬁrmed by a normality test on (the log of) the empirical distribution. In a nutshell, the proofs of Propositions 3 and 4 rest upon controlling of the norm of the hidden states, which obeys the recurrence

hk+1 2 = hk 2 + α2 L Vk+1g(hk, wk+1) 2 + 2αL hk, Vk+1g(hk, wk+1) , (6)

where , denotes the standard scalar product in Rd. Taking the expectations on both side, one deduces with Assumptions (A1) and (A2) that

E Vk+1g(hk, wk+1) 2 = E E Vk+1g(hk, wk+1) 2 hk, wk+1

= E g(hk, wk+1) 2 hk 2 (7)

and E hk, Vk+1g(hk, wk+1) = E E hk, Vk+1g(hk, wk+1) hk, wk+1 = 0. (8)

The equalities (7) and (8) allow deriving without further work bounds in expectation on h L , as already observed in an informal manner by Arpit et al. (2019). However, the results we are after are stronger since they involve precise quantiﬁcation of the ﬂuctuations induced by the initialization through high-probability bounds. A ﬁner control of the deviations of Vk+1g(hk, wk+1) 2 and hk, Vk+1g(hk, wk+1) is then needed. This involves concentration inequalities on random matrices with sub-Gaussian entries, and martingale arguments. This probabilistic derivation allows improvements over earlier works in the literature that only show stability for β 1 (Hanin and Rolnick, 2018; Allen-Zhu et al., 2019). A similar proof technique was already used in Zhang et al. (2019b) to show in an asymptotic setting explosion of the forward pass for β < 1/2 and stability for β 1/2, in accordance with our results. We extend their result in several ways, most notably by considering a sub-Gaussian initialization relaxing their Gaussian assumption, a more general architecture, and obtaining fully non-asymptotic bounds, while their approach is asymptotic both in width and depth. This makes the mathematical analysis signiﬁcantly more challenging. We also introduce a novel distinction between the stability case β = 1/2 and the identity case β > 1/2, showing the presence of non-vanishing ﬂuctuations only for β = 1/2. For completeness, we note that a revised version of their paper does provide non-asymptotic bounds (Zhang et al., 2022),

Marion, Fermanian, Biau, and Vert

albeit still in the more restrictive setting of a Gaussian initialization, a speciﬁc architecture, and without separating the cases β = 1/2 and β > 1/2. We next show precise non-asymptotic bounds for the gradients, using the martingale structure of the forward diﬀerentiation recurrence, a novel proof technique.

2.3 Probabilistic Bounds on the Gradients

Propositions 3 and 4 provide insights on the output of the network when L is large. However, they do not carry information on the backwards dynamics of propagation of the gradients of the loss pk = L

hk Rd. Assessing the dynamics of the (pk)0 k L as a function of L is important since the behavior of this sequence impacts trainability of the network at initialization. Thus, in this subsection, we are interested in quantifying the magnitude of p0 p L / p L , when L is large. Notice that, contrarily to the previous subsection where we were mostly interested in the last hidden state h L, the quantity of interest is now p0 (not p L), the gradient at index 0. The reason is that the sequence (pk)0 k L is deﬁned backwardly, as we will see below. We also stress that (pk)0 k L is the sequence of derivatives of the loss w.r.t. the hidden states hk, and not w.r.t. the parameters. The reason for considering this sequence is that the pk are involved in the backpropagation algorithm and are therefore essential for assessing the stability of the gradient descent (see, e.g., Arpit et al., 2019). Analyzing the behavior of the sequence (pk)0 k L is challenging since, according to the backpropagation (or reverse-mode diﬀerentiation) formula, one has

pk = pk+1 + αL g(hk, wk+1)

h V k+1pk+1.

Taking the norm,

pk 2 = pk+1 2 + α2 L g(hk, wk+1)

h V k+1pk+1 2 + 2αL D pk+1, g(hk, wk+1)

h V k+1pk+1 E .

Although the equation looks qualitatively similar to (6), it has the unpleasant feature that pk+1 depends on the whole sequence of weights w1, V1, . . . , w L, VL. This forbids applying directly the same proof techniques as for the hidden states due to the lack of adaptation of the pk to the ﬁltration of the hidden state process. Simplifying assumptions have sometimes been made to analyze this recurrence equation, for instance assuming independence of g(hk,wk+1)

h and hk (see, e.g., Yang and Schoenholz, 2017). However, such assumption remains a strong requirement, which is not veriﬁed for many network architectures (for example model res-1). Other authors resort to an ε-net argument (Allen-Zhu et al., 2019; Zhang et al., 2019b). We tackle the problem from a diﬀerent point of view and propose an alternative approach based on forward-mode diﬀerentiation, valid under a much weaker assumption. The cost we pay is that we obtain results in expectation and not in high probability. Let us sketch our approach before stating the results more formally. We denote by z Rd

an independent random variable that will be used to assess the magnitude of the gradients. For any 0 i, j L, let hj

hi Rd d be the Jacobian matrix of hj with respect to hi. Recall that the (m, n)-th entry of this matrix equals the derivative of the m-th coordinate of hj w.r.t. the n-th coordinate of hi. Then, letting qk(z) = hk

h0 z, we have, by the chain rule,

qk+1(z) = hk+1

hk qk(z) = qk(z) + αLVk+1 g(hk, wk+1)

h qk(z). (9)

Scaling Res Nets in the Large-depth Regime

Identity (9), which is similar to (4), expresses qk+1(z) as a function of qk(z), and therefore respects the ﬂow of information. Next, assuming that z is random with a Gaussian distribution, it is possible to express one of our quantities of interest, p0 / p L , as a function of the last vector q L(z). Indeed,

p L 2 = 1 p L 2 Ez N(0,Id) p 0 z 2 = Ez N(0,Id)

q L(z) 2 , (10)

where Id is the identity matrix in Rd and the second equality is a consequence of

h0 z = p Lq L(z).

In summary, the recurrence (9) allows us to derive bounds on the norm of q L(z), which can then transfer to p0 / p L via (10). For this, it is necessary to make the following assumption on the ratio p L/ p L :

(A3) Let b = p L/ p L . Then E(b|h L) = 0 and E(b b|h L) = Id/d.

It is a mild assumption, which is veriﬁed for instance if nout = 1 with squared error (for regression) or cross-entropy (for binary classiﬁcation). In these cases, p L/ p L = B / B F, where F is the Frobenius norm and B is the weight matrix of the last layer. We ﬁnally need the following assumption, which is the equivalent of Assumption (A2) for the gradients.

(A4) One has, almost surely,

2 E g(hk, wk+1)

h qk 2 hk, qk

Assumption (A4) is satisﬁed by all the standard architectures listed in Table 1, as shown by the next proposition.

Proposition 6 Let res-1, res-2, and res-3 be the models deﬁned in Table 1. Assume that (A1) is satisﬁed and σ is almost everywhere diﬀerentiable, with a σ b. Then

(i) Assumption (A4) is satisﬁed for res-1.

(ii) Assumption (A4) is satisﬁed for res-2 and res-3, when the entries of

d Wk, 1 k L, are centered i.i.d. random variables, independent of d and L, with unit variance.

The next two propositions are the counterparts of Proposition 3 and Proposition 4 for the gradient dynamics.

Proposition 7 Consider a Res Net (4) such that Assumptions (A1)-(A4) are satisﬁed. If Lα2 L 1, then, for any δ (0, 1), with probability at least 1 δ,

p L 2 2Lα2 L δ .

Marion, Fermanian, Biau, and Vert

Proposition 8 Consider a Res Net (4) such that Assumptions (A1)-(A4) are satisﬁed. Then

2α2 L L 1 E p0 p L 2

(1 + α2 L)L 1.

A simple corollary of the propositions above is as follows.

Corollary 9 Consider a Res Net (4) such that Assumptions (A1)-(A4) are satisﬁed, and take αL = 1/Lβ, with β > 0. Then

(i) If β > 1/2, p0 p L

(ii) If β < 1/2,

(ii) If β = 1/2,

1 E p0 p L 2

Corollary 9 is illustrated in Figure 3. The experimental protocol is the same as in Figure 1, but we now track p0 and p L, the gradients of the loss L with respect to the ﬁrst and the last hidden states. In accordance with our results, when β > 1/2, the gradient remains the same from one layer to another (left plot). On the other hand, the middle plot clearly shows that when β < 1/2 the gradient explodes. Once again, the case β = 1/2 (right plot) is the only one for which the distribution of gradients at initialization is non-trivial. Figure 2b illustrates that the empirical distribution of gradients in this case also seems to be log-normal. Our ﬁndings extend results from the literature showing explosion of the backward pass for β < 1/2 and stability for β 1/2, both in an asymptotic setting (Allen-Zhu et al., 2019; Zhang et al., 2019b; Hayou et al., 2021) and a non-asymptotic setting (Zhang et al., 2022). Similar comparisons can be drawn with these papers as in the previous section (see end of Section 2.2). Furthermore, we emphasize again that making use of the forward-mode formulation of the gradients diﬀers from these previous works, which resorted either to an inﬁnite-width setting or to backward-mode diﬀerentiation and an ε-net argument. In summary, this and the previous subsection both point towards the same conclusion: there are three diﬀerent cases, depending on the value of β explosion when β < 1/2, nondegenerate limit when β = 1/2, and identity when β > 1/2. In the explosion case, it is well known that the network cannot be trained (Yang and Schoenholz, 2017). The theory thus points out that the value 1/2 plays a pivotal role. Remarkably, this value has a speciﬁc interpretation in the continuous-time point of view of Res Nets, in terms of SDE. This is the topic that we address in the next section.

Scaling Res Nets in the Large-depth Regime

0 250 500 750 1000 L

0 250 500 750 1000 L

0 250 500 750 1000 L

Figure 3: Evolution of p0 p L / p L as a function of L for diﬀerent values of β and an i.i.d. U( p

3/d) initialization of model res-3, with d = 40. The input is a random Gaussian observation x in dimension nin = 64. The experiment is repeated with 50 independent randomizations.

3. Scaling in the Continuous-time Setting

Starting with the discrete Res Net (4), it is tempting to let L go to inﬁnity and consider the network as the discretization of a diﬀerential equation where the layer index k {0, . . . , L} is replaced by the time index t [0, 1]. This interpretation of deep neural networks has been popularized by Chen et al. (2018) and is often referred to as the neural ODE paradigm. Notice that this setting is diﬀerent from the so-called mean-ﬁeld analysis, where the width of the network is assumed to be inﬁnite beforehand. In our setting, the width d is assumed to be ﬁnite and ﬁxed.

3.1 Convergence Towards a SDE in the Large-depth Regime

One of the main messages of Section 2 is that the standard initialization with i.i.d. parameters leads to a non-degenerate model for large values of L only if Lα2 L 1 (Propositions 3 and 4), or, equivalently, if β = 1/2 when αL = 1/Lβ (Corollary 5). Remarkably, in the continuous-time limit, this regime corresponds to the discretization of a SDE. Indeed, consider for simplicity the (discrete) Res Net model res-1

h0 = Ax, hk+1 = hk + 1

L Vk+1σ(hk), 0 k L 1, (11)

where the entries of all (Vk)1 k L are assumed to be i.i.d. N(0, 1/d). Recall the following deﬁnition:

Deﬁnition 10 A one-dimensional Brownian motion (Bt)t [0,1] is a continuous-time stochastic process with B0 = 0, almost surely continuous, with independent increments, and such that for any 0 s < t 1, Bt Bs N(0, t s).

Now, let B : [0, 1] Rd d be a (d d)-dimensional Brownian motion, in the sense that the (Bij)1 i,j d are independent one-dimensional Brownian motions. Thus, for any

Marion, Fermanian, Biau, and Vert

0 k L 1 and any 1 i, j d, we have

B(k+1)/L,i,j Bk/L,i,j N 0, 1

and the increments for diﬀerent values of (i, j, k) are independent. As a consequence, the recurrence (11) is equivalent in distribution to the recurrence

h k+1 = h k + 1

d σ(h k )(B(k+1)/L Bk/L), 0 k L 1.

(Note that this is true because Vk+1 has the same distribution as V k+1.) We recognize the Euler-Maruyama discretization (Kloeden and Platen, 1992) on the {k/L, 0 k L} mesh of the SDE H0 = Ax, d H t = 1

d σ(H t )d Bt, t [0, 1], (12)

where the output of the network is now a function of the ﬁnal value of H, that is, H1. The link between the discrete Res Net (11) and the SDE (12) is formalized in the next proposition.

Proposition 11 Consider the res-1 model, where the entries of Vk are i.i.d. Gaussian N(0, 1/d) random variables. Assume that the activation function σ is Lipschitz continuous. Then the SDE (12) has a unique solution H and, for any 0 k L,

E Hk/L hk c

for some c > 0.

Notice that the requirement that σ is Lipschitz continuous is satisﬁed by most classical activation functions, including Re LU. This proposition is interesting for several reasons. First, the scaling β = 1/2, which is exactly the one that yields a non-trivial dynamics at initialization, corresponds in the continuous world to a remarkably simple model of diﬀusion. This shows that very deep neural networks properly initialized with i.i.d. weights are equivalent to solutions of SDE. This analogy opens interesting perspectives for training deep networks using automatic diﬀerentiation for solutions of neural SDE (Li et al., 2020). Second, we stress that the emergence of a SDE instead of an ODE carries an important message. Several authors (including, e.g., Thorpe and van Gennip, 2023) have shown that, under appropriate assumptions, a deep Res Net converges in the large depth limit to an ODE and not a SDE. The reason why we obtain a SDE here is intrinsically connected with the choice of i.i.d. initialization for the weights, which makes a Brownian motion appear at the limit, as highlighted above. In other words, the i.i.d. initialization, the choice β = 1/2 (the relevant critical value exhibited in Section 2), and the emergence of a SDE are intimately linked together. On the other hand, the case β = 1 matches with an ODE if the initialization is not i.i.d., as we will see in Subsection 3.2. Finally, we point out that Proposition 11 states the convergence of a Res Net towards a SDE for the basic architecture res-1 and for Gaussian initialization. The extension to more general settings is an interesting direction of research, although clearly beyond the scope of the present paper (see, e.g., Peluchetti and Favaro, 2020, and Cohen et al., 2021, for results in this direction).

Scaling Res Nets in the Large-depth Regime

3.2 Scaling in the Neural ODE Setting

Convergence towards an ODE. The basic message of our Proposition 11 is that an i.i.d. initialization, together with β = 1/2, leads to a SDE rather than an ODE. A natural question is then whether a diﬀerent choice of weight distributions (at initialization) and scaling can lead to a classical neural ODE. To answer this question and leave the world of i.i.d. initialization, we assume that the weights (Vk)1 k L and (wk)1 k L are discretizations of smooth functions V : [0, 1] Rd d

and W : [0, 1] Rp. We then consider the general iteration (4) with αL = 1/L, that is,

h0 = Ax, hk+1 = hk + 1

LVk+1g(hk, wk+1), 0 k L 1, (13)

where Vk = Vk/L and wk = Wk/L. Of course, it is still possible to consider (Vk)1 k L (resp. (wk)1 k L) as random variables, by letting (Vt)t [0,1] (resp. (Wt)t [0,1]) be a continuous-time stochastic process. In this model, we shall need the following assumption:

(A5) For any 1 k L, one has Vk = Vk/L and wk = Wk/L, where the stochastic processes V and W are almost surely Lipschitz continuous.

More precisely, almost surely, there exist KV , KW such that, for any s, t [0, 1],

Vt Vs KV |t s|, Wt Ws KW |t s|.

A typical model that satisﬁes Assumption (A5) is obtained by letting the entries of V and W be independent Gaussian processes with expectation zero and squared exponential covariance K(x, x ) = exp( (x x )2

2ℓ2 ), where ℓ> 0 (Lederer et al., 2019). Note that the Lipschitz constants may themselves be random, depending on the Gaussian process sample. We shall also need the following requirement on g, which is satisﬁed by all our models as soon as σ is Lipschitz continuous:

(A6) The function g is Lipschitz continuous on compact sets, in the sense that for any compact P Rp, there exists KP > 0 such that, for all h, h Rd, w P,

g(h, w) g(h , w) KP h h ,

and for any compact D Rd, there exists KD,P > 0 such that, for all h D, w, w P, g(h, w) g(h, w ) KD,P w w .

Under Assumptions (A5) and (A6), the recurrence (13) almost surely converges towards the neural ODE given by

H0 = Ax, d Ht = Vtg(Ht, Wt)dt, t [0, 1], (14)

as shown by the proposition below.

Proposition 12 Consider model (13) such that Assumptions (A5) and (A6) are satisﬁed. Then the ODE (14) has a unique solution H, and, almost surely, there exists some c > 0 such that, for any 0 k L, Hk/L hk c

Marion, Fermanian, Biau, and Vert

It should be stressed that the transition from the discrete recurrence (13) to the continuous-time diﬀerential equation (14) relies on the assumptions that the weight sequences (wk)1 k L and (Vk)1 k L are the discretizations of smooth limiting processes W and V on the one hand, and that the scaling αL is chosen as 1/L on the other hand. From a practical perspective, Proposition 12 shows that it is possible to initialize Res Nets in the ODE regime, by choosing a smooth stochastic process, discretizing it at each layer, and taking a 1/L scaling. This is in sharp contrast with the results of Sections 2 and 3.1, which show that the usual i.i.d. procedure leads to a neural SDE.

Stability and scaling. Assuming that the weights of the network are discretizations of a smooth function (Assumption (A5)), it is possible to obtain stability results, depending on the value of β, similarly to what has been done in Section 2. We show below that β = 1 is a critical value, by examining the hidden states, in the same way as β = 1/2 is a critical value in the i.i.d. setting. Similar results can be shown for the gradients. We begin by a proposition handling the cases β > 1 and β = 1.

Proposition 13 Consider a Res Net (4) such that Assumptions (A5) and (A6) are satisﬁed. Let αL = 1/Lβ, with β > 0.

(i) If β > 1, then, almost surely,

(ii) If β = 1, then, almost surely, there exists some c > 0 such that

The explosion case (β < 1) is more delicate to deal with. We prove it for a linear model, and leave for future work the extension to more general cases.

Proposition 14 Consider the res-1 model, taking σ as the identity function. Assume that Assumption (A5) is satisﬁed and that V T 0 has a positive eigenvalue. Let αL = 1/Lβ, with β (0, 1). Then, almost surely,

max k hk h0

The assumption of the existence of a positive eigenvalue for V 0 is mild. For instance, if the entries of V0 are i.i.d. random variables with ﬁnite moments of all order, G otze and Jalowy (2021) show that such an eigenvalue exists with probability at least 1 1/d for d large enough. Essentially, the proof relies on showing divergence of the hidden states along the eigenvector associated to a positive eigenvalue of V T 0 . The extension to models that do not have an identity activation function is delicate. Indeed, while we expect such divergence to generally occur with a non-linear activation function, it is technically delicate to show as one has to rule out cases where the non-linearity of the activation function induces compensations that prevent divergence.

Scaling Res Nets in the Large-depth Regime

0 250 500 750 1000 L

0 250 500 750 1000 L

108 β = 0.5

0 250 500 750 1000 L

Figure 4: Evolution of h L h0 / h0 as a function of L for diﬀerent values of β and a smooth initialization of model res-3, with d = 40. The input is a random Gaussian observation x in dimension nin = 64. The experiment is repeated with 50 independent randomizations.

0 250 500 750 1000 L

0 250 500 750 1000 L

108 β = 0.5

0 250 500 750 1000 L

Figure 5: Evolution of p0 p L / p L as a function of L for diﬀerent values of β and a smooth initialization of model res-3, with d = 40. The input is a random Gaussian observation x in dimension nin = 64. The experiment is repeated with 50 independent randomizations.

Marion, Fermanian, Biau, and Vert

In this setting, we observe experimentally a behavior of the output and of the gradients when L grows large similar to the one explored in Section 2. This is illustrated in Figures 4 and 5, which mirror Figures 1 and 3 in Section 2. The ﬁgures clearly show that there exist three cases for the output and for the gradients: an identity case (left plots), an explosion case (middle), and a non-trivial case separating explosion and identity (right). However, the remarkable point is that the separation occurs for β = 1, and not β = 1/2, as predicted by Propositions 13 and 14.

4. Experiments

We experimentally investigate in this section two questions. The ﬁrst one is to know whether there exists a range of scaling factors β > 0 and weight initializations, beyond the i.i.d. and the smooth regimes. The second question is whether our analysis, which pertains to the initialization phase, provides insights into the training phase, beyond initialization.

4.1 Intermediate Regimes

In order to describe the transition between the i.i.d. and smooth cases, a possible route is to consider that the weights are increments of a γ-H older stochastic process. This model is interesting insofar as the Brownian motion (SDE regime) is (1/2 ε)-H older (ε > 0) and a Lipschitz continuous stochastic process (ODE regime) is 1-H older. In line with the above, in a series of experiments, we initialize the weights as increments of a fractional Brownian motion (BH t )t [0,1]. Recall that BH is a continuous-time Gaussian process, starting at zero, with zero expectation for all t [0, 1], and covariance function

E(BH s BH t ) = 1

2(|s|2H + |t|2H |t s|2H), 0 s, t 1,

where H (0, 1) is called the Hurst index. This index describes the raggedness of the process, with a higher value leading to a smoother process. When H = 1/2, the process is a standard Brownian motion (Deﬁnition 10), whose increments are independent by construction. When H > 1/2, the increments of the process are positively correlated, while if H < 1/2 the increments are negatively correlated. Importantly, a fractional Brownian motion with Hurst index H is (H ε)-H older continuous for any ε > 0. In the limit when H 1, the trajectories converge to linear functions (whose increments satisfy (A5)). As an illustration, Figure 6 depicts three realizations of a fractional Brownian motion with H = 0.2 (left), H = 0.5 (middle), and H = 0.8 (right). In order to assess the eﬀect of the scaling factor β and the Hurst index H, we initialize a neural network res-3 with d = 40, L = 1000, various values of β [0.2, 1.3], and with weights taken as increments of fractional Brownian motions with various Hurst indices H (0, 1). Figure 7 depicts the empirical magnitude of the output and the gradients at initialization as a function of the Hurst index H and the scaling factor β. First note that we recover the two regimes (i.i.d. and smooth) discussed so far. For H = 1/2, the i.i.d. regime kicks in, with explosion (β < 1/2, orange zone), non-trivial behavior (β = 1/2, black zone), and identity (β > 1/2, blue zone). Likewise, we see at H = 1 a similar pattern in the smooth regime, with, as predicted by Proposition 13, a critical value β = 1. Beyond these two speciﬁc cases, we observe for an index H varying in (1/2, 1) a whole range of intermediate

Scaling Res Nets in the Large-depth Regime

0 200 400 600 800 1000 3

(a) H = 0.2

0 200 400 600 800 1000

(b) H = 0.5

0 200 400 600 800 1000 0.8

(c) H = 0.8

Figure 6: Examples of realizations of a fractional Brownian motion BH for diﬀerent Hurst indexes H. Note that the smaller the value of H, the more irregular the trajectory is.

1.3 1.18 1.06 0.94 0.82 0.69 0.57 0.45 0.33

(a) ln10( h L h0 / h0 )

1.3 1.18 1.06 0.94 0.82 0.69 0.57 0.45 0.33

(b) ln10( p0 p L / p L )

Figure 7: Magnitude of the outputs and of the gradients as a function of the regularity of the weights (Hurst index H) and of the scaling factor β. The orange zone corresponds to the explosion case, i.e., h L h0 h0 and p0 p L p L . The blue zone corresponds to the identity case, i.e., h L h0 h0 and p0 p L p L . Finally, the black zone is an intermediate case, where h L h0 h0 and p0 p L p L .

situations, where the transition between identity and explosion seems to happen for a critical β = H. Interestingly, for H < 1/2, the transition seems to saturate at the value β = 1/2. The take-home message is that the choice of the scaling of a Res Net seems to be closely linked to the regularity of the weights as a function of the layer. More precisely, for all regimes, the critical scaling factor between explosion and identity seems to have a natural interpretation as the (H older) regularity of the underlying continuous-time stochastic process. We believe that the mathematical understanding of this connection, beyond the fractional

Marion, Fermanian, Biau, and Vert

Brownian motion case, is a promising research direction for the future. Finally, these experiments suggest that it is sensible to initialize a Res Net for any value of the scaling β (1/2, 1), while avoiding the identity and explosion situations, by simulating a fractional Brownian motion of Hurst index H = β and initializing the weights as the increments of this process.

4.2 Beyond Initialization

At initialization, before the gradient descent, the distribution of the weights (wk)1 k L and (Vk)1 k L is chosen by the practitioner. By contrast, during and after training, control is lost on these distributions, making the picture more complex. In particular, the existence and characterization of a continuous-time stochastic process whose discretization matches the trained Res Net is an interesting but diﬃcult problem. Attacking this question requires a ﬁne understanding of the interaction between training dynamics and the regularity of the sequence of the weights during the gradient descent. However, there is experimental evidence that the trained weights exhibit strong structure as a function of the layer index k (Cohen et al., 2021; Bayer et al., 2023), and that their regularity strongly depends on the choice of initialization. Interesting theoretical preliminary results in the ODE case are reported in Sander et al. (2022) for a linear activation and further generalized by Marion et al. (2024). Figure 8 depicts this mechanism by plotting a given coordinate of wk as a function of the layer index k ranging from 1 to the depth L = 1000, after training. Note that, in this experiment, there is no bias term in the residual layers, following the formulations from Table 1. We check that adding a bias term gives qualitatively similar plots (see Appendix D). To investigate the link between regularity of the weights at initialization, scaling, and performance after training, we train Res Nets on the datasets MNIST (Deng, 2012) and CIFAR-10 (Krizhevsky, 2009). As in Subsection 4.1, we initialize the Res Nets with various scaling factors and weights that are increments of fractional Brownian motions with diﬀerent regularities. Then, for each combination of weight initialization and scaling factor, the Res Net is trained using the Adam optimizer (Kingma and Ba, 2015) for 10 epochs. The model includes a zero-initialized bias term on each residual layer. The results in terms of accuracy

0 200 400 600 800 1000 1.5

(a) β = 1, smooth initialization

0 200 400 600 800 1000 1.5

(b) β = 1, i.i.d. initialization

0 200 400 600 800 1000 1.5

(c) β = 1/2, i.i.d. initialization

Figure 8: Plot of a given coordinate of wk, after training, as a function of the layer index k ranging from 1 to the depth L = 1000 for three diﬀerent choices of β and initializations.

Scaling Res Nets in the Large-depth Regime

0.9 0.81 0.72 0.63 0.54 0.46 0.37 0.28 0.19

(a) On MNIST

0.1 0.19 0.28 0.37 0.46 0.54 0.63 0.72 0.81

(b) On CIFAR-10

Figure 9: Accuracy after training as a function of the regularity of the weights at initialization and scaling. For each point of the heatmap, the model was trained on a grid of learning rates, and the best performance is shown.

are presented in Figure 9 (light orange = good performance, blue = bad performance). We observe a pattern similar to the one of Figure 7. This means that, for a given regularity, the network is unable to learn if it is initialized with a scaling too far below the critical value, which of course is connected with the gradient explosion issue discussed previously. On the other hand, and perhaps more surprisingly, the performance seems to be more or less stable in the identity region, with perhaps a small degradation in the case of CIFAR-10. This somewhat contrasts with the results from Yang and Schoenholz (2017), who exhibit a decrease in performance for i.i.d. initialization and a large scaling factor β. Note however that, in the experiments reported in Figure 7, we adapt the learning rate of the gradient descent on a grid by cross-validation. When taking a ﬁxed learning rate, we also observe a decrease in performance for large scaling factor β. The interplay between the learning rate and the scaling factor is one of the keys to better assess how the performance of the trained network is connected with the scaling.

Acknowledgments

The authors thank anonymous referees for insightful comments. They also thank S. Schoenholz for fruitful discussion. P. Marion was supported by a grant from R egion ˆIle-de-France and by a Google Ph D Fellowship award.

Marion, Fermanian, Biau, and Vert

Appendix A. Proofs

Throughout the proofs, the i-th coordinate of a vector v is denoted by vi. Similarly, the i-th row of a matrix M is denoted by Mi, and its (i, j)-th entry by Mij.

A.1 Proof of Proposition 1

Statement (i) is clear (with C = 1) since, for any h Rd,

σ(h) 2 a2 h 2, b2 h 2 1

2 h 2, h 2 .

With respect to statement (ii), it is enough to show that for any h Rd and any random matrix W satisfying the assumptions of the proposition, one has

2 E σ(Wh) 2 h 2 and E σ(Wh) 8 C h 8,

E Re LU(Wh) 2 = h 2

2 and E Re LU(Wh) 8 C h 8.

The two claims with the squared norms are consequences of Lemmas 17 and 18 in Appendix B, together with the fact that the variance of the entries of W equals 1/d. In order to prove the other two statements, ﬁrst note that E( σ(Wh) 8) E( Wh 8) and E( Re LU(Wh) 8) E( Wh 8). The results are then consequences of Lemma 22 in Appendix B, which states that

d2 1 + 384s4 + 3072s6.

A.2 Proof of Proposition 3

According to Lemma 15 below, one has

1 + α2 L L 1 .

But, for Lα2 L 1, we have (1 + α2 L)L 1 exp(Lα2 L) 1 2Lα2 L. Therefore,

and the result follows from Markov s inequality.

Lemma 15 Consider a Res Net (4) such that Assumptions (A1) and (A2) are satisﬁed. Then 1 + α2 L 2

L 1 E h L h0 2

1 + α2 L L 1 .

Scaling Res Nets in the Large-depth Regime

Proof (Lemma 15). Taking the squared norm of the forward update rule (4) and dividing by h0 2 yields

h0 2 = 1 h0 2

hk 2 + α2 L Vk+1g(hk, wk+1) 2 + 2αL hk, Vk+1g(hk, wk+1) . (15)

We deduce by Assumptions (A1) and (A2) that

(1 + α2 L)E hk 2

Therefore, by recurrence, we are led to

(1 + α2 L)k. (16)

Now, observe that h L = h0 + αL PL 1 k=0 Vk+1g(hk, wk+1). Thus, we have

k,k =0 E g(hk, wk+1) V k+1Vk +1g(hk , wk +1)

By conditioning on all random variables except Vk +1 for k < k (and Vk+1 for k > k ), it is easy to see that the only non-zero terms are when k = k . This yields

k=0 E Vk+1g(hk, wk+1) 2

(by Assumptions A1 and A2)

k=0 (1 + α2 L)k

= 1 + α2 L L 1 .

= 1 + α2 L 2

Marion, Fermanian, Biau, and Vert

A.3 Proof of Proposition 4

Dividing (15) by hk 2 and taking the logarithm leads to

ln( hk+1 2) = ln( hk 2)+ln 1+α2 L Vk+1g(hk, wk+1) 2

hk 2 +2αL D hk

hk , Vk+1g(hk, wk+1)

Yk,1 = α2 L Vk+1g(hk, wk+1) 2

hk 2 , Yk,2 = 2αL D hk

hk , Vk+1g(hk, wk+1)

and Yk = Yk,1 + Yk,2. The proof of Proposition 4 strongly relies on the following lemma, which provides technical information on the moments of Yk,1 and Yk,2. For the sake of clarity, its proof is postponed to Appendix A.13.

Lemma 16 Assume that Assumptions (A1) and (A2) are satisﬁed. Then

(E1) E(Yk,2|hk) = 0.

(E2) α2 L 2 E(Yk,1|hk) α2 L.

(E3) E(Yk,1Yk,2|hk) = 0.

(E4) E(Y 2 k,2|hk) 4α2 L d .

(E5) E(Y 4 k,2|hk) 2048 s4α4 L d2 .

(E6) E(Y 4 k,1|hk) C 3072 s6

d + 1 α8 L.

(E7) E(Y 2 k,1|hk)

d + 1 α4 L.

For c > 0, we have

h0 2 c = P ln( h L 2) ln( h0 2) ln(c)

k=0 ln(1 + Yk) ln(c)

k=0 Yk ln(c)

(using ln(1 + x) x for x > 1).

Let S = PL 1 k=0 Yk E(Yk|hk). By (E1) and (E2),

k=0 E(Yk|hk) Lα2 L.

So, for c > exp(Lα2 L),

h0 2 c P S ln(c)

k=0 E(Yk|hk)

P(S ln(c) Lα2 L)

P S2 ln(c) Lα2 L 2

E(S2) (ln(c) Lα2 L)2 (17)

(by Markov s inequality.)

Scaling Res Nets in the Large-depth Regime

It remains to upper bound E(S2). To this aim, note that

k=0 E Yk E(Yk|hk) 2

k=0 E Y 2 k

4Lα2 L d + 128

(by (E3), (E4), and (E7))

The last inequality is true for α2 L 1

C(d+128s4). Therefore, by inequality (17), we obtain,

for c > exp(Lα2 L),

h0 2 c 5Lα2 L d ln(c) Lα2 L 2 .

We conclude that, for any δ (0, 1), with probability at least 1 δ,

This shows statement (ii) of the proposition. Next, to prove statement (i), observe that c > 0,

h0 2 c = P ln( h L 2) ln( h0 2) ln(c)

k=0 ln(1 + Yk) ln(c)

k=0 ln(1 + Yk) ln(c) and k, Yk 1

k=0 ln(1 + Yk) ln(c) and k, Yk < 1

Using the inequality ln(1 + x) x x2 for x 1/2, we obtain

h0 2 c P L 1 X

k=0 Yk Y 2 k ln(c) and k, Yk 1

k=0 ln(1 + Yk) ln(c) and k, Yk < 1

h0 2 c P L 1 X

k=0 Yk Y 2 k ln(c) +

k=0 P Yk,2 < 1

Marion, Fermanian, Biau, and Vert

We handle the two terms above on the right-hand side separately. For the ﬁrst term, let Zk = Yk Y 2 k and S = PL 1 k=0 Zk E(Zk|hk). Observe that, by (E1)-(E4) and (E7),

k=0 E(Zk|hk) Lα2 L 2 4Lα2 L d 128

8Lα2 L, (19)

where the last inequality is valid for d 64 and α2 L 1 16

C(2s4+1). Hence, for 0 < c <

exp(3Lα2 L/8),

k=0 Yk Y 2 k ln c = P S ln(c)

k=0 E(Zk|hk)

P S ln(c) 3Lα2 L 8

P S2 ln(c) 3Lα2 L 8

E(S2) ln(c) 3Lα2 L 8 2

(by Markov s inequality.)

Using the cr-inequality (a + b)n 2n 1(an + bn) respectively for n = 2 and n = 4, we see that

k=0 E Zk E(Zk|hk) 2

k=0 E Z2 k 2

k=0 E Y 2 k + E Y 4 k

k=0 E(Y 2 k,1) + E(Y 2 k,2) + 2E(Yk,1Yk,2) + 8E(Y 4 k,1) + 8E(Y 4 k,2).

By (E3)-(E7), it is easy to verify that, for d 64 and α2 L 1 (

E(S2) 10Lα2 L d .

This shows that, for c < exp(3Lα2 L/8),

k=0 Yk Y 2 k ln(c) 10Lα2 L d ln(c) 3Lα2 L 8 2 .

To conclude the proof, it remains to upper bound the second term of inequality (18). According to inequality (21) in the proof of Lemma 16 (with t = 1/2), one has

k=0 P Yk,2 < 1

2L exp d 64α2 Ls2

Scaling Res Nets in the Large-depth Regime

Putting everything together, we are led to

h0 2 c 10Lα2 L d ln(c) 3Lα2 L 8 2 + 2L exp d 64α2 Ls2

Take δ (0, 1). Then, if 2L exp d 64α2 Ls2 δ 11, with probability at least 1 δ,

h0 2 > exp 3Lα2 L 8

Notice that this inequality is valid under the assumption α2 L 2 (

A.4 Proof of Corollary 5

Statement (i) is a consequence of Proposition 3, whereas (ii) is a consequence of Proposition 4 (i). The latter is valid under the conditions d 64 and αL 2 (

C+16s4)d, which is automatically satisﬁed for all L large enough. Furthermore, an inspection of the proof of Proposition 4 reveals that the divergence in high probability of h L can be proved under the relaxed assumption d 9. Indeed, the main constraint on d comes from the lower bound (19), where one needs to make sure that Lα2 L 2 4 Lα2 L d > 0, which is the case for d = 9. To prove (iii), we use a union bound on both statements of Proposition 4.

A.5 Proof of Proposition 6

The ﬁrst claim follows from the observation that

g(hk, wk+1)

σ (hk,1) 0 . . . 0 0 σ (hk,2) . . . 0 ... ... ... ... 0 0 . . . σ (hk,d)

from (A1), and from the assumption on σ . Let us now prove (ii). In the rest of the proof, the subscript k is ignored to lighten the notation. Observe that

σ ( W1, h ) 0 . . . 0 0 σ ( W2, h ) . . . 0 ... ... ... ... 0 0 . . . σ ( Wd, h )

Denote by D the matrix in the middle of the right-hand side. Then

h q 2 h, q = E V DWq 2|h, q = E DWq 2|h, q

Marion, Fermanian, Biau, and Vert

For model res-2, we have

h q 2 h, q = E d X

j=1 Wijqj 2 σ ( Wi, h ) h, q .

The conclusion follows from the hypothesis that a σ b and E Wq 2|q = q 2. For model res-3, we have

h q 2 h, q = E d X

j=1 Wijqj 2 1Pd j=1 Wijhj 0

Since the (Wij)1 i,j d are centered random variables, we conclude that

h q 2 h, q = 1

j=1 Wijqj 2 q = 1

2E Wq 2|q = q 2

A.6 Proof of Proposition 7

Letting b = p L/ p L , as in Assumption (A3), and taking expectation in (10), we obtain

= E(|b q L(z)|2) = 1

d E( q L(z) 2) (20)

The rest of the proof is similar to the proof of Proposition 3. From (9), we have

qk+1(z) 2 = qk(z) 2+α2 L Vk+1 g(hk, wk+1)

h qk(z) 2 +2αL qk(z), Vk+1 g(hk, wk+1)

By independence of Vk+1 from qk(z) and g(hk,wk+1)

E D qk(z), Vk+1 g(hk, wk+1)

h qk(z) E = 0.

E Vk+1 g(hk, wk+1)

h qk(z) 2 = E E Vk+1 g(hk, wk+1)

h qk(z) 2 hk, wk+1, qk(z)

= E g(hk, wk+1)

= E E g(hk, wk+1)

h qk(z) 2 hk, qk(z) .

By Assumption (A3), we are led to

2α2 L E( qk(z) 2) E( qk+1(z) 2) (1 + α2 L)E( qk(z) 2),

Scaling Res Nets in the Large-depth Regime

and thus, by induction, since q0(z) = z and E( z 2) = d,

2α2 L k E( qk(z) 2) d(1 + 4α2 L)k.

In particular, for k = L,

2α2 L L E( q L(z) 2) d(1 + α2 L)L.

Therefore, by (20), 1 + 1

2α2 L L E p0 2

(1 + α2 L)L.

To ﬁnish the proof, observe that

1 p L (p0 p L) z = b (q L(z) z).

Using arguments similar to (20), we may write

d E q L(z) z 2

Now, upon noting that q L(z) z = q L(z) q0(z) = αL PL 1 k=0 Vk+1 g(hk,wk+1)

E( q L(z) z 2) = α2 L

k,k =0 E qk(z) g(hk, wk+1)

h V k+1Vk +1 g(h k, wk +1)

Vk+1 g(hk, wk+1)

k=0 (1 + α2 L)k

= d (1 + α2 L)L 1 d exp(Lα2 L) 1 2d Lα2 L,

for Lα2 L 1. Note that the second equality is obtained by conditioning on every random variable except Vk +1 for k < k (and Vk+1 for k > k ). Finally, by using Markov s inequality, we conclude that, for any ε > 0,

P p0 p L 2 ε p L 2 2Lα2 L ε .

A.7 Proof of Proposition 8

The proof of Proposition 7 reveals that

(1 + α2 L)L 1.

Using similar arguments, one has

d E q L(z) z 2

2α2 L k = 1 + 1

Marion, Fermanian, Biau, and Vert

A.8 Proof of Corollary 9

The ﬁrst statement is an immediate consequence of Proposition 7. The second one is a consequence of Proposition 8 and the fact that, for β < 1, 1 + 1

L = exp L ln 1 + 1

Finally, (iii) follows from Proposition 8.

A.9 Proof of Proposition 11

The proposition is a consequence of Kloeden and Platen (1992, Theorems 4.5.3 and 10.2.2) for the SDE

1 dσ(H t )d Bt.

Letting a(h, t) = 0 and b(h, t) = q

1 dσ(h), we need to check the following assumptions:

(H1) The functions a( , ) and b( , ) are jointly measurable on Rd [0, 1].

(H2) There exists a constant C1 > 0 such that, for any x, y Rd, t [0, 1],

a(x, t) a(y, t) + b(x, t) b(y, t) C1 x y .

(H3) There exists a constant C2 > 0 such that, for any x Rd, t [0, 1],

a(x, t) + b(x, t) C2(1 + x ).

(H4) E H0 2 < .

(H5) There exists a constant C3 > 0 such that, for any x Rd, s, t [0, 1],

a(x, t) a(x, s) + b(x, t) b(x, s) C3(1 + x )|t s| 1/2.

Assumptions (H1), (H4), and (H5) readily follow from the deﬁnitions. Assumption (H2) is true since σ is Lipschitz continuous, and (H3) follows from

σ(x) b x x 1 + x .

A.10 Proof of Proposition 12

Let ψ : Rd [0, 1] Rd be deﬁned for any h Rd, t [0, 1], by ψ(h, t) = Vtg(h, Wt). With this notation, the ODE (14) is equivalent to the initial value problem

d Ht = ψ(Ht, t)dt, H0 = Ax.

By Assumptions (A5) and (A6), ψ is Lipschitz continuous in its ﬁrst argument, in the sense that there exists K > 0 (which may depend on the realization of V and W ) such that, for all h, h Rd, t [0, 1], ψ(h, t) ψ(h , t) K h h .

Scaling Res Nets in the Large-depth Regime

In addition, it is continuous in its second one. Thus, according to the Picard-Lindel of theorem (Theorem 23 in Appendix C), this is enough to show that the neural ODE (14) has a unique solution on [0, 1]. Note that the solution H is continuous on [0, 1] and is therefore bounded by a constant M > 0. In order to prove the approximation bound of Proposition 12, we start by proving that both ψ and H are Lipschitz continuous in t. Under (A5) and (A6), this is clear for ψ since H is bounded. Moreover, for any [s, t] [0, 1], we have

Ht Hs = Z t

s ψ(Hu, u)du Z t

s ψ(Hu, u) du

(t s) sup u [0,1] h Rd, h M

Now, let K1 and K2 denote the Lipschitz constants of ψ (in both arguments) and H respectively, and, for any 0 k L, let tk = k/L. Then we have, for k 1,

= Htk 1 + Z tk

tk 1 ψ(Hu, u)du hk 1 1

Lψ(hk 1, tk 1)

Htk 1 hk 1 + Z tk

tk 1 ψ(Hu, u) ψ(hk 1, tk 1) du

Htk 1 hk 1 + K1

tk 1 Hu hk 1 du + K1

tk 1 |u tk 1|du

Htk 1 hk 1 + K1

tk 1 Hu Htk 1 du + K1

tk 1 |u tk 1|du

Htk 1 hk 1 + (K2 + 1)K1

tk 1 |u tk 1|du

Htk 1 hk 1 + (K2 + 1)K1

By recurrence, we obtain

j (K2 + 1)K1

2L2 L 1 + K1

L (K2 + 1)K1

e K1 (K2 + 1)K1

which concludes the proof.

A.11 Proof of Proposition 13

Starting from (4) and using Assumption (A6), one easily obtains the existence of C1 and C2 (whose values depend on the realization of V and W ) such that

hk+1 (1 + C1αL) hk + C2αL.

Marion, Fermanian, Biau, and Vert

By recurrence,

hk+1 (1 + C1αL)k h0 + C2

Hence, using αL 1/L,

hk+1 exp(C1) h0 + C2

Since g is Lipschitz continuous on compact sets, it is bounded on every ball of Rd Rp. The result is then a consequence of the identity

h L h0 = αL

k=0 Vk+1g(hk, wk+1),

since we showed that each term in the sum is bounded by some constant C3 > 0, independent of L and k. Hence we have that

h L h0 C3LαL = C3L1 β,

yielding the results depending on the value of β.

A.12 Proof of Proposition 14

In the linear case, (4) can be written

hk+1 = hk + αLVk+1hk, 0 k L 1.

Take y a unit-norm eigenvector of V 0 with associated eigenvalue λ > 0. Then

hk+1, y = hk + αLVk+1hk, y

= hk, y + αL hk, V k+1y

= hk, y + λαL hk, y + αL hk, (Vk+1 V0) y .

Since V is Lipschitz and Vk+1 = Vk+1/L, there exists c such that Vk+1 V0 ck+1

| hk+1, y | (1 + λαL)| hk, y | cαL k + 1

Then, by recurrence,

| h L, y | (1 + λαL)L| h0, y | cαL

k=0 (k + 1)(1 + λαL)k hk

(1 + λαL)L| h0, y | cαL(1 + λαL)L max k hk .

Let M = | h0,y |

2cαL , and suppose that hk M for all 0 k L. Then

h L | h L, y |

(by the Cauchy-Schwartz inequality)

(1 + λαL)L | h0, y | c MαL .

Scaling Res Nets in the Large-depth Regime

Then, for λαL 1,

2(1 + λαL)L| h0, y | 1

2 | h0, y |.

Thus, since LαL = L1 β, we have that h L , which contradicts our assumption that hk M for all 0 k L. We deduce that, for all L large enough,

max k hk > | h0, y |

Furthermore,

max k hk h0

h0 > | h0, y | 2c h0 αL 1 L .

A.13 Proof of Lemma 16

(E1) and (E2) are simple consequences of Assumptions (A1) and (A2). To prove (E3), let f(hk, wk+1) = Vk+1g(hk, wk+1). Then

E(Yk,2Yk,1|hk) = 1 hk 4 E f(hk, wk+1) 2 hk, f(hk, wk+1) hk

j=1 f(hk, wk+1)2 i (hk)jf(hk, wk+1)j hk .

It is easy to verify that, under Assumption (A1), each term of the sum above has zero expectation. This shows (E3). To establish (E4), we start by noting that

hk , f(hk, wk+1)

E2 hk = 1 hk 4 E h k f(hk, wk+1)f(hk, wk+1) hk|hk

= 1 hk 4 h k E f(hk, wk+1)f(hk, wk+1) |hk hk.

Clearly, E(f(hk, wk+1)if(hk, wk+1)j) = 0 for i = j. Since, furthermore, the coordinates of f(hk, wk+1) are identically distributed conditionally on hk, we obtain

E f(hk, wk+1)f(hk, wk+1) |hk = 1

d E( f(hk, wk+1) 2|hk)Id.

hk , f(hk, wk+1)

E2 hk = 1 d hk 4 E( f(hk, wk+1) 2|hk)h k hk 1

by Assumptions (A1) and (A2). To prove (E5), let ϕ = Vk+1g(hk,wk+1),hk

g(hk,wk+1) hk . Then, for any t > 0,

P(|Yk,2| > t) = P |ϕ| > t hk 2αL g(hk, wk+1)

= E P |ϕ| > t hk 2αL g(hk, wk+1)

Marion, Fermanian, Biau, and Vert

So, by Lemma 21 in Appendix B,

P(|Yk,2| > t) E 2 exp dt2 hk 2

16α2 Ls2 g(hk, wk+1) 2

= E E 2 exp dt2 hk 2

16α2 Ls2 g(hk, wk+1) 2

E 2 exp dt2 hk 2

16α2 Ls2E( g(hk, wk+1) 2|hk)

by Jensen s inequality. Finally, using Assumption (A2), we deduce that

P(|Yk,2| > t) E 2 exp dt2

= 2 exp dt2

In particular, for all q 1 (see, e.g., Pauwels, 2020),

E(Y 2q k,2) q! 32s2α2 L d

The result is obtained by taking q = 2. Finally, (E6) and (E7) are consequences of Lemma 22 in Appendix B.

Appendix B. Concentration of Sub-Gaussian Random Matrices

In this appendix, we are interested in concentration of moments of sub-Gaussian matrices. We begin by two simple lemmas on second-order moments of random matrix-vector products.

Lemma 17 Let W Rd d be a matrix whose entries are centered i.i.d. random variables, with ﬁnite variance, and let σ be an activation function such that, for all x R, a|x| |σ(x)| b|x|, 1/

2 a < b 1. Then, for any x Rd,

1 2E Wx 2 E σ(Wx) 2 E Wx 2 and E Re LU(Wx) 2 = 1

Proof The ﬁrst part is a consequence of the assumption on σ. To prove the equality, let Xi = Pd j=1 Wijxj. Then

E Re LU(Wx) 2 = E d X

j=1 Wijxj 2 1Pd j=1 Wijxj 0

i=1 X2 i 1Xi 0 .

Since the (Wij)1 j d are centered and independent random variables, Xi is also centered. Hence E(X2 i 1Xi 0) = 1/2E(X2 i ), which concludes the proof.

Lemma 18 Let W Rd d be a matrix whose entries are centered i.i.d. random variables, with ﬁnite variance s2. Then, for any x Rd, E Wx 2 = s2d x 2.

Scaling Res Nets in the Large-depth Regime

Proof For any 1 i d,

|Wx|2 i = d X

j=1 Wijxj 2 =

j,j =1 Wij Wij xjxj .

Thus, by independence,

E |Wx|2 i = E d X

j,j =1 Wij Wij xjxj =

j=1 E(W 2 ij)x2 j = s2 x 2. (22)

The result follows by summing over all i {1, . . . , d}.

We now aim at deriving more involved results on concentration of linear and quadratic forms of sub-Gaussian matrices (Lemma 21 and Lemma 22). These two propositions are byproducts of the main result of Kontorovich (2014), which generalizes Mc Diarmid s inequality to sub-Gaussian variables. We start by a technical result regarding the sub Gaussian diameter introduced by Kontorovich (2014), whose deﬁnition is recalled below.

Deﬁnition 19 Let X be a real-valued random variable, X an independent copy of X, and ε a Rademacher random variable, independent of X and X . Then the sub-Gaussian diameter of X is deﬁned as the smallest t such that ε|X X | is t2 sub-Gaussian.

Lemma 20 Let X be an s2 sub-Gaussian centered random variable. Then the sub-Gaussian diameter of X is less than

Proof Let λ R. Then, using the notation of Deﬁnition 19, one has

E(expλε|X X |) = E(expλ(X X ) 1ε=1) + E(expλε(X X) 1ε= 1)

= E(expλ(X X ))

= E(expλX)2

where the last equality is a consequence of the symmetry of X.

We are now ready to prove the two main results of this appendix.

Lemma 21 (Bound on the deviation of linear forms) Let V be a Rd d matrix whose entries are i.i.d s2/d sub-Gaussian random variables. Then, for any x, y Rd, x, y = 0,

x y t 2 exp dt2

Proof For any 1 i, j d, set Xij = xi Vijyj

x y . Let X = Rd2 endowed with the ℓ1 norm, let X be the vector in X whose (id + j)-th coordinate is Xij, and let the function ϕ be deﬁned by

Marion, Fermanian, Biau, and Vert

By the triangle inequality, ϕ is a Lipschitz continuous function, with Lipschitz constant equal to 1. Observe also that Xij is a x2 i s2y2 j/d x 2 y 2 sub-Gaussian. Thus, according to Lemma 20, the sub-Gaussian diameter of Xij is less than

d x y . By Kontorovich (2014, Theorem 1), for any t > 0, one has

P (ϕ(X) t) 2 exp

2 Pd i,j=1 2s2x2 i y2 j d x 2 y 2

x y t 2 exp dt2

Lemma 22 (Bound of moments of quadratic forms) Let V be a Rd d matrix whose entries are i.i.d s2/d sub-Gaussian random variables, with variance 1/d. Then, for any x Rd, x = 0,

d and E V x 8

Proof The proof is similar to the one of Lemma 21, with Xij = Vijxj

x , X = Rd, and

Each function ϕi is a Lipschitz continuous function, with Lipschitz constant equal to 1. Observe now that the random variable Xij is x2 js2/d x 2 sub-Gaussian. Thus, according to Lemma 20, the sub-Gaussian diameter of Xij is less than

d x . Therefore, according to Kontorovich (2014, Theorem 1), for any t > 0,

P (ϕi(X) t) 2 exp

2 Pd j=1 2s2x2 j d x 2

P | Vi, x |

x t 2 exp dt2

Hence (see, e.g., Pauwels, 2020),

From identity (22) in the proof of technical Lemma 18, we obtain that, for q = 1,

Scaling Res Nets in the Large-depth Regime

which is an improvement by a factor 8s2 over the previous upper bound. To conclude, it remains to conclude V x 4 and V x 8 with the Vi, x . To do so, observe that

V x 4 = d X

i=1 Vi, x 2 2 =

Vi, x 2 Vj, x 2 +

i=1 Vi, x 4.

Hence, by independence of the (Vi)1 i d,

i=1 E Vi, x 4

d2 + d2(8s2)2

d2 1 + 128s4

d (by (23) and (24))

V x 8 = d X

i=1 Vi, x 2 3 =

i,j,k=1 i =j =k

Vi, x 2 Vj, x 2 Vj, x 2 +

Vi, x 2 Vj, x 4 +

i=1 Vi, x 8.

i,j,k=1 i =j =k

i=1 E Vi, x 8

= d(d 1)(d 2) 1

d2 + 3d(d 1)2(8s2)2

d3 + d6(8s2)3

Appendix C. A Version of the Picard-Lindel of Theorem

Theorem 23 Assume that f is Lipschitz continuous in its ﬁrst argument and continuous in its second one. Then, for any z Rd, the initial value problem

d Ht = f(Ht, t)dt, H0 = z, (25)

admits a unique solution H : [0, 1] Rd.

Marion, Fermanian, Biau, and Vert

Proof Let C ([s, t], Rd) be the set of continuous functions from [s, t] to Rd. For any [s, t] [0, 1], ζ Rd, let Ψ be the function

Ψ : C ([s, t], Rd) C ([s, t], Rd)

Y 7 v 7 ζ + Z v

s f(Yu, u)du .

For any Y, Y C ([s, t], Rd), v [s, t], one has, denoting by Kf the Lipschitz constant of f in its ﬁrst argument,

Ψ(Y )v Ψ(Y )v Z v

f(Yu, u) f(Y u, u) du

s Kf Yu Y u du

Kf Y Y (t s).

This yields Ψ(Y ) Ψ(Y ) Kf(t s) Y Y ,

which means that the function Ψ is Lipschitz continuous on C ([s, t], Rd) endowed with the supremum norm, with Lipschitz constant Kf(t s). So, on any interval [s, t] of length smaller than δ = 1/2Kf, the function Ψ is a contraction. Thus, by the Banach ﬁxed-point theorem, for any initial value ζ, Ψ has a unique ﬁxed point. Hence, there exists a unique solution to (25) on any interval of length δ with any initial condition. To obtain a solution on [0, 1] it is suﬃcient to concatenate these solutions.

Appendix D. Detailed Experimental Setting and Additional Plots

Our code is available at https://github.com/Pierre Marion23/scaling-resnets. To obtain Figures 1 to 3, we initialize Res Nets from res-3 with the hyperparameters of Table 2.

d 40 nin 64 nout 1 L 10 to 1000 β 0.25, 0.5, 1 weight distribution U( p

3/d) data distribution standard Gaussian

Table 2: Hyperparameters of Figures 1 to 3

Scaling Res Nets in the Large-depth Regime

0 250 500 750 1000 L

0 250 500 750 1000 L

0 250 500 750 1000 L

Figure 10: Evolution of h L h0 / h0 as a function of L for diﬀerent values of β and for the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are as in Figure 1.

1.0 1.2 1.4 1.6 h L / h0

(a) Distribution of h L / h0

1.0 1.2 1.4 1.6 L h0 / L h L

(b) Distribution of L

Figure 11: Empirical distributions of the norms for the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are as in Figure 2.

Marion, Fermanian, Biau, and Vert

0 250 500 750 1000 L

0 250 500 750 1000 L

0 250 500 750 1000 L

Figure 12: Evolution of p0 p L / p L as a function of L for diﬀerent values of β and for the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are as in Figure 3.

0 250 500 750 1000 L

0 250 500 750 1000 L

108 β = 0.5

0 250 500 750 1000 L

Figure 13: Evolution of h L h0 / h0 as a function of L for diﬀerent values of β and a smooth initialization of the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are as in Figure 4.

Scaling Res Nets in the Large-depth Regime

Each experiment is repeated 50 times, with independent data and weight sampling. For Figures 4 and 5, we take the same hyperparameters except for β, which now takes values in {0.5, 1, 2}, and for the weight distribution. The weights are now initialized as discretizations of a Gaussian process. More precisely, each entry of V and W is an independent Gaussian process with zero mean and an RBF kernel of variance 10 2. We also perform the same experiments with the model hk+1 = hk + αLσ(Wkhk), and report the result in Figures 10 14. Although this formulation is not covered by our theoretical results (see discussion at the end of Section 2.1), we observe qualitatively similar results as Figures 1 5. To obtain Figure 7, we take the hyperparameters of Table 3.

d 40 nin 64 nout 1 L 1000 β 0.2 to 1.3 weight distribution fractional Brownian motion with Hurst index from 0.05 to 0.97 data distribution standard Gaussian

Table 3: Hyperparameters of Figure 7

More precisely, for each 1 i, j d, we let (Vk+1,i,j)0 k L 1 be the increments of a fractional Brownian motion (f Bm), where the various f Bm involved are independent. The procedure is the same for w. In Figure 9, we use res-1, with the hyperparameters of Table 4. Each residual layer also includes a bias term, initialized to zero, and trained with the same learning rate as the other weights. We train on MNIST1 and CIFAR-102 using the Adam optimizer (Kingma and Ba, 2015) for 10 epochs. The learning rate is divided by 10 after 5 epochs. The best performance on the learning rate grid is reported in the ﬁgure.

d 30 L 1000 β 0.2 to 1.3 weight distribution fractional Brownian motion with Hurst index from 0.05 to 0.97 learning rate grid 10 4, 10 3, 10 2, 10 1, 1

Table 4: Hyperparameters of Figure 9

1. http://yann.lecun.com/exdb/mnist 2. https://www.cs.toronto.edu/~kriz/cifar.html

Marion, Fermanian, Biau, and Vert

0 250 500 750 1000 L

0 250 500 750 1000 L

108 β = 0.5

0 250 500 750 1000 L

Figure 14: Evolution of p0 p L / p L as a function of L for diﬀerent values of β and a smooth initialization of the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are the same as in Figure 5.

0 200 400 600 800 1000 1.5

(a) β = 1, smooth initialization

0 200 400 600 800 1000 1.5

(b) β = 1, i.i.d. initialization

0 200 400 600 800 1000 1.5

(c) β = 1/2, i.i.d. initialization

Figure 15: Plot of a given coordinate of wk, after training, as a function of the layer index k ranging from 1 to the depth L = 1000 for three diﬀerent choices of β and initializations. A bias term is trained in each residual layer.

Scaling Res Nets in the Large-depth Regime

Figure 8 is obtained by plotting a random coordinate of wk, after training on MNIST. While there is no bias term in the model trained for this ﬁgure, we show below the result of the same experiment when additionally training a zero-initialized bias term in each residual layer. We observe that adding a bias term does not qualitatively change the results.

Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via overparameterization. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 242 252. PMLR, 2019.

D. Arpit, V. Campos, and Y. Bengio. How to initialize your network? Robust initialization for Weight Norm & Res Nets. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 10902 10911. Curran Associates, Inc., 2019.

T. Bachlechner, B.P. Majumder, H. Mao, G. Cottrell, and J. Auley. Re Zero is all you need: Fast convergence at large depth. In C. de Campos and M.H. Maathuis, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artiﬁcial Intelligence, volume 161, pages 1352 1361. PMLR, 2021.

Christian Bayer, Peter K Friz, and Nikolas Tapia. Stability of deep neural networks via discrete rough paths. SIAM Journal on Mathematics of Data Science, 5:50 76, 2023.

A. Brock, S. De, and S.L. Smith. Characterizing signal propagation to close the performance gap in unnormalized Res Nets. In International Conference on Learning Representations, 2021.

B. Chang, M. Chen, E. Haber, and E.H. Chi. Antisymmetric RNN: A dynamical system view on recurrent neural networks. In International Conference on Learning Representations, 2019.

R.T.Q. Chen, Y. Rubanova, J. Bettencourt, and D.K. Duvenaud. Neural ordinary diﬀerential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 6572 6583. Curran Associates, Inc., 2018.

F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.

A.-S. Cohen, R. Cont, A. Rossier, and R. Xu. Scaling properties of deep residual networks. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 2039 2048. PMLR, 2021.

R. Cont, A. Rossier, and R. Xu. Asymptotic analysis of deep residual networks. ar Xiv:2212.08199, 2022.

S. De and S. Smith. Batch normalization biases residual blocks towards the identity function in deep networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 19964 19975. Curran Associates, Inc., 2020.

Marion, Fermanian, Biau, and Vert

L. Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29:141 142, 2012.

W. E, J. Han, and Q. Li. A mean-ﬁeld optimal control formulation of deep learning. Research in the Mathematical Sciences, 6:10, 2019.

A. Fermanian, P. Marion, J.-P. Vert, and G. Biau. Framing RNN as a kernel method: A neural ODE approach. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 3121 3134. Curran Associates, Inc., 2021.

X. Glorot and Y. Bengio. Understanding the diﬃculty of training deep feedforward neural networks. In Y.W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 9, pages 249 256. PMLR, 2010.

I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.

W. Grathwohl, R.T.Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019.

F. G otze and J. Jalowy. Rate of convergence to the circular law via smoothing inequalities for log-potentials. Random Matrices: Theory and Applications, 10:2150026, 2021.

B. Hanin and D. Rolnick. How to start training: The eﬀect of initialization and architecture. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 569 579. Curran Associates, Inc., 2018.

S. Hayou, E. Clerico, B. He, G. Deligiannidis, A. Doucet, and J. Rousseau. Stable Res Net. In A. Banerjee and K. Fukumizu, editors, Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, volume 130, pages 1324 1332. PMLR, 2021.

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on Image Net classiﬁcation. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1026 1034. IEEE Computer Society, 2015.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016a.

K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision ECCV 2016, pages 630 645. Springer International Publishing, 2016b.

S. Ioﬀe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 448 456. PMLR, 2015.

Scaling Res Nets in the Large-depth Regime

S. Jastrzkebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors inﬂuencing minima in SGD. ar Xiv:1711.04623, 2017.

P. Kidger, J. Foster, X. Li, and T. Lyons. Eﬃcient and accurate gradients for neural SDEs. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 18747 18761. Curran Associates, Inc., 2021.

D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

P.E. Kloeden and E. Platen. Numerical Solution of Stochastic Diﬀerential Equations. Springer, Berlin, 1992.

A. Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In E.P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 28 36. PMLR, 2014.

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

A. Lederer, J. Umlauft, and S. Hirche. Uniform error bounds for Gaussian process regression with application to safe control. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 659 669. Curran Associates, Inc., 2019.

X. Li, T.-K. L. Wong, R.T.Q. Chen, and D.K. Duvenaud. Scalable gradients and variational inference for stochastic diﬀerential equations. In C. Zhang, F. Ruiz, T. Bui, A.B. Dieng, and D. Liang, editors, Proceedings of the 2nd Symposium on Advances in Approximate Bayesian Inference, volume 118, pages 1 28. PMLR, 2020.

A. Maas, A. Hannun, and A. Ng. Rectiﬁer nonlinearities improve neural network acoustic models. In S. Dasgupta and D. Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28. PMLR, 2013.

P. Marion, Y.-H. Wu, M.E. Sander, and G. Biau. Implicit regularization of deep residual networks towards neural ODEs. In International Conference on Learning Representations, 2024.

E. Pauwels. Statistics, Optimization and Algorithms in High Dimension. Lecture Notes, Toulouse 3 Paul Sabatier University, 2020.

S. Peluchetti and S. Favaro. Inﬁnitely deep neural networks as diﬀusion processes. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artiﬁcial Intelligence and Statistics, volume 108, pages 1126 1136. PMLR, 2020.

M.E. Sander, P. Ablin, and G. Peyr e. Do residual neural networks discretize neural ordinary diﬀerential equations? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and

Marion, Fermanian, Biau, and Vert

A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36520 36532. Curran Associates, Inc., 2022.

J. Shao, K. Hu, C. Wang, X. Xue, and B. Raj. Is normalization indispensable for training deep neural network? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13434 13444. Curran Associates, Inc., 2020.

Matthew Thorpe and Yves van Gennip. Deep limits of residual neural networks. Research in the Mathematical Sciences, 10:6, 2023.

R. van Handel. Probability in High Dimension. APC 550 Lecture Notes, Princeton University, 2016.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5998 6008. Curran Associates, Inc., 2017.

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46:6761 6774, 2024.

B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectiﬁed activations in convolutional network. ar Xiv:1505.00853, 2015.

G. Yang and S. Schoenholz. Mean ﬁeld residual networks: On the edge of chaos. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 2865 2873. Curran Associates, Inc., 2017.

H. Zhang, Y.N. Dauphin, and T. Ma. Fixup initialization: Residual learning without normalization. In International Conference on Learning Representations, 2019a.

H. Zhang, D. Yu, M. Yi, W. Chen, and T.-Y. Liu. Convergence theory of learning overparameterized Res Net: A full characterization. ar Xiv:1903.07120v4, 2019b.

Huishuai Zhang, Da Yu, Mingyang Yi, Wei Chen, and Tie-Yan Liu. Stabilize deep resnet with a sharp scaling factor τ. Machine Learning, 111:3359 3392, 2022.

J. Zhang, B. Han, L. Wynter, B.K.H. Low, and M. Kankanhalli. Towards robust Res Net: A small step but a giant leap. In Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence, IJCAI-19, pages 4285 4291. International Joint Conferences on Artiﬁcial Intelligence Organization, 2019c.