# scaling_resnets_in_the_largedepth_regime__2ce6f3b0.pdf Journal of Machine Learning Research 26 (2025) 1-48 Submitted 6/22; Revised 2/25; Published 2/25 Scaling Res Nets in the Large-depth Regime Pierre Marion pierre.marion@mines.org Sorbonne Universit e, CNRS, Laboratoire de Probabilit es, Statistique et Mod elisation F-75005 Paris, France Adeline Fermanian adeline.fermanian@califrais.fr MINES Paris Tech, PSL Research University, CBIO, F-75006 Paris, France Institut Curie, PSL Research University, F-75005 Paris, France INSERM, U900, F-75005 Paris, France G erard Biau gerard.biau@sorbonne-universite.fr Sorbonne Universit e, CNRS, Laboratoire de Probabilit es, Statistique et Mod elisation Institut universitaire de France F-75005 Paris, France Jean-Philippe Vert jean-philippe.vert@mines.org Google Research, Brain team, Paris, France Editor: Joan Bruna Deep Res Nets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth L increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor αL. We show in a probabilistic setting that with standard i.i.d. initializations, the only non-trivial dynamics is for αL = 1/ L other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep Res Nets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and αL = 1/L. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training. Keywords: Res Nets, deep learning theory, neural ODE, neural network initialization, continuous-time models 1. Introduction We begin by introducing the general context on deep residual networks, before stating our contributions and discussing related work. . Now at EPFL . Now at Califrais . Now at Owkin c 2025 Pierre Marion, Adeline Fermanian, G erard Biau, and Jean-Philippe Vert. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/22-0664.html. Marion, Fermanian, Biau, and Vert 1.1 Deep Residual Neural Networks Residual neural networks (Res Nets), introduced by He et al. (2016a) in the field of computer vision, were the first deep neural network models successfully trained with several thousand layers. Since then, extensive empirical evidence has shown that increasing the depth leads to significant improvements in performance, while raising new challenges in terms of training (e.g., Wang et al., 2024). From a high-level perspective, the key feature of Res Nets is the presence of skip connections between successive layers. In mathematical terms, this means that the (k + 1)-th hidden state hk+1 Rd follows sequentially from the previous hidden state via the recurrence relation hk+1 = hk + f(hk, θk+1), 0 k L 1, (1) where f( , θk+1) : Rd Rd is the layer function parameterized by θk+1 Rp and L is the number of layers. The skip connection corresponds to the addition of hk on the right-hand side of (1), which is absent in classical feedforward networks. This refinement prevents instability issues during training when L is large, provided training is performed in a careful way (He et al., 2015). The idea of adding skip connections has become common practice in the field of deep learning, and is today incorporated in many other models such as Transformers in natural language processing (Vaswani et al., 2017). For simplicity, in the rest of the paper, we continue to use the terminology Res Nets to denote any architecture of the form (1), keeping in mind that this framework goes beyond the original model of He et al. (2016a). The most common architectures have 50-150 layers, but Res Nets can be trained with depths up to the order of thousand layers (He et al., 2016b). Yet, the training procedure needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth increases. As pointed out by, e.g., Shao et al. (2020), these instabilities are related to a shift in the magnitude of the variance of a signal as it passes through the network. In the original approach of He et al. (2016a), the issue was mitigated by adding a normalization step, called batch normalization (Ioffe and Szegedy, 2015), which rescales the output of each layer via centering and unit variance normalization. However, this normalization stage introduces practical and theoretical difficulties, among which computational overhead and strong dependence on the batch size (see Brock et al., 2021, and the references therein). A widespread alternative to stabilize training in deep models, explored for example by Yang and Schoenholz (2017), Arpit et al. (2019), Zhang et al. (2019b), and De and Smith (2020), is to incorporate a scaling factor αL in front of the residual term in (1), yielding the model hk+1 = hk + αLf(hk, θk+1), 0 k L 1. (2) There is strong evidence that this scaling factor αL should depend on L, without however any consensus to date on the exact form of this dependence, nor on the mathematical grounding of the approach. Thus, despite progresses on the empirical side, the mathematical forces in action behind the stability of deep Res Nets are still poorly understood, although they are key to unlock training at arbitrary depth. Our goal in the present paper is to take a step forward towards a better theoretical understanding of deep Res Nets by providing a thorough probabilistic analysis of the sequence (hk)0 k L at initialization when L is large, and by leveraging a continuous-time interpretation Scaling Res Nets in the Large-depth Regime of model (2) via the so-called neural ordinary differential equation (neural ODE, Chen et al., 2018) paradigm. In a nutshell, our results highlight the intimate connection that exists at initialization between stability of the learning process, the regularity of the weights, and the scaling factor αL. We offer in particular a proper mathematical grounding on why and how to choose the parameter αL as a function of the depth L and the distribution of the weights. 1.2 Our Contributions Scaling at initialization. The optimal parameters of Res Nets are learned by minimizing some empirical risk function via a gradient descent algorithm. As highlighted for example by Yang and Schoenholz (2017), Hanin and Rolnick (2018), and Arpit et al. (2019), a good parameter initialization of this learning phase plays a major role in the quality of the learned model, in particular to avoid vanishing gradients and deadlock at initialization, or exploding gradients and quick divergence of the model parameters at the beginning of training. Moreover, a good initialization allows the use of larger learning rates, which have been shown to correlate with better generalization (Jastrzkebski et al., 2017). It is thus of great interest to study and understand the role played by scaling of deep Res Nets at initialization. This is the context in which we place ourselves in the sequel. At initialization stage, the weights (θk)1 k L are usually chosen as (realizations of) independent and identically distributed (i.i.d.) random variables, which typically follow a uniform or Gaussian distribution on Rp. Accordingly, the sequence (hk)0 k L that results from the recursion (2) for a given input to the network takes the form of a sequence of random variables that are not i.i.d. but are actually a martingale. Thus, denoting informally by L the differentiable loss associated with the learning task (classification or regression), the distributions of (hk)0 k L and ( L hk )0 k L as L becomes large carry useful information on the stability of training. For instance, exploding gradients in the backpropagation phase of learning correspond to the fact that, with high probability, L h L , where denotes the Euclidean norm. Our first contribution, in Section 2, is to provide thorough mathematical statements on the behavior of these distributions (both for finite and infinite L), depending on the value of αL. Among other results, we show that only the choice αL 1/ L yields a non-trivial behavior at initialization, thereby confirming empirical findings in the literature (Arpit et al., 2019; De and Smith, 2020). For αL 1/ L, the norms explode exponentially fast with L, which is inappropriate for training. For αL 1/ L, the network is almost equivalent to identity, that is, h L h0. The analysis of the different cases as a function of αL is mathematically involved and makes extensive use of concentration tools from random matrix theory. The continuous approach. As noticed by several authors (Chen et al., 2018; E et al., 2019; Thorpe and van Gennip, 2023), model (2) with a scaling factor αL = 1/L (and not 1/ L) is formally similar to the discretization of a differential equation. Thus, when L tends to infinity, the weights and hidden states change continuously with the layer according to the equation d Ht dt = f(Ht, Θt), t [0, 1]. (3) Here, time t is the continuous analogue of the layer index k, H : [0, 1] Rd is a continuoustime hidden state, and Θ : [0, 1] Rp a continuous-time parameter. This important Marion, Fermanian, Biau, and Vert connection between Res Nets and differential equations has been identified in the past years under the umbrella name of neural ODE. Since the original article of Chen et al. (2018), this point of view has led to the development of a variety of new continuous-time models, together with innovative architectures and efficient training algorithms (Chang et al., 2019; Grathwohl et al., 2019; Kidger et al., 2021). The neural ODE paradigm also enabled to leverage the rich theory of differential equations to better understand the mechanisms at work behind deep Res Nets (E et al., 2019; Fermanian et al., 2021). However, there is a debated question in the neural ODE community about the choice αL = 1/L, which guarantees convergence of the discrete model (2) to its continuous-time counterpart (3). As a matter of fact, it seems that this choice is guided by more mathematical than practical considerations, and several authors have suggested that it is inconsistent with what is done in practice (Cohen et al., 2021; Bayer et al., 2023). Moreover, letting αL = 1/L is somewhat contradictory with the results discussed above, which highlighted that the only non-trivial limit at initialization is αL = 1/ L. Thus, as a second contribution, we clarify the problem in Section 3 by leveraging our previous results on stability. We show that the value αL = 1/ L corresponds in the continuous world to a neural stochastic differential equation (SDE) of the form (3), where now Θ : [0, 1] Rp takes the form of a continuous-time stochastic process, typically a Brownian motion. By contrast, we also prove that the neural ODE regime with αL = 1/L corresponds to the limit of a Res Net, not with i.i.d. weights as considered before, but with more complex and correlated weight distributions. For these weight distributions, the scaling αL = 1/L is also a critical value between explosion and identity. Going further, our third contribution is to exhibit in Section 4 a continuous range of regimes that are controlled by the choice of αL (beyond the cases 1/ L and 1/L) and the distribution of (θk)1 k L at initialization, derived from a continuous-time process Θ with a regularity different from a Brownian motion. More precisely, we show experimentally that there is a strong interplay (with the same three cases explosion, identity mapping, non-trivial behavior) between the choice of αL and the regularity of (θk)1 k L as a function of the layer index k. In addition, empirical evidence suggests that this interplay impacts both the behavior and performance of the networks during training, beyond initialization. 1.3 Related Work The choice of scaling for Res Nets has been discussed in many papers, without however reaching a clear consensus on the form this scaling factor should take. For instance, Hanin and Rolnick (2018) state that stability requires αL 1/L, while Zhang et al. (2019b) show that αL 1/ L is enough to ensure stability. On the other hand, Cohen et al. (2021) claim that the scaling factor observed in practice in trained Res Nets is of the form 1/Lβ with β 0.7. Other authors have proposed more complex choices for αL (e.g., Zhang et al., 2019a; Shao et al., 2020). Taking another point of view, De and Smith (2020) observe that batch normalization is empirically equivalent to taking a 1/ L normalization factor. Bachlechner et al. (2021) suggest learning a scaling parameter αk that is allowed to vary from one layer to another, whereas, in (4), αL is kept constant across layers. These authors observe a great acceleration for training compared to traditional Res Nets with no scaling. They also suggest a similar architecture for Transformers and then notice that αk 1/L at the end of training. Scaling Res Nets in the Large-depth Regime Closest to our analysis at initialization are the papers of Arpit et al. (2019) and Zhang et al. (2019b). Arpit et al. (2019) develop a theoretical analysis based on mean field approximation that suggests that a scaling factor αL = 1/ L prevents vanishing/exploding gradients at initialization, and provide experimental evidence that this approach is competitive with batch normalization. However, the authors do not provide rigorous mathematical statements for the three different cases αL 1/ L, and αL 1/ L, nor do they highlight the connection with the continuous-time interpretation. Interestingly, the idea of exploiting the martingale structure to analyze the magnitude of the hidden states is present in Zhang et al. (2019b), who study the convergence of gradient descent for over-parameterized Res Nets with different values of αL. Nevertheless, they consider a specific model with Gaussian weights, and only provide asymptotic results when both width and depth tend to infinity. The asymptotic limit in this particular regime has been studied by, e.g., Allen-Zhu et al. (2019) and Hayou et al. (2021). We depart from this point of view by considering a finite-width setting. The connection between the choice of scaling and the continuous-time point of view has previously been noticed by Zhang et al. (2019c), then studied in detail by Cohen et al. (2021) and Cont et al. (2022). These authors show that, under assumptions on the form of the weights, it is possible to derive limiting (stochastic or ordinary) differential equations for the hidden states. However, they do not discuss the transition between these two regimes, nor do they link differential equations regimes with the stability of the network. 2. Scaling at Initialization Our goal in this section is to study the effect of the scaling factor αL on the stability of Res Nets at initialization, assuming that the weights are i.i.d. random variables. We start by making more precise the model and the learning problem introduced in (1). 2.1 Model and Assumptions Model. The data is a sample of n pairs (xi, yi)1 i n, where xi is the input vector in Rnin and yi Rnout is the output vector to be predicted. This setting includes regression and classification (after one-hot encoding of the labels). Specifying the informal recurrence (1), for any input x Rnin, we consider the output Fπ(x) Rnout of the Res Net model defined by hk+1 = hk + αLVk+1g(hk, wk+1), 0 k L 1, Fπ(x) = Bh L, where αL > 0 is the scaling factor of the Res Net and π = (A, B, (wk)1 k L, (Vk)1 k L) are its parameters, with A Rd nin, B Rnout d, wk Rp and Vk Rd d for k = 1, . . . , L. The almost-everywhere differentiable function g : Rd Rp Rd encodes the choice of architecture. We note that the model includes initial and final linear layers in order to map the input space Rnin into the space of hidden states Rd, and symmetrically to map the last hidden state h L into the output space Rnout. These two transformations are of little interest to us, since we mostly focus on the behavior of the sequence of hidden states (hk)0 k L. Let Marion, Fermanian, Biau, and Vert us finally notice that the results of this section can be adapted to hidden layers that do not have the same width, at the cost of increased technicality. An important feature of model (4) is that the layer function takes the form of a matrixvector multiplication, which will prove crucial to make use of concentration results on random matrices. We stress that this setting is standard in practice and that it encompasses many types of Res Nets. It includes for example simple Res Nets where g(h, w) = σ(h) with σ the activation function, and the original Res Nets from He et al. (2016a), which have g(h, w) = Re LU(Wh + b), where the parameter is a pair w = (W, b) with W Rd d a weight matrix and b Rd a bias, and Re LU: x 7 max(x, 0) is applied element-wise. This setting also includes attention layers, where g corresponds to the scaled dot-product between keys and queries, as well as convolutional layers. The assumptions made below do not cover these latter cases, making it an interesting avenue for future research to adapt our approach to these cases. Throughout the article, we let ℓ: Rnout Rnout R+ be a loss function, differentiable w.r.t. its first parameter, for example the squared loss or the cross-entropy loss. The objective of learning is to find the optimal parameter π that minimizes the empirical risk L (π) = Pn i=1 ℓ(Fπ(xi), yi). Probabilistic setting at initialization. The minimization of the empirical risk is usually performed by stochastic gradient descent or one of its variants (Goodfellow et al., 2016, Chapter 8). The gradient descent is initialized by choosing the weights as (realizations of) i.i.d. random variables. The parameters w1, V1, . . . , w L, VL in model (4) are therefore assumed to be an i.i.d. collection of random variables, where we recall that wk Rp and Vk Rd d parameterize the k-th layer of the network. In this stochastic context, the successive hidden states h0, . . . , h L given a fixed input x are also random variables, but their distribution is not i.i.d. in fact, under our assumptions, this sequence is a martingale. To avoid unnecessary technicalities, we assume that the sequence (hk)0 k L is non-atomic. This is for example the case if the distribution of the parameters is absolutely continuous w.r.t. the Lebesgue measure. In particular, this ensures that the sequence (hk)0 k L almost surely does not hit the non-differentiability points of g. It is stressed that the distribution of the parameters are assumed to be independent of the depth, so that all the dependence on L is captured in the scaling factor αL. This model enables us to consider multiple architectures at once, via the function g. By contrast, some authors formulate the problem of scaling as a choice of the variance at initialization (e.g., Yang and Schoenholz, 2017; Wang et al., 2024), which makes the analysis architecturedependent. However, for a given architecture, these two approaches are essentially equivalent since Var(αLVk) = α2 L Var(Vk). The quantity h L h0 / h0 carries key information on the behavior of the network at initialization. On the one hand, if h L h0 h0 , the network is essentially equal to the identity function. On the other hand, if h L h0 h0 , the output of the network explodes. An intermediate situation is when h L h0 h0 . In addition, another source of information is provided by the gradients of the hidden states with respect to the empirical risk L . If L h L , the gradients do not change as they flow through the network, which means that the exact same information is backpropagated Scaling Res Nets in the Large-depth Regime throughout the network. Conversely, if L h L , the gradients explode during backpropagation. By exploiting the martingale structure of ( hk )0 k L, as well as state-ofthe-art concentration inequalities for random matrices with sub-Gaussian entries, we provide in this section probabilistic bounds on the magnitude of these various quantities. Assumptions. Some assumptions are needed on the choices of architecture and initialization. Recall that a centered real-valued random variable X is said to be s2 sub-Gaussian (van Handel, 2016, Chapter 3) if for all λ R, E(exp(λX)) exp(λ2s2/2). The sub-Gaussian property is a constraint on the tail of the probability distribution. As an example, Gaussian random variables on the real line are sub-Gaussian and so are bounded random variables. The following assumptions will be needed throughout the section: for any 1 k L, (A1) For some s 1, the entries of d Vk are centered i.i.d. s2 sub-Gaussian random variables, independent of d and L, with unit variance. (A2) For some C > 0, independent of d and L, and for any h Rd, 2 E g(h, wk) 2 h 2 and E g(h, wk) 8 C h 8. Assumption (A1) is mild and satisfied by all initializations used in practice. For example, the classical Glorot initialization (Glorot and Bengio, 2010) which is the default implementation in the Keras package (Chollet et al., 2015) takes the entries of Vk as uniform U( p 3/d) variables. This means that d Vk is initialized with U( 3) random variables, which satisfy (A1). Other examples include the Gaussian N(0, 1/d) initialization of He et al. (2015) and, for example, initialization with Rademacher variables. The first part of Assumption (A2) ensures that g( , wk) is not too far away from being an isometry in expectation. The second part is more technical and, roughly, allows to upper bound the deviations of the norm of g(hk 1, wk). Our next Proposition 1 shows that most classical Res Net architectures verify Assumption (A2). For the sake of readability, these models, together with their parameters, are summarized in Table 1 below. Name Recurrence relation Parameters of g res-1 Simple Res Net hk+1 = hk + αLVk+1σ(hk) wk+1 = res-2 Parametric Res Net hk+1 = hk + αLVk+1σ(Wk+1hk) wk+1 = Wk+1 res-3 Original Res Net hk+1 = hk + αLVk+1 Re LU(Wk+1hk) wk+1 = Wk+1 Table 1: Examples of Res Net architectures considered in the paper. In the first two cases, the activation function σ is such that, for all x R, a|x| |σ(x)| b|x|, 1/ 2 a < b 1. In the last two cases, Wk+1 Rd d. Proposition 1 Let res-1, res-2, and res-3 be the models defined in Table 1. Then (i) Assumption (A2) is satisfied for res-1. Marion, Fermanian, Biau, and Vert (ii) Assumption (A2) is satisfied for res-2 and res-3, as soon as the entries of d Wk+1, 0 k L 1, are centered i.i.d. sub-Gaussian random variables, independent of d and L, with unit variance. In the models res-1 and res-2, σ can be, for instance, taken as the parametric Re LU function, i.e., σ(x) = x+ + sx , where x+ (resp. x ) denotes the positive (resp. negative) part and the slope s [1/ 2, 1] is a parameter of the model. Parametric Re LU, also known as leaky Re LU (Maas et al., 2013), has been shown to outperform standard Re LU for image datasets (Xu et al., 2015). Observe also that res-2 differs from res-3 since the classical Re LU function is defined by Re LU(x) = x+ and thus does not satisfy the condition |σ(x)| a|x|. Note that there is no bias term in these three models, as this term is commonly initialized to zero, and we are interested in the behavior at initialization. Remark 2 An important architecture that is not covered by our setting is hk+1 = hk + αLσ(Wk+1hk), where compared to res-1 the weight matrix is located inside the activation function σ. This is due to the fact that the residual branch writes as a matrix-vector product in our model (4), which enables us to leverage matrix concentration inequalities (see, e.g., Lemmas 21 and 22). However, we check numerically that qualitatively similar results are obtained in this case (see Appendix D). This leads us to believe that similar concentration results should hold true when the matrix-vector product is followed by an element-wise activation function. We leave this mathematical study for future work. 2.2 Probabilistic Bounds on the Norm of the Hidden States The next two propositions describe how the quantity h L h0 / h0 changes as a function of Lα2 L. Proposition 3 provides a high-probability bound of interest when Lα2 L 1. In this case, we see that, with high probability, the network acts as the identity function, directly mapping h0 to h L. On the other hand, Proposition 4 provides information in the two cases Lα2 L 1 and Lα2 L 1. When Lα2 L 1, the lower bound (i) indicates an explosion with high probability of the norm of the last hidden state. On the other hand, when Lα2 L 1, the bounds (i) and (ii) show that h L randomly varies around h0 with fluctuation sizes bounded from below and above. Proposition 3 Consider a Res Net (4) such that Assumptions (A1) and (A2) are satisfied. If Lα2 L 1, then, for any δ (0, 1), with probability at least 1 δ, h0 2 2Lα2 L δ . Proposition 4 Consider a Res Net (4) such that Assumptions (A1) and (A2) are satisfied. (i) Assume that d 64 and α2 L 2 ( C+16s4)d. Then, for any δ (0, 1), with probability at least 1 δ, Scaling Res Nets in the Large-depth Regime provided that 2L exp d 64α2 Ls2 (ii) Assume that α2 L 1 C(d+128s4). Then, for any δ (0, 1), with probability at least 1 δ, We recall that the constants s and C appearing in Proposition 4 are defined respectively by Assumptions (A1) and (A2). Moreover, note that the assumptions of Proposition 4 on d and αL are mild, since in the learning tasks where deep Res Nets are involved, one typically has αL = 1/Lβ with β > 0, d 102 and L 102. Furthermore, the assumption on αL is a technical assumption used to simplify the final expression. It is also possible to derive a proof without this assumption, at the cost of a more intricate result. Note also that condition (5) is not severe since, when d and L are large, it encompasses all reasonable values of δ. Propositions 3 and 4 are interesting in the sense that they provide finite-depth high-probability bounds on the behavior of the hidden states, depending on the magnitude of Lα2 L. The results become clearer by letting αL = 1/Lβ, with β > 0, as shown in the following corollary. Corollary 5 Consider a Res Net (4) such that Assumptions (A1) and (A2) are satisfied, and let αL = 1/Lβ, with β > 0. (i) If β > 1/2, then h L h0 (ii) If β < 1/2 and d 9, then h L h0 (iii) If β = 1/2, d 64, L (1 C + 8s4)d + 96 Cs4, then, for any δ (0, 1), with probability at least 1 δ, 1 < h L h0 2 provided that Corollary 5 highlights three different asymptotic behaviors for h L , depending on the values of β. For β > 1/2, statement (i) tells that h L converges towards h0 in probability, as L tends to infinity, which means that the neural network is essentially equivalent to an identity mapping. On the other hand, for β < 1/2, the norm of h L explodes with high probability. Finally, for the critical value β = 1/2, we see that h L fluctuates around h0, with Marion, Fermanian, Biau, and Vert 0 250 500 750 1000 L 0 250 500 750 1000 L 0 250 500 750 1000 L Figure 1: Evolution of h L h0 / h0 as a function of L for different values of β and an i.i.d. U( p 3/d) initialization of model res-3, with d = 40. The input is a random Gaussian observation x in dimension nin = 64. The experiment is repeated with 50 independent randomizations. 1.0 1.2 1.4 1.6 0 (a) Distribution of h L / h0 1.0 1.2 1.4 1.6 0 (b) Distribution of L Figure 2: Empirical distributions of the norms for β = 1/2, L = 103, d = 100. The experiment is repeated with 104 independent randomizations. Scaling Res Nets in the Large-depth Regime a fluctuation size independent of L. Observe that the lower bound in (iii) is not trivial as soon as exp(3/8 p 11/dδ) > 1, i.e., d > 99/64δ. The message of Corollary 5 is that the only scaling leading to a non-degenerate distribution at initialization is for β = 1/2. The three statements of Corollary 5 are illustrated in Figure 1. In this experiment, we consider model res-3, a random Gaussian observation x in dimension nin = 64, and parameters initialized with a uniform distribution U( p 3/d). We refer to Appendix D for a detailed setup of all the experiments of the paper. Figure 2a shows the empirical distribution of h L / h0 when β = 1/2 for a large number of realizations. This figure illustrates in particular that our bounds are reasonably sharp, since the bounds indicate that the first quartile of the distribution is larger than 0.87 (whereas the first quartile of the empirical histogram is equal to 1.21) and the third quartile is less than 2.06 (whereas the third quartile of the empirical histogram is equal to 1.34). Determining the exact distribution of h L / h0 is an interesting avenue for future research that is beyond the scope of the present article. There is however a strong indication that the ratio follows a log-normal distribution, as confirmed by a normality test on (the log of) the empirical distribution. In a nutshell, the proofs of Propositions 3 and 4 rest upon controlling of the norm of the hidden states, which obeys the recurrence hk+1 2 = hk 2 + α2 L Vk+1g(hk, wk+1) 2 + 2αL hk, Vk+1g(hk, wk+1) , (6) where , denotes the standard scalar product in Rd. Taking the expectations on both side, one deduces with Assumptions (A1) and (A2) that E Vk+1g(hk, wk+1) 2 = E E Vk+1g(hk, wk+1) 2 hk, wk+1 = E g(hk, wk+1) 2 hk 2 (7) and E hk, Vk+1g(hk, wk+1) = E E hk, Vk+1g(hk, wk+1) hk, wk+1 = 0. (8) The equalities (7) and (8) allow deriving without further work bounds in expectation on h L , as already observed in an informal manner by Arpit et al. (2019). However, the results we are after are stronger since they involve precise quantification of the fluctuations induced by the initialization through high-probability bounds. A finer control of the deviations of Vk+1g(hk, wk+1) 2 and hk, Vk+1g(hk, wk+1) is then needed. This involves concentration inequalities on random matrices with sub-Gaussian entries, and martingale arguments. This probabilistic derivation allows improvements over earlier works in the literature that only show stability for β 1 (Hanin and Rolnick, 2018; Allen-Zhu et al., 2019). A similar proof technique was already used in Zhang et al. (2019b) to show in an asymptotic setting explosion of the forward pass for β < 1/2 and stability for β 1/2, in accordance with our results. We extend their result in several ways, most notably by considering a sub-Gaussian initialization relaxing their Gaussian assumption, a more general architecture, and obtaining fully non-asymptotic bounds, while their approach is asymptotic both in width and depth. This makes the mathematical analysis significantly more challenging. We also introduce a novel distinction between the stability case β = 1/2 and the identity case β > 1/2, showing the presence of non-vanishing fluctuations only for β = 1/2. For completeness, we note that a revised version of their paper does provide non-asymptotic bounds (Zhang et al., 2022), Marion, Fermanian, Biau, and Vert albeit still in the more restrictive setting of a Gaussian initialization, a specific architecture, and without separating the cases β = 1/2 and β > 1/2. We next show precise non-asymptotic bounds for the gradients, using the martingale structure of the forward differentiation recurrence, a novel proof technique. 2.3 Probabilistic Bounds on the Gradients Propositions 3 and 4 provide insights on the output of the network when L is large. However, they do not carry information on the backwards dynamics of propagation of the gradients of the loss pk = L hk Rd. Assessing the dynamics of the (pk)0 k L as a function of L is important since the behavior of this sequence impacts trainability of the network at initialization. Thus, in this subsection, we are interested in quantifying the magnitude of p0 p L / p L , when L is large. Notice that, contrarily to the previous subsection where we were mostly interested in the last hidden state h L, the quantity of interest is now p0 (not p L), the gradient at index 0. The reason is that the sequence (pk)0 k L is defined backwardly, as we will see below. We also stress that (pk)0 k L is the sequence of derivatives of the loss w.r.t. the hidden states hk, and not w.r.t. the parameters. The reason for considering this sequence is that the pk are involved in the backpropagation algorithm and are therefore essential for assessing the stability of the gradient descent (see, e.g., Arpit et al., 2019). Analyzing the behavior of the sequence (pk)0 k L is challenging since, according to the backpropagation (or reverse-mode differentiation) formula, one has pk = pk+1 + αL g(hk, wk+1) h V k+1pk+1. Taking the norm, pk 2 = pk+1 2 + α2 L g(hk, wk+1) h V k+1pk+1 2 + 2αL D pk+1, g(hk, wk+1) h V k+1pk+1 E . Although the equation looks qualitatively similar to (6), it has the unpleasant feature that pk+1 depends on the whole sequence of weights w1, V1, . . . , w L, VL. This forbids applying directly the same proof techniques as for the hidden states due to the lack of adaptation of the pk to the filtration of the hidden state process. Simplifying assumptions have sometimes been made to analyze this recurrence equation, for instance assuming independence of g(hk,wk+1) h and hk (see, e.g., Yang and Schoenholz, 2017). However, such assumption remains a strong requirement, which is not verified for many network architectures (for example model res-1). Other authors resort to an ε-net argument (Allen-Zhu et al., 2019; Zhang et al., 2019b). We tackle the problem from a different point of view and propose an alternative approach based on forward-mode differentiation, valid under a much weaker assumption. The cost we pay is that we obtain results in expectation and not in high probability. Let us sketch our approach before stating the results more formally. We denote by z Rd an independent random variable that will be used to assess the magnitude of the gradients. For any 0 i, j L, let hj hi Rd d be the Jacobian matrix of hj with respect to hi. Recall that the (m, n)-th entry of this matrix equals the derivative of the m-th coordinate of hj w.r.t. the n-th coordinate of hi. Then, letting qk(z) = hk h0 z, we have, by the chain rule, qk+1(z) = hk+1 hk qk(z) = qk(z) + αLVk+1 g(hk, wk+1) h qk(z). (9) Scaling Res Nets in the Large-depth Regime Identity (9), which is similar to (4), expresses qk+1(z) as a function of qk(z), and therefore respects the flow of information. Next, assuming that z is random with a Gaussian distribution, it is possible to express one of our quantities of interest, p0 / p L , as a function of the last vector q L(z). Indeed, p L 2 = 1 p L 2 Ez N(0,Id) p 0 z 2 = Ez N(0,Id) q L(z) 2 , (10) where Id is the identity matrix in Rd and the second equality is a consequence of h0 z = p Lq L(z). In summary, the recurrence (9) allows us to derive bounds on the norm of q L(z), which can then transfer to p0 / p L via (10). For this, it is necessary to make the following assumption on the ratio p L/ p L : (A3) Let b = p L/ p L . Then E(b|h L) = 0 and E(b b|h L) = Id/d. It is a mild assumption, which is verified for instance if nout = 1 with squared error (for regression) or cross-entropy (for binary classification). In these cases, p L/ p L = B / B F, where F is the Frobenius norm and B is the weight matrix of the last layer. We finally need the following assumption, which is the equivalent of Assumption (A2) for the gradients. (A4) One has, almost surely, 2 E g(hk, wk+1) h qk 2 hk, qk Assumption (A4) is satisfied by all the standard architectures listed in Table 1, as shown by the next proposition. Proposition 6 Let res-1, res-2, and res-3 be the models defined in Table 1. Assume that (A1) is satisfied and σ is almost everywhere differentiable, with a σ b. Then (i) Assumption (A4) is satisfied for res-1. (ii) Assumption (A4) is satisfied for res-2 and res-3, when the entries of d Wk, 1 k L, are centered i.i.d. random variables, independent of d and L, with unit variance. The next two propositions are the counterparts of Proposition 3 and Proposition 4 for the gradient dynamics. Proposition 7 Consider a Res Net (4) such that Assumptions (A1)-(A4) are satisfied. If Lα2 L 1, then, for any δ (0, 1), with probability at least 1 δ, p L 2 2Lα2 L δ . Marion, Fermanian, Biau, and Vert Proposition 8 Consider a Res Net (4) such that Assumptions (A1)-(A4) are satisfied. Then 2α2 L L 1 E p0 p L 2 (1 + α2 L)L 1. A simple corollary of the propositions above is as follows. Corollary 9 Consider a Res Net (4) such that Assumptions (A1)-(A4) are satisfied, and take αL = 1/Lβ, with β > 0. Then (i) If β > 1/2, p0 p L (ii) If β < 1/2, (ii) If β = 1/2, 1 E p0 p L 2 Corollary 9 is illustrated in Figure 3. The experimental protocol is the same as in Figure 1, but we now track p0 and p L, the gradients of the loss L with respect to the first and the last hidden states. In accordance with our results, when β > 1/2, the gradient remains the same from one layer to another (left plot). On the other hand, the middle plot clearly shows that when β < 1/2 the gradient explodes. Once again, the case β = 1/2 (right plot) is the only one for which the distribution of gradients at initialization is non-trivial. Figure 2b illustrates that the empirical distribution of gradients in this case also seems to be log-normal. Our findings extend results from the literature showing explosion of the backward pass for β < 1/2 and stability for β 1/2, both in an asymptotic setting (Allen-Zhu et al., 2019; Zhang et al., 2019b; Hayou et al., 2021) and a non-asymptotic setting (Zhang et al., 2022). Similar comparisons can be drawn with these papers as in the previous section (see end of Section 2.2). Furthermore, we emphasize again that making use of the forward-mode formulation of the gradients differs from these previous works, which resorted either to an infinite-width setting or to backward-mode differentiation and an ε-net argument. In summary, this and the previous subsection both point towards the same conclusion: there are three different cases, depending on the value of β explosion when β < 1/2, nondegenerate limit when β = 1/2, and identity when β > 1/2. In the explosion case, it is well known that the network cannot be trained (Yang and Schoenholz, 2017). The theory thus points out that the value 1/2 plays a pivotal role. Remarkably, this value has a specific interpretation in the continuous-time point of view of Res Nets, in terms of SDE. This is the topic that we address in the next section. Scaling Res Nets in the Large-depth Regime 0 250 500 750 1000 L 0 250 500 750 1000 L 0 250 500 750 1000 L Figure 3: Evolution of p0 p L / p L as a function of L for different values of β and an i.i.d. U( p 3/d) initialization of model res-3, with d = 40. The input is a random Gaussian observation x in dimension nin = 64. The experiment is repeated with 50 independent randomizations. 3. Scaling in the Continuous-time Setting Starting with the discrete Res Net (4), it is tempting to let L go to infinity and consider the network as the discretization of a differential equation where the layer index k {0, . . . , L} is replaced by the time index t [0, 1]. This interpretation of deep neural networks has been popularized by Chen et al. (2018) and is often referred to as the neural ODE paradigm. Notice that this setting is different from the so-called mean-field analysis, where the width of the network is assumed to be infinite beforehand. In our setting, the width d is assumed to be finite and fixed. 3.1 Convergence Towards a SDE in the Large-depth Regime One of the main messages of Section 2 is that the standard initialization with i.i.d. parameters leads to a non-degenerate model for large values of L only if Lα2 L 1 (Propositions 3 and 4), or, equivalently, if β = 1/2 when αL = 1/Lβ (Corollary 5). Remarkably, in the continuous-time limit, this regime corresponds to the discretization of a SDE. Indeed, consider for simplicity the (discrete) Res Net model res-1 h0 = Ax, hk+1 = hk + 1 L Vk+1σ(hk), 0 k L 1, (11) where the entries of all (Vk)1 k L are assumed to be i.i.d. N(0, 1/d). Recall the following definition: Definition 10 A one-dimensional Brownian motion (Bt)t [0,1] is a continuous-time stochastic process with B0 = 0, almost surely continuous, with independent increments, and such that for any 0 s < t 1, Bt Bs N(0, t s). Now, let B : [0, 1] Rd d be a (d d)-dimensional Brownian motion, in the sense that the (Bij)1 i,j d are independent one-dimensional Brownian motions. Thus, for any Marion, Fermanian, Biau, and Vert 0 k L 1 and any 1 i, j d, we have B(k+1)/L,i,j Bk/L,i,j N 0, 1 and the increments for different values of (i, j, k) are independent. As a consequence, the recurrence (11) is equivalent in distribution to the recurrence h k+1 = h k + 1 d σ(h k )(B(k+1)/L Bk/L), 0 k L 1. (Note that this is true because Vk+1 has the same distribution as V k+1.) We recognize the Euler-Maruyama discretization (Kloeden and Platen, 1992) on the {k/L, 0 k L} mesh of the SDE H0 = Ax, d H t = 1 d σ(H t )d Bt, t [0, 1], (12) where the output of the network is now a function of the final value of H, that is, H1. The link between the discrete Res Net (11) and the SDE (12) is formalized in the next proposition. Proposition 11 Consider the res-1 model, where the entries of Vk are i.i.d. Gaussian N(0, 1/d) random variables. Assume that the activation function σ is Lipschitz continuous. Then the SDE (12) has a unique solution H and, for any 0 k L, E Hk/L hk c for some c > 0. Notice that the requirement that σ is Lipschitz continuous is satisfied by most classical activation functions, including Re LU. This proposition is interesting for several reasons. First, the scaling β = 1/2, which is exactly the one that yields a non-trivial dynamics at initialization, corresponds in the continuous world to a remarkably simple model of diffusion. This shows that very deep neural networks properly initialized with i.i.d. weights are equivalent to solutions of SDE. This analogy opens interesting perspectives for training deep networks using automatic differentiation for solutions of neural SDE (Li et al., 2020). Second, we stress that the emergence of a SDE instead of an ODE carries an important message. Several authors (including, e.g., Thorpe and van Gennip, 2023) have shown that, under appropriate assumptions, a deep Res Net converges in the large depth limit to an ODE and not a SDE. The reason why we obtain a SDE here is intrinsically connected with the choice of i.i.d. initialization for the weights, which makes a Brownian motion appear at the limit, as highlighted above. In other words, the i.i.d. initialization, the choice β = 1/2 (the relevant critical value exhibited in Section 2), and the emergence of a SDE are intimately linked together. On the other hand, the case β = 1 matches with an ODE if the initialization is not i.i.d., as we will see in Subsection 3.2. Finally, we point out that Proposition 11 states the convergence of a Res Net towards a SDE for the basic architecture res-1 and for Gaussian initialization. The extension to more general settings is an interesting direction of research, although clearly beyond the scope of the present paper (see, e.g., Peluchetti and Favaro, 2020, and Cohen et al., 2021, for results in this direction). Scaling Res Nets in the Large-depth Regime 3.2 Scaling in the Neural ODE Setting Convergence towards an ODE. The basic message of our Proposition 11 is that an i.i.d. initialization, together with β = 1/2, leads to a SDE rather than an ODE. A natural question is then whether a different choice of weight distributions (at initialization) and scaling can lead to a classical neural ODE. To answer this question and leave the world of i.i.d. initialization, we assume that the weights (Vk)1 k L and (wk)1 k L are discretizations of smooth functions V : [0, 1] Rd d and W : [0, 1] Rp. We then consider the general iteration (4) with αL = 1/L, that is, h0 = Ax, hk+1 = hk + 1 LVk+1g(hk, wk+1), 0 k L 1, (13) where Vk = Vk/L and wk = Wk/L. Of course, it is still possible to consider (Vk)1 k L (resp. (wk)1 k L) as random variables, by letting (Vt)t [0,1] (resp. (Wt)t [0,1]) be a continuous-time stochastic process. In this model, we shall need the following assumption: (A5) For any 1 k L, one has Vk = Vk/L and wk = Wk/L, where the stochastic processes V and W are almost surely Lipschitz continuous. More precisely, almost surely, there exist KV , KW such that, for any s, t [0, 1], Vt Vs KV |t s|, Wt Ws KW |t s|. A typical model that satisfies Assumption (A5) is obtained by letting the entries of V and W be independent Gaussian processes with expectation zero and squared exponential covariance K(x, x ) = exp( (x x )2 2ℓ2 ), where ℓ> 0 (Lederer et al., 2019). Note that the Lipschitz constants may themselves be random, depending on the Gaussian process sample. We shall also need the following requirement on g, which is satisfied by all our models as soon as σ is Lipschitz continuous: (A6) The function g is Lipschitz continuous on compact sets, in the sense that for any compact P Rp, there exists KP > 0 such that, for all h, h Rd, w P, g(h, w) g(h , w) KP h h , and for any compact D Rd, there exists KD,P > 0 such that, for all h D, w, w P, g(h, w) g(h, w ) KD,P w w . Under Assumptions (A5) and (A6), the recurrence (13) almost surely converges towards the neural ODE given by H0 = Ax, d Ht = Vtg(Ht, Wt)dt, t [0, 1], (14) as shown by the proposition below. Proposition 12 Consider model (13) such that Assumptions (A5) and (A6) are satisfied. Then the ODE (14) has a unique solution H, and, almost surely, there exists some c > 0 such that, for any 0 k L, Hk/L hk c Marion, Fermanian, Biau, and Vert It should be stressed that the transition from the discrete recurrence (13) to the continuous-time differential equation (14) relies on the assumptions that the weight sequences (wk)1 k L and (Vk)1 k L are the discretizations of smooth limiting processes W and V on the one hand, and that the scaling αL is chosen as 1/L on the other hand. From a practical perspective, Proposition 12 shows that it is possible to initialize Res Nets in the ODE regime, by choosing a smooth stochastic process, discretizing it at each layer, and taking a 1/L scaling. This is in sharp contrast with the results of Sections 2 and 3.1, which show that the usual i.i.d. procedure leads to a neural SDE. Stability and scaling. Assuming that the weights of the network are discretizations of a smooth function (Assumption (A5)), it is possible to obtain stability results, depending on the value of β, similarly to what has been done in Section 2. We show below that β = 1 is a critical value, by examining the hidden states, in the same way as β = 1/2 is a critical value in the i.i.d. setting. Similar results can be shown for the gradients. We begin by a proposition handling the cases β > 1 and β = 1. Proposition 13 Consider a Res Net (4) such that Assumptions (A5) and (A6) are satisfied. Let αL = 1/Lβ, with β > 0. (i) If β > 1, then, almost surely, (ii) If β = 1, then, almost surely, there exists some c > 0 such that The explosion case (β < 1) is more delicate to deal with. We prove it for a linear model, and leave for future work the extension to more general cases. Proposition 14 Consider the res-1 model, taking σ as the identity function. Assume that Assumption (A5) is satisfied and that V T 0 has a positive eigenvalue. Let αL = 1/Lβ, with β (0, 1). Then, almost surely, max k hk h0 The assumption of the existence of a positive eigenvalue for V 0 is mild. For instance, if the entries of V0 are i.i.d. random variables with finite moments of all order, G otze and Jalowy (2021) show that such an eigenvalue exists with probability at least 1 1/d for d large enough. Essentially, the proof relies on showing divergence of the hidden states along the eigenvector associated to a positive eigenvalue of V T 0 . The extension to models that do not have an identity activation function is delicate. Indeed, while we expect such divergence to generally occur with a non-linear activation function, it is technically delicate to show as one has to rule out cases where the non-linearity of the activation function induces compensations that prevent divergence. Scaling Res Nets in the Large-depth Regime 0 250 500 750 1000 L 0 250 500 750 1000 L 108 β = 0.5 0 250 500 750 1000 L Figure 4: Evolution of h L h0 / h0 as a function of L for different values of β and a smooth initialization of model res-3, with d = 40. The input is a random Gaussian observation x in dimension nin = 64. The experiment is repeated with 50 independent randomizations. 0 250 500 750 1000 L 0 250 500 750 1000 L 108 β = 0.5 0 250 500 750 1000 L Figure 5: Evolution of p0 p L / p L as a function of L for different values of β and a smooth initialization of model res-3, with d = 40. The input is a random Gaussian observation x in dimension nin = 64. The experiment is repeated with 50 independent randomizations. Marion, Fermanian, Biau, and Vert In this setting, we observe experimentally a behavior of the output and of the gradients when L grows large similar to the one explored in Section 2. This is illustrated in Figures 4 and 5, which mirror Figures 1 and 3 in Section 2. The figures clearly show that there exist three cases for the output and for the gradients: an identity case (left plots), an explosion case (middle), and a non-trivial case separating explosion and identity (right). However, the remarkable point is that the separation occurs for β = 1, and not β = 1/2, as predicted by Propositions 13 and 14. 4. Experiments We experimentally investigate in this section two questions. The first one is to know whether there exists a range of scaling factors β > 0 and weight initializations, beyond the i.i.d. and the smooth regimes. The second question is whether our analysis, which pertains to the initialization phase, provides insights into the training phase, beyond initialization. 4.1 Intermediate Regimes In order to describe the transition between the i.i.d. and smooth cases, a possible route is to consider that the weights are increments of a γ-H older stochastic process. This model is interesting insofar as the Brownian motion (SDE regime) is (1/2 ε)-H older (ε > 0) and a Lipschitz continuous stochastic process (ODE regime) is 1-H older. In line with the above, in a series of experiments, we initialize the weights as increments of a fractional Brownian motion (BH t )t [0,1]. Recall that BH is a continuous-time Gaussian process, starting at zero, with zero expectation for all t [0, 1], and covariance function E(BH s BH t ) = 1 2(|s|2H + |t|2H |t s|2H), 0 s, t 1, where H (0, 1) is called the Hurst index. This index describes the raggedness of the process, with a higher value leading to a smoother process. When H = 1/2, the process is a standard Brownian motion (Definition 10), whose increments are independent by construction. When H > 1/2, the increments of the process are positively correlated, while if H < 1/2 the increments are negatively correlated. Importantly, a fractional Brownian motion with Hurst index H is (H ε)-H older continuous for any ε > 0. In the limit when H 1, the trajectories converge to linear functions (whose increments satisfy (A5)). As an illustration, Figure 6 depicts three realizations of a fractional Brownian motion with H = 0.2 (left), H = 0.5 (middle), and H = 0.8 (right). In order to assess the effect of the scaling factor β and the Hurst index H, we initialize a neural network res-3 with d = 40, L = 1000, various values of β [0.2, 1.3], and with weights taken as increments of fractional Brownian motions with various Hurst indices H (0, 1). Figure 7 depicts the empirical magnitude of the output and the gradients at initialization as a function of the Hurst index H and the scaling factor β. First note that we recover the two regimes (i.i.d. and smooth) discussed so far. For H = 1/2, the i.i.d. regime kicks in, with explosion (β < 1/2, orange zone), non-trivial behavior (β = 1/2, black zone), and identity (β > 1/2, blue zone). Likewise, we see at H = 1 a similar pattern in the smooth regime, with, as predicted by Proposition 13, a critical value β = 1. Beyond these two specific cases, we observe for an index H varying in (1/2, 1) a whole range of intermediate Scaling Res Nets in the Large-depth Regime 0 200 400 600 800 1000 3 (a) H = 0.2 0 200 400 600 800 1000 (b) H = 0.5 0 200 400 600 800 1000 0.8 (c) H = 0.8 Figure 6: Examples of realizations of a fractional Brownian motion BH for different Hurst indexes H. Note that the smaller the value of H, the more irregular the trajectory is. 1.3 1.18 1.06 0.94 0.82 0.69 0.57 0.45 0.33 (a) ln10( h L h0 / h0 ) 1.3 1.18 1.06 0.94 0.82 0.69 0.57 0.45 0.33 (b) ln10( p0 p L / p L ) Figure 7: Magnitude of the outputs and of the gradients as a function of the regularity of the weights (Hurst index H) and of the scaling factor β. The orange zone corresponds to the explosion case, i.e., h L h0 h0 and p0 p L p L . The blue zone corresponds to the identity case, i.e., h L h0 h0 and p0 p L p L . Finally, the black zone is an intermediate case, where h L h0 h0 and p0 p L p L . situations, where the transition between identity and explosion seems to happen for a critical β = H. Interestingly, for H < 1/2, the transition seems to saturate at the value β = 1/2. The take-home message is that the choice of the scaling of a Res Net seems to be closely linked to the regularity of the weights as a function of the layer. More precisely, for all regimes, the critical scaling factor between explosion and identity seems to have a natural interpretation as the (H older) regularity of the underlying continuous-time stochastic process. We believe that the mathematical understanding of this connection, beyond the fractional Marion, Fermanian, Biau, and Vert Brownian motion case, is a promising research direction for the future. Finally, these experiments suggest that it is sensible to initialize a Res Net for any value of the scaling β (1/2, 1), while avoiding the identity and explosion situations, by simulating a fractional Brownian motion of Hurst index H = β and initializing the weights as the increments of this process. 4.2 Beyond Initialization At initialization, before the gradient descent, the distribution of the weights (wk)1 k L and (Vk)1 k L is chosen by the practitioner. By contrast, during and after training, control is lost on these distributions, making the picture more complex. In particular, the existence and characterization of a continuous-time stochastic process whose discretization matches the trained Res Net is an interesting but difficult problem. Attacking this question requires a fine understanding of the interaction between training dynamics and the regularity of the sequence of the weights during the gradient descent. However, there is experimental evidence that the trained weights exhibit strong structure as a function of the layer index k (Cohen et al., 2021; Bayer et al., 2023), and that their regularity strongly depends on the choice of initialization. Interesting theoretical preliminary results in the ODE case are reported in Sander et al. (2022) for a linear activation and further generalized by Marion et al. (2024). Figure 8 depicts this mechanism by plotting a given coordinate of wk as a function of the layer index k ranging from 1 to the depth L = 1000, after training. Note that, in this experiment, there is no bias term in the residual layers, following the formulations from Table 1. We check that adding a bias term gives qualitatively similar plots (see Appendix D). To investigate the link between regularity of the weights at initialization, scaling, and performance after training, we train Res Nets on the datasets MNIST (Deng, 2012) and CIFAR-10 (Krizhevsky, 2009). As in Subsection 4.1, we initialize the Res Nets with various scaling factors and weights that are increments of fractional Brownian motions with different regularities. Then, for each combination of weight initialization and scaling factor, the Res Net is trained using the Adam optimizer (Kingma and Ba, 2015) for 10 epochs. The model includes a zero-initialized bias term on each residual layer. The results in terms of accuracy 0 200 400 600 800 1000 1.5 (a) β = 1, smooth initialization 0 200 400 600 800 1000 1.5 (b) β = 1, i.i.d. initialization 0 200 400 600 800 1000 1.5 (c) β = 1/2, i.i.d. initialization Figure 8: Plot of a given coordinate of wk, after training, as a function of the layer index k ranging from 1 to the depth L = 1000 for three different choices of β and initializations. Scaling Res Nets in the Large-depth Regime 0.9 0.81 0.72 0.63 0.54 0.46 0.37 0.28 0.19 (a) On MNIST 0.1 0.19 0.28 0.37 0.46 0.54 0.63 0.72 0.81 (b) On CIFAR-10 Figure 9: Accuracy after training as a function of the regularity of the weights at initialization and scaling. For each point of the heatmap, the model was trained on a grid of learning rates, and the best performance is shown. are presented in Figure 9 (light orange = good performance, blue = bad performance). We observe a pattern similar to the one of Figure 7. This means that, for a given regularity, the network is unable to learn if it is initialized with a scaling too far below the critical value, which of course is connected with the gradient explosion issue discussed previously. On the other hand, and perhaps more surprisingly, the performance seems to be more or less stable in the identity region, with perhaps a small degradation in the case of CIFAR-10. This somewhat contrasts with the results from Yang and Schoenholz (2017), who exhibit a decrease in performance for i.i.d. initialization and a large scaling factor β. Note however that, in the experiments reported in Figure 7, we adapt the learning rate of the gradient descent on a grid by cross-validation. When taking a fixed learning rate, we also observe a decrease in performance for large scaling factor β. The interplay between the learning rate and the scaling factor is one of the keys to better assess how the performance of the trained network is connected with the scaling. Acknowledgments The authors thank anonymous referees for insightful comments. They also thank S. Schoenholz for fruitful discussion. P. Marion was supported by a grant from R egion ˆIle-de-France and by a Google Ph D Fellowship award. Marion, Fermanian, Biau, and Vert Appendix A. Proofs Throughout the proofs, the i-th coordinate of a vector v is denoted by vi. Similarly, the i-th row of a matrix M is denoted by Mi, and its (i, j)-th entry by Mij. A.1 Proof of Proposition 1 Statement (i) is clear (with C = 1) since, for any h Rd, σ(h) 2 a2 h 2, b2 h 2 1 2 h 2, h 2 . With respect to statement (ii), it is enough to show that for any h Rd and any random matrix W satisfying the assumptions of the proposition, one has 2 E σ(Wh) 2 h 2 and E σ(Wh) 8 C h 8, E Re LU(Wh) 2 = h 2 2 and E Re LU(Wh) 8 C h 8. The two claims with the squared norms are consequences of Lemmas 17 and 18 in Appendix B, together with the fact that the variance of the entries of W equals 1/d. In order to prove the other two statements, first note that E( σ(Wh) 8) E( Wh 8) and E( Re LU(Wh) 8) E( Wh 8). The results are then consequences of Lemma 22 in Appendix B, which states that d2 1 + 384s4 + 3072s6. A.2 Proof of Proposition 3 According to Lemma 15 below, one has 1 + α2 L L 1 . But, for Lα2 L 1, we have (1 + α2 L)L 1 exp(Lα2 L) 1 2Lα2 L. Therefore, and the result follows from Markov s inequality. Lemma 15 Consider a Res Net (4) such that Assumptions (A1) and (A2) are satisfied. Then 1 + α2 L 2 L 1 E h L h0 2 1 + α2 L L 1 . Scaling Res Nets in the Large-depth Regime Proof (Lemma 15). Taking the squared norm of the forward update rule (4) and dividing by h0 2 yields h0 2 = 1 h0 2 hk 2 + α2 L Vk+1g(hk, wk+1) 2 + 2αL hk, Vk+1g(hk, wk+1) . (15) We deduce by Assumptions (A1) and (A2) that (1 + α2 L)E hk 2 Therefore, by recurrence, we are led to (1 + α2 L)k. (16) Now, observe that h L = h0 + αL PL 1 k=0 Vk+1g(hk, wk+1). Thus, we have k,k =0 E g(hk, wk+1) V k+1Vk +1g(hk , wk +1) By conditioning on all random variables except Vk +1 for k < k (and Vk+1 for k > k ), it is easy to see that the only non-zero terms are when k = k . This yields k=0 E Vk+1g(hk, wk+1) 2 (by Assumptions A1 and A2) k=0 (1 + α2 L)k = 1 + α2 L L 1 . = 1 + α2 L 2 Marion, Fermanian, Biau, and Vert A.3 Proof of Proposition 4 Dividing (15) by hk 2 and taking the logarithm leads to ln( hk+1 2) = ln( hk 2)+ln 1+α2 L Vk+1g(hk, wk+1) 2 hk 2 +2αL D hk hk , Vk+1g(hk, wk+1) Yk,1 = α2 L Vk+1g(hk, wk+1) 2 hk 2 , Yk,2 = 2αL D hk hk , Vk+1g(hk, wk+1) and Yk = Yk,1 + Yk,2. The proof of Proposition 4 strongly relies on the following lemma, which provides technical information on the moments of Yk,1 and Yk,2. For the sake of clarity, its proof is postponed to Appendix A.13. Lemma 16 Assume that Assumptions (A1) and (A2) are satisfied. Then (E1) E(Yk,2|hk) = 0. (E2) α2 L 2 E(Yk,1|hk) α2 L. (E3) E(Yk,1Yk,2|hk) = 0. (E4) E(Y 2 k,2|hk) 4α2 L d . (E5) E(Y 4 k,2|hk) 2048 s4α4 L d2 . (E6) E(Y 4 k,1|hk) C 3072 s6 d + 1 α8 L. (E7) E(Y 2 k,1|hk) d + 1 α4 L. For c > 0, we have h0 2 c = P ln( h L 2) ln( h0 2) ln(c) k=0 ln(1 + Yk) ln(c) k=0 Yk ln(c) (using ln(1 + x) x for x > 1). Let S = PL 1 k=0 Yk E(Yk|hk). By (E1) and (E2), k=0 E(Yk|hk) Lα2 L. So, for c > exp(Lα2 L), h0 2 c P S ln(c) k=0 E(Yk|hk) P(S ln(c) Lα2 L) P S2 ln(c) Lα2 L 2 E(S2) (ln(c) Lα2 L)2 (17) (by Markov s inequality.) Scaling Res Nets in the Large-depth Regime It remains to upper bound E(S2). To this aim, note that k=0 E Yk E(Yk|hk) 2 k=0 E Y 2 k 4Lα2 L d + 128 (by (E3), (E4), and (E7)) The last inequality is true for α2 L 1 C(d+128s4). Therefore, by inequality (17), we obtain, for c > exp(Lα2 L), h0 2 c 5Lα2 L d ln(c) Lα2 L 2 . We conclude that, for any δ (0, 1), with probability at least 1 δ, This shows statement (ii) of the proposition. Next, to prove statement (i), observe that c > 0, h0 2 c = P ln( h L 2) ln( h0 2) ln(c) k=0 ln(1 + Yk) ln(c) k=0 ln(1 + Yk) ln(c) and k, Yk 1 k=0 ln(1 + Yk) ln(c) and k, Yk < 1 Using the inequality ln(1 + x) x x2 for x 1/2, we obtain h0 2 c P L 1 X k=0 Yk Y 2 k ln(c) and k, Yk 1 k=0 ln(1 + Yk) ln(c) and k, Yk < 1 h0 2 c P L 1 X k=0 Yk Y 2 k ln(c) + k=0 P Yk,2 < 1 Marion, Fermanian, Biau, and Vert We handle the two terms above on the right-hand side separately. For the first term, let Zk = Yk Y 2 k and S = PL 1 k=0 Zk E(Zk|hk). Observe that, by (E1)-(E4) and (E7), k=0 E(Zk|hk) Lα2 L 2 4Lα2 L d 128 8Lα2 L, (19) where the last inequality is valid for d 64 and α2 L 1 16 C(2s4+1). Hence, for 0 < c < exp(3Lα2 L/8), k=0 Yk Y 2 k ln c = P S ln(c) k=0 E(Zk|hk) P S ln(c) 3Lα2 L 8 P S2 ln(c) 3Lα2 L 8 E(S2) ln(c) 3Lα2 L 8 2 (by Markov s inequality.) Using the cr-inequality (a + b)n 2n 1(an + bn) respectively for n = 2 and n = 4, we see that k=0 E Zk E(Zk|hk) 2 k=0 E Z2 k 2 k=0 E Y 2 k + E Y 4 k k=0 E(Y 2 k,1) + E(Y 2 k,2) + 2E(Yk,1Yk,2) + 8E(Y 4 k,1) + 8E(Y 4 k,2). By (E3)-(E7), it is easy to verify that, for d 64 and α2 L 1 ( E(S2) 10Lα2 L d . This shows that, for c < exp(3Lα2 L/8), k=0 Yk Y 2 k ln(c) 10Lα2 L d ln(c) 3Lα2 L 8 2 . To conclude the proof, it remains to upper bound the second term of inequality (18). According to inequality (21) in the proof of Lemma 16 (with t = 1/2), one has k=0 P Yk,2 < 1 2L exp d 64α2 Ls2 Scaling Res Nets in the Large-depth Regime Putting everything together, we are led to h0 2 c 10Lα2 L d ln(c) 3Lα2 L 8 2 + 2L exp d 64α2 Ls2 Take δ (0, 1). Then, if 2L exp d 64α2 Ls2 δ 11, with probability at least 1 δ, h0 2 > exp 3Lα2 L 8 Notice that this inequality is valid under the assumption α2 L 2 ( A.4 Proof of Corollary 5 Statement (i) is a consequence of Proposition 3, whereas (ii) is a consequence of Proposition 4 (i). The latter is valid under the conditions d 64 and αL 2 ( C+16s4)d, which is automatically satisfied for all L large enough. Furthermore, an inspection of the proof of Proposition 4 reveals that the divergence in high probability of h L can be proved under the relaxed assumption d 9. Indeed, the main constraint on d comes from the lower bound (19), where one needs to make sure that Lα2 L 2 4 Lα2 L d > 0, which is the case for d = 9. To prove (iii), we use a union bound on both statements of Proposition 4. A.5 Proof of Proposition 6 The first claim follows from the observation that g(hk, wk+1) σ (hk,1) 0 . . . 0 0 σ (hk,2) . . . 0 ... ... ... ... 0 0 . . . σ (hk,d) from (A1), and from the assumption on σ . Let us now prove (ii). In the rest of the proof, the subscript k is ignored to lighten the notation. Observe that σ ( W1, h ) 0 . . . 0 0 σ ( W2, h ) . . . 0 ... ... ... ... 0 0 . . . σ ( Wd, h ) Denote by D the matrix in the middle of the right-hand side. Then h q 2 h, q = E V DWq 2|h, q = E DWq 2|h, q Marion, Fermanian, Biau, and Vert For model res-2, we have h q 2 h, q = E d X j=1 Wijqj 2 σ ( Wi, h ) h, q . The conclusion follows from the hypothesis that a σ b and E Wq 2|q = q 2. For model res-3, we have h q 2 h, q = E d X j=1 Wijqj 2 1Pd j=1 Wijhj 0 Since the (Wij)1 i,j d are centered random variables, we conclude that h q 2 h, q = 1 j=1 Wijqj 2 q = 1 2E Wq 2|q = q 2 A.6 Proof of Proposition 7 Letting b = p L/ p L , as in Assumption (A3), and taking expectation in (10), we obtain = E(|b q L(z)|2) = 1 d E( q L(z) 2) (20) The rest of the proof is similar to the proof of Proposition 3. From (9), we have qk+1(z) 2 = qk(z) 2+α2 L Vk+1 g(hk, wk+1) h qk(z) 2 +2αL qk(z), Vk+1 g(hk, wk+1) By independence of Vk+1 from qk(z) and g(hk,wk+1) E D qk(z), Vk+1 g(hk, wk+1) h qk(z) E = 0. E Vk+1 g(hk, wk+1) h qk(z) 2 = E E Vk+1 g(hk, wk+1) h qk(z) 2 hk, wk+1, qk(z) = E g(hk, wk+1) = E E g(hk, wk+1) h qk(z) 2 hk, qk(z) . By Assumption (A3), we are led to 2α2 L E( qk(z) 2) E( qk+1(z) 2) (1 + α2 L)E( qk(z) 2), Scaling Res Nets in the Large-depth Regime and thus, by induction, since q0(z) = z and E( z 2) = d, 2α2 L k E( qk(z) 2) d(1 + 4α2 L)k. In particular, for k = L, 2α2 L L E( q L(z) 2) d(1 + α2 L)L. Therefore, by (20), 1 + 1 2α2 L L E p0 2 (1 + α2 L)L. To finish the proof, observe that 1 p L (p0 p L) z = b (q L(z) z). Using arguments similar to (20), we may write d E q L(z) z 2 Now, upon noting that q L(z) z = q L(z) q0(z) = αL PL 1 k=0 Vk+1 g(hk,wk+1) E( q L(z) z 2) = α2 L k,k =0 E qk(z) g(hk, wk+1) h V k+1Vk +1 g(h k, wk +1) Vk+1 g(hk, wk+1) k=0 (1 + α2 L)k = d (1 + α2 L)L 1 d exp(Lα2 L) 1 2d Lα2 L, for Lα2 L 1. Note that the second equality is obtained by conditioning on every random variable except Vk +1 for k < k (and Vk+1 for k > k ). Finally, by using Markov s inequality, we conclude that, for any ε > 0, P p0 p L 2 ε p L 2 2Lα2 L ε . A.7 Proof of Proposition 8 The proof of Proposition 7 reveals that (1 + α2 L)L 1. Using similar arguments, one has d E q L(z) z 2 2α2 L k = 1 + 1 Marion, Fermanian, Biau, and Vert A.8 Proof of Corollary 9 The first statement is an immediate consequence of Proposition 7. The second one is a consequence of Proposition 8 and the fact that, for β < 1, 1 + 1 L = exp L ln 1 + 1 Finally, (iii) follows from Proposition 8. A.9 Proof of Proposition 11 The proposition is a consequence of Kloeden and Platen (1992, Theorems 4.5.3 and 10.2.2) for the SDE 1 dσ(H t )d Bt. Letting a(h, t) = 0 and b(h, t) = q 1 dσ(h), we need to check the following assumptions: (H1) The functions a( , ) and b( , ) are jointly measurable on Rd [0, 1]. (H2) There exists a constant C1 > 0 such that, for any x, y Rd, t [0, 1], a(x, t) a(y, t) + b(x, t) b(y, t) C1 x y . (H3) There exists a constant C2 > 0 such that, for any x Rd, t [0, 1], a(x, t) + b(x, t) C2(1 + x ). (H4) E H0 2 < . (H5) There exists a constant C3 > 0 such that, for any x Rd, s, t [0, 1], a(x, t) a(x, s) + b(x, t) b(x, s) C3(1 + x )|t s| 1/2. Assumptions (H1), (H4), and (H5) readily follow from the definitions. Assumption (H2) is true since σ is Lipschitz continuous, and (H3) follows from σ(x) b x x 1 + x . A.10 Proof of Proposition 12 Let ψ : Rd [0, 1] Rd be defined for any h Rd, t [0, 1], by ψ(h, t) = Vtg(h, Wt). With this notation, the ODE (14) is equivalent to the initial value problem d Ht = ψ(Ht, t)dt, H0 = Ax. By Assumptions (A5) and (A6), ψ is Lipschitz continuous in its first argument, in the sense that there exists K > 0 (which may depend on the realization of V and W ) such that, for all h, h Rd, t [0, 1], ψ(h, t) ψ(h , t) K h h . Scaling Res Nets in the Large-depth Regime In addition, it is continuous in its second one. Thus, according to the Picard-Lindel of theorem (Theorem 23 in Appendix C), this is enough to show that the neural ODE (14) has a unique solution on [0, 1]. Note that the solution H is continuous on [0, 1] and is therefore bounded by a constant M > 0. In order to prove the approximation bound of Proposition 12, we start by proving that both ψ and H are Lipschitz continuous in t. Under (A5) and (A6), this is clear for ψ since H is bounded. Moreover, for any [s, t] [0, 1], we have Ht Hs = Z t s ψ(Hu, u)du Z t s ψ(Hu, u) du (t s) sup u [0,1] h Rd, h M Now, let K1 and K2 denote the Lipschitz constants of ψ (in both arguments) and H respectively, and, for any 0 k L, let tk = k/L. Then we have, for k 1, = Htk 1 + Z tk tk 1 ψ(Hu, u)du hk 1 1 Lψ(hk 1, tk 1) Htk 1 hk 1 + Z tk tk 1 ψ(Hu, u) ψ(hk 1, tk 1) du Htk 1 hk 1 + K1 tk 1 Hu hk 1 du + K1 tk 1 |u tk 1|du Htk 1 hk 1 + K1 tk 1 Hu Htk 1 du + K1 tk 1 |u tk 1|du Htk 1 hk 1 + (K2 + 1)K1 tk 1 |u tk 1|du Htk 1 hk 1 + (K2 + 1)K1 By recurrence, we obtain j (K2 + 1)K1 2L2 L 1 + K1 L (K2 + 1)K1 e K1 (K2 + 1)K1 which concludes the proof. A.11 Proof of Proposition 13 Starting from (4) and using Assumption (A6), one easily obtains the existence of C1 and C2 (whose values depend on the realization of V and W ) such that hk+1 (1 + C1αL) hk + C2αL. Marion, Fermanian, Biau, and Vert By recurrence, hk+1 (1 + C1αL)k h0 + C2 Hence, using αL 1/L, hk+1 exp(C1) h0 + C2 Since g is Lipschitz continuous on compact sets, it is bounded on every ball of Rd Rp. The result is then a consequence of the identity h L h0 = αL k=0 Vk+1g(hk, wk+1), since we showed that each term in the sum is bounded by some constant C3 > 0, independent of L and k. Hence we have that h L h0 C3LαL = C3L1 β, yielding the results depending on the value of β. A.12 Proof of Proposition 14 In the linear case, (4) can be written hk+1 = hk + αLVk+1hk, 0 k L 1. Take y a unit-norm eigenvector of V 0 with associated eigenvalue λ > 0. Then hk+1, y = hk + αLVk+1hk, y = hk, y + αL hk, V k+1y = hk, y + λαL hk, y + αL hk, (Vk+1 V0) y . Since V is Lipschitz and Vk+1 = Vk+1/L, there exists c such that Vk+1 V0 ck+1 | hk+1, y | (1 + λαL)| hk, y | cαL k + 1 Then, by recurrence, | h L, y | (1 + λαL)L| h0, y | cαL k=0 (k + 1)(1 + λαL)k hk (1 + λαL)L| h0, y | cαL(1 + λαL)L max k hk . Let M = | h0,y | 2cαL , and suppose that hk M for all 0 k L. Then h L | h L, y | (by the Cauchy-Schwartz inequality) (1 + λαL)L | h0, y | c MαL . Scaling Res Nets in the Large-depth Regime Then, for λαL 1, 2(1 + λαL)L| h0, y | 1 2 | h0, y |. Thus, since LαL = L1 β, we have that h L , which contradicts our assumption that hk M for all 0 k L. We deduce that, for all L large enough, max k hk > | h0, y | Furthermore, max k hk h0 h0 > | h0, y | 2c h0 αL 1 L . A.13 Proof of Lemma 16 (E1) and (E2) are simple consequences of Assumptions (A1) and (A2). To prove (E3), let f(hk, wk+1) = Vk+1g(hk, wk+1). Then E(Yk,2Yk,1|hk) = 1 hk 4 E f(hk, wk+1) 2 hk, f(hk, wk+1) hk j=1 f(hk, wk+1)2 i (hk)jf(hk, wk+1)j hk . It is easy to verify that, under Assumption (A1), each term of the sum above has zero expectation. This shows (E3). To establish (E4), we start by noting that hk , f(hk, wk+1) E2 hk = 1 hk 4 E h k f(hk, wk+1)f(hk, wk+1) hk|hk = 1 hk 4 h k E f(hk, wk+1)f(hk, wk+1) |hk hk. Clearly, E(f(hk, wk+1)if(hk, wk+1)j) = 0 for i = j. Since, furthermore, the coordinates of f(hk, wk+1) are identically distributed conditionally on hk, we obtain E f(hk, wk+1)f(hk, wk+1) |hk = 1 d E( f(hk, wk+1) 2|hk)Id. hk , f(hk, wk+1) E2 hk = 1 d hk 4 E( f(hk, wk+1) 2|hk)h k hk 1 by Assumptions (A1) and (A2). To prove (E5), let ϕ = Vk+1g(hk,wk+1),hk g(hk,wk+1) hk . Then, for any t > 0, P(|Yk,2| > t) = P |ϕ| > t hk 2αL g(hk, wk+1) = E P |ϕ| > t hk 2αL g(hk, wk+1) Marion, Fermanian, Biau, and Vert So, by Lemma 21 in Appendix B, P(|Yk,2| > t) E 2 exp dt2 hk 2 16α2 Ls2 g(hk, wk+1) 2 = E E 2 exp dt2 hk 2 16α2 Ls2 g(hk, wk+1) 2 E 2 exp dt2 hk 2 16α2 Ls2E( g(hk, wk+1) 2|hk) by Jensen s inequality. Finally, using Assumption (A2), we deduce that P(|Yk,2| > t) E 2 exp dt2 = 2 exp dt2 In particular, for all q 1 (see, e.g., Pauwels, 2020), E(Y 2q k,2) q! 32s2α2 L d The result is obtained by taking q = 2. Finally, (E6) and (E7) are consequences of Lemma 22 in Appendix B. Appendix B. Concentration of Sub-Gaussian Random Matrices In this appendix, we are interested in concentration of moments of sub-Gaussian matrices. We begin by two simple lemmas on second-order moments of random matrix-vector products. Lemma 17 Let W Rd d be a matrix whose entries are centered i.i.d. random variables, with finite variance, and let σ be an activation function such that, for all x R, a|x| |σ(x)| b|x|, 1/ 2 a < b 1. Then, for any x Rd, 1 2E Wx 2 E σ(Wx) 2 E Wx 2 and E Re LU(Wx) 2 = 1 Proof The first part is a consequence of the assumption on σ. To prove the equality, let Xi = Pd j=1 Wijxj. Then E Re LU(Wx) 2 = E d X j=1 Wijxj 2 1Pd j=1 Wijxj 0 i=1 X2 i 1Xi 0 . Since the (Wij)1 j d are centered and independent random variables, Xi is also centered. Hence E(X2 i 1Xi 0) = 1/2E(X2 i ), which concludes the proof. Lemma 18 Let W Rd d be a matrix whose entries are centered i.i.d. random variables, with finite variance s2. Then, for any x Rd, E Wx 2 = s2d x 2. Scaling Res Nets in the Large-depth Regime Proof For any 1 i d, |Wx|2 i = d X j=1 Wijxj 2 = j,j =1 Wij Wij xjxj . Thus, by independence, E |Wx|2 i = E d X j,j =1 Wij Wij xjxj = j=1 E(W 2 ij)x2 j = s2 x 2. (22) The result follows by summing over all i {1, . . . , d}. We now aim at deriving more involved results on concentration of linear and quadratic forms of sub-Gaussian matrices (Lemma 21 and Lemma 22). These two propositions are byproducts of the main result of Kontorovich (2014), which generalizes Mc Diarmid s inequality to sub-Gaussian variables. We start by a technical result regarding the sub Gaussian diameter introduced by Kontorovich (2014), whose definition is recalled below. Definition 19 Let X be a real-valued random variable, X an independent copy of X, and ε a Rademacher random variable, independent of X and X . Then the sub-Gaussian diameter of X is defined as the smallest t such that ε|X X | is t2 sub-Gaussian. Lemma 20 Let X be an s2 sub-Gaussian centered random variable. Then the sub-Gaussian diameter of X is less than Proof Let λ R. Then, using the notation of Definition 19, one has E(expλε|X X |) = E(expλ(X X ) 1ε=1) + E(expλε(X X) 1ε= 1) = E(expλ(X X )) = E(expλX)2 where the last equality is a consequence of the symmetry of X. We are now ready to prove the two main results of this appendix. Lemma 21 (Bound on the deviation of linear forms) Let V be a Rd d matrix whose entries are i.i.d s2/d sub-Gaussian random variables. Then, for any x, y Rd, x, y = 0, x y t 2 exp dt2 Proof For any 1 i, j d, set Xij = xi Vijyj x y . Let X = Rd2 endowed with the ℓ1 norm, let X be the vector in X whose (id + j)-th coordinate is Xij, and let the function ϕ be defined by Marion, Fermanian, Biau, and Vert By the triangle inequality, ϕ is a Lipschitz continuous function, with Lipschitz constant equal to 1. Observe also that Xij is a x2 i s2y2 j/d x 2 y 2 sub-Gaussian. Thus, according to Lemma 20, the sub-Gaussian diameter of Xij is less than d x y . By Kontorovich (2014, Theorem 1), for any t > 0, one has P (ϕ(X) t) 2 exp 2 Pd i,j=1 2s2x2 i y2 j d x 2 y 2 x y t 2 exp dt2 Lemma 22 (Bound of moments of quadratic forms) Let V be a Rd d matrix whose entries are i.i.d s2/d sub-Gaussian random variables, with variance 1/d. Then, for any x Rd, x = 0, d and E V x 8 Proof The proof is similar to the one of Lemma 21, with Xij = Vijxj x , X = Rd, and Each function ϕi is a Lipschitz continuous function, with Lipschitz constant equal to 1. Observe now that the random variable Xij is x2 js2/d x 2 sub-Gaussian. Thus, according to Lemma 20, the sub-Gaussian diameter of Xij is less than d x . Therefore, according to Kontorovich (2014, Theorem 1), for any t > 0, P (ϕi(X) t) 2 exp 2 Pd j=1 2s2x2 j d x 2 P | Vi, x | x t 2 exp dt2 Hence (see, e.g., Pauwels, 2020), From identity (22) in the proof of technical Lemma 18, we obtain that, for q = 1, Scaling Res Nets in the Large-depth Regime which is an improvement by a factor 8s2 over the previous upper bound. To conclude, it remains to conclude V x 4 and V x 8 with the Vi, x . To do so, observe that V x 4 = d X i=1 Vi, x 2 2 = Vi, x 2 Vj, x 2 + i=1 Vi, x 4. Hence, by independence of the (Vi)1 i d, i=1 E Vi, x 4 d2 + d2(8s2)2 d2 1 + 128s4 d (by (23) and (24)) V x 8 = d X i=1 Vi, x 2 3 = i,j,k=1 i =j =k Vi, x 2 Vj, x 2 Vj, x 2 + Vi, x 2 Vj, x 4 + i=1 Vi, x 8. i,j,k=1 i =j =k i=1 E Vi, x 8 = d(d 1)(d 2) 1 d2 + 3d(d 1)2(8s2)2 d3 + d6(8s2)3 Appendix C. A Version of the Picard-Lindel of Theorem Theorem 23 Assume that f is Lipschitz continuous in its first argument and continuous in its second one. Then, for any z Rd, the initial value problem d Ht = f(Ht, t)dt, H0 = z, (25) admits a unique solution H : [0, 1] Rd. Marion, Fermanian, Biau, and Vert Proof Let C ([s, t], Rd) be the set of continuous functions from [s, t] to Rd. For any [s, t] [0, 1], ζ Rd, let Ψ be the function Ψ : C ([s, t], Rd) C ([s, t], Rd) Y 7 v 7 ζ + Z v s f(Yu, u)du . For any Y, Y C ([s, t], Rd), v [s, t], one has, denoting by Kf the Lipschitz constant of f in its first argument, Ψ(Y )v Ψ(Y )v Z v f(Yu, u) f(Y u, u) du s Kf Yu Y u du Kf Y Y (t s). This yields Ψ(Y ) Ψ(Y ) Kf(t s) Y Y , which means that the function Ψ is Lipschitz continuous on C ([s, t], Rd) endowed with the supremum norm, with Lipschitz constant Kf(t s). So, on any interval [s, t] of length smaller than δ = 1/2Kf, the function Ψ is a contraction. Thus, by the Banach fixed-point theorem, for any initial value ζ, Ψ has a unique fixed point. Hence, there exists a unique solution to (25) on any interval of length δ with any initial condition. To obtain a solution on [0, 1] it is sufficient to concatenate these solutions. Appendix D. Detailed Experimental Setting and Additional Plots Our code is available at https://github.com/Pierre Marion23/scaling-resnets. To obtain Figures 1 to 3, we initialize Res Nets from res-3 with the hyperparameters of Table 2. d 40 nin 64 nout 1 L 10 to 1000 β 0.25, 0.5, 1 weight distribution U( p 3/d) data distribution standard Gaussian Table 2: Hyperparameters of Figures 1 to 3 Scaling Res Nets in the Large-depth Regime 0 250 500 750 1000 L 0 250 500 750 1000 L 0 250 500 750 1000 L Figure 10: Evolution of h L h0 / h0 as a function of L for different values of β and for the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are as in Figure 1. 1.0 1.2 1.4 1.6 h L / h0 (a) Distribution of h L / h0 1.0 1.2 1.4 1.6 L h0 / L h L (b) Distribution of L Figure 11: Empirical distributions of the norms for the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are as in Figure 2. Marion, Fermanian, Biau, and Vert 0 250 500 750 1000 L 0 250 500 750 1000 L 0 250 500 750 1000 L Figure 12: Evolution of p0 p L / p L as a function of L for different values of β and for the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are as in Figure 3. 0 250 500 750 1000 L 0 250 500 750 1000 L 108 β = 0.5 0 250 500 750 1000 L Figure 13: Evolution of h L h0 / h0 as a function of L for different values of β and a smooth initialization of the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are as in Figure 4. Scaling Res Nets in the Large-depth Regime Each experiment is repeated 50 times, with independent data and weight sampling. For Figures 4 and 5, we take the same hyperparameters except for β, which now takes values in {0.5, 1, 2}, and for the weight distribution. The weights are now initialized as discretizations of a Gaussian process. More precisely, each entry of V and W is an independent Gaussian process with zero mean and an RBF kernel of variance 10 2. We also perform the same experiments with the model hk+1 = hk + αLσ(Wkhk), and report the result in Figures 10 14. Although this formulation is not covered by our theoretical results (see discussion at the end of Section 2.1), we observe qualitatively similar results as Figures 1 5. To obtain Figure 7, we take the hyperparameters of Table 3. d 40 nin 64 nout 1 L 1000 β 0.2 to 1.3 weight distribution fractional Brownian motion with Hurst index from 0.05 to 0.97 data distribution standard Gaussian Table 3: Hyperparameters of Figure 7 More precisely, for each 1 i, j d, we let (Vk+1,i,j)0 k L 1 be the increments of a fractional Brownian motion (f Bm), where the various f Bm involved are independent. The procedure is the same for w. In Figure 9, we use res-1, with the hyperparameters of Table 4. Each residual layer also includes a bias term, initialized to zero, and trained with the same learning rate as the other weights. We train on MNIST1 and CIFAR-102 using the Adam optimizer (Kingma and Ba, 2015) for 10 epochs. The learning rate is divided by 10 after 5 epochs. The best performance on the learning rate grid is reported in the figure. d 30 L 1000 β 0.2 to 1.3 weight distribution fractional Brownian motion with Hurst index from 0.05 to 0.97 learning rate grid 10 4, 10 3, 10 2, 10 1, 1 Table 4: Hyperparameters of Figure 9 1. http://yann.lecun.com/exdb/mnist 2. https://www.cs.toronto.edu/~kriz/cifar.html Marion, Fermanian, Biau, and Vert 0 250 500 750 1000 L 0 250 500 750 1000 L 108 β = 0.5 0 250 500 750 1000 L Figure 14: Evolution of p0 p L / p L as a function of L for different values of β and a smooth initialization of the model hk+1 = hk + αLσ(Wkhk). Hyperparameters are the same as in Figure 5. 0 200 400 600 800 1000 1.5 (a) β = 1, smooth initialization 0 200 400 600 800 1000 1.5 (b) β = 1, i.i.d. initialization 0 200 400 600 800 1000 1.5 (c) β = 1/2, i.i.d. initialization Figure 15: Plot of a given coordinate of wk, after training, as a function of the layer index k ranging from 1 to the depth L = 1000 for three different choices of β and initializations. A bias term is trained in each residual layer. Scaling Res Nets in the Large-depth Regime Figure 8 is obtained by plotting a random coordinate of wk, after training on MNIST. While there is no bias term in the model trained for this figure, we show below the result of the same experiment when additionally training a zero-initialized bias term in each residual layer. We observe that adding a bias term does not qualitatively change the results. Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via overparameterization. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 242 252. PMLR, 2019. D. Arpit, V. Campos, and Y. Bengio. How to initialize your network? Robust initialization for Weight Norm & Res Nets. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 10902 10911. Curran Associates, Inc., 2019. T. Bachlechner, B.P. Majumder, H. Mao, G. Cottrell, and J. Auley. Re Zero is all you need: Fast convergence at large depth. In C. de Campos and M.H. Maathuis, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161, pages 1352 1361. PMLR, 2021. Christian Bayer, Peter K Friz, and Nikolas Tapia. Stability of deep neural networks via discrete rough paths. SIAM Journal on Mathematics of Data Science, 5:50 76, 2023. A. Brock, S. De, and S.L. Smith. Characterizing signal propagation to close the performance gap in unnormalized Res Nets. In International Conference on Learning Representations, 2021. B. Chang, M. Chen, E. Haber, and E.H. Chi. Antisymmetric RNN: A dynamical system view on recurrent neural networks. In International Conference on Learning Representations, 2019. R.T.Q. Chen, Y. Rubanova, J. Bettencourt, and D.K. Duvenaud. Neural ordinary differential equations. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 6572 6583. Curran Associates, Inc., 2018. F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015. A.-S. Cohen, R. Cont, A. Rossier, and R. Xu. Scaling properties of deep residual networks. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 2039 2048. PMLR, 2021. R. Cont, A. Rossier, and R. Xu. Asymptotic analysis of deep residual networks. ar Xiv:2212.08199, 2022. S. De and S. Smith. Batch normalization biases residual blocks towards the identity function in deep networks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 19964 19975. Curran Associates, Inc., 2020. Marion, Fermanian, Biau, and Vert L. Deng. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29:141 142, 2012. W. E, J. Han, and Q. Li. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6:10, 2019. A. Fermanian, P. Marion, J.-P. Vert, and G. Biau. Framing RNN as a kernel method: A neural ODE approach. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 3121 3134. Curran Associates, Inc., 2021. X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Y.W. Teh and M. Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9, pages 249 256. PMLR, 2010. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. W. Grathwohl, R.T.Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019. F. G otze and J. Jalowy. Rate of convergence to the circular law via smoothing inequalities for log-potentials. Random Matrices: Theory and Applications, 10:2150026, 2021. B. Hanin and D. Rolnick. How to start training: The effect of initialization and architecture. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 569 579. Curran Associates, Inc., 2018. S. Hayou, E. Clerico, B. He, G. Deligiannidis, A. Doucet, and J. Rousseau. Stable Res Net. In A. Banerjee and K. Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130, pages 1324 1332. PMLR, 2021. K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on Image Net classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1026 1034. IEEE Computer Society, 2015. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016a. K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision ECCV 2016, pages 630 645. Springer International Publishing, 2016b. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 448 456. PMLR, 2015. Scaling Res Nets in the Large-depth Regime S. Jastrzkebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD. ar Xiv:1711.04623, 2017. P. Kidger, J. Foster, X. Li, and T. Lyons. Efficient and accurate gradients for neural SDEs. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 18747 18761. Curran Associates, Inc., 2021. D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. P.E. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Springer, Berlin, 1992. A. Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In E.P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 28 36. PMLR, 2014. A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. A. Lederer, J. Umlauft, and S. Hirche. Uniform error bounds for Gaussian process regression with application to safe control. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 659 669. Curran Associates, Inc., 2019. X. Li, T.-K. L. Wong, R.T.Q. Chen, and D.K. Duvenaud. Scalable gradients and variational inference for stochastic differential equations. In C. Zhang, F. Ruiz, T. Bui, A.B. Dieng, and D. Liang, editors, Proceedings of the 2nd Symposium on Advances in Approximate Bayesian Inference, volume 118, pages 1 28. PMLR, 2020. A. Maas, A. Hannun, and A. Ng. Rectifier nonlinearities improve neural network acoustic models. In S. Dasgupta and D. Mc Allester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28. PMLR, 2013. P. Marion, Y.-H. Wu, M.E. Sander, and G. Biau. Implicit regularization of deep residual networks towards neural ODEs. In International Conference on Learning Representations, 2024. E. Pauwels. Statistics, Optimization and Algorithms in High Dimension. Lecture Notes, Toulouse 3 Paul Sabatier University, 2020. S. Peluchetti and S. Favaro. Infinitely deep neural networks as diffusion processes. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108, pages 1126 1136. PMLR, 2020. M.E. Sander, P. Ablin, and G. Peyr e. Do residual neural networks discretize neural ordinary differential equations? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and Marion, Fermanian, Biau, and Vert A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36520 36532. Curran Associates, Inc., 2022. J. Shao, K. Hu, C. Wang, X. Xue, and B. Raj. Is normalization indispensable for training deep neural network? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13434 13444. Curran Associates, Inc., 2020. Matthew Thorpe and Yves van Gennip. Deep limits of residual neural networks. Research in the Mathematical Sciences, 10:6, 2023. R. van Handel. Probability in High Dimension. APC 550 Lecture Notes, Princeton University, 2016. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5998 6008. Curran Associates, Inc., 2017. Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46:6761 6774, 2024. B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. ar Xiv:1505.00853, 2015. G. Yang and S. Schoenholz. Mean field residual networks: On the edge of chaos. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 2865 2873. Curran Associates, Inc., 2017. H. Zhang, Y.N. Dauphin, and T. Ma. Fixup initialization: Residual learning without normalization. In International Conference on Learning Representations, 2019a. H. Zhang, D. Yu, M. Yi, W. Chen, and T.-Y. Liu. Convergence theory of learning overparameterized Res Net: A full characterization. ar Xiv:1903.07120v4, 2019b. Huishuai Zhang, Da Yu, Mingyang Yi, Wei Chen, and Tie-Yan Liu. Stabilize deep resnet with a sharp scaling factor τ. Machine Learning, 111:3359 3392, 2022. J. Zhang, B. Han, L. Wynter, B.K.H. Low, and M. Kankanhalli. Towards robust Res Net: A small step but a giant leap. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 4285 4291. International Joint Conferences on Artificial Intelligence Organization, 2019c.