# regularization_in_resnet_with_stochastic_depth__732ab55f.pdf

Regularization in Res Net with Stochastic Depth

Souﬁane Hayou Department of Statistics University of Oxford United Kingdom

Fadhel Ayed Huawei Technologies France

Regularization plays a major role in modern deep learning. From classic techniques such as L1, L2 penalties to other noise-based methods such as Dropout, regularization often yields better generalization properties by avoiding overﬁtting. Recently, Stochastic Depth (SD) has emerged as an alternative regularization technique for residual neural networks (Res Nets) and has proven to boost the performance of Res Net on many tasks [Huang et al., 2016]. Despite the recent success of SD, little is known about this technique from a theoretical perspective. This paper provides a hybrid analysis combining perturbation analysis and signal propagation to shed light on different regularization effects of SD. Our analysis allows us to derive principled guidelines for choosing the survival rates used for training with SD.

1 Introduction

Stochastic Depth (SD) is a well-established regularization method that was ﬁrst introduced by Huang et al. [2016]. It is similar in principle to Dropout [Hinton et al., 2012, Srivastava et al., 2014] and Drop Connect [Wan et al., 2013]. It belongs to the family of noise-based regularization techniques, which includes other methods such as noise injection in data [Webb, 1994, Bishop, 1995] and noise injection throughout the network [Camuto et al., 2020]. While Dropout, resp. Drop Connect consists of removing some neurons, resp. weights, at each iteration, SD randomly drops full layers, and only updates the weights of the resulting subnetwork at each training iteration. As a result of this mechanism, SD can be exclusively used with residual neural networks (Res Nets).

There exists a stream of papers in the literature on the regularization effect of Dropout for linear models [Wager et al., 2013, Mianjy and Arora, 2019, Helmbold and Long, 2015, Cavazza et al., 2017]. Recent work by Wei et al. [2020] extended this analysis to deep neural networks using second-order perturbation analysis. It disentangled the explicit regularization of Dropout on the loss function and the implicit regularization on the gradient. Similarly, Camuto et al. [2020] studied the explicit regularization effect induced by adding Gaussian Noise to the activations and empirically illustrated the beneﬁts of this regularization scheme. However, to the best of our knowledge, no analytical study of SD exists in the literature. This paper aims to ﬁll this gap by studying the regularization effect of SD from an analytical point of view; this allows us to derive principled guidelines on the choice of the survival probabilities for network layers. Concretely, our contributions are four-fold:

We show that SD acts as an explicit regularizer on the loss function by penalizing a notion of information discrepancy between keeping and removing certain layers. We prove that the uniform mode, deﬁned as the choice of constant survival probabilities, is related to maximum regularization using SD. We study the large depth behaviour of SD and show that in this limit, SD mimics Gaussian Noise Injection by implicitly adding data-adaptive Gaussian noise to the pre-activations.

Equal contribution. Correspondence to: <soufiane.hayou@yahoo.fr; fadhel.ayed@huawei.com>

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

By deﬁning the training budget L as the desired average depth, we show the existence of two different regimes: small budget and large budget regimes. We introduce a new algorithm called Sense Mode to compute the survival rates under a ﬁxed training budget and provide a series of experiments that validates our Budget hypothesis introduced in Section 5.

2 Stochastic Depth Neural Networks

Stochastic depth neural networks were ﬁrst introduced by Huang et al. [2016]. They are standard residual neural networks with random depth. In practice, each block in the residual network is multiplied by a random Bernoulli variable δl (l is the block s index) that is equal to 1 with some survival probability pl and 0 otherwise. The mask is re-sampled after each training iteration, making the gradient act solely on the subnetwork composed of blocks with δl = 1.

We consider a slightly different version where we apply the binary mask to the pre-activations instead of the activations. We deﬁne a depth L stochastic depth Res Net by

y0(x; δ) = Ψ0(x, W0), yl(x; δ) = yl 1(x; δ) + δlΨl(yl 1(x; δ), Wl), 1 l L, yout(x; δ) = Ψout(y L(x; δ), Wout), (1)

where Wl are the weights in the lth layer, Ψ is a mapping that deﬁnes the nature of the layer, yl are the pre-activations, and δ = (δl)1 l L is a vector of Bernoulli variables with survival parameters p = (pl)1 l L. δ is re-sampled at each iteration. For the sake of simpliﬁcation, we consider constant width Res Net and we further denote by N the width, i.e. for all l [L 1], yl RN. The output function of the network is given by s(yout) where s is some convenient mapping for the learning task, e.g. the Softmax mapping for classiﬁcation tasks. We denote by o the dimension of the network output, i.e. s(yout) Ro which is also the dimension of yout. For our theoretical analysis, we consider a Vanilla model with residual blocks composed of a Fully Connected linear layer

Ψl(x, W) = Wφ(x),

where φ(x) is the activation function. The weights are initialized with He init [He et al., 2015], e.g. for Re LU, W l ij N(0, 2/N).

There are no principled guidelines on choosing the survival probabilities. However, the original paper by Huang et al. [2016] proposes two alternatives that appear to make empirical consensus: the uniform and linear modes, described by

Uniform: pl = p L, Linear: pl = 1 l

where we conveniently parameterize both alternatives using p L.

Training budget L. We deﬁne the training budget L to be the desired average depth of the subnetworks with SD. The user typically ﬁxes this budget, e.g., a small budget can be necessary when training is conducted on small capacity devices.

Depth of the subnetwork. Given the mode p = (pl)1 l L, after each iteration, the subnetwork has a depth Lδ = PL l=1 δl with an average Lp := Eδ[Lδ] = PL l=1 pl. Given a budget L, there is a inﬁnite number of modes p such that Lp = L. In the next lemma, we provide probabilistic bounds on Lδ using standard concentration inequalities. We also show that with a ﬁxed budget L, the uniform mode is linked to maximal variability. Lemma 1 (Concentration of Lδ). For any β (0, 1), we have that with probability at least 1 β,

|Lδ Lp| vp u 1 log(2/β)

where Lp = E[Lδ] = PL l=1 pl, vp = Var[Lδ] = PL l=1 pl(1 pl), and u(t) = (1 + t) log(1 + t) t.

Moreover, for a given average depth Lp = L, the upperbound in Eq. (2) is maximal for the uniform

choice of survival probabilities p = L

L, ..., L L .

(a) Distributions of Lδ for a Resnet100 with average survival rate L/L = 0.5 for the uniform and linear modes.

(b) Training time of Dropout and SD on CIFAR10 with Res Net56 for 100 epochs.

Lemma 1 shows that with high probability, the depth of the subnetwork that we obtain with SD is within an ℓ1 error of vp u 1 log(2/β)

from the average depth Lp. Given a ﬁxed

budget L, this segment is maximized for the uniform mode p = ( L/L, . . . , L/L). Fig. 1a highlights this result. This was expected since the variance of the depth Lδ is also maximized by the uniform mode. The uniform mode corresponds to maximum entropy of the random depth, which would intuitively results in maximum regularization. We depict this behaviour in more details in Section 4.

SD vs Dropout. From a computational point of view, SD has the advantage of reducing the effective depth during training. Depending on the chosen budget, the subnetworks might be signiﬁcantly shallower than the entire network (Lemma 1). This depth compression can be effectively leveraged for computational training time gain (Fig. 1b). It is not the case with Dropout. Indeed, assuming that the choice of dropout probabilities is such that we keep the same number of parameters on average compared to SD, we still have to multiply matrices L times during the forward/backward propagation. It is not straightforward to leverage the sparsity obtained by Dropout for computational gain. In practice, Dropout requires an additional step of sampling the mask for every neuron, resulting in longer training times than without Dropout (Fig. 1b). However, there is a trade-off between how small the budget is and the performance of the trained model with SD (Section 6).

3 Effect of Stochastic Depth at initialization

Empirical evidence strongly suggests that Stochastic Depth allows training deeper models [Huang et al., 2016]. Intuitively, at each iteration, SD updates only the parameters of a subnetwork with average depth Lp = P

l pl < L, which could potentially alleviate any exploding/vanishing gradient issue. This phenomenon is often faced when training ultra deep neural networks. To formalize this intuition, we consider the model s asymptotic properties at initialization in the inﬁnite-width limit N + . This regime has been the focus of several theoretical studies [Neal, 1995, Poole et al., 2016, Schoenholz et al., 2017, Yang, 2020, Xiao et al., 2018, Hayou et al., 2019, 2020, 2021a] since it allows to derive analytically the distributions of different quantities of untrained neural networks. Speciﬁcally, randomly initialized Res Nets, as well as other commonly-used architectures such as Fully connected Feedforward networks, convolutional networks and LSTMs, are equivalent to Gaussian Processes in the inﬁnite-width limit. An important ingredient in this theory is the Gradient Independence assumption. Let us formally state this assumption ﬁrst.

Assumption 1 (Gradient Independence). In the inﬁnite width limit, we assume that the weights W used for back-propagation are an iid version of the weights W used for forward propagation.

Assumption 1 is ubiquitous in the literature on the signal propagation in deep neural networks. It has been used to derive theoretical results on signal propagation in randomly initialized deep neural network [Schoenholz et al., 2017, Poole et al., 2016, Yang and Schoenholz, 2017, Hayou et al., 2021b,a] and is also a key tool in the derivation of the Neural Tangent Kernel [Jacot et al., 2018, Arora et al., 2019, Hayou et al., 2020]. Recently, it has been shown by Yang [2020] that Assumption 1 yields the exact computation for the gradient covariance in the inﬁnite width limit. See Appendix A0.3 for a detailed discussion about this assumption. Throughout the paper, we provide numerical results that substantiate the theoretical results that we derive using this assumption. We show that Assumption 1 yields an excellent match between theoretical results and numerical experiments.

Leveraging this assumption, Yang and Schoenholz [2017], Hayou et al. [2021a] proved that Res Net suffers from exploding gradient at initialization. We show in the next proposition that SD helps mitigate the exploding gradient behaviour at initialization in inﬁnite width Res Nets.

(a) L/L = 0.5, standard Res Net.

(b) L/L = 0.7, standard Res Net.

(c) L/L = 0.7, stable Res Net.

Figure 2: Empirical illustration of Proposition 1 ((a) and (b)) and Stable Resnet (c). Comparison of the growth rate of the gradient magnitude ql(x, z) at initialization for Vanilla Res Net50 with width 512. The y-axis of ﬁgures (a) and (b) are in log scale. The y-axis of ﬁgure (c) is in linear scale. The expectation is computed using 500 Monte-Carlo (MC) samples.

Proposition 1. Let φ = Re LU and L(x, z) = ℓ(yout(x; δ), z) for (x, z) Rd Ro, where ℓ(z, z )

is some differentiable loss function. Let ql(x, z) = EW,δ yl L 2

y LL 2 , where the numerator and denomi-

nator are respectively the norms of the gradients with respect to the inputs of the lth and Lth layers . Then, in the inﬁnite width limit, under Assumption 1, for all l [L] and (x, z) Rd Ro, we have

With Stochastic Depth, ql(x, z) = QL k=l+1(1 + pk),

Without Stochastic Depth (i.e. δ = 1), ql(x, z) = 2L l.

Table 1: Gradient magnitude growth rate with Vanilla Resnet50 with width 512 and training budget L/L = 0.7. Empirical vs. Theoretical value (between parenthesis) of ql(x, z) at initialization, for standard (no SD), uniform and linear modes. The expectation is performed using 500 MC samples.

Standard Uniform Linear ℓ

0 2.001 (2) 1.705 (1.7) 1.694 (1.691) 10 2.001 (2) 1.708 (1.7) 1.633 (1.629) 20 2.001 (2) 1.707 (1.7) 1.569 (1.573) 30 2.001 (2) 1.716 (1.7) 1.555 (1.516) 40 1.999 (2) 1.739 (1.7) 1.530 (1.459)

Proposition 1 indicates that with or without SD, the gradient explodes exponentially at initialization as it backpropagates through the network. However, with SD, the exponential growth is characterized by the mode p. Intuitively, if we choose pl 1 for some layer l, then the contribution of this layer in the exponential growth is negligible since 1 + pl 1. From a practical point of view, the choice of pl 1 means that the lth layer is hardly present in any subnetwork during training, thus making its contribution to the gradient negligible on average (w.r.t δ). For a Res Net with L = 50 and uniform mode p = (1/2, . . . , 1/2), SD reduces the gradient exploding by six orders of magnitude. Fig. 2a and Fig. 2b illustrates the exponential growth of the gradient for the uniform and linear modes, as compared to the growth of the gradient without SD. We compare the empirical/theoretical growth rates of the magnitude of the gradient in Table 1; the results show a good match between our theoretical result (under Assumption 1) and the empirical ones. Further analysis can be found in Section A4.

Stable Res Net. Hayou et al. [2021a] have shown that introducing the scaling factor 1/

L in front of the residual blocks is sufﬁcient to avoid the exploding gradient at initialization, as illustrated in Figure 2c. The hidden layers in Stable Res Net (with SD) are given by,

yl(x; δ) = yl 1(x; δ) + δl

L Ψl(yl 1(x; δ), Wl), 1 l L. (3)

The intuition behind the choice of the scaling factor 1/

L comes for the variance of yl. At initialization, with standard Res Net (Eq. (1)), we have Var[yl] = Var[yl 1] + Θ(1), which implies that Var[yl] = Θ(l). With Stable Res Net (Eq. (3)), this becomes Var[yl] = Var[yl 1] + Θ(1/L), resulting in Var[yl] = Θ(1) (See Hayou et al. [2021a] for more details). In the rest of the paper, we restrict our analysis to Stable Res Net; this will help isolate the regularization effect of SD in the limit of large depth without any variance/gradient exploding issue.

Nevertheless, the natural connection between SD and Dropout, coupled with the line of work on the regularization effect induced by the latter [Wager et al., 2013, Mianjy and Arora, 2019, Helmbold and Long, 2015, Cavazza et al., 2017, Wei et al., 2020], would indicate that the beneﬁts of SD are not limited to controlling the magnitude of the gradient. Using a second order Taylor expansion, Wei et al. [2020] have shown that Dropout induces an explicit regularization on the loss function. Intuitively, one should expect a similar effect with SD. In the next section, we elucidate the explicit regularization effect of SD on the loss function, and we shed light on another regularization effect of SD that occurs in the large depth limit.

4 Regularization effect of Stochastic Depth

4.1 Explicit regularization on the loss function

Consider a dataset D = X T consisting of n (input, target) pairs {(xi, ti)}1 i n with (xi, ti) Rd Ro. Let ℓ: Rd Ro R be a smooth loss function, e.g. quadratic loss, cross-entropy loss etc. Deﬁne the model loss for a single sample (x, t) D by

L(W , x; δ) = ℓ(yout(x; δ), t), L(W , x) = Eδ [ℓ(yout(x; δ), t)] ,

where W = (Wl)0 l L. The empirical loss given by L(W ) = 1

n Pn i=1 Eδ [ℓ(yout(xi; δ), ti)] .

To isolate the regularization effect of SD on the loss function, we use a second order approximation of the loss function around δ = 1, this allows us to marginalize out the mask δ. The full derivation is provided in Appendix A2. Let zl(x; δ) = Ψl(Wl, yl 1(x; δ)) be the activations. For some pair (x, t) D, we obtain

L(W , x) L(W , x) + 1

l=1 pl(1 pl)gl(W , x), (4)

where L(W , x) ℓ(yout(x; p), t) (more precisely, L(W , x) is the second order Taylor approximation of ℓ(yout(x; p), t) around p = 12), and gl(W , x) = zl(x; 1)T 2 yl[ℓ Gl](yl(x; 1))zl(x; 1) with Gl is the function deﬁned by yout(x; 1) = Gl(yl 1(x; 1) + 1

Lzl(x; 1)).

The ﬁrst term L(W , x) in Eq. (4) is the loss function of the average network (i.e. replacing δ with its mean p). Thus, Eq. (4) shows that training with SD entails training the average network with an explicit regularization term that implicitly depends on the weights W .

SD enforces ﬂatness. The presence of the hessian in the penalization term provides a geometric interpretation of the regularization induced by SD: it enforces a notion of ﬂatness determined by the hessian of the loss function with respect to the hidden activations zl. This ﬂatness is naturally inherited by the weights, thus leading to ﬂatter minima. Recent works by [Keskar et al., 2016, Jastrzebski et al., 2018, Yao et al., 2018] showed empirically that ﬂat minima yield better generalization compared to minima with large second derivatives of the loss. Wei et al. [2020] have shown that a similar behaviour occurs in networks with Dropout.

Let Jl(x) = yl Gl(yl(x; 1)) be the Jacobian of the output layer with respect to the hidden layer yl with δ = 1, and H(x) = 2 zℓ(z)|z=yout(x;1) the hessian of the loss function ℓ. The hessian matrix inside the penalization terms gl(W , x) can be decomposed as in [Le Cun et al., 2012, Sagun et al., 2017] 2 yl[ℓ Gl](yl(x; 1)) = Jl(x)T H(x)Jl(x) + Γl(x), where Γ depends on the hessian of the network output. Γ is generally non-PSD, and therefore cannot be seen as a regularizer. Moreover, it has been shown empirically that the ﬁrst term generally dominates and drives the regularization effect [Sagun et al., 2017, Wei et al., 2020, Camuto et al., 2020]. Thus, we restrict our analysis to the regularization effect induced by the ﬁrst term, and we consider the new version of gl deﬁned by

gl(W , x) = ζl(x, W )T H(x) ζl(x, W ) = Tr H(x) ζl(x, W )ζl(x, W )T , (5)

where ζl(x, W ) = Jl(x)zl(x; 1). The quality of this approximation is discussed in Appendix A4.

Information discrepancy. The vector ζl represents a measure of the information discrepancy between keeping and removing the lth layer. Indeed, ζl measures the sensitivity of the model output to the lth layer,

yout(x; 1) yout(x; 1l) δlyout(x; δ)|δ=1 = ζl(x, W ),

where 1l is the vector of 1 s everywhere with 0 in the lth coordinate. With this in mind, the regularization term gl in Eq. (5) is most signiﬁcant when the information discrepancy is well-aligned with the hessian of the loss function, i.e. SD penalizes mostly the layers with information discrepancy that violates the ﬂatness, conﬁrming our intuition above.

2Note that we could obtain Eq. (4) using the Taylor expansion around δ = p. However, in this case, the Hessian will depend on p, which complicates the analysis of the role of p in the regularization term.

Quadratic loss. With the quadratic loss ℓ(z, z ) = z z 2 2, the hessian Hℓ(x) = 2I is isotropic, i.e. it does not favorite any direction over the others. Intuitively, we expect the penalization to be similar for all the layers. In this case, we have gl(W , x) = 2 ζl(x) 2 2, and the loss is given by

L(W ) L(W ) + 1

l=1 pl(1 pl)gl(W ), (6)

where gl(W ) = 2

n Pn i=1 ζl(xi, W ) 2 2 is the regularization term marginalized over inputs X.

Eq. (6) shows that the mode p has a direct impact on the regularization term induced by SD. The latter tends to penalize mostly the layers with survival probability pl close to 50%. The mode p = (1/2, . . . , 1/2) is therefore a universal maximizer of the regularization term, given ﬁxed weights W . However, given a training budget L, the mode p that maximizes the regularization term 1 2L PL l=1 pl(1 pl)gl(W) depends on the values of gl(W ). We show this in the next lemma. Lemma 2 (Max regularization). Consider the empirical loss L given by Eq. (6) for some ﬁxed weights W (e.g. W could be the weights at any training step of SGD). Then, given a training budget L, the regularization is maximal for p l = min 1, max(0, 1

2 Cgl(W ) 1) , where C is a normalizing constant, that has the same sign as L 2 L. The global maximum is obtained for pl = 1/2.

Figure 3: Distribution of gl(W ) across the layers at initialization for Vanilla Res Net50 with width 512.

Lemma 2 shows that under ﬁxed budget, the mode p that maximizes the regularization induced by SD is generally layerdependent ( = uniform). However, we show that at initialization, on average (w.r.t W ), p is uniform. Theorem 1 (p is uniform at initialization). Assume φ = Re LU and W are initialized with N(0, 2

N ). Then, in the inﬁnite width limit, under Assumption 1, for all l [1 : L], we have

EW [gl(W )] = EW [g1(W )].

As a result, given a budget L, the average regularization term 1 2L PL l=1 pl(1 pl)EW [gl(W )] is maximal for the uniform mode p = ( L/L, . . . , L/L).

The proof of Theorem 1 is based on some results from the signal propagation theory in deep neural network. We provide an overview of this theory in Appendix A0. Theorem 1 shows that, given a training budget L and a randomly initialized Res Net with N(0, 2/N) and N large, the average (w.r.t W ) maximal regularization at initialization is almost achieved by the uniform mode. This is because the coefﬁcients EW [gl(W )] are equal under Assumption 1, which we highlight in Fig. 3. As a result, we would intuitively expect that the uniform mode performs best when the budget L is large, e.g. L is large and L L, since in this case, at each iteration, we update the weights of an overparameterized subnetwork, which would require more regularization compared to the small budget regime. We formalize this intuition in Section 5.

In the next section, we show that SD is linked to another regularization effect that only occurs in the large depth limit; in this limit, we show that SD mimics Gaussian Noise Injection methods by adding Gaussian noise to the pre-activations.

4.2 Stochastic Depth mimics Gaussian noise injection

Recent work by Camuto et al. [2020] studied the regularization effect of Gaussian Noise Injection (GNI) on the loss function and showed that adding isotropic Gaussian noise to the activations zl improves generalization by acting as a regularizer on the loss. The authors suggested adding a zero mean Gaussian noise parameterized by its variance. At training time t, this translates to replacing zt l by zt l + N(0, σ2 l I), where zt l is the value of the activations in the lth layer at training time t, and σ2 l is a parameter that controls the noise level. Empirically, adding this noise tends to boost the performance by making the model robust to over-ﬁtting. Using similar perturbation analysis as in the previous section, we show that when the depth is large, SD mimics GNI by implicitly adding a non-isotropic data-adaptive Gaussian noise to the pre-activations yl at each training iteration. We bring to the reader s attention that the following analysis holds throughout the training (it is not limited to the initialization), and does not require the inﬁnite-width regime.

Consider an arbitrary neuron yi αL in the (αL)th layer for some ﬁxed α (0, 1). yi αL(x, δ) can be approximated using a ﬁrst order Taylor expansion around δ = 1. We obtain similarly,

yi αL(x, δ) yi αL(x) + 1

l=1 ηl zl, yl Gi l(yl(x; 1)) (7)

where Gi l is deﬁned by yi αL(x; 1) = Gi l(yl(x; 1)), ηl = δl pl, and yi αL(x) = yi αL(x, 1) + 1

L PαL l=1(pl 1) zl, yl Gi l(yl(x; 1)) yi αL(x, p).

Let γα,L(x) = 1

L PαL l=1 ηl zl, yl Gi l(yl(x; 1)) . With SD, yi αL(x; δ) can therefore be seen as a perturbed version of yi αL(x; p) (the pre-activation of the average network) with noise γα,L(x). The scaling factor 1/

L ensures that γα,L remains bounded (in ℓ2 norm) as L grows. Without this scaling, the variance of γα,L will generally explode. The term γα,L captures the randomness of the binary mask δ, which up to a factor α, resembles to the scaled mean in Central Limit Theorem(CLT) and can be written as γα,L(x) = α 1

αL PαL l=2 Xl,L(x) where Xl,L(x) = ηl zl, yl Gi l(yl(x; 1)) . Ideally, we would like to apply CLT to conclude on the Gaussianity of γα,L(x) in the large depth limit. However, the random variables Xl are generally not i.i.d (they have different variances) and they also depend on L. Thus, standard CLT argument fails. Fortunately, there is a more general form of CLT known as Lindeberg s CLT which we use in the proof of the next theorem. Theorem 2. Let x Rd, Xl,L(x) = ηl µl,L(x) where µl,L(x) = zl, yl Gi l(yl(x; 1) , and σ2 l,L(x) = Varδ[Xl,L(x)] = pl(1 pl)µl,L(x)2 for l [L]. Assume that

1. There exists a (0, 1/2) such that for all L, and l [L], pl (a, 1 a).

2. lim L maxk [L] µ2 k,L(x) PL l=1 µ2 l,L(x) = 0.

3. vα, (x) := lim L

PL l=1 σ2 l,L(x) L exists and is ﬁnite.

Then, γα,L(x) D L N(0, α vα, (x)).

Figure 4: (Theorem 2) Assumption 2 as a function the depth L and epoch.

Fig. 4 provides an empirical veriﬁcation of the second condition of Theorem 2 across all training epochs. There is a clear downtrend as the depth increases; this trend is consistent throughout training, which supports the validity of the second condition in Theorem 2 at all training times. Theorem 2 shows that training a Res Net with SD involves implicitly adding the noise γα,L(x) to yi αL. This noise becomes asymptotically normally distributed3, conﬁrming that SD implicitly injects input-dependent Gaussian noise in this limit. Camuto et al. [2020] studied GNI in the context of input-independent noise and concluded on the beneﬁt of such methods on the overall performance of the trained network. We empirically conﬁrm the results of Theorem 2 in Section 6 using different statistical normality tests.

Similarly, we study the implicit regularization effect of SD induced on the gradient in Appendix A3, and show that under some assumptions, SD acts implicitly on the gradient by adding Gaussian noise in the large depth limit.

5 The Budget Hypothesis

We have seen in Section 4 that given a budget L, the uniform mode is linked to maximal regularization with SD at initialization (Theorem 1). Intuitively, for ﬁxed weights W , the magnitude of standard regularization methods such as . 1 or . 2 correlates with the number of parameters; the larger the model, the bigger the penalization term. Hence, in our case, we would require the regularization term to correlate (in magnitude) with the number of parameters, or equivalently, the number of trainable layers. Assuming L 1, and given a ﬁxed budget L, the number of trainable layers at each training

3The limiting variance vα, (x) depends on the input x, suggesting that γα,L(.) might converge in distribution to a Gaussian process in the limit of large depth, under stronger assumptions. We leave this for future work.

iteration is close to L (Lemma 1). Hence, the magnitude of the regularization term should depend on how large/small the budget L is, as compared to L. Small budget regime ( L/L 1). In this regime, the effective depth Lδ of the subnetwork is small compared to L. As the ratio L/L gets smaller, the sampled subnetworks become shallower, suggesting that the regularization need not be maximal in this case, and therefore p should not be uniform in accordance with Theorem 1. Another way to look at this is through the bias-variance trade-off principle. Indeed, as L/L 0, the variance of the model decreases (and the bias increases), suggesting less regularization is needed. The increase in bias inevitably causes a deterioration of the performance; we call this the Budget-performance trade-off. To derive a more sensible choice of p for small budget regimes, we introduce a new Information Discrepancy based algorithm (Section 4). We call this algorithm Sensitivity Mode or brieﬂy Sense Mode. This algorithm works in two steps:

1. Compute the sensitivity (S) of the loss w.r.t the layer at initialization using the approximation,

Sl = L(W ; 1) L(W ; 1l) δl L(W ; δ)|δ=1.

Sl is a measure of the sensitivity of the loss to keeping/removing the lth layer. 2. Use a mapping ϕ to map S to the mode, p = ϕ(S), where ϕ is a linear mapping from the range of S to [pmin, 1] and pmin is the minimum survival rate (ﬁxed by the user). φ is the linear mapping from the range of S to the segment [pmin, 1]. In other words, pl = pmin + α Sl, where the constant alpha is chosen in order to satisfy the budget constraint: P l pl = L.

Large budget regime ( L/L 1). In this regime, the effective depth Lδ of the subnetworks is close to L, and thus, we are in the overparameterized regime where maximal regularization could boost the performance of the model by avoiding over-ﬁtting. Thus, we anticipate the uniform mode to perform better than other alternatives in this case. We are now ready to formally state our hypothesis, Budget hypothesis. Assuming L 1, the uniform mode outperforms Sense Mode in the large budget regime, while Sense Mode outperforms the uniform mode in the small budget regime.

We empirically validate the Budget hypothesis and the Budget-performance trade-off in Section 6.

6 Experiments

The objective of this section is two-fold: we empirically verify the theoretical analysis developed in sections 3 and 4 with a Vanilla Res Net model on a toy regression task; we also empirically validate the Budget Hypothesis on the benchmark datasets CIFAR-10 and CIFAR-100 [Krizhevsky et al., 2009]. Notebooks and code to reproduce all experiments, plots and tables presented are available in the supplementary material. We perform comparisons at constant training budgets.

Implementation details: Vanilla Stable Res Net is composed of identical residual blocks each formed of a Linear layer followed by Re LU. Res Net110 follows [He et al., 2016, Huang et al., 2016]; it comprises three groups of residual blocks; each block consists of a sequence Convolution Batch Norm-Re LU-Convolution-Batch Norm. We use the adjective "Stable" (Stable Vanilla Res Net, Stable Res Net110) to indicate that we scale the blocks using a factor 1/

L as described in Section 3. We build on an open-source implementation of standard Res Net4. The toy regression task consists of estimating the function fβ : x 7 sin(βT x), where the inputs x and parameter β are in R256, sampled from a standard Gaussian. CIFAR-10, CIFAR-100 contain 32-by-32 color images, representing respectively 10 and 100 classes of natural scene objects. We present here our main conclusions. Further implementation details and other insightful results are in the Appendix A4.

Gaussian Noise Injection: We proceed by empirically verifying the Gaussian behavior of the neurons as described in Theorem 4. For each input x, we sample 200 masks and the corresponding y(x; δ). We then compute the p-value pvx of the Shapiro-Wilk test of normality [Shapiro and Wilk, 1965]. In Fig. 5 we represent the distribution of the p-values {pvx | x X}. We can see that the Gaussian behavior holds throughout training (left). On the right part of the Figure, we can see that the Normal behavior becomes accurate after approximately 20 layers. In the Appendix we report further experiments with different modes, survival rates, and a different test of normality to verify both Theorem 2 and the critical assumption 2.

4https://github.com/felixgwu/img_classiﬁcation_pk_pytorch

Figure 5: Empirical veriﬁcation of Theorem 2 on Vanilla Res Net100 with width 128 with average survival probability L/L = 0.7 and uniform mode. Distribution of the p-values for Shapiro s normality test as a function of the training epoch (left) and depth of the network (right). The tests are performed for the ﬁnal output neuron y L(x) (left) and for an arbitrary neuron per layer (right).

Empirical veriﬁcation of the Budget Hypothesis: We compare the three modes: Uniform, Linear, and Sense Mode on two benchmark datasets using a grid survival proportions. The values and standard deviations reported are obtained using four runs. For Sense Mode, we use the simple rule pl |Sl|, where Sl is the sensitivity (see section 5). We report in Table 2 the results for Stable Res Net110. Results with Stable Res Net56 are reported in Appendix A4.

Table 2: Comparison of the modes of selection of the survival probabilities with ﬁxed budget with Stable Res Net110.

L/L Uniform Sense Mode Linear

0.1 17.2 0.3 15.4 0.4 0.2 10.3 0.4 9.3 0.5 0.3 7.7 0.2 7.0 0.3 0.4 7.4 0.3 7.3 0.4 0.5 6.8 0.1 7.3 0.2 9.1 0.1 0.6 6.3 0.2 6.9 0.1 7.5 0.2 0.7 5.9 0.1 7.3 0.3 6.4 0.2 0.8 5.7 0.1 6.6 0.2 6.1 0.2 0.9 5.7 0.1 6.2 0.2 6.0 0.2

1 6.37 0.12 (a) CIFAR10

L/L Uniform Sense Mode Linear

0.1 55.2 0.4 51.3 0.6 0.2 38.3 0.3 36.4 0.4 0.3 31.4 0.2 30.1 0.5 0.4 30.9 0.2 28.5 0.4 0.5 28.4 0.3 29.5 0.5 36.5 0.4 0.6 26.5 0.4 29.9 0.6 30.9 0.4 0.7 25.8 0.1 29.5 0.3 27.3 0.3 0.8 25.5 0.1 30.0 0.3 25.7 0.2 0.9 25.5 0.3 28.3 0.2 25.5 0.2

1 26.5 0.2 (b) CIFAR100

The empirical results are coherent with the Budget Hypothesis. When the training budget is large, i.e. L/L 0.5, the Uniform mode outperforms the others. We note nevertheless that when L/L 0.9, the Linear and Uniform models have similar performance. This seems reasonable as the Uniform and Linear probabilities become very close for such survival proportions. When the budget is low, i.e. L/L < 0.5, the Sense Mode outperforms the uniform one (the linear mode cannot be used with budgets L < L/2 when L 1, since P pl/L > 1/2 1/(2L) 1/2), thus conﬁrming the Budget hypothesis. Table 2 also shows a clear Budget-performance trade-off.

7 Related work The regularization effect of Dropout in the context of linear models has been the topic of a stream of papers [Wager et al., 2013, Mianjy and Arora, 2019, Helmbold and Long, 2015, Cavazza et al., 2017]. This analysis has been recently extended to neural networks by Wei et al. [2020] where authors used a similar approach to ours to depict the explicit and implicit regularization effects of Dropout. To the best of our knowledge, our paper is the ﬁrst to provide analytical results for the regularization effect of SD, and study the large depth behaviour of SD, showing that the latter mimics Gaussian Noise Injection [Camuto et al., 2020]. A further analysis of the implicit regularization effect of SD is provided in Appendix A3.

8 Limitations and extensions In this work, we provided an analytical study of the regularization effect of SD using a second order Taylor approximation of the loss function. Although the remaining higher order terms are usually dominated by the second order approximation (See the quality of the second order approximation in Appendix A4), they might also be responsible for other regularization effects. This is a common limitation in the literature on noise-based regularization in deep neural networks [Camuto et al., 2020, Wei et al., 2020]. Further research is needed to isolate the effect of higher order terms. We also believe that Sense Mode opens an exciting research direction knowing that the low budget regime has not been much explored yet in the literature. We believe that one can probably get even better results with more elaborate maps ϕ such that p = ϕ(S). Another interesting extension of an algorithmic nature is the dynamic use of Sense Mode throughout training. We are currently investigating this topic which we leave for future work.

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646 661. Springer, 2016.

Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929 1958, 2014.

Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058 1066, 2013.

A.R. Webb. Functional approximation by feed-forward networks: a least-squares approach to generalization. IEEE Transactions on Neural Networks, 5(3):363 371, 1994.

Chris M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural Computation, 7(1):108 116, 1995.

A. Camuto, M. Willetts, U. Simsekli, S. J. Roberts, and C. C. Holmes. Explicit regularisation in gaussian noise injections. In Advances in Neural Information Processing Systems, volume 33, pages 16603 16614, 2020.

Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in neural information processing systems, pages 351 359, 2013.

Poorya Mianjy and Raman Arora. On dropout and nuclear norm regularization. ar Xiv preprint ar Xiv:1905.11887, 2019.

David P Helmbold and Philip M Long. On the inductive bias of dropout. The Journal of Machine Learning Research, 16(1):3403 3454, 2015.

Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, and René Vidal. Dropout as a low-rank regularizer for matrix factorization. ar Xiv preprint ar Xiv:1710.05092, 2017.

C. Wei, S. Kakade, and T. Ma. The implicit and explicit regularization effects of dropout. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 10181 10192. PMLR, 2020.

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In ICCV, 2015.

R.M. Neal. Bayesian Learning for Neural Networks, volume 118. Springer Science & Business Media, 1995.

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, 2016.

S.S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations, 2017.

G. Yang. Tensor programs iii: Neural matrix laws. ar Xiv preprint ar Xiv:2009.10685, 2020.

L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and P. Pennington. Dynamical isometry and a mean ﬁeld theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. ICML 2018, 2018.

S. Hayou, A. Doucet, and J. Rousseau. On the impact of the activation function on deep neural networks training. In International Conference on Machine Learning, 2019.

S. Hayou, A. Doucet, and J. Rousseau. Mean-ﬁeld behaviour of neural tangent kernel for deep neural networks. ar Xiv preprint ar Xiv:1905.13654, 2020.

S. Hayou, E. Clerico, B. He, G. Deligiannidis, A. Doucet, and J. Rousseau. Stable resnet. In Proceedings of The 24th International Conference on Artiﬁcial Intelligence and Statistics, pages 1324 1332, 2021a.

G. Yang and S. Schoenholz. Mean ﬁeld residual networks: On the edge of chaos. In Advances in Neural Information Processing Systems, pages 7103 7114, 2017.

S. Hayou, J.F. Ton, A. Doucet, and Y.W. Teh. Robust pruning at initialization. In International Conference on Learning Representations, 2021b.

A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, 2018.

S. Arora, S.S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang. On exact computation with an inﬁnitely wide neural net. In Advances in Neural Information Processing Systems, 2019.

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016.

Stanislaw Jastrzebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of dnn loss and the sgd step length. ar Xiv preprint ar Xiv:1807.05031, 2018.

Zhewei Yao, Amir Gholami, Qi Lei, Kurt Keutzer, and Michael W Mahoney. Hessian-based analysis of large batch training and robustness to adversaries. In Advances in Neural Information Processing Systems, pages 4949 4959, 2018.

Yann A Le Cun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efﬁcient backprop. In Neural networks: Tricks of the trade, pages 9 48. Springer, 2012.

Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. ar Xiv preprint ar Xiv:1706.04454, 2017.

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Samuel Sanford Shapiro and Martin B Wilk. An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591 611, 1965.

J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems. 2019.

A.G. Matthews, J. Hron, M. Rowland, R.E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.

T. Lillicrap, D. Cownden, D. Tweed, and C. Akerman. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7(13276), 2016.

A. Neelakantan, L. Vilnis, Quoc V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens. Adding gradient noise improves learning for very deep networks. ar Xiv pre Print 1511.06807, 2015.

RALPH B. D Agostino. Transformation to normality of the null distribution of g1. Biometrika, 57(3): 679 681, 12 1970. ISSN 0006-3444. doi: 10.1093/biomet/57.3.679. URL https://doi.org/ 10.1093/biomet/57.3.679.